Re: [PATCH bpf-next v2 1/1] docs/bpf: Add description of register liveness tracking algorithm

public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed

From: Eduard Zingerman <eddyz87@gmail.com>
To: David Vernet <void@manifault.com>
Cc: bpf@vger.kernel.org, ast@kernel.org, andrii@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev, kernel-team@fb.com,
	yhs@fb.com
Subject: Re: [PATCH bpf-next v2 1/1] docs/bpf: Add description of register liveness tracking algorithm
Date: Tue, 31 Jan 2023 01:45:29 +0200	[thread overview]
Message-ID: <d8a6ac211053c27463c87beeaa74ee511d3ca796.camel@gmail.com> (raw)
In-Reply-To: <Y9ghp6G8KiG6fO2a@maniforge>

On Mon, 2023-01-30 at 13:59 -0600, David Vernet wrote:
> On Mon, Jan 30, 2023 at 08:24:00PM +0200, Eduard Zingerman wrote:
> > This is a followup for [1], adds an overview for the register liveness
> > tracking, covers the following points:
> > - why register liveness tracking is useful;
> > - how register parentage chains are constructed;
> > - how liveness marks are applied using the parentage chains.
> > 
> > [1] https://lore.kernel.org/bpf/CAADnVQKs2i1iuZ5SUGuJtxWVfGYR9kDgYKhq3rNV+kBLQCu7rA@mail.gmail.com/
> > 
> > Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
> 
> This is great, thanks a lot for writing this up. Left some comments.
> 
> > ---
> >  Documentation/bpf/verifier.rst | 266 +++++++++++++++++++++++++++++++++
> >  1 file changed, 266 insertions(+)
> > 
> > diff --git a/Documentation/bpf/verifier.rst b/Documentation/bpf/verifier.rst
> > index 3afa548ec28c..7269933428e1 100644
> > --- a/Documentation/bpf/verifier.rst
> > +++ b/Documentation/bpf/verifier.rst
> > @@ -316,6 +316,272 @@ Pruning considers not only the registers but also the stack (and any spilled
> >  registers it may hold).  They must all be safe for the branch to be pruned.
> >  This is implemented in states_equal().
> >  
> > +Some technical details about state pruning implementation could be found below.

Hi David,

Thank you for commenting. I'll update the doc according to your
s/xxx/yyy/ suggestions, sorry for the amount of missing comas, etc.
Regarding some bigger points please see below.

> > +
> > +Register liveness tracking
> > +--------------------------
> > +
> > +In order to make state pruning effective liveness state is tracked for each
> 
> Could you please add a comma after "effective"? Will use shorthand
> s/xxx/yyy for future suggestions like this.
> 
> > +register and stack slot. The basic idea is to identify which registers or stack
> > +slots were used by children of the state to reach the program exit. Registers
> 
> IMO the phrasing "used by children of the state" may be a bit confusing.
> What do you think about this?
> 
> "The basic idea is to track which registers and stack slots are actually
> used during subseqeuent execution of the program, until program exit is
> reached."
> 
> > +that were never used could be removed from the cached state thus making more
> 
> s/Registers that were never/Registers and stack slots that were never
> 
> > +states equivalent to a cached state. This could be illustrated by the following
> > +program::
> > +
> > +  0: call bpf_get_prandom_u32()
> > +  1: r1 = 0
> > +  2: if r0 == 0 goto +1
> > +  3: r0 = 1
> > +  --- checkpoint ---
> > +  4: r0 = r1
> > +  5: exit
> > +
> > +Suppose that state cache entry is created at instruction #4 (such entries are
> > +also called "checkpoints" in the text below), verifier reaches this instruction
> > +in two states:
> 
> Can we split these into two sentences? Wdyt about the following:
> 
> "Suppose that a state cache entry is created at instruction #4 (such
> entries are also called "checkpoints" in the text below). The verifier
> could reach the instruction with one of two possible register states:"
> 
> > +
> > +* r0 = 1, r1 = 0
> > +* r0 = 0, r1 = 0
> > +
> > +However, only the value of register ``r1`` is important to successfully finish
> > +verification. The goal of liveness tracking algorithm is to spot this fact and
> 
> s/ of liveness tracking/of the liveness tracking
> 
> > +figure out that both states are actually equivalent.
> > +
> > +Data structures
> > +~~~~~~~~~~~~~~~
> > +
> > +Liveness is tracked using the following data structures::
> > +
> > +  enum bpf_reg_liveness {
> > +	REG_LIVE_NONE = 0,
> > +	REG_LIVE_READ32 = 0x1,
> > +	REG_LIVE_READ64 = 0x2,
> > +	REG_LIVE_READ = REG_LIVE_READ32 | REG_LIVE_READ64,
> > +	REG_LIVE_WRITTEN = 0x4,
> > +	REG_LIVE_DONE = 0x8,
> > +  };
> 
> FYI, you have the option of using the .. kernel-doc:: directive here, as
> in e.g.:
> 
> .. kernel-doc:: include/linux/bpf_verifier.h
>   :identifiers: bpf_reg_liveness
> 
> For enum bpf_reg_liveness and the other types. Doing so would require
> you updating the kernel-doc header comment in
> include/linux/bpf_verifier.h to reflect the description in [0], i.e.
> something like this for enum bpf_reg_liveness:
> 
> [0]: https://docs.kernel.org/doc-guide/kernel-doc.html#structure-union-and-enumeration-documentation
> 
> /**
>  * enum bpf_reg_liveness - Liveness marks, used for registers and spilled-regs
>  * @REG_LIVE_NONE: reg hasn't been read or written this branch
>  * @REG_LIVE_READ32: reg was read, so we're sensitive to initial value
>  * @REG_LIVE_READ64: likewise, but full 64-bit content matters
>  * @REG_LIVE_READ: An OR of both 32 and 64 bit read reg states.
>  * @REG_LIVE_WRITTEN: reg was written first, screening off later reads
>  * @REG_LIVE_DONE: liveness won't be updating this register anymore
>  *
>  * Read marks propagate upwards until they find a write mark; they record that
>  * "one of this state's descendants read this reg" (and therefore the reg is
>  * relevant for states_equal() checks).
>  * Write marks collect downwards and do not propagate; they record that "the
>  * straight-line code that reached this state (from its parent) wrote this reg"
>  * (and therefore that reads propagated from this state or its descendants
>  * should not propagate to its parent).
>  * A state with a write mark can receive read marks; it just won't propagate
>  * them to its parent, since the write mark is a property, not of the state,
>  * but of the link between it and its parent.  See mark_reg_read() and
>  * mark_stack_slot_read() in kernel/bpf/verifier.c.
>  */
> 
> Not everyone likes how the rendered HTML looks, and I tend to agree that
> it looks pretty blocky, but it at least keeps all of this documentation
> in one place and avoids going stale. Also, it has the benefit of
> automatically creating a link to the ..kernel-doc:: comment anywhere
> that "enum bpf_reg_liveness" is written in the Documentation subtree.

Thank you for pointing to this capability. I'd like to stick to the
current variant because I don't need full definitions of struct
bpf_reg_state and others. The only complete definition I want to
describe is enum bpf_reg_liveness, for all others just want to show
the dependencies. I just tried mixing `.. kernel-doc::` for the enum,
and code block for the structs but it looks awkward in the final HTML.

> 
> > +
> > +  struct bpf_reg_state {
> > + 	...
> > +	struct bpf_reg_state *parent;
> > + 	...
> > +	enum bpf_reg_liveness live;
> > + 	...
> > +  };
> > +
> > +  struct bpf_stack_state {
> > +	struct bpf_reg_state spilled_ptr;
> > +	...
> > +  };
> > +
> > +  struct bpf_func_state {
> > +	struct bpf_reg_state regs[MAX_BPF_REG];
> > +        ...
> > +	struct bpf_stack_state *stack;
> > +  }
> > +  
> > +  struct bpf_verifier_state {
> > +	struct bpf_func_state *frame[MAX_CALL_FRAMES];
> > +	struct bpf_verifier_state *parent;
> > +        ...
> > +  }
> > +
> > +* ``REG_LIVE_NONE`` is an initial value assigned to ``->live`` fields upon new
> > +  verifier state creation;
> > +
> > +* ``REG_LIVE_WRITTEN`` means that the value of the register (or stack slot) is
> > +  defined by some instruction "executed" within verifier state;
> > +
> > +* ``REG_LIVE_READ{32,64}`` means that the value of the register (or stack slot)
> > +  is read by some instruction "executed" within verifier state;
> > +
> > +* ``REG_LIVE_DONE`` is a marker used by ``clean_verifier_state()`` to avoid
> > +  processing same verifier state multiple times and for some sanity checks;
> > +
> > +* ``->live`` field values are formed by combining ``enum bpf_reg_liveness``
> > +  values using bitwise or.
> > +
> > +Register parentage chains
> > +~~~~~~~~~~~~~~~~~~~~~~~~~
> > +  
> > +In order to propagate information between parent and child states register
> 
> Suggestion:
> 
> s/parent and child states register parentage chain/
>   parent and child states, a *register parentage chain*
> 
> > +parentage chain is established. Each register or stack slot is linked to a
> > +corresponding register or stack slot in it's parent state via a ``->parent``
> 
> s/it's/its
> 
> > +pointer. This link is established upon state creation in function
> 
> suggestion: I think you can remove the word 'function'.
> 
> > +``is_state_visited()`` and is never changed.
> > +
> > +The rules for correspondence between registers / stack slots are as follows:
> > +
> > +* For current stack frame registers and stack slots of the new state are linked
> > +  to the registers and stack slots of the parent state with the same indices.
> 
> s/For current stack frame registers/For the current stack frame,
> registers
> 
> > +
> > +* For outer stack frames only caller saved registers (r6-r9) and stack slots are
> > +  linked to the registers and stack slots of the parent state with the same
> > +  indices.
> 
> s/For the outer stack frames only/For the outer stack frames, only
> 
> > +
> > +This could be illustrated by the following diagram (arrows stand for
> > +``->parent`` pointers)::
> > +
> > +      ...                    ; Frame #0, some instructions
> > +  --- checkpoint #0 ---
> > +  1 : r6 = 42                ; Frame #0
> > +  --- checkpoint #1 ---
> > +  2 : call foo()             ; Frame #0
> > +      ...                    ; Frame #1, instructions from foo()
> > +  --- checkpoint #2 ---
> > +      ...                    ; Frame #1, instructions from foo()
> > +  --- checkpoint #3 ---
> > +      exit                   ; Frame #1, return from foo()
> > +  3 : r1 = r6                ; Frame #0  <- current state
> > +  
> > +             +--------------------------+--------------------------+ 
> > +             |         Frame #0         |         Frame #1         |
> > +  Checkpoint +--------------------------+--------------------------+
> > +  #0         | r0-r5 | r6-r9 | fp-8 ... |                           
> > +             +--------------------------+                           
> > +                ^       ^       ^
> > +                |       |       |          
> > +  Checkpoint +--------------------------+                           
> > +  #1         | r0-r5 | r6-r9 | fp-8 ... |
> > +             +--------------------------+
> > +                        ^       ^         nil     nil     nil 
> > +                        |       |          |       |       |          
> > +  Checkpoint +--------------------------+--------------------------+
> > +  #2         | r0-r5 | r6-r9 | fp-8 ... | r0-r5 | r6-r9 | fp-8 ... |
> > +             +--------------------------+--------------------------+
> > +                        ^       ^          ^       ^       ^  
> > +                        |       |          |       |       |
> > +  Checkpoint +--------------------------+--------------------------+
> > +  #3         | r0-r5 | r6-r9 | fp-8 ... | r0-r5 | r6-r9 | fp-8 ... |
> > +             +--------------------------+--------------------------+
> > +                        ^       ^
> > +                        |       |          
> > +  Current    +--------------------------+                           
> > +  state      | r0-r5 | r6-r9 | fp-8 ... |                           
> > +             +--------------------------+
> > +                        \           
> > +                          r6 read mark is propagated via
> > +                          these links all the way up to
> > +                          checkpoint #1.
> 
> IMO this illustration is kind of confusing as shown, relative to the
> descriptions of current / outer above. 'Frame #1' is potentially in a
> different function / subprog, correct? So it seems kind of confusing to
> see it propagating registers from checkpoint 3 -> checkpoint 2, etc. And
> it's propagating r0-r5. Similarly, it looks like 'Current state' isn't
> propagating r0-r5. I'm sure I'm misunderstanding something.
> 

Good point. I reviewed the logic in __check_func_call() and it turns
out I missed the actions performed by:
- init_func_state() which cuts all parent links for the newly created
  callee state.
- set_callee_state() which copies r1-r5 from caller byte-to-byte
  (thus including parent link).
I'll update the description and diagram to reflect this.

> > +
> > +Liveness marks tracking
> > +~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +For each processed instruction verifier propagates information about reads up
> 
> s/processed instruction verifier/processed instruction, the verifier
> 
> > +the parentage chain and saves information about writes in the current state.
> > +The information about reads is propagated by function ``mark_reg_read()`` which
> > +could be summarized as follows::
> > +
> > +  mark_reg_read(struct bpf_reg_state *state):
> > +      parent = state->parent
> > +      while parent:
> > +          if parent->live & REG_LIVE_WRITTEN:
> > +              break
> > +          if parent->live & REG_LIVE_READ64:
> > +              break
> > +          parent->live |= REG_LIVE_READ64
> > +          state = parent
> > +          parent = state->parent
> > +
> > +Note: details about REG_LIVE_READ32 are omitted.
> > +
> > +Also note: the read marks are applied to the **parent** state while write marks
> > +are applied to the **current** state.
> > +
> > +Because stack writes could have different sizes ``REG_LIVE_WRITTEN`` marks are
> > +applied conservatively: stack slots are marked as written only if write size
> > +corresponds to the size of the register, e.g. see function ``save_register_state()``.
> > +
> > +Consider the following example::
> > +
> > +  0: (*u64)(r10 - 8) = 0
> > +  --- checkpoint #0 ---
> > +  1: (*u32)(r10 - 8) = 1
> > +  2: r1 = (*u32)(r10 - 8)
> > +
> > +Because write size at instruction (1) is smaller than register size the write
> 
> s/write size/the write size
> 
> also
> 
> s/than register size the/than register size, the
> 
> > +mark will not be added to fp-8 slot when (1) is verified, thus the fp-8 read at
> > +instruction (2) would propagate read mark for fp-8 up to checkpoint #0.
> 
> s/is verified, thus/is verifier. Thus
> 
> also
> 
> s/would propagate read mark/would propagate the read mark
> 
> In general, this sentence explains mechanically why the write won't
> propagate (the write size is smaller than the register size), but I
> think it would be useful to explain why sizeof(write) < sizeof(reg)
> implies that it won't be propagated.
> 
> > +
> > +Once ``BPF_EXIT`` instruction is reached function ``update_branch_counts()`` is
> 
> s/Once ``BPF_EXIT``/Once the ``BPF_EXIT``
> 
> also
> 
> s/is reached function/is reached,
> 
> > +called to update the ``->branches`` counter for each verifier state in a chain
> > +of parent verifier states. When ``->branches`` counter reaches zero the verifier
> 
> s/When ``->branches`` counter reaches zero/
>   When the ``branches`` counter reaches zero,
> 
> > +state becomes a valid entry in a set of cached verifier states.
> > +
> > +Each entry of the verifier states cache is post-processed by a function
> > +``clean_live_states()``. This function marks all registers and stack slots
> > +without ``REG_LIVE_READ{32,64}`` marks as ``NOT_INIT`` or ``STACK_INVALID``.
> > +Registers/stack slots marked in this way are ignored in function ``stacksafe()``
> > +called from ``states_equal()`` when state cache entry is considered for
> 
> s/when state cache entry/when a state cache entry
> 
> > +equivalence with a current state.
> > +
> > +Now it is possible to explain how the example from the beginning of the section
> > +works::
> > +
> > +  0: call bpf_get_prandom_u32()
> > +  1: r1 = 0
> > +  2: if r0 == 0 goto +1
> > +  3: r0 = 1
> > +  --- checkpoint[0] ---
> > +  4: r0 = r1
> > +  5: exit
> > +
> > +* At instruction #2 branching point is reached and state ``{ r0 == 0, r1 == 0, pc == 4 }``
> > +  is pushed to states processing queue (pc stands for program counter).
> > +  
> > +* At instruction #4:
> > +  
> > +  * ``checkpoint[0]`` states cache entry is created: ``{ r0 == 1, r1 == 0, pc == 4 }``;
> > +  * ``checkpoint[0].r0`` is marked as written;
> > +  * ``checkpoint[0].r1`` is marked as read;
> > +
> > +* At instruction #5 exit is reached and ``checkpoint[0]`` can now be processed
> > +  by ``clean_live_states()``, after this processing ``checkpoint[0].r0`` has a
> 
> s/by ``clean_live_states()``, after/by ``clean_live_states()``. After
> 
> > +  read mark and all other registers and stack slots are marked as ``NOT_INIT``
> > +  or ``STACK_INVALID``
> > +  
> > +* The state ``{ r0 == 0, r1 == 0, pc == 4 }`` is popped from the states queue
> > +  and is compared against a cached state ``{ r1 == 0, pc == 4 }``, the states
> > +  are considered equivalent.
> > +
> > +Read marks propagation for cache hits
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Another important point is handling of read marks when a previously verified
> 
> s/is handling of/is the handling of
> 
> > +state is found in the states cache. All read marks present on registers and
> > +stack slots of the cached state must be propagated over the parentage chain of
> > +the current state. Function ``propagate_liveness()`` handles this case.
> 
> Suggestion: Consider expanding the sentence to include why it's
> necessary to propagate like this. You explain below after the diagram,
> but it might be helpful to clarify here so the reader has some context
> when reading over the diagram.

Well, I figured that the diagram below is the simplest way to describe
why this logic is necessary. I'll try to come up with some condensed
description as you suggest.

> 
> > +
> > +For example, consider the following state parentage chain (S is a
> > +starting state, A-E are derived states, -> arrows show which state is
> > +derived from which)::
> > +
> > +                      r1 read
> > +               <-------------                    A[r1] == 0
> > +                    +---+                        C[r1] == 0
> > +      S ---> A ---> | B | ---> exit              E[r1] == 1
> > +      |             |   |
> > +      ` ---> C ---> | D |
> > +      |             +---+
> > +      ` ---> E        
> > +                      ^
> > +             ^        |___   suppose all these
> > +             |             states are at insn #Y
> > +      suppose all these
> > +    states are at insn #X
> > +
> > +* Chain of states ``S -> A -> B -> exit`` is verified first.
> > +
> > +* While ``B -> exit`` is verified, register ``r1`` is read and this read mark is
> > +  propagated up to state ``A``.
> > +
> > +* When chain of states ``C -> D`` is verified the state ``D`` turns out to be
> > +  equivalent to state ``B``.
> > +
> > +* The read mark for ``r1`` has to be propagated to state ``C``, otherwise state
> > +  ``C`` might get mistakenly marked as equivalent to state ``E`` even though
> > +  values for register ``r1`` differ between ``C`` and ``E``.
> > +
> >  Understanding eBPF verifier messages
> >  ====================================
> >  
> > -- 
> > 2.39.0
> >

next prev parent reply	other threads:[~2023-01-30 23:45 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-30 18:23 [PATCH bpf-next v2 0/1] docs/bpf: Add description of register liveness tracking algorithm Eduard Zingerman
2023-01-30 18:24 ` [PATCH bpf-next v2 1/1] " Eduard Zingerman
2023-01-30 19:59   ` David Vernet
2023-01-30 23:45     ` Eduard Zingerman [this message]
2023-01-31  2:33   ` Alexei Starovoitov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d8a6ac211053c27463c87beeaa74ee511d3ca796.camel@gmail.com \
    --to=eddyz87@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=kernel-team@fb.com \
    --cc=martin.lau@linux.dev \
    --cc=void@manifault.com \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox