Re: [PATCH v3] bpf/verifier: optimize precision backtracking by skipping precise bits

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Yonghong Song <yonghong.song@linux.dev>
To: Qiliang Yuan <realwujing@gmail.com>, eddyz87@gmail.com
Cc: andrii@kernel.org, ast@kernel.org, bpf@vger.kernel.org,
	daniel@iogearbox.net, haoluo@google.com, jolsa@kernel.org,
	kpsingh@kernel.org, linux-kernel@vger.kernel.org,
	martin.lau@linux.dev, sdf@fomichev.me, song@kernel.org,
	yuanql9@chinatelecom.cn
Subject: Re: [PATCH v3] bpf/verifier: optimize precision backtracking by skipping precise bits
Date: Sun, 18 Jan 2026 22:20:07 -0800	[thread overview]
Message-ID: <930f735f-cb35-41b9-9ab9-b3ee558b27ee@linux.dev> (raw)
In-Reply-To: <20260117100922.38459-1-realwujing@gmail.com>



On 1/17/26 2:09 AM, Qiliang Yuan wrote:
> Optimize __mark_chain_precision() by skipping registers or stack slots
> that are already marked as precise. This prevents redundant history
> walks when multiple verification paths converge on the same state.
>
> Centralizing this check in __mark_chain_precision() improves efficiency
> for all entry points (mark_chain_precision, propagate_precision) and
> simplifies call-site logic.
>
> Performance Results (Extreme Stress Test on 32-core system):
> Under system-wide saturation (32 parallel veristat instances) using a
> high-stress backtracking payload (~290k insns, 34k states), this
> optimization demonstrated significant micro-architectural gains:
>
> - Total Retired Instructions:  -82.2 Billion (-1.94%)
> - Total CPU Cycles:            -161.3 Billion (-3.11%)
> - Avg. Insns per Verify:       -17.2 Million (-2.84%)
> - Page Faults:                 -39.90% (Significant reduction in memory pressure)
>
> The massive reduction in page faults suggests that avoiding redundant
> backtracking significantly lowers memory subsystem churn during deep
> state history walks.
>
> Verified that total instruction and state counts (per veristat) remain
> identical across all tests, confirming logic equivalence.
>
> Suggested-by: Eduard Zingerman <eddyz87@gmail.com>
> Signed-off-by: Qiliang Yuan <realwujing@gmail.com>
> ---
> On Fri, Jan 16, 2026 at 11:27 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>> As I said before, this is a useful change.
>>
>>> 4. bpf/verifier: optimize precision backtracking by skipping precise bits
>>>     (https://lore.kernel.org/all/20260115152037.449362-1-realwujing@gmail.com/)
>>>     Following your suggestion to refactor the logic into the core engine for
>>>     better coverage and clarity, I have provided a v2 version of this patch here:
>>>     (https://lore.kernel.org/all/20260116045839.23743-1-realwujing@gmail.com/)
>>>     This v2 version specifically addresses your feedback by centralizing the
>>>     logic and includes a comprehensive performance comparison (veristat results)
>>>     in the commit log. It reduces the complexity of redundant backtracking
>>>     requests from O(D) (where D is history depth) to O(1) by utilizing the
>>>     'precise' flag to skip already-processed states.
>> Same as with #1: using veristat duration metric, especially for such
>> small programs, is not a reasonable performance analysis.
> Link: https://lore.kernel.org/all/75807149f7de7a106db0ccda88e5d4439b94a1e7.camel@gmail.com/
>
> Hi Eduard,
>
> Acknowledged. To provide a more robust performance analysis, I have moved away
> from veristat duration and instead used hardware performance counters (perf stat)
> under system-wide saturation with a custom backtracking stress test. This
> demonstrates the optimization's hardware-level efficiency (retired instructions
> and page faults) more reliably.
>
> Best regards,
> Qiliang
>
> Test case (backtrack_stress.c):
> #include "vmlinux.h"
> #include <bpf/bpf_helpers.h>
>
> struct {
>      __uint(type, BPF_MAP_TYPE_ARRAY);
>      __uint(max_entries, 1);
>      __type(key, __u32);
>      __type(value, __u64);
> } dummy_map SEC(".maps");
>
> SEC("tc")
> int backtrack_stress(struct __sk_buff *skb)
> {
>      __u32 key = 0;
>      __u64 *val = bpf_map_lookup_elem(&dummy_map, &key);
>      if (!val) return 0;
>      __u64 x = *val;
>      
>      /* 1. Create a deep dependency chain to fill history for 'x' */
>      x += 1; x *= 2; x -= 1; x ^= 0x55;
>      x += 1; x *= 2; x -= 1; x ^= 0xAA;
>      x += 1; x *= 2; x -= 1; x ^= 0x55;
>      x += 1; x *= 2; x -= 1; x ^= 0xAA;
>
>      /* 2. Create many states via conditional branches */
> #define CHECK_X(n) if (x == n) { x += 1; } if (x == n + 1) { x -= 1; }
> #define CHECK_X10(n)  CHECK_X(n) CHECK_X(n+2) CHECK_X(n+4) CHECK_X(n+6) CHECK_X(n+8) \
>                        CHECK_X(n+10) CHECK_X(n+12) CHECK_X(n+14) CHECK_X(n+16) CHECK_X(n+18)
> #define CHECK_X100(n) CHECK_X10(n) CHECK_X10(n+20) CHECK_X10(n+40) CHECK_X10(n+60) CHECK_X10(n+80) \
>                        CHECK_X10(n+100) CHECK_X10(n+120) CHECK_X10(n+140) CHECK_X10(n+160) CHECK_X10(n+180)
>
>      CHECK_X100(0)
>      CHECK_X100(200)
>      CHECK_X100(400)
>      CHECK_X100(600)
>      CHECK_X100(800)
>      CHECK_X100(1000)
>
>      /* 3. Trigger mark_chain_precision() multiple times on 'x' */
>      #pragma clang loop unroll(full)
>      for (int i = 0; i < 500; i++) {
>          if (x == (2000 + i)) {
>              x += 1;
>          }
>      }
>
>      return x;
> }
>
> char _license[] SEC("license") = "GPL";
>
> How to Test:
> -----------
> 1. Compile the BPF program (from kernel root):
>     clang -O2 -target bpf \
>           -I./tools/testing/selftests/bpf/ \
>           -I./tools/lib/ \
>           -I./tools/include/uapi/ \
>           -I./tools/testing/selftests/bpf/include \
>           -c backtrack_stress.c -o backtrack_stress.bpf.o
>
> 2. System-wide saturation profiling (32 cores):
>     # Start perf in background
>     sudo perf stat -a -- sleep 60 &
>     # Start 32 parallel loops of veristat
>     for i in {1..32}; do (while true; do ./veristat backtrack_stress.bpf.o > /dev/null; done &); done
>
> Raw Performance Data:
> ---------------------
> Baseline (6.19.0-rc5-baseline, git commit 944aacb68baf):
> File                    Program           Verdict  Duration (us)   Insns  States  Program size  Jited size
> ----------------------  ----------------  -------  -------------  ------  ------  ------------  ----------
> backtrack_stress.bpf.o  backtrack_stress  success         197924  289939   34331          5437       28809
> ----------------------  ----------------  -------  -------------  ------  ------  ------------  ----------
>
>           1,388,149      context-switches                 #    722.5 cs/sec  cs_per_second
>        1,921,399.69 msec cpu-clock                        #     32.0 CPUs  CPUs_utilized
>              25,113      cpu-migrations                   #     13.1 migrations/sec  migrations_per_second
>           8,108,516      page-faults                      #   4220.1 faults/sec  page_faults_per_second
>      97,445,724,421      branch-misses                    #      8.1 %  branch_miss_rate         (50.07%)
>     903,852,287,721      branches                         #    470.4 M/sec  branch_frequency     (66.76%)
>   5,190,519,089,751      cpu-cycles                       #      2.7 GHz  cycles_frequency       (66.81%)
>   4,230,500,391,043      instructions                     #      0.8 instructions  insn_per_cycle  (66.76%)
>   1,853,856,616,836      stalled-cycles-frontend          #     0.36 frontend_cycles_idle        (66.52%)
>
>        60.031936126 seconds time elapsed
>
> Patched (6.19.0-rc5-optimized):
> File                    Program           Verdict  Duration (us)   Insns  States  Program size  Jited size
> ----------------------  ----------------  -------  -------------  ------  ------  ------------  ----------
> backtrack_stress.bpf.o  backtrack_stress  success         214600  289939   34331          5437       28809
> ----------------------  ----------------  -------  -------------  ------  ------  ------------  ----------
>
>           1,433,270      context-switches                 #    745.9 cs/sec  cs_per_second
>        1,921,604.54 msec cpu-clock                        #     32.0 CPUs  CPUs_utilized
>              22,795      cpu-migrations                   #     11.9 migrations/sec  migrations_per_second
>           4,873,895      page-faults                      #   2536.4 faults/sec  page_faults_per_second
>      97,038,959,375      branch-misses                    #      8.1 %  branch_miss_rate         (50.07%)
>     890,170,312,491      branches                         #    463.2 M/sec  branch_frequency     (66.76%)
>   5,029,192,994,167      cpu-cycles                       #      2.6 GHz  cycles_frequency       (66.81%)
>   4,148,237,426,723      instructions                     #      0.8 instructions  insn_per_cycle  (66.77%)
>   1,818,457,318,301      stalled-cycles-frontend          #     0.36 frontend_cycles_idle        (66.51%)
>
>        60.032523872 seconds time elapsed
>
>   kernel/bpf/verifier.c | 27 +++++++++++++++++++++++++--
>   1 file changed, 25 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 3135643d5695..250f1dc0298e 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -4765,14 +4765,37 @@ static int __mark_chain_precision(struct bpf_verifier_env *env,
>   	 * slot, but don't set precise flag in current state, as precision
>   	 * tracking in the current state is unnecessary.
>   	 */
> -	func = st->frame[bt->frame];
>   	if (regno >= 0) {
> -		reg = &func->regs[regno];
> +		reg = &st->frame[bt->frame]->regs[regno];
>   		if (reg->type != SCALAR_VALUE) {
>   			verifier_bug(env, "backtracking misuse");
>   			return -EFAULT;
>   		}
> +		if (reg->precise)
> +			return 0;
>   		bt_set_reg(bt, regno);
> +	} else {
> +		for (fr = bt->frame; fr >= 0; fr--) {
> +			u32 reg_mask = bt_frame_reg_mask(bt, fr);
> +			u64 stack_mask = bt_frame_stack_mask(bt, fr);
> +			DECLARE_BITMAP(mask, 64);
> +
> +			func = st->frame[fr];
> +			if (reg_mask) {
> +				bitmap_from_u64(mask, reg_mask);
> +				for_each_set_bit(i, mask, 32) {
> +					if (func->regs[i].precise)
> +						bt_clear_frame_reg(bt, fr, i);
> +				}
> +			}
> +			if (stack_mask) {
> +				bitmap_from_u64(mask, stack_mask);
> +				for_each_set_bit(i, mask, 64) {
> +					if (func->stack[i].spilled_ptr.precise)
> +						bt_clear_frame_slot(bt, fr, i);
> +				}
> +			}
> +		}
>   	}
>   
>   	if (bt_empty(bt))

Here is another data point. I applied this patch to latest bpf-next
and adding printk()'s like below:
...
+               if (reg->precise) {
+                       printk("%s %s %d\n", __FILE__, __func__, __LINE__);
+                       return 0;
+               }
...
+               for (fr = bt->frame; fr >= 0; fr--) {
+                       u32 reg_mask = bt_frame_reg_mask(bt, fr);
+                       u64 stack_mask = bt_frame_stack_mask(bt, fr);
+                       DECLARE_BITMAP(mask, 64);
+
+                       func = st->frame[fr];
+                       if (reg_mask) {
+                               bitmap_from_u64(mask, reg_mask);
+                               for_each_set_bit(i, mask, 32) {
+                                       if (func->regs[i].precise) {
+                                               printk("%s %s %d\n", __FILE__, __func__, __LINE__);
+                                               bt_clear_frame_reg(bt, fr, i);
+                                       }
+                               }
+                       }
+                       if (stack_mask) {
+                               bitmap_from_u64(mask, stack_mask);
+                               for_each_set_bit(i, mask, 64) {
+                                       if (func->stack[i].spilled_ptr.precise) {
+                                               printk("%s %s %d\n", __FILE__, __func__, __LINE__);
+                                               bt_clear_frame_slot(bt, fr, i);
+                                       }
+                               }
+                       }

I run through the bpf selftests, and didn't see a single printk dump in the above.
So looks like this patch is a noop for me.

next prev parent reply	other threads:[~2026-01-19  6:20 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-15 15:04 [PATCH] bpf/verifier: optimize precision backtracking by skipping precise bits wujing
2026-01-15 18:52 ` Eduard Zingerman
2026-01-16  4:58   ` [PATCH v2] " Qiliang Yuan
2026-01-17 10:09   ` [PATCH v3] " Qiliang Yuan
2026-01-17 17:12     ` Alexei Starovoitov
2026-01-19  6:20     ` Yonghong Song [this message]
2026-01-19 18:43     ` Eduard Zingerman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=930f735f-cb35-41b9-9ab9-b3ee558b27ee@linux.dev \
    --to=yonghong.song@linux.dev \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=eddyz87@gmail.com \
    --cc=haoluo@google.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=realwujing@gmail.com \
    --cc=sdf@fomichev.me \
    --cc=song@kernel.org \
    --cc=yuanql9@chinatelecom.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.