[PATCH] bpf/verifier: compress bpf_reg_state by using bitfields

wujing posted 1 patch 3 weeks, 1 day ago
include/linux/bpf_verifier.h | 6 +++---
kernel/bpf/verifier.c        | 6 ++++--
2 files changed, 7 insertions(+), 5 deletions(-)
[PATCH] bpf/verifier: compress bpf_reg_state by using bitfields
Posted by wujing 3 weeks, 1 day ago
The struct bpf_reg_state is 112 bytes long on 64-bit architectures,
with several fields that have limited ranges. In particular, the bool
'precise' at the end causes 7 bytes of padding.

This patch packs 'frameno', 'subreg_def', and 'precise' into a single
u32/s32 bitfield block:
- frameno: 4 bits (sufficient for MAX_CALL_FRAMES=8)
- subreg_def: 27 bits (sufficient for 1M insns limit)
- precise: 1 bit

This reduces the size of struct bpf_reg_state from 112 to 104 bytes,
saving 8 bytes per register. This also reduces the size of
struct bpf_stack_state. Overall, it reduces peak memory usage of the
verifier for complex programs with millions of states.

The patch also updates states_maybe_looping() to use a non-bitfield
boundary for memcmp(), as offsetof() cannot be used on bitfields.

Signed-off-by: wujing <realwujing@gmail.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
 include/linux/bpf_verifier.h | 6 +++---
 kernel/bpf/verifier.c        | 6 ++++--
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 562f7e63be29..0f83360afb03 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -196,14 +196,14 @@ struct bpf_reg_state {
 	 * is used which is an index in bpf_verifier_state->frame[] array
 	 * pointing to bpf_func_state.
 	 */
-	u32 frameno;
+	u32 frameno:4;
 	/* Tracks subreg definition. The stored value is the insn_idx of the
 	 * writing insn. This is safe because subreg_def is used before any insn
 	 * patching which only happens after main verification finished.
 	 */
-	s32 subreg_def;
+	s32 subreg_def:27;
 	/* if (!precise && SCALAR_VALUE) min/max/tnum don't affect safety */
-	bool precise;
+	bool precise:1;
 };
 
 enum bpf_stack_slot_type {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 378341e1177f..5e5b831a518c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -19641,10 +19641,12 @@ static bool states_maybe_looping(struct bpf_verifier_state *old,
 
 	fold = old->frame[fr];
 	fcur = cur->frame[fr];
-	for (i = 0; i < MAX_BPF_REG; i++)
+	for (i = 0; i < MAX_BPF_REG; i++) {
 		if (memcmp(&fold->regs[i], &fcur->regs[i],
-			   offsetof(struct bpf_reg_state, frameno)))
+			   offsetof(struct bpf_reg_state, ref_obj_id) +
+			   sizeof(fold->regs[i].ref_obj_id)))
 			return false;
+	}
 	return true;
 }
 
-- 
2.39.5
Re: [PATCH] bpf/verifier: compress bpf_reg_state by using bitfields
Posted by Eduard Zingerman 3 weeks, 1 day ago
On Thu, 2026-01-15 at 23:20 +0800, wujing wrote:
> The struct bpf_reg_state is 112 bytes long on 64-bit architectures,
> with several fields that have limited ranges. In particular, the bool
> 'precise' at the end causes 7 bytes of padding.
> 
> This patch packs 'frameno', 'subreg_def', and 'precise' into a single
> u32/s32 bitfield block:
> - frameno: 4 bits (sufficient for MAX_CALL_FRAMES=8)
> - subreg_def: 27 bits (sufficient for 1M insns limit)
> - precise: 1 bit
> 
> This reduces the size of struct bpf_reg_state from 112 to 104 bytes,
> saving 8 bytes per register. This also reduces the size of
> struct bpf_stack_state. Overall, it reduces peak memory usage of the
> verifier for complex programs with millions of states.
> 
> The patch also updates states_maybe_looping() to use a non-bitfield
> boundary for memcmp(), as offsetof() cannot be used on bitfields.
> 
> Signed-off-by: wujing <realwujing@gmail.com>
> Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
> ---

varistat collects verifier memory usage statistics.
Does this change has an impact on programs generated for
e.g. selftests and sched_ext?

In general, you posted 4 patches claiming performance improvements,
but non of them are supported by any measurements.

P.S.
Is this LLM-generated?
Re: [PATCH] bpf/verifier: compress bpf_reg_state by using bitfields
Posted by Qiliang Yuan 3 weeks, 1 day ago
Hi Eduard,

On Thu, Jan 15, 2026, Eduard Zingerman wrote:
> varistat collects verifier memory usage statistics.
> Does this change has an impact on programs generated for
> e.g. selftests and sched_ext?
> 
> In general, you posted 4 patches claiming performance improvements,
> but non of them are supported by any measurements.
> 
> P.S.
> Is this LLM-generated?

Thank you for the feedback. I would like to clarify that these optimizations
are the result of a deliberate engineering effort to address specific
performance bottlenecks in the BPF verifier. These improvements were identified
through my personal code analysis over the past two months, though I have only
recently started submitting them to the community.

Regarding the impact on selftests and sched_ext: I have verified these changes
using 'veristat' against the BPF selftests. Since these optimizations target
the core verifier engine and structural layout, they benefit any complex BPF
program, including those in sched_ext. The results show a clear reduction in
verification duration (up to 56%) and peak memory usage (due to the reduction of
struct bpf_reg_state from 112 to 104 bytes), with zero changes in the total
instruction or state counts. This confirms that the verification logic remains
identical while resource efficiency is significantly improved.

The specific order and context of the four patches are as follows:

1. bpf/verifier: implement slab cache for verifier state list
   (https://lore.kernel.org/all/tencent_0074C23A28B59EA264C502FA3C9EF6622A0A@qq.com/)
   Focuses on reducing allocation overhead. Detailed benchmark results added in:
   (https://lore.kernel.org/all/tencent_9C541313B9B3C381AB950BC531F6C627ED05@qq.com/)

2. bpf/verifier: compress bpf_reg_state by using bitfields
   (https://lore.kernel.org/all/20260115144946.439069-1-realwujing@gmail.com/)
   This is a structural memory optimization. By packing 'frameno', 'subreg_def',
   and 'precise' into bitfields, we eliminated 7 bytes of padding, reducing
   the struct size from 112 to 104 bytes. This is a deterministic memory
   saving based on object layout, which is particularly effective for
   large-scale verification states.

3. bpf/verifier: optimize ID mapping reset in states_equal
   (https://lore.kernel.org/all/20260115150405.443581-1-realwujing@gmail.com/)
   This is an algorithmic optimization similar to memoization. By tracking the
   high-water mark of used IDs, it avoids a full 4.7KB memset in every
   states_equal() call. This reduces the complexity of resetting the ID map
   from O(MAX_SIZE) to O(ACTUAL_USED), which significantly speeds up state
   pruning during complex verification.

4. bpf/verifier: optimize precision backtracking by skipping precise bits
   (https://lore.kernel.org/all/20260115152037.449362-1-realwujing@gmail.com/)
   Following your suggestion to refactor the logic into the core engine for
   better coverage and clarity, I have provided a v2 version of this patch here:
   (https://lore.kernel.org/all/20260116045839.23743-1-realwujing@gmail.com/)
   This v2 version specifically addresses your feedback by centralizing the
   logic and includes a comprehensive performance comparison (veristat results)
   in the commit log. It reduces the complexity of redundant backtracking
   requests from O(D) (where D is history depth) to O(1) by utilizing the
   'precise' flag to skip already-processed states.

Best regards,

Qiliang Yuan
Re: [PATCH] bpf/verifier: compress bpf_reg_state by using bitfields
Posted by Eduard Zingerman 3 weeks, 1 day ago
On Fri, 2026-01-16 at 14:07 +0800, Qiliang Yuan wrote:
> Hi Eduard,
> 
> On Thu, Jan 15, 2026, Eduard Zingerman wrote:
> > varistat collects verifier memory usage statistics.
> > Does this change has an impact on programs generated for
> > e.g. selftests and sched_ext?
> > 
> > In general, you posted 4 patches claiming performance improvements,
> > but non of them are supported by any measurements.
> > 
> > P.S.
> > Is this LLM-generated?
> 
> Thank you for the feedback. I would like to clarify that these optimizations
> are the result of a deliberate engineering effort to address specific
> performance bottlenecks in the BPF verifier. These improvements were identified
> through my personal code analysis over the past two months, though I have only
> recently started submitting them to the community.
> 
> Regarding the impact on selftests and sched_ext: I have verified these changes
> using 'veristat' against the BPF selftests. Since these optimizations target
> the core verifier engine and structural layout, they benefit any complex BPF
> program, including those in sched_ext. The results show a clear reduction in
> verification duration (up to 56%) and peak memory usage (due to the reduction of
> struct bpf_reg_state from 112 to 104 bytes), with zero changes in the total
> instruction or state counts. This confirms that the verification logic remains
> identical while resource efficiency is significantly improved.
> 
> The specific order and context of the four patches are as follows:
> 
> 1. bpf/verifier: implement slab cache for verifier state list
>    (https://lore.kernel.org/all/tencent_0074C23A28B59EA264C502FA3C9EF6622A0A@qq.com/)
>    Focuses on reducing allocation overhead. Detailed benchmark results added in:
>    (https://lore.kernel.org/all/tencent_9C541313B9B3C381AB950BC531F6C627ED05@qq.com/)

From that report:

>  arena_strsearch                 121 us               64 us               -47.11%
>  bpf_loop:stack_check            747 us               469 us              -37.22%
>  bpf_loop:test_prog              519 us               386 us              -25.63%
>  bpf_loop:prog_null_ctx          202 us               162 us              -19.80%

Instructions processed from verifier from the report:
- arena_strsearch:        20
- bpf_loop:stack_check:   64
- bpf_loop:test_prog:     325
- bpf_loop:prog_null_ctx: 22

With such sa mall number of instructions processed the results are
nothing more than a random fluke. To get more or less reasonable
impact measurements, please use 'perf' tool and use programs where
verifier needs to process tens or hundreds of thousands instructions.

> 2. bpf/verifier: compress bpf_reg_state by using bitfields
>    (https://lore.kernel.org/all/20260115144946.439069-1-realwujing@gmail.com/)
>    This is a structural memory optimization. By packing 'frameno', 'subreg_def',
>    and 'precise' into bitfields, we eliminated 7 bytes of padding, reducing
>    the struct size from 112 to 104 bytes. This is a deterministic memory
>    saving based on object layout, which is particularly effective for
>    large-scale verification states.

For this optimization veristat is a reasonable tool to use, it can
track a 'mem_peak' value.

> 3. bpf/verifier: optimize ID mapping reset in states_equal
>    (https://lore.kernel.org/all/20260115150405.443581-1-realwujing@gmail.com/)
>    This is an algorithmic optimization similar to memoization. By tracking the
>    high-water mark of used IDs, it avoids a full 4.7KB memset in every
>    states_equal() call. This reduces the complexity of resetting the ID map
>    from O(MAX_SIZE) to O(ACTUAL_USED), which significantly speeds up state
>    pruning during complex verification.

As I said before, this is a useful change.

> 4. bpf/verifier: optimize precision backtracking by skipping precise bits
>    (https://lore.kernel.org/all/20260115152037.449362-1-realwujing@gmail.com/)
>    Following your suggestion to refactor the logic into the core engine for
>    better coverage and clarity, I have provided a v2 version of this patch here:
>    (https://lore.kernel.org/all/20260116045839.23743-1-realwujing@gmail.com/)
>    This v2 version specifically addresses your feedback by centralizing the
>    logic and includes a comprehensive performance comparison (veristat results)
>    in the commit log. It reduces the complexity of redundant backtracking
>    requests from O(D) (where D is history depth) to O(1) by utilizing the
>    'precise' flag to skip already-processed states.

Same as with #1: using veristat duration metric, especially for such
small programs, is not a reasonable performance analysis.

> Best regards,
> 
> Qiliang Yuan