[v1] Introduce load-acquire and store-release BPF instructions

[PATCH RFC bpf-next v1 0/4] Introduce load-acquire and store-release BPF instructions

Posted by Peilin Ye 1 year, 1 month ago

Hi all!

This RFC patchset adds kernel support for BPF load-acquire and store-release
instructions (for background, please see [1]).  Currently only arm64 is
supported for RFC.  The corresponding LLVM changes can be found at:
  https://github.com/llvm/llvm-project/pull/108636

As discussed on GitHub [2], define both load-acquire and store-release as
BPF_STX | BPF_ATOMIC instructions.  The following new flags are introduced:

  BPF_ATOMIC_LOAD    0x10
  BPF_ATOMIC_STORE   0x20

  BPF_RELAXED        0x0
  BPF_ACQUIRE        0x1
  BPF_RELEASE        0x2
  BPF_ACQ_REL        0x3
  BPF_SEQ_CST        0x4

  BPF_LOAD_ACQ       (BPF_ATOMIC_LOAD | BPF_ACQUIRE)
  BPF_STORE_REL      (BPF_ATOMIC_STORE | BPF_RELEASE)

Bit 4-7 of 'imm' encodes the new atomic operations (load and store), and bit
0-3 specifies the memory order.  A load-acquire is a BPF_STX | BPF_ATOMIC
instruction with 'imm' set to BPF_LOAD_ACQ (0x11).  Similarly, a store-release
is a BPF_STX | BPF_ATOMIC instruction with 'imm' set to BPF_STORE_REL (0x22).

For bit 4-7 of 'imm' we need to avoid conflicts with existing
BPF_STX | BPF_ATOMIC instructions.  Currently the following values (a subset
of BPFArithOp<>) are in use:

  def BPF_ADD  : BPFArithOp<0x0>;
  def BPF_OR   : BPFArithOp<0x4>;
  def BPF_AND  : BPFArithOp<0x5>;
  def BPF_XOR  : BPFArithOp<0xa>;
  def BPF_XCHG    : BPFArithOp<0xe>;
  def BPF_CMPXCHG : BPFArithOp<0xf>;

0x1 and 0x2 were chosen for the new instructions because:

  * BPFArithOp<0x1> is BPF_SUB.  Compilers already handle atomic subtraction
    by generating a BPF NEG followed by a BPF ADD instruction.

  * BPFArithOp<0x2> is BPF_MUL, and we do not have a plan for adding BPF
    atomic multiplication instructions.

So we think by choosing 0x1 and 0x2, we can avoid having conflicts with
BPFArithOp<> in the future.  Previously 0xb was chosen because we will never
need BPF_MOV (BPFArithOp<0xb>) for BPF_ATOMIC.  Please suggest if you think
different values should be used.

Based on [3], the BPF load-acquire, the arm64 JIT compiler generates LDAR
(RCsc) instead of LDAPR (RCpc).  Will Deacon also suggested LDAR over LDAPR in
an offlist conversation for the following reasons:

  a. Not all CPUs support LDAPR, as also pointed out in Paul E. McKenney's
     email (search for "older ARM64 hardware" in [3]).

  b. The extra ordering provided by RCsc is important in some use cases e.g.
     locks.

  c. The arm64 ISA does not provide e.g. other atomic memory operations in
     RCpc.  In other words, it is not worth losing the extra ordering that
     LDAR provides, if we would still be using RCsc for all other cases.

Unlike existing atomic operations that only support BPF_W (32-bit) and
BPF_DW (64-bit) size modifiers, load-acquires and store-releases also
support BPF_B (8-bit) and BPF_H (16-bit).  An 8- or 16-bit load-acquire
zero-extends the value before writing it to a 32-bit register, just like
LDARH and friends.

Examples of using the new instructions (assuming little-endian):

  long foo(long *ptr) {
      return __atomic_load_n(ptr, __ATOMIC_ACQUIRE);
  }

Using clang -mcpu=v4, foo() can be compiled to:

  db 10 00 00 11 00 00 00  r0 = load_acquire((u64 *)(r1 + 0x0))
  95 00 00 00 00 00 00 00  exit

  opcode (0xdb): BPF_ATOMIC | BPF_DW | BPF_STX
  imm (0x00000011): BPF_LOAD_ACQ

For arm64, an LDAR instruction would be generated by the JIT compiler for
the above, e.g.:

  ldar  x7, [x0]

Similarly, consider this 16-bit store-release:

  void bar(short *ptr, short val) {
      __atomic_store_n(ptr, val, __ATOMIC_RELEASE);
  }

bar() can be compiled to (again, using clang -mcpu=v4):

  cb 21 00 00 22 00 00 00  store_release((u16 *)(r1 + 0x0), w2)
  95 00 00 00 00 00 00 00  exit

  opcode (0xcb): BPF_ATOMIC | BPF_H | BPF_STX
  imm (0x00000022): BPF_ATOMIC_STORE | BPF_RELEASE

An STLRH will be generated for it, e.g.:

  stlrh  w1, [x0]

For a complete mapping for ARM64:

  load-acquire     8-bit  LDARB
 (BPF_LOAD_ACQ)   16-bit  LDARH
                  32-bit  LDAR (32-bit)
                  64-bit  LDAR (64-bit)
  store-release    8-bit  STLRB
 (BPF_STORE_REL)  16-bit  STLRH
                  32-bit  STLR (32-bit)
                  64-bit  STLR (64-bit)

Using in arena is supported.  Inline assembly is also supported.  For example:

  asm volatile("%0 = load_acquire((u64 *)(%1 + 0x0))" :
               "=r"(ret) : "r"(ptr) : "memory");

A new pre-defined macro, __BPF_FEATURE_LOAD_ACQ_STORE_REL, can be used to
detect if clang supports BPF load-acquire and store-release.

Please refer to individual kernel patches (and LLVM commits) for details.
Any suggestions or corrections would be much appreciated!

[1] https://lore.kernel.org/all/20240729183246.4110549-1-yepeilin@google.com/
[2] https://github.com/llvm/llvm-project/pull/108636#issuecomment-2389403477
[3] https://lore.kernel.org/bpf/75d1352e-c05e-4fdf-96bf-b1c3daaf41f0@paulmck-laptop/

Thanks,
Peilin Ye (4):
  bpf/verifier: Factor out check_load()
  bpf: Introduce load-acquire and store-release instructions
  selftests/bpf: Delete duplicate verifier/atomic_invalid tests
  selftests/bpf: Add selftests for load-acquire and store-release
    instructions

 arch/arm64/include/asm/insn.h                 |  8 ++
 arch/arm64/lib/insn.c                         | 34 +++++++
 arch/arm64/net/bpf_jit.h                      | 20 +++++
 arch/arm64/net/bpf_jit_comp.c                 | 85 +++++++++++++++++-
 include/linux/filter.h                        |  2 +
 include/uapi/linux/bpf.h                      | 13 +++
 kernel/bpf/core.c                             | 41 ++++++++-
 kernel/bpf/disasm.c                           | 14 +++
 kernel/bpf/verifier.c                         | 88 ++++++++++++-------
 tools/include/uapi/linux/bpf.h                | 13 +++
 .../selftests/bpf/prog_tests/arena_atomics.c  | 61 ++++++++++++-
 .../selftests/bpf/prog_tests/atomics.c        | 57 +++++++++++-
 .../selftests/bpf/progs/arena_atomics.c       | 62 ++++++++++++-
 tools/testing/selftests/bpf/progs/atomics.c   | 62 ++++++++++++-
 .../selftests/bpf/verifier/atomic_invalid.c   | 28 +++---
 .../selftests/bpf/verifier/atomic_load.c      | 71 +++++++++++++++
 .../selftests/bpf/verifier/atomic_store.c     | 70 +++++++++++++++
 17 files changed, 672 insertions(+), 57 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/verifier/atomic_load.c
 create mode 100644 tools/testing/selftests/bpf/verifier/atomic_store.c

-- 
2.47.1.613.gc27f4b7a9f-goog

Re: [PATCH RFC bpf-next v1 0/4] Introduce load-acquire and store-release BPF instructions

Posted by Peilin Ye 1 year, 1 month ago

By this:

On Sat, Dec 21, 2024 at 01:22:04AM +0000, Peilin Ye wrote:
> Based on [3], the BPF load-acquire, the arm64 JIT compiler generates LDAR
> (RCsc) instead of LDAPR (RCpc).

I actually meant:

  Based on [3], for BPF load-acquire, make the arm64 JIT compiler
  generate LDAR (RCsc) instead of LDAPR (RCpc).

Sorry for any confusion.