block: Introduce a BPF-based I/O scheduler

[RFC 0/2] block: Introduce a BPF-based I/O scheduler

Chengkaitao posted 2 patches 6 days, 2 hours ago

Download series mbox

block/Kconfig.iosched                         |   8 +
block/Makefile                                |   1 +
block/blk-merge.c                             |  49 +-
block/blk-mq-sched.h                          |   4 +
block/blk-mq.c                                |   8 +-
block/blk-mq.h                                |   2 +-
block/blk.h                                   |   2 +
block/ufq-bpfops.c                            | 213 +++++++
block/ufq-iosched.c                           | 526 ++++++++++++++++++
block/ufq-iosched.h                           |  38 ++
block/ufq-kfunc.c                             |  91 +++
tools/ufq_iosched/.gitignore                  |   2 +
tools/ufq_iosched/Makefile                    | 262 +++++++++
tools/ufq_iosched/README.md                   | 136 +++++
.../include/bpf-compat/gnu/stubs.h            |  12 +
tools/ufq_iosched/include/ufq/common.bpf.h    |  73 +++
tools/ufq_iosched/include/ufq/common.h        |  91 +++
tools/ufq_iosched/include/ufq/simple_stat.h   |  21 +
tools/ufq_iosched/ufq_simple.bpf.c            | 445 +++++++++++++++
tools/ufq_iosched/ufq_simple.c                | 118 ++++
20 files changed, 2094 insertions(+), 8 deletions(-)
create mode 100644 block/ufq-bpfops.c
create mode 100644 block/ufq-iosched.c
create mode 100644 block/ufq-iosched.h
create mode 100644 block/ufq-kfunc.c
create mode 100644 tools/ufq_iosched/.gitignore
create mode 100644 tools/ufq_iosched/Makefile
create mode 100644 tools/ufq_iosched/README.md
create mode 100644 tools/ufq_iosched/include/bpf-compat/gnu/stubs.h
create mode 100644 tools/ufq_iosched/include/ufq/common.bpf.h
create mode 100644 tools/ufq_iosched/include/ufq/common.h
create mode 100644 tools/ufq_iosched/include/ufq/simple_stat.h
create mode 100644 tools/ufq_iosched/ufq_simple.bpf.c
create mode 100644 tools/ufq_iosched/ufq_simple.c

Expand all Fold all

[RFC 0/2] block: Introduce a BPF-based I/O scheduler

Posted by Chengkaitao 6 days, 2 hours ago

From: Kaitao Cheng <chengkaitao@kylinos.cn>

I have been working on adding a new BPF-based I/O scheduler. It has both
kernel and user-space parts. In kernel space, using per-ctx, I implemented
a simple elevator that exposes a set of BPF hooks. The goal is to move the
policy side of I/O scheduling out of the kernel and into user space, which
should greatly increase flexibility and applicability. To verify that the
whole stack works end to end, I wrote a simple BPF example program. I am
calling this feature the UFQ (User-programmable Flexible Queueing) I/O
scheduler.

This patch depends on new BPF functionality that I have already posted to
the BPF community but that is not yet in mainline. Details are in these
two threads:

https://lore.kernel.org/all/20260214124042.62229-1-pilgrimtao@gmail.com/
https://lore.kernel.org/all/20260316112843.78657-1-pilgrimtao@gmail.com/

To try it, you need to apply the patches from those series first.

Note: This is still somewhat experimental. I have only done basic testing,
there may be bugs or security issues, which I plan to address in follow-up
work. I am also looking for community feedback on whether this direction
and the implementation approach make sense, and what else we should
consider.

Kaitao Cheng (2):
  block: Introduce the UFQ I/O scheduler
  tools/ufq_iosched: add BPF example scheduler and build scaffolding

 block/Kconfig.iosched                         |   8 +
 block/Makefile                                |   1 +
 block/blk-merge.c                             |  49 +-
 block/blk-mq-sched.h                          |   4 +
 block/blk-mq.c                                |   8 +-
 block/blk-mq.h                                |   2 +-
 block/blk.h                                   |   2 +
 block/ufq-bpfops.c                            | 213 +++++++
 block/ufq-iosched.c                           | 526 ++++++++++++++++++
 block/ufq-iosched.h                           |  38 ++
 block/ufq-kfunc.c                             |  91 +++
 tools/ufq_iosched/.gitignore                  |   2 +
 tools/ufq_iosched/Makefile                    | 262 +++++++++
 tools/ufq_iosched/README.md                   | 136 +++++
 .../include/bpf-compat/gnu/stubs.h            |  12 +
 tools/ufq_iosched/include/ufq/common.bpf.h    |  73 +++
 tools/ufq_iosched/include/ufq/common.h        |  91 +++
 tools/ufq_iosched/include/ufq/simple_stat.h   |  21 +
 tools/ufq_iosched/ufq_simple.bpf.c            | 445 +++++++++++++++
 tools/ufq_iosched/ufq_simple.c                | 118 ++++
 20 files changed, 2094 insertions(+), 8 deletions(-)
 create mode 100644 block/ufq-bpfops.c
 create mode 100644 block/ufq-iosched.c
 create mode 100644 block/ufq-iosched.h
 create mode 100644 block/ufq-kfunc.c
 create mode 100644 tools/ufq_iosched/.gitignore
 create mode 100644 tools/ufq_iosched/Makefile
 create mode 100644 tools/ufq_iosched/README.md
 create mode 100644 tools/ufq_iosched/include/bpf-compat/gnu/stubs.h
 create mode 100644 tools/ufq_iosched/include/ufq/common.bpf.h
 create mode 100644 tools/ufq_iosched/include/ufq/common.h
 create mode 100644 tools/ufq_iosched/include/ufq/simple_stat.h
 create mode 100644 tools/ufq_iosched/ufq_simple.bpf.c
 create mode 100644 tools/ufq_iosched/ufq_simple.c

-- 
2.43.0

Re: [RFC 0/2] block: Introduce a BPF-based I/O scheduler

Posted by Bart Van Assche 5 days, 22 hours ago

On 3/27/26 4:47 AM, Chengkaitao wrote:
> I have been working on adding a new BPF-based I/O scheduler. It has both
> kernel and user-space parts. In kernel space, using per-ctx,

Does "ctx" perhaps refer to struct blk_mq_hw_ctx? If so, please use the 
abbreviation "hctx" to prevent confusion with struct blk_mq_ctx (block
layer software queue).

For what type of block devices is this new type of I/O scheduler 
intended? This new type of I/O scheduler is not appropriate for hard
disks. To schedule I/O effectively for harddisks, an I/O scheduler must
be aware of all pending I/O requests. This is why the mq-deadline I/O
scheduler maintains a single list of requests across all hardware
queues.

Additionally, this new I/O scheduler is not appropriate for the fastest
block devices. For very fast block devices, any I/O scheduler incurs a
measurable overhead.

> I implemented
> a simple elevator that exposes a set of BPF hooks. The goal is to move the
> policy side of I/O scheduling out of the kernel and into user space,

What does "into user space" mean in this context? As you know BPF code
runs in kernel context.

Thanks,

Bart.

Re: [RFC 0/2] block: Introduce a BPF-based I/O scheduler

Posted by Chengkaitao 5 days ago

On Fri, Mar 27, 2026 at 11:48 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 3/27/26 4:47 AM, Chengkaitao wrote:
> > I have been working on adding a new BPF-based I/O scheduler. It has both
> > kernel and user-space parts. In kernel space, using per-ctx,
>
> Does "ctx" perhaps refer to struct blk_mq_hw_ctx? If so, please use the
> abbreviation "hctx" to prevent confusion with struct blk_mq_ctx (block
> layer software queue).

Here, ctx refers to struct blk_mq_ctx. The intent is: when no eBPF
policy is attached, the new I/O scheduler behaves like the none
scheduler; when an eBPF policy is attached, the blk_mq_ctx queues
maintained by the new I/O scheduler serve as backup and fallback
for the eBPF program.

> For what type of block devices is this new type of I/O scheduler
> intended? This new type of I/O scheduler is not appropriate for hard
> disks. To schedule I/O effectively for harddisks, an I/O scheduler must
> be aware of all pending I/O requests. This is why the mq-deadline I/O
> scheduler maintains a single list of requests across all hardware
> queues.

This new I/O scheduler targets mechanical hard disks. The scheduler
can be aware of all pending I/O requests, and users can maintain a
single list of requests in an eBPF program.

> Additionally, this new I/O scheduler is not appropriate for the fastest
> block devices. For very fast block devices, any I/O scheduler incurs a
> measurable overhead.

For the fastest block devices, avoiding extra scheduling policy is
often the best policy. In some customized workloads, however, extra
policy may be needed, for example priority scheduling, cgroup-aware
differentiation, or fine-grained metrics. Those scenarios have not
been validated with real demos yet, but the approach seems viable.

> > I implemented
> > a simple elevator that exposes a set of BPF hooks. The goal is to move the
> > policy side of I/O scheduling out of the kernel and into user space,
>
> What does "into user space" mean in this context? As you know BPF code
> runs in kernel context.

Sorry, my wording was inaccurate. It would be more appropriate to
phrase it as "into user-defined BPF programs".

-- 
Yours,
Chengkaitao