Documentation/block/ublk.rst | 36 ++++ drivers/block/ublk_drv.c | 302 ++++++++++++++++++++++++++++++++-- include/uapi/linux/ublk_cmd.h | 8 +- 3 files changed, 328 insertions(+), 18 deletions(-)
ublk_drv is a driver simply passes all blk-mq rqs to userspace
target(such as ublksrv[1]). For each ublk queue, there is one
ubq_daemon(pthread). All ubq_daemons share the same process
which opens /dev/ublkcX. The ubq_daemon code infinitely loops on
io_uring_enter() to send/receive io_uring cmds which pass
information of blk-mq rqs.
Since the real IO handler(the process/thread opening /dev/ublkcX) is
in userspace, it could crash if:
(1) the user kills -9 it because of IO hang on backend, system
reboot, etc...
(2) the process/thread catches a exception(segfault, divisor error,
oom...) Therefore, the kernel driver has to deal with a dying
ubq_daemon or the process.
Now, if one ubq_daemon(pthread) or the process crashes, ublk_drv
must abort the dying ubq, stop the device and delete everything.
This is not a good choice in practice because users do not expect
aborted requests, I/O errors and a deleted device. They may want
a recovery machenism so that no requests are aborted and no I/O
error occurs. Anyway, users just want everything works as usual.
This patchset implements USER_RECOVERY support. If the process
or any ubq_daemon(pthread) crashes(exits accidentally), we allow
user to provide new process and ubq_daemons.
Note: The responsibility of recovery belongs to the user who opens
/dev/ublkcX. After a crash, the kernel driver only switch the
device's state to be ready for recovery(START_USER_RECOVERY) or
termination(STOP_DEV). The state is defined as UBLK_S_DEV_QUIESCED.
This patchset does not provide how to detect such a crash in userspace.
The user has may ways to do so. For example, user may:
(1) send GET_DEV_INFO on specific dev_id and check if its state is
UBLK_S_DEV_QUIESCED.
(2) 'ps' on ublksrv_pid.
Recovery feature is quite useful for real products. In detail,
we support this scenario:
(1) The /dev/ublkc0 is opened by process 0.
(2) Fio is running on /dev/ublkb0 exposed by ublk_drv and all
rqs are handled by process 0.
(3) Process 0 suddenly crashes(e.g. segfault);
(4) Fio is still running and submit IOs(but these IOs cannot
be dispatched now)
(5) User starts process 1 and attach it to /dev/ublkc0
(6) All rqs are handled by process 1 now and IOs can be
completed now.
Note: The backend must tolerate double-write because we re-issue
a rq sent to the old process 0 before.
We provide a sample script here to simulate the above steps:
***************************script***************************
LOOPS=10
__ublk_get_pid() {
pid=`./ublk list -n 0 | grep "pid" | awk '{print $7}'`
echo $pid
}
ublk_recover_kill()
{
for CNT in `seq $LOOPS`; do
dmesg -C
pid=`__ublk_get_pid`
echo -e "*** kill $pid now ***"
kill -9 $pid
sleep 6
echo -e "*** recover now ***"
./ublk recover -n 0
sleep 6
done
}
ublk_test()
{
echo -e "*** add ublk device ***"
./ublk add -t null -d 4 -i 1
sleep 2
echo -e "*** start fio ***"
fio --bs=4k \
--filename=/dev/ublkb0 \
--runtime=140s \
--rw=read &
sleep 4
ublk_recover_kill
wait
echo -e "*** delete ublk device ***"
./ublk del -n 0
}
for CNT in `seq 4`; do
modprobe -rv ublk_drv
modprobe ublk_drv
echo -e "************ round $CNT ************"
ublk_test
sleep 5
done
***************************script***************************
You may run it with our modified ublksrv[2] which supports
recovery feature. No I/O error occurs and you can verify it
by typing
$ perf-tools/bin/tpoint block:block_rq_error
The basic idea of USER_RECOVERY is quite straightfoward:
(1) quiesce ublk queues and requeue/abort rqs.
(2) release/free everything belongs to the dying process.
Note: Since ublk_drv does save information about user process,
this work is important because we don't expect any resource
lekage. Particularly, ioucmds from the dying ubq_daemons
need to be completed(freed).
(3) allow new ubq_daemons issue FETCH_REQ.
Note: ublk_ch_uring_cmd() checks some states and flags. We
have to set them to a correct value.
Here is steps to reocver:
(0) requests dispatched after the corresponding ubq_daemon is dying
are requeued.
(1) monitor_work finds one dying ubq_daemon, and it should
schedule quiesce_work and requeue/abort requests issued to
userspace before the ubq_daemon is dying.
(2) quiesce_work must (a)quiesce request queue to ban any incoming
ublk_queue_rq(), (b)wait unitl all rqs are IDLE, (c)complete old
ioucmds. Then the ublk device is ready for recovery or stop.
(3) The user sends START_USER_RECOVERY ctrl-cmd to /dev/ublk-control
with a dev_id X (such as 3 for /dev/ublkc3).
(4) Then ublk_drv should perpare for a new process to attach /dev/ublkcX.
All ublk_io structures are cleared and ubq_daemons are reset.
(5) Then, user should start a new process and ubq_daemons(pthreads) and
send FETCH_REQ by io_uring_enter() to make all ubqs be ready. The
user must correctly setup queues, flags and so on(how to persist
user's information is not related to this patchset).
(6) The user sends END_USER_RECOVERY ctrl-cmd to /dev/ublk-control with a
dev_id X.
(7) After receiving END_USER_RECOVERY, ublk_drv waits for all ubq_daemons
getting ready. Then it unquiesces request queue and new rqs are
allowed.
You should use ublksrv[2] and tests[3] provided by us. We add 3 additional
tests to verify that recovery feature works. Our code will be PR-ed to
Ming's repo soon.
[1] https://github.com/ming1/ubdsrv
[2] https://github.com/old-memories/ubdsrv/tree/recovery-v1
[3] https://github.com/old-memories/ubdsrv/tree/recovery-v1/tests/generic
Since V5:
(1) add mod_delayed_work() into __ublk_abort_rq() wrapper
(2) update documentation on UBLK_F_USER_RECOVERY
Since V4:
(1) remove ublk_cancel_dev() refactor patch
(2) keep START_USER_RECOVERY and END_USER_RECOVERY
(3) avoid UAF on ubq_daemon in monitor_work
(4) add one helper for requeuing/ending rqs
Since V3:
(1) do not kick requeue list in ublk_queue_rq() or io_uring fallback wq
with a dying ubq_daemon but kicking the list once while unquiescing dev
(2) add comment on requeing rqs in ublk_queue_rq(), or io_uring fallback wq
with a dying ubq_daemon
(3) split support for UBLK_F_USER_RECOVERY_REISSUE into a single patch
(4) let monitor_work abort/requeue rqs issued to userspace instead of
quiesce_work with recovery enabled
(5) alway wait until no INFLIGHT rq exists in ublk_quiesce_dev()
(6) move ublk re-init stuff into ublk_ch_release()
(7) let ublk_quiesce_dev() go on as long as one ubq_daemon is dying
(8) add only one ctrl-cmd and rename it as RESTART_DEV
(9) check ub.dev_info->flags instead of iterating on all ubqs
(10) do not disable recoevry feature, but always qiuesce dev in
ublk_stop_dev() and then unquiesce it
(11) add doc on USER_RECOVERY feature
Since V2:
(1) run ublk_quiesce_dev() in a standalone work.
(2) do not run monitor_work after START_USER_RECOVERY is handled.
(3) refactor recovery feature code so that it does not affect current code.
Since V1:
(1) refactor cover letter. Add intruduction on "how to detect a crash" and
"why we need recovery feature".
(2) do not refactor task_work and ublk_queue_rq().
(3) allow users freely stop/recover the device.
(4) add comment on ublk_cancel_queue().
(5) refactor monitor_work and aborting machenism since we add recovery
machenism in monitor_work.
ZiyangZhang (7):
ublk_drv: check 'current' instead of 'ubq_daemon'
ublk_drv: define macros for recovery feature and check them
ublk_drv: requeue rqs with recovery feature enabled
ublk_drv: consider recovery feature in aborting mechanism
ublk_drv: support UBLK_F_USER_RECOVERY_REISSUE
ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY support
Documentation: document ublk user recovery feature
Documentation/block/ublk.rst | 36 ++++
drivers/block/ublk_drv.c | 302 ++++++++++++++++++++++++++++++++--
include/uapi/linux/ublk_cmd.h | 8 +-
3 files changed, 328 insertions(+), 18 deletions(-)
--
2.27.0
On 9/23/22 9:39 AM, ZiyangZhang wrote:
> ublk_drv is a driver simply passes all blk-mq rqs to userspace
> target(such as ublksrv[1]). For each ublk queue, there is one
> ubq_daemon(pthread). All ubq_daemons share the same process
> which opens /dev/ublkcX. The ubq_daemon code infinitely loops on
> io_uring_enter() to send/receive io_uring cmds which pass
> information of blk-mq rqs.
>
> Since the real IO handler(the process/thread opening /dev/ublkcX) is
> in userspace, it could crash if:
> (1) the user kills -9 it because of IO hang on backend, system
> reboot, etc...
> (2) the process/thread catches a exception(segfault, divisor error,
> oom...) Therefore, the kernel driver has to deal with a dying
> ubq_daemon or the process.
>
> Now, if one ubq_daemon(pthread) or the process crashes, ublk_drv
> must abort the dying ubq, stop the device and delete everything.
> This is not a good choice in practice because users do not expect
> aborted requests, I/O errors and a deleted device. They may want
> a recovery machenism so that no requests are aborted and no I/O
> error occurs. Anyway, users just want everything works as usual.
>
> This patchset implements USER_RECOVERY support. If the process
> or any ubq_daemon(pthread) crashes(exits accidentally), we allow
> user to provide new process and ubq_daemons.
>
> Note: The responsibility of recovery belongs to the user who opens
> /dev/ublkcX. After a crash, the kernel driver only switch the
> device's state to be ready for recovery(START_USER_RECOVERY) or
> termination(STOP_DEV). The state is defined as UBLK_S_DEV_QUIESCED.
> This patchset does not provide how to detect such a crash in userspace.
> The user has may ways to do so. For example, user may:
> (1) send GET_DEV_INFO on specific dev_id and check if its state is
> UBLK_S_DEV_QUIESCED.
> (2) 'ps' on ublksrv_pid.
>
> Recovery feature is quite useful for real products. In detail,
> we support this scenario:
> (1) The /dev/ublkc0 is opened by process 0.
> (2) Fio is running on /dev/ublkb0 exposed by ublk_drv and all
> rqs are handled by process 0.
> (3) Process 0 suddenly crashes(e.g. segfault);
> (4) Fio is still running and submit IOs(but these IOs cannot
> be dispatched now)
> (5) User starts process 1 and attach it to /dev/ublkc0
> (6) All rqs are handled by process 1 now and IOs can be
> completed now.
>
> Note: The backend must tolerate double-write because we re-issue
> a rq sent to the old process 0 before.
>
> We provide a sample script here to simulate the above steps:
>
> ***************************script***************************
> LOOPS=10
>
> __ublk_get_pid() {
> pid=`./ublk list -n 0 | grep "pid" | awk '{print $7}'`
> echo $pid
> }
>
> ublk_recover_kill()
> {
> for CNT in `seq $LOOPS`; do
> dmesg -C
> pid=`__ublk_get_pid`
> echo -e "*** kill $pid now ***"
> kill -9 $pid
> sleep 6
> echo -e "*** recover now ***"
> ./ublk recover -n 0
> sleep 6
> done
> }
>
> ublk_test()
> {
> echo -e "*** add ublk device ***"
> ./ublk add -t null -d 4 -i 1
> sleep 2
> echo -e "*** start fio ***"
> fio --bs=4k \
> --filename=/dev/ublkb0 \
> --runtime=140s \
> --rw=read &
> sleep 4
> ublk_recover_kill
> wait
> echo -e "*** delete ublk device ***"
> ./ublk del -n 0
> }
>
> for CNT in `seq 4`; do
> modprobe -rv ublk_drv
> modprobe ublk_drv
> echo -e "************ round $CNT ************"
> ublk_test
> sleep 5
> done
> ***************************script***************************
>
> You may run it with our modified ublksrv[2] which supports
> recovery feature. No I/O error occurs and you can verify it
> by typing
> $ perf-tools/bin/tpoint block:block_rq_error
>
> The basic idea of USER_RECOVERY is quite straightfoward:
> (1) quiesce ublk queues and requeue/abort rqs.
> (2) release/free everything belongs to the dying process.
> Note: Since ublk_drv does save information about user process,
> this work is important because we don't expect any resource
> lekage. Particularly, ioucmds from the dying ubq_daemons
> need to be completed(freed).
> (3) allow new ubq_daemons issue FETCH_REQ.
> Note: ublk_ch_uring_cmd() checks some states and flags. We
> have to set them to a correct value.
>
> Here is steps to reocver:
> (0) requests dispatched after the corresponding ubq_daemon is dying
> are requeued.
> (1) monitor_work finds one dying ubq_daemon, and it should
> schedule quiesce_work and requeue/abort requests issued to
> userspace before the ubq_daemon is dying.
> (2) quiesce_work must (a)quiesce request queue to ban any incoming
> ublk_queue_rq(), (b)wait unitl all rqs are IDLE, (c)complete old
> ioucmds. Then the ublk device is ready for recovery or stop.
> (3) The user sends START_USER_RECOVERY ctrl-cmd to /dev/ublk-control
> with a dev_id X (such as 3 for /dev/ublkc3).
> (4) Then ublk_drv should perpare for a new process to attach /dev/ublkcX.
> All ublk_io structures are cleared and ubq_daemons are reset.
> (5) Then, user should start a new process and ubq_daemons(pthreads) and
> send FETCH_REQ by io_uring_enter() to make all ubqs be ready. The
> user must correctly setup queues, flags and so on(how to persist
> user's information is not related to this patchset).
> (6) The user sends END_USER_RECOVERY ctrl-cmd to /dev/ublk-control with a
> dev_id X.
> (7) After receiving END_USER_RECOVERY, ublk_drv waits for all ubq_daemons
> getting ready. Then it unquiesces request queue and new rqs are
> allowed.
>
> You should use ublksrv[2] and tests[3] provided by us. We add 3 additional
> tests to verify that recovery feature works. Our code will be PR-ed to
> Ming's repo soon.
I'm going to apply 1-6 for 6.1, applying the doc patch is difficult as
it only went into 6.0 past forking off the 6.1 block branch. Would you
mind resending the 7/7 patch once the merge window opens and I've pushed
the previous bits? I may forget otherwise...
--
Jens Axboe
On 2022/9/24 09:12, Jens Axboe wrote: > > I'm going to apply 1-6 for 6.1, applying the doc patch is difficult as > it only went into 6.0 past forking off the 6.1 block branch. Would you > mind resending the 7/7 patch once the merge window opens and I've pushed > the previous bits? I may forget otherwise... OK, I will resend the doc patch. Regards, Zhang
© 2016 - 2026 Red Hat, Inc.