contrib/libvhost-user/libvhost-user.c | 106 +++- contrib/libvhost-user/libvhost-user.h | 70 +++ docs/interop/vhost-user.rst | 41 ++ hw/virtio/vhost-user-fs.c | 334 ++++++++++- hw/virtio/vhost-user.c | 123 ++++ hw/virtio/vhost.c | 42 ++ include/hw/virtio/vhost-backend.h | 6 + include/hw/virtio/vhost-user-fs.h | 16 +- include/hw/virtio/vhost.h | 42 ++ tools/virtiofsd/fuse_lowlevel.c | 24 +- tools/virtiofsd/fuse_virtio.c | 44 ++ tools/virtiofsd/fuse_virtio.h | 1 + tools/virtiofsd/helper.c | 9 + tools/virtiofsd/passthrough_helpers.h | 2 +- tools/virtiofsd/passthrough_ll.c | 830 ++++++++++++++++++-------- tools/virtiofsd/passthrough_seccomp.c | 1 + 16 files changed, 1413 insertions(+), 278 deletions(-)
Hi, all We implement virtio-fs crash reconnection in this patchset. The crash reconnection of virtiofsd here is completely transparent to guest, no remount in guest is needed, even the inflight requests can be handled normally after reconnection. We are looking forward to any comments. Thanks, Jiachen OVERVIEW: To support virtio-fs crash reconnection, we need to support the recovery of 1) inflight FUSE request, and 2) virtiofsd internal status information. Fortunately, QEMU's vhost-user reconnection framework already supports inflight I/O tracking by using VHOST_USER_GET_INFLIGHT_FD and VHOST_USER_SET_INFLIGHT_FD (see 5ad204bf2 and 5f9ff1eff for details). As the FUSE requests are transferred by virtqueue I/O requests, by using the vhost-user inflight I/O tracking, we can recover the inflight FUSE requests. To support virtiofsd internal status recovery, we introduce 4 new vhost-user message types. As shown in the following diagram, two of them are used to persist shared lo_maps and opened fds to QEMU, the other two message types are used to restore the status when reconnecting. VHOST_USER_SLAVE_SHM VHOST_USER_SLAVE_FD +--------------+ Persist +--------------------+ | <---------------------+ | | QEMU | | Virtio-fs Daemon | | +---------------------> | +--------------+ Restore +--------------------+ VHOST_USER_SET_SHM VHOST_USER_SET_FD Although the 4 newly added message types are to support virtiofsd reconnection in this patchset, it might be potential in other situation. So we keep in mind to make them more general when add them to vhost related source files. VHOST_USER_SLAVE_SHM and VHOST_USER_SET_SHM can be used for memory sharing between a vhost-user daemon and QEMU, VHOST_USER_SLAVE_FD and VHOST_USER_SET_FD would be useful if we want to shared opened fds between QEMU process and vhost-user daemon process. USAGE and NOTES: - The commits are rebased to a recent QEMU master commit b4d939133dca0fa2b. - ",reconnect=1" should be added to the "-chardev socket" of vhost-user-fs-pci in the QEMU command line, for example: qemu-system-x86_64 ... \ -chardev socket,id=char0,path=/tmp/vhostqemu,reconnect=1 \ -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=myfs \ ... - We add new options for virtiofsd to enable or disable crash reconnection. And some options are not supported by crash reconnection. So add following options to virtiofsd to enable reconnection: virtiofsd ... -o reconnect -o no_mount_ns -o no_flock -o no_posix_lock -o no_xattr ... - The reasons why virtiofsd-side locking, extended attributes, and mount namespace are not supported is explained in the commit message of the 6th patch (virtiofsd: Add two new options for crash reconnection). - The 9th patch is a work-around that will not affect the overall correctness. We remove the qsort related codes because we found that when resubmit_num is larger than 64, seccomp will kill the virtiofsd process. - Support for dax version virtiofsd is very possible and requires almost no additional change to this patchset. Jiachen Zhang (9): vhost-user-fs: Add support for reconnection of vhost-user-fs backend vhost: Add vhost-user message types for sending shared memory and file fds vhost-user-fs: Support virtiofsd crash reconnection libvhost-user: Add vhost-user message types for sending shared memory and file fds virtiofsd: Convert the struct lo_map array to a more flatten layout virtiofsd: Add two new options for crash reconnection virtiofsd: Persist/restore lo_map and opened fds to/from QEMU virtiofsd: Ensure crash consistency after reconnection virtiofsd: (work around) Comment qsort in inflight I/O tracking contrib/libvhost-user/libvhost-user.c | 106 +++- contrib/libvhost-user/libvhost-user.h | 70 +++ docs/interop/vhost-user.rst | 41 ++ hw/virtio/vhost-user-fs.c | 334 ++++++++++- hw/virtio/vhost-user.c | 123 ++++ hw/virtio/vhost.c | 42 ++ include/hw/virtio/vhost-backend.h | 6 + include/hw/virtio/vhost-user-fs.h | 16 +- include/hw/virtio/vhost.h | 42 ++ tools/virtiofsd/fuse_lowlevel.c | 24 +- tools/virtiofsd/fuse_virtio.c | 44 ++ tools/virtiofsd/fuse_virtio.h | 1 + tools/virtiofsd/helper.c | 9 + tools/virtiofsd/passthrough_helpers.h | 2 +- tools/virtiofsd/passthrough_ll.c | 830 ++++++++++++++++++-------- tools/virtiofsd/passthrough_seccomp.c | 1 + 16 files changed, 1413 insertions(+), 278 deletions(-) -- 2.20.1
Patchew URL: https://patchew.org/QEMU/20201215162119.27360-1-zhangjiachen.jaycee@bytedance.com/ Hi, This series seems to have some coding style problems. See output below for more information: Type: series Message-id: 20201215162119.27360-1-zhangjiachen.jaycee@bytedance.com Subject: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection === TEST SCRIPT BEGIN === #!/bin/bash git rev-parse base > /dev/null || exit 0 git config --local diff.renamelimit 0 git config --local diff.renames True git config --local diff.algorithm histogram ./scripts/checkpatch.pl --mailback base.. === TEST SCRIPT END === Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384 From https://github.com/patchew-project/qemu * [new tag] patchew/20201215162119.27360-1-zhangjiachen.jaycee@bytedance.com -> patchew/20201215162119.27360-1-zhangjiachen.jaycee@bytedance.com * [new tag] patchew/20201215174824.76017-1-richard.henderson@linaro.org -> patchew/20201215174824.76017-1-richard.henderson@linaro.org Switched to a new branch 'test' c6015f6 virtiofsd: (work around) Comment qsort in inflight I/O tracking a285357 virtiofsd: Ensure crash consistency after reconnection d5f69c8 virtiofsd: Persist/restore lo_map and opened fds to/from QEMU 4e0a56d virtiofsd: Add two new options for crash reconnection d4b638e virtiofsd: Convert the struct lo_map array to a more flatten layout 6dadec2 libvhost-user: Add vhost-user message types for sending shared memory and file fds 91f265b vhost-user-fs: Support virtiofsd crash reconnection bb587ea vhost: Add vhost-user message types for sending shared memory and file fds ccb8eca vhost-user-fs: Add support for reconnection of vhost-user-fs backend === OUTPUT BEGIN === 1/9 Checking commit ccb8eca38f55 (vhost-user-fs: Add support for reconnection of vhost-user-fs backend) 2/9 Checking commit bb587eaf556e (vhost: Add vhost-user message types for sending shared memory and file fds) 3/9 Checking commit 91f265b55b12 (vhost-user-fs: Support virtiofsd crash reconnection) 4/9 Checking commit 6dadec226d05 (libvhost-user: Add vhost-user message types for sending shared memory and file fds) 5/9 Checking commit d4b638e34242 (virtiofsd: Convert the struct lo_map array to a more flatten layout) ERROR: use qemu_real_host_page_size instead of getpagesize() #195: FILE: tools/virtiofsd/passthrough_ll.c:383: + int page_size = getpagesize(); total: 1 errors, 0 warnings, 997 lines checked Patch 5/9 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. 6/9 Checking commit 4e0a56d0217d (virtiofsd: Add two new options for crash reconnection) 7/9 Checking commit d5f69c8465ac (virtiofsd: Persist/restore lo_map and opened fds to/from QEMU) 8/9 Checking commit a285357c7bff (virtiofsd: Ensure crash consistency after reconnection) 9/9 Checking commit c6015f602fb0 (virtiofsd: (work around) Comment qsort in inflight I/O tracking) === OUTPUT END === Test command exited with code: 1 The full log is available at http://patchew.org/logs/20201215162119.27360-1-zhangjiachen.jaycee@bytedance.com/testing.checkpatch/?type=message. --- Email generated automatically by Patchew [https://patchew.org/]. Please send your feedback to patchew-devel@redhat.com
Hi all, We are going to develop the v2 patch for virtio-fs crash reconnection. As suggested by Marc-André and Stefan, except for the inflight I/O tracking log area, all the other internal statuses of virtiofsd will be saved to some places other than QEMU. Specifically, the three lo_maps (ino_map, dirp_map, and fd_map) could be saved to several mmapped files, and the opened fds could be saved to systemd. I'd like to get some feedback on our further thoughts before we work on the revision. 1. What about by default save the opened fds as file handles to host kernel, instead of saving them to systemd. After some internal discussion, we think introducing systemd may introduce more uncertainness to the system, as we need to create one service for each daemon, and all the daemons may suffer the single-point failure of the systemd process. 2. Like the btree map implementation (multikey.rs) of virtiofsd-rs, what about splitting the flatten lo_map implementation, which supports to be persisted to files, from passhtrough_ll.c to a new separated source file. This way, maybe we can more easily wrap it with some Rust compatible interfaces, and enable crash recovery for virtiofsd-rs based on it. 3. What about dropping the dirp_map, and integrate the opened directory fds to fd_map. The virtiofsd-rs implementation only has two maps (inodes and handles). In the C version, dirp_map may also unnecessary. All the best, Jiachen On Wed, Dec 16, 2020 at 12:21 AM Jiachen Zhang < zhangjiachen.jaycee@bytedance.com> wrote: > Hi, all > > We implement virtio-fs crash reconnection in this patchset. The crash > reconnection of virtiofsd here is completely transparent to guest, no > remount in guest is needed, even the inflight requests can be handled > normally after reconnection. We are looking forward to any comments. > > Thanks, > Jiachen > > > OVERVIEW: > > To support virtio-fs crash reconnection, we need to support the recovery > of 1) inflight FUSE request, and 2) virtiofsd internal status information. > > Fortunately, QEMU's vhost-user reconnection framework already supports > inflight I/O tracking by using VHOST_USER_GET_INFLIGHT_FD and > VHOST_USER_SET_INFLIGHT_FD (see 5ad204bf2 and 5f9ff1eff for details). > As the FUSE requests are transferred by virtqueue I/O requests, by using > the vhost-user inflight I/O tracking, we can recover the inflight FUSE > requests. > > To support virtiofsd internal status recovery, we introduce 4 new > vhost-user message types. As shown in the following diagram, two of them > are used to persist shared lo_maps and opened fds to QEMU, the other two > message types are used to restore the status when reconnecting. > > VHOST_USER_SLAVE_SHM > VHOST_USER_SLAVE_FD > +--------------+ Persist +--------------------+ > | <---------------------+ | > | QEMU | | Virtio-fs Daemon | > | +---------------------> | > +--------------+ Restore +--------------------+ > VHOST_USER_SET_SHM > VHOST_USER_SET_FD > > Although the 4 newly added message types are to support virtiofsd > reconnection in this patchset, it might be potential in other situation. > So we keep in mind to make them more general when add them to vhost > related source files. VHOST_USER_SLAVE_SHM and VHOST_USER_SET_SHM can be > used for memory sharing between a vhost-user daemon and QEMU, > VHOST_USER_SLAVE_FD and VHOST_USER_SET_FD would be useful if we want to > shared opened fds between QEMU process and vhost-user daemon process. > > > USAGE and NOTES: > > - The commits are rebased to a recent QEMU master commit b4d939133dca0fa2b. > > - ",reconnect=1" should be added to the "-chardev socket" of > vhost-user-fs-pci > in the QEMU command line, for example: > > qemu-system-x86_64 ... \ > -chardev socket,id=char0,path=/tmp/vhostqemu,reconnect=1 \ > -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=myfs \ > ... > > - We add new options for virtiofsd to enable or disable crash reconnection. > And some options are not supported by crash reconnection. So add following > options to virtiofsd to enable reconnection: > > virtiofsd ... -o reconnect -o no_mount_ns -o no_flock -o no_posix_lock > -o no_xattr ... > > - The reasons why virtiofsd-side locking, extended attributes, and mount > namespace are not supported is explained in the commit message of the 6th > patch (virtiofsd: Add two new options for crash reconnection). > > - The 9th patch is a work-around that will not affect the overall > correctness. > We remove the qsort related codes because we found that when resubmit_num > is > larger than 64, seccomp will kill the virtiofsd process. > > - Support for dax version virtiofsd is very possible and requires almost no > additional change to this patchset. > > > Jiachen Zhang (9): > vhost-user-fs: Add support for reconnection of vhost-user-fs backend > vhost: Add vhost-user message types for sending shared memory and file > fds > vhost-user-fs: Support virtiofsd crash reconnection > libvhost-user: Add vhost-user message types for sending shared memory > and file fds > virtiofsd: Convert the struct lo_map array to a more flatten layout > virtiofsd: Add two new options for crash reconnection > virtiofsd: Persist/restore lo_map and opened fds to/from QEMU > virtiofsd: Ensure crash consistency after reconnection > virtiofsd: (work around) Comment qsort in inflight I/O tracking > > contrib/libvhost-user/libvhost-user.c | 106 +++- > contrib/libvhost-user/libvhost-user.h | 70 +++ > docs/interop/vhost-user.rst | 41 ++ > hw/virtio/vhost-user-fs.c | 334 ++++++++++- > hw/virtio/vhost-user.c | 123 ++++ > hw/virtio/vhost.c | 42 ++ > include/hw/virtio/vhost-backend.h | 6 + > include/hw/virtio/vhost-user-fs.h | 16 +- > include/hw/virtio/vhost.h | 42 ++ > tools/virtiofsd/fuse_lowlevel.c | 24 +- > tools/virtiofsd/fuse_virtio.c | 44 ++ > tools/virtiofsd/fuse_virtio.h | 1 + > tools/virtiofsd/helper.c | 9 + > tools/virtiofsd/passthrough_helpers.h | 2 +- > tools/virtiofsd/passthrough_ll.c | 830 ++++++++++++++++++-------- > tools/virtiofsd/passthrough_seccomp.c | 1 + > 16 files changed, 1413 insertions(+), 278 deletions(-) > > -- > 2.20.1 > >
On Mon, May 10, 2021 at 10:38:05PM +0800, Jiachen Zhang wrote: > Hi all, > > > We are going to develop the v2 patch for virtio-fs crash reconnection. As > suggested by Marc-André and Stefan, except for the inflight I/O tracking > log area, all the other internal statuses of virtiofsd will be saved to > some places other than QEMU. Specifically, the three lo_maps (ino_map, > dirp_map, and fd_map) could be saved to several mmapped files, and the > opened fds could be saved to systemd. I'd like to get some feedback on our > further thoughts before we work on the revision. > > > 1. What about by default save the opened fds as file handles to host > kernel, instead of saving them to systemd. After some internal discussion, > we think introducing systemd may introduce more uncertainness to the > system, as we need to create one service for each daemon, and all the > daemons may suffer the single-point failure of the systemd process. I don't think saving file handles works 100%. The difference between an open fd and a file handle is what happens when the inode is deleted. If an external process deletes the inode during restart and then the fd keeps it alive while a file handle becomes stale and the inode is gone. Regarding systemd, it's pid 1 and cannot die - otherwise the system is broken. But in any case I think there are multiple options here. Whether you choose to systemd, implement the sd_notify(3) protocol in your own parent process, or take a different approach like a parent process with clone(2) CLONE_FILES to avoid the communication overhead for saving every fd, I think all of those approaches would be reasonable. > 2. Like the btree map implementation (multikey.rs) of virtiofsd-rs, what > about splitting the flatten lo_map implementation, which supports to be > persisted to files, from passhtrough_ll.c to a new separated source file. > This way, maybe we can more easily wrap it with some Rust compatible > interfaces, and enable crash recovery for virtiofsd-rs based on it. In the past two months I've noticed the number of virtiofsd-rs merge requests has increased and I think the trend is that new development is focussing on virtiofsd-rs. If it fits into your plans then focussing on virtiofsd-rs would be fine and then there is no need to worry about Rust compatible interfaces for C virtiofsd. > 3. What about dropping the dirp_map, and integrate the opened directory fds > to fd_map. The virtiofsd-rs implementation only has two maps (inodes and > handles). In the C version, dirp_map may also unnecessary. Maybe, but carefully: The purpose of the maps is to safely isolate the client from the virtiofsd's internal objects. The way I remember it is that C virtiofsd has a separate dirp map to prevent type confusion between regular open files and directories. The client must not trick the server into calling readdir(3) on something that's not a struct dirent because that could be a security issue. However, it's possible that virtiofsd-rs is able to combine the two because it uses syscall APIs on file descriptors instead of libc opendir(3) so there is no possibility of type confusion. The syscall would simply fail if the file descriptor is not O_DIRECTORY. Stefan
Hi On Tue, Dec 15, 2020 at 8:22 PM Jiachen Zhang < zhangjiachen.jaycee@bytedance.com> wrote: > Hi, all > > We implement virtio-fs crash reconnection in this patchset. The crash > reconnection of virtiofsd here is completely transparent to guest, no > remount in guest is needed, even the inflight requests can be handled > normally after reconnection. We are looking forward to any comments. > > Thanks, > Jiachen > > > OVERVIEW: > > To support virtio-fs crash reconnection, we need to support the recovery > of 1) inflight FUSE request, and 2) virtiofsd internal status information. > > Fortunately, QEMU's vhost-user reconnection framework already supports > inflight I/O tracking by using VHOST_USER_GET_INFLIGHT_FD and > VHOST_USER_SET_INFLIGHT_FD (see 5ad204bf2 and 5f9ff1eff for details). > As the FUSE requests are transferred by virtqueue I/O requests, by using > the vhost-user inflight I/O tracking, we can recover the inflight FUSE > requests. > > To support virtiofsd internal status recovery, we introduce 4 new > vhost-user message types. As shown in the following diagram, two of them > are used to persist shared lo_maps and opened fds to QEMU, the other two > message types are used to restore the status when reconnecting. > > VHOST_USER_SLAVE_SHM > VHOST_USER_SLAVE_FD > +--------------+ Persist +--------------------+ > | <---------------------+ | > | QEMU | | Virtio-fs Daemon | > | +---------------------> | > +--------------+ Restore +--------------------+ > VHOST_USER_SET_SHM > VHOST_USER_SET_FD > > Although the 4 newly added message types are to support virtiofsd > reconnection in this patchset, it might be potential in other situation. > So we keep in mind to make them more general when add them to vhost > related source files. VHOST_USER_SLAVE_SHM and VHOST_USER_SET_SHM can be > used for memory sharing between a vhost-user daemon and QEMU, > VHOST_USER_SLAVE_FD and VHOST_USER_SET_FD would be useful if we want to > shared opened fds between QEMU process and vhost-user daemon process. > Before adding new messages to the already complex vhost-user protocol, can we evaluate other options? First thing that came to my mind is that the memory state could be saved to disk or with a POSIX shared memory object. Eventually, the protocol could just pass around the fds, and not make a special treatment for shared memory. Then I remember systemd has a pretty good API & protocol for this sort of thing: sd_notify(3) (afaik, it is quite easy to implement a minimal handler) You can store fds with FDSTORE=1 (with an optional associated FDNAME). sd_listen_fds() & others to get them back (note: passed by inheritance only I think). systemd seems to not make shm a special case either, just treat it like an opened fd to restore. If we consider backend processes are going to be managed by libvirt or even a systemd service, is it a better alternative? sd_notify() offers a number of interesting features as well to monitor services. > > USAGE and NOTES: > > - The commits are rebased to a recent QEMU master commit b4d939133dca0fa2b. > > - ",reconnect=1" should be added to the "-chardev socket" of > vhost-user-fs-pci > in the QEMU command line, for example: > > qemu-system-x86_64 ... \ > -chardev socket,id=char0,path=/tmp/vhostqemu,reconnect=1 \ > -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=myfs \ > ... > > - We add new options for virtiofsd to enable or disable crash reconnection. > And some options are not supported by crash reconnection. So add following > options to virtiofsd to enable reconnection: > > virtiofsd ... -o reconnect -o no_mount_ns -o no_flock -o no_posix_lock > -o no_xattr ... > > - The reasons why virtiofsd-side locking, extended attributes, and mount > namespace are not supported is explained in the commit message of the 6th > patch (virtiofsd: Add two new options for crash reconnection). > > - The 9th patch is a work-around that will not affect the overall > correctness. > We remove the qsort related codes because we found that when resubmit_num > is > larger than 64, seccomp will kill the virtiofsd process. > > - Support for dax version virtiofsd is very possible and requires almost no > additional change to this patchset. > > > Jiachen Zhang (9): > vhost-user-fs: Add support for reconnection of vhost-user-fs backend > vhost: Add vhost-user message types for sending shared memory and file > fds > vhost-user-fs: Support virtiofsd crash reconnection > libvhost-user: Add vhost-user message types for sending shared memory > and file fds > virtiofsd: Convert the struct lo_map array to a more flatten layout > virtiofsd: Add two new options for crash reconnection > virtiofsd: Persist/restore lo_map and opened fds to/from QEMU > virtiofsd: Ensure crash consistency after reconnection > virtiofsd: (work around) Comment qsort in inflight I/O tracking > > contrib/libvhost-user/libvhost-user.c | 106 +++- > contrib/libvhost-user/libvhost-user.h | 70 +++ > docs/interop/vhost-user.rst | 41 ++ > hw/virtio/vhost-user-fs.c | 334 ++++++++++- > hw/virtio/vhost-user.c | 123 ++++ > hw/virtio/vhost.c | 42 ++ > include/hw/virtio/vhost-backend.h | 6 + > include/hw/virtio/vhost-user-fs.h | 16 +- > include/hw/virtio/vhost.h | 42 ++ > tools/virtiofsd/fuse_lowlevel.c | 24 +- > tools/virtiofsd/fuse_virtio.c | 44 ++ > tools/virtiofsd/fuse_virtio.h | 1 + > tools/virtiofsd/helper.c | 9 + > tools/virtiofsd/passthrough_helpers.h | 2 +- > tools/virtiofsd/passthrough_ll.c | 830 ++++++++++++++++++-------- > tools/virtiofsd/passthrough_seccomp.c | 1 + > 16 files changed, 1413 insertions(+), 278 deletions(-) > > -- > 2.20.1 > > > -- Marc-André Lureau
On Wed, Dec 16, 2020 at 11:36 PM Marc-André Lureau < marcandre.lureau@gmail.com> wrote: > Hi > > On Tue, Dec 15, 2020 at 8:22 PM Jiachen Zhang < > zhangjiachen.jaycee@bytedance.com> wrote: > >> Hi, all >> >> We implement virtio-fs crash reconnection in this patchset. The crash >> reconnection of virtiofsd here is completely transparent to guest, no >> remount in guest is needed, even the inflight requests can be handled >> normally after reconnection. We are looking forward to any comments. >> >> Thanks, >> Jiachen >> >> >> OVERVIEW: >> >> To support virtio-fs crash reconnection, we need to support the recovery >> of 1) inflight FUSE request, and 2) virtiofsd internal status information. >> >> Fortunately, QEMU's vhost-user reconnection framework already supports >> inflight I/O tracking by using VHOST_USER_GET_INFLIGHT_FD and >> VHOST_USER_SET_INFLIGHT_FD (see 5ad204bf2 and 5f9ff1eff for details). >> As the FUSE requests are transferred by virtqueue I/O requests, by using >> the vhost-user inflight I/O tracking, we can recover the inflight FUSE >> requests. >> >> To support virtiofsd internal status recovery, we introduce 4 new >> vhost-user message types. As shown in the following diagram, two of them >> are used to persist shared lo_maps and opened fds to QEMU, the other two >> message types are used to restore the status when reconnecting. >> >> VHOST_USER_SLAVE_SHM >> VHOST_USER_SLAVE_FD >> +--------------+ Persist +--------------------+ >> | <---------------------+ | >> | QEMU | | Virtio-fs Daemon | >> | +---------------------> | >> +--------------+ Restore +--------------------+ >> VHOST_USER_SET_SHM >> VHOST_USER_SET_FD >> >> Although the 4 newly added message types are to support virtiofsd >> reconnection in this patchset, it might be potential in other situation. >> So we keep in mind to make them more general when add them to vhost >> related source files. VHOST_USER_SLAVE_SHM and VHOST_USER_SET_SHM can be >> used for memory sharing between a vhost-user daemon and QEMU, >> VHOST_USER_SLAVE_FD and VHOST_USER_SET_FD would be useful if we want to >> shared opened fds between QEMU process and vhost-user daemon process. >> > > Before adding new messages to the already complex vhost-user protocol, can > we evaluate other options? > > First thing that came to my mind is that the memory state could be saved > to disk or with a POSIX shared memory object. > > Eventually, the protocol could just pass around the fds, and not make a > special treatment for shared memory. > > Then I remember systemd has a pretty good API & protocol for this sort of > thing: sd_notify(3) (afaik, it is quite easy to implement a minimal handler) > > You can store fds with FDSTORE=1 (with an optional associated FDNAME). > sd_listen_fds() & others to get them back (note: passed by inheritance only > I think). systemd seems to not make shm a special case either, just treat > it like an opened fd to restore. > > If we consider backend processes are going to be managed by libvirt or > even a systemd service, is it a better alternative? sd_notify() offers a > number of interesting features as well to monitor services. > > Thanks for the suggestions. Actually, we choose to save all state information to QEMU because a virtiofsd has the same lifecycle as its QEMU master. However, saving things to a file do avoid communication with QEMU, and we no longer need to increase the complexity of vhost-user protocol. The suggestion to save fds to the systemd is also very reasonable if we don't consider the lifecycle issues, we will try it. All the best, Jiachen >> >> USAGE and NOTES: >> >> - The commits are rebased to a recent QEMU master commit >> b4d939133dca0fa2b. >> >> - ",reconnect=1" should be added to the "-chardev socket" of >> vhost-user-fs-pci >> in the QEMU command line, for example: >> >> qemu-system-x86_64 ... \ >> -chardev socket,id=char0,path=/tmp/vhostqemu,reconnect=1 \ >> -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=myfs \ >> ... >> >> - We add new options for virtiofsd to enable or disable crash >> reconnection. >> And some options are not supported by crash reconnection. So add following >> options to virtiofsd to enable reconnection: >> >> virtiofsd ... -o reconnect -o no_mount_ns -o no_flock -o no_posix_lock >> -o no_xattr ... >> >> - The reasons why virtiofsd-side locking, extended attributes, and mount >> namespace are not supported is explained in the commit message of the 6th >> patch (virtiofsd: Add two new options for crash reconnection). >> >> - The 9th patch is a work-around that will not affect the overall >> correctness. >> We remove the qsort related codes because we found that when resubmit_num >> is >> larger than 64, seccomp will kill the virtiofsd process. >> >> - Support for dax version virtiofsd is very possible and requires almost >> no >> additional change to this patchset. >> >> >> Jiachen Zhang (9): >> vhost-user-fs: Add support for reconnection of vhost-user-fs backend >> vhost: Add vhost-user message types for sending shared memory and file >> fds >> vhost-user-fs: Support virtiofsd crash reconnection >> libvhost-user: Add vhost-user message types for sending shared memory >> and file fds >> virtiofsd: Convert the struct lo_map array to a more flatten layout >> virtiofsd: Add two new options for crash reconnection >> virtiofsd: Persist/restore lo_map and opened fds to/from QEMU >> virtiofsd: Ensure crash consistency after reconnection >> virtiofsd: (work around) Comment qsort in inflight I/O tracking >> >> contrib/libvhost-user/libvhost-user.c | 106 +++- >> contrib/libvhost-user/libvhost-user.h | 70 +++ >> docs/interop/vhost-user.rst | 41 ++ >> hw/virtio/vhost-user-fs.c | 334 ++++++++++- >> hw/virtio/vhost-user.c | 123 ++++ >> hw/virtio/vhost.c | 42 ++ >> include/hw/virtio/vhost-backend.h | 6 + >> include/hw/virtio/vhost-user-fs.h | 16 +- >> include/hw/virtio/vhost.h | 42 ++ >> tools/virtiofsd/fuse_lowlevel.c | 24 +- >> tools/virtiofsd/fuse_virtio.c | 44 ++ >> tools/virtiofsd/fuse_virtio.h | 1 + >> tools/virtiofsd/helper.c | 9 + >> tools/virtiofsd/passthrough_helpers.h | 2 +- >> tools/virtiofsd/passthrough_ll.c | 830 ++++++++++++++++++-------- >> tools/virtiofsd/passthrough_seccomp.c | 1 + >> 16 files changed, 1413 insertions(+), 278 deletions(-) >> >> -- >> 2.20.1 >> >> >> > > -- > Marc-André Lureau >
On Fri, Dec 18, 2020 at 05:39:34PM +0800, Jiachen Zhang wrote: > Thanks for the suggestions. Actually, we choose to save all state > information to QEMU because a virtiofsd has the same lifecycle as its > QEMU master. However, saving things to a file do avoid communication with > QEMU, and we no longer need to increase the complexity of vhost-user > protocol. The suggestion to save fds to the systemd is also very reasonable > if we don't consider the lifecycle issues, we will try it. Hi, We recently discussed crash recovery in the virtio-fs bi-weekly call and I read some of this email thread because it's a topic I'm interested in. I agree with Marc-André that storing file descriptors does not need to be in the vhost-user protocol. The lifetime of a vhost-user device backend is not controlled by the VMM since the device backend is launched separately. Therefore it's reasonable for the component that launched the device backend to also have the responsibility of cleaning up the vhost-user device backend. Using the sd_notify(3) interface is a neat idea. It's supported natively by systemd but you can also implement a compatible interface in your own software. This way the vhost-user device backend can be launched using systemd or your own software. That said, if people find it more convenient to store fds using the vhost-user protocol, then I think that is enough justification to add a generic message to the vhost-user protocol. The important thing is to make the message generic so it solves all crash recovery use cases. The inflight fd messages were too specific and now we're having to think about adding more messages again. Stefan
On Mittwoch, 17. März 2021 11:05:32 CET Stefan Hajnoczi wrote: > On Fri, Dec 18, 2020 at 05:39:34PM +0800, Jiachen Zhang wrote: > > Thanks for the suggestions. Actually, we choose to save all state > > information to QEMU because a virtiofsd has the same lifecycle as its > > QEMU master. However, saving things to a file do avoid communication with > > QEMU, and we no longer need to increase the complexity of vhost-user > > protocol. The suggestion to save fds to the systemd is also very > > reasonable > > if we don't consider the lifecycle issues, we will try it. > > Hi, > We recently discussed crash recovery in the virtio-fs bi-weekly call and > I read some of this email thread because it's a topic I'm interested in. I just had a quick fly over the patches so far. Shouldn't there be some kind of constraint for an automatic reconnection feature after a crash to prevent this being exploited by ROP brute force attacks? E.g. adding some (maybe continuously increasing) delay and/or limiting the amount of reconnects within a certain time frame would come to my mind. Best regards, Christian Schoenebeck
On Wed, Mar 17, 2021 at 7:50 PM Christian Schoenebeck < qemu_oss@crudebyte.com> wrote: > On Mittwoch, 17. März 2021 11:05:32 CET Stefan Hajnoczi wrote: > > On Fri, Dec 18, 2020 at 05:39:34PM +0800, Jiachen Zhang wrote: > > > Thanks for the suggestions. Actually, we choose to save all state > > > information to QEMU because a virtiofsd has the same lifecycle as its > > > QEMU master. However, saving things to a file do avoid communication > with > > > QEMU, and we no longer need to increase the complexity of vhost-user > > > protocol. The suggestion to save fds to the systemd is also very > > > reasonable > > > if we don't consider the lifecycle issues, we will try it. > > > > Hi, > > We recently discussed crash recovery in the virtio-fs bi-weekly call and > > I read some of this email thread because it's a topic I'm interested in. > > I just had a quick fly over the patches so far. Shouldn't there be some > kind > of constraint for an automatic reconnection feature after a crash to > prevent > this being exploited by ROP brute force attacks? > > E.g. adding some (maybe continuously increasing) delay and/or limiting the > amount of reconnects within a certain time frame would come to my mind. > > Best regards, > Christian Schoenebeck > > > Thanks, Christian. I am still trying to figure out the details of the ROP attacks. However, QEMU's vhost-user reconnection is based on chardev socket reconnection. The socket reconnection can be enabled by the "--chardev socket,...,reconnect=N" in QEMU command options, in which N means QEMU will try to connect the disconnected socket every N seconds. We can increase N to increase the reconnect delay. If we want to change the reconnect delay dynamically, I think we should change the chardev socket reconnection code. It is a more generic mechanism than vhost-user-fs and vhost-user backend. By the way, I also considered the socket reconnection delay time in the performance aspect. As the reconnection delay increase, if an application in the guest is doing I/Os, it will suffer larger tail latency. And for now, the smallest delay is 1 second, which is rather large for high-performance virtual I/O devices today. I think maybe a more performant and safer reconnect delay adjustment mechanism should be considered in the future. What are your thoughts? Jiachen
On Mittwoch, 17. März 2021 13:57:47 CET Jiachen Zhang wrote: > On Wed, Mar 17, 2021 at 7:50 PM Christian Schoenebeck < > > qemu_oss@crudebyte.com> wrote: > > On Mittwoch, 17. März 2021 11:05:32 CET Stefan Hajnoczi wrote: > > > On Fri, Dec 18, 2020 at 05:39:34PM +0800, Jiachen Zhang wrote: > > > > Thanks for the suggestions. Actually, we choose to save all state > > > > information to QEMU because a virtiofsd has the same lifecycle as its > > > > QEMU master. However, saving things to a file do avoid communication > > > > with > > > > > > QEMU, and we no longer need to increase the complexity of vhost-user > > > > protocol. The suggestion to save fds to the systemd is also very > > > > reasonable > > > > if we don't consider the lifecycle issues, we will try it. > > > > > > Hi, > > > We recently discussed crash recovery in the virtio-fs bi-weekly call and > > > I read some of this email thread because it's a topic I'm interested in. > > > > I just had a quick fly over the patches so far. Shouldn't there be some > > kind > > of constraint for an automatic reconnection feature after a crash to > > prevent > > this being exploited by ROP brute force attacks? > > > > E.g. adding some (maybe continuously increasing) delay and/or limiting the > > amount of reconnects within a certain time frame would come to my mind. > > > > Best regards, > > Christian Schoenebeck > > Thanks, Christian. I am still trying to figure out the details of the ROP > attacks. > > However, QEMU's vhost-user reconnection is based on chardev socket > reconnection. The socket reconnection can be enabled by the "--chardev > socket,...,reconnect=N" in QEMU command options, in which N means QEMU will > try to connect the disconnected socket every N seconds. We can increase N > to increase the reconnect delay. If we want to change the reconnect delay > dynamically, I think we should change the chardev socket reconnection code. > It is a more generic mechanism than vhost-user-fs and vhost-user backend. > > By the way, I also considered the socket reconnection delay time in the > performance aspect. As the reconnection delay increase, if an application > in the guest is doing I/Os, it will suffer larger tail latency. And for > now, the smallest delay is 1 second, which is rather large for > high-performance virtual I/O devices today. I think maybe a more performant > and safer reconnect delay adjustment mechanism should be considered in the > future. What are your thoughts? So with N=1 an attacker could e.g. bypass a 16-bit PAC by brute-force in ~18 hours (e.g. on Arm if PAC + MTE was enabled). With 24-bit PAC (no MTE) it would be ~194 days. Independent of what architecture and defend mechanism is used, there is always the possibility though that some kind of side channel attack exists that might require a much lower amount of attempts. So in an untrusted environment I would personally limit the amount of automatic reconnects and rather accept a down time for further investigation if a suspicious high amount of crashes happened. And yes, if a dynamic delay scheme was deployed in future then starting with a value smaller than 1 second would make sense. Best regards, Christian Schoenebeck
On Wed, Mar 17, 2021 at 6:05 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > On Fri, Dec 18, 2020 at 05:39:34PM +0800, Jiachen Zhang wrote: > > Thanks for the suggestions. Actually, we choose to save all state > > information to QEMU because a virtiofsd has the same lifecycle as its > > QEMU master. However, saving things to a file do avoid communication with > > QEMU, and we no longer need to increase the complexity of vhost-user > > protocol. The suggestion to save fds to the systemd is also very > reasonable > > if we don't consider the lifecycle issues, we will try it. > > Hi, > We recently discussed crash recovery in the virtio-fs bi-weekly call and > I read some of this email thread because it's a topic I'm interested in. > > I agree with Marc-André that storing file descriptors does not need to > be in the vhost-user protocol. The lifetime of a vhost-user device > backend is not controlled by the VMM since the device backend is > launched separately. Therefore it's reasonable for the component that > launched the device backend to also have the responsibility of cleaning > up the vhost-user device backend. > > Using the sd_notify(3) interface is a neat idea. It's supported natively > by systemd but you can also implement a compatible interface in your own > software. This way the vhost-user device backend can be launched using > systemd or your own software. > > That said, if people find it more convenient to store fds using the > vhost-user protocol, then I think that is enough justification to add a > generic message to the vhost-user protocol. The important thing is to > make the message generic so it solves all crash recovery use cases. The > inflight fd messages were too specific and now we're having to think > about adding more messages again. > > Stefan > Hi, Stefan, I agreed with you that a virtiofsd must be launched by a software like systemd. So we are planning to define more generic persist/restore interfaces (callbacks). Then anyone can implement their own persist/restore callbacks to store states to proper places. And I think in the next version we will implement default callbacks for the interfaces. Instead of vhost-user messages, systemd's sd_notify(3) will be the default method for storing fds, and several tmpfs files can be the default place to store the shm regions. Jiachen
On Wed, Mar 17, 2021 at 08:32:31PM +0800, Jiachen Zhang wrote: > On Wed, Mar 17, 2021 at 6:05 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Fri, Dec 18, 2020 at 05:39:34PM +0800, Jiachen Zhang wrote: > I agreed with you that a virtiofsd must be launched by a software like > systemd. So we are planning to define more generic persist/restore > interfaces (callbacks). Then anyone can implement their own persist/restore > callbacks to store states to proper places. And I think in the next > version we will implement default callbacks for the interfaces. Instead of > vhost-user messages, systemd's sd_notify(3) will be the default method for > storing fds, and several tmpfs files can be the default place to store the > shm regions. Okay, great! I was thinking about how to make the crash recovery mechanism reusable as a C library or Rust crate. The mechanism is a combination of: 1. sd_listen_fds(3) for restoring the fds on restart. 2. sd_notify(3) for storing the fds. 3. memfd or tmpfs for storing state (could be mmapped). I'm not sure if there is enough common behavior to create a reusable API or if this is quite application-specific. Stefan
On Mon, Mar 22, 2021 at 11:00:52AM +0000, Stefan Hajnoczi wrote: > On Wed, Mar 17, 2021 at 08:32:31PM +0800, Jiachen Zhang wrote: > > On Wed, Mar 17, 2021 at 6:05 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > On Fri, Dec 18, 2020 at 05:39:34PM +0800, Jiachen Zhang wrote: > > I agreed with you that a virtiofsd must be launched by a software like > > systemd. So we are planning to define more generic persist/restore > > interfaces (callbacks). Then anyone can implement their own persist/restore > > callbacks to store states to proper places. And I think in the next > > version we will implement default callbacks for the interfaces. Instead of > > vhost-user messages, systemd's sd_notify(3) will be the default method for > > storing fds, and several tmpfs files can be the default place to store the > > shm regions. > > Okay, great! > > I was thinking about how to make the crash recovery mechanism reusable > as a C library or Rust crate. The mechanism is a combination of: > 1. sd_listen_fds(3) for restoring the fds on restart. > 2. sd_notify(3) for storing the fds. > 3. memfd or tmpfs for storing state (could be mmapped). > > I'm not sure if there is enough common behavior to create a reusable API > or if this is quite application-specific. I am wondering what will happen for use cases where virtiofsd is running inside a container (with no systemd inside containers). Do container managers offer systemd like services to save and restore state. Vivek
On Mon, Mar 22, 2021 at 04:13:26PM -0400, Vivek Goyal wrote: > On Mon, Mar 22, 2021 at 11:00:52AM +0000, Stefan Hajnoczi wrote: > > On Wed, Mar 17, 2021 at 08:32:31PM +0800, Jiachen Zhang wrote: > > > On Wed, Mar 17, 2021 at 6:05 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > On Fri, Dec 18, 2020 at 05:39:34PM +0800, Jiachen Zhang wrote: > > > I agreed with you that a virtiofsd must be launched by a software like > > > systemd. So we are planning to define more generic persist/restore > > > interfaces (callbacks). Then anyone can implement their own persist/restore > > > callbacks to store states to proper places. And I think in the next > > > version we will implement default callbacks for the interfaces. Instead of > > > vhost-user messages, systemd's sd_notify(3) will be the default method for > > > storing fds, and several tmpfs files can be the default place to store the > > > shm regions. > > > > Okay, great! > > > > I was thinking about how to make the crash recovery mechanism reusable > > as a C library or Rust crate. The mechanism is a combination of: > > 1. sd_listen_fds(3) for restoring the fds on restart. > > 2. sd_notify(3) for storing the fds. > > 3. memfd or tmpfs for storing state (could be mmapped). > > > > I'm not sure if there is enough common behavior to create a reusable API > > or if this is quite application-specific. > > I am wondering what will happen for use cases where virtiofsd is running > inside a container (with no systemd inside containers). > > Do container managers offer systemd like services to save and restore > state. It appears so, at least for Podman where sd_notify is explicitly mentioned: https://www.redhat.com/sysadmin/improved-systemd-podman As mentioned in this email thread, the socket activation and sd_notify(3) interface can also be implemented by non-systemd software. Anyone running a custom container runtime or orchestration software could add these interfaces (they are reasonably simple and the protocols are documented in the systemd documentation). Stefan
© 2016 - 2024 Red Hat, Inc.