[RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection

Jiachen Zhang posted 9 patches 3 years, 4 months ago
Test checkpatch failed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20201215162119.27360-1-zhangjiachen.jaycee@bytedance.com
Maintainers: "Dr. David Alan Gilbert" <dgilbert@redhat.com>, "Michael S. Tsirkin" <mst@redhat.com>, Stefan Hajnoczi <stefanha@redhat.com>
contrib/libvhost-user/libvhost-user.c | 106 +++-
contrib/libvhost-user/libvhost-user.h |  70 +++
docs/interop/vhost-user.rst           |  41 ++
hw/virtio/vhost-user-fs.c             | 334 ++++++++++-
hw/virtio/vhost-user.c                | 123 ++++
hw/virtio/vhost.c                     |  42 ++
include/hw/virtio/vhost-backend.h     |   6 +
include/hw/virtio/vhost-user-fs.h     |  16 +-
include/hw/virtio/vhost.h             |  42 ++
tools/virtiofsd/fuse_lowlevel.c       |  24 +-
tools/virtiofsd/fuse_virtio.c         |  44 ++
tools/virtiofsd/fuse_virtio.h         |   1 +
tools/virtiofsd/helper.c              |   9 +
tools/virtiofsd/passthrough_helpers.h |   2 +-
tools/virtiofsd/passthrough_ll.c      | 830 ++++++++++++++++++--------
tools/virtiofsd/passthrough_seccomp.c |   1 +
16 files changed, 1413 insertions(+), 278 deletions(-)
[RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
Posted by Jiachen Zhang 3 years, 4 months ago
Hi, all

We implement virtio-fs crash reconnection in this patchset. The crash
reconnection of virtiofsd here is completely transparent to guest, no
remount in guest is needed, even the inflight requests can be handled
normally after reconnection. We are looking forward to any comments. 

Thanks,
Jiachen


OVERVIEW:

To support virtio-fs crash reconnection, we need to support the recovery
of 1) inflight FUSE request, and 2) virtiofsd internal status information.

Fortunately, QEMU's vhost-user reconnection framework already supports
inflight I/O tracking by using VHOST_USER_GET_INFLIGHT_FD and
VHOST_USER_SET_INFLIGHT_FD (see 5ad204bf2 and 5f9ff1eff for details).
As the FUSE requests are transferred by virtqueue I/O requests, by using
the vhost-user inflight I/O tracking, we can recover the inflight FUSE
requests.

To support virtiofsd internal status recovery, we introduce 4 new
vhost-user message types. As shown in the following diagram, two of them
are used to persist shared lo_maps and opened fds to QEMU, the other two
message types are used to restore the status when reconnecting.

                               VHOST_USER_SLAVE_SHM
                               VHOST_USER_SLAVE_FD
    +--------------+       Persist       +--------------------+
    |              <---------------------+                    |
    |     QEMU     |                     |  Virtio-fs Daemon  |
    |              +--------------------->                    |
    +--------------+       Restore       +--------------------+
            VHOST_USER_SET_SHM
            VHOST_USER_SET_FD

Although the 4 newly added message types are to support virtiofsd
reconnection in this patchset, it might be potential in other situation.
So we keep in mind to make them more general when add them to vhost
related source files. VHOST_USER_SLAVE_SHM and VHOST_USER_SET_SHM can be
used for memory sharing between a vhost-user daemon and QEMU,
VHOST_USER_SLAVE_FD and VHOST_USER_SET_FD would be useful if we want to
shared opened fds between QEMU process and vhost-user daemon process.


USAGE and NOTES:

- The commits are rebased to a recent QEMU master commit b4d939133dca0fa2b.

- ",reconnect=1" should be added to the "-chardev socket" of vhost-user-fs-pci
in the QEMU command line, for example:

    qemu-system-x86_64 ... \
    -chardev socket,id=char0,path=/tmp/vhostqemu,reconnect=1 \
    -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=myfs \
    ...

- We add new options for virtiofsd to enable or disable crash reconnection.
And some options are not supported by crash reconnection. So add following
options to virtiofsd to enable reconnection:

    virtiofsd ... -o reconnect -o no_mount_ns -o no_flock -o no_posix_lock
    -o no_xattr ...

- The reasons why virtiofsd-side locking, extended attributes, and mount
namespace are not supported is explained in the commit message of the 6th
patch (virtiofsd: Add two new options for crash reconnection).

- The 9th patch is a work-around that will not affect the overall correctness.
We remove the qsort related codes because we found that when resubmit_num is
larger than 64, seccomp will kill the virtiofsd process.

- Support for dax version virtiofsd is very possible and requires almost no
additional change to this patchset.


Jiachen Zhang (9):
  vhost-user-fs: Add support for reconnection of vhost-user-fs backend
  vhost: Add vhost-user message types for sending shared memory and file
    fds
  vhost-user-fs: Support virtiofsd crash reconnection
  libvhost-user: Add vhost-user message types for sending shared memory
    and file fds
  virtiofsd: Convert the struct lo_map array to a more flatten layout
  virtiofsd: Add two new options for crash reconnection
  virtiofsd: Persist/restore lo_map and opened fds to/from QEMU
  virtiofsd: Ensure crash consistency after reconnection
  virtiofsd: (work around) Comment qsort in inflight I/O tracking

 contrib/libvhost-user/libvhost-user.c | 106 +++-
 contrib/libvhost-user/libvhost-user.h |  70 +++
 docs/interop/vhost-user.rst           |  41 ++
 hw/virtio/vhost-user-fs.c             | 334 ++++++++++-
 hw/virtio/vhost-user.c                | 123 ++++
 hw/virtio/vhost.c                     |  42 ++
 include/hw/virtio/vhost-backend.h     |   6 +
 include/hw/virtio/vhost-user-fs.h     |  16 +-
 include/hw/virtio/vhost.h             |  42 ++
 tools/virtiofsd/fuse_lowlevel.c       |  24 +-
 tools/virtiofsd/fuse_virtio.c         |  44 ++
 tools/virtiofsd/fuse_virtio.h         |   1 +
 tools/virtiofsd/helper.c              |   9 +
 tools/virtiofsd/passthrough_helpers.h |   2 +-
 tools/virtiofsd/passthrough_ll.c      | 830 ++++++++++++++++++--------
 tools/virtiofsd/passthrough_seccomp.c |   1 +
 16 files changed, 1413 insertions(+), 278 deletions(-)

-- 
2.20.1


Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
Posted by no-reply@patchew.org 3 years, 4 months ago
Patchew URL: https://patchew.org/QEMU/20201215162119.27360-1-zhangjiachen.jaycee@bytedance.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 20201215162119.27360-1-zhangjiachen.jaycee@bytedance.com
Subject: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 * [new tag]         patchew/20201215162119.27360-1-zhangjiachen.jaycee@bytedance.com -> patchew/20201215162119.27360-1-zhangjiachen.jaycee@bytedance.com
 * [new tag]         patchew/20201215174824.76017-1-richard.henderson@linaro.org -> patchew/20201215174824.76017-1-richard.henderson@linaro.org
Switched to a new branch 'test'
c6015f6 virtiofsd: (work around) Comment qsort in inflight I/O tracking
a285357 virtiofsd: Ensure crash consistency after reconnection
d5f69c8 virtiofsd: Persist/restore lo_map and opened fds to/from QEMU
4e0a56d virtiofsd: Add two new options for crash reconnection
d4b638e virtiofsd: Convert the struct lo_map array to a more flatten layout
6dadec2 libvhost-user: Add vhost-user message types for sending shared memory and file fds
91f265b vhost-user-fs: Support virtiofsd crash reconnection
bb587ea vhost: Add vhost-user message types for sending shared memory and file fds
ccb8eca vhost-user-fs: Add support for reconnection of vhost-user-fs backend

=== OUTPUT BEGIN ===
1/9 Checking commit ccb8eca38f55 (vhost-user-fs: Add support for reconnection of vhost-user-fs backend)
2/9 Checking commit bb587eaf556e (vhost: Add vhost-user message types for sending shared memory and file fds)
3/9 Checking commit 91f265b55b12 (vhost-user-fs: Support virtiofsd crash reconnection)
4/9 Checking commit 6dadec226d05 (libvhost-user: Add vhost-user message types for sending shared memory and file fds)
5/9 Checking commit d4b638e34242 (virtiofsd: Convert the struct lo_map array to a more flatten layout)
ERROR: use qemu_real_host_page_size instead of getpagesize()
#195: FILE: tools/virtiofsd/passthrough_ll.c:383:
+    int page_size = getpagesize();

total: 1 errors, 0 warnings, 997 lines checked

Patch 5/9 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

6/9 Checking commit 4e0a56d0217d (virtiofsd: Add two new options for crash reconnection)
7/9 Checking commit d5f69c8465ac (virtiofsd: Persist/restore lo_map and opened fds to/from QEMU)
8/9 Checking commit a285357c7bff (virtiofsd: Ensure crash consistency after reconnection)
9/9 Checking commit c6015f602fb0 (virtiofsd: (work around) Comment qsort in inflight I/O tracking)
=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/20201215162119.27360-1-zhangjiachen.jaycee@bytedance.com/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com
Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
Posted by Jiachen Zhang 3 years ago
Hi all,


We are going to develop the v2 patch for virtio-fs crash reconnection. As
suggested by Marc-André and Stefan, except for the inflight I/O tracking
log area, all the other internal statuses of virtiofsd will be saved to
some places other than QEMU. Specifically, the three lo_maps (ino_map,
dirp_map, and fd_map) could be saved to several mmapped files, and the
opened fds could be saved to systemd. I'd like to get some feedback on our
further thoughts before we work on the revision.


1. What about by default save the opened fds as file handles to host
kernel, instead of saving them to systemd. After some internal discussion,
we think introducing systemd may introduce more uncertainness to the
system, as we need to create one service for each daemon, and all the
daemons may suffer the single-point failure of the systemd process.


2. Like the btree map implementation (multikey.rs) of virtiofsd-rs, what
about splitting the flatten lo_map implementation, which supports to be
persisted to files, from passhtrough_ll.c to a new separated source file.
This way, maybe we can more easily wrap it with some Rust compatible
interfaces, and enable crash recovery for virtiofsd-rs based on it.


3. What about dropping the dirp_map, and integrate the opened directory fds
to fd_map. The virtiofsd-rs implementation only has two maps (inodes and
handles). In the C version, dirp_map may also unnecessary.


All the best,

Jiachen

On Wed, Dec 16, 2020 at 12:21 AM Jiachen Zhang <
zhangjiachen.jaycee@bytedance.com> wrote:

> Hi, all
>
> We implement virtio-fs crash reconnection in this patchset. The crash
> reconnection of virtiofsd here is completely transparent to guest, no
> remount in guest is needed, even the inflight requests can be handled
> normally after reconnection. We are looking forward to any comments.
>
> Thanks,
> Jiachen
>
>
> OVERVIEW:
>
> To support virtio-fs crash reconnection, we need to support the recovery
> of 1) inflight FUSE request, and 2) virtiofsd internal status information.
>
> Fortunately, QEMU's vhost-user reconnection framework already supports
> inflight I/O tracking by using VHOST_USER_GET_INFLIGHT_FD and
> VHOST_USER_SET_INFLIGHT_FD (see 5ad204bf2 and 5f9ff1eff for details).
> As the FUSE requests are transferred by virtqueue I/O requests, by using
> the vhost-user inflight I/O tracking, we can recover the inflight FUSE
> requests.
>
> To support virtiofsd internal status recovery, we introduce 4 new
> vhost-user message types. As shown in the following diagram, two of them
> are used to persist shared lo_maps and opened fds to QEMU, the other two
> message types are used to restore the status when reconnecting.
>
>                                VHOST_USER_SLAVE_SHM
>                                VHOST_USER_SLAVE_FD
>     +--------------+       Persist       +--------------------+
>     |              <---------------------+                    |
>     |     QEMU     |                     |  Virtio-fs Daemon  |
>     |              +--------------------->                    |
>     +--------------+       Restore       +--------------------+
>             VHOST_USER_SET_SHM
>             VHOST_USER_SET_FD
>
> Although the 4 newly added message types are to support virtiofsd
> reconnection in this patchset, it might be potential in other situation.
> So we keep in mind to make them more general when add them to vhost
> related source files. VHOST_USER_SLAVE_SHM and VHOST_USER_SET_SHM can be
> used for memory sharing between a vhost-user daemon and QEMU,
> VHOST_USER_SLAVE_FD and VHOST_USER_SET_FD would be useful if we want to
> shared opened fds between QEMU process and vhost-user daemon process.
>
>
> USAGE and NOTES:
>
> - The commits are rebased to a recent QEMU master commit b4d939133dca0fa2b.
>
> - ",reconnect=1" should be added to the "-chardev socket" of
> vhost-user-fs-pci
> in the QEMU command line, for example:
>
>     qemu-system-x86_64 ... \
>     -chardev socket,id=char0,path=/tmp/vhostqemu,reconnect=1 \
>     -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=myfs \
>     ...
>
> - We add new options for virtiofsd to enable or disable crash reconnection.
> And some options are not supported by crash reconnection. So add following
> options to virtiofsd to enable reconnection:
>
>     virtiofsd ... -o reconnect -o no_mount_ns -o no_flock -o no_posix_lock
>     -o no_xattr ...
>
> - The reasons why virtiofsd-side locking, extended attributes, and mount
> namespace are not supported is explained in the commit message of the 6th
> patch (virtiofsd: Add two new options for crash reconnection).
>
> - The 9th patch is a work-around that will not affect the overall
> correctness.
> We remove the qsort related codes because we found that when resubmit_num
> is
> larger than 64, seccomp will kill the virtiofsd process.
>
> - Support for dax version virtiofsd is very possible and requires almost no
> additional change to this patchset.
>
>
> Jiachen Zhang (9):
>   vhost-user-fs: Add support for reconnection of vhost-user-fs backend
>   vhost: Add vhost-user message types for sending shared memory and file
>     fds
>   vhost-user-fs: Support virtiofsd crash reconnection
>   libvhost-user: Add vhost-user message types for sending shared memory
>     and file fds
>   virtiofsd: Convert the struct lo_map array to a more flatten layout
>   virtiofsd: Add two new options for crash reconnection
>   virtiofsd: Persist/restore lo_map and opened fds to/from QEMU
>   virtiofsd: Ensure crash consistency after reconnection
>   virtiofsd: (work around) Comment qsort in inflight I/O tracking
>
>  contrib/libvhost-user/libvhost-user.c | 106 +++-
>  contrib/libvhost-user/libvhost-user.h |  70 +++
>  docs/interop/vhost-user.rst           |  41 ++
>  hw/virtio/vhost-user-fs.c             | 334 ++++++++++-
>  hw/virtio/vhost-user.c                | 123 ++++
>  hw/virtio/vhost.c                     |  42 ++
>  include/hw/virtio/vhost-backend.h     |   6 +
>  include/hw/virtio/vhost-user-fs.h     |  16 +-
>  include/hw/virtio/vhost.h             |  42 ++
>  tools/virtiofsd/fuse_lowlevel.c       |  24 +-
>  tools/virtiofsd/fuse_virtio.c         |  44 ++
>  tools/virtiofsd/fuse_virtio.h         |   1 +
>  tools/virtiofsd/helper.c              |   9 +
>  tools/virtiofsd/passthrough_helpers.h |   2 +-
>  tools/virtiofsd/passthrough_ll.c      | 830 ++++++++++++++++++--------
>  tools/virtiofsd/passthrough_seccomp.c |   1 +
>  16 files changed, 1413 insertions(+), 278 deletions(-)
>
> --
> 2.20.1
>
>
Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
Posted by Stefan Hajnoczi 3 years ago
On Mon, May 10, 2021 at 10:38:05PM +0800, Jiachen Zhang wrote:
> Hi all,
> 
> 
> We are going to develop the v2 patch for virtio-fs crash reconnection. As
> suggested by Marc-André and Stefan, except for the inflight I/O tracking
> log area, all the other internal statuses of virtiofsd will be saved to
> some places other than QEMU. Specifically, the three lo_maps (ino_map,
> dirp_map, and fd_map) could be saved to several mmapped files, and the
> opened fds could be saved to systemd. I'd like to get some feedback on our
> further thoughts before we work on the revision.
> 
> 
> 1. What about by default save the opened fds as file handles to host
> kernel, instead of saving them to systemd. After some internal discussion,
> we think introducing systemd may introduce more uncertainness to the
> system, as we need to create one service for each daemon, and all the
> daemons may suffer the single-point failure of the systemd process.

I don't think saving file handles works 100%. The difference between an
open fd and a file handle is what happens when the inode is deleted. If
an external process deletes the inode during restart and then the fd
keeps it alive while a file handle becomes stale and the inode is gone.

Regarding systemd, it's pid 1 and cannot die - otherwise the system is
broken.

But in any case I think there are multiple options here. Whether you
choose to systemd, implement the sd_notify(3) protocol in your own
parent process, or take a different approach like a parent process with
clone(2) CLONE_FILES to avoid the communication overhead for saving
every fd, I think all of those approaches would be reasonable.

> 2. Like the btree map implementation (multikey.rs) of virtiofsd-rs, what
> about splitting the flatten lo_map implementation, which supports to be
> persisted to files, from passhtrough_ll.c to a new separated source file.
> This way, maybe we can more easily wrap it with some Rust compatible
> interfaces, and enable crash recovery for virtiofsd-rs based on it.

In the past two months I've noticed the number of virtiofsd-rs merge
requests has increased and I think the trend is that new development is
focussing on virtiofsd-rs.

If it fits into your plans then focussing on virtiofsd-rs would be fine
and then there is no need to worry about Rust compatible interfaces for
C virtiofsd.

> 3. What about dropping the dirp_map, and integrate the opened directory fds
> to fd_map. The virtiofsd-rs implementation only has two maps (inodes and
> handles). In the C version, dirp_map may also unnecessary.

Maybe, but carefully:

The purpose of the maps is to safely isolate the client from the
virtiofsd's internal objects. The way I remember it is that C virtiofsd
has a separate dirp map to prevent type confusion between regular open
files and directories. The client must not trick the server into calling
readdir(3) on something that's not a struct dirent because that could be
a security issue.

However, it's possible that virtiofsd-rs is able to combine the two
because it uses syscall APIs on file descriptors instead of libc
opendir(3) so there is no possibility of type confusion. The syscall
would simply fail if the file descriptor is not O_DIRECTORY.

Stefan
Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
Posted by Marc-André Lureau 3 years, 4 months ago
Hi

On Tue, Dec 15, 2020 at 8:22 PM Jiachen Zhang <
zhangjiachen.jaycee@bytedance.com> wrote:

> Hi, all
>
> We implement virtio-fs crash reconnection in this patchset. The crash
> reconnection of virtiofsd here is completely transparent to guest, no
> remount in guest is needed, even the inflight requests can be handled
> normally after reconnection. We are looking forward to any comments.
>
> Thanks,
> Jiachen
>
>
> OVERVIEW:
>
> To support virtio-fs crash reconnection, we need to support the recovery
> of 1) inflight FUSE request, and 2) virtiofsd internal status information.
>
> Fortunately, QEMU's vhost-user reconnection framework already supports
> inflight I/O tracking by using VHOST_USER_GET_INFLIGHT_FD and
> VHOST_USER_SET_INFLIGHT_FD (see 5ad204bf2 and 5f9ff1eff for details).
> As the FUSE requests are transferred by virtqueue I/O requests, by using
> the vhost-user inflight I/O tracking, we can recover the inflight FUSE
> requests.
>
> To support virtiofsd internal status recovery, we introduce 4 new
> vhost-user message types. As shown in the following diagram, two of them
> are used to persist shared lo_maps and opened fds to QEMU, the other two
> message types are used to restore the status when reconnecting.
>
>                                VHOST_USER_SLAVE_SHM
>                                VHOST_USER_SLAVE_FD
>     +--------------+       Persist       +--------------------+
>     |              <---------------------+                    |
>     |     QEMU     |                     |  Virtio-fs Daemon  |
>     |              +--------------------->                    |
>     +--------------+       Restore       +--------------------+
>             VHOST_USER_SET_SHM
>             VHOST_USER_SET_FD
>
> Although the 4 newly added message types are to support virtiofsd
> reconnection in this patchset, it might be potential in other situation.
> So we keep in mind to make them more general when add them to vhost
> related source files. VHOST_USER_SLAVE_SHM and VHOST_USER_SET_SHM can be
> used for memory sharing between a vhost-user daemon and QEMU,
> VHOST_USER_SLAVE_FD and VHOST_USER_SET_FD would be useful if we want to
> shared opened fds between QEMU process and vhost-user daemon process.
>

Before adding new messages to the already complex vhost-user protocol, can
we evaluate other options?

First thing that came to my mind is that the memory state could be saved to
disk or with a POSIX shared memory object.

Eventually, the protocol could just pass around the fds, and not make a
special treatment for shared memory.

Then I remember systemd has a pretty good API & protocol for this sort of
thing: sd_notify(3) (afaik, it is quite easy to implement a minimal handler)

You can store fds with FDSTORE=1 (with an optional associated FDNAME).
sd_listen_fds() & others to get them back (note: passed by inheritance only
I think). systemd seems to not make shm a special case either, just treat
it like an opened fd to restore.

If we consider backend processes are going to be managed by libvirt or even
a systemd service, is it a better alternative? sd_notify() offers a number
of interesting features as well to monitor services.


>
> USAGE and NOTES:
>
> - The commits are rebased to a recent QEMU master commit b4d939133dca0fa2b.
>
> - ",reconnect=1" should be added to the "-chardev socket" of
> vhost-user-fs-pci
> in the QEMU command line, for example:
>
>     qemu-system-x86_64 ... \
>     -chardev socket,id=char0,path=/tmp/vhostqemu,reconnect=1 \
>     -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=myfs \
>     ...
>
> - We add new options for virtiofsd to enable or disable crash reconnection.
> And some options are not supported by crash reconnection. So add following
> options to virtiofsd to enable reconnection:
>
>     virtiofsd ... -o reconnect -o no_mount_ns -o no_flock -o no_posix_lock
>     -o no_xattr ...
>
> - The reasons why virtiofsd-side locking, extended attributes, and mount
> namespace are not supported is explained in the commit message of the 6th
> patch (virtiofsd: Add two new options for crash reconnection).
>
> - The 9th patch is a work-around that will not affect the overall
> correctness.
> We remove the qsort related codes because we found that when resubmit_num
> is
> larger than 64, seccomp will kill the virtiofsd process.
>
> - Support for dax version virtiofsd is very possible and requires almost no
> additional change to this patchset.
>
>
> Jiachen Zhang (9):
>   vhost-user-fs: Add support for reconnection of vhost-user-fs backend
>   vhost: Add vhost-user message types for sending shared memory and file
>     fds
>   vhost-user-fs: Support virtiofsd crash reconnection
>   libvhost-user: Add vhost-user message types for sending shared memory
>     and file fds
>   virtiofsd: Convert the struct lo_map array to a more flatten layout
>   virtiofsd: Add two new options for crash reconnection
>   virtiofsd: Persist/restore lo_map and opened fds to/from QEMU
>   virtiofsd: Ensure crash consistency after reconnection
>   virtiofsd: (work around) Comment qsort in inflight I/O tracking
>
>  contrib/libvhost-user/libvhost-user.c | 106 +++-
>  contrib/libvhost-user/libvhost-user.h |  70 +++
>  docs/interop/vhost-user.rst           |  41 ++
>  hw/virtio/vhost-user-fs.c             | 334 ++++++++++-
>  hw/virtio/vhost-user.c                | 123 ++++
>  hw/virtio/vhost.c                     |  42 ++
>  include/hw/virtio/vhost-backend.h     |   6 +
>  include/hw/virtio/vhost-user-fs.h     |  16 +-
>  include/hw/virtio/vhost.h             |  42 ++
>  tools/virtiofsd/fuse_lowlevel.c       |  24 +-
>  tools/virtiofsd/fuse_virtio.c         |  44 ++
>  tools/virtiofsd/fuse_virtio.h         |   1 +
>  tools/virtiofsd/helper.c              |   9 +
>  tools/virtiofsd/passthrough_helpers.h |   2 +-
>  tools/virtiofsd/passthrough_ll.c      | 830 ++++++++++++++++++--------
>  tools/virtiofsd/passthrough_seccomp.c |   1 +
>  16 files changed, 1413 insertions(+), 278 deletions(-)
>
> --
> 2.20.1
>
>
>

-- 
Marc-André Lureau
Re: [External] Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
Posted by Jiachen Zhang 3 years, 4 months ago
On Wed, Dec 16, 2020 at 11:36 PM Marc-André Lureau <
marcandre.lureau@gmail.com> wrote:

> Hi
>
> On Tue, Dec 15, 2020 at 8:22 PM Jiachen Zhang <
> zhangjiachen.jaycee@bytedance.com> wrote:
>
>> Hi, all
>>
>> We implement virtio-fs crash reconnection in this patchset. The crash
>> reconnection of virtiofsd here is completely transparent to guest, no
>> remount in guest is needed, even the inflight requests can be handled
>> normally after reconnection. We are looking forward to any comments.
>>
>> Thanks,
>> Jiachen
>>
>>
>> OVERVIEW:
>>
>> To support virtio-fs crash reconnection, we need to support the recovery
>> of 1) inflight FUSE request, and 2) virtiofsd internal status information.
>>
>> Fortunately, QEMU's vhost-user reconnection framework already supports
>> inflight I/O tracking by using VHOST_USER_GET_INFLIGHT_FD and
>> VHOST_USER_SET_INFLIGHT_FD (see 5ad204bf2 and 5f9ff1eff for details).
>> As the FUSE requests are transferred by virtqueue I/O requests, by using
>> the vhost-user inflight I/O tracking, we can recover the inflight FUSE
>> requests.
>>
>> To support virtiofsd internal status recovery, we introduce 4 new
>> vhost-user message types. As shown in the following diagram, two of them
>> are used to persist shared lo_maps and opened fds to QEMU, the other two
>> message types are used to restore the status when reconnecting.
>>
>>                                VHOST_USER_SLAVE_SHM
>>                                VHOST_USER_SLAVE_FD
>>     +--------------+       Persist       +--------------------+
>>     |              <---------------------+                    |
>>     |     QEMU     |                     |  Virtio-fs Daemon  |
>>     |              +--------------------->                    |
>>     +--------------+       Restore       +--------------------+
>>             VHOST_USER_SET_SHM
>>             VHOST_USER_SET_FD
>>
>> Although the 4 newly added message types are to support virtiofsd
>> reconnection in this patchset, it might be potential in other situation.
>> So we keep in mind to make them more general when add them to vhost
>> related source files. VHOST_USER_SLAVE_SHM and VHOST_USER_SET_SHM can be
>> used for memory sharing between a vhost-user daemon and QEMU,
>> VHOST_USER_SLAVE_FD and VHOST_USER_SET_FD would be useful if we want to
>> shared opened fds between QEMU process and vhost-user daemon process.
>>
>
> Before adding new messages to the already complex vhost-user protocol, can
> we evaluate other options?
>
> First thing that came to my mind is that the memory state could be saved
> to disk or with a POSIX shared memory object.
>
>
Eventually, the protocol could just pass around the fds, and not make a
> special treatment for shared memory.
>
> Then I remember systemd has a pretty good API & protocol for this sort of
> thing: sd_notify(3) (afaik, it is quite easy to implement a minimal handler)
>
> You can store fds with FDSTORE=1 (with an optional associated FDNAME).
> sd_listen_fds() & others to get them back (note: passed by inheritance only
> I think). systemd seems to not make shm a special case either, just treat
> it like an opened fd to restore.
>
> If we consider backend processes are going to be managed by libvirt or
> even a systemd service, is it a better alternative? sd_notify() offers a
> number of interesting features as well to monitor services.
>
>

Thanks for the suggestions. Actually, we choose to save all state
information to QEMU because a virtiofsd has the same lifecycle as its
QEMU master. However, saving things to a file do avoid communication with
QEMU, and we no longer need to increase the complexity of vhost-user
protocol. The suggestion to save fds to the systemd is also very reasonable
if we don't consider the lifecycle issues, we will try it.

All the best,
Jiachen



>>
>> USAGE and NOTES:
>>
>> - The commits are rebased to a recent QEMU master commit
>> b4d939133dca0fa2b.
>>
>> - ",reconnect=1" should be added to the "-chardev socket" of
>> vhost-user-fs-pci
>> in the QEMU command line, for example:
>>
>>     qemu-system-x86_64 ... \
>>     -chardev socket,id=char0,path=/tmp/vhostqemu,reconnect=1 \
>>     -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=myfs \
>>     ...
>>
>> - We add new options for virtiofsd to enable or disable crash
>> reconnection.
>> And some options are not supported by crash reconnection. So add following
>> options to virtiofsd to enable reconnection:
>>
>>     virtiofsd ... -o reconnect -o no_mount_ns -o no_flock -o no_posix_lock
>>     -o no_xattr ...
>>
>> - The reasons why virtiofsd-side locking, extended attributes, and mount
>> namespace are not supported is explained in the commit message of the 6th
>> patch (virtiofsd: Add two new options for crash reconnection).
>>
>> - The 9th patch is a work-around that will not affect the overall
>> correctness.
>> We remove the qsort related codes because we found that when resubmit_num
>> is
>> larger than 64, seccomp will kill the virtiofsd process.
>>
>> - Support for dax version virtiofsd is very possible and requires almost
>> no
>> additional change to this patchset.
>>
>>
>> Jiachen Zhang (9):
>>   vhost-user-fs: Add support for reconnection of vhost-user-fs backend
>>   vhost: Add vhost-user message types for sending shared memory and file
>>     fds
>>   vhost-user-fs: Support virtiofsd crash reconnection
>>   libvhost-user: Add vhost-user message types for sending shared memory
>>     and file fds
>>   virtiofsd: Convert the struct lo_map array to a more flatten layout
>>   virtiofsd: Add two new options for crash reconnection
>>   virtiofsd: Persist/restore lo_map and opened fds to/from QEMU
>>   virtiofsd: Ensure crash consistency after reconnection
>>   virtiofsd: (work around) Comment qsort in inflight I/O tracking
>>
>>  contrib/libvhost-user/libvhost-user.c | 106 +++-
>>  contrib/libvhost-user/libvhost-user.h |  70 +++
>>  docs/interop/vhost-user.rst           |  41 ++
>>  hw/virtio/vhost-user-fs.c             | 334 ++++++++++-
>>  hw/virtio/vhost-user.c                | 123 ++++
>>  hw/virtio/vhost.c                     |  42 ++
>>  include/hw/virtio/vhost-backend.h     |   6 +
>>  include/hw/virtio/vhost-user-fs.h     |  16 +-
>>  include/hw/virtio/vhost.h             |  42 ++
>>  tools/virtiofsd/fuse_lowlevel.c       |  24 +-
>>  tools/virtiofsd/fuse_virtio.c         |  44 ++
>>  tools/virtiofsd/fuse_virtio.h         |   1 +
>>  tools/virtiofsd/helper.c              |   9 +
>>  tools/virtiofsd/passthrough_helpers.h |   2 +-
>>  tools/virtiofsd/passthrough_ll.c      | 830 ++++++++++++++++++--------
>>  tools/virtiofsd/passthrough_seccomp.c |   1 +
>>  16 files changed, 1413 insertions(+), 278 deletions(-)
>>
>> --
>> 2.20.1
>>
>>
>>
>
> --
> Marc-André Lureau
>
Re: [External] Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
Posted by Stefan Hajnoczi 3 years, 1 month ago
On Fri, Dec 18, 2020 at 05:39:34PM +0800, Jiachen Zhang wrote:
> Thanks for the suggestions. Actually, we choose to save all state
> information to QEMU because a virtiofsd has the same lifecycle as its
> QEMU master. However, saving things to a file do avoid communication with
> QEMU, and we no longer need to increase the complexity of vhost-user
> protocol. The suggestion to save fds to the systemd is also very reasonable
> if we don't consider the lifecycle issues, we will try it.

Hi,
We recently discussed crash recovery in the virtio-fs bi-weekly call and
I read some of this email thread because it's a topic I'm interested in.

I agree with Marc-André that storing file descriptors does not need to
be in the vhost-user protocol. The lifetime of a vhost-user device
backend is not controlled by the VMM since the device backend is
launched separately. Therefore it's reasonable for the component that
launched the device backend to also have the responsibility of cleaning
up the vhost-user device backend.

Using the sd_notify(3) interface is a neat idea. It's supported natively
by systemd but you can also implement a compatible interface in your own
software. This way the vhost-user device backend can be launched using
systemd or your own software.

That said, if people find it more convenient to store fds using the
vhost-user protocol, then I think that is enough justification to add a
generic message to the vhost-user protocol. The important thing is to
make the message generic so it solves all crash recovery use cases. The
inflight fd messages were too specific and now we're having to think
about adding more messages again.

Stefan
Re: [External] Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
Posted by Christian Schoenebeck 3 years, 1 month ago
On Mittwoch, 17. März 2021 11:05:32 CET Stefan Hajnoczi wrote:
> On Fri, Dec 18, 2020 at 05:39:34PM +0800, Jiachen Zhang wrote:
> > Thanks for the suggestions. Actually, we choose to save all state
> > information to QEMU because a virtiofsd has the same lifecycle as its
> > QEMU master. However, saving things to a file do avoid communication with
> > QEMU, and we no longer need to increase the complexity of vhost-user
> > protocol. The suggestion to save fds to the systemd is also very
> > reasonable
> > if we don't consider the lifecycle issues, we will try it.
> 
> Hi,
> We recently discussed crash recovery in the virtio-fs bi-weekly call and
> I read some of this email thread because it's a topic I'm interested in.

I just had a quick fly over the patches so far. Shouldn't there be some kind 
of constraint for an automatic reconnection feature after a crash to prevent 
this being exploited by ROP brute force attacks?

E.g. adding some (maybe continuously increasing) delay and/or limiting the 
amount of reconnects within a certain time frame would come to my mind.

Best regards,
Christian Schoenebeck



Re: [External] Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
Posted by Jiachen Zhang 3 years, 1 month ago
On Wed, Mar 17, 2021 at 7:50 PM Christian Schoenebeck <
qemu_oss@crudebyte.com> wrote:

> On Mittwoch, 17. März 2021 11:05:32 CET Stefan Hajnoczi wrote:
> > On Fri, Dec 18, 2020 at 05:39:34PM +0800, Jiachen Zhang wrote:
> > > Thanks for the suggestions. Actually, we choose to save all state
> > > information to QEMU because a virtiofsd has the same lifecycle as its
> > > QEMU master. However, saving things to a file do avoid communication
> with
> > > QEMU, and we no longer need to increase the complexity of vhost-user
> > > protocol. The suggestion to save fds to the systemd is also very
> > > reasonable
> > > if we don't consider the lifecycle issues, we will try it.
> >
> > Hi,
> > We recently discussed crash recovery in the virtio-fs bi-weekly call and
> > I read some of this email thread because it's a topic I'm interested in.
>
> I just had a quick fly over the patches so far. Shouldn't there be some
> kind
> of constraint for an automatic reconnection feature after a crash to
> prevent
> this being exploited by ROP brute force attacks?
>
> E.g. adding some (maybe continuously increasing) delay and/or limiting the
> amount of reconnects within a certain time frame would come to my mind.
>
> Best regards,
> Christian Schoenebeck
>
>
>

Thanks, Christian. I am still trying to figure out the details of the ROP
attacks.

However, QEMU's vhost-user reconnection is based on chardev socket
reconnection. The socket reconnection can be enabled by the "--chardev
socket,...,reconnect=N" in QEMU command options, in which N means QEMU will
try to connect the disconnected socket every N seconds. We can increase N
to increase the reconnect delay. If we want to change the reconnect delay
dynamically, I think we should change the chardev socket reconnection code.
It is a more generic mechanism than vhost-user-fs and vhost-user backend.

By the way, I also considered the socket reconnection delay time in the
performance aspect. As the reconnection delay increase, if an application
in the guest is doing I/Os, it will suffer larger tail latency. And for
now, the smallest delay is 1 second, which is rather large for
high-performance virtual I/O devices today. I think maybe a more performant
and safer reconnect delay adjustment mechanism should be considered in the
future. What are your thoughts?


Jiachen
Re: [External] Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
Posted by Christian Schoenebeck 3 years, 1 month ago
On Mittwoch, 17. März 2021 13:57:47 CET Jiachen Zhang wrote:
> On Wed, Mar 17, 2021 at 7:50 PM Christian Schoenebeck <
> 
> qemu_oss@crudebyte.com> wrote:
> > On Mittwoch, 17. März 2021 11:05:32 CET Stefan Hajnoczi wrote:
> > > On Fri, Dec 18, 2020 at 05:39:34PM +0800, Jiachen Zhang wrote:
> > > > Thanks for the suggestions. Actually, we choose to save all state
> > > > information to QEMU because a virtiofsd has the same lifecycle as its
> > > > QEMU master. However, saving things to a file do avoid communication
> > 
> > with
> > 
> > > > QEMU, and we no longer need to increase the complexity of vhost-user
> > > > protocol. The suggestion to save fds to the systemd is also very
> > > > reasonable
> > > > if we don't consider the lifecycle issues, we will try it.
> > > 
> > > Hi,
> > > We recently discussed crash recovery in the virtio-fs bi-weekly call and
> > > I read some of this email thread because it's a topic I'm interested in.
> > 
> > I just had a quick fly over the patches so far. Shouldn't there be some
> > kind
> > of constraint for an automatic reconnection feature after a crash to
> > prevent
> > this being exploited by ROP brute force attacks?
> > 
> > E.g. adding some (maybe continuously increasing) delay and/or limiting the
> > amount of reconnects within a certain time frame would come to my mind.
> > 
> > Best regards,
> > Christian Schoenebeck
> 
> Thanks, Christian. I am still trying to figure out the details of the ROP
> attacks.
> 
> However, QEMU's vhost-user reconnection is based on chardev socket
> reconnection. The socket reconnection can be enabled by the "--chardev
> socket,...,reconnect=N" in QEMU command options, in which N means QEMU will
> try to connect the disconnected socket every N seconds. We can increase N
> to increase the reconnect delay. If we want to change the reconnect delay
> dynamically, I think we should change the chardev socket reconnection code.
> It is a more generic mechanism than vhost-user-fs and vhost-user backend.
> 
> By the way, I also considered the socket reconnection delay time in the
> performance aspect. As the reconnection delay increase, if an application
> in the guest is doing I/Os, it will suffer larger tail latency. And for
> now, the smallest delay is 1 second, which is rather large for
> high-performance virtual I/O devices today. I think maybe a more performant
> and safer reconnect delay adjustment mechanism should be considered in the
> future. What are your thoughts?

So with N=1 an attacker could e.g. bypass a 16-bit PAC by brute-force in ~18 
hours (e.g. on Arm if PAC + MTE was enabled). With 24-bit PAC (no MTE) it 
would be ~194 days. Independent of what architecture and defend mechanism is 
used, there is always the possibility though that some kind of side channel 
attack exists that might require a much lower amount of attempts. So in an 
untrusted environment I would personally limit the amount of automatic 
reconnects and rather accept a down time for further investigation if a 
suspicious high amount of crashes happened.

And yes, if a dynamic delay scheme was deployed in future then starting with a 
value smaller than 1 second would make sense.

Best regards,
Christian Schoenebeck



Re: [External] Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
Posted by Jiachen Zhang 3 years, 1 month ago
On Wed, Mar 17, 2021 at 6:05 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:

> On Fri, Dec 18, 2020 at 05:39:34PM +0800, Jiachen Zhang wrote:
> > Thanks for the suggestions. Actually, we choose to save all state
> > information to QEMU because a virtiofsd has the same lifecycle as its
> > QEMU master. However, saving things to a file do avoid communication with
> > QEMU, and we no longer need to increase the complexity of vhost-user
> > protocol. The suggestion to save fds to the systemd is also very
> reasonable
> > if we don't consider the lifecycle issues, we will try it.
>
> Hi,
> We recently discussed crash recovery in the virtio-fs bi-weekly call and
> I read some of this email thread because it's a topic I'm interested in.
>
> I agree with Marc-André that storing file descriptors does not need to
> be in the vhost-user protocol. The lifetime of a vhost-user device
> backend is not controlled by the VMM since the device backend is
> launched separately. Therefore it's reasonable for the component that
> launched the device backend to also have the responsibility of cleaning
> up the vhost-user device backend.
>
> Using the sd_notify(3) interface is a neat idea. It's supported natively
> by systemd but you can also implement a compatible interface in your own
> software. This way the vhost-user device backend can be launched using
> systemd or your own software.
>
> That said, if people find it more convenient to store fds using the
> vhost-user protocol, then I think that is enough justification to add a
> generic message to the vhost-user protocol. The important thing is to
> make the message generic so it solves all crash recovery use cases. The
> inflight fd messages were too specific and now we're having to think
> about adding more messages again.
>
> Stefan
>


Hi, Stefan,

I agreed with you that a virtiofsd must be launched by a software like
systemd. So we are planning to define more generic persist/restore
interfaces (callbacks). Then anyone can implement their own persist/restore
callbacks to store states to proper places.  And I think in the next
version we will implement default callbacks for the interfaces. Instead of
vhost-user messages, systemd's sd_notify(3) will be the default method for
storing fds, and several tmpfs files can be the default place to store the
shm regions.

Jiachen
Re: [External] Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
Posted by Stefan Hajnoczi 3 years, 1 month ago
On Wed, Mar 17, 2021 at 08:32:31PM +0800, Jiachen Zhang wrote:
> On Wed, Mar 17, 2021 at 6:05 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > On Fri, Dec 18, 2020 at 05:39:34PM +0800, Jiachen Zhang wrote:
> I agreed with you that a virtiofsd must be launched by a software like
> systemd. So we are planning to define more generic persist/restore
> interfaces (callbacks). Then anyone can implement their own persist/restore
> callbacks to store states to proper places.  And I think in the next
> version we will implement default callbacks for the interfaces. Instead of
> vhost-user messages, systemd's sd_notify(3) will be the default method for
> storing fds, and several tmpfs files can be the default place to store the
> shm regions.

Okay, great!

I was thinking about how to make the crash recovery mechanism reusable
as a C library or Rust crate. The mechanism is a combination of:
1. sd_listen_fds(3) for restoring the fds on restart.
2. sd_notify(3) for storing the fds.
3. memfd or tmpfs for storing state (could be mmapped).

I'm not sure if there is enough common behavior to create a reusable API
or if this is quite application-specific.

Stefan
Re: [Virtio-fs] [External] Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
Posted by Vivek Goyal 3 years, 1 month ago
On Mon, Mar 22, 2021 at 11:00:52AM +0000, Stefan Hajnoczi wrote:
> On Wed, Mar 17, 2021 at 08:32:31PM +0800, Jiachen Zhang wrote:
> > On Wed, Mar 17, 2021 at 6:05 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > On Fri, Dec 18, 2020 at 05:39:34PM +0800, Jiachen Zhang wrote:
> > I agreed with you that a virtiofsd must be launched by a software like
> > systemd. So we are planning to define more generic persist/restore
> > interfaces (callbacks). Then anyone can implement their own persist/restore
> > callbacks to store states to proper places.  And I think in the next
> > version we will implement default callbacks for the interfaces. Instead of
> > vhost-user messages, systemd's sd_notify(3) will be the default method for
> > storing fds, and several tmpfs files can be the default place to store the
> > shm regions.
> 
> Okay, great!
> 
> I was thinking about how to make the crash recovery mechanism reusable
> as a C library or Rust crate. The mechanism is a combination of:
> 1. sd_listen_fds(3) for restoring the fds on restart.
> 2. sd_notify(3) for storing the fds.
> 3. memfd or tmpfs for storing state (could be mmapped).
> 
> I'm not sure if there is enough common behavior to create a reusable API
> or if this is quite application-specific.

I am wondering what will happen for use cases where virtiofsd is running
inside a container (with no systemd inside containers).

Do container managers offer systemd like services to save and restore
state.

Vivek


Re: [Virtio-fs] [External] Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
Posted by Stefan Hajnoczi 3 years, 1 month ago
On Mon, Mar 22, 2021 at 04:13:26PM -0400, Vivek Goyal wrote:
> On Mon, Mar 22, 2021 at 11:00:52AM +0000, Stefan Hajnoczi wrote:
> > On Wed, Mar 17, 2021 at 08:32:31PM +0800, Jiachen Zhang wrote:
> > > On Wed, Mar 17, 2021 at 6:05 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > On Fri, Dec 18, 2020 at 05:39:34PM +0800, Jiachen Zhang wrote:
> > > I agreed with you that a virtiofsd must be launched by a software like
> > > systemd. So we are planning to define more generic persist/restore
> > > interfaces (callbacks). Then anyone can implement their own persist/restore
> > > callbacks to store states to proper places.  And I think in the next
> > > version we will implement default callbacks for the interfaces. Instead of
> > > vhost-user messages, systemd's sd_notify(3) will be the default method for
> > > storing fds, and several tmpfs files can be the default place to store the
> > > shm regions.
> > 
> > Okay, great!
> > 
> > I was thinking about how to make the crash recovery mechanism reusable
> > as a C library or Rust crate. The mechanism is a combination of:
> > 1. sd_listen_fds(3) for restoring the fds on restart.
> > 2. sd_notify(3) for storing the fds.
> > 3. memfd or tmpfs for storing state (could be mmapped).
> > 
> > I'm not sure if there is enough common behavior to create a reusable API
> > or if this is quite application-specific.
> 
> I am wondering what will happen for use cases where virtiofsd is running
> inside a container (with no systemd inside containers).
> 
> Do container managers offer systemd like services to save and restore
> state.

It appears so, at least for Podman where sd_notify is explicitly
mentioned:
https://www.redhat.com/sysadmin/improved-systemd-podman

As mentioned in this email thread, the socket activation and
sd_notify(3) interface can also be implemented by non-systemd software.
Anyone running a custom container runtime or orchestration software
could add these interfaces (they are reasonably simple and the protocols
are documented in the systemd documentation).

Stefan