[v3] export/fuse: Use coroutines and multi-threading

[PATCH v3 21/21] fuse: Increase MAX_WRITE_SIZE with a second buffer

Posted by Hanna Czenczek 4 months, 2 weeks ago

We probably want to support larger write sizes than just 4k; 64k seems
nice.  However, we cannot read partial requests from the FUSE FD, we
always have to read requests in full; so our read buffer must be large
enough to accommodate potential 64k writes if we want to support that.

Always allocating FuseRequest objects with 64k buffers in them seems
wasteful, though.  But we can get around the issue by splitting the
buffer into two and using readv(): One part will hold all normal (up to
4k) write requests and all other requests, and a second part (the
"spill-over buffer") will be used only for larger write requests.  Each
FuseQueue has its own spill-over buffer, and only if we find it used
when reading a request will we move its ownership into the FuseRequest
object and allocate a new spill-over buffer for the queue.

This way, we get to support "large" write sizes without having to
allocate big buffers when they aren't used.

Also, this even reduces the size of the FuseRequest objects because the
read buffer has to have at least FUSE_MIN_READ_BUFFER (8192) bytes; but
the requests we support are not quite so large (except for >4k writes),
so until now, we basically had to have useless padding in there.

With the spill-over buffer added, the FUSE_MIN_READ_BUFFER requirement
is easily met and we can decrease the size of the buffer portion that is
right inside of FuseRequest.

As for benchmarks, the benefit of this patch can be shown easily by
writing a 4G image (with qemu-img convert) to a FUSE export:
- Before this patch: Takes 25.6 s (14.4 s with -t none)
- After this patch: Takes 4.5 s (5.5 s with -t none)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 137 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 118 insertions(+), 19 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 4ae448bc14..c0ad4696ce 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -50,8 +50,17 @@
 
 /* Prevent overly long bounce buffer allocations */
 #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024))
-/* Small enough to fit in the request buffer */
-#define FUSE_MAX_WRITE_BYTES (4 * 1024)
+/*
+ * FUSE_MAX_WRITE_BYTES determines the maximum number of bytes we support in a
+ * write request; FUSE_IN_PLACE_WRITE_BYTES and FUSE_SPILLOVER_BUF_SIZE
+ * determine the split between the size of the in-place buffer in FuseRequest
+ * and the spill-over buffer in FuseQueue.  See FuseQueue.spillover_buf for a
+ * detailed explanation.
+ */
+#define FUSE_IN_PLACE_WRITE_BYTES (4 * 1024)
+#define FUSE_MAX_WRITE_BYTES (64 * 1024)
+#define FUSE_SPILLOVER_BUF_SIZE \
+    (FUSE_MAX_WRITE_BYTES - FUSE_IN_PLACE_WRITE_BYTES)
 
 typedef struct FuseExport FuseExport;
 
@@ -67,15 +76,49 @@ typedef struct FuseQueue {
 
     /*
      * The request buffer must be able to hold a full write, and/or at least
-     * FUSE_MIN_READ_BUFFER (from linux/fuse.h) bytes
+     * FUSE_MIN_READ_BUFFER (from linux/fuse.h) bytes.
+     * This however is just the first part of the buffer; every read is given
+     * a vector of this buffer (which should be enough for all normal requests,
+     * which we check via the static assertion in FUSE_IN_OP_STRUCT()) and the
+     * spill-over buffer below.
+     * Therefore, the size of this buffer plus FUSE_SPILLOVER_BUF_SIZE must be
+     * FUSE_MIN_READ_BUFFER or more (checked via static assertion below).
+     */
+    char request_buf[sizeof(struct fuse_in_header) +
+                     sizeof(struct fuse_write_in) +
+                     FUSE_IN_PLACE_WRITE_BYTES];
+
+    /*
+     * When retrieving a FUSE request, the destination buffer must always be
+     * sufficiently large for the whole request, i.e. with max_write=64k, we
+     * must provide a buffer that fits the WRITE header and 64 kB of space for
+     * data.
+     * We do want to support 64k write requests without requiring them to be
+     * split up, but at the same time, do not want to do such a large allocation
+     * for every single request.
+     * Therefore, the FuseRequest object provides an in-line buffer that is
+     * enough for write requests up to 4k (and all other requests), and for
+     * every request that is bigger, we provide a spill-over buffer here (for
+     * the remaining 64k - 4k = 60k).
+     * When poll_fuse_fd() reads a FUSE request, it passes these buffers as an
+     * I/O vector, and then checks the return value (number of bytes read) to
+     * find out whether the spill-over buffer was used.  If so, it will move the
+     * buffer to the request, and will allocate a new spill-over buffer for the
+     * next request.
+     *
+     * Free this buffer with qemu_vfree().
      */
-    char request_buf[MAX_CONST(
-        sizeof(struct fuse_in_header) + sizeof(struct fuse_write_in) +
-             FUSE_MAX_WRITE_BYTES,
-        FUSE_MIN_READ_BUFFER
-    )];
+    void *spillover_buf;
 } FuseQueue;
 
+/*
+ * Verify that FuseQueue.request_buf plus the spill-over buffer together
+ * are big enough to be accepted by the FUSE kernel driver.
+ */
+QEMU_BUILD_BUG_ON(sizeof(((FuseQueue *)0)->request_buf) +
+                  FUSE_SPILLOVER_BUF_SIZE <
+                  FUSE_MIN_READ_BUFFER);
+
 struct FuseExport {
     BlockExport common;
 
@@ -131,7 +174,8 @@ static int clone_fuse_fd(int fd, Error **errp);
 static bool is_regular_file(const char *path, Error **errp);
 
 static void read_from_fuse_fd(void *opaque);
-static void coroutine_fn fuse_co_process_request(FuseQueue *q);
+static void coroutine_fn
+fuse_co_process_request(FuseQueue *q, void *spillover_buf);
 
 static void fuse_inc_in_flight(FuseExport *exp)
 {
@@ -476,12 +520,27 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
     FuseExport *exp = q->exp;
     ssize_t ret;
     const struct fuse_in_header *in_hdr;
+    struct iovec iov[2];
+    void *spillover_buf = NULL;
 
     if (unlikely(exp->halted)) {
         goto no_request;
     }
 
-    ret = RETRY_ON_EINTR(read(fuse_fd, q->request_buf, sizeof(q->request_buf)));
+    /*
+     * If handling the last request consumed the spill-over buffer, allocate a
+     * new one.  Align it to the block device's alignment, which admittedly is
+     * only useful if FUSE_IN_PLACE_WRITE_BYTES is aligned, too.
+     */
+    if (unlikely(!q->spillover_buf)) {
+        q->spillover_buf = blk_blockalign(exp->common.blk,
+                                          FUSE_SPILLOVER_BUF_SIZE);
+    }
+    /* Construct the I/O vector to hold the FUSE request */
+    iov[0] = (struct iovec) { q->request_buf, sizeof(q->request_buf) };
+    iov[1] = (struct iovec) { q->spillover_buf, FUSE_SPILLOVER_BUF_SIZE };
+
+    ret = RETRY_ON_EINTR(readv(fuse_fd, iov, ARRAY_SIZE(iov)));
     if (ret < 0 && errno == EAGAIN) {
         /* No request available */
         goto no_request;
@@ -510,7 +569,13 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
         goto no_request;
     }
 
-    fuse_co_process_request(q);
+    if (unlikely(ret > sizeof(q->request_buf))) {
+        /* Spillover buffer used, take ownership */
+        spillover_buf = q->spillover_buf;
+        q->spillover_buf = NULL;
+    }
+
+    fuse_co_process_request(q, spillover_buf);
 
 no_request:
     fuse_dec_in_flight(exp);
@@ -560,6 +625,9 @@ static void fuse_export_delete(BlockExport *blk_exp)
         if (i > 0 && q->fuse_fd >= 0) {
             close(q->fuse_fd);
         }
+        if (q->spillover_buf) {
+            qemu_vfree(q->spillover_buf);
+        }
     }
     g_free(exp->queues);
 
@@ -869,17 +937,25 @@ fuse_co_read(FuseExport *exp, void **bufptr, uint64_t offset, uint32_t size)
 }
 
 /**
- * Handle client writes to the exported image.  @buf has the data to be written
- * and will be copied to a bounce buffer before yielding for the first time.
+ * Handle client writes to the exported image.  @in_place_buf has the first
+ * FUSE_IN_PLACE_WRITE_BYTES bytes of the data to be written, @spillover_buf
+ * contains the rest (if any; NULL otherwise).
+ * Data in @in_place_buf is assumed to be overwritten after yielding, so will
+ * be copied to a bounce buffer beforehand.  @spillover_buf in contrast is
+ * assumed to be exclusively owned and will be used as-is.
  * Return the number of bytes written to *out on success, and -errno on error.
  */
 static ssize_t coroutine_fn
 fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
-              uint64_t offset, uint32_t size, const void *buf)
+              uint64_t offset, uint32_t size,
+              const void *in_place_buf, const void *spillover_buf)
 {
+    size_t in_place_size;
     void *copied;
     int64_t blk_len;
     int ret;
+    struct iovec iov[2];
+    QEMUIOVector qiov;
 
     /* Limited by max_write, should not happen */
     if (size > BDRV_REQUEST_MAX_BYTES) {
@@ -891,8 +967,9 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
     }
 
     /* Must copy to bounce buffer before potentially yielding */
-    copied = blk_blockalign(exp->common.blk, size);
-    memcpy(copied, buf, size);
+    in_place_size = MIN(size, FUSE_IN_PLACE_WRITE_BYTES);
+    copied = blk_blockalign(exp->common.blk, in_place_size);
+    memcpy(copied, in_place_buf, in_place_size);
 
     /**
      * Clients will expect short writes at EOF, so we have to limit
@@ -916,7 +993,21 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
         }
     }
 
-    ret = blk_co_pwrite(exp->common.blk, offset, size, copied, 0);
+    iov[0] = (struct iovec) {
+        .iov_base = copied,
+        .iov_len = in_place_size,
+    };
+    if (size > FUSE_IN_PLACE_WRITE_BYTES) {
+        assert(size - FUSE_IN_PLACE_WRITE_BYTES <= FUSE_SPILLOVER_BUF_SIZE);
+        iov[1] = (struct iovec) {
+            .iov_base = (void *)spillover_buf,
+            .iov_len = size - FUSE_IN_PLACE_WRITE_BYTES,
+        };
+        qemu_iovec_init_external(&qiov, iov, 2);
+    } else {
+        qemu_iovec_init_external(&qiov, iov, 1);
+    }
+    ret = blk_co_pwritev(exp->common.blk, offset, size, &qiov, 0);
     if (ret < 0) {
         goto fail_free_buffer;
     }
@@ -1275,8 +1366,14 @@ static int fuse_write_buf_response(int fd, uint32_t req_id,
  * Note that yielding in any request-processing function can overwrite the
  * contents of q->request_buf.  Anything that takes a buffer needs to take
  * care that the content is copied before yielding.
+ *
+ * @spillover_buf can contain the tail of a write request too large to fit into
+ * q->request_buf.  This function takes ownership of it (i.e. will free it),
+ * which assumes that its contents will not be overwritten by concurrent
+ * requests (as opposed to q->request_buf).
  */
-static void coroutine_fn fuse_co_process_request(FuseQueue *q)
+static void coroutine_fn
+fuse_co_process_request(FuseQueue *q, void *spillover_buf)
 {
     FuseExport *exp = q->exp;
     uint32_t opcode;
@@ -1372,7 +1469,7 @@ static void coroutine_fn fuse_co_process_request(FuseQueue *q)
          * yielding.
          */
         ret = fuse_co_write(exp, FUSE_OUT_OP_STRUCT(write, out_buf),
-                            in->offset, in->size, in + 1);
+                            in->offset, in->size, in + 1, spillover_buf);
         break;
     }
 
@@ -1414,6 +1511,8 @@ static void coroutine_fn fuse_co_process_request(FuseQueue *q)
                             ret < 0 ? ret : 0,
                             ret < 0 ? 0 : ret);
     }
+
+    qemu_vfree(spillover_buf);
 }
 
 const BlockExportDriver blk_exp_fuse = {
-- 
2.49.0

Re: [PATCH v3 21/21] fuse: Increase MAX_WRITE_SIZE with a second buffer

Posted by Kevin Wolf 3 weeks, 1 day ago

Am 01.07.2025 um 13:44 hat Hanna Czenczek geschrieben:
> We probably want to support larger write sizes than just 4k; 64k seems
> nice.  However, we cannot read partial requests from the FUSE FD, we
> always have to read requests in full; so our read buffer must be large
> enough to accommodate potential 64k writes if we want to support that.
> 
> Always allocating FuseRequest objects with 64k buffers in them seems
> wasteful, though.  But we can get around the issue by splitting the
> buffer into two and using readv(): One part will hold all normal (up to
> 4k) write requests and all other requests, and a second part (the
> "spill-over buffer") will be used only for larger write requests.  Each
> FuseQueue has its own spill-over buffer, and only if we find it used
> when reading a request will we move its ownership into the FuseRequest
> object and allocate a new spill-over buffer for the queue.
> 
> This way, we get to support "large" write sizes without having to
> allocate big buffers when they aren't used.
> 
> Also, this even reduces the size of the FuseRequest objects because the
> read buffer has to have at least FUSE_MIN_READ_BUFFER (8192) bytes; but
> the requests we support are not quite so large (except for >4k writes),
> so until now, we basically had to have useless padding in there.
> 
> With the spill-over buffer added, the FUSE_MIN_READ_BUFFER requirement
> is easily met and we can decrease the size of the buffer portion that is
> right inside of FuseRequest.
> 
> As for benchmarks, the benefit of this patch can be shown easily by
> writing a 4G image (with qemu-img convert) to a FUSE export:
> - Before this patch: Takes 25.6 s (14.4 s with -t none)
> - After this patch: Takes 4.5 s (5.5 s with -t none)
> 
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>

The commit message seems outdated, there is no such thing as a
FuseRequest object.

I agree with the idea of allocating a separate buffer for the data to be
written. I'm not so sure that the approach taken here with combining an
in-place and a spillover buffer does actually do much for us in exchange
for the additional complexity.

The allocation + memcpy for in_place buf in fuse_co_write() bothers me a
bit. I'd rather have a buffer for the data to write that can be directly
used. And let's be real, we already allocate a 1 MB stack per request. I
don't think 64k more or less make a big difference, but it would allow
us to save the memcpy() for 4k requests and additionally an allocation
for larger requests.

The tradeoff when you use an iov for the buffer in FuseQueue that is
only big enough for the header and fuse_write_in and then directly the
per-request buffer that is owned by the coroutine is that for requests
that are larger than fuse_write_in, you'll have to copy the rest back
from the data buffer first. This seems to be only fuse_setattr_in, which
shouldn't be a hot path at all, and only a few bytes.

All that said, what you have is an obvious improvement over limiting
write requests to 4k.

Kevin

Re: [PATCH v3 21/21] fuse: Increase MAX_WRITE_SIZE with a second buffer

Posted by Hanna Czenczek 2 weeks, 1 day ago

On 23.10.25 17:11, Kevin Wolf wrote:
> Am 01.07.2025 um 13:44 hat Hanna Czenczek geschrieben:
>> We probably want to support larger write sizes than just 4k; 64k seems
>> nice.  However, we cannot read partial requests from the FUSE FD, we
>> always have to read requests in full; so our read buffer must be large
>> enough to accommodate potential 64k writes if we want to support that.
>>
>> Always allocating FuseRequest objects with 64k buffers in them seems
>> wasteful, though.  But we can get around the issue by splitting the
>> buffer into two and using readv(): One part will hold all normal (up to
>> 4k) write requests and all other requests, and a second part (the
>> "spill-over buffer") will be used only for larger write requests.  Each
>> FuseQueue has its own spill-over buffer, and only if we find it used
>> when reading a request will we move its ownership into the FuseRequest
>> object and allocate a new spill-over buffer for the queue.
>>
>> This way, we get to support "large" write sizes without having to
>> allocate big buffers when they aren't used.
>>
>> Also, this even reduces the size of the FuseRequest objects because the
>> read buffer has to have at least FUSE_MIN_READ_BUFFER (8192) bytes; but
>> the requests we support are not quite so large (except for >4k writes),
>> so until now, we basically had to have useless padding in there.
>>
>> With the spill-over buffer added, the FUSE_MIN_READ_BUFFER requirement
>> is easily met and we can decrease the size of the buffer portion that is
>> right inside of FuseRequest.
>>
>> As for benchmarks, the benefit of this patch can be shown easily by
>> writing a 4G image (with qemu-img convert) to a FUSE export:
>> - Before this patch: Takes 25.6 s (14.4 s with -t none)
>> - After this patch: Takes 4.5 s (5.5 s with -t none)
>>
>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> The commit message seems outdated, there is no such thing as a
> FuseRequest object.
>
> I agree with the idea of allocating a separate buffer for the data to be
> written. I'm not so sure that the approach taken here with combining an
> in-place and a spillover buffer does actually do much for us in exchange
> for the additional complexity.
>
> The allocation + memcpy for in_place buf in fuse_co_write() bothers me a
> bit. I'd rather have a buffer for the data to write that can be directly
> used. And let's be real, we already allocate a 1 MB stack per request. I
> don't think 64k more or less make a big difference, but it would allow
> us to save the memcpy() for 4k requests and additionally an allocation
> for larger requests.
>
> The tradeoff when you use an iov for the buffer in FuseQueue that is
> only big enough for the header and fuse_write_in and then directly the
> per-request buffer that is owned by the coroutine is that for requests
> that are larger than fuse_write_in, you'll have to copy the rest back
> from the data buffer first. This seems to be only fuse_setattr_in, which
> shouldn't be a hot path at all, and only a few bytes.

So I understand that first, you disagree with “Always allocating 
FuseRequest objects with 64k buffers in them seems wasteful, though.” 
I.e. to just use a 64k buffer per request.  OK, fair.

Second, you suggest to improve performance by having an aligned 64k data 
buffer separate from the request metadata buffer to save on memcpy().  I 
did consider this, but discarded it because of I was afraid of the 
complexity.  Actually probably too afraid.

I’ll take a look, it actually can’t be too hard.  (Just have two 
buffers; make the metadata buffer long enough to capture all request 
headers, but only pass the sizeof(fuse_write_in)-long head into readv(), 
then check the request opcode.  If metadata spilled over to the data 
buffer, copy it back into the “shadowed” metadata buffer tail.)

Hanna

> All that said, what you have is an obvious improvement over limiting
> write requests to 4k.
>
> Kevin
>

Re: [PATCH v3 21/21] fuse: Increase MAX_WRITE_SIZE with a second buffer

Posted by Kevin Wolf 2 weeks ago

Am 31.10.2025 um 13:13 hat Hanna Czenczek geschrieben:
> On 23.10.25 17:11, Kevin Wolf wrote:
> > Am 01.07.2025 um 13:44 hat Hanna Czenczek geschrieben:
> > > We probably want to support larger write sizes than just 4k; 64k seems
> > > nice.  However, we cannot read partial requests from the FUSE FD, we
> > > always have to read requests in full; so our read buffer must be large
> > > enough to accommodate potential 64k writes if we want to support that.
> > > 
> > > Always allocating FuseRequest objects with 64k buffers in them seems
> > > wasteful, though.  But we can get around the issue by splitting the
> > > buffer into two and using readv(): One part will hold all normal (up to
> > > 4k) write requests and all other requests, and a second part (the
> > > "spill-over buffer") will be used only for larger write requests.  Each
> > > FuseQueue has its own spill-over buffer, and only if we find it used
> > > when reading a request will we move its ownership into the FuseRequest
> > > object and allocate a new spill-over buffer for the queue.
> > > 
> > > This way, we get to support "large" write sizes without having to
> > > allocate big buffers when they aren't used.
> > > 
> > > Also, this even reduces the size of the FuseRequest objects because the
> > > read buffer has to have at least FUSE_MIN_READ_BUFFER (8192) bytes; but
> > > the requests we support are not quite so large (except for >4k writes),
> > > so until now, we basically had to have useless padding in there.
> > > 
> > > With the spill-over buffer added, the FUSE_MIN_READ_BUFFER requirement
> > > is easily met and we can decrease the size of the buffer portion that is
> > > right inside of FuseRequest.
> > > 
> > > As for benchmarks, the benefit of this patch can be shown easily by
> > > writing a 4G image (with qemu-img convert) to a FUSE export:
> > > - Before this patch: Takes 25.6 s (14.4 s with -t none)
> > > - After this patch: Takes 4.5 s (5.5 s with -t none)
> > > 
> > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > The commit message seems outdated, there is no such thing as a
> > FuseRequest object.
> > 
> > I agree with the idea of allocating a separate buffer for the data to be
> > written. I'm not so sure that the approach taken here with combining an
> > in-place and a spillover buffer does actually do much for us in exchange
> > for the additional complexity.
> > 
> > The allocation + memcpy for in_place buf in fuse_co_write() bothers me a
> > bit. I'd rather have a buffer for the data to write that can be directly
> > used. And let's be real, we already allocate a 1 MB stack per request. I
> > don't think 64k more or less make a big difference, but it would allow
> > us to save the memcpy() for 4k requests and additionally an allocation
> > for larger requests.
> > 
> > The tradeoff when you use an iov for the buffer in FuseQueue that is
> > only big enough for the header and fuse_write_in and then directly the
> > per-request buffer that is owned by the coroutine is that for requests
> > that are larger than fuse_write_in, you'll have to copy the rest back
> > from the data buffer first. This seems to be only fuse_setattr_in, which
> > shouldn't be a hot path at all, and only a few bytes.
> 
> So I understand that first, you disagree with “Always allocating FuseRequest
> objects with 64k buffers in them seems wasteful, though.” I.e. to just use a
> 64k buffer per request.  OK, fair.

I think in practice most write requests will exceed the 4k anyway, so
we'd already use the spillover buffer. Maybe the most common exception
is fio, if we want to optimise for that one. :-)

> Second, you suggest to improve performance by having an aligned 64k data
> buffer separate from the request metadata buffer to save on memcpy().  I did
> consider this, but discarded it because of I was afraid of the complexity. 
> Actually probably too afraid.
> 
> I’ll take a look, it actually can’t be too hard.  (Just have two buffers;
> make the metadata buffer long enough to capture all request headers, but
> only pass the sizeof(fuse_write_in)-long head into readv(), then check the
> request opcode.  If metadata spilled over to the data buffer, copy it back
> into the “shadowed” metadata buffer tail.)

Yes, that's what I had in mind. Not sure if it's premature
optimisation...

Kevin

Re: [PATCH v3 21/21] fuse: Increase MAX_WRITE_SIZE with a second buffer

Posted by Hanna Czenczek 2 weeks ago

On 31.10.25 13:33, Kevin Wolf wrote:
> Am 31.10.2025 um 13:13 hat Hanna Czenczek geschrieben:
>> On 23.10.25 17:11, Kevin Wolf wrote:
>>> Am 01.07.2025 um 13:44 hat Hanna Czenczek geschrieben:
>>>> We probably want to support larger write sizes than just 4k; 64k seems
>>>> nice.  However, we cannot read partial requests from the FUSE FD, we
>>>> always have to read requests in full; so our read buffer must be large
>>>> enough to accommodate potential 64k writes if we want to support that.
>>>>
>>>> Always allocating FuseRequest objects with 64k buffers in them seems
>>>> wasteful, though.  But we can get around the issue by splitting the
>>>> buffer into two and using readv(): One part will hold all normal (up to
>>>> 4k) write requests and all other requests, and a second part (the
>>>> "spill-over buffer") will be used only for larger write requests.  Each
>>>> FuseQueue has its own spill-over buffer, and only if we find it used
>>>> when reading a request will we move its ownership into the FuseRequest
>>>> object and allocate a new spill-over buffer for the queue.
>>>>
>>>> This way, we get to support "large" write sizes without having to
>>>> allocate big buffers when they aren't used.
>>>>
>>>> Also, this even reduces the size of the FuseRequest objects because the
>>>> read buffer has to have at least FUSE_MIN_READ_BUFFER (8192) bytes; but
>>>> the requests we support are not quite so large (except for >4k writes),
>>>> so until now, we basically had to have useless padding in there.
>>>>
>>>> With the spill-over buffer added, the FUSE_MIN_READ_BUFFER requirement
>>>> is easily met and we can decrease the size of the buffer portion that is
>>>> right inside of FuseRequest.
>>>>
>>>> As for benchmarks, the benefit of this patch can be shown easily by
>>>> writing a 4G image (with qemu-img convert) to a FUSE export:
>>>> - Before this patch: Takes 25.6 s (14.4 s with -t none)
>>>> - After this patch: Takes 4.5 s (5.5 s with -t none)
>>>>
>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>> The commit message seems outdated, there is no such thing as a
>>> FuseRequest object.
>>>
>>> I agree with the idea of allocating a separate buffer for the data to be
>>> written. I'm not so sure that the approach taken here with combining an
>>> in-place and a spillover buffer does actually do much for us in exchange
>>> for the additional complexity.
>>>
>>> The allocation + memcpy for in_place buf in fuse_co_write() bothers me a
>>> bit. I'd rather have a buffer for the data to write that can be directly
>>> used. And let's be real, we already allocate a 1 MB stack per request. I
>>> don't think 64k more or less make a big difference, but it would allow
>>> us to save the memcpy() for 4k requests and additionally an allocation
>>> for larger requests.
>>>
>>> The tradeoff when you use an iov for the buffer in FuseQueue that is
>>> only big enough for the header and fuse_write_in and then directly the
>>> per-request buffer that is owned by the coroutine is that for requests
>>> that are larger than fuse_write_in, you'll have to copy the rest back
>>> from the data buffer first. This seems to be only fuse_setattr_in, which
>>> shouldn't be a hot path at all, and only a few bytes.
>> So I understand that first, you disagree with “Always allocating FuseRequest
>> objects with 64k buffers in them seems wasteful, though.” I.e. to just use a
>> 64k buffer per request.  OK, fair.
> I think in practice most write requests will exceed the 4k anyway, so
> we'd already use the spillover buffer. Maybe the most common exception
> is fio, if we want to optimise for that one. :-)
>
>> Second, you suggest to improve performance by having an aligned 64k data
>> buffer separate from the request metadata buffer to save on memcpy().  I did
>> consider this, but discarded it because of I was afraid of the complexity.
>> Actually probably too afraid.
>>
>> I’ll take a look, it actually can’t be too hard.  (Just have two buffers;
>> make the metadata buffer long enough to capture all request headers, but
>> only pass the sizeof(fuse_write_in)-long head into readv(), then check the
>> request opcode.  If metadata spilled over to the data buffer, copy it back
>> into the “shadowed” metadata buffer tail.)
> Yes, that's what I had in mind. Not sure if it's premature
> optimisation...

Just to confirm – this means we will have to do an allocation for each 
write request.  I think this should still be cheaper than the current 
memcpy(), but still wanted to ask back whether you agree.

(We could have a pool of buffers, but that would most definitely be a 
premature optimization – and I hope our memalloc algorithm can do that 
internally just fine.)

Hanna

Re: [PATCH v3 21/21] fuse: Increase MAX_WRITE_SIZE with a second buffer

Posted by Kevin Wolf 2 weeks ago

Am 31.10.2025 um 16:03 hat Hanna Czenczek geschrieben:
> On 31.10.25 13:33, Kevin Wolf wrote:
> > Am 31.10.2025 um 13:13 hat Hanna Czenczek geschrieben:
> > > On 23.10.25 17:11, Kevin Wolf wrote:
> > > > Am 01.07.2025 um 13:44 hat Hanna Czenczek geschrieben:
> > > > > We probably want to support larger write sizes than just 4k; 64k seems
> > > > > nice.  However, we cannot read partial requests from the FUSE FD, we
> > > > > always have to read requests in full; so our read buffer must be large
> > > > > enough to accommodate potential 64k writes if we want to support that.
> > > > > 
> > > > > Always allocating FuseRequest objects with 64k buffers in them seems
> > > > > wasteful, though.  But we can get around the issue by splitting the
> > > > > buffer into two and using readv(): One part will hold all normal (up to
> > > > > 4k) write requests and all other requests, and a second part (the
> > > > > "spill-over buffer") will be used only for larger write requests.  Each
> > > > > FuseQueue has its own spill-over buffer, and only if we find it used
> > > > > when reading a request will we move its ownership into the FuseRequest
> > > > > object and allocate a new spill-over buffer for the queue.
> > > > > 
> > > > > This way, we get to support "large" write sizes without having to
> > > > > allocate big buffers when they aren't used.
> > > > > 
> > > > > Also, this even reduces the size of the FuseRequest objects because the
> > > > > read buffer has to have at least FUSE_MIN_READ_BUFFER (8192) bytes; but
> > > > > the requests we support are not quite so large (except for >4k writes),
> > > > > so until now, we basically had to have useless padding in there.
> > > > > 
> > > > > With the spill-over buffer added, the FUSE_MIN_READ_BUFFER requirement
> > > > > is easily met and we can decrease the size of the buffer portion that is
> > > > > right inside of FuseRequest.
> > > > > 
> > > > > As for benchmarks, the benefit of this patch can be shown easily by
> > > > > writing a 4G image (with qemu-img convert) to a FUSE export:
> > > > > - Before this patch: Takes 25.6 s (14.4 s with -t none)
> > > > > - After this patch: Takes 4.5 s (5.5 s with -t none)
> > > > > 
> > > > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > The commit message seems outdated, there is no such thing as a
> > > > FuseRequest object.
> > > > 
> > > > I agree with the idea of allocating a separate buffer for the data to be
> > > > written. I'm not so sure that the approach taken here with combining an
> > > > in-place and a spillover buffer does actually do much for us in exchange
> > > > for the additional complexity.
> > > > 
> > > > The allocation + memcpy for in_place buf in fuse_co_write() bothers me a
> > > > bit. I'd rather have a buffer for the data to write that can be directly
> > > > used. And let's be real, we already allocate a 1 MB stack per request. I
> > > > don't think 64k more or less make a big difference, but it would allow
> > > > us to save the memcpy() for 4k requests and additionally an allocation
> > > > for larger requests.
> > > > 
> > > > The tradeoff when you use an iov for the buffer in FuseQueue that is
> > > > only big enough for the header and fuse_write_in and then directly the
> > > > per-request buffer that is owned by the coroutine is that for requests
> > > > that are larger than fuse_write_in, you'll have to copy the rest back
> > > > from the data buffer first. This seems to be only fuse_setattr_in, which
> > > > shouldn't be a hot path at all, and only a few bytes.
> > > So I understand that first, you disagree with “Always allocating FuseRequest
> > > objects with 64k buffers in them seems wasteful, though.” I.e. to just use a
> > > 64k buffer per request.  OK, fair.
> > I think in practice most write requests will exceed the 4k anyway, so
> > we'd already use the spillover buffer. Maybe the most common exception
> > is fio, if we want to optimise for that one. :-)
> > 
> > > Second, you suggest to improve performance by having an aligned 64k data
> > > buffer separate from the request metadata buffer to save on memcpy().  I did
> > > consider this, but discarded it because of I was afraid of the complexity.
> > > Actually probably too afraid.
> > > 
> > > I’ll take a look, it actually can’t be too hard.  (Just have two buffers;
> > > make the metadata buffer long enough to capture all request headers, but
> > > only pass the sizeof(fuse_write_in)-long head into readv(), then check the
> > > request opcode.  If metadata spilled over to the data buffer, copy it back
> > > into the “shadowed” metadata buffer tail.)
> > Yes, that's what I had in mind. Not sure if it's premature
> > optimisation...
> 
> Just to confirm – this means we will have to do an allocation for each write
> request.  I think this should still be cheaper than the current memcpy(),
> but still wanted to ask back whether you agree.

For write requests, we will save the unconditional 4k bounce buffer
allocation and make the currently conditional 64k allocation (where in
practice I argued the condition is usually met) unconditional instead.

All other requests have no allocations before and after.

Is this right? If so, I expect that the change could only ever reduce
the number of allocations. It can increase the size of allocations (for
small write requests < 4k), but I'm not very worried about this.

> (We could have a pool of buffers, but that would most definitely be a
> premature optimization – and I hope our memalloc algorithm can do that
> internally just fine.)

Yes.

Kevin