From nobody Sat Nov 15 09:50:13 2025 Delivered-To: importer@patchew.org Authentication-Results: mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=pass(p=none dis=none) header.from=gmail.com ARC-Seal: i=1; a=rsa-sha256; t=1752691145; cv=none; d=zohomail.com; s=zohoarc; b=LR7elgJ6Ty/8T9Q71m85z25oEwpEYq5cd9b71Hf5eMBRFr5VC6M5azwpsc34Lsz0Yw7Whcp8lpDqBgrz13Q5mqn0Yof741aVUTWP/qrgJseKu52nd+UFaIn37BlKhyt9aJ4vd8Feflqz682PeE3yEHDxd8ogXwRM8zLsRfHvSqY= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1752691145; h=Content-Transfer-Encoding:Cc:Cc:Date:Date:From:From:In-Reply-To:List-Subscribe:List-Post:List-Id:List-Archive:List-Help:List-Unsubscribe:MIME-Version:Message-ID:References:Sender:Subject:Subject:To:To:Message-Id:Reply-To; bh=qBqqqAhFPCcGIhx43Vmp309Ms+xttfH3nRt/ulKH580=; b=BNV19uA5C+m2RFoDEVIZRpuXVf7wF5H4/+qUpIXpnZkzA8eaz8nyR9eZ+VLYDREVIOYJ5lXxy7uPfeaqzyTsu7paOIhYUFzS9Yd8tD+5QjCgqOd98kd65+7k0QLpR7WiwaHJAvV83C7L7qD9sdPMbRP+MBlEz1KMgsE2a7bL+0w= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=pass header.from= (p=none dis=none) Return-Path: Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) by mx.zohomail.com with SMTPS id 1752691145881150.60146788672455; Wed, 16 Jul 2025 11:39:05 -0700 (PDT) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1uc71k-0006pW-Nd; Wed, 16 Jul 2025 14:38:49 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1uc71f-0006l7-Mx; Wed, 16 Jul 2025 14:38:43 -0400 Received: from mail-qk1-x731.google.com ([2607:f8b0:4864:20::731]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1uc71c-0003bV-PL; Wed, 16 Jul 2025 14:38:43 -0400 Received: by mail-qk1-x731.google.com with SMTP id af79cd13be357-7e34399cdb2so12045785a.3; Wed, 16 Jul 2025 11:38:35 -0700 (PDT) Received: from zzzhi.home (bras-base-ktnron0923w-grc-12-70-50-118-45.dsl.bell.ca. [70.50.118.45]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7e168b4e352sm494464285a.82.2025.07.16.11.38.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 16 Jul 2025 11:38:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1752691115; x=1753295915; darn=nongnu.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=qBqqqAhFPCcGIhx43Vmp309Ms+xttfH3nRt/ulKH580=; b=l440V35INJiw90JfPZoFoKTD33vNHJdJEUwcCbsj5UaoNjXbSX9HvXOChfR8g6M1lT Qdmb2CGFgPQADwOp0+6YleNU6/UelUyKYXGNPka3vcsfig832Im22j9A9q6J7dnitL1G fK8oe7LMJ3y95uuqi0hJzegGE+34z1KPBRrvg7GR60g+hk6yreunKBRAipt8E0UaQk/k iCAxdwK3RF2QJ1u7k+p5jrRwyG3zECwBQNFMa6mEJaerTMBJMihRsf24TLWuNB1qvUFB fw3wk+Y1ZvSYpOwkc2luq51qRx/3Py8QEXiEvcCNNhLgDIzpoJ0JrrJdNMX1YciiDGDA dFnA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1752691115; x=1753295915; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=qBqqqAhFPCcGIhx43Vmp309Ms+xttfH3nRt/ulKH580=; b=hAxLM/i0j2LtHeV1hIne5cDy/EVTj3d9h8EwLrq0NK1RSpYL4pkq9WaS1isXWQoTQn 0VIP2LUBBvcjGDZcbNwRgTw8WzKJT0k20LJEmYB/z7hzXXbb+ydJU5jQiYbwN/ov+hpj wj+GNtQojEcohZHNiyRzEC1ceU4GHVdJdFii2XZoE6O0s0sdnoR7h00eJHyzfAEXfgDT oc1rniJJiYeYlP3K2W/+nSIMFLe1OliyBOp6Tuq2Y9cOupMhoX9PCRM9Th7yF47lBar4 BlFSxMGUQ19C0q8t7wNjdFMzCPLncjSJ8VUMtUBuytEHN/spEYPV99/IemxDTUzOkDfr DM6w== X-Gm-Message-State: AOJu0Yy3pHq+tRzua+d62LC2yyYjgaYElv1tAVij+gGRdoenUqEaxZqu Q99wfKs2d3YPoVtql1N6OOkNZ1Y3ens2T2R1kAXUfVIaRSaoiDr0xmO74MPffA== X-Gm-Gg: ASbGncvQZXsq0dJ/ti66YuuPaD6zUSw9HOzsHkwu9NFWLPWLc0Ll0Wc7WpJ4Gi+ber9 R65Fuyui1ZHESne5Yv8kMg8zISC8jucceKxubLR2C0EHOGRkJnIHo/aiowv1rIx5Q6nM1L+loTn bmTExbqfjK3znrGfHzIDdcLhKAMVlu23WiUl09y+vyuPJPZNSjNJxCGnFgJbkpmNa11cZPqn9eR YUyZkrRKylSppm7vKMhuGXyvomtIUDfhrdEltKwo5C7Tdrzp7xHO/kFOxTCFd4a6K+8BWsCFYhv PDfXGbWkSWApUSCsJi0dcP2TB3ZLYoD+XpMsEl7zTk1bYayLi9d4/D/GznpvRyzjLul+Dunko/k sGkmqwy22q0H6O4MytfFdwzzmmmCsrrpD0OJlaEO1hRI4XTWnTuHo3jtsfBnTdC8KoPch7LqnSO MBM15tboA= X-Google-Smtp-Source: AGHT+IEACMmtxJGAxmQTXHPSgy5i1cTjPimkxps7rzLl3EMZAlJ/WBKOjWxrHD6MKrxLe7MiynHykg== X-Received: by 2002:a05:620a:191f:b0:7e3:39d4:427c with SMTP id af79cd13be357-7e342ab1c1bmr638076985a.15.1752691114640; Wed, 16 Jul 2025 11:38:34 -0700 (PDT) From: Brian Song To: qemu-block@nongnu.org Cc: qemu-devel@nongnu.org, armbru@redhat.com, bschubert@ddn.com, fam@euphon.net, hibriansong@gmail.com, hreitz@redhat.com, kwolf@redhat.com, stefanha@redhat.com Subject: [PATCH RFC 1/1] block/export: FUSE-over-io_uring Support for QEMU FUSE Exports Date: Wed, 16 Jul 2025 14:38:24 -0400 Message-ID: <20250716183824.216257-2-hibriansong@gmail.com> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20250716183824.216257-1-hibriansong@gmail.com> References: <20250716183824.216257-1-hibriansong@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Received-SPF: pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Received-SPF: pass client-ip=2607:f8b0:4864:20::731; envelope-from=hibriansong@gmail.com; helo=mail-qk1-x731.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: qemu-devel-bounces+importer=patchew.org@nongnu.org X-ZohoMail-DKIM: pass (identity @gmail.com) X-ZM-MESSAGEID: 1752691147906116600 Content-Type: text/plain; charset="utf-8" This work provides an initial implementation of fuse-over-io_uring support for QEMU export. According to the fuse-over-io_uring protocol specification, the userspace side must create the same number of queues as the number of CPUs (nr_cpu), just like the kernel. Currently, each queue contains only a single SQE entry, which is used to validate the correctness of the fuse-over-io_uring functionality. All FUSE read and write operations interact with the kernel via io vectors embedded in the SQE entry during submission and CQE fetching. The req_header and op_payload members of each entry are included as parts of the io vector: req_header carries the FUSE operation header, and op_payload carries the data payload, such as file attributes in a getattr reply, file content in a read reply, or file content being written to the FUSE client in a write operation. At present, multi-threading support is still incomplete. In addition, handling connection termination and managing the "drained" state of a FUSE block export in QEMU remain as pending work. Suggested-by: Kevin Wolf Suggested-by: Stefan Hajnoczi Signed-off-by: Brian Song --- block/export/fuse.c | 423 +++++++++++++++++++++++++-- docs/tools/qemu-storage-daemon.rst | 10 +- qapi/block-export.json | 6 +- storage-daemon/qemu-storage-daemon.c | 1 + util/fdmon-io_uring.c | 5 +- 5 files changed, 420 insertions(+), 25 deletions(-) diff --git a/block/export/fuse.c b/block/export/fuse.c index c0ad4696ce..637d36186a 100644 --- a/block/export/fuse.c +++ b/block/export/fuse.c @@ -48,6 +48,11 @@ #include #endif =20 +#define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32 + +/* room needed in buffer to accommodate header */ +#define FUSE_BUFFER_HEADER_SIZE 0x1000 + /* Prevent overly long bounce buffer allocations */ #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024)) /* @@ -64,6 +69,26 @@ =20 typedef struct FuseExport FuseExport; =20 +struct FuseQueue; + +typedef struct FuseRingEnt { + /* back pointer */ + struct FuseQueue *q; + + /* commit id of a fuse request */ + uint64_t req_commit_id; + + /* fuse request header and payload */ + struct fuse_uring_req_header *req_header; + void *op_payload; + size_t req_payload_sz; + + /* The vector passed to the kernel */ + struct iovec iov[2]; + + CqeHandler fuse_cqe_handler; +} FuseRingEnt; + /* * One FUSE "queue", representing one FUSE FD from which requests are fetc= hed * and processed. Each queue is tied to an AioContext. @@ -73,6 +98,7 @@ typedef struct FuseQueue { =20 AioContext *ctx; int fuse_fd; + int qid; =20 /* * The request buffer must be able to hold a full write, and/or at lea= st @@ -109,6 +135,17 @@ typedef struct FuseQueue { * Free this buffer with qemu_vfree(). */ void *spillover_buf; + +#ifdef CONFIG_LINUX_IO_URING + FuseRingEnt ent; + + /* + * TODO + * Support multi-threaded FUSE over io_uring by using eventfd and allo= cating + * an extra SQE for each thread to be notified when the connection + * shuts down. + */ +#endif } FuseQueue; =20 /* @@ -148,6 +185,7 @@ struct FuseExport { bool growable; /* Whether allow_other was used as a mount option or not */ bool allow_other; + bool is_uring; =20 mode_t st_mode; uid_t st_uid; @@ -257,6 +295,126 @@ static const BlockDevOps fuse_export_blk_dev_ops =3D { .drained_poll =3D fuse_export_drained_poll, }; =20 +#ifdef CONFIG_LINUX_IO_URING +static void coroutine_fn fuse_uring_co_process_request(FuseRingEnt *ent); + +static void coroutine_fn co_fuse_uring_queue_handle_cqes(void *opaque) +{ + CqeHandler *cqe_handler =3D opaque; + FuseRingEnt *ent =3D container_of(cqe_handler, FuseRingEnt, fuse_cqe_h= andler); + FuseExport *exp =3D ent->q->exp; + + fuse_uring_co_process_request(ent); + + fuse_dec_in_flight(exp); +} + +static void fuse_uring_cqe_handler(CqeHandler *cqe_handler) +{ + FuseRingEnt *ent =3D container_of(cqe_handler, FuseRingEnt, fuse_cqe_h= andler); + FuseQueue *q =3D ent->q; + Coroutine *co; + FuseExport *exp =3D ent->q->exp; + + int err =3D cqe_handler->cqe.res; + if (err !=3D 0) { + /* TODO end_conn support */ + + /* -ENOTCONN is ok on umount */ + if (err !=3D -EINTR && err !=3D -EOPNOTSUPP && + err !=3D -EAGAIN && err !=3D -ENOTCONN) { + fuse_export_halt(exp); + } + } else { + co =3D qemu_coroutine_create(co_fuse_uring_queue_handle_cqes, + cqe_handler); + /* Decremented by co_fuse_uring_queue_handle_cqes() */ + fuse_inc_in_flight(q->exp); + qemu_coroutine_enter(co); + } +} + +static void fuse_uring_sqe_set_req_data(struct fuse_uring_cmd_req *req, + const unsigned int qid, + const unsigned int commit_id) +{ + req->qid =3D qid; + req->commit_id =3D commit_id; + req->flags =3D 0; +} + +static void fuse_uring_sqe_prepare(struct io_uring_sqe *sqe, FuseRingEnt *= ent, + __u32 cmd_op) +{ + sqe->opcode =3D IORING_OP_URING_CMD; + + sqe->fd =3D ent->q->fuse_fd; + sqe->rw_flags =3D 0; + sqe->ioprio =3D 0; + sqe->off =3D 0; + + sqe->cmd_op =3D cmd_op; + sqe->__pad1 =3D 0; +} + +static void fuse_uring_prep_sqe_register(struct io_uring_sqe *sqe, void *o= paque) +{ + FuseQueue *q =3D opaque; + struct fuse_uring_cmd_req *req =3D (void *)&sqe->cmd[0]; + + fuse_uring_sqe_prepare(sqe, &q->ent, FUSE_IO_URING_CMD_REGISTER); + + sqe->addr =3D (uint64_t)(q->ent.iov); + sqe->len =3D 2; + + fuse_uring_sqe_set_req_data(req, q->qid, 0); +} + +static void fuse_uring_start(FuseExport *exp, struct fuse_init_out *out) +{ + /* + * Since we didn't enable the FUSE_MAX_PAGES feature, the value of + * fc->max_pages should be FUSE_DEFAULT_MAX_PAGES_PER_REQ, which is se= t by + * the kernel by default. Also, max_write should not exceed + * FUSE_DEFAULT_MAX_PAGES_PER_REQ * PAGE_SIZE. + */ + size_t bufsize =3D out->max_write + FUSE_BUFFER_HEADER_SIZE; + + if (!(out->flags & FUSE_MAX_PAGES)) { + /* + * bufsize =3D MIN(FUSE_DEFAULT_MAX_PAGES_PER_REQ * + * qemu_real_host_page_size() + FUSE_BUFFER_HEADER_SIZE, buf= size); + */ + bufsize =3D FUSE_DEFAULT_MAX_PAGES_PER_REQ * qemu_real_host_page_s= ize() + + FUSE_BUFFER_HEADER_SIZE; + } + + for (int i =3D 0; i < exp->num_queues; i++) { + FuseQueue *q =3D &exp->queues[i]; + + q->ent.q =3D q; + + q->ent.req_header =3D g_malloc0(sizeof(struct fuse_uring_req_heade= r)); + q->ent.req_payload_sz =3D bufsize - FUSE_BUFFER_HEADER_SIZE; + q->ent.op_payload =3D g_malloc0(q->ent.req_payload_sz); + + q->ent.iov[0] =3D (struct iovec) { + q->ent.req_header, + sizeof(struct fuse_uring_req_header) + }; + q->ent.iov[1] =3D (struct iovec) { + q->ent.op_payload, + q->ent.req_payload_sz + }; + + exp->queues[i].ent.fuse_cqe_handler.cb =3D fuse_uring_cqe_handler; + + aio_add_sqe(fuse_uring_prep_sqe_register, &(exp->queues[i]), + &(exp->queues[i].ent.fuse_cqe_handler)); + } +} +#endif + static int fuse_export_create(BlockExport *blk_exp, BlockExportOptions *blk_exp_args, AioContext *const *multithread, @@ -280,6 +438,7 @@ static int fuse_export_create(BlockExport *blk_exp, =20 for (size_t i =3D 0; i < mt_count; i++) { exp->queues[i] =3D (FuseQueue) { + .qid =3D i, .exp =3D exp, .ctx =3D multithread[i], .fuse_fd =3D -1, @@ -293,6 +452,7 @@ static int fuse_export_create(BlockExport *blk_exp, exp->num_queues =3D 1; exp->queues =3D g_new(FuseQueue, 1); exp->queues[0] =3D (FuseQueue) { + .qid =3D 0, .exp =3D exp, .ctx =3D exp->common.ctx, .fuse_fd =3D -1, @@ -312,6 +472,8 @@ static int fuse_export_create(BlockExport *blk_exp, } } =20 + exp->is_uring =3D args->uring ? true : false; + blk_set_dev_ops(exp->common.blk, &fuse_export_blk_dev_ops, exp); =20 /* @@ -597,6 +759,22 @@ static void read_from_fuse_fd(void *opaque) qemu_coroutine_enter(co); } =20 +#ifdef CONFIG_LINUX_IO_URING +static void fuse_export_delete_uring(FuseExport *exp) +{ + exp->is_uring =3D false; + + /* + * TODO + * end_conn handling + */ + for (size_t qid =3D 0; qid < exp->num_queues; qid++) { + g_free(exp->queues[qid].ent.req_header); + g_free(exp->queues[qid].ent.op_payload); + } +} +#endif + static void fuse_export_shutdown(BlockExport *blk_exp) { FuseExport *exp =3D container_of(blk_exp, FuseExport, common); @@ -618,6 +796,11 @@ static void fuse_export_delete(BlockExport *blk_exp) { FuseExport *exp =3D container_of(blk_exp, FuseExport, common); =20 +#ifdef CONFIG_LINUX_IO_URING + if (exp->is_uring) + fuse_export_delete_uring(exp); +#endif + for (int i =3D 0; i < exp->num_queues; i++) { FuseQueue *q =3D &exp->queues[i]; =20 @@ -687,15 +870,22 @@ static ssize_t coroutine_fn fuse_co_init(FuseExport *exp, struct fuse_init_out *out, uint32_t max_readahead, uint32_t flags) { - const uint32_t supported_flags =3D FUSE_ASYNC_READ | FUSE_ASYNC_DIO; + const uint32_t supported_flags =3D FUSE_ASYNC_READ | FUSE_ASYNC_DIO + | FUSE_INIT_EXT; + uint64_t outargflags =3D flags; + +#ifdef CONFIG_LINUX_IO_URING + if (exp->is_uring) + outargflags |=3D FUSE_OVER_IO_URING; +#endif =20 *out =3D (struct fuse_init_out) { .major =3D FUSE_KERNEL_VERSION, .minor =3D FUSE_KERNEL_MINOR_VERSION, .max_readahead =3D max_readahead, .max_write =3D FUSE_MAX_WRITE_BYTES, - .flags =3D flags & supported_flags, - .flags2 =3D 0, + .flags =3D outargflags & supported_flags, + .flags2 =3D outargflags >> 32, =20 /* libfuse maximum: 2^16 - 1 */ .max_background =3D UINT16_MAX, @@ -943,6 +1133,9 @@ fuse_co_read(FuseExport *exp, void **bufptr, uint64_t = offset, uint32_t size) * Data in @in_place_buf is assumed to be overwritten after yielding, so w= ill * be copied to a bounce buffer beforehand. @spillover_buf in contrast is * assumed to be exclusively owned and will be used as-is. + * In FUSE-over-io_uring mode, the actual op_payload content is stored in + * @spillover_buf. To ensure this buffer is used for writing, @in_place_buf + * is explicitly set to NULL. * Return the number of bytes written to *out on success, and -errno on er= ror. */ static ssize_t coroutine_fn @@ -950,8 +1143,8 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *= out, uint64_t offset, uint32_t size, const void *in_place_buf, const void *spillover_buf) { - size_t in_place_size; - void *copied; + size_t in_place_size =3D 0; + void *copied =3D NULL; int64_t blk_len; int ret; struct iovec iov[2]; @@ -966,10 +1159,12 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out= *out, return -EACCES; } =20 - /* Must copy to bounce buffer before potentially yielding */ - in_place_size =3D MIN(size, FUSE_IN_PLACE_WRITE_BYTES); - copied =3D blk_blockalign(exp->common.blk, in_place_size); - memcpy(copied, in_place_buf, in_place_size); + if (in_place_buf) { + /* Must copy to bounce buffer before potentially yielding */ + in_place_size =3D MIN(size, FUSE_IN_PLACE_WRITE_BYTES); + copied =3D blk_blockalign(exp->common.blk, in_place_size); + memcpy(copied, in_place_buf, in_place_size); + } =20 /** * Clients will expect short writes at EOF, so we have to limit @@ -993,26 +1188,37 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out= *out, } } =20 - iov[0] =3D (struct iovec) { - .iov_base =3D copied, - .iov_len =3D in_place_size, - }; - if (size > FUSE_IN_PLACE_WRITE_BYTES) { - assert(size - FUSE_IN_PLACE_WRITE_BYTES <=3D FUSE_SPILLOVER_BUF_SI= ZE); - iov[1] =3D (struct iovec) { - .iov_base =3D (void *)spillover_buf, - .iov_len =3D size - FUSE_IN_PLACE_WRITE_BYTES, + if (in_place_buf) { + iov[0] =3D (struct iovec) { + .iov_base =3D copied, + .iov_len =3D in_place_size, }; - qemu_iovec_init_external(&qiov, iov, 2); + if (size > FUSE_IN_PLACE_WRITE_BYTES) { + assert(size - FUSE_IN_PLACE_WRITE_BYTES <=3D FUSE_SPILLOVER_BU= F_SIZE); + iov[1] =3D (struct iovec) { + .iov_base =3D (void *)spillover_buf, + .iov_len =3D size - FUSE_IN_PLACE_WRITE_BYTES, + }; + qemu_iovec_init_external(&qiov, iov, 2); + } else { + qemu_iovec_init_external(&qiov, iov, 1); + } } else { + /* fuse over io_uring */ + iov[0] =3D (struct iovec) { + .iov_base =3D (void *)spillover_buf, + .iov_len =3D size, + }; qemu_iovec_init_external(&qiov, iov, 1); } + ret =3D blk_co_pwritev(exp->common.blk, offset, size, &qiov, 0); if (ret < 0) { goto fail_free_buffer; } =20 - qemu_vfree(copied); + if (in_place_buf) + qemu_vfree(copied); =20 *out =3D (struct fuse_write_out) { .size =3D size, @@ -1020,7 +1226,9 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out = *out, return sizeof(*out); =20 fail_free_buffer: - qemu_vfree(copied); + if (in_place_buf) { + qemu_vfree(copied); + } return ret; } =20 @@ -1409,6 +1617,12 @@ fuse_co_process_request(FuseQueue *q, void *spillove= r_buf) const struct fuse_init_in *in =3D FUSE_IN_OP_STRUCT(init, q); ret =3D fuse_co_init(exp, FUSE_OUT_OP_STRUCT(init, out_buf), in->max_readahead, in->flags); +#ifdef CONFIG_LINUX_IO_URING + /* Set up fuse over io_uring after replying to the first FUSE_INIT= */ + if (exp->is_uring) { + fuse_uring_start(exp, FUSE_OUT_OP_STRUCT(init, out_buf)); + } +#endif break; } =20 @@ -1515,6 +1729,173 @@ fuse_co_process_request(FuseQueue *q, void *spillov= er_buf) qemu_vfree(spillover_buf); } =20 +#ifdef CONFIG_LINUX_IO_URING +static void fuse_uring_prep_sqe_commit(struct io_uring_sqe *sqe, void *opa= que) +{ + FuseRingEnt *ent =3D opaque; + struct fuse_uring_cmd_req *req =3D (void *)&sqe->cmd[0]; + + fuse_uring_sqe_prepare(sqe, ent, FUSE_IO_URING_CMD_COMMIT_AND_FETCH); + fuse_uring_sqe_set_req_data(req, ent->q->qid, + ent->req_commit_id); +} + +static void +fuse_uring_write_response(FuseRingEnt *ent, uint32_t req_id, ssize_t ret, + const void *out_op_hdr, const void *buf) +{ + struct fuse_uring_req_header *rrh =3D ent->req_header; + struct fuse_out_header *out_header =3D (struct fuse_out_header *)&rrh-= >in_out; + struct fuse_uring_ent_in_out *ent_in_out =3D + (struct fuse_uring_ent_in_out *)&rrh->ring_ent_in_out; + + if (buf) { + memcpy(ent->op_payload, buf, ret); + } else if (ret > 0) { + if (ret > ent->req_payload_sz) { + warn_report("data size %zu exceeds payload buffer size %zu", + ret, ent->req_payload_sz); + ret =3D -EINVAL; + } else { + memcpy(ent->op_payload, out_op_hdr, ret); + } + } + + out_header->error =3D ret < 0 ? ret : 0; + out_header->unique =3D req_id; + /* out_header->len =3D ret > 0 ? ret : 0; */ + ent_in_out->payload_sz =3D ret > 0 ? ret : 0; + + aio_add_sqe(fuse_uring_prep_sqe_commit, ent, + &ent->fuse_cqe_handler); +} + +static void coroutine_fn fuse_uring_co_process_request(FuseRingEnt *ent) +{ + FuseQueue *q =3D ent->q; + FuseExport *exp =3D q->exp; + struct fuse_uring_req_header *rrh =3D ent->req_header; + struct fuse_uring_ent_in_out *ent_in_out =3D + (struct fuse_uring_ent_in_out *)&rrh->ring_ent_in_out; + + char out_op_hdr[MAX_CONST(sizeof(struct fuse_init_out), + MAX_CONST(sizeof(struct fuse_open_out), + MAX_CONST(sizeof(struct fuse_attr_out), + MAX_CONST(sizeof(struct fuse_write_out), + sizeof(struct fuse_lseek_out)))))]; + + void *out_data_buffer =3D NULL; + + uint32_t opcode; + uint64_t req_id; + + struct fuse_in_header *in_hdr =3D (struct fuse_in_header *)&rrh->in_ou= t; + opcode =3D in_hdr->opcode; + req_id =3D in_hdr->unique; + + ent->req_commit_id =3D ent_in_out->commit_id; + + if (unlikely(ent->req_commit_id =3D=3D 0)) { + /* + * If this happens kernel will not find the response - it will + * be stuck forever - better to abort immediately. + */ + error_report("If this happens kernel will not find the response" + " - it will be stuck forever - better to abort immediately."); + fuse_export_halt(exp); + fuse_dec_in_flight(exp); + return; + } + + ssize_t ret; + + switch (opcode) { + case FUSE_OPEN: + ret =3D fuse_co_open(exp, (struct fuse_open_out *)out_op_hdr); + break; + + case FUSE_RELEASE: + ret =3D 0; + break; + + case FUSE_LOOKUP: + ret =3D -ENOENT; /* There is no node but the root node */ + break; + + case FUSE_GETATTR: + ret =3D fuse_co_getattr(exp, (struct fuse_attr_out *)out_op_hdr); + break; + + case FUSE_SETATTR: { + const struct fuse_setattr_in *in =3D + (const struct fuse_setattr_in *)&rrh->op_in; + ret =3D fuse_co_setattr(exp, (struct fuse_attr_out *)out_op_hdr, + in->valid, in->size, in->mode, in->uid, in->= gid); + break; + } + + case FUSE_READ: { + const struct fuse_read_in *in =3D + (const struct fuse_read_in *)&rrh->op_in; + ret =3D fuse_co_read(exp, &out_data_buffer, in->offset, in->size); + break; + } + + case FUSE_WRITE: { + const struct fuse_write_in *in =3D + (const struct fuse_write_in *)&rrh->op_in; + + assert(in->size =3D=3D ent_in_out->payload_sz); + + /* + * poll_fuse_fd() has checked that in_hdr->len matches the number = of + * bytes read, which cannot exceed the max_write value we set + * (FUSE_MAX_WRITE_BYTES). So we know that FUSE_MAX_WRITE_BYTES >= =3D + * in_hdr->len >=3D in->size + X, so this assertion must hold. + */ + assert(in->size <=3D FUSE_MAX_WRITE_BYTES); + + ret =3D fuse_co_write(exp, (struct fuse_write_out *)out_op_hdr, + in->offset, in->size, NULL, ent->op_payload); + break; + } + + case FUSE_FALLOCATE: { + const struct fuse_fallocate_in *in =3D + (const struct fuse_fallocate_in *)&rrh->op_in; + ret =3D fuse_co_fallocate(exp, in->offset, in->length, in->mode); + break; + } + + case FUSE_FSYNC: + ret =3D fuse_co_fsync(exp); + break; + + case FUSE_FLUSH: + ret =3D fuse_co_flush(exp); + break; + +#ifdef CONFIG_FUSE_LSEEK + case FUSE_LSEEK: { + const struct fuse_lseek_in *in =3D + (const struct fuse_lseek_in *)&rrh->op_in; + ret =3D fuse_co_lseek(exp, (struct fuse_lseek_out *)out_op_hdr, + in->offset, in->whence); + break; + } +#endif + + default: + ret =3D -ENOSYS; + } + + fuse_uring_write_response(ent, req_id, ret, out_op_hdr, out_data_buffe= r); + + if (out_data_buffer) + qemu_vfree(out_data_buffer); +} +#endif + const BlockExportDriver blk_exp_fuse =3D { .type =3D BLOCK_EXPORT_TYPE_FUSE, .instance_size =3D sizeof(FuseExport), diff --git a/docs/tools/qemu-storage-daemon.rst b/docs/tools/qemu-storage-d= aemon.rst index 35ab2d7807..4ec0648e95 100644 --- a/docs/tools/qemu-storage-daemon.rst +++ b/docs/tools/qemu-storage-daemon.rst @@ -78,7 +78,7 @@ Standard options: .. option:: --export [type=3D]nbd,id=3D,node-name=3D[,name= =3D][,writable=3Don|off][,bitmap=3D] --export [type=3D]vhost-user-blk,id=3D,node-name=3D,addr.= type=3Dunix,addr.path=3D[,writable=3Don|off][,logical-block-si= ze=3D][,num-queues=3D] --export [type=3D]vhost-user-blk,id=3D,node-name=3D,addr.= type=3Dfd,addr.str=3D[,writable=3Don|off][,logical-block-size=3D][,num-queues=3D] - --export [type=3D]fuse,id=3D,node-name=3D,mountpoint=3D[,growable=3Don|off][,writable=3Don|off][,allow-other=3Don|off|auto] + --export [type=3D]fuse,id=3D,node-name=3D,mountpoint=3D[,growable=3Don|off][,writable=3Don|off][,allow-other=3Don|off|auto][,u= ring=3Don|off] --export [type=3D]vduse-blk,id=3D,node-name=3D,name=3D[,writable=3Don|off][,num-queues=3D][,queue-size=3D][,logical-block-size=3D][,serial=3D] =20 is a block export definition. ``node-name`` is the block node that shoul= d be @@ -111,7 +111,13 @@ Standard options: that enabling this option as a non-root user requires enabling the user_allow_other option in the global fuse.conf configuration file. Set= ting ``allow-other`` to auto (the default) will try enabling this option, and= on - error fall back to disabling it. + error fall back to disabling it. Once ``uring`` is enabled + (off by default), the initialization of FUSE-over-io_uring-related setti= ngs + will be performed in the FUSE_INIT request handler. This setup bypasses + the traditional /dev/fuse communication mechanism and instead uses io_ur= ing + for handling FUSE operations. + + =20 The ``vduse-blk`` export type takes a ``name`` (must be unique across th= e host) to create the VDUSE device. diff --git a/qapi/block-export.json b/qapi/block-export.json index 9ae703ad01..7d14f3f1ba 100644 --- a/qapi/block-export.json +++ b/qapi/block-export.json @@ -184,12 +184,16 @@ # mount the export with allow_other, and if that fails, try again # without. (since 6.1; default: auto) # +# @uring: If we enable uring option, it will enable FUSE over io_uring +# feature for QEMU FUSE export. (default: false) +# # Since: 6.0 ## { 'struct': 'BlockExportOptionsFuse', 'data': { 'mountpoint': 'str', '*growable': 'bool', - '*allow-other': 'FuseExportAllowOther' }, + '*allow-other': 'FuseExportAllowOther', + '*uring': 'bool' }, 'if': 'CONFIG_FUSE' } =20 ## diff --git a/storage-daemon/qemu-storage-daemon.c b/storage-daemon/qemu-sto= rage-daemon.c index eb72561358..803538db29 100644 --- a/storage-daemon/qemu-storage-daemon.c +++ b/storage-daemon/qemu-storage-daemon.c @@ -107,6 +107,7 @@ static void help(void) #ifdef CONFIG_FUSE " --export [type=3D]fuse,id=3D,node-name=3D,mountpoint=3D<= file>\n" " [,growable=3Don|off][,writable=3Don|off][,allow-other=3Don|off= |auto]\n" +" [,fuse-over-uring=3Don|off]" " export the specified block node over FUSE\n" "\n" #endif /* CONFIG_FUSE */ diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c index d2433d1d99..68d3fe8e01 100644 --- a/util/fdmon-io_uring.c +++ b/util/fdmon-io_uring.c @@ -452,10 +452,13 @@ static const FDMonOps fdmon_io_uring_ops =3D { void fdmon_io_uring_setup(AioContext *ctx, Error **errp) { int ret; + int flags; =20 ctx->io_uring_fd_tag =3D NULL; + flags =3D IORING_SETUP_SQE128; =20 - ret =3D io_uring_queue_init(FDMON_IO_URING_ENTRIES, &ctx->fdmon_io_uri= ng, 0); + ret =3D io_uring_queue_init(FDMON_IO_URING_ENTRIES, + &ctx->fdmon_io_uring, flags); if (ret !=3D 0) { error_setg_errno(errp, -ret, "Failed to initialize io_uring"); return; --=20 2.50.1