From nobody Mon Sep 22 03:35:37 2025 Delivered-To: importer@patchew.org Authentication-Results: mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=pass(p=quarantine dis=none) header.from=redhat.com ARC-Seal: i=1; a=rsa-sha256; t=1757527851; cv=none; d=zohomail.com; s=zohoarc; b=Y3nCgReqISKuuRYwAVNjw1HfNuu0k1RWZfKsMt2PA3DAO74FSlcYAF1XkwGkO0VV3BXZCX7/gfXoYNZqXQWT2te8gDH5pEQvAdU/2AGqLgG8GAxRfjirHo5ojTJVR7b/PJ5r7OsiRBqeYjmz0++2JoJzb5ABABpHtQ5lBKMgUI8= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1757527851; h=Content-Transfer-Encoding:Cc:Cc:Date:Date:From:From:In-Reply-To:List-Subscribe:List-Post:List-Id:List-Archive:List-Help:List-Unsubscribe:MIME-Version:Message-ID:References:Sender:Subject:Subject:To:To:Message-Id:Reply-To; bh=bfnzjAANnmsE4ocVGt6kY9Vy45/88OZNR3tRHCcDEog=; b=CpSpyytlvdKIefjABHrMDiEA0xN2+SD5+PaVD4qP4cpCSDRtmE5lPMmsWwk88qFiqtVrlPxxUQ/c/puenuqrqhP2OGARYtKCdsx6J1oB9T5bljPzlroPmb561eWqU6CYQYxNk9CJHlyeOrlDWZaGWbtXzuUuI6E5CCZoqycap44= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=pass header.from= (p=quarantine dis=none) Return-Path: Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) by mx.zohomail.com with SMTPS id 1757527851708785.9546198891763; Wed, 10 Sep 2025 11:10:51 -0700 (PDT) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1uwPEi-0005lK-Lj; Wed, 10 Sep 2025 14:08:08 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1uwPDn-0004jX-3C for qemu-devel@nongnu.org; Wed, 10 Sep 2025 14:07:08 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.129.124]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1uwPDg-00056A-In for qemu-devel@nongnu.org; Wed, 10 Sep 2025 14:07:06 -0400 Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-679-69vUACC4Nba83r_3do7WNQ-1; Wed, 10 Sep 2025 14:06:56 -0400 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id C6D4E180045C; Wed, 10 Sep 2025 18:06:54 +0000 (UTC) Received: from localhost (unknown [10.2.17.66]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 280D41955F24; Wed, 10 Sep 2025 18:06:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1757527619; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=bfnzjAANnmsE4ocVGt6kY9Vy45/88OZNR3tRHCcDEog=; b=MpjzdvVMjubE/AUskii5ZRY0jPzOfM4ikLvX1CpdooQ9aNG6RkW4O0V4AxfQkrSCs0qYAP TDMDt3ur0Bnf3fsdSAgBLLIyYLMUJtjHq8iFuhm1qrMUEU3IbOZb7YbeaGHYLZvNcQRE4c BoixdquL0fQdtfdWt/Uje4Kpgla89qI= X-MC-Unique: 69vUACC4Nba83r_3do7WNQ-1 X-Mimecast-MFC-AGG-ID: 69vUACC4Nba83r_3do7WNQ_1757527614 From: Stefan Hajnoczi To: qemu-devel@nongnu.org Cc: Aarushi Mehta , Fam Zheng , Stefano Garzarella , Hanna Czenczek , eblake@redhat.com, Stefan Hajnoczi , Kevin Wolf , qemu-block@nongnu.org, hibriansong@gmail.com, Stefan Weil , Paolo Bonzini Subject: [PATCH v4 09/12] aio-posix: add aio_add_sqe() API for user-defined io_uring requests Date: Wed, 10 Sep 2025 13:57:00 -0400 Message-ID: <20250910175703.374499-10-stefanha@redhat.com> In-Reply-To: <20250910175703.374499-1-stefanha@redhat.com> References: <20250910175703.374499-1-stefanha@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 Received-SPF: pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Received-SPF: pass client-ip=170.10.129.124; envelope-from=stefanha@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H5=0.001, RCVD_IN_MSPIKE_WL=0.001, RCVD_IN_VALIDITY_RPBL_BLOCKED=0.001, RCVD_IN_VALIDITY_SAFE_BLOCKED=0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: qemu-devel-bounces+importer=patchew.org@nongnu.org X-ZohoMail-DKIM: pass (identity @redhat.com) X-ZM-MESSAGEID: 1757527852853116600 Content-Type: text/plain; charset="utf-8" Introduce the aio_add_sqe() API for submitting io_uring requests in the current AioContext. This allows other components in QEMU, like the block layer, to take advantage of io_uring features without creating their own io_uring context. This API supports nested event loops just like file descriptor monitoring and BHs do. This comes at a complexity cost: a BH is required to dispatch CQE callbacks and they are placed on a list so that a nested event loop can invoke its parent's pending CQE callbacks. If you're wondering why CqeHandler exists instead of just a callback function pointer, this is why. Signed-off-by: Stefan Hajnoczi Reviewed-by: Eric Blake --- v2: - Fix pre_sqe -> prep_sqe typo [Eric] - Add #endif terminator comment [Eric] --- include/block/aio.h | 84 ++++++++++++++++++++++- util/aio-posix.h | 1 + util/aio-posix.c | 9 +++ util/fdmon-io_uring.c | 152 ++++++++++++++++++++++++++++++++---------- 4 files changed, 208 insertions(+), 38 deletions(-) diff --git a/include/block/aio.h b/include/block/aio.h index d919d7c8f4..56ea0d47b7 100644 --- a/include/block/aio.h +++ b/include/block/aio.h @@ -61,6 +61,27 @@ typedef struct LuringState LuringState; /* Is polling disabled? */ bool aio_poll_disabled(AioContext *ctx); =20 +#ifdef CONFIG_LINUX_IO_URING +/* + * Each io_uring request must have a unique CqeHandler that processes the = cqe. + * The lifetime of a CqeHandler must be at least from aio_add_sqe() until + * ->cb() invocation. + */ +typedef struct CqeHandler CqeHandler; +struct CqeHandler { + /* Called by the AioContext when the request has completed */ + void (*cb)(CqeHandler *handler); + + /* Used internally, do not access this */ + QSIMPLEQ_ENTRY(CqeHandler) next; + + /* This field is filled in before ->cb() is called */ + struct io_uring_cqe cqe; +}; + +typedef QSIMPLEQ_HEAD(, CqeHandler) CqeHandlerSimpleQ; +#endif /* CONFIG_LINUX_IO_URING */ + /* Callbacks for file descriptor monitoring implementations */ typedef struct { /* @@ -138,6 +159,27 @@ typedef struct { * Called with list_lock incremented. */ void (*gsource_dispatch)(AioContext *ctx, AioHandlerList *ready_list); + +#ifdef CONFIG_LINUX_IO_URING + /** + * aio_add_sqe: Add an io_uring sqe for submission. + * @prep_sqe: invoked with an sqe that should be prepared for submissi= on + * @opaque: user-defined argument to @prep_sqe() + * @cqe_handler: the unique cqe handler associated with this request + * + * The caller's @prep_sqe() function is invoked to fill in the details= of + * the sqe. Do not call io_uring_sqe_set_data() on this sqe. + * + * The kernel may see the sqe as soon as @prep_sqe() returns or it may= take + * until the next event loop iteration. + * + * This function is called from the current AioContext and is not + * thread-safe. + */ + void (*add_sqe)(AioContext *ctx, + void (*prep_sqe)(struct io_uring_sqe *sqe, void *opaqu= e), + void *opaque, CqeHandler *cqe_handler); +#endif /* CONFIG_LINUX_IO_URING */ } FDMonOps; =20 /* @@ -255,7 +297,11 @@ struct AioContext { struct io_uring fdmon_io_uring; AioHandlerSList submit_list; gpointer io_uring_fd_tag; -#endif + + /* Pending callback state for cqe handlers */ + CqeHandlerSimpleQ cqe_handler_ready_list; + QEMUBH *cqe_handler_bh; +#endif /* CONFIG_LINUX_IO_URING */ =20 /* TimerLists for calling timers - one per clock type. Has its own * locking. @@ -761,4 +807,40 @@ void aio_context_set_aio_params(AioContext *ctx, int64= _t max_batch); */ void aio_context_set_thread_pool_params(AioContext *ctx, int64_t min, int64_t max, Error **errp); + +#ifdef CONFIG_LINUX_IO_URING +/** + * aio_has_io_uring: Return whether io_uring is available. + * + * io_uring is either available in all AioContexts or in none, so this only + * needs to be called once from within any thread's AioContext. + */ +static inline bool aio_has_io_uring(void) +{ + AioContext *ctx =3D qemu_get_current_aio_context(); + return ctx->fdmon_ops->add_sqe; +} + +/** + * aio_add_sqe: Add an io_uring sqe for submission. + * @prep_sqe: invoked with an sqe that should be prepared for submission + * @opaque: user-defined argument to @prep_sqe() + * @cqe_handler: the unique cqe handler associated with this request + * + * The caller's @prep_sqe() function is invoked to fill in the details of = the + * sqe. Do not call io_uring_sqe_set_data() on this sqe. + * + * The sqe is submitted by the current AioContext. The kernel may see the = sqe + * as soon as @prep_sqe() returns or it may take until the next event loop + * iteration. + * + * When the AioContext is destroyed, pending sqes are ignored and their + * CqeHandlers are not invoked. + * + * This function must be called only when aio_has_io_uring() returns true. + */ +void aio_add_sqe(void (*prep_sqe)(struct io_uring_sqe *sqe, void *opaque), + void *opaque, CqeHandler *cqe_handler); +#endif /* CONFIG_LINUX_IO_URING */ + #endif diff --git a/util/aio-posix.h b/util/aio-posix.h index 6f9d97d866..57ef801a5f 100644 --- a/util/aio-posix.h +++ b/util/aio-posix.h @@ -36,6 +36,7 @@ struct AioHandler { #ifdef CONFIG_LINUX_IO_URING QSLIST_ENTRY(AioHandler) node_submitted; unsigned flags; /* see fdmon-io_uring.c */ + CqeHandler cqe_handler; #endif int64_t poll_idle_timeout; /* when to stop userspace polling */ bool poll_ready; /* has polling detected an event? */ diff --git a/util/aio-posix.c b/util/aio-posix.c index 800b4debbf..df945312b3 100644 --- a/util/aio-posix.c +++ b/util/aio-posix.c @@ -797,3 +797,12 @@ void aio_context_set_aio_params(AioContext *ctx, int64= _t max_batch) =20 aio_notify(ctx); } + +#ifdef CONFIG_LINUX_IO_URING +void aio_add_sqe(void (*prep_sqe)(struct io_uring_sqe *sqe, void *opaque), + void *opaque, CqeHandler *cqe_handler) +{ + AioContext *ctx =3D qemu_get_current_aio_context(); + ctx->fdmon_ops->add_sqe(ctx, prep_sqe, opaque, cqe_handler); +} +#endif /* CONFIG_LINUX_IO_URING */ diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c index 3181601767..7f7c50907d 100644 --- a/util/fdmon-io_uring.c +++ b/util/fdmon-io_uring.c @@ -46,6 +46,7 @@ #include "qemu/osdep.h" #include #include "qapi/error.h" +#include "qemu/defer-call.h" #include "qemu/rcu_queue.h" #include "aio-posix.h" =20 @@ -76,8 +77,8 @@ static inline int pfd_events_from_poll(int poll_events) } =20 /* - * Returns an sqe for submitting a request. Only be called within - * fdmon_io_uring_wait(). + * Returns an sqe for submitting a request. Only called from the AioContext + * thread. */ static struct io_uring_sqe *get_sqe(AioContext *ctx) { @@ -168,23 +169,43 @@ static void fdmon_io_uring_update(AioContext *ctx, } } =20 +static void fdmon_io_uring_add_sqe(AioContext *ctx, + void (*prep_sqe)(struct io_uring_sqe *sqe, void *opaque), + void *opaque, CqeHandler *cqe_handler) +{ + struct io_uring_sqe *sqe =3D get_sqe(ctx); + + prep_sqe(sqe, opaque); + io_uring_sqe_set_data(sqe, cqe_handler); +} + +static void fdmon_special_cqe_handler(CqeHandler *cqe_handler) +{ + /* + * This is an empty function that is never called. It is used as a fun= ction + * pointer to distinguish it from ordinary cqe handlers. + */ +} + static void add_poll_add_sqe(AioContext *ctx, AioHandler *node) { struct io_uring_sqe *sqe =3D get_sqe(ctx); int events =3D poll_events_from_pfd(node->pfd.events); =20 io_uring_prep_poll_add(sqe, node->pfd.fd, events); - io_uring_sqe_set_data(sqe, node); + node->cqe_handler.cb =3D fdmon_special_cqe_handler; + io_uring_sqe_set_data(sqe, &node->cqe_handler); } =20 static void add_poll_remove_sqe(AioContext *ctx, AioHandler *node) { struct io_uring_sqe *sqe =3D get_sqe(ctx); + CqeHandler *cqe_handler =3D &node->cqe_handler; =20 #ifdef LIBURING_HAVE_DATA64 - io_uring_prep_poll_remove(sqe, (uintptr_t)node); + io_uring_prep_poll_remove(sqe, (uintptr_t)cqe_handler); #else - io_uring_prep_poll_remove(sqe, node); + io_uring_prep_poll_remove(sqe, cqe_handler); #endif io_uring_sqe_set_data(sqe, NULL); } @@ -233,19 +254,13 @@ static void fill_sq_ring(AioContext *ctx) } } =20 -/* Returns true if a handler became ready */ -static bool process_cqe(AioContext *ctx, - AioHandlerList *ready_list, - struct io_uring_cqe *cqe) +static bool process_cqe_aio_handler(AioContext *ctx, + AioHandlerList *ready_list, + AioHandler *node, + struct io_uring_cqe *cqe) { - AioHandler *node =3D io_uring_cqe_get_data(cqe); unsigned flags; =20 - /* poll_timeout and poll_remove have a zero user_data field */ - if (!node) { - return false; - } - /* * Deletion can only happen when IORING_OP_POLL_ADD completes. If we = race * with enqueue() here then we can safely clear the FDMON_IO_URING_REM= OVE @@ -269,6 +284,61 @@ static bool process_cqe(AioContext *ctx, return true; } =20 +/* Process CqeHandlers from the ready list */ +static void cqe_handler_bh(void *opaque) +{ + AioContext *ctx =3D opaque; + CqeHandlerSimpleQ *ready_list =3D &ctx->cqe_handler_ready_list; + + /* + * If cqe_handler->cb() calls aio_poll() it must continue processing + * ready_list. Schedule a BH so the inner event loop calls us again. + */ + qemu_bh_schedule(ctx->cqe_handler_bh); + + /* Handlers may use defer_call() to coalesce frequent operations */ + defer_call_begin(); + + while (!QSIMPLEQ_EMPTY(ready_list)) { + CqeHandler *cqe_handler =3D QSIMPLEQ_FIRST(ready_list); + + QSIMPLEQ_REMOVE_HEAD(ready_list, next); + + cqe_handler->cb(cqe_handler); + } + + defer_call_end(); + + qemu_bh_cancel(ctx->cqe_handler_bh); +} + +/* Returns true if a handler became ready */ +static bool process_cqe(AioContext *ctx, + AioHandlerList *ready_list, + struct io_uring_cqe *cqe) +{ + CqeHandler *cqe_handler =3D io_uring_cqe_get_data(cqe); + + /* poll_timeout and poll_remove have a zero user_data field */ + if (!cqe_handler) { + return false; + } + + /* + * Special handling for AioHandler cqes. They need ready_list and have= a + * return value. + */ + if (cqe_handler->cb =3D=3D fdmon_special_cqe_handler) { + AioHandler *node =3D container_of(cqe_handler, AioHandler, cqe_han= dler); + return process_cqe_aio_handler(ctx, ready_list, node, cqe); + } + + cqe_handler->cqe =3D *cqe; + QSIMPLEQ_INSERT_TAIL(&ctx->cqe_handler_ready_list, cqe_handler, next); + qemu_bh_schedule(ctx->cqe_handler_bh); + return false; +} + static int process_cq_ring(AioContext *ctx, AioHandlerList *ready_list) { struct io_uring *ring =3D &ctx->fdmon_io_uring; @@ -368,6 +438,7 @@ static const FDMonOps fdmon_io_uring_ops =3D { .gsource_prepare =3D fdmon_io_uring_gsource_prepare, .gsource_check =3D fdmon_io_uring_gsource_check, .gsource_dispatch =3D fdmon_io_uring_gsource_dispatch, + .add_sqe =3D fdmon_io_uring_add_sqe, }; =20 void fdmon_io_uring_setup(AioContext *ctx, Error **errp) @@ -383,6 +454,8 @@ void fdmon_io_uring_setup(AioContext *ctx, Error **errp) } =20 QSLIST_INIT(&ctx->submit_list); + QSIMPLEQ_INIT(&ctx->cqe_handler_ready_list); + ctx->cqe_handler_bh =3D aio_bh_new(ctx, cqe_handler_bh, ctx); ctx->fdmon_ops =3D &fdmon_io_uring_ops; ctx->io_uring_fd_tag =3D g_source_add_unix_fd(&ctx->source, ctx->fdmon_io_uring.ring_fd, G_IO_IN); @@ -390,33 +463,38 @@ void fdmon_io_uring_setup(AioContext *ctx, Error **er= rp) =20 void fdmon_io_uring_destroy(AioContext *ctx) { - if (ctx->fdmon_ops =3D=3D &fdmon_io_uring_ops) { - AioHandler *node; + AioHandler *node; =20 - io_uring_queue_exit(&ctx->fdmon_io_uring); + if (ctx->fdmon_ops !=3D &fdmon_io_uring_ops) { + return; + } =20 - /* Move handlers due to be removed onto the deleted list */ - while ((node =3D QSLIST_FIRST_RCU(&ctx->submit_list))) { - unsigned flags =3D qatomic_fetch_and(&node->flags, - ~(FDMON_IO_URING_PENDING | - FDMON_IO_URING_ADD | - FDMON_IO_URING_REMOVE | - FDMON_IO_URING_DELETE_AIO_HANDLER)); + io_uring_queue_exit(&ctx->fdmon_io_uring); =20 - if ((flags & FDMON_IO_URING_REMOVE) || - (flags & FDMON_IO_URING_DELETE_AIO_HANDLER)) { - QLIST_INSERT_HEAD_RCU(&ctx->deleted_aio_handlers, - node, node_deleted); - } + /* Move handlers due to be removed onto the deleted list */ + while ((node =3D QSLIST_FIRST_RCU(&ctx->submit_list))) { + unsigned flags =3D qatomic_fetch_and(&node->flags, + ~(FDMON_IO_URING_PENDING | + FDMON_IO_URING_ADD | + FDMON_IO_URING_REMOVE | + FDMON_IO_URING_DELETE_AIO_HANDLER)); =20 - QSLIST_REMOVE_HEAD_RCU(&ctx->submit_list, node_submitted); + if ((flags & FDMON_IO_URING_REMOVE) || + (flags & FDMON_IO_URING_DELETE_AIO_HANDLER)) { + QLIST_INSERT_HEAD_RCU(&ctx->deleted_aio_handlers, + node, node_deleted); } =20 - g_source_remove_unix_fd(&ctx->source, ctx->io_uring_fd_tag); - ctx->io_uring_fd_tag =3D NULL; - - qemu_lockcnt_lock(&ctx->list_lock); - fdmon_poll_downgrade(ctx); - qemu_lockcnt_unlock(&ctx->list_lock); + QSLIST_REMOVE_HEAD_RCU(&ctx->submit_list, node_submitted); } + + g_source_remove_unix_fd(&ctx->source, ctx->io_uring_fd_tag); + ctx->io_uring_fd_tag =3D NULL; + + assert(QSIMPLEQ_EMPTY(&ctx->cqe_handler_ready_list)); + qemu_bh_delete(ctx->cqe_handler_bh); + + qemu_lockcnt_lock(&ctx->list_lock); + fdmon_poll_downgrade(ctx); + qemu_lockcnt_unlock(&ctx->list_lock); } --=20 2.51.0