From nobody Mon Sep 22 03:36:19 2025 Delivered-To: importer@patchew.org Authentication-Results: mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=pass(p=quarantine dis=none) header.from=redhat.com ARC-Seal: i=1; a=rsa-sha256; t=1757527927; cv=none; d=zohomail.com; s=zohoarc; b=D4Zv9CX1JOagWmfSZLArwYY+a6I0P8Mt1fmwtdopq40PeqEYLsqfIHxFjiYM7elZ3LH6QLNqspBjY85WnVQnPTRthXSxWOs/lgtoRK5yzd0/IBx23xdz7Nl54rzLFKHzMZmMUgkNxOc7mteX/RS1zmBUvjZBlwkLuryXt6aWAW8= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1757527927; h=Content-Transfer-Encoding:Cc:Cc:Date:Date:From:From:In-Reply-To:List-Subscribe:List-Post:List-Id:List-Archive:List-Help:List-Unsubscribe:MIME-Version:Message-ID:References:Sender:Subject:Subject:To:To:Message-Id:Reply-To; bh=jFrpdMCA7cXwdgH/5bT+9Uvd3+EuMh14w/XKgtqS/O0=; b=Y14TmF4Fq63Nuzm5Kh61/Xj5KHVFj2nYf6UDOCoRW8ETLdyE+ybN+BpfbSyRYkupua7c4P8BOqMLDq/t88QEYnZxiMZR/eyt3pDvkMg5FhClY6avxyY2sNj01XS9m8qOp8+iM3JGltWVvoKgQA/GHBwMkeVf4qi4oDWNx/rT5iQ= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=pass header.from= (p=quarantine dis=none) Return-Path: Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) by mx.zohomail.com with SMTPS id 1757527927699839.2083610642009; Wed, 10 Sep 2025 11:12:07 -0700 (PDT) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1uwPGa-0007MF-5H; Wed, 10 Sep 2025 14:10:01 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1uwPDs-0004qU-MB for qemu-devel@nongnu.org; Wed, 10 Sep 2025 14:07:17 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1uwPDm-00057I-SV for qemu-devel@nongnu.org; Wed, 10 Sep 2025 14:07:11 -0400 Received: from mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-630-wBdvI7K1MoWj_ksp3nowRQ-1; Wed, 10 Sep 2025 14:07:00 -0400 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id AD490195608A; Wed, 10 Sep 2025 18:06:58 +0000 (UTC) Received: from localhost (unknown [10.2.17.66]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 2675A1800446; Wed, 10 Sep 2025 18:06:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1757527623; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=jFrpdMCA7cXwdgH/5bT+9Uvd3+EuMh14w/XKgtqS/O0=; b=UvYWuQsIW0bYNMSKieGQq8UNlFBmt4cR+ONJzEW3vyTp7T+tQyddQGTdplNhvRJ7nOCTFa /ELsMYUZlTQ6GZ8uYprmYNtS9mRkA/tbqeUmaLlcS5PZvuB9C0iPiWTD+w+k5Y8c+CgJf0 5u6B7ZFAeGutzerHaEa49bwiQHlXEJs= X-MC-Unique: wBdvI7K1MoWj_ksp3nowRQ-1 X-Mimecast-MFC-AGG-ID: wBdvI7K1MoWj_ksp3nowRQ_1757527618 From: Stefan Hajnoczi To: qemu-devel@nongnu.org Cc: Aarushi Mehta , Fam Zheng , Stefano Garzarella , Hanna Czenczek , eblake@redhat.com, Stefan Hajnoczi , Kevin Wolf , qemu-block@nongnu.org, hibriansong@gmail.com, Stefan Weil , Paolo Bonzini Subject: [PATCH v4 11/12] block/io_uring: use aio_add_sqe() Date: Wed, 10 Sep 2025 13:57:02 -0400 Message-ID: <20250910175703.374499-12-stefanha@redhat.com> In-Reply-To: <20250910175703.374499-1-stefanha@redhat.com> References: <20250910175703.374499-1-stefanha@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 Received-SPF: pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Received-SPF: pass client-ip=170.10.133.124; envelope-from=stefanha@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H5=0.001, RCVD_IN_MSPIKE_WL=0.001, RCVD_IN_VALIDITY_RPBL_BLOCKED=0.001, RCVD_IN_VALIDITY_SAFE_BLOCKED=0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=unavailable autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: qemu-devel-bounces+importer=patchew.org@nongnu.org X-ZohoMail-DKIM: pass (identity @redhat.com) X-ZM-MESSAGEID: 1757527930029116600 Content-Type: text/plain; charset="utf-8" AioContext has its own io_uring instance for file descriptor monitoring. The disk I/O io_uring code was developed separately. Originally I thought the characteristics of file descriptor monitoring and disk I/O were too different, requiring separate io_uring instances. Now it has become clear to me that it's feasible to share a single io_uring instance for file descriptor monitoring and disk I/O. We're not using io_uring's IOPOLL feature or anything else that would require a separate instance. Unify block/io_uring.c and util/fdmon-io_uring.c using the new aio_add_sqe() API that allows user-defined io_uring sqe submission. Now block/io_uring.c just needs to submit readv/writev/fsync and most of the io_uring-specific logic is handled by fdmon-io_uring.c. There are two immediate advantages: 1. Fewer system calls. There is no need to monitor the disk I/O io_uring ring fd from the file descriptor monitoring io_uring instance. Disk I/O completions are now picked up directly. Also, sqes are accumulated in the sq ring until the end of the event loop iteration and there are fewer io_uring_enter(2) syscalls. 2. Less code duplication. Note that error_setg() messages are not supposed to end with punctuation, so I removed a '.' for the non-io_uring build error message. Signed-off-by: Stefan Hajnoczi Reviewed-by: Eric Blake --- include/block/aio.h | 7 - include/block/raw-aio.h | 5 - block/file-posix.c | 40 ++-- block/io_uring.c | 489 ++++++++++------------------------------ stubs/io_uring.c | 32 --- util/async.c | 35 --- util/fdmon-io_uring.c | 6 + block/trace-events | 12 +- stubs/meson.build | 3 - util/trace-events | 4 + 10 files changed, 140 insertions(+), 493 deletions(-) delete mode 100644 stubs/io_uring.c diff --git a/include/block/aio.h b/include/block/aio.h index 56ea0d47b7..60e76d4227 100644 --- a/include/block/aio.h +++ b/include/block/aio.h @@ -291,8 +291,6 @@ struct AioContext { struct LinuxAioState *linux_aio; #endif #ifdef CONFIG_LINUX_IO_URING - LuringState *linux_io_uring; - /* State for file descriptor monitoring using Linux io_uring */ struct io_uring fdmon_io_uring; AioHandlerSList submit_list; @@ -597,11 +595,6 @@ struct LinuxAioState *aio_setup_linux_aio(AioContext *= ctx, Error **errp); /* Return the LinuxAioState bound to this AioContext */ struct LinuxAioState *aio_get_linux_aio(AioContext *ctx); =20 -/* Setup the LuringState bound to this AioContext */ -LuringState *aio_setup_linux_io_uring(AioContext *ctx, Error **errp); - -/* Return the LuringState bound to this AioContext */ -LuringState *aio_get_linux_io_uring(AioContext *ctx); /** * aio_timer_new_with_attrs: * @ctx: the aio context diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h index 6570244496..30e5fc9a9f 100644 --- a/include/block/raw-aio.h +++ b/include/block/raw-aio.h @@ -74,15 +74,10 @@ static inline bool laio_has_fua(void) #endif /* io_uring.c - Linux io_uring implementation */ #ifdef CONFIG_LINUX_IO_URING -LuringState *luring_init(Error **errp); -void luring_cleanup(LuringState *s); - /* luring_co_submit: submit I/O requests in the thread's current AioContex= t. */ int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t o= ffset, QEMUIOVector *qiov, int type, BdrvRequestFlags flags); -void luring_detach_aio_context(LuringState *s, AioContext *old_context); -void luring_attach_aio_context(LuringState *s, AioContext *new_context); bool luring_has_fua(void); #else static inline bool luring_has_fua(void) diff --git a/block/file-posix.c b/block/file-posix.c index 8c738674ce..8b7c02d19a 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -755,14 +755,23 @@ static int raw_open_common(BlockDriverState *bs, QDic= t *options, } #endif /* !defined(CONFIG_LINUX_AIO) */ =20 -#ifndef CONFIG_LINUX_IO_URING if (s->use_linux_io_uring) { +#ifdef CONFIG_LINUX_IO_URING + if (!aio_has_io_uring()) { + error_setg(errp, "aio=3Dio_uring was specified, but is not " + "available (disabled via io_uring_disabled " + "sysctl or blocked by container runtime " + "seccomp policy?)"); + ret =3D -EINVAL; + goto fail; + } +#else error_setg(errp, "aio=3Dio_uring was specified, but is not support= ed " - "in this build."); + "in this build"); ret =3D -EINVAL; goto fail; - } #endif /* !defined(CONFIG_LINUX_IO_URING) */ + } =20 s->has_discard =3D true; s->has_write_zeroes =3D true; @@ -2522,27 +2531,6 @@ static bool bdrv_qiov_is_aligned(BlockDriverState *b= s, QEMUIOVector *qiov) return true; } =20 -#ifdef CONFIG_LINUX_IO_URING -static inline bool raw_check_linux_io_uring(BDRVRawState *s) -{ - Error *local_err =3D NULL; - AioContext *ctx; - - if (!s->use_linux_io_uring) { - return false; - } - - ctx =3D qemu_get_current_aio_context(); - if (unlikely(!aio_setup_linux_io_uring(ctx, &local_err))) { - error_reportf_err(local_err, "Unable to use linux io_uring, " - "falling back to thread pool: "); - s->use_linux_io_uring =3D false; - return false; - } - return true; -} -#endif - #ifdef CONFIG_LINUX_AIO static inline bool raw_check_linux_aio(BDRVRawState *s) { @@ -2595,7 +2583,7 @@ raw_co_prw(BlockDriverState *bs, int64_t *offset_ptr,= uint64_t bytes, if (s->needs_alignment && !bdrv_qiov_is_aligned(bs, qiov)) { type |=3D QEMU_AIO_MISALIGNED; #ifdef CONFIG_LINUX_IO_URING - } else if (raw_check_linux_io_uring(s)) { + } else if (s->use_linux_io_uring) { assert(qiov->size =3D=3D bytes); ret =3D luring_co_submit(bs, s->fd, offset, qiov, type, flags); goto out; @@ -2692,7 +2680,7 @@ static int coroutine_fn raw_co_flush_to_disk(BlockDri= verState *bs) }; =20 #ifdef CONFIG_LINUX_IO_URING - if (raw_check_linux_io_uring(s)) { + if (s->use_linux_io_uring) { return luring_co_submit(bs, s->fd, 0, NULL, QEMU_AIO_FLUSH, 0); } #endif diff --git a/block/io_uring.c b/block/io_uring.c index dd4f304910..dd930ee57e 100644 --- a/block/io_uring.c +++ b/block/io_uring.c @@ -11,28 +11,20 @@ #include "qemu/osdep.h" #include #include "block/aio.h" -#include "qemu/queue.h" #include "block/block.h" #include "block/raw-aio.h" #include "qemu/coroutine.h" -#include "qemu/defer-call.h" -#include "qapi/error.h" #include "system/block-backend.h" #include "trace.h" =20 -/* Only used for assertions. */ -#include "qemu/coroutine_int.h" - -/* io_uring ring size */ -#define MAX_ENTRIES 128 - -typedef struct LuringAIOCB { +typedef struct { Coroutine *co; - struct io_uring_sqe sqeq; - ssize_t ret; QEMUIOVector *qiov; - bool is_read; - QSIMPLEQ_ENTRY(LuringAIOCB) next; + uint64_t offset; + ssize_t ret; + int type; + int fd; + BdrvRequestFlags flags; =20 /* * Buffered reads may require resubmission, see @@ -40,36 +32,51 @@ typedef struct LuringAIOCB { */ int total_read; QEMUIOVector resubmit_qiov; -} LuringAIOCB; =20 -typedef struct LuringQueue { - unsigned int in_queue; - unsigned int in_flight; - bool blocked; - QSIMPLEQ_HEAD(, LuringAIOCB) submit_queue; -} LuringQueue; + CqeHandler cqe_handler; +} LuringRequest; =20 -struct LuringState { - AioContext *aio_context; - - struct io_uring ring; - - /* No locking required, only accessed from AioContext home thread */ - LuringQueue io_q; - - QEMUBH *completion_bh; -}; - -/** - * luring_resubmit: - * - * Resubmit a request by appending it to submit_queue. The caller must en= sure - * that ioq_submit() is called later so that submit_queue requests are sta= rted. - */ -static void luring_resubmit(LuringState *s, LuringAIOCB *luringcb) +static void luring_prep_sqe(struct io_uring_sqe *sqe, void *opaque) { - QSIMPLEQ_INSERT_TAIL(&s->io_q.submit_queue, luringcb, next); - s->io_q.in_queue++; + LuringRequest *req =3D opaque; + QEMUIOVector *qiov =3D req->qiov; + uint64_t offset =3D req->offset; + int fd =3D req->fd; + BdrvRequestFlags flags =3D req->flags; + + switch (req->type) { + case QEMU_AIO_WRITE: +#ifdef HAVE_IO_URING_PREP_WRITEV2 + { + int luring_flags =3D (flags & BDRV_REQ_FUA) ? RWF_DSYNC : 0; + io_uring_prep_writev2(sqe, fd, qiov->iov, + qiov->niov, offset, luring_flags); + } +#else + assert(flags =3D=3D 0); + io_uring_prep_writev(sqe, fd, qiov->iov, qiov->niov, offset); +#endif + break; + case QEMU_AIO_ZONE_APPEND: + io_uring_prep_writev(sqe, fd, qiov->iov, qiov->niov, offset); + break; + case QEMU_AIO_READ: + { + if (req->resubmit_qiov.iov !=3D NULL) { + qiov =3D &req->resubmit_qiov; + } + io_uring_prep_readv(sqe, fd, qiov->iov, qiov->niov, + offset + req->total_read); + break; + } + case QEMU_AIO_FLUSH: + io_uring_prep_fsync(sqe, fd, IORING_FSYNC_DATASYNC); + break; + default: + fprintf(stderr, "%s: invalid AIO request type, aborting 0x%x.\n", + __func__, req->type); + abort(); + } } =20 /** @@ -78,385 +85,115 @@ static void luring_resubmit(LuringState *s, LuringAIO= CB *luringcb) * Short reads are rare but may occur. The remaining read request needs to= be * resubmitted. */ -static void luring_resubmit_short_read(LuringState *s, LuringAIOCB *luring= cb, - int nread) +static void luring_resubmit_short_read(LuringRequest *req, int nread) { QEMUIOVector *resubmit_qiov; size_t remaining; =20 - trace_luring_resubmit_short_read(s, luringcb, nread); + trace_luring_resubmit_short_read(req, nread); =20 /* Update read position */ - luringcb->total_read +=3D nread; - remaining =3D luringcb->qiov->size - luringcb->total_read; + req->total_read +=3D nread; + remaining =3D req->qiov->size - req->total_read; =20 /* Shorten qiov */ - resubmit_qiov =3D &luringcb->resubmit_qiov; + resubmit_qiov =3D &req->resubmit_qiov; if (resubmit_qiov->iov =3D=3D NULL) { - qemu_iovec_init(resubmit_qiov, luringcb->qiov->niov); + qemu_iovec_init(resubmit_qiov, req->qiov->niov); } else { qemu_iovec_reset(resubmit_qiov); } - qemu_iovec_concat(resubmit_qiov, luringcb->qiov, luringcb->total_read, - remaining); + qemu_iovec_concat(resubmit_qiov, req->qiov, req->total_read, remaining= ); =20 - /* Update sqe */ - luringcb->sqeq.off +=3D nread; - luringcb->sqeq.addr =3D (uintptr_t)luringcb->resubmit_qiov.iov; - luringcb->sqeq.len =3D luringcb->resubmit_qiov.niov; - - luring_resubmit(s, luringcb); + aio_add_sqe(luring_prep_sqe, req, &req->cqe_handler); } =20 -/** - * luring_process_completions: - * @s: AIO state - * - * Fetches completed I/O requests, consumes cqes and invokes their callbac= ks - * The function is somewhat tricky because it supports nested event loops,= for - * example when a request callback invokes aio_poll(). - * - * Function schedules BH completion so it can be called again in a nested - * event loop. When there are no events left to complete the BH is being - * canceled. - * - */ -static void luring_process_completions(LuringState *s) +static void luring_cqe_handler(CqeHandler *cqe_handler) { - struct io_uring_cqe *cqes; - int total_bytes; + LuringRequest *req =3D container_of(cqe_handler, LuringRequest, cqe_ha= ndler); + int ret =3D cqe_handler->cqe.res; =20 - defer_call_begin(); + trace_luring_cqe_handler(req, ret); =20 - /* - * Request completion callbacks can run the nested event loop. - * Schedule ourselves so the nested event loop will "see" remaining - * completed requests and process them. Without this, completion - * callbacks that wait for other requests using a nested event loop - * would hang forever. - * - * This workaround is needed because io_uring uses poll_wait, which - * is woken up when new events are added to the uring, thus polling on - * the same uring fd will block unless more events are received. - * - * Other leaf block drivers (drivers that access the data themselves) - * are networking based, so they poll sockets for data and run the - * correct coroutine. - */ - qemu_bh_schedule(s->completion_bh); - - while (io_uring_peek_cqe(&s->ring, &cqes) =3D=3D 0) { - LuringAIOCB *luringcb; - int ret; - - if (!cqes) { - break; + if (ret < 0) { + /* + * Only writev/readv/fsync requests on regular files or host block + * devices are submitted. Therefore -EAGAIN is not expected but it= 's + * known to happen sometimes with Linux SCSI. Submit again and hope + * the request completes successfully. + * + * For more information, see: + * https://lore.kernel.org/io-uring/20210727165811.284510-3-axboe@= kernel.dk/T/#u + * + * If the code is changed to submit other types of requests in the + * future, then this workaround may need to be extended to deal wi= th + * genuine -EAGAIN results that should not be resubmitted + * immediately. + */ + if (ret =3D=3D -EINTR || ret =3D=3D -EAGAIN) { + aio_add_sqe(luring_prep_sqe, req, &req->cqe_handler); + return; } - - luringcb =3D io_uring_cqe_get_data(cqes); - ret =3D cqes->res; - io_uring_cqe_seen(&s->ring, cqes); - cqes =3D NULL; - - /* Change counters one-by-one because we can be nested. */ - s->io_q.in_flight--; - trace_luring_process_completion(s, luringcb, ret); - + } else if (req->qiov) { /* total_read is non-zero only for resubmitted read requests */ - total_bytes =3D ret + luringcb->total_read; + int total_bytes =3D ret + req->total_read; =20 - if (ret < 0) { - /* - * Only writev/readv/fsync requests on regular files or host b= lock - * devices are submitted. Therefore -EAGAIN is not expected bu= t it's - * known to happen sometimes with Linux SCSI. Submit again and= hope - * the request completes successfully. - * - * For more information, see: - * https://lore.kernel.org/io-uring/20210727165811.284510-3-ax= boe@kernel.dk/T/#u - * - * If the code is changed to submit other types of requests in= the - * future, then this workaround may need to be extended to dea= l with - * genuine -EAGAIN results that should not be resubmitted - * immediately. - */ - if (ret =3D=3D -EINTR || ret =3D=3D -EAGAIN) { - luring_resubmit(s, luringcb); - continue; - } - } else if (!luringcb->qiov) { - goto end; - } else if (total_bytes =3D=3D luringcb->qiov->size) { + if (total_bytes =3D=3D req->qiov->size) { ret =3D 0; - /* Only read/write */ } else { /* Short Read/Write */ - if (luringcb->is_read) { + if (req->type =3D=3D QEMU_AIO_READ) { if (ret > 0) { - luring_resubmit_short_read(s, luringcb, ret); - continue; - } else { - /* Pad with zeroes */ - qemu_iovec_memset(luringcb->qiov, total_bytes, 0, - luringcb->qiov->size - total_bytes); - ret =3D 0; + luring_resubmit_short_read(req, ret); + return; } + + /* Pad with zeroes */ + qemu_iovec_memset(req->qiov, total_bytes, 0, + req->qiov->size - total_bytes); + ret =3D 0; } else { ret =3D -ENOSPC; } } -end: - luringcb->ret =3D ret; - qemu_iovec_destroy(&luringcb->resubmit_qiov); - - /* - * If the coroutine is already entered it must be in ioq_submit() - * and will notice luringcb->ret has been filled in when it - * eventually runs later. Coroutines cannot be entered recursively - * so avoid doing that! - */ - assert(luringcb->co->ctx =3D=3D s->aio_context); - if (!qemu_coroutine_entered(luringcb->co)) { - aio_co_wake(luringcb->co); - } } =20 - qemu_bh_cancel(s->completion_bh); + req->ret =3D ret; + qemu_iovec_destroy(&req->resubmit_qiov); =20 - defer_call_end(); -} - -static int ioq_submit(LuringState *s) -{ - int ret =3D 0; - LuringAIOCB *luringcb, *luringcb_next; - - while (s->io_q.in_queue > 0) { - /* - * Try to fetch sqes from the ring for requests waiting in - * the overflow queue - */ - QSIMPLEQ_FOREACH_SAFE(luringcb, &s->io_q.submit_queue, next, - luringcb_next) { - struct io_uring_sqe *sqes =3D io_uring_get_sqe(&s->ring); - if (!sqes) { - break; - } - /* Prep sqe for submission */ - *sqes =3D luringcb->sqeq; - QSIMPLEQ_REMOVE_HEAD(&s->io_q.submit_queue, next); - } - ret =3D io_uring_submit(&s->ring); - trace_luring_io_uring_submit(s, ret); - /* Prevent infinite loop if submission is refused */ - if (ret <=3D 0) { - if (ret =3D=3D -EAGAIN || ret =3D=3D -EINTR) { - continue; - } - break; - } - s->io_q.in_flight +=3D ret; - s->io_q.in_queue -=3D ret; - } - s->io_q.blocked =3D (s->io_q.in_queue > 0); - - if (s->io_q.in_flight) { - /* - * We can try to complete something just right away if there are - * still requests in-flight. - */ - luring_process_completions(s); - } - return ret; -} - -static void luring_process_completions_and_submit(LuringState *s) -{ - luring_process_completions(s); - - if (s->io_q.in_queue > 0) { - ioq_submit(s); + /* + * If the coroutine is already entered it must be in luring_co_submit(= ) and + * will notice req->ret has been filled in when it eventually runs lat= er. + * Coroutines cannot be entered recursively so avoid doing that! + */ + if (!qemu_coroutine_entered(req->co)) { + aio_co_wake(req->co); } } =20 -static void qemu_luring_completion_bh(void *opaque) +int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, + uint64_t offset, QEMUIOVector *qiov, + int type, BdrvRequestFlags flags) { - LuringState *s =3D opaque; - luring_process_completions_and_submit(s); -} - -static void qemu_luring_completion_cb(void *opaque) -{ - LuringState *s =3D opaque; - luring_process_completions_and_submit(s); -} - -static bool qemu_luring_poll_cb(void *opaque) -{ - LuringState *s =3D opaque; - - return io_uring_cq_ready(&s->ring); -} - -static void qemu_luring_poll_ready(void *opaque) -{ - LuringState *s =3D opaque; - - luring_process_completions_and_submit(s); -} - -static void ioq_init(LuringQueue *io_q) -{ - QSIMPLEQ_INIT(&io_q->submit_queue); - io_q->in_queue =3D 0; - io_q->in_flight =3D 0; - io_q->blocked =3D false; -} - -static void luring_deferred_fn(void *opaque) -{ - LuringState *s =3D opaque; - trace_luring_unplug_fn(s, s->io_q.blocked, s->io_q.in_queue, - s->io_q.in_flight); - if (!s->io_q.blocked && s->io_q.in_queue > 0) { - ioq_submit(s); - } -} - -/** - * luring_do_submit: - * @fd: file descriptor for I/O - * @luringcb: AIO control block - * @s: AIO state - * @offset: offset for request - * @type: type of request - * - * Fetches sqes from ring, adds to pending queue and preps them - * - */ -static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s, - uint64_t offset, int type, BdrvRequestFlags fl= ags) -{ - int ret; - struct io_uring_sqe *sqes =3D &luringcb->sqeq; - - switch (type) { - case QEMU_AIO_WRITE: -#ifdef HAVE_IO_URING_PREP_WRITEV2 - { - int luring_flags =3D (flags & BDRV_REQ_FUA) ? RWF_DSYNC : 0; - io_uring_prep_writev2(sqes, fd, luringcb->qiov->iov, - luringcb->qiov->niov, offset, luring_flags); - } -#else - assert(flags =3D=3D 0); - io_uring_prep_writev(sqes, fd, luringcb->qiov->iov, - luringcb->qiov->niov, offset); -#endif - break; - case QEMU_AIO_ZONE_APPEND: - io_uring_prep_writev(sqes, fd, luringcb->qiov->iov, - luringcb->qiov->niov, offset); - break; - case QEMU_AIO_READ: - io_uring_prep_readv(sqes, fd, luringcb->qiov->iov, - luringcb->qiov->niov, offset); - break; - case QEMU_AIO_FLUSH: - io_uring_prep_fsync(sqes, fd, IORING_FSYNC_DATASYNC); - break; - default: - fprintf(stderr, "%s: invalid AIO request type, aborting 0x%x.\n", - __func__, type); - abort(); - } - io_uring_sqe_set_data(sqes, luringcb); - - QSIMPLEQ_INSERT_TAIL(&s->io_q.submit_queue, luringcb, next); - s->io_q.in_queue++; - trace_luring_do_submit(s, s->io_q.blocked, s->io_q.in_queue, - s->io_q.in_flight); - if (!s->io_q.blocked) { - if (s->io_q.in_flight + s->io_q.in_queue >=3D MAX_ENTRIES) { - ret =3D ioq_submit(s); - trace_luring_do_submit_done(s, ret); - return ret; - } - - defer_call(luring_deferred_fn, s); - } - return 0; -} - -int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t o= ffset, - QEMUIOVector *qiov, int type, - BdrvRequestFlags flags) -{ - int ret; - AioContext *ctx =3D qemu_get_current_aio_context(); - LuringState *s =3D aio_get_linux_io_uring(ctx); - LuringAIOCB luringcb =3D { + LuringRequest req =3D { .co =3D qemu_coroutine_self(), - .ret =3D -EINPROGRESS, .qiov =3D qiov, - .is_read =3D (type =3D=3D QEMU_AIO_READ), + .ret =3D -EINPROGRESS, + .type =3D type, + .fd =3D fd, + .offset =3D offset, + .flags =3D flags, }; - trace_luring_co_submit(bs, s, &luringcb, fd, offset, qiov ? qiov->size= : 0, - type); - ret =3D luring_do_submit(fd, &luringcb, s, offset, type, flags); =20 - if (ret < 0) { - return ret; - } + req.cqe_handler.cb =3D luring_cqe_handler; =20 - if (luringcb.ret =3D=3D -EINPROGRESS) { + trace_luring_co_submit(bs, &req, fd, offset, qiov ? qiov->size : 0, ty= pe); + aio_add_sqe(luring_prep_sqe, &req, &req.cqe_handler); + + if (req.ret =3D=3D -EINPROGRESS) { qemu_coroutine_yield(); } - return luringcb.ret; -} - -void luring_detach_aio_context(LuringState *s, AioContext *old_context) -{ - aio_set_fd_handler(old_context, s->ring.ring_fd, - NULL, NULL, NULL, NULL, s); - qemu_bh_delete(s->completion_bh); - s->aio_context =3D NULL; -} - -void luring_attach_aio_context(LuringState *s, AioContext *new_context) -{ - s->aio_context =3D new_context; - s->completion_bh =3D aio_bh_new(new_context, qemu_luring_completion_bh= , s); - aio_set_fd_handler(s->aio_context, s->ring.ring_fd, - qemu_luring_completion_cb, NULL, - qemu_luring_poll_cb, qemu_luring_poll_ready, s); -} - -LuringState *luring_init(Error **errp) -{ - int rc; - LuringState *s =3D g_new0(LuringState, 1); - struct io_uring *ring =3D &s->ring; - - trace_luring_init_state(s, sizeof(*s)); - - rc =3D io_uring_queue_init(MAX_ENTRIES, ring, 0); - if (rc < 0) { - error_setg_errno(errp, -rc, "failed to init linux io_uring ring"); - g_free(s); - return NULL; - } - - ioq_init(&s->io_q); - return s; - -} - -void luring_cleanup(LuringState *s) -{ - io_uring_queue_exit(&s->ring); - trace_luring_cleanup_state(s); - g_free(s); + return req.ret; } =20 bool luring_has_fua(void) diff --git a/stubs/io_uring.c b/stubs/io_uring.c deleted file mode 100644 index 622d1e4648..0000000000 --- a/stubs/io_uring.c +++ /dev/null @@ -1,32 +0,0 @@ -/* - * Linux io_uring support. - * - * Copyright (C) 2009 IBM, Corp. - * Copyright (C) 2009 Red Hat, Inc. - * - * This work is licensed under the terms of the GNU GPL, version 2 or late= r. - * See the COPYING file in the top-level directory. - */ -#include "qemu/osdep.h" -#include "block/aio.h" -#include "block/raw-aio.h" - -void luring_detach_aio_context(LuringState *s, AioContext *old_context) -{ - abort(); -} - -void luring_attach_aio_context(LuringState *s, AioContext *new_context) -{ - abort(); -} - -LuringState *luring_init(Error **errp) -{ - abort(); -} - -void luring_cleanup(LuringState *s) -{ - abort(); -} diff --git a/util/async.c b/util/async.c index 2bb7e1f3c1..6a880714d9 100644 --- a/util/async.c +++ b/util/async.c @@ -383,14 +383,6 @@ aio_ctx_finalize(GSource *source) } #endif =20 -#ifdef CONFIG_LINUX_IO_URING - if (ctx->linux_io_uring) { - luring_detach_aio_context(ctx->linux_io_uring, ctx); - luring_cleanup(ctx->linux_io_uring); - ctx->linux_io_uring =3D NULL; - } -#endif - assert(QSLIST_EMPTY(&ctx->scheduled_coroutines)); qemu_bh_delete(ctx->co_schedule_bh); =20 @@ -465,29 +457,6 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx) } #endif =20 -#ifdef CONFIG_LINUX_IO_URING -LuringState *aio_setup_linux_io_uring(AioContext *ctx, Error **errp) -{ - if (ctx->linux_io_uring) { - return ctx->linux_io_uring; - } - - ctx->linux_io_uring =3D luring_init(errp); - if (!ctx->linux_io_uring) { - return NULL; - } - - luring_attach_aio_context(ctx->linux_io_uring, ctx); - return ctx->linux_io_uring; -} - -LuringState *aio_get_linux_io_uring(AioContext *ctx) -{ - assert(ctx->linux_io_uring); - return ctx->linux_io_uring; -} -#endif - void aio_notify(AioContext *ctx) { /* @@ -611,10 +580,6 @@ AioContext *aio_context_new(Error **errp) ctx->linux_aio =3D NULL; #endif =20 -#ifdef CONFIG_LINUX_IO_URING - ctx->linux_io_uring =3D NULL; -#endif - ctx->thread_pool =3D NULL; qemu_rec_mutex_init(&ctx->lock); timerlistgroup_init(&ctx->tlg, aio_timerlist_notify, ctx); diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c index 9ed02597d6..28b93c8ab9 100644 --- a/util/fdmon-io_uring.c +++ b/util/fdmon-io_uring.c @@ -49,6 +49,7 @@ #include "qemu/defer-call.h" #include "qemu/rcu_queue.h" #include "aio-posix.h" +#include "trace.h" =20 enum { FDMON_IO_URING_ENTRIES =3D 128, /* sq/cq ring size */ @@ -177,6 +178,9 @@ static void fdmon_io_uring_add_sqe(AioContext *ctx, =20 prep_sqe(sqe, opaque); io_uring_sqe_set_data(sqe, cqe_handler); + + trace_fdmon_io_uring_add_sqe(ctx, opaque, sqe->opcode, sqe->fd, sqe->o= ff, + cqe_handler); } =20 static void fdmon_special_cqe_handler(CqeHandler *cqe_handler) @@ -304,6 +308,8 @@ static void cqe_handler_bh(void *opaque) =20 QSIMPLEQ_REMOVE_HEAD(ready_list, next); =20 + trace_fdmon_io_uring_cqe_handler(ctx, cqe_handler, + cqe_handler->cqe.res); cqe_handler->cb(cqe_handler); } =20 diff --git a/block/trace-events b/block/trace-events index 8e789e1f12..c9b4736ff8 100644 --- a/block/trace-events +++ b/block/trace-events @@ -62,15 +62,9 @@ qmp_block_stream(void *bs) "bs %p" file_paio_submit(void *acb, void *opaque, int64_t offset, int count, int t= ype) "acb %p opaque %p offset %"PRId64" count %d type %d" =20 # io_uring.c -luring_init_state(void *s, size_t size) "s %p size %zu" -luring_cleanup_state(void *s) "%p freed" -luring_unplug_fn(void *s, int blocked, int queued, int inflight) "LuringSt= ate %p blocked %d queued %d inflight %d" -luring_do_submit(void *s, int blocked, int queued, int inflight) "LuringSt= ate %p blocked %d queued %d inflight %d" -luring_do_submit_done(void *s, int ret) "LuringState %p submitted to kerne= l %d" -luring_co_submit(void *bs, void *s, void *luringcb, int fd, uint64_t offse= t, size_t nbytes, int type) "bs %p s %p luringcb %p fd %d offset %" PRId64 = " nbytes %zd type %d" -luring_process_completion(void *s, void *aiocb, int ret) "LuringState %p l= uringcb %p ret %d" -luring_io_uring_submit(void *s, int ret) "LuringState %p ret %d" -luring_resubmit_short_read(void *s, void *luringcb, int nread) "LuringStat= e %p luringcb %p nread %d" +luring_cqe_handler(void *req, int ret) "req %p ret %d" +luring_co_submit(void *bs, void *req, int fd, uint64_t offset, size_t nbyt= es, int type) "bs %p req %p fd %d offset %" PRId64 " nbytes %zd type %d" +luring_resubmit_short_read(void *req, int nread) "req %p nread %d" =20 # qcow2.c qcow2_add_task(void *co, void *bs, void *pool, const char *action, int clu= ster_type, uint64_t host_offset, uint64_t offset, uint64_t bytes, void *qio= v, size_t qiov_offset) "co %p bs %p pool %p: %s: cluster_type %d file_clust= er_offset %" PRIu64 " offset %" PRIu64 " bytes %" PRIu64 " qiov %p qiov_off= set %zu" diff --git a/stubs/meson.build b/stubs/meson.build index cef046e685..93aa90c817 100644 --- a/stubs/meson.build +++ b/stubs/meson.build @@ -32,9 +32,6 @@ if have_block or have_ga stub_ss.add(files('cpus-virtual-clock.c')) stub_ss.add(files('icount.c')) stub_ss.add(files('graph-lock.c')) - if linux_io_uring.found() - stub_ss.add(files('io_uring.c')) - endif if libaio.found() stub_ss.add(files('linux-aio.c')) endif diff --git a/util/trace-events b/util/trace-events index bd8f25fb59..540d662507 100644 --- a/util/trace-events +++ b/util/trace-events @@ -24,6 +24,10 @@ buffer_move_empty(const char *buf, size_t len, const cha= r *from) "%s: %zd bytes buffer_move(const char *buf, size_t len, const char *from) "%s: %zd bytes = from %s" buffer_free(const char *buf, size_t len) "%s: capacity %zd" =20 +# fdmon-io_uring.c +fdmon_io_uring_add_sqe(void *ctx, void *opaque, int opcode, int fd, uint64= _t off, void *cqe_handler) "ctx %p opaque %p opcode %d fd %d off %"PRId64" = cqe_handler %p" +fdmon_io_uring_cqe_handler(void *ctx, void *cqe_handler, int cqe_res) "ctx= %p cqe_handler %p cqe_res %d" + # filemonitor-inotify.c qemu_file_monitor_add_watch(void *mon, const char *dirpath, const char *fi= lename, void *cb, void *opaque, int64_t id) "File monitor %p add watch dir= =3D'%s' file=3D'%s' cb=3D%p opaque=3D%p id=3D%" PRId64 qemu_file_monitor_remove_watch(void *mon, const char *dirpath, int64_t id)= "File monitor %p remove watch dir=3D'%s' id=3D%" PRId64 --=20 2.51.0