[v5] io_uring/uring_cmd: allow non-iopoll cmds with IORING_SETUP_IOPOLL

[PATCH v5 3/5] io_uring: count CQEs in io_iopoll_check()

Posted by Caleb Sander Mateos 1 month ago

A subsequent commit will allow uring_cmds that don't use iopoll on
IORING_SETUP_IOPOLL io_urings. As a result, CQEs can be posted without
setting the iopoll_completed flag for a request in iopoll_list or going
through task work. For example, a UBLK_U_IO_FETCH_IO_CMDS command could
call io_uring_mshot_cmd_post_cqe() to directly post a CQE. The
io_iopoll_check() loop currently only counts completions posted in
io_do_iopoll() when determining whether the min_events threshold has
been met. It also exits early if there are any existing CQEs before
polling, or if any CQEs are posted while running task work. CQEs posted
via io_uring_mshot_cmd_post_cqe() or other mechanisms won't be counted
against min_events.

Explicitly check the available CQEs in each io_iopoll_check() loop
iteration to account for CQEs posted in any fashion.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
---
 io_uring/io_uring.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 46f39831d27c..b4625695bb3a 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1184,11 +1184,10 @@ __cold void io_iopoll_try_reap_events(struct io_ring_ctx *ctx)
 		io_move_task_work_from_local(ctx);
 }
 
 static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_events)
 {
-	unsigned int nr_events = 0;
 	unsigned long check_cq;
 
 	min_events = min(min_events, ctx->cq_entries);
 
 	lockdep_assert_held(&ctx->uring_lock);
@@ -1227,34 +1226,30 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_events)
 		 * the poll to the issued list. Otherwise we can spin here
 		 * forever, while the workqueue is stuck trying to acquire the
 		 * very same mutex.
 		 */
 		if (list_empty(&ctx->iopoll_list) || io_task_work_pending(ctx)) {
-			u32 tail = ctx->cached_cq_tail;
-
 			(void) io_run_local_work_locked(ctx, min_events);
 
 			if (task_work_pending(current) || list_empty(&ctx->iopoll_list)) {
 				mutex_unlock(&ctx->uring_lock);
 				io_run_task_work();
 				mutex_lock(&ctx->uring_lock);
 			}
 			/* some requests don't go through iopoll_list */
-			if (tail != ctx->cached_cq_tail || list_empty(&ctx->iopoll_list))
+			if (list_empty(&ctx->iopoll_list))
 				break;
 		}
 		ret = io_do_iopoll(ctx, !min_events);
 		if (unlikely(ret < 0))
 			return ret;
 
 		if (task_sigpending(current))
 			return -EINTR;
 		if (need_resched())
 			break;
-
-		nr_events += ret;
-	} while (nr_events < min_events);
+	} while (io_cqring_events(ctx) < min_events);
 
 	return 0;
 }
 
 void io_req_task_complete(struct io_tw_req tw_req, io_tw_token_t tw)
-- 
2.45.2

Re: [PATCH v5 3/5] io_uring: count CQEs in io_iopoll_check()

Posted by Ming Lei 4 weeks, 1 day ago

On Mon, Mar 02, 2026 at 10:29:12AM -0700, Caleb Sander Mateos wrote:
> A subsequent commit will allow uring_cmds that don't use iopoll on
> IORING_SETUP_IOPOLL io_urings. As a result, CQEs can be posted without
> setting the iopoll_completed flag for a request in iopoll_list or going
> through task work. For example, a UBLK_U_IO_FETCH_IO_CMDS command could
> call io_uring_mshot_cmd_post_cqe() to directly post a CQE. The
> io_iopoll_check() loop currently only counts completions posted in
> io_do_iopoll() when determining whether the min_events threshold has
> been met. It also exits early if there are any existing CQEs before
> polling, or if any CQEs are posted while running task work. CQEs posted
> via io_uring_mshot_cmd_post_cqe() or other mechanisms won't be counted
> against min_events.
> 
> Explicitly check the available CQEs in each io_iopoll_check() loop
> iteration to account for CQEs posted in any fashion.
> 
> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> ---
>  io_uring/io_uring.c | 9 ++-------
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 46f39831d27c..b4625695bb3a 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -1184,11 +1184,10 @@ __cold void io_iopoll_try_reap_events(struct io_ring_ctx *ctx)
>  		io_move_task_work_from_local(ctx);
>  }
>  
>  static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_events)
>  {
> -	unsigned int nr_events = 0;
>  	unsigned long check_cq;
>  
>  	min_events = min(min_events, ctx->cq_entries);
>  
>  	lockdep_assert_held(&ctx->uring_lock);
> @@ -1227,34 +1226,30 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_events)
>  		 * the poll to the issued list. Otherwise we can spin here
>  		 * forever, while the workqueue is stuck trying to acquire the
>  		 * very same mutex.
>  		 */
>  		if (list_empty(&ctx->iopoll_list) || io_task_work_pending(ctx)) {
> -			u32 tail = ctx->cached_cq_tail;
> -
>  			(void) io_run_local_work_locked(ctx, min_events);
>  
>  			if (task_work_pending(current) || list_empty(&ctx->iopoll_list)) {
>  				mutex_unlock(&ctx->uring_lock);
>  				io_run_task_work();
>  				mutex_lock(&ctx->uring_lock);
>  			}
>  			/* some requests don't go through iopoll_list */
> -			if (tail != ctx->cached_cq_tail || list_empty(&ctx->iopoll_list))
> +			if (list_empty(&ctx->iopoll_list))
>  				break;
>  		}
>  		ret = io_do_iopoll(ctx, !min_events);
>  		if (unlikely(ret < 0))
>  			return ret;
>  
>  		if (task_sigpending(current))
>  			return -EINTR;
>  		if (need_resched())
>  			break;
> -
> -		nr_events += ret;
> -	} while (nr_events < min_events);
> +	} while (io_cqring_events(ctx) < min_events);

Before entering the loop, if io_cqring_events() finds any queued CQE,
io_iopoll_check() returns immediately without polling.

If the queued CQE is originated from non-iopoll uring_cmd, iopoll request
will not be polled, may this be one issue?


Thanks,
Ming

Re: [PATCH v5 3/5] io_uring: count CQEs in io_iopoll_check()

Posted by Caleb Sander Mateos 4 weeks, 1 day ago

On Wed, Mar 4, 2026 at 2:33 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Mon, Mar 02, 2026 at 10:29:12AM -0700, Caleb Sander Mateos wrote:
> > A subsequent commit will allow uring_cmds that don't use iopoll on
> > IORING_SETUP_IOPOLL io_urings. As a result, CQEs can be posted without
> > setting the iopoll_completed flag for a request in iopoll_list or going
> > through task work. For example, a UBLK_U_IO_FETCH_IO_CMDS command could
> > call io_uring_mshot_cmd_post_cqe() to directly post a CQE. The
> > io_iopoll_check() loop currently only counts completions posted in
> > io_do_iopoll() when determining whether the min_events threshold has
> > been met. It also exits early if there are any existing CQEs before
> > polling, or if any CQEs are posted while running task work. CQEs posted
> > via io_uring_mshot_cmd_post_cqe() or other mechanisms won't be counted
> > against min_events.
> >
> > Explicitly check the available CQEs in each io_iopoll_check() loop
> > iteration to account for CQEs posted in any fashion.
> >
> > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> > ---
> >  io_uring/io_uring.c | 9 ++-------
> >  1 file changed, 2 insertions(+), 7 deletions(-)
> >
> > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> > index 46f39831d27c..b4625695bb3a 100644
> > --- a/io_uring/io_uring.c
> > +++ b/io_uring/io_uring.c
> > @@ -1184,11 +1184,10 @@ __cold void io_iopoll_try_reap_events(struct io_ring_ctx *ctx)
> >               io_move_task_work_from_local(ctx);
> >  }
> >
> >  static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_events)
> >  {
> > -     unsigned int nr_events = 0;
> >       unsigned long check_cq;
> >
> >       min_events = min(min_events, ctx->cq_entries);
> >
> >       lockdep_assert_held(&ctx->uring_lock);
> > @@ -1227,34 +1226,30 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_events)
> >                * the poll to the issued list. Otherwise we can spin here
> >                * forever, while the workqueue is stuck trying to acquire the
> >                * very same mutex.
> >                */
> >               if (list_empty(&ctx->iopoll_list) || io_task_work_pending(ctx)) {
> > -                     u32 tail = ctx->cached_cq_tail;
> > -
> >                       (void) io_run_local_work_locked(ctx, min_events);
> >
> >                       if (task_work_pending(current) || list_empty(&ctx->iopoll_list)) {
> >                               mutex_unlock(&ctx->uring_lock);
> >                               io_run_task_work();
> >                               mutex_lock(&ctx->uring_lock);
> >                       }
> >                       /* some requests don't go through iopoll_list */
> > -                     if (tail != ctx->cached_cq_tail || list_empty(&ctx->iopoll_list))
> > +                     if (list_empty(&ctx->iopoll_list))
> >                               break;
> >               }
> >               ret = io_do_iopoll(ctx, !min_events);
> >               if (unlikely(ret < 0))
> >                       return ret;
> >
> >               if (task_sigpending(current))
> >                       return -EINTR;
> >               if (need_resched())
> >                       break;
> > -
> > -             nr_events += ret;
> > -     } while (nr_events < min_events);
> > +     } while (io_cqring_events(ctx) < min_events);
>
> Before entering the loop, if io_cqring_events() finds any queued CQE,
> io_iopoll_check() returns immediately without polling.
>
> If the queued CQE is originated from non-iopoll uring_cmd, iopoll request
> will not be polled, may this be one issue?

I also noticed that logic and thought it seemed odd. I would think
we'd always want to wait for min_events CQEs (and iopoll once even if
min_events is 0). Looks like Jens added the early return in commit
a3a0e43fd770 ("io_uring: don't enter poll loop if we have CQEs
pending"), perhaps he can shed some light on it?

Thanks,
Caleb

Re: [PATCH v5 3/5] io_uring: count CQEs in io_iopoll_check()

Posted by Jens Axboe 4 weeks, 1 day ago

On 3/4/26 8:46 AM, Caleb Sander Mateos wrote:
> On Wed, Mar 4, 2026 at 2:33?AM Ming Lei <ming.lei@redhat.com> wrote:
>>
>> On Mon, Mar 02, 2026 at 10:29:12AM -0700, Caleb Sander Mateos wrote:
>>> A subsequent commit will allow uring_cmds that don't use iopoll on
>>> IORING_SETUP_IOPOLL io_urings. As a result, CQEs can be posted without
>>> setting the iopoll_completed flag for a request in iopoll_list or going
>>> through task work. For example, a UBLK_U_IO_FETCH_IO_CMDS command could
>>> call io_uring_mshot_cmd_post_cqe() to directly post a CQE. The
>>> io_iopoll_check() loop currently only counts completions posted in
>>> io_do_iopoll() when determining whether the min_events threshold has
>>> been met. It also exits early if there are any existing CQEs before
>>> polling, or if any CQEs are posted while running task work. CQEs posted
>>> via io_uring_mshot_cmd_post_cqe() or other mechanisms won't be counted
>>> against min_events.
>>>
>>> Explicitly check the available CQEs in each io_iopoll_check() loop
>>> iteration to account for CQEs posted in any fashion.
>>>
>>> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
>>> ---
>>>  io_uring/io_uring.c | 9 ++-------
>>>  1 file changed, 2 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>> index 46f39831d27c..b4625695bb3a 100644
>>> --- a/io_uring/io_uring.c
>>> +++ b/io_uring/io_uring.c
>>> @@ -1184,11 +1184,10 @@ __cold void io_iopoll_try_reap_events(struct io_ring_ctx *ctx)
>>>               io_move_task_work_from_local(ctx);
>>>  }
>>>
>>>  static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_events)
>>>  {
>>> -     unsigned int nr_events = 0;
>>>       unsigned long check_cq;
>>>
>>>       min_events = min(min_events, ctx->cq_entries);
>>>
>>>       lockdep_assert_held(&ctx->uring_lock);
>>> @@ -1227,34 +1226,30 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_events)
>>>                * the poll to the issued list. Otherwise we can spin here
>>>                * forever, while the workqueue is stuck trying to acquire the
>>>                * very same mutex.
>>>                */
>>>               if (list_empty(&ctx->iopoll_list) || io_task_work_pending(ctx)) {
>>> -                     u32 tail = ctx->cached_cq_tail;
>>> -
>>>                       (void) io_run_local_work_locked(ctx, min_events);
>>>
>>>                       if (task_work_pending(current) || list_empty(&ctx->iopoll_list)) {
>>>                               mutex_unlock(&ctx->uring_lock);
>>>                               io_run_task_work();
>>>                               mutex_lock(&ctx->uring_lock);
>>>                       }
>>>                       /* some requests don't go through iopoll_list */
>>> -                     if (tail != ctx->cached_cq_tail || list_empty(&ctx->iopoll_list))
>>> +                     if (list_empty(&ctx->iopoll_list))
>>>                               break;
>>>               }
>>>               ret = io_do_iopoll(ctx, !min_events);
>>>               if (unlikely(ret < 0))
>>>                       return ret;
>>>
>>>               if (task_sigpending(current))
>>>                       return -EINTR;
>>>               if (need_resched())
>>>                       break;
>>> -
>>> -             nr_events += ret;
>>> -     } while (nr_events < min_events);
>>> +     } while (io_cqring_events(ctx) < min_events);
>>
>> Before entering the loop, if io_cqring_events() finds any queued CQE,
>> io_iopoll_check() returns immediately without polling.
>>
>> If the queued CQE is originated from non-iopoll uring_cmd, iopoll request
>> will not be polled, may this be one issue?
> 
> I also noticed that logic and thought it seemed odd. I would think
> we'd always want to wait for min_events CQEs (and iopoll once even if
> min_events is 0). Looks like Jens added the early return in commit
> a3a0e43fd770 ("io_uring: don't enter poll loop if we have CQEs
> pending"), perhaps he can shed some light on it?

I don't  recall the bug in question, it's been a while... But it always
makes sense to return events that are ready, and skip polling. It should
only be done if there are no ready events to reap.

-- 
Jens Axboe

Re: [PATCH v5 3/5] io_uring: count CQEs in io_iopoll_check()

Posted by Caleb Sander Mateos 3 weeks, 5 days ago

On Wed, Mar 4, 2026 at 8:29 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 3/4/26 8:46 AM, Caleb Sander Mateos wrote:
> > On Wed, Mar 4, 2026 at 2:33?AM Ming Lei <ming.lei@redhat.com> wrote:
> >>
> >> On Mon, Mar 02, 2026 at 10:29:12AM -0700, Caleb Sander Mateos wrote:
> >>> A subsequent commit will allow uring_cmds that don't use iopoll on
> >>> IORING_SETUP_IOPOLL io_urings. As a result, CQEs can be posted without
> >>> setting the iopoll_completed flag for a request in iopoll_list or going
> >>> through task work. For example, a UBLK_U_IO_FETCH_IO_CMDS command could
> >>> call io_uring_mshot_cmd_post_cqe() to directly post a CQE. The
> >>> io_iopoll_check() loop currently only counts completions posted in
> >>> io_do_iopoll() when determining whether the min_events threshold has
> >>> been met. It also exits early if there are any existing CQEs before
> >>> polling, or if any CQEs are posted while running task work. CQEs posted
> >>> via io_uring_mshot_cmd_post_cqe() or other mechanisms won't be counted
> >>> against min_events.
> >>>
> >>> Explicitly check the available CQEs in each io_iopoll_check() loop
> >>> iteration to account for CQEs posted in any fashion.
> >>>
> >>> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> >>> ---
> >>>  io_uring/io_uring.c | 9 ++-------
> >>>  1 file changed, 2 insertions(+), 7 deletions(-)
> >>>
> >>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> >>> index 46f39831d27c..b4625695bb3a 100644
> >>> --- a/io_uring/io_uring.c
> >>> +++ b/io_uring/io_uring.c
> >>> @@ -1184,11 +1184,10 @@ __cold void io_iopoll_try_reap_events(struct io_ring_ctx *ctx)
> >>>               io_move_task_work_from_local(ctx);
> >>>  }
> >>>
> >>>  static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_events)
> >>>  {
> >>> -     unsigned int nr_events = 0;
> >>>       unsigned long check_cq;
> >>>
> >>>       min_events = min(min_events, ctx->cq_entries);
> >>>
> >>>       lockdep_assert_held(&ctx->uring_lock);
> >>> @@ -1227,34 +1226,30 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_events)
> >>>                * the poll to the issued list. Otherwise we can spin here
> >>>                * forever, while the workqueue is stuck trying to acquire the
> >>>                * very same mutex.
> >>>                */
> >>>               if (list_empty(&ctx->iopoll_list) || io_task_work_pending(ctx)) {
> >>> -                     u32 tail = ctx->cached_cq_tail;
> >>> -
> >>>                       (void) io_run_local_work_locked(ctx, min_events);
> >>>
> >>>                       if (task_work_pending(current) || list_empty(&ctx->iopoll_list)) {
> >>>                               mutex_unlock(&ctx->uring_lock);
> >>>                               io_run_task_work();
> >>>                               mutex_lock(&ctx->uring_lock);
> >>>                       }
> >>>                       /* some requests don't go through iopoll_list */
> >>> -                     if (tail != ctx->cached_cq_tail || list_empty(&ctx->iopoll_list))
> >>> +                     if (list_empty(&ctx->iopoll_list))
> >>>                               break;
> >>>               }
> >>>               ret = io_do_iopoll(ctx, !min_events);
> >>>               if (unlikely(ret < 0))
> >>>                       return ret;
> >>>
> >>>               if (task_sigpending(current))
> >>>                       return -EINTR;
> >>>               if (need_resched())
> >>>                       break;
> >>> -
> >>> -             nr_events += ret;
> >>> -     } while (nr_events < min_events);
> >>> +     } while (io_cqring_events(ctx) < min_events);
> >>
> >> Before entering the loop, if io_cqring_events() finds any queued CQE,
> >> io_iopoll_check() returns immediately without polling.
> >>
> >> If the queued CQE is originated from non-iopoll uring_cmd, iopoll request
> >> will not be polled, may this be one issue?
> >
> > I also noticed that logic and thought it seemed odd. I would think
> > we'd always want to wait for min_events CQEs (and iopoll once even if
> > min_events is 0). Looks like Jens added the early return in commit
> > a3a0e43fd770 ("io_uring: don't enter poll loop if we have CQEs
> > pending"), perhaps he can shed some light on it?
>
> I don't  recall the bug in question, it's been a while... But it always
> makes sense to return events that are ready, and skip polling. It should
> only be done if there are no ready events to reap.

Ming, are you okay with preserving that behavior in this patch then? I
guess there's a potential fairness concern where REQ_F_IOPOLL requests
may not be polled for some time if non-REQ_F_IOPOLL requests continue
to frequently post CQEs.

Jens, any thoughts on this series? Is it ready to merge?

Thanks,
Caleb

Re: [PATCH v5 3/5] io_uring: count CQEs in io_iopoll_check()

Posted by Ming Lei 3 weeks, 5 days ago

On Fri, Mar 06, 2026 at 05:38:15PM -0800, Caleb Sander Mateos wrote:
> On Wed, Mar 4, 2026 at 8:29 AM Jens Axboe <axboe@kernel.dk> wrote:
> >
> > On 3/4/26 8:46 AM, Caleb Sander Mateos wrote:
> > > On Wed, Mar 4, 2026 at 2:33?AM Ming Lei <ming.lei@redhat.com> wrote:
> > >>
> > >> On Mon, Mar 02, 2026 at 10:29:12AM -0700, Caleb Sander Mateos wrote:
> > >>> A subsequent commit will allow uring_cmds that don't use iopoll on
> > >>> IORING_SETUP_IOPOLL io_urings. As a result, CQEs can be posted without
> > >>> setting the iopoll_completed flag for a request in iopoll_list or going
> > >>> through task work. For example, a UBLK_U_IO_FETCH_IO_CMDS command could
> > >>> call io_uring_mshot_cmd_post_cqe() to directly post a CQE. The
> > >>> io_iopoll_check() loop currently only counts completions posted in
> > >>> io_do_iopoll() when determining whether the min_events threshold has
> > >>> been met. It also exits early if there are any existing CQEs before
> > >>> polling, or if any CQEs are posted while running task work. CQEs posted
> > >>> via io_uring_mshot_cmd_post_cqe() or other mechanisms won't be counted
> > >>> against min_events.
> > >>>
> > >>> Explicitly check the available CQEs in each io_iopoll_check() loop
> > >>> iteration to account for CQEs posted in any fashion.
> > >>>
> > >>> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> > >>> ---
> > >>>  io_uring/io_uring.c | 9 ++-------
> > >>>  1 file changed, 2 insertions(+), 7 deletions(-)
> > >>>
> > >>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> > >>> index 46f39831d27c..b4625695bb3a 100644
> > >>> --- a/io_uring/io_uring.c
> > >>> +++ b/io_uring/io_uring.c
> > >>> @@ -1184,11 +1184,10 @@ __cold void io_iopoll_try_reap_events(struct io_ring_ctx *ctx)
> > >>>               io_move_task_work_from_local(ctx);
> > >>>  }
> > >>>
> > >>>  static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_events)
> > >>>  {
> > >>> -     unsigned int nr_events = 0;
> > >>>       unsigned long check_cq;
> > >>>
> > >>>       min_events = min(min_events, ctx->cq_entries);
> > >>>
> > >>>       lockdep_assert_held(&ctx->uring_lock);
> > >>> @@ -1227,34 +1226,30 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_events)
> > >>>                * the poll to the issued list. Otherwise we can spin here
> > >>>                * forever, while the workqueue is stuck trying to acquire the
> > >>>                * very same mutex.
> > >>>                */
> > >>>               if (list_empty(&ctx->iopoll_list) || io_task_work_pending(ctx)) {
> > >>> -                     u32 tail = ctx->cached_cq_tail;
> > >>> -
> > >>>                       (void) io_run_local_work_locked(ctx, min_events);
> > >>>
> > >>>                       if (task_work_pending(current) || list_empty(&ctx->iopoll_list)) {
> > >>>                               mutex_unlock(&ctx->uring_lock);
> > >>>                               io_run_task_work();
> > >>>                               mutex_lock(&ctx->uring_lock);
> > >>>                       }
> > >>>                       /* some requests don't go through iopoll_list */
> > >>> -                     if (tail != ctx->cached_cq_tail || list_empty(&ctx->iopoll_list))
> > >>> +                     if (list_empty(&ctx->iopoll_list))
> > >>>                               break;
> > >>>               }
> > >>>               ret = io_do_iopoll(ctx, !min_events);
> > >>>               if (unlikely(ret < 0))
> > >>>                       return ret;
> > >>>
> > >>>               if (task_sigpending(current))
> > >>>                       return -EINTR;
> > >>>               if (need_resched())
> > >>>                       break;
> > >>> -
> > >>> -             nr_events += ret;
> > >>> -     } while (nr_events < min_events);
> > >>> +     } while (io_cqring_events(ctx) < min_events);
> > >>
> > >> Before entering the loop, if io_cqring_events() finds any queued CQE,
> > >> io_iopoll_check() returns immediately without polling.
> > >>
> > >> If the queued CQE is originated from non-iopoll uring_cmd, iopoll request
> > >> will not be polled, may this be one issue?
> > >
> > > I also noticed that logic and thought it seemed odd. I would think
> > > we'd always want to wait for min_events CQEs (and iopoll once even if
> > > min_events is 0). Looks like Jens added the early return in commit
> > > a3a0e43fd770 ("io_uring: don't enter poll loop if we have CQEs
> > > pending"), perhaps he can shed some light on it?
> >
> > I don't  recall the bug in question, it's been a while... But it always
> > makes sense to return events that are ready, and skip polling. It should
> > only be done if there are no ready events to reap.
> 
> Ming, are you okay with preserving that behavior in this patch then? I
> guess there's a potential fairness concern where REQ_F_IOPOLL requests
> may not be polled for some time if non-REQ_F_IOPOLL requests continue
> to frequently post CQEs.

IMO, the fairness may not a big deal given userspace should keep polling
if the iopoll IO isn't done.

But forget to mention, if non-iopoll CQE is posted and ->cq_flush becomes
true, io_submit_flush_completions() may not get chance to run in case of
the early return. 

Maybe something like below is needed:

@@ -1556,8 +1556,10 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_events)
         * If we do, we can potentially be spinning for commands that
         * already triggered a CQE (eg in error).
         */
-       if (io_cqring_events(ctx))
+       if (io_cqring_events(ctx)) {
+               io_submit_flush_completions(ctx);
                return 0;
+       }



Thanks,
Ming

Re: [PATCH v5 3/5] io_uring: count CQEs in io_iopoll_check()

Posted by Jens Axboe 2 weeks, 2 days ago

On 3/6/26 7:35 PM, Ming Lei wrote:
> On Fri, Mar 06, 2026 at 05:38:15PM -0800, Caleb Sander Mateos wrote:
>> On Wed, Mar 4, 2026 at 8:29?AM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> On 3/4/26 8:46 AM, Caleb Sander Mateos wrote:
>>>> On Wed, Mar 4, 2026 at 2:33?AM Ming Lei <ming.lei@redhat.com> wrote:
>>>>>
>>>>> On Mon, Mar 02, 2026 at 10:29:12AM -0700, Caleb Sander Mateos wrote:
>>>>>> A subsequent commit will allow uring_cmds that don't use iopoll on
>>>>>> IORING_SETUP_IOPOLL io_urings. As a result, CQEs can be posted without
>>>>>> setting the iopoll_completed flag for a request in iopoll_list or going
>>>>>> through task work. For example, a UBLK_U_IO_FETCH_IO_CMDS command could
>>>>>> call io_uring_mshot_cmd_post_cqe() to directly post a CQE. The
>>>>>> io_iopoll_check() loop currently only counts completions posted in
>>>>>> io_do_iopoll() when determining whether the min_events threshold has
>>>>>> been met. It also exits early if there are any existing CQEs before
>>>>>> polling, or if any CQEs are posted while running task work. CQEs posted
>>>>>> via io_uring_mshot_cmd_post_cqe() or other mechanisms won't be counted
>>>>>> against min_events.
>>>>>>
>>>>>> Explicitly check the available CQEs in each io_iopoll_check() loop
>>>>>> iteration to account for CQEs posted in any fashion.
>>>>>>
>>>>>> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
>>>>>> ---
>>>>>>  io_uring/io_uring.c | 9 ++-------
>>>>>>  1 file changed, 2 insertions(+), 7 deletions(-)
>>>>>>
>>>>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>>>>> index 46f39831d27c..b4625695bb3a 100644
>>>>>> --- a/io_uring/io_uring.c
>>>>>> +++ b/io_uring/io_uring.c
>>>>>> @@ -1184,11 +1184,10 @@ __cold void io_iopoll_try_reap_events(struct io_ring_ctx *ctx)
>>>>>>               io_move_task_work_from_local(ctx);
>>>>>>  }
>>>>>>
>>>>>>  static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_events)
>>>>>>  {
>>>>>> -     unsigned int nr_events = 0;
>>>>>>       unsigned long check_cq;
>>>>>>
>>>>>>       min_events = min(min_events, ctx->cq_entries);
>>>>>>
>>>>>>       lockdep_assert_held(&ctx->uring_lock);
>>>>>> @@ -1227,34 +1226,30 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_events)
>>>>>>                * the poll to the issued list. Otherwise we can spin here
>>>>>>                * forever, while the workqueue is stuck trying to acquire the
>>>>>>                * very same mutex.
>>>>>>                */
>>>>>>               if (list_empty(&ctx->iopoll_list) || io_task_work_pending(ctx)) {
>>>>>> -                     u32 tail = ctx->cached_cq_tail;
>>>>>> -
>>>>>>                       (void) io_run_local_work_locked(ctx, min_events);
>>>>>>
>>>>>>                       if (task_work_pending(current) || list_empty(&ctx->iopoll_list)) {
>>>>>>                               mutex_unlock(&ctx->uring_lock);
>>>>>>                               io_run_task_work();
>>>>>>                               mutex_lock(&ctx->uring_lock);
>>>>>>                       }
>>>>>>                       /* some requests don't go through iopoll_list */
>>>>>> -                     if (tail != ctx->cached_cq_tail || list_empty(&ctx->iopoll_list))
>>>>>> +                     if (list_empty(&ctx->iopoll_list))
>>>>>>                               break;
>>>>>>               }
>>>>>>               ret = io_do_iopoll(ctx, !min_events);
>>>>>>               if (unlikely(ret < 0))
>>>>>>                       return ret;
>>>>>>
>>>>>>               if (task_sigpending(current))
>>>>>>                       return -EINTR;
>>>>>>               if (need_resched())
>>>>>>                       break;
>>>>>> -
>>>>>> -             nr_events += ret;
>>>>>> -     } while (nr_events < min_events);
>>>>>> +     } while (io_cqring_events(ctx) < min_events);
>>>>>
>>>>> Before entering the loop, if io_cqring_events() finds any queued CQE,
>>>>> io_iopoll_check() returns immediately without polling.
>>>>>
>>>>> If the queued CQE is originated from non-iopoll uring_cmd, iopoll request
>>>>> will not be polled, may this be one issue?
>>>>
>>>> I also noticed that logic and thought it seemed odd. I would think
>>>> we'd always want to wait for min_events CQEs (and iopoll once even if
>>>> min_events is 0). Looks like Jens added the early return in commit
>>>> a3a0e43fd770 ("io_uring: don't enter poll loop if we have CQEs
>>>> pending"), perhaps he can shed some light on it?
>>>
>>> I don't  recall the bug in question, it's been a while... But it always
>>> makes sense to return events that are ready, and skip polling. It should
>>> only be done if there are no ready events to reap.
>>
>> Ming, are you okay with preserving that behavior in this patch then? I
>> guess there's a potential fairness concern where REQ_F_IOPOLL requests
>> may not be polled for some time if non-REQ_F_IOPOLL requests continue
>> to frequently post CQEs.
> 
> IMO, the fairness may not a big deal given userspace should keep polling
> if the iopoll IO isn't done.

I think so.

> But forget to mention, if non-iopoll CQE is posted and ->cq_flush becomes
> true, io_submit_flush_completions() may not get chance to run in case of
> the early return. 
> 
> Maybe something like below is needed:
> 
> @@ -1556,8 +1556,10 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_events)
>          * If we do, we can potentially be spinning for commands that
>          * already triggered a CQE (eg in error).
>          */
> -       if (io_cqring_events(ctx))
> +       if (io_cqring_events(ctx)) {
> +               io_submit_flush_completions(ctx);
>                 return 0;
> +       }

We should probably experiment with this a bit, I don't think it's a
showstopper for merging this.

Caleb, I think we can stage this for 7.1 and see how it goes.

-- 
Jens Axboe

Re: [PATCH v5 3/5] io_uring: count CQEs in io_iopoll_check()

Posted by Caleb Sander Mateos 2 weeks, 2 days ago

On Mon, Mar 16, 2026 at 3:11 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 3/6/26 7:35 PM, Ming Lei wrote:
> > On Fri, Mar 06, 2026 at 05:38:15PM -0800, Caleb Sander Mateos wrote:
> >> On Wed, Mar 4, 2026 at 8:29?AM Jens Axboe <axboe@kernel.dk> wrote:
> >>>
> >>> On 3/4/26 8:46 AM, Caleb Sander Mateos wrote:
> >>>> On Wed, Mar 4, 2026 at 2:33?AM Ming Lei <ming.lei@redhat.com> wrote:
> >>>>>
> >>>>> On Mon, Mar 02, 2026 at 10:29:12AM -0700, Caleb Sander Mateos wrote:
> >>>>>> A subsequent commit will allow uring_cmds that don't use iopoll on
> >>>>>> IORING_SETUP_IOPOLL io_urings. As a result, CQEs can be posted without
> >>>>>> setting the iopoll_completed flag for a request in iopoll_list or going
> >>>>>> through task work. For example, a UBLK_U_IO_FETCH_IO_CMDS command could
> >>>>>> call io_uring_mshot_cmd_post_cqe() to directly post a CQE. The
> >>>>>> io_iopoll_check() loop currently only counts completions posted in
> >>>>>> io_do_iopoll() when determining whether the min_events threshold has
> >>>>>> been met. It also exits early if there are any existing CQEs before
> >>>>>> polling, or if any CQEs are posted while running task work. CQEs posted
> >>>>>> via io_uring_mshot_cmd_post_cqe() or other mechanisms won't be counted
> >>>>>> against min_events.
> >>>>>>
> >>>>>> Explicitly check the available CQEs in each io_iopoll_check() loop
> >>>>>> iteration to account for CQEs posted in any fashion.
> >>>>>>
> >>>>>> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> >>>>>> ---
> >>>>>>  io_uring/io_uring.c | 9 ++-------
> >>>>>>  1 file changed, 2 insertions(+), 7 deletions(-)
> >>>>>>
> >>>>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> >>>>>> index 46f39831d27c..b4625695bb3a 100644
> >>>>>> --- a/io_uring/io_uring.c
> >>>>>> +++ b/io_uring/io_uring.c
> >>>>>> @@ -1184,11 +1184,10 @@ __cold void io_iopoll_try_reap_events(struct io_ring_ctx *ctx)
> >>>>>>               io_move_task_work_from_local(ctx);
> >>>>>>  }
> >>>>>>
> >>>>>>  static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_events)
> >>>>>>  {
> >>>>>> -     unsigned int nr_events = 0;
> >>>>>>       unsigned long check_cq;
> >>>>>>
> >>>>>>       min_events = min(min_events, ctx->cq_entries);
> >>>>>>
> >>>>>>       lockdep_assert_held(&ctx->uring_lock);
> >>>>>> @@ -1227,34 +1226,30 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_events)
> >>>>>>                * the poll to the issued list. Otherwise we can spin here
> >>>>>>                * forever, while the workqueue is stuck trying to acquire the
> >>>>>>                * very same mutex.
> >>>>>>                */
> >>>>>>               if (list_empty(&ctx->iopoll_list) || io_task_work_pending(ctx)) {
> >>>>>> -                     u32 tail = ctx->cached_cq_tail;
> >>>>>> -
> >>>>>>                       (void) io_run_local_work_locked(ctx, min_events);
> >>>>>>
> >>>>>>                       if (task_work_pending(current) || list_empty(&ctx->iopoll_list)) {
> >>>>>>                               mutex_unlock(&ctx->uring_lock);
> >>>>>>                               io_run_task_work();
> >>>>>>                               mutex_lock(&ctx->uring_lock);
> >>>>>>                       }
> >>>>>>                       /* some requests don't go through iopoll_list */
> >>>>>> -                     if (tail != ctx->cached_cq_tail || list_empty(&ctx->iopoll_list))
> >>>>>> +                     if (list_empty(&ctx->iopoll_list))
> >>>>>>                               break;
> >>>>>>               }
> >>>>>>               ret = io_do_iopoll(ctx, !min_events);
> >>>>>>               if (unlikely(ret < 0))
> >>>>>>                       return ret;
> >>>>>>
> >>>>>>               if (task_sigpending(current))
> >>>>>>                       return -EINTR;
> >>>>>>               if (need_resched())
> >>>>>>                       break;
> >>>>>> -
> >>>>>> -             nr_events += ret;
> >>>>>> -     } while (nr_events < min_events);
> >>>>>> +     } while (io_cqring_events(ctx) < min_events);
> >>>>>
> >>>>> Before entering the loop, if io_cqring_events() finds any queued CQE,
> >>>>> io_iopoll_check() returns immediately without polling.
> >>>>>
> >>>>> If the queued CQE is originated from non-iopoll uring_cmd, iopoll request
> >>>>> will not be polled, may this be one issue?
> >>>>
> >>>> I also noticed that logic and thought it seemed odd. I would think
> >>>> we'd always want to wait for min_events CQEs (and iopoll once even if
> >>>> min_events is 0). Looks like Jens added the early return in commit
> >>>> a3a0e43fd770 ("io_uring: don't enter poll loop if we have CQEs
> >>>> pending"), perhaps he can shed some light on it?
> >>>
> >>> I don't  recall the bug in question, it's been a while... But it always
> >>> makes sense to return events that are ready, and skip polling. It should
> >>> only be done if there are no ready events to reap.
> >>
> >> Ming, are you okay with preserving that behavior in this patch then? I
> >> guess there's a potential fairness concern where REQ_F_IOPOLL requests
> >> may not be polled for some time if non-REQ_F_IOPOLL requests continue
> >> to frequently post CQEs.
> >
> > IMO, the fairness may not a big deal given userspace should keep polling
> > if the iopoll IO isn't done.
>
> I think so.
>
> > But forget to mention, if non-iopoll CQE is posted and ->cq_flush becomes
> > true, io_submit_flush_completions() may not get chance to run in case of
> > the early return.
> >
> > Maybe something like below is needed:
> >
> > @@ -1556,8 +1556,10 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_events)
> >          * If we do, we can potentially be spinning for commands that
> >          * already triggered a CQE (eg in error).
> >          */
> > -       if (io_cqring_events(ctx))
> > +       if (io_cqring_events(ctx)) {
> > +               io_submit_flush_completions(ctx);
> >                 return 0;
> > +       }
>
> We should probably experiment with this a bit, I don't think it's a
> showstopper for merging this.

I don't think it's an issue currently. Any path that posts CQEs
(synchronously during submit, from task work, or iopoll) already calls
io_submit_flush_completions()/__io_submit_flush_completions(). In
other words, there's no way to reach this io_cqring_events() branch
without the CQEs having already been flushed. If a new path were added
in the future that could post CQEs outside of those contexts, it would
of course need to ensure io_submit_flush_completions() is called after
posting the CQEs.

>
> Caleb, I think we can stage this for 7.1 and see how it goes.

Sounds good, thank you.

--Caleb

[PATCH v5 1/5] io_uring: add REQ_F_IOPOLL
[PATCH v5 2/5] io_uring: remove iopoll_queue from struct io_issue_def
[PATCH v5 3/5] io_uring: count CQEs in io_iopoll_check()
[PATCH v5 4/5] io_uring/uring_cmd: allow non-iopoll cmds with IORING_SETUP_IOPOLL
[PATCH v5 5/5] nvme: remove nvme_dev_uring_cmd() IO_URING_F_IOPOLL check