[RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB

Sidong Yang posted 5 patches 1 month, 1 week ago
[RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB
Posted by Sidong Yang 1 month, 1 week ago
The pdu field in io_uring_cmd may contain stale data when a request
object is recycled from the slab cache. Accessing uninitialized or
garbage memory can lead to undefined behavior in users of the pdu.

Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that
each command starts from a well-defined state. This avoids exposing
uninitialized memory and prevents potential misinterpretation of data
from previous requests.

No functional change is intended other than guaranteeing that pdu is
always zero-initialized before use.

Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai>
---
 io_uring/uring_cmd.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 053bac89b6c0..2492525d4e43 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	if (!ac)
 		return -ENOMEM;
 	ioucmd->sqe = sqe;
+	memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu));
 	return 0;
 }
 
-- 
2.43.0
Re: [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB
Posted by Caleb Sander Mateos 1 month ago
On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
>
> The pdu field in io_uring_cmd may contain stale data when a request
> object is recycled from the slab cache. Accessing uninitialized or
> garbage memory can lead to undefined behavior in users of the pdu.
>
> Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that
> each command starts from a well-defined state. This avoids exposing
> uninitialized memory and prevents potential misinterpretation of data
> from previous requests.
>
> No functional change is intended other than guaranteeing that pdu is
> always zero-initialized before use.
>
> Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai>
> ---
>  io_uring/uring_cmd.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> index 053bac89b6c0..2492525d4e43 100644
> --- a/io_uring/uring_cmd.c
> +++ b/io_uring/uring_cmd.c
> @@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>         if (!ac)
>                 return -ENOMEM;
>         ioucmd->sqe = sqe;
> +       memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu));

Adding this overhead to every existing uring_cmd() implementation is
unfortunate. Could we instead track the initialized/uninitialized
state by using different types on the Rust side? The io_uring_cmd
could start as an IoUringCmd, where the PDU field is MaybeUninit,
write_pdu<T>() could return a new IoUringCmdPdu<T> that guarantees the
PDU has been initialized.

Best,
Caleb

>         return 0;
>  }
>
> --
> 2.43.0
>
Re: [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB
Posted by Sidong Yang 1 month ago
On Mon, Sep 01, 2025 at 05:34:28PM -0700, Caleb Sander Mateos wrote:
> On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> >
> > The pdu field in io_uring_cmd may contain stale data when a request
> > object is recycled from the slab cache. Accessing uninitialized or
> > garbage memory can lead to undefined behavior in users of the pdu.
> >
> > Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that
> > each command starts from a well-defined state. This avoids exposing
> > uninitialized memory and prevents potential misinterpretation of data
> > from previous requests.
> >
> > No functional change is intended other than guaranteeing that pdu is
> > always zero-initialized before use.
> >
> > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai>
> > ---
> >  io_uring/uring_cmd.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> > index 053bac89b6c0..2492525d4e43 100644
> > --- a/io_uring/uring_cmd.c
> > +++ b/io_uring/uring_cmd.c
> > @@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> >         if (!ac)
> >                 return -ENOMEM;
> >         ioucmd->sqe = sqe;
> > +       memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu));
> 
> Adding this overhead to every existing uring_cmd() implementation is
> unfortunate. Could we instead track the initialized/uninitialized
> state by using different types on the Rust side? The io_uring_cmd
> could start as an IoUringCmd, where the PDU field is MaybeUninit,
> write_pdu<T>() could return a new IoUringCmdPdu<T> that guarantees the
> PDU has been initialized.

I've found a flag IORING_URING_CMD_REISSUE that we could initialize
the pdu. In uring_cmd callback, we can fill zero when it's not reissued.
But I don't know that we could call T::default() in miscdevice. If we
make IoUringCmdPdu<T>, MiscDevice also should be MiscDevice<T>.

How about assign a byte in pdu for checking initialized? In uring_cmd(),
We could set a byte flag that it's not initialized. And we could return
error that it's not initialized in read_pdu().

Thanks,
Sidong

> 
> Best,
> Caleb
> 
> >         return 0;
> >  }
> >
> > --
> > 2.43.0
> >
Re: [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB
Posted by Caleb Sander Mateos 1 month ago
On Tue, Sep 2, 2025 at 3:23 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
>
> On Mon, Sep 01, 2025 at 05:34:28PM -0700, Caleb Sander Mateos wrote:
> > On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > >
> > > The pdu field in io_uring_cmd may contain stale data when a request
> > > object is recycled from the slab cache. Accessing uninitialized or
> > > garbage memory can lead to undefined behavior in users of the pdu.
> > >
> > > Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that
> > > each command starts from a well-defined state. This avoids exposing
> > > uninitialized memory and prevents potential misinterpretation of data
> > > from previous requests.
> > >
> > > No functional change is intended other than guaranteeing that pdu is
> > > always zero-initialized before use.
> > >
> > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai>
> > > ---
> > >  io_uring/uring_cmd.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > >
> > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> > > index 053bac89b6c0..2492525d4e43 100644
> > > --- a/io_uring/uring_cmd.c
> > > +++ b/io_uring/uring_cmd.c
> > > @@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> > >         if (!ac)
> > >                 return -ENOMEM;
> > >         ioucmd->sqe = sqe;
> > > +       memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu));
> >
> > Adding this overhead to every existing uring_cmd() implementation is
> > unfortunate. Could we instead track the initialized/uninitialized
> > state by using different types on the Rust side? The io_uring_cmd
> > could start as an IoUringCmd, where the PDU field is MaybeUninit,
> > write_pdu<T>() could return a new IoUringCmdPdu<T> that guarantees the
> > PDU has been initialized.
>
> I've found a flag IORING_URING_CMD_REISSUE that we could initialize
> the pdu. In uring_cmd callback, we can fill zero when it's not reissued.
> But I don't know that we could call T::default() in miscdevice. If we
> make IoUringCmdPdu<T>, MiscDevice also should be MiscDevice<T>.
>
> How about assign a byte in pdu for checking initialized? In uring_cmd(),
> We could set a byte flag that it's not initialized. And we could return
> error that it's not initialized in read_pdu().

Could we do the zero-initialization (or T::default()) in
MiscdeviceVTable::uring_cmd() if the IORING_URING_CMD_REISSUE flag
isn't set (i.e. on the initial issue)? That way, we avoid any
performance penalty for the existing C uring_cmd() implementations.
I'm not quite sure what you mean by "assign a byte in pdu for checking
initialized".

Best,
Caleb
Re: [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB
Posted by Sidong Yang 3 weeks, 6 days ago
On Tue, Sep 02, 2025 at 08:31:00AM -0700, Caleb Sander Mateos wrote:
> On Tue, Sep 2, 2025 at 3:23 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> >
> > On Mon, Sep 01, 2025 at 05:34:28PM -0700, Caleb Sander Mateos wrote:
> > > On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > >
> > > > The pdu field in io_uring_cmd may contain stale data when a request
> > > > object is recycled from the slab cache. Accessing uninitialized or
> > > > garbage memory can lead to undefined behavior in users of the pdu.
> > > >
> > > > Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that
> > > > each command starts from a well-defined state. This avoids exposing
> > > > uninitialized memory and prevents potential misinterpretation of data
> > > > from previous requests.
> > > >
> > > > No functional change is intended other than guaranteeing that pdu is
> > > > always zero-initialized before use.
> > > >
> > > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai>
> > > > ---
> > > >  io_uring/uring_cmd.c | 1 +
> > > >  1 file changed, 1 insertion(+)
> > > >
> > > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> > > > index 053bac89b6c0..2492525d4e43 100644
> > > > --- a/io_uring/uring_cmd.c
> > > > +++ b/io_uring/uring_cmd.c
> > > > @@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> > > >         if (!ac)
> > > >                 return -ENOMEM;
> > > >         ioucmd->sqe = sqe;
> > > > +       memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu));
> > >
> > > Adding this overhead to every existing uring_cmd() implementation is
> > > unfortunate. Could we instead track the initialized/uninitialized
> > > state by using different types on the Rust side? The io_uring_cmd
> > > could start as an IoUringCmd, where the PDU field is MaybeUninit,
> > > write_pdu<T>() could return a new IoUringCmdPdu<T> that guarantees the
> > > PDU has been initialized.
> >
> > I've found a flag IORING_URING_CMD_REISSUE that we could initialize
> > the pdu. In uring_cmd callback, we can fill zero when it's not reissued.
> > But I don't know that we could call T::default() in miscdevice. If we
> > make IoUringCmdPdu<T>, MiscDevice also should be MiscDevice<T>.
> >
> > How about assign a byte in pdu for checking initialized? In uring_cmd(),
> > We could set a byte flag that it's not initialized. And we could return
> > error that it's not initialized in read_pdu().
> 
> Could we do the zero-initialization (or T::default()) in
> MiscdeviceVTable::uring_cmd() if the IORING_URING_CMD_REISSUE flag
> isn't set (i.e. on the initial issue)? That way, we avoid any
> performance penalty for the existing C uring_cmd() implementations.
> I'm not quite sure what you mean by "assign a byte in pdu for checking
> initialized".

Sure, we could fill zero when it's the first time uring_cmd called with
checking the flag. I would remove this commit for next version. I also
suggests that we would provide the method that read_pdu() and write_pdu().
In read_pdu() I want to check write_pdu() is called before. So along the
20 bytes for pdu, maybe we could use a bytes for the flag that pdu is
initialized?

But maybe I would introduce a new struct that has Pin<&mut IoUringCmd> and
issue_flags. How about some additional field for pdu is initialized like below?

struct IoUringCmdArgs {
  ioucmd: Pin<&mut IoUringCmd>,
  issue_flags: u32,
  pdu_initialized: bool,
}

> 
> Best,
> Caleb
Re: [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB
Posted by Caleb Sander Mateos 3 weeks, 4 days ago
On Sat, Sep 6, 2025 at 7:28 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
>
> On Tue, Sep 02, 2025 at 08:31:00AM -0700, Caleb Sander Mateos wrote:
> > On Tue, Sep 2, 2025 at 3:23 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > >
> > > On Mon, Sep 01, 2025 at 05:34:28PM -0700, Caleb Sander Mateos wrote:
> > > > On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > > >
> > > > > The pdu field in io_uring_cmd may contain stale data when a request
> > > > > object is recycled from the slab cache. Accessing uninitialized or
> > > > > garbage memory can lead to undefined behavior in users of the pdu.
> > > > >
> > > > > Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that
> > > > > each command starts from a well-defined state. This avoids exposing
> > > > > uninitialized memory and prevents potential misinterpretation of data
> > > > > from previous requests.
> > > > >
> > > > > No functional change is intended other than guaranteeing that pdu is
> > > > > always zero-initialized before use.
> > > > >
> > > > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai>
> > > > > ---
> > > > >  io_uring/uring_cmd.c | 1 +
> > > > >  1 file changed, 1 insertion(+)
> > > > >
> > > > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> > > > > index 053bac89b6c0..2492525d4e43 100644
> > > > > --- a/io_uring/uring_cmd.c
> > > > > +++ b/io_uring/uring_cmd.c
> > > > > @@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> > > > >         if (!ac)
> > > > >                 return -ENOMEM;
> > > > >         ioucmd->sqe = sqe;
> > > > > +       memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu));
> > > >
> > > > Adding this overhead to every existing uring_cmd() implementation is
> > > > unfortunate. Could we instead track the initialized/uninitialized
> > > > state by using different types on the Rust side? The io_uring_cmd
> > > > could start as an IoUringCmd, where the PDU field is MaybeUninit,
> > > > write_pdu<T>() could return a new IoUringCmdPdu<T> that guarantees the
> > > > PDU has been initialized.
> > >
> > > I've found a flag IORING_URING_CMD_REISSUE that we could initialize
> > > the pdu. In uring_cmd callback, we can fill zero when it's not reissued.
> > > But I don't know that we could call T::default() in miscdevice. If we
> > > make IoUringCmdPdu<T>, MiscDevice also should be MiscDevice<T>.
> > >
> > > How about assign a byte in pdu for checking initialized? In uring_cmd(),
> > > We could set a byte flag that it's not initialized. And we could return
> > > error that it's not initialized in read_pdu().
> >
> > Could we do the zero-initialization (or T::default()) in
> > MiscdeviceVTable::uring_cmd() if the IORING_URING_CMD_REISSUE flag
> > isn't set (i.e. on the initial issue)? That way, we avoid any
> > performance penalty for the existing C uring_cmd() implementations.
> > I'm not quite sure what you mean by "assign a byte in pdu for checking
> > initialized".
>
> Sure, we could fill zero when it's the first time uring_cmd called with
> checking the flag. I would remove this commit for next version. I also
> suggests that we would provide the method that read_pdu() and write_pdu().
> In read_pdu() I want to check write_pdu() is called before. So along the
> 20 bytes for pdu, maybe we could use a bytes for the flag that pdu is
> initialized?

Not sure what you mean about "20 bytes for pdu".
It seems like it would be preferable to enforce that write_pdu() has
been called before read_pdu() using the Rust type system instead of a
runtime check. I was thinking a signature like fn write_pdu(cmd:
IoUringCmd, value: T) -> IoUringCmdPdu<T>. Do you feel there's a
reason that wouldn't work and a runtime check would be necessary?

>
> But maybe I would introduce a new struct that has Pin<&mut IoUringCmd> and
> issue_flags. How about some additional field for pdu is initialized like below?
>
> struct IoUringCmdArgs {
>   ioucmd: Pin<&mut IoUringCmd>,
>   issue_flags: u32,
>   pdu_initialized: bool,
> }

One other thing I realized is that issue_flags should come from the
*current* context rather than the context the uring_cmd() callback was
called in. For example, if io_uring_cmd_done() is called from task
work context, issue_flags should match the issue_flags passed to the
io_uring_cmd_tw_t callback, not the issue_flags originally passed to
the uring_cmd() callback. So it probably makes more sense to decouple
issue_flags from the (owned) IoUringCmd. I think you could pass it by
reference (&IssueFlags) or with a phantom reference lifetime
(IssueFlags<'_>) to the Rust uring_cmd() and task work callbacks to
ensure it can't be used after those callbacks have returned.

Best,
Caleb
Re: [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB
Posted by Sidong Yang 3 weeks, 3 days ago
On Mon, Sep 08, 2025 at 12:45:58PM -0700, Caleb Sander Mateos wrote:
> On Sat, Sep 6, 2025 at 7:28 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> >
> > On Tue, Sep 02, 2025 at 08:31:00AM -0700, Caleb Sander Mateos wrote:
> > > On Tue, Sep 2, 2025 at 3:23 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > >
> > > > On Mon, Sep 01, 2025 at 05:34:28PM -0700, Caleb Sander Mateos wrote:
> > > > > On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > > > >
> > > > > > The pdu field in io_uring_cmd may contain stale data when a request
> > > > > > object is recycled from the slab cache. Accessing uninitialized or
> > > > > > garbage memory can lead to undefined behavior in users of the pdu.
> > > > > >
> > > > > > Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that
> > > > > > each command starts from a well-defined state. This avoids exposing
> > > > > > uninitialized memory and prevents potential misinterpretation of data
> > > > > > from previous requests.
> > > > > >
> > > > > > No functional change is intended other than guaranteeing that pdu is
> > > > > > always zero-initialized before use.
> > > > > >
> > > > > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai>
> > > > > > ---
> > > > > >  io_uring/uring_cmd.c | 1 +
> > > > > >  1 file changed, 1 insertion(+)
> > > > > >
> > > > > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> > > > > > index 053bac89b6c0..2492525d4e43 100644
> > > > > > --- a/io_uring/uring_cmd.c
> > > > > > +++ b/io_uring/uring_cmd.c
> > > > > > @@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> > > > > >         if (!ac)
> > > > > >                 return -ENOMEM;
> > > > > >         ioucmd->sqe = sqe;
> > > > > > +       memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu));
> > > > >
> > > > > Adding this overhead to every existing uring_cmd() implementation is
> > > > > unfortunate. Could we instead track the initialized/uninitialized
> > > > > state by using different types on the Rust side? The io_uring_cmd
> > > > > could start as an IoUringCmd, where the PDU field is MaybeUninit,
> > > > > write_pdu<T>() could return a new IoUringCmdPdu<T> that guarantees the
> > > > > PDU has been initialized.
> > > >
> > > > I've found a flag IORING_URING_CMD_REISSUE that we could initialize
> > > > the pdu. In uring_cmd callback, we can fill zero when it's not reissued.
> > > > But I don't know that we could call T::default() in miscdevice. If we
> > > > make IoUringCmdPdu<T>, MiscDevice also should be MiscDevice<T>.
> > > >
> > > > How about assign a byte in pdu for checking initialized? In uring_cmd(),
> > > > We could set a byte flag that it's not initialized. And we could return
> > > > error that it's not initialized in read_pdu().
> > >
> > > Could we do the zero-initialization (or T::default()) in
> > > MiscdeviceVTable::uring_cmd() if the IORING_URING_CMD_REISSUE flag
> > > isn't set (i.e. on the initial issue)? That way, we avoid any
> > > performance penalty for the existing C uring_cmd() implementations.
> > > I'm not quite sure what you mean by "assign a byte in pdu for checking
> > > initialized".
> >
> > Sure, we could fill zero when it's the first time uring_cmd called with
> > checking the flag. I would remove this commit for next version. I also
> > suggests that we would provide the method that read_pdu() and write_pdu().
> > In read_pdu() I want to check write_pdu() is called before. So along the
> > 20 bytes for pdu, maybe we could use a bytes for the flag that pdu is
> > initialized?
> 
> Not sure what you mean about "20 bytes for pdu".
> It seems like it would be preferable to enforce that write_pdu() has
> been called before read_pdu() using the Rust type system instead of a
> runtime check. I was thinking a signature like fn write_pdu(cmd:
> IoUringCmd, value: T) -> IoUringCmdPdu<T>. Do you feel there's a
> reason that wouldn't work and a runtime check would be necessary?

I didn't think about make write_pdu() to return IoUringCmdPdu<T> before.
I think it's good way to pdu is safe without adding a new generic param for
MiscDevice. write_pdu() would return IoUringCmdPdu<T> and it could call
IoUringCmdPdu<T>::pdu(&mut self) -> &mut T safely maybe.

> 
> >
> > But maybe I would introduce a new struct that has Pin<&mut IoUringCmd> and
> > issue_flags. How about some additional field for pdu is initialized like below?
> >
> > struct IoUringCmdArgs {
> >   ioucmd: Pin<&mut IoUringCmd>,
> >   issue_flags: u32,
> >   pdu_initialized: bool,
> > }
> 
> One other thing I realized is that issue_flags should come from the
> *current* context rather than the context the uring_cmd() callback was
> called in. For example, if io_uring_cmd_done() is called from task
> work context, issue_flags should match the issue_flags passed to the
> io_uring_cmd_tw_t callback, not the issue_flags originally passed to
> the uring_cmd() callback. So it probably makes more sense to decouple
> issue_flags from the (owned) IoUringCmd. I think you could pass it by
> reference (&IssueFlags) or with a phantom reference lifetime
> (IssueFlags<'_>) to the Rust uring_cmd() and task work callbacks to
> ensure it can't be used after those callbacks have returned.

I have had no idea about task work context. I agree with you that
it would be better to separate issue_flags from IoUringCmd. So, 
IoUringCmdArgs would have a only field Pin<&mut IoUringCmd>?

> 
> Best,
> Caleb
Re: [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB
Posted by Caleb Sander Mateos 3 weeks, 3 days ago
On Tue, Sep 9, 2025 at 7:43 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
>
> On Mon, Sep 08, 2025 at 12:45:58PM -0700, Caleb Sander Mateos wrote:
> > On Sat, Sep 6, 2025 at 7:28 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > >
> > > On Tue, Sep 02, 2025 at 08:31:00AM -0700, Caleb Sander Mateos wrote:
> > > > On Tue, Sep 2, 2025 at 3:23 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > > >
> > > > > On Mon, Sep 01, 2025 at 05:34:28PM -0700, Caleb Sander Mateos wrote:
> > > > > > On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > > > > >
> > > > > > > The pdu field in io_uring_cmd may contain stale data when a request
> > > > > > > object is recycled from the slab cache. Accessing uninitialized or
> > > > > > > garbage memory can lead to undefined behavior in users of the pdu.
> > > > > > >
> > > > > > > Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that
> > > > > > > each command starts from a well-defined state. This avoids exposing
> > > > > > > uninitialized memory and prevents potential misinterpretation of data
> > > > > > > from previous requests.
> > > > > > >
> > > > > > > No functional change is intended other than guaranteeing that pdu is
> > > > > > > always zero-initialized before use.
> > > > > > >
> > > > > > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai>
> > > > > > > ---
> > > > > > >  io_uring/uring_cmd.c | 1 +
> > > > > > >  1 file changed, 1 insertion(+)
> > > > > > >
> > > > > > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> > > > > > > index 053bac89b6c0..2492525d4e43 100644
> > > > > > > --- a/io_uring/uring_cmd.c
> > > > > > > +++ b/io_uring/uring_cmd.c
> > > > > > > @@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> > > > > > >         if (!ac)
> > > > > > >                 return -ENOMEM;
> > > > > > >         ioucmd->sqe = sqe;
> > > > > > > +       memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu));
> > > > > >
> > > > > > Adding this overhead to every existing uring_cmd() implementation is
> > > > > > unfortunate. Could we instead track the initialized/uninitialized
> > > > > > state by using different types on the Rust side? The io_uring_cmd
> > > > > > could start as an IoUringCmd, where the PDU field is MaybeUninit,
> > > > > > write_pdu<T>() could return a new IoUringCmdPdu<T> that guarantees the
> > > > > > PDU has been initialized.
> > > > >
> > > > > I've found a flag IORING_URING_CMD_REISSUE that we could initialize
> > > > > the pdu. In uring_cmd callback, we can fill zero when it's not reissued.
> > > > > But I don't know that we could call T::default() in miscdevice. If we
> > > > > make IoUringCmdPdu<T>, MiscDevice also should be MiscDevice<T>.
> > > > >
> > > > > How about assign a byte in pdu for checking initialized? In uring_cmd(),
> > > > > We could set a byte flag that it's not initialized. And we could return
> > > > > error that it's not initialized in read_pdu().
> > > >
> > > > Could we do the zero-initialization (or T::default()) in
> > > > MiscdeviceVTable::uring_cmd() if the IORING_URING_CMD_REISSUE flag
> > > > isn't set (i.e. on the initial issue)? That way, we avoid any
> > > > performance penalty for the existing C uring_cmd() implementations.
> > > > I'm not quite sure what you mean by "assign a byte in pdu for checking
> > > > initialized".
> > >
> > > Sure, we could fill zero when it's the first time uring_cmd called with
> > > checking the flag. I would remove this commit for next version. I also
> > > suggests that we would provide the method that read_pdu() and write_pdu().
> > > In read_pdu() I want to check write_pdu() is called before. So along the
> > > 20 bytes for pdu, maybe we could use a bytes for the flag that pdu is
> > > initialized?
> >
> > Not sure what you mean about "20 bytes for pdu".
> > It seems like it would be preferable to enforce that write_pdu() has
> > been called before read_pdu() using the Rust type system instead of a
> > runtime check. I was thinking a signature like fn write_pdu(cmd:
> > IoUringCmd, value: T) -> IoUringCmdPdu<T>. Do you feel there's a
> > reason that wouldn't work and a runtime check would be necessary?
>
> I didn't think about make write_pdu() to return IoUringCmdPdu<T> before.
> I think it's good way to pdu is safe without adding a new generic param for
> MiscDevice. write_pdu() would return IoUringCmdPdu<T> and it could call
> IoUringCmdPdu<T>::pdu(&mut self) -> &mut T safely maybe.

Yes, that's what I was thinking.

>
> >
> > >
> > > But maybe I would introduce a new struct that has Pin<&mut IoUringCmd> and
> > > issue_flags. How about some additional field for pdu is initialized like below?
> > >
> > > struct IoUringCmdArgs {
> > >   ioucmd: Pin<&mut IoUringCmd>,
> > >   issue_flags: u32,
> > >   pdu_initialized: bool,
> > > }
> >
> > One other thing I realized is that issue_flags should come from the
> > *current* context rather than the context the uring_cmd() callback was
> > called in. For example, if io_uring_cmd_done() is called from task
> > work context, issue_flags should match the issue_flags passed to the
> > io_uring_cmd_tw_t callback, not the issue_flags originally passed to
> > the uring_cmd() callback. So it probably makes more sense to decouple
> > issue_flags from the (owned) IoUringCmd. I think you could pass it by
> > reference (&IssueFlags) or with a phantom reference lifetime
> > (IssueFlags<'_>) to the Rust uring_cmd() and task work callbacks to
> > ensure it can't be used after those callbacks have returned.
>
> I have had no idea about task work context. I agree with you that
> it would be better to separate issue_flags from IoUringCmd. So,
> IoUringCmdArgs would have a only field Pin<&mut IoUringCmd>?

"Task work" is a mechanism io_uring uses to queue work to run on the
thread that submitted an io_uring operation. It's basically a
per-thread atomic queue of callbacks that the thread will process
whenever it returns from the kernel to userspace (after a syscall or
an interrupt). This is the context where asynchronous uring_cmd
completions are generally processed (see
io_uring_cmd_complete_in_task() and io_uring_cmd_do_in_task_lazy()). I
can't speak to the history of why io_uring uses task work, but my
guess would be that it provides a safe context to acquire the
io_ring_ctx uring_lock mutex (e.g. nvme_uring_cmd_end_io() can be
called from an interrupt handler, so it's not allowed to take a
mutex). Processing all the task work at once also provides natural
opportunities for batching.
Yes, we probably don't need to bundle anything else with the
IoUringCmd after all. As I mentioned earlier, I don't think Pin<&mut
IoUringCmd> will work for uring_cmds that complete asynchronously, as
they will need to outlive the uring_cmd() call. So uring_cmd() needs
to transfer ownership of the struct io_uring_cmd.

Best,
Caleb
Re: [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB
Posted by Sidong Yang 3 weeks ago
On Tue, Sep 09, 2025 at 09:32:37AM -0700, Caleb Sander Mateos wrote:
> On Tue, Sep 9, 2025 at 7:43 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> >
> > On Mon, Sep 08, 2025 at 12:45:58PM -0700, Caleb Sander Mateos wrote:
> > > On Sat, Sep 6, 2025 at 7:28 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > >
> > > > On Tue, Sep 02, 2025 at 08:31:00AM -0700, Caleb Sander Mateos wrote:
> > > > > On Tue, Sep 2, 2025 at 3:23 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > > > >
> > > > > > On Mon, Sep 01, 2025 at 05:34:28PM -0700, Caleb Sander Mateos wrote:
> > > > > > > On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > > > > > >
> > > > > > > > The pdu field in io_uring_cmd may contain stale data when a request
> > > > > > > > object is recycled from the slab cache. Accessing uninitialized or
> > > > > > > > garbage memory can lead to undefined behavior in users of the pdu.
> > > > > > > >
> > > > > > > > Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that
> > > > > > > > each command starts from a well-defined state. This avoids exposing
> > > > > > > > uninitialized memory and prevents potential misinterpretation of data
> > > > > > > > from previous requests.
> > > > > > > >
> > > > > > > > No functional change is intended other than guaranteeing that pdu is
> > > > > > > > always zero-initialized before use.
> > > > > > > >
> > > > > > > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai>
> > > > > > > > ---
> > > > > > > >  io_uring/uring_cmd.c | 1 +
> > > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > >
> > > > > > > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> > > > > > > > index 053bac89b6c0..2492525d4e43 100644
> > > > > > > > --- a/io_uring/uring_cmd.c
> > > > > > > > +++ b/io_uring/uring_cmd.c
> > > > > > > > @@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> > > > > > > >         if (!ac)
> > > > > > > >                 return -ENOMEM;
> > > > > > > >         ioucmd->sqe = sqe;
> > > > > > > > +       memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu));
> > > > > > >
> > > > > > > Adding this overhead to every existing uring_cmd() implementation is
> > > > > > > unfortunate. Could we instead track the initialized/uninitialized
> > > > > > > state by using different types on the Rust side? The io_uring_cmd
> > > > > > > could start as an IoUringCmd, where the PDU field is MaybeUninit,
> > > > > > > write_pdu<T>() could return a new IoUringCmdPdu<T> that guarantees the
> > > > > > > PDU has been initialized.
> > > > > >
> > > > > > I've found a flag IORING_URING_CMD_REISSUE that we could initialize
> > > > > > the pdu. In uring_cmd callback, we can fill zero when it's not reissued.
> > > > > > But I don't know that we could call T::default() in miscdevice. If we
> > > > > > make IoUringCmdPdu<T>, MiscDevice also should be MiscDevice<T>.
> > > > > >
> > > > > > How about assign a byte in pdu for checking initialized? In uring_cmd(),
> > > > > > We could set a byte flag that it's not initialized. And we could return
> > > > > > error that it's not initialized in read_pdu().
> > > > >
> > > > > Could we do the zero-initialization (or T::default()) in
> > > > > MiscdeviceVTable::uring_cmd() if the IORING_URING_CMD_REISSUE flag
> > > > > isn't set (i.e. on the initial issue)? That way, we avoid any
> > > > > performance penalty for the existing C uring_cmd() implementations.
> > > > > I'm not quite sure what you mean by "assign a byte in pdu for checking
> > > > > initialized".
> > > >
> > > > Sure, we could fill zero when it's the first time uring_cmd called with
> > > > checking the flag. I would remove this commit for next version. I also
> > > > suggests that we would provide the method that read_pdu() and write_pdu().
> > > > In read_pdu() I want to check write_pdu() is called before. So along the
> > > > 20 bytes for pdu, maybe we could use a bytes for the flag that pdu is
> > > > initialized?
> > >
> > > Not sure what you mean about "20 bytes for pdu".
> > > It seems like it would be preferable to enforce that write_pdu() has
> > > been called before read_pdu() using the Rust type system instead of a
> > > runtime check. I was thinking a signature like fn write_pdu(cmd:
> > > IoUringCmd, value: T) -> IoUringCmdPdu<T>. Do you feel there's a
> > > reason that wouldn't work and a runtime check would be necessary?
> >
> > I didn't think about make write_pdu() to return IoUringCmdPdu<T> before.
> > I think it's good way to pdu is safe without adding a new generic param for
> > MiscDevice. write_pdu() would return IoUringCmdPdu<T> and it could call
> > IoUringCmdPdu<T>::pdu(&mut self) -> &mut T safely maybe.
> 
> Yes, that's what I was thinking.

Good, I'll change api in this way. Thanks!

> 
> >
> > >
> > > >
> > > > But maybe I would introduce a new struct that has Pin<&mut IoUringCmd> and
> > > > issue_flags. How about some additional field for pdu is initialized like below?
> > > >
> > > > struct IoUringCmdArgs {
> > > >   ioucmd: Pin<&mut IoUringCmd>,
> > > >   issue_flags: u32,
> > > >   pdu_initialized: bool,
> > > > }
> > >
> > > One other thing I realized is that issue_flags should come from the
> > > *current* context rather than the context the uring_cmd() callback was
> > > called in. For example, if io_uring_cmd_done() is called from task
> > > work context, issue_flags should match the issue_flags passed to the
> > > io_uring_cmd_tw_t callback, not the issue_flags originally passed to
> > > the uring_cmd() callback. So it probably makes more sense to decouple
> > > issue_flags from the (owned) IoUringCmd. I think you could pass it by
> > > reference (&IssueFlags) or with a phantom reference lifetime
> > > (IssueFlags<'_>) to the Rust uring_cmd() and task work callbacks to
> > > ensure it can't be used after those callbacks have returned.
> >
> > I have had no idea about task work context. I agree with you that
> > it would be better to separate issue_flags from IoUringCmd. So,
> > IoUringCmdArgs would have a only field Pin<&mut IoUringCmd>?
> 
> "Task work" is a mechanism io_uring uses to queue work to run on the
> thread that submitted an io_uring operation. It's basically a
> per-thread atomic queue of callbacks that the thread will process
> whenever it returns from the kernel to userspace (after a syscall or
> an interrupt). This is the context where asynchronous uring_cmd
> completions are generally processed (see
> io_uring_cmd_complete_in_task() and io_uring_cmd_do_in_task_lazy()). I
> can't speak to the history of why io_uring uses task work, but my
> guess would be that it provides a safe context to acquire the
> io_ring_ctx uring_lock mutex (e.g. nvme_uring_cmd_end_io() can be
> called from an interrupt handler, so it's not allowed to take a
> mutex). Processing all the task work at once also provides natural
> opportunities for batching.

Thanks, I've checked io_uring_cmd_complete_in_task() that it receives
callback that has issue_flags different with io_uring_cmd(). I'll try to add
a api that wrapping io_uring_cmd_complete_in_task() for next version.

>
> Yes, we probably don't need to bundle anything else with the
> IoUringCmd after all. As I mentioned earlier, I don't think Pin<&mut
> IoUringCmd> will work for uring_cmds that complete asynchronously, as
> they will need to outlive the uring_cmd() call. So uring_cmd() needs
> to transfer ownership of the struct io_uring_cmd.

I can't think that how to take ownership of struct io_uring_cmd. The
struct allocated with io_alloc_req() and should be freed with io_free_req().
If taking ownership means having pointer of struct io_uring_cmd, I think
it's no difference with current version. Also could it be called with
mem::forget() if it has ownership?

Thanks,
Sidong
 
> Best,
> Caleb
Re: [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB
Posted by Caleb Sander Mateos 3 weeks ago
On Fri, Sep 12, 2025 at 9:42 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
>
> On Tue, Sep 09, 2025 at 09:32:37AM -0700, Caleb Sander Mateos wrote:
> > On Tue, Sep 9, 2025 at 7:43 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > >
> > > On Mon, Sep 08, 2025 at 12:45:58PM -0700, Caleb Sander Mateos wrote:
> > > > On Sat, Sep 6, 2025 at 7:28 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > > >
> > > > > On Tue, Sep 02, 2025 at 08:31:00AM -0700, Caleb Sander Mateos wrote:
> > > > > > On Tue, Sep 2, 2025 at 3:23 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > > > > >
> > > > > > > On Mon, Sep 01, 2025 at 05:34:28PM -0700, Caleb Sander Mateos wrote:
> > > > > > > > On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > > > > > > >
> > > > > > > > > The pdu field in io_uring_cmd may contain stale data when a request
> > > > > > > > > object is recycled from the slab cache. Accessing uninitialized or
> > > > > > > > > garbage memory can lead to undefined behavior in users of the pdu.
> > > > > > > > >
> > > > > > > > > Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that
> > > > > > > > > each command starts from a well-defined state. This avoids exposing
> > > > > > > > > uninitialized memory and prevents potential misinterpretation of data
> > > > > > > > > from previous requests.
> > > > > > > > >
> > > > > > > > > No functional change is intended other than guaranteeing that pdu is
> > > > > > > > > always zero-initialized before use.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai>
> > > > > > > > > ---
> > > > > > > > >  io_uring/uring_cmd.c | 1 +
> > > > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > > >
> > > > > > > > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> > > > > > > > > index 053bac89b6c0..2492525d4e43 100644
> > > > > > > > > --- a/io_uring/uring_cmd.c
> > > > > > > > > +++ b/io_uring/uring_cmd.c
> > > > > > > > > @@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> > > > > > > > >         if (!ac)
> > > > > > > > >                 return -ENOMEM;
> > > > > > > > >         ioucmd->sqe = sqe;
> > > > > > > > > +       memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu));
> > > > > > > >
> > > > > > > > Adding this overhead to every existing uring_cmd() implementation is
> > > > > > > > unfortunate. Could we instead track the initialized/uninitialized
> > > > > > > > state by using different types on the Rust side? The io_uring_cmd
> > > > > > > > could start as an IoUringCmd, where the PDU field is MaybeUninit,
> > > > > > > > write_pdu<T>() could return a new IoUringCmdPdu<T> that guarantees the
> > > > > > > > PDU has been initialized.
> > > > > > >
> > > > > > > I've found a flag IORING_URING_CMD_REISSUE that we could initialize
> > > > > > > the pdu. In uring_cmd callback, we can fill zero when it's not reissued.
> > > > > > > But I don't know that we could call T::default() in miscdevice. If we
> > > > > > > make IoUringCmdPdu<T>, MiscDevice also should be MiscDevice<T>.
> > > > > > >
> > > > > > > How about assign a byte in pdu for checking initialized? In uring_cmd(),
> > > > > > > We could set a byte flag that it's not initialized. And we could return
> > > > > > > error that it's not initialized in read_pdu().
> > > > > >
> > > > > > Could we do the zero-initialization (or T::default()) in
> > > > > > MiscdeviceVTable::uring_cmd() if the IORING_URING_CMD_REISSUE flag
> > > > > > isn't set (i.e. on the initial issue)? That way, we avoid any
> > > > > > performance penalty for the existing C uring_cmd() implementations.
> > > > > > I'm not quite sure what you mean by "assign a byte in pdu for checking
> > > > > > initialized".
> > > > >
> > > > > Sure, we could fill zero when it's the first time uring_cmd called with
> > > > > checking the flag. I would remove this commit for next version. I also
> > > > > suggests that we would provide the method that read_pdu() and write_pdu().
> > > > > In read_pdu() I want to check write_pdu() is called before. So along the
> > > > > 20 bytes for pdu, maybe we could use a bytes for the flag that pdu is
> > > > > initialized?
> > > >
> > > > Not sure what you mean about "20 bytes for pdu".
> > > > It seems like it would be preferable to enforce that write_pdu() has
> > > > been called before read_pdu() using the Rust type system instead of a
> > > > runtime check. I was thinking a signature like fn write_pdu(cmd:
> > > > IoUringCmd, value: T) -> IoUringCmdPdu<T>. Do you feel there's a
> > > > reason that wouldn't work and a runtime check would be necessary?
> > >
> > > I didn't think about make write_pdu() to return IoUringCmdPdu<T> before.
> > > I think it's good way to pdu is safe without adding a new generic param for
> > > MiscDevice. write_pdu() would return IoUringCmdPdu<T> and it could call
> > > IoUringCmdPdu<T>::pdu(&mut self) -> &mut T safely maybe.
> >
> > Yes, that's what I was thinking.
>
> Good, I'll change api in this way. Thanks!
>
> >
> > >
> > > >
> > > > >
> > > > > But maybe I would introduce a new struct that has Pin<&mut IoUringCmd> and
> > > > > issue_flags. How about some additional field for pdu is initialized like below?
> > > > >
> > > > > struct IoUringCmdArgs {
> > > > >   ioucmd: Pin<&mut IoUringCmd>,
> > > > >   issue_flags: u32,
> > > > >   pdu_initialized: bool,
> > > > > }
> > > >
> > > > One other thing I realized is that issue_flags should come from the
> > > > *current* context rather than the context the uring_cmd() callback was
> > > > called in. For example, if io_uring_cmd_done() is called from task
> > > > work context, issue_flags should match the issue_flags passed to the
> > > > io_uring_cmd_tw_t callback, not the issue_flags originally passed to
> > > > the uring_cmd() callback. So it probably makes more sense to decouple
> > > > issue_flags from the (owned) IoUringCmd. I think you could pass it by
> > > > reference (&IssueFlags) or with a phantom reference lifetime
> > > > (IssueFlags<'_>) to the Rust uring_cmd() and task work callbacks to
> > > > ensure it can't be used after those callbacks have returned.
> > >
> > > I have had no idea about task work context. I agree with you that
> > > it would be better to separate issue_flags from IoUringCmd. So,
> > > IoUringCmdArgs would have a only field Pin<&mut IoUringCmd>?
> >
> > "Task work" is a mechanism io_uring uses to queue work to run on the
> > thread that submitted an io_uring operation. It's basically a
> > per-thread atomic queue of callbacks that the thread will process
> > whenever it returns from the kernel to userspace (after a syscall or
> > an interrupt). This is the context where asynchronous uring_cmd
> > completions are generally processed (see
> > io_uring_cmd_complete_in_task() and io_uring_cmd_do_in_task_lazy()). I
> > can't speak to the history of why io_uring uses task work, but my
> > guess would be that it provides a safe context to acquire the
> > io_ring_ctx uring_lock mutex (e.g. nvme_uring_cmd_end_io() can be
> > called from an interrupt handler, so it's not allowed to take a
> > mutex). Processing all the task work at once also provides natural
> > opportunities for batching.
>
> Thanks, I've checked io_uring_cmd_complete_in_task() that it receives
> callback that has issue_flags different with io_uring_cmd(). I'll try to add
> a api that wrapping io_uring_cmd_complete_in_task() for next version.
>
> >
> > Yes, we probably don't need to bundle anything else with the
> > IoUringCmd after all. As I mentioned earlier, I don't think Pin<&mut
> > IoUringCmd> will work for uring_cmds that complete asynchronously, as
> > they will need to outlive the uring_cmd() call. So uring_cmd() needs
> > to transfer ownership of the struct io_uring_cmd.
>
> I can't think that how to take ownership of struct io_uring_cmd. The
> struct allocated with io_alloc_req() and should be freed with io_free_req().
> If taking ownership means having pointer of struct io_uring_cmd, I think
> it's no difference with current version. Also could it be called with
> mem::forget() if it has ownership?

I don't mean ownership of the io_uring_cmd allocation; that's the
responsibility of the io_uring layer. But once the io_uring_cmd is
handed to the uring_cmd() implementation, it belongs to that layer
until it completes the command back to io_uring. Maybe a better way to
describe it would be as ownership of the "executing io_uring_cmd". The
problem with Pin<&mut IoUringCmd> is that it is a borrowed reference
to the io_uring_cmd, so it can't outlive the uring_cmd() callback.
Yes, it's possible to leak the io_uring_cmd by never calling
io_uring_cmd_done() to return it to the io_uring layer.

I would imagine something like this:

#[derive(Clone, Copy)]
struct IssueFlags<'a>(c_uint, PhantomData<&'a ()>);

// Indicates ownership of the io_uring_cmd between uring_cmd() and
io_uring_cmd_done()
struct IoUringCmd(NonNull<bindings::io_uring_cmd>);

impl IoUringCmd {
        // ...

        fn done(self, ret: i32, res2: u64, issue_flags: IssueFlags<'_>) {
                let cmd = self.0.as_ptr();
                let issue_flags = issue_flags.0;
                unsafe {
                        bindings::io_uring_cmd_done(cmd, ret, res2, issue_flags)
                }
        }
}

// Can choose whether to complete the command synchronously or asynchronously.
// If take_async() is called, IoUringCmd::done() needs to be called to
complete the command.
// If take_async() isn't called, the command is completed synchronously
// with the return value from MiscDevice::uring_cmd().
struct UringCmdInput<'a>(&mut Option<NonNull<bindings::io_uring_cmd>>);

impl UringCmdInput<'_> {
        fn take_async(self) -> IoUringCmd {
                IoUringCmd(self.0.take().unwrap())
        }
}

trait MiscDevice {
        // ...

        fn uring_cmd(
                _device: <Self::Ptr as ForeignOwnable>::Borrowed<'_>,
                _cmd: UringCmdInput<'_>,
                _ issue_flags: IssueFlags <'_>,
        ) -> Result<i32> {
                build_error!(VTABLE_DEFAULT_ERROR)
        }
}

impl<T: MiscDevice> MiscdeviceVTable<T> {
        // ...

        unsafe extern "C" fn uring_cmd(
                ioucmd: *mut bindings::io_uring_cmd,
                issue_flags: c_uint,
        ) -> c_int {
                let raw_file = unsafe { (*ioucmd).file };
                let private = unsafe { (*raw_file).private_data }.cast();
                let device = unsafe { <T::Ptr as
ForeignOwnable>::borrow(private) };
                let mut ioucmd = Some(NonNull::new(ioucmd).unwrap());
                let issue_flags = IssueFlags(issue_flags, PhantomData);
                let ret = T::uring_cmd(device, UringCmdInput(&mut
ioucmd), issue_flags);
                // -EIOCBQUEUED indicates ownership of io_uring_cmd
has been taken
                if iou_cmd.is_none() {
                        return -bindings::EIOCBQUEUED;
                }

                let ret = from_result(|| ret);
                if ret == -bindings::EIOCBQUEUED {
                        -bindings::EINVAL
                } else {
                        ret
                }
        }
}

Best,
Caleb
Re: [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB
Posted by Sidong Yang 2 weeks, 6 days ago
On Fri, Sep 12, 2025 at 10:56:31AM -0700, Caleb Sander Mateos wrote:
> On Fri, Sep 12, 2025 at 9:42 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> >
> > On Tue, Sep 09, 2025 at 09:32:37AM -0700, Caleb Sander Mateos wrote:
> > > On Tue, Sep 9, 2025 at 7:43 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > >
> > > > On Mon, Sep 08, 2025 at 12:45:58PM -0700, Caleb Sander Mateos wrote:
> > > > > On Sat, Sep 6, 2025 at 7:28 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > > > >
> > > > > > On Tue, Sep 02, 2025 at 08:31:00AM -0700, Caleb Sander Mateos wrote:
> > > > > > > On Tue, Sep 2, 2025 at 3:23 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > > > > > >
> > > > > > > > On Mon, Sep 01, 2025 at 05:34:28PM -0700, Caleb Sander Mateos wrote:
> > > > > > > > > On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > > > > > > > >
> > > > > > > > > > The pdu field in io_uring_cmd may contain stale data when a request
> > > > > > > > > > object is recycled from the slab cache. Accessing uninitialized or
> > > > > > > > > > garbage memory can lead to undefined behavior in users of the pdu.
> > > > > > > > > >
> > > > > > > > > > Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that
> > > > > > > > > > each command starts from a well-defined state. This avoids exposing
> > > > > > > > > > uninitialized memory and prevents potential misinterpretation of data
> > > > > > > > > > from previous requests.
> > > > > > > > > >
> > > > > > > > > > No functional change is intended other than guaranteeing that pdu is
> > > > > > > > > > always zero-initialized before use.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai>
> > > > > > > > > > ---
> > > > > > > > > >  io_uring/uring_cmd.c | 1 +
> > > > > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > > > >
> > > > > > > > > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> > > > > > > > > > index 053bac89b6c0..2492525d4e43 100644
> > > > > > > > > > --- a/io_uring/uring_cmd.c
> > > > > > > > > > +++ b/io_uring/uring_cmd.c
> > > > > > > > > > @@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> > > > > > > > > >         if (!ac)
> > > > > > > > > >                 return -ENOMEM;
> > > > > > > > > >         ioucmd->sqe = sqe;
> > > > > > > > > > +       memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu));
> > > > > > > > >
> > > > > > > > > Adding this overhead to every existing uring_cmd() implementation is
> > > > > > > > > unfortunate. Could we instead track the initialized/uninitialized
> > > > > > > > > state by using different types on the Rust side? The io_uring_cmd
> > > > > > > > > could start as an IoUringCmd, where the PDU field is MaybeUninit,
> > > > > > > > > write_pdu<T>() could return a new IoUringCmdPdu<T> that guarantees the
> > > > > > > > > PDU has been initialized.
> > > > > > > >
> > > > > > > > I've found a flag IORING_URING_CMD_REISSUE that we could initialize
> > > > > > > > the pdu. In uring_cmd callback, we can fill zero when it's not reissued.
> > > > > > > > But I don't know that we could call T::default() in miscdevice. If we
> > > > > > > > make IoUringCmdPdu<T>, MiscDevice also should be MiscDevice<T>.
> > > > > > > >
> > > > > > > > How about assign a byte in pdu for checking initialized? In uring_cmd(),
> > > > > > > > We could set a byte flag that it's not initialized. And we could return
> > > > > > > > error that it's not initialized in read_pdu().
> > > > > > >
> > > > > > > Could we do the zero-initialization (or T::default()) in
> > > > > > > MiscdeviceVTable::uring_cmd() if the IORING_URING_CMD_REISSUE flag
> > > > > > > isn't set (i.e. on the initial issue)? That way, we avoid any
> > > > > > > performance penalty for the existing C uring_cmd() implementations.
> > > > > > > I'm not quite sure what you mean by "assign a byte in pdu for checking
> > > > > > > initialized".
> > > > > >
> > > > > > Sure, we could fill zero when it's the first time uring_cmd called with
> > > > > > checking the flag. I would remove this commit for next version. I also
> > > > > > suggests that we would provide the method that read_pdu() and write_pdu().
> > > > > > In read_pdu() I want to check write_pdu() is called before. So along the
> > > > > > 20 bytes for pdu, maybe we could use a bytes for the flag that pdu is
> > > > > > initialized?
> > > > >
> > > > > Not sure what you mean about "20 bytes for pdu".
> > > > > It seems like it would be preferable to enforce that write_pdu() has
> > > > > been called before read_pdu() using the Rust type system instead of a
> > > > > runtime check. I was thinking a signature like fn write_pdu(cmd:
> > > > > IoUringCmd, value: T) -> IoUringCmdPdu<T>. Do you feel there's a
> > > > > reason that wouldn't work and a runtime check would be necessary?
> > > >
> > > > I didn't think about make write_pdu() to return IoUringCmdPdu<T> before.
> > > > I think it's good way to pdu is safe without adding a new generic param for
> > > > MiscDevice. write_pdu() would return IoUringCmdPdu<T> and it could call
> > > > IoUringCmdPdu<T>::pdu(&mut self) -> &mut T safely maybe.
> > >
> > > Yes, that's what I was thinking.
> >
> > Good, I'll change api in this way. Thanks!
> >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > But maybe I would introduce a new struct that has Pin<&mut IoUringCmd> and
> > > > > > issue_flags. How about some additional field for pdu is initialized like below?
> > > > > >
> > > > > > struct IoUringCmdArgs {
> > > > > >   ioucmd: Pin<&mut IoUringCmd>,
> > > > > >   issue_flags: u32,
> > > > > >   pdu_initialized: bool,
> > > > > > }
> > > > >
> > > > > One other thing I realized is that issue_flags should come from the
> > > > > *current* context rather than the context the uring_cmd() callback was
> > > > > called in. For example, if io_uring_cmd_done() is called from task
> > > > > work context, issue_flags should match the issue_flags passed to the
> > > > > io_uring_cmd_tw_t callback, not the issue_flags originally passed to
> > > > > the uring_cmd() callback. So it probably makes more sense to decouple
> > > > > issue_flags from the (owned) IoUringCmd. I think you could pass it by
> > > > > reference (&IssueFlags) or with a phantom reference lifetime
> > > > > (IssueFlags<'_>) to the Rust uring_cmd() and task work callbacks to
> > > > > ensure it can't be used after those callbacks have returned.
> > > >
> > > > I have had no idea about task work context. I agree with you that
> > > > it would be better to separate issue_flags from IoUringCmd. So,
> > > > IoUringCmdArgs would have a only field Pin<&mut IoUringCmd>?
> > >
> > > "Task work" is a mechanism io_uring uses to queue work to run on the
> > > thread that submitted an io_uring operation. It's basically a
> > > per-thread atomic queue of callbacks that the thread will process
> > > whenever it returns from the kernel to userspace (after a syscall or
> > > an interrupt). This is the context where asynchronous uring_cmd
> > > completions are generally processed (see
> > > io_uring_cmd_complete_in_task() and io_uring_cmd_do_in_task_lazy()). I
> > > can't speak to the history of why io_uring uses task work, but my
> > > guess would be that it provides a safe context to acquire the
> > > io_ring_ctx uring_lock mutex (e.g. nvme_uring_cmd_end_io() can be
> > > called from an interrupt handler, so it's not allowed to take a
> > > mutex). Processing all the task work at once also provides natural
> > > opportunities for batching.
> >
> > Thanks, I've checked io_uring_cmd_complete_in_task() that it receives
> > callback that has issue_flags different with io_uring_cmd(). I'll try to add
> > a api that wrapping io_uring_cmd_complete_in_task() for next version.
> >
> > >
> > > Yes, we probably don't need to bundle anything else with the
> > > IoUringCmd after all. As I mentioned earlier, I don't think Pin<&mut
> > > IoUringCmd> will work for uring_cmds that complete asynchronously, as
> > > they will need to outlive the uring_cmd() call. So uring_cmd() needs
> > > to transfer ownership of the struct io_uring_cmd.
> >
> > I can't think that how to take ownership of struct io_uring_cmd. The
> > struct allocated with io_alloc_req() and should be freed with io_free_req().
> > If taking ownership means having pointer of struct io_uring_cmd, I think
> > it's no difference with current version. Also could it be called with
> > mem::forget() if it has ownership?
> 
> I don't mean ownership of the io_uring_cmd allocation; that's the
> responsibility of the io_uring layer. But once the io_uring_cmd is
> handed to the uring_cmd() implementation, it belongs to that layer
> until it completes the command back to io_uring. Maybe a better way to
> describe it would be as ownership of the "executing io_uring_cmd". The
> problem with Pin<&mut IoUringCmd> is that it is a borrowed reference
> to the io_uring_cmd, so it can't outlive the uring_cmd() callback.
> Yes, it's possible to leak the io_uring_cmd by never calling
> io_uring_cmd_done() to return it to the io_uring layer.

Thanks, I understood that IoUringCmd could be outlive uring_cmd callback.
But it's sad that it could be leaked without any unsafe code.

> 
> I would imagine something like this:
> 
> #[derive(Clone, Copy)]
> struct IssueFlags<'a>(c_uint, PhantomData<&'a ()>);
> 
> // Indicates ownership of the io_uring_cmd between uring_cmd() and
> io_uring_cmd_done()
> struct IoUringCmd(NonNull<bindings::io_uring_cmd>);
> 
> impl IoUringCmd {
>         // ...
> 
>         fn done(self, ret: i32, res2: u64, issue_flags: IssueFlags<'_>) {
>                 let cmd = self.0.as_ptr();
>                 let issue_flags = issue_flags.0;
>                 unsafe {
>                         bindings::io_uring_cmd_done(cmd, ret, res2, issue_flags)
>                 }
>         }
> }
> 
> // Can choose whether to complete the command synchronously or asynchronously.
> // If take_async() is called, IoUringCmd::done() needs to be called to
> complete the command.
> // If take_async() isn't called, the command is completed synchronously
> // with the return value from MiscDevice::uring_cmd().
> struct UringCmdInput<'a>(&mut Option<NonNull<bindings::io_uring_cmd>>);

Thanks for a detailed example!

But rather than this, We could introduce new return type that has a callback that
user could take IoUringCmd instead of returning -EIOCBQUEUED. 

But I prefer that we provide just one type IoUringCmd without UringCmdInput.
Although UringCmdInput, user could call done() in uring_cmd callback and 
it makes confusion that whether task_async() called and returning -EIOCBQUEUED
is mismatched that returns -EINVAL. We don't need to make it complex.

Thanks,
Sidong
> 
> impl UringCmdInput<'_> {
>         fn take_async(self) -> IoUringCmd {
>                 IoUringCmd(self.0.take().unwrap())
>         }
> }
> 
> trait MiscDevice {
>         // ...
> 
>         fn uring_cmd(
>                 _device: <Self::Ptr as ForeignOwnable>::Borrowed<'_>,
>                 _cmd: UringCmdInput<'_>,
>                 _ issue_flags: IssueFlags <'_>,
>         ) -> Result<i32> {
>                 build_error!(VTABLE_DEFAULT_ERROR)
>         }
> }
> 
> impl<T: MiscDevice> MiscdeviceVTable<T> {
>         // ...
> 
>         unsafe extern "C" fn uring_cmd(
>                 ioucmd: *mut bindings::io_uring_cmd,
>                 issue_flags: c_uint,
>         ) -> c_int {
>                 let raw_file = unsafe { (*ioucmd).file };
>                 let private = unsafe { (*raw_file).private_data }.cast();
>                 let device = unsafe { <T::Ptr as
> ForeignOwnable>::borrow(private) };
>                 let mut ioucmd = Some(NonNull::new(ioucmd).unwrap());
>                 let issue_flags = IssueFlags(issue_flags, PhantomData);
>                 let ret = T::uring_cmd(device, UringCmdInput(&mut
> ioucmd), issue_flags);
>                 // -EIOCBQUEUED indicates ownership of io_uring_cmd
> has been taken
>                 if iou_cmd.is_none() {
>                         return -bindings::EIOCBQUEUED;
>                 }
> 
>                 let ret = from_result(|| ret);
>                 if ret == -bindings::EIOCBQUEUED {
>                         -bindings::EINVAL
>                 } else {
>                         ret
>                 }
>         }
> }
> 
> Best,
> Caleb
Re: [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB
Posted by Caleb Sander Mateos 2 weeks, 4 days ago
On Sat, Sep 13, 2025 at 5:42 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
>
> On Fri, Sep 12, 2025 at 10:56:31AM -0700, Caleb Sander Mateos wrote:
> > On Fri, Sep 12, 2025 at 9:42 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > >
> > > On Tue, Sep 09, 2025 at 09:32:37AM -0700, Caleb Sander Mateos wrote:
> > > > On Tue, Sep 9, 2025 at 7:43 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > > >
> > > > > On Mon, Sep 08, 2025 at 12:45:58PM -0700, Caleb Sander Mateos wrote:
> > > > > > On Sat, Sep 6, 2025 at 7:28 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > > > > >
> > > > > > > On Tue, Sep 02, 2025 at 08:31:00AM -0700, Caleb Sander Mateos wrote:
> > > > > > > > On Tue, Sep 2, 2025 at 3:23 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, Sep 01, 2025 at 05:34:28PM -0700, Caleb Sander Mateos wrote:
> > > > > > > > > > On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > > > > > > > > >
> > > > > > > > > > > The pdu field in io_uring_cmd may contain stale data when a request
> > > > > > > > > > > object is recycled from the slab cache. Accessing uninitialized or
> > > > > > > > > > > garbage memory can lead to undefined behavior in users of the pdu.
> > > > > > > > > > >
> > > > > > > > > > > Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that
> > > > > > > > > > > each command starts from a well-defined state. This avoids exposing
> > > > > > > > > > > uninitialized memory and prevents potential misinterpretation of data
> > > > > > > > > > > from previous requests.
> > > > > > > > > > >
> > > > > > > > > > > No functional change is intended other than guaranteeing that pdu is
> > > > > > > > > > > always zero-initialized before use.
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai>
> > > > > > > > > > > ---
> > > > > > > > > > >  io_uring/uring_cmd.c | 1 +
> > > > > > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> > > > > > > > > > > index 053bac89b6c0..2492525d4e43 100644
> > > > > > > > > > > --- a/io_uring/uring_cmd.c
> > > > > > > > > > > +++ b/io_uring/uring_cmd.c
> > > > > > > > > > > @@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> > > > > > > > > > >         if (!ac)
> > > > > > > > > > >                 return -ENOMEM;
> > > > > > > > > > >         ioucmd->sqe = sqe;
> > > > > > > > > > > +       memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu));
> > > > > > > > > >
> > > > > > > > > > Adding this overhead to every existing uring_cmd() implementation is
> > > > > > > > > > unfortunate. Could we instead track the initialized/uninitialized
> > > > > > > > > > state by using different types on the Rust side? The io_uring_cmd
> > > > > > > > > > could start as an IoUringCmd, where the PDU field is MaybeUninit,
> > > > > > > > > > write_pdu<T>() could return a new IoUringCmdPdu<T> that guarantees the
> > > > > > > > > > PDU has been initialized.
> > > > > > > > >
> > > > > > > > > I've found a flag IORING_URING_CMD_REISSUE that we could initialize
> > > > > > > > > the pdu. In uring_cmd callback, we can fill zero when it's not reissued.
> > > > > > > > > But I don't know that we could call T::default() in miscdevice. If we
> > > > > > > > > make IoUringCmdPdu<T>, MiscDevice also should be MiscDevice<T>.
> > > > > > > > >
> > > > > > > > > How about assign a byte in pdu for checking initialized? In uring_cmd(),
> > > > > > > > > We could set a byte flag that it's not initialized. And we could return
> > > > > > > > > error that it's not initialized in read_pdu().
> > > > > > > >
> > > > > > > > Could we do the zero-initialization (or T::default()) in
> > > > > > > > MiscdeviceVTable::uring_cmd() if the IORING_URING_CMD_REISSUE flag
> > > > > > > > isn't set (i.e. on the initial issue)? That way, we avoid any
> > > > > > > > performance penalty for the existing C uring_cmd() implementations.
> > > > > > > > I'm not quite sure what you mean by "assign a byte in pdu for checking
> > > > > > > > initialized".
> > > > > > >
> > > > > > > Sure, we could fill zero when it's the first time uring_cmd called with
> > > > > > > checking the flag. I would remove this commit for next version. I also
> > > > > > > suggests that we would provide the method that read_pdu() and write_pdu().
> > > > > > > In read_pdu() I want to check write_pdu() is called before. So along the
> > > > > > > 20 bytes for pdu, maybe we could use a bytes for the flag that pdu is
> > > > > > > initialized?
> > > > > >
> > > > > > Not sure what you mean about "20 bytes for pdu".
> > > > > > It seems like it would be preferable to enforce that write_pdu() has
> > > > > > been called before read_pdu() using the Rust type system instead of a
> > > > > > runtime check. I was thinking a signature like fn write_pdu(cmd:
> > > > > > IoUringCmd, value: T) -> IoUringCmdPdu<T>. Do you feel there's a
> > > > > > reason that wouldn't work and a runtime check would be necessary?
> > > > >
> > > > > I didn't think about make write_pdu() to return IoUringCmdPdu<T> before.
> > > > > I think it's good way to pdu is safe without adding a new generic param for
> > > > > MiscDevice. write_pdu() would return IoUringCmdPdu<T> and it could call
> > > > > IoUringCmdPdu<T>::pdu(&mut self) -> &mut T safely maybe.
> > > >
> > > > Yes, that's what I was thinking.
> > >
> > > Good, I'll change api in this way. Thanks!
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > But maybe I would introduce a new struct that has Pin<&mut IoUringCmd> and
> > > > > > > issue_flags. How about some additional field for pdu is initialized like below?
> > > > > > >
> > > > > > > struct IoUringCmdArgs {
> > > > > > >   ioucmd: Pin<&mut IoUringCmd>,
> > > > > > >   issue_flags: u32,
> > > > > > >   pdu_initialized: bool,
> > > > > > > }
> > > > > >
> > > > > > One other thing I realized is that issue_flags should come from the
> > > > > > *current* context rather than the context the uring_cmd() callback was
> > > > > > called in. For example, if io_uring_cmd_done() is called from task
> > > > > > work context, issue_flags should match the issue_flags passed to the
> > > > > > io_uring_cmd_tw_t callback, not the issue_flags originally passed to
> > > > > > the uring_cmd() callback. So it probably makes more sense to decouple
> > > > > > issue_flags from the (owned) IoUringCmd. I think you could pass it by
> > > > > > reference (&IssueFlags) or with a phantom reference lifetime
> > > > > > (IssueFlags<'_>) to the Rust uring_cmd() and task work callbacks to
> > > > > > ensure it can't be used after those callbacks have returned.
> > > > >
> > > > > I have had no idea about task work context. I agree with you that
> > > > > it would be better to separate issue_flags from IoUringCmd. So,
> > > > > IoUringCmdArgs would have a only field Pin<&mut IoUringCmd>?
> > > >
> > > > "Task work" is a mechanism io_uring uses to queue work to run on the
> > > > thread that submitted an io_uring operation. It's basically a
> > > > per-thread atomic queue of callbacks that the thread will process
> > > > whenever it returns from the kernel to userspace (after a syscall or
> > > > an interrupt). This is the context where asynchronous uring_cmd
> > > > completions are generally processed (see
> > > > io_uring_cmd_complete_in_task() and io_uring_cmd_do_in_task_lazy()). I
> > > > can't speak to the history of why io_uring uses task work, but my
> > > > guess would be that it provides a safe context to acquire the
> > > > io_ring_ctx uring_lock mutex (e.g. nvme_uring_cmd_end_io() can be
> > > > called from an interrupt handler, so it's not allowed to take a
> > > > mutex). Processing all the task work at once also provides natural
> > > > opportunities for batching.
> > >
> > > Thanks, I've checked io_uring_cmd_complete_in_task() that it receives
> > > callback that has issue_flags different with io_uring_cmd(). I'll try to add
> > > a api that wrapping io_uring_cmd_complete_in_task() for next version.
> > >
> > > >
> > > > Yes, we probably don't need to bundle anything else with the
> > > > IoUringCmd after all. As I mentioned earlier, I don't think Pin<&mut
> > > > IoUringCmd> will work for uring_cmds that complete asynchronously, as
> > > > they will need to outlive the uring_cmd() call. So uring_cmd() needs
> > > > to transfer ownership of the struct io_uring_cmd.
> > >
> > > I can't think that how to take ownership of struct io_uring_cmd. The
> > > struct allocated with io_alloc_req() and should be freed with io_free_req().
> > > If taking ownership means having pointer of struct io_uring_cmd, I think
> > > it's no difference with current version. Also could it be called with
> > > mem::forget() if it has ownership?
> >
> > I don't mean ownership of the io_uring_cmd allocation; that's the
> > responsibility of the io_uring layer. But once the io_uring_cmd is
> > handed to the uring_cmd() implementation, it belongs to that layer
> > until it completes the command back to io_uring. Maybe a better way to
> > describe it would be as ownership of the "executing io_uring_cmd". The
> > problem with Pin<&mut IoUringCmd> is that it is a borrowed reference
> > to the io_uring_cmd, so it can't outlive the uring_cmd() callback.
> > Yes, it's possible to leak the io_uring_cmd by never calling
> > io_uring_cmd_done() to return it to the io_uring layer.
>
> Thanks, I understood that IoUringCmd could be outlive uring_cmd callback.
> But it's sad that it could be leaked without any unsafe code.

Safety in Rust doesn't require destructors to run, which means any
resource can be safely leaked
(https://faultlore.com/blah/everyone-poops/ has some historical
background on why Rust decided leaks had to be considered safe).
Leaking an io_uring_cmd is analogous to leaking a Box, both are
perfectly possible in safe Rust.

>
> >
> > I would imagine something like this:
> >
> > #[derive(Clone, Copy)]
> > struct IssueFlags<'a>(c_uint, PhantomData<&'a ()>);
> >
> > // Indicates ownership of the io_uring_cmd between uring_cmd() and
> > io_uring_cmd_done()
> > struct IoUringCmd(NonNull<bindings::io_uring_cmd>);
> >
> > impl IoUringCmd {
> >         // ...
> >
> >         fn done(self, ret: i32, res2: u64, issue_flags: IssueFlags<'_>) {
> >                 let cmd = self.0.as_ptr();
> >                 let issue_flags = issue_flags.0;
> >                 unsafe {
> >                         bindings::io_uring_cmd_done(cmd, ret, res2, issue_flags)
> >                 }
> >         }
> > }
> >
> > // Can choose whether to complete the command synchronously or asynchronously.
> > // If take_async() is called, IoUringCmd::done() needs to be called to
> > complete the command.
> > // If take_async() isn't called, the command is completed synchronously
> > // with the return value from MiscDevice::uring_cmd().
> > struct UringCmdInput<'a>(&mut Option<NonNull<bindings::io_uring_cmd>>);
>
> Thanks for a detailed example!
>
> But rather than this, We could introduce new return type that has a callback that
> user could take IoUringCmd instead of returning -EIOCBQUEUED.

I'm not following what you're suggesting, maybe a code sample would help?

>
> But I prefer that we provide just one type IoUringCmd without UringCmdInput.
> Although UringCmdInput, user could call done() in uring_cmd callback and
> it makes confusion that whether task_async() called and returning -EIOCBQUEUED
> is mismatched that returns -EINVAL. We don't need to make it complex.

Sure, if you only want to support asynchronous io_uring_cmd
completions, than you can just pass IoUringCmd to
MiscDevice::uring_cmd() and require it to call IoUringCmd::done() to
complete the command. There's a small performance overhead to that
over just returning the result from the uring_cmd() callback for
synchronous completions (and it's more verbose), but I think that
would be fine for an initial implementation.

Best,
Caleb
Re: [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB
Posted by Sidong Yang 2 weeks, 2 days ago
On Mon, Sep 15, 2025 at 09:54:50AM -0700, Caleb Sander Mateos wrote:
> On Sat, Sep 13, 2025 at 5:42 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> >
> > On Fri, Sep 12, 2025 at 10:56:31AM -0700, Caleb Sander Mateos wrote:
> > > On Fri, Sep 12, 2025 at 9:42 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > >
> > > > On Tue, Sep 09, 2025 at 09:32:37AM -0700, Caleb Sander Mateos wrote:
> > > > > On Tue, Sep 9, 2025 at 7:43 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > > > >
> > > > > > On Mon, Sep 08, 2025 at 12:45:58PM -0700, Caleb Sander Mateos wrote:
> > > > > > > On Sat, Sep 6, 2025 at 7:28 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > > > > > >
> > > > > > > > On Tue, Sep 02, 2025 at 08:31:00AM -0700, Caleb Sander Mateos wrote:
> > > > > > > > > On Tue, Sep 2, 2025 at 3:23 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon, Sep 01, 2025 at 05:34:28PM -0700, Caleb Sander Mateos wrote:
> > > > > > > > > > > On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > The pdu field in io_uring_cmd may contain stale data when a request
> > > > > > > > > > > > object is recycled from the slab cache. Accessing uninitialized or
> > > > > > > > > > > > garbage memory can lead to undefined behavior in users of the pdu.
> > > > > > > > > > > >
> > > > > > > > > > > > Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that
> > > > > > > > > > > > each command starts from a well-defined state. This avoids exposing
> > > > > > > > > > > > uninitialized memory and prevents potential misinterpretation of data
> > > > > > > > > > > > from previous requests.
> > > > > > > > > > > >
> > > > > > > > > > > > No functional change is intended other than guaranteeing that pdu is
> > > > > > > > > > > > always zero-initialized before use.
> > > > > > > > > > > >
> > > > > > > > > > > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai>
> > > > > > > > > > > > ---
> > > > > > > > > > > >  io_uring/uring_cmd.c | 1 +
> > > > > > > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > > > > > >
> > > > > > > > > > > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> > > > > > > > > > > > index 053bac89b6c0..2492525d4e43 100644
> > > > > > > > > > > > --- a/io_uring/uring_cmd.c
> > > > > > > > > > > > +++ b/io_uring/uring_cmd.c
> > > > > > > > > > > > @@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> > > > > > > > > > > >         if (!ac)
> > > > > > > > > > > >                 return -ENOMEM;
> > > > > > > > > > > >         ioucmd->sqe = sqe;
> > > > > > > > > > > > +       memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu));
> > > > > > > > > > >
> > > > > > > > > > > Adding this overhead to every existing uring_cmd() implementation is
> > > > > > > > > > > unfortunate. Could we instead track the initialized/uninitialized
> > > > > > > > > > > state by using different types on the Rust side? The io_uring_cmd
> > > > > > > > > > > could start as an IoUringCmd, where the PDU field is MaybeUninit,
> > > > > > > > > > > write_pdu<T>() could return a new IoUringCmdPdu<T> that guarantees the
> > > > > > > > > > > PDU has been initialized.
> > > > > > > > > >
> > > > > > > > > > I've found a flag IORING_URING_CMD_REISSUE that we could initialize
> > > > > > > > > > the pdu. In uring_cmd callback, we can fill zero when it's not reissued.
> > > > > > > > > > But I don't know that we could call T::default() in miscdevice. If we
> > > > > > > > > > make IoUringCmdPdu<T>, MiscDevice also should be MiscDevice<T>.
> > > > > > > > > >
> > > > > > > > > > How about assign a byte in pdu for checking initialized? In uring_cmd(),
> > > > > > > > > > We could set a byte flag that it's not initialized. And we could return
> > > > > > > > > > error that it's not initialized in read_pdu().
> > > > > > > > >
> > > > > > > > > Could we do the zero-initialization (or T::default()) in
> > > > > > > > > MiscdeviceVTable::uring_cmd() if the IORING_URING_CMD_REISSUE flag
> > > > > > > > > isn't set (i.e. on the initial issue)? That way, we avoid any
> > > > > > > > > performance penalty for the existing C uring_cmd() implementations.
> > > > > > > > > I'm not quite sure what you mean by "assign a byte in pdu for checking
> > > > > > > > > initialized".
> > > > > > > >
> > > > > > > > Sure, we could fill zero when it's the first time uring_cmd called with
> > > > > > > > checking the flag. I would remove this commit for next version. I also
> > > > > > > > suggests that we would provide the method that read_pdu() and write_pdu().
> > > > > > > > In read_pdu() I want to check write_pdu() is called before. So along the
> > > > > > > > 20 bytes for pdu, maybe we could use a bytes for the flag that pdu is
> > > > > > > > initialized?
> > > > > > >
> > > > > > > Not sure what you mean about "20 bytes for pdu".
> > > > > > > It seems like it would be preferable to enforce that write_pdu() has
> > > > > > > been called before read_pdu() using the Rust type system instead of a
> > > > > > > runtime check. I was thinking a signature like fn write_pdu(cmd:
> > > > > > > IoUringCmd, value: T) -> IoUringCmdPdu<T>. Do you feel there's a
> > > > > > > reason that wouldn't work and a runtime check would be necessary?
> > > > > >
> > > > > > I didn't think about make write_pdu() to return IoUringCmdPdu<T> before.
> > > > > > I think it's good way to pdu is safe without adding a new generic param for
> > > > > > MiscDevice. write_pdu() would return IoUringCmdPdu<T> and it could call
> > > > > > IoUringCmdPdu<T>::pdu(&mut self) -> &mut T safely maybe.
> > > > >
> > > > > Yes, that's what I was thinking.
> > > >
> > > > Good, I'll change api in this way. Thanks!
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > But maybe I would introduce a new struct that has Pin<&mut IoUringCmd> and
> > > > > > > > issue_flags. How about some additional field for pdu is initialized like below?
> > > > > > > >
> > > > > > > > struct IoUringCmdArgs {
> > > > > > > >   ioucmd: Pin<&mut IoUringCmd>,
> > > > > > > >   issue_flags: u32,
> > > > > > > >   pdu_initialized: bool,
> > > > > > > > }
> > > > > > >
> > > > > > > One other thing I realized is that issue_flags should come from the
> > > > > > > *current* context rather than the context the uring_cmd() callback was
> > > > > > > called in. For example, if io_uring_cmd_done() is called from task
> > > > > > > work context, issue_flags should match the issue_flags passed to the
> > > > > > > io_uring_cmd_tw_t callback, not the issue_flags originally passed to
> > > > > > > the uring_cmd() callback. So it probably makes more sense to decouple
> > > > > > > issue_flags from the (owned) IoUringCmd. I think you could pass it by
> > > > > > > reference (&IssueFlags) or with a phantom reference lifetime
> > > > > > > (IssueFlags<'_>) to the Rust uring_cmd() and task work callbacks to
> > > > > > > ensure it can't be used after those callbacks have returned.
> > > > > >
> > > > > > I have had no idea about task work context. I agree with you that
> > > > > > it would be better to separate issue_flags from IoUringCmd. So,
> > > > > > IoUringCmdArgs would have a only field Pin<&mut IoUringCmd>?
> > > > >
> > > > > "Task work" is a mechanism io_uring uses to queue work to run on the
> > > > > thread that submitted an io_uring operation. It's basically a
> > > > > per-thread atomic queue of callbacks that the thread will process
> > > > > whenever it returns from the kernel to userspace (after a syscall or
> > > > > an interrupt). This is the context where asynchronous uring_cmd
> > > > > completions are generally processed (see
> > > > > io_uring_cmd_complete_in_task() and io_uring_cmd_do_in_task_lazy()). I
> > > > > can't speak to the history of why io_uring uses task work, but my
> > > > > guess would be that it provides a safe context to acquire the
> > > > > io_ring_ctx uring_lock mutex (e.g. nvme_uring_cmd_end_io() can be
> > > > > called from an interrupt handler, so it's not allowed to take a
> > > > > mutex). Processing all the task work at once also provides natural
> > > > > opportunities for batching.
> > > >
> > > > Thanks, I've checked io_uring_cmd_complete_in_task() that it receives
> > > > callback that has issue_flags different with io_uring_cmd(). I'll try to add
> > > > a api that wrapping io_uring_cmd_complete_in_task() for next version.
> > > >
> > > > >
> > > > > Yes, we probably don't need to bundle anything else with the
> > > > > IoUringCmd after all. As I mentioned earlier, I don't think Pin<&mut
> > > > > IoUringCmd> will work for uring_cmds that complete asynchronously, as
> > > > > they will need to outlive the uring_cmd() call. So uring_cmd() needs
> > > > > to transfer ownership of the struct io_uring_cmd.
> > > >
> > > > I can't think that how to take ownership of struct io_uring_cmd. The
> > > > struct allocated with io_alloc_req() and should be freed with io_free_req().
> > > > If taking ownership means having pointer of struct io_uring_cmd, I think
> > > > it's no difference with current version. Also could it be called with
> > > > mem::forget() if it has ownership?
> > >
> > > I don't mean ownership of the io_uring_cmd allocation; that's the
> > > responsibility of the io_uring layer. But once the io_uring_cmd is
> > > handed to the uring_cmd() implementation, it belongs to that layer
> > > until it completes the command back to io_uring. Maybe a better way to
> > > describe it would be as ownership of the "executing io_uring_cmd". The
> > > problem with Pin<&mut IoUringCmd> is that it is a borrowed reference
> > > to the io_uring_cmd, so it can't outlive the uring_cmd() callback.
> > > Yes, it's possible to leak the io_uring_cmd by never calling
> > > io_uring_cmd_done() to return it to the io_uring layer.
> >
> > Thanks, I understood that IoUringCmd could be outlive uring_cmd callback.
> > But it's sad that it could be leaked without any unsafe code.
> 
> Safety in Rust doesn't require destructors to run, which means any
> resource can be safely leaked
> (https://faultlore.com/blah/everyone-poops/ has some historical
> background on why Rust decided leaks had to be considered safe).
> Leaking an io_uring_cmd is analogous to leaking a Box, both are
> perfectly possible in safe Rust.

Thanks for the reference. If driver just drops `IoUringCmd` without taking,
the request wouldn't be completed until io-uring instance deinitialized.
I understood that we cannot handle this.

> 
> >
> > >
> > > I would imagine something like this:
> > >
> > > #[derive(Clone, Copy)]
> > > struct IssueFlags<'a>(c_uint, PhantomData<&'a ()>);
> > >
> > > // Indicates ownership of the io_uring_cmd between uring_cmd() and
> > > io_uring_cmd_done()
> > > struct IoUringCmd(NonNull<bindings::io_uring_cmd>);
> > >
> > > impl IoUringCmd {
> > >         // ...
> > >
> > >         fn done(self, ret: i32, res2: u64, issue_flags: IssueFlags<'_>) {
> > >                 let cmd = self.0.as_ptr();
> > >                 let issue_flags = issue_flags.0;
> > >                 unsafe {
> > >                         bindings::io_uring_cmd_done(cmd, ret, res2, issue_flags)
> > >                 }
> > >         }
> > > }
> > >
> > > // Can choose whether to complete the command synchronously or asynchronously.
> > > // If take_async() is called, IoUringCmd::done() needs to be called to
> > > complete the command.
> > > // If take_async() isn't called, the command is completed synchronously
> > > // with the return value from MiscDevice::uring_cmd().
> > > struct UringCmdInput<'a>(&mut Option<NonNull<bindings::io_uring_cmd>>);
> >
> > Thanks for a detailed example!
> >
> > But rather than this, We could introduce new return type that has a callback that
> > user could take IoUringCmd instead of returning -EIOCBQUEUED.
> 
> I'm not following what you're suggesting, maybe a code sample would help?

maybe like below...

#[vtable]
pub trait MiscDevice: Sized {
    /// ...

    /// If user returns `EIOCBQUEUED`, this function would be called for
    /// users who want to take `IoUringCmdAsync`. It could call `done()` for complete
    /// request.
    fn uring_cmd_async_prep(
          device: <Self::Ptr as ForeignOwnable>::Borrowed<'_>,
          io_uring_cmd_async: IoUringCmdAsync);
    /// ...
}

impl<T: MiscDevice> MiscdeviceVTable<T> {
    // ...
    unsafe extern "C" fn uring_cmd(
          ioucmd: *mut bindings::io_uring_cmd,
          issue_flags: ffi::c_uint,
      ) -> c_int {
          // ...
          let result = T::uring_cmd(device, ioucmd, issue_flags);

          if let Err(EIOCBQUEUED) = result {
              T::uring_cmd_async_prep(device, ioucmd.to_async());
          }

          from_result(|| result)
      }
}
> 
> >
> > But I prefer that we provide just one type IoUringCmd without UringCmdInput.
> > Although UringCmdInput, user could call done() in uring_cmd callback and
> > it makes confusion that whether task_async() called and returning -EIOCBQUEUED
> > is mismatched that returns -EINVAL. We don't need to make it complex.
> 
> Sure, if you only want to support asynchronous io_uring_cmd
> completions, than you can just pass IoUringCmd to
> MiscDevice::uring_cmd() and require it to call IoUringCmd::done() to
> complete the command. There's a small performance overhead to that
> over just returning the result from the uring_cmd() callback for
> synchronous completions (and it's more verbose), but I think that
> would be fine for an initial implementation.

I appreciate for your understanding. I think it would be good to have just one
simple struct `IoUringCmd`. I'll make next version patch soon.

Thanks,
Sidong

> 
> Best,
> Caleb
Re: [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB
Posted by Caleb Sander Mateos 1 week, 4 days ago
On Wed, Sep 17, 2025 at 7:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
>
> On Mon, Sep 15, 2025 at 09:54:50AM -0700, Caleb Sander Mateos wrote:
> > On Sat, Sep 13, 2025 at 5:42 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > >
> > > On Fri, Sep 12, 2025 at 10:56:31AM -0700, Caleb Sander Mateos wrote:
> > > > On Fri, Sep 12, 2025 at 9:42 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > > >
> > > > > On Tue, Sep 09, 2025 at 09:32:37AM -0700, Caleb Sander Mateos wrote:
> > > > > > On Tue, Sep 9, 2025 at 7:43 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > > > > >
> > > > > > > On Mon, Sep 08, 2025 at 12:45:58PM -0700, Caleb Sander Mateos wrote:
> > > > > > > > On Sat, Sep 6, 2025 at 7:28 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > > > > > > >
> > > > > > > > > On Tue, Sep 02, 2025 at 08:31:00AM -0700, Caleb Sander Mateos wrote:
> > > > > > > > > > On Tue, Sep 2, 2025 at 3:23 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Sep 01, 2025 at 05:34:28PM -0700, Caleb Sander Mateos wrote:
> > > > > > > > > > > > On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > The pdu field in io_uring_cmd may contain stale data when a request
> > > > > > > > > > > > > object is recycled from the slab cache. Accessing uninitialized or
> > > > > > > > > > > > > garbage memory can lead to undefined behavior in users of the pdu.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that
> > > > > > > > > > > > > each command starts from a well-defined state. This avoids exposing
> > > > > > > > > > > > > uninitialized memory and prevents potential misinterpretation of data
> > > > > > > > > > > > > from previous requests.
> > > > > > > > > > > > >
> > > > > > > > > > > > > No functional change is intended other than guaranteeing that pdu is
> > > > > > > > > > > > > always zero-initialized before use.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai>
> > > > > > > > > > > > > ---
> > > > > > > > > > > > >  io_uring/uring_cmd.c | 1 +
> > > > > > > > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > > > > > > >
> > > > > > > > > > > > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> > > > > > > > > > > > > index 053bac89b6c0..2492525d4e43 100644
> > > > > > > > > > > > > --- a/io_uring/uring_cmd.c
> > > > > > > > > > > > > +++ b/io_uring/uring_cmd.c
> > > > > > > > > > > > > @@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> > > > > > > > > > > > >         if (!ac)
> > > > > > > > > > > > >                 return -ENOMEM;
> > > > > > > > > > > > >         ioucmd->sqe = sqe;
> > > > > > > > > > > > > +       memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu));
> > > > > > > > > > > >
> > > > > > > > > > > > Adding this overhead to every existing uring_cmd() implementation is
> > > > > > > > > > > > unfortunate. Could we instead track the initialized/uninitialized
> > > > > > > > > > > > state by using different types on the Rust side? The io_uring_cmd
> > > > > > > > > > > > could start as an IoUringCmd, where the PDU field is MaybeUninit,
> > > > > > > > > > > > write_pdu<T>() could return a new IoUringCmdPdu<T> that guarantees the
> > > > > > > > > > > > PDU has been initialized.
> > > > > > > > > > >
> > > > > > > > > > > I've found a flag IORING_URING_CMD_REISSUE that we could initialize
> > > > > > > > > > > the pdu. In uring_cmd callback, we can fill zero when it's not reissued.
> > > > > > > > > > > But I don't know that we could call T::default() in miscdevice. If we
> > > > > > > > > > > make IoUringCmdPdu<T>, MiscDevice also should be MiscDevice<T>.
> > > > > > > > > > >
> > > > > > > > > > > How about assign a byte in pdu for checking initialized? In uring_cmd(),
> > > > > > > > > > > We could set a byte flag that it's not initialized. And we could return
> > > > > > > > > > > error that it's not initialized in read_pdu().
> > > > > > > > > >
> > > > > > > > > > Could we do the zero-initialization (or T::default()) in
> > > > > > > > > > MiscdeviceVTable::uring_cmd() if the IORING_URING_CMD_REISSUE flag
> > > > > > > > > > isn't set (i.e. on the initial issue)? That way, we avoid any
> > > > > > > > > > performance penalty for the existing C uring_cmd() implementations.
> > > > > > > > > > I'm not quite sure what you mean by "assign a byte in pdu for checking
> > > > > > > > > > initialized".
> > > > > > > > >
> > > > > > > > > Sure, we could fill zero when it's the first time uring_cmd called with
> > > > > > > > > checking the flag. I would remove this commit for next version. I also
> > > > > > > > > suggests that we would provide the method that read_pdu() and write_pdu().
> > > > > > > > > In read_pdu() I want to check write_pdu() is called before. So along the
> > > > > > > > > 20 bytes for pdu, maybe we could use a bytes for the flag that pdu is
> > > > > > > > > initialized?
> > > > > > > >
> > > > > > > > Not sure what you mean about "20 bytes for pdu".
> > > > > > > > It seems like it would be preferable to enforce that write_pdu() has
> > > > > > > > been called before read_pdu() using the Rust type system instead of a
> > > > > > > > runtime check. I was thinking a signature like fn write_pdu(cmd:
> > > > > > > > IoUringCmd, value: T) -> IoUringCmdPdu<T>. Do you feel there's a
> > > > > > > > reason that wouldn't work and a runtime check would be necessary?
> > > > > > >
> > > > > > > I didn't think about make write_pdu() to return IoUringCmdPdu<T> before.
> > > > > > > I think it's good way to pdu is safe without adding a new generic param for
> > > > > > > MiscDevice. write_pdu() would return IoUringCmdPdu<T> and it could call
> > > > > > > IoUringCmdPdu<T>::pdu(&mut self) -> &mut T safely maybe.
> > > > > >
> > > > > > Yes, that's what I was thinking.
> > > > >
> > > > > Good, I'll change api in this way. Thanks!
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > But maybe I would introduce a new struct that has Pin<&mut IoUringCmd> and
> > > > > > > > > issue_flags. How about some additional field for pdu is initialized like below?
> > > > > > > > >
> > > > > > > > > struct IoUringCmdArgs {
> > > > > > > > >   ioucmd: Pin<&mut IoUringCmd>,
> > > > > > > > >   issue_flags: u32,
> > > > > > > > >   pdu_initialized: bool,
> > > > > > > > > }
> > > > > > > >
> > > > > > > > One other thing I realized is that issue_flags should come from the
> > > > > > > > *current* context rather than the context the uring_cmd() callback was
> > > > > > > > called in. For example, if io_uring_cmd_done() is called from task
> > > > > > > > work context, issue_flags should match the issue_flags passed to the
> > > > > > > > io_uring_cmd_tw_t callback, not the issue_flags originally passed to
> > > > > > > > the uring_cmd() callback. So it probably makes more sense to decouple
> > > > > > > > issue_flags from the (owned) IoUringCmd. I think you could pass it by
> > > > > > > > reference (&IssueFlags) or with a phantom reference lifetime
> > > > > > > > (IssueFlags<'_>) to the Rust uring_cmd() and task work callbacks to
> > > > > > > > ensure it can't be used after those callbacks have returned.
> > > > > > >
> > > > > > > I have had no idea about task work context. I agree with you that
> > > > > > > it would be better to separate issue_flags from IoUringCmd. So,
> > > > > > > IoUringCmdArgs would have a only field Pin<&mut IoUringCmd>?
> > > > > >
> > > > > > "Task work" is a mechanism io_uring uses to queue work to run on the
> > > > > > thread that submitted an io_uring operation. It's basically a
> > > > > > per-thread atomic queue of callbacks that the thread will process
> > > > > > whenever it returns from the kernel to userspace (after a syscall or
> > > > > > an interrupt). This is the context where asynchronous uring_cmd
> > > > > > completions are generally processed (see
> > > > > > io_uring_cmd_complete_in_task() and io_uring_cmd_do_in_task_lazy()). I
> > > > > > can't speak to the history of why io_uring uses task work, but my
> > > > > > guess would be that it provides a safe context to acquire the
> > > > > > io_ring_ctx uring_lock mutex (e.g. nvme_uring_cmd_end_io() can be
> > > > > > called from an interrupt handler, so it's not allowed to take a
> > > > > > mutex). Processing all the task work at once also provides natural
> > > > > > opportunities for batching.
> > > > >
> > > > > Thanks, I've checked io_uring_cmd_complete_in_task() that it receives
> > > > > callback that has issue_flags different with io_uring_cmd(). I'll try to add
> > > > > a api that wrapping io_uring_cmd_complete_in_task() for next version.
> > > > >
> > > > > >
> > > > > > Yes, we probably don't need to bundle anything else with the
> > > > > > IoUringCmd after all. As I mentioned earlier, I don't think Pin<&mut
> > > > > > IoUringCmd> will work for uring_cmds that complete asynchronously, as
> > > > > > they will need to outlive the uring_cmd() call. So uring_cmd() needs
> > > > > > to transfer ownership of the struct io_uring_cmd.
> > > > >
> > > > > I can't think that how to take ownership of struct io_uring_cmd. The
> > > > > struct allocated with io_alloc_req() and should be freed with io_free_req().
> > > > > If taking ownership means having pointer of struct io_uring_cmd, I think
> > > > > it's no difference with current version. Also could it be called with
> > > > > mem::forget() if it has ownership?
> > > >
> > > > I don't mean ownership of the io_uring_cmd allocation; that's the
> > > > responsibility of the io_uring layer. But once the io_uring_cmd is
> > > > handed to the uring_cmd() implementation, it belongs to that layer
> > > > until it completes the command back to io_uring. Maybe a better way to
> > > > describe it would be as ownership of the "executing io_uring_cmd". The
> > > > problem with Pin<&mut IoUringCmd> is that it is a borrowed reference
> > > > to the io_uring_cmd, so it can't outlive the uring_cmd() callback.
> > > > Yes, it's possible to leak the io_uring_cmd by never calling
> > > > io_uring_cmd_done() to return it to the io_uring layer.
> > >
> > > Thanks, I understood that IoUringCmd could be outlive uring_cmd callback.
> > > But it's sad that it could be leaked without any unsafe code.
> >
> > Safety in Rust doesn't require destructors to run, which means any
> > resource can be safely leaked
> > (https://faultlore.com/blah/everyone-poops/ has some historical
> > background on why Rust decided leaks had to be considered safe).
> > Leaking an io_uring_cmd is analogous to leaking a Box, both are
> > perfectly possible in safe Rust.
>
> Thanks for the reference. If driver just drops `IoUringCmd` without taking,
> the request wouldn't be completed until io-uring instance deinitialized.
> I understood that we cannot handle this.
>
> >
> > >
> > > >
> > > > I would imagine something like this:
> > > >
> > > > #[derive(Clone, Copy)]
> > > > struct IssueFlags<'a>(c_uint, PhantomData<&'a ()>);
> > > >
> > > > // Indicates ownership of the io_uring_cmd between uring_cmd() and
> > > > io_uring_cmd_done()
> > > > struct IoUringCmd(NonNull<bindings::io_uring_cmd>);
> > > >
> > > > impl IoUringCmd {
> > > >         // ...
> > > >
> > > >         fn done(self, ret: i32, res2: u64, issue_flags: IssueFlags<'_>) {
> > > >                 let cmd = self.0.as_ptr();
> > > >                 let issue_flags = issue_flags.0;
> > > >                 unsafe {
> > > >                         bindings::io_uring_cmd_done(cmd, ret, res2, issue_flags)
> > > >                 }
> > > >         }
> > > > }
> > > >
> > > > // Can choose whether to complete the command synchronously or asynchronously.
> > > > // If take_async() is called, IoUringCmd::done() needs to be called to
> > > > complete the command.
> > > > // If take_async() isn't called, the command is completed synchronously
> > > > // with the return value from MiscDevice::uring_cmd().
> > > > struct UringCmdInput<'a>(&mut Option<NonNull<bindings::io_uring_cmd>>);
> > >
> > > Thanks for a detailed example!
> > >
> > > But rather than this, We could introduce new return type that has a callback that
> > > user could take IoUringCmd instead of returning -EIOCBQUEUED.
> >
> > I'm not following what you're suggesting, maybe a code sample would help?
>
> maybe like below...
>
> #[vtable]
> pub trait MiscDevice: Sized {
>     /// ...
>
>     /// If user returns `EIOCBQUEUED`, this function would be called for
>     /// users who want to take `IoUringCmdAsync`. It could call `done()` for complete
>     /// request.
>     fn uring_cmd_async_prep(
>           device: <Self::Ptr as ForeignOwnable>::Borrowed<'_>,
>           io_uring_cmd_async: IoUringCmdAsync);
>     /// ...
> }
>
> impl<T: MiscDevice> MiscdeviceVTable<T> {
>     // ...
>     unsafe extern "C" fn uring_cmd(
>           ioucmd: *mut bindings::io_uring_cmd,
>           issue_flags: ffi::c_uint,
>       ) -> c_int {
>           // ...
>           let result = T::uring_cmd(device, ioucmd, issue_flags);
>
>           if let Err(EIOCBQUEUED) = result {
>               T::uring_cmd_async_prep(device, ioucmd.to_async());
>           }

I see. Yes, this could work, but I worry separating the implementation
into 2 calls makes it difficult to pass state between them. If
T::uring_cmd() can determine whether the command will complete
synchronously or asynchronously just by inspecting the
io_uring_cmd/device, it may be a convenient interface. But if
T::uring_cmd() performs a bunch of work before deciding the command
should go async, it may be a pain to pass those computed values to
T::uring_cmd_async_prep().

>
>           from_result(|| result)
>       }
> }
> >
> > >
> > > But I prefer that we provide just one type IoUringCmd without UringCmdInput.
> > > Although UringCmdInput, user could call done() in uring_cmd callback and
> > > it makes confusion that whether task_async() called and returning -EIOCBQUEUED
> > > is mismatched that returns -EINVAL. We don't need to make it complex.
> >
> > Sure, if you only want to support asynchronous io_uring_cmd
> > completions, than you can just pass IoUringCmd to
> > MiscDevice::uring_cmd() and require it to call IoUringCmd::done() to
> > complete the command. There's a small performance overhead to that
> > over just returning the result from the uring_cmd() callback for
> > synchronous completions (and it's more verbose), but I think that
> > would be fine for an initial implementation.
>
> I appreciate for your understanding. I think it would be good to have just one
> simple struct `IoUringCmd`. I'll make next version patch soon.

Yeah, I think it will probably be easiest for an initial
implementation to always transfer ownership of the io_uring_cmd to
MiscDevice::uring_cmd() and requiring it to call IoUringCmd::done().
The synchronous case could be optimized in the future, but that's not
the primary io_uring_cmd use case.

Best,
Caleb