[v1] Rework NVME page transfers and avoid fsegment allocation

[SeaBIOS] [PATCH 5/5] nvme: Build the page list in the existing dma buffer

Posted by Kevin O'Connor 2 years, 9 months ago

Commit 01f2736cc905d ("nvme: Pass large I/O requests as PRP lists")
introduced multi-page requests using the NVMe PRP mechanism. To store the
list and "first page to write to" hints, it added fields to the NVMe
namespace struct.

Unfortunately, that struct resides in fseg which is read-only at runtime.
While KVM ignores the read-only part and allows writes, real hardware and
TCG adhere to the semantics and ignore writes to the fseg region. The net
effect of that is that reads and writes were always happening on address 0,
unless they went through the bounce buffer logic.

This patch builds the PRP maintenance data in the existing "dma bounce
buffer" and only builds it when needed.

Fixes: 01f2736cc905d ("nvme: Pass large I/O requests as PRP lists")
Reported-by: Matt DeVillier <matt.devillier@gmail.com>
Signed-off-by: Alexander Graf <graf@amazon.com>
Signed-off-by: Kevin O'Connor <kevin@koconnor.net>
---
 src/hw/nvme-int.h |  6 ------
 src/hw/nvme.c     | 51 +++++++++++++++++------------------------------
 2 files changed, 18 insertions(+), 39 deletions(-)

diff --git a/src/hw/nvme-int.h b/src/hw/nvme-int.h
index 9564c17..f0d717d 100644
--- a/src/hw/nvme-int.h
+++ b/src/hw/nvme-int.h
@@ -10,8 +10,6 @@
 #include "types.h" // u32
 #include "pcidevice.h" // struct pci_device
 
-#define NVME_MAX_PRPL_ENTRIES 15 /* Allows requests up to 64kb */
-
 /* Data structures */
 
 /* The register file of a NVMe host controller. This struct follows the naming
@@ -122,10 +120,6 @@ struct nvme_namespace {
 
     /* Page aligned buffer of size NVME_PAGE_SIZE. */
     char *dma_buffer;
-
-    /* Page List */
-    u32 prpl_len;
-    u64 prpl[NVME_MAX_PRPL_ENTRIES];
 };
 
 /* Data structures for NVMe admin identify commands */
diff --git a/src/hw/nvme.c b/src/hw/nvme.c
index 3a73784..39b9138 100644
--- a/src/hw/nvme.c
+++ b/src/hw/nvme.c
@@ -470,35 +470,19 @@ nvme_bounce_xfer(struct nvme_namespace *ns, u64 lba, void *buf, u16 count,
     return res;
 }
 
-static void nvme_reset_prpl(struct nvme_namespace *ns)
-{
-    ns->prpl_len = 0;
-}
-
-static int nvme_add_prpl(struct nvme_namespace *ns, u64 base)
-{
-    if (ns->prpl_len >= NVME_MAX_PRPL_ENTRIES)
-        return -1;
-
-    ns->prpl[ns->prpl_len++] = base;
-
-    return 0;
-}
+#define NVME_MAX_PRPL_ENTRIES 15 /* Allows requests up to 64kb */
 
 // Transfer data using page list (if applicable)
 static int
 nvme_prpl_xfer(struct nvme_namespace *ns, u64 lba, void *buf, u16 count,
                int write)
 {
-    int first_page = 1;
     u32 base = (long)buf;
     s32 size;
 
     if (count > ns->max_req_size)
         count = ns->max_req_size;
 
-    nvme_reset_prpl(ns);
-
     size = count * ns->block_size;
     /* Special case for transfers that fit into PRP1, but are unaligned */
     if (((size + (base & ~NVME_PAGE_MASK)) <= NVME_PAGE_SIZE))
@@ -512,28 +496,29 @@ nvme_prpl_xfer(struct nvme_namespace *ns, u64 lba, void *buf, u16 count,
     if (size & (ns->block_size - 1ULL))
         return nvme_bounce_xfer(ns, lba, buf, count, write);
 
-    for (; size > 0; base += NVME_PAGE_SIZE, size -= NVME_PAGE_SIZE) {
-        if (first_page) {
-            /* First page is special */
-            first_page = 0;
-            continue;
-        }
-        if (nvme_add_prpl(ns, base))
-            return nvme_bounce_xfer(ns, lba, buf, count, write);
-    }
-
-    void *prp2;
     if ((ns->block_size * count) > (NVME_PAGE_SIZE * 2)) {
-        /* We need to describe more than 2 pages, rely on PRP List */
-        prp2 = ns->prpl;
+        /* We need to describe more than 2 pages, build PRP List */
+        u32 prpl_len = 0;
+        u64 *prpl = (void*)ns->dma_buffer;
+        int first_page = 1;
+        for (; size > 0; base += NVME_PAGE_SIZE, size -= NVME_PAGE_SIZE) {
+            if (first_page) {
+                /* First page is special */
+                first_page = 0;
+                continue;
+            }
+            if (prpl_len >= NVME_MAX_PRPL_ENTRIES)
+                return nvme_bounce_xfer(ns, lba, buf, count, write);
+            prpl[prpl_len++] = base;
+        }
+        return nvme_io_xfer(ns, lba, buf, prpl, count, write);
     } else if ((ns->block_size * count) > NVME_PAGE_SIZE) {
         /* Directly embed the 2nd page if we only need 2 pages */
-        prp2 = (void *)(long)ns->prpl[0];
+        return nvme_io_xfer(ns, lba, buf, buf + NVME_PAGE_SIZE, count, write);
     } else {
         /* One page is enough, don't expose anything else */
-        prp2 = NULL;
+        return nvme_io_xfer(ns, lba, buf, NULL, count, write);
     }
-    return nvme_io_xfer(ns, lba, buf, prp2, count, write);
 }
 
 static int
-- 
2.31.1

_______________________________________________
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-leave@seabios.org

[SeaBIOS] Re: [PATCH 5/5] nvme: Build the page list in the existing dma buffer

Posted by Alexander Graf via SeaBIOS 2 years, 9 months ago

On 19.01.22 19:45, Kevin O'Connor wrote:
> Commit 01f2736cc905d ("nvme: Pass large I/O requests as PRP lists")
> introduced multi-page requests using the NVMe PRP mechanism. To store the
> list and "first page to write to" hints, it added fields to the NVMe
> namespace struct.
>
> Unfortunately, that struct resides in fseg which is read-only at runtime.
> While KVM ignores the read-only part and allows writes, real hardware and
> TCG adhere to the semantics and ignore writes to the fseg region. The net
> effect of that is that reads and writes were always happening on address 0,
> unless they went through the bounce buffer logic.
>
> This patch builds the PRP maintenance data in the existing "dma bounce
> buffer" and only builds it when needed.
>
> Fixes: 01f2736cc905d ("nvme: Pass large I/O requests as PRP lists")
> Reported-by: Matt DeVillier <matt.devillier@gmail.com>
> Signed-off-by: Alexander Graf <graf@amazon.com>
> Signed-off-by: Kevin O'Connor <kevin@koconnor.net>
> ---
>   src/hw/nvme-int.h |  6 ------
>   src/hw/nvme.c     | 51 +++++++++++++++++------------------------------
>   2 files changed, 18 insertions(+), 39 deletions(-)
>
> diff --git a/src/hw/nvme-int.h b/src/hw/nvme-int.h
> index 9564c17..f0d717d 100644
> --- a/src/hw/nvme-int.h
> +++ b/src/hw/nvme-int.h
> @@ -10,8 +10,6 @@
>   #include "types.h" // u32
>   #include "pcidevice.h" // struct pci_device
>
> -#define NVME_MAX_PRPL_ENTRIES 15 /* Allows requests up to 64kb */
> -
>   /* Data structures */
>
>   /* The register file of a NVMe host controller. This struct follows the naming
> @@ -122,10 +120,6 @@ struct nvme_namespace {
>
>       /* Page aligned buffer of size NVME_PAGE_SIZE. */
>       char *dma_buffer;
> -
> -    /* Page List */
> -    u32 prpl_len;
> -    u64 prpl[NVME_MAX_PRPL_ENTRIES];
>   };
>
>   /* Data structures for NVMe admin identify commands */
> diff --git a/src/hw/nvme.c b/src/hw/nvme.c
> index 3a73784..39b9138 100644
> --- a/src/hw/nvme.c
> +++ b/src/hw/nvme.c
> @@ -470,35 +470,19 @@ nvme_bounce_xfer(struct nvme_namespace *ns, u64 lba, void *buf, u16 count,
>       return res;
>   }
>
> -static void nvme_reset_prpl(struct nvme_namespace *ns)
> -{
> -    ns->prpl_len = 0;
> -}
> -
> -static int nvme_add_prpl(struct nvme_namespace *ns, u64 base)
> -{
> -    if (ns->prpl_len >= NVME_MAX_PRPL_ENTRIES)
> -        return -1;
> -
> -    ns->prpl[ns->prpl_len++] = base;
> -
> -    return 0;
> -}
> +#define NVME_MAX_PRPL_ENTRIES 15 /* Allows requests up to 64kb */
>
>   // Transfer data using page list (if applicable)
>   static int
>   nvme_prpl_xfer(struct nvme_namespace *ns, u64 lba, void *buf, u16 count,
>                  int write)
>   {
> -    int first_page = 1;
>       u32 base = (long)buf;
>       s32 size;
>
>       if (count > ns->max_req_size)
>           count = ns->max_req_size;
>
> -    nvme_reset_prpl(ns);
> -
>       size = count * ns->block_size;
>       /* Special case for transfers that fit into PRP1, but are unaligned */
>       if (((size + (base & ~NVME_PAGE_MASK)) <= NVME_PAGE_SIZE))
> @@ -512,28 +496,29 @@ nvme_prpl_xfer(struct nvme_namespace *ns, u64 lba, void *buf, u16 count,
>       if (size & (ns->block_size - 1ULL))
>           return nvme_bounce_xfer(ns, lba, buf, count, write);
>
> -    for (; size > 0; base += NVME_PAGE_SIZE, size -= NVME_PAGE_SIZE) {
> -        if (first_page) {
> -            /* First page is special */
> -            first_page = 0;
> -            continue;
> -        }
> -        if (nvme_add_prpl(ns, base))
> -            return nvme_bounce_xfer(ns, lba, buf, count, write);
> -    }
> -
> -    void *prp2;
>       if ((ns->block_size * count) > (NVME_PAGE_SIZE * 2)) {
> -        /* We need to describe more than 2 pages, rely on PRP List */
> -        prp2 = ns->prpl;
> +        /* We need to describe more than 2 pages, build PRP List */
> +        u32 prpl_len = 0;
> +        u64 *prpl = (void*)ns->dma_buffer;


At 15*8=120 bytes of data, do we even need to put the prpl into 
dma_buffer or can we just use the stack? Will the array be guaranteed 
64bit aligned or would we need to add an attribute to it?

     u64 prpl[NVME_MAX_PRPL_ENTRIES];

Either way, I don't have strong feelings one way or another. It just 
seems more natural to keep the bounce buffer purely as bounce buffer :).


Alex


> +        int first_page = 1;
> +        for (; size > 0; base += NVME_PAGE_SIZE, size -= NVME_PAGE_SIZE) {
> +            if (first_page) {
> +                /* First page is special */
> +                first_page = 0;
> +                continue;
> +            }
> +            if (prpl_len >= NVME_MAX_PRPL_ENTRIES)
> +                return nvme_bounce_xfer(ns, lba, buf, count, write);
> +            prpl[prpl_len++] = base;
> +        }
> +        return nvme_io_xfer(ns, lba, buf, prpl, count, write);
>       } else if ((ns->block_size * count) > NVME_PAGE_SIZE) {
>           /* Directly embed the 2nd page if we only need 2 pages */
> -        prp2 = (void *)(long)ns->prpl[0];
> +        return nvme_io_xfer(ns, lba, buf, buf + NVME_PAGE_SIZE, count, write);
>       } else {
>           /* One page is enough, don't expose anything else */
> -        prp2 = NULL;
> +        return nvme_io_xfer(ns, lba, buf, NULL, count, write);
>       }
> -    return nvme_io_xfer(ns, lba, buf, prp2, count, write);
>   }
>
>   static int
> --
> 2.31.1
>



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


_______________________________________________
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-leave@seabios.org

[SeaBIOS] Re: [PATCH 5/5] nvme: Build the page list in the existing dma buffer

Posted by Kevin O'Connor 2 years, 9 months ago

On Fri, Jan 21, 2022 at 03:28:33PM +0100, Alexander Graf wrote:
> 
> On 19.01.22 19:45, Kevin O'Connor wrote:
> > Commit 01f2736cc905d ("nvme: Pass large I/O requests as PRP lists")
> > introduced multi-page requests using the NVMe PRP mechanism. To store the
> > list and "first page to write to" hints, it added fields to the NVMe
> > namespace struct.
> > 
> > Unfortunately, that struct resides in fseg which is read-only at runtime.
> > While KVM ignores the read-only part and allows writes, real hardware and
> > TCG adhere to the semantics and ignore writes to the fseg region. The net
> > effect of that is that reads and writes were always happening on address 0,
> > unless they went through the bounce buffer logic.
> > 
> > This patch builds the PRP maintenance data in the existing "dma bounce
> > buffer" and only builds it when needed.
> > 
> > Fixes: 01f2736cc905d ("nvme: Pass large I/O requests as PRP lists")
> > Reported-by: Matt DeVillier <matt.devillier@gmail.com>
> > Signed-off-by: Alexander Graf <graf@amazon.com>
> > Signed-off-by: Kevin O'Connor <kevin@koconnor.net>
> > ---
> >   src/hw/nvme-int.h |  6 ------
> >   src/hw/nvme.c     | 51 +++++++++++++++++------------------------------
> >   2 files changed, 18 insertions(+), 39 deletions(-)
> > 
> > diff --git a/src/hw/nvme-int.h b/src/hw/nvme-int.h
> > index 9564c17..f0d717d 100644
> > --- a/src/hw/nvme-int.h
> > +++ b/src/hw/nvme-int.h
> > @@ -10,8 +10,6 @@
> >   #include "types.h" // u32
> >   #include "pcidevice.h" // struct pci_device
> > 
> > -#define NVME_MAX_PRPL_ENTRIES 15 /* Allows requests up to 64kb */
> > -
> >   /* Data structures */
> > 
> >   /* The register file of a NVMe host controller. This struct follows the naming
> > @@ -122,10 +120,6 @@ struct nvme_namespace {
> > 
> >       /* Page aligned buffer of size NVME_PAGE_SIZE. */
> >       char *dma_buffer;
> > -
> > -    /* Page List */
> > -    u32 prpl_len;
> > -    u64 prpl[NVME_MAX_PRPL_ENTRIES];
> >   };
> > 
> >   /* Data structures for NVMe admin identify commands */
> > diff --git a/src/hw/nvme.c b/src/hw/nvme.c
> > index 3a73784..39b9138 100644
> > --- a/src/hw/nvme.c
> > +++ b/src/hw/nvme.c
> > @@ -470,35 +470,19 @@ nvme_bounce_xfer(struct nvme_namespace *ns, u64 lba, void *buf, u16 count,
> >       return res;
> >   }
> > 
> > -static void nvme_reset_prpl(struct nvme_namespace *ns)
> > -{
> > -    ns->prpl_len = 0;
> > -}
> > -
> > -static int nvme_add_prpl(struct nvme_namespace *ns, u64 base)
> > -{
> > -    if (ns->prpl_len >= NVME_MAX_PRPL_ENTRIES)
> > -        return -1;
> > -
> > -    ns->prpl[ns->prpl_len++] = base;
> > -
> > -    return 0;
> > -}
> > +#define NVME_MAX_PRPL_ENTRIES 15 /* Allows requests up to 64kb */
> > 
> >   // Transfer data using page list (if applicable)
> >   static int
> >   nvme_prpl_xfer(struct nvme_namespace *ns, u64 lba, void *buf, u16 count,
> >                  int write)
> >   {
> > -    int first_page = 1;
> >       u32 base = (long)buf;
> >       s32 size;
> > 
> >       if (count > ns->max_req_size)
> >           count = ns->max_req_size;
> > 
> > -    nvme_reset_prpl(ns);
> > -
> >       size = count * ns->block_size;
> >       /* Special case for transfers that fit into PRP1, but are unaligned */
> >       if (((size + (base & ~NVME_PAGE_MASK)) <= NVME_PAGE_SIZE))
> > @@ -512,28 +496,29 @@ nvme_prpl_xfer(struct nvme_namespace *ns, u64 lba, void *buf, u16 count,
> >       if (size & (ns->block_size - 1ULL))
> >           return nvme_bounce_xfer(ns, lba, buf, count, write);
> > 
> > -    for (; size > 0; base += NVME_PAGE_SIZE, size -= NVME_PAGE_SIZE) {
> > -        if (first_page) {
> > -            /* First page is special */
> > -            first_page = 0;
> > -            continue;
> > -        }
> > -        if (nvme_add_prpl(ns, base))
> > -            return nvme_bounce_xfer(ns, lba, buf, count, write);
> > -    }
> > -
> > -    void *prp2;
> >       if ((ns->block_size * count) > (NVME_PAGE_SIZE * 2)) {
> > -        /* We need to describe more than 2 pages, rely on PRP List */
> > -        prp2 = ns->prpl;
> > +        /* We need to describe more than 2 pages, build PRP List */
> > +        u32 prpl_len = 0;
> > +        u64 *prpl = (void*)ns->dma_buffer;
> 
> 
> At 15*8=120 bytes of data, do we even need to put the prpl into dma_buffer
> or can we just use the stack? Will the array be guaranteed 64bit aligned or
> would we need to add an attribute to it?
> 
>     u64 prpl[NVME_MAX_PRPL_ENTRIES];
> 
> Either way, I don't have strong feelings one way or another. It just seems
> more natural to keep the bounce buffer purely as bounce buffer :).

In general it's possible to DMA from the stack (eg,
src/hw/ohci.c:ohci_send_pipe() ).  However, SeaBIOS doesn't guarantee
stack alignment so it would need to be done manually.  Also, I'm not
sure how tight the nvme request completion code is - if it returns
early for some reason it could cause havoc (device dma into random
memory).

Another option might be to make a single global prpl array, but that
would consume the memory even if nvme isn't in use.

FWIW though, I don't see a harm in the single ( u64 *prpl =
(void*)ns->dma_buffer ) line.

Thanks,
-Kevin


> 
> 
> Alex
> 
> 
> > +        int first_page = 1;
> > +        for (; size > 0; base += NVME_PAGE_SIZE, size -= NVME_PAGE_SIZE) {
> > +            if (first_page) {
> > +                /* First page is special */
> > +                first_page = 0;
> > +                continue;
> > +            }
> > +            if (prpl_len >= NVME_MAX_PRPL_ENTRIES)
> > +                return nvme_bounce_xfer(ns, lba, buf, count, write);
> > +            prpl[prpl_len++] = base;
> > +        }
> > +        return nvme_io_xfer(ns, lba, buf, prpl, count, write);
> >       } else if ((ns->block_size * count) > NVME_PAGE_SIZE) {
> >           /* Directly embed the 2nd page if we only need 2 pages */
> > -        prp2 = (void *)(long)ns->prpl[0];
> > +        return nvme_io_xfer(ns, lba, buf, buf + NVME_PAGE_SIZE, count, write);
> >       } else {
> >           /* One page is enough, don't expose anything else */
> > -        prp2 = NULL;
> > +        return nvme_io_xfer(ns, lba, buf, NULL, count, write);
> >       }
> > -    return nvme_io_xfer(ns, lba, buf, prp2, count, write);
> >   }
> > 
> >   static int
> > --
> > 2.31.1
> > 
> 
> 
> 
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
> 
> 
_______________________________________________
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-leave@seabios.org

[SeaBIOS] Re: [PATCH 5/5] nvme: Build the page list in the existing dma buffer

Posted by Alexander Graf via SeaBIOS 2 years, 9 months ago

On 21.01.22 17:02, Kevin O'Connor wrote:
> On Fri, Jan 21, 2022 at 03:28:33PM +0100, Alexander Graf wrote:
>> On 19.01.22 19:45, Kevin O'Connor wrote:
>>> Commit 01f2736cc905d ("nvme: Pass large I/O requests as PRP lists")
>>> introduced multi-page requests using the NVMe PRP mechanism. To store the
>>> list and "first page to write to" hints, it added fields to the NVMe
>>> namespace struct.
>>>
>>> Unfortunately, that struct resides in fseg which is read-only at runtime.
>>> While KVM ignores the read-only part and allows writes, real hardware and
>>> TCG adhere to the semantics and ignore writes to the fseg region. The net
>>> effect of that is that reads and writes were always happening on address 0,
>>> unless they went through the bounce buffer logic.
>>>
>>> This patch builds the PRP maintenance data in the existing "dma bounce
>>> buffer" and only builds it when needed.
>>>
>>> Fixes: 01f2736cc905d ("nvme: Pass large I/O requests as PRP lists")
>>> Reported-by: Matt DeVillier <matt.devillier@gmail.com>
>>> Signed-off-by: Alexander Graf <graf@amazon.com>
>>> Signed-off-by: Kevin O'Connor <kevin@koconnor.net>
>>> ---
>>>    src/hw/nvme-int.h |  6 ------
>>>    src/hw/nvme.c     | 51 +++++++++++++++++------------------------------
>>>    2 files changed, 18 insertions(+), 39 deletions(-)
>>>
>>> diff --git a/src/hw/nvme-int.h b/src/hw/nvme-int.h
>>> index 9564c17..f0d717d 100644
>>> --- a/src/hw/nvme-int.h
>>> +++ b/src/hw/nvme-int.h
>>> @@ -10,8 +10,6 @@
>>>    #include "types.h" // u32
>>>    #include "pcidevice.h" // struct pci_device
>>>
>>> -#define NVME_MAX_PRPL_ENTRIES 15 /* Allows requests up to 64kb */
>>> -
>>>    /* Data structures */
>>>
>>>    /* The register file of a NVMe host controller. This struct follows the naming
>>> @@ -122,10 +120,6 @@ struct nvme_namespace {
>>>
>>>        /* Page aligned buffer of size NVME_PAGE_SIZE. */
>>>        char *dma_buffer;
>>> -
>>> -    /* Page List */
>>> -    u32 prpl_len;
>>> -    u64 prpl[NVME_MAX_PRPL_ENTRIES];
>>>    };
>>>
>>>    /* Data structures for NVMe admin identify commands */
>>> diff --git a/src/hw/nvme.c b/src/hw/nvme.c
>>> index 3a73784..39b9138 100644
>>> --- a/src/hw/nvme.c
>>> +++ b/src/hw/nvme.c
>>> @@ -470,35 +470,19 @@ nvme_bounce_xfer(struct nvme_namespace *ns, u64 lba, void *buf, u16 count,
>>>        return res;
>>>    }
>>>
>>> -static void nvme_reset_prpl(struct nvme_namespace *ns)
>>> -{
>>> -    ns->prpl_len = 0;
>>> -}
>>> -
>>> -static int nvme_add_prpl(struct nvme_namespace *ns, u64 base)
>>> -{
>>> -    if (ns->prpl_len >= NVME_MAX_PRPL_ENTRIES)
>>> -        return -1;
>>> -
>>> -    ns->prpl[ns->prpl_len++] = base;
>>> -
>>> -    return 0;
>>> -}
>>> +#define NVME_MAX_PRPL_ENTRIES 15 /* Allows requests up to 64kb */
>>>
>>>    // Transfer data using page list (if applicable)
>>>    static int
>>>    nvme_prpl_xfer(struct nvme_namespace *ns, u64 lba, void *buf, u16 count,
>>>                   int write)
>>>    {
>>> -    int first_page = 1;
>>>        u32 base = (long)buf;
>>>        s32 size;
>>>
>>>        if (count > ns->max_req_size)
>>>            count = ns->max_req_size;
>>>
>>> -    nvme_reset_prpl(ns);
>>> -
>>>        size = count * ns->block_size;
>>>        /* Special case for transfers that fit into PRP1, but are unaligned */
>>>        if (((size + (base & ~NVME_PAGE_MASK)) <= NVME_PAGE_SIZE))
>>> @@ -512,28 +496,29 @@ nvme_prpl_xfer(struct nvme_namespace *ns, u64 lba, void *buf, u16 count,
>>>        if (size & (ns->block_size - 1ULL))
>>>            return nvme_bounce_xfer(ns, lba, buf, count, write);
>>>
>>> -    for (; size > 0; base += NVME_PAGE_SIZE, size -= NVME_PAGE_SIZE) {
>>> -        if (first_page) {
>>> -            /* First page is special */
>>> -            first_page = 0;
>>> -            continue;
>>> -        }
>>> -        if (nvme_add_prpl(ns, base))
>>> -            return nvme_bounce_xfer(ns, lba, buf, count, write);
>>> -    }
>>> -
>>> -    void *prp2;
>>>        if ((ns->block_size * count) > (NVME_PAGE_SIZE * 2)) {
>>> -        /* We need to describe more than 2 pages, rely on PRP List */
>>> -        prp2 = ns->prpl;
>>> +        /* We need to describe more than 2 pages, build PRP List */
>>> +        u32 prpl_len = 0;
>>> +        u64 *prpl = (void*)ns->dma_buffer;
>>
>> At 15*8=120 bytes of data, do we even need to put the prpl into dma_buffer
>> or can we just use the stack? Will the array be guaranteed 64bit aligned or
>> would we need to add an attribute to it?
>>
>>      u64 prpl[NVME_MAX_PRPL_ENTRIES];
>>
>> Either way, I don't have strong feelings one way or another. It just seems
>> more natural to keep the bounce buffer purely as bounce buffer :).
> In general it's possible to DMA from the stack (eg,
> src/hw/ohci.c:ohci_send_pipe() ).  However, SeaBIOS doesn't guarantee
> stack alignment so it would need to be done manually.  Also, I'm not
> sure how tight the nvme request completion code is - if it returns
> early for some reason it could cause havoc (device dma into random
> memory).
>
> Another option might be to make a single global prpl array, but that
> would consume the memory even if nvme isn't in use.
>
> FWIW though, I don't see a harm in the single ( u64 *prpl =
> (void*)ns->dma_buffer ) line.


Fair, works for me. But then we probably want to also adjust 
MAX_PRPL_ENTRIES to match the full page we now have available, no?


Alex





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


_______________________________________________
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-leave@seabios.org

[SeaBIOS] Re: [PATCH 5/5] nvme: Build the page list in the existing dma buffer

Posted by Kevin O'Connor 2 years, 9 months ago

On Fri, Jan 21, 2022 at 05:41:17PM +0100, Alexander Graf wrote:
> On 21.01.22 17:02, Kevin O'Connor wrote:
> > On Fri, Jan 21, 2022 at 03:28:33PM +0100, Alexander Graf wrote:
> > > On 19.01.22 19:45, Kevin O'Connor wrote:
> > > >        if ((ns->block_size * count) > (NVME_PAGE_SIZE * 2)) {
> > > > -        /* We need to describe more than 2 pages, rely on PRP List */
> > > > -        prp2 = ns->prpl;
> > > > +        /* We need to describe more than 2 pages, build PRP List */
> > > > +        u32 prpl_len = 0;
> > > > +        u64 *prpl = (void*)ns->dma_buffer;
> > > 
> > > At 15*8=120 bytes of data, do we even need to put the prpl into dma_buffer
> > > or can we just use the stack? Will the array be guaranteed 64bit aligned or
> > > would we need to add an attribute to it?
> > > 
> > >      u64 prpl[NVME_MAX_PRPL_ENTRIES];
> > > 
> > > Either way, I don't have strong feelings one way or another. It just seems
> > > more natural to keep the bounce buffer purely as bounce buffer :).
> > In general it's possible to DMA from the stack (eg,
> > src/hw/ohci.c:ohci_send_pipe() ).  However, SeaBIOS doesn't guarantee
> > stack alignment so it would need to be done manually.  Also, I'm not
> > sure how tight the nvme request completion code is - if it returns
> > early for some reason it could cause havoc (device dma into random
> > memory).
> > 
> > Another option might be to make a single global prpl array, but that
> > would consume the memory even if nvme isn't in use.
> > 
> > FWIW though, I don't see a harm in the single ( u64 *prpl =
> > (void*)ns->dma_buffer ) line.
> 
> 
> Fair, works for me. But then we probably want to also adjust
> MAX_PRPL_ENTRIES to match the full page we now have available, no?
> 

I don't think a BIOS disk request can be over 64KiB so I don't think
it matters.  I got the impression the current checks are just "sanity
checks".  I don't see a harm in keeping the sanity check and that size
is as good as any other as far as I know.

Cheers,
-Kevin
_______________________________________________
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-leave@seabios.org

[SeaBIOS] Re: [PATCH 5/5] nvme: Build the page list in the existing dma buffer

Posted by Alexander Graf via SeaBIOS 2 years, 9 months ago

On 21.01.22 17:54, Kevin O'Connor wrote:
> On Fri, Jan 21, 2022 at 05:41:17PM +0100, Alexander Graf wrote:
>> On 21.01.22 17:02, Kevin O'Connor wrote:
>>> On Fri, Jan 21, 2022 at 03:28:33PM +0100, Alexander Graf wrote:
>>>> On 19.01.22 19:45, Kevin O'Connor wrote:
>>>>>         if ((ns->block_size * count) > (NVME_PAGE_SIZE * 2)) {
>>>>> -        /* We need to describe more than 2 pages, rely on PRP List */
>>>>> -        prp2 = ns->prpl;
>>>>> +        /* We need to describe more than 2 pages, build PRP List */
>>>>> +        u32 prpl_len = 0;
>>>>> +        u64 *prpl = (void*)ns->dma_buffer;
>>>> At 15*8=120 bytes of data, do we even need to put the prpl into dma_buffer
>>>> or can we just use the stack? Will the array be guaranteed 64bit aligned or
>>>> would we need to add an attribute to it?
>>>>
>>>>       u64 prpl[NVME_MAX_PRPL_ENTRIES];
>>>>
>>>> Either way, I don't have strong feelings one way or another. It just seems
>>>> more natural to keep the bounce buffer purely as bounce buffer :).
>>> In general it's possible to DMA from the stack (eg,
>>> src/hw/ohci.c:ohci_send_pipe() ).  However, SeaBIOS doesn't guarantee
>>> stack alignment so it would need to be done manually.  Also, I'm not
>>> sure how tight the nvme request completion code is - if it returns
>>> early for some reason it could cause havoc (device dma into random
>>> memory).
>>>
>>> Another option might be to make a single global prpl array, but that
>>> would consume the memory even if nvme isn't in use.
>>>
>>> FWIW though, I don't see a harm in the single ( u64 *prpl =
>>> (void*)ns->dma_buffer ) line.
>>
>> Fair, works for me. But then we probably want to also adjust
>> MAX_PRPL_ENTRIES to match the full page we now have available, no?
>>
> I don't think a BIOS disk request can be over 64KiB so I don't think
> it matters.  I got the impression the current checks are just "sanity
> checks".  I don't see a harm in keeping the sanity check and that size
> is as good as any other as far as I know.


It's a bit of both: Sanity checks and code that potentially can be 
reused outside of the SeaBIOS context. So I would try as hard as 
possible to not make assumptions that we can only handle max 64kb I/O 
requests. Plus, if we really wanted to, we could even introduce a new 
SeaBIOS specific INT to do larger I/O requests.

I won't make a fuss if you want to keep it at 64kb max request size 
though :).


Alex





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


_______________________________________________
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-leave@seabios.org

[SeaBIOS] Re: [PATCH 5/5] nvme: Build the page list in the existing dma buffer

Posted by Kevin O'Connor 2 years, 8 months ago

On Fri, Jan 21, 2022 at 09:44:46PM +0100, Alexander Graf wrote:
> On 21.01.22 17:54, Kevin O'Connor wrote:
> > On Fri, Jan 21, 2022 at 05:41:17PM +0100, Alexander Graf wrote:
> > > On 21.01.22 17:02, Kevin O'Connor wrote:
> > > > On Fri, Jan 21, 2022 at 03:28:33PM +0100, Alexander Graf wrote:
> > > > > On 19.01.22 19:45, Kevin O'Connor wrote:
> > > > > >         if ((ns->block_size * count) > (NVME_PAGE_SIZE * 2)) {
> > > > > > -        /* We need to describe more than 2 pages, rely on PRP List */
> > > > > > -        prp2 = ns->prpl;
> > > > > > +        /* We need to describe more than 2 pages, build PRP List */
> > > > > > +        u32 prpl_len = 0;
> > > > > > +        u64 *prpl = (void*)ns->dma_buffer;
> > > > > At 15*8=120 bytes of data, do we even need to put the prpl into dma_buffer
> > > > > or can we just use the stack? Will the array be guaranteed 64bit aligned or
> > > > > would we need to add an attribute to it?
> > > > > 
> > > > >       u64 prpl[NVME_MAX_PRPL_ENTRIES];
> > > > > 
> > > > > Either way, I don't have strong feelings one way or another. It just seems
> > > > > more natural to keep the bounce buffer purely as bounce buffer :).
> > > > In general it's possible to DMA from the stack (eg,
> > > > src/hw/ohci.c:ohci_send_pipe() ).  However, SeaBIOS doesn't guarantee
> > > > stack alignment so it would need to be done manually.  Also, I'm not
> > > > sure how tight the nvme request completion code is - if it returns
> > > > early for some reason it could cause havoc (device dma into random
> > > > memory).
> > > > 
> > > > Another option might be to make a single global prpl array, but that
> > > > would consume the memory even if nvme isn't in use.
> > > > 
> > > > FWIW though, I don't see a harm in the single ( u64 *prpl =
> > > > (void*)ns->dma_buffer ) line.
> > > 
> > > Fair, works for me. But then we probably want to also adjust
> > > MAX_PRPL_ENTRIES to match the full page we now have available, no?
> > > 
> > I don't think a BIOS disk request can be over 64KiB so I don't think
> > it matters.  I got the impression the current checks are just "sanity
> > checks".  I don't see a harm in keeping the sanity check and that size
> > is as good as any other as far as I know.
> 
> It's a bit of both: Sanity checks and code that potentially can be reused
> outside of the SeaBIOS context. So I would try as hard as possible to not
> make assumptions that we can only handle max 64kb I/O requests.

I agree.  I also think it would be good to address the two items I
raised at:
https://www.mail-archive.com/seabios@seabios.org/msg12833.html

I'd prefer if someone more familiar with nvme (and has a better
testing infrastructure) could look at the above items though.  I'm not
familiar with the nvme hardware specs and I'm just testing by
"checking if qemu starts".

I'd be inclined to go forward with the current patch series as the
above items seem orthogonal.  That is, if there is interest, I'd guess
they would be best merged on top of this series anyway.

Cheers,
-Kevin
_______________________________________________
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-leave@seabios.org

[SeaBIOS] Re: [PATCH 5/5] nvme: Build the page list in the existing dma buffer

Posted by Graf (AWS), Alexander via SeaBIOS 2 years, 8 months ago


> Am 22.01.2022 um 17:34 schrieb Kevin O'Connor <kevin@koconnor.net>:
> 
>> On Fri, Jan 21, 2022 at 09:44:46PM +0100, Alexander Graf wrote:
>>> On 21.01.22 17:54, Kevin O'Connor wrote:
>>> On Fri, Jan 21, 2022 at 05:41:17PM +0100, Alexander Graf wrote:
>>>> On 21.01.22 17:02, Kevin O'Connor wrote:
>>>>> On Fri, Jan 21, 2022 at 03:28:33PM +0100, Alexander Graf wrote:
>>>>>> On 19.01.22 19:45, Kevin O'Connor wrote:
>>>>>>>        if ((ns->block_size * count) > (NVME_PAGE_SIZE * 2)) {
>>>>>>> -        /* We need to describe more than 2 pages, rely on PRP List */
>>>>>>> -        prp2 = ns->prpl;
>>>>>>> +        /* We need to describe more than 2 pages, build PRP List */
>>>>>>> +        u32 prpl_len = 0;
>>>>>>> +        u64 *prpl = (void*)ns->dma_buffer;
>>>>>> At 15*8=120 bytes of data, do we even need to put the prpl into dma_buffer
>>>>>> or can we just use the stack? Will the array be guaranteed 64bit aligned or
>>>>>> would we need to add an attribute to it?
>>>>>> 
>>>>>>      u64 prpl[NVME_MAX_PRPL_ENTRIES];
>>>>>> 
>>>>>> Either way, I don't have strong feelings one way or another. It just seems
>>>>>> more natural to keep the bounce buffer purely as bounce buffer :).
>>>>> In general it's possible to DMA from the stack (eg,
>>>>> src/hw/ohci.c:ohci_send_pipe() ).  However, SeaBIOS doesn't guarantee
>>>>> stack alignment so it would need to be done manually.  Also, I'm not
>>>>> sure how tight the nvme request completion code is - if it returns
>>>>> early for some reason it could cause havoc (device dma into random
>>>>> memory).
>>>>> 
>>>>> Another option might be to make a single global prpl array, but that
>>>>> would consume the memory even if nvme isn't in use.
>>>>> 
>>>>> FWIW though, I don't see a harm in the single ( u64 *prpl =
>>>>> (void*)ns->dma_buffer ) line.
>>>> 
>>>> Fair, works for me. But then we probably want to also adjust
>>>> MAX_PRPL_ENTRIES to match the full page we now have available, no?
>>>> 
>>> I don't think a BIOS disk request can be over 64KiB so I don't think
>>> it matters.  I got the impression the current checks are just "sanity
>>> checks".  I don't see a harm in keeping the sanity check and that size
>>> is as good as any other as far as I know.
>> 
>> It's a bit of both: Sanity checks and code that potentially can be reused
>> outside of the SeaBIOS context. So I would try as hard as possible to not
>> make assumptions that we can only handle max 64kb I/O requests.
> 
> I agree.  I also think it would be good to address the two items I
> raised at:
> https://www.mail-archive.com/seabios@seabios.org/msg12833.html
> 
> I'd prefer if someone more familiar with nvme (and has a better
> testing infrastructure) could look at the above items though.  I'm not
> familiar with the nvme hardware specs and I'm just testing by
> "checking if qemu starts".
> 
> I'd be inclined to go forward with the current patch series as the
> above items seem orthogonal.  That is, if there is interest, I'd guess
> they would be best merged on top of this series anyway.

I agree with that plan :). Let me see if I can find a few cycles to be that person some time soon.

Alex

> 
> Cheers,
> -Kevin



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



_______________________________________________
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-leave@seabios.org

[SeaBIOS] [PATCH 1/5] nvme: Rework nvme_io_readwrite() to return -1 on error
[SeaBIOS] [PATCH 2/5] nvme: Add nvme_bounce_xfer() helper function
[SeaBIOS] [PATCH 3/5] nvme: Convert nvme_build_prpl() to nvme_prpl_xfer()
[SeaBIOS] [PATCH 4/5] nvme: Pass prp1 and prp2 directly to nvme_io_xfer()
[SeaBIOS] [PATCH 5/5] nvme: Build the page list in the existing dma buffer