[SeaBIOS] [PATCH v3 4/4] nvme: Split requests by maximum allowed size

Alexander Graf posted 4 patches 1 month, 3 weeks ago

[SeaBIOS] [PATCH v3 4/4] nvme: Split requests by maximum allowed size

Posted by Alexander Graf 1 month, 3 weeks ago
Some NVMe controllers only support small maximum request sizes, such as
the AWS EBS NVMe implementation which only supports NVMe requests of up
to 32 pages (256kb) at once.

BIOS callers can exceed those request sizes by defining sector counts
above this threshold. Currently we fall back to the bounce buffer
implementation for those. This is slow.

This patch introduces splitting logic to the NVMe I/O request code so
that every NVMe I/O request gets handled in a chunk size that is
consumable by the NVMe adapter, while maintaining the fast path PRPL
logic we just introduced.

Signed-off-by: Alexander Graf <graf@amazon.com>
---
 src/hw/nvme.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/src/hw/nvme.c b/src/hw/nvme.c
index b92ca52..cc37bca 100644
--- a/src/hw/nvme.c
+++ b/src/hw/nvme.c
@@ -727,6 +727,22 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct disk_op_s *op, int write)
     u16 const max_blocks = NVME_PAGE_SIZE / ns->block_size;
     u16 i;
 
+    /* Split up requests that are larger than the device can handle */
+    if (op->count > ns->max_req_size) {
+        u16 count = op->count;
+
+        /* Handle the first max_req_size elements */
+        op->count = ns->max_req_size;
+        if (nvme_cmd_readwrite(ns, op, write))
+            return res;
+
+        /* Handle the remainder of the request */
+        op->count = count - ns->max_req_size;
+        op->lba += ns->max_req_size;
+        op->buf_fl += (ns->max_req_size * ns->block_size);
+        return nvme_cmd_readwrite(ns, op, write);
+    }
+
     if (!nvme_build_prpl(ns, op)) {
         /* Request goes via PRP List logic */
         return nvme_io_readwrite(ns, op->lba, ns->prp1, op->count, write);
-- 
2.16.4




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


_______________________________________________
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-leave@seabios.org

[SeaBIOS] Re: [PATCH v3 4/4] nvme: Split requests by maximum allowed size

Posted by Kevin O'Connor 1 month, 3 weeks ago
On Wed, Sep 30, 2020 at 11:10:56PM +0200, Alexander Graf wrote:
> Some NVMe controllers only support small maximum request sizes, such as
> the AWS EBS NVMe implementation which only supports NVMe requests of up
> to 32 pages (256kb) at once.
> 
> BIOS callers can exceed those request sizes by defining sector counts
> above this threshold. Currently we fall back to the bounce buffer
> implementation for those. This is slow.
> 
> This patch introduces splitting logic to the NVMe I/O request code so
> that every NVMe I/O request gets handled in a chunk size that is
> consumable by the NVMe adapter, while maintaining the fast path PRPL
> logic we just introduced.
> 
> Signed-off-by: Alexander Graf <graf@amazon.com>
> ---
>  src/hw/nvme.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/src/hw/nvme.c b/src/hw/nvme.c
> index b92ca52..cc37bca 100644
> --- a/src/hw/nvme.c
> +++ b/src/hw/nvme.c
> @@ -727,6 +727,22 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct disk_op_s *op, int write)
>      u16 const max_blocks = NVME_PAGE_SIZE / ns->block_size;
>      u16 i;
>  
> +    /* Split up requests that are larger than the device can handle */
> +    if (op->count > ns->max_req_size) {
> +        u16 count = op->count;
> +
> +        /* Handle the first max_req_size elements */
> +        op->count = ns->max_req_size;
> +        if (nvme_cmd_readwrite(ns, op, write))
> +            return res;
> +
> +        /* Handle the remainder of the request */
> +        op->count = count - ns->max_req_size;
> +        op->lba += ns->max_req_size;
> +        op->buf_fl += (ns->max_req_size * ns->block_size);
> +        return nvme_cmd_readwrite(ns, op, write);
> +    }

Depending on the disk access, this code could run with a small stack.
I would avoid recursion.

Otherwise, the patch series looks okay to me.  (I don't have enough
knowledge of the nvme code to give a full review though.)

-Kevin
_______________________________________________
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-leave@seabios.org

[SeaBIOS] Re: [PATCH v3 4/4] nvme: Split requests by maximum allowed size

Posted by Graf (AWS), Alexander 1 month, 3 weeks ago
Hey Kevin,

> Am 03.10.2020 um 03:47 schrieb Kevin O'Connor <kevin@koconnor.net>:
> 
> 
>> On Wed, Sep 30, 2020 at 11:10:56PM +0200, Alexander Graf wrote:
>> Some NVMe controllers only support small maximum request sizes, such as
>> the AWS EBS NVMe implementation which only supports NVMe requests of up
>> to 32 pages (256kb) at once.
>> 
>> BIOS callers can exceed those request sizes by defining sector counts
>> above this threshold. Currently we fall back to the bounce buffer
>> implementation for those. This is slow.
>> 
>> This patch introduces splitting logic to the NVMe I/O request code so
>> that every NVMe I/O request gets handled in a chunk size that is
>> consumable by the NVMe adapter, while maintaining the fast path PRPL
>> logic we just introduced.
>> 
>> Signed-off-by: Alexander Graf <graf@amazon.com>
>> ---
>> src/hw/nvme.c | 16 ++++++++++++++++
>> 1 file changed, 16 insertions(+)
>> 
>> diff --git a/src/hw/nvme.c b/src/hw/nvme.c
>> index b92ca52..cc37bca 100644
>> --- a/src/hw/nvme.c
>> +++ b/src/hw/nvme.c
>> @@ -727,6 +727,22 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct disk_op_s *op, int write)
>>     u16 const max_blocks = NVME_PAGE_SIZE / ns->block_size;
>>     u16 i;
>> 
>> +    /* Split up requests that are larger than the device can handle */
>> +    if (op->count > ns->max_req_size) {
>> +        u16 count = op->count;
>> +
>> +        /* Handle the first max_req_size elements */
>> +        op->count = ns->max_req_size;
>> +        if (nvme_cmd_readwrite(ns, op, write))
>> +            return res;
>> +
>> +        /* Handle the remainder of the request */
>> +        op->count = count - ns->max_req_size;
>> +        op->lba += ns->max_req_size;
>> +        op->buf_fl += (ns->max_req_size * ns->block_size);
>> +        return nvme_cmd_readwrite(ns, op, write);
>> +    }
> 
> Depending on the disk access, this code could run with a small stack.
> I would avoid recursion.

This is tail recursion, which any reasonable compiler converts into a jmp :). Take a look at the .o file - for me it did become a plain jump.

> 
> Otherwise, the patch series looks okay to me.  (I don't have enough
> knowledge of the nvme code to give a full review though).

Thanks :)

Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


_______________________________________________
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-leave@seabios.org

[SeaBIOS] Re: [PATCH v3 4/4] nvme: Split requests by maximum allowed size

Posted by Kevin O'Connor 1 month, 2 weeks ago
On Sat, Oct 03, 2020 at 05:40:58AM +0000, Graf (AWS), Alexander wrote:
> Hey Kevin,
> 
> > Am 03.10.2020 um 03:47 schrieb Kevin O'Connor <kevin@koconnor.net>:
> > 
> > 
> >> On Wed, Sep 30, 2020 at 11:10:56PM +0200, Alexander Graf wrote:
> >> Some NVMe controllers only support small maximum request sizes, such as
> >> the AWS EBS NVMe implementation which only supports NVMe requests of up
> >> to 32 pages (256kb) at once.
> >> 
> >> BIOS callers can exceed those request sizes by defining sector counts
> >> above this threshold. Currently we fall back to the bounce buffer
> >> implementation for those. This is slow.
> >> 
> >> This patch introduces splitting logic to the NVMe I/O request code so
> >> that every NVMe I/O request gets handled in a chunk size that is
> >> consumable by the NVMe adapter, while maintaining the fast path PRPL
> >> logic we just introduced.
> >> 
> >> Signed-off-by: Alexander Graf <graf@amazon.com>
> >> ---
> >> src/hw/nvme.c | 16 ++++++++++++++++
> >> 1 file changed, 16 insertions(+)
> >> 
> >> diff --git a/src/hw/nvme.c b/src/hw/nvme.c
> >> index b92ca52..cc37bca 100644
> >> --- a/src/hw/nvme.c
> >> +++ b/src/hw/nvme.c
> >> @@ -727,6 +727,22 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct disk_op_s *op, int write)
> >>     u16 const max_blocks = NVME_PAGE_SIZE / ns->block_size;
> >>     u16 i;
> >> 
> >> +    /* Split up requests that are larger than the device can handle */
> >> +    if (op->count > ns->max_req_size) {
> >> +        u16 count = op->count;
> >> +
> >> +        /* Handle the first max_req_size elements */
> >> +        op->count = ns->max_req_size;
> >> +        if (nvme_cmd_readwrite(ns, op, write))
> >> +            return res;
> >> +
> >> +        /* Handle the remainder of the request */
> >> +        op->count = count - ns->max_req_size;
> >> +        op->lba += ns->max_req_size;
> >> +        op->buf_fl += (ns->max_req_size * ns->block_size);
> >> +        return nvme_cmd_readwrite(ns, op, write);
> >> +    }
> > 
> > Depending on the disk access, this code could run with a small stack.
> > I would avoid recursion.
> 
> This is tail recursion, which any reasonable compiler converts into a jmp :). Take a look at the .o file - for me it did become a plain jump.
> 

Okay, that's fine then.

> > 
> > Otherwise, the patch series looks okay to me.  (I don't have enough
> > knowledge of the nvme code to give a full review though).
> 
> Thanks :)

I'll leave a few more days to allow others to comment, but otherwise
I'm fine with committing.

Cheers,
-Kevin
_______________________________________________
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-leave@seabios.org

[SeaBIOS] Re: [PATCH v3 4/4] nvme: Split requests by maximum allowed size

Posted by David Woodhouse 3 weeks, 5 days ago
On Wed, 2020-10-07 at 12:13 -0400, Kevin O'Connor wrote:
> > > > --- a/src/hw/nvme.c
> > > > +++ b/src/hw/nvme.c
> > > > @@ -727,6 +727,22 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct disk_op_s *op, int write)
> > > >      u16 const max_blocks = NVME_PAGE_SIZE / ns->block_size;
> > > >      u16 i;
> > > > 
> > > > +    /* Split up requests that are larger than the device can handle */
> > > > +    if (op->count > ns->max_req_size) {
> > > > +        u16 count = op->count;
> > > > +
> > > > +        /* Handle the first max_req_size elements */
> > > > +        op->count = ns->max_req_size;
> > > > +        if (nvme_cmd_readwrite(ns, op, write))
> > > > +            return res;
> > > > +
> > > > +        /* Handle the remainder of the request */
> > > > +        op->count = count - ns->max_req_size;
> > > > +        op->lba += ns->max_req_size;
> > > > +        op->buf_fl += (ns->max_req_size * ns->block_size);
> > > > +        return nvme_cmd_readwrite(ns, op, write);
> > > > +    }
> > > 
> > > Depending on the disk access, this code could run with a small stack.
> > > I would avoid recursion.
> > 
> > This is tail recursion, which any reasonable compiler converts into
> > a jmp :). Take a look at the .o file - for me it did become a plain
> > jump.
> > 
> 
> Okay, that's fine then.

It's still kind of icky. This used to be a purely iterative function.
Now for a large unaligned request (which nvme_build_prpl refuses to
handle) you get a mixture of (mostly tail) recursion and iteration.

It'll recurse for each max_req_size, then iterate over each page within
that.

I'd much rather see just a simple iterative loop. Something like this
(incremental, completely untested):

--- a/src/hw/nvme.c
+++ b/src/hw/nvme.c
@@ -472,19 +472,20 @@ static int nvme_add_prpl(struct nvme_namespace *ns, u64 base)
 
 int nvme_build_prpl(struct nvme_namespace *ns, struct disk_op_s *op)
 {
-    int first_page = 1;
+    int first_page = 1, count = op->count;
     u32 base = (long)op->buf_fl;
-    s32 size = op->count * ns->block_size;
+    s32 size;
 
-    if (op->count > ns->max_req_size)
-        return -1;
+    if (count > ns->max_req_size)
+	    count = ns->max_req_size;
 
     nvme_reset_prpl(ns);
 
+    size = count * ns->block_size;
     /* Special case for transfers that fit into PRP1, but are unaligned */
     if (((size + (base & ~NVME_PAGE_MASK)) <= NVME_PAGE_SIZE)) {
         ns->prp1 = op->buf_fl;
-        return 0;
+        return count;
     }
 
     /* Every request has to be page aligned */
@@ -506,7 +507,7 @@ int nvme_build_prpl(struct nvme_namespace *ns, struct disk_op_s *op)
             return -1;
     }
 
-    return 0;
+    return count;
 }
 
 static int
@@ -725,27 +726,16 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct disk_op_s *op, int write)
 {
     int res = DISK_RET_SUCCESS;
     u16 const max_blocks = NVME_PAGE_SIZE / ns->block_size;
-    u16 i;
-
-    /* Split up requests that are larger than the device can handle */
-    if (op->count > ns->max_req_size) {
-        u16 count = op->count;
+    u16 i, count;
 
-        /* Handle the first max_req_size elements */
-        op->count = ns->max_req_size;
-        if (nvme_cmd_readwrite(ns, op, write))
-            return res;
-
-        /* Handle the remainder of the request */
-        op->count = count - ns->max_req_size;
-        op->lba += ns->max_req_size;
-        op->buf_fl += (ns->max_req_size * ns->block_size);
-        return nvme_cmd_readwrite(ns, op, write);
-    }
+    while (op->count && ((count = nvme_build_prpl(ns, op)) > 0)) {
+	res = nvme_io_readwrite(ns, op->lba, ns->prp1, count, write);
+        if (res != DISK_RET_SUCCESS)
+            break;
 
-    if (!nvme_build_prpl(ns, op)) {
-        /* Request goes via PRP List logic */
-        return nvme_io_readwrite(ns, op->lba, ns->prp1, op->count, write);
+        op->count -= count;
+        op->lba += count;
+        op->buf_fl += (count * ns->block_size);
     }
 
     for (i = 0; i < op->count && res == DISK_RET_SUCCESS;) {

_______________________________________________
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-leave@seabios.org

[SeaBIOS] Re: [PATCH v3 4/4] nvme: Split requests by maximum allowed size

Posted by Kevin O'Connor 3 weeks, 3 days ago
On Thu, Oct 29, 2020 at 05:21:04PM +0000, David Woodhouse wrote:
> On Wed, 2020-10-07 at 12:13 -0400, Kevin O'Connor wrote:
> > > > > --- a/src/hw/nvme.c
> > > > > +++ b/src/hw/nvme.c
> > > > > @@ -727,6 +727,22 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct disk_op_s *op, int write)
> > > > >      u16 const max_blocks = NVME_PAGE_SIZE / ns->block_size;
> > > > >      u16 i;
> > > > > 
> > > > > +    /* Split up requests that are larger than the device can handle */
> > > > > +    if (op->count > ns->max_req_size) {
> > > > > +        u16 count = op->count;
> > > > > +
> > > > > +        /* Handle the first max_req_size elements */
> > > > > +        op->count = ns->max_req_size;
> > > > > +        if (nvme_cmd_readwrite(ns, op, write))
> > > > > +            return res;
> > > > > +
> > > > > +        /* Handle the remainder of the request */
> > > > > +        op->count = count - ns->max_req_size;
> > > > > +        op->lba += ns->max_req_size;
> > > > > +        op->buf_fl += (ns->max_req_size * ns->block_size);
> > > > > +        return nvme_cmd_readwrite(ns, op, write);
> > > > > +    }
> > > > 
> > > > Depending on the disk access, this code could run with a small stack.
> > > > I would avoid recursion.
> > > 
> > > This is tail recursion, which any reasonable compiler converts into
> > > a jmp :). Take a look at the .o file - for me it did become a plain
> > > jump.
> > > 
> > 
> > Okay, that's fine then.
> 
> It's still kind of icky. This used to be a purely iterative function.
> Now for a large unaligned request (which nvme_build_prpl refuses to
> handle) you get a mixture of (mostly tail) recursion and iteration.
> 
> It'll recurse for each max_req_size, then iterate over each page within
> that.
> 
> I'd much rather see just a simple iterative loop. Something like this
> (incremental, completely untested):

FWIW, I agree that a version without recursion (even tail recursion)
would be nice.  If someone wants to test and confirm that, then that
would be great.

-Kevin
_______________________________________________
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-leave@seabios.org

[SeaBIOS] [PATCH] nvme: Clean up nvme_cmd_readwrite()

Posted by David Woodhouse 3 weeks, 3 days ago
From: David woodhouse <dwmw@amazon.co.uk>

This ended up with an odd mix of recursion (albeit *mostly* tail-recursion)
and interation that could have been prettier. Make it prettier by making
nvme_build_prpl() return the number of blocks it consumes, and just
looping.

If nvme_build_prpl() doesn't want to play, just fall through to the
original loop.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
On Fri, 2020-10-30 at 19:09 -0400, Kevin O'Connor wrote:
> FWIW, I agree that a version without recursion (even tail recursion)
> would be nice.  If someone wants to test and confirm that, then that
> would be great.

Seems to work here... I just tidied it up a little more and added a
dprintf() to match the original loop. Didn't see it actually limiting
the requests in my testing so I made nvme_build_prpl() limit to 32
blocks and then it did... and still seems to put things in the right
place.

 src/hw/nvme.c | 47 +++++++++++++++++++++--------------------------
 1 file changed, 21 insertions(+), 26 deletions(-)

diff --git a/src/hw/nvme.c b/src/hw/nvme.c
index cc37bca..5363d3f 100644
--- a/src/hw/nvme.c
+++ b/src/hw/nvme.c
@@ -472,19 +472,20 @@ static int nvme_add_prpl(struct nvme_namespace *ns, u64 base)
 
 int nvme_build_prpl(struct nvme_namespace *ns, struct disk_op_s *op)
 {
-    int first_page = 1;
+    int first_page = 1, count = op->count;
     u32 base = (long)op->buf_fl;
-    s32 size = op->count * ns->block_size;
+    s32 size;
 
-    if (op->count > ns->max_req_size)
-        return -1;
+    if (count > ns->max_req_size)
+        count = ns->max_req_size;
 
     nvme_reset_prpl(ns);
 
+    size = count * ns->block_size;
     /* Special case for transfers that fit into PRP1, but are unaligned */
     if (((size + (base & ~NVME_PAGE_MASK)) <= NVME_PAGE_SIZE)) {
         ns->prp1 = op->buf_fl;
-        return 0;
+        return count;
     }
 
     /* Every request has to be page aligned */
@@ -506,7 +507,7 @@ int nvme_build_prpl(struct nvme_namespace *ns, struct disk_op_s *op)
             return -1;
     }
 
-    return 0;
+    return count;
 }
 
 static int
@@ -725,34 +726,28 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct disk_op_s *op, int write)
 {
     int res = DISK_RET_SUCCESS;
     u16 const max_blocks = NVME_PAGE_SIZE / ns->block_size;
-    u16 i;
+    u16 i, blocks;
 
-    /* Split up requests that are larger than the device can handle */
-    if (op->count > ns->max_req_size) {
-        u16 count = op->count;
+    while (op->count && ((blocks = nvme_build_prpl(ns, op)) > 0)) {
+        res = nvme_io_readwrite(ns, op->lba, ns->prp1, blocks, write);
+        dprintf(5, "ns %u %s lba %llu+%u: %d\n", ns->ns_id, write ? "write"
+                                                                  : "read",
+                op->lba, blocks, res);
 
-        /* Handle the first max_req_size elements */
-        op->count = ns->max_req_size;
-        if (nvme_cmd_readwrite(ns, op, write))
+        if (res != DISK_RET_SUCCESS)
             return res;
 
-        /* Handle the remainder of the request */
-        op->count = count - ns->max_req_size;
-        op->lba += ns->max_req_size;
-        op->buf_fl += (ns->max_req_size * ns->block_size);
-        return nvme_cmd_readwrite(ns, op, write);
-    }
-
-    if (!nvme_build_prpl(ns, op)) {
-        /* Request goes via PRP List logic */
-        return nvme_io_readwrite(ns, op->lba, ns->prp1, op->count, write);
+        op->count -= blocks;
+        op->lba += blocks;
+        op->buf_fl += (blocks * ns->block_size);
     }
 
     for (i = 0; i < op->count && res == DISK_RET_SUCCESS;) {
-        u16 blocks_remaining = op->count - i;
-        u16 blocks = blocks_remaining < max_blocks ? blocks_remaining
-                                                   : max_blocks;
         char *op_buf = op->buf_fl + i * ns->block_size;
+        u16 blocks_remaining = op->count - i;
+
+        blocks = blocks_remaining < max_blocks ? blocks_remaining
+                                               : max_blocks;
 
         if (write) {
             memcpy(ns->dma_buffer, op_buf, blocks * ns->block_size);
-- 
2.17.1



_______________________________________________
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-leave@seabios.org