[v2] pflash: Only read non-zero parts of backend image

[PATCH v2] pflash: Only read non-zero parts of backend image

Posted by Gerd Hoffmann 3 years, 1 month ago

From: Xiang Zheng <zhengxiang9@huawei.com>

Currently we fill the VIRT_FLASH memory space with two 64MB NOR images
when using persistent UEFI variables on virt board. Actually we only use
a very small(non-zero) part of the memory while the rest significant
large(zero) part of memory is wasted.

So this patch checks the block status and only writes the non-zero part
into memory. This requires pflash devices to use sparse files for
backends.

Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com>

[ kraxel: rebased to latest master ]

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
---
 hw/block/block.c | 36 +++++++++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/hw/block/block.c b/hw/block/block.c
index f9c4fe67673b..142ebe4267e4 100644
--- a/hw/block/block.c
+++ b/hw/block/block.c
@@ -14,6 +14,40 @@
 #include "qapi/error.h"
 #include "qapi/qapi-types-block.h"
 
+/*
+ * Read the non-zeroes parts of @blk into @buf
+ * Reading all of the @blk is expensive if the zeroes parts of @blk
+ * is large enough. Therefore check the block status and only write
+ * the non-zeroes block into @buf.
+ *
+ * Return 0 on success, non-zero on error.
+ */
+static int blk_pread_nonzeroes(BlockBackend *blk, hwaddr size, void *buf)
+{
+    int ret;
+    int64_t bytes, offset = 0;
+    BlockDriverState *bs = blk_bs(blk);
+
+    for (;;) {
+        bytes = MIN(size - offset, BDRV_REQUEST_MAX_SECTORS);
+        if (bytes <= 0) {
+            return 0;
+        }
+        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
+        if (ret < 0) {
+            return ret;
+        }
+        if (!(ret & BDRV_BLOCK_ZERO)) {
+            ret = bdrv_pread(bs->file, offset, bytes,
+                             (uint8_t *) buf + offset, 0);
+            if (ret < 0) {
+                return ret;
+            }
+        }
+        offset += bytes;
+    }
+}
+
 /*
  * Read the entire contents of @blk into @buf.
  * @blk's contents must be @size bytes, and @size must be at most
@@ -53,7 +87,7 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
      * block device and read only on demand.
      */
     assert(size <= BDRV_REQUEST_MAX_BYTES);
-    ret = blk_pread(blk, 0, size, buf, 0);
+    ret = blk_pread_nonzeroes(blk, size, buf);
     if (ret < 0) {
         error_setg_errno(errp, -ret, "can't read block backend");
         return false;
-- 
2.38.1

Re: [PATCH v2] pflash: Only read non-zero parts of backend image

Posted by Kevin Wolf 3 years, 1 month ago

Am 20.12.2022 um 09:42 hat Gerd Hoffmann geschrieben:
> From: Xiang Zheng <zhengxiang9@huawei.com>
> 
> Currently we fill the VIRT_FLASH memory space with two 64MB NOR images
> when using persistent UEFI variables on virt board. Actually we only use
> a very small(non-zero) part of the memory while the rest significant
> large(zero) part of memory is wasted.
> 
> So this patch checks the block status and only writes the non-zero part
> into memory. This requires pflash devices to use sparse files for
> backends.
> 
> Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com>
> 
> [ kraxel: rebased to latest master ]
> 
> Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>

Thanks, applied to the block branch.

Even though discussion is ongoing about using alternative devices, it
seems to me that this is a simple optimisation that doesn't change the
behaviour as seen by the guest and that we want to have either way. If
anyone objects and wants me to drop the patch again, let me know.

Kevin

Re: [PATCH v2] pflash: Only read non-zero parts of backend image

Posted by Daniel P. Berrangé 3 years, 1 month ago

On Tue, Dec 20, 2022 at 09:42:46AM +0100, Gerd Hoffmann wrote:
> From: Xiang Zheng <zhengxiang9@huawei.com>
> 
> Currently we fill the VIRT_FLASH memory space with two 64MB NOR images
> when using persistent UEFI variables on virt board. Actually we only use
> a very small(non-zero) part of the memory while the rest significant
> large(zero) part of memory is wasted.
> 
> So this patch checks the block status and only writes the non-zero part
> into memory. This requires pflash devices to use sparse files for
> backends.
> 
> Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com>
> 
> [ kraxel: rebased to latest master ]
> 
> Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
> ---
>  hw/block/block.c | 36 +++++++++++++++++++++++++++++++++++-
>  1 file changed, 35 insertions(+), 1 deletion(-)

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH v2] pflash: Only read non-zero parts of backend image

Posted by Philippe Mathieu-Daudé 3 years, 1 month ago

[Extending to people using UEFI VARStore on Virt machines]

On 20/12/22 09:42, Gerd Hoffmann wrote:
> From: Xiang Zheng <zhengxiang9@huawei.com>
> 
> Currently we fill the VIRT_FLASH memory space with two 64MB NOR images
> when using persistent UEFI variables on virt board. Actually we only use
> a very small(non-zero) part of the memory while the rest significant
> large(zero) part of memory is wasted.
> 
> So this patch checks the block status and only writes the non-zero part
> into memory. This requires pflash devices to use sparse files for
> backends.

I like the idea, but I'm not sure how to relate with NOR flash devices.

 From the block layer, we get BDRV_BLOCK_ZERO when a block is fully
filled by zeroes ('\0').

We don't want to waste host memory, I get it.

Now what "sees" the guest? Is the UEFI VARStore filled with zeroes?
If so, is it a EDK2 specific case for all virt machines? This would
be a virtualization optimization and in that case, this patch would
work.

On hardware the NOR flash "erased state" is filled of '\xff'. If
EDK2 requires a 64MiB VARStore on NOR flash, I'd expect the non-used
area to be filled with \xff, at least up to the sector size. Otherwise
it is sub-optimal use of persistent storage on hardware.

But instead of keeping insisting on that, I'd like to step back a little
and discuss. What is the use case?

* Either you want to test UEFI on real hardware and a NOR flash makes
   sense,

* or you are trying to optimize paravirtualized guests. In that case
   why insist with emulated NOR devices? Why not have EDK2 directly
   use a paravirtualized block driver which we can optimize / tune
   without interfering with emulated models?

Keeping insisting on optimizing guests using the QEMU pflash device
seems wrong to me. I'm pretty sure we can do better optimizing clouds
payloads.

> Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com>
> 
> [ kraxel: rebased to latest master ]
> 
> Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
> ---
>   hw/block/block.c | 36 +++++++++++++++++++++++++++++++++++-
>   1 file changed, 35 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/block/block.c b/hw/block/block.c
> index f9c4fe67673b..142ebe4267e4 100644
> --- a/hw/block/block.c
> +++ b/hw/block/block.c
> @@ -14,6 +14,40 @@
>   #include "qapi/error.h"
>   #include "qapi/qapi-types-block.h"
>   
> +/*
> + * Read the non-zeroes parts of @blk into @buf
> + * Reading all of the @blk is expensive if the zeroes parts of @blk
> + * is large enough. Therefore check the block status and only write
> + * the non-zeroes block into @buf.
> + *
> + * Return 0 on success, non-zero on error.
> + */
> +static int blk_pread_nonzeroes(BlockBackend *blk, hwaddr size, void *buf)
> +{
> +    int ret;
> +    int64_t bytes, offset = 0;
> +    BlockDriverState *bs = blk_bs(blk);
> +
> +    for (;;) {
> +        bytes = MIN(size - offset, BDRV_REQUEST_MAX_SECTORS);
> +        if (bytes <= 0) {
> +            return 0;
> +        }
> +        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +        if (!(ret & BDRV_BLOCK_ZERO)) {
> +            ret = bdrv_pread(bs->file, offset, bytes,
> +                             (uint8_t *) buf + offset, 0);
> +            if (ret < 0) {
> +                return ret;
> +            }
> +        }
> +        offset += bytes;
> +    }
> +}
> +
>   /*
>    * Read the entire contents of @blk into @buf.
>    * @blk's contents must be @size bytes, and @size must be at most
> @@ -53,7 +87,7 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size,
>        * block device and read only on demand.
>        */
>       assert(size <= BDRV_REQUEST_MAX_BYTES);
> -    ret = blk_pread(blk, 0, size, buf, 0);
> +    ret = blk_pread_nonzeroes(blk, size, buf);
>       if (ret < 0) {
>           error_setg_errno(errp, -ret, "can't read block backend");
>           return false;

Re: [PATCH v2] pflash: Only read non-zero parts of backend image

Posted by Gerd Hoffmann 3 years, 1 month ago

On Tue, Dec 20, 2022 at 10:30:43AM +0100, Philippe Mathieu-Daudé wrote:
> [Extending to people using UEFI VARStore on Virt machines]
> 
> On 20/12/22 09:42, Gerd Hoffmann wrote:
> > From: Xiang Zheng <zhengxiang9@huawei.com>
> > 
> > Currently we fill the VIRT_FLASH memory space with two 64MB NOR images
> > when using persistent UEFI variables on virt board. Actually we only use
> > a very small(non-zero) part of the memory while the rest significant
> > large(zero) part of memory is wasted.
> > 
> > So this patch checks the block status and only writes the non-zero part
> > into memory. This requires pflash devices to use sparse files for
> > backends.
> 
> I like the idea, but I'm not sure how to relate with NOR flash devices.
> 
> From the block layer, we get BDRV_BLOCK_ZERO when a block is fully
> filled by zeroes ('\0').
> 
> We don't want to waste host memory, I get it.
> 
> Now what "sees" the guest? Is the UEFI VARStore filled with zeroes?

The varstore is filled with 0xff.  It's 768k in size.  The padding
following (63M plus a bit) is 0x00.  To be exact:

kraxel@sirius ~# hex /usr/share/edk2/aarch64/vars-template-pflash.raw 
00000000  00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00  ................
00000010  8d 2b f1 ff  96 76 8b 4c  a9 85 27 47  07 5b 4f 50  .+...v.L..'G.[OP
00000020  00 00 0c 00  00 00 00 00  5f 46 56 48  ff fe 04 00  ........_FVH....
00000030  48 00 28 09  00 00 00 02  03 00 00 00  00 00 04 00  H.(.............
00000040  00 00 00 00  00 00 00 00  78 2c f3 aa  7b 94 9a 43  ........x,..{..C
00000050  a1 80 2e 14  4e c3 77 92  b8 ff 03 00  5a fe 00 00  ....N.w.....Z...
00000060  00 00 00 00  ff ff ff ff  ff ff ff ff  ff ff ff ff  ................
00000070  ff ff ff ff  ff ff ff ff  ff ff ff ff  ff ff ff ff  ................
*
00040000  2b 29 58 9e  68 7c 7d 49  a0 ce 65 00  fd 9f 1b 95  +)X.h|}I..e.....
00040010  5b e7 c6 86  fe ff ff ff  e0 ff 03 00  00 00 00 00  [...............
00040020  ff ff ff ff  ff ff ff ff  ff ff ff ff  ff ff ff ff  ................
*
000c0000  00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00  ................
*

> If so, is it a EDK2 specific case for all virt machines?  This would
> be a virtualization optimization and in that case, this patch would
> work.

vars-template-pflash.raw (padded image) is simply QEMU_VARS.fd (unpadded
image) with 'truncate --size 64M' applied.

Yes, that's a pure virtual machine thing.  On physical hardware you
would probably just flash the first 768k and leave the remaining flash
capacity untouched.

> * or you are trying to optimize paravirtualized guests.

This.  Ideally without putting everything upside-down.

>   In that case why insist with emulated NOR devices? Why not have EDK2
>   directly use a paravirtualized block driver which we can optimize /
>   tune without interfering with emulated models?

While that probably would work for the variable store (I think we could
very well do with variable store not being mapped and requiring explicit
read/write requests) that idea is not going to work very well for the
firmware code which must be mapped into the address space.  pflash is
almost the only device we have which serves that need.  The only other
option I can see would be a rom (the code is usually mapped r/o anyway),
but that has pretty much the same problem space.  We would likewise want
a big enough fixed size ROM, to avoid life migration problems and all
that, and we want the unused space not waste memory.

> Keeping insisting on optimizing guests using the QEMU pflash device
> seems wrong to me. I'm pretty sure we can do better optimizing clouds
> payloads.

Moving away from pflash for efi variable storage would cause alot of
churn through the whole stack.  firmware, qemu, libvirt, upper
management, all affected.  Is that worth the trouble?  Using pflash
isn't that much of a problem IMHO.

Re: [PATCH v2] pflash: Only read non-zero parts of backend image

Posted by Ard Biesheuvel 3 years, 1 month ago

On Tue, 20 Dec 2022 at 16:33, Gerd Hoffmann <kraxel@redhat.com> wrote:
>
> On Tue, Dec 20, 2022 at 10:30:43AM +0100, Philippe Mathieu-Daudé wrote:
> > [Extending to people using UEFI VARStore on Virt machines]
> >
> > On 20/12/22 09:42, Gerd Hoffmann wrote:
> > > From: Xiang Zheng <zhengxiang9@huawei.com>
> > >
> > > Currently we fill the VIRT_FLASH memory space with two 64MB NOR images
> > > when using persistent UEFI variables on virt board. Actually we only use
> > > a very small(non-zero) part of the memory while the rest significant
> > > large(zero) part of memory is wasted.
> > >
> > > So this patch checks the block status and only writes the non-zero part
> > > into memory. This requires pflash devices to use sparse files for
> > > backends.
> >
> > I like the idea, but I'm not sure how to relate with NOR flash devices.
> >
> > From the block layer, we get BDRV_BLOCK_ZERO when a block is fully
> > filled by zeroes ('\0').
> >
> > We don't want to waste host memory, I get it.
> >
> > Now what "sees" the guest? Is the UEFI VARStore filled with zeroes?
>
> The varstore is filled with 0xff.  It's 768k in size.  The padding
> following (63M plus a bit) is 0x00.  To be exact:
>
> kraxel@sirius ~# hex /usr/share/edk2/aarch64/vars-template-pflash.raw
> 00000000  00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00  ................
> 00000010  8d 2b f1 ff  96 76 8b 4c  a9 85 27 47  07 5b 4f 50  .+...v.L..'G.[OP
> 00000020  00 00 0c 00  00 00 00 00  5f 46 56 48  ff fe 04 00  ........_FVH....
> 00000030  48 00 28 09  00 00 00 02  03 00 00 00  00 00 04 00  H.(.............
> 00000040  00 00 00 00  00 00 00 00  78 2c f3 aa  7b 94 9a 43  ........x,..{..C
> 00000050  a1 80 2e 14  4e c3 77 92  b8 ff 03 00  5a fe 00 00  ....N.w.....Z...
> 00000060  00 00 00 00  ff ff ff ff  ff ff ff ff  ff ff ff ff  ................
> 00000070  ff ff ff ff  ff ff ff ff  ff ff ff ff  ff ff ff ff  ................
> *
> 00040000  2b 29 58 9e  68 7c 7d 49  a0 ce 65 00  fd 9f 1b 95  +)X.h|}I..e.....
> 00040010  5b e7 c6 86  fe ff ff ff  e0 ff 03 00  00 00 00 00  [...............
> 00040020  ff ff ff ff  ff ff ff ff  ff ff ff ff  ff ff ff ff  ................
> *
> 000c0000  00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00  ................
> *
>
> > If so, is it a EDK2 specific case for all virt machines?  This would
> > be a virtualization optimization and in that case, this patch would
> > work.
>
> vars-template-pflash.raw (padded image) is simply QEMU_VARS.fd (unpadded
> image) with 'truncate --size 64M' applied.
>
> Yes, that's a pure virtual machine thing.  On physical hardware you
> would probably just flash the first 768k and leave the remaining flash
> capacity untouched.
>
> > * or you are trying to optimize paravirtualized guests.
>
> This.  Ideally without putting everything upside-down.
>
> >   In that case why insist with emulated NOR devices? Why not have EDK2
> >   directly use a paravirtualized block driver which we can optimize /
> >   tune without interfering with emulated models?
>
> While that probably would work for the variable store (I think we could
> very well do with variable store not being mapped and requiring explicit
> read/write requests) that idea is not going to work very well for the
> firmware code which must be mapped into the address space.  pflash is
> almost the only device we have which serves that need.  The only other
> option I can see would be a rom (the code is usually mapped r/o anyway),
> but that has pretty much the same problem space.  We would likewise want
> a big enough fixed size ROM, to avoid life migration problems and all
> that, and we want the unused space not waste memory.
>
> > Keeping insisting on optimizing guests using the QEMU pflash device
> > seems wrong to me. I'm pretty sure we can do better optimizing clouds
> > payloads.
>
> Moving away from pflash for efi variable storage would cause alot of
> churn through the whole stack.  firmware, qemu, libvirt, upper
> management, all affected.  Is that worth the trouble?  Using pflash
> isn't that much of a problem IMHO.
>

Agreed. pflash is a bit clunky but not a huge problem atm (although
setting up and tearing down the r/o memslot for every read resp. write
results in some performance issues under kvm/arm64)

*If* we decide to replace it, I would suggest an emulated ROM for the
executable image (without any emulated programming facility
whatsoever) and a paravirtualized get/setvariable interface which can
be used in a sane way to virtualize secure boot without having to
emulate SMM or other secure world firmware interfaces.

Re: [PATCH v2] pflash: Only read non-zero parts of backend image

Posted by Gerd Hoffmann 3 years, 1 month ago

  Hi,

> > Moving away from pflash for efi variable storage would cause alot of
> > churn through the whole stack.  firmware, qemu, libvirt, upper
> > management, all affected.  Is that worth the trouble?  Using pflash
> > isn't that much of a problem IMHO.
> 
> Agreed. pflash is a bit clunky but not a huge problem atm (although
> setting up and tearing down the r/o memslot for every read resp. write
> results in some performance issues under kvm/arm64)
> 
> *If* we decide to replace it, I would suggest an emulated ROM for the
> executable image (without any emulated programming facility
> whatsoever)

Sure.

> and a paravirtualized get/setvariable interface which can
> be used in a sane way to virtualize secure boot without having to
> emulate SMM or other secure world firmware interfaces.

Suggestions how to do that best?  The only option I can see is moving
the variable policy processing to the host, so any variable update
requests are checked even in case the guest OS bypasses the firmware
(which it can easily do when we don't have SMM mode to restrict access
to the paravirtual efi variable service device).

take care,
  Gerd