[Qemu-devel] [PATCH v5 4/5] qcow2: Don't allow overflow during cluster allocation

Eric Blake posted 5 patches 7 years, 5 months ago
There is a newer version of this series
[Qemu-devel] [PATCH v5 4/5] qcow2: Don't allow overflow during cluster allocation
Posted by Eric Blake 7 years, 5 months ago
Our code was already checking that we did not attempt to
allocate more clusters than what would fit in an INT64 (the
physical maximimum if we can access a full off_t's worth of
data).  But this does not catch smaller limits enforced by
various spots in the qcow2 image description: L1 and normal
clusters of L2 are documented as having bits 63-56 reserved
for other purposes, capping our maximum offset at 64PB (bit
55 is the maximum bit set).  And for compressed images with
2M clusters, the cap drops the maximum offset to bit 48, or
a maximum offset of 512TB.  If we overflow that offset, we
would write compressed data into one place, but try to
decompress from another, which won't work.

I don't have 512TB handy to prove whether things break if we
compress so much data that we overflow that limit, and don't
think that iotests can (quickly) test it either.  Test 138
comes close (it corrupts an image into thinking something lives
at 32PB, which is half the maximum for L1 sizing - although
it relies on 512-byte clusters).  But that test points out
that we will generally hit other limits first (such as running
out of memory for the refcount table, or exceeding file system
limits like 16TB on ext4, etc), so this is more a theoretical
safety valve than something likely to be hit.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>

---
v3: use s->cluster_offset_mask instead of open-coding it [Berto],
use MIN() to clamp offset on small cluster size, add spec patch
first to justify clamping even on refcount allocations
---
 block/qcow2.h          |  6 ++++++
 block/qcow2-refcount.c | 21 ++++++++++++++-------
 2 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 1df15a18aa1..a3b9faa9d53 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -41,6 +41,12 @@
 #define QCOW_MAX_CRYPT_CLUSTERS 32
 #define QCOW_MAX_SNAPSHOTS 65536

+/* Field widths in qcow2 mean normal cluster offsets cannot reach
+ * 64PB; depending on cluster size, compressed clusters can have a
+ * smaller limit (64PB for up to 16k clusters, then ramps down to
+ * 512TB for 2M clusters).  */
+#define QCOW_MAX_CLUSTER_OFFSET ((1ULL << 56) - 1)
+
 /* 8 MB refcount table is enough for 2 PB images at 64k cluster size
  * (128 GB for 512 byte clusters, 2 EB for 2 MB clusters) */
 #define QCOW_MAX_REFTABLE_SIZE 0x800000
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 6bfc11bb48f..fcbea3c9644 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -31,7 +31,8 @@
 #include "qemu/bswap.h"
 #include "qemu/cutils.h"

-static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size);
+static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size,
+                                    uint64_t max);
 static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
                             int64_t offset, int64_t length, uint64_t addend,
                             bool decrease, enum qcow2_discard_type type);
@@ -362,7 +363,8 @@ static int alloc_refcount_block(BlockDriverState *bs,
     }

     /* Allocate the refcount block itself and mark it as used */
-    int64_t new_block = alloc_clusters_noref(bs, s->cluster_size);
+    int64_t new_block = alloc_clusters_noref(bs, s->cluster_size,
+                                             QCOW_MAX_CLUSTER_OFFSET);
     if (new_block < 0) {
         return new_block;
     }
@@ -954,7 +956,8 @@ int qcow2_update_cluster_refcount(BlockDriverState *bs,


 /* return < 0 if error */
-static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size)
+static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size,
+                                    uint64_t max)
 {
     BDRVQcow2State *s = bs->opaque;
     uint64_t i, nb_clusters, refcount;
@@ -979,9 +982,9 @@ retry:
     }

     /* Make sure that all offsets in the "allocated" range are representable
-     * in an int64_t */
+     * in the requested max */
     if (s->free_cluster_index > 0 &&
-        s->free_cluster_index - 1 > (INT64_MAX >> s->cluster_bits))
+        s->free_cluster_index - 1 > (max >> s->cluster_bits))
     {
         return -EFBIG;
     }
@@ -1001,7 +1004,7 @@ int64_t qcow2_alloc_clusters(BlockDriverState *bs, uint64_t size)

     BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC);
     do {
-        offset = alloc_clusters_noref(bs, size);
+        offset = alloc_clusters_noref(bs, size, QCOW_MAX_CLUSTER_OFFSET);
         if (offset < 0) {
             return offset;
         }
@@ -1083,7 +1086,11 @@ int64_t qcow2_alloc_bytes(BlockDriverState *bs, int size)
     free_in_cluster = s->cluster_size - offset_into_cluster(s, offset);
     do {
         if (!offset || free_in_cluster < size) {
-            int64_t new_cluster = alloc_clusters_noref(bs, s->cluster_size);
+            int64_t new_cluster;
+
+            new_cluster = alloc_clusters_noref(bs, s->cluster_size,
+                                               MIN(s->cluster_offset_mask,
+                                                   QCOW_MAX_CLUSTER_OFFSET));
             if (new_cluster < 0) {
                 return new_cluster;
             }
-- 
2.14.3


Re: [Qemu-devel] [PATCH v5 4/5] qcow2: Don't allow overflow during cluster allocation
Posted by Max Reitz 7 years, 5 months ago
On 2018-04-24 00:33, Eric Blake wrote:
> Our code was already checking that we did not attempt to
> allocate more clusters than what would fit in an INT64 (the
> physical maximimum if we can access a full off_t's worth of
> data).  But this does not catch smaller limits enforced by
> various spots in the qcow2 image description: L1 and normal
> clusters of L2 are documented as having bits 63-56 reserved
> for other purposes, capping our maximum offset at 64PB (bit
> 55 is the maximum bit set).  And for compressed images with
> 2M clusters, the cap drops the maximum offset to bit 48, or
> a maximum offset of 512TB.  If we overflow that offset, we
> would write compressed data into one place, but try to
> decompress from another, which won't work.
> 
> I don't have 512TB handy to prove whether things break if we
> compress so much data that we overflow that limit, and don't
> think that iotests can (quickly) test it either.  Test 138
> comes close (it corrupts an image into thinking something lives
> at 32PB, which is half the maximum for L1 sizing - although
> it relies on 512-byte clusters).  But that test points out
> that we will generally hit other limits first (such as running
> out of memory for the refcount table, or exceeding file system
> limits like 16TB on ext4, etc), so this is more a theoretical
> safety valve than something likely to be hit.

You don't need 512 TB, though, 36 MB is sufficient.

Here's what you do:
(1) Create a 513 TB image with cluster_size=2M,refcount_bits=1
(2) Take a hex editor and enter 16 refblocks into the reftable
(3) Fill all of those refblocks with 1s

(Funny side note: qemu-img check thinks that image is clean because it
doesn't check refcounts beyond the image end...)

I've attached a compressed test image (unsurprisingly, it compresses
really well).

Before this series:
$ ./qemu-io -c 'write -c 0 2M' test.qcow2
qcow2: Marking image as corrupt: Preventing invalid write on metadata
(overlaps with refcount block); further corruption events will be suppressed
write failed: Input/output error

Aw.

After this series:
$ ./qemu-io -c 'write -c 0 2M' test.qcow2
write failed: Input/output error

(Normal writes just work fine.)


Maybe you want to add a test still -- creating the image is rather quick
(well, you have to write 64 MB of 1s, but other than that).  The only
thing that takes a bit of time is qemu figuring out where the first free
cluster is...  That takes like 15 seconds here.

And another issue of course is...

$ ls -lhs test.qcow2
42M -rw-r--r--. 1 maxx maxx 513T 25. Apr 16:42 test.qcow2

Yeah, that.  Depends on the host file system, of course, whether that is
a real issue. O:-)

Max

> Signed-off-by: Eric Blake <eblake@redhat.com>
> Reviewed-by: Alberto Garcia <berto@igalia.com>
> 
> ---
> v3: use s->cluster_offset_mask instead of open-coding it [Berto],
> use MIN() to clamp offset on small cluster size, add spec patch
> first to justify clamping even on refcount allocations
> ---
>  block/qcow2.h          |  6 ++++++
>  block/qcow2-refcount.c | 21 ++++++++++++++-------
>  2 files changed, 20 insertions(+), 7 deletions(-)
> 
> diff --git a/block/qcow2.h b/block/qcow2.h
> index 1df15a18aa1..a3b9faa9d53 100644
> --- a/block/qcow2.h
> +++ b/block/qcow2.h
> @@ -41,6 +41,12 @@
>  #define QCOW_MAX_CRYPT_CLUSTERS 32
>  #define QCOW_MAX_SNAPSHOTS 65536
> 
> +/* Field widths in qcow2 mean normal cluster offsets cannot reach
> + * 64PB; depending on cluster size, compressed clusters can have a
> + * smaller limit (64PB for up to 16k clusters, then ramps down to
> + * 512TB for 2M clusters).  */
> +#define QCOW_MAX_CLUSTER_OFFSET ((1ULL << 56) - 1)
> +
>  /* 8 MB refcount table is enough for 2 PB images at 64k cluster size
>   * (128 GB for 512 byte clusters, 2 EB for 2 MB clusters) */
>  #define QCOW_MAX_REFTABLE_SIZE 0x800000
> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
> index 6bfc11bb48f..fcbea3c9644 100644
> --- a/block/qcow2-refcount.c
> +++ b/block/qcow2-refcount.c
> @@ -31,7 +31,8 @@
>  #include "qemu/bswap.h"
>  #include "qemu/cutils.h"
> 
> -static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size);
> +static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size,
> +                                    uint64_t max);
>  static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>                              int64_t offset, int64_t length, uint64_t addend,
>                              bool decrease, enum qcow2_discard_type type);
> @@ -362,7 +363,8 @@ static int alloc_refcount_block(BlockDriverState *bs,
>      }
> 
>      /* Allocate the refcount block itself and mark it as used */
> -    int64_t new_block = alloc_clusters_noref(bs, s->cluster_size);
> +    int64_t new_block = alloc_clusters_noref(bs, s->cluster_size,
> +                                             QCOW_MAX_CLUSTER_OFFSET);
>      if (new_block < 0) {
>          return new_block;
>      }
> @@ -954,7 +956,8 @@ int qcow2_update_cluster_refcount(BlockDriverState *bs,
> 
> 
>  /* return < 0 if error */
> -static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size)
> +static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size,
> +                                    uint64_t max)
>  {
>      BDRVQcow2State *s = bs->opaque;
>      uint64_t i, nb_clusters, refcount;
> @@ -979,9 +982,9 @@ retry:
>      }
> 
>      /* Make sure that all offsets in the "allocated" range are representable
> -     * in an int64_t */
> +     * in the requested max */
>      if (s->free_cluster_index > 0 &&
> -        s->free_cluster_index - 1 > (INT64_MAX >> s->cluster_bits))
> +        s->free_cluster_index - 1 > (max >> s->cluster_bits))
>      {
>          return -EFBIG;
>      }
> @@ -1001,7 +1004,7 @@ int64_t qcow2_alloc_clusters(BlockDriverState *bs, uint64_t size)
> 
>      BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC);
>      do {
> -        offset = alloc_clusters_noref(bs, size);
> +        offset = alloc_clusters_noref(bs, size, QCOW_MAX_CLUSTER_OFFSET);
>          if (offset < 0) {
>              return offset;
>          }
> @@ -1083,7 +1086,11 @@ int64_t qcow2_alloc_bytes(BlockDriverState *bs, int size)
>      free_in_cluster = s->cluster_size - offset_into_cluster(s, offset);
>      do {
>          if (!offset || free_in_cluster < size) {
> -            int64_t new_cluster = alloc_clusters_noref(bs, s->cluster_size);
> +            int64_t new_cluster;
> +
> +            new_cluster = alloc_clusters_noref(bs, s->cluster_size,
> +                                               MIN(s->cluster_offset_mask,
> +                                                   QCOW_MAX_CLUSTER_OFFSET));
>              if (new_cluster < 0) {
>                  return new_cluster;
>              }
> 

Re: [Qemu-devel] [PATCH v5 4/5] qcow2: Don't allow overflow during cluster allocation
Posted by Eric Blake 7 years, 5 months ago
On 04/25/2018 09:44 AM, Max Reitz wrote:
> On 2018-04-24 00:33, Eric Blake wrote:
>> Our code was already checking that we did not attempt to
>> allocate more clusters than what would fit in an INT64 (the
>> physical maximimum if we can access a full off_t's worth of

s/maximimum/maximum/

>> data).  But this does not catch smaller limits enforced by
>> various spots in the qcow2 image description: L1 and normal
>> clusters of L2 are documented as having bits 63-56 reserved
>> for other purposes, capping our maximum offset at 64PB (bit
>> 55 is the maximum bit set).  And for compressed images with
>> 2M clusters, the cap drops the maximum offset to bit 48, or
>> a maximum offset of 512TB.  If we overflow that offset, we
>> would write compressed data into one place, but try to
>> decompress from another, which won't work.
>>
>> I don't have 512TB handy to prove whether things break if we
>> compress so much data that we overflow that limit, and don't
>> think that iotests can (quickly) test it either.  Test 138
>> comes close (it corrupts an image into thinking something lives
>> at 32PB, which is half the maximum for L1 sizing - although
>> it relies on 512-byte clusters).  But that test points out
>> that we will generally hit other limits first (such as running
>> out of memory for the refcount table, or exceeding file system
>> limits like 16TB on ext4, etc), so this is more a theoretical
>> safety valve than something likely to be hit.
> 
> You don't need 512 TB, though, 36 MB is sufficient.

Cool.  I'll have to attempt that as a followup patch.

> 
> Here's what you do:
> (1) Create a 513 TB image with cluster_size=2M,refcount_bits=1
> (2) Take a hex editor and enter 16 refblocks into the reftable
> (3) Fill all of those refblocks with 1s

That's a lot of leaked clusters ;)

> 
> (Funny side note: qemu-img check thinks that image is clean because it
> doesn't check refcounts beyond the image end...)

Eww - yet another bug to fix...

> 
> I've attached a compressed test image (unsurprisingly, it compresses
> really well).
> 
> Before this series:
> $ ./qemu-io -c 'write -c 0 2M' test.qcow2
> qcow2: Marking image as corrupt: Preventing invalid write on metadata
> (overlaps with refcount block); further corruption events will be suppressed
> write failed: Input/output error
> 
> Aw.
> 
> After this series:
> $ ./qemu-io -c 'write -c 0 2M' test.qcow2
> write failed: Input/output error
> 
> (Normal writes just work fine.)
> 
> 
> Maybe you want to add a test still -- creating the image is rather quick
> (well, you have to write 64 MB of 1s, but other than that).  The only
> thing that takes a bit of time is qemu figuring out where the first free
> cluster is...  That takes like 15 seconds here.

Then the test doesn't belong in '-g quick'.

> 
> And another issue of course is...
> 
> $ ls -lhs test.qcow2
> 42M -rw-r--r--. 1 maxx maxx 513T 25. Apr 16:42 test.qcow2
> 
> Yeah, that.  Depends on the host file system, of course, whether that is
> a real issue. O:-)

As long as iotests can gracefully skip if qemu-img fails to create the
image, then the test should still run on all remaining filesystems that
support sparse files that large.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-devel] [PATCH v5 4/5] qcow2: Don't allow overflow during cluster allocation
Posted by Eric Blake 7 years, 5 months ago
On 04/25/2018 09:44 AM, Max Reitz wrote:

> Here's what you do:
> (1) Create a 513 TB image with cluster_size=2M,refcount_bits=1
> (2) Take a hex editor and enter 16 refblocks into the reftable
> (3) Fill all of those refblocks with 1s
> 
> (Funny side note: qemu-img check thinks that image is clean because it
> doesn't check refcounts beyond the image end...)
> 
> I've attached a compressed test image (unsurprisingly, it compresses
> really well).
> 
> Before this series:
> $ ./qemu-io -c 'write -c 0 2M' test.qcow2
> qcow2: Marking image as corrupt: Preventing invalid write on metadata
> (overlaps with refcount block); further corruption events will be suppressed
> write failed: Input/output error
> 
> Aw.
> 
> After this series:
> $ ./qemu-io -c 'write -c 0 2M' test.qcow2
> write failed: Input/output error

Hmm, I wonder if ENOSPC is a better failure than EIO if the reason we
can't write a compressed cluster is because the image has grown too large.

> 
> (Normal writes just work fine.)

I also wonder if there should be a knob to control whether attempts to
write a compressed cluster should fail vs. silently fall back to a
normal write when encountering this too-large situation (that is, a
trade-off between keeping guest writes working no matter what, even if
we lose out on compression savings, vs. learning as soon as possible
when compression is no longer possible).  Then again, other than
carefully-crafted super-sparse files, it's much harder to argue that
you'll hit this in real usage scenarios, so it may not be worth the effort.

> 
> 
> Maybe you want to add a test still -- creating the image is rather quick
> (well, you have to write 64 MB of 1s, but other than that).  The only
> thing that takes a bit of time is qemu figuring out where the first free
> cluster is...  That takes like 15 seconds here.
> 
> And another issue of course is...
> 
> $ ls -lhs test.qcow2
> 42M -rw-r--r--. 1 maxx maxx 513T 25. Apr 16:42 test.qcow2
> 
> Yeah, that.  Depends on the host file system, of course, whether that is
> a real issue. O:-)

Well, I'm writing the iotest now, and thus will post a v6 of this series.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org