From nobody Mon Nov 10 09:48:01 2025 Delivered-To: importer@patchew.org Received-SPF: temperror (zoho.com: Error in retrieving data from DNS) client-ip=209.51.188.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; dkim=fail; spf=temperror (zoho.com: Error in retrieving data from DNS) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=fail(p=none dis=none) header.from=oracle.com ARC-Seal: i=1; a=rsa-sha256; t=1556092488; cv=none; d=zoho.com; s=zohoarc; b=H5Z8ttcTKUIv2GtG58sBogBG40/W6C6TGGA9TTtY00y//t3+V4Gh3fPU0RY8SoImdRgm6z5ihkSSnQ8/5RAh/52hHHP1PpVmxmYBqcKWsYUVw5QGxVY/Nl7ydru9eAL33B+vjepB9teB4/yZ5/p5wsdx0WwPvW+c8QsYcJ7omjY= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zoho.com; s=zohoarc; t=1556092488; h=Cc:Date:From:In-Reply-To:List-Subscribe:List-Post:List-Id:List-Archive:List-Help:List-Unsubscribe:Message-ID:References:Sender:Subject:To:ARC-Authentication-Results; bh=asTKXEYM/BwFVwIDqkM3+Fnyd91984U1g4yfMnxhwKU=; b=nGI7K+21dLfw3C7M8dLc3nbjz0YFl0fLN+qgTsH4f56O53BdrUM1GAabedvWz4Bt/z/Zqj6cAxNIuFhsGDE/AnxLYGSxRsJPu6LkQyG2sq3wlFm0fATHWIWm5xjkrAqumT48F2fXInpnjnejokBeSxMvFtXgPhIDgwPtwJ9fgXU= ARC-Authentication-Results: i=1; mx.zoho.com; dkim=fail; spf=temperror (zoho.com: Error in retrieving data from DNS) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=fail header.from= (p=none dis=none) header.from= Return-Path: Received: from lists.gnu.org (209.51.188.17 [209.51.188.17]) by mx.zohomail.com with SMTPS id 1556092488815849.3651967762991; Wed, 24 Apr 2019 00:54:48 -0700 (PDT) Received: from localhost ([127.0.0.1]:37467 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hJCjh-00050T-4f for importer@patchew.org; Wed, 24 Apr 2019 03:54:33 -0400 Received: from eggs.gnu.org ([209.51.188.92]:52715) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hJCha-0003k9-Ul for qemu-devel@nongnu.org; Wed, 24 Apr 2019 03:52:24 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1hJCek-00081W-O3 for qemu-devel@nongnu.org; Wed, 24 Apr 2019 03:49:28 -0400 Received: from userp2130.oracle.com ([156.151.31.86]:48998) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1hJCej-0007rn-9g; Wed, 24 Apr 2019 03:49:25 -0400 Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.27/8.16.0.27) with SMTP id x3O7n5Yw118836; Wed, 24 Apr 2019 07:49:15 GMT Received: from aserp3020.oracle.com (aserp3020.oracle.com [141.146.126.70]) by userp2130.oracle.com with ESMTP id 2rytut0muh-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 24 Apr 2019 07:49:15 +0000 Received: from pps.filterd (aserp3020.oracle.com [127.0.0.1]) by aserp3020.oracle.com (8.16.0.27/8.16.0.27) with SMTP id x3O7mdMx090482; Wed, 24 Apr 2019 07:49:14 GMT Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236]) by aserp3020.oracle.com with ESMTP id 2s0fv3e3j1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 24 Apr 2019 07:49:14 +0000 Received: from abhmp0012.oracle.com (abhmp0012.oracle.com [141.146.116.18]) by aserv0122.oracle.com (8.14.4/8.14.4) with ESMTP id x3O7nDAY017058; Wed, 24 Apr 2019 07:49:13 GMT Received: from nexus.ravello.local (/213.57.127.2) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 24 Apr 2019 00:49:13 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references; s=corp-2018-07-02; bh=asTKXEYM/BwFVwIDqkM3+Fnyd91984U1g4yfMnxhwKU=; b=zkBl0S1mPhZ4pxz3jzke4yRxEfYEnPcNQ0xUJTt/aW3MNQbjOKwxdOtw4HXVK2HooDib L+11/CD3cli8QEqzAv4/I3hi10hR99t7zqD7vpNB8a5WboFxVy6ChQZ/HB1mp3BPSduO 3L8oifToNGyQQ7c7Eo9WR/OHswucSHhxV1wLo574UA1IMgrhc7SLmwZOJ/AlAndrnEaf Z6zUtSJOtszA/lHjQl2l1Y6ueylCYuiiTlJycMpPLfHNpYisMcqi1tgbD58XDVSEPE20 5WL3809ZnO2rJNZ+1HT7898nW6Iu4Dy8KhdZcNlDTziqvJE1QD7p3UmAhHlIBmbJASBo NA== From: Sam Eiderman To: fam@euphon.net, kwolf@redhat.com, mreitz@redhat.com, qemu-block@nongnu.org Date: Wed, 24 Apr 2019 10:49:00 +0300 Message-Id: <20190424074901.31430-2-shmuel.eiderman@oracle.com> X-Mailer: git-send-email 2.13.3 In-Reply-To: <20190424074901.31430-1-shmuel.eiderman@oracle.com> References: <20190424074901.31430-1-shmuel.eiderman@oracle.com> X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9236 signatures=668685 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1904240068 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9236 signatures=668685 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1904240068 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x [generic] X-Received-From: 156.151.31.86 Subject: [Qemu-devel] [PATCH 1/2] vmdk: Fix comment regarding max l1_size coverage X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: eyal.moscovici@oracle.com, arbel.moshe@oracle.com, qemu-devel@nongnu.org, shmuel.eiderman@oracle.com, liran.alon@oracle.com, karl.heubaum@oracle.com Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" X-ZohoMail-DKIM: fail (Header signature does not verify) Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Commit b0651b8c246d ("vmdk: Move l1_size check into vmdk_add_extent") extended the l1_size check from VMDK4 to VMDK3 but did not update the default coverage in the moved comment. The previous vmdk4 calculation: (512 * 1024 * 1024) * 512(l2 entries) * 65536(grain) =3D 16PB The added vmdk3 calculation: (512 * 1024 * 1024) * 4096(l2 entries) * 512(grain) =3D 1PB Adding the calculation of vmdk3 to the comment. In any case, VMware does not offer virtual disks more than 2TB for vmdk4/vmdk3 or 64TB for the new undocumented seSparse format which is not implemented yet in qemu. Reviewed-by: Karl Heubaum Reviewed-by: Eyal Moscovici Reviewed-by: Liran Alon Reviewed-by: Arbel Moshe Signed-off-by: Sam Eiderman Reviewed-by: yuchenlin --- block/vmdk.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/block/vmdk.c b/block/vmdk.c index de8cb859f8..fc7378da78 100644 --- a/block/vmdk.c +++ b/block/vmdk.c @@ -426,10 +426,15 @@ static int vmdk_add_extent(BlockDriverState *bs, return -EFBIG; } if (l1_size > 512 * 1024 * 1024) { - /* Although with big capacity and small l1_entry_sectors, we can g= et a + /* + * Although with big capacity and small l1_entry_sectors, we can g= et a * big l1_size, we don't want unbounded value to allocate the tabl= e. - * Limit it to 512M, which is 16PB for default cluster and L2 table - * size */ + * Limit it to 512M, which is: + * 16PB - for default "Hosted Sparse Extent" (VMDK4) + * cluster size: 64KB, L2 table size: 512 entries + * 1PB - for default "ESXi Host Sparse Extent" (VMDK3/vmfsSpa= rse) + * cluster size: 512B, L2 table size: 4096 entries + */ error_setg(errp, "L1 size too big"); return -EFBIG; } --=20 2.13.3 From nobody Mon Nov 10 09:48:01 2025 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; dkim=fail; spf=pass (zoho.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=fail(p=none dis=none) header.from=oracle.com ARC-Seal: i=1; a=rsa-sha256; t=1556092484; cv=none; d=zoho.com; s=zohoarc; b=QThqMhMyKGUAb+rV/gPJUUSvc3C32Ygj9fKzo8CYhRLaCC/jCaklmeu171cHKc4CQjoPCwVnfZUFUzYFJrDYnnF/FWzknwp7Bi1YEibx4LZUIKYa+bke8gK9T3IwKe/0/GpsFO9AX59Tp+tYMUxTHMk6Hk4qWsV5wXIgWtAxwzQ= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zoho.com; s=zohoarc; t=1556092484; h=Cc:Date:From:In-Reply-To:List-Subscribe:List-Post:List-Id:List-Archive:List-Help:List-Unsubscribe:Message-ID:References:Sender:Subject:To:ARC-Authentication-Results; bh=GseU7YXAB6QOGsDf37KZELvA0up4ClFajPrG/P765yQ=; b=c+tLRctRQq6Yppb2g2NabNRtubiqA1h1yAsMf0+SC+ye6SiwbEk9jhPIJlLuga8P2tGOgwC8PAQqKT75Be6SvgH7VdjzfPTMm8Kidt7r1C+S14+jvb5BLnH7x/l11LS/STgjh/jqhm3IK/2GtArXXQxR6HH/cjHrHw9nm+VLGkY= ARC-Authentication-Results: i=1; mx.zoho.com; dkim=fail; spf=pass (zoho.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=fail header.from= (p=none dis=none) header.from= Return-Path: Received: from lists.gnu.org (209.51.188.17 [209.51.188.17]) by mx.zohomail.com with SMTPS id 1556092484439149.69119679840117; Wed, 24 Apr 2019 00:54:44 -0700 (PDT) Received: from localhost ([127.0.0.1]:37469 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hJCjl-00053e-4D for importer@patchew.org; Wed, 24 Apr 2019 03:54:37 -0400 Received: from eggs.gnu.org ([209.51.188.92]:52676) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hJChZ-0003jy-ED for qemu-devel@nongnu.org; Wed, 24 Apr 2019 03:52:24 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1hJCem-00085P-RD for qemu-devel@nongnu.org; Wed, 24 Apr 2019 03:49:31 -0400 Received: from userp2130.oracle.com ([156.151.31.86]:49024) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1hJCej-0007uh-Ch; Wed, 24 Apr 2019 03:49:25 -0400 Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.27/8.16.0.27) with SMTP id x3O7n4mF118827; Wed, 24 Apr 2019 07:49:20 GMT Received: from userp3030.oracle.com (userp3030.oracle.com [156.151.31.80]) by userp2130.oracle.com with ESMTP id 2rytut0muw-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 24 Apr 2019 07:49:19 +0000 Received: from pps.filterd (userp3030.oracle.com [127.0.0.1]) by userp3030.oracle.com (8.16.0.27/8.16.0.27) with SMTP id x3O7nJNZ057728; Wed, 24 Apr 2019 07:49:19 GMT Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by userp3030.oracle.com with ESMTP id 2ryrhsgm9e-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 24 Apr 2019 07:49:19 +0000 Received: from abhmp0012.oracle.com (abhmp0012.oracle.com [141.146.116.18]) by aserv0121.oracle.com (8.14.4/8.13.8) with ESMTP id x3O7nGNq014988; Wed, 24 Apr 2019 07:49:16 GMT Received: from nexus.ravello.local (/213.57.127.2) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 24 Apr 2019 00:49:15 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : in-reply-to : references; s=corp-2018-07-02; bh=GseU7YXAB6QOGsDf37KZELvA0up4ClFajPrG/P765yQ=; b=Y2OSiTeqf04elYAvfAIiwNMLR7JReNXbSdkzix4wky1mrLLvW4P69w8gBTxFTGFF9GmK gGsdIWjg3WxkxRLlJ5gWdgxtRJSgfLuII1Uh9OWP5pnkRSTBd/1nnynUHpA5m1obTe3H vdnPa2/6gHCtVx05I0wMkPclf/yu7ahSMZQN7k/Ul9GDr86rfHyXs4jtgJ3wooEDEe3r UwtAy71GP++yoRenGA8lU2iINF3cn4R8hRurMnQcHubP68TI4k+osb/LREo4Tjkrb9Ci eOE9xwEV6+beT6Li6w2H5SiythMHFpIypd9wn/MZK3XeKK2cOzpS6o+j3OolBDrChd4D DA== From: Sam Eiderman To: fam@euphon.net, kwolf@redhat.com, mreitz@redhat.com, qemu-block@nongnu.org Date: Wed, 24 Apr 2019 10:49:01 +0300 Message-Id: <20190424074901.31430-3-shmuel.eiderman@oracle.com> X-Mailer: git-send-email 2.13.3 In-Reply-To: <20190424074901.31430-1-shmuel.eiderman@oracle.com> References: <20190424074901.31430-1-shmuel.eiderman@oracle.com> X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9236 signatures=668685 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1904240068 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9236 signatures=668685 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1904240068 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x [generic] X-Received-From: 156.151.31.86 Subject: [Qemu-devel] [PATCH 2/2] vmdk: Add read-only support for seSparse snapshots X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: eyal.moscovici@oracle.com, arbel.moshe@oracle.com, qemu-devel@nongnu.org, shmuel.eiderman@oracle.com, liran.alon@oracle.com, karl.heubaum@oracle.com Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" X-ZohoMail-DKIM: fail (Header signature does not verify) Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Until ESXi 6.5 VMware used the vmfsSparse format for snapshots (VMDK3 in QEMU). This format was lacking in the following: * Grain directory (L1) and grain table (L2) entries were 32-bit, allowing access to only 2TB (slightly less) of data. * The grain size (default) was 512 bytes - leading to data fragmentation and many grain tables. * For space reclamation purposes, it was necessary to find all the grains which are not pointed to by any grain table - so a reverse mapping of "offset of grain in vmdk" to "grain table" must be constructed - which takes large amounts of CPU/RAM. The format specification can be found in VMware's documentation: https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf In ESXi 6.5, to support snapshot files larger than 2TB, a new format was introduced: SESparse (Space Efficient). This format fixes the above issues: * All entries are now 64-bit. * The grain size (default) is 4KB. * Grain directory and grain tables are now located at the beginning of the file. + seSparse format reserves space for all grain tables. + Grain tables can be addressed using an index. + Grains are located in the end of the file and can also be addressed with an index. - seSparse vmdks of large disks (64TB) have huge preallocated headers - mainly due to L2 tables, even for empty snapshots. * The header contains a reverse mapping ("backmap") of "offset of grain in vmdk" to "grain table" and a bitmap ("free bitmap") which specifies for each grain - whether it is allocated or not. Using these data structures we can implement space reclamation efficiently. * Due to the fact that the header now maintains two mappings: * The regular one (grain directory & grain tables) * A reverse one (backmap and free bitmap) These data structures can lose consistency upon crash and result in a corrupted VMDK. Therefore, a journal is also added to the VMDK and is replayed when the VMware reopens the file after a crash. Since ESXi 6.7 - SESparse is the only snapshot format available. Unfortunately, VMware does not provide documentation regarding the new seSparse format. This commit is based on black-box research of the seSparse format. Various in-guest block operations and their effect on the snapshot file were tested. The only VMware provided source of information (regarding the underlying implementation) was a log file on the ESXi: /var/log/hostd.log Whenever an seSparse snapshot is created - the log is being populated with seSparse records. Relevant log records are of the form: [...] Const Header: [...] constMagic =3D 0xcafebabe [...] version =3D 2.1 [...] capacity =3D 204800 [...] grainSize =3D 8 [...] grainTableSize =3D 64 [...] flags =3D 0 [...] Extents: [...] Header : <1 : 1> [...] JournalHdr : <2 : 2> [...] Journal : <2048 : 2048> [...] GrainDirectory : <4096 : 2048> [...] GrainTables : <6144 : 2048> [...] FreeBitmap : <8192 : 2048> [...] BackMap : <10240 : 2048> [...] Grain : <12288 : 204800> [...] Volatile Header: [...] volatileMagic =3D 0xcafecafe [...] FreeGTNumber =3D 0 [...] nextTxnSeqNumber =3D 0 [...] replayJournal =3D 0 The sizes that are seen in the log file are in sectors. Extents are of the following format: This commit is a strict implementation which enforces: * magics * version number 2.1 * grain size of 8 sectors (4KB) * grain table size of 64 sectors * zero flags * extent locations Additionally, this commit proivdes only a subset of the functionality offered by seSparse's format: * Read-only * No journal replay * No space reclamation * No unmap support Hence, journal header, journal, free bitmap and backmap extents are unused, only the "classic" (L1 -> L2 -> data) grain access is implemented. However there are several differences in the grain access itself. Grain directory (L1): * Grain directory entries are indexes (not offsets) to grain tables. * Valid grain directory entries have their highest nibble set to 0x1. * Since grain tables are always located in the beginning of the file - the index can fit into 32 bits - so we can use its low part if it's valid. Grain table (L2): * Grain table entries are indexes (not offsets) to grains. * If the highest nibble of the entry is: 0x0: The grain in not allocated. The rest of the bytes are 0. 0x1: The grain is unmapped - guest sees a zero grain. The rest of the bits point to the previously mapped grain, see 0x3 case. 0x2: The grain is zero. 0x3: The grain is allocated - to get the index calculate: ((entry & 0x0fff000000000000) >> 48) | ((entry & 0x0000ffffffffffff) << 12) * The difference between 0x1 and 0x2 is that 0x1 is an unallocated grain which results from the guest using sg_unmap to unmap the grain - but the grain itself still exists in the grain extent - a space reclamation procedure should delete it. Unmapping a zero grain has no effect (0x2 will not change to 0x1) but unmapping an unallocated grain will (0x0 to 0x1) - naturally. In order to implement seSparse some fields had to be changed to support both 32-bit and 64-bit entry sizes. Read-only is implemented by failing on pwrite since qemu-img opens the vmdk with read-write flags for "convert" (which is a read-only) operation. Reviewed-by: Karl Heubaum Reviewed-by: Eyal Moscovici Reviewed-by: Arbel Moshe Signed-off-by: Sam Eiderman --- block/vmdk.c | 475 +++++++++++++++++++++++++++++++++++++++++++++++++++++++= ++-- 1 file changed, 459 insertions(+), 16 deletions(-) diff --git a/block/vmdk.c b/block/vmdk.c index fc7378da78..e599c08b95 100644 --- a/block/vmdk.c +++ b/block/vmdk.c @@ -91,6 +91,44 @@ typedef struct { uint16_t compressAlgorithm; } QEMU_PACKED VMDK4Header; =20 +typedef struct VMDKSESparseConstHeader { + uint64_t magic; + uint64_t version; + uint64_t capacity; + uint64_t grain_size; + uint64_t grain_table_size; + uint64_t flags; + uint64_t reserved1; + uint64_t reserved2; + uint64_t reserved3; + uint64_t reserved4; + uint64_t volatile_header_offset; + uint64_t volatile_header_size; + uint64_t journal_header_offset; + uint64_t journal_header_size; + uint64_t journal_offset; + uint64_t journal_size; + uint64_t grain_dir_offset; + uint64_t grain_dir_size; + uint64_t grain_tables_offset; + uint64_t grain_tables_size; + uint64_t free_bitmap_offset; + uint64_t free_bitmap_size; + uint64_t backmap_offset; + uint64_t backmap_size; + uint64_t grains_offset; + uint64_t grains_size; + uint8_t pad[304]; +} QEMU_PACKED VMDKSESparseConstHeader; + +typedef struct VMDKSESparseVolatileHeader { + uint64_t magic; + uint64_t free_gt_number; + uint64_t next_txn_seq_number; + uint64_t replay_journal; + uint8_t pad[480]; +} QEMU_PACKED VMDKSESparseVolatileHeader; + #define L2_CACHE_SIZE 16 =20 typedef struct VmdkExtent { @@ -99,19 +137,23 @@ typedef struct VmdkExtent { bool compressed; bool has_marker; bool has_zero_grain; + bool sesparse; + uint64_t sesparse_l2_tables_offset; + uint64_t sesparse_clusters_offset; + int32_t entry_size; int version; int64_t sectors; int64_t end_sector; int64_t flat_start_offset; int64_t l1_table_offset; int64_t l1_backup_table_offset; - uint32_t *l1_table; + void *l1_table; uint32_t *l1_backup_table; unsigned int l1_size; uint32_t l1_entry_sectors; =20 unsigned int l2_size; - uint32_t *l2_cache; + void *l2_cache; uint32_t l2_cache_offsets[L2_CACHE_SIZE]; uint32_t l2_cache_counts[L2_CACHE_SIZE]; =20 @@ -434,6 +476,8 @@ static int vmdk_add_extent(BlockDriverState *bs, * cluster size: 64KB, L2 table size: 512 entries * 1PB - for default "ESXi Host Sparse Extent" (VMDK3/vmfsSpa= rse) * cluster size: 512B, L2 table size: 4096 entries + * 8PB - for default "ESXI seSparse Extent" + * cluster size: 4KB, L2 table size: 4096 entries */ error_setg(errp, "L1 size too big"); return -EFBIG; @@ -459,6 +503,7 @@ static int vmdk_add_extent(BlockDriverState *bs, extent->l2_size =3D l2_size; extent->cluster_sectors =3D flat ? sectors : cluster_sectors; extent->next_cluster_sector =3D ROUND_UP(nb_sectors, cluster_sectors); + extent->entry_size =3D sizeof(uint32_t); =20 if (s->num_extents > 1) { extent->end_sector =3D (*(extent - 1)).end_sector + extent->sector= s; @@ -480,7 +525,7 @@ static int vmdk_init_tables(BlockDriverState *bs, VmdkE= xtent *extent, int i; =20 /* read the L1 table */ - l1_size =3D extent->l1_size * sizeof(uint32_t); + l1_size =3D extent->l1_size * extent->entry_size; extent->l1_table =3D g_try_malloc(l1_size); if (l1_size && extent->l1_table =3D=3D NULL) { return -ENOMEM; @@ -498,10 +543,15 @@ static int vmdk_init_tables(BlockDriverState *bs, Vmd= kExtent *extent, goto fail_l1; } for (i =3D 0; i < extent->l1_size; i++) { - le32_to_cpus(&extent->l1_table[i]); + if (extent->sesparse) { + le64_to_cpus((uint64_t *)extent->l1_table + i); + } else { + le32_to_cpus((uint32_t *)extent->l1_table + i); + } } =20 if (extent->l1_backup_table_offset) { + assert(!extent->sesparse); extent->l1_backup_table =3D g_try_malloc(l1_size); if (l1_size && extent->l1_backup_table =3D=3D NULL) { ret =3D -ENOMEM; @@ -524,7 +574,7 @@ static int vmdk_init_tables(BlockDriverState *bs, VmdkE= xtent *extent, } =20 extent->l2_cache =3D - g_new(uint32_t, extent->l2_size * L2_CACHE_SIZE); + g_malloc(extent->entry_size * extent->l2_size * L2_CACHE_SIZE); return 0; fail_l1b: g_free(extent->l1_backup_table); @@ -570,6 +620,331 @@ static int vmdk_open_vmfs_sparse(BlockDriverState *bs, return ret; } =20 +#define SESPARSE_CONST_HEADER_MAGIC 0x00000000cafebabe +#define SESPARSE_VOLATILE_HEADER_MAGIC 0x00000000cafecafe + +static const char zero_pad[BDRV_SECTOR_SIZE]; + +/* Strict checks - format not officially documented */ +static int check_se_sparse_const_header(VMDKSESparseConstHeader *header, + Error **errp) +{ + uint64_t grain_table_coverage; + uint64_t needed_grain_tables; + uint64_t needed_grain_dir_size; + uint64_t needed_grain_tables_size; + uint64_t needed_free_bitmap_size; + + header->magic =3D le64_to_cpu(header->magic); + header->version =3D le64_to_cpu(header->version); + header->grain_size =3D le64_to_cpu(header->grain_size); + header->grain_table_size =3D le64_to_cpu(header->grain_table_size); + header->flags =3D le64_to_cpu(header->flags); + header->reserved1 =3D le64_to_cpu(header->reserved1); + header->reserved2 =3D le64_to_cpu(header->reserved2); + header->reserved3 =3D le64_to_cpu(header->reserved3); + header->reserved4 =3D le64_to_cpu(header->reserved4); + + header->volatile_header_offset =3D + le64_to_cpu(header->volatile_header_offset); + header->volatile_header_size =3D le64_to_cpu(header->volatile_header_s= ize); + + header->journal_header_offset =3D le64_to_cpu(header->journal_header_o= ffset); + header->journal_header_size =3D le64_to_cpu(header->journal_header_siz= e); + + header->journal_offset =3D le64_to_cpu(header->journal_offset); + header->journal_size =3D le64_to_cpu(header->journal_size); + + header->grain_dir_offset =3D le64_to_cpu(header->grain_dir_offset); + header->grain_dir_size =3D le64_to_cpu(header->grain_dir_size); + + header->grain_tables_offset =3D le64_to_cpu(header->grain_tables_offse= t); + header->grain_tables_size =3D le64_to_cpu(header->grain_tables_size); + + header->free_bitmap_offset =3D le64_to_cpu(header->free_bitmap_offset); + header->free_bitmap_size =3D le64_to_cpu(header->free_bitmap_size); + + header->backmap_offset =3D le64_to_cpu(header->backmap_offset); + header->backmap_size =3D le64_to_cpu(header->backmap_size); + + header->grains_offset =3D le64_to_cpu(header->grains_offset); + header->grains_size =3D le64_to_cpu(header->grains_size); + + if (header->magic !=3D SESPARSE_CONST_HEADER_MAGIC) { + error_setg(errp, "Bad const header magic: 0x%016" PRIx64, + header->magic); + return -EINVAL; + } + + if (header->version !=3D 0x0000000200000001) { + error_setg(errp, "Unsupported version: 0x%016" PRIx64, + header->version); + return -ENOTSUP; + } + + if (header->grain_size !=3D 8) { + error_setg(errp, "Unsupported grain size: %" PRIu64, + header->grain_size); + return -ENOTSUP; + } + + if (header->grain_table_size !=3D 64) { + error_setg(errp, "Unsupported grain table size: %" PRIu64, + header->grain_table_size); + return -ENOTSUP; + } + + if (header->flags !=3D 0) { + error_setg(errp, "Unsupported flags: 0x%016" PRIx64, + header->flags); + return -ENOTSUP; + } + + if (header->reserved1 !=3D 0 || header->reserved2 !=3D 0 || + header->reserved3 !=3D 0 || header->reserved4 !=3D 0) { + error_setg(errp, "Unsupported reserved bits:" + " 0x%016" PRIx64 " 0x%016" PRIx64 + " 0x%016" PRIx64 " 0x%016" PRIx64, + header->reserved1, header->reserved2, + header->reserved3, header->reserved4); + return -ENOTSUP; + } + + if (header->volatile_header_offset !=3D 1) { + error_setg(errp, "Unsupported volatile header offset: %" PRIu64, + header->volatile_header_offset); + return -ENOTSUP; + } + + if (header->volatile_header_size !=3D 1) { + error_setg(errp, "Unsupported volatile header size: %" PRIu64, + header->volatile_header_size); + return -ENOTSUP; + } + + if (header->journal_header_offset !=3D 2) { + error_setg(errp, "Unsupported journal header offset: %" PRIu64, + header->journal_header_offset); + return -ENOTSUP; + } + + if (header->journal_header_size !=3D 2) { + error_setg(errp, "Unsupported journal header size: %" PRIu64, + header->journal_header_size); + return -ENOTSUP; + } + + if (header->journal_offset !=3D 2048) { + error_setg(errp, "Unsupported journal offset: %" PRIu64, + header->journal_offset); + return -ENOTSUP; + } + + if (header->journal_size !=3D 2048) { + error_setg(errp, "Unsupported journal size: %" PRIu64, + header->journal_size); + return -ENOTSUP; + } + + if (header->grain_dir_offset !=3D 4096) { + error_setg(errp, "Unsupported grain directory offset: %" PRIu64, + header->grain_dir_offset); + return -ENOTSUP; + } + + /* in sectors */ + grain_table_coverage =3D ((header->grain_table_size * + BDRV_SECTOR_SIZE / sizeof(uint64_t)) * + header->grain_size); + needed_grain_tables =3D header->capacity / grain_table_coverage; + needed_grain_dir_size =3D DIV_ROUND_UP(needed_grain_tables * sizeof(ui= nt64_t), + BDRV_SECTOR_SIZE); + needed_grain_dir_size =3D ROUND_UP(needed_grain_dir_size, 2048); + + if (header->grain_dir_size !=3D needed_grain_dir_size) { + error_setg(errp, "Invalid grain directory size: %" PRIu64 + ", needed: %" PRIu64, + header->grain_dir_size, needed_grain_dir_size); + return -EINVAL; + } + + if (header->grain_tables_offset !=3D + header->grain_dir_offset + header->grain_dir_size) { + error_setg(errp, "Grain directory must be followed by grain tables= "); + return -EINVAL; + } + + needed_grain_tables_size =3D needed_grain_tables * header->grain_table= _size; + needed_grain_tables_size =3D ROUND_UP(needed_grain_tables_size, 2048); + + if (header->grain_tables_size !=3D needed_grain_tables_size) { + error_setg(errp, "Invalid grain tables size: %" PRIu64 + ", needed: %" PRIu64, + header->grain_tables_size, needed_grain_tables_size); + return -EINVAL; + } + + if (header->free_bitmap_offset !=3D + header->grain_tables_offset + header->grain_tables_size) { + error_setg(errp, "Grain tables must be followed by free bitmap"); + return -EINVAL; + } + + /* in bits */ + needed_free_bitmap_size =3D DIV_ROUND_UP(header->capacity, + header->grain_size); + /* in bytes */ + needed_free_bitmap_size =3D DIV_ROUND_UP(needed_free_bitmap_size, + BITS_PER_BYTE); + /* in sectors */ + needed_free_bitmap_size =3D DIV_ROUND_UP(needed_free_bitmap_size, + BDRV_SECTOR_SIZE); + needed_free_bitmap_size =3D ROUND_UP(needed_free_bitmap_size, 2048); + + if (header->free_bitmap_size !=3D needed_free_bitmap_size) { + error_setg(errp, "Invalid free bitmap size: %" PRIu64 + ", needed: %" PRIu64, + header->free_bitmap_size, needed_free_bitmap_size); + return -EINVAL; + } + + if (header->backmap_offset !=3D + header->free_bitmap_offset + header->free_bitmap_size) { + error_setg(errp, "Free bitmap must be followed by backmap"); + return -EINVAL; + } + + if (header->backmap_size !=3D header->grain_tables_size) { + error_setg(errp, "Invalid backmap size: %" PRIu64 + ", needed: %" PRIu64, + header->backmap_size, header->grain_tables_size); + return -EINVAL; + } + + if (header->grains_offset !=3D + header->backmap_offset + header->backmap_size) { + error_setg(errp, "Backmap must be followed by grains"); + return -EINVAL; + } + + if (header->grains_size !=3D header->capacity) { + error_setg(errp, "Invalid grains size: %" PRIu64 + ", needed: %" PRIu64, + header->grains_size, header->capacity); + return -EINVAL; + } + + /* check that padding is 0 */ + if (memcmp(header->pad, zero_pad, sizeof(header->pad))) { + error_setg(errp, "Unsupported non-zero const header padding"); + return -ENOTSUP; + } + + return 0; +} + +static int check_se_sparse_volatile_header(VMDKSESparseVolatileHeader *hea= der, + Error **errp) +{ + header->magic =3D le64_to_cpu(header->magic); + header->free_gt_number =3D le64_to_cpu(header->free_gt_number); + header->next_txn_seq_number =3D le64_to_cpu(header->next_txn_seq_numbe= r); + header->replay_journal =3D le64_to_cpu(header->replay_journal); + + if (header->magic !=3D SESPARSE_VOLATILE_HEADER_MAGIC) { + error_setg(errp, "Bad volatile header magic: 0x%016" PRIx64, + header->magic); + return -EINVAL; + } + + if (header->replay_journal) { + error_setg(errp, "Replaying journal not supported"); + return -ENOTSUP; + } + + /* check that padding is 0 */ + if (memcmp(header->pad, zero_pad, sizeof(header->pad))) { + error_setg(errp, "Unsupported non-zero volatile header padding"); + return -ENOTSUP; + } + + return 0; +} + +static int vmdk_open_se_sparse(BlockDriverState *bs, + BdrvChild *file, + int flags, Error **errp) +{ + int ret; + VMDKSESparseConstHeader const_header; + VMDKSESparseVolatileHeader volatile_header; + VmdkExtent *extent; + + assert(sizeof(const_header) =3D=3D BDRV_SECTOR_SIZE); + + ret =3D bdrv_pread(file, 0, &const_header, sizeof(const_header)); + if (ret < 0) { + bdrv_refresh_filename(file->bs); + error_setg_errno(errp, -ret, + "Could not read const header from file '%s'", + file->bs->filename); + return ret; + } + + /* check const header */ + ret =3D check_se_sparse_const_header(&const_header, errp); + if (ret < 0) { + return ret; + } + + assert(sizeof(volatile_header) =3D=3D BDRV_SECTOR_SIZE); + + ret =3D bdrv_pread(file, + const_header.volatile_header_offset * BDRV_SECTOR_SIZ= E, + &volatile_header, sizeof(volatile_header)); + if (ret < 0) { + bdrv_refresh_filename(file->bs); + error_setg_errno(errp, -ret, + "Could not read volatile header from file '%s'", + file->bs->filename); + return ret; + } + + /* check volatile header */ + ret =3D check_se_sparse_volatile_header(&volatile_header, errp); + if (ret < 0) { + return ret; + } + + ret =3D vmdk_add_extent(bs, file, false, + const_header.capacity, + const_header.grain_dir_offset * BDRV_SECTOR_SIZE, + 0, + const_header.grain_dir_size * + BDRV_SECTOR_SIZE / sizeof(uint64_t), + const_header.grain_table_size * + BDRV_SECTOR_SIZE / sizeof(uint64_t), + const_header.grain_size, + &extent, + errp); + if (ret < 0) { + return ret; + } + + extent->sesparse =3D true; + extent->sesparse_l2_tables_offset =3D const_header.grain_tables_offset; + extent->sesparse_clusters_offset =3D const_header.grains_offset; + extent->entry_size =3D sizeof(uint64_t); + + ret =3D vmdk_init_tables(bs, extent, errp); + if (ret) { + /* free extent allocated by vmdk_add_extent */ + vmdk_free_last_extent(bs); + } + + return ret; +} + static int vmdk_open_desc_file(BlockDriverState *bs, int flags, char *buf, QDict *options, Error **errp); =20 @@ -847,6 +1222,7 @@ static int vmdk_parse_extents(const char *desc, BlockD= riverState *bs, * RW [size in sectors] SPARSE "file-name.vmdk" * RW [size in sectors] VMFS "file-name.vmdk" * RW [size in sectors] VMFSSPARSE "file-name.vmdk" + * RW [size in sectors] SESPARSE "file-name.vmdk" */ flat_offset =3D -1; matches =3D sscanf(p, "%10s %" SCNd64 " %10s \"%511[^\n\r\"]\" %" = SCNd64, @@ -869,7 +1245,8 @@ static int vmdk_parse_extents(const char *desc, BlockD= riverState *bs, =20 if (sectors <=3D 0 || (strcmp(type, "FLAT") && strcmp(type, "SPARSE") && - strcmp(type, "VMFS") && strcmp(type, "VMFSSPARSE")) || + strcmp(type, "VMFS") && strcmp(type, "VMFSSPARSE") && + strcmp(type, "SESPARSE")) || (strcmp(access, "RW"))) { continue; } @@ -922,6 +1299,13 @@ static int vmdk_parse_extents(const char *desc, Block= DriverState *bs, return ret; } extent =3D &s->extents[s->num_extents - 1]; + } else if (!strcmp(type, "SESPARSE")) { + ret =3D vmdk_open_se_sparse(bs, extent_file, bs->open_flags, e= rrp); + if (ret) { + bdrv_unref_child(bs, extent_file); + return ret; + } + extent =3D &s->extents[s->num_extents - 1]; } else { error_setg(errp, "Unsupported extent type '%s'", type); bdrv_unref_child(bs, extent_file); @@ -956,6 +1340,7 @@ static int vmdk_open_desc_file(BlockDriverState *bs, i= nt flags, char *buf, if (strcmp(ct, "monolithicFlat") && strcmp(ct, "vmfs") && strcmp(ct, "vmfsSparse") && + strcmp(ct, "seSparse") && strcmp(ct, "twoGbMaxExtentSparse") && strcmp(ct, "twoGbMaxExtentFlat")) { error_setg(errp, "Unsupported image type '%s'", ct); @@ -1206,10 +1591,12 @@ static int get_cluster_offset(BlockDriverState *bs, { unsigned int l1_index, l2_offset, l2_index; int min_index, i, j; - uint32_t min_count, *l2_table; + uint32_t min_count; + void *l2_table; bool zeroed =3D false; int64_t ret; int64_t cluster_sector; + unsigned int l2_size_bytes =3D extent->l2_size * extent->entry_size; =20 if (m_data) { m_data->valid =3D 0; @@ -1224,7 +1611,31 @@ static int get_cluster_offset(BlockDriverState *bs, if (l1_index >=3D extent->l1_size) { return VMDK_ERROR; } - l2_offset =3D extent->l1_table[l1_index]; + if (extent->sesparse) { + uint64_t l2_offset_u64 =3D ((uint64_t *)extent->l1_table)[l1_index= ]; + if (l2_offset_u64 =3D=3D 0) { + l2_offset =3D 0; + } else if ((l2_offset_u64 & 0xffffffff00000000) !=3D 0x10000000000= 00000) { + /* + * Top most nibble is 0x1 if grain table is allocated. + * strict check - top most 4 bytes must be 0x10000000 since max + * supported size is 64TB for disk - so no more than 64TB / 16= MB + * grain directories which is smaller than uint32, + * where 16MB is the only supported default grain table covera= ge. + */ + return VMDK_ERROR; + } else { + l2_offset_u64 =3D l2_offset_u64 & 0x00000000ffffffff; + l2_offset_u64 =3D extent->sesparse_l2_tables_offset + + l2_offset_u64 * l2_size_bytes / BDRV_SECTOR_SIZE; + if (l2_offset_u64 > 0x00000000ffffffff) { + return VMDK_ERROR; + } + l2_offset =3D (unsigned int)(l2_offset_u64); + } + } else { + l2_offset =3D ((uint32_t *)extent->l1_table)[l1_index]; + } if (!l2_offset) { return VMDK_UNALLOC; } @@ -1236,7 +1647,7 @@ static int get_cluster_offset(BlockDriverState *bs, extent->l2_cache_counts[j] >>=3D 1; } } - l2_table =3D extent->l2_cache + (i * extent->l2_size); + l2_table =3D (char *)extent->l2_cache + (i * l2_size_bytes); goto found; } } @@ -1249,13 +1660,13 @@ static int get_cluster_offset(BlockDriverState *bs, min_index =3D i; } } - l2_table =3D extent->l2_cache + (min_index * extent->l2_size); + l2_table =3D (char *)extent->l2_cache + (min_index * l2_size_bytes); BLKDBG_EVENT(extent->file, BLKDBG_L2_LOAD); if (bdrv_pread(extent->file, (int64_t)l2_offset * 512, l2_table, - extent->l2_size * sizeof(uint32_t) - ) !=3D extent->l2_size * sizeof(uint32_t)) { + l2_size_bytes + ) !=3D l2_size_bytes) { return VMDK_ERROR; } =20 @@ -1263,16 +1674,45 @@ static int get_cluster_offset(BlockDriverState *bs, extent->l2_cache_counts[min_index] =3D 1; found: l2_index =3D ((offset >> 9) / extent->cluster_sectors) % extent->l2_si= ze; - cluster_sector =3D le32_to_cpu(l2_table[l2_index]); =20 - if (extent->has_zero_grain && cluster_sector =3D=3D VMDK_GTE_ZEROED) { - zeroed =3D true; + if (extent->sesparse) { + cluster_sector =3D le64_to_cpu(((uint64_t *)l2_table)[l2_index]); + switch (cluster_sector & 0xf000000000000000) { + case 0x0000000000000000: + /* unallocated grain */ + if (cluster_sector !=3D 0) { + return VMDK_ERROR; + } + break; + case 0x1000000000000000: + /* scsi-unmapped grain - fallthrough */ + case 0x2000000000000000: + /* zero grain */ + zeroed =3D true; + break; + case 0x3000000000000000: + /* allocated grain */ + cluster_sector =3D (((cluster_sector & 0x0fff000000000000) >> = 48) | + ((cluster_sector & 0x0000ffffffffffff) << 12= )); + cluster_sector =3D extent->sesparse_clusters_offset + + cluster_sector * extent->cluster_sectors; + break; + default: + return VMDK_ERROR; + } + } else { + cluster_sector =3D le32_to_cpu(((uint32_t *)l2_table)[l2_index]); + + if (extent->has_zero_grain && cluster_sector =3D=3D VMDK_GTE_ZEROE= D) { + zeroed =3D true; + } } =20 if (!cluster_sector || zeroed) { if (!allocate) { return zeroed ? VMDK_ZEROED : VMDK_UNALLOC; } + assert(!extent->sesparse); =20 if (extent->next_cluster_sector >=3D VMDK_EXTENT_MAX_SECTORS) { return VMDK_ERROR; @@ -1296,7 +1736,7 @@ static int get_cluster_offset(BlockDriverState *bs, m_data->l1_index =3D l1_index; m_data->l2_index =3D l2_index; m_data->l2_offset =3D l2_offset; - m_data->l2_cache_entry =3D &l2_table[l2_index]; + m_data->l2_cache_entry =3D ((uint32_t *)l2_table) + l2_index; } } *cluster_offset =3D cluster_sector << BDRV_SECTOR_BITS; @@ -1622,6 +2062,9 @@ static int vmdk_pwritev(BlockDriverState *bs, uint64_= t offset, if (!extent) { return -EIO; } + if (extent->sesparse) { + return -ENOTSUP; + } offset_in_cluster =3D vmdk_find_offset_in_cluster(extent, offset); n_bytes =3D MIN(bytes, extent->cluster_sectors * BDRV_SECTOR_SIZE - offset_in_cluster); --=20 2.13.3