From nobody Tue Dec 16 16:20:33 2025 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) client-ip=208.118.235.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org Return-Path: Received: from lists.gnu.org (208.118.235.17 [208.118.235.17]) by mx.zohomail.com with SMTPS id 15125901271961014.3868278362927; Wed, 6 Dec 2017 11:55:27 -0800 (PST) Received: from localhost ([::1]:57415 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eMfmd-0006Am-8D for importer@patchew.org; Wed, 06 Dec 2017 14:55:07 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:50199) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eMfCs-0001BV-SD for qemu-devel@nongnu.org; Wed, 06 Dec 2017 14:18:13 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eMfCp-0001CF-LU for qemu-devel@nongnu.org; Wed, 06 Dec 2017 14:18:10 -0500 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:51760 helo=mx0a-001b2d01.pphosted.com) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1eMfCp-0001Bx-FI for qemu-devel@nongnu.org; Wed, 06 Dec 2017 14:18:07 -0500 Received: from pps.filterd (m0098413.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.21/8.16.0.21) with SMTP id vB6JE32j015599 for ; Wed, 6 Dec 2017 14:18:07 -0500 Received: from e16.ny.us.ibm.com (e16.ny.us.ibm.com [129.33.205.206]) by mx0b-001b2d01.pphosted.com with ESMTP id 2epn0ac21m-1 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT) for ; Wed, 06 Dec 2017 14:18:06 -0500 Received: from localhost by e16.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 6 Dec 2017 14:18:06 -0500 Received: from b01cxnp22034.gho.pok.ibm.com (9.57.198.24) by e16.ny.us.ibm.com (146.89.104.203) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Wed, 6 Dec 2017 14:18:03 -0500 Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com [9.57.199.107]) by b01cxnp22034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id vB6JI1an51839184; Wed, 6 Dec 2017 19:18:02 GMT Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 9BCEA12403D; Wed, 6 Dec 2017 14:15:02 -0500 (EST) Received: from localhost (unknown [9.80.93.86]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTP id 5003B124044; Wed, 6 Dec 2017 14:15:02 -0500 (EST) From: Michael Roth To: qemu-devel@nongnu.org Date: Wed, 6 Dec 2017 13:15:58 -0600 X-Mailer: git-send-email 2.11.0 In-Reply-To: <20171206191648.18208-1-mdroth@linux.vnet.ibm.com> References: <20171206191648.18208-1-mdroth@linux.vnet.ibm.com> X-TM-AS-GCONF: 00 x-cbid: 17120619-0024-0000-0000-000002FDADD8 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00008161; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000244; SDB=6.00956379; UDB=6.00483442; IPR=6.00736416; BA=6.00005729; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00018387; XFM=3.00000015; UTC=2017-12-06 19:18:04 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17120619-0025-0000-0000-0000463DB319 Message-Id: <20171206191648.18208-6-mdroth@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, , definitions=2017-12-06_07:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=1 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1712060273 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x [generic] [fuzzy] X-Received-From: 148.163.158.5 Subject: [Qemu-devel] [PATCH 05/55] block: Perform copy-on-read in loop X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Kevin Wolf , qemu-stable@nongnu.org Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" X-ZohoMail: RSF_0 Z_629925259 SPT_0 Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" From: Eric Blake Improve our braindead copy-on-read implementation. Pre-patch, we have multiple issues: - we create a bounce buffer and perform a write for the entire request, even if the active image already has 99% of the clusters occupied, and really only needs to copy-on-read the remaining 1% of the clusters - our bounce buffer was as large as the read request, and can needlessly exhaust our memory by using double the memory of the request size (the original request plus our bounce buffer), rather than a capped maximum overhead beyond the original - if a driver has a max_transfer limit, we are bypassing the normal code in bdrv_aligned_preadv() that fragments to that limit, and instead attempt to read the entire buffer from the driver in one go, which some drivers may assert on - a client can request a large request of nearly 2G such that rounding the request out to cluster boundaries results in a byte count larger than 2G. While this cannot exceed 32 bits, it DOES have some follow-on problems: -- the call to bdrv_driver_pread() can assert for exceeding BDRV_REQUEST_MAX_BYTES, if the driver is old and lacks .bdrv_co_preadv -- if the buffer is all zeroes, the subsequent call to bdrv_co_do_pwrite_zeroes is a no-op due to a negative size, which means we did not actually copy on read Fix all of these issues by breaking up the action into a loop, where each iteration is capped to sane limits. Also, querying the allocation status allows us to optimize: when data is already present in the active layer, we don't need to bounce. Note that the code has a telling comment that copy-on-read should probably be a filter driver rather than a bolt-on hack in io.c; but that remains a task for another day. CC: qemu-stable@nongnu.org Signed-off-by: Eric Blake Reviewed-by: Kevin Wolf Reviewed-by: Stefan Hajnoczi Signed-off-by: Kevin Wolf (cherry picked from commit cb2e28780c7080af489e72227683fe374f05022d) Conflicts: block/io.c * remove context dep on d855ebcd3 Signed-off-by: Michael Roth --- block/io.c | 118 ++++++++++++++++++++++++++++++++++++++++++---------------= ---- 1 file changed, 81 insertions(+), 37 deletions(-) diff --git a/block/io.c b/block/io.c index 26003814eb..fce856ea8a 100644 --- a/block/io.c +++ b/block/io.c @@ -34,6 +34,9 @@ =20 #define NOT_DONE 0x7fffffff /* used while emulated sync operation in progr= ess */ =20 +/* Maximum bounce buffer for copy-on-read and write zeroes, in bytes */ +#define MAX_BOUNCE_BUFFER (32768 << BDRV_SECTOR_BITS) + static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs, int64_t offset, int bytes, BdrvRequestFlags flags); =20 @@ -945,11 +948,14 @@ static int coroutine_fn bdrv_co_do_copy_on_readv(Bdrv= Child *child, =20 BlockDriver *drv =3D bs->drv; struct iovec iov; - QEMUIOVector bounce_qiov; + QEMUIOVector local_qiov; int64_t cluster_offset; unsigned int cluster_bytes; size_t skip_bytes; int ret; + int max_transfer =3D MIN_NON_ZERO(bs->bl.max_transfer, + BDRV_REQUEST_MAX_BYTES); + unsigned int progress =3D 0; =20 /* FIXME We cannot require callers to have write permissions when all = they * are doing is a read request. If we did things right, write permissi= ons @@ -961,52 +967,94 @@ static int coroutine_fn bdrv_co_do_copy_on_readv(Bdrv= Child *child, // assert(child->perm & (BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE)); =20 /* Cover entire cluster so no additional backing file I/O is required = when - * allocating cluster in the image file. + * allocating cluster in the image file. Note that this value may exc= eed + * BDRV_REQUEST_MAX_BYTES (even when the original read did not), which + * is one reason we loop rather than doing it all at once. */ bdrv_round_to_clusters(bs, offset, bytes, &cluster_offset, &cluster_by= tes); + skip_bytes =3D offset - cluster_offset; =20 trace_bdrv_co_do_copy_on_readv(bs, offset, bytes, cluster_offset, cluster_bytes); =20 - iov.iov_len =3D cluster_bytes; - iov.iov_base =3D bounce_buffer =3D qemu_try_blockalign(bs, iov.iov_len= ); + bounce_buffer =3D qemu_try_blockalign(bs, + MIN(MIN(max_transfer, cluster_byte= s), + MAX_BOUNCE_BUFFER)); if (bounce_buffer =3D=3D NULL) { ret =3D -ENOMEM; goto err; } =20 - qemu_iovec_init_external(&bounce_qiov, &iov, 1); + while (cluster_bytes) { + int64_t pnum; =20 - ret =3D bdrv_driver_preadv(bs, cluster_offset, cluster_bytes, - &bounce_qiov, 0); - if (ret < 0) { - goto err; - } + ret =3D bdrv_is_allocated(bs, cluster_offset, + MIN(cluster_bytes, max_transfer), &pnum); + if (ret < 0) { + /* Safe to treat errors in querying allocation as if + * unallocated; we'll probably fail again soon on the + * read, but at least that will set a decent errno. + */ + pnum =3D MIN(cluster_bytes, max_transfer); + } =20 - if (drv->bdrv_co_pwrite_zeroes && - buffer_is_zero(bounce_buffer, iov.iov_len)) { - /* FIXME: Should we (perhaps conditionally) be setting - * BDRV_REQ_MAY_UNMAP, if it will allow for a sparser copy - * that still correctly reads as zero? */ - ret =3D bdrv_co_do_pwrite_zeroes(bs, cluster_offset, cluster_bytes= , 0); - } else { - /* This does not change the data on the disk, it is not necessary - * to flush even in cache=3Dwritethrough mode. - */ - ret =3D bdrv_driver_pwritev(bs, cluster_offset, cluster_bytes, - &bounce_qiov, 0); - } + assert(skip_bytes < pnum); =20 - if (ret < 0) { - /* It might be okay to ignore write errors for guest requests. If= this - * is a deliberate copy-on-read then we don't want to ignore the e= rror. - * Simply report it in all cases. - */ - goto err; - } + if (ret <=3D 0) { + /* Must copy-on-read; use the bounce buffer */ + iov.iov_base =3D bounce_buffer; + iov.iov_len =3D pnum =3D MIN(pnum, MAX_BOUNCE_BUFFER); + qemu_iovec_init_external(&local_qiov, &iov, 1); =20 - skip_bytes =3D offset - cluster_offset; - qemu_iovec_from_buf(qiov, 0, bounce_buffer + skip_bytes, bytes); + ret =3D bdrv_driver_preadv(bs, cluster_offset, pnum, + &local_qiov, 0); + if (ret < 0) { + goto err; + } + + if (drv->bdrv_co_pwrite_zeroes && + buffer_is_zero(bounce_buffer, pnum)) { + /* FIXME: Should we (perhaps conditionally) be setting + * BDRV_REQ_MAY_UNMAP, if it will allow for a sparser copy + * that still correctly reads as zero? */ + ret =3D bdrv_co_do_pwrite_zeroes(bs, cluster_offset, pnum,= 0); + } else { + /* This does not change the data on the disk, it is not + * necessary to flush even in cache=3Dwritethrough mode. + */ + ret =3D bdrv_driver_pwritev(bs, cluster_offset, pnum, + &local_qiov, 0); + } + + if (ret < 0) { + /* It might be okay to ignore write errors for guest + * requests. If this is a deliberate copy-on-read + * then we don't want to ignore the error. Simply + * report it in all cases. + */ + goto err; + } + + qemu_iovec_from_buf(qiov, progress, bounce_buffer + skip_bytes, + pnum - skip_bytes); + } else { + /* Read directly into the destination */ + qemu_iovec_init(&local_qiov, qiov->niov); + qemu_iovec_concat(&local_qiov, qiov, progress, pnum - skip_byt= es); + ret =3D bdrv_driver_preadv(bs, offset + progress, local_qiov.s= ize, + &local_qiov, 0); + qemu_iovec_destroy(&local_qiov); + if (ret < 0) { + goto err; + } + } + + cluster_offset +=3D pnum; + cluster_bytes -=3D pnum; + progress +=3D pnum - skip_bytes; + skip_bytes =3D 0; + } + ret =3D 0; =20 err: qemu_vfree(bounce_buffer); @@ -1212,9 +1260,6 @@ int coroutine_fn bdrv_co_readv(BdrvChild *child, int6= 4_t sector_num, return bdrv_co_do_readv(child, sector_num, nb_sectors, qiov, 0); } =20 -/* Maximum buffer for write zeroes fallback, in bytes */ -#define MAX_WRITE_ZEROES_BOUNCE_BUFFER (32768 << BDRV_SECTOR_BITS) - static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs, int64_t offset, int bytes, BdrvRequestFlags flags) { @@ -1229,8 +1274,7 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(Bloc= kDriverState *bs, int max_write_zeroes =3D MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MA= X); int alignment =3D MAX(bs->bl.pwrite_zeroes_alignment, bs->bl.request_alignment); - int max_transfer =3D MIN_NON_ZERO(bs->bl.max_transfer, - MAX_WRITE_ZEROES_BOUNCE_BUFFER); + int max_transfer =3D MIN_NON_ZERO(bs->bl.max_transfer, MAX_BOUNCE_BUFF= ER); =20 assert(alignment % bs->bl.request_alignment =3D=3D 0); head =3D offset % alignment; --=20 2.11.0