From nobody Sun Apr 12 05:58:54 2026 Delivered-To: importer@patchew.org Authentication-Results: mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=pass(p=quarantine dis=none) header.from=redhat.com ARC-Seal: i=1; a=rsa-sha256; t=1771423289; cv=none; d=zohomail.com; s=zohoarc; b=YPjloGc0j/vgPG8AT02IxYafZ4DyflFPOFP1UfR3nsZRhrLhsZn5kXAV/d0nMGV+w3/PtZFKCg4DC7dNPzObVrmz7cjyTfGTGp4ymY1V+OuURAwwCHsz08+KLVRgICtjXhPpGeiee1qlndnbOJZ7RH7WRXhv0VpyA3zNUT0hmGE= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1771423289; h=Content-Type:Content-Transfer-Encoding:Cc:Cc:Date:Date:From:From:In-Reply-To:List-Subscribe:List-Post:List-Id:List-Archive:List-Help:List-Unsubscribe:MIME-Version:Message-ID:References:Sender:Subject:Subject:To:To:Message-Id:Reply-To; bh=x+FmMN620Ef1QiHzVSuMnoiNS27P80CU44pJ5h3ooyU=; b=bR51dWi3rOBf9Ua8ZahZJ5VnKxUnoY+Hn9pWvHvPTfIgubp7H918Ai42SjhTOIWihGjjtkn66YJ0mlO2OwdUCcz5bYaRk/ruooNu7I9+tpj1cxrlneZhX7iXZXZW+CYpa/Q/x0LL1HnrcITW8yQ2o7rlMdvESNDL1amJZKD0hi0= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=pass header.from= (p=quarantine dis=none) Return-Path: Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) by mx.zohomail.com with SMTPS id 1771423289304958.357668085993; Wed, 18 Feb 2026 06:01:29 -0800 (PST) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1vsi6y-00041d-P0; Wed, 18 Feb 2026 09:01:07 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1vsi6M-0003xX-Jr for qemu-devel@nongnu.org; Wed, 18 Feb 2026 09:00:29 -0500 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1vsi6J-0007Af-H0 for qemu-devel@nongnu.org; Wed, 18 Feb 2026 09:00:25 -0500 Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-600-rl99xc6eOnGevHOj0iG1Bw-1; Wed, 18 Feb 2026 09:00:18 -0500 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 31CB7195FDC3; Wed, 18 Feb 2026 14:00:17 +0000 (UTC) Received: from corto.redhat.com (unknown [10.45.224.251]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 44CF01956095; Wed, 18 Feb 2026 14:00:15 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1771423222; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=x+FmMN620Ef1QiHzVSuMnoiNS27P80CU44pJ5h3ooyU=; b=Bmq7X/8K+rCqygZLEGecPMB7LqBB9VdzyX6bkWouQ7MOA7b6rGayUXJC12bPCnNGPLe5eU hl7wMQG2ySUxKD98nJVtkyLxvQaBHqefeT818mUFajlCOJ9oB9R/0Y5wNsi2AyoE4MnJzJ hvBX6G49yhyQpErL18/fndL7/XdXeBA= X-MC-Unique: rl99xc6eOnGevHOj0iG1Bw-1 X-Mimecast-MFC-AGG-ID: rl99xc6eOnGevHOj0iG1Bw_1771423217 From: =?UTF-8?q?C=C3=A9dric=20Le=20Goater?= To: qemu-devel@nongnu.org Cc: Alex Williamson , Ankit Agrawal , Shameer Kolothum , Jason Gunthorpe , =?UTF-8?q?C=C3=A9dric=20Le=20Goater?= Subject: [PULL 4/5] hw/vfio: align mmap to power-of-2 of region size for hugepfnmap Date: Wed, 18 Feb 2026 15:00:02 +0100 Message-ID: <20260218140003.1554502-5-clg@redhat.com> In-Reply-To: <20260218140003.1554502-1-clg@redhat.com> References: <20260218140003.1554502-1-clg@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 Received-SPF: pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Received-SPF: pass client-ip=170.10.133.124; envelope-from=clg@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.043, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H5=0.001, RCVD_IN_MSPIKE_WL=0.001, RCVD_IN_VALIDITY_RPBL_BLOCKED=0.001, RCVD_IN_VALIDITY_SAFE_BLOCKED=0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: qemu development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: qemu-devel-bounces+importer=patchew.org@nongnu.org X-ZohoMail-DKIM: pass (identity @redhat.com) X-ZM-MESSAGEID: 1771423292166154100 From: Ankit Agrawal On Grace-based systems such as GB200, device memory is exposed as a BAR but the actual mappable size is not power-of-2 aligned. The previous algorithm aligned each sparse mmap area based on its individual size using ctz64() which prevented efficient huge page usage by the kernel. Adjust VFIO region mapping alignment to use the next power-of-2 of the total region size and place the sparse subregions at their appropriate offset. This provides better opportunities to get huge alignment allowing the kernel to use larger page sizes for the VMA. This enables the use of PMD-level huge pages which can significantly improve memory access performance and reduce TLB pressure for large device memory regions. With this change: - Create a single aligned base mapping for the entire region - Change Alignment to be based on pow2ceil(region->size), capped at 1GiB - Unmap gaps between sparse regions - Use MAP_FIXED to overlay sparse mmap areas at their offsets Example VMA for device memory of size 0x2F00F00000 on GB200: Before (misaligned, no hugepfnmap): ff88ff000000-ffb7fff00000 rw-s 400000000000 00:06 727 /d= ev/vfio/devices/vfio1 After (aligned to 1GiB boundary, hugepfnmap enabled): ff8ac0000000-ffb9c0f00000 rw-s 400000000000 00:06 727 /d= ev/vfio/devices/vfio1 Requires sparse regions to be sorted by offset (done in previous patch) to correctly identify and handle gaps. cc: Alex Williamson Reviewed-by: Alex Williamson Reviewed-by: Shameer Kolothum Suggested-by: Jason Gunthorpe Signed-off-by: Ankit Agrawal Reviewed-by: C=C3=A9dric Le Goater Link: https://lore.kernel.org/qemu-devel/20260217153010.408739-4-ankita@nvi= dia.com Signed-off-by: C=C3=A9dric Le Goater --- hw/vfio/region.c | 86 +++++++++++++++++++++++++++++++++--------------- 1 file changed, 59 insertions(+), 27 deletions(-) diff --git a/hw/vfio/region.c b/hw/vfio/region.c index d464eadf9c048e29981da8af48f8f86933a98ad5..47fdc2df349b65c6be6c9605b7a= 38a4e367f0475 100644 --- a/hw/vfio/region.c +++ b/hw/vfio/region.c @@ -344,8 +344,11 @@ static bool vfio_region_create_dma_buf(VFIORegion *reg= ion, Error **errp) =20 int vfio_region_mmap(VFIORegion *region) { - int i, ret, prot =3D 0; + void *map_base, *map_align; Error *local_err =3D NULL; + int i, ret, prot =3D 0; + off_t map_offset =3D 0; + size_t align; char *name; int fd; =20 @@ -356,41 +359,61 @@ int vfio_region_mmap(VFIORegion *region) prot |=3D region->flags & VFIO_REGION_INFO_FLAG_READ ? PROT_READ : 0; prot |=3D region->flags & VFIO_REGION_INFO_FLAG_WRITE ? PROT_WRITE : 0; =20 - for (i =3D 0; i < region->nr_mmaps; i++) { - size_t align =3D MIN(1ULL << ctz64(region->mmaps[i].size), 1 * GiB= ); - void *map_base, *map_align; + /* + * Align the mmap for more efficient mapping in the kernel. Ideally + * we'd know the PMD and PUD mapping sizes to use as discrete alignment + * intervals, but we don't. As of Linux v6.19, the largest PUD size + * supporting huge pfnmap is 1GiB (ARCH_SUPPORTS_PUD_PFNMAP is only set + * on x86_64). + * + * Align by power-of-two of the size of the entire region - capped + * by 1G - and place the sparse subregions at their appropriate offset. + * This will get maximum alignment. + * + * NB. qemu_memalign() and friends actually allocate memory, whereas + * the region size here can exceed host memory, therefore we manually + * create an oversized anonymous mapping and clean it up for alignment. + */ =20 - /* - * Align the mmap for more efficient mapping in the kernel. Ideal= ly - * we'd know the PMD and PUD mapping sizes to use as discrete alig= nment - * intervals, but we don't. As of Linux v6.12, the largest PUD si= ze - * supporting huge pfnmap is 1GiB (ARCH_SUPPORTS_PUD_PFNMAP is onl= y set - * on x86_64). Align by power-of-two size, capped at 1GiB. - * - * NB. qemu_memalign() and friends actually allocate memory, where= as - * the region size here can exceed host memory, therefore we manua= lly - * create an oversized anonymous mapping and clean it up for align= ment. - */ - map_base =3D mmap(0, region->mmaps[i].size + align, PROT_NONE, - MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); - if (map_base =3D=3D MAP_FAILED) { - ret =3D -errno; - goto no_mmap; - } + align =3D MIN(pow2ceil(region->size), 1 * GiB); =20 - fd =3D vfio_device_get_region_fd(region->vbasedev, region->nr); + map_base =3D mmap(0, region->size + align, PROT_NONE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (map_base =3D=3D MAP_FAILED) { + ret =3D -errno; + trace_vfio_region_mmap_fault(memory_region_name(region->mem), -1, + region->fd_offset, + region->fd_offset + region->size - 1,= ret); + return ret; + } + + fd =3D vfio_device_get_region_fd(region->vbasedev, region->nr); =20 - map_align =3D (void *)ROUND_UP((uintptr_t)map_base, (uintptr_t)ali= gn); - munmap(map_base, map_align - map_base); - munmap(map_align + region->mmaps[i].size, - align - (map_align - map_base)); + map_align =3D (void *)ROUND_UP((uintptr_t)map_base, (uintptr_t)align); + munmap(map_base, map_align - map_base); + munmap(map_align + region->size, + align - (map_align - map_base)); =20 - region->mmaps[i].mmap =3D mmap(map_align, region->mmaps[i].size, p= rot, + /* + * Regions should already be sorted by vfio_setup_region_sparse_mmaps(= ). + * This is critical for the following algorithm which relies on range + * offsets being in ascending order. + */ + for (i =3D 0; i < region->nr_mmaps; i++) { + munmap(map_align + map_offset, region->mmaps[i].offset - map_offse= t); + region->mmaps[i].mmap =3D mmap(map_align + region->mmaps[i].offset, + region->mmaps[i].size, prot, MAP_SHARED | MAP_FIXED, fd, region->fd_offset + region->mmaps[i].offset); if (region->mmaps[i].mmap =3D=3D MAP_FAILED) { ret =3D -errno; + /* + * Only unmap the rest of the region. Any mmaps that were succ= essful + * will be unmapped in no_mmap. + */ + munmap(map_align + region->mmaps[i].offset, + region->size - region->mmaps[i].offset); goto no_mmap; } =20 @@ -408,6 +431,15 @@ int vfio_region_mmap(VFIORegion *region) region->mmaps[i].offset, region->mmaps[i].offset + region->mmaps[i].size - 1); + + map_offset =3D region->mmaps[i].offset + region->mmaps[i].size; + } + + /* + * Unmap the rest of the region not covered by sparse mmap. + */ + if (map_offset < region->size) { + munmap(map_align + map_offset, region->size - map_offset); } =20 if (!vfio_region_create_dma_buf(region, &local_err)) { --=20 2.53.0