From nobody Sun Feb 8 22:50:48 2026 Delivered-To: importer@patchew.org Authentication-Results: mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; arc=pass (i=1 dmarc=pass fromdomain=oracle.com); dmarc=pass(p=none dis=none) header.from=oracle.com ARC-Seal: i=2; a=rsa-sha256; t=1641949250; cv=pass; d=zohomail.com; s=zohoarc; b=J9j7W2zVBps1BsLBQC4Vht2GWgbE1nJ/sxPTtXpJ6RaOJtMppIb/t205eTZxAxyCNGVkr/e6bNupXymeFWiS3WcDdgNTBSIuMfFvjDIGiwiNyHU0eRrM3FYg4VrJLPmsHxwHvgL3/26zFnJfLBghX9pZsdRh/naaS6e1helpk58= ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1641949250; h=Content-Type:Date:From:In-Reply-To:List-Subscribe:List-Post:List-Id:List-Archive:List-Help:List-Unsubscribe:MIME-Version:Message-ID:References:Sender:Subject:To; bh=C+Yaw42T04Urbw50y7KdSedY46VIaeMAT8XomORSxwk=; b=cHC6acSG8rsxtedDaE2lbWT0O6HKiutXjX9tdhGff7XFWuLPsB4P2vCXU9HGj0qACVslU+oLyhcaPa3W1NRO0SAAv98c0TDNEbiGZKZQ71JOQTV7i0H/2hybtMuWWrElzxFvjVivg+iN0um+lowt28GiMZFbpVzSt3wePK32QjA= ARC-Authentication-Results: i=2; mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; arc=pass (i=1 dmarc=pass fromdomain=oracle.com); dmarc=pass header.from= (p=none dis=none) Return-Path: Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) by mx.zohomail.com with SMTPS id 1641949250956431.516837091748; Tue, 11 Jan 2022 17:00:50 -0800 (PST) Received: from localhost ([::1]:46928 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1n7S0P-0001O9-Ef for importer@patchew.org; Tue, 11 Jan 2022 20:00:49 -0500 Received: from eggs.gnu.org ([209.51.188.92]:36768) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1n7Rdn-0000ye-M8 for qemu-devel@nongnu.org; Tue, 11 Jan 2022 19:37:27 -0500 Received: from mx0b-00069f02.pphosted.com ([205.220.177.32]:11602) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1n7Rdi-0005gb-Ej for qemu-devel@nongnu.org; Tue, 11 Jan 2022 19:37:27 -0500 Received: from pps.filterd (m0246632.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.16.1.2/8.16.1.2) with SMTP id 20BMrYAu025170 for ; Wed, 12 Jan 2022 00:37:11 GMT Received: from userp3030.oracle.com (userp3030.oracle.com [156.151.31.80]) by mx0b-00069f02.pphosted.com with ESMTP id 3dgmk9crp8-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK) for ; Wed, 12 Jan 2022 00:37:10 +0000 Received: from pps.filterd (userp3030.oracle.com [127.0.0.1]) by userp3030.oracle.com (8.16.1.2/8.16.1.2) with SMTP id 20C0KATc069271 for ; Wed, 12 Jan 2022 00:37:07 GMT Received: from nam10-dm6-obe.outbound.protection.outlook.com (mail-dm6nam10lp2106.outbound.protection.outlook.com [104.47.58.106]) by userp3030.oracle.com with ESMTP id 3deyqy1gju-5 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK) for ; Wed, 12 Jan 2022 00:37:07 +0000 Received: from PH0PR10MB4679.namprd10.prod.outlook.com (2603:10b6:510:3c::15) by PH7PR10MB5813.namprd10.prod.outlook.com (2603:10b6:510:132::7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4823.19; Wed, 12 Jan 2022 00:37:03 +0000 Received: from PH0PR10MB4679.namprd10.prod.outlook.com ([fe80::5536:dbc6:5161:ac1b]) by PH0PR10MB4679.namprd10.prod.outlook.com ([fe80::5536:dbc6:5161:ac1b%3]) with mapi id 15.20.4867.012; Wed, 12 Jan 2022 00:37:03 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : subject : date : message-id : in-reply-to : references : content-type : mime-version; s=corp-2021-07-09; bh=C+Yaw42T04Urbw50y7KdSedY46VIaeMAT8XomORSxwk=; b=uHOavHcydI9AEv9XQslnzJ3CDUhBOwl/HPySnI80Wre7FQwfZWMrlvhPGFt+iaGQGVv1 AduQX3ckb8Xm27dgAOsl5M8//iPN8RpjOLofRE9CU9jVvI6eBo/0o2YL2hyuRg/ekIw/ Wlrp2J4lNNtQC8kJ87HjMkDjrHF0pIBxwX1dAB69cz+SaJTkgZGsY52r3koLiVdRr1hT YSPRX6kGdJSzsgbiZ6IZHH1PYI5hn0LwTWLZvi8OAcpcqQb+FuuZPERWooza4LEzZF/i rKZhtkIwSqtBUsFjp4js4BrWKeXttj4bI4RNxpGXdePwm8d+euGGhm9+3tuB/xqAfDeq lQ== ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=NCWakPT2RI2zAKR+/pHsddWjDh+LYjC2yaqK+Y89BYSkKGgBRw4/AK4IdztCDVnQT5OK6A9O1JFUHLBHJ4iOF/0llGt2Z6vNrHFNTQyoWtVlmHtuDXUVzdpbjD/i+57g3rnAH0IoNmLQHyw6Xx14LFRjVRnnSvZ6DW7M2Z93wI1mv3UzFDLQcpkyA/ZBVkMmgpTwvW2PE1PKFTJz0lLyA7wRWCC+fcWteCqw/53pxMfvadPE/NVdKbjX5SjZwGdeauWp1pJ7J+dDGBmn8LppauTe4q48tO+9jxET9eY81ZHN5fhp13yzZNtsfvhVfwnC7V2Y0UyrzE9VH/xlZJFA+w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=C+Yaw42T04Urbw50y7KdSedY46VIaeMAT8XomORSxwk=; b=L7WkBats4MyiV1HFyX4I9uMb93JmOAN0Gq7HKgYMVFfnLi2+Zu0X5xN2YMf6N9wbJmKvd8UtvwDIc6ZfrEz7svzEdPfGzdZtT8fEShWXQQzeII32XLAXa/rB/8gO6ATn/IjB+k7tZZ2eJ1843neav4UFUAEFpqyxngxufXRyplxFSAWh8Tiedb02d48Cf63wpsJcnNS28WjjQwmDcncmPIvdt5eQ4GebLukIYkA/KJhVJZtN6WQ1jayfQoHIU6Jr5jIE0MvU7L50w96tHqmL6AFiGXtgzP0yZ4IYWA/xwLneENhc+1wqX4pxF3foPu2IEc5OW4ExW5b7pBQrQ2wqvg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=oracle.com; dmarc=pass action=none header.from=oracle.com; dkim=pass header.d=oracle.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.onmicrosoft.com; s=selector2-oracle-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=C+Yaw42T04Urbw50y7KdSedY46VIaeMAT8XomORSxwk=; b=kz6+Gf0bFyYx1xGMT7hJUzRXO5t+2SZOd8emPwHKxHJrsYAukRRgpIFIzl1vBdeZ1gUepk2VstV64dwXuZyxMtzM3v8NaLCtHnQlImcUre1YQw8xy8HyEmhApqF4yFsi+1hltsSb+WGkPvauG/dcFxDK888wQ2mN38m10i6WCaY= From: John Johnson To: qemu-devel@nongnu.org Subject: [RFC v4 01/21] vfio-user: introduce vfio-user protocol specification Date: Tue, 11 Jan 2022 16:43:37 -0800 Message-Id: X-Mailer: git-send-email 1.8.3.1 In-Reply-To: References: X-ClientProxiedBy: BY5PR03CA0006.namprd03.prod.outlook.com (2603:10b6:a03:1e0::16) To PH0PR10MB4679.namprd10.prod.outlook.com (2603:10b6:510:3c::15) MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: aef11dbf-eb5b-4d79-95c7-08d9d563a715 X-MS-TrafficTypeDiagnostic: PH7PR10MB5813:EE_ X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:10000; X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: LDE6Bi7+mdsGSwOv3OO+t/FEzOoFQvpB7jXnvWdN9PvJHK94nO0tqbjIJJ9xkmjzcxx8ZnAuwJIVrkVL8St3b2ftHB6VEo+yjNX8gZZBZEYfliiJz7Dyuas8PANi2jsFdAWQ5IdT4o2DcpUGb7lJJUow5qblkRyw5UK2keftPF/ErKXJ3q5eybX6W2IeV76eopmm9SGsPVW1Y/iQfV0NUdYI5EHjhUhIO8oepgT44oBjds84H7E21Did+JySmOm+vn3qd1VPZg42wxN+KisZ0XJCOESIALMs8/96z6lwwB4UWJ0qCjyBeJ6Cq3IOsSV0i24qufM2fVUvm11nw6ynrVTQ99EGwSl5uDZchN0uJsEXHfCo9OuXtwT+aKyWRMH1u1G4Bj6wpvr+sbsaz0Hl/Beluyjv0+XNKpkegVH6O97ECa3EycoemyI87TRNAURkOyozkWQ5nKiwzBVtsnBMozEFiPbvtjwQYb7TlWnzF3869FSwgAeOpecl2LYLxaafFbpM0e9TFZr5Hx0hz61B972pEZ04wZopPElngfU3K5foSpwDbp0/w8adHpMF8B3EHLgUjlzu2jr71b9CtqDkmiCe0Wvo8LPrKd+huRwcDs17cCzcg79IXBL3uXCkrM0p+O/iqlrDZNa/Q5X+nVoZNfPUr32K7AjA1D26w4ymQGondo4TpwSB0NpldJ4uBnZluaurZpcdA0xgYaUUQpy3xduwjhwcjAF2kdLC4w24XJ3zuRRlZBi2Gjz5gf6zK436E8rPd1lHZKGUhTplNP2hRAVHcHjPvIH8Y/q2UhBd+wRtptRUiz1wcL/gtxOVChFp X-Forefront-Antispam-Report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:PH0PR10MB4679.namprd10.prod.outlook.com; PTR:; CAT:NONE; SFS:(366004)(2906002)(30864003)(6506007)(8936002)(26005)(6512007)(36756003)(186003)(8676002)(52116002)(6666004)(86362001)(6916009)(966005)(6486002)(316002)(508600001)(2616005)(83380400001)(66946007)(5660300002)(38100700002)(66556008)(38350700002)(66476007)(559001)(579004); DIR:OUT; SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?us-ascii?Q?x7n7PxeH/6CF7msZygcXyKZBZaCORqQRKKV5mQFLY8dh34iGhGqhPihl5EEV?= =?us-ascii?Q?1Y0BLWPjvX2nfwxTi/KngfPKu3DlRNDLbtVvboYlxQ+G37nWwfB16IyUJwDv?= =?us-ascii?Q?eogi+hsmapTV2bPoEafJvNoqxrKpByOmU9un6t+sM/BOLGj3/GQ69fH1PjJQ?= =?us-ascii?Q?d/VYDedwWBawNT5nZZ77Jt1cgpwqiAgRoKyD12v8zqR4aRHPhm5F2B0UW0p2?= =?us-ascii?Q?2QLlW1I78i958sA9iPIfl8CpDyRJMlmDGoRdtDD41CtAtiTC3wuj70MHQFzs?= =?us-ascii?Q?AUTtmhuSnYMASTR7zgXDfD+KiLxW9kfSUljK9OxCYlUqlvPIp+c6lGgqobKl?= =?us-ascii?Q?XrIlV85b4wP7aZ0ZSrzArh4InXH8UWz2zV6ro9GpBQO2WHGZUIXeqKWsTpJG?= =?us-ascii?Q?bPZ+V/inpVtAu1hnbTBvOjWYQxF+wbMVFDndpxjCflnx3ZIpZC6Nq9hR4jVT?= =?us-ascii?Q?Dq1vWGSBMwcPxHHVej3xw+u0YCpPAxHE9vA7UbrsP2Ivt/LN9otnE2V3kQZX?= =?us-ascii?Q?wnK6OxXDz8ze6Nsfraew5ZD9Grj/Y6QzrTCIA0wLfJlRO/EbJ+bgAjyjuKo5?= =?us-ascii?Q?C2RPf8Z4isHqvMWZNGrGorM8GhawBuMAlYLbP/QZQHv3RanWl5D7QlTtHa/K?= =?us-ascii?Q?hn+iDckhB/QhDg7947YYlUnQGNNjZqfwIS+S+VAp1VgLCen79dNdTvJWasoZ?= =?us-ascii?Q?tvo+L0os7YEGFfDjuu2+nRsF6/MDRkpUptuBFhKk7cTrF09hdCM4YE/eHYCY?= =?us-ascii?Q?EOMebnVWFEI6gunU9FJ+v/vlRI6lWNBgGCG+5wIrgGVHN1j1yVcIJf8FeiwU?= =?us-ascii?Q?yADSZioV0gvCzxJ/Vh3FmHQ0U/6S8kkrdJlD2D4rCGviRay+WqqX5qsdusbe?= =?us-ascii?Q?hyEMFQWDObCnNVup6l5MumIOojtK16qHQ8gLJBSir7tWTMxq1csaBY8L5UY9?= =?us-ascii?Q?f6sOYIICEBTmZX6DlX+EAe+Z0YU1AVgiZx1bEy13gxwaTXd1XIk3c2TeAn4t?= =?us-ascii?Q?jttf3C3KOdQ4nspLToEu6Ohn30Ta8Eo3Qj/8ng/JhhMt6s2x7QndewwCrR0V?= =?us-ascii?Q?zoOE1q3G+c/65/uPEsagWzDIVyeq/HtOP7sEUyBV+NOuXFgkzCKj5/FTOb3W?= =?us-ascii?Q?VYFghWLTa8ELgNvS1jqesOgioLn29sziDruh3hUwtKzrtwOuyRr+X4CMSwyR?= =?us-ascii?Q?kmNn8ebfpuYHy5H86wdm9TinzQaY3O+rV+HEz3hFYMXeB5be/eQzPy9Acn02?= =?us-ascii?Q?K1rbrCKoU9fMj9wWODws6v0ZnieCUeMomDhza8yJUiQ/Orlat0w47CnWRyFu?= =?us-ascii?Q?AHbf24zgheFG6Ti3MHXWNV69l0nBi1h6tK2ZheH6ZoFYE+eOr7l9in047QTr?= =?us-ascii?Q?F8J0sJcE6Q/sjgZmvye2e8hZdT93EBbtWO4zSYVD/Oxn6ZjwBzgL3sN2RKOI?= =?us-ascii?Q?Q/Qe0WeMmZ14fTIhG+H2TiB3e985EoDeVwYFPONNJTZYH6NxLKUemxjbkN5K?= =?us-ascii?Q?jiSmy4G7TEYcSjSSyE8Cmlbh6MpcwxliIKNKsBnrNi78b653DUHaPWyCsXfj?= =?us-ascii?Q?2x9/AyH1myPaQu6zv+8jnPsW+jfrQVR8rkTyxvhNuNGltc2Vx5P9P//NMIr3?= =?us-ascii?Q?+BcZhG4T4JUT3fQMx2vkmTu4BCUksrmXayC2ocXDiLNyvptB1LDtFYulWONY?= =?us-ascii?Q?lOvcBQ=3D=3D?= X-OriginatorOrg: oracle.com X-MS-Exchange-CrossTenant-Network-Message-Id: aef11dbf-eb5b-4d79-95c7-08d9d563a715 X-MS-Exchange-CrossTenant-AuthSource: PH0PR10MB4679.namprd10.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 12 Jan 2022 00:37:03.3354 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 4e2c6054-71cb-48f1-bd6c-3a9705aca71b X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: zhqBu+tkRtcwXiOLWhs7tDXeDxt2uT2nNg7nESi7oH1QkWICqYVWT6OHR16uj6GOyNtZH3dvC/iumHhy3Ex3lNl4anIVBqt1zifBhsrrUBI= X-MS-Exchange-Transport-CrossTenantHeadersStamped: PH7PR10MB5813 X-Proofpoint-Virus-Version: vendor=nai engine=6300 definitions=10224 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 mlxscore=0 suspectscore=0 spamscore=0 mlxlogscore=999 adultscore=0 phishscore=0 malwarescore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2110150000 definitions=main-2201120000 X-Proofpoint-GUID: Er85bSqWyyeTK2kPJy3Ul1K1OFT__ASz X-Proofpoint-ORIG-GUID: Er85bSqWyyeTK2kPJy3Ul1K1OFT__ASz Received-SPF: pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Received-SPF: pass client-ip=205.220.177.32; envelope-from=john.g.johnson@oracle.com; helo=mx0b-00069f02.pphosted.com X-Spam_score_int: -27 X-Spam_score: -2.8 X-Spam_bar: -- X-Spam_report: (-2.8 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, WEIRD_QUOTING=0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" X-ZohoMail-DKIM: pass (identity @oracle.com) X-ZM-MESSAGEID: 1641949251573100001 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Thanos Makatos This patch introduces the vfio-user protocol specification (formerly known as VFIO-over-socket), which is designed to allow devices to be emulated outside QEMU, in a separate process. vfio-user reuses the existing VFIO defines, structs and concepts. It has been earlier discussed as an RFC in: "RFC: use VFIO over a UNIX domain socket to implement device offloading" Signed-off-by: John G Johnson Signed-off-by: Thanos Makatos Signed-off-by: John Levon --- docs/devel/index.rst | 1 + docs/devel/vfio-user.rst | 1810 ++++++++++++++++++++++++++++++++++++++++++= ++++ MAINTAINERS | 6 + 3 files changed, 1817 insertions(+) create mode 100644 docs/devel/vfio-user.rst diff --git a/docs/devel/index.rst b/docs/devel/index.rst index afd9375..23d2c30 100644 --- a/docs/devel/index.rst +++ b/docs/devel/index.rst @@ -48,3 +48,4 @@ modifying QEMU's source code. trivial-patches submitting-a-patch submitting-a-pull-request + vfio-user diff --git a/docs/devel/vfio-user.rst b/docs/devel/vfio-user.rst new file mode 100644 index 0000000..97a7506 --- /dev/null +++ b/docs/devel/vfio-user.rst @@ -0,0 +1,1810 @@ +.. include:: +******************************** +vfio-user Protocol Specification +******************************** + +-------------- +Version_ 0.9.1 +-------------- + +.. contents:: Table of Contents + +Introduction +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +vfio-user is a protocol that allows a device to be emulated in a separate +process outside of a Virtual Machine Monitor (VMM). vfio-user devices cons= ist +of a generic VFIO device type, living inside the VMM, which we call the cl= ient, +and the core device implementation, living outside the VMM, which we call = the +server. + +The vfio-user specification is partly based on the +`Linux VFIO ioctl interface `_. + +VFIO is a mature and stable API, backed by an extensively used framework. = The +existing VFIO client implementation in QEMU (``qemu/hw/vfio/``) can be lar= gely +re-used, though there is nothing in this specification that requires that +particular implementation. None of the VFIO kernel modules are required for +supporting the protocol, on either the client or server side. Some source +definitions in VFIO are re-used for vfio-user. + +The main idea is to allow a virtual device to function in a separate proce= ss in +the same host over a UNIX domain socket. A UNIX domain socket (``AF_UNIX``= ) is +chosen because file descriptors can be trivially sent over it, which in tu= rn +allows: + +* Sharing of client memory for DMA with the server. +* Sharing of server memory with the client for fast MMIO. +* Efficient sharing of eventfd's for triggering interrupts. + +Other socket types could be used which allow the server to run in a separa= te +guest in the same host (``AF_VSOCK``) or remotely (``AF_INET``). Theoretic= ally +the underlying transport does not necessarily have to be a socket, however= we do +not examine such alternatives. In this protocol version we focus on using = a UNIX +domain socket and introduce basic support for the other two types of socke= ts +without considering performance implications. + +While passing of file descriptors is desirable for performance reasons, su= pport +is not necessary for either the client or the server in order to implement= the +protocol. There is always an in-band, message-passing fall back mechanism. + +Overview +=3D=3D=3D=3D=3D=3D=3D=3D + +VFIO is a framework that allows a physical device to be securely passed th= rough +to a user space process; the device-specific kernel driver does not drive = the +device at all. Typically, the user space process is a VMM and the device = is +passed through to it in order to achieve high performance. VFIO provides a= n API +and the required functionality in the kernel. QEMU has adopted VFIO to all= ow a +guest to directly access physical devices, instead of emulating them in +software. + +vfio-user reuses the core VFIO concepts defined in its API, but implements= them +as messages to be sent over a socket. It does not change the kernel-based = VFIO +in any way, in fact none of the VFIO kernel modules need to be loaded to u= se +vfio-user. It is also possible for the client to concurrently use the curr= ent +kernel-based VFIO for one device, and vfio-user for another device. + +VFIO Device Model +----------------- + +A device under VFIO presents a standard interface to the user process. Man= y of +the VFIO operations in the existing interface use the ``ioctl()`` system c= all, and +references to the existing interface are called the ``ioctl()`` implementa= tion in +this document. + +The following sections describe the set of messages that implement the vfi= o-user +interface over a socket. In many cases, the messages are analogous to data +structures used in the ``ioctl()`` implementation. Messages derived from t= he +``ioctl()`` will have a name derived from the ``ioctl()`` command name. E= .g., the +``VFIO_DEVICE_GET_INFO`` ``ioctl()`` command becomes a +``VFIO_USER_DEVICE_GET_INFO`` message. The purpose of this reuse is to sh= are as +much code as feasible with the ``ioctl()`` implementation``. + +Connection Initiation +^^^^^^^^^^^^^^^^^^^^^ + +After the client connects to the server, the initial client message is +``VFIO_USER_VERSION`` to propose a protocol version and set of capabilitie= s to +apply to the session. The server replies with a compatible version and set= of +capabilities it supports, or closes the connection if it cannot support the +advertised version. + +Device Information +^^^^^^^^^^^^^^^^^^ + +The client uses a ``VFIO_USER_DEVICE_GET_INFO`` message to query the serve= r for +information about the device. This information includes: + +* The device type and whether it supports reset (``VFIO_DEVICE_FLAGS_``), +* the number of device regions, and +* the device presents to the client the number of interrupt types the devi= ce + supports. + +Region Information +^^^^^^^^^^^^^^^^^^ + +The client uses ``VFIO_USER_DEVICE_GET_REGION_INFO`` messages to query the +server for information about the device's regions. This information descri= bes: + +* Read and write permissions, whether it can be memory mapped, and whether= it + supports additional capabilities (``VFIO_REGION_INFO_CAP_``). +* Region index, size, and offset. + +When a device region can be mapped by the client, the server provides a fi= le +descriptor which the client can ``mmap()``. The server is responsible for +polling for client updates to memory mapped regions. + +Region Capabilities +""""""""""""""""""" + +Some regions have additional capabilities that cannot be described adequat= ely +by the region info data structure. These capabilities are returned in the +region info reply in a list similar to PCI capabilities in a PCI device's +configuration space. + +Sparse Regions +"""""""""""""" +A region can be memory-mappable in whole or in part. When only a subset of= a +region can be mapped by the client, a ``VFIO_REGION_INFO_CAP_SPARSE_MMAP`` +capability is included in the region info reply. This capability describes +which portions can be mapped by the client. + +.. Note:: + For example, in a virtual NVMe controller, sparse regions can be used so + that accesses to the NVMe registers (found in the beginning of BAR0) are + trapped (an infrequent event), while allowing direct access to the door= bells + (an extremely frequent event as every I/O submission requires a write to + BAR0), found in the next page after the NVMe registers in BAR0. + +Device-Specific Regions +""""""""""""""""""""""" + +A device can define regions additional to the standard ones (e.g. PCI inde= xes +0-8). This is achieved by including a ``VFIO_REGION_INFO_CAP_TYPE`` capabi= lity +in the region info reply of a device-specific region. Such regions are ref= lected +in ``struct vfio_user_device_info.num_regions``. Thus, for PCI devices this +value can be equal to, or higher than, ``VFIO_PCI_NUM_REGIONS``. + +Region I/O via file descriptors +------------------------------- + +For unmapped regions, region I/O from the client is done via +``VFIO_USER_REGION_READ/WRITE``. As an optimization, ioeventfds or ioregi= onfds +may be configured for sub-regions of some regions. A client may request +information on these sub-regions via ``VFIO_USER_DEVICE_GET_REGION_IO_FDS`= `; by +configuring the returned file descriptors as ioeventfds or ioregionfds, the +server can be directly notified of I/O (for example, by KVM) without takin= g a +trip through the client. + +Interrupts +^^^^^^^^^^ + +The client uses ``VFIO_USER_DEVICE_GET_IRQ_INFO`` messages to query the se= rver +for the device's interrupt types. The interrupt types are specific to the = bus +the device is attached to, and the client is expected to know the capabili= ties +of each interrupt type. The server can signal an interrupt by directly inj= ecting +interrupts into the guest via an event file descriptor. The client configu= res +how the server signals an interrupt with ``VFIO_USER_SET_IRQS`` messages. + +Device Read and Write +^^^^^^^^^^^^^^^^^^^^^ + +When the guest executes load or store operations to an unmapped device reg= ion, +the client forwards these operations to the server with +``VFIO_USER_REGION_READ`` or ``VFIO_USER_REGION_WRITE`` messages. The serv= er +will reply with data from the device on read operations or an acknowledgem= ent on +write operations. See `Read and Write Operations`_. + +Client memory access +-------------------- + +The client uses ``VFIO_USER_DMA_MAP`` and ``VFIO_USER_DMA_UNMAP`` messages= to +inform the server of the valid DMA ranges that the server can access on be= half +of a device (typically, VM guest memory). DMA memory may be accessed by the +server via ``VFIO_USER_DMA_READ`` and ``VFIO_USER_DMA_WRITE`` messages ove= r the +socket. In this case, the "DMA" part of the naming is a misnomer. + +Actual direct memory access of client memory from the server is possible i= f the +client provides file descriptors the server can ``mmap()``. Note that ``mm= ap()`` +privileges cannot be revoked by the client, therefore file descriptors sho= uld +only be exported in environments where the client trusts the server not to +corrupt guest memory. + +See `Read and Write Operations`_. + +Client/server interactions +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D + +Socket +------ + +A server can serve: + +1) one or more clients, and/or +2) one or more virtual devices, belonging to one or more clients. + +The current protocol specification requires a dedicated socket per +client/server connection. It is a server-side implementation detail whethe= r a +single server handles multiple virtual devices from the same or multiple +clients. The location of the socket is implementation-specific. Multiplexi= ng +clients, devices, and servers over the same socket is not supported in this +version of the protocol. + +Authentication +-------------- + +For ``AF_UNIX``, we rely on OS mandatory access controls on the socket fil= es, +therefore it is up to the management layer to set up the socket as require= d. +Socket types that span guests or hosts will require a proper authentication +mechanism. Defining that mechanism is deferred to a future version of the +protocol. + +Command Concurrency +------------------- + +A client may pipeline multiple commands without waiting for previous comma= nd +replies. The server will process commands in the order they are received.= A +consequence of this is if a client issues a command with the *No_reply* bi= t, +then subsequently issues a command without *No_reply*, the older command w= ill +have been processed before the reply to the younger command is sent by the +server. The client must be aware of the device's capability to process +concurrent commands if pipelining is used. For example, pipelining allows +multiple client threads to concurrently access device regions; the client = must +ensure these accesses obey device semantics. + +An example is a frame buffer device, where the device may allow concurrent +access to different areas of video memory, but may have indeterminate beha= vior +if concurrent accesses are performed to command or status registers. + +Note that unrelated messages sent from the server to the client can appear= in +between a client to server request/reply and vice versa. + +Implementers should be prepared for certain commands to exhibit potentially +unbounded latencies. For example, ``VFIO_USER_DEVICE_RESET`` may take an +arbitrarily long time to complete; clients should take care not to block +unnecessarily. + +Socket Disconnection Behavior +----------------------------- +The server and the client can disconnect from each other, either intention= ally +or unexpectedly. Both the client and the server need to know how to handle= such +events. + +Server Disconnection +^^^^^^^^^^^^^^^^^^^^ +A server disconnecting from the client may indicate that: + +1) A virtual device has been restarted, either intentionally (e.g. because= of a + device update) or unintentionally (e.g. because of a crash). +2) A virtual device has been shut down with no intention to be restarted. + +It is impossible for the client to know whether or not a failure is +intermittent or innocuous and should be retried, therefore the client shou= ld +reset the VFIO device when it detects the socket has been disconnected. +Error recovery will be driven by the guest's device error handling +behavior. + +Client Disconnection +^^^^^^^^^^^^^^^^^^^^ +The client disconnecting from the server primarily means that the client +has exited. Currently, this means that the guest is shut down so the devic= e is +no longer needed therefore the server can automatically exit. However, the= re +can be cases where a client disconnection should not result in a server ex= it: + +1) A single server serving multiple clients. +2) A multi-process QEMU upgrading itself step by step, which is not yet + implemented. + +Therefore in order for the protocol to be forward compatible, the server s= hould +respond to a client disconnection as follows: + + - all client memory regions are unmapped and cleaned up (including closin= g any + passed file descriptors) + - all IRQ file descriptors passed from the old client are closed + - the device state should otherwise be retained + +The expectation is that when a client reconnects, it will re-establish IRQ= and +client memory mappings. + +If anything happens to the client (such as qemu really did exit), the cont= rol +stack will know about it and can clean up resources accordingly. + +Security Considerations +----------------------- + +Speaking generally, vfio-user clients should not trust servers, and vice v= ersa. +Standard tools and mechanisms should be used on both sides to validate inp= ut and +prevent against denial of service scenarios, buffer overflow, etc. + +Request Retry and Response Timeout +---------------------------------- +A failed command is a command that has been successfully sent and has been +responded to with an error code. Failure to send the command in the first = place +(e.g. because the socket is disconnected) is a different type of error exa= mined +earlier in the disconnect section. + +.. Note:: + QEMU's VFIO retries certain operations if they fail. While this makes s= ense + for real HW, we don't know for sure whether it makes sense for virtual + devices. + +Defining a retry and timeout scheme is deferred to a future version of the +protocol. + +Message sizes +------------- + +Some requests have an ``argsz`` field. In a request, it defines the maximum +expected reply payload size, which should be at least the size of the fixed +reply payload headers defined here. The *request* payload size is defined = by the +usual ``msg_size`` field in the header, not the ``argsz`` field. + +In a reply, the server sets ``argsz`` field to the size needed for a full +payload size. This may be less than the requested maximum size. This may be +larger than the requested maximum size: in that case, the full payload is = not +included in the reply, but the ``argsz`` field in the reply indicates the = needed +size, allowing a client to allocate a larger buffer for holding the reply = before +trying again. + +In addition, during negotiation (see `Version`_), the client and server m= ay +each specify a ``max_data_xfer_size`` value; this defines the maximum data= that +may be read or written via one of the ``VFIO_USER_DMA/REGION_READ/WRITE`` +messages; see `Read and Write Operations`_. + +Protocol Specification +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +To distinguish from the base VFIO symbols, all vfio-user symbols are prefi= xed +with ``vfio_user`` or ``VFIO_USER``. In this revision, all data is in the +endianness of the host system, although this may be relaxed in future +revisions in cases where the client and server run on different hosts +with different endianness. + +Unless otherwise specified, all sizes should be presumed to be in bytes. + +.. _Commands: + +Commands +-------- +The following table lists the VFIO message command IDs, and whether the +message command is sent from the client or the server. + +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D = =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Name Command Request Direction +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D = =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +``VFIO_USER_VERSION`` 1 client -> server +``VFIO_USER_DMA_MAP`` 2 client -> server +``VFIO_USER_DMA_UNMAP`` 3 client -> server +``VFIO_USER_DEVICE_GET_INFO`` 4 client -> server +``VFIO_USER_DEVICE_GET_REGION_INFO`` 5 client -> server +``VFIO_USER_DEVICE_GET_REGION_IO_FDS`` 6 client -> server +``VFIO_USER_DEVICE_GET_IRQ_INFO`` 7 client -> server +``VFIO_USER_DEVICE_SET_IRQS`` 8 client -> server +``VFIO_USER_REGION_READ`` 9 client -> server +``VFIO_USER_REGION_WRITE`` 10 client -> server +``VFIO_USER_DMA_READ`` 11 server -> client +``VFIO_USER_DMA_WRITE`` 12 server -> client +``VFIO_USER_DEVICE_RESET`` 13 client -> server +``VFIO_USER_DIRTY_PAGES`` 14 client -> server +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D = =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +Header +------ + +All messages, both command messages and reply messages, are preceded by a +16-byte header that contains basic information about the message. The head= er is +followed by message-specific data described in the sections below. + ++----------------+--------+-------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D= +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| Message ID | 0 | 2 | ++----------------+--------+-------------+ +| Command | 2 | 2 | ++----------------+--------+-------------+ +| Message size | 4 | 4 | ++----------------+--------+-------------+ +| Flags | 8 | 4 | ++----------------+--------+-------------+ +| | +-----+------------+ | +| | | Bit | Definition | | +| | +=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ | +| | | 0-3 | Type | | +| | +-----+------------+ | +| | | 4 | No_reply | | +| | +-----+------------+ | +| | | 5 | Error | | +| | +-----+------------+ | ++----------------+--------+-------------+ +| Error | 12 | 4 | ++----------------+--------+-------------+ +| | 16 | variable | ++----------------+--------+-------------+ + +* *Message ID* identifies the message, and is echoed in the command's reply + message. Message IDs belong entirely to the sender, can be re-used (even + concurrently) and the receiver must not make any assumptions about their + uniqueness. +* *Command* specifies the command to be executed, listed in Commands_. It = is + also set in the reply header. +* *Message size* contains the size of the entire message, including the he= ader. +* *Flags* contains attributes of the message: + + * The *Type* bits indicate the message type. + + * *Command* (value 0x0) indicates a command message. + * *Reply* (value 0x1) indicates a reply message acknowledging a previ= ous + command with the same message ID. + * *No_reply* in a command message indicates that no reply is needed for = this + command. This is commonly used when multiple commands are sent, and o= nly + the last needs acknowledgement. + * *Error* in a reply message indicates the command being acknowledged had + an error. In this case, the *Error* field will be valid. + +* *Error* in a reply message is an optional UNIX errno value. It may be ze= ro + even if the Error bit is set in Flags. It is reserved in a command messa= ge. + +Each command message in Commands_ must be replied to with a reply message, +unless the message sets the *No_Reply* bit. The reply consists of the hea= der +with the *Reply* bit set, plus any additional data. + +If an error occurs, the reply message must only include the reply header. + +As the header is standard in both requests and replies, it is not included= in +the command-specific specifications below; each message definition should = be +appended to the standard header, and the offsets are given from the end of= the +standard header. + +``VFIO_USER_VERSION`` +--------------------- + +.. _Version: + +This is the initial message sent by the client after the socket connection= is +established; the same format is used for the server's reply. + +Upon establishing a connection, the client must send a ``VFIO_USER_VERSION= `` +message proposing a protocol version and a set of capabilities. The server +compares these with the versions and capabilities it supports and sends a +``VFIO_USER_VERSION`` reply according to the following rules. + +* The major version in the reply must be the same as proposed. If the clie= nt + does not support the proposed major, it closes the connection. +* The minor version in the reply must be equal to or less than the minor + version proposed. +* The capability list must be a subset of those proposed. If the server + requires a capability the client did not include, it closes the connecti= on. + +The protocol major version will only change when incompatible protocol cha= nges +are made, such as changing the message format. The minor version may change +when compatible changes are made, such as adding new messages or capabilit= ies, +Both the client and server must support all minor versions less than the +maximum minor version it supports. E.g., an implementation that supports +version 1.3 must also support 1.0 through 1.2. + +When making a change to this specification, the protocol version number mu= st +be included in the form "added in version X.Y" + +Request +^^^^^^^ + +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D =3D=3D=3D= =3D +Name Offset Size +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D =3D=3D=3D= =3D +version major 0 2 +version minor 2 2 +version data 4 variable (including terminating NUL). Optional. +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D =3D=3D=3D= =3D + +The version data is an optional UTF-8 encoded JSON byte array with the fol= lowing +format: + ++--------------+--------+-----------------------------------+ +| Name | Type | Description | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| capabilities | object | Contains common capabilities that | +| | | the sender supports. Optional. | ++--------------+--------+-----------------------------------+ + +Capabilities: + ++--------------------+--------+-------------------------------------------= -----+ +| Name | Type | Description = | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D= =3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D+ +| max_msg_fds | number | Maximum number of file descriptors that ca= n be | +| | | received by the sender in one message. = | +| | | Optional. If not specified then the receiv= er | +| | | must assume a value of ``1``. = | ++--------------------+--------+-------------------------------------------= -----+ +| max_data_xfer_size | number | Maximum ``count`` for data transfer messag= es; | +| | | see `Read and Write Operations`_. Optional= , | +| | | with a default value of 1048576 bytes. = | ++--------------------+--------+-------------------------------------------= -----+ +| migration | object | Migration capability parameters. If missin= g | +| | | then migration is not supported by the sen= der. | ++--------------------+--------+-------------------------------------------= -----+ + +The migration capability contains the following name/value pairs: + ++--------+--------+-----------------------------------------------+ +| Name | Type | Description | ++=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| pgsize | number | Page size of dirty pages bitmap. The smallest | +| | | between the client and the server is used. | ++--------+--------+-----------------------------------------------+ + +Reply +^^^^^ + +The same message format is used in the server's reply with the semantics +described above. + +``VFIO_USER_DMA_MAP`` +--------------------- + +This command message is sent by the client to the server to inform it of t= he +memory regions the server can access. It must be sent before the server can +perform any DMA to the client. It is normally sent directly after the vers= ion +handshake is completed, but may also occur when memory is added to the cli= ent, +or if the client uses a vIOMMU. + +Request +^^^^^^^ + +The request payload for this message is a structure of the following forma= t: + ++-------------+--------+-------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| argsz | 0 | 4 | ++-------------+--------+-------------+ +| flags | 4 | 4 | ++-------------+--------+-------------+ +| | +-----+------------+ | +| | | Bit | Definition | | +| | +=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ | +| | | 0 | readable | | +| | +-----+------------+ | +| | | 1 | writeable | | +| | +-----+------------+ | ++-------------+--------+-------------+ +| offset | 8 | 8 | ++-------------+--------+-------------+ +| address | 16 | 8 | ++-------------+--------+-------------+ +| size | 24 | 8 | ++-------------+--------+-------------+ + +* *argsz* is the size of the above structure. Note there is no reply paylo= ad, + so this field differs from other message types. +* *flags* contains the following region attributes: + + * *readable* indicates that the region can be read from. + + * *writeable* indicates that the region can be written to. + +* *offset* is the file offset of the region with respect to the associated= file + descriptor, or zero if the region is not mappable +* *address* is the base DMA address of the region. +* *size* is the size of the region. + +This structure is 32 bytes in size, so the message size is 16 + 32 bytes. + +If the DMA region being added can be directly mapped by the server, a file +descriptor must be sent as part of the message meta-data. The region can be +mapped via the mmap() system call. On ``AF_UNIX`` sockets, the file descri= ptor +must be passed as ``SCM_RIGHTS`` type ancillary data. Otherwise, if the D= MA +region cannot be directly mapped by the server, no file descriptor must be= sent +as part of the message meta-data and the DMA region can be accessed by the +server using ``VFIO_USER_DMA_READ`` and ``VFIO_USER_DMA_WRITE`` messages, +explained in `Read and Write Operations`_. A command to map over an existi= ng +region must be failed by the server with ``EEXIST`` set in error field in = the +reply. + +Reply +^^^^^ + +There is no payload in the reply message. + +``VFIO_USER_DMA_UNMAP`` +----------------------- + +This command message is sent by the client to the server to inform it that= a +DMA region, previously made available via a ``VFIO_USER_DMA_MAP`` command +message, is no longer available for DMA. It typically occurs when memory is +subtracted from the client or if the client uses a vIOMMU. The DMA region = is +described by the following structure: + +Request +^^^^^^^ + +The request payload for this message is a structure of the following forma= t: + ++--------------+--------+------------------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| argsz | 0 | 4 | ++--------------+--------+------------------------+ +| flags | 4 | 4 | ++--------------+--------+------------------------+ +| | +-----+-----------------------+ | +| | | Bit | Definition | | +| | +=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ | +| | | 0 | get dirty page bitmap | | +| | +-----+-----------------------+ | ++--------------+--------+------------------------+ +| address | 8 | 8 | ++--------------+--------+------------------------+ +| size | 16 | 8 | ++--------------+--------+------------------------+ + +* *argsz* is the maximum size of the reply payload. +* *flags* contains the following DMA region attributes: + + * *get dirty page bitmap* indicates that a dirty page bitmap must be + populated before unmapping the DMA region. The client must provide a + `VFIO Bitmap`_ structure, explained below, immediately following this + entry. + +* *address* is the base DMA address of the DMA region. +* *size* is the size of the DMA region. + +The address and size of the DMA region being unmapped must match exactly a +previous mapping. The size of request message depends on whether or not the +*get dirty page bitmap* bit is set in Flags: + +* If not set, the size of the total request message is: 16 + 24. + +* If set, the size of the total request message is: 16 + 24 + 16. + +.. _VFIO Bitmap: + +VFIO Bitmap Format +"""""""""""""""""" + ++--------+--------+------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D+ +| pgsize | 0 | 8 | ++--------+--------+------+ +| size | 8 | 8 | ++--------+--------+------+ + +* *pgsize* is the page size for the bitmap, in bytes. +* *size* is the size for the bitmap, in bytes, excluding the VFIO bitmap h= eader. + +Reply +^^^^^ + +Upon receiving a ``VFIO_USER_DMA_UNMAP`` command, if the file descriptor is +mapped then the server must release all references to that DMA region befo= re +replying, which potentially includes in-flight DMA transactions. + +The server responds with the original DMA entry in the request. If the +*get dirty page bitmap* bit is set in flags in the request, then +the server also includes the `VFIO Bitmap`_ structure sent in the request, +followed by the corresponding dirty page bitmap, where each bit represents +one page of size *pgsize* in `VFIO Bitmap`_ . + +The total size of the total reply message is: +16 + 24 + (16 + *size* in `VFIO Bitmap`_ if *get dirty page bitmap* is set= ). + +``VFIO_USER_DEVICE_GET_INFO`` +----------------------------- + +This command message is sent by the client to the server to query for basic +information about the device. + +Request +^^^^^^^ + ++-------------+--------+--------------------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| argsz | 0 | 4 | ++-------------+--------+--------------------------+ +| flags | 4 | 4 | ++-------------+--------+--------------------------+ +| | +-----+-------------------------+ | +| | | Bit | Definition | | +| | +=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ | +| | | 0 | VFIO_DEVICE_FLAGS_RESET | | +| | +-----+-------------------------+ | +| | | 1 | VFIO_DEVICE_FLAGS_PCI | | +| | +-----+-------------------------+ | ++-------------+--------+--------------------------+ +| num_regions | 8 | 4 | ++-------------+--------+--------------------------+ +| num_irqs | 12 | 4 | ++-------------+--------+--------------------------+ + +* *argsz* is the maximum size of the reply payload +* all other fields must be zero. + +Reply +^^^^^ + ++-------------+--------+--------------------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| argsz | 0 | 4 | ++-------------+--------+--------------------------+ +| flags | 4 | 4 | ++-------------+--------+--------------------------+ +| | +-----+-------------------------+ | +| | | Bit | Definition | | +| | +=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ | +| | | 0 | VFIO_DEVICE_FLAGS_RESET | | +| | +-----+-------------------------+ | +| | | 1 | VFIO_DEVICE_FLAGS_PCI | | +| | +-----+-------------------------+ | ++-------------+--------+--------------------------+ +| num_regions | 8 | 4 | ++-------------+--------+--------------------------+ +| num_irqs | 12 | 4 | ++-------------+--------+--------------------------+ + +* *argsz* is the size required for the full reply payload (16 bytes today) +* *flags* contains the following device attributes. + + * ``VFIO_DEVICE_FLAGS_RESET`` indicates that the device supports the + ``VFIO_USER_DEVICE_RESET`` message. + * ``VFIO_DEVICE_FLAGS_PCI`` indicates that the device is a PCI device. + +* *num_regions* is the number of memory regions that the device exposes. +* *num_irqs* is the number of distinct interrupt types that the device sup= ports. + +This version of the protocol only supports PCI devices. Additional devices= may +be supported in future versions. + +``VFIO_USER_DEVICE_GET_REGION_INFO`` +------------------------------------ + +This command message is sent by the client to the server to query for +information about device regions. The VFIO region info structure is define= d in +```` (``struct vfio_region_info``). + +Request +^^^^^^^ + ++------------+--------+------------------------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D+ +| argsz | 0 | 4 | ++------------+--------+------------------------------+ +| flags | 4 | 4 | ++------------+--------+------------------------------+ +| index | 8 | 4 | ++------------+--------+------------------------------+ +| cap_offset | 12 | 4 | ++------------+--------+------------------------------+ +| size | 16 | 8 | ++------------+--------+------------------------------+ +| offset | 24 | 8 | ++------------+--------+------------------------------+ + +* *argsz* the maximum size of the reply payload +* *index* is the index of memory region being queried, it is the only field + that is required to be set in the command message. +* all other fields must be zero. + +Reply +^^^^^ + ++------------+--------+------------------------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D+ +| argsz | 0 | 4 | ++------------+--------+------------------------------+ +| flags | 4 | 4 | ++------------+--------+------------------------------+ +| | +-----+-----------------------------+ | +| | | Bit | Definition | | +| | +=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ | +| | | 0 | VFIO_REGION_INFO_FLAG_READ | | +| | +-----+-----------------------------+ | +| | | 1 | VFIO_REGION_INFO_FLAG_WRITE | | +| | +-----+-----------------------------+ | +| | | 2 | VFIO_REGION_INFO_FLAG_MMAP | | +| | +-----+-----------------------------+ | +| | | 3 | VFIO_REGION_INFO_FLAG_CAPS | | +| | +-----+-----------------------------+ | ++------------+--------+------------------------------+ ++------------+--------+------------------------------+ +| index | 8 | 4 | ++------------+--------+------------------------------+ +| cap_offset | 12 | 4 | ++------------+--------+------------------------------+ +| size | 16 | 8 | ++------------+--------+------------------------------+ +| offset | 24 | 8 | ++------------+--------+------------------------------+ + +* *argsz* is the size required for the full reply payload (region info str= ucture + plus the size of any region capabilities) +* *flags* are attributes of the region: + + * ``VFIO_REGION_INFO_FLAG_READ`` allows client read access to the region. + * ``VFIO_REGION_INFO_FLAG_WRITE`` allows client write access to the regi= on. + * ``VFIO_REGION_INFO_FLAG_MMAP`` specifies the client can mmap() the reg= ion. + When this flag is set, the reply will include a file descriptor in its + meta-data. On ``AF_UNIX`` sockets, the file descriptors will be passed= as + ``SCM_RIGHTS`` type ancillary data. + * ``VFIO_REGION_INFO_FLAG_CAPS`` indicates additional capabilities found= in the + reply. + +* *index* is the index of memory region being queried, it is the only field + that is required to be set in the command message. +* *cap_offset* describes where additional region capabilities can be found. + cap_offset is relative to the beginning of the VFIO region info structur= e. + The data structure it points is a VFIO cap header defined in + ````. +* *size* is the size of the region. +* *offset* is the offset that should be given to the mmap() system call for + regions with the MMAP attribute. It is also used as the base offset when + mapping a VFIO sparse mmap area, described below. + +VFIO region capabilities +"""""""""""""""""""""""" + +The VFIO region information can also include a capabilities list. This lis= t is +similar to a PCI capability list - each entry has a common header that +identifies a capability and where the next capability in the list can be f= ound. +The VFIO capability header format is defined in ```` (``stru= ct +vfio_info_cap_header``). + +VFIO cap header format +"""""""""""""""""""""" + ++---------+--------+------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D+ +| id | 0 | 2 | ++---------+--------+------+ +| version | 2 | 2 | ++---------+--------+------+ +| next | 4 | 4 | ++---------+--------+------+ + +* *id* is the capability identity. +* *version* is a capability-specific version number. +* *next* specifies the offset of the next capability in the capability lis= t. It + is relative to the beginning of the VFIO region info structure. + +VFIO sparse mmap cap header +""""""""""""""""""""""""""" + ++------------------+----------------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D+ +| id | VFIO_REGION_INFO_CAP_SPARSE_MMAP | ++------------------+----------------------------------+ +| version | 0x1 | ++------------------+----------------------------------+ +| next | | ++------------------+----------------------------------+ +| sparse mmap info | VFIO region info sparse mmap | ++------------------+----------------------------------+ + +This capability is defined when only a subrange of the region supports +direct access by the client via mmap(). The VFIO sparse mmap area is defin= ed in +```` (``struct vfio_region_sparse_mmap_area`` and ``struct +vfio_region_info_cap_sparse_mmap``). + +VFIO region info cap sparse mmap +"""""""""""""""""""""""""""""""" + ++----------+--------+------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D= =3D+ +| nr_areas | 0 | 4 | ++----------+--------+------+ +| reserved | 4 | 4 | ++----------+--------+------+ +| offset | 8 | 8 | ++----------+--------+------+ +| size | 16 | 9 | ++----------+--------+------+ +| ... | | | ++----------+--------+------+ + +* *nr_areas* is the number of sparse mmap areas in the region. +* *offset* and size describe a single area that can be mapped by the clien= t. + There will be *nr_areas* pairs of offset and size. The offset will be ad= ded to + the base offset given in the ``VFIO_USER_DEVICE_GET_REGION_INFO`` to for= m the + offset argument of the subsequent mmap() call. + +The VFIO sparse mmap area is defined in ```` (``struct +vfio_region_info_cap_sparse_mmap``). + +VFIO region type cap header +""""""""""""""""""""""""""" + ++------------------+---------------------------+ +| Name | Value | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| id | VFIO_REGION_INFO_CAP_TYPE | ++------------------+---------------------------+ +| version | 0x1 | ++------------------+---------------------------+ +| next | | ++------------------+---------------------------+ +| region info type | VFIO region info type | ++------------------+---------------------------+ + +This capability is defined when a region is specific to the device. + +VFIO region info type cap +""""""""""""""""""""""""" + +The VFIO region info type is defined in ```` +(``struct vfio_region_info_cap_type``). + ++---------+--------+------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D+ +| type | 0 | 4 | ++---------+--------+------+ +| subtype | 4 | 4 | ++---------+--------+------+ + +The only device-specific region type and subtype supported by vfio-user is +``VFIO_REGION_TYPE_MIGRATION`` (3) and ``VFIO_REGION_SUBTYPE_MIGRATION`` (= 1). + +``VFIO_USER_DEVICE_GET_REGION_IO_FDS`` +-------------------------------------- + +Clients can access regions via ``VFIO_USER_REGION_READ/WRITE`` or, if prov= ided, by +``mmap()`` of a file descriptor provided by the server. + +``VFIO_USER_DEVICE_GET_REGION_IO_FDS`` provides an alternative access mech= anism via +file descriptors. This is an optional feature intended for performance +improvements where an underlying sub-system (such as KVM) supports communi= cation +across such file descriptors to the vfio-user server, without needing to +round-trip through the client. + +The server returns an array of sub-regions for the requested region. Each +sub-region describes a span (offset and size) of a region, along with the +requested file descriptor notification mechanism to use. Each sub-region = in the +response message may choose to use a different method, as defined below. = The +two mechanisms supported in this specification are ioeventfds and ioregion= fds. + +The server in addition returns a file descriptor in the ancillary data; cl= ients +are expected to configure each sub-region's file descriptor with the reque= sted +notification method. For example, a client could configure KVM with the +requested ioeventfd via a ``KVM_IOEVENTFD`` ``ioctl()``. + +Request +^^^^^^^ + ++-------------+--------+------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D= =3D=3D=3D=3D+ +| argsz | 0 | 4 | ++-------------+--------+------+ +| flags | 4 | 4 | ++-------------+--------+------+ +| index | 8 | 4 | ++-------------+--------+------+ +| count | 12 | 4 | ++-------------+--------+------+ + +* *argsz* the maximum size of the reply payload +* *index* is the index of memory region being queried +* all other fields must be zero + +The client must set ``flags`` to zero and specify the region being queried= in +the ``index``. + +Reply +^^^^^ + ++-------------+--------+------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D= =3D=3D=3D=3D+ +| argsz | 0 | 4 | ++-------------+--------+------+ +| flags | 4 | 4 | ++-------------+--------+------+ +| index | 8 | 4 | ++-------------+--------+------+ +| count | 12 | 4 | ++-------------+--------+------+ +| sub-regions | 16 | ... | ++-------------+--------+------+ + +* *argsz* is the size of the region IO FD info structure plus the + total size of the sub-region array. Thus, each array entry "i" is at off= set + i * ((argsz - 32) / count). Note that currently this is 40 bytes for bot= h IO + FD types, but this is not to be relied on. As elsewhere, this indicates = the + full reply payload size needed. +* *flags* must be zero +* *index* is the index of memory region being queried +* *count* is the number of sub-regions in the array +* *sub-regions* is the array of Sub-Region IO FD info structures + +The reply message will additionally include at least one file descriptor i= n the +ancillary data. Note that more than one sub-region may share the same file +descriptor. + +Note that it is the client's responsibility to verify the requested values= (for +example, that the requested offset does not exceed the region's bounds). + +Each sub-region given in the response has one of two possible structures, +depending whether *type* is ``VFIO_USER_IO_FD_TYPE_IOEVENTFD`` or +``VFIO_USER_IO_FD_TYPE_IOREGIONFD``: + +Sub-Region IO FD info format (ioeventfd) +"""""""""""""""""""""""""""""""""""""""" + ++-----------+--------+------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D= =3D=3D+ +| offset | 0 | 8 | ++-----------+--------+------+ +| size | 8 | 8 | ++-----------+--------+------+ +| fd_index | 16 | 4 | ++-----------+--------+------+ +| type | 20 | 4 | ++-----------+--------+------+ +| flags | 24 | 4 | ++-----------+--------+------+ +| padding | 28 | 4 | ++-----------+--------+------+ +| datamatch | 32 | 8 | ++-----------+--------+------+ + +* *offset* is the offset of the start of the sub-region within the region + requested ("physical address offset" for the region) +* *size* is the length of the sub-region. This may be zero if the access s= ize is + not relevant, which may allow for optimizations +* *fd_index* is the index in the ancillary data of the FD to use for ioeve= ntfd + notification; it may be shared. +* *type* is ``VFIO_USER_IO_FD_TYPE_IOEVENTFD`` +* *flags* is any of: + + * ``KVM_IOEVENTFD_FLAG_DATAMATCH`` + * ``KVM_IOEVENTFD_FLAG_PIO`` + * ``KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY`` (FIXME: makes sense?) + +* *datamatch* is the datamatch value if needed + +See https://www.kernel.org/doc/Documentation/virtual/kvm/api.txt, *4.59 +KVM_IOEVENTFD* for further context on the ioeventfd-specific fields. + +Sub-Region IO FD info format (ioregionfd) +""""""""""""""""""""""""""""""""""""""""" + ++-----------+--------+------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D= =3D=3D+ +| offset | 0 | 8 | ++-----------+--------+------+ +| size | 8 | 8 | ++-----------+--------+------+ +| fd_index | 16 | 4 | ++-----------+--------+------+ +| type | 20 | 4 | ++-----------+--------+------+ +| flags | 24 | 4 | ++-----------+--------+------+ +| padding | 28 | 4 | ++-----------+--------+------+ +| user_data | 32 | 8 | ++-----------+--------+------+ + +* *offset* is the offset of the start of the sub-region within the region + requested ("physical address offset" for the region) +* *size* is the length of the sub-region. This may be zero if the access s= ize is + not relevant, which may allow for optimizations; ``KVM_IOREGION_POSTED_W= RITES`` + must be set in *flags* in this case +* *fd_index* is the index in the ancillary data of the FD to use for ioreg= ionfd + messages; it may be shared +* *type* is ``VFIO_USER_IO_FD_TYPE_IOREGIONFD`` +* *flags* is any of: + + * ``KVM_IOREGION_PIO`` + * ``KVM_IOREGION_POSTED_WRITES`` + +* *user_data* is an opaque value passed back to the server via a message o= n the + file descriptor + +For further information on the ioregionfd-specific fields, see: +https://lore.kernel.org/kvm/cover.1613828726.git.eafanasova@gmail.com/ + +(FIXME: update with final API docs.) + +``VFIO_USER_DEVICE_GET_IRQ_INFO`` +--------------------------------- + +This command message is sent by the client to the server to query for +information about device interrupt types. The VFIO IRQ info structure is +defined in ```` (``struct vfio_irq_info``). + +Request +^^^^^^^ + ++-------+--------+---------------------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| argsz | 0 | 4 | ++-------+--------+---------------------------+ +| flags | 4 | 4 | ++-------+--------+---------------------------+ +| | +-----+--------------------------+ | +| | | Bit | Definition | | +| | +=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ | +| | | 0 | VFIO_IRQ_INFO_EVENTFD | | +| | +-----+--------------------------+ | +| | | 1 | VFIO_IRQ_INFO_MASKABLE | | +| | +-----+--------------------------+ | +| | | 2 | VFIO_IRQ_INFO_AUTOMASKED | | +| | +-----+--------------------------+ | +| | | 3 | VFIO_IRQ_INFO_NORESIZE | | +| | +-----+--------------------------+ | ++-------+--------+---------------------------+ +| index | 8 | 4 | ++-------+--------+---------------------------+ +| count | 12 | 4 | ++-------+--------+---------------------------+ + +* *argsz* is the maximum size of the reply payload (16 bytes today) +* index is the index of IRQ type being queried (e.g. ``VFIO_PCI_MSIX_IRQ_I= NDEX``) +* all other fields must be zero + +Reply +^^^^^ + ++-------+--------+---------------------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| argsz | 0 | 4 | ++-------+--------+---------------------------+ +| flags | 4 | 4 | ++-------+--------+---------------------------+ +| | +-----+--------------------------+ | +| | | Bit | Definition | | +| | +=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ | +| | | 0 | VFIO_IRQ_INFO_EVENTFD | | +| | +-----+--------------------------+ | +| | | 1 | VFIO_IRQ_INFO_MASKABLE | | +| | +-----+--------------------------+ | +| | | 2 | VFIO_IRQ_INFO_AUTOMASKED | | +| | +-----+--------------------------+ | +| | | 3 | VFIO_IRQ_INFO_NORESIZE | | +| | +-----+--------------------------+ | ++-------+--------+---------------------------+ +| index | 8 | 4 | ++-------+--------+---------------------------+ +| count | 12 | 4 | ++-------+--------+---------------------------+ + +* *argsz* is the size required for the full reply payload (16 bytes today) +* *flags* defines IRQ attributes: + + * ``VFIO_IRQ_INFO_EVENTFD`` indicates the IRQ type can support server ev= entfd + signalling. + * ``VFIO_IRQ_INFO_MASKABLE`` indicates that the IRQ type supports the ``= MASK`` + and ``UNMASK`` actions in a ``VFIO_USER_DEVICE_SET_IRQS`` message. + * ``VFIO_IRQ_INFO_AUTOMASKED`` indicates the IRQ type masks itself after= being + triggered, and the client must send an ``UNMASK`` action to receive new + interrupts. + * ``VFIO_IRQ_INFO_NORESIZE`` indicates ``VFIO_USER_SET_IRQS`` operations= setup + interrupts as a set, and new sub-indexes cannot be enabled without dis= abling + the entire type. +* index is the index of IRQ type being queried +* count describes the number of interrupts of the queried type. + +``VFIO_USER_DEVICE_SET_IRQS`` +----------------------------- + +This command message is sent by the client to the server to set actions for +device interrupt types. The VFIO IRQ set structure is defined in +```` (``struct vfio_irq_set``). + +Request +^^^^^^^ + ++-------+--------+------------------------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| argsz | 0 | 4 | ++-------+--------+------------------------------+ +| flags | 4 | 4 | ++-------+--------+------------------------------+ +| | +-----+-----------------------------+ | +| | | Bit | Definition | | +| | +=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ | +| | | 0 | VFIO_IRQ_SET_DATA_NONE | | +| | +-----+-----------------------------+ | +| | | 1 | VFIO_IRQ_SET_DATA_BOOL | | +| | +-----+-----------------------------+ | +| | | 2 | VFIO_IRQ_SET_DATA_EVENTFD | | +| | +-----+-----------------------------+ | +| | | 3 | VFIO_IRQ_SET_ACTION_MASK | | +| | +-----+-----------------------------+ | +| | | 4 | VFIO_IRQ_SET_ACTION_UNMASK | | +| | +-----+-----------------------------+ | +| | | 5 | VFIO_IRQ_SET_ACTION_TRIGGER | | +| | +-----+-----------------------------+ | ++-------+--------+------------------------------+ +| index | 8 | 4 | ++-------+--------+------------------------------+ +| start | 12 | 4 | ++-------+--------+------------------------------+ +| count | 16 | 4 | ++-------+--------+------------------------------+ +| data | 20 | variable | ++-------+--------+------------------------------+ + +* *argsz* is the size of the VFIO IRQ set request payload, including any *= data* + field. Note there is no reply payload, so this field differs from other + message types. +* *flags* defines the action performed on the interrupt range. The ``DATA`` + flags describe the data field sent in the message; the ``ACTION`` flags + describe the action to be performed. The flags are mutually exclusive for + both sets. + + * ``VFIO_IRQ_SET_DATA_NONE`` indicates there is no data field in the com= mand. + The action is performed unconditionally. + * ``VFIO_IRQ_SET_DATA_BOOL`` indicates the data field is an array of boo= lean + bytes. The action is performed if the corresponding boolean is true. + * ``VFIO_IRQ_SET_DATA_EVENTFD`` indicates an array of event file descrip= tors + was sent in the message meta-data. These descriptors will be signalled= when + the action defined by the action flags occurs. In ``AF_UNIX`` sockets,= the + descriptors are sent as ``SCM_RIGHTS`` type ancillary data. + If no file descriptors are provided, this de-assigns the specified + previously configured interrupts. + * ``VFIO_IRQ_SET_ACTION_MASK`` indicates a masking event. It can be used= with + ``VFIO_IRQ_SET_DATA_BOOL`` or ``VFIO_IRQ_SET_DATA_NONE`` to mask an in= terrupt, + or with ``VFIO_IRQ_SET_DATA_EVENTFD`` to generate an event when the gu= est masks + the interrupt. + * ``VFIO_IRQ_SET_ACTION_UNMASK`` indicates an unmasking event. It can be= used + with ``VFIO_IRQ_SET_DATA_BOOL`` or ``VFIO_IRQ_SET_DATA_NONE`` to unmas= k an + interrupt, or with ``VFIO_IRQ_SET_DATA_EVENTFD`` to generate an event = when the + guest unmasks the interrupt. + * ``VFIO_IRQ_SET_ACTION_TRIGGER`` indicates a triggering event. It can b= e used + with ``VFIO_IRQ_SET_DATA_BOOL`` or ``VFIO_IRQ_SET_DATA_NONE`` to trigg= er an + interrupt, or with ``VFIO_IRQ_SET_DATA_EVENTFD`` to generate an event = when the + server triggers the interrupt. + +* *index* is the index of IRQ type being setup. +* *start* is the start of the sub-index being set. +* *count* describes the number of sub-indexes being set. As a special case= , a + count (and start) of 0, with data flags of ``VFIO_IRQ_SET_DATA_NONE`` di= sables + all interrupts of the index. +* *data* is an optional field included when the + ``VFIO_IRQ_SET_DATA_BOOL`` flag is present. It contains an array of bool= eans + that specify whether the action is to be performed on the corresponding + index. It's used when the action is only performed on a subset of the ra= nge + specified. + +Not all interrupt types support every combination of data and action flags. +The client must know the capabilities of the device and IRQ index before it +sends a ``VFIO_USER_DEVICE_SET_IRQ`` message. + +In typical operation, a specific IRQ may operate as follows: + +1. The client sends a ``VFIO_USER_DEVICE_SET_IRQ`` message with + ``flags=3D(VFIO_IRQ_SET_DATA_EVENTFD|VFIO_IRQ_SET_ACTION_TRIGGER)`` alo= ng + with an eventfd. This associates the IRQ with a particular eventfd on t= he + server side. + +#. The client may send a ``VFIO_USER_DEVICE_SET_IRQ`` message with + ``flags=3D(VFIO_IRQ_SET_DATA_EVENTFD|VFIO_IRQ_SET_ACTION_MASK/UNMASK)``= along + with another eventfd. This associates the given eventfd with the + mask/unmask state on the server side. + +#. The server may trigger the IRQ by writing 1 to the eventfd. + +#. The server may mask/unmask an IRQ which will write 1 to the correspondi= ng + mask/unmask eventfd, if there is one. + +5. A client may trigger a device IRQ itself, by sending a + ``VFIO_USER_DEVICE_SET_IRQ`` message with + ``flags=3D(VFIO_IRQ_SET_DATA_NONE/BOOL|VFIO_IRQ_SET_ACTION_TRIGGER)``. + +6. A client may mask or unmask the IRQ, by sending a + ``VFIO_USER_DEVICE_SET_IRQ`` message with + ``flags=3D(VFIO_IRQ_SET_DATA_NONE/BOOL|VFIO_IRQ_SET_ACTION_MASK/UNMASK)= ``. + +Reply +^^^^^ + +There is no payload in the reply. + +.. _Read and Write Operations: + +Note that all of these operations must be supported by the client and/or s= erver, +even if the corresponding memory or device region has been shared as mappa= ble. + +The ``count`` field must not exceed the value of ``max_data_xfer_size`` of= the +peer, for both reads and writes. + +``VFIO_USER_REGION_READ`` +------------------------- + +If a device region is not mappable, it's not directly accessible by the cl= ient +via ``mmap()`` of the underlying file descriptor. In this case, a client c= an +read from a device region with this message. + +Request +^^^^^^^ + ++--------+--------+----------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D+ +| offset | 0 | 8 | ++--------+--------+----------+ +| region | 8 | 4 | ++--------+--------+----------+ +| count | 12 | 4 | ++--------+--------+----------+ + +* *offset* into the region being accessed. +* *region* is the index of the region being accessed. +* *count* is the size of the data to be transferred. + +Reply +^^^^^ + ++--------+--------+----------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D+ +| offset | 0 | 8 | ++--------+--------+----------+ +| region | 8 | 4 | ++--------+--------+----------+ +| count | 12 | 4 | ++--------+--------+----------+ +| data | 16 | variable | ++--------+--------+----------+ + +* *offset* into the region accessed. +* *region* is the index of the region accessed. +* *count* is the size of the data transferred. +* *data* is the data that was read from the device region. + +``VFIO_USER_REGION_WRITE`` +-------------------------- + +If a device region is not mappable, it's not directly accessible by the cl= ient +via mmap() of the underlying fd. In this case, a client can write to a dev= ice +region with this message. + +Request +^^^^^^^ + ++--------+--------+----------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D+ +| offset | 0 | 8 | ++--------+--------+----------+ +| region | 8 | 4 | ++--------+--------+----------+ +| count | 12 | 4 | ++--------+--------+----------+ +| data | 16 | variable | ++--------+--------+----------+ + +* *offset* into the region being accessed. +* *region* is the index of the region being accessed. +* *count* is the size of the data to be transferred. +* *data* is the data to write + +Reply +^^^^^ + ++--------+--------+----------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D+ +| offset | 0 | 8 | ++--------+--------+----------+ +| region | 8 | 4 | ++--------+--------+----------+ +| count | 12 | 4 | ++--------+--------+----------+ + +* *offset* into the region accessed. +* *region* is the index of the region accessed. +* *count* is the size of the data transferred. + +``VFIO_USER_DMA_READ`` +----------------------- + +If the client has not shared mappable memory, the server can use this mess= age to +read from guest memory. + +Request +^^^^^^^ + ++---------+--------+----------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D+ +| address | 0 | 8 | ++---------+--------+----------+ +| count | 8 | 8 | ++---------+--------+----------+ + +* *address* is the client DMA memory address being accessed. This address = must have + been previously exported to the server with a ``VFIO_USER_DMA_MAP`` mess= age. +* *count* is the size of the data to be transferred. + +Reply +^^^^^ + ++---------+--------+----------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D+ +| address | 0 | 8 | ++---------+--------+----------+ +| count | 8 | 8 | ++---------+--------+----------+ +| data | 16 | variable | ++---------+--------+----------+ + +* *address* is the client DMA memory address being accessed. +* *count* is the size of the data transferred. +* *data* is the data read. + +``VFIO_USER_DMA_WRITE`` +----------------------- + +If the client has not shared mappable memory, the server can use this mess= age to +write to guest memory. + +Request +^^^^^^^ + ++---------+--------+----------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D+ +| address | 0 | 8 | ++---------+--------+----------+ +| count | 8 | 8 | ++---------+--------+----------+ +| data | 16 | variable | ++---------+--------+----------+ + +* *address* is the client DMA memory address being accessed. This address = must have + been previously exported to the server with a ``VFIO_USER_DMA_MAP`` mess= age. +* *count* is the size of the data to be transferred. +* *data* is the data to write + +Reply +^^^^^ + ++---------+--------+----------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D+ +| address | 0 | 8 | ++---------+--------+----------+ +| count | 8 | 4 | ++---------+--------+----------+ + +* *address* is the client DMA memory address being accessed. +* *count* is the size of the data transferred. + +``VFIO_USER_DEVICE_RESET`` +-------------------------- + +This command message is sent from the client to the server to reset the de= vice. +Neither the request or reply have a payload. + +``VFIO_USER_DIRTY_PAGES`` +------------------------- + +This command is analogous to ``VFIO_IOMMU_DIRTY_PAGES``. It is sent by the= client +to the server in order to control logging of dirty pages, usually during a= live +migration. + +Dirty page tracking is optional for server implementation; clients should = not +rely on it. + +Request +^^^^^^^ + ++-------+--------+-----------------------------------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D+ +| argsz | 0 | 4 | ++-------+--------+-----------------------------------------+ +| flags | 4 | 4 | ++-------+--------+-----------------------------------------+ +| | +-----+----------------------------------------+ | +| | | Bit | Definition | | +| | +=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= + | +| | | 0 | VFIO_IOMMU_DIRTY_PAGES_FLAG_START | | +| | +-----+----------------------------------------+ | +| | | 1 | VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP | | +| | +-----+----------------------------------------+ | +| | | 2 | VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP | | +| | +-----+----------------------------------------+ | ++-------+--------+-----------------------------------------+ + +* *argsz* is the size of the VFIO dirty bitmap info structure for + ``START/STOP``; and for ``GET_BITMAP``, the maximum size of the reply pa= yload + +* *flags* defines the action to be performed by the server: + + * ``VFIO_IOMMU_DIRTY_PAGES_FLAG_START`` instructs the server to start lo= gging + pages it dirties. Logging continues until explicitly disabled by + ``VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP``. + + * ``VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP`` instructs the server to stop logg= ing + dirty pages. + + * ``VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP`` requests the server to retu= rn + the dirty bitmap for a specific IOVA range. The IOVA range is specifie= d by + a "VFIO Bitmap Range" structure, which must immediately follow this + "VFIO Dirty Pages" structure. See `VFIO Bitmap Range Format`_. + This operation is only valid if logging of dirty pages has been previo= usly + started. + + These flags are mutually exclusive with each other. + +This part of the request is analogous to VFIO's ``struct +vfio_iommu_type1_dirty_bitmap``. + +.. _VFIO Bitmap Range Format: + +VFIO Bitmap Range Format +"""""""""""""""""""""""" + ++--------+--------+------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D+ +| iova | 0 | 8 | ++--------+--------+------+ +| size | 8 | 8 | ++--------+--------+------+ +| bitmap | 16 | 24 | ++--------+--------+------+ + +* *iova* is the IOVA offset + +* *size* is the size of the IOVA region + +* *bitmap* is the VFIO Bitmap explained in `VFIO Bitmap`_. + +This part of the request is analogous to VFIO's ``struct +vfio_iommu_type1_dirty_bitmap_get``. + +Reply +^^^^^ + +For ``VFIO_IOMMU_DIRTY_PAGES_FLAG_START`` or +``VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP``, there is no reply payload. + +For ``VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP``, the reply payload is as fo= llows: + ++--------------+--------+-----------------------------------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ +| argsz | 0 | 4 | ++--------------+--------+-----------------------------------------+ +| flags | 4 | 4 | ++--------------+--------+-----------------------------------------+ +| | +-----+----------------------------------------+ | +| | | Bit | Definition | | +| | +=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D+ | +| | | 2 | VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP | | +| | +-----+----------------------------------------+ | ++--------------+--------+-----------------------------------------+ +| bitmap range | 8 | 40 | ++--------------+--------+-----------------------------------------+ +| bitmap | 48 | variable | ++--------------+--------+-----------------------------------------+ + +* *argsz* is the size required for the full reply payload (dirty pages str= ucture + + bitmap range structure + actual bitmap) +* *flags* is ``VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP`` +* *bitmap range* is the same bitmap range struct provided in the request, = as + defined in `VFIO Bitmap Range Format`_. +* *bitmap* is the actual dirty pages bitmap corresponding to the range req= uest + +VFIO Device Migration Info +-------------------------- + +A device may contain a migration region (of type +``VFIO_REGION_TYPE_MIGRATION``). The beginning of the region must contain +``struct vfio_device_migration_info``, defined in ````. This +subregion is accessed like any other part of a standard vfio-user region +using ``VFIO_USER_REGION_READ``/``VFIO_USER_REGION_WRITE``. + ++---------------+--------+-----------------------------+ +| Name | Offset | Size | ++=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D+= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D+ +| device_state | 0 | 4 | ++---------------+--------+-----------------------------+ +| | +-----+----------------------------+ | +| | | Bit | Definition | | +| | +=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ | +| | | 0 | VFIO_DEVICE_STATE_RUNNING | | +| | +-----+----------------------------+ | +| | | 1 | VFIO_DEVICE_STATE_SAVING | | +| | +-----+----------------------------+ | +| | | 2 | VFIO_DEVICE_STATE_RESUMING | | +| | +-----+----------------------------+ | ++---------------+--------+-----------------------------+ +| reserved | 4 | 4 | ++---------------+--------+-----------------------------+ +| pending_bytes | 8 | 8 | ++---------------+--------+-----------------------------+ +| data_offset | 16 | 8 | ++---------------+--------+-----------------------------+ +| data_size | 24 | 8 | ++---------------+--------+-----------------------------+ + +* *device_state* defines the state of the device: + + The client initiates device state transition by writing the intended sta= te. + The server must respond only after it has successfully transitioned to t= he new + state. If an error occurs then the server must respond to the + ``VFIO_USER_REGION_WRITE`` operation with the Error field set accordingl= y and + must remain at the previous state, or in case of internal error it must + transition to the error state, defined as + ``VFIO_DEVICE_STATE_RESUMING | VFIO_DEVICE_STATE_SAVING``. The client mu= st + re-read the device state in order to determine it afresh. + + The following device states are defined: + + +-----------+---------+----------+-----------------------------------+ + | _RESUMING | _SAVING | _RUNNING | Description | + +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ + | 0 | 0 | 0 | Device is stopped. | + +-----------+---------+----------+-----------------------------------+ + | 0 | 0 | 1 | Device is running, default state. | + +-----------+---------+----------+-----------------------------------+ + | 0 | 1 | 0 | Stop-and-copy state | + +-----------+---------+----------+-----------------------------------+ + | 0 | 1 | 1 | Pre-copy state | + +-----------+---------+----------+-----------------------------------+ + | 1 | 0 | 0 | Resuming | + +-----------+---------+----------+-----------------------------------+ + | 1 | 0 | 1 | Invalid state | + +-----------+---------+----------+-----------------------------------+ + | 1 | 1 | 0 | Error state | + +-----------+---------+----------+-----------------------------------+ + | 1 | 1 | 1 | Invalid state | + +-----------+---------+----------+-----------------------------------+ + + Valid state transitions are shown in the following table: + + +-------------------------+---------+---------+---------------+---------= -+----------+ + | |darr| From / To |rarr| | Stopped | Running | Stop-and-copy | Pre-copy= | Resuming | + +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D+ + | Stopped | \- | 1 | 0 | 0 = | 0 | + +-------------------------+---------+---------+---------------+---------= -+----------+ + | Running | 1 | \- | 1 | 1 = | 1 | + +-------------------------+---------+---------+---------------+---------= -+----------+ + | Stop-and-copy | 1 | 1 | \- | 0 = | 0 | + +-------------------------+---------+---------+---------------+---------= -+----------+ + | Pre-copy | 0 | 0 | 1 | \- = | 0 | + +-------------------------+---------+---------+---------------+---------= -+----------+ + | Resuming | 0 | 1 | 0 | 0 = | \- | + +-------------------------+---------+---------+---------------+---------= -+----------+ + + A device is migrated to the destination as follows: + + * The source client transitions the device state from the running state = to + the pre-copy state. This transition is optional for the client but mus= t be + supported by the server. The source server starts sending device state= data + to the source client through the migration region while the device is + running. + + * The source client transitions the device state from the running state = or the + pre-copy state to the stop-and-copy state. The source server stops the + device, saves device state and sends it to the source client through t= he + migration region. + + The source client is responsible for sending the migration data to the + destination client. + + A device is resumed on the destination as follows: + + * The destination client transitions the device state from the running s= tate + to the resuming state. The destination server uses the device state da= ta + received through the migration region to resume the device. + + * The destination client provides saved device state to the destination + server and then transitions the device to back to the running state. + +* *reserved* This field is reserved and any access to it must be ignored b= y the + server. + +* *pending_bytes* Remaining bytes to be migrated by the server. This field= is + read only. + +* *data_offset* Offset in the migration region where the client must: + + * read from, during the pre-copy or stop-and-copy state, or + + * write to, during the resuming state. + + This field is read only. + +* *data_size* Contains the size, in bytes, of the amount of data copied to: + + * the source migration region by the source server during the pre-copy or + stop-and copy state, or + + * the destination migration region by the destination client during the + resuming state. + +Device-specific data must be stored at any position after +``struct vfio_device_migration_info``. Note that the migration region can = be +memory mappable, even partially. In practise, only the migration data port= ion +can be memory mapped. + +The client processes device state data during the pre-copy and the +stop-and-copy state in the following iterative manner: + + 1. The client reads ``pending_bytes`` to mark a new iteration. Repeated = reads + of this field is an idempotent operation. If there are no migration d= ata + to be consumed then the next step depends on the current device state: + + * pre-copy: the client must try again. + + * stop-and-copy: this procedure can end and the device can now start + resuming on the destination. + + 2. The client reads ``data_offset``; at this point the server must make + available a portion of migration data at this offset to be read by the + client, which must happen *before* completing the read operation. The + amount of data to be read must be stored in the ``data_size`` field, = which + the client reads next. + + 3. The client reads ``data_size`` to determine the amount of migration d= ata + available. + + 4. The client reads and processes the migration data. + + 5. Go to step 1. + +Note that the client can transition the device from the pre-copy state to = the +stop-and-copy state at any time; ``pending_bytes`` does not need to become= zero. + +The client initializes the device state on the destination by setting the +device state in the resuming state and writing the migration data to the +destination migration region at ``data_offset`` offset. The client can wri= te the +source migration data in an iterative manner and the server must consume t= his +data before completing each write operation, updating the ``data_offset`` = field. +The server must apply the source migration data on the device resume state= . The +client must write data on the same order and transaction size as read. + +If an error occurs then the server must fail the read or write operation. = It is +an implementation detail of the client how to handle errors. + +Appendices +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +Unused VFIO ``ioctl()`` commands +-------------------------------- + +The following VFIO commands do not have an equivalent vfio-user command: + +* ``VFIO_GET_API_VERSION`` +* ``VFIO_CHECK_EXTENSION`` +* ``VFIO_SET_IOMMU`` +* ``VFIO_GROUP_GET_STATUS`` +* ``VFIO_GROUP_SET_CONTAINER`` +* ``VFIO_GROUP_UNSET_CONTAINER`` +* ``VFIO_GROUP_GET_DEVICE_FD`` +* ``VFIO_IOMMU_GET_INFO`` + +However, once support for live migration for VFIO devices is finalized some +of the above commands may have to be handled by the client in their +corresponding vfio-user form. This will be addressed in a future protocol +version. + +VFIO groups and containers +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The current VFIO implementation includes group and container idioms that +describe how a device relates to the host IOMMU. In the vfio-user +implementation, the IOMMU is implemented in SW by the client, and is not +visible to the server. The simplest idea would be that the client put each +device into its own group and container. + +Backend Program Conventions +--------------------------- + +vfio-user backend program conventions are based on the vhost-user ones. + +* The backend program must not daemonize itself. +* No assumptions must be made as to what access the backend program has on= the + system. +* File descriptors 0, 1 and 2 must exist, must have regular + stdin/stdout/stderr semantics, and can be redirected. +* The backend program must honor the SIGTERM signal. +* The backend program must accept the following commands line options: + + * ``--socket-path=3DPATH``: path to UNIX domain socket, + * ``--fd=3DFDNUM``: file descriptor for UNIX domain socket, incompatible= with + ``--socket-path`` +* The backend program must be accompanied with a JSON file stored under + ``/usr/share/vfio-user``. + +TODO add schema similar to docs/interop/vhost-user.json. diff --git a/MAINTAINERS b/MAINTAINERS index 7543eb4..1258e11 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1900,6 +1900,12 @@ F: hw/vfio/ap.c F: docs/system/s390x/vfio-ap.rst L: qemu-s390x@nongnu.org =20 +vfio-user +M: John G Johnson +M: Thanos Makatos +S: Supported +F: docs/devel/vfio-user.rst + vhost M: Michael S. Tsirkin S: Supported --=20 1.8.3.1