hw/acpi/Kconfig | 5 + hw/acpi/meson.build | 1 + hw/acpi/vmclock-abi.h | 186 +++++++++++++++++++++++++++++++++++++ hw/acpi/vmclock.c | 189 ++++++++++++++++++++++++++++++++++++++ hw/i386/Kconfig | 1 + hw/i386/acpi-build.c | 10 +- include/hw/acpi/vmclock.h | 34 +++++++ 7 files changed, 425 insertions(+), 1 deletion(-) create mode 100644 hw/acpi/vmclock-abi.h create mode 100644 hw/acpi/vmclock.c create mode 100644 include/hw/acpi/vmclock.h
From: David Woodhouse <dwmw@amazon.co.uk>
The vmclock device addresses the problem of live migration with
precision clocks. The tolerances of a hardware counter (e.g. TSC) are
typically around ±50PPM. A guest will use NTP/PTP/PPS to discipline that
counter against an external source of 'real' time, and track the precise
frequency of the counter as it changes with environmental conditions.
When a guest is live migrated, anything it knows about the frequency of
the underlying counter becomes invalid. It may move from a host where
the counter running at -50PPM of its nominal frequency, to a host where
it runs at +50PPM. There will also be a step change in the value of the
counter, as the correctness of its absolute value at migration is
limited by the accuracy of the source and destination host's time
synchronization.
The device exposes a shared memory region to guests, which can be mapped
all the way to userspace. In the first phase, this merely advertises a
'disruption_marker', which indicates that the guest should throw away any
NTP synchronization it thinks it has, and start again.
Because the region can be exposed all the way to userspace, applications
can still use time from a fast vDSO 'system call', and check the
disruption marker to be sure that their timestamp is indeed truthful.
The structure also allows for the precise time, as known by the host, to
be exposed directly to guests so that they don't have to wait for NTP to
resync from scratch.
The values and fields are based on the nascent virtio-rtc specification,
and the intent is that a version (hopefully precisely this version) of
this structure will be included as an optional part of that spec. In the
meantime, a simple ACPI device along the lines of VMGENID is perfectly
sufficient and is compatible with what's being shipped in certain
commercial hypervisors.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
v4:
• Trivial checkpatch fixes and comment improvements.
v3:
• Add comment that vmclock-abi.h will come from the Linux kernel
headers once it gets merged there.
v2:
• Change esterror/maxerror fields to nanoseconds.
• Change to officially assigned AMZNC10C ACPI HID.
• Fix little-endian handling of fields in update.
Guest support:
https://lore.kernel.org/all/78969a39b51ec00e85551b752767be65f6794b46.camel@infradead.org/
hw/acpi/Kconfig | 5 +
hw/acpi/meson.build | 1 +
hw/acpi/vmclock-abi.h | 186 +++++++++++++++++++++++++++++++++++++
hw/acpi/vmclock.c | 189 ++++++++++++++++++++++++++++++++++++++
hw/i386/Kconfig | 1 +
hw/i386/acpi-build.c | 10 +-
include/hw/acpi/vmclock.h | 34 +++++++
7 files changed, 425 insertions(+), 1 deletion(-)
create mode 100644 hw/acpi/vmclock-abi.h
create mode 100644 hw/acpi/vmclock.c
create mode 100644 include/hw/acpi/vmclock.h
diff --git a/hw/acpi/Kconfig b/hw/acpi/Kconfig
index e07d3204eb..1d4e9f0845 100644
--- a/hw/acpi/Kconfig
+++ b/hw/acpi/Kconfig
@@ -60,6 +60,11 @@ config ACPI_VMGENID
default y
depends on PC
+config ACPI_VMCLOCK
+ bool
+ default y
+ depends on PC
+
config ACPI_VIOT
bool
depends on ACPI
diff --git a/hw/acpi/meson.build b/hw/acpi/meson.build
index 7f8ccc9b7a..995f99084a 100644
--- a/hw/acpi/meson.build
+++ b/hw/acpi/meson.build
@@ -16,6 +16,7 @@ acpi_ss.add(when: 'CONFIG_ACPI_NVDIMM', if_false: files('acpi-nvdimm-stub.c'))
acpi_ss.add(when: 'CONFIG_ACPI_PCI', if_true: files('pci.c'))
acpi_ss.add(when: 'CONFIG_ACPI_CXL', if_true: files('cxl.c'), if_false: files('cxl-stub.c'))
acpi_ss.add(when: 'CONFIG_ACPI_VMGENID', if_true: files('vmgenid.c'))
+acpi_ss.add(when: 'CONFIG_ACPI_VMCLOCK', if_true: files('vmclock.c'))
acpi_ss.add(when: 'CONFIG_ACPI_HW_REDUCED', if_true: files('generic_event_device.c'))
acpi_ss.add(when: 'CONFIG_ACPI_HMAT', if_true: files('hmat.c'))
acpi_ss.add(when: 'CONFIG_ACPI_APEI', if_true: files('ghes.c'), if_false: files('ghes-stub.c'))
diff --git a/hw/acpi/vmclock-abi.h b/hw/acpi/vmclock-abi.h
new file mode 100644
index 0000000000..19cbf85efd
--- /dev/null
+++ b/hw/acpi/vmclock-abi.h
@@ -0,0 +1,186 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */
+
+/*
+ * This structure provides a vDSO-style clock to VM guests, exposing the
+ * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch
+ * counter, etc.) and real time. It is designed to address the problem of
+ * live migration, which other clock enlightenments do not.
+ *
+ * When a guest is live migrated, this affects the clock in two ways.
+ *
+ * First, even between identical hosts the actual frequency of the underlying
+ * counter will change within the tolerances of its specification (typically
+ * ±50PPM, or 4 seconds a day). This frequency also varies over time on the
+ * same host, but can be tracked by NTP as it generally varies slowly. With
+ * live migration there is a step change in the frequency, with no warning.
+ *
+ * Second, there may be a step change in the value of the counter itself, as
+ * its accuracy is limited by the precision of the NTP synchronization on the
+ * source and destination hosts.
+ *
+ * So any calibration (NTP, PTP, etc.) which the guest has done on the source
+ * host before migration is invalid, and needs to be redone on the new host.
+ *
+ * In its most basic mode, this structure provides only an indication to the
+ * guest that live migration has occurred. This allows the guest to know that
+ * its clock is invalid and take remedial action. For applications that need
+ * reliable accurate timestamps (e.g. distributed databases), the structure
+ * can be mapped all the way to userspace. This allows the application to see
+ * directly for itself that the clock is disrupted and take appropriate
+ * action, even when using a vDSO-style method to get the time instead of a
+ * system call.
+ *
+ * In its more advanced mode. this structure can also be used to expose the
+ * precise relationship of the CPU counter to real time, as calibrated by the
+ * host. This means that userspace applications can have accurate time
+ * immediately after live migration, rather than having to pause operations
+ * and wait for NTP to recover. This mode does, of course, rely on the
+ * counter being reliable and consistent across CPUs.
+ *
+ * Note that this must be true UTC, never with smeared leap seconds. If a
+ * guest wishes to construct a smeared clock, it can do so. Presenting a
+ * smeared clock through this interface would be problematic because it
+ * actually messes with the apparent counter *period*. A linear smearing
+ * of 1 ms per second would effectively tweak the counter period by 1000PPM
+ * at the start/end of the smearing period, while a sinusoidal smear would
+ * basically be impossible to represent.
+ *
+ * This structure is offered with the intent that it be adopted into the
+ * nascent virtio-rtc standard, as a virtio-rtc that does not address the live
+ * migration problem seems a little less than fit for purpose. For that
+ * reason, certain fields use precisely the same numeric definitions as in
+ * the virtio-rtc proposal. The structure can also be exposed through an ACPI
+ * device with the CID "VMCLOCK", modelled on the "VMGENID" device except for
+ * the fact that it uses a real _CRS to convey the address of the structure
+ * (which should be a full page, to allow for mapping directly to userspace).
+ */
+
+#ifndef __VMCLOCK_ABI_H__
+#define __VMCLOCK_ABI_H__
+
+#ifdef __KERNEL__
+#include <linux/types.h>
+#else
+#include <stdint.h>
+#endif
+
+struct vmclock_abi {
+ /* CONSTANT FIELDS */
+ uint32_t magic;
+#define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */
+ uint32_t size; /* Size of region containing this structure */
+ uint16_t version; /* 1 */
+ uint8_t counter_id; /* Matches VIRTIO_RTC_COUNTER_xxx except INVALID */
+#define VMCLOCK_COUNTER_ARM_VCNT 0
+#define VMCLOCK_COUNTER_X86_TSC 1
+#define VMCLOCK_COUNTER_INVALID 0xff
+ uint8_t time_type; /* Matches VIRTIO_RTC_TYPE_xxx */
+#define VMCLOCK_TIME_UTC 0 /* Since 1970-01-01 00:00:00z */
+#define VMCLOCK_TIME_TAI 1 /* Since 1970-01-01 00:00:00z */
+#define VMCLOCK_TIME_MONOTONIC 2 /* Since undefined epoch */
+#define VMCLOCK_TIME_INVALID_SMEARED 3 /* Not supported */
+#define VMCLOCK_TIME_INVALID_MAYBE_SMEARED 4 /* Not supported */
+
+ /* NON-CONSTANT FIELDS PROTECTED BY SEQCOUNT LOCK */
+ uint32_t seq_count; /* Low bit means an update is in progress */
+ /*
+ * This field changes to another non-repeating value when the CPU
+ * counter is disrupted, for example on live migration. This lets
+ * the guest know that it should discard any calibration it has
+ * performed of the counter against external sources (NTP/PTP/etc.).
+ */
+ uint64_t disruption_marker;
+ uint64_t flags;
+ /* Indicates that the tai_offset_sec field is valid */
+#define VMCLOCK_FLAG_TAI_OFFSET_VALID (1 << 0)
+ /*
+ * Optionally used to notify guests of pending maintenance events.
+ * A guest which provides latency-sensitive services may wish to
+ * remove itself from service if an event is coming up. Two flags
+ * indicate the approximate imminence of the event.
+ */
+#define VMCLOCK_FLAG_DISRUPTION_SOON (1 << 1) /* About a day */
+#define VMCLOCK_FLAG_DISRUPTION_IMMINENT (1 << 2) /* About an hour */
+#define VMCLOCK_FLAG_PERIOD_ESTERROR_VALID (1 << 3)
+#define VMCLOCK_FLAG_PERIOD_MAXERROR_VALID (1 << 4)
+#define VMCLOCK_FLAG_TIME_ESTERROR_VALID (1 << 5)
+#define VMCLOCK_FLAG_TIME_MAXERROR_VALID (1 << 6)
+ /*
+ * If the MONOTONIC flag is set then (other than leap seconds) it is
+ * guaranteed that the time calculated according this structure at
+ * any given moment shall never appear to be later than the time
+ * calculated via the structure at any *later* moment.
+ *
+ * In particular, a timestamp based on a counter reading taken
+ * immediately after setting the low bit of seq_count (and the
+ * associated memory barrier), using the previously-valid time and
+ * period fields, shall never be later than a timestamp based on
+ * a counter reading taken immediately before *clearing* the low
+ * bit again after the update, using the about-to-be-valid fields.
+ */
+#define VMCLOCK_FLAG_TIME_MONOTONIC (1 << 7)
+
+ uint8_t pad[2];
+ uint8_t clock_status;
+#define VMCLOCK_STATUS_UNKNOWN 0
+#define VMCLOCK_STATUS_INITIALIZING 1
+#define VMCLOCK_STATUS_SYNCHRONIZED 2
+#define VMCLOCK_STATUS_FREERUNNING 3
+#define VMCLOCK_STATUS_UNRELIABLE 4
+
+ /*
+ * The time exposed through this device is never smeared. This field
+ * corresponds to the 'subtype' field in virtio-rtc, which indicates
+ * the smearing method. However in this case it provides a *hint* to
+ * the guest operating system, such that *if* the guest OS wants to
+ * provide its users with an alternative clock which does not follow
+ * UTC, it may do so in a fashion consistent with the other systems
+ * in the nearby environment.
+ */
+ uint8_t leap_second_smearing_hint; /* Matches VIRTIO_RTC_SUBTYPE_xxx */
+#define VMCLOCK_SMEARING_STRICT 0
+#define VMCLOCK_SMEARING_NOON_LINEAR 1
+#define VMCLOCK_SMEARING_UTC_SLS 2
+ int16_t tai_offset_sec;
+ uint8_t leap_indicator;
+ /*
+ * This field is based on the VIRTIO_RTC_LEAP_xxx values as defined
+ * in the current draft of virtio-rtc, but since smearing cannot be
+ * used with the shared memory device, some values are not used.
+ *
+ * The _POST_POS and _POST_NEG values allow the guest to perform
+ * its own smearing during the day or so after a leap second when
+ * such smearing may need to continue being applied for a leap
+ * second which is now theoretically "historical".
+ */
+#define VMCLOCK_LEAP_NONE 0x00 /* No known nearby leap second */
+#define VMCLOCK_LEAP_PRE_POS 0x01 /* Positive leap second at EOM */
+#define VMCLOCK_LEAP_PRE_NEG 0x02 /* Negative leap second at EOM */
+#define VMCLOCK_LEAP_POS 0x03 /* Set during 23:59:60 second */
+#define VMCLOCK_LEAP_POST_POS 0x04
+#define VMCLOCK_LEAP_POST_NEG 0x05
+
+ /* Bit shift for counter_period_frac_sec and its error rate */
+ uint8_t counter_period_shift;
+ /*
+ * Paired values of counter and UTC at a given point in time.
+ */
+ uint64_t counter_value;
+ /*
+ * Counter period, and error margin of same. The unit of these
+ * fields is 1/2^(64 + counter_period_shift) of a second.
+ */
+ uint64_t counter_period_frac_sec;
+ uint64_t counter_period_esterror_rate_frac_sec;
+ uint64_t counter_period_maxerror_rate_frac_sec;
+
+ /*
+ * Time according to time_type field above.
+ */
+ uint64_t time_sec; /* Seconds since time_type epoch */
+ uint64_t time_frac_sec; /* Units of 1/2^64 of a second */
+ uint64_t time_esterror_nanosec;
+ uint64_t time_maxerror_nanosec;
+};
+
+#endif /* __VMCLOCK_ABI_H__ */
diff --git a/hw/acpi/vmclock.c b/hw/acpi/vmclock.c
new file mode 100644
index 0000000000..e7df047c33
--- /dev/null
+++ b/hw/acpi/vmclock.c
@@ -0,0 +1,189 @@
+/*
+ * Virtual Machine Clock Device
+ *
+ * Copyright © 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ *
+ * Authors: David Woodhouse <dwmw2@infradead.org>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qemu/module.h"
+#include "hw/i386/e820_memory_layout.h"
+#include "hw/acpi/acpi.h"
+#include "hw/acpi/aml-build.h"
+#include "hw/acpi/vmclock.h"
+#include "hw/nvram/fw_cfg.h"
+#include "hw/qdev-properties.h"
+#include "hw/qdev-properties-system.h"
+#include "migration/vmstate.h"
+#include "sysemu/reset.h"
+
+/* This will come from Linux headers in the end, but we carry our own for now. */
+#include "vmclock-abi.h"
+
+void vmclock_build_acpi(VmclockState *vms, GArray *table_data,
+ BIOSLinker *linker, const char *oem_id)
+{
+ Aml *ssdt, *dev, *scope, *method, *addr, *crs;
+ AcpiTable table = { .sig = "SSDT", .rev = 1,
+ .oem_id = oem_id, .oem_table_id = "VMCLOCK" };
+
+ /* Put VMCLOCK into a separate SSDT table */
+ acpi_table_begin(&table, table_data);
+ ssdt = init_aml_allocator();
+
+ scope = aml_scope("\\_SB");
+ dev = aml_device("VCLK");
+ aml_append(dev, aml_name_decl("_HID", aml_string("AMZNC10C")));
+ aml_append(dev, aml_name_decl("_CID", aml_string("VMCLOCK")));
+ aml_append(dev, aml_name_decl("_DDN", aml_string("VMCLOCK")));
+
+ /* Simple status method */
+ method = aml_method("_STA", 0, AML_NOTSERIALIZED);
+ addr = aml_local(0);
+ aml_append(method, aml_store(aml_int(0xf), addr));
+ aml_append(method, aml_return(addr));
+ aml_append(dev, method);
+
+ crs = aml_resource_template();
+ aml_append(crs, aml_qword_memory(AML_POS_DECODE,
+ AML_MIN_FIXED, AML_MAX_FIXED,
+ AML_CACHEABLE, AML_READ_ONLY,
+ 0xffffffffffffffffULL,
+ vms->physaddr,
+ vms->physaddr + VMCLOCK_SIZE - 1,
+ 0, VMCLOCK_SIZE));
+ aml_append(dev, aml_name_decl("_CRS", crs));
+ aml_append(scope, dev);
+ aml_append(ssdt, scope);
+
+ g_array_append_vals(table_data, ssdt->buf->data, ssdt->buf->len);
+ acpi_table_end(linker, &table);
+ free_aml_allocator();
+}
+
+static void vmclock_update_guest(VmclockState *vms)
+{
+ uint64_t disruption_marker;
+ uint32_t seq_count;
+
+ if (!vms->clk) {
+ return;
+ }
+
+ seq_count = le32_to_cpu(vms->clk->seq_count) | 1;
+ vms->clk->seq_count = cpu_to_le32(seq_count);
+ /* These barriers pair with read barriers in the guest */
+ smp_wmb();
+
+ disruption_marker = le64_to_cpu(vms->clk->disruption_marker);
+ disruption_marker++;
+ vms->clk->disruption_marker = cpu_to_le64(disruption_marker);
+
+ /* These barriers pair with read barriers in the guest */
+ smp_wmb();
+ vms->clk->seq_count = cpu_to_le32(seq_count + 1);
+}
+
+/*
+ * After restoring an image, we need to update the guest memory to notify
+ * it of clock disruption.
+ */
+static int vmclock_post_load(void *opaque, int version_id)
+{
+ VmclockState *vms = opaque;
+
+ vmclock_update_guest(vms);
+ return 0;
+}
+
+static const VMStateDescription vmstate_vmclock = {
+ .name = "vmclock",
+ .version_id = 1,
+ .minimum_version_id = 1,
+ .post_load = vmclock_post_load,
+ .fields = (const VMStateField[]) {
+ VMSTATE_UINT64(physaddr, VmclockState),
+ VMSTATE_END_OF_LIST()
+ },
+};
+
+static void vmclock_handle_reset(void *opaque)
+{
+ VmclockState *vms = VMCLOCK(opaque);
+
+ if (!memory_region_is_mapped(&vms->clk_page)) {
+ memory_region_add_subregion_overlap(get_system_memory(),
+ vms->physaddr,
+ &vms->clk_page, 0);
+ }
+}
+
+static void vmclock_realize(DeviceState *dev, Error **errp)
+{
+ VmclockState *vms = VMCLOCK(dev);
+
+ /*
+ * Given that this function is executing, there is at least one VMCLOCK
+ * device. Check if there are several.
+ */
+ if (!find_vmclock_dev()) {
+ error_setg(errp, "at most one %s device is permitted", TYPE_VMCLOCK);
+ return;
+ }
+
+ vms->physaddr = VMCLOCK_ADDR;
+
+ e820_add_entry(vms->physaddr, VMCLOCK_SIZE, E820_RESERVED);
+
+ memory_region_init_ram(&vms->clk_page, OBJECT(dev), "vmclock_page",
+ VMCLOCK_SIZE, &error_abort);
+ memory_region_set_enabled(&vms->clk_page, true);
+ vms->clk = memory_region_get_ram_ptr(&vms->clk_page);
+ memset(vms->clk, 0, VMCLOCK_SIZE);
+
+ vms->clk->magic = cpu_to_le32(VMCLOCK_MAGIC);
+ vms->clk->size = cpu_to_le16(VMCLOCK_SIZE);
+ vms->clk->version = cpu_to_le16(1);
+
+ /* These are all zero and thus default, but be explicit */
+ vms->clk->clock_status = VMCLOCK_STATUS_UNKNOWN;
+ vms->clk->counter_id = VMCLOCK_COUNTER_INVALID;
+
+ qemu_register_reset(vmclock_handle_reset, vms);
+
+ vmclock_update_guest(vms);
+}
+
+static Property vmclock_device_properties[] = {
+ DEFINE_PROP_END_OF_LIST(),
+};
+
+static void vmclock_device_class_init(ObjectClass *klass, void *data)
+{
+ DeviceClass *dc = DEVICE_CLASS(klass);
+
+ dc->vmsd = &vmstate_vmclock;
+ dc->realize = vmclock_realize;
+ device_class_set_props(dc, vmclock_device_properties);
+ dc->hotpluggable = false;
+ set_bit(DEVICE_CATEGORY_MISC, dc->categories);
+}
+
+static const TypeInfo vmclock_device_info = {
+ .name = TYPE_VMCLOCK,
+ .parent = TYPE_DEVICE,
+ .instance_size = sizeof(VmclockState),
+ .class_init = vmclock_device_class_init,
+};
+
+static void vmclock_register_types(void)
+{
+ type_register_static(&vmclock_device_info);
+}
+
+type_init(vmclock_register_types)
diff --git a/hw/i386/Kconfig b/hw/i386/Kconfig
index f4a33b6c08..86b00bb94f 100644
--- a/hw/i386/Kconfig
+++ b/hw/i386/Kconfig
@@ -43,6 +43,7 @@ config PC
select SERIAL_ISA
select ACPI_PCI
select ACPI_VMGENID
+ select ACPI_VMCLOCK
select VIRTIO_PMEM_SUPPORTED
select VIRTIO_MEM_SUPPORTED
select HV_BALLOON_SUPPORTED
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 4967aa7459..097fe08836 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -43,6 +43,7 @@
#include "sysemu/tpm.h"
#include "hw/acpi/tpm.h"
#include "hw/acpi/vmgenid.h"
+#include "hw/acpi/vmclock.h"
#include "hw/acpi/erst.h"
#include "hw/acpi/piix4.h"
#include "sysemu/tpm_backend.h"
@@ -2429,7 +2430,7 @@ void acpi_build(AcpiBuildTables *tables, MachineState *machine)
uint8_t *u;
GArray *tables_blob = tables->table_data;
AcpiSlicOem slic_oem = { .id = NULL, .table_id = NULL };
- Object *vmgenid_dev;
+ Object *vmgenid_dev, *vmclock_dev;
char *oem_id;
char *oem_table_id;
@@ -2502,6 +2503,13 @@ void acpi_build(AcpiBuildTables *tables, MachineState *machine)
tables->vmgenid, tables->linker, x86ms->oem_id);
}
+ vmclock_dev = find_vmclock_dev();
+ if (vmclock_dev) {
+ acpi_add_table(table_offsets, tables_blob);
+ vmclock_build_acpi(VMCLOCK(vmclock_dev), tables_blob, tables->linker,
+ x86ms->oem_id);
+ }
+
if (misc.has_hpet) {
acpi_add_table(table_offsets, tables_blob);
build_hpet(tables_blob, tables->linker, x86ms->oem_id,
diff --git a/include/hw/acpi/vmclock.h b/include/hw/acpi/vmclock.h
new file mode 100644
index 0000000000..5605605812
--- /dev/null
+++ b/include/hw/acpi/vmclock.h
@@ -0,0 +1,34 @@
+#ifndef ACPI_VMCLOCK_H
+#define ACPI_VMCLOCK_H
+
+#include "hw/acpi/bios-linker-loader.h"
+#include "hw/qdev-core.h"
+#include "qemu/uuid.h"
+#include "qom/object.h"
+
+#define TYPE_VMCLOCK "vmclock"
+
+#define VMCLOCK_ADDR 0xfeffb000
+#define VMCLOCK_SIZE 0x1000
+
+OBJECT_DECLARE_SIMPLE_TYPE(VmclockState, VMCLOCK)
+
+struct vmclock_abi;
+
+struct VmclockState {
+ DeviceState parent_obj;
+ MemoryRegion clk_page;
+ uint64_t physaddr;
+ struct vmclock_abi *clk;
+};
+
+/* returns NULL unless there is exactly one device */
+static inline Object *find_vmclock_dev(void)
+{
+ return object_resolve_path_type("", TYPE_VMCLOCK, NULL);
+}
+
+void vmclock_build_acpi(VmclockState *vms, GArray *table_data,
+ BIOSLinker *linker, const char *oem_id);
+
+#endif
--
2.44.0
On Mon, 07 Oct 2024 14:53:54 +0100 David Woodhouse <dwmw2@infradead.org> wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > The vmclock device addresses the problem of live migration with > precision clocks. The tolerances of a hardware counter (e.g. TSC) are > typically around ±50PPM. A guest will use NTP/PTP/PPS to discipline that > counter against an external source of 'real' time, and track the precise > frequency of the counter as it changes with environmental conditions. > > When a guest is live migrated, anything it knows about the frequency of > the underlying counter becomes invalid. It may move from a host where > the counter running at -50PPM of its nominal frequency, to a host where > it runs at +50PPM. There will also be a step change in the value of the > counter, as the correctness of its absolute value at migration is > limited by the accuracy of the source and destination host's time > synchronization. > > The device exposes a shared memory region to guests, which can be mapped > all the way to userspace. In the first phase, this merely advertises a > 'disruption_marker', which indicates that the guest should throw away any > NTP synchronization it thinks it has, and start again. > > Because the region can be exposed all the way to userspace, applications > can still use time from a fast vDSO 'system call', and check the > disruption marker to be sure that their timestamp is indeed truthful. > > The structure also allows for the precise time, as known by the host, to > be exposed directly to guests so that they don't have to wait for NTP to > resync from scratch. > > The values and fields are based on the nascent virtio-rtc specification, > and the intent is that a version (hopefully precisely this version) of > this structure will be included as an optional part of that spec. In the > meantime, a simple ACPI device along the lines of VMGENID is perfectly > sufficient and is compatible with what's being shipped in certain > commercial hypervisors. > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Hi David, A couple of drive by comments (as we all love those!) Jonathan > --- > diff --git a/hw/acpi/vmclock-abi.h b/hw/acpi/vmclock-abi.h > new file mode 100644 > index 0000000000..19cbf85efd > --- /dev/null > +++ b/hw/acpi/vmclock-abi.h > @@ -0,0 +1,186 @@ > +#endif /* __VMCLOCK_ABI_H__ */ > diff --git a/hw/acpi/vmclock.c b/hw/acpi/vmclock.c > new file mode 100644 > index 0000000000..e7df047c33 > --- /dev/null > +++ b/hw/acpi/vmclock.c > + > +void vmclock_build_acpi(VmclockState *vms, GArray *table_data, > + BIOSLinker *linker, const char *oem_id) > +{ > + Aml *ssdt, *dev, *scope, *method, *addr, *crs; > + AcpiTable table = { .sig = "SSDT", .rev = 1, > + .oem_id = oem_id, .oem_table_id = "VMCLOCK" }; > + > + /* Put VMCLOCK into a separate SSDT table */ > + acpi_table_begin(&table, table_data); > + ssdt = init_aml_allocator(); > + > + scope = aml_scope("\\_SB"); > + dev = aml_device("VCLK"); > + aml_append(dev, aml_name_decl("_HID", aml_string("AMZNC10C"))); Nice _HID :) > + aml_append(dev, aml_name_decl("_CID", aml_string("VMCLOCK"))); > + aml_append(dev, aml_name_decl("_DDN", aml_string("VMCLOCK"))); > + > + /* Simple status method */ > + method = aml_method("_STA", 0, AML_NOTSERIALIZED); > + addr = aml_local(0); > + aml_append(method, aml_store(aml_int(0xf), addr)); > + aml_append(method, aml_return(addr)); > + aml_append(dev, method); It's static so you should be able to do: aml_append(dev, aml_name_decl("_STA", aml_int(0xf))); > + > + crs = aml_resource_template(); > + aml_append(crs, aml_qword_memory(AML_POS_DECODE, > + AML_MIN_FIXED, AML_MAX_FIXED, > + AML_CACHEABLE, AML_READ_ONLY, > + 0xffffffffffffffffULL, > + vms->physaddr, > + vms->physaddr + VMCLOCK_SIZE - 1, > + 0, VMCLOCK_SIZE)); > + aml_append(dev, aml_name_decl("_CRS", crs)); > + aml_append(scope, dev); > + aml_append(ssdt, scope); > + > + g_array_append_vals(table_data, ssdt->buf->data, ssdt->buf->len); > + acpi_table_end(linker, &table); > + free_aml_allocator(); > +} > +static Property vmclock_device_properties[] = { > + DEFINE_PROP_END_OF_LIST(), > +}; > + > +static void vmclock_device_class_init(ObjectClass *klass, void *data) > +{ > + DeviceClass *dc = DEVICE_CLASS(klass); > + > + dc->vmsd = &vmstate_vmclock; > + dc->realize = vmclock_realize; > + device_class_set_props(dc, vmclock_device_properties); Probably a silly question but why register no properties? Is that any different to not registering them? + a quick look suggests not all class_init functions set them. > + dc->hotpluggable = false; > + set_bit(DEVICE_CATEGORY_MISC, dc->categories); > +} > + > +static const TypeInfo vmclock_device_info = { > + .name = TYPE_VMCLOCK, > + .parent = TYPE_DEVICE, > + .instance_size = sizeof(VmclockState), > + .class_init = vmclock_device_class_init, > +}; > + > +static void vmclock_register_types(void) > +{ > + type_register_static(&vmclock_device_info); > +} > + > +type_init(vmclock_register_types)
On Tue, 2024-10-08 at 12:04 +0100, Jonathan Cameron wrote: > On Mon, 07 Oct 2024 14:53:54 +0100 > David Woodhouse <dwmw2@infradead.org> wrote: > > > > > + aml_append(dev, aml_name_decl("_HID", > > aml_string("AMZNC10C"))); > > Nice _HID :) Can't take credit for that; it was mst's idea :) > > + aml_append(dev, aml_name_decl("_CID", aml_string("VMCLOCK"))); > > + aml_append(dev, aml_name_decl("_DDN", aml_string("VMCLOCK"))); > > + > > + /* Simple status method */ > > + method = aml_method("_STA", 0, AML_NOTSERIALIZED); > > + addr = aml_local(0); > > + aml_append(method, aml_store(aml_int(0xf), addr)); > > + aml_append(method, aml_return(addr)); > > + aml_append(dev, method); > > It's static so you should be able to do: > aml_append(dev, aml_name_decl("_STA", aml_int(0xf))); Thanks. Tested and pushed to https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock > > > +static Property vmclock_device_properties[] = { > > + DEFINE_PROP_END_OF_LIST(), > > +}; > > + > > +static void vmclock_device_class_init(ObjectClass *klass, void *data) > > +{ > > + DeviceClass *dc = DEVICE_CLASS(klass); > > + > > + dc->vmsd = &vmstate_vmclock; > > + dc->realize = vmclock_realize; > > + device_class_set_props(dc, vmclock_device_properties); > > Probably a silly question but why register no properties? > Is that any different to not registering them? > + a quick look suggests not all class_init functions > set them. They'll be used in the future. Phase 1 of the VMCLOCK device just signals *disruption* to guests, so they know that their clocks are wrong immediately after migration and they have to resync before performing any operations that require accurate time. Phase 2 will advertise the current time (based on the TSC of the new host) directly to the guest. That's when it'll need a little more setup. So when I copied the boilerplate from vmgenid I just ripped out the actual UUID property, not the whole setup. For now I've just been testing the guest side with this hack, FWIW: --- a/hw/acpi/vmclock.c +++ b/hw/acpi/vmclock.c @@ -151,6 +151,18 @@ static void vmclock_realize(DeviceState *dev, Error **errp) qemu_register_reset(vmclock_handle_reset, vms); + vms->clk->time_type = VMCLOCK_TIME_TAI; + vms->clk->flags = VMCLOCK_FLAG_TAI_OFFSET_VALID; + vms->clk->tai_offset_sec = -3600; + vms->clk->clock_status = VMCLOCK_STATUS_SYNCHRONIZED; + vms->clk->counter_value = 0; + vms->clk->counter_id = VMCLOCK_COUNTER_X86_TSC; + vms->clk->time_sec = 1704067200; + vms->clk->time_frac_sec = 0x8000000000000000ULL; + vms->clk->counter_period_frac_sec = 0x1a6e39b3e0ULL; + vms->clk->counter_period_shift = 4; + //vms->clk->counter_period_frac_sec = 0x1934c67f9b2ce6ULL; + vmclock_update_guest(vms); }
© 2016 - 2024 Red Hat, Inc.