1
From: David Woodhouse <dwmw@amazon.co.uk>
1
From: David Woodhouse <dwmw@amazon.co.uk>
2
2
3
The vmclock device addresses the problem of live migration with precision
3
The vmclock device addresses the problem of live migration with
4
clocks. The tolerances of a hardware counter (e.g. TSC) are typically
4
precision clocks. The tolerances of a hardware counter (e.g. TSC) are
5
around ±50PPM. We use NTP/PTP/PPS to discipline that counter against an
5
typically around ±50PPM. A guest will use NTP/PTP/PPS to discipline that
6
external source of 'real' time, and track the precise frequency of the
6
counter against an external source of 'real' time, and track the precise
7
counter as it changes with environmental conditions.
7
frequency of the counter as it changes with environmental conditions.
8
8
9
When a guest is live migrated, anything it knows about the frequency of
9
When a guest is live migrated, anything it knows about the frequency of
10
the underlying counter becomes invalid. It may move from a host where
10
the underlying counter becomes invalid. It may move from a host where
11
the counter running at -50PPM of its nominal frequency, to a host where
11
the counter running at -50PPM of its nominal frequency, to a host where
12
it runs at +50PPM. There will also be a step change in the value of the
12
it runs at +50PPM. There will also be a step change in the value of the
...
...
17
The device exposes a shared memory region to guests, which can be mapped
17
The device exposes a shared memory region to guests, which can be mapped
18
all the way to userspace. In the first phase, this merely advertises a
18
all the way to userspace. In the first phase, this merely advertises a
19
'disruption_marker', which indicates that the guest should throw away any
19
'disruption_marker', which indicates that the guest should throw away any
20
NTP synchronization it thinks it has, and start again.
20
NTP synchronization it thinks it has, and start again.
21
21
22
Because can be exposed all the way to userspace, applications can still
22
Because the region can be exposed all the way to userspace, applications
23
use time from a vDSO 'system call', and check the disruption marker to
23
can still use time from a fast vDSO 'system call', and check the
24
be sure that their timestamp is indeed truthful.
24
disruption marker to be sure that their timestamp is indeed truthful.
25
25
26
The structure also allows for the precise time, as known by the host, to
26
The structure also allows for the precise time, as known by the host, to
27
be exposed directly to guests so that they don't have to wait for NTP to
27
be exposed directly to guests so that they don't have to wait for NTP to
28
resync from scratch.
28
resync from scratch.
29
29
30
The values and fields are based on the nascent virtio-rtc specification,
31
and the intent is that a version (hopefully precisely this version) of
32
this structure will be included as an optional part of that spec. In the
33
meantime, a simple ACPI device along the lines of VMGENID is perfectly
34
sufficient and is compatible with what's being shipped in certain
35
commercial hypervisors.
36
37
Linux guest support was merged into the 6.13-rc1 kernel:
38
https://git.kernel.org/torvalds/c/205032724226
39
30
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
40
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
41
Reviewed-by: Paul Durrant <paul@xen.org>
31
---
42
---
32
43
v6:
33
Guest kernel support at
44
• Rebase for DEFINE_PROP_END_OF_LIST removal and sysemu→system
34
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/vmclock
45
rename.
35
and discussion at
46
36
https://lore.kernel.org/lkml/51dcda5b675fb68c54b74fd19c408a3a086fc412.camel@infradead.org/
47
v5:
37
48
• Trivial simplification to AML generation.
38
hw/acpi/Kconfig | 5 ++
49
• Import vmclock-abi.h from Linux now the guest support is merged.
39
hw/acpi/meson.build | 1 +
50
40
hw/acpi/vmclock-abi.h | 175 +++++++++++++++++++++++++++++++++++++
51
v4:
41
hw/acpi/vmclock.c | 177 ++++++++++++++++++++++++++++++++++++++
52
• Trivial checkpatch fixes and comment improvements.
42
hw/i386/Kconfig | 1 +
53
43
hw/i386/acpi-build.c | 10 ++-
54
v3:
44
include/hw/acpi/vmclock.h | 34 ++++++++
55
• Add comment that vmclock-abi.h will come from the Linux kernel
45
7 files changed, 402 insertions(+), 1 deletion(-)
56
headers once it gets merged there.
46
create mode 100644 hw/acpi/vmclock-abi.h
57
58
v2:
59
• Change esterror/maxerror fields to nanoseconds.
60
• Change to officially assigned AMZNC10C ACPI HID.
61
• Fix little-endian handling of fields in update.
62
63
64
hw/acpi/Kconfig | 5 +
65
hw/acpi/meson.build | 1 +
66
hw/acpi/vmclock.c | 179 ++++++++++++++++++
67
hw/i386/Kconfig | 1 +
68
hw/i386/acpi-build.c | 10 +-
69
include/hw/acpi/vmclock.h | 34 ++++
70
include/standard-headers/linux/vmclock-abi.h | 182 +++++++++++++++++++
71
scripts/update-linux-headers.sh | 1 +
72
8 files changed, 412 insertions(+), 1 deletion(-)
47
create mode 100644 hw/acpi/vmclock.c
73
create mode 100644 hw/acpi/vmclock.c
48
create mode 100644 include/hw/acpi/vmclock.h
74
create mode 100644 include/hw/acpi/vmclock.h
75
create mode 100644 include/standard-headers/linux/vmclock-abi.h
49
76
50
diff --git a/hw/acpi/Kconfig b/hw/acpi/Kconfig
77
diff --git a/hw/acpi/Kconfig b/hw/acpi/Kconfig
51
index XXXXXXX..XXXXXXX 100644
78
index XXXXXXX..XXXXXXX 100644
52
--- a/hw/acpi/Kconfig
79
--- a/hw/acpi/Kconfig
53
+++ b/hw/acpi/Kconfig
80
+++ b/hw/acpi/Kconfig
...
...
73
acpi_ss.add(when: 'CONFIG_ACPI_VMGENID', if_true: files('vmgenid.c'))
100
acpi_ss.add(when: 'CONFIG_ACPI_VMGENID', if_true: files('vmgenid.c'))
74
+acpi_ss.add(when: 'CONFIG_ACPI_VMCLOCK', if_true: files('vmclock.c'))
101
+acpi_ss.add(when: 'CONFIG_ACPI_VMCLOCK', if_true: files('vmclock.c'))
75
acpi_ss.add(when: 'CONFIG_ACPI_HW_REDUCED', if_true: files('generic_event_device.c'))
102
acpi_ss.add(when: 'CONFIG_ACPI_HW_REDUCED', if_true: files('generic_event_device.c'))
76
acpi_ss.add(when: 'CONFIG_ACPI_HMAT', if_true: files('hmat.c'))
103
acpi_ss.add(when: 'CONFIG_ACPI_HMAT', if_true: files('hmat.c'))
77
acpi_ss.add(when: 'CONFIG_ACPI_APEI', if_true: files('ghes.c'), if_false: files('ghes-stub.c'))
104
acpi_ss.add(when: 'CONFIG_ACPI_APEI', if_true: files('ghes.c'), if_false: files('ghes-stub.c'))
78
diff --git a/hw/acpi/vmclock-abi.h b/hw/acpi/vmclock-abi.h
79
new file mode 100644
80
index XXXXXXX..XXXXXXX
81
--- /dev/null
82
+++ b/hw/acpi/vmclock-abi.h
83
@@ -XXX,XX +XXX,XX @@
84
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */
85
+
86
+/*
87
+ * This structure provides a vDSO-style clock to VM guests, exposing the
88
+ * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch
89
+ * counter, etc.) and real time. It is designed to address the problem of
90
+ * live migration, which other clock enlightenments do not.
91
+ *
92
+ * When a guest is live migrated, this affects the clock in two ways.
93
+ *
94
+ * First, even between identical hosts the actual frequency of the underlying
95
+ * counter will change within the tolerances of its specification (typically
96
+ * ±50PPM, or 4 seconds a day). The frequency also varies over time on the
97
+ * same host, but can be tracked by NTP as it generally varies slowly. With
98
+ * live migration there is a step change in the frequency, with no warning.
99
+ *
100
+ * Second, there may be a step change in the value of the counter itself, as
101
+ * its accuracy is limited by the precision of the NTP synchronization on the
102
+ * source and destination hosts.
103
+ *
104
+ * So any calibration (NTP, PTP, etc.) which the guest has done on the source
105
+ * host before migration is invalid, and needs to be redone on the new host.
106
+ *
107
+ * In its most basic mode, this structure provides only an indication to the
108
+ * guest that live migration has occurred. This allows the guest to know that
109
+ * its clock is invalid and take remedial action. For applications that need
110
+ * reliable accurate timestamps (e.g. distributed databases), the structure
111
+ * can be mapped all the way to userspace. This allows the application to see
112
+ * directly for itself that the clock is disrupted and take appropriate
113
+ * action, even when using a vDSO-style method to get the time instead of a
114
+ * system call.
115
+ *
116
+ * In its more advanced mode. this structure can also be used to expose the
117
+ * precise relationship of the CPU counter to real time, as calibrated by the
118
+ * host. This means that userspace applications can have accurate time
119
+ * immediately after live migration, rather than having to pause operations
120
+ * and wait for NTP to recover. This mode does, of course, rely on the
121
+ * counter being reliable and consistent across CPUs.
122
+ *
123
+ * Note that this must be true UTC, never with smeared leap seconds. If a
124
+ * guest wishes to construct a smeared clock, it can do so. Presenting a
125
+ * smeared clock through this interface would be problematic because it
126
+ * actually messes with the apparent counter *period*. A linear smearing
127
+ * of 1 ms per second would effectively tweak the counter period by 1000PPM
128
+ * at the start/end of the smearing period, while a sinusoidal smear would
129
+ * basically be impossible to represent.
130
+ */
131
+
132
+#ifndef __VMCLOCK_ABI_H__
133
+#define __VMCLOCK_ABI_H__
134
+
135
+#ifdef __KERNEL__
136
+#include <linux/types.h>
137
+#else
138
+#include <stdint.h>
139
+#endif
140
+
141
+struct vmclock_abi {
142
+    uint32_t magic;
143
+#define VMCLOCK_MAGIC    0x4b4c4356 /* "VCLK" */
144
+    uint16_t size;        /* Size of page containing this structure */
145
+    uint16_t version;    /* 1 */
146
+
147
+    /* Sequence lock. Low bit means an update is in progress. */
148
+    uint32_t seq_count;
149
+
150
+    uint32_t flags;
151
+    /* Indicates that the tai_offset_sec field is valid */
152
+#define VMCLOCK_FLAG_TAI_OFFSET_VALID        (1 << 0)
153
+    /*
154
+     * Optionally used to notify guests of pending maintenance events.
155
+     * A guest may wish to remove itself from service if an event is
156
+     * coming up. Two flags indicate the rough imminence of the event.
157
+     */
158
+#define VMCLOCK_FLAG_DISRUPTION_SOON        (1 << 1) /* About a day */
159
+#define VMCLOCK_FLAG_DISRUPTION_IMMINENT    (1 << 2) /* About an hour */
160
+    /* Indicates that the utc_time_maxerror_picosec field is valid */
161
+#define VMCLOCK_FLAG_UTC_MAXERROR_VALID        (1 << 3)
162
+    /* Indicates counter_period_error_rate_frac_sec is valid */
163
+#define VMCLOCK_FLAG_PERIOD_ERROR_VALID        (1 << 4)
164
+
165
+    /*
166
+     * This field changes to another non-repeating value when the CPU
167
+     * counter is disrupted, for example on live migration. This lets
168
+     * the guest know that it should discard any calibration it has
169
+     * performed of the counter against external sources (NTP/PTP/etc.).
170
+     */
171
+    uint64_t disruption_marker;
172
+
173
+    uint8_t clock_status;
174
+#define VMCLOCK_STATUS_UNKNOWN        0
175
+#define VMCLOCK_STATUS_INITIALIZING    1
176
+#define VMCLOCK_STATUS_SYNCHRONIZED    2
177
+#define VMCLOCK_STATUS_FREERUNNING    3
178
+#define VMCLOCK_STATUS_UNRELIABLE    4
179
+
180
+    uint8_t counter_id;
181
+#define VMCLOCK_COUNTER_INVALID        0
182
+#define VMCLOCK_COUNTER_X86_TSC        1
183
+#define VMCLOCK_COUNTER_ARM_VCNT    2
184
+#define VMCLOCK_COUNTER_X86_ART        3
185
+
186
+    /*
187
+     * By providing the offset from UTC to TAI, the guest can know both
188
+     * UTC and TAI reliably, whichever is indicated in the time_type
189
+     * field. Valid if VMCLOCK_FLAG_TAI_OFFSET_VALID is set in flags.
190
+     */
191
+    int16_t tai_offset_sec;
192
+
193
+    /*
194
+     * The time exposed through this device is never smeaared; if it
195
+     * claims to be VMCLOCK_TIME_UTC then it MUST be UTC. This field
196
+     * provides a hint to the guest operating system, such that *if*
197
+     * the guest OS wants to provide its users with an alternative
198
+     * clock which does not follow the POSIX CLOCK_REALTIME standard,
199
+     * it may do so in a fashion consistent with the other systems
200
+     * in the nearby environment.
201
+     */
202
+    uint8_t leap_second_smearing_hint;
203
+    /* Provide true UTC to users, unsmeared. */;
204
+#define VMCLOCK_SMEARING_NONE            0
205
+    /*
206
+     * https://aws.amazon.com/blogs/aws/look-before-you-leap-the-coming-leap-second-and-aws/
207
+     * From noon on the day before to noon on the day after, smear the
208
+     * clock by a linear 1/86400s per second.
209
+    */
210
+#define VMCLOCK_SMEARING_LINEAR_86400        1
211
+    /*
212
+     * draft-kuhn-leapsecond-00
213
+     * For the 1000s leading up to the leap second, smear the clock by
214
+     * clock by a linear 1ms per second.
215
+     */
216
+#define VMCLOCK_SMEARING_UTC_SLS        2
217
+
218
+    /*
219
+     * What time is exposed in the time_sec/time_frac_sec fields?
220
+     */
221
+    uint8_t time_type;
222
+#define VMCLOCK_TIME_UNKNOWN        0    /* Invalid / no time exposed */
223
+#define VMCLOCK_TIME_UTC        1    /* Since 1970-01-01 00:00:00z */
224
+#define VMCLOCK_TIME_TAI        2    /* Since 1970-01-01 00:00:00z */
225
+#define VMCLOCK_TIME_MONOTONIC        3    /* Since undefined epoch */
226
+
227
+    /* Bit shift for counter_period_frac_sec and its error rate */
228
+    uint8_t counter_period_shift;
229
+
230
+    /*
231
+     * Unlike in NTP, this can indicate a leap second in the past. This
232
+     * is needed to allow guests to derive an imprecise clock with
233
+     * smeared leap seconds for themselves, as some modes of smearing
234
+     * need the adjustments to continue even after the moment at which
235
+     * the leap second should have occurred.
236
+     */
237
+    int8_t leapsecond_direction;
238
+    uint64_t leapsecond_tai_sec; /* Since 1970-01-01 00:00:00z */
239
+
240
+    /*
241
+     * Paired values of counter and UTC at a given point in time.
242
+     */
243
+    uint64_t counter_value;
244
+    uint64_t time_sec;
245
+    uint64_t time_frac_sec;
246
+
247
+    /*
248
+     * Counter frequency, and error margin. The unit of these fields is
249
+     * seconds >> (64 + counter_period_shift)
250
+     */
251
+    uint64_t counter_period_frac_sec;
252
+    uint64_t counter_period_error_rate_frac_sec;
253
+
254
+    /* Error margin of UTC reading above (± picoseconds) */
255
+    uint64_t utc_time_maxerror_picosec;
256
+};
257
+
258
+#endif /* __VMCLOCK_ABI_H__ */
259
diff --git a/hw/acpi/vmclock.c b/hw/acpi/vmclock.c
105
diff --git a/hw/acpi/vmclock.c b/hw/acpi/vmclock.c
260
new file mode 100644
106
new file mode 100644
261
index XXXXXXX..XXXXXXX
107
index XXXXXXX..XXXXXXX
262
--- /dev/null
108
--- /dev/null
263
+++ b/hw/acpi/vmclock.c
109
+++ b/hw/acpi/vmclock.c
...
...
282
+#include "hw/acpi/vmclock.h"
128
+#include "hw/acpi/vmclock.h"
283
+#include "hw/nvram/fw_cfg.h"
129
+#include "hw/nvram/fw_cfg.h"
284
+#include "hw/qdev-properties.h"
130
+#include "hw/qdev-properties.h"
285
+#include "hw/qdev-properties-system.h"
131
+#include "hw/qdev-properties-system.h"
286
+#include "migration/vmstate.h"
132
+#include "migration/vmstate.h"
287
+#include "sysemu/reset.h"
133
+#include "system/reset.h"
288
+
134
+
289
+#include "vmclock-abi.h"
135
+#include "standard-headers/linux/vmclock-abi.h"
290
+
136
+
291
+void vmclock_build_acpi(VmclockState *vms, GArray *table_data,
137
+void vmclock_build_acpi(VmclockState *vms, GArray *table_data,
292
+ BIOSLinker *linker, const char *oem_id)
138
+ BIOSLinker *linker, const char *oem_id)
293
+{
139
+{
294
+ Aml *ssdt, *dev, *scope, *method, *addr, *crs;
140
+ Aml *ssdt, *dev, *scope, *crs;
295
+ AcpiTable table = { .sig = "SSDT", .rev = 1,
141
+ AcpiTable table = { .sig = "SSDT", .rev = 1,
296
+ .oem_id = oem_id, .oem_table_id = "VMCLOCK" };
142
+ .oem_id = oem_id, .oem_table_id = "VMCLOCK" };
297
+
143
+
298
+ /* Put VMCLOCK into a separate SSDT table */
144
+ /* Put VMCLOCK into a separate SSDT table */
299
+ acpi_table_begin(&table, table_data);
145
+ acpi_table_begin(&table, table_data);
300
+ ssdt = init_aml_allocator();
146
+ ssdt = init_aml_allocator();
301
+
147
+
302
+ scope = aml_scope("\\_SB");
148
+ scope = aml_scope("\\_SB");
303
+ dev = aml_device("VCLK");
149
+ dev = aml_device("VCLK");
304
+ aml_append(dev, aml_name_decl("_HID", aml_string("QEMUVCLK")));
150
+ aml_append(dev, aml_name_decl("_HID", aml_string("AMZNC10C")));
305
+ aml_append(dev, aml_name_decl("_CID", aml_string("VMCLOCK")));
151
+ aml_append(dev, aml_name_decl("_CID", aml_string("VMCLOCK")));
306
+ aml_append(dev, aml_name_decl("_DDN", aml_string("VMCLOCK")));
152
+ aml_append(dev, aml_name_decl("_DDN", aml_string("VMCLOCK")));
307
+
153
+
308
+ /* Simple status method */
154
+ /* Simple status method */
309
+ method = aml_method("_STA", 0, AML_NOTSERIALIZED);
155
+ aml_append(dev, aml_name_decl("_STA", aml_int(0xf)));
310
+ addr = aml_local(0);
311
+ aml_append(method, aml_store(aml_int(0xf), addr));
312
+ aml_append(method, aml_return(addr));
313
+ aml_append(dev, method);
314
+
156
+
315
+ crs = aml_resource_template();
157
+ crs = aml_resource_template();
316
+ aml_append(crs, aml_qword_memory(AML_POS_DECODE,
158
+ aml_append(crs, aml_qword_memory(AML_POS_DECODE,
317
+ AML_MIN_FIXED, AML_MAX_FIXED,
159
+ AML_MIN_FIXED, AML_MAX_FIXED,
318
+ AML_CACHEABLE, AML_READ_ONLY,
160
+ AML_CACHEABLE, AML_READ_ONLY,
...
...
329
+ free_aml_allocator();
171
+ free_aml_allocator();
330
+}
172
+}
331
+
173
+
332
+static void vmclock_update_guest(VmclockState *vms)
174
+static void vmclock_update_guest(VmclockState *vms)
333
+{
175
+{
176
+ uint64_t disruption_marker;
177
+ uint32_t seq_count;
178
+
334
+ if (!vms->clk) {
179
+ if (!vms->clk) {
335
+ return;
180
+ return;
336
+ }
181
+ }
337
+ vms->clk->seq_count |= 1;
182
+
183
+ seq_count = le32_to_cpu(vms->clk->seq_count) | 1;
184
+ vms->clk->seq_count = cpu_to_le32(seq_count);
185
+ /* These barriers pair with read barriers in the guest */
338
+ smp_wmb();
186
+ smp_wmb();
339
+
187
+
340
+ vms->clk->disruption_marker++;
188
+ disruption_marker = le64_to_cpu(vms->clk->disruption_marker);
341
+
189
+ disruption_marker++;
190
+ vms->clk->disruption_marker = cpu_to_le64(disruption_marker);
191
+
192
+ /* These barriers pair with read barriers in the guest */
342
+ smp_wmb();
193
+ smp_wmb();
343
+ vms->clk->seq_count += 1;
194
+ vms->clk->seq_count = cpu_to_le32(seq_count + 1);
344
+}
195
+}
345
+
196
+
346
+/* After restoring an image, we need to update the guest memory and notify
197
+/*
347
+ * it of a potential change to VM Generation ID
198
+ * After restoring an image, we need to update the guest memory to notify
199
+ * it of clock disruption.
348
+ */
200
+ */
349
+static int vmclock_post_load(void *opaque, int version_id)
201
+static int vmclock_post_load(void *opaque, int version_id)
350
+{
202
+{
351
+ VmclockState *vms = opaque;
203
+ VmclockState *vms = opaque;
204
+
352
+ vmclock_update_guest(vms);
205
+ vmclock_update_guest(vms);
353
+ return 0;
206
+ return 0;
354
+}
207
+}
355
+
208
+
356
+static const VMStateDescription vmstate_vmclock = {
209
+static const VMStateDescription vmstate_vmclock = {
...
...
377
+
230
+
378
+static void vmclock_realize(DeviceState *dev, Error **errp)
231
+static void vmclock_realize(DeviceState *dev, Error **errp)
379
+{
232
+{
380
+ VmclockState *vms = VMCLOCK(dev);
233
+ VmclockState *vms = VMCLOCK(dev);
381
+
234
+
382
+ /* Given that this function is executing, there is at least one VMCLOCK
235
+ /*
236
+ * Given that this function is executing, there is at least one VMCLOCK
383
+ * device. Check if there are several.
237
+ * device. Check if there are several.
384
+ */
238
+ */
385
+ if (!find_vmclock_dev()) {
239
+ if (!find_vmclock_dev()) {
386
+ error_setg(errp, "at most one %s device is permitted", TYPE_VMCLOCK);
240
+ error_setg(errp, "at most one %s device is permitted", TYPE_VMCLOCK);
387
+ return;
241
+ return;
...
...
400
+ vms->clk->magic = cpu_to_le32(VMCLOCK_MAGIC);
254
+ vms->clk->magic = cpu_to_le32(VMCLOCK_MAGIC);
401
+ vms->clk->size = cpu_to_le16(VMCLOCK_SIZE);
255
+ vms->clk->size = cpu_to_le16(VMCLOCK_SIZE);
402
+ vms->clk->version = cpu_to_le16(1);
256
+ vms->clk->version = cpu_to_le16(1);
403
+
257
+
404
+ /* These are all zero and thus default, but be explicit */
258
+ /* These are all zero and thus default, but be explicit */
405
+ vms->clk->time_type = VMCLOCK_TIME_UNKNOWN;
406
+ vms->clk->clock_status = VMCLOCK_STATUS_UNKNOWN;
259
+ vms->clk->clock_status = VMCLOCK_STATUS_UNKNOWN;
407
+ vms->clk->counter_id = VMCLOCK_COUNTER_INVALID;
260
+ vms->clk->counter_id = VMCLOCK_COUNTER_INVALID;
408
+
261
+
409
+ qemu_register_reset(vmclock_handle_reset, vms);
262
+ qemu_register_reset(vmclock_handle_reset, vms);
410
+
263
+
411
+ vmclock_update_guest(vms);
264
+ vmclock_update_guest(vms);
412
+}
265
+}
413
+
266
+
414
+static Property vmclock_device_properties[] = {
415
+ DEFINE_PROP_END_OF_LIST(),
416
+};
417
+
418
+static void vmclock_device_class_init(ObjectClass *klass, void *data)
267
+static void vmclock_device_class_init(ObjectClass *klass, void *data)
419
+{
268
+{
420
+ DeviceClass *dc = DEVICE_CLASS(klass);
269
+ DeviceClass *dc = DEVICE_CLASS(klass);
421
+
270
+
422
+ dc->vmsd = &vmstate_vmclock;
271
+ dc->vmsd = &vmstate_vmclock;
423
+ dc->realize = vmclock_realize;
272
+ dc->realize = vmclock_realize;
424
+ device_class_set_props(dc, vmclock_device_properties);
425
+ dc->hotpluggable = false;
273
+ dc->hotpluggable = false;
426
+ set_bit(DEVICE_CATEGORY_MISC, dc->categories);
274
+ set_bit(DEVICE_CATEGORY_MISC, dc->categories);
427
+}
275
+}
428
+
276
+
429
+static const TypeInfo vmclock_device_info = {
277
+static const TypeInfo vmclock_device_info = {
...
...
454
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
302
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
455
index XXXXXXX..XXXXXXX 100644
303
index XXXXXXX..XXXXXXX 100644
456
--- a/hw/i386/acpi-build.c
304
--- a/hw/i386/acpi-build.c
457
+++ b/hw/i386/acpi-build.c
305
+++ b/hw/i386/acpi-build.c
458
@@ -XXX,XX +XXX,XX @@
306
@@ -XXX,XX +XXX,XX @@
459
#include "sysemu/tpm.h"
307
#include "system/tpm.h"
460
#include "hw/acpi/tpm.h"
308
#include "hw/acpi/tpm.h"
461
#include "hw/acpi/vmgenid.h"
309
#include "hw/acpi/vmgenid.h"
462
+#include "hw/acpi/vmclock.h"
310
+#include "hw/acpi/vmclock.h"
463
#include "hw/acpi/erst.h"
311
#include "hw/acpi/erst.h"
464
#include "hw/acpi/piix4.h"
312
#include "hw/acpi/piix4.h"
465
#include "sysemu/tpm_backend.h"
313
#include "system/tpm_backend.h"
466
@@ -XXX,XX +XXX,XX @@ void acpi_build(AcpiBuildTables *tables, MachineState *machine)
314
@@ -XXX,XX +XXX,XX @@ void acpi_build(AcpiBuildTables *tables, MachineState *machine)
467
size_t aml_len = 0;
315
uint8_t *u;
468
GArray *tables_blob = tables->table_data;
316
GArray *tables_blob = tables->table_data;
469
AcpiSlicOem slic_oem = { .id = NULL, .table_id = NULL };
317
AcpiSlicOem slic_oem = { .id = NULL, .table_id = NULL };
470
- Object *vmgenid_dev;
318
- Object *vmgenid_dev;
471
+ Object *vmgenid_dev, *vmclock_dev;
319
+ Object *vmgenid_dev, *vmclock_dev;
472
char *oem_id;
320
char *oem_id;
...
...
524
+
372
+
525
+void vmclock_build_acpi(VmclockState *vms, GArray *table_data,
373
+void vmclock_build_acpi(VmclockState *vms, GArray *table_data,
526
+ BIOSLinker *linker, const char *oem_id);
374
+ BIOSLinker *linker, const char *oem_id);
527
+
375
+
528
+#endif
376
+#endif
377
diff --git a/include/standard-headers/linux/vmclock-abi.h b/include/standard-headers/linux/vmclock-abi.h
378
new file mode 100644
379
index XXXXXXX..XXXXXXX
380
--- /dev/null
381
+++ b/include/standard-headers/linux/vmclock-abi.h
382
@@ -XXX,XX +XXX,XX @@
383
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */
384
+
385
+/*
386
+ * This structure provides a vDSO-style clock to VM guests, exposing the
387
+ * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch
388
+ * counter, etc.) and real time. It is designed to address the problem of
389
+ * live migration, which other clock enlightenments do not.
390
+ *
391
+ * When a guest is live migrated, this affects the clock in two ways.
392
+ *
393
+ * First, even between identical hosts the actual frequency of the underlying
394
+ * counter will change within the tolerances of its specification (typically
395
+ * ±50PPM, or 4 seconds a day). This frequency also varies over time on the
396
+ * same host, but can be tracked by NTP as it generally varies slowly. With
397
+ * live migration there is a step change in the frequency, with no warning.
398
+ *
399
+ * Second, there may be a step change in the value of the counter itself, as
400
+ * its accuracy is limited by the precision of the NTP synchronization on the
401
+ * source and destination hosts.
402
+ *
403
+ * So any calibration (NTP, PTP, etc.) which the guest has done on the source
404
+ * host before migration is invalid, and needs to be redone on the new host.
405
+ *
406
+ * In its most basic mode, this structure provides only an indication to the
407
+ * guest that live migration has occurred. This allows the guest to know that
408
+ * its clock is invalid and take remedial action. For applications that need
409
+ * reliable accurate timestamps (e.g. distributed databases), the structure
410
+ * can be mapped all the way to userspace. This allows the application to see
411
+ * directly for itself that the clock is disrupted and take appropriate
412
+ * action, even when using a vDSO-style method to get the time instead of a
413
+ * system call.
414
+ *
415
+ * In its more advanced mode. this structure can also be used to expose the
416
+ * precise relationship of the CPU counter to real time, as calibrated by the
417
+ * host. This means that userspace applications can have accurate time
418
+ * immediately after live migration, rather than having to pause operations
419
+ * and wait for NTP to recover. This mode does, of course, rely on the
420
+ * counter being reliable and consistent across CPUs.
421
+ *
422
+ * Note that this must be true UTC, never with smeared leap seconds. If a
423
+ * guest wishes to construct a smeared clock, it can do so. Presenting a
424
+ * smeared clock through this interface would be problematic because it
425
+ * actually messes with the apparent counter *period*. A linear smearing
426
+ * of 1 ms per second would effectively tweak the counter period by 1000PPM
427
+ * at the start/end of the smearing period, while a sinusoidal smear would
428
+ * basically be impossible to represent.
429
+ *
430
+ * This structure is offered with the intent that it be adopted into the
431
+ * nascent virtio-rtc standard, as a virtio-rtc that does not address the live
432
+ * migration problem seems a little less than fit for purpose. For that
433
+ * reason, certain fields use precisely the same numeric definitions as in
434
+ * the virtio-rtc proposal. The structure can also be exposed through an ACPI
435
+ * device with the CID "VMCLOCK", modelled on the "VMGENID" device except for
436
+ * the fact that it uses a real _CRS to convey the address of the structure
437
+ * (which should be a full page, to allow for mapping directly to userspace).
438
+ */
439
+
440
+#ifndef __VMCLOCK_ABI_H__
441
+#define __VMCLOCK_ABI_H__
442
+
443
+#include "standard-headers/linux/types.h"
444
+
445
+struct vmclock_abi {
446
+    /* CONSTANT FIELDS */
447
+    uint32_t magic;
448
+#define VMCLOCK_MAGIC    0x4b4c4356 /* "VCLK" */
449
+    uint32_t size;        /* Size of region containing this structure */
450
+    uint16_t version;    /* 1 */
451
+    uint8_t counter_id; /* Matches VIRTIO_RTC_COUNTER_xxx except INVALID */
452
+#define VMCLOCK_COUNTER_ARM_VCNT    0
453
+#define VMCLOCK_COUNTER_X86_TSC        1
454
+#define VMCLOCK_COUNTER_INVALID        0xff
455
+    uint8_t time_type; /* Matches VIRTIO_RTC_TYPE_xxx */
456
+#define VMCLOCK_TIME_UTC            0    /* Since 1970-01-01 00:00:00z */
457
+#define VMCLOCK_TIME_TAI            1    /* Since 1970-01-01 00:00:00z */
458
+#define VMCLOCK_TIME_MONOTONIC            2    /* Since undefined epoch */
459
+#define VMCLOCK_TIME_INVALID_SMEARED        3    /* Not supported */
460
+#define VMCLOCK_TIME_INVALID_MAYBE_SMEARED    4    /* Not supported */
461
+
462
+    /* NON-CONSTANT FIELDS PROTECTED BY SEQCOUNT LOCK */
463
+    uint32_t seq_count;    /* Low bit means an update is in progress */
464
+    /*
465
+     * This field changes to another non-repeating value when the CPU
466
+     * counter is disrupted, for example on live migration. This lets
467
+     * the guest know that it should discard any calibration it has
468
+     * performed of the counter against external sources (NTP/PTP/etc.).
469
+     */
470
+    uint64_t disruption_marker;
471
+    uint64_t flags;
472
+    /* Indicates that the tai_offset_sec field is valid */
473
+#define VMCLOCK_FLAG_TAI_OFFSET_VALID        (1 << 0)
474
+    /*
475
+     * Optionally used to notify guests of pending maintenance events.
476
+     * A guest which provides latency-sensitive services may wish to
477
+     * remove itself from service if an event is coming up. Two flags
478
+     * indicate the approximate imminence of the event.
479
+     */
480
+#define VMCLOCK_FLAG_DISRUPTION_SOON        (1 << 1) /* About a day */
481
+#define VMCLOCK_FLAG_DISRUPTION_IMMINENT    (1 << 2) /* About an hour */
482
+#define VMCLOCK_FLAG_PERIOD_ESTERROR_VALID    (1 << 3)
483
+#define VMCLOCK_FLAG_PERIOD_MAXERROR_VALID    (1 << 4)
484
+#define VMCLOCK_FLAG_TIME_ESTERROR_VALID    (1 << 5)
485
+#define VMCLOCK_FLAG_TIME_MAXERROR_VALID    (1 << 6)
486
+    /*
487
+     * If the MONOTONIC flag is set then (other than leap seconds) it is
488
+     * guaranteed that the time calculated according this structure at
489
+     * any given moment shall never appear to be later than the time
490
+     * calculated via the structure at any *later* moment.
491
+     *
492
+     * In particular, a timestamp based on a counter reading taken
493
+     * immediately after setting the low bit of seq_count (and the
494
+     * associated memory barrier), using the previously-valid time and
495
+     * period fields, shall never be later than a timestamp based on
496
+     * a counter reading taken immediately before *clearing* the low
497
+     * bit again after the update, using the about-to-be-valid fields.
498
+     */
499
+#define VMCLOCK_FLAG_TIME_MONOTONIC        (1 << 7)
500
+
501
+    uint8_t pad[2];
502
+    uint8_t clock_status;
503
+#define VMCLOCK_STATUS_UNKNOWN        0
504
+#define VMCLOCK_STATUS_INITIALIZING    1
505
+#define VMCLOCK_STATUS_SYNCHRONIZED    2
506
+#define VMCLOCK_STATUS_FREERUNNING    3
507
+#define VMCLOCK_STATUS_UNRELIABLE    4
508
+
509
+    /*
510
+     * The time exposed through this device is never smeared. This field
511
+     * corresponds to the 'subtype' field in virtio-rtc, which indicates
512
+     * the smearing method. However in this case it provides a *hint* to
513
+     * the guest operating system, such that *if* the guest OS wants to
514
+     * provide its users with an alternative clock which does not follow
515
+     * UTC, it may do so in a fashion consistent with the other systems
516
+     * in the nearby environment.
517
+     */
518
+    uint8_t leap_second_smearing_hint; /* Matches VIRTIO_RTC_SUBTYPE_xxx */
519
+#define VMCLOCK_SMEARING_STRICT        0
520
+#define VMCLOCK_SMEARING_NOON_LINEAR    1
521
+#define VMCLOCK_SMEARING_UTC_SLS    2
522
+    uint16_t tai_offset_sec; /* Actually two's complement signed */
523
+    uint8_t leap_indicator;
524
+    /*
525
+     * This field is based on the VIRTIO_RTC_LEAP_xxx values as defined
526
+     * in the current draft of virtio-rtc, but since smearing cannot be
527
+     * used with the shared memory device, some values are not used.
528
+     *
529
+     * The _POST_POS and _POST_NEG values allow the guest to perform
530
+     * its own smearing during the day or so after a leap second when
531
+     * such smearing may need to continue being applied for a leap
532
+     * second which is now theoretically "historical".
533
+     */
534
+#define VMCLOCK_LEAP_NONE    0x00    /* No known nearby leap second */
535
+#define VMCLOCK_LEAP_PRE_POS    0x01    /* Positive leap second at EOM */
536
+#define VMCLOCK_LEAP_PRE_NEG    0x02    /* Negative leap second at EOM */
537
+#define VMCLOCK_LEAP_POS    0x03    /* Set during 23:59:60 second */
538
+#define VMCLOCK_LEAP_POST_POS    0x04
539
+#define VMCLOCK_LEAP_POST_NEG    0x05
540
+
541
+    /* Bit shift for counter_period_frac_sec and its error rate */
542
+    uint8_t counter_period_shift;
543
+    /*
544
+     * Paired values of counter and UTC at a given point in time.
545
+     */
546
+    uint64_t counter_value;
547
+    /*
548
+     * Counter period, and error margin of same. The unit of these
549
+     * fields is 1/2^(64 + counter_period_shift) of a second.
550
+     */
551
+    uint64_t counter_period_frac_sec;
552
+    uint64_t counter_period_esterror_rate_frac_sec;
553
+    uint64_t counter_period_maxerror_rate_frac_sec;
554
+
555
+    /*
556
+     * Time according to time_type field above.
557
+     */
558
+    uint64_t time_sec;        /* Seconds since time_type epoch */
559
+    uint64_t time_frac_sec;        /* Units of 1/2^64 of a second */
560
+    uint64_t time_esterror_nanosec;
561
+    uint64_t time_maxerror_nanosec;
562
+};
563
+
564
+#endif /* __VMCLOCK_ABI_H__ */
565
diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
566
index XXXXXXX..XXXXXXX 100755
567
--- a/scripts/update-linux-headers.sh
568
+++ b/scripts/update-linux-headers.sh
569
@@ -XXX,XX +XXX,XX @@ for i in "$hdrdir"/include/linux/*virtio*.h \
570
"$hdrdir/include/linux/kernel.h" \
571
"$hdrdir/include/linux/kvm_para.h" \
572
"$hdrdir/include/linux/vhost_types.h" \
573
+ "$hdrdir/include/linux/vmclock-abi.h" \
574
"$hdrdir/include/linux/sysinfo.h"; do
575
cp_portable "$i" "$output/include/standard-headers/linux"
576
done
529
--
577
--
530
2.44.0
578
2.47.0
531
579
532
580
diff view generated by jsdifflib