1 | From: David Woodhouse <dwmw@amazon.co.uk> | 1 | From: David Woodhouse <dwmw@amazon.co.uk> |
---|---|---|---|
2 | 2 | ||
3 | The vmclock device addresses the problem of live migration with precision | 3 | The vmclock device addresses the problem of live migration with |
4 | clocks. The tolerances of a hardware counter (e.g. TSC) are typically | 4 | precision clocks. The tolerances of a hardware counter (e.g. TSC) are |
5 | around ±50PPM. We use NTP/PTP/PPS to discipline that counter against an | 5 | typically around ±50PPM. A guest will use NTP/PTP/PPS to discipline that |
6 | external source of 'real' time, and track the precise frequency of the | 6 | counter against an external source of 'real' time, and track the precise |
7 | counter as it changes with environmental conditions. | 7 | frequency of the counter as it changes with environmental conditions. |
8 | 8 | ||
9 | When a guest is live migrated, anything it knows about the frequency of | 9 | When a guest is live migrated, anything it knows about the frequency of |
10 | the underlying counter becomes invalid. It may move from a host where | 10 | the underlying counter becomes invalid. It may move from a host where |
11 | the counter running at -50PPM of its nominal frequency, to a host where | 11 | the counter running at -50PPM of its nominal frequency, to a host where |
12 | it runs at +50PPM. There will also be a step change in the value of the | 12 | it runs at +50PPM. There will also be a step change in the value of the |
... | ... | ||
17 | The device exposes a shared memory region to guests, which can be mapped | 17 | The device exposes a shared memory region to guests, which can be mapped |
18 | all the way to userspace. In the first phase, this merely advertises a | 18 | all the way to userspace. In the first phase, this merely advertises a |
19 | 'disruption_marker', which indicates that the guest should throw away any | 19 | 'disruption_marker', which indicates that the guest should throw away any |
20 | NTP synchronization it thinks it has, and start again. | 20 | NTP synchronization it thinks it has, and start again. |
21 | 21 | ||
22 | Because can be exposed all the way to userspace, applications can still | 22 | Because the region can be exposed all the way to userspace, applications |
23 | use time from a vDSO 'system call', and check the disruption marker to | 23 | can still use time from a fast vDSO 'system call', and check the |
24 | be sure that their timestamp is indeed truthful. | 24 | disruption marker to be sure that their timestamp is indeed truthful. |
25 | 25 | ||
26 | The structure also allows for the precise time, as known by the host, to | 26 | The structure also allows for the precise time, as known by the host, to |
27 | be exposed directly to guests so that they don't have to wait for NTP to | 27 | be exposed directly to guests so that they don't have to wait for NTP to |
28 | resync from scratch. | 28 | resync from scratch. |
29 | 29 | ||
30 | The values and fields are based on the nascent virtio-rtc specification, | ||
31 | and the intent is that a version (hopefully precisely this version) of | ||
32 | this structure will be included as an optional part of that spec. In the | ||
33 | meantime, a simple ACPI device along the lines of VMGENID is perfectly | ||
34 | sufficient and is compatible with what's being shipped in certain | ||
35 | commercial hypervisors. | ||
36 | |||
37 | Linux guest support was merged into the 6.13-rc1 kernel: | ||
38 | https://git.kernel.org/torvalds/c/205032724226 | ||
39 | |||
30 | Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> | 40 | Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> |
41 | Reviewed-by: Paul Durrant <paul@xen.org> | ||
31 | --- | 42 | --- |
32 | 43 | v6: | |
33 | Guest kernel support at | 44 | • Rebase for DEFINE_PROP_END_OF_LIST removal and sysemu→system |
34 | https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/vmclock | 45 | rename. |
35 | and discussion at | 46 | |
36 | https://lore.kernel.org/lkml/51dcda5b675fb68c54b74fd19c408a3a086fc412.camel@infradead.org/ | 47 | v5: |
37 | 48 | • Trivial simplification to AML generation. | |
38 | hw/acpi/Kconfig | 5 ++ | 49 | • Import vmclock-abi.h from Linux now the guest support is merged. |
39 | hw/acpi/meson.build | 1 + | 50 | |
40 | hw/acpi/vmclock-abi.h | 175 +++++++++++++++++++++++++++++++++++++ | 51 | v4: |
41 | hw/acpi/vmclock.c | 177 ++++++++++++++++++++++++++++++++++++++ | 52 | • Trivial checkpatch fixes and comment improvements. |
42 | hw/i386/Kconfig | 1 + | 53 | |
43 | hw/i386/acpi-build.c | 10 ++- | 54 | v3: |
44 | include/hw/acpi/vmclock.h | 34 ++++++++ | 55 | • Add comment that vmclock-abi.h will come from the Linux kernel |
45 | 7 files changed, 402 insertions(+), 1 deletion(-) | 56 | headers once it gets merged there. |
46 | create mode 100644 hw/acpi/vmclock-abi.h | 57 | |
58 | v2: | ||
59 | • Change esterror/maxerror fields to nanoseconds. | ||
60 | • Change to officially assigned AMZNC10C ACPI HID. | ||
61 | • Fix little-endian handling of fields in update. | ||
62 | |||
63 | |||
64 | hw/acpi/Kconfig | 5 + | ||
65 | hw/acpi/meson.build | 1 + | ||
66 | hw/acpi/vmclock.c | 179 ++++++++++++++++++ | ||
67 | hw/i386/Kconfig | 1 + | ||
68 | hw/i386/acpi-build.c | 10 +- | ||
69 | include/hw/acpi/vmclock.h | 34 ++++ | ||
70 | include/standard-headers/linux/vmclock-abi.h | 182 +++++++++++++++++++ | ||
71 | scripts/update-linux-headers.sh | 1 + | ||
72 | 8 files changed, 412 insertions(+), 1 deletion(-) | ||
47 | create mode 100644 hw/acpi/vmclock.c | 73 | create mode 100644 hw/acpi/vmclock.c |
48 | create mode 100644 include/hw/acpi/vmclock.h | 74 | create mode 100644 include/hw/acpi/vmclock.h |
75 | create mode 100644 include/standard-headers/linux/vmclock-abi.h | ||
49 | 76 | ||
50 | diff --git a/hw/acpi/Kconfig b/hw/acpi/Kconfig | 77 | diff --git a/hw/acpi/Kconfig b/hw/acpi/Kconfig |
51 | index XXXXXXX..XXXXXXX 100644 | 78 | index XXXXXXX..XXXXXXX 100644 |
52 | --- a/hw/acpi/Kconfig | 79 | --- a/hw/acpi/Kconfig |
53 | +++ b/hw/acpi/Kconfig | 80 | +++ b/hw/acpi/Kconfig |
... | ... | ||
73 | acpi_ss.add(when: 'CONFIG_ACPI_VMGENID', if_true: files('vmgenid.c')) | 100 | acpi_ss.add(when: 'CONFIG_ACPI_VMGENID', if_true: files('vmgenid.c')) |
74 | +acpi_ss.add(when: 'CONFIG_ACPI_VMCLOCK', if_true: files('vmclock.c')) | 101 | +acpi_ss.add(when: 'CONFIG_ACPI_VMCLOCK', if_true: files('vmclock.c')) |
75 | acpi_ss.add(when: 'CONFIG_ACPI_HW_REDUCED', if_true: files('generic_event_device.c')) | 102 | acpi_ss.add(when: 'CONFIG_ACPI_HW_REDUCED', if_true: files('generic_event_device.c')) |
76 | acpi_ss.add(when: 'CONFIG_ACPI_HMAT', if_true: files('hmat.c')) | 103 | acpi_ss.add(when: 'CONFIG_ACPI_HMAT', if_true: files('hmat.c')) |
77 | acpi_ss.add(when: 'CONFIG_ACPI_APEI', if_true: files('ghes.c'), if_false: files('ghes-stub.c')) | 104 | acpi_ss.add(when: 'CONFIG_ACPI_APEI', if_true: files('ghes.c'), if_false: files('ghes-stub.c')) |
78 | diff --git a/hw/acpi/vmclock-abi.h b/hw/acpi/vmclock-abi.h | ||
79 | new file mode 100644 | ||
80 | index XXXXXXX..XXXXXXX | ||
81 | --- /dev/null | ||
82 | +++ b/hw/acpi/vmclock-abi.h | ||
83 | @@ -XXX,XX +XXX,XX @@ | ||
84 | +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */ | ||
85 | + | ||
86 | +/* | ||
87 | + * This structure provides a vDSO-style clock to VM guests, exposing the | ||
88 | + * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch | ||
89 | + * counter, etc.) and real time. It is designed to address the problem of | ||
90 | + * live migration, which other clock enlightenments do not. | ||
91 | + * | ||
92 | + * When a guest is live migrated, this affects the clock in two ways. | ||
93 | + * | ||
94 | + * First, even between identical hosts the actual frequency of the underlying | ||
95 | + * counter will change within the tolerances of its specification (typically | ||
96 | + * ±50PPM, or 4 seconds a day). The frequency also varies over time on the | ||
97 | + * same host, but can be tracked by NTP as it generally varies slowly. With | ||
98 | + * live migration there is a step change in the frequency, with no warning. | ||
99 | + * | ||
100 | + * Second, there may be a step change in the value of the counter itself, as | ||
101 | + * its accuracy is limited by the precision of the NTP synchronization on the | ||
102 | + * source and destination hosts. | ||
103 | + * | ||
104 | + * So any calibration (NTP, PTP, etc.) which the guest has done on the source | ||
105 | + * host before migration is invalid, and needs to be redone on the new host. | ||
106 | + * | ||
107 | + * In its most basic mode, this structure provides only an indication to the | ||
108 | + * guest that live migration has occurred. This allows the guest to know that | ||
109 | + * its clock is invalid and take remedial action. For applications that need | ||
110 | + * reliable accurate timestamps (e.g. distributed databases), the structure | ||
111 | + * can be mapped all the way to userspace. This allows the application to see | ||
112 | + * directly for itself that the clock is disrupted and take appropriate | ||
113 | + * action, even when using a vDSO-style method to get the time instead of a | ||
114 | + * system call. | ||
115 | + * | ||
116 | + * In its more advanced mode. this structure can also be used to expose the | ||
117 | + * precise relationship of the CPU counter to real time, as calibrated by the | ||
118 | + * host. This means that userspace applications can have accurate time | ||
119 | + * immediately after live migration, rather than having to pause operations | ||
120 | + * and wait for NTP to recover. This mode does, of course, rely on the | ||
121 | + * counter being reliable and consistent across CPUs. | ||
122 | + * | ||
123 | + * Note that this must be true UTC, never with smeared leap seconds. If a | ||
124 | + * guest wishes to construct a smeared clock, it can do so. Presenting a | ||
125 | + * smeared clock through this interface would be problematic because it | ||
126 | + * actually messes with the apparent counter *period*. A linear smearing | ||
127 | + * of 1 ms per second would effectively tweak the counter period by 1000PPM | ||
128 | + * at the start/end of the smearing period, while a sinusoidal smear would | ||
129 | + * basically be impossible to represent. | ||
130 | + */ | ||
131 | + | ||
132 | +#ifndef __VMCLOCK_ABI_H__ | ||
133 | +#define __VMCLOCK_ABI_H__ | ||
134 | + | ||
135 | +#ifdef __KERNEL__ | ||
136 | +#include <linux/types.h> | ||
137 | +#else | ||
138 | +#include <stdint.h> | ||
139 | +#endif | ||
140 | + | ||
141 | +struct vmclock_abi { | ||
142 | + uint32_t magic; | ||
143 | +#define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */ | ||
144 | + uint16_t size; /* Size of page containing this structure */ | ||
145 | + uint16_t version; /* 1 */ | ||
146 | + | ||
147 | + /* Sequence lock. Low bit means an update is in progress. */ | ||
148 | + uint32_t seq_count; | ||
149 | + | ||
150 | + uint32_t flags; | ||
151 | + /* Indicates that the tai_offset_sec field is valid */ | ||
152 | +#define VMCLOCK_FLAG_TAI_OFFSET_VALID (1 << 0) | ||
153 | + /* | ||
154 | + * Optionally used to notify guests of pending maintenance events. | ||
155 | + * A guest may wish to remove itself from service if an event is | ||
156 | + * coming up. Two flags indicate the rough imminence of the event. | ||
157 | + */ | ||
158 | +#define VMCLOCK_FLAG_DISRUPTION_SOON (1 << 1) /* About a day */ | ||
159 | +#define VMCLOCK_FLAG_DISRUPTION_IMMINENT (1 << 2) /* About an hour */ | ||
160 | + /* Indicates that the utc_time_maxerror_picosec field is valid */ | ||
161 | +#define VMCLOCK_FLAG_UTC_MAXERROR_VALID (1 << 3) | ||
162 | + /* Indicates counter_period_error_rate_frac_sec is valid */ | ||
163 | +#define VMCLOCK_FLAG_PERIOD_ERROR_VALID (1 << 4) | ||
164 | + | ||
165 | + /* | ||
166 | + * This field changes to another non-repeating value when the CPU | ||
167 | + * counter is disrupted, for example on live migration. This lets | ||
168 | + * the guest know that it should discard any calibration it has | ||
169 | + * performed of the counter against external sources (NTP/PTP/etc.). | ||
170 | + */ | ||
171 | + uint64_t disruption_marker; | ||
172 | + | ||
173 | + uint8_t clock_status; | ||
174 | +#define VMCLOCK_STATUS_UNKNOWN 0 | ||
175 | +#define VMCLOCK_STATUS_INITIALIZING 1 | ||
176 | +#define VMCLOCK_STATUS_SYNCHRONIZED 2 | ||
177 | +#define VMCLOCK_STATUS_FREERUNNING 3 | ||
178 | +#define VMCLOCK_STATUS_UNRELIABLE 4 | ||
179 | + | ||
180 | + uint8_t counter_id; | ||
181 | +#define VMCLOCK_COUNTER_INVALID 0 | ||
182 | +#define VMCLOCK_COUNTER_X86_TSC 1 | ||
183 | +#define VMCLOCK_COUNTER_ARM_VCNT 2 | ||
184 | +#define VMCLOCK_COUNTER_X86_ART 3 | ||
185 | + | ||
186 | + /* | ||
187 | + * By providing the offset from UTC to TAI, the guest can know both | ||
188 | + * UTC and TAI reliably, whichever is indicated in the time_type | ||
189 | + * field. Valid if VMCLOCK_FLAG_TAI_OFFSET_VALID is set in flags. | ||
190 | + */ | ||
191 | + int16_t tai_offset_sec; | ||
192 | + | ||
193 | + /* | ||
194 | + * The time exposed through this device is never smeaared; if it | ||
195 | + * claims to be VMCLOCK_TIME_UTC then it MUST be UTC. This field | ||
196 | + * provides a hint to the guest operating system, such that *if* | ||
197 | + * the guest OS wants to provide its users with an alternative | ||
198 | + * clock which does not follow the POSIX CLOCK_REALTIME standard, | ||
199 | + * it may do so in a fashion consistent with the other systems | ||
200 | + * in the nearby environment. | ||
201 | + */ | ||
202 | + uint8_t leap_second_smearing_hint; | ||
203 | + /* Provide true UTC to users, unsmeared. */; | ||
204 | +#define VMCLOCK_SMEARING_NONE 0 | ||
205 | + /* | ||
206 | + * https://aws.amazon.com/blogs/aws/look-before-you-leap-the-coming-leap-second-and-aws/ | ||
207 | + * From noon on the day before to noon on the day after, smear the | ||
208 | + * clock by a linear 1/86400s per second. | ||
209 | + */ | ||
210 | +#define VMCLOCK_SMEARING_LINEAR_86400 1 | ||
211 | + /* | ||
212 | + * draft-kuhn-leapsecond-00 | ||
213 | + * For the 1000s leading up to the leap second, smear the clock by | ||
214 | + * clock by a linear 1ms per second. | ||
215 | + */ | ||
216 | +#define VMCLOCK_SMEARING_UTC_SLS 2 | ||
217 | + | ||
218 | + /* | ||
219 | + * What time is exposed in the time_sec/time_frac_sec fields? | ||
220 | + */ | ||
221 | + uint8_t time_type; | ||
222 | +#define VMCLOCK_TIME_UNKNOWN 0 /* Invalid / no time exposed */ | ||
223 | +#define VMCLOCK_TIME_UTC 1 /* Since 1970-01-01 00:00:00z */ | ||
224 | +#define VMCLOCK_TIME_TAI 2 /* Since 1970-01-01 00:00:00z */ | ||
225 | +#define VMCLOCK_TIME_MONOTONIC 3 /* Since undefined epoch */ | ||
226 | + | ||
227 | + /* Bit shift for counter_period_frac_sec and its error rate */ | ||
228 | + uint8_t counter_period_shift; | ||
229 | + | ||
230 | + /* | ||
231 | + * Unlike in NTP, this can indicate a leap second in the past. This | ||
232 | + * is needed to allow guests to derive an imprecise clock with | ||
233 | + * smeared leap seconds for themselves, as some modes of smearing | ||
234 | + * need the adjustments to continue even after the moment at which | ||
235 | + * the leap second should have occurred. | ||
236 | + */ | ||
237 | + int8_t leapsecond_direction; | ||
238 | + uint64_t leapsecond_tai_sec; /* Since 1970-01-01 00:00:00z */ | ||
239 | + | ||
240 | + /* | ||
241 | + * Paired values of counter and UTC at a given point in time. | ||
242 | + */ | ||
243 | + uint64_t counter_value; | ||
244 | + uint64_t time_sec; | ||
245 | + uint64_t time_frac_sec; | ||
246 | + | ||
247 | + /* | ||
248 | + * Counter frequency, and error margin. The unit of these fields is | ||
249 | + * seconds >> (64 + counter_period_shift) | ||
250 | + */ | ||
251 | + uint64_t counter_period_frac_sec; | ||
252 | + uint64_t counter_period_error_rate_frac_sec; | ||
253 | + | ||
254 | + /* Error margin of UTC reading above (± picoseconds) */ | ||
255 | + uint64_t utc_time_maxerror_picosec; | ||
256 | +}; | ||
257 | + | ||
258 | +#endif /* __VMCLOCK_ABI_H__ */ | ||
259 | diff --git a/hw/acpi/vmclock.c b/hw/acpi/vmclock.c | 105 | diff --git a/hw/acpi/vmclock.c b/hw/acpi/vmclock.c |
260 | new file mode 100644 | 106 | new file mode 100644 |
261 | index XXXXXXX..XXXXXXX | 107 | index XXXXXXX..XXXXXXX |
262 | --- /dev/null | 108 | --- /dev/null |
263 | +++ b/hw/acpi/vmclock.c | 109 | +++ b/hw/acpi/vmclock.c |
... | ... | ||
282 | +#include "hw/acpi/vmclock.h" | 128 | +#include "hw/acpi/vmclock.h" |
283 | +#include "hw/nvram/fw_cfg.h" | 129 | +#include "hw/nvram/fw_cfg.h" |
284 | +#include "hw/qdev-properties.h" | 130 | +#include "hw/qdev-properties.h" |
285 | +#include "hw/qdev-properties-system.h" | 131 | +#include "hw/qdev-properties-system.h" |
286 | +#include "migration/vmstate.h" | 132 | +#include "migration/vmstate.h" |
287 | +#include "sysemu/reset.h" | 133 | +#include "system/reset.h" |
288 | + | 134 | + |
289 | +#include "vmclock-abi.h" | 135 | +#include "standard-headers/linux/vmclock-abi.h" |
290 | + | 136 | + |
291 | +void vmclock_build_acpi(VmclockState *vms, GArray *table_data, | 137 | +void vmclock_build_acpi(VmclockState *vms, GArray *table_data, |
292 | + BIOSLinker *linker, const char *oem_id) | 138 | + BIOSLinker *linker, const char *oem_id) |
293 | +{ | 139 | +{ |
294 | + Aml *ssdt, *dev, *scope, *method, *addr, *crs; | 140 | + Aml *ssdt, *dev, *scope, *crs; |
295 | + AcpiTable table = { .sig = "SSDT", .rev = 1, | 141 | + AcpiTable table = { .sig = "SSDT", .rev = 1, |
296 | + .oem_id = oem_id, .oem_table_id = "VMCLOCK" }; | 142 | + .oem_id = oem_id, .oem_table_id = "VMCLOCK" }; |
297 | + | 143 | + |
298 | + /* Put VMCLOCK into a separate SSDT table */ | 144 | + /* Put VMCLOCK into a separate SSDT table */ |
299 | + acpi_table_begin(&table, table_data); | 145 | + acpi_table_begin(&table, table_data); |
300 | + ssdt = init_aml_allocator(); | 146 | + ssdt = init_aml_allocator(); |
301 | + | 147 | + |
302 | + scope = aml_scope("\\_SB"); | 148 | + scope = aml_scope("\\_SB"); |
303 | + dev = aml_device("VCLK"); | 149 | + dev = aml_device("VCLK"); |
304 | + aml_append(dev, aml_name_decl("_HID", aml_string("QEMUVCLK"))); | 150 | + aml_append(dev, aml_name_decl("_HID", aml_string("AMZNC10C"))); |
305 | + aml_append(dev, aml_name_decl("_CID", aml_string("VMCLOCK"))); | 151 | + aml_append(dev, aml_name_decl("_CID", aml_string("VMCLOCK"))); |
306 | + aml_append(dev, aml_name_decl("_DDN", aml_string("VMCLOCK"))); | 152 | + aml_append(dev, aml_name_decl("_DDN", aml_string("VMCLOCK"))); |
307 | + | 153 | + |
308 | + /* Simple status method */ | 154 | + /* Simple status method */ |
309 | + method = aml_method("_STA", 0, AML_NOTSERIALIZED); | 155 | + aml_append(dev, aml_name_decl("_STA", aml_int(0xf))); |
310 | + addr = aml_local(0); | ||
311 | + aml_append(method, aml_store(aml_int(0xf), addr)); | ||
312 | + aml_append(method, aml_return(addr)); | ||
313 | + aml_append(dev, method); | ||
314 | + | 156 | + |
315 | + crs = aml_resource_template(); | 157 | + crs = aml_resource_template(); |
316 | + aml_append(crs, aml_qword_memory(AML_POS_DECODE, | 158 | + aml_append(crs, aml_qword_memory(AML_POS_DECODE, |
317 | + AML_MIN_FIXED, AML_MAX_FIXED, | 159 | + AML_MIN_FIXED, AML_MAX_FIXED, |
318 | + AML_CACHEABLE, AML_READ_ONLY, | 160 | + AML_CACHEABLE, AML_READ_ONLY, |
... | ... | ||
329 | + free_aml_allocator(); | 171 | + free_aml_allocator(); |
330 | +} | 172 | +} |
331 | + | 173 | + |
332 | +static void vmclock_update_guest(VmclockState *vms) | 174 | +static void vmclock_update_guest(VmclockState *vms) |
333 | +{ | 175 | +{ |
176 | + uint64_t disruption_marker; | ||
177 | + uint32_t seq_count; | ||
178 | + | ||
334 | + if (!vms->clk) { | 179 | + if (!vms->clk) { |
335 | + return; | 180 | + return; |
336 | + } | 181 | + } |
337 | + vms->clk->seq_count |= 1; | 182 | + |
183 | + seq_count = le32_to_cpu(vms->clk->seq_count) | 1; | ||
184 | + vms->clk->seq_count = cpu_to_le32(seq_count); | ||
185 | + /* These barriers pair with read barriers in the guest */ | ||
338 | + smp_wmb(); | 186 | + smp_wmb(); |
339 | + | 187 | + |
340 | + vms->clk->disruption_marker++; | 188 | + disruption_marker = le64_to_cpu(vms->clk->disruption_marker); |
341 | + | 189 | + disruption_marker++; |
190 | + vms->clk->disruption_marker = cpu_to_le64(disruption_marker); | ||
191 | + | ||
192 | + /* These barriers pair with read barriers in the guest */ | ||
342 | + smp_wmb(); | 193 | + smp_wmb(); |
343 | + vms->clk->seq_count += 1; | 194 | + vms->clk->seq_count = cpu_to_le32(seq_count + 1); |
344 | +} | 195 | +} |
345 | + | 196 | + |
346 | +/* After restoring an image, we need to update the guest memory and notify | 197 | +/* |
347 | + * it of a potential change to VM Generation ID | 198 | + * After restoring an image, we need to update the guest memory to notify |
199 | + * it of clock disruption. | ||
348 | + */ | 200 | + */ |
349 | +static int vmclock_post_load(void *opaque, int version_id) | 201 | +static int vmclock_post_load(void *opaque, int version_id) |
350 | +{ | 202 | +{ |
351 | + VmclockState *vms = opaque; | 203 | + VmclockState *vms = opaque; |
204 | + | ||
352 | + vmclock_update_guest(vms); | 205 | + vmclock_update_guest(vms); |
353 | + return 0; | 206 | + return 0; |
354 | +} | 207 | +} |
355 | + | 208 | + |
356 | +static const VMStateDescription vmstate_vmclock = { | 209 | +static const VMStateDescription vmstate_vmclock = { |
... | ... | ||
377 | + | 230 | + |
378 | +static void vmclock_realize(DeviceState *dev, Error **errp) | 231 | +static void vmclock_realize(DeviceState *dev, Error **errp) |
379 | +{ | 232 | +{ |
380 | + VmclockState *vms = VMCLOCK(dev); | 233 | + VmclockState *vms = VMCLOCK(dev); |
381 | + | 234 | + |
382 | + /* Given that this function is executing, there is at least one VMCLOCK | 235 | + /* |
236 | + * Given that this function is executing, there is at least one VMCLOCK | ||
383 | + * device. Check if there are several. | 237 | + * device. Check if there are several. |
384 | + */ | 238 | + */ |
385 | + if (!find_vmclock_dev()) { | 239 | + if (!find_vmclock_dev()) { |
386 | + error_setg(errp, "at most one %s device is permitted", TYPE_VMCLOCK); | 240 | + error_setg(errp, "at most one %s device is permitted", TYPE_VMCLOCK); |
387 | + return; | 241 | + return; |
... | ... | ||
400 | + vms->clk->magic = cpu_to_le32(VMCLOCK_MAGIC); | 254 | + vms->clk->magic = cpu_to_le32(VMCLOCK_MAGIC); |
401 | + vms->clk->size = cpu_to_le16(VMCLOCK_SIZE); | 255 | + vms->clk->size = cpu_to_le16(VMCLOCK_SIZE); |
402 | + vms->clk->version = cpu_to_le16(1); | 256 | + vms->clk->version = cpu_to_le16(1); |
403 | + | 257 | + |
404 | + /* These are all zero and thus default, but be explicit */ | 258 | + /* These are all zero and thus default, but be explicit */ |
405 | + vms->clk->time_type = VMCLOCK_TIME_UNKNOWN; | ||
406 | + vms->clk->clock_status = VMCLOCK_STATUS_UNKNOWN; | 259 | + vms->clk->clock_status = VMCLOCK_STATUS_UNKNOWN; |
407 | + vms->clk->counter_id = VMCLOCK_COUNTER_INVALID; | 260 | + vms->clk->counter_id = VMCLOCK_COUNTER_INVALID; |
408 | + | 261 | + |
409 | + qemu_register_reset(vmclock_handle_reset, vms); | 262 | + qemu_register_reset(vmclock_handle_reset, vms); |
410 | + | 263 | + |
411 | + vmclock_update_guest(vms); | 264 | + vmclock_update_guest(vms); |
412 | +} | 265 | +} |
413 | + | 266 | + |
414 | +static Property vmclock_device_properties[] = { | ||
415 | + DEFINE_PROP_END_OF_LIST(), | ||
416 | +}; | ||
417 | + | ||
418 | +static void vmclock_device_class_init(ObjectClass *klass, void *data) | 267 | +static void vmclock_device_class_init(ObjectClass *klass, void *data) |
419 | +{ | 268 | +{ |
420 | + DeviceClass *dc = DEVICE_CLASS(klass); | 269 | + DeviceClass *dc = DEVICE_CLASS(klass); |
421 | + | 270 | + |
422 | + dc->vmsd = &vmstate_vmclock; | 271 | + dc->vmsd = &vmstate_vmclock; |
423 | + dc->realize = vmclock_realize; | 272 | + dc->realize = vmclock_realize; |
424 | + device_class_set_props(dc, vmclock_device_properties); | ||
425 | + dc->hotpluggable = false; | 273 | + dc->hotpluggable = false; |
426 | + set_bit(DEVICE_CATEGORY_MISC, dc->categories); | 274 | + set_bit(DEVICE_CATEGORY_MISC, dc->categories); |
427 | +} | 275 | +} |
428 | + | 276 | + |
429 | +static const TypeInfo vmclock_device_info = { | 277 | +static const TypeInfo vmclock_device_info = { |
... | ... | ||
454 | diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c | 302 | diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c |
455 | index XXXXXXX..XXXXXXX 100644 | 303 | index XXXXXXX..XXXXXXX 100644 |
456 | --- a/hw/i386/acpi-build.c | 304 | --- a/hw/i386/acpi-build.c |
457 | +++ b/hw/i386/acpi-build.c | 305 | +++ b/hw/i386/acpi-build.c |
458 | @@ -XXX,XX +XXX,XX @@ | 306 | @@ -XXX,XX +XXX,XX @@ |
459 | #include "sysemu/tpm.h" | 307 | #include "system/tpm.h" |
460 | #include "hw/acpi/tpm.h" | 308 | #include "hw/acpi/tpm.h" |
461 | #include "hw/acpi/vmgenid.h" | 309 | #include "hw/acpi/vmgenid.h" |
462 | +#include "hw/acpi/vmclock.h" | 310 | +#include "hw/acpi/vmclock.h" |
463 | #include "hw/acpi/erst.h" | 311 | #include "hw/acpi/erst.h" |
464 | #include "hw/acpi/piix4.h" | 312 | #include "hw/acpi/piix4.h" |
465 | #include "sysemu/tpm_backend.h" | 313 | #include "system/tpm_backend.h" |
466 | @@ -XXX,XX +XXX,XX @@ void acpi_build(AcpiBuildTables *tables, MachineState *machine) | 314 | @@ -XXX,XX +XXX,XX @@ void acpi_build(AcpiBuildTables *tables, MachineState *machine) |
467 | size_t aml_len = 0; | 315 | uint8_t *u; |
468 | GArray *tables_blob = tables->table_data; | 316 | GArray *tables_blob = tables->table_data; |
469 | AcpiSlicOem slic_oem = { .id = NULL, .table_id = NULL }; | 317 | AcpiSlicOem slic_oem = { .id = NULL, .table_id = NULL }; |
470 | - Object *vmgenid_dev; | 318 | - Object *vmgenid_dev; |
471 | + Object *vmgenid_dev, *vmclock_dev; | 319 | + Object *vmgenid_dev, *vmclock_dev; |
472 | char *oem_id; | 320 | char *oem_id; |
... | ... | ||
524 | + | 372 | + |
525 | +void vmclock_build_acpi(VmclockState *vms, GArray *table_data, | 373 | +void vmclock_build_acpi(VmclockState *vms, GArray *table_data, |
526 | + BIOSLinker *linker, const char *oem_id); | 374 | + BIOSLinker *linker, const char *oem_id); |
527 | + | 375 | + |
528 | +#endif | 376 | +#endif |
377 | diff --git a/include/standard-headers/linux/vmclock-abi.h b/include/standard-headers/linux/vmclock-abi.h | ||
378 | new file mode 100644 | ||
379 | index XXXXXXX..XXXXXXX | ||
380 | --- /dev/null | ||
381 | +++ b/include/standard-headers/linux/vmclock-abi.h | ||
382 | @@ -XXX,XX +XXX,XX @@ | ||
383 | +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */ | ||
384 | + | ||
385 | +/* | ||
386 | + * This structure provides a vDSO-style clock to VM guests, exposing the | ||
387 | + * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch | ||
388 | + * counter, etc.) and real time. It is designed to address the problem of | ||
389 | + * live migration, which other clock enlightenments do not. | ||
390 | + * | ||
391 | + * When a guest is live migrated, this affects the clock in two ways. | ||
392 | + * | ||
393 | + * First, even between identical hosts the actual frequency of the underlying | ||
394 | + * counter will change within the tolerances of its specification (typically | ||
395 | + * ±50PPM, or 4 seconds a day). This frequency also varies over time on the | ||
396 | + * same host, but can be tracked by NTP as it generally varies slowly. With | ||
397 | + * live migration there is a step change in the frequency, with no warning. | ||
398 | + * | ||
399 | + * Second, there may be a step change in the value of the counter itself, as | ||
400 | + * its accuracy is limited by the precision of the NTP synchronization on the | ||
401 | + * source and destination hosts. | ||
402 | + * | ||
403 | + * So any calibration (NTP, PTP, etc.) which the guest has done on the source | ||
404 | + * host before migration is invalid, and needs to be redone on the new host. | ||
405 | + * | ||
406 | + * In its most basic mode, this structure provides only an indication to the | ||
407 | + * guest that live migration has occurred. This allows the guest to know that | ||
408 | + * its clock is invalid and take remedial action. For applications that need | ||
409 | + * reliable accurate timestamps (e.g. distributed databases), the structure | ||
410 | + * can be mapped all the way to userspace. This allows the application to see | ||
411 | + * directly for itself that the clock is disrupted and take appropriate | ||
412 | + * action, even when using a vDSO-style method to get the time instead of a | ||
413 | + * system call. | ||
414 | + * | ||
415 | + * In its more advanced mode. this structure can also be used to expose the | ||
416 | + * precise relationship of the CPU counter to real time, as calibrated by the | ||
417 | + * host. This means that userspace applications can have accurate time | ||
418 | + * immediately after live migration, rather than having to pause operations | ||
419 | + * and wait for NTP to recover. This mode does, of course, rely on the | ||
420 | + * counter being reliable and consistent across CPUs. | ||
421 | + * | ||
422 | + * Note that this must be true UTC, never with smeared leap seconds. If a | ||
423 | + * guest wishes to construct a smeared clock, it can do so. Presenting a | ||
424 | + * smeared clock through this interface would be problematic because it | ||
425 | + * actually messes with the apparent counter *period*. A linear smearing | ||
426 | + * of 1 ms per second would effectively tweak the counter period by 1000PPM | ||
427 | + * at the start/end of the smearing period, while a sinusoidal smear would | ||
428 | + * basically be impossible to represent. | ||
429 | + * | ||
430 | + * This structure is offered with the intent that it be adopted into the | ||
431 | + * nascent virtio-rtc standard, as a virtio-rtc that does not address the live | ||
432 | + * migration problem seems a little less than fit for purpose. For that | ||
433 | + * reason, certain fields use precisely the same numeric definitions as in | ||
434 | + * the virtio-rtc proposal. The structure can also be exposed through an ACPI | ||
435 | + * device with the CID "VMCLOCK", modelled on the "VMGENID" device except for | ||
436 | + * the fact that it uses a real _CRS to convey the address of the structure | ||
437 | + * (which should be a full page, to allow for mapping directly to userspace). | ||
438 | + */ | ||
439 | + | ||
440 | +#ifndef __VMCLOCK_ABI_H__ | ||
441 | +#define __VMCLOCK_ABI_H__ | ||
442 | + | ||
443 | +#include "standard-headers/linux/types.h" | ||
444 | + | ||
445 | +struct vmclock_abi { | ||
446 | + /* CONSTANT FIELDS */ | ||
447 | + uint32_t magic; | ||
448 | +#define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */ | ||
449 | + uint32_t size; /* Size of region containing this structure */ | ||
450 | + uint16_t version; /* 1 */ | ||
451 | + uint8_t counter_id; /* Matches VIRTIO_RTC_COUNTER_xxx except INVALID */ | ||
452 | +#define VMCLOCK_COUNTER_ARM_VCNT 0 | ||
453 | +#define VMCLOCK_COUNTER_X86_TSC 1 | ||
454 | +#define VMCLOCK_COUNTER_INVALID 0xff | ||
455 | + uint8_t time_type; /* Matches VIRTIO_RTC_TYPE_xxx */ | ||
456 | +#define VMCLOCK_TIME_UTC 0 /* Since 1970-01-01 00:00:00z */ | ||
457 | +#define VMCLOCK_TIME_TAI 1 /* Since 1970-01-01 00:00:00z */ | ||
458 | +#define VMCLOCK_TIME_MONOTONIC 2 /* Since undefined epoch */ | ||
459 | +#define VMCLOCK_TIME_INVALID_SMEARED 3 /* Not supported */ | ||
460 | +#define VMCLOCK_TIME_INVALID_MAYBE_SMEARED 4 /* Not supported */ | ||
461 | + | ||
462 | + /* NON-CONSTANT FIELDS PROTECTED BY SEQCOUNT LOCK */ | ||
463 | + uint32_t seq_count; /* Low bit means an update is in progress */ | ||
464 | + /* | ||
465 | + * This field changes to another non-repeating value when the CPU | ||
466 | + * counter is disrupted, for example on live migration. This lets | ||
467 | + * the guest know that it should discard any calibration it has | ||
468 | + * performed of the counter against external sources (NTP/PTP/etc.). | ||
469 | + */ | ||
470 | + uint64_t disruption_marker; | ||
471 | + uint64_t flags; | ||
472 | + /* Indicates that the tai_offset_sec field is valid */ | ||
473 | +#define VMCLOCK_FLAG_TAI_OFFSET_VALID (1 << 0) | ||
474 | + /* | ||
475 | + * Optionally used to notify guests of pending maintenance events. | ||
476 | + * A guest which provides latency-sensitive services may wish to | ||
477 | + * remove itself from service if an event is coming up. Two flags | ||
478 | + * indicate the approximate imminence of the event. | ||
479 | + */ | ||
480 | +#define VMCLOCK_FLAG_DISRUPTION_SOON (1 << 1) /* About a day */ | ||
481 | +#define VMCLOCK_FLAG_DISRUPTION_IMMINENT (1 << 2) /* About an hour */ | ||
482 | +#define VMCLOCK_FLAG_PERIOD_ESTERROR_VALID (1 << 3) | ||
483 | +#define VMCLOCK_FLAG_PERIOD_MAXERROR_VALID (1 << 4) | ||
484 | +#define VMCLOCK_FLAG_TIME_ESTERROR_VALID (1 << 5) | ||
485 | +#define VMCLOCK_FLAG_TIME_MAXERROR_VALID (1 << 6) | ||
486 | + /* | ||
487 | + * If the MONOTONIC flag is set then (other than leap seconds) it is | ||
488 | + * guaranteed that the time calculated according this structure at | ||
489 | + * any given moment shall never appear to be later than the time | ||
490 | + * calculated via the structure at any *later* moment. | ||
491 | + * | ||
492 | + * In particular, a timestamp based on a counter reading taken | ||
493 | + * immediately after setting the low bit of seq_count (and the | ||
494 | + * associated memory barrier), using the previously-valid time and | ||
495 | + * period fields, shall never be later than a timestamp based on | ||
496 | + * a counter reading taken immediately before *clearing* the low | ||
497 | + * bit again after the update, using the about-to-be-valid fields. | ||
498 | + */ | ||
499 | +#define VMCLOCK_FLAG_TIME_MONOTONIC (1 << 7) | ||
500 | + | ||
501 | + uint8_t pad[2]; | ||
502 | + uint8_t clock_status; | ||
503 | +#define VMCLOCK_STATUS_UNKNOWN 0 | ||
504 | +#define VMCLOCK_STATUS_INITIALIZING 1 | ||
505 | +#define VMCLOCK_STATUS_SYNCHRONIZED 2 | ||
506 | +#define VMCLOCK_STATUS_FREERUNNING 3 | ||
507 | +#define VMCLOCK_STATUS_UNRELIABLE 4 | ||
508 | + | ||
509 | + /* | ||
510 | + * The time exposed through this device is never smeared. This field | ||
511 | + * corresponds to the 'subtype' field in virtio-rtc, which indicates | ||
512 | + * the smearing method. However in this case it provides a *hint* to | ||
513 | + * the guest operating system, such that *if* the guest OS wants to | ||
514 | + * provide its users with an alternative clock which does not follow | ||
515 | + * UTC, it may do so in a fashion consistent with the other systems | ||
516 | + * in the nearby environment. | ||
517 | + */ | ||
518 | + uint8_t leap_second_smearing_hint; /* Matches VIRTIO_RTC_SUBTYPE_xxx */ | ||
519 | +#define VMCLOCK_SMEARING_STRICT 0 | ||
520 | +#define VMCLOCK_SMEARING_NOON_LINEAR 1 | ||
521 | +#define VMCLOCK_SMEARING_UTC_SLS 2 | ||
522 | + uint16_t tai_offset_sec; /* Actually two's complement signed */ | ||
523 | + uint8_t leap_indicator; | ||
524 | + /* | ||
525 | + * This field is based on the VIRTIO_RTC_LEAP_xxx values as defined | ||
526 | + * in the current draft of virtio-rtc, but since smearing cannot be | ||
527 | + * used with the shared memory device, some values are not used. | ||
528 | + * | ||
529 | + * The _POST_POS and _POST_NEG values allow the guest to perform | ||
530 | + * its own smearing during the day or so after a leap second when | ||
531 | + * such smearing may need to continue being applied for a leap | ||
532 | + * second which is now theoretically "historical". | ||
533 | + */ | ||
534 | +#define VMCLOCK_LEAP_NONE 0x00 /* No known nearby leap second */ | ||
535 | +#define VMCLOCK_LEAP_PRE_POS 0x01 /* Positive leap second at EOM */ | ||
536 | +#define VMCLOCK_LEAP_PRE_NEG 0x02 /* Negative leap second at EOM */ | ||
537 | +#define VMCLOCK_LEAP_POS 0x03 /* Set during 23:59:60 second */ | ||
538 | +#define VMCLOCK_LEAP_POST_POS 0x04 | ||
539 | +#define VMCLOCK_LEAP_POST_NEG 0x05 | ||
540 | + | ||
541 | + /* Bit shift for counter_period_frac_sec and its error rate */ | ||
542 | + uint8_t counter_period_shift; | ||
543 | + /* | ||
544 | + * Paired values of counter and UTC at a given point in time. | ||
545 | + */ | ||
546 | + uint64_t counter_value; | ||
547 | + /* | ||
548 | + * Counter period, and error margin of same. The unit of these | ||
549 | + * fields is 1/2^(64 + counter_period_shift) of a second. | ||
550 | + */ | ||
551 | + uint64_t counter_period_frac_sec; | ||
552 | + uint64_t counter_period_esterror_rate_frac_sec; | ||
553 | + uint64_t counter_period_maxerror_rate_frac_sec; | ||
554 | + | ||
555 | + /* | ||
556 | + * Time according to time_type field above. | ||
557 | + */ | ||
558 | + uint64_t time_sec; /* Seconds since time_type epoch */ | ||
559 | + uint64_t time_frac_sec; /* Units of 1/2^64 of a second */ | ||
560 | + uint64_t time_esterror_nanosec; | ||
561 | + uint64_t time_maxerror_nanosec; | ||
562 | +}; | ||
563 | + | ||
564 | +#endif /* __VMCLOCK_ABI_H__ */ | ||
565 | diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh | ||
566 | index XXXXXXX..XXXXXXX 100755 | ||
567 | --- a/scripts/update-linux-headers.sh | ||
568 | +++ b/scripts/update-linux-headers.sh | ||
569 | @@ -XXX,XX +XXX,XX @@ for i in "$hdrdir"/include/linux/*virtio*.h \ | ||
570 | "$hdrdir/include/linux/kernel.h" \ | ||
571 | "$hdrdir/include/linux/kvm_para.h" \ | ||
572 | "$hdrdir/include/linux/vhost_types.h" \ | ||
573 | + "$hdrdir/include/linux/vmclock-abi.h" \ | ||
574 | "$hdrdir/include/linux/sysinfo.h"; do | ||
575 | cp_portable "$i" "$output/include/standard-headers/linux" | ||
576 | done | ||
529 | -- | 577 | -- |
530 | 2.44.0 | 578 | 2.47.0 |
531 | 579 | ||
532 | 580 | diff view generated by jsdifflib |