arch/x86/kernel/cpu/mshyperv.c | 11 ----------- drivers/clocksource/i8253.c | 12 ------------ include/linux/i8253.h | 1 - 3 files changed, 24 deletions(-)
From: Li RongQing <lirongqing@baidu.com>
Zeroing the counter register in pit_shutdown() isn't actually supposed to
stop it from counting, will causes the PIT to start running again,
From the spec:
The largest possible initial count is 0; this is equivalent to 216 for
binary counting and 104 for BCD counting.
The Counter does not stop when it reaches zero. In Modes 0, 1, 4, and 5 the
Counter "wraps around" to the highest count, either FFFF hex for binary
count- ing or 9999 for BCD counting, and continues counting.
Mode 0 is typically used for event counting. After the Control Word is
written, OUT is initially low, and will remain low until the Counter
reaches zero. OUT then goes high and remains high until a new count or a
new Mode 0 Control Word is written into the Counter.
Hyper-V and KVM follow the spec, the issue that 35b69a42 "(clockevents/drivers/
i8253: Add support for PIT shutdown quirk") fixed is in i8253 drivers, not Hyper-v,
so delete the zero timer counter register in shutdown, and delete PIT shutdown
quirk for Hyper-v
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
---
arch/x86/kernel/cpu/mshyperv.c | 11 -----------
drivers/clocksource/i8253.c | 12 ------------
include/linux/i8253.h | 1 -
3 files changed, 24 deletions(-)
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 46668e2..f788889 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -16,7 +16,6 @@
#include <linux/interrupt.h>
#include <linux/irq.h>
#include <linux/kexec.h>
-#include <linux/i8253.h>
#include <linux/random.h>
#include <linux/swiotlb.h>
#include <asm/processor.h>
@@ -399,16 +398,6 @@ static void __init ms_hyperv_init_platform(void)
if (efi_enabled(EFI_BOOT))
x86_platform.get_nmi_reason = hv_get_nmi_reason;
- /*
- * Hyper-V VMs have a PIT emulation quirk such that zeroing the
- * counter register during PIT shutdown restarts the PIT. So it
- * continues to interrupt @18.2 HZ. Setting i8253_clear_counter
- * to false tells pit_shutdown() not to zero the counter so that
- * the PIT really is shutdown. Generation 2 VMs don't have a PIT,
- * and setting this value has no effect.
- */
- i8253_clear_counter_on_shutdown = false;
-
#if IS_ENABLED(CONFIG_HYPERV)
/*
* Setup the hook to get control post apic initialization.
diff --git a/drivers/clocksource/i8253.c b/drivers/clocksource/i8253.c
index d4350bb..169474d 100644
--- a/drivers/clocksource/i8253.c
+++ b/drivers/clocksource/i8253.c
@@ -20,13 +20,6 @@
DEFINE_RAW_SPINLOCK(i8253_lock);
EXPORT_SYMBOL(i8253_lock);
-/*
- * Handle PIT quirk in pit_shutdown() where zeroing the counter register
- * restarts the PIT, negating the shutdown. On platforms with the quirk,
- * platform specific code can set this to false.
- */
-bool i8253_clear_counter_on_shutdown __ro_after_init = true;
-
#ifdef CONFIG_CLKSRC_I8253
/*
* Since the PIT overflows every tick, its not very useful
@@ -117,11 +110,6 @@ static int pit_shutdown(struct clock_event_device *evt)
outb_p(0x30, PIT_MODE);
- if (i8253_clear_counter_on_shutdown) {
- outb_p(0, PIT_CH0);
- outb_p(0, PIT_CH0);
- }
-
raw_spin_unlock(&i8253_lock);
return 0;
}
diff --git a/include/linux/i8253.h b/include/linux/i8253.h
index 8336b2f..e6bb36a 100644
--- a/include/linux/i8253.h
+++ b/include/linux/i8253.h
@@ -21,7 +21,6 @@
#define PIT_LATCH ((PIT_TICK_RATE + HZ/2) / HZ)
extern raw_spinlock_t i8253_lock;
-extern bool i8253_clear_counter_on_shutdown;
extern struct clock_event_device i8253_clockevent;
extern void clockevent_i8253_init(bool oneshot);
--
2.9.4
On Tue, Feb 07 2023 at 09:14, lirongqing@baidu.com wrote: > @@ -117,11 +110,6 @@ static int pit_shutdown(struct clock_event_device *evt) > > outb_p(0x30, PIT_MODE); > > - if (i8253_clear_counter_on_shutdown) { > - outb_p(0, PIT_CH0); > - outb_p(0, PIT_CH0); > - } > - The stop sequence is wrong: When there is a count in progress, writing a new LSB before the counter has counted down to 0 and rolled over to FFFFh, WILL stop the counter. However, if the LSB is loaded AFTER the counter has rolled over to FFFFh, so that an MSB now exists in the counter, then the counter WILL NOT stop. The original i8253 datasheet says: 1) Write 1st byte stops the current counting 2) Write 2nd byte starts the new count The above does not make sure it actually has not rolled over and it obviously is initiating the new count by writing the MSB too. So it does not work on real hardware either. The proper sequence is: // Switch to mode 0 outb_p(0x30, PIT_MODE); // Load the maximum value to ensure there is no rollover outb_p(0xff, PIT_CH0); // Writing MSB starts the counter from 0xFFFF and clears rollover outb_p(0xff, PIT_CH0); // Stop the counter by writing only LSB outb_p(0xff, PIT_CH0); That works on real hardware and fails on KVM... So much for the claim that KVM follows the spec :) So yes, the code is buggy, but instead of deleting it, we rather fix it, no? User space test program below. Thanks, tglx --- #include <stdio.h> #include <unistd.h> #include <sys/io.h> typedef unsigned char u8; typedef unsigned short u16; #define BUILDIO(bwl, bw, type) \ static __always_inline void __out##bwl(type value, u16 port) \ { \ asm volatile("out" #bwl " %" #bw "0, %w1" \ : : "a"(value), "Nd"(port)); \ } \ \ static __always_inline type __in##bwl(u16 port) \ { \ type value; \ asm volatile("in" #bwl " %w1, %" #bw "0" \ : "=a"(value) : "Nd"(port)); \ return value; \ } BUILDIO(b, b, u8) #define inb __inb #define outb __outb #define PIT_MODE 0x43 #define PIT_CH0 0x40 #define PIT_CH2 0x42 static int is8254; static void dump_pit(void) { if (is8254) { // Latch and output counter and status outb(0xC2, PIT_MODE); printf("%02x %02x %02x\n", inb(PIT_CH0), inb(PIT_CH0), inb(PIT_CH0)); } else { // Latch and output counter outb(0x0, PIT_MODE); printf("%02x %02x\n", inb(PIT_CH0), inb(PIT_CH0)); } } int main(int argc, char* argv[]) { if (argc > 1) is8254 = 1; if (ioperm(0x40, 4, 1) != 0) return 1; dump_pit(); printf("Set periodic\n"); outb(0x34, PIT_MODE); outb(0x00, PIT_CH0); outb(0x0F, PIT_CH0); dump_pit(); usleep(1000); dump_pit(); printf("Set oneshot\n"); outb(0x38, PIT_MODE); outb(0x00, PIT_CH0); outb(0x0F, PIT_CH0); dump_pit(); usleep(1000); dump_pit(); printf("Set stop\n"); outb(0x30, PIT_MODE); outb(0xFF, PIT_CH0); outb(0xFF, PIT_CH0); outb(0xFF, PIT_CH0); dump_pit(); usleep(100000); dump_pit(); usleep(100000); dump_pit(); return 0; }
On Thu, 2024-08-01 at 16:21 +0200, Thomas Gleixner wrote: > On Tue, Feb 07 2023 at 09:14, lirongqing@baidu.com wrote: > > @@ -117,11 +110,6 @@ static int pit_shutdown(struct clock_event_device *evt) > > > > outb_p(0x30, PIT_MODE); > > > > - if (i8253_clear_counter_on_shutdown) { > > - outb_p(0, PIT_CH0); > > - outb_p(0, PIT_CH0); > > - } > > - > > The stop sequence is wrong: > > When there is a count in progress, writing a new LSB before the > counter has counted down to 0 and rolled over to FFFFh, WILL stop > the counter. However, if the LSB is loaded AFTER the counter has > rolled over to FFFFh, so that an MSB now exists in the counter, then > the counter WILL NOT stop. > > The original i8253 datasheet says: > > 1) Write 1st byte stops the current counting > 2) Write 2nd byte starts the new count It says that for mode zero ("Interrupt on Terminal Count"), yes. But in that mode, shouldn't the IRQ only fire *one* more time anyway, rather than repeatedly? That should be OK, shouldn't it? "When terminal count is reached, the output will go high and remain high until the selected count register is reloaded wityh the mode or a new count is loaded". It's OK for it to keep *counting* as long as it stops firing interrupts, isn't it? Either way, this is somewhat orthogonal to the patch I posted in https://lore.kernel.org/kvm/6cd62b5058e11a6262cb2e798cc85cc5daead3b1.camel@infradead.org/T/#u for the fact that we don't shut down the PIT at *all* if we aren't ever going to use it. I'm glad I decided to export a function from the clocksource driver and just *call* it from pit_timer_init() though. Means we can bikeshed the shutdown sequence in *one* place and it isn't duplicated.
On Thu, Aug 01 2024 at 18:49, David Woodhouse wrote: > On Thu, 2024-08-01 at 16:21 +0200, Thomas Gleixner wrote: >> The stop sequence is wrong: >> >> When there is a count in progress, writing a new LSB before the >> counter has counted down to 0 and rolled over to FFFFh, WILL stop >> the counter. However, if the LSB is loaded AFTER the counter has >> rolled over to FFFFh, so that an MSB now exists in the counter, then >> the counter WILL NOT stop. >> >> The original i8253 datasheet says: >> >> 1) Write 1st byte stops the current counting >> 2) Write 2nd byte starts the new count > > It says that for mode zero ("Interrupt on Terminal Count"), yes. But in > that mode, shouldn't the IRQ only fire *one* more time anyway, rather > than repeatedly? That should be OK, shouldn't it? > > "When terminal count is reached, the output will go high and remain > high until the selected count register is reloaded wityh the mode or a > new count is loaded". I just confirmed that this is the case on KVM. > It's OK for it to keep *counting* as long as it stops firing > interrupts, isn't it? Yes. So the sequence should stop KVM from trying to inject interrupts. Maybe someone fixes it to actually stop fiddling with the counter too :) > Either way, this is somewhat orthogonal to the patch I posted in > https://lore.kernel.org/kvm/6cd62b5058e11a6262cb2e798cc85cc5daead3b1.camel@infradead.org/T/#u > for the fact that we don't shut down the PIT at *all* if we aren't ever > going to use it. > > I'm glad I decided to export a function from the clocksource driver and > just *call* it from pit_timer_init() though. Means we can bikeshed the > shutdown sequence in *one* place and it isn't duplicated. Right. Though we don't have to make this conditional on hypervisor I think. Thanks, tglx
On Thu, 2024-08-01 at 21:06 +0200, Thomas Gleixner wrote: > On Thu, Aug 01 2024 at 18:49, David Woodhouse wrote: > > On Thu, 2024-08-01 at 16:21 +0200, Thomas Gleixner wrote: > > > The stop sequence is wrong: > > > > > > When there is a count in progress, writing a new LSB before the > > > counter has counted down to 0 and rolled over to FFFFh, WILL stop > > > the counter. However, if the LSB is loaded AFTER the counter has > > > rolled over to FFFFh, so that an MSB now exists in the counter, then > > > the counter WILL NOT stop. > > > > > > The original i8253 datasheet says: > > > > > > 1) Write 1st byte stops the current counting > > > 2) Write 2nd byte starts the new count > > > > It says that for mode zero ("Interrupt on Terminal Count"), yes. But in > > that mode, shouldn't the IRQ only fire *one* more time anyway, rather > > than repeatedly? That should be OK, shouldn't it? > > > > "When terminal count is reached, the output will go high and remain > > high until the selected count register is reloaded wityh the mode or a > > new count is loaded". > > I just confirmed that this is the case on KVM. > > > It's OK for it to keep *counting* as long as it stops firing > > interrupts, isn't it? > > Yes. So the sequence should stop KVM from trying to inject > interrupts. Maybe someone fixes it to actually stop fiddling with the > counter too :) I don't think we care about the counter value, as that's *calculated* on demand when the guest tries to read from it. Or, more to the point, *if* the guest tries to read from it. As opposed to the interrupt, which is a timer in the VMM which takes a CPU out of guest mode and incurs steal time, just to waggle a pin on the emulated PICs for no good reason. > > Either way, this is somewhat orthogonal to the patch I posted in > > https://lore.kernel.org/kvm/6cd62b5058e11a6262cb2e798cc85cc5daead3b1.camel@infradead.org/T/#u > > for the fact that we don't shut down the PIT at *all* if we aren't ever > > going to use it. > > > > I'm glad I decided to export a function from the clocksource driver and > > just *call* it from pit_timer_init() though. Means we can bikeshed the > > shutdown sequence in *one* place and it isn't duplicated. > > Right. Though we don't have to make this conditional on hypervisor I > think. Right, we don't *have* to. I vacillated about that and almost ripped it out before sending the patch, but came down on the side of "hardware is a steaming pile of crap and if I don't *have* to change its behaviour, let's not touch it". I justify my cowardice on the basis that it doesn't *matter* if a hardware implementation is still toggling the IRQ pin; in that case it's only a few irrelevant transistors which are busy, and it doesn't translate to steal time.
On Thu, Aug 01 2024 at 20:21, David Woodhouse wrote: > On Thu, 2024-08-01 at 21:06 +0200, Thomas Gleixner wrote: >> Yes. So the sequence should stop KVM from trying to inject >> interrupts. Maybe someone fixes it to actually stop fiddling with the >> counter too :) > > I don't think we care about the counter value, as that's *calculated* > on demand when the guest tries to read from it. Or, more to the point, > *if* the guest tries to read from it. > > As opposed to the interrupt, which is a timer in the VMM which takes a > CPU out of guest mode and incurs steal time, just to waggle a pin on > the emulated PICs for no good reason. Well, if the implementation still arms the timer in the background, then it matters because that has to be processed too. I haven't looked at that code at all, so what do I know. >> > I'm glad I decided to export a function from the clocksource driver and >> > just *call* it from pit_timer_init() though. Means we can bikeshed the >> > shutdown sequence in *one* place and it isn't duplicated. >> >> Right. Though we don't have to make this conditional on hypervisor I >> think. > > Right, we don't *have* to. I vacillated about that and almost ripped it > out before sending the patch, but came down on the side of "hardware is > a steaming pile of crap and if I don't *have* to change its behaviour, > let's not touch it". > > I justify my cowardice on the basis that it doesn't *matter* if a > hardware implementation is still toggling the IRQ pin; in that case > it's only a few irrelevant transistors which are busy, and it doesn't > translate to steal time. On real hardware it translates to power... Thanks, tglx
On Thu, 2024-08-01 at 22:00 +0200, Thomas Gleixner wrote: > On Thu, Aug 01 2024 at 20:21, David Woodhouse wrote: > > On Thu, 2024-08-01 at 21:06 +0200, Thomas Gleixner wrote: > > > Yes. So the sequence should stop KVM from trying to inject > > > interrupts. Maybe someone fixes it to actually stop fiddling with the > > > counter too :) > > > > I don't think we care about the counter value, as that's *calculated* > > on demand when the guest tries to read from it. Or, more to the point, > > *if* the guest tries to read from it. > > > > As opposed to the interrupt, which is a timer in the VMM which takes a > > CPU out of guest mode and incurs steal time, just to waggle a pin on > > the emulated PICs for no good reason. > > Well, if the implementation still arms the timer in the background, then > it matters because that has to be processed too. I haven't looked at > that code at all, so what do I know. It only needs to arm the timer if it needs to send an interrupt. Otherwise, it's all event-driven from the guest interaction. > > > > I'm glad I decided to export a function from the clocksource driver and > > > > just *call* it from pit_timer_init() though. Means we can bikeshed the > > > > shutdown sequence in *one* place and it isn't duplicated. > > > > > > Right. Though we don't have to make this conditional on hypervisor I > > > think. > > > > Right, we don't *have* to. I vacillated about that and almost ripped it > > out before sending the patch, but came down on the side of "hardware is > > a steaming pile of crap and if I don't *have* to change its behaviour, > > let's not touch it". > > > > I justify my cowardice on the basis that it doesn't *matter* if a > > hardware implementation is still toggling the IRQ pin; in that case > > it's only a few irrelevant transistors which are busy, and it doesn't > > translate to steal time. > > On real hardware it translates to power... Perhaps, although I'd guess it's a negligible amount. Still, happy to be brave and make it unconditional. Want a new version of the patch?
On Thu, Aug 01 2024 at 21:49, David Woodhouse wrote: > On Thu, 2024-08-01 at 22:00 +0200, Thomas Gleixner wrote: >> > I justify my cowardice on the basis that it doesn't *matter* if a >> > hardware implementation is still toggling the IRQ pin; in that case >> > it's only a few irrelevant transistors which are busy, and it doesn't >> > translate to steal time. >> >> On real hardware it translates to power... > > Perhaps, although I'd guess it's a negligible amount. Still, happy to > be brave and make it unconditional. Want a new version of the patch? Let'ss fix the shutdown sequence first (See Michaels latest mail) and then do the clockevents_i8253_init() change on top. Thanks, tglx
On 1 August 2024 22:22:56 BST, Thomas Gleixner <tglx@linutronix.de> wrote: >On Thu, Aug 01 2024 at 21:49, David Woodhouse wrote: >> On Thu, 2024-08-01 at 22:00 +0200, Thomas Gleixner wrote: >>> > I justify my cowardice on the basis that it doesn't *matter* if a >>> > hardware implementation is still toggling the IRQ pin; in that case >>> > it's only a few irrelevant transistors which are busy, and it doesn't >>> > translate to steal time. >>> >>> On real hardware it translates to power... >> >> Perhaps, although I'd guess it's a negligible amount. Still, happy to >> be brave and make it unconditional. Want a new version of the patch? > >Let'ss fix the shutdown sequence first (See Michaels latest mail) and >then do the clockevents_i8253_init() change on top. Makes sense.
On Thu, 2024-08-01 at 22:31 +0100, David Woodhouse wrote: > On 1 August 2024 22:22:56 BST, Thomas Gleixner <tglx@linutronix.de> wrote: > > On Thu, Aug 01 2024 at 21:49, David Woodhouse wrote: > > > On Thu, 2024-08-01 at 22:00 +0200, Thomas Gleixner wrote: > > > > > I justify my cowardice on the basis that it doesn't *matter* if a > > > > > hardware implementation is still toggling the IRQ pin; in that case > > > > > it's only a few irrelevant transistors which are busy, and it doesn't > > > > > translate to steal time. > > > > > > > > On real hardware it translates to power... > > > > > > Perhaps, although I'd guess it's a negligible amount. Still, happy to > > > be brave and make it unconditional. Want a new version of the patch? > > > > Let'ss fix the shutdown sequence first (See Michaels latest mail) and > > then do the clockevents_i8253_init() change on top. > > Makes sense. On second thoughts, let's add clockevent_i8253_disable() first and call it when the PIT isn't being used, then we can bikeshed the precise behaviour of that function to our hearts' content. I think it should look like this. Revised version of your test program attached. void clockevent_i8253_disable(void) { raw_spin_lock(&i8253_lock); /* * Writing the MODE register should stop the counter, according to * the datasheet. This appears to work on real hardware (well, on * modern Intel and AMD boxes; I didn't dig the Pegasos out of the * shed). * * However, some virtual implementations differ, and the MODE change * doesn't have any effect until either the counter is written (KVM * in-kernel PIT) or the next interrupt (QEMU). And in those cases, * it may not stop the *count*, only the interrupts. Although in * the virt case, that probably doesn't matter, as the value of the * counter will only be calculated on demand if the guest reads it; * it's the interrupts which cause steal time. * * Hyper-V apparently has a bug where even in mode 0, the IRQ keeps * firing repeatedly if the counter is running. But it *does* do the * right thing when the MODE register is written. * * So: write the MODE and then load the counter, which ensures that * the IRQ is stopped on those buggy virt implementations. And then * write the MODE again, which is the right way to stop it. */ outb_p(0x30, PIT_MODE); outb_p(0, PIT_CH0); outb_p(0, PIT_CH0); outb_p(0x30, PIT_MODE); raw_spin_unlock(&i8253_lock); }
On Thu, 2024-08-01 at 18:49 +0100, David Woodhouse wrote: > On Thu, 2024-08-01 at 16:21 +0200, Thomas Gleixner wrote: > > On Tue, Feb 07 2023 at 09:14, lirongqing@baidu.com wrote: > > > @@ -117,11 +110,6 @@ static int pit_shutdown(struct clock_event_device *evt) > > > > > > outb_p(0x30, PIT_MODE); > > > > > > - if (i8253_clear_counter_on_shutdown) { > > > - outb_p(0, PIT_CH0); > > > - outb_p(0, PIT_CH0); > > > - } > > > - > > > > The stop sequence is wrong: > > > > When there is a count in progress, writing a new LSB before the > > counter has counted down to 0 and rolled over to FFFFh, WILL stop > > the counter. However, if the LSB is loaded AFTER the counter has > > rolled over to FFFFh, so that an MSB now exists in the counter, then > > the counter WILL NOT stop. > > > > The original i8253 datasheet says: > > > > 1) Write 1st byte stops the current counting > > 2) Write 2nd byte starts the new count > It also prefixes that with "Rewriting a counter register during counting results in the following:". But after you write the MODE register, is it actually supposed to be counting? Just a little further up, under 'Counter Loading', it says: "The count register is not loaded until the count value is written (one or two bytes, depending on the mode selected by the RL bits), followed by a rising edge and a falling edge of the clock. Any read of the counter prior to that falling clock edge may yield invalid data". OK, but what *triggers* that invalid state? Given that it explicitly says that a one-byte counter write ends that state, it isn't the first of two bytes. Surely that means that from the time the MODE register is written, any read of the counter may yield invalid data, until the counter is written? I suspect there are as many implementations (virt and hardware) as there are reasonable interpretations of the spec... and then some.
On Thu, Aug 01 2024 at 19:25, David Woodhouse wrote: > On Thu, 2024-08-01 at 18:49 +0100, David Woodhouse wrote: >> > The stop sequence is wrong: >> > >> > When there is a count in progress, writing a new LSB before the >> > counter has counted down to 0 and rolled over to FFFFh, WILL stop >> > the counter. However, if the LSB is loaded AFTER the counter has >> > rolled over to FFFFh, so that an MSB now exists in the counter, then >> > the counter WILL NOT stop. >> > >> > The original i8253 datasheet says: >> > >> > 1) Write 1st byte stops the current counting >> > 2) Write 2nd byte starts the new count >> > > It also prefixes that with "Rewriting a counter register during > counting results in the following:". > > But after you write the MODE register, is it actually supposed to be > counting? Just a little further up, under 'Counter Loading', it says: It's not counting right out of reset. But once it started counting it's tedious to stop :) > "The count register is not loaded until the count value is written (one > or two bytes, depending on the mode selected by the RL bits), followed > by a rising edge and a falling edge of the clock. Any read of the > counter prior to that falling clock edge may yield invalid data". > > OK, but what *triggers* that invalid state? Given that it explicitly > says that a one-byte counter write ends that state, it isn't the first > of two bytes. Surely that means that from the time the MODE register is > written, any read of the counter may yield invalid data, until the > counter is written? It seems to keep ticking with the old value. > I suspect there are as many implementations (virt and hardware) as > there are reasonable interpretations of the spec... and then some. Indeed. Thanks, tglx
On Thu, 2024-08-01 at 20:57 +0200, Thomas Gleixner wrote: > On Thu, Aug 01 2024 at 19:25, David Woodhouse wrote: > > On Thu, 2024-08-01 at 18:49 +0100, David Woodhouse wrote: > > > > The stop sequence is wrong: > > > > > > > > When there is a count in progress, writing a new LSB before the > > > > counter has counted down to 0 and rolled over to FFFFh, WILL stop > > > > the counter. However, if the LSB is loaded AFTER the counter has > > > > rolled over to FFFFh, so that an MSB now exists in the counter, then > > > > the counter WILL NOT stop. > > > > > > > > The original i8253 datasheet says: > > > > > > > > 1) Write 1st byte stops the current counting > > > > 2) Write 2nd byte starts the new count > > > > > > > It also prefixes that with "Rewriting a counter register during > > counting results in the following:". > > > > But after you write the MODE register, is it actually supposed to be > > counting? Just a little further up, under 'Counter Loading', it says: > > It's not counting right out of reset. But once it started counting it's > tedious to stop :) My reading of the data sheet definitely suggests that it *shouldn't* be. Mode 0 says: "The output will be initially low after the mode set operation. After the count is loaded into the selected count register... the counter will count." Mode 2 says: "When this mode is set, the output will remain high until after the count register is loaded." Mode 4 says: "After the mode is set, the output will be high. When the count is loaded, the counter will begin counting". All of that strongly hints to me that a *compliant* implementation (haha) would stop the interrupts (and the count) when the MODE is set. So writing *only* the mode ought to work, in theory. If the failure mode is just that a bad implementation takes a little bit more power or steal time, I wonder if that's what we should do? It's what Hyper-V wants, it seems. And if other hypervisors/VMMs want to avoid taking pointless steal time, they can fix their bugs. (... gets more coffee and starts fixing bugs ...) > > "The count register is not loaded until the count value is written (one > > or two bytes, depending on the mode selected by the RL bits), followed > > by a rising edge and a falling edge of the clock. Any read of the > > counter prior to that falling clock edge may yield invalid data". > > > > OK, but what *triggers* that invalid state? Given that it explicitly > > says that a one-byte counter write ends that state, it isn't the first > > of two bytes. Surely that means that from the time the MODE register is > > written, any read of the counter may yield invalid data, until the > > counter is written? > > It seems to keep ticking with the old value. > > > I suspect there are as many implementations (virt and hardware) as > > there are reasonable interpretations of the spec... and then some. > > Indeed. > > Thanks, > > tglx >
On Fri, Aug 02 2024 at 09:07, David Woodhouse wrote: > On Thu, 2024-08-01 at 20:57 +0200, Thomas Gleixner wrote: >> It's not counting right out of reset. But once it started counting it's >> tedious to stop :) > > My reading of the data sheet definitely suggests that it *shouldn't* > be. > > Mode 0 says: "The output will be initially low after the mode set > operation. After the count is loaded into the selected count > register... the counter will count." Hmm. Indeed. That does not stop the counter, but it prevents the interrupt from firing over and over. So fine, we can go with the patch from Li, but the changelog needs a rewrite and the code want's a big fat comment. Thanks, tglx
On Fri, 2024-08-02 at 12:49 +0200, Thomas Gleixner wrote: > On Fri, Aug 02 2024 at 09:07, David Woodhouse wrote: > > On Thu, 2024-08-01 at 20:57 +0200, Thomas Gleixner wrote: > > > It's not counting right out of reset. But once it started counting it's > > > tedious to stop :) > > > > My reading of the data sheet definitely suggests that it *shouldn't* > > be. > > > > Mode 0 says: "The output will be initially low after the mode set > > operation. After the count is loaded into the selected count > > register... the counter will count." > > Hmm. Indeed. That does not stop the counter, but it prevents the > interrupt from firing over and over. > > So fine, we can go with the patch from Li, but the changelog needs a > rewrite and the code want's a big fat comment. Nah, it wants to be MODE, COUNT, COUNT, MODE to handle all known implementations. Already posted as [PATCH 2/1] (with big fat comments and a version of your test) at https://lore.kernel.org/kvm/3bc237678ade809cc685fedb8c1a3d435e590639.camel@infradead.org/ Although I just realised that I edited the first patch (to *remove* the now-bogus comments about the stop sequence) before posting that one, so they don't follow cleanly from one another; there's a trivial conflict. I also forgot to remove the pre-1999 typedefs from the test program, despite fixing it to use <stdint.h> like it's the 21st century :) Top two commits of https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/clocks I'll repost properly if you're happy with them?
On Fri, Aug 02 2024 at 12:04, David Woodhouse wrote: > On Fri, 2024-08-02 at 12:49 +0200, Thomas Gleixner wrote: >> So fine, we can go with the patch from Li, but the changelog needs a >> rewrite and the code want's a big fat comment. > > Nah, it wants to be MODE, COUNT, COUNT, MODE to handle all known > implementations. Yes. That works for whatever reason :) > Already posted as [PATCH 2/1] (with big fat comments and a version of > your test) at > > https://lore.kernel.org/kvm/3bc237678ade809cc685fedb8c1a3d435e590639.camel@infradead.org/ > > Although I just realised that I edited the first patch (to *remove* the > now-bogus comments about the stop sequence) before posting that one, so > they don't follow cleanly from one another; there's a trivial conflict. > I also forgot to remove the pre-1999 typedefs from the test program, > despite fixing it to use <stdint.h> like it's the 21st century :) Grandpas are allowed to use pre-1999 typedefs. :) > Top two commits of > https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/clocks > > I'll repost properly if you're happy with them? Just make the disable unconditional. Thanks, tglx
On Fri, 2024-08-02 at 15:27 +0200, Thomas Gleixner wrote: > > Top two commits of > > https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/clocks > > > > I'll repost properly if you're happy with them? > > Just make the disable unconditional. Oops, thought I'd done that too. Turns out I have to press the buttons on that big black slab in front of me, not just think about it. New series coming up after a brief smoke test.
From: lirongqing@baidu.com <lirongqing@baidu.com> Sent: Monday, February 6, 2023 5:15 PM > > Zeroing the counter register in pit_shutdown() isn't actually supposed to > stop it from counting, will causes the PIT to start running again, > From the spec: > > The largest possible initial count is 0; this is equivalent to 216 for > binary counting and 104 for BCD counting. > > The Counter does not stop when it reaches zero. In Modes 0, 1, 4, and 5 the > Counter "wraps around" to the highest count, either FFFF hex for binary > count- ing or 9999 for BCD counting, and continues counting. > > Mode 0 is typically used for event counting. After the Control Word is > written, OUT is initially low, and will remain low until the Counter > reaches zero. OUT then goes high and remains high until a new count or a > new Mode 0 Control Word is written into the Counter. > > Hyper-V and KVM follow the spec, the issue that 35b69a42 "(clockevents/drivers/ > i8253: Add support for PIT shutdown quirk") fixed is in i8253 drivers, not Hyper-v, > so delete the zero timer counter register in shutdown, and delete PIT shutdown > quirk for Hyper-v From the standpoint of Hyper-V, I'm good with this change. But there's a risk that old hardware might not be compliant with the spec, and needs the zero'ing for some reason. The experts in the x86 space will be in the best position to assess the risk. At the time, the quirk approach was taken so the change applied only to Hyper-V, and any such risk was avoided. For Hyper-V, Reviewed-by: Michael Kelley <mikelley@microsoft.com> > > Suggested-by: Sean Christopherson <seanjc@google.com> > Signed-off-by: Li RongQing <lirongqing@baidu.com> > --- > arch/x86/kernel/cpu/mshyperv.c | 11 ----------- > drivers/clocksource/i8253.c | 12 ------------ > include/linux/i8253.h | 1 - > 3 files changed, 24 deletions(-) > > diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c > index 46668e2..f788889 100644 > --- a/arch/x86/kernel/cpu/mshyperv.c > +++ b/arch/x86/kernel/cpu/mshyperv.c > @@ -16,7 +16,6 @@ > #include <linux/interrupt.h> > #include <linux/irq.h> > #include <linux/kexec.h> > -#include <linux/i8253.h> > #include <linux/random.h> > #include <linux/swiotlb.h> > #include <asm/processor.h> > @@ -399,16 +398,6 @@ static void __init ms_hyperv_init_platform(void) > if (efi_enabled(EFI_BOOT)) > x86_platform.get_nmi_reason = hv_get_nmi_reason; > > - /* > - * Hyper-V VMs have a PIT emulation quirk such that zeroing the > - * counter register during PIT shutdown restarts the PIT. So it > - * continues to interrupt @18.2 HZ. Setting i8253_clear_counter > - * to false tells pit_shutdown() not to zero the counter so that > - * the PIT really is shutdown. Generation 2 VMs don't have a PIT, > - * and setting this value has no effect. > - */ > - i8253_clear_counter_on_shutdown = false; > - > #if IS_ENABLED(CONFIG_HYPERV) > /* > * Setup the hook to get control post apic initialization. > diff --git a/drivers/clocksource/i8253.c b/drivers/clocksource/i8253.c > index d4350bb..169474d 100644 > --- a/drivers/clocksource/i8253.c > +++ b/drivers/clocksource/i8253.c > @@ -20,13 +20,6 @@ > DEFINE_RAW_SPINLOCK(i8253_lock); > EXPORT_SYMBOL(i8253_lock); > > -/* > - * Handle PIT quirk in pit_shutdown() where zeroing the counter register > - * restarts the PIT, negating the shutdown. On platforms with the quirk, > - * platform specific code can set this to false. > - */ > -bool i8253_clear_counter_on_shutdown __ro_after_init = true; > - > #ifdef CONFIG_CLKSRC_I8253 > /* > * Since the PIT overflows every tick, its not very useful > @@ -117,11 +110,6 @@ static int pit_shutdown(struct clock_event_device *evt) > > outb_p(0x30, PIT_MODE); > > - if (i8253_clear_counter_on_shutdown) { > - outb_p(0, PIT_CH0); > - outb_p(0, PIT_CH0); > - } > - > raw_spin_unlock(&i8253_lock); > return 0; > } > diff --git a/include/linux/i8253.h b/include/linux/i8253.h > index 8336b2f..e6bb36a 100644 > --- a/include/linux/i8253.h > +++ b/include/linux/i8253.h > @@ -21,7 +21,6 @@ > #define PIT_LATCH ((PIT_TICK_RATE + HZ/2) / HZ) > > extern raw_spinlock_t i8253_lock; > -extern bool i8253_clear_counter_on_shutdown; > extern struct clock_event_device i8253_clockevent; > extern void clockevent_i8253_init(bool oneshot); > > -- > 2.9.4
On 2023-02-08 at 01:04:19 +0000, "Michael Kelley (LINUX)" <mikelley@microsoft.com> said: > From: lirongqing@baidu.com <lirongqing@baidu.com> Sent: Monday, February 6, 2023 5:15 PM > > > > Zeroing the counter register in pit_shutdown() isn't actually supposed to > > stop it from counting, will causes the PIT to start running again, > > From the spec: > > > > The largest possible initial count is 0; this is equivalent to 216 for > > binary counting and 104 for BCD counting. > > > > The Counter does not stop when it reaches zero. In Modes 0, 1, 4, and 5 the > > Counter "wraps around" to the highest count, either FFFF hex for binary > > count- ing or 9999 for BCD counting, and continues counting. > > > > Mode 0 is typically used for event counting. After the Control Word is > > written, OUT is initially low, and will remain low until the Counter > > reaches zero. OUT then goes high and remains high until a new count or a > > new Mode 0 Control Word is written into the Counter. > > > > Hyper-V and KVM follow the spec, the issue that 35b69a42 "(clockevents/drivers/ > > i8253: Add support for PIT shutdown quirk") fixed is in i8253 drivers, not Hyper-v, > > so delete the zero timer counter register in shutdown, and delete PIT shutdown > > quirk for Hyper-v > > From the standpoint of Hyper-V, I'm good with this change. But there's a > risk that old hardware might not be compliant with the spec, and needs the > zero'ing for some reason. The experts in the x86 space will be in the best > position to assess the risk. At the time, the quirk approach was taken so > the change applied only to Hyper-V, and any such risk was avoided. > > For Hyper-V, > Reviewed-by: Michael Kelley <mikelley@microsoft.com> It's not entirely clear why we zero it at all. What was it supposed to achieve? I suppose if we want to be ultra-careful, perhaps we could do the zeroing only on real hardware, if hypervisor_is_type(X86_HYPER_NATIVE)? This does seem to be making VMMs spend a bunch of time stealing cycles from the guest to emulate an unwanted timer, so it would be good to fix it.
On Thu, 2024-08-01 at 10:00 +0100, David Woodhouse wrote: > > > > > > > > On 2023-02-08 at 01:04:19 +0000, "Michael Kelley (LINUX)" <mikelley@microsoft.com> said: > > > > From: lirongqing@baidu.com <lirongqing@baidu.com> Sent: Monday, February 6, 2023 5:15 PM > > > > > > > > > > > > Zeroing the counter register in pit_shutdown() isn't actually supposed to > > > > > > stop it from counting, will causes the PIT to start running again, > > > > > > From the spec: > > > > > > > > > > > > The largest possible initial count is 0; this is equivalent to 216 for > > > > > > binary counting and 104 for BCD counting. > > > > > > > > > > > > The Counter does not stop when it reaches zero. In Modes 0, 1, 4, and 5 the > > > > > > Counter "wraps around" to the highest count, either FFFF hex for binary > > > > > > count- ing or 9999 for BCD counting, and continues counting. > > > > > > > > > > > > Mode 0 is typically used for event counting. After the Control Word is > > > > > > written, OUT is initially low, and will remain low until the Counter > > > > > > reaches zero. OUT then goes high and remains high until a new count or a > > > > > > new Mode 0 Control Word is written into the Counter. > > > > > > > > > > > > Hyper-V and KVM follow the spec, the issue that 35b69a42 "(clockevents/drivers/ > > > > > > i8253: Add support for PIT shutdown quirk") fixed is in i8253 drivers, not Hyper-v, > > > > > > so delete the zero timer counter register in shutdown, and delete PIT shutdown > > > > > > quirk for Hyper-v > > > > > > > > From the standpoint of Hyper-V, I'm good with this change. But there's a > > > > risk that old hardware might not be compliant with the spec, and needs the > > > > zero'ing for some reason. The experts in the x86 space will be in the best > > > > position to assess the risk. At the time, the quirk approach was taken so > > > > the change applied only to Hyper-V, and any such risk was avoided. > > > > > > > > For Hyper-V, > > > > Reviewed-by: Michael Kelley <mikelley@microsoft.com> > > > > It's not entirely clear why we zero it at all. What was it supposed to > > achieve? Answering my own question here, it seems that *some* implementations don't apply the mode change until the counter is subsequently written. I think QEMU applies it immediately. But the KVM in-kernel one (and the AWS Nitro Hypervisor) do not. So the interrupt doesn't shut up until you write the counter. I don't see any justification for writing the counter causing any more than *one* further interrupt though, in single-shot mode. If that's what Hyper-V does, that seems like a bug (which is worked around already). I do see the VMM wasting a bunch of time emulating a PIT IRQ that goes nowhere, but that seems to be because the bootloader or BIOS turns it on, and nothing in Linux ever turns it back *off* again unless it's registered at boot and then pit_shutdown() is called when we switch to something better. I'll look at an optimisation in the VMM which can stop firing the timer if the PIT IRQ is already masked or pending in the PIC and IOAPIC; that's just a waste of steal time. But the guest kernel should probably do something like this... diff --git a/arch/x86/kernel/i8253.c b/arch/x86/kernel/i8253.c index 2b7999a1a50a..d13fd793254e 100644 --- a/arch/x86/kernel/i8253.c +++ b/arch/x86/kernel/i8253.c @@ -8,6 +8,7 @@ #include <linux/timex.h> #include <linux/i8253.h> +#include <asm/hypervisor.h> #include <asm/apic.h> #include <asm/hpet.h> #include <asm/time.h> @@ -39,8 +40,29 @@ static bool __init use_pit(void) bool __init pit_timer_init(void) { - if (!use_pit()) + if (!use_pit()) { + if (!hypervisor_is_type(X86_HYPER_NATIVE)) { + /* + * Don't just ignore the PIT. Ensure it's stopped, + * because VMMs otherwise steal CPU time just to + * pointlessly waggle the (masked) IRQ. + */ + raw_spin_lock(&i8253_lock); + outb_p(0x30, PIT_MODE); + + /* + * It's not entirely clear from the datasheet, but some + * virtual implementations don't stop until the counter + * is actually written. + */ + if (i8253_clear_counter_on_shutdown) { + outb_p(0, PIT_CH0); + outb_p(0, PIT_CH0); + } + raw_spin_unlock(&i8253_lock); + } return false; + } clockevent_i8253_init(true); global_clock_event = &i8253_clockevent; >
On Thu, Aug 01 2024 at 16:22, David Woodhouse wrote: > On Thu, 2024-08-01 at 10:00 +0100, David Woodhouse wrote: > bool __init pit_timer_init(void) > { > - if (!use_pit()) > + if (!use_pit()) { > + if (!hypervisor_is_type(X86_HYPER_NATIVE)) { > + /* > + * Don't just ignore the PIT. Ensure it's stopped, > + * because VMMs otherwise steal CPU time just to > + * pointlessly waggle the (masked) IRQ. > + */ > + raw_spin_lock(&i8253_lock); > + outb_p(0x30, PIT_MODE); > + > + /* > + * It's not entirely clear from the datasheet, but some > + * virtual implementations don't stop until the counter > + * is actually written. > + */ > + if (i8253_clear_counter_on_shutdown) { > + outb_p(0, PIT_CH0); > + outb_p(0, PIT_CH0); > + } > + raw_spin_unlock(&i8253_lock); > + } > return false; > + } That's just wrong. What we want is to have the underlying problem fixed in the driver and then make: > clockevent_i8253_init(true); bool clockevent_i8253_init(bool enable, bool oneshot); so it can invoke the shutdown sequence instead of registering the pile: if (!enable) { shutdown(); return false; } ... return true; and the call site becomes: if (!clockevent_i8253_init(use_pit(), true)) return false; No? Thanks, tglx
On 1 August 2024 22:07:25 BST, Thomas Gleixner <tglx@linutronix.de> wrote: >On Thu, Aug 01 2024 at 16:22, David Woodhouse wrote: >> On Thu, 2024-08-01 at 10:00 +0100, David Woodhouse wrote: >> bool __init pit_timer_init(void) >> { >> - if (!use_pit()) >> + if (!use_pit()) { >> + if (!hypervisor_is_type(X86_HYPER_NATIVE)) { >> + /* >> + * Don't just ignore the PIT. Ensure it's stopped, >> + * because VMMs otherwise steal CPU time just to >> + * pointlessly waggle the (masked) IRQ. >> + */ >> + raw_spin_lock(&i8253_lock); >> + outb_p(0x30, PIT_MODE); >> + >> + /* >> + * It's not entirely clear from the datasheet, but some >> + * virtual implementations don't stop until the counter >> + * is actually written. >> + */ >> + if (i8253_clear_counter_on_shutdown) { >> + outb_p(0, PIT_CH0); >> + outb_p(0, PIT_CH0); >> + } >> + raw_spin_unlock(&i8253_lock); >> + } >> return false; >> + } > >That's just wrong. What we want is to have the underlying problem >fixed in the driver and then make: > >> clockevent_i8253_init(true); > >bool clockevent_i8253_init(bool enable, bool oneshot); > >so it can invoke the shutdown sequence instead of registering the pile: > > if (!enable) { > shutdown(); > return false; > } > ... > return true; > >and the call site becomes: > > if (!clockevent_i8253_init(use_pit(), true)) > return false; > >No? Yes. Well, kind of. The way I actually did it was by exposing the shutdown function instead of an "init" function which optionally did the opposite. But yes, I left the hardware-bashing in precisely once place, in the driver.
On Wed, Feb 08, 2023, Michael Kelley (LINUX) wrote: > From: lirongqing@baidu.com <lirongqing@baidu.com> Sent: Monday, February 6, 2023 5:15 PM > > > > Zeroing the counter register in pit_shutdown() isn't actually supposed to > > stop it from counting, will causes the PIT to start running again, > > From the spec: > > > > The largest possible initial count is 0; this is equivalent to 216 for > > binary counting and 104 for BCD counting. > > > > The Counter does not stop when it reaches zero. In Modes 0, 1, 4, and 5 the > > Counter "wraps around" to the highest count, either FFFF hex for binary > > count- ing or 9999 for BCD counting, and continues counting. > > > > Mode 0 is typically used for event counting. After the Control Word is > > written, OUT is initially low, and will remain low until the Counter > > reaches zero. OUT then goes high and remains high until a new count or a > > new Mode 0 Control Word is written into the Counter. > > > > Hyper-V and KVM follow the spec, the issue that 35b69a42 "(clockevents/drivers/ > > i8253: Add support for PIT shutdown quirk") fixed is in i8253 drivers, not Hyper-v, > > so delete the zero timer counter register in shutdown, and delete PIT shutdown > > quirk for Hyper-v > > From the standpoint of Hyper-V, I'm good with this change. But there's a > risk that old hardware might not be compliant with the spec, and needs the > zero'ing for some reason. The experts in the x86 space will be in the best > position to assess the risk. Yep, my feeling exactly. My input is purely from reading those crusty old specs.
> > > Hyper-V and KVM follow the spec, the issue that 35b69a42 > > > "(clockevents/drivers/ > > > i8253: Add support for PIT shutdown quirk") fixed is in i8253 > > > drivers, not Hyper-v, so delete the zero timer counter register in > > > shutdown, and delete PIT shutdown quirk for Hyper-v > > > > From the standpoint of Hyper-V, I'm good with this change. But > > there's a risk that old hardware might not be compliant with the spec, > > and needs the zero'ing for some reason. The experts in the x86 space > > will be in the best position to assess the risk. > > Yep, my feeling exactly. My input is purely from reading those crusty old specs. Hi Thomas: Could you give some suggestion about this patch? Thanks -Li
© 2016 - 2025 Red Hat, Inc.