:p
atchew
Login
Found because of yesterday's Pentium errata fun, and trying to complete/publish the XSA-462 PoC. Andrew Cooper (2): x86/vlapic: Fix handling of writes to APIC_ESR x86/vlapic: Drop vlapic->esr_lock xen/arch/x86/hvm/vlapic.c | 27 +++++++++++--------------- xen/arch/x86/include/asm/hvm/vlapic.h | 1 - xen/include/public/arch-x86/hvm/save.h | 1 + 3 files changed, 12 insertions(+), 17 deletions(-) -- 2.39.5
Xen currently presents APIC_ESR to guests as a simple read/write register. This is incorrect. The SDM states: The ESR is a write/read register. Before attempt to read from the ESR, software should first write to it. (The value written does not affect the values read subsequently; only zero may be written in x2APIC mode.) This write clears any previously logged errors and updates the ESR with any errors detected since the last write to the ESR. This write also rearms the APIC error interrupt triggering mechanism. Introduce a new pending_esr field in hvm_hw_lapic. Update vlapic_error() to accumulate errors here, and extend vlapic_reg_write() to discard the written value, and instead transfer pending_esr into APIC_ESR. Reads are still as before. Importantly, this means that guests no longer destroys the ESR value it's looking for in the LVTERR handler when following the SDM instructions. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> --- CC: Jan Beulich <JBeulich@suse.com> CC: Roger Pau Monné <roger.pau@citrix.com> Slightly RFC. This collides with Alejandro's patch which adds the apic_id field to hvm_hw_lapic too. However, this is a far more obvious backport candidate. lapic_check_hidden() might in principle want to audit this field, but it's not clear what to check. While prior Xen will never have produced it in the migration stream, Intel APIC-V will set APIC_ESR_ILLREGA above and beyond what Xen will currently emulate. I've checked that this does behave correctly under Intel APIC-V. Writes to APIC_ESR drop the written value into the backing page then take a trap-style EXIT_REASON_APIC_WRITE which allows us to sample/latch properly. --- xen/arch/x86/hvm/vlapic.c | 17 +++++++++++++++-- xen/include/public/arch-x86/hvm/save.h | 1 + 2 files changed, 16 insertions(+), 2 deletions(-) diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/hvm/vlapic.c +++ b/xen/arch/x86/hvm/vlapic.c @@ -XXX,XX +XXX,XX @@ static void vlapic_error(struct vlapic *vlapic, unsigned int errmask) uint32_t esr; spin_lock_irqsave(&vlapic->esr_lock, flags); - esr = vlapic_get_reg(vlapic, APIC_ESR); + esr = vlapic->hw.pending_esr; if ( (esr & errmask) != errmask ) { uint32_t lvterr = vlapic_get_reg(vlapic, APIC_LVTERR); @@ -XXX,XX +XXX,XX @@ static void vlapic_error(struct vlapic *vlapic, unsigned int errmask) errmask |= APIC_ESR_RECVILL; } - vlapic_set_reg(vlapic, APIC_ESR, esr | errmask); + vlapic->hw.pending_esr |= errmask; if ( inj ) vlapic_set_irq(vlapic, lvterr & APIC_VECTOR_MASK, 0); @@ -XXX,XX +XXX,XX @@ void vlapic_reg_write(struct vcpu *v, unsigned int reg, uint32_t val) vlapic_set_reg(vlapic, APIC_ID, val); break; + case APIC_ESR: + { + unsigned long flags; + + spin_lock_irqsave(&vlapic->esr_lock, flags); + val = vlapic->hw.pending_esr; + vlapic->hw.pending_esr = 0; + spin_unlock_irqrestore(&vlapic->esr_lock, flags); + + vlapic_set_reg(vlapic, APIC_ESR, val); + break; + } + case APIC_TASKPRI: vlapic_set_reg(vlapic, APIC_TASKPRI, val & 0xff); break; diff --git a/xen/include/public/arch-x86/hvm/save.h b/xen/include/public/arch-x86/hvm/save.h index XXXXXXX..XXXXXXX 100644 --- a/xen/include/public/arch-x86/hvm/save.h +++ b/xen/include/public/arch-x86/hvm/save.h @@ -XXX,XX +XXX,XX @@ struct hvm_hw_lapic { uint32_t disabled; /* VLAPIC_xx_DISABLED */ uint32_t timer_divisor; uint64_t tdt_msr; + uint32_t pending_esr; }; DECLARE_HVM_SAVE_TYPE(LAPIC, 5, struct hvm_hw_lapic); -- 2.39.5
With vlapic->hw.pending_esr held outside of the main regs page, it's much easier to use atomic operations. Use xchg() in vlapic_reg_write(), and *set_bit() in vlapic_error(). The only interesting change is that vlapic_error() now needs to take an err_bit rather than an errmask, but thats fine for all current callers and forseable changes. No practical change. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> --- CC: Jan Beulich <JBeulich@suse.com> CC: Roger Pau Monné <roger.pau@citrix.com> It turns out that XSA-462 had an indentation bug in it. Our spinlock infrastructure is obscenely large. Bloat-o-meter reports: add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-111 (-111) Function old new delta vlapic_init 208 190 -18 vlapic_error 112 67 -45 vlapic_reg_write 1145 1097 -48 In principle we could revert the XSA-462 patch now, and remove the LVTERR vector handling special case. MISRA is going to complain either way, because it will see the cycle through vlapic_set_irq() without considering the surrounding logic. --- xen/arch/x86/hvm/vlapic.c | 32 ++++++--------------------- xen/arch/x86/include/asm/hvm/vlapic.h | 1 - 2 files changed, 7 insertions(+), 26 deletions(-) diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/hvm/vlapic.c +++ b/xen/arch/x86/hvm/vlapic.c @@ -XXX,XX +XXX,XX @@ static int vlapic_find_highest_irr(struct vlapic *vlapic) return vlapic_find_highest_vector(&vlapic->regs->data[APIC_IRR]); } -static void vlapic_error(struct vlapic *vlapic, unsigned int errmask) +static void vlapic_error(struct vlapic *vlapic, unsigned int err_bit) { - unsigned long flags; - uint32_t esr; - - spin_lock_irqsave(&vlapic->esr_lock, flags); - esr = vlapic->hw.pending_esr; - if ( (esr & errmask) != errmask ) + if ( !test_and_set_bit(err_bit, &vlapic->hw.pending_esr) ) { uint32_t lvterr = vlapic_get_reg(vlapic, APIC_LVTERR); bool inj = false; @@ -XXX,XX +XXX,XX @@ static void vlapic_error(struct vlapic *vlapic, unsigned int errmask) if ( (lvterr & APIC_VECTOR_MASK) >= 16 ) inj = true; else - errmask |= APIC_ESR_RECVILL; + set_bit(ilog2(APIC_ESR_RECVILL), &vlapic->hw.pending_esr); } - vlapic->hw.pending_esr |= errmask; - if ( inj ) vlapic_set_irq(vlapic, lvterr & APIC_VECTOR_MASK, 0); } - spin_unlock_irqrestore(&vlapic->esr_lock, flags); } bool vlapic_test_irq(const struct vlapic *vlapic, uint8_t vec) @@ -XXX,XX +XXX,XX @@ void vlapic_set_irq(struct vlapic *vlapic, uint8_t vec, uint8_t trig) if ( unlikely(vec < 16) ) { - vlapic_error(vlapic, APIC_ESR_RECVILL); + vlapic_error(vlapic, ilog2(APIC_ESR_RECVILL)); return; } @@ -XXX,XX +XXX,XX @@ void vlapic_ipi( vlapic_domain(vlapic), vlapic, short_hand, dest, dest_mode); if ( unlikely((icr_low & APIC_VECTOR_MASK) < 16) ) - vlapic_error(vlapic, APIC_ESR_SENDILL); + vlapic_error(vlapic, ilog2(APIC_ESR_SENDILL)); else if ( target ) vlapic_accept_irq(vlapic_vcpu(target), icr_low); break; @@ -XXX,XX +XXX,XX @@ void vlapic_ipi( case APIC_DM_FIXED: if ( unlikely((icr_low & APIC_VECTOR_MASK) < 16) ) { - vlapic_error(vlapic, APIC_ESR_SENDILL); + vlapic_error(vlapic, ilog2(APIC_ESR_SENDILL)); break; } /* fall through */ @@ -XXX,XX +XXX,XX @@ void vlapic_reg_write(struct vcpu *v, unsigned int reg, uint32_t val) break; case APIC_ESR: - { - unsigned long flags; - - spin_lock_irqsave(&vlapic->esr_lock, flags); - val = vlapic->hw.pending_esr; - vlapic->hw.pending_esr = 0; - spin_unlock_irqrestore(&vlapic->esr_lock, flags); - + val = xchg(&vlapic->hw.pending_esr, 0); vlapic_set_reg(vlapic, APIC_ESR, val); break; - } case APIC_TASKPRI: vlapic_set_reg(vlapic, APIC_TASKPRI, val & 0xff); @@ -XXX,XX +XXX,XX @@ int vlapic_init(struct vcpu *v) vlapic_reset(vlapic); - spin_lock_init(&vlapic->esr_lock); - tasklet_init(&vlapic->init_sipi.tasklet, vlapic_init_sipi_action, v); if ( v->vcpu_id == 0 ) diff --git a/xen/arch/x86/include/asm/hvm/vlapic.h b/xen/arch/x86/include/asm/hvm/vlapic.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/hvm/vlapic.h +++ b/xen/arch/x86/include/asm/hvm/vlapic.h @@ -XXX,XX +XXX,XX @@ struct vlapic { bool hw, regs; uint32_t id, ldr; } loaded; - spinlock_t esr_lock; struct periodic_time pt; s_time_t timer_last_update; struct page_info *regs_page; -- 2.39.5
Found because of yesterday's Pentium errata fun, and trying to complete/publish the XSA-462 PoC. v2 has involved substantial CPU archeology, but has confirmed that the LVTERR delivery behaviour is implementation specific, and therefore Xen is fine to stay with it's current behaviour. Andrew Cooper (2): x86/vlapic: Fix handling of writes to APIC_ESR x86/vlapic: Drop vlapic->esr_lock xen/arch/x86/hvm/vlapic.c | 34 ++++++++++++++------------ xen/arch/x86/include/asm/hvm/vlapic.h | 1 - xen/include/public/arch-x86/hvm/save.h | 1 + 3 files changed, 19 insertions(+), 17 deletions(-) -- 2.34.1
Xen currently presents APIC_ESR to guests as a simple read/write register. This is incorrect. The SDM states: The ESR is a write/read register. Before attempt to read from the ESR, software should first write to it. (The value written does not affect the values read subsequently; only zero may be written in x2APIC mode.) This write clears any previously logged errors and updates the ESR with any errors detected since the last write to the ESR. Introduce a new pending_esr field in hvm_hw_lapic. Update vlapic_error() to accumulate errors here, and extend vlapic_reg_write() to discard the written value and transfer pending_esr into APIC_ESR. Reads are still as before. Importantly, this means that guests no longer destroys the ESR value it's looking for in the LVTERR handler when following the SDM instructions. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> --- CC: Jan Beulich <JBeulich@suse.com> CC: Roger Pau Monné <roger.pau@citrix.com> v2: * Minor adjustment to the commit message --- xen/arch/x86/hvm/vlapic.c | 17 +++++++++++++++-- xen/include/public/arch-x86/hvm/save.h | 1 + 2 files changed, 16 insertions(+), 2 deletions(-) diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/hvm/vlapic.c +++ b/xen/arch/x86/hvm/vlapic.c @@ -XXX,XX +XXX,XX @@ static void vlapic_error(struct vlapic *vlapic, unsigned int errmask) uint32_t esr; spin_lock_irqsave(&vlapic->esr_lock, flags); - esr = vlapic_get_reg(vlapic, APIC_ESR); + esr = vlapic->hw.pending_esr; if ( (esr & errmask) != errmask ) { uint32_t lvterr = vlapic_get_reg(vlapic, APIC_LVTERR); @@ -XXX,XX +XXX,XX @@ static void vlapic_error(struct vlapic *vlapic, unsigned int errmask) errmask |= APIC_ESR_RECVILL; } - vlapic_set_reg(vlapic, APIC_ESR, esr | errmask); + vlapic->hw.pending_esr |= errmask; if ( inj ) vlapic_set_irq(vlapic, lvterr & APIC_VECTOR_MASK, 0); @@ -XXX,XX +XXX,XX @@ void vlapic_reg_write(struct vcpu *v, unsigned int reg, uint32_t val) vlapic_set_reg(vlapic, APIC_ID, val); break; + case APIC_ESR: + { + unsigned long flags; + + spin_lock_irqsave(&vlapic->esr_lock, flags); + val = vlapic->hw.pending_esr; + vlapic->hw.pending_esr = 0; + spin_unlock_irqrestore(&vlapic->esr_lock, flags); + + vlapic_set_reg(vlapic, APIC_ESR, val); + break; + } + case APIC_TASKPRI: vlapic_set_reg(vlapic, APIC_TASKPRI, val & 0xff); break; diff --git a/xen/include/public/arch-x86/hvm/save.h b/xen/include/public/arch-x86/hvm/save.h index XXXXXXX..XXXXXXX 100644 --- a/xen/include/public/arch-x86/hvm/save.h +++ b/xen/include/public/arch-x86/hvm/save.h @@ -XXX,XX +XXX,XX @@ struct hvm_hw_lapic { uint32_t disabled; /* VLAPIC_xx_DISABLED */ uint32_t timer_divisor; uint64_t tdt_msr; + uint32_t pending_esr; }; DECLARE_HVM_SAVE_TYPE(LAPIC, 5, struct hvm_hw_lapic); -- 2.34.1
The exact behaviour of LVTERR interrupt generation is implementation specific. * Newer Intel CPUs generate an interrupt when pending_esr becomes nonzero. * Older Intel and all AMD CPUs generate an interrupt when any individual bit in pending_esr becomes nonzero. Neither vendor documents their behaviour very well. Xen implements the per-bit behaviour and has done since support was added. Importantly, the per-bit behaviour can be expressed using the atomic operations available in the x86 architecture, whereas the former (interrupt only on pending_esr becoming nonzero) cannot. With vlapic->hw.pending_esr held outside of the main regs page, it's much easier to use atomic operations. Use xchg() in vlapic_reg_write(), and *set_bit() in vlapic_error(). The only interesting change is that vlapic_error() now needs to take a single bit only, rather than a mask, but this fine for all current callers and forseable changes. No practical change. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> --- CC: Jan Beulich <JBeulich@suse.com> CC: Roger Pau Monné <roger.pau@citrix.com> Confirmed by Intel, AMD, and 3rd party sources. https://sandpile.org/x86/apic.htm has been updated to note this behaviour. None of the vendors have indicated an enthusiasm to clarify the behaviour in their docs. v2: * Rewrite the commit message from scratch. --- xen/arch/x86/hvm/vlapic.c | 39 ++++++++++----------------- xen/arch/x86/include/asm/hvm/vlapic.h | 1 - 2 files changed, 14 insertions(+), 26 deletions(-) diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/hvm/vlapic.c +++ b/xen/arch/x86/hvm/vlapic.c @@ -XXX,XX +XXX,XX @@ static int vlapic_find_highest_irr(struct vlapic *vlapic) return vlapic_find_highest_vector(&vlapic->regs->data[APIC_IRR]); } -static void vlapic_error(struct vlapic *vlapic, unsigned int errmask) +static void vlapic_error(struct vlapic *vlapic, unsigned int err_bit) { - unsigned long flags; - uint32_t esr; - - spin_lock_irqsave(&vlapic->esr_lock, flags); - esr = vlapic->hw.pending_esr; - if ( (esr & errmask) != errmask ) + /* + * Whether LVTERR is delivered on a per-bit basis, or only on + * pending_esr becoming nonzero is implementation specific. + * + * Xen implements the per-bit behaviour as it can be expressed + * locklessly. + */ + if ( !test_and_set_bit(err_bit, &vlapic->hw.pending_esr) ) { uint32_t lvterr = vlapic_get_reg(vlapic, APIC_LVTERR); bool inj = false; @@ -XXX,XX +XXX,XX @@ static void vlapic_error(struct vlapic *vlapic, unsigned int errmask) if ( (lvterr & APIC_VECTOR_MASK) >= 16 ) inj = true; else - errmask |= APIC_ESR_RECVILL; + set_bit(ilog2(APIC_ESR_RECVILL), &vlapic->hw.pending_esr); } - vlapic->hw.pending_esr |= errmask; - if ( inj ) vlapic_set_irq(vlapic, lvterr & APIC_VECTOR_MASK, 0); } - spin_unlock_irqrestore(&vlapic->esr_lock, flags); } bool vlapic_test_irq(const struct vlapic *vlapic, uint8_t vec) @@ -XXX,XX +XXX,XX @@ void vlapic_set_irq(struct vlapic *vlapic, uint8_t vec, uint8_t trig) if ( unlikely(vec < 16) ) { - vlapic_error(vlapic, APIC_ESR_RECVILL); + vlapic_error(vlapic, ilog2(APIC_ESR_RECVILL)); return; } @@ -XXX,XX +XXX,XX @@ void vlapic_ipi( vlapic_domain(vlapic), vlapic, short_hand, dest, dest_mode); if ( unlikely((icr_low & APIC_VECTOR_MASK) < 16) ) - vlapic_error(vlapic, APIC_ESR_SENDILL); + vlapic_error(vlapic, ilog2(APIC_ESR_SENDILL)); else if ( target ) vlapic_accept_irq(vlapic_vcpu(target), icr_low); break; @@ -XXX,XX +XXX,XX @@ void vlapic_ipi( case APIC_DM_FIXED: if ( unlikely((icr_low & APIC_VECTOR_MASK) < 16) ) { - vlapic_error(vlapic, APIC_ESR_SENDILL); + vlapic_error(vlapic, ilog2(APIC_ESR_SENDILL)); break; } /* fall through */ @@ -XXX,XX +XXX,XX @@ void vlapic_reg_write(struct vcpu *v, unsigned int reg, uint32_t val) break; case APIC_ESR: - { - unsigned long flags; - - spin_lock_irqsave(&vlapic->esr_lock, flags); - val = vlapic->hw.pending_esr; - vlapic->hw.pending_esr = 0; - spin_unlock_irqrestore(&vlapic->esr_lock, flags); - + val = xchg(&vlapic->hw.pending_esr, 0); vlapic_set_reg(vlapic, APIC_ESR, val); break; - } case APIC_TASKPRI: vlapic_set_reg(vlapic, APIC_TASKPRI, val & 0xff); @@ -XXX,XX +XXX,XX @@ int vlapic_init(struct vcpu *v) vlapic_reset(vlapic); - spin_lock_init(&vlapic->esr_lock); - tasklet_init(&vlapic->init_sipi.tasklet, vlapic_init_sipi_action, v); if ( v->vcpu_id == 0 ) diff --git a/xen/arch/x86/include/asm/hvm/vlapic.h b/xen/arch/x86/include/asm/hvm/vlapic.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/hvm/vlapic.h +++ b/xen/arch/x86/include/asm/hvm/vlapic.h @@ -XXX,XX +XXX,XX @@ struct vlapic { bool hw, regs; uint32_t id, ldr; } loaded; - spinlock_t esr_lock; struct periodic_time pt; s_time_t timer_last_update; struct page_info *regs_page; -- 2.34.1