Since both kernel and user mode run in ring 3, they run in the same
"predictor mode". While the kernel could take care of this itself, doing
so would be yet another item distinguishing PV from native. Additionally
we're in a much better position to issue the barrier command, and we can
save a #GP (for privileged instruction emulation) this way.
To allow to recover performance, introduce a new VM assist allowing the guest
kernel to suppress this barrier.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: Leverage entry-IBPB. Add VM assist. Re-base.
---
I'm not entirely happy with re-using opt_ibpb_ctxt_switch here (it's a
mode switch after all, but v1 used opt_ibpb here), but it also didn't
seem very reasonable to introduce yet another command line option. The
only feasible alternative I would see is to check the CPUID bits directly.
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -757,7 +757,8 @@ static inline void pv_inject_sw_interrup
* but we can't make such requests fail all of the sudden.
*/
#define PV64_VM_ASSIST_MASK (PV32_VM_ASSIST_MASK | \
- (1UL << VMASST_TYPE_m2p_strict))
+ (1UL << VMASST_TYPE_m2p_strict) | \
+ (1UL << VMASST_TYPE_mode_switch_no_ibpb))
#define HVM_VM_ASSIST_MASK (1UL << VMASST_TYPE_runstate_update_flag)
#define arch_vm_assist_valid_mask(d) \
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -467,7 +467,15 @@ void toggle_guest_mode(struct vcpu *v)
if ( v->arch.flags & TF_kernel_mode )
v->arch.pv.gs_base_kernel = gs_base;
else
+ {
v->arch.pv.gs_base_user = gs_base;
+
+ if ( opt_ibpb_ctxt_switch &&
+ !(d->arch.spec_ctrl_flags & SCF_entry_ibpb) &&
+ !VM_ASSIST(d, mode_switch_no_ibpb) )
+ wrmsrl(MSR_PRED_CMD, PRED_CMD_IBPB);
+ }
+
asm volatile ( "swapgs" );
_toggle_guest_pt(v);
--- a/xen/include/public/xen.h
+++ b/xen/include/public/xen.h
@@ -571,6 +571,16 @@ DEFINE_XEN_GUEST_HANDLE(mmuext_op_t);
*/
#define VMASST_TYPE_m2p_strict 32
+/*
+ * x86-64 guests: Suppress IBPB on guest-user to guest-kernel mode switch.
+ *
+ * By default (on affected and capable hardware) as a safety measure Xen,
+ * to cover for the fact that guest-kernel and guest-user modes are both
+ * running in ring 3 (and hence share prediction context), would issue a
+ * barrier for user->kernel mode switches of PV guests.
+ */
+#define VMASST_TYPE_mode_switch_no_ibpb 33
+
#if __XEN_INTERFACE_VERSION__ < 0x00040600
#define MAX_VMASST_TYPE 3
#endif
On Tue, Jul 19, 2022 at 02:55:17PM +0200, Jan Beulich wrote: > Since both kernel and user mode run in ring 3, they run in the same > "predictor mode". While the kernel could take care of this itself, doing > so would be yet another item distinguishing PV from native. Additionally > we're in a much better position to issue the barrier command, and we can > save a #GP (for privileged instruction emulation) this way. > > To allow to recover performance, introduce a new VM assist allowing the guest > kernel to suppress this barrier. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> > --- > v2: Leverage entry-IBPB. Add VM assist. Re-base. > --- > I'm not entirely happy with re-using opt_ibpb_ctxt_switch here (it's a > mode switch after all, but v1 used opt_ibpb here), but it also didn't > seem very reasonable to introduce yet another command line option. The > only feasible alternative I would see is to check the CPUID bits directly. Likely needs a mention in xen-command-line.md that the `ibpb` option also controls whether a barrier is executed by Xen in PV vCPU context switches from user-space to kernel. The current text only mentions vCPU context switches. The rest LGTM. Thanks, Roger.
On 13.09.2022 17:41, Roger Pau Monné wrote: > On Tue, Jul 19, 2022 at 02:55:17PM +0200, Jan Beulich wrote: >> Since both kernel and user mode run in ring 3, they run in the same >> "predictor mode". While the kernel could take care of this itself, doing >> so would be yet another item distinguishing PV from native. Additionally >> we're in a much better position to issue the barrier command, and we can >> save a #GP (for privileged instruction emulation) this way. >> >> To allow to recover performance, introduce a new VM assist allowing the guest >> kernel to suppress this barrier. >> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> >> --- >> v2: Leverage entry-IBPB. Add VM assist. Re-base. >> --- >> I'm not entirely happy with re-using opt_ibpb_ctxt_switch here (it's a >> mode switch after all, but v1 used opt_ibpb here), but it also didn't >> seem very reasonable to introduce yet another command line option. The >> only feasible alternative I would see is to check the CPUID bits directly. > > Likely needs a mention in xen-command-line.md that the `ibpb` option > also controls whether a barrier is executed by Xen in PV vCPU context > switches from user-space to kernel. The current text only mentions > vCPU context switches. Andrew and I actually discussed this perhaps better having a separate control. Jan
On Tue, Sep 13, 2022 at 06:05:30PM +0200, Jan Beulich wrote: > On 13.09.2022 17:41, Roger Pau Monné wrote: > > On Tue, Jul 19, 2022 at 02:55:17PM +0200, Jan Beulich wrote: > >> Since both kernel and user mode run in ring 3, they run in the same > >> "predictor mode". While the kernel could take care of this itself, doing > >> so would be yet another item distinguishing PV from native. Additionally > >> we're in a much better position to issue the barrier command, and we can > >> save a #GP (for privileged instruction emulation) this way. > >> > >> To allow to recover performance, introduce a new VM assist allowing the guest > >> kernel to suppress this barrier. > >> > >> Signed-off-by: Jan Beulich <jbeulich@suse.com> > >> --- > >> v2: Leverage entry-IBPB. Add VM assist. Re-base. > >> --- > >> I'm not entirely happy with re-using opt_ibpb_ctxt_switch here (it's a > >> mode switch after all, but v1 used opt_ibpb here), but it also didn't > >> seem very reasonable to introduce yet another command line option. The > >> only feasible alternative I would see is to check the CPUID bits directly. > > > > Likely needs a mention in xen-command-line.md that the `ibpb` option > > also controls whether a barrier is executed by Xen in PV vCPU context > > switches from user-space to kernel. The current text only mentions > > vCPU context switches. > > Andrew and I actually discussed this perhaps better having a separate > control. OK, didn't know there was some feedback here already. A separate control would indeed be clearer. I guess a new patch will appear then? Thanks, Roger.
© 2016 - 2024 Red Hat, Inc.