[v4] x86: FRED support

[PATCH v4 01/14] x86/pv: Don't assume that INT $imm8 instructions are two bytes long

Posted by Andrew Cooper 3 days, 4 hours ago

For INT $N instructions (besides $0x80 for which there is a dedicated fast
path), handling is mostly fault-based because of DPL0 gates in the IDT.  This
means that when the guest kernel allows the instruction too, Xen must
increment %rip to the end of the instruction before passing a trap to the
guest kernel.

When an INT $N instruction has a prefix, it's longer than two bytes, and Xen
will deliver the "trap" with %rip pointing into the middle of the instruction.

Introduce a new pv_emulate_sw_interrupt() which uses x86_insn_length() to
determine the instruction length, rather than assuming two.

This is a change in behaviour for PV guests, but the prior behaviour cannot
reasonably be said to be intentional.

This change does not affect the INT $0x80 fastpath.  Prefixed INT $N
instructions occur almost exclusively in test code or exploits, and INT $0x80
appears to be the only user-usable interrupt gate in contemporary PV guests.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v4:
 * New
---
 xen/arch/x86/include/asm/pv/traps.h |  2 ++
 xen/arch/x86/pv/emul-priv-op.c      | 48 +++++++++++++++++++++++++++++
 xen/arch/x86/traps.c                |  3 +-
 3 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/include/asm/pv/traps.h b/xen/arch/x86/include/asm/pv/traps.h
index 8c201190923d..16e9a8d2aa3f 100644
--- a/xen/arch/x86/include/asm/pv/traps.h
+++ b/xen/arch/x86/include/asm/pv/traps.h
@@ -17,6 +17,7 @@
 int pv_raise_nmi(struct vcpu *v);
 
 int pv_emulate_privileged_op(struct cpu_user_regs *regs);
+void pv_emulate_sw_interrupt(struct cpu_user_regs *regs);
 void pv_emulate_gate_op(struct cpu_user_regs *regs);
 bool pv_emulate_invalid_op(struct cpu_user_regs *regs);
 
@@ -31,6 +32,7 @@ static inline bool pv_trap_callback_registered(const struct vcpu *v,
 static inline int pv_raise_nmi(struct vcpu *v) { return -EOPNOTSUPP; }
 
 static inline int pv_emulate_privileged_op(struct cpu_user_regs *regs) { return 0; }
+static inline void pv_emulate_sw_interrupt(struct cpu_user_regs *regs) {}
 static inline void pv_emulate_gate_op(struct cpu_user_regs *regs) {}
 static inline bool pv_emulate_invalid_op(struct cpu_user_regs *regs) { return true; }
 
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index a3c1fd12621d..87d3bbcf901f 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -8,6 +8,7 @@
  */
 
 #include <xen/domain_page.h>
+#include <xen/err.h>
 #include <xen/event.h>
 #include <xen/guest_access.h>
 #include <xen/hypercall.h>
@@ -1401,6 +1402,53 @@ int pv_emulate_privileged_op(struct cpu_user_regs *regs)
     return 0;
 }
 
+/*
+ * Hardware already decoded the INT $N instruction and determinted that there
+ * was a DPL issue, hence the #GP.  Xen has already determined that the guest
+ * kernel has permitted this software interrupt.
+ *
+ * All that is needed is the instruction length, to turn the fault into a
+ * trap.  All errors are turned back into the original #GP, as that's the
+ * action that really happened.
+ */
+void pv_emulate_sw_interrupt(struct cpu_user_regs *regs)
+{
+    struct vcpu *curr = current;
+    struct domain *currd = curr->domain;
+    struct priv_op_ctxt ctxt = {
+        .ctxt.regs = regs,
+        .ctxt.lma = !is_pv_32bit_domain(currd),
+    };
+    struct x86_emulate_state *state;
+    uint8_t vector = regs->error_code >> 3;
+    unsigned int len, ar;
+
+    if ( !pv_emul_read_descriptor(regs->cs, curr, &ctxt.cs.base,
+                                  &ctxt.cs.limit, &ar, 1) ||
+         !(ar & _SEGMENT_S) ||
+         !(ar & _SEGMENT_P) ||
+         !(ar & _SEGMENT_CODE) )
+        goto error;
+
+    state = x86_decode_insn(&ctxt.ctxt, insn_fetch);
+    if ( IS_ERR_OR_NULL(state) )
+        goto error;
+
+    len = x86_insn_length(state, &ctxt.ctxt);
+    x86_emulate_free_state(state);
+
+    /* Note: Checked slightly late to simplify 'state' handling. */
+    if ( ctxt.ctxt.opcode != 0xcd /* INT $imm8 */ )
+        goto error;
+
+    regs->rip += len;
+    pv_inject_sw_interrupt(vector);
+    return;
+
+ error:
+    pv_inject_hw_exception(X86_EXC_GP, regs->entry_vector);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 5feac88d6c0b..907fb4c186c0 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1379,8 +1379,7 @@ void do_general_protection(struct cpu_user_regs *regs)
 
         if ( permit_softint(TI_GET_DPL(ti), v, regs) )
         {
-            regs->rip += 2;
-            pv_inject_sw_interrupt(vector);
+            pv_emulate_sw_interrupt(regs);
             return;
         }
     }
-- 
2.39.5

Re: [PATCH v4 01/14] x86/pv: Don't assume that INT $imm8 instructions are two bytes long

Posted by Jan Beulich 16 hours ago

On 28.02.2026 00:16, Andrew Cooper wrote:
> For INT $N instructions (besides $0x80 for which there is a dedicated fast
> path), handling is mostly fault-based because of DPL0 gates in the IDT.  This
> means that when the guest kernel allows the instruction too, Xen must
> increment %rip to the end of the instruction before passing a trap to the
> guest kernel.
> 
> When an INT $N instruction has a prefix, it's longer than two bytes, and Xen
> will deliver the "trap" with %rip pointing into the middle of the instruction.
> 
> Introduce a new pv_emulate_sw_interrupt() which uses x86_insn_length() to
> determine the instruction length, rather than assuming two.
> 
> This is a change in behaviour for PV guests, but the prior behaviour cannot
> reasonably be said to be intentional.
> 
> This change does not affect the INT $0x80 fastpath.  Prefixed INT $N
> instructions occur almost exclusively in test code or exploits, and INT $0x80
> appears to be the only user-usable interrupt gate in contemporary PV guests.

Whereas for the slow path, while the subtracting of 2 from %rip there isn't
quite right either, the insn size determination here would then simply yield
2 as well, so all is good for that case as well.

> @@ -1401,6 +1402,53 @@ int pv_emulate_privileged_op(struct cpu_user_regs *regs)
>      return 0;
>  }
>  
> +/*
> + * Hardware already decoded the INT $N instruction and determinted that there
> + * was a DPL issue, hence the #GP.  Xen has already determined that the guest
> + * kernel has permitted this software interrupt.
> + *
> + * All that is needed is the instruction length, to turn the fault into a
> + * trap.  All errors are turned back into the original #GP, as that's the
> + * action that really happened.
> + */
> +void pv_emulate_sw_interrupt(struct cpu_user_regs *regs)
> +{
> +    struct vcpu *curr = current;
> +    struct domain *currd = curr->domain;
> +    struct priv_op_ctxt ctxt = {
> +        .ctxt.regs = regs,
> +        .ctxt.lma = !is_pv_32bit_domain(currd),

The difference may not be overly significant here, but 64-bit guests can run
32-bit code, so setting .lma seems wrong in that case. As it ought to be
largely benign, perhaps to code could even be left as is, just with a comment
to clarify things?

> +    };
> +    struct x86_emulate_state *state;
> +    uint8_t vector = regs->error_code >> 3;
> +    unsigned int len, ar;
> +
> +    if ( !pv_emul_read_descriptor(regs->cs, curr, &ctxt.cs.base,
> +                                  &ctxt.cs.limit, &ar, 1) ||
> +         !(ar & _SEGMENT_S) ||
> +         !(ar & _SEGMENT_P) ||
> +         !(ar & _SEGMENT_CODE) )
> +        goto error;
> +
> +    state = x86_decode_insn(&ctxt.ctxt, insn_fetch);
> +    if ( IS_ERR_OR_NULL(state) )
> +        goto error;
> +
> +    len = x86_insn_length(state, &ctxt.ctxt);
> +    x86_emulate_free_state(state);
> +
> +    /* Note: Checked slightly late to simplify 'state' handling. */
> +    if ( ctxt.ctxt.opcode != 0xcd /* INT $imm8 */ )
> +        goto error;
> +
> +    regs->rip += len;
> +    pv_inject_sw_interrupt(vector);
> +    return;
> +
> + error:
> +    pv_inject_hw_exception(X86_EXC_GP, regs->entry_vector);

DYM regs->error_code here? Might it alternatively make sense to return a
boolean here, for ...

> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -1379,8 +1379,7 @@ void do_general_protection(struct cpu_user_regs *regs)
>  
>          if ( permit_softint(TI_GET_DPL(ti), v, regs) )
>          {
> -            regs->rip += 2;
> -            pv_inject_sw_interrupt(vector);
> +            pv_emulate_sw_interrupt(regs);
>              return;

... the return here to become conditional, leveraging the #GP injection at
the bottom of this function?

Jan

Re: [PATCH v4 01/14] x86/pv: Don't assume that INT $imm8 instructions are two bytes long

Posted by Andrew Cooper 15 hours ago

On 02/03/2026 11:03 am, Jan Beulich wrote:
> On 28.02.2026 00:16, Andrew Cooper wrote:
>> For INT $N instructions (besides $0x80 for which there is a dedicated fast
>> path), handling is mostly fault-based because of DPL0 gates in the IDT.  This
>> means that when the guest kernel allows the instruction too, Xen must
>> increment %rip to the end of the instruction before passing a trap to the
>> guest kernel.
>>
>> When an INT $N instruction has a prefix, it's longer than two bytes, and Xen
>> will deliver the "trap" with %rip pointing into the middle of the instruction.
>>
>> Introduce a new pv_emulate_sw_interrupt() which uses x86_insn_length() to
>> determine the instruction length, rather than assuming two.
>>
>> This is a change in behaviour for PV guests, but the prior behaviour cannot
>> reasonably be said to be intentional.
>>
>> This change does not affect the INT $0x80 fastpath.  Prefixed INT $N
>> instructions occur almost exclusively in test code or exploits, and INT $0x80
>> appears to be the only user-usable interrupt gate in contemporary PV guests.
> Whereas for the slow path, while the subtracting of 2 from %rip there isn't
> quite right either, the insn size determination here would then simply yield
> 2 as well, so all is good for that case as well.

I've covered that in the docs patch (patch 2).  Because INT $0x80 is
DPL3 and therefore traps, this is the best we can do.

>
>> @@ -1401,6 +1402,53 @@ int pv_emulate_privileged_op(struct cpu_user_regs *regs)
>>      return 0;
>>  }
>>  
>> +/*
>> + * Hardware already decoded the INT $N instruction and determinted that there
>> + * was a DPL issue, hence the #GP.  Xen has already determined that the guest
>> + * kernel has permitted this software interrupt.
>> + *
>> + * All that is needed is the instruction length, to turn the fault into a
>> + * trap.  All errors are turned back into the original #GP, as that's the
>> + * action that really happened.
>> + */
>> +void pv_emulate_sw_interrupt(struct cpu_user_regs *regs)
>> +{
>> +    struct vcpu *curr = current;
>> +    struct domain *currd = curr->domain;
>> +    struct priv_op_ctxt ctxt = {
>> +        .ctxt.regs = regs,
>> +        .ctxt.lma = !is_pv_32bit_domain(currd),
> The difference may not be overly significant here, but 64-bit guests can run
> 32-bit code, so setting .lma seems wrong in that case. As it ought to be
> largely benign, perhaps to code could even be left as is, just with a comment
> to clarify things?

LMA must be set for a 64bit guest.  Are you confusing it with %cs.l ?

What's potentially wrong is having LMA clear for a 32bit guest, but this
is how pv_emulate_privileged_op() behaves.  LMA is active in real
hardware when running in a compatibility mode segment.

I don't think anything actually cares about LMA. 
pv_emul_read_descriptor() doesn't audit L and instead relies on us not
permitting a PV32 guest to write a 64bit code segment.

>
>> +    };
>> +    struct x86_emulate_state *state;
>> +    uint8_t vector = regs->error_code >> 3;
>> +    unsigned int len, ar;
>> +
>> +    if ( !pv_emul_read_descriptor(regs->cs, curr, &ctxt.cs.base,
>> +                                  &ctxt.cs.limit, &ar, 1) ||
>> +         !(ar & _SEGMENT_S) ||
>> +         !(ar & _SEGMENT_P) ||
>> +         !(ar & _SEGMENT_CODE) )
>> +        goto error;
>> +
>> +    state = x86_decode_insn(&ctxt.ctxt, insn_fetch);
>> +    if ( IS_ERR_OR_NULL(state) )
>> +        goto error;
>> +
>> +    len = x86_insn_length(state, &ctxt.ctxt);
>> +    x86_emulate_free_state(state);
>> +
>> +    /* Note: Checked slightly late to simplify 'state' handling. */
>> +    if ( ctxt.ctxt.opcode != 0xcd /* INT $imm8 */ )
>> +        goto error;
>> +
>> +    regs->rip += len;
>> +    pv_inject_sw_interrupt(vector);
>> +    return;
>> +
>> + error:
>> +    pv_inject_hw_exception(X86_EXC_GP, regs->entry_vector);
> DYM regs->error_code here?

Oh.  I'm sure I fixed this bug already.  I wonder where the fix got lost.

Yes, it should be regs->error_code.

>  Might it alternatively make sense to return a
> boolean here, for ...
>
>> --- a/xen/arch/x86/traps.c
>> +++ b/xen/arch/x86/traps.c
>> @@ -1379,8 +1379,7 @@ void do_general_protection(struct cpu_user_regs *regs)
>>  
>>          if ( permit_softint(TI_GET_DPL(ti), v, regs) )
>>          {
>> -            regs->rip += 2;
>> -            pv_inject_sw_interrupt(vector);
>> +            pv_emulate_sw_interrupt(regs);
>>              return;
> ... the return here to become conditional, leveraging the #GP injection at
> the bottom of this function?

To make this bool, I need to insert a new label into the function.  I
considered that, but delayed it.  do_general_protection() wants a lot
more cleaning up than just this, and proportionability is a concern.

What I was actually considering was splitting out a new pv_handle_GP()
function to remove the ifdef-ary, and doing a wholesale rework at that
point.

~Andrew

P.S. Something I'm still trying to figure out is how to make
guest_mode() able to DCE based on the caller being
entry_from_{xen,pv}(), because the function can be bifurcated for FRED. 
It doesn't appear that the assume() constructs work, probably because
do_general_protection() can't be inlined due to IDT mode.

Re: [PATCH v4 01/14] x86/pv: Don't assume that INT $imm8 instructions are two bytes long

Posted by Jan Beulich 14 hours ago

On 02.03.2026 12:43, Andrew Cooper wrote:
> On 02/03/2026 11:03 am, Jan Beulich wrote:
>> On 28.02.2026 00:16, Andrew Cooper wrote:
>>> @@ -1401,6 +1402,53 @@ int pv_emulate_privileged_op(struct cpu_user_regs *regs)
>>>      return 0;
>>>  }
>>>  
>>> +/*
>>> + * Hardware already decoded the INT $N instruction and determinted that there
>>> + * was a DPL issue, hence the #GP.  Xen has already determined that the guest
>>> + * kernel has permitted this software interrupt.
>>> + *
>>> + * All that is needed is the instruction length, to turn the fault into a
>>> + * trap.  All errors are turned back into the original #GP, as that's the
>>> + * action that really happened.
>>> + */
>>> +void pv_emulate_sw_interrupt(struct cpu_user_regs *regs)
>>> +{
>>> +    struct vcpu *curr = current;
>>> +    struct domain *currd = curr->domain;
>>> +    struct priv_op_ctxt ctxt = {
>>> +        .ctxt.regs = regs,
>>> +        .ctxt.lma = !is_pv_32bit_domain(currd),
>> The difference may not be overly significant here, but 64-bit guests can run
>> 32-bit code, so setting .lma seems wrong in that case. As it ought to be
>> largely benign, perhaps to code could even be left as is, just with a comment
>> to clarify things?
> 
> LMA must be set for a 64bit guest.  Are you confusing it with %cs.l ?

Indeed I am, sorry.

>>> +    struct x86_emulate_state *state;
>>> +    uint8_t vector = regs->error_code >> 3;
>>> +    unsigned int len, ar;
>>> +
>>> +    if ( !pv_emul_read_descriptor(regs->cs, curr, &ctxt.cs.base,
>>> +                                  &ctxt.cs.limit, &ar, 1) ||
>>> +         !(ar & _SEGMENT_S) ||
>>> +         !(ar & _SEGMENT_P) ||
>>> +         !(ar & _SEGMENT_CODE) )
>>> +        goto error;
>>> +
>>> +    state = x86_decode_insn(&ctxt.ctxt, insn_fetch);
>>> +    if ( IS_ERR_OR_NULL(state) )
>>> +        goto error;
>>> +
>>> +    len = x86_insn_length(state, &ctxt.ctxt);
>>> +    x86_emulate_free_state(state);
>>> +
>>> +    /* Note: Checked slightly late to simplify 'state' handling. */
>>> +    if ( ctxt.ctxt.opcode != 0xcd /* INT $imm8 */ )
>>> +        goto error;
>>> +
>>> +    regs->rip += len;
>>> +    pv_inject_sw_interrupt(vector);
>>> +    return;
>>> +
>>> + error:
>>> +    pv_inject_hw_exception(X86_EXC_GP, regs->entry_vector);
>> DYM regs->error_code here?
> 
> Oh.  I'm sure I fixed this bug already.  I wonder where the fix got lost.
> 
> Yes, it should be regs->error_code.

Then (plus with my confusion above sorted)
Reviewed-by: Jan Beulich <jbeulich@suse.com>

>>  Might it alternatively make sense to return a
>> boolean here, for ...
>>
>>> --- a/xen/arch/x86/traps.c
>>> +++ b/xen/arch/x86/traps.c
>>> @@ -1379,8 +1379,7 @@ void do_general_protection(struct cpu_user_regs *regs)
>>>  
>>>          if ( permit_softint(TI_GET_DPL(ti), v, regs) )
>>>          {
>>> -            regs->rip += 2;
>>> -            pv_inject_sw_interrupt(vector);
>>> +            pv_emulate_sw_interrupt(regs);
>>>              return;
>> ... the return here to become conditional, leveraging the #GP injection at
>> the bottom of this function?
> 
> To make this bool, I need to insert a new label into the function.

Why would that be? Simply skipping the return and falling through will do,
afaics.

>  I
> considered that, but delayed it.  do_general_protection() wants a lot
> more cleaning up than just this, and proportionability is a concern.

Whatever you exactly mean with this.

Jan

Re: [PATCH v4 01/14] x86/pv: Don't assume that INT $imm8 instructions are two bytes long

Posted by Andrew Cooper 10 hours ago

On 02/03/2026 12:57 pm, Jan Beulich wrote:
> On 02.03.2026 12:43, Andrew Cooper wrote:
>> On 02/03/2026 11:03 am, Jan Beulich wrote:
>>> On 28.02.2026 00:16, Andrew Cooper wrote:
>>>> @@ -1401,6 +1402,53 @@ int pv_emulate_privileged_op(struct cpu_user_regs *regs)
>>>>      return 0;
>>>>  }
>>>>  
>>>> +/*
>>>> + * Hardware already decoded the INT $N instruction and determinted that there
>>>> + * was a DPL issue, hence the #GP.  Xen has already determined that the guest
>>>> + * kernel has permitted this software interrupt.
>>>> + *
>>>> + * All that is needed is the instruction length, to turn the fault into a
>>>> + * trap.  All errors are turned back into the original #GP, as that's the
>>>> + * action that really happened.
>>>> + */
>>>> +void pv_emulate_sw_interrupt(struct cpu_user_regs *regs)
>>>> +{
>>>> +    struct vcpu *curr = current;
>>>> +    struct domain *currd = curr->domain;
>>>> +    struct priv_op_ctxt ctxt = {
>>>> +        .ctxt.regs = regs,
>>>> +        .ctxt.lma = !is_pv_32bit_domain(currd),
>>> The difference may not be overly significant here, but 64-bit guests can run
>>> 32-bit code, so setting .lma seems wrong in that case. As it ought to be
>>> largely benign, perhaps to code could even be left as is, just with a comment
>>> to clarify things?
>> LMA must be set for a 64bit guest.  Are you confusing it with %cs.l ?
> Indeed I am, sorry.
>
>>>> +    struct x86_emulate_state *state;
>>>> +    uint8_t vector = regs->error_code >> 3;
>>>> +    unsigned int len, ar;
>>>> +
>>>> +    if ( !pv_emul_read_descriptor(regs->cs, curr, &ctxt.cs.base,
>>>> +                                  &ctxt.cs.limit, &ar, 1) ||
>>>> +         !(ar & _SEGMENT_S) ||
>>>> +         !(ar & _SEGMENT_P) ||
>>>> +         !(ar & _SEGMENT_CODE) )
>>>> +        goto error;
>>>> +
>>>> +    state = x86_decode_insn(&ctxt.ctxt, insn_fetch);
>>>> +    if ( IS_ERR_OR_NULL(state) )
>>>> +        goto error;
>>>> +
>>>> +    len = x86_insn_length(state, &ctxt.ctxt);
>>>> +    x86_emulate_free_state(state);
>>>> +
>>>> +    /* Note: Checked slightly late to simplify 'state' handling. */
>>>> +    if ( ctxt.ctxt.opcode != 0xcd /* INT $imm8 */ )
>>>> +        goto error;
>>>> +
>>>> +    regs->rip += len;
>>>> +    pv_inject_sw_interrupt(vector);
>>>> +    return;
>>>> +
>>>> + error:
>>>> +    pv_inject_hw_exception(X86_EXC_GP, regs->entry_vector);
>>> DYM regs->error_code here?
>> Oh.  I'm sure I fixed this bug already.  I wonder where the fix got lost.
>>
>> Yes, it should be regs->error_code.
> Then (plus with my confusion above sorted)
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thanks.

>
>>>  Might it alternatively make sense to return a
>>> boolean here, for ...
>>>
>>>> --- a/xen/arch/x86/traps.c
>>>> +++ b/xen/arch/x86/traps.c
>>>> @@ -1379,8 +1379,7 @@ void do_general_protection(struct cpu_user_regs *regs)
>>>>  
>>>>          if ( permit_softint(TI_GET_DPL(ti), v, regs) )
>>>>          {
>>>> -            regs->rip += 2;
>>>> -            pv_inject_sw_interrupt(vector);
>>>> +            pv_emulate_sw_interrupt(regs);
>>>>              return;
>>> ... the return here to become conditional, leveraging the #GP injection at
>>> the bottom of this function?
>> To make this bool, I need to insert a new label into the function.
> Why would that be? Simply skipping the return and falling through will do,
> afaics.
>
>>   I
>> considered that, but delayed it.  do_general_protection() wants a lot
>> more cleaning up than just this, and proportionability is a concern.
> Whatever you exactly mean with this.

Hmm.  That was supposed to say backportability, but I have no idea how
ended up like that.

The other advantage of being void functions is that they can be tailcalled.


Anyway, I have a plan for cleanup once FRED is settled, which looks a
little like this:

handle_GP_IDT()
    if ( guest_regs() )
        return handle_GP_guest()
    else
        return handle_GP_xen()

handle_GP_guest()
    ...

handle_GP_xen()
    ...

where the two FRED entrypoints can now call the context-specific
function rather than the generic one.

This does involve duplicating the X86_XEC_EXT check which is the only
common aspect in the #GP handler.  Next I need to figure out whether the
other handlers can be rearranged similarly.

~Andrew