For INT $N instructions (besides $0x80 for which there is a dedicated fast
path), handling is mostly fault-based because of DPL0 gates in the IDT. This
means that when the guest kernel allows the instruction too, Xen must
increment %rip to the end of the instruction before passing a trap to the
guest kernel.
When an INT $N instruction has a prefix, it's longer than two bytes, and Xen
will deliver the "trap" with %rip pointing into the middle of the instruction.
Introduce a new pv_emulate_sw_interrupt() which uses x86_insn_length() to
determine the instruction length, rather than assuming two.
This is a change in behaviour for PV guests, but the prior behaviour cannot
reasonably be said to be intentional.
This change does not affect the INT $0x80 fastpath. Prefixed INT $N
instructions occur almost exclusively in test code or exploits, and INT $0x80
appears to be the only user-usable interrupt gate in contemporary PV guests.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
v4:
* New
---
xen/arch/x86/include/asm/pv/traps.h | 2 ++
xen/arch/x86/pv/emul-priv-op.c | 48 +++++++++++++++++++++++++++++
xen/arch/x86/traps.c | 3 +-
3 files changed, 51 insertions(+), 2 deletions(-)
diff --git a/xen/arch/x86/include/asm/pv/traps.h b/xen/arch/x86/include/asm/pv/traps.h
index 8c201190923d..16e9a8d2aa3f 100644
--- a/xen/arch/x86/include/asm/pv/traps.h
+++ b/xen/arch/x86/include/asm/pv/traps.h
@@ -17,6 +17,7 @@
int pv_raise_nmi(struct vcpu *v);
int pv_emulate_privileged_op(struct cpu_user_regs *regs);
+void pv_emulate_sw_interrupt(struct cpu_user_regs *regs);
void pv_emulate_gate_op(struct cpu_user_regs *regs);
bool pv_emulate_invalid_op(struct cpu_user_regs *regs);
@@ -31,6 +32,7 @@ static inline bool pv_trap_callback_registered(const struct vcpu *v,
static inline int pv_raise_nmi(struct vcpu *v) { return -EOPNOTSUPP; }
static inline int pv_emulate_privileged_op(struct cpu_user_regs *regs) { return 0; }
+static inline void pv_emulate_sw_interrupt(struct cpu_user_regs *regs) {}
static inline void pv_emulate_gate_op(struct cpu_user_regs *regs) {}
static inline bool pv_emulate_invalid_op(struct cpu_user_regs *regs) { return true; }
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index a3c1fd12621d..87d3bbcf901f 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -8,6 +8,7 @@
*/
#include <xen/domain_page.h>
+#include <xen/err.h>
#include <xen/event.h>
#include <xen/guest_access.h>
#include <xen/hypercall.h>
@@ -1401,6 +1402,53 @@ int pv_emulate_privileged_op(struct cpu_user_regs *regs)
return 0;
}
+/*
+ * Hardware already decoded the INT $N instruction and determinted that there
+ * was a DPL issue, hence the #GP. Xen has already determined that the guest
+ * kernel has permitted this software interrupt.
+ *
+ * All that is needed is the instruction length, to turn the fault into a
+ * trap. All errors are turned back into the original #GP, as that's the
+ * action that really happened.
+ */
+void pv_emulate_sw_interrupt(struct cpu_user_regs *regs)
+{
+ struct vcpu *curr = current;
+ struct domain *currd = curr->domain;
+ struct priv_op_ctxt ctxt = {
+ .ctxt.regs = regs,
+ .ctxt.lma = !is_pv_32bit_domain(currd),
+ };
+ struct x86_emulate_state *state;
+ uint8_t vector = regs->error_code >> 3;
+ unsigned int len, ar;
+
+ if ( !pv_emul_read_descriptor(regs->cs, curr, &ctxt.cs.base,
+ &ctxt.cs.limit, &ar, 1) ||
+ !(ar & _SEGMENT_S) ||
+ !(ar & _SEGMENT_P) ||
+ !(ar & _SEGMENT_CODE) )
+ goto error;
+
+ state = x86_decode_insn(&ctxt.ctxt, insn_fetch);
+ if ( IS_ERR_OR_NULL(state) )
+ goto error;
+
+ len = x86_insn_length(state, &ctxt.ctxt);
+ x86_emulate_free_state(state);
+
+ /* Note: Checked slightly late to simplify 'state' handling. */
+ if ( ctxt.ctxt.opcode != 0xcd /* INT $imm8 */ )
+ goto error;
+
+ regs->rip += len;
+ pv_inject_sw_interrupt(vector);
+ return;
+
+ error:
+ pv_inject_hw_exception(X86_EXC_GP, regs->entry_vector);
+}
+
/*
* Local variables:
* mode: C
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 5feac88d6c0b..907fb4c186c0 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1379,8 +1379,7 @@ void do_general_protection(struct cpu_user_regs *regs)
if ( permit_softint(TI_GET_DPL(ti), v, regs) )
{
- regs->rip += 2;
- pv_inject_sw_interrupt(vector);
+ pv_emulate_sw_interrupt(regs);
return;
}
}
--
2.39.5
On 28.02.2026 00:16, Andrew Cooper wrote:
> For INT $N instructions (besides $0x80 for which there is a dedicated fast
> path), handling is mostly fault-based because of DPL0 gates in the IDT. This
> means that when the guest kernel allows the instruction too, Xen must
> increment %rip to the end of the instruction before passing a trap to the
> guest kernel.
>
> When an INT $N instruction has a prefix, it's longer than two bytes, and Xen
> will deliver the "trap" with %rip pointing into the middle of the instruction.
>
> Introduce a new pv_emulate_sw_interrupt() which uses x86_insn_length() to
> determine the instruction length, rather than assuming two.
>
> This is a change in behaviour for PV guests, but the prior behaviour cannot
> reasonably be said to be intentional.
>
> This change does not affect the INT $0x80 fastpath. Prefixed INT $N
> instructions occur almost exclusively in test code or exploits, and INT $0x80
> appears to be the only user-usable interrupt gate in contemporary PV guests.
Whereas for the slow path, while the subtracting of 2 from %rip there isn't
quite right either, the insn size determination here would then simply yield
2 as well, so all is good for that case as well.
> @@ -1401,6 +1402,53 @@ int pv_emulate_privileged_op(struct cpu_user_regs *regs)
> return 0;
> }
>
> +/*
> + * Hardware already decoded the INT $N instruction and determinted that there
> + * was a DPL issue, hence the #GP. Xen has already determined that the guest
> + * kernel has permitted this software interrupt.
> + *
> + * All that is needed is the instruction length, to turn the fault into a
> + * trap. All errors are turned back into the original #GP, as that's the
> + * action that really happened.
> + */
> +void pv_emulate_sw_interrupt(struct cpu_user_regs *regs)
> +{
> + struct vcpu *curr = current;
> + struct domain *currd = curr->domain;
> + struct priv_op_ctxt ctxt = {
> + .ctxt.regs = regs,
> + .ctxt.lma = !is_pv_32bit_domain(currd),
The difference may not be overly significant here, but 64-bit guests can run
32-bit code, so setting .lma seems wrong in that case. As it ought to be
largely benign, perhaps to code could even be left as is, just with a comment
to clarify things?
> + };
> + struct x86_emulate_state *state;
> + uint8_t vector = regs->error_code >> 3;
> + unsigned int len, ar;
> +
> + if ( !pv_emul_read_descriptor(regs->cs, curr, &ctxt.cs.base,
> + &ctxt.cs.limit, &ar, 1) ||
> + !(ar & _SEGMENT_S) ||
> + !(ar & _SEGMENT_P) ||
> + !(ar & _SEGMENT_CODE) )
> + goto error;
> +
> + state = x86_decode_insn(&ctxt.ctxt, insn_fetch);
> + if ( IS_ERR_OR_NULL(state) )
> + goto error;
> +
> + len = x86_insn_length(state, &ctxt.ctxt);
> + x86_emulate_free_state(state);
> +
> + /* Note: Checked slightly late to simplify 'state' handling. */
> + if ( ctxt.ctxt.opcode != 0xcd /* INT $imm8 */ )
> + goto error;
> +
> + regs->rip += len;
> + pv_inject_sw_interrupt(vector);
> + return;
> +
> + error:
> + pv_inject_hw_exception(X86_EXC_GP, regs->entry_vector);
DYM regs->error_code here? Might it alternatively make sense to return a
boolean here, for ...
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -1379,8 +1379,7 @@ void do_general_protection(struct cpu_user_regs *regs)
>
> if ( permit_softint(TI_GET_DPL(ti), v, regs) )
> {
> - regs->rip += 2;
> - pv_inject_sw_interrupt(vector);
> + pv_emulate_sw_interrupt(regs);
> return;
... the return here to become conditional, leveraging the #GP injection at
the bottom of this function?
Jan
On 02/03/2026 11:03 am, Jan Beulich wrote:
> On 28.02.2026 00:16, Andrew Cooper wrote:
>> For INT $N instructions (besides $0x80 for which there is a dedicated fast
>> path), handling is mostly fault-based because of DPL0 gates in the IDT. This
>> means that when the guest kernel allows the instruction too, Xen must
>> increment %rip to the end of the instruction before passing a trap to the
>> guest kernel.
>>
>> When an INT $N instruction has a prefix, it's longer than two bytes, and Xen
>> will deliver the "trap" with %rip pointing into the middle of the instruction.
>>
>> Introduce a new pv_emulate_sw_interrupt() which uses x86_insn_length() to
>> determine the instruction length, rather than assuming two.
>>
>> This is a change in behaviour for PV guests, but the prior behaviour cannot
>> reasonably be said to be intentional.
>>
>> This change does not affect the INT $0x80 fastpath. Prefixed INT $N
>> instructions occur almost exclusively in test code or exploits, and INT $0x80
>> appears to be the only user-usable interrupt gate in contemporary PV guests.
> Whereas for the slow path, while the subtracting of 2 from %rip there isn't
> quite right either, the insn size determination here would then simply yield
> 2 as well, so all is good for that case as well.
I've covered that in the docs patch (patch 2). Because INT $0x80 is
DPL3 and therefore traps, this is the best we can do.
>
>> @@ -1401,6 +1402,53 @@ int pv_emulate_privileged_op(struct cpu_user_regs *regs)
>> return 0;
>> }
>>
>> +/*
>> + * Hardware already decoded the INT $N instruction and determinted that there
>> + * was a DPL issue, hence the #GP. Xen has already determined that the guest
>> + * kernel has permitted this software interrupt.
>> + *
>> + * All that is needed is the instruction length, to turn the fault into a
>> + * trap. All errors are turned back into the original #GP, as that's the
>> + * action that really happened.
>> + */
>> +void pv_emulate_sw_interrupt(struct cpu_user_regs *regs)
>> +{
>> + struct vcpu *curr = current;
>> + struct domain *currd = curr->domain;
>> + struct priv_op_ctxt ctxt = {
>> + .ctxt.regs = regs,
>> + .ctxt.lma = !is_pv_32bit_domain(currd),
> The difference may not be overly significant here, but 64-bit guests can run
> 32-bit code, so setting .lma seems wrong in that case. As it ought to be
> largely benign, perhaps to code could even be left as is, just with a comment
> to clarify things?
LMA must be set for a 64bit guest. Are you confusing it with %cs.l ?
What's potentially wrong is having LMA clear for a 32bit guest, but this
is how pv_emulate_privileged_op() behaves. LMA is active in real
hardware when running in a compatibility mode segment.
I don't think anything actually cares about LMA.
pv_emul_read_descriptor() doesn't audit L and instead relies on us not
permitting a PV32 guest to write a 64bit code segment.
>
>> + };
>> + struct x86_emulate_state *state;
>> + uint8_t vector = regs->error_code >> 3;
>> + unsigned int len, ar;
>> +
>> + if ( !pv_emul_read_descriptor(regs->cs, curr, &ctxt.cs.base,
>> + &ctxt.cs.limit, &ar, 1) ||
>> + !(ar & _SEGMENT_S) ||
>> + !(ar & _SEGMENT_P) ||
>> + !(ar & _SEGMENT_CODE) )
>> + goto error;
>> +
>> + state = x86_decode_insn(&ctxt.ctxt, insn_fetch);
>> + if ( IS_ERR_OR_NULL(state) )
>> + goto error;
>> +
>> + len = x86_insn_length(state, &ctxt.ctxt);
>> + x86_emulate_free_state(state);
>> +
>> + /* Note: Checked slightly late to simplify 'state' handling. */
>> + if ( ctxt.ctxt.opcode != 0xcd /* INT $imm8 */ )
>> + goto error;
>> +
>> + regs->rip += len;
>> + pv_inject_sw_interrupt(vector);
>> + return;
>> +
>> + error:
>> + pv_inject_hw_exception(X86_EXC_GP, regs->entry_vector);
> DYM regs->error_code here?
Oh. I'm sure I fixed this bug already. I wonder where the fix got lost.
Yes, it should be regs->error_code.
> Might it alternatively make sense to return a
> boolean here, for ...
>
>> --- a/xen/arch/x86/traps.c
>> +++ b/xen/arch/x86/traps.c
>> @@ -1379,8 +1379,7 @@ void do_general_protection(struct cpu_user_regs *regs)
>>
>> if ( permit_softint(TI_GET_DPL(ti), v, regs) )
>> {
>> - regs->rip += 2;
>> - pv_inject_sw_interrupt(vector);
>> + pv_emulate_sw_interrupt(regs);
>> return;
> ... the return here to become conditional, leveraging the #GP injection at
> the bottom of this function?
To make this bool, I need to insert a new label into the function. I
considered that, but delayed it. do_general_protection() wants a lot
more cleaning up than just this, and proportionability is a concern.
What I was actually considering was splitting out a new pv_handle_GP()
function to remove the ifdef-ary, and doing a wholesale rework at that
point.
~Andrew
P.S. Something I'm still trying to figure out is how to make
guest_mode() able to DCE based on the caller being
entry_from_{xen,pv}(), because the function can be bifurcated for FRED.
It doesn't appear that the assume() constructs work, probably because
do_general_protection() can't be inlined due to IDT mode.
On 02.03.2026 12:43, Andrew Cooper wrote:
> On 02/03/2026 11:03 am, Jan Beulich wrote:
>> On 28.02.2026 00:16, Andrew Cooper wrote:
>>> @@ -1401,6 +1402,53 @@ int pv_emulate_privileged_op(struct cpu_user_regs *regs)
>>> return 0;
>>> }
>>>
>>> +/*
>>> + * Hardware already decoded the INT $N instruction and determinted that there
>>> + * was a DPL issue, hence the #GP. Xen has already determined that the guest
>>> + * kernel has permitted this software interrupt.
>>> + *
>>> + * All that is needed is the instruction length, to turn the fault into a
>>> + * trap. All errors are turned back into the original #GP, as that's the
>>> + * action that really happened.
>>> + */
>>> +void pv_emulate_sw_interrupt(struct cpu_user_regs *regs)
>>> +{
>>> + struct vcpu *curr = current;
>>> + struct domain *currd = curr->domain;
>>> + struct priv_op_ctxt ctxt = {
>>> + .ctxt.regs = regs,
>>> + .ctxt.lma = !is_pv_32bit_domain(currd),
>> The difference may not be overly significant here, but 64-bit guests can run
>> 32-bit code, so setting .lma seems wrong in that case. As it ought to be
>> largely benign, perhaps to code could even be left as is, just with a comment
>> to clarify things?
>
> LMA must be set for a 64bit guest. Are you confusing it with %cs.l ?
Indeed I am, sorry.
>>> + struct x86_emulate_state *state;
>>> + uint8_t vector = regs->error_code >> 3;
>>> + unsigned int len, ar;
>>> +
>>> + if ( !pv_emul_read_descriptor(regs->cs, curr, &ctxt.cs.base,
>>> + &ctxt.cs.limit, &ar, 1) ||
>>> + !(ar & _SEGMENT_S) ||
>>> + !(ar & _SEGMENT_P) ||
>>> + !(ar & _SEGMENT_CODE) )
>>> + goto error;
>>> +
>>> + state = x86_decode_insn(&ctxt.ctxt, insn_fetch);
>>> + if ( IS_ERR_OR_NULL(state) )
>>> + goto error;
>>> +
>>> + len = x86_insn_length(state, &ctxt.ctxt);
>>> + x86_emulate_free_state(state);
>>> +
>>> + /* Note: Checked slightly late to simplify 'state' handling. */
>>> + if ( ctxt.ctxt.opcode != 0xcd /* INT $imm8 */ )
>>> + goto error;
>>> +
>>> + regs->rip += len;
>>> + pv_inject_sw_interrupt(vector);
>>> + return;
>>> +
>>> + error:
>>> + pv_inject_hw_exception(X86_EXC_GP, regs->entry_vector);
>> DYM regs->error_code here?
>
> Oh. I'm sure I fixed this bug already. I wonder where the fix got lost.
>
> Yes, it should be regs->error_code.
Then (plus with my confusion above sorted)
Reviewed-by: Jan Beulich <jbeulich@suse.com>
>> Might it alternatively make sense to return a
>> boolean here, for ...
>>
>>> --- a/xen/arch/x86/traps.c
>>> +++ b/xen/arch/x86/traps.c
>>> @@ -1379,8 +1379,7 @@ void do_general_protection(struct cpu_user_regs *regs)
>>>
>>> if ( permit_softint(TI_GET_DPL(ti), v, regs) )
>>> {
>>> - regs->rip += 2;
>>> - pv_inject_sw_interrupt(vector);
>>> + pv_emulate_sw_interrupt(regs);
>>> return;
>> ... the return here to become conditional, leveraging the #GP injection at
>> the bottom of this function?
>
> To make this bool, I need to insert a new label into the function.
Why would that be? Simply skipping the return and falling through will do,
afaics.
> I
> considered that, but delayed it. do_general_protection() wants a lot
> more cleaning up than just this, and proportionability is a concern.
Whatever you exactly mean with this.
Jan
On 02/03/2026 12:57 pm, Jan Beulich wrote:
> On 02.03.2026 12:43, Andrew Cooper wrote:
>> On 02/03/2026 11:03 am, Jan Beulich wrote:
>>> On 28.02.2026 00:16, Andrew Cooper wrote:
>>>> @@ -1401,6 +1402,53 @@ int pv_emulate_privileged_op(struct cpu_user_regs *regs)
>>>> return 0;
>>>> }
>>>>
>>>> +/*
>>>> + * Hardware already decoded the INT $N instruction and determinted that there
>>>> + * was a DPL issue, hence the #GP. Xen has already determined that the guest
>>>> + * kernel has permitted this software interrupt.
>>>> + *
>>>> + * All that is needed is the instruction length, to turn the fault into a
>>>> + * trap. All errors are turned back into the original #GP, as that's the
>>>> + * action that really happened.
>>>> + */
>>>> +void pv_emulate_sw_interrupt(struct cpu_user_regs *regs)
>>>> +{
>>>> + struct vcpu *curr = current;
>>>> + struct domain *currd = curr->domain;
>>>> + struct priv_op_ctxt ctxt = {
>>>> + .ctxt.regs = regs,
>>>> + .ctxt.lma = !is_pv_32bit_domain(currd),
>>> The difference may not be overly significant here, but 64-bit guests can run
>>> 32-bit code, so setting .lma seems wrong in that case. As it ought to be
>>> largely benign, perhaps to code could even be left as is, just with a comment
>>> to clarify things?
>> LMA must be set for a 64bit guest. Are you confusing it with %cs.l ?
> Indeed I am, sorry.
>
>>>> + struct x86_emulate_state *state;
>>>> + uint8_t vector = regs->error_code >> 3;
>>>> + unsigned int len, ar;
>>>> +
>>>> + if ( !pv_emul_read_descriptor(regs->cs, curr, &ctxt.cs.base,
>>>> + &ctxt.cs.limit, &ar, 1) ||
>>>> + !(ar & _SEGMENT_S) ||
>>>> + !(ar & _SEGMENT_P) ||
>>>> + !(ar & _SEGMENT_CODE) )
>>>> + goto error;
>>>> +
>>>> + state = x86_decode_insn(&ctxt.ctxt, insn_fetch);
>>>> + if ( IS_ERR_OR_NULL(state) )
>>>> + goto error;
>>>> +
>>>> + len = x86_insn_length(state, &ctxt.ctxt);
>>>> + x86_emulate_free_state(state);
>>>> +
>>>> + /* Note: Checked slightly late to simplify 'state' handling. */
>>>> + if ( ctxt.ctxt.opcode != 0xcd /* INT $imm8 */ )
>>>> + goto error;
>>>> +
>>>> + regs->rip += len;
>>>> + pv_inject_sw_interrupt(vector);
>>>> + return;
>>>> +
>>>> + error:
>>>> + pv_inject_hw_exception(X86_EXC_GP, regs->entry_vector);
>>> DYM regs->error_code here?
>> Oh. I'm sure I fixed this bug already. I wonder where the fix got lost.
>>
>> Yes, it should be regs->error_code.
> Then (plus with my confusion above sorted)
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Thanks.
>
>>> Might it alternatively make sense to return a
>>> boolean here, for ...
>>>
>>>> --- a/xen/arch/x86/traps.c
>>>> +++ b/xen/arch/x86/traps.c
>>>> @@ -1379,8 +1379,7 @@ void do_general_protection(struct cpu_user_regs *regs)
>>>>
>>>> if ( permit_softint(TI_GET_DPL(ti), v, regs) )
>>>> {
>>>> - regs->rip += 2;
>>>> - pv_inject_sw_interrupt(vector);
>>>> + pv_emulate_sw_interrupt(regs);
>>>> return;
>>> ... the return here to become conditional, leveraging the #GP injection at
>>> the bottom of this function?
>> To make this bool, I need to insert a new label into the function.
> Why would that be? Simply skipping the return and falling through will do,
> afaics.
>
>> I
>> considered that, but delayed it. do_general_protection() wants a lot
>> more cleaning up than just this, and proportionability is a concern.
> Whatever you exactly mean with this.
Hmm. That was supposed to say backportability, but I have no idea how
ended up like that.
The other advantage of being void functions is that they can be tailcalled.
Anyway, I have a plan for cleanup once FRED is settled, which looks a
little like this:
handle_GP_IDT()
if ( guest_regs() )
return handle_GP_guest()
else
return handle_GP_xen()
handle_GP_guest()
...
handle_GP_xen()
...
where the two FRED entrypoints can now call the context-specific
function rather than the generic one.
This does involve duplicating the X86_XEC_EXT check which is the only
common aspect in the #GP handler. Next I need to figure out whether the
other handlers can be rearranged similarly.
~Andrew
© 2016 - 2026 Red Hat, Inc.