[PATCH v4 02/14] docs/guest-guide: Describe the PV traps and entrypoints ABI

Andrew Cooper posted 14 patches 3 days, 4 hours ago
[PATCH v4 02/14] docs/guest-guide: Describe the PV traps and entrypoints ABI
Posted by Andrew Cooper 3 days, 4 hours ago
... seeing as I've had to thoroughly reverse engineer it for FRED and make
tweaks in places.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

Obviously there's a lot more in need of doing, but this is at least a start.

v4:
 * New
---
 docs/glossary.rst                 |   3 +
 docs/guest-guide/x86/index.rst    |   1 +
 docs/guest-guide/x86/pv-traps.rst | 123 ++++++++++++++++++++++++++++++
 3 files changed, 127 insertions(+)
 create mode 100644 docs/guest-guide/x86/pv-traps.rst

diff --git a/docs/glossary.rst b/docs/glossary.rst
index 6adeec77e14c..c8ab2386bc6e 100644
--- a/docs/glossary.rst
+++ b/docs/glossary.rst
@@ -43,6 +43,9 @@ Glossary
      Sapphire Rapids (Server, 2023) CPUs.  AMD support only CET-SS, starting
      with Zen3 (Both client and server, 2020) CPUs.
 
+   event channel
+     A paravirtual facility for guests to send and recieve interrupts.
+
    guest
      The term 'guest' has two different meanings, depending on context, and
      should not be confused with :term:`domain`.
diff --git a/docs/guest-guide/x86/index.rst b/docs/guest-guide/x86/index.rst
index 502968490d9d..5b38ae397a9f 100644
--- a/docs/guest-guide/x86/index.rst
+++ b/docs/guest-guide/x86/index.rst
@@ -7,3 +7,4 @@ x86
    :maxdepth: 2
 
    hypercall-abi
+   pv-traps
diff --git a/docs/guest-guide/x86/pv-traps.rst b/docs/guest-guide/x86/pv-traps.rst
new file mode 100644
index 000000000000..2ff18e2f9454
--- /dev/null
+++ b/docs/guest-guide/x86/pv-traps.rst
@@ -0,0 +1,123 @@
+.. SPDX-License-Identifier: CC-BY-4.0
+
+PV Traps and Entrypoints
+========================
+
+.. note::
+
+   The details here are specific to 64bit builds of Xen.  Details for 32bit
+   builds of Xen, are different and not discussed further.
+
+PV guests are subject to Xen's linkage setup for events (interrupts,
+exceptions and system calls).  x86's IDT architecture and limitations are the
+majority influence on the PV ABI.
+
+All external interrupts are routed to PV guests via the :term:`Event Channel`
+interface, and not discussed further here.
+
+What remain are exceptions, and the instructions which cause a control
+transfers.  In the x86 architecture, the instructions relevant for PV guests
+are:
+
+ * ``INT3``, which generates ``#BP``.
+
+ * ``INTO``, which generates ``#OF`` only if the overflow flag is set.  It is
+   only usable in compatibility mode, and will ``#UD`` in 64bit mode.
+
+ * ``CALL (far)`` referencing a gate in the GDT.
+
+ * ``INT $N``, which invokes an arbitrary IDT gate.  These four instructions
+   so far all check the gate DPL and will ``#GP`` otherwise.
+
+ * ``INT1``, also known as ``ICEBP``, which generates ``#DB``.  This
+   instruction does *not* check DPL, and can be used unconditionally by
+   userspace.
+
+ * ``SYSCALL``, which enters CPL0 as configured by the ``{C,L,}STAR`` MSRs.
+   It is usable if enabled by ``MSR_EFER.SCE``, and will ``#UD`` otherwise.
+   On Intel parts, ``SYSCALL`` is unusable outside of 64bit mode.
+
+ * ``SYSENTER``, which enters CPL0 as configured by the ``SEP`` MSRs.  It is
+   usable if enabled by ``MSR_SYSENTER_CS`` having a non-NUL selector, and
+   will ``#GP`` otherwise.  On AMD parts, ``SYSENTER`` is unusable in Long
+   mode.
+
+
+Xen's configuration
+-------------------
+
+Xen maintains a complete IDT, with most gates configured with DPL0.  This
+causes most ``INT $N`` instructions to ``#GP``.  This allows Xen to emulate
+the instruction, referring to the guest kernels vDPL choice.
+
+ * Vectors 3 ``#BP`` and 4 ``#OF`` are DPL3, in order to allow the ``INT3``
+   and ``INTO`` instructions to function in userspace.
+
+ * Vector 0x80 is DPL3 in order to implement the legacy system call fastpath
+   commonly found in UNIXes.
+
+ * Vector 0x82 is DPL1 when PV32 is enabled, allowing the guest kernel to make
+   hypercalls to Xen.  All other cases (PV32 guest userspace, and both PV64
+   modes) operate in CPL3 and this vector behaves like all others to ``INT
+   $N`` instructions.
+
+A range of the GDT is guest-owned, allowing for call gates.  During audit, Xen
+forces all call gates to DPL0, causing their use to ``#GP`` allowing for
+emulation.
+
+Xen enables ``SYSCALL`` in all cases as it is mandatory in 64bit mode, and
+enables ``SYSENTER`` when available in 64bit mode.
+
+When Xen is using FRED delivery the hardware configuration is substantially
+different, but the behaviour for guests remains as unchanged as possible.
+
+
+PV Guest's configuration
+------------------------
+
+The PV ABI contains the "trap table", modelled very closely on the IDT.  It is
+manipulated by ``HYPERCALL_set_trap_table``, has 256 entries, each containing
+a code segment selector, an address, and flags.  A guest is expected to
+configure handlers for all exceptions; failure to do so is terminal similar to
+a Triple Fault.
+
+Part of the GDT is guest owned with descriptors audited by Xen.  This range
+can be manipulated with ``HYPERVISOR_set_gdt`` and
+``HYPERVISOR_update_descriptor``.
+
+Other entrypoints are configured via ``HYPERVISOR_callback_op``.  Of note here
+are the callback types ``syscall``, ``syscall32`` (relevant for AMD parts) and
+``sysenter`` (relevant for Intel parts).
+
+.. warning::
+
+   Prior to Xen 4.15, there was no check that the ``syscall`` or ``syscall32``
+   callbacks had been registered before attempting to deliver via them.
+   Guests are strongly advised to ensure the entrypoints are registered before
+   running userspace.
+
+
+Notes
+-----
+
+``INT3`` vs ``INT $3`` and ``INTO`` vs ``INT $4`` are hard to distinguish
+architecturally as both forms have a DPL check and use the same IDT vectors.
+Because Xen configures both as DPL3, the ``INT $`` forms do not fault for
+emulation, and are treated as if they were exceptions.  This means the guest
+can't block these instruction by trying to configure them with vDPL0.
+
+The instructions which trap into Xen (``INT $0x80``, ``SYSCALL``,
+``SYSENTER``) but can be disabled by guest configuration need turning back
+into faults for the guest kernel to process.
+
+ * When using IDT delivery, instruction lengths are not provided by hardware
+   and Xen does not account for possible prefixes.  ``%rip`` only gets rewound
+   by the length of the unprefixed instruction.  This is observable, but not
+   expected to be an issue in practice.
+
+ * When Xen is using FRED delivery, the full instruction length is provided by
+   hardware, and ``%rip`` is rewound fully.
+
+While both PV32 and PV64 guests are permitted to write Call Gates into the
+GDT, emulation is only wired up for PV32.  At the time of writing, the x86
+maintainers feel no specific need to fix this omission.
-- 
2.39.5


Re: [PATCH v4 02/14] docs/guest-guide: Describe the PV traps and entrypoints ABI
Posted by Jan Beulich 16 hours ago
On 28.02.2026 00:16, Andrew Cooper wrote:
> ... seeing as I've had to thoroughly reverse engineer it for FRED and make
> tweaks in places.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Acked-by: Jan Beulich <jbeulich@suse.com>

> --- /dev/null
> +++ b/docs/guest-guide/x86/pv-traps.rst
> @@ -0,0 +1,123 @@
> +.. SPDX-License-Identifier: CC-BY-4.0
> +
> +PV Traps and Entrypoints
> +========================
> +
> +.. note::
> +
> +   The details here are specific to 64bit builds of Xen.  Details for 32bit
> +   builds of Xen, are different and not discussed further.

Nit: Stray comma?

> +PV guests are subject to Xen's linkage setup for events (interrupts,
> +exceptions and system calls).  x86's IDT architecture and limitations are the
> +majority influence on the PV ABI.
> +
> +All external interrupts are routed to PV guests via the :term:`Event Channel`
> +interface, and not discussed further here.
> +
> +What remain are exceptions, and the instructions which cause a control
> +transfers.  In the x86 architecture, the instructions relevant for PV guests
> +are:
> +
> + * ``INT3``, which generates ``#BP``.
> +
> + * ``INTO``, which generates ``#OF`` only if the overflow flag is set.  It is
> +   only usable in compatibility mode, and will ``#UD`` in 64bit mode.
> +
> + * ``CALL (far)`` referencing a gate in the GDT.
> +
> + * ``INT $N``, which invokes an arbitrary IDT gate.  These four instructions
> +   so far all check the gate DPL and will ``#GP`` otherwise.
> +
> + * ``INT1``, also known as ``ICEBP``, which generates ``#DB``.  This
> +   instruction does *not* check DPL, and can be used unconditionally by
> +   userspace.
> +
> + * ``SYSCALL``, which enters CPL0 as configured by the ``{C,L,}STAR`` MSRs.
> +   It is usable if enabled by ``MSR_EFER.SCE``, and will ``#UD`` otherwise.
> +   On Intel parts, ``SYSCALL`` is unusable outside of 64bit mode.
> +
> + * ``SYSENTER``, which enters CPL0 as configured by the ``SEP`` MSRs.  It is
> +   usable if enabled by ``MSR_SYSENTER_CS`` having a non-NUL selector, and
> +   will ``#GP`` otherwise.  On AMD parts, ``SYSENTER`` is unusable in Long
> +   mode.

The UD<n> family of insns is kind of a hybrid: They explicitly generate #UD,
and hence do a control transfer. Same for at least BOUND. It's not quite clear
whether they should be enumerated here as well.

> +Xen's configuration
> +-------------------
> +
> +Xen maintains a complete IDT, with most gates configured with DPL0.  This
> +causes most ``INT $N`` instructions to ``#GP``.  This allows Xen to emulate
> +the instruction, referring to the guest kernels vDPL choice.
> +
> + * Vectors 3 ``#BP`` and 4 ``#OF`` are DPL3, in order to allow the ``INT3``
> +   and ``INTO`` instructions to function in userspace.
> +
> + * Vector 0x80 is DPL3 in order to implement the legacy system call fastpath
> +   commonly found in UNIXes.

Much like we make this DPL0 when PV=n, should we perhaps make vectors 3 and 4
DPL0 as well in that case (just for formality's sake)? Maybe 4, like 9, would
even want to be an autogen entry point then?

Jan
Re: [PATCH v4 02/14] docs/guest-guide: Describe the PV traps and entrypoints ABI
Posted by Andrew Cooper 12 hours ago
On 02/03/2026 11:19 am, Jan Beulich wrote:
> On 28.02.2026 00:16, Andrew Cooper wrote:
>> ... seeing as I've had to thoroughly reverse engineer it for FRED and make
>> tweaks in places.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Acked-by: Jan Beulich <jbeulich@suse.com>

Thanks.

>
>> --- /dev/null
>> +++ b/docs/guest-guide/x86/pv-traps.rst
>> @@ -0,0 +1,123 @@
>> +.. SPDX-License-Identifier: CC-BY-4.0
>> +
>> +PV Traps and Entrypoints
>> +========================
>> +
>> +.. note::
>> +
>> +   The details here are specific to 64bit builds of Xen.  Details for 32bit
>> +   builds of Xen, are different and not discussed further.
> Nit: Stray comma?

Yes.  From a sentence refactor.  Will drop.

>
>> +PV guests are subject to Xen's linkage setup for events (interrupts,
>> +exceptions and system calls).  x86's IDT architecture and limitations are the
>> +majority influence on the PV ABI.
>> +
>> +All external interrupts are routed to PV guests via the :term:`Event Channel`
>> +interface, and not discussed further here.
>> +
>> +What remain are exceptions, and the instructions which cause a control
>> +transfers.  In the x86 architecture, the instructions relevant for PV guests
>> +are:
>> +
>> + * ``INT3``, which generates ``#BP``.
>> +
>> + * ``INTO``, which generates ``#OF`` only if the overflow flag is set.  It is
>> +   only usable in compatibility mode, and will ``#UD`` in 64bit mode.
>> +
>> + * ``CALL (far)`` referencing a gate in the GDT.
>> +
>> + * ``INT $N``, which invokes an arbitrary IDT gate.  These four instructions
>> +   so far all check the gate DPL and will ``#GP`` otherwise.
>> +
>> + * ``INT1``, also known as ``ICEBP``, which generates ``#DB``.  This
>> +   instruction does *not* check DPL, and can be used unconditionally by
>> +   userspace.
>> +
>> + * ``SYSCALL``, which enters CPL0 as configured by the ``{C,L,}STAR`` MSRs.
>> +   It is usable if enabled by ``MSR_EFER.SCE``, and will ``#UD`` otherwise.
>> +   On Intel parts, ``SYSCALL`` is unusable outside of 64bit mode.
>> +
>> + * ``SYSENTER``, which enters CPL0 as configured by the ``SEP`` MSRs.  It is
>> +   usable if enabled by ``MSR_SYSENTER_CS`` having a non-NUL selector, and
>> +   will ``#GP`` otherwise.  On AMD parts, ``SYSENTER`` is unusable in Long
>> +   mode.
> The UD<n> family of insns is kind of a hybrid: They explicitly generate #UD,
> and hence do a control transfer. Same for at least BOUND. It's not quite clear
> whether they should be enumerated here as well.

UDn and BOUND are strictly faults, not traps.  They're type 3 (hardware
exception) and provide no instruction length.

The simplest implementation of UDn is nothing.  The decoder already
needs a signal for "not an instruction I know" which is wired into #UD. 
All the manual does by defining these "instructions" is promise that
nothing else will be allocated in that opcode space.

BOUND is weird.  I'm not sure what more can be said about it.

Either way, #UD and #BR are not interestingly different from e.g. #PF
from a PV guest's point of view.

>
>> +Xen's configuration
>> +-------------------
>> +
>> +Xen maintains a complete IDT, with most gates configured with DPL0.  This
>> +causes most ``INT $N`` instructions to ``#GP``.  This allows Xen to emulate
>> +the instruction, referring to the guest kernels vDPL choice.
>> +
>> + * Vectors 3 ``#BP`` and 4 ``#OF`` are DPL3, in order to allow the ``INT3``
>> +   and ``INTO`` instructions to function in userspace.
>> +
>> + * Vector 0x80 is DPL3 in order to implement the legacy system call fastpath
>> +   commonly found in UNIXes.
> Much like we make this DPL0 when PV=n, should we perhaps make vectors 3 and 4
> DPL0 as well in that case (just for formality's sake)? Maybe 4, like 9, would
> even want to be an autogen entry point then?

We could, but does that gain us anything?

For 0x80, we get another vector to use for regular interrupts, but that
doesn't work for the vectors below 0x10.

~Andrew