[v2] x86/alternatives: Adjust all insn-relative fields

[PATCH v2 1/7] x86: Introduce x86_decode_lite()

Posted by Andrew Cooper 1 month, 2 weeks ago

In order to relocate all IP-relative fields in an alternative replacement
block, we need to decode the instructions enough to obtain their length and
any relative fields.

Full x86_decode() is far too heavyweight, so introduce a minimal form which
can make several simplifying assumptions.

This a mostly-complete decoder for integer instruction in the onebyte and
twobyte maps.  Some instructions are intentionally unsupported, owing to being
unlikely to find in alternatives, and so as to reduce decode complexity.  See
the subsequent patch adding a userspace test harness for further details.

This logic can decode all alternative blocks that exist in Xen right now.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * Switch to 0 on failure, rel_sz in bytes
 * Mostly complete the integer instructions; paird with userspace harness
 * Put in .init when !CONFIG_LIVEPATCH
---
 xen/arch/x86/x86_emulate/Makefile      |   6 +
 xen/arch/x86/x86_emulate/decode-lite.c | 311 +++++++++++++++++++++++++
 xen/arch/x86/x86_emulate/private.h     |   2 +
 xen/arch/x86/x86_emulate/x86_emulate.h |  14 ++
 4 files changed, 333 insertions(+)
 create mode 100644 xen/arch/x86/x86_emulate/decode-lite.c

diff --git a/xen/arch/x86/x86_emulate/Makefile b/xen/arch/x86/x86_emulate/Makefile
index 2e20d65d788a..7a56de877f25 100644
--- a/xen/arch/x86/x86_emulate/Makefile
+++ b/xen/arch/x86/x86_emulate/Makefile
@@ -6,3 +6,9 @@ obj-y += decode.o
 obj-$(CONFIG_HVM) += fpu.o
 obj-y += util.o
 obj-y += util-xen.o
+
+ifeq ($(CONFIG_LIVEPATCH),y)
+obj-y += decode-lite.o
+else
+obj-bin-y += decode-lite.init.o
+endif
diff --git a/xen/arch/x86/x86_emulate/decode-lite.c b/xen/arch/x86/x86_emulate/decode-lite.c
new file mode 100644
index 000000000000..6e2581eeab83
--- /dev/null
+++ b/xen/arch/x86/x86_emulate/decode-lite.c
@@ -0,0 +1,311 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#include "private.h"
+
+#define Imm8   (1 << 0)
+#define Imm    (1 << 1)
+#define Moffs  (1 << 2)
+#define Branch (1 << 5) /* ... that we care about */
+/*      ModRM  (1 << 6) */
+#define Known  (1 << 7)
+
+#define ALU_OPS                                 \
+    (Known|ModRM),                              \
+    (Known|ModRM),                              \
+    (Known|ModRM),                              \
+    (Known|ModRM),                              \
+    (Known|Imm8),                               \
+    (Known|Imm)
+
+static const uint8_t init_or_livepatch_const onebyte[256] = {
+    [0x00] = ALU_OPS, /* ADD */ [0x08] = ALU_OPS, /* OR  */
+    [0x10] = ALU_OPS, /* ADC */ [0x18] = ALU_OPS, /* SBB */
+    [0x20] = ALU_OPS, /* AND */ [0x28] = ALU_OPS, /* SUB */
+    [0x30] = ALU_OPS, /* XOR */ [0x38] = ALU_OPS, /* CMP */
+/*  [0x40 ... 0x4f] = REX prefixes */
+    [0x50 ... 0x5f] = (Known),             /* PUSH/POP %reg */
+
+    [0x63]          = (Known|ModRM),       /* MOVSxd */
+
+    [0x68]          = (Known|Imm),         /* PUSH $imm */
+    [0x69]          = (Known|ModRM|Imm),   /* IMUL $imm/r/rm */
+    [0x6a]          = (Known|Imm8),        /* PUSH $imm8 */
+    [0x6b]          = (Known|ModRM|Imm8),  /* PUSH $imm8/r/rm */
+    [0x6c ... 0x6f] = (Known),             /* INS/OUTS */
+
+    [0x70 ... 0x7f] = (Known|Branch|Imm8), /* Jcc disp8 */
+    [0x80]          = (Known|ModRM|Imm8),  /* Grp1 */
+    [0x81]          = (Known|ModRM|Imm),   /* Grp1 */
+
+    [0x83]          = (Known|ModRM|Imm8),  /* Grp1 */
+    [0x84 ... 0x8e] = (Known|ModRM),       /* TEST/XCHG/MOV/MOV-SREG/LEA r/rm */
+
+    [0x90 ... 0x99] = (Known),             /* NOP/XCHG rAX/CLTQ/CQTO */
+
+    [0x9b ... 0x9f] = (Known),             /* FWAIT/PUSHF/POPF/SAHF/LAHF */
+
+    [0xa0 ... 0xa3] = (Known|Moffs),       /* MOVABS */
+    [0xa4 ... 0xa7] = (Known),             /* MOVS/CMPS */
+    [0xa8]          = (Known|Imm8),        /* TEST %al */
+    [0xa9]          = (Known|Imm),         /* TEST %al */
+    [0xaa ... 0xaf] = (Known),             /* STOS/LODS/SCAS */
+    [0xb0 ... 0xb7] = (Known|Imm8),        /* MOV $imm8, %reg */
+    [0xb8 ... 0xbf] = (Known|Imm),         /* MOV $imm{16,32,64}, %reg */
+    [0xc0 ... 0xc1] = (Known|ModRM|Imm8),  /* Grp2 (ROL..SAR $imm8, %reg) */
+
+    [0xc3]          = (Known),             /* RET */
+    [0xc6]          = (Known|ModRM|Imm8),  /* Grp11, Further ModRM decode */
+    [0xc7]          = (Known|ModRM|Imm),   /* Grp11, Further ModRM decode */
+
+    [0xcb ... 0xcc] = (Known),             /* LRET/INT3 */
+    [0xcd]          = (Known|Imm8),        /* INT $imm8 */
+
+    [0xd0 ... 0xd3] = (Known|ModRM),       /* Grp2 (ROL..SAR {$1,%cl}, %reg) */
+
+    [0xe4 ... 0xe7] = (Known|Imm8),        /* IN/OUT $imm8 */
+    [0xe8 ... 0xe9] = (Known|Branch|Imm),  /* CALL/JMP disp32 */
+    [0xeb]          = (Known|Branch|Imm8), /* JMP disp8 */
+    [0xec ... 0xef] = (Known),             /* IN/OUT %dx */
+
+    [0xf1]          = (Known),             /* ICEBP */
+    [0xf4]          = (Known),             /* HLT */
+    [0xf5]          = (Known),             /* CMC */
+    [0xf6 ... 0xf7] = (Known|ModRM),       /* Grp3, Further ModRM decode */
+    [0xf8 ... 0xfd] = (Known),             /* CLC ... STD */
+    [0xfe ... 0xff] = (Known|ModRM),       /* Grp4 */
+};
+static const uint8_t init_or_livepatch_const twobyte[256] = {
+    [0x00 ... 0x03] = (Known|ModRM),       /* Grp6/Grp7/LAR/LSL */
+    [0x06]          = (Known),             /* CLTS */
+    [0x09]          = (Known),             /* WBINVD */
+    [0x0b]          = (Known),             /* UD2 */
+
+    [0x18 ... 0x1f] = (Known|ModRM),       /* Grp16 (Hint Nop) */
+    [0x20 ... 0x23] = (Known|ModRM),       /* MOV %cr/%dr */
+
+    [0x30 ... 0x33] = (Known),             /* WRMSR/RDTSC/RDMSR/RDPMC */
+
+    [0x40 ... 0x4f] = (Known|ModRM),       /* CMOVcc */
+
+    [0x80 ... 0x8f] = (Known|Branch|Imm),  /* Jcc disp32 */
+    [0x90 ... 0x9f] = (Known|ModRM),       /* SETcc */
+
+    [0xa0 ... 0xa2] = (Known),             /* PUSH/POP %fs/CPUID */
+    [0xa3]          = (Known|ModRM),       /* BT r/rm */
+    [0xa4]          = (Known|ModRM|Imm8),  /* SHLD $imm8 */
+    [0xa5]          = (Known|ModRM),       /* SHLD %cl */
+
+    [0xa8 ... 0xa9] = (Known),             /* PUSH/POP %gs */
+
+    [0xab]          = (Known|ModRM),       /* BTS */
+    [0xac]          = (Known|ModRM|Imm8),  /* SHRD $imm8 */
+    [0xad ... 0xaf] = (Known|ModRM),       /* SHRD %cl/Grp15/IMUL */
+
+    [0xb0 ... 0xb9] = (Known|ModRM),       /* CMPXCHG/LSS/BTR/LFS/LGS/MOVZxx/POPCNT/Grp10 */
+    [0xba]          = (Known|ModRM|Imm8),  /* Grp8 */
+    [0xbb ... 0xbf] = (Known|ModRM),       /* BSR/BSF/BSR/MOVSX */
+    [0xc0 ... 0xc1] = (Known|ModRM),       /* XADD */
+    [0xc7]          = (Known|ModRM),       /* Grp9 */
+    [0xc8 ... 0xcf] = (Known),             /* BSWAP */
+};
+
+/*
+ * Bare minimum x86 instruction decoder to parse the alternative replacement
+ * instructions and locate the IP-relative references that may need updating.
+ *
+ * These are:
+ *  - disp8/32 from near branches
+ *  - RIP-relative memory references
+ *
+ * The following simplifications are used:
+ *  - All code is 64bit, and the instruction stream is safe to read.
+ *  - The 67 prefix is not implemented, so the address size is only 64bit.
+ *
+ * Inputs:
+ *  @ip  The position to start decoding from.
+ *  @end End of the replacement block.  Exceeding this is considered an error.
+ *
+ * Returns: x86_decode_lite_t
+ *  - On failure, length of 0.
+ *  - On success, length > 0.  For rel_sz > 0, rel points at the relative
+ *    field in the instruction stream.
+ */
+x86_decode_lite_t init_or_livepatch x86_decode_lite(void *ip, void *end)
+{
+    void *start = ip, *rel = NULL;
+    unsigned int opc, rel_sz = 0;
+    uint8_t b, d, rex = 0, osize = 4;
+
+#define OPC_TWOBYTE (1 << 8)
+
+    /* Mutates IP, uses END. */
+#define FETCH(ty)                                       \
+    ({                                                  \
+        ty _val;                                        \
+                                                        \
+        if ( (ip + sizeof(ty)) > end )                  \
+            goto overrun;                               \
+        _val = *(ty *)ip;                               \
+        ip += sizeof(ty);                               \
+        _val;                                           \
+    })
+
+    for ( ;; ) /* Prefixes */
+    {
+        switch ( b = FETCH(uint8_t) )
+        {
+        case 0x26: /* ES override */
+        case 0x2e: /* CS override */
+        case 0x36: /* DS override */
+        case 0x3e: /* SS override */
+        case 0x64: /* FS override */
+        case 0x65: /* GS override */
+        case 0xf0: /* LOCK */
+        case 0xf2: /* REPNE */
+        case 0xf3: /* REP */
+            break;
+
+        case 0x66: /* Operand size override */
+            osize = 2;
+            break;
+
+        /* case 0x67: Address size override, not implemented */
+
+        case 0x40 ... 0x4f: /* REX */
+            rex = b;
+            continue;
+
+        default:
+            goto prefixes_done;
+        }
+        rex = 0; /* REX cancelled by subsequent legacy prefix. */
+    }
+ prefixes_done:
+
+    if ( rex & 0x08 ) /* REX.W */
+        osize = 8;
+
+    /* Fetch the main opcode byte(s) */
+    if ( b == 0x0f )
+    {
+        b = FETCH(uint8_t);
+        opc = OPC_TWOBYTE | b;
+
+        d = twobyte[b];
+    }
+    else
+    {
+        opc = b;
+        d = onebyte[b];
+    }
+
+    if ( unlikely(!(d & Known)) )
+        goto unknown;
+
+    if ( d & ModRM )
+    {
+        uint8_t modrm = FETCH(uint8_t);
+        uint8_t mod = modrm >> 6;
+        uint8_t reg = (modrm >> 3) & 7;
+        uint8_t rm = modrm & 7;
+
+        /* ModRM/SIB decode */
+        if ( mod == 0 && rm == 5 ) /* RIP relative */
+        {
+            rel = ip;
+            rel_sz = 4;
+            FETCH(uint32_t);
+        }
+        else if ( mod != 3 && rm == 4 ) /* SIB */
+        {
+            uint8_t sib = FETCH(uint8_t);
+            uint8_t base = sib & 7;
+
+            if ( mod == 0 && base == 5 )
+                goto disp32;
+        }
+
+        if ( mod == 1 ) /* disp8 */
+            FETCH(uint8_t);
+        else if ( mod == 2 ) /* disp32 */
+        {
+        disp32:
+            FETCH(uint32_t);
+        }
+
+        /* ModRM based decode adjustements */
+        switch ( opc )
+        {
+        case 0xc7: /* Grp11 XBEGIN is a branch. */
+            if ( modrm == 0xf8 )
+                d |= Branch;
+            break;
+        case 0xf6: /* Grp3 TEST(s) have extra Imm8 */
+            if ( reg == 0 || reg == 1 )
+                d |= Imm8;
+            break;
+        case 0xf7: /* Grp3 TEST(s) have extra Imm */
+            if ( reg == 0 || reg == 1 )
+                d |= Imm;
+            break;
+        }
+    }
+
+    if ( d & Branch )
+    {
+        /*
+         * We don't tolerate 66-prefixed call/jmp in alternatives.  Some are
+         * genuinely decoded differently between Intel and AMD CPUs.
+         *
+         * We also don't support APX instructions, so don't have to cope with
+         * JMPABS which is the first branch to have an 8-byte immediate.
+         */
+        if ( osize < 4 )
+            goto bad_osize;
+
+        rel = ip;
+        rel_sz = (d & Imm8) ? 1 : 4;
+    }
+
+    if ( d & (Imm | Imm8 | Moffs) )
+    {
+        if ( d & Imm8 )
+            osize = 1;
+        else if ( d & Moffs )
+            osize = 8;
+        else if ( osize == 8 && !(opc >= 0xb8 && opc <= 0xbf) )
+            osize = 4;
+
+        switch ( osize )
+        {
+        case 1: FETCH(uint8_t);  break;
+        case 2: FETCH(uint16_t); break;
+        case 4: FETCH(uint32_t); break;
+        case 8: FETCH(uint64_t); break;
+        default: goto bad_osize;
+        }
+    }
+
+    return (x86_decode_lite_t){ ip - start, rel_sz, rel };
+
+ bad_osize:
+    printk(XENLOG_ERR "%s() Bad osize %u in %*ph\n",
+           __func__, osize,
+           (int)(unsigned long)(end - start), start);
+    return (x86_decode_lite_t){ 0, 0, NULL };
+
+ unknown:
+    printk(XENLOG_ERR "%s() Unknown opcode in %*ph <%02x> %*ph\n",
+           __func__,
+           (int)(unsigned long)(ip - 1 - start), start, b,
+           (int)(unsigned long)(end - ip), ip);
+    return (x86_decode_lite_t){ 0, 0, NULL };
+
+ overrun:
+    printk(XENLOG_ERR "%s() Decode overrun, got %*ph\n",
+           __func__,
+           (int)(unsigned long)(end - start), start);
+    return (x86_decode_lite_t){ 0, 0, NULL };
+
+#undef FETCH
+}
diff --git a/xen/arch/x86/x86_emulate/private.h b/xen/arch/x86/x86_emulate/private.h
index 172270a458bd..c5e0983948ce 100644
--- a/xen/arch/x86/x86_emulate/private.h
+++ b/xen/arch/x86/x86_emulate/private.h
@@ -9,7 +9,9 @@
 #ifdef __XEN__
 
 # include <xen/bug.h>
+# include <xen/init.h>
 # include <xen/kernel.h>
+# include <xen/livepatch.h>
 # include <asm/endbr.h>
 # include <asm/msr-index.h>
 # include <asm/x86-vendors.h>
diff --git a/xen/arch/x86/x86_emulate/x86_emulate.h b/xen/arch/x86/x86_emulate/x86_emulate.h
index 3e819d41746e..01c371057626 100644
--- a/xen/arch/x86/x86_emulate/x86_emulate.h
+++ b/xen/arch/x86/x86_emulate/x86_emulate.h
@@ -846,4 +846,18 @@ static inline void x86_emul_reset_event(struct x86_emulate_ctxt *ctxt)
     ctxt->event = (struct x86_event){};
 }
 
+/*
+ * x86_decode_lite().  Very minimal decoder for managing alternatives.
+ *
+ * @len is 0 on error, or nonzero on success.  If the instruction has a
+ * relative field, @rel_sz is nonzero, and @rel points at the field.
+ */
+typedef struct {
+    uint8_t len;
+    uint8_t rel_sz; /* bytes: 0, 1 or 4 */
+    void *rel;
+} x86_decode_lite_t;
+
+x86_decode_lite_t x86_decode_lite(void *ip, void *end);
+
 #endif /* __X86_EMULATE_H__ */
-- 
2.39.5

Re: [PATCH v2 1/7] x86: Introduce x86_decode_lite()

Posted by Jan Beulich 1 month, 2 weeks ago

On 02.10.2024 17:27, Andrew Cooper wrote:
> --- /dev/null
> +++ b/xen/arch/x86/x86_emulate/decode-lite.c
> @@ -0,0 +1,311 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +
> +#include "private.h"
> +
> +#define Imm8   (1 << 0)
> +#define Imm    (1 << 1)
> +#define Moffs  (1 << 2)
> +#define Branch (1 << 5) /* ... that we care about */
> +/*      ModRM  (1 << 6) */
> +#define Known  (1 << 7)
> +
> +#define ALU_OPS                                 \
> +    (Known|ModRM),                              \
> +    (Known|ModRM),                              \
> +    (Known|ModRM),                              \
> +    (Known|ModRM),                              \
> +    (Known|Imm8),                               \
> +    (Known|Imm)
> +
> +static const uint8_t init_or_livepatch_const onebyte[256] = {
> +    [0x00] = ALU_OPS, /* ADD */ [0x08] = ALU_OPS, /* OR  */
> +    [0x10] = ALU_OPS, /* ADC */ [0x18] = ALU_OPS, /* SBB */
> +    [0x20] = ALU_OPS, /* AND */ [0x28] = ALU_OPS, /* SUB */
> +    [0x30] = ALU_OPS, /* XOR */ [0x38] = ALU_OPS, /* CMP */
> +/*  [0x40 ... 0x4f] = REX prefixes */

For these and other aspects further down, may I ask for a comment at the
top of the file setting the scope for this new function (and its
associated data) as being strictly 64-bit only? And perhaps even strictly
covering only what Xen may legitimately use (largely excluding APX for
the foreseeable future, i.e. until such time that we might decide to
allow Xen itself to use APX throughout its code).

Besides APX, with more immediate uses in mind, I wonder about e.g.
BMI/BMI2 insns, which live outside the one/two-byte maps.

I would further appreciate if we could be consistent with the mentioning
(or not) of prefixes: The REX ones here are the only ones that the table
mentions right now. In fact I wonder whether a Prefix attribute wouldn't
be nice to have, so you don't need to open-code all of them in the
function itself.

> +    [0x50 ... 0x5f] = (Known),             /* PUSH/POP %reg */
> +
> +    [0x63]          = (Known|ModRM),       /* MOVSxd */
> +
> +    [0x68]          = (Known|Imm),         /* PUSH $imm */
> +    [0x69]          = (Known|ModRM|Imm),   /* IMUL $imm/r/rm */
> +    [0x6a]          = (Known|Imm8),        /* PUSH $imm8 */
> +    [0x6b]          = (Known|ModRM|Imm8),  /* PUSH $imm8/r/rm */

IMUL? I'm also slightly irritated by $imm{,8}/r/rm - better to use commas
to separate operands, and then rm is the middle one (2nd source) while r
is the destination (last), if already you want AT&T operand order.

> +    [0x6c ... 0x6f] = (Known),             /* INS/OUTS */
> +
> +    [0x70 ... 0x7f] = (Known|Branch|Imm8), /* Jcc disp8 */
> +    [0x80]          = (Known|ModRM|Imm8),  /* Grp1 */
> +    [0x81]          = (Known|ModRM|Imm),   /* Grp1 */
> +
> +    [0x83]          = (Known|ModRM|Imm8),  /* Grp1 */
> +    [0x84 ... 0x8e] = (Known|ModRM),       /* TEST/XCHG/MOV/MOV-SREG/LEA r/rm */
> +
> +    [0x90 ... 0x99] = (Known),             /* NOP/XCHG rAX/CLTQ/CQTO */

Omitting PUSH (at 0x8f) is somewhat odd, considering that it's a pretty
"normal" insn.

> +    [0x9b ... 0x9f] = (Known),             /* FWAIT/PUSHF/POPF/SAHF/LAHF */
> +
> +    [0xa0 ... 0xa3] = (Known|Moffs),       /* MOVABS */
> +    [0xa4 ... 0xa7] = (Known),             /* MOVS/CMPS */
> +    [0xa8]          = (Known|Imm8),        /* TEST %al */
> +    [0xa9]          = (Known|Imm),         /* TEST %al */

%rAX?

> +    [0xaa ... 0xaf] = (Known),             /* STOS/LODS/SCAS */
> +    [0xb0 ... 0xb7] = (Known|Imm8),        /* MOV $imm8, %reg */
> +    [0xb8 ... 0xbf] = (Known|Imm),         /* MOV $imm{16,32,64}, %reg */
> +    [0xc0 ... 0xc1] = (Known|ModRM|Imm8),  /* Grp2 (ROL..SAR $imm8, %reg) */
> +
> +    [0xc3]          = (Known),             /* RET */

With the absence of Branch here I think it would be nice if at least the
description went into a little more detail as to the comment higher up
saying "... that we care about".

> +    [0xc6]          = (Known|ModRM|Imm8),  /* Grp11, Further ModRM decode */
> +    [0xc7]          = (Known|ModRM|Imm),   /* Grp11, Further ModRM decode */
> +
> +    [0xcb ... 0xcc] = (Known),             /* LRET/INT3 */
> +    [0xcd]          = (Known|Imm8),        /* INT $imm8 */

No IRET, when you have things like e.g. ICEBP and CLTS?

> +    [0xd0 ... 0xd3] = (Known|ModRM),       /* Grp2 (ROL..SAR {$1,%cl}, %reg) */
> +
> +    [0xe4 ... 0xe7] = (Known|Imm8),        /* IN/OUT $imm8 */
> +    [0xe8 ... 0xe9] = (Known|Branch|Imm),  /* CALL/JMP disp32 */
> +    [0xeb]          = (Known|Branch|Imm8), /* JMP disp8 */
> +    [0xec ... 0xef] = (Known),             /* IN/OUT %dx */
> +
> +    [0xf1]          = (Known),             /* ICEBP */
> +    [0xf4]          = (Known),             /* HLT */
> +    [0xf5]          = (Known),             /* CMC */
> +    [0xf6 ... 0xf7] = (Known|ModRM),       /* Grp3, Further ModRM decode */
> +    [0xf8 ... 0xfd] = (Known),             /* CLC ... STD */
> +    [0xfe ... 0xff] = (Known|ModRM),       /* Grp4 */
> +};
> +static const uint8_t init_or_livepatch_const twobyte[256] = {
> +    [0x00 ... 0x03] = (Known|ModRM),       /* Grp6/Grp7/LAR/LSL */
> +    [0x06]          = (Known),             /* CLTS */
> +    [0x09]          = (Known),             /* WBINVD */
> +    [0x0b]          = (Known),             /* UD2 */
> +
> +    [0x18 ... 0x1f] = (Known|ModRM),       /* Grp16 (Hint Nop) */
> +    [0x20 ... 0x23] = (Known|ModRM),       /* MOV %cr/%dr */
> +
> +    [0x30 ... 0x33] = (Known),             /* WRMSR/RDTSC/RDMSR/RDPMC */
> +
> +    [0x40 ... 0x4f] = (Known|ModRM),       /* CMOVcc */
> +
> +    [0x80 ... 0x8f] = (Known|Branch|Imm),  /* Jcc disp32 */
> +    [0x90 ... 0x9f] = (Known|ModRM),       /* SETcc */
> +
> +    [0xa0 ... 0xa2] = (Known),             /* PUSH/POP %fs/CPUID */
> +    [0xa3]          = (Known|ModRM),       /* BT r/rm */
> +    [0xa4]          = (Known|ModRM|Imm8),  /* SHLD $imm8 */
> +    [0xa5]          = (Known|ModRM),       /* SHLD %cl */
> +
> +    [0xa8 ... 0xa9] = (Known),             /* PUSH/POP %gs */
> +
> +    [0xab]          = (Known|ModRM),       /* BTS */
> +    [0xac]          = (Known|ModRM|Imm8),  /* SHRD $imm8 */
> +    [0xad ... 0xaf] = (Known|ModRM),       /* SHRD %cl/Grp15/IMUL */
> +
> +    [0xb0 ... 0xb9] = (Known|ModRM),       /* CMPXCHG/LSS/BTR/LFS/LGS/MOVZxx/POPCNT/Grp10 */

Grp10 is kind of odd - I think UD1 would be more concise, despite the SDM
indeed using group 10 there.

> +    [0xba]          = (Known|ModRM|Imm8),  /* Grp8 */
> +    [0xbb ... 0xbf] = (Known|ModRM),       /* BSR/BSF/BSR/MOVSX */

0xbb is BTC I think.

> +    [0xc0 ... 0xc1] = (Known|ModRM),       /* XADD */
> +    [0xc7]          = (Known|ModRM),       /* Grp9 */
> +    [0xc8 ... 0xcf] = (Known),             /* BSWAP */

With UD1 and UD2, perhaps UD0 would make sense to also have in the table?

> +};
> +
> +/*
> + * Bare minimum x86 instruction decoder to parse the alternative replacement
> + * instructions and locate the IP-relative references that may need updating.
> + *
> + * These are:
> + *  - disp8/32 from near branches
> + *  - RIP-relative memory references
> + *
> + * The following simplifications are used:
> + *  - All code is 64bit, and the instruction stream is safe to read.
> + *  - The 67 prefix is not implemented, so the address size is only 64bit.

As to the earlier remark - maybe this part of the comment could simply
move to the top of the file?

> + * Inputs:
> + *  @ip  The position to start decoding from.
> + *  @end End of the replacement block.  Exceeding this is considered an error.
> + *
> + * Returns: x86_decode_lite_t
> + *  - On failure, length of 0.
> + *  - On success, length > 0.  For rel_sz > 0, rel points at the relative
> + *    field in the instruction stream.
> + */
> +x86_decode_lite_t init_or_livepatch x86_decode_lite(void *ip, void *end)

Imo both pointers would be nice to be to-const.

> +{
> +    void *start = ip, *rel = NULL;
> +    unsigned int opc, rel_sz = 0;
> +    uint8_t b, d, rex = 0, osize = 4;
> +
> +#define OPC_TWOBYTE (1 << 8)
> +
> +    /* Mutates IP, uses END. */
> +#define FETCH(ty)                                       \
> +    ({                                                  \
> +        ty _val;                                        \
> +                                                        \
> +        if ( (ip + sizeof(ty)) > end )                  \
> +            goto overrun;                               \
> +        _val = *(ty *)ip;                               \
> +        ip += sizeof(ty);                               \
> +        _val;                                           \
> +    })
> +
> +    for ( ;; ) /* Prefixes */
> +    {
> +        switch ( b = FETCH(uint8_t) )
> +        {
> +        case 0x26: /* ES override */
> +        case 0x2e: /* CS override */
> +        case 0x36: /* DS override */
> +        case 0x3e: /* SS override */
> +        case 0x64: /* FS override */
> +        case 0x65: /* GS override */

If you don't like the idea of a Prefix attribute I wonder in how far we
actually need all of the above, when you already ...

> +        case 0xf0: /* LOCK */
> +        case 0xf2: /* REPNE */
> +        case 0xf3: /* REP */
> +            break;
> +
> +        case 0x66: /* Operand size override */
> +            osize = 2;
> +            break;
> +
> +        /* case 0x67: Address size override, not implemented */

... deliberately leave of this one.

> +        case 0x40 ... 0x4f: /* REX */
> +            rex = b;
> +            continue;
> +
> +        default:
> +            goto prefixes_done;
> +        }
> +        rex = 0; /* REX cancelled by subsequent legacy prefix. */
> +    }
> + prefixes_done:
> +
> +    if ( rex & 0x08 ) /* REX.W */

Can you please use REX_W here?

> +        osize = 8;
> +
> +    /* Fetch the main opcode byte(s) */
> +    if ( b == 0x0f )
> +    {
> +        b = FETCH(uint8_t);
> +        opc = OPC_TWOBYTE | b;
> +
> +        d = twobyte[b];
> +    }
> +    else
> +    {
> +        opc = b;
> +        d = onebyte[b];
> +    }
> +
> +    if ( unlikely(!(d & Known)) )
> +        goto unknown;
> +
> +    if ( d & ModRM )
> +    {
> +        uint8_t modrm = FETCH(uint8_t);
> +        uint8_t mod = modrm >> 6;
> +        uint8_t reg = (modrm >> 3) & 7;
> +        uint8_t rm = modrm & 7;
> +
> +        /* ModRM/SIB decode */
> +        if ( mod == 0 && rm == 5 ) /* RIP relative */
> +        {
> +            rel = ip;
> +            rel_sz = 4;
> +            FETCH(uint32_t);
> +        }
> +        else if ( mod != 3 && rm == 4 ) /* SIB */
> +        {
> +            uint8_t sib = FETCH(uint8_t);
> +            uint8_t base = sib & 7;
> +
> +            if ( mod == 0 && base == 5 )
> +                goto disp32;
> +        }
> +
> +        if ( mod == 1 ) /* disp8 */
> +            FETCH(uint8_t);
> +        else if ( mod == 2 ) /* disp32 */
> +        {
> +        disp32:
> +            FETCH(uint32_t);

The values aren't used, so the types don't matter overly much, yet int8_t
and int32_t would be a more accurate representation of what's being
fetched.

> +        }
> +
> +        /* ModRM based decode adjustements */
> +        switch ( opc )
> +        {
> +        case 0xc7: /* Grp11 XBEGIN is a branch. */
> +            if ( modrm == 0xf8 )
> +                d |= Branch;
> +            break;
> +        case 0xf6: /* Grp3 TEST(s) have extra Imm8 */
> +            if ( reg == 0 || reg == 1 )
> +                d |= Imm8;
> +            break;
> +        case 0xf7: /* Grp3 TEST(s) have extra Imm */
> +            if ( reg == 0 || reg == 1 )
> +                d |= Imm;
> +            break;
> +        }

In this switch() you don't distinguish 1- and 2-byte maps at all. This
is an issue in particular for 0xc7.

Blank lines between case blocks would also be nice, as this switch() is
going to only ever grow.

> +    }
> +
> +    if ( d & Branch )
> +    {
> +        /*
> +         * We don't tolerate 66-prefixed call/jmp in alternatives.  Some are
> +         * genuinely decoded differently between Intel and AMD CPUs.
> +         *
> +         * We also don't support APX instructions, so don't have to cope with
> +         * JMPABS which is the first branch to have an 8-byte immediate.
> +         */
> +        if ( osize < 4 )
> +            goto bad_osize;
> +
> +        rel = ip;
> +        rel_sz = (d & Imm8) ? 1 : 4;
> +    }
> +
> +    if ( d & (Imm | Imm8 | Moffs) )
> +    {
> +        if ( d & Imm8 )
> +            osize = 1;
> +        else if ( d & Moffs )
> +            osize = 8;
> +        else if ( osize == 8 && !(opc >= 0xb8 && opc <= 0xbf) )

Again want to also take the opcode map into account, even if - by luck -
this would work as is for now.

Also could I talk you into converting the two comparisons into one, along
the lines of "(opc | 7) != 0xbf"?

> --- a/xen/arch/x86/x86_emulate/private.h
> +++ b/xen/arch/x86/x86_emulate/private.h
> @@ -9,7 +9,9 @@
>  #ifdef __XEN__
>  
>  # include <xen/bug.h>
> +# include <xen/init.h>
>  # include <xen/kernel.h>
> +# include <xen/livepatch.h>
>  # include <asm/endbr.h>
>  # include <asm/msr-index.h>
>  # include <asm/x86-vendors.h>

It's only the new file that needs these - can we limit the dependencies
to just that one by putting these new #include-s there?

Jan

Re: [PATCH v2 1/7] x86: Introduce x86_decode_lite()

Posted by Andrew Cooper 1 month, 2 weeks ago

On 07/10/2024 1:56 pm, Jan Beulich wrote:
> On 02.10.2024 17:27, Andrew Cooper wrote:
>> --- /dev/null
>> +++ b/xen/arch/x86/x86_emulate/decode-lite.c
>> @@ -0,0 +1,311 @@
>> +/* SPDX-License-Identifier: GPL-2.0-only */
>> +
>> +#include "private.h"
>> +
>> +#define Imm8   (1 << 0)
>> +#define Imm    (1 << 1)
>> +#define Moffs  (1 << 2)
>> +#define Branch (1 << 5) /* ... that we care about */
>> +/*      ModRM  (1 << 6) */
>> +#define Known  (1 << 7)
>> +
>> +#define ALU_OPS                                 \
>> +    (Known|ModRM),                              \
>> +    (Known|ModRM),                              \
>> +    (Known|ModRM),                              \
>> +    (Known|ModRM),                              \
>> +    (Known|Imm8),                               \
>> +    (Known|Imm)
>> +
>> +static const uint8_t init_or_livepatch_const onebyte[256] = {
>> +    [0x00] = ALU_OPS, /* ADD */ [0x08] = ALU_OPS, /* OR  */
>> +    [0x10] = ALU_OPS, /* ADC */ [0x18] = ALU_OPS, /* SBB */
>> +    [0x20] = ALU_OPS, /* AND */ [0x28] = ALU_OPS, /* SUB */
>> +    [0x30] = ALU_OPS, /* XOR */ [0x38] = ALU_OPS, /* CMP */
>> +/*  [0x40 ... 0x4f] = REX prefixes */
> For these and other aspects further down, may I ask for a comment at the
> top of the file setting the scope for this new function (and its
> associated data) as being strictly 64-bit only? And perhaps even strictly
> covering only what Xen may legitimately use (largely excluding APX for
> the foreseeable future, i.e. until such time that we might decide to
> allow Xen itself to use APX throughout its code).
>
> Besides APX, with more immediate uses in mind, I wonder about e.g.
> BMI/BMI2 insns, which live outside the one/two-byte maps.

They're not needed yet, and it would require extra decode complexity.

> I would further appreciate if we could be consistent with the mentioning
> (or not) of prefixes: The REX ones here are the only ones that the table
> mentions right now. In fact I wonder whether a Prefix attribute wouldn't
> be nice to have, so you don't need to open-code all of them in the
> function itself.

The comment about REX prefixes is only here because you insisted on it.

But I really don't like double-explaining things.

>> +    [0x6c ... 0x6f] = (Known),             /* INS/OUTS */
>> +
>> +    [0x70 ... 0x7f] = (Known|Branch|Imm8), /* Jcc disp8 */
>> +    [0x80]          = (Known|ModRM|Imm8),  /* Grp1 */
>> +    [0x81]          = (Known|ModRM|Imm),   /* Grp1 */
>> +
>> +    [0x83]          = (Known|ModRM|Imm8),  /* Grp1 */
>> +    [0x84 ... 0x8e] = (Known|ModRM),       /* TEST/XCHG/MOV/MOV-SREG/LEA r/rm */
>> +
>> +    [0x90 ... 0x99] = (Known),             /* NOP/XCHG rAX/CLTQ/CQTO */
> Omitting PUSH (at 0x8f) is somewhat odd, considering that it's a pretty
> "normal" insn.

Except it's not.  It's the XOP prefix too, and would require extra
decode complexity.

>> +    [0xc6]          = (Known|ModRM|Imm8),  /* Grp11, Further ModRM decode */
>> +    [0xc7]          = (Known|ModRM|Imm),   /* Grp11, Further ModRM decode */
>> +
>> +    [0xcb ... 0xcc] = (Known),             /* LRET/INT3 */
>> +    [0xcd]          = (Known|Imm8),        /* INT $imm8 */
> No IRET, when you have things like e.g. ICEBP and CLTS?

The absence of IRET is intentional, because it can't be used safely. 
SYSRET/EXIT are excluded too for consistency.

IRET can be added if/when it is needed, and someone has figured out a
way of adjusting the exception table entry.

>> +};
>> +
>> +/*
>> + * Bare minimum x86 instruction decoder to parse the alternative replacement
>> + * instructions and locate the IP-relative references that may need updating.
>> + *
>> + * These are:
>> + *  - disp8/32 from near branches
>> + *  - RIP-relative memory references
>> + *
>> + * The following simplifications are used:
>> + *  - All code is 64bit, and the instruction stream is safe to read.
>> + *  - The 67 prefix is not implemented, so the address size is only 64bit.
> As to the earlier remark - maybe this part of the comment could simply
> move to the top of the file?

I really don't want to split the comment.  It needs to all live together.

Given this is a single-function file, I also don't see the need for this
comment to move away from here.  You can't interpret the decode tables
without reading the function.

>
>> + * Inputs:
>> + *  @ip  The position to start decoding from.
>> + *  @end End of the replacement block.  Exceeding this is considered an error.
>> + *
>> + * Returns: x86_decode_lite_t
>> + *  - On failure, length of 0.
>> + *  - On success, length > 0.  For rel_sz > 0, rel points at the relative
>> + *    field in the instruction stream.
>> + */
>> +x86_decode_lite_t init_or_livepatch x86_decode_lite(void *ip, void *end)
> Imo both pointers would be nice to be to-const.

In v1, you also identified why that wouldn't compile.

>
>> +{
>> +    void *start = ip, *rel = NULL;
>> +    unsigned int opc, rel_sz = 0;
>> +    uint8_t b, d, rex = 0, osize = 4;
>> +
>> +#define OPC_TWOBYTE (1 << 8)
>> +
>> +    /* Mutates IP, uses END. */
>> +#define FETCH(ty)                                       \
>> +    ({                                                  \
>> +        ty _val;                                        \
>> +                                                        \
>> +        if ( (ip + sizeof(ty)) > end )                  \
>> +            goto overrun;                               \
>> +        _val = *(ty *)ip;                               \
>> +        ip += sizeof(ty);                               \
>> +        _val;                                           \
>> +    })
>> +
>> +    for ( ;; ) /* Prefixes */
>> +    {
>> +        switch ( b = FETCH(uint8_t) )
>> +        {
>> +        case 0x26: /* ES override */
>> +        case 0x2e: /* CS override */
>> +        case 0x36: /* DS override */
>> +        case 0x3e: /* SS override */
>> +        case 0x64: /* FS override */
>> +        case 0x65: /* GS override */
> If you don't like the idea of a Prefix attribute

I don't like the idea of making this intentionally dissimilar to the
main decoder, just to safe a few lines of source code.

GCC optimises it into a bitmap anyway.

>  I wonder in how far we
> actually need all of the above, when you already ...
>
>> +        case 0xf0: /* LOCK */
>> +        case 0xf2: /* REPNE */
>> +        case 0xf3: /* REP */
>> +            break;
>> +
>> +        case 0x66: /* Operand size override */
>> +            osize = 2;
>> +            break;
>> +
>> +        /* case 0x67: Address size override, not implemented */
> ... deliberately leave of this one.

Excluding 67 is intentional because it a) has no business being used,
and b) adds a huge amount of decode complexity.

Whereas two of segment prefixes are already necessary to decode the
alternatives we have today.
>> +        case 0x40 ... 0x4f: /* REX */
>> +            rex = b;
>> +            continue;
>> +
>> +        default:
>> +            goto prefixes_done;
>> +        }
>> +        rex = 0; /* REX cancelled by subsequent legacy prefix. */
>> +    }
>> + prefixes_done:
>> +
>> +    if ( rex & 0x08 ) /* REX.W */
> Can you please use REX_W here?

Oh, it is available.  Ok.

>
>> +        osize = 8;
>> +
>> +    /* Fetch the main opcode byte(s) */
>> +    if ( b == 0x0f )
>> +    {
>> +        b = FETCH(uint8_t);
>> +        opc = OPC_TWOBYTE | b;
>> +
>> +        d = twobyte[b];
>> +    }
>> +    else
>> +    {
>> +        opc = b;
>> +        d = onebyte[b];
>> +    }
>> +
>> +    if ( unlikely(!(d & Known)) )
>> +        goto unknown;
>> +
>> +    if ( d & ModRM )
>> +    {
>> +        uint8_t modrm = FETCH(uint8_t);
>> +        uint8_t mod = modrm >> 6;
>> +        uint8_t reg = (modrm >> 3) & 7;
>> +        uint8_t rm = modrm & 7;
>> +
>> +        /* ModRM/SIB decode */
>> +        if ( mod == 0 && rm == 5 ) /* RIP relative */
>> +        {
>> +            rel = ip;
>> +            rel_sz = 4;
>> +            FETCH(uint32_t);
>> +        }
>> +        else if ( mod != 3 && rm == 4 ) /* SIB */
>> +        {
>> +            uint8_t sib = FETCH(uint8_t);
>> +            uint8_t base = sib & 7;
>> +
>> +            if ( mod == 0 && base == 5 )
>> +                goto disp32;
>> +        }
>> +
>> +        if ( mod == 1 ) /* disp8 */
>> +            FETCH(uint8_t);
>> +        else if ( mod == 2 ) /* disp32 */
>> +        {
>> +        disp32:
>> +            FETCH(uint32_t);
> The values aren't used, so the types don't matter overly much, yet int8_t
> and int32_t would be a more accurate representation of what's being
> fetched.

Why does that matter?  I'm skipping a number of bytes, not interpreting
the result.

>
>> +        }
>> +
>> +        /* ModRM based decode adjustements */
>> +        switch ( opc )
>> +        {
>> +        case 0xc7: /* Grp11 XBEGIN is a branch. */
>> +            if ( modrm == 0xf8 )
>> +                d |= Branch;
>> +            break;
>> +        case 0xf6: /* Grp3 TEST(s) have extra Imm8 */
>> +            if ( reg == 0 || reg == 1 )
>> +                d |= Imm8;
>> +            break;
>> +        case 0xf7: /* Grp3 TEST(s) have extra Imm */
>> +            if ( reg == 0 || reg == 1 )
>> +                d |= Imm;
>> +            break;
>> +        }
> In this switch() you don't distinguish 1- and 2-byte maps at all.

See OPC_TWOBYTE.  They are distinguished here.
>> +    }
>> +
>> +    if ( d & Branch )
>> +    {
>> +        /*
>> +         * We don't tolerate 66-prefixed call/jmp in alternatives.  Some are
>> +         * genuinely decoded differently between Intel and AMD CPUs.
>> +         *
>> +         * We also don't support APX instructions, so don't have to cope with
>> +         * JMPABS which is the first branch to have an 8-byte immediate.
>> +         */
>> +        if ( osize < 4 )
>> +            goto bad_osize;
>> +
>> +        rel = ip;
>> +        rel_sz = (d & Imm8) ? 1 : 4;
>> +    }
>> +
>> +    if ( d & (Imm | Imm8 | Moffs) )
>> +    {
>> +        if ( d & Imm8 )
>> +            osize = 1;
>> +        else if ( d & Moffs )
>> +            osize = 8;
>> +        else if ( osize == 8 && !(opc >= 0xb8 && opc <= 0xbf) )
> Again want to also take the opcode map into account, even if - by luck -
> this would work as is for now.
>
> Also could I talk you into converting the two comparisons into one, along
> the lines of "(opc | 7) != 0xbf"?

That's the kind of obfuscation which should be left to the compiler.

I know you think like that, but most others (including myself) do not.
>> --- a/xen/arch/x86/x86_emulate/private.h
>> +++ b/xen/arch/x86/x86_emulate/private.h
>> @@ -9,7 +9,9 @@
>>  #ifdef __XEN__
>>  
>>  # include <xen/bug.h>
>> +# include <xen/init.h>
>>  # include <xen/kernel.h>
>> +# include <xen/livepatch.h>
>>  # include <asm/endbr.h>
>>  # include <asm/msr-index.h>
>>  # include <asm/x86-vendors.h>
> It's only the new file that needs these - can we limit the dependencies
> to just that one by putting these new #include-s there?

Not if you want the userspace harness in patch 2 to compile.

~Andrew

Re: [PATCH v2 1/7] x86: Introduce x86_decode_lite()

Posted by Jan Beulich 1 month, 2 weeks ago

On 07.10.2024 19:17, Andrew Cooper wrote:
> On 07/10/2024 1:56 pm, Jan Beulich wrote:
>> On 02.10.2024 17:27, Andrew Cooper wrote:
>>> --- /dev/null
>>> +++ b/xen/arch/x86/x86_emulate/decode-lite.c
>>> @@ -0,0 +1,311 @@
>>> +/* SPDX-License-Identifier: GPL-2.0-only */
>>> +
>>> +#include "private.h"
>>> +
>>> +#define Imm8   (1 << 0)
>>> +#define Imm    (1 << 1)
>>> +#define Moffs  (1 << 2)
>>> +#define Branch (1 << 5) /* ... that we care about */
>>> +/*      ModRM  (1 << 6) */
>>> +#define Known  (1 << 7)
>>> +
>>> +#define ALU_OPS                                 \
>>> +    (Known|ModRM),                              \
>>> +    (Known|ModRM),                              \
>>> +    (Known|ModRM),                              \
>>> +    (Known|ModRM),                              \
>>> +    (Known|Imm8),                               \
>>> +    (Known|Imm)
>>> +
>>> +static const uint8_t init_or_livepatch_const onebyte[256] = {
>>> +    [0x00] = ALU_OPS, /* ADD */ [0x08] = ALU_OPS, /* OR  */
>>> +    [0x10] = ALU_OPS, /* ADC */ [0x18] = ALU_OPS, /* SBB */
>>> +    [0x20] = ALU_OPS, /* AND */ [0x28] = ALU_OPS, /* SUB */
>>> +    [0x30] = ALU_OPS, /* XOR */ [0x38] = ALU_OPS, /* CMP */
>>> +/*  [0x40 ... 0x4f] = REX prefixes */
>> For these and other aspects further down, may I ask for a comment at the
>> top of the file setting the scope for this new function (and its
>> associated data) as being strictly 64-bit only? And perhaps even strictly
>> covering only what Xen may legitimately use (largely excluding APX for
>> the foreseeable future, i.e. until such time that we might decide to
>> allow Xen itself to use APX throughout its code).
>>
>> Besides APX, with more immediate uses in mind, I wonder about e.g.
>> BMI/BMI2 insns, which live outside the one/two-byte maps.
> 
> They're not needed yet, and it would require extra decode complexity.
> 
>> I would further appreciate if we could be consistent with the mentioning
>> (or not) of prefixes: The REX ones here are the only ones that the table
>> mentions right now. In fact I wonder whether a Prefix attribute wouldn't
>> be nice to have, so you don't need to open-code all of them in the
>> function itself.
> 
> The comment about REX prefixes is only here because you insisted on it.

Did I? Looking back at my v1 replies I can't spot such a comment. But anyway,
what I'd like to ask for is consistency: Mention all (relevant) prefixes, or
none of them.

>>> +    [0x6c ... 0x6f] = (Known),             /* INS/OUTS */
>>> +
>>> +    [0x70 ... 0x7f] = (Known|Branch|Imm8), /* Jcc disp8 */
>>> +    [0x80]          = (Known|ModRM|Imm8),  /* Grp1 */
>>> +    [0x81]          = (Known|ModRM|Imm),   /* Grp1 */
>>> +
>>> +    [0x83]          = (Known|ModRM|Imm8),  /* Grp1 */
>>> +    [0x84 ... 0x8e] = (Known|ModRM),       /* TEST/XCHG/MOV/MOV-SREG/LEA r/rm */
>>> +
>>> +    [0x90 ... 0x99] = (Known),             /* NOP/XCHG rAX/CLTQ/CQTO */
>> Omitting PUSH (at 0x8f) is somewhat odd, considering that it's a pretty
>> "normal" insn.
> 
> Except it's not.  It's the XOP prefix too, and would require extra
> decode complexity.

Well, just like you don't decode VEX/EVEX, you could opt not to decode XOP
despite decoding this PUSH form. Otherwise I'm inclined to ask that no PUSH
or POP should be decoded.

>>> +    [0xc6]          = (Known|ModRM|Imm8),  /* Grp11, Further ModRM decode */
>>> +    [0xc7]          = (Known|ModRM|Imm),   /* Grp11, Further ModRM decode */
>>> +
>>> +    [0xcb ... 0xcc] = (Known),             /* LRET/INT3 */
>>> +    [0xcd]          = (Known|Imm8),        /* INT $imm8 */
>> No IRET, when you have things like e.g. ICEBP and CLTS?
> 
> The absence of IRET is intentional, because it can't be used safely. 
> SYSRET/EXIT are excluded too for consistency.
> 
> IRET can be added if/when it is needed, and someone has figured out a
> way of adjusting the exception table entry.

If it has an attached exception table entry. If it doesn't, I think its
use would in principle be fine. And an attached exception table entry could
similarly be a problem for any other insn, couldn't it? The same argument
would in particular apply to LRET, I think.

Anyway, once again my main concern here are apparent inconsistencies. Any
of them will make it more difficult to judge what needs / doesn't need
adding here going forward.

>>> +};
>>> +
>>> +/*
>>> + * Bare minimum x86 instruction decoder to parse the alternative replacement
>>> + * instructions and locate the IP-relative references that may need updating.
>>> + *
>>> + * These are:
>>> + *  - disp8/32 from near branches
>>> + *  - RIP-relative memory references
>>> + *
>>> + * The following simplifications are used:
>>> + *  - All code is 64bit, and the instruction stream is safe to read.
>>> + *  - The 67 prefix is not implemented, so the address size is only 64bit.
>> As to the earlier remark - maybe this part of the comment could simply
>> move to the top of the file?
> 
> I really don't want to split the comment.  It needs to all live together.
> 
> Given this is a single-function file, I also don't see the need for this
> comment to move away from here.  You can't interpret the decode tables
> without reading the function.

I'm torn; I think having some fundamental information at the top of the file
would help. Yet keeping commentary together is also a valid aspect.

>>> + * Inputs:
>>> + *  @ip  The position to start decoding from.
>>> + *  @end End of the replacement block.  Exceeding this is considered an error.
>>> + *
>>> + * Returns: x86_decode_lite_t
>>> + *  - On failure, length of 0.
>>> + *  - On success, length > 0.  For rel_sz > 0, rel points at the relative
>>> + *    field in the instruction stream.
>>> + */
>>> +x86_decode_lite_t init_or_livepatch x86_decode_lite(void *ip, void *end)
>> Imo both pointers would be nice to be to-const.
> 
> In v1, you also identified why that wouldn't compile.

Hmm, right. It's been a while. Would it be too much to ask that you add a
sentence to that effect to the description?

>>> +{
>>> +    void *start = ip, *rel = NULL;
>>> +    unsigned int opc, rel_sz = 0;
>>> +    uint8_t b, d, rex = 0, osize = 4;
>>> +
>>> +#define OPC_TWOBYTE (1 << 8)
>>> +
>>> +    /* Mutates IP, uses END. */
>>> +#define FETCH(ty)                                       \
>>> +    ({                                                  \
>>> +        ty _val;                                        \
>>> +                                                        \
>>> +        if ( (ip + sizeof(ty)) > end )                  \
>>> +            goto overrun;                               \
>>> +        _val = *(ty *)ip;                               \
>>> +        ip += sizeof(ty);                               \
>>> +        _val;                                           \
>>> +    })
>>> +
>>> +    for ( ;; ) /* Prefixes */
>>> +    {
>>> +        switch ( b = FETCH(uint8_t) )
>>> +        {
>>> +        case 0x26: /* ES override */
>>> +        case 0x2e: /* CS override */
>>> +        case 0x36: /* DS override */
>>> +        case 0x3e: /* SS override */
>>> +        case 0x64: /* FS override */
>>> +        case 0x65: /* GS override */
>> If you don't like the idea of a Prefix attribute
> 
> I don't like the idea of making this intentionally dissimilar to the
> main decoder, just to safe a few lines of source code.
> 
> GCC optimises it into a bitmap anyway.

Well, okay then.

>>  I wonder in how far we
>> actually need all of the above, when you already ...
>>
>>> +        case 0xf0: /* LOCK */
>>> +        case 0xf2: /* REPNE */
>>> +        case 0xf3: /* REP */
>>> +            break;
>>> +
>>> +        case 0x66: /* Operand size override */
>>> +            osize = 2;
>>> +            break;
>>> +
>>> +        /* case 0x67: Address size override, not implemented */
>> ... deliberately leave of this one.
> 
> Excluding 67 is intentional because it a) has no business being used,
> and b) adds a huge amount of decode complexity.

I largely agree with a), but why b)? The only difference in decode is
how to treat moffs. In any event - the comments here again were because
things looks somewhat inconsistent the way they are right now. As that's
deliberate, it's perhaps tolerable.

> Whereas two of segment prefixes are already necessary to decode the
> alternatives we have today.
>>> +        case 0x40 ... 0x4f: /* REX */
>>> +            rex = b;
>>> +            continue;
>>> +
>>> +        default:
>>> +            goto prefixes_done;
>>> +        }
>>> +        rex = 0; /* REX cancelled by subsequent legacy prefix. */
>>> +    }
>>> + prefixes_done:
>>> +
>>> +    if ( rex & 0x08 ) /* REX.W */
>> Can you please use REX_W here?
> 
> Oh, it is available.  Ok.
> 
>>
>>> +        osize = 8;
>>> +
>>> +    /* Fetch the main opcode byte(s) */
>>> +    if ( b == 0x0f )
>>> +    {
>>> +        b = FETCH(uint8_t);
>>> +        opc = OPC_TWOBYTE | b;
>>> +
>>> +        d = twobyte[b];
>>> +    }
>>> +    else
>>> +    {
>>> +        opc = b;
>>> +        d = onebyte[b];
>>> +    }
>>> +
>>> +    if ( unlikely(!(d & Known)) )
>>> +        goto unknown;
>>> +
>>> +    if ( d & ModRM )
>>> +    {
>>> +        uint8_t modrm = FETCH(uint8_t);
>>> +        uint8_t mod = modrm >> 6;
>>> +        uint8_t reg = (modrm >> 3) & 7;
>>> +        uint8_t rm = modrm & 7;
>>> +
>>> +        /* ModRM/SIB decode */
>>> +        if ( mod == 0 && rm == 5 ) /* RIP relative */
>>> +        {
>>> +            rel = ip;
>>> +            rel_sz = 4;
>>> +            FETCH(uint32_t);
>>> +        }
>>> +        else if ( mod != 3 && rm == 4 ) /* SIB */
>>> +        {
>>> +            uint8_t sib = FETCH(uint8_t);
>>> +            uint8_t base = sib & 7;
>>> +
>>> +            if ( mod == 0 && base == 5 )
>>> +                goto disp32;
>>> +        }
>>> +
>>> +        if ( mod == 1 ) /* disp8 */
>>> +            FETCH(uint8_t);
>>> +        else if ( mod == 2 ) /* disp32 */
>>> +        {
>>> +        disp32:
>>> +            FETCH(uint32_t);
>> The values aren't used, so the types don't matter overly much, yet int8_t
>> and int32_t would be a more accurate representation of what's being
>> fetched.
> 
> Why does that matter?  I'm skipping a number of bytes, not interpreting
> the result.

It matters from a doc pov only at this point. When one simply reads this
code, one may wonder "why unsigned". Just like I did.

>>> +        }
>>> +
>>> +        /* ModRM based decode adjustements */
>>> +        switch ( opc )
>>> +        {
>>> +        case 0xc7: /* Grp11 XBEGIN is a branch. */
>>> +            if ( modrm == 0xf8 )
>>> +                d |= Branch;
>>> +            break;
>>> +        case 0xf6: /* Grp3 TEST(s) have extra Imm8 */
>>> +            if ( reg == 0 || reg == 1 )
>>> +                d |= Imm8;
>>> +            break;
>>> +        case 0xf7: /* Grp3 TEST(s) have extra Imm */
>>> +            if ( reg == 0 || reg == 1 )
>>> +                d |= Imm;
>>> +            break;
>>> +        }
>> In this switch() you don't distinguish 1- and 2-byte maps at all.
> 
> See OPC_TWOBYTE.  They are distinguished here.

Oh, I've overlooked that, sorry.

>>> +    }
>>> +
>>> +    if ( d & Branch )
>>> +    {
>>> +        /*
>>> +         * We don't tolerate 66-prefixed call/jmp in alternatives.  Some are
>>> +         * genuinely decoded differently between Intel and AMD CPUs.
>>> +         *
>>> +         * We also don't support APX instructions, so don't have to cope with
>>> +         * JMPABS which is the first branch to have an 8-byte immediate.
>>> +         */
>>> +        if ( osize < 4 )
>>> +            goto bad_osize;
>>> +
>>> +        rel = ip;
>>> +        rel_sz = (d & Imm8) ? 1 : 4;
>>> +    }
>>> +
>>> +    if ( d & (Imm | Imm8 | Moffs) )
>>> +    {
>>> +        if ( d & Imm8 )
>>> +            osize = 1;
>>> +        else if ( d & Moffs )
>>> +            osize = 8;
>>> +        else if ( osize == 8 && !(opc >= 0xb8 && opc <= 0xbf) )
>> Again want to also take the opcode map into account, even if - by luck -
>> this would work as is for now.
>>
>> Also could I talk you into converting the two comparisons into one, along
>> the lines of "(opc | 7) != 0xbf"?
> 
> That's the kind of obfuscation which should be left to the compiler.

If only it actually did. I didn't check recently, but last I checked they
still didn't take opportunities like this.

>>> --- a/xen/arch/x86/x86_emulate/private.h
>>> +++ b/xen/arch/x86/x86_emulate/private.h
>>> @@ -9,7 +9,9 @@
>>>  #ifdef __XEN__
>>>  
>>>  # include <xen/bug.h>
>>> +# include <xen/init.h>
>>>  # include <xen/kernel.h>
>>> +# include <xen/livepatch.h>
>>>  # include <asm/endbr.h>
>>>  # include <asm/msr-index.h>
>>>  # include <asm/x86-vendors.h>
>> It's only the new file that needs these - can we limit the dependencies
>> to just that one by putting these new #include-s there?
> 
> Not if you want the userspace harness in patch 2 to compile.

This is inside #ifdef __XEN__, so how can the userspace harness'es build
matter? Of course the #include-s, once moved, would need to be inside a
similar #ifdef.

Jan