arch/alpha/kernel/traps.c | 409 ++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 400 insertions(+), 9 deletions(-)
Complementing compiler support for the `-msafe-bwa' and `-msafe-partial'
code generation options slated to land in GCC 15, implement emulation
for unaligned LDx_L and STx_C operations for the unlikely case where an
alignment violation has resulted from improperly written code and caused
these operations to trap in atomic RMW memory access sequences made to
provide data consistency for non-BWX byte and word write operations, and
writes to unaligned data objects causing partial memory updates.
The principle of operation is as follows:
1. A trapping unaligned LDx_L operation results in the pair of adjacent
aligned whole data quantities spanned being read and stored for the
reference with a subsequent STx_C operation, along with the width of
the data accessed and its virtual address, and the task referring or
NULL if the kernel. The valitidy marker is set.
2. Regular memory load operations are used to retrieve data because no
atomicity is needed at this stage, and matching the width accessed,
either LDQ_U or LDL even though the latter instruction requires extra
operations, to avoid the case where an unaligned longword located
entirely within an aligned quadword would complicate handling.
3. Data is masked, shifted and merged appropriately and returned in the
intended register as the result of the trapping LDx_L instruction.
4. A trapping unaligned STx_C operation results in the valitidy marker
being checked for being true, and the width of the data accessed
along with the virtual address and the task referring or the kernel
for a match. The pair of whole data quantities previously read by
LDx_L emulation is retrieved and the valitidy marker is cleared.
5. If the checks succeeded, then in an atomic loop the location of the
first whole data quantity is reread, and data retrieved compared with
the value previously obtained. If there's no match, then the loop is
aborted and 0 is returned in the intended register as the result of
the trapping STx_C instruction and emulation completes. Otherwise
new data obtained from the source operand of STx_C is combined with
the data retrieved, replacing by byte insertion the part intended,
and an atomic write of this new data is attempted. If it fails, the
loop continues from the beginning. Otherwise processing proceeds to
the next step.
6. The same operations are performed on the second whole data quantity.
7. At this point both whole data quantities have been written, ensuring
that no third-party intervening write has changed them at the point
of the write from the values held at previous LDx_L. Therefore 1 is
returned in the intended register as the result of the trapping STx_C
instruction.
8. No user accesses are permitted in traps from the kernel mode as the
only LDx_L/STx_C accesses made to user memory locations by the kernel
are supposed to be those from handcrafted code, which has to written
such as not to trap.
Since atomic loops are used for data updates the approach works equally
well in both UP and SMP environments. No data atomicity is guaranteed,
but data consistency is, that is concurrent RMW accesses won't clobber
each other, however if the same data is concurrently written as already
there with a regular write between emulated LDx_L and STx_C, then STx_C
will still succeed. Likewise if data is modified, but then restored
before STx_C has had a chance to run.
This fulfils consistency requirements and guarantees that data outside
the quantity written has not changed between emulated LDx_L and STx_C.
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
---
Hi,
This has cleared the pair of `-msafe-bwa -msafe-partial' regressions
observed in GCC verification (the third one was a Modula 2 frontend bug,
now fixed in the compiler). I have verified individual misalignments with
a small program by hand as well, for both the data retrieved by emulated
LDx_L and the data stored by emulated STx_C.
The kernel itself built with `-mcpu=ev4 -msafe-bwa -msafe-partial' boots
and has passed GCC verification, and triggered no extra unaligned traps.
Full verification was run with 6.3.0-rc5 and Ivan's stack alignment fixes
applied just because I was confident already that version works correctly.
Interestingly enough no kernel mode traps have triggered with a kernel
built with GCC 12 (and with most user traps coming from GCC verification):
kernel unaligned acc : 0 (pc=0,va=0)
user unaligned acc : 1766720 (pc=20000053064,va=120020189)
but with GCC 15 a small quantity happened (even before I ran GCC testing):
kernel unaligned acc : 78 (pc=fffffc0000ad5194,va=fffffc0002db5784)
user unaligned acc : 883452 (pc=20000053064,va=120020189)
It seems a compiler regression worth checking -- the trap recorded was in
`icmp6_dst_alloc' with a pair of quadword writes to `rt->rt6i_dst.addr',
which however by its type (`struct in6_addr') is only longword-aligned and
indeed starts at offset 148 from the outermost struct. I have a sneaking
suspicion one of my earlier GCC changes might be at fault. At least I now
have a test case to experiment with.
I've also built and booted 6.9.0-rc3 as at commit 82c525bfafb4 ("alpha:
trim the unused stuff from asm-offsets.c"), the last one before support
for my system was axed. It has passed the verification with my small
program (available by request; I'm not sure if it's worth turning into a
kernel selftest).
NB I'm going to ignore the 72 errors checkpatch.pl issues for EXC usage.
The coding style of the new additions is consistent with the rest of the
file and any change to that would best be made separately (but I fail to
see the point).
Questions, comments, concerns? Otherwise please apply, and I'll proceed
with the rest of the GCC effort, followed by cleaning handwritten assembly
up that uses STQ_U in our port and in glibc.
Maciej
---
arch/alpha/kernel/traps.c | 409 ++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 400 insertions(+), 9 deletions(-)
linux-alpha-llsc-unaligned.diff
Index: linux-macro/arch/alpha/kernel/traps.c
===================================================================
--- linux-macro.orig/arch/alpha/kernel/traps.c
+++ linux-macro/arch/alpha/kernel/traps.c
@@ -368,6 +368,13 @@ struct unaligned_stat {
unsigned long count, va, pc;
} unaligned[2];
+/* Unaligned LDx_L/STx_C emulation state. */
+static DEFINE_RAW_SPINLOCK(ll_lock);
+static struct task_struct *ll_task;
+static unsigned long ll_data[2];
+static unsigned long ll_va;
+static bool ll_quad;
+static bool ll_bit;
/* Macro for exception fixup code to access integer registers. */
#define una_reg(r) (_regs[(r) >= 16 && (r) <= 18 ? (r)+19 : (r)])
@@ -381,6 +388,9 @@ do_entUna(void * va, unsigned long opcod
unsigned long pc = regs->pc - 4;
unsigned long *_regs = regs->regs;
const struct exception_table_entry *fixup;
+ unsigned long flags;
+ unsigned long la;
+ bool ll_match;
unaligned[0].count++;
unaligned[0].va = (unsigned long) va;
@@ -439,6 +449,65 @@ do_entUna(void * va, unsigned long opcod
una_reg(reg) = tmp1|tmp2;
return;
+ case 0x2a: /* ldl_l */
+ la = (unsigned long)va;
+ if (la < TASK_SIZE)
+ break;
+ __asm__ __volatile__(
+ "1: ldl %3,0(%5)\n"
+ "2: ldl %4,4(%5)\n"
+ " srl %3,%6,%1\n"
+ " sll %4,%7,%2\n"
+ " zapnot %1,15,%1\n"
+ " zapnot %2,15,%2\n"
+ "3:\n"
+ EXC(1b,3b,%1,%0)
+ EXC(2b,3b,%2,%0)
+ : "=r"(error),
+ "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
+ : "r"(la & ~3ul),
+ "r"((la & 3) * 8), "r"((4 - (la & 3)) * 8), "0"(0));
+ if (error)
+ goto got_exception;
+ raw_spin_lock_irqsave(&ll_lock, flags);
+ ll_va = la;
+ ll_task = NULL;
+ ll_data[0] = tmp3;
+ ll_data[1] = tmp4;
+ ll_quad = false;
+ ll_bit = true;
+ raw_spin_unlock_irqrestore(&ll_lock, flags);
+ una_reg(reg) = (int)(tmp1|tmp2);
+ return;
+
+ case 0x2b: /* ldq_l */
+ la = (unsigned long)va;
+ if (la < TASK_SIZE)
+ break;
+ __asm__ __volatile__(
+ "1: ldq_u %3,0(%5)\n"
+ "2: ldq_u %4,7(%5)\n"
+ " extql %3,%5,%1\n"
+ " extqh %4,%5,%2\n"
+ "3:\n"
+ EXC(1b,3b,%1,%0)
+ EXC(2b,3b,%2,%0)
+ : "=r"(error),
+ "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
+ : "r"(va), "0"(0));
+ if (error)
+ goto got_exception;
+ raw_spin_lock_irqsave(&ll_lock, flags);
+ ll_va = la;
+ ll_task = NULL;
+ ll_data[0] = tmp3;
+ ll_data[1] = tmp4;
+ ll_quad = true;
+ ll_bit = true;
+ raw_spin_unlock_irqrestore(&ll_lock, flags);
+ una_reg(reg) = tmp1|tmp2;
+ return;
+
/* Note that the store sequences do not indicate that they change
memory because it _should_ be affecting nothing in this context.
(Otherwise we have other, much larger, problems.) */
@@ -513,6 +582,134 @@ do_entUna(void * va, unsigned long opcod
if (error)
goto got_exception;
return;
+
+ case 0x2e: /* stl_c */
+ la = (unsigned long)va;
+ if (la < TASK_SIZE)
+ break;
+ raw_spin_lock_irqsave(&ll_lock, flags);
+ ll_match = ll_bit;
+ ll_match &= !ll_quad;
+ ll_match &= ll_task == NULL;
+ ll_match &= ll_va == la;
+ tmp3 = ll_data[0];
+ tmp4 = ll_data[1];
+ ll_bit = false;
+ raw_spin_unlock_irqrestore(&ll_lock, flags);
+ if (ll_match) {
+ __asm__ __volatile__(
+ " srl %6,%5,%3\n"
+ " zapnot %3,%8,%3\n"
+ "1: ldl_l %2,4(%4)\n"
+ " cmpeq %7,%2,%1\n"
+ " beq %1,4f\n"
+ " zap %2,%8,%2\n"
+ " or %2,%3,%1\n"
+ "2: stl_c %1,4(%4)\n"
+ " beq %1,3f\n"
+ " .subsection 2\n"
+ "3: br 1b\n"
+ " .previous\n"
+ "4:\n"
+ EXC(1b,4b,%2,%0)
+ EXC(2b,4b,%1,%0)
+ : "=r"(error), "=&r"(ll_match),
+ "=&r"(tmp1), "=&r"(tmp2)
+ : "r"(la & ~3ul), "r"((4 - (la & 3)) * 8),
+ "r"(una_reg(reg)), "r"(tmp4),
+ "r"((15 >> (4 - (la & 3))) & 0xf), "0"(0));
+ if (error)
+ goto got_exception;
+ }
+ if (ll_match) {
+ __asm__ __volatile__(
+ " sll %6,%5,%3\n"
+ " zapnot %3,%8,%3\n"
+ "1: ldl_l %2,0(%4)\n"
+ " cmpeq %7,%2,%1\n"
+ " beq %1,4f\n"
+ " zap %2,%8,%2\n"
+ " or %2,%3,%1\n"
+ "2: stl_c %1,0(%4)\n"
+ " beq %1,3f\n"
+ " .subsection 2\n"
+ "3: br 1b\n"
+ " .previous\n"
+ "4:\n"
+ EXC(1b,4b,%2,%0)
+ EXC(2b,4b,%1,%0)
+ : "=r"(error), "=&r"(ll_match),
+ "=&r"(tmp1), "=&r"(tmp2)
+ : "r"(la & ~3ul), "r"((la & 3) * 8),
+ "r"(una_reg(reg)), "r"(tmp3),
+ "r"((15 << (la & 3)) & 0xf), "0"(0));
+ if (error)
+ goto got_exception;
+ }
+ una_reg(reg) = ll_match;
+ return;
+
+ case 0x2f: /* stq_c */
+ la = (unsigned long)va;
+ if (la < TASK_SIZE)
+ break;
+ raw_spin_lock_irqsave(&ll_lock, flags);
+ ll_match = ll_bit;
+ ll_match &= ll_quad;
+ ll_match &= ll_task == NULL;
+ ll_match &= ll_va == la;
+ tmp3 = ll_data[0];
+ tmp4 = ll_data[1];
+ ll_bit = false;
+ raw_spin_unlock_irqrestore(&ll_lock, flags);
+ if (ll_match) {
+ __asm__ __volatile__(
+ " insqh %6,%4,%3\n"
+ "1: ldq_l %2,8(%5)\n"
+ " cmpeq %7,%2,%1\n"
+ " beq %1,4f\n"
+ " mskqh %2,%4,%2\n"
+ " or %2,%3,%1\n"
+ "2: stq_c %1,8(%5)\n"
+ " beq %1,3f\n"
+ " .subsection 2\n"
+ "3: br 1b\n"
+ " .previous\n"
+ "4:\n"
+ EXC(1b,4b,%2,%0)
+ EXC(2b,4b,%1,%0)
+ : "=r"(error), "=&r"(ll_match),
+ "=&r"(tmp1), "=&r"(tmp2)
+ : "r"(va), "r"(la & ~7ul),
+ "r"(una_reg(reg)), "r"(tmp4), "0"(0));
+ if (error)
+ goto got_exception;
+ }
+ if (ll_match) {
+ __asm__ __volatile__(
+ " insql %6,%4,%3\n"
+ "1: ldq_l %2,0(%5)\n"
+ " cmpeq %7,%2,%1\n"
+ " beq %1,4f\n"
+ " mskql %2,%4,%2\n"
+ " or %2,%3,%1\n"
+ "2: stq_c %1,0(%5)\n"
+ " beq %1,3f\n"
+ " .subsection 2\n"
+ "3: br 1b\n"
+ " .previous\n"
+ "4:\n"
+ EXC(1b,4b,%2,%0)
+ EXC(2b,4b,%1,%0)
+ : "=r"(error), "=&r"(ll_match),
+ "=&r"(tmp1), "=&r"(tmp2)
+ : "r"(va), "r"(la & ~7ul),
+ "r"(una_reg(reg)), "r"(tmp3), "0"(0));
+ if (error)
+ goto got_exception;
+ }
+ una_reg(reg) = ll_match;
+ return;
}
printk("Bad unaligned kernel access at %016lx: %p %lx %lu\n",
@@ -624,24 +821,33 @@ s_reg_to_mem (unsigned long s_reg)
* so finding the appropriate registers is a little more difficult
* than in the kernel case.
*
- * Finally, we handle regular integer load/stores only. In
- * particular, load-linked/store-conditionally and floating point
- * load/stores are not supported. The former make no sense with
- * unaligned faults (they are guaranteed to fail) and I don't think
- * the latter will occur in any decent program.
+ * We have three classes of operations to handle:
*
- * Sigh. We *do* have to handle some FP operations, because GCC will
- * uses them as temporary storage for integer memory to memory copies.
- * However, we need to deal with stt/ldt and sts/lds only.
+ * - We handle regular integer load/stores transparently to faulting
+ * code, preserving the semantics of the triggering instruction.
+ *
+ * - We handle some FP operations as well, because GCC will use them as
+ * temporary storage for integer memory to memory copies. However,
+ * we need to deal with stt/ldt and sts/lds only.
+ *
+ * - We handle load-locked/store-conditional operations by maintaining
+ * data consistency only, within the two adjacent longwords or
+ * quadwords partially spanned. This is sufficient to guarantee an
+ * unaligned RMW sequence using these operations won't clobber data
+ * *outside* the location intended but does *not* guarantee atomicity
+ * for the data quantity itself.
*/
#define OP_INT_MASK ( 1L << 0x28 | 1L << 0x2c /* ldl stl */ \
+ | 1L << 0x2a | 1L << 0x2e /* ldl_l stl_c */ \
| 1L << 0x29 | 1L << 0x2d /* ldq stq */ \
+ | 1L << 0x2b | 1L << 0x2f /* ldq_l stq_c */ \
| 1L << 0x0c | 1L << 0x0d /* ldwu stw */ \
| 1L << 0x0a | 1L << 0x0e ) /* ldbu stb */
#define OP_WRITE_MASK ( 1L << 0x26 | 1L << 0x27 /* sts stt */ \
| 1L << 0x2c | 1L << 0x2d /* stl stq */ \
+ | 1L << 0x2e | 1L << 0x2d /* stl_c stq_c */ \
| 1L << 0x0d | 1L << 0x0e ) /* stw stb */
#define R(x) ((size_t) &((struct pt_regs *)0)->x)
@@ -666,6 +872,9 @@ do_entUnaUser(void __user * va, unsigned
unsigned long tmp1, tmp2, tmp3, tmp4;
unsigned long fake_reg, *reg_addr = &fake_reg;
+ unsigned long flags;
+ unsigned long la;
+ bool ll_match;
int si_code;
long error;
@@ -794,6 +1003,61 @@ do_entUnaUser(void __user * va, unsigned
*reg_addr = tmp1|tmp2;
break;
+ case 0x2a: /* ldl_l */
+ la = (unsigned long)va;
+ __asm__ __volatile__(
+ "1: ldl %3,0(%5)\n"
+ "2: ldl %4,4(%5)\n"
+ " srl %3,%6,%1\n"
+ " sll %4,%7,%2\n"
+ " zapnot %1,15,%1\n"
+ " zapnot %2,15,%2\n"
+ "3:\n"
+ EXC(1b,3b,%1,%0)
+ EXC(2b,3b,%2,%0)
+ : "=r"(error),
+ "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
+ : "r"(la & ~3ul),
+ "r"((la & 3) * 8), "r"((4 - (la & 3)) * 8), "0"(0));
+ if (error)
+ goto give_sigsegv;
+ raw_spin_lock_irqsave(&ll_lock, flags);
+ ll_va = la;
+ ll_task = current;
+ ll_data[0] = tmp3;
+ ll_data[1] = tmp4;
+ ll_quad = false;
+ ll_bit = true;
+ raw_spin_unlock_irqrestore(&ll_lock, flags);
+ *reg_addr = (int)(tmp1|tmp2);
+ break;
+
+ case 0x2b: /* ldq_l */
+ la = (unsigned long)va;
+ __asm__ __volatile__(
+ "1: ldq_u %3,0(%5)\n"
+ "2: ldq_u %4,7(%5)\n"
+ " extql %3,%5,%1\n"
+ " extqh %4,%5,%2\n"
+ "3:\n"
+ EXC(1b,3b,%1,%0)
+ EXC(2b,3b,%2,%0)
+ : "=r"(error),
+ "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
+ : "r"(va), "0"(0));
+ if (error)
+ goto give_sigsegv;
+ raw_spin_lock_irqsave(&ll_lock, flags);
+ ll_va = la;
+ ll_task = current;
+ ll_data[0] = tmp3;
+ ll_data[1] = tmp4;
+ ll_quad = true;
+ ll_bit = true;
+ raw_spin_unlock_irqrestore(&ll_lock, flags);
+ *reg_addr = tmp1|tmp2;
+ break;
+
/* Note that the store sequences do not indicate that they change
memory because it _should_ be affecting nothing in this context.
(Otherwise we have other, much larger, problems.) */
@@ -877,12 +1141,139 @@ do_entUnaUser(void __user * va, unsigned
goto give_sigsegv;
return;
+ case 0x2e: /* stl_c */
+ la = (unsigned long)va;
+ raw_spin_lock_irqsave(&ll_lock, flags);
+ ll_match = ll_bit;
+ ll_match &= !ll_quad;
+ ll_match &= ll_task == current;
+ ll_match &= ll_va == la;
+ tmp3 = ll_data[0];
+ tmp4 = ll_data[1];
+ ll_bit = false;
+ raw_spin_unlock_irqrestore(&ll_lock, flags);
+ if (ll_match) {
+ __asm__ __volatile__(
+ " srl %6,%5,%3\n"
+ " zapnot %3,%8,%3\n"
+ "1: ldl_l %2,4(%4)\n"
+ " cmpeq %7,%2,%1\n"
+ " beq %1,4f\n"
+ " zap %2,%8,%2\n"
+ " or %2,%3,%1\n"
+ "2: stl_c %1,4(%4)\n"
+ " beq %1,3f\n"
+ " .subsection 2\n"
+ "3: br 1b\n"
+ " .previous\n"
+ "4:\n"
+ EXC(1b,4b,%2,%0)
+ EXC(2b,4b,%1,%0)
+ : "=r"(error), "=&r"(ll_match),
+ "=&r"(tmp1), "=&r"(tmp2)
+ : "r"(la & ~3ul), "r"((4 - (la & 3)) * 8),
+ "r"(*reg_addr), "r"(tmp4),
+ "r"((15 >> (4 - (la & 3))) & 0xf), "0"(0));
+ if (error)
+ goto give_sigsegv;
+ }
+ if (ll_match) {
+ __asm__ __volatile__(
+ " sll %6,%5,%3\n"
+ " zapnot %3,%8,%3\n"
+ "1: ldl_l %2,0(%4)\n"
+ " cmpeq %7,%2,%1\n"
+ " beq %1,4f\n"
+ " zap %2,%8,%2\n"
+ " or %2,%3,%1\n"
+ "2: stl_c %1,0(%4)\n"
+ " beq %1,3f\n"
+ " .subsection 2\n"
+ "3: br 1b\n"
+ " .previous\n"
+ "4:\n"
+ EXC(1b,4b,%2,%0)
+ EXC(2b,4b,%1,%0)
+ : "=r"(error), "=&r"(ll_match),
+ "=&r"(tmp1), "=&r"(tmp2)
+ : "r"(la & ~3ul), "r"((la & 3) * 8),
+ "r"(*reg_addr), "r"(tmp3),
+ "r"((15 << (la & 3)) & 0xf), "0"(0));
+ if (error)
+ goto give_sigsegv;
+ }
+ *reg_addr = ll_match;
+ break;
+
+ case 0x2f: /* stq_c */
+ la = (unsigned long)va;
+ raw_spin_lock_irqsave(&ll_lock, flags);
+ ll_match = ll_bit;
+ ll_match &= ll_quad;
+ ll_match &= ll_task == current;
+ ll_match &= ll_va == la;
+ tmp3 = ll_data[0];
+ tmp4 = ll_data[1];
+ ll_bit = false;
+ raw_spin_unlock_irqrestore(&ll_lock, flags);
+ if (ll_match) {
+ __asm__ __volatile__(
+ " insqh %6,%4,%3\n"
+ "1: ldq_l %2,8(%5)\n"
+ " cmpeq %7,%2,%1\n"
+ " beq %1,4f\n"
+ " mskqh %2,%4,%2\n"
+ " or %2,%3,%1\n"
+ "2: stq_c %1,8(%5)\n"
+ " beq %1,3f\n"
+ " .subsection 2\n"
+ "3: br 1b\n"
+ " .previous\n"
+ "4:\n"
+ EXC(1b,4b,%2,%0)
+ EXC(2b,4b,%1,%0)
+ : "=r"(error), "=&r"(ll_match),
+ "=&r"(tmp1), "=&r"(tmp2)
+ : "r"(va), "r"(la & ~7ul),
+ "r"(*reg_addr), "r"(tmp4), "0"(0));
+ if (error)
+ goto give_sigsegv;
+ }
+ if (ll_match) {
+ __asm__ __volatile__(
+ " insql %6,%4,%3\n"
+ "1: ldq_l %2,0(%5)\n"
+ " cmpeq %7,%2,%1\n"
+ " beq %1,4f\n"
+ " mskql %2,%4,%2\n"
+ " or %2,%3,%1\n"
+ "2: stq_c %1,0(%5)\n"
+ " beq %1,3f\n"
+ " .subsection 2\n"
+ "3: br 1b\n"
+ " .previous\n"
+ "4:\n"
+ EXC(1b,4b,%2,%0)
+ EXC(2b,4b,%1,%0)
+ : "=r"(error), "=&r"(ll_match),
+ "=&r"(tmp1), "=&r"(tmp2)
+ : "r"(va), "r"(la & ~7ul),
+ "r"(*reg_addr), "r"(tmp3), "0"(0));
+ if (error)
+ goto give_sigsegv;
+ }
+ *reg_addr = ll_match;
+ break;
+
default:
/* What instruction were you trying to use, exactly? */
goto give_sigbus;
}
- /* Only integer loads should get here; everyone else returns early. */
+ /*
+ * Only integer loads and stores conditional should get here;
+ * everyone else returns early.
+ */
if (reg == 30)
wrusp(fake_reg);
return;
On Wed, 19 Feb 2025, Maciej W. Rozycki wrote:
> Interestingly enough no kernel mode traps have triggered with a kernel
> built with GCC 12 (and with most user traps coming from GCC verification):
>
> kernel unaligned acc : 0 (pc=0,va=0)
> user unaligned acc : 1766720 (pc=20000053064,va=120020189)
>
> but with GCC 15 a small quantity happened (even before I ran GCC testing):
>
> kernel unaligned acc : 78 (pc=fffffc0000ad5194,va=fffffc0002db5784)
> user unaligned acc : 883452 (pc=20000053064,va=120020189)
>
> It seems a compiler regression worth checking -- the trap recorded was in
> `icmp6_dst_alloc' with a pair of quadword writes to `rt->rt6i_dst.addr',
> which however by its type (`struct in6_addr') is only longword-aligned and
> indeed starts at offset 148 from the outermost struct. I have a sneaking
> suspicion one of my earlier GCC changes might be at fault. At least I now
> have a test case to experiment with.
FYI my suspicion wasn't wrong, I have submitted a compiler fix now[1].
My plan has been to complete the GCC side first as it's more urgent given
its annual only release cycle model targetting April/May, whereas I think
the Linux side can slip a release or two in our roughly bi-monthly cycle.
I'm going to schedule my time accordinly and with my upcoming holiday also
in the picture I may not be able to post v2 of this proposal until around
end of March the earliest.
References:
[1] "Alpha: Fix base block alignment calculation regression",
<https://inbox.sourceware.org/gcc-patches/alpine.DEB.2.21.2502251934260.65342@angie.orcam.me.uk/>
Maciej
On 2/19/25 04:46, Maciej W. Rozycki wrote: > Complementing compiler support for the `-msafe-bwa' and `-msafe-partial' > code generation options slated to land in GCC 15, Pointer? I can't find it on the gcc-patches list. > implement emulation > for unaligned LDx_L and STx_C operations for the unlikely case where an > alignment violation has resulted from improperly written code and caused > these operations to trap in atomic RMW memory access sequences made to > provide data consistency for non-BWX byte and word write operations, and > writes to unaligned data objects causing partial memory updates. > > The principle of operation is as follows: > > 1. A trapping unaligned LDx_L operation results in the pair of adjacent > aligned whole data quantities spanned being read and stored for the > reference with a subsequent STx_C operation, along with the width of > the data accessed and its virtual address, and the task referring or > NULL if the kernel. The valitidy marker is set. > > 2. Regular memory load operations are used to retrieve data because no > atomicity is needed at this stage, and matching the width accessed, > either LDQ_U or LDL even though the latter instruction requires extra > operations, to avoid the case where an unaligned longword located > entirely within an aligned quadword would complicate handling. > > 3. Data is masked, shifted and merged appropriately and returned in the > intended register as the result of the trapping LDx_L instruction. > > 4. A trapping unaligned STx_C operation results in the valitidy marker > being checked for being true, and the width of the data accessed > along with the virtual address and the task referring or the kernel > for a match. The pair of whole data quantities previously read by > LDx_L emulation is retrieved and the valitidy marker is cleared. > > 5. If the checks succeeded, then in an atomic loop the location of the > first whole data quantity is reread, and data retrieved compared with > the value previously obtained. If there's no match, then the loop is > aborted and 0 is returned in the intended register as the result of > the trapping STx_C instruction and emulation completes. Otherwise > new data obtained from the source operand of STx_C is combined with > the data retrieved, replacing by byte insertion the part intended, > and an atomic write of this new data is attempted. If it fails, the > loop continues from the beginning. Otherwise processing proceeds to > the next step. > > 6. The same operations are performed on the second whole data quantity. > > 7. At this point both whole data quantities have been written, ensuring > that no third-party intervening write has changed them at the point > of the write from the values held at previous LDx_L. Therefore 1 is > returned in the intended register as the result of the trapping STx_C > instruction. I think general-purpose non-atomic emulation of STx_C is a really bad idea. Without looking at your gcc patches, I can guess what you're after: you've generated a ll/sc sequence for (aligned) short, and want to emulate if it happens to be unaligned. Crucially, when emulating non-aligned, you should not strive to make it atomic. No other architecture promises atomic non-aligned stores, so why should you do that here? I suggest some sort of magic code sequence, bic addr_in, 6, addr_al loop: ldq_l t0, 0(addr_al) magic-nop done - loop inswl data, addr_in, t1 mskwl t0, addr_in, t0 bis t0, t1, t0 stq_c t0, 0(addr_al) beq t0, loop done: With the trap, match the magic-nop, pick out the input registers from the following inswl, perform the two (atomic!) byte stores to accomplish the emulation, adjust the pc forward to the done label. Choose anything you like for the magic-nop. The (done - loop) displacement is small, so any 8-bit immediate would suffice. E.g. "eqv $31, disp, $31". You might require the displacement to be constant and not actually extract "disp"; just match the entire uint32_t instruction pattern. r~
On Thu, 20 Feb 2025, Richard Henderson wrote: > > Complementing compiler support for the `-msafe-bwa' and `-msafe-partial' > > code generation options slated to land in GCC 15, > > Pointer? I can't find it on the gcc-patches list. Here: <https://inbox.sourceware.org/gcc-patches/alpine.DEB.2.21.2501050246590.49841@angie.orcam.me.uk/> and hopefully in your inbox/archive somewhere as well. > > 7. At this point both whole data quantities have been written, ensuring > > that no third-party intervening write has changed them at the point > > of the write from the values held at previous LDx_L. Therefore 1 is > > returned in the intended register as the result of the trapping STx_C > > instruction. > > I think general-purpose non-atomic emulation of STx_C is a really bad idea. > > Without looking at your gcc patches, I can guess what you're after: you've > generated a ll/sc sequence for (aligned) short, and want to emulate if it > happens to be unaligned. It's a corner case, yes, when the compiler was told the access would be aligned, but it turns out not. It's where you cast a (char *) pointer to (short *) that wasn't suitably aligned for such a cast and dereference it (and the quadword case is similarly for the ends of misaligned inline `memcpy'/`memset'). Only two cases (plus a bug in GM2 frontend) hitting this throughout the GCC testsuite show the rarity of this case. > Crucially, when emulating non-aligned, you should not strive to make it > atomic. No other architecture promises atomic non-aligned stores, so why > should you do that here? This code doesn't strive to be atomic, but to preserve data *outside* the quantity accessed from being clobbered, and for this purpose an atomic sequence is both inevitable and sufficient, for both partial quantities around the unaligned quantity written. The trapping code does not expect atomicity for the unaligned quantity itself -- it is handled in pieces just as say with MIPS SWL/SWR masked store instruction pairs -- and this code, effectively an Alpha/Linux psABI extension, does not guarantee it either. > I suggest some sort of magic code sequence, > > bic addr_in, 6, addr_al > loop: > ldq_l t0, 0(addr_al) > magic-nop done - loop > inswl data, addr_in, t1 > mskwl t0, addr_in, t0 > bis t0, t1, t0 > stq_c t0, 0(addr_al) > beq t0, loop > done: > > With the trap, match the magic-nop, pick out the input registers from the > following inswl, perform the two (atomic!) byte stores to accomplish the > emulation, adjust the pc forward to the done label. It seems to make no sense to me to penalise all user code for the corner case mentioned above while still having the emulation in the kernel, given that 99.999...% of accesses will have been correctly aligned by GCC. And it gets even more complex when you have an awkward number of bytes to mask, such as 3, 5, 6, 7, which will happen for example if inline `memcpy' is expanded by GCC for a quadword-aligned block of 31-bytes, in which case other instructions will be used for masking/insertion for the trailing 7 bytes, and the block turns out misaligned at run time. I'm inconvinced, it seems a lot of hassle for little gain to me. Maciej
On Thu, Feb 20, 2025 at 12:54 PM Richard Henderson <richard.henderson@linaro.org> wrote: > > On 2/19/25 04:46, Maciej W. Rozycki wrote: > > Complementing compiler support for the `-msafe-bwa' and `-msafe-partial' > > code generation options slated to land in GCC 15, > > Pointer? I can't find it on the gcc-patches list. I believe it's this: https://inbox.sourceware.org/gcc-patches/alpine.DEB.2.21.2411141652300.9262@angie.orcam.me.uk/ The first half or so have landed so far, right Maciej?
On Thu, 20 Feb 2025, Matt Turner wrote: > > > Complementing compiler support for the `-msafe-bwa' and `-msafe-partial' > > > code generation options slated to land in GCC 15, > > > > Pointer? I can't find it on the gcc-patches list. > > I believe it's this: > > https://inbox.sourceware.org/gcc-patches/alpine.DEB.2.21.2411141652300.9262@angie.orcam.me.uk/ That's the original series; v2 is here: https://inbox.sourceware.org/gcc-patches/alpine.DEB.2.21.2501050246590.49841@angie.orcam.me.uk/ > The first half or so have landed so far, right Maciej? Yes, fixes for bugs discovered in the course have been committed and the rest of changes already approved, although there were a couple of comments I yet want to investigate/address by the end of next week. Getting clean GCC test results with the kernel emulation was my objective before moving forward. Maciej
On Thu, 20 Feb 2025 at 09:54, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Crucially, when emulating non-aligned, you should not strive to make it atomic. No other
> architecture promises atomic non-aligned stores, so why should you do that here?
I'm not disagreeing with the "it doesn't necessarily have to be
atomic", but I will point out that x86 does indeed promise atomic
non-aligned accesses.
It will actually lock both cachelines when straddling a cacheline.
It's slow, it's horrendous, and people are trying to get away from it
(google "split lock"), but it is actually architecturally supported.
Linus
On 2/20/25 09:59, Linus Torvalds wrote: > On Thu, 20 Feb 2025 at 09:54, Richard Henderson > <richard.henderson@linaro.org> wrote: >> >> Crucially, when emulating non-aligned, you should not strive to make it atomic. No other >> architecture promises atomic non-aligned stores, so why should you do that here? > > I'm not disagreeing with the "it doesn't necessarily have to be > atomic", but I will point out that x86 does indeed promise atomic > non-aligned accesses. I should have been more expansive with that statement: I didn't mean "no unaligned atomics" (e.g. lock orw), but "unaligned normal stores may be non-atomic" (e.g. movw across a cacheline). My guess about the gcc patches is that it's the latter that wanted emulation here. r~
On Wed, Feb 19, 2025 at 7:46 AM Maciej W. Rozycki <macro@orcam.me.uk> wrote:
>
> Complementing compiler support for the `-msafe-bwa' and `-msafe-partial'
> code generation options slated to land in GCC 15, implement emulation
> for unaligned LDx_L and STx_C operations for the unlikely case where an
> alignment violation has resulted from improperly written code and caused
> these operations to trap in atomic RMW memory access sequences made to
> provide data consistency for non-BWX byte and word write operations, and
> writes to unaligned data objects causing partial memory updates.
>
> The principle of operation is as follows:
>
> 1. A trapping unaligned LDx_L operation results in the pair of adjacent
> aligned whole data quantities spanned being read and stored for the
> reference with a subsequent STx_C operation, along with the width of
> the data accessed and its virtual address, and the task referring or
> NULL if the kernel. The valitidy marker is set.
Typo: validity (occurs twice more in the commit message below)
>
> 2. Regular memory load operations are used to retrieve data because no
> atomicity is needed at this stage, and matching the width accessed,
> either LDQ_U or LDL even though the latter instruction requires extra
> operations, to avoid the case where an unaligned longword located
> entirely within an aligned quadword would complicate handling.
>
> 3. Data is masked, shifted and merged appropriately and returned in the
> intended register as the result of the trapping LDx_L instruction.
>
> 4. A trapping unaligned STx_C operation results in the valitidy marker
> being checked for being true, and the width of the data accessed
> along with the virtual address and the task referring or the kernel
> for a match. The pair of whole data quantities previously read by
> LDx_L emulation is retrieved and the valitidy marker is cleared.
>
> 5. If the checks succeeded, then in an atomic loop the location of the
> first whole data quantity is reread, and data retrieved compared with
> the value previously obtained. If there's no match, then the loop is
> aborted and 0 is returned in the intended register as the result of
> the trapping STx_C instruction and emulation completes. Otherwise
> new data obtained from the source operand of STx_C is combined with
> the data retrieved, replacing by byte insertion the part intended,
> and an atomic write of this new data is attempted. If it fails, the
> loop continues from the beginning. Otherwise processing proceeds to
> the next step.
>
> 6. The same operations are performed on the second whole data quantity.
>
> 7. At this point both whole data quantities have been written, ensuring
> that no third-party intervening write has changed them at the point
> of the write from the values held at previous LDx_L. Therefore 1 is
> returned in the intended register as the result of the trapping STx_C
> instruction.
>
> 8. No user accesses are permitted in traps from the kernel mode as the
> only LDx_L/STx_C accesses made to user memory locations by the kernel
> are supposed to be those from handcrafted code, which has to written
> such as not to trap.
>
> Since atomic loops are used for data updates the approach works equally
> well in both UP and SMP environments. No data atomicity is guaranteed,
> but data consistency is, that is concurrent RMW accesses won't clobber
> each other, however if the same data is concurrently written as already
> there with a regular write between emulated LDx_L and STx_C, then STx_C
> will still succeed. Likewise if data is modified, but then restored
> before STx_C has had a chance to run.
>
> This fulfils consistency requirements and guarantees that data outside
> the quantity written has not changed between emulated LDx_L and STx_C.
>
> Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
> ---
> Hi,
>
> This has cleared the pair of `-msafe-bwa -msafe-partial' regressions
> observed in GCC verification (the third one was a Modula 2 frontend bug,
> now fixed in the compiler). I have verified individual misalignments with
> a small program by hand as well, for both the data retrieved by emulated
> LDx_L and the data stored by emulated STx_C.
>
> The kernel itself built with `-mcpu=ev4 -msafe-bwa -msafe-partial' boots
> and has passed GCC verification, and triggered no extra unaligned traps.
>
> Full verification was run with 6.3.0-rc5 and Ivan's stack alignment fixes
> applied just because I was confident already that version works correctly.
> Interestingly enough no kernel mode traps have triggered with a kernel
> built with GCC 12 (and with most user traps coming from GCC verification):
>
> kernel unaligned acc : 0 (pc=0,va=0)
> user unaligned acc : 1766720 (pc=20000053064,va=120020189)
>
> but with GCC 15 a small quantity happened (even before I ran GCC testing):
>
> kernel unaligned acc : 78 (pc=fffffc0000ad5194,va=fffffc0002db5784)
> user unaligned acc : 883452 (pc=20000053064,va=120020189)
>
> It seems a compiler regression worth checking -- the trap recorded was in
> `icmp6_dst_alloc' with a pair of quadword writes to `rt->rt6i_dst.addr',
> which however by its type (`struct in6_addr') is only longword-aligned and
> indeed starts at offset 148 from the outermost struct. I have a sneaking
> suspicion one of my earlier GCC changes might be at fault. At least I now
> have a test case to experiment with.
>
> I've also built and booted 6.9.0-rc3 as at commit 82c525bfafb4 ("alpha:
> trim the unused stuff from asm-offsets.c"), the last one before support
> for my system was axed. It has passed the verification with my small
> program (available by request; I'm not sure if it's worth turning into a
> kernel selftest).
>
> NB I'm going to ignore the 72 errors checkpatch.pl issues for EXC usage.
> The coding style of the new additions is consistent with the rest of the
> file and any change to that would best be made separately (but I fail to
> see the point).
>
> Questions, comments, concerns? Otherwise please apply, and I'll proceed
> with the rest of the GCC effort, followed by cleaning handwritten assembly
> up that uses STQ_U in our port and in glibc.
>
> Maciej
> ---
> arch/alpha/kernel/traps.c | 409 ++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 400 insertions(+), 9 deletions(-)
>
> linux-alpha-llsc-unaligned.diff
> Index: linux-macro/arch/alpha/kernel/traps.c
> ===================================================================
> --- linux-macro.orig/arch/alpha/kernel/traps.c
> +++ linux-macro/arch/alpha/kernel/traps.c
> @@ -368,6 +368,13 @@ struct unaligned_stat {
> unsigned long count, va, pc;
> } unaligned[2];
>
> +/* Unaligned LDx_L/STx_C emulation state. */
> +static DEFINE_RAW_SPINLOCK(ll_lock);
> +static struct task_struct *ll_task;
> +static unsigned long ll_data[2];
> +static unsigned long ll_va;
> +static bool ll_quad;
> +static bool ll_bit;
>
> /* Macro for exception fixup code to access integer registers. */
> #define una_reg(r) (_regs[(r) >= 16 && (r) <= 18 ? (r)+19 : (r)])
> @@ -381,6 +388,9 @@ do_entUna(void * va, unsigned long opcod
> unsigned long pc = regs->pc - 4;
> unsigned long *_regs = regs->regs;
> const struct exception_table_entry *fixup;
> + unsigned long flags;
> + unsigned long la;
> + bool ll_match;
>
> unaligned[0].count++;
> unaligned[0].va = (unsigned long) va;
> @@ -439,6 +449,65 @@ do_entUna(void * va, unsigned long opcod
> una_reg(reg) = tmp1|tmp2;
> return;
>
> + case 0x2a: /* ldl_l */
> + la = (unsigned long)va;
> + if (la < TASK_SIZE)
> + break;
> + __asm__ __volatile__(
> + "1: ldl %3,0(%5)\n"
> + "2: ldl %4,4(%5)\n"
> + " srl %3,%6,%1\n"
> + " sll %4,%7,%2\n"
> + " zapnot %1,15,%1\n"
> + " zapnot %2,15,%2\n"
> + "3:\n"
> + EXC(1b,3b,%1,%0)
> + EXC(2b,3b,%2,%0)
> + : "=r"(error),
> + "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> + : "r"(la & ~3ul),
> + "r"((la & 3) * 8), "r"((4 - (la & 3)) * 8), "0"(0));
> + if (error)
> + goto got_exception;
> + raw_spin_lock_irqsave(&ll_lock, flags);
> + ll_va = la;
> + ll_task = NULL;
> + ll_data[0] = tmp3;
> + ll_data[1] = tmp4;
> + ll_quad = false;
> + ll_bit = true;
> + raw_spin_unlock_irqrestore(&ll_lock, flags);
> + una_reg(reg) = (int)(tmp1|tmp2);
> + return;
> +
> + case 0x2b: /* ldq_l */
> + la = (unsigned long)va;
> + if (la < TASK_SIZE)
> + break;
> + __asm__ __volatile__(
> + "1: ldq_u %3,0(%5)\n"
> + "2: ldq_u %4,7(%5)\n"
> + " extql %3,%5,%1\n"
> + " extqh %4,%5,%2\n"
> + "3:\n"
> + EXC(1b,3b,%1,%0)
> + EXC(2b,3b,%2,%0)
> + : "=r"(error),
> + "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> + : "r"(va), "0"(0));
> + if (error)
> + goto got_exception;
> + raw_spin_lock_irqsave(&ll_lock, flags);
> + ll_va = la;
> + ll_task = NULL;
> + ll_data[0] = tmp3;
> + ll_data[1] = tmp4;
> + ll_quad = true;
> + ll_bit = true;
> + raw_spin_unlock_irqrestore(&ll_lock, flags);
> + una_reg(reg) = tmp1|tmp2;
> + return;
> +
> /* Note that the store sequences do not indicate that they change
> memory because it _should_ be affecting nothing in this context.
> (Otherwise we have other, much larger, problems.) */
> @@ -513,6 +582,134 @@ do_entUna(void * va, unsigned long opcod
> if (error)
> goto got_exception;
> return;
> +
> + case 0x2e: /* stl_c */
> + la = (unsigned long)va;
> + if (la < TASK_SIZE)
> + break;
> + raw_spin_lock_irqsave(&ll_lock, flags);
> + ll_match = ll_bit;
> + ll_match &= !ll_quad;
> + ll_match &= ll_task == NULL;
> + ll_match &= ll_va == la;
> + tmp3 = ll_data[0];
> + tmp4 = ll_data[1];
> + ll_bit = false;
> + raw_spin_unlock_irqrestore(&ll_lock, flags);
> + if (ll_match) {
> + __asm__ __volatile__(
> + " srl %6,%5,%3\n"
> + " zapnot %3,%8,%3\n"
> + "1: ldl_l %2,4(%4)\n"
> + " cmpeq %7,%2,%1\n"
> + " beq %1,4f\n"
> + " zap %2,%8,%2\n"
> + " or %2,%3,%1\n"
> + "2: stl_c %1,4(%4)\n"
> + " beq %1,3f\n"
> + " .subsection 2\n"
> + "3: br 1b\n"
> + " .previous\n"
> + "4:\n"
> + EXC(1b,4b,%2,%0)
> + EXC(2b,4b,%1,%0)
> + : "=r"(error), "=&r"(ll_match),
> + "=&r"(tmp1), "=&r"(tmp2)
> + : "r"(la & ~3ul), "r"((4 - (la & 3)) * 8),
> + "r"(una_reg(reg)), "r"(tmp4),
> + "r"((15 >> (4 - (la & 3))) & 0xf), "0"(0));
> + if (error)
> + goto got_exception;
> + }
> + if (ll_match) {
> + __asm__ __volatile__(
> + " sll %6,%5,%3\n"
> + " zapnot %3,%8,%3\n"
> + "1: ldl_l %2,0(%4)\n"
> + " cmpeq %7,%2,%1\n"
> + " beq %1,4f\n"
> + " zap %2,%8,%2\n"
> + " or %2,%3,%1\n"
> + "2: stl_c %1,0(%4)\n"
> + " beq %1,3f\n"
> + " .subsection 2\n"
> + "3: br 1b\n"
> + " .previous\n"
> + "4:\n"
> + EXC(1b,4b,%2,%0)
> + EXC(2b,4b,%1,%0)
> + : "=r"(error), "=&r"(ll_match),
> + "=&r"(tmp1), "=&r"(tmp2)
> + : "r"(la & ~3ul), "r"((la & 3) * 8),
> + "r"(una_reg(reg)), "r"(tmp3),
> + "r"((15 << (la & 3)) & 0xf), "0"(0));
> + if (error)
> + goto got_exception;
> + }
> + una_reg(reg) = ll_match;
> + return;
> +
> + case 0x2f: /* stq_c */
> + la = (unsigned long)va;
> + if (la < TASK_SIZE)
> + break;
> + raw_spin_lock_irqsave(&ll_lock, flags);
> + ll_match = ll_bit;
> + ll_match &= ll_quad;
> + ll_match &= ll_task == NULL;
> + ll_match &= ll_va == la;
> + tmp3 = ll_data[0];
> + tmp4 = ll_data[1];
> + ll_bit = false;
> + raw_spin_unlock_irqrestore(&ll_lock, flags);
> + if (ll_match) {
> + __asm__ __volatile__(
> + " insqh %6,%4,%3\n"
> + "1: ldq_l %2,8(%5)\n"
> + " cmpeq %7,%2,%1\n"
> + " beq %1,4f\n"
> + " mskqh %2,%4,%2\n"
> + " or %2,%3,%1\n"
> + "2: stq_c %1,8(%5)\n"
> + " beq %1,3f\n"
> + " .subsection 2\n"
> + "3: br 1b\n"
> + " .previous\n"
> + "4:\n"
> + EXC(1b,4b,%2,%0)
> + EXC(2b,4b,%1,%0)
> + : "=r"(error), "=&r"(ll_match),
> + "=&r"(tmp1), "=&r"(tmp2)
> + : "r"(va), "r"(la & ~7ul),
> + "r"(una_reg(reg)), "r"(tmp4), "0"(0));
> + if (error)
> + goto got_exception;
> + }
> + if (ll_match) {
> + __asm__ __volatile__(
> + " insql %6,%4,%3\n"
> + "1: ldq_l %2,0(%5)\n"
> + " cmpeq %7,%2,%1\n"
> + " beq %1,4f\n"
> + " mskql %2,%4,%2\n"
> + " or %2,%3,%1\n"
> + "2: stq_c %1,0(%5)\n"
> + " beq %1,3f\n"
> + " .subsection 2\n"
> + "3: br 1b\n"
> + " .previous\n"
> + "4:\n"
> + EXC(1b,4b,%2,%0)
> + EXC(2b,4b,%1,%0)
> + : "=r"(error), "=&r"(ll_match),
> + "=&r"(tmp1), "=&r"(tmp2)
> + : "r"(va), "r"(la & ~7ul),
> + "r"(una_reg(reg)), "r"(tmp3), "0"(0));
> + if (error)
> + goto got_exception;
> + }
> + una_reg(reg) = ll_match;
> + return;
> }
>
> printk("Bad unaligned kernel access at %016lx: %p %lx %lu\n",
> @@ -624,24 +821,33 @@ s_reg_to_mem (unsigned long s_reg)
> * so finding the appropriate registers is a little more difficult
> * than in the kernel case.
> *
> - * Finally, we handle regular integer load/stores only. In
> - * particular, load-linked/store-conditionally and floating point
> - * load/stores are not supported. The former make no sense with
> - * unaligned faults (they are guaranteed to fail) and I don't think
> - * the latter will occur in any decent program.
> + * We have three classes of operations to handle:
> *
> - * Sigh. We *do* have to handle some FP operations, because GCC will
> - * uses them as temporary storage for integer memory to memory copies.
> - * However, we need to deal with stt/ldt and sts/lds only.
> + * - We handle regular integer load/stores transparently to faulting
> + * code, preserving the semantics of the triggering instruction.
> + *
> + * - We handle some FP operations as well, because GCC will use them as
> + * temporary storage for integer memory to memory copies. However,
> + * we need to deal with stt/ldt and sts/lds only.
> + *
> + * - We handle load-locked/store-conditional operations by maintaining
> + * data consistency only, within the two adjacent longwords or
> + * quadwords partially spanned. This is sufficient to guarantee an
> + * unaligned RMW sequence using these operations won't clobber data
> + * *outside* the location intended but does *not* guarantee atomicity
> + * for the data quantity itself.
> */
>
> #define OP_INT_MASK ( 1L << 0x28 | 1L << 0x2c /* ldl stl */ \
> + | 1L << 0x2a | 1L << 0x2e /* ldl_l stl_c */ \
> | 1L << 0x29 | 1L << 0x2d /* ldq stq */ \
> + | 1L << 0x2b | 1L << 0x2f /* ldq_l stq_c */ \
> | 1L << 0x0c | 1L << 0x0d /* ldwu stw */ \
> | 1L << 0x0a | 1L << 0x0e ) /* ldbu stb */
>
> #define OP_WRITE_MASK ( 1L << 0x26 | 1L << 0x27 /* sts stt */ \
> | 1L << 0x2c | 1L << 0x2d /* stl stq */ \
> + | 1L << 0x2e | 1L << 0x2d /* stl_c stq_c */ \
stq_c should be 0x2f, not 0x2d. Looks like a copy-n-paste mistake.
On Thu, Feb 20, 2025 at 11:46 AM Matt Turner <mattst88@gmail.com> wrote: > > On Wed, Feb 19, 2025 at 7:46 AM Maciej W. Rozycki <macro@orcam.me.uk> wrote: > > #define OP_INT_MASK ( 1L << 0x28 | 1L << 0x2c /* ldl stl */ \ > > + | 1L << 0x2a | 1L << 0x2e /* ldl_l stl_c */ \ > > | 1L << 0x29 | 1L << 0x2d /* ldq stq */ \ > > + | 1L << 0x2b | 1L << 0x2f /* ldq_l stq_c */ \ > > | 1L << 0x0c | 1L << 0x0d /* ldwu stw */ \ > > | 1L << 0x0a | 1L << 0x0e ) /* ldbu stb */ > > > > #define OP_WRITE_MASK ( 1L << 0x26 | 1L << 0x27 /* sts stt */ \ > > | 1L << 0x2c | 1L << 0x2d /* stl stq */ \ > > + | 1L << 0x2e | 1L << 0x2d /* stl_c stq_c */ \ > > stq_c should be 0x2f, not 0x2d. Looks like a copy-n-paste mistake. The good news is that OP_WRITE_MASK appears to be unused going all the way back to the import into git, so this doesn't indicate a problem with any of the testing that's been done.
On Thu, 20 Feb 2025, Matt Turner wrote: > > On Wed, Feb 19, 2025 at 7:46 AM Maciej W. Rozycki <macro@orcam.me.uk> wrote: > > > #define OP_INT_MASK ( 1L << 0x28 | 1L << 0x2c /* ldl stl */ \ > > > + | 1L << 0x2a | 1L << 0x2e /* ldl_l stl_c */ \ > > > | 1L << 0x29 | 1L << 0x2d /* ldq stq */ \ > > > + | 1L << 0x2b | 1L << 0x2f /* ldq_l stq_c */ \ > > > | 1L << 0x0c | 1L << 0x0d /* ldwu stw */ \ > > > | 1L << 0x0a | 1L << 0x0e ) /* ldbu stb */ > > > > > > #define OP_WRITE_MASK ( 1L << 0x26 | 1L << 0x27 /* sts stt */ \ > > > | 1L << 0x2c | 1L << 0x2d /* stl stq */ \ > > > + | 1L << 0x2e | 1L << 0x2d /* stl_c stq_c */ \ > > > > stq_c should be 0x2f, not 0x2d. Looks like a copy-n-paste mistake. > > The good news is that OP_WRITE_MASK appears to be unused going all the > way back to the import into git, so this doesn't indicate a problem > with any of the testing that's been done. Good catch, thank you, and I guess the lack of use is why things haven't broken. I'll make a preparatory change in v2 and remove this macro then. FWIW it came with 2.1.36, already unused, so presumably a leftover from a WIP version. I'll fix the typo in the description as well, thank you. Maciej
I've tested this patch on an Alphaserver ES40 (running 6.14.0-rc3),
using the test-suit provided by Maciej and can verify that it works.
Tested-by: Magnus Lindholm <linmag7@gmail.com>
On Wed, Feb 19, 2025 at 1:46 PM Maciej W. Rozycki <macro@orcam.me.uk> wrote:
>
> Complementing compiler support for the `-msafe-bwa' and `-msafe-partial'
> code generation options slated to land in GCC 15, implement emulation
> for unaligned LDx_L and STx_C operations for the unlikely case where an
> alignment violation has resulted from improperly written code and caused
> these operations to trap in atomic RMW memory access sequences made to
> provide data consistency for non-BWX byte and word write operations, and
> writes to unaligned data objects causing partial memory updates.
>
> The principle of operation is as follows:
>
> 1. A trapping unaligned LDx_L operation results in the pair of adjacent
> aligned whole data quantities spanned being read and stored for the
> reference with a subsequent STx_C operation, along with the width of
> the data accessed and its virtual address, and the task referring or
> NULL if the kernel. The valitidy marker is set.
>
> 2. Regular memory load operations are used to retrieve data because no
> atomicity is needed at this stage, and matching the width accessed,
> either LDQ_U or LDL even though the latter instruction requires extra
> operations, to avoid the case where an unaligned longword located
> entirely within an aligned quadword would complicate handling.
>
> 3. Data is masked, shifted and merged appropriately and returned in the
> intended register as the result of the trapping LDx_L instruction.
>
> 4. A trapping unaligned STx_C operation results in the valitidy marker
> being checked for being true, and the width of the data accessed
> along with the virtual address and the task referring or the kernel
> for a match. The pair of whole data quantities previously read by
> LDx_L emulation is retrieved and the valitidy marker is cleared.
>
> 5. If the checks succeeded, then in an atomic loop the location of the
> first whole data quantity is reread, and data retrieved compared with
> the value previously obtained. If there's no match, then the loop is
> aborted and 0 is returned in the intended register as the result of
> the trapping STx_C instruction and emulation completes. Otherwise
> new data obtained from the source operand of STx_C is combined with
> the data retrieved, replacing by byte insertion the part intended,
> and an atomic write of this new data is attempted. If it fails, the
> loop continues from the beginning. Otherwise processing proceeds to
> the next step.
>
> 6. The same operations are performed on the second whole data quantity.
>
> 7. At this point both whole data quantities have been written, ensuring
> that no third-party intervening write has changed them at the point
> of the write from the values held at previous LDx_L. Therefore 1 is
> returned in the intended register as the result of the trapping STx_C
> instruction.
>
> 8. No user accesses are permitted in traps from the kernel mode as the
> only LDx_L/STx_C accesses made to user memory locations by the kernel
> are supposed to be those from handcrafted code, which has to written
> such as not to trap.
>
> Since atomic loops are used for data updates the approach works equally
> well in both UP and SMP environments. No data atomicity is guaranteed,
> but data consistency is, that is concurrent RMW accesses won't clobber
> each other, however if the same data is concurrently written as already
> there with a regular write between emulated LDx_L and STx_C, then STx_C
> will still succeed. Likewise if data is modified, but then restored
> before STx_C has had a chance to run.
>
> This fulfils consistency requirements and guarantees that data outside
> the quantity written has not changed between emulated LDx_L and STx_C.
>
> Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
> ---
> Hi,
>
> This has cleared the pair of `-msafe-bwa -msafe-partial' regressions
> observed in GCC verification (the third one was a Modula 2 frontend bug,
> now fixed in the compiler). I have verified individual misalignments with
> a small program by hand as well, for both the data retrieved by emulated
> LDx_L and the data stored by emulated STx_C.
>
> The kernel itself built with `-mcpu=ev4 -msafe-bwa -msafe-partial' boots
> and has passed GCC verification, and triggered no extra unaligned traps.
>
> Full verification was run with 6.3.0-rc5 and Ivan's stack alignment fixes
> applied just because I was confident already that version works correctly.
> Interestingly enough no kernel mode traps have triggered with a kernel
> built with GCC 12 (and with most user traps coming from GCC verification):
>
> kernel unaligned acc : 0 (pc=0,va=0)
> user unaligned acc : 1766720 (pc=20000053064,va=120020189)
>
> but with GCC 15 a small quantity happened (even before I ran GCC testing):
>
> kernel unaligned acc : 78 (pc=fffffc0000ad5194,va=fffffc0002db5784)
> user unaligned acc : 883452 (pc=20000053064,va=120020189)
>
> It seems a compiler regression worth checking -- the trap recorded was in
> `icmp6_dst_alloc' with a pair of quadword writes to `rt->rt6i_dst.addr',
> which however by its type (`struct in6_addr') is only longword-aligned and
> indeed starts at offset 148 from the outermost struct. I have a sneaking
> suspicion one of my earlier GCC changes might be at fault. At least I now
> have a test case to experiment with.
>
> I've also built and booted 6.9.0-rc3 as at commit 82c525bfafb4 ("alpha:
> trim the unused stuff from asm-offsets.c"), the last one before support
> for my system was axed. It has passed the verification with my small
> program (available by request; I'm not sure if it's worth turning into a
> kernel selftest).
>
> NB I'm going to ignore the 72 errors checkpatch.pl issues for EXC usage.
> The coding style of the new additions is consistent with the rest of the
> file and any change to that would best be made separately (but I fail to
> see the point).
>
> Questions, comments, concerns? Otherwise please apply, and I'll proceed
> with the rest of the GCC effort, followed by cleaning handwritten assembly
> up that uses STQ_U in our port and in glibc.
>
> Maciej
> ---
> arch/alpha/kernel/traps.c | 409 ++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 400 insertions(+), 9 deletions(-)
>
> linux-alpha-llsc-unaligned.diff
> Index: linux-macro/arch/alpha/kernel/traps.c
> ===================================================================
> --- linux-macro.orig/arch/alpha/kernel/traps.c
> +++ linux-macro/arch/alpha/kernel/traps.c
> @@ -368,6 +368,13 @@ struct unaligned_stat {
> unsigned long count, va, pc;
> } unaligned[2];
>
> +/* Unaligned LDx_L/STx_C emulation state. */
> +static DEFINE_RAW_SPINLOCK(ll_lock);
> +static struct task_struct *ll_task;
> +static unsigned long ll_data[2];
> +static unsigned long ll_va;
> +static bool ll_quad;
> +static bool ll_bit;
>
> /* Macro for exception fixup code to access integer registers. */
> #define una_reg(r) (_regs[(r) >= 16 && (r) <= 18 ? (r)+19 : (r)])
> @@ -381,6 +388,9 @@ do_entUna(void * va, unsigned long opcod
> unsigned long pc = regs->pc - 4;
> unsigned long *_regs = regs->regs;
> const struct exception_table_entry *fixup;
> + unsigned long flags;
> + unsigned long la;
> + bool ll_match;
>
> unaligned[0].count++;
> unaligned[0].va = (unsigned long) va;
> @@ -439,6 +449,65 @@ do_entUna(void * va, unsigned long opcod
> una_reg(reg) = tmp1|tmp2;
> return;
>
> + case 0x2a: /* ldl_l */
> + la = (unsigned long)va;
> + if (la < TASK_SIZE)
> + break;
> + __asm__ __volatile__(
> + "1: ldl %3,0(%5)\n"
> + "2: ldl %4,4(%5)\n"
> + " srl %3,%6,%1\n"
> + " sll %4,%7,%2\n"
> + " zapnot %1,15,%1\n"
> + " zapnot %2,15,%2\n"
> + "3:\n"
> + EXC(1b,3b,%1,%0)
> + EXC(2b,3b,%2,%0)
> + : "=r"(error),
> + "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> + : "r"(la & ~3ul),
> + "r"((la & 3) * 8), "r"((4 - (la & 3)) * 8), "0"(0));
> + if (error)
> + goto got_exception;
> + raw_spin_lock_irqsave(&ll_lock, flags);
> + ll_va = la;
> + ll_task = NULL;
> + ll_data[0] = tmp3;
> + ll_data[1] = tmp4;
> + ll_quad = false;
> + ll_bit = true;
> + raw_spin_unlock_irqrestore(&ll_lock, flags);
> + una_reg(reg) = (int)(tmp1|tmp2);
> + return;
> +
> + case 0x2b: /* ldq_l */
> + la = (unsigned long)va;
> + if (la < TASK_SIZE)
> + break;
> + __asm__ __volatile__(
> + "1: ldq_u %3,0(%5)\n"
> + "2: ldq_u %4,7(%5)\n"
> + " extql %3,%5,%1\n"
> + " extqh %4,%5,%2\n"
> + "3:\n"
> + EXC(1b,3b,%1,%0)
> + EXC(2b,3b,%2,%0)
> + : "=r"(error),
> + "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> + : "r"(va), "0"(0));
> + if (error)
> + goto got_exception;
> + raw_spin_lock_irqsave(&ll_lock, flags);
> + ll_va = la;
> + ll_task = NULL;
> + ll_data[0] = tmp3;
> + ll_data[1] = tmp4;
> + ll_quad = true;
> + ll_bit = true;
> + raw_spin_unlock_irqrestore(&ll_lock, flags);
> + una_reg(reg) = tmp1|tmp2;
> + return;
> +
> /* Note that the store sequences do not indicate that they change
> memory because it _should_ be affecting nothing in this context.
> (Otherwise we have other, much larger, problems.) */
> @@ -513,6 +582,134 @@ do_entUna(void * va, unsigned long opcod
> if (error)
> goto got_exception;
> return;
> +
> + case 0x2e: /* stl_c */
> + la = (unsigned long)va;
> + if (la < TASK_SIZE)
> + break;
> + raw_spin_lock_irqsave(&ll_lock, flags);
> + ll_match = ll_bit;
> + ll_match &= !ll_quad;
> + ll_match &= ll_task == NULL;
> + ll_match &= ll_va == la;
> + tmp3 = ll_data[0];
> + tmp4 = ll_data[1];
> + ll_bit = false;
> + raw_spin_unlock_irqrestore(&ll_lock, flags);
> + if (ll_match) {
> + __asm__ __volatile__(
> + " srl %6,%5,%3\n"
> + " zapnot %3,%8,%3\n"
> + "1: ldl_l %2,4(%4)\n"
> + " cmpeq %7,%2,%1\n"
> + " beq %1,4f\n"
> + " zap %2,%8,%2\n"
> + " or %2,%3,%1\n"
> + "2: stl_c %1,4(%4)\n"
> + " beq %1,3f\n"
> + " .subsection 2\n"
> + "3: br 1b\n"
> + " .previous\n"
> + "4:\n"
> + EXC(1b,4b,%2,%0)
> + EXC(2b,4b,%1,%0)
> + : "=r"(error), "=&r"(ll_match),
> + "=&r"(tmp1), "=&r"(tmp2)
> + : "r"(la & ~3ul), "r"((4 - (la & 3)) * 8),
> + "r"(una_reg(reg)), "r"(tmp4),
> + "r"((15 >> (4 - (la & 3))) & 0xf), "0"(0));
> + if (error)
> + goto got_exception;
> + }
> + if (ll_match) {
> + __asm__ __volatile__(
> + " sll %6,%5,%3\n"
> + " zapnot %3,%8,%3\n"
> + "1: ldl_l %2,0(%4)\n"
> + " cmpeq %7,%2,%1\n"
> + " beq %1,4f\n"
> + " zap %2,%8,%2\n"
> + " or %2,%3,%1\n"
> + "2: stl_c %1,0(%4)\n"
> + " beq %1,3f\n"
> + " .subsection 2\n"
> + "3: br 1b\n"
> + " .previous\n"
> + "4:\n"
> + EXC(1b,4b,%2,%0)
> + EXC(2b,4b,%1,%0)
> + : "=r"(error), "=&r"(ll_match),
> + "=&r"(tmp1), "=&r"(tmp2)
> + : "r"(la & ~3ul), "r"((la & 3) * 8),
> + "r"(una_reg(reg)), "r"(tmp3),
> + "r"((15 << (la & 3)) & 0xf), "0"(0));
> + if (error)
> + goto got_exception;
> + }
> + una_reg(reg) = ll_match;
> + return;
> +
> + case 0x2f: /* stq_c */
> + la = (unsigned long)va;
> + if (la < TASK_SIZE)
> + break;
> + raw_spin_lock_irqsave(&ll_lock, flags);
> + ll_match = ll_bit;
> + ll_match &= ll_quad;
> + ll_match &= ll_task == NULL;
> + ll_match &= ll_va == la;
> + tmp3 = ll_data[0];
> + tmp4 = ll_data[1];
> + ll_bit = false;
> + raw_spin_unlock_irqrestore(&ll_lock, flags);
> + if (ll_match) {
> + __asm__ __volatile__(
> + " insqh %6,%4,%3\n"
> + "1: ldq_l %2,8(%5)\n"
> + " cmpeq %7,%2,%1\n"
> + " beq %1,4f\n"
> + " mskqh %2,%4,%2\n"
> + " or %2,%3,%1\n"
> + "2: stq_c %1,8(%5)\n"
> + " beq %1,3f\n"
> + " .subsection 2\n"
> + "3: br 1b\n"
> + " .previous\n"
> + "4:\n"
> + EXC(1b,4b,%2,%0)
> + EXC(2b,4b,%1,%0)
> + : "=r"(error), "=&r"(ll_match),
> + "=&r"(tmp1), "=&r"(tmp2)
> + : "r"(va), "r"(la & ~7ul),
> + "r"(una_reg(reg)), "r"(tmp4), "0"(0));
> + if (error)
> + goto got_exception;
> + }
> + if (ll_match) {
> + __asm__ __volatile__(
> + " insql %6,%4,%3\n"
> + "1: ldq_l %2,0(%5)\n"
> + " cmpeq %7,%2,%1\n"
> + " beq %1,4f\n"
> + " mskql %2,%4,%2\n"
> + " or %2,%3,%1\n"
> + "2: stq_c %1,0(%5)\n"
> + " beq %1,3f\n"
> + " .subsection 2\n"
> + "3: br 1b\n"
> + " .previous\n"
> + "4:\n"
> + EXC(1b,4b,%2,%0)
> + EXC(2b,4b,%1,%0)
> + : "=r"(error), "=&r"(ll_match),
> + "=&r"(tmp1), "=&r"(tmp2)
> + : "r"(va), "r"(la & ~7ul),
> + "r"(una_reg(reg)), "r"(tmp3), "0"(0));
> + if (error)
> + goto got_exception;
> + }
> + una_reg(reg) = ll_match;
> + return;
> }
>
> printk("Bad unaligned kernel access at %016lx: %p %lx %lu\n",
> @@ -624,24 +821,33 @@ s_reg_to_mem (unsigned long s_reg)
> * so finding the appropriate registers is a little more difficult
> * than in the kernel case.
> *
> - * Finally, we handle regular integer load/stores only. In
> - * particular, load-linked/store-conditionally and floating point
> - * load/stores are not supported. The former make no sense with
> - * unaligned faults (they are guaranteed to fail) and I don't think
> - * the latter will occur in any decent program.
> + * We have three classes of operations to handle:
> *
> - * Sigh. We *do* have to handle some FP operations, because GCC will
> - * uses them as temporary storage for integer memory to memory copies.
> - * However, we need to deal with stt/ldt and sts/lds only.
> + * - We handle regular integer load/stores transparently to faulting
> + * code, preserving the semantics of the triggering instruction.
> + *
> + * - We handle some FP operations as well, because GCC will use them as
> + * temporary storage for integer memory to memory copies. However,
> + * we need to deal with stt/ldt and sts/lds only.
> + *
> + * - We handle load-locked/store-conditional operations by maintaining
> + * data consistency only, within the two adjacent longwords or
> + * quadwords partially spanned. This is sufficient to guarantee an
> + * unaligned RMW sequence using these operations won't clobber data
> + * *outside* the location intended but does *not* guarantee atomicity
> + * for the data quantity itself.
> */
>
> #define OP_INT_MASK ( 1L << 0x28 | 1L << 0x2c /* ldl stl */ \
> + | 1L << 0x2a | 1L << 0x2e /* ldl_l stl_c */ \
> | 1L << 0x29 | 1L << 0x2d /* ldq stq */ \
> + | 1L << 0x2b | 1L << 0x2f /* ldq_l stq_c */ \
> | 1L << 0x0c | 1L << 0x0d /* ldwu stw */ \
> | 1L << 0x0a | 1L << 0x0e ) /* ldbu stb */
>
> #define OP_WRITE_MASK ( 1L << 0x26 | 1L << 0x27 /* sts stt */ \
> | 1L << 0x2c | 1L << 0x2d /* stl stq */ \
> + | 1L << 0x2e | 1L << 0x2d /* stl_c stq_c */ \
> | 1L << 0x0d | 1L << 0x0e ) /* stw stb */
>
> #define R(x) ((size_t) &((struct pt_regs *)0)->x)
> @@ -666,6 +872,9 @@ do_entUnaUser(void __user * va, unsigned
>
> unsigned long tmp1, tmp2, tmp3, tmp4;
> unsigned long fake_reg, *reg_addr = &fake_reg;
> + unsigned long flags;
> + unsigned long la;
> + bool ll_match;
> int si_code;
> long error;
>
> @@ -794,6 +1003,61 @@ do_entUnaUser(void __user * va, unsigned
> *reg_addr = tmp1|tmp2;
> break;
>
> + case 0x2a: /* ldl_l */
> + la = (unsigned long)va;
> + __asm__ __volatile__(
> + "1: ldl %3,0(%5)\n"
> + "2: ldl %4,4(%5)\n"
> + " srl %3,%6,%1\n"
> + " sll %4,%7,%2\n"
> + " zapnot %1,15,%1\n"
> + " zapnot %2,15,%2\n"
> + "3:\n"
> + EXC(1b,3b,%1,%0)
> + EXC(2b,3b,%2,%0)
> + : "=r"(error),
> + "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> + : "r"(la & ~3ul),
> + "r"((la & 3) * 8), "r"((4 - (la & 3)) * 8), "0"(0));
> + if (error)
> + goto give_sigsegv;
> + raw_spin_lock_irqsave(&ll_lock, flags);
> + ll_va = la;
> + ll_task = current;
> + ll_data[0] = tmp3;
> + ll_data[1] = tmp4;
> + ll_quad = false;
> + ll_bit = true;
> + raw_spin_unlock_irqrestore(&ll_lock, flags);
> + *reg_addr = (int)(tmp1|tmp2);
> + break;
> +
> + case 0x2b: /* ldq_l */
> + la = (unsigned long)va;
> + __asm__ __volatile__(
> + "1: ldq_u %3,0(%5)\n"
> + "2: ldq_u %4,7(%5)\n"
> + " extql %3,%5,%1\n"
> + " extqh %4,%5,%2\n"
> + "3:\n"
> + EXC(1b,3b,%1,%0)
> + EXC(2b,3b,%2,%0)
> + : "=r"(error),
> + "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> + : "r"(va), "0"(0));
> + if (error)
> + goto give_sigsegv;
> + raw_spin_lock_irqsave(&ll_lock, flags);
> + ll_va = la;
> + ll_task = current;
> + ll_data[0] = tmp3;
> + ll_data[1] = tmp4;
> + ll_quad = true;
> + ll_bit = true;
> + raw_spin_unlock_irqrestore(&ll_lock, flags);
> + *reg_addr = tmp1|tmp2;
> + break;
> +
> /* Note that the store sequences do not indicate that they change
> memory because it _should_ be affecting nothing in this context.
> (Otherwise we have other, much larger, problems.) */
> @@ -877,12 +1141,139 @@ do_entUnaUser(void __user * va, unsigned
> goto give_sigsegv;
> return;
>
> + case 0x2e: /* stl_c */
> + la = (unsigned long)va;
> + raw_spin_lock_irqsave(&ll_lock, flags);
> + ll_match = ll_bit;
> + ll_match &= !ll_quad;
> + ll_match &= ll_task == current;
> + ll_match &= ll_va == la;
> + tmp3 = ll_data[0];
> + tmp4 = ll_data[1];
> + ll_bit = false;
> + raw_spin_unlock_irqrestore(&ll_lock, flags);
> + if (ll_match) {
> + __asm__ __volatile__(
> + " srl %6,%5,%3\n"
> + " zapnot %3,%8,%3\n"
> + "1: ldl_l %2,4(%4)\n"
> + " cmpeq %7,%2,%1\n"
> + " beq %1,4f\n"
> + " zap %2,%8,%2\n"
> + " or %2,%3,%1\n"
> + "2: stl_c %1,4(%4)\n"
> + " beq %1,3f\n"
> + " .subsection 2\n"
> + "3: br 1b\n"
> + " .previous\n"
> + "4:\n"
> + EXC(1b,4b,%2,%0)
> + EXC(2b,4b,%1,%0)
> + : "=r"(error), "=&r"(ll_match),
> + "=&r"(tmp1), "=&r"(tmp2)
> + : "r"(la & ~3ul), "r"((4 - (la & 3)) * 8),
> + "r"(*reg_addr), "r"(tmp4),
> + "r"((15 >> (4 - (la & 3))) & 0xf), "0"(0));
> + if (error)
> + goto give_sigsegv;
> + }
> + if (ll_match) {
> + __asm__ __volatile__(
> + " sll %6,%5,%3\n"
> + " zapnot %3,%8,%3\n"
> + "1: ldl_l %2,0(%4)\n"
> + " cmpeq %7,%2,%1\n"
> + " beq %1,4f\n"
> + " zap %2,%8,%2\n"
> + " or %2,%3,%1\n"
> + "2: stl_c %1,0(%4)\n"
> + " beq %1,3f\n"
> + " .subsection 2\n"
> + "3: br 1b\n"
> + " .previous\n"
> + "4:\n"
> + EXC(1b,4b,%2,%0)
> + EXC(2b,4b,%1,%0)
> + : "=r"(error), "=&r"(ll_match),
> + "=&r"(tmp1), "=&r"(tmp2)
> + : "r"(la & ~3ul), "r"((la & 3) * 8),
> + "r"(*reg_addr), "r"(tmp3),
> + "r"((15 << (la & 3)) & 0xf), "0"(0));
> + if (error)
> + goto give_sigsegv;
> + }
> + *reg_addr = ll_match;
> + break;
> +
> + case 0x2f: /* stq_c */
> + la = (unsigned long)va;
> + raw_spin_lock_irqsave(&ll_lock, flags);
> + ll_match = ll_bit;
> + ll_match &= ll_quad;
> + ll_match &= ll_task == current;
> + ll_match &= ll_va == la;
> + tmp3 = ll_data[0];
> + tmp4 = ll_data[1];
> + ll_bit = false;
> + raw_spin_unlock_irqrestore(&ll_lock, flags);
> + if (ll_match) {
> + __asm__ __volatile__(
> + " insqh %6,%4,%3\n"
> + "1: ldq_l %2,8(%5)\n"
> + " cmpeq %7,%2,%1\n"
> + " beq %1,4f\n"
> + " mskqh %2,%4,%2\n"
> + " or %2,%3,%1\n"
> + "2: stq_c %1,8(%5)\n"
> + " beq %1,3f\n"
> + " .subsection 2\n"
> + "3: br 1b\n"
> + " .previous\n"
> + "4:\n"
> + EXC(1b,4b,%2,%0)
> + EXC(2b,4b,%1,%0)
> + : "=r"(error), "=&r"(ll_match),
> + "=&r"(tmp1), "=&r"(tmp2)
> + : "r"(va), "r"(la & ~7ul),
> + "r"(*reg_addr), "r"(tmp4), "0"(0));
> + if (error)
> + goto give_sigsegv;
> + }
> + if (ll_match) {
> + __asm__ __volatile__(
> + " insql %6,%4,%3\n"
> + "1: ldq_l %2,0(%5)\n"
> + " cmpeq %7,%2,%1\n"
> + " beq %1,4f\n"
> + " mskql %2,%4,%2\n"
> + " or %2,%3,%1\n"
> + "2: stq_c %1,0(%5)\n"
> + " beq %1,3f\n"
> + " .subsection 2\n"
> + "3: br 1b\n"
> + " .previous\n"
> + "4:\n"
> + EXC(1b,4b,%2,%0)
> + EXC(2b,4b,%1,%0)
> + : "=r"(error), "=&r"(ll_match),
> + "=&r"(tmp1), "=&r"(tmp2)
> + : "r"(va), "r"(la & ~7ul),
> + "r"(*reg_addr), "r"(tmp3), "0"(0));
> + if (error)
> + goto give_sigsegv;
> + }
> + *reg_addr = ll_match;
> + break;
> +
> default:
> /* What instruction were you trying to use, exactly? */
> goto give_sigbus;
> }
>
> - /* Only integer loads should get here; everyone else returns early. */
> + /*
> + * Only integer loads and stores conditional should get here;
> + * everyone else returns early.
> + */
> if (reg == 30)
> wrusp(fake_reg);
> return;
>
On Wed, 19 Feb 2025 at 04:46, Maciej W. Rozycki <macro@orcam.me.uk> wrote:
>
> 1. A trapping unaligned LDx_L operation results in the pair of adjacent
> aligned whole data quantities spanned being read and stored for the
> reference with a subsequent STx_C operation, along with the width of
> the data accessed and its virtual address, and the task referring or
> NULL if the kernel. The validity marker is set.
So I have a couple of comments. I don't care deeply because I do think
alpha is dead, but for completeness:
(a) I don't think the task checking is correct.
You're only checking the task pointer, so what can happen is that a
task exits and another starts up with the same task pointer value, and
it all matches across one task doing a ld_l and another doing a st_c.
Does this matter? No. You'd literally have to *try* to create that
situation with identical mis-aligned addresses and data contents, and
an exit after a 'ld_l', and doing a 'st_c' in the new task without the
matching ld_l.
So I suspect this works in practice, but it's still worth mentioning.
(b) this is not truly atomic wrt concurrent aligned non-trapping
operations to the same words. Or in fact to current trapping ones,
since you end up inevitably releasing the spinlock before the final
stc emulation.
I think this is fundamental and non-fixable, because the stc is done
as two operations, and the first can succeed with the second failing
(or they can both succeed, just interleaved with other accesses).
Again, I don't think we care, and it works in practice, but it does
mean that I *really* think that:
(c) you should not handle the kernel case at all.
If the kernel does an unaligned ld_l/st_c, that's a fundamental kernel
bug. Don't emulate it. Particularly when the emulation fundamentally
is not truly atomic wrt other accesses.
Finally:
(d) I think you're doing a few too many inline asms by hand, and
you're masking the results too much.
On the read-side emulation, why do you do that
+ "1: ldl %3,0(%5)\n"
+ "2: ldl %4,4(%5)\n"
+ " srl %3,%6,%1\n"
+ " sll %4,%7,%2\n"
+ " zapnot %1,15,%1\n"
+ " zapnot %2,15,%2\n"
at all? Just do two aligned loads, and don't mask the bytes around
them. A *real* ldl/stc will fail not just when the exact bytes are
different, but when somebody has touched the same cacheline. So if the
aligned values have changed, you should fail the stc even if the
change was in other bytes.
And doing two aligned loads don't need any inline asm at all.
On the st_c side, I think you're repeating the same inline asm twice,
and should have a single helper.
Is this a NAK for the patch? No. But I do think it should be massaged a bit.
Linus
On Wed, 19 Feb 2025, Linus Torvalds wrote: > > 1. A trapping unaligned LDx_L operation results in the pair of adjacent > > aligned whole data quantities spanned being read and stored for the > > reference with a subsequent STx_C operation, along with the width of > > the data accessed and its virtual address, and the task referring or > > NULL if the kernel. The validity marker is set. > > So I have a couple of comments. I don't care deeply because I do think > alpha is dead, but for completeness: > > (a) I don't think the task checking is correct. > > You're only checking the task pointer, so what can happen is that a > task exits and another starts up with the same task pointer value, and > it all matches across one task doing a ld_l and another doing a st_c. > > Does this matter? No. You'd literally have to *try* to create that > situation with identical mis-aligned addresses and data contents, and > an exit after a 'ld_l', and doing a 'st_c' in the new task without the > matching ld_l. > > So I suspect this works in practice, but it's still worth mentioning. It's an interesting corner case, but executing STx_C without preceding matching LDx_L is hardly useful and depending on the circumstances may or may not cause unpredictable results. We can make it a part of the psABI that such usage is not supported. Otherwise we could clear `ll_bit' in a task's termination path. > (b) this is not truly atomic wrt concurrent aligned non-trapping > operations to the same words. Or in fact to current trapping ones, > since you end up inevitably releasing the spinlock before the final > stc emulation. It is not supposed to be. The only objective of this code is to protect the *unchanged* part of the longword/quadword. > I think this is fundamental and non-fixable, because the stc is done > as two operations, and the first can succeed with the second failing > (or they can both succeed, just interleaved with other accesses). It is absolutely fine and by design. Atomicity of the unaligned store of the quantity itself is not guaranteed by this sequence, just as it is not with the original one using a pair of STQ_U operations, where a concurrent write to the same location may cause the parts of the quantity to become inconsistent with each other. If the trapping code wanted to make the data quantity accessed atomic, it should have declared it atomic as well as made it aligned, in which case the compiler would have done the right thing, including padding/separation from other data objects where necessary for the minimum width of data that hardware can guarantee atomicity for. And then we would not have arrived in this emulation path in the first place. > Again, I don't think we care, and it works in practice, but it does > mean that I *really* think that: > > (c) you should not handle the kernel case at all. > > If the kernel does an unaligned ld_l/st_c, that's a fundamental kernel > bug. Don't emulate it. Particularly when the emulation fundamentally > is not truly atomic wrt other accesses. Good point actually, I think I mentally drove myself into a dead end here. Yes, absolutely, it is not expected to happen unless we have a bug in our code somewhere! > Finally: > > (d) I think you're doing a few too many inline asms by hand, and > you're masking the results too much. > > On the read-side emulation, why do you do that > > + "1: ldl %3,0(%5)\n" > + "2: ldl %4,4(%5)\n" > + " srl %3,%6,%1\n" > + " sll %4,%7,%2\n" > + " zapnot %1,15,%1\n" > + " zapnot %2,15,%2\n" > > at all? Just do two aligned loads, and don't mask the bytes around > them. A *real* ldl/stc will fail not just when the exact bytes are > different, but when somebody has touched the same cacheline. So if the > aligned values have changed, you should fail the stc even if the > change was in other bytes. We do need to extract the bytes desired for the result of LDx_L though. Yes, it can be done in C, but the same stands for all the other emulation pieces here and yet they use inline assembly. GCC is happy to do byte extraction itself where necessary, it has machine description patterns for that as it a fairly common operation (it does not for INSxH and MSKxH though, they're handled as intrinsics only, which however we could use instead). I think there is value in consistency, and with this piece written as inline assembly you can spot the difference from the other variants right away. Or I could rewrite the byte extraction in C across other patterns. > And doing two aligned loads don't need any inline asm at all. Neither does unaligned loads, as GCC is happy to emit LDQ_U itself where necessary, but we want to catch exceptions or we'd get an oops rather than SIGSEGV. > On the st_c side, I think you're repeating the same inline asm twice, > and should have a single helper. The masks are different as are displacements, but a good point otherwise. I think this guarantees the prologue to the atomic loop to go away, which is a good thing, but I'll yet double-check the quality of code produced. > Is this a NAK for the patch? No. But I do think it should be massaged a bit. Thank you for your review, I'll post v2 once I'm done with the massage. Maciej
On Thu, 20 Feb 2025, Maciej W. Rozycki wrote: > > Again, I don't think we care, and it works in practice, but it does > > mean that I *really* think that: > > > > (c) you should not handle the kernel case at all. > > > > If the kernel does an unaligned ld_l/st_c, that's a fundamental kernel > > bug. Don't emulate it. Particularly when the emulation fundamentally > > is not truly atomic wrt other accesses. > > Good point actually, I think I mentally drove myself into a dead end > here. Yes, absolutely, it is not expected to happen unless we have a bug > in our code somewhere! Well, I take it back. I knew I had something in mind while writing this code, just couldn't remember at the time of the reply, and I now brought it back at a mental relief break: it's networking. It's been decades ago, but I kept in memory a discussion when Alpha still was a thing, up to remembering it was DaveM saying we chose to use aligned frame/packet header accesses for performance reasons even where we don't know beforehand whether a given piece of networking data will or will not be actually aligned, so as not to penalise the common aligned case. Here's one reference I could track down: "TCP SYNs broken in 2.3.41", <https://marc.info/?l=linux-kernel&m=94927689929463> (not present on lore; I have the original thread of discussion with complete e-mail headers in my personal archive though), likely the very one I remembered. So unless I'm proved otherwise (e.g. that all such code paths are now gone from networking, which may or may not be the case: I saw IPX go but I can see AppleTalk still around; or that no sub-longword accesses are ever used in the relevant networking paths), I'm going to keep kernel emulation in v2, because what just used to be wrapped in an unaligned LDQ/STQ pair, which we trapped on and emulated, will now become an LDQ_L/STQ_C loop. Do you happen to know what the situation is here? NB my network usage with my Alpha environment is virtually exclusively FDDI, and there the wrong alignment phenomena may or may not be present at all for the various networking protocols, owing to different link-layer framing (and including the Packet Request Header used by the Motorola FDDI chipset the way it does for the very purpose to keep things aligned). I don't know offhand, my FDDI-fu is a bit dusty even in the areas I have already explored, and this is a new one; I've certainly not tried (to learn anything about) AppleTalk over FDDI (or indeed otherwise). All in all the lack of kernel unaligned traps in my setup and hence no recorded way to trigger them merely means I haven't used any of the stuff that used to cause them. Maciej
On Mon, 7 Apr 2025 at 13:46, Maciej W. Rozycki <macro@orcam.me.uk> wrote:
>
> So unless I'm proved otherwise (e.g. that all such code paths are now
> gone from networking, which may or may not be the case: I saw IPX go but I
> can see AppleTalk still around; or that no sub-longword accesses are ever
> used in the relevant networking paths), I'm going to keep kernel emulation
> in v2, because what just used to be wrapped in an unaligned LDQ/STQ pair,
> which we trapped on and emulated, will now become an LDQ_L/STQ_C loop.
>
> Do you happen to know what the situation is here?
I think networking ends up using 'get_unaligned()' properly for header
accesses these days for any of this.
If you don't, some architectures will literally silently give you
garbage back and not even fault.
Admittedly that's mainly some really broken old 32-bit ARM stuff and
hopefully it's all dead by now.
So unless you actually *see* the unaligned faults, I really think you
shouldn't emulate them.
And I'd like to know where they are if you do see them
Linus
Hi,
Very impressive work! much appreciated!
I have the patch applied on a system running
6.14.0-rc3-00061-gd4ca652e53ca. You mentioned that you have a
testsuite/small program. If you can provide it to me I can do some
testing on my ES40.
Regards
Magnus Lindholm
On Wed, Feb 19, 2025 at 1:46 PM Maciej W. Rozycki <macro@orcam.me.uk> wrote:
>
> Complementing compiler support for the `-msafe-bwa' and `-msafe-partial'
> code generation options slated to land in GCC 15, implement emulation
> for unaligned LDx_L and STx_C operations for the unlikely case where an
> alignment violation has resulted from improperly written code and caused
> these operations to trap in atomic RMW memory access sequences made to
> provide data consistency for non-BWX byte and word write operations, and
> writes to unaligned data objects causing partial memory updates.
>
> The principle of operation is as follows:
>
> 1. A trapping unaligned LDx_L operation results in the pair of adjacent
> aligned whole data quantities spanned being read and stored for the
> reference with a subsequent STx_C operation, along with the width of
> the data accessed and its virtual address, and the task referring or
> NULL if the kernel. The valitidy marker is set.
>
> 2. Regular memory load operations are used to retrieve data because no
> atomicity is needed at this stage, and matching the width accessed,
> either LDQ_U or LDL even though the latter instruction requires extra
> operations, to avoid the case where an unaligned longword located
> entirely within an aligned quadword would complicate handling.
>
> 3. Data is masked, shifted and merged appropriately and returned in the
> intended register as the result of the trapping LDx_L instruction.
>
> 4. A trapping unaligned STx_C operation results in the valitidy marker
> being checked for being true, and the width of the data accessed
> along with the virtual address and the task referring or the kernel
> for a match. The pair of whole data quantities previously read by
> LDx_L emulation is retrieved and the valitidy marker is cleared.
>
> 5. If the checks succeeded, then in an atomic loop the location of the
> first whole data quantity is reread, and data retrieved compared with
> the value previously obtained. If there's no match, then the loop is
> aborted and 0 is returned in the intended register as the result of
> the trapping STx_C instruction and emulation completes. Otherwise
> new data obtained from the source operand of STx_C is combined with
> the data retrieved, replacing by byte insertion the part intended,
> and an atomic write of this new data is attempted. If it fails, the
> loop continues from the beginning. Otherwise processing proceeds to
> the next step.
>
> 6. The same operations are performed on the second whole data quantity.
>
> 7. At this point both whole data quantities have been written, ensuring
> that no third-party intervening write has changed them at the point
> of the write from the values held at previous LDx_L. Therefore 1 is
> returned in the intended register as the result of the trapping STx_C
> instruction.
>
> 8. No user accesses are permitted in traps from the kernel mode as the
> only LDx_L/STx_C accesses made to user memory locations by the kernel
> are supposed to be those from handcrafted code, which has to written
> such as not to trap.
>
> Since atomic loops are used for data updates the approach works equally
> well in both UP and SMP environments. No data atomicity is guaranteed,
> but data consistency is, that is concurrent RMW accesses won't clobber
> each other, however if the same data is concurrently written as already
> there with a regular write between emulated LDx_L and STx_C, then STx_C
> will still succeed. Likewise if data is modified, but then restored
> before STx_C has had a chance to run.
>
> This fulfils consistency requirements and guarantees that data outside
> the quantity written has not changed between emulated LDx_L and STx_C.
>
> Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
> ---
> Hi,
>
> This has cleared the pair of `-msafe-bwa -msafe-partial' regressions
> observed in GCC verification (the third one was a Modula 2 frontend bug,
> now fixed in the compiler). I have verified individual misalignments with
> a small program by hand as well, for both the data retrieved by emulated
> LDx_L and the data stored by emulated STx_C.
>
> The kernel itself built with `-mcpu=ev4 -msafe-bwa -msafe-partial' boots
> and has passed GCC verification, and triggered no extra unaligned traps.
>
> Full verification was run with 6.3.0-rc5 and Ivan's stack alignment fixes
> applied just because I was confident already that version works correctly.
> Interestingly enough no kernel mode traps have triggered with a kernel
> built with GCC 12 (and with most user traps coming from GCC verification):
>
> kernel unaligned acc : 0 (pc=0,va=0)
> user unaligned acc : 1766720 (pc=20000053064,va=120020189)
>
> but with GCC 15 a small quantity happened (even before I ran GCC testing):
>
> kernel unaligned acc : 78 (pc=fffffc0000ad5194,va=fffffc0002db5784)
> user unaligned acc : 883452 (pc=20000053064,va=120020189)
>
> It seems a compiler regression worth checking -- the trap recorded was in
> `icmp6_dst_alloc' with a pair of quadword writes to `rt->rt6i_dst.addr',
> which however by its type (`struct in6_addr') is only longword-aligned and
> indeed starts at offset 148 from the outermost struct. I have a sneaking
> suspicion one of my earlier GCC changes might be at fault. At least I now
> have a test case to experiment with.
>
> I've also built and booted 6.9.0-rc3 as at commit 82c525bfafb4 ("alpha:
> trim the unused stuff from asm-offsets.c"), the last one before support
> for my system was axed. It has passed the verification with my small
> program (available by request; I'm not sure if it's worth turning into a
> kernel selftest).
>
> NB I'm going to ignore the 72 errors checkpatch.pl issues for EXC usage.
> The coding style of the new additions is consistent with the rest of the
> file and any change to that would best be made separately (but I fail to
> see the point).
>
> Questions, comments, concerns? Otherwise please apply, and I'll proceed
> with the rest of the GCC effort, followed by cleaning handwritten assembly
> up that uses STQ_U in our port and in glibc.
>
> Maciej
> ---
> arch/alpha/kernel/traps.c | 409 ++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 400 insertions(+), 9 deletions(-)
>
> linux-alpha-llsc-unaligned.diff
> Index: linux-macro/arch/alpha/kernel/traps.c
> ===================================================================
> --- linux-macro.orig/arch/alpha/kernel/traps.c
> +++ linux-macro/arch/alpha/kernel/traps.c
> @@ -368,6 +368,13 @@ struct unaligned_stat {
> unsigned long count, va, pc;
> } unaligned[2];
>
> +/* Unaligned LDx_L/STx_C emulation state. */
> +static DEFINE_RAW_SPINLOCK(ll_lock);
> +static struct task_struct *ll_task;
> +static unsigned long ll_data[2];
> +static unsigned long ll_va;
> +static bool ll_quad;
> +static bool ll_bit;
>
> /* Macro for exception fixup code to access integer registers. */
> #define una_reg(r) (_regs[(r) >= 16 && (r) <= 18 ? (r)+19 : (r)])
> @@ -381,6 +388,9 @@ do_entUna(void * va, unsigned long opcod
> unsigned long pc = regs->pc - 4;
> unsigned long *_regs = regs->regs;
> const struct exception_table_entry *fixup;
> + unsigned long flags;
> + unsigned long la;
> + bool ll_match;
>
> unaligned[0].count++;
> unaligned[0].va = (unsigned long) va;
> @@ -439,6 +449,65 @@ do_entUna(void * va, unsigned long opcod
> una_reg(reg) = tmp1|tmp2;
> return;
>
> + case 0x2a: /* ldl_l */
> + la = (unsigned long)va;
> + if (la < TASK_SIZE)
> + break;
> + __asm__ __volatile__(
> + "1: ldl %3,0(%5)\n"
> + "2: ldl %4,4(%5)\n"
> + " srl %3,%6,%1\n"
> + " sll %4,%7,%2\n"
> + " zapnot %1,15,%1\n"
> + " zapnot %2,15,%2\n"
> + "3:\n"
> + EXC(1b,3b,%1,%0)
> + EXC(2b,3b,%2,%0)
> + : "=r"(error),
> + "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> + : "r"(la & ~3ul),
> + "r"((la & 3) * 8), "r"((4 - (la & 3)) * 8), "0"(0));
> + if (error)
> + goto got_exception;
> + raw_spin_lock_irqsave(&ll_lock, flags);
> + ll_va = la;
> + ll_task = NULL;
> + ll_data[0] = tmp3;
> + ll_data[1] = tmp4;
> + ll_quad = false;
> + ll_bit = true;
> + raw_spin_unlock_irqrestore(&ll_lock, flags);
> + una_reg(reg) = (int)(tmp1|tmp2);
> + return;
> +
> + case 0x2b: /* ldq_l */
> + la = (unsigned long)va;
> + if (la < TASK_SIZE)
> + break;
> + __asm__ __volatile__(
> + "1: ldq_u %3,0(%5)\n"
> + "2: ldq_u %4,7(%5)\n"
> + " extql %3,%5,%1\n"
> + " extqh %4,%5,%2\n"
> + "3:\n"
> + EXC(1b,3b,%1,%0)
> + EXC(2b,3b,%2,%0)
> + : "=r"(error),
> + "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> + : "r"(va), "0"(0));
> + if (error)
> + goto got_exception;
> + raw_spin_lock_irqsave(&ll_lock, flags);
> + ll_va = la;
> + ll_task = NULL;
> + ll_data[0] = tmp3;
> + ll_data[1] = tmp4;
> + ll_quad = true;
> + ll_bit = true;
> + raw_spin_unlock_irqrestore(&ll_lock, flags);
> + una_reg(reg) = tmp1|tmp2;
> + return;
> +
> /* Note that the store sequences do not indicate that they change
> memory because it _should_ be affecting nothing in this context.
> (Otherwise we have other, much larger, problems.) */
> @@ -513,6 +582,134 @@ do_entUna(void * va, unsigned long opcod
> if (error)
> goto got_exception;
> return;
> +
> + case 0x2e: /* stl_c */
> + la = (unsigned long)va;
> + if (la < TASK_SIZE)
> + break;
> + raw_spin_lock_irqsave(&ll_lock, flags);
> + ll_match = ll_bit;
> + ll_match &= !ll_quad;
> + ll_match &= ll_task == NULL;
> + ll_match &= ll_va == la;
> + tmp3 = ll_data[0];
> + tmp4 = ll_data[1];
> + ll_bit = false;
> + raw_spin_unlock_irqrestore(&ll_lock, flags);
> + if (ll_match) {
> + __asm__ __volatile__(
> + " srl %6,%5,%3\n"
> + " zapnot %3,%8,%3\n"
> + "1: ldl_l %2,4(%4)\n"
> + " cmpeq %7,%2,%1\n"
> + " beq %1,4f\n"
> + " zap %2,%8,%2\n"
> + " or %2,%3,%1\n"
> + "2: stl_c %1,4(%4)\n"
> + " beq %1,3f\n"
> + " .subsection 2\n"
> + "3: br 1b\n"
> + " .previous\n"
> + "4:\n"
> + EXC(1b,4b,%2,%0)
> + EXC(2b,4b,%1,%0)
> + : "=r"(error), "=&r"(ll_match),
> + "=&r"(tmp1), "=&r"(tmp2)
> + : "r"(la & ~3ul), "r"((4 - (la & 3)) * 8),
> + "r"(una_reg(reg)), "r"(tmp4),
> + "r"((15 >> (4 - (la & 3))) & 0xf), "0"(0));
> + if (error)
> + goto got_exception;
> + }
> + if (ll_match) {
> + __asm__ __volatile__(
> + " sll %6,%5,%3\n"
> + " zapnot %3,%8,%3\n"
> + "1: ldl_l %2,0(%4)\n"
> + " cmpeq %7,%2,%1\n"
> + " beq %1,4f\n"
> + " zap %2,%8,%2\n"
> + " or %2,%3,%1\n"
> + "2: stl_c %1,0(%4)\n"
> + " beq %1,3f\n"
> + " .subsection 2\n"
> + "3: br 1b\n"
> + " .previous\n"
> + "4:\n"
> + EXC(1b,4b,%2,%0)
> + EXC(2b,4b,%1,%0)
> + : "=r"(error), "=&r"(ll_match),
> + "=&r"(tmp1), "=&r"(tmp2)
> + : "r"(la & ~3ul), "r"((la & 3) * 8),
> + "r"(una_reg(reg)), "r"(tmp3),
> + "r"((15 << (la & 3)) & 0xf), "0"(0));
> + if (error)
> + goto got_exception;
> + }
> + una_reg(reg) = ll_match;
> + return;
> +
> + case 0x2f: /* stq_c */
> + la = (unsigned long)va;
> + if (la < TASK_SIZE)
> + break;
> + raw_spin_lock_irqsave(&ll_lock, flags);
> + ll_match = ll_bit;
> + ll_match &= ll_quad;
> + ll_match &= ll_task == NULL;
> + ll_match &= ll_va == la;
> + tmp3 = ll_data[0];
> + tmp4 = ll_data[1];
> + ll_bit = false;
> + raw_spin_unlock_irqrestore(&ll_lock, flags);
> + if (ll_match) {
> + __asm__ __volatile__(
> + " insqh %6,%4,%3\n"
> + "1: ldq_l %2,8(%5)\n"
> + " cmpeq %7,%2,%1\n"
> + " beq %1,4f\n"
> + " mskqh %2,%4,%2\n"
> + " or %2,%3,%1\n"
> + "2: stq_c %1,8(%5)\n"
> + " beq %1,3f\n"
> + " .subsection 2\n"
> + "3: br 1b\n"
> + " .previous\n"
> + "4:\n"
> + EXC(1b,4b,%2,%0)
> + EXC(2b,4b,%1,%0)
> + : "=r"(error), "=&r"(ll_match),
> + "=&r"(tmp1), "=&r"(tmp2)
> + : "r"(va), "r"(la & ~7ul),
> + "r"(una_reg(reg)), "r"(tmp4), "0"(0));
> + if (error)
> + goto got_exception;
> + }
> + if (ll_match) {
> + __asm__ __volatile__(
> + " insql %6,%4,%3\n"
> + "1: ldq_l %2,0(%5)\n"
> + " cmpeq %7,%2,%1\n"
> + " beq %1,4f\n"
> + " mskql %2,%4,%2\n"
> + " or %2,%3,%1\n"
> + "2: stq_c %1,0(%5)\n"
> + " beq %1,3f\n"
> + " .subsection 2\n"
> + "3: br 1b\n"
> + " .previous\n"
> + "4:\n"
> + EXC(1b,4b,%2,%0)
> + EXC(2b,4b,%1,%0)
> + : "=r"(error), "=&r"(ll_match),
> + "=&r"(tmp1), "=&r"(tmp2)
> + : "r"(va), "r"(la & ~7ul),
> + "r"(una_reg(reg)), "r"(tmp3), "0"(0));
> + if (error)
> + goto got_exception;
> + }
> + una_reg(reg) = ll_match;
> + return;
> }
>
> printk("Bad unaligned kernel access at %016lx: %p %lx %lu\n",
> @@ -624,24 +821,33 @@ s_reg_to_mem (unsigned long s_reg)
> * so finding the appropriate registers is a little more difficult
> * than in the kernel case.
> *
> - * Finally, we handle regular integer load/stores only. In
> - * particular, load-linked/store-conditionally and floating point
> - * load/stores are not supported. The former make no sense with
> - * unaligned faults (they are guaranteed to fail) and I don't think
> - * the latter will occur in any decent program.
> + * We have three classes of operations to handle:
> *
> - * Sigh. We *do* have to handle some FP operations, because GCC will
> - * uses them as temporary storage for integer memory to memory copies.
> - * However, we need to deal with stt/ldt and sts/lds only.
> + * - We handle regular integer load/stores transparently to faulting
> + * code, preserving the semantics of the triggering instruction.
> + *
> + * - We handle some FP operations as well, because GCC will use them as
> + * temporary storage for integer memory to memory copies. However,
> + * we need to deal with stt/ldt and sts/lds only.
> + *
> + * - We handle load-locked/store-conditional operations by maintaining
> + * data consistency only, within the two adjacent longwords or
> + * quadwords partially spanned. This is sufficient to guarantee an
> + * unaligned RMW sequence using these operations won't clobber data
> + * *outside* the location intended but does *not* guarantee atomicity
> + * for the data quantity itself.
> */
>
> #define OP_INT_MASK ( 1L << 0x28 | 1L << 0x2c /* ldl stl */ \
> + | 1L << 0x2a | 1L << 0x2e /* ldl_l stl_c */ \
> | 1L << 0x29 | 1L << 0x2d /* ldq stq */ \
> + | 1L << 0x2b | 1L << 0x2f /* ldq_l stq_c */ \
> | 1L << 0x0c | 1L << 0x0d /* ldwu stw */ \
> | 1L << 0x0a | 1L << 0x0e ) /* ldbu stb */
>
> #define OP_WRITE_MASK ( 1L << 0x26 | 1L << 0x27 /* sts stt */ \
> | 1L << 0x2c | 1L << 0x2d /* stl stq */ \
> + | 1L << 0x2e | 1L << 0x2d /* stl_c stq_c */ \
> | 1L << 0x0d | 1L << 0x0e ) /* stw stb */
>
> #define R(x) ((size_t) &((struct pt_regs *)0)->x)
> @@ -666,6 +872,9 @@ do_entUnaUser(void __user * va, unsigned
>
> unsigned long tmp1, tmp2, tmp3, tmp4;
> unsigned long fake_reg, *reg_addr = &fake_reg;
> + unsigned long flags;
> + unsigned long la;
> + bool ll_match;
> int si_code;
> long error;
>
> @@ -794,6 +1003,61 @@ do_entUnaUser(void __user * va, unsigned
> *reg_addr = tmp1|tmp2;
> break;
>
> + case 0x2a: /* ldl_l */
> + la = (unsigned long)va;
> + __asm__ __volatile__(
> + "1: ldl %3,0(%5)\n"
> + "2: ldl %4,4(%5)\n"
> + " srl %3,%6,%1\n"
> + " sll %4,%7,%2\n"
> + " zapnot %1,15,%1\n"
> + " zapnot %2,15,%2\n"
> + "3:\n"
> + EXC(1b,3b,%1,%0)
> + EXC(2b,3b,%2,%0)
> + : "=r"(error),
> + "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> + : "r"(la & ~3ul),
> + "r"((la & 3) * 8), "r"((4 - (la & 3)) * 8), "0"(0));
> + if (error)
> + goto give_sigsegv;
> + raw_spin_lock_irqsave(&ll_lock, flags);
> + ll_va = la;
> + ll_task = current;
> + ll_data[0] = tmp3;
> + ll_data[1] = tmp4;
> + ll_quad = false;
> + ll_bit = true;
> + raw_spin_unlock_irqrestore(&ll_lock, flags);
> + *reg_addr = (int)(tmp1|tmp2);
> + break;
> +
> + case 0x2b: /* ldq_l */
> + la = (unsigned long)va;
> + __asm__ __volatile__(
> + "1: ldq_u %3,0(%5)\n"
> + "2: ldq_u %4,7(%5)\n"
> + " extql %3,%5,%1\n"
> + " extqh %4,%5,%2\n"
> + "3:\n"
> + EXC(1b,3b,%1,%0)
> + EXC(2b,3b,%2,%0)
> + : "=r"(error),
> + "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> + : "r"(va), "0"(0));
> + if (error)
> + goto give_sigsegv;
> + raw_spin_lock_irqsave(&ll_lock, flags);
> + ll_va = la;
> + ll_task = current;
> + ll_data[0] = tmp3;
> + ll_data[1] = tmp4;
> + ll_quad = true;
> + ll_bit = true;
> + raw_spin_unlock_irqrestore(&ll_lock, flags);
> + *reg_addr = tmp1|tmp2;
> + break;
> +
> /* Note that the store sequences do not indicate that they change
> memory because it _should_ be affecting nothing in this context.
> (Otherwise we have other, much larger, problems.) */
> @@ -877,12 +1141,139 @@ do_entUnaUser(void __user * va, unsigned
> goto give_sigsegv;
> return;
>
> + case 0x2e: /* stl_c */
> + la = (unsigned long)va;
> + raw_spin_lock_irqsave(&ll_lock, flags);
> + ll_match = ll_bit;
> + ll_match &= !ll_quad;
> + ll_match &= ll_task == current;
> + ll_match &= ll_va == la;
> + tmp3 = ll_data[0];
> + tmp4 = ll_data[1];
> + ll_bit = false;
> + raw_spin_unlock_irqrestore(&ll_lock, flags);
> + if (ll_match) {
> + __asm__ __volatile__(
> + " srl %6,%5,%3\n"
> + " zapnot %3,%8,%3\n"
> + "1: ldl_l %2,4(%4)\n"
> + " cmpeq %7,%2,%1\n"
> + " beq %1,4f\n"
> + " zap %2,%8,%2\n"
> + " or %2,%3,%1\n"
> + "2: stl_c %1,4(%4)\n"
> + " beq %1,3f\n"
> + " .subsection 2\n"
> + "3: br 1b\n"
> + " .previous\n"
> + "4:\n"
> + EXC(1b,4b,%2,%0)
> + EXC(2b,4b,%1,%0)
> + : "=r"(error), "=&r"(ll_match),
> + "=&r"(tmp1), "=&r"(tmp2)
> + : "r"(la & ~3ul), "r"((4 - (la & 3)) * 8),
> + "r"(*reg_addr), "r"(tmp4),
> + "r"((15 >> (4 - (la & 3))) & 0xf), "0"(0));
> + if (error)
> + goto give_sigsegv;
> + }
> + if (ll_match) {
> + __asm__ __volatile__(
> + " sll %6,%5,%3\n"
> + " zapnot %3,%8,%3\n"
> + "1: ldl_l %2,0(%4)\n"
> + " cmpeq %7,%2,%1\n"
> + " beq %1,4f\n"
> + " zap %2,%8,%2\n"
> + " or %2,%3,%1\n"
> + "2: stl_c %1,0(%4)\n"
> + " beq %1,3f\n"
> + " .subsection 2\n"
> + "3: br 1b\n"
> + " .previous\n"
> + "4:\n"
> + EXC(1b,4b,%2,%0)
> + EXC(2b,4b,%1,%0)
> + : "=r"(error), "=&r"(ll_match),
> + "=&r"(tmp1), "=&r"(tmp2)
> + : "r"(la & ~3ul), "r"((la & 3) * 8),
> + "r"(*reg_addr), "r"(tmp3),
> + "r"((15 << (la & 3)) & 0xf), "0"(0));
> + if (error)
> + goto give_sigsegv;
> + }
> + *reg_addr = ll_match;
> + break;
> +
> + case 0x2f: /* stq_c */
> + la = (unsigned long)va;
> + raw_spin_lock_irqsave(&ll_lock, flags);
> + ll_match = ll_bit;
> + ll_match &= ll_quad;
> + ll_match &= ll_task == current;
> + ll_match &= ll_va == la;
> + tmp3 = ll_data[0];
> + tmp4 = ll_data[1];
> + ll_bit = false;
> + raw_spin_unlock_irqrestore(&ll_lock, flags);
> + if (ll_match) {
> + __asm__ __volatile__(
> + " insqh %6,%4,%3\n"
> + "1: ldq_l %2,8(%5)\n"
> + " cmpeq %7,%2,%1\n"
> + " beq %1,4f\n"
> + " mskqh %2,%4,%2\n"
> + " or %2,%3,%1\n"
> + "2: stq_c %1,8(%5)\n"
> + " beq %1,3f\n"
> + " .subsection 2\n"
> + "3: br 1b\n"
> + " .previous\n"
> + "4:\n"
> + EXC(1b,4b,%2,%0)
> + EXC(2b,4b,%1,%0)
> + : "=r"(error), "=&r"(ll_match),
> + "=&r"(tmp1), "=&r"(tmp2)
> + : "r"(va), "r"(la & ~7ul),
> + "r"(*reg_addr), "r"(tmp4), "0"(0));
> + if (error)
> + goto give_sigsegv;
> + }
> + if (ll_match) {
> + __asm__ __volatile__(
> + " insql %6,%4,%3\n"
> + "1: ldq_l %2,0(%5)\n"
> + " cmpeq %7,%2,%1\n"
> + " beq %1,4f\n"
> + " mskql %2,%4,%2\n"
> + " or %2,%3,%1\n"
> + "2: stq_c %1,0(%5)\n"
> + " beq %1,3f\n"
> + " .subsection 2\n"
> + "3: br 1b\n"
> + " .previous\n"
> + "4:\n"
> + EXC(1b,4b,%2,%0)
> + EXC(2b,4b,%1,%0)
> + : "=r"(error), "=&r"(ll_match),
> + "=&r"(tmp1), "=&r"(tmp2)
> + : "r"(va), "r"(la & ~7ul),
> + "r"(*reg_addr), "r"(tmp3), "0"(0));
> + if (error)
> + goto give_sigsegv;
> + }
> + *reg_addr = ll_match;
> + break;
> +
> default:
> /* What instruction were you trying to use, exactly? */
> goto give_sigbus;
> }
>
> - /* Only integer loads should get here; everyone else returns early. */
> + /*
> + * Only integer loads and stores conditional should get here;
> + * everyone else returns early.
> + */
> if (reg == 30)
> wrusp(fake_reg);
> return;
>
© 2016 - 2025 Red Hat, Inc.