[PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency

Maciej W. Rozycki posted 1 patch 10 months ago
arch/alpha/kernel/traps.c |  409 ++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 400 insertions(+), 9 deletions(-)
[PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
Posted by Maciej W. Rozycki 10 months ago
Complementing compiler support for the `-msafe-bwa' and `-msafe-partial' 
code generation options slated to land in GCC 15, implement emulation 
for unaligned LDx_L and STx_C operations for the unlikely case where an 
alignment violation has resulted from improperly written code and caused 
these operations to trap in atomic RMW memory access sequences made to 
provide data consistency for non-BWX byte and word write operations, and 
writes to unaligned data objects causing partial memory updates.

The principle of operation is as follows:

1. A trapping unaligned LDx_L operation results in the pair of adjacent 
   aligned whole data quantities spanned being read and stored for the 
   reference with a subsequent STx_C operation, along with the width of 
   the data accessed and its virtual address, and the task referring or 
   NULL if the kernel.  The valitidy marker is set.

2. Regular memory load operations are used to retrieve data because no 
   atomicity is needed at this stage, and matching the width accessed, 
   either LDQ_U or LDL even though the latter instruction requires extra 
   operations, to avoid the case where an unaligned longword located 
   entirely within an aligned quadword would complicate handling.

3. Data is masked, shifted and merged appropriately and returned in the
   intended register as the result of the trapping LDx_L instruction.

4. A trapping unaligned STx_C operation results in the valitidy marker 
   being checked for being true, and the width of the data accessed 
   along with the virtual address and the task referring or the kernel 
   for a match.  The pair of whole data quantities previously read by 
   LDx_L emulation is retrieved and the valitidy marker is cleared.

5. If the checks succeeded, then in an atomic loop the location of the 
   first whole data quantity is reread, and data retrieved compared with 
   the value previously obtained.  If there's no match, then the loop is 
   aborted and 0 is returned in the intended register as the result of 
   the trapping STx_C instruction and emulation completes.  Otherwise 
   new data obtained from the source operand of STx_C is combined with 
   the data retrieved, replacing by byte insertion the part intended, 
   and an atomic write of this new data is attempted.  If it fails, the 
   loop continues from the beginning.  Otherwise processing proceeds to 
   the next step.

6. The same operations are performed on the second whole data quantity.

7. At this point both whole data quantities have been written, ensuring 
   that no third-party intervening write has changed them at the point 
   of the write from the values held at previous LDx_L.  Therefore 1 is 
   returned in the intended register as the result of the trapping STx_C 
   instruction.

8. No user accesses are permitted in traps from the kernel mode as the
   only LDx_L/STx_C accesses made to user memory locations by the kernel 
   are supposed to be those from handcrafted code, which has to written 
   such as not to trap.

Since atomic loops are used for data updates the approach works equally 
well in both UP and SMP environments.  No data atomicity is guaranteed, 
but data consistency is, that is concurrent RMW accesses won't clobber 
each other, however if the same data is concurrently written as already 
there with a regular write between emulated LDx_L and STx_C, then STx_C 
will still succeed.  Likewise if data is modified, but then restored 
before STx_C has had a chance to run.

This fulfils consistency requirements and guarantees that data outside 
the quantity written has not changed between emulated LDx_L and STx_C.

Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
---
Hi,

 This has cleared the pair of `-msafe-bwa -msafe-partial' regressions 
observed in GCC verification (the third one was a Modula 2 frontend bug, 
now fixed in the compiler).  I have verified individual misalignments with 
a small program by hand as well, for both the data retrieved by emulated 
LDx_L and the data stored by emulated STx_C.

 The kernel itself built with `-mcpu=ev4 -msafe-bwa -msafe-partial' boots 
and has passed GCC verification, and triggered no extra unaligned traps.

 Full verification was run with 6.3.0-rc5 and Ivan's stack alignment fixes 
applied just because I was confident already that version works correctly.  
Interestingly enough no kernel mode traps have triggered with a kernel 
built with GCC 12 (and with most user traps coming from GCC verification):

kernel unaligned acc	: 0 (pc=0,va=0)
user unaligned acc	: 1766720 (pc=20000053064,va=120020189)

but with GCC 15 a small quantity happened (even before I ran GCC testing):

kernel unaligned acc    : 78 (pc=fffffc0000ad5194,va=fffffc0002db5784)
user unaligned acc      : 883452 (pc=20000053064,va=120020189)

It seems a compiler regression worth checking -- the trap recorded was in 
`icmp6_dst_alloc' with a pair of quadword writes to `rt->rt6i_dst.addr', 
which however by its type (`struct in6_addr') is only longword-aligned and 
indeed starts at offset 148 from the outermost struct.  I have a sneaking 
suspicion one of my earlier GCC changes might be at fault.  At least I now 
have a test case to experiment with.

 I've also built and booted 6.9.0-rc3 as at commit 82c525bfafb4 ("alpha: 
trim the unused stuff from asm-offsets.c"), the last one before support 
for my system was axed.  It has passed the verification with my small 
program (available by request; I'm not sure if it's worth turning into a 
kernel selftest).

 NB I'm going to ignore the 72 errors checkpatch.pl issues for EXC usage.  
The coding style of the new additions is consistent with the rest of the 
file and any change to that would best be made separately (but I fail to 
see the point).

 Questions, comments, concerns?  Otherwise please apply, and I'll proceed 
with the rest of the GCC effort, followed by cleaning handwritten assembly 
up that uses STQ_U in our port and in glibc.

  Maciej
---
 arch/alpha/kernel/traps.c |  409 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 400 insertions(+), 9 deletions(-)

linux-alpha-llsc-unaligned.diff
Index: linux-macro/arch/alpha/kernel/traps.c
===================================================================
--- linux-macro.orig/arch/alpha/kernel/traps.c
+++ linux-macro/arch/alpha/kernel/traps.c
@@ -368,6 +368,13 @@ struct unaligned_stat {
 	unsigned long count, va, pc;
 } unaligned[2];
 
+/* Unaligned LDx_L/STx_C emulation state.  */
+static DEFINE_RAW_SPINLOCK(ll_lock);
+static struct task_struct *ll_task;
+static unsigned long ll_data[2];
+static unsigned long ll_va;
+static bool ll_quad;
+static bool ll_bit;
 
 /* Macro for exception fixup code to access integer registers.  */
 #define una_reg(r)  (_regs[(r) >= 16 && (r) <= 18 ? (r)+19 : (r)])
@@ -381,6 +388,9 @@ do_entUna(void * va, unsigned long opcod
 	unsigned long pc = regs->pc - 4;
 	unsigned long *_regs = regs->regs;
 	const struct exception_table_entry *fixup;
+	unsigned long flags;
+	unsigned long la;
+	bool ll_match;
 
 	unaligned[0].count++;
 	unaligned[0].va = (unsigned long) va;
@@ -439,6 +449,65 @@ do_entUna(void * va, unsigned long opcod
 		una_reg(reg) = tmp1|tmp2;
 		return;
 
+	case 0x2a: /* ldl_l */
+		la = (unsigned long)va;
+		if (la < TASK_SIZE)
+			break;
+		__asm__ __volatile__(
+		"1:	ldl %3,0(%5)\n"
+		"2:	ldl %4,4(%5)\n"
+		"	srl %3,%6,%1\n"
+		"	sll %4,%7,%2\n"
+		"	zapnot %1,15,%1\n"
+		"	zapnot %2,15,%2\n"
+		"3:\n"
+		EXC(1b,3b,%1,%0)
+		EXC(2b,3b,%2,%0)
+			: "=r"(error),
+			  "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
+			: "r"(la & ~3ul),
+			  "r"((la & 3) * 8), "r"((4 - (la & 3)) * 8), "0"(0));
+		if (error)
+			goto got_exception;
+		raw_spin_lock_irqsave(&ll_lock, flags);
+		ll_va = la;
+		ll_task = NULL;
+		ll_data[0] = tmp3;
+		ll_data[1] = tmp4;
+		ll_quad = false;
+		ll_bit = true;
+		raw_spin_unlock_irqrestore(&ll_lock, flags);
+		una_reg(reg) = (int)(tmp1|tmp2);
+		return;
+
+	case 0x2b: /* ldq_l */
+		la = (unsigned long)va;
+		if (la < TASK_SIZE)
+			break;
+		__asm__ __volatile__(
+		"1:	ldq_u %3,0(%5)\n"
+		"2:	ldq_u %4,7(%5)\n"
+		"	extql %3,%5,%1\n"
+		"	extqh %4,%5,%2\n"
+		"3:\n"
+		EXC(1b,3b,%1,%0)
+		EXC(2b,3b,%2,%0)
+			: "=r"(error),
+			  "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
+			: "r"(va), "0"(0));
+		if (error)
+			goto got_exception;
+		raw_spin_lock_irqsave(&ll_lock, flags);
+		ll_va = la;
+		ll_task = NULL;
+		ll_data[0] = tmp3;
+		ll_data[1] = tmp4;
+		ll_quad = true;
+		ll_bit = true;
+		raw_spin_unlock_irqrestore(&ll_lock, flags);
+		una_reg(reg) = tmp1|tmp2;
+		return;
+
 	/* Note that the store sequences do not indicate that they change
 	   memory because it _should_ be affecting nothing in this context.
 	   (Otherwise we have other, much larger, problems.)  */
@@ -513,6 +582,134 @@ do_entUna(void * va, unsigned long opcod
 		if (error)
 			goto got_exception;
 		return;
+
+	case 0x2e: /* stl_c */
+		la = (unsigned long)va;
+		if (la < TASK_SIZE)
+			break;
+		raw_spin_lock_irqsave(&ll_lock, flags);
+		ll_match = ll_bit;
+		ll_match &= !ll_quad;
+		ll_match &= ll_task == NULL;
+		ll_match &= ll_va == la;
+		tmp3 = ll_data[0];
+		tmp4 = ll_data[1];
+		ll_bit = false;
+		raw_spin_unlock_irqrestore(&ll_lock, flags);
+		if (ll_match) {
+			__asm__ __volatile__(
+			"	srl %6,%5,%3\n"
+			"	zapnot %3,%8,%3\n"
+			"1:	ldl_l %2,4(%4)\n"
+			"	cmpeq %7,%2,%1\n"
+			"	beq %1,4f\n"
+			"	zap %2,%8,%2\n"
+			"	or %2,%3,%1\n"
+			"2:	stl_c %1,4(%4)\n"
+			"	beq %1,3f\n"
+			"	.subsection 2\n"
+			"3:	br 1b\n"
+			"	.previous\n"
+			"4:\n"
+			EXC(1b,4b,%2,%0)
+			EXC(2b,4b,%1,%0)
+				: "=r"(error), "=&r"(ll_match),
+				  "=&r"(tmp1), "=&r"(tmp2)
+				: "r"(la & ~3ul), "r"((4 - (la & 3)) * 8),
+				  "r"(una_reg(reg)), "r"(tmp4),
+				  "r"((15 >> (4 - (la & 3))) & 0xf), "0"(0));
+			if (error)
+				goto got_exception;
+		}
+		if (ll_match) {
+			__asm__ __volatile__(
+			"	sll %6,%5,%3\n"
+			"	zapnot %3,%8,%3\n"
+			"1:	ldl_l %2,0(%4)\n"
+			"	cmpeq %7,%2,%1\n"
+			"	beq %1,4f\n"
+			"	zap %2,%8,%2\n"
+			"	or %2,%3,%1\n"
+			"2:	stl_c %1,0(%4)\n"
+			"	beq %1,3f\n"
+			"	.subsection 2\n"
+			"3:	br 1b\n"
+			"	.previous\n"
+			"4:\n"
+			EXC(1b,4b,%2,%0)
+			EXC(2b,4b,%1,%0)
+				: "=r"(error), "=&r"(ll_match),
+				  "=&r"(tmp1), "=&r"(tmp2)
+				: "r"(la & ~3ul), "r"((la & 3) * 8),
+				  "r"(una_reg(reg)), "r"(tmp3),
+				  "r"((15 << (la & 3)) & 0xf), "0"(0));
+			if (error)
+				goto got_exception;
+		}
+		una_reg(reg) = ll_match;
+		return;
+
+	case 0x2f: /* stq_c */
+		la = (unsigned long)va;
+		if (la < TASK_SIZE)
+			break;
+		raw_spin_lock_irqsave(&ll_lock, flags);
+		ll_match = ll_bit;
+		ll_match &= ll_quad;
+		ll_match &= ll_task == NULL;
+		ll_match &= ll_va == la;
+		tmp3 = ll_data[0];
+		tmp4 = ll_data[1];
+		ll_bit = false;
+		raw_spin_unlock_irqrestore(&ll_lock, flags);
+		if (ll_match) {
+			__asm__ __volatile__(
+			"	insqh %6,%4,%3\n"
+			"1:	ldq_l %2,8(%5)\n"
+			"	cmpeq %7,%2,%1\n"
+			"	beq %1,4f\n"
+			"	mskqh %2,%4,%2\n"
+			"	or %2,%3,%1\n"
+			"2:	stq_c %1,8(%5)\n"
+			"	beq %1,3f\n"
+			"	.subsection 2\n"
+			"3:	br 1b\n"
+			"	.previous\n"
+			"4:\n"
+			EXC(1b,4b,%2,%0)
+			EXC(2b,4b,%1,%0)
+				: "=r"(error), "=&r"(ll_match),
+				  "=&r"(tmp1), "=&r"(tmp2)
+				: "r"(va), "r"(la & ~7ul),
+				  "r"(una_reg(reg)), "r"(tmp4), "0"(0));
+			if (error)
+				goto got_exception;
+		}
+		if (ll_match) {
+			__asm__ __volatile__(
+			"	insql %6,%4,%3\n"
+			"1:	ldq_l %2,0(%5)\n"
+			"	cmpeq %7,%2,%1\n"
+			"	beq %1,4f\n"
+			"	mskql %2,%4,%2\n"
+			"	or %2,%3,%1\n"
+			"2:	stq_c %1,0(%5)\n"
+			"	beq %1,3f\n"
+			"	.subsection 2\n"
+			"3:	br 1b\n"
+			"	.previous\n"
+			"4:\n"
+			EXC(1b,4b,%2,%0)
+			EXC(2b,4b,%1,%0)
+				: "=r"(error), "=&r"(ll_match),
+				  "=&r"(tmp1), "=&r"(tmp2)
+				: "r"(va), "r"(la & ~7ul),
+				  "r"(una_reg(reg)), "r"(tmp3), "0"(0));
+			if (error)
+				goto got_exception;
+		}
+		una_reg(reg) = ll_match;
+		return;
 	}
 
 	printk("Bad unaligned kernel access at %016lx: %p %lx %lu\n",
@@ -624,24 +821,33 @@ s_reg_to_mem (unsigned long s_reg)
  * so finding the appropriate registers is a little more difficult
  * than in the kernel case.
  *
- * Finally, we handle regular integer load/stores only.  In
- * particular, load-linked/store-conditionally and floating point
- * load/stores are not supported.  The former make no sense with
- * unaligned faults (they are guaranteed to fail) and I don't think
- * the latter will occur in any decent program.
+ * We have three classes of operations to handle:
  *
- * Sigh. We *do* have to handle some FP operations, because GCC will
- * uses them as temporary storage for integer memory to memory copies.
- * However, we need to deal with stt/ldt and sts/lds only.
+ * - We handle regular integer load/stores transparently to faulting
+ *   code, preserving the semantics of the triggering instruction.
+ *
+ * - We handle some FP operations as well, because GCC will use them as
+ *   temporary storage for integer memory to memory copies.  However,
+ *   we need to deal with stt/ldt and sts/lds only.
+ *
+ * - We handle load-locked/store-conditional operations by maintaining
+ *   data consistency only, within the two adjacent longwords or
+ *   quadwords partially spanned.  This is sufficient to guarantee an
+ *   unaligned RMW sequence using these operations won't clobber data
+ *   *outside* the location intended but does *not* guarantee atomicity
+ *   for the data quantity itself.
  */
 
 #define OP_INT_MASK	( 1L << 0x28 | 1L << 0x2c   /* ldl stl */	\
+			| 1L << 0x2a | 1L << 0x2e   /* ldl_l stl_c */	\
 			| 1L << 0x29 | 1L << 0x2d   /* ldq stq */	\
+			| 1L << 0x2b | 1L << 0x2f   /* ldq_l stq_c */	\
 			| 1L << 0x0c | 1L << 0x0d   /* ldwu stw */	\
 			| 1L << 0x0a | 1L << 0x0e ) /* ldbu stb */
 
 #define OP_WRITE_MASK	( 1L << 0x26 | 1L << 0x27   /* sts stt */	\
 			| 1L << 0x2c | 1L << 0x2d   /* stl stq */	\
+			| 1L << 0x2e | 1L << 0x2d   /* stl_c stq_c */	\
 			| 1L << 0x0d | 1L << 0x0e ) /* stw stb */
 
 #define R(x)	((size_t) &((struct pt_regs *)0)->x)
@@ -666,6 +872,9 @@ do_entUnaUser(void __user * va, unsigned
 
 	unsigned long tmp1, tmp2, tmp3, tmp4;
 	unsigned long fake_reg, *reg_addr = &fake_reg;
+	unsigned long flags;
+	unsigned long la;
+	bool ll_match;
 	int si_code;
 	long error;
 
@@ -794,6 +1003,61 @@ do_entUnaUser(void __user * va, unsigned
 		*reg_addr = tmp1|tmp2;
 		break;
 
+	case 0x2a: /* ldl_l */
+		la = (unsigned long)va;
+		__asm__ __volatile__(
+		"1:	ldl %3,0(%5)\n"
+		"2:	ldl %4,4(%5)\n"
+		"	srl %3,%6,%1\n"
+		"	sll %4,%7,%2\n"
+		"	zapnot %1,15,%1\n"
+		"	zapnot %2,15,%2\n"
+		"3:\n"
+		EXC(1b,3b,%1,%0)
+		EXC(2b,3b,%2,%0)
+			: "=r"(error),
+			  "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
+			: "r"(la & ~3ul),
+			  "r"((la & 3) * 8), "r"((4 - (la & 3)) * 8), "0"(0));
+		if (error)
+			goto give_sigsegv;
+		raw_spin_lock_irqsave(&ll_lock, flags);
+		ll_va = la;
+		ll_task = current;
+		ll_data[0] = tmp3;
+		ll_data[1] = tmp4;
+		ll_quad = false;
+		ll_bit = true;
+		raw_spin_unlock_irqrestore(&ll_lock, flags);
+		*reg_addr = (int)(tmp1|tmp2);
+		break;
+
+	case 0x2b: /* ldq_l */
+		la = (unsigned long)va;
+		__asm__ __volatile__(
+		"1:	ldq_u %3,0(%5)\n"
+		"2:	ldq_u %4,7(%5)\n"
+		"	extql %3,%5,%1\n"
+		"	extqh %4,%5,%2\n"
+		"3:\n"
+		EXC(1b,3b,%1,%0)
+		EXC(2b,3b,%2,%0)
+			: "=r"(error),
+			  "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
+			: "r"(va), "0"(0));
+		if (error)
+			goto give_sigsegv;
+		raw_spin_lock_irqsave(&ll_lock, flags);
+		ll_va = la;
+		ll_task = current;
+		ll_data[0] = tmp3;
+		ll_data[1] = tmp4;
+		ll_quad = true;
+		ll_bit = true;
+		raw_spin_unlock_irqrestore(&ll_lock, flags);
+		*reg_addr = tmp1|tmp2;
+		break;
+
 	/* Note that the store sequences do not indicate that they change
 	   memory because it _should_ be affecting nothing in this context.
 	   (Otherwise we have other, much larger, problems.)  */
@@ -877,12 +1141,139 @@ do_entUnaUser(void __user * va, unsigned
 			goto give_sigsegv;
 		return;
 
+	case 0x2e: /* stl_c */
+		la = (unsigned long)va;
+		raw_spin_lock_irqsave(&ll_lock, flags);
+		ll_match = ll_bit;
+		ll_match &= !ll_quad;
+		ll_match &= ll_task == current;
+		ll_match &= ll_va == la;
+		tmp3 = ll_data[0];
+		tmp4 = ll_data[1];
+		ll_bit = false;
+		raw_spin_unlock_irqrestore(&ll_lock, flags);
+		if (ll_match) {
+			__asm__ __volatile__(
+			"	srl %6,%5,%3\n"
+			"	zapnot %3,%8,%3\n"
+			"1:	ldl_l %2,4(%4)\n"
+			"	cmpeq %7,%2,%1\n"
+			"	beq %1,4f\n"
+			"	zap %2,%8,%2\n"
+			"	or %2,%3,%1\n"
+			"2:	stl_c %1,4(%4)\n"
+			"	beq %1,3f\n"
+			"	.subsection 2\n"
+			"3:	br 1b\n"
+			"	.previous\n"
+			"4:\n"
+			EXC(1b,4b,%2,%0)
+			EXC(2b,4b,%1,%0)
+				: "=r"(error), "=&r"(ll_match),
+				  "=&r"(tmp1), "=&r"(tmp2)
+				: "r"(la & ~3ul), "r"((4 - (la & 3)) * 8),
+				  "r"(*reg_addr), "r"(tmp4),
+				  "r"((15 >> (4 - (la & 3))) & 0xf), "0"(0));
+			if (error)
+				goto give_sigsegv;
+		}
+		if (ll_match) {
+			__asm__ __volatile__(
+			"	sll %6,%5,%3\n"
+			"	zapnot %3,%8,%3\n"
+			"1:	ldl_l %2,0(%4)\n"
+			"	cmpeq %7,%2,%1\n"
+			"	beq %1,4f\n"
+			"	zap %2,%8,%2\n"
+			"	or %2,%3,%1\n"
+			"2:	stl_c %1,0(%4)\n"
+			"	beq %1,3f\n"
+			"	.subsection 2\n"
+			"3:	br 1b\n"
+			"	.previous\n"
+			"4:\n"
+			EXC(1b,4b,%2,%0)
+			EXC(2b,4b,%1,%0)
+				: "=r"(error), "=&r"(ll_match),
+				  "=&r"(tmp1), "=&r"(tmp2)
+				: "r"(la & ~3ul), "r"((la & 3) * 8),
+				  "r"(*reg_addr), "r"(tmp3),
+				  "r"((15 << (la & 3)) & 0xf), "0"(0));
+			if (error)
+				goto give_sigsegv;
+		}
+		*reg_addr = ll_match;
+		break;
+
+	case 0x2f: /* stq_c */
+		la = (unsigned long)va;
+		raw_spin_lock_irqsave(&ll_lock, flags);
+		ll_match = ll_bit;
+		ll_match &= ll_quad;
+		ll_match &= ll_task == current;
+		ll_match &= ll_va == la;
+		tmp3 = ll_data[0];
+		tmp4 = ll_data[1];
+		ll_bit = false;
+		raw_spin_unlock_irqrestore(&ll_lock, flags);
+		if (ll_match) {
+			__asm__ __volatile__(
+			"	insqh %6,%4,%3\n"
+			"1:	ldq_l %2,8(%5)\n"
+			"	cmpeq %7,%2,%1\n"
+			"	beq %1,4f\n"
+			"	mskqh %2,%4,%2\n"
+			"	or %2,%3,%1\n"
+			"2:	stq_c %1,8(%5)\n"
+			"	beq %1,3f\n"
+			"	.subsection 2\n"
+			"3:	br 1b\n"
+			"	.previous\n"
+			"4:\n"
+			EXC(1b,4b,%2,%0)
+			EXC(2b,4b,%1,%0)
+				: "=r"(error), "=&r"(ll_match),
+				  "=&r"(tmp1), "=&r"(tmp2)
+				: "r"(va), "r"(la & ~7ul),
+				  "r"(*reg_addr), "r"(tmp4), "0"(0));
+			if (error)
+				goto give_sigsegv;
+		}
+		if (ll_match) {
+			__asm__ __volatile__(
+			"	insql %6,%4,%3\n"
+			"1:	ldq_l %2,0(%5)\n"
+			"	cmpeq %7,%2,%1\n"
+			"	beq %1,4f\n"
+			"	mskql %2,%4,%2\n"
+			"	or %2,%3,%1\n"
+			"2:	stq_c %1,0(%5)\n"
+			"	beq %1,3f\n"
+			"	.subsection 2\n"
+			"3:	br 1b\n"
+			"	.previous\n"
+			"4:\n"
+			EXC(1b,4b,%2,%0)
+			EXC(2b,4b,%1,%0)
+				: "=r"(error), "=&r"(ll_match),
+				  "=&r"(tmp1), "=&r"(tmp2)
+				: "r"(va), "r"(la & ~7ul),
+				  "r"(*reg_addr), "r"(tmp3), "0"(0));
+			if (error)
+				goto give_sigsegv;
+		}
+		*reg_addr = ll_match;
+		break;
+
 	default:
 		/* What instruction were you trying to use, exactly?  */
 		goto give_sigbus;
 	}
 
-	/* Only integer loads should get here; everyone else returns early. */
+	/*
+	 * Only integer loads and stores conditional should get here;
+	 * everyone else returns early.
+	 */
 	if (reg == 30)
 		wrusp(fake_reg);
 	return;
Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
Posted by Maciej W. Rozycki 9 months, 3 weeks ago
On Wed, 19 Feb 2025, Maciej W. Rozycki wrote:

> Interestingly enough no kernel mode traps have triggered with a kernel 
> built with GCC 12 (and with most user traps coming from GCC verification):
> 
> kernel unaligned acc	: 0 (pc=0,va=0)
> user unaligned acc	: 1766720 (pc=20000053064,va=120020189)
> 
> but with GCC 15 a small quantity happened (even before I ran GCC testing):
> 
> kernel unaligned acc    : 78 (pc=fffffc0000ad5194,va=fffffc0002db5784)
> user unaligned acc      : 883452 (pc=20000053064,va=120020189)
> 
> It seems a compiler regression worth checking -- the trap recorded was in 
> `icmp6_dst_alloc' with a pair of quadword writes to `rt->rt6i_dst.addr', 
> which however by its type (`struct in6_addr') is only longword-aligned and 
> indeed starts at offset 148 from the outermost struct.  I have a sneaking 
> suspicion one of my earlier GCC changes might be at fault.  At least I now 
> have a test case to experiment with.

 FYI my suspicion wasn't wrong, I have submitted a compiler fix now[1].  

 My plan has been to complete the GCC side first as it's more urgent given 
its annual only release cycle model targetting April/May, whereas I think 
the Linux side can slip a release or two in our roughly bi-monthly cycle.  
I'm going to schedule my time accordinly and with my upcoming holiday also 
in the picture I may not be able to post v2 of this proposal until around 
end of March the earliest.

References:

[1] "Alpha: Fix base block alignment calculation regression", 
    <https://inbox.sourceware.org/gcc-patches/alpine.DEB.2.21.2502251934260.65342@angie.orcam.me.uk/>

  Maciej
Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
Posted by Richard Henderson 10 months ago
On 2/19/25 04:46, Maciej W. Rozycki wrote:
> Complementing compiler support for the `-msafe-bwa' and `-msafe-partial'
> code generation options slated to land in GCC 15,

Pointer?  I can't find it on the gcc-patches list.


> implement emulation
> for unaligned LDx_L and STx_C operations for the unlikely case where an
> alignment violation has resulted from improperly written code and caused
> these operations to trap in atomic RMW memory access sequences made to
> provide data consistency for non-BWX byte and word write operations, and
> writes to unaligned data objects causing partial memory updates.
> 
> The principle of operation is as follows:
> 
> 1. A trapping unaligned LDx_L operation results in the pair of adjacent
>     aligned whole data quantities spanned being read and stored for the
>     reference with a subsequent STx_C operation, along with the width of
>     the data accessed and its virtual address, and the task referring or
>     NULL if the kernel.  The valitidy marker is set.
> 
> 2. Regular memory load operations are used to retrieve data because no
>     atomicity is needed at this stage, and matching the width accessed,
>     either LDQ_U or LDL even though the latter instruction requires extra
>     operations, to avoid the case where an unaligned longword located
>     entirely within an aligned quadword would complicate handling.
> 
> 3. Data is masked, shifted and merged appropriately and returned in the
>     intended register as the result of the trapping LDx_L instruction.
> 
> 4. A trapping unaligned STx_C operation results in the valitidy marker
>     being checked for being true, and the width of the data accessed
>     along with the virtual address and the task referring or the kernel
>     for a match.  The pair of whole data quantities previously read by
>     LDx_L emulation is retrieved and the valitidy marker is cleared.
> 
> 5. If the checks succeeded, then in an atomic loop the location of the
>     first whole data quantity is reread, and data retrieved compared with
>     the value previously obtained.  If there's no match, then the loop is
>     aborted and 0 is returned in the intended register as the result of
>     the trapping STx_C instruction and emulation completes.  Otherwise
>     new data obtained from the source operand of STx_C is combined with
>     the data retrieved, replacing by byte insertion the part intended,
>     and an atomic write of this new data is attempted.  If it fails, the
>     loop continues from the beginning.  Otherwise processing proceeds to
>     the next step.
> 
> 6. The same operations are performed on the second whole data quantity.
> 
> 7. At this point both whole data quantities have been written, ensuring
>     that no third-party intervening write has changed them at the point
>     of the write from the values held at previous LDx_L.  Therefore 1 is
>     returned in the intended register as the result of the trapping STx_C
>     instruction.

I think general-purpose non-atomic emulation of STx_C is a really bad idea.

Without looking at your gcc patches, I can guess what you're after: you've generated a 
ll/sc sequence for (aligned) short, and want to emulate if it happens to be unaligned.

Crucially, when emulating non-aligned, you should not strive to make it atomic.  No other 
architecture promises atomic non-aligned stores, so why should you do that here?

I suggest some sort of magic code sequence,

	bic	addr_in, 6, addr_al
loop:
	ldq_l	t0, 0(addr_al)
	magic-nop done - loop
	inswl	data, addr_in, t1
	mskwl	t0, addr_in, t0
	bis	t0, t1, t0
	stq_c	t0, 0(addr_al)
	beq	t0, loop
done:

With the trap, match the magic-nop, pick out the input registers from the following inswl, 
perform the two (atomic!) byte stores to accomplish the emulation, adjust the pc forward 
to the done label.

Choose anything you like for the magic-nop.  The (done - loop) displacement is small, so 
any 8-bit immediate would suffice.  E.g. "eqv $31, disp, $31".  You might require the 
displacement to be constant and not actually extract "disp"; just match the entire 
uint32_t instruction pattern.


r~
Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
Posted by Maciej W. Rozycki 10 months ago
On Thu, 20 Feb 2025, Richard Henderson wrote:

> > Complementing compiler support for the `-msafe-bwa' and `-msafe-partial'
> > code generation options slated to land in GCC 15,
> 
> Pointer?  I can't find it on the gcc-patches list.

 Here: 
<https://inbox.sourceware.org/gcc-patches/alpine.DEB.2.21.2501050246590.49841@angie.orcam.me.uk/>
and hopefully in your inbox/archive somewhere as well.

> > 7. At this point both whole data quantities have been written, ensuring
> >     that no third-party intervening write has changed them at the point
> >     of the write from the values held at previous LDx_L.  Therefore 1 is
> >     returned in the intended register as the result of the trapping STx_C
> >     instruction.
> 
> I think general-purpose non-atomic emulation of STx_C is a really bad idea.
> 
> Without looking at your gcc patches, I can guess what you're after: you've
> generated a ll/sc sequence for (aligned) short, and want to emulate if it
> happens to be unaligned.

 It's a corner case, yes, when the compiler was told the access would be 
aligned, but it turns out not.  It's where you cast a (char *) pointer to 
(short *) that wasn't suitably aligned for such a cast and dereference it 
(and the quadword case is similarly for the ends of misaligned inline 
`memcpy'/`memset').

 Only two cases (plus a bug in GM2 frontend) hitting this throughout the 
GCC testsuite show the rarity of this case.

> Crucially, when emulating non-aligned, you should not strive to make it
> atomic.  No other architecture promises atomic non-aligned stores, so why
> should you do that here?

 This code doesn't strive to be atomic, but to preserve data *outside* the 
quantity accessed from being clobbered, and for this purpose an atomic 
sequence is both inevitable and sufficient, for both partial quantities 
around the unaligned quantity written.  The trapping code does not expect 
atomicity for the unaligned quantity itself -- it is handled in pieces 
just as say with MIPS SWL/SWR masked store instruction pairs -- and this 
code, effectively an Alpha/Linux psABI extension, does not guarantee it 
either.

> I suggest some sort of magic code sequence,
> 
> 	bic	addr_in, 6, addr_al
> loop:
> 	ldq_l	t0, 0(addr_al)
> 	magic-nop done - loop
> 	inswl	data, addr_in, t1
> 	mskwl	t0, addr_in, t0
> 	bis	t0, t1, t0
> 	stq_c	t0, 0(addr_al)
> 	beq	t0, loop
> done:
> 
> With the trap, match the magic-nop, pick out the input registers from the
> following inswl, perform the two (atomic!) byte stores to accomplish the
> emulation, adjust the pc forward to the done label.

 It seems to make no sense to me to penalise all user code for the corner 
case mentioned above while still having the emulation in the kernel, given 
that 99.999...% of accesses will have been correctly aligned by GCC.  And 
it gets even more complex when you have an awkward number of bytes to 
mask, such as 3, 5, 6, 7, which will happen for example if inline `memcpy' 
is expanded by GCC for a quadword-aligned block of 31-bytes, in which case 
other instructions will be used for masking/insertion for the trailing 7 
bytes, and the block turns out misaligned at run time.

 I'm inconvinced, it seems a lot of hassle for little gain to me.

  Maciej
Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
Posted by Matt Turner 10 months ago
On Thu, Feb 20, 2025 at 12:54 PM Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 2/19/25 04:46, Maciej W. Rozycki wrote:
> > Complementing compiler support for the `-msafe-bwa' and `-msafe-partial'
> > code generation options slated to land in GCC 15,
>
> Pointer?  I can't find it on the gcc-patches list.

I believe it's this:

https://inbox.sourceware.org/gcc-patches/alpine.DEB.2.21.2411141652300.9262@angie.orcam.me.uk/

The first half or so have landed so far, right Maciej?
Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
Posted by Maciej W. Rozycki 10 months ago
On Thu, 20 Feb 2025, Matt Turner wrote:

> > > Complementing compiler support for the `-msafe-bwa' and `-msafe-partial'
> > > code generation options slated to land in GCC 15,
> >
> > Pointer?  I can't find it on the gcc-patches list.
> 
> I believe it's this:
> 
> https://inbox.sourceware.org/gcc-patches/alpine.DEB.2.21.2411141652300.9262@angie.orcam.me.uk/

 That's the original series; v2 is here:

https://inbox.sourceware.org/gcc-patches/alpine.DEB.2.21.2501050246590.49841@angie.orcam.me.uk/

> The first half or so have landed so far, right Maciej?

 Yes, fixes for bugs discovered in the course have been committed and the 
rest of changes already approved, although there were a couple of comments 
I yet want to investigate/address by the end of next week.  Getting clean 
GCC test results with the kernel emulation was my objective before moving 
forward.

  Maciej
Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
Posted by Linus Torvalds 10 months ago
On Thu, 20 Feb 2025 at 09:54, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Crucially, when emulating non-aligned, you should not strive to make it atomic.  No other
> architecture promises atomic non-aligned stores, so why should you do that here?

I'm not disagreeing with the "it doesn't necessarily have to be
atomic", but I will point out that x86 does indeed promise atomic
non-aligned accesses.

It will actually lock both cachelines when straddling a cacheline.

It's slow, it's horrendous, and people are trying to get away from it
(google "split lock"), but it is actually architecturally supported.

                   Linus
Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
Posted by Richard Henderson 10 months ago
On 2/20/25 09:59, Linus Torvalds wrote:
> On Thu, 20 Feb 2025 at 09:54, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> Crucially, when emulating non-aligned, you should not strive to make it atomic.  No other
>> architecture promises atomic non-aligned stores, so why should you do that here?
> 
> I'm not disagreeing with the "it doesn't necessarily have to be
> atomic", but I will point out that x86 does indeed promise atomic
> non-aligned accesses.

I should have been more expansive with that statement: I didn't mean "no unaligned 
atomics" (e.g. lock orw), but "unaligned normal stores may be non-atomic" (e.g. movw 
across a cacheline).

My guess about the gcc patches is that it's the latter that wanted emulation here.


r~
Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
Posted by Matt Turner 10 months ago
On Wed, Feb 19, 2025 at 7:46 AM Maciej W. Rozycki <macro@orcam.me.uk> wrote:
>
> Complementing compiler support for the `-msafe-bwa' and `-msafe-partial'
> code generation options slated to land in GCC 15, implement emulation
> for unaligned LDx_L and STx_C operations for the unlikely case where an
> alignment violation has resulted from improperly written code and caused
> these operations to trap in atomic RMW memory access sequences made to
> provide data consistency for non-BWX byte and word write operations, and
> writes to unaligned data objects causing partial memory updates.
>
> The principle of operation is as follows:
>
> 1. A trapping unaligned LDx_L operation results in the pair of adjacent
>    aligned whole data quantities spanned being read and stored for the
>    reference with a subsequent STx_C operation, along with the width of
>    the data accessed and its virtual address, and the task referring or
>    NULL if the kernel.  The valitidy marker is set.

Typo: validity (occurs twice more in the commit message below)

>
> 2. Regular memory load operations are used to retrieve data because no
>    atomicity is needed at this stage, and matching the width accessed,
>    either LDQ_U or LDL even though the latter instruction requires extra
>    operations, to avoid the case where an unaligned longword located
>    entirely within an aligned quadword would complicate handling.
>
> 3. Data is masked, shifted and merged appropriately and returned in the
>    intended register as the result of the trapping LDx_L instruction.
>
> 4. A trapping unaligned STx_C operation results in the valitidy marker
>    being checked for being true, and the width of the data accessed
>    along with the virtual address and the task referring or the kernel
>    for a match.  The pair of whole data quantities previously read by
>    LDx_L emulation is retrieved and the valitidy marker is cleared.
>
> 5. If the checks succeeded, then in an atomic loop the location of the
>    first whole data quantity is reread, and data retrieved compared with
>    the value previously obtained.  If there's no match, then the loop is
>    aborted and 0 is returned in the intended register as the result of
>    the trapping STx_C instruction and emulation completes.  Otherwise
>    new data obtained from the source operand of STx_C is combined with
>    the data retrieved, replacing by byte insertion the part intended,
>    and an atomic write of this new data is attempted.  If it fails, the
>    loop continues from the beginning.  Otherwise processing proceeds to
>    the next step.
>
> 6. The same operations are performed on the second whole data quantity.
>
> 7. At this point both whole data quantities have been written, ensuring
>    that no third-party intervening write has changed them at the point
>    of the write from the values held at previous LDx_L.  Therefore 1 is
>    returned in the intended register as the result of the trapping STx_C
>    instruction.
>
> 8. No user accesses are permitted in traps from the kernel mode as the
>    only LDx_L/STx_C accesses made to user memory locations by the kernel
>    are supposed to be those from handcrafted code, which has to written
>    such as not to trap.
>
> Since atomic loops are used for data updates the approach works equally
> well in both UP and SMP environments.  No data atomicity is guaranteed,
> but data consistency is, that is concurrent RMW accesses won't clobber
> each other, however if the same data is concurrently written as already
> there with a regular write between emulated LDx_L and STx_C, then STx_C
> will still succeed.  Likewise if data is modified, but then restored
> before STx_C has had a chance to run.
>
> This fulfils consistency requirements and guarantees that data outside
> the quantity written has not changed between emulated LDx_L and STx_C.
>
> Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
> ---
> Hi,
>
>  This has cleared the pair of `-msafe-bwa -msafe-partial' regressions
> observed in GCC verification (the third one was a Modula 2 frontend bug,
> now fixed in the compiler).  I have verified individual misalignments with
> a small program by hand as well, for both the data retrieved by emulated
> LDx_L and the data stored by emulated STx_C.
>
>  The kernel itself built with `-mcpu=ev4 -msafe-bwa -msafe-partial' boots
> and has passed GCC verification, and triggered no extra unaligned traps.
>
>  Full verification was run with 6.3.0-rc5 and Ivan's stack alignment fixes
> applied just because I was confident already that version works correctly.
> Interestingly enough no kernel mode traps have triggered with a kernel
> built with GCC 12 (and with most user traps coming from GCC verification):
>
> kernel unaligned acc    : 0 (pc=0,va=0)
> user unaligned acc      : 1766720 (pc=20000053064,va=120020189)
>
> but with GCC 15 a small quantity happened (even before I ran GCC testing):
>
> kernel unaligned acc    : 78 (pc=fffffc0000ad5194,va=fffffc0002db5784)
> user unaligned acc      : 883452 (pc=20000053064,va=120020189)
>
> It seems a compiler regression worth checking -- the trap recorded was in
> `icmp6_dst_alloc' with a pair of quadword writes to `rt->rt6i_dst.addr',
> which however by its type (`struct in6_addr') is only longword-aligned and
> indeed starts at offset 148 from the outermost struct.  I have a sneaking
> suspicion one of my earlier GCC changes might be at fault.  At least I now
> have a test case to experiment with.
>
>  I've also built and booted 6.9.0-rc3 as at commit 82c525bfafb4 ("alpha:
> trim the unused stuff from asm-offsets.c"), the last one before support
> for my system was axed.  It has passed the verification with my small
> program (available by request; I'm not sure if it's worth turning into a
> kernel selftest).
>
>  NB I'm going to ignore the 72 errors checkpatch.pl issues for EXC usage.
> The coding style of the new additions is consistent with the rest of the
> file and any change to that would best be made separately (but I fail to
> see the point).
>
>  Questions, comments, concerns?  Otherwise please apply, and I'll proceed
> with the rest of the GCC effort, followed by cleaning handwritten assembly
> up that uses STQ_U in our port and in glibc.
>
>   Maciej
> ---
>  arch/alpha/kernel/traps.c |  409 ++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 400 insertions(+), 9 deletions(-)
>
> linux-alpha-llsc-unaligned.diff
> Index: linux-macro/arch/alpha/kernel/traps.c
> ===================================================================
> --- linux-macro.orig/arch/alpha/kernel/traps.c
> +++ linux-macro/arch/alpha/kernel/traps.c
> @@ -368,6 +368,13 @@ struct unaligned_stat {
>         unsigned long count, va, pc;
>  } unaligned[2];
>
> +/* Unaligned LDx_L/STx_C emulation state.  */
> +static DEFINE_RAW_SPINLOCK(ll_lock);
> +static struct task_struct *ll_task;
> +static unsigned long ll_data[2];
> +static unsigned long ll_va;
> +static bool ll_quad;
> +static bool ll_bit;
>
>  /* Macro for exception fixup code to access integer registers.  */
>  #define una_reg(r)  (_regs[(r) >= 16 && (r) <= 18 ? (r)+19 : (r)])
> @@ -381,6 +388,9 @@ do_entUna(void * va, unsigned long opcod
>         unsigned long pc = regs->pc - 4;
>         unsigned long *_regs = regs->regs;
>         const struct exception_table_entry *fixup;
> +       unsigned long flags;
> +       unsigned long la;
> +       bool ll_match;
>
>         unaligned[0].count++;
>         unaligned[0].va = (unsigned long) va;
> @@ -439,6 +449,65 @@ do_entUna(void * va, unsigned long opcod
>                 una_reg(reg) = tmp1|tmp2;
>                 return;
>
> +       case 0x2a: /* ldl_l */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               __asm__ __volatile__(
> +               "1:     ldl %3,0(%5)\n"
> +               "2:     ldl %4,4(%5)\n"
> +               "       srl %3,%6,%1\n"
> +               "       sll %4,%7,%2\n"
> +               "       zapnot %1,15,%1\n"
> +               "       zapnot %2,15,%2\n"
> +               "3:\n"
> +               EXC(1b,3b,%1,%0)
> +               EXC(2b,3b,%2,%0)
> +                       : "=r"(error),
> +                         "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> +                       : "r"(la & ~3ul),
> +                         "r"((la & 3) * 8), "r"((4 - (la & 3)) * 8), "0"(0));
> +               if (error)
> +                       goto got_exception;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_va = la;
> +               ll_task = NULL;
> +               ll_data[0] = tmp3;
> +               ll_data[1] = tmp4;
> +               ll_quad = false;
> +               ll_bit = true;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               una_reg(reg) = (int)(tmp1|tmp2);
> +               return;
> +
> +       case 0x2b: /* ldq_l */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               __asm__ __volatile__(
> +               "1:     ldq_u %3,0(%5)\n"
> +               "2:     ldq_u %4,7(%5)\n"
> +               "       extql %3,%5,%1\n"
> +               "       extqh %4,%5,%2\n"
> +               "3:\n"
> +               EXC(1b,3b,%1,%0)
> +               EXC(2b,3b,%2,%0)
> +                       : "=r"(error),
> +                         "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> +                       : "r"(va), "0"(0));
> +               if (error)
> +                       goto got_exception;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_va = la;
> +               ll_task = NULL;
> +               ll_data[0] = tmp3;
> +               ll_data[1] = tmp4;
> +               ll_quad = true;
> +               ll_bit = true;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               una_reg(reg) = tmp1|tmp2;
> +               return;
> +
>         /* Note that the store sequences do not indicate that they change
>            memory because it _should_ be affecting nothing in this context.
>            (Otherwise we have other, much larger, problems.)  */
> @@ -513,6 +582,134 @@ do_entUna(void * va, unsigned long opcod
>                 if (error)
>                         goto got_exception;
>                 return;
> +
> +       case 0x2e: /* stl_c */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_match = ll_bit;
> +               ll_match &= !ll_quad;
> +               ll_match &= ll_task == NULL;
> +               ll_match &= ll_va == la;
> +               tmp3 = ll_data[0];
> +               tmp4 = ll_data[1];
> +               ll_bit = false;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       srl %6,%5,%3\n"
> +                       "       zapnot %3,%8,%3\n"
> +                       "1:     ldl_l %2,4(%4)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       zap %2,%8,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stl_c %1,4(%4)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(la & ~3ul), "r"((4 - (la & 3)) * 8),
> +                                 "r"(una_reg(reg)), "r"(tmp4),
> +                                 "r"((15 >> (4 - (la & 3))) & 0xf), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       sll %6,%5,%3\n"
> +                       "       zapnot %3,%8,%3\n"
> +                       "1:     ldl_l %2,0(%4)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       zap %2,%8,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stl_c %1,0(%4)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(la & ~3ul), "r"((la & 3) * 8),
> +                                 "r"(una_reg(reg)), "r"(tmp3),
> +                                 "r"((15 << (la & 3)) & 0xf), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               una_reg(reg) = ll_match;
> +               return;
> +
> +       case 0x2f: /* stq_c */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_match = ll_bit;
> +               ll_match &= ll_quad;
> +               ll_match &= ll_task == NULL;
> +               ll_match &= ll_va == la;
> +               tmp3 = ll_data[0];
> +               tmp4 = ll_data[1];
> +               ll_bit = false;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       insqh %6,%4,%3\n"
> +                       "1:     ldq_l %2,8(%5)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       mskqh %2,%4,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stq_c %1,8(%5)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(va), "r"(la & ~7ul),
> +                                 "r"(una_reg(reg)), "r"(tmp4), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       insql %6,%4,%3\n"
> +                       "1:     ldq_l %2,0(%5)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       mskql %2,%4,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stq_c %1,0(%5)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(va), "r"(la & ~7ul),
> +                                 "r"(una_reg(reg)), "r"(tmp3), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               una_reg(reg) = ll_match;
> +               return;
>         }
>
>         printk("Bad unaligned kernel access at %016lx: %p %lx %lu\n",
> @@ -624,24 +821,33 @@ s_reg_to_mem (unsigned long s_reg)
>   * so finding the appropriate registers is a little more difficult
>   * than in the kernel case.
>   *
> - * Finally, we handle regular integer load/stores only.  In
> - * particular, load-linked/store-conditionally and floating point
> - * load/stores are not supported.  The former make no sense with
> - * unaligned faults (they are guaranteed to fail) and I don't think
> - * the latter will occur in any decent program.
> + * We have three classes of operations to handle:
>   *
> - * Sigh. We *do* have to handle some FP operations, because GCC will
> - * uses them as temporary storage for integer memory to memory copies.
> - * However, we need to deal with stt/ldt and sts/lds only.
> + * - We handle regular integer load/stores transparently to faulting
> + *   code, preserving the semantics of the triggering instruction.
> + *
> + * - We handle some FP operations as well, because GCC will use them as
> + *   temporary storage for integer memory to memory copies.  However,
> + *   we need to deal with stt/ldt and sts/lds only.
> + *
> + * - We handle load-locked/store-conditional operations by maintaining
> + *   data consistency only, within the two adjacent longwords or
> + *   quadwords partially spanned.  This is sufficient to guarantee an
> + *   unaligned RMW sequence using these operations won't clobber data
> + *   *outside* the location intended but does *not* guarantee atomicity
> + *   for the data quantity itself.
>   */
>
>  #define OP_INT_MASK    ( 1L << 0x28 | 1L << 0x2c   /* ldl stl */       \
> +                       | 1L << 0x2a | 1L << 0x2e   /* ldl_l stl_c */   \
>                         | 1L << 0x29 | 1L << 0x2d   /* ldq stq */       \
> +                       | 1L << 0x2b | 1L << 0x2f   /* ldq_l stq_c */   \
>                         | 1L << 0x0c | 1L << 0x0d   /* ldwu stw */      \
>                         | 1L << 0x0a | 1L << 0x0e ) /* ldbu stb */
>
>  #define OP_WRITE_MASK  ( 1L << 0x26 | 1L << 0x27   /* sts stt */       \
>                         | 1L << 0x2c | 1L << 0x2d   /* stl stq */       \
> +                       | 1L << 0x2e | 1L << 0x2d   /* stl_c stq_c */   \

stq_c should be 0x2f, not 0x2d. Looks like a copy-n-paste mistake.
Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
Posted by Matt Turner 10 months ago
On Thu, Feb 20, 2025 at 11:46 AM Matt Turner <mattst88@gmail.com> wrote:
>
> On Wed, Feb 19, 2025 at 7:46 AM Maciej W. Rozycki <macro@orcam.me.uk> wrote:
> >  #define OP_INT_MASK    ( 1L << 0x28 | 1L << 0x2c   /* ldl stl */       \
> > +                       | 1L << 0x2a | 1L << 0x2e   /* ldl_l stl_c */   \
> >                         | 1L << 0x29 | 1L << 0x2d   /* ldq stq */       \
> > +                       | 1L << 0x2b | 1L << 0x2f   /* ldq_l stq_c */   \
> >                         | 1L << 0x0c | 1L << 0x0d   /* ldwu stw */      \
> >                         | 1L << 0x0a | 1L << 0x0e ) /* ldbu stb */
> >
> >  #define OP_WRITE_MASK  ( 1L << 0x26 | 1L << 0x27   /* sts stt */       \
> >                         | 1L << 0x2c | 1L << 0x2d   /* stl stq */       \
> > +                       | 1L << 0x2e | 1L << 0x2d   /* stl_c stq_c */   \
>
> stq_c should be 0x2f, not 0x2d. Looks like a copy-n-paste mistake.

The good news is that OP_WRITE_MASK appears to be unused going all the
way back to the import into git, so this doesn't indicate a problem
with any of the testing that's been done.
Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
Posted by Maciej W. Rozycki 10 months ago
On Thu, 20 Feb 2025, Matt Turner wrote:

> > On Wed, Feb 19, 2025 at 7:46 AM Maciej W. Rozycki <macro@orcam.me.uk> wrote:
> > >  #define OP_INT_MASK    ( 1L << 0x28 | 1L << 0x2c   /* ldl stl */       \
> > > +                       | 1L << 0x2a | 1L << 0x2e   /* ldl_l stl_c */   \
> > >                         | 1L << 0x29 | 1L << 0x2d   /* ldq stq */       \
> > > +                       | 1L << 0x2b | 1L << 0x2f   /* ldq_l stq_c */   \
> > >                         | 1L << 0x0c | 1L << 0x0d   /* ldwu stw */      \
> > >                         | 1L << 0x0a | 1L << 0x0e ) /* ldbu stb */
> > >
> > >  #define OP_WRITE_MASK  ( 1L << 0x26 | 1L << 0x27   /* sts stt */       \
> > >                         | 1L << 0x2c | 1L << 0x2d   /* stl stq */       \
> > > +                       | 1L << 0x2e | 1L << 0x2d   /* stl_c stq_c */   \
> >
> > stq_c should be 0x2f, not 0x2d. Looks like a copy-n-paste mistake.
> 
> The good news is that OP_WRITE_MASK appears to be unused going all the
> way back to the import into git, so this doesn't indicate a problem
> with any of the testing that's been done.

 Good catch, thank you, and I guess the lack of use is why things haven't 
broken.  I'll make a preparatory change in v2 and remove this macro then.

 FWIW it came with 2.1.36, already unused, so presumably a leftover from a 
WIP version.

 I'll fix the typo in the description as well, thank you.

  Maciej
Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
Posted by Magnus Lindholm 10 months ago
I've tested this patch on an Alphaserver ES40 (running 6.14.0-rc3),
using the test-suit provided by Maciej and can verify that it works.


Tested-by: Magnus Lindholm <linmag7@gmail.com>

On Wed, Feb 19, 2025 at 1:46 PM Maciej W. Rozycki <macro@orcam.me.uk> wrote:
>
> Complementing compiler support for the `-msafe-bwa' and `-msafe-partial'
> code generation options slated to land in GCC 15, implement emulation
> for unaligned LDx_L and STx_C operations for the unlikely case where an
> alignment violation has resulted from improperly written code and caused
> these operations to trap in atomic RMW memory access sequences made to
> provide data consistency for non-BWX byte and word write operations, and
> writes to unaligned data objects causing partial memory updates.
>
> The principle of operation is as follows:
>
> 1. A trapping unaligned LDx_L operation results in the pair of adjacent
>    aligned whole data quantities spanned being read and stored for the
>    reference with a subsequent STx_C operation, along with the width of
>    the data accessed and its virtual address, and the task referring or
>    NULL if the kernel.  The valitidy marker is set.
>
> 2. Regular memory load operations are used to retrieve data because no
>    atomicity is needed at this stage, and matching the width accessed,
>    either LDQ_U or LDL even though the latter instruction requires extra
>    operations, to avoid the case where an unaligned longword located
>    entirely within an aligned quadword would complicate handling.
>
> 3. Data is masked, shifted and merged appropriately and returned in the
>    intended register as the result of the trapping LDx_L instruction.
>
> 4. A trapping unaligned STx_C operation results in the valitidy marker
>    being checked for being true, and the width of the data accessed
>    along with the virtual address and the task referring or the kernel
>    for a match.  The pair of whole data quantities previously read by
>    LDx_L emulation is retrieved and the valitidy marker is cleared.
>
> 5. If the checks succeeded, then in an atomic loop the location of the
>    first whole data quantity is reread, and data retrieved compared with
>    the value previously obtained.  If there's no match, then the loop is
>    aborted and 0 is returned in the intended register as the result of
>    the trapping STx_C instruction and emulation completes.  Otherwise
>    new data obtained from the source operand of STx_C is combined with
>    the data retrieved, replacing by byte insertion the part intended,
>    and an atomic write of this new data is attempted.  If it fails, the
>    loop continues from the beginning.  Otherwise processing proceeds to
>    the next step.
>
> 6. The same operations are performed on the second whole data quantity.
>
> 7. At this point both whole data quantities have been written, ensuring
>    that no third-party intervening write has changed them at the point
>    of the write from the values held at previous LDx_L.  Therefore 1 is
>    returned in the intended register as the result of the trapping STx_C
>    instruction.
>
> 8. No user accesses are permitted in traps from the kernel mode as the
>    only LDx_L/STx_C accesses made to user memory locations by the kernel
>    are supposed to be those from handcrafted code, which has to written
>    such as not to trap.
>
> Since atomic loops are used for data updates the approach works equally
> well in both UP and SMP environments.  No data atomicity is guaranteed,
> but data consistency is, that is concurrent RMW accesses won't clobber
> each other, however if the same data is concurrently written as already
> there with a regular write between emulated LDx_L and STx_C, then STx_C
> will still succeed.  Likewise if data is modified, but then restored
> before STx_C has had a chance to run.
>
> This fulfils consistency requirements and guarantees that data outside
> the quantity written has not changed between emulated LDx_L and STx_C.
>
> Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
> ---
> Hi,
>
>  This has cleared the pair of `-msafe-bwa -msafe-partial' regressions
> observed in GCC verification (the third one was a Modula 2 frontend bug,
> now fixed in the compiler).  I have verified individual misalignments with
> a small program by hand as well, for both the data retrieved by emulated
> LDx_L and the data stored by emulated STx_C.
>
>  The kernel itself built with `-mcpu=ev4 -msafe-bwa -msafe-partial' boots
> and has passed GCC verification, and triggered no extra unaligned traps.
>
>  Full verification was run with 6.3.0-rc5 and Ivan's stack alignment fixes
> applied just because I was confident already that version works correctly.
> Interestingly enough no kernel mode traps have triggered with a kernel
> built with GCC 12 (and with most user traps coming from GCC verification):
>
> kernel unaligned acc    : 0 (pc=0,va=0)
> user unaligned acc      : 1766720 (pc=20000053064,va=120020189)
>
> but with GCC 15 a small quantity happened (even before I ran GCC testing):
>
> kernel unaligned acc    : 78 (pc=fffffc0000ad5194,va=fffffc0002db5784)
> user unaligned acc      : 883452 (pc=20000053064,va=120020189)
>
> It seems a compiler regression worth checking -- the trap recorded was in
> `icmp6_dst_alloc' with a pair of quadword writes to `rt->rt6i_dst.addr',
> which however by its type (`struct in6_addr') is only longword-aligned and
> indeed starts at offset 148 from the outermost struct.  I have a sneaking
> suspicion one of my earlier GCC changes might be at fault.  At least I now
> have a test case to experiment with.
>
>  I've also built and booted 6.9.0-rc3 as at commit 82c525bfafb4 ("alpha:
> trim the unused stuff from asm-offsets.c"), the last one before support
> for my system was axed.  It has passed the verification with my small
> program (available by request; I'm not sure if it's worth turning into a
> kernel selftest).
>
>  NB I'm going to ignore the 72 errors checkpatch.pl issues for EXC usage.
> The coding style of the new additions is consistent with the rest of the
> file and any change to that would best be made separately (but I fail to
> see the point).
>
>  Questions, comments, concerns?  Otherwise please apply, and I'll proceed
> with the rest of the GCC effort, followed by cleaning handwritten assembly
> up that uses STQ_U in our port and in glibc.
>
>   Maciej
> ---
>  arch/alpha/kernel/traps.c |  409 ++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 400 insertions(+), 9 deletions(-)
>
> linux-alpha-llsc-unaligned.diff
> Index: linux-macro/arch/alpha/kernel/traps.c
> ===================================================================
> --- linux-macro.orig/arch/alpha/kernel/traps.c
> +++ linux-macro/arch/alpha/kernel/traps.c
> @@ -368,6 +368,13 @@ struct unaligned_stat {
>         unsigned long count, va, pc;
>  } unaligned[2];
>
> +/* Unaligned LDx_L/STx_C emulation state.  */
> +static DEFINE_RAW_SPINLOCK(ll_lock);
> +static struct task_struct *ll_task;
> +static unsigned long ll_data[2];
> +static unsigned long ll_va;
> +static bool ll_quad;
> +static bool ll_bit;
>
>  /* Macro for exception fixup code to access integer registers.  */
>  #define una_reg(r)  (_regs[(r) >= 16 && (r) <= 18 ? (r)+19 : (r)])
> @@ -381,6 +388,9 @@ do_entUna(void * va, unsigned long opcod
>         unsigned long pc = regs->pc - 4;
>         unsigned long *_regs = regs->regs;
>         const struct exception_table_entry *fixup;
> +       unsigned long flags;
> +       unsigned long la;
> +       bool ll_match;
>
>         unaligned[0].count++;
>         unaligned[0].va = (unsigned long) va;
> @@ -439,6 +449,65 @@ do_entUna(void * va, unsigned long opcod
>                 una_reg(reg) = tmp1|tmp2;
>                 return;
>
> +       case 0x2a: /* ldl_l */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               __asm__ __volatile__(
> +               "1:     ldl %3,0(%5)\n"
> +               "2:     ldl %4,4(%5)\n"
> +               "       srl %3,%6,%1\n"
> +               "       sll %4,%7,%2\n"
> +               "       zapnot %1,15,%1\n"
> +               "       zapnot %2,15,%2\n"
> +               "3:\n"
> +               EXC(1b,3b,%1,%0)
> +               EXC(2b,3b,%2,%0)
> +                       : "=r"(error),
> +                         "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> +                       : "r"(la & ~3ul),
> +                         "r"((la & 3) * 8), "r"((4 - (la & 3)) * 8), "0"(0));
> +               if (error)
> +                       goto got_exception;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_va = la;
> +               ll_task = NULL;
> +               ll_data[0] = tmp3;
> +               ll_data[1] = tmp4;
> +               ll_quad = false;
> +               ll_bit = true;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               una_reg(reg) = (int)(tmp1|tmp2);
> +               return;
> +
> +       case 0x2b: /* ldq_l */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               __asm__ __volatile__(
> +               "1:     ldq_u %3,0(%5)\n"
> +               "2:     ldq_u %4,7(%5)\n"
> +               "       extql %3,%5,%1\n"
> +               "       extqh %4,%5,%2\n"
> +               "3:\n"
> +               EXC(1b,3b,%1,%0)
> +               EXC(2b,3b,%2,%0)
> +                       : "=r"(error),
> +                         "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> +                       : "r"(va), "0"(0));
> +               if (error)
> +                       goto got_exception;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_va = la;
> +               ll_task = NULL;
> +               ll_data[0] = tmp3;
> +               ll_data[1] = tmp4;
> +               ll_quad = true;
> +               ll_bit = true;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               una_reg(reg) = tmp1|tmp2;
> +               return;
> +
>         /* Note that the store sequences do not indicate that they change
>            memory because it _should_ be affecting nothing in this context.
>            (Otherwise we have other, much larger, problems.)  */
> @@ -513,6 +582,134 @@ do_entUna(void * va, unsigned long opcod
>                 if (error)
>                         goto got_exception;
>                 return;
> +
> +       case 0x2e: /* stl_c */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_match = ll_bit;
> +               ll_match &= !ll_quad;
> +               ll_match &= ll_task == NULL;
> +               ll_match &= ll_va == la;
> +               tmp3 = ll_data[0];
> +               tmp4 = ll_data[1];
> +               ll_bit = false;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       srl %6,%5,%3\n"
> +                       "       zapnot %3,%8,%3\n"
> +                       "1:     ldl_l %2,4(%4)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       zap %2,%8,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stl_c %1,4(%4)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(la & ~3ul), "r"((4 - (la & 3)) * 8),
> +                                 "r"(una_reg(reg)), "r"(tmp4),
> +                                 "r"((15 >> (4 - (la & 3))) & 0xf), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       sll %6,%5,%3\n"
> +                       "       zapnot %3,%8,%3\n"
> +                       "1:     ldl_l %2,0(%4)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       zap %2,%8,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stl_c %1,0(%4)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(la & ~3ul), "r"((la & 3) * 8),
> +                                 "r"(una_reg(reg)), "r"(tmp3),
> +                                 "r"((15 << (la & 3)) & 0xf), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               una_reg(reg) = ll_match;
> +               return;
> +
> +       case 0x2f: /* stq_c */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_match = ll_bit;
> +               ll_match &= ll_quad;
> +               ll_match &= ll_task == NULL;
> +               ll_match &= ll_va == la;
> +               tmp3 = ll_data[0];
> +               tmp4 = ll_data[1];
> +               ll_bit = false;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       insqh %6,%4,%3\n"
> +                       "1:     ldq_l %2,8(%5)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       mskqh %2,%4,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stq_c %1,8(%5)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(va), "r"(la & ~7ul),
> +                                 "r"(una_reg(reg)), "r"(tmp4), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       insql %6,%4,%3\n"
> +                       "1:     ldq_l %2,0(%5)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       mskql %2,%4,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stq_c %1,0(%5)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(va), "r"(la & ~7ul),
> +                                 "r"(una_reg(reg)), "r"(tmp3), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               una_reg(reg) = ll_match;
> +               return;
>         }
>
>         printk("Bad unaligned kernel access at %016lx: %p %lx %lu\n",
> @@ -624,24 +821,33 @@ s_reg_to_mem (unsigned long s_reg)
>   * so finding the appropriate registers is a little more difficult
>   * than in the kernel case.
>   *
> - * Finally, we handle regular integer load/stores only.  In
> - * particular, load-linked/store-conditionally and floating point
> - * load/stores are not supported.  The former make no sense with
> - * unaligned faults (they are guaranteed to fail) and I don't think
> - * the latter will occur in any decent program.
> + * We have three classes of operations to handle:
>   *
> - * Sigh. We *do* have to handle some FP operations, because GCC will
> - * uses them as temporary storage for integer memory to memory copies.
> - * However, we need to deal with stt/ldt and sts/lds only.
> + * - We handle regular integer load/stores transparently to faulting
> + *   code, preserving the semantics of the triggering instruction.
> + *
> + * - We handle some FP operations as well, because GCC will use them as
> + *   temporary storage for integer memory to memory copies.  However,
> + *   we need to deal with stt/ldt and sts/lds only.
> + *
> + * - We handle load-locked/store-conditional operations by maintaining
> + *   data consistency only, within the two adjacent longwords or
> + *   quadwords partially spanned.  This is sufficient to guarantee an
> + *   unaligned RMW sequence using these operations won't clobber data
> + *   *outside* the location intended but does *not* guarantee atomicity
> + *   for the data quantity itself.
>   */
>
>  #define OP_INT_MASK    ( 1L << 0x28 | 1L << 0x2c   /* ldl stl */       \
> +                       | 1L << 0x2a | 1L << 0x2e   /* ldl_l stl_c */   \
>                         | 1L << 0x29 | 1L << 0x2d   /* ldq stq */       \
> +                       | 1L << 0x2b | 1L << 0x2f   /* ldq_l stq_c */   \
>                         | 1L << 0x0c | 1L << 0x0d   /* ldwu stw */      \
>                         | 1L << 0x0a | 1L << 0x0e ) /* ldbu stb */
>
>  #define OP_WRITE_MASK  ( 1L << 0x26 | 1L << 0x27   /* sts stt */       \
>                         | 1L << 0x2c | 1L << 0x2d   /* stl stq */       \
> +                       | 1L << 0x2e | 1L << 0x2d   /* stl_c stq_c */   \
>                         | 1L << 0x0d | 1L << 0x0e ) /* stw stb */
>
>  #define R(x)   ((size_t) &((struct pt_regs *)0)->x)
> @@ -666,6 +872,9 @@ do_entUnaUser(void __user * va, unsigned
>
>         unsigned long tmp1, tmp2, tmp3, tmp4;
>         unsigned long fake_reg, *reg_addr = &fake_reg;
> +       unsigned long flags;
> +       unsigned long la;
> +       bool ll_match;
>         int si_code;
>         long error;
>
> @@ -794,6 +1003,61 @@ do_entUnaUser(void __user * va, unsigned
>                 *reg_addr = tmp1|tmp2;
>                 break;
>
> +       case 0x2a: /* ldl_l */
> +               la = (unsigned long)va;
> +               __asm__ __volatile__(
> +               "1:     ldl %3,0(%5)\n"
> +               "2:     ldl %4,4(%5)\n"
> +               "       srl %3,%6,%1\n"
> +               "       sll %4,%7,%2\n"
> +               "       zapnot %1,15,%1\n"
> +               "       zapnot %2,15,%2\n"
> +               "3:\n"
> +               EXC(1b,3b,%1,%0)
> +               EXC(2b,3b,%2,%0)
> +                       : "=r"(error),
> +                         "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> +                       : "r"(la & ~3ul),
> +                         "r"((la & 3) * 8), "r"((4 - (la & 3)) * 8), "0"(0));
> +               if (error)
> +                       goto give_sigsegv;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_va = la;
> +               ll_task = current;
> +               ll_data[0] = tmp3;
> +               ll_data[1] = tmp4;
> +               ll_quad = false;
> +               ll_bit = true;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               *reg_addr = (int)(tmp1|tmp2);
> +               break;
> +
> +       case 0x2b: /* ldq_l */
> +               la = (unsigned long)va;
> +               __asm__ __volatile__(
> +               "1:     ldq_u %3,0(%5)\n"
> +               "2:     ldq_u %4,7(%5)\n"
> +               "       extql %3,%5,%1\n"
> +               "       extqh %4,%5,%2\n"
> +               "3:\n"
> +               EXC(1b,3b,%1,%0)
> +               EXC(2b,3b,%2,%0)
> +                       : "=r"(error),
> +                         "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> +                       : "r"(va), "0"(0));
> +               if (error)
> +                       goto give_sigsegv;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_va = la;
> +               ll_task = current;
> +               ll_data[0] = tmp3;
> +               ll_data[1] = tmp4;
> +               ll_quad = true;
> +               ll_bit = true;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               *reg_addr = tmp1|tmp2;
> +               break;
> +
>         /* Note that the store sequences do not indicate that they change
>            memory because it _should_ be affecting nothing in this context.
>            (Otherwise we have other, much larger, problems.)  */
> @@ -877,12 +1141,139 @@ do_entUnaUser(void __user * va, unsigned
>                         goto give_sigsegv;
>                 return;
>
> +       case 0x2e: /* stl_c */
> +               la = (unsigned long)va;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_match = ll_bit;
> +               ll_match &= !ll_quad;
> +               ll_match &= ll_task == current;
> +               ll_match &= ll_va == la;
> +               tmp3 = ll_data[0];
> +               tmp4 = ll_data[1];
> +               ll_bit = false;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       srl %6,%5,%3\n"
> +                       "       zapnot %3,%8,%3\n"
> +                       "1:     ldl_l %2,4(%4)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       zap %2,%8,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stl_c %1,4(%4)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(la & ~3ul), "r"((4 - (la & 3)) * 8),
> +                                 "r"(*reg_addr), "r"(tmp4),
> +                                 "r"((15 >> (4 - (la & 3))) & 0xf), "0"(0));
> +                       if (error)
> +                               goto give_sigsegv;
> +               }
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       sll %6,%5,%3\n"
> +                       "       zapnot %3,%8,%3\n"
> +                       "1:     ldl_l %2,0(%4)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       zap %2,%8,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stl_c %1,0(%4)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(la & ~3ul), "r"((la & 3) * 8),
> +                                 "r"(*reg_addr), "r"(tmp3),
> +                                 "r"((15 << (la & 3)) & 0xf), "0"(0));
> +                       if (error)
> +                               goto give_sigsegv;
> +               }
> +               *reg_addr = ll_match;
> +               break;
> +
> +       case 0x2f: /* stq_c */
> +               la = (unsigned long)va;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_match = ll_bit;
> +               ll_match &= ll_quad;
> +               ll_match &= ll_task == current;
> +               ll_match &= ll_va == la;
> +               tmp3 = ll_data[0];
> +               tmp4 = ll_data[1];
> +               ll_bit = false;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       insqh %6,%4,%3\n"
> +                       "1:     ldq_l %2,8(%5)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       mskqh %2,%4,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stq_c %1,8(%5)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(va), "r"(la & ~7ul),
> +                                 "r"(*reg_addr), "r"(tmp4), "0"(0));
> +                       if (error)
> +                               goto give_sigsegv;
> +               }
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       insql %6,%4,%3\n"
> +                       "1:     ldq_l %2,0(%5)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       mskql %2,%4,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stq_c %1,0(%5)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(va), "r"(la & ~7ul),
> +                                 "r"(*reg_addr), "r"(tmp3), "0"(0));
> +                       if (error)
> +                               goto give_sigsegv;
> +               }
> +               *reg_addr = ll_match;
> +               break;
> +
>         default:
>                 /* What instruction were you trying to use, exactly?  */
>                 goto give_sigbus;
>         }
>
> -       /* Only integer loads should get here; everyone else returns early. */
> +       /*
> +        * Only integer loads and stores conditional should get here;
> +        * everyone else returns early.
> +        */
>         if (reg == 30)
>                 wrusp(fake_reg);
>         return;
>
Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
Posted by Linus Torvalds 10 months ago
On Wed, 19 Feb 2025 at 04:46, Maciej W. Rozycki <macro@orcam.me.uk> wrote:
>
> 1. A trapping unaligned LDx_L operation results in the pair of adjacent
>    aligned whole data quantities spanned being read and stored for the
>    reference with a subsequent STx_C operation, along with the width of
>    the data accessed and its virtual address, and the task referring or
>    NULL if the kernel.  The validity marker is set.

So I have a couple of comments. I don't care deeply because I do think
alpha is dead, but for completeness:

 (a) I don't think the task checking is correct.

You're only checking the task pointer, so what can happen is that a
task exits and another starts up with the same task pointer value, and
it all matches across one task doing a ld_l and another doing a st_c.

Does this matter? No. You'd literally have to *try* to create that
situation with identical mis-aligned addresses and data contents, and
an exit after a 'ld_l', and doing a 'st_c' in the new task without the
matching ld_l.

So I suspect this works in practice, but it's still worth mentioning.

 (b) this is not truly atomic wrt concurrent aligned non-trapping
operations to the same words. Or in fact to current trapping ones,
since you end up inevitably releasing the spinlock before the final
stc emulation.

I think this is fundamental and non-fixable, because the stc is done
as two operations, and the first can succeed with the second failing
(or they can both succeed, just interleaved with other accesses).

Again, I don't think we care, and it works in practice, but it does
mean that I *really* think that:

 (c) you should not handle the kernel case at all.

If the kernel does an unaligned ld_l/st_c, that's a fundamental kernel
bug. Don't emulate it. Particularly when the emulation fundamentally
is not truly atomic wrt other accesses.

Finally:

 (d) I think you're doing a few too many inline asms by hand, and
you're masking the results too much.

On the read-side emulation, why do you do that

+               "1:     ldl %3,0(%5)\n"
+               "2:     ldl %4,4(%5)\n"
+               "       srl %3,%6,%1\n"
+               "       sll %4,%7,%2\n"
+               "       zapnot %1,15,%1\n"
+               "       zapnot %2,15,%2\n"

at all? Just do two aligned loads, and don't mask the bytes around
them. A *real* ldl/stc will fail not just when the exact bytes are
different, but when somebody has touched the same cacheline. So if the
aligned values have changed, you should fail the stc even if the
change was in other bytes.

And doing two aligned loads don't need any inline asm at all.

On the st_c side, I think you're repeating the same inline asm twice,
and should have a single helper.

Is this a NAK for the patch? No. But I do think it should be massaged a bit.

            Linus
Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
Posted by Maciej W. Rozycki 10 months ago
On Wed, 19 Feb 2025, Linus Torvalds wrote:

> > 1. A trapping unaligned LDx_L operation results in the pair of adjacent
> >    aligned whole data quantities spanned being read and stored for the
> >    reference with a subsequent STx_C operation, along with the width of
> >    the data accessed and its virtual address, and the task referring or
> >    NULL if the kernel.  The validity marker is set.
> 
> So I have a couple of comments. I don't care deeply because I do think
> alpha is dead, but for completeness:
> 
>  (a) I don't think the task checking is correct.
> 
> You're only checking the task pointer, so what can happen is that a
> task exits and another starts up with the same task pointer value, and
> it all matches across one task doing a ld_l and another doing a st_c.
> 
> Does this matter? No. You'd literally have to *try* to create that
> situation with identical mis-aligned addresses and data contents, and
> an exit after a 'ld_l', and doing a 'st_c' in the new task without the
> matching ld_l.
> 
> So I suspect this works in practice, but it's still worth mentioning.

 It's an interesting corner case, but executing STx_C without preceding 
matching LDx_L is hardly useful and depending on the circumstances may or 
may not cause unpredictable results.  We can make it a part of the psABI 
that such usage is not supported.  Otherwise we could clear `ll_bit' in 
a task's termination path.

>  (b) this is not truly atomic wrt concurrent aligned non-trapping
> operations to the same words. Or in fact to current trapping ones,
> since you end up inevitably releasing the spinlock before the final
> stc emulation.

 It is not supposed to be.  The only objective of this code is to protect 
the *unchanged* part of the longword/quadword.

> I think this is fundamental and non-fixable, because the stc is done
> as two operations, and the first can succeed with the second failing
> (or they can both succeed, just interleaved with other accesses).

 It is absolutely fine and by design.  Atomicity of the unaligned store of 
the quantity itself is not guaranteed by this sequence, just as it is not 
with the original one using a pair of STQ_U operations, where a concurrent 
write to the same location may cause the parts of the quantity to become 
inconsistent with each other.

 If the trapping code wanted to make the data quantity accessed atomic, it 
should have declared it atomic as well as made it aligned, in which case 
the compiler would have done the right thing, including padding/separation 
from other data objects where necessary for the minimum width of data that 
hardware can guarantee atomicity for.  And then we would not have arrived 
in this emulation path in the first place.

> Again, I don't think we care, and it works in practice, but it does
> mean that I *really* think that:
> 
>  (c) you should not handle the kernel case at all.
> 
> If the kernel does an unaligned ld_l/st_c, that's a fundamental kernel
> bug. Don't emulate it. Particularly when the emulation fundamentally
> is not truly atomic wrt other accesses.

 Good point actually, I think I mentally drove myself into a dead end 
here.  Yes, absolutely, it is not expected to happen unless we have a bug 
in our code somewhere!

> Finally:
> 
>  (d) I think you're doing a few too many inline asms by hand, and
> you're masking the results too much.
> 
> On the read-side emulation, why do you do that
> 
> +               "1:     ldl %3,0(%5)\n"
> +               "2:     ldl %4,4(%5)\n"
> +               "       srl %3,%6,%1\n"
> +               "       sll %4,%7,%2\n"
> +               "       zapnot %1,15,%1\n"
> +               "       zapnot %2,15,%2\n"
> 
> at all? Just do two aligned loads, and don't mask the bytes around
> them. A *real* ldl/stc will fail not just when the exact bytes are
> different, but when somebody has touched the same cacheline. So if the
> aligned values have changed, you should fail the stc even if the
> change was in other bytes.

 We do need to extract the bytes desired for the result of LDx_L though.

 Yes, it can be done in C, but the same stands for all the other emulation
pieces here and yet they use inline assembly.  GCC is happy to do byte 
extraction itself where necessary, it has machine description patterns for 
that as it a fairly common operation (it does not for INSxH and MSKxH 
though, they're handled as intrinsics only, which however we could use 
instead).

 I think there is value in consistency, and with this piece written as 
inline assembly you can spot the difference from the other variants right 
away.  Or I could rewrite the byte extraction in C across other patterns.

> And doing two aligned loads don't need any inline asm at all.

 Neither does unaligned loads, as GCC is happy to emit LDQ_U itself where 
necessary, but we want to catch exceptions or we'd get an oops rather than 
SIGSEGV.

> On the st_c side, I think you're repeating the same inline asm twice,
> and should have a single helper.

 The masks are different as are displacements, but a good point otherwise.  
I think this guarantees the prologue to the atomic loop to go away, which 
is a good thing, but I'll yet double-check the quality of code produced.

> Is this a NAK for the patch? No. But I do think it should be massaged a bit.

 Thank you for your review, I'll post v2 once I'm done with the massage.

  Maciej
Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
Posted by Maciej W. Rozycki 8 months, 2 weeks ago
On Thu, 20 Feb 2025, Maciej W. Rozycki wrote:

> > Again, I don't think we care, and it works in practice, but it does
> > mean that I *really* think that:
> > 
> >  (c) you should not handle the kernel case at all.
> > 
> > If the kernel does an unaligned ld_l/st_c, that's a fundamental kernel
> > bug. Don't emulate it. Particularly when the emulation fundamentally
> > is not truly atomic wrt other accesses.
> 
>  Good point actually, I think I mentally drove myself into a dead end 
> here.  Yes, absolutely, it is not expected to happen unless we have a bug 
> in our code somewhere!

 Well, I take it back.  I knew I had something in mind while writing this 
code, just couldn't remember at the time of the reply, and I now brought 
it back at a mental relief break: it's networking.

 It's been decades ago, but I kept in memory a discussion when Alpha still 
was a thing, up to remembering it was DaveM saying we chose to use aligned 
frame/packet header accesses for performance reasons even where we don't 
know beforehand whether a given piece of networking data will or will not 
be actually aligned, so as not to penalise the common aligned case.

 Here's one reference I could track down: "TCP SYNs broken in 2.3.41", 
<https://marc.info/?l=linux-kernel&m=94927689929463> (not present on lore; 
I have the original thread of discussion with complete e-mail headers in 
my personal archive though), likely the very one I remembered.

 So unless I'm proved otherwise (e.g. that all such code paths are now 
gone from networking, which may or may not be the case: I saw IPX go but I 
can see AppleTalk still around; or that no sub-longword accesses are ever 
used in the relevant networking paths), I'm going to keep kernel emulation 
in v2, because what just used to be wrapped in an unaligned LDQ/STQ pair, 
which we trapped on and emulated, will now become an LDQ_L/STQ_C loop.

 Do you happen to know what the situation is here?

 NB my network usage with my Alpha environment is virtually exclusively 
FDDI, and there the wrong alignment phenomena may or may not be present at 
all for the various networking protocols, owing to different link-layer 
framing (and including the Packet Request Header used by the Motorola FDDI 
chipset the way it does for the very purpose to keep things aligned).  I 
don't know offhand, my FDDI-fu is a bit dusty even in the areas I have 
already explored, and this is a new one; I've certainly not tried (to 
learn anything about) AppleTalk over FDDI (or indeed otherwise).

 All in all the lack of kernel unaligned traps in my setup and hence no 
recorded way to trigger them merely means I haven't used any of the stuff 
that used to cause them.

  Maciej
Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
Posted by Linus Torvalds 8 months, 2 weeks ago
On Mon, 7 Apr 2025 at 13:46, Maciej W. Rozycki <macro@orcam.me.uk> wrote:
>
>  So unless I'm proved otherwise (e.g. that all such code paths are now
> gone from networking, which may or may not be the case: I saw IPX go but I
> can see AppleTalk still around; or that no sub-longword accesses are ever
> used in the relevant networking paths), I'm going to keep kernel emulation
> in v2, because what just used to be wrapped in an unaligned LDQ/STQ pair,
> which we trapped on and emulated, will now become an LDQ_L/STQ_C loop.
>
>  Do you happen to know what the situation is here?

I think networking ends up using 'get_unaligned()' properly for header
accesses these days for any of this.

If you don't, some architectures will literally silently give you
garbage back and not even fault.

Admittedly that's mainly some really broken old 32-bit ARM stuff and
hopefully it's all dead by now.

So unless you actually *see* the unaligned faults, I really think you
shouldn't emulate them.

And I'd like to know where they are if you do see them

             Linus
Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
Posted by Magnus Lindholm 10 months ago
Hi,

Very impressive work! much appreciated!

I have the patch applied on a system running
6.14.0-rc3-00061-gd4ca652e53ca. You mentioned that you have a
testsuite/small program. If you can provide it to me I can do some
testing on my ES40.

Regards

Magnus Lindholm

On Wed, Feb 19, 2025 at 1:46 PM Maciej W. Rozycki <macro@orcam.me.uk> wrote:
>
> Complementing compiler support for the `-msafe-bwa' and `-msafe-partial'
> code generation options slated to land in GCC 15, implement emulation
> for unaligned LDx_L and STx_C operations for the unlikely case where an
> alignment violation has resulted from improperly written code and caused
> these operations to trap in atomic RMW memory access sequences made to
> provide data consistency for non-BWX byte and word write operations, and
> writes to unaligned data objects causing partial memory updates.
>
> The principle of operation is as follows:
>
> 1. A trapping unaligned LDx_L operation results in the pair of adjacent
>    aligned whole data quantities spanned being read and stored for the
>    reference with a subsequent STx_C operation, along with the width of
>    the data accessed and its virtual address, and the task referring or
>    NULL if the kernel.  The valitidy marker is set.
>
> 2. Regular memory load operations are used to retrieve data because no
>    atomicity is needed at this stage, and matching the width accessed,
>    either LDQ_U or LDL even though the latter instruction requires extra
>    operations, to avoid the case where an unaligned longword located
>    entirely within an aligned quadword would complicate handling.
>
> 3. Data is masked, shifted and merged appropriately and returned in the
>    intended register as the result of the trapping LDx_L instruction.
>
> 4. A trapping unaligned STx_C operation results in the valitidy marker
>    being checked for being true, and the width of the data accessed
>    along with the virtual address and the task referring or the kernel
>    for a match.  The pair of whole data quantities previously read by
>    LDx_L emulation is retrieved and the valitidy marker is cleared.
>
> 5. If the checks succeeded, then in an atomic loop the location of the
>    first whole data quantity is reread, and data retrieved compared with
>    the value previously obtained.  If there's no match, then the loop is
>    aborted and 0 is returned in the intended register as the result of
>    the trapping STx_C instruction and emulation completes.  Otherwise
>    new data obtained from the source operand of STx_C is combined with
>    the data retrieved, replacing by byte insertion the part intended,
>    and an atomic write of this new data is attempted.  If it fails, the
>    loop continues from the beginning.  Otherwise processing proceeds to
>    the next step.
>
> 6. The same operations are performed on the second whole data quantity.
>
> 7. At this point both whole data quantities have been written, ensuring
>    that no third-party intervening write has changed them at the point
>    of the write from the values held at previous LDx_L.  Therefore 1 is
>    returned in the intended register as the result of the trapping STx_C
>    instruction.
>
> 8. No user accesses are permitted in traps from the kernel mode as the
>    only LDx_L/STx_C accesses made to user memory locations by the kernel
>    are supposed to be those from handcrafted code, which has to written
>    such as not to trap.
>
> Since atomic loops are used for data updates the approach works equally
> well in both UP and SMP environments.  No data atomicity is guaranteed,
> but data consistency is, that is concurrent RMW accesses won't clobber
> each other, however if the same data is concurrently written as already
> there with a regular write between emulated LDx_L and STx_C, then STx_C
> will still succeed.  Likewise if data is modified, but then restored
> before STx_C has had a chance to run.
>
> This fulfils consistency requirements and guarantees that data outside
> the quantity written has not changed between emulated LDx_L and STx_C.
>
> Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
> ---
> Hi,
>
>  This has cleared the pair of `-msafe-bwa -msafe-partial' regressions
> observed in GCC verification (the third one was a Modula 2 frontend bug,
> now fixed in the compiler).  I have verified individual misalignments with
> a small program by hand as well, for both the data retrieved by emulated
> LDx_L and the data stored by emulated STx_C.
>
>  The kernel itself built with `-mcpu=ev4 -msafe-bwa -msafe-partial' boots
> and has passed GCC verification, and triggered no extra unaligned traps.
>
>  Full verification was run with 6.3.0-rc5 and Ivan's stack alignment fixes
> applied just because I was confident already that version works correctly.
> Interestingly enough no kernel mode traps have triggered with a kernel
> built with GCC 12 (and with most user traps coming from GCC verification):
>
> kernel unaligned acc    : 0 (pc=0,va=0)
> user unaligned acc      : 1766720 (pc=20000053064,va=120020189)
>
> but with GCC 15 a small quantity happened (even before I ran GCC testing):
>
> kernel unaligned acc    : 78 (pc=fffffc0000ad5194,va=fffffc0002db5784)
> user unaligned acc      : 883452 (pc=20000053064,va=120020189)
>
> It seems a compiler regression worth checking -- the trap recorded was in
> `icmp6_dst_alloc' with a pair of quadword writes to `rt->rt6i_dst.addr',
> which however by its type (`struct in6_addr') is only longword-aligned and
> indeed starts at offset 148 from the outermost struct.  I have a sneaking
> suspicion one of my earlier GCC changes might be at fault.  At least I now
> have a test case to experiment with.
>
>  I've also built and booted 6.9.0-rc3 as at commit 82c525bfafb4 ("alpha:
> trim the unused stuff from asm-offsets.c"), the last one before support
> for my system was axed.  It has passed the verification with my small
> program (available by request; I'm not sure if it's worth turning into a
> kernel selftest).
>
>  NB I'm going to ignore the 72 errors checkpatch.pl issues for EXC usage.
> The coding style of the new additions is consistent with the rest of the
> file and any change to that would best be made separately (but I fail to
> see the point).
>
>  Questions, comments, concerns?  Otherwise please apply, and I'll proceed
> with the rest of the GCC effort, followed by cleaning handwritten assembly
> up that uses STQ_U in our port and in glibc.
>
>   Maciej
> ---
>  arch/alpha/kernel/traps.c |  409 ++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 400 insertions(+), 9 deletions(-)
>
> linux-alpha-llsc-unaligned.diff
> Index: linux-macro/arch/alpha/kernel/traps.c
> ===================================================================
> --- linux-macro.orig/arch/alpha/kernel/traps.c
> +++ linux-macro/arch/alpha/kernel/traps.c
> @@ -368,6 +368,13 @@ struct unaligned_stat {
>         unsigned long count, va, pc;
>  } unaligned[2];
>
> +/* Unaligned LDx_L/STx_C emulation state.  */
> +static DEFINE_RAW_SPINLOCK(ll_lock);
> +static struct task_struct *ll_task;
> +static unsigned long ll_data[2];
> +static unsigned long ll_va;
> +static bool ll_quad;
> +static bool ll_bit;
>
>  /* Macro for exception fixup code to access integer registers.  */
>  #define una_reg(r)  (_regs[(r) >= 16 && (r) <= 18 ? (r)+19 : (r)])
> @@ -381,6 +388,9 @@ do_entUna(void * va, unsigned long opcod
>         unsigned long pc = regs->pc - 4;
>         unsigned long *_regs = regs->regs;
>         const struct exception_table_entry *fixup;
> +       unsigned long flags;
> +       unsigned long la;
> +       bool ll_match;
>
>         unaligned[0].count++;
>         unaligned[0].va = (unsigned long) va;
> @@ -439,6 +449,65 @@ do_entUna(void * va, unsigned long opcod
>                 una_reg(reg) = tmp1|tmp2;
>                 return;
>
> +       case 0x2a: /* ldl_l */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               __asm__ __volatile__(
> +               "1:     ldl %3,0(%5)\n"
> +               "2:     ldl %4,4(%5)\n"
> +               "       srl %3,%6,%1\n"
> +               "       sll %4,%7,%2\n"
> +               "       zapnot %1,15,%1\n"
> +               "       zapnot %2,15,%2\n"
> +               "3:\n"
> +               EXC(1b,3b,%1,%0)
> +               EXC(2b,3b,%2,%0)
> +                       : "=r"(error),
> +                         "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> +                       : "r"(la & ~3ul),
> +                         "r"((la & 3) * 8), "r"((4 - (la & 3)) * 8), "0"(0));
> +               if (error)
> +                       goto got_exception;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_va = la;
> +               ll_task = NULL;
> +               ll_data[0] = tmp3;
> +               ll_data[1] = tmp4;
> +               ll_quad = false;
> +               ll_bit = true;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               una_reg(reg) = (int)(tmp1|tmp2);
> +               return;
> +
> +       case 0x2b: /* ldq_l */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               __asm__ __volatile__(
> +               "1:     ldq_u %3,0(%5)\n"
> +               "2:     ldq_u %4,7(%5)\n"
> +               "       extql %3,%5,%1\n"
> +               "       extqh %4,%5,%2\n"
> +               "3:\n"
> +               EXC(1b,3b,%1,%0)
> +               EXC(2b,3b,%2,%0)
> +                       : "=r"(error),
> +                         "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> +                       : "r"(va), "0"(0));
> +               if (error)
> +                       goto got_exception;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_va = la;
> +               ll_task = NULL;
> +               ll_data[0] = tmp3;
> +               ll_data[1] = tmp4;
> +               ll_quad = true;
> +               ll_bit = true;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               una_reg(reg) = tmp1|tmp2;
> +               return;
> +
>         /* Note that the store sequences do not indicate that they change
>            memory because it _should_ be affecting nothing in this context.
>            (Otherwise we have other, much larger, problems.)  */
> @@ -513,6 +582,134 @@ do_entUna(void * va, unsigned long opcod
>                 if (error)
>                         goto got_exception;
>                 return;
> +
> +       case 0x2e: /* stl_c */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_match = ll_bit;
> +               ll_match &= !ll_quad;
> +               ll_match &= ll_task == NULL;
> +               ll_match &= ll_va == la;
> +               tmp3 = ll_data[0];
> +               tmp4 = ll_data[1];
> +               ll_bit = false;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       srl %6,%5,%3\n"
> +                       "       zapnot %3,%8,%3\n"
> +                       "1:     ldl_l %2,4(%4)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       zap %2,%8,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stl_c %1,4(%4)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(la & ~3ul), "r"((4 - (la & 3)) * 8),
> +                                 "r"(una_reg(reg)), "r"(tmp4),
> +                                 "r"((15 >> (4 - (la & 3))) & 0xf), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       sll %6,%5,%3\n"
> +                       "       zapnot %3,%8,%3\n"
> +                       "1:     ldl_l %2,0(%4)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       zap %2,%8,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stl_c %1,0(%4)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(la & ~3ul), "r"((la & 3) * 8),
> +                                 "r"(una_reg(reg)), "r"(tmp3),
> +                                 "r"((15 << (la & 3)) & 0xf), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               una_reg(reg) = ll_match;
> +               return;
> +
> +       case 0x2f: /* stq_c */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_match = ll_bit;
> +               ll_match &= ll_quad;
> +               ll_match &= ll_task == NULL;
> +               ll_match &= ll_va == la;
> +               tmp3 = ll_data[0];
> +               tmp4 = ll_data[1];
> +               ll_bit = false;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       insqh %6,%4,%3\n"
> +                       "1:     ldq_l %2,8(%5)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       mskqh %2,%4,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stq_c %1,8(%5)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(va), "r"(la & ~7ul),
> +                                 "r"(una_reg(reg)), "r"(tmp4), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       insql %6,%4,%3\n"
> +                       "1:     ldq_l %2,0(%5)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       mskql %2,%4,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stq_c %1,0(%5)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(va), "r"(la & ~7ul),
> +                                 "r"(una_reg(reg)), "r"(tmp3), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               una_reg(reg) = ll_match;
> +               return;
>         }
>
>         printk("Bad unaligned kernel access at %016lx: %p %lx %lu\n",
> @@ -624,24 +821,33 @@ s_reg_to_mem (unsigned long s_reg)
>   * so finding the appropriate registers is a little more difficult
>   * than in the kernel case.
>   *
> - * Finally, we handle regular integer load/stores only.  In
> - * particular, load-linked/store-conditionally and floating point
> - * load/stores are not supported.  The former make no sense with
> - * unaligned faults (they are guaranteed to fail) and I don't think
> - * the latter will occur in any decent program.
> + * We have three classes of operations to handle:
>   *
> - * Sigh. We *do* have to handle some FP operations, because GCC will
> - * uses them as temporary storage for integer memory to memory copies.
> - * However, we need to deal with stt/ldt and sts/lds only.
> + * - We handle regular integer load/stores transparently to faulting
> + *   code, preserving the semantics of the triggering instruction.
> + *
> + * - We handle some FP operations as well, because GCC will use them as
> + *   temporary storage for integer memory to memory copies.  However,
> + *   we need to deal with stt/ldt and sts/lds only.
> + *
> + * - We handle load-locked/store-conditional operations by maintaining
> + *   data consistency only, within the two adjacent longwords or
> + *   quadwords partially spanned.  This is sufficient to guarantee an
> + *   unaligned RMW sequence using these operations won't clobber data
> + *   *outside* the location intended but does *not* guarantee atomicity
> + *   for the data quantity itself.
>   */
>
>  #define OP_INT_MASK    ( 1L << 0x28 | 1L << 0x2c   /* ldl stl */       \
> +                       | 1L << 0x2a | 1L << 0x2e   /* ldl_l stl_c */   \
>                         | 1L << 0x29 | 1L << 0x2d   /* ldq stq */       \
> +                       | 1L << 0x2b | 1L << 0x2f   /* ldq_l stq_c */   \
>                         | 1L << 0x0c | 1L << 0x0d   /* ldwu stw */      \
>                         | 1L << 0x0a | 1L << 0x0e ) /* ldbu stb */
>
>  #define OP_WRITE_MASK  ( 1L << 0x26 | 1L << 0x27   /* sts stt */       \
>                         | 1L << 0x2c | 1L << 0x2d   /* stl stq */       \
> +                       | 1L << 0x2e | 1L << 0x2d   /* stl_c stq_c */   \
>                         | 1L << 0x0d | 1L << 0x0e ) /* stw stb */
>
>  #define R(x)   ((size_t) &((struct pt_regs *)0)->x)
> @@ -666,6 +872,9 @@ do_entUnaUser(void __user * va, unsigned
>
>         unsigned long tmp1, tmp2, tmp3, tmp4;
>         unsigned long fake_reg, *reg_addr = &fake_reg;
> +       unsigned long flags;
> +       unsigned long la;
> +       bool ll_match;
>         int si_code;
>         long error;
>
> @@ -794,6 +1003,61 @@ do_entUnaUser(void __user * va, unsigned
>                 *reg_addr = tmp1|tmp2;
>                 break;
>
> +       case 0x2a: /* ldl_l */
> +               la = (unsigned long)va;
> +               __asm__ __volatile__(
> +               "1:     ldl %3,0(%5)\n"
> +               "2:     ldl %4,4(%5)\n"
> +               "       srl %3,%6,%1\n"
> +               "       sll %4,%7,%2\n"
> +               "       zapnot %1,15,%1\n"
> +               "       zapnot %2,15,%2\n"
> +               "3:\n"
> +               EXC(1b,3b,%1,%0)
> +               EXC(2b,3b,%2,%0)
> +                       : "=r"(error),
> +                         "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> +                       : "r"(la & ~3ul),
> +                         "r"((la & 3) * 8), "r"((4 - (la & 3)) * 8), "0"(0));
> +               if (error)
> +                       goto give_sigsegv;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_va = la;
> +               ll_task = current;
> +               ll_data[0] = tmp3;
> +               ll_data[1] = tmp4;
> +               ll_quad = false;
> +               ll_bit = true;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               *reg_addr = (int)(tmp1|tmp2);
> +               break;
> +
> +       case 0x2b: /* ldq_l */
> +               la = (unsigned long)va;
> +               __asm__ __volatile__(
> +               "1:     ldq_u %3,0(%5)\n"
> +               "2:     ldq_u %4,7(%5)\n"
> +               "       extql %3,%5,%1\n"
> +               "       extqh %4,%5,%2\n"
> +               "3:\n"
> +               EXC(1b,3b,%1,%0)
> +               EXC(2b,3b,%2,%0)
> +                       : "=r"(error),
> +                         "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> +                       : "r"(va), "0"(0));
> +               if (error)
> +                       goto give_sigsegv;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_va = la;
> +               ll_task = current;
> +               ll_data[0] = tmp3;
> +               ll_data[1] = tmp4;
> +               ll_quad = true;
> +               ll_bit = true;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               *reg_addr = tmp1|tmp2;
> +               break;
> +
>         /* Note that the store sequences do not indicate that they change
>            memory because it _should_ be affecting nothing in this context.
>            (Otherwise we have other, much larger, problems.)  */
> @@ -877,12 +1141,139 @@ do_entUnaUser(void __user * va, unsigned
>                         goto give_sigsegv;
>                 return;
>
> +       case 0x2e: /* stl_c */
> +               la = (unsigned long)va;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_match = ll_bit;
> +               ll_match &= !ll_quad;
> +               ll_match &= ll_task == current;
> +               ll_match &= ll_va == la;
> +               tmp3 = ll_data[0];
> +               tmp4 = ll_data[1];
> +               ll_bit = false;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       srl %6,%5,%3\n"
> +                       "       zapnot %3,%8,%3\n"
> +                       "1:     ldl_l %2,4(%4)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       zap %2,%8,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stl_c %1,4(%4)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(la & ~3ul), "r"((4 - (la & 3)) * 8),
> +                                 "r"(*reg_addr), "r"(tmp4),
> +                                 "r"((15 >> (4 - (la & 3))) & 0xf), "0"(0));
> +                       if (error)
> +                               goto give_sigsegv;
> +               }
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       sll %6,%5,%3\n"
> +                       "       zapnot %3,%8,%3\n"
> +                       "1:     ldl_l %2,0(%4)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       zap %2,%8,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stl_c %1,0(%4)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(la & ~3ul), "r"((la & 3) * 8),
> +                                 "r"(*reg_addr), "r"(tmp3),
> +                                 "r"((15 << (la & 3)) & 0xf), "0"(0));
> +                       if (error)
> +                               goto give_sigsegv;
> +               }
> +               *reg_addr = ll_match;
> +               break;
> +
> +       case 0x2f: /* stq_c */
> +               la = (unsigned long)va;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_match = ll_bit;
> +               ll_match &= ll_quad;
> +               ll_match &= ll_task == current;
> +               ll_match &= ll_va == la;
> +               tmp3 = ll_data[0];
> +               tmp4 = ll_data[1];
> +               ll_bit = false;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       insqh %6,%4,%3\n"
> +                       "1:     ldq_l %2,8(%5)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       mskqh %2,%4,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stq_c %1,8(%5)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(va), "r"(la & ~7ul),
> +                                 "r"(*reg_addr), "r"(tmp4), "0"(0));
> +                       if (error)
> +                               goto give_sigsegv;
> +               }
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       insql %6,%4,%3\n"
> +                       "1:     ldq_l %2,0(%5)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       mskql %2,%4,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stq_c %1,0(%5)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(va), "r"(la & ~7ul),
> +                                 "r"(*reg_addr), "r"(tmp3), "0"(0));
> +                       if (error)
> +                               goto give_sigsegv;
> +               }
> +               *reg_addr = ll_match;
> +               break;
> +
>         default:
>                 /* What instruction were you trying to use, exactly?  */
>                 goto give_sigbus;
>         }
>
> -       /* Only integer loads should get here; everyone else returns early. */
> +       /*
> +        * Only integer loads and stores conditional should get here;
> +        * everyone else returns early.
> +        */
>         if (reg == 30)
>                 wrusp(fake_reg);
>         return;
>