[PATCH] Alpha: Emulate unaligned LDx_L/STx

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
@ 2025-02-19 12:46 Maciej W. Rozycki
  2025-02-19 17:38 ` Magnus Lindholm
                   ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: Maciej W. Rozycki @ 2025-02-19 12:46 UTC (permalink / raw)
  To: Richard Henderson, Ivan Kokshaysky, Matt Turner
  Cc: Arnd Bergmann, John Paul Adrian Glaubitz, Magnus Lindholm,
	Paul E. McKenney, Linus Torvalds, Al Viro, linux-alpha,
	linux-kernel

Complementing compiler support for the `-msafe-bwa' and `-msafe-partial' 
code generation options slated to land in GCC 15, implement emulation 
for unaligned LDx_L and STx_C operations for the unlikely case where an 
alignment violation has resulted from improperly written code and caused 
these operations to trap in atomic RMW memory access sequences made to 
provide data consistency for non-BWX byte and word write operations, and 
writes to unaligned data objects causing partial memory updates.

The principle of operation is as follows:

1. A trapping unaligned LDx_L operation results in the pair of adjacent 
   aligned whole data quantities spanned being read and stored for the 
   reference with a subsequent STx_C operation, along with the width of 
   the data accessed and its virtual address, and the task referring or 
   NULL if the kernel.  The valitidy marker is set.

2. Regular memory load operations are used to retrieve data because no 
   atomicity is needed at this stage, and matching the width accessed, 
   either LDQ_U or LDL even though the latter instruction requires extra 
   operations, to avoid the case where an unaligned longword located 
   entirely within an aligned quadword would complicate handling.

3. Data is masked, shifted and merged appropriately and returned in the
   intended register as the result of the trapping LDx_L instruction.

4. A trapping unaligned STx_C operation results in the valitidy marker 
   being checked for being true, and the width of the data accessed 
   along with the virtual address and the task referring or the kernel 
   for a match.  The pair of whole data quantities previously read by 
   LDx_L emulation is retrieved and the valitidy marker is cleared.

5. If the checks succeeded, then in an atomic loop the location of the 
   first whole data quantity is reread, and data retrieved compared with 
   the value previously obtained.  If there's no match, then the loop is 
   aborted and 0 is returned in the intended register as the result of 
   the trapping STx_C instruction and emulation completes.  Otherwise 
   new data obtained from the source operand of STx_C is combined with 
   the data retrieved, replacing by byte insertion the part intended, 
   and an atomic write of this new data is attempted.  If it fails, the 
   loop continues from the beginning.  Otherwise processing proceeds to 
   the next step.

6. The same operations are performed on the second whole data quantity.

7. At this point both whole data quantities have been written, ensuring 
   that no third-party intervening write has changed them at the point 
   of the write from the values held at previous LDx_L.  Therefore 1 is 
   returned in the intended register as the result of the trapping STx_C 
   instruction.

8. No user accesses are permitted in traps from the kernel mode as the
   only LDx_L/STx_C accesses made to user memory locations by the kernel 
   are supposed to be those from handcrafted code, which has to written 
   such as not to trap.

Since atomic loops are used for data updates the approach works equally 
well in both UP and SMP environments.  No data atomicity is guaranteed, 
but data consistency is, that is concurrent RMW accesses won't clobber 
each other, however if the same data is concurrently written as already 
there with a regular write between emulated LDx_L and STx_C, then STx_C 
will still succeed.  Likewise if data is modified, but then restored 
before STx_C has had a chance to run.

This fulfils consistency requirements and guarantees that data outside 
the quantity written has not changed between emulated LDx_L and STx_C.

Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
---
Hi,

 This has cleared the pair of `-msafe-bwa -msafe-partial' regressions 
observed in GCC verification (the third one was a Modula 2 frontend bug, 
now fixed in the compiler).  I have verified individual misalignments with 
a small program by hand as well, for both the data retrieved by emulated 
LDx_L and the data stored by emulated STx_C.

 The kernel itself built with `-mcpu=ev4 -msafe-bwa -msafe-partial' boots 
and has passed GCC verification, and triggered no extra unaligned traps.

 Full verification was run with 6.3.0-rc5 and Ivan's stack alignment fixes 
applied just because I was confident already that version works correctly.  
Interestingly enough no kernel mode traps have triggered with a kernel 
built with GCC 12 (and with most user traps coming from GCC verification):

kernel unaligned acc	: 0 (pc=0,va=0)
user unaligned acc	: 1766720 (pc=20000053064,va=120020189)

but with GCC 15 a small quantity happened (even before I ran GCC testing):

kernel unaligned acc    : 78 (pc=fffffc0000ad5194,va=fffffc0002db5784)
user unaligned acc      : 883452 (pc=20000053064,va=120020189)

It seems a compiler regression worth checking -- the trap recorded was in 
`icmp6_dst_alloc' with a pair of quadword writes to `rt->rt6i_dst.addr', 
which however by its type (`struct in6_addr') is only longword-aligned and 
indeed starts at offset 148 from the outermost struct.  I have a sneaking 
suspicion one of my earlier GCC changes might be at fault.  At least I now 
have a test case to experiment with.

 I've also built and booted 6.9.0-rc3 as at commit 82c525bfafb4 ("alpha: 
trim the unused stuff from asm-offsets.c"), the last one before support 
for my system was axed.  It has passed the verification with my small 
program (available by request; I'm not sure if it's worth turning into a 
kernel selftest).

 NB I'm going to ignore the 72 errors checkpatch.pl issues for EXC usage.  
The coding style of the new additions is consistent with the rest of the 
file and any change to that would best be made separately (but I fail to 
see the point).

 Questions, comments, concerns?  Otherwise please apply, and I'll proceed 
with the rest of the GCC effort, followed by cleaning handwritten assembly 
up that uses STQ_U in our port and in glibc.

  Maciej
---
 arch/alpha/kernel/traps.c |  409 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 400 insertions(+), 9 deletions(-)

linux-alpha-llsc-unaligned.diff
Index: linux-macro/arch/alpha/kernel/traps.c
===================================================================
--- linux-macro.orig/arch/alpha/kernel/traps.c
+++ linux-macro/arch/alpha/kernel/traps.c
@@ -368,6 +368,13 @@ struct unaligned_stat {
 	unsigned long count, va, pc;
 } unaligned[2];
 
+/* Unaligned LDx_L/STx_C emulation state.  */
+static DEFINE_RAW_SPINLOCK(ll_lock);
+static struct task_struct *ll_task;
+static unsigned long ll_data[2];
+static unsigned long ll_va;
+static bool ll_quad;
+static bool ll_bit;
 
 /* Macro for exception fixup code to access integer registers.  */
 #define una_reg(r)  (_regs[(r) >= 16 && (r) <= 18 ? (r)+19 : (r)])
@@ -381,6 +388,9 @@ do_entUna(void * va, unsigned long opcod
 	unsigned long pc = regs->pc - 4;
 	unsigned long *_regs = regs->regs;
 	const struct exception_table_entry *fixup;
+	unsigned long flags;
+	unsigned long la;
+	bool ll_match;
 
 	unaligned[0].count++;
 	unaligned[0].va = (unsigned long) va;
@@ -439,6 +449,65 @@ do_entUna(void * va, unsigned long opcod
 		una_reg(reg) = tmp1|tmp2;
 		return;
 
+	case 0x2a: /* ldl_l */
+		la = (unsigned long)va;
+		if (la < TASK_SIZE)
+			break;
+		__asm__ __volatile__(
+		"1:	ldl %3,0(%5)\n"
+		"2:	ldl %4,4(%5)\n"
+		"	srl %3,%6,%1\n"
+		"	sll %4,%7,%2\n"
+		"	zapnot %1,15,%1\n"
+		"	zapnot %2,15,%2\n"
+		"3:\n"
+		EXC(1b,3b,%1,%0)
+		EXC(2b,3b,%2,%0)
+			: "=r"(error),
+			  "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
+			: "r"(la & ~3ul),
+			  "r"((la & 3) * 8), "r"((4 - (la & 3)) * 8), "0"(0));
+		if (error)
+			goto got_exception;
+		raw_spin_lock_irqsave(&ll_lock, flags);
+		ll_va = la;
+		ll_task = NULL;
+		ll_data[0] = tmp3;
+		ll_data[1] = tmp4;
+		ll_quad = false;
+		ll_bit = true;
+		raw_spin_unlock_irqrestore(&ll_lock, flags);
+		una_reg(reg) = (int)(tmp1|tmp2);
+		return;
+
+	case 0x2b: /* ldq_l */
+		la = (unsigned long)va;
+		if (la < TASK_SIZE)
+			break;
+		__asm__ __volatile__(
+		"1:	ldq_u %3,0(%5)\n"
+		"2:	ldq_u %4,7(%5)\n"
+		"	extql %3,%5,%1\n"
+		"	extqh %4,%5,%2\n"
+		"3:\n"
+		EXC(1b,3b,%1,%0)
+		EXC(2b,3b,%2,%0)
+			: "=r"(error),
+			  "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
+			: "r"(va), "0"(0));
+		if (error)
+			goto got_exception;
+		raw_spin_lock_irqsave(&ll_lock, flags);
+		ll_va = la;
+		ll_task = NULL;
+		ll_data[0] = tmp3;
+		ll_data[1] = tmp4;
+		ll_quad = true;
+		ll_bit = true;
+		raw_spin_unlock_irqrestore(&ll_lock, flags);
+		una_reg(reg) = tmp1|tmp2;
+		return;
+
 	/* Note that the store sequences do not indicate that they change
 	   memory because it _should_ be affecting nothing in this context.
 	   (Otherwise we have other, much larger, problems.)  */
@@ -513,6 +582,134 @@ do_entUna(void * va, unsigned long opcod
 		if (error)
 			goto got_exception;
 		return;
+
+	case 0x2e: /* stl_c */
+		la = (unsigned long)va;
+		if (la < TASK_SIZE)
+			break;
+		raw_spin_lock_irqsave(&ll_lock, flags);
+		ll_match = ll_bit;
+		ll_match &= !ll_quad;
+		ll_match &= ll_task == NULL;
+		ll_match &= ll_va == la;
+		tmp3 = ll_data[0];
+		tmp4 = ll_data[1];
+		ll_bit = false;
+		raw_spin_unlock_irqrestore(&ll_lock, flags);
+		if (ll_match) {
+			__asm__ __volatile__(
+			"	srl %6,%5,%3\n"
+			"	zapnot %3,%8,%3\n"
+			"1:	ldl_l %2,4(%4)\n"
+			"	cmpeq %7,%2,%1\n"
+			"	beq %1,4f\n"
+			"	zap %2,%8,%2\n"
+			"	or %2,%3,%1\n"
+			"2:	stl_c %1,4(%4)\n"
+			"	beq %1,3f\n"
+			"	.subsection 2\n"
+			"3:	br 1b\n"
+			"	.previous\n"
+			"4:\n"
+			EXC(1b,4b,%2,%0)
+			EXC(2b,4b,%1,%0)
+				: "=r"(error), "=&r"(ll_match),
+				  "=&r"(tmp1), "=&r"(tmp2)
+				: "r"(la & ~3ul), "r"((4 - (la & 3)) * 8),
+				  "r"(una_reg(reg)), "r"(tmp4),
+				  "r"((15 >> (4 - (la & 3))) & 0xf), "0"(0));
+			if (error)
+				goto got_exception;
+		}
+		if (ll_match) {
+			__asm__ __volatile__(
+			"	sll %6,%5,%3\n"
+			"	zapnot %3,%8,%3\n"
+			"1:	ldl_l %2,0(%4)\n"
+			"	cmpeq %7,%2,%1\n"
+			"	beq %1,4f\n"
+			"	zap %2,%8,%2\n"
+			"	or %2,%3,%1\n"
+			"2:	stl_c %1,0(%4)\n"
+			"	beq %1,3f\n"
+			"	.subsection 2\n"
+			"3:	br 1b\n"
+			"	.previous\n"
+			"4:\n"
+			EXC(1b,4b,%2,%0)
+			EXC(2b,4b,%1,%0)
+				: "=r"(error), "=&r"(ll_match),
+				  "=&r"(tmp1), "=&r"(tmp2)
+				: "r"(la & ~3ul), "r"((la & 3) * 8),
+				  "r"(una_reg(reg)), "r"(tmp3),
+				  "r"((15 << (la & 3)) & 0xf), "0"(0));
+			if (error)
+				goto got_exception;
+		}
+		una_reg(reg) = ll_match;
+		return;
+
+	case 0x2f: /* stq_c */
+		la = (unsigned long)va;
+		if (la < TASK_SIZE)
+			break;
+		raw_spin_lock_irqsave(&ll_lock, flags);
+		ll_match = ll_bit;
+		ll_match &= ll_quad;
+		ll_match &= ll_task == NULL;
+		ll_match &= ll_va == la;
+		tmp3 = ll_data[0];
+		tmp4 = ll_data[1];
+		ll_bit = false;
+		raw_spin_unlock_irqrestore(&ll_lock, flags);
+		if (ll_match) {
+			__asm__ __volatile__(
+			"	insqh %6,%4,%3\n"
+			"1:	ldq_l %2,8(%5)\n"
+			"	cmpeq %7,%2,%1\n"
+			"	beq %1,4f\n"
+			"	mskqh %2,%4,%2\n"
+			"	or %2,%3,%1\n"
+			"2:	stq_c %1,8(%5)\n"
+			"	beq %1,3f\n"
+			"	.subsection 2\n"
+			"3:	br 1b\n"
+			"	.previous\n"
+			"4:\n"
+			EXC(1b,4b,%2,%0)
+			EXC(2b,4b,%1,%0)
+				: "=r"(error), "=&r"(ll_match),
+				  "=&r"(tmp1), "=&r"(tmp2)
+				: "r"(va), "r"(la & ~7ul),
+				  "r"(una_reg(reg)), "r"(tmp4), "0"(0));
+			if (error)
+				goto got_exception;
+		}
+		if (ll_match) {
+			__asm__ __volatile__(
+			"	insql %6,%4,%3\n"
+			"1:	ldq_l %2,0(%5)\n"
+			"	cmpeq %7,%2,%1\n"
+			"	beq %1,4f\n"
+			"	mskql %2,%4,%2\n"
+			"	or %2,%3,%1\n"
+			"2:	stq_c %1,0(%5)\n"
+			"	beq %1,3f\n"
+			"	.subsection 2\n"
+			"3:	br 1b\n"
+			"	.previous\n"
+			"4:\n"
+			EXC(1b,4b,%2,%0)
+			EXC(2b,4b,%1,%0)
+				: "=r"(error), "=&r"(ll_match),
+				  "=&r"(tmp1), "=&r"(tmp2)
+				: "r"(va), "r"(la & ~7ul),
+				  "r"(una_reg(reg)), "r"(tmp3), "0"(0));
+			if (error)
+				goto got_exception;
+		}
+		una_reg(reg) = ll_match;
+		return;
 	}
 
 	printk("Bad unaligned kernel access at %016lx: %p %lx %lu\n",
@@ -624,24 +821,33 @@ s_reg_to_mem (unsigned long s_reg)
  * so finding the appropriate registers is a little more difficult
  * than in the kernel case.
  *
- * Finally, we handle regular integer load/stores only.  In
- * particular, load-linked/store-conditionally and floating point
- * load/stores are not supported.  The former make no sense with
- * unaligned faults (they are guaranteed to fail) and I don't think
- * the latter will occur in any decent program.
+ * We have three classes of operations to handle:
  *
- * Sigh. We *do* have to handle some FP operations, because GCC will
- * uses them as temporary storage for integer memory to memory copies.
- * However, we need to deal with stt/ldt and sts/lds only.
+ * - We handle regular integer load/stores transparently to faulting
+ *   code, preserving the semantics of the triggering instruction.
+ *
+ * - We handle some FP operations as well, because GCC will use them as
+ *   temporary storage for integer memory to memory copies.  However,
+ *   we need to deal with stt/ldt and sts/lds only.
+ *
+ * - We handle load-locked/store-conditional operations by maintaining
+ *   data consistency only, within the two adjacent longwords or
+ *   quadwords partially spanned.  This is sufficient to guarantee an
+ *   unaligned RMW sequence using these operations won't clobber data
+ *   *outside* the location intended but does *not* guarantee atomicity
+ *   for the data quantity itself.
  */
 
 #define OP_INT_MASK	( 1L << 0x28 | 1L << 0x2c   /* ldl stl */	\
+			| 1L << 0x2a | 1L << 0x2e   /* ldl_l stl_c */	\
 			| 1L << 0x29 | 1L << 0x2d   /* ldq stq */	\
+			| 1L << 0x2b | 1L << 0x2f   /* ldq_l stq_c */	\
 			| 1L << 0x0c | 1L << 0x0d   /* ldwu stw */	\
 			| 1L << 0x0a | 1L << 0x0e ) /* ldbu stb */
 
 #define OP_WRITE_MASK	( 1L << 0x26 | 1L << 0x27   /* sts stt */	\
 			| 1L << 0x2c | 1L << 0x2d   /* stl stq */	\
+			| 1L << 0x2e | 1L << 0x2d   /* stl_c stq_c */	\
 			| 1L << 0x0d | 1L << 0x0e ) /* stw stb */
 
 #define R(x)	((size_t) &((struct pt_regs *)0)->x)
@@ -666,6 +872,9 @@ do_entUnaUser(void __user * va, unsigned
 
 	unsigned long tmp1, tmp2, tmp3, tmp4;
 	unsigned long fake_reg, *reg_addr = &fake_reg;
+	unsigned long flags;
+	unsigned long la;
+	bool ll_match;
 	int si_code;
 	long error;
 
@@ -794,6 +1003,61 @@ do_entUnaUser(void __user * va, unsigned
 		*reg_addr = tmp1|tmp2;
 		break;
 
+	case 0x2a: /* ldl_l */
+		la = (unsigned long)va;
+		__asm__ __volatile__(
+		"1:	ldl %3,0(%5)\n"
+		"2:	ldl %4,4(%5)\n"
+		"	srl %3,%6,%1\n"
+		"	sll %4,%7,%2\n"
+		"	zapnot %1,15,%1\n"
+		"	zapnot %2,15,%2\n"
+		"3:\n"
+		EXC(1b,3b,%1,%0)
+		EXC(2b,3b,%2,%0)
+			: "=r"(error),
+			  "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
+			: "r"(la & ~3ul),
+			  "r"((la & 3) * 8), "r"((4 - (la & 3)) * 8), "0"(0));
+		if (error)
+			goto give_sigsegv;
+		raw_spin_lock_irqsave(&ll_lock, flags);
+		ll_va = la;
+		ll_task = current;
+		ll_data[0] = tmp3;
+		ll_data[1] = tmp4;
+		ll_quad = false;
+		ll_bit = true;
+		raw_spin_unlock_irqrestore(&ll_lock, flags);
+		*reg_addr = (int)(tmp1|tmp2);
+		break;
+
+	case 0x2b: /* ldq_l */
+		la = (unsigned long)va;
+		__asm__ __volatile__(
+		"1:	ldq_u %3,0(%5)\n"
+		"2:	ldq_u %4,7(%5)\n"
+		"	extql %3,%5,%1\n"
+		"	extqh %4,%5,%2\n"
+		"3:\n"
+		EXC(1b,3b,%1,%0)
+		EXC(2b,3b,%2,%0)
+			: "=r"(error),
+			  "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
+			: "r"(va), "0"(0));
+		if (error)
+			goto give_sigsegv;
+		raw_spin_lock_irqsave(&ll_lock, flags);
+		ll_va = la;
+		ll_task = current;
+		ll_data[0] = tmp3;
+		ll_data[1] = tmp4;
+		ll_quad = true;
+		ll_bit = true;
+		raw_spin_unlock_irqrestore(&ll_lock, flags);
+		*reg_addr = tmp1|tmp2;
+		break;
+
 	/* Note that the store sequences do not indicate that they change
 	   memory because it _should_ be affecting nothing in this context.
 	   (Otherwise we have other, much larger, problems.)  */
@@ -877,12 +1141,139 @@ do_entUnaUser(void __user * va, unsigned
 			goto give_sigsegv;
 		return;
 
+	case 0x2e: /* stl_c */
+		la = (unsigned long)va;
+		raw_spin_lock_irqsave(&ll_lock, flags);
+		ll_match = ll_bit;
+		ll_match &= !ll_quad;
+		ll_match &= ll_task == current;
+		ll_match &= ll_va == la;
+		tmp3 = ll_data[0];
+		tmp4 = ll_data[1];
+		ll_bit = false;
+		raw_spin_unlock_irqrestore(&ll_lock, flags);
+		if (ll_match) {
+			__asm__ __volatile__(
+			"	srl %6,%5,%3\n"
+			"	zapnot %3,%8,%3\n"
+			"1:	ldl_l %2,4(%4)\n"
+			"	cmpeq %7,%2,%1\n"
+			"	beq %1,4f\n"
+			"	zap %2,%8,%2\n"
+			"	or %2,%3,%1\n"
+			"2:	stl_c %1,4(%4)\n"
+			"	beq %1,3f\n"
+			"	.subsection 2\n"
+			"3:	br 1b\n"
+			"	.previous\n"
+			"4:\n"
+			EXC(1b,4b,%2,%0)
+			EXC(2b,4b,%1,%0)
+				: "=r"(error), "=&r"(ll_match),
+				  "=&r"(tmp1), "=&r"(tmp2)
+				: "r"(la & ~3ul), "r"((4 - (la & 3)) * 8),
+				  "r"(*reg_addr), "r"(tmp4),
+				  "r"((15 >> (4 - (la & 3))) & 0xf), "0"(0));
+			if (error)
+				goto give_sigsegv;
+		}
+		if (ll_match) {
+			__asm__ __volatile__(
+			"	sll %6,%5,%3\n"
+			"	zapnot %3,%8,%3\n"
+			"1:	ldl_l %2,0(%4)\n"
+			"	cmpeq %7,%2,%1\n"
+			"	beq %1,4f\n"
+			"	zap %2,%8,%2\n"
+			"	or %2,%3,%1\n"
+			"2:	stl_c %1,0(%4)\n"
+			"	beq %1,3f\n"
+			"	.subsection 2\n"
+			"3:	br 1b\n"
+			"	.previous\n"
+			"4:\n"
+			EXC(1b,4b,%2,%0)
+			EXC(2b,4b,%1,%0)
+				: "=r"(error), "=&r"(ll_match),
+				  "=&r"(tmp1), "=&r"(tmp2)
+				: "r"(la & ~3ul), "r"((la & 3) * 8),
+				  "r"(*reg_addr), "r"(tmp3),
+				  "r"((15 << (la & 3)) & 0xf), "0"(0));
+			if (error)
+				goto give_sigsegv;
+		}
+		*reg_addr = ll_match;
+		break;
+
+	case 0x2f: /* stq_c */
+		la = (unsigned long)va;
+		raw_spin_lock_irqsave(&ll_lock, flags);
+		ll_match = ll_bit;
+		ll_match &= ll_quad;
+		ll_match &= ll_task == current;
+		ll_match &= ll_va == la;
+		tmp3 = ll_data[0];
+		tmp4 = ll_data[1];
+		ll_bit = false;
+		raw_spin_unlock_irqrestore(&ll_lock, flags);
+		if (ll_match) {
+			__asm__ __volatile__(
+			"	insqh %6,%4,%3\n"
+			"1:	ldq_l %2,8(%5)\n"
+			"	cmpeq %7,%2,%1\n"
+			"	beq %1,4f\n"
+			"	mskqh %2,%4,%2\n"
+			"	or %2,%3,%1\n"
+			"2:	stq_c %1,8(%5)\n"
+			"	beq %1,3f\n"
+			"	.subsection 2\n"
+			"3:	br 1b\n"
+			"	.previous\n"
+			"4:\n"
+			EXC(1b,4b,%2,%0)
+			EXC(2b,4b,%1,%0)
+				: "=r"(error), "=&r"(ll_match),
+				  "=&r"(tmp1), "=&r"(tmp2)
+				: "r"(va), "r"(la & ~7ul),
+				  "r"(*reg_addr), "r"(tmp4), "0"(0));
+			if (error)
+				goto give_sigsegv;
+		}
+		if (ll_match) {
+			__asm__ __volatile__(
+			"	insql %6,%4,%3\n"
+			"1:	ldq_l %2,0(%5)\n"
+			"	cmpeq %7,%2,%1\n"
+			"	beq %1,4f\n"
+			"	mskql %2,%4,%2\n"
+			"	or %2,%3,%1\n"
+			"2:	stq_c %1,0(%5)\n"
+			"	beq %1,3f\n"
+			"	.subsection 2\n"
+			"3:	br 1b\n"
+			"	.previous\n"
+			"4:\n"
+			EXC(1b,4b,%2,%0)
+			EXC(2b,4b,%1,%0)
+				: "=r"(error), "=&r"(ll_match),
+				  "=&r"(tmp1), "=&r"(tmp2)
+				: "r"(va), "r"(la & ~7ul),
+				  "r"(*reg_addr), "r"(tmp3), "0"(0));
+			if (error)
+				goto give_sigsegv;
+		}
+		*reg_addr = ll_match;
+		break;
+
 	default:
 		/* What instruction were you trying to use, exactly?  */
 		goto give_sigbus;
 	}
 
-	/* Only integer loads should get here; everyone else returns early. */
+	/*
+	 * Only integer loads and stores conditional should get here;
+	 * everyone else returns early.
+	 */
 	if (reg == 30)
 		wrusp(fake_reg);
 	return;

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
  2025-02-19 12:46 [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency Maciej W. Rozycki
@ 2025-02-19 17:38 ` Magnus Lindholm
  2025-02-19 23:40 ` Linus Torvalds
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Magnus Lindholm @ 2025-02-19 17:38 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Richard Henderson, Ivan Kokshaysky, Matt Turner, Arnd Bergmann,
	John Paul Adrian Glaubitz, Paul E. McKenney, Linus Torvalds,
	Al Viro, linux-alpha, linux-kernel

Hi,

Very impressive work! much appreciated!

I have the patch applied on a system running
6.14.0-rc3-00061-gd4ca652e53ca. You mentioned that you have a
testsuite/small program. If you can provide it to me I can do some
testing on my ES40.

Regards

Magnus Lindholm

On Wed, Feb 19, 2025 at 1:46 PM Maciej W. Rozycki <macro@orcam.me.uk> wrote:
>
> Complementing compiler support for the `-msafe-bwa' and `-msafe-partial'
> code generation options slated to land in GCC 15, implement emulation
> for unaligned LDx_L and STx_C operations for the unlikely case where an
> alignment violation has resulted from improperly written code and caused
> these operations to trap in atomic RMW memory access sequences made to
> provide data consistency for non-BWX byte and word write operations, and
> writes to unaligned data objects causing partial memory updates.
>
> The principle of operation is as follows:
>
> 1. A trapping unaligned LDx_L operation results in the pair of adjacent
>    aligned whole data quantities spanned being read and stored for the
>    reference with a subsequent STx_C operation, along with the width of
>    the data accessed and its virtual address, and the task referring or
>    NULL if the kernel.  The valitidy marker is set.
>
> 2. Regular memory load operations are used to retrieve data because no
>    atomicity is needed at this stage, and matching the width accessed,
>    either LDQ_U or LDL even though the latter instruction requires extra
>    operations, to avoid the case where an unaligned longword located
>    entirely within an aligned quadword would complicate handling.
>
> 3. Data is masked, shifted and merged appropriately and returned in the
>    intended register as the result of the trapping LDx_L instruction.
>
> 4. A trapping unaligned STx_C operation results in the valitidy marker
>    being checked for being true, and the width of the data accessed
>    along with the virtual address and the task referring or the kernel
>    for a match.  The pair of whole data quantities previously read by
>    LDx_L emulation is retrieved and the valitidy marker is cleared.
>
> 5. If the checks succeeded, then in an atomic loop the location of the
>    first whole data quantity is reread, and data retrieved compared with
>    the value previously obtained.  If there's no match, then the loop is
>    aborted and 0 is returned in the intended register as the result of
>    the trapping STx_C instruction and emulation completes.  Otherwise
>    new data obtained from the source operand of STx_C is combined with
>    the data retrieved, replacing by byte insertion the part intended,
>    and an atomic write of this new data is attempted.  If it fails, the
>    loop continues from the beginning.  Otherwise processing proceeds to
>    the next step.
>
> 6. The same operations are performed on the second whole data quantity.
>
> 7. At this point both whole data quantities have been written, ensuring
>    that no third-party intervening write has changed them at the point
>    of the write from the values held at previous LDx_L.  Therefore 1 is
>    returned in the intended register as the result of the trapping STx_C
>    instruction.
>
> 8. No user accesses are permitted in traps from the kernel mode as the
>    only LDx_L/STx_C accesses made to user memory locations by the kernel
>    are supposed to be those from handcrafted code, which has to written
>    such as not to trap.
>
> Since atomic loops are used for data updates the approach works equally
> well in both UP and SMP environments.  No data atomicity is guaranteed,
> but data consistency is, that is concurrent RMW accesses won't clobber
> each other, however if the same data is concurrently written as already
> there with a regular write between emulated LDx_L and STx_C, then STx_C
> will still succeed.  Likewise if data is modified, but then restored
> before STx_C has had a chance to run.
>
> This fulfils consistency requirements and guarantees that data outside
> the quantity written has not changed between emulated LDx_L and STx_C.
>
> Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
> ---
> Hi,
>
>  This has cleared the pair of `-msafe-bwa -msafe-partial' regressions
> observed in GCC verification (the third one was a Modula 2 frontend bug,
> now fixed in the compiler).  I have verified individual misalignments with
> a small program by hand as well, for both the data retrieved by emulated
> LDx_L and the data stored by emulated STx_C.
>
>  The kernel itself built with `-mcpu=ev4 -msafe-bwa -msafe-partial' boots
> and has passed GCC verification, and triggered no extra unaligned traps.
>
>  Full verification was run with 6.3.0-rc5 and Ivan's stack alignment fixes
> applied just because I was confident already that version works correctly.
> Interestingly enough no kernel mode traps have triggered with a kernel
> built with GCC 12 (and with most user traps coming from GCC verification):
>
> kernel unaligned acc    : 0 (pc=0,va=0)
> user unaligned acc      : 1766720 (pc=20000053064,va=120020189)
>
> but with GCC 15 a small quantity happened (even before I ran GCC testing):
>
> kernel unaligned acc    : 78 (pc=fffffc0000ad5194,va=fffffc0002db5784)
> user unaligned acc      : 883452 (pc=20000053064,va=120020189)
>
> It seems a compiler regression worth checking -- the trap recorded was in
> `icmp6_dst_alloc' with a pair of quadword writes to `rt->rt6i_dst.addr',
> which however by its type (`struct in6_addr') is only longword-aligned and
> indeed starts at offset 148 from the outermost struct.  I have a sneaking
> suspicion one of my earlier GCC changes might be at fault.  At least I now
> have a test case to experiment with.
>
>  I've also built and booted 6.9.0-rc3 as at commit 82c525bfafb4 ("alpha:
> trim the unused stuff from asm-offsets.c"), the last one before support
> for my system was axed.  It has passed the verification with my small
> program (available by request; I'm not sure if it's worth turning into a
> kernel selftest).
>
>  NB I'm going to ignore the 72 errors checkpatch.pl issues for EXC usage.
> The coding style of the new additions is consistent with the rest of the
> file and any change to that would best be made separately (but I fail to
> see the point).
>
>  Questions, comments, concerns?  Otherwise please apply, and I'll proceed
> with the rest of the GCC effort, followed by cleaning handwritten assembly
> up that uses STQ_U in our port and in glibc.
>
>   Maciej
> ---
>  arch/alpha/kernel/traps.c |  409 ++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 400 insertions(+), 9 deletions(-)
>
> linux-alpha-llsc-unaligned.diff
> Index: linux-macro/arch/alpha/kernel/traps.c
> ===================================================================
> --- linux-macro.orig/arch/alpha/kernel/traps.c
> +++ linux-macro/arch/alpha/kernel/traps.c
> @@ -368,6 +368,13 @@ struct unaligned_stat {
>         unsigned long count, va, pc;
>  } unaligned[2];
>
> +/* Unaligned LDx_L/STx_C emulation state.  */
> +static DEFINE_RAW_SPINLOCK(ll_lock);
> +static struct task_struct *ll_task;
> +static unsigned long ll_data[2];
> +static unsigned long ll_va;
> +static bool ll_quad;
> +static bool ll_bit;
>
>  /* Macro for exception fixup code to access integer registers.  */
>  #define una_reg(r)  (_regs[(r) >= 16 && (r) <= 18 ? (r)+19 : (r)])
> @@ -381,6 +388,9 @@ do_entUna(void * va, unsigned long opcod
>         unsigned long pc = regs->pc - 4;
>         unsigned long *_regs = regs->regs;
>         const struct exception_table_entry *fixup;
> +       unsigned long flags;
> +       unsigned long la;
> +       bool ll_match;
>
>         unaligned[0].count++;
>         unaligned[0].va = (unsigned long) va;
> @@ -439,6 +449,65 @@ do_entUna(void * va, unsigned long opcod
>                 una_reg(reg) = tmp1|tmp2;
>                 return;
>
> +       case 0x2a: /* ldl_l */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               __asm__ __volatile__(
> +               "1:     ldl %3,0(%5)\n"
> +               "2:     ldl %4,4(%5)\n"
> +               "       srl %3,%6,%1\n"
> +               "       sll %4,%7,%2\n"
> +               "       zapnot %1,15,%1\n"
> +               "       zapnot %2,15,%2\n"
> +               "3:\n"
> +               EXC(1b,3b,%1,%0)
> +               EXC(2b,3b,%2,%0)
> +                       : "=r"(error),
> +                         "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> +                       : "r"(la & ~3ul),
> +                         "r"((la & 3) * 8), "r"((4 - (la & 3)) * 8), "0"(0));
> +               if (error)
> +                       goto got_exception;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_va = la;
> +               ll_task = NULL;
> +               ll_data[0] = tmp3;
> +               ll_data[1] = tmp4;
> +               ll_quad = false;
> +               ll_bit = true;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               una_reg(reg) = (int)(tmp1|tmp2);
> +               return;
> +
> +       case 0x2b: /* ldq_l */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               __asm__ __volatile__(
> +               "1:     ldq_u %3,0(%5)\n"
> +               "2:     ldq_u %4,7(%5)\n"
> +               "       extql %3,%5,%1\n"
> +               "       extqh %4,%5,%2\n"
> +               "3:\n"
> +               EXC(1b,3b,%1,%0)
> +               EXC(2b,3b,%2,%0)
> +                       : "=r"(error),
> +                         "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> +                       : "r"(va), "0"(0));
> +               if (error)
> +                       goto got_exception;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_va = la;
> +               ll_task = NULL;
> +               ll_data[0] = tmp3;
> +               ll_data[1] = tmp4;
> +               ll_quad = true;
> +               ll_bit = true;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               una_reg(reg) = tmp1|tmp2;
> +               return;
> +
>         /* Note that the store sequences do not indicate that they change
>            memory because it _should_ be affecting nothing in this context.
>            (Otherwise we have other, much larger, problems.)  */
> @@ -513,6 +582,134 @@ do_entUna(void * va, unsigned long opcod
>                 if (error)
>                         goto got_exception;
>                 return;
> +
> +       case 0x2e: /* stl_c */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_match = ll_bit;
> +               ll_match &= !ll_quad;
> +               ll_match &= ll_task == NULL;
> +               ll_match &= ll_va == la;
> +               tmp3 = ll_data[0];
> +               tmp4 = ll_data[1];
> +               ll_bit = false;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       srl %6,%5,%3\n"
> +                       "       zapnot %3,%8,%3\n"
> +                       "1:     ldl_l %2,4(%4)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       zap %2,%8,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stl_c %1,4(%4)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(la & ~3ul), "r"((4 - (la & 3)) * 8),
> +                                 "r"(una_reg(reg)), "r"(tmp4),
> +                                 "r"((15 >> (4 - (la & 3))) & 0xf), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       sll %6,%5,%3\n"
> +                       "       zapnot %3,%8,%3\n"
> +                       "1:     ldl_l %2,0(%4)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       zap %2,%8,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stl_c %1,0(%4)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(la & ~3ul), "r"((la & 3) * 8),
> +                                 "r"(una_reg(reg)), "r"(tmp3),
> +                                 "r"((15 << (la & 3)) & 0xf), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               una_reg(reg) = ll_match;
> +               return;
> +
> +       case 0x2f: /* stq_c */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_match = ll_bit;
> +               ll_match &= ll_quad;
> +               ll_match &= ll_task == NULL;
> +               ll_match &= ll_va == la;
> +               tmp3 = ll_data[0];
> +               tmp4 = ll_data[1];
> +               ll_bit = false;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       insqh %6,%4,%3\n"
> +                       "1:     ldq_l %2,8(%5)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       mskqh %2,%4,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stq_c %1,8(%5)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(va), "r"(la & ~7ul),
> +                                 "r"(una_reg(reg)), "r"(tmp4), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       insql %6,%4,%3\n"
> +                       "1:     ldq_l %2,0(%5)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       mskql %2,%4,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stq_c %1,0(%5)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(va), "r"(la & ~7ul),
> +                                 "r"(una_reg(reg)), "r"(tmp3), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               una_reg(reg) = ll_match;
> +               return;
>         }
>
>         printk("Bad unaligned kernel access at %016lx: %p %lx %lu\n",
> @@ -624,24 +821,33 @@ s_reg_to_mem (unsigned long s_reg)
>   * so finding the appropriate registers is a little more difficult
>   * than in the kernel case.
>   *
> - * Finally, we handle regular integer load/stores only.  In
> - * particular, load-linked/store-conditionally and floating point
> - * load/stores are not supported.  The former make no sense with
> - * unaligned faults (they are guaranteed to fail) and I don't think
> - * the latter will occur in any decent program.
> + * We have three classes of operations to handle:
>   *
> - * Sigh. We *do* have to handle some FP operations, because GCC will
> - * uses them as temporary storage for integer memory to memory copies.
> - * However, we need to deal with stt/ldt and sts/lds only.
> + * - We handle regular integer load/stores transparently to faulting
> + *   code, preserving the semantics of the triggering instruction.
> + *
> + * - We handle some FP operations as well, because GCC will use them as
> + *   temporary storage for integer memory to memory copies.  However,
> + *   we need to deal with stt/ldt and sts/lds only.
> + *
> + * - We handle load-locked/store-conditional operations by maintaining
> + *   data consistency only, within the two adjacent longwords or
> + *   quadwords partially spanned.  This is sufficient to guarantee an
> + *   unaligned RMW sequence using these operations won't clobber data
> + *   *outside* the location intended but does *not* guarantee atomicity
> + *   for the data quantity itself.
>   */
>
>  #define OP_INT_MASK    ( 1L << 0x28 | 1L << 0x2c   /* ldl stl */       \
> +                       | 1L << 0x2a | 1L << 0x2e   /* ldl_l stl_c */   \
>                         | 1L << 0x29 | 1L << 0x2d   /* ldq stq */       \
> +                       | 1L << 0x2b | 1L << 0x2f   /* ldq_l stq_c */   \
>                         | 1L << 0x0c | 1L << 0x0d   /* ldwu stw */      \
>                         | 1L << 0x0a | 1L << 0x0e ) /* ldbu stb */
>
>  #define OP_WRITE_MASK  ( 1L << 0x26 | 1L << 0x27   /* sts stt */       \
>                         | 1L << 0x2c | 1L << 0x2d   /* stl stq */       \
> +                       | 1L << 0x2e | 1L << 0x2d   /* stl_c stq_c */   \
>                         | 1L << 0x0d | 1L << 0x0e ) /* stw stb */
>
>  #define R(x)   ((size_t) &((struct pt_regs *)0)->x)
> @@ -666,6 +872,9 @@ do_entUnaUser(void __user * va, unsigned
>
>         unsigned long tmp1, tmp2, tmp3, tmp4;
>         unsigned long fake_reg, *reg_addr = &fake_reg;
> +       unsigned long flags;
> +       unsigned long la;
> +       bool ll_match;
>         int si_code;
>         long error;
>
> @@ -794,6 +1003,61 @@ do_entUnaUser(void __user * va, unsigned
>                 *reg_addr = tmp1|tmp2;
>                 break;
>
> +       case 0x2a: /* ldl_l */
> +               la = (unsigned long)va;
> +               __asm__ __volatile__(
> +               "1:     ldl %3,0(%5)\n"
> +               "2:     ldl %4,4(%5)\n"
> +               "       srl %3,%6,%1\n"
> +               "       sll %4,%7,%2\n"
> +               "       zapnot %1,15,%1\n"
> +               "       zapnot %2,15,%2\n"
> +               "3:\n"
> +               EXC(1b,3b,%1,%0)
> +               EXC(2b,3b,%2,%0)
> +                       : "=r"(error),
> +                         "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> +                       : "r"(la & ~3ul),
> +                         "r"((la & 3) * 8), "r"((4 - (la & 3)) * 8), "0"(0));
> +               if (error)
> +                       goto give_sigsegv;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_va = la;
> +               ll_task = current;
> +               ll_data[0] = tmp3;
> +               ll_data[1] = tmp4;
> +               ll_quad = false;
> +               ll_bit = true;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               *reg_addr = (int)(tmp1|tmp2);
> +               break;
> +
> +       case 0x2b: /* ldq_l */
> +               la = (unsigned long)va;
> +               __asm__ __volatile__(
> +               "1:     ldq_u %3,0(%5)\n"
> +               "2:     ldq_u %4,7(%5)\n"
> +               "       extql %3,%5,%1\n"
> +               "       extqh %4,%5,%2\n"
> +               "3:\n"
> +               EXC(1b,3b,%1,%0)
> +               EXC(2b,3b,%2,%0)
> +                       : "=r"(error),
> +                         "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> +                       : "r"(va), "0"(0));
> +               if (error)
> +                       goto give_sigsegv;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_va = la;
> +               ll_task = current;
> +               ll_data[0] = tmp3;
> +               ll_data[1] = tmp4;
> +               ll_quad = true;
> +               ll_bit = true;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               *reg_addr = tmp1|tmp2;
> +               break;
> +
>         /* Note that the store sequences do not indicate that they change
>            memory because it _should_ be affecting nothing in this context.
>            (Otherwise we have other, much larger, problems.)  */
> @@ -877,12 +1141,139 @@ do_entUnaUser(void __user * va, unsigned
>                         goto give_sigsegv;
>                 return;
>
> +       case 0x2e: /* stl_c */
> +               la = (unsigned long)va;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_match = ll_bit;
> +               ll_match &= !ll_quad;
> +               ll_match &= ll_task == current;
> +               ll_match &= ll_va == la;
> +               tmp3 = ll_data[0];
> +               tmp4 = ll_data[1];
> +               ll_bit = false;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       srl %6,%5,%3\n"
> +                       "       zapnot %3,%8,%3\n"
> +                       "1:     ldl_l %2,4(%4)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       zap %2,%8,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stl_c %1,4(%4)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(la & ~3ul), "r"((4 - (la & 3)) * 8),
> +                                 "r"(*reg_addr), "r"(tmp4),
> +                                 "r"((15 >> (4 - (la & 3))) & 0xf), "0"(0));
> +                       if (error)
> +                               goto give_sigsegv;
> +               }
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       sll %6,%5,%3\n"
> +                       "       zapnot %3,%8,%3\n"
> +                       "1:     ldl_l %2,0(%4)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       zap %2,%8,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stl_c %1,0(%4)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(la & ~3ul), "r"((la & 3) * 8),
> +                                 "r"(*reg_addr), "r"(tmp3),
> +                                 "r"((15 << (la & 3)) & 0xf), "0"(0));
> +                       if (error)
> +                               goto give_sigsegv;
> +               }
> +               *reg_addr = ll_match;
> +               break;
> +
> +       case 0x2f: /* stq_c */
> +               la = (unsigned long)va;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_match = ll_bit;
> +               ll_match &= ll_quad;
> +               ll_match &= ll_task == current;
> +               ll_match &= ll_va == la;
> +               tmp3 = ll_data[0];
> +               tmp4 = ll_data[1];
> +               ll_bit = false;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       insqh %6,%4,%3\n"
> +                       "1:     ldq_l %2,8(%5)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       mskqh %2,%4,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stq_c %1,8(%5)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(va), "r"(la & ~7ul),
> +                                 "r"(*reg_addr), "r"(tmp4), "0"(0));
> +                       if (error)
> +                               goto give_sigsegv;
> +               }
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       insql %6,%4,%3\n"
> +                       "1:     ldq_l %2,0(%5)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       mskql %2,%4,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stq_c %1,0(%5)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(va), "r"(la & ~7ul),
> +                                 "r"(*reg_addr), "r"(tmp3), "0"(0));
> +                       if (error)
> +                               goto give_sigsegv;
> +               }
> +               *reg_addr = ll_match;
> +               break;
> +
>         default:
>                 /* What instruction were you trying to use, exactly?  */
>                 goto give_sigbus;
>         }
>
> -       /* Only integer loads should get here; everyone else returns early. */
> +       /*
> +        * Only integer loads and stores conditional should get here;
> +        * everyone else returns early.
> +        */
>         if (reg == 30)
>                 wrusp(fake_reg);
>         return;
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
  2025-02-19 12:46 [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency Maciej W. Rozycki
  2025-02-19 17:38 ` Magnus Lindholm
@ 2025-02-19 23:40 ` Linus Torvalds
  2025-02-20 22:09   ` Maciej W. Rozycki
  2025-02-20 14:25 ` Magnus Lindholm
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2025-02-19 23:40 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Richard Henderson, Ivan Kokshaysky, Matt Turner, Arnd Bergmann,
	John Paul Adrian Glaubitz, Magnus Lindholm, Paul E. McKenney,
	Al Viro, linux-alpha, linux-kernel

On Wed, 19 Feb 2025 at 04:46, Maciej W. Rozycki <macro@orcam.me.uk> wrote:
>
> 1. A trapping unaligned LDx_L operation results in the pair of adjacent
>    aligned whole data quantities spanned being read and stored for the
>    reference with a subsequent STx_C operation, along with the width of
>    the data accessed and its virtual address, and the task referring or
>    NULL if the kernel.  The validity marker is set.

So I have a couple of comments. I don't care deeply because I do think
alpha is dead, but for completeness:

 (a) I don't think the task checking is correct.

You're only checking the task pointer, so what can happen is that a
task exits and another starts up with the same task pointer value, and
it all matches across one task doing a ld_l and another doing a st_c.

Does this matter? No. You'd literally have to *try* to create that
situation with identical mis-aligned addresses and data contents, and
an exit after a 'ld_l', and doing a 'st_c' in the new task without the
matching ld_l.

So I suspect this works in practice, but it's still worth mentioning.

 (b) this is not truly atomic wrt concurrent aligned non-trapping
operations to the same words. Or in fact to current trapping ones,
since you end up inevitably releasing the spinlock before the final
stc emulation.

I think this is fundamental and non-fixable, because the stc is done
as two operations, and the first can succeed with the second failing
(or they can both succeed, just interleaved with other accesses).

Again, I don't think we care, and it works in practice, but it does
mean that I *really* think that:

 (c) you should not handle the kernel case at all.

If the kernel does an unaligned ld_l/st_c, that's a fundamental kernel
bug. Don't emulate it. Particularly when the emulation fundamentally
is not truly atomic wrt other accesses.

Finally:

 (d) I think you're doing a few too many inline asms by hand, and
you're masking the results too much.

On the read-side emulation, why do you do that

+               "1:     ldl %3,0(%5)\n"
+               "2:     ldl %4,4(%5)\n"
+               "       srl %3,%6,%1\n"
+               "       sll %4,%7,%2\n"
+               "       zapnot %1,15,%1\n"
+               "       zapnot %2,15,%2\n"

at all? Just do two aligned loads, and don't mask the bytes around
them. A *real* ldl/stc will fail not just when the exact bytes are
different, but when somebody has touched the same cacheline. So if the
aligned values have changed, you should fail the stc even if the
change was in other bytes.

And doing two aligned loads don't need any inline asm at all.

On the st_c side, I think you're repeating the same inline asm twice,
and should have a single helper.

Is this a NAK for the patch? No. But I do think it should be massaged a bit.

            Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
  2025-02-19 23:40 ` Linus Torvalds
@ 2025-02-20 22:09   ` Maciej W. Rozycki
  2025-04-07 20:46     ` Maciej W. Rozycki
  0 siblings, 1 reply; 21+ messages in thread
From: Maciej W. Rozycki @ 2025-02-20 22:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Richard Henderson, Ivan Kokshaysky, Matt Turner, Arnd Bergmann,
	John Paul Adrian Glaubitz, Magnus Lindholm, Paul E. McKenney,
	Al Viro, linux-alpha, linux-kernel

On Wed, 19 Feb 2025, Linus Torvalds wrote:

> > 1. A trapping unaligned LDx_L operation results in the pair of adjacent
> >    aligned whole data quantities spanned being read and stored for the
> >    reference with a subsequent STx_C operation, along with the width of
> >    the data accessed and its virtual address, and the task referring or
> >    NULL if the kernel.  The validity marker is set.
> 
> So I have a couple of comments. I don't care deeply because I do think
> alpha is dead, but for completeness:
> 
>  (a) I don't think the task checking is correct.
> 
> You're only checking the task pointer, so what can happen is that a
> task exits and another starts up with the same task pointer value, and
> it all matches across one task doing a ld_l and another doing a st_c.
> 
> Does this matter? No. You'd literally have to *try* to create that
> situation with identical mis-aligned addresses and data contents, and
> an exit after a 'ld_l', and doing a 'st_c' in the new task without the
> matching ld_l.
> 
> So I suspect this works in practice, but it's still worth mentioning.

 It's an interesting corner case, but executing STx_C without preceding 
matching LDx_L is hardly useful and depending on the circumstances may or 
may not cause unpredictable results.  We can make it a part of the psABI 
that such usage is not supported.  Otherwise we could clear `ll_bit' in 
a task's termination path.

>  (b) this is not truly atomic wrt concurrent aligned non-trapping
> operations to the same words. Or in fact to current trapping ones,
> since you end up inevitably releasing the spinlock before the final
> stc emulation.

 It is not supposed to be.  The only objective of this code is to protect 
the *unchanged* part of the longword/quadword.

> I think this is fundamental and non-fixable, because the stc is done
> as two operations, and the first can succeed with the second failing
> (or they can both succeed, just interleaved with other accesses).

 It is absolutely fine and by design.  Atomicity of the unaligned store of 
the quantity itself is not guaranteed by this sequence, just as it is not 
with the original one using a pair of STQ_U operations, where a concurrent 
write to the same location may cause the parts of the quantity to become 
inconsistent with each other.

 If the trapping code wanted to make the data quantity accessed atomic, it 
should have declared it atomic as well as made it aligned, in which case 
the compiler would have done the right thing, including padding/separation 
from other data objects where necessary for the minimum width of data that 
hardware can guarantee atomicity for.  And then we would not have arrived 
in this emulation path in the first place.

> Again, I don't think we care, and it works in practice, but it does
> mean that I *really* think that:
> 
>  (c) you should not handle the kernel case at all.
> 
> If the kernel does an unaligned ld_l/st_c, that's a fundamental kernel
> bug. Don't emulate it. Particularly when the emulation fundamentally
> is not truly atomic wrt other accesses.

 Good point actually, I think I mentally drove myself into a dead end 
here.  Yes, absolutely, it is not expected to happen unless we have a bug 
in our code somewhere!

> Finally:
> 
>  (d) I think you're doing a few too many inline asms by hand, and
> you're masking the results too much.
> 
> On the read-side emulation, why do you do that
> 
> +               "1:     ldl %3,0(%5)\n"
> +               "2:     ldl %4,4(%5)\n"
> +               "       srl %3,%6,%1\n"
> +               "       sll %4,%7,%2\n"
> +               "       zapnot %1,15,%1\n"
> +               "       zapnot %2,15,%2\n"
> 
> at all? Just do two aligned loads, and don't mask the bytes around
> them. A *real* ldl/stc will fail not just when the exact bytes are
> different, but when somebody has touched the same cacheline. So if the
> aligned values have changed, you should fail the stc even if the
> change was in other bytes.

 We do need to extract the bytes desired for the result of LDx_L though.

 Yes, it can be done in C, but the same stands for all the other emulation
pieces here and yet they use inline assembly.  GCC is happy to do byte 
extraction itself where necessary, it has machine description patterns for 
that as it a fairly common operation (it does not for INSxH and MSKxH 
though, they're handled as intrinsics only, which however we could use 
instead).

 I think there is value in consistency, and with this piece written as 
inline assembly you can spot the difference from the other variants right 
away.  Or I could rewrite the byte extraction in C across other patterns.

> And doing two aligned loads don't need any inline asm at all.

 Neither does unaligned loads, as GCC is happy to emit LDQ_U itself where 
necessary, but we want to catch exceptions or we'd get an oops rather than 
SIGSEGV.

> On the st_c side, I think you're repeating the same inline asm twice,
> and should have a single helper.

 The masks are different as are displacements, but a good point otherwise.  
I think this guarantees the prologue to the atomic loop to go away, which 
is a good thing, but I'll yet double-check the quality of code produced.

> Is this a NAK for the patch? No. But I do think it should be massaged a bit.

 Thank you for your review, I'll post v2 once I'm done with the massage.

  Maciej

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
  2025-02-20 22:09   ` Maciej W. Rozycki
@ 2025-04-07 20:46     ` Maciej W. Rozycki
  2025-04-08  0:34       ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Maciej W. Rozycki @ 2025-04-07 20:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Richard Henderson, Ivan Kokshaysky, Matt Turner, Arnd Bergmann,
	John Paul Adrian Glaubitz, Magnus Lindholm, Paul E. McKenney,
	Al Viro, linux-alpha, linux-kernel

On Thu, 20 Feb 2025, Maciej W. Rozycki wrote:

> > Again, I don't think we care, and it works in practice, but it does
> > mean that I *really* think that:
> > 
> >  (c) you should not handle the kernel case at all.
> > 
> > If the kernel does an unaligned ld_l/st_c, that's a fundamental kernel
> > bug. Don't emulate it. Particularly when the emulation fundamentally
> > is not truly atomic wrt other accesses.
> 
>  Good point actually, I think I mentally drove myself into a dead end 
> here.  Yes, absolutely, it is not expected to happen unless we have a bug 
> in our code somewhere!

 Well, I take it back.  I knew I had something in mind while writing this 
code, just couldn't remember at the time of the reply, and I now brought 
it back at a mental relief break: it's networking.

 It's been decades ago, but I kept in memory a discussion when Alpha still 
was a thing, up to remembering it was DaveM saying we chose to use aligned 
frame/packet header accesses for performance reasons even where we don't 
know beforehand whether a given piece of networking data will or will not 
be actually aligned, so as not to penalise the common aligned case.

 Here's one reference I could track down: "TCP SYNs broken in 2.3.41", 
<https://marc.info/?l=linux-kernel&m=94927689929463> (not present on lore; 
I have the original thread of discussion with complete e-mail headers in 
my personal archive though), likely the very one I remembered.

 So unless I'm proved otherwise (e.g. that all such code paths are now 
gone from networking, which may or may not be the case: I saw IPX go but I 
can see AppleTalk still around; or that no sub-longword accesses are ever 
used in the relevant networking paths), I'm going to keep kernel emulation 
in v2, because what just used to be wrapped in an unaligned LDQ/STQ pair, 
which we trapped on and emulated, will now become an LDQ_L/STQ_C loop.

 Do you happen to know what the situation is here?

 NB my network usage with my Alpha environment is virtually exclusively 
FDDI, and there the wrong alignment phenomena may or may not be present at 
all for the various networking protocols, owing to different link-layer 
framing (and including the Packet Request Header used by the Motorola FDDI 
chipset the way it does for the very purpose to keep things aligned).  I 
don't know offhand, my FDDI-fu is a bit dusty even in the areas I have 
already explored, and this is a new one; I've certainly not tried (to 
learn anything about) AppleTalk over FDDI (or indeed otherwise).

 All in all the lack of kernel unaligned traps in my setup and hence no 
recorded way to trigger them merely means I haven't used any of the stuff 
that used to cause them.

  Maciej

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
  2025-04-07 20:46     ` Maciej W. Rozycki
@ 2025-04-08  0:34       ` Linus Torvalds
  2025-04-08  8:37         ` Arnd Bergmann
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2025-04-08  0:34 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Richard Henderson, Ivan Kokshaysky, Matt Turner, Arnd Bergmann,
	John Paul Adrian Glaubitz, Magnus Lindholm, Paul E. McKenney,
	Al Viro, linux-alpha, linux-kernel

On Mon, 7 Apr 2025 at 13:46, Maciej W. Rozycki <macro@orcam.me.uk> wrote:
>
>  So unless I'm proved otherwise (e.g. that all such code paths are now
> gone from networking, which may or may not be the case: I saw IPX go but I
> can see AppleTalk still around; or that no sub-longword accesses are ever
> used in the relevant networking paths), I'm going to keep kernel emulation
> in v2, because what just used to be wrapped in an unaligned LDQ/STQ pair,
> which we trapped on and emulated, will now become an LDQ_L/STQ_C loop.
>
>  Do you happen to know what the situation is here?

I think networking ends up using 'get_unaligned()' properly for header
accesses these days for any of this.

If you don't, some architectures will literally silently give you
garbage back and not even fault.

Admittedly that's mainly some really broken old 32-bit ARM stuff and
hopefully it's all dead by now.

So unless you actually *see* the unaligned faults, I really think you
shouldn't emulate them.

And I'd like to know where they are if you do see them

             Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
  2025-04-08  0:34       ` Linus Torvalds
@ 2025-04-08  8:37         ` Arnd Bergmann
  2025-04-09 15:52           ` Eric W. Biederman
  0 siblings, 1 reply; 21+ messages in thread
From: Arnd Bergmann @ 2025-04-08  8:37 UTC (permalink / raw)
  To: Linus Torvalds, Maciej W. Rozycki
  Cc: Richard Henderson, Ivan Kokshaysky, Matt Turner,
	John Paul Adrian Glaubitz, Magnus Lindholm, Paul E. McKenney,
	Alexander Viro, linux-alpha, linux-kernel

On Tue, Apr 8, 2025, at 02:34, Linus Torvalds wrote:
> On Mon, 7 Apr 2025 at 13:46, Maciej W. Rozycki <macro@orcam.me.uk> wrote:
>>
>>  So unless I'm proved otherwise (e.g. that all such code paths are now
>> gone from networking, which may or may not be the case: I saw IPX go but I
>> can see AppleTalk still around; or that no sub-longword accesses are ever
>> used in the relevant networking paths), I'm going to keep kernel emulation
>> in v2, because what just used to be wrapped in an unaligned LDQ/STQ pair,
>> which we trapped on and emulated, will now become an LDQ_L/STQ_C loop.
>>
>>  Do you happen to know what the situation is here?
>
> I think networking ends up using 'get_unaligned()' properly for header
> accesses these days for any of this.
>
> If you don't, some architectures will literally silently give you
> garbage back and not even fault.
>
> Admittedly that's mainly some really broken old 32-bit ARM stuff and
> hopefully it's all dead by now.

Yes, the last one doing this was EBSA110, which we removed in 2020.

> So unless you actually *see* the unaligned faults, I really think you
> shouldn't emulate them.
>
> And I'd like to know where they are if you do see them

FWIW, all the major architectures that have variants without
unaligned load/store (arm32, mips, ppc, riscv) trap and emulate
them for both user and kernel access for normal memory, but
they don't emulate it for atomic ll/sc type instructions.
These instructions also trap and kill the task on the
architectures that can do hardware unaligned access (x86
cmpxchg8b being a notable exception).

     Arnd

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
  2025-04-08  8:37         ` Arnd Bergmann
@ 2025-04-09 15:52           ` Eric W. Biederman
  2025-04-09 20:59             ` Maciej W. Rozycki
  0 siblings, 1 reply; 21+ messages in thread
From: Eric W. Biederman @ 2025-04-09 15:52 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Linus Torvalds, Maciej W. Rozycki, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, John Paul Adrian Glaubitz,
	Magnus Lindholm, Paul E. McKenney, Alexander Viro, linux-alpha,
	linux-kernel

"Arnd Bergmann" <arnd@arndb.de> writes:

> On Tue, Apr 8, 2025, at 02:34, Linus Torvalds wrote:
>> On Mon, 7 Apr 2025 at 13:46, Maciej W. Rozycki <macro@orcam.me.uk> wrote:
>>>
>>>  So unless I'm proved otherwise (e.g. that all such code paths are now
>>> gone from networking, which may or may not be the case: I saw IPX go but I
>>> can see AppleTalk still around; or that no sub-longword accesses are ever
>>> used in the relevant networking paths), I'm going to keep kernel emulation
>>> in v2, because what just used to be wrapped in an unaligned LDQ/STQ pair,
>>> which we trapped on and emulated, will now become an LDQ_L/STQ_C loop.
>>>
>>>  Do you happen to know what the situation is here?
>>
>> I think networking ends up using 'get_unaligned()' properly for header
>> accesses these days for any of this.
>>
>> If you don't, some architectures will literally silently give you
>> garbage back and not even fault.
>>
>> Admittedly that's mainly some really broken old 32-bit ARM stuff and
>> hopefully it's all dead by now.
>
> Yes, the last one doing this was EBSA110, which we removed in 2020.
>
>> So unless you actually *see* the unaligned faults, I really think you
>> shouldn't emulate them.
>>
>> And I'd like to know where they are if you do see them

I was nerd sniped by this so I took a look.

I have a distinct memory that even the ipv4 stack can generate unaligned
loads.  Looking at the code in net/ipv4/ip_input.c:ip_rcv_finish_core
there are several unprotected accesses to iph->daddr.

Which means that if the lower layers ever give something that is not 4
byte aligned for ipv4 just reading the destination address will be an
unaligned read.

There are similar unprotected accesses to the ipv6 destination address
but it is declared as an array of bytes.  So that address can not
be misaligned.

There is a theoretical path through 802.2 that adds a 3 byte sap
header that could cause problems.  We have LLC_SAP_IP defined
but I don't see anything calling register_8022_client that would
be needed to hook that up to the ipv4 stack.

As long as the individual ethernet drivers have the hardware deliver
packets 2 bytes into an aligned packet buffer the 14 byte ethernet
header will end on a 16 byte aligned location, I don't think there
is a way to trigger unaligned behavior with ipv4 or ipv6.

Hmm.  Looking appletalk appears to be built on top of SNAP.
So after the ethernet header processing the code goes through
net/llc/llc_input.c:llc_rcv and then net/802/snap_rcv before
reaching any of the appletalk protocols.

I think the common case for llc would be 3 bytes + 5 bytes for snap,
for 8 bytes in the common case.  But the code seems to be reading
4 or 5 bytes for llc so I am confused.  In either case it definitely
appears there are cases where the ethernet headers before appletalk
can be an odd number of bytes which has the possibility of unaligning
everything.

Both of the appletalk protocols appear to make unguarded 16bit reads
from their headers.  So having a buffer that is only 1 byte aligned
looks like it will definitely be a problem.

> FWIW, all the major architectures that have variants without
> unaligned load/store (arm32, mips, ppc, riscv) trap and emulate
> them for both user and kernel access for normal memory, but
> they don't emulate it for atomic ll/sc type instructions.
> These instructions also trap and kill the task on the
> architectures that can do hardware unaligned access (x86
> cmpxchg8b being a notable exception).

I don't see anything that would get atomics involved in the networking
stack.  No READ_ONCE on packet data or anything like that.  I believe
that is fairly fundamental as well.  Whatever is processing a packet is
the only code processing that packet.

So I would be very surprised if the kernel needed emulation of any
atomics, just emulation of normal unaligned reads.  I haven't looked to
see if the transmission paths do things that will result in unaligned
writes.

Eric

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
  2025-04-09 15:52           ` Eric W. Biederman
@ 2025-04-09 20:59             ` Maciej W. Rozycki
  2025-04-10  4:37               ` Eric W. Biederman
  0 siblings, 1 reply; 21+ messages in thread
From: Maciej W. Rozycki @ 2025-04-09 20:59 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Arnd Bergmann, Linus Torvalds, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, John Paul Adrian Glaubitz, Magnus Lindholm,
	Paul E. McKenney, Alexander Viro, linux-alpha, linux-kernel

On Wed, 9 Apr 2025, Eric W. Biederman wrote:

> >> So unless you actually *see* the unaligned faults, I really think you
> >> shouldn't emulate them.
> >>
> >> And I'd like to know where they are if you do see them
> 
> I was nerd sniped by this so I took a look.
> 
> I have a distinct memory that even the ipv4 stack can generate unaligned
> loads.  Looking at the code in net/ipv4/ip_input.c:ip_rcv_finish_core
> there are several unprotected accesses to iph->daddr.
> 
> Which means that if the lower layers ever give something that is not 4
> byte aligned for ipv4 just reading the destination address will be an
> unaligned read.
> 
> There are similar unprotected accesses to the ipv6 destination address
> but it is declared as an array of bytes.  So that address can not
> be misaligned.
> 
> There is a theoretical path through 802.2 that adds a 3 byte sap
> header that could cause problems.  We have LLC_SAP_IP defined
> but I don't see anything calling register_8022_client that would
> be needed to hook that up to the ipv4 stack.
> 
> As long as the individual ethernet drivers have the hardware deliver
> packets 2 bytes into an aligned packet buffer the 14 byte ethernet
> header will end on a 16 byte aligned location, I don't think there
> is a way to trigger unaligned behavior with ipv4 or ipv6.
> 
> Hmm.  Looking appletalk appears to be built on top of SNAP.
> So after the ethernet header processing the code goes through
> net/llc/llc_input.c:llc_rcv and then net/802/snap_rcv before
> reaching any of the appletalk protocols.
> 
> I think the common case for llc would be 3 bytes + 5 bytes for snap,
> for 8 bytes in the common case.  But the code seems to be reading
> 4 or 5 bytes for llc so I am confused.  In either case it definitely
> appears there are cases where the ethernet headers before appletalk
> can be an odd number of bytes which has the possibility of unaligning
> everything.
> 
> Both of the appletalk protocols appear to make unguarded 16bit reads
> from their headers.  So having a buffer that is only 1 byte aligned
> looks like it will definitely be a problem.

 Thank you for your analysis, really insightful.

> > FWIW, all the major architectures that have variants without
> > unaligned load/store (arm32, mips, ppc, riscv) trap and emulate
> > them for both user and kernel access for normal memory, but
> > they don't emulate it for atomic ll/sc type instructions.
> > These instructions also trap and kill the task on the
> > architectures that can do hardware unaligned access (x86
> > cmpxchg8b being a notable exception).

 But all those architectures have 1-byte and 2-byte memory access machine 
instructions as well, and consequently none requires an RMW sequence to 
update such data quantities that implies the data consistency issue that 
we have on non-BWX Alpha.

> I don't see anything that would get atomics involved in the networking
> stack.  No READ_ONCE on packet data or anything like that.  I believe
> that is fairly fundamental as well.  Whatever is processing a packet is
> the only code processing that packet.
> 
> So I would be very surprised if the kernel needed emulation of any
> atomics, just emulation of normal unaligned reads.  I haven't looked to
> see if the transmission paths do things that will result in unaligned
> writes.

 The problem we have on the non-BWX Alpha target is that hardware has no 
memory access instructions narrower than 4 bytes.  Consequently to write a 
1- or 2-byte quantity an RMW instruction sequence is required, in the way 
of reading the whole 4-byte quantity, inserting the bytes to be modified, 
and writing the whole 4-byte quantity back to memory.  However such a 
sequence is not safe for concurrent writes, as described below.

 A pair of concurrent RMW sequences targetting the same part of an aligned 
4-byte data quantity is not an issue: it's just an execution race and 
software may be prepared for it (or otherwise either prevent the race via 
a mutex or alternatively use an atomic data type along with the associated 
accessors, which will move data locations in memory suitably apart).

 The issue is a pair of concurrent RMW sequences targetting different 
parts of the same aligned 4-byte data quantity: software can legitimately 
expect that writes to disjoint memory locations (e.g. adjacent struct 
members, except for bit-fields) won't affect each other.  But here where a 
pair of such RMW sequences runs interleaved, the later write to one 
location will clobber the value written previously to the other.  So we 
have a data race.  Note that no atomicity is concerned here, we are 
talking plain memory writes, such as with ordinary assignments to regular 
variables in C code.

 So I have come up with a solution where such RMW sequences are actually 
emitted by GCC as an LDL_L/STL_C atomic access loop which ensures that no 
intervening write has changed the aligned 4-byte data quantity containing 
the 1- or 2-byte quantity accessed.  This guarantees consistency of the 
part(s) of the aligned 4-byte data quantity *outside* the 1- or 2-byte 
quantity written.  Atomicity is guaranteed by hardware as a side effect, 
but not a part of this Alpha/Linux psABI extension (i.e. not in our 
contract).

 For known-unaligned 2-byte quantities (such as packed structure members) 
the compiler knows that they may span 2 aligned 4-byte data quantities and 
produces two LDL_L/STL_C loops with suitable address adjustments and data 
masking.  This still guarantess consistency of data *outside* the 2-byte 
quantity written.  No atomicity is guaranteed, because parts of the 2-byte 
quantity may be stored by pieces (if the 2-byte quantity is in the middle 
of an aligned 4-byte quantity, then it'll be written twice).

 The problem is with the case where the compiler has been told to produce 
code to write an aligned 2-byte quantity, but at run time it turns out 
unaligned.  Now we have to emulate the LDL_L and STL_C instructions of the 
atomic access loop or otherwise the code will crash.

 My approach for this scenario is simple: LDL_L emulation remembers the 
address accessed and data present in the 2 aligned 4-byte data quantities 
spanned, and STL_C emulation returns failure in the case of an address 
mismatch and otherwise uses two LDL_L/STL_C loops to load the the 2 
aligned 4-byte data quantities by piece, compare each with data retrieved 
previously at LDL_L emulation time, returning failure in the case of a 
mismatch, insert the requested value and then store the resulting 
quantity.  Again this guarantees consistency of the parts of the 2 aligned 
4-byte data quantities *outside* the unaligned 2-byte quantity written.  
And again, no atomicity is guaranteed.

 So while there are no atomic operations in our code at the C language 
level, we get them sneaked in by the compiler under our feet to solve the 
data consistency issue.  Now if we can ascertain the code paths concerned 
won't ever exercise concurrency, we could tell the compiler not to produce 
these atomics for 1-byte and 2-byte accesses, on a file-by-file or even 
function-by-function basis, but it seems to me like the very maintenance 
effort we want to avoid for a legacy platform.  Whereas if we build the 
kernel with the atomics enabled universally, we won't have to be bothered 
with analysing individual cases (at performance cost, but that's assumed).

 I've left 8-byte data quantities out for clarity from the consideration 
above; they're used by the compiler as suitable and handled accordingly.

 Let me know if you find anything here unclear.

  Maciej

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
  2025-04-09 20:59             ` Maciej W. Rozycki
@ 2025-04-10  4:37               ` Eric W. Biederman
  0 siblings, 0 replies; 21+ messages in thread
From: Eric W. Biederman @ 2025-04-10  4:37 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Arnd Bergmann, Linus Torvalds, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, John Paul Adrian Glaubitz, Magnus Lindholm,
	Paul E. McKenney, Alexander Viro, linux-alpha, linux-kernel

"Maciej W. Rozycki" <macro@orcam.me.uk> writes:

> On Wed, 9 Apr 2025, Eric W. Biederman wrote:
>
>> >> So unless you actually *see* the unaligned faults, I really think you
>> >> shouldn't emulate them.
>> >>
>> >> And I'd like to know where they are if you do see them
>> 
>> I was nerd sniped by this so I took a look.
>> 
>> I have a distinct memory that even the ipv4 stack can generate unaligned
>> loads.  Looking at the code in net/ipv4/ip_input.c:ip_rcv_finish_core
>> there are several unprotected accesses to iph->daddr.
>> 
>> Which means that if the lower layers ever give something that is not 4
>> byte aligned for ipv4 just reading the destination address will be an
>> unaligned read.
>> 
>> There are similar unprotected accesses to the ipv6 destination address
>> but it is declared as an array of bytes.  So that address can not
>> be misaligned.
>> 
>> There is a theoretical path through 802.2 that adds a 3 byte sap
>> header that could cause problems.  We have LLC_SAP_IP defined
>> but I don't see anything calling register_8022_client that would
>> be needed to hook that up to the ipv4 stack.
>> 
>> As long as the individual ethernet drivers have the hardware deliver
>> packets 2 bytes into an aligned packet buffer the 14 byte ethernet
>> header will end on a 16 byte aligned location, I don't think there
>> is a way to trigger unaligned behavior with ipv4 or ipv6.
>> 
>> Hmm.  Looking appletalk appears to be built on top of SNAP.
>> So after the ethernet header processing the code goes through
>> net/llc/llc_input.c:llc_rcv and then net/802/snap_rcv before
>> reaching any of the appletalk protocols.
>> 
>> I think the common case for llc would be 3 bytes + 5 bytes for snap,
>> for 8 bytes in the common case.  But the code seems to be reading
>> 4 or 5 bytes for llc so I am confused.  In either case it definitely
>> appears there are cases where the ethernet headers before appletalk
>> can be an odd number of bytes which has the possibility of unaligning
>> everything.
>> 
>> Both of the appletalk protocols appear to make unguarded 16bit reads
>> from their headers.  So having a buffer that is only 1 byte aligned
>> looks like it will definitely be a problem.
>
>  Thank you for your analysis, really insightful.
>
>> > FWIW, all the major architectures that have variants without
>> > unaligned load/store (arm32, mips, ppc, riscv) trap and emulate
>> > them for both user and kernel access for normal memory, but
>> > they don't emulate it for atomic ll/sc type instructions.
>> > These instructions also trap and kill the task on the
>> > architectures that can do hardware unaligned access (x86
>> > cmpxchg8b being a notable exception).
>
>  But all those architectures have 1-byte and 2-byte memory access machine 
> instructions as well, and consequently none requires an RMW sequence to 
> update such data quantities that implies the data consistency issue that 
> we have on non-BWX Alpha.
>
>> I don't see anything that would get atomics involved in the networking
>> stack.  No READ_ONCE on packet data or anything like that.  I believe
>> that is fairly fundamental as well.  Whatever is processing a packet is
>> the only code processing that packet.
>> 
>> So I would be very surprised if the kernel needed emulation of any
>> atomics, just emulation of normal unaligned reads.  I haven't looked to
>> see if the transmission paths do things that will result in unaligned
>> writes.
>
>  The problem we have on the non-BWX Alpha target is that hardware has no 
> memory access instructions narrower than 4 bytes.  Consequently to write a 
> 1- or 2-byte quantity an RMW instruction sequence is required, in the way 
> of reading the whole 4-byte quantity, inserting the bytes to be modified, 
> and writing the whole 4-byte quantity back to memory.  However such a 
> sequence is not safe for concurrent writes, as described below.
>
>  A pair of concurrent RMW sequences targetting the same part of an aligned 
> 4-byte data quantity is not an issue: it's just an execution race and 
> software may be prepared for it (or otherwise either prevent the race via 
> a mutex or alternatively use an atomic data type along with the associated 
> accessors, which will move data locations in memory suitably apart).
>
>  The issue is a pair of concurrent RMW sequences targetting different 
> parts of the same aligned 4-byte data quantity: software can legitimately 
> expect that writes to disjoint memory locations (e.g. adjacent struct 
> members, except for bit-fields) won't affect each other.  But here where a 
> pair of such RMW sequences runs interleaved, the later write to one 
> location will clobber the value written previously to the other.  So we 
> have a data race.  Note that no atomicity is concerned here, we are 
> talking plain memory writes, such as with ordinary assignments to regular 
> variables in C code.
>
>  So I have come up with a solution where such RMW sequences are actually 
> emitted by GCC as an LDL_L/STL_C atomic access loop which ensures that no 
> intervening write has changed the aligned 4-byte data quantity containing 
> the 1- or 2-byte quantity accessed.  This guarantees consistency of the 
> part(s) of the aligned 4-byte data quantity *outside* the 1- or 2-byte 
> quantity written.  Atomicity is guaranteed by hardware as a side effect, 
> but not a part of this Alpha/Linux psABI extension (i.e. not in our 
> contract).
>
>  For known-unaligned 2-byte quantities (such as packed structure members) 
> the compiler knows that they may span 2 aligned 4-byte data quantities and 
> produces two LDL_L/STL_C loops with suitable address adjustments and data 
> masking.  This still guarantess consistency of data *outside* the 2-byte 
> quantity written.  No atomicity is guaranteed, because parts of the 2-byte 
> quantity may be stored by pieces (if the 2-byte quantity is in the middle 
> of an aligned 4-byte quantity, then it'll be written twice).
>
>  The problem is with the case where the compiler has been told to produce 
> code to write an aligned 2-byte quantity, but at run time it turns out 
> unaligned.  Now we have to emulate the LDL_L and STL_C instructions of the 
> atomic access loop or otherwise the code will crash.
>
>  My approach for this scenario is simple: LDL_L emulation remembers the 
> address accessed and data present in the 2 aligned 4-byte data quantities 
> spanned, and STL_C emulation returns failure in the case of an address 
> mismatch and otherwise uses two LDL_L/STL_C loops to load the the 2 
> aligned 4-byte data quantities by piece, compare each with data retrieved 
> previously at LDL_L emulation time, returning failure in the case of a 
> mismatch, insert the requested value and then store the resulting 
> quantity.  Again this guarantees consistency of the parts of the 2 aligned 
> 4-byte data quantities *outside* the unaligned 2-byte quantity written.  
> And again, no atomicity is guaranteed.
>
>  So while there are no atomic operations in our code at the C language 
> level, we get them sneaked in by the compiler under our feet to solve the 
> data consistency issue.  Now if we can ascertain the code paths concerned 
> won't ever exercise concurrency, we could tell the compiler not to produce 
> these atomics for 1-byte and 2-byte accesses, on a file-by-file or even 
> function-by-function basis, but it seems to me like the very maintenance 
> effort we want to avoid for a legacy platform.  Whereas if we build the 
> kernel with the atomics enabled universally, we won't have to be bothered 
> with analysing individual cases (at performance cost, but that's assumed).
>
>  I've left 8-byte data quantities out for clarity from the consideration 
> above; they're used by the compiler as suitable and handled accordingly.
>
>  Let me know if you find anything here unclear.

The emulation you are doing makes sense.

Just a few more points.  I am not current but I have never seen
concurrency (inside of a packet) at the network layer.

I don't recall ever hearing the write paths in the network stack
were ever a problem.

I suspect the write side you can verify fairly easily by simply
compiling in appletalk and opening a PF_APPLETALK socket, and sending a
message.  If that doesn't trigger emulation I can't image any other
write path in the kernel will.

Eric

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
  2025-02-19 12:46 [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency Maciej W. Rozycki
  2025-02-19 17:38 ` Magnus Lindholm
  2025-02-19 23:40 ` Linus Torvalds
@ 2025-02-20 14:25 ` Magnus Lindholm
  2025-02-20 16:46 ` Matt Turner
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Magnus Lindholm @ 2025-02-20 14:25 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Richard Henderson, Ivan Kokshaysky, Matt Turner, Arnd Bergmann,
	John Paul Adrian Glaubitz, Paul E. McKenney, Linus Torvalds,
	Al Viro, linux-alpha, linux-kernel

I've tested this patch on an Alphaserver ES40 (running 6.14.0-rc3),
using the test-suit provided by Maciej and can verify that it works.


Tested-by: Magnus Lindholm <linmag7@gmail.com>

On Wed, Feb 19, 2025 at 1:46 PM Maciej W. Rozycki <macro@orcam.me.uk> wrote:
>
> Complementing compiler support for the `-msafe-bwa' and `-msafe-partial'
> code generation options slated to land in GCC 15, implement emulation
> for unaligned LDx_L and STx_C operations for the unlikely case where an
> alignment violation has resulted from improperly written code and caused
> these operations to trap in atomic RMW memory access sequences made to
> provide data consistency for non-BWX byte and word write operations, and
> writes to unaligned data objects causing partial memory updates.
>
> The principle of operation is as follows:
>
> 1. A trapping unaligned LDx_L operation results in the pair of adjacent
>    aligned whole data quantities spanned being read and stored for the
>    reference with a subsequent STx_C operation, along with the width of
>    the data accessed and its virtual address, and the task referring or
>    NULL if the kernel.  The valitidy marker is set.
>
> 2. Regular memory load operations are used to retrieve data because no
>    atomicity is needed at this stage, and matching the width accessed,
>    either LDQ_U or LDL even though the latter instruction requires extra
>    operations, to avoid the case where an unaligned longword located
>    entirely within an aligned quadword would complicate handling.
>
> 3. Data is masked, shifted and merged appropriately and returned in the
>    intended register as the result of the trapping LDx_L instruction.
>
> 4. A trapping unaligned STx_C operation results in the valitidy marker
>    being checked for being true, and the width of the data accessed
>    along with the virtual address and the task referring or the kernel
>    for a match.  The pair of whole data quantities previously read by
>    LDx_L emulation is retrieved and the valitidy marker is cleared.
>
> 5. If the checks succeeded, then in an atomic loop the location of the
>    first whole data quantity is reread, and data retrieved compared with
>    the value previously obtained.  If there's no match, then the loop is
>    aborted and 0 is returned in the intended register as the result of
>    the trapping STx_C instruction and emulation completes.  Otherwise
>    new data obtained from the source operand of STx_C is combined with
>    the data retrieved, replacing by byte insertion the part intended,
>    and an atomic write of this new data is attempted.  If it fails, the
>    loop continues from the beginning.  Otherwise processing proceeds to
>    the next step.
>
> 6. The same operations are performed on the second whole data quantity.
>
> 7. At this point both whole data quantities have been written, ensuring
>    that no third-party intervening write has changed them at the point
>    of the write from the values held at previous LDx_L.  Therefore 1 is
>    returned in the intended register as the result of the trapping STx_C
>    instruction.
>
> 8. No user accesses are permitted in traps from the kernel mode as the
>    only LDx_L/STx_C accesses made to user memory locations by the kernel
>    are supposed to be those from handcrafted code, which has to written
>    such as not to trap.
>
> Since atomic loops are used for data updates the approach works equally
> well in both UP and SMP environments.  No data atomicity is guaranteed,
> but data consistency is, that is concurrent RMW accesses won't clobber
> each other, however if the same data is concurrently written as already
> there with a regular write between emulated LDx_L and STx_C, then STx_C
> will still succeed.  Likewise if data is modified, but then restored
> before STx_C has had a chance to run.
>
> This fulfils consistency requirements and guarantees that data outside
> the quantity written has not changed between emulated LDx_L and STx_C.
>
> Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
> ---
> Hi,
>
>  This has cleared the pair of `-msafe-bwa -msafe-partial' regressions
> observed in GCC verification (the third one was a Modula 2 frontend bug,
> now fixed in the compiler).  I have verified individual misalignments with
> a small program by hand as well, for both the data retrieved by emulated
> LDx_L and the data stored by emulated STx_C.
>
>  The kernel itself built with `-mcpu=ev4 -msafe-bwa -msafe-partial' boots
> and has passed GCC verification, and triggered no extra unaligned traps.
>
>  Full verification was run with 6.3.0-rc5 and Ivan's stack alignment fixes
> applied just because I was confident already that version works correctly.
> Interestingly enough no kernel mode traps have triggered with a kernel
> built with GCC 12 (and with most user traps coming from GCC verification):
>
> kernel unaligned acc    : 0 (pc=0,va=0)
> user unaligned acc      : 1766720 (pc=20000053064,va=120020189)
>
> but with GCC 15 a small quantity happened (even before I ran GCC testing):
>
> kernel unaligned acc    : 78 (pc=fffffc0000ad5194,va=fffffc0002db5784)
> user unaligned acc      : 883452 (pc=20000053064,va=120020189)
>
> It seems a compiler regression worth checking -- the trap recorded was in
> `icmp6_dst_alloc' with a pair of quadword writes to `rt->rt6i_dst.addr',
> which however by its type (`struct in6_addr') is only longword-aligned and
> indeed starts at offset 148 from the outermost struct.  I have a sneaking
> suspicion one of my earlier GCC changes might be at fault.  At least I now
> have a test case to experiment with.
>
>  I've also built and booted 6.9.0-rc3 as at commit 82c525bfafb4 ("alpha:
> trim the unused stuff from asm-offsets.c"), the last one before support
> for my system was axed.  It has passed the verification with my small
> program (available by request; I'm not sure if it's worth turning into a
> kernel selftest).
>
>  NB I'm going to ignore the 72 errors checkpatch.pl issues for EXC usage.
> The coding style of the new additions is consistent with the rest of the
> file and any change to that would best be made separately (but I fail to
> see the point).
>
>  Questions, comments, concerns?  Otherwise please apply, and I'll proceed
> with the rest of the GCC effort, followed by cleaning handwritten assembly
> up that uses STQ_U in our port and in glibc.
>
>   Maciej
> ---
>  arch/alpha/kernel/traps.c |  409 ++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 400 insertions(+), 9 deletions(-)
>
> linux-alpha-llsc-unaligned.diff
> Index: linux-macro/arch/alpha/kernel/traps.c
> ===================================================================
> --- linux-macro.orig/arch/alpha/kernel/traps.c
> +++ linux-macro/arch/alpha/kernel/traps.c
> @@ -368,6 +368,13 @@ struct unaligned_stat {
>         unsigned long count, va, pc;
>  } unaligned[2];
>
> +/* Unaligned LDx_L/STx_C emulation state.  */
> +static DEFINE_RAW_SPINLOCK(ll_lock);
> +static struct task_struct *ll_task;
> +static unsigned long ll_data[2];
> +static unsigned long ll_va;
> +static bool ll_quad;
> +static bool ll_bit;
>
>  /* Macro for exception fixup code to access integer registers.  */
>  #define una_reg(r)  (_regs[(r) >= 16 && (r) <= 18 ? (r)+19 : (r)])
> @@ -381,6 +388,9 @@ do_entUna(void * va, unsigned long opcod
>         unsigned long pc = regs->pc - 4;
>         unsigned long *_regs = regs->regs;
>         const struct exception_table_entry *fixup;
> +       unsigned long flags;
> +       unsigned long la;
> +       bool ll_match;
>
>         unaligned[0].count++;
>         unaligned[0].va = (unsigned long) va;
> @@ -439,6 +449,65 @@ do_entUna(void * va, unsigned long opcod
>                 una_reg(reg) = tmp1|tmp2;
>                 return;
>
> +       case 0x2a: /* ldl_l */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               __asm__ __volatile__(
> +               "1:     ldl %3,0(%5)\n"
> +               "2:     ldl %4,4(%5)\n"
> +               "       srl %3,%6,%1\n"
> +               "       sll %4,%7,%2\n"
> +               "       zapnot %1,15,%1\n"
> +               "       zapnot %2,15,%2\n"
> +               "3:\n"
> +               EXC(1b,3b,%1,%0)
> +               EXC(2b,3b,%2,%0)
> +                       : "=r"(error),
> +                         "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> +                       : "r"(la & ~3ul),
> +                         "r"((la & 3) * 8), "r"((4 - (la & 3)) * 8), "0"(0));
> +               if (error)
> +                       goto got_exception;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_va = la;
> +               ll_task = NULL;
> +               ll_data[0] = tmp3;
> +               ll_data[1] = tmp4;
> +               ll_quad = false;
> +               ll_bit = true;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               una_reg(reg) = (int)(tmp1|tmp2);
> +               return;
> +
> +       case 0x2b: /* ldq_l */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               __asm__ __volatile__(
> +               "1:     ldq_u %3,0(%5)\n"
> +               "2:     ldq_u %4,7(%5)\n"
> +               "       extql %3,%5,%1\n"
> +               "       extqh %4,%5,%2\n"
> +               "3:\n"
> +               EXC(1b,3b,%1,%0)
> +               EXC(2b,3b,%2,%0)
> +                       : "=r"(error),
> +                         "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> +                       : "r"(va), "0"(0));
> +               if (error)
> +                       goto got_exception;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_va = la;
> +               ll_task = NULL;
> +               ll_data[0] = tmp3;
> +               ll_data[1] = tmp4;
> +               ll_quad = true;
> +               ll_bit = true;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               una_reg(reg) = tmp1|tmp2;
> +               return;
> +
>         /* Note that the store sequences do not indicate that they change
>            memory because it _should_ be affecting nothing in this context.
>            (Otherwise we have other, much larger, problems.)  */
> @@ -513,6 +582,134 @@ do_entUna(void * va, unsigned long opcod
>                 if (error)
>                         goto got_exception;
>                 return;
> +
> +       case 0x2e: /* stl_c */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_match = ll_bit;
> +               ll_match &= !ll_quad;
> +               ll_match &= ll_task == NULL;
> +               ll_match &= ll_va == la;
> +               tmp3 = ll_data[0];
> +               tmp4 = ll_data[1];
> +               ll_bit = false;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       srl %6,%5,%3\n"
> +                       "       zapnot %3,%8,%3\n"
> +                       "1:     ldl_l %2,4(%4)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       zap %2,%8,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stl_c %1,4(%4)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(la & ~3ul), "r"((4 - (la & 3)) * 8),
> +                                 "r"(una_reg(reg)), "r"(tmp4),
> +                                 "r"((15 >> (4 - (la & 3))) & 0xf), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       sll %6,%5,%3\n"
> +                       "       zapnot %3,%8,%3\n"
> +                       "1:     ldl_l %2,0(%4)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       zap %2,%8,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stl_c %1,0(%4)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(la & ~3ul), "r"((la & 3) * 8),
> +                                 "r"(una_reg(reg)), "r"(tmp3),
> +                                 "r"((15 << (la & 3)) & 0xf), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               una_reg(reg) = ll_match;
> +               return;
> +
> +       case 0x2f: /* stq_c */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_match = ll_bit;
> +               ll_match &= ll_quad;
> +               ll_match &= ll_task == NULL;
> +               ll_match &= ll_va == la;
> +               tmp3 = ll_data[0];
> +               tmp4 = ll_data[1];
> +               ll_bit = false;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       insqh %6,%4,%3\n"
> +                       "1:     ldq_l %2,8(%5)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       mskqh %2,%4,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stq_c %1,8(%5)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(va), "r"(la & ~7ul),
> +                                 "r"(una_reg(reg)), "r"(tmp4), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       insql %6,%4,%3\n"
> +                       "1:     ldq_l %2,0(%5)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       mskql %2,%4,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stq_c %1,0(%5)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(va), "r"(la & ~7ul),
> +                                 "r"(una_reg(reg)), "r"(tmp3), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               una_reg(reg) = ll_match;
> +               return;
>         }
>
>         printk("Bad unaligned kernel access at %016lx: %p %lx %lu\n",
> @@ -624,24 +821,33 @@ s_reg_to_mem (unsigned long s_reg)
>   * so finding the appropriate registers is a little more difficult
>   * than in the kernel case.
>   *
> - * Finally, we handle regular integer load/stores only.  In
> - * particular, load-linked/store-conditionally and floating point
> - * load/stores are not supported.  The former make no sense with
> - * unaligned faults (they are guaranteed to fail) and I don't think
> - * the latter will occur in any decent program.
> + * We have three classes of operations to handle:
>   *
> - * Sigh. We *do* have to handle some FP operations, because GCC will
> - * uses them as temporary storage for integer memory to memory copies.
> - * However, we need to deal with stt/ldt and sts/lds only.
> + * - We handle regular integer load/stores transparently to faulting
> + *   code, preserving the semantics of the triggering instruction.
> + *
> + * - We handle some FP operations as well, because GCC will use them as
> + *   temporary storage for integer memory to memory copies.  However,
> + *   we need to deal with stt/ldt and sts/lds only.
> + *
> + * - We handle load-locked/store-conditional operations by maintaining
> + *   data consistency only, within the two adjacent longwords or
> + *   quadwords partially spanned.  This is sufficient to guarantee an
> + *   unaligned RMW sequence using these operations won't clobber data
> + *   *outside* the location intended but does *not* guarantee atomicity
> + *   for the data quantity itself.
>   */
>
>  #define OP_INT_MASK    ( 1L << 0x28 | 1L << 0x2c   /* ldl stl */       \
> +                       | 1L << 0x2a | 1L << 0x2e   /* ldl_l stl_c */   \
>                         | 1L << 0x29 | 1L << 0x2d   /* ldq stq */       \
> +                       | 1L << 0x2b | 1L << 0x2f   /* ldq_l stq_c */   \
>                         | 1L << 0x0c | 1L << 0x0d   /* ldwu stw */      \
>                         | 1L << 0x0a | 1L << 0x0e ) /* ldbu stb */
>
>  #define OP_WRITE_MASK  ( 1L << 0x26 | 1L << 0x27   /* sts stt */       \
>                         | 1L << 0x2c | 1L << 0x2d   /* stl stq */       \
> +                       | 1L << 0x2e | 1L << 0x2d   /* stl_c stq_c */   \
>                         | 1L << 0x0d | 1L << 0x0e ) /* stw stb */
>
>  #define R(x)   ((size_t) &((struct pt_regs *)0)->x)
> @@ -666,6 +872,9 @@ do_entUnaUser(void __user * va, unsigned
>
>         unsigned long tmp1, tmp2, tmp3, tmp4;
>         unsigned long fake_reg, *reg_addr = &fake_reg;
> +       unsigned long flags;
> +       unsigned long la;
> +       bool ll_match;
>         int si_code;
>         long error;
>
> @@ -794,6 +1003,61 @@ do_entUnaUser(void __user * va, unsigned
>                 *reg_addr = tmp1|tmp2;
>                 break;
>
> +       case 0x2a: /* ldl_l */
> +               la = (unsigned long)va;
> +               __asm__ __volatile__(
> +               "1:     ldl %3,0(%5)\n"
> +               "2:     ldl %4,4(%5)\n"
> +               "       srl %3,%6,%1\n"
> +               "       sll %4,%7,%2\n"
> +               "       zapnot %1,15,%1\n"
> +               "       zapnot %2,15,%2\n"
> +               "3:\n"
> +               EXC(1b,3b,%1,%0)
> +               EXC(2b,3b,%2,%0)
> +                       : "=r"(error),
> +                         "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> +                       : "r"(la & ~3ul),
> +                         "r"((la & 3) * 8), "r"((4 - (la & 3)) * 8), "0"(0));
> +               if (error)
> +                       goto give_sigsegv;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_va = la;
> +               ll_task = current;
> +               ll_data[0] = tmp3;
> +               ll_data[1] = tmp4;
> +               ll_quad = false;
> +               ll_bit = true;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               *reg_addr = (int)(tmp1|tmp2);
> +               break;
> +
> +       case 0x2b: /* ldq_l */
> +               la = (unsigned long)va;
> +               __asm__ __volatile__(
> +               "1:     ldq_u %3,0(%5)\n"
> +               "2:     ldq_u %4,7(%5)\n"
> +               "       extql %3,%5,%1\n"
> +               "       extqh %4,%5,%2\n"
> +               "3:\n"
> +               EXC(1b,3b,%1,%0)
> +               EXC(2b,3b,%2,%0)
> +                       : "=r"(error),
> +                         "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> +                       : "r"(va), "0"(0));
> +               if (error)
> +                       goto give_sigsegv;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_va = la;
> +               ll_task = current;
> +               ll_data[0] = tmp3;
> +               ll_data[1] = tmp4;
> +               ll_quad = true;
> +               ll_bit = true;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               *reg_addr = tmp1|tmp2;
> +               break;
> +
>         /* Note that the store sequences do not indicate that they change
>            memory because it _should_ be affecting nothing in this context.
>            (Otherwise we have other, much larger, problems.)  */
> @@ -877,12 +1141,139 @@ do_entUnaUser(void __user * va, unsigned
>                         goto give_sigsegv;
>                 return;
>
> +       case 0x2e: /* stl_c */
> +               la = (unsigned long)va;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_match = ll_bit;
> +               ll_match &= !ll_quad;
> +               ll_match &= ll_task == current;
> +               ll_match &= ll_va == la;
> +               tmp3 = ll_data[0];
> +               tmp4 = ll_data[1];
> +               ll_bit = false;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       srl %6,%5,%3\n"
> +                       "       zapnot %3,%8,%3\n"
> +                       "1:     ldl_l %2,4(%4)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       zap %2,%8,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stl_c %1,4(%4)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(la & ~3ul), "r"((4 - (la & 3)) * 8),
> +                                 "r"(*reg_addr), "r"(tmp4),
> +                                 "r"((15 >> (4 - (la & 3))) & 0xf), "0"(0));
> +                       if (error)
> +                               goto give_sigsegv;
> +               }
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       sll %6,%5,%3\n"
> +                       "       zapnot %3,%8,%3\n"
> +                       "1:     ldl_l %2,0(%4)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       zap %2,%8,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stl_c %1,0(%4)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(la & ~3ul), "r"((la & 3) * 8),
> +                                 "r"(*reg_addr), "r"(tmp3),
> +                                 "r"((15 << (la & 3)) & 0xf), "0"(0));
> +                       if (error)
> +                               goto give_sigsegv;
> +               }
> +               *reg_addr = ll_match;
> +               break;
> +
> +       case 0x2f: /* stq_c */
> +               la = (unsigned long)va;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_match = ll_bit;
> +               ll_match &= ll_quad;
> +               ll_match &= ll_task == current;
> +               ll_match &= ll_va == la;
> +               tmp3 = ll_data[0];
> +               tmp4 = ll_data[1];
> +               ll_bit = false;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       insqh %6,%4,%3\n"
> +                       "1:     ldq_l %2,8(%5)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       mskqh %2,%4,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stq_c %1,8(%5)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(va), "r"(la & ~7ul),
> +                                 "r"(*reg_addr), "r"(tmp4), "0"(0));
> +                       if (error)
> +                               goto give_sigsegv;
> +               }
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       insql %6,%4,%3\n"
> +                       "1:     ldq_l %2,0(%5)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       mskql %2,%4,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stq_c %1,0(%5)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(va), "r"(la & ~7ul),
> +                                 "r"(*reg_addr), "r"(tmp3), "0"(0));
> +                       if (error)
> +                               goto give_sigsegv;
> +               }
> +               *reg_addr = ll_match;
> +               break;
> +
>         default:
>                 /* What instruction were you trying to use, exactly?  */
>                 goto give_sigbus;
>         }
>
> -       /* Only integer loads should get here; everyone else returns early. */
> +       /*
> +        * Only integer loads and stores conditional should get here;
> +        * everyone else returns early.
> +        */
>         if (reg == 30)
>                 wrusp(fake_reg);
>         return;
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
  2025-02-19 12:46 [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency Maciej W. Rozycki
                   ` (2 preceding siblings ...)
  2025-02-20 14:25 ` Magnus Lindholm
@ 2025-02-20 16:46 ` Matt Turner
  2025-02-20 16:52   ` Matt Turner
  2025-02-20 17:54 ` Richard Henderson
  2025-02-25 20:44 ` Maciej W. Rozycki
  5 siblings, 1 reply; 21+ messages in thread
From: Matt Turner @ 2025-02-20 16:46 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Richard Henderson, Ivan Kokshaysky, Arnd Bergmann,
	John Paul Adrian Glaubitz, Magnus Lindholm, Paul E. McKenney,
	Linus Torvalds, Al Viro, linux-alpha, linux-kernel

On Wed, Feb 19, 2025 at 7:46 AM Maciej W. Rozycki <macro@orcam.me.uk> wrote:
>
> Complementing compiler support for the `-msafe-bwa' and `-msafe-partial'
> code generation options slated to land in GCC 15, implement emulation
> for unaligned LDx_L and STx_C operations for the unlikely case where an
> alignment violation has resulted from improperly written code and caused
> these operations to trap in atomic RMW memory access sequences made to
> provide data consistency for non-BWX byte and word write operations, and
> writes to unaligned data objects causing partial memory updates.
>
> The principle of operation is as follows:
>
> 1. A trapping unaligned LDx_L operation results in the pair of adjacent
>    aligned whole data quantities spanned being read and stored for the
>    reference with a subsequent STx_C operation, along with the width of
>    the data accessed and its virtual address, and the task referring or
>    NULL if the kernel.  The valitidy marker is set.

Typo: validity (occurs twice more in the commit message below)

>
> 2. Regular memory load operations are used to retrieve data because no
>    atomicity is needed at this stage, and matching the width accessed,
>    either LDQ_U or LDL even though the latter instruction requires extra
>    operations, to avoid the case where an unaligned longword located
>    entirely within an aligned quadword would complicate handling.
>
> 3. Data is masked, shifted and merged appropriately and returned in the
>    intended register as the result of the trapping LDx_L instruction.
>
> 4. A trapping unaligned STx_C operation results in the valitidy marker
>    being checked for being true, and the width of the data accessed
>    along with the virtual address and the task referring or the kernel
>    for a match.  The pair of whole data quantities previously read by
>    LDx_L emulation is retrieved and the valitidy marker is cleared.
>
> 5. If the checks succeeded, then in an atomic loop the location of the
>    first whole data quantity is reread, and data retrieved compared with
>    the value previously obtained.  If there's no match, then the loop is
>    aborted and 0 is returned in the intended register as the result of
>    the trapping STx_C instruction and emulation completes.  Otherwise
>    new data obtained from the source operand of STx_C is combined with
>    the data retrieved, replacing by byte insertion the part intended,
>    and an atomic write of this new data is attempted.  If it fails, the
>    loop continues from the beginning.  Otherwise processing proceeds to
>    the next step.
>
> 6. The same operations are performed on the second whole data quantity.
>
> 7. At this point both whole data quantities have been written, ensuring
>    that no third-party intervening write has changed them at the point
>    of the write from the values held at previous LDx_L.  Therefore 1 is
>    returned in the intended register as the result of the trapping STx_C
>    instruction.
>
> 8. No user accesses are permitted in traps from the kernel mode as the
>    only LDx_L/STx_C accesses made to user memory locations by the kernel
>    are supposed to be those from handcrafted code, which has to written
>    such as not to trap.
>
> Since atomic loops are used for data updates the approach works equally
> well in both UP and SMP environments.  No data atomicity is guaranteed,
> but data consistency is, that is concurrent RMW accesses won't clobber
> each other, however if the same data is concurrently written as already
> there with a regular write between emulated LDx_L and STx_C, then STx_C
> will still succeed.  Likewise if data is modified, but then restored
> before STx_C has had a chance to run.
>
> This fulfils consistency requirements and guarantees that data outside
> the quantity written has not changed between emulated LDx_L and STx_C.
>
> Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
> ---
> Hi,
>
>  This has cleared the pair of `-msafe-bwa -msafe-partial' regressions
> observed in GCC verification (the third one was a Modula 2 frontend bug,
> now fixed in the compiler).  I have verified individual misalignments with
> a small program by hand as well, for both the data retrieved by emulated
> LDx_L and the data stored by emulated STx_C.
>
>  The kernel itself built with `-mcpu=ev4 -msafe-bwa -msafe-partial' boots
> and has passed GCC verification, and triggered no extra unaligned traps.
>
>  Full verification was run with 6.3.0-rc5 and Ivan's stack alignment fixes
> applied just because I was confident already that version works correctly.
> Interestingly enough no kernel mode traps have triggered with a kernel
> built with GCC 12 (and with most user traps coming from GCC verification):
>
> kernel unaligned acc    : 0 (pc=0,va=0)
> user unaligned acc      : 1766720 (pc=20000053064,va=120020189)
>
> but with GCC 15 a small quantity happened (even before I ran GCC testing):
>
> kernel unaligned acc    : 78 (pc=fffffc0000ad5194,va=fffffc0002db5784)
> user unaligned acc      : 883452 (pc=20000053064,va=120020189)
>
> It seems a compiler regression worth checking -- the trap recorded was in
> `icmp6_dst_alloc' with a pair of quadword writes to `rt->rt6i_dst.addr',
> which however by its type (`struct in6_addr') is only longword-aligned and
> indeed starts at offset 148 from the outermost struct.  I have a sneaking
> suspicion one of my earlier GCC changes might be at fault.  At least I now
> have a test case to experiment with.
>
>  I've also built and booted 6.9.0-rc3 as at commit 82c525bfafb4 ("alpha:
> trim the unused stuff from asm-offsets.c"), the last one before support
> for my system was axed.  It has passed the verification with my small
> program (available by request; I'm not sure if it's worth turning into a
> kernel selftest).
>
>  NB I'm going to ignore the 72 errors checkpatch.pl issues for EXC usage.
> The coding style of the new additions is consistent with the rest of the
> file and any change to that would best be made separately (but I fail to
> see the point).
>
>  Questions, comments, concerns?  Otherwise please apply, and I'll proceed
> with the rest of the GCC effort, followed by cleaning handwritten assembly
> up that uses STQ_U in our port and in glibc.
>
>   Maciej
> ---
>  arch/alpha/kernel/traps.c |  409 ++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 400 insertions(+), 9 deletions(-)
>
> linux-alpha-llsc-unaligned.diff
> Index: linux-macro/arch/alpha/kernel/traps.c
> ===================================================================
> --- linux-macro.orig/arch/alpha/kernel/traps.c
> +++ linux-macro/arch/alpha/kernel/traps.c
> @@ -368,6 +368,13 @@ struct unaligned_stat {
>         unsigned long count, va, pc;
>  } unaligned[2];
>
> +/* Unaligned LDx_L/STx_C emulation state.  */
> +static DEFINE_RAW_SPINLOCK(ll_lock);
> +static struct task_struct *ll_task;
> +static unsigned long ll_data[2];
> +static unsigned long ll_va;
> +static bool ll_quad;
> +static bool ll_bit;
>
>  /* Macro for exception fixup code to access integer registers.  */
>  #define una_reg(r)  (_regs[(r) >= 16 && (r) <= 18 ? (r)+19 : (r)])
> @@ -381,6 +388,9 @@ do_entUna(void * va, unsigned long opcod
>         unsigned long pc = regs->pc - 4;
>         unsigned long *_regs = regs->regs;
>         const struct exception_table_entry *fixup;
> +       unsigned long flags;
> +       unsigned long la;
> +       bool ll_match;
>
>         unaligned[0].count++;
>         unaligned[0].va = (unsigned long) va;
> @@ -439,6 +449,65 @@ do_entUna(void * va, unsigned long opcod
>                 una_reg(reg) = tmp1|tmp2;
>                 return;
>
> +       case 0x2a: /* ldl_l */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               __asm__ __volatile__(
> +               "1:     ldl %3,0(%5)\n"
> +               "2:     ldl %4,4(%5)\n"
> +               "       srl %3,%6,%1\n"
> +               "       sll %4,%7,%2\n"
> +               "       zapnot %1,15,%1\n"
> +               "       zapnot %2,15,%2\n"
> +               "3:\n"
> +               EXC(1b,3b,%1,%0)
> +               EXC(2b,3b,%2,%0)
> +                       : "=r"(error),
> +                         "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> +                       : "r"(la & ~3ul),
> +                         "r"((la & 3) * 8), "r"((4 - (la & 3)) * 8), "0"(0));
> +               if (error)
> +                       goto got_exception;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_va = la;
> +               ll_task = NULL;
> +               ll_data[0] = tmp3;
> +               ll_data[1] = tmp4;
> +               ll_quad = false;
> +               ll_bit = true;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               una_reg(reg) = (int)(tmp1|tmp2);
> +               return;
> +
> +       case 0x2b: /* ldq_l */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               __asm__ __volatile__(
> +               "1:     ldq_u %3,0(%5)\n"
> +               "2:     ldq_u %4,7(%5)\n"
> +               "       extql %3,%5,%1\n"
> +               "       extqh %4,%5,%2\n"
> +               "3:\n"
> +               EXC(1b,3b,%1,%0)
> +               EXC(2b,3b,%2,%0)
> +                       : "=r"(error),
> +                         "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> +                       : "r"(va), "0"(0));
> +               if (error)
> +                       goto got_exception;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_va = la;
> +               ll_task = NULL;
> +               ll_data[0] = tmp3;
> +               ll_data[1] = tmp4;
> +               ll_quad = true;
> +               ll_bit = true;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               una_reg(reg) = tmp1|tmp2;
> +               return;
> +
>         /* Note that the store sequences do not indicate that they change
>            memory because it _should_ be affecting nothing in this context.
>            (Otherwise we have other, much larger, problems.)  */
> @@ -513,6 +582,134 @@ do_entUna(void * va, unsigned long opcod
>                 if (error)
>                         goto got_exception;
>                 return;
> +
> +       case 0x2e: /* stl_c */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_match = ll_bit;
> +               ll_match &= !ll_quad;
> +               ll_match &= ll_task == NULL;
> +               ll_match &= ll_va == la;
> +               tmp3 = ll_data[0];
> +               tmp4 = ll_data[1];
> +               ll_bit = false;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       srl %6,%5,%3\n"
> +                       "       zapnot %3,%8,%3\n"
> +                       "1:     ldl_l %2,4(%4)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       zap %2,%8,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stl_c %1,4(%4)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(la & ~3ul), "r"((4 - (la & 3)) * 8),
> +                                 "r"(una_reg(reg)), "r"(tmp4),
> +                                 "r"((15 >> (4 - (la & 3))) & 0xf), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       sll %6,%5,%3\n"
> +                       "       zapnot %3,%8,%3\n"
> +                       "1:     ldl_l %2,0(%4)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       zap %2,%8,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stl_c %1,0(%4)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(la & ~3ul), "r"((la & 3) * 8),
> +                                 "r"(una_reg(reg)), "r"(tmp3),
> +                                 "r"((15 << (la & 3)) & 0xf), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               una_reg(reg) = ll_match;
> +               return;
> +
> +       case 0x2f: /* stq_c */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_match = ll_bit;
> +               ll_match &= ll_quad;
> +               ll_match &= ll_task == NULL;
> +               ll_match &= ll_va == la;
> +               tmp3 = ll_data[0];
> +               tmp4 = ll_data[1];
> +               ll_bit = false;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       insqh %6,%4,%3\n"
> +                       "1:     ldq_l %2,8(%5)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       mskqh %2,%4,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stq_c %1,8(%5)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(va), "r"(la & ~7ul),
> +                                 "r"(una_reg(reg)), "r"(tmp4), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       insql %6,%4,%3\n"
> +                       "1:     ldq_l %2,0(%5)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       mskql %2,%4,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stq_c %1,0(%5)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(va), "r"(la & ~7ul),
> +                                 "r"(una_reg(reg)), "r"(tmp3), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               una_reg(reg) = ll_match;
> +               return;
>         }
>
>         printk("Bad unaligned kernel access at %016lx: %p %lx %lu\n",
> @@ -624,24 +821,33 @@ s_reg_to_mem (unsigned long s_reg)
>   * so finding the appropriate registers is a little more difficult
>   * than in the kernel case.
>   *
> - * Finally, we handle regular integer load/stores only.  In
> - * particular, load-linked/store-conditionally and floating point
> - * load/stores are not supported.  The former make no sense with
> - * unaligned faults (they are guaranteed to fail) and I don't think
> - * the latter will occur in any decent program.
> + * We have three classes of operations to handle:
>   *
> - * Sigh. We *do* have to handle some FP operations, because GCC will
> - * uses them as temporary storage for integer memory to memory copies.
> - * However, we need to deal with stt/ldt and sts/lds only.
> + * - We handle regular integer load/stores transparently to faulting
> + *   code, preserving the semantics of the triggering instruction.
> + *
> + * - We handle some FP operations as well, because GCC will use them as
> + *   temporary storage for integer memory to memory copies.  However,
> + *   we need to deal with stt/ldt and sts/lds only.
> + *
> + * - We handle load-locked/store-conditional operations by maintaining
> + *   data consistency only, within the two adjacent longwords or
> + *   quadwords partially spanned.  This is sufficient to guarantee an
> + *   unaligned RMW sequence using these operations won't clobber data
> + *   *outside* the location intended but does *not* guarantee atomicity
> + *   for the data quantity itself.
>   */
>
>  #define OP_INT_MASK    ( 1L << 0x28 | 1L << 0x2c   /* ldl stl */       \
> +                       | 1L << 0x2a | 1L << 0x2e   /* ldl_l stl_c */   \
>                         | 1L << 0x29 | 1L << 0x2d   /* ldq stq */       \
> +                       | 1L << 0x2b | 1L << 0x2f   /* ldq_l stq_c */   \
>                         | 1L << 0x0c | 1L << 0x0d   /* ldwu stw */      \
>                         | 1L << 0x0a | 1L << 0x0e ) /* ldbu stb */
>
>  #define OP_WRITE_MASK  ( 1L << 0x26 | 1L << 0x27   /* sts stt */       \
>                         | 1L << 0x2c | 1L << 0x2d   /* stl stq */       \
> +                       | 1L << 0x2e | 1L << 0x2d   /* stl_c stq_c */   \

stq_c should be 0x2f, not 0x2d. Looks like a copy-n-paste mistake.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
  2025-02-20 16:46 ` Matt Turner
@ 2025-02-20 16:52   ` Matt Turner
  2025-02-20 19:57     ` Maciej W. Rozycki
  0 siblings, 1 reply; 21+ messages in thread
From: Matt Turner @ 2025-02-20 16:52 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Richard Henderson, Ivan Kokshaysky, Arnd Bergmann,
	John Paul Adrian Glaubitz, Magnus Lindholm, Paul E. McKenney,
	Linus Torvalds, Al Viro, linux-alpha, linux-kernel

On Thu, Feb 20, 2025 at 11:46 AM Matt Turner <mattst88@gmail.com> wrote:
>
> On Wed, Feb 19, 2025 at 7:46 AM Maciej W. Rozycki <macro@orcam.me.uk> wrote:
> >  #define OP_INT_MASK    ( 1L << 0x28 | 1L << 0x2c   /* ldl stl */       \
> > +                       | 1L << 0x2a | 1L << 0x2e   /* ldl_l stl_c */   \
> >                         | 1L << 0x29 | 1L << 0x2d   /* ldq stq */       \
> > +                       | 1L << 0x2b | 1L << 0x2f   /* ldq_l stq_c */   \
> >                         | 1L << 0x0c | 1L << 0x0d   /* ldwu stw */      \
> >                         | 1L << 0x0a | 1L << 0x0e ) /* ldbu stb */
> >
> >  #define OP_WRITE_MASK  ( 1L << 0x26 | 1L << 0x27   /* sts stt */       \
> >                         | 1L << 0x2c | 1L << 0x2d   /* stl stq */       \
> > +                       | 1L << 0x2e | 1L << 0x2d   /* stl_c stq_c */   \
>
> stq_c should be 0x2f, not 0x2d. Looks like a copy-n-paste mistake.

The good news is that OP_WRITE_MASK appears to be unused going all the
way back to the import into git, so this doesn't indicate a problem
with any of the testing that's been done.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
  2025-02-20 16:52   ` Matt Turner
@ 2025-02-20 19:57     ` Maciej W. Rozycki
  0 siblings, 0 replies; 21+ messages in thread
From: Maciej W. Rozycki @ 2025-02-20 19:57 UTC (permalink / raw)
  To: Matt Turner
  Cc: Richard Henderson, Ivan Kokshaysky, Arnd Bergmann,
	John Paul Adrian Glaubitz, Magnus Lindholm, Paul E. McKenney,
	Linus Torvalds, Al Viro, linux-alpha, linux-kernel

On Thu, 20 Feb 2025, Matt Turner wrote:

> > On Wed, Feb 19, 2025 at 7:46 AM Maciej W. Rozycki <macro@orcam.me.uk> wrote:
> > >  #define OP_INT_MASK    ( 1L << 0x28 | 1L << 0x2c   /* ldl stl */       \
> > > +                       | 1L << 0x2a | 1L << 0x2e   /* ldl_l stl_c */   \
> > >                         | 1L << 0x29 | 1L << 0x2d   /* ldq stq */       \
> > > +                       | 1L << 0x2b | 1L << 0x2f   /* ldq_l stq_c */   \
> > >                         | 1L << 0x0c | 1L << 0x0d   /* ldwu stw */      \
> > >                         | 1L << 0x0a | 1L << 0x0e ) /* ldbu stb */
> > >
> > >  #define OP_WRITE_MASK  ( 1L << 0x26 | 1L << 0x27   /* sts stt */       \
> > >                         | 1L << 0x2c | 1L << 0x2d   /* stl stq */       \
> > > +                       | 1L << 0x2e | 1L << 0x2d   /* stl_c stq_c */   \
> >
> > stq_c should be 0x2f, not 0x2d. Looks like a copy-n-paste mistake.
> 
> The good news is that OP_WRITE_MASK appears to be unused going all the
> way back to the import into git, so this doesn't indicate a problem
> with any of the testing that's been done.

 Good catch, thank you, and I guess the lack of use is why things haven't 
broken.  I'll make a preparatory change in v2 and remove this macro then.

 FWIW it came with 2.1.36, already unused, so presumably a leftover from a 
WIP version.

 I'll fix the typo in the description as well, thank you.

  Maciej

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
  2025-02-19 12:46 [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency Maciej W. Rozycki
                   ` (3 preceding siblings ...)
  2025-02-20 16:46 ` Matt Turner
@ 2025-02-20 17:54 ` Richard Henderson
  2025-02-20 17:59   ` Linus Torvalds
                     ` (2 more replies)
  2025-02-25 20:44 ` Maciej W. Rozycki
  5 siblings, 3 replies; 21+ messages in thread
From: Richard Henderson @ 2025-02-20 17:54 UTC (permalink / raw)
  To: Maciej W. Rozycki, Ivan Kokshaysky, Matt Turner
  Cc: Arnd Bergmann, John Paul Adrian Glaubitz, Magnus Lindholm,
	Paul E. McKenney, Linus Torvalds, Al Viro, linux-alpha,
	linux-kernel

On 2/19/25 04:46, Maciej W. Rozycki wrote:
> Complementing compiler support for the `-msafe-bwa' and `-msafe-partial'
> code generation options slated to land in GCC 15,

Pointer?  I can't find it on the gcc-patches list.


> implement emulation
> for unaligned LDx_L and STx_C operations for the unlikely case where an
> alignment violation has resulted from improperly written code and caused
> these operations to trap in atomic RMW memory access sequences made to
> provide data consistency for non-BWX byte and word write operations, and
> writes to unaligned data objects causing partial memory updates.
> 
> The principle of operation is as follows:
> 
> 1. A trapping unaligned LDx_L operation results in the pair of adjacent
>     aligned whole data quantities spanned being read and stored for the
>     reference with a subsequent STx_C operation, along with the width of
>     the data accessed and its virtual address, and the task referring or
>     NULL if the kernel.  The valitidy marker is set.
> 
> 2. Regular memory load operations are used to retrieve data because no
>     atomicity is needed at this stage, and matching the width accessed,
>     either LDQ_U or LDL even though the latter instruction requires extra
>     operations, to avoid the case where an unaligned longword located
>     entirely within an aligned quadword would complicate handling.
> 
> 3. Data is masked, shifted and merged appropriately and returned in the
>     intended register as the result of the trapping LDx_L instruction.
> 
> 4. A trapping unaligned STx_C operation results in the valitidy marker
>     being checked for being true, and the width of the data accessed
>     along with the virtual address and the task referring or the kernel
>     for a match.  The pair of whole data quantities previously read by
>     LDx_L emulation is retrieved and the valitidy marker is cleared.
> 
> 5. If the checks succeeded, then in an atomic loop the location of the
>     first whole data quantity is reread, and data retrieved compared with
>     the value previously obtained.  If there's no match, then the loop is
>     aborted and 0 is returned in the intended register as the result of
>     the trapping STx_C instruction and emulation completes.  Otherwise
>     new data obtained from the source operand of STx_C is combined with
>     the data retrieved, replacing by byte insertion the part intended,
>     and an atomic write of this new data is attempted.  If it fails, the
>     loop continues from the beginning.  Otherwise processing proceeds to
>     the next step.
> 
> 6. The same operations are performed on the second whole data quantity.
> 
> 7. At this point both whole data quantities have been written, ensuring
>     that no third-party intervening write has changed them at the point
>     of the write from the values held at previous LDx_L.  Therefore 1 is
>     returned in the intended register as the result of the trapping STx_C
>     instruction.

I think general-purpose non-atomic emulation of STx_C is a really bad idea.

Without looking at your gcc patches, I can guess what you're after: you've generated a 
ll/sc sequence for (aligned) short, and want to emulate if it happens to be unaligned.

Crucially, when emulating non-aligned, you should not strive to make it atomic.  No other 
architecture promises atomic non-aligned stores, so why should you do that here?

I suggest some sort of magic code sequence,

	bic	addr_in, 6, addr_al
loop:
	ldq_l	t0, 0(addr_al)
	magic-nop done - loop
	inswl	data, addr_in, t1
	mskwl	t0, addr_in, t0
	bis	t0, t1, t0
	stq_c	t0, 0(addr_al)
	beq	t0, loop
done:

With the trap, match the magic-nop, pick out the input registers from the following inswl, 
perform the two (atomic!) byte stores to accomplish the emulation, adjust the pc forward 
to the done label.

Choose anything you like for the magic-nop.  The (done - loop) displacement is small, so 
any 8-bit immediate would suffice.  E.g. "eqv $31, disp, $31".  You might require the 
displacement to be constant and not actually extract "disp"; just match the entire 
uint32_t instruction pattern.


r~

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
  2025-02-20 17:54 ` Richard Henderson
@ 2025-02-20 17:59   ` Linus Torvalds
  2025-02-20 18:12     ` Richard Henderson
  2025-02-20 19:07   ` Matt Turner
  2025-02-20 21:05   ` Maciej W. Rozycki
  2 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2025-02-20 17:59 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Maciej W. Rozycki, Ivan Kokshaysky, Matt Turner, Arnd Bergmann,
	John Paul Adrian Glaubitz, Magnus Lindholm, Paul E. McKenney,
	Al Viro, linux-alpha, linux-kernel

On Thu, 20 Feb 2025 at 09:54, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Crucially, when emulating non-aligned, you should not strive to make it atomic.  No other
> architecture promises atomic non-aligned stores, so why should you do that here?

I'm not disagreeing with the "it doesn't necessarily have to be
atomic", but I will point out that x86 does indeed promise atomic
non-aligned accesses.

It will actually lock both cachelines when straddling a cacheline.

It's slow, it's horrendous, and people are trying to get away from it
(google "split lock"), but it is actually architecturally supported.

                   Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
  2025-02-20 17:59   ` Linus Torvalds
@ 2025-02-20 18:12     ` Richard Henderson
  0 siblings, 0 replies; 21+ messages in thread
From: Richard Henderson @ 2025-02-20 18:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Maciej W. Rozycki, Ivan Kokshaysky, Matt Turner, Arnd Bergmann,
	John Paul Adrian Glaubitz, Magnus Lindholm, Paul E. McKenney,
	Al Viro, linux-alpha, linux-kernel

On 2/20/25 09:59, Linus Torvalds wrote:
> On Thu, 20 Feb 2025 at 09:54, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> Crucially, when emulating non-aligned, you should not strive to make it atomic.  No other
>> architecture promises atomic non-aligned stores, so why should you do that here?
> 
> I'm not disagreeing with the "it doesn't necessarily have to be
> atomic", but I will point out that x86 does indeed promise atomic
> non-aligned accesses.

I should have been more expansive with that statement: I didn't mean "no unaligned 
atomics" (e.g. lock orw), but "unaligned normal stores may be non-atomic" (e.g. movw 
across a cacheline).

My guess about the gcc patches is that it's the latter that wanted emulation here.


r~

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
  2025-02-20 17:54 ` Richard Henderson
  2025-02-20 17:59   ` Linus Torvalds
@ 2025-02-20 19:07   ` Matt Turner
  2025-02-20 20:09     ` Maciej W. Rozycki
  2025-02-20 21:05   ` Maciej W. Rozycki
  2 siblings, 1 reply; 21+ messages in thread
From: Matt Turner @ 2025-02-20 19:07 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Maciej W. Rozycki, Ivan Kokshaysky, Arnd Bergmann,
	John Paul Adrian Glaubitz, Magnus Lindholm, Paul E. McKenney,
	Linus Torvalds, Al Viro, linux-alpha, linux-kernel

On Thu, Feb 20, 2025 at 12:54 PM Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 2/19/25 04:46, Maciej W. Rozycki wrote:
> > Complementing compiler support for the `-msafe-bwa' and `-msafe-partial'
> > code generation options slated to land in GCC 15,
>
> Pointer?  I can't find it on the gcc-patches list.

I believe it's this:

https://inbox.sourceware.org/gcc-patches/alpine.DEB.2.21.2411141652300.9262@angie.orcam.me.uk/

The first half or so have landed so far, right Maciej?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
  2025-02-20 19:07   ` Matt Turner
@ 2025-02-20 20:09     ` Maciej W. Rozycki
  0 siblings, 0 replies; 21+ messages in thread
From: Maciej W. Rozycki @ 2025-02-20 20:09 UTC (permalink / raw)
  To: Matt Turner
  Cc: Richard Henderson, Ivan Kokshaysky, Arnd Bergmann,
	John Paul Adrian Glaubitz, Magnus Lindholm, Paul E. McKenney,
	Linus Torvalds, Al Viro, linux-alpha, linux-kernel

On Thu, 20 Feb 2025, Matt Turner wrote:

> > > Complementing compiler support for the `-msafe-bwa' and `-msafe-partial'
> > > code generation options slated to land in GCC 15,
> >
> > Pointer?  I can't find it on the gcc-patches list.
> 
> I believe it's this:
> 
> https://inbox.sourceware.org/gcc-patches/alpine.DEB.2.21.2411141652300.9262@angie.orcam.me.uk/

 That's the original series; v2 is here:

https://inbox.sourceware.org/gcc-patches/alpine.DEB.2.21.2501050246590.49841@angie.orcam.me.uk/

> The first half or so have landed so far, right Maciej?

 Yes, fixes for bugs discovered in the course have been committed and the 
rest of changes already approved, although there were a couple of comments 
I yet want to investigate/address by the end of next week.  Getting clean 
GCC test results with the kernel emulation was my objective before moving 
forward.

  Maciej

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
  2025-02-20 17:54 ` Richard Henderson
  2025-02-20 17:59   ` Linus Torvalds
  2025-02-20 19:07   ` Matt Turner
@ 2025-02-20 21:05   ` Maciej W. Rozycki
  2 siblings, 0 replies; 21+ messages in thread
From: Maciej W. Rozycki @ 2025-02-20 21:05 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Ivan Kokshaysky, Matt Turner, Arnd Bergmann,
	John Paul Adrian Glaubitz, Magnus Lindholm, Paul E. McKenney,
	Linus Torvalds, Al Viro, linux-alpha, linux-kernel

On Thu, 20 Feb 2025, Richard Henderson wrote:

> > Complementing compiler support for the `-msafe-bwa' and `-msafe-partial'
> > code generation options slated to land in GCC 15,
> 
> Pointer?  I can't find it on the gcc-patches list.

 Here: 
<https://inbox.sourceware.org/gcc-patches/alpine.DEB.2.21.2501050246590.49841@angie.orcam.me.uk/>
and hopefully in your inbox/archive somewhere as well.

> > 7. At this point both whole data quantities have been written, ensuring
> >     that no third-party intervening write has changed them at the point
> >     of the write from the values held at previous LDx_L.  Therefore 1 is
> >     returned in the intended register as the result of the trapping STx_C
> >     instruction.
> 
> I think general-purpose non-atomic emulation of STx_C is a really bad idea.
> 
> Without looking at your gcc patches, I can guess what you're after: you've
> generated a ll/sc sequence for (aligned) short, and want to emulate if it
> happens to be unaligned.

 It's a corner case, yes, when the compiler was told the access would be 
aligned, but it turns out not.  It's where you cast a (char *) pointer to 
(short *) that wasn't suitably aligned for such a cast and dereference it 
(and the quadword case is similarly for the ends of misaligned inline 
`memcpy'/`memset').

 Only two cases (plus a bug in GM2 frontend) hitting this throughout the 
GCC testsuite show the rarity of this case.

> Crucially, when emulating non-aligned, you should not strive to make it
> atomic.  No other architecture promises atomic non-aligned stores, so why
> should you do that here?

 This code doesn't strive to be atomic, but to preserve data *outside* the 
quantity accessed from being clobbered, and for this purpose an atomic 
sequence is both inevitable and sufficient, for both partial quantities 
around the unaligned quantity written.  The trapping code does not expect 
atomicity for the unaligned quantity itself -- it is handled in pieces 
just as say with MIPS SWL/SWR masked store instruction pairs -- and this 
code, effectively an Alpha/Linux psABI extension, does not guarantee it 
either.

> I suggest some sort of magic code sequence,
> 
> 	bic	addr_in, 6, addr_al
> loop:
> 	ldq_l	t0, 0(addr_al)
> 	magic-nop done - loop
> 	inswl	data, addr_in, t1
> 	mskwl	t0, addr_in, t0
> 	bis	t0, t1, t0
> 	stq_c	t0, 0(addr_al)
> 	beq	t0, loop
> done:
> 
> With the trap, match the magic-nop, pick out the input registers from the
> following inswl, perform the two (atomic!) byte stores to accomplish the
> emulation, adjust the pc forward to the done label.

 It seems to make no sense to me to penalise all user code for the corner 
case mentioned above while still having the emulation in the kernel, given 
that 99.999...% of accesses will have been correctly aligned by GCC.  And 
it gets even more complex when you have an awkward number of bytes to 
mask, such as 3, 5, 6, 7, which will happen for example if inline `memcpy' 
is expanded by GCC for a quadword-aligned block of 31-bytes, in which case 
other instructions will be used for masking/insertion for the trailing 7 
bytes, and the block turns out misaligned at run time.

 I'm inconvinced, it seems a lot of hassle for little gain to me.

  Maciej

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
  2025-02-19 12:46 [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency Maciej W. Rozycki
                   ` (4 preceding siblings ...)
  2025-02-20 17:54 ` Richard Henderson
@ 2025-02-25 20:44 ` Maciej W. Rozycki
  5 siblings, 0 replies; 21+ messages in thread
From: Maciej W. Rozycki @ 2025-02-25 20:44 UTC (permalink / raw)
  To: Richard Henderson, Ivan Kokshaysky, Matt Turner
  Cc: Arnd Bergmann, John Paul Adrian Glaubitz, Magnus Lindholm,
	Paul E. McKenney, Linus Torvalds, Al Viro, linux-alpha,
	linux-kernel

On Wed, 19 Feb 2025, Maciej W. Rozycki wrote:

> Interestingly enough no kernel mode traps have triggered with a kernel 
> built with GCC 12 (and with most user traps coming from GCC verification):
> 
> kernel unaligned acc	: 0 (pc=0,va=0)
> user unaligned acc	: 1766720 (pc=20000053064,va=120020189)
> 
> but with GCC 15 a small quantity happened (even before I ran GCC testing):
> 
> kernel unaligned acc    : 78 (pc=fffffc0000ad5194,va=fffffc0002db5784)
> user unaligned acc      : 883452 (pc=20000053064,va=120020189)
> 
> It seems a compiler regression worth checking -- the trap recorded was in 
> `icmp6_dst_alloc' with a pair of quadword writes to `rt->rt6i_dst.addr', 
> which however by its type (`struct in6_addr') is only longword-aligned and 
> indeed starts at offset 148 from the outermost struct.  I have a sneaking 
> suspicion one of my earlier GCC changes might be at fault.  At least I now 
> have a test case to experiment with.

 FYI my suspicion wasn't wrong, I have submitted a compiler fix now[1].  

 My plan has been to complete the GCC side first as it's more urgent given 
its annual only release cycle model targetting April/May, whereas I think 
the Linux side can slip a release or two in our roughly bi-monthly cycle.  
I'm going to schedule my time accordinly and with my upcoming holiday also 
in the picture I may not be able to post v2 of this proposal until around 
end of March the earliest.

References:

[1] "Alpha: Fix base block alignment calculation regression", 
    <https://inbox.sourceware.org/gcc-patches/alpine.DEB.2.21.2502251934260.65342@angie.orcam.me.uk/>

  Maciej

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-04-10  4:38 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-19 12:46 [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency Maciej W. Rozycki
2025-02-19 17:38 ` Magnus Lindholm
2025-02-19 23:40 ` Linus Torvalds
2025-02-20 22:09   ` Maciej W. Rozycki
2025-04-07 20:46     ` Maciej W. Rozycki
2025-04-08  0:34       ` Linus Torvalds
2025-04-08  8:37         ` Arnd Bergmann
2025-04-09 15:52           ` Eric W. Biederman
2025-04-09 20:59             ` Maciej W. Rozycki
2025-04-10  4:37               ` Eric W. Biederman
2025-02-20 14:25 ` Magnus Lindholm
2025-02-20 16:46 ` Matt Turner
2025-02-20 16:52   ` Matt Turner
2025-02-20 19:57     ` Maciej W. Rozycki
2025-02-20 17:54 ` Richard Henderson
2025-02-20 17:59   ` Linus Torvalds
2025-02-20 18:12     ` Richard Henderson
2025-02-20 19:07   ` Matt Turner
2025-02-20 20:09     ` Maciej W. Rozycki
2025-02-20 21:05   ` Maciej W. Rozycki
2025-02-25 20:44 ` Maciej W. Rozycki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox