All of lore.kernel.org
 help / color / mirror / Atom feed
From: Charlie Jenkins <charlie@rivosinc.com>
To: Helge Deller <deller@kernel.org>
Cc: Guenter Roeck <linux@roeck-us.net>, Helge Deller <deller@gmx.de>,
	"James E . J . Bottomley" <James.Bottomley@hansenpartnership.com>,
	linux-parisc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Palmer Dabbelt <palmer@rivosinc.com>
Subject: Re: [PATCH] parisc: Fix csum_ipv6_magic on 64-bit systems
Date: Fri, 16 Feb 2024 22:00:16 -0500	[thread overview]
Message-ID: <ZdAhQHFXUF7wEWea@ghost> (raw)
In-Reply-To: <Zc9XW-TxQKp84vMt@p100>

On Fri, Feb 16, 2024 at 01:38:51PM +0100, Helge Deller wrote:
> * Guenter Roeck <linux@roeck-us.net>:
> > hppa 64-bit systems calculates the IPv6 checksum using 64-bit add
> > operations. The last add folds protocol and length fields into the 64-bit
> > result. While unlikely, this operation can overflow. The overflow can be
> > triggered with a code sequence such as the following.
> > 
> > 	/* try to trigger massive overflows */
> > 	memset(tmp_buf, 0xff, sizeof(struct in6_addr));
> > 	csum_result = csum_ipv6_magic((struct in6_addr *)tmp_buf,
> > 				      (struct in6_addr *)tmp_buf,
> > 				      0xffff, 0xff, 0xffffffff);
> > 
> > Fix the problem by adding any overflows from the final add operation into
> > the calculated checksum. Fortunately, we can do this without additional
> > cost by replacing the add operation used to fold the checksum into 32 bit
> > with "add,dc" to add in the missing carry.
> 
> 
> Gunter,
> 
> Thanks for your patch for 32- and 64-bit systems.
> But I think it's time to sunset the parisc inline assembly for ipv4/ipv6
> checksumming.
> The patch below converts the code to use standard C-coding (taken from the
> s390 header file) and it survives the v8 lib/checksum patch.
> 
> Opinions?
> 
> Helge
> 
> Subject: [PATCH] parisc: Fix csum_ipv6_magic on 32- and 64-bit machines
> 
> Guenter noticed that the 32- and 64-bit checksum functions may calculate
> wrong values under certain circumstances. He fixed it by usining
> corrected carry-flags added, but overall I think it's better to convert
> away from inline assembly to generic C-coding.  That way the code is
> much cleaner and the compiler can do much better optimizations based on
> the target architecture (32- vs. 64-bit). Furthermore, the compiler
> generates nowadays much better code, so inline assembly usually won't
> give any benefit any longer.
> 
> Signed-off-by: Helge Deller <deller@gmx.de>
> Noticed-by: Guenter Roeck <linux@roeck-us.net>
> 
> diff --git a/arch/parisc/include/asm/checksum.h b/arch/parisc/include/asm/checksum.h
> index 3c43baca7b39..c72f14536353 100644
> --- a/arch/parisc/include/asm/checksum.h
> +++ b/arch/parisc/include/asm/checksum.h
> @@ -18,160 +18,93 @@
>   */
>  extern __wsum csum_partial(const void *, int, __wsum);
>  
> +
>  /*
> - *	Optimized for IP headers, which always checksum on 4 octet boundaries.
> - *
> - *	Written by Randolph Chung <tausq@debian.org>, and then mucked with by
> - *	LaMont Jones <lamont@debian.org>
> + * Fold a partial checksum without adding pseudo headers.
>   */
> -static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
> +static inline __sum16 csum_fold(__wsum sum)
>  {
> -	unsigned int sum;
> -	unsigned long t0, t1, t2;
> +	u32 csum = (__force u32) sum;
>  
> -	__asm__ __volatile__ (
> -"	ldws,ma		4(%1), %0\n"
> -"	addib,<=	-4, %2, 2f\n"
> -"\n"
> -"	ldws		4(%1), %4\n"
> -"	ldws		8(%1), %5\n"
> -"	add		%0, %4, %0\n"
> -"	ldws,ma		12(%1), %3\n"
> -"	addc		%0, %5, %0\n"
> -"	addc		%0, %3, %0\n"
> -"1:	ldws,ma		4(%1), %3\n"
> -"	addib,<		0, %2, 1b\n"
> -"	addc		%0, %3, %0\n"
> -"\n"
> -"	extru		%0, 31, 16, %4\n"
> -"	extru		%0, 15, 16, %5\n"
> -"	addc		%4, %5, %0\n"
> -"	extru		%0, 15, 16, %5\n"
> -"	add		%0, %5, %0\n"
> -"	subi		-1, %0, %0\n"
> -"2:\n"
> -	: "=r" (sum), "=r" (iph), "=r" (ihl), "=r" (t0), "=r" (t1), "=r" (t2)
> -	: "1" (iph), "2" (ihl)
> -	: "memory");
> -
> -	return (__force __sum16)sum;
> +	csum += (csum >> 16) | (csum << 16);
> +	csum >>= 16;
> +	return (__force __sum16) ~csum;
>  }

For all of my analysis I am using gcc 12.3.0.

We can do better than this! By inspection this looks like a performance
regression. The generic version of csum_fold in
include/asm-generic/checksum.h is better than this so should be used
instead.

I am not familiar with hppa assembly but this is the assembly for the
original csum_fold here:

0000000000000000 <csum_fold>:
   0:	08 03 02 41 	copy r3,r1
   4:	08 1e 02 43 	copy sp,r3
   8:	73 c1 00 88 	std,ma r1,40(sp)
   c:	d3 5a 09 fc 	shrpw r26,r26,16,ret0
  10:	0b 9a 0a 1c 	add,l r26,ret0,ret0
  14:	0b 80 09 9c 	uaddcm r0,ret0,ret0
  18:	db 9c 09 f0 	extrd,u,* ret0,47,16,ret0
  1c:	34 7e 00 80 	ldo 40(r3),sp
  20:	e8 40 d0 00 	bve (rp)
  24:	53 c3 3f 8d 	ldd,mb -40(sp),r3

This is the assembly for the generic version:
0000000000000000 <csum_fold_generic>:
   0:	08 03 02 41 	copy r3,r1
   4:	08 1e 02 43 	copy sp,r3
   8:	73 c1 00 88 	std,ma r1,40(sp)
   c:	0b 40 09 9c 	uaddcm r0,r26,ret0
  10:	d3 5a 09 fa 	shrpw r26,r26,16,r26
  14:	0b 5c 04 1c 	sub ret0,r26,ret0
  18:	db 9c 09 f0 	extrd,u,* ret0,47,16,ret0
  1c:	34 7e 00 80 	ldo 40(r3),sp
  20:	e8 40 d0 00 	bve (rp)
  24:	53 c3 3f 8d 	ldd,mb -40(sp),r3

This is the assembly for yours:
0000000000000028 <csum_fold_new>:
  28:	08 03 02 41 	copy r3,r1
  2c:	08 1e 02 43 	copy sp,r3
  30:	73 c1 00 88 	std,ma r1,40(sp)
  34:	d3 5a 09 fc 	shrpw r26,r26,16,ret0
  38:	0b 9a 0a 1c 	add,l r26,ret0,ret0
  3c:	d3 9c 19 f0 	extrw,u ret0,15,16,ret0
  40:	0b 80 09 9c 	uaddcm r0,ret0,ret0
  44:	db 9c 0b f0 	extrd,u,* ret0,63,16,ret0
  48:	34 7e 00 80 	ldo 40(r3),sp
  4c:	e8 40 d0 00 	bve (rp)
  50:	53 c3 3f 8d 	ldd,mb -40(sp),r3
  54:	00 00 00 00 	break 0,0

 The assembly is almost the same for the generic and the original code,
 except for rearranging the shift and add operation which allows the
 addition to become a subtraction and shortens the dependency chain on
 some architectures (looks like it doesn't change much here). However,
 this new code introduces an additional extrw instruction.

>  
>  /*
> - *	Fold a partial checksum
> + * This is a version of ip_compute_csum() optimized for IP headers,
> + * which always checksums on 4 octet boundaries.
>   */
> -static inline __sum16 csum_fold(__wsum csum)
> +static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
>  {
> -	u32 sum = (__force u32)csum;
> -	/* add the swapped two 16-bit halves of sum,
> -	   a possible carry from adding the two 16-bit halves,
> -	   will carry from the lower half into the upper half,
> -	   giving us the correct sum in the upper half. */
> -	sum += (sum << 16) + (sum >> 16);
> -	return (__force __sum16)(~sum >> 16);
> +	__u64 csum = 0;
> +	__u32 *ptr = (u32 *)iph;
> +
> +	csum += *ptr++;
> +	csum += *ptr++;
> +	csum += *ptr++;
> +	csum += *ptr++;
> +	ihl -= 4;
> +	while (ihl--)
> +		csum += *ptr++;
> +	csum += (csum >> 32) | (csum << 32);
> +	return csum_fold((__force __wsum)(csum >> 32));
>  }

This doesn't leverage add with carry well. This causes the code size of this
to be dramatically larger than the original assembly, which I assume
nicely correlates to an increased execution time.

This is the original assembly in 64-bit (almost the same in 32-bit):
0000000000000028 <ip_fast_csum>:
  28:	08 03 02 41 	copy r3,r1
  2c:	08 1e 02 43 	copy sp,r3
  30:	73 c1 00 88 	std,ma r1,40(sp)
  34:	0f 48 10 bc 	ldw,ma 4(r26),ret0
  38:	a7 39 60 70 	addib,<= -4,r25,78 <ip_fast_csum+0x50>
  3c:	0f 48 10 93 	ldw 4(r26),r19
  40:	0f 50 10 94 	ldw 8(r26),r20
  44:	0a 7c 06 1c 	add ret0,r19,ret0
  48:	0f 58 10 bf 	ldw,ma c(r26),r31
  4c:	0a 9c 07 1c 	add,c ret0,r20,ret0
  50:	0b fc 07 1c 	add,c ret0,r31,ret0
  54:	0f 48 10 bf 	ldw,ma 4(r26),r31
  58:	a7 20 5f ed 	addib,< 0,r25,54 <ip_fast_csum+0x2c>
  5c:	0b fc 07 1c 	add,c ret0,r31,ret0
  60:	d3 93 1b f0 	extrw,u ret0,31,16,r19
  64:	d3 94 19 f0 	extrw,u ret0,15,16,r20
  68:	0a 93 07 1c 	add,c r19,r20,ret0
  6c:	d3 94 19 f0 	extrw,u ret0,15,16,r20
  70:	0a 9c 06 1c 	add ret0,r20,ret0
  74:	97 9c 07 ff 	subi -1,ret0,ret0
  78:	db 9c 0b f0 	extrd,u,* ret0,63,16,ret0
  7c:	34 7e 00 80 	ldo 40(r3),sp
  80:	e8 40 d0 00 	bve (rp)
  84:	53 c3 3f 8d 	ldd,mb -40(sp),r3

This is the 64-bit assembly of your proposal:

0000000000000058 <ip_fast_csum_new>:
  58:	08 03 02 41 	copy r3,r1
  5c:	0f c2 12 c1 	std rp,-10(sp)
  60:	08 1e 02 43 	copy sp,r3
  64:	73 c1 01 08 	std,ma r1,80(sp)
  68:	37 39 3f f9 	ldo -4(r25),r25
  6c:	0c 64 12 d0 	std r4,8(r3)
  70:	0f 48 10 9f 	ldw 4(r26),r31
  74:	db 39 0b e0 	extrd,u,* r25,63,32,r25
  78:	37 5a 00 20 	ldo 10(r26),r26
  7c:	0f 41 10 9c 	ldw -10(r26),ret0
  80:	0b fc 0a 1c 	add,l ret0,r31,ret0
  84:	0f 51 10 9f 	ldw -8(r26),r31
  88:	0b 9f 0a 1f 	add,l r31,ret0,r31
  8c:	0f 59 10 9c 	ldw -4(r26),ret0
  90:	0b fc 0a 1c 	add,l ret0,r31,ret0
  94:	37 3f 3f ff 	ldo -1(r25),r31
  98:	8f ff 20 68 	cmpib,<> -1,r31,d4 <ip_fast_csum_new+0x7c>
  9c:	db f9 0b e0 	extrd,u,* r31,63,32,r25
  a0:	d3 9c 07 fa 	shrpd,* ret0,ret0,32,r26
  a4:	37 dd 3f a1 	ldo -30(sp),ret1
  a8:	0b 9a 0a 1a 	add,l r26,ret0,r26
  ac:	00 00 14 a1 	mfia r1
  b0:	28 20 00 00 	addil L%0,r1,r1
  b4:	34 21 00 00 	ldo 0(r1),r1
  b8:	e8 20 f0 00 	bve,l (r1),rp
  bc:	db 5a 03 e0 	extrd,u,* r26,31,32,r26
  c0:	0c 70 10 c4 	ldd 8(r3),r4
  c4:	0c 61 10 c2 	ldd -10(r3),rp
  c8:	34 7e 00 80 	ldo 40(r3),sp
  cc:	e8 40 d0 00 	bve (rp)
  d0:	53 c3 3f 8d 	ldd,mb -40(sp),r3
  d4:	0f 40 10 9f 	ldw 0(r26),r31
  d8:	37 5a 00 08 	ldo 4(r26),r26
  dc:	e8 1f 1f 65 	b,l 94 <ip_fast_csum_new+0x3c>,r0
  e0:	0b fc 0a 1c 	add,l ret0,r31,ret0
  e4:	00 00 00 00 	break 0,0

The 32-bit assembly is even longer.

Maybe you can get some inspiration from my riscv implementation
arch/riscv/include/asm/checksum.h. You can swap out the line:
csum += csum < ((const unsigned int *)iph)[pos];
With the addc macro defined in arch/parisc/lib/checksum.c.

I will guess that this will improve the code but I haven't checked.

> - 
> -static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
> -					__u32 len, __u8 proto,
> -					__wsum sum)
> +
> +/*
> + * Computes the checksum of the TCP/UDP pseudo-header.
> + * Returns a 32-bit checksum.
> + */
> +static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr, __u32 len,
> +					__u8 proto, __wsum sum)
>  {
> -	__asm__(
> -	"	add  %1, %0, %0\n"
> -	"	addc %2, %0, %0\n"
> -	"	addc %3, %0, %0\n"
> -	"	addc %%r0, %0, %0\n"
> -		: "=r" (sum)
> -		: "r" (daddr), "r"(saddr), "r"(proto+len), "0"(sum));
> -	return sum;
> +	__u64 csum = (__force __u64)sum;
> +
> +	csum += (__force __u32)saddr;
> +	csum += (__force __u32)daddr;
> +	csum += len;
> +	csum += proto;
> +	csum += (csum >> 32) | (csum << 32);
> +	return (__force __wsum)(csum >> 32);
>  }

The assembly from the original:

0000000000000088 <csum_tcpudp_nofold>:
  88:	08 03 02 41 	copy r3,r1
  8c:	08 1e 02 43 	copy sp,r3
  90:	73 c1 00 88 	std,ma r1,40(sp)
  94:	da f7 0b f8 	extrd,u,* r23,63,8,r23
  98:	0a f8 0a 17 	add,l r24,r23,r23
  9c:	0a d9 06 16 	add r25,r22,r22
  a0:	0a da 07 16 	add,c r26,r22,r22
  a4:	0a d7 07 16 	add,c r23,r22,r22
  a8:	0a c0 07 16 	add,c r0,r22,r22
  ac:	da dc 0b e0 	extrd,u,* r22,63,32,ret0
  b0:	34 7e 00 80 	ldo 40(r3),sp
  b4:	e8 40 d0 00 	bve (rp)
  b8:	53 c3 3f 8d 	ldd,mb -40(sp),r3
  bc:	00 00 00 00 	break 0,0

The 64-bit assembly from your proposal:

0000000000000140 <csum_tcpudp_nofold_new>:
 140:	08 03 02 41 	copy r3,r1
 144:	08 1e 02 43 	copy sp,r3
 148:	73 c1 00 88 	std,ma r1,40(sp)
 14c:	db 5a 0b e0 	extrd,u,* r26,63,32,r26
 150:	db 39 0b e0 	extrd,u,* r25,63,32,r25
 154:	da f7 0b f8 	extrd,u,* r23,63,8,r23
 158:	db 18 0b e0 	extrd,u,* r24,63,32,r24
 15c:	da d6 0b e0 	extrd,u,* r22,63,32,r22
 160:	0a f8 0a 18 	add,l r24,r23,r24
 164:	0b 38 0a 18 	add,l r24,r25,r24
 168:	0b 58 0a 18 	add,l r24,r26,r24
 16c:	0a d8 0a 16 	add,l r24,r22,r22
 170:	d2 d6 07 fc 	shrpd,* r22,r22,32,ret0
 174:	0a dc 0a 1c 	add,l ret0,r22,ret0
 178:	db 9c 03 e0 	extrd,u,* ret0,31,32,ret0
 17c:	34 7e 00 80 	ldo 40(r3),sp
 180:	e8 40 d0 00 	bve (rp)
 184:	53 c3 3f 8d 	ldd,mb -40(sp),r3

There are a lot of extra extrd instructions, and again is really bad on
32-bit parisc.

Maybe there is a good way to represent this in C, but the assembly is
so simple and clean for this function already.

>  
>  /*
> - * computes the checksum of the TCP/UDP pseudo-header
> - * returns a 16-bit checksum, already complemented
> + * Computes the checksum of the TCP/UDP pseudo-header.
> + * Returns a 16-bit checksum, already complemented.
>   */
> -static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr,
> -					__u32 len, __u8 proto,
> -					__wsum sum)
> +static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr, __u32 len,
> +					__u8 proto, __wsum sum)
>  {
> -	return csum_fold(csum_tcpudp_nofold(saddr,daddr,len,proto,sum));
> +	return csum_fold(csum_tcpudp_nofold(saddr, daddr, len, proto, sum));
>  }
>  
The implementation of csum_tcpudp_magic is the same as the generic, so
the generic version should be used instead.

>  /*
> - * this routine is used for miscellaneous IP-like checksums, mainly
> - * in icmp.c
> + * Used for miscellaneous IP-like checksums, mainly icmp.
>   */
> -static inline __sum16 ip_compute_csum(const void *buf, int len)
> +static inline __sum16 ip_compute_csum(const void *buff, int len)
>  {
> -	 return csum_fold (csum_partial(buf, len, 0));
> +	return csum_fold(csum_partial(buff, len, 0));
>  }

The generic version is better than this so it should be used instead.

>  
> -
>  #define _HAVE_ARCH_IPV6_CSUM
> -static __inline__ __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
> -					  const struct in6_addr *daddr,
> -					  __u32 len, __u8 proto,
> -					  __wsum sum)
> +static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
> +				      const struct in6_addr *daddr,
> +				      __u32 len, __u8 proto, __wsum csum)
>  {
> -	unsigned long t0, t1, t2, t3;
> -
> -	len += proto;	/* add 16-bit proto + len */
> -
> -	__asm__ __volatile__ (
> -
> -#if BITS_PER_LONG > 32
> -
> -	/*
> -	** We can execute two loads and two adds per cycle on PA 8000.
> -	** But add insn's get serialized waiting for the carry bit.
> -	** Try to keep 4 registers with "live" values ahead of the ALU.
> -	*/
> -
> -"	ldd,ma		8(%1), %4\n"	/* get 1st saddr word */
> -"	ldd,ma		8(%2), %5\n"	/* get 1st daddr word */
> -"	add		%4, %0, %0\n"
> -"	ldd,ma		8(%1), %6\n"	/* 2nd saddr */
> -"	ldd,ma		8(%2), %7\n"	/* 2nd daddr */
> -"	add,dc		%5, %0, %0\n"
> -"	add,dc		%6, %0, %0\n"
> -"	add,dc		%7, %0, %0\n"
> -"	add,dc		%3, %0, %0\n"  /* fold in proto+len | carry bit */
> -"	extrd,u		%0, 31, 32, %4\n"/* copy upper half down */
> -"	depdi		0, 31, 32, %0\n"/* clear upper half */
> -"	add		%4, %0, %0\n"	/* fold into 32-bits */
> -"	addc		0, %0, %0\n"	/* add carry */
> -
> -#else
> -
> -	/*
> -	** For PA 1.x, the insn order doesn't matter as much.
> -	** Insn stream is serialized on the carry bit here too.
> -	** result from the previous operation (eg r0 + x)
> -	*/
> -"	ldw,ma		4(%1), %4\n"	/* get 1st saddr word */
> -"	ldw,ma		4(%2), %5\n"	/* get 1st daddr word */
> -"	add		%4, %0, %0\n"
> -"	ldw,ma		4(%1), %6\n"	/* 2nd saddr */
> -"	addc		%5, %0, %0\n"
> -"	ldw,ma		4(%2), %7\n"	/* 2nd daddr */
> -"	addc		%6, %0, %0\n"
> -"	ldw,ma		4(%1), %4\n"	/* 3rd saddr */
> -"	addc		%7, %0, %0\n"
> -"	ldw,ma		4(%2), %5\n"	/* 3rd daddr */
> -"	addc		%4, %0, %0\n"
> -"	ldw,ma		4(%1), %6\n"	/* 4th saddr */
> -"	addc		%5, %0, %0\n"
> -"	ldw,ma		4(%2), %7\n"	/* 4th daddr */
> -"	addc		%6, %0, %0\n"
> -"	addc		%7, %0, %0\n"
> -"	addc		%3, %0, %0\n"	/* fold in proto+len, catch carry */
> -
> -#endif
> -	: "=r" (sum), "=r" (saddr), "=r" (daddr), "=r" (len),
> -	  "=r" (t0), "=r" (t1), "=r" (t2), "=r" (t3)
> -	: "0" (sum), "1" (saddr), "2" (daddr), "3" (len)
> -	: "memory");
> -	return csum_fold(sum);
> +	__u64 sum = (__force __u64)csum;
> +
> +	sum += (__force __u32)saddr->s6_addr32[0];
> +	sum += (__force __u32)saddr->s6_addr32[1];
> +	sum += (__force __u32)saddr->s6_addr32[2];
> +	sum += (__force __u32)saddr->s6_addr32[3];
> +	sum += (__force __u32)daddr->s6_addr32[0];
> +	sum += (__force __u32)daddr->s6_addr32[1];
> +	sum += (__force __u32)daddr->s6_addr32[2];
> +	sum += (__force __u32)daddr->s6_addr32[3];
> +	sum += len;
> +	sum += proto;
> +	sum += (sum >> 32) | (sum << 32);
> +	return csum_fold((__force __wsum)(sum >> 32));
>  }
>  
>  #endif
> -

Similar story again here where the add with carry is not well translated
into C, resulting in significantly worse assembly. Using __u64 seems to
be a big contributing factor for why the 32-bit assembly is worse.

I am not sure the best way to represent this in a clean way in C.

add with carry is not well understood by GCC 12.3 it seems. These
functions are generally heavily optimized on every architecture, so I
think it is worth it to default to assembly if you aren't able to
achieve comparable performance in C.

- Charlie


  parent reply	other threads:[~2024-02-17  3:00 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-13 23:46 [PATCH] parisc: Fix csum_ipv6_magic on 64-bit systems Guenter Roeck
2024-02-14 20:22 ` Charlie Jenkins
2024-02-16 12:38 ` Helge Deller
2024-02-16 15:31   ` Guenter Roeck
2024-02-17  3:00   ` Charlie Jenkins [this message]
2024-02-17 22:47     ` David Laight
2024-02-19 21:11     ` Helge Deller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZdAhQHFXUF7wEWea@ghost \
    --to=charlie@rivosinc.com \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=deller@gmx.de \
    --cc=deller@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-parisc@vger.kernel.org \
    --cc=linux@roeck-us.net \
    --cc=palmer@rivosinc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.