[PATCH 0/3] x86_64, sfc: 128-bit memory-mapped I/O

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] x86_64, sfc: 128-bit memory-mapped I/O
@ 2012-08-22  1:17 Ben Hutchings
  2012-08-22  1:20 ` [PATCH 1/3] x86_64: Define 128-bit types for kernel code only Ben Hutchings
                   ` (3 more replies)
  0 siblings, 4 replies; 56+ messages in thread
From: Ben Hutchings @ 2012-08-22  1:17 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin
  Cc: netdev, linux-net-drivers, x86

Current Solarflare network controllers have 128-bit memory-mapped
registers which are normally accessed through a series of I/O
operations.  However, it is also possible to access them with a single
MOVAPS instruction on x86_64, and this is measurably faster as it
requires only one PCIe transaction.

This series introduces reado(), writeo() and the corresponding raw
operations on x86_64.  These could also be implemented on i386 with an
appropriate CPU selection, but it would require u128 etc. to be defined
as structure types.

The sfc driver can potentially take advantage of both 128-bit reads and
writes, but as a first step this changes only one particular write that
may be done on the data path.

Ben.

Ben Hutchings (3):
  x86_64: Define 128-bit types for kernel code only
  x86_64: Define 128-bit memory-mapped I/O operations
  sfc: Use __raw_writeo() to perform TX descriptor push where possible

 arch/x86/Kconfig.cpu                |    4 ++
 arch/x86/include/asm/io.h           |   14 +++++++
 arch/x86/include/asm/types.h        |    8 ++++
 arch/x86/lib/Makefile               |    1 +
 arch/x86/lib/oword_io.c             |   65 +++++++++++++++++++++++++++++++++++
 drivers/net/ethernet/sfc/bitfield.h |    3 ++
 drivers/net/ethernet/sfc/io.h       |   18 ++++++++-
 include/linux/types.h               |   12 ++++++
 8 files changed, 123 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/lib/oword_io.c

-- 
1.7.7.6

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 1/3] x86_64: Define 128-bit types for kernel code only
  2012-08-22  1:17 [PATCH 0/3] x86_64, sfc: 128-bit memory-mapped I/O Ben Hutchings
@ 2012-08-22  1:20 ` Ben Hutchings
  2012-08-22  1:23 ` [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations Ben Hutchings
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 56+ messages in thread
From: Ben Hutchings @ 2012-08-22  1:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin
  Cc: netdev, linux-net-drivers, x86

These types will initially be used for 128-bit I/O operations.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
I'm not sure whether there is any point in all the _t type aliases, but
they're defined for all the other explicit-width types.

Ben.

 arch/x86/include/asm/types.h |    8 ++++++++
 include/linux/types.h        |   12 ++++++++++++
 2 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/types.h b/arch/x86/include/asm/types.h
index 8e8c23f..0829082 100644
--- a/arch/x86/include/asm/types.h
+++ b/arch/x86/include/asm/types.h
@@ -3,4 +3,12 @@
 
 #include <asm-generic/types.h>
 
+#ifdef __KERNEL__
+#if defined(CONFIG_X86_64) && !defined(__ASSEMBLY__) && !defined(_SETUP)
+#define HAVE_INT128
+typedef __int128 s128;
+typedef unsigned __int128 u128;
+#endif
+#endif
+
 #endif /* _ASM_X86_TYPES_H */
diff --git a/include/linux/types.h b/include/linux/types.h
index 9c1bd53..46e67ad 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -127,6 +127,12 @@ typedef		__s64		int64_t;
 #define aligned_be64 __be64 __attribute__((aligned(8)))
 #define aligned_le64 __le64 __attribute__((aligned(8)))
 
+#ifdef HAVE_INT128
+typedef		u128		u_int128_t;
+typedef		u128		uint128_t;
+typedef		s128		int128_t;
+#endif
+
 /**
  * The type used for indexing onto a disc or disc partition.
  *
@@ -199,6 +205,12 @@ typedef __u32 __bitwise __wsum;
 #define __aligned_le64 __le64 __attribute__((aligned(8)))
 
 #ifdef __KERNEL__
+
+#ifdef HAVE_INT128
+typedef u128 __bitwise __le128;
+typedef u128 __bitwise __be128;
+#endif
+
 typedef unsigned __bitwise__ gfp_t;
 typedef unsigned __bitwise__ fmode_t;
 
-- 
1.7.7.6



-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22  1:17 [PATCH 0/3] x86_64, sfc: 128-bit memory-mapped I/O Ben Hutchings
  2012-08-22  1:20 ` [PATCH 1/3] x86_64: Define 128-bit types for kernel code only Ben Hutchings
@ 2012-08-22  1:23 ` Ben Hutchings
  2012-08-22  1:37   ` H. Peter Anvin
  2012-08-22  1:26 ` [PATCH 3/3] sfc: Use __raw_writeo() to perform TX descriptor push where possible Ben Hutchings
  2012-08-22  1:38 ` [PATCH 0/3] x86_64, sfc: 128-bit memory-mapped I/O H. Peter Anvin
  3 siblings, 1 reply; 56+ messages in thread
From: Ben Hutchings @ 2012-08-22  1:23 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin
  Cc: netdev, linux-net-drivers, x86

Define reado(), writeo() and their raw counterparts using SSE.

Based on work by Stuart Hodgson <smhodgson@solarflare.com>.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 arch/x86/Kconfig.cpu      |    4 +++
 arch/x86/include/asm/io.h |   14 +++++++++
 arch/x86/lib/Makefile     |    1 +
 arch/x86/lib/oword_io.c   |   65 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 84 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/lib/oword_io.c

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index 706e12e..802508e 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -372,6 +372,10 @@ config X86_USE_3DNOW
 	def_bool y
 	depends on (MCYRIXIII || MK7 || MGEODE_LX) && !UML
 
+config X86_USE_SSE
+	def_bool y
+	depends on X86_64
+
 config X86_OOSTORE
 	def_bool y
 	depends on (MWINCHIP3D || MWINCHIPC6) && MTRR
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index d8e8eef..06b3e23 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -95,6 +95,20 @@ build_mmio_write(writeq, "q", unsigned long, "r", :"memory")
 
 #endif
 
+#ifdef CONFIG_X86_USE_SSE
+
+u128 reado(const volatile void __iomem *addr);
+void writeo(u128 val, volatile void __iomem *addr);
+
+#define __raw_reado(addr) reado(addr)
+#define __raw_writeo(val, addr)	writeo(val, addr)
+
+/* Let people know that we have them */
+#define reado			reado
+#define writeo			writeo
+
+#endif
+
 /**
  *	virt_to_phys	-	map virtual addresses to physical
  *	@address: address to remap
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index b00f678..1791198 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -25,6 +25,7 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o
 
 obj-y += msr.o msr-reg.o msr-reg-export.o
+obj-$(CONFIG_X86_USE_SSE) += oword_io.o
 
 ifeq ($(CONFIG_X86_32),y)
         obj-y += atomic64_32.o
diff --git a/arch/x86/lib/oword_io.c b/arch/x86/lib/oword_io.c
new file mode 100644
index 0000000..8189bf3
--- /dev/null
+++ b/arch/x86/lib/oword_io.c
@@ -0,0 +1,65 @@
+/****************************************************************************
+ * 128-bit MMIO for x86
+ * Copyright 2012 Solarflare Communications Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation, incorporated herein by reference.
+ */
+
+#include <linux/export.h>
+#include <linux/preempt.h>
+#include <linux/types.h>
+#include <asm/io.h>
+
+/*
+ * We copy the data between a memory buffer and the MMIO address via
+ * register xmm0.  We have to save and restore the SSE state, and
+ * disable preemption.  Only the MMIO address is required to be
+ * 128-bit aligned, since the stack generally is not.
+ */
+
+u128 reado(const volatile void __iomem *addr)
+{
+	u128 ret;
+	u64 cr0;
+	u128 xmm0;
+
+	preempt_disable();
+	asm volatile (
+		"movq %%cr0,%0\n\t"
+		"clts\n\t"
+		"movups %%xmm0,%1\n\t"
+		"movaps %3,%%xmm0\n\t"
+		"movups %%xmm0,%2\n\t"
+		"sfence\n\t"
+		"movups %1,%%xmm0\n\t"
+		"movq %0,%%cr0\n\t"
+		: "=r"(cr0), "=m"(xmm0), "=m"(ret)
+		: "m"(*(const volatile u128 __iomem *)addr));
+	preempt_enable();
+
+	return ret;
+}
+EXPORT_SYMBOL(reado);
+
+void writeo(u128 val, volatile void __iomem *addr)
+{
+	u64 cr0;
+	u128 xmm0;
+
+	preempt_disable();
+	asm volatile (
+		"movq %%cr0,%0\n\t"
+		"clts\n\t"
+		"movups %%xmm0,%1\n\t"
+		"movups %3,%%xmm0\n\t"
+		"movaps %%xmm0,%2\n\t"
+		"sfence\n\t"
+		"movups %1,%%xmm0\n\t"
+		"movq %0,%%cr0\n\t"
+		: "=r"(cr0), "=m"(xmm0), "=m"(*(volatile u128 __iomem *)addr)
+		: "m"(val));
+	preempt_enable();
+}
+EXPORT_SYMBOL(writeo);
-- 
1.7.7.6



-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22  1:23 ` [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations Ben Hutchings
@ 2012-08-22  1:37   ` H. Peter Anvin
  2012-08-22  2:04     ` Ben Hutchings
  0 siblings, 1 reply; 56+ messages in thread
From: H. Peter Anvin @ 2012-08-22  1:37 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Thomas Gleixner, Ingo Molnar, netdev, linux-net-drivers, x86

On 08/21/2012 06:23 PM, Ben Hutchings wrote:
> Define reado(), writeo() and their raw counterparts using SSE.
> 
> Based on work by Stuart Hodgson <smhodgson@solarflare.com>.

It would be vastly better if we explicitly controlled this with
kernel_fpu_begin()/kernel_fpu_end() rather than hiding it in primitives
than might tempt the user to do very much the wrong thing.

Also, it needs to be extremely clear to the user that these operations
use the FPU, and all the requirements there need to be met, including
not using them at interrupt time.

	-hpa

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22  1:37   ` H. Peter Anvin
@ 2012-08-22  2:04     ` Ben Hutchings
  2012-08-22  2:34       ` David Miller
  0 siblings, 1 reply; 56+ messages in thread
From: Ben Hutchings @ 2012-08-22  2:04 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Thomas Gleixner, Ingo Molnar, netdev, linux-net-drivers, x86

On Tue, 2012-08-21 at 18:37 -0700, H. Peter Anvin wrote:
> On 08/21/2012 06:23 PM, Ben Hutchings wrote:
> > Define reado(), writeo() and their raw counterparts using SSE.
> > 
> > Based on work by Stuart Hodgson <smhodgson@solarflare.com>.
> 
> It would be vastly better if we explicitly controlled this with
> kernel_fpu_begin()/kernel_fpu_end() rather than hiding it in primitives
> than might tempt the user to do very much the wrong thing.
> 
> Also, it needs to be extremely clear to the user that these operations
> use the FPU, and all the requirements there need to be met, including
> not using them at interrupt time.

Well we can sometimes use the FPU state at IRQ time, can't we
(irq_fpu_usable())?   So we might need, say, try_reado() and
try_writeo() with callers expected to fall back to alternatives.  (Which
they must have anyway for any architecture that doesn't support this.)

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22  2:04     ` Ben Hutchings
@ 2012-08-22  2:34       ` David Miller
  2012-08-22  3:24         ` H. Peter Anvin
  0 siblings, 1 reply; 56+ messages in thread
From: David Miller @ 2012-08-22  2:34 UTC (permalink / raw)
  To: bhutchings; +Cc: hpa, tglx, mingo, netdev, linux-net-drivers, x86

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Wed, 22 Aug 2012 03:04:11 +0100

> On Tue, 2012-08-21 at 18:37 -0700, H. Peter Anvin wrote:
>> On 08/21/2012 06:23 PM, Ben Hutchings wrote:
>> > Define reado(), writeo() and their raw counterparts using SSE.
>> > 
>> > Based on work by Stuart Hodgson <smhodgson@solarflare.com>.
>> 
>> It would be vastly better if we explicitly controlled this with
>> kernel_fpu_begin()/kernel_fpu_end() rather than hiding it in primitives
>> than might tempt the user to do very much the wrong thing.
>> 
>> Also, it needs to be extremely clear to the user that these operations
>> use the FPU, and all the requirements there need to be met, including
>> not using them at interrupt time.
> 
> Well we can sometimes use the FPU state at IRQ time, can't we
> (irq_fpu_usable())?   So we might need, say, try_reado() and
> try_writeo() with callers expected to fall back to alternatives.  (Which
> they must have anyway for any architecture that doesn't support this.)

I really hope we eventually get rid of this rediculous restriction the
x86 code has.

It really needs a proper stack of FPU state saves like sparc64 has.

Half of the code and complexity in arch/x86/crypto/ would just
disappear, because most of it has to do with handling this obtuse
FPU usage restriction which shouldn't even be an issue in the first
place.

I continually see more and more code that has to check this
irq_fpu_usable() thing, and have ugly fallback code, and therefore is
a sign that this really needs to be fixed properly.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22  2:34       ` David Miller
@ 2012-08-22  3:24         ` H. Peter Anvin
  2012-08-22  3:29           ` David Miller
  2012-08-22  3:52           ` Linus Torvalds
  0 siblings, 2 replies; 56+ messages in thread
From: H. Peter Anvin @ 2012-08-22  3:24 UTC (permalink / raw)
  To: David Miller
  Cc: bhutchings, tglx, mingo, netdev, linux-net-drivers, x86,
	Linus Torvalds

On 08/21/2012 07:34 PM, David Miller wrote:
> 
> I really hope we eventually get rid of this rediculous restriction the
> x86 code has.
> 
> It really needs a proper stack of FPU state saves like sparc64 has.
> 
> Half of the code and complexity in arch/x86/crypto/ would just
> disappear, because most of it has to do with handling this obtuse
> FPU usage restriction which shouldn't even be an issue in the first
> place.
> 
> I continually see more and more code that has to check this
> irq_fpu_usable() thing, and have ugly fallback code, and therefore is
> a sign that this really needs to be fixed properly.
> 

[Cc: Linus, since he has had very strong opinions on this in the past.]

I'm all ears... tell me how sparc64 deals with this, maybe we can
implement something similar.  At the same time, do keep in mind that on
x86 this is not just a matter of the FPU state, but the entire "extended
state" which can be very large.

Given the cost of state save/enable, however, nothing is going to change
the need for kernel_fpu_begin/end, nor the fact that those things will
want to bracket large regions.

	-hpa

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22  3:24         ` H. Peter Anvin
@ 2012-08-22  3:29           ` David Miller
  2012-08-22  3:49             ` H. Peter Anvin
  2012-08-22  3:52           ` Linus Torvalds
  1 sibling, 1 reply; 56+ messages in thread
From: David Miller @ 2012-08-22  3:29 UTC (permalink / raw)
  To: hpa; +Cc: bhutchings, tglx, mingo, netdev, linux-net-drivers, x86, torvalds

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Tue, 21 Aug 2012 20:24:18 -0700

> I'm all ears... tell me how sparc64 deals with this, maybe we can
> implement something similar.  At the same time, do keep in mind that on
> x86 this is not just a matter of the FPU state, but the entire "extended
> state" which can be very large.

Sparc's state is pretty huge too.  256 bytes worth of FPU registers,
plus a set of 64-bit control registers.

What we do is we have a FPU stack that grows up from the end of the
thread_info struct, towards the bottom of the kernel stack.

Slot 0 is always the user FPU state.

Slot 1 and further are kernel FPU state save areas.

We hold a counter which keep track of how far deeply saved we are
in the stack.

Not for the purpose of space saving, but for overhead reduction we
sometimes can get away with only saving away half of the FPU
registers.  The chip provides a pair of dirty bits, one for the lower
half of the FPU register file and one for the upper half.  We only
save the bits that are actually dirty.

Furthermore, when we have FPU using code in the kernel that only uses
the lower half of the registers, we only save away that part of the
state around the routine.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22  3:29           ` David Miller
@ 2012-08-22  3:49             ` H. Peter Anvin
  0 siblings, 0 replies; 56+ messages in thread
From: H. Peter Anvin @ 2012-08-22  3:49 UTC (permalink / raw)
  To: David Miller
  Cc: bhutchings, tglx, mingo, netdev, linux-net-drivers, x86, torvalds

On 08/21/2012 08:29 PM, David Miller wrote:
> 
> What we do is we have a FPU stack that grows up from the end of the
> thread_info struct, towards the bottom of the kernel stack.
> 

We have 8K of kernel stack, and an xstate which is pushing a kilobyte
already.  This seems like a nightmare.  Even if we allocate a larger
stack for this sole purpose, we'd have to put a pretty hard cap on how
far it could grow.

> Slot 0 is always the user FPU state.
> 
> Slot 1 and further are kernel FPU state save areas.
> 
> We hold a counter which keep track of how far deeply saved we are
> in the stack.
> 
> Not for the purpose of space saving, but for overhead reduction we
> sometimes can get away with only saving away half of the FPU
> registers.  The chip provides a pair of dirty bits, one for the lower
> half of the FPU register file and one for the upper half.  We only
> save the bits that are actually dirty.
> 
> Furthermore, when we have FPU using code in the kernel that only uses
> the lower half of the registers, we only save away that part of the
> state around the routine.

This is messy on x86; it is somewhat doable but it gets really hairy
because of the monolithic [f]xsave/[f]xrstor instruction.

	-hpa

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22  3:24         ` H. Peter Anvin
  2012-08-22  3:29           ` David Miller
@ 2012-08-22  3:52           ` Linus Torvalds
  2012-08-22  3:59             ` H. Peter Anvin
                               ` (2 more replies)
  1 sibling, 3 replies; 56+ messages in thread
From: Linus Torvalds @ 2012-08-22  3:52 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: David Miller, bhutchings, tglx, mingo, netdev, linux-net-drivers,
	x86

On Tue, Aug 21, 2012 at 8:24 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
>> I continually see more and more code that has to check this
>> irq_fpu_usable() thing, and have ugly fallback code, and therefore is
>> a sign that this really needs to be fixed properly.
>>
>
> [Cc: Linus, since he has had very strong opinions on this in the past.]

I have strong opinions, because I don't think you *can* do it any
other way on x86.

Saving FPU state on kernel entry IS NOT GOING TO HAPPEN. End of story.
It's too f*cking slow.

And if we don't save it, we cannot use the FP state in kernel mode
without explicitly checking, because the FPU state may not be there,
and touching FP registers can cause exceptions etc. In fact, under
many loads the exception is going to be the common case.

Keeping the FP state live all the time (and then perhaps just saving a
single MMX register for kernel use) and switching it synchronously on
every task switch sounds like a really really bad idea too. It
apparently works better on modern CPU's, but it sucked horribly on
older ones. The FPU state save is horribly expensive, to the point
that we saw it very clearly on benchmarks for signal handling and task
switching.

So the normal operation is going to be "touching the FPU causes a fault".

That said, we might *try* to make the kernel FPU use be cheaper. We
could do something like:

 - if we need to restore FPU state for user mode, don't do it
synchronously in kernel_fpu_end(), just set a work-flag, and do it
just once before returning to user mode

 - that makes kernel_fpu_end() a no-op, and while we can't get rid of
kernel_fpu_begin(), we could at least make it cheap to do many times
consecutively (ie we'd have to check "who owns FPU state" and perhaps
save it, but the save would have to be done only the first time).

I haven't seen the patch being discussed, or the rationale for it. But
I doubt it makes sense to do 128-bit MMIO and expect any kind of
atomicity things anyway, and I very much doubt that using SSE would
make all that much sense. What's the background, and why would you
want to do this crap?

              Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22  3:52           ` Linus Torvalds
@ 2012-08-22  3:59             ` H. Peter Anvin
  2012-08-22  4:14               ` David Miller
  2012-08-22  4:35               ` Linus Torvalds
  2012-08-22  4:42             ` Linus Torvalds
  2012-08-22 13:26             ` Ben Hutchings
  2 siblings, 2 replies; 56+ messages in thread
From: H. Peter Anvin @ 2012-08-22  3:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, bhutchings, tglx, mingo, netdev, linux-net-drivers,
	x86

On 08/21/2012 08:52 PM, Linus Torvalds wrote:
> 
> That said, we might *try* to make the kernel FPU use be cheaper. We
> could do something like:
> 
>  - if we need to restore FPU state for user mode, don't do it
> synchronously in kernel_fpu_end(), just set a work-flag, and do it
> just once before returning to user mode
> 
>  - that makes kernel_fpu_end() a no-op, and while we can't get rid of
> kernel_fpu_begin(), we could at least make it cheap to do many times
> consecutively (ie we'd have to check "who owns FPU state" and perhaps
> save it, but the save would have to be done only the first time).
> 

The zeroeth order thing we should do is to make it nest; there are cases
where we do multiple kernel_fpu_begin/_end in sequence for no good reason.

kernel_fpu_end() would still have to re-enable preemption (and
preemption would have to check the work flag), but that should be cheap.

We could allow the FPU in the kernel to have preemption, if we allocated
space for two xstates per thread instead of one.  That is, however, a
fair hunk of memory.

	-hpa

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22  3:59             ` H. Peter Anvin
@ 2012-08-22  4:14               ` David Miller
  2012-08-22 21:14                 ` David Miller
  2012-08-22  4:35               ` Linus Torvalds
  1 sibling, 1 reply; 56+ messages in thread
From: David Miller @ 2012-08-22  4:14 UTC (permalink / raw)
  To: hpa; +Cc: torvalds, bhutchings, tglx, mingo, netdev, linux-net-drivers, x86

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Tue, 21 Aug 2012 20:59:26 -0700

> kernel_fpu_end() would still have to re-enable preemption (and
> preemption would have to check the work flag), but that should be cheap.
> 
> We could allow the FPU in the kernel to have preemption, if we allocated
> space for two xstates per thread instead of one.  That is, however, a
> fair hunk of memory.

Once you have done the first FPU save for the sake of the kernel, you
can minimize what you save for any deeper nesting because the kernel
only cares about a very limited part of that FPU state not the whole
1K thing.

Those bits you can save by hand with a bunch of explicit stores of the
XMM registers, or something like that.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22  4:14               ` David Miller
@ 2012-08-22 21:14                 ` David Miller
  2012-08-22 21:28                   ` Linus Torvalds
  0 siblings, 1 reply; 56+ messages in thread
From: David Miller @ 2012-08-22 21:14 UTC (permalink / raw)
  To: hpa; +Cc: torvalds, bhutchings, tglx, mingo, netdev, linux-net-drivers, x86

From: David Miller <davem@davemloft.net>
Date: Tue, 21 Aug 2012 21:14:27 -0700 (PDT)

> From: "H. Peter Anvin" <hpa@zytor.com>
> Date: Tue, 21 Aug 2012 20:59:26 -0700
> 
>> kernel_fpu_end() would still have to re-enable preemption (and
>> preemption would have to check the work flag), but that should be cheap.
>> 
>> We could allow the FPU in the kernel to have preemption, if we allocated
>> space for two xstates per thread instead of one.  That is, however, a
>> fair hunk of memory.
> 
> Once you have done the first FPU save for the sake of the kernel, you
> can minimize what you save for any deeper nesting because the kernel
> only cares about a very limited part of that FPU state not the whole
> 1K thing.
> 
> Those bits you can save by hand with a bunch of explicit stores of the
> XMM registers, or something like that.

BTW, just to clarify, I'm not saying that we should save the FPU on
every trap where we find the FPU enabled or anything stupid like that.

Definitely keep the kern_fpu_begin()/kern_fpu_end() type markers
around FPU usage, but allow some kind of nesting facility.

Here's one idea.  Anyone using the existing kern_fpu_*() markers get
the existing behavior.  Only one level of kernel FPU usage is allowed.

But a new interface allows specification of a state-save mask.  And it
is only users of this interface for which we allow nesting past the
first FPU user.

If this is the first kernel FPU user, we always do the full fxsave or
whatever to push out the full state.  For any level of kernel FPU
nesting we save only what is in the save-mask, by hand.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 21:14                 ` David Miller
@ 2012-08-22 21:28                   ` Linus Torvalds
  2012-08-22 21:38                     ` David Miller
  0 siblings, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2012-08-22 21:28 UTC (permalink / raw)
  To: David Miller; +Cc: hpa, bhutchings, tglx, mingo, netdev, linux-net-drivers, x86

On Wed, Aug 22, 2012 at 2:14 PM, David Miller <davem@davemloft.net> wrote:
>
> BTW, just to clarify, I'm not saying that we should save the FPU on
> every trap where we find the FPU enabled or anything stupid like that.
>
> Definitely keep the kern_fpu_begin()/kern_fpu_end() type markers
> around FPU usage, but allow some kind of nesting facility.

So nesting shouldn't be horrible, but the thing that really screws
with people like the crypto use is not nesting, but the fact that
sometimes you can't save at all, and the whole "kernel_fpu_possible()"
or whatever we call the checking function.

IOW, in [soft]irq context, to avoid races with the irq happening as
the process is going to do something with the FPU state, we don't
allow saving and changing state, because that would mean that the
normal FP state paths would have to be irq-safe, and they aren't.

And once you have to have that fpu possible check, if it happens to
also disallow nested use, I doubt that's going to really affect
anybody. The code has to take the case of "I'm not allowed to change
FPU state" case into account regardless.

                 Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 21:28                   ` Linus Torvalds
@ 2012-08-22 21:38                     ` David Miller
  0 siblings, 0 replies; 56+ messages in thread
From: David Miller @ 2012-08-22 21:38 UTC (permalink / raw)
  To: torvalds; +Cc: hpa, bhutchings, tglx, mingo, netdev, linux-net-drivers, x86

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed, 22 Aug 2012 14:28:50 -0700

> On Wed, Aug 22, 2012 at 2:14 PM, David Miller <davem@davemloft.net> wrote:
>>
>> BTW, just to clarify, I'm not saying that we should save the FPU on
>> every trap where we find the FPU enabled or anything stupid like that.
>>
>> Definitely keep the kern_fpu_begin()/kern_fpu_end() type markers
>> around FPU usage, but allow some kind of nesting facility.
> 
> So nesting shouldn't be horrible, but the thing that really screws
> with people like the crypto use is not nesting, but the fact that
> sometimes you can't save at all, and the whole "kernel_fpu_possible()"
> or whatever we call the checking function.
> 
> IOW, in [soft]irq context, to avoid races with the irq happening as
> the process is going to do something with the FPU state, we don't
> allow saving and changing state, because that would mean that the
> normal FP state paths would have to be irq-safe, and they aren't.
> 
> And once you have to have that fpu possible check, if it happens to
> also disallow nested use, I doubt that's going to really affect
> anybody. The code has to take the case of "I'm not allowed to change
> FPU state" case into account regardless.

I don't think you really have to do anything special to handle
interrupts properly.

Let's assume that we use some variable length save area at the end of
thread_info to do this nested saving.

When you are asked for FPU usage, you first figure out how much you're
going to save.

Then you advance the allocation pointer in the thread_info, and save
into the space you allocated.

If an interrupt wants to use the FPU, that should be fine as well.
Whether the interrupt FPU save does it's save after you did, or
before, it should work out fine.

I suppose you might have some issues in determining whether we need to
do the full fxsave stuff or not.  There could be a state bit for that,
or similar.

Another idea, instead of doing this in thread_info, is to do it on the
local stack.  That way if we're in an interrupt, we'll use that
interrupt type's kernel stack.

You might be able to get away with always doing the full FPU
save/restore in that situation.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22  3:59             ` H. Peter Anvin
  2012-08-22  4:14               ` David Miller
@ 2012-08-22  4:35               ` Linus Torvalds
  2012-08-22  5:00                 ` David Miller
  1 sibling, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2012-08-22  4:35 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: David Miller, bhutchings, tglx, mingo, netdev, linux-net-drivers,
	x86

On Tue, Aug 21, 2012 at 8:59 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> kernel_fpu_end() would still have to re-enable preemption (and
> preemption would have to check the work flag), but that should be cheap.

No, done right, you don't even have to disable preemption. Sure, you
might disable preemption inside of kernel_fpu_begin() itself (around
the test for "do we need to save FPU state and set the flag to restore
it at return-to-user-mode time"), but I think that you should be able
to run the code afterwards with preemption enabled.

Why? Because you've saved the FPU state, and you've set the per-thread
flag saying "I will restore at return to user mode". And that, coupled
with teaching the scheduler about this case ("don't set TS, only save
xmm0 in a special kernel-xmm0-save-area") means that the thing can
preempt fine.

The nesting case would be about just saving (again) that xmm0 register
in any nested kernel_fpu_begin/end pair. But that doesn't need
preemption, that just needs local storage for the kernel_fpu_begin/end
(and it can do the xmm0 save/restore with regular "mov" instructions,
because the kernel owns the FPU at that point, since it's nested).

So it's doable. One question is how many registers to save for kernel
use. It might be that we'd need to make that a "actual FPU user needs
to save/restore the registers it uses", and not do the saving in
kernel_fpu_begin()/end() at all.

My biggest reason to question this all is that I don't think it's
worth it. Why would we ever care to do all this in the first place?
There's no really sane use for it.

Judging from the subject line, some crazy person wants to do 128-bit
MMIO. That's just crap. Nobody sane cares. Do it non-atomically with
regular accesses. If it's high-performance IO, it uses DMA. If it
doesn't use DMA and depends on CPU MMIO accesses, WHO THE F*CK CARES?
Nobody.

And if some crazy device depends on 128-bit atomic writes, the device
is shit. Forget about it. Tell people to stop digging themselves
deeper down a hole it's simply not worth going.

                Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22  4:35               ` Linus Torvalds
@ 2012-08-22  5:00                 ` David Miller
  2012-08-22 14:06                   ` Linus Torvalds
  0 siblings, 1 reply; 56+ messages in thread
From: David Miller @ 2012-08-22  5:00 UTC (permalink / raw)
  To: torvalds; +Cc: hpa, bhutchings, tglx, mingo, netdev, linux-net-drivers, x86

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 21 Aug 2012 21:35:22 -0700

> My biggest reason to question this all is that I don't think it's
> worth it. Why would we ever care to do all this in the first place?
> There's no really sane use for it.

All the x86 crypto code hits this case all the time, easiest example
is doing a dm-crypt on a block device when an IPSEC packet arrives.

The crypto code has all of this special code and layering that is
there purely so it can fall back to the slow non-optimized version
of the crypto operation when it hits this can't-nest-fpu-saving
situation.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22  5:00                 ` David Miller
@ 2012-08-22 14:06                   ` Linus Torvalds
  0 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2012-08-22 14:06 UTC (permalink / raw)
  To: David Miller; +Cc: hpa, bhutchings, tglx, mingo, netdev, linux-net-drivers, x86

On Tue, Aug 21, 2012 at 10:00 PM, David Miller <davem@davemloft.net> wrote:
>
> All the x86 crypto code hits this case all the time, easiest example
> is doing a dm-crypt on a block device when an IPSEC packet arrives.
>
> The crypto code has all of this special code and layering that is
> there purely so it can fall back to the slow non-optimized version
> of the crypto operation when it hits this can't-nest-fpu-saving
> situation.

Right. And it needs to be there. The current interface is fine and correct.

We can maybe optimize the current model (as outlined earlier), but no,
there's no way in hell we're doing lazy saving of FPU state from
interrupts etc. So all the "check if I can use FPU state at all", and
the explicit kernel_fpu_begin/end() interfaces are absolutely the
right model.

I realize that the people who write that code think that *their* code
is the most important thing in the world, and everything else should
revolve around them, and we should make everything else slower just to
make them happy. But they are wrong.

Deal with it, or do not. You can do the crypto entirely in software. I
think the current model is absolutely the right one, exactly because
it *allows* you to use the FPU for crypto, but it doesn't force the
rest of the kernel to make sure the FPU is always available to you.

I think your complaining about the interface is entirely bogus, and
comes from not appreciating that other areas and uses have other
concerns.

What I would suggest is trying to do profiling, and seeing if the
"let's try to save only once, and restore only when returning to user
space" approach helps. But that's an *implementation* change, not an
interface change. The interface is doing the right thing, the
implementation is just not perhaps optimal.

                  Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22  3:52           ` Linus Torvalds
  2012-08-22  3:59             ` H. Peter Anvin
@ 2012-08-22  4:42             ` Linus Torvalds
  2012-08-22 13:26             ` Ben Hutchings
  2 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2012-08-22  4:42 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: David Miller, bhutchings, tglx, mingo, netdev, linux-net-drivers,
	x86

On Tue, Aug 21, 2012 at 8:52 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I haven't seen the patch being discussed, or the rationale for it. But
> I doubt it makes sense to do 128-bit MMIO and expect any kind of
> atomicity things anyway, and I very much doubt that using SSE would
> make all that much sense. What's the background, and why would you
> want to do this crap?

Btw, for the 64-bit case, we did have ordering issues, and maybe some
128-bit model has similar ordering issues. Fine. You can't rely on
128-bit atomic accesses anyway on 99% of all hardware - either the CPU
itself cannot do it, or it's too damn inconvenient with XMM only
registers, or the bus itself is limited to 64 bits at a time anyway,
so the CPU or the IO interface would split such an access *anyway*.

So the whole concept of "we rely on atomic 128-bit MMIO accesses"
seems terminally broken. Any driver that thinks it needs that is just
crazy.

And those issues have nothing to do with x86 kernel_fpu_begin/end()
what-so-ever.

So judging by that, I would say that some driver writer needs to take
a few pills, clear up their head, and take another look at their life.
Tell them to look at

    include/asm-generic/io-64-nonatomic-hi-lo.h

(and *-lo-hi.h) instead, and ponder the wisdom of just doing it that
way. Tell them to go f*ck themselves if they think they need XMM
registers. They are wrong, for one reason or another.

                 Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22  3:52           ` Linus Torvalds
  2012-08-22  3:59             ` H. Peter Anvin
  2012-08-22  4:42             ` Linus Torvalds
@ 2012-08-22 13:26             ` Ben Hutchings
  2012-08-22 14:20               ` Linus Torvalds
  2 siblings, 1 reply; 56+ messages in thread
From: Ben Hutchings @ 2012-08-22 13:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, David Miller, tglx, mingo, netdev,
	linux-net-drivers, x86

On Tue, 2012-08-21 at 20:52 -0700, Linus Torvalds wrote:
[...]
> I haven't seen the patch being discussed, or the rationale for it. But
> I doubt it makes sense to do 128-bit MMIO and expect any kind of
> atomicity things anyway, and I very much doubt that using SSE would
> make all that much sense. What's the background, and why would you
> want to do this crap?

It's <1345598804.2659.78.camel@bwh-desktop.uk.solarflarecom.com>.
When updating a TX DMA ring pointer in sfc, we can push the first new
descriptor at the same time, so that for a linear packet only one DMA
read is then required before the packet goes out on the wire.  Currently
this requires 2-4 MMIO writes (each a separate PCIe transactions),
depending on the host word size.  There is a measurable reduction in
latency if we can reduce it to 1 PCIe transaction.  (But as previously
discussed, write-combining isn't suitable for this.)

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 13:26             ` Ben Hutchings
@ 2012-08-22 14:20               ` Linus Torvalds
  2012-08-22 14:24                 ` Ben Hutchings
  2012-08-22 14:42                 ` David Laight
  0 siblings, 2 replies; 56+ messages in thread
From: Linus Torvalds @ 2012-08-22 14:20 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: H. Peter Anvin, David Miller, tglx, mingo, netdev,
	linux-net-drivers, x86

On Wed, Aug 22, 2012 at 6:26 AM, Ben Hutchings
<bhutchings@solarflare.com> wrote:
>
> It's <1345598804.2659.78.camel@bwh-desktop.uk.solarflarecom.com>.

That doesn't help me in the least. I don't *have* that email. It was
never sent to me. I don't know what list it went on, and googling the
ID doesn't get me anything. So sending me the mail ID is kind of
pointless.

So I still don't see the actual patch. But everything I heard about it
indirectly makes me go "That's just crazy".

> When updating a TX DMA ring pointer in sfc, we can push the first new
> descriptor at the same time, so that for a linear packet only one DMA
> read is then required before the packet goes out on the wire.  Currently
> this requires 2-4 MMIO writes (each a separate PCIe transactions),
> depending on the host word size.  There is a measurable reduction in
> latency if we can reduce it to 1 PCIe transaction.  (But as previously
> discussed, write-combining isn't suitable for this.)

Quite frankly, this isn't even *remotely* changing my mind about our
FPU model. I'm like "ok, some random driver is trying to be clever,
and it's almost certainly getting things entirely wrong and doing
things that only work on certain machines".

If you really think it's a big deal, do it in *your* driver, and make
it do the whole irq_fpu_usable() check together with
kernel_fpu_begin/end(). And make it very much x86-specific and with a
fallback to non-atomic accesses.

Any patch that exports some "atomic 128-bit MMIO writes" for general
use sounds totally and utterly broken. You can't do it. It's
*fundamentally* an operation that many CPU's cannot even do. 64-bit
buses (or even 32-bit ones) will make the 128-bit write be split up
*anyway*, regardless of any 128-bit register issues.

And nobody else sane cares about this, so it shouldn't even be a
x86-64 specific thing. It should be a driver hack, since that's what
it is. We don't want to support crazy stuff like this in general, that
is not just architecture-, but microarchitecture-specific.

I think you'd be crazy to even want to do this in the first place, but
if you do, keep it internal to your driver, and don't expose the crazy
to anybody else.

                 Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 14:20               ` Linus Torvalds
@ 2012-08-22 14:24                 ` Ben Hutchings
  2012-08-22 14:30                   ` Benjamin LaHaise
  2012-08-22 14:50                   ` Linus Torvalds
  2012-08-22 14:42                 ` David Laight
  1 sibling, 2 replies; 56+ messages in thread
From: Ben Hutchings @ 2012-08-22 14:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, David Miller, tglx, mingo, netdev,
	linux-net-drivers, x86

On Wed, 2012-08-22 at 07:20 -0700, Linus Torvalds wrote:
> On Wed, Aug 22, 2012 at 6:26 AM, Ben Hutchings
> <bhutchings@solarflare.com> wrote:
> >
> > It's <1345598804.2659.78.camel@bwh-desktop.uk.solarflarecom.com>.
> 
> That doesn't help me in the least. I don't *have* that email. It was
> never sent to me. I don't know what list it went on, and googling the
> ID doesn't get me anything. So sending me the mail ID is kind of
> pointless.
> 
> So I still don't see the actual patch. But everything I heard about it
> indirectly makes me go "That's just crazy".

Sorry, I'll paste it below.

> > When updating a TX DMA ring pointer in sfc, we can push the first new
> > descriptor at the same time, so that for a linear packet only one DMA
> > read is then required before the packet goes out on the wire.  Currently
> > this requires 2-4 MMIO writes (each a separate PCIe transactions),
> > depending on the host word size.  There is a measurable reduction in
> > latency if we can reduce it to 1 PCIe transaction.  (But as previously
> > discussed, write-combining isn't suitable for this.)
> 
> Quite frankly, this isn't even *remotely* changing my mind about our
> FPU model. I'm like "ok, some random driver is trying to be clever,
> and it's almost certainly getting things entirely wrong and doing
> things that only work on certain machines".
> 
> If you really think it's a big deal, do it in *your* driver, and make
> it do the whole irq_fpu_usable() check together with
> kernel_fpu_begin/end(). And make it very much x86-specific and with a
> fallback to non-atomic accesses.
[...]

Which is what I first submitted, but David wanted it to be generic.

Ben.

---
Subject: sfc: Use __raw_writeo() to perform TX descriptor push where possible

Use the new __raw_writeo() function for TX descriptor push where
available.  This means we now use only a single PCIe transaction
on x86_64 (vs 2 before), reducing TX latency slightly.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/ethernet/sfc/bitfield.h |    3 +++
 drivers/net/ethernet/sfc/io.h       |   18 ++++++++++++++++--
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/sfc/bitfield.h b/drivers/net/ethernet/sfc/bitfield.h
index b26a954..5feeba2 100644
--- a/drivers/net/ethernet/sfc/bitfield.h
+++ b/drivers/net/ethernet/sfc/bitfield.h
@@ -83,6 +83,9 @@ typedef union efx_qword {
 
 /* An octword (eight-word, i.e. 16 byte) datatype - little-endian in HW */
 typedef union efx_oword {
+#ifdef HAVE_INT128
+       __le128 u128;
+#endif
        __le64 u64[2];
        efx_qword_t qword[2];
        __le32 u32[4];
diff --git a/drivers/net/ethernet/sfc/io.h b/drivers/net/ethernet/sfc/io.h
index 751d1ec..fbcdc6d 100644
--- a/drivers/net/ethernet/sfc/io.h
+++ b/drivers/net/ethernet/sfc/io.h
@@ -57,10 +57,22 @@
  *   current state.
  */
 
-#if BITS_PER_LONG == 64
+#if defined(writeo)
+#define EFX_USE_OWORD_IO 1
+#endif
+
+#if defined(readq) && defined(writeq)
 #define EFX_USE_QWORD_IO 1
 #endif
 
+#ifdef EFX_USE_OWORD_IO
+static inline void _efx_writeo(struct efx_nic *efx, __le128 value,
+                              unsigned int reg)
+{
+       __raw_writeo((__force u128)value, efx->membase + reg);
+}
+#endif
+
 #ifdef EFX_USE_QWORD_IO
 static inline void _efx_writeq(struct efx_nic *efx, __le64 value,
                                  unsigned int reg)
@@ -235,7 +247,9 @@ static inline void _efx_writeo_page(struct efx_nic *efx, efx_oword_t *value,
                   "writing register %x with " EFX_OWORD_FMT "\n", reg,
                   EFX_OWORD_VAL(*value));
 
-#ifdef EFX_USE_QWORD_IO
+#if defined(EFX_USE_OWORD_IO)
+       _efx_writeo(efx, value->u128, reg);
+#elif defined(EFX_USE_QWORD_IO)
        _efx_writeq(efx, value->u64[0], reg + 0);
        _efx_writeq(efx, value->u64[1], reg + 8);
 #else

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 14:24                 ` Ben Hutchings
@ 2012-08-22 14:30                   ` Benjamin LaHaise
  2012-08-22 14:58                     ` Ben Hutchings
  2012-08-22 14:50                   ` Linus Torvalds
  1 sibling, 1 reply; 56+ messages in thread
From: Benjamin LaHaise @ 2012-08-22 14:30 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Linus Torvalds, H. Peter Anvin, David Miller, tglx, mingo, netdev,
	linux-net-drivers, x86

On Wed, Aug 22, 2012 at 03:24:59PM +0100, Ben Hutchings wrote:
> -#ifdef EFX_USE_QWORD_IO
> +#if defined(EFX_USE_OWORD_IO)
> +       _efx_writeo(efx, value->u128, reg);
> +#elif defined(EFX_USE_QWORD_IO)
>         _efx_writeq(efx, value->u64[0], reg + 0);
>         _efx_writeq(efx, value->u64[1], reg + 8);
>  #else

This looks like a perfect fit for write combining.  I have traces showing 
that enabling write combining on MMIO does indeed generate a single PCIe 
transaction on at least a couple of different current systems.  Why is 
that not an option?

		-ben
-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 14:30                   ` Benjamin LaHaise
@ 2012-08-22 14:58                     ` Ben Hutchings
  2012-08-22 15:13                       ` H. Peter Anvin
  0 siblings, 1 reply; 56+ messages in thread
From: Ben Hutchings @ 2012-08-22 14:58 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Linus Torvalds, H. Peter Anvin, David Miller, tglx, mingo, netdev,
	linux-net-drivers, x86

On Wed, 2012-08-22 at 10:30 -0400, Benjamin LaHaise wrote:
> On Wed, Aug 22, 2012 at 03:24:59PM +0100, Ben Hutchings wrote:
> > -#ifdef EFX_USE_QWORD_IO
> > +#if defined(EFX_USE_OWORD_IO)
> > +       _efx_writeo(efx, value->u128, reg);
> > +#elif defined(EFX_USE_QWORD_IO)
> >         _efx_writeq(efx, value->u64[0], reg + 0);
> >         _efx_writeq(efx, value->u64[1], reg + 8);
> >  #else
> 
> This looks like a perfect fit for write combining.  I have traces showing 
> that enabling write combining on MMIO does indeed generate a single PCIe 
> transaction on at least a couple of different current systems.  Why is 
> that not an option?

Because reordering, and see the comment at the top of this file.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 14:58                     ` Ben Hutchings
@ 2012-08-22 15:13                       ` H. Peter Anvin
  2012-08-22 15:27                         ` David Laight
  0 siblings, 1 reply; 56+ messages in thread
From: H. Peter Anvin @ 2012-08-22 15:13 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Benjamin LaHaise, Linus Torvalds, David Miller, tglx, mingo,
	netdev, linux-net-drivers, x86

On 08/22/2012 07:58 AM, Ben Hutchings wrote:
>
> Because reordering, and see the comment at the top of this file.
>

Your architecture sounds similar to one I once worked on (Orion 
Microsystems CNIC/OPA-2).  That architecture had a descriptor ring in 
device memory, and a single trigger bit would move the head pointer.

We used write combining to write out a set of descriptors, and then used 
a non-write-combining write to do the final write which bumps the head 
pointer.  The UC write flushes the write combiners ahead of it, so it 
ends up with two transactions (one for the WC data and one for the UC 
trigger) but it could frequently push quite a few descriptors in that 
operation.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 15:13                       ` H. Peter Anvin
@ 2012-08-22 15:27                         ` David Laight
  2012-08-22 15:49                           ` H. Peter Anvin
                                             ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: David Laight @ 2012-08-22 15:27 UTC (permalink / raw)
  To: H. Peter Anvin, Ben Hutchings
  Cc: Benjamin LaHaise, Linus Torvalds, David Miller, tglx, mingo,
	netdev, linux-net-drivers, x86

> Your architecture sounds similar to one I once worked on (Orion
> Microsystems CNIC/OPA-2).  That architecture had a descriptor ring in
> device memory, and a single trigger bit would move the head pointer.
> 
> We used write combining to write out a set of descriptors, and then
> used
> a non-write-combining write to do the final write which bumps the head
> pointer.  The UC write flushes the write combiners ahead of it, so it
> ends up with two transactions (one for the WC data and one for the UC
> trigger) but it could frequently push quite a few descriptors in that
> operation.

The code actually looks more like a normal ethernet ring interface
with an 'owner' bit in each entry.
So it is important to write the owner bit last.

It might be possibly to set multiple ring entries in two TLPs
by first writing all of them (maybe with write combining)
but without changing the ownership of the first entry.
Then doing a second transfer to update the owner bit it
the first entry.
The order of the writes in the first transfer would then not
matter.

FWIW can you even guarantee to do an atomic 64bit PCIe transfer
on many systems (without resorting to a dma unit).

	David

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 15:27                         ` David Laight
@ 2012-08-22 15:49                           ` H. Peter Anvin
  2012-08-22 15:51                           ` Ben Hutchings
  2012-08-22 15:51                           ` H. Peter Anvin
  2 siblings, 0 replies; 56+ messages in thread
From: H. Peter Anvin @ 2012-08-22 15:49 UTC (permalink / raw)
  To: David Laight
  Cc: Ben Hutchings, Benjamin LaHaise, Linus Torvalds, David Miller,
	tglx, mingo, netdev, linux-net-drivers, x86

On 08/22/2012 08:27 AM, David Laight wrote:
>> Your architecture sounds similar to one I once worked on (Orion
>> Microsystems CNIC/OPA-2).  That architecture had a descriptor ring in
>> device memory, and a single trigger bit would move the head pointer.
>>
>> We used write combining to write out a set of descriptors, and then
>> used
>> a non-write-combining write to do the final write which bumps the head
>> pointer.  The UC write flushes the write combiners ahead of it, so it
>> ends up with two transactions (one for the WC data and one for the UC
>> trigger) but it could frequently push quite a few descriptors in that
>> operation.
>
> The code actually looks more like a normal ethernet ring interface
> with an 'owner' bit in each entry.
> So it is important to write the owner bit last.
>
> It might be possibly to set multiple ring entries in two TLPs
> by first writing all of them (maybe with write combining)
> but without changing the ownership of the first entry.
> Then doing a second transfer to update the owner bit it
> the first entry.
> The order of the writes in the first transfer would then not
> matter.
>
> FWIW can you even guarantee to do an atomic 64bit PCIe transfer
> on many systems (without resorting to a dma unit).
>

On many systems, perhaps, but I suspect that 32 bits is the maximum you 
can truly guarantee.

	-hpa


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 15:27                         ` David Laight
  2012-08-22 15:49                           ` H. Peter Anvin
@ 2012-08-22 15:51                           ` Ben Hutchings
  2012-08-22 15:54                             ` H. Peter Anvin
  2012-08-22 15:51                           ` H. Peter Anvin
  2 siblings, 1 reply; 56+ messages in thread
From: Ben Hutchings @ 2012-08-22 15:51 UTC (permalink / raw)
  To: David Laight
  Cc: H. Peter Anvin, Benjamin LaHaise, Linus Torvalds, David Miller,
	tglx, mingo, netdev, linux-net-drivers, x86

On Wed, 2012-08-22 at 16:27 +0100, David Laight wrote:
> > Your architecture sounds similar to one I once worked on (Orion
> > Microsystems CNIC/OPA-2).  That architecture had a descriptor ring in
> > device memory, and a single trigger bit would move the head pointer.
> > 
> > We used write combining to write out a set of descriptors, and then
> > used
> > a non-write-combining write to do the final write which bumps the head
> > pointer.  The UC write flushes the write combiners ahead of it, so it
> > ends up with two transactions (one for the WC data and one for the UC
> > trigger) but it could frequently push quite a few descriptors in that
> > operation.
> 
> The code actually looks more like a normal ethernet ring interface
> with an 'owner' bit in each entry.
> So it is important to write the owner bit last.

You're confused.  The 'owner' field in the descriptor pointer is part of
the memory protection mechanism for user-level networking.  And we don't
have up to 1024 TX descriptors in a single ring, we have up to 1024
separate rings - in host memory, of course.  Which is why we have the
'TX push' feature to reduce latency for a currently empty TX queue.

> It might be possibly to set multiple ring entries in two TLPs
> by first writing all of them (maybe with write combining)
> but without changing the ownership of the first entry.
> Then doing a second transfer to update the owner bit it
> the first entry.
> The order of the writes in the first transfer would then not
> matter.
> 
> FWIW can you even guarantee to do an atomic 64bit PCIe transfer
> on many systems (without resorting to a dma unit).

On any architecture that implements readq and writeq these had better be
atomic.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 15:51                           ` Ben Hutchings
@ 2012-08-22 15:54                             ` H. Peter Anvin
  2012-08-22 16:44                               ` Ben Hutchings
  2012-08-22 16:51                               ` Linus Torvalds
  0 siblings, 2 replies; 56+ messages in thread
From: H. Peter Anvin @ 2012-08-22 15:54 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: David Laight, Benjamin LaHaise, Linus Torvalds, David Miller,
	tglx, mingo, netdev, linux-net-drivers, x86

On 08/22/2012 08:51 AM, Ben Hutchings wrote:
>>
>> FWIW can you even guarantee to do an atomic 64bit PCIe transfer
>> on many systems (without resorting to a dma unit).
>
> On any architecture that implements readq and writeq these had better be
> atomic.
>

Sorry, you fail.  There are definitely systems in the field where 
readq() and writeq() are implemented, because the CPU supports them, 
where the fabric does not guarantee they are intact.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 15:54                             ` H. Peter Anvin
@ 2012-08-22 16:44                               ` Ben Hutchings
  2012-08-22 16:49                                 ` H. Peter Anvin
  2012-08-22 16:55                                 ` Linus Torvalds
  2012-08-22 16:51                               ` Linus Torvalds
  1 sibling, 2 replies; 56+ messages in thread
From: Ben Hutchings @ 2012-08-22 16:44 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: David Laight, Benjamin LaHaise, Linus Torvalds, David Miller,
	tglx, mingo, netdev, linux-net-drivers, x86

On Wed, 2012-08-22 at 08:54 -0700, H. Peter Anvin wrote:
> On 08/22/2012 08:51 AM, Ben Hutchings wrote:
> >>
> >> FWIW can you even guarantee to do an atomic 64bit PCIe transfer
> >> on many systems (without resorting to a dma unit).
> >
> > On any architecture that implements readq and writeq these had better be
> > atomic.
> >
> 
> Sorry, you fail.  There are definitely systems in the field where 
> readq() and writeq() are implemented, because the CPU supports them, 
> where the fabric does not guarantee they are intact.

Well, when the issue of 64-bit MMIO was discussed earlier this year, you
said nothing about this.  I thought the conclusion was that any
definitions provided by <asm/io.h> *must* be atomic and drivers can use
<asm-generic/io-64-nonatomic-hi-lo.h> or
<asm-generic/io-64-nonatomic-lo-hi.h> as a fallback.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 16:44                               ` Ben Hutchings
@ 2012-08-22 16:49                                 ` H. Peter Anvin
  2012-08-22 16:55                                 ` Linus Torvalds
  1 sibling, 0 replies; 56+ messages in thread
From: H. Peter Anvin @ 2012-08-22 16:49 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: David Laight, Benjamin LaHaise, Linus Torvalds, David Miller,
	tglx, mingo, netdev, linux-net-drivers, x86

On 08/22/2012 09:44 AM, Ben Hutchings wrote:
>>
>> Sorry, you fail.  There are definitely systems in the field where
>> readq() and writeq() are implemented, because the CPU supports them,
>> where the fabric does not guarantee they are intact.
>
> Well, when the issue of 64-bit MMIO was discussed earlier this year, you
> said nothing about this.  I thought the conclusion was that any
> definitions provided by <asm/io.h> *must* be atomic and drivers can use
> <asm-generic/io-64-nonatomic-hi-lo.h> or
> <asm-generic/io-64-nonatomic-lo-hi.h> as a fallback.
>

That is true at the exit interface from the CPU core.  Beyond that 
drivers have to keep in mind the possible limitations of the 
communications fabric between the CPU and the device.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 16:44                               ` Ben Hutchings
  2012-08-22 16:49                                 ` H. Peter Anvin
@ 2012-08-22 16:55                                 ` Linus Torvalds
  2012-08-22 17:09                                   ` Ben Hutchings
  1 sibling, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2012-08-22 16:55 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: H. Peter Anvin, David Laight, Benjamin LaHaise, David Miller,
	tglx, mingo, netdev, linux-net-drivers, x86

On Wed, Aug 22, 2012 at 9:44 AM, Ben Hutchings
<bhutchings@solarflare.com> wrote:
>
> Well, when the issue of 64-bit MMIO was discussed earlier this year, you
> said nothing about this.  I thought the conclusion was that any
> definitions provided by <asm/io.h> *must* be atomic and drivers can use
> <asm-generic/io-64-nonatomic-hi-lo.h> or
> <asm-generic/io-64-nonatomic-lo-hi.h> as a fallback.

Think 32-bit PCI with a 64-bit CPU.

The CPU itself does the 64-bit access no problem. The bus? Not so
much. Even if it's a burst transaction with a single packet, the
actual device on the other side will see the 64-bit value as two
separate parts. Sometimes that matters, sometimes it doesn't (ask
yourself: "What's the atomicity guarantee at the device end? Burst
transaction or individual word of a transaction?").

Again, being limited to PCIe, you are unlikely to hit these issues,
but system bridges can do odd things sometimes, and in the *general*
case it's definitely true that "writeq()" can generate multiple
accesses at the device end even if the *CPU* only generated a single
one.

            Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 16:55                                 ` Linus Torvalds
@ 2012-08-22 17:09                                   ` Ben Hutchings
  2012-08-22 17:12                                     ` H. Peter Anvin
  2012-08-22 17:26                                     ` Linus Torvalds
  0 siblings, 2 replies; 56+ messages in thread
From: Ben Hutchings @ 2012-08-22 17:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, David Laight, Benjamin LaHaise, David Miller,
	tglx, mingo, netdev, linux-net-drivers, x86

On Wed, 2012-08-22 at 09:55 -0700, Linus Torvalds wrote:
> On Wed, Aug 22, 2012 at 9:44 AM, Ben Hutchings
> <bhutchings@solarflare.com> wrote:
> >
> > Well, when the issue of 64-bit MMIO was discussed earlier this year, you
> > said nothing about this.  I thought the conclusion was that any
> > definitions provided by <asm/io.h> *must* be atomic and drivers can use
> > <asm-generic/io-64-nonatomic-hi-lo.h> or
> > <asm-generic/io-64-nonatomic-lo-hi.h> as a fallback.
> 
> Think 32-bit PCI with a 64-bit CPU.
[...]

Well, sure, I'm assuming that the driver is responsible for checking
that the device and its bus interface support an MMIO of the requested
width.

But the architecture code must be responsible for reporting whether the
host supports it, right?

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 17:09                                   ` Ben Hutchings
@ 2012-08-22 17:12                                     ` H. Peter Anvin
  2012-08-22 17:27                                       ` Ben Hutchings
  2012-08-22 17:26                                     ` Linus Torvalds
  1 sibling, 1 reply; 56+ messages in thread
From: H. Peter Anvin @ 2012-08-22 17:12 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Linus Torvalds, David Laight, Benjamin LaHaise, David Miller,
	tglx, mingo, netdev, linux-net-drivers, x86

On 08/22/2012 10:09 AM, Ben Hutchings wrote:
>
> Well, sure, I'm assuming that the driver is responsible for checking
> that the device and its bus interface support an MMIO of the requested
> width.
>
> But the architecture code must be responsible for reporting whether the
> host supports it, right?
>

No, the architecture code *can't*.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 17:12                                     ` H. Peter Anvin
@ 2012-08-22 17:27                                       ` Ben Hutchings
  2012-08-22 17:54                                         ` Linus Torvalds
  0 siblings, 1 reply; 56+ messages in thread
From: Ben Hutchings @ 2012-08-22 17:27 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, David Laight, Benjamin LaHaise, David Miller,
	tglx, mingo, netdev, linux-net-drivers, x86

On Wed, 2012-08-22 at 10:12 -0700, H. Peter Anvin wrote:
> On 08/22/2012 10:09 AM, Ben Hutchings wrote:
> >
> > Well, sure, I'm assuming that the driver is responsible for checking
> > that the device and its bus interface support an MMIO of the requested
> > width.
> >
> > But the architecture code must be responsible for reporting whether the
> > host supports it, right?
> >
> 
> No, the architecture code *can't*.

So, let me check that I understand this right:
- To support 32-bit architectures, a driver should include one of two
different definitions of readq/writeq depending on which order the
device needs to receive 32-bit operations.
- On 64-bit architectures (or at least x86_64), the system might split
up readq/writeq into 32-bit operations in unspecified order, and the
driver can't control this.

If this is right, how can it be safe to use readq/writeq at all?

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 17:27                                       ` Ben Hutchings
@ 2012-08-22 17:54                                         ` Linus Torvalds
  2012-08-22 18:11                                           ` Ben Hutchings
  0 siblings, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2012-08-22 17:54 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: H. Peter Anvin, David Laight, Benjamin LaHaise, David Miller,
	tglx, mingo, netdev, linux-net-drivers, x86

On Wed, Aug 22, 2012 at 10:27 AM, Ben Hutchings
<bhutchings@solarflare.com> wrote:
>
> If this is right, how can it be safe to use readq/writeq at all?

Pray.

Or don't care about ordering: use hardware that is well-designed and
doesn't have crap interfaces that are fragile.

If you care about ordering, you need to do them as two separate
accesses, and have a fence in between. Which, quite frankly, sounds
like the right model for you *anyway*, since then you could use
write-combining memory and you might even go faster, despite an
explicit fence and thus a minimum of 2 transactions.

Seriously. If you care that deeply about the ordering of the bytes you
write out, MAKE THAT ORDERING VERY EXPLICIT IN THE SOURCE CODE. Don't
say "oh, with this hack, I win 100ns". You need to ask yourself: what
do you care about more? Going really fast on some machine that you can
test, or being safe?

With PCIe, it's *probably* fine to just say "we expect 64-bit accesses
to make it through unmolested".

The 128-bit case I really don't know about. It probably works too. But
while I'd call the 64-bit case almost certain (in the absence of truly
crap hardware), the 128-bit case I have a hard time judging how
certain it is going to be.

                   Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 17:54                                         ` Linus Torvalds
@ 2012-08-22 18:11                                           ` Ben Hutchings
  2012-08-22 18:18                                             ` H. Peter Anvin
  2012-08-22 18:28                                             ` Linus Torvalds
  0 siblings, 2 replies; 56+ messages in thread
From: Ben Hutchings @ 2012-08-22 18:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, David Laight, Benjamin LaHaise, David Miller,
	tglx, mingo, netdev, linux-net-drivers, x86

On Wed, 2012-08-22 at 10:54 -0700, Linus Torvalds wrote:
> On Wed, Aug 22, 2012 at 10:27 AM, Ben Hutchings
> <bhutchings@solarflare.com> wrote:
> >
> > If this is right, how can it be safe to use readq/writeq at all?
> 
> Pray.
> 
> Or don't care about ordering: use hardware that is well-designed and
> doesn't have crap interfaces that are fragile.

Well the whole point of having the two 32-bit generic implementations is
that hardware may care about the order!  How can it be right that a
64-bit implementation assumes it doesn't?

> If you care about ordering, you need to do them as two separate
> accesses, and have a fence in between. Which, quite frankly, sounds
> like the right model for you *anyway*, since then you could use
> write-combining memory and you might even go faster, despite an
> explicit fence and thus a minimum of 2 transactions.

Yes, which unfortunately is no better than we have at the moment.

> Seriously. If you care that deeply about the ordering of the bytes you
> write out, MAKE THAT ORDERING VERY EXPLICIT IN THE SOURCE CODE. Don't
> say "oh, with this hack, I win 100ns". You need to ask yourself: what
> do you care about more? Going really fast on some machine that you can
> test, or being safe?

I have to care quite a lot about both. :-)  But yes, safety first.

> With PCIe, it's *probably* fine to just say "we expect 64-bit accesses
> to make it through unmolested".

I have to hope so.

> The 128-bit case I really don't know about. It probably works too. But
> while I'd call the 64-bit case almost certain (in the absence of truly
> crap hardware), the 128-bit case I have a hard time judging how
> certain it is going to be.

Right, I think it's been made pretty clear that it's going to be
dependent on more than just architecture.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 18:11                                           ` Ben Hutchings
@ 2012-08-22 18:18                                             ` H. Peter Anvin
  2012-08-22 18:28                                             ` Linus Torvalds
  1 sibling, 0 replies; 56+ messages in thread
From: H. Peter Anvin @ 2012-08-22 18:18 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Linus Torvalds, David Laight, Benjamin LaHaise, David Miller,
	tglx, mingo, netdev, linux-net-drivers, x86

On 08/22/2012 11:11 AM, Ben Hutchings wrote:
> 
> Well the whole point of having the two 32-bit generic implementations is
> that hardware may care about the order!  How can it be right that a
> 64-bit implementation assumes it doesn't?
> 

On x86 platforms it is pretty much universal (as in: I have never seen
an exception) that transactions that are broken up are broken up in
littleendian order.  That was the original readq/writeq implementation
on x86-32, but it screwed up some other architectures.

The other reason to not have readq/writeq by default where we *know* it
can't be supported is that some drivers (e.g. the CNIC/OPA-2 driver I
mentioned) can do better, performance-wise, if it knows that.

	-hpa

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 18:11                                           ` Ben Hutchings
  2012-08-22 18:18                                             ` H. Peter Anvin
@ 2012-08-22 18:28                                             ` Linus Torvalds
  2012-08-22 19:01                                               ` Ben Hutchings
  1 sibling, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2012-08-22 18:28 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: H. Peter Anvin, David Laight, Benjamin LaHaise, David Miller,
	tglx, mingo, netdev, linux-net-drivers, x86

On Wed, Aug 22, 2012 at 11:11 AM, Ben Hutchings
<bhutchings@solarflare.com> wrote:
>
> Right, I think it's been made pretty clear that it's going to be
> dependent on more than just architecture.

Well, it's entirely possible that the 128-bit case will work correctly
on all x86-64 hardware out there on PCIe.

I just can't guarantee it, because we certainly have had issues with
hw doing odd things before. But maybe PCIe really is well-specified
enough, and maybe nobody has done a odd PCIe bridges, and maybe every
time some 128-bit access is split, the bus in question still always
remembers the original 128-bit size in the transaction. It's not at
all impossible. I just wouldn't *guarantee* it.

And to some degree, for high-end server-only hardware in particular,
it really *is* acceptable to say "If you have odd hardware, odd things
will happen". So for this particular driver, maybe the right approach
is simply to say "we require that your fabric works right". And see if
anybody ever complains.

The 100ns may be worth those kinds of "you'd better not have old/crap
hardware" decisions. It's not acceptable for some drivers (a driver
for some consumer ATA chip might not want to make that kind of choice,
and say "whatever, we'll be really conservative), but "Quod licet
Jovi, non licet bovi".

The fact that something might not be *guaranteed* to always work
doesn't necessarily mean that it is always the wrong thing to do..

              Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 18:28                                             ` Linus Torvalds
@ 2012-08-22 19:01                                               ` Ben Hutchings
  0 siblings, 0 replies; 56+ messages in thread
From: Ben Hutchings @ 2012-08-22 19:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, David Laight, Benjamin LaHaise, David Miller,
	tglx, mingo, netdev, linux-net-drivers, x86

On Wed, 2012-08-22 at 11:28 -0700, Linus Torvalds wrote:
> On Wed, Aug 22, 2012 at 11:11 AM, Ben Hutchings
> <bhutchings@solarflare.com> wrote:
> >
> > Right, I think it's been made pretty clear that it's going to be
> > dependent on more than just architecture.
> 
> Well, it's entirely possible that the 128-bit case will work correctly
> on all x86-64 hardware out there on PCIe.
> 
> I just can't guarantee it, because we certainly have had issues with
> hw doing odd things before. But maybe PCIe really is well-specified
> enough, and maybe nobody has done a odd PCIe bridges, and maybe every
> time some 128-bit access is split, the bus in question still always
> remembers the original 128-bit size in the transaction. It's not at
> all impossible. I just wouldn't *guarantee* it.

Even then, everything works out OK if the particular MMIO write I'm
concerned about *is* split - just as long as the resulting operations
are in ascending address order.  This is not true on all systems if we
enable write-combining (without a fence in the middle, which defeats the
purpose), and I think hpa was saying that it may not be the case with
SSE writes either.

> And to some degree, for high-end server-only hardware in particular,
> it really *is* acceptable to say "If you have odd hardware, odd things
> will happen". So for this particular driver, maybe the right approach
> is simply to say "we require that your fabric works right". And see if
> anybody ever complains.

Maybe.  At the moment reordering tends to cause the hardware to complain
that we sent an invalid sequence of DMA descriptors, but that's only
because we're not being as smart as we could about using TX push.  I
don't want to run the risk of sending out corrupted packets (with
offloaded checksums, so they're not that obviously invalid) on the wire.

Ben.

> The 100ns may be worth those kinds of "you'd better not have old/crap
> hardware" decisions. It's not acceptable for some drivers (a driver
> for some consumer ATA chip might not want to make that kind of choice,
> and say "whatever, we'll be really conservative), but "Quod licet
> Jovi, non licet bovi".
> 
> The fact that something might not be *guaranteed* to always work
> doesn't necessarily mean that it is always the wrong thing to do..

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 17:09                                   ` Ben Hutchings
  2012-08-22 17:12                                     ` H. Peter Anvin
@ 2012-08-22 17:26                                     ` Linus Torvalds
  1 sibling, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2012-08-22 17:26 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: H. Peter Anvin, David Laight, Benjamin LaHaise, David Miller,
	tglx, mingo, netdev, linux-net-drivers, x86

On Wed, Aug 22, 2012 at 10:09 AM, Ben Hutchings
<bhutchings@solarflare.com> wrote:
>
> But the architecture code must be responsible for reporting whether the
> host supports it, right?

How? It's impossible. As far as the CPU is concerned, the writes
happen atomically. Look how you use it: you don't even query
dynamically about whether the stuff outside the CPU can handle atomic
128-bit writes. You just assume it at compile-time with an #ifdef. How
the hell do you expect that to be able to then say "oh, on this
machine the device you are doing the access to is behind an odd
PCI->PCIe bridge that will split the access"?

Not that we even tend to *know* those kinds of things. It's really
esoteric chipset knowledge. I wouldn't even expect it to be
necessarily documented in the chipset docs, it *might* be in some
NDA'd BIOS writer's guide thing.

You don't even seem to realize that things like the Intel FSB was
patented and wasn't fully documented by Intel at all? And that's for a
bus interface that was used for over a decade from the dominant CPU
manufacturer. What do you think happens with odd random chipsets? Who
do you expects to know?

I *suspect* that 128-bit writes would generally make it intact over
PCIe in real life, but I absolutely wouldn't guarantee it on all
machines. Exactly because of issues like "what happens with a nVidia
host bridge and the old FSB model on older Intel chips?" or "What does
the AMD memory pipeline do?".

Many CPU cores have 64-bit buses even *internally*, much less
externally. Yes, they have atomicity guarantees in their architecture
manual, but go look at it: those talk about memory accesses, and they
are based on cache coherency (these days - they *used* to be based on
certain bus guarantees). The MMIO side is a completely different
animal, and is still based on the bus - and nobody documents that,
afaik.

                 Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 15:54                             ` H. Peter Anvin
  2012-08-22 16:44                               ` Ben Hutchings
@ 2012-08-22 16:51                               ` Linus Torvalds
  2012-08-22 16:59                                 ` H. Peter Anvin
  1 sibling, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2012-08-22 16:51 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ben Hutchings, David Laight, Benjamin LaHaise, David Miller, tglx,
	mingo, netdev, linux-net-drivers, x86

On Wed, Aug 22, 2012 at 8:54 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> Sorry, you fail.  There are definitely systems in the field where readq()
> and writeq() are implemented, because the CPU supports them, where the
> fabric does not guarantee they are intact.

Indeed.

It's unlikely to be an issue with a PCIe driver, though. I'm pretty
sure you can rely on 64-bit transfers there, especially with a CPU
that is modern enough to run 64-bit mode.

That said, even with PCIe, I wonder if older CPU's (think Intel with a
front-side bus, rather than PCIe on die) necessarily always do 128-bit
writes. The FSB is just 64 bits wide, and I could *imagine* that a
PCIe chipset behind the FSB might end up just always generating at
most 64-bit PCIe transactions for host accesses just because that
would be "natural".

Sounds unlikely, but hey, hardware sometimes does odd things.

             Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 16:51                               ` Linus Torvalds
@ 2012-08-22 16:59                                 ` H. Peter Anvin
  0 siblings, 0 replies; 56+ messages in thread
From: H. Peter Anvin @ 2012-08-22 16:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ben Hutchings, David Laight, Benjamin LaHaise, David Miller, tglx,
	mingo, netdev, linux-net-drivers, x86

On 08/22/2012 09:51 AM, Linus Torvalds wrote:
> On Wed, Aug 22, 2012 at 8:54 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> Sorry, you fail.  There are definitely systems in the field where readq()
>> and writeq() are implemented, because the CPU supports them, where the
>> fabric does not guarantee they are intact.
>
> Indeed.
>
> It's unlikely to be an issue with a PCIe driver, though. I'm pretty
> sure you can rely on 64-bit transfers there, especially with a CPU
> that is modern enough to run 64-bit mode.
>
> That said, even with PCIe, I wonder if older CPU's (think Intel with a
> front-side bus, rather than PCIe on die) necessarily always do 128-bit
> writes. The FSB is just 64 bits wide, and I could *imagine* that a
> PCIe chipset behind the FSB might end up just always generating at
> most 64-bit PCIe transactions for host accesses just because that
> would be "natural".
>
> Sounds unlikely, but hey, hardware sometimes does odd things.
>

I'm wondering how e.g. a K8 would work (CPU -> HT -> PCIe) on UC memory 
there.  I know for a fact that some CPU cores break up SSE transactions 
into 64-bit transactions.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 15:27                         ` David Laight
  2012-08-22 15:49                           ` H. Peter Anvin
  2012-08-22 15:51                           ` Ben Hutchings
@ 2012-08-22 15:51                           ` H. Peter Anvin
  2 siblings, 0 replies; 56+ messages in thread
From: H. Peter Anvin @ 2012-08-22 15:51 UTC (permalink / raw)
  To: David Laight
  Cc: Ben Hutchings, Benjamin LaHaise, Linus Torvalds, David Miller,
	tglx, mingo, netdev, linux-net-drivers, x86

On 08/22/2012 08:27 AM, David Laight wrote:
>> Your architecture sounds similar to one I once worked on (Orion
>> Microsystems CNIC/OPA-2).  That architecture had a descriptor ring in
>> device memory, and a single trigger bit would move the head pointer.
>>
>> We used write combining to write out a set of descriptors, and then
>> used
>> a non-write-combining write to do the final write which bumps the head
>> pointer.  The UC write flushes the write combiners ahead of it, so it
>> ends up with two transactions (one for the WC data and one for the UC
>> trigger) but it could frequently push quite a few descriptors in that
>> operation.
>
> The code actually looks more like a normal ethernet ring interface
> with an 'owner' bit in each entry.
> So it is important to write the owner bit last.
>
> It might be possibly to set multiple ring entries in two TLPs
> by first writing all of them (maybe with write combining)
> but without changing the ownership of the first entry.
> Then doing a second transfer to update the owner bit it
> the first entry.
> The order of the writes in the first transfer would then not
> matter.
>

The design flaw in that kind of design would be the need to set the 
owner bit on every entry.  Now, in the case of CNIC/OPA-2 support for 
write combining was an explicit design goal, so writes are inert until 
the trigger bit is written, at which point the head pointer is moved to 
the entry containing the trigger bit.  Very effective.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 14:24                 ` Ben Hutchings
  2012-08-22 14:30                   ` Benjamin LaHaise
@ 2012-08-22 14:50                   ` Linus Torvalds
  2012-08-22 14:56                     ` Linus Torvalds
  2012-08-22 15:41                     ` Ben Hutchings
  1 sibling, 2 replies; 56+ messages in thread
From: Linus Torvalds @ 2012-08-22 14:50 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: H. Peter Anvin, David Miller, tglx, mingo, netdev,
	linux-net-drivers, x86

On Wed, Aug 22, 2012 at 7:24 AM, Ben Hutchings
<bhutchings@solarflare.com> wrote:
>
> Sorry, I'll paste it below.

The thing you pasted isn't actually the thing in the subject line.
It's just you *using* it.

I wanted to see what that "writeo()" looks like for x86-64.

But I got google to find it for me by looking for "__raw_writeo", so I
can see the patch now. It looks like it might work. Does it really
help performance despite always doing those TS games in CR0 for each
access?

                 Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 14:50                   ` Linus Torvalds
@ 2012-08-22 14:56                     ` Linus Torvalds
  2012-08-22 15:05                       ` David Laight
  2012-08-22 15:41                     ` Ben Hutchings
  1 sibling, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2012-08-22 14:56 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: H. Peter Anvin, David Miller, tglx, mingo, netdev,
	linux-net-drivers, x86

On Wed, Aug 22, 2012 at 7:50 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But I got google to find it for me by looking for "__raw_writeo", so I
> can see the patch now. It looks like it might work. Does it really
> help performance despite always doing those TS games in CR0 for each
> access?

Btw, are we even certain that a 128-bit PCIe write is going to remain
atomic across a bus (ie over various PCIe bridges etc)? Do you you
care? Is it just a "one transaction is cheaper than two", and it
doesn't really have any ordering constraints? If the thing gets split
into two 64-bit transactions (in whatever order) by a bridge on the
way, would that be ok?

We've seen buses split accesses before (ie 64-bit writes on 32-bit PCI).

                     Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 14:56                     ` Linus Torvalds
@ 2012-08-22 15:05                       ` David Laight
  2012-08-22 15:16                         ` Linus Torvalds
  0 siblings, 1 reply; 56+ messages in thread
From: David Laight @ 2012-08-22 15:05 UTC (permalink / raw)
  To: Linus Torvalds, Ben Hutchings
  Cc: H. Peter Anvin, David Miller, tglx, mingo, netdev,
	linux-net-drivers, x86

> Btw, are we even certain that a 128-bit PCIe write is going to remain
> atomic across a bus (ie over various PCIe bridges etc)? Do you you
> care? Is it just a "one transaction is cheaper than two", and it
> doesn't really have any ordering constraints? If the thing gets split
> into two 64-bit transactions (in whatever order) by a bridge on the
> way, would that be ok?

PCIe transfers are basically hdlc packets containing the address,
command and any associated data. Unless they get bridged
though some strange PCIe<->PCI<->PCIe system they are very
unlikely to get broken up.
Maybe if they are longer than the maximum TLP size for the
target somewhere - but that is probably at least 128 bytes.

The time taken is largely independent of the transfer size.
The systems I've used (ppc accessing an Altera FPGA) have
PCIe cycles types of the order of microseconds.
Even for slow comms it is important to generate long TLP.

	David

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 15:05                       ` David Laight
@ 2012-08-22 15:16                         ` Linus Torvalds
  0 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2012-08-22 15:16 UTC (permalink / raw)
  To: David Laight
  Cc: Ben Hutchings, H. Peter Anvin, David Miller, tglx, mingo, netdev,
	linux-net-drivers, x86

On Wed, Aug 22, 2012 at 8:05 AM, David Laight <David.Laight@aculab.com> wrote:
>
> PCIe transfers are basically hdlc packets containing the address,
> command and any associated data. Unless they get bridged
> though some strange PCIe<->PCI<->PCIe system they are very
> unlikely to get broken up.

It's exactly the odd kind of "mix in a non-native PCIe bridge" setups
I'd worry about. But I guess that is pretty unlikely in any modern
machine (except for thunderbolt, and I think that's going to pass any
PCIe stuff through unchanged).

                  Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 14:50                   ` Linus Torvalds
  2012-08-22 14:56                     ` Linus Torvalds
@ 2012-08-22 15:41                     ` Ben Hutchings
  1 sibling, 0 replies; 56+ messages in thread
From: Ben Hutchings @ 2012-08-22 15:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, David Miller, tglx, mingo, netdev,
	linux-net-drivers, x86

On Wed, 2012-08-22 at 07:50 -0700, Linus Torvalds wrote:
> On Wed, Aug 22, 2012 at 7:24 AM, Ben Hutchings
> <bhutchings@solarflare.com> wrote:
> >
> > Sorry, I'll paste it below.
> 
> The thing you pasted isn't actually the thing in the subject line.
> It's just you *using* it.
> 
> I wanted to see what that "writeo()" looks like for x86-64.
> 
> But I got google to find it for me by looking for "__raw_writeo", so I
> can see the patch now. It looks like it might work. Does it really
> help performance despite always doing those TS games in CR0 for each
> access?

I haven't run the experiment myself, but my colleagues observed a net
reduction of 100s of nanoseconds of latency.  That may not sound like
much, but for a small packet traversing an uncongested twinax link it's
around 5-10% of the total latency from the descriptor pointer write to
DMA completion on the peer.

Later, you wrote:
> Btw, are we even certain that a 128-bit PCIe write is going to remain
> atomic across a bus (ie over various PCIe bridges etc)?

I don't think PCIe bridges are allowed to split up TLPs (this is why the
PCI core has to be so careful about programming Max Payload Size).  What
happens between the processor core and the host bridge is another
matter, though.

> Do you you
> care? Is it just a "one transaction is cheaper than two", and it
> doesn't really have any ordering constraints? If the thing gets split
> into two 64-bit transactions (in whatever order) by a bridge on the
> way, would that be ok?

We care if the two transactions are not in ascending address order;
that's why we had to abandon write combining.

> We've seen buses split accesses before (ie 64-bit writes on 32-bit
> PCI).

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations
  2012-08-22 14:20               ` Linus Torvalds
  2012-08-22 14:24                 ` Ben Hutchings
@ 2012-08-22 14:42                 ` David Laight
  1 sibling, 0 replies; 56+ messages in thread
From: David Laight @ 2012-08-22 14:42 UTC (permalink / raw)
  To: Linus Torvalds, Ben Hutchings
  Cc: H. Peter Anvin, David Miller, tglx, mingo, netdev,
	linux-net-drivers, x86

> Any patch that exports some "atomic 128-bit MMIO writes" for general
> use sounds totally and utterly broken. You can't do it. It's
> *fundamentally* an operation that many CPU's cannot even do. 64-bit
> buses (or even 32-bit ones) will make the 128-bit write be split up
> *anyway*, regardless of any 128-bit register issues.
> 
> And nobody else sane cares about this, so it shouldn't even be a
> x86-64 specific thing...

There are several other processors that can generate long
PCIe transfers, sometimes by using a DMA engine associated
with the PCIe interface (eg freescale 83xx ppc).

Given the slow speed of PCIe transactions (think ISA bus
speeds - at least with some targets), it is important to
be able to request multi-word transfers in a generic way
on systems that can support it.

This support is a property of the PCIe interface block,
not that of the driver that wishes to use the function.

I don't know if XMM register transfers generate 16byte
TLP on any x86 cpus - they might on some.

Perhaps claiming the function is atomic is the real
problem - otherwise a single TLP could be used on
systems (and in contexts) where it is possible, but
a slower mulit-TLP transfer done (possibly without
guaranteeing the transfer order) done where it is not.

	David

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 3/3] sfc: Use __raw_writeo() to perform TX descriptor push where possible
  2012-08-22  1:17 [PATCH 0/3] x86_64, sfc: 128-bit memory-mapped I/O Ben Hutchings
  2012-08-22  1:20 ` [PATCH 1/3] x86_64: Define 128-bit types for kernel code only Ben Hutchings
  2012-08-22  1:23 ` [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations Ben Hutchings
@ 2012-08-22  1:26 ` Ben Hutchings
  2012-08-22  1:38 ` [PATCH 0/3] x86_64, sfc: 128-bit memory-mapped I/O H. Peter Anvin
  3 siblings, 0 replies; 56+ messages in thread
From: Ben Hutchings @ 2012-08-22  1:26 UTC (permalink / raw)
  To: David Miller, Thomas Gleixner, Ingo Molnar, H. Peter Anvin
  Cc: netdev, linux-net-drivers, x86

Use the new __raw_writeo() function for TX descriptor push where
available.  This means we now use only a single PCIe transaction
on x86_64 (vs 2 before), reducing TX latency slightly.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
David,

Assuming the previous 2 patches are accepted for x86, I expect you'll
still want to take this into the net-next tree.  So perhaps they should
go onto a separate branch that you can pull?

Ben.

 drivers/net/ethernet/sfc/bitfield.h |    3 +++
 drivers/net/ethernet/sfc/io.h       |   18 ++++++++++++++++--
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/sfc/bitfield.h b/drivers/net/ethernet/sfc/bitfield.h
index b26a954..5feeba2 100644
--- a/drivers/net/ethernet/sfc/bitfield.h
+++ b/drivers/net/ethernet/sfc/bitfield.h
@@ -83,6 +83,9 @@ typedef union efx_qword {
 
 /* An octword (eight-word, i.e. 16 byte) datatype - little-endian in HW */
 typedef union efx_oword {
+#ifdef HAVE_INT128
+	__le128 u128;
+#endif
 	__le64 u64[2];
 	efx_qword_t qword[2];
 	__le32 u32[4];
diff --git a/drivers/net/ethernet/sfc/io.h b/drivers/net/ethernet/sfc/io.h
index 751d1ec..fbcdc6d 100644
--- a/drivers/net/ethernet/sfc/io.h
+++ b/drivers/net/ethernet/sfc/io.h
@@ -57,10 +57,22 @@
  *   current state.
  */
 
-#if BITS_PER_LONG == 64
+#if defined(writeo)
+#define EFX_USE_OWORD_IO 1
+#endif
+
+#if defined(readq) && defined(writeq)
 #define EFX_USE_QWORD_IO 1
 #endif
 
+#ifdef EFX_USE_OWORD_IO
+static inline void _efx_writeo(struct efx_nic *efx, __le128 value,
+			       unsigned int reg)
+{
+	__raw_writeo((__force u128)value, efx->membase + reg);
+}
+#endif
+
 #ifdef EFX_USE_QWORD_IO
 static inline void _efx_writeq(struct efx_nic *efx, __le64 value,
 				  unsigned int reg)
@@ -235,7 +247,9 @@ static inline void _efx_writeo_page(struct efx_nic *efx, efx_oword_t *value,
 		   "writing register %x with " EFX_OWORD_FMT "\n", reg,
 		   EFX_OWORD_VAL(*value));
 
-#ifdef EFX_USE_QWORD_IO
+#if defined(EFX_USE_OWORD_IO)
+	_efx_writeo(efx, value->u128, reg);
+#elif defined(EFX_USE_QWORD_IO)
 	_efx_writeq(efx, value->u64[0], reg + 0);
 	_efx_writeq(efx, value->u64[1], reg + 8);
 #else
-- 
1.7.7.6


-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/3] x86_64, sfc: 128-bit memory-mapped I/O
  2012-08-22  1:17 [PATCH 0/3] x86_64, sfc: 128-bit memory-mapped I/O Ben Hutchings
                   ` (2 preceding siblings ...)
  2012-08-22  1:26 ` [PATCH 3/3] sfc: Use __raw_writeo() to perform TX descriptor push where possible Ben Hutchings
@ 2012-08-22  1:38 ` H. Peter Anvin
  2012-08-22  1:43   ` Ben Hutchings
  3 siblings, 1 reply; 56+ messages in thread
From: H. Peter Anvin @ 2012-08-22  1:38 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Thomas Gleixner, Ingo Molnar, netdev, linux-net-drivers, x86

On 08/21/2012 06:17 PM, Ben Hutchings wrote:
> Current Solarflare network controllers have 128-bit memory-mapped
> registers which are normally accessed through a series of I/O
> operations.  However, it is also possible to access them with a single
> MOVAPS instruction on x86_64, and this is measurably faster as it
> requires only one PCIe transaction.

Also, have you considered doing this with write combining instead?

	-hpa

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/3] x86_64, sfc: 128-bit memory-mapped I/O
  2012-08-22  1:38 ` [PATCH 0/3] x86_64, sfc: 128-bit memory-mapped I/O H. Peter Anvin
@ 2012-08-22  1:43   ` Ben Hutchings
  2012-08-22  1:59     ` H. Peter Anvin
  0 siblings, 1 reply; 56+ messages in thread
From: Ben Hutchings @ 2012-08-22  1:43 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Thomas Gleixner, Ingo Molnar, netdev, linux-net-drivers, x86

On Tue, 2012-08-21 at 18:38 -0700, H. Peter Anvin wrote:
> On 08/21/2012 06:17 PM, Ben Hutchings wrote:
> > Current Solarflare network controllers have 128-bit memory-mapped
> > registers which are normally accessed through a series of I/O
> > operations.  However, it is also possible to access them with a single
> > MOVAPS instruction on x86_64, and this is measurably faster as it
> > requires only one PCIe transaction.
> 
> Also, have you considered doing this with write combining instead?

We tried it, and it goes horribly wrong.  On some systems, the writes
are not combined, but they are reordered in a way the hardware doesn't
support.  See the comment at the top of drivers/net/ethernet/sfc/io.h.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/3] x86_64, sfc: 128-bit memory-mapped I/O
  2012-08-22  1:43   ` Ben Hutchings
@ 2012-08-22  1:59     ` H. Peter Anvin
  2012-08-22  2:10       ` Ben Hutchings
  0 siblings, 1 reply; 56+ messages in thread
From: H. Peter Anvin @ 2012-08-22  1:59 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Thomas Gleixner, Ingo Molnar, netdev, linux-net-drivers, x86

On 08/21/2012 06:43 PM, Ben Hutchings wrote:
> On Tue, 2012-08-21 at 18:38 -0700, H. Peter Anvin wrote:
>> On 08/21/2012 06:17 PM, Ben Hutchings wrote:
>>> Current Solarflare network controllers have 128-bit memory-mapped
>>> registers which are normally accessed through a series of I/O
>>> operations.  However, it is also possible to access them with a single
>>> MOVAPS instruction on x86_64, and this is measurably faster as it
>>> requires only one PCIe transaction.
>>
>> Also, have you considered doing this with write combining instead?
> 
> We tried it, and it goes horribly wrong.  On some systems, the writes
> are not combined, but they are reordered in a way the hardware doesn't
> support.  See the comment at the top of drivers/net/ethernet/sfc/io.h.
> 

Yes, you have to make sure you properly enforce the necessary ordering
requirements manually (I think you can do that with sfence).

	-hpa

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/3] x86_64, sfc: 128-bit memory-mapped I/O
  2012-08-22  1:59     ` H. Peter Anvin
@ 2012-08-22  2:10       ` Ben Hutchings
  2012-08-22  2:31         ` H. Peter Anvin
  0 siblings, 1 reply; 56+ messages in thread
From: Ben Hutchings @ 2012-08-22  2:10 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Thomas Gleixner, Ingo Molnar, netdev, linux-net-drivers, x86

On Tue, 2012-08-21 at 18:59 -0700, H. Peter Anvin wrote:
> On 08/21/2012 06:43 PM, Ben Hutchings wrote:
> > On Tue, 2012-08-21 at 18:38 -0700, H. Peter Anvin wrote:
> >> On 08/21/2012 06:17 PM, Ben Hutchings wrote:
> >>> Current Solarflare network controllers have 128-bit memory-mapped
> >>> registers which are normally accessed through a series of I/O
> >>> operations.  However, it is also possible to access them with a single
> >>> MOVAPS instruction on x86_64, and this is measurably faster as it
> >>> requires only one PCIe transaction.
> >>
> >> Also, have you considered doing this with write combining instead?
> > 
> > We tried it, and it goes horribly wrong.  On some systems, the writes
> > are not combined, but they are reordered in a way the hardware doesn't
> > support.  See the comment at the top of drivers/net/ethernet/sfc/io.h.
> > 
> 
> Yes, you have to make sure you properly enforce the necessary ordering
> requirements manually (I think you can do that with sfence).

We did put an sfence after the writes to each register.  But some
systems only want to combine writes that cover an entire cache line, and
the writes covering a 128-bit register get broken back up into multiple
writes at the PCIe level.  And on some systems these are sent in
decreasing address order, which breaks the rules for writing to
TX_DESC_UPD.

To avoid this we'd have to put an sfence in between the writes to a
register, leaving us back where we started.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/3] x86_64, sfc: 128-bit memory-mapped I/O
  2012-08-22  2:10       ` Ben Hutchings
@ 2012-08-22  2:31         ` H. Peter Anvin
  0 siblings, 0 replies; 56+ messages in thread
From: H. Peter Anvin @ 2012-08-22  2:31 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Thomas Gleixner, Ingo Molnar, netdev, linux-net-drivers, x86

On 08/21/2012 07:10 PM, Ben Hutchings wrote:
>>
>> Yes, you have to make sure you properly enforce the necessary ordering
>> requirements manually (I think you can do that with sfence).
> 
> We did put an sfence after the writes to each register.  But some
> systems only want to combine writes that cover an entire cache line, and
> the writes covering a 128-bit register get broken back up into multiple
> writes at the PCIe level.  And on some systems these are sent in
> decreasing address order, which breaks the rules for writing to
> TX_DESC_UPD.
> 
> To avoid this we'd have to put an sfence in between the writes to a
> register, leaving us back where we started.
> 

You realize the same applies to 128-bit writes, right?  Some cores
and/or systems will break them up.

	-hpa

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2012-08-22 21:38 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-22  1:17 [PATCH 0/3] x86_64, sfc: 128-bit memory-mapped I/O Ben Hutchings
2012-08-22  1:20 ` [PATCH 1/3] x86_64: Define 128-bit types for kernel code only Ben Hutchings
2012-08-22  1:23 ` [PATCH 2/3] x86_64: Define 128-bit memory-mapped I/O operations Ben Hutchings
2012-08-22  1:37   ` H. Peter Anvin
2012-08-22  2:04     ` Ben Hutchings
2012-08-22  2:34       ` David Miller
2012-08-22  3:24         ` H. Peter Anvin
2012-08-22  3:29           ` David Miller
2012-08-22  3:49             ` H. Peter Anvin
2012-08-22  3:52           ` Linus Torvalds
2012-08-22  3:59             ` H. Peter Anvin
2012-08-22  4:14               ` David Miller
2012-08-22 21:14                 ` David Miller
2012-08-22 21:28                   ` Linus Torvalds
2012-08-22 21:38                     ` David Miller
2012-08-22  4:35               ` Linus Torvalds
2012-08-22  5:00                 ` David Miller
2012-08-22 14:06                   ` Linus Torvalds
2012-08-22  4:42             ` Linus Torvalds
2012-08-22 13:26             ` Ben Hutchings
2012-08-22 14:20               ` Linus Torvalds
2012-08-22 14:24                 ` Ben Hutchings
2012-08-22 14:30                   ` Benjamin LaHaise
2012-08-22 14:58                     ` Ben Hutchings
2012-08-22 15:13                       ` H. Peter Anvin
2012-08-22 15:27                         ` David Laight
2012-08-22 15:49                           ` H. Peter Anvin
2012-08-22 15:51                           ` Ben Hutchings
2012-08-22 15:54                             ` H. Peter Anvin
2012-08-22 16:44                               ` Ben Hutchings
2012-08-22 16:49                                 ` H. Peter Anvin
2012-08-22 16:55                                 ` Linus Torvalds
2012-08-22 17:09                                   ` Ben Hutchings
2012-08-22 17:12                                     ` H. Peter Anvin
2012-08-22 17:27                                       ` Ben Hutchings
2012-08-22 17:54                                         ` Linus Torvalds
2012-08-22 18:11                                           ` Ben Hutchings
2012-08-22 18:18                                             ` H. Peter Anvin
2012-08-22 18:28                                             ` Linus Torvalds
2012-08-22 19:01                                               ` Ben Hutchings
2012-08-22 17:26                                     ` Linus Torvalds
2012-08-22 16:51                               ` Linus Torvalds
2012-08-22 16:59                                 ` H. Peter Anvin
2012-08-22 15:51                           ` H. Peter Anvin
2012-08-22 14:50                   ` Linus Torvalds
2012-08-22 14:56                     ` Linus Torvalds
2012-08-22 15:05                       ` David Laight
2012-08-22 15:16                         ` Linus Torvalds
2012-08-22 15:41                     ` Ben Hutchings
2012-08-22 14:42                 ` David Laight
2012-08-22  1:26 ` [PATCH 3/3] sfc: Use __raw_writeo() to perform TX descriptor push where possible Ben Hutchings
2012-08-22  1:38 ` [PATCH 0/3] x86_64, sfc: 128-bit memory-mapped I/O H. Peter Anvin
2012-08-22  1:43   ` Ben Hutchings
2012-08-22  1:59     ` H. Peter Anvin
2012-08-22  2:10       ` Ben Hutchings
2012-08-22  2:31         ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).