* [PATCH net-next 0/2] Add pgtable API to query if write combining is available
@ 2014-10-01 14:54 Or Gerlitz
2014-10-01 14:54 ` [PATCH net-next 1/2] pgtable: Add " Or Gerlitz
2014-10-01 14:54 ` [PATCH net-next 2/2] net/mlx4_core: Disable BF when write combining is not available Or Gerlitz
0 siblings, 2 replies; 6+ messages in thread
From: Or Gerlitz @ 2014-10-01 14:54 UTC (permalink / raw)
To: David S. Miller
Cc: netdev, Amir Vadai, Jack Morgenstein, Moshe Lazer, Tal Alon,
Yevgeny Petrilin, Or Gerlitz
Currently the kernel write-combining interface provides a best effort
mechanism in which the caller simply invokes pgprot_writecombine().
If write combining is available, the region is mapped for it, otherwise
the region is (silently) mapped as non-cached. In some cases, however,
the calling driver must know if write combining is available, so a silent
best effort mechanism is not sufficient. Add writecombine_available(), which
returns 1 if the system supports write combining and 0 if it doesn't.
In mlx4 for better latency, we write send descriptors to a write-combining
(WC) mapped buffer instead of ringing a doorbell and having the HW fetch
the descriptor from system memory.
However, if write-combining is not supported on the host, then we
obtain better latency by using the doorbell-ring/HW fetch mechanism.
This series from Moshe and Jack adds the API and uses in in mlx4.
We are sending through netdev to get feedback from the networking
community and extend the reviewer audience if required.
Or
Moshe Lazer (2):
pgtable: Add API to query if write combining is available
net/mlx4_core: Disable BF when write combining is not available
arch/arm/include/asm/pgtable.h | 6 ++++++
arch/arm64/include/asm/pgtable.h | 5 +++++
arch/ia64/include/asm/pgtable.h | 6 ++++++
arch/powerpc/include/asm/pgtable.h | 6 ++++++
arch/x86/include/asm/pgtable_types.h | 2 ++
arch/x86/mm/pat.c | 9 +++++++++
drivers/net/ethernet/mellanox/mlx4/fw.c | 2 +-
include/asm-generic/pgtable.h | 8 ++++++++
8 files changed, 43 insertions(+), 1 deletions(-)
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH net-next 1/2] pgtable: Add API to query if write combining is available
2014-10-01 14:54 [PATCH net-next 0/2] Add pgtable API to query if write combining is available Or Gerlitz
@ 2014-10-01 14:54 ` Or Gerlitz
2014-10-03 19:31 ` David Miller
2014-10-01 14:54 ` [PATCH net-next 2/2] net/mlx4_core: Disable BF when write combining is not available Or Gerlitz
1 sibling, 1 reply; 6+ messages in thread
From: Or Gerlitz @ 2014-10-01 14:54 UTC (permalink / raw)
To: David S. Miller
Cc: netdev, Amir Vadai, Jack Morgenstein, Moshe Lazer, Tal Alon,
Yevgeny Petrilin, Or Gerlitz
From: Moshe Lazer <moshel@mellanox.com>
Currently the kernel write-combining interface provides a best effort
mechanism in which the caller simply invokes pgprot_writecombine().
If write combining is available, the region is mapped for it, otherwise
the region is (silently) mapped as non-cached.
In some cases, however, the calling driver must know if write combining
is available, so a silent best effort mechanism is not sufficient.
Add writecombine_available(), which returns 1 if the system
supports write combining and 0 if it doesn't.
Signed-off-by: Moshe Lazer <moshel@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
arch/arm/include/asm/pgtable.h | 6 ++++++
arch/arm64/include/asm/pgtable.h | 5 +++++
arch/ia64/include/asm/pgtable.h | 6 ++++++
arch/powerpc/include/asm/pgtable.h | 6 ++++++
arch/x86/include/asm/pgtable_types.h | 2 ++
arch/x86/mm/pat.c | 9 +++++++++
include/asm-generic/pgtable.h | 8 ++++++++
7 files changed, 42 insertions(+), 0 deletions(-)
diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index 01baef0..87224dc 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -119,6 +119,12 @@ extern pgprot_t pgprot_s2_device;
#define pgprot_writecombine(prot) \
__pgprot_modify(prot, L_PTE_MT_MASK, L_PTE_MT_BUFFERABLE)
+#define writecombine_available writecombine_available
+static inline int writecombine_available(void)
+{
+ return 1;
+}
+
#define pgprot_stronglyordered(prot) \
__pgprot_modify(prot, L_PTE_MT_MASK, L_PTE_MT_UNCACHED)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index ffe1ba0..f284cb6 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -296,6 +296,11 @@ static inline int has_transparent_hugepage(void)
__pgprot_modify(prot, PTE_ATTRINDX_MASK, PTE_ATTRINDX(MT_DEVICE_nGnRnE) | PTE_PXN | PTE_UXN)
#define pgprot_writecombine(prot) \
__pgprot_modify(prot, PTE_ATTRINDX_MASK, PTE_ATTRINDX(MT_NORMAL_NC) | PTE_PXN | PTE_UXN)
+#define writecombine_available writecombine_available
+static inline int writecombine_available(void)
+{
+ return 1;
+}
#define __HAVE_PHYS_MEM_ACCESS_PROT
struct file;
extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtable.h
index 7935115..66bfd00 100644
--- a/arch/ia64/include/asm/pgtable.h
+++ b/arch/ia64/include/asm/pgtable.h
@@ -356,6 +356,12 @@ static inline void set_pte(pte_t *ptep, pte_t pteval)
#define pgprot_noncached(prot) __pgprot((pgprot_val(prot) & ~_PAGE_MA_MASK) | _PAGE_MA_UC)
#define pgprot_writecombine(prot) __pgprot((pgprot_val(prot) & ~_PAGE_MA_MASK) | _PAGE_MA_WC)
+#define writecombine_available writecombine_available
+static inline int writecombine_available(void)
+{
+ return 1;
+}
+
struct file;
extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
unsigned long size, pgprot_t vma_prot);
diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index d98c1ec..a52e035 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -267,6 +267,12 @@ extern int ptep_set_access_flags(struct vm_area_struct *vma, unsigned long addre
#define pgprot_writecombine pgprot_noncached_wc
+#define writecombine_available writecombine_available
+static inline int writecombine_available(void)
+{
+ return 1;
+}
+
struct file;
extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
unsigned long size, pgprot_t vma_prot);
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index f216963..5f8d2b4 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -337,6 +337,8 @@ extern int nx_enabled;
#define pgprot_writecombine pgprot_writecombine
extern pgprot_t pgprot_writecombine(pgprot_t prot);
+#define writecombine_available writecombine_available
+int writecombine_available(void);
/* Indicate that x86 has its own track and untrack pfn vma functions */
#define __HAVE_PFNMAP_TRACKING
diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index 6574388..c144ab3 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -797,6 +797,15 @@ pgprot_t pgprot_writecombine(pgprot_t prot)
}
EXPORT_SYMBOL_GPL(pgprot_writecombine);
+int writecombine_available(void)
+{
+ if (pat_enabled)
+ return 1;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(writecombine_available);
+
#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_X86_PAT)
static struct memtype *memtype_get_idx(loff_t pos)
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 53b2acc..f503e5b 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -249,6 +249,14 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
#define pgprot_writecombine pgprot_noncached
#endif
+#ifndef writecombine_available
+#define writecombine_available writecombine_available
+static inline int writecombine_available(void)
+{
+ return 0;
+}
+#endif
+
/*
* When walking page tables, get the address of the next boundary,
* or the end address of the range if that comes earlier. Although no
--
1.7.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH net-next 2/2] net/mlx4_core: Disable BF when write combining is not available
2014-10-01 14:54 [PATCH net-next 0/2] Add pgtable API to query if write combining is available Or Gerlitz
2014-10-01 14:54 ` [PATCH net-next 1/2] pgtable: Add " Or Gerlitz
@ 2014-10-01 14:54 ` Or Gerlitz
2014-10-01 16:52 ` Alexei Starovoitov
1 sibling, 1 reply; 6+ messages in thread
From: Or Gerlitz @ 2014-10-01 14:54 UTC (permalink / raw)
To: David S. Miller
Cc: netdev, Amir Vadai, Jack Morgenstein, Moshe Lazer, Tal Alon,
Yevgeny Petrilin, Or Gerlitz
From: Moshe Lazer <moshel@mellanox.com>
In mlx4 for better latency, we write send descriptors to a write-combining
(WC) mapped buffer instead of ringing a doorbell and having the HW fetch
the descriptor from system memory.
However, if write-combining is not supported on the host, then we
obtain better latency by using the doorbell-ring/HW fetch mechanism.
The mechanism that uses WC is called Blue-Flame (BF). BF is beneficial
only when the system supports write combining. When the BF buffer is
mapped as a write-combine buffer, the HCA receives data in multi-word
bursts. However, if the BF buffer is mapped only as non-cached, the
HCA receives data in individual dword chunks, which harms performance.
Therefore, disable blueflame when write combining is not available.
Signed-off-by: Moshe Lazer <moshel@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx4/fw.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.c b/drivers/net/ethernet/mellanox/mlx4/fw.c
index 2e88a23..f7bb548 100644
--- a/drivers/net/ethernet/mellanox/mlx4/fw.c
+++ b/drivers/net/ethernet/mellanox/mlx4/fw.c
@@ -671,7 +671,7 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap)
dev_cap->min_page_sz = 1 << field;
MLX4_GET(field, outbox, QUERY_DEV_CAP_BF_OFFSET);
- if (field & 0x80) {
+ if ((field & 0x80) && writecombine_available()) {
MLX4_GET(field, outbox, QUERY_DEV_CAP_LOG_BF_REG_SZ_OFFSET);
dev_cap->bf_reg_size = 1 << (field & 0x1f);
MLX4_GET(field, outbox, QUERY_DEV_CAP_LOG_MAX_BF_REGS_PER_PAGE_OFFSET);
--
1.7.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH net-next 2/2] net/mlx4_core: Disable BF when write combining is not available
2014-10-01 14:54 ` [PATCH net-next 2/2] net/mlx4_core: Disable BF when write combining is not available Or Gerlitz
@ 2014-10-01 16:52 ` Alexei Starovoitov
2014-10-02 14:37 ` Or Gerlitz
0 siblings, 1 reply; 6+ messages in thread
From: Alexei Starovoitov @ 2014-10-01 16:52 UTC (permalink / raw)
To: Or Gerlitz
Cc: David S. Miller, netdev@vger.kernel.org, Amir Vadai,
Jack Morgenstein, Moshe Lazer, Tal Alon, Yevgeny Petrilin
On Wed, Oct 1, 2014 at 7:54 AM, Or Gerlitz <ogerlitz@mellanox.com> wrote:
> From: Moshe Lazer <moshel@mellanox.com>
>
> In mlx4 for better latency, we write send descriptors to a write-combining
> (WC) mapped buffer instead of ringing a doorbell and having the HW fetch
> the descriptor from system memory.
>
> However, if write-combining is not supported on the host, then we
> obtain better latency by using the doorbell-ring/HW fetch mechanism.
>
> The mechanism that uses WC is called Blue-Flame (BF). BF is beneficial
> only when the system supports write combining. When the BF buffer is
> mapped as a write-combine buffer, the HCA receives data in multi-word
> bursts. However, if the BF buffer is mapped only as non-cached, the
> HCA receives data in individual dword chunks, which harms performance.
>
> Therefore, disable blueflame when write combining is not available.
curious, what numbers you're seeing:
- bf=on with wc
- bf=on without wc
- bf=off and doorbell
they will help to justify this change.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net-next 2/2] net/mlx4_core: Disable BF when write combining is not available
2014-10-01 16:52 ` Alexei Starovoitov
@ 2014-10-02 14:37 ` Or Gerlitz
0 siblings, 0 replies; 6+ messages in thread
From: Or Gerlitz @ 2014-10-02 14:37 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: David S. Miller, netdev@vger.kernel.org, Amir Vadai,
Jack Morgenstein, Moshe Lazer, Tal Alon, Yevgeny Petrilin,
Amir Ancel
On 10/1/2014 7:52 PM, Alexei Starovoitov wrote:
> On Wed, Oct 1, 2014 at 7:54 AM, Or Gerlitz <ogerlitz@mellanox.com> wrote:
>> From: Moshe Lazer <moshel@mellanox.com>
>>
>> In mlx4 for better latency, we write send descriptors to a write-combining
>> (WC) mapped buffer instead of ringing a doorbell and having the HW fetch
>> the descriptor from system memory.
>>
>> However, if write-combining is not supported on the host, then we
>> obtain better latency by using the doorbell-ring/HW fetch mechanism.
>>
>> The mechanism that uses WC is called Blue-Flame (BF). BF is beneficial
>> only when the system supports write combining. When the BF buffer is
>> mapped as a write-combine buffer, the HCA receives data in multi-word
>> bursts. However, if the BF buffer is mapped only as non-cached, the
>> HCA receives data in individual dword chunks, which harms performance.
>>
>> Therefore, disable blueflame when write combining is not available.
> curious, what numbers you're seeing:
> - [1] bf=on with wc
> - [2] bf=on without wc
> - [3] bf=off and doorbell
> they will help to justify this change.
Sure, see below:
The 1st set of results was obtained from running latency test
with the HCA being passthrough-ed into VM running over KVM
host -- so WC isn't available.
The problematic range is 32-128B, for example with 128 bytes
message, using BF has latency of 1.47us and no usage of BF
only 1us. When WC isn't really available every write of 64B
would actually translate into 8 writes of 8 bytes which obviously
hurts the latency.
# /usr/bin/taskset -c 0 ib_write_lat -d mlx4_0 -i 1 -F -a -n 1000000
[2] BF on without WC
#bytes #iterations t_min[usec] t_max[usec] t_typical[usec]
2 1000000 0.74 186.16 0.79
4 1000000 0.70 103.62 0.78
8 1000000 0.74 77.02 0.78
16 1000000 0.65 640.75 0.86
32 1000000 0.90 134.63 0.96
64 1000000 1.05 808.52 1.11
128 1000000 1.05 405.58 1.47
[3] BF off and using doorbell
#bytes #iterations t_min[usec] t_max[usec] t_typical[usec]
2 1000000 0.85 107.29 0.89
4 1000000 0.84 705.90 0.89
8 1000000 0.85 457.72 0.89
16 1000000 0.85 1041.43 0.90
32 1000000 0.88 773.67 0.92
64 1000000 0.90 82.70 0.93
128 1000000 0.96 78.20 1.00
The 2nd set of results was obtained from running latency test
over bare-metal host where WC is available. Clearly we gain
better latency when BF is used vs. the doorbell base (around 300ns
of improvement, where there are systems which this climbs to 500ns).
# /usr/bin/taskset -c 0 ib_write_lat -d mlx4_0 -i 1 -F -a -n 1000000
[1] BF on, WC available
#bytes #iterations t_min[usec] t_max[usec] t_typical[usec]
2 1000000 0.74 131.62 0.79
4 1000000 0.74 134.51 0.79
8 1000000 0.74 154.30 0.79
16 1000000 0.74 1437.57 0.79
32 1000000 0.79 138.23 0.83
64 1000000 0.82 135.86 0.85
128 1000000 0.94 131.11 0.98
[3] BF off and using doorbell
#bytes #iterations t_min[usec] t_max[usec] t_typical[usec]
2 1000000 1.05 137.55 1.10
4 1000000 1.04 422.50 1.10
8 1000000 1.05 141.26 1.10
16 1000000 1.06 1261.99 1.11
32 1000000 1.09 141.47 1.14
64 1000000 1.11 435.44 1.16
128 1000000 1.22 212.19 1.27
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net-next 1/2] pgtable: Add API to query if write combining is available
2014-10-01 14:54 ` [PATCH net-next 1/2] pgtable: Add " Or Gerlitz
@ 2014-10-03 19:31 ` David Miller
0 siblings, 0 replies; 6+ messages in thread
From: David Miller @ 2014-10-03 19:31 UTC (permalink / raw)
To: ogerlitz; +Cc: netdev, amirv, jackm, moshel, talal, yevgenyp
From: Or Gerlitz <ogerlitz@mellanox.com>
Date: Wed, 1 Oct 2014 17:54:41 +0300
> +#define writecombine_available writecombine_available
> +static inline int writecombine_available(void)
> +{
> + return 1;
> +}
> +
Please use 'bool' and 'true'/'false'.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2014-10-03 19:29 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-01 14:54 [PATCH net-next 0/2] Add pgtable API to query if write combining is available Or Gerlitz
2014-10-01 14:54 ` [PATCH net-next 1/2] pgtable: Add " Or Gerlitz
2014-10-03 19:31 ` David Miller
2014-10-01 14:54 ` [PATCH net-next 2/2] net/mlx4_core: Disable BF when write combining is not available Or Gerlitz
2014-10-01 16:52 ` Alexei Starovoitov
2014-10-02 14:37 ` Or Gerlitz
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).