Netdev List
 help / color / mirror / Atom feed
* [PATCH 5/7] x86: remove the experimental forcesac boot option
From: Christoph Hellwig @ 2018-05-25 14:35 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Tony Luck, Fenghua Yu,
	Greg Kroah-Hartman
  Cc: x86, iommu, linux-kernel, linux-ia64, netdev
In-Reply-To: <20180525143512.1466-1-hch@lst.de>

Limiting the dma mask to avoid PCI (pre-PCIe) DAC cycles while paying
the huge overhead of an IOMMU is rather pointless, and this seriously
gets in the way of dma mapping work.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 .../admin-guide/kernel-parameters.txt         |  1 -
 Documentation/x86/x86_64/boot-options.txt     |  4 +---
 arch/x86/kernel/pci-dma.c                     | 21 +------------------
 drivers/net/ethernet/sfc/efx.c                |  5 ++---
 drivers/net/ethernet/sfc/falcon/efx.c         |  5 ++---
 5 files changed, 6 insertions(+), 30 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index f2040d46f095..cc0ac035b8fe 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1705,7 +1705,6 @@
 		nopanic
 		merge
 		nomerge
-		forcesac
 		soft
 		pt		[x86, IA-64]
 		nobypass	[PPC/POWERNV]
diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt
index 153b3a57fba2..341588ec4e29 100644
--- a/Documentation/x86/x86_64/boot-options.txt
+++ b/Documentation/x86/x86_64/boot-options.txt
@@ -208,7 +208,7 @@ IOMMU (input/output memory management unit)
       Kernel boot message: "PCI-DMA: Using Calgary IOMMU"
 
  iommu=[<size>][,noagp][,off][,force][,noforce][,leak[=<nr_of_leak_pages>]
-	[,memaper[=<order>]][,merge][,forcesac][,fullflush][,nomerge]
+	[,memaper[=<order>]][,merge][,fullflush][,nomerge]
 	[,noaperture][,calgary]
 
   General iommu options:
@@ -235,8 +235,6 @@ IOMMU (input/output memory management unit)
                        (experimental).
     nomerge            Don't do scatter-gather (SG) merging.
     noaperture         Ask the IOMMU not to touch the aperture for AGP.
-    forcesac           Force single-address cycle (SAC) mode for masks <40bits
-                       (experimental).
     noagp              Don't initialize the AGP driver and use full aperture.
     allowdac           Allow double-address cycle (DAC) mode, i.e. DMA >4GB.
                        DAC is used with 32-bit PCI to push a 64-bit address in
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 77625b60a510..91dff954b745 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -20,8 +20,6 @@ static int forbid_dac __read_mostly;
 const struct dma_map_ops *dma_ops = &dma_direct_ops;
 EXPORT_SYMBOL(dma_ops);
 
-static int iommu_sac_force __read_mostly;
-
 #ifdef CONFIG_IOMMU_DEBUG
 int panic_on_overflow __read_mostly = 1;
 int force_iommu __read_mostly = 1;
@@ -125,7 +123,7 @@ static __init int iommu_setup(char *p)
 		if (!strncmp(p, "nomerge", 7))
 			iommu_merge = 0;
 		if (!strncmp(p, "forcesac", 8))
-			iommu_sac_force = 1;
+			pr_warn("forcesac option ignored.\n");
 		if (!strncmp(p, "allowdac", 8))
 			forbid_dac = 0;
 		if (!strncmp(p, "nodac", 5))
@@ -165,23 +163,6 @@ int arch_dma_supported(struct device *dev, u64 mask)
 	}
 #endif
 
-	/* Tell the device to use SAC when IOMMU force is on.  This
-	   allows the driver to use cheaper accesses in some cases.
-
-	   Problem with this is that if we overflow the IOMMU area and
-	   return DAC as fallback address the device may not handle it
-	   correctly.
-
-	   As a special case some controllers have a 39bit address
-	   mode that is as efficient as 32bit (aic79xx). Don't force
-	   SAC for these.  Assume all masks <= 40 bits are of this
-	   type. Normally this doesn't make any difference, but gives
-	   more gentle handling of IOMMU overflow. */
-	if (iommu_sac_force && (mask >= DMA_BIT_MASK(40))) {
-		dev_info(dev, "Force SAC with mask %Lx\n", mask);
-		return 0;
-	}
-
 	return 1;
 }
 EXPORT_SYMBOL(arch_dma_supported);
diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index a4ebd8715494..661828e8fdcf 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -1289,9 +1289,8 @@ static int efx_init_io(struct efx_nic *efx)
 
 	pci_set_master(pci_dev);
 
-	/* Set the PCI DMA mask.  Try all possibilities from our
-	 * genuine mask down to 32 bits, because some architectures
-	 * (e.g. x86_64 with iommu_sac_force set) will allow 40 bit
+	/* Set the PCI DMA mask.  Try all possibilities from our genuine mask
+	 * down to 32 bits, because some architectures will allow 40 bit
 	 * masks event though they reject 46 bit masks.
 	 */
 	while (dma_mask > 0x7fffffffUL) {
diff --git a/drivers/net/ethernet/sfc/falcon/efx.c b/drivers/net/ethernet/sfc/falcon/efx.c
index 3d6c91e96589..dd5530a4f8c8 100644
--- a/drivers/net/ethernet/sfc/falcon/efx.c
+++ b/drivers/net/ethernet/sfc/falcon/efx.c
@@ -1242,9 +1242,8 @@ static int ef4_init_io(struct ef4_nic *efx)
 
 	pci_set_master(pci_dev);
 
-	/* Set the PCI DMA mask.  Try all possibilities from our
-	 * genuine mask down to 32 bits, because some architectures
-	 * (e.g. x86_64 with iommu_sac_force set) will allow 40 bit
+	/* Set the PCI DMA mask.  Try all possibilities from our genuine mask
+	 * down to 32 bits, because some architectures will allow 40 bit
 	 * masks event though they reject 46 bit masks.
 	 */
 	while (dma_mask > 0x7fffffffUL) {
-- 
2.17.0

^ permalink raw reply related

* [PATCH 4/7] x86: remove a stray reference to pci-nommu.c
From: Christoph Hellwig @ 2018-05-25 14:35 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Tony Luck, Fenghua Yu,
	Greg Kroah-Hartman
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	x86-DgEjT+Ai2ygdnm+yROfE0A, linux-ia64-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20180525143512.1466-1-hch-jcswGhMUV9g@public.gmane.org>

This is just the minimal workaround.  The file file is mostly either stale
and/or duplicative of Documentation/admin-guide/kernel-parameters.txt,
but that is much more work than I'm willing to do right now.

Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 Documentation/x86/x86_64/boot-options.txt | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt
index b297c48389b9..153b3a57fba2 100644
--- a/Documentation/x86/x86_64/boot-options.txt
+++ b/Documentation/x86/x86_64/boot-options.txt
@@ -187,9 +187,9 @@ PCI
 
 IOMMU (input/output memory management unit)
 
- Currently four x86-64 PCI-DMA mapping implementations exist:
+ Multiple x86-64 PCI-DMA mapping implementations exist, for example:
 
-   1. <arch/x86_64/kernel/pci-nommu.c>: use no hardware/software IOMMU at all
+   1. <lib/dma-direct.c>: use no hardware/software IOMMU at all
       (e.g. because you have < 3 GB memory).
       Kernel boot message: "PCI-DMA: Disabling IOMMU"
 
-- 
2.17.0

^ permalink raw reply related

* [PATCH 3/7] ia64: remove iommu_dma_supported
From: Christoph Hellwig @ 2018-05-25 14:35 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Tony Luck, Fenghua Yu,
	Greg Kroah-Hartman
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	x86-DgEjT+Ai2ygdnm+yROfE0A, linux-ia64-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20180525143512.1466-1-hch-jcswGhMUV9g@public.gmane.org>

The generic dma_direct_supported helper already used by intel-iommu on
x86 does a better job than the ia64 reimplementation.

Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 arch/ia64/kernel/pci-dma.c  | 13 -------------
 drivers/iommu/intel-iommu.c |  2 --
 2 files changed, 15 deletions(-)

diff --git a/arch/ia64/kernel/pci-dma.c b/arch/ia64/kernel/pci-dma.c
index 50b6ad282a90..3c2884bef3d4 100644
--- a/arch/ia64/kernel/pci-dma.c
+++ b/arch/ia64/kernel/pci-dma.c
@@ -51,18 +51,6 @@ iommu_dma_init(void)
 	return;
 }
 
-int iommu_dma_supported(struct device *dev, u64 mask)
-{
-	/* Copied from i386. Doesn't make much sense, because it will
-	   only work for pci_alloc_coherent.
-	   The caller just has to use GFP_DMA in this case. */
-	if (mask < DMA_BIT_MASK(24))
-		return 0;
-
-	return 1;
-}
-EXPORT_SYMBOL(iommu_dma_supported);
-
 void __init pci_iommu_alloc(void)
 {
 	dma_ops = &intel_dma_ops;
@@ -71,7 +59,6 @@ void __init pci_iommu_alloc(void)
 	intel_dma_ops.sync_sg_for_cpu = machvec_dma_sync_sg;
 	intel_dma_ops.sync_single_for_device = machvec_dma_sync_single;
 	intel_dma_ops.sync_sg_for_device = machvec_dma_sync_sg;
-	intel_dma_ops.dma_supported = iommu_dma_supported;
 
 	/*
 	 * The order of these functions is important for
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 749d8f235346..5e0bef3754d1 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3841,9 +3841,7 @@ const struct dma_map_ops intel_dma_ops = {
 	.map_page = intel_map_page,
 	.unmap_page = intel_unmap_page,
 	.mapping_error = intel_mapping_error,
-#ifdef CONFIG_X86
 	.dma_supported = dma_direct_supported,
-#endif
 };
 
 static inline int iommu_domain_cache_init(void)
-- 
2.17.0

^ permalink raw reply related

* [PATCH 2/7] ia64: remove the dead iommu_sac_force variable
From: Christoph Hellwig @ 2018-05-25 14:35 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Tony Luck, Fenghua Yu,
	Greg Kroah-Hartman
  Cc: x86, iommu, linux-kernel, linux-ia64, netdev
In-Reply-To: <20180525143512.1466-1-hch@lst.de>

Looks like copy and paste from x86 that never actually got used.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 arch/ia64/kernel/pci-dma.c | 19 -------------------
 1 file changed, 19 deletions(-)

diff --git a/arch/ia64/kernel/pci-dma.c b/arch/ia64/kernel/pci-dma.c
index b5df084c0af4..50b6ad282a90 100644
--- a/arch/ia64/kernel/pci-dma.c
+++ b/arch/ia64/kernel/pci-dma.c
@@ -18,8 +18,6 @@
 dma_addr_t bad_dma_address __read_mostly;
 EXPORT_SYMBOL(bad_dma_address);
 
-static int iommu_sac_force __read_mostly;
-
 int no_iommu __read_mostly;
 #ifdef CONFIG_IOMMU_DEBUG
 int force_iommu __read_mostly = 1;
@@ -61,23 +59,6 @@ int iommu_dma_supported(struct device *dev, u64 mask)
 	if (mask < DMA_BIT_MASK(24))
 		return 0;
 
-	/* Tell the device to use SAC when IOMMU force is on.  This
-	   allows the driver to use cheaper accesses in some cases.
-
-	   Problem with this is that if we overflow the IOMMU area and
-	   return DAC as fallback address the device may not handle it
-	   correctly.
-
-	   As a special case some controllers have a 39bit address
-	   mode that is as efficient as 32bit (aic79xx). Don't force
-	   SAC for these.  Assume all masks <= 40 bits are of this
-	   type. Normally this doesn't make any difference, but gives
-	   more gentle handling of IOMMU overflow. */
-	if (iommu_sac_force && (mask >= DMA_BIT_MASK(40))) {
-		dev_info(dev, "Force SAC with mask %llx\n", mask);
-		return 0;
-	}
-
 	return 1;
 }
 EXPORT_SYMBOL(iommu_dma_supported);
-- 
2.17.0

^ permalink raw reply related

* [PATCH 1/7] core, dma-direct: add a flag 32-bit dma limits
From: Christoph Hellwig @ 2018-05-25 14:35 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Tony Luck, Fenghua Yu,
	Greg Kroah-Hartman
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	x86-DgEjT+Ai2ygdnm+yROfE0A, linux-ia64-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20180525143512.1466-1-hch-jcswGhMUV9g@public.gmane.org>

Various PCI bridges (VIA PCI, Xilinx PCIe) limit DMA to only 32-bits
even if the device itself supports more.  Add a single bit flag to
struct device (to be moved into the dma extension once we around it)
to flag such devices and reject larger DMA to them.

Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 include/linux/device.h | 3 +++
 lib/dma-direct.c       | 6 ++++++
 2 files changed, 9 insertions(+)

diff --git a/include/linux/device.h b/include/linux/device.h
index 477956990f5e..fa317e45f5e6 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -904,6 +904,8 @@ struct dev_links_info {
  * @offline:	Set after successful invocation of bus type's .offline().
  * @of_node_reused: Set if the device-tree node is shared with an ancestor
  *              device.
+ * @dma_32bit_limit: bridge limited to 32bit DMA even if the device itself
+ *		indicates support for a higher limit in the dma_mask field.
  *
  * At the lowest level, every device in a Linux system is represented by an
  * instance of struct device. The device structure contains the information
@@ -992,6 +994,7 @@ struct device {
 	bool			offline_disabled:1;
 	bool			offline:1;
 	bool			of_node_reused:1;
+	bool			dma_32bit_limit:1;
 };
 
 static inline struct device *kobj_to_dev(struct kobject *kobj)
diff --git a/lib/dma-direct.c b/lib/dma-direct.c
index bbfb229aa067..0151a7b2bc87 100644
--- a/lib/dma-direct.c
+++ b/lib/dma-direct.c
@@ -165,6 +165,12 @@ int dma_direct_supported(struct device *dev, u64 mask)
 	if (mask < DMA_BIT_MASK(32))
 		return 0;
 #endif
+	/*
+	 * Various PCI/PCIe bridges have broken support for > 32bit DMA even
+	 * if the device itself might support it.
+	 */
+	if (dev->dma_32bit_limit && mask > DMA_BIT_MASK(32))
+		return 0;
 	return 1;
 }
 
-- 
2.17.0

^ permalink raw reply related

* refactor 32-bit dma limit quirks
From: Christoph Hellwig @ 2018-05-25 14:35 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Tony Luck, Fenghua Yu,
	Greg Kroah-Hartman
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	x86-DgEjT+Ai2ygdnm+yROfE0A, linux-ia64-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hi all,

x86 currently has some quirks to force lower dma masks.  They are mostly
useful for certain VIA systems that otherwise corrupt data, but otherwise
don't make much sense given that the modern DMA APIs do the right thing
automatically.

This series dumps a few of these old kernel command lines (including their
not really working version on ia64), and moves the VIA quirk to a flag
in struct device so that it can be apply generically.  This will be needed
to support Xylinx root ports with a similar issue that show up in common
RISC-V boards.

^ permalink raw reply

* Re: [PATCH net-next] net:sched: add action inheritdsfield to skbmod
From: Marcelo Ricardo Leitner @ 2018-05-25 14:34 UTC (permalink / raw)
  To: Fu, Qiaobin
  Cc: davem@davemloft.net, netdev@vger.kernel.org, jhs@mojatatu.com,
	Michel Machado
In-Reply-To: <C7516012-947F-4485-B5DA-DD9AD45427F8@bu.edu>

On Fri, May 25, 2018 at 05:45:03AM +0000, Fu, Qiaobin wrote:
> Hi Marcelo,
> 
> Thanks for pointing out these style issues. Below is the updated version:

Hi Qiaobin,

Looks good to me. Now you have to submit it like you submitted the
original patch, but add the version tag to the summary. Like '[PATCH
v2 net-next] ....'
And without the text before the changelog.

Thanks.

> 
> ---
> The new action inheritdsfield copies the field DS of
> IPv4 and IPv6 packets into skb->priority. This enables
> later classification of packets based on the DS field.
> 
> Original idea by Jamal Hadi Salim <jhs@mojatatu.com>
> 
> Signed-off-by: Qiaobin Fu <qiaobinf@bu.edu>
> Reviewed-by: Michel Machado <michel@digirati.com.br>
> ---
> 
> Note that the motivation for this patch is found in the following discussion:
> https://www.spinics.net/lists/netdev/msg501061.html
> ---
> 
> diff --git a/include/uapi/linux/tc_act/tc_skbmod.h b/include/uapi/linux/tc_act/tc_skbmod.h
> index 38c072f..0718b48 100644
> --- a/include/uapi/linux/tc_act/tc_skbmod.h
> +++ b/include/uapi/linux/tc_act/tc_skbmod.h
> @@ -19,6 +19,7 @@
>  #define SKBMOD_F_SMAC	0x2
>  #define SKBMOD_F_ETYPE	0x4
>  #define SKBMOD_F_SWAPMAC 0x8
> +#define SKBMOD_F_INHERITDSFIELD 0x10
>  
>  struct tc_skbmod {
>  	tc_gen;
> diff --git a/net/sched/act_skbmod.c b/net/sched/act_skbmod.c
> index ad050d7..e2082f6 100644
> --- a/net/sched/act_skbmod.c
> +++ b/net/sched/act_skbmod.c
> @@ -16,6 +16,9 @@
>  #include <linux/rtnetlink.h>
>  #include <net/netlink.h>
>  #include <net/pkt_sched.h>
> +#include <net/ip.h>
> +#include <net/ipv6.h>
> +#include <net/dsfield.h>
>  
>  #include <linux/tc_act/tc_skbmod.h>
>  #include <net/tc_act/tc_skbmod.h>
> @@ -72,6 +75,26 @@ static int tcf_skbmod_run(struct sk_buff *skb, const struct tc_action *a,
>  		ether_addr_copy(eth_hdr(skb)->h_source, (u8 *)tmpaddr);
>  	}
>  
> +	if (flags & SKBMOD_F_INHERITDSFIELD) {
> +		int wlen = skb_network_offset(skb);
> +
> +		switch (tc_skb_protocol(skb)) {
> +		case htons(ETH_P_IP):
> +			wlen += sizeof(struct iphdr);
> +			if (!pskb_may_pull(skb, wlen))
> +				return TC_ACT_SHOT;
> +			skb->priority = ipv4_get_dsfield(ip_hdr(skb)) >> 2;
> +			break;
> +
> +		case htons(ETH_P_IPV6):
> +			wlen += sizeof(struct ipv6hdr);
> +			if (!pskb_may_pull(skb, wlen))
> +				return TC_ACT_SHOT;
> +			skb->priority = ipv6_get_dsfield(ipv6_hdr(skb)) >> 2;
> +			break;
> +		}
> +	}
> +
>  	return action;
>  }
>  
> @@ -127,6 +150,9 @@ static int tcf_skbmod_init(struct net *net, struct nlattr *nla,
>  	if (parm->flags & SKBMOD_F_SWAPMAC)
>  		lflags = SKBMOD_F_SWAPMAC;
>  
> +	if (parm->flags & SKBMOD_F_INHERITDSFIELD)
> +		lflags |= SKBMOD_F_INHERITDSFIELD;
> +
>  	exists = tcf_idr_check(tn, parm->index, a, bind);
>  	if (exists && bind)
>  		return 0;
> 
> > On May 23, 2018, at 2:06 PM, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
> > 
> > Hi,
> > 
> > Some style fixes:
> > 
> > On Thu, May 17, 2018 at 07:33:08PM +0000, Fu, Qiaobin wrote:
> >> net/sched: add action inheritdsfield to skbmod
> > 
> > This extra line above should not be here.
> > 
> >> 
> >> The new action inheritdsfield copies the field DS of
> >> IPv4 and IPv6 packets into skb->prioriry. This enables
> >                              typo -----^
> > 
> >> later classification of packets based on the DS field.
> >> 
> >> Original idea by Jamal Hadi Salim <jhs@mojatatu.com>
> >> 
> >> Signed-off-by: Qiaobin Fu <qiaobinf@bu.edu>
> >> Reviewed-by: Michel Machado <michel@digirati.com.br>
> >> ---
> >> 
> >> Note that the motivation for this patch is found in the following discussion:
> >> https://www.spinics.net/lists/netdev/msg501061.html
> >> ---
> >> 
> >> diff --git a/include/uapi/linux/tc_act/tc_skbmod.h b/include/uapi/linux/tc_act/tc_skbmod.h
> >> index 38c072f..0718b48 100644
> >> --- a/include/uapi/linux/tc_act/tc_skbmod.h
> >> +++ b/include/uapi/linux/tc_act/tc_skbmod.h
> >> @@ -19,6 +19,7 @@
> >> #define SKBMOD_F_SMAC	0x2
> >> #define SKBMOD_F_ETYPE	0x4
> >> #define SKBMOD_F_SWAPMAC 0x8
> >> +#define SKBMOD_F_INHERITDSFIELD 0x10
> >> 
> >> struct tc_skbmod {
> >> 	tc_gen;
> >> diff --git a/net/sched/act_skbmod.c b/net/sched/act_skbmod.c
> >> index ad050d7..21d5bec 100644
> >> --- a/net/sched/act_skbmod.c
> >> +++ b/net/sched/act_skbmod.c
> >> @@ -16,6 +16,9 @@
> >> #include <linux/rtnetlink.h>
> >> #include <net/netlink.h>
> >> #include <net/pkt_sched.h>
> >> +#include <net/ip.h>
> >> +#include <net/ipv6.h>
> >> +#include <net/dsfield.h>
> >> 
> >> #include <linux/tc_act/tc_skbmod.h>
> >> #include <net/tc_act/tc_skbmod.h>
> >> @@ -72,6 +75,25 @@ static int tcf_skbmod_run(struct sk_buff *skb, const struct tc_action *a,
> >> 		ether_addr_copy(eth_hdr(skb)->h_source, (u8 *)tmpaddr);
> >> 	}
> >> 
> >> +	if (flags & SKBMOD_F_INHERITDSFIELD) {
> >> +		int wlen = skb_network_offset(skb);
> > 
> > You need a blank line here, between var declaration and the rest.
> > 
> >> +		switch (tc_skb_protocol(skb)) {
> >> +		case htons(ETH_P_IP):
> >> +			wlen += sizeof(struct iphdr);
> >> +			if (!pskb_may_pull(skb, wlen))
> >> +				return TC_ACT_SHOT;
> >> +			skb->priority = ipv4_get_dsfield(ip_hdr(skb)) >> 2;
> >> +			break;
> >> +
> >> +		case htons(ETH_P_IPV6):
> >> +			wlen += sizeof(struct ipv6hdr);
> >> +			if (!pskb_may_pull(skb, wlen))
> >> +				return TC_ACT_SHOT;
> >> +			skb->priority = ipv6_get_dsfield(ipv6_hdr(skb)) >> 2;
> >> +			break;
> >> +		}
> >> +	}
> >> +
> >> 	return action;
> >> }
> >> 
> >> @@ -127,6 +149,9 @@ static int tcf_skbmod_init(struct net *net, struct nlattr *nla,
> >> 	if (parm->flags & SKBMOD_F_SWAPMAC)
> >> 		lflags = SKBMOD_F_SWAPMAC;
> >> 
> >> +	if (parm->flags & SKBMOD_F_INHERITDSFIELD)
> >> +		lflags |= SKBMOD_F_INHERITDSFIELD;
> >> +
> >> 	exists = tcf_idr_check(tn, parm->index, a, bind);
> >> 	if (exists && bind)
> >> 		return 0;
> 

^ permalink raw reply

* Re: STMMAC driver with TSO enabled issue
From: Jose Abreu @ 2018-05-25 14:32 UTC (permalink / raw)
  To: Bhadram Varka, Jose Abreu, netdev@vger.kernel.org, Joao Pinto
In-Reply-To: <94cda7c4-127c-cae1-e51e-8853224065e2@nvidia.com>

On 25-05-2018 15:25, Bhadram Varka wrote:
> Hi Jose,
>
> On 5/25/2018 7:35 PM, Jose Abreu wrote:
>> Hi Bhadram,
>>
>> On 25-05-2018 05:41, Bhadram Varka wrote:
>>> Hi Jose,
>>>
>>> On 5/24/2018 3:01 PM, Jose Abreu wrote:
>>>> Hi Bhadram,
>>>>
>>>> On 24-05-2018 06:58, Bhadram Varka wrote:
>>>>>
>>>>> After some time if check Tx descriptor status - then I see
>>>>> only
>>>>> below
>>>>>
>>>>> [..]
>>>>> [85788.286730] 027 [0x827951b0]: 0xf854f000 0x0 0x16d8
>>>>> 0x90000000
>>>>>
>>>>> index 025 and 026 descriptors processed but not index 027.
>>>>>
>>>>> At this stage Tx DMA is always in below state -
>>>>>
>>>>> ■ 3'b011: Running (Reading Data from system memory
>>>>> buffer and queuing it to the Tx buffer (Tx FIFO))
>>>>
>>>> Thats strange, I think the descriptors look okay though. I will
>>>> need the registers values (before the lock) and, if
>>>> possible, the
>>>> git bisect output.
>>>
>>> Attaching the register dump file after the issue observed.
>>> Please check once.
>>>
>>
>> ----->8-----
>> 0x112c = 0x0000003F
>> 0x11ac = 0x0000003F
>> 0x122c = 0x0000003F
>> 0x12ac = 0x0000003F
>>
>> 0x1130 = 0x0000003F
>> 0x11b0 = 0x0000003F
>> 0x1230 = 0x0000003F
>> 0x12b0 = 0x0000003F
>> ----->8-----
>>
>> This can't be right, it should be DMA_{RX/TX}_SIZE - 1 = 511. Did
>> you change these values in the code?
>>
>
> Yes. I have changed the descriptor length to 64 - so that
> searching for the current descriptor status would be easy.

Ok, it shouldn't impact anything. The only thing I'm remembering
now is that you can have TSO not enabled in all DMA channels (HW
configuration allows this). Please check if TSO in single-queue
works.

Thanks and Best Regards,
Jose Miguel Abreu

^ permalink raw reply

* hello
From: las8ms @ 2018-05-25 14:28 UTC (permalink / raw)


we came across your e-mail contact prior a private search while in need 
of help

Get your bank atm card, I prepare card that worth huge amount of money
loaded in the card is $3,000,000m us dollars, with this card you can 
make
maximum withdrawals of $500 us dollars daily from any atm machine in
your country. get back to us by this email, mathewthomson746@gmail,com

Mr. mr. mathewthomson,
Manager  director atm card department

^ permalink raw reply

* [PATCH net-next] net: sched: shrink struct Qdisc
From: Paolo Abeni @ 2018-05-25 14:28 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Jamal Hadi Salim, Cong Wang, Jiri Pirko

The struct Qdisc has a lot of holes, especially after commit
a53851e2c321 ("net: sched: explicit locking in gso_cpu fallback"),
which as a side effect, moved the fields just after 'busylock'
on a new cacheline.

Since both 'padded' and 'refcnt' are not updated frequently, and
there is a hole before 'gso_skb', we can move such fields there,
saving a cacheline without any performance side effect.

Before this commit:

pahole -C Qdisc net/sche/sch_generic.o
	# ...
        /* size: 384, cachelines: 6, members: 25 */
        /* sum members: 236, holes: 3, sum holes: 92 */
        /* padding: 56 */

After this commit:
pahole -C Qdisc net/sche/sch_generic.o
	# ...
	/* size: 320, cachelines: 5, members: 25 */
	/* sum members: 236, holes: 2, sum holes: 28 */
	/* padding: 56 */

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/net/sch_generic.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 98c10a28cd01..827a3711dc68 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -85,6 +85,8 @@ struct Qdisc {
 	struct net_rate_estimator __rcu *rate_est;
 	struct gnet_stats_basic_cpu __percpu *cpu_bstats;
 	struct gnet_stats_queue	__percpu *cpu_qstats;
+	int			padded;
+	refcount_t		refcnt;
 
 	/*
 	 * For performance sake on SMP, we put highly modified fields at the end
@@ -97,8 +99,6 @@ struct Qdisc {
 	unsigned long		state;
 	struct Qdisc            *next_sched;
 	struct sk_buff_head	skb_bad_txq;
-	int			padded;
-	refcount_t		refcnt;
 
 	spinlock_t		busylock ____cacheline_aligned_in_smp;
 	spinlock_t		seqlock;
-- 
2.17.0

^ permalink raw reply related

* Re: STMMAC driver with TSO enabled issue
From: Bhadram Varka @ 2018-05-25 14:25 UTC (permalink / raw)
  To: Jose Abreu, netdev@vger.kernel.org, Joao Pinto
In-Reply-To: <78914761-0375-5929-ed88-5225e0e260b9@synopsys.com>

Hi Jose,

On 5/25/2018 7:35 PM, Jose Abreu wrote:
> Hi Bhadram,
> 
> On 25-05-2018 05:41, Bhadram Varka wrote:
>> Hi Jose,
>>
>> On 5/24/2018 3:01 PM, Jose Abreu wrote:
>>> Hi Bhadram,
>>>
>>> On 24-05-2018 06:58, Bhadram Varka wrote:
>>>>
>>>> After some time if check Tx descriptor status - then I see only
>>>> below
>>>>
>>>> [..]
>>>> [85788.286730] 027 [0x827951b0]: 0xf854f000 0x0 0x16d8
>>>> 0x90000000
>>>>
>>>> index 025 and 026 descriptors processed but not index 027.
>>>>
>>>> At this stage Tx DMA is always in below state -
>>>>
>>>> ■ 3'b011: Running (Reading Data from system memory
>>>> buffer and queuing it to the Tx buffer (Tx FIFO))
>>>
>>> Thats strange, I think the descriptors look okay though. I will
>>> need the registers values (before the lock) and, if possible, the
>>> git bisect output.
>>
>> Attaching the register dump file after the issue observed.
>> Please check once.
>>
> 
> ----->8-----
> 0x112c = 0x0000003F
> 0x11ac = 0x0000003F
> 0x122c = 0x0000003F
> 0x12ac = 0x0000003F
> 
> 0x1130 = 0x0000003F
> 0x11b0 = 0x0000003F
> 0x1230 = 0x0000003F
> 0x12b0 = 0x0000003F
> ----->8-----
> 
> This can't be right, it should be DMA_{RX/TX}_SIZE - 1 = 511. Did
> you change these values in the code?
> 

Yes. I have changed the descriptor length to 64 - so that searching for 
the current descriptor status would be easy.

-- 
Thanks,
Bhadram.

^ permalink raw reply

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
From: David Miller @ 2018-05-25 14:23 UTC (permalink / raw)
  To: qing.huang
  Cc: tariqt, haakon.bugge, yanjun.zhu, netdev, linux-rdma,
	linux-kernel, gi-oh.kim
In-Reply-To: <20180523232246.20445-1-qing.huang@oracle.com>

From: Qing Huang <qing.huang@oracle.com>
Date: Wed, 23 May 2018 16:22:46 -0700

> When a system is under memory presure (high usage with fragments),
> the original 256KB ICM chunk allocations will likely trigger kernel
> memory management to enter slow path doing memory compact/migration
> ops in order to complete high order memory allocations.
> 
> When that happens, user processes calling uverb APIs may get stuck
> for more than 120s easily even though there are a lot of free pages
> in smaller chunks available in the system.
> 
> Syslog:
> ...
> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
> oracle_205573_e:205573 blocked for more than 120 seconds.
> ...
> 
> With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.
> 
> However in order to support smaller ICM chunk size, we need to fix
> another issue in large size kcalloc allocations.
> 
> E.g.
> Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
> size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
> entry). So we need a 16MB allocation for a table->icm pointer array to
> hold 2M pointers which can easily cause kcalloc to fail.
> 
> The solution is to use kvzalloc to replace kcalloc which will fall back
> to vmalloc automatically if kmalloc fails.
> 
> Signed-off-by: Qing Huang <qing.huang@oracle.com>
> Acked-by: Daniel Jurgens <danielj@mellanox.com>
> Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] ath10k: transmit queued frames after waking queues
From: Niklas Cassel @ 2018-05-25 14:21 UTC (permalink / raw)
  To: Bob Copeland
  Cc: Adrian Chadd, Kalle Valo, David Miller, ath10k, linux-wireless,
	netdev, Linux Kernel Mailing List
In-Reply-To: <20180525125023.alc42lkgehc6iodg@localhost>

On Fri, May 25, 2018 at 08:50:23AM -0400, Bob Copeland wrote:
> On Fri, May 25, 2018 at 02:36:56PM +0200, Niklas Cassel wrote:
> > A spin lock does have the advantage of ordering: memory operations issued
> > before the spin_unlock_bh() will be completed before the spin_unlock_bh()
> > operation has completed.
> > 
> > However, ath10k_htt_tx_dec_pending() was called earlier in the same function,
> > which decreases htt->num_pending_tx, so that write will be completed before
> > our read. That is the only ordering we care about here (if we should call
> > ath10k_mac_tx_push_pending() or not).
> 
> Sure.  I also understand that reading inside a lock and operating on the
> value outside the lock isn't really the definition of synchronization
> (doesn't really matter in this case though).
> 
> I was just suggesting that the implicit memory barrier in the spin unlock
> that we are already paying for would be sufficient here too, and it matches
> the semantic of "tx fields under tx_lock."  On the other hand, maybe it's
> just me, but I tend to look askance at just-in-case READ_ONCEs sprinkled
> about.

I agree, because of the implicit memory barrier from spin_unlock_bh(),
READ_ONCE shouldn't really be needed in this case.

I think that it's a good thing to be critical of all "just-in-case" things,
however, it's not always that obvious if you actually need READ_ONCE or not.

E.g. you might need to use it even when you are holding a spin_lock.

Some people recommend to use it for all concurrent non-read-only shared memory
accesses: https://github.com/google/ktsan/wiki/READ_ONCE-and-WRITE_ONCE

Is there a better guideline somewhere..?


Kind regards,
Niklas

^ permalink raw reply

* Re: [patch iproute2/net-next 2/2] devlink: introduce support for showing port number and split subport number
From: David Ahern @ 2018-05-25 14:19 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, idosch, jakub.kicinski, mlxsw, andrew, vivien.didelot,
	f.fainelli, michael.chan, ganeshgr, saeedm, simon.horman,
	pieter.jansenvanvuuren, john.hurley, dirk.vandermerwe,
	alexander.h.duyck, ogerlitz, vijaya.guvva, satananda.burla,
	raghu.vatsavayi, felix.manlunas, gospo, sathya.perla,
	vasundhara-v.volam, tariqt, eranbe, jeffrey.t.kirsher, roopa
In-Reply-To: <20180524063952.GC2295@nanopsycho>

On 5/24/18 12:39 AM, Jiri Pirko wrote:
> Wed, May 23, 2018 at 10:05:49PM CEST, dsahern@gmail.com wrote:
>> On 5/20/18 2:15 AM, Jiri Pirko wrote:
>>> From: Jiri Pirko <jiri@mellanox.com>
>>>
>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>> ---
>>>  devlink/devlink.c            | 6 ++++++
>>>  include/uapi/linux/devlink.h | 2 ++
>>>  2 files changed, 8 insertions(+)
>>>
>>> diff --git a/devlink/devlink.c b/devlink/devlink.c
>>> index df2c66dac1c7..b0ae17767dab 100644
>>> --- a/devlink/devlink.c
>>> +++ b/devlink/devlink.c
>>> @@ -1737,9 +1737,15 @@ static void pr_out_port(struct dl *dl, struct nlattr **tb)
>>>  
>>>  		pr_out_str(dl, "flavour", port_flavour_name(port_flavour));
>>>  	}
>>> +	if (tb[DEVLINK_ATTR_PORT_NUMBER])
>>> +		pr_out_uint(dl, "number",
>>> +			    mnl_attr_get_u32(tb[DEVLINK_ATTR_PORT_NUMBER]));
>>
>> "number" is a label means nothing. "port" is more descriptive.
> 
> That attribute name is "port_number". As the other attributes are
> named "port_something", and the "something" is printed out here, the
> "number" is consistent with it. Each line represents a port with a list
> of attributes.

The name of the attribute is not relevant here. That's an API that very
few people will see. I am looking at this from a user perspective and
the word "number" followed by a number is not clear.

> 
>>
>> # ./devlink port
>> pci/0000:03:00.0/1: type eth netdev swp17 flavour physical number 17
>> pci/0000:03:00.0/3: type eth netdev swp18 flavour physical number 18
>> pci/0000:03:00.0/5: type eth netdev swp19 flavour physical number 19
>> pci/0000:03:00.0/7: type eth netdev swp20 flavour physical number 20
>> pci/0000:03:00.0/9: type eth netdev swp21 flavour physical number 21
>> ...
>> pci/0000:03:00.0/61: type eth netdev swp1s0 flavour physical number 1
>> split_group 1 subport 0
>> pci/0000:03:00.0/62: type eth netdev swp1s1 flavour physical number 1
>> split_group 1 subport 1
>>

^ permalink raw reply

* KASAN: use-after-free Write in tls_push_record
From: syzbot @ 2018-05-25 14:16 UTC (permalink / raw)
  To: aviadye, borisp, davejwatson, davem, linux-kernel, netdev,
	syzkaller-bugs

Hello,

syzbot found the following crash on:

HEAD commit:    13405468f49d bpfilter: don't pass O_CREAT when opening con..
git tree:       net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=109ad82f800000
kernel config:  https://syzkaller.appspot.com/x/.config?x=8be0182d69f8d422
dashboard link: https://syzkaller.appspot.com/bug?extid=709f2810a6a05f11d4d3
compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=151ec3a7800000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=154d302f800000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+709f2810a6a05f11d4d3@syzkaller.appspotmail.com

RDX: 00000000fffffdef RSI: 00000000200005c0 RDI: 0000000000000003
RBP: 00007ffd6ccdd780 R08: 0000000020000000 R09: 000000000000001c
R10: 0000000000000000 R11: 0000000000000212 R12: 0000000000000004
R13: ffffffffffffffff R14: 0000000000000000 R15: 0000000000000000
==================================================================
BUG: KASAN: use-after-free in tls_fill_prepend include/net/tls.h:339  
[inline]
BUG: KASAN: use-after-free in tls_push_record+0x1023/0x13e0  
net/tls/tls_sw.c:240
Write of size 1 at addr ffff8801d88d5000 by task syz-executor377/4600

CPU: 1 PID: 4600 Comm: syz-executor377 Not tainted 4.17.0-rc6+ #61
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
  print_address_description+0x6c/0x20b mm/kasan/report.c:256
  kasan_report_error mm/kasan/report.c:354 [inline]
  kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
  __asan_report_store1_noabort+0x17/0x20 mm/kasan/report.c:435
  tls_fill_prepend include/net/tls.h:339 [inline]
  tls_push_record+0x1023/0x13e0 net/tls/tls_sw.c:240
  tls_sw_sendmsg+0x9de/0x12b0 net/tls/tls_sw.c:484
  inet_sendmsg+0x19f/0x690 net/ipv4/af_inet.c:798
  sock_sendmsg_nosec net/socket.c:629 [inline]
  sock_sendmsg+0xd5/0x120 net/socket.c:639
  __sys_sendto+0x3d7/0x670 net/socket.c:1789
  __do_sys_sendto net/socket.c:1801 [inline]
  __se_sys_sendto net/socket.c:1797 [inline]
  __x64_sys_sendto+0xe1/0x1a0 net/socket.c:1797
  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4416d9
RSP: 002b:00007ffd6ccdd758 EFLAGS: 00000212 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004416d9
RDX: 00000000fffffdef RSI: 00000000200005c0 RDI: 0000000000000003
RBP: 00007ffd6ccdd780 R08: 0000000020000000 R09: 000000000000001c
R10: 0000000000000000 R11: 0000000000000212 R12: 0000000000000004
R13: ffffffffffffffff R14: 0000000000000000 R15: 0000000000000000

The buggy address belongs to the page:
page:ffffea0007623540 count:0 mapcount:0 mapping:0000000000000000 index:0x0
flags: 0x2fffc0000000000()
raw: 02fffc0000000000 0000000000000000 0000000000000000 00000000ffffffff
raw: ffffea0007592b60 ffff8801dae2fdd8 0000000000000000 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
  ffff8801d88d4f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  ffff8801d88d4f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> ffff8801d88d5000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                    ^
  ffff8801d88d5080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  ffff8801d88d5100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with  
syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* Re: STMMAC driver with TSO enabled issue
From: Jose Abreu @ 2018-05-25 14:05 UTC (permalink / raw)
  To: Bhadram Varka, Jose Abreu, netdev@vger.kernel.org, Joao Pinto
In-Reply-To: <3143ce1f-30e7-6b8d-06b5-6048abab54bc@nvidia.com>

Hi Bhadram,

On 25-05-2018 05:41, Bhadram Varka wrote:
> Hi Jose,
>
> On 5/24/2018 3:01 PM, Jose Abreu wrote:
>> Hi Bhadram,
>>
>> On 24-05-2018 06:58, Bhadram Varka wrote:
>>>
>>> After some time if check Tx descriptor status - then I see only
>>> below
>>>
>>> [..]
>>> [85788.286730] 027 [0x827951b0]: 0xf854f000 0x0 0x16d8
>>> 0x90000000
>>>
>>> index 025 and 026 descriptors processed but not index 027.
>>>
>>> At this stage Tx DMA is always in below state -
>>>
>>> ■ 3'b011: Running (Reading Data from system memory
>>> buffer and queuing it to the Tx buffer (Tx FIFO))
>>
>> Thats strange, I think the descriptors look okay though. I will
>> need the registers values (before the lock) and, if possible, the
>> git bisect output.
>
> Attaching the register dump file after the issue observed.
> Please check once.
>

----->8-----
0x112c = 0x0000003F
0x11ac = 0x0000003F
0x122c = 0x0000003F
0x12ac = 0x0000003F

0x1130 = 0x0000003F
0x11b0 = 0x0000003F
0x1230 = 0x0000003F
0x12b0 = 0x0000003F
----->8-----

This can't be right, it should be DMA_{RX/TX}_SIZE - 1 = 511. Did
you change these values in the code?

Thanks and Best Regards,
Jose Miguel Abreu

^ permalink raw reply

* Re: [PATCH] PCI: allow drivers to limit the number of VFs to 0
From: Bjorn Helgaas @ 2018-05-25 14:02 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Bjorn Helgaas, linux-pci, netdev, Sathya Perla, Felix Manlunas,
	alexander.duyck, john.fastabend, Jacob Keller, Donald Dutile,
	oss-drivers, Christoph Hellwig
In-Reply-To: <20180524182015.1af7b4c9@cakuba>

On Thu, May 24, 2018 at 06:20:15PM -0700, Jakub Kicinski wrote:
> Hi Bjorn!
> 
> On Thu, 24 May 2018 18:57:48 -0500, Bjorn Helgaas wrote:
> > On Mon, Apr 02, 2018 at 03:46:52PM -0700, Jakub Kicinski wrote:
> > > Some user space depends on enabling sriov_totalvfs number of VFs
> > > to not fail, e.g.:
> > > 
> > > $ cat .../sriov_totalvfs > .../sriov_numvfs
> > > 
> > > For devices which VF support depends on loaded FW we have the
> > > pci_sriov_{g,s}et_totalvfs() API.  However, this API uses 0 as
> > > a special "unset" value, meaning drivers can't limit sriov_totalvfs
> > > to 0.  Remove the special values completely and simply initialize
> > > driver_max_VFs to total_VFs.  Then always use driver_max_VFs.
> > > Add a helper for drivers to reset the VF limit back to total.  
> > 
> > I still can't really make sense out of the changelog.
> >
> > I think part of the reason it's confusing is because there are two
> > things going on:
> > 
> >   1) You want this:
> >   
> >        pci_sriov_set_totalvfs(dev, 0);
> >        x = pci_sriov_get_totalvfs(dev) 
> > 
> >      to return 0 instead of total_VFs.  That seems to connect with
> >      your subject line.  It means "sriov_totalvfs" in sysfs could be
> >      0, but I don't know how that is useful (I'm sure it is; just
> >      educate me :))
> 
> Let me just quote the bug report that got filed on our internal bug
> tracker :)
> 
>   When testing Juju Openstack with Ubuntu 18.04, enabling SR-IOV causes
>   errors because Juju gets the sriov_totalvfs for SR-IOV-capable device
>   then tries to set that as the sriov_numvfs parameter.
> 
>   For SR-IOV incapable FW, the sriov_totalvfs parameter should be 0, 
>   but it's set to max.  When FW is switched to flower*, the correct 
>   sriov_totalvfs value is presented.
> 
> * flower is a project name

>From the point of view of the PCI core (which knows nothing about
device firmware and relies on the architected config space described
by the PCIe spec), this sounds like an erratum: with some firmware
installed, the device is not capable of SR-IOV, but still advertises
an SR-IOV capability with "TotalVFs > 0".

Regardless of whether that's an erratum, we do allow PF drivers to use
pci_sriov_set_totalvfs() to limit the number of VFs that may be
enabled by writing to the PF's "sriov_numvfs" sysfs file.

But the current implementation does not allow a PF driver to limit VFs
to 0, and that does seem nonsensical.

> My understanding is OpenStack uses sriov_totalvfs to determine how many
> VFs can be enabled, looks like this is the code:
> 
> http://git.openstack.org/cgit/openstack/charm-neutron-openvswitch/tree/hooks/neutron_ovs_utils.py#n464
> 
> >   2) You're adding the pci_sriov_reset_totalvfs() interface.  I'm not
> >      sure what you intend for this.  Is *every* driver supposed to
> >      call it in .remove()?  Could/should this be done in the core
> >      somehow instead of depending on every driver?
> 
> Good question, I was just thinking yesterday we may want to call it
> from the core, but I don't think it's strictly necessary nor always
> sufficient (we may reload FW without re-probing).
> 
> We have a device which supports different number of VFs based on the FW
> loaded.  Some legacy FWs does not inform the driver how many VFs it can
> support, because it supports max.  So the flow in our driver is this:
> 
> load_fw(dev);
> ...
> max_vfs = ask_fw_for_max_vfs(dev);
> if (max_vfs >= 0)
> 	return pci_sriov_set_totalvfs(dev, max_vfs);
> else /* FW didn't tell us, assume max */
> 	return pci_sriov_reset_totalvfs(dev); 
> 
> We also reset the max on device remove, but that's not strictly
> necessary.
> 
> Other users of pci_sriov_set_totalvfs() always know the value to set
> the total to (either always get it from FW or it's a constant).
> 
> If you prefer we can work out the correct max for those legacy cases in
> the driver as well, although it seemed cleaner to just ask the core,
> since it already has total_VFs value handy :)
> 
> > I'm also having a hard time connecting your user-space command example
> > with the rest of this.  Maybe it will make more sense to me tomorrow
> > after some coffee.
> 
> OpenStack assumes it will always be able to set sriov_numvfs to
> sriov_totalvfs, see this 'if':
> 
> http://git.openstack.org/cgit/openstack/charm-neutron-openvswitch/tree/hooks/neutron_ovs_utils.py#n512

Thanks for educating me.  I think there are two issues here that we
can separate.  I extracted the patch below for the first.

The second is the question of resetting driver_max_VFs.  I think we
currently have a general issue in the core:

  - load PF driver 1
  - driver calls pci_sriov_set_totalvfs() to reduce driver_max_VFs
  - unload PF driver 1
  - load PF driver 2

Now driver_max_VFs is still stuck at the lower value set by driver 1.
I don't think that's the way this should work.

I guess this is partly a consequence of setting driver_max_VFs in
sriov_init(), which is called before driver attach and should only
depend on hardware characteristics, so it is related to the patch
below.  But I think we should fix it in general, not just for
netronome.


commit 4a338bc6f94b9ad824ac944f5dfc249d6838719c
Author: Jakub Kicinski <jakub.kicinski@netronome.com>
Date:   Fri May 25 08:18:34 2018 -0500

    PCI/IOV: Allow PF drivers to limit total_VFs to 0
    
    Some SR-IOV PF drivers implement .sriov_configure(), which allows
    user-space to enable VFs by writing the desired number of VFs to the sysfs
    "sriov_numvfs" file (see sriov_numvfs_store()).
    
    The PCI core limits the number of VFs to the TotalVFs advertised by the
    device in its SR-IOV capability.  The PF driver can limit the number of VFs
    to even fewer (it may have pre-allocated data structures or knowledge of
    device limitations) by calling pci_sriov_set_totalvfs(), but previously it
    could not limit the VFs to 0.
    
    Change pci_sriov_get_totalvfs() so it always respects the VF limit imposed
    by the PF driver, even if the limit is 0.
    
    This sequence:
    
      pci_sriov_set_totalvfs(dev, 0);
      x = pci_sriov_get_totalvfs(dev);
    
    previously set "x" to TotalVFs from the SR-IOV capability.  Now it will set
    "x" to 0.
    
    Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 192b82898a38..d0d73dbbd5ca 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -469,6 +469,7 @@ static int sriov_init(struct pci_dev *dev, int pos)
 	iov->nres = nres;
 	iov->ctrl = ctrl;
 	iov->total_VFs = total;
+	iov->driver_max_VFs = total;
 	pci_read_config_word(dev, pos + PCI_SRIOV_VF_DID, &iov->vf_device);
 	iov->pgsz = pgsz;
 	iov->self = dev;
@@ -827,10 +828,7 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
 	if (!dev->is_physfn)
 		return 0;
 
-	if (dev->sriov->driver_max_VFs)
-		return dev->sriov->driver_max_VFs;
-
-	return dev->sriov->total_VFs;
+	return dev->sriov->driver_max_VFs;
 }
 EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs);
 

^ permalink raw reply related

* Re: [PATCH bpf-next 0/5] fix test_sockmap
From: John Fastabend @ 2018-05-25 14:01 UTC (permalink / raw)
  To: Prashant Bhole, Alexei Starovoitov, Daniel Borkmann
  Cc: David S . Miller, Shuah Khan, netdev
In-Reply-To: <a7a84334-b650-23e7-9df5-43a942ac9666@lab.ntt.co.jp>

On 05/25/2018 01:28 AM, Prashant Bhole wrote:
> 
> 
> On 5/24/2018 1:58 PM, John Fastabend wrote:
>> On 05/23/2018 09:47 PM, Prashant Bhole wrote:
>>>
>>>
>>> On 5/23/2018 6:44 PM, Prashant Bhole wrote:
>>>>
>>>>
>>>> On 5/22/2018 2:08 AM, John Fastabend wrote:
>>>>> On 05/20/2018 10:13 PM, Prashant Bhole wrote:
>>>>>>
>>>>>>
>>>>>> On 5/19/2018 1:42 AM, John Fastabend wrote:
>>>>>>> On 05/18/2018 12:17 AM, Prashant Bhole wrote:
>>>>>>>> This series fixes bugs in test_sockmap code. They weren't caught
>>>>>>>> previously because failure in RX/TX thread was not notified to the
>>>>>>>> main thread.
>>>>>>>>
>>>>>>>> Also fixed data verification logic and slightly improved test
>>>>>>>> output
>>>>>>>> such that parameters values (cork, apply, start, end) of failed
>>>>>>>> test
>>>>>>>> can be easily seen.
>>>>>>>>
>>>>>>>
>>>>>>> Great, this was on my list so thanks for taking care of it.
>>>>>>>
>>>>>>>> Note: Even after fixing above problems there are issues with tests
>>>>>>>> which set cork parameter. Tests fail (RX thread timeout) when cork
>>>>>>>> value is non-zero and overall data sent by TX thread isn't
>>>>>>>> multiples
>>>>>>>> of cork value.
>>>>>>>
>>>>>>>
>>>>>>> This is expected. When 'cork' is set the sender should only xmit
>>>>>>> the data when 'cork' bytes are available. If the user doesn't
>>>>>>> provide the N bytes the data is cork'ed waiting for the bytes and
>>>>>>> if the socket is closed the state is cleaned up. What these tests
>>>>>>> are testing is the cleanup path when a user doesn't provide the
>>>>>>> N bytes. In practice this is used to validate headers and prevent
>>>>>>> users from sending partial headers. We want to keep these tests
>>>>>>> because
>>>>>>> they verify a tear-down path in the code.
>>>>>>
>>>>>> Ok.
>>>>>>
>>>>>>>
>>>>>>> After your changes do these get reported as failures? If so we
>>>>>>> need to account for the above in the calculations.
>>>>>>
>>>>>> Yes, cork related test are reported as failures because of RX thread
>>>>>> timeout.
>>>>>>
>>>>>> So with your above description, I think we need to differentiate cork
>>>>>> tests with partial data and full data. In partial data test we can
>>>>>> have
>>>>>> something like "timeout_expected" flag. Any other way to fix it?
>>>>>>
>>>>>
>>>>> Adding a flag seems reasonable to me. Lets do this for now. Also I
>>>>> plan to add more negative tests so we can either use the same
>>>>> flag or a new one for those cases as well.
>>>>>
>>>>
>>>> John,
>>>> I worked on this for some time and noticed that the RX-timeout of
>>>> tests with cork parameter is dependent on various parameters. So we
>>>> can not set a flag like the way 'drop_expected' flag is set before
>>>> executing the test.
>>>>
>>>> So I decided to write a function which judges all parameters before
>>>> each test and decides whether a test with cork parameter will
>>>> timeout or not. Then the conditions in the function became
>>>> complicated. For example some tests fail if opt->rate < 17 (with
>>>> some other conditions). Here is 17 is related to FRAGS_PER_SKB.
>>>> Consider following two examples.
>>> I'm sorry. Correction: s/FRAGS_PER_SKB/MAX_SKB_FRAGS/
>>>
>>>>
>>>> ./test_sockmap --cgroup /mnt/cgroup2 -r 16 -i 1 -l 30 -t sendpage
>>>> --txmsg --txmsg_cork 1024   # RX timeout occurs
>>>>
>>>> ./test_sockmap --cgroup /mnt/cgroup2 -r 17 -i 1 -l 30 -t sendpage
>>>> --txmsg --txmsg_cork 1024   # Success!
>>>>
>>
>> Ah yes this hits the buffer limit and flushes the queue. The kernel
>> side doesn't know how to merge those specific sendpage requests so
>> it gives each request its own buffer and when the limit is reached
>> we flush it.
>>
>>>> Do we need to keep such tests? if yes, then I will continue with
>>>> adding such conditions in the function.
>>>>
>>
>> Yes, these tests are needed because they are testing the edge cases.
>> These are probably the most important tests because my normal usage
>> will catch any issues in the "good" cases its these types of things
>> that can go unnoticed (at least for a short while) if we don't have
>> specific tests for them.
> 
> I tried but it is difficult to come up with a right set of conditions
> which lead to test failure.
> 

Agreed, it can be yes. How about adding your logic for all tests except
"cork" cases. If there is a flag to set if the timeout is expected we
can always manually set it in the test invocation. Might not be as
nice as automatically learning the expected results but possibly easier
than building some complicated logic to figure it out.

Would you mind submitting your series again without the "cork" tests
being tracked? And if you want add a bit to tell if the "cork" tests are
going to timeout or not setting it per test manually. But I think
your series can just omit the cork test for now and still be useful.

> -Prashant
>>
>> Thanks for doing this.
>> John
> 
> 

^ permalink raw reply

* Re: KASAN: use-after-free Read in ccid2_hc_tx_packet_recv
From: syzbot @ 2018-05-25 13:52 UTC (permalink / raw)
  To: alexey.kodanev, davem, dccp, edumazet, gerrit, keescook,
	linux-kernel, netdev, soheil, syzkaller-bugs
In-Reply-To: <0000000000003872fd0568da185f@google.com>

syzbot has found a reproducer for the following crash on:

HEAD commit:    b50694381cfc Merge branch 'stable/for-linus-4.17' of git:/..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=17151cb7800000
kernel config:  https://syzkaller.appspot.com/x/.config?x=982e2df1b9e60b02
dashboard link: https://syzkaller.appspot.com/bug?extid=554ccde221001ab5479a
compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=1363ccb7800000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1272e2b7800000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+554ccde221001ab5479a@syzkaller.appspotmail.com

Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
==================================================================
  fail_dump lib/fault-inject.c:51 [inline]
  should_fail.cold.4+0xa/0x1a lib/fault-inject.c:149
BUG: KASAN: use-after-free in ccid2_hc_tx_packet_recv+0x2383/0x275e  
net/dccp/ccids/ccid2.c:597
Read of size 1 at addr ffff8801ba4911c2 by task syz-executor940/4542

  __should_failslab+0x124/0x180 mm/failslab.c:32
  should_failslab+0x9/0x14 mm/slab_common.c:1522
  slab_pre_alloc_hook mm/slab.h:423 [inline]
  slab_alloc mm/slab.c:3378 [inline]
  kmem_cache_alloc_trace+0x4b/0x780 mm/slab.c:3618
  kmalloc include/linux/slab.h:512 [inline]
  dccp_ackvec_parsed_add+0xa1/0x310 net/dccp/ackvec.c:352
  ccid2_hc_tx_parse_options+0x9a/0xb0 net/dccp/ccids/ccid2.c:510
  ccid_hc_tx_parse_options net/dccp/ccid.h:207 [inline]
  dccp_parse_options+0x658/0x11f0 net/dccp/options.c:233
  dccp_rcv_established+0x44/0xb0 net/dccp/input.c:374
  dccp_v4_do_rcv+0x153/0x180 net/dccp/ipv4.c:654
  sk_backlog_rcv include/net/sock.h:909 [inline]
  __release_sock+0x12f/0x3a0 net/core/sock.c:2335
  release_sock+0xa4/0x2b0 net/core/sock.c:2850
  dccp_sendmsg+0x771/0x1020 net/dccp/proto.c:820
  inet_sendmsg+0x19f/0x690 net/ipv4/af_inet.c:798
  sock_sendmsg_nosec net/socket.c:629 [inline]
  sock_sendmsg+0xd5/0x120 net/socket.c:639
  ___sys_sendmsg+0x525/0x940 net/socket.c:2117
  __sys_sendmmsg+0x240/0x6f0 net/socket.c:2212
  __do_sys_sendmmsg net/socket.c:2241 [inline]
  __se_sys_sendmmsg net/socket.c:2238 [inline]
  __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2238
  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x441819
RSP: 002b:00007ffdb9a9df08 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441819
RDX: 04000000000001e6 RSI: 0000000020000c00 RDI: 0000000000000005
RBP: 00007ffdb9a9df20 R08: 0000000000000002 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff
R13: 04000000000001e6 R14: 0000000000000006 R15: 0000000000000000
CPU: 0 PID: 4542 Comm: syz-executor940 Not tainted 4.17.0-rc6+ #66
dccp_parse_options: DCCP(        (ptrval)): Option 38 (len=1) error=5
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
  print_address_description+0x6c/0x20b mm/kasan/report.c:256
  kasan_report_error mm/kasan/report.c:354 [inline]
  kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
  __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
  ccid2_hc_tx_packet_recv+0x2383/0x275e net/dccp/ccids/ccid2.c:597
  ccid_hc_tx_packet_recv net/dccp/ccid.h:192 [inline]
  dccp_deliver_input_to_ccids+0x203/0x280 net/dccp/input.c:186
  dccp_rcv_established+0x87/0xb0 net/dccp/input.c:378
  dccp_v4_do_rcv+0x153/0x180 net/dccp/ipv4.c:654
  sk_backlog_rcv include/net/sock.h:909 [inline]
  __release_sock+0x12f/0x3a0 net/core/sock.c:2335
  release_sock+0xa4/0x2b0 net/core/sock.c:2850
  dccp_sendmsg+0x771/0x1020 net/dccp/proto.c:820
  inet_sendmsg+0x19f/0x690 net/ipv4/af_inet.c:798
  sock_sendmsg_nosec net/socket.c:629 [inline]
  sock_sendmsg+0xd5/0x120 net/socket.c:639
  ___sys_sendmsg+0x525/0x940 net/socket.c:2117
  __sys_sendmmsg+0x240/0x6f0 net/socket.c:2212
  __do_sys_sendmmsg net/socket.c:2241 [inline]
  __se_sys_sendmmsg net/socket.c:2238 [inline]
  __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2238
  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x441819
RSP: 002b:00007ffdb9a9df08 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441819
RDX: 04000000000001e6 RSI: 0000000020000c00 RDI: 0000000000000005
RBP: 00007ffdb9a9df20 R08: 0000000000000002 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff
R13: 04000000000001e6 R14: 0000000000000006 R15: 0000000000000000

Allocated by task 4542:
  save_stack+0x43/0xd0 mm/kasan/kasan.c:448
  set_track mm/kasan/kasan.c:460 [inline]
  kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
  __do_kmalloc_node mm/slab.c:3682 [inline]
  __kmalloc_node_track_caller+0x47/0x70 mm/slab.c:3696
  __kmalloc_reserve.isra.38+0x3a/0xe0 net/core/skbuff.c:137
  __alloc_skb+0x14d/0x780 net/core/skbuff.c:205
  alloc_skb include/linux/skbuff.h:987 [inline]
  dccp_send_ack+0xd2/0x340 net/dccp/output.c:580
  ccid2_hc_rx_packet_recv+0x139/0x1b0 net/dccp/ccids/ccid2.c:776
  ccid_hc_rx_packet_recv net/dccp/ccid.h:185 [inline]
  dccp_deliver_input_to_ccids+0xf0/0x280 net/dccp/input.c:180
  dccp_rcv_established+0x87/0xb0 net/dccp/input.c:378
  dccp_v4_do_rcv+0x153/0x180 net/dccp/ipv4.c:654
  sk_backlog_rcv include/net/sock.h:909 [inline]
  __sk_receive_skb+0x3a2/0xd60 net/core/sock.c:513
  dccp_v4_rcv+0x10e5/0x1f3f net/dccp/ipv4.c:875
  ip_local_deliver_finish+0x2e3/0xd80 net/ipv4/ip_input.c:215
  NF_HOOK include/linux/netfilter.h:288 [inline]
  ip_local_deliver+0x1e1/0x720 net/ipv4/ip_input.c:256
  dst_input include/net/dst.h:450 [inline]
  ip_rcv_finish+0x81b/0x2200 net/ipv4/ip_input.c:396
  NF_HOOK include/linux/netfilter.h:288 [inline]
  ip_rcv+0xb70/0x143d net/ipv4/ip_input.c:492
  __netif_receive_skb_core+0x26f5/0x3630 net/core/dev.c:4592
  __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:4657
  process_backlog+0x219/0x760 net/core/dev.c:5337
  napi_poll net/core/dev.c:5735 [inline]
  net_rx_action+0x7b7/0x1930 net/core/dev.c:5801
  __do_softirq+0x2e0/0xaf5 kernel/softirq.c:285

Freed by task 4542:
  save_stack+0x43/0xd0 mm/kasan/kasan.c:448
  set_track mm/kasan/kasan.c:460 [inline]
  __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
  kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
  __cache_free mm/slab.c:3498 [inline]
  kfree+0xd9/0x260 mm/slab.c:3813
  skb_free_head+0x99/0xc0 net/core/skbuff.c:550
  skb_release_data+0x690/0x860 net/core/skbuff.c:570
  skb_release_all+0x4a/0x60 net/core/skbuff.c:627
  __kfree_skb net/core/skbuff.c:641 [inline]
  kfree_skb+0x195/0x560 net/core/skbuff.c:659
  dccp_v4_do_rcv+0x12b/0x180 net/dccp/ipv4.c:689
  sk_backlog_rcv include/net/sock.h:909 [inline]
  __release_sock+0x12f/0x3a0 net/core/sock.c:2335
  release_sock+0xa4/0x2b0 net/core/sock.c:2850
  dccp_sendmsg+0x771/0x1020 net/dccp/proto.c:820
  inet_sendmsg+0x19f/0x690 net/ipv4/af_inet.c:798
  sock_sendmsg_nosec net/socket.c:629 [inline]
  sock_sendmsg+0xd5/0x120 net/socket.c:639
  ___sys_sendmsg+0x525/0x940 net/socket.c:2117
  __sys_sendmmsg+0x240/0x6f0 net/socket.c:2212
  __do_sys_sendmmsg net/socket.c:2241 [inline]
  __se_sys_sendmmsg net/socket.c:2238 [inline]
  __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2238
  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff8801ba490d40
  which belongs to the cache kmalloc-2048 of size 2048
The buggy address is located 1154 bytes inside of
  2048-byte region [ffff8801ba490d40, ffff8801ba491540)
The buggy address belongs to the page:
page:ffffea0006e92400 count:1 mapcount:0 mapping:ffff8801ba4904c0 index:0x0  
compound_mapcount: 0
flags: 0x2fffc0000008100(slab|head)
raw: 02fffc0000008100 ffff8801ba4904c0 0000000000000000 0000000100000003
raw: ffffea0006ed9a20 ffffea0006fa5fa0 ffff8801da800c40 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
  ffff8801ba491080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff8801ba491100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff8801ba491180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                            ^
  ffff8801ba491200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff8801ba491280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

^ permalink raw reply

* Re: [PATCH net] tun: Fix NULL pointer dereference in XDP redirect
From: Toshiaki Makita @ 2018-05-25 13:43 UTC (permalink / raw)
  To: Jason Wang, Toshiaki Makita, David S. Miller; +Cc: netdev
In-Reply-To: <491862dc-b6d2-854d-4d5f-8dfe34da882d@redhat.com>

On 18/05/25 (金) 18:59, Jason Wang wrote:
> On 2018年05月25日 12:32, Toshiaki Makita wrote:
>> Calling XDP redirection requires preempt/bh disabled. Especially softirq
>> can call another XDP function and redirection functions, then percpu
>> value ri->map can be overwritten to NULL.
>>
>> This is a generic XDP case called from tun.
>>
>> [ 3535.736058] BUG: unable to handle kernel NULL pointer dereference 
>> at 0000000000000018
>> [ 3535.743974] PGD 0 P4D 0
>> [ 3535.746530] Oops: 0000 [#1] SMP PTI
>> [ 3535.750049] Modules linked in: vhost_net vhost tap tun bridge stp 
>> llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter 
>> sunrpc vfat fat ext4 mbcache jbd2 intel_rapl skx_edac nfit libnvdimm 
>> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif 
>> irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc ses 
>> aesni_intel crypto_simd cryptd enclosure hpwdt hpilo glue_helper 
>> ipmi_si pcspkr wmi mei_me ioatdma mei ipmi_devintf shpchp dca 
>> ipmi_msghandler lpc_ich acpi_power_meter sch_fq_codel ip_tables xfs 
>> libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea 
>> sysfillrect sysimgblt fb_sys_fops ttm drm smartpqi i40e crc32c_intel 
>> scsi_transport_sas tg3 i2c_core ptp pps_core
>> [ 3535.813456] CPU: 5 PID: 1630 Comm: vhost-1614 Not tainted 
>> 4.17.0-rc4 #2
>> [ 3535.820127] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 
>> Gen10, BIOS U32 11/14/2017
>> [ 3535.828732] RIP: 0010:__xdp_map_lookup_elem+0x5/0x30
>> [ 3535.833740] RSP: 0018:ffffb4bc47bf7c58 EFLAGS: 00010246
>> [ 3535.839009] RAX: ffff9fdfcfea1c40 RBX: 0000000000000000 RCX: 
>> ffff9fdf27fe3100
>> [ 3535.846205] RDX: ffff9fdfca769200 RSI: 0000000000000000 RDI: 
>> 0000000000000000
>> [ 3535.853402] RBP: ffffb4bc491d9000 R08: 00000000000045ad R09: 
>> 0000000000000ec0
>> [ 3535.860597] R10: 0000000000000001 R11: ffff9fdf26c3ce4e R12: 
>> ffff9fdf9e72c000
>> [ 3535.867794] R13: 0000000000000000 R14: fffffffffffffff2 R15: 
>> ffff9fdfc82cdd00
>> [ 3535.874990] FS:  0000000000000000(0000) GS:ffff9fdfcfe80000(0000) 
>> knlGS:0000000000000000
>> [ 3535.883152] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 3535.888948] CR2: 0000000000000018 CR3: 0000000bde724004 CR4: 
>> 00000000007626e0
>> [ 3535.896145] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
>> 0000000000000000
>> [ 3535.903342] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
>> 0000000000000400
>> [ 3535.910538] PKRU: 55555554
>> [ 3535.913267] Call Trace:
>> [ 3535.915736]  xdp_do_generic_redirect+0x7a/0x310
>> [ 3535.920310]  do_xdp_generic.part.117+0x285/0x370
>> [ 3535.924970]  tun_get_user+0x5b9/0x1260 [tun]
>> [ 3535.929279]  tun_sendmsg+0x52/0x70 [tun]
>> [ 3535.933237]  handle_tx+0x2ad/0x5f0 [vhost_net]
>> [ 3535.937721]  vhost_worker+0xa5/0x100 [vhost]
>> [ 3535.942030]  kthread+0xf5/0x130
>> [ 3535.945198]  ? vhost_dev_ioctl+0x3b0/0x3b0 [vhost]
>> [ 3535.950031]  ? kthread_bind+0x10/0x10
>> [ 3535.953727]  ret_from_fork+0x35/0x40
>> [ 3535.957334] Code: 0e 74 15 83 f8 10 75 05 e9 49 aa b3 ff f3 c3 0f 
>> 1f 80 00 00 00 00 f3 c3 e9 29 9d b3 ff 66 0f 1f 84 00 00 00 00 00 0f 
>> 1f 44 00 00 <8b> 47 18 83 f8 0e 74 0d 83 f8 10 75 05 e9 49 a9 b3 ff 31 
>> c0 c3
>> [ 3535.976387] RIP: __xdp_map_lookup_elem+0x5/0x30 RSP: ffffb4bc47bf7c58
>> [ 3535.982883] CR2: 0000000000000018
>> [ 3535.987096] ---[ end trace 383b299dd1430240 ]---
>> [ 3536.131325] Kernel panic - not syncing: Fatal exception
>> [ 3536.137484] Kernel Offset: 0x26a00000 from 0xffffffff81000000 
>> (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>> [ 3536.281406] ---[ end Kernel panic - not syncing: Fatal exception ]---
>>
>> And a kernel with generic case fixed still panics in tun driver XDP
>> redirect, because it did not disable bh.
>>
>> [ 2055.128746] BUG: unable to handle kernel NULL pointer dereference 
>> at 0000000000000018
>> [ 2055.136662] PGD 0 P4D 0
>> [ 2055.139219] Oops: 0000 [#1] SMP PTI
>> [ 2055.142736] Modules linked in: vhost_net vhost tap tun bridge stp 
>> llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter 
>> sunrpc vfat fat ext4 mbcache jbd2 intel_rapl skx_edac nfit libnvdimm 
>> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass 
>> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc ses aesni_intel 
>> ipmi_ssif crypto_simd enclosure cryptd hpwdt glue_helper ioatdma hpilo 
>> wmi dca pcspkr ipmi_si acpi_power_meter ipmi_devintf shpchp mei_me 
>> ipmi_msghandler mei lpc_ich sch_fq_codel ip_tables xfs libcrc32c 
>> sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect 
>> sysimgblt fb_sys_fops ttm drm i40e smartpqi tg3 scsi_transport_sas 
>> crc32c_intel i2c_core ptp pps_core
>> [ 2055.206142] CPU: 6 PID: 1693 Comm: vhost-1683 Tainted: G        
>> W         4.17.0-rc5-fix-tun+ #1
>> [ 2055.215011] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 
>> Gen10, BIOS U32 11/14/2017
>> [ 2055.223617] RIP: 0010:__xdp_map_lookup_elem+0x5/0x30
>> [ 2055.228624] RSP: 0018:ffff998b07607cc0 EFLAGS: 00010246
>> [ 2055.233892] RAX: ffff8dbd8e235700 RBX: ffff8dbd8ff21c40 RCX: 
>> 0000000000000004
>> [ 2055.241089] RDX: ffff998b097a9000 RSI: 0000000000000000 RDI: 
>> 0000000000000000
>> [ 2055.248286] RBP: 0000000000000000 R08: 00000000000065a8 R09: 
>> 0000000000005d80
>> [ 2055.255483] R10: 0000000000000040 R11: ffff8dbcf0100000 R12: 
>> ffff998b097a9000
>> [ 2055.262681] R13: ffff8dbd8c98c000 R14: 0000000000000000 R15: 
>> ffff998b07607d78
>> [ 2055.269879] FS:  0000000000000000(0000) GS:ffff8dbd8ff00000(0000) 
>> knlGS:0000000000000000
>> [ 2055.278039] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 2055.283834] CR2: 0000000000000018 CR3: 0000000c0c8cc005 CR4: 
>> 00000000007626e0
>> [ 2055.291030] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
>> 0000000000000000
>> [ 2055.298227] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
>> 0000000000000400
>> [ 2055.305424] PKRU: 55555554
>> [ 2055.308153] Call Trace:
>> [ 2055.310624]  xdp_do_redirect+0x7b/0x380
>> [ 2055.314499]  tun_get_user+0x10fe/0x12a0 [tun]
>> [ 2055.318895]  tun_sendmsg+0x52/0x70 [tun]
>> [ 2055.322852]  handle_tx+0x2ad/0x5f0 [vhost_net]
>> [ 2055.327337]  vhost_worker+0xa5/0x100 [vhost]
>> [ 2055.331646]  kthread+0xf5/0x130
>> [ 2055.334813]  ? vhost_dev_ioctl+0x3b0/0x3b0 [vhost]
>> [ 2055.339646]  ? kthread_bind+0x10/0x10
>> [ 2055.343343]  ret_from_fork+0x35/0x40
>> [ 2055.346950] Code: 0e 74 15 83 f8 10 75 05 e9 e9 aa b3 ff f3 c3 0f 
>> 1f 80 00 00 00 00 f3 c3 e9 c9 9d b3 ff 66 0f 1f 84 00 00 00 00 00 0f 
>> 1f 44 00 00 <8b> 47 18 83 f8 0e 74 0d 83 f8 10 75 05 e9 e9 a9 b3 ff 31 
>> c0 c3
>> [ 2055.366004] RIP: __xdp_map_lookup_elem+0x5/0x30 RSP: ffff998b07607cc0
>> [ 2055.372500] CR2: 0000000000000018
>> [ 2055.375856] ---[ end trace 2a2dcc5e9e174268 ]---
>> [ 2055.523626] Kernel panic - not syncing: Fatal exception
>> [ 2055.529796] Kernel Offset: 0x2e000000 from 0xffffffff81000000 
>> (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>> [ 2055.677539] ---[ end Kernel panic - not syncing: Fatal exception ]---
>>
>> Fixes: 761876c857cb ("tap: XDP support")
>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>> ---
>>   drivers/net/tun.c | 12 ++++++++++++
>>   1 file changed, 12 insertions(+)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index 45d8077..4fc7dbf 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -1650,6 +1650,7 @@ static struct sk_buff *tun_build_skb(struct 
>> tun_struct *tun,
>>       else
>>           *skb_xdp = 0;
>> +    local_bh_disable();
>>       preempt_disable();
>>       rcu_read_lock();
>>       xdp_prog = rcu_dereference(tun->xdp_prog);
>> @@ -1676,6 +1677,7 @@ static struct sk_buff *tun_build_skb(struct 
>> tun_struct *tun,
>>                   goto err_redirect;
>>               rcu_read_unlock();
>>               preempt_enable();
>> +            local_bh_enable();
>>               return NULL;
>>           case XDP_TX:
>>               get_page(alloc_frag->page);
>> @@ -1685,6 +1687,7 @@ static struct sk_buff *tun_build_skb(struct 
>> tun_struct *tun,
>>               tun_xdp_flush(tun->dev);
>>               rcu_read_unlock();
>>               preempt_enable();
>> +            local_bh_enable();
>>               return NULL;
>>           case XDP_PASS:
>>               delta = orig_data - xdp.data;
>> @@ -1704,6 +1707,7 @@ static struct sk_buff *tun_build_skb(struct 
>> tun_struct *tun,
>>       if (!skb) {
>>           rcu_read_unlock();
>>           preempt_enable();
>> +        local_bh_enable();
>>           return ERR_PTR(-ENOMEM);
>>       }
>> @@ -1714,6 +1718,7 @@ static struct sk_buff *tun_build_skb(struct 
>> tun_struct *tun,
>>       rcu_read_unlock();
>>       preempt_enable();
>> +    local_bh_enable();
>>       return skb;
>> @@ -1722,6 +1727,7 @@ static struct sk_buff *tun_build_skb(struct 
>> tun_struct *tun,
>>   err_xdp:
>>       rcu_read_unlock();
>>       preempt_enable();
>> +    local_bh_enable();
>>       this_cpu_inc(tun->pcpu_stats->rx_dropped);
>>       return NULL;
>>   }
>> @@ -1917,16 +1923,22 @@ static ssize_t tun_get_user(struct tun_struct 
>> *tun, struct tun_file *tfile,
>>           struct bpf_prog *xdp_prog;
>>           int ret;
>> +        local_bh_disable();
>> +        preempt_disable();
>>           rcu_read_lock();
>>           xdp_prog = rcu_dereference(tun->xdp_prog);
>>           if (xdp_prog) {
>>               ret = do_xdp_generic(xdp_prog, skb);
>>               if (ret != XDP_PASS) {
>>                   rcu_read_unlock();
>> +                preempt_enable();
>> +                local_bh_enable();
>>                   return total_len;
>>               }
>>           }
>>           rcu_read_unlock();
>> +        preempt_enable();
>> +        local_bh_enable();
>>       }
>>       rcu_read_lock();
> 
> Good catch, thanks.
> 
> But I think we can just replace preempt_disable()/enable() with 
> local_bh_disable()/local_bh_enable() ?

I actually thought the same, but noticed this patch.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9ea4c380066fbe

It looks like they do not think local_bh_disable() implies 
preempt_disable(). But I'm not sure why..

Toshiaki Makita

^ permalink raw reply

* Re: [PATCH] ath6kl: mark expected switch fall-throughs
From: Kalle Valo @ 2018-05-25 13:30 UTC (permalink / raw)
  To: Sergei Shtylyov
  Cc: Gustavo A. R. Silva, David S. Miller, linux-wireless, netdev,
	linux-kernel
In-Reply-To: <060f93a0-7210-79b6-37a7-cd8900719d40@cogentembedded.com>

Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> writes:

> On 5/25/2018 2:13 AM, Gustavo A. R. Silva wrote:
>
>> In preparation to enabling -Wimplicit-fallthrough, mark switch cases
>> where we are expecting to fall through.
>>
>> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
>> ---
>>   drivers/net/wireless/ath/ath6kl/cfg80211.c | 6 +++---
>>   1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/net/wireless/ath/ath6kl/cfg80211.c b/drivers/net/wireless/ath/ath6kl/cfg80211.c
>> index 2ba8cf3..29e32cd 100644
>> --- a/drivers/net/wireless/ath/ath6kl/cfg80211.c
>> +++ b/drivers/net/wireless/ath/ath6kl/cfg80211.c
>> @@ -3898,17 +3898,17 @@ int ath6kl_cfg80211_init(struct ath6kl *ar)
>>   	wiphy->max_scan_ie_len = 1000; /* FIX: what is correct limit? */
>>   	switch (ar->hw.cap) {
>>   	case WMI_11AN_CAP:
>> -		ht = true;
>> +		ht = true; /* fall through */
>>   	case WMI_11A_CAP:
>>   		band_5gig = true;
>>   		break;
>>   	case WMI_11GN_CAP:
>> -		ht = true;
>> +		ht = true; /* fall through */
>>   	case WMI_11G_CAP:
>>   		band_2gig = true;
>>   		break;
>>   	case WMI_11AGN_CAP:
>> -		ht = true;
>> +		ht = true; /* fall through */
>>   	case WMI_11AG_CAP:
>>   		band_2gig = true;
>>   		band_5gig = true;
>
>    Hm, typically such comments are done on a line of their own, have
> never seen this style...

Yeah, I was wondering the same. Was there a particular reason for this?

-- 
Kalle Valo

^ permalink raw reply

* Re: [PATCH] net: netsec: reduce DMA mask to 40 bits
From: Robin Murphy @ 2018-05-25 13:08 UTC (permalink / raw)
  To: Ard Biesheuvel, netdev
  Cc: davem, Jassi Brar, Masahisa Kojima, Ilias Apalodimas
In-Reply-To: <20180525125037.779-1-ard.biesheuvel@linaro.org>

On 25/05/18 13:50, Ard Biesheuvel wrote:
> The netsec network controller IP can drive 64 address bits for DMA, and
> the DMA mask is set accordingly in the driver. However, the SynQuacer
> SoC, which is the only silicon incorporating this IP at the moment,
> integrates this IP in a manner that leaves address bits [63:40]
> unconnected.
> 
> Up until now, this has not resulted in any problems, given that the DDR
> controller doesn't decode those bits to begin with. However, recent
> firmware updates for platforms incorporating this SoC allow the IOMMU
> to be enabled, which does decode address bits [47:40], and allocates
> top down from the IOVA space, producing DMA addresses that have bits
> set that have been left unconnected.
> 
> Both the DT and ACPI (IORT) descriptions of the platform take this into
> account, and only describe a DMA address space of 40 bits (using either
> dma-ranges DT properties, or DMA address limits in IORT named component
> nodes). However, even though our IOMMU and bus layers may take such
> limitations into account by setting a narrower DMA mask when creating
> the platform device, the netsec probe() entrypoint follows the common
> practice of setting the DMA mask uncondionally, according to the
> capabilities of the IP block itself rather than to its integration into
> the chip.
> 
> It is currently unclear what the correct fix is here. We could hack around
> it by only setting the DMA mask if it deviates from its default value of
> DMA_BIT_MASK(32). However, this makes it impossible for the bus layer to
> use DMA_BIT_MASK(32) as the bus limit, and so it appears that a more
> comprehensive approach is required to take DMA limits imposed by the
> SoC as a whole into account.
> 
> In the mean time, let's limit the DMA mask to 40 bits. Given that there
> is currently only one SoC that incorporates this IP, this is a reasonable
> approach that can be backported to -stable and buys us some time to come
> up with a proper fix going forward.

It's a little bit of a dodge, but until another SoC comes along with 
different requirements I agree it's the reasonable thing to do.

Reviewed-by: Robin Murphy <robin.murphy@arm.com>

> Fixes: 533dd11a12f6 ("net: socionext: Add Synquacer NetSec driver")
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Jassi Brar <jaswinder.singh@linaro.org>
> Cc: Masahisa Kojima <masahisa.kojima@linaro.org>
> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
> Please cc to -stable (v4.16+)
> 
>   drivers/net/ethernet/socionext/netsec.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/socionext/netsec.c b/drivers/net/ethernet/socionext/netsec.c
> index f4c0b02ddad8..59fbf74dcada 100644
> --- a/drivers/net/ethernet/socionext/netsec.c
> +++ b/drivers/net/ethernet/socionext/netsec.c
> @@ -1674,8 +1674,8 @@ static int netsec_probe(struct platform_device *pdev)
>   	if (ret)
>   		goto unreg_napi;
>   
> -	if (dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64)))
> -		dev_warn(&pdev->dev, "Failed to enable 64-bit DMA\n");
> +	if (dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(40)))
> +		dev_warn(&pdev->dev, "Failed to set DMA mask\n");
>   
>   	ret = register_netdev(ndev);
>   	if (ret) {
> 

^ permalink raw reply

* [PATCH net-next] qmi_wwan: apply SET_DTR quirk to the SIMCOM shared device ID
From: Bjørn Mork @ 2018-05-25 13:00 UTC (permalink / raw)
  To: netdev; +Cc: linux-usb, Bjørn Mork, Reinhard Speyerer

SIMCOM are reusing a single device ID for many (all of their?)
different modems, based on different chipsets and firmwares. Newer
Qualcomm chipset generations require setting DTR to wake the QMI
function.  The SIM7600E modem is using such a chipset, making it
fail to work with this driver despite the device ID match.

Fix by unconditionally enabling the SET_DTR quirk for all SIMCOM
modems using this specific device ID.  This is similar to what
we already have done for another case of device IDs recycled over
multiple chipset generations: 14cf4a771b30 ("drivers: net: usb:
qmi_wwan: add QMI_QUIRK_SET_DTR for Telit PID 0x1201")

Initial testing on an older SIM7100 modem shows no immediate side
effects.

Reported-by: Sebastian Sjoholm <sebastian.sjoholm@gmail.com>
Cc: Reinhard Speyerer <rspmn@arcor.de>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
---
I propose this for net-next for now to get some extra testing
exposure before going into stable.  I'll send a separate request
for stable backport as soon as I'm satisfied that there are no
hidden issues with some specific modem.


Bjørn 

 drivers/net/usb/qmi_wwan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
index 42565dd33aa6..148e78f8b48c 100644
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -1248,7 +1248,7 @@ static const struct usb_device_id products[] = {
 	{QMI_FIXED_INTF(0x03f0, 0x4e1d, 8)},	/* HP lt4111 LTE/EV-DO/HSPA+ Gobi 4G Module */
 	{QMI_FIXED_INTF(0x03f0, 0x9d1d, 1)},	/* HP lt4120 Snapdragon X5 LTE */
 	{QMI_FIXED_INTF(0x22de, 0x9061, 3)},	/* WeTelecom WPD-600N */
-	{QMI_FIXED_INTF(0x1e0e, 0x9001, 5)},	/* SIMCom 7230E */
+	{QMI_QUIRK_SET_DTR(0x1e0e, 0x9001, 5)},	/* SIMCom 7100E, 7230E, 7600E ++ */
 	{QMI_QUIRK_SET_DTR(0x2c7c, 0x0125, 4)},	/* Quectel EC25, EC20 R2.0  Mini PCIe */
 	{QMI_QUIRK_SET_DTR(0x2c7c, 0x0121, 4)},	/* Quectel EC21 Mini PCIe */
 	{QMI_FIXED_INTF(0x2c7c, 0x0296, 4)},	/* Quectel BG96 */
-- 
2.11.0

^ permalink raw reply related

* [PATCH] net: netsec: reduce DMA mask to 40 bits
From: Ard Biesheuvel @ 2018-05-25 12:50 UTC (permalink / raw)
  To: netdev
  Cc: davem, Ard Biesheuvel, Robin Murphy, Jassi Brar, Masahisa Kojima,
	Ilias Apalodimas

The netsec network controller IP can drive 64 address bits for DMA, and
the DMA mask is set accordingly in the driver. However, the SynQuacer
SoC, which is the only silicon incorporating this IP at the moment,
integrates this IP in a manner that leaves address bits [63:40]
unconnected.

Up until now, this has not resulted in any problems, given that the DDR
controller doesn't decode those bits to begin with. However, recent
firmware updates for platforms incorporating this SoC allow the IOMMU
to be enabled, which does decode address bits [47:40], and allocates
top down from the IOVA space, producing DMA addresses that have bits
set that have been left unconnected.

Both the DT and ACPI (IORT) descriptions of the platform take this into
account, and only describe a DMA address space of 40 bits (using either
dma-ranges DT properties, or DMA address limits in IORT named component
nodes). However, even though our IOMMU and bus layers may take such
limitations into account by setting a narrower DMA mask when creating
the platform device, the netsec probe() entrypoint follows the common
practice of setting the DMA mask uncondionally, according to the
capabilities of the IP block itself rather than to its integration into
the chip.

It is currently unclear what the correct fix is here. We could hack around
it by only setting the DMA mask if it deviates from its default value of
DMA_BIT_MASK(32). However, this makes it impossible for the bus layer to
use DMA_BIT_MASK(32) as the bus limit, and so it appears that a more
comprehensive approach is required to take DMA limits imposed by the
SoC as a whole into account.

In the mean time, let's limit the DMA mask to 40 bits. Given that there
is currently only one SoC that incorporates this IP, this is a reasonable
approach that can be backported to -stable and buys us some time to come
up with a proper fix going forward.

Fixes: 533dd11a12f6 ("net: socionext: Add Synquacer NetSec driver")
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Jassi Brar <jaswinder.singh@linaro.org>
Cc: Masahisa Kojima <masahisa.kojima@linaro.org>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
Please cc to -stable (v4.16+)

 drivers/net/ethernet/socionext/netsec.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/socionext/netsec.c b/drivers/net/ethernet/socionext/netsec.c
index f4c0b02ddad8..59fbf74dcada 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -1674,8 +1674,8 @@ static int netsec_probe(struct platform_device *pdev)
 	if (ret)
 		goto unreg_napi;
 
-	if (dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64)))
-		dev_warn(&pdev->dev, "Failed to enable 64-bit DMA\n");
+	if (dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(40)))
+		dev_warn(&pdev->dev, "Failed to set DMA mask\n");
 
 	ret = register_netdev(ndev);
 	if (ret) {
-- 
2.17.0

^ permalink raw reply related

* Re: [PATCH] ath10k: transmit queued frames after waking queues
From: Bob Copeland @ 2018-05-25 12:50 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Adrian Chadd, Kalle Valo, David Miller, ath10k, linux-wireless,
	netdev, Linux Kernel Mailing List
In-Reply-To: <20180525123656.GB8780@localhost.localdomain>

On Fri, May 25, 2018 at 02:36:56PM +0200, Niklas Cassel wrote:
> A spin lock does have the advantage of ordering: memory operations issued
> before the spin_unlock_bh() will be completed before the spin_unlock_bh()
> operation has completed.
> 
> However, ath10k_htt_tx_dec_pending() was called earlier in the same function,
> which decreases htt->num_pending_tx, so that write will be completed before
> our read. That is the only ordering we care about here (if we should call
> ath10k_mac_tx_push_pending() or not).

Sure.  I also understand that reading inside a lock and operating on the
value outside the lock isn't really the definition of synchronization
(doesn't really matter in this case though).

I was just suggesting that the implicit memory barrier in the spin unlock
that we are already paying for would be sufficient here too, and it matches
the semantic of "tx fields under tx_lock."  On the other hand, maybe it's
just me, but I tend to look askance at just-in-case READ_ONCEs sprinkled
about.

-- 
Bob Copeland %% https://bobcopeland.com/

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox