Netdev List
 help / color / mirror / Atom feed
* [PATCH 6/7] net: ethernet: ti: cpsw: add support for descs_pool_size dt property
From: Grygorii Strashko @ 2016-12-01 23:34 UTC (permalink / raw)
  To: David S. Miller, netdev, Mugunthan V N
  Cc: Sekhar Nori, linux-kernel, linux-omap, Ivan Khoronzhuk,
	Grygorii Strashko
In-Reply-To: <20161201233432.6182-1-grygorii.strashko@ti.com>

The CPSW CPDMA can process buffer descriptors placed as in internal
CPPI RAM as in DDR. This patch adds support in CPSW and CPDMA for
descs_pool_size DT property, which defines total number of CPDMA CPPI
descriptors to be used for both ingress/egress packets processing:
 - memory size required for CPDMA descriptor pool is calculated basing
on number of descriptors specified by user in descs_pool_size and
CPDMA descriptor size;
 - allocate CPDMA descriptor pool in DDR if pool memory size >
internal CPPI RAM or use internal CPPI RAM otherwise;
 - if descs_pool_size not specified in DT - the default value 256 will
be used which will allow to place CPDMA descriptors pool into the
internal CPPI RAM (current default behaviour);
 - CPDMA will ignore descs_pool_size if descs_pool_size = 0 for
backward comaptiobility with davinci_emac.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
---
 drivers/net/ethernet/ti/cpsw.c          |  5 +++++
 drivers/net/ethernet/ti/cpsw.h          |  1 +
 drivers/net/ethernet/ti/davinci_cpdma.c | 12 ++++++++++++
 drivers/net/ethernet/ti/davinci_cpdma.h |  1 +
 4 files changed, 19 insertions(+)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index dd5d830..a98c6260 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -145,6 +145,7 @@ do {								\
 		cpsw->data.active_slave)
 #define IRQ_NUM			2
 #define CPSW_MAX_QUEUES		8
+#define CPSW_CPDMA_DESCS_POOL_SIZE_DEFAULT 256
 
 static int debug_level;
 module_param(debug_level, int, 0);
@@ -2557,6 +2558,9 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data,
 	if (of_property_read_bool(node, "dual_emac"))
 		data->dual_emac = 1;
 
+	data->descs_pool_size = CPSW_CPDMA_DESCS_POOL_SIZE_DEFAULT;
+	of_property_read_u32(node, "descs_pool_size", &data->descs_pool_size);
+
 	/*
 	 * Populate all the child nodes here...
 	 */
@@ -2967,6 +2971,7 @@ static int cpsw_probe(struct platform_device *pdev)
 	dma_params.has_ext_regs		= true;
 	dma_params.desc_hw_addr         = dma_params.desc_mem_phys;
 	dma_params.bus_freq_mhz		= cpsw->bus_freq_mhz;
+	dma_params.descs_pool_size	= cpsw->data.descs_pool_size;
 
 	cpsw->dma = cpdma_ctlr_create(&dma_params);
 	if (!cpsw->dma) {
diff --git a/drivers/net/ethernet/ti/cpsw.h b/drivers/net/ethernet/ti/cpsw.h
index 16b54c6..8835d79 100644
--- a/drivers/net/ethernet/ti/cpsw.h
+++ b/drivers/net/ethernet/ti/cpsw.h
@@ -38,6 +38,7 @@ struct cpsw_platform_data {
 	u32	mac_control;	/* Mac control register */
 	u16	default_vlan;	/* Def VLAN for ALE lookup in VLAN aware mode*/
 	bool	dual_emac;	/* Enable Dual EMAC mode */
+	u32	descs_pool_size;	/* Number of Rx/Tx Descriptios */
 };
 
 void cpsw_phy_sel(struct device *dev, phy_interface_t phy_mode, int slave);
diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c b/drivers/net/ethernet/ti/davinci_cpdma.c
index ba892bb..f45bb8a 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.c
+++ b/drivers/net/ethernet/ti/davinci_cpdma.c
@@ -219,6 +219,18 @@ int cpdma_desc_pool_create(struct cpdma_ctlr *ctlr)
 				cpdma_params->desc_align);
 	pool->num_desc	= pool->mem_size / pool->desc_size;
 
+	if (cpdma_params->descs_pool_size) {
+		/* recalculate memory size required cpdma descriptor pool
+		 * basing on number of descriptors specified by user and
+		 * if memory size > CPPI internal RAM size (desc_mem_size)
+		 * then switch to use DDR
+		 */
+		pool->num_desc = cpdma_params->descs_pool_size;
+		pool->mem_size = pool->desc_size * pool->num_desc;
+		if (pool->mem_size > cpdma_params->desc_mem_size)
+			cpdma_params->desc_mem_phys = 0;
+	}
+
 	pool->gen_pool = devm_gen_pool_create(ctlr->dev, ilog2(pool->desc_size),
 					      -1, "cpdma");
 	if (IS_ERR(pool->gen_pool)) {
diff --git a/drivers/net/ethernet/ti/davinci_cpdma.h b/drivers/net/ethernet/ti/davinci_cpdma.h
index 4a167db..cb45f8f 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.h
+++ b/drivers/net/ethernet/ti/davinci_cpdma.h
@@ -37,6 +37,7 @@ struct cpdma_params {
 	int			desc_mem_size;
 	int			desc_align;
 	u32			bus_freq_mhz;
+	u32			descs_pool_size;
 
 	/*
 	 * Some instances of embedded cpdma controllers have extra control and
-- 
2.10.1

^ permalink raw reply related

* [PATCH 5/7] Documentation: DT: net: cpsw: allow to specify descriptors pool size
From: Grygorii Strashko @ 2016-12-01 23:34 UTC (permalink / raw)
  To: David S. Miller, netdev, Mugunthan V N
  Cc: Sekhar Nori, linux-kernel, linux-omap, Ivan Khoronzhuk,
	Grygorii Strashko
In-Reply-To: <20161201233432.6182-1-grygorii.strashko@ti.com>

Add optional property "descs_pool_size" to specify buffer descriptor's
pool size. The "descs_pool_size" should define total number of CPDMA
CPPI descriptors to be used for both ingress/egress packets
processing. If not specified - the default value 256 will be used
which will allow to place descriptor's pool into the internal CPPI
RAM on most of TI SoC.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
---
 Documentation/devicetree/bindings/net/cpsw.txt | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/cpsw.txt b/Documentation/devicetree/bindings/net/cpsw.txt
index 5ad439f..b99d196 100644
--- a/Documentation/devicetree/bindings/net/cpsw.txt
+++ b/Documentation/devicetree/bindings/net/cpsw.txt
@@ -35,6 +35,11 @@ Optional properties:
 			  For example in dra72x-evm, pcf gpio has to be
 			  driven low so that cpsw slave 0 and phy data
 			  lines are connected via mux.
+- descs_pool_size	: total number of CPDMA CPPI descriptors to be used for
+			  both ingress/egress packets processing. if not
+			  specified the default value 256 will be used which
+			  will allow to place descriptors pool into the
+			  internal CPPI RAM.
 
 
 Slave Properties:
-- 
2.10.1

^ permalink raw reply related

* [PATCH 4/7] net: ethernet: ti: cpdma: use devm_ioremap
From: Grygorii Strashko @ 2016-12-01 23:34 UTC (permalink / raw)
  To: David S. Miller, netdev, Mugunthan V N
  Cc: Sekhar Nori, linux-kernel, linux-omap, Ivan Khoronzhuk,
	Grygorii Strashko
In-Reply-To: <20161201233432.6182-1-grygorii.strashko@ti.com>

Use devm_ioremap() and simplify the code.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
---
 drivers/net/ethernet/ti/davinci_cpdma.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c b/drivers/net/ethernet/ti/davinci_cpdma.c
index db0a7fd..ba892bb 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.c
+++ b/drivers/net/ethernet/ti/davinci_cpdma.c
@@ -195,8 +195,6 @@ static void cpdma_desc_pool_destroy(struct cpdma_ctlr *ctlr)
 	if (pool->cpumap)
 		dma_free_coherent(ctlr->dev, pool->mem_size, pool->cpumap,
 				  pool->phys);
-	else
-		iounmap(pool->iomap);
 }
 
 /*
@@ -231,7 +229,8 @@ int cpdma_desc_pool_create(struct cpdma_ctlr *ctlr)
 
 	if (cpdma_params->desc_mem_phys) {
 		pool->phys  = cpdma_params->desc_mem_phys;
-		pool->iomap = ioremap(pool->phys, pool->mem_size);
+		pool->iomap = devm_ioremap(ctlr->dev, pool->phys,
+					   pool->mem_size);
 		pool->hw_addr = cpdma_params->desc_hw_addr;
 	} else {
 		pool->cpumap = dma_alloc_coherent(ctlr->dev,  pool->mem_size,
-- 
2.10.1

^ permalink raw reply related

* [PATCH 3/7] net: ethernet: ti: cpdma: minimize number of parameters in cpdma_desc_pool_create/destroy()
From: Grygorii Strashko @ 2016-12-01 23:34 UTC (permalink / raw)
  To: David S. Miller, netdev, Mugunthan V N
  Cc: Sekhar Nori, linux-kernel, linux-omap, Ivan Khoronzhuk,
	Grygorii Strashko
In-Reply-To: <20161201233432.6182-1-grygorii.strashko@ti.com>

Update cpdma_desc_pool_create/destroy() to accept only one parameter
struct cpdma_ctlr*, as this structure contains all required
information for pool creation/destruction.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
---
 drivers/net/ethernet/ti/davinci_cpdma.c | 66 ++++++++++++++++-----------------
 1 file changed, 33 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c b/drivers/net/ethernet/ti/davinci_cpdma.c
index 379314f..db0a7fd 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.c
+++ b/drivers/net/ethernet/ti/davinci_cpdma.c
@@ -181,8 +181,10 @@ static struct cpdma_control_info controls[] = {
 				 (directed << CPDMA_TO_PORT_SHIFT));	\
 	} while (0)
 
-static void cpdma_desc_pool_destroy(struct cpdma_desc_pool *pool)
+static void cpdma_desc_pool_destroy(struct cpdma_ctlr *ctlr)
 {
+	struct cpdma_desc_pool *pool = ctlr->pool;
+
 	if (!pool)
 		return;
 
@@ -191,7 +193,7 @@ static void cpdma_desc_pool_destroy(struct cpdma_desc_pool *pool)
 	     gen_pool_size(pool->gen_pool),
 	     gen_pool_avail(pool->gen_pool));
 	if (pool->cpumap)
-		dma_free_coherent(pool->dev, pool->mem_size, pool->cpumap,
+		dma_free_coherent(ctlr->dev, pool->mem_size, pool->cpumap,
 				  pool->phys);
 	else
 		iounmap(pool->iomap);
@@ -203,57 +205,60 @@ static void cpdma_desc_pool_destroy(struct cpdma_desc_pool *pool)
  * devices (e.g. cpsw switches) use plain old memory.  Descriptor pools
  * abstract out these details
  */
-static struct cpdma_desc_pool *
-cpdma_desc_pool_create(struct device *dev, u32 phys, dma_addr_t hw_addr,
-				int size, int align)
+int cpdma_desc_pool_create(struct cpdma_ctlr *ctlr)
 {
+	struct cpdma_params *cpdma_params = &ctlr->params;
 	struct cpdma_desc_pool *pool;
-	int ret;
+	int ret = 0;
 
-	pool = devm_kzalloc(dev, sizeof(*pool), GFP_KERNEL);
+	pool = devm_kzalloc(ctlr->dev, sizeof(*pool), GFP_KERNEL);
 	if (!pool)
 		goto gen_pool_create_fail;
+	ctlr->pool = pool;
 
-	pool->dev	= dev;
-	pool->mem_size	= size;
-	pool->desc_size	= ALIGN(sizeof(struct cpdma_desc), align);
-	pool->num_desc	= size / pool->desc_size;
+	pool->mem_size	= cpdma_params->desc_mem_size;
+	pool->desc_size	= ALIGN(sizeof(struct cpdma_desc),
+				cpdma_params->desc_align);
+	pool->num_desc	= pool->mem_size / pool->desc_size;
 
-	pool->gen_pool = devm_gen_pool_create(dev, ilog2(pool->desc_size), -1,
-					      "cpdma");
+	pool->gen_pool = devm_gen_pool_create(ctlr->dev, ilog2(pool->desc_size),
+					      -1, "cpdma");
 	if (IS_ERR(pool->gen_pool)) {
-		dev_err(dev, "pool create failed %ld\n",
-			PTR_ERR(pool->gen_pool));
+		ret = PTR_ERR(pool->gen_pool);
+		dev_err(ctlr->dev, "pool create failed %d\n", ret);
 		goto gen_pool_create_fail;
 	}
 
-	if (phys) {
-		pool->phys  = phys;
-		pool->iomap = ioremap(phys, size); /* should be memremap? */
-		pool->hw_addr = hw_addr;
+	if (cpdma_params->desc_mem_phys) {
+		pool->phys  = cpdma_params->desc_mem_phys;
+		pool->iomap = ioremap(pool->phys, pool->mem_size);
+		pool->hw_addr = cpdma_params->desc_hw_addr;
 	} else {
-		pool->cpumap = dma_alloc_coherent(dev, size, &pool->hw_addr,
-						  GFP_KERNEL);
+		pool->cpumap = dma_alloc_coherent(ctlr->dev,  pool->mem_size,
+						  &pool->hw_addr, GFP_KERNEL);
 		pool->iomap = (void __iomem __force *)pool->cpumap;
 		pool->phys = pool->hw_addr; /* assumes no IOMMU, don't use this value */
 	}
 
-	if (!pool->iomap)
+	if (!pool->iomap) {
+		ret = -ENOMEM;
 		goto gen_pool_create_fail;
+	}
 
 	ret = gen_pool_add_virt(pool->gen_pool, (unsigned long)pool->iomap,
 				pool->phys, pool->mem_size, -1);
 	if (ret < 0) {
-		dev_err(dev, "pool add failed %d\n", ret);
+		dev_err(ctlr->dev, "pool add failed %d\n", ret);
 		goto gen_pool_add_virt_fail;
 	}
 
-	return pool;
+	return 0;
 
 gen_pool_add_virt_fail:
-	cpdma_desc_pool_destroy(pool);
+	cpdma_desc_pool_destroy(ctlr);
 gen_pool_create_fail:
-	return NULL;
+	ctlr->pool = NULL;
+	return ret;
 }
 
 static inline dma_addr_t desc_phys(struct cpdma_desc_pool *pool,
@@ -502,12 +507,7 @@ struct cpdma_ctlr *cpdma_ctlr_create(struct cpdma_params *params)
 	ctlr->chan_num = 0;
 	spin_lock_init(&ctlr->lock);
 
-	ctlr->pool = cpdma_desc_pool_create(ctlr->dev,
-					    ctlr->params.desc_mem_phys,
-					    ctlr->params.desc_hw_addr,
-					    ctlr->params.desc_mem_size,
-					    ctlr->params.desc_align);
-	if (!ctlr->pool)
+	if (cpdma_desc_pool_create(ctlr))
 		return NULL;
 
 	if (WARN_ON(ctlr->num_chan > CPDMA_MAX_CHANNELS))
@@ -623,7 +623,7 @@ int cpdma_ctlr_destroy(struct cpdma_ctlr *ctlr)
 	for (i = 0; i < ARRAY_SIZE(ctlr->channels); i++)
 		cpdma_chan_destroy(ctlr->channels[i]);
 
-	cpdma_desc_pool_destroy(ctlr->pool);
+	cpdma_desc_pool_destroy(ctlr);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(cpdma_ctlr_destroy);
-- 
2.10.1

^ permalink raw reply related

* [PATCH 2/7] net: ethernet: ti: cpdma: fix desc re-queuing
From: Grygorii Strashko @ 2016-12-01 23:34 UTC (permalink / raw)
  To: David S. Miller, netdev, Mugunthan V N
  Cc: Sekhar Nori, linux-kernel, linux-omap, Ivan Khoronzhuk,
	Grygorii Strashko
In-Reply-To: <20161201233432.6182-1-grygorii.strashko@ti.com>

The currently processing cpdma descriptor with EOQ flag set may
contain two values in Next Descriptor Pointer field:
- valid pointer: means CPDMA missed addition of new desc in queue;
- null: no more descriptors in queue.
In the later case, it's not required to write to HDP register, but now
CPDMA does it.

Hence, add additional check for Next Descriptor Pointer != null in
cpdma_chan_process() function before writing in HDP register.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
---
 drivers/net/ethernet/ti/davinci_cpdma.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c b/drivers/net/ethernet/ti/davinci_cpdma.c
index 0924014..379314f 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.c
+++ b/drivers/net/ethernet/ti/davinci_cpdma.c
@@ -1152,7 +1152,7 @@ static int __cpdma_chan_process(struct cpdma_chan *chan)
 	chan->count--;
 	chan->stats.good_dequeue++;
 
-	if (status & CPDMA_DESC_EOQ) {
+	if ((status & CPDMA_DESC_EOQ) && chan->head) {
 		chan->stats.requeue++;
 		chan_write(chan, hdp, desc_phys(pool, chan->head));
 	}
-- 
2.10.1

^ permalink raw reply related

* [PATCH 1/7] net: ethernet: ti: cpdma: am437x: allow descs to be plased in ddr
From: Grygorii Strashko @ 2016-12-01 23:34 UTC (permalink / raw)
  To: David S. Miller, netdev, Mugunthan V N
  Cc: Sekhar Nori, linux-kernel, linux-omap, Ivan Khoronzhuk,
	Grygorii Strashko
In-Reply-To: <20161201233432.6182-1-grygorii.strashko@ti.com>

It's observed that cpsw/cpdma is not working properly when CPPI
descriptors are placed in DDR instead of internal CPPI RAM on am437x
SoC:
- rx/tx silently stops processing packets;
- or - after boot it's working for sometime, but stuck once Network
load is increased (ping is working, but iperf is not).
(The same issue has not been reproduced on am335x and am57xx).

It seems that write to HDP register processed faster by interconnect
than writing of descriptor memory buffer in DDR, which is probably
caused by store buffer / write buffer differences as these function
are implemented differently across devices. So, to fix this i come up
with two changes:

1) all accesses to the channel register HDP/CP/RXFREE registers should
be done using sync IO accessors readl()/writel(), because all previous
memory writes writes have to be completed before starting channel
(write to HDP) or completing desc processing.

2) the change 1 only doesn't work on am437x and additional reading of
desc's field is required right after the new descriptor was filled
with data and before pointer on it will be stored in
prev_desc->hw_next field or HDP register.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
---
 drivers/net/ethernet/ti/davinci_cpdma.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c b/drivers/net/ethernet/ti/davinci_cpdma.c
index c776e45..0924014 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.c
+++ b/drivers/net/ethernet/ti/davinci_cpdma.c
@@ -167,10 +167,10 @@ static struct cpdma_control_info controls[] = {
 
 /* various accessors */
 #define dma_reg_read(ctlr, ofs)		__raw_readl((ctlr)->dmaregs + (ofs))
-#define chan_read(chan, fld)		__raw_readl((chan)->fld)
+#define chan_read(chan, fld)		readl((chan)->fld)
 #define desc_read(desc, fld)		__raw_readl(&(desc)->fld)
 #define dma_reg_write(ctlr, ofs, v)	__raw_writel(v, (ctlr)->dmaregs + (ofs))
-#define chan_write(chan, fld, v)	__raw_writel(v, (chan)->fld)
+#define chan_write(chan, fld, v)	writel(v, (chan)->fld)
 #define desc_write(desc, fld, v)	__raw_writel((u32)(v), &(desc)->fld)
 
 #define cpdma_desc_to_port(chan, mode, directed)			\
@@ -1064,6 +1064,7 @@ int cpdma_chan_submit(struct cpdma_chan *chan, void *token, void *data,
 	desc_write(desc, sw_token,  token);
 	desc_write(desc, sw_buffer, buffer);
 	desc_write(desc, sw_len,    len);
+	desc_read(desc, sw_len);
 
 	__cpdma_chan_submit(chan, desc);
 
-- 
2.10.1

^ permalink raw reply related

* [PATCH 0/7]  net: ethernet: ti: cpsw: support placing CPDMA descriptors into DDR
From: Grygorii Strashko @ 2016-12-01 23:34 UTC (permalink / raw)
  To: David S. Miller, netdev, Mugunthan V N
  Cc: Sekhar Nori, linux-kernel, linux-omap, Ivan Khoronzhuk,
	Grygorii Strashko

This series intended to add support for placing CPDMA descriptors into DDR by
introducing new DT property "descs_pool_size" to specify buffer descriptor's
pool size. The "descs_pool_size" defines total number of CPDMA
CPPI descriptors to be used for both ingress/egress packets
processing. If not specified - the default value 256 will be used
which will allow to place descriptor's pool into the internal CPPI
RAM.

This allows significantly to reduce UDP packets drop rate
for bandwidth >301 Mbits/sec (am57x).  

Before enabling this feature, the am437x SoC has to be fixed as it's proved
that it's not working when CPDMA descriptors placed in DDR.
So, the patch 1 fixes this issue.

Grygorii Strashko (7):
  net: ethernet: ti: cpdma: am437x: allow descs to be plased in ddr
  net: ethernet: ti: cpdma: fix desc re-queuing
  net: ethernet: ti: cpdma: minimize number of parameters in
    cpdma_desc_pool_create/destroy()
  net: ethernet: ti: cpdma: use devm_ioremap
  Documentation: DT: net: cpsw: allow to specify descriptors pool size
  net: ethernet: ti: cpsw: add support for descs_pool_size dt property
  Documentation: DT: net: cpsw: remove no_bd_ram property

 Documentation/devicetree/bindings/net/cpsw.txt |  8 ++-
 arch/arm/boot/dts/am33xx.dtsi                  |  1 -
 arch/arm/boot/dts/am4372.dtsi                  |  1 -
 arch/arm/boot/dts/dm814x.dtsi                  |  1 -
 arch/arm/boot/dts/dra7.dtsi                    |  1 -
 drivers/net/ethernet/ti/cpsw.c                 |  5 ++
 drivers/net/ethernet/ti/cpsw.h                 |  1 +
 drivers/net/ethernet/ti/davinci_cpdma.c        | 90 +++++++++++++++-----------
 drivers/net/ethernet/ti/davinci_cpdma.h        |  1 +
 9 files changed, 63 insertions(+), 46 deletions(-)

-- 
2.10.1

^ permalink raw reply

* Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: Hannes Frederic Sowa @ 2016-12-01 23:27 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf, arkadis,
	ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
	f.fainelli, alexander.h.duyck, kaber
In-Reply-To: <20161201231445.vtx7mjxuxusar3mu@splinter>

On 02.12.2016 00:14, Ido Schimmel wrote:

[...]

>> Basically, if you delete a node right now the kernel might simply do a
>> RCU_INIT_POINTER(ptr_location, NULL), which has absolutely no barriers
>> or synchronization with the reader side. Thus you might get a callback
>> from the notifier for a delete event on the one CPU and you end up
>> queueing this fib entry after the delete queue, because the RCU walk
>> isn't protected by any means.
>>
>> Looking closer at this series again, I overlooked the fact that you
>> fetch fib_seq using a rtnl_lock and rtnl_unlock pair, which first of all
>> orders fetching of fib_seq and thus the RCU dumping after any concurrent
>> executing fib table update, also the mutex_lock and unlock provide
>> proper acquire and release fences, so the CPU indeed sees the effect of
>> a RCU_INIT_POINTER update done on another CPU, because they pair with
>> the rtnl_unlock which might happen on the other CPU.
> 
> Yep, Exactly. I had a feeling this is the issue you were referring to,
> but then you were the one to suggest the use of RTNL, so I was quite
> confused.

At that time I actually had in mind that the fib_register would happen
under the sequence lock, so I didn't look closely to the memory barrier
pairings. I kinda still consider this to be a happy accident. ;)

>> My question is if this is a bit of luck and if we should make this
>> explicit by putting the registration itself under the protection of the
>> sequence counter. I favor the additional protection, e.g. if we some day
>> actually we optimize the fib_seq code? Otherwise we might probably
>> document this fact. :)
> 
> Well, some listeners don't require a dump, but only registration
> (rocker) and in the future we might only need a dump (e.g., port being
> moved to a different net namespace). So I'm not sure if bundling both
> together is a good idea.
> 
> Maybe we can keep register_fib_notifier() as-is and add 'bool register'
> to fib_notifier_dump() so that when set, 'nb' is also registered after
> RCU walk, but before we check if the dump is consistent (unregistered if
> inconsistent)?

I really like that. Would you mind adding this?

[...]

>> Quick follow-up question: How can I quickly find out the hw limitations
>> via the kernel api?
> 
> That's a good question. Currently, you can't. However, we already have a
> mechanism in place to read device's capabilities from the firmware and
> we can (and should) expose some of them to the user. The best API for
> that would be devlink, as it can represent the entire device as opposed
> to only a port netdev like other tools.
> 
> We're also working on making the pipeline more visible to the user, so
> that it would be easier for users to understand and debug their
> networks. I believe a colleague of mine (Matty) presented this during
> the last netdev conference.

Thanks, I will look it up!

Bye,
Hannes

^ permalink raw reply

* [PATCH 3/3] netns: fix net_generic() "id - 1" bloat
From: Alexey Dobriyan @ 2016-12-02  1:21 UTC (permalink / raw)
  To: davem; +Cc: netdev, xemul

net_generic() function is both a) inline and b) used ~600 times.

It has the following code inside

		...
	ptr = ng->ptr[id - 1];
		...

"id" is never compile time constant so compiler is forced to subtract 1.
And those decrements or LEA [r32 - 1] instructions add up.

We also start id'ing from 1 to catch bugs where pernet sybsystem id
is not initialized and 0. This is quite pointless idea (nothing will
work or immediate interference with first registered subsystem) in
general but it hints what needs to be done for code size reduction.

Namely, overlaying allocation of pointer array and fixed part of
structure in the beginning and using usual base-0 addressing.

Ids are just cookies, their exact values do not matter, so lets start
with 3 on x86_64.

Code size savings (oh boy): -4.2 KB

As usual, ignore the initial compiler stupidity part of the table.

	add/remove: 0/0 grow/shrink: 12/670 up/down: 89/-4297 (-4208)
	function                                     old     new   delta
	tipc_nametbl_insert_publ                    1250    1270     +20
	nlmclnt_lookup_host                          686     703     +17
	nfsd4_encode_fattr                          5930    5941     +11
	nfs_get_client                              1050    1061     +11
	register_pernet_operations                   333     342      +9
	tcf_mirred_init                              843     849      +6
	tcf_bpf_init                                1143    1149      +6
	gss_setup_upcall                             990     994      +4
	idmap_name_to_id                             432     434      +2
	ops_init                                     274     275      +1
	nfsd_inject_forget_client                    259     260      +1
	nfs4_alloc_client                            612     613      +1
	tunnel_key_walker                            164     163      -1

		...

	tipc_bcbase_select_primary                   392     360     -32
	mac80211_hwsim_new_radio                    2808    2767     -41
	ipip6_tunnel_ioctl                          2228    2186     -42
	tipc_bcast_rcv                               715     672     -43
	tipc_link_build_proto_msg                   1140    1089     -51
	nfsd4_lock                                  3851    3796     -55
	tipc_mon_rcv                                1012     956     -56
	Total: Before=156643951, After=156639743, chg -0.00%


Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---

 include/net/netns/generic.h |   16 +++++++++-------
 net/core/net_namespace.c    |   20 ++++++++++++--------
 2 files changed, 21 insertions(+), 15 deletions(-)

--- a/include/net/netns/generic.h
+++ b/include/net/netns/generic.h
@@ -25,12 +25,14 @@
  */
 
 struct net_generic {
-	struct {
-		unsigned int len;
-		struct rcu_head rcu;
-	} s;
-
-	void *ptr[0];
+	union {
+		struct {
+			unsigned int len;
+			struct rcu_head rcu;
+		} s;
+
+		void *ptr[0];
+	};
 };
 
 static inline void *net_generic(const struct net *net, unsigned int id)
@@ -40,7 +42,7 @@ static inline void *net_generic(const struct net *net, unsigned int id)
 
 	rcu_read_lock();
 	ng = rcu_dereference(net->gen);
-	ptr = ng->ptr[id - 1];
+	ptr = ng->ptr[id];
 	rcu_read_unlock();
 
 	return ptr;
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -39,6 +39,9 @@ EXPORT_SYMBOL(init_net);
 
 static bool init_net_initialized;
 
+#define MIN_PERNET_OPS_ID	\
+	((sizeof(struct net_generic) + sizeof(void *) - 1) / sizeof(void *))
+
 #define INITIAL_NET_GEN_PTRS	13 /* +1 for len +2 for rcu_head */
 
 static unsigned int max_gen_ptrs = INITIAL_NET_GEN_PTRS;
@@ -46,7 +49,7 @@ static unsigned int max_gen_ptrs = INITIAL_NET_GEN_PTRS;
 static struct net_generic *net_alloc_generic(void)
 {
 	struct net_generic *ng;
-	size_t generic_size = offsetof(struct net_generic, ptr[max_gen_ptrs]);
+	unsigned int generic_size = offsetof(struct net_generic, ptr[max_gen_ptrs]);
 
 	ng = kzalloc(generic_size, GFP_KERNEL);
 	if (ng)
@@ -60,12 +63,12 @@ static int net_assign_generic(struct net *net, unsigned int id, void *data)
 	struct net_generic *ng, *old_ng;
 
 	BUG_ON(!mutex_is_locked(&net_mutex));
-	BUG_ON(id == 0);
+	BUG_ON(id < MIN_PERNET_OPS_ID);
 
 	old_ng = rcu_dereference_protected(net->gen,
 					   lockdep_is_held(&net_mutex));
-	if (old_ng->s.len >= id) {
-		old_ng->ptr[id - 1] = data;
+	if (old_ng->s.len > id) {
+		old_ng->ptr[id] = data;
 		return 0;
 	}
 
@@ -84,8 +87,9 @@ static int net_assign_generic(struct net *net, unsigned int id, void *data)
 	 * the old copy for kfree after a grace period.
 	 */
 
-	memcpy(&ng->ptr, &old_ng->ptr, old_ng->s.len * sizeof(void*));
-	ng->ptr[id - 1] = data;
+	memcpy(&ng->ptr[MIN_PERNET_OPS_ID], &old_ng->ptr[MIN_PERNET_OPS_ID],
+	       (old_ng->s.len - MIN_PERNET_OPS_ID) * sizeof(void *));
+	ng->ptr[id] = data;
 
 	rcu_assign_pointer(net->gen, ng);
 	kfree_rcu(old_ng, s.rcu);
@@ -874,7 +878,7 @@ static int register_pernet_operations(struct list_head *list,
 
 	if (ops->id) {
 again:
-		error = ida_get_new_above(&net_generic_ids, 1, ops->id);
+		error = ida_get_new_above(&net_generic_ids, MIN_PERNET_OPS_ID, ops->id);
 		if (error < 0) {
 			if (error == -EAGAIN) {
 				ida_pre_get(&net_generic_ids, GFP_KERNEL);
@@ -882,7 +886,7 @@ static int register_pernet_operations(struct list_head *list,
 			}
 			return error;
 		}
-		max_gen_ptrs = max(max_gen_ptrs, *ops->id);
+		max_gen_ptrs = max(max_gen_ptrs, *ops->id + 1);
 	}
 	error = __register_pernet_operations(list, ops);
 	if (error) {

^ permalink raw reply

* Re: [PATCH next] arp: avoid sending ucast probes to 00:00:00:00:00:00
From: Eric Dumazet @ 2016-12-01 23:15 UTC (permalink / raw)
  To: Mahesh Bandewar; +Cc: netdev, Eric Dumazet, David Miller, Mahesh Bandewar
In-Reply-To: <1480632994-12128-1-git-send-email-mahesh@bandewar.net>

On Thu, 2016-12-01 at 14:56 -0800, Mahesh Bandewar wrote:
> From: Mahesh Bandewar <maheshb@google.com>
> 
> If initial broadcast probe(s) is/are lost, the neigh entry wont have
> valid address of the neighbour. In a situation like this, the fall
> back should be to send a broadcast probe, however the code logic
> continues sending ucast probes to 00:00:00:00:00:00. The default value
> of ucast probes is 3 so system usually recovers after three such probes
> but if the value configured is larger it takes those many probes
> (a probe is sent every second in default config) / seconds to recover
> making machine not-available on the network.
> 
> This patch just ensures that the unicast address is not NULL otherwise
> falls back to sending broadcast probe.
> 
> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
> ---
>  net/ipv4/arp.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
> index 89a8cac4726a..56fb33d5ed31 100644
> --- a/net/ipv4/arp.c
> +++ b/net/ipv4/arp.c
> @@ -330,6 +330,7 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
>  {
>  	__be32 saddr = 0;
>  	u8 dst_ha[MAX_ADDR_LEN], *dst_hw = NULL;
> +	u8 null_dev_hw_addr[MAX_ADDR_LEN];
>  	struct net_device *dev = neigh->dev;
>  	__be32 target = *(__be32 *)neigh->primary_key;
>  	int probes = atomic_read(&neigh->probes);
> @@ -371,10 +372,12 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
>  
>  	probes -= NEIGH_VAR(neigh->parms, UCAST_PROBES);
>  	if (probes < 0) {
> +		memset(&null_dev_hw_addr, 0, dev->addr_len);
>  		if (!(neigh->nud_state & NUD_VALID))
>  			pr_debug("trying to ucast probe in NUD_INVALID\n");
>  		neigh_ha_snapshot(dst_ha, neigh, dev);
> -		dst_hw = dst_ha;
> +		if (memcmp(&dst_ha, &null_dev_hw_addr, dev->addr_len) != 0)
> +			dst_hw = dst_ha;
>  	} else {
>  		probes -= NEIGH_VAR(neigh->parms, APP_PROBES);
>  		if (probes < 0) {

Why is is an IPv4 specific issue ?
What about IPv6 ?


I would try something in neighbour code, maybe :

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 782dd866366554e53dda3e6c69c807ec90bd0e08..fdfb177eecb6a9b1479eedde457cb1f652d32c68 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -916,7 +916,10 @@ static void neigh_timer_handler(unsigned long arg)
 			neigh_dbg(2, "neigh %p is probed\n", neigh);
 			neigh->nud_state = NUD_PROBE;
 			neigh->updated = jiffies;
-			atomic_set(&neigh->probes, 0);
+			atomic_set(&neigh->probes,
+				   (neigh->output == neigh_blackhole) ?
+					NEIGH_VAR(neigh->parms, UCAST_PROBES) :
+					0);
 			notify = 1;
 			next = now + NEIGH_VAR(neigh->parms, RETRANS_TIME);
 		}

Thanks.

^ permalink raw reply related

* [PATCH] netlink: 2-clause nla_ok()
From: Alexey Dobriyan @ 2016-12-02  0:59 UTC (permalink / raw)
  To: davem; +Cc: netdev

nla_ok() consists of 3 clauses:

	1) int rem >= (int)sizeof(struct nlattr)

	2) u16 nla_len >= sizeof(struct nlattr)

	3) u16 nla_len <= int rem

The statement is that clause (1) is redundant.

What it does is ensuring that "rem" is a positive number,
so that in clause (3) positive number will be compared to positive number
with no problems.

However, "u16" fully fits into "int" and integers do not change value
when upcasting even to signed type. Negative integers will be rejected
by clause (3) just fine. Small positive integers will be rejected
by transitivity of comparison operator.

NOTE: all of the above DOES NOT apply to nlmsg_ok() where ->nlmsg_len is
u32(!), so 3 clauses AND A CAST TO INT are necessary.

Obligatory space savings report: -1.6 KB

	$ ./scripts/bloat-o-meter ../vmlinux-000* ../vmlinux-001*
	add/remove: 0/0 grow/shrink: 3/63 up/down: 35/-1692 (-1657)
	function                                     old     new   delta
	validate_scan_freqs                          142     155     +13
	tcf_em_tree_validate                         867     879     +12
	dcbnl_ieee_del                               328     338     +10
	netlbl_cipsov4_add_common.isra               218     215      -3
		...
	ovs_nla_put_actions                          888     806     -82
	netlbl_cipsov4_add_std                      1648    1566     -82
	nl80211_parse_sched_scan                    2889    2780    -109
	ip_tun_from_nlattr                          3086    2945    -141

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---

 include/net/netlink.h |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -698,8 +698,7 @@ static inline int nla_len(const struct nlattr *nla)
  */
 static inline int nla_ok(const struct nlattr *nla, int remaining)
 {
-	return remaining >= (int) sizeof(*nla) &&
-	       nla->nla_len >= sizeof(*nla) &&
+	return nla->nla_len >= sizeof(*nla) &&
 	       nla->nla_len <= remaining;
 }
 

^ permalink raw reply

* Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: Ido Schimmel @ 2016-12-01 23:14 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf, arkadis,
	ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
	f.fainelli, alexander.h.duyck, kaber
In-Reply-To: <ea2a8856-59d8-1b5f-9428-b16efa8f6979@stressinduktion.org>

On Thu, Dec 01, 2016 at 10:57:52PM +0100, Hannes Frederic Sowa wrote:
> On 30.11.2016 19:22, Ido Schimmel wrote:
> > On Wed, Nov 30, 2016 at 05:49:56PM +0100, Hannes Frederic Sowa wrote:
> >> On 30.11.2016 17:32, Ido Schimmel wrote:
> >>> On Wed, Nov 30, 2016 at 04:37:48PM +0100, Hannes Frederic Sowa wrote:
> >>>> On 30.11.2016 11:09, Jiri Pirko wrote:
> >>>>> From: Ido Schimmel <idosch@mellanox.com>
> >>>>>
> >>>>> Make sure the device has a complete view of the FIB tables by invoking
> >>>>> their dump during module init.
> >>>>>
> >>>>> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
> >>>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> >>>>> ---
> >>>>>  .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 23 ++++++++++++++++++++++
> >>>>>  1 file changed, 23 insertions(+)
> >>>>>
> >>>>> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> >>>>> index 14bed1d..d176047 100644
> >>>>> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> >>>>> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> >>>>> @@ -2027,8 +2027,23 @@ static int mlxsw_sp_router_fib_event(struct notifier_block *nb,
> >>>>>  	return NOTIFY_DONE;
> >>>>>  }
> >>>>>  
> >>>>> +static void mlxsw_sp_router_fib_dump_flush(struct notifier_block *nb)
> >>>>> +{
> >>>>> +	struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, fib_nb);
> >>>>> +
> >>>>> +	/* Flush pending FIB notifications and then flush the device's
> >>>>> +	 * table before requesting another dump. Do that with RTNL held,
> >>>>> +	 * as FIB notification block is already registered.
> >>>>> +	 */
> >>>>> +	mlxsw_core_flush_owq();
> >>>>> +	rtnl_lock();
> >>>>> +	mlxsw_sp_router_fib_flush(mlxsw_sp);
> >>>>> +	rtnl_unlock();
> >>>>> +}
> >>>>> +
> >>>>>  int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
> >>>>>  {
> >>>>> +	fib_dump_cb_t *cb = mlxsw_sp_router_fib_dump_flush;
> >>>>>  	int err;
> >>>>>  
> >>>>>  	INIT_LIST_HEAD(&mlxsw_sp->router.nexthop_neighs_list);
> >>>>> @@ -2048,8 +2063,16 @@ int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
> >>>>>  
> >>>>>  	mlxsw_sp->fib_nb.notifier_call = mlxsw_sp_router_fib_event;
> >>>>>  	register_fib_notifier(&mlxsw_sp->fib_nb);
> >>>>
> >>>> Sorry to pick in here again:
> >>>>
> >>>> There is a race here. You need to protect the registration of the fib
> >>>> notifier as well by the sequence counter. Updates here are not ordered
> >>>> in relation to this code below.
> >>>
> >>> You mean updates that can be received after you registered the notifier
> >>> and until the dump started? I'm aware of that and that's OK. This
> >>> listener should be able to handle duplicates.
> >>
> >> I am not concerned about duplicates, but about ordering deletes and
> >> getting an add from the RCU code you will add the node to hw while it is
> >> deleted in the software path. You probably will ignore the delete
> >> because nothing is installed in hw and later add the node which was
> >> actually deleted but just reordered which happend on another CPU, no?
> > 
> > Are you referring to reordering in the workqueue? We already covered
> > this using an ordered workqueue, which has one context of execution
> > system-wide.
> 
> Ups, sorry, I missed that mail. Probably read it on the mobile phone and
> it became invisible for me later on. Busy day... ;)

Yet another reason not to read emails on your phone ;)

> The reordering in the workqueue seems fine to me and also still necessary.

Correct.

> Basically, if you delete a node right now the kernel might simply do a
> RCU_INIT_POINTER(ptr_location, NULL), which has absolutely no barriers
> or synchronization with the reader side. Thus you might get a callback
> from the notifier for a delete event on the one CPU and you end up
> queueing this fib entry after the delete queue, because the RCU walk
> isn't protected by any means.
> 
> Looking closer at this series again, I overlooked the fact that you
> fetch fib_seq using a rtnl_lock and rtnl_unlock pair, which first of all
> orders fetching of fib_seq and thus the RCU dumping after any concurrent
> executing fib table update, also the mutex_lock and unlock provide
> proper acquire and release fences, so the CPU indeed sees the effect of
> a RCU_INIT_POINTER update done on another CPU, because they pair with
> the rtnl_unlock which might happen on the other CPU.

Yep, Exactly. I had a feeling this is the issue you were referring to,
but then you were the one to suggest the use of RTNL, so I was quite
confused.

> My question is if this is a bit of luck and if we should make this
> explicit by putting the registration itself under the protection of the
> sequence counter. I favor the additional protection, e.g. if we some day
> actually we optimize the fib_seq code? Otherwise we might probably
> document this fact. :)

Well, some listeners don't require a dump, but only registration
(rocker) and in the future we might only need a dump (e.g., port being
moved to a different net namespace). So I'm not sure if bundling both
together is a good idea.

Maybe we can keep register_fib_notifier() as-is and add 'bool register'
to fib_notifier_dump() so that when set, 'nb' is also registered after
RCU walk, but before we check if the dump is consistent (unregistered if
inconsistent)?

> >>> I've a follow up patchset that introduces a new event in switchdev
> >>> notification chain called SWITCHDEV_SYNC, which is sent when port
> >>> netdevs are enslaved / released  from a master device (points in time
> >>> where kernel<->device can get out of sync). It will invoke
> >>> re-propagation of configuration from different parts of the stack
> >>> (e.g. bridge driver, 8021q driver, fib/neigh code), which can result
> >>> in duplicates.
> >>
> >> Okay, understood. I wonder how we can protect against accidentally abort
> >> calls actually. E.g. if I start to inject routes into my routing domain
> >> how can I make sure the box doesn't die after I try to insert enough
> >> routes. Do we need to touch quagga etc?
> > 
> > The whole point of moving abort mechanism to the driver is that the
> > system won't die, but instead routing will be done in the kernel. If you
> > respect hardware limitations, then there's no reason for abort mechanism
> > to kick in.
> 
> Quick follow-up question: How can I quickly find out the hw limitations
> via the kernel api?

That's a good question. Currently, you can't. However, we already have a
mechanism in place to read device's capabilities from the firmware and
we can (and should) expose some of them to the user. The best API for
that would be devlink, as it can represent the entire device as opposed
to only a port netdev like other tools.

We're also working on making the pipeline more visible to the user, so
that it would be easier for users to understand and debug their
networks. I believe a colleague of mine (Matty) presented this during
the last netdev conference.

^ permalink raw reply

* [PATCH 2/3] netns: add dummy struct inside "struct net_generic"
From: Alexey Dobriyan @ 2016-12-02  1:12 UTC (permalink / raw)
  To: davem; +Cc: netdev, xemul

This is precursor to fixing "[id - 1]" bloat inside net_generic().

Name "s" is chosen to complement name "u" often used for dummy unions.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---

 include/net/netns/generic.h |    6 ++++--
 net/core/net_namespace.c    |    8 ++++----
 2 files changed, 8 insertions(+), 6 deletions(-)

--- a/include/net/netns/generic.h
+++ b/include/net/netns/generic.h
@@ -25,8 +25,10 @@
  */
 
 struct net_generic {
-	unsigned int len;
-	struct rcu_head rcu;
+	struct {
+		unsigned int len;
+		struct rcu_head rcu;
+	} s;
 
 	void *ptr[0];
 };
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -50,7 +50,7 @@ static struct net_generic *net_alloc_generic(void)
 
 	ng = kzalloc(generic_size, GFP_KERNEL);
 	if (ng)
-		ng->len = max_gen_ptrs;
+		ng->s.len = max_gen_ptrs;
 
 	return ng;
 }
@@ -64,7 +64,7 @@ static int net_assign_generic(struct net *net, unsigned int id, void *data)
 
 	old_ng = rcu_dereference_protected(net->gen,
 					   lockdep_is_held(&net_mutex));
-	if (old_ng->len >= id) {
+	if (old_ng->s.len >= id) {
 		old_ng->ptr[id - 1] = data;
 		return 0;
 	}
@@ -84,11 +84,11 @@ static int net_assign_generic(struct net *net, unsigned int id, void *data)
 	 * the old copy for kfree after a grace period.
 	 */
 
-	memcpy(&ng->ptr, &old_ng->ptr, old_ng->len * sizeof(void*));
+	memcpy(&ng->ptr, &old_ng->ptr, old_ng->s.len * sizeof(void*));
 	ng->ptr[id - 1] = data;
 
 	rcu_assign_pointer(net->gen, ng);
-	kfree_rcu(old_ng, rcu);
+	kfree_rcu(old_ng, s.rcu);
 	return 0;
 }
 

^ permalink raw reply

* [PATCH 1/3] netns: publish net_generic correctly
From: Alexey Dobriyan @ 2016-12-02  1:11 UTC (permalink / raw)
  To: davem; +Cc: netdev, xemul

Publishing net_generic pointer is done with silly mistake: new array is
published BEFORE setting freshly acquired pernet subsystem pointer.

	memcpy
	rcu_assign_pointer
	kfree_rcu
	ng->ptr[id - 1] = data;

This bug was introduced with commit dec827d174d7f76c457238800183ca864a639365
("[NETNS]: The generic per-net pointers.") in the glorious days of
chopping networking stack into containers proper 8.5 years ago (whee...)

How it didn't trigger for so long?
Well, you need quite specific set of conditions:

*) race window opens once per pernet subsystem addition
   (read: modprobe or boot)

*) not every pernet subsystem is eligible (need ->id and ->size)

*) not every pernet subsystem is vulnerable (need incorrect or absense
   of ordering of register_pernet_sybsys() and actually using net_generic())

*) to hide the bug even more, default is to preallocate 13 pointers which
   is actually quite a lot. You need IPv6, netfilter, bridging etc together
   loaded to trigger reallocation in the first place. Trimmed down
   config are OK.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---

 net/core/net_namespace.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -64,9 +64,10 @@ static int net_assign_generic(struct net *net, unsigned int id, void *data)
 
 	old_ng = rcu_dereference_protected(net->gen,
 					   lockdep_is_held(&net_mutex));
-	ng = old_ng;
-	if (old_ng->len >= id)
-		goto assign;
+	if (old_ng->len >= id) {
+		old_ng->ptr[id - 1] = data;
+		return 0;
+	}
 
 	ng = net_alloc_generic();
 	if (ng == NULL)
@@ -84,11 +85,10 @@ static int net_assign_generic(struct net *net, unsigned int id, void *data)
 	 */
 
 	memcpy(&ng->ptr, &old_ng->ptr, old_ng->len * sizeof(void*));
+	ng->ptr[id - 1] = data;
 
 	rcu_assign_pointer(net->gen, ng);
 	kfree_rcu(old_ng, rcu);
-assign:
-	ng->ptr[id - 1] = data;
 	return 0;
 }
 

^ permalink raw reply

* [PATCH next] arp: avoid sending ucast probes to 00:00:00:00:00:00
From: Mahesh Bandewar @ 2016-12-01 22:56 UTC (permalink / raw)
  To: netdev, Eric Dumazet, David Miller; +Cc: Mahesh Bandewar

From: Mahesh Bandewar <maheshb@google.com>

If initial broadcast probe(s) is/are lost, the neigh entry wont have
valid address of the neighbour. In a situation like this, the fall
back should be to send a broadcast probe, however the code logic
continues sending ucast probes to 00:00:00:00:00:00. The default value
of ucast probes is 3 so system usually recovers after three such probes
but if the value configured is larger it takes those many probes
(a probe is sent every second in default config) / seconds to recover
making machine not-available on the network.

This patch just ensures that the unicast address is not NULL otherwise
falls back to sending broadcast probe.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
---
 net/ipv4/arp.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 89a8cac4726a..56fb33d5ed31 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -330,6 +330,7 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
 {
 	__be32 saddr = 0;
 	u8 dst_ha[MAX_ADDR_LEN], *dst_hw = NULL;
+	u8 null_dev_hw_addr[MAX_ADDR_LEN];
 	struct net_device *dev = neigh->dev;
 	__be32 target = *(__be32 *)neigh->primary_key;
 	int probes = atomic_read(&neigh->probes);
@@ -371,10 +372,12 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
 
 	probes -= NEIGH_VAR(neigh->parms, UCAST_PROBES);
 	if (probes < 0) {
+		memset(&null_dev_hw_addr, 0, dev->addr_len);
 		if (!(neigh->nud_state & NUD_VALID))
 			pr_debug("trying to ucast probe in NUD_INVALID\n");
 		neigh_ha_snapshot(dst_ha, neigh, dev);
-		dst_hw = dst_ha;
+		if (memcmp(&dst_ha, &null_dev_hw_addr, dev->addr_len) != 0)
+			dst_hw = dst_ha;
 	} else {
 		probes -= NEIGH_VAR(neigh->parms, APP_PROBES);
 		if (probes < 0) {
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* Re: Initial thoughts on TXDP
From: Hannes Frederic Sowa @ 2016-12-01 22:55 UTC (permalink / raw)
  To: Sowmini Varadhan, Tom Herbert; +Cc: Linux Kernel Network Developers
In-Reply-To: <20161201201324.GJ24547@oracle.com>

On 01.12.2016 21:13, Sowmini Varadhan wrote:
> On (12/01/16 11:05), Tom Herbert wrote:
>>
>> Polling does not necessarily imply that networking monopolizes the CPU
>> except when the CPU is otherwise idle. Presumably the application
>> drives the polling when it is ready to receive work.
> 
> I'm not grokking that- "if the cpu is idle, we want to busy-poll
> and make it 0% idle"?  Keeping CPU 0% idle has all sorts
> of issues, see slide 20 of
>  http://www.slideshare.net/shemminger/dpdk-performance
>
>>> and one other critical difference from the hot-potato-forwarding
>>> model (the sort of OVS model that DPDK etc might aruguably be a fit for)
>>> does not apply: in order to figure out the ethernet and IP headers
>>> in the response correctly at all times (in the face of things like VRRP,
>>> gw changes, gw's mac addr changes etc) the application should really
>>> be listening on NETLINK sockets for modifications to the networking
>>> state - again points to needing a select() socket set where you can
>>> have both the I/O fds and the netlink socket,
>>>
>> I would think that that is management would not be implemented in a
>> fast path processing thread for an application.
> 
> sure, but my point was that *XDP and other stack-bypass methods needs 
> to provide a select()able socket: when your use-case is not about just
> networking, you have to snoop on changes to the control plane, and update
> your data path. In the OVS case (pure networking) the OVS control plane
> updates are intrinsic to OVS. For the rest of the request/response world,
> we need a select()able socket set to do this elegantly (not really
> possible in DPDK, for example)

Busypoll on steroids is what windows does by mapping the user space
"doorbell" into a vDSO and let user space loop on that maybe with
MWAIT/MONITOR. The interesting thing is that you can map other events to
this notification event, too. It sounds like a usable idea to me and
reassembles what we already do with futexes.

Bye,
Hannes

^ permalink raw reply

* stmmac: turn coalescing / NAPI off in stmmac
From: Pavel Machek @ 2016-12-01 22:48 UTC (permalink / raw)
  To: David Miller, alexandre.torgue; +Cc: peppe.cavallaro, netdev, linux-kernel
In-Reply-To: <20161201.152303.425589678238707778.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 7554 bytes --]


> > @@ -2771,12 +2771,8 @@ static netdev_features_t stmmac_fix_features(struct net_device *dev,
> >  		features &= ~NETIF_F_CSUM_MASK;
> >  
> >  	/* Disable tso if asked by ethtool */
> > -	if ((priv->plat->tso_en) && (priv->dma_cap.tsoen)) {
> > -		if (features & NETIF_F_TSO)
> > -			priv->tso = true;
> > -		else
> > -			priv->tso = false;
> > -	}
> > +	if ((priv->plat->tso_en) && (priv->dma_cap.tsoen))
> > +		priv->tso = !!(features & NETIF_F_TSO);
> >  
> 
> Pavel, this really seems arbitrary.
> 
> Whilst I really appreciate you're looking into this driver a bit because
> of some issues you are trying to resolve, I'd like to ask that you not
> start bombarding me with nit-pick cleanups here and there and instead
> concentrate on the real bug or issue.

Well, fixing clean code is easier than fixing strange code... Plus I
was hoping to make the mainainers to talk. The driver is listed as
supported after all.

Anyway... since you asked. I belive I have way to disable NAPI / tx
coalescing in the driver. Unfortunately, locking is missing on the rx
path, and needs to be extended to _irqsave variant on tx path.

So patch currently looks like this (hand edited, can't be
applied, got it working few hours ago). Does it look acceptable?

I'd prefer this to go after the patch that pulls common code to single
place, so that single place needs to be patched. Plus I guess I should
add ifdefs, so that more advanced NAPI / tx coalescing code can be
reactivated when it is fixed. Trivial fixes can go on top. Does that
sound like a plan?

Which tree do you want patches against?

https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/ ?

Best regards,
								Pavel


diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 0b706a7..c0016c8 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1395,9 +1397,10 @@ static void __stmmac_tx_clean(struct stmmac_priv *priv)
 
 static void stmmac_tx_clean(struct stmmac_priv *priv)
 {
-	spin_lock(&priv->tx_lock);
+	unsigned long flags;
+	spin_lock_irqsave(&priv->tx_lock, flags);
 	__stmmac_tx_clean(priv);
-	spin_unlock(&priv->tx_lock);	
+	spin_unlock_irqrestore(&priv->tx_lock, flags);	
 }
 
 static inline void stmmac_enable_dma_irq(struct stmmac_priv *priv)
@@ -1441,6 +1444,8 @@ static void stmmac_tx_err(struct stmmac_priv *priv)
 	netif_wake_queue(priv->dev);
 }
 
+static int stmmac_rx(struct stmmac_priv *priv, int limit);
+
 /**
  * stmmac_dma_interrupt - DMA ISR
  * @priv: driver private structure
@@ -1452,10 +1457,17 @@ static void stmmac_dma_interrupt(struct stmmac_priv *priv)
 {
 	int status;
 	int rxfifosz = priv->plat->rx_fifo_size;
+	unsigned long flags;
 
 	status = priv->hw->dma->dma_interrupt(priv->ioaddr, &priv->xstats);
 	if (likely((status & handle_rx)) || (status & handle_tx)) {
+		int r;
+		spin_lock_irqsave(&priv->tx_lock, flags);
+		r = stmmac_rx(priv, 999);
+		spin_unlock_irqrestore(&priv->tx_lock, flags);		
+#if 0
 		if (likely(napi_schedule_prep(&priv->napi))) {
 			//pr_err("napi: schedule\n");
 			stmmac_disable_dma_irq(priv);
@@ -1463,7 +1475,8 @@ static void stmmac_dma_interrupt(struct stmmac_priv *priv)
 		} else
 			pr_err("napi: schedule failed\n");
 #endif
+		stmmac_tx_clean(priv);
 	}
 	if (unlikely(status & tx_hard_error_bump_tc)) {
 		/* Try to bump up the dma threshold on this failure */
@@ -1638,7 +1651,7 @@ static void stmmac_tx_timer(unsigned long data)
 {
 	struct stmmac_priv *priv = (struct stmmac_priv *)data;
 
-	stmmac_tx_clean(priv);
+	//stmmac_tx_clean(priv);
 }
 
 /**
@@ -1990,6 +2003,8 @@ static void stmmac_xmit_common(struct sk_buff *skb, struct net_device *dev, int
 	if (likely(priv->tx_coal_frames > priv->tx_count_frames)) {
 		mod_timer(&priv->txtimer,
 			  STMMAC_COAL_TIMER(priv->tx_coal_timer));
+		priv->hw->desc->set_tx_ic(desc);
+		priv->xstats.tx_set_ic_bit++;
 	} else {
 		priv->tx_count_frames = 0;
 		priv->hw->desc->set_tx_ic(desc);
@@ -2038,8 +2053,9 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct dma_desc *desc, *first, *mss_desc = NULL;
 	u8 proto_hdr_len;
 	int i;
+	unsigned long flags;
 
-	spin_lock(&priv->tx_lock);
+	spin_lock_irqsave(&priv->tx_lock, flags);
 
 	/* Compute header lengths */
 	proto_hdr_len = skb_transport_offset(skb) + tcp_hdrlen(skb);
@@ -2052,7 +2068,7 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
 			/* This is a hard error, log it. */
 			pr_err("%s: Tx Ring full when queue awake\n", __func__);
 		}
-		spin_unlock(&priv->tx_lock);
+		spin_unlock_irqrestore(&priv->tx_lock, flags);
 		return NETDEV_TX_BUSY;
 	}
 
@@ -2168,11 +2184,11 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
 	priv->hw->dma->set_tx_tail_ptr(priv->ioaddr, priv->tx_tail_addr,
 				       STMMAC_CHAN0);
 
-	spin_unlock(&priv->tx_lock);
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
 	return NETDEV_TX_OK;
 
 dma_map_err:
-	spin_unlock(&priv->tx_lock);
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
 	dev_err(priv->device, "Tx dma map failed\n");
 	dev_kfree_skb(skb);
 	priv->dev->stats.tx_dropped++;
@@ -2197,6 +2213,7 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct dma_desc *desc, *first;
 	unsigned int enh_desc;
 	unsigned int des;
+	unsigned int flags;
 
 	/* Manage oversized TCP frames for GMAC4 device */
 	if (skb_is_gso(skb) && priv->tso) {
@@ -2204,7 +2221,7 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 			return stmmac_tso_xmit(skb, dev);
 	}
 
-	spin_lock(&priv->tx_lock);
+	spin_lock_irqsave(&priv->tx_lock, flags);
 
 	if (unlikely(stmmac_tx_avail(priv) < nfrags + 1)) {
 		if (!netif_queue_stopped(dev)) {
@@ -2212,7 +2229,7 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 			/* This is a hard error, log it. */
 			pr_err("%s: Tx Ring full when queue awake\n", __func__);
 		}
-		spin_unlock(&priv->tx_lock);
+		spin_unlock_irqrestore(&priv->tx_lock, flags);
 		return NETDEV_TX_BUSY;
 	}
 
@@ -2347,11 +2364,11 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 		priv->hw->dma->set_tx_tail_ptr(priv->ioaddr, priv->tx_tail_addr,
 					       STMMAC_CHAN0);
 
-	spin_unlock(&priv->tx_lock);
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
 	return NETDEV_TX_OK;
 
 dma_map_err:
-	spin_unlock(&priv->tx_lock);
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
 	dev_err(priv->device, "Tx dma map failed\n");
 	dev_kfree_skb(skb);
 	priv->dev->stats.tx_dropped++;
@@ -2634,7 +2651,8 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit)
 			else
 				skb->ip_summed = CHECKSUM_UNNECESSARY;
 
-			napi_gro_receive(&priv->napi, skb);
+			//napi_gro_receive(&priv->napi, skb);
+			netif_rx(skb);
 
 			priv->dev->stats.rx_packets++;
 			priv->dev->stats.rx_bytes += frame_len;
@@ -2662,6 +2680,7 @@ static int stmmac_poll(struct napi_struct *napi, int budget)
 	struct stmmac_priv *priv = container_of(napi, struct stmmac_priv, napi);
 	int work_done;
 
+	BUG();
 	priv->xstats.napi_poll++;
 	stmmac_tx_clean(priv);
 



-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply related

* Re: Initial thoughts on TXDP
From: Hannes Frederic Sowa @ 2016-12-01 22:47 UTC (permalink / raw)
  To: Tom Herbert, Florian Westphal
  Cc: Linux Kernel Network Developers, Jesper Dangaard Brouer
In-Reply-To: <CALx6S36ywu3ruY7AFKYk=N4Ekr5zjY33ivx92EgNNT36XoXhFA@mail.gmail.com>

Side note:

On 01.12.2016 20:51, Tom Herbert wrote:
>> > E.g. "mini-skb": Even if we assume that this provides a speedup
>> > (where does that come from? should make no difference if a 32 or
>> >  320 byte buffer gets allocated).
>> >
> It's the zero'ing of three cache lines. I believe we talked about that
> as netdev.

Jesper and me played with that again very recently:

https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c#L590

In micro-benchmarks we saw a pretty good speed up not using the rep
stosb generated by gcc builtin but plain movq's. Probably the cost model
for __builtin_memset in gcc is wrong?

When Jesper is free we wanted to benchmark this and maybe come up with a
arch specific way of cleaning if it turns out to really improve throughput.

SIMD instructions seem even faster but the kernel_fpu_begin/end() kill
all the benefits.

Bye,
Hannes

^ permalink raw reply

* Re: [PATCH net-next v3] ipv6 addrconf: Implemented enhanced DAD (RFC7527)
From: Hannes Frederic Sowa @ 2016-12-01 22:28 UTC (permalink / raw)
  To: Erik Nordmark, davem; +Cc: netdev, Bob Gilligan
In-Reply-To: <1480549159-8142-1-git-send-email-nordmark@arista.com>

On 01.12.2016 00:39, Erik Nordmark wrote:
> Implemented RFC7527 Enhanced DAD.
> IPv6 duplicate address detection can fail if there is some temporary
> loopback of Ethernet frames. RFC7527 solves this by including a random
> nonce in the NS messages used for DAD, and if an NS is received with the
> same nonce it is assumed to be a looped back DAD probe and is ignored.
> RFC7527 is enabled by default. Can be disabled by setting both of
> conf/{all,interface}/enhanced_dad to zero.
> 
> Signed-off-by: Erik Nordmark <nordmark@arista.com>
> Signed-off-by: Bob Gilligan <gilligan@arista.com>
> ---

Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

Thanks!
> @@ -794,6 +808,17 @@ static void ndisc_recv_ns(struct sk_buff *skb)
>  have_ifp:
>  		if (ifp->flags & (IFA_F_TENTATIVE|IFA_F_OPTIMISTIC)) {
>  			if (dad) {
> +				if (nonce != 0 && ifp->dad_nonce == nonce) {
> +					u8 *np = (u8 *)&nonce;
> +					/* Matching nonce if looped back */
> +					ND_PRINTK(2, notice,
> +						  "%s: IPv6 DAD loopback for address %pI6c nonce %02x:%02x:%02x:%02x:%02x:%02x ignored\n",
> +						  ifp->idev->dev->name,
> +						  &ifp->addr,
> +						  np[0], np[1], np[2], np[3],
> +						  np[4], np[5]);
> +					goto out;
> +				}
>  				/*
>  				 * We are colliding with another node
>  				 * who is doing DAD
> 

I think it could be a "%pM" because it looks like a MAC address, but
better leave it like that. :)

Bye,
Hannes

^ permalink raw reply

* Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
From: Paolo Abeni @ 2016-12-01 22:17 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Mel Gorman, Andrew Morton, Christoph Lameter, Michal Hocko,
	Vlastimil Babka, Johannes Weiner, Linux-MM, Linux-Kernel,
	Rick Jones, netdev@vger.kernel.org, Hannes Frederic Sowa
In-Reply-To: <20161201183402.2fbb8c5b@redhat.com>

On Thu, 2016-12-01 at 18:34 +0100, Jesper Dangaard Brouer wrote:
> (Cc. netdev, we might have an issue with Paolo's UDP accounting and
> small socket queues)
> 
> On Wed, 30 Nov 2016 16:35:20 +0000
> Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > > I don't quite get why you are setting the socket recv size
> > > (with -- -s and -S) to such a small number, size + 256.
> > >   
> > 
> > Maybe I missed something at the time I wrote that but why would it
> > need to be larger?
> 
> Well, to me it is quite obvious that we need some queue to avoid packet
> drops.  We have two processes netperf and netserver, that are sending
> packets between each-other (UDP_STREAM mostly netperf -> netserver).
> These PIDs are getting scheduled and migrated between CPUs, and thus
> does not get executed equally fast, thus a queue is need absorb the
> fluctuations.
> 
> The network stack is even partly catching your config "mistake" and
> increase the socket queue size, so we minimum can handle one max frame
> (due skb "truesize" concept approx PAGE_SIZE + overhead).
> 
> Hopefully for localhost testing a small queue should hopefully not
> result in packet drops.  Testing... ups, this does result in packet
> drops.
> 
> Test command extracted from mmtests, UDP_STREAM size 1024:
> 
>  netperf-2.4.5-installed/bin/netperf -t UDP_STREAM  -l 60  -H 127.0.0.1 \
>    -- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895
> 
>  UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0)
>   port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
>  Socket  Message  Elapsed      Messages                
>  Size    Size     Time         Okay Errors   Throughput
>  bytes   bytes    secs            #      #   10^6bits/sec
> 
>    4608    1024   60.00     50024301      0    6829.98
>    2560           60.00     46133211           6298.72
> 
>  Dropped packets: 50024301-46133211=3891090
> 
> To get a better drop indication, during this I run a command, to get
> system-wide network counters from the last second, so below numbers are
> per second.
> 
>  $ nstat > /dev/null && sleep 1  && nstat
>  #kernel
>  IpInReceives                    885162             0.0
>  IpInDelivers                    885161             0.0
>  IpOutRequests                   885162             0.0
>  UdpInDatagrams                  776105             0.0
>  UdpInErrors                     109056             0.0
>  UdpOutDatagrams                 885160             0.0
>  UdpRcvbufErrors                 109056             0.0
>  IpExtInOctets                   931190476          0.0
>  IpExtOutOctets                  931189564          0.0
>  IpExtInNoECTPkts                885162             0.0
> 
> So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See
> UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop
> happens kernel side in __udp_queue_rcv_skb[1], because receiving
> process didn't empty it's queue fast enough see [2].
> 
> Although upstream changes are coming in this area, [2] is replaced with
> __udp_enqueue_schedule_skb, which I actually tested with... hmm
> 
> Retesting with kernel 4.7.0-baseline+ ... show something else. 
> To Paolo, you might want to look into this.  And it could also explain why
> I've not see the mentioned speedup by mm-change, as I've been testing
> this patch on top of net-next (at 93ba2222550) with Paolo's UDP changes.

Thank you for reporting this.

It seems that the commit 123b4a633580 ("udp: use it's own memory
accounting schema") is too strict while checking the rcvbuf. 

For very small value of rcvbuf, it allows a single skb to be enqueued,
while previously we allowed 2 of them to enter the queue, even if the
first one truesize exceeded rcvbuf, as in your test-case.

Can you please try the following patch ?

Thank you,

Paolo
---
 net/ipv4/udp.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index e1d0bf8..2f5dc92 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1200,19 +1200,21 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	struct sk_buff_head *list = &sk->sk_receive_queue;
 	int rmem, delta, amt, err = -ENOMEM;
 	int size = skb->truesize;
+	int limit;
 
 	/* try to avoid the costly atomic add/sub pair when the receive
 	 * queue is full; always allow at least a packet
 	 */
 	rmem = atomic_read(&sk->sk_rmem_alloc);
-	if (rmem && (rmem + size > sk->sk_rcvbuf))
+	limit = size + sk->sk_rcvbuf;
+	if (rmem > limit)
 		goto drop;
 
 	/* we drop only if the receive buf is full and the receive
 	 * queue contains some other skb
 	 */
 	rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
-	if ((rmem > sk->sk_rcvbuf) && (rmem > size))
+	if (rmem > limit)
 		goto uncharge_drop;
 
 	spin_lock(&list->lock);





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: [PATCH 1/1] NET: usb: qmi_wwan: add support for Telit LE922A PID 0x1040
From: Bjørn Mork @ 2016-12-01 22:13 UTC (permalink / raw)
  To: Daniele Palmas; +Cc: netdev
In-Reply-To: <1480607525-23044-2-git-send-email-dnlplm@gmail.com>

Daniele Palmas <dnlplm@gmail.com> writes:

> This patch adds support for PID 0x1040 of Telit LE922A.

LGTM

Acked-by: Bjørn Mork <bjorn@mork.no>

^ permalink raw reply

* Re: Initial thoughts on TXDP
From: Tom Herbert @ 2016-12-01 22:12 UTC (permalink / raw)
  To: Rick Jones; +Cc: Sowmini Varadhan, Linux Kernel Network Developers
In-Reply-To: <1e88fd64-0045-beb5-101a-a55b8f54bd08@hpe.com>

On Thu, Dec 1, 2016 at 1:47 PM, Rick Jones <rick.jones2@hpe.com> wrote:
> On 12/01/2016 12:18 PM, Tom Herbert wrote:
>>
>> On Thu, Dec 1, 2016 at 11:48 AM, Rick Jones <rick.jones2@hpe.com> wrote:
>>>
>>> Just how much per-packet path-length are you thinking will go away under
>>> the
>>> likes of TXDP?  It is admittedly "just" netperf but losing TSO/GSO does
>>> some
>>> non-trivial things to effective overhead (service demand) and so
>>> throughput:
>>>
>> For plain in order TCP packets I believe we should be able process
>> each packet at nearly same speed as GRO. Most of the protocol
>> processing we do between GRO and the stack are the same, the
>> differences are that we need to do a connection lookup in the stack
>> path (note we now do this is UDP GRO and that hasn't show up as a
>> major hit). We also need to consider enqueue/dequeue on the socket
>> which is a major reason to try for lockless sockets in this instance.
>
>
> So waving hands a bit, and taking the service demand for the GRO-on receive
> test in my previous message (860 ns/KB), that would be ~ (1448/1024)*860 or
> ~1.216 usec of CPU time per TCP segment, including ACK generation which
> unless an explicit ACK-avoidance heuristic a la HP-UX 11/Solaris 2 is put in
> place would be for every-other segment. Etc etc.
>
>> Sure, but trying running something emulates a more realistic workload
>> than a TCP stream, like RR test with relative small payload and many
>> connections.
>
>
> That is a good point, which of course is why the RR tests are there in
> netperf :) Don't get me wrong, I *like* seeing path-length reductions. What
> would you posit is a relatively small payload?  The promotion of IR10
> suggests that perhaps 14KB or so is a sufficiently common so I'll grasp at
> that as the length of a piece of string:
>
We have consider both request size and response side in RPC.
Presumably, something like a memcache server is most serving data as
opposed to reading it, we are looking to receiving much smaller
packets than being sent. Requests are going to be quite small say 100
bytes and unless we are doing significant amount of pipelining on
connections GRO would rarely kick-in. Response size will have a lot of
variability, anything from a few kilobytes up to a megabyte. I'm sorry
I can't be more specific this is an artifact of datacenters that have
100s of different applications and communication patterns. Maybe 100b
request size, 8K, 16K, 64K response sizes might be good for test.

> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_RR -- -P 12867 -r 128,14K
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET
> to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr
>
> 16384  87380  128     14336  10.00   8118.31  1.57   -1.00  46.410  -1.000
> 16384  87380
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_RR -- -P 12867 -r 128,14K
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET
> to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr
>
> 16384  87380  128     14336  10.00   5837.35  2.20   -1.00  90.628  -1.000
> 16384  87380
>
> So, losing GRO doubled the service demand.  I suppose I could see cutting
> path-length in half based on the things you listed which would be bypassed?
>
> I'm sure mileage will vary with different NICs and CPUs.  The ones used here
> happened to be to hand.
>
This is also biased because you're using a single connection, but is
consistent with data we've seen in the past. To be clear I'm not
saying GRO is bad, the fact that GRO has such a visible impact in your
test means that the GRO path is significantly more efficient. Closing
that gap seen in your numbers would be a benefit, that means we have
improved per packet processing.

Tom

> happy benchmarking,
>
> rick
>
> Just to get a crude feel for sensitivity, doubling to 28K unsurprisingly
> goes to more than doubling, and halving to 7K narrows the delta:
>
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_RR -- -P 12867 -r 128,28K
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET
> to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr
>
> 16384  87380  128     28672  10.00   6732.32  1.79   -1.00  63.819  -1.000
> 16384  87380
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_RR -- -P 12867 -r 128,28K
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET
> to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr
>
> 16384  87380  128     28672  10.00   3780.47  2.32   -1.00  147.280  -1.000
> 16384  87380
>
>
>
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_RR -- -P 12867 -r 128,7K
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET
> to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr
>
> 16384  87380  128     7168   10.00   10535.01  1.52   -1.00  34.664  -1.000
> 16384  87380
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_RR -- -P 12867 -r 128,7K
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET
> to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr
>
> 16384  87380  128     7168   10.00   8225.17  1.80   -1.00  52.661  -1.000
> 16384  87380
>
>

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Eric Dumazet @ 2016-12-01 22:10 UTC (permalink / raw)
  To: David Miller; +Cc: brouer, saeedm, rick.jones2, netdev, saeedm, tariqt
In-Reply-To: <20161201.152029.2045474106824619667.davem@davemloft.net>

On Thu, 2016-12-01 at 15:20 -0500, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 01 Dec 2016 09:04:17 -0800
> 
> > On Thu, 2016-12-01 at 17:04 +0100, Jesper Dangaard Brouer wrote:
> > 
> >> When qdisc layer or trafgen/af_packet see this indication it knows it
> >> should/must flush the queue when it don't have more work left.  Perhaps
> >> through net_tx_action(), by registering itself and e.g. if qdisc_run()
> >> is called and queue is empty then check if queue needs a flush. I would
> >> also allow driver to flush and clear this bit.
> > 
> > net_tx_action() is not normally called, unless BQL limit is hit and/or
> > some qdiscs with throttling (HTB, TBF, FQ, ...)
> 
> The one thing I wonder about is whether we should "ramp up" into a mode
> where the NAPI poll does the doorbells instead of going directly there.
> 
> Maybe I misunderstand your algorithm, but it looks to me like if there
> are any active packets in the TX queue at enqueue time you will defer
> the doorbell to the interrupt handler.
> 
> Let's say we put 1 packet in, and hit the doorbell.
> 
> Then another packet comes in and we defer the doorbell to the IRQ.
> 
> At this point there are a couple things I'm unclear about.


> 
> For example, if we didn't hit the doorbell, will the chip still take a
> peek at the second descriptor?  Depending upon how the doorbell works
> it might, or it might not.

It might depend on the hardware. I can easily check on mlx4, by
increasing tx-usecs and tx-frames, and sending 2 packets back to back.

> 
> Either way, wouldn't there be a possible condition where the chip
> wouldn't see the second enqueued packet and we'd thus have the wire
> idle until the interrupt + NAPI runs and hits the doorbell?
> 
> This is why I think we should "ramp up" the doorbell deferral, in
> order to avoid this potential wire idle time situation.


> 
> Maybe the situation I'm worried about is not possible, so please
> explain it to me :-)

This is absolutely the problem. We might need to enable this mode only
above a given load. We could have an EWMA of the number of packets
that TX completion runs can dequeue. And enable auto doorbell only if we
have that many packets in the TX ring (instead of the  "1 packet
threshold" of the WIP)

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Eric Dumazet @ 2016-12-01 22:04 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jesper Dangaard Brouer, Rick Jones, Netdev, Saeed Mahameed,
	Tariq Toukan
In-Reply-To: <CAKgT0Uf_EeWYqGinyV6U1r-Jqd07R9gN-7GeA82emk4EoN=1mQ@mail.gmail.com>

On Thu, 2016-12-01 at 13:32 -0800, Alexander Duyck wrote:

> A few years back when I did something like this on ixgbe I was told by
> you that the issue was that doing something like this would add too
> much latency.  I was just wondering what the latency impact is on a
> change like this and if this now isn't that much of an issue?

I believe it is clear that we can not use this trick without admin
choice.

In my experience, mlx4 can deliver way more interrupts than ixgbe.
(No idea why)

This "auto doorbell" is tied to proper "ethtool -c tx-usecs|tx-frames|
tx-frames-irq", or simply a choice for hosts dedicated to content
delivery (like video servers) with 40GBit+ cards.

Under stress, softirq can be handled by ksoftirqd, and might be delayed
by ~10 ms or even more.

This WIP was minimal, but we certainly need to add belts and suspenders.

1) <maybe> timestamps
  use a ktime_get_ns() to remember a timestamp for the last doorbell,
  and force a doorbell if it gets too old, checked in ndo_start_xmit()
  every time we would like to not send the doorbell.

2) <maybe> max numbers of packets not yet doorbelled.

But we can not rely on another high resolution timer, since they also
require softirq being processed timely.

^ permalink raw reply

* Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: Hannes Frederic Sowa @ 2016-12-01 21:57 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf, arkadis,
	ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
	f.fainelli, alexander.h.duyck, kaber
In-Reply-To: <20161130182243.cpoyoqvxuv2bnn4y@splinter.mtl.com>

On 30.11.2016 19:22, Ido Schimmel wrote:
> On Wed, Nov 30, 2016 at 05:49:56PM +0100, Hannes Frederic Sowa wrote:
>> On 30.11.2016 17:32, Ido Schimmel wrote:
>>> On Wed, Nov 30, 2016 at 04:37:48PM +0100, Hannes Frederic Sowa wrote:
>>>> On 30.11.2016 11:09, Jiri Pirko wrote:
>>>>> From: Ido Schimmel <idosch@mellanox.com>
>>>>>
>>>>> Make sure the device has a complete view of the FIB tables by invoking
>>>>> their dump during module init.
>>>>>
>>>>> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
>>>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>>>> ---
>>>>>  .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 23 ++++++++++++++++++++++
>>>>>  1 file changed, 23 insertions(+)
>>>>>
>>>>> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>>>>> index 14bed1d..d176047 100644
>>>>> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>>>>> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>>>>> @@ -2027,8 +2027,23 @@ static int mlxsw_sp_router_fib_event(struct notifier_block *nb,
>>>>>  	return NOTIFY_DONE;
>>>>>  }
>>>>>  
>>>>> +static void mlxsw_sp_router_fib_dump_flush(struct notifier_block *nb)
>>>>> +{
>>>>> +	struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, fib_nb);
>>>>> +
>>>>> +	/* Flush pending FIB notifications and then flush the device's
>>>>> +	 * table before requesting another dump. Do that with RTNL held,
>>>>> +	 * as FIB notification block is already registered.
>>>>> +	 */
>>>>> +	mlxsw_core_flush_owq();
>>>>> +	rtnl_lock();
>>>>> +	mlxsw_sp_router_fib_flush(mlxsw_sp);
>>>>> +	rtnl_unlock();
>>>>> +}
>>>>> +
>>>>>  int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
>>>>>  {
>>>>> +	fib_dump_cb_t *cb = mlxsw_sp_router_fib_dump_flush;
>>>>>  	int err;
>>>>>  
>>>>>  	INIT_LIST_HEAD(&mlxsw_sp->router.nexthop_neighs_list);
>>>>> @@ -2048,8 +2063,16 @@ int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
>>>>>  
>>>>>  	mlxsw_sp->fib_nb.notifier_call = mlxsw_sp_router_fib_event;
>>>>>  	register_fib_notifier(&mlxsw_sp->fib_nb);
>>>>
>>>> Sorry to pick in here again:
>>>>
>>>> There is a race here. You need to protect the registration of the fib
>>>> notifier as well by the sequence counter. Updates here are not ordered
>>>> in relation to this code below.
>>>
>>> You mean updates that can be received after you registered the notifier
>>> and until the dump started? I'm aware of that and that's OK. This
>>> listener should be able to handle duplicates.
>>
>> I am not concerned about duplicates, but about ordering deletes and
>> getting an add from the RCU code you will add the node to hw while it is
>> deleted in the software path. You probably will ignore the delete
>> because nothing is installed in hw and later add the node which was
>> actually deleted but just reordered which happend on another CPU, no?
> 
> Are you referring to reordering in the workqueue? We already covered
> this using an ordered workqueue, which has one context of execution
> system-wide.

Ups, sorry, I missed that mail. Probably read it on the mobile phone and
it became invisible for me later on. Busy day... ;)

The reordering in the workqueue seems fine to me and also still necessary.

Basically, if you delete a node right now the kernel might simply do a
RCU_INIT_POINTER(ptr_location, NULL), which has absolutely no barriers
or synchronization with the reader side. Thus you might get a callback
from the notifier for a delete event on the one CPU and you end up
queueing this fib entry after the delete queue, because the RCU walk
isn't protected by any means.

Looking closer at this series again, I overlooked the fact that you
fetch fib_seq using a rtnl_lock and rtnl_unlock pair, which first of all
orders fetching of fib_seq and thus the RCU dumping after any concurrent
executing fib table update, also the mutex_lock and unlock provide
proper acquire and release fences, so the CPU indeed sees the effect of
a RCU_INIT_POINTER update done on another CPU, because they pair with
the rtnl_unlock which might happen on the other CPU.

My question is if this is a bit of luck and if we should make this
explicit by putting the registration itself under the protection of the
sequence counter. I favor the additional protection, e.g. if we some day
actually we optimize the fib_seq code? Otherwise we might probably
document this fact. :)

>>> I've a follow up patchset that introduces a new event in switchdev
>>> notification chain called SWITCHDEV_SYNC, which is sent when port
>>> netdevs are enslaved / released  from a master device (points in time
>>> where kernel<->device can get out of sync). It will invoke
>>> re-propagation of configuration from different parts of the stack
>>> (e.g. bridge driver, 8021q driver, fib/neigh code), which can result
>>> in duplicates.
>>
>> Okay, understood. I wonder how we can protect against accidentally abort
>> calls actually. E.g. if I start to inject routes into my routing domain
>> how can I make sure the box doesn't die after I try to insert enough
>> routes. Do we need to touch quagga etc?
> 
> The whole point of moving abort mechanism to the driver is that the
> system won't die, but instead routing will be done in the kernel. If you
> respect hardware limitations, then there's no reason for abort mechanism
> to kick in.

Quick follow-up question: How can I quickly find out the hw limitations
via the kernel api?

Thanks,
Hannes

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox