Netdev List
 help / color / mirror / Atom feed
* [PATCH 3/5] e1000e: do not toggle LANPHYPC value bit when PHY reset is blocked
From: Zhu Yanjun @ 2014-12-16 10:28 UTC (permalink / raw)
  To: netdev, w, zyjzyj2000; +Cc: Zhu Yanjun, Bruce Allan, Jeff Kirsher
In-Reply-To: <1418725700-31465-1-git-send-email-Yanjun.Zhu@windriver.com>

2.6.x kernels require a similar logic change as commit 6cc7aae 
[e1000e: do not toggle LANPHYPC value bit when PHY reset is blocked]
introduces for newer kernels.

When PHY reset is intentionally blocked on 82577/8/9, do not toggle the
LANPHYPC value bit (essentially performing a hard power reset of the
device) otherwise the PHY can be put into an unknown state.

Cleanup whitespace in the same function.

[yanjun.zhu: whitespace remains unchanged]

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Zhu Yanjun <Yanjun.Zhu@windriver.com>
---
 drivers/net/e1000e/ich8lan.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/e1000e/ich8lan.c b/drivers/net/e1000e/ich8lan.c
index c4b2d15..8c7e4aa 100644
--- a/drivers/net/e1000e/ich8lan.c
+++ b/drivers/net/e1000e/ich8lan.c
@@ -280,7 +280,8 @@ static s32 e1000_init_phy_params_pchlan(struct e1000_hw *hw)
 	phy->ops.write_phy_reg_locked = e1000_write_phy_reg_hv_locked;
 	phy->autoneg_mask             = AUTONEG_ADVERTISE_SPEED_DEFAULT;
 
-	if (!(er32(FWSM) & E1000_ICH_FWSM_FW_VALID)) {
+	if (!(er32(FWSM) & E1000_ICH_FWSM_FW_VALID) && 
+		!e1000_check_reset_block(hw)) {
 		/*Set Phy Config Counter to 50msec */
 		ctrl = er32(FEXTNVM3);
 		ctrl &= ~E1000_FEXTNVM3_PHY_CFG_COUNTER_MASK;
-- 
1.9.1

^ permalink raw reply related

* [PATCH 2/5] e1000e: workaround EEPROM configuration change on 82579 on kernel 2.6.x
From: Zhu Yanjun @ 2014-12-16 10:28 UTC (permalink / raw)
  To: netdev, w, zyjzyj2000; +Cc: Zhu Yanjun, Bruce Allan, Jeff Kirsher
In-Reply-To: <1418725700-31465-1-git-send-email-Yanjun.Zhu@windriver.com>

2.6.x kernels require a similar logic change as commit 62bc813 
[e1000e: workaround EEPROM configuration change on 82579] introduces
for newer kernels.

An update to the EEPROM on 82579 will extend a delay in hardware to fix an
issue with WoL not working after a G3->S5 transition which is unrelated to
the driver.  However, this extended delay conflicts with nominal operation
of the device when it is initialized by the driver and after every reset
of the hardware (i.e. the driver starts configuring the device before the
hardware is done with it's own configuration work).  The workaround for
when the driver is in control of the device is to tell the hardware after
every reset the configuration delay should be the original shorter one.

Some pre-existing variables are renamed generically to be re-used with
new register accesses.

[e1000_toggle_lanphypc_value_ich8lan does not exist. Its implementations
exist in e1000_init_phy_params_pchlan. Renamed variables remain unchanged]

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Zhu Yanjun <Yanjun.Zhu@windriver.com>
---
 drivers/net/e1000e/hw.h      |  1 +
 drivers/net/e1000e/ich8lan.c | 17 +++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/drivers/net/e1000e/hw.h b/drivers/net/e1000e/hw.h
index 11f3b7c..b055d78 100644
--- a/drivers/net/e1000e/hw.h
+++ b/drivers/net/e1000e/hw.h
@@ -60,6 +60,7 @@ enum e1e_registers {
 	E1000_FEXTNVM  = 0x00028, /* Future Extended NVM - RW */
 	E1000_FCT      = 0x00030, /* Flow Control Type - RW */
 	E1000_VET      = 0x00038, /* VLAN Ether Type - RW */
+	E1000_FEXTNVM3 = 0x0003C, /* Future Extended NVM 3 - RW */
 	E1000_ICR      = 0x000C0, /* Interrupt Cause Read - R/clr */
 	E1000_ITR      = 0x000C4, /* Interrupt Throttling Rate - RW */
 	E1000_ICS      = 0x000C8, /* Interrupt Cause Set - WO */
diff --git a/drivers/net/e1000e/ich8lan.c b/drivers/net/e1000e/ich8lan.c
index 020657c..c4b2d15 100644
--- a/drivers/net/e1000e/ich8lan.c
+++ b/drivers/net/e1000e/ich8lan.c
@@ -108,6 +108,9 @@
 #define E1000_FEXTNVM_SW_CONFIG		1
 #define E1000_FEXTNVM_SW_CONFIG_ICH8M (1 << 27) /* Bit redefined for ICH8M :/ */
 
+#define E1000_FEXTNVM3_PHY_CFG_COUNTER_MASK    0x0C000000
+#define E1000_FEXTNVM3_PHY_CFG_COUNTER_50MSEC  0x08000000
+
 #define PCIE_ICH8_SNOOP_ALL		PCIE_NO_SNOOP_ALL
 
 #define E1000_ICH_RAR_ENTRIES		7
@@ -278,6 +281,12 @@ static s32 e1000_init_phy_params_pchlan(struct e1000_hw *hw)
 	phy->autoneg_mask             = AUTONEG_ADVERTISE_SPEED_DEFAULT;
 
 	if (!(er32(FWSM) & E1000_ICH_FWSM_FW_VALID)) {
+		/*Set Phy Config Counter to 50msec */
+		ctrl = er32(FEXTNVM3);
+		ctrl &= ~E1000_FEXTNVM3_PHY_CFG_COUNTER_MASK;
+		ctrl |= E1000_FEXTNVM3_PHY_CFG_COUNTER_50MSEC;
+		ew32(FEXTNVM3, ctrl);
+
 		/*
 		 * The MAC-PHY interconnect may still be in SMBus mode
 		 * after Sx->S0.  Toggle the LANPHYPC Value bit to force
@@ -2685,6 +2694,14 @@ static s32 e1000_reset_hw_ich8lan(struct e1000_hw *hw)
 	ew32(CTRL, (ctrl | E1000_CTRL_RST));
 	msleep(20);
 
+	/* Set Phy Config Counter to 50msec */
+	if (hw->mac.type == e1000_pch2lan) {
+		u32 phycc_reg = er32(FEXTNVM3);
+		phycc_reg &= ~E1000_FEXTNVM3_PHY_CFG_COUNTER_MASK;
+		phycc_reg |= E1000_FEXTNVM3_PHY_CFG_COUNTER_50MSEC;
+		ew32(FEXTNVM3, phycc_reg);
+	}
+
 	if (!ret_val)
 		e1000_release_swflag_ich8lan(hw);
 
-- 
1.9.1

^ permalink raw reply related

* [PATCH 1/5] e1000e: reset MAC-PHY interconnect on 82577/82578
From: Zhu Yanjun @ 2014-12-16 10:28 UTC (permalink / raw)
  To: netdev, w, zyjzyj2000
  Cc: Zhu Yanjun, Bruce Allan, Jeff Kirsher, David S. Miller
In-Reply-To: <1418725700-31465-1-git-send-email-Yanjun.Zhu@windriver.com>

2.6.x kernels require a similar logic change as commit 6dfaa76 
[e1000e: reset MAC-PHY interconnect on 82577/82578] introduces
for newer kernels.

During Sx->S0 transitions, the interconnect between the MAC and PHY on
82577/82578 can remain in SMBus mode instead of transitioning to the
PCIe-like mode required during normal operation.  Toggling the LANPHYPC
Value bit essentially resets the interconnect forcing it to the correct
mode.

after review of all intel drivers, found several instances where
drivers had the incorrect pattern of:
memory mapped write();
delay();

which should always be:
memory mapped write();
write flush(); /* aka memory mapped read */
delay();

explanation:
The reason for including the flush is that writes can be held
(posted) in PCI/PCIe bridges, but the read always has to complete
synchronously and therefore has to flush all pending writes to a
device.  If a write is held and followed by a delay, the delay
means nothing because the write may not have reached hardware
(maybe even not until the next read)

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Zhu Yanjun <Yanjun.Zhu@windriver.com>
---
 drivers/net/e1000e/defines.h |  2 ++
 drivers/net/e1000e/ich8lan.c | 20 ++++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/drivers/net/e1000e/defines.h b/drivers/net/e1000e/defines.h
index 1190167..52283a6 100644
--- a/drivers/net/e1000e/defines.h
+++ b/drivers/net/e1000e/defines.h
@@ -214,6 +214,8 @@
 #define E1000_CTRL_SPD_1000 0x00000200  /* Force 1Gb */
 #define E1000_CTRL_FRCSPD   0x00000800  /* Force Speed */
 #define E1000_CTRL_FRCDPX   0x00001000  /* Force Duplex */
+#define E1000_CTRL_LANPHYPC_OVERRIDE 0x00010000 /* SW control of LANPHYPC */
+#define E1000_CTRL_LANPHYPC_VALUE    0x00020000 /* SW value of LANPHYPC */
 #define E1000_CTRL_SWDPIN0  0x00040000  /* SWDPIN 0 value */
 #define E1000_CTRL_SWDPIN1  0x00080000  /* SWDPIN 1 value */
 #define E1000_CTRL_SWDPIO0  0x00400000  /* SWDPIN 0 Input or output */
diff --git a/drivers/net/e1000e/ich8lan.c b/drivers/net/e1000e/ich8lan.c
index de39f9a..020657c 100644
--- a/drivers/net/e1000e/ich8lan.c
+++ b/drivers/net/e1000e/ich8lan.c
@@ -88,6 +88,8 @@
 
 
 #define E1000_ICH_FWSM_RSPCIPHY	0x00000040 /* Reset PHY on PCI Reset */
+/* FW established a valid mode */ 
+#define E1000_ICH_FWSM_FW_VALID                0x00008000
 
 #define E1000_ICH_MNG_IAMT_MODE		0x2
 
@@ -260,6 +262,7 @@ static inline void __ew32flash(struct e1000_hw *hw, unsigned long reg, u32 val)
 static s32 e1000_init_phy_params_pchlan(struct e1000_hw *hw)
 {
 	struct e1000_phy_info *phy = &hw->phy;
+	u32 ctrl;
 	s32 ret_val = 0;
 
 	phy->addr                     = 1;
@@ -274,6 +277,23 @@ static s32 e1000_init_phy_params_pchlan(struct e1000_hw *hw)
 	phy->ops.write_phy_reg_locked = e1000_write_phy_reg_hv_locked;
 	phy->autoneg_mask             = AUTONEG_ADVERTISE_SPEED_DEFAULT;
 
+	if (!(er32(FWSM) & E1000_ICH_FWSM_FW_VALID)) {
+		/*
+		 * The MAC-PHY interconnect may still be in SMBus mode
+		 * after Sx->S0.  Toggle the LANPHYPC Value bit to force
+		 * the interconnect to PCIe mode, but only if there is no
+		 * firmware present otherwise firmware will have done it.
+		*/
+		ctrl = er32(CTRL);
+		ctrl |=  E1000_CTRL_LANPHYPC_OVERRIDE;
+		ctrl &= ~E1000_CTRL_LANPHYPC_VALUE;
+		ew32(CTRL, ctrl);
+		e1e_flush();
+		udelay(10);
+		ctrl &= ~E1000_CTRL_LANPHYPC_OVERRIDE;
+		ew32(CTRL, ctrl);
+		msleep(50);
+	}
 	/*
 	 * Reset the PHY before any acccess to it.  Doing so, ensures that
 	 * the PHY is in a known good state before we read/write PHY registers.
-- 
1.9.1

^ permalink raw reply related

* [PATCH V2 0/5] e1000e: fix nic not boot after rebooting
From: Zhu Yanjun @ 2014-12-16 10:28 UTC (permalink / raw)
  To: netdev, w, zyjzyj2000; +Cc: Zhu Yanjun

With kernel 2.6.x, e1000e with 82577/8/9 sometimes will not boot
after rebooting. 

If a kernel 2.6.x board with 82577/8/9 e1000e nic is rebooted for 
100 times, there are 7~8 times that 82577/8/9 e1000e nic will not boot 
normally.

V2:
Follow the advice from Willy, the wrong upstream commit IDs in these
5 patches commit messages are fixed.

Zhu Yanjun (5):
  e1000e: reset MAC-PHY interconnect on 82577/82578
  e1000e: workaround EEPROM configuration change on 82579 on kernel
    2.6.x
  e1000e: do not toggle LANPHYPC value bit when PHY reset is blocked
  e1000e: update workaround for 82579 intermittently disabled during
    S0->Sx
  e1000e: cleanup use of check_reset_block function pointer

 drivers/net/e1000e/defines.h |  2 ++
 drivers/net/e1000e/hw.h      |  1 +
 drivers/net/e1000e/ich8lan.c | 37 +++++++++++++++++++++++++++++++++++++
 3 files changed, 40 insertions(+)

-- 
1.9.1

^ permalink raw reply

* Re: [PATCH net 1/2] net: dsa: handle non-existing PHYs on switch internal bus
From: Andrew Lunn @ 2014-12-16 10:24 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: netdev, davem, computersforpeace, andrey.volkov, Andrew Lunn
In-Reply-To: <1418330956-17151-2-git-send-email-f.fainelli@gmail.com>

On Thu, Dec 11, 2014 at 12:49:15PM -0800, Florian Fainelli wrote:
> In case there is no PHY at the designated address on the internal
> switch, we would basically de-reference a null pointer here:
> 
> dsa_slave_phy_setup(...)
> {
> 	p->phy = ds->slave_mii_bus->phy_map[p->port];
> 	phy_connect_direct(slave_dev, p->phy, dsa_slave_adjust_link,
> 				      ^------
> 
> This can be triggered when the platform configuration (platform_data or
> Device Tree) indicates there should be a PHY device at this address, but
> the HW is non-responsive, such that we cannot attach a PHY device at
> this specific location.
> 
> Fix this by checking the return value prior to calling
> phy_connect_direct().
> 
> CC: Andrew Lunn <andrew@lunn.ch>
> Fixes: b31f65fb4383 ("net: dsa: slave: Fix autoneg for phys on switch MDIO bus")
> Reported-by: Brian Norris <computersforpeace@gmail.com>
> Signed-off-by: Andrey Volkov <andrey.volkov@nexvision.fr>
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>

Acked-by: Andrew Lunn <andrew@lunn.ch>

	  Andrew


> ---
>  net/dsa/slave.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/net/dsa/slave.c b/net/dsa/slave.c
> index ab03e00ffe8f..0efaab1e408b 100644
> --- a/net/dsa/slave.c
> +++ b/net/dsa/slave.c
> @@ -491,6 +491,9 @@ static void dsa_slave_phy_setup(struct dsa_slave_priv *p,
>  	 */
>  	if (!p->phy) {
>  		p->phy = ds->slave_mii_bus->phy_map[p->port];
> +		if (!p->phy)
> +			return;
> +
>  		phy_connect_direct(slave_dev, p->phy, dsa_slave_adjust_link,
>  				   p->phy_interface);
>  	} else {
> -- 
> 2.1.0
> 

^ permalink raw reply

* Re: [PATCH net-next v9 0/3] add hisilicon hip04 ethernet driver
From: Ding Tianhong @ 2014-12-16 10:08 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-arm-kernel, zhangfei.gao, davem, linux, f.fainelli,
	sergei.shtylyov, mark.rutland, David.Laight, eric.dumazet, xuwei5,
	netdev, devicetree
In-Reply-To: <26790181.hAWQCCSkRa@wuerfel>

On 2014/12/16 16:54, Arnd Bergmann wrote:
> On Tuesday 16 December 2014 15:57:21 Ding Tianhong wrote:
>> On 2014/12/15 21:29, Arnd Bergmann wrote:
>>> On Thursday 11 December 2014 19:42:27 Ding Tianhong wrote:
>>> @@ -381,57 +392,37 @@ static void hip04_tx_reclaim(struct net_device *ndev, bool force)
>>>  		dev_kfree_skb(priv->tx_skb[tx_tail]);
>>>  		priv->tx_skb[tx_tail] = NULL;
>>>  		tx_tail = TX_NEXT(tx_tail);
>>> -		priv->tx_count--;
>>> -
>>> -		if (priv->tx_count <= 0)
>>> -			break;
>>> +		count--;
>>>  	}
>>>  
> ...
>>> -	queue_delayed_work(priv->wq, &priv->tx_clean_task, delta_in_ticks);
>>> +	return count;
>>
>> I think should return pkts_compl, because may break from the loop, the
>> pkts_compl may smaller than count.
> 
> The calling convention I used is to return the packets that are remaining
> on the queue. Only if that is nonzero we need to reschedule the timer.
> 

OK, agree.

>> and we need to add netif_tx_lock() to protect this function to avoid concurrency conflict.
> 
> Oh, did I miss something? The idea was that the start_xmit function only updates
> the tx_head pointer and reads the tx_tail, while the tx_reclaim function does
> the reverse, and writes to a different cache line, in order to allow a lockless
> queue traversal.
> 
> Can you point to a specific struct member that still need to be protected by
> the lock? Did I miss a race that would allow both functions to exit with
> the timer disabled and entries left on the queue?
> 
OK, got it, no problem.

>>> @@ -623,8 +648,6 @@ static int hip04_mac_stop(struct net_device *ndev)
>>>  	struct hip04_priv *priv = netdev_priv(ndev);
>>>  	int i;
>>>  
>>> -	cancel_delayed_work_sync(&priv->tx_clean_task);
>>> -
>> I think we should cancle the hrtimer when closed and queue the timer when open.
> 
> I was expecting that force-cleaning up the tx queue would be enough for that.
> It it not?
> 
> I suppose it can't hurt to cancel the timer here anyway, and maybe use
> WARN_ON() if it's still active.
> 

Ok, I found no need to worry about this, when the dev is closed, the napi will disable and will not enter timer again.

> Starting the timer after opening seems wrong though: at that point there are
> no packets on the queue yet. The timer should always start ticking at the
> exact point when the first packet is put on the queue while the timer is
> not already pending.
> 
Ok.

>>>  	napi_disable(&priv->napi);
>>>  	netif_stop_queue(ndev);
>>>  	hip04_mac_disable(ndev);
>>> @@ -725,6 +748,7 @@ static int hip04_mac_probe(struct platform_device *pdev)
>>>  	struct hip04_priv *priv;
>>>  	struct resource *res;
>>>  	unsigned int irq;
>>> +	ktime_t txtime;
>>>  	int ret;
>>>  
>>>  	ndev = alloc_etherdev(sizeof(struct hip04_priv));
>>> @@ -751,6 +775,21 @@ static int hip04_mac_probe(struct platform_device *pdev)
>>>  	priv->port = arg.args[0];
>>>  	priv->chan = arg.args[1] * RX_DESC_NUM;
>>>  
>>> +	hrtimer_init(&priv->tx_coalesce_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>>> +
>>> +	/*
>>> +	 * BQL will try to keep the TX queue as short as possible, but it can't
>>> +	 * be faster than tx_coalesce_usecs, so we need a fast timeout here,
>>> +	 * but also long enough to gather up enough frames to ensure we don't
>>> +	 * get more interrupts than necessary.
>>> +	 * 200us is enough for 16 frames of 1500 bytes at gigabit ethernet rate
>>> +	 */
>>> +	priv->tx_coalesce_frames = TX_DESC_NUM * 3 / 4;
>>> +	priv->tx_coalesce_usecs = 200;
>>> +	/* allow timer to fire after half the time at the earliest */
>>> +	txtime = ktime_set(0, priv->tx_coalesce_usecs * NSEC_PER_USEC / 2);
>>> +	hrtimer_set_expires_range(&priv->tx_coalesce_timer, txtime, txtime);
>>> +
>>
>> I think miss the line:
>>  priv->tx_coalesce_timer.function = tx_done;
> 
> Yes, good point.
> 

I will send v10 when the net-next open again, and these days will test this driver, thanks a lot.

Ding

> 	Arnd
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> .
> 

^ permalink raw reply

* Re: [PATCH net-next v9 0/3] add hisilicon hip04 ethernet driver
From: Arnd Bergmann @ 2014-12-16  8:54 UTC (permalink / raw)
  To: Ding Tianhong
  Cc: linux-arm-kernel, zhangfei.gao, davem, linux, f.fainelli,
	sergei.shtylyov, mark.rutland, David.Laight, eric.dumazet, xuwei5,
	netdev, devicetree
In-Reply-To: <548FE5E1.1090300@huawei.com>

On Tuesday 16 December 2014 15:57:21 Ding Tianhong wrote:
> On 2014/12/15 21:29, Arnd Bergmann wrote:
> > On Thursday 11 December 2014 19:42:27 Ding Tianhong wrote:
> > @@ -381,57 +392,37 @@ static void hip04_tx_reclaim(struct net_device *ndev, bool force)
> >  		dev_kfree_skb(priv->tx_skb[tx_tail]);
> >  		priv->tx_skb[tx_tail] = NULL;
> >  		tx_tail = TX_NEXT(tx_tail);
> > -		priv->tx_count--;
> > -
> > -		if (priv->tx_count <= 0)
> > -			break;
> > +		count--;
> >  	}
> >  
...
> > -	queue_delayed_work(priv->wq, &priv->tx_clean_task, delta_in_ticks);
> > +	return count;
> 
> I think should return pkts_compl, because may break from the loop, the
> pkts_compl may smaller than count.

The calling convention I used is to return the packets that are remaining
on the queue. Only if that is nonzero we need to reschedule the timer.

> and we need to add netif_tx_lock() to protect this function to avoid concurrency conflict.

Oh, did I miss something? The idea was that the start_xmit function only updates
the tx_head pointer and reads the tx_tail, while the tx_reclaim function does
the reverse, and writes to a different cache line, in order to allow a lockless
queue traversal.

Can you point to a specific struct member that still need to be protected by
the lock? Did I miss a race that would allow both functions to exit with
the timer disabled and entries left on the queue?

> > @@ -623,8 +648,6 @@ static int hip04_mac_stop(struct net_device *ndev)
> >  	struct hip04_priv *priv = netdev_priv(ndev);
> >  	int i;
> >  
> > -	cancel_delayed_work_sync(&priv->tx_clean_task);
> > -
> I think we should cancle the hrtimer when closed and queue the timer when open.

I was expecting that force-cleaning up the tx queue would be enough for that.
It it not?

I suppose it can't hurt to cancel the timer here anyway, and maybe use
WARN_ON() if it's still active.

Starting the timer after opening seems wrong though: at that point there are
no packets on the queue yet. The timer should always start ticking at the
exact point when the first packet is put on the queue while the timer is
not already pending.

> >  	napi_disable(&priv->napi);
> >  	netif_stop_queue(ndev);
> >  	hip04_mac_disable(ndev);
> > @@ -725,6 +748,7 @@ static int hip04_mac_probe(struct platform_device *pdev)
> >  	struct hip04_priv *priv;
> >  	struct resource *res;
> >  	unsigned int irq;
> > +	ktime_t txtime;
> >  	int ret;
> >  
> >  	ndev = alloc_etherdev(sizeof(struct hip04_priv));
> > @@ -751,6 +775,21 @@ static int hip04_mac_probe(struct platform_device *pdev)
> >  	priv->port = arg.args[0];
> >  	priv->chan = arg.args[1] * RX_DESC_NUM;
> >  
> > +	hrtimer_init(&priv->tx_coalesce_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> > +
> > +	/*
> > +	 * BQL will try to keep the TX queue as short as possible, but it can't
> > +	 * be faster than tx_coalesce_usecs, so we need a fast timeout here,
> > +	 * but also long enough to gather up enough frames to ensure we don't
> > +	 * get more interrupts than necessary.
> > +	 * 200us is enough for 16 frames of 1500 bytes at gigabit ethernet rate
> > +	 */
> > +	priv->tx_coalesce_frames = TX_DESC_NUM * 3 / 4;
> > +	priv->tx_coalesce_usecs = 200;
> > +	/* allow timer to fire after half the time at the earliest */
> > +	txtime = ktime_set(0, priv->tx_coalesce_usecs * NSEC_PER_USEC / 2);
> > +	hrtimer_set_expires_range(&priv->tx_coalesce_timer, txtime, txtime);
> > +
> 
> I think miss the line:
>  priv->tx_coalesce_timer.function = tx_done;

Yes, good point.

	Arnd

^ permalink raw reply

* Re: [PATCH net-next v9 0/3] add hisilicon hip04 ethernet driver
From: Ding Tianhong @ 2014-12-16  7:57 UTC (permalink / raw)
  To: Arnd Bergmann, linux-arm-kernel
  Cc: zhangfei.gao, davem, linux, f.fainelli, sergei.shtylyov,
	mark.rutland, David.Laight, eric.dumazet, xuwei5, netdev,
	devicetree
In-Reply-To: <2708715.thQWqYuW6a@wuerfel>

On 2014/12/15 21:29, Arnd Bergmann wrote:
> On Thursday 11 December 2014 19:42:27 Ding Tianhong wrote:
>> v9:
>> - There is no tx completion interrupts to free DMAd Tx packets, it means taht
>>   we rely on new tx packets arriving to run the destructors of completed packets,
>>   which open up space in their sockets's send queues. Sometimes we don't get such
>>   new packets causing Tx to stall, a single UDP transmitter is a good example of
>>   this situation, so we need a clean up workqueue to reclaims completed packets,
>>   the workqueue will only free the last packets which is already stay for several jiffies.
>>   Also fix some format cleanups.
> 
> You must have missed the reply in which David Miller explained why
> you can't call skb_orphan and rely on an occasional cleanup of the
> queue. Please use something like the patch below:
> 
> - drop the broken skb_orphan call
> - drop the workqueue
> - batch cleanup based on tx_coalesce_frames/usecs for better throughput
> - use a reasonable default tx timeout (200us, could be shorted
>   based on measurements) with a range timer
> - fix napi poll function return value
> - use a lockless queue for cleanup
> 

Ok, agree, my comments see below.

> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> 
> Feel free to fold this into your patch rather than keeping it separate.
> Completely untested, probably contains some bugs. Please ask if you
> have questions about this.
> 
> diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c b/drivers/net/ethernet/hisilicon/hip04_eth.c
> index 9d37b670db6a..d85c5287654e 100644
> --- a/drivers/net/ethernet/hisilicon/hip04_eth.c
> +++ b/drivers/net/ethernet/hisilicon/hip04_eth.c
> @@ -12,6 +12,7 @@
>  #include <linux/etherdevice.h>
>  #include <linux/platform_device.h>
>  #include <linux/interrupt.h>
> +#include <linux/ktime.h>
>  #include <linux/of_address.h>
>  #include <linux/phy.h>
>  #include <linux/of_mdio.h>
> @@ -157,9 +158,11 @@ struct hip04_priv {
>  	struct sk_buff *tx_skb[TX_DESC_NUM];
>  	dma_addr_t tx_phys[TX_DESC_NUM];
>  	unsigned int tx_head;
> -	unsigned int tx_tail;
> -	int tx_count;
> -	unsigned long last_tx;
> +
> +	/* FIXME: make these adjustable through ethtool */
> +	int tx_coalesce_frames;
> +	int tx_coalesce_usecs;
> +	struct hrtimer tx_coalesce_timer;
>  
>  	unsigned char *rx_buf[RX_DESC_NUM];
>  	dma_addr_t rx_phys[RX_DESC_NUM];
> @@ -171,10 +174,15 @@ struct hip04_priv {
>  	struct regmap *map;
>  	struct work_struct tx_timeout_task;
>  
> -	struct workqueue_struct *wq;
> -	struct delayed_work tx_clean_task;
> +	/* written only by tx cleanup */
> +	unsigned int tx_tail ____cacheline_aligned_in_smp;
>  };
>  
> +static inline unsigned int tx_count(unsigned int head, unsigned int tail)
> +{
> +	return (head - tail) % (TX_DESC_NUM - 1);
> +}
> +
>  static void hip04_config_port(struct net_device *ndev, u32 speed, u32 duplex)
>  {
>  	struct hip04_priv *priv = netdev_priv(ndev);
> @@ -351,18 +359,21 @@ static int hip04_set_mac_address(struct net_device *ndev, void *addr)
>  	return 0;
>  }
>  
> -static void hip04_tx_reclaim(struct net_device *ndev, bool force)
> +static int hip04_tx_reclaim(struct net_device *ndev, bool force)
>  {
>  	struct hip04_priv *priv = netdev_priv(ndev);
>  	unsigned tx_tail = priv->tx_tail;
>  	struct tx_desc *desc;
>  	unsigned int bytes_compl = 0, pkts_compl = 0;
> +	unsigned int count;
>  
> -	if (priv->tx_count == 0)
> +	smp_rmb();
> +	count = tx_count(ACCESS_ONCE(priv->tx_head), tx_tail);
> +	if (count == 0)
>  		goto out;
>  
> -	while ((tx_tail != priv->tx_head) || (priv->tx_count == TX_DESC_NUM)) {
> -		desc = &priv->tx_desc[priv->tx_tail];
> +	while (count) {
> +		desc = &priv->tx_desc[tx_tail];
>  		if (desc->send_addr != 0) {
>  			if (force)
>  				desc->send_addr = 0;
> @@ -381,57 +392,37 @@ static void hip04_tx_reclaim(struct net_device *ndev, bool force)
>  		dev_kfree_skb(priv->tx_skb[tx_tail]);
>  		priv->tx_skb[tx_tail] = NULL;
>  		tx_tail = TX_NEXT(tx_tail);
> -		priv->tx_count--;
> -
> -		if (priv->tx_count <= 0)
> -			break;
> +		count--;
>  	}
>  
>  	priv->tx_tail = tx_tail;
> +	smp_wmb(); /* Ensure tx_tail visible to xmit */
>  
> -	/* Ensure tx_tail & tx_count visible to xmit */
> -	smp_mb();
>  out:
> -
>  	if (pkts_compl || bytes_compl)
>  		netdev_completed_queue(ndev, pkts_compl, bytes_compl);
>  
> -	if (unlikely(netif_queue_stopped(ndev)) &&
> -	    (priv->tx_count < TX_DESC_NUM))
> +	if (unlikely(netif_queue_stopped(ndev)) && (count < (TX_DESC_NUM - 1)))
>  		netif_wake_queue(ndev);
> -}
>  
> -static void hip04_tx_clean_monitor(struct work_struct *work)
> -{
> -	struct hip04_priv *priv = container_of(work, struct hip04_priv,
> -					       tx_clean_task.work);
> -	struct net_device *ndev = priv->ndev;
> -	int delta_in_ticks = msecs_to_jiffies(1000);
> -
> -	if (!time_in_range(jiffies, priv->last_tx,
> -			   priv->last_tx + delta_in_ticks)) {
> -		netif_tx_lock(ndev);
> -		hip04_tx_reclaim(ndev, false);
> -		netif_tx_unlock(ndev);
> -	}
> -	queue_delayed_work(priv->wq, &priv->tx_clean_task, delta_in_ticks);
> +	return count;

I think should return pkts_compl, because may break from the loop, the pkts_compl may smaller than count.
and we need to add netif_tx_lock() to protect this function to avoid concurrency conflict.

>  }
>  
>  static int hip04_mac_start_xmit(struct sk_buff *skb, struct net_device *ndev)
>  {
>  	struct hip04_priv *priv = netdev_priv(ndev);
>  	struct net_device_stats *stats = &ndev->stats;
> -	unsigned int tx_head = priv->tx_head;
> +	unsigned int tx_head = priv->tx_head, count;
>  	struct tx_desc *desc = &priv->tx_desc[tx_head];
>  	dma_addr_t phys;
>  
> -	if (priv->tx_count >= TX_DESC_NUM) {
> +	smp_rmb();
> +	count = tx_count(tx_head, ACCESS_ONCE(priv->tx_tail));
> +	if (count == (TX_DESC_NUM - 1)) {
>  		netif_stop_queue(ndev);
>  		return NETDEV_TX_BUSY;
>  	}
>  
> -	hip04_tx_reclaim(ndev, false);
> -
>  	phys = dma_map_single(&ndev->dev, skb->data, skb->len, DMA_TO_DEVICE);
>  	if (dma_mapping_error(&ndev->dev, phys)) {
>  		dev_kfree_skb(skb);
> @@ -447,20 +438,33 @@ static int hip04_mac_start_xmit(struct sk_buff *skb, struct net_device *ndev)
>  	desc->wb_addr = cpu_to_be32(phys);
>  	skb_tx_timestamp(skb);
>  
> -	/* Don't wait up for transmitted skbs to be freed. */
> -	skb_orphan(skb);
> -
> +	/* FIXME: eliminate this mmio access if xmit_more is set */
>  	hip04_set_xmit_desc(priv, phys);
>  	priv->tx_head = TX_NEXT(tx_head);
> +	count++;
>  	netdev_sent_queue(ndev, skb->len);
>  
>  	stats->tx_bytes += skb->len;
>  	stats->tx_packets++;
> -	priv->tx_count++;
> -	priv->last_tx = jiffies;
>  
> -	/* Ensure tx_head & tx_count update visible to tx reclaim */
> -	smp_mb();
> +	/* Ensure tx_head update visible to tx reclaim */
> +	smp_wmb();
> +
> +	/* queue is getting full, better start cleaning up now */
> +	if (count >= priv->tx_coalesce_frames) {
> +		if (napi_schedule_prep(&priv->napi)) {
> +			/* disable rx interrupt and timer */
> +			priv->reg_inten &= ~(RCV_INT);
> +			writel_relaxed(DEF_INT_MASK & ~RCV_INT,
> +				       priv->base + PPE_INTEN);
> +			hrtimer_cancel(&priv->tx_coalesce_timer);
> +			__napi_schedule(&priv->napi);
> +		}
> +	} else if (!hrtimer_is_queued(&priv->tx_coalesce_timer)) {
> +		/* cleanup not pending yet, start a new timer */
> +		hrtimer_start_expires(&priv->tx_coalesce_timer,
> +				      HRTIMER_MODE_REL);
> +	}
>  
>  	return NETDEV_TX_OK;
>  }
> @@ -477,6 +481,7 @@ static int hip04_rx_poll(struct napi_struct *napi, int budget)
>  	bool last = false;
>  	dma_addr_t phys;
>  	int rx = 0;
> +	int tx_remaining;
>  	u16 len;
>  	u32 err;
>  
> @@ -513,11 +518,11 @@ static int hip04_rx_poll(struct napi_struct *napi, int budget)
>  
>  		buf = netdev_alloc_frag(priv->rx_buf_size);
>  		if (!buf)
> -			return -ENOMEM;
> +			goto done;
>  		phys = dma_map_single(&ndev->dev, buf,
>  				      RX_BUF_SIZE, DMA_FROM_DEVICE);
>  		if (dma_mapping_error(&ndev->dev, phys))
> -			return -EIO;
> +			goto done;
>  		priv->rx_buf[priv->rx_head] = buf;
>  		priv->rx_phys[priv->rx_head] = phys;
>  		hip04_set_recv_desc(priv, phys);
> @@ -537,6 +542,11 @@ static int hip04_rx_poll(struct napi_struct *napi, int budget)
>  	}
>  	napi_complete(napi);
>  done:
> +	/* clean up tx descriptors and start a new timer if necessary */
> +	tx_remaining = hip04_tx_reclaim(ndev, false);
> +	if (rx < budget && tx_remaining)
> +		hrtimer_start_expires(&priv->tx_coalesce_timer, HRTIMER_MODE_REL);
> +
>  	return rx;
>  }
>  
> @@ -547,6 +557,9 @@ static irqreturn_t hip04_mac_interrupt(int irq, void *dev_id)
>  	struct net_device_stats *stats = &ndev->stats;
>  	u32 ists = readl_relaxed(priv->base + PPE_INTSTS);
>  
> +	if (!ists)
> +		return IRQ_NONE;
> +
>  	writel_relaxed(DEF_INT_MASK, priv->base + PPE_RINT);
>  
>  	if (unlikely(ists & DEF_INT_ERR)) {
> @@ -560,16 +573,32 @@ static irqreturn_t hip04_mac_interrupt(int irq, void *dev_id)
>  		}
>  	}
>  
> -	if (ists & RCV_INT) {
> +	if (ists & RCV_INT && napi_schedule_prep(&priv->napi)) {
>  		/* disable rx interrupt */
>  		priv->reg_inten &= ~(RCV_INT);
> -		writel_relaxed(priv->reg_inten, priv->base + PPE_INTEN);
> -		napi_schedule(&priv->napi);
> +		writel_relaxed(DEF_INT_MASK & ~RCV_INT, priv->base + PPE_INTEN);
> +		hrtimer_cancel(&priv->tx_coalesce_timer);
> +		__napi_schedule(&priv->napi);
>  	}
>  
>  	return IRQ_HANDLED;
>  }
>  
> +enum hrtimer_restart tx_done(struct hrtimer *hrtimer)
> +{
> +	struct hip04_priv *priv;
> +	priv = container_of(hrtimer, struct hip04_priv, tx_coalesce_timer);
> +
> +	if (napi_schedule_prep(&priv->napi)) {
> +		/* disable rx interrupt */
> +		priv->reg_inten &= ~(RCV_INT);
> +		writel_relaxed(DEF_INT_MASK & ~RCV_INT, priv->base + PPE_INTEN);
> +		__napi_schedule(&priv->napi);
> +	}
> +
> +	return HRTIMER_NORESTART;
> +}
> +
>  static void hip04_adjust_link(struct net_device *ndev)
>  {
>  	struct hip04_priv *priv = netdev_priv(ndev);
> @@ -589,7 +618,6 @@ static int hip04_mac_open(struct net_device *ndev)
>  	priv->rx_head = 0;
>  	priv->tx_head = 0;
>  	priv->tx_tail = 0;
> -	priv->tx_count = 0;
>  	hip04_reset_ppe(priv);
>  
>  	for (i = 0; i < RX_DESC_NUM; i++) {
> @@ -612,9 +640,6 @@ static int hip04_mac_open(struct net_device *ndev)
>  	hip04_mac_enable(ndev);
>  	napi_enable(&priv->napi);
>  
> -	INIT_DELAYED_WORK(&priv->tx_clean_task, hip04_tx_clean_monitor);
> -	queue_delayed_work(priv->wq, &priv->tx_clean_task, 0);
> -
>  	return 0;
>  }
>  
> @@ -623,8 +648,6 @@ static int hip04_mac_stop(struct net_device *ndev)
>  	struct hip04_priv *priv = netdev_priv(ndev);
>  	int i;
>  
> -	cancel_delayed_work_sync(&priv->tx_clean_task);
> -
I think we should cancle the hrtimer when closed and queue the timer when open.

>  	napi_disable(&priv->napi);
>  	netif_stop_queue(ndev);
>  	hip04_mac_disable(ndev);
> @@ -725,6 +748,7 @@ static int hip04_mac_probe(struct platform_device *pdev)
>  	struct hip04_priv *priv;
>  	struct resource *res;
>  	unsigned int irq;
> +	ktime_t txtime;
>  	int ret;
>  
>  	ndev = alloc_etherdev(sizeof(struct hip04_priv));
> @@ -751,6 +775,21 @@ static int hip04_mac_probe(struct platform_device *pdev)
>  	priv->port = arg.args[0];
>  	priv->chan = arg.args[1] * RX_DESC_NUM;
>  
> +	hrtimer_init(&priv->tx_coalesce_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +
> +	/*
> +	 * BQL will try to keep the TX queue as short as possible, but it can't
> +	 * be faster than tx_coalesce_usecs, so we need a fast timeout here,
> +	 * but also long enough to gather up enough frames to ensure we don't
> +	 * get more interrupts than necessary.
> +	 * 200us is enough for 16 frames of 1500 bytes at gigabit ethernet rate
> +	 */
> +	priv->tx_coalesce_frames = TX_DESC_NUM * 3 / 4;
> +	priv->tx_coalesce_usecs = 200;
> +	/* allow timer to fire after half the time at the earliest */
> +	txtime = ktime_set(0, priv->tx_coalesce_usecs * NSEC_PER_USEC / 2);
> +	hrtimer_set_expires_range(&priv->tx_coalesce_timer, txtime, txtime);
> +

I think miss the line:
 priv->tx_coalesce_timer.function = tx_done;


Regards
Ding
>  	priv->map = syscon_node_to_regmap(arg.np);
>  	if (IS_ERR(priv->map)) {
>  		dev_warn(d, "no syscon hisilicon,hip04-ppe\n");
> @@ -788,12 +827,6 @@ static int hip04_mac_probe(struct platform_device *pdev)
>  		}
>  	}
>  
> -	priv->wq = create_singlethread_workqueue(ndev->name);
> -	if (!priv->wq) {
> -		ret = -ENOMEM;
> -		goto init_fail;
> -	}
> -
>  	INIT_WORK(&priv->tx_timeout_task, hip04_tx_timeout_task);
>  
>  	ether_setup(ndev);
> @@ -848,8 +881,6 @@ static int hip04_remove(struct platform_device *pdev)
>  	free_irq(ndev->irq, ndev);
>  	of_node_put(priv->phy_node);
>  	cancel_work_sync(&priv->tx_timeout_task);
> -	if (priv->wq)
> -		destroy_workqueue(priv->wq);
>  	free_netdev(ndev);
>  
>  	return 0;
> 
> 
> .
> 

^ permalink raw reply

* Re: [PATCH net 0/3] net: broadcom: fix FIXED_PHY dependencies
From: David Miller @ 2014-12-16  5:57 UTC (permalink / raw)
  To: f.fainelli; +Cc: netdev
In-Reply-To: <1418666235-3109-1-git-send-email-f.fainelli@gmail.com>

From: Florian Fainelli <f.fainelli@gmail.com>
Date: Mon, 15 Dec 2014 09:57:12 -0800

> This patch series removes the bogus "select FIXED_PHY if FOO=y" that
> I have been using in GENET, SYSTEMPORT and the SF2 DSA switch
> driver.

Series applied, thanks Florian.

^ permalink raw reply

* Re: errors in alignment changes..
From: Simon Horman @ 2014-12-16  5:33 UTC (permalink / raw)
  To: Sergei Shtylyov; +Cc: David Miller, mitsuhiro.kimura.kc, netdev
In-Reply-To: <548AD8D0.4040207@cogentembedded.com>

On Fri, Dec 12, 2014 at 03:00:16PM +0300, Sergei Shtylyov wrote:
> Hello.
> 
> On 12/12/2014 7:30 AM, Simon Horman wrote:
> 
> >>From: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
> >>Date: Thu, 11 Dec 2014 01:08:07 +0300
> 
> >>>    I guess we can just do:
> 
> >>>		rxdesc->addr = dma_map_single(...);
> 
> >>Best not to leave a potentially invalid DMA address in a
> >>receive descriptor the chip can potentially fetch and
> >>look at.
> 
> >>That's why I said to put it into a local variable and
> >>check for errors first.
> 
> >Hi Dave,
> 
> >this patch ending up in net is partially my fault.
> 
> >Sergei, do you have time to address David's concerns in relation to this
> >patch?
> 
>    No, I probably don't.
> 
> >If not I would like to suggest reverting it for now.
> 
>    Why? The patch does what it was intended for. Getting rid of
> virt_to_phys() calls is a separate issue, IMO.

Thanks, that is fine by me.

^ permalink raw reply

* RE: Bug: mv643xxx fails with highmem
From: fugang.duan @ 2014-12-16  2:19 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: David Miller, Fabio.Estevam@freescale.com,
	ezequiel.garcia@free-electrons.com, netdev@vger.kernel.org
In-Reply-To: <20141215180430.GK11285@n2100.arm.linux.org.uk>

From: Russell King - ARM Linux <linux@arm.linux.org.uk> Sent: Tuesday, December 16, 2014 2:05 AM
> To: Duan Fugang-B38611
> Cc: David Miller; Estevam Fabio-R49496; ezequiel.garcia@free-
> electrons.com; netdev@vger.kernel.org
> Subject: Re: Bug: mv643xxx fails with highmem
> 
> On Fri, Dec 12, 2014 at 05:34:01AM +0000, fugang.duan@freescale.com wrote:
> > I will submit one patch to fix the issue.
> 
> There's more bugs in the FEC driver... here's the relevant bits:
> 
> static void
> fec_enet_tx_queue(struct net_device *ndev, u16 queue_id) {
>         bdp = txq->dirty_tx;
> 
>         bdp = fec_enet_get_nextdesc(bdp, fep, queue_id);
> 
>         while (((status = bdp->cbd_sc) & BD_ENET_TX_READY) == 0) {
>                 /* current queue is empty */
>                 if (bdp == txq->cur_tx)
>                         break;
> 
>                 skb = txq->tx_skbuff[index];
>                 txq->tx_skbuff[index] = NULL;
>                 if (!IS_TSO_HEADER(txq, bdp->cbd_bufaddr))
>                         dma_unmap_single(&fep->pdev->dev, bdp-
> >cbd_bufaddr,
>                                         bdp->cbd_datlen, DMA_TO_DEVICE);
>                 bdp->cbd_bufaddr = 0;
>                 if (!skb) {
>                         bdp = fec_enet_get_nextdesc(bdp, fep, queue_id);
>                         continue;
>                 }
> ...
>                 txq->dirty_tx = bdp;
>                 bdp = fec_enet_get_nextdesc(bdp, fep, queue_id);
>         }
> 
> Consider the following code path:
> - we enter this function
> - get the dirty_tx pointer
> - move to the next descriptor (which we'll call descriptor A)
> - next descriptor indicates that TX_READY = 0
> - bdp != txq->cur_tx
> - we unmap if needed
> - we set bdp->cmdbufaddr = 0
> - assume skb is NULL, so we move to the next descriptor (we'll call this
> B)
> - next descriptor _may_ have TX_READY = 1
> - we break out of the loop, and return
> 
> Some time later, we re-enter:
> - get the dirty_tx pointer
> - move to the next descriptor (which is descriptor A above)
> - next descriptor indicates that TX_READY = 0
> - bdp != txq->cur_tx
> - we call dma_unmap_single(..., bdp->cbd_bufaddr, which we previously
> zeroed
>   - the DMA API debugging complains that FEC is unmapping memory which it
>     doesn't own
> 
> Unfortunately, this does appear to happen - from a paste from Jon
> Nettleton from iMX6Q:
> 
>  32. [   45.033001] unmapping this address 0x0 size 66  33. [   45.037470]
> ------------[ cut here ]------------  34. [   45.042127] WARNING: CPU: 0
> PID: 102 at lib/dma-debug.c:1080 check_unmap+0x784/0x9f4()
> 35. [   45.050066] fec 2188000.ethernet: DMA-API: device driver tries to
> free DMA memory it has not a]
> 
> (where the printk at line 32 is something that was added to debug this.)
> 
> The sad thing is that the remainder of my FEC patches did go a long way
> to clean up these kinds of issues in the driver (and there's /many/ of
> them), but unfortunately other conflicting changes got merged before I
> could finish rebasing them, I decided to move on to other things and
> discard the remainder of my patch set.  Marek showed some interest in
> taking the patch set over, but I've not heard anything more - and I'm not
> about to resurect my efforts only to get into the same situation where
> I'm carrying 50 odd patches which I can't merge back into mainline
> without spending weeks endlessly rebasing them.
> 
Russell, many thanks for your effort and thanks for your pointing out the bug.
I will think one method to fix it.

And I have one question for highmem dma mapping issue as below:
fec_enet_txq_submit_frag_skb(struct fec_enet_priv_tx_q *txq, struct sk_buff *skb, struct net_device *ndev)
{
	...
		    bufaddr = page_address(this_frag->page.p) + this_frag->page_offset;

                index = fec_enet_get_bd_index(txq->tx_bd_base, bdp, fep);
                if (((unsigned long) bufaddr) & fep->tx_align ||
                        fep->quirks & FEC_QUIRK_SWAP_FRAME) {
                        memcpy(txq->tx_bounce[index], bufaddr, frag_len);
                        bufaddr = txq->tx_bounce[index];

                        if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
                                swap_buffer(bufaddr, frag_len);
                }

                addr = dma_map_single(&fep->pdev->dev, bufaddr, frag_len,
                                      DMA_TO_DEVICE);
                if (dma_mapping_error(&fep->pdev->dev, addr)) {
                        dev_kfree_skb_any(skb);
                        if (net_ratelimit())
                                netdev_err(ndev, "Tx DMA memory map failed\n");
                        goto dma_mapping_error;
                }
	...
}
If the frag page is located at high memory, use dma_map_single() is not right, must use skb_frag_dma_map() or dma_map_page().
But before mapping, if tx has buffer alignment limitation (tx_align is not zero), there need to do memcpy for buffer alignment.
So, there we need to check whether the page is in highmem, if so, we need to call kmap_atomic() or kmap_high_get() to get cpu address,
And then do memcpy or swap buffer operation.

Do you think the above solution is right ? or maybe there have better method to fix it ?

Regards,
Andy

^ permalink raw reply

* Re: [PATCH 3/5] e1000e: do not toggle LANPHYPC value bit when PHY reset is blocked
From: yzhu1 @ 2014-12-16  2:08 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Sergei Shtylyov, Zhu Yanjun, netdev@vger.kernel.org, ALLAN, BRUCE,
	KIRSHER, JEFFREY
In-Reply-To: <20141215133335.GC28701@1wt.eu>

Hi, Willy

Thanks for your reply.

I will fix it and send V2.

Best Regards!
Zhu Yanjun
On 12/15/2014 09:33 PM, Willy Tarreau wrote:
> On Mon, Dec 15, 2014 at 01:21:43PM +0000, Zhu, Yanjun wrote:
>> Hi, Willy
>>
>> Thanks for your reply.
>>
>> This patch "[PATCH 3/5] e1000e: do not toggle LANPHYPC value bit when PHY reset is blocked" exists in tag v3.1.
>>
>> Please follow these steps, this patch can be found:
>>
>> 1. git checkout -f v3.1
>>
>> 2. git log -p drivers/net/e1000e/ich8lan.c
>>
>> 3. search "b7d6e335"
>>
>> Then we will find this patch.
> Ah it's because you truncated the commit ID from the right instead of from
> the left. Truncated commit IDs are valid from the left, not from the right.
> In your case, the commit is 6cc7aaed70c96c3933fbacbad582fc79b7d6e335
> ("e1000e: do not toggle LANPHYPC value bit when PHY reset is blocked"), so
> the truncated ID is 6cc7aae, not b7d6e335. It's important to fix that in
> your commit messages so that a "git show" works correctly (it failed for me
> for this precise reason).
>
> Thanks,
> Willy
>

^ permalink raw reply

* Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Roopa Prabhu @ 2014-12-16  1:20 UTC (permalink / raw)
  To: Arad, Ronen
  Cc: Jamal Hadi Salim, John Fastabend, netdev@vger.kernel.org,
	Jiri Pirko, sfeldma@gmail.com, bcrl@kvack.org, tgraf@suug.ch,
	stephen@networkplumber.org, linville@tuxdriver.com,
	vyasevic@redhat.com, davem@davemloft.net, shm@cumulusnetworks.com,
	gospo@cumulusnetworks.com
In-Reply-To: <E4CD12F19ABA0C4D8729E087A761DC3505DA6149@ORSMSX101.amr.corp.intel.com>

On 12/15/14, 4:58 PM, Arad, Ronen wrote:
>
>> -----Original Message-----
>> From: Jamal Hadi Salim [mailto:jhs@mojatatu.com]
>> Sent: Tuesday, December 16, 2014 1:28 AM
>> To: Arad, Ronen; John Fastabend; netdev@vger.kernel.org
>> Cc: Roopa Prabhu; Jiri Pirko; sfeldma@gmail.com; bcrl@kvack.org;
>> tgraf@suug.ch; stephen@networkplumber.org; linville@tuxdriver.com;
>> vyasevic@redhat.com; davem@davemloft.net;
>> shm@cumulusnetworks.com; gospo@cumulusnetworks.com
>> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del
>> bridge port attributes
>>
>> On 12/15/14 13:36, Arad, Ronen wrote:
>>>
>>>> -----Original Message-----
>>> The behavior of a driver could depend on the presence of a bridge and
>> features such as FDB LEARNING and LEARNING_SYNC.
>>
>> Indeed, those are bridge attributes.
>>
>>> A switch port driver which is not enslaved to a bridge might need to
>>> implement VLAN-aware FDB within the driver and report its content to user-
>> space using ndo_fdb_dump.
>>   >
>>> A switch port driver which is enslaved to a bridge could do with only
>>> pass through for static FDB configuration
>>   > to the HW when LEARNING_SYNC is configured. FDB reporting to user-
>> space and soft aging are left to the bridge module FDB.
>>> Such driver, without LEARNING_SYNC could still avoid maintaing in-driver
>> FDB as long as it could dump the HW FDB on demand.
>>> LEARNING_SYNC also requires periodic updates of freshness information
>> from the driver to the bridge module.
>>
>> If you have an fdb - shouldnt that be exposed only if you have a bridge
>> abstraction exposed? i.e thats where the Linux tools would work.
> I'm trying to find out what are the opinions of other people in the netdev list.
> John have clearly stated that he'd like to see full L2 switching functionality (at least) supported without making a bridge device mandatory.
> The existing bridge ndos (ndo_bridge_{set,del,get}link) already support that with proper setting of SELF/MASTER flags by iproute2.
> I see the value in supporting both approaches (bridge device mandatory and bridge device optional). If the choice is left to user-driven policy decision, we need to document both use models and map traditional L2 features to each model.
> The L2 offloading (or NETFUNC as it is currently called), which is being discussed on a different patch-set, is only needed when a bridge device is used.
> Without a bridge device, all configuration has to be targeted at the switch port driver directly using the SELF flag. FDB remains relevant and it is used to configure static MAC table entries and dump the HW MAC table.
Your understanding is right here. So far all patches have kept both 
models in mind.
> When the HW device is a L2 switch or a multi-layer switch (L2-L3 or even higher), there is a gap between what the HW is doing and what is explicitly modeled in Linux.
Can you elaborate more here ?. We use the linux model to accelerate a 
multi-layer (l2-l3) switch today. There maybe a few gaps, but these gaps 
can be closed by having equivalent functionality in the software path.

> Without a bridge device, the HW is represented by a set of switch port devices and the bridging (both control and data planes) takes place only in the HW and switch port driver.
> Each switch port driver has to implement its own FDB as there is no common shared code among drivers for different HW devices.
> Using a bridge device could partially alleviate that, but it comes with a cost. There is a need to properly implement offloading of both configuration and data-path. The transmit and receive path in the bridge module should be somehow bypassed to avoid unnecessary overhead or duplicate packets coming from both software bridging and HW bridging.
>
>> What i was refering to was a scenario where i have no interest in the fdb
>> despite such a hardware capabilities. VLANs is a different issue;
>>
> VLAN is fundamental feature of L2 and L3 switching and Linux is unclear about it. Bridge device could model bridging of untagged packets which requires a bridge device for each VLAN and a vlan device on each port that is a member of the bridge's VLAN.
> This different from the behavior and configuration of classic closed-source switches.
> An alternative model is VLAN filtering where a bridge is VLAN-aware and switches tagged traffic. A bridge device represents multiple L2 domains with VLAN filtering policy that defines the switching rules within each domain.
And the linux bridge driver supports both models today.

> Forwarding (e.g. L3 routing) is expected across such L2 domains using L3 entities.
> The modeling of L3 entities per L2 domain (e.g. per-VLAN) in the VLAN filtering model is yet unclear to me.
In the vlan filtering bridge model, You can create a vlan device on the 
bridge for l3 ...

>   
>>>>> Will the decision about using a bridge device or avoiding it be left
>>>>> to the end-user?
>>>> Its a user policy decision. Again the offload bit gets us this in a
>>>> reasonably configurable way IMO.
>>>>
>>>>> (This requires switch port drivers to be able to work and provide
>>>>> similar functionality in both setups).
>>>> Right, but if the drivers "care" who is calling their ndo ops
>>>> something is seriously broken. For the driver it should not need to
>>>> know anything about the callers so it doesn't matter to the driver if
>>>> its a netlink call from user space or an internal call fro bridge.ko
>>> LEARNING_SYNC only makes sense when a switch port driver is enslaved to
>> a bridge.
>>   > Rocker switch driver indeed monitors upper change notifications and keep
>> track of master bridge presence.
>>> So bridge presence is not transparent.
>>>
>> Agreed - the challenge so far is that people have been fascinated by "switch"
>> point of view. I think we are learning and the class device will eventually
>> become obvious as useful.
>>
>> cheers,
>> jamal
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Arad, Ronen @ 2014-12-16  0:58 UTC (permalink / raw)
  To: Jamal Hadi Salim, John Fastabend, netdev@vger.kernel.org
  Cc: Roopa Prabhu, Jiri Pirko, sfeldma@gmail.com, bcrl@kvack.org,
	tgraf@suug.ch, stephen@networkplumber.org, linville@tuxdriver.com,
	vyasevic@redhat.com, davem@davemloft.net, shm@cumulusnetworks.com,
	gospo@cumulusnetworks.com
In-Reply-To: <548F6E62.1040500@mojatatu.com>



> -----Original Message-----
> From: Jamal Hadi Salim [mailto:jhs@mojatatu.com]
> Sent: Tuesday, December 16, 2014 1:28 AM
> To: Arad, Ronen; John Fastabend; netdev@vger.kernel.org
> Cc: Roopa Prabhu; Jiri Pirko; sfeldma@gmail.com; bcrl@kvack.org;
> tgraf@suug.ch; stephen@networkplumber.org; linville@tuxdriver.com;
> vyasevic@redhat.com; davem@davemloft.net;
> shm@cumulusnetworks.com; gospo@cumulusnetworks.com
> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del
> bridge port attributes
> 
> On 12/15/14 13:36, Arad, Ronen wrote:
> >
> >
> >> -----Original Message-----
> 
> > The behavior of a driver could depend on the presence of a bridge and
> features such as FDB LEARNING and LEARNING_SYNC.
> 
> Indeed, those are bridge attributes.
> 
> > A switch port driver which is not enslaved to a bridge might need to
> >implement VLAN-aware FDB within the driver and report its content to user-
> space using ndo_fdb_dump.
>  >
> > A switch port driver which is enslaved to a bridge could do with only
> > pass through for static FDB configuration
>  > to the HW when LEARNING_SYNC is configured. FDB reporting to user-
> space and soft aging are left to the bridge module FDB.
> > Such driver, without LEARNING_SYNC could still avoid maintaing in-driver
> FDB as long as it could dump the HW FDB on demand.
> > LEARNING_SYNC also requires periodic updates of freshness information
> from the driver to the bridge module.
> >
> 
> 
> If you have an fdb - shouldnt that be exposed only if you have a bridge
> abstraction exposed? i.e thats where the Linux tools would work.

I'm trying to find out what are the opinions of other people in the netdev list.
John have clearly stated that he'd like to see full L2 switching functionality (at least) supported without making a bridge device mandatory.
The existing bridge ndos (ndo_bridge_{set,del,get}link) already support that with proper setting of SELF/MASTER flags by iproute2.
I see the value in supporting both approaches (bridge device mandatory and bridge device optional). If the choice is left to user-driven policy decision, we need to document both use models and map traditional L2 features to each model. 
The L2 offloading (or NETFUNC as it is currently called), which is being discussed on a different patch-set, is only needed when a bridge device is used.
Without a bridge device, all configuration has to be targeted at the switch port driver directly using the SELF flag. FDB remains relevant and it is used to configure static MAC table entries and dump the HW MAC table.
When the HW device is a L2 switch or a multi-layer switch (L2-L3 or even higher), there is a gap between what the HW is doing and what is explicitly modeled in Linux. Without a bridge device, the HW is represented by a set of switch port devices and the bridging (both control and data planes) takes place only in the HW and switch port driver.
Each switch port driver has to implement its own FDB as there is no common shared code among drivers for different HW devices.
Using a bridge device could partially alleviate that, but it comes with a cost. There is a need to properly implement offloading of both configuration and data-path. The transmit and receive path in the bridge module should be somehow bypassed to avoid unnecessary overhead or duplicate packets coming from both software bridging and HW bridging.

> What i was refering to was a scenario where i have no interest in the fdb
> despite such a hardware capabilities. VLANs is a different issue;
>
VLAN is fundamental feature of L2 and L3 switching and Linux is unclear about it. Bridge device could model bridging of untagged packets which requires a bridge device for each VLAN and a vlan device on each port that is a member of the bridge's VLAN.
This different from the behavior and configuration of classic closed-source switches.
An alternative model is VLAN filtering where a bridge is VLAN-aware and switches tagged traffic. A bridge device represents multiple L2 domains with VLAN filtering policy that defines the switching rules within each domain. Forwarding (e.g. L3 routing) is expected across such L2 domains using L3 entities.
The modeling of L3 entities per L2 domain (e.g. per-VLAN) in the VLAN filtering model is yet unclear to me.
 
> >>> Will the decision about using a bridge device or avoiding it be left
> >>> to the end-user?
> >>
> >> Its a user policy decision. Again the offload bit gets us this in a
> >> reasonably configurable way IMO.
> >>
> >>> (This requires switch port drivers to be able to work and provide
> >>> similar functionality in both setups).
> >>
> >> Right, but if the drivers "care" who is calling their ndo ops
> >> something is seriously broken. For the driver it should not need to
> >> know anything about the callers so it doesn't matter to the driver if
> >> its a netlink call from user space or an internal call fro bridge.ko
> >
> > LEARNING_SYNC only makes sense when a switch port driver is enslaved to
> a bridge.
>  > Rocker switch driver indeed monitors upper change notifications and keep
> track of master bridge presence.
> > So bridge presence is not transparent.
> >
> 
> Agreed - the challenge so far is that people have been fascinated by "switch"
> point of view. I think we are learning and the class device will eventually
> become obvious as useful.
> 
> cheers,
> jamal

^ permalink raw reply

* RE: [E1000-devel] [PATCH] ixgbe, ixgbevf: Add new mbox API to enable MC promiscuous mode
From: Hiroshi Shimamoto @ 2014-12-16  0:49 UTC (permalink / raw)
  To: Alexander Duyck, e1000-devel@lists.sourceforge.net
  Cc: netdev@vger.kernel.org, Choi, Sy Jong, Hayato Momma,
	linux-kernel@vger.kernel.org
In-Reply-To: <54809D57.9060804@gmail.com>

> > Subject: Re: [E1000-devel] [PATCH] ixgbe, ixgbevf: Add new mbox API to enable MC promiscuous mode
> >
> > On 11/27/2014 02:39 AM, Hiroshi Shimamoto wrote:
> > > From: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
> > >
> > > The limitation of the number of multicast address for VF is not enough
> > > for the large scale server with SR-IOV feature.
> > > IPv6 requires the multicast MAC address for each IP address to handle
> > > the Neighbor Solicitation message.
> > > We couldn't assign over 30 IPv6 addresses to a single VF interface.
> > >
> > > The easy way to solve this is enabling multicast promiscuous mode.
> > > It is good to have a functionality to enable multicast promiscuous mode
> > > for each VF from VF driver.
> > >
> > > This patch introduces the new mbox API, IXGBE_VF_SET_MC_PROMISC, to
> > > enable/disable multicast promiscuous mode in VF. If multicast promiscuous
> > > mode is enabled the VF can receive all multicast packets.
> > >
> > > With this patch, the ixgbevf driver automatically enable multicast
> > > promiscuous mode when the number of multicast addresses is over than 30
> > > if possible.
> > >
> > > This also bump the API version up to 1.2 to check whether the API,
> > > IXGBE_VF_SET_MC_PROMISC is available.
> > >
> > > Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
> > > CC: Choi, Sy Jong <sy.jong.choi@intel.com>
> > > Reviewed-by: Hayato Momma <h-momma@ce.jp.nec.com>
> >
> > This is a REALLY bad idea unless you plan to limit this to privileged VFs.
> >
> > I would recommend looking at adding an ndo operation to control this
> > feature so that it could be disabled by default in the PF and only
> > enabled on the host side if specifically requested.  Otherwise the
> 
> Do you mean that PF driver should have the flag to enable or disable per VF
> and disallow the request from VF?

Could you answer about that?

> 
> > problem is I can easily see this leading security issues as the VFs
> > might begin getting access to messages that they aren't supposed to.
> 
> OK, by the way, I think that the current ixgbe and ixgbevf implementation
> has already such issue. The guest can add hash entry to receive MAC and it
> can get every multicast MAC frame with the current mbox API.
> Does your concern come from the easiness of doing that?

There is the single MTA per PF, not per VF.
VF requests PF to register the hash of MC MAC, then PF set a bit in the MTA
and set the flag IXGBE_VMOLR_ROMPE of VF, which enables packets switching to
the VF if MC MAC hits the hash entry in the MTA.
If VM1 has VF1 which uses MC MAC1 and VM2 has VF2 which uses MC MAC2, both
of VM1 and VM2 will receive MC MAC1. VM2 doesn't know why it receives MAC1.
In other words, in the current implementation, a VF receives all multicast
packets which are registered from other VFs.
Because the above reason, I hadn't imagined that enabling MC promiscuous mode
increases receiving the MC messages that they aren't supposed to.
I think that this patch doesn't change that behavior.

thanks,
Hiroshi

^ permalink raw reply

* Re: [PATCH net-next RESEND] net: Do not call ndo_dflt_fdb_dump if ndo_fdb_dump is defined.
From: John Fastabend @ 2014-12-16  0:45 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Hubert Sokolowski, Roopa Prabhu, netdev@vger.kernel.org,
	Vlad Yasevich
In-Reply-To: <548EF05E.6050401@mojatatu.com>

On 12/15/2014 06:29 AM, Jamal Hadi Salim wrote:
> On 12/12/14 15:05, John Fastabend wrote:
>> On 12/12/2014 06:35 AM, Jamal Hadi Salim wrote:
>
>
>> I'll wake up ;)
>
>
> Vlad made me go over those patches in a few iterations to make
> sure that the use cases covered in the test case work. It is
> holiday season, so he may be offline.
>

Yep.

>> First quick grep of code finds some strange uses of ndo_fdb_dump like
>> this in macvlan,
>>
>>    ./drivers/net/macvlan.c
>>          .ndo_fdb_dump           = ndo_dflt_fdb_dump,
>>
>> I'll be sending a patch once net-next opens up again to resolve it. Its
>> harmless though so not really a fix for net.
>>
>> There seem to be a few places that have the potential to return
>> different values then the uc/mc lists.
>>
>>      ./drivers/net/vxlan.c
>>      ./drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
>>      ./drivers/net/ethernet/rocker/rocker.c
>>
>>      ./net/bridge/br_device.c
>>
>
> Yes, thats my observation as well.
> The question is: Are multi/unicast address unconditionally dumped?

hmm good question. When I implemented this on the host nics with SR-IOV,
VMDQ, etc. The multi/unicast addresses were propagated into the FDB by
the driver. My logic was if some netdev ethx has a set of MAC addresses
above it well then any virtual function or virtual device also behind
the hardware shouldn't be sending those addresses out the egress switch
facing port. Otherwise the switch will see packets it knows are behind
that port and drop them. Or flood them if it hasn't learned the address
yet. Either way they will never get to the right netdev.

Admittedly I wasn't thinking about switches with many ports at the time.

> Some of these drivers may be just doing the LinuxWay(aka cutnpaste what
> the other driver did).

My original thinking here was... if it didn't implement fdb_add, fdb_del
and fdb_dump then if you wanted to think of it as having forwarding
database that was fine but it was really just a two port mac relay. In
which case just dump all the mac addresses it knows about. In this case
if it was something more fancy it could do its own dump like vxlan or
macvlan.

> If you go over the original thread exchange with Vlad, you'll notice
> i was kind of unsure why dumping of unicast/multicast had anything to
> do with fdb dumping.
> It is still my view that we shouldnt be treating these addresses as if
> they were fdb entries. But: The problem is once you allow an API to
> user space you cant take it back even if people are depending on bugs.
>

For a host nic ucast/multicast and fdb are the same, I think? The
code we had was just short-hand to allow the common case a host nic
to work. Notice vxlan and bridge drivers didn't dump there addr lists 
from fdb_dump until your patch.

Perhaps my implementation of macvlan fdb_{add|del|dump} is buggy. And
I shouldn't overload the addr lists.

>
>> So I guess we can walk through the list and analyse them a bit.
>>
>> vxlan:
>>
>> Try stacking devices on top of the vxlan device this will call a uc_add
>> routine if you then change the mac addr on the vlan. This would get
>> reported by the dflt fdb dump handlers but not the drivers fdb dump
>> handlers. So removing the dflt dump handler from this patch at least
>> changes things. We should either explain why this is OK or accept that
>> the driver needs to be fixed. Or I guess that the patch is just wrong.
>> My guess is one of the latter options.
>>
>> Also Jamal, your original patch seems like it might of changed this
>> and Hubert's patch is reverting back to its original case. Was this
>> specific part of your patch intentional?
>>
>
> Yes.
> This is based on the view that unicast/multicast must be dumped
> *unconditionally*. If the view is that uni/mcast addresses are
> dumped conditionally based on what the driver thinks, then Hubert's
> one liner is good. But i really would like Vlad to comment. 80%
> of the effort on my part if you look at the thread was the refactoring
> of the code to meet the use case.

I'm interested to see what Vlad says as well. But the current situation
is previously some drivers dumped their addr lists others didn't.
Specifically, the more switch like devices (bridge, vxlan) didn't. Now
every device will dump the addr lists. I'm not entirely convinced that
is correct.

>
> I thought the abstraction which requires that your own MAC addresses
> are treated as fdb entries was broken - but it is too late to change
> that.
>

It works OK for host nics (NICS that can't forward between ports) and
seems at best confusing for real switch asics. On a related question do
you expect the switch asic to trap any packets with MAC addresses in
the multi/unicast address lists and send them to the correct netdev? Or
will the switch forward them using normal FDB tables?

Also I don't think its too late to fix it though. Maybe we had some
buggy drivers is all.

> cheers,
> jamal


-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* [ANNOUNCE] nftables 0.4 release
From: Pablo Neira Ayuso @ 2014-12-16  0:40 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, netfilter, netfilter-announce, lwn, kaber

[-- Attachment #1: Type: text/plain, Size: 4742 bytes --]

Hi!

The Netfilter project proudly presents:

        nftables 0.4

This release contains a lot of bug fixes and new features contained up
to the recent 3.18 kernel release (and some features coming up in the
yet unreleased 3.19-rc).

New features
============

* Add support for global ruleset operations (available since 3.18).
  Get rid of all tables, chains, and rules in one go:

        # nft flush ruleset

  List the ruleset for all existing families:

        # nft list ruleset

  You can save the ruleset and restore it via:

        # echo "flush ruleset" > ruleset.file
        # nft list ruleset >> ruleset.file
        # nft -f ruleset.file

  This mimics iptables-restore, including the ruleset for all
  supported families: ip, ip6, inet, bridge and arp.

* Full logging support for all the families, including nfnetlink_log
  support (available since 3.17).

* Automatic selection of the optimal set implementation (available
  since 3.16).

  You can tell the kernel to optimize your set representation base
  according to the space-time tradeoff, eg. optimize memory:

        # nft add set filter set1 { type ipv4_addr ; policy memory ; }

  Or optimize performance:

        # nft add set filter set1 { type ipv4_addr ; policy performance ; }

  You can also use this in maps:

        # nft add map filter map1 { type ipv4_addr : verdict ; policy performace ; }

  And indicate the expected size to assist the set selection routine:

        # nft add set filter set1 { type ipv4_addr ; size 1024 ; }

* Complete reject support (available for ip, ip6 and inet since 3.14.
  bridge support and the icmpx abstraction since 3.18).

        # nft add rule filter input reject with icmp type host-unreachable

  and for IPv6:

        # nft add rule ip6 filter input reject with icmpv6 type no-route

  you can the ICMPx abstraction from the inet table:

        # nft add rule inet filter input reject with icmpx type no-route

  and TCP traffic with the reset packets:

        # nft add rule filter input reject with tcp reset

* Masquerading support (available since 3.18).

        # nft add rule nat postrouting masquerade

* Redirect support (available since upcoming Linux kernel 3.19-rc).

        # nft add rule nat prerouting tcp dport 22 redirect to 2222

* Support for NAT flag: random, fully-random, persistent.

* Consistency checks for interferences between updates and ruleset dumps
  (initially available since 3.16, enhanced with ruleset generations
   since 3.18).

* Extend meta to support pkttype, cpu and devgroup matching.

* Automatic regression tests through our customized python shell
  script.

* Allow to disable libreadline and debug at configure stage.

* Full conversion to autotools.

Syntax changes
==============

* 'queue' flags are now expressed as a list of comma-separated symbols:

        # nft add filter input counter queue num 0-3 fanout,bypass

  for consistency with flags, that are always expressed like this.

* nft doesn't resolve names by default anymore. IP addresses are
  always expressed in the numeric representation. A new '-N' option
  allows you to request for the resolution.

Bug fixes
=========

* Crash with anonymous sets with lots of elements.

* Several annoying byteorder issues that resulted in incorrect bytecode
  generation and wrong listings.

* Endianness problems reported from little endian archs.

* Named verdict maps, eg.

        # nft add map filter my_vmap { type ipv4_addr : verdict\; }
        # nft add element filter my_vmap { 1.1.1.1 : drop, 2.2.2.2 : drop}
        # nft add rule filter input ip saddr vmap @my_vmap

* Crash in 'nft describe' with wrong expressions.

* Parsing of ether types.

* Crash on usage of basetypes, eg.

        # nft add rule filter input ct state 8 accept

  instead of 'ct state new'.

* Crash on wrong values when performing basetype parsing, eg.

     <cmdline>:1:29-31: Error: Could not parse conntrack state
     add rule test test ct state xxx accept
                                 ^^^

* Broken listing og meta and ct range expressions, eg.

        nft add rule filter input meta length 33-55 counter

* Don't display a BUG message on too large decimal/hexadecimal values.

Resources
=========

The nftables code can be obtained from:

* http://netfilter.org/projects/nftables/downloads.html
* ftp://ftp.netfilter.org/pub/nftables
* git://git.netfilter.org/nftables

To build the code, you libnftnl and libmnl are required:

* http://netfilter.org/projects/libnftnl/index.html

Thanks
======

Thanks to all our contributors, testers and bug reporters, whom have
all helped to get rid of a good bunch of bugs and push new features.

On behalf of the Netfilter Core Team,
Happy bytecode execution :)

[-- Attachment #2: changes-nftables-0.4.txt --]
[-- Type: text/plain, Size: 7597 bytes --]

Alvaro Neira (15):
      linealize: generate unary expression with the appropiate operation
      payload: generate dependency in the appropriate byteorder
      src: Enhance payload_gen_dependency()
      datatype: Enhance symbolic_constant_parse()
      nft: complete reject support
      evaluate: fix a crash if we specify ether type or meta nfproto in reject
      delinearize: list the icmpx reason with the string associated
      evaluate: reject: fix crash if we specify ether type or meta nfproto
      evaluate: reject: fix crash if we have transport protocol conflict from inet
      test: update and add the reject tests for ip, ip6, bridge and inet.
      evaluate: reject: accept a reject reason with incorrect network context
      evaluate: reject: check in bridge and inet the network context in reject
      evaluate: reject: check the context in reject without reason for bridge and inet tables
      evaluate: reject: enhance the error support throwing message with more details
      evaluate: reject: fix crash on NULL location with bridge and tcp reset

Alvaro Neira Ayuso (1):
      src: add specific byteorder to the struct proto_hdr_template

Ana Rey (15):
      src: Add support for pkttype in meta expresion
      src: Add support for cpu in meta expresion
      src: meta: Fix the size of cpu attribute
      src: Add devgroup support in meta expresion
      tests: Add automated regression testing
      tests: Add ip folder with test files
      tests: Add ip6 folder with test files.
      tests: Add inet folder with test files.
      tests: Add arp folder with test files.
      tests: Add bridge folder with test files.
      tests: Add any folder with test files.
      tests: regression: Delete all reference to wlan0 in test files
      tests: regression: Delete an unnecessary whitespace in an output messages
      meta: Add support for datatype devgroup
      src: Add cgroup support in meta expresion

Arturo Borrero (18):
      netlink: monitor: add a helper function to handle sets referenced by a rule
      netlink: monitor: fix how rules with intervals are printed
      doc: update documentation with 'monitor' and 'export'
      src: add `flush ruleset'
      netlink: include file and line in netlink ABI errors
      src: add set optimization options
      rule: rename do_command_list_cleanup() to table_cleanup()
      rule: factorize chain and table listing code
      src: add list ruleset command
      src: add nat persistent and random options
      src: add masquerade support
      tests: add tests for masquerade
      mnl: delete useless parameter nf_sock in batch functions
      src: add redirect support
      nft: don't resolve hostnames by default
      tests/regression: masquerade: fix invalid syntax
      tests/regression: redirect: fix invalid syntax
      parser: allow both nat_flags and port specification in redirect

David Kozub (1):
      build: add missing \ in src/Makefile.am (AM_CPPFLAGS)

Eric Leblond (2):
      scanner: fix reading of really long line
      datatype: fix name of icmp* code

Giorgio Dal Molin (2):
      build: add autotools support for the 'doc' subdir
      build: add autotools support for the 'files' subdir

Kevin Fenzi (1):
      doc: nft: Fix trivial error in man page where flush should be rename

Pablo Neira Ayuso (53):
      proto: initialize result expression in ethertype_parse()
      mnl: immediately return on errors in mnl_nft_ruleset_dump()
      mnl: check for NLM_F_DUMP_INTR when dumping object lists
      mnl: add nft_batch_continue() helper
      mnl: add nft_nlmsg_batch_current() helper
      src: rework batching logic to fix possible use of uninitialized pages
      main: propagate error to shell
      mnl: introduce NFT_NLMSG_MAXSIZE
      mnl: fix crashes when using sets with many elements
      src: add level option to the log statement
      src: don't return error in netlink_linearize_rule()
      include: refresh include/linux/nf_tables.h cached copy
      log: netlink_linearize: don't set level if user didn't specify
      src: fix 'describe' command when passing wrong expressions
      mnl: consistency checks across several netlink dumps
      mnl: use nft_batch_begin and nft_batch_end from libnftnl
      src: interpret the event type from the evaluation step
      netlink: use switch whenever possible in the monitor code
      utils: indicate file and line on memory allocation errors
      include: refresh cached copy of nf_tables.h
      build: use PKG_CHECK_MODULES to check for libmnl and libnftnl
      build: use AC_PROG_YACC and AM_PROG_LEX
      rename parser.y to parser_bison.y
      include: add cli.h
      build: autotools conversion
      netlink: don't bug on unknown events
      src: restore nft --debug
      parser: restore named vmap
      tests: regression: any/queue.t: use new syntax
      tests: regression: don't use -nnn for non-list commands
      tests: regression: fix bogus error due to bash
      tests: regression: test masquerade from nat/postrouting too
      datatype: fix crash when using basetype instead of symbolic constants
      datatype: relax datatype check in integer_type_parse()
      netlink_delinearize: clone on netlink_get_register(), release previous on _set()
      meta: set base field on clones
      tests: regression: fix "Listing is broken" instead of output mismatch
      tests: regression: any/ct: remove wrong output
      scanner: don't bug on too large values
      payload: fix endianess issue in payload_expr_pctx_update()
      src: generate set members using integer_type in the appropriate byteorder
      netlink_delinearize: fix listing of set members in host byteorder using integer_type
      netlink: fix listing of range set elements in host byteorder
      rule: fix segmentation faults on kernels without nftables support
      tests: regression: adapt nat tests to use random-fully
      tests: regression: redirect.t: fix bogus errors
      parser: use 'redirect to PORT' instead of 'redirect :PORT'
      tests: regression: fix wrong number of test files
      tests: regression: simplify run_test_file() in case `-e' is used
      tests: regression: log.t: this works for bridge and arp since 3.17
      build: restore --disable-debug
      datatype: missing byteorder in string_type
      Bump version to v0.4

Patrick McHardy (16):
      netlink: check and handle errors from netlink_delinearize_set()
      evaluate: fix concat expressions as map arguments
      payload: take endianess into account when updating the payload context
      datatype: take endianess into account in symbolic_constant_print()
      proto: fix byteorder of ETH_P_* values
      verdict type: handle verdict flags and encoded additional information
      parser: simplify monitor command parsing
      parser: compact log level grammar
      expr: make range_low()/range_high() usable outside of segtree
      queue: clean up queue statement
      parser: rearrange monitor/export rules
      dtype: remove unnecessary icmp* parse/print functions
      stmt: rename nat "random-fully" option to "fully-random"
      meta: properly align types in meta_template table
      dtype: fix memory leak in concat_type_destroy()
      datatype: print datatype name in datatype_print() BUG message

Steven Barth (2):
      build: allow disabling libreadline-support
      build: remove unnecessary libintl.h check

Yanchuan Nian (2):
      Fix memory leak in nft get operation
      Fix typo in chain hook parsing

Yuxuan Shui (1):
      payload: use proto_unknown for raw protocol header


^ permalink raw reply

* Re: [bisected] tg3 broken in 3.18.0?
From: Bjorn Helgaas @ 2014-12-16  0:31 UTC (permalink / raw)
  To: Nils Holland; +Cc: David Miller, netdev, linux-pci@vger.kernel.org, Rajat Jain
In-Reply-To: <20141213210251.GA12812@teela.fritz.box>

On Sat, Dec 13, 2014 at 2:02 PM, Nils Holland <nholland@tisys.org> wrote:
> rajatxjain@gmail.com
> Bcc:
> Subject: Re: [bisected] tg3 broken in 3.18.0?
> Reply-To:
> In-Reply-To: <20141212.201831.186234837340644301.davem@davemloft.net>
>
> On Fri, Dec 12, 2014 at 08:18:31PM -0500, David Miller wrote:
>> From: Nils Holland <nholland@tisys.org>
>> Date: Sat, 13 Dec 2014 02:14:08 +0100
>>
>> >
>> > My bisect exercise suggests that the following commit is the culprit:
>> >
>> > 89665a6a71408796565bfd29cfa6a7877b17a667 (PCI: Check only the Vendor
>> > ID to identify Configuration Request Retry)
>>
>> You definitely need to bring this up with the author of that change
>> and the relevent list for the PCI subsystem and/or linux-kernel.
>
> I've now already sent an inquiry to Rajat Jain, the author of the
> patch in question, and this message here is now also CC'd to
> linux-pci@.
>
> With this message, I'd like to add one last result of investigation
> I've done today, in the hope that it will aid the folks with more
> knowledge to go after the issue.
>
> Basically, I've added a little debug output to tg3.c in the function
> tg3_poll_fw(), as that function contained the code that would print
> out the "No firmware running" line that was visible in dmesg on those
> kernels where tg3 would not work for me. So, I basically had this:
>
> static int tg3_poll_fw(struct tg3 *tp)
> {
>         int i;
>         u32 val;
>
>         netdev_info(tp->dev, "XX: Boom!\n");
>         [...]
> }
>
> Now, I was looking through dmesg searching for occurances of this
> debug output, using a standard 3.18.0 kernel (where my tg3 doesn't
> work) as well as using a 3.18.0 kernel with
> 89665a6a71408796565bfd29cfa6a7877b17a667 reverted (where my tg3
> works). Here's the results:
>
> [standard 3.18.0 (=problematic)]:
> [    2.197653] libphy: tg3 mdio bus: probed
> [    2.257488] tg3 0000:02:00.0 eth0:
>         Tigon3 [partno(BCM57780) rev 57780001] (PCI Express) MAC address
>         00:19:99:ce:13:a6
> [    2.259589] tg3 0000:02:00.0 eth0:
>         attached PHY driver [Broadcom BCM57780] (mii_bus:phy_addr=200:01)
> [    2.261740] tg3 0000:02:00.0 eth0:
>         RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[1]
> [    2.263912] tg3 0000:02:00.0 eth0:
>         dma_rwctrl[76180000] dma_mask[64-bit]
> [...]
> [   10.028002] tg3 0000:02:00.0: irq 25 for MSI/MSI-X
> [   10.028247] tg3 0000:02:00.0 enp2s0: XX: Boom!
> [   12.157034] tg3 0000:02:00.0 enp2s0: No firmware running
>
>
> [3.18.0 without above mentioned patch, 3.17.3 is the same, both result
> in a working tg3]:
> [    1.397167] libphy: tg3 mdio bus: probed
> [    1.456473] tg3 0000:02:00.0
>         (unnamed net_device) (uninitialized): XX: Boom!
> [    1.464987] tg3 0000:02:00.0 eth0:
>         Tigon3 [partno(BCM57780) rev 57780001] (PCI Express) MAC address
>         00:19:99:ce:13:a6
> [    1.467118] tg3 0000:02:00.0 eth0:
>         attached PHY driver [Broadcom BCM57780] (mii_bus:phy_addr=200:01)
> [    1.469311] tg3 0000:02:00.0 eth0:
>         RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[1]
> [    1.471500] tg3 0000:02:00.0 eth0:
>         dma_rwctrl[76180000] dma_mask[64-bit]
> [...]
> [    9.631629] tg3 0000:02:00.0: irq 25 for MSI/MSI-X
> [    9.631962] tg3 0000:02:00.0 enp2s0: XX: Boom!
> [    9.634339] tg3 0000:02:00.0 enp2s0: XX: Boom!
> [    9.642741] IPv6:
>         ADDRCONF(NETDEV_UP): enp2s0: link is not ready
> [   10.479636] tg3 0000:02:00.0
>         enp2s0: Link is down
> [   11.484498] tg3 0000:02:00.0
>         enp2s0: Link is up at 100 Mbps, full duplex
>
> As can be seen, there are two tg3-related sections in my dmesg in both
> the working and non-working scenarios: At about 1 - 2 secs, the card
> seems to begin initializing, and at about 9 - 10 seconds it is (or
> should be) ready to establish a network connection.
>
> My debug section, or tg3.c's tg3_poll_fw(), seems to be called thrice
> in the working situation: The first hit occurs at 1.456473 where the tg3
> device is still reported as "(unnamed net_device) (uninitialized)".
> Then, the section gets hit twice again at around 9.63 - at this point
> the driver already reports the card as initialized / by its real name.
>
> In the non-working situation, the debug sections seems to be hit only
> once, at 10.028247. At this point, the tg3 is already reported as
> initialized - just like when it's hit the second and third time in the
> working situation.
>
> Bottom line is that commit 89665a6a71408796565bfd29cfa6a7877b17a667
> really makes a difference regarding the way the tg3 card is
> initialized, which seems to cause the problem.

Hi Nils,

Thanks a lot for the bug report.  Can you open a bugzilla at
http://bugzilla.kernel.org, put it in the drivers/PCI component, mark
it as a regression, and attach the complete dmesg log for both the
working and non-working cases, as well as "lspci -vv" output for the
working case?

I don't yet see how 89665a6a7140 makes a difference here.  We must
eventually read PCI_VENDOR_ID_BROADCOM (0x14e4) because the tg3 driver
claimed the device.

Can you still reproduce the problem if you print out the value of "l"
every time we read PCI_VENDOR_ID in pci_bus_read_dev_vendor_id()?
That will change the timing, so it's possible that will make it harder
to reproduce.

Bjorn

^ permalink raw reply

* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: Eric Dumazet @ 2014-12-15 23:40 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Tom Herbert, Josef Bacik, Alexei Starovoitov, Laurent Chavey,
	Yuchung Cheng, Martin KaFai Lau, netdev@vger.kernel.org,
	David S. Miller, Hannes Frederic Sowa, Steven Rostedt,
	Lawrence Brakmo, Kernel Team
In-Reply-To: <548F6EB7.8040802@mojatatu.com>

On Mon, 2014-12-15 at 18:28 -0500, Jamal Hadi Salim wrote:
> On 12/15/14 17:01, Tom Herbert wrote:
> 
> 
> >
> > Maybe this would be good for discussion at netdev01?
> >
> 
> Yes it would be a good fit,
> I just pinged Eric when i saw his email saying the same thing ;->
> 

For the record, I made this suggestion to Josef in a private mail, sent
at 10am PST ;)

^ permalink raw reply

* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: Jamal Hadi Salim @ 2014-12-15 23:28 UTC (permalink / raw)
  To: Tom Herbert, Josef Bacik
  Cc: Eric Dumazet, Alexei Starovoitov, Laurent Chavey, Yuchung Cheng,
	Martin KaFai Lau, netdev@vger.kernel.org, David S. Miller,
	Hannes Frederic Sowa, Steven Rostedt, Lawrence Brakmo,
	Kernel Team
In-Reply-To: <CA+mtBx8tB6EE6i9C5KdOmwJ1D1nnaX3bvia71oj=N9U5h3KKBA@mail.gmail.com>

On 12/15/14 17:01, Tom Herbert wrote:


>
> Maybe this would be good for discussion at netdev01?
>

Yes it would be a good fit,
I just pinged Eric when i saw his email saying the same thing ;->

cheers,
jamal

^ permalink raw reply

* Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Jamal Hadi Salim @ 2014-12-15 23:27 UTC (permalink / raw)
  To: Arad, Ronen, John Fastabend, netdev@vger.kernel.org
  Cc: Roopa Prabhu, Jiri Pirko, sfeldma@gmail.com, bcrl@kvack.org,
	tgraf@suug.ch, stephen@networkplumber.org, linville@tuxdriver.com,
	vyasevic@redhat.com, davem@davemloft.net, shm@cumulusnetworks.com,
	gospo@cumulusnetworks.com
In-Reply-To: <E4CD12F19ABA0C4D8729E087A761DC3505D9FA8C@ORSMSX106.amr.corp.intel.com>

On 12/15/14 13:36, Arad, Ronen wrote:
>
>
>> -----Original Message-----

> The behavior of a driver could depend on the presence of a bridge and features such as FDB LEARNING and LEARNING_SYNC.

Indeed, those are bridge attributes.

> A switch port driver which is not enslaved to a bridge might need to implement VLAN-aware FDB
>within the driver and report its content to user-space using ndo_fdb_dump.
 >
> A switch port driver which is enslaved to a bridge could do with only pass through for static FDB configuration
 > to the HW when LEARNING_SYNC is configured. FDB reporting to 
user-space and soft aging are left to the bridge module FDB.
> Such driver, without LEARNING_SYNC could still avoid maintaing in-driver FDB as long as it could dump the HW FDB on demand.
> LEARNING_SYNC also requires periodic updates of freshness information from the driver to the bridge module.
>


If you have an fdb - shouldnt that be exposed only if you have a bridge
abstraction exposed? i.e thats where the Linux tools would work.
What i was refering to was a scenario where i have no interest in the
fdb despite such a hardware capabilities. VLANs is a different issue;

>>> Will the decision about using a bridge device or avoiding it be left
>>> to the end-user?
>>
>> Its a user policy decision. Again the offload bit gets us this in a reasonably
>> configurable way IMO.
>>
>>> (This requires switch port drivers to be able to work and provide
>>> similar functionality in both setups).
>>
>> Right, but if the drivers "care" who is calling their ndo ops something is
>> seriously broken. For the driver it should not need to know anything about
>> the callers so it doesn't matter to the driver if its a netlink call from user
>> space or an internal call fro bridge.ko
>
> LEARNING_SYNC only makes sense when a switch port driver is enslaved to a bridge.
 > Rocker switch driver indeed monitors upper change notifications and 
keep track of master bridge presence.
> So bridge presence is not transparent.
>

Agreed - the challenge so far is that people have been fascinated by
"switch" point of view. I think we are learning and the class device
will eventually become obvious as useful.

cheers,
jamal

^ permalink raw reply

* [ANNOUNCE] libnftnl 1.0.3 release
From: Pablo Neira Ayuso @ 2014-12-15 23:21 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, netfilter, netfilter-announce, lwn

[-- Attachment #1: Type: text/plain, Size: 559 bytes --]

Hi!

The Netfilter project proudly presents:

        libnftnl 1.0.3

libnftnl is a userspace library providing a low-level netlink
programming interface (API) to the in-kernel nf_tables subsystem. The
library libnftnl has been previously known as libnftables. This
library is currently used by the nft command line tool.

This release comes with new features available up to 3.18, see
ChangeLog for more details.

You can download this library from:

http://www.netfilter.org/projects/libnftnl/downloads.html
ftp://ftp.netfilter.org/pub/libnftnl/

Have fun!

[-- Attachment #2: changes-libnftnl-1.0.3.txt --]
[-- Type: text/plain, Size: 3665 bytes --]

Alvaro Neira (2):
      ruleset: add set id to parsed sets
      src: internal set id allocation from nft_ruleset_parse*()

Ana Rey (15):
      chain: Free memory in the same function that is reserved
      chain: Use nft_rule_expr_set_* in the xml parsing code
      table: Free memory in the same function that is reserved
      table: Use nft_table_attr_set_* in the xml functions
      table: Add set, unset and parse implementation for the use attribute
      table: Do not print unset values in xml file
      table: Do not print unset values in json file
      chain: Add all support of use attribute
      chain: Do not print unset attributes in xml
      chain: Rename variables in nft_jansson_parse_chain functions
      chain: Do not print unset attributes in json
      expr: meta: Add pkttype support
      expr: meta: Add cpu support for meta expresion
      expr: meta: Add devgroup support
      expr: meta: Add cgroup support

Arturo Borrero (33):
      set: add support for set mechanism selection
      examples: nft-set-add: use batch infraestructure
      examples: nft-chain-del: add chain_del_parse()
      examples: nft-chain-del: support new batching interface
      set_elem: use proper free function
      examples: merge nft-chain-{xml|json}-add.c
      examples: nft-chain-parse-add: add batching support
      examples: merge nft-table-{xml|json}-add.c
      examples: nft-table-parse-add: add batching support
      examples: nft-table-add: add table_add_parse()
      examples: nft-table-add: add batching support
      examples: nft-table-del: add table_del_parse()
      examples: nft-table-del: add batching support
      src: fix printing of XML/JSON event wrapper header/footer
      expr: nat: add support for the new flags attribute
      expr: add new nft_masq expression
      nf_tables.h: add NFTA_MASQ_UNSPEC
      utils: nft_fprintf: prevent an empty buffer from being printed
      set: fix set nlmsg desc parsing
      examples: merge nft-rule-{xml|json}-add.c
      examples: nft-rule-parse-add: add batching support
      examples: nft-set-json-add: generalize parsing format support
      examples: nft-set-parse-add: add batching support
      examples: nft-table-add: fix wrong buffer pointer
      expr: masq: optional printing of flags attr in snprintf_default
      tests: add tests for the masq expression
      tests: also test nat flags attribute
      src: cleanup in mxml and jansson regarding set_id parsing
      utils: fix arp family number
      ruleset: deconstify _get interface
      src: add support for nft_redir expression
      tests: add tests for nft_redir expression
      examples: nft-rule-parse-add: fix wrong buffer usage when building rule header

Giuseppe Longo (1):
      buffer: include stdarg header

Pablo Neira Ayuso (16):
      expr: log: add support for level and flags
      src: stricter netlink attribute length validation
      set_elem: add nft_set_elems_nlmsg_build_payload_iter()
      common: add batching interfaces
      examples: nft-chain-add: add chain_add_parse()
      examples: nft-chain-add: support new batching interface
      utils: define xfree() as macro
      src: get rid of cached copies of x_tables.h and xt_LOG.h
      src: add ruleset generation class
      src: fix compilation without xml/json support
      remove empty src/attr.c
      expr: nat: use 'nat_type' instead of 'type' in the parser
      src: consolidate XML/JSON exportation
      expr: data_reg: use 'reg' instead of 'data_reg'
      bump version to 1.0.3
      include: add missing gen.h to Makefile.am

Álvaro Neira Ayuso (1):
      expr: log: define variable flags in xml parser


^ permalink raw reply

* ethtool 3.18 released
From: Ben Hutchings @ 2014-12-14 19:35 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 448 bytes --]

ethtool version 3.18 has been released.

Home page: https://www.kernel.org/pub/software/network/ethtool/
Download link:
https://www.kernel.org/pub/software/network/ethtool/ethtool-3.18.tar.xz

Release notes:

	* Fix: Lookup of SFP Tx bias in SFF-8472 module diagnostics (-m option)
	* Fix: Build with musl by using more common typedefs

Ben.

-- 
Ben Hutchings
The two most common things in the universe are hydrogen and stupidity.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply

* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: rapier @ 2014-12-15 22:17 UTC (permalink / raw)
  To: Tom Herbert, Josef Bacik
  Cc: Eric Dumazet, Alexei Starovoitov, Laurent Chavey, Yuchung Cheng,
	Martin KaFai Lau, netdev@vger.kernel.org, David S. Miller,
	Hannes Frederic Sowa, Steven Rostedt, Lawrence Brakmo,
	Kernel Team
In-Reply-To: <CA+mtBx8tB6EE6i9C5KdOmwJ1D1nnaX3bvia71oj=N9U5h3KKBA@mail.gmail.com>

The Web10g development team at PSC (we've been working with
a number of other organizations on this) will be submitting
the kernel instrument set tomorrow morning. We'd be happy to
join any discussion then.

Chris rapier

On 12/15/14, 5:01 PM, Tom Herbert wrote:
> On Mon, Dec 15, 2014 at 8:42 AM, Josef Bacik <jbacik@fb.com> wrote:
>> On 12/15/2014 11:03 AM, Eric Dumazet wrote:
>>>
>>> On Sun, 2014-12-14 at 22:55 -0800, Alexei Starovoitov wrote:
>>>
>>>> I think patches 1 and 3 are good additions, since they establish
>>>> few permanent points of instrumentation in tcp stack.
>>>> Patches 4-5 look more like use cases of tracepoints established
>>>> before. They may feel like simple additions and, no doubt,
>>>> they are useful, but since they expose things via tracing
>>>> infra they become part of api and cannot be changed later,
>>>> when more stats would be needed.
>>>> I think systemtap like scripting on top of patches 1 and 3
>>>> should solve your use case ?
>>>> Also, have you looked at recent eBPF work?
>>>> Though it's not completely ready yet, soon it should
>>>> be able to do the same stats collection as you have
>>>> in 4/5 without adding permanent pieces to the kernel.
>>>
>>>
>>> So it looks like web10g like interfaces are very often requested by
>>> various teams.
>>>
>>> And we have many different views on how to hack this. I am astonished by
>>> number of hacks I saw about this stuff going on.
>>>
>>> What about a clean way, extending current TCP_INFO, which is both
>>> available as a getsockopt() for socket owners and ss/iproute2
>>> information for 'external entities'
>>>
>>> If we consider web10g info needed, then adding a ftrace/eBPF like
>>> interface is simply yet another piece of code we need to maintain,
>>> and the argument of 'this should cost nothing if not activated' is
>>> nonsense since major players need to constantly monitor TCP metrics and
>>> behavior.
>>>
>>> It seems both FaceBook and Google are working on a subset of web10g.
>>>
>>> I suggest we meet together and establish a common ground, preferably
>>> after Christmas holidays.
>>>
>>
>> We've set up something for exactly this case at the end of January but have
>> yet to get a response from Google.  If any of the Google people cc'ed (or
>> really anybody, its not a strictly FB/Google thing) is interested please
>> email me directly and I'll send you the details, we will be meeting face to
>> face in the bay area at the end of January.  Thanks,
>>
>
> Maybe this would be good for discussion at netdev01?
>
>> Josef
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* [iproute2] tc: Show classes more hierarchically]
From: vadim4j @ 2014-12-15 22:48 UTC (permalink / raw)
  To: netdev; +Cc: vadim4j

Hi All,

I am playing with showing classes in more hierarchically format and I
have some code and example of output from my TC looks like:

# tc/tc -t class show dev tap0

 \---1:2 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
        \---1:40 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
        \---1:50 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
        \---1:60 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
 \---1:1 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
        \---1:10 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
               \---1:11 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
                      \---1:111 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
        \---1:20 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
        \---1:30 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 


which in standart output mode it looks like:

# tc/tc class show dev tap0

class htb 1:11 parent 1:10 rate 3Mbit ceil 6Mbit burst 15Kb cburst 1599b 
class htb 1:111 parent 1:11 prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
class htb 1:10 parent 1:1 rate 5Mbit ceil 5Mbit burst 15Kb cburst 1600b 
class htb 1:1 root rate 6Mbit ceil 6Mbit burst 15Kb cburst 1599b 
class htb 1:20 parent 1:1 leaf 20: prio 0 rate 3Mbit ceil 6Mbit burst 15Kb cburst 1599b 
class htb 1:2 root rate 6Mbit ceil 6Mbit burst 15Kb cburst 1599b 
class htb 1:30 parent 1:1 leaf 30: prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
class htb 1:40 parent 1:2 leaf 40: prio 0 rate 5Mbit ceil 5Mbit burst 15Kb cburst 1600b 
class htb 1:50 parent 1:2 leaf 50: prio 0 rate 3Mbit ceil 6Mbit burst 15Kb cburst 1599b 
class htb 1:60 parent 1:2 leaf 60: prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 

So I'd like to ask if it might be useful for the TC users (may be
better format ?) to have this ?

Thanks,

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox