Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next 2/2] net: phy: bcm7xxx: Plug in support for reading PHY error counters
From: Florian Fainelli @ 2016-11-28 21:06 UTC (permalink / raw)
  To: netdev
  Cc: davem, andrew, bcm-kernel-feedback-list, allan.nielsen,
	raju.lakkaraju, Florian Fainelli
In-Reply-To: <20161128210614.12621-1-f.fainelli@gmail.com>

Broadcom BCM7xxx internal PHYs can leverage the Broadcom PHY library
module PHY error counters helper functions, just implement the
appropriate PHY driver function calls to do so. We need to allocate some
storage space for our PHY statistics, and provide it to the Broadcom PHY
library, so do this in a specific probe function, and slightly wrap the
get_stats function call.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/phy/bcm7xxx.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/drivers/net/phy/bcm7xxx.c b/drivers/net/phy/bcm7xxx.c
index 5b3be4c67be8..fb976ab2ab92 100644
--- a/drivers/net/phy/bcm7xxx.c
+++ b/drivers/net/phy/bcm7xxx.c
@@ -45,6 +45,10 @@
 #define AFE_VDAC_OTHERS_0		MISC_ADDR(0x39, 3)
 #define AFE_HPF_TRIM_OTHERS		MISC_ADDR(0x3a, 0)
 
+struct bcm7xxx_phy_priv {
+	u64	*stats;
+};
+
 static void r_rc_cal_reset(struct phy_device *phydev)
 {
 	/* Reset R_CAL/RC_CAL Engine */
@@ -350,6 +354,32 @@ static int bcm7xxx_28nm_set_tunable(struct phy_device *phydev,
 	return genphy_restart_aneg(phydev);
 }
 
+static void bcm7xxx_28nm_get_phy_stats(struct phy_device *phydev,
+				       struct ethtool_stats *stats, u64 *data)
+{
+	struct bcm7xxx_phy_priv *priv = phydev->priv;
+
+	bcm_phy_get_stats(phydev, priv->stats, stats, data);
+}
+
+static int bcm7xxx_28nm_probe(struct phy_device *phydev)
+{
+	struct bcm7xxx_phy_priv *priv;
+
+	priv = devm_kzalloc(&phydev->mdio.dev, sizeof(*priv), GFP_KERNEL);
+	if (!priv)
+		return -ENOMEM;
+
+	phydev->priv = priv;
+
+	priv->stats = devm_kzalloc(&phydev->mdio.dev,
+				   bcm_phy_get_sset_count(phydev), GFP_KERNEL);
+	if (!priv->stats)
+		return -ENOMEM;
+
+	return 0;
+}
+
 #define BCM7XXX_28NM_GPHY(_oui, _name)					\
 {									\
 	.phy_id		= (_oui),					\
@@ -364,6 +394,10 @@ static int bcm7xxx_28nm_set_tunable(struct phy_device *phydev,
 	.resume		= bcm7xxx_28nm_resume,				\
 	.get_tunable	= bcm7xxx_28nm_get_tunable,			\
 	.set_tunable	= bcm7xxx_28nm_set_tunable,			\
+	.get_sset_count	= bcm_phy_get_sset_count,			\
+	.get_strings	= bcm_phy_get_strings,				\
+	.get_stats	= bcm7xxx_28nm_get_phy_stats,			\
+	.probe		= bcm7xxx_28nm_probe,				\
 }
 
 #define BCM7XXX_40NM_EPHY(_oui, _name)					\
-- 
2.9.3

^ permalink raw reply related

* [PATCH] rtl8xxxu: fix tx rate debug output
From: Arnd Bergmann @ 2016-11-28 21:08 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Arnd Bergmann, Kalle Valo, linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

We accidentally print the rate before we know it for txdesc_v2:

wireless/realtek/rtl8xxxu/rtl8xxxu_core.c: In function 'rtl8xxxu_fill_txdesc_v2':
wireless/realtek/rtl8xxxu/rtl8xxxu_core.c:4848:3: error: 'rate' may be used uninitialized in this function [-Werror=maybe-uninitialized]

txdesc_v1 got it right, so let's do it the same way here.

Fixes: b4c3d9cfb607 ("rtl8xxxu: Pass tx_info to fill_txdesc in order to have access to retry count")
Signed-off-by: Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org>
---
 drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c
index 04141e57b8ae..a9137abc3ad9 100644
--- a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c
+++ b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c
@@ -4844,16 +4844,16 @@ rtl8xxxu_fill_txdesc_v2(struct ieee80211_hw *hw, struct ieee80211_hdr *hdr,
 
 	tx_desc40 = (struct rtl8xxxu_txdesc40 *)tx_desc32;
 
-	if (rtl8xxxu_debug & RTL8XXXU_DEBUG_TX)
-		dev_info(dev, "%s: TX rate: %d, pkt size %d\n",
-			 __func__, rate, cpu_to_le16(tx_desc40->pkt_size));

^ permalink raw reply related

* Re: [PATCH net-next v3 0/4] Documentation: net: phy: Improve documentation
From: David Miller @ 2016-11-28 21:08 UTC (permalink / raw)
  To: f.fainelli
  Cc: netdev, andrew, sf84, martin.blumenstingl, mans, alexandre.torgue,
	peppe.cavallaro, timur, jbrunet
In-Reply-To: <20161128024515.13070-1-f.fainelli@gmail.com>

From: Florian Fainelli <f.fainelli@gmail.com>
Date: Sun, 27 Nov 2016 18:45:11 -0800

> This patch series addresses discussions and feedback that was recently received
> on the mailing-list in the area of: flow control/pause frames, interpretation of
> phy_interface_t and finally add some links to useful standards documents.

I'm always happy to see documentation improvements, series applied,
thanks!

^ permalink raw reply

* Re: [PATCH net 3/5] l2tp: fix racy socket lookup in l2tp_ip and l2tp_ip6 bind()
From: kbuild test robot @ 2016-11-28 21:10 UTC (permalink / raw)
  To: Guillaume Nault; +Cc: kbuild-all, netdev, James Chapman, Chris Elston
In-Reply-To: <de1fbe689e143ca160dd4da9a61c585c44ff6e78.1480360512.git.g.nault@alphalink.fr>

[-- Attachment #1: Type: text/plain, Size: 2705 bytes --]

Hi Guillaume,

[auto build test WARNING on net/master]

url:    https://github.com/0day-ci/linux/commits/Guillaume-Nault/l2tp-fixes-for-l2tp_ip-and-l2tp_ip6-socket-handling/20161129-043208
config: x86_64-rhel (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings

All warnings (new ones prefixed by >>):

   net/l2tp/l2tp_ip.c: In function 'l2tp_ip_bind':
>> net/l2tp/l2tp_ip.c:299:9: warning: 'ret' may be used uninitialized in this function [-Wmaybe-uninitialized]
     return ret;
            ^~~

vim +/ret +299 net/l2tp/l2tp_ip.c

3fb4e5ea Guillaume Nault 2016-11-28  283  		goto out;
3fb4e5ea Guillaume Nault 2016-11-28  284  	}
3fb4e5ea Guillaume Nault 2016-11-28  285  
3fb4e5ea Guillaume Nault 2016-11-28  286  	sk_dst_reset(sk);
0d76751f James Chapman   2010-04-02  287  	l2tp_ip_sk(sk)->conn_id = addr->l2tp_conn_id;
0d76751f James Chapman   2010-04-02  288  
0d76751f James Chapman   2010-04-02  289  	sk_add_bind_node(sk, &l2tp_ip_bind_table);
0d76751f James Chapman   2010-04-02  290  	sk_del_node_init(sk);
0d76751f James Chapman   2010-04-02  291  	write_unlock_bh(&l2tp_ip_lock);
3fb4e5ea Guillaume Nault 2016-11-28  292  
0d76751f James Chapman   2010-04-02  293  	ret = 0;
c51ce497 James Chapman   2012-05-29  294  	sock_reset_flag(sk, SOCK_ZAPPED);
c51ce497 James Chapman   2012-05-29  295  
0d76751f James Chapman   2010-04-02  296  out:
0d76751f James Chapman   2010-04-02  297  	release_sock(sk);
0d76751f James Chapman   2010-04-02  298  
0d76751f James Chapman   2010-04-02 @299  	return ret;
0d76751f James Chapman   2010-04-02  300  }
0d76751f James Chapman   2010-04-02  301  
0d76751f James Chapman   2010-04-02  302  static int l2tp_ip_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
0d76751f James Chapman   2010-04-02  303  {
0d76751f James Chapman   2010-04-02  304  	struct sockaddr_l2tpip *lsa = (struct sockaddr_l2tpip *) uaddr;
de3c7a18 James Chapman   2012-04-29  305  	int rc;
0d76751f James Chapman   2010-04-02  306  
0d76751f James Chapman   2010-04-02  307  	if (addr_len < sizeof(*lsa))

:::::: The code at line 299 was first introduced by commit
:::::: 0d76751fad7739014485ba5bd388d4f1b4fd4143 l2tp: Add L2TPv3 IP encapsulation (no UDP) support

:::::: TO: James Chapman <jchapman@katalix.com>
:::::: CC: David S. Miller <davem@davemloft.net>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 37847 bytes --]

^ permalink raw reply

* Re: [PATCH] rtl8xxxu: fix tx rate debug output
From: Jes Sorensen @ 2016-11-28 21:12 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Kalle Valo, linux-wireless, netdev, linux-kernel
In-Reply-To: <20161128210815.2368509-1-arnd@arndb.de>

Arnd Bergmann <arnd@arndb.de> writes:
> We accidentally print the rate before we know it for txdesc_v2:

Hi Arnd,

Thanks for the patch - Barry Day already posted a patch for this which
Kalle has applied to the wireless tree.

Cheers,
Jes


>
> wireless/realtek/rtl8xxxu/rtl8xxxu_core.c: In function 'rtl8xxxu_fill_txdesc_v2':
> wireless/realtek/rtl8xxxu/rtl8xxxu_core.c:4848:3: error: 'rate' may be used uninitialized in this function [-Werror=maybe-uninitialized]
>
> txdesc_v1 got it right, so let's do it the same way here.
>
> Fixes: b4c3d9cfb607 ("rtl8xxxu: Pass tx_info to fill_txdesc in order to have access to retry count")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
>  drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c
> index 04141e57b8ae..a9137abc3ad9 100644
> --- a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c
> +++ b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c
> @@ -4844,16 +4844,16 @@ rtl8xxxu_fill_txdesc_v2(struct ieee80211_hw *hw, struct ieee80211_hdr *hdr,
>  
>  	tx_desc40 = (struct rtl8xxxu_txdesc40 *)tx_desc32;
>  
> -	if (rtl8xxxu_debug & RTL8XXXU_DEBUG_TX)
> -		dev_info(dev, "%s: TX rate: %d, pkt size %d\n",
> -			 __func__, rate, cpu_to_le16(tx_desc40->pkt_size));
> -
>  	if (rate_flags & IEEE80211_TX_RC_MCS &&
>  	    !ieee80211_is_mgmt(hdr->frame_control))
>  		rate = tx_info->control.rates[0].idx + DESC_RATE_MCS0;
>  	else
>  		rate = tx_rate->hw_value;
>  
> +	if (rtl8xxxu_debug & RTL8XXXU_DEBUG_TX)
> +		dev_info(dev, "%s: TX rate: %d, pkt size %d\n",
> +			 __func__, rate, cpu_to_le16(tx_desc40->pkt_size));
> +
>  	seq_number = IEEE80211_SEQ_TO_SN(le16_to_cpu(hdr->seq_ctrl));
>  
>  	tx_desc40->txdw4 = cpu_to_le32(rate);

^ permalink raw reply

* Re: [PATCH] net: handle no dst on skb in icmp6_send
From: David Miller @ 2016-11-28 21:13 UTC (permalink / raw)
  To: dsa; +Cc: netdev, andreyknvl
In-Reply-To: <1480301573-21183-1-git-send-email-dsa@cumulusnetworks.com>

From: David Ahern <dsa@cumulusnetworks.com>
Date: Sun, 27 Nov 2016 18:52:53 -0800

> Andrey reported the following while fuzzing the kernel with syzkaller:
 ...
> icmp6_send / icmpv6_send is invoked for both rx and tx paths. In both
> cases the dst->dev should be preferred for determining the L3 domain
> if the dst has been set on the skb. Fallback to the skb->dev if it has
> not. This covers the case reported here where icmp6_send is invoked on
> Rx before the route lookup.
> 
> Fixes: 5d41ce29e ("net: icmp6_send should use dst dev to determine L3 domain")
> Reported-by: Andrey Konovalov <andreyknvl@google.com>
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>

Applied, thanks David.

^ permalink raw reply

* Re: Receive offloads, small RCVBUF and zero TCP window
From: Alex Sidorenko @ 2016-11-28 21:14 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20161128.155459.1527519991492144879.davem@davemloft.net>

On Monday, November 28, 2016 3:54:59 PM EST David Miller wrote:
> From: Alex Sidorenko <alexandre.sidorenko@hpe.com>
> Date: Mon, 28 Nov 2016 15:49:26 -0500
> 
> > Now the question is whether is is OK to have icsk->icsk_ack.rcv_mss
> > larger than MTU.
> 
> It absolutely is not OK.
> 
> If VMWare wants to receive large frames for batching purposes it must
> use GRO or similar to achieve that, not just send vanilla frames into
> the stack which are larger than the device MTU.
> 

As VMWare's vmxnet3 driver is open-sourced and part of generic kernel, do you think the problem is in that driver or elsewhere? I looked at vmxnet3 sources and see that it uses LRO/GRO subroutines. Unfortunately, I don't understand its logic enough to see whether they are doing anything incorrectly.

Alex 

-- 

------------------------------------------------------------------
Alex Sidorenko	email: asid@hpe.com
ERT  Linux 	Hewlett-Packard Enterprise (Canada)
------------------------------------------------------------------

^ permalink raw reply

* Re: [PATCH] geneve: fix ip_hdr_len reserved for geneve6 tunnel.
From: David Miller @ 2016-11-28 21:15 UTC (permalink / raw)
  To: yanhaishuang; +Cc: hannes, aduyck, pshelar, jbenc, netdev, linux-kernel
In-Reply-To: <1480310818-78456-1-git-send-email-yanhaishuang@cmss.chinamobile.com>

From: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Date: Mon, 28 Nov 2016 13:26:58 +0800

> It shold reserved sizeof(ipv6hdr) for geneve in ipv6 tunnel.
> 
> Fixes: c3ef5aa5e5 ('geneve: Merge ipv4 and ipv6 geneve_build_skb()')
> 
> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>

Applied, thanks.

^ permalink raw reply

* Re: [patch net] net: dsa: fix unbalanced dsa_switch_tree reference counting
From: David Miller @ 2016-11-28 21:16 UTC (permalink / raw)
  To: nikita.yoush; +Cc: netdev, cphealy, andrew, linux-kernel
In-Reply-To: <1480315728-23398-1-git-send-email-nikita.yoush@cogentembedded.com>

From: Nikita Yushchenko <nikita.yoush@cogentembedded.com>
Date: Mon, 28 Nov 2016 09:48:48 +0300

> _dsa_register_switch() gets a dsa_switch_tree object either via
> dsa_get_dst() or via dsa_add_dst(). Former path does not increase kref
> in returned object (resulting into caller not owning a reference),
> while later path does create a new object (resulting into caller owning
> a reference).
> 
> The rest of _dsa_register_switch() assumes that it owns a reference, and
> calls dsa_put_dst().
> 
> This causes a memory breakage if first switch in the tree initialized
> successfully, but second failed to initialize. In particular, freed
> dsa_swith_tree object is left referenced by switch that was initialized,
> and later access to sysfs attributes of that switch cause OOPS.
> 
> To fix, need to add kref_get() call to dsa_get_dst().
> 
> Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com>
> Fixes: 83c0afaec7b7 ("net: dsa: Add new binding implementation")
> Reviewed-by: Andrew Lunn <andrew@lunn.ch>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net-next 0/6] tcp: sender chronographs instrumentation
From: David Miller @ 2016-11-28 21:19 UTC (permalink / raw)
  To: ycheng; +Cc: soheil, francisyyan, netdev, ncardwell, edumazet
In-Reply-To: <1480191016-73210-1-git-send-email-ycheng@google.com>

From: Yuchung Cheng <ycheng@google.com>
Date: Sat, 26 Nov 2016 12:10:10 -0800

> This patch set provides instrumentation on TCP sender limitations.
> While developing the BBR congestion control, we noticed that TCP
> sending process is often limited by factors unrelated to congestion
> control: insufficient sender buffer and/or insufficient receive
> window/buffer to saturate the network bandwidth. Unfortunately these
> limits are not visible to the users and often the poor performance
> is attributed to the congestion control of choice.
> 
> Thie patch aims to help users get the high level understanding of
> where sending process is limited by, similar to the TCP_INFO design.
> It is not to replace detailed kernel tracing and instrumentation
> facilities.
> 
> In addition this patch set provides a new option to the timestamping
> work to instrument these limits on application data unit. For exampe,
> one can use SO_TIMESTAMPING and this patch set to measure the how
> long a particular HTTP response is limited by small receive window.
> 
> Patch set was initially written by Francis Yan then polished
> by Yuchung Cheng, with lots of help from Eric Dumazet and Soheil
> Hassas Yeganeh.

Looks great, series applied, thanks!

^ permalink raw reply

* Re: net: GPF in eth_header
From: Eric Dumazet @ 2016-11-28 21:18 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Florian Westphal, syzkaller, Hannes Frederic Sowa, David Miller,
	Tom Herbert, Alexander Duyck, Jiri Benc, Sabrina Dubroca, netdev,
	LKML
In-Reply-To: <1480367152.18162.86.camel@edumazet-glaptop3.roam.corp.google.com>

On Mon, 2016-11-28 at 13:05 -0800, Eric Dumazet wrote:
> On Mon, 2016-11-28 at 11:47 -0800, Eric Dumazet wrote:
> > On Mon, 2016-11-28 at 20:34 +0100, Dmitry Vyukov wrote:
> > > On Mon, Nov 28, 2016 at 8:04 PM, 'Andrey Konovalov' via syzkaller
> > 
> > > > Hi Eric,
> > > >
> > > > As far as I can see, skb_network_offset() becomes negative after
> > > > pskb_pull(skb, (u8 *) (fhdr + 1) - skb->data) in nf_ct_frag6_queue().
> > > > At least I'm able to detect that with a BUG_ON().
> > > >
> > > > Also it seems that the issue is only reproducible (at least with the
> > > > poc I provided) for a short time after boot.
> > > 
> > > 
> > > Eric,
> > > 
> > > Is it enough to debug? Or maybe Andrey can trace some values for you.
> > 
> > Well, now we are talking, if you tell me how many modules you load, it
> > might help ;)
> > 
> > nf_ct_frag6_queue is nowhere to be seen in my kernels, that might
> > explain why I could not reproduce the bug.
> > 
> > Let me try ;)
> > 
> 
> Might be a bug added in commit daaa7d647f81f3
> ("netfilter: ipv6: avoid nf_iterate recursion")
> 
> Florian, what do you think of dropping a packet that presumably was
> mangled badly by nf_ct_frag6_queue() ?
> 
> (Like about 48 byte pulled :(, and/or skb->csum changed )
> 
> diff --git a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
> index f7aab5ab93a5..31aa947332d8 100644
> --- a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
> +++ b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
> @@ -65,8 +65,8 @@ static unsigned int ipv6_defrag(void *priv,
>  
>  	err = nf_ct_frag6_gather(state->net, skb,
>  				 nf_ct6_defrag_user(state->hook, skb));
> -	/* queued */
> -	if (err == -EINPROGRESS)
> +	/* queued or mangled ... */
> +	if (err)
>  		return NF_STOLEN;

Or more exactly , returning NF_DROP so that skb is really freed ;)


diff --git a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
index f7aab5ab93a5..508739a3ca2a 100644
--- a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
+++ b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
@@ -65,9 +65,9 @@ static unsigned int ipv6_defrag(void *priv,
 
 	err = nf_ct_frag6_gather(state->net, skb,
 				 nf_ct6_defrag_user(state->hook, skb));
-	/* queued */
-	if (err == -EINPROGRESS)
-		return NF_STOLEN;
+	/* queued or mangled ... */
+	if (err)
+		return (err == -EINPROGRESS) ? NF_STOLEN : NF_DROP;
 
 	return NF_ACCEPT;
 }

^ permalink raw reply related

* Re: [PATCH v4] cpsw: ethtool: add support for getting/setting EEE registers
From: David Miller @ 2016-11-28 21:23 UTC (permalink / raw)
  To: yegorslists; +Cc: netdev, linux-omap, grygorii.strashko, mugunthanvnm
In-Reply-To: <1480322493-25308-1-git-send-email-yegorslists@googlemail.com>

From: yegorslists@googlemail.com
Date: Mon, 28 Nov 2016 09:41:33 +0100

> From: Yegor Yefremov <yegorslists@googlemail.com>
> 
> Add the ability to query and set Energy Efficient Ethernet parameters
> via ethtool for applicable devices.
> 
> This patch doesn't activate full EEE support in cpsw driver, but it
> enables reading and writing EEE advertising settings. This way one
> can disable advertising EEE for certain speeds.
> 
> Signed-off-by: Yegor Yefremov <yegorslists@googlemail.com>
> Acked-by: Rami Rosen <roszenrami@gmail.com>
> ---
> Changes:
> 	v4: respine against net-next (David Miller)
> 	v3: explain what features will be available with this patch (Florian Fainelli)
> 	v2: make routines static (Rami Rosen)

Applied, thank you.

^ permalink raw reply

* Re: [patch net] sched: cls_flower: remove from hashtable only in case skip sw flag is not set
From: Jiri Pirko @ 2016-11-28 21:23 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Amir Vadai, Linux Netdev List, David Miller, Jamal Hadi Salim,
	Ido Schimmel, Elad Raz, Or Gerlitz, Hadar Hen Zion
In-Reply-To: <CAJ3xEMhdyBfgtOgJqsY_x-Qr6hrhF01nm3VPL-kJnCofA0r3nA@mail.gmail.com>

Mon, Nov 28, 2016 at 10:04:56PM CET, gerlitz.or@gmail.com wrote:
>On Mon, Nov 28, 2016 at 4:40 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>> From: Jiri Pirko <jiri@mellanox.com>
>>
>> Be symmetric to hashtable insert and remove filter from hashtable only
>> in case skip sw flag is not set.
>>
>> Fixes: e69985c67c33 ("net/sched: cls_flower: Introduce support in SKIP SW flag")
>
>Amir, Jiri - what was the impact of running without this fix for the
>last 3-4 kernels? I haven't seen any crashes, is that leaking took
>place? or this is just a cleanup to make things more clear and
>maintainable?

It's a fix for real bug. If you add rule with skip_sw flag, it is not
inserted into hashtable. But once you remove it, the current code
removes it from hashtable (did not inspect how rhashtable implementation
handles this).

^ permalink raw reply

* Re: [PATCH] bpf/samples: Fix PT_REGS_IP on s390x and use it
From: David Miller @ 2016-11-28 21:27 UTC (permalink / raw)
  To: holzheu; +Cc: ast, netdev, heiko.carstens, schwidefsky
In-Reply-To: <20161128134830.44cd7c1f@TP-holzheu>

From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Date: Mon, 28 Nov 2016 13:48:30 +0100

> The files "sampleip_kern.c" and "trace_event_kern.c" directly access
> "ctx->regs.ip" which is not available on s390x. Fix this and use the
> PT_REGS_IP() macro instead.
> 
> Also fix the macro for s390x and use "psw.addr" from "pt_regs".
> 
> Reported-by: Zvonko Kosic <zvonko.kosic@de.ibm.com>
> Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> Acked-by: Alexei Starovoitov <ast@kernel.org>

Applied, thanks.

^ permalink raw reply

* Re: Large performance regression with 6in4 tunnel (sit)
From: Alexander Duyck @ 2016-11-28 21:32 UTC (permalink / raw)
  To: Stephen Rothwell; +Cc: Sven-Haegar Koch, Eli Cooper, Netdev, Eric Dumazet
In-Reply-To: <20161127142340.3a5c197e@canb.auug.org.au>

On Sat, Nov 26, 2016 at 7:23 PM, Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> Hi Sven-Haegar,
>
> On Fri, 25 Nov 2016 05:06:53 +0100 (CET) Sven-Haegar Koch <haegar@sdinet.de> wrote:
>>
>> Somehow this problem description really reminds me of a report on
>> netdev a bit ago, which the following patch fixed:
>>
>> commit 9ee6c5dc816aa8256257f2cd4008a9291ec7e985
>> Author: Lance Richardson <lrichard@redhat.com>
>> Date:   Wed Nov 2 16:36:17 2016 -0400
>>
>>     ipv4: allow local fragmentation in ip_finish_output_gso()
>>
>>     Some configurations (e.g. geneve interface with default
>>     MTU of 1500 over an ethernet interface with 1500 MTU) result
>>     in the transmission of packets that exceed the configured MTU.
>>     While this should be considered to be a "bad" configuration,
>>     it is still allowed and should not result in the sending
>>     of packets that exceed the configured MTU.
>>
>> Could this be related?
>>
>> I suppose it would be difficult to test this patch on this machine?
>
> The kernel I am running on is based on 4.7.8, so the above patch
> doesn't come close to applying. Most fo what it is reverting was
> introduced in commit 359ebda25aa0 ("net/ipv4: Introduce IPSKB_FRAG_SEGS
> bit to inet_skb_parm.flags") in v4.8-rc1.

So I think I have this root caused.  The problem seems to be the fact
that I chose to use lco_csum when trying to cancel out the inner IP
header from the checksum and it turns out that the transport offset is
never updated in the case of these tunnels.

For now a workaround is to just set tx-gso-partial to off on the
interface the tunnel is running over and you should be able to pass
traffic without any issues.

I have a patch for igb/igbvf that should be out in the next hour or so
which should address it.

Thanks.

- Alex

^ permalink raw reply

* Re: net: GPF in eth_header
From: Florian Westphal @ 2016-11-28 21:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Dmitry Vyukov, Florian Westphal, syzkaller, Hannes Frederic Sowa,
	David Miller, Tom Herbert, Alexander Duyck, Jiri Benc,
	Sabrina Dubroca, netdev, LKML
In-Reply-To: <1480367886.18162.88.camel@edumazet-glaptop3.roam.corp.google.com>

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Might be a bug added in commit daaa7d647f81f3
> > ("netfilter: ipv6: avoid nf_iterate recursion")
> > 
> > Florian, what do you think of dropping a packet that presumably was
> > mangled badly by nf_ct_frag6_queue() ?

ipv4 definitely frees malformed packets.
In general, I think netfilter should avoid 'silent' drops if possible
and let skb continue, but of course such skbs should not be made worse
as what we ate to begin with...

> > (Like about 48 byte pulled :(, and/or skb->csum changed )

I think this warrants a review of ipv6 reassembly too, bug reported here
is because ipv6 nf defrag is also done on output.

> diff --git a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
> index f7aab5ab93a5..508739a3ca2a 100644
> --- a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
> +++ b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
> @@ -65,9 +65,9 @@ static unsigned int ipv6_defrag(void *priv,
>  
>  	err = nf_ct_frag6_gather(state->net, skb,
>  				 nf_ct6_defrag_user(state->hook, skb));
> -	/* queued */
> -	if (err == -EINPROGRESS)
> -		return NF_STOLEN;
> +	/* queued or mangled ... */
> +	if (err)
> +		return (err == -EINPROGRESS) ? NF_STOLEN : NF_DROP;
>  
>  	return NF_ACCEPT;

Looks good, we'll need to change some of the errno return codes in
nf_ct_frag6_gather to 0 though for this to work, which should not be too
hard ;)

^ permalink raw reply

* Re: [PATCH v3 net-next 1/2] net: ethernet: slicoss: add slicoss gigabit ethernet driver
From: Lino Sanfilippo @ 2016-11-28 21:41 UTC (permalink / raw)
  To: Florian Fainelli, davem, charrer, liodot, gregkh, andrew
  Cc: devel, netdev, linux-kernel
In-Reply-To: <a6f99dc8-60e4-7727-0764-abdfac81c713@gmail.com>

Hi Florian,

On 28.11.2016 05:56, Florian Fainelli wrote:
> On 11/26/2016 04:20 AM, Lino Sanfilippo wrote:
>> Add driver for Alacritech gigabit ethernet cards with SLIC (session-layer
>> interface control) technology. The driver provides basic support without
>> SLIC for the following devices:
>> 
>> - Mojave cards (single port PCI Gigabit) both copper and fiber
>> - Oasis cards (single and dual port PCI-x Gigabit) copper and fiber
>> - Kalahari cards (dual and quad port PCI-e Gigabit) copper and fiber
> 
> This looks great, a few nits below:
> 
> 
>> +#define SLIC_MAX_TX_COMPLETIONS		100
> 
> You usually don't want to limit the number of TX completion, if the
> entire TX ring needs to be cleaned, you would want to allow that.
> 

The problem is that the HW does not provide a tx completion index. Instead we have to 
iterate the status descriptors until we get an invalid idx which indicates that there 
are no further tx descriptors done for now. I am afraid that if we do not limit the 
number of descriptors processed in the tx completion handler, a continuous transmission 
of frames could keep the loop in xmit_complete() run endlessly. I dont know if this 
can actually happen but I wanted to make sure that this is avoided.

> [snip]
> 
>> +	while (slic_get_free_rx_descs(rxq) > SLIC_MAX_REQ_RX_DESCS) {
>> +		skb = alloc_skb(maplen + ALIGN_MASK, gfp);
>> +		if (!skb)
>> +			break;
>> +
>> +		paddr = dma_map_single(&sdev->pdev->dev, skb->data, maplen,
>> +				       DMA_FROM_DEVICE);
>> +		if (dma_mapping_error(&sdev->pdev->dev, paddr)) {
>> +			netdev_err(dev, "mapping rx packet failed\n");
>> +			/* drop skb */
>> +			dev_kfree_skb_any(skb);
>> +			break;
>> +		}
>> +		/* ensure head buffer descriptors are 256 byte aligned */
>> +		offset = 0;
>> +		misalign = paddr & ALIGN_MASK;
>> +		if (misalign) {
>> +			offset = SLIC_RX_BUFF_ALIGN - misalign;
>> +			skb_reserve(skb, offset);
>> +		}
>> +		/* the HW expects dma chunks for descriptor + frame data */
>> +		desc = (struct slic_rx_desc *)skb->data;
>> +		memset(desc, 0, sizeof(*desc));
> 
> Do you really need to zero-out the prepending RX descriptor? Are not you
> missing a write barrier here?

Indeed, it should be sufficient to make sure that the bit SLIC_IRHDDR_SVALID is not set.
I will adjust it. 
Concerning the write barrier: You mean a wmb() before slic_write() to ensure that the zeroing
 of the status desc is done before the descriptor is passed to the HW, right?


> [snip]
> 
>> +
>> +		dma_sync_single_for_cpu(&sdev->pdev->dev,
>> +					dma_unmap_addr(buff, map_addr),
>> +					buff->addr_offset + sizeof(*desc),
>> +					DMA_FROM_DEVICE);
>> +
>> +		status = le32_to_cpu(desc->status);
>> +		if (!(status & SLIC_IRHDDR_SVALID))
>> +			break;
>> +
>> +		buff->skb = NULL;
>> +
>> +		dma_unmap_single(&sdev->pdev->dev,
>> +				 dma_unmap_addr(buff, map_addr),
>> +				 dma_unmap_len(buff, map_len),
>> +				 DMA_FROM_DEVICE);
> 
> This is potentially inefficient, you already did a cache invalidation
> for the RX descriptor here, you could be more efficient with just
> invalidating the packet length, minus the descriptor length.
> 

I am not sure I understand: We have to unmap the complete dma area, no matter if we synced
part of it before, dont we? AFAIK a dma sync is different from unmapping dma, or do I miss
something?


>> +
>> +		/* skip rx descriptor that is placed before the frame data */
>> +		skb_reserve(skb, SLIC_RX_BUFF_HDR_SIZE);
>> +
>> +		if (unlikely(status & SLIC_IRHDDR_ERR)) {
>> +			slic_handle_frame_error(sdev, skb);
>> +			dev_kfree_skb_any(skb);
>> +		} else {
>> +			struct ethhdr *eh = (struct ethhdr *)skb->data;
>> +
>> +			if (is_multicast_ether_addr(eh->h_dest))
>> +				SLIC_INC_STATS_COUNTER(&sdev->stats, rx_mcasts);
>> +
>> +			len = le32_to_cpu(desc->length) & SLIC_IRHDDR_FLEN_MSK;
>> +			skb_put(skb, len);
>> +			skb->protocol = eth_type_trans(skb, dev);
>> +			skb->ip_summed = CHECKSUM_UNNECESSARY;
>> +			skb->dev = dev;
> 
> eth_type_trans() already assigns skb->dev = dev;
> 

Right, this is unnecessary, I will fix it.

>> +static int slic_poll(struct napi_struct *napi, int todo)
>> +{
>> +	struct slic_device *sdev = container_of(napi, struct slic_device, napi);
>> +	struct slic_shmem *sm = &sdev->shmem;
>> +	struct slic_shmem_data *sm_data = sm->shmem_data;
>> +	u32 isr = le32_to_cpu(sm_data->isr);
>> +	unsigned int done = 0;
>> +
>> +	slic_handle_irq(sdev, isr, todo, &done);
>> +
>> +	if (done < todo) {
>> +		napi_complete(napi);
> 
> napi_complete_done() since you know how many packets you completed.
> 

Ok, will change it.

>> +		/* reenable irqs */
>> +		sm_data->isr = 0;
>> +		/* make sure sm_data->isr is cleard before irqs are reenabled */
>> +		wmb();
>> +		slic_write(sdev, SLIC_REG_ISR, 0);
>> +		slic_flush_write(sdev);
>> +	}
>> +
>> +	return done;
>> +}
>> +
>> +static irqreturn_t slic_irq(int irq, void *dev_id)
>> +{
>> +	struct slic_device *sdev = dev_id;
>> +	struct slic_shmem *sm = &sdev->shmem;
>> +	struct slic_shmem_data *sm_data = sm->shmem_data;
>> +
>> +	slic_write(sdev, SLIC_REG_ICR, SLIC_ICR_INT_MASK);
>> +	slic_flush_write(sdev);
>> +	/* make sure sm_data->isr is read after ICR_INT_MASK is set */
>> +	wmb();
>> +
>> +	if (!sm_data->isr) {
>> +		dma_rmb();
>> +		/* spurious interrupt */
>> +		slic_write(sdev, SLIC_REG_ISR, 0);
>> +		slic_flush_write(sdev);
>> +		return IRQ_NONE;
>> +	}
>> +
>> +	napi_schedule(&sdev->napi);
> 
> Likewise napi_schedule_irqoff() can be used here.
> 

Ok, will change it.

Thanks a lot Florian!

Regards,
Lino

^ permalink raw reply

* [net PATCH 0/2] Don't use lco_csum to compute IPv4 checksum
From: Alexander Duyck @ 2016-11-28 15:42 UTC (permalink / raw)
  To: netdev, intel-wired-lan; +Cc: sfr, davem, jeffrey.t.kirsher

When I implemented the GSO partial support in the Intel drivers I was using
lco_csum to compute the checksum that we needed to plug into the IPv4
checksum field in order to cancel out the data that was not a part of the
IPv4 header.  However this didn't take into account that the transport
offset might be pointing to the inner transport header.

Instead of using lco_csum I have just coded around it so that we can use
the outer IP header plus the IP header length to determine where we need to
start our checksum and then just call csum_partial ourselves.

This should fix the SIT issue reported on igb interfaces as well as simliar
issues that would pop up on other Intel NICs.

---

Alexander Duyck (2):
      igb/igbvf: Don't use lco_csum to compute IPv4 checksum
      ixgbe/ixgbevf: Don't use lco_csum to compute IPv4 checksum


 drivers/net/ethernet/intel/igb/igb_main.c         |    8 ++++++--
 drivers/net/ethernet/intel/igbvf/netdev.c         |    8 ++++++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |    8 ++++++--
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |    8 ++++++--
 4 files changed, 24 insertions(+), 8 deletions(-)

^ permalink raw reply

* [net PATCH 1/2] igb/igbvf: Don't use lco_csum to compute IPv4 checksum
From: Alexander Duyck @ 2016-11-28 15:42 UTC (permalink / raw)
  To: netdev, intel-wired-lan; +Cc: sfr, davem, jeffrey.t.kirsher
In-Reply-To: <20161128153927.15466.99193.stgit@ahduyck-blue-test.jf.intel.com>

In the case of IPIP and SIT tunnel frames the outer transport header
offset is actually set to the same offset as the inner transport header.
This results in the lco_csum call not doing any checksum computation over
the inner IPv4/v6 header data.

In order to account for that I am updating the code so that we determine
the location to start the checksum ourselves based on the location of the
IPv4 header and the length.

Fixes: e10715d3e961 ("igb/igbvf: Add support for GSO partial")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c |    8 ++++++--
 drivers/net/ethernet/intel/igbvf/netdev.c |    8 ++++++--
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 59125b1..76ce0cc 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -4922,11 +4922,15 @@ static int igb_tso(struct igb_ring *tx_ring,
 
 	/* initialize outer IP header fields */
 	if (ip.v4->version == 4) {
+		unsigned char *csum_start = skb_checksum_start(skb);
+		unsigned char *trans_start = ip.hdr + (ip.v4->ihl * 4);
+
 		/* IP header will have to cancel out any data that
 		 * is not a part of the outer IP header
 		 */
-		ip.v4->check = csum_fold(csum_add(lco_csum(skb),
-						  csum_unfold(l4.tcp->check)));
+		ip.v4->check = csum_fold(csum_partial(trans_start,
+						      csum_start - trans_start,
+						      0));
 		type_tucmd |= E1000_ADVTXD_TUCMD_IPV4;
 
 		ip.v4->tot_len = 0;
diff --git a/drivers/net/ethernet/intel/igbvf/netdev.c b/drivers/net/ethernet/intel/igbvf/netdev.c
index b0778ba..e3a80ac 100644
--- a/drivers/net/ethernet/intel/igbvf/netdev.c
+++ b/drivers/net/ethernet/intel/igbvf/netdev.c
@@ -1965,11 +1965,15 @@ static int igbvf_tso(struct igbvf_ring *tx_ring,
 
 	/* initialize outer IP header fields */
 	if (ip.v4->version == 4) {
+		unsigned char *csum_start = skb_checksum_start(skb);
+		unsigned char *trans_start = ip.hdr + (ip.v4->ihl * 4);
+
 		/* IP header will have to cancel out any data that
 		 * is not a part of the outer IP header
 		 */
-		ip.v4->check = csum_fold(csum_add(lco_csum(skb),
-						  csum_unfold(l4.tcp->check)));
+		ip.v4->check = csum_fold(csum_partial(trans_start,
+						      csum_start - trans_start,
+						      0));
 		type_tucmd |= E1000_ADVTXD_TUCMD_IPV4;
 
 		ip.v4->tot_len = 0;

^ permalink raw reply related

* [net PATCH 2/2] ixgbe/ixgbevf: Don't use lco_csum to compute IPv4 checksum
From: Alexander Duyck @ 2016-11-28 15:42 UTC (permalink / raw)
  To: netdev, intel-wired-lan; +Cc: sfr, davem, jeffrey.t.kirsher
In-Reply-To: <20161128153927.15466.99193.stgit@ahduyck-blue-test.jf.intel.com>

In the case of IPIP and SIT tunnel frames the outer transport header
offset is actually set to the same offset as the inner transport header.
This results in the lco_csum call not doing any checksum computation over
the inner IPv4/v6 header data.

In order to account for that I am updating the code so that we determine
the location to start the checksum ourselves based on the location of the
IPv4 header and the length.

Fixes: b83e30104bd9 ("ixgbe/ixgbevf: Add support for GSO partial")
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |    8 ++++++--
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |    8 ++++++--
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index b4f0374..c72f783 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -7284,11 +7284,15 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring,
 
 	/* initialize outer IP header fields */
 	if (ip.v4->version == 4) {
+		unsigned char *csum_start = skb_checksum_start(skb);
+		unsigned char *trans_start = ip.hdr + (ip.v4->ihl * 4);
+
 		/* IP header will have to cancel out any data that
 		 * is not a part of the outer IP header
 		 */
-		ip.v4->check = csum_fold(csum_add(lco_csum(skb),
-						  csum_unfold(l4.tcp->check)));
+		ip.v4->check = csum_fold(csum_partial(trans_start,
+						      csum_start - trans_start,
+						      0));
 		type_tucmd |= IXGBE_ADVTXD_TUCMD_IPV4;
 
 		ip.v4->tot_len = 0;
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index d9d6616..baa885e 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -3326,11 +3326,15 @@ static int ixgbevf_tso(struct ixgbevf_ring *tx_ring,
 
 	/* initialize outer IP header fields */
 	if (ip.v4->version == 4) {
+		unsigned char *csum_start = skb_checksum_start(skb);
+		unsigned char *trans_start = ip.hdr + (ip.v4->ihl * 4);
+
 		/* IP header will have to cancel out any data that
 		 * is not a part of the outer IP header
 		 */
-		ip.v4->check = csum_fold(csum_add(lco_csum(skb),
-						  csum_unfold(l4.tcp->check)));
+		ip.v4->check = csum_fold(csum_partial(trans_start,
+						      csum_start - trans_start,
+						      0));
 		type_tucmd |= IXGBE_ADVTXD_TUCMD_IPV4;
 
 		ip.v4->tot_len = 0;

^ permalink raw reply related

* (unknown), 
From: salome.khum @ 2016-11-28 21:46 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: MESSAGE_5999780_netdev.zip --]
[-- Type: application/zip, Size: 2062 bytes --]

^ permalink raw reply

* Re: [PATCH] mlx4: give precise rx/tx bytes/packets counters
From: Saeed Mahameed @ 2016-11-28 21:55 UTC (permalink / raw)
  To: David Miller; +Cc: Eric Dumazet, Linux Netdev List, Tariq Toukan
In-Reply-To: <20161128.154057.1355209532933900181.davem@davemloft.net>

On Mon, Nov 28, 2016 at 10:40 PM, David Miller <davem@davemloft.net> wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Sat, 26 Nov 2016 18:16:00 -0800
>
>> On Sun, 2016-11-27 at 00:47 +0200, Saeed Mahameed wrote:
>>> On Fri, Nov 25, 2016 at 5:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>>> As you see here in SRIOV mode (PF only) reads   sw stats from FW.
>>> Tariq, I think we need to fix this.
>>
>> Sure, my patch does not change this at all.
>>
>> If mlx4_is_master() is false, then we aggregate the software states and
>> only the software stats.
>>
>> My patch makes this aggregation possible at the time ethtool or
>> ndo_get_stat64() are called, since this absolutely not depend on the 250
>> ms timer fetching hardware stats.
>
> Saeed please provide counter arguments or ACK this patch, thank you.

I have nothing against this patch, I just wanted to point out that
this patch is just fixing the symptom.
We keep ignoring the root cause that dev_get_stats is called under a
spin_lock which really ties our hands "us device drivers developers"
and push us towards those fragile solutions like deferred work for
caching statistics.
I am not saying that Eric should fix this in his patch, i just wanted
to raise the awareness of the root cause.

Regarding the SRIOV PF stats, i will take it with Tariq internally.

Bottom line, I ACK this patch, I might even do the same myself for
mlx5 :), but there are some follow ups that need to be done.

^ permalink raw reply

* Re: Receive offloads, small RCVBUF and zero TCP window
From: Marcelo Ricardo Leitner @ 2016-11-28 22:01 UTC (permalink / raw)
  To: David Miller; +Cc: alexandre.sidorenko, netdev, jmaxwell37, eric.dumazet
In-Reply-To: <20161128.155459.1527519991492144879.davem@davemloft.net>

On Mon, Nov 28, 2016 at 03:54:59PM -0500, David Miller wrote:
> From: Alex Sidorenko <alexandre.sidorenko@hpe.com>
> Date: Mon, 28 Nov 2016 15:49:26 -0500
> 
> > Now the question is whether is is OK to have icsk->icsk_ack.rcv_mss
> > larger than MTU.
> 
> It absolutely is not OK.
> 

Would it make sense to add a pr_warn_once() and perhaps even clamp it
down to known/saner MSS?

> If VMWare wants to receive large frames for batching purposes it must
> use GRO or similar to achieve that, not just send vanilla frames into
> the stack which are larger than the device MTU.
> 

It's not the first report I've seen on this type of issue. IBM also had
this issue recently while not being able to send the gso_size from tx
side to rx, and the warning probably could have saved quite some
debugging time.

Something like (but with a better msg, for sure):

--8<--

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a27b9c0e27c0..3a59cffae3fa 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -144,7 +144,9 @@ static void tcp_measure_rcv_mss(struct sock *sk, const struct sk_buff *skb)
 	 */
 	len = skb_shinfo(skb)->gso_size ? : skb->len;
 	if (len >= icsk->icsk_ack.rcv_mss) {
-		icsk->icsk_ack.rcv_mss = len;
+		icsk->icsk_ack.rcv_mss = max(len, tcp_sk(sk)->advmss);
+		if (icsk->icsk_ack.rcv_mss != len)
+			pr_warn_once("Your driver is likely doing bad rx acceleration.\n");
 	} else {
 		/* Otherwise, we make more careful check taking into account,
 		 * that SACKs block is variable.

^ permalink raw reply related

* Re: Aw: Re: [PATCH] mlx4: give precise rx/tx bytes/packets counters
From: Lino Sanfilippo @ 2016-11-28 22:02 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Laight, David Miller, netdev, Tariq Toukan
In-Reply-To: <1480101551.8455.557.camel@edumazet-glaptop3.roam.corp.google.com>

Hi Eric,

On 25.11.2016 20:19, Eric Dumazet wrote:
> On Fri, 2016-11-25 at 17:30 +0100, Lino Sanfilippo wrote:
>> Hi,
>> 
>> 
>> > 
>> > The READ_ONCE() are documenting the fact that no lock is taken to fetch
>> > the stats, while another cpus might being changing them.
>> > 
>> > I had no answer yet from https://patchwork.ozlabs.org/patch/698449/
>> > 
>> > So I thought it was not needed to explain this in the changelog, given
>> > that it apparently is one of the few things that can block someone to
>> > understand one of my changes :/
>> > 
>> > Apparently nobody really understands READ_ONCE() purpose, it is really a
>> > pity we have to explain this over and over.
>> > 
>> 
>> Even at the risk of showing once more a lack of understanding for
>> READ_ONCE():
>> Does not a READ_ONCE() have to e paired with some kind of
>> WRITE_ONCE()? 
> 
> You are right.
> 
> Although in this case, the producers are using a lock, and do
> 
> ring->packets++;
> 
> We hopefully have compilers/cpus that do not put intermediate garbage in
> ring->packets while doing the increment.
> 
> One problem with :
> 
> WRITE_ONCE(ring->packets, ring->packets + 1);
> 
> is that gcc no longer uses an INC instruction.

I see. So we would have to do something like

tmp = ring->packets;
tmp++;
WRITE_ONCE(ring->packets, tmp);

to use WRITE_ONCE in this case? If so, could it be worth doing something like this to 
have a balanced READ_ONCE, WRITE_ONCE usage?

> 
> Maybe we need some ADD_ONCE(ptr, val) macro doing the proper thing.
> 
>> Furthermore: there a quite some network drivers that ensure visibility
>> of 
>> the descriptor queue indices between xmit and xmit completion function
>> by means of
>> smp barriers. Could all these drivers theoretically be adjusted to use
>> READ_ONCE(),
>> WRITE_ONCE() for the indices instead?
>> 
> 
> Well, for this precise case we do need appropriate smp barriers.
> 
> READ_ONCE() can be better than poor barrier(), look at 
> https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=b668534c1d9b80f4cda4d761eb11d3a6c9f4ced8
> 
> 

Regards,
Lino

^ permalink raw reply

* Re: net: GPF in eth_header
From: Eric Dumazet @ 2016-11-28 22:14 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Dmitry Vyukov, syzkaller, Hannes Frederic Sowa, David Miller,
	Tom Herbert, Alexander Duyck, Jiri Benc, Sabrina Dubroca, netdev,
	LKML
In-Reply-To: <20161128213444.GA9858@breakpoint.cc>

On Mon, 2016-11-28 at 22:34 +0100, Florian Westphal wrote:
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > Might be a bug added in commit daaa7d647f81f3
> > > ("netfilter: ipv6: avoid nf_iterate recursion")
> > > 
> > > Florian, what do you think of dropping a packet that presumably was
> > > mangled badly by nf_ct_frag6_queue() ?
> 
> ipv4 definitely frees malformed packets.
> In general, I think netfilter should avoid 'silent' drops if possible
> and let skb continue, but of course such skbs should not be made worse
> as what we ate to begin with...
> 
> > > (Like about 48 byte pulled :(, and/or skb->csum changed )
> 
> I think this warrants a review of ipv6 reassembly too, bug reported here
> is because ipv6 nf defrag is also done on output.


ip6_frag_queue() definitely frees bad/mangled skbs()


> Looks good, we'll need to change some of the errno return codes in
> nf_ct_frag6_gather to 0 though for this to work, which should not be too
> hard ;)

If the goal is to let buggy packets pass, then we might need to undo
changes in nf_ct_frag6_queue()

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox