Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next 2/2] openvswitch: Fix skb->protocol for vlan frames.
From: Jiri Benc @ 2016-11-28 22:42 UTC (permalink / raw)
  To: Jarno Rajahalme; +Cc: netdev
In-Reply-To: <76814927-D373-4C3A-BC85-5771304235A7@ovn.org>

On Mon, 28 Nov 2016 14:29:39 -0800, Jarno Rajahalme wrote:
> I’m not sure what you suggest here. Obviously the kernel ABI can not
> be changed as existing userspace code expects upcalled packets to be
> non-accelerated. Also, if userspace pushes vlan headers, the packet
> will actually have them.

The user space API needs to be preserved, of course. I'm talking about
what happens internally in the kernel.

See this patchset: https://www.spinics.net/lists/netdev/msg398827.html

> Would this incremental fix this:
> 
> diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
> index 9be9fda..37f1bb9 100644
> --- a/net/openvswitch/flow.c
> +++ b/net/openvswitch/flow.c
> @@ -354,6 +354,8 @@ static int parse_vlan(struct sk_buff *skb, struct
> sw_flow_key *key) res = parse_vlan_tag(skb, &key->eth.vlan);
>  		if (res <= 0)
>  			return res;
> +		if (skb->protocol == htons(ETH_P_TEB))
> +			skb->protocol = key->eth.vlan.tpid;
>  	}
>  
>  	/* Parse inner vlan tag. */

I'll look at this tomorrow. But it seems we're adding more and more
hacks instead of cleaning up the vlan handling.

 Jiri

^ permalink raw reply

* Re: Large performance regression with 6in4 tunnel (sit)
From: Stephen Rothwell @ 2016-11-28 22:38 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Sven-Haegar Koch, Eli Cooper, Netdev, Eric Dumazet
In-Reply-To: <CAKgT0Ue3f0hUnCvwBupqmhBE_bbxZUO7ODXsz2cbHrPu5gAqyA@mail.gmail.com>

Hi Alex,

On Mon, 28 Nov 2016 13:32:21 -0800 Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
> So I think I have this root caused.  The problem seems to be the fact
> that I chose to use lco_csum when trying to cancel out the inner IP
> header from the checksum and it turns out that the transport offset is
> never updated in the case of these tunnels.
> 
> For now a workaround is to just set tx-gso-partial to off on the
> interface the tunnel is running over and you should be able to pass
> traffic without any issues.

OK, so that works (even with gso and tso set to "on" on the sit
interface).  Thanks.

> I have a patch for igb/igbvf that should be out in the next hour or so
> which should address it.

That will be a bit harder to test, but I will see what I can do.

-- 
Cheers,
Stephen Rothwell

^ permalink raw reply

* Re: [PATCH net-next 2/2] openvswitch: Fix skb->protocol for vlan frames.
From: Jarno Rajahalme @ 2016-11-28 22:29 UTC (permalink / raw)
  To: Jiri Benc; +Cc: netdev
In-Reply-To: <20161124171046.7eb0e287@griffin>


> On Nov 24, 2016, at 8:10 AM, Jiri Benc <jbenc@redhat.com> wrote:
> 
> On Tue, 22 Nov 2016 20:09:34 -0800, Jarno Rajahalme wrote:
>> Do not set skb->protocol to be the ethertype of the L3 header, unless
>> the packet only has the L3 header.  For a non-hardware offloaded VLAN
>> frame skb->protocol needs to be one of the VLAN ethertypes.
>> 
>> Any VLAN offloading is undone on the OVS netlink interface.  Due to
>> this all VLAN packets sent to openvswitch module from userspace are
>> non-offloaded.
> 
> This is exactly why I wanted to always accelerate the vlan tag, the
> same way it is done in other parts of the networking stack: to prevent
> all those weird corner cases.
> 
> Looks to me this is the only real way forward.
> 

I’m not sure what you suggest here. Obviously the kernel ABI can not be changed as existing userspace code expects upcalled packets to be non-accelerated. Also, if userspace pushes vlan headers, the packet will actually have them.

> This patch is wrong, it would leave skb->protocol as ETH_P_TEB for L2
> frames received via ARPHRD_NONE interface.
> 

Would this incremental fix this:

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 9be9fda..37f1bb9 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -354,6 +354,8 @@ static int parse_vlan(struct sk_buff *skb, struct sw_flow_key *key)
 		res = parse_vlan_tag(skb, &key->eth.vlan);
 		if (res <= 0)
 			return res;
+		if (skb->protocol == htons(ETH_P_TEB))
+			skb->protocol = key->eth.vlan.tpid;
 	}
 
 	/* Parse inner vlan tag. */


  Jarno

^ permalink raw reply related

* Re: [net PATCH 0/2] Don't use lco_csum to compute IPv4 checksum
From: Jeff Kirsher @ 2016-11-28 22:26 UTC (permalink / raw)
  To: sfr, netdev, intel-wired-lan; +Cc: Alexander Duyck, davem
In-Reply-To: <20161128153927.15466.99193.stgit@ahduyck-blue-test.jf.intel.com>

[-- Attachment #1: Type: text/plain, Size: 1076 bytes --]

On Mon, 2016-11-28 at 10:42 -0500, Alexander Duyck wrote:
> When I implemented the GSO partial support in the Intel drivers I was
> using
> lco_csum to compute the checksum that we needed to plug into the IPv4
> checksum field in order to cancel out the data that was not a part of the
> IPv4 header.  However this didn't take into account that the transport
> offset might be pointing to the inner transport header.
> 
> Instead of using lco_csum I have just coded around it so that we can use
> the outer IP header plus the IP header length to determine where we need
> to
> start our checksum and then just call csum_partial ourselves.
> 
> This should fix the SIT issue reported on igb interfaces as well as
> simliar
> issues that would pop up on other Intel NICs.
> 
> ---
> 
> Alexander Duyck (2):
>       igb/igbvf: Don't use lco_csum to compute IPv4 checksum
>       ixgbe/ixgbevf: Don't use lco_csum to compute IPv4 checksum

Stephen, I have applied Alex's patches to my net-queue tree.  Can you
confirm they resolve the bug seen?

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: net: GPF in eth_header
From: Florian Westphal @ 2016-11-28 22:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Florian Westphal, Dmitry Vyukov, syzkaller, Hannes Frederic Sowa,
	David Miller, Tom Herbert, Alexander Duyck, Jiri Benc,
	Sabrina Dubroca, netdev, LKML
In-Reply-To: <1480371244.18162.91.camel@edumazet-glaptop3.roam.corp.google.com>

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2016-11-28 at 22:34 +0100, Florian Westphal wrote:
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > > Might be a bug added in commit daaa7d647f81f3
> > > > ("netfilter: ipv6: avoid nf_iterate recursion")
> > > > 
> > > > Florian, what do you think of dropping a packet that presumably was
> > > > mangled badly by nf_ct_frag6_queue() ?
> > 
> > ipv4 definitely frees malformed packets.
> > In general, I think netfilter should avoid 'silent' drops if possible
> > and let skb continue, but of course such skbs should not be made worse
> > as what we ate to begin with...
> > 
> > > > (Like about 48 byte pulled :(, and/or skb->csum changed )
> > 
> > I think this warrants a review of ipv6 reassembly too, bug reported here
> > is because ipv6 nf defrag is also done on output.
> 
> 
> ip6_frag_queue() definitely frees bad/mangled skbs()

Yes, sorry.  nf_ct_frag6_queue is mostly derived from ip6_frag_queue
so any bugs in one might also exist in other.
Thats all I wanted to say here.  I'll check this tomorrow.

> > Looks good, we'll need to change some of the errno return codes in
> > nf_ct_frag6_gather to 0 though for this to work, which should not be too
> > hard ;)
> 
> If the goal is to let buggy packets pass, then we might need to undo
> changes in nf_ct_frag6_queue()

It currently returns -EINVAL in cases where skb wasn't changed/altered
(e.g. because it doesn't have a fragment header), so we should ACCEPT in
that case.

As for 'buggy' packet, I think its ok to mimic ip6_frag_queue, i.e.
if it tosses returning NF_DROP under same circumstance seems ok.

(Passing however will -- on ingress side -- cause snmp stat increments
 in ipv6 reassembly, this still might be desireable).

I'll check where undo might be possible/not too hard.

Thanks Eric for debugging this!

^ permalink raw reply

* Re: Checking for MDIO phy address 0
From: Phil Endecott @ 2016-11-28 22:17 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: netdev
In-Reply-To: <acbf27e0-48d3-736c-2ff4-69969ced9da7@gmail.com>

Hi Florian,

Florian Fainelli wrote:
> On 11/28/2016 10:53 AM, Phil Endecott wrote:

>> Anyway, my reason for this message is to suggest a runtime diagnostic 
>> message somewhere if address 0 is used - this could have saved me a lot of 
>> work! If someone can suggest the right place to add this I can prepare a 
>> patch.
>
> Address 0 can be made special or not, but there is no programmatic way
> to tell without scanning the MDIO bus for devices, so I don't think a
> warning is going to help at all to warn about that. There are also tons
> of cases where the address is just treated like any other address, which
> is typical with MDIO connected Ethernet switches where PHY@0 is just the
> switch's port 0 built-in PHY for instance.

Thanks for the quick reply; I'll forget that idea.

Regarding the actual problem with the 0 address on this board, I
have been told that someone at APM is going to look into it.  Fingers
crossed that it's fixable somewhere in a driver.  I'll update the list
if necessary.


Cheers,  Phil.

^ permalink raw reply

* [PATCH net-next] bpf, xdp: allow to pass flags to dev_change_xdp_fd
From: Daniel Borkmann @ 2016-11-28 22:16 UTC (permalink / raw)
  To: davem; +Cc: alexei.starovoitov, netdev, Daniel Borkmann

Add an IFLA_XDP_FLAGS attribute that can be passed for setting up
XDP along with IFLA_XDP_FD, which eventually allows user space to
implement typical add/replace/delete logic for programs. Right now,
calling into dev_change_xdp_fd() will always replace previous programs.

When passed XDP_FLAGS_UPDATE_IF_NOEXIST, we can handle this more
graceful when requested by returning -EBUSY in case we try to
attach a new program, but we find that another one is already
attached. This will be used by upcoming front-end for iproute2 as
well.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/netdevice.h    |  2 +-
 include/uapi/linux/if_link.h |  4 ++++
 net/core/dev.c               | 20 ++++++++++++++++++--
 net/core/rtnetlink.c         | 14 +++++++++++++-
 4 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 4ffcd87..3755317 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3253,7 +3253,7 @@ int dev_get_phys_port_id(struct net_device *dev,
 int dev_get_phys_port_name(struct net_device *dev,
 			   char *name, size_t len);
 int dev_change_proto_down(struct net_device *dev, bool proto_down);
-int dev_change_xdp_fd(struct net_device *dev, int fd);
+int dev_change_xdp_fd(struct net_device *dev, int fd, u32 flags);
 struct sk_buff *validate_xmit_skb_list(struct sk_buff *skb, struct net_device *dev);
 struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 				    struct netdev_queue *txq, int *ret);
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 92b2d49..6b13e59 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -876,10 +876,14 @@ enum {
 
 /* XDP section */
 
+#define XDP_FLAGS_UPDATE_IF_NOEXIST	(1U << 0)
+#define XDP_FLAGS_MASK			(XDP_FLAGS_UPDATE_IF_NOEXIST)
+
 enum {
 	IFLA_XDP_UNSPEC,
 	IFLA_XDP_FD,
 	IFLA_XDP_ATTACHED,
+	IFLA_XDP_FLAGS,
 	__IFLA_XDP_MAX,
 };
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 048b46b..a44aefe 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6692,26 +6692,42 @@ int dev_change_proto_down(struct net_device *dev, bool proto_down)
  *	dev_change_xdp_fd - set or clear a bpf program for a device rx path
  *	@dev: device
  *	@fd: new program fd or negative value to clear
+ *	@flags: xdp-related flags
  *
  *	Set or clear a bpf program for a device
  */
-int dev_change_xdp_fd(struct net_device *dev, int fd)
+int dev_change_xdp_fd(struct net_device *dev, int fd, u32 flags)
 {
 	const struct net_device_ops *ops = dev->netdev_ops;
 	struct bpf_prog *prog = NULL;
-	struct netdev_xdp xdp = {};
+	struct netdev_xdp xdp;
 	int err;
 
+	ASSERT_RTNL();
+
 	if (!ops->ndo_xdp)
 		return -EOPNOTSUPP;
 	if (fd >= 0) {
+		if (flags & XDP_FLAGS_UPDATE_IF_NOEXIST) {
+			memset(&xdp, 0, sizeof(xdp));
+			xdp.command = XDP_QUERY_PROG;
+
+			err = ops->ndo_xdp(dev, &xdp);
+			if (err < 0)
+				return err;
+			if (xdp.prog_attached)
+				return -EBUSY;
+		}
+
 		prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_XDP);
 		if (IS_ERR(prog))
 			return PTR_ERR(prog);
 	}
 
+	memset(&xdp, 0, sizeof(xdp));
 	xdp.command = XDP_SETUP_PROG;
 	xdp.prog = prog;
+
 	err = ops->ndo_xdp(dev, &xdp);
 	if (err < 0 && prog)
 		bpf_prog_put(prog);
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 4e60525..bd85570 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1505,6 +1505,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 static const struct nla_policy ifla_xdp_policy[IFLA_XDP_MAX + 1] = {
 	[IFLA_XDP_FD]		= { .type = NLA_S32 },
 	[IFLA_XDP_ATTACHED]	= { .type = NLA_U8 },
+	[IFLA_XDP_FLAGS]	= { .type = NLA_U32 },
 };
 
 static const struct rtnl_link_ops *linkinfo_to_kind_ops(const struct nlattr *nla)
@@ -2164,6 +2165,7 @@ static int do_setlink(const struct sk_buff *skb,
 
 	if (tb[IFLA_XDP]) {
 		struct nlattr *xdp[IFLA_XDP_MAX + 1];
+		u32 xdp_flags = 0;
 
 		err = nla_parse_nested(xdp, IFLA_XDP_MAX, tb[IFLA_XDP],
 				       ifla_xdp_policy);
@@ -2174,9 +2176,19 @@ static int do_setlink(const struct sk_buff *skb,
 			err = -EINVAL;
 			goto errout;
 		}
+
+		if (xdp[IFLA_XDP_FLAGS]) {
+			xdp_flags = nla_get_u32(xdp[IFLA_XDP_FLAGS]);
+			if (xdp_flags & ~XDP_FLAGS_MASK) {
+				err = -EINVAL;
+				goto errout;
+			}
+		}
+
 		if (xdp[IFLA_XDP_FD]) {
 			err = dev_change_xdp_fd(dev,
-						nla_get_s32(xdp[IFLA_XDP_FD]));
+						nla_get_s32(xdp[IFLA_XDP_FD]),
+						xdp_flags);
 			if (err)
 				goto errout;
 			status |= DO_SETLINK_NOTIFY;
-- 
1.9.3

^ permalink raw reply related

* Re: net: GPF in eth_header
From: Eric Dumazet @ 2016-11-28 22:14 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Dmitry Vyukov, syzkaller, Hannes Frederic Sowa, David Miller,
	Tom Herbert, Alexander Duyck, Jiri Benc, Sabrina Dubroca, netdev,
	LKML
In-Reply-To: <20161128213444.GA9858@breakpoint.cc>

On Mon, 2016-11-28 at 22:34 +0100, Florian Westphal wrote:
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > Might be a bug added in commit daaa7d647f81f3
> > > ("netfilter: ipv6: avoid nf_iterate recursion")
> > > 
> > > Florian, what do you think of dropping a packet that presumably was
> > > mangled badly by nf_ct_frag6_queue() ?
> 
> ipv4 definitely frees malformed packets.
> In general, I think netfilter should avoid 'silent' drops if possible
> and let skb continue, but of course such skbs should not be made worse
> as what we ate to begin with...
> 
> > > (Like about 48 byte pulled :(, and/or skb->csum changed )
> 
> I think this warrants a review of ipv6 reassembly too, bug reported here
> is because ipv6 nf defrag is also done on output.


ip6_frag_queue() definitely frees bad/mangled skbs()


> Looks good, we'll need to change some of the errno return codes in
> nf_ct_frag6_gather to 0 though for this to work, which should not be too
> hard ;)

If the goal is to let buggy packets pass, then we might need to undo
changes in nf_ct_frag6_queue()

^ permalink raw reply

* Re: Aw: Re: [PATCH] mlx4: give precise rx/tx bytes/packets counters
From: Lino Sanfilippo @ 2016-11-28 22:02 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Laight, David Miller, netdev, Tariq Toukan
In-Reply-To: <1480101551.8455.557.camel@edumazet-glaptop3.roam.corp.google.com>

Hi Eric,

On 25.11.2016 20:19, Eric Dumazet wrote:
> On Fri, 2016-11-25 at 17:30 +0100, Lino Sanfilippo wrote:
>> Hi,
>> 
>> 
>> > 
>> > The READ_ONCE() are documenting the fact that no lock is taken to fetch
>> > the stats, while another cpus might being changing them.
>> > 
>> > I had no answer yet from https://patchwork.ozlabs.org/patch/698449/
>> > 
>> > So I thought it was not needed to explain this in the changelog, given
>> > that it apparently is one of the few things that can block someone to
>> > understand one of my changes :/
>> > 
>> > Apparently nobody really understands READ_ONCE() purpose, it is really a
>> > pity we have to explain this over and over.
>> > 
>> 
>> Even at the risk of showing once more a lack of understanding for
>> READ_ONCE():
>> Does not a READ_ONCE() have to e paired with some kind of
>> WRITE_ONCE()? 
> 
> You are right.
> 
> Although in this case, the producers are using a lock, and do
> 
> ring->packets++;
> 
> We hopefully have compilers/cpus that do not put intermediate garbage in
> ring->packets while doing the increment.
> 
> One problem with :
> 
> WRITE_ONCE(ring->packets, ring->packets + 1);
> 
> is that gcc no longer uses an INC instruction.

I see. So we would have to do something like

tmp = ring->packets;
tmp++;
WRITE_ONCE(ring->packets, tmp);

to use WRITE_ONCE in this case? If so, could it be worth doing something like this to 
have a balanced READ_ONCE, WRITE_ONCE usage?

> 
> Maybe we need some ADD_ONCE(ptr, val) macro doing the proper thing.
> 
>> Furthermore: there a quite some network drivers that ensure visibility
>> of 
>> the descriptor queue indices between xmit and xmit completion function
>> by means of
>> smp barriers. Could all these drivers theoretically be adjusted to use
>> READ_ONCE(),
>> WRITE_ONCE() for the indices instead?
>> 
> 
> Well, for this precise case we do need appropriate smp barriers.
> 
> READ_ONCE() can be better than poor barrier(), look at 
> https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=b668534c1d9b80f4cda4d761eb11d3a6c9f4ced8
> 
> 

Regards,
Lino

^ permalink raw reply

* Re: Receive offloads, small RCVBUF and zero TCP window
From: Marcelo Ricardo Leitner @ 2016-11-28 22:01 UTC (permalink / raw)
  To: David Miller; +Cc: alexandre.sidorenko, netdev, jmaxwell37, eric.dumazet
In-Reply-To: <20161128.155459.1527519991492144879.davem@davemloft.net>

On Mon, Nov 28, 2016 at 03:54:59PM -0500, David Miller wrote:
> From: Alex Sidorenko <alexandre.sidorenko@hpe.com>
> Date: Mon, 28 Nov 2016 15:49:26 -0500
> 
> > Now the question is whether is is OK to have icsk->icsk_ack.rcv_mss
> > larger than MTU.
> 
> It absolutely is not OK.
> 

Would it make sense to add a pr_warn_once() and perhaps even clamp it
down to known/saner MSS?

> If VMWare wants to receive large frames for batching purposes it must
> use GRO or similar to achieve that, not just send vanilla frames into
> the stack which are larger than the device MTU.
> 

It's not the first report I've seen on this type of issue. IBM also had
this issue recently while not being able to send the gso_size from tx
side to rx, and the warning probably could have saved quite some
debugging time.

Something like (but with a better msg, for sure):

--8<--

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a27b9c0e27c0..3a59cffae3fa 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -144,7 +144,9 @@ static void tcp_measure_rcv_mss(struct sock *sk, const struct sk_buff *skb)
 	 */
 	len = skb_shinfo(skb)->gso_size ? : skb->len;
 	if (len >= icsk->icsk_ack.rcv_mss) {
-		icsk->icsk_ack.rcv_mss = len;
+		icsk->icsk_ack.rcv_mss = max(len, tcp_sk(sk)->advmss);
+		if (icsk->icsk_ack.rcv_mss != len)
+			pr_warn_once("Your driver is likely doing bad rx acceleration.\n");
 	} else {
 		/* Otherwise, we make more careful check taking into account,
 		 * that SACKs block is variable.

^ permalink raw reply related

* Re: [PATCH] mlx4: give precise rx/tx bytes/packets counters
From: Saeed Mahameed @ 2016-11-28 21:55 UTC (permalink / raw)
  To: David Miller; +Cc: Eric Dumazet, Linux Netdev List, Tariq Toukan
In-Reply-To: <20161128.154057.1355209532933900181.davem@davemloft.net>

On Mon, Nov 28, 2016 at 10:40 PM, David Miller <davem@davemloft.net> wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Sat, 26 Nov 2016 18:16:00 -0800
>
>> On Sun, 2016-11-27 at 00:47 +0200, Saeed Mahameed wrote:
>>> On Fri, Nov 25, 2016 at 5:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>>> As you see here in SRIOV mode (PF only) reads   sw stats from FW.
>>> Tariq, I think we need to fix this.
>>
>> Sure, my patch does not change this at all.
>>
>> If mlx4_is_master() is false, then we aggregate the software states and
>> only the software stats.
>>
>> My patch makes this aggregation possible at the time ethtool or
>> ndo_get_stat64() are called, since this absolutely not depend on the 250
>> ms timer fetching hardware stats.
>
> Saeed please provide counter arguments or ACK this patch, thank you.

I have nothing against this patch, I just wanted to point out that
this patch is just fixing the symptom.
We keep ignoring the root cause that dev_get_stats is called under a
spin_lock which really ties our hands "us device drivers developers"
and push us towards those fragile solutions like deferred work for
caching statistics.
I am not saying that Eric should fix this in his patch, i just wanted
to raise the awareness of the root cause.

Regarding the SRIOV PF stats, i will take it with Tariq internally.

Bottom line, I ACK this patch, I might even do the same myself for
mlx5 :), but there are some follow ups that need to be done.

^ permalink raw reply

* (unknown), 
From: salome.khum @ 2016-11-28 21:46 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: MESSAGE_5999780_netdev.zip --]
[-- Type: application/zip, Size: 2062 bytes --]

^ permalink raw reply

* [net PATCH 2/2] ixgbe/ixgbevf: Don't use lco_csum to compute IPv4 checksum
From: Alexander Duyck @ 2016-11-28 15:42 UTC (permalink / raw)
  To: netdev, intel-wired-lan; +Cc: sfr, davem, jeffrey.t.kirsher
In-Reply-To: <20161128153927.15466.99193.stgit@ahduyck-blue-test.jf.intel.com>

In the case of IPIP and SIT tunnel frames the outer transport header
offset is actually set to the same offset as the inner transport header.
This results in the lco_csum call not doing any checksum computation over
the inner IPv4/v6 header data.

In order to account for that I am updating the code so that we determine
the location to start the checksum ourselves based on the location of the
IPv4 header and the length.

Fixes: b83e30104bd9 ("ixgbe/ixgbevf: Add support for GSO partial")
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |    8 ++++++--
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |    8 ++++++--
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index b4f0374..c72f783 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -7284,11 +7284,15 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring,
 
 	/* initialize outer IP header fields */
 	if (ip.v4->version == 4) {
+		unsigned char *csum_start = skb_checksum_start(skb);
+		unsigned char *trans_start = ip.hdr + (ip.v4->ihl * 4);
+
 		/* IP header will have to cancel out any data that
 		 * is not a part of the outer IP header
 		 */
-		ip.v4->check = csum_fold(csum_add(lco_csum(skb),
-						  csum_unfold(l4.tcp->check)));
+		ip.v4->check = csum_fold(csum_partial(trans_start,
+						      csum_start - trans_start,
+						      0));
 		type_tucmd |= IXGBE_ADVTXD_TUCMD_IPV4;
 
 		ip.v4->tot_len = 0;
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index d9d6616..baa885e 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -3326,11 +3326,15 @@ static int ixgbevf_tso(struct ixgbevf_ring *tx_ring,
 
 	/* initialize outer IP header fields */
 	if (ip.v4->version == 4) {
+		unsigned char *csum_start = skb_checksum_start(skb);
+		unsigned char *trans_start = ip.hdr + (ip.v4->ihl * 4);
+
 		/* IP header will have to cancel out any data that
 		 * is not a part of the outer IP header
 		 */
-		ip.v4->check = csum_fold(csum_add(lco_csum(skb),
-						  csum_unfold(l4.tcp->check)));
+		ip.v4->check = csum_fold(csum_partial(trans_start,
+						      csum_start - trans_start,
+						      0));
 		type_tucmd |= IXGBE_ADVTXD_TUCMD_IPV4;
 
 		ip.v4->tot_len = 0;

^ permalink raw reply related

* [net PATCH 1/2] igb/igbvf: Don't use lco_csum to compute IPv4 checksum
From: Alexander Duyck @ 2016-11-28 15:42 UTC (permalink / raw)
  To: netdev, intel-wired-lan; +Cc: sfr, davem, jeffrey.t.kirsher
In-Reply-To: <20161128153927.15466.99193.stgit@ahduyck-blue-test.jf.intel.com>

In the case of IPIP and SIT tunnel frames the outer transport header
offset is actually set to the same offset as the inner transport header.
This results in the lco_csum call not doing any checksum computation over
the inner IPv4/v6 header data.

In order to account for that I am updating the code so that we determine
the location to start the checksum ourselves based on the location of the
IPv4 header and the length.

Fixes: e10715d3e961 ("igb/igbvf: Add support for GSO partial")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c |    8 ++++++--
 drivers/net/ethernet/intel/igbvf/netdev.c |    8 ++++++--
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 59125b1..76ce0cc 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -4922,11 +4922,15 @@ static int igb_tso(struct igb_ring *tx_ring,
 
 	/* initialize outer IP header fields */
 	if (ip.v4->version == 4) {
+		unsigned char *csum_start = skb_checksum_start(skb);
+		unsigned char *trans_start = ip.hdr + (ip.v4->ihl * 4);
+
 		/* IP header will have to cancel out any data that
 		 * is not a part of the outer IP header
 		 */
-		ip.v4->check = csum_fold(csum_add(lco_csum(skb),
-						  csum_unfold(l4.tcp->check)));
+		ip.v4->check = csum_fold(csum_partial(trans_start,
+						      csum_start - trans_start,
+						      0));
 		type_tucmd |= E1000_ADVTXD_TUCMD_IPV4;
 
 		ip.v4->tot_len = 0;
diff --git a/drivers/net/ethernet/intel/igbvf/netdev.c b/drivers/net/ethernet/intel/igbvf/netdev.c
index b0778ba..e3a80ac 100644
--- a/drivers/net/ethernet/intel/igbvf/netdev.c
+++ b/drivers/net/ethernet/intel/igbvf/netdev.c
@@ -1965,11 +1965,15 @@ static int igbvf_tso(struct igbvf_ring *tx_ring,
 
 	/* initialize outer IP header fields */
 	if (ip.v4->version == 4) {
+		unsigned char *csum_start = skb_checksum_start(skb);
+		unsigned char *trans_start = ip.hdr + (ip.v4->ihl * 4);
+
 		/* IP header will have to cancel out any data that
 		 * is not a part of the outer IP header
 		 */
-		ip.v4->check = csum_fold(csum_add(lco_csum(skb),
-						  csum_unfold(l4.tcp->check)));
+		ip.v4->check = csum_fold(csum_partial(trans_start,
+						      csum_start - trans_start,
+						      0));
 		type_tucmd |= E1000_ADVTXD_TUCMD_IPV4;
 
 		ip.v4->tot_len = 0;

^ permalink raw reply related

* [net PATCH 0/2] Don't use lco_csum to compute IPv4 checksum
From: Alexander Duyck @ 2016-11-28 15:42 UTC (permalink / raw)
  To: netdev, intel-wired-lan; +Cc: sfr, davem, jeffrey.t.kirsher

When I implemented the GSO partial support in the Intel drivers I was using
lco_csum to compute the checksum that we needed to plug into the IPv4
checksum field in order to cancel out the data that was not a part of the
IPv4 header.  However this didn't take into account that the transport
offset might be pointing to the inner transport header.

Instead of using lco_csum I have just coded around it so that we can use
the outer IP header plus the IP header length to determine where we need to
start our checksum and then just call csum_partial ourselves.

This should fix the SIT issue reported on igb interfaces as well as simliar
issues that would pop up on other Intel NICs.

---

Alexander Duyck (2):
      igb/igbvf: Don't use lco_csum to compute IPv4 checksum
      ixgbe/ixgbevf: Don't use lco_csum to compute IPv4 checksum

 drivers/net/ethernet/intel/igb/igb_main.c         |    8 ++++++--
 drivers/net/ethernet/intel/igbvf/netdev.c         |    8 ++++++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |    8 ++++++--
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |    8 ++++++--
 4 files changed, 24 insertions(+), 8 deletions(-)

^ permalink raw reply

* Re: [PATCH v3 net-next 1/2] net: ethernet: slicoss: add slicoss gigabit ethernet driver
From: Lino Sanfilippo @ 2016-11-28 21:41 UTC (permalink / raw)
  To: Florian Fainelli, davem, charrer, liodot, gregkh, andrew
  Cc: devel, netdev, linux-kernel
In-Reply-To: <a6f99dc8-60e4-7727-0764-abdfac81c713@gmail.com>

Hi Florian,

On 28.11.2016 05:56, Florian Fainelli wrote:
> On 11/26/2016 04:20 AM, Lino Sanfilippo wrote:
>> Add driver for Alacritech gigabit ethernet cards with SLIC (session-layer
>> interface control) technology. The driver provides basic support without
>> SLIC for the following devices:
>> 
>> - Mojave cards (single port PCI Gigabit) both copper and fiber
>> - Oasis cards (single and dual port PCI-x Gigabit) copper and fiber
>> - Kalahari cards (dual and quad port PCI-e Gigabit) copper and fiber
> 
> This looks great, a few nits below:
> 
> 
>> +#define SLIC_MAX_TX_COMPLETIONS		100
> 
> You usually don't want to limit the number of TX completion, if the
> entire TX ring needs to be cleaned, you would want to allow that.
> 

The problem is that the HW does not provide a tx completion index. Instead we have to 
iterate the status descriptors until we get an invalid idx which indicates that there 
are no further tx descriptors done for now. I am afraid that if we do not limit the 
number of descriptors processed in the tx completion handler, a continuous transmission 
of frames could keep the loop in xmit_complete() run endlessly. I dont know if this 
can actually happen but I wanted to make sure that this is avoided.

> [snip]
> 
>> +	while (slic_get_free_rx_descs(rxq) > SLIC_MAX_REQ_RX_DESCS) {
>> +		skb = alloc_skb(maplen + ALIGN_MASK, gfp);
>> +		if (!skb)
>> +			break;
>> +
>> +		paddr = dma_map_single(&sdev->pdev->dev, skb->data, maplen,
>> +				       DMA_FROM_DEVICE);
>> +		if (dma_mapping_error(&sdev->pdev->dev, paddr)) {
>> +			netdev_err(dev, "mapping rx packet failed\n");
>> +			/* drop skb */
>> +			dev_kfree_skb_any(skb);
>> +			break;
>> +		}
>> +		/* ensure head buffer descriptors are 256 byte aligned */
>> +		offset = 0;
>> +		misalign = paddr & ALIGN_MASK;
>> +		if (misalign) {
>> +			offset = SLIC_RX_BUFF_ALIGN - misalign;
>> +			skb_reserve(skb, offset);
>> +		}
>> +		/* the HW expects dma chunks for descriptor + frame data */
>> +		desc = (struct slic_rx_desc *)skb->data;
>> +		memset(desc, 0, sizeof(*desc));
> 
> Do you really need to zero-out the prepending RX descriptor? Are not you
> missing a write barrier here?

Indeed, it should be sufficient to make sure that the bit SLIC_IRHDDR_SVALID is not set.
I will adjust it. 
Concerning the write barrier: You mean a wmb() before slic_write() to ensure that the zeroing
 of the status desc is done before the descriptor is passed to the HW, right?


> [snip]
> 
>> +
>> +		dma_sync_single_for_cpu(&sdev->pdev->dev,
>> +					dma_unmap_addr(buff, map_addr),
>> +					buff->addr_offset + sizeof(*desc),
>> +					DMA_FROM_DEVICE);
>> +
>> +		status = le32_to_cpu(desc->status);
>> +		if (!(status & SLIC_IRHDDR_SVALID))
>> +			break;
>> +
>> +		buff->skb = NULL;
>> +
>> +		dma_unmap_single(&sdev->pdev->dev,
>> +				 dma_unmap_addr(buff, map_addr),
>> +				 dma_unmap_len(buff, map_len),
>> +				 DMA_FROM_DEVICE);
> 
> This is potentially inefficient, you already did a cache invalidation
> for the RX descriptor here, you could be more efficient with just
> invalidating the packet length, minus the descriptor length.
> 

I am not sure I understand: We have to unmap the complete dma area, no matter if we synced
part of it before, dont we? AFAIK a dma sync is different from unmapping dma, or do I miss
something?


>> +
>> +		/* skip rx descriptor that is placed before the frame data */
>> +		skb_reserve(skb, SLIC_RX_BUFF_HDR_SIZE);
>> +
>> +		if (unlikely(status & SLIC_IRHDDR_ERR)) {
>> +			slic_handle_frame_error(sdev, skb);
>> +			dev_kfree_skb_any(skb);
>> +		} else {
>> +			struct ethhdr *eh = (struct ethhdr *)skb->data;
>> +
>> +			if (is_multicast_ether_addr(eh->h_dest))
>> +				SLIC_INC_STATS_COUNTER(&sdev->stats, rx_mcasts);
>> +
>> +			len = le32_to_cpu(desc->length) & SLIC_IRHDDR_FLEN_MSK;
>> +			skb_put(skb, len);
>> +			skb->protocol = eth_type_trans(skb, dev);
>> +			skb->ip_summed = CHECKSUM_UNNECESSARY;
>> +			skb->dev = dev;
> 
> eth_type_trans() already assigns skb->dev = dev;
> 

Right, this is unnecessary, I will fix it.

>> +static int slic_poll(struct napi_struct *napi, int todo)
>> +{
>> +	struct slic_device *sdev = container_of(napi, struct slic_device, napi);
>> +	struct slic_shmem *sm = &sdev->shmem;
>> +	struct slic_shmem_data *sm_data = sm->shmem_data;
>> +	u32 isr = le32_to_cpu(sm_data->isr);
>> +	unsigned int done = 0;
>> +
>> +	slic_handle_irq(sdev, isr, todo, &done);
>> +
>> +	if (done < todo) {
>> +		napi_complete(napi);
> 
> napi_complete_done() since you know how many packets you completed.
> 

Ok, will change it.

>> +		/* reenable irqs */
>> +		sm_data->isr = 0;
>> +		/* make sure sm_data->isr is cleard before irqs are reenabled */
>> +		wmb();
>> +		slic_write(sdev, SLIC_REG_ISR, 0);
>> +		slic_flush_write(sdev);
>> +	}
>> +
>> +	return done;
>> +}
>> +
>> +static irqreturn_t slic_irq(int irq, void *dev_id)
>> +{
>> +	struct slic_device *sdev = dev_id;
>> +	struct slic_shmem *sm = &sdev->shmem;
>> +	struct slic_shmem_data *sm_data = sm->shmem_data;
>> +
>> +	slic_write(sdev, SLIC_REG_ICR, SLIC_ICR_INT_MASK);
>> +	slic_flush_write(sdev);
>> +	/* make sure sm_data->isr is read after ICR_INT_MASK is set */
>> +	wmb();
>> +
>> +	if (!sm_data->isr) {
>> +		dma_rmb();
>> +		/* spurious interrupt */
>> +		slic_write(sdev, SLIC_REG_ISR, 0);
>> +		slic_flush_write(sdev);
>> +		return IRQ_NONE;
>> +	}
>> +
>> +	napi_schedule(&sdev->napi);
> 
> Likewise napi_schedule_irqoff() can be used here.
> 

Ok, will change it.

Thanks a lot Florian!

Regards,
Lino

^ permalink raw reply

* Re: net: GPF in eth_header
From: Florian Westphal @ 2016-11-28 21:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Dmitry Vyukov, Florian Westphal, syzkaller, Hannes Frederic Sowa,
	David Miller, Tom Herbert, Alexander Duyck, Jiri Benc,
	Sabrina Dubroca, netdev, LKML
In-Reply-To: <1480367886.18162.88.camel@edumazet-glaptop3.roam.corp.google.com>

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Might be a bug added in commit daaa7d647f81f3
> > ("netfilter: ipv6: avoid nf_iterate recursion")
> > 
> > Florian, what do you think of dropping a packet that presumably was
> > mangled badly by nf_ct_frag6_queue() ?

ipv4 definitely frees malformed packets.
In general, I think netfilter should avoid 'silent' drops if possible
and let skb continue, but of course such skbs should not be made worse
as what we ate to begin with...

> > (Like about 48 byte pulled :(, and/or skb->csum changed )

I think this warrants a review of ipv6 reassembly too, bug reported here
is because ipv6 nf defrag is also done on output.

> diff --git a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
> index f7aab5ab93a5..508739a3ca2a 100644
> --- a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
> +++ b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
> @@ -65,9 +65,9 @@ static unsigned int ipv6_defrag(void *priv,
>  
>  	err = nf_ct_frag6_gather(state->net, skb,
>  				 nf_ct6_defrag_user(state->hook, skb));
> -	/* queued */
> -	if (err == -EINPROGRESS)
> -		return NF_STOLEN;
> +	/* queued or mangled ... */
> +	if (err)
> +		return (err == -EINPROGRESS) ? NF_STOLEN : NF_DROP;
>  
>  	return NF_ACCEPT;

Looks good, we'll need to change some of the errno return codes in
nf_ct_frag6_gather to 0 though for this to work, which should not be too
hard ;)

^ permalink raw reply

* Re: Large performance regression with 6in4 tunnel (sit)
From: Alexander Duyck @ 2016-11-28 21:32 UTC (permalink / raw)
  To: Stephen Rothwell; +Cc: Sven-Haegar Koch, Eli Cooper, Netdev, Eric Dumazet
In-Reply-To: <20161127142340.3a5c197e@canb.auug.org.au>

On Sat, Nov 26, 2016 at 7:23 PM, Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> Hi Sven-Haegar,
>
> On Fri, 25 Nov 2016 05:06:53 +0100 (CET) Sven-Haegar Koch <haegar@sdinet.de> wrote:
>>
>> Somehow this problem description really reminds me of a report on
>> netdev a bit ago, which the following patch fixed:
>>
>> commit 9ee6c5dc816aa8256257f2cd4008a9291ec7e985
>> Author: Lance Richardson <lrichard@redhat.com>
>> Date:   Wed Nov 2 16:36:17 2016 -0400
>>
>>     ipv4: allow local fragmentation in ip_finish_output_gso()
>>
>>     Some configurations (e.g. geneve interface with default
>>     MTU of 1500 over an ethernet interface with 1500 MTU) result
>>     in the transmission of packets that exceed the configured MTU.
>>     While this should be considered to be a "bad" configuration,
>>     it is still allowed and should not result in the sending
>>     of packets that exceed the configured MTU.
>>
>> Could this be related?
>>
>> I suppose it would be difficult to test this patch on this machine?
>
> The kernel I am running on is based on 4.7.8, so the above patch
> doesn't come close to applying. Most fo what it is reverting was
> introduced in commit 359ebda25aa0 ("net/ipv4: Introduce IPSKB_FRAG_SEGS
> bit to inet_skb_parm.flags") in v4.8-rc1.

So I think I have this root caused.  The problem seems to be the fact
that I chose to use lco_csum when trying to cancel out the inner IP
header from the checksum and it turns out that the transport offset is
never updated in the case of these tunnels.

For now a workaround is to just set tx-gso-partial to off on the
interface the tunnel is running over and you should be able to pass
traffic without any issues.

I have a patch for igb/igbvf that should be out in the next hour or so
which should address it.

Thanks.

- Alex

^ permalink raw reply

* Re: [PATCH] bpf/samples: Fix PT_REGS_IP on s390x and use it
From: David Miller @ 2016-11-28 21:27 UTC (permalink / raw)
  To: holzheu; +Cc: ast, netdev, heiko.carstens, schwidefsky
In-Reply-To: <20161128134830.44cd7c1f@TP-holzheu>

From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Date: Mon, 28 Nov 2016 13:48:30 +0100

> The files "sampleip_kern.c" and "trace_event_kern.c" directly access
> "ctx->regs.ip" which is not available on s390x. Fix this and use the
> PT_REGS_IP() macro instead.
> 
> Also fix the macro for s390x and use "psw.addr" from "pt_regs".
> 
> Reported-by: Zvonko Kosic <zvonko.kosic@de.ibm.com>
> Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> Acked-by: Alexei Starovoitov <ast@kernel.org>

Applied, thanks.

^ permalink raw reply

* Re: [patch net] sched: cls_flower: remove from hashtable only in case skip sw flag is not set
From: Jiri Pirko @ 2016-11-28 21:23 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Amir Vadai, Linux Netdev List, David Miller, Jamal Hadi Salim,
	Ido Schimmel, Elad Raz, Or Gerlitz, Hadar Hen Zion
In-Reply-To: <CAJ3xEMhdyBfgtOgJqsY_x-Qr6hrhF01nm3VPL-kJnCofA0r3nA@mail.gmail.com>

Mon, Nov 28, 2016 at 10:04:56PM CET, gerlitz.or@gmail.com wrote:
>On Mon, Nov 28, 2016 at 4:40 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>> From: Jiri Pirko <jiri@mellanox.com>
>>
>> Be symmetric to hashtable insert and remove filter from hashtable only
>> in case skip sw flag is not set.
>>
>> Fixes: e69985c67c33 ("net/sched: cls_flower: Introduce support in SKIP SW flag")
>
>Amir, Jiri - what was the impact of running without this fix for the
>last 3-4 kernels? I haven't seen any crashes, is that leaking took
>place? or this is just a cleanup to make things more clear and
>maintainable?

It's a fix for real bug. If you add rule with skip_sw flag, it is not
inserted into hashtable. But once you remove it, the current code
removes it from hashtable (did not inspect how rhashtable implementation
handles this).

^ permalink raw reply

* Re: [PATCH v4] cpsw: ethtool: add support for getting/setting EEE registers
From: David Miller @ 2016-11-28 21:23 UTC (permalink / raw)
  To: yegorslists; +Cc: netdev, linux-omap, grygorii.strashko, mugunthanvnm
In-Reply-To: <1480322493-25308-1-git-send-email-yegorslists@googlemail.com>

From: yegorslists@googlemail.com
Date: Mon, 28 Nov 2016 09:41:33 +0100

> From: Yegor Yefremov <yegorslists@googlemail.com>
> 
> Add the ability to query and set Energy Efficient Ethernet parameters
> via ethtool for applicable devices.
> 
> This patch doesn't activate full EEE support in cpsw driver, but it
> enables reading and writing EEE advertising settings. This way one
> can disable advertising EEE for certain speeds.
> 
> Signed-off-by: Yegor Yefremov <yegorslists@googlemail.com>
> Acked-by: Rami Rosen <roszenrami@gmail.com>
> ---
> Changes:
> 	v4: respine against net-next (David Miller)
> 	v3: explain what features will be available with this patch (Florian Fainelli)
> 	v2: make routines static (Rami Rosen)

Applied, thank you.

^ permalink raw reply

* Re: net: GPF in eth_header
From: Eric Dumazet @ 2016-11-28 21:18 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Florian Westphal, syzkaller, Hannes Frederic Sowa, David Miller,
	Tom Herbert, Alexander Duyck, Jiri Benc, Sabrina Dubroca, netdev,
	LKML
In-Reply-To: <1480367152.18162.86.camel@edumazet-glaptop3.roam.corp.google.com>

On Mon, 2016-11-28 at 13:05 -0800, Eric Dumazet wrote:
> On Mon, 2016-11-28 at 11:47 -0800, Eric Dumazet wrote:
> > On Mon, 2016-11-28 at 20:34 +0100, Dmitry Vyukov wrote:
> > > On Mon, Nov 28, 2016 at 8:04 PM, 'Andrey Konovalov' via syzkaller
> > 
> > > > Hi Eric,
> > > >
> > > > As far as I can see, skb_network_offset() becomes negative after
> > > > pskb_pull(skb, (u8 *) (fhdr + 1) - skb->data) in nf_ct_frag6_queue().
> > > > At least I'm able to detect that with a BUG_ON().
> > > >
> > > > Also it seems that the issue is only reproducible (at least with the
> > > > poc I provided) for a short time after boot.
> > > 
> > > 
> > > Eric,
> > > 
> > > Is it enough to debug? Or maybe Andrey can trace some values for you.
> > 
> > Well, now we are talking, if you tell me how many modules you load, it
> > might help ;)
> > 
> > nf_ct_frag6_queue is nowhere to be seen in my kernels, that might
> > explain why I could not reproduce the bug.
> > 
> > Let me try ;)
> > 
> 
> Might be a bug added in commit daaa7d647f81f3
> ("netfilter: ipv6: avoid nf_iterate recursion")
> 
> Florian, what do you think of dropping a packet that presumably was
> mangled badly by nf_ct_frag6_queue() ?
> 
> (Like about 48 byte pulled :(, and/or skb->csum changed )
> 
> diff --git a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
> index f7aab5ab93a5..31aa947332d8 100644
> --- a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
> +++ b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
> @@ -65,8 +65,8 @@ static unsigned int ipv6_defrag(void *priv,
>  
>  	err = nf_ct_frag6_gather(state->net, skb,
>  				 nf_ct6_defrag_user(state->hook, skb));
> -	/* queued */
> -	if (err == -EINPROGRESS)
> +	/* queued or mangled ... */
> +	if (err)
>  		return NF_STOLEN;

Or more exactly , returning NF_DROP so that skb is really freed ;)


diff --git a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
index f7aab5ab93a5..508739a3ca2a 100644
--- a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
+++ b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
@@ -65,9 +65,9 @@ static unsigned int ipv6_defrag(void *priv,
 
 	err = nf_ct_frag6_gather(state->net, skb,
 				 nf_ct6_defrag_user(state->hook, skb));
-	/* queued */
-	if (err == -EINPROGRESS)
-		return NF_STOLEN;
+	/* queued or mangled ... */
+	if (err)
+		return (err == -EINPROGRESS) ? NF_STOLEN : NF_DROP;
 
 	return NF_ACCEPT;
 }

^ permalink raw reply related

* Re: [PATCH net-next 0/6] tcp: sender chronographs instrumentation
From: David Miller @ 2016-11-28 21:19 UTC (permalink / raw)
  To: ycheng; +Cc: soheil, francisyyan, netdev, ncardwell, edumazet
In-Reply-To: <1480191016-73210-1-git-send-email-ycheng@google.com>

From: Yuchung Cheng <ycheng@google.com>
Date: Sat, 26 Nov 2016 12:10:10 -0800

> This patch set provides instrumentation on TCP sender limitations.
> While developing the BBR congestion control, we noticed that TCP
> sending process is often limited by factors unrelated to congestion
> control: insufficient sender buffer and/or insufficient receive
> window/buffer to saturate the network bandwidth. Unfortunately these
> limits are not visible to the users and often the poor performance
> is attributed to the congestion control of choice.
> 
> Thie patch aims to help users get the high level understanding of
> where sending process is limited by, similar to the TCP_INFO design.
> It is not to replace detailed kernel tracing and instrumentation
> facilities.
> 
> In addition this patch set provides a new option to the timestamping
> work to instrument these limits on application data unit. For exampe,
> one can use SO_TIMESTAMPING and this patch set to measure the how
> long a particular HTTP response is limited by small receive window.
> 
> Patch set was initially written by Francis Yan then polished
> by Yuchung Cheng, with lots of help from Eric Dumazet and Soheil
> Hassas Yeganeh.

Looks great, series applied, thanks!

^ permalink raw reply

* Re: [patch net] net: dsa: fix unbalanced dsa_switch_tree reference counting
From: David Miller @ 2016-11-28 21:16 UTC (permalink / raw)
  To: nikita.yoush; +Cc: netdev, cphealy, andrew, linux-kernel
In-Reply-To: <1480315728-23398-1-git-send-email-nikita.yoush@cogentembedded.com>

From: Nikita Yushchenko <nikita.yoush@cogentembedded.com>
Date: Mon, 28 Nov 2016 09:48:48 +0300

> _dsa_register_switch() gets a dsa_switch_tree object either via
> dsa_get_dst() or via dsa_add_dst(). Former path does not increase kref
> in returned object (resulting into caller not owning a reference),
> while later path does create a new object (resulting into caller owning
> a reference).
> 
> The rest of _dsa_register_switch() assumes that it owns a reference, and
> calls dsa_put_dst().
> 
> This causes a memory breakage if first switch in the tree initialized
> successfully, but second failed to initialize. In particular, freed
> dsa_swith_tree object is left referenced by switch that was initialized,
> and later access to sysfs attributes of that switch cause OOPS.
> 
> To fix, need to add kref_get() call to dsa_get_dst().
> 
> Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com>
> Fixes: 83c0afaec7b7 ("net: dsa: Add new binding implementation")
> Reviewed-by: Andrew Lunn <andrew@lunn.ch>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH] geneve: fix ip_hdr_len reserved for geneve6 tunnel.
From: David Miller @ 2016-11-28 21:15 UTC (permalink / raw)
  To: yanhaishuang; +Cc: hannes, aduyck, pshelar, jbenc, netdev, linux-kernel
In-Reply-To: <1480310818-78456-1-git-send-email-yanhaishuang@cmss.chinamobile.com>

From: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Date: Mon, 28 Nov 2016 13:26:58 +0800

> It shold reserved sizeof(ipv6hdr) for geneve in ipv6 tunnel.
> 
> Fixes: c3ef5aa5e5 ('geneve: Merge ipv4 and ipv6 geneve_build_skb()')
> 
> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>

Applied, thanks.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox