Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next 1/1] driver: ipvlan: Free the port memory directly with kfree instead of kfree_rcu
From: Gao Feng @ 2016-12-06  7:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Mahesh Bandewar, Eric Dumazet,
	Linux Kernel Network Developers
In-Reply-To: <1481007190.18162.570.camel@edumazet-glaptop3.roam.corp.google.com>

Hi Eric,

On Tue, Dec 6, 2016 at 2:53 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2016-12-06 at 14:31 +0800, Gao Feng wrote:
>
>> Because I don't fully hold the ipvlan codes now, I am afraid of that
>> there is someone which may get the port address when
>> ipvlan_port_destroy. So the original ipvlan_port_destroy uses the
>> kfree_rcu to avoid it.
>>
>> I am sure there is unnecessary to use kfree in ipvlan_port_create.
>
> And I am pretty sure it is unnecessary to use kfree_rcu() in
> ipvlan_port_destroy() as well.
>
> I highly suggest you spend time on learning why.
>
>
>

Thanks your suggestion.
I will send v2 patch after get the reason by myself.

Begards
Feng

^ permalink raw reply

* [PATCH] net: return value of skb_linearize should be handled in Linux kernel
From: Zhouyi Zhou @ 2016-12-06  7:10 UTC (permalink / raw)
  To: faisal.latif, dledford, sean.hefty, hal.rosenstock,
	jeffrey.t.kirsher, QLogic-Storage-Upstream, jejb, martin.petersen,
	jth, jon.maloy, ying.xue, davem, linux-rdma, linux-kernel,
	intel-wired-lan, netdev, linux-scsi, fcoe-devel, tipc-discussion
  Cc: Zhouyi Zhou

kmalloc_reserve may fail to allocate memory inside skb_linearize, 
which means skb_linearize's return value should not be ignored. 
Following patch correct the uses of skb_linearize.

Compiled in x86_64

Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
---
 drivers/infiniband/hw/nes/nes_nic.c           | 5 +++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c | 6 +++++-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 3 +--
 drivers/scsi/bnx2fc/bnx2fc_fcoe.c             | 7 +++++--
 drivers/scsi/fcoe/fcoe.c                      | 5 ++++-
 net/tipc/link.c                               | 3 ++-
 net/tipc/name_distr.c                         | 5 ++++-
 7 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/hw/nes/nes_nic.c b/drivers/infiniband/hw/nes/nes_nic.c
index 2b27d13..69372ea 100644
--- a/drivers/infiniband/hw/nes/nes_nic.c
+++ b/drivers/infiniband/hw/nes/nes_nic.c
@@ -662,10 +662,11 @@ static int nes_netdev_start_xmit(struct sk_buff *skb, struct net_device *netdev)
 				nesnic->sq_head &= nesnic->sq_size-1;
 			}
 		} else {
-			nesvnic->linearized_skbs++;
 			hoffset = skb_transport_header(skb) - skb->data;
 			nhoffset = skb_network_header(skb) - skb->data;
-			skb_linearize(skb);
+			if (skb_linearize(skb))
+				return NETDEV_TX_BUSY;
+			nesvnic->linearized_skbs++;
 			skb_set_transport_header(skb, hoffset);
 			skb_set_network_header(skb, nhoffset);
 			if (!nes_nic_send(skb, netdev))
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
index 2a653ec..ab787cb 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
@@ -490,7 +490,11 @@ int ixgbe_fcoe_ddp(struct ixgbe_adapter *adapter,
 	 */
 	if ((fh->fh_r_ctl == FC_RCTL_DD_SOL_DATA) &&
 	    (fctl & FC_FC_END_SEQ)) {
-		skb_linearize(skb);
+		int err = 0;
+
+		err = skb_linearize(skb);
+		if (err)
+			return err;
 		crc = (struct fcoe_crc_eof *)skb_put(skb, sizeof(*crc));
 		crc->fcoe_eof = FC_EOF_T;
 	}
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index fee1f29..4926d48 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2173,8 +2173,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 				total_rx_bytes += ddp_bytes;
 				total_rx_packets += DIV_ROUND_UP(ddp_bytes,
 								 mss);
-			}
-			if (!ddp_bytes) {
+			} else {
 				dev_kfree_skb_any(skb);
 				continue;
 			}
diff --git a/drivers/scsi/bnx2fc/bnx2fc_fcoe.c b/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
index f9ddb61..197d02e 100644
--- a/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
+++ b/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
@@ -542,8 +542,11 @@ static void bnx2fc_recv_frame(struct sk_buff *skb)
 		return;
 	}
 
-	if (skb_is_nonlinear(skb))
-		skb_linearize(skb);
+	if (skb_linearize(skb)) {
+		kfree_skb(skb);
+		return;
+	}
+
 	mac = eth_hdr(skb)->h_source;
 	dest_mac = eth_hdr(skb)->h_dest;
 
diff --git a/drivers/scsi/fcoe/fcoe.c b/drivers/scsi/fcoe/fcoe.c
index 9bd41a3..f691b97 100644
--- a/drivers/scsi/fcoe/fcoe.c
+++ b/drivers/scsi/fcoe/fcoe.c
@@ -1685,7 +1685,10 @@ static void fcoe_recv_frame(struct sk_buff *skb)
 			skb->dev ? skb->dev->name : "<NULL>");
 
 	port = lport_priv(lport);
-	skb_linearize(skb); /* check for skb_is_nonlinear is within skb_linearize */
+	if (skb_linearize(skb)) {
+		kfree_skb(skb);
+		return;
+	}
 
 	/*
 	 * Frame length checks and setting up the header pointers
diff --git a/net/tipc/link.c b/net/tipc/link.c
index bda89bf..077c570 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -1446,7 +1446,8 @@ static int tipc_link_proto_rcv(struct tipc_link *l, struct sk_buff *skb,
 	if (tipc_own_addr(l->net) > msg_prevnode(hdr))
 		l->net_plane = msg_net_plane(hdr);
 
-	skb_linearize(skb);
+	if (skb_linearize(skb))
+		goto exit;
 	hdr = buf_msg(skb);
 	data = msg_data(hdr);
 
diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c
index c1cfd92..4e05d2a 100644
--- a/net/tipc/name_distr.c
+++ b/net/tipc/name_distr.c
@@ -356,7 +356,10 @@ void tipc_named_rcv(struct net *net, struct sk_buff_head *inputq)
 
 	spin_lock_bh(&tn->nametbl_lock);
 	for (skb = skb_dequeue(inputq); skb; skb = skb_dequeue(inputq)) {
-		skb_linearize(skb);
+		if (skb_linearize(skb)) {
+			kfree_skb(skb);
+			continue;
+		}
 		msg = buf_msg(skb);
 		mtype = msg_type(msg);
 		item = (struct distr_item *)msg_data(msg);
-- 
1.9.1

^ permalink raw reply related

* [PATCHv2 net] team: team_port_add should check link_up before enable port
From: Xin Long @ 2016-12-06  7:29 UTC (permalink / raw)
  To: network dev; +Cc: davem, Marcelo Ricardo Leitner, Jiri Pirko

Now when users add a nic to team dev, the option 'enable' of the port
is true by default, as team_port_enable enables it after dev_open in
team_port_add.

But even if the port_dev has no carrier, like it's cable was unpluged,
the port is still enabled. It leads to that team dev couldn't work well
if this port was chosen to connect, and has no chance to change to use
other ports if link_watch is ethtool.

This patch is to enable the port only when the port_dev has carrier in
team_port_add.

v1 -> v2:
  use netif_carrier_ok() instead of !!netif_carrier_ok(), as it returns
  bool now.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 drivers/net/team/team.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index a380649..4bc0103 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -1140,6 +1140,7 @@ static int team_port_add(struct team *team, struct net_device *port_dev)
 	struct net_device *dev = team->dev;
 	struct team_port *port;
 	char *portname = port_dev->name;
+	bool linkup;
 	int err;
 
 	if (port_dev->flags & IFF_LOOPBACK) {
@@ -1249,9 +1250,12 @@ static int team_port_add(struct team *team, struct net_device *port_dev)
 
 	port->index = -1;
 	list_add_tail_rcu(&port->list, &team->port_list);
-	team_port_enable(team, port);
+	linkup = netif_carrier_ok(port_dev);
+	if (linkup)
+		team_port_enable(team, port);
+
 	__team_compute_features(team);
-	__team_port_change_port_added(port, !!netif_carrier_ok(port_dev));
+	__team_port_change_port_added(port, linkup);
 	__team_options_change_check(team);
 
 	netdev_info(dev, "Port device %s added\n", portname);
-- 
2.1.0

^ permalink raw reply related

* Re: commit : ppp: add rtnetlink device creation support - breaks netcf on my machine.
From: Brad Campbell @ 2016-12-06  7:47 UTC (permalink / raw)
  To: Guillaume Nault; +Cc: netdev
In-Reply-To: <20161205175320.mbtswpmdjumfgh7y@alphalink.fr>

On 06/12/16 01:53, Guillaume Nault wrote:
>>
> Probably not a mistake on your side. I've started looking at netcf'
> source code, but haven't found anything that could explain your issue.
> It'd really help if you could provide steps to reproduce the bug.

Further to my message this morning, I started with a clean linux.git 
4.9.0-rc7-00198-g0cb65c8 and did two runs. One untouched and one with 
the identified patch reverted. I logged both of these with NLCB=debug, 
then split out the ppp section and diffed them.

It appears the only difference of note is the new ATTR 18. I did a diff 
of the entire dump for both and nothing else popped out.


brad@test:~$ diff -u ppp-ok ppp-fail
--- ppp-ok	2016-12-06 13:32:04.358393578 +0800
+++ ppp-fail	2016-12-06 13:32:18.577864406 +0800
@@ -1,10 +1,10 @@
  --------------------------   BEGIN NETLINK MESSAGE 
---------------------------
    [HEADER] 16 octets
-    .nlmsg_len = 628
+    .nlmsg_len = 644
      .nlmsg_type = 16 <route/link::new>
      .nlmsg_flags = 2 <MULTI>
-    .nlmsg_seq = 1481001940
-    .nlmsg_pid = 7462
+    .nlmsg_seq = 1481002252
+    .nlmsg_pid = 7376
    [PAYLOAD] 16 octets
      00 00 00 02 0a 00 00 00 d1 10 01 00 00 00 00 00       ................
    [ATTR 03] 5 octets
@@ -71,6 +71,8 @@
      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
..................
      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
..................
      00 00 00 00 00 00                                     ......
+  [ATTR 18] 12 octets
+    08 00 01 00 70 70 70 00 04 00 02 00                   ....ppp.....
    [ATTR 26] 132 octets
      84 00 02 00 80 00 01 00 01 00 00 00 00 00 00 00 00 00 
..................
      00 00 01 00 00 00 01 00 00 00 01 00 00 00 01 00 00 00 
..................
@@ -81,3 +83,4 @@
      00 00 00 00 10 27 00 00 e8 03 00 00 00 00 00 00 00 00 
.....'............
      00 00 00 00 00 00                                     ......
  ---------------------------  END NETLINK MESSAGE 
---------------------------

Running with NLDBG=4 seems to generate this :
DBG<2>: While picking up for 0x26d2e00 <route/link>, recvmsgs() returned 
-34:  (errno = Numerical result out of range)DBG<1>: Clearing cache 
0x26d2e00 <route/link>...

(skip forward 4 hours)

Ok, so I've spent the afternoon compiling and installing software.

I'm afraid I gave you a bum steer. The issue only manifests itself on 
libnl1. I had both installed and netcf was compiling against 1 and not 3.

I spent the afternoon compiling and installing various combinations of 
libnl and netcf and can only reproduce the issue if netcf is compiled 
against libnl <= 1.1.4. It won't compile against 2, 3, or 3.1 and it 
works against 3.2. That explains why it manifests itself on my clean 
Debian 7 machines.

I can work around it locally by recompiling all my stuff against libnl3 
if you don't feel inclined to chase it down, but it is certainly 
reproducible on nl1. I compiled up 1.1.4 and compiled netcf-0.2.8 
against that and the problem shows.

Regards,
Brad

^ permalink raw reply

* Re: [PATCH]net:witchdev:release the lock before function return to avoid dead lock
From: Jiri Pirko @ 2016-12-06  8:06 UTC (permalink / raw)
  To: Feng Deng; +Cc: David S. Miller, netdev, linux-kernel, feng.deng
In-Reply-To: <CAEX1niVLkt=LQhjrALsUrN+oEa9ZEUJ7X+tbNFs71RLm6ryZCA@mail.gmail.com>

Tue, Dec 06, 2016 at 03:58:29AM CET, cxdx2006@gmail.com wrote:
>From : Feng Deng <cxdx2006@gmail.com>
>
>In switchdev_deferred_dequeue(),only get the lock at the beginning,
>but forgot to release before return,
>we must release the lock to avoid dead lock
 
Feng, your patches are corrupted. Please use "git send-email"
Also, when you send a fit for -net tree, you have to pass byt the
"Fixes" tag. Please see the git log for details.

Thanks!


>
>Signed-off-by: Feng Deng <feng.deng@cortina-access.com>
>---
> net/switchdev/switchdev.c | 1 +
> 1 file changed, 1 insertion(+)
> mode change 100644 => 100755 net/switchdev/switchdev.c
>
>diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
>index 3b95fe9..c0a1ad4
>--- a/net/switchdev/switchdev.c
>+++ b/net/switchdev/switchdev.c
>@@ -120,6 +120,7 @@ static struct switchdev_deferred_item
>*switchdev_deferred_dequeue(void)
>  dfitem = list_first_entry(&deferred,
>   struct switchdev_deferred_item, list);
>  list_del(&dfitem->list);
>+ spin_unlock_bh(&deferred_lock);
> unlock:
>  spin_unlock_bh(&deferred_lock);
>  return dfitem;
>-- 
>1.8.3.1

^ permalink raw reply

* Re: [PATCHv2 net] team: team_port_add should check link_up before enable port
From: Jiri Pirko @ 2016-12-06  8:22 UTC (permalink / raw)
  To: Xin Long; +Cc: network dev, davem, Marcelo Ricardo Leitner
In-Reply-To: <62f7aa9c00b68beed68a907c51dde22e7847d6c9.1481009348.git.lucien.xin@gmail.com>

Tue, Dec 06, 2016 at 08:29:08AM CET, lucien.xin@gmail.com wrote:
>Now when users add a nic to team dev, the option 'enable' of the port
>is true by default, as team_port_enable enables it after dev_open in
>team_port_add.
>
>But even if the port_dev has no carrier, like it's cable was unpluged,
>the port is still enabled. It leads to that team dev couldn't work well
>if this port was chosen to connect, and has no chance to change to use
>other ports if link_watch is ethtool.
>
>This patch is to enable the port only when the port_dev has carrier in
>team_port_add.
>
>v1 -> v2:
>  use netif_carrier_ok() instead of !!netif_carrier_ok(), as it returns
>  bool now.
>
>Signed-off-by: Xin Long <lucien.xin@gmail.com>
>---
> drivers/net/team/team.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
>index a380649..4bc0103 100644
>--- a/drivers/net/team/team.c
>+++ b/drivers/net/team/team.c
>@@ -1140,6 +1140,7 @@ static int team_port_add(struct team *team, struct net_device *port_dev)
> 	struct net_device *dev = team->dev;
> 	struct team_port *port;
> 	char *portname = port_dev->name;
>+	bool linkup;
> 	int err;
> 
> 	if (port_dev->flags & IFF_LOOPBACK) {
>@@ -1249,9 +1250,12 @@ static int team_port_add(struct team *team, struct net_device *port_dev)
> 
> 	port->index = -1;
> 	list_add_tail_rcu(&port->list, &team->port_list);
>-	team_port_enable(team, port);
>+	linkup = netif_carrier_ok(port_dev);
>+	if (linkup)
>+		team_port_enable(team, port);

This is obviously wrong change. It you use a simple setup without
userspace part (e.g. round robin), The port gets never enabled.
team_port_enabl is called from here and team_port_en_option_set.

By default, all ports should be enabled. Only in case the userspace
daemon decides to disable, it does so.

Could you send me the exact configuration where you see the issue?
This should be definitelly fixed in user part.

Thanks.

^ permalink raw reply

* Re: [PATCH] net/udp: do not touch skb->peeked unless really needed
From: Paolo Abeni @ 2016-12-06  9:53 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Willem de Bruijn
In-Reply-To: <1480960639.18162.556.camel@edumazet-glaptop3.roam.corp.google.com>

Hi Eric,

On Mon, 2016-12-05 at 09:57 -0800, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> In UDP recvmsg() path we currently access 3 cache lines from an skb
> while holding receive queue lock, plus another one if packet is
> dequeued, since we need to change skb->next->prev
> 
> 1st cache line (contains ->next/prev pointers, offsets 0x00 and 0x08)
> 2nd cache line (skb->len & skb->peeked, offsets 0x80 and 0x8e)
> 3rd cache line (skb->truesize/users, offsets 0xe0 and 0xe4)
> 
> skb->peeked is only needed to make sure 0-length packets are properly
> handled while MSG_PEEK is operated.
> 
> I had first the intent to remove skb->peeked but the "MSG_PEEK at
> non-zero offset" support added by Sam Kumar makes this not possible.

I'm wondering if peeking with offset is going to complicate the 2 queues
patch, too.

> This patch avoids one cache line miss during the locked section, when
> skb->len and skb->peeked do not have to be read.
> 
> It also avoids the skb_set_peeked() cost for non empty UDP datagrams.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  net/core/datagram.c |   19 ++++++++++---------
>  1 file changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index 49816af8586bb832e806972b486588041a99524c..9482037a5c8c64aec79e42c65bd2691bdd9450a3 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -214,6 +214,7 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags,
>  	if (error)
>  		goto no_packet;
>  
> +	*peeked = 0;
>  	do {
>  		/* Again only user level code calls this function, so nothing
>  		 * interrupt level will suddenly eat the receive_queue.
> @@ -227,22 +228,22 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags,
>  		spin_lock_irqsave(&queue->lock, cpu_flags);
>  		skb_queue_walk(queue, skb) {
>  			*last = skb;
> -			*peeked = skb->peeked;
>  			if (flags & MSG_PEEK) {
>  				if (_off >= skb->len && (skb->len || _off ||
>  							 skb->peeked)) {
>  					_off -= skb->len;
>  					continue;
>  				}
> -
> -				skb = skb_set_peeked(skb);
> -				error = PTR_ERR(skb);
> -				if (IS_ERR(skb)) {
> -					spin_unlock_irqrestore(&queue->lock,
> -							       cpu_flags);
> -					goto no_packet;
> +				if (!skb->len) {
> +					skb = skb_set_peeked(skb);
> +					if (IS_ERR(skb)) {
> +						error = PTR_ERR(skb);
> +						spin_unlock_irqrestore(&queue->lock,
> +								       cpu_flags);
> +						goto no_packet;
> +					}
>  				}

I don't understand why we can avoid setting skb->peek if len > 0. I
think that will change the kernel behavior if:
- peek with offset is set
- 3 skbs with len > 0 are enqueued
- the u/s peek (with offset) the second one
- the u/s disable peeking with offset and peeks 2 more skbs.

With the current code in the last step the u/s is going to peek the 1#
and the 3# skbs, after this patch will peek the 1# and the 2#. Am I
missing something ? Probably the new behavior is more correct, but still
is a change. 

I gave this a run in my test bed on top of your udp-related patches I
see additional ~3 improvement in the udp flood scenario, and a bit more
in the un-contended scenario.

Thank you,

Paolo

^ permalink raw reply

* Re: [PATCH iproute2/net-next 3/3] tc: flower: support matching on ICMP type and code
From: Simon Horman @ 2016-12-06 10:18 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger, Jamal Hadi Salim, Jiri Pirko
In-Reply-To: <1480703079-25422-4-git-send-email-simon.horman@netronome.com>

On Fri, Dec 02, 2016 at 07:24:39PM +0100, Simon Horman wrote:
> Support matching on ICMP type and code.
> 
> Example usage:
> 
> tc qdisc add dev eth0 ingress
> 
> tc filter add dev eth0 protocol ip parent ffff: flower \
> 	indev eth0 ip_proto icmp type 8 code 0 action drop
> 
> tc filter add dev eth0 protocol ipv6 parent ffff: flower \
> 	indev eth0 ip_proto icmpv6 type 128 code 0 action drop
> 
> Signed-off-by: Simon Horman <simon.horman@netronome.com>

This series no longer applies to net-next.

...

> @@ -524,6 +597,12 @@ static void flower_print_port(FILE *f, char *name, struct rtattr *attr)
>  		fprintf(f, "\n  %s %d", name, ntohs(rta_getattr_u16(attr)));
>  }
>  
> +static void flower_print_icmp(FILE *f, char *name, struct rtattr *attr)
> +{
> +	if (attr)
> +		fprintf(f, "\n  %s %d", name, ntohs(rta_getattr_u8(attr)));

ntohs() should be removed from the above as its argument is a u8.

...

^ permalink raw reply

* Re: [PATCH] net/udp: do not touch skb->peeked unless really needed
From: Paolo Abeni @ 2016-12-06 10:34 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Willem de Bruijn
In-Reply-To: <1480960639.18162.556.camel@edumazet-glaptop3.roam.corp.google.com>

Hi Eric,

On Mon, 2016-12-05 at 09:57 -0800, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> In UDP recvmsg() path we currently access 3 cache lines from an skb
> while holding receive queue lock, plus another one if packet is
> dequeued, since we need to change skb->next->prev
> 
> 1st cache line (contains ->next/prev pointers, offsets 0x00 and 0x08)
> 2nd cache line (skb->len & skb->peeked, offsets 0x80 and 0x8e)
> 3rd cache line (skb->truesize/users, offsets 0xe0 and 0xe4)
> 
> skb->peeked is only needed to make sure 0-length packets are properly
> handled while MSG_PEEK is operated.
> 
> I had first the intent to remove skb->peeked but the "MSG_PEEK at
> non-zero offset" support added by Sam Kumar makes this not possible.
> 
> This patch avoids one cache line miss during the locked section, when
> skb->len and skb->peeked do not have to be read.
> 
> It also avoids the skb_set_peeked() cost for non empty UDP datagrams.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Thank you for all the good work.

After all your improvement, I see the cacheline miss in inet_recvmsg()
as a major perf offender for the user space process in the udp flood
scenario due to skc_rxhash sharing the same sk_drops cacheline.

Using an udp-specific drop counter (and an sk_drops accessor to wrap
sk_drops access where needed), we could avoid such cache miss. With that
- patch for udp.h only below - I get 3% improvement on top of all the
pending udp patches, and the gain should be more relevant after the 2
queues rework. What do you think ?

Cheers,

Paolo.
---
diff --git a/include/linux/udp.h b/include/linux/udp.h
index d1fd8cd..2bbf5db 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -49,7 +49,11 @@ struct udp_sock {
        unsigned int     corkflag;      /* Cork is required */
        __u8             encap_type;    /* Is this an Encapsulation socket? */
        unsigned char    no_check6_tx:1,/* Send zero UDP6 checksums on TX? */
-                        no_check6_rx:1;/* Allow zero UDP6 checksums on RX? */
+                        no_check6_rx:1,/* Allow zero UDP6 checksums on RX? */
+                        pcflag:6;      /* UDP-Lite specific, moved here to */
+                                       /* fill an hole, marks socket as */
+                                       /* UDP-Lite if > 0    */
+
        /*
         * Following member retains the information to create a UDP header
         * when the socket is uncorked.
@@ -64,8 +68,7 @@ struct udp_sock {
 #define UDPLITE_BIT      0x1           /* set by udplite proto init function */
 #define UDPLITE_SEND_CC  0x2           /* set via udplite setsockopt         */
 #define UDPLITE_RECV_CC  0x4           /* set via udplite setsocktopt        */
-       __u8             pcflag;        /* marks socket as UDP-Lite if > 0    */
-       __u8             unused[3];
+
        /*
         * For encapsulation sockets.
         */
@@ -79,6 +82,9 @@ struct udp_sock {
        int                     (*gro_complete)(struct sock *sk,
                                                struct sk_buff *skb,
                                                int nhoff);
+
+       /* since we are prone to drops, avoid dirtying any sk cacheline */
+       atomic_t                drops ____cacheline_aligned_in_smp;
 };
 
 static inline struct udp_sock *udp_sk(const struct sock *sk)

^ permalink raw reply related

* Re: [PATCHv2 net] team: team_port_add should check link_up before enable port
From: Xin Long @ 2016-12-06 10:38 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: network dev, davem, Marcelo Ricardo Leitner
In-Reply-To: <20161206082233.GD1984@nanopsycho>

[-- Attachment #1: Type: text/plain, Size: 2719 bytes --]

On Tue, Dec 6, 2016 at 4:22 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Tue, Dec 06, 2016 at 08:29:08AM CET, lucien.xin@gmail.com wrote:
>>Now when users add a nic to team dev, the option 'enable' of the port
>>is true by default, as team_port_enable enables it after dev_open in
>>team_port_add.
>>
>>But even if the port_dev has no carrier, like it's cable was unpluged,
>>the port is still enabled. It leads to that team dev couldn't work well
>>if this port was chosen to connect, and has no chance to change to use
>>other ports if link_watch is ethtool.
>>
>>This patch is to enable the port only when the port_dev has carrier in
>>team_port_add.
>>
>>v1 -> v2:
>>  use netif_carrier_ok() instead of !!netif_carrier_ok(), as it returns
>>  bool now.
>>
>>Signed-off-by: Xin Long <lucien.xin@gmail.com>
>>---
>> drivers/net/team/team.c | 8 ++++++--
>> 1 file changed, 6 insertions(+), 2 deletions(-)
>>
>>diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
>>index a380649..4bc0103 100644
>>--- a/drivers/net/team/team.c
>>+++ b/drivers/net/team/team.c
>>@@ -1140,6 +1140,7 @@ static int team_port_add(struct team *team, struct net_device *port_dev)
>>       struct net_device *dev = team->dev;
>>       struct team_port *port;
>>       char *portname = port_dev->name;
>>+      bool linkup;
>>       int err;
>>
>>       if (port_dev->flags & IFF_LOOPBACK) {
>>@@ -1249,9 +1250,12 @@ static int team_port_add(struct team *team, struct net_device *port_dev)
>>
>>       port->index = -1;
>>       list_add_tail_rcu(&port->list, &team->port_list);
>>-      team_port_enable(team, port);
>>+      linkup = netif_carrier_ok(port_dev);
>>+      if (linkup)
>>+              team_port_enable(team, port);
>
> This is obviously wrong change. It you use a simple setup without
> userspace part (e.g. round robin), The port gets never enabled.
> team_port_enabl is called from here and team_port_en_option_set.
yes, without userspace part, that would be a problem.

>
> By default, all ports should be enabled. Only in case the userspace
> daemon decides to disable, it does so.
>
> Could you send me the exact configuration where you see the issue?
attachment is the scripts,

# ./setup.sh
# ip netns exec ns1 ./team1.sh
# ./team0.sh
# ping 192.168.11.3

the issue is team0 cannot switch to veth2, even if the veth0 has no carrier.

> This should be definitelly fixed in user part.
now it can disable/enable it in lw_ethtool_event_watch_port_changed()
when receive event from kernel, but at the beginning, the first event from
adding port will not going to team_port_en_option_set, as new_link_up
== common_ppriv->link_up in lw_ethtool_event_watch_port_changed().

maybe it should be fixed there ?

>
> Thanks.

[-- Attachment #2: team-scripts.tar.gz --]
[-- Type: application/x-gzip, Size: 561 bytes --]

^ permalink raw reply

* Re: [PATCH v2 net-next 1/2] flow dissector: ICMP support
From: Simon Horman @ 2016-12-06 10:51 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jiri Pirko, David Miller, Linux Kernel Network Developers,
	Jay Vosburgh, Veaceslav Falico, Andy Gospodarek, Jamal Hadi Salim,
	Jiri Pirko
In-Reply-To: <CALx6S37f-TFAhAidri12iD0KnhFE2kYuVnsXsj7yjivk-ovRiw@mail.gmail.com>

On Mon, Dec 05, 2016 at 09:29:45AM -0800, Tom Herbert wrote:
> On Mon, Dec 5, 2016 at 6:23 AM, Simon Horman <simon.horman@netronome.com> wrote:
> > On Sat, Dec 03, 2016 at 10:52:32AM -0800, Tom Herbert wrote:
> >> On Sat, Dec 3, 2016 at 2:49 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> >> > Fri, Dec 02, 2016 at 09:31:41PM CET, simon.horman@netronome.com wrote:
> >> >>Allow dissection of ICMP(V6) type and code. This re-uses transport layer
> >> >>port dissection code as although ICMP is not a transport protocol and their
> >> >>type and code are not ports this allows sharing of both code and storage.
> >> >>
> >> >>Signed-off-by: Simon Horman <simon.horman@netronome.com>
> >
> > ...
> >
> >> > Digging into this a bit more. I think it would be much nice not to mix
> >> > up l4 ports and icmp stuff.
> >> >
> >> > How about to have FLOW_DISSECTOR_KEY_ICMP
> >> > and
> >> > struct flow_dissector_key_icmp {
> >> >         u8 type;
> >> >         u8 code;
> >> > };
> >> >
> >> > The you can make this structure and struct flow_dissector_key_ports into
> >> > an union in struct flow_keys.
> >> >
> >> > Looks much cleaner to me.
> >
> > Hi Jiri,
> >
> > I wondered about that too. The main reason that I didn't do this the first
> > time around is that I wasn't sure what to do to differentiate between ports
> > and icmp in fl_init_dissector() given that using a union would could to
> > mask bits being set in both if they are set in either.
> >
> > I've taken a stab at addressing that below along with implementing your
> > suggestion. If the treatment in fl_init_dissector() is acceptable
> > perhaps something similar should be done for FLOW_DISSECTOR_KEY_IPV[46]_ADDRS
> > too?
> >
> >> I agree, this patch adds to many conditionals into the fast path for
> >> ICMP handling. Neither is there much point in using type and code as
> >> input to the packet hash.
> >
> > Hi Tom,
> >
> > I'm not sure that I follow this. A packet may be ICMP or not regardless of
> > if FLOW_DISSECTOR_KEY_PORT (and now FLOW_DISSECTOR_KEY_ICMP) port is set or
> > not. I'd appreciate some guidance here.
> >
> > First-pass at implementing Jiri's suggestion follows:
> >

...

> > diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
> > index 0584b4bb4390..33e5fa2b3e87 100644
> > --- a/net/core/flow_dissector.c
> > +++ b/net/core/flow_dissector.c

...

> > @@ -975,6 +983,10 @@ static const struct flow_dissector_key flow_keys_dissector_keys[] = {
> >                 .offset = offsetof(struct flow_keys, ports),
> >         },
> >         {
> > +               .key_id = FLOW_DISSECTOR_KEY_ICMP,
> > +               .offset = offsetof(struct flow_keys, icmp),
> > +       },
> 
> This is the problem I was referring to. This adds ICMP to the keys for
> computing the skb hash and the ICMP type and code are in a union with
> port numbers so they are in the range over which the hash is computed.
> This means that we are treating type and code as some sort of flow and
> and so different type/code between the same src/dst could follow
> different paths in ECMP. For the purposes of computing a packet hash I
> don't think we want ICMP information included. This might be a matter
> of not putting FLOW_DISSECTOR_KEY_ICMP in flow_keys_dissector_keys,
> where we need this information we could define another structure that
> has all the same fields as in flow_keys_dissector_keys and include
> FLOW_DISSECTOR_KEY_ICMP. Alternatively, the flow_dissector_key_icmp
> could also be outside of the area that is used in the hash: that is no
> in flow_keys_hash_start(keys) to flow_keys_hash_length(keys), keyval);

...

> > diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
> > index c96b2a349779..f4ba69bd362f 100644
> > --- a/net/sched/cls_flower.c
> > +++ b/net/sched/cls_flower.c
> > @@ -38,7 +38,10 @@ struct fl_flow_key {
> >                 struct flow_dissector_key_ipv4_addrs ipv4;
> >                 struct flow_dissector_key_ipv6_addrs ipv6;
> >         };
> > -       struct flow_dissector_key_ports tp;
> > +       union {
> > +               struct flow_dissector_key_ports tp;
> > +               struct flow_dissector_key_icmp icmp;
> > +       };
> 
> When an ICMP packet is encapsulated within UDP then icmp overwrites
> valid port information with is a stronger signal of the flow (unless
> FLOW_DISSECTOR_F_STOP_AT_ENCAP is set). This is another reason not to
> use a union with ports.

...

Hi Tom,

thanks for your explanation. I think I have a clearer picture now.

I have reworked things to try to address your concerns.
In particular:

* FLOW_DISSECTOR_KEY_ICMP is no longer added to flow_keys_dissector_keys.
  I don't think it is needed at all for the use-case I currently have in
  mind, which is classifying packets using tc_flower. And adding it there
  was an error on my part.

* I stopped using a union for ports and icmp. At the very least this seems
  to make it easier to reason about things as we no longer need to consider
  that a port value may in fact be an ICMP type or code.

  This seems to allow us avoid adding a number of is_icmp checks (as I believe
  you pointed out earlier). And should also address the problem you state
  immediately above.

* I have also placed icmp outside of the range flow_keys_hash_start(keys)
  to flow_keys_hash_length(keys), keyval). This is because I now see no
  value of having it inside that range and it both avoids any chance of
  contaminating the hash with ICMP values and hashing over unwanted
  (hopefully zero) values.

  This required an update to flow_keys_hash_length() as the bound
  of the end of fields the hash is no longer the end of struct flow_keys.
  It seems clean but I wonder if there are similar assumptions lurking
  elsewhere.

I have lightly tested this for the tc_flower case (only).

Incremental patch on top of this series:

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index a6f75cfb2bf7..8029dd4912b6 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3181,8 +3181,7 @@ static bool bond_flow_dissect(struct bonding *bond, struct sk_buff *skb,
 	} else {
 		return false;
 	}
-	if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER34 &&
-	    proto >= 0 && !skb_flow_is_icmp_any(skb, proto))
+	if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER34 && proto >= 0)
 		fk->ports.ports = skb_flow_get_ports(skb, noff, proto);
 
 	return true;
@@ -3210,8 +3209,7 @@ u32 bond_xmit_hash(struct bonding *bond, struct sk_buff *skb)
 		return bond_eth_hash(skb);
 
 	if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER23 ||
-	    bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP23 ||
-	    flow_keys_are_icmp_any(&flow))
+	    bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP23)
 		hash = bond_eth_hash(skb);
 	else
 		hash = (__force u32)flow.ports.ports;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 44a8f69a9198..9c535fbccf2c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1094,11 +1094,6 @@ u32 __skb_get_poff(const struct sk_buff *skb, void *data,
 __be32 __skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto,
 			    void *data, int hlen_proto);
 
-static inline bool skb_flow_is_icmp_any(const struct sk_buff *skb, u8 ip_proto)
-{
-	return flow_protos_are_icmp_any(skb->protocol, ip_proto);
-}
-
 static inline __be32 skb_flow_get_ports(const struct sk_buff *skb,
 					int thoff, u8 ip_proto)
 {
diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index 5540dfa18872..058c9df8a4d8 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -91,14 +91,10 @@ struct flow_dissector_key_addrs {
 
 /**
  * flow_dissector_key_ports:
- *	@ports: port numbers of Transport header or
- *		type and code of ICMP header
+ *	@ports: port numbers of Transport header
  *		ports: source (high) and destination (low) port numbers
  *		src: source port number
  *		dst: destination port number
- *		icmp: ICMP type (high) and code (low)
- *		type: ICMP type
- *		type: ICMP code
  */
 struct flow_dissector_key_ports {
 	union {
@@ -107,6 +103,18 @@ struct flow_dissector_key_ports {
 			__be16 src;
 			__be16 dst;
 		};
+	};
+};
+
+/**
+ * flow_dissector_key_icmp:
+ *	@ports: type and code of ICMP header
+ *		icmp: ICMP type (high) and code (low)
+ *		type: ICMP type
+ *		code: ICMP code
+ */
+struct flow_dissector_key_icmp {
+	union {
 		__be16 icmp;
 		struct {
 			u8 type;
@@ -115,7 +123,6 @@ struct flow_dissector_key_ports {
 	};
 };
 
-
 /**
  * struct flow_dissector_key_eth_addrs:
  * @src: source Ethernet address
@@ -133,6 +140,7 @@ enum flow_dissector_key_id {
 	FLOW_DISSECTOR_KEY_IPV4_ADDRS, /* struct flow_dissector_key_ipv4_addrs */
 	FLOW_DISSECTOR_KEY_IPV6_ADDRS, /* struct flow_dissector_key_ipv6_addrs */
 	FLOW_DISSECTOR_KEY_PORTS, /* struct flow_dissector_key_ports */
+	FLOW_DISSECTOR_KEY_ICMP, /* struct flow_dissector_key_icmp */
 	FLOW_DISSECTOR_KEY_ETH_ADDRS, /* struct flow_dissector_key_eth_addrs */
 	FLOW_DISSECTOR_KEY_TIPC_ADDRS, /* struct flow_dissector_key_tipc_addrs */
 	FLOW_DISSECTOR_KEY_VLAN, /* struct flow_dissector_key_flow_vlan */
@@ -173,11 +181,16 @@ struct flow_keys {
 	struct flow_dissector_key_keyid keyid;
 	struct flow_dissector_key_ports ports;
 	struct flow_dissector_key_addrs addrs;
+#define FLOW_KEYS_HASH_END_FIELD addrs
+	struct flow_dissector_key_icmp icmp;
 };
 
 #define FLOW_KEYS_HASH_OFFSET		\
 	offsetof(struct flow_keys, FLOW_KEYS_HASH_START_FIELD)
 
+#define FLOW_KEYS_HASH_END		\
+	offsetofend(struct flow_keys, FLOW_KEYS_HASH_END_FIELD)
+
 __be32 flow_get_u32_src(const struct flow_keys *flow);
 __be32 flow_get_u32_dst(const struct flow_keys *flow);
 
@@ -233,8 +246,7 @@ static inline bool flow_keys_are_icmp_any(const struct flow_keys *keys)
 
 static inline bool flow_keys_have_l4(const struct flow_keys *keys)
 {
-	return (!flow_keys_are_icmp_any(keys) && keys->ports.ports) ||
-		keys->tags.flow_label;
+	return (keys->ports.ports || keys->tags.flow_label);
 }
 
 u32 flow_hash_from_keys(struct flow_keys *keys);
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 0584b4bb4390..ed6d46475343 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -139,6 +139,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
 	struct flow_dissector_key_basic *key_basic;
 	struct flow_dissector_key_addrs *key_addrs;
 	struct flow_dissector_key_ports *key_ports;
+	struct flow_dissector_key_icmp *key_icmp;
 	struct flow_dissector_key_tags *key_tags;
 	struct flow_dissector_key_vlan *key_vlan;
 	struct flow_dissector_key_keyid *key_keyid;
@@ -559,18 +560,25 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
 		break;
 	}
 
-	if (dissector_uses_key(flow_dissector,
-			       FLOW_DISSECTOR_KEY_PORTS)) {
-		key_ports = skb_flow_dissector_target(flow_dissector,
-						      FLOW_DISSECTOR_KEY_PORTS,
-						      target_container);
-		if (flow_protos_are_icmp_any(proto, ip_proto))
-			key_ports->icmp = skb_flow_get_be16(skb, nhoff, data,
-							    hlen);
-		else
+	if (flow_protos_are_icmp_any(proto, ip_proto)) {
+		if (dissector_uses_key(flow_dissector,
+				       FLOW_DISSECTOR_KEY_ICMP)) {
+			key_icmp = skb_flow_dissector_target(flow_dissector,
+							     FLOW_DISSECTOR_KEY_ICMP,
+							     target_container);
+			key_icmp->icmp = skb_flow_get_be16(skb, nhoff, data,
+							   hlen);
+		}
+	} else {
+		if (dissector_uses_key(flow_dissector,
+				       FLOW_DISSECTOR_KEY_PORTS)) {
+			key_ports = skb_flow_dissector_target(flow_dissector,
+							      FLOW_DISSECTOR_KEY_PORTS,
+							      target_container);
 			key_ports->ports = __skb_flow_get_ports(skb, nhoff,
 								ip_proto, data,
 								hlen);
+		}
 	}
 
 out_good:
@@ -613,9 +621,10 @@ static inline const u32 *flow_keys_hash_start(const struct flow_keys *flow)
 static inline size_t flow_keys_hash_length(const struct flow_keys *flow)
 {
 	size_t diff = FLOW_KEYS_HASH_OFFSET + sizeof(flow->addrs);
-	BUILD_BUG_ON((sizeof(*flow) - FLOW_KEYS_HASH_OFFSET) % sizeof(u32));
+	BUILD_BUG_ON((FLOW_KEYS_HASH_END - FLOW_KEYS_HASH_OFFSET) %
+		     sizeof(u32));
 	BUILD_BUG_ON(offsetof(typeof(*flow), addrs) !=
-		     sizeof(*flow) - sizeof(flow->addrs));
+		     FLOW_KEYS_HASH_END - sizeof(flow->addrs));
 
 	switch (flow->control.addr_type) {
 	case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
@@ -628,7 +637,7 @@ static inline size_t flow_keys_hash_length(const struct flow_keys *flow)
 		diff -= sizeof(flow->addrs.tipcaddrs);
 		break;
 	}
-	return (sizeof(*flow) - diff) / sizeof(u32);
+	return (FLOW_KEYS_HASH_END - diff) / sizeof(u32);
 }
 
 __be32 flow_get_u32_src(const struct flow_keys *flow)
@@ -745,8 +754,7 @@ void make_flow_keys_digest(struct flow_keys_digest *digest,
 
 	data->n_proto = flow->basic.n_proto;
 	data->ip_proto = flow->basic.ip_proto;
-	if (flow_keys_have_l4(flow))
-		data->ports = flow->ports.ports;
+	data->ports = flow->ports.ports;
 	data->src = flow->addrs.v4addrs.src;
 	data->dst = flow->addrs.v4addrs.dst;
 }
diff --git a/net/sched/cls_flow.c b/net/sched/cls_flow.c
index 408313e33bbe..6575aba87630 100644
--- a/net/sched/cls_flow.c
+++ b/net/sched/cls_flow.c
@@ -96,7 +96,7 @@ static u32 flow_get_proto(const struct sk_buff *skb,
 static u32 flow_get_proto_src(const struct sk_buff *skb,
 			      const struct flow_keys *flow)
 {
-	if (!flow_keys_are_icmp_any(flow) && flow->ports.ports)
+	if (flow->ports.ports)
 		return ntohs(flow->ports.src);
 
 	return addr_fold(skb->sk);
@@ -105,7 +105,7 @@ static u32 flow_get_proto_src(const struct sk_buff *skb,
 static u32 flow_get_proto_dst(const struct sk_buff *skb,
 			      const struct flow_keys *flow)
 {
-	if (!flow_keys_are_icmp_any(flow) && flow->ports.ports)
+	if (flow->ports.ports)
 		return ntohs(flow->ports.dst);
 
 	return addr_fold(skb_dst(skb)) ^ (__force u16) tc_skb_protocol(skb);
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index c96b2a349779..56df0368125a 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -39,6 +39,7 @@ struct fl_flow_key {
 		struct flow_dissector_key_ipv6_addrs ipv6;
 	};
 	struct flow_dissector_key_ports tp;
+	struct flow_dissector_key_icmp icmp;
 	struct flow_dissector_key_keyid enc_key_id;
 	union {
 		struct flow_dissector_key_ipv4_addrs enc_ipv4;
@@ -511,19 +512,23 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
 			       &mask->tp.dst, TCA_FLOWER_KEY_SCTP_DST_MASK,
 			       sizeof(key->tp.dst));
 	} else if (flow_basic_key_is_icmpv4(&key->basic)) {
-		fl_set_key_val(tb, &key->tp.type, TCA_FLOWER_KEY_ICMPV4_TYPE,
-			       &mask->tp.type, TCA_FLOWER_KEY_ICMPV4_TYPE_MASK,
-			       sizeof(key->tp.type));
-		fl_set_key_val(tb, &key->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
-			       &mask->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
-			       sizeof(key->tp.code));
+		fl_set_key_val(tb, &key->icmp.type, TCA_FLOWER_KEY_ICMPV4_TYPE,
+			       &mask->icmp.type,
+			       TCA_FLOWER_KEY_ICMPV4_TYPE_MASK,
+			       sizeof(key->icmp.type));
+		fl_set_key_val(tb, &key->icmp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
+			       &mask->icmp.code,
+			       TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
+			       sizeof(key->icmp.code));
 	} else if (flow_basic_key_is_icmpv6(&key->basic)) {
-		fl_set_key_val(tb, &key->tp.type, TCA_FLOWER_KEY_ICMPV6_TYPE,
-			       &mask->tp.type, TCA_FLOWER_KEY_ICMPV6_TYPE_MASK,
-			       sizeof(key->tp.type));
-		fl_set_key_val(tb, &key->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
-			       &mask->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
-			       sizeof(key->tp.code));
+		fl_set_key_val(tb, &key->icmp.type, TCA_FLOWER_KEY_ICMPV6_TYPE,
+			       &mask->icmp.type,
+			       TCA_FLOWER_KEY_ICMPV6_TYPE_MASK,
+			       sizeof(key->icmp.type));
+		fl_set_key_val(tb, &key->icmp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
+			       &mask->icmp.code,
+			       TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
+			       sizeof(key->icmp.code));
 	}
 
 	if (tb[TCA_FLOWER_KEY_ENC_IPV4_SRC] ||
@@ -634,6 +639,8 @@ static void fl_init_dissector(struct cls_fl_head *head,
 	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
 			     FLOW_DISSECTOR_KEY_PORTS, tp);
 	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
+			     FLOW_DISSECTOR_KEY_ICMP, icmp);
+	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
 			     FLOW_DISSECTOR_KEY_VLAN, vlan);
 	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
 			     FLOW_DISSECTOR_KEY_ENC_KEYID, enc_key_id);
@@ -1000,24 +1007,24 @@ static int fl_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
 				  sizeof(key->tp.dst))))
 		goto nla_put_failure;
 	else if (flow_basic_key_is_icmpv4(&key->basic) &&
-		 (fl_dump_key_val(skb, &key->tp.type,
-				  TCA_FLOWER_KEY_ICMPV4_TYPE, &mask->tp.type,
+		 (fl_dump_key_val(skb, &key->icmp.type,
+				  TCA_FLOWER_KEY_ICMPV4_TYPE, &mask->icmp.type,
 				  TCA_FLOWER_KEY_ICMPV4_TYPE_MASK,
-				  sizeof(key->tp.type)) ||
-		  fl_dump_key_val(skb, &key->tp.code,
-				  TCA_FLOWER_KEY_ICMPV4_CODE, &mask->tp.code,
+				  sizeof(key->icmp.type)) ||
+		  fl_dump_key_val(skb, &key->icmp.code,
+				  TCA_FLOWER_KEY_ICMPV4_CODE, &mask->icmp.code,
 				  TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
-				  sizeof(key->tp.code))))
+				  sizeof(key->icmp.code))))
 		goto nla_put_failure;
 	else if (flow_basic_key_is_icmpv6(&key->basic) &&
-		 (fl_dump_key_val(skb, &key->tp.type,
-				  TCA_FLOWER_KEY_ICMPV6_TYPE, &mask->tp.type,
+		 (fl_dump_key_val(skb, &key->icmp.type,
+				  TCA_FLOWER_KEY_ICMPV6_TYPE, &mask->icmp.type,
 				  TCA_FLOWER_KEY_ICMPV6_TYPE_MASK,
-				  sizeof(key->tp.type)) ||
-		  fl_dump_key_val(skb, &key->tp.code,
-				  TCA_FLOWER_KEY_ICMPV6_CODE, &mask->tp.code,
+				  sizeof(key->icmp.type)) ||
+		  fl_dump_key_val(skb, &key->icmp.code,
+				  TCA_FLOWER_KEY_ICMPV6_CODE, &mask->icmp.code,
 				  TCA_FLOWER_KEY_ICMPV6_CODE_MASK,
-				  sizeof(key->tp.code))))
+				  sizeof(key->icmp.code))))
 		goto nla_put_failure;
 
 	if (key->enc_control.addr_type == FLOW_DISSECTOR_KEY_IPV4_ADDRS &&

^ permalink raw reply related

* [PATCH V4 net-next] net: hns: Fix to conditionally convey RX checksum flag to stack
From: Salil Mehta @ 2016-12-06 11:09 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, mehta.salil.lnk, netdev, linux-kernel,
	linuxarm

This patch introduces the RX checksum function to check the
status of the hardware calculated checksum and its error and
appropriately convey status to the upper stack in skb->ip_summed
field.

In hardware, we only support checksum for the following
protocols:
1) IPv4,
2) TCP(over IPv4 or IPv6),
3) UDP(over IPv4 or IPv6),
4) SCTP(over IPv4 or IPv6)
but we support many L3(IPv4, IPv6, MPLS, PPPoE etc) and
L4(TCP, UDP, GRE, SCTP, IGMP, ICMP etc.) protocols.

Hardware limitation:
Our present hardware RX Descriptor lacks L3/L4 checksum
"Status & Error" bit (which usually can be used to indicate whether
checksum was calculated by the hardware and if there was any error
encountered during checksum calculation).

Software workaround:
We do get info within the RX descriptor about the kind of
L3/L4 protocol coming in the packet and the error status. These
errors might not just be checksum errors but could be related to
version, length of IPv4, UDP, TCP etc.
Because there is no-way of knowing if it is a L3/L4 error due
to bad checksum or any other L3/L4 error, we will not (cannot)
convey hardware checksum status(CHECKSUM_UNNECESSARY) for such
cases to upper stack and will not maintain the RX L3/L4 checksum
counters as well.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
Change Log:
Patch V4: Accidentally floated a wrong V3 patch. corrected this:
          Link: https://lkml.org/lkml/2016/12/5/604
Patch V3: Re-structured the code.
Patch V2: Addressed the comment by David Miller
          Link: https://www.spinics.net/lists/netdev/msg406697.html
Patch V1: This patch is a result of the comments given by
          David Miller <davem@davemloft.net>
          Link: https://lkml.org/lkml/2016/6/15/27
---
 drivers/net/ethernet/hisilicon/hns/hnae.h     |    2 +
 drivers/net/ethernet/hisilicon/hns/hns_enet.c |   76 ++++++++++++++++++++++---
 2 files changed, 71 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns/hnae.h b/drivers/net/ethernet/hisilicon/hns/hnae.h
index 09602f1..8016854 100644
--- a/drivers/net/ethernet/hisilicon/hns/hnae.h
+++ b/drivers/net/ethernet/hisilicon/hns/hnae.h
@@ -99,6 +99,8 @@ enum hnae_led_state {
 #define HNS_RX_FLAG_L3ID_IPV6 0x1
 #define HNS_RX_FLAG_L4ID_UDP 0x0
 #define HNS_RX_FLAG_L4ID_TCP 0x1
+#define HNS_RX_FLAG_L4ID_SCTP 0x3
+
 
 #define HNS_TXD_ASID_S 0
 #define HNS_TXD_ASID_M (0xff << HNS_TXD_ASID_S)
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
index 255fede..c02449b 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
@@ -565,6 +565,71 @@ static void get_rx_desc_bnum(u32 bnum_flag, int *out_bnum)
 				   HNS_RXD_BUFNUM_M, HNS_RXD_BUFNUM_S);
 }
 
+static void hns_nic_rx_checksum(struct hns_nic_ring_data *ring_data,
+				struct sk_buff *skb, u32 flag)
+{
+	struct net_device *netdev = ring_data->napi.dev;
+	u32 l3id;
+	u32 l4id;
+
+	/* check if RX checksum offload is enabled */
+	if (unlikely(!(netdev->features & NETIF_F_RXCSUM)))
+		return;
+
+	/* In hardware, we only support checksum for the following protocols:
+	 * 1) IPv4,
+	 * 2) TCP(over IPv4 or IPv6),
+	 * 3) UDP(over IPv4 or IPv6),
+	 * 4) SCTP(over IPv4 or IPv6)
+	 * but we support many L3(IPv4, IPv6, MPLS, PPPoE etc) and L4(TCP,
+	 * UDP, GRE, SCTP, IGMP, ICMP etc.) protocols.
+	 *
+	 * Hardware limitation:
+	 * Our present hardware RX Descriptor lacks L3/L4 checksum "Status &
+	 * Error" bit (which usually can be used to indicate whether checksum
+	 * was calculated by the hardware and if there was any error encountered
+	 * during checksum calculation).
+	 *
+	 * Software workaround:
+	 * We do get info within the RX descriptor about the kind of L3/L4
+	 * protocol coming in the packet and the error status. These errors
+	 * might not just be checksum errors but could be related to version,
+	 * length of IPv4, UDP, TCP etc.
+	 * Because there is no-way of knowing if it is a L3/L4 error due to bad
+	 * checksum or any other L3/L4 error, we will not (cannot) convey
+	 * checksum status for such cases to upper stack and will not maintain
+	 * the RX L3/L4 checksum counters as well.
+	 */
+
+	l3id = hnae_get_field(flag, HNS_RXD_L3ID_M, HNS_RXD_L3ID_S);
+	l4id = hnae_get_field(flag, HNS_RXD_L4ID_M, HNS_RXD_L4ID_S);
+
+	/*  check L3 protocol for which checksum is supported */
+	if ((l3id != HNS_RX_FLAG_L3ID_IPV4) && (l3id != HNS_RX_FLAG_L3ID_IPV6))
+		return;
+
+	/* check for any(not just checksum)flagged L3 protocol errors */
+	if (unlikely(hnae_get_bit(flag, HNS_RXD_L3E_B)))
+		return;
+
+	/* we do not support checksum of fragmented packets */
+	if (unlikely(hnae_get_bit(flag, HNS_RXD_FRAG_B)))
+		return;
+
+	/*  check L4 protocol for which checksum is supported */
+	if ((l4id != HNS_RX_FLAG_L4ID_TCP) &&
+	    (l4id != HNS_RX_FLAG_L4ID_UDP) &&
+	    (l4id != HNS_RX_FLAG_L4ID_SCTP))
+		return;
+
+	/* check for any(not just checksum)flagged L4 protocol errors */
+	if (unlikely(hnae_get_bit(flag, HNS_RXD_L4E_B)))
+		return;
+
+	/* now, this has to be a packet with valid RX checksum */
+	skb->ip_summed = CHECKSUM_UNNECESSARY;
+}
+
 static int hns_nic_poll_rx_skb(struct hns_nic_ring_data *ring_data,
 			       struct sk_buff **out_skb, int *out_bnum)
 {
@@ -683,13 +748,10 @@ static int hns_nic_poll_rx_skb(struct hns_nic_ring_data *ring_data,
 	ring->stats.rx_pkts++;
 	ring->stats.rx_bytes += skb->len;
 
-	if (unlikely(hnae_get_bit(bnum_flag, HNS_RXD_L3E_B) ||
-		     hnae_get_bit(bnum_flag, HNS_RXD_L4E_B))) {
-		ring->stats.l3l4_csum_err++;
-		return 0;
-	}
-
-	skb->ip_summed = CHECKSUM_UNNECESSARY;
+	/* indicate to upper stack if our hardware has already calculated
+	 * the RX checksum
+	 */
+	hns_nic_rx_checksum(ring_data, skb, bnum_flag);
 
 	return 0;
 }
-- 
1.7.9.5

^ permalink raw reply related

* Re: bpf bounded loops. Was: [flamebait] xdp
From: Hannes Frederic Sowa @ 2016-12-06 11:35 UTC (permalink / raw)
  To: Edward Cree, Alexei Starovoitov
  Cc: Tom Herbert, Thomas Graf, Linux Kernel Network Developers,
	Daniel Borkmann, David S. Miller
In-Reply-To: <aafa452d-26dd-d58b-2649-21ccce9370a4@solarflare.com>

On 05.12.2016 17:54, Edward Cree wrote:
> On 05/12/16 16:50, Hannes Frederic Sowa wrote:
>> On 05.12.2016 17:40, Edward Cree wrote:
>>> I may be completely mistaken here, but can't the verifier unroll the loop 'for
>>> verification' without it actually being unrolled in the program?
>>> I.e., any "proof that the loop terminates" should translate into "rewrite of
>>> the directed graph to make it a DAG, possibly duplicating a lot of insns", and
>>> you feed the rewritten graph to the verifier, while using the original loopy
>>> version as the actual program to store and later execute.
>>> Then the verifier happily checks things like array indices being valid, without
>>> having to know about the bounded loops.
>> That is what is already happening. E.g. __builtin_memset is expanded up
>> to 128 rounds (which is a lot) but at some point llvm doesn't do enoug
>> unrolling of that.
>>
>> The BPF target configures that in
>> http://llvm.org/docs/doxygen/html/BPFISelLowering_8cpp_source.html on
>> line 166-169.
> I think you're talking about the _compiler_ unrolling loops before it
> submits the program to the kernel.  I'm talking about having the _verifier_
> unroll them, so that we can execute the original (non-unrolled) version.
> Or am I misunderstanding?

Ah, in the verifier this would be part of flow control analysis what we
are talking about in the other part of this thread.

Bye,
Hannes

^ permalink raw reply

* Re: arp_filter and IPv6 ND
From: Hannes Frederic Sowa @ 2016-12-06 12:07 UTC (permalink / raw)
  To: Saku Ytti; +Cc: netdev
In-Reply-To: <CAAeewD9CDQZp0u6L6JtZ96cboKDCRB=yY92F+-ufsA8=OEgyUw@mail.gmail.com>

On 03.12.2016 15:21, Saku Ytti wrote:
> On 2 December 2016 at 20:39, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
> 
> Hey,
> 
>> E.g. you can use IP addresses bound to other interfaces to send replys
>> on another interface. This can be useful if you have a limited amount of
>> IP addresses on the system but much more interfaces. Especially if they
>> are limited in scope, like in IPv6.
>>
>> Basically Cisco's feature of "unnumbered interface" is always provided
>> in Linux. And there are certainly cases where you would want to use it,
>> e.g. emulate private-vlan feature for network separation.
> 
> Got it, thanks, the explanation makes sense. And indeed it's valid
> case, but also it is the exception, not the rule. I think it would be
> entirely change the default and people who want 'unnumbered' style
> behaviour (like some BRAS scenarios), will know how to and why to
> configure it.

The limited ip address scenario is actually more common for normal
routers. ;)

In retrospect I don't know what what would win if the decision would be
made again. Mostly all operating systems switched to strong end host
model over time, Linux remaining in the weak host end camp alone. It
probably is also easier to go from strong end to weak end system by
policy than vice versa, so I would probably also picked strong end
system semantics by default today.

>> Also in the BGP setup, you might have it easier to establish loopback
>> neighbor contact by just using static on-link routes, without caring
>> about more complex numbering there (otherwise you pretty soon introduce
>> OSPF or some other routing protocol to do the recursive forward resolution).
> 
> The BGP is running on-link, it's just that the BGP is advertising loop
> of Linux. Why the loop ends up in ND cache, I don't know.

Did you check neighbor advertisements and solicitations with tcpdump?

Did you have force_tllao, ndisc_notify enabled? Which source address
does the BGP/TCP connection use?

>>> Grand, not that I feel comfortable writing it. I'd rather see the
>>> whole suppression functionality moved to neighbour.c from being AFI
>>> specific.
>>
>> Yes sure, please provide a patch. A separate sysctl is necessary anyway
>> because the current one is within the ipv4 procfs directory hierarchy.
> 
> Sorry, not a comfortable C programmer, I'm pretty confident I could
> get it working, but I'm more confident that patch would be entirely
> rejected and rewritten by someone who knows what they are doing.
> I see no reason not to have AFI specific toggle, just logic and code
> should be AFI agnostic, like GC (ARP/ND cache time) stuff in
> neighbour.c is nicely done. Frankly whole ARP/ND code could do with
> refactoring to make arp.c and ndisc.c more wire-format stuff and
> behavioural code more in neighbour.c.

Let's first see what the real problem is.

^ permalink raw reply

* Re: [PATCH] net/udp: do not touch skb->peeked unless really needed
From: Paolo Abeni @ 2016-12-06 12:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Willem de Bruijn
In-Reply-To: <1481017989.6225.21.camel@redhat.com>

On Tue, 2016-12-06 at 10:53 +0100, Paolo Abeni wrote:
> On Mon, 2016-12-05 at 09:57 -0800, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > In UDP recvmsg() path we currently access 3 cache lines from an skb
> > while holding receive queue lock, plus another one if packet is
> > dequeued, since we need to change skb->next->prev
> > 
> > 1st cache line (contains ->next/prev pointers, offsets 0x00 and 0x08)
> > 2nd cache line (skb->len & skb->peeked, offsets 0x80 and 0x8e)
> > 3rd cache line (skb->truesize/users, offsets 0xe0 and 0xe4)
> > 
> > skb->peeked is only needed to make sure 0-length packets are properly
> > handled while MSG_PEEK is operated.
> > 
> > I had first the intent to remove skb->peeked but the "MSG_PEEK at
> > non-zero offset" support added by Sam Kumar makes this not possible.
> 
> I'm wondering if peeking with offset is going to complicate the 2 queues
> patch, too.
> 
> > This patch avoids one cache line miss during the locked section, when
> > skb->len and skb->peeked do not have to be read.
> > 
> > It also avoids the skb_set_peeked() cost for non empty UDP datagrams.
> > 
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> >  net/core/datagram.c |   19 ++++++++++---------
> >  1 file changed, 10 insertions(+), 9 deletions(-)
> > 
> > diff --git a/net/core/datagram.c b/net/core/datagram.c
> > index 49816af8586bb832e806972b486588041a99524c..9482037a5c8c64aec79e42c65bd2691bdd9450a3 100644
> > --- a/net/core/datagram.c
> > +++ b/net/core/datagram.c
> > @@ -214,6 +214,7 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags,
> >  	if (error)
> >  		goto no_packet;
> >  
> > +	*peeked = 0;
> >  	do {
> >  		/* Again only user level code calls this function, so nothing
> >  		 * interrupt level will suddenly eat the receive_queue.
> > @@ -227,22 +228,22 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags,
> >  		spin_lock_irqsave(&queue->lock, cpu_flags);
> >  		skb_queue_walk(queue, skb) {
> >  			*last = skb;
> > -			*peeked = skb->peeked;
> >  			if (flags & MSG_PEEK) {
> >  				if (_off >= skb->len && (skb->len || _off ||
> >  							 skb->peeked)) {
> >  					_off -= skb->len;
> >  					continue;
> >  				}
> > -
> > -				skb = skb_set_peeked(skb);
> > -				error = PTR_ERR(skb);
> > -				if (IS_ERR(skb)) {
> > -					spin_unlock_irqrestore(&queue->lock,
> > -							       cpu_flags);
> > -					goto no_packet;
> > +				if (!skb->len) {
> > +					skb = skb_set_peeked(skb);
> > +					if (IS_ERR(skb)) {
> > +						error = PTR_ERR(skb);
> > +						spin_unlock_irqrestore(&queue->lock,
> > +								       cpu_flags);
> > +						goto no_packet;
> > +					}
> >  				}
> 
> I don't understand why we can avoid setting skb->peek if len > 0. I
> think that will change the kernel behavior if:
> - peek with offset is set
> - 3 skbs with len > 0 are enqueued
> - the u/s peek (with offset) the second one
> - the u/s disable peeking with offset and peeks 2 more skbs.
> 
> With the current code in the last step the u/s is going to peek the 1#
> and the 3# skbs, after this patch will peek the 1# and the 2#. Am I
> missing something ? Probably the new behavior is more correct, but still
> is a change. 

Please ignore the above dumb comment. I misread the 'skip condition'.

I'm fine with the patch in its current form.

Acked-by: Paolo Abeni <pabeni@redhat.com>

^ permalink raw reply

* Re: [PATCH v2] netfilter: avoid warn and OOM killer on vmalloc call
From: Pablo Neira Ayuso @ 2016-12-06 12:21 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: andreyknvl, fw, nhorman, netfilter-devel, netdev, linux-kernel
In-Reply-To: <e10fd8b5206bfde5526e3bbf2b5ff9af584b0a6f.1480671893.git.marcelo.leitner@gmail.com>

On Fri, Dec 02, 2016 at 07:46:38AM -0200, Marcelo Ricardo Leitner wrote:
> Andrey Konovalov reported that this vmalloc call is based on an
> userspace request and that it's spewing traces, which may flood the logs
> and cause DoS if abused.
> 
> Florian Westphal also mentioned that this call should not trigger OOM
> killer.
> 
> This patch brings the vmalloc call in sync to kmalloc and disables the
> warn trace on allocation failure and also disable OOM killer invocation.
> 
> Note, however, that under such stress situation, other places may
> trigger OOM killer invocation.

Unless anyone has an objection, I'm going to place this in nf-next.

Thanks.

> Reported-by: Andrey Konovalov <andreyknvl@google.com>
> Cc: Florian Westphal <fw@strlen.de>
> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> ---
>  net/netfilter/x_tables.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> index fc4977456c30e098197b4f987b758072c9cf60d9..dece525bf83a0098dad607fce665cd0bde228362 100644
> --- a/net/netfilter/x_tables.c
> +++ b/net/netfilter/x_tables.c
> @@ -958,7 +958,9 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
>  	if (sz <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
>  		info = kmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
>  	if (!info) {
> -		info = vmalloc(sz);
> +		info = __vmalloc(sz, GFP_KERNEL | __GFP_NOWARN |
> +				     __GFP_NORETRY | __GFP_HIGHMEM,
> +				 PAGE_KERNEL);
>  		if (!info)
>  			return NULL;
>  	}
> -- 
> 2.9.3
> 

^ permalink raw reply

* Re: [PATCH v2 net-next v2 4/4] net: dsa: mv88e6xxx: add PPU operations
From: Stefan Eichenberger @ 2016-12-06 12:27 UTC (permalink / raw)
  To: Vivien Didelot
  Cc: netdev, linux-kernel, kernel, David S. Miller, Florian Fainelli,
	Andrew Lunn, Richard Cochran
In-Reply-To: <874m2ic95g.fsf@ketchup.i-did-not-set--mail-host-address--so-tickle-me>

On 5 December 2016 at 23:18, Vivien Didelot
<vivien.didelot@savoirfairelinux.com> wrote:
> Stefan Eichenberger <eichest@gmail.com> writes:
>
>> Hi Vivien,
>>
>> On Mon, Dec 05, 2016 at 11:27:03AM -0500, Vivien Didelot wrote:
>>> @@ -3266,6 +3220,8 @@ static const struct mv88e6xxx_ops mv88e6097_ops = {
>>>      .g1_set_cpu_port = mv88e6095_g1_set_cpu_port,
>>>      .g1_set_egress_port = mv88e6095_g1_set_egress_port,
>>>      .mgmt_rsvd2cpu = mv88e6095_g2_mgmt_rsvd2cpu,
>>> +    .ppu_enable = mv88e6185_g1_ppu_enable,
>>> +    .ppu_disable = mv88e6185_g1_ppu_disable,
>>>      .reset = mv88e6185_g1_reset,
>>>  };
>>
>> The mv88e6097 should use the indirect access to the phys, bit 14 in g1
>> control is marked as reserved. They write in the datasheet that
>> disabling the PPU is still supported but indirect access via g2 should
>> be used because disabling the PPU  is no longer recommended.
>
> Ho ok thanks, I respin a v3 right away with this removed and with
> mv88e6352_g1_reset instead.

Perfect thank you, now it's fine.

Regards,
Stefan

^ permalink raw reply

* Re: [PATCHv2 net] team: team_port_add should check link_up before enable port
From: Jiri Pirko @ 2016-12-06 12:27 UTC (permalink / raw)
  To: Xin Long; +Cc: network dev, davem, Marcelo Ricardo Leitner
In-Reply-To: <CADvbK_e01bz-PGmDrdxztyzLqmR5od3rahO960Oe2OhYQ6fp9w@mail.gmail.com>

Tue, Dec 06, 2016 at 11:38:53AM CET, lucien.xin@gmail.com wrote:
>On Tue, Dec 6, 2016 at 4:22 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Tue, Dec 06, 2016 at 08:29:08AM CET, lucien.xin@gmail.com wrote:
>>>Now when users add a nic to team dev, the option 'enable' of the port
>>>is true by default, as team_port_enable enables it after dev_open in
>>>team_port_add.
>>>
>>>But even if the port_dev has no carrier, like it's cable was unpluged,
>>>the port is still enabled. It leads to that team dev couldn't work well
>>>if this port was chosen to connect, and has no chance to change to use
>>>other ports if link_watch is ethtool.
>>>
>>>This patch is to enable the port only when the port_dev has carrier in
>>>team_port_add.
>>>
>>>v1 -> v2:
>>>  use netif_carrier_ok() instead of !!netif_carrier_ok(), as it returns
>>>  bool now.
>>>
>>>Signed-off-by: Xin Long <lucien.xin@gmail.com>
>>>---
>>> drivers/net/team/team.c | 8 ++++++--
>>> 1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>>diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
>>>index a380649..4bc0103 100644
>>>--- a/drivers/net/team/team.c
>>>+++ b/drivers/net/team/team.c
>>>@@ -1140,6 +1140,7 @@ static int team_port_add(struct team *team, struct net_device *port_dev)
>>>       struct net_device *dev = team->dev;
>>>       struct team_port *port;
>>>       char *portname = port_dev->name;
>>>+      bool linkup;
>>>       int err;
>>>
>>>       if (port_dev->flags & IFF_LOOPBACK) {
>>>@@ -1249,9 +1250,12 @@ static int team_port_add(struct team *team, struct net_device *port_dev)
>>>
>>>       port->index = -1;
>>>       list_add_tail_rcu(&port->list, &team->port_list);
>>>-      team_port_enable(team, port);
>>>+      linkup = netif_carrier_ok(port_dev);
>>>+      if (linkup)
>>>+              team_port_enable(team, port);
>>
>> This is obviously wrong change. It you use a simple setup without
>> userspace part (e.g. round robin), The port gets never enabled.
>> team_port_enabl is called from here and team_port_en_option_set.
>yes, without userspace part, that would be a problem.
>
>>
>> By default, all ports should be enabled. Only in case the userspace
>> daemon decides to disable, it does so.
>>
>> Could you send me the exact configuration where you see the issue?
>attachment is the scripts,
>
># ./setup.sh
># ip netns exec ns1 ./team1.sh
># ./team0.sh
># ping 192.168.11.3
>
>the issue is team0 cannot switch to veth2, even if the veth0 has no carrier.
>
>> This should be definitelly fixed in user part.
>now it can disable/enable it in lw_ethtool_event_watch_port_changed()
>when receive event from kernel, but at the beginning, the first event from
>adding port will not going to team_port_en_option_set, as new_link_up
>== common_ppriv->link_up in lw_ethtool_event_watch_port_changed().
>
>maybe it should be fixed there ?

Yes please.

>
>>
>> Thanks.

^ permalink raw reply

* Re: [PATCH v2 net-next 1/2] flow dissector: ICMP support
From: Jiri Pirko @ 2016-12-06 12:26 UTC (permalink / raw)
  To: Simon Horman
  Cc: Tom Herbert, David Miller, Linux Kernel Network Developers,
	Jay Vosburgh, Veaceslav Falico, Andy Gospodarek, Jamal Hadi Salim,
	Jiri Pirko
In-Reply-To: <20161206105145.GA27020@penelope.horms.nl>

Tue, Dec 06, 2016 at 11:51:46AM CET, simon.horman@netronome.com wrote:
>On Mon, Dec 05, 2016 at 09:29:45AM -0800, Tom Herbert wrote:
>> On Mon, Dec 5, 2016 at 6:23 AM, Simon Horman <simon.horman@netronome.com> wrote:
>> > On Sat, Dec 03, 2016 at 10:52:32AM -0800, Tom Herbert wrote:
>> >> On Sat, Dec 3, 2016 at 2:49 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> >> > Fri, Dec 02, 2016 at 09:31:41PM CET, simon.horman@netronome.com wrote:
>> >> >>Allow dissection of ICMP(V6) type and code. This re-uses transport layer
>> >> >>port dissection code as although ICMP is not a transport protocol and their
>> >> >>type and code are not ports this allows sharing of both code and storage.
>> >> >>
>> >> >>Signed-off-by: Simon Horman <simon.horman@netronome.com>
>> >
>> > ...
>> >
>> >> > Digging into this a bit more. I think it would be much nice not to mix
>> >> > up l4 ports and icmp stuff.
>> >> >
>> >> > How about to have FLOW_DISSECTOR_KEY_ICMP
>> >> > and
>> >> > struct flow_dissector_key_icmp {
>> >> >         u8 type;
>> >> >         u8 code;
>> >> > };
>> >> >
>> >> > The you can make this structure and struct flow_dissector_key_ports into
>> >> > an union in struct flow_keys.
>> >> >
>> >> > Looks much cleaner to me.
>> >
>> > Hi Jiri,
>> >
>> > I wondered about that too. The main reason that I didn't do this the first
>> > time around is that I wasn't sure what to do to differentiate between ports
>> > and icmp in fl_init_dissector() given that using a union would could to
>> > mask bits being set in both if they are set in either.
>> >
>> > I've taken a stab at addressing that below along with implementing your
>> > suggestion. If the treatment in fl_init_dissector() is acceptable
>> > perhaps something similar should be done for FLOW_DISSECTOR_KEY_IPV[46]_ADDRS
>> > too?
>> >
>> >> I agree, this patch adds to many conditionals into the fast path for
>> >> ICMP handling. Neither is there much point in using type and code as
>> >> input to the packet hash.
>> >
>> > Hi Tom,
>> >
>> > I'm not sure that I follow this. A packet may be ICMP or not regardless of
>> > if FLOW_DISSECTOR_KEY_PORT (and now FLOW_DISSECTOR_KEY_ICMP) port is set or
>> > not. I'd appreciate some guidance here.
>> >
>> > First-pass at implementing Jiri's suggestion follows:
>> >
>
>...
>
>> > diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
>> > index 0584b4bb4390..33e5fa2b3e87 100644
>> > --- a/net/core/flow_dissector.c
>> > +++ b/net/core/flow_dissector.c
>
>...
>
>> > @@ -975,6 +983,10 @@ static const struct flow_dissector_key flow_keys_dissector_keys[] = {
>> >                 .offset = offsetof(struct flow_keys, ports),
>> >         },
>> >         {
>> > +               .key_id = FLOW_DISSECTOR_KEY_ICMP,
>> > +               .offset = offsetof(struct flow_keys, icmp),
>> > +       },
>> 
>> This is the problem I was referring to. This adds ICMP to the keys for
>> computing the skb hash and the ICMP type and code are in a union with
>> port numbers so they are in the range over which the hash is computed.
>> This means that we are treating type and code as some sort of flow and
>> and so different type/code between the same src/dst could follow
>> different paths in ECMP. For the purposes of computing a packet hash I
>> don't think we want ICMP information included. This might be a matter
>> of not putting FLOW_DISSECTOR_KEY_ICMP in flow_keys_dissector_keys,
>> where we need this information we could define another structure that
>> has all the same fields as in flow_keys_dissector_keys and include
>> FLOW_DISSECTOR_KEY_ICMP. Alternatively, the flow_dissector_key_icmp
>> could also be outside of the area that is used in the hash: that is no
>> in flow_keys_hash_start(keys) to flow_keys_hash_length(keys), keyval);
>
>...
>
>> > diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
>> > index c96b2a349779..f4ba69bd362f 100644
>> > --- a/net/sched/cls_flower.c
>> > +++ b/net/sched/cls_flower.c
>> > @@ -38,7 +38,10 @@ struct fl_flow_key {
>> >                 struct flow_dissector_key_ipv4_addrs ipv4;
>> >                 struct flow_dissector_key_ipv6_addrs ipv6;
>> >         };
>> > -       struct flow_dissector_key_ports tp;
>> > +       union {
>> > +               struct flow_dissector_key_ports tp;
>> > +               struct flow_dissector_key_icmp icmp;
>> > +       };
>> 
>> When an ICMP packet is encapsulated within UDP then icmp overwrites
>> valid port information with is a stronger signal of the flow (unless
>> FLOW_DISSECTOR_F_STOP_AT_ENCAP is set). This is another reason not to
>> use a union with ports.
>
>...
>
>Hi Tom,
>
>thanks for your explanation. I think I have a clearer picture now.
>
>I have reworked things to try to address your concerns.
>In particular:
>
>* FLOW_DISSECTOR_KEY_ICMP is no longer added to flow_keys_dissector_keys.
>  I don't think it is needed at all for the use-case I currently have in
>  mind, which is classifying packets using tc_flower. And adding it there
>  was an error on my part.

I was just about to ask why you are adding it there. Good.


>
>* I stopped using a union for ports and icmp. At the very least this seems
>  to make it easier to reason about things as we no longer need to consider
>  that a port value may in fact be an ICMP type or code.

This should be within csl_flower code. You can easily have it as a union
in struct fl_flow_key. 


>
>  This seems to allow us avoid adding a number of is_icmp checks (as I believe
>  you pointed out earlier). And should also address the problem you state
>  immediately above.

I don't understand the check you are talking about. The union just allow
to share the mem. I don't see any checks needed.


>
>* I have also placed icmp outside of the range flow_keys_hash_start(keys)
>  to flow_keys_hash_length(keys), keyval). This is because I now see no
>  value of having it inside that range and it both avoids any chance of
>  contaminating the hash with ICMP values and hashing over unwanted
>  (hopefully zero) values.

Okay, now I'm confused again. You don't want to add this to
flow_keys_dissector_keys. Why would you?


>
>  This required an update to flow_keys_hash_length() as the bound
>  of the end of fields the hash is no longer the end of struct flow_keys.
>  It seems clean but I wonder if there are similar assumptions lurking
>  elsewhere.
>
>I have lightly tested this for the tc_flower case (only).
>
>Incremental patch on top of this series:
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index a6f75cfb2bf7..8029dd4912b6 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -3181,8 +3181,7 @@ static bool bond_flow_dissect(struct bonding *bond, struct sk_buff *skb,
> 	} else {
> 		return false;
> 	}
>-	if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER34 &&
>-	    proto >= 0 && !skb_flow_is_icmp_any(skb, proto))
>+	if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER34 && proto >= 0)
> 		fk->ports.ports = skb_flow_get_ports(skb, noff, proto);
> 
> 	return true;
>@@ -3210,8 +3209,7 @@ u32 bond_xmit_hash(struct bonding *bond, struct sk_buff *skb)
> 		return bond_eth_hash(skb);
> 
> 	if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER23 ||
>-	    bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP23 ||
>-	    flow_keys_are_icmp_any(&flow))
>+	    bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP23)
> 		hash = bond_eth_hash(skb);
> 	else
> 		hash = (__force u32)flow.ports.ports;
>diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>index 44a8f69a9198..9c535fbccf2c 100644
>--- a/include/linux/skbuff.h
>+++ b/include/linux/skbuff.h
>@@ -1094,11 +1094,6 @@ u32 __skb_get_poff(const struct sk_buff *skb, void *data,
> __be32 __skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto,
> 			    void *data, int hlen_proto);
> 
>-static inline bool skb_flow_is_icmp_any(const struct sk_buff *skb, u8 ip_proto)
>-{
>-	return flow_protos_are_icmp_any(skb->protocol, ip_proto);
>-}
>-
> static inline __be32 skb_flow_get_ports(const struct sk_buff *skb,
> 					int thoff, u8 ip_proto)
> {
>diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
>index 5540dfa18872..058c9df8a4d8 100644
>--- a/include/net/flow_dissector.h
>+++ b/include/net/flow_dissector.h
>@@ -91,14 +91,10 @@ struct flow_dissector_key_addrs {
> 
> /**
>  * flow_dissector_key_ports:
>- *	@ports: port numbers of Transport header or
>- *		type and code of ICMP header
>+ *	@ports: port numbers of Transport header
>  *		ports: source (high) and destination (low) port numbers
>  *		src: source port number
>  *		dst: destination port number
>- *		icmp: ICMP type (high) and code (low)
>- *		type: ICMP type
>- *		type: ICMP code
>  */
> struct flow_dissector_key_ports {
> 	union {
>@@ -107,6 +103,18 @@ struct flow_dissector_key_ports {
> 			__be16 src;
> 			__be16 dst;
> 		};
>+	};
>+};
>+
>+/**
>+ * flow_dissector_key_icmp:
>+ *	@ports: type and code of ICMP header
>+ *		icmp: ICMP type (high) and code (low)
>+ *		type: ICMP type
>+ *		code: ICMP code
>+ */
>+struct flow_dissector_key_icmp {
>+	union {
> 		__be16 icmp;
> 		struct {
> 			u8 type;
>@@ -115,7 +123,6 @@ struct flow_dissector_key_ports {
> 	};
> };
> 
>-
> /**
>  * struct flow_dissector_key_eth_addrs:
>  * @src: source Ethernet address
>@@ -133,6 +140,7 @@ enum flow_dissector_key_id {
> 	FLOW_DISSECTOR_KEY_IPV4_ADDRS, /* struct flow_dissector_key_ipv4_addrs */
> 	FLOW_DISSECTOR_KEY_IPV6_ADDRS, /* struct flow_dissector_key_ipv6_addrs */
> 	FLOW_DISSECTOR_KEY_PORTS, /* struct flow_dissector_key_ports */
>+	FLOW_DISSECTOR_KEY_ICMP, /* struct flow_dissector_key_icmp */
> 	FLOW_DISSECTOR_KEY_ETH_ADDRS, /* struct flow_dissector_key_eth_addrs */
> 	FLOW_DISSECTOR_KEY_TIPC_ADDRS, /* struct flow_dissector_key_tipc_addrs */
> 	FLOW_DISSECTOR_KEY_VLAN, /* struct flow_dissector_key_flow_vlan */
>@@ -173,11 +181,16 @@ struct flow_keys {
> 	struct flow_dissector_key_keyid keyid;
> 	struct flow_dissector_key_ports ports;
> 	struct flow_dissector_key_addrs addrs;
>+#define FLOW_KEYS_HASH_END_FIELD addrs
>+	struct flow_dissector_key_icmp icmp;
> };
> 
> #define FLOW_KEYS_HASH_OFFSET		\
> 	offsetof(struct flow_keys, FLOW_KEYS_HASH_START_FIELD)
> 
>+#define FLOW_KEYS_HASH_END		\
>+	offsetofend(struct flow_keys, FLOW_KEYS_HASH_END_FIELD)
>+
> __be32 flow_get_u32_src(const struct flow_keys *flow);
> __be32 flow_get_u32_dst(const struct flow_keys *flow);
> 
>@@ -233,8 +246,7 @@ static inline bool flow_keys_are_icmp_any(const struct flow_keys *keys)
> 
> static inline bool flow_keys_have_l4(const struct flow_keys *keys)
> {
>-	return (!flow_keys_are_icmp_any(keys) && keys->ports.ports) ||
>-		keys->tags.flow_label;
>+	return (keys->ports.ports || keys->tags.flow_label);
> }
> 
> u32 flow_hash_from_keys(struct flow_keys *keys);
>diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
>index 0584b4bb4390..ed6d46475343 100644
>--- a/net/core/flow_dissector.c
>+++ b/net/core/flow_dissector.c
>@@ -139,6 +139,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> 	struct flow_dissector_key_basic *key_basic;
> 	struct flow_dissector_key_addrs *key_addrs;
> 	struct flow_dissector_key_ports *key_ports;
>+	struct flow_dissector_key_icmp *key_icmp;
> 	struct flow_dissector_key_tags *key_tags;
> 	struct flow_dissector_key_vlan *key_vlan;
> 	struct flow_dissector_key_keyid *key_keyid;
>@@ -559,18 +560,25 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> 		break;
> 	}
> 
>-	if (dissector_uses_key(flow_dissector,
>-			       FLOW_DISSECTOR_KEY_PORTS)) {
>-		key_ports = skb_flow_dissector_target(flow_dissector,
>-						      FLOW_DISSECTOR_KEY_PORTS,
>-						      target_container);
>-		if (flow_protos_are_icmp_any(proto, ip_proto))
>-			key_ports->icmp = skb_flow_get_be16(skb, nhoff, data,
>-							    hlen);
>-		else
>+	if (flow_protos_are_icmp_any(proto, ip_proto)) {
>+		if (dissector_uses_key(flow_dissector,
>+				       FLOW_DISSECTOR_KEY_ICMP)) {
>+			key_icmp = skb_flow_dissector_target(flow_dissector,
>+							     FLOW_DISSECTOR_KEY_ICMP,
>+							     target_container);
>+			key_icmp->icmp = skb_flow_get_be16(skb, nhoff, data,
>+							   hlen);
>+		}
>+	} else {
>+		if (dissector_uses_key(flow_dissector,
>+				       FLOW_DISSECTOR_KEY_PORTS)) {
>+			key_ports = skb_flow_dissector_target(flow_dissector,
>+							      FLOW_DISSECTOR_KEY_PORTS,
>+							      target_container);
> 			key_ports->ports = __skb_flow_get_ports(skb, nhoff,
> 								ip_proto, data,
> 								hlen);
>+		}
> 	}
> 
> out_good:
>@@ -613,9 +621,10 @@ static inline const u32 *flow_keys_hash_start(const struct flow_keys *flow)
> static inline size_t flow_keys_hash_length(const struct flow_keys *flow)
> {
> 	size_t diff = FLOW_KEYS_HASH_OFFSET + sizeof(flow->addrs);
>-	BUILD_BUG_ON((sizeof(*flow) - FLOW_KEYS_HASH_OFFSET) % sizeof(u32));
>+	BUILD_BUG_ON((FLOW_KEYS_HASH_END - FLOW_KEYS_HASH_OFFSET) %
>+		     sizeof(u32));
> 	BUILD_BUG_ON(offsetof(typeof(*flow), addrs) !=
>-		     sizeof(*flow) - sizeof(flow->addrs));
>+		     FLOW_KEYS_HASH_END - sizeof(flow->addrs));
> 
> 	switch (flow->control.addr_type) {
> 	case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
>@@ -628,7 +637,7 @@ static inline size_t flow_keys_hash_length(const struct flow_keys *flow)
> 		diff -= sizeof(flow->addrs.tipcaddrs);
> 		break;
> 	}
>-	return (sizeof(*flow) - diff) / sizeof(u32);
>+	return (FLOW_KEYS_HASH_END - diff) / sizeof(u32);
> }
> 
> __be32 flow_get_u32_src(const struct flow_keys *flow)
>@@ -745,8 +754,7 @@ void make_flow_keys_digest(struct flow_keys_digest *digest,
> 
> 	data->n_proto = flow->basic.n_proto;
> 	data->ip_proto = flow->basic.ip_proto;
>-	if (flow_keys_have_l4(flow))
>-		data->ports = flow->ports.ports;
>+	data->ports = flow->ports.ports;
> 	data->src = flow->addrs.v4addrs.src;
> 	data->dst = flow->addrs.v4addrs.dst;
> }
>diff --git a/net/sched/cls_flow.c b/net/sched/cls_flow.c
>index 408313e33bbe..6575aba87630 100644
>--- a/net/sched/cls_flow.c
>+++ b/net/sched/cls_flow.c
>@@ -96,7 +96,7 @@ static u32 flow_get_proto(const struct sk_buff *skb,
> static u32 flow_get_proto_src(const struct sk_buff *skb,
> 			      const struct flow_keys *flow)
> {
>-	if (!flow_keys_are_icmp_any(flow) && flow->ports.ports)
>+	if (flow->ports.ports)
> 		return ntohs(flow->ports.src);
> 
> 	return addr_fold(skb->sk);
>@@ -105,7 +105,7 @@ static u32 flow_get_proto_src(const struct sk_buff *skb,
> static u32 flow_get_proto_dst(const struct sk_buff *skb,
> 			      const struct flow_keys *flow)
> {
>-	if (!flow_keys_are_icmp_any(flow) && flow->ports.ports)
>+	if (flow->ports.ports)
> 		return ntohs(flow->ports.dst);
> 
> 	return addr_fold(skb_dst(skb)) ^ (__force u16) tc_skb_protocol(skb);
>diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
>index c96b2a349779..56df0368125a 100644
>--- a/net/sched/cls_flower.c
>+++ b/net/sched/cls_flower.c
>@@ -39,6 +39,7 @@ struct fl_flow_key {
> 		struct flow_dissector_key_ipv6_addrs ipv6;
> 	};
> 	struct flow_dissector_key_ports tp;
>+	struct flow_dissector_key_icmp icmp;
> 	struct flow_dissector_key_keyid enc_key_id;
> 	union {
> 		struct flow_dissector_key_ipv4_addrs enc_ipv4;
>@@ -511,19 +512,23 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
> 			       &mask->tp.dst, TCA_FLOWER_KEY_SCTP_DST_MASK,
> 			       sizeof(key->tp.dst));
> 	} else if (flow_basic_key_is_icmpv4(&key->basic)) {
>-		fl_set_key_val(tb, &key->tp.type, TCA_FLOWER_KEY_ICMPV4_TYPE,
>-			       &mask->tp.type, TCA_FLOWER_KEY_ICMPV4_TYPE_MASK,
>-			       sizeof(key->tp.type));
>-		fl_set_key_val(tb, &key->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
>-			       &mask->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
>-			       sizeof(key->tp.code));
>+		fl_set_key_val(tb, &key->icmp.type, TCA_FLOWER_KEY_ICMPV4_TYPE,
>+			       &mask->icmp.type,
>+			       TCA_FLOWER_KEY_ICMPV4_TYPE_MASK,
>+			       sizeof(key->icmp.type));
>+		fl_set_key_val(tb, &key->icmp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
>+			       &mask->icmp.code,
>+			       TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
>+			       sizeof(key->icmp.code));
> 	} else if (flow_basic_key_is_icmpv6(&key->basic)) {
>-		fl_set_key_val(tb, &key->tp.type, TCA_FLOWER_KEY_ICMPV6_TYPE,
>-			       &mask->tp.type, TCA_FLOWER_KEY_ICMPV6_TYPE_MASK,
>-			       sizeof(key->tp.type));
>-		fl_set_key_val(tb, &key->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
>-			       &mask->tp.code, TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
>-			       sizeof(key->tp.code));
>+		fl_set_key_val(tb, &key->icmp.type, TCA_FLOWER_KEY_ICMPV6_TYPE,
>+			       &mask->icmp.type,
>+			       TCA_FLOWER_KEY_ICMPV6_TYPE_MASK,
>+			       sizeof(key->icmp.type));
>+		fl_set_key_val(tb, &key->icmp.code, TCA_FLOWER_KEY_ICMPV4_CODE,
>+			       &mask->icmp.code,
>+			       TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
>+			       sizeof(key->icmp.code));
> 	}
> 
> 	if (tb[TCA_FLOWER_KEY_ENC_IPV4_SRC] ||
>@@ -634,6 +639,8 @@ static void fl_init_dissector(struct cls_fl_head *head,
> 	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
> 			     FLOW_DISSECTOR_KEY_PORTS, tp);
> 	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
>+			     FLOW_DISSECTOR_KEY_ICMP, icmp);
>+	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
> 			     FLOW_DISSECTOR_KEY_VLAN, vlan);
> 	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
> 			     FLOW_DISSECTOR_KEY_ENC_KEYID, enc_key_id);
>@@ -1000,24 +1007,24 @@ static int fl_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
> 				  sizeof(key->tp.dst))))
> 		goto nla_put_failure;
> 	else if (flow_basic_key_is_icmpv4(&key->basic) &&
>-		 (fl_dump_key_val(skb, &key->tp.type,
>-				  TCA_FLOWER_KEY_ICMPV4_TYPE, &mask->tp.type,
>+		 (fl_dump_key_val(skb, &key->icmp.type,
>+				  TCA_FLOWER_KEY_ICMPV4_TYPE, &mask->icmp.type,
> 				  TCA_FLOWER_KEY_ICMPV4_TYPE_MASK,
>-				  sizeof(key->tp.type)) ||
>-		  fl_dump_key_val(skb, &key->tp.code,
>-				  TCA_FLOWER_KEY_ICMPV4_CODE, &mask->tp.code,
>+				  sizeof(key->icmp.type)) ||
>+		  fl_dump_key_val(skb, &key->icmp.code,
>+				  TCA_FLOWER_KEY_ICMPV4_CODE, &mask->icmp.code,
> 				  TCA_FLOWER_KEY_ICMPV4_CODE_MASK,
>-				  sizeof(key->tp.code))))
>+				  sizeof(key->icmp.code))))
> 		goto nla_put_failure;
> 	else if (flow_basic_key_is_icmpv6(&key->basic) &&
>-		 (fl_dump_key_val(skb, &key->tp.type,
>-				  TCA_FLOWER_KEY_ICMPV6_TYPE, &mask->tp.type,
>+		 (fl_dump_key_val(skb, &key->icmp.type,
>+				  TCA_FLOWER_KEY_ICMPV6_TYPE, &mask->icmp.type,
> 				  TCA_FLOWER_KEY_ICMPV6_TYPE_MASK,
>-				  sizeof(key->tp.type)) ||
>-		  fl_dump_key_val(skb, &key->tp.code,
>-				  TCA_FLOWER_KEY_ICMPV6_CODE, &mask->tp.code,
>+				  sizeof(key->icmp.type)) ||
>+		  fl_dump_key_val(skb, &key->icmp.code,
>+				  TCA_FLOWER_KEY_ICMPV6_CODE, &mask->icmp.code,
> 				  TCA_FLOWER_KEY_ICMPV6_CODE_MASK,
>-				  sizeof(key->tp.code))))
>+				  sizeof(key->icmp.code))))
> 		goto nla_put_failure;
> 
> 	if (key->enc_control.addr_type == FLOW_DISSECTOR_KEY_IPV4_ADDRS &&

^ permalink raw reply

* Re: [PATCH] net:switchdev: fix to release the lock before function return to avoid deadlock
From: Jiri Pirko @ 2016-12-06 12:29 UTC (permalink / raw)
  To: Feng Deng; +Cc: davem, netdev, linux-kernel, cxdx2006
In-Reply-To: <1481018964-47554-1-git-send-email-feng.deng@cortina-access.com>

Tue, Dec 06, 2016 at 11:09:24AM CET, feng.deng@cortina-access.com wrote:
>before switchdev_deferred_dequeue() normal return,show release the lock,
>if not,maybe there will be deadlock sometimes
>
>Signed-off-by: Feng Deng <feng.deng@cortina-access.com>
>---
> net/switchdev/switchdev.c | 1 +
> 1 file changed, 1 insertion(+)
>
>diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
>index 3b95fe9..c0a1ad4 100644
>--- a/net/switchdev/switchdev.c
>+++ b/net/switchdev/switchdev.c
>@@ -120,6 +120,7 @@ static struct switchdev_deferred_item *switchdev_deferred_dequeue(void)
> 	dfitem = list_first_entry(&deferred,
> 				  struct switchdev_deferred_item, list);
> 	list_del(&dfitem->list);
>+	spin_unlock_bh(&deferred_lock);
> unlock:
> 	spin_unlock_bh(&deferred_lock);

You are joking right?

^ permalink raw reply

* Re: [PATCH]net:switchdev:add judgement and proper return vlaue to make switchdev_port_vlan_fill() more robust
From: Jiri Pirko @ 2016-12-06 12:32 UTC (permalink / raw)
  To: Feng Deng; +Cc: David S. Miller, netdev, linux-kernel, feng.deng
In-Reply-To: <CAEX1niU24TLaRff+TMPu4ek4vT=R2mv9yaE6j1W1ZsFYh99QGg@mail.gmail.com>

Tue, Dec 06, 2016 at 05:46:36AM CET, cxdx2006@gmail.com wrote:
>From: Feng Deng<cxdx2006@gmail.com>
>
>1.add judgement to make sure switchdev_port_vlan_fill() could
>  return right,  even when switchdev_port_vlan_dump_put() failed  ,
>  which will make switchdev_port_vlan_fill()  more robust
>2.add proper return vlaue at the end of the switchdev_port_vlan_fill()
>
>Signed-off-by: Feng Deng <feng.deng@cortina-access.com>

Please don't send patches like this. You are wasting everyone's time.

Thanks!

^ permalink raw reply

* [PATCH net-next 2/3] net/sched: cls_flower: Add support for matching on flags
From: Or Gerlitz @ 2016-12-06 12:32 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Jiri Pirko, Roi Dayan, Hadar Har-Zion, Or Gerlitz
In-Reply-To: <1481027579-23195-1-git-send-email-ogerlitz@mellanox.com>

Add UAPI to provide set of flags for matching, where the flags
provided from user-space are mapped to flow-dissector flags.

The 1st flag allows to match on whether the packet is an
IP fragment and corresponds to the FLOW_DIS_IS_FRAGMENT flag.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
---
 include/uapi/linux/pkt_cls.h |  7 ++++
 net/sched/cls_flower.c       | 83 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 90 insertions(+)

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 86786d4..f114277 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -457,11 +457,18 @@ enum {
 	TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK,	/* be16 */
 	TCA_FLOWER_KEY_ENC_UDP_DST_PORT,	/* be16 */
 	TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK,	/* be16 */
+
+	TCA_FLOWER_KEY_FLAGS,		/* be32 */
+	TCA_FLOWER_KEY_FLAGS_MASK,	/* be32 */
 	__TCA_FLOWER_MAX,
 };
 
 #define TCA_FLOWER_MAX (__TCA_FLOWER_MAX - 1)
 
+enum {
+	TCA_FLOWER_KEY_FLAGS_IS_FRAGMENT = BIT(0),
+};
+
 /* Match-all classifier */
 
 enum {
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index c5cea78..d9f4124 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -383,6 +383,8 @@ static const struct nla_policy fl_policy[TCA_FLOWER_MAX + 1] = {
 	[TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK]	= { .type = NLA_U16 },
 	[TCA_FLOWER_KEY_ENC_UDP_DST_PORT]	= { .type = NLA_U16 },
 	[TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK]	= { .type = NLA_U16 },
+	[TCA_FLOWER_KEY_FLAGS]		= { .type = NLA_U32 },
+	[TCA_FLOWER_KEY_FLAGS_MASK]	= { .type = NLA_U32 },
 };
 
 static void fl_set_key_val(struct nlattr **tb,
@@ -417,6 +419,40 @@ static void fl_set_key_vlan(struct nlattr **tb,
 	}
 }
 
+static void set_flags(u32 flower_key, u32 flower_mask,
+		      u32 *dissector_key, u32 *dissector_mask,
+		      u32 flower_flag_bit, u32 dissector_flag_bit)
+{
+	if (flower_mask & flower_flag_bit) {
+		*dissector_mask |= dissector_flag_bit;
+		if (flower_key & flower_flag_bit)
+			*dissector_key |= dissector_flag_bit;
+	}
+}
+
+static void fl_set_key_flags(struct nlattr **tb,
+			     u32 *flags_key,
+			     u32 *flags_mask)
+{
+	u32 key, mask;
+
+	if (!tb[TCA_FLOWER_KEY_FLAGS])
+		return;
+
+	key = be32_to_cpu(nla_get_u32(tb[TCA_FLOWER_KEY_FLAGS]));
+
+	if (!tb[TCA_FLOWER_KEY_FLAGS_MASK])
+		mask = ~0;
+	else
+		mask = be32_to_cpu(nla_get_u32(tb[TCA_FLOWER_KEY_FLAGS_MASK]));
+
+	*flags_key  = 0;
+	*flags_mask = 0;
+
+	set_flags(key, mask, flags_key, flags_mask,
+		  TCA_FLOWER_KEY_FLAGS_IS_FRAGMENT, FLOW_DIS_IS_FRAGMENT);
+}
+
 static int fl_set_key(struct net *net, struct nlattr **tb,
 		      struct fl_flow_key *key, struct fl_flow_key *mask)
 {
@@ -543,6 +579,8 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
 		       &mask->enc_tp.dst, TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK,
 		       sizeof(key->enc_tp.dst));
 
+	fl_set_key_flags(tb, &key->control.flags, &mask->control.flags);
+
 	return 0;
 }
 
@@ -877,6 +915,48 @@ static int fl_dump_key_vlan(struct sk_buff *skb,
 	return 0;
 }
 
+static void get_flags(u32 dissector_key, u32 dissector_mask,
+		      u32 *flower_key, u32 *flower_mask,
+		      u32 flower_flag_bit, u32 dissector_flag_bit)
+{
+	if (dissector_mask & dissector_flag_bit) {
+		*flower_mask |= flower_flag_bit;
+		if (dissector_key & dissector_flag_bit)
+			*flower_key |= flower_flag_bit;
+	}
+}
+
+static int fl_dump_key_flags(struct sk_buff *skb,
+			     u32 flags_key,
+			     u32 flags_mask)
+{
+	u32 key, mask;
+	__be32 _key, _mask;
+	int err;
+
+	if (!memchr_inv(&flags_mask, 0, sizeof(flags_mask)))
+		return 0;
+
+	key  = 0;
+	mask = 0;
+
+	get_flags(flags_key, flags_mask, &key, &mask,
+		  TCA_FLOWER_KEY_FLAGS_IS_FRAGMENT, FLOW_DIS_IS_FRAGMENT);
+
+	_key  = cpu_to_be32(key);
+	_mask = cpu_to_be32(mask);
+
+	err = nla_put(skb, TCA_FLOWER_KEY_FLAGS, 4, &_key);
+	if (err)
+		return err;
+
+	err = nla_put(skb, TCA_FLOWER_KEY_FLAGS_MASK, 4, &_mask);
+	if (err)
+		return err;
+
+	return 0;
+}
+
 static int fl_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
 		   struct sk_buff *skb, struct tcmsg *t)
 {
@@ -1012,6 +1092,9 @@ static int fl_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
 			    sizeof(key->enc_tp.dst)))
 		goto nla_put_failure;
 
+	if (fl_dump_key_flags(skb, key->control.flags, mask->control.flags))
+		goto nla_put_failure;
+
 	nla_put_u32(skb, TCA_FLOWER_FLAGS, f->flags);
 
 	if (tcf_exts_dump(skb, &f->exts))
-- 
2.3.7

^ permalink raw reply related

* [PATCH net-next 1/3] net/flow_dissector: Enable matching on flags for TC filter consumers
From: Or Gerlitz @ 2016-12-06 12:32 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Jiri Pirko, Roi Dayan, Hadar Har-Zion, Or Gerlitz,
	Tom Herbert
In-Reply-To: <1481027579-23195-1-git-send-email-ogerlitz@mellanox.com>

This is a pre-step to allow TC classifiers that makes use of the
flow-dissector (e.g Flower) to match on flow-dissector flags which
are located in the control struct.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 include/net/flow_dissector.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index c4f3166..a679e6c 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -154,8 +154,8 @@ struct flow_dissector {
 };
 
 struct flow_keys {
+#define FLOW_KEYS_HASH_START_FIELD control
 	struct flow_dissector_key_control control;
-#define FLOW_KEYS_HASH_START_FIELD basic
 	struct flow_dissector_key_basic basic;
 	struct flow_dissector_key_tags tags;
 	struct flow_dissector_key_vlan vlan;
-- 
2.3.7
Cc: Tom Herbert <tom@herbertland.com>

^ permalink raw reply related

* [PATCH net-next 3/3] net/mlx5e: Offload TC matching on packets being IP fragments
From: Or Gerlitz @ 2016-12-06 12:32 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Jiri Pirko, Roi Dayan, Hadar Har-Zion, Or Gerlitz
In-Reply-To: <1481027579-23195-1-git-send-email-ogerlitz@mellanox.com>

Enable offloading of matching on packets being fragments.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index f07ef8c..f8829b5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -31,6 +31,7 @@
  */
 
 #include <net/flow_dissector.h>
+#include <net/sch_generic.h>
 #include <net/pkt_cls.h>
 #include <net/tc_act/tc_gact.h>
 #include <net/tc_act/tc_skbedit.h>
@@ -363,7 +364,18 @@ static int __parse_cls_flower(struct mlx5e_priv *priv,
 			skb_flow_dissector_target(f->dissector,
 						  FLOW_DISSECTOR_KEY_CONTROL,
 						  f->key);
+
+		struct flow_dissector_key_control *mask =
+			skb_flow_dissector_target(f->dissector,
+						  FLOW_DISSECTOR_KEY_CONTROL,
+						  f->mask);
 		addr_type = key->addr_type;
+
+		if (mask->flags & FLOW_DIS_IS_FRAGMENT) {
+			MLX5_SET(fte_match_set_lyr_2_4, headers_c, frag, 1);
+			MLX5_SET(fte_match_set_lyr_2_4, headers_v, frag,
+				 key->flags & FLOW_DIS_IS_FRAGMENT);
+		}
 	}
 
 	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_BASIC)) {
-- 
2.3.7

^ permalink raw reply related

* [PATCH net-next 0/3] net/sched: cls_flower: Add support for matching on dissection flags
From: Or Gerlitz @ 2016-12-06 12:32 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Jiri Pirko, Roi Dayan, Hadar Har-Zion, Or Gerlitz

This series add the UAPI to provide set of flags for matching, where the 
flags provided from user-space are mapped to flow-dissector flags.

The 1st flag allows to match on whether the packet is an
IP fragment and corresponds to the FLOW_DIS_IS_FRAGMENT flag.

Or

Or Gerlitz (3):
  net/flow_dissector: Enable matching on flags for TC filter consumers
  net/sched: cls_flower: Add support for matching on flags
  net/mlx5e: Offload TC matching on packets being IP fragments

 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 12 ++++
 include/net/flow_dissector.h                    |  2 +-
 include/uapi/linux/pkt_cls.h                    |  7 +++
 net/sched/cls_flower.c                          | 83 +++++++++++++++++++++++++
 4 files changed, 103 insertions(+), 1 deletion(-)

-- 
2.3.7

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox