Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next v2 2/3] netlink: eliminate nl_sk_hash_lock
From: Thomas Graf @ 2015-01-13  0:08 UTC (permalink / raw)
  To: Ying Xue; +Cc: davem, netdev
In-Reply-To: <1421045544-13670-3-git-send-email-ying.xue@windriver.com>

On 01/12/15 at 02:52pm, Ying Xue wrote:
> As rhashtable_lookup_compare_insert() can guarantee the process
> of search and insertion is atomic, it's safe to eliminate the
> nl_sk_hash_lock. After this, object insertion or removal will
> be protected with per bucket lock on write side while object
> lookup is guarded with rcu read lock on read side.
> 
> Signed-off-by: Ying Xue <ying.xue@windriver.com>
> Cc: Thomas Graf <tgraf@suug.ch>

LGTM now, nice!

Acked-by: Thomas Graf <tgraf@suug.ch>

^ permalink raw reply

* [PATCH net-next] bridge: fix uninitialized variable warning
From: roopa @ 2015-01-13  0:25 UTC (permalink / raw)
  To: netdev, davem; +Cc: tgraf, roopa

From: Roopa Prabhu <roopa@cumulusnetworks.com>

net/bridge/br_netlink.c: In function ‘br_fill_ifinfo’:
net/bridge/br_netlink.c:146:32: warning: ‘vid_range_flags’ may be used uninitialized in this function [-Wmaybe-uninitialized]
  err = br_fill_ifvlaninfo_range(skb, vid_range_start,
                                ^
net/bridge/br_netlink.c:108:6: note: ‘vid_range_flags’ was declared here
  u16 vid_range_flags;

Reported-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
 net/bridge/br_netlink.c |   16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index 0b03879..66ece91 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -105,7 +105,7 @@ static int br_fill_ifvlaninfo_compressed(struct sk_buff *skb,
 					 const struct net_port_vlans *pv)
 {
 	u16 vid_range_start = 0, vid_range_end = 0;
-	u16 vid_range_flags;
+	u16 vid_range_flags = 0;
 	u16 pvid, vid, flags;
 	int err = 0;
 
@@ -142,12 +142,14 @@ initvars:
 		vid_range_flags = flags;
 	}
 
-	/* Call it once more to send any left over vlans */
-	err = br_fill_ifvlaninfo_range(skb, vid_range_start,
-				       vid_range_end,
-				       vid_range_flags);
-	if (err)
-		return err;
+	if (vid_range_start != 0) {
+		/* Call it once more to send any left over vlans */
+		err = br_fill_ifvlaninfo_range(skb, vid_range_start,
+					       vid_range_end,
+					       vid_range_flags);
+		if (err)
+			return err;
+	}
 
 	return 0;
 }
-- 
1.7.10.4

^ permalink raw reply related

* [3.19-rc3] tg3: BUG: sleeping function called from invalid context
From: Peter Hurley @ 2015-01-13  0:59 UTC (permalink / raw)
  To: Prashant Sreedharan, Michael Chan; +Cc: netdev, Linux kernel

On 3.19-rc3, I'm seeing this might_sleep() warning [1] from the tg3_open()
call stack. Let me know if I need to bisect this.

Regards,
Peter Hurley

[1]

[   17.203009] BUG: sleeping function called from invalid context at /home/peter/src/kernels/mainline/kernel/irq/manage.c:104
[   17.203067] in_atomic(): 1, irqs_disabled(): 0, pid: 1106, name: ip
[   17.203092] 2 locks held by ip/1106:
[   17.205255]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff816adf1f>] rtnetlink_rcv+0x1f/0x40
[   17.207445]  #1:  (&(&tp->lock)->rlock){+.....}, at: [<ffffffffa01073e6>] tg3_start+0xc06/0x11f0 [tg3]
[   17.209725] CPU: 2 PID: 1106 Comm: ip Not tainted 3.19.0-rc3+wip-xeon+lockdep #rc3+wip
[   17.211900] Hardware name: Dell Inc. Precision WorkStation T5400  /0RW203, BIOS A11 04/30/2012
[   17.214086]  0000000000000068 ffff8802ac823498 ffffffff817af7e8 0000000000000005
[   17.216265]  ffffffff81a9be78 ffff8802ac8234a8 ffffffff810998a5 ffff8802ac8234d8
[   17.218446]  ffffffff8109991a ffff8802ac8234c8 ffff8802af0aae00 ffffffffa00ed000
[   17.220636] Call Trace:
[   17.222743]  [<ffffffff817af7e8>] dump_stack+0x4f/0x7b
[   17.224808]  [<ffffffff810998a5>] ___might_sleep+0x105/0x140
[   17.226842]  [<ffffffff8109991a>] __might_sleep+0x3a/0xa0
[   17.228869]  [<ffffffffa00ed000>] ? 0xffffffffa00ed000
[   17.230939]  [<ffffffff810d7d78>] synchronize_irq+0x38/0xa0
[   17.232967]  [<ffffffffa00ed000>] ? 0xffffffffa00ed000
[   17.234991]  [<ffffffffa010105f>] tg3_chip_reset+0x13f/0x9c0 [tg3]
[   17.236988]  [<ffffffffa01020ae>] tg3_reset_hw+0x7e/0x2d20 [tg3]
[   17.238996]  [<ffffffff813bfaff>] ? __udelay+0x2f/0x40
[   17.241007]  [<ffffffffa00ef2f7>] ? _tw32_flush+0x47/0x80 [tg3]
[   17.243066]  [<ffffffffa0104dac>] tg3_init_hw+0x5c/0x70 [tg3]
[   17.245438]  [<ffffffffa010740b>] tg3_start+0xc2b/0x11f0 [tg3]
[   17.247444]  [<ffffffffa0107ad7>] ? tg3_open+0x107/0x2e0 [tg3]
[   17.249556]  [<ffffffff810c338d>] ? trace_hardirqs_on+0xd/0x10
[   17.251581]  [<ffffffff8107806f>] ? __local_bh_enable_ip+0x6f/0x100
[   17.253710]  [<ffffffffa0107af8>] tg3_open+0x128/0x2e0 [tg3]
[   17.255758]  [<ffffffff816ba3f5>] ? netpoll_poll_disable+0x5/0xa0
[   17.257932]  [<ffffffff816a14af>] __dev_open+0xbf/0x140
[   17.260091]  [<ffffffff816a17c1>] __dev_change_flags+0xa1/0x160
[   17.262222]  [<ffffffff816a18a9>] dev_change_flags+0x29/0x60
[   17.264360]  [<ffffffff816b0e02>] do_setlink+0x2f2/0xa30
[   17.266431]  [<ffffffff816b1b7f>] rtnl_newlink+0x51f/0x750
[   17.268485]  [<ffffffff816b1749>] ? rtnl_newlink+0xe9/0x750
[   17.270483]  [<ffffffff811869c2>] ? free_pages_prepare+0x1d2/0x270
[   17.272507]  [<ffffffff810c32bd>] ? trace_hardirqs_on_caller+0x11d/0x1e0
[   17.274531]  [<ffffffff813dd1b2>] ? nla_parse+0x32/0x120
[   17.276531]  [<ffffffff81021ab5>] ? native_sched_clock+0x35/0xa0
[   17.278514]  [<ffffffff816adfd5>] rtnetlink_rcv_msg+0x95/0x250
[   17.280485]  [<ffffffff8109f699>] ? preempt_count_sub+0x49/0x50
[   17.282448]  [<ffffffff817b4a02>] ? mutex_lock_nested+0x382/0x530
[   17.284402]  [<ffffffff816adf1f>] ? rtnetlink_rcv+0x1f/0x40
[   17.286290]  [<ffffffff816adf1f>] ? rtnetlink_rcv+0x1f/0x40
[   17.288142]  [<ffffffff816adf40>] ? rtnetlink_rcv+0x40/0x40
[   17.290031]  [<ffffffff816cedc1>] netlink_rcv_skb+0xc1/0xe0
[   17.291836]  [<ffffffff816adf2e>] rtnetlink_rcv+0x2e/0x40
[   17.293615]  [<ffffffff816ce473>] netlink_unicast+0xf3/0x1d0
[   17.295420]  [<ffffffff816ce863>] netlink_sendmsg+0x313/0x690
[   17.297132]  [<ffffffff811ada4f>] ? might_fault+0x5f/0xb0
[   17.298799]  [<ffffffff8168253c>] do_sock_sendmsg+0x8c/0x100
[   17.300493]  [<ffffffff81681e3e>] ? copy_msghdr_from_user+0x15e/0x1f0
[   17.302173]  [<ffffffff81682aeb>] ___sys_sendmsg+0x30b/0x320
[   17.303798]  [<ffffffff81021ab5>] ? native_sched_clock+0x35/0xa0
[   17.305431]  [<ffffffff810bdee0>] ? cpuacct_account_field+0x80/0xb0
[   17.307085]  [<ffffffff81021ab5>] ? native_sched_clock+0x35/0xa0
[   17.308744]  [<ffffffff810a4f35>] ? sched_clock_local+0x25/0x90
[   17.310375]  [<ffffffff810a5dc1>] ? vtime_account_user+0x91/0xa0
[   17.311948]  [<ffffffff810a5198>] ? sched_clock_cpu+0xb8/0xe0
[   17.313509]  [<ffffffff810bf8be>] ? put_lock_stats.isra.26+0xe/0x30
[   17.315069]  [<ffffffff810c007e>] ? lock_release_holdtime.part.27+0x12e/0x1b0
[   17.316618]  [<ffffffff810a5dc1>] ? vtime_account_user+0x91/0xa0
[   17.318162]  [<ffffffff8109f5d1>] ? get_parent_ip+0x11/0x50
[   17.319703]  [<ffffffff8109f699>] ? preempt_count_sub+0x49/0x50
[   17.321235]  [<ffffffff811807e5>] ? context_tracking_user_exit+0x55/0x130
[   17.322732]  [<ffffffff811807e5>] ? context_tracking_user_exit+0x55/0x130
[   17.324197]  [<ffffffff816834f2>] __sys_sendmsg+0x42/0x80
[   17.325634]  [<ffffffff81683542>] SyS_sendmsg+0x12/0x20
[   17.327048]  [<ffffffff817ba12d>] system_call_fastpath+0x16/0x1b

^ permalink raw reply

* [PATCH net-next v2 0/2] net: Remote checksum offload for VXLAN
From: Tom Herbert @ 2015-01-13  1:00 UTC (permalink / raw)
  To: davem, netdev

This patch set adds support for remote checksum offload in VXLAN.

The remote checksum offload is generalized by creating a common
function (remcsum_adjust) that does the work of modifying the
checksum in remote checksum offload. This function can be called
from normal or GRO path. GUE was modified to use this function.

To support RCO is VXLAN we use the 9th bit in the reserved
flags to indicated remote checksum offload. The start and offset
values are encoded n a compressed form in the low order (reserved)
byte of the vni field.

Remote checksum offload is described in
https://tools.ietf.org/html/draft-herbert-remotecsumoffload-01

Changes in v2:
  - Add udp_offload_callbacks which has GRO functions that take a
    udp_offload pointer argument. This argument can be used to retrieve
    a per port structure of the encapsulation for use in gro processing
    (mostly by doing container_of on the structure).
  - Use the 10th bit in VXLAN flags for RCO which does not seem to
    conflict with other proposals at this time (ie. VXLAN-GPE and
    VXLAN-GPB)
  - Require that RCO must be explicitly enabled on the receiver
    as well as the sender.

Tested by running 200 TCP_STREAM connections with VXLAN (over IPv4).

With UDP checksums and Remote Checksum Offload
  IPv4
      Client
        11.84% CPU utilization
      Server
        12.96% CPU utilization
      9197 Mbps
  IPv6
      Client
        12.46% CPU utilization
      Server
        14.48% CPU utilization
      8963 Mbps

With UDP checksums, no remote checksum offload
  IPv4
      Client
        15.67% CPU utilization
      Server
        14.83% CPU utilization
      9094 Mbps
  IPv6
      Client
        16.21% CPU utilization
      Server
        14.32% CPU utilization
      9058 Mbps
 
No UDP checksums
  IPv4
      Client
        15.03% CPU utilization
      Server
        23.09% CPU utilization
      9089 Mbps
  IPv6
      Client
        16.18% CPU utilization
      Server
        26.57% CPU utilization
       8954 Mbps

Tom Herbert (2):
  udp: pass udp_offload struct to UDP gro callbacks
  vxlan: Remote checksum offload

 drivers/net/vxlan.c          | 198 +++++++++++++++++++++++++++++++++++++++++--
 include/linux/netdevice.h    |  15 +++-
 include/net/vxlan.h          |  11 +++
 include/uapi/linux/if_link.h |   2 +
 net/ipv4/fou.c               |  12 ++-
 net/ipv4/geneve.c            |   6 +-
 net/ipv4/udp_offload.c       |   7 +-
 7 files changed, 233 insertions(+), 18 deletions(-)

-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply

* [PATCH net-next v2 1/2] udp: pass udp_offload struct to UDP gro callbacks
From: Tom Herbert @ 2015-01-13  1:00 UTC (permalink / raw)
  To: davem, netdev
In-Reply-To: <1421110838-5146-1-git-send-email-therbert@google.com>

This patch introduces udp_offload_callbacks which has the same
GRO functions (but not a GSO function) as offload_callbacks,
except there is an argument to a udp_offload struct passed to
gro_receive and gro_complete functions. This additional argument
can be used to retrieve the per port structure of the encapsulation
for use in gro processing (mostly by doing container_of on the
structure).

Signed-off-by: Tom Herbert <therbert@google.com>
---
 drivers/net/vxlan.c       |  7 +++++--
 include/linux/netdevice.h | 15 +++++++++++++--
 net/ipv4/fou.c            | 12 ++++++++----
 net/ipv4/geneve.c         |  6 ++++--
 net/ipv4/udp_offload.c    |  7 +++++--
 5 files changed, 35 insertions(+), 12 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 3a18d8e..90e2f49 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -539,7 +539,9 @@ static int vxlan_fdb_append(struct vxlan_fdb *f,
 	return 1;
 }
 
-static struct sk_buff **vxlan_gro_receive(struct sk_buff **head, struct sk_buff *skb)
+static struct sk_buff **vxlan_gro_receive(struct sk_buff **head,
+					  struct sk_buff *skb,
+					  struct udp_offload *uoff)
 {
 	struct sk_buff *p, **pp = NULL;
 	struct vxlanhdr *vh, *vh2;
@@ -578,7 +580,8 @@ out:
 	return pp;
 }
 
-static int vxlan_gro_complete(struct sk_buff *skb, int nhoff)
+static int vxlan_gro_complete(struct sk_buff *skb, int nhoff,
+			      struct udp_offload *uoff)
 {
 	udp_tunnel_gro_complete(skb, nhoff);
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 679e6e9..47921c2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1969,7 +1969,7 @@ struct offload_callbacks {
 	struct sk_buff		*(*gso_segment)(struct sk_buff *skb,
 						netdev_features_t features);
 	struct sk_buff		**(*gro_receive)(struct sk_buff **head,
-					       struct sk_buff *skb);
+						 struct sk_buff *skb);
 	int			(*gro_complete)(struct sk_buff *skb, int nhoff);
 };
 
@@ -1979,10 +1979,21 @@ struct packet_offload {
 	struct list_head	 list;
 };
 
+struct udp_offload;
+
+struct udp_offload_callbacks {
+	struct sk_buff		**(*gro_receive)(struct sk_buff **head,
+						 struct sk_buff *skb,
+						 struct udp_offload *uoff);
+	int			(*gro_complete)(struct sk_buff *skb,
+						int nhoff,
+						struct udp_offload *uoff);
+};
+
 struct udp_offload {
 	__be16			 port;
 	u8			 ipproto;
-	struct offload_callbacks callbacks;
+	struct udp_offload_callbacks callbacks;
 };
 
 /* often modified stats are per cpu, other are shared (netdev->stats) */
diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index 2197c36..3bc0cf0 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -174,7 +174,8 @@ drop:
 }
 
 static struct sk_buff **fou_gro_receive(struct sk_buff **head,
-					struct sk_buff *skb)
+					struct sk_buff *skb,
+					struct udp_offload *uoff)
 {
 	const struct net_offload *ops;
 	struct sk_buff **pp = NULL;
@@ -195,7 +196,8 @@ out_unlock:
 	return pp;
 }
 
-static int fou_gro_complete(struct sk_buff *skb, int nhoff)
+static int fou_gro_complete(struct sk_buff *skb, int nhoff,
+			    struct udp_offload *uoff)
 {
 	const struct net_offload *ops;
 	u8 proto = NAPI_GRO_CB(skb)->proto;
@@ -254,7 +256,8 @@ static struct guehdr *gue_gro_remcsum(struct sk_buff *skb, unsigned int off,
 }
 
 static struct sk_buff **gue_gro_receive(struct sk_buff **head,
-					struct sk_buff *skb)
+					struct sk_buff *skb,
+					struct udp_offload *uoff)
 {
 	const struct net_offload **offloads;
 	const struct net_offload *ops;
@@ -360,7 +363,8 @@ out:
 	return pp;
 }
 
-static int gue_gro_complete(struct sk_buff *skb, int nhoff)
+static int gue_gro_complete(struct sk_buff *skb, int nhoff,
+			    struct udp_offload *uoff)
 {
 	const struct net_offload **offloads;
 	struct guehdr *guehdr = (struct guehdr *)(skb->data + nhoff);
diff --git a/net/ipv4/geneve.c b/net/ipv4/geneve.c
index 5b52046..31244bc 100644
--- a/net/ipv4/geneve.c
+++ b/net/ipv4/geneve.c
@@ -147,7 +147,8 @@ static int geneve_hlen(struct genevehdr *gh)
 }
 
 static struct sk_buff **geneve_gro_receive(struct sk_buff **head,
-					   struct sk_buff *skb)
+					   struct sk_buff *skb,
+					   struct udp_offload *uoff)
 {
 	struct sk_buff *p, **pp = NULL;
 	struct genevehdr *gh, *gh2;
@@ -211,7 +212,8 @@ out:
 	return pp;
 }
 
-static int geneve_gro_complete(struct sk_buff *skb, int nhoff)
+static int geneve_gro_complete(struct sk_buff *skb, int nhoff,
+			       struct udp_offload *uoff)
 {
 	struct genevehdr *gh;
 	struct packet_offload *ptype;
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index d3e537e..d10f6f4 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -339,7 +339,8 @@ unflush:
 	skb_gro_pull(skb, sizeof(struct udphdr)); /* pull encapsulating udp header */
 	skb_gro_postpull_rcsum(skb, uh, sizeof(struct udphdr));
 	NAPI_GRO_CB(skb)->proto = uo_priv->offload->ipproto;
-	pp = uo_priv->offload->callbacks.gro_receive(head, skb);
+	pp = uo_priv->offload->callbacks.gro_receive(head, skb,
+						     uo_priv->offload);
 
 out_unlock:
 	rcu_read_unlock();
@@ -395,7 +396,9 @@ int udp_gro_complete(struct sk_buff *skb, int nhoff)
 
 	if (uo_priv != NULL) {
 		NAPI_GRO_CB(skb)->proto = uo_priv->offload->ipproto;
-		err = uo_priv->offload->callbacks.gro_complete(skb, nhoff + sizeof(struct udphdr));
+		err = uo_priv->offload->callbacks.gro_complete(skb,
+				nhoff + sizeof(struct udphdr),
+				uo_priv->offload);
 	}
 
 	rcu_read_unlock();
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

* [PATCH net-next v2 2/2] vxlan: Remote checksum offload
From: Tom Herbert @ 2015-01-13  1:00 UTC (permalink / raw)
  To: davem, netdev
In-Reply-To: <1421110838-5146-1-git-send-email-therbert@google.com>

Add support for remote checksum offload in VXLAN. This uses a
reserved bit to indicate that RCO is being done, and uses the low order
reserved eight bits of the VNI to hold the start and offset values in a
compressed manner.

Start is encoded in the low order seven bits of VNI. This is start >> 1
so that the checksum start offset is 0-254 using even values only.
Checksum offset (transport checksum field) is indicated in the high
order bit in the low order byte of the VNI. If the bit is set, the
checksum field is for UDP (so offset = start + 6), else checksum
field is for TCP (so offset = start + 16). Only TCP and UDP are
supported in this implementation.

Remote checksum offload for VXLAN is described in:

https://tools.ietf.org/html/draft-herbert-vxlan-rco-00

Tested by running 200 TCP_STREAM connections with VXLAN (over IPv4).

With UDP checksums and Remote Checksum Offload
  IPv4
      Client
        11.84% CPU utilization
      Server
        12.96% CPU utilization
      9197 Mbps
  IPv6
      Client
        12.46% CPU utilization
      Server
        14.48% CPU utilization
      8963 Mbps

With UDP checksums, no remote checksum offload
  IPv4
      Client
        15.67% CPU utilization
      Server
        14.83% CPU utilization
      9094 Mbps
  IPv6
      Client
        16.21% CPU utilization
      Server
        14.32% CPU utilization
      9058 Mbps

No UDP checksums
  IPv4
      Client
        15.03% CPU utilization
      Server
        23.09% CPU utilization
      9089 Mbps
  IPv6
      Client
        16.18% CPU utilization
      Server
        26.57% CPU utilization
       8954 Mbps

Signed-off-by: Tom Herbert <therbert@google.com>
---
 drivers/net/vxlan.c          | 191 +++++++++++++++++++++++++++++++++++++++++--
 include/net/vxlan.h          |  11 +++
 include/uapi/linux/if_link.h |   2 +
 3 files changed, 198 insertions(+), 6 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 90e2f49..2cdab2b 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -539,6 +539,46 @@ static int vxlan_fdb_append(struct vxlan_fdb *f,
 	return 1;
 }
 
+static struct vxlanhdr *vxlan_gro_remcsum(struct sk_buff *skb,
+					  unsigned int off,
+					  struct vxlanhdr *vh, size_t hdrlen,
+					  u32 data)
+{
+	size_t start, offset, plen;
+	__wsum delta;
+
+	if (skb->remcsum_offload)
+		return vh;
+
+	if (!NAPI_GRO_CB(skb)->csum_valid)
+		return NULL;
+
+	start = (data & VXLAN_RCO_MASK) << VXLAN_RCO_SHIFT;
+	offset = start + ((data & VXLAN_RCO_UDP) ?
+			  offsetof(struct udphdr, check) :
+			  offsetof(struct tcphdr, check));
+
+	plen = hdrlen + offset + sizeof(u16);
+
+	/* Pull checksum that will be written */
+	if (skb_gro_header_hard(skb, off + plen)) {
+		vh = skb_gro_header_slow(skb, off + plen, off);
+		if (!vh)
+			return NULL;
+	}
+
+	delta = remcsum_adjust((void *)vh + hdrlen,
+			       NAPI_GRO_CB(skb)->csum, start, offset);
+
+	/* Adjust skb->csum since we changed the packet */
+	skb->csum = csum_add(skb->csum, delta);
+	NAPI_GRO_CB(skb)->csum = csum_add(NAPI_GRO_CB(skb)->csum, delta);
+
+	skb->remcsum_offload = 1;
+
+	return vh;
+}
+
 static struct sk_buff **vxlan_gro_receive(struct sk_buff **head,
 					  struct sk_buff *skb,
 					  struct udp_offload *uoff)
@@ -547,6 +587,9 @@ static struct sk_buff **vxlan_gro_receive(struct sk_buff **head,
 	struct vxlanhdr *vh, *vh2;
 	unsigned int hlen, off_vx;
 	int flush = 1;
+	struct vxlan_sock *vs = container_of(uoff, struct vxlan_sock,
+					     udp_offloads);
+	u32 flags;
 
 	off_vx = skb_gro_offset(skb);
 	hlen = off_vx + sizeof(*vh);
@@ -557,6 +600,19 @@ static struct sk_buff **vxlan_gro_receive(struct sk_buff **head,
 			goto out;
 	}
 
+	skb_gro_pull(skb, sizeof(struct vxlanhdr)); /* pull vxlan header */
+	skb_gro_postpull_rcsum(skb, vh, sizeof(struct vxlanhdr));
+
+	flags = ntohl(vh->vx_flags);
+
+	if ((flags & VXLAN_HF_RCO) && (vs->flags & VXLAN_F_REMCSUM_RX)) {
+		vh = vxlan_gro_remcsum(skb, off_vx, vh, sizeof(struct vxlanhdr),
+				       ntohl(vh->vx_vni));
+
+		if (!vh)
+			goto out;
+	}
+
 	flush = 0;
 
 	for (p = *head; p; p = p->next) {
@@ -570,8 +626,6 @@ static struct sk_buff **vxlan_gro_receive(struct sk_buff **head,
 		}
 	}
 
-	skb_gro_pull(skb, sizeof(struct vxlanhdr));
-	skb_gro_postpull_rcsum(skb, vh, sizeof(struct vxlanhdr));
 	pp = eth_gro_receive(head, skb);
 
 out:
@@ -1087,6 +1141,42 @@ static void vxlan_igmp_leave(struct work_struct *work)
 	dev_put(vxlan->dev);
 }
 
+static struct vxlanhdr *vxlan_remcsum(struct sk_buff *skb, struct vxlanhdr *vh,
+				      size_t hdrlen, u32 data)
+{
+	size_t start, offset, plen;
+	__wsum delta;
+
+	if (skb->remcsum_offload) {
+		/* Already processed in GRO path */
+		skb->remcsum_offload = 0;
+		return vh;
+	}
+
+	start = (data & VXLAN_RCO_MASK) << VXLAN_RCO_SHIFT;
+	offset = start + ((data & VXLAN_RCO_UDP) ?
+			  offsetof(struct udphdr, check) :
+			  offsetof(struct tcphdr, check));
+
+	plen = hdrlen + offset + sizeof(u16);
+
+	if (!pskb_may_pull(skb, plen))
+		return NULL;
+
+	vh = (struct vxlanhdr *)(udp_hdr(skb) + 1);
+
+	if (unlikely(skb->ip_summed != CHECKSUM_COMPLETE))
+		__skb_checksum_complete(skb);
+
+	delta = remcsum_adjust((void *)vh + hdrlen,
+			       skb->csum, start, offset);
+
+	/* Adjust skb->csum since we changed the packet */
+	skb->csum = csum_add(skb->csum, delta);
+
+	return vh;
+}
+
 /* Callback from net/ipv4/udp.c to receive packets */
 static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
 {
@@ -1111,12 +1201,22 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
 
 	if (iptunnel_pull_header(skb, VXLAN_HLEN, htons(ETH_P_TEB)))
 		goto drop;
+	vxh = (struct vxlanhdr *)(udp_hdr(skb) + 1);
 
 	vs = rcu_dereference_sk_user_data(sk);
 	if (!vs)
 		goto drop;
 
-	if (flags || (vni & 0xff)) {
+	if ((flags & VXLAN_HF_RCO) && (vs->flags & VXLAN_F_REMCSUM_RX)) {
+		vxh = vxlan_remcsum(skb, vxh, sizeof(struct vxlanhdr), vni);
+		if (!vxh)
+			goto drop;
+
+		flags &= ~VXLAN_HF_RCO;
+		vni &= VXLAN_VID_MASK;
+	}
+
+	if (flags || (vni & ~VXLAN_VID_MASK)) {
 		/* If there are any unprocessed flags remaining treat
 		 * this as a malformed packet. This behavior diverges from
 		 * VXLAN RFC (RFC7348) which stipulates that bits in reserved
@@ -1553,8 +1653,23 @@ static int vxlan6_xmit_skb(struct vxlan_sock *vs,
 	int min_headroom;
 	int err;
 	bool udp_sum = !udp_get_no_check6_tx(vs->sock->sk);
+	int type = udp_sum ? SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL;
+	u16 hdrlen = sizeof(struct vxlanhdr);
+
+	if ((vs->flags & VXLAN_F_REMCSUM_TX) &&
+	    skb->ip_summed == CHECKSUM_PARTIAL) {
+		int csum_start = skb_checksum_start_offset(skb);
+
+		if (csum_start <= VXLAN_MAX_REMCSUM_START &&
+		    !(csum_start & VXLAN_RCO_SHIFT_MASK) &&
+		    (skb->csum_offset == offsetof(struct udphdr, check) ||
+		     skb->csum_offset == offsetof(struct tcphdr, check))) {
+			udp_sum = false;
+			type |= SKB_GSO_TUNNEL_REMCSUM;
+		}
+	}
 
-	skb = udp_tunnel_handle_offloads(skb, udp_sum);
+	skb = iptunnel_handle_offloads(skb, udp_sum, type);
 	if (IS_ERR(skb)) {
 		err = -EINVAL;
 		goto err;
@@ -1583,6 +1698,22 @@ static int vxlan6_xmit_skb(struct vxlan_sock *vs,
 	vxh->vx_flags = htonl(VXLAN_HF_VNI);
 	vxh->vx_vni = vni;
 
+	if (type & SKB_GSO_TUNNEL_REMCSUM) {
+		u32 data = (skb_checksum_start_offset(skb) - hdrlen) >>
+			   VXLAN_RCO_SHIFT;
+
+		if (skb->csum_offset == offsetof(struct udphdr, check))
+			data |= VXLAN_RCO_UDP;
+
+		vxh->vx_vni |= htonl(data);
+		vxh->vx_flags |= htonl(VXLAN_HF_RCO);
+
+		if (!skb_is_gso(skb)) {
+			skb->ip_summed = CHECKSUM_NONE;
+			skb->encapsulation = 0;
+		}
+	}
+
 	skb_set_inner_protocol(skb, htons(ETH_P_TEB));
 
 	udp_tunnel6_xmit_skb(vs->sock, dst, skb, dev, saddr, daddr, prio,
@@ -1603,8 +1734,23 @@ int vxlan_xmit_skb(struct vxlan_sock *vs,
 	int min_headroom;
 	int err;
 	bool udp_sum = !vs->sock->sk->sk_no_check_tx;
+	int type = udp_sum ? SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL;
+	u16 hdrlen = sizeof(struct vxlanhdr);
+
+	if ((vs->flags & VXLAN_F_REMCSUM_TX) &&
+	    skb->ip_summed == CHECKSUM_PARTIAL) {
+		int csum_start = skb_checksum_start_offset(skb);
+
+		if (csum_start <= VXLAN_MAX_REMCSUM_START &&
+		    !(csum_start & VXLAN_RCO_SHIFT_MASK) &&
+		    (skb->csum_offset == offsetof(struct udphdr, check) ||
+		     skb->csum_offset == offsetof(struct tcphdr, check))) {
+			udp_sum = false;
+			type |= SKB_GSO_TUNNEL_REMCSUM;
+		}
+	}
 
-	skb = udp_tunnel_handle_offloads(skb, udp_sum);
+	skb = iptunnel_handle_offloads(skb, udp_sum, type);
 	if (IS_ERR(skb))
 		return PTR_ERR(skb);
 
@@ -1627,6 +1773,22 @@ int vxlan_xmit_skb(struct vxlan_sock *vs,
 	vxh->vx_flags = htonl(VXLAN_HF_VNI);
 	vxh->vx_vni = vni;
 
+	if (type & SKB_GSO_TUNNEL_REMCSUM) {
+		u32 data = (skb_checksum_start_offset(skb) - hdrlen) >>
+			   VXLAN_RCO_SHIFT;
+
+		if (skb->csum_offset == offsetof(struct udphdr, check))
+			data |= VXLAN_RCO_UDP;
+
+		vxh->vx_vni |= htonl(data);
+		vxh->vx_flags |= htonl(VXLAN_HF_RCO);
+
+		if (!skb_is_gso(skb)) {
+			skb->ip_summed = CHECKSUM_NONE;
+			skb->encapsulation = 0;
+		}
+	}
+
 	skb_set_inner_protocol(skb, htons(ETH_P_TEB));
 
 	return udp_tunnel_xmit_skb(vs->sock, rt, skb, src, dst, tos,
@@ -2218,6 +2380,8 @@ static const struct nla_policy vxlan_policy[IFLA_VXLAN_MAX + 1] = {
 	[IFLA_VXLAN_UDP_CSUM]	= { .type = NLA_U8 },
 	[IFLA_VXLAN_UDP_ZERO_CSUM6_TX]	= { .type = NLA_U8 },
 	[IFLA_VXLAN_UDP_ZERO_CSUM6_RX]	= { .type = NLA_U8 },
+	[IFLA_VXLAN_REMCSUM_TX]	= { .type = NLA_U8 },
+	[IFLA_VXLAN_REMCSUM_RX]	= { .type = NLA_U8 },
 };
 
 static int vxlan_validate(struct nlattr *tb[], struct nlattr *data[])
@@ -2339,6 +2503,7 @@ static struct vxlan_sock *vxlan_socket_create(struct net *net, __be16 port,
 	atomic_set(&vs->refcnt, 1);
 	vs->rcv = rcv;
 	vs->data = data;
+	vs->flags = flags;
 
 	/* Initialize the vxlan udp offloads structure */
 	vs->udp_offloads.port = port;
@@ -2533,6 +2698,14 @@ static int vxlan_newlink(struct net *net, struct net_device *dev,
 	    nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_RX]))
 		vxlan->flags |= VXLAN_F_UDP_ZERO_CSUM6_RX;
 
+	if (data[IFLA_VXLAN_REMCSUM_TX] &&
+	    nla_get_u8(data[IFLA_VXLAN_REMCSUM_TX]))
+		vxlan->flags |= VXLAN_F_REMCSUM_TX;
+
+	if (data[IFLA_VXLAN_REMCSUM_RX] &&
+	    nla_get_u8(data[IFLA_VXLAN_REMCSUM_RX]))
+		vxlan->flags |= VXLAN_F_REMCSUM_RX;
+
 	if (vxlan_find_vni(net, vni, use_ipv6 ? AF_INET6 : AF_INET,
 			   vxlan->dst_port)) {
 		pr_info("duplicate VNI %u\n", vni);
@@ -2601,6 +2774,8 @@ static size_t vxlan_get_size(const struct net_device *dev)
 		nla_total_size(sizeof(__u8)) + /* IFLA_VXLAN_UDP_CSUM */
 		nla_total_size(sizeof(__u8)) + /* IFLA_VXLAN_UDP_ZERO_CSUM6_TX */
 		nla_total_size(sizeof(__u8)) + /* IFLA_VXLAN_UDP_ZERO_CSUM6_RX */
+		nla_total_size(sizeof(__u8)) + /* IFLA_VXLAN_REMCSUM_TX */
+		nla_total_size(sizeof(__u8)) + /* IFLA_VXLAN_REMCSUM_RX */
 		0;
 }
 
@@ -2666,7 +2841,11 @@ static int vxlan_fill_info(struct sk_buff *skb, const struct net_device *dev)
 	    nla_put_u8(skb, IFLA_VXLAN_UDP_ZERO_CSUM6_TX,
 			!!(vxlan->flags & VXLAN_F_UDP_ZERO_CSUM6_TX)) ||
 	    nla_put_u8(skb, IFLA_VXLAN_UDP_ZERO_CSUM6_RX,
-			!!(vxlan->flags & VXLAN_F_UDP_ZERO_CSUM6_RX)))
+			!!(vxlan->flags & VXLAN_F_UDP_ZERO_CSUM6_RX)) ||
+	    nla_put_u8(skb, IFLA_VXLAN_REMCSUM_TX,
+			!!(vxlan->flags & VXLAN_F_REMCSUM_TX)) ||
+	    nla_put_u8(skb, IFLA_VXLAN_REMCSUM_RX,
+			!!(vxlan->flags & VXLAN_F_REMCSUM_RX)))
 		goto nla_put_failure;
 
 	if (nla_put(skb, IFLA_VXLAN_PORT_RANGE, sizeof(ports), &ports))
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index a0d8073..0a7443b 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -19,6 +19,14 @@ struct vxlanhdr {
 
 /* VXLAN header flags. */
 #define VXLAN_HF_VNI 0x08000000
+#define VXLAN_HF_RCO 0x00200000
+
+/* Remote checksum offload header option */
+#define VXLAN_RCO_MASK  0x7f    /* Last byte of vni field */
+#define VXLAN_RCO_UDP   0x80    /* Indicate UDP RCO (TCP when not set *) */
+#define VXLAN_RCO_SHIFT 1       /* Left shift of start */
+#define VXLAN_RCO_SHIFT_MASK ((1 << VXLAN_RCO_SHIFT) - 1)
+#define VXLAN_MAX_REMCSUM_START (VXLAN_RCO_MASK << VXLAN_RCO_SHIFT)
 
 #define VXLAN_N_VID     (1u << 24)
 #define VXLAN_VID_MASK  (VXLAN_N_VID - 1)
@@ -38,6 +46,7 @@ struct vxlan_sock {
 	struct hlist_head vni_list[VNI_HASH_SIZE];
 	atomic_t	  refcnt;
 	struct udp_offload udp_offloads;
+	u32		  flags;
 };
 
 #define VXLAN_F_LEARN			0x01
@@ -49,6 +58,8 @@ struct vxlan_sock {
 #define VXLAN_F_UDP_CSUM		0x40
 #define VXLAN_F_UDP_ZERO_CSUM6_TX	0x80
 #define VXLAN_F_UDP_ZERO_CSUM6_RX	0x100
+#define VXLAN_F_REMCSUM_TX		0x200
+#define VXLAN_F_REMCSUM_RX		0x400
 
 struct vxlan_sock *vxlan_sock_add(struct net *net, __be16 port,
 				  vxlan_rcv_t *rcv, void *data,
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index f7d0d2d..b2723f6 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -370,6 +370,8 @@ enum {
 	IFLA_VXLAN_UDP_CSUM,
 	IFLA_VXLAN_UDP_ZERO_CSUM6_TX,
 	IFLA_VXLAN_UDP_ZERO_CSUM6_RX,
+	IFLA_VXLAN_REMCSUM_TX,
+	IFLA_VXLAN_REMCSUM_RX,
 	__IFLA_VXLAN_MAX
 };
 #define IFLA_VXLAN_MAX	(__IFLA_VXLAN_MAX - 1)
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

* Re: [PATCH 6/6] openvswitch: Support VXLAN Group Policy extension
From: Thomas Graf @ 2015-01-13  1:02 UTC (permalink / raw)
  To: Jesse Gross
  Cc: David Miller, Stephen Hemminger, Pravin Shelar, Tom Herbert,
	Alexei Starovoitov, dev@openvswitch.org, netdev
In-Reply-To: <CAEP_g=8TnrwdGiTOB_mKeQSsEakV8y2OU7_ARwi+j9WDYh=Wag@mail.gmail.com>

On 01/12/15 at 01:54pm, Jesse Gross wrote:
> On Mon, Jan 12, 2015 at 4:26 AM, Thomas Graf <tgraf@suug.ch> wrote:
> > +       if (tb[OVS_VXLAN_EXT_MAX])
> > +               opts.gbp = nla_get_u32(tb[OVS_VXLAN_EXT_MAX]);
> 
> Shouldn't this be OVS_VXLAN_EXT_GBP instead of OVS_VXLAN_EXT_MAX?
> (They have the same value.)

Good catch, thanks!

> > +       if (!is_mask)
> > +               SW_FLOW_KEY_PUT(match, tun_opts_len, sizeof(opts), false);
> > +       else
> > +               SW_FLOW_KEY_PUT(match, tun_opts_len, 0xff, true);
> 
> Have you thought carefully about how the masking model work as other
> extensions are potentially added? This was a little tricky with Geneve
> because I wanted to be able to match on both "no options present" as
> well as wildcard all options. The other interesting thing is how you
> serialize them back correctly to userspace, which was the genesis of
> the TUNNEL_OPTIONS_PRESENT flag.
> 
> My guess is that this may basically work fine now that there is only
> one extension present but it is important to think about how it might
> work with multiple independent extensions in the future. (I haven't
> thought about it, I'm just asking.)

I currently don't see a reason why adding another extension would be
a problem. It should work like Geneve options except that the order
of the options in the flow is given (struct vxlan_opts).

Matching on "no options present" is supported in the datapath by
via the TUNNEL_VXLAN_OPT flag although there is no way in user space
to express this intent yet. I haven't come across a need to support it
yet.

Since the Netlink API is decoupled from the datapath flow
representation, all of this can be changed if needed without breaking
the Netlink ABI.

> If you set Geneve options and output to a VXLAN port (or vice versa),
> you will get garbage, right? Is there any way that we can sanity check
> that?

What about if we only apply tun_info->options on Geneve if
TUNNEL_GENEVE_OPT is set and vice versa?

^ permalink raw reply

* Re: [PATCH 2/6] vxlan: Group Policy extension
From: Thomas Graf @ 2015-01-13  1:03 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David Miller, Jesse Gross, Stephen Hemminger, Pravin B Shelar,
	Alexei Starovoitov, Linux Netdev List, dev@openvswitch.org
In-Reply-To: <CA+mtBx8SWiwjBfxx05omur1gDeF9QcKtjDKJQ+XdUgi_U8LWig@mail.gmail.com>

On 01/12/15 at 10:14am, Tom Herbert wrote:
> > diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
> > index f7d0d2d..9f07bf5 100644
> > --- a/include/uapi/linux/if_link.h
> > +++ b/include/uapi/linux/if_link.h
> > @@ -370,10 +370,18 @@ enum {
> >         IFLA_VXLAN_UDP_CSUM,
> >         IFLA_VXLAN_UDP_ZERO_CSUM6_TX,
> >         IFLA_VXLAN_UDP_ZERO_CSUM6_RX,
> > +       IFLA_VXLAN_EXTENSION,
> >         __IFLA_VXLAN_MAX
> >  };
> >  #define IFLA_VXLAN_MAX (__IFLA_VXLAN_MAX - 1)
> >
> > +enum {
> > +       IFLA_VXLAN_EXT_UNSPEC,
> > +       IFLA_VXLAN_EXT_GBP,
> > +       __IFLA_VXLAN_EXT_MAX,
> > +};
> > +#define IFLA_VXLAN_EXT_MAX (__IFLA_VXLAN_EXT_MAX - 1)
> > +
> 
> Creating a level of indirection for extensions seems overly
> complicated to me. Why not just define IFLA_VXLAN_GBP as just another
> enum above?

I think it's cleaner to group them in a nested attribute.
It clearly separates the optional extensions from the base
attributes. RCO, GPE, GBP can all live in there.

^ permalink raw reply

* Re: [PATCH 2/6] vxlan: Group Policy extension
From: Thomas Graf @ 2015-01-13  1:04 UTC (permalink / raw)
  To: Nicolas Dichtel
  Cc: davem, jesse, stephen, pshelar, therbert, alexei.starovoitov,
	netdev, dev
In-Reply-To: <54B40661.9020408@6wind.com>

On 01/12/15 at 06:37pm, Nicolas Dichtel wrote:
> >+	if (data[IFLA_VXLAN_EXTENSION])
> >+		configure_vxlan_exts(vxlan, data[IFLA_VXLAN_EXTENSION]);
> >+
> Can you also update vxlan_fill_info() so that these new attributes can be
> dumped via netlink?

Sure, will do.

^ permalink raw reply

* Re: [PATCH net-next v2 2/2] vxlan: Remote checksum offload
From: Thomas Graf @ 2015-01-13  1:26 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <1421110838-5146-3-git-send-email-therbert@google.com>

On 01/12/15 at 05:00pm, Tom Herbert wrote:
> +	if ((flags & VXLAN_HF_RCO) && (vs->flags & VXLAN_F_REMCSUM_RX)) {
> +		vxh = vxlan_remcsum(skb, vxh, sizeof(struct vxlanhdr), vni);
> +		if (!vxh)
> +			goto drop;
> +
> +		flags &= ~VXLAN_HF_RCO;
> +		vni &= VXLAN_VID_MASK;
> +	}

Nice.

Would you mind basing this on top off the extension framework being put
in place by GBP? I think that all VXLAN extensions should be exposed as
such in a universal way to user space.

^ permalink raw reply

* Re: [PATCH 2/6] vxlan: Group Policy extension
From: Tom Herbert @ 2015-01-13  2:28 UTC (permalink / raw)
  To: Thomas Graf
  Cc: David Miller, Jesse Gross, Stephen Hemminger, Pravin B Shelar,
	Alexei Starovoitov, Linux Netdev List, dev@openvswitch.org
In-Reply-To: <20150113010357.GB20387@casper.infradead.org>

On Mon, Jan 12, 2015 at 5:03 PM, Thomas Graf <tgraf@suug.ch> wrote:
> On 01/12/15 at 10:14am, Tom Herbert wrote:
>> > diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
>> > index f7d0d2d..9f07bf5 100644
>> > --- a/include/uapi/linux/if_link.h
>> > +++ b/include/uapi/linux/if_link.h
>> > @@ -370,10 +370,18 @@ enum {
>> >         IFLA_VXLAN_UDP_CSUM,
>> >         IFLA_VXLAN_UDP_ZERO_CSUM6_TX,
>> >         IFLA_VXLAN_UDP_ZERO_CSUM6_RX,
>> > +       IFLA_VXLAN_EXTENSION,
>> >         __IFLA_VXLAN_MAX
>> >  };
>> >  #define IFLA_VXLAN_MAX (__IFLA_VXLAN_MAX - 1)
>> >
>> > +enum {
>> > +       IFLA_VXLAN_EXT_UNSPEC,
>> > +       IFLA_VXLAN_EXT_GBP,
>> > +       __IFLA_VXLAN_EXT_MAX,
>> > +};
>> > +#define IFLA_VXLAN_EXT_MAX (__IFLA_VXLAN_EXT_MAX - 1)
>> > +
>>
>> Creating a level of indirection for extensions seems overly
>> complicated to me. Why not just define IFLA_VXLAN_GBP as just another
>> enum above?
>
> I think it's cleaner to group them in a nested attribute.
> It clearly separates the optional extensions from the base
> attributes. RCO, GPE, GBP can all live in there.

This is inconsistent with similar things in GRE and GUE. For instance,
GRE keyid is set as its own attribute. It just seems like this adding
more code to the driver than is necessary for the functionality
needed.

^ permalink raw reply

* Re: [3.19-rc3] tg3: BUG: sleeping function called from invalid context
From: Prashant Sreedharan @ 2015-01-13  2:30 UTC (permalink / raw)
  To: Peter Hurley; +Cc: Michael Chan, netdev, Linux kernel
In-Reply-To: <54B46DD5.9050802@hurleysoftware.com>

On Mon, 2015-01-12 at 19:59 -0500, Peter Hurley wrote:
> On 3.19-rc3, I'm seeing this might_sleep() warning [1] from the tg3_open()
> call stack. Let me know if I need to bisect this.
> 
> Regards,
> Peter Hurley
> 
> [1]
> 
> [   17.203009] BUG: sleeping function called from invalid context at /home/peter/src/kernels/mainline/kernel/irq/manage.c:104
> [   17.203067] in_atomic(): 1, irqs_disabled(): 0, pid: 1106, name: ip
> [   17.203092] 2 locks held by ip/1106:
> [   17.205255]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff816adf1f>] rtnetlink_rcv+0x1f/0x40
> [   17.207445]  #1:  (&(&tp->lock)->rlock){+.....}, at: [<ffffffffa01073e6>] tg3_start+0xc06/0x11f0 [tg3]
> [   17.209725] CPU: 2 PID: 1106 Comm: ip Not tainted 3.19.0-rc3+wip-xeon+lockdep #rc3+wip
> [   17.211900] Hardware name: Dell Inc. Precision WorkStation T5400  /0RW203, BIOS A11 04/30/2012
> [   17.214086]  0000000000000068 ffff8802ac823498 ffffffff817af7e8 0000000000000005
> [   17.216265]  ffffffff81a9be78 ffff8802ac8234a8 ffffffff810998a5 ffff8802ac8234d8
> [   17.218446]  ffffffff8109991a ffff8802ac8234c8 ffff8802af0aae00 ffffffffa00ed000
> [   17.220636] Call Trace:
> [   17.222743]  [<ffffffff817af7e8>] dump_stack+0x4f/0x7b
> [   17.224808]  [<ffffffff810998a5>] ___might_sleep+0x105/0x140
> [   17.226842]  [<ffffffff8109991a>] __might_sleep+0x3a/0xa0
> [   17.228869]  [<ffffffffa00ed000>] ? 0xffffffffa00ed000
> [   17.230939]  [<ffffffff810d7d78>] synchronize_irq+0x38/0xa0
> [   17.232967]  [<ffffffffa00ed000>] ? 0xffffffffa00ed000
> [   17.234991]  [<ffffffffa010105f>] tg3_chip_reset+0x13f/0x9c0 [tg3]
> [   17.236988]  [<ffffffffa01020ae>] tg3_reset_hw+0x7e/0x2d20 [tg3]
> [   17.238996]  [<ffffffff813bfaff>] ? __udelay+0x2f/0x40
> [   17.241007]  [<ffffffffa00ef2f7>] ? _tw32_flush+0x47/0x80 [tg3]
> [   17.243066]  [<ffffffffa0104dac>] tg3_init_hw+0x5c/0x70 [tg3]
> [   17.245438]  [<ffffffffa010740b>] tg3_start+0xc2b/0x11f0 [tg3]
> [   17.247444]  [<ffffffffa0107ad7>] ? tg3_open+0x107/0x2e0 [tg3]
> [   17.249556]  [<ffffffff810c338d>] ? trace_hardirqs_on+0xd/0x10
> [   17.251581]  [<ffffffff8107806f>] ? __local_bh_enable_ip+0x6f/0x100
> [   17.253710]  [<ffffffffa0107af8>] tg3_open+0x128/0x2e0 [tg3]
> [   17.255758]  [<ffffffff816ba3f5>] ? netpoll_poll_disable+0x5/0xa0
> [   17.257932]  [<ffffffff816a14af>] __dev_open+0xbf/0x140
> [   17.260091]  [<ffffffff816a17c1>] __dev_change_flags+0xa1/0x160
> [   17.262222]  [<ffffffff816a18a9>] dev_change_flags+0x29/0x60
> [   17.264360]  [<ffffffff816b0e02>] do_setlink+0x2f2/0xa30
> [   17.266431]  [<ffffffff816b1b7f>] rtnl_newlink+0x51f/0x750
> [   17.268485]  [<ffffffff816b1749>] ? rtnl_newlink+0xe9/0x750
> [   17.270483]  [<ffffffff811869c2>] ? free_pages_prepare+0x1d2/0x270
> [   17.272507]  [<ffffffff810c32bd>] ? trace_hardirqs_on_caller+0x11d/0x1e0
> [   17.274531]  [<ffffffff813dd1b2>] ? nla_parse+0x32/0x120
> [   17.276531]  [<ffffffff81021ab5>] ? native_sched_clock+0x35/0xa0
> [   17.278514]  [<ffffffff816adfd5>] rtnetlink_rcv_msg+0x95/0x250
> [   17.280485]  [<ffffffff8109f699>] ? preempt_count_sub+0x49/0x50
> [   17.282448]  [<ffffffff817b4a02>] ? mutex_lock_nested+0x382/0x530
> [   17.284402]  [<ffffffff816adf1f>] ? rtnetlink_rcv+0x1f/0x40
> [   17.286290]  [<ffffffff816adf1f>] ? rtnetlink_rcv+0x1f/0x40
> [   17.288142]  [<ffffffff816adf40>] ? rtnetlink_rcv+0x40/0x40
> [   17.290031]  [<ffffffff816cedc1>] netlink_rcv_skb+0xc1/0xe0
> [   17.291836]  [<ffffffff816adf2e>] rtnetlink_rcv+0x2e/0x40
> [   17.293615]  [<ffffffff816ce473>] netlink_unicast+0xf3/0x1d0
> [   17.295420]  [<ffffffff816ce863>] netlink_sendmsg+0x313/0x690
> [   17.297132]  [<ffffffff811ada4f>] ? might_fault+0x5f/0xb0
> [   17.298799]  [<ffffffff8168253c>] do_sock_sendmsg+0x8c/0x100
> [   17.300493]  [<ffffffff81681e3e>] ? copy_msghdr_from_user+0x15e/0x1f0
> [   17.302173]  [<ffffffff81682aeb>] ___sys_sendmsg+0x30b/0x320
> [   17.303798]  [<ffffffff81021ab5>] ? native_sched_clock+0x35/0xa0
> [   17.305431]  [<ffffffff810bdee0>] ? cpuacct_account_field+0x80/0xb0
> [   17.307085]  [<ffffffff81021ab5>] ? native_sched_clock+0x35/0xa0
> [   17.308744]  [<ffffffff810a4f35>] ? sched_clock_local+0x25/0x90
> [   17.310375]  [<ffffffff810a5dc1>] ? vtime_account_user+0x91/0xa0
> [   17.311948]  [<ffffffff810a5198>] ? sched_clock_cpu+0xb8/0xe0
> [   17.313509]  [<ffffffff810bf8be>] ? put_lock_stats.isra.26+0xe/0x30
> [   17.315069]  [<ffffffff810c007e>] ? lock_release_holdtime.part.27+0x12e/0x1b0
> [   17.316618]  [<ffffffff810a5dc1>] ? vtime_account_user+0x91/0xa0
> [   17.318162]  [<ffffffff8109f5d1>] ? get_parent_ip+0x11/0x50
> [   17.319703]  [<ffffffff8109f699>] ? preempt_count_sub+0x49/0x50
> [   17.321235]  [<ffffffff811807e5>] ? context_tracking_user_exit+0x55/0x130
> [   17.322732]  [<ffffffff811807e5>] ? context_tracking_user_exit+0x55/0x130
> [   17.324197]  [<ffffffff816834f2>] __sys_sendmsg+0x42/0x80
> [   17.325634]  [<ffffffff81683542>] SyS_sendmsg+0x12/0x20
> [   17.327048]  [<ffffffff817ba12d>] system_call_fastpath+0x16/0x1b

Please bisect, there hasn't been tg3 code changes in this path that
might cause this. It would help to know the commit changes that is
triggering the problem. Also could you provide the device details, from
syslog look for "Tigon3 [partno(BCMxxxxx) rev xxxxxxx]". Thanks.

^ permalink raw reply

* [PATCHv2 net-next] openvswitch: Introduce ovs_tunnel_route_lookup
From: Fan Du @ 2015-01-13  2:41 UTC (permalink / raw)
  To: pshelar; +Cc: dev, netdev, fengyuleidian0615
In-Reply-To: <1421054087-25632-1-git-send-email-fan.du@intel.com>

Introduce ovs_tunnel_route_lookup to consolidate route lookup
shared by vxlan, gre, and geneve ports.

Signed-off-by: Fan Du <fan.du@intel.com>
---
Chnage log:
v2:
  - Use inline instead of function call
  - Rename vport_route_lookup to ovs_tunnel_route_lookup
---
 net/openvswitch/vport-geneve.c |   11 +----------
 net/openvswitch/vport-gre.c    |   10 +---------
 net/openvswitch/vport-vxlan.c  |   10 +---------
 net/openvswitch/vport.h        |   18 ++++++++++++++++++
 4 files changed, 21 insertions(+), 28 deletions(-)

diff --git a/net/openvswitch/vport-geneve.c b/net/openvswitch/vport-geneve.c
index 484864d..0953c9f 100644
--- a/net/openvswitch/vport-geneve.c
+++ b/net/openvswitch/vport-geneve.c
@@ -191,16 +191,7 @@ static int geneve_tnl_send(struct vport *vport, struct sk_buff *skb)
 	}
 
 	tun_key = &tun_info->tunnel;
-
-	/* Route lookup */
-	memset(&fl, 0, sizeof(fl));
-	fl.daddr = tun_key->ipv4_dst;
-	fl.saddr = tun_key->ipv4_src;
-	fl.flowi4_tos = RT_TOS(tun_key->ipv4_tos);
-	fl.flowi4_mark = skb->mark;
-	fl.flowi4_proto = IPPROTO_UDP;
-
-	rt = ip_route_output_key(net, &fl);
+	rt = ovs_tunnel_route_lookup(net, tun_key, skb, &fl, IPPROTO_UDP);
 	if (IS_ERR(rt)) {
 		err = PTR_ERR(rt);
 		goto error;
diff --git a/net/openvswitch/vport-gre.c b/net/openvswitch/vport-gre.c
index d4168c4..3171e03 100644
--- a/net/openvswitch/vport-gre.c
+++ b/net/openvswitch/vport-gre.c
@@ -148,15 +148,7 @@ static int gre_tnl_send(struct vport *vport, struct sk_buff *skb)
 	}
 
 	tun_key = &OVS_CB(skb)->egress_tun_info->tunnel;
-	/* Route lookup */
-	memset(&fl, 0, sizeof(fl));
-	fl.daddr = tun_key->ipv4_dst;
-	fl.saddr = tun_key->ipv4_src;
-	fl.flowi4_tos = RT_TOS(tun_key->ipv4_tos);
-	fl.flowi4_mark = skb->mark;
-	fl.flowi4_proto = IPPROTO_GRE;
-
-	rt = ip_route_output_key(net, &fl);
+	rt = ovs_tunnel_route_lookup(net, tun_key, skb, &fl, IPPROTO_GRE);
 	if (IS_ERR(rt)) {
 		err = PTR_ERR(rt);
 		goto err_free_skb;
diff --git a/net/openvswitch/vport-vxlan.c b/net/openvswitch/vport-vxlan.c
index d7c46b3..1528090 100644
--- a/net/openvswitch/vport-vxlan.c
+++ b/net/openvswitch/vport-vxlan.c
@@ -158,15 +158,7 @@ static int vxlan_tnl_send(struct vport *vport, struct sk_buff *skb)
 	}
 
 	tun_key = &OVS_CB(skb)->egress_tun_info->tunnel;
-	/* Route lookup */
-	memset(&fl, 0, sizeof(fl));
-	fl.daddr = tun_key->ipv4_dst;
-	fl.saddr = tun_key->ipv4_src;
-	fl.flowi4_tos = RT_TOS(tun_key->ipv4_tos);
-	fl.flowi4_mark = skb->mark;
-	fl.flowi4_proto = IPPROTO_UDP;
-
-	rt = ip_route_output_key(net, &fl);
+	rt = ovs_tunnel_route_lookup(net, tun_key, skb, &fl, IPPROTO_UDP);
 	if (IS_ERR(rt)) {
 		err = PTR_ERR(rt);
 		goto error;
diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h
index 99c8e71..7b97661 100644
--- a/net/openvswitch/vport.h
+++ b/net/openvswitch/vport.h
@@ -236,4 +236,22 @@ static inline void ovs_skb_postpush_rcsum(struct sk_buff *skb,
 int ovs_vport_ops_register(struct vport_ops *ops);
 void ovs_vport_ops_unregister(struct vport_ops *ops);
 
+static inline struct rtable *ovs_tunnel_route_lookup(struct net *net,
+						     struct ovs_key_ipv4_tunnel *key,
+						     struct sk_buff *skb,
+						     struct flowi4 *fl,
+						     u8 protocol)
+{
+	struct rtable *rt;
+
+	memset(fl, 0, sizeof(*fl));
+	fl->daddr = key->ipv4_dst;
+	fl->saddr = key->ipv4_src;
+	fl->flowi4_tos = RT_TOS(key->ipv4_tos);
+	fl->flowi4_mark = skb->mark;
+	fl->flowi4_proto = protocol;
+
+	rt = ip_route_output_key(net, fl);
+	return rt;
+}
 #endif /* vport.h */
-- 
1.7.1

^ permalink raw reply related

* NetDev 0.1 new proposals accepted update
From: Richard Guy Briggs @ 2015-01-13  2:51 UTC (permalink / raw)
  To: netdev, linux-wireless, lwn, netdev01, lartc, netfilter,
	netfilter-devel

Fellow netheads:

Here is an update on new proposals accepted this past week for NetDev 0.1 that
you may have missed if you aren't following the RSS feed or twitter:


Accepted talks: (All listed: https://www.netdev01.org/sessions )
===============
Offloading to yet another software switch
Michio Honda
https://www.netdev01.org/sessions/13

Implementing Open vSwitch datapath using TC
Jiri Pirko
https://www.netdev01.org/sessions/14

MPTCP Upstreaming
Doru Gucea
https://www.netdev01.org/sessions/16


Accepted workshops:
===================
TCP stack instrumentation BoF
Chris Rapier
https://www.netdev01.org/sessions/17

802.1ad HW acceleration and MTU handling
Toshiaki Makita
https://www.netdev01.org/sessions/18


Accepted Tutorials:
===================
Tutorial on perf Usage
Hannes Frederic Sowa
https://www.netdev01.org/sessions/12

BPF In-kernel Virtual Machine
Alexei Starovoitov
https://www.netdev01.org/sessions/15


All of the accepted proposals are new work.  A couple of proposals have been
returned for rework prior to acceptance.  There are still more excellent
proposals currently making their way through the technical committee vetting
process.  The committee has so far been very impressed with the quality of
proposals submitted.


A reminder that the Westin Hotel is holding a block of rooms for Netdev01 at a
guaranteed rate of $159.00 or $179.00 (depending on the type of room required)
and the rooms are going fast due to Winterlude bookings.  That guarantee
expires on January 23, so book now to avoid disappointment and get the low
rate. The rooms are going faster than we expected!
Reservations: https://www.starwoodmeeting.com/StarGroupsWeb/res?id=1412035802&key=1AC9C1F8


Registration https://onlineregistrations.ca/netdev01/
$100/day, or $350 for 4 days (Cdn dollars). (online reg closes Feb 12th)
Registering helps us plan properly for numbers of attendees,
ensuring venue sizes and supplies are appropriate without
wasting resources.


NetDev 0.1 would like to gratefully acknowledge our sponsors: https://netdev01.org/sponsors
Verizon http://www.verizon.com/
Cumulus Networks http://cumulusnetworks.com/ 
Mojatatu Networks http://mojatatu.com/ 


THE Technical Conference on Linux Networking, February 14-17, 2015, Ottawa, Canada
https://netdev01.org/

Travel advice: https://netdev01.org/travel
RSS feed: https://netdev01.org/atom
Follow us on Twitter: @netdev01 https://twitter.com/netdev01

^ permalink raw reply

* Re: [PATCH] i40e: don't enable and init FCOE by default when do PF reset
From: Ethan Zhao @ 2015-01-13  2:56 UTC (permalink / raw)
  To: Dev, Vasu
  Cc: Ronciak, John, Ethan Zhao, Kirsher, Jeffrey T, Brandeburg, Jesse,
	Allan, Bruce W, Wyborny, Carolyn, Skidmore, Donald C,
	Rose, Gregory V, Vick, Matthew, Williams, Mitch A, Parikh, Neerav,
	Linux NICS, e1000-devel@lists.sourceforge.net,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	brian.maly@oracle.com
In-Reply-To: <933BEC2E04D6A5458F4B0239FB547F9A34CC3608@fmsmsx118.amr.corp.intel.com>

Vasu,

On Sat, Jan 10, 2015 at 2:18 AM, Dev, Vasu <vasu.dev@intel.com> wrote:
>> -----Original Message-----
>> From: Ronciak, John
>> Sent: Friday, January 09, 2015 8:42 AM
>> To: Ethan Zhao; Kirsher, Jeffrey T; Brandeburg, Jesse; Allan, Bruce W;
>> Wyborny, Carolyn; Skidmore, Donald C; Rose, Gregory V; Vick, Matthew;
>> Williams, Mitch A; Dev, Vasu; Parikh, Neerav
>> Cc: Linux NICS; e1000-devel@lists.sourceforge.net; netdev@vger.kernel.org;
>> linux-kernel@vger.kernel.org; ethan.kernel@gmail.com;
>> brian.maly@oracle.com
>> Subject: RE: [PATCH] i40e: don't enable and init FCOE by default when do PF
>> reset
>>
>> Adding Vasu and Neerav
>>
>> Cheers,
>> John
>>
>> > -----Original Message-----
>> > From: Ethan Zhao [mailto:ethan.zhao@oracle.com]
>> > Sent: Friday, January 9, 2015 8:38 AM
>> > To: Kirsher, Jeffrey T; Brandeburg, Jesse; Allan, Bruce W; Wyborny,
>> > Carolyn; Skidmore, Donald C; Rose, Gregory V; Vick, Matthew; Ronciak,
>> > John; Williams, Mitch A
>> > Cc: Linux NICS; e1000-devel@lists.sourceforge.net;
>> > netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
>> > ethan.kernel@gmail.com; brian.maly@oracle.com; Ethan Zhao
>> > Subject: [PATCH] i40e: don't enable and init FCOE by default when do
>> > PF reset
>> >
>> > While do PF reset with function i40e_reset_and_rebuild(), it will call
>> > i40e_init_pf_fcoe() by default if FCOE is defined, thus if the PF is
>> > resetted, FCOE will be enabled whatever it was - enabled or not.
>> >
>> > Such bug might be hit when PF resumes from suspend, run diagnostic
>> > test with ethtool, setup VLAN etc.
>> >
>> > Passed building with v3.19-rc3.
>> >
>> > Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
>> > ---
>> >  drivers/net/ethernet/intel/i40e/i40e_main.c | 9 ++++++---
>> >  1 file changed, 6 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c
>> > b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> > index a5f2660..a2572cc 100644
>> > --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
>> > +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> > @@ -6180,9 +6180,12 @@ static void i40e_reset_and_rebuild(struct
>> > i40e_pf *pf, bool reinit)
>> >     }
>> >  #endif /* CONFIG_I40E_DCB */
>> >  #ifdef I40E_FCOE
>> > -   ret = i40e_init_pf_fcoe(pf);
>> > -   if (ret)
>> > -           dev_info(&pf->pdev->dev, "init_pf_fcoe failed: %d\n", ret);
>> > +   if (pf->flags & I40E_FLAG_FCOE_ENABLED) {
>> > +           ret = i40e_init_pf_fcoe(pf);
>
> Calling i40e_init_pf_fcoe() here conflicts with its I40E_FLAG_FCOE_ENABLED pre-condition since I40E_FLAG_FCOE_ENABLED is set by very same i40e_init_pf_fcoe(), in turn i40e_init_pf_fcoe() will never get called.

I don't think so,  here ,i40e_reset_and_rebuild()  is not the only and
the first place that  i40e_init_pf_fcoe() is called,
see i40e_probe(), that is the first chance.

i40e_probe()
-->i40e_sw_init()
     -->i40e_init_pf_fcoe()

And the I40E_FLAG_FCOE_ENABLED is possible be set by
i40e_fcoe_enable() or i40e_fcoe_disable() interface before the reset
action is to be done.

BTW, the reason I post this patch is that we hit a bug, after setup
vlan, the PF is enabled to FCOE.

>
> Jeff Kirsher should be getting out a patch queued by me which adds I40E_FCoE Kbuild option, in that FCoE is disabled by default and  user could enable FCoE only if needed, that patch would do same of skipping i40e_init_pf_fcoe() whether FCoE capability in device enabled or not in default config.
>

The following patch will not fix the above issue -- configuration of
PF will be changed via reset.
How about the FCOE is configured and disabled by  i40e_fcoe_disable()
, then reset happens ?

> From patchwork Wed Oct  2 23:26:08 2013
> Content-Type: text/plain; charset="utf-8"
> MIME-Version: 1.0
> Content-Transfer-Encoding: 7bit
> Subject: [net] i40e: adds FCoE configure option
> Date: Thu, 03 Oct 2013 07:26:08 -0000
> From: Vasu Dev <vasu.dev@intel.com>
> X-Patchwork-Id: 11797
>
> Adds FCoE config option I40E_FCOE, so that FCoE can be enabled
> as needed but otherwise have it disabled by default.
>
> This also eliminate multiple FCoE config checks, instead now just
> one config check for CONFIG_I40E_FCOE.
>
> The I40E FCoE was added with 3.17 kernel and therefore this patch
> shall be applied to stable 3.17 kernel also.
>
> CC: <stable@vger.kernel.org>
> Signed-off-by: Vasu Dev <vasu.dev@intel.com>
> Tested-by: Jim Young <jamesx.m.young@intel.com>
>
> ---
> drivers/net/ethernet/intel/Kconfig           |   11 +++++++++++
>  drivers/net/ethernet/intel/i40e/Makefile     |    2 +-
>  drivers/net/ethernet/intel/i40e/i40e_osdep.h |    4 ++--
>  3 files changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/Kconfig b/drivers/net/ethernet/intel/Kconfig
> index 5b8300a..4d61ef5 100644
> --- a/drivers/net/ethernet/intel/Kconfig
> +++ b/drivers/net/ethernet/intel/Kconfig
> @@ -281,6 +281,17 @@ config I40E_DCB
>
>           If unsure, say N.
>
> +config I40E_FCOE
> +       bool "Fibre Channel over Ethernet (FCoE)"
> +       default n
> +       depends on I40E && DCB && FCOE
> +       ---help---
> +         Say Y here if you want to use Fibre Channel over Ethernet (FCoE)
> +         in the driver. This will create new netdev for exclusive FCoE
> +         use with XL710 FCoE offloads enabled.
> +
> +         If unsure, say N.
> +
>  config I40EVF
>         tristate "Intel(R) XL710 X710 Virtual Function Ethernet support"
>         depends on PCI_MSI
> diff --git a/drivers/net/ethernet/intel/i40e/Makefile b/drivers/net/ethernet/intel/i40e/Makefile
> index 4b94ddb..c405819 100644
> --- a/drivers/net/ethernet/intel/i40e/Makefile
> +++ b/drivers/net/ethernet/intel/i40e/Makefile
> @@ -44,4 +44,4 @@ i40e-objs := i40e_main.o \
>         i40e_virtchnl_pf.o
>
>  i40e-$(CONFIG_I40E_DCB) += i40e_dcb.o i40e_dcb_nl.o
> -i40e-$(CONFIG_FCOE:m=y) += i40e_fcoe.o
> +i40e-$(CONFIG_I40E_FCOE) += i40e_fcoe.o
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_osdep.h b/drivers/net/ethernet/intel/i40e/i40e_osdep.h
> index 045b5c4..ad802dd 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_osdep.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_osdep.h
> @@ -78,7 +78,7 @@ do {                                                            \
>  } while (0)
>
>  typedef enum i40e_status_code i40e_status;
> -#if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
> +#ifdef CONFIG_I40E_FCOE
>  #define I40E_FCOE
> -#endif /* CONFIG_FCOE or CONFIG_FCOE_MODULE */
> +#endif
>  #endif /* _I40E_OSDEP_H_ */
>
>> > +           if (ret)
>> > +                   dev_info(&pf->pdev->dev,
>> > +                            "init_pf_fcoe failed: %d\n", ret);
>> > +   }
>> >
>> >  #endif
>> >     /* do basic switch setup */
>> > --
>> > 1.8.3.1
>

Thanks,
Ethan

^ permalink raw reply

* [PATCH] net/fsl: fix a bug in xgmac_mdio
From: shh.xie @ 2015-01-13  2:30 UTC (permalink / raw)
  To: netdev, davem; +Cc: Shaohui Xie

From: Shaohui Xie <Shaohui.Xie@freescale.com>

There is a bug in xgmac_mdio_read when clear the bit MDIO_STAT_ENC,
which '&' is missed in 'mdio_stat &= ~MDIO_STAT_ENC'.

Signed-off-by: Shaohui Xie <Shaohui.Xie@freescale.com>
---
 drivers/net/ethernet/freescale/xgmac_mdio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/freescale/xgmac_mdio.c b/drivers/net/ethernet/freescale/xgmac_mdio.c
index e0fc3d1..a492e50 100644
--- a/drivers/net/ethernet/freescale/xgmac_mdio.c
+++ b/drivers/net/ethernet/freescale/xgmac_mdio.c
@@ -156,7 +156,7 @@ static int xgmac_mdio_read(struct mii_bus *bus, int phy_id, int regnum)
 		mdio_stat |= MDIO_STAT_ENC;
 	} else {
 		dev_addr = regnum & 0x1f;
-		mdio_stat = ~MDIO_STAT_ENC;
+		mdio_stat &= ~MDIO_STAT_ENC;
 	}
 
 	out_be32(&regs->mdio_stat, mdio_stat);
-- 
1.8.4.1

^ permalink raw reply related

* [PATCH] net/fsl: replace (1 << x) with BIT(x) for bit definitions in xgmac_mdio
From: shh.xie @ 2015-01-13  2:30 UTC (permalink / raw)
  To: netdev, davem; +Cc: Shaohui Xie

From: Shaohui Xie <Shaohui.Xie@freescale.com>

Signed-off-by: Shaohui Xie <Shaohui.Xie@freescale.com>
---
 drivers/net/ethernet/freescale/xgmac_mdio.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/freescale/xgmac_mdio.c b/drivers/net/ethernet/freescale/xgmac_mdio.c
index a492e50..3a76e23 100644
--- a/drivers/net/ethernet/freescale/xgmac_mdio.c
+++ b/drivers/net/ethernet/freescale/xgmac_mdio.c
@@ -34,17 +34,17 @@ struct tgec_mdio_controller {
 
 #define MDIO_STAT_ENC		BIT(6)
 #define MDIO_STAT_CLKDIV(x)	(((x>>1) & 0xff) << 8)
-#define MDIO_STAT_BSY		(1 << 0)
-#define MDIO_STAT_RD_ER		(1 << 1)
+#define MDIO_STAT_BSY		BIT(0)
+#define MDIO_STAT_RD_ER		BIT(1)
 #define MDIO_CTL_DEV_ADDR(x) 	(x & 0x1f)
 #define MDIO_CTL_PORT_ADDR(x)	((x & 0x1f) << 5)
-#define MDIO_CTL_PRE_DIS	(1 << 10)
-#define MDIO_CTL_SCAN_EN	(1 << 11)
-#define MDIO_CTL_POST_INC	(1 << 14)
-#define MDIO_CTL_READ		(1 << 15)
+#define MDIO_CTL_PRE_DIS	BIT(10)
+#define MDIO_CTL_SCAN_EN	BIT(11)
+#define MDIO_CTL_POST_INC	BIT(14)
+#define MDIO_CTL_READ		BIT(15)
 
 #define MDIO_DATA(x)		(x & 0xffff)
-#define MDIO_DATA_BSY		(1 << 31)
+#define MDIO_DATA_BSY		BIT(31)
 
 /*
  * Wait until the MDIO bus is free
-- 
1.8.4.1

^ permalink raw reply related

* Re: [PATCH net] ipv6: Prevent ipv6_find_hdr() from returning ENOENT for valid non-first fragments
From: Rahul Sharma @ 2015-01-13  4:23 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Hannes Frederic Sowa, netdev, linux-kernel, netfilter-devel
In-Reply-To: <20150112115111.GA3506@salvia>

Hi

On Mon, Jan 12, 2015 at 5:21 PM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Mon, Jan 12, 2015 at 04:38:16PM +0530, Rahul Sharma wrote:
>> Hi Pablo, Hannes
>>
>> On Fri, Jan 9, 2015 at 9:20 PM, Hannes Frederic Sowa
>> <hannes@stressinduktion.org> wrote:
>> > On Fr, 2015-01-09 at 12:45 +0100, Pablo Neira Ayuso wrote:
>> >> Hi Hannes,
>> >>
>> >> On Fri, Jan 09, 2015 at 12:34:15PM +0100, Hannes Frederic Sowa wrote:
>> >> > On Fri, Jan 9, 2015, at 08:18, Rahul Sharma wrote:
>> >> > > Hi Pablo,
>> >> > >
>> >> > > On Fri, Jan 9, 2015 at 5:35 AM, Pablo Neira Ayuso <pablo@netfilter.org>
>> >> > > wrote:
>> >> > > > On Thu, Jan 08, 2015 at 11:39:16PM +0100, Hannes Frederic Sowa wrote:
>> >> > > >> Hi Pablo,
>> >> > > >>
>> >> > > >> On Thu, Jan 8, 2015, at 21:53, Pablo Neira Ayuso wrote:
>> >> > > >> > I'm afraid we cannot just get rid of that !ipv6_ext_hdr() check. The
>> >> > > >> > ipv6_find_hdr() function is designed to return the transport protocol.
>> >> > > >> > After the proposed change, it will return extension header numbers.
>> >> > > >> > This will break existing ip6tables rulesets since the `-p' option
>> >> > > >> > relies on this function to match the transport protocol.
>> >> > > >> >
>> >> > > >> > Note that the AH header is skipped (see code a bit below this
>> >> > > >> > problematic fragmentation handling) so the follow up header after the
>> >> > > >> > AH header is returned as the transport header.
>> >> > > >> >
>> >> > > >> > We can probably return the AH protocol number for non-1st fragments.
>> >> > > >> > However, that would be something new to ip6tables since nobody has
>> >> > > >> > ever seen packet matching `-p ah' rules. Thus, we restore control to
>> >> > > >> > the user to allow this, but we would accept all kind of fragmented AH
>> >> > > >> > traffic through the firewall since we cannot know what transport
>> >> > > >> > protocol contains from non-1st fragments (unless I'm missing anything,
>> >> > > >> > I need to have a closer look at this again tomorrow with fresher
>> >> > > >> > mind).
>> >> > > >>
>> >> > > >> The code in question is guarded by (_frag_off != 0), so we are
>> >> > > >> definitely processing a non-1st fragment currently. The -p match would
>> >> > > >> happen at the time when the packet is reassembled and thus ipv6_find_hdr
>> >> > > >> will find the real transport (final) header at this point (I hope I
>> >> > > >> followed the code correctly here).
>> >> > > >
>> >> > > > Then, Rahul should get things working by modprobing nf_defrag_ipv6.
>> >> > >
>> >> > > I already had nf_defrag_ipv6 installed when the issue occured. But I
>> >> > > see ip6table_raw_hook returning NF_DROP for the second fragment.
>> >> >
>> >> > That's what I expected. I think the change only affects hooks before
>> >> > reassembly.
>> >>
>> >> reassembly happens at NF_IP6_PRI_CONNTRACK_DEFRAG (-400), so that
>> >> happens before NF_IP6_PRI_RAW (-300) in IPv6 which is where the raw
>> >> table is placed.
>> >
>> > I tried to reproduce it, but couldn't get non-1st fragments getting
>> > dropped during traversal of the raw table. They get dropped earlier at
>> > during reassembly or pass.
>> >
>> > I agree with Pablo, I also would like to see more data.
>> >
>> > Thanks,
>> > Hannes
>> >
>> >
>>
>> I enabled pr_debug() and there was no error in nf_ct_frag6_gather().
>> It seems to have defragmented the packet correctly. As expected,
>> ipv6_defrag() returns NF_STOLEN for the first packet after queuing it.
>> For the next fragment, ipv6_defrag() calls nf_ct_frag6_output() after
>> after reassembling it.
>
> nf_ct_frag6_output() doesn't exist anymore. You're using an old
> kernel, you should have started by telling so in your report.
>
> See 6aafeef ("netfilter: push reasm skb through instead of original
> frag skbs").

 I apologize for not mentioning the kernel version in my first mail. I
had suspected problem in ipv6_find_hdr, the code for which was same.
Anyway, thanks for the help. I ll try to figure out how to make this
work in my kernel.

Thanks,
Rahul

^ permalink raw reply

* [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
From: John Fastabend @ 2015-01-13  4:35 UTC (permalink / raw)
  To: netdev; +Cc: danny.zhou, nhorman, dborkman, john.ronciak, hannes, brouer

This patch adds net_device ops to split off a set of driver queues
from the driver and map the queues into user space via mmap. This
allows the queues to be directly manipulated from user space. For
raw packet interface this removes any overhead from the kernel network
stack.

With these operations we bypass the network stack and packet_type
handlers that would typically send traffic to an af_packet socket.
This means hardware must do the forwarding. To do this ew can use
the ETHTOOL_SRXCLSRLINS ops in the ethtool command set. It is
currently supported by multiple drivers including sfc, mlx4, niu,
ixgbe, and i40e. Supporting some way to steer traffic to a queue
is the _only_ hardware requirement to support this interface.

A follow on patch adds support for ixgbe but we expect at least
the subset of drivers implementing ETHTOOL_SRXCLSRLINS can be
implemented later.

The high level flow, leveraging the af_packet control path, looks
like:

	bind(fd, &sockaddr, sizeof(sockaddr));

	/* Get the device type and info */
	getsockopt(fd, SOL_PACKET, PACKET_DEV_DESC_INFO, &def_info,
		   &optlen);

	/* With device info we can look up descriptor format */

	/* Get the layout of ring space offset, page_sz, cnt */
	getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
		   &info, &optlen);

	/* request some queues from the driver */
	setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
		   &qpairs_info, sizeof(qpairs_info));

	/* if we let the driver pick us queues learn which queues
         * we were given
         */
	getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
		   &qpairs_info, sizeof(qpairs_info));

	/* And mmap queue pairs to user space */
	mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
	     MAP_SHARED, fd, 0);

	/* Now we have some user space queues to read/write to*/

There is one critical difference when running with these interfaces
vs running without them. In the normal case the af_packet module
uses a standard descriptor format exported by the af_packet user
space headers. In this model because we are working directly with
driver queues the descriptor format maps to the descriptor format
used by the device. User space applications can learn device
information from the socket option PACKET_DEV_DESC_INFO. These
are described by giving the vendor/deviceid and a descriptor layout
in offset/length/width/alignment/byte_ordering.

To protect against arbitrary DMA writes IOMMU devices put memory
in a single domain to stop arbitrary DMA to memory. Note it would
be possible to dma into another sockets pages because most NIC
devices only support a single domain. This would require being
able to guess another sockets page layout. However the socket
operation does require CAP_NET_ADMIN privileges.

Additionally we have a set of DPDK patches to enable DPDK with this
interface. DPDK can be downloaded @ dpdk.org although as I hope is
clear from above DPDK is just our paticular test environment we
expect other libraries could be built on this interface.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/linux/netdevice.h      |   79 ++++++++
 include/uapi/linux/if_packet.h |   88 +++++++++
 net/packet/af_packet.c         |  397 ++++++++++++++++++++++++++++++++++++++++
 net/packet/internal.h          |   10 +
 4 files changed, 573 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 679e6e9..b71c97d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -52,6 +52,8 @@
 #include <linux/neighbour.h>
 #include <uapi/linux/netdevice.h>
 
+#include <linux/if_packet.h>
+
 struct netpoll_info;
 struct device;
 struct phy_device;
@@ -1030,6 +1032,54 @@ typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
  * int (*ndo_switch_port_stp_update)(struct net_device *dev, u8 state);
  *	Called to notify switch device port of bridge port STP
  *	state change.
+ *
+ * int (*ndo_split_queue_pairs) (struct net_device *dev,
+ *				 unsigned int qpairs_start_from,
+ *				 unsigned int qpairs_num,
+ *				 struct sock *sk)
+ *	Called to request a set of queues from the driver to be handed to the
+ *	callee for management. After this returns the driver will not use the
+ *	queues.
+ *
+ * int (*ndo_get_split_queue_pairs) (struct net_device *dev,
+ *				 unsigned int *qpairs_start_from,
+ *				 unsigned int *qpairs_num,
+ *				 struct sock *sk)
+ *	Called to get the location of queues that have been split for user
+ *	space to use. The socket must have previously requested the queues via
+ *	ndo_split_queue_pairs successfully.
+ *
+ * int (*ndo_return_queue_pairs) (struct net_device *dev,
+ *				  struct sock *sk)
+ *	Called to return a set of queues identified by sock to the driver. The
+ *	socket must have previously requested the queues via
+ *	ndo_split_queue_pairs for this action to be performed.
+ *
+ * int (*ndo_get_device_qpair_map_region_info) (struct net_device *dev,
+ *				struct tpacket_dev_qpair_map_region_info *info)
+ *	Called to return mapping of queue memory region.
+ *
+ * int (*ndo_get_device_desc_info) (struct net_device *dev,
+ *				    struct tpacket_dev_info *dev_info)
+ *	Called to get device specific information. This should uniquely identify
+ *	the hardware so that descriptor formats can be learned by the stack/user
+ *	space.
+ *
+ * int (*ndo_direct_qpair_page_map) (struct vm_area_struct *vma,
+ *				     struct net_device *dev)
+ *	Called to map queue pair range from split_queue_pairs into mmap region.
+ *
+ * int (*ndo_direct_validate_dma_mem_region_map)
+ *					(struct net_device *dev,
+ *					 struct tpacket_dma_mem_region *region,
+ *					 struct sock *sk)
+ *	Called to validate DMA address remaping for userspace memory region
+ *
+ * int (*ndo_get_dma_region_info)
+ *				 (struct net_device *dev,
+ *				  struct tpacket_dma_mem_region *region,
+ *				  struct sock *sk)
+ *	Called to get dma region' information such as iova.
  */
 struct net_device_ops {
 	int			(*ndo_init)(struct net_device *dev);
@@ -1190,6 +1240,35 @@ struct net_device_ops {
 	int			(*ndo_switch_port_stp_update)(struct net_device *dev,
 							      u8 state);
 #endif
+	int			(*ndo_split_queue_pairs)(struct net_device *dev,
+					 unsigned int qpairs_start_from,
+					 unsigned int qpairs_num,
+					 struct sock *sk);
+	int			(*ndo_get_split_queue_pairs)
+					(struct net_device *dev,
+					 unsigned int *qpairs_start_from,
+					 unsigned int *qpairs_num,
+					 struct sock *sk);
+	int			(*ndo_return_queue_pairs)
+					(struct net_device *dev,
+					 struct sock *sk);
+	int			(*ndo_get_device_qpair_map_region_info)
+					(struct net_device *dev,
+					 struct tpacket_dev_qpair_map_region_info *info);
+	int			(*ndo_get_device_desc_info)
+					(struct net_device *dev,
+					 struct tpacket_dev_info *dev_info);
+	int			(*ndo_direct_qpair_page_map)
+					(struct vm_area_struct *vma,
+					 struct net_device *dev);
+	int			(*ndo_validate_dma_mem_region_map)
+					(struct net_device *dev,
+					 struct tpacket_dma_mem_region *region,
+					 struct sock *sk);
+	int			(*ndo_get_dma_region_info)
+					(struct net_device *dev,
+					 struct tpacket_dma_mem_region *region,
+					 struct sock *sk);
 };
 
 /**
diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
index da2d668..eb7a727 100644
--- a/include/uapi/linux/if_packet.h
+++ b/include/uapi/linux/if_packet.h
@@ -54,6 +54,13 @@ struct sockaddr_ll {
 #define PACKET_FANOUT			18
 #define PACKET_TX_HAS_OFF		19
 #define PACKET_QDISC_BYPASS		20
+#define PACKET_RXTX_QPAIRS_SPLIT	21
+#define PACKET_RXTX_QPAIRS_RETURN	22
+#define PACKET_DEV_QPAIR_MAP_REGION_INFO	23
+#define PACKET_DEV_DESC_INFO		24
+#define PACKET_DMA_MEM_REGION_MAP       25
+#define PACKET_DMA_MEM_REGION_RELEASE   26
+
 
 #define PACKET_FANOUT_HASH		0
 #define PACKET_FANOUT_LB		1
@@ -64,6 +71,87 @@ struct sockaddr_ll {
 #define PACKET_FANOUT_FLAG_ROLLOVER	0x1000
 #define PACKET_FANOUT_FLAG_DEFRAG	0x8000
 
+#define PACKET_MAX_NUM_MAP_MEMORY_REGIONS 64
+#define PACKET_MAX_NUM_DESC_FORMATS	  8
+#define PACKET_MAX_NUM_DESC_FIELDS	  64
+#define PACKET_NIC_DESC_FIELD(fseq, foffset, fwidth, falign, fbo) \
+		.seqn = (__u8)fseq,				\
+		.offset = (__u8)foffset,			\
+		.width = (__u8)fwidth,				\
+		.align = (__u8)falign,				\
+		.byte_order = (__u8)fbo
+
+#define MAX_MAP_MEMORY_REGIONS	64
+
+/* setsockopt takes addr, size ,direction parametner, getsockopt takes
+ * iova, size, direction.
+ * */
+struct tpacket_dma_mem_region {
+	void *addr;		/* userspace virtual address */
+	__u64 phys_addr;	/* physical address */
+	__u64 iova;		/* IO virtual address used for DMA */
+	unsigned long size;	/* size of region */
+	int direction;		/* dma data direction */
+};
+
+struct tpacket_dev_qpair_map_region_info {
+	unsigned int tp_dev_bar_sz;		/* size of BAR */
+	unsigned int tp_dev_sysm_sz;		/* size of systerm memory */
+	/* number of contiguous memory on BAR mapping to user space */
+	unsigned int tp_num_map_regions;
+	/* number of contiguous memory on system mapping to user apce */
+	unsigned int tp_num_sysm_map_regions;
+	struct map_page_region {
+		unsigned page_offset;	/* offset to start of region */
+		unsigned page_sz;	/* size of page */
+		unsigned page_cnt;	/* number of pages */
+	} tp_regions[MAX_MAP_MEMORY_REGIONS];
+};
+
+struct tpacket_dev_qpairs_info {
+	unsigned int tp_qpairs_start_from;	/* qpairs index to start from */
+	unsigned int tp_qpairs_num;		/* number of qpairs */
+};
+
+enum tpack_desc_byte_order {
+	BO_NATIVE = 0,
+	BO_NETWORK,
+	BO_BIG_ENDIAN,
+	BO_LITTLE_ENDIAN,
+};
+
+struct tpacket_nic_desc_fld {
+	__u8 seqn;	/* Sequency index of descriptor field */
+	__u8 offset;	/* Offset to start */
+	__u8 width;	/* Width of field */
+	__u8 align;	/* Alignment in bits */
+	enum tpack_desc_byte_order byte_order;	/* Endian flag */
+};
+
+struct tpacket_nic_desc_expr {
+	__u8 version;		/* Version number */
+	__u8 size;		/* Descriptor size in bytes */
+	enum tpack_desc_byte_order byte_order;		/* Endian flag */
+	__u8 num_of_fld;	/* Number of valid fields */
+	/* List of each descriptor field */
+	struct tpacket_nic_desc_fld fields[PACKET_MAX_NUM_DESC_FIELDS];
+};
+
+struct tpacket_dev_info {
+	__u16	tp_device_id;
+	__u16	tp_vendor_id;
+	__u16	tp_subsystem_device_id;
+	__u16	tp_subsystem_vendor_id;
+	__u32	tp_numa_node;
+	__u32	tp_revision_id;
+	__u32	tp_num_total_qpairs;
+	__u32	tp_num_inuse_qpairs;
+	__u32	tp_num_rx_desc_fmt;
+	__u32	tp_num_tx_desc_fmt;
+	struct tpacket_nic_desc_expr tp_rx_dexpr[PACKET_MAX_NUM_DESC_FORMATS];
+	struct tpacket_nic_desc_expr tp_tx_dexpr[PACKET_MAX_NUM_DESC_FORMATS];
+};
+
 struct tpacket_stats {
 	unsigned int	tp_packets;
 	unsigned int	tp_drops;
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 6880f34..8cd17da 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -214,6 +214,9 @@ static void prb_clear_rxhash(struct tpacket_kbdq_core *,
 static void prb_fill_vlan_info(struct tpacket_kbdq_core *,
 		struct tpacket3_hdr *);
 static void packet_flush_mclist(struct sock *sk);
+static int umem_release(struct net_device *dev, struct packet_sock *po);
+static int get_umem_pages(struct tpacket_dma_mem_region *region,
+			  struct packet_umem_region *umem);
 
 struct packet_skb_cb {
 	unsigned int origlen;
@@ -2633,6 +2636,16 @@ static int packet_release(struct socket *sock)
 	sock_prot_inuse_add(net, sk->sk_prot, -1);
 	preempt_enable();
 
+	if (po->tp_owns_queue_pairs) {
+		struct net_device *dev;
+
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (dev) {
+			dev->netdev_ops->ndo_return_queue_pairs(dev, sk);
+			umem_release(dev, po);
+		}
+	}
+
 	spin_lock(&po->bind_lock);
 	unregister_prot_hook(sk, false);
 	packet_cached_dev_reset(po);
@@ -2829,6 +2842,8 @@ static int packet_create(struct net *net, struct socket *sock, int protocol,
 	po->num = proto;
 	po->xmit = dev_queue_xmit;
 
+	INIT_LIST_HEAD(&po->umem_list);
+
 	err = packet_alloc_pending(po);
 	if (err)
 		goto out2;
@@ -3226,6 +3241,88 @@ static void packet_flush_mclist(struct sock *sk)
 }
 
 static int
+get_umem_pages(struct tpacket_dma_mem_region *region,
+	       struct packet_umem_region *umem)
+{
+	struct page **page_list;
+	unsigned long npages;
+	unsigned long offset;
+	unsigned long base;
+	unsigned long i;
+	int ret;
+	dma_addr_t phys_base;
+
+	phys_base = (region->phys_addr) & PAGE_MASK;
+	base = ((unsigned long)region->addr) & PAGE_MASK;
+	offset = ((unsigned long)region->addr) & (~PAGE_MASK);
+	npages = PAGE_ALIGN(region->size + offset) >> PAGE_SHIFT;
+
+	npages = min_t(unsigned long, npages, umem->nents);
+	sg_init_table(umem->sglist, npages);
+
+	umem->nmap = 0;
+	page_list = (struct page **)__get_free_page(GFP_KERNEL);
+	if (!page_list)
+		return -ENOMEM;
+
+	while (npages) {
+		unsigned long min = min_t(unsigned long, npages,
+					  PAGE_SIZE / sizeof(struct page *));
+
+		ret = get_user_pages(current, current->mm, base, min,
+				     1, 0, page_list, NULL);
+		if (ret < 0)
+			break;
+
+		base += ret * PAGE_SIZE;
+		npages -= ret;
+
+		/* validate if the memory region is physically contigenous */
+		for (i = 0; i < ret; i++) {
+			unsigned int page_index =
+				(page_to_phys(page_list[i]) - phys_base) /
+				PAGE_SIZE;
+
+			if (page_index != umem->nmap + i) {
+				int j;
+
+				for (j = 0; j < (umem->nmap + i); j++)
+					put_page(sg_page(&umem->sglist[j]));
+
+				free_page((unsigned long)page_list);
+				return -EFAULT;
+			}
+
+			sg_set_page(&umem->sglist[umem->nmap + i],
+				    page_list[i], PAGE_SIZE, 0);
+		}
+
+		umem->nmap += ret;
+	}
+
+	free_page((unsigned long)page_list);
+	return 0;
+}
+
+static int
+umem_release(struct net_device *dev, struct packet_sock *po)
+{
+	struct packet_umem_region *umem, *tmp;
+	int i;
+
+	list_for_each_entry_safe(umem, tmp, &po->umem_list, list) {
+		dma_unmap_sg(dev->dev.parent, umem->sglist,
+			     umem->nmap, umem->direction);
+		for (i = 0; i < umem->nmap; i++)
+			put_page(sg_page(&umem->sglist[i]));
+
+		vfree(umem);
+	}
+
+	return 0;
+}
+
+static int
 packet_setsockopt(struct socket *sock, int level, int optname, char __user *optval, unsigned int optlen)
 {
 	struct sock *sk = sock->sk;
@@ -3428,6 +3525,167 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 		po->xmit = val ? packet_direct_xmit : dev_queue_xmit;
 		return 0;
 	}
+	case PACKET_RXTX_QPAIRS_SPLIT:
+	{
+		struct tpacket_dev_qpairs_info qpairs;
+		const struct net_device_ops *ops;
+		struct net_device *dev;
+		int err;
+
+		if (optlen != sizeof(qpairs))
+			return -EINVAL;
+		if (copy_from_user(&qpairs, optval, sizeof(qpairs)))
+			return -EFAULT;
+
+		/* Only allow one set of queues to be owned by userspace */
+		if (po->tp_owns_queue_pairs)
+			return -EBUSY;
+
+		/* This call only works after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+		ops = dev->netdev_ops;
+		if (!ops->ndo_split_queue_pairs)
+			return -EOPNOTSUPP;
+
+		err =  ops->ndo_split_queue_pairs(dev,
+						  qpairs.tp_qpairs_start_from,
+						  qpairs.tp_qpairs_num, sk);
+		if (!err)
+			po->tp_owns_queue_pairs = true;
+
+		return err;
+	}
+	case PACKET_RXTX_QPAIRS_RETURN:
+	{
+		struct tpacket_dev_qpairs_info qpairs_info;
+		const struct net_device_ops *ops;
+		struct net_device *dev;
+		int err;
+
+		if (optlen != sizeof(qpairs_info))
+			return -EINVAL;
+		if (copy_from_user(&qpairs_info, optval, sizeof(qpairs_info)))
+			return -EFAULT;
+
+		if (!po->tp_owns_queue_pairs)
+			return -EINVAL;
+
+		/* This call only work after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+		ops = dev->netdev_ops;
+		if (!ops->ndo_split_queue_pairs)
+			return -EOPNOTSUPP;
+
+		err =  dev->netdev_ops->ndo_return_queue_pairs(dev, sk);
+		if (!err)
+			po->tp_owns_queue_pairs = false;
+
+		return err;
+	}
+	case PACKET_DMA_MEM_REGION_MAP:
+	{
+		struct tpacket_dma_mem_region region;
+		const struct net_device_ops *ops;
+		struct net_device *dev;
+		struct packet_umem_region *umem;
+		unsigned long npages;
+		unsigned long offset;
+		unsigned long i;
+		int err;
+
+		if (optlen != sizeof(region))
+			return -EINVAL;
+		if (copy_from_user(&region, optval, sizeof(region)))
+			return -EFAULT;
+		if ((region.direction != DMA_BIDIRECTIONAL) &&
+		    (region.direction != DMA_TO_DEVICE) &&
+		    (region.direction != DMA_FROM_DEVICE))
+			return -EFAULT;
+
+		if (!po->tp_owns_queue_pairs)
+			return -EINVAL;
+
+		/* This call only work after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+
+		offset = ((unsigned long)region.addr) & (~PAGE_MASK);
+		npages = PAGE_ALIGN(region.size + offset) >> PAGE_SHIFT;
+
+		umem = vzalloc(sizeof(*umem) +
+			       sizeof(struct scatterlist) * npages);
+		if (!umem)
+			return -ENOMEM;
+
+		umem->nents = npages;
+		umem->direction = region.direction;
+
+		down_write(&current->mm->mmap_sem);
+		if (get_umem_pages(&region, umem) < 0) {
+			ret = -EFAULT;
+			goto exit;
+		}
+
+		if ((umem->nmap == npages) &&
+		    (0 != dma_map_sg(dev->dev.parent, umem->sglist,
+				     umem->nmap, region.direction))) {
+			region.iova = sg_dma_address(umem->sglist) + offset;
+
+			ops = dev->netdev_ops;
+			if (!ops->ndo_validate_dma_mem_region_map) {
+				ret = -EOPNOTSUPP;
+				goto unmap;
+			}
+
+			/* use driver to validate mapping of dma memory */
+			err = ops->ndo_validate_dma_mem_region_map(dev,
+								   &region,
+								   sk);
+			if (!err) {
+				list_add_tail(&umem->list, &po->umem_list);
+				ret = 0;
+				goto exit;
+			}
+		}
+
+unmap:
+		dma_unmap_sg(dev->dev.parent, umem->sglist,
+			     umem->nmap, umem->direction);
+		for (i = 0; i < umem->nmap; i++)
+			put_page(sg_page(&umem->sglist[i]));
+
+		vfree(umem);
+exit:
+		up_write(&current->mm->mmap_sem);
+
+		return ret;
+	}
+	case PACKET_DMA_MEM_REGION_RELEASE:
+	{
+		struct net_device *dev;
+
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+
+		down_write(&current->mm->mmap_sem);
+		ret = umem_release(dev, po);
+		up_write(&current->mm->mmap_sem);
+
+		return ret;
+	}
+
 	default:
 		return -ENOPROTOOPT;
 	}
@@ -3523,6 +3781,129 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
 	case PACKET_QDISC_BYPASS:
 		val = packet_use_direct_xmit(po);
 		break;
+	case PACKET_RXTX_QPAIRS_SPLIT:
+	{
+		struct net_device *dev;
+		struct tpacket_dev_qpairs_info qpairs_info;
+		int err;
+
+		if (len != sizeof(qpairs_info))
+			return -EINVAL;
+		if (copy_from_user(&qpairs_info, optval, sizeof(qpairs_info)))
+			return -EFAULT;
+
+		/* This call only work after a successful queue pairs split-off
+		 * operation via setsockopt()
+		 */
+		if (!po->tp_owns_queue_pairs)
+			return -EINVAL;
+
+		/* This call only work after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+		if (!dev->netdev_ops->ndo_split_queue_pairs)
+			return -EOPNOTSUPP;
+
+		err =  dev->netdev_ops->ndo_get_split_queue_pairs(dev,
+					&qpairs_info.tp_qpairs_start_from,
+					&qpairs_info.tp_qpairs_num, sk);
+
+		lv = sizeof(qpairs_info);
+		data = &qpairs_info;
+		break;
+	}
+	case PACKET_DEV_QPAIR_MAP_REGION_INFO:
+	{
+		struct tpacket_dev_qpair_map_region_info info;
+		const struct net_device_ops *ops;
+		struct net_device *dev;
+		int err;
+
+		if (len != sizeof(info))
+			return -EINVAL;
+		if (copy_from_user(&info, optval, sizeof(info)))
+			return -EFAULT;
+
+		/* This call only work after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+
+		ops = dev->netdev_ops;
+		if (!ops->ndo_get_device_qpair_map_region_info)
+			return -EOPNOTSUPP;
+
+		err = ops->ndo_get_device_qpair_map_region_info(dev, &info);
+		if (err)
+			return err;
+
+		lv = sizeof(struct tpacket_dev_qpair_map_region_info);
+		data = &info;
+		break;
+	}
+	case PACKET_DEV_DESC_INFO:
+	{
+		struct net_device *dev;
+		struct tpacket_dev_info info;
+		int err;
+
+		if (len != sizeof(info))
+			return -EINVAL;
+		if (copy_from_user(&info, optval, sizeof(info)))
+			return -EFAULT;
+
+		/* This call only work after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+		if (!dev->netdev_ops->ndo_get_device_desc_info)
+			return -EOPNOTSUPP;
+
+		err =  dev->netdev_ops->ndo_get_device_desc_info(dev, &info);
+		if (err)
+			return err;
+
+		lv = sizeof(struct tpacket_dev_info);
+		data = &info;
+		break;
+	}
+	case PACKET_DMA_MEM_REGION_MAP:
+	{
+		struct tpacket_dma_mem_region info;
+		struct net_device *dev;
+		int err;
+
+		if (len != sizeof(info))
+				return -EINVAL;
+		if (copy_from_user(&info, optval, sizeof(info)))
+				return -EFAULT;
+
+		/* This call only work after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+
+		if (!dev->netdev_ops->ndo_get_dma_region_info)
+			return -EOPNOTSUPP;
+
+		err =  dev->netdev_ops->ndo_get_dma_region_info(dev, &info, sk);
+		if (err)
+			return err;
+
+		lv = sizeof(struct tpacket_dma_mem_region);
+		data = &info;
+		break;
+	}
+
 	default:
 		return -ENOPROTOOPT;
 	}
@@ -3536,7 +3917,6 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
 	return 0;
 }
 
-
 static int packet_notifier(struct notifier_block *this,
 			   unsigned long msg, void *ptr)
 {
@@ -3920,6 +4300,8 @@ static int packet_mmap(struct file *file, struct socket *sock,
 	struct packet_sock *po = pkt_sk(sk);
 	unsigned long size, expected_size;
 	struct packet_ring_buffer *rb;
+	const struct net_device_ops *ops;
+	struct net_device *dev;
 	unsigned long start;
 	int err = -EINVAL;
 	int i;
@@ -3927,8 +4309,20 @@ static int packet_mmap(struct file *file, struct socket *sock,
 	if (vma->vm_pgoff)
 		return -EINVAL;
 
+	dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+	if (!dev)
+		return -EINVAL;
+
 	mutex_lock(&po->pg_vec_lock);
 
+	if (po->tp_owns_queue_pairs) {
+		ops = dev->netdev_ops;
+		err = ops->ndo_direct_qpair_page_map(vma, dev);
+		if (err)
+			goto out;
+		goto done;
+	}
+
 	expected_size = 0;
 	for (rb = &po->rx_ring; rb <= &po->tx_ring; rb++) {
 		if (rb->pg_vec) {
@@ -3966,6 +4360,7 @@ static int packet_mmap(struct file *file, struct socket *sock,
 		}
 	}
 
+done:
 	atomic_inc(&po->mapped);
 	vma->vm_ops = &packet_mmap_ops;
 	err = 0;
diff --git a/net/packet/internal.h b/net/packet/internal.h
index cdddf6a..55d2fce 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -90,6 +90,14 @@ struct packet_fanout {
 	struct packet_type	prot_hook ____cacheline_aligned_in_smp;
 };
 
+struct packet_umem_region {
+	struct list_head	list;
+	int			nents;
+	int			nmap;
+	int			direction;
+	struct scatterlist	sglist[0];
+};
+
 struct packet_sock {
 	/* struct sock has to be the first member of packet_sock */
 	struct sock		sk;
@@ -97,6 +105,7 @@ struct packet_sock {
 	union  tpacket_stats_u	stats;
 	struct packet_ring_buffer	rx_ring;
 	struct packet_ring_buffer	tx_ring;
+	struct list_head        umem_list;
 	int			copy_thresh;
 	spinlock_t		bind_lock;
 	struct mutex		pg_vec_lock;
@@ -113,6 +122,7 @@ struct packet_sock {
 	unsigned int		tp_reserve;
 	unsigned int		tp_loss:1;
 	unsigned int		tp_tx_has_off:1;
+	unsigned int		tp_owns_queue_pairs:1;
 	unsigned int		tp_tstamp;
 	struct net_device __rcu	*cached_dev;
 	int			(*xmit)(struct sk_buff *skb);

^ permalink raw reply related

* [RFC PATCH v2 2/2] net: ixgbe: implement af_packet direct queue mappings
From: John Fastabend @ 2015-01-13  4:35 UTC (permalink / raw)
  To: netdev; +Cc: danny.zhou, nhorman, dborkman, john.ronciak, hannes, brouer
In-Reply-To: <20150113043509.29985.33515.stgit@nitbit.x32>

This allows driver queues to be split off and mapped into user
space using af_packet.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h         |   17 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c |   23 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c    |  407 ++++++++++++++++++++++
 drivers/net/ethernet/intel/ixgbe/ixgbe_type.h    |    1 
 4 files changed, 440 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 38fc64c..aa4960e 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -204,6 +204,20 @@ struct ixgbe_tx_queue_stats {
 	u64 tx_done_old;
 };
 
+#define MAX_USER_DMA_REGIONS_PER_SOCKET  16
+
+struct ixgbe_user_dma_region {
+	dma_addr_t dma_region_iova;
+	unsigned long dma_region_size;
+	int direction;
+};
+
+struct ixgbe_user_queue_info {
+	struct sock *sk_handle;
+	struct ixgbe_user_dma_region regions[MAX_USER_DMA_REGIONS_PER_SOCKET];
+	int num_of_regions;
+};
+
 struct ixgbe_rx_queue_stats {
 	u64 rsc_count;
 	u64 rsc_flush;
@@ -673,6 +687,9 @@ struct ixgbe_adapter {
 
 	struct ixgbe_q_vector *q_vector[MAX_Q_VECTORS];
 
+	/* Direct User Space Queues */
+	struct ixgbe_user_queue_info user_queue_info[MAX_RX_QUEUES];
+
 	/* DCB parameters */
 	struct ieee_pfc *ixgbe_ieee_pfc;
 	struct ieee_ets *ixgbe_ieee_ets;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
index e5be0dd..f180a58 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
@@ -2598,12 +2598,17 @@ static int ixgbe_add_ethtool_fdir_entry(struct ixgbe_adapter *adapter,
 	if (!(adapter->flags & IXGBE_FLAG_FDIR_PERFECT_CAPABLE))
 		return -EOPNOTSUPP;
 
+	if (fsp->ring_cookie > MAX_RX_QUEUES)
+		return -EINVAL;
+
 	/*
 	 * Don't allow programming if the action is a queue greater than
-	 * the number of online Rx queues.
+	 * the number of online Rx queues unless it is a user space
+	 * queue.
 	 */
 	if ((fsp->ring_cookie != RX_CLS_FLOW_DISC) &&
-	    (fsp->ring_cookie >= adapter->num_rx_queues))
+	    (fsp->ring_cookie >= adapter->num_rx_queues) &&
+	    !adapter->user_queue_info[fsp->ring_cookie].sk_handle)
 		return -EINVAL;
 
 	/* Don't allow indexes to exist outside of available space */
@@ -2680,12 +2685,18 @@ static int ixgbe_add_ethtool_fdir_entry(struct ixgbe_adapter *adapter,
 	/* apply mask and compute/store hash */
 	ixgbe_atr_compute_perfect_hash_82599(&input->filter, &mask);
 
+	/* Set input action to reg_idx for driver owned queues otherwise
+	 * use the absolute index for user space queues.
+	 */
+	if (fsp->ring_cookie < adapter->num_rx_queues &&
+	    fsp->ring_cookie != IXGBE_FDIR_DROP_QUEUE)
+		input->action = adapter->rx_ring[input->action]->reg_idx;
+
 	/* program filters to filter memory */
 	err = ixgbe_fdir_write_perfect_filter_82599(hw,
-				&input->filter, input->sw_idx,
-				(input->action == IXGBE_FDIR_DROP_QUEUE) ?
-				IXGBE_FDIR_DROP_QUEUE :
-				adapter->rx_ring[input->action]->reg_idx);
+						    &input->filter,
+						    input->sw_idx,
+						    input->action);
 	if (err)
 		goto err_out_w_lock;
 
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 2ed2c7d..be5bde86 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -50,6 +50,9 @@
 #include <linux/if_bridge.h>
 #include <linux/prefetch.h>
 #include <scsi/fc/fc_fcoe.h>
+#include <linux/mm.h>
+#include <linux/if_packet.h>
+#include <linux/iommu.h>
 
 #ifdef CONFIG_OF
 #include <linux/of_net.h>
@@ -80,6 +83,12 @@ const char ixgbe_driver_version[] = DRV_VERSION;
 static const char ixgbe_copyright[] =
 				"Copyright (c) 1999-2014 Intel Corporation.";
 
+static unsigned int *dummy_page_buf;
+
+#ifndef CONFIG_DMA_MEMORY_PROTECTION
+#define CONFIG_DMA_MEMORY_PROTECTION
+#endif
+
 static const struct ixgbe_info *ixgbe_info_tbl[] = {
 	[board_82598]		= &ixgbe_82598_info,
 	[board_82599]		= &ixgbe_82599_info,
@@ -167,6 +176,76 @@ MODULE_DESCRIPTION("Intel(R) 10 Gigabit PCI Express Network Driver");
 MODULE_LICENSE("GPL");
 MODULE_VERSION(DRV_VERSION);
 
+enum ixgbe_legacy_rx_enum {
+	IXGBE_LEGACY_RX_FIELD_PKT_ADDR = 0,	/* Packet buffer address */
+	IXGBE_LEGACY_RX_FIELD_LENGTH,		/* Packet length */
+	IXGBE_LEGACY_RX_FIELD_CSUM,		/* Fragment checksum */
+	IXGBE_LEGACY_RX_FIELD_STATUS,		/* Descriptors status */
+	IXGBE_LEGACY_RX_FIELD_ERRORS,		/* Receive errors */
+	IXGBE_LEGACY_RX_FIELD_VLAN,		/* VLAN tag */
+};
+
+enum ixgbe_legacy_tx_enum {
+	IXGBE_LEGACY_TX_FIELD_PKT_ADDR = 0,	/* Packet buffer address */
+	IXGBE_LEGACY_TX_FIELD_LENGTH,		/* Packet length */
+	IXGBE_LEGACY_TX_FIELD_CSO,		/* Checksum offset*/
+	IXGBE_LEGACY_TX_FIELD_CMD,		/* Descriptor control */
+	IXGBE_LEGACY_TX_FIELD_STATUS,		/* Descriptor status */
+	IXGBE_LEGACY_TX_FIELD_RSVD,		/* Reserved */
+	IXGBE_LEGACY_TX_FIELD_CSS,		/* Checksum start */
+	IXGBE_LEGACY_TX_FIELD_VLAN_TAG,		/* VLAN tag */
+};
+
+/* IXGBE Receive Descriptor - Legacy */
+static const struct tpacket_nic_desc_fld ixgbe_legacy_rx_desc[] = {
+	/* Packet buffer address */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_RX_FIELD_PKT_ADDR,
+				0,  64, 64,  BO_NATIVE)},
+	/* Packet length */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_RX_FIELD_LENGTH,
+				64, 16, 8,  BO_NATIVE)},
+	/* Fragment checksum */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_RX_FIELD_CSUM,
+				80, 16, 8,  BO_NATIVE)},
+	/* Descriptors status */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_RX_FIELD_STATUS,
+				96, 8, 8,  BO_NATIVE)},
+	/* Receive errors */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_RX_FIELD_ERRORS,
+				104, 8, 8,  BO_NATIVE)},
+	/* VLAN tag */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_RX_FIELD_VLAN,
+				112, 16, 8,  BO_NATIVE)},
+};
+
+/* IXGBE Transmit Descriptor - Legacy */
+static const struct tpacket_nic_desc_fld ixgbe_legacy_tx_desc[] = {
+	/* Packet buffer address */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_TX_FIELD_PKT_ADDR,
+				0,   64, 64,  BO_NATIVE)},
+	/* Data buffer length */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_TX_FIELD_LENGTH,
+				64,  16, 8,  BO_NATIVE)},
+	/* Checksum offset */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_TX_FIELD_CSO,
+				80,  8, 8,  BO_NATIVE)},
+	/* Command byte */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_TX_FIELD_CMD,
+				88,  8, 8,  BO_NATIVE)},
+	/* Transmitted status */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_TX_FIELD_STATUS,
+				96,  4, 1,  BO_NATIVE)},
+	/* Reserved */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_TX_FIELD_RSVD,
+				100, 4, 1,  BO_NATIVE)},
+	/* Checksum start */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_TX_FIELD_CSS,
+				104, 8, 8,  BO_NATIVE)},
+	/* VLAN tag */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_TX_FIELD_VLAN_TAG,
+				112, 16, 8,  BO_NATIVE)},
+};
+
 static bool ixgbe_check_cfg_remove(struct ixgbe_hw *hw, struct pci_dev *pdev);
 
 static int ixgbe_read_pci_cfg_word_parent(struct ixgbe_adapter *adapter,
@@ -3137,6 +3216,17 @@ static void ixgbe_enable_rx_drop(struct ixgbe_adapter *adapter,
 	IXGBE_WRITE_REG(hw, IXGBE_SRRCTL(reg_idx), srrctl);
 }
 
+static bool ixgbe_have_user_queues(struct ixgbe_adapter *adapter)
+{
+	int i;
+
+	for (i = 0; i < MAX_RX_QUEUES; i++) {
+		if (adapter->user_queue_info[i].sk_handle)
+			return true;
+	}
+	return false;
+}
+
 static void ixgbe_disable_rx_drop(struct ixgbe_adapter *adapter,
 				  struct ixgbe_ring *ring)
 {
@@ -3171,7 +3261,8 @@ static void ixgbe_set_rx_drop_en(struct ixgbe_adapter *adapter)
 	 *  and performance reasons.
 	 */
 	if (adapter->num_vfs || (adapter->num_rx_queues > 1 &&
-	    !(adapter->hw.fc.current_mode & ixgbe_fc_tx_pause) && !pfc_en)) {
+	    !(adapter->hw.fc.current_mode & ixgbe_fc_tx_pause) && !pfc_en) ||
+	    ixgbe_have_user_queues(adapter)) {
 		for (i = 0; i < adapter->num_rx_queues; i++)
 			ixgbe_enable_rx_drop(adapter, adapter->rx_ring[i]);
 	} else {
@@ -7938,6 +8029,306 @@ static void ixgbe_fwd_del(struct net_device *pdev, void *priv)
 	kfree(fwd_adapter);
 }
 
+static int ixgbe_ndo_split_queue_pairs(struct net_device *dev,
+				       unsigned int start_from,
+				       unsigned int qpairs_num,
+				       struct sock *sk)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	unsigned int qpair_index;
+
+	/* allocate whatever available qpairs */
+	if (start_from == -1) {
+		unsigned int count = 0;
+
+		for (qpair_index = adapter->num_rx_queues;
+		     qpair_index < MAX_RX_QUEUES;
+		     qpair_index++) {
+			if (!adapter->user_queue_info[qpair_index].sk_handle) {
+				count++;
+				if (count == qpairs_num) {
+					start_from = qpair_index - count + 1;
+					break;
+				}
+			} else {
+				count = 0;
+			}
+		}
+	}
+
+	/* otherwise the caller specified exact queues */
+	if ((start_from > MAX_TX_QUEUES) ||
+	    (start_from > MAX_RX_QUEUES) ||
+	    (start_from + qpairs_num > MAX_TX_QUEUES) ||
+	    (start_from + qpairs_num > MAX_RX_QUEUES))
+		return -EINVAL;
+
+	/* If the qpairs are being used by the driver do not let user space
+	 * consume the queues. Also if the queue has already been allocated
+	 * to a socket do fail the request.
+	 */
+	for (qpair_index = start_from;
+	     qpair_index < start_from + qpairs_num;
+	     qpair_index++) {
+		if ((qpair_index < adapter->num_tx_queues) ||
+		    (qpair_index < adapter->num_rx_queues))
+			return -EINVAL;
+
+		if (adapter->user_queue_info[qpair_index].sk_handle)
+			return -EBUSY;
+	}
+
+	/* remember the sk handle for each queue pair */
+	for (qpair_index = start_from;
+	     qpair_index < start_from + qpairs_num;
+	     qpair_index++) {
+		adapter->user_queue_info[qpair_index].sk_handle = sk;
+		adapter->user_queue_info[qpair_index].num_of_regions = 0;
+	}
+
+	return 0;
+}
+
+static int ixgbe_ndo_get_split_queue_pairs(struct net_device *dev,
+					   unsigned int *start_from,
+					   unsigned int *qpairs_num,
+					   struct sock *sk)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	unsigned int qpair_index;
+	*qpairs_num = 0;
+
+	for (qpair_index = adapter->num_tx_queues;
+	     qpair_index < MAX_RX_QUEUES;
+	     qpair_index++) {
+		if (adapter->user_queue_info[qpair_index].sk_handle == sk) {
+			if (*qpairs_num == 0)
+				*start_from = qpair_index;
+			*qpairs_num = *qpairs_num + 1;
+		}
+	}
+
+	return 0;
+}
+
+static int ixgbe_ndo_return_queue_pairs(struct net_device *dev, struct sock *sk)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	struct ixgbe_user_queue_info *info;
+	unsigned int qpair_index;
+
+	for (qpair_index = adapter->num_tx_queues;
+	     qpair_index < MAX_RX_QUEUES;
+	     qpair_index++) {
+		info = &adapter->user_queue_info[qpair_index];
+
+		if (info->sk_handle == sk) {
+			info->sk_handle = NULL;
+			info->num_of_regions = 0;
+		}
+	}
+
+	return 0;
+}
+
+/* Rx descriptor starts from 0x1000 and Tx descriptor starts from 0x6000
+ * both the TX and RX descriptors use 4K pages.
+ */
+#define RX_DESC_ADDR_OFFSET		0x1000
+#define TX_DESC_ADDR_OFFSET		0x6000
+#define PAGE_SIZE_4K			4096
+
+static int
+ixgbe_ndo_qpair_map_region(struct net_device *dev,
+			   struct tpacket_dev_qpair_map_region_info *info)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+
+	/* no need to map systme memory to userspace for ixgbe */
+	info->tp_dev_sysm_sz = 0;
+	info->tp_num_sysm_map_regions = 0;
+
+	info->tp_dev_bar_sz = pci_resource_len(adapter->pdev, 0);
+	info->tp_num_map_regions = 2;
+
+	info->tp_regions[0].page_offset = RX_DESC_ADDR_OFFSET;
+	info->tp_regions[0].page_sz = PAGE_SIZE;
+	info->tp_regions[0].page_cnt = 1;
+	info->tp_regions[1].page_offset = TX_DESC_ADDR_OFFSET;
+	info->tp_regions[1].page_sz = PAGE_SIZE;
+	info->tp_regions[1].page_cnt = 1;
+
+	return 0;
+}
+
+static int ixgbe_ndo_get_device_desc_info(struct net_device *dev,
+					  struct tpacket_dev_info *dev_info)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	int max_queues;
+	int i;
+	__u8 flds_rx = sizeof(ixgbe_legacy_rx_desc) /
+		       sizeof(struct tpacket_nic_desc_fld);
+	__u8 flds_tx = sizeof(ixgbe_legacy_tx_desc) /
+		       sizeof(struct tpacket_nic_desc_fld);
+
+	max_queues = max(adapter->num_rx_queues, adapter->num_tx_queues);
+
+	dev_info->tp_device_id = adapter->hw.device_id;
+	dev_info->tp_vendor_id = adapter->hw.vendor_id;
+	dev_info->tp_subsystem_device_id = adapter->hw.subsystem_device_id;
+	dev_info->tp_subsystem_vendor_id = adapter->hw.subsystem_vendor_id;
+	dev_info->tp_revision_id = adapter->hw.revision_id;
+	dev_info->tp_numa_node = dev_to_node(&dev->dev);
+
+	dev_info->tp_num_total_qpairs = min(MAX_RX_QUEUES, MAX_TX_QUEUES);
+	dev_info->tp_num_inuse_qpairs = max_queues;
+
+	dev_info->tp_num_rx_desc_fmt = 1;
+	dev_info->tp_num_tx_desc_fmt = 1;
+
+	dev_info->tp_rx_dexpr[0].version = 1;
+	dev_info->tp_rx_dexpr[0].size = sizeof(union ixgbe_adv_rx_desc);
+	dev_info->tp_rx_dexpr[0].byte_order = BO_NATIVE;
+	dev_info->tp_rx_dexpr[0].num_of_fld = flds_rx;
+	for (i = 0; i < dev_info->tp_rx_dexpr[0].num_of_fld; i++)
+		memcpy(&dev_info->tp_rx_dexpr[0].fields[i],
+		       &ixgbe_legacy_rx_desc[i],
+		       sizeof(struct tpacket_nic_desc_fld));
+
+	dev_info->tp_tx_dexpr[0].version = 1;
+	dev_info->tp_tx_dexpr[0].size = sizeof(union ixgbe_adv_tx_desc);
+	dev_info->tp_tx_dexpr[0].byte_order = BO_NATIVE;
+	dev_info->tp_tx_dexpr[0].num_of_fld = flds_tx;
+	for (i = 0; i < dev_info->tp_rx_dexpr[0].num_of_fld; i++)
+		memcpy(&dev_info->tp_tx_dexpr[0].fields[i],
+		       &ixgbe_legacy_tx_desc[i],
+		       sizeof(struct tpacket_nic_desc_fld));
+
+	return 0;
+}
+
+static int
+ixgbe_ndo_qpair_page_map(struct vm_area_struct *vma, struct net_device *dev)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	phys_addr_t phy_addr = pci_resource_start(adapter->pdev, 0);
+	unsigned long pfn_rx = (phy_addr + RX_DESC_ADDR_OFFSET) >> PAGE_SHIFT;
+	unsigned long pfn_tx = (phy_addr + TX_DESC_ADDR_OFFSET) >> PAGE_SHIFT;
+	unsigned long dummy_page_phy;
+	pgprot_t pre_vm_page_prot;
+	unsigned long start;
+	unsigned int i;
+	int err;
+
+	if (!dummy_page_buf) {
+		dummy_page_buf = kzalloc(PAGE_SIZE_4K, GFP_KERNEL);
+		if (!dummy_page_buf)
+			return -ENOMEM;
+
+		for (i = 0; i < PAGE_SIZE_4K / sizeof(unsigned int); i++)
+			dummy_page_buf[i] = 0xdeadbeef;
+	}
+
+	dummy_page_phy = virt_to_phys(dummy_page_buf);
+	pre_vm_page_prot = vma->vm_page_prot;
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+
+	/* assume the vm_start is 4K aligned address */
+	for (start = vma->vm_start;
+	     start < vma->vm_end;
+	     start += PAGE_SIZE_4K) {
+		if (start == vma->vm_start + RX_DESC_ADDR_OFFSET) {
+			err = remap_pfn_range(vma, start, pfn_rx, PAGE_SIZE_4K,
+					      vma->vm_page_prot);
+			if (err)
+				return -EAGAIN;
+		} else if (start == vma->vm_start + TX_DESC_ADDR_OFFSET) {
+			err = remap_pfn_range(vma, start, pfn_tx, PAGE_SIZE_4K,
+					      vma->vm_page_prot);
+			if (err)
+				return -EAGAIN;
+		} else {
+			unsigned long addr = dummy_page_phy > PAGE_SHIFT;
+
+			err = remap_pfn_range(vma, start, addr, PAGE_SIZE_4K,
+					      pre_vm_page_prot);
+			if (err)
+				return -EAGAIN;
+		}
+	}
+	return 0;
+}
+
+static int
+ixgbe_ndo_val_dma_mem_region_map(struct net_device *dev,
+				 struct tpacket_dma_mem_region *region,
+				 struct sock *sk)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	unsigned int qpair_index, i;
+	struct ixgbe_user_queue_info *info;
+
+#ifdef CONFIG_DMA_MEMORY_PROTECTION
+	/* IOVA not equal to physical address means IOMMU takes effect */
+	if (region->phys_addr == region->iova)
+		return -EFAULT;
+#endif
+
+	for (qpair_index = adapter->num_tx_queues;
+	     qpair_index < MAX_RX_QUEUES;
+	     qpair_index++) {
+		info = &adapter->user_queue_info[qpair_index];
+		i = info->num_of_regions;
+
+		if (info->sk_handle != sk)
+			continue;
+
+		if (info->num_of_regions >= MAX_USER_DMA_REGIONS_PER_SOCKET)
+			return -EFAULT;
+
+		info->regions[i].dma_region_size = region->size;
+		info->regions[i].direction = region->direction;
+		info->regions[i].dma_region_iova = region->iova;
+		info->num_of_regions++;
+	}
+
+	return 0;
+}
+
+static int
+ixgbe_get_dma_region_info(struct net_device *dev,
+			  struct tpacket_dma_mem_region *region,
+			  struct sock *sk)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	struct ixgbe_user_queue_info *info;
+	unsigned int qpair_index;
+
+	for (qpair_index = adapter->num_tx_queues;
+	     qpair_index < MAX_RX_QUEUES;
+	     qpair_index++) {
+		int i;
+
+		info = &adapter->user_queue_info[qpair_index];
+		if (info->sk_handle != sk)
+			continue;
+
+		for (i = 0; i <= info->num_of_regions; i++) {
+			struct ixgbe_user_dma_region *r;
+
+			r = &info->regions[i];
+			if ((r->dma_region_size == region->size) &&
+			    (r->direction == region->direction)) {
+				region->iova = r->dma_region_iova;
+				return 0;
+			}
+		}
+	}
+
+	return -1;
+}
+
 static const struct net_device_ops ixgbe_netdev_ops = {
 	.ndo_open		= ixgbe_open,
 	.ndo_stop		= ixgbe_close,
@@ -7982,6 +8373,15 @@ static const struct net_device_ops ixgbe_netdev_ops = {
 	.ndo_bridge_getlink	= ixgbe_ndo_bridge_getlink,
 	.ndo_dfwd_add_station	= ixgbe_fwd_add,
 	.ndo_dfwd_del_station	= ixgbe_fwd_del,
+
+	.ndo_split_queue_pairs	= ixgbe_ndo_split_queue_pairs,
+	.ndo_get_split_queue_pairs = ixgbe_ndo_get_split_queue_pairs,
+	.ndo_return_queue_pairs	   = ixgbe_ndo_return_queue_pairs,
+	.ndo_get_device_desc_info  = ixgbe_ndo_get_device_desc_info,
+	.ndo_direct_qpair_page_map = ixgbe_ndo_qpair_page_map,
+	.ndo_get_dma_region_info   = ixgbe_get_dma_region_info,
+	.ndo_get_device_qpair_map_region_info = ixgbe_ndo_qpair_map_region,
+	.ndo_validate_dma_mem_region_map = ixgbe_ndo_val_dma_mem_region_map,
 };
 
 /**
@@ -8203,7 +8603,9 @@ static int ixgbe_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	hw->back = adapter;
 	adapter->msg_enable = netif_msg_init(debug, DEFAULT_MSG_ENABLE);
 
-	hw->hw_addr = ioremap(pci_resource_start(pdev, 0),
+	hw->pci_hw_addr = pci_resource_start(pdev, 0);
+
+	hw->hw_addr = ioremap(hw->pci_hw_addr,
 			      pci_resource_len(pdev, 0));
 	adapter->io_addr = hw->hw_addr;
 	if (!hw->hw_addr) {
@@ -8875,6 +9277,7 @@ module_init(ixgbe_init_module);
  **/
 static void __exit ixgbe_exit_module(void)
 {
+	kfree(dummy_page_buf);
 #ifdef CONFIG_IXGBE_DCA
 	dca_unregister_notify(&dca_notifier);
 #endif
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
index d101b25..4034d31 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
@@ -3180,6 +3180,7 @@ struct ixgbe_mbx_info {
 
 struct ixgbe_hw {
 	u8 __iomem			*hw_addr;
+	phys_addr_t			pci_hw_addr;
 	void				*back;
 	struct ixgbe_mac_info		mac;
 	struct ixgbe_addr_filter_info	addr_ctrl;

^ permalink raw reply related

* Re: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
From: John Fastabend @ 2015-01-13  4:42 UTC (permalink / raw)
  To: netdev
  Cc: danny.zhou, nhorman, dborkman, john.ronciak, hannes, brouer,
	Or Gerlitz
In-Reply-To: <20150113043509.29985.33515.stgit@nitbit.x32>

On 01/12/2015 08:35 PM, John Fastabend wrote:
> This patch adds net_device ops to split off a set of driver queues
> from the driver and map the queues into user space via mmap. This
> allows the queues to be directly manipulated from user space. For
> raw packet interface this removes any overhead from the kernel network
> stack.
>

+cc: Or Gerlitz

[...]

> +
> +struct tpacket_dev_info {
> +	__u16	tp_device_id;
> +	__u16	tp_vendor_id;
> +	__u16	tp_subsystem_device_id;
> +	__u16	tp_subsystem_vendor_id;
> +	__u32	tp_numa_node;
> +	__u32	tp_revision_id;
> +	__u32	tp_num_total_qpairs;
> +	__u32	tp_num_inuse_qpairs;
> +	__u32	tp_num_rx_desc_fmt;
> +	__u32	tp_num_tx_desc_fmt;
> +	struct tpacket_nic_desc_expr tp_rx_dexpr[PACKET_MAX_NUM_DESC_FORMATS];
> +	struct tpacket_nic_desc_expr tp_tx_dexpr[PACKET_MAX_NUM_DESC_FORMATS];

At least one reason this is still RFCs is this needs to be
cleaned up.

net/packet/af_packet.c: In function ‘packet_getsockopt’:
net/packet/af_packet.c:3918:1: warning: the frame size of 9264 bytes is 
larger than 2048 bytes [-Wframe-larger-than=]

but I wanted to see if there was any feedback.

Thanks,
John

-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re:Investment
From: Suklee Peck @ 2015-01-12 18:32 UTC (permalink / raw)
  To: Recipients

I'm contacting you on behalf of an investment placed under management 5 years ago by Shui bian. He needs assistance in investing these funds. If you are interested, you can write to his private email ( saitt1@qq.com ) for further details.

Best Regards,
Suklee Peck

^ permalink raw reply

* why are IPv6 addresses removed on link down
From: David Ahern @ 2015-01-13  5:06 UTC (permalink / raw)
  To: netdev@vger.kernel.org

We noticed that IPv6 addresses are removed on a link down. e.g.,
   ip link set dev eth1


Looking at the code it appears to be this code path in addrconf.c:

         case NETDEV_DOWN:
         case NETDEV_UNREGISTER:
                 /*
                  *      Remove all addresses from this interface.
                  */
                 addrconf_ifdown(dev, event != NETDEV_DOWN);
                 break;

IPv4 addresses are NOT removed on a link down. Is there a particular 
reason IPv6 addresses are?

Thanks,
David

^ permalink raw reply

* Re: [PATCH net-next] rhashtable: Lower/upper bucket may map to same lock while shrinking
From: David Miller @ 2015-01-13  5:25 UTC (permalink / raw)
  To: tgraf; +Cc: fengguang.wu, lkp, linux-kernel, netfilter-devel, coreteam,
	netdev
In-Reply-To: <20150112235821.GB16617@casper.infradead.org>

From: Thomas Graf <tgraf@suug.ch>
Date: Mon, 12 Jan 2015 23:58:21 +0000

> Each per bucket lock covers a configurable number of buckets. While
> shrinking, two buckets in the old table contain entries for a single
> bucket in the new table. We need to lock down both while linking.
> Check if they are protected by different locks to avoid a recursive
> lock.
> 
> Fixes: 97defe1e ("rhashtable: Per bucket locks & deferred expansion/shrinking")
> Reported-by: Fengguang Wu <fengguang.wu@intel.com>
> Signed-off-by: Thomas Graf <tgraf@suug.ch>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] rhashtable: Add MAINTAINERS entry
From: David Miller @ 2015-01-13  5:25 UTC (permalink / raw)
  To: tgraf; +Cc: netdev
In-Reply-To: <cf5552c116d0fe998a3b660d5850b7c2efd814b5.1421107185.git.tgraf@suug.ch>

From: Thomas Graf <tgraf@suug.ch>
Date: Tue, 13 Jan 2015 01:01:24 +0100

> Signed-off-by: Thomas Graf <tgraf@suug.ch>

Applied.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox