Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH: net-next] core: make GRO methods static.
From: David Miller @ 2012-11-29 18:18 UTC (permalink / raw)
  To: ramirose; +Cc: netdev
In-Reply-To: <1354175725-23382-1-git-send-email-ramirose@gmail.com>

From: Rami Rosen <ramirose@gmail.com>
Date: Thu, 29 Nov 2012 09:55:25 +0200

> This patch changes three methods to be static and removes their
> EXPORT_SYMBOLs in core/dev.c and their external declaration in
> netdevice.h. The methods, dev_gro_receive(), napi_frags_finish() and
> napi_skb_finish(), which are in the GRO rx path, are not used
> outside core/dev.c.
> 
> Signed-off-by: Rami Rosen <ramirose@gmail.com>

Applied, thanks Rami.

^ permalink raw reply

* Re: [net-next PATCH V2 3/9] net: frag, move LRU list maintenance outside of rwlock
From: Eric Dumazet @ 2012-11-29 18:24 UTC (permalink / raw)
  To: David Miller
  Cc: brouer, fw, netdev, pablo, tgraf, amwang, kaber, paulmck, herbert
In-Reply-To: <20121129.130506.329791401604974668.davem@davemloft.net>

On Thu, 2012-11-29 at 13:05 -0500, David Miller wrote:

> Replace 1024 in your formula with X and the limit is therefore
> controlled by X.
> 
> So it seems the high_thresh can be replaced with an appropriate
> determination of X to size the hash.
> 
> If X is 256, that limits us to ~130MB per cpu.
> 

per host, as the table would be shared by all cpus.

Lets say the default mem limit would be 4MB, I believe the
percpu_counter is cheap enough that we still allow a 1024 buckets table,
and allow receiving full size IP packets (If not under frag attack)

(If we divided 4MB by 64, we would have a 64KB limit per bucket, which
is too small because of truesize overhead)

^ permalink raw reply

* Re: [PATCH] bonding: rlb mode of bond should not alter ARP originating via bridge
From: Jay Vosburgh @ 2012-11-29 18:28 UTC (permalink / raw)
  To: Zheng Li; +Cc: netdev, andy, linux-kernel, davem, joe.jin
In-Reply-To: <1354096624-22380-1-git-send-email-zheng.x.li@oracle.com>

Zheng Li <zheng.x.li@oracle.com> wrote:

>Do not modify or load balance ARP packets passing through balance-alb
>mode (wherein the ARP did not originate locally, and arrived via a bridge).
>
>Modifying pass-through ARP replies causes an incorrect MAC address 
>to be placed into the ARP packet, rendering peers unable to communicate 
>with the actual destination from which the ARP reply originated.
>
>Load balancing pass-through ARP requests causes an entry to be
>created for the peer in the rlb table, and bond_alb_monitor will
>occasionally issue ARP updates to all peers in the table instrucing them
>as to which MAC address they should communicate with; this occurs when
>some event sets rx_ntt.  In the bridged case, however, the MAC address
>used for the update would be the MAC of the slave, not the actual source
>MAC of the originating destination.  This would render peers unable to
>communicate with the destinations beyond the bridge.
>
>Signed-off-by: Zheng Li <zheng.x.li@oracle.com>
>Cc: Jay Vosburgh <fubar@us.ibm.com>
>Cc: Andy Gospodarek <andy@greyhouse.net>
>Cc: "David S. Miller" <davem@davemloft.net>

Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>

>---
> drivers/net/bonding/bond_alb.c |    6 ++++++
> drivers/net/bonding/bonding.h  |   13 +++++++++++++
> 2 files changed, 19 insertions(+), 0 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
>index e15cc11..6fecb52 100644
>--- a/drivers/net/bonding/bond_alb.c
>+++ b/drivers/net/bonding/bond_alb.c
>@@ -694,6 +694,12 @@ static struct slave *rlb_arp_xmit(struct sk_buff *skb, struct bonding *bond)
> 	struct arp_pkt *arp = arp_pkt(skb);
> 	struct slave *tx_slave = NULL;
>
>+	/* Don't modify or load balance ARPs that do not originate locally
>+	 * (e.g.,arrive via a bridge).
>+	 */
>+	if (!bond_slave_has_mac(bond, arp->mac_src))
>+		return NULL;
>+
> 	if (arp->op_code == htons(ARPOP_REPLY)) {
> 		/* the arp must be sent on the selected
> 		* rx channel
>diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
>index f8af2fc..6dded56 100644
>--- a/drivers/net/bonding/bonding.h
>+++ b/drivers/net/bonding/bonding.h
>@@ -22,6 +22,7 @@
> #include <linux/in6.h>
> #include <linux/netpoll.h>
> #include <linux/inetdevice.h>
>+#include <linux/etherdevice.h>
> #include "bond_3ad.h"
> #include "bond_alb.h"
>
>@@ -450,6 +451,18 @@ static inline void bond_destroy_proc_dir(struct bond_net *bn)
> }
> #endif
>
>+static inline struct slave *bond_slave_has_mac(struct bonding *bond,
>+					       const u8 *mac)
>+{
>+	int i = 0;
>+	struct slave *tmp;
>+
>+	bond_for_each_slave(bond, tmp, i)
>+		if (ether_addr_equal_64bits(mac, tmp->dev->dev_addr))
>+			return tmp;
>+
>+	return NULL;
>+}
>
> /* exported from bond_main.c */
> extern int bond_net_id;
>-- 
>1.7.6.5
>

^ permalink raw reply

* Re: [PATCH v2 3/3] pppoatm: protect against freeing of vcc
From: chas williams - CONTRACTOR @ 2012-11-29 18:29 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Krzysztof Mazur, David Laight, davem, netdev, linux-kernel,
	nathan
In-Reply-To: <1354212708.21562.205.camel@shinybook.infradead.org>

On Thu, 29 Nov 2012 18:11:48 +0000
David Woodhouse <dwmw2@infradead.org> wrote:

> We do see the 'packet received for unknown VCC' complaint, after we
> reboot the host without resetting the card. And as shown in the commit I
> just referenced, we rely on the !ATM_VF_READY check in order to prevent
> a use-after-free when the tasklet is sending packets up a VCC that's
> just been closed.

well that behavior is just crap.  why is it delivering cells for a
vpi/vci pair that is not open?  regardless, i pointed out the behavior
of find_vcc() just in case you need to do some clean up on data that is
coming back while you are attempting to finish operations during your
driver's close() but this cleanup might not be happening since you
arent able to get a reference to the vcc so you can pop() the data or
whatever you might need to do.

> > again part of this is poor synchronization.  the detach protocol (i.e.
> > push of a NULL skb) should be flushing any pending transmits and
> > shutting down whatever in the protocol is doing any sending and
> > receiving.  however, i release this might be difficult to do since the
> > detach protocol is invoked in such a strange way.
> 
> You mean that (e.g.) pppoatm_push(vcc, NULL) should be waiting for any
> pending TX skb (that has already been passed off to the driver) to
> *complete*? How would it even do that?

i think the order of the vcc_destroy_socket() operations is a bit
wrong.  it should call detach protocol (i.e. push a NULL).  this should
cause the attached protocol to stop any future sends and receives, and
it CAN sleep in this context (and only this context -- generally
sending cannot sleep which is why this might seem confusing) to do
whatever it needs to do to wait for the attached protocol to clean up
queues, flush data etc.

then the driver close() should be called.  this takes care of cleaning
up any pending tx or rx that is in the hardware.  and of course,
close() can sleep since it will be in a interrutible context.

the pop() might be screwed up here though.  you might have skb's in the
driver that should be pop()'ed with the formerly attached protocol.

you could wait for pending tx's sent to the driver.  you know that your
attached protocol's pop() will be called.  keep count of the
outstanding transmit skb's.  you cant do this though if you close() the
vcc before you detach the protocol.

^ permalink raw reply

* Re: [net-next PATCH V2 3/9] net: frag, move LRU list maintenance outside of rwlock
From: David Miller @ 2012-11-29 18:31 UTC (permalink / raw)
  To: eric.dumazet
  Cc: brouer, fw, netdev, pablo, tgraf, amwang, kaber, paulmck, herbert
In-Reply-To: <1354213492.3299.22.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 29 Nov 2012 10:24:52 -0800

> On Thu, 2012-11-29 at 13:05 -0500, David Miller wrote:
> 
>> Replace 1024 in your formula with X and the limit is therefore
>> controlled by X.
>> 
>> So it seems the high_thresh can be replaced with an appropriate
>> determination of X to size the hash.
>> 
>> If X is 256, that limits us to ~130MB per cpu.
>> 
> 
> per host, as the table would be shared by all cpus.

I think a per-cpu hash might make more sense.

This would scale our limits to the size of the system.

I'm willing to be convinced otherwise, but it seems the most
sensible thing to do.

^ permalink raw reply

* Re: [net-next PATCH V2 3/9] net: frag, move LRU list maintenance outside of rwlock
From: Eric Dumazet @ 2012-11-29 18:33 UTC (permalink / raw)
  To: David Miller
  Cc: brouer, fw, netdev, pablo, tgraf, amwang, kaber, paulmck, herbert
In-Reply-To: <20121129.133108.427624036846294750.davem@davemloft.net>

On Thu, 2012-11-29 at 13:31 -0500, David Miller wrote:

> I think a per-cpu hash might make more sense.
> 
> This would scale our limits to the size of the system.
> 
> I'm willing to be convinced otherwise, but it seems the most
> sensible thing to do.

It would break in many cases, when frags are spreaded on different cpus.

^ permalink raw reply

* [PATCH net-next 4/7] openvswitch: add ipv6 'set' action
From: Jesse Gross @ 2012-11-29 18:35 UTC (permalink / raw)
  To: David Miller; +Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1354214149-33651-1-git-send-email-jesse-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>

From: Ansis Atteka <aatteka-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>

This patch adds ipv6 set action functionality. It allows to change
traffic class, flow label, hop-limit, ipv6 source and destination
address fields.

Signed-off-by: Ansis Atteka <aatteka-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Jesse Gross <jesse-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>
---
 net/openvswitch/actions.c  |   93 ++++++++++++++++++++++++++++++++++++++++++++
 net/openvswitch/datapath.c |   20 ++++++++++
 2 files changed, 113 insertions(+)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 0811447..a58ed27 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -28,6 +28,7 @@
 #include <linux/if_arp.h>
 #include <linux/if_vlan.h>
 #include <net/ip.h>
+#include <net/ipv6.h>
 #include <net/checksum.h>
 #include <net/dsfield.h>
 
@@ -162,6 +163,53 @@ static void set_ip_addr(struct sk_buff *skb, struct iphdr *nh,
 	*addr = new_addr;
 }
 
+static void update_ipv6_checksum(struct sk_buff *skb, u8 l4_proto,
+				 __be32 addr[4], const __be32 new_addr[4])
+{
+	int transport_len = skb->len - skb_transport_offset(skb);
+
+	if (l4_proto == IPPROTO_TCP) {
+		if (likely(transport_len >= sizeof(struct tcphdr)))
+			inet_proto_csum_replace16(&tcp_hdr(skb)->check, skb,
+						  addr, new_addr, 1);
+	} else if (l4_proto == IPPROTO_UDP) {
+		if (likely(transport_len >= sizeof(struct udphdr))) {
+			struct udphdr *uh = udp_hdr(skb);
+
+			if (uh->check || skb->ip_summed == CHECKSUM_PARTIAL) {
+				inet_proto_csum_replace16(&uh->check, skb,
+							  addr, new_addr, 1);
+				if (!uh->check)
+					uh->check = CSUM_MANGLED_0;
+			}
+		}
+	}
+}
+
+static void set_ipv6_addr(struct sk_buff *skb, u8 l4_proto,
+			  __be32 addr[4], const __be32 new_addr[4],
+			  bool recalculate_csum)
+{
+	if (recalculate_csum)
+		update_ipv6_checksum(skb, l4_proto, addr, new_addr);
+
+	skb->rxhash = 0;
+	memcpy(addr, new_addr, sizeof(__be32[4]));
+}
+
+static void set_ipv6_tc(struct ipv6hdr *nh, u8 tc)
+{
+	nh->priority = tc >> 4;
+	nh->flow_lbl[0] = (nh->flow_lbl[0] & 0x0F) | ((tc & 0x0F) << 4);
+}
+
+static void set_ipv6_fl(struct ipv6hdr *nh, u32 fl)
+{
+	nh->flow_lbl[0] = (nh->flow_lbl[0] & 0xF0) | (fl & 0x000F0000) >> 16;
+	nh->flow_lbl[1] = (fl & 0x0000FF00) >> 8;
+	nh->flow_lbl[2] = fl & 0x000000FF;
+}
+
 static void set_ip_ttl(struct sk_buff *skb, struct iphdr *nh, u8 new_ttl)
 {
 	csum_replace2(&nh->check, htons(nh->ttl << 8), htons(new_ttl << 8));
@@ -195,6 +243,47 @@ static int set_ipv4(struct sk_buff *skb, const struct ovs_key_ipv4 *ipv4_key)
 	return 0;
 }
 
+static int set_ipv6(struct sk_buff *skb, const struct ovs_key_ipv6 *ipv6_key)
+{
+	struct ipv6hdr *nh;
+	int err;
+	__be32 *saddr;
+	__be32 *daddr;
+
+	err = make_writable(skb, skb_network_offset(skb) +
+			    sizeof(struct ipv6hdr));
+	if (unlikely(err))
+		return err;
+
+	nh = ipv6_hdr(skb);
+	saddr = (__be32 *)&nh->saddr;
+	daddr = (__be32 *)&nh->daddr;
+
+	if (memcmp(ipv6_key->ipv6_src, saddr, sizeof(ipv6_key->ipv6_src)))
+		set_ipv6_addr(skb, ipv6_key->ipv6_proto, saddr,
+			      ipv6_key->ipv6_src, true);
+
+	if (memcmp(ipv6_key->ipv6_dst, daddr, sizeof(ipv6_key->ipv6_dst))) {
+		unsigned int offset = 0;
+		int flags = IP6_FH_F_SKIP_RH;
+		bool recalc_csum = true;
+
+		if (ipv6_ext_hdr(nh->nexthdr))
+			recalc_csum = ipv6_find_hdr(skb, &offset,
+						    NEXTHDR_ROUTING, NULL,
+						    &flags) != NEXTHDR_ROUTING;
+
+		set_ipv6_addr(skb, ipv6_key->ipv6_proto, daddr,
+			      ipv6_key->ipv6_dst, recalc_csum);
+	}
+
+	set_ipv6_tc(nh, ipv6_key->ipv6_tclass);
+	set_ipv6_fl(nh, ntohl(ipv6_key->ipv6_label));
+	nh->hop_limit = ipv6_key->ipv6_hlimit;
+
+	return 0;
+}
+
 /* Must follow make_writable() since that can move the skb data. */
 static void set_tp_port(struct sk_buff *skb, __be16 *port,
 			 __be16 new_port, __sum16 *check)
@@ -347,6 +436,10 @@ static int execute_set_action(struct sk_buff *skb,
 		err = set_ipv4(skb, nla_data(nested_attr));
 		break;
 
+	case OVS_KEY_ATTR_IPV6:
+		err = set_ipv6(skb, nla_data(nested_attr));
+		break;
+
 	case OVS_KEY_ATTR_TCP:
 		err = set_tcp(skb, nla_data(nested_attr));
 		break;
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 4c4b62c..fd4a6a4 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -479,6 +479,7 @@ static int validate_set(const struct nlattr *a,
 
 	switch (key_type) {
 	const struct ovs_key_ipv4 *ipv4_key;
+	const struct ovs_key_ipv6 *ipv6_key;
 
 	case OVS_KEY_ATTR_PRIORITY:
 	case OVS_KEY_ATTR_ETHERNET:
@@ -500,6 +501,25 @@ static int validate_set(const struct nlattr *a,
 
 		break;
 
+	case OVS_KEY_ATTR_IPV6:
+		if (flow_key->eth.type != htons(ETH_P_IPV6))
+			return -EINVAL;
+
+		if (!flow_key->ip.proto)
+			return -EINVAL;
+
+		ipv6_key = nla_data(ovs_key);
+		if (ipv6_key->ipv6_proto != flow_key->ip.proto)
+			return -EINVAL;
+
+		if (ipv6_key->ipv6_frag != flow_key->ip.frag)
+			return -EINVAL;
+
+		if (ntohl(ipv6_key->ipv6_label) & 0xFFF00000)
+			return -EINVAL;
+
+		break;
+
 	case OVS_KEY_ATTR_TCP:
 		if (flow_key->ip.proto != IPPROTO_TCP)
 			return -EINVAL;
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH net-next 5/7] net: openvswitch: use this_cpu_ptr per-cpu helper
From: Jesse Gross @ 2012-11-29 18:35 UTC (permalink / raw)
  To: David Miller
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	Shan Wei
In-Reply-To: <1354214149-33651-1-git-send-email-jesse-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>

From: Shan Wei <davidshan-1Nz4purKYjRBDgjK7y7TUQ@public.gmane.org>

just use more faster this_cpu_ptr instead of per_cpu_ptr(p, smp_processor_id());

Signed-off-by: Shan Wei <davidshan-1Nz4purKYjRBDgjK7y7TUQ@public.gmane.org>
Reviewed-by: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Jesse Gross <jesse-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>
---
 net/openvswitch/datapath.c |    4 ++--
 net/openvswitch/vport.c    |    5 ++---
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index fd4a6a4..7b1d6d2 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -208,7 +208,7 @@ void ovs_dp_process_received_packet(struct vport *p, struct sk_buff *skb)
 	int error;
 	int key_len;
 
-	stats = per_cpu_ptr(dp->stats_percpu, smp_processor_id());
+	stats = this_cpu_ptr(dp->stats_percpu);
 
 	/* Extract flow from 'skb' into 'key'. */
 	error = ovs_flow_extract(skb, p->port_no, &key, &key_len);
@@ -282,7 +282,7 @@ int ovs_dp_upcall(struct datapath *dp, struct sk_buff *skb,
 	return 0;
 
 err:
-	stats = per_cpu_ptr(dp->stats_percpu, smp_processor_id());
+	stats = this_cpu_ptr(dp->stats_percpu);
 
 	u64_stats_update_begin(&stats->sync);
 	stats->n_lost++;
diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index 03779e8..70af0be 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -333,8 +333,7 @@ void ovs_vport_receive(struct vport *vport, struct sk_buff *skb)
 {
 	struct vport_percpu_stats *stats;
 
-	stats = per_cpu_ptr(vport->percpu_stats, smp_processor_id());
-
+	stats = this_cpu_ptr(vport->percpu_stats);
 	u64_stats_update_begin(&stats->sync);
 	stats->rx_packets++;
 	stats->rx_bytes += skb->len;
@@ -359,7 +358,7 @@ int ovs_vport_send(struct vport *vport, struct sk_buff *skb)
 	if (likely(sent)) {
 		struct vport_percpu_stats *stats;
 
-		stats = per_cpu_ptr(vport->percpu_stats, smp_processor_id());
+		stats = this_cpu_ptr(vport->percpu_stats);
 
 		u64_stats_update_begin(&stats->sync);
 		stats->tx_packets++;
-- 
1.7.9.5

^ permalink raw reply related

* [GIT net-next] Open vSwitch
From: Jesse Gross @ 2012-11-29 18:35 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, dev

This series of improvements for 3.8/net-next contains four components:
 * Support for modifying IPv6 headers
 * Support for matching and setting skb->mark for better integration with
   things like iptables
 * Ability to recognize the EtherType for RARP packets
 * Two small performance enhancements

The movement of ipv6_find_hdr() into exthdrs_core.c causes two small merge
conflicts.  I left it as is but can do the merge if you want.  The conflicts
are:
 * ipv6_find_hdr() and ipv6_find_tlv() were both moved to the bottom of
   exthdrs_core.c.  Both should stay.
 * A new use of ipv6_find_hdr() was added to net/netfilter/ipvs/ip_vs_core.c
   after this patch.  The IPVS user has two instances of the old constant
   name IP6T_FH_F_FRAG which has been renamed to IP6_FH_F_FRAG.

The following changes since commit d04d382980c86bdee9960c3eb157a73f8ed230cc:

  openvswitch: Store flow key len if ARP opcode is not request or reply. (2012-10-30 17:17:09 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch.git master

for you to fetch changes up to 92eb1d477145b2e7780b5002e856f70b8c3d74da:

  openvswitch: Use RCU callback when detaching netdevices. (2012-11-28 14:04:34 -0800)

----------------------------------------------------------------
Ansis Atteka (3):
      ipv6: improve ipv6_find_hdr() to skip empty routing headers
      openvswitch: add ipv6 'set' action
      openvswitch: add skb mark matching and set action

Jesse Gross (2):
      ipv6: Move ipv6_find_hdr() out of Netfilter code.
      openvswitch: Use RCU callback when detaching netdevices.

Mehak Mahajan (1):
      openvswitch: Process RARP packets with ethertype 0x8035 similar to ARP packets.

Shan Wei (1):
      net: openvswitch: use this_cpu_ptr per-cpu helper

 include/linux/netfilter_ipv6/ip6_tables.h |    9 ---
 include/linux/openvswitch.h               |    1 +
 include/net/ipv6.h                        |   10 +++
 net/ipv6/exthdrs_core.c                   |  123 +++++++++++++++++++++++++++++
 net/ipv6/netfilter/ip6_tables.c           |  103 ------------------------
 net/netfilter/xt_HMARK.c                  |    8 +-
 net/openvswitch/actions.c                 |   97 +++++++++++++++++++++++
 net/openvswitch/datapath.c                |   27 ++++++-
 net/openvswitch/flow.c                    |   28 ++++++-
 net/openvswitch/flow.h                    |    8 +-
 net/openvswitch/vport-netdev.c            |   14 +++-
 net/openvswitch/vport-netdev.h            |    3 +
 net/openvswitch/vport.c                   |    5 +-
 13 files changed, 304 insertions(+), 132 deletions(-)

^ permalink raw reply

* [PATCH net-next 1/7] openvswitch: Process RARP packets with ethertype 0x8035 similar to ARP packets.
From: Jesse Gross @ 2012-11-29 18:35 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, dev, Mehak Mahajan
In-Reply-To: <1354214149-33651-1-git-send-email-jesse@nicira.com>

From: Mehak Mahajan <mmahajan@nicira.com>

With this commit, OVS will match the data in the RARP packets having
ethertype 0x8035, in the same way as the data in the ARP packets.

Signed-off-by: Mehak Mahajan <mmahajan@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
---
 net/openvswitch/flow.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 733cbf4..e6ce902 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -689,7 +689,8 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key,
 			}
 		}
 
-	} else if (key->eth.type == htons(ETH_P_ARP) && arphdr_ok(skb)) {
+	} else if ((key->eth.type == htons(ETH_P_ARP) ||
+		   key->eth.type == htons(ETH_P_RARP)) && arphdr_ok(skb)) {
 		struct arp_eth_header *arp;
 
 		arp = (struct arp_eth_header *)skb_network_header(skb);
@@ -1086,7 +1087,8 @@ int ovs_flow_from_nlattrs(struct sw_flow_key *swkey, int *key_lenp,
 			if (err)
 				return err;
 		}
-	} else if (swkey->eth.type == htons(ETH_P_ARP)) {
+	} else if (swkey->eth.type == htons(ETH_P_ARP) ||
+		   swkey->eth.type == htons(ETH_P_RARP)) {
 		const struct ovs_key_arp *arp_key;
 
 		if (!(attrs & (1 << OVS_KEY_ATTR_ARP)))
@@ -1222,7 +1224,8 @@ int ovs_flow_to_nlattrs(const struct sw_flow_key *swkey, struct sk_buff *skb)
 		ipv6_key->ipv6_tclass = swkey->ip.tos;
 		ipv6_key->ipv6_hlimit = swkey->ip.ttl;
 		ipv6_key->ipv6_frag = swkey->ip.frag;
-	} else if (swkey->eth.type == htons(ETH_P_ARP)) {
+	} else if (swkey->eth.type == htons(ETH_P_ARP) ||
+		   swkey->eth.type == htons(ETH_P_RARP)) {
 		struct ovs_key_arp *arp_key;
 
 		nla = nla_reserve(skb, OVS_KEY_ATTR_ARP, sizeof(*arp_key));
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH net-next 2/7] ipv6: Move ipv6_find_hdr() out of Netfilter code.
From: Jesse Gross @ 2012-11-29 18:35 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, dev
In-Reply-To: <1354214149-33651-1-git-send-email-jesse@nicira.com>

Open vSwitch will soon also use ipv6_find_hdr() so this moves it
out of Netfilter-specific code into a more common location.

Signed-off-by: Jesse Gross <jesse@nicira.com>
---
 include/linux/netfilter_ipv6/ip6_tables.h |    9 ---
 include/net/ipv6.h                        |    9 +++
 net/ipv6/exthdrs_core.c                   |  103 +++++++++++++++++++++++++++++
 net/ipv6/netfilter/ip6_tables.c           |  103 -----------------------------
 net/netfilter/xt_HMARK.c                  |    8 +--
 5 files changed, 116 insertions(+), 116 deletions(-)

diff --git a/include/linux/netfilter_ipv6/ip6_tables.h b/include/linux/netfilter_ipv6/ip6_tables.h
index 5f84c62..610208b 100644
--- a/include/linux/netfilter_ipv6/ip6_tables.h
+++ b/include/linux/netfilter_ipv6/ip6_tables.h
@@ -47,15 +47,6 @@ ip6t_ext_hdr(u8 nexthdr)
 	       (nexthdr == IPPROTO_DSTOPTS);
 }
 
-enum {
-	IP6T_FH_F_FRAG	= (1 << 0),
-	IP6T_FH_F_AUTH	= (1 << 1),
-};
-
-/* find specified header and get offset to it */
-extern int ipv6_find_hdr(const struct sk_buff *skb, unsigned int *offset,
-			 int target, unsigned short *fragoff, int *fragflg);
-
 #ifdef CONFIG_COMPAT
 #include <net/compat.h>
 
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 979bf6c..b2f0cfb 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -630,6 +630,15 @@ extern int			ipv6_skip_exthdr(const struct sk_buff *, int start,
 
 extern bool			ipv6_ext_hdr(u8 nexthdr);
 
+enum {
+	IP6_FH_F_FRAG	= (1 << 0),
+	IP6_FH_F_AUTH	= (1 << 1),
+};
+
+/* find specified header and get offset to it */
+extern int ipv6_find_hdr(const struct sk_buff *skb, unsigned int *offset,
+			 int target, unsigned short *fragoff, int *fragflg);
+
 extern int ipv6_find_tlv(struct sk_buff *skb, int offset, int type);
 
 extern struct in6_addr *fl6_update_dst(struct flowi6 *fl6,
diff --git a/net/ipv6/exthdrs_core.c b/net/ipv6/exthdrs_core.c
index f73d59a..8ea253a 100644
--- a/net/ipv6/exthdrs_core.c
+++ b/net/ipv6/exthdrs_core.c
@@ -111,3 +111,106 @@ int ipv6_skip_exthdr(const struct sk_buff *skb, int start, u8 *nexthdrp,
 	return start;
 }
 EXPORT_SYMBOL(ipv6_skip_exthdr);
+
+/*
+ * find the offset to specified header or the protocol number of last header
+ * if target < 0. "last header" is transport protocol header, ESP, or
+ * "No next header".
+ *
+ * Note that *offset is used as input/output parameter. an if it is not zero,
+ * then it must be a valid offset to an inner IPv6 header. This can be used
+ * to explore inner IPv6 header, eg. ICMPv6 error messages.
+ *
+ * If target header is found, its offset is set in *offset and return protocol
+ * number. Otherwise, return -1.
+ *
+ * If the first fragment doesn't contain the final protocol header or
+ * NEXTHDR_NONE it is considered invalid.
+ *
+ * Note that non-1st fragment is special case that "the protocol number
+ * of last header" is "next header" field in Fragment header. In this case,
+ * *offset is meaningless and fragment offset is stored in *fragoff if fragoff
+ * isn't NULL.
+ *
+ * if flags is not NULL and it's a fragment, then the frag flag IP6_FH_F_FRAG
+ * will be set. If it's an AH header, the IP6_FH_F_AUTH flag is set and
+ * target < 0, then this function will stop at the AH header.
+ */
+int ipv6_find_hdr(const struct sk_buff *skb, unsigned int *offset,
+		  int target, unsigned short *fragoff, int *flags)
+{
+	unsigned int start = skb_network_offset(skb) + sizeof(struct ipv6hdr);
+	u8 nexthdr = ipv6_hdr(skb)->nexthdr;
+	unsigned int len;
+
+	if (fragoff)
+		*fragoff = 0;
+
+	if (*offset) {
+		struct ipv6hdr _ip6, *ip6;
+
+		ip6 = skb_header_pointer(skb, *offset, sizeof(_ip6), &_ip6);
+		if (!ip6 || (ip6->version != 6)) {
+			printk(KERN_ERR "IPv6 header not found\n");
+			return -EBADMSG;
+		}
+		start = *offset + sizeof(struct ipv6hdr);
+		nexthdr = ip6->nexthdr;
+	}
+	len = skb->len - start;
+
+	while (nexthdr != target) {
+		struct ipv6_opt_hdr _hdr, *hp;
+		unsigned int hdrlen;
+
+		if ((!ipv6_ext_hdr(nexthdr)) || nexthdr == NEXTHDR_NONE) {
+			if (target < 0)
+				break;
+			return -ENOENT;
+		}
+
+		hp = skb_header_pointer(skb, start, sizeof(_hdr), &_hdr);
+		if (hp == NULL)
+			return -EBADMSG;
+		if (nexthdr == NEXTHDR_FRAGMENT) {
+			unsigned short _frag_off;
+			__be16 *fp;
+
+			if (flags)	/* Indicate that this is a fragment */
+				*flags |= IP6_FH_F_FRAG;
+			fp = skb_header_pointer(skb,
+						start+offsetof(struct frag_hdr,
+							       frag_off),
+						sizeof(_frag_off),
+						&_frag_off);
+			if (fp == NULL)
+				return -EBADMSG;
+
+			_frag_off = ntohs(*fp) & ~0x7;
+			if (_frag_off) {
+				if (target < 0 &&
+				    ((!ipv6_ext_hdr(hp->nexthdr)) ||
+				     hp->nexthdr == NEXTHDR_NONE)) {
+					if (fragoff)
+						*fragoff = _frag_off;
+					return hp->nexthdr;
+				}
+				return -ENOENT;
+			}
+			hdrlen = 8;
+		} else if (nexthdr == NEXTHDR_AUTH) {
+			if (flags && (*flags & IP6_FH_F_AUTH) && (target < 0))
+				break;
+			hdrlen = (hp->hdrlen + 2) << 2;
+		} else
+			hdrlen = ipv6_optlen(hp);
+
+		nexthdr = hp->nexthdr;
+		len -= hdrlen;
+		start += hdrlen;
+	}
+
+	*offset = start;
+	return nexthdr;
+}
+EXPORT_SYMBOL(ipv6_find_hdr);
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index d7cb045..1ce4f15 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -2273,112 +2273,9 @@ static void __exit ip6_tables_fini(void)
 	unregister_pernet_subsys(&ip6_tables_net_ops);
 }
 
-/*
- * find the offset to specified header or the protocol number of last header
- * if target < 0. "last header" is transport protocol header, ESP, or
- * "No next header".
- *
- * Note that *offset is used as input/output parameter. an if it is not zero,
- * then it must be a valid offset to an inner IPv6 header. This can be used
- * to explore inner IPv6 header, eg. ICMPv6 error messages.
- *
- * If target header is found, its offset is set in *offset and return protocol
- * number. Otherwise, return -1.
- *
- * If the first fragment doesn't contain the final protocol header or
- * NEXTHDR_NONE it is considered invalid.
- *
- * Note that non-1st fragment is special case that "the protocol number
- * of last header" is "next header" field in Fragment header. In this case,
- * *offset is meaningless and fragment offset is stored in *fragoff if fragoff
- * isn't NULL.
- *
- * if flags is not NULL and it's a fragment, then the frag flag IP6T_FH_F_FRAG
- * will be set. If it's an AH header, the IP6T_FH_F_AUTH flag is set and
- * target < 0, then this function will stop at the AH header.
- */
-int ipv6_find_hdr(const struct sk_buff *skb, unsigned int *offset,
-		  int target, unsigned short *fragoff, int *flags)
-{
-	unsigned int start = skb_network_offset(skb) + sizeof(struct ipv6hdr);
-	u8 nexthdr = ipv6_hdr(skb)->nexthdr;
-	unsigned int len;
-
-	if (fragoff)
-		*fragoff = 0;
-
-	if (*offset) {
-		struct ipv6hdr _ip6, *ip6;
-
-		ip6 = skb_header_pointer(skb, *offset, sizeof(_ip6), &_ip6);
-		if (!ip6 || (ip6->version != 6)) {
-			printk(KERN_ERR "IPv6 header not found\n");
-			return -EBADMSG;
-		}
-		start = *offset + sizeof(struct ipv6hdr);
-		nexthdr = ip6->nexthdr;
-	}
-	len = skb->len - start;
-
-	while (nexthdr != target) {
-		struct ipv6_opt_hdr _hdr, *hp;
-		unsigned int hdrlen;
-
-		if ((!ipv6_ext_hdr(nexthdr)) || nexthdr == NEXTHDR_NONE) {
-			if (target < 0)
-				break;
-			return -ENOENT;
-		}
-
-		hp = skb_header_pointer(skb, start, sizeof(_hdr), &_hdr);
-		if (hp == NULL)
-			return -EBADMSG;
-		if (nexthdr == NEXTHDR_FRAGMENT) {
-			unsigned short _frag_off;
-			__be16 *fp;
-
-			if (flags)	/* Indicate that this is a fragment */
-				*flags |= IP6T_FH_F_FRAG;
-			fp = skb_header_pointer(skb,
-						start+offsetof(struct frag_hdr,
-							       frag_off),
-						sizeof(_frag_off),
-						&_frag_off);
-			if (fp == NULL)
-				return -EBADMSG;
-
-			_frag_off = ntohs(*fp) & ~0x7;
-			if (_frag_off) {
-				if (target < 0 &&
-				    ((!ipv6_ext_hdr(hp->nexthdr)) ||
-				     hp->nexthdr == NEXTHDR_NONE)) {
-					if (fragoff)
-						*fragoff = _frag_off;
-					return hp->nexthdr;
-				}
-				return -ENOENT;
-			}
-			hdrlen = 8;
-		} else if (nexthdr == NEXTHDR_AUTH) {
-			if (flags && (*flags & IP6T_FH_F_AUTH) && (target < 0))
-				break;
-			hdrlen = (hp->hdrlen + 2) << 2;
-		} else
-			hdrlen = ipv6_optlen(hp);
-
-		nexthdr = hp->nexthdr;
-		len -= hdrlen;
-		start += hdrlen;
-	}
-
-	*offset = start;
-	return nexthdr;
-}
-
 EXPORT_SYMBOL(ip6t_register_table);
 EXPORT_SYMBOL(ip6t_unregister_table);
 EXPORT_SYMBOL(ip6t_do_table);
-EXPORT_SYMBOL(ipv6_find_hdr);
 
 module_init(ip6_tables_init);
 module_exit(ip6_tables_fini);
diff --git a/net/netfilter/xt_HMARK.c b/net/netfilter/xt_HMARK.c
index 1686ca1..73b73f6 100644
--- a/net/netfilter/xt_HMARK.c
+++ b/net/netfilter/xt_HMARK.c
@@ -167,7 +167,7 @@ hmark_pkt_set_htuple_ipv6(const struct sk_buff *skb, struct hmark_tuple *t,
 			  const struct xt_hmark_info *info)
 {
 	struct ipv6hdr *ip6, _ip6;
-	int flag = IP6T_FH_F_AUTH;
+	int flag = IP6_FH_F_AUTH;
 	unsigned int nhoff = 0;
 	u16 fragoff = 0;
 	int nexthdr;
@@ -177,7 +177,7 @@ hmark_pkt_set_htuple_ipv6(const struct sk_buff *skb, struct hmark_tuple *t,
 	if (nexthdr < 0)
 		return 0;
 	/* No need to check for icmp errors on fragments */
-	if ((flag & IP6T_FH_F_FRAG) || (nexthdr != IPPROTO_ICMPV6))
+	if ((flag & IP6_FH_F_FRAG) || (nexthdr != IPPROTO_ICMPV6))
 		goto noicmp;
 	/* Use inner header in case of ICMP errors */
 	if (get_inner6_hdr(skb, &nhoff)) {
@@ -185,7 +185,7 @@ hmark_pkt_set_htuple_ipv6(const struct sk_buff *skb, struct hmark_tuple *t,
 		if (ip6 == NULL)
 			return -1;
 		/* If AH present, use SPI like in ESP. */
-		flag = IP6T_FH_F_AUTH;
+		flag = IP6_FH_F_AUTH;
 		nexthdr = ipv6_find_hdr(skb, &nhoff, -1, &fragoff, &flag);
 		if (nexthdr < 0)
 			return -1;
@@ -201,7 +201,7 @@ noicmp:
 	if (t->proto == IPPROTO_ICMPV6)
 		return 0;
 
-	if (flag & IP6T_FH_F_FRAG)
+	if (flag & IP6_FH_F_FRAG)
 		return 0;
 
 	hmark_set_tuple_ports(skb, nhoff, t, info);
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH net-next 3/7] ipv6: improve ipv6_find_hdr() to skip empty routing headers
From: Jesse Gross @ 2012-11-29 18:35 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, dev, Ansis Atteka
In-Reply-To: <1354214149-33651-1-git-send-email-jesse@nicira.com>

From: Ansis Atteka <aatteka@nicira.com>

This patch prepares ipv6_find_hdr() function so that it could be
able to skip routing headers, where segements_left is 0. This is
required to handle multiple routing header case correctly when
changing IPv6 addresses.

Signed-off-by: Ansis Atteka <aatteka@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
---
 include/net/ipv6.h      |    5 +++--
 net/ipv6/exthdrs_core.c |   36 ++++++++++++++++++++++++++++--------
 2 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index b2f0cfb..acbd8e0 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -631,8 +631,9 @@ extern int			ipv6_skip_exthdr(const struct sk_buff *, int start,
 extern bool			ipv6_ext_hdr(u8 nexthdr);
 
 enum {
-	IP6_FH_F_FRAG	= (1 << 0),
-	IP6_FH_F_AUTH	= (1 << 1),
+	IP6_FH_F_FRAG		= (1 << 0),
+	IP6_FH_F_AUTH		= (1 << 1),
+	IP6_FH_F_SKIP_RH	= (1 << 2),
 };
 
 /* find specified header and get offset to it */
diff --git a/net/ipv6/exthdrs_core.c b/net/ipv6/exthdrs_core.c
index 8ea253a..11b4e29 100644
--- a/net/ipv6/exthdrs_core.c
+++ b/net/ipv6/exthdrs_core.c
@@ -132,9 +132,11 @@ EXPORT_SYMBOL(ipv6_skip_exthdr);
  * *offset is meaningless and fragment offset is stored in *fragoff if fragoff
  * isn't NULL.
  *
- * if flags is not NULL and it's a fragment, then the frag flag IP6_FH_F_FRAG
- * will be set. If it's an AH header, the IP6_FH_F_AUTH flag is set and
- * target < 0, then this function will stop at the AH header.
+ * if flags is not NULL and it's a fragment, then the frag flag
+ * IP6_FH_F_FRAG will be set. If it's an AH header, the
+ * IP6_FH_F_AUTH flag is set and target < 0, then this function will
+ * stop at the AH header. If IP6_FH_F_SKIP_RH flag was passed, then this
+ * function will skip all those routing headers, where segements_left was 0.
  */
 int ipv6_find_hdr(const struct sk_buff *skb, unsigned int *offset,
 		  int target, unsigned short *fragoff, int *flags)
@@ -142,6 +144,7 @@ int ipv6_find_hdr(const struct sk_buff *skb, unsigned int *offset,
 	unsigned int start = skb_network_offset(skb) + sizeof(struct ipv6hdr);
 	u8 nexthdr = ipv6_hdr(skb)->nexthdr;
 	unsigned int len;
+	bool found;
 
 	if (fragoff)
 		*fragoff = 0;
@@ -159,9 +162,10 @@ int ipv6_find_hdr(const struct sk_buff *skb, unsigned int *offset,
 	}
 	len = skb->len - start;
 
-	while (nexthdr != target) {
+	do {
 		struct ipv6_opt_hdr _hdr, *hp;
 		unsigned int hdrlen;
+		found = (nexthdr == target);
 
 		if ((!ipv6_ext_hdr(nexthdr)) || nexthdr == NEXTHDR_NONE) {
 			if (target < 0)
@@ -172,6 +176,20 @@ int ipv6_find_hdr(const struct sk_buff *skb, unsigned int *offset,
 		hp = skb_header_pointer(skb, start, sizeof(_hdr), &_hdr);
 		if (hp == NULL)
 			return -EBADMSG;
+
+		if (nexthdr == NEXTHDR_ROUTING) {
+			struct ipv6_rt_hdr _rh, *rh;
+
+			rh = skb_header_pointer(skb, start, sizeof(_rh),
+						&_rh);
+			if (rh == NULL)
+				return -EBADMSG;
+
+			if (flags && (*flags & IP6_FH_F_SKIP_RH) &&
+			    rh->segments_left == 0)
+				found = false;
+		}
+
 		if (nexthdr == NEXTHDR_FRAGMENT) {
 			unsigned short _frag_off;
 			__be16 *fp;
@@ -205,10 +223,12 @@ int ipv6_find_hdr(const struct sk_buff *skb, unsigned int *offset,
 		} else
 			hdrlen = ipv6_optlen(hp);
 
-		nexthdr = hp->nexthdr;
-		len -= hdrlen;
-		start += hdrlen;
-	}
+		if (!found) {
+			nexthdr = hp->nexthdr;
+			len -= hdrlen;
+			start += hdrlen;
+		}
+	} while (!found);
 
 	*offset = start;
 	return nexthdr;
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH net-next 6/7] openvswitch: add skb mark matching and set action
From: Jesse Gross @ 2012-11-29 18:35 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, dev, Ansis Atteka
In-Reply-To: <1354214149-33651-1-git-send-email-jesse@nicira.com>

From: Ansis Atteka <aatteka@nicira.com>

This patch adds support for skb mark matching and set action.

Signed-off-by: Ansis Atteka <aatteka@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
---
 include/linux/openvswitch.h |    1 +
 net/openvswitch/actions.c   |    4 ++++
 net/openvswitch/datapath.c  |    3 +++
 net/openvswitch/flow.c      |   19 ++++++++++++++++++-
 net/openvswitch/flow.h      |    8 +++++---
 5 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/include/linux/openvswitch.h b/include/linux/openvswitch.h
index eb1efa5..d42e174 100644
--- a/include/linux/openvswitch.h
+++ b/include/linux/openvswitch.h
@@ -243,6 +243,7 @@ enum ovs_key_attr {
 	OVS_KEY_ATTR_ICMPV6,    /* struct ovs_key_icmpv6 */
 	OVS_KEY_ATTR_ARP,       /* struct ovs_key_arp */
 	OVS_KEY_ATTR_ND,        /* struct ovs_key_nd */
+	OVS_KEY_ATTR_SKB_MARK,  /* u32 skb mark */
 	__OVS_KEY_ATTR_MAX
 };
 
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index a58ed27..ac2defe 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -428,6 +428,10 @@ static int execute_set_action(struct sk_buff *skb,
 		skb->priority = nla_get_u32(nested_attr);
 		break;
 
+	case OVS_KEY_ATTR_SKB_MARK:
+		skb->mark = nla_get_u32(nested_attr);
+		break;
+
 	case OVS_KEY_ATTR_ETHERNET:
 		err = set_eth_addr(skb, nla_data(nested_attr));
 		break;
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 7b1d6d2..f996db3 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -482,6 +482,7 @@ static int validate_set(const struct nlattr *a,
 	const struct ovs_key_ipv6 *ipv6_key;
 
 	case OVS_KEY_ATTR_PRIORITY:
+	case OVS_KEY_ATTR_SKB_MARK:
 	case OVS_KEY_ATTR_ETHERNET:
 		break;
 
@@ -695,6 +696,7 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info)
 		goto err_flow_free;
 
 	err = ovs_flow_metadata_from_nlattrs(&flow->key.phy.priority,
+					     &flow->key.phy.skb_mark,
 					     &flow->key.phy.in_port,
 					     a[OVS_PACKET_ATTR_KEY]);
 	if (err)
@@ -714,6 +716,7 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info)
 
 	OVS_CB(packet)->flow = flow;
 	packet->priority = flow->key.phy.priority;
+	packet->mark = flow->key.phy.skb_mark;
 
 	rcu_read_lock();
 	dp = get_dp(sock_net(skb->sk), ovs_header->dp_ifindex);
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index e6ce902..c3294ce 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -604,6 +604,7 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key,
 
 	key->phy.priority = skb->priority;
 	key->phy.in_port = in_port;
+	key->phy.skb_mark = skb->mark;
 
 	skb_reset_mac_header(skb);
 
@@ -803,6 +804,7 @@ const int ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
 	[OVS_KEY_ATTR_ENCAP] = -1,
 	[OVS_KEY_ATTR_PRIORITY] = sizeof(u32),
 	[OVS_KEY_ATTR_IN_PORT] = sizeof(u32),
+	[OVS_KEY_ATTR_SKB_MARK] = sizeof(u32),
 	[OVS_KEY_ATTR_ETHERNET] = sizeof(struct ovs_key_ethernet),
 	[OVS_KEY_ATTR_VLAN] = sizeof(__be16),
 	[OVS_KEY_ATTR_ETHERTYPE] = sizeof(__be16),
@@ -988,6 +990,10 @@ int ovs_flow_from_nlattrs(struct sw_flow_key *swkey, int *key_lenp,
 	} else {
 		swkey->phy.in_port = DP_MAX_PORTS;
 	}
+	if (attrs & (1 << OVS_KEY_ATTR_SKB_MARK)) {
+		swkey->phy.skb_mark = nla_get_u32(a[OVS_KEY_ATTR_SKB_MARK]);
+		attrs &= ~(1 << OVS_KEY_ATTR_SKB_MARK);
+	}
 
 	/* Data attributes. */
 	if (!(attrs & (1 << OVS_KEY_ATTR_ETHERNET)))
@@ -1115,6 +1121,8 @@ int ovs_flow_from_nlattrs(struct sw_flow_key *swkey, int *key_lenp,
 
 /**
  * ovs_flow_metadata_from_nlattrs - parses Netlink attributes into a flow key.
+ * @priority: receives the skb priority
+ * @mark: receives the skb mark
  * @in_port: receives the extracted input port.
  * @key: Netlink attribute holding nested %OVS_KEY_ATTR_* Netlink attribute
  * sequence.
@@ -1124,7 +1132,7 @@ int ovs_flow_from_nlattrs(struct sw_flow_key *swkey, int *key_lenp,
  * get the metadata, that is, the parts of the flow key that cannot be
  * extracted from the packet itself.
  */
-int ovs_flow_metadata_from_nlattrs(u32 *priority, u16 *in_port,
+int ovs_flow_metadata_from_nlattrs(u32 *priority, u32 *mark, u16 *in_port,
 			       const struct nlattr *attr)
 {
 	const struct nlattr *nla;
@@ -1132,6 +1140,7 @@ int ovs_flow_metadata_from_nlattrs(u32 *priority, u16 *in_port,
 
 	*in_port = DP_MAX_PORTS;
 	*priority = 0;
+	*mark = 0;
 
 	nla_for_each_nested(nla, attr, rem) {
 		int type = nla_type(nla);
@@ -1150,6 +1159,10 @@ int ovs_flow_metadata_from_nlattrs(u32 *priority, u16 *in_port,
 					return -EINVAL;
 				*in_port = nla_get_u32(nla);
 				break;
+
+			case OVS_KEY_ATTR_SKB_MARK:
+				*mark = nla_get_u32(nla);
+				break;
 			}
 		}
 	}
@@ -1171,6 +1184,10 @@ int ovs_flow_to_nlattrs(const struct sw_flow_key *swkey, struct sk_buff *skb)
 	    nla_put_u32(skb, OVS_KEY_ATTR_IN_PORT, swkey->phy.in_port))
 		goto nla_put_failure;
 
+	if (swkey->phy.skb_mark &&
+	    nla_put_u32(skb, OVS_KEY_ATTR_SKB_MARK, swkey->phy.skb_mark))
+		goto nla_put_failure;
+
 	nla = nla_reserve(skb, OVS_KEY_ATTR_ETHERNET, sizeof(*eth_key));
 	if (!nla)
 		goto nla_put_failure;
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index 14a324e..a7bb60f 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -43,6 +43,7 @@ struct sw_flow_actions {
 struct sw_flow_key {
 	struct {
 		u32	priority;	/* Packet QoS priority. */
+		u32	skb_mark;	/* SKB mark. */
 		u16	in_port;	/* Input switch port (or DP_MAX_PORTS). */
 	} phy;
 	struct {
@@ -144,6 +145,7 @@ u64 ovs_flow_used_time(unsigned long flow_jiffies);
  *                         ------  ---  ------  -----
  *  OVS_KEY_ATTR_PRIORITY      4    --     4      8
  *  OVS_KEY_ATTR_IN_PORT       4    --     4      8
+ *  OVS_KEY_ATTR_SKB_MARK      4    --     4      8
  *  OVS_KEY_ATTR_ETHERNET     12    --     4     16
  *  OVS_KEY_ATTR_ETHERTYPE     2     2     4      8  (outer VLAN ethertype)
  *  OVS_KEY_ATTR_8021Q         4    --     4      8
@@ -153,14 +155,14 @@ u64 ovs_flow_used_time(unsigned long flow_jiffies);
  *  OVS_KEY_ATTR_ICMPV6        2     2     4      8
  *  OVS_KEY_ATTR_ND           28    --     4     32
  *  -------------------------------------------------
- *  total                                       144
+ *  total                                       152
  */
-#define FLOW_BUFSIZE 144
+#define FLOW_BUFSIZE 152
 
 int ovs_flow_to_nlattrs(const struct sw_flow_key *, struct sk_buff *);
 int ovs_flow_from_nlattrs(struct sw_flow_key *swkey, int *key_lenp,
 		      const struct nlattr *);
-int ovs_flow_metadata_from_nlattrs(u32 *priority, u16 *in_port,
+int ovs_flow_metadata_from_nlattrs(u32 *priority, u32 *mark, u16 *in_port,
 			       const struct nlattr *);
 
 #define MAX_ACTIONS_BUFSIZE    (16 * 1024)
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH net-next 7/7] openvswitch: Use RCU callback when detaching netdevices.
From: Jesse Gross @ 2012-11-29 18:35 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, dev, Justin Pettit
In-Reply-To: <1354214149-33651-1-git-send-email-jesse@nicira.com>

Currently, each time a device is detached from an OVS datapath
we call synchronize RCU before freeing associated data structures.
However, if a bridge is deleted (which detaches all ports) when
many devices are connected then there can be a long delay.  This
switches to use call_rcu() to group the cost together.

Reported-by: Justin Pettit <jpettit@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
---
 net/openvswitch/vport-netdev.c |   14 ++++++++++----
 net/openvswitch/vport-netdev.h |    3 +++
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/net/openvswitch/vport-netdev.c b/net/openvswitch/vport-netdev.c
index a903348..a9327e2 100644
--- a/net/openvswitch/vport-netdev.c
+++ b/net/openvswitch/vport-netdev.c
@@ -114,6 +114,15 @@ error:
 	return ERR_PTR(err);
 }
 
+static void free_port_rcu(struct rcu_head *rcu)
+{
+	struct netdev_vport *netdev_vport = container_of(rcu,
+					struct netdev_vport, rcu);
+
+	dev_put(netdev_vport->dev);
+	ovs_vport_free(vport_from_priv(netdev_vport));
+}
+
 static void netdev_destroy(struct vport *vport)
 {
 	struct netdev_vport *netdev_vport = netdev_vport_priv(vport);
@@ -122,10 +131,7 @@ static void netdev_destroy(struct vport *vport)
 	netdev_rx_handler_unregister(netdev_vport->dev);
 	dev_set_promiscuity(netdev_vport->dev, -1);
 
-	synchronize_rcu();
-
-	dev_put(netdev_vport->dev);
-	ovs_vport_free(vport);
+	call_rcu(&netdev_vport->rcu, free_port_rcu);
 }
 
 const char *ovs_netdev_get_name(const struct vport *vport)
diff --git a/net/openvswitch/vport-netdev.h b/net/openvswitch/vport-netdev.h
index f7072a2..6478079 100644
--- a/net/openvswitch/vport-netdev.h
+++ b/net/openvswitch/vport-netdev.h
@@ -20,12 +20,15 @@
 #define VPORT_NETDEV_H 1
 
 #include <linux/netdevice.h>
+#include <linux/rcupdate.h>
 
 #include "vport.h"
 
 struct vport *ovs_netdev_get_vport(struct net_device *dev);
 
 struct netdev_vport {
+	struct rcu_head rcu;
+
 	struct net_device *dev;
 };
 
-- 
1.7.9.5

^ permalink raw reply related

* Re: [net-next PATCH V2 3/9] net: frag, move LRU list maintenance outside of rwlock
From: David Miller @ 2012-11-29 18:36 UTC (permalink / raw)
  To: eric.dumazet
  Cc: brouer, fw, netdev, pablo, tgraf, amwang, kaber, paulmck, herbert
In-Reply-To: <1354213993.3299.23.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 29 Nov 2012 10:33:13 -0800

> On Thu, 2012-11-29 at 13:31 -0500, David Miller wrote:
> 
>> I think a per-cpu hash might make more sense.
>> 
>> This would scale our limits to the size of the system.
>> 
>> I'm willing to be convinced otherwise, but it seems the most
>> sensible thing to do.
> 
> It would break in many cases, when frags are spreaded on different cpus.

Indeed, ignore my stupid idea.

^ permalink raw reply

* Re: [PATCH net-next] doc: make the description of how tcp_ecn works more explicit and clear
From: Eric Dumazet @ 2012-11-29 18:42 UTC (permalink / raw)
  To: David Miller; +Cc: raj, netdev
In-Reply-To: <20121129.131627.987805831377793894.davem@davemloft.net>

On Thu, 2012-11-29 at 13:16 -0500, David Miller wrote:
> From: raj@tardy.usa.hp.com (Rick Jones)
> Date: Wed, 28 Nov 2012 11:53:10 -0800 (PST)
> 
> > From: Rick Jones <rick.jones2@hp.com>
> > 
> > Make the description of how tcp_ecn works a bit more explicit and clear.
> > 
> > Signed-off-by: Rick Jones <rick.jones2@hp.com>
> 
> Applied, thanks Rick.
> 
> I think we should change the default to one, to be honest.  I thought
> that's what we were doing by now...
> 
> '2' made sense 10 years ago, but it doesn't really today.

With 1 setting, I for example was enable to connect to a HP device,
when I was still working for SFR.

(It was an HTTP/HTTPS based administrative software, to manage HP c7000
enclosures)

I would suggest making a large scale experiment before doing this 2->1
move.

^ permalink raw reply

* Re: [PATCH net-next 1/3] MAINTAINERS: Add Mellanox ethernet driver - mlx4_en
From: Amir Vadai @ 2012-11-29 18:55 UTC (permalink / raw)
  To: Joe Perches
  Cc: David S. Miller, Or Gerlitz, Oren Duer, netdev, Yevgeny Petrilin
In-Reply-To: <1354210184.9874.8.camel@joe-AO722>

On Thu, Nov 29, 2012 at 7:29 PM, Joe Perches <joe@perches.com> wrote:
> On Thu, 2012-11-29 at 19:15 +0200, Amir Vadai wrote:
>> Set mlx4_en maintainer to Amir Vadai instead of Yevgeny Petrilin.
>>
>> Signed-off-by: Amir Vadai <amirv@mellanox.com>
>> Cc: Yevgeny Petrilin <yevgenyp@mellanox.com>
>> ---
>>  MAINTAINERS |    7 +++++++
>>  1 files changed, 7 insertions(+), 0 deletions(-)
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 5d72dd5..67fd3de 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -4822,6 +4822,13 @@ F:     Documentation/scsi/megaraid.txt
>>  F:   drivers/scsi/megaraid.*
>>  F:   drivers/scsi/megaraid/
>>
>> +MELLANOX ETHERNET DRIVER (mlx4_en)
>> +M:   Amir Vadai <amirv@mellanox.com>
>> +L:   netdev@vger.kernel.org
>> +S:   Supported
>> +W:   http://www.mellanox.com
>> +Q:   http://patchwork.ozlabs.org/project/netdev/list/
>
> What about a file pattern?  Maybe
> F:      drivers/net/ethernet/mellanox/mlx/
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Right - will fix it

Thanks,
Amir

^ permalink raw reply

* Re: [Qemu-devel] tap devices not receiving packets from a bridge
From: Peter Lieven @ 2012-11-29 18:58 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel@nongnu.org, netdev, mst
In-Reply-To: <20121123070211.GC22787@stefanha-thinkpad.hitronhub.home>

Am 23.11.2012 08:02, schrieb Stefan Hajnoczi:
> On Thu, Nov 22, 2012 at 03:29:52PM +0100, Peter Lieven wrote:
>> is anyone aware of a problem with the linux network bridge that in very rare circumstances stops
>> a bridge from sending pakets to a tap device?
>>
>> My problem occurs in conjunction with vanilla qemu-kvm-1.2.0 and Ubuntu Kernel 3.2.0-34.53
>> which is based on Linux 3.2.33.
>>
>> I was not yet able to reproduce the issue, it happens in really rare cases. The symptom is that
>> the tap does not have any TX packets. RX is working fine. I see the packets coming in at
>> the physical interface on the host, but they are not forwarded to the tap interface.
>> The bridge itself has learnt the mac address of the vServer that is connected to the tap interface.
>> It does not help to toggle the bridge link status,  the tap interface status or the interface in the vServer.
>> It seems that problem occurs if a tap interface that has previously been used, but set to nonpersistent
>> is set persistent again and then is by chance assigned to the same vServer (=same mac address on same
>> bridge) again. Unfortunately it seems not to be reproducible.
> 
> Not sure but this patch from Michael Tsirkin may help - it solves an
> issue with persistent tap devices:
> 
> http://patchwork.ozlabs.org/patch/198598/

I have tried that patch (even if I do not use persistant taps), but it doesn't help.

What I can say now is that if a VM is not working with a tap - lets say tap10 then
it will not work with tap10, even if the vm is shut down. tap10 is set to non-persistant.
then the vm is started again, assigned occasionally again tap10 and is not working again.

BUT, if I use qemu-kvm-1.0.1 in the second case it is working. I have seen that there is
a lot of changes between 1.0.1 and 1.2.0 in the tap code. Maybe there is a bug in the
inititialization since then.

What also seem to have changed is that vlans have been removed. I was not aware of that,
so I still use vlans which are now emulated via hubs. Maybe this is related.

Peter

> 
> Stefan
> 

^ permalink raw reply

* Re: [PATCH net-next] doc: make the description of how tcp_ecn works more explicit and clear
From: Rick Jones @ 2012-11-29 18:58 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1354214521.3299.30.camel@edumazet-glaptop>

On 11/29/2012 10:42 AM, Eric Dumazet wrote:
> On Thu, 2012-11-29 at 13:16 -0500, David Miller wrote:
>> From: raj@tardy.usa.hp.com (Rick Jones)
>> Date: Wed, 28 Nov 2012 11:53:10 -0800 (PST)
>>
>>> From: Rick Jones <rick.jones2@hp.com>
>>>
>>> Make the description of how tcp_ecn works a bit more explicit and clear.
>>>
>>> Signed-off-by: Rick Jones <rick.jones2@hp.com>
>>
>> Applied, thanks Rick.

Am I correct in assuming that the documentation is supposed to word-wrap 
somewhere around 72 columns?  If so, as I have time for floor sweeping I 
can try to go through more of it.

>> I think we should change the default to one, to be honest.  I thought
>> that's what we were doing by now...

You weren't the only one - what triggered my looking at that description 
in the first place was an assertion in the tcpm mailing list that Linux 
defaulted to ecn enabled.

>> '2' made sense 10 years ago, but it doesn't really today.
>
> With 1 setting, I for example was enable to connect to a HP device,
> when I was still working for SFR.
>
> (It was an HTTP/HTTPS based administrative software, to manage HP c7000
> enclosures)

If you have some of the particulars, feel free to send them to me 
offline.  Being one of the cobbler's children I cannot make promises but 
I can try to see if whatever it was has evolved since then.

> I would suggest making a large scale experiment before doing this 2->1
> move.

Perhaps one or more of the "development oriented" (term?) distros can 
ship with a sysctl.conf file that sets it to one?  Or some companies 
with rather large Internet presence.

At the time of the tcpm message I went ahead and set it to one on 
netperf.org but that is far from a large scale experiment.  It has been 
a couple weeks and I've captured almost 250000 SYN segments (netperf.org 
isn't all that busy).  My recollection is that at least one search 
engine provider's bots were negotiating ECN and one noteable one was 
not.  I'd think a search engine provider's crawlers would be a large 
scale experiment.

rick jones

^ permalink raw reply

* Re: [PATCH resend net-next 2/3] myri10ge: Add vlan rx for better GRO perf.
From: Andrew Gallatin @ 2012-11-29 19:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20121129.131738.1067398537384405805.davem@davemloft.net>

On 11/29/12 13:17, David Miller wrote:
> From: Andrew Gallatin <gallatin@myri.com>
> Date: Wed, 28 Nov 2012 16:20:56 -0500
>
>> + if ((dev->features & (NETIF_F_HW_VLAN_RX)) == NETIF_F_HW_VLAN_RX &&
>> +	    (veh->h_vlan_proto == ntohs(ETH_P_8021Q))) {
>> +		/* fixup csum if needed */
>> +		if (skb->ip_summed == CHECKSUM_COMPLETE)
>> +			skb->csum = csum_sub(skb->csum,
>> + csum_partial(va + ETH_HLEN,
>> + VLAN_HLEN, 0));
>
> This indentation looks like spaghetti, verify that this kind of error
> doesn't exist in the rest of your patches, and resend the series.
>

Sorry.  Emacs victim...  I'll clean it up & re-submit.  I'd stupidly
assumed that checkpatch would verify indentation.  :(

Drew

^ permalink raw reply

* [PATCH net-next v1 3/3] net/mlx4_en: Set number of rx/tx channels using ethtool
From: Amir Vadai @ 2012-11-29 19:21 UTC (permalink / raw)
  To: David S. Miller; +Cc: Amir Vadai, Or Gerlitz, Oren Duer, netdev
In-Reply-To: <1354216903-830-1-git-send-email-amirv@mellanox.com>

Add support to changing number of rx/tx channels using
ethtool ('ethtool -[lL]'). Where the number of tx channels specified in ethtool
is the number of rings per user priority - not total number of tx rings.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |   69 +++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx4/en_main.c    |    2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  |   26 +++++----
 drivers/net/ethernet/mellanox/mlx4/en_tx.c      |    2 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h    |    8 ++-
 5 files changed, 93 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index dc8ccb4..681bc1b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -999,6 +999,73 @@ static int mlx4_en_set_rxnfc(struct net_device *dev, struct ethtool_rxnfc *cmd)
 	return err;
 }
 
+static void mlx4_en_get_channels(struct net_device *dev,
+		struct ethtool_channels *channel)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+
+	memset(channel, 0, sizeof(*channel));
+
+	channel->max_rx = MAX_RX_RINGS;
+	channel->max_tx = MLX4_EN_MAX_TX_RING_P_UP;
+
+	channel->rx_count = priv->rx_ring_num;
+	channel->tx_count = priv->tx_ring_num / MLX4_EN_NUM_UP;
+}
+
+static int mlx4_en_set_channels(struct net_device *dev,
+		struct ethtool_channels *channel)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+	struct mlx4_en_dev *mdev = priv->mdev;
+	int port_up;
+	int err = 0;
+
+	if (channel->other_count || channel->combined_count ||
+	    channel->tx_count > channel->max_tx ||
+	    channel->rx_count > channel->max_rx ||
+	    !channel->tx_count || !channel->rx_count)
+		return -EINVAL;
+
+	mutex_lock(&mdev->state_lock);
+	if (priv->port_up) {
+		port_up = 1;
+		mlx4_en_stop_port(dev);
+	}
+
+	mlx4_en_free_resources(priv);
+
+	priv->num_tx_rings_p_up = channel->tx_count;
+	priv->tx_ring_num = channel->tx_count * MLX4_EN_NUM_UP;
+	priv->rx_ring_num = channel->rx_count;
+
+	err = mlx4_en_alloc_resources(priv);
+	if (err) {
+		en_err(priv, "Failed reallocating port resources\n");
+		goto out;
+	}
+
+	netif_set_real_num_tx_queues(dev, priv->tx_ring_num);
+	netif_set_real_num_rx_queues(dev, priv->rx_ring_num);
+
+	mlx4_en_setup_tc(dev, MLX4_EN_NUM_UP);
+
+	en_warn(priv, "Using %d TX rings\n", priv->tx_ring_num);
+	en_warn(priv, "Using %d RX rings\n", priv->rx_ring_num);
+
+	if (port_up) {
+		err = mlx4_en_start_port(dev);
+		if (err)
+			en_err(priv, "Failed starting port\n");
+	}
+
+	err = mlx4_en_moderation_update(priv);
+
+out:
+	mutex_unlock(&mdev->state_lock);
+	return err;
+}
+
 const struct ethtool_ops mlx4_en_ethtool_ops = {
 	.get_drvinfo = mlx4_en_get_drvinfo,
 	.get_settings = mlx4_en_get_settings,
@@ -1023,6 +1090,8 @@ const struct ethtool_ops mlx4_en_ethtool_ops = {
 	.get_rxfh_indir_size = mlx4_en_get_rxfh_indir_size,
 	.get_rxfh_indir = mlx4_en_get_rxfh_indir,
 	.set_rxfh_indir = mlx4_en_set_rxfh_indir,
+	.get_channels = mlx4_en_get_channels,
+	.set_channels = mlx4_en_set_channels,
 };
 
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_main.c b/drivers/net/ethernet/mellanox/mlx4/en_main.c
index a52922e..3a2b8c6 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_main.c
@@ -250,7 +250,7 @@ static void *mlx4_en_add(struct mlx4_dev *dev)
 				rounddown_pow_of_two(max_t(int, MIN_RX_RINGS,
 							   min_t(int,
 								 dev->caps.num_comp_vectors,
-								 MAX_RX_RINGS)));
+								 DEF_RX_RINGS)));
 		} else {
 			mdev->profile.prof[i].rx_ring_num = rounddown_pow_of_two(
 				min_t(int, dev->caps.comp_pool/
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 2b23ca2..7d1287f 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -47,11 +47,11 @@
 #include "mlx4_en.h"
 #include "en_port.h"
 
-static int mlx4_en_setup_tc(struct net_device *dev, u8 up)
+int mlx4_en_setup_tc(struct net_device *dev, u8 up)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	int i;
-	unsigned int q, offset = 0;
+	unsigned int offset = 0;
 
 	if (up && up != MLX4_EN_NUM_UP)
 		return -EINVAL;
@@ -59,10 +59,9 @@ static int mlx4_en_setup_tc(struct net_device *dev, u8 up)
 	netdev_set_num_tc(dev, up);
 
 	/* Partition Tx queues evenly amongst UP's */
-	q = priv->tx_ring_num / up;
 	for (i = 0; i < up; i++) {
-		netdev_set_tc_queue(dev, i, q, offset);
-		offset += q;
+		netdev_set_tc_queue(dev, i, priv->num_tx_rings_p_up, offset);
+		offset += priv->num_tx_rings_p_up;
 	}
 
 	return 0;
@@ -1114,7 +1113,7 @@ int mlx4_en_start_port(struct net_device *dev)
 		/* Configure ring */
 		tx_ring = &priv->tx_ring[i];
 		err = mlx4_en_activate_tx_ring(priv, tx_ring, cq->mcq.cqn,
-			i / priv->mdev->profile.num_tx_rings_p_up);
+			i / priv->num_tx_rings_p_up);
 		if (err) {
 			en_err(priv, "Failed allocating Tx ring\n");
 			mlx4_en_deactivate_cq(priv, cq);
@@ -1564,10 +1563,13 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 	int err;
 
 	dev = alloc_etherdev_mqs(sizeof(struct mlx4_en_priv),
-	    prof->tx_ring_num, prof->rx_ring_num);
+				 MAX_TX_RINGS, MAX_RX_RINGS);
 	if (dev == NULL)
 		return -ENOMEM;
 
+	netif_set_real_num_tx_queues(dev, prof->tx_ring_num);
+	netif_set_real_num_rx_queues(dev, prof->rx_ring_num);
+
 	SET_NETDEV_DEV(dev, &mdev->dev->pdev->dev);
 	dev->dev_id =  port - 1;
 
@@ -1586,15 +1588,17 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 	priv->flags = prof->flags;
 	priv->ctrl_flags = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE |
 			MLX4_WQE_CTRL_SOLICITED);
+	priv->num_tx_rings_p_up = mdev->profile.num_tx_rings_p_up;
 	priv->tx_ring_num = prof->tx_ring_num;
-	priv->tx_ring = kzalloc(sizeof(struct mlx4_en_tx_ring) *
-			priv->tx_ring_num, GFP_KERNEL);
+
+	priv->tx_ring = kzalloc(sizeof(struct mlx4_en_tx_ring) * MAX_TX_RINGS,
+				GFP_KERNEL);
 	if (!priv->tx_ring) {
 		err = -ENOMEM;
 		goto out;
 	}
-	priv->tx_cq = kzalloc(sizeof(struct mlx4_en_cq) * priv->tx_ring_num,
-			GFP_KERNEL);
+	priv->tx_cq = kzalloc(sizeof(struct mlx4_en_cq) * MAX_RX_RINGS,
+			      GFP_KERNEL);
 	if (!priv->tx_cq) {
 		err = -ENOMEM;
 		goto out;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index b35094c..1f571d0 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -523,7 +523,7 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *sk
 u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
-	u16 rings_p_up = priv->mdev->profile.num_tx_rings_p_up;
+	u16 rings_p_up = priv->num_tx_rings_p_up;
 	u8 up = 0;
 
 	if (dev->num_tc)
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index d3eba8b..334ec48 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -67,7 +67,8 @@
 
 #define MLX4_EN_PAGE_SHIFT	12
 #define MLX4_EN_PAGE_SIZE	(1 << MLX4_EN_PAGE_SHIFT)
-#define MAX_RX_RINGS		16
+#define DEF_RX_RINGS		16
+#define MAX_RX_RINGS		128
 #define MIN_RX_RINGS		4
 #define TXBB_SIZE		64
 #define HEADROOM		(2048 / TXBB_SIZE + 1)
@@ -118,6 +119,8 @@ enum {
 #define MLX4_EN_NUM_UP			8
 #define MLX4_EN_DEF_TX_RING_SIZE	512
 #define MLX4_EN_DEF_RX_RING_SIZE  	1024
+#define MAX_TX_RINGS			(MLX4_EN_MAX_TX_RING_P_UP * \
+					 MLX4_EN_NUM_UP)
 
 /* Target number of packets to coalesce with interrupt moderation */
 #define MLX4_EN_RX_COAL_TARGET	44
@@ -476,6 +479,7 @@ struct mlx4_en_priv {
 	u32 flags;
 #define MLX4_EN_FLAG_PROMISC	0x1
 #define MLX4_EN_FLAG_MC_PROMISC	0x2
+	u8 num_tx_rings_p_up;
 	u32 tx_ring_num;
 	u32 rx_ring_num;
 	u32 rx_skb_size;
@@ -596,6 +600,8 @@ int mlx4_en_QUERY_PORT(struct mlx4_en_dev *mdev, u8 port);
 extern const struct dcbnl_rtnl_ops mlx4_en_dcbnl_ops;
 #endif
 
+int mlx4_en_setup_tc(struct net_device *dev, u8 up);
+
 #ifdef CONFIG_RFS_ACCEL
 void mlx4_en_cleanup_filters(struct mlx4_en_priv *priv,
 			     struct mlx4_en_rx_ring *rx_ring);
-- 
1.7.8.2

^ permalink raw reply related

* [PATCH net-next v1 1/3] MAINTAINERS: Add Mellanox ethernet driver - mlx4_en
From: Amir Vadai @ 2012-11-29 19:21 UTC (permalink / raw)
  To: David S. Miller
  Cc: Amir Vadai, Or Gerlitz, Oren Duer, netdev, Yevgeny Petrilin
In-Reply-To: <1354216903-830-1-git-send-email-amirv@mellanox.com>

Set mlx4_en maintainer to Amir Vadai instead of Yevgeny Petrilin.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Cc: Yevgeny Petrilin <yevgenyp@mellanox.com>
---
 MAINTAINERS |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 5d72dd5..3c6e8cd 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4822,6 +4822,14 @@ F:	Documentation/scsi/megaraid.txt
 F:	drivers/scsi/megaraid.*
 F:	drivers/scsi/megaraid/

+MELLANOX ETHERNET DRIVER (mlx4_en)
+M:	Amir Vadai <amirv@mellanox.com>
+L: 	netdev@vger.kernel.org
+S:	Supported
+W:	http://www.mellanox.com
+Q:	http://patchwork.ozlabs.org/project/netdev/list/
+F:	drivers/net/ethernet/mellanox/mlx4/en_*
+
 MEMORY MANAGEMENT
 L:	linux-mm@kvack.org
 W:	http://www.linux-mm.org
-- 
1.7.8.2

^ permalink raw reply related

* [PATCH net-next v1 0/3] mlx4_en: set number of rx/tx channels using ethtool
From: Amir Vadai @ 2012-11-29 19:21 UTC (permalink / raw)
  To: David S. Miller; +Cc: Amir Vadai, Or Gerlitz, Oren Duer, netdev

1. Added a record in the MAINTAINERS file for the mlx4_en driver
2. Fix set_ringparam not to forget tx moderation info + remove code duplication
3. Add support to changing number of rx/tx channels using ethtool

---

Changes from V0:
- Added file pattern to MAINAINERS file

Amir Vadai (3):
  MAINTAINERS: Add Mellanox ethernet driver - mlx4_en
  net/mlx4_en: Fix TX moderation info loss after set_ringparam is
    called
  net/mlx4_en: Set number of rx/tx channels using ethtool

 MAINTAINERS                                     |    8 ++
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |  128 +++++++++++++++++-----
 drivers/net/ethernet/mellanox/mlx4/en_main.c    |    2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  |   26 +++--
 drivers/net/ethernet/mellanox/mlx4/en_tx.c      |    2 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h    |    8 ++-
 6 files changed, 131 insertions(+), 43 deletions(-)

-- 
1.7.8.2

^ permalink raw reply

* [PATCH net-next v1 2/3] net/mlx4_en: Fix TX moderation info loss after set_ringparam is called
From: Amir Vadai @ 2012-11-29 19:21 UTC (permalink / raw)
  To: David S. Miller; +Cc: Amir Vadai, Or Gerlitz, Oren Duer, netdev
In-Reply-To: <1354216903-830-1-git-send-email-amirv@mellanox.com>

We need to re-set tx moderation information after calling set_ringparam
else default tx moderation will be used.
Also avoid related code duplication, by putting it in a utility function.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |   59 ++++++++++++-----------
 1 files changed, 30 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index 9d0b88e..dc8ccb4 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -43,6 +43,34 @@
 #define EN_ETHTOOL_SHORT_MASK cpu_to_be16(0xffff)
 #define EN_ETHTOOL_WORD_MASK  cpu_to_be32(0xffffffff)
 
+static int mlx4_en_moderation_update(struct mlx4_en_priv *priv)
+{
+	int i;
+	int err = 0;
+
+	for (i = 0; i < priv->tx_ring_num; i++) {
+		priv->tx_cq[i].moder_cnt = priv->tx_frames;
+		priv->tx_cq[i].moder_time = priv->tx_usecs;
+		err = mlx4_en_set_cq_moder(priv, &priv->tx_cq[i]);
+		if (err)
+			return err;
+	}
+
+	if (priv->adaptive_rx_coal)
+		return 0;
+
+	for (i = 0; i < priv->rx_ring_num; i++) {
+		priv->rx_cq[i].moder_cnt = priv->rx_frames;
+		priv->rx_cq[i].moder_time = priv->rx_usecs;
+		priv->last_moder_time[i] = MLX4_EN_AUTO_CONF;
+		err = mlx4_en_set_cq_moder(priv, &priv->rx_cq[i]);
+		if (err)
+			return err;
+	}
+
+	return err;
+}
+
 static void
 mlx4_en_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *drvinfo)
 {
@@ -381,7 +409,6 @@ static int mlx4_en_set_coalesce(struct net_device *dev,
 			      struct ethtool_coalesce *coal)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
-	int err, i;
 
 	priv->rx_frames = (coal->rx_max_coalesced_frames ==
 			   MLX4_EN_AUTO_CONF) ?
@@ -397,14 +424,6 @@ static int mlx4_en_set_coalesce(struct net_device *dev,
 	    coal->tx_max_coalesced_frames != priv->tx_frames) {
 		priv->tx_usecs = coal->tx_coalesce_usecs;
 		priv->tx_frames = coal->tx_max_coalesced_frames;
-		for (i = 0; i < priv->tx_ring_num; i++) {
-			priv->tx_cq[i].moder_cnt = priv->tx_frames;
-			priv->tx_cq[i].moder_time = priv->tx_usecs;
-			if (mlx4_en_set_cq_moder(priv, &priv->tx_cq[i])) {
-				en_warn(priv, "Failed changing moderation "
-					      "for TX cq %d\n", i);
-			}
-		}
 	}
 
 	/* Set adaptive coalescing params */
@@ -414,18 +433,8 @@ static int mlx4_en_set_coalesce(struct net_device *dev,
 	priv->rx_usecs_high = coal->rx_coalesce_usecs_high;
 	priv->sample_interval = coal->rate_sample_interval;
 	priv->adaptive_rx_coal = coal->use_adaptive_rx_coalesce;
-	if (priv->adaptive_rx_coal)
-		return 0;
 
-	for (i = 0; i < priv->rx_ring_num; i++) {
-		priv->rx_cq[i].moder_cnt = priv->rx_frames;
-		priv->rx_cq[i].moder_time = priv->rx_usecs;
-		priv->last_moder_time[i] = MLX4_EN_AUTO_CONF;
-		err = mlx4_en_set_cq_moder(priv, &priv->rx_cq[i]);
-		if (err)
-			return err;
-	}
-	return 0;
+	return mlx4_en_moderation_update(priv);
 }
 
 static int mlx4_en_set_pauseparam(struct net_device *dev,
@@ -466,7 +475,6 @@ static int mlx4_en_set_ringparam(struct net_device *dev,
 	u32 rx_size, tx_size;
 	int port_up = 0;
 	int err = 0;
-	int i;
 
 	if (param->rx_jumbo_pending || param->rx_mini_pending)
 		return -EINVAL;
@@ -505,14 +513,7 @@ static int mlx4_en_set_ringparam(struct net_device *dev,
 			en_err(priv, "Failed starting port\n");
 	}
 
-	for (i = 0; i < priv->rx_ring_num; i++) {
-		priv->rx_cq[i].moder_cnt = priv->rx_frames;
-		priv->rx_cq[i].moder_time = priv->rx_usecs;
-		priv->last_moder_time[i] = MLX4_EN_AUTO_CONF;
-		err = mlx4_en_set_cq_moder(priv, &priv->rx_cq[i]);
-		if (err)
-			goto out;
-	}
+	err = mlx4_en_moderation_update(priv);
 
 out:
 	mutex_unlock(&mdev->state_lock);
-- 
1.7.8.2

^ permalink raw reply related

* Re: RDS: sendto() with very large buffer triggering WARNING: at mm/page_alloc.c:2403
From: Venkat Venkatsubra @ 2012-11-29 19:38 UTC (permalink / raw)
  To: Tommi Rantala; +Cc: netdev, rds-devel, Dave Jones
In-Reply-To: <CA+ydwtrVyun4LhMufd44ip4eXSwXp9Tw34+5yEdm3UdTVb9hQA@mail.gmail.com>

On 11/29/2012 2:39 AM, Tommi Rantala wrote:
> Hello,
>
> Is RDS supposed to cap the sendto() buffer size? Saw the WARNING while
> fuzzing with Trinity.
>
> #include<string.h>
> #include<arpa/inet.h>
> #include<sys/socket.h>
>
> static const char buf[1234000567];
>
> int main(void)
> {
>          int fd;
>          struct sockaddr_in sa;
>
>          fd = socket(21 /* AF_RDS */, SOCK_SEQPACKET, 0);
>          if (fd<  0)
>                  return 1;
>
>          memset(&sa, 0, sizeof(sa));
>          sa.sin_family = AF_INET;
>          sa.sin_addr.s_addr = inet_addr("127.0.0.1");
>          sa.sin_port = htons(11111);
>
>          bind(fd, (struct sockaddr *)&sa, sizeof(sa));
>
>          sendto(fd, buf, sizeof(buf), 0, (struct sockaddr *)&sa, sizeof(sa));
>
>          return 0;
> }
>
> $ strace -e sendto ./rds-sendto
> sendto(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 1234000567, 0, {sa_family=AF_INET, sin_port=htons(11111),
> sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ENOMEM (Cannot allocate
> memory)
>
> [ 7421.592595] ------------[ cut here ]------------
> [ 7421.592621] WARNING: at mm/page_alloc.c:2403
> __alloc_pages_nodemask+0x2c0/0x9f0()
> [ 7421.592628] Hardware name: EB1012
> [ 7421.592633] Modules linked in:
> [ 7421.592645] Pid: 3082, comm: rds-sendto Not tainted 3.7.0-rc7+ #58
> [ 7421.592650] Call Trace:
> [ 7421.592667]  [<ffffffff810a197b>] warn_slowpath_common+0x7b/0xc0
> [ 7421.592678]  [<ffffffff810a19d5>] warn_slowpath_null+0x15/0x20
> [ 7421.592689]  [<ffffffff81171900>] __alloc_pages_nodemask+0x2c0/0x9f0
> [ 7421.592700]  [<ffffffff81107c90>] ? __lock_acquire+0x3a0/0x9f0
> [ 7421.592711]  [<ffffffff81107c90>] ? __lock_acquire+0x3a0/0x9f0
> [ 7421.592724]  [<ffffffff811ac6bf>] alloc_pages_current+0x7f/0xf0
> [ 7421.592735]  [<ffffffff8116cc19>] __get_free_pages+0x9/0x40
> [ 7421.592746]  [<ffffffff811b3d8a>] kmalloc_order_trace+0x3a/0x190
> [ 7421.592755]  [<ffffffff81107c90>] ? __lock_acquire+0x3a0/0x9f0
> [ 7421.592765]  [<ffffffff811b4f59>] __kmalloc+0x229/0x240
> [ 7421.592778]  [<ffffffff81d1b06e>] rds_message_alloc+0x1e/0xa0
> [ 7421.592789]  [<ffffffff81d1db66>] rds_sendmsg+0x196/0x720
> [ 7421.592802]  [<ffffffff81a72a80>] ? sock_update_classid+0xf0/0x2b0
> [ 7421.592813]  [<ffffffff81a6abec>] sock_sendmsg+0xdc/0xf0
> [ 7421.592828]  [<ffffffff8118e9e5>] ? might_fault+0x85/0x90
> [ 7421.592838]  [<ffffffff8118e99c>] ? might_fault+0x3c/0x90
> [ 7421.592848]  [<ffffffff81a6e0fa>] sys_sendto+0xfa/0x130
> [ 7421.592859]  [<ffffffff8110953d>] ? trace_hardirqs_on_caller+0x10d/0x1a0
> [ 7421.592868]  [<ffffffff811095dd>] ? trace_hardirqs_on+0xd/0x10
> [ 7421.592881]  [<ffffffff81e94c9d>] ? _raw_spin_unlock_irq+0x3d/0x70
> [ 7421.592892]  [<ffffffff810bab44>] ? ptrace_notify+0x74/0x90
> [ 7421.592904]  [<ffffffff81e96250>] tracesys+0xdd/0xe2
> [ 7421.592911] ---[ end trace d9d681d0d60abf69 ]---
>
> Tommi
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
The man page of rds says this:
        The  default values for the send and receive buffer size are 
controlled
        by the A given  RDS  socket  has  limited  transmit  buffer  
space.  It
        defaults  to  the  system  wide  socket  send  buffer  size  set 
in the
        wmem_default and rmem_default sysctls, respectively. They can 
be  tuned
        by  the application through the SO_SNDBUF and SO_RCVBUF socket 
options.

rds_sendmsg (net/rds/send.c) checks this limit a bit later (after 
rds_message_alloc()):
         while (!rds_send_queue_rm(rs, conn, rm, rs->rs_bound_port,
                                   dport, &queued)) {
                 rds_stats_inc(s_send_queue_full);
                 /* XXX make sure this is reasonable */
                 if (payload_len > rds_sk_sndbuf(rs)) {
                         ret = -EMSGSIZE;
                         goto out;
                 }
                 ....
Venkat

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox