Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC PATCH net-next v3 1/4] netns: add genl cmd to add and get peer netns ids
From: Nicolas Dichtel @ 2014-10-03 12:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	luto-kltTT9wpgjJwATOyAt5JVQ, cwang-xCSkyg8dI+0RB7SZvlqPiA
In-Reply-To: <87tx3mmflp.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

Le 02/10/2014 21:33, Eric W. Biederman a écrit :
> Nicolas Dichtel <nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org> writes:
>
>> With this patch, a user can define an id for a peer netns by providing a FD or a
>> PID. These ids are local to netns (ie valid only into one netns).
>>
>> This will be useful for netlink messages when a x-netns interface is
>> dumped.
>
> You have a "id -> struct net *" table but you don't have a
> "struct net * -> id" table which looks like it will impact the
> performance of peernet2id at scale.
It is indirectly stores in 'struct idr'. It can be optimized later, with a
proper algorithm to find quickly this 'struct net *' (hash table? something
else?). A basic algorithm will not be more scalable than the current
idr_for_each().

^ permalink raw reply

* Re: [RFC PATCH net-next v2 0/5] netns: allow to identify peer netns
From: Nicolas Dichtel @ 2014-10-03 12:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andy Lutomirski, Network Development, Linux Containers,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	David S. Miller, Stephen Hemminger, Andrew Morton, Cong Wang
In-Reply-To: <8761g2nurx.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

Le 02/10/2014 21:20, Eric W. Biederman a écrit :
> Nicolas Dichtel <nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org> writes:
>
>> Le 29/09/2014 20:43, Eric W. Biederman a écrit :
>>> Nicolas Dichtel <nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org> writes:
>>>
>>>> Le 26/09/2014 20:57, Eric W. Biederman a écrit :
>>>>> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>>>>>
>>>>>> On Fri, Sep 26, 2014 at 11:10 AM, Eric W. Biederman
>>>>>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>>>>>> I see two ways to go with this.
>>>>>>>
>>>>>>> - A per network namespace table to that you can store ids for ``peer''
>>>>>>>      network namespaces.  The table would need to be populated manually by
>>>>>>>      the likes of ip netns add.
>>>>>>>
>>>>>>>      That flips the order of assignment and makes this idea solid.
>>>> I have a preference for this solution, because it allows to have a full
>>>> broadcast messages. When you have a lot of network interfaces (> 10k),
>>>> it saves a lot of time to avoid another request to get all informations.
>>>
>>> My practical question is how often does it happen that we care?
>> In fact, I don't think that scenarii with a lot of netns have a full mesh of
>> x-netns interfaces. It will be more one "link" netns with the physical
>> interface and all other with one interface with the link part in this "link"
>> netns. Hence, only one nsid is needing in each netns.
>
> I will buy that a full mesh is unlikely.
>
> For people doing simulations anything physical has a limited number of
> links.
>
> For people wanting all to all connectivity setting up an internal
> macvlan (or the equivalent) is likely much simpler and more efficient
> that a full mesh.
>
> So the question in my mind is how do we create these identifiers at need
> (when we create the cross network namespace links) instead of at network
> namespace creation time.  I don't see an answer to that in your patches,
> and perhaps it obvious.
For me, it is the responsability of the user who creates the netns. He should
know what will be done with this new netns, hence he may or may not define an
id. It's also possible to delegate this to the user who will create the tunnel.
In other words, it's part of the configuration.

^ permalink raw reply

* Re: [PATCH 1/1] bridge: Fix NAT66ed IPv6 packets not being bridged correctly
From: Pablo Neira Ayuso @ 2014-10-03 11:21 UTC (permalink / raw)
  To: bernhard.thaler
  Cc: David Miller, kuznet, jmorris, yoshfuji, kaber, stephen, netdev,
	bridge, linux-kernel, netfilter-devel, sven
In-Reply-To: <20140915.174009.1151039353244879675.davem@davemloft.net>

Hi Bernhard,

Sorry for taking a bit to get back to you with feedback. We've been
discussing recently some changes in br_netfilter. Basically, to
modularize it [1] and this has taken a while.

Regarding your change. Sven Eckelmann (CC'ed in this email) sent a RFC
out of the merge window that have remain tagged in my patchwork, you
can find it here.

http://patchwork.ozlabs.org/patch/381027/

I would like to see a submission that covers all NAT66 scenarios that
we currenty support.

Thanks.

[1] http://comments.gmane.org/gmane.comp.security.firewalls.netfilter.devel/54234

On Mon, Sep 15, 2014 at 05:40:09PM -0400, David Miller wrote:
> From: Bernhard Thaler <bernhard.thaler@wvnet.at>
> Date: Mon, 15 Sep 2014 23:27:28 +0200
> 
> CC:'ing netfilter-devel and the netfilter maintainer, which is
> probably the primary place this patch should have been submitted.
> 
> > Ethernet frames are not bridged to correct interface when packets have
> > been NAT66ed; compared to IPv4 logic in code, IPv6 code does not store
> > original address (before NAT) on transmit to determine if packet was
> > NAT66ed on receive to swap addresses back.
> > 
> > Changes added in br_netfilter.c to store original address, compare
> > against it and determine correct output interface. Changes needed in
> > netfilter_bridge.h to store IPv6 address in pre-existing union.
> > Export of ip6_route_input needed to use it in br_netfilter.c.
> > 
> > Problem may only affect systems doing NAT66 and ethernet bridging at
> > the same time. Tested in NAT66 setup on base of an ethernet bridge.
> > 
> > Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
> > ---
> >  include/linux/netfilter_bridge.h |    2 +
> >  net/bridge/br_netfilter.c        |  136 ++++++++++++++++++++++++++++----------
> >  net/ipv6/route.c                 |    1 +
> >  3 files changed, 105 insertions(+), 34 deletions(-)
> > 
> > diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h
> > index 8ab1c27..3a9cdcd 100644
> > --- a/include/linux/netfilter_bridge.h
> > +++ b/include/linux/netfilter_bridge.h
> > @@ -2,6 +2,7 @@
> >  #define __LINUX_BRIDGE_NETFILTER_H
> >  
> >  #include <uapi/linux/netfilter_bridge.h>
> > +#include <uapi/linux/in6.h>
> >  
> >  
> >  enum nf_br_hook_priorities {
> > @@ -79,6 +80,7 @@ static inline unsigned int nf_bridge_pad(const struct sk_buff *skb)
> >  struct bridge_skb_cb {
> >  	union {
> >  		__be32 ipv4;
> > +		struct in6_addr ipv6;
> >  	} daddr;
> >  };
> >  
> > diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
> > index a615264..2ae3888 100644
> > --- a/net/bridge/br_netfilter.c
> > +++ b/net/bridge/br_netfilter.c
> > @@ -35,6 +35,9 @@
> >  #include <net/ip.h>
> >  #include <net/ipv6.h>
> >  #include <net/route.h>
> > +#include <net/ip6_route.h>
> > +#include <net/flow.h>
> > +#include <net/dst.h>
> >  
> >  #include <asm/uaccess.h>
> >  #include "br_private.h"
> > @@ -42,10 +45,18 @@
> >  #include <linux/sysctl.h>
> >  #endif
> >  
> > -#define skb_origaddr(skb)	 (((struct bridge_skb_cb *) \
> > -				 (skb->nf_bridge->data))->daddr.ipv4)
> > -#define store_orig_dstaddr(skb)	 (skb_origaddr(skb) = ip_hdr(skb)->daddr)
> > -#define dnat_took_place(skb)	 (skb_origaddr(skb) != ip_hdr(skb)->daddr)
> > +#define skb_origaddr(skb)		(((struct bridge_skb_cb *)\
> > +					(skb->nf_bridge->data))->daddr.ipv4)
> > +#define skb_origaddr_ipv6(skb)		(((struct bridge_skb_cb *)\
> > +					(skb->nf_bridge->data))->daddr.ipv6)
> > +#define store_orig_dstaddr(skb)		(skb_origaddr(skb) = ip_hdr(skb)->daddr)
> > +#define store_orig_dstaddr_ipv6(skb)	(skb_origaddr_ipv6(skb) = \
> > +					ipv6_hdr(skb)->daddr)
> > +#define dnat_took_place(skb)		(skb_origaddr(skb) != \
> > +					ip_hdr(skb)->daddr)
> > +#define dnat_took_place_ipv6(skb)	(memcmp(&skb_origaddr_ipv6(skb), \
> > +					&(ipv6_hdr(skb)->daddr), \
> > +					sizeof(struct in6_addr)) != 0)
> >  
> >  #ifdef CONFIG_SYSCTL
> >  static struct ctl_table_header *brnf_sysctl_header;
> > @@ -340,36 +351,6 @@ int nf_bridge_copy_header(struct sk_buff *skb)
> >  	return 0;
> >  }
> >  
> > -/* PF_BRIDGE/PRE_ROUTING *********************************************/
> > -/* Undo the changes made for ip6tables PREROUTING and continue the
> > - * bridge PRE_ROUTING hook. */
> > -static int br_nf_pre_routing_finish_ipv6(struct sk_buff *skb)
> > -{
> > -	struct nf_bridge_info *nf_bridge = skb->nf_bridge;
> > -	struct rtable *rt;
> > -
> > -	if (nf_bridge->mask & BRNF_PKT_TYPE) {
> > -		skb->pkt_type = PACKET_OTHERHOST;
> > -		nf_bridge->mask ^= BRNF_PKT_TYPE;
> > -	}
> > -	nf_bridge->mask ^= BRNF_NF_BRIDGE_PREROUTING;
> > -
> > -	rt = bridge_parent_rtable(nf_bridge->physindev);
> > -	if (!rt) {
> > -		kfree_skb(skb);
> > -		return 0;
> > -	}
> > -	skb_dst_set_noref(skb, &rt->dst);
> > -
> > -	skb->dev = nf_bridge->physindev;
> > -	nf_bridge_update_protocol(skb);
> > -	nf_bridge_push_encap_header(skb);
> > -	NF_HOOK_THRESH(NFPROTO_BRIDGE, NF_BR_PRE_ROUTING, skb, skb->dev, NULL,
> > -		       br_handle_frame_finish, 1);
> > -
> > -	return 0;
> > -}
> > -
> >  /* Obtain the correct destination MAC address, while preserving the original
> >   * source MAC address. If we already know this address, we just copy it. If we
> >   * don't, we use the neighbour framework to find out. In both cases, we make
> > @@ -527,6 +508,92 @@ bridged_dnat:
> >  	return 0;
> >  }
> >  
> > +/* PF_BRIDGE/PRE_ROUTING *********************************************
> > + * Undo the changes made for ip6tables PREROUTING and continue the
> > + * bridge PRE_ROUTING hook.
> > + */
> > +static int br_nf_pre_routing_finish_ipv6(struct sk_buff *skb)
> > +{
> > +	struct net_device *dev = skb->dev;
> > +	struct ipv6hdr *iph = ipv6_hdr(skb);
> > +	struct nf_bridge_info *nf_bridge = skb->nf_bridge;
> > +	struct rtable *rt;
> > +	struct dst_entry *dst;
> > +	struct flowi6 fl6 = {
> > +		.flowi6_iif = skb->dev->ifindex,
> > +		.daddr = iph->daddr,
> > +		.saddr = iph->saddr,
> > +		.flowlabel = ip6_flowinfo(iph),
> > +		.flowi6_mark = skb->mark,
> > +		.flowi6_proto = iph->nexthdr,
> > +	};
> > +
> > +	if (nf_bridge->mask & BRNF_PKT_TYPE) {
> > +		skb->pkt_type = PACKET_OTHERHOST;
> > +		nf_bridge->mask ^= BRNF_PKT_TYPE;
> > +	}
> > +	nf_bridge->mask ^= BRNF_NF_BRIDGE_PREROUTING;
> > +
> > +	if (dnat_took_place_ipv6(skb)) {
> > +		ip6_route_input(skb);
> > +		/* ip6_route_input is void function,
> > +		 * no int returned as in ip4_route_input
> > +		 * changes value of skb->_skb_refdst) on success
> > +		 */
> > +		if (skb->_skb_refdst == 0) {
> > +			struct in_device *in_dev = __in_dev_get_rcu(dev);
> > +
> > +			if (!in_dev || IN_DEV_FORWARD(in_dev))
> > +				goto free_skb;
> > +
> > +			dst = ip6_route_output(dev_net(dev), skb->sk, &fl6);
> > +			if (!IS_ERR(dst)) {
> > +				/* - Bridged-and-DNAT'ed traffic doesn't
> > +				 *   require ip_forwarding.
> > +				 */
> > +				if (dst->dev == dev) {
> > +					skb_dst_set(skb, dst);
> > +					goto bridged_dnat;
> > +				}
> > +				dst_release(dst);
> > +			}
> > +free_skb:
> > +			kfree_skb(skb);
> > +			return 0;
> > +		} else {
> > +			if (skb_dst(skb)->dev == dev) {
> > +bridged_dnat:
> > +				skb->dev = nf_bridge->physindev;
> > +				nf_bridge_update_protocol(skb);
> > +				nf_bridge_push_encap_header(skb);
> > +				NF_HOOK_THRESH(NFPROTO_BRIDGE,
> > +					       NF_BR_PRE_ROUTING,
> > +					       skb, skb->dev, NULL,
> > +					       br_nf_pre_routing_finish_bridge,
> > +					       1);
> > +				return 0;
> > +			}
> > +			memcpy(eth_hdr(skb)->h_dest, dev->dev_addr, ETH_ALEN);
> > +			skb->pkt_type = PACKET_HOST;
> > +		}
> > +	} else {
> > +		rt = bridge_parent_rtable(nf_bridge->physindev);
> > +		if (!rt) {
> > +			kfree_skb(skb);
> > +			return 0;
> > +		}
> > +		skb_dst_set_noref(skb, &rt->dst);
> > +	}
> > +
> > +	skb->dev = nf_bridge->physindev;
> > +	nf_bridge_update_protocol(skb);
> > +	nf_bridge_push_encap_header(skb);
> > +	NF_HOOK_THRESH(NFPROTO_BRIDGE, NF_BR_PRE_ROUTING, skb, skb->dev, NULL,
> > +		       br_handle_frame_finish, 1);
> > +
> > +	return 0;
> > +}
> > +
> >  static struct net_device *brnf_get_logical_dev(struct sk_buff *skb, const struct net_device *dev)
> >  {
> >  	struct net_device *vlan, *br;
> > @@ -658,6 +725,7 @@ static unsigned int br_nf_pre_routing_ipv6(const struct nf_hook_ops *ops,
> >  	if (!setup_pre_routing(skb))
> >  		return NF_DROP;
> >  
> > +	store_orig_dstaddr_ipv6(skb);
> >  	skb->protocol = htons(ETH_P_IPV6);
> >  	NF_HOOK(NFPROTO_IPV6, NF_INET_PRE_ROUTING, skb, skb->dev, NULL,
> >  		br_nf_pre_routing_finish_ipv6);
> > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > index f23fbd2..e328905 100644
> > --- a/net/ipv6/route.c
> > +++ b/net/ipv6/route.c
> > @@ -1017,6 +1017,7 @@ void ip6_route_input(struct sk_buff *skb)
> >  
> >  	skb_dst_set(skb, ip6_route_input_lookup(net, skb->dev, &fl6, flags));
> >  }
> > +EXPORT_SYMBOL(ip6_route_input);
> >  
> >  static struct rt6_info *ip6_pol_route_output(struct net *net, struct fib6_table *table,
> >  					     struct flowi6 *fl6, int flags)
> > -- 
> > 1.7.10.4
> > 

^ permalink raw reply

* Re: [RFC PATCH linux 2/2] fs/proc: use a hash table for the directory entries
From: Alexey Dobriyan @ 2014-10-03 10:55 UTC (permalink / raw)
  To: Nicolas Dichtel
  Cc: netdev, Linux Kernel, David Miller, Eric W. Biederman,
	Andrew Morton, rui.xiang, Al Viro, Oleg Nesterov, Cyrill Gorcunov,
	kirill.shutemov, Grant Likely, Theodore Ts'o,
	Thierry Herbelot
In-Reply-To: <1412263501-6572-3-git-send-email-nicolas.dichtel@6wind.com>

On Thu, Oct 2, 2014 at 6:25 PM, Nicolas Dichtel
<nicolas.dichtel@6wind.com> wrote:
> --- a/fs/proc/generic.c
> +++ b/fs/proc/generic.c
> @@ -81,10 +81,13 @@ static int __xlate_proc_name(const char *name, struct proc_dir_entry **ret,

> +       if (!S_ISDIR(de->mode))
> +               return -EINVAL;

There are way too many S_ISDIR checks.
In lookup and readdir, it is guaranteed that PDE is directory.

I'd say all of them aren't needed because non-directories have
->subdir = NULL and
directories have ->subdir != NULL which transforms into hashtable or
rbtree or whatever,
so you only need to guarantee only directories appear where they are
expected and
fearlessly use your new data structure traversing directories.

    Alexey

^ permalink raw reply

* [net-next PATCH] veth: don't assign a qdisc to veth
From: Jesper Dangaard Brouer @ 2014-10-03 10:48 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, David S. Miller; +Cc: Jiri Pirko, mpatel

The veth driver is a virtual device, and should not have assigned
the default qdisc.  Verified (ndo_start_xmit) veth_xmit can only
return NETDEV_TX_OK, thus this should be safe to bypass qdisc.

Not assigning a qdisc is subtly done by setting tx_queue_len to zero.

Reported-by: Mrunal Patel <mpatel@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 drivers/net/veth.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 8ad5965..3c32fcf 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -287,6 +287,7 @@ static const struct net_device_ops veth_netdev_ops = {
 static void veth_setup(struct net_device *dev)
 {
 	ether_setup(dev);
+	dev->tx_queue_len = 0;

 	dev->priv_flags &= ~IFF_TX_SKB_SHARING;
 	dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;

^ permalink raw reply related

* [patch] checkpatch: remove the ether_addr_copy warning
From: Dan Carpenter @ 2014-10-03  9:35 UTC (permalink / raw)
  To: Andy Whitcroft, Andrew Morton
  Cc: Joe Perches, netdev, kernel-janitors, gregkh

Most people sending checkpatch.pl fixes don't know how to verify the
alignment.  This checkpatch warning just encourages newbies to try
introduce bugs.  Patch submitters tell us that they just sed the code
and it's the job for the maintainer to check that it's correct.

If you want to work on these then you can get the same information by
typing `grep -nw memcpy drivers/net/ -R | grep ETH_ALEN`

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 52a223e..d540dd4 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -4698,16 +4698,6 @@ sub process {
 			}
 		}
 
-# Check for memcpy(foo, bar, ETH_ALEN) that could be ether_addr_copy(foo, bar)
-		if ($^V && $^V ge 5.10.0 &&
-		    $line =~ /^\+(?:.*?)\bmemcpy\s*\(\s*$FuncArg\s*,\s*$FuncArg\s*\,\s*ETH_ALEN\s*\)/s) {
-			if (WARN("PREFER_ETHER_ADDR_COPY",
-				 "Prefer ether_addr_copy() over memcpy() if the Ethernet addresses are __aligned(2)\n" . $herecurr) &&
-			    $fix) {
-				$fixed[$fixlinenr] =~ s/\bmemcpy\s*\(\s*$FuncArg\s*,\s*$FuncArg\s*\,\s*ETH_ALEN\s*\)/ether_addr_copy($2, $7)/;
-			}
-		}
-
 # typecasts on min/max could be min_t/max_t
 		if ($^V && $^V ge 5.10.0 &&
 		    defined $stat &&

^ permalink raw reply related

* [PATCH v2 net-next] r8169:add support for RTL8168EP
From: Chun-Hao Lin @ 2014-10-03  9:20 UTC (permalink / raw)
  To: netdev; +Cc: nic_swsd, linux-kernel, Chun-Hao Lin

RTL8168EP is Realtek PCIe Gigabit Ethernet controller.
It is a successor chip of RTL8168DP.

For RTL8168EP, the read/write ocp register is via eri channel type 2,
so I modify ocp_read() ocp_write() and move related functions under
rtl_eri_read() rtl_eri_write().

The way of checking if dash is enabled and setting driver start/stop
are also different with RTL8168DP.

Right now, RTL8168EP phy mcu did not need patch firmware code, so I
did not add firmware code for it.

Signed-off-by: Chun-Hao Lin <hau@realtek.com>
---
 drivers/net/ethernet/realtek/r8169.c | 512 ++++++++++++++++++++++++++++++++---
 1 file changed, 467 insertions(+), 45 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 54476ba..3efdf4d 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -155,6 +155,9 @@ enum mac_version {
 	RTL_GIGA_MAC_VER_46,
 	RTL_GIGA_MAC_VER_47,
 	RTL_GIGA_MAC_VER_48,
+	RTL_GIGA_MAC_VER_49,
+	RTL_GIGA_MAC_VER_50,
+	RTL_GIGA_MAC_VER_51,
 	RTL_GIGA_MAC_NONE   = 0xff,
 };
 
@@ -302,6 +305,15 @@ static const struct {
 	[RTL_GIGA_MAC_VER_48] =
 		_R("RTL8107e",		RTL_TD_1, FIRMWARE_8107E_2,
 							JUMBO_1K, false),
+	[RTL_GIGA_MAC_VER_49] =
+		_R("RTL8168ep/8111ep",	RTL_TD_1, NULL,
+							JUMBO_9K, false),
+	[RTL_GIGA_MAC_VER_50] =
+		_R("RTL8168ep/8111ep",	RTL_TD_1, NULL,
+							JUMBO_9K, false),
+	[RTL_GIGA_MAC_VER_51] =
+		_R("RTL8168ep/8111ep",	RTL_TD_1, NULL,
+							JUMBO_9K, false),
 };
 #undef _R
 
@@ -400,6 +412,10 @@ enum rtl_registers {
 	FuncEvent	= 0xf0,
 	FuncEventMask	= 0xf4,
 	FuncPresetState	= 0xf8,
+	IBCR0           = 0xf8,
+	IBCR2           = 0xf9,
+	IBIMR0          = 0xfa,
+	IBISR0          = 0xfb,
 	FuncForceEvent	= 0xfc,
 };
 
@@ -467,6 +483,7 @@ enum rtl8168_registers {
 #define ERIAR_EXGMAC			(0x00 << ERIAR_TYPE_SHIFT)
 #define ERIAR_MSIX			(0x01 << ERIAR_TYPE_SHIFT)
 #define ERIAR_ASF			(0x02 << ERIAR_TYPE_SHIFT)
+#define ERIAR_OOB			(0x02 << ERIAR_TYPE_SHIFT)
 #define ERIAR_MASK_SHIFT		12
 #define ERIAR_MASK_0001			(0x1 << ERIAR_MASK_SHIFT)
 #define ERIAR_MASK_0011			(0x3 << ERIAR_MASK_SHIFT)
@@ -935,40 +952,6 @@ static const struct rtl_cond name = {			\
 							\
 static bool name ## _check(struct rtl8169_private *tp)
 
-DECLARE_RTL_COND(rtl_ocpar_cond)
-{
-	void __iomem *ioaddr = tp->mmio_addr;
-
-	return RTL_R32(OCPAR) & OCPAR_FLAG;
-}
-
-static u32 ocp_read(struct rtl8169_private *tp, u8 mask, u16 reg)
-{
-	void __iomem *ioaddr = tp->mmio_addr;
-
-	RTL_W32(OCPAR, ((u32)mask & 0x0f) << 12 | (reg & 0x0fff));
-
-	return rtl_udelay_loop_wait_high(tp, &rtl_ocpar_cond, 100, 20) ?
-		RTL_R32(OCPDR) : ~0;
-}
-
-static void ocp_write(struct rtl8169_private *tp, u8 mask, u16 reg, u32 data)
-{
-	void __iomem *ioaddr = tp->mmio_addr;
-
-	RTL_W32(OCPDR, data);
-	RTL_W32(OCPAR, OCPAR_FLAG | ((u32)mask & 0x0f) << 12 | (reg & 0x0fff));
-
-	rtl_udelay_loop_wait_low(tp, &rtl_ocpar_cond, 100, 20);
-}
-
-DECLARE_RTL_COND(rtl_eriar_cond)
-{
-	void __iomem *ioaddr = tp->mmio_addr;
-
-	return RTL_R32(ERIAR) & ERIAR_FLAG;
-}
-
 static bool rtl_ocp_reg_failure(struct rtl8169_private *tp, u32 reg)
 {
 	if (reg & 0xffff0001) {
@@ -1110,6 +1093,13 @@ static int r8169_mdio_read(struct rtl8169_private *tp, int reg)
 	return value;
 }
 
+DECLARE_RTL_COND(rtl_ocpar_cond)
+{
+	void __iomem *ioaddr = tp->mmio_addr;
+
+	return RTL_R32(OCPAR) & OCPAR_FLAG;
+}
+
 static void r8168dp_1_mdio_access(struct rtl8169_private *tp, int reg, u32 data)
 {
 	void __iomem *ioaddr = tp->mmio_addr;
@@ -1245,6 +1235,13 @@ static u16 rtl_ephy_read(struct rtl8169_private *tp, int reg_addr)
 		RTL_R32(EPHYAR) & EPHYAR_DATA_MASK : ~0;
 }
 
+DECLARE_RTL_COND(rtl_eriar_cond)
+{
+	void __iomem *ioaddr = tp->mmio_addr;
+
+	return RTL_R32(ERIAR) & ERIAR_FLAG;
+}
+
 static void rtl_eri_write(struct rtl8169_private *tp, int addr, u32 mask,
 			  u32 val, int type)
 {
@@ -1276,6 +1273,52 @@ static void rtl_w0w1_eri(struct rtl8169_private *tp, int addr, u32 mask, u32 p,
 	rtl_eri_write(tp, addr, mask, (val & ~m) | p, type);
 }
 
+static u32 ocp_read(struct rtl8169_private *tp, u8 mask, u16 reg)
+{
+	void __iomem *ioaddr = tp->mmio_addr;
+
+	switch (tp->mac_version) {
+	case RTL_GIGA_MAC_VER_27:
+	case RTL_GIGA_MAC_VER_28:
+	case RTL_GIGA_MAC_VER_31:
+		RTL_W32(OCPAR, ((u32)mask & 0x0f) << 12 | (reg & 0x0fff));
+		return rtl_udelay_loop_wait_high(tp, &rtl_ocpar_cond, 100, 20)
+			? RTL_R32(OCPDR) : ~0;
+	case RTL_GIGA_MAC_VER_49:
+	case RTL_GIGA_MAC_VER_50:
+	case RTL_GIGA_MAC_VER_51:
+		return rtl_eri_read(tp, reg, ERIAR_OOB);
+	default:
+		BUG();
+		return ~0;
+	}
+}
+
+static void ocp_write(struct rtl8169_private *tp, u8 mask, u16 reg, u32 data)
+{
+	void __iomem *ioaddr = tp->mmio_addr;
+
+	switch (tp->mac_version) {
+	case RTL_GIGA_MAC_VER_27:
+	case RTL_GIGA_MAC_VER_28:
+	case RTL_GIGA_MAC_VER_31:
+		RTL_W32(OCPDR, data);
+		RTL_W32(OCPAR, OCPAR_FLAG | ((u32)mask & 0x0f) << 12 |
+			(reg & 0x0fff));
+		rtl_udelay_loop_wait_low(tp, &rtl_ocpar_cond, 100, 20);
+		break;
+	case RTL_GIGA_MAC_VER_49:
+	case RTL_GIGA_MAC_VER_50:
+	case RTL_GIGA_MAC_VER_51:
+		rtl_eri_write(tp, reg, ((u32)mask & 0x0f) << ERIAR_MASK_SHIFT,
+			      data, ERIAR_OOB);
+		break;
+	default:
+		BUG();
+		break;
+	}
+}
+
 static void rtl8168_oob_notify(struct rtl8169_private *tp, u8 cmd)
 {
 	rtl_eri_write(tp, 0xe8, ERIAR_MASK_0001, cmd, ERIAR_EXGMAC);
@@ -1301,25 +1344,84 @@ DECLARE_RTL_COND(rtl_ocp_read_cond)
 	return ocp_read(tp, 0x0f, reg) & 0x00000800;
 }
 
-static void rtl8168_driver_start(struct rtl8169_private *tp)
+DECLARE_RTL_COND(rtl_ep_ocp_read_cond)
 {
-	rtl8168_oob_notify(tp, OOB_CMD_DRIVER_START);
+	return ocp_read(tp, 0x0f, 0x124) & 0x00000001;
+}
+
+DECLARE_RTL_COND(rtl_ocp_tx_cond)
+{
+	void __iomem *ioaddr = tp->mmio_addr;
 
-	rtl_msleep_loop_wait_high(tp, &rtl_ocp_read_cond, 10, 10);
+	return RTL_R8(IBISR0) & 0x02;
+}
+
+static void rtl8168_driver_start(struct rtl8169_private *tp)
+{
+	switch (tp->mac_version) {
+	case RTL_GIGA_MAC_VER_27:
+	case RTL_GIGA_MAC_VER_28:
+	case RTL_GIGA_MAC_VER_31:
+		rtl8168_oob_notify(tp, OOB_CMD_DRIVER_START);
+		rtl_msleep_loop_wait_high(tp, &rtl_ocp_read_cond, 10, 10);
+		break;
+	case RTL_GIGA_MAC_VER_49:
+	case RTL_GIGA_MAC_VER_50:
+	case RTL_GIGA_MAC_VER_51:
+		ocp_write(tp, 0x01, 0x180, OOB_CMD_DRIVER_START);
+		ocp_write(tp, 0x01, 0x30, ocp_read(tp, 0x01, 0x30) | 0x01);
+		rtl_msleep_loop_wait_high(tp, &rtl_ep_ocp_read_cond, 10, 10);
+		break;
+	default:
+		BUG();
+		break;
+	}
 }
 
 static void rtl8168_driver_stop(struct rtl8169_private *tp)
 {
-	rtl8168_oob_notify(tp, OOB_CMD_DRIVER_STOP);
+	void __iomem *ioaddr = tp->mmio_addr;
 
-	rtl_msleep_loop_wait_low(tp, &rtl_ocp_read_cond, 10, 10);
+	switch (tp->mac_version) {
+	case RTL_GIGA_MAC_VER_27:
+	case RTL_GIGA_MAC_VER_28:
+	case RTL_GIGA_MAC_VER_31:
+		rtl8168_oob_notify(tp, OOB_CMD_DRIVER_STOP);
+		rtl_msleep_loop_wait_low(tp, &rtl_ocp_read_cond, 10, 10);
+		break;
+	case RTL_GIGA_MAC_VER_49:
+	case RTL_GIGA_MAC_VER_50:
+	case RTL_GIGA_MAC_VER_51:
+		RTL_W8(IBCR2, RTL_R8(IBCR2) & ~0x01);
+		rtl_msleep_loop_wait_low(tp, &rtl_ocp_tx_cond, 50, 2000);
+		RTL_W8(IBISR0, RTL_R8(IBISR0) | 0x20);
+		RTL_W8(IBCR0, RTL_R8(IBCR0) & ~0x01);
+		ocp_write(tp, 0x01, 0x180, OOB_CMD_DRIVER_STOP);
+		ocp_write(tp, 0x01, 0x30, ocp_read(tp, 0x01, 0x30) | 0x01);
+		rtl_msleep_loop_wait_low(tp, &rtl_ep_ocp_read_cond, 10, 10);
+		break;
+	default:
+		BUG();
+		break;
+	}
 }
 
 static int r8168_check_dash(struct rtl8169_private *tp)
 {
 	u16 reg = rtl8168_get_ocp_reg(tp);
 
-	return (ocp_read(tp, 0x0f, reg) & 0x00008000) ? 1 : 0;
+	switch (tp->mac_version) {
+	case RTL_GIGA_MAC_VER_27:
+	case RTL_GIGA_MAC_VER_28:
+	case RTL_GIGA_MAC_VER_31:
+		return (ocp_read(tp, 0x0f, reg) & 0x00008000) ? 1 : 0;
+	case RTL_GIGA_MAC_VER_49:
+	case RTL_GIGA_MAC_VER_50:
+	case RTL_GIGA_MAC_VER_51:
+		return (ocp_read(tp, 0x0f, 0x128) & 0x00000001) ? 1 : 0;
+	default:
+		return 0;
+	}
 }
 
 struct exgmac_reg {
@@ -1553,6 +1655,9 @@ static u32 __rtl8169_get_wol(struct rtl8169_private *tp)
 	case RTL_GIGA_MAC_VER_46:
 	case RTL_GIGA_MAC_VER_47:
 	case RTL_GIGA_MAC_VER_48:
+	case RTL_GIGA_MAC_VER_49:
+	case RTL_GIGA_MAC_VER_50:
+	case RTL_GIGA_MAC_VER_51:
 		if (rtl_eri_read(tp, 0xdc, ERIAR_EXGMAC) & MagicPacket_v2)
 			wolopts |= WAKE_MAGIC;
 		break;
@@ -1620,6 +1725,9 @@ static void __rtl8169_set_wol(struct rtl8169_private *tp, u32 wolopts)
 	case RTL_GIGA_MAC_VER_46:
 	case RTL_GIGA_MAC_VER_47:
 	case RTL_GIGA_MAC_VER_48:
+	case RTL_GIGA_MAC_VER_49:
+	case RTL_GIGA_MAC_VER_50:
+	case RTL_GIGA_MAC_VER_51:
 		tmp = ARRAY_SIZE(cfg) - 1;
 		if (wolopts & WAKE_MAGIC)
 			rtl_w0w1_eri(tp,
@@ -2126,6 +2234,11 @@ static void rtl8169_get_mac_version(struct rtl8169_private *tp,
 		u32 val;
 		int mac_version;
 	} mac_info[] = {
+		/* 8168EP family. */
+		{ 0x7cf00000, 0x50200000,	RTL_GIGA_MAC_VER_51 },
+		{ 0x7cf00000, 0x50100000,	RTL_GIGA_MAC_VER_50 },
+		{ 0x7cf00000, 0x50000000,	RTL_GIGA_MAC_VER_49 },
+
 		/* 8168H family. */
 		{ 0x7cf00000, 0x54100000,	RTL_GIGA_MAC_VER_46 },
 		{ 0x7cf00000, 0x54000000,	RTL_GIGA_MAC_VER_45 },
@@ -3741,6 +3854,139 @@ static void rtl8168h_2_hw_phy_config(struct rtl8169_private *tp)
 	rtl_writephy(tp, 0x1f, 0x0000);
 }
 
+static void rtl8168ep_1_hw_phy_config(struct rtl8169_private *tp)
+{
+	/* Enable PHY auto speed down */
+	rtl_writephy(tp, 0x1f, 0x0a44);
+	rtl_w0w1_phy(tp, 0x11, 0x000c, 0x0000);
+	rtl_writephy(tp, 0x1f, 0x0000);
+
+	/* patch 10M & ALDPS */
+	rtl_writephy(tp, 0x1f, 0x0bcc);
+	rtl_w0w1_phy(tp, 0x14, 0x0000, 0x0100);
+	rtl_writephy(tp, 0x1f, 0x0a44);
+	rtl_w0w1_phy(tp, 0x11, 0x00c0, 0x0000);
+	rtl_writephy(tp, 0x1f, 0x0a43);
+	rtl_writephy(tp, 0x13, 0x8084);
+	rtl_w0w1_phy(tp, 0x14, 0x0000, 0x6000);
+	rtl_w0w1_phy(tp, 0x10, 0x1003, 0x0000);
+	rtl_writephy(tp, 0x1f, 0x0000);
+
+	/* Enable EEE auto-fallback function */
+	rtl_writephy(tp, 0x1f, 0x0a4b);
+	rtl_w0w1_phy(tp, 0x11, 0x0004, 0x0000);
+	rtl_writephy(tp, 0x1f, 0x0000);
+
+	/* Enable UC LPF tune function */
+	rtl_writephy(tp, 0x1f, 0x0a43);
+	rtl_writephy(tp, 0x13, 0x8012);
+	rtl_w0w1_phy(tp, 0x14, 0x8000, 0x0000);
+	rtl_writephy(tp, 0x1f, 0x0000);
+
+	/* set rg_sel_sdm_rate */
+	rtl_writephy(tp, 0x1f, 0x0c42);
+	rtl_w0w1_phy(tp, 0x11, 0x4000, 0x2000);
+	rtl_writephy(tp, 0x1f, 0x0000);
+
+	/* Check ALDPS bit, disable it if enabled */
+	rtl_writephy(tp, 0x1f, 0x0a43);
+	if (rtl_readphy(tp, 0x10) & 0x0004)
+		rtl_w0w1_phy(tp, 0x10, 0x0000, 0x0004);
+
+	rtl_writephy(tp, 0x1f, 0x0000);
+}
+
+static void rtl8168ep_2_hw_phy_config(struct rtl8169_private *tp)
+{
+	/* patch 10M & ALDPS */
+	rtl_writephy(tp, 0x1f, 0x0bcc);
+	rtl_w0w1_phy(tp, 0x14, 0x0000, 0x0100);
+	rtl_writephy(tp, 0x1f, 0x0a44);
+	rtl_w0w1_phy(tp, 0x11, 0x00c0, 0x0000);
+	rtl_writephy(tp, 0x1f, 0x0a43);
+	rtl_writephy(tp, 0x13, 0x8084);
+	rtl_w0w1_phy(tp, 0x14, 0x0000, 0x6000);
+	rtl_w0w1_phy(tp, 0x10, 0x1003, 0x0000);
+	rtl_writephy(tp, 0x1f, 0x0000);
+
+	/* Enable UC LPF tune function */
+	rtl_writephy(tp, 0x1f, 0x0a43);
+	rtl_writephy(tp, 0x13, 0x8012);
+	rtl_w0w1_phy(tp, 0x14, 0x8000, 0x0000);
+	rtl_writephy(tp, 0x1f, 0x0000);
+
+	/* Set rg_sel_sdm_rate */
+	rtl_writephy(tp, 0x1f, 0x0c42);
+	rtl_w0w1_phy(tp, 0x11, 0x4000, 0x2000);
+	rtl_writephy(tp, 0x1f, 0x0000);
+
+	/* Channel estimation parameters */
+	rtl_writephy(tp, 0x1f, 0x0a43);
+	rtl_writephy(tp, 0x13, 0x80f3);
+	rtl_w0w1_phy(tp, 0x14, 0x8b00, ~0x8bff);
+	rtl_writephy(tp, 0x13, 0x80f0);
+	rtl_w0w1_phy(tp, 0x14, 0x3a00, ~0x3aff);
+	rtl_writephy(tp, 0x13, 0x80ef);
+	rtl_w0w1_phy(tp, 0x14, 0x0500, ~0x05ff);
+	rtl_writephy(tp, 0x13, 0x80f6);
+	rtl_w0w1_phy(tp, 0x14, 0x6e00, ~0x6eff);
+	rtl_writephy(tp, 0x13, 0x80ec);
+	rtl_w0w1_phy(tp, 0x14, 0x6800, ~0x68ff);
+	rtl_writephy(tp, 0x13, 0x80ed);
+	rtl_w0w1_phy(tp, 0x14, 0x7c00, ~0x7cff);
+	rtl_writephy(tp, 0x13, 0x80f2);
+	rtl_w0w1_phy(tp, 0x14, 0xf400, ~0xf4ff);
+	rtl_writephy(tp, 0x13, 0x80f4);
+	rtl_w0w1_phy(tp, 0x14, 0x8500, ~0x85ff);
+	rtl_writephy(tp, 0x1f, 0x0a43);
+	rtl_writephy(tp, 0x13, 0x8110);
+	rtl_w0w1_phy(tp, 0x14, 0xa800, ~0xa8ff);
+	rtl_writephy(tp, 0x13, 0x810f);
+	rtl_w0w1_phy(tp, 0x14, 0x1d00, ~0x1dff);
+	rtl_writephy(tp, 0x13, 0x8111);
+	rtl_w0w1_phy(tp, 0x14, 0xf500, ~0xf5ff);
+	rtl_writephy(tp, 0x13, 0x8113);
+	rtl_w0w1_phy(tp, 0x14, 0x6100, ~0x61ff);
+	rtl_writephy(tp, 0x13, 0x8115);
+	rtl_w0w1_phy(tp, 0x14, 0x9200, ~0x92ff);
+	rtl_writephy(tp, 0x13, 0x810e);
+	rtl_w0w1_phy(tp, 0x14, 0x0400, ~0x04ff);
+	rtl_writephy(tp, 0x13, 0x810c);
+	rtl_w0w1_phy(tp, 0x14, 0x7C00, ~0x7Cff);
+	rtl_writephy(tp, 0x13, 0x810b);
+	rtl_w0w1_phy(tp, 0x14, 0x5a00, ~0x5aff);
+	rtl_writephy(tp, 0x1f, 0x0a43);
+	rtl_writephy(tp, 0x13, 0x80d1);
+	rtl_w0w1_phy(tp, 0x14, 0xff00, ~0xffff);
+	rtl_writephy(tp, 0x13, 0x80cd);
+	rtl_w0w1_phy(tp, 0x14, 0x9e00, ~0x9eff);
+	rtl_writephy(tp, 0x13, 0x80d3);
+	rtl_w0w1_phy(tp, 0x14, 0x0e00, ~0x0eff);
+	rtl_writephy(tp, 0x13, 0x80d5);
+	rtl_w0w1_phy(tp, 0x14, 0xca00, ~0xcaff);
+	rtl_writephy(tp, 0x13, 0x80d7);
+	rtl_w0w1_phy(tp, 0x14, 0x8400, ~0x84ff);
+
+	/* Force PWM-mode */
+	rtl_writephy(tp, 0x1f, 0x0bcd);
+	rtl_writephy(tp, 0x14, 0x5065);
+	rtl_writephy(tp, 0x14, 0xd065);
+	rtl_writephy(tp, 0x1f, 0x0bC8);
+	rtl_writephy(tp, 0x12, 0x00ed);
+	rtl_writephy(tp, 0x1f, 0x0bcd);
+	rtl_writephy(tp, 0x14, 0x1065);
+	rtl_writephy(tp, 0x14, 0x9065);
+	rtl_writephy(tp, 0x14, 0x1065);
+	rtl_writephy(tp, 0x1f, 0x0000);
+
+	/* Check ALDPS bit, disable it if enabled */
+	rtl_writephy(tp, 0x1f, 0x0a43);
+	if (rtl_readphy(tp, 0x10) & 0x0004)
+		rtl_w0w1_phy(tp, 0x10, 0x0000, 0x0004);
+
+	rtl_writephy(tp, 0x1f, 0x0000);
+}
+
 static void rtl8102e_hw_phy_config(struct rtl8169_private *tp)
 {
 	static const struct phy_reg phy_reg_init[] = {
@@ -3940,6 +4186,14 @@ static void rtl_hw_phy_config(struct net_device *dev)
 		rtl8168h_2_hw_phy_config(tp);
 		break;
 
+	case RTL_GIGA_MAC_VER_49:
+		rtl8168ep_1_hw_phy_config(tp);
+		break;
+	case RTL_GIGA_MAC_VER_50:
+	case RTL_GIGA_MAC_VER_51:
+		rtl8168ep_2_hw_phy_config(tp);
+		break;
+
 	case RTL_GIGA_MAC_VER_41:
 	default:
 		break;
@@ -4154,6 +4408,9 @@ static void rtl_init_mdio_ops(struct rtl8169_private *tp)
 	case RTL_GIGA_MAC_VER_46:
 	case RTL_GIGA_MAC_VER_47:
 	case RTL_GIGA_MAC_VER_48:
+	case RTL_GIGA_MAC_VER_49:
+	case RTL_GIGA_MAC_VER_50:
+	case RTL_GIGA_MAC_VER_51:
 		ops->write	= r8168g_mdio_write;
 		ops->read	= r8168g_mdio_read;
 		break;
@@ -4212,6 +4469,9 @@ static void rtl_wol_suspend_quirk(struct rtl8169_private *tp)
 	case RTL_GIGA_MAC_VER_46:
 	case RTL_GIGA_MAC_VER_47:
 	case RTL_GIGA_MAC_VER_48:
+	case RTL_GIGA_MAC_VER_49:
+	case RTL_GIGA_MAC_VER_50:
+	case RTL_GIGA_MAC_VER_51:
 		RTL_W32(RxConfig, RTL_R32(RxConfig) |
 			AcceptBroadcast | AcceptMulticast | AcceptMyPhys);
 		break;
@@ -4356,7 +4616,10 @@ static void r8168_pll_power_down(struct rtl8169_private *tp)
 
 	if ((tp->mac_version == RTL_GIGA_MAC_VER_27 ||
 	     tp->mac_version == RTL_GIGA_MAC_VER_28 ||
-	     tp->mac_version == RTL_GIGA_MAC_VER_31) &&
+	     tp->mac_version == RTL_GIGA_MAC_VER_31 ||
+	     tp->mac_version == RTL_GIGA_MAC_VER_49 ||
+	     tp->mac_version == RTL_GIGA_MAC_VER_50 ||
+	     tp->mac_version == RTL_GIGA_MAC_VER_51) &&
 	    r8168_check_dash(tp)) {
 		return;
 	}
@@ -4387,10 +4650,13 @@ static void r8168_pll_power_down(struct rtl8169_private *tp)
 	case RTL_GIGA_MAC_VER_44:
 	case RTL_GIGA_MAC_VER_45:
 	case RTL_GIGA_MAC_VER_46:
+	case RTL_GIGA_MAC_VER_50:
+	case RTL_GIGA_MAC_VER_51:
 		RTL_W8(PMCH, RTL_R8(PMCH) & ~0x80);
 		break;
 	case RTL_GIGA_MAC_VER_40:
 	case RTL_GIGA_MAC_VER_41:
+	case RTL_GIGA_MAC_VER_49:
 		rtl_w0w1_eri(tp, 0x1a8, ERIAR_MASK_1111, 0x00000000,
 			     0xfc000000, ERIAR_EXGMAC);
 		RTL_W8(PMCH, RTL_R8(PMCH) & ~0x80);
@@ -4415,10 +4681,13 @@ static void r8168_pll_power_up(struct rtl8169_private *tp)
 	case RTL_GIGA_MAC_VER_44:
 	case RTL_GIGA_MAC_VER_45:
 	case RTL_GIGA_MAC_VER_46:
+	case RTL_GIGA_MAC_VER_50:
+	case RTL_GIGA_MAC_VER_51:
 		RTL_W8(PMCH, RTL_R8(PMCH) | 0xc0);
 		break;
 	case RTL_GIGA_MAC_VER_40:
 	case RTL_GIGA_MAC_VER_41:
+	case RTL_GIGA_MAC_VER_49:
 		RTL_W8(PMCH, RTL_R8(PMCH) | 0xc0);
 		rtl_w0w1_eri(tp, 0x1a8, ERIAR_MASK_1111, 0xfc000000,
 			     0x00000000, ERIAR_EXGMAC);
@@ -4493,6 +4762,9 @@ static void rtl_init_pll_power_ops(struct rtl8169_private *tp)
 	case RTL_GIGA_MAC_VER_44:
 	case RTL_GIGA_MAC_VER_45:
 	case RTL_GIGA_MAC_VER_46:
+	case RTL_GIGA_MAC_VER_49:
+	case RTL_GIGA_MAC_VER_50:
+	case RTL_GIGA_MAC_VER_51:
 		ops->down	= r8168_pll_power_down;
 		ops->up		= r8168_pll_power_up;
 		break;
@@ -4547,6 +4819,9 @@ static void rtl_init_rxcfg(struct rtl8169_private *tp)
 	case RTL_GIGA_MAC_VER_46:
 	case RTL_GIGA_MAC_VER_47:
 	case RTL_GIGA_MAC_VER_48:
+	case RTL_GIGA_MAC_VER_49:
+	case RTL_GIGA_MAC_VER_50:
+	case RTL_GIGA_MAC_VER_51:
 		RTL_W32(RxConfig, RX128_INT_EN | RX_DMA_BURST | RX_EARLY_OFF);
 		break;
 	default:
@@ -4712,6 +4987,9 @@ static void rtl_init_jumbo_ops(struct rtl8169_private *tp)
 	case RTL_GIGA_MAC_VER_46:
 	case RTL_GIGA_MAC_VER_47:
 	case RTL_GIGA_MAC_VER_48:
+	case RTL_GIGA_MAC_VER_49:
+	case RTL_GIGA_MAC_VER_50:
+	case RTL_GIGA_MAC_VER_51:
 	default:
 		ops->disable	= NULL;
 		ops->enable	= NULL;
@@ -4828,7 +5106,10 @@ static void rtl8169_hw_reset(struct rtl8169_private *tp)
 		   tp->mac_version == RTL_GIGA_MAC_VER_45 ||
 		   tp->mac_version == RTL_GIGA_MAC_VER_46 ||
 		   tp->mac_version == RTL_GIGA_MAC_VER_47 ||
-		   tp->mac_version == RTL_GIGA_MAC_VER_48) {
+		   tp->mac_version == RTL_GIGA_MAC_VER_48 ||
+		   tp->mac_version == RTL_GIGA_MAC_VER_49 ||
+		   tp->mac_version == RTL_GIGA_MAC_VER_50 ||
+		   tp->mac_version == RTL_GIGA_MAC_VER_51) {
 		RTL_W8(ChipCmd, RTL_R8(ChipCmd) | StopReq);
 		rtl_udelay_loop_wait_high(tp, &rtl_txcfg_empty_cond, 100, 666);
 	} else {
@@ -5754,6 +6035,120 @@ static void rtl_hw_start_8168h_1(struct rtl8169_private *tp)
 	r8168_mac_ocp_write(tp, 0xc09e, 0x0000);
 }
 
+static void rtl_hw_start_8168ep(struct rtl8169_private *tp)
+{
+	void __iomem *ioaddr = tp->mmio_addr;
+	struct pci_dev *pdev = tp->pci_dev;
+
+	RTL_W32(TxConfig, RTL_R32(TxConfig) | TXCFG_AUTO_FIFO);
+
+	rtl_eri_write(tp, 0xc8, ERIAR_MASK_0101, 0x00080002, ERIAR_EXGMAC);
+	rtl_eri_write(tp, 0xcc, ERIAR_MASK_0001, 0x2f, ERIAR_EXGMAC);
+	rtl_eri_write(tp, 0xd0, ERIAR_MASK_0001, 0x5f, ERIAR_EXGMAC);
+	rtl_eri_write(tp, 0xe8, ERIAR_MASK_1111, 0x00100006, ERIAR_EXGMAC);
+
+	rtl_csi_access_enable_1(tp);
+
+	rtl_tx_performance_tweak(pdev, 0x5 << MAX_READ_REQUEST_SHIFT);
+
+	rtl_w0w1_eri(tp, 0xdc, ERIAR_MASK_0001, 0x00, 0x01, ERIAR_EXGMAC);
+	rtl_w0w1_eri(tp, 0xdc, ERIAR_MASK_0001, 0x01, 0x00, ERIAR_EXGMAC);
+
+	rtl_w0w1_eri(tp, 0xd4, ERIAR_MASK_1111, 0x1f80, 0x00, ERIAR_EXGMAC);
+
+	rtl_eri_write(tp, 0x5f0, ERIAR_MASK_0011, 0x4f87, ERIAR_EXGMAC);
+
+	RTL_W8(ChipCmd, CmdTxEnb | CmdRxEnb);
+	RTL_W32(MISC, RTL_R32(MISC) & ~RXDV_GATED_EN);
+	RTL_W8(MaxTxPacketSize, EarlySize);
+
+	rtl_eri_write(tp, 0xc0, ERIAR_MASK_0011, 0x0000, ERIAR_EXGMAC);
+	rtl_eri_write(tp, 0xb8, ERIAR_MASK_0011, 0x0000, ERIAR_EXGMAC);
+
+	/* Adjust EEE LED frequency */
+	RTL_W8(EEE_LED, RTL_R8(EEE_LED) & ~0x07);
+
+	rtl_w0w1_eri(tp, 0x2fc, ERIAR_MASK_0001, 0x01, 0x06, ERIAR_EXGMAC);
+
+	RTL_W8(DLLPR, RTL_R8(DLLPR) & ~TX_10M_PS_EN);
+
+	rtl_pcie_state_l2l3_enable(tp, false);
+}
+
+static void rtl_hw_start_8168ep_1(struct rtl8169_private *tp)
+{
+	void __iomem *ioaddr = tp->mmio_addr;
+	static const struct ephy_info e_info_8168ep_1[] = {
+		{ 0x00, 0xffff,	0x10ab },
+		{ 0x06, 0xffff,	0xf030 },
+		{ 0x08, 0xffff,	0x2006 },
+		{ 0x0d, 0xffff,	0x1666 },
+		{ 0x0c, 0x3ff0,	0x0000 }
+	};
+
+	/* disable aspm and clock request before access ephy */
+	RTL_W8(Config2, RTL_R8(Config2) & ~ClkReqEn);
+	RTL_W8(Config5, RTL_R8(Config5) & ~ASPM_en);
+	rtl_ephy_init(tp, e_info_8168ep_1, ARRAY_SIZE(e_info_8168ep_1));
+
+	rtl_hw_start_8168ep(tp);
+}
+
+static void rtl_hw_start_8168ep_2(struct rtl8169_private *tp)
+{
+	void __iomem *ioaddr = tp->mmio_addr;
+	static const struct ephy_info e_info_8168ep_2[] = {
+		{ 0x00, 0xffff,	0x10a3 },
+		{ 0x19, 0xffff,	0xfc00 },
+		{ 0x1e, 0xffff,	0x20ea }
+	};
+
+	/* disable aspm and clock request before access ephy */
+	RTL_W8(Config2, RTL_R8(Config2) & ~ClkReqEn);
+	RTL_W8(Config5, RTL_R8(Config5) & ~ASPM_en);
+	rtl_ephy_init(tp, e_info_8168ep_2, ARRAY_SIZE(e_info_8168ep_2));
+
+	rtl_hw_start_8168ep(tp);
+
+	RTL_W8(DLLPR, RTL_R8(DLLPR) & ~PFM_EN);
+	RTL_W8(DLLPR, RTL_R8(MISC_1) & ~PFM_D3COLD_EN);
+}
+
+static void rtl_hw_start_8168ep_3(struct rtl8169_private *tp)
+{
+	void __iomem *ioaddr = tp->mmio_addr;
+	u32 data;
+	static const struct ephy_info e_info_8168ep_3[] = {
+		{ 0x00, 0xffff,	0x10a3 },
+		{ 0x19, 0xffff,	0x7c00 },
+		{ 0x1e, 0xffff,	0x20eb },
+		{ 0x0d, 0xffff,	0x1666 }
+	};
+
+	/* disable aspm and clock request before access ephy */
+	RTL_W8(Config2, RTL_R8(Config2) & ~ClkReqEn);
+	RTL_W8(Config5, RTL_R8(Config5) & ~ASPM_en);
+	rtl_ephy_init(tp, e_info_8168ep_3, ARRAY_SIZE(e_info_8168ep_3));
+
+	rtl_hw_start_8168ep(tp);
+
+	RTL_W8(DLLPR, RTL_R8(DLLPR) & ~PFM_EN);
+	RTL_W8(DLLPR, RTL_R8(MISC_1) & ~PFM_D3COLD_EN);
+
+	data = r8168_mac_ocp_read(tp, 0xd3e2);
+	data &= 0xf000;
+	data |= 0x0271;
+	r8168_mac_ocp_write(tp, 0xd3e2, data);
+
+	data = r8168_mac_ocp_read(tp, 0xd3e4);
+	data &= 0xff00;
+	r8168_mac_ocp_write(tp, 0xd3e4, data);
+
+	data = r8168_mac_ocp_read(tp, 0xe860);
+	data |= 0x0080;
+	r8168_mac_ocp_write(tp, 0xe860, data);
+}
+
 static void rtl_hw_start_8168(struct net_device *dev)
 {
 	struct rtl8169_private *tp = netdev_priv(dev);
@@ -5869,6 +6264,18 @@ static void rtl_hw_start_8168(struct net_device *dev)
 		rtl_hw_start_8168h_1(tp);
 		break;
 
+	case RTL_GIGA_MAC_VER_49:
+		rtl_hw_start_8168ep_1(tp);
+		break;
+
+	case RTL_GIGA_MAC_VER_50:
+		rtl_hw_start_8168ep_2(tp);
+		break;
+
+	case RTL_GIGA_MAC_VER_51:
+		rtl_hw_start_8168ep_3(tp);
+		break;
+
 	default:
 		printk(KERN_ERR PFX "%s: unknown chipset (mac_version = %d).\n",
 			dev->name, tp->mac_version);
@@ -7399,7 +7806,10 @@ static void rtl_remove_one(struct pci_dev *pdev)
 
 	if ((tp->mac_version == RTL_GIGA_MAC_VER_27 ||
 	     tp->mac_version == RTL_GIGA_MAC_VER_28 ||
-	     tp->mac_version == RTL_GIGA_MAC_VER_31) &&
+	     tp->mac_version == RTL_GIGA_MAC_VER_31 ||
+	     tp->mac_version == RTL_GIGA_MAC_VER_49 ||
+	     tp->mac_version == RTL_GIGA_MAC_VER_50 ||
+	     tp->mac_version == RTL_GIGA_MAC_VER_51) &&
 	    r8168_check_dash(tp)) {
 		rtl8168_driver_stop(tp);
 	}
@@ -7556,6 +7966,9 @@ static void rtl_hw_initialize(struct rtl8169_private *tp)
 	case RTL_GIGA_MAC_VER_46:
 	case RTL_GIGA_MAC_VER_47:
 	case RTL_GIGA_MAC_VER_48:
+	case RTL_GIGA_MAC_VER_49:
+	case RTL_GIGA_MAC_VER_50:
+	case RTL_GIGA_MAC_VER_51:
 		rtl_hw_init_8168g(tp);
 		break;
 
@@ -7708,6 +8121,9 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	case RTL_GIGA_MAC_VER_46:
 	case RTL_GIGA_MAC_VER_47:
 	case RTL_GIGA_MAC_VER_48:
+	case RTL_GIGA_MAC_VER_49:
+	case RTL_GIGA_MAC_VER_50:
+	case RTL_GIGA_MAC_VER_51:
 		if (rtl_eri_read(tp, 0xdc, ERIAR_EXGMAC) & MagicPacket_v2)
 			tp->features |= RTL_FEATURE_WOL;
 		if ((RTL_R8(Config3) & LinkUp) != 0)
@@ -7756,7 +8172,10 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	    tp->mac_version == RTL_GIGA_MAC_VER_45 ||
 	    tp->mac_version == RTL_GIGA_MAC_VER_46 ||
 	    tp->mac_version == RTL_GIGA_MAC_VER_47 ||
-	    tp->mac_version == RTL_GIGA_MAC_VER_48) {
+	    tp->mac_version == RTL_GIGA_MAC_VER_48 ||
+	    tp->mac_version == RTL_GIGA_MAC_VER_49 ||
+	    tp->mac_version == RTL_GIGA_MAC_VER_50 ||
+	    tp->mac_version == RTL_GIGA_MAC_VER_51) {
 		u16 mac_addr[3];
 
 		*(u32 *)&mac_addr[0] = rtl_eri_read(tp, 0xe0, ERIAR_EXGMAC);
@@ -7835,7 +8254,10 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 
 	if ((tp->mac_version == RTL_GIGA_MAC_VER_27 ||
 	     tp->mac_version == RTL_GIGA_MAC_VER_28 ||
-	     tp->mac_version == RTL_GIGA_MAC_VER_31) &&
+	     tp->mac_version == RTL_GIGA_MAC_VER_31 ||
+	     tp->mac_version == RTL_GIGA_MAC_VER_49 ||
+	     tp->mac_version == RTL_GIGA_MAC_VER_50 ||
+	     tp->mac_version == RTL_GIGA_MAC_VER_51) &&
 	    r8168_check_dash(tp)) {
 		rtl8168_driver_start(tp);
 	}
-- 
1.8.3.2

^ permalink raw reply related

* Re: [PATCH v1 0/4] Enable FEC pps ouput
From: Richard Cochran @ 2014-10-03  8:23 UTC (permalink / raw)
  To: Luwei Zhou
  Cc: davem, netdev, shawn.guo, bhutchings, R49496, b38611, b20596,
	stephen
In-Reply-To: <1411632621-17429-1-git-send-email-b45643@freescale.com>

On Thu, Sep 25, 2014 at 04:10:17PM +0800, Luwei Zhou wrote:
> This patch does:
> 	- Replace 32-bit free-running PTP timer with 31-bit.
> 	- Implement hardware PTP timestamp adjustment.
> 	- Enable PPS output based on hardware adjustment.
> 
> 
> Luwei Zhou (4):
>   net: fec: ptp: Use the 31-bit ptp timer.
>   net: fec: ptp: Use hardware algorithm to adjust PTP counter.
>   net: fec: ptp: Enalbe PPS ouput based on ptp clock
                   ^^^^^^     ^^^^^
                   Enable     output

Also, it looks like you have implemented the wrong feature.

The "pps" capability means that the clock provides a PPS event
(interrupt) to the kernel.

IOW, if .pps==1, then you must call ptp_clock_event() with event type
PTP_CLOCK_PPS.

Your patch set seems to be creating a periodic output and not a PPS
kernel callback.

You should clarify what you are trying to do, and then implement the
appropriate interface in your driver.

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH v1 5/5] driver-core: add driver asynchronous probe support
From: Tom Gundersen @ 2014-10-03  8:23 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: One Thousand Gnomes, Takashi Iwai, Kay Sievers, pmladek, LKML,
	Michal Hocko, Praveen Krishnamoorthy, hare, Nagalakshmi Nandigama,
	werner, Tetsuo Handa, mpt-fusionlinux.pdl, Tim Gardner,
	Benjamin Poirier, Santosh Rastapur, Casey Leedom, Hariprasad S,
	Pierre Fersing, dbueso, Sreekanth Reddy, Arjan van de Ven,
	Abhijit Mahajan, systemd 
In-Reply-To: <20141002200612.GQ14081@wotan.suse.de>

On Thu, Oct 2, 2014 at 10:06 PM, Luis R. Rodriguez <mcgrof@suse.com> wrote:
> On Thu, Oct 02, 2014 at 08:12:37AM +0200, Tom Gundersen wrote:
>> Making kmod a special case is of course possible. However, as long as
>> there is no fundamental reason why kmod should get this special
>> treatment, this just looks like a work-around to me.
>
> I've mentioned a series of five reasons why its a bad idea right now to
> sigkill modules [0], we're reviewed them each and still at least
> items 2-4 remain particularly valid fundamental reasons to avoid it

So items 2-4 basically say "there currently are drivers that cannot
deal with sigkill after a three minute timeout".

In the short-term we already have the solution: increase the timeout.
In the long-term, we have two choices, either permanently add some
heuristic to udev to deal with drivers taking a very long time to be
inserted, or fix the drivers not to take such a long time. A priori,
it makes no sense to me that drivers spend unbounded amounts of time
to get inserted, so fixing the drivers seems like the most reasonable
approach to me. That said, I'm of course open to be proven wrong if
there are some drivers that fundamentally _must_ take a long time to
insert (but we should then discuss why that is and how we can best
deal with the situation, rather than adding some hack up-front when we
don't even know if it is needed).

Your patch series should go a long way towards fixing the drivers (and
I imagine there being a lot of low-hanging fruit that can easily be
fixed once your series has landed), and the fact that we have now
increased the udev timeout from 30 to 180 seconds should also greatly
reduce the problem.

Cheers,

Tom

^ permalink raw reply

* Re: netlink NETLINK_ROUTE  failure & Can the kernel really handle IPv6 properly
From: Ulf Samuelsson @ 2014-10-03  7:59 UTC (permalink / raw)
  To: Dan Williams, Ulf samuelsson; +Cc: Hannes Frederic Sowa, Linux Netdev List
In-Reply-To: <1412264648.7227.9.camel@dcbw.local>


On 10/02/2014 05:44 PM, Dan Williams wrote:
> On Thu, 2014-10-02 at 14:38 +0200, Ulf samuelsson wrote:
>> This is the significant code, and it is wrong.
>>
>> static struct notifier_block my_ipv6_address_notifier =
>>   {
>>     my_ipv6_address_notifier_cb,
>>     NULL,
>>     0
>>   };
>>
>> register_inet6addr_notifier (&my_ipv6_address_notifier );
>>
>> int
>> my_ipv6_address_notifier_cb (struct notifier_block *self,
>>                                  unsigned long event, void *val)
>> {
>>   struct inet6_ifaddr *ifaddr = (struct inet6_ifaddr *)val;
>>
>>
>>   /* We are only interested in address add/delete events */
>>   /* IPv6 address add comes as NETDEV_UP and delete comes as
>>    * NETDEV_DOWN
>>    */
>>   if ((event != NETDEV_UP) && (event != NETDEV_DOWN))
>>     return ret;
>>
>>   if (ifaddr == NULL)
>>     return ret;
>>   /* Now that we are sure that it is a IPv6 address being added deleted,
>>    * verify that it is a link local address.
>>    */
>>   if (!IPV6_IS_ADDR_LINKLOCAL (&ifaddr->addr))
>>     {
>>       return ret;
>>     }
>>   ...
>>   send_message_to_app(LINK_LOCAL_UP, ip);
>>   ...
>>   return ret;
>> }
>>
>>
>> Application tries to send message to "ip" and fails, because the link-local adress is still
>> in "tentative state"
> It seems to me that a better architecture would be to have the app
> itself listen for RTM_NEWADDR netlink event and look for lack of
> IFA_F_TENTATIVE in the IFA_FLAGS attribute.  Using a kernel module to do
> the same thing seems pretty wrong.
>
> Dan

OK, got more information. This is part of a fairly large and complex system.

The kernel module is a control module which collects information
both from the kernel and from H/W, and talks to a userspace interface 
manager.

There is a proprietary management application which is used to configure 
the stack.
This talks to the interface manager to handle different operations for IPv6.

The kernel module needs to know when interfaces are ready to use,
I.E: know when it exits "tentative" mode to do its job properly,
so the kernel module has to listen for RTM_NEWADDR.

In a simpler system, your approach would be OK.

Best Regards,
Ulf Samuelsson


>> Best Regards
>> Ulf Samuelsson
>> ulf@emagii.com
>> +46  (722) 427 437
>>
>>
>>> 1 okt 2014 kl. 23:30 skrev Hannes Frederic Sowa <hannes@stressinduktion.org>:
>>>
>>> Hello,
>>>
>>>> On Wed, Oct 1, 2014, at 22:28, Ulf Samuelsson wrote:
>>>> BTW, the problem I am trying to solve is how to connect to an I/F with
>>>> an IPv6 link-local address.
>>>>
>>>> An existing kernel module waits for a NETDEV_UP event, and then tries to
>>>> communicate
>>>> with the link-local address.
>>>>
>>>> This will fail, because (according to a colleague) the I/F enters a
>>>> "tentative" state,
>>>> where it is trying to decide if it is unique or not.
>>>> It will remain in that state for 1-2 seconds, and only afterwards is the
>>>> link-local address
>>>> available for normal use.
>>>>
>>>> The guys writing the module, claim that the kernel is using NETDEV_UP.
>>>> There is very little code in the kernel using NETLINK_ROUTE, even in
>>>> latest stable.
>>>> It is using NETDEV_UP.
>>>>
>>>> If my colleague is right, the kernel really cannot handle IPv6
>>>> link-local addresses properly.
>>> Sorry, I cannot really follow you, can you send example code or be a bit
>>> more precise?
>>>
>>> Thanks,
>>> Hannes
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* xen netback
From: Tomas Mozes @ 2014-10-03  8:00 UTC (permalink / raw)
  To: netdev

On some of our xen-dom0 machines running 3.14.x we have a problem of 
domU's freezing. This has been fixed in:
http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/drivers/net/xen-netback/netback.c?id=59ae9fc67007da8b5aea7b0a31c3607745cfbfee

and the patch has been included in 3.16.x, but I cannot find it in the 
current longterm releases. We have an open gentoo bug report about it:
https://bugs.gentoo.org/show_bug.cgi?id=517852

This bug doesn't happen on all xen-dom0 machines, but I can confirm that 
on the problematic machine after applying the aforemention patch, the 
problem gone away and the system has been running for 60 days without 
problems.

Please consider applying the patch to the longterm kernels.

Thank you

^ permalink raw reply

* [PATCH net] ematch: Fix the matching of inverted containers (again).
From: Ignacy Gawędzki @ 2014-10-03  8:06 UTC (permalink / raw)
  To: netdev

The result of a negated container has to be inverted before checking for
early ending.

Signed-off-by: Ignacy Gawędzki <ignacy.gawedzki@green-communications.fr>
---
 net/sched/ematch.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/sched/ematch.c b/net/sched/ematch.c
index ad57f44..300ecf6 100644
--- a/net/sched/ematch.c
+++ b/net/sched/ematch.c
@@ -526,11 +526,12 @@ pop_stack:
 		match_idx = stack[--stackp];
 		cur_match = tcf_em_get_match(tree, match_idx);
 
-		if (tcf_em_early_end(cur_match, res)) {
-			if (tcf_em_is_inverted(cur_match))
-				res = !res;
+		if (tcf_em_is_inverted(cur_match))
+			res = !res;
+
+		if (tcf_em_early_end(cur_match, res))
 			goto pop_stack;
-		} else {
+		else {
 			match_idx++;
 			goto proceed;
 		}
-- 
1.9.1

^ permalink raw reply related

* Re: bug: race in team_{notify_peers,mcast_rejoin} scheduling
From: Jiri Pirko @ 2014-10-03  8:04 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: netdev
In-Reply-To: <20141002145028.2f62838c@jlaw-desktop.mno.stratus.com>

Thu, Oct 02, 2014 at 08:50:28PM CEST, joe.lawrence@stratus.com wrote:
>Hello Jiri,
>
>Occasionally on boot I noticed that team_notify_peers_work would get
>*very* busy.
>
>With the following debugging added to team_notify_peers:
>
>        netdev_info(team->dev, "%s(%p)\n", __func__, team);
>        dump_stack();
>
>I saw the following:
>
>% dmesg | grep -e 'team[0-9]: team_notify_peers' -e 'port_enable' -e 'port_disable'
>[   68.340861] team0: team_notify_peers(ffff88104ffa4de0)
>[   68.743264]  [<ffffffffa034fd38>] team_port_enable.part.40+0x78/0x90 [team]
>[   69.622395] team0: team_notify_peers(ffff88104ffa4de0)
>[   69.966758]  [<ffffffffa034ef63>] team_port_disable+0x123/0x160 [team]
>[   71.099263] team0: team_notify_peers(ffff88104ffa4de0)
>[   71.466243]  [<ffffffffa034fd38>] team_port_enable.part.40+0x78/0x90 [team]
>[   72.383788] team0: team_notify_peers(ffff88104ffa4de0)
>[   72.744778]  [<ffffffffa034ef63>] team_port_disable+0x123/0x160 [team]
>[   73.476190] team0: team_notify_peers(ffff88104ffa4de0)
>[   73.830592]  [<ffffffffa034fd38>] team_port_enable.part.40+0x78/0x90 [team]
>[   74.796738] team1: team_notify_peers(ffff88104f5df080)
>[   75.165577]  [<ffffffffa034fd38>] team_port_enable.part.40+0x78/0x90 [team]
>[   75.694968] team1: team_notify_peers(ffff88104f5df080)
>[   75.694984]  [<ffffffffa034ef63>] team_port_disable+0x123/0x160 [team]
>[   77.316488] team1: team_notify_peers(ffff88104f5df080)
>[   77.663122]  [<ffffffffa034fd38>] team_port_enable.part.40+0x78/0x90 [team]
>[   78.470488] team1: team_notify_peers(ffff88104f5df080)
>[   78.814722]  [<ffffffffa034ef63>] team_port_disable+0x123/0x160 [team]
>[   82.690765] team2: team_notify_peers(ffff88083d24df40)
>[   83.083540]  [<ffffffffa034fd38>] team_port_enable.part.40+0x78/0x90 [team]
>[   83.942458] team2: team_notify_peers(ffff88083d24df40)
>[   84.286446]  [<ffffffffa034ef63>] team_port_disable+0x123/0x160 [team]
>[   86.089955] team3: team_notify_peers(ffff88083fd14de0)
>[   86.453495]  [<ffffffffa034fd38>] team_port_enable.part.40+0x78/0x90 [team]
>[   87.267773] team3: team_notify_peers(ffff88083fd14de0)
>[   87.610203]  [<ffffffffa034ef63>] team_port_disable+0x123/0x160 [team]
>
>which shows team_port_enable/disable getting invoked in short
>succession.  When looking at one of the team's
>notify_peers.count_pending value, I saw that it was negative and slowly
>counting down from 0xffff...ffff!
>
>This lead me believe that there is a race condition present in
>the .count_pending pattern that both team_notify_peers and
>team_mcast_rejoin employ.
>
>Can you comment on the following patch/workaround?

Well, I can't see a better solution. Adding is fine here I believe.

Acked-by: Jiri Pirko <jiri@resnulli.us>

Please send to patch with my ack and with:

Fixes: fc423ff00df3a19554414ee ("team: add peer notification")

So it can be pushed to stable.

Thanks!

>
>Thanks,
>
>-- Joe
>
>-->8-- -->8-- -->8--
>
>From b11d7dcd051a2f141c1eec0a43c4a4ddf0361d10 Mon Sep 17 00:00:00 2001
>From: Joe Lawrence <joe.lawrence@stratus.com>
>Date: Thu, 2 Oct 2014 14:24:26 -0400
>Subject: [PATCH] team: avoid race condition in scheduling delayed work
>
>When team_notify_peers and team_mcast_rejoin are called, they both reset
>their respective .count_pending atomic variable. Then when the actual
>worker function is executed, the variable is atomically decremented.
>This pattern introduces a potential race condition where the
>.count_pending rolls over and the worker function keeps rescheduling
>until .count_pending decrements to zero again:
>
>THREAD 1                           THREAD 2
>========                           ========
>team_notify_peers(teamX)
>  atomic_set count_pending = 1
>  schedule_delayed_work
>                                   team_notify_peers(teamX)
>                                   atomic_set count_pending = 1
>team_notify_peers_work
>  atomic_dec_and_test
>    count_pending = 0
>  (return)
>                                   schedule_delayed_work
>                                   team_notify_peers_work
>                                   atomic_dec_and_test
>                                     count_pending = -1
>                                   schedule_delayed_work
>                                   (repeat until count_pending = 0)
>
>Instead of assigning a new value to .count_pending, use atomic_add to
>tack-on the additional desired worker function invocations.
>
>Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
>---
> drivers/net/team/team.c |    4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
>index d46df38..2b87e3f 100644
>--- a/drivers/net/team/team.c
>+++ b/drivers/net/team/team.c
>@@ -647,7 +647,7 @@ static void team_notify_peers(struct team *team)
> {
> 	if (!team->notify_peers.count || !netif_running(team->dev))
> 		return;
>-	atomic_set(&team->notify_peers.count_pending, team->notify_peers.count);
>+	atomic_add(team->notify_peers.count, &team->notify_peers.count_pending);
> 	schedule_delayed_work(&team->notify_peers.dw, 0);
> }
> 
>@@ -687,7 +687,7 @@ static void team_mcast_rejoin(struct team *team)
> {
> 	if (!team->mcast_rejoin.count || !netif_running(team->dev))
> 		return;
>-	atomic_set(&team->mcast_rejoin.count_pending, team->mcast_rejoin.count);
>+	atomic_add(team->mcast_rejoin.count, &team->mcast_rejoin.count_pending);
> 	schedule_delayed_work(&team->mcast_rejoin.dw, 0);
> }
> 
>-- 
>1.7.10.4
>
>
>

^ permalink raw reply

* Re: HW bridging support using notifiers?
From: Jiri Pirko @ 2014-10-03  7:53 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Florian Fainelli, netdev, davem, stephen, andy, tgraf, nbd,
	john.r.fastabend, edumazet, vyasevic, buytenh
In-Reply-To: <A97C2083-CFD9-4238-9C43-97EF43200124@cumulusnetworks.com>

Fri, Oct 03, 2014 at 07:13:35AM CEST, sfeldma@cumulusnetworks.com wrote:
>
>On Oct 2, 2014, at 6:48 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
>
>> Hi all,
>> 
>> I am taking a look at adding HW bridging support to DSA, in way that's
>> usable outside of DSA.
>> 
>> Lennert's approach in 2008 [1] looks conceptually good to me,as he
>> noted, it uses a bunch of new ndo's which is not only limiting to one
>> ndo implementer per struct net_device, but also is mostly consuming the
>> information from the bridge layer, while the ndo is an action
>> 
>> So here's what I am up to now:
>> 
>> - use the NETDEV_JOIN notifier to discover when a bridge port is added
>> - use the NETDEV_LEAVE notifier, still need to verify this does not
>> break netconsole as indicated in net/bridge/br_if.c
>
>I can’t find a NETDEV_LEAVE...is this new?
>
>For this, rocker is using netdev_notifier and watching for NETDEV_CHANGEUPPER.  On CHANGEUPPER, use netdev_master_upper_dev_get(dev) to get master.  If master and master rtnl_link_ops->kind is “bridge”, then dev port is in bridge; otherwise port is not in bridge.  Of course, master could be “openvswitch”, if the driver swings both ways.

I agree that NETDEV_CHANGEUPPER should be used for this.

>
>Would this approach work for DSA?
>
>> - use the NETDEV_CHANGEINFODATA notifier to notify about STP state changes
>> 
>> Now, this raises a bunch of questions:
>> 
>> - we would need a getter to return the stp state of a given network
>> device when called with NETDEV_CHANGEINFODATA, is that acceptable? This
>> would be the first function exported by the bridge layer to expose
>> internal data
>> 
>> NB: this also raises the question of the race condition and locking
>> within br_set_stp_state() and when the network devices notifier callback
>> runs
>> 
>> - or do we need a new network device notifier accepting an opaque
>> pointer which could provide us with the data we what, something like
>> this: call_netdevices_notifier_data(NETDEV_CHANGEINFODATA, dev, info),
>> where info would be a structure/union telling what's this data about
>> 
>
>We need STP state change notification for rocker also, so we can install/remove STP/ARP filters on port state change.  Netdev_notifier would work.  I was also thinking about using Jiri’s ndo_swdev_flow_install/remove to install/remove the STP filters on the port, rather than using netdev_notifier.  In other words, does the HW bridge driver need to know the STP state or can it be dumb (stateless) and told when to accept STP BPDUs, or not, using swdev_flow construct:
>
>  dst_mac 01:80:c2:00:00:00 lasp 0x4242 in_port <port ifindex> actions output <br ifindex>
>
>and later when reaching LEARNING state:
>
>  dst_mac ff:ff:ff:ff:ff:ff eth_type ARP in_port <port ifindex> actions output <br ifindex>
>
>and finally when reaching FORWARDING state, the learned/static bridge fdbs:
>
>  dst_mac <neigh_mac> in_port <br ifindex> actions output <port ifindex>
>
>So driver doesn’t really know what STP is; it’s just installs/removes port filter when told to, using the common ndo_swdev_flow API.  The smarts stay in the bridge driver.
>
>
>> Let me know what you think, thanks!
>> 
>> [1]: http://patchwork.ozlabs.org/patch/16578/
>> --
>> Florian
>
>
>-scott
>
>
>

^ permalink raw reply

* Re: [RFC PATCH linux 2/2] fs/proc: use a hash table for the directory entries
From: Nicolas Dichtel @ 2014-10-03  7:28 UTC (permalink / raw)
  To: Eric W. Biederman, Alexey Dobriyan
  Cc: netdev, linux-kernel, davem, akpm, rui.xiang, viro, oleg,
	gorcunov, kirill.shutemov, grant.likely, tytso, Thierry Herbelot
In-Reply-To: <87h9zmji3q.fsf@x220.int.ebiederm.org>

Le 02/10/2014 23:07, Eric W. Biederman a écrit :
> Alexey Dobriyan <adobriyan@gmail.com> writes:
>
>> On Thu, Oct 02, 2014 at 11:01:50AM -0700, Eric W. Biederman wrote:
>>> Nicolas Dichtel <nicolas.dichtel@6wind.com> writes:
>>>
>>>> From: Thierry Herbelot <thierry.herbelot@6wind.com>
>>>>
>>>> The current implementation for the directories in /proc is using a single
>>>> linked list. This is slow when handling directories with large numbers of
>>>> entries (eg netdevice-related entries when lots of tunnels are opened).
>>>>
>>>> This patch enables multiple linked lists. A hash based on the entry name is
>>>> used to select the linked list for one given entry.
>>>>
>>>> The speed creation of netdevices is faster as shorter linked lists must be
>>>> scanned when adding a new netdevice.
>>>
>>> Is the directory of primary concern /proc/net/dev/snmp6 ?
>>>
>>> Unless I have configured my networking stack weird by mistake that
>>> is the only directory under /proc/net that grows when we add an
>>> interface.
>>>
>>> I just want to make certain I am seeing the same things that you are
>>> seeing.
>>>
>>> I feel silly for overlooking this directory when the rest of the
>>> scalability work was done.
>>
>> Slowdown comes from "duplicate name" check:
>>
>>          for (tmp = dir->subdir; tmp; tmp = tmp->next)
>>                  if (strcmp(tmp->name, dp->name) == 0) {
>>                          WARN(1, "proc_dir_entry '%s/%s' already registered\n",
>>                                  dir->name, dp->name);
>>                          break;
>>                  }
>>
>> Removal can be made O(1) after switching to doubly-linked list.
>
> Yes.  There is the however unfortunate fact that proc directories exist
> to be used.  If we don't switch to a better data structure than a linked
> list the actual use will then opening of the files under
> /proc/net/dev/snmp6/ will become O(N^2).  Which doesn't help much
> (assuming those files are good for something).
>
> If those files aren't actually useful we should just make registering
> them a config option.  Deprecate them strongly and let only people who
> need extreme backwards compatibility enable them.
>
> Alexey do you know that those files aren't useful?  Unless we know
> otherwise we should make those files useful.
This was proposed and nacked in a first version:
http://thread.gmane.org/gmane.linux.network/285840/


Regards,
Nicolas

^ permalink raw reply

* <Reply ASAP>
From: Mr. James @ 2014-10-02 20:54 UTC (permalink / raw)
  To: Recipients

Sequel to your non-response to my previous email, I am re-sending this to you again thus; A deceased client of mine who died of a heart-related ailment about 3 years ago left behind some funds which I want you to  assist in retriving and distributing. Reply so I can give you details on what is needed to be done.

Regards,

James.

^ permalink raw reply

* Re: netlink NETLINK_ROUTE  failure & Can the kernel really handle IPv6 properly
From: Ulf Samuelsson @ 2014-10-03  5:59 UTC (permalink / raw)
  To: Dan Williams, Ulf samuelsson; +Cc: Hannes Frederic Sowa, Linux Netdev List
In-Reply-To: <1412264648.7227.9.camel@dcbw.local>


On 10/02/2014 05:44 PM, Dan Williams wrote:
> On Thu, 2014-10-02 at 14:38 +0200, Ulf samuelsson wrote:
>> This is the significant code, and it is wrong.
>>
>> static struct notifier_block my_ipv6_address_notifier =
>>   {
>>     my_ipv6_address_notifier_cb,
>>     NULL,
>>     0
>>   };
>>
>> register_inet6addr_notifier (&my_ipv6_address_notifier );
>>
>> int
>> my_ipv6_address_notifier_cb (struct notifier_block *self,
>>                                  unsigned long event, void *val)
>> {
>>   struct inet6_ifaddr *ifaddr = (struct inet6_ifaddr *)val;
>>
>>
>>   /* We are only interested in address add/delete events */
>>   /* IPv6 address add comes as NETDEV_UP and delete comes as
>>    * NETDEV_DOWN
>>    */
>>   if ((event != NETDEV_UP) && (event != NETDEV_DOWN))
>>     return ret;
>>
>>   if (ifaddr == NULL)
>>     return ret;
>>   /* Now that we are sure that it is a IPv6 address being added deleted,
>>    * verify that it is a link local address.
>>    */
>>   if (!IPV6_IS_ADDR_LINKLOCAL (&ifaddr->addr))
>>     {
>>       return ret;
>>     }
>>   ...
>>   send_message_to_app(LINK_LOCAL_UP, ip);
>>   ...
>>   return ret;
>> }
>>
>>
>> Application tries to send message to "ip" and fails, because the link-local adress is still
>> in "tentative state"
> It seems to me that a better architecture would be to have the app
> itself listen for RTM_NEWADDR netlink event and look for lack of
> IFA_F_TENTATIVE in the IFA_FLAGS attribute.  Using a kernel module to do
> the same thing seems pretty wrong.
>
> Dan

Thanks, Maybe you are right but I do not have all the details.

The kernel module is part of a driver for a Broadcom router chip.
The guys writing the driver might want to know as well.

I will see what they say.

Best Regards,
Ulf Samuelsson



>> Best Regards
>> Ulf Samuelsson
>> ulf@emagii.com
>> +46  (722) 427 437
>>
>>
>>> 1 okt 2014 kl. 23:30 skrev Hannes Frederic Sowa <hannes@stressinduktion.org>:
>>>
>>> Hello,
>>>
>>>> On Wed, Oct 1, 2014, at 22:28, Ulf Samuelsson wrote:
>>>> BTW, the problem I am trying to solve is how to connect to an I/F with
>>>> an IPv6 link-local address.
>>>>
>>>> An existing kernel module waits for a NETDEV_UP event, and then tries to
>>>> communicate
>>>> with the link-local address.
>>>>
>>>> This will fail, because (according to a colleague) the I/F enters a
>>>> "tentative" state,
>>>> where it is trying to decide if it is unique or not.
>>>> It will remain in that state for 1-2 seconds, and only afterwards is the
>>>> link-local address
>>>> available for normal use.
>>>>
>>>> The guys writing the module, claim that the kernel is using NETDEV_UP.
>>>> There is very little code in the kernel using NETLINK_ROUTE, even in
>>>> latest stable.
>>>> It is using NETDEV_UP.
>>>>
>>>> If my colleague is right, the kernel really cannot handle IPv6
>>>> link-local addresses properly.
>>> Sorry, I cannot really follow you, can you send example code or be a bit
>>> more precise?
>>>
>>> Thanks,
>>> Hannes
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH v1 2/2] net: sched: replace ematch calls to use struct net
From: John Fastabend @ 2014-10-03  5:46 UTC (permalink / raw)
  To: xiyou.wangcong, davem; +Cc: netdev, jhs, eric.dumazet
In-Reply-To: <20141003054558.20925.67091.stgit@nitbit.x32>

We can not use tcf_proto tp in all cases from RCU callbacks
so this patch simplifies the ematch code paths to take a
'struct net' and then use it directly in em_ipset the only
user.

Previously, em_ipset took the 'net' attribute from the
qdisc embedded in the tcp_proto struct.

This resolves casis where tcf_em_destroy() was being used
in callbacks and referencing the possibly invalid tcf_proto
field.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/net/pkt_cls.h  |   13 ++++++-------
 net/sched/cls_basic.c  |    9 ++++-----
 net/sched/cls_cgroup.c |   10 +++++-----
 net/sched/cls_flow.c   |   12 ++++++------
 net/sched/em_canid.c   |    6 +++---
 net/sched/em_ipset.c   |    7 +++----
 net/sched/em_meta.c    |    4 ++--
 net/sched/em_nbyte.c   |    2 +-
 net/sched/em_text.c    |    4 ++--
 net/sched/ematch.c     |   18 +++++++++---------
 10 files changed, 41 insertions(+), 44 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index ef44ad9..f6ebcf3 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -229,12 +229,11 @@ struct tcf_ematch_tree {
 struct tcf_ematch_ops {
 	int			kind;
 	int			datalen;
-	int			(*change)(struct tcf_proto *, void *,
+	int			(*change)(struct net *, void *,
 					  int, struct tcf_ematch *);
 	int			(*match)(struct sk_buff *, struct tcf_ematch *,
 					 struct tcf_pkt_info *);
-	void			(*destroy)(struct tcf_proto *,
-					   struct tcf_ematch *);
+	void			(*destroy)(struct net *, struct tcf_ematch *);
 	int			(*dump)(struct sk_buff *, struct tcf_ematch *);
 	struct module		*owner;
 	struct list_head	link;
@@ -242,9 +241,9 @@ struct tcf_ematch_ops {
 
 int tcf_em_register(struct tcf_ematch_ops *);
 void tcf_em_unregister(struct tcf_ematch_ops *);
-int tcf_em_tree_validate(struct tcf_proto *, struct nlattr *,
+int tcf_em_tree_validate(struct net *, struct nlattr *,
 			 struct tcf_ematch_tree *);
-void tcf_em_tree_destroy(struct tcf_proto *, struct tcf_ematch_tree *);
+void tcf_em_tree_destroy(struct net *, struct tcf_ematch_tree *);
 int tcf_em_tree_dump(struct sk_buff *, struct tcf_ematch_tree *, int);
 int __tcf_em_tree_match(struct sk_buff *, struct tcf_ematch_tree *,
 			struct tcf_pkt_info *);
@@ -300,8 +299,8 @@ static inline int tcf_em_tree_match(struct sk_buff *skb,
 struct tcf_ematch_tree {
 };
 
-#define tcf_em_tree_validate(tp, tb, t) ((void)(t), 0)
-#define tcf_em_tree_destroy(tp, t) do { (void)(t); } while(0)
+#define tcf_em_tree_validate(net, tb, t) ((void)(t), 0)
+#define tcf_em_tree_destroy(net, t) do { (void)(t); } while(0)
 #define tcf_em_tree_dump(skb, t, tlv) (0)
 #define tcf_em_tree_change(tp, dst, src) do { } while(0)
 #define tcf_em_tree_match(skb, t, info) ((void)(info), 1)
diff --git a/net/sched/cls_basic.c b/net/sched/cls_basic.c
index 81ddfa6..f37e4fb 100644
--- a/net/sched/cls_basic.c
+++ b/net/sched/cls_basic.c
@@ -32,7 +32,7 @@ struct basic_filter {
 	struct tcf_exts		exts;
 	struct tcf_ematch_tree	ematches;
 	struct tcf_result	res;
-	struct tcf_proto	*tp;
+	struct net		*net;
 	struct list_head	link;
 	struct rcu_head		rcu;
 };
@@ -91,10 +91,9 @@ static int basic_init(struct tcf_proto *tp)
 static void basic_delete_filter(struct rcu_head *head)
 {
 	struct basic_filter *f = container_of(head, struct basic_filter, rcu);
-	struct tcf_proto *tp = f->tp;
 
 	tcf_exts_destroy(&f->exts);
-	tcf_em_tree_destroy(tp, &f->ematches);
+	tcf_em_tree_destroy(f->net, &f->ematches);
 	kfree(f);
 }
 
@@ -147,7 +146,7 @@ static int basic_set_parms(struct net *net, struct tcf_proto *tp,
 	if (err < 0)
 		return err;
 
-	err = tcf_em_tree_validate(tp, tb[TCA_BASIC_EMATCHES], &t);
+	err = tcf_em_tree_validate(net, tb[TCA_BASIC_EMATCHES], &t);
 	if (err < 0)
 		goto errout;
 
@@ -158,7 +157,7 @@ static int basic_set_parms(struct net *net, struct tcf_proto *tp,
 
 	tcf_exts_change(tp, &f->exts, &e);
 	tcf_em_tree_change(tp, &f->ematches, &t);
-	f->tp = tp;
+	f->net = net;
 
 	return 0;
 errout:
diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
index 3409f16..d4fef3a 100644
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -22,7 +22,7 @@ struct cls_cgroup_head {
 	u32			handle;
 	struct tcf_exts		exts;
 	struct tcf_ematch_tree	ematches;
-	struct tcf_proto	*tp;
+	struct net		*net;
 	struct rcu_head		rcu;
 };
 
@@ -87,7 +87,7 @@ static void cls_cgroup_destroy_rcu(struct rcu_head *root)
 						    rcu);
 
 	tcf_exts_destroy(&head->exts);
-	tcf_em_tree_destroy(head->tp, &head->ematches);
+	tcf_em_tree_destroy(head->net, &head->ematches);
 	kfree(head);
 }
 
@@ -122,7 +122,7 @@ static int cls_cgroup_change(struct net *net, struct sk_buff *in_skb,
 	else
 		new->handle = handle;
 
-	new->tp = tp;
+	new->net = net;
 	err = nla_parse_nested(tb, TCA_CGROUP_MAX, tca[TCA_OPTIONS],
 			       cgroup_policy);
 	if (err < 0)
@@ -133,7 +133,7 @@ static int cls_cgroup_change(struct net *net, struct sk_buff *in_skb,
 	if (err < 0)
 		goto errout;
 
-	err = tcf_em_tree_validate(tp, tb[TCA_CGROUP_EMATCHES], &t);
+	err = tcf_em_tree_validate(net, tb[TCA_CGROUP_EMATCHES], &t);
 	if (err < 0) {
 		tcf_exts_destroy(&e);
 		goto errout;
@@ -157,7 +157,7 @@ static void cls_cgroup_destroy(struct tcf_proto *tp)
 
 	if (head) {
 		tcf_exts_destroy(&head->exts);
-		tcf_em_tree_destroy(tp, &head->ematches);
+		tcf_em_tree_destroy(head->net, &head->ematches);
 		RCU_INIT_POINTER(tp->root, NULL);
 		kfree_rcu(head, rcu);
 	}
diff --git a/net/sched/cls_flow.c b/net/sched/cls_flow.c
index f18d27f7..5413810 100644
--- a/net/sched/cls_flow.c
+++ b/net/sched/cls_flow.c
@@ -41,7 +41,7 @@ struct flow_filter {
 	struct list_head	list;
 	struct tcf_exts		exts;
 	struct tcf_ematch_tree	ematches;
-	struct tcf_proto	*tp;
+	struct net		*net;
 	struct timer_list	perturb_timer;
 	u32			perturb_period;
 	u32			handle;
@@ -355,7 +355,7 @@ static void flow_destroy_filter(struct rcu_head *head)
 
 	del_timer_sync(&f->perturb_timer);
 	tcf_exts_destroy(&f->exts);
-	tcf_em_tree_destroy(f->tp, &f->ematches);
+	tcf_em_tree_destroy(f->net, &f->ematches);
 	kfree(f);
 }
 
@@ -410,7 +410,7 @@ static int flow_change(struct net *net, struct sk_buff *in_skb,
 	if (err < 0)
 		return err;
 
-	err = tcf_em_tree_validate(tp, tb[TCA_FLOW_EMATCHES], &t);
+	err = tcf_em_tree_validate(net, tb[TCA_FLOW_EMATCHES], &t);
 	if (err < 0)
 		goto err1;
 
@@ -428,7 +428,7 @@ static int flow_change(struct net *net, struct sk_buff *in_skb,
 		/* Copy fold into fnew */
 		fnew->handle = fold->handle;
 		fnew->keymask = fold->keymask;
-		fnew->tp = fold->tp;
+		fnew->net = fold->net;
 
 		fnew->handle = fold->handle;
 		fnew->nkeys = fold->nkeys;
@@ -481,7 +481,7 @@ static int flow_change(struct net *net, struct sk_buff *in_skb,
 
 		fnew->handle = handle;
 		fnew->mask  = ~0U;
-		fnew->tp = tp;
+		fnew->net = net;
 		get_random_bytes(&fnew->hashrnd, 4);
 		tcf_exts_init(&fnew->exts, TCA_FLOW_ACT, TCA_FLOW_POLICE);
 	}
@@ -530,7 +530,7 @@ static int flow_change(struct net *net, struct sk_buff *in_skb,
 	return 0;
 
 err2:
-	tcf_em_tree_destroy(tp, &t);
+	tcf_em_tree_destroy(net, &t);
 	kfree(fnew);
 err1:
 	tcf_exts_destroy(&e);
diff --git a/net/sched/em_canid.c b/net/sched/em_canid.c
index 7c292d4..65570b3 100644
--- a/net/sched/em_canid.c
+++ b/net/sched/em_canid.c
@@ -120,8 +120,8 @@ static int em_canid_match(struct sk_buff *skb, struct tcf_ematch *m,
 	return match;
 }
 
-static int em_canid_change(struct tcf_proto *tp, void *data, int len,
-			  struct tcf_ematch *m)
+static int em_canid_change(struct net *net, void *data, int len,
+			   struct tcf_ematch *m)
 {
 	struct can_filter *conf = data; /* Array with rules */
 	struct canid_match *cm;
@@ -183,7 +183,7 @@ static int em_canid_change(struct tcf_proto *tp, void *data, int len,
 	return 0;
 }
 
-static void em_canid_destroy(struct tcf_proto *tp, struct tcf_ematch *m)
+static void em_canid_destroy(struct net *net, struct tcf_ematch *m)
 {
 	struct canid_match *cm = em_canid_priv(m);
 
diff --git a/net/sched/em_ipset.c b/net/sched/em_ipset.c
index 527aeb7..10f31df 100644
--- a/net/sched/em_ipset.c
+++ b/net/sched/em_ipset.c
@@ -19,12 +19,11 @@
 #include <net/ip.h>
 #include <net/pkt_cls.h>
 
-static int em_ipset_change(struct tcf_proto *tp, void *data, int data_len,
+static int em_ipset_change(struct net *net, void *data, int data_len,
 			   struct tcf_ematch *em)
 {
 	struct xt_set_info *set = data;
 	ip_set_id_t index;
-	struct net *net = dev_net(qdisc_dev(tp->q));
 
 	if (data_len != sizeof(*set))
 		return -EINVAL;
@@ -42,11 +41,11 @@ static int em_ipset_change(struct tcf_proto *tp, void *data, int data_len,
 	return -ENOMEM;
 }
 
-static void em_ipset_destroy(struct tcf_proto *p, struct tcf_ematch *em)
+static void em_ipset_destroy(struct net *net, struct tcf_ematch *em)
 {
 	const struct xt_set_info *set = (const void *) em->data;
 	if (set) {
-		ip_set_nfnl_put(dev_net(qdisc_dev(p->q)), set->index);
+		ip_set_nfnl_put(net, set->index);
 		kfree((void *) em->data);
 	}
 }
diff --git a/net/sched/em_meta.c b/net/sched/em_meta.c
index 9b8c0b0..7ce92a2 100644
--- a/net/sched/em_meta.c
+++ b/net/sched/em_meta.c
@@ -856,7 +856,7 @@ static const struct nla_policy meta_policy[TCA_EM_META_MAX + 1] = {
 	[TCA_EM_META_HDR]	= { .len = sizeof(struct tcf_meta_hdr) },
 };
 
-static int em_meta_change(struct tcf_proto *tp, void *data, int len,
+static int em_meta_change(struct net *net, void *data, int len,
 			  struct tcf_ematch *m)
 {
 	int err;
@@ -908,7 +908,7 @@ errout:
 	return err;
 }
 
-static void em_meta_destroy(struct tcf_proto *tp, struct tcf_ematch *m)
+static void em_meta_destroy(struct net *net, struct tcf_ematch *m)
 {
 	if (m)
 		meta_delete((struct meta_match *) m->data);
diff --git a/net/sched/em_nbyte.c b/net/sched/em_nbyte.c
index a3bed07..df3110d 100644
--- a/net/sched/em_nbyte.c
+++ b/net/sched/em_nbyte.c
@@ -23,7 +23,7 @@ struct nbyte_data {
 	char			pattern[0];
 };
 
-static int em_nbyte_change(struct tcf_proto *tp, void *data, int data_len,
+static int em_nbyte_change(struct net *net, void *data, int data_len,
 			   struct tcf_ematch *em)
 {
 	struct tcf_em_nbyte *nbyte = data;
diff --git a/net/sched/em_text.c b/net/sched/em_text.c
index 15d353d..45d53d5 100644
--- a/net/sched/em_text.c
+++ b/net/sched/em_text.c
@@ -45,7 +45,7 @@ static int em_text_match(struct sk_buff *skb, struct tcf_ematch *m,
 	return skb_find_text(skb, from, to, tm->config, &state) != UINT_MAX;
 }
 
-static int em_text_change(struct tcf_proto *tp, void *data, int len,
+static int em_text_change(struct net *net, void *data, int len,
 			  struct tcf_ematch *m)
 {
 	struct text_match *tm;
@@ -100,7 +100,7 @@ retry:
 	return 0;
 }
 
-static void em_text_destroy(struct tcf_proto *tp, struct tcf_ematch *m)
+static void em_text_destroy(struct net *net, struct tcf_ematch *m)
 {
 	if (EM_TEXT_PRIV(m) && EM_TEXT_PRIV(m)->config)
 		textsearch_destroy(EM_TEXT_PRIV(m)->config);
diff --git a/net/sched/ematch.c b/net/sched/ematch.c
index 3a633de..b3b90ef 100644
--- a/net/sched/ematch.c
+++ b/net/sched/ematch.c
@@ -170,7 +170,7 @@ static inline struct tcf_ematch *tcf_em_get_match(struct tcf_ematch_tree *tree,
 }
 
 
-static int tcf_em_validate(struct tcf_proto *tp,
+static int tcf_em_validate(struct net *net,
 			   struct tcf_ematch_tree_hdr *tree_hdr,
 			   struct tcf_ematch *em, struct nlattr *nla, int idx)
 {
@@ -240,7 +240,7 @@ static int tcf_em_validate(struct tcf_proto *tp,
 			goto errout;
 
 		if (em->ops->change) {
-			err = em->ops->change(tp, data, data_len, em);
+			err = em->ops->change(net, data, data_len, em);
 			if (err < 0)
 				goto errout;
 		} else if (data_len > 0) {
@@ -285,7 +285,7 @@ static const struct nla_policy em_policy[TCA_EMATCH_TREE_MAX + 1] = {
 /**
  * tcf_em_tree_validate - validate ematch config TLV and build ematch tree
  *
- * @tp: classifier kind handle
+ * @net: callers net reference
  * @nla: ematch tree configuration TLV
  * @tree: destination ematch tree variable to store the resulting
  *        ematch tree.
@@ -298,7 +298,7 @@ static const struct nla_policy em_policy[TCA_EMATCH_TREE_MAX + 1] = {
  *
  * Returns a negative error code if the configuration TLV contains errors.
  */
-int tcf_em_tree_validate(struct tcf_proto *tp, struct nlattr *nla,
+int tcf_em_tree_validate(struct net *net, struct nlattr *nla,
 			 struct tcf_ematch_tree *tree)
 {
 	int idx, list_len, matches_len, err;
@@ -356,7 +356,7 @@ int tcf_em_tree_validate(struct tcf_proto *tp, struct nlattr *nla,
 
 		em = tcf_em_get_match(tree, idx);
 
-		err = tcf_em_validate(tp, tree_hdr, em, rt_match, idx);
+		err = tcf_em_validate(net, tree_hdr, em, rt_match, idx);
 		if (err < 0)
 			goto errout_abort;
 
@@ -378,7 +378,7 @@ errout:
 	return err;
 
 errout_abort:
-	tcf_em_tree_destroy(tp, tree);
+	tcf_em_tree_destroy(net, tree);
 	return err;
 }
 EXPORT_SYMBOL(tcf_em_tree_validate);
@@ -386,14 +386,14 @@ EXPORT_SYMBOL(tcf_em_tree_validate);
 /**
  * tcf_em_tree_destroy - destroy an ematch tree
  *
- * @tp: classifier kind handle
+ * @net: callers net reference
  * @tree: ematch tree to be deleted
  *
  * This functions destroys an ematch tree previously created by
  * tcf_em_tree_validate()/tcf_em_tree_change(). You must ensure that
  * the ematch tree is not in use before calling this function.
  */
-void tcf_em_tree_destroy(struct tcf_proto *tp, struct tcf_ematch_tree *tree)
+void tcf_em_tree_destroy(struct net *net, struct tcf_ematch_tree *tree)
 {
 	int i;
 
@@ -405,7 +405,7 @@ void tcf_em_tree_destroy(struct tcf_proto *tp, struct tcf_ematch_tree *tree)
 
 		if (em->ops) {
 			if (em->ops->destroy)
-				em->ops->destroy(tp, em);
+				em->ops->destroy(net, em);
 			else if (!tcf_em_is_simple(em))
 				kfree((void *) em->data);
 			module_put(em->ops->owner);

^ permalink raw reply related

* [PATCH v1 1/2] net: sched: do not use tcf_proto 'tp' argument from call_rcu
From: John Fastabend @ 2014-10-03  5:45 UTC (permalink / raw)
  To: xiyou.wangcong, davem; +Cc: netdev, jhs, eric.dumazet

Using the tcf_proto pointer 'tp' from inside the classifiers callback
is not valid because it may have been cleaned up by another call_rcu
occuring on another CPU.

'tp' is currently being used by tcf_unbind_filter() in this patch we
move instances of tcf_unbind_filter outside of the call_rcu() context.
This is safe to do because any running schedulers will either read the
valid class field or it will be zeroed.

And all schedulers today when the class is 0 do a lookup using the
same call used by the tcf_exts_bind(). So even if we have a running
classifier hit the null class pointer it will do a lookup and get
to the same result. This is particularly fragile at the moment because
the only way to verify this is to audit the schedulers call sites.

Reported-by: Cong Wang <xiyou.wangconf@gmail.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 net/sched/cls_basic.c |    4 +++-
 net/sched/cls_bpf.c   |    4 +++-
 net/sched/cls_fw.c    |    5 +++--
 net/sched/cls_route.c |    8 +++++---
 4 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/net/sched/cls_basic.c b/net/sched/cls_basic.c
index fe20826..81ddfa6 100644
--- a/net/sched/cls_basic.c
+++ b/net/sched/cls_basic.c
@@ -93,7 +93,6 @@ static void basic_delete_filter(struct rcu_head *head)
 	struct basic_filter *f = container_of(head, struct basic_filter, rcu);
 	struct tcf_proto *tp = f->tp;
 
-	tcf_unbind_filter(tp, &f->res);
 	tcf_exts_destroy(&f->exts);
 	tcf_em_tree_destroy(tp, &f->ematches);
 	kfree(f);
@@ -106,6 +105,7 @@ static void basic_destroy(struct tcf_proto *tp)
 
 	list_for_each_entry_safe(f, n, &head->flist, link) {
 		list_del_rcu(&f->link);
+		tcf_unbind_filter(tp, &f->res);
 		call_rcu(&f->rcu, basic_delete_filter);
 	}
 	RCU_INIT_POINTER(tp->root, NULL);
@@ -120,6 +120,7 @@ static int basic_delete(struct tcf_proto *tp, unsigned long arg)
 	list_for_each_entry(t, &head->flist, link)
 		if (t == f) {
 			list_del_rcu(&t->link);
+			tcf_unbind_filter(tp, &t->res);
 			call_rcu(&t->rcu, basic_delete_filter);
 			return 0;
 		}
@@ -222,6 +223,7 @@ static int basic_change(struct net *net, struct sk_buff *in_skb,
 
 	if (fold) {
 		list_replace_rcu(&fold->link, &fnew->link);
+		tcf_unbind_filter(tp, &fold->res);
 		call_rcu(&fold->rcu, basic_delete_filter);
 	} else {
 		list_add_rcu(&fnew->link, &head->flist);
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 4318d06..eed49d1 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -92,7 +92,6 @@ static int cls_bpf_init(struct tcf_proto *tp)
 
 static void cls_bpf_delete_prog(struct tcf_proto *tp, struct cls_bpf_prog *prog)
 {
-	tcf_unbind_filter(tp, &prog->res);
 	tcf_exts_destroy(&prog->exts);
 
 	bpf_prog_destroy(prog->filter);
@@ -116,6 +115,7 @@ static int cls_bpf_delete(struct tcf_proto *tp, unsigned long arg)
 	list_for_each_entry(prog, &head->plist, link) {
 		if (prog == todel) {
 			list_del_rcu(&prog->link);
+			tcf_unbind_filter(tp, &prog->res);
 			call_rcu(&prog->rcu, __cls_bpf_delete_prog);
 			return 0;
 		}
@@ -131,6 +131,7 @@ static void cls_bpf_destroy(struct tcf_proto *tp)
 
 	list_for_each_entry_safe(prog, tmp, &head->plist, link) {
 		list_del_rcu(&prog->link);
+		tcf_unbind_filter(tp, &prog->res);
 		call_rcu(&prog->rcu, __cls_bpf_delete_prog);
 	}
 
@@ -282,6 +283,7 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
 
 	if (oldprog) {
 		list_replace_rcu(&prog->link, &oldprog->link);
+		tcf_unbind_filter(tp, &oldprog->res);
 		call_rcu(&oldprog->rcu, __cls_bpf_delete_prog);
 	} else {
 		list_add_rcu(&prog->link, &head->plist);
diff --git a/net/sched/cls_fw.c b/net/sched/cls_fw.c
index da805ae..dbfdfd1 100644
--- a/net/sched/cls_fw.c
+++ b/net/sched/cls_fw.c
@@ -123,9 +123,7 @@ static int fw_init(struct tcf_proto *tp)
 static void fw_delete_filter(struct rcu_head *head)
 {
 	struct fw_filter *f = container_of(head, struct fw_filter, rcu);
-	struct tcf_proto *tp = f->tp;
 
-	tcf_unbind_filter(tp, &f->res);
 	tcf_exts_destroy(&f->exts);
 	kfree(f);
 }
@@ -143,6 +141,7 @@ static void fw_destroy(struct tcf_proto *tp)
 		while ((f = rtnl_dereference(head->ht[h])) != NULL) {
 			RCU_INIT_POINTER(head->ht[h],
 					 rtnl_dereference(f->next));
+			tcf_unbind_filter(tp, &f->res);
 			call_rcu(&f->rcu, fw_delete_filter);
 		}
 	}
@@ -166,6 +165,7 @@ static int fw_delete(struct tcf_proto *tp, unsigned long arg)
 	     fp = &pfp->next, pfp = rtnl_dereference(*fp)) {
 		if (pfp == f) {
 			RCU_INIT_POINTER(*fp, rtnl_dereference(f->next));
+			tcf_unbind_filter(tp, &f->res);
 			call_rcu(&f->rcu, fw_delete_filter);
 			return 0;
 		}
@@ -280,6 +280,7 @@ static int fw_change(struct net *net, struct sk_buff *in_skb,
 
 		RCU_INIT_POINTER(fnew->next, rtnl_dereference(pfp->next));
 		rcu_assign_pointer(*fp, fnew);
+		tcf_unbind_filter(tp, &f->res);
 		call_rcu(&f->rcu, fw_delete_filter);
 
 		*arg = (unsigned long)fnew;
diff --git a/net/sched/cls_route.c b/net/sched/cls_route.c
index b665aee..6f22baa 100644
--- a/net/sched/cls_route.c
+++ b/net/sched/cls_route.c
@@ -269,9 +269,7 @@ static void
 route4_delete_filter(struct rcu_head *head)
 {
 	struct route4_filter *f = container_of(head, struct route4_filter, rcu);
-	struct tcf_proto *tp = f->tp;
 
-	tcf_unbind_filter(tp, &f->res);
 	tcf_exts_destroy(&f->exts);
 	kfree(f);
 }
@@ -297,6 +295,7 @@ static void route4_destroy(struct tcf_proto *tp)
 
 					next = rtnl_dereference(f->next);
 					RCU_INIT_POINTER(b->ht[h2], next);
+					tcf_unbind_filter(tp, &f->res);
 					call_rcu(&f->rcu, route4_delete_filter);
 				}
 			}
@@ -338,6 +337,7 @@ static int route4_delete(struct tcf_proto *tp, unsigned long arg)
 			route4_reset_fastmap(head);
 
 			/* Delete it */
+			tcf_unbind_filter(tp, &f->res);
 			call_rcu(&f->rcu, route4_delete_filter);
 
 			/* Strip RTNL protected tree */
@@ -545,8 +545,10 @@ static int route4_change(struct net *net, struct sk_buff *in_skb,
 
 	route4_reset_fastmap(head);
 	*arg = (unsigned long)f;
-	if (fold)
+	if (fold) {
+		tcf_unbind_filter(tp, &fold->res);
 		call_rcu(&fold->rcu, route4_delete_filter);
+	}
 	return 0;
 
 errout:

^ permalink raw reply related

* [PATCH] net: sched: suspicious RCU usage in qdisc_watchdog
From: John Fastabend @ 2014-10-03  5:43 UTC (permalink / raw)
  To: davem; +Cc: xiyou.wangcong, eric.dumazet, netdev

Suspicious RCU usage in qdisc_watchdog call needs to be done inside
rcu_read_lock/rcu_read_unlock. And then Qdisc destroy operations
need to ensure timer is cancelled before removing qdisc structure.

[ 3992.191339] ===============================
[ 3992.191340] [ INFO: suspicious RCU usage. ]
[ 3992.191343] 3.17.0-rc6net-next+ #72 Not tainted
[ 3992.191345] -------------------------------
[ 3992.191347] include/net/sch_generic.h:272 suspicious rcu_dereference_check() usage!
[ 3992.191348]
[ 3992.191348] other info that might help us debug this:
[ 3992.191348]
[ 3992.191351]
[ 3992.191351] rcu_scheduler_active = 1, debug_locks = 1
[ 3992.191353] no locks held by swapper/1/0.
[ 3992.191355]
[ 3992.191355] stack backtrace:
[ 3992.191358] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.17.0-rc6net-next+ #72
[ 3992.191360] Hardware name:                  /DZ77RE-75K, BIOS GAZ7711H.86A.0060.2012.1115.1750 11/15/2012
[ 3992.191362]  0000000000000001 ffff880235803e48 ffffffff8178f92c 0000000000000000
[ 3992.191366]  ffff8802322224a0 ffff880235803e78 ffffffff810c9966 ffff8800a5fe3000
[ 3992.191370]  ffff880235803f30 ffff8802359cd768 ffff8802359cd6e0 ffff880235803e98
[ 3992.191374] Call Trace:
[ 3992.191376]  <IRQ>  [<ffffffff8178f92c>] dump_stack+0x4e/0x68
[ 3992.191387]  [<ffffffff810c9966>] lockdep_rcu_suspicious+0xe6/0x130
[ 3992.191392]  [<ffffffff8167213a>] qdisc_watchdog+0x8a/0xb0
[ 3992.191396]  [<ffffffff810f93f2>] __run_hrtimer+0x72/0x420
[ 3992.191399]  [<ffffffff810f9bcd>] ? hrtimer_interrupt+0x7d/0x240
[ 3992.191403]  [<ffffffff816720b0>] ? tc_classify+0xc0/0xc0
[ 3992.191406]  [<ffffffff810f9c4f>] hrtimer_interrupt+0xff/0x240
[ 3992.191410]  [<ffffffff8109e4a5>] ? __atomic_notifier_call_chain+0x5/0x140
[ 3992.191415]  [<ffffffff8103577b>] local_apic_timer_interrupt+0x3b/0x60
[ 3992.191419]  [<ffffffff8179c2b5>] smp_apic_timer_interrupt+0x45/0x60
[ 3992.191422]  [<ffffffff8179a6bf>] apic_timer_interrupt+0x6f/0x80
[ 3992.191424]  <EOI>  [<ffffffff815ed233>] ? cpuidle_enter_state+0x73/0x2e0
[ 3992.191432]  [<ffffffff815ed22e>] ? cpuidle_enter_state+0x6e/0x2e0
[ 3992.191437]  [<ffffffff815ed567>] cpuidle_enter+0x17/0x20
[ 3992.191441]  [<ffffffff810c0741>] cpu_startup_entry+0x3d1/0x4a0
[ 3992.191445]  [<ffffffff81106fc6>] ? clockevents_config_and_register+0x26/0x30
[ 3992.191448]  [<ffffffff81033c16>] start_secondary+0x1b6/0x260

Fixes: b26b0d1e8b1 ("net: qdisc: use rcu prefix and silence sparse warnings")
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 net/sched/sch_api.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index aa83295..c79a226 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -578,8 +578,10 @@ static enum hrtimer_restart qdisc_watchdog(struct hrtimer *timer)
 	struct qdisc_watchdog *wd = container_of(timer, struct qdisc_watchdog,
 						 timer);
 
+	rcu_read_lock();
 	qdisc_unthrottled(wd->qdisc);
 	__netif_schedule(qdisc_root(wd->qdisc));
+	rcu_read_unlock();
 
 	return HRTIMER_NORESTART;
 }

^ permalink raw reply related

* Re: HW bridging support using notifiers?
From: Scott Feldman @ 2014-10-03  5:13 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: netdev, davem, jiri, stephen, andy, tgraf, nbd, john.r.fastabend,
	edumazet, vyasevic, buytenh
In-Reply-To: <542E0089.5050005@gmail.com>

On Oct 2, 2014, at 6:48 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:

> Hi all,
> 
> I am taking a look at adding HW bridging support to DSA, in way that's
> usable outside of DSA.
> 
> Lennert's approach in 2008 [1] looks conceptually good to me,as he
> noted, it uses a bunch of new ndo's which is not only limiting to one
> ndo implementer per struct net_device, but also is mostly consuming the
> information from the bridge layer, while the ndo is an action
> 
> So here's what I am up to now:
> 
> - use the NETDEV_JOIN notifier to discover when a bridge port is added
> - use the NETDEV_LEAVE notifier, still need to verify this does not
> break netconsole as indicated in net/bridge/br_if.c

I can’t find a NETDEV_LEAVE...is this new?

For this, rocker is using netdev_notifier and watching for NETDEV_CHANGEUPPER.  On CHANGEUPPER, use netdev_master_upper_dev_get(dev) to get master.  If master and master rtnl_link_ops->kind is “bridge”, then dev port is in bridge; otherwise port is not in bridge.  Of course, master could be “openvswitch”, if the driver swings both ways.

Would this approach work for DSA?

> - use the NETDEV_CHANGEINFODATA notifier to notify about STP state changes
> 
> Now, this raises a bunch of questions:
> 
> - we would need a getter to return the stp state of a given network
> device when called with NETDEV_CHANGEINFODATA, is that acceptable? This
> would be the first function exported by the bridge layer to expose
> internal data
> 
> NB: this also raises the question of the race condition and locking
> within br_set_stp_state() and when the network devices notifier callback
> runs
> 
> - or do we need a new network device notifier accepting an opaque
> pointer which could provide us with the data we what, something like
> this: call_netdevices_notifier_data(NETDEV_CHANGEINFODATA, dev, info),
> where info would be a structure/union telling what's this data about
> 

We need STP state change notification for rocker also, so we can install/remove STP/ARP filters on port state change.  Netdev_notifier would work.  I was also thinking about using Jiri’s ndo_swdev_flow_install/remove to install/remove the STP filters on the port, rather than using netdev_notifier.  In other words, does the HW bridge driver need to know the STP state or can it be dumb (stateless) and told when to accept STP BPDUs, or not, using swdev_flow construct:

  dst_mac 01:80:c2:00:00:00 lasp 0x4242 in_port <port ifindex> actions output <br ifindex>

and later when reaching LEARNING state:

  dst_mac ff:ff:ff:ff:ff:ff eth_type ARP in_port <port ifindex> actions output <br ifindex>

and finally when reaching FORWARDING state, the learned/static bridge fdbs:

  dst_mac <neigh_mac> in_port <br ifindex> actions output <port ifindex>

So driver doesn’t really know what STP is; it’s just installs/removes port filter when told to, using the common ndo_swdev_flow API.  The smarts stay in the bridge driver.

> Let me know what you think, thanks!
> 
> [1]: http://patchwork.ozlabs.org/patch/16578/
> --
> Florian

-scott

^ permalink raw reply

* Re: [PATCH v3 net-next 3/3] bridge: Add filtering support for default_pvid
From: Cong Wang @ 2014-10-03  4:41 UTC (permalink / raw)
  To: Vladislav Yasevich
  Cc: Stephen Hemminger, netdev, bridge@lists.linux-foundation.org,
	Vladislav Yasevich
In-Reply-To: <1412294070-11930-4-git-send-email-vyasevic@redhat.com>

On Thu, Oct 2, 2014 at 4:54 PM, Vladislav Yasevich <vyasevich@gmail.com> wrote:
> +static void br_vlan_disable_default_pvid(struct net_bridge *br)
> +{
> +       struct net_bridge_port *p;
> +       u16 pvid = br->default_pvid;
> +
> +       /* Disable default_pvid on all ports where it is still
> +        * configured.
> +        */
> +

This empty line is not necessary.

> +       if (vlan_default_pvid(br_get_vlan_info(br), pvid))
> +               br_vlan_delete(br, pvid);
> +
> +       list_for_each_entry(p, &br->port_list, list) {
> +               if (vlan_default_pvid(nbp_get_vlan_info(p), pvid))
> +                       nbp_vlan_delete(p, pvid);
> +       }
> +
> +       br->default_pvid = 0;
> +}
> +
> +static int __br_vlan_set_default_pvid(struct net_bridge *br, u16 pvid)
> +{
> +       struct net_bridge_port *p;
> +       u16 old_pvid;
> +       int err;
> +       DECLARE_BITMAP(changed, BR_MAX_PORTS);


This bitmap will use 128 bytes on stack, why not using heap?

> +
> +       bitmap_zero(changed, BR_MAX_PORTS);
> +
> +       /* This function runs with filtering turned off so we can
> +        * remove the old pvid configuration and add the new one after
> +        * without impacting traffic.
> +        */
> +
> +       old_pvid = br->default_pvid;


Remove the empty line.

[...]

> +int nbp_vlan_init(struct net_bridge_port *p)
> +{
> +       int rc = 0;
> +
> +       if (p->br->default_pvid) {
> +               rc = nbp_vlan_add(p, p->br->default_pvid,
> +                                 BRIDGE_VLAN_INFO_PVID |
> +                                 BRIDGE_VLAN_INFO_UNTAGGED);
> +       }
> +
> +       return rc;
> +}

'rc' can be removed.

^ permalink raw reply

* [PATCH net-next] net: dsa: do not call phy_start_aneg
From: Florian Fainelli @ 2014-10-03  1:56 UTC (permalink / raw)
  To: netdev; +Cc: davem, Florian Fainelli
In-Reply-To: <1412301363-8478-1-git-send-email-f.fainelli@gmail.com>

Commit f7f1de51edbd ("net: dsa: start and stop the PHY state machine")
add calls to phy_start() in dsa_slave_open() respectively phy_stop() in
dsa_slave_close().

We also call phy_start_aneg() in dsa_slave_create(), and this call is
messing up with the PHY state machine, since we basically start the
auto-negotiation, and later on restart it when calling phy_start().
phy_start() does not currently handle the PHY_FORCING or PHY_AN states
properly, but such a fix would be too invasive for this window.

Fixes: f7f1de51edbd ("net: dsa: start and stop the PHY state machine")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 net/dsa/slave.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 36953c84ff2d..8030489d9cbe 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -608,7 +608,6 @@ dsa_slave_create(struct dsa_switch *ds, struct device *parent,
 		p->phy->speed = 0;
 		p->phy->duplex = 0;
 		p->phy->advertising = p->phy->supported | ADVERTISED_Autoneg;
-		phy_start_aneg(p->phy);
 	}

 	return slave_dev;
-- 
1.9.1

^ permalink raw reply related

* [PATCH net-next] net: dsa: PHY state machine usage fix
From: Florian Fainelli @ 2014-10-03  1:56 UTC (permalink / raw)
  To: netdev; +Cc: davem, Florian Fainelli

Hi David,

This is a one-liner fixing long auto-negotiation with DSA devices, as I
mention in the commit message, the PHY state machine will need to be
fixed at some point to handle a phy_start_aneg() then phy_start() call,
but that looks too invasive for this merge window.

Thanks!

Florian Fainelli (1):
  net: dsa: do not call phy_start_aneg

 net/dsa/slave.c | 1 -
 1 file changed, 1 deletion(-)

-- 
1.9.1

^ permalink raw reply

* Re: [net-next 6/6] openvswitch: Add support for Geneve tunneling.
From: Jesse Gross @ 2014-10-03  1:48 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Andy Zhou, David Miller, Linux Netdev List
In-Reply-To: <CA+mtBx_gswubpaBF1bX-dRN580Xv2F5aJAJaKRp56pFeWzRWEA@mail.gmail.com>

On Thu, Oct 2, 2014 at 4:04 PM, Tom Herbert <therbert@google.com> wrote:
> On Thu, Oct 2, 2014 at 1:04 AM, Andy Zhou <azhou@nicira.com> wrote:
>> From: Jesse Gross <jesse@nicira.com>
>>
>> The Openvswitch implementation is completely agnostic to the options
>> that are in use and can handle newly defined options without
>> further work. It does this by simply matching on a byte array
>> of options and allowing userspace to setup flows on this array.
>>
>> Signed-off-by: Jesse Gross <jesse@nicira.com>
>> Signed-off-by: Andy Zhou <azhou@nicira.com>
>> ---
>>  include/net/ip_tunnels.h         |   21 ++--
>>  include/uapi/linux/openvswitch.h |    2 +
>>  net/openvswitch/Kconfig          |   11 ++
>>  net/openvswitch/Makefile         |    4 +
>>  net/openvswitch/datapath.c       |    5 +-
>>  net/openvswitch/flow.c           |   20 +++-
>>  net/openvswitch/flow.h           |   20 +++-
>>  net/openvswitch/flow_netlink.c   |  176 +++++++++++++++++++++++-----
>>  net/openvswitch/vport-geneve.c   |  236 ++++++++++++++++++++++++++++++++++++++
>>  net/openvswitch/vport-gre.c      |    2 +-
>>  net/openvswitch/vport-vxlan.c    |    2 +-
>>  net/openvswitch/vport.c          |    3 +
>>  net/openvswitch/vport.h          |    1 +
>>  13 files changed, 461 insertions(+), 42 deletions(-)
>>  create mode 100644 net/openvswitch/vport-geneve.c
>>
>> diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
>> index a9ce155..5bc6ede 100644
>> --- a/include/net/ip_tunnels.h
>> +++ b/include/net/ip_tunnels.h
>> @@ -86,17 +86,18 @@ struct ip_tunnel {
>>         struct gro_cells        gro_cells;
>>  };
>>
>> -#define TUNNEL_CSUM    __cpu_to_be16(0x01)
>> -#define TUNNEL_ROUTING __cpu_to_be16(0x02)
>> -#define TUNNEL_KEY     __cpu_to_be16(0x04)
>> -#define TUNNEL_SEQ     __cpu_to_be16(0x08)
>> -#define TUNNEL_STRICT  __cpu_to_be16(0x10)
>> -#define TUNNEL_REC     __cpu_to_be16(0x20)
>> -#define TUNNEL_VERSION __cpu_to_be16(0x40)
>> -#define TUNNEL_NO_KEY  __cpu_to_be16(0x80)
>> +#define TUNNEL_CSUM            __cpu_to_be16(0x01)
>> +#define TUNNEL_ROUTING         __cpu_to_be16(0x02)
>> +#define TUNNEL_KEY             __cpu_to_be16(0x04)
>> +#define TUNNEL_SEQ             __cpu_to_be16(0x08)
>> +#define TUNNEL_STRICT          __cpu_to_be16(0x10)
>> +#define TUNNEL_REC             __cpu_to_be16(0x20)
>
> Just changing whitespace in these?

Yeah, it's just reindenting to match the new values.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox