Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next v2 0/6] net: dsa: mv88e6131: HW bridging support for 6185
From: Andrew Lunn @ 2016-04-04  2:13 UTC (permalink / raw)
  To: Vivien Didelot; +Cc: netdev, linux-kernel, kernel, David S. Miller
In-Reply-To: <1459457626-30082-1-git-send-email-vivien.didelot@savoirfairelinux.com>

On Thu, Mar 31, 2016 at 04:53:40PM -0400, Vivien Didelot wrote:
> All packets passing through a switch of the 6185 family are currently all
> directed to the CPU port. This means that port bridging is software driven.
> 
> To enable hardware bridging for this switch family, we need to implement the
> port mapping operations, the FDB operations, and optionally the VLAN operations
> (for 802.1Q and VLAN filtering aware systems).
> 
> However this family only has 256 FDBs indexed by 8-bit identifiers, opposed to
> 4096 FDBs with 12-bit identifiers for other families such as 6352. It also
> doesn't have dedicated FID registers for ATU and VTU operations.
> 
> This patchset fixes these differences, and enable hardware bridging for 6185.

Hi Vivien

I added a test for in chip 6185 bridging, and it worked as expected.

Tested-by: Andrew Lunn <andrew@lunn.ch>

	   Andrew

^ permalink raw reply

* [PATCH net] cxgb4: Add pci device id for chelsio t520-cr adapter
From: Hariprasad Shenai @ 2016-04-04  4:24 UTC (permalink / raw)
  To: davem; +Cc: netdev, leedom, nirranjan, Hariprasad Shenai

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
---
 drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h b/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h
index 06bc2d2e7a73..a2cdfc1261dc 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h
@@ -166,6 +166,7 @@ CH_PCI_DEVICE_ID_TABLE_DEFINE_BEGIN
 	CH_PCI_ID_TABLE_FENTRY(0x5099),	/* Custom 2x40G QSFP */
 	CH_PCI_ID_TABLE_FENTRY(0x509a),	/* Custom T520-CR */
 	CH_PCI_ID_TABLE_FENTRY(0x509b),	/* Custom T540-CR LOM */
+	CH_PCI_ID_TABLE_FENTRY(0x509c),	/* Custom T520-CR*/
 
 	/* T6 adapters:
 	 */
-- 
2.3.4

^ permalink raw reply related

* [PATCH net-next] cxgb4/cxgb4vf:  Deprecate module parameter dflt_msg_enable
From: Hariprasad Shenai @ 2016-04-04  4:53 UTC (permalink / raw)
  To: davem; +Cc: netdev, leedom, nirranjan, Hariprasad Shenai

Message level can be set through ethtool, so deprecate module parameter
which is used to set the same.

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c     | 3 ++-
 drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index d1e3f0997d6b..acefa35b7250 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -168,7 +168,8 @@ MODULE_PARM_DESC(force_init, "Forcibly become Master PF and initialize adapter,"
 static int dflt_msg_enable = DFLT_MSG_ENABLE;
 
 module_param(dflt_msg_enable, int, 0644);
-MODULE_PARM_DESC(dflt_msg_enable, "Chelsio T4 default message enable bitmap");
+MODULE_PARM_DESC(dflt_msg_enable, "Chelsio T4 default message enable bitmap,"
+		 "deprecated parameter");
 
 /*
  * The driver uses the best interrupt scheme available on a platform in the
diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
index 1cc8a7a69457..730fec73d5a6 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
@@ -74,7 +74,8 @@ static int dflt_msg_enable = DFLT_MSG_ENABLE;
 
 module_param(dflt_msg_enable, int, 0644);
 MODULE_PARM_DESC(dflt_msg_enable,
-		 "default adapter ethtool message level bitmap");
+		 "default adapter ethtool message level bitmap, "
+		 "deprecated parameter");
 
 /*
  * The driver uses the best interrupt scheme available on a platform in the
-- 
2.3.4

^ permalink raw reply related

* Re: [v7, 3/5] dt: move guts devicetree doc out of powerpc directory
From: Rob Herring @ 2016-04-04  5:15 UTC (permalink / raw)
  To: Yangbo Lu
  Cc: devicetree-u79uwXL29TY76Z2rM5mHXA,
	ulf.hansson-QSEj5FYQhm4dnm+yROfE0A, Zhao Qiang, Russell King,
	Claudiu Manoil, Bhupesh Sharma, netdev-u79uwXL29TY76Z2rM5mHXA,
	Santosh Shilimkar, linux-mmc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, scott.wood-3arQi8VN3Tc,
	xiaobo.xie-3arQi8VN3Tc,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-i2c-u79uwXL29TY76Z2rM5mHXA, Jochen Friedrich, Kumar Gala,
	leoyang.li-3arQi8VN3Tc, linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ,
	linux-clk-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
In-Reply-To: <1459480051-3701-4-git-send-email-yangbo.lu-3arQi8VN3Tc@public.gmane.org>

On Fri, Apr 01, 2016 at 11:07:29AM +0800, Yangbo Lu wrote:
> Move guts devicetree doc to Documentation/devicetree/bindings/soc/fsl/
> since it's used by not only PowerPC but also ARM. And add a specification
> for 'little-endian' property.
> 
> Signed-off-by: Yangbo Lu <yangbo.lu-3arQi8VN3Tc@public.gmane.org>
> ---
> Changes for v2:
> 	- None
> Changes for v3:
> 	- None
> Changes for v4:
> 	- Added this patch
> Changes for v5:
> 	- Modified the description for little-endian property
> Changes for v6:
> 	- None
> Changes for v7:
> 	- None
> ---
>  Documentation/devicetree/bindings/{powerpc => soc}/fsl/guts.txt | 3 +++
>  1 file changed, 3 insertions(+)
>  rename Documentation/devicetree/bindings/{powerpc => soc}/fsl/guts.txt (91%)

Acked-by: Rob Herring <robh-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

^ permalink raw reply

* Re: [PATCH v5 net-next] net: ipv4: Consider failed nexthops in multipath routes
From: Julian Anastasov @ 2016-04-04  6:29 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev
In-Reply-To: <1459728547-36371-1-git-send-email-dsa@cumulusnetworks.com>


	Hello,

On Sun, 3 Apr 2016, David Ahern wrote:

> Multipath route lookups should consider knowledge about next hops and not
> select a hop that is known to be failed.
> 
> Example:
> 
>                      [h2]                   [h3]   15.0.0.5
>                       |                      |
>                      3|                     3|
>                     [SP1]                  [SP2]--+
>                      1  2                   1     2
>                      |  |     /-------------+     |
>                      |   \   /                    |
>                      |     X                      |
>                      |    / \                     |
>                      |   /   \---------------\    |
>                      1  2                     1   2
>          12.0.0.2  [TOR1] 3-----------------3 [TOR2] 12.0.0.3
>                      4                         4
>                       \                       /
>                         \                    /
>                          \                  /
>                           -------|   |-----/
>                                  1   2
>                                 [TOR3]
>                                   3|
>                                    |
>                                   [h1]  12.0.0.1
> 
> host h1 with IP 12.0.0.1 has 2 paths to host h3 at 15.0.0.5:
> 
>     root@h1:~# ip ro ls
>     ...
>     12.0.0.0/24 dev swp1  proto kernel  scope link  src 12.0.0.1
>     15.0.0.0/16
>             nexthop via 12.0.0.2  dev swp1 weight 1
>             nexthop via 12.0.0.3  dev swp1 weight 1
>     ...
> 
> If the link between tor3 and tor1 is down and the link between tor1
> and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups
> in h1 are alternating between the 2 routes: ping 15.0.0.5 gets one and
> ssh 15.0.0.5 gets the other. Connections that attempt to use the
> 12.0.0.2 nexthop fail since that neighbor is not reachable:
> 
>     root@h1:~# ip neigh show
>     ...
>     12.0.0.3 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE
>     12.0.0.2 dev swp1  FAILED
>     ...
> 
> The failed path can be avoided by considering known neighbor information
> when selecting next hops. If the neighbor lookup fails we have no
> knowledge about the nexthop, so give it a shot. If there is an entry
> then only select the nexthop if the state is sane. This is similar to
> what fib_detect_death does.
> 
> To maintain backward compatibility use of the neighbor information is
> based on a new sysctl, fib_multipath_use_neigh.
> 
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>

Reviewed-by: Julian Anastasov <ja@ssi.bg>

	With one comment: the fallback strategy is simplified,
we do not fallback to all possible reachable nexthops.

> ---
> v5
> - returned comma that got lost in the ether and removed resetting of
>   nhsel at end of loop - again comments from Julian
> 
> v4
> - remove NULL initializer and logic for fallback per Julian's comment
> 
> v3
> - Julian comments: changed use of dead in documentation to failed,
>   init state to NUD_REACHABLE which simplifies fib_good_nh, use of
>   nh_dev for neighbor lookup, fallback to first entry which is what
>   current logic does
> 
> v2
> - use rcu locking to avoid refcnts per Eric's suggestion
> - only consider neighbor info for nh_scope == RT_SCOPE_LINK per Julian's
>   comment
> - drop the 'state == NUD_REACHABLE' from the state check since it is
>   part of NUD_VALID (comment from Julian)
> - wrapped the use of the neigh in a sysctl
> 
>  Documentation/networking/ip-sysctl.txt | 10 ++++++++++
>  include/net/netns/ipv4.h               |  3 +++
>  net/ipv4/fib_semantics.c               | 34 +++++++++++++++++++++++++++++-----
>  net/ipv4/sysctl_net_ipv4.c             | 11 +++++++++++
>  4 files changed, 53 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index b183e2b606c8..6c7f365b1515 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -63,6 +63,16 @@ fwmark_reflect - BOOLEAN
>  	fwmark of the packet they are replying to.
>  	Default: 0
>  
> +fib_multipath_use_neigh - BOOLEAN
> +	Use status of existing neighbor entry when determining nexthop for
> +	multipath routes. If disabled, neighbor information is not used and
> +	packets could be directed to a failed nexthop. Only valid for kernels
> +	built with CONFIG_IP_ROUTE_MULTIPATH enabled.
> +	Default: 0 (disabled)
> +	Possible values:
> +	0 - disabled
> +	1 - enabled
> +
>  route/max_size - INTEGER
>  	Maximum number of routes allowed in the kernel.  Increase
>  	this when using large numbers of interfaces and/or routes.
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index a69cde3ce460..d061ffeb1e71 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -133,6 +133,9 @@ struct netns_ipv4 {
>  	struct fib_rules_ops	*mr_rules_ops;
>  #endif
>  #endif
> +#ifdef CONFIG_IP_ROUTE_MULTIPATH
> +	int sysctl_fib_multipath_use_neigh;
> +#endif
>  	atomic_t	rt_genid;
>  };
>  #endif
> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> index d97268e8ff10..5016676c9186 100644
> --- a/net/ipv4/fib_semantics.c
> +++ b/net/ipv4/fib_semantics.c
> @@ -1559,21 +1559,45 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags)
>  }
>  
>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> +static bool fib_good_nh(const struct fib_nh *nh)
> +{
> +	int state = NUD_REACHABLE;
> +
> +	if (nh->nh_scope == RT_SCOPE_LINK) {
> +		struct neighbour *n;
> +
> +		rcu_read_lock_bh();
> +
> +		n = __neigh_lookup_noref(&arp_tbl, &nh->nh_gw, nh->nh_dev);
> +		if (n)
> +			state = n->nud_state;
> +
> +		rcu_read_unlock_bh();
> +	}
> +
> +	return !!(state & NUD_VALID);
> +}
>  
>  void fib_select_multipath(struct fib_result *res, int hash)
>  {
>  	struct fib_info *fi = res->fi;
> +	struct net *net = fi->fib_net;
> +	bool first = false;
>  
>  	for_nexthops(fi) {
>  		if (hash > atomic_read(&nh->nh_upper_bound))
>  			continue;
>  
> -		res->nh_sel = nhsel;
> -		return;
> +		if (!net->ipv4.sysctl_fib_multipath_use_neigh ||
> +		    fib_good_nh(nh)) {
> +			res->nh_sel = nhsel;
> +			return;
> +		}
> +		if (!first) {
> +			res->nh_sel = nhsel;
> +			first = true;
> +		}
>  	} endfor_nexthops(fi);
> -
> -	/* Race condition: route has just become dead. */
> -	res->nh_sel = 0;
>  }
>  #endif
>  
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 1e1fe6086dd9..bb0419582b8d 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -960,6 +960,17 @@ static struct ctl_table ipv4_net_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec,
>  	},
> +#ifdef CONFIG_IP_ROUTE_MULTIPATH
> +	{
> +		.procname	= "fib_multipath_use_neigh",
> +		.data		= &init_net.ipv4.sysctl_fib_multipath_use_neigh,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= &zero,
> +		.extra2		= &one,
> +	},
> +#endif
>  	{ }
>  };
>  
> -- 
> 1.9.1

Regards

^ permalink raw reply

* Re: Section 4 No. 9,10 Failed was occurred by IPv6 Ready Logo Conformance Test
From: Yuki Machida @ 2016-04-04  6:43 UTC (permalink / raw)
  To: Rongqing Li, netdev
In-Reply-To: <56FE2A90.8080300@jp.fujitsu.com>

Hi Roy,

On 2016年04月01日 17:00, Yuki Machida wrote:
> Hi Roy,
> 
> Thank you for your advice.
> I am very glad.
> 
> Futher comment below.
> 
> On 2016年04月01日 16:43, Rongqing Li wrote:
>>
>>
>> On 2016年04月01日 15:31, Yuki Machida wrote:
>>> Hi all,
>>>
>>> I tested 4.6-rc1 by IPv6 Ready Logo Core Conformance Test.
>>> 4.6-rc1 has some FAILs in Section 4 (RFC 1981: Path MTU Discovery for IP version 6).
>>> I conformed that it was PASSed in 3.14.28 and it was FAILed in 4.1.17.
>>> I will find a patch between 3.14 and 4.1.
>>>
>>> IPv6 Ready Logo
>>> https://www.ipv6ready.org/
>>> TAHI Project
>>> http://www.tahi.org/
>>>
>>> I ran the IPv6 Ready Logo Core Conformance Test on Intel D510MO (Atom D510).
>>> It is using userland build with yocto project.
>>>
>>> Test Environment
>>> Test Specification          : 4.0.6
>>> Tool Version                : REL_3_3_2
>>> Test Program Version        : V6LC_5_0_0
>>> Target Device               : Intel D510MO (Atom D510)
>>>
>>> List of FAILs
>>>
>>> Section 4: RFC 1981 - Path MTU Discovery for IPv6
>>> - Test v6LC.4.1.6: Receiving MTU Below IPv6 Minimum Link MTU
>>>      - No. 9 Part A: MTU equal to 56
>>>      - No.10 Part B: MTU equal to 1279
>>>
>>
>> apply this one
>>
>> commit 8013d1d7eafb0589ca766db6b74026f76b7f5cb4
>> Author: Hangbin Liu <liuhangbin@gmail.com>
>> Date:   Thu Jul 30 14:28:42 2015 +0800
>>
>>       net/ipv6: add sysctl option accept_ra_min_hop_limit
>>
>>       Commit 6fd99094de2b ("ipv6: Don't reduce hop limit for an interface")
>>       disabled accept hop limit from RA if it is smaller than the current hop
>>       limit for security stuff. But this behavior kind of break the RFC
>> definition.
>>
>>       RFC 4861, 6.3.4.  Processing Received Router Advertisements
>>          A Router Advertisement field (e.g., Cur Hop Limit, Reachable Time,
>>          and Retrans Timer) may contain a value denoting that it is
>>          unspecified.  In such cases, the parameter should be ignored and the
>>          host should continue using whatever value it is already using.
>>
>>          If the received Cur Hop Limit value is non-zero, the host SHOULD set
>>          its CurHopLimit variable to the received value.
>>
>>       So add sysctl option accept_ra_min_hop_limit to let user choose the
>> minimum
>>       hop limit value they can accept from RA. And set default to 1 to
>> meet RFC
>>       standards.
>>
>>       Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
>>       Acked-by: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
>>       Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> I conformed that above patch has been applied at v4.3 in linux.git.
> 
> % git tag --contains=8013d1d7eafb0589ca766db6b74026f76b7f5cb4 | head
> v4.3
> v4.3-rc1
> v4.3-rc2
> v4.3-rc3
> v4.3-rc4
> v4.3-rc5
> v4.3-rc6
> v4.3-rc7
> v4.4
> v4.4-rc1
> 
>>
>>
>>
>>
>>
>> and revert the below one, the TAHI should be updated
>>
>> commit 9d289715eb5c252ae15bd547cb252ca547a3c4f2
>> Author: Hagen Paul Pfeifer <hagen@jauu.net>
>> Date: Thu Jan 15 22:34:25 2015 +0100
>>
>>       ipv6: stop sending PTB packets for MTU < 1280
>>
>>       Reduce the attack vector and stop generating IPv6 Fragment Header for
>>       paths with an MTU smaller than the minimum required IPv6 MTU
>>       size (1280 byte) - called atomic fragments.
>>
>>       See IETF I-D "Deprecating the Generation of IPv6 Atomic Fragments" [1]
>>       for more information and how this "feature" can be misused.
>>
>>       [1]
>> https://tools.ietf.org/html/draft-ietf-6man-deprecate-atomfrag-generation-00
>>
>>       Signed-off-by: Fernando Gont <fgont@si6networks.com>
>>       Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
>>       Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
>>       Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> I will try.

I confirmed that v4.1.20 revert above patch is passed Section 4 No. 9 and 10 testcases
in IPv6 Ready Logo Conformance Test.
I can't immediately revert above patch from v4.6-rc1 by implementation has changed.

I am considering how to fix this problem for mainline.

> 
>>
>>
>>
>> -Roy
>>
>>
>>
>>
>>> Regards,
>>> Yuki Machida
>>>
>>

^ permalink raw reply

* For Your Consideration!
From: John M @ 2016-04-04  7:41 UTC (permalink / raw)
  To: netdev

Hello,

I need you to assist me claim and invest the sum of $50 Million(Fifty Million US Dollars) in your Country.You will get 30% share out of the total fund for your assistance.More details when i hear back from you.

Kind regards,
John

^ permalink raw reply

* [PATCH] net: socket: return a proper error code when source address becomes nonlocal
From: Liping Zhang @ 2016-04-04  7:09 UTC (permalink / raw)
  To: davem; +Cc: netdev, Liping Zhang

From: Liping Zhang <liping.zhang@spreadtrum.com>

1. Socket can use bind(directly) or connect(indirectly) to bind to a local
   ip address, and later if the network becomes down, that cause the source
   address becomes nonlocal, then send() call will fail and return EINVAL.
   But this error code is confusing, acctually we did not pass any invalid
   arguments. Furthermore, send() maybe return ok at first, it now returns
   fail just because of a temporary network problem, i.e. when the network
   recovery, send() call will become ok. Return EADDRNOTAVAIL instead of
   EINVAL in such situation is better.
2. We can use IPV6_PKTINFO to specify the ipv6 source address when call
   sendmsg() to send packet, but if the address is not available, call will
   fail and EINVAL is returned. This error code is not very appropriate,
   it failed maybe just because of a temporary network problem. Also
   RFC3542, section 6.6 describe an example returns EADDRNOTAVAIL:
   "ipi6_ifindex specifies an interface but the address ipi6_addr is not
   available for use on that interface.". So return EADDRNOTAVAIL instead
   of EINVAL here.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
---
 net/ipv4/route.c    |    6 ++++--
 net/ipv6/datagram.c |    2 +-
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 02c6229..857f7b3 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2149,11 +2149,13 @@ struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4,

 	rcu_read_lock();
 	if (fl4->saddr) {
-		rth = ERR_PTR(-EINVAL);
+		rth = ERR_PTR(-EADDRNOTAVAIL);
 		if (ipv4_is_multicast(fl4->saddr) ||
 		    ipv4_is_lbcast(fl4->saddr) ||
-		    ipv4_is_zeronet(fl4->saddr))
+		    ipv4_is_zeronet(fl4->saddr)) {
+			rth = ERR_PTR(-EINVAL);
 			goto out;
+		}

 		/* I removed check for oif == dev_out->oif here.
 		   It was wrong for two reasons:
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 4281621..04d62e8 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -746,7 +746,7 @@ int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
 						   strict ? dev : NULL, 0) &&
 				    !ipv6_chk_acast_addr_src(net, dev,
 							     &src_info->ipi6_addr))
-					err = -EINVAL;
+					err = -EADDRNOTAVAIL;
 				else
 					fl6->saddr = src_info->ipi6_addr;
 			}
-- 
1.7.9.5

^ permalink raw reply related

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
From: Johannes Berg @ 2016-04-04  7:35 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: davem, netdev, tom, alexei.starovoitov, ogerlitz, daniel,
	john.fastabend, brouer
In-Reply-To: <20160403063834.GE21980@gmail.com>

On Sat, 2016-04-02 at 23:38 -0700, Brenden Blanco wrote:
> 
> Having a common check makes sense. The tricky thing is that the type can
> only be checked after taking the reference, and I wanted to keep the
> scope of the prog brief in the case of errors. I would have to move the
> bpf_prog_get logic into dev_change_bpf_fd and pass a bpf_prog * into the
> ndo instead. Would that API look fine to you?

I can't really comment, I wasn't planning on using the API right now :)

However, what else is there that the driver could possibly do with the
FD, other than getting the bpf_prog?

> A possible extension of this is just to keep the bpf_prog * in the
> netdev itself and expose a feature flag from the driver rather than
> an ndo. But that would mean another 8 bytes in the netdev.

That also misses the signal to the driver when the program is
set/removed, so I don't think that works. I'd argue it's not really
desirable anyway though since I wouldn't expect a majority of drivers
to start supporting this.

johannes

^ permalink raw reply

* Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop
From: Johannes Berg @ 2016-04-04  7:37 UTC (permalink / raw)
  To: Lorenzo Colitti, Tom Herbert
  Cc: Brenden Blanco, David S. Miller, Linux Kernel Network Developers,
	Alexei Starovoitov, gerlitz, Daniel Borkmann, john fastabend,
	Jesper Dangaard Brouer
In-Reply-To: <CAKD1Yr351bEXwBOj8e8Hq=_u7J4Zi2-r=w3k9Z3XFe0AP4m5aw@mail.gmail.com>

On Sun, 2016-04-03 at 11:28 +0900, Lorenzo Colitti wrote:

> That said, getting BPF to the driver is part of the picture. On the
> chipsets we're targeting for APF, we're only seeing 2k-4k of memory
> (that's 256-512 BPF instructions) available for filtering code, which
> means that BPF might be too large.

That's true, but I think that as far as the userspace API is concerned
that shouldn't really be an issue. I think we can compile the BPF into
APF, similar to how BPF can be compiled into machine code today.
Additionally, I'm not sure we can realistically expect all devices to
really implement APF "natively", I think there's a good chance but
there's also a possibility of compiling to the native firmware
environment, for example.

johannes

^ permalink raw reply

* Re: [PATCH v3 00/16] add Intel X722 iWARP driver
From: Christoph Hellwig @ 2016-04-04  7:39 UTC (permalink / raw)
  To: Faisal Latif
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	e1000-rdma-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f
In-Reply-To: <1453318816-21672-1-git-send-email-faisal.latif-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

On Wed, Jan 20, 2016 at 01:40:00PM -0600, Faisal Latif wrote:
> This driver provides iWARP RDMA functionality for the Intel(R) X722 Ethernet
> controller for PCI Physical Functions. It is in early product cycle
> and having the driver in the kernel will allow users to have hardware support
> when available for purchase.

Just curious: how is this driver supposed to work?  It doesn't seem to
support FRWRs despite the iWarp spec requiring support for it.  It also
sets IB_DEVICE_MEM_MGT_EXTENSIONS despite the lack of this methods,
which will lead to instant crashes when using any of the usual drivers.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Has the net-next tree been open now?
From: Dexuan Cui @ 2016-04-04  7:44 UTC (permalink / raw)
  To: David Miller, netdev@vger.kernel.org

Hi David,
I saw the v4.6-rc1 tag had been in net-next.git and a bunch of stmmac patches
appeared on the tree's master branch yesterday.

Thanks,
-- Dexuan

^ permalink raw reply

* Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop
From: Jesper Dangaard Brouer @ 2016-04-04  7:48 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: Tom Herbert, David S. Miller, Linux Kernel Network Developers,
	Alexei Starovoitov, gerlitz, Daniel Borkmann, john fastabend,
	brouer, Alexander Duyck
In-Reply-To: <20160403054103.GB21980@gmail.com>

On Sat, 2 Apr 2016 22:41:04 -0700
Brenden Blanco <bblanco@plumgrid.com> wrote:

> On Sat, Apr 02, 2016 at 12:47:16PM -0400, Tom Herbert wrote:
>
> > Very nice! Do you think this hook will be sufficient to implement a
> > fast forward patch also?

(DMA experts please verify and correct me!)

One of the gotchas is how DMA sync/unmap works.  For forwarding you
need to modify the headers.  The DMA sync API (DMA_FROM_DEVICE) specify
that the data is to be _considered_ read-only.  AFAIK you can write into
the data, BUT on DMA_unmap the API/DMA-engine is allowed to overwrite
data... note on most archs the DMA_unmap does not overwrite.

This DMA issue should not block the work on a hook for early packet drop.
Maybe we should add a flag option, that can specify to the hook if the
packet read-only? (e.g. if driver use page-fragments and DMA_sync)

We should have another track/thread on how to solve the DMA issue:
I see two solutions.

Solution 1: Simply use a "full" page per packet and do the DMA_unmap.
This result in a slowdown on arch's with expensive DMA-map/unmap.  And
we stress the page allocator more (can be solved with a page-pool-cache).
Eric will not like this due to memory usage, but we can just add a
"copy-break" step for normal stack hand-off.

Solution 2: (Due credit to Alex Duyck, this idea came up while
discussing issue with him).  Remember DMA_sync'ed data is only
considered read-only, because the DMA_unmap can be destructive.  In many
cases DMA_unmap is not.  Thus, we could take advantage of this, and
allow modifying DMA sync'ed data on those DMA setups.

> That is the goal, but more work needs to be done of course. It won't be
> possible with just a single pseudo skb, the driver will need a fast
> way to get batches of pseudo skbs (per core?) through from rx to tx.
> In mlx4 for instance, either the skb needs to be much more complete
> to be handled from the start of mlx4_en_xmit(), or that function
> would need to be split so that the fast tx could start midway through.
> 
> Or, skb allocation just gets much faster. Then it should be pretty
> straightforward.

With the bulking SLUB API, we can reduce the bare kmem_cache_alloc+free
cost per SKB from 90 cycles to 27 cycles.  It is good, but for really
fast forwarding it would be good to avoid allocating any extra data
structures.  We just want to move a RX packet-page to a TX ring queue.

Maybe the 27 cycles kmem_cache/slab cost is considered "fast-enough",
for what we gain in ease of implementation.  The real expensive part of
the SKB process is memset/clearing the SKB.  Which the fast forward
use-case could avoid.  Splitting the SKB alloc and clearing part would
be a needed first step.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH] ip6_tunnel: set rtnl_link_ops before calling register_netdevice
From: Nicolas Dichtel @ 2016-04-04  7:51 UTC (permalink / raw)
  To: Thadeu Lima de Souza Cascardo, netdev
In-Reply-To: <1459541870-26938-1-git-send-email-cascardo@redhat.com>

Le 01/04/2016 22:17, Thadeu Lima de Souza Cascardo a écrit :
> When creating an ip6tnl tunnel with ip tunnel, rtnl_link_ops is not set
> before ip6_tnl_create2 is called. When register_netdevice is called, there
> is no linkinfo attribute in the NEWLINK message because of that.
>
> Setting rtnl_link_ops before calling register_netdevice fixes that.
>
> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Fixes: 0b112457229d ("ip6tnl: add support of link creation via rtnl")
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>

^ permalink raw reply

* System hangs (unable to handle kernel paging request)
From: Oleksii Berezhniak @ 2016-04-04  7:59 UTC (permalink / raw)
  To: netdev

Good day.

We have PPPoE server with CentOS 7 (kernel 3.10.0-327.10.1.el7.dsip.x86_64)

We applied some PPPoE related patches to this kernel:

ppp: don't override sk->sk_state in pppoe_flush_dev()
ppp: fix pppoe_dev deletion condition in pppoe_release()
pppoe: fix memory corruption in padt work structure
pppoe: fix reference counting in PPPoE proxy

Also we built latest version of ixgbe driver from Intel.

Now we have crashes after approx. one week of uptime:

[545444.673270] BUG: unable to handle kernel paging request at ffff88a005040200
[545444.673306] IP: [<ffffffff811c0e95>] kmem_cache_alloc+0x75/0x1d0
[545444.673335] PGD 0
[545444.673348] Oops: 0000 [#1] SMP
[545444.673367] Modules linked in: arc4 ppp_mppe act_police cls_u32
sch_ingress sch_tbf pptp gre pppoe pppox ppp_generic slhc 8021q garp
stp mrp llc iptable_nat nf_conn
track_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_filter xt_TCPMSS
iptable_mangle xt_CT nf_conntrack iptable_raw w83793 hwmon_vid
snd_hda_codec_realtek snd_hda_codec
_generic snd_hda_intel snd_hda_codec coretemp snd_hda_core iTCO_wdt
kvm iTCO_vendor_support snd_hwdep snd_seq snd_seq_device ipmi_ssif
ppdev lpc_ich snd_pcm pcspkr mfd_
core sg ipmi_si snd_timer snd i2c_i801 ipmi_msghandler ioatdma
parport_pc parport shpchp soundcore i7core_edac tpm_infineon edac_core
ip_tables ext4 mbcache jbd2 sd_mod
 crct10dif_generic crc_t10dif crct10dif_common syscopyarea sysfillrect
firewire_ohci sysimgblt i2c_algo_bit drm_kms_helper ata_generic
pata_acpi
[545444.674383]  ttm firewire_core crc_itu_t serio_raw drm ata_piix
libata crc32c_intel i2c_core ixgbe(OE) vxlan e1000e ip6_udp_tunnel
udp_tunnel aacraid dca ptp pps_co
re
[545444.674783] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G           OE
------------   3.10.0-327.10.1.el7.dsip.x86_64 #1
[545444.675032] Hardware name: empty empty/S7010, BIOS 'V2.06  ' 03/31/2010
[545444.675162] task: ffff880139c55c00 ti: ffff880139c84000 task.ti:
ffff880139c84000
[545444.675400] RIP: 0010:[<ffffffff811c0e95>]  [<ffffffff811c0e95>]
kmem_cache_alloc+0x75/0x1d0
[545444.675641] RSP: 0018:ffff88023fc23ce8  EFLAGS: 00010286
[545444.675766] RAX: 0000000000000000 RBX: ffff8802302eab00 RCX:
000000010eb8edbe
[545444.676002] RDX: 000000010eb8edbd RSI: 0000000000000020 RDI:
ffff88013b803700
[545444.676237] RBP: ffff88023fc23d18 R08: 00000000000175a0 R09:
ffffffff81517e70
[545444.676472] R10: 000000000000006b R11: 0000000000000000 R12:
ffff88a005040200
[545444.676706] R13: 0000000000000020 R14: ffff88013b803700 R15:
ffff88013b803700
[545444.676942] FS:  0000000000000000(0000) GS:ffff88023fc20000(0000)
knlGS:0000000000000000
[545444.677180] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[545444.677307] CR2: ffff88a005040200 CR3: 0000000237e63000 CR4:
00000000000007e0
[545444.677543] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[545444.677779] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[545444.678014] Stack:
[545444.678127]  ffff880237ea2040 ffff8802302eab00 0000000000000280
0000000000000280
[545444.678370]  0000000000000006 ffff880236bb1b60 ffff88023fc23d40
ffffffff81517e70
[545444.678614]  0000000000000280 ffff8802302eab00 0000000000000000
ffff88023fc23d60
[545444.678857] Call Trace:
[545444.678973]  <IRQ>

[545444.678982]
[545444.679100]  [<ffffffff81517e70>] build_skb+0x30/0x1d0
[545444.679222]  [<ffffffff8151a973>] __alloc_rx_skb+0x63/0xb0
[545444.679349]  [<ffffffff8151a9db>] __netdev_alloc_skb+0x1b/0x40
[545444.679492]  [<ffffffffa0104d8e>] ixgbe_clean_rx_irq+0xee/0xa50 [ixgbe]
[545444.679624]  [<ffffffff8152862f>] ? __napi_complete+0x1f/0x30
[545444.679756]  [<ffffffffa0106738>] ixgbe_poll+0x2d8/0x6d0 [ixgbe]
[545444.679886]  [<ffffffff8152b092>] net_rx_action+0x152/0x240
[545444.680015]  [<ffffffff81084aef>] __do_softirq+0xef/0x280
[545444.680144]  [<ffffffff8164735c>] call_softirq+0x1c/0x30
[545444.680277]  [<ffffffff81016fc5>] do_softirq+0x65/0xa0
[545444.680402]  [<ffffffff81084e85>] irq_exit+0x115/0x120
[545444.680529]  [<ffffffff81647ef8>] do_IRQ+0x58/0xf0
[545444.680660]  [<ffffffff8163d1ad>] common_interrupt+0x6d/0x6d
[545444.680786]  <EOI>
[545444.680794]
[545444.680914]  [<ffffffff81058e96>] ? native_safe_halt+0x6/0x10
[545444.681041]  [<ffffffff8101dbcf>] default_idle+0x1f/0xc0
[545444.681168]  [<ffffffff8101e4d6>] arch_cpu_idle+0x26/0x30
[545444.681297]  [<ffffffff810d62c5>] cpu_startup_entry+0x245/0x290
[545444.681427]  [<ffffffff810475fa>] start_secondary+0x1ba/0x230
[545444.681554] Code: ce 00 00 49 8b 50 08 4d 8b 20 49 8b 40 10 4d 85
e4 0f 84 1f 01 00 00 48 85 c0 0f 84 16 01 00 00 49 63 46 20 48 8d 4a
01 4d 8b 06 <49> 8b 1c 04 4c
89 e0 65 49 0f c7 08 0f 94 c0 84 c0 74 b9 49 63
[545444.682056] RIP  [<ffffffff811c0e95>] kmem_cache_alloc+0x75/0x1d0
[545444.682186]  RSP <ffff88023fc23ce8>
[545444.682305] CR2: ffff88a005040200


Every time description and call stack are the same.

What can be cause of these crashes?

Thanks.

-- 
WBR

^ permalink raw reply

* Re: net: memory leak due to CLONE_NEWNET
From: Dmitry Vyukov @ 2016-04-04  8:13 UTC (permalink / raw)
  To: Cong Wang
  Cc: David S. Miller, Nicolas Dichtel, Thomas Graf, netdev, LKML,
	Eric Dumazet, syzkaller, Kostya Serebryany, Alexander Potapenko,
	Sasha Levin
In-Reply-To: <CAM_iQpU2_7dFR22xqHm3R3Vh7jbbe1=j3CBBnjhQj3G3AnYvYg@mail.gmail.com>

On Sun, Apr 3, 2016 at 12:31 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Sat, Apr 2, 2016 at 6:55 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
>> Hello,
>>
>> The following program leads to memory leaks in:
>>
>> unreferenced object 0xffff88005c10d208 (size 96):
>>   comm "a.out", pid 10753, jiffies 4296778619 (age 43.118s)
>>   hex dump (first 32 bytes):
>>     e8 31 85 2d 00 88 ff ff 0f 00 00 00 00 00 00 00  .1.-............
>>     00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00  .....N..........
>>   backtrace:
>>     [<ffffffff8679bb23>] kmemleak_alloc+0x63/0xa0 mm/kmemleak.c:915
>>     [<     inline     >] kmemleak_alloc_recursive include/linux/kmemleak.h:47
>>     [<     inline     >] slab_post_alloc_hook mm/slab.h:406
>>     [<     inline     >] slab_alloc_node mm/slub.c:2602
>>     [<     inline     >] slab_alloc mm/slub.c:2610
>>     [<ffffffff8179b4c0>] kmem_cache_alloc_trace+0x160/0x3d0 mm/slub.c:2627
>>     [<     inline     >] kmalloc include/linux/slab.h:478
>>     [<     inline     >] tc_action_net_init include/net/act_api.h:122
>>     [<ffffffff8574e62e>] csum_init_net+0x15e/0x450 net/sched/act_csum.c:593
>>     [<ffffffff8564ffc9>] ops_init+0xa9/0x3a0 net/core/net_namespace.c:109
>>     [<ffffffff85650474>] setup_net+0x1b4/0x3e0 net/core/net_namespace.c:287
>>     [<ffffffff85651a56>] copy_net_ns+0xd6/0x1a0 net/core/net_namespace.c:367
>>     [<ffffffff813d01bf>] create_new_namespaces+0x37f/0x740 kernel/nsproxy.c:106
>>     [<ffffffff813d0b69>] unshare_nsproxy_namespaces+0xa9/0x1d0
>
> The following patch should fix it.
>
> diff --git a/include/net/act_api.h b/include/net/act_api.h
> index 2a19fe1..03e322b 100644
> --- a/include/net/act_api.h
> +++ b/include/net/act_api.h
> @@ -135,6 +135,7 @@ void tcf_hashinfo_destroy(const struct tc_action_ops *ops,
>  static inline void tc_action_net_exit(struct tc_action_net *tn)
>  {
>         tcf_hashinfo_destroy(tn->ops, tn->hinfo);
> +       kfree(tn->hinfo);
>  }
>
>  int tcf_generic_walker(struct tc_action_net *tn, struct sk_buff *skb,


Fixes the leak for me.

Tested-by: Dmitry Vyukov <dvyukov@google.com>

Thanks

^ permalink raw reply

* [RFC] ipv6: allow bypassing cross-intf routing limits
From: Michal Kazior @ 2016-04-04  8:15 UTC (permalink / raw)
  To: netdev; +Cc: Michal Kazior
In-Reply-To: <CA+BoTQnC-OKZ8eRohBYetfyW6-xo31kJtS8Lh+svxC=fkVsrXw@mail.gmail.com>

There are some use-cases to allow link-local
routing for bridging purposes.

One of these is allowing transparent 802.11
bridging. Due to 802.11 framing limitations many
Access Points make it impossible to create bridges
on Client endpoints because they can't maintain
Destination/Source/Transmitter/Receiver address
distinction with only 3 addresses in frame header.

The default behavior, i.e. link-local traffic
being non-routable, remains. The user has to
explicitly enable the bypass when defining a given
route.

Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
---
For more background see:

  http://www.spinics.net/lists/netdev/msg371022.html



 include/uapi/linux/rtnetlink.h |  8 ++++++--
 net/ipv6/ip6_output.c          | 11 +++++++++--
 net/ipv6/route.c               |  4 ++++
 3 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index ca764b5da86d..a577eec0e56e 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -424,9 +424,13 @@ enum {
 #define RTAX_FEATURE_SACK	(1 << 1)
 #define RTAX_FEATURE_TIMESTAMP	(1 << 2)
 #define RTAX_FEATURE_ALLFRAG	(1 << 3)
+#define RTAX_FEATURE_XFACE	(1 << 4)
 
-#define RTAX_FEATURE_MASK	(RTAX_FEATURE_ECN | RTAX_FEATURE_SACK | \
-				 RTAX_FEATURE_TIMESTAMP | RTAX_FEATURE_ALLFRAG)
+#define RTAX_FEATURE_MASK	(RTAX_FEATURE_ECN | \
+				 RTAX_FEATURE_SACK | \
+				 RTAX_FEATURE_TIMESTAMP | \
+				 RTAX_FEATURE_ALLFRAG | \
+				 RTAX_FEATURE_XFACE)
 
 struct rta_session {
 	__u8	proto;
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 9428345d3a07..9abb42acb6ad 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -283,6 +283,7 @@ static int ip6_forward_proxy_check(struct sk_buff *skb)
 	u8 nexthdr = hdr->nexthdr;
 	__be16 frag_off;
 	int offset;
+	int feat = dst_metric_raw(skb_dst(skb), RTAX_FEATURES);
 
 	if (ipv6_ext_hdr(nexthdr)) {
 		offset = ipv6_skip_exthdr(skb, sizeof(*hdr), &nexthdr, &frag_off);
@@ -320,8 +321,11 @@ static int ip6_forward_proxy_check(struct sk_buff *skb)
 	 * The proxying router can't forward traffic sent to a link-local
 	 * address, so signal the sender and discard the packet. This
 	 * behavior is clarified by the MIPv6 specification.
+	 *
+	 * It's useful to allow an override for transparent traffic relay.
 	 */
-	if (ipv6_addr_type(&hdr->daddr) & IPV6_ADDR_LINKLOCAL) {
+	if ((ipv6_addr_type(&hdr->daddr) & IPV6_ADDR_LINKLOCAL) &&
+	    !(feat & RTAX_FEATURE_XFACE)) {
 		dst_link_failure(skb);
 		return -1;
 	}
@@ -485,12 +489,15 @@ int ip6_forward(struct sk_buff *skb)
 			inet_putpeer(peer);
 	} else {
 		int addrtype = ipv6_addr_type(&hdr->saddr);
+		int feat = dst_metric_raw(dst, RTAX_FEATURES);
 
 		/* This check is security critical. */
 		if (addrtype == IPV6_ADDR_ANY ||
 		    addrtype & (IPV6_ADDR_MULTICAST | IPV6_ADDR_LOOPBACK))
 			goto error;
-		if (addrtype & IPV6_ADDR_LINKLOCAL) {
+
+		if ((addrtype & IPV6_ADDR_LINKLOCAL) &&
+		    !(feat & RTAX_FEATURE_XFACE)) {
 			icmpv6_send(skb, ICMPV6_DEST_UNREACH,
 				    ICMPV6_NOT_NEIGHBOUR, 0);
 			goto error;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index ed446639219c..560c99853907 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -629,8 +629,12 @@ static inline enum rt6_nud_state rt6_check_neigh(struct rt6_info *rt)
 static int rt6_score_route(struct rt6_info *rt, int oif,
 			   int strict)
 {
+	int feat = dst_metric_raw(&rt->dst, RTAX_FEATURES);
 	int m;
 
+	if (feat & RTAX_FEATURE_XFACE)
+		strict &= ~RT6_LOOKUP_F_IFACE;
+
 	m = rt6_check_dev(rt, oif);
 	if (!m && (strict & RT6_LOOKUP_F_IFACE))
 		return RT6_NUD_FAIL_HARD;
-- 
2.1.4

^ permalink raw reply related

* davinci-mdio: failing to connect to PHY
From: Petr Kulhavy @ 2016-04-04  8:18 UTC (permalink / raw)
  To: netdev

Hi,

I'm experiencing a peculiar problem with PHY communication in the 
current davinci-mdio.c driver.
After upgrading from kernel 3.17 to 4.5 my DT based AM1808 board started 
having issues with the PHY communication.
The MAC is detected, the MDIO is detected, the PHY is detected 
(twice?!?!), however there is no data being sent/received and the after 
issuing "ifdown -a" the MDIO starts spitting out messages that it cannot 
connect to the PHY:

net eth0: could not connect to phy davinci_mdio.0:00
davinci_mdio davinci_mdio.0: resetting idled controller

I'm using a single Micrel KSZ8081 PHY connected via RMII using the 
default PHY address 0x01.
Here is the dmesg excerpt related to mdio:

davinci_mdio davinci_mdio.0: Runtime PM disabled, clock forced on.
davinci_mdio davinci_mdio.0: davinci mdio revision 1.5
davinci_mdio davinci_mdio.0: detected phy mask fffffffc
libphy: davinci_mdio.0: probed
davinci_mdio davinci_mdio.0: phy[0]: device davinci_mdio.0:00, driver 
Micrel KSZ8081 or KSZ8091
davinci_mdio davinci_mdio.0: phy[1]: device davinci_mdio.0:01, driver 
Micrel KSZ8081 or KSZ8091
davinci_mdio davinci_mdio.0: resetting idled controller
Micrel KSZ8081 or KSZ8091 davinci_mdio.0:00: failed to disable NAND tree 
mode
Micrel KSZ8081 or KSZ8091 davinci_mdio.0:00: attached PHY driver [Micrel 
KSZ8081 or KSZ8091] (mii_bus:phy_addr=davinci_mdio.0:00, irq=-1)

After a soft-reboot the MDIO uses a different PHY mask fffffffd, detects 
correctly only one PHY at address 1 (this is the default address) and 
the networking works:

davinci_mdio davinci_mdio.0: Runtime PM disabled, clock forced on.
davinci_mdio davinci_mdio.0: davinci mdio revision 1.5
davinci_mdio davinci_mdio.0: detected phy mask fffffffd
libphy: davinci_mdio.0: probed
davinci_mdio davinci_mdio.0: phy[1]: device davinci_mdio.0:01, driver 
Micrel KSZ8081 or KSZ8091
davinci_mdio davinci_mdio.0: resetting idled controller
Micrel KSZ8081 or KSZ8091 davinci_mdio.0:01: attached PHY driver [Micrel 
KSZ8081 or KSZ8091] (mii_bus:phy_addr=davinci_mdio.0:01, irq=-1)

I'm wondering what the problem is and why the PHY mask is different 
after power-up and after a soft reboot.
Also it's not clear to me why this set-up worked with kernel 3.17 even 
if it was detecting the PHY twice exactly the same way.
How does the mask relate to the PHY address and how is it calculated?

Thanks
Petr

^ permalink raw reply

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
From: Jesper Dangaard Brouer @ 2016-04-04  8:33 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: davem, netdev, tom, alexei.starovoitov, Or Gerlitz, daniel,
	john.fastabend, brouer
In-Reply-To: <1459560118-5582-5-git-send-email-bblanco@plumgrid.com>

On Fri,  1 Apr 2016 18:21:57 -0700
Brenden Blanco <bblanco@plumgrid.com> wrote:

> @@ -840,6 +843,21 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>  		l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
>  			(cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
>  
> +		/* A bpf program gets first chance to drop the packet. It may
> +		 * read bytes but not past the end of the frag. A non-zero
> +		 * return indicates packet should be dropped.
> +		 */
> +		if (prog) {
> +			struct ethhdr *ethh;
> +
> +			ethh = (struct ethhdr *)(page_address(frags[0].page) +
> +						 frags[0].page_offset);
> +			if (mlx4_call_bpf(prog, ethh, length)) {
> +				priv->stats.rx_dropped++;
> +				goto next;
> +			}
> +		}
> +

For future API, I can imagine more return codes being needed.

For forwarding I could imagine returning "STOLEN", which should not
increment rx_dropped.

One could also imagine supporting tcpdump/af_packet like facilities at
this packet-page level (e.g. af_packet queue packets into a RX ring
buffer, later processed/read async). It could return "SHARED", bumping
refcnt on page, and indicate page is now read-only. Thus, affecting
drivers processing flow.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH] Marvell phy: add fiber status check for some components
From: Charles-Antoine Couret @ 2016-04-04  8:45 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20160401170838.GA21633@lunn.ch>

Hi,

> Shouldn't you return to page 0, i.e. MII_M1111_COPPER, under all
> conditions?

I return marvell_read_status() which returns 0 if it hasn't error during the process.
In case of right conditions, my function returns 0 for COPPER part (and FIBER part too).

It doesn't change the value returned and behavior.

Charles-Antoine

^ permalink raw reply

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
From: Daniel Borkmann @ 2016-04-04  8:49 UTC (permalink / raw)
  To: Brenden Blanco, davem
  Cc: netdev, tom, alexei.starovoitov, gerlitz, john.fastabend, brouer
In-Reply-To: <1459560118-5582-2-git-send-email-bblanco@plumgrid.com>

On 04/02/2016 03:21 AM, Brenden Blanco wrote:
> Add a new bpf prog type that is intended to run in early stages of the
> packet rx path. Only minimal packet metadata will be available, hence a new
> context type, struct xdp_metadata, is exposed to userspace. So far only
> expose the readable packet length, and only in read mode.
>
> The PHYS_DEV name is chosen to represent that the program is meant only
> for physical adapters, rather than all netdevs.
>
> While the user visible struct is new, the underlying context must be
> implemented as a minimal skb in order for the packet load_* instructions
> to work. The skb filled in by the driver must have skb->len, skb->head,
> and skb->data set, and skb->data_len == 0.
>
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> ---
>   include/uapi/linux/bpf.h |  5 ++++
>   kernel/bpf/verifier.c    |  1 +
>   net/core/filter.c        | 68 ++++++++++++++++++++++++++++++++++++++++++++++++
>   3 files changed, 74 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 924f537..b8a4ef2 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -92,6 +92,7 @@ enum bpf_prog_type {
>   	BPF_PROG_TYPE_KPROBE,
>   	BPF_PROG_TYPE_SCHED_CLS,
>   	BPF_PROG_TYPE_SCHED_ACT,
> +	BPF_PROG_TYPE_PHYS_DEV,
>   };
>
>   #define BPF_PSEUDO_MAP_FD	1
> @@ -367,6 +368,10 @@ struct __sk_buff {
>   	__u32 tc_classid;
>   };
>
> +struct xdp_metadata {
> +	__u32 len;
> +};

Should this consistently be called 'xdp' or rather 'phys dev',
because currently it's a mixture of both everywhere?

>   struct bpf_tunnel_key {
>   	__u32 tunnel_id;
>   	union {
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 2e08f8e..804ca70 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1340,6 +1340,7 @@ static bool may_access_skb(enum bpf_prog_type type)
>   	case BPF_PROG_TYPE_SOCKET_FILTER:
>   	case BPF_PROG_TYPE_SCHED_CLS:
>   	case BPF_PROG_TYPE_SCHED_ACT:
> +	case BPF_PROG_TYPE_PHYS_DEV:
>   		return true;
>   	default:
>   		return false;
> diff --git a/net/core/filter.c b/net/core/filter.c
> index b7177d0..c417db6 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2018,6 +2018,12 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
>   	}
>   }
>
> +static const struct bpf_func_proto *
> +phys_dev_func_proto(enum bpf_func_id func_id)
> +{
> +	return sk_filter_func_proto(func_id);

Do you plan to support bpf_skb_load_bytes() as well? I like using
this API especially when dealing with larger chunks (>4 bytes) to
load into stack memory, plus content is kept in network byte order.

What about other helpers such as bpf_skb_store_bytes() et al that
work on skbs. Do you intent to reuse them as is and thus populate
the per cpu skb with needed fields (faking linear data), or do you
see larger obstacles that prevent for this?

Thanks,
Daniel

^ permalink raw reply

* Re: Backport patch from 4.2 to 3.18
From: Andrei Sharaev @ 2016-04-04  9:05 UTC (permalink / raw)
  To: David S. Miller; +Cc: Sasha Levin, stable, netdev@vger.kernel.org, LKML
In-Reply-To: <56E6EA59.5020302@oracle.com>

Hi David,

Could you help with this problem?

-- 
Best regards,
Andrei Sharaev
BYAS-RIPE
ISP Atlant Telecom
aosharaev@telecom.by

14.03.2016 19:44, Sasha Levin пишет:
> On 03/04/2016 04:26 PM, Sasha Levin wrote:
>> On 03/04/2016 03:40 PM, Andrei Sharaev wrote:
>>>> Hi Sasha,
>>>>
>>>> Can you backport this patch for "inet-frag-fixes" to linux kernel 3.18 LTS?
>>>> http://kernel.suse.com/cgit/kernel/commit/?h=v4.2-rc5&id=64b892ad2326348a5b8314167590d240e3bcc69e
>>>>
>>>> I get 1-5 kernel panics in month for linux kernels 3.18.24-3.18.26 at my NAT server with big IPv4 traffic (10-15 Gbps).
>>>> My kernel panics have similar symptoms:
>>>>>> <82>general protection fault: 0000 [#1] SMP
>>>>>> <82>Modules linked in: bonding ipt_NETFLOW(O) xt_recent configfs x86_pkg_temp_thermal ixgbe(O)
>>>>>> <86>CPU: 13 PID: 29908 Comm: kworker/13:2 Tainted: G          IO   3.18.26 #1
>>>>>> <86>Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0005.101720141054 10/17/2014
>>>>>> <82>Workqueue: events inet_frag_worker
>>>>>> <86>task: ffff88046cdba9a0 ti: ffff880454928000 task.ti: ffff880454928000
>>>>>> <82>RIP: 0010:[<ffffffff815b91d9>]  [<ffffffff815b91d9>] inet_evict_bucket+0x109/0x160
>>>>>> <86>RSP: 0018:ffff88045492bd38  EFLAGS: 00010286
>>>>>> <86>RAX: ffff880441d0e001 RBX: dead0000001000c0 RCX: 000000018030002e
>>>>>> <86>RDX: 000000018030002f RSI: ffff880441d0e000 RDI: dead0000001000c0
>>>>>> <86>RBP: ffff88045492bd88 R08: 0000000000000000 R09: ffff88086cc88500
>>>>>> <86>R10: ffff88046fdb5c50 R11: ffffea0011074380 R12: 0000000000000002
>>>>>> <86>R13: ffffffff81e02200 R14: 0000000000000000 R15: ffff88083f0942a0
>>>>>> <86>FS:  0000000000000000(0000) GS:ffff88046fda0000(0000) knlGS:0000000000000000
>>>>>> <86>CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>> <86>CR2: 00007fab8f466000 CR3: 000000085f66d000 CR4: 00000000001407e0
>>>>>> <86>Stack:
>>>>>> <82> ffffffff81e05a78 ffffffff81e05a70 ffff88046cdba9a0 ffff88083f0942e0
>>>>>> <82> ffff88046f808c00 0000000000000079 ffffffff81e02200 ffffffff81e06200
>>>>>> <82> 0000000000000388 0000000000000007 ffff88045492bdf8 ffffffff815b928a
>>>>>> <86>Call Trace:
>>>>>> <82> [<ffffffff815b928a>] inet_frag_worker+0x5a/0x230
>>>>>> <82> [<ffffffff8105808d>] process_one_work+0x12d/0x330
>>>>>> <82> [<ffffffff8105892b>] worker_thread+0x4b/0x450
>>>>>> <82> [<ffffffff810588e0>] ? cancel_delayed_work_sync+0x10/0x10
>>>>>> <82> [<ffffffff8105cd34>] kthread+0xc4/0xe0
>>>>>> <82> [<ffffffff81060c59>] ? finish_task_switch+0x49/0xc0
>>>>>> <82> [<ffffffff8105cc70>] ? kthread_create_on_node+0x170/0x170
>>>>>> <82> [<ffffffff81603d88>] ret_from_fork+0x58/0x90
>>>>>> <82> [<ffffffff8105cc70>] ? kthread_create_on_node+0x170/0x170
>>>>>> <82>Code: f6 0f 85 73 ff ff ff 48 8b 45 b8 80 40 08 01 48 8b 7d c8 48 85 ff 74 23 48 83 ef 40 75 0d eb 1b 66 90 48 83 eb 40 48 89 df 74 10 <48> 8b 5f 40 41 ff 95 70 40 00 00 48 85 db 75 e7 48 83 c4 28 44
>>>>>> <22>RIP  [<ffffffff815b91d9>] inet_evict_bucket+0x109/0x160
>>>>>> <82> RSP <ffff88045492bd38>
>> Hey Andrei,
>>
>> Thanks for the report.
>>
>> Usually David Miller (Cc'ed) handles backporting network commits. In this case, I see
>> that he has elected not to backport it into 4.1 or 3.18, so I don't want to do it without
>> getting an ack from him first.
>>
>> David, is it ok to backport these commits back to 3.18 (and probably 4.1)?
> Ping?

^ permalink raw reply

* [PATCH] net: mvneta: Remove superfluous SMP function call
From: Anna-Maria Gleixner @ 2016-04-04  9:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: rt, Anna-Maria Gleixner, Thomas Petazzoni, netdev

Since commit 1cf4f629d9d2 ("cpu/hotplug: Move online calls to
hotplugged cpu") it is ensured that callbacks of CPU_ONLINE and
CPU_DOWN_PREPARE are processed on the hotplugged cpu. Due to this SMP
function calls are no longer required.

Replace smp_call_function_single() with a direct call to
mvneta_percpu_enable() or mvneta_percpu_disable(). The functions do
not require to be called with interrupts disabled, therefore the
smp_call_function_single() calling convention is not preserved.

Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: netdev@vger.kernel.org
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
---
 drivers/net/ethernet/marvell/mvneta.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -3354,8 +3354,7 @@ static int mvneta_percpu_notifier(struct
 		/* Enable per-CPU interrupts on the CPU that is
 		 * brought up.
 		 */
-		smp_call_function_single(cpu, mvneta_percpu_enable,
-					 pp, true);
+		mvneta_percpu_enable(pp);
 
 		/* Enable per-CPU interrupt on the one CPU we care
 		 * about.
@@ -3387,8 +3386,7 @@ static int mvneta_percpu_notifier(struct
 		/* Disable per-CPU interrupts on the CPU that is
 		 * brought down.
 		 */
-		smp_call_function_single(cpu, mvneta_percpu_disable,
-					 pp, true);
+		mvneta_percpu_disable(pp);
 
 		break;
 	case CPU_DEAD:

^ permalink raw reply

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
From: Daniel Borkmann @ 2016-04-04  9:22 UTC (permalink / raw)
  To: Brenden Blanco, davem
  Cc: netdev, tom, alexei.starovoitov, gerlitz, john.fastabend, brouer
In-Reply-To: <1459560118-5582-5-git-send-email-bblanco@plumgrid.com>

On 04/02/2016 03:21 AM, Brenden Blanco wrote:
> Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx4 driver.  Since
> bpf programs require a skb context to navigate the packet, build a
> percpu fake skb with the minimal fields. This avoids the costly
> allocation for packets that end up being dropped.
>
> Since mlx4 is so far the only user of this pseudo skb, the build
> function is defined locally.
>
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> ---
>   drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 61 ++++++++++++++++++++++++++
>   drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 18 ++++++++
>   drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  2 +
>   3 files changed, 81 insertions(+)
>
[...]
>
> +static DEFINE_PER_CPU(struct sk_buff, percpu_pseudo_skb);
> +
> +static void build_pseudo_skb_for_bpf(struct sk_buff *skb, void *data,
> +				     unsigned int length)
> +{
> +	/* data_len is intentionally not set here so that skb_is_nonlinear()
> +	 * returns false
> +	 */
> +
> +	skb->len = length;
> +	skb->head = data;
> +	skb->data = data;
> +}
> +
> +int mlx4_call_bpf(struct bpf_prog *prog, void *data, unsigned int length)
> +{
> +	struct sk_buff *skb = this_cpu_ptr(&percpu_pseudo_skb);
> +	int ret;
> +
> +	build_pseudo_skb_for_bpf(skb, data, length);
> +
> +	rcu_read_lock();
> +	ret = BPF_PROG_RUN(prog, (void *)skb);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}

Couldn't this diff rather live in filter.c? Doesn't seem mlx4 specific. When
placed there, the api would also make the requirements clear for every driver
wanting to implement xdp wrt meta data that needs to be passed, and allows to
easier review code (as driver just call a few core helpers rather than needing
to re-implement the pseudo skb et al).

> +static int mlx4_bpf_set(struct net_device *dev, int fd)
> +{
> +	struct mlx4_en_priv *priv = netdev_priv(dev);
> +	struct bpf_prog *prog = NULL, *old_prog;
> +
> +	if (fd >= 0) {
> +		prog = bpf_prog_get(fd);
> +		if (IS_ERR(prog))
> +			return PTR_ERR(prog);
> +
> +		if (prog->type != BPF_PROG_TYPE_PHYS_DEV) {
> +			bpf_prog_put(prog);
> +			return -EINVAL;
> +		}

This block could just be a generic helper that mlx4_bpf_set() calls from here.

> +	}
> +
> +	old_prog = xchg(&priv->prog, prog);
> +	if (old_prog) {
> +		synchronize_net();
> +		bpf_prog_put(old_prog);
> +	}
> +
> +	priv->dev->bpf_valid = !!prog;

Could the 'bpf_valid' addition to the net_device be avoided altogether?

The API could probably just be named .ndo_bpf() and depending how you invoke
it, either set/deletes the program or tell (return code) whether a program is
currently attached.

> +	return 0;
> +}
> +
>   static const struct net_device_ops mlx4_netdev_ops = {
>   	.ndo_open		= mlx4_en_open,
>   	.ndo_stop		= mlx4_en_close,
> @@ -2486,6 +2545,7 @@ static const struct net_device_ops mlx4_netdev_ops = {
>   	.ndo_features_check	= mlx4_en_features_check,
>   #endif
>   	.ndo_set_tx_maxrate	= mlx4_en_set_tx_maxrate,
> +	.ndo_bpf_set		= mlx4_bpf_set,
>   };
>
>   static const struct net_device_ops mlx4_netdev_ops_master = {
> @@ -2524,6 +2584,7 @@ static const struct net_device_ops mlx4_netdev_ops_master = {
>   	.ndo_features_check	= mlx4_en_features_check,
>   #endif
>   	.ndo_set_tx_maxrate	= mlx4_en_set_tx_maxrate,
> +	.ndo_bpf_set		= mlx4_bpf_set,
>   };
>
>   struct mlx4_en_bond {
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 86bcfe5..03fe005 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -748,6 +748,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>   	struct mlx4_en_rx_ring *ring = priv->rx_ring[cq->ring];
>   	struct mlx4_en_rx_alloc *frags;
>   	struct mlx4_en_rx_desc *rx_desc;
> +	struct bpf_prog *prog;
>   	struct sk_buff *skb;
>   	int index;
>   	int nr;
> @@ -764,6 +765,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>   	if (budget <= 0)
>   		return polled;
>
> +	prog = READ_ONCE(priv->prog);
> +
>   	/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
>   	 * descriptor offset can be deduced from the CQE index instead of
>   	 * reading 'cqe->index' */
> @@ -840,6 +843,21 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>   		l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
>   			(cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
>
> +		/* A bpf program gets first chance to drop the packet. It may
> +		 * read bytes but not past the end of the frag. A non-zero
> +		 * return indicates packet should be dropped.
> +		 */
> +		if (prog) {
> +			struct ethhdr *ethh;
> +
> +			ethh = (struct ethhdr *)(page_address(frags[0].page) +
> +						 frags[0].page_offset);
> +			if (mlx4_call_bpf(prog, ethh, length)) {

Since such program will be ABI, the return code might get some more additions in
future (e.g. forwarding, etc), so it needs to be thought through that we don't
burn ourselves later.

Maybe reuse tc opcodes, or define own ones?

We currently would have:

  0    - Drop.
  1    - Pass to stack.
  rest - Reserved for future use.

> +				priv->stats.rx_dropped++;
> +				goto next;
> +			}
> +		}
> +
>   		if (likely(dev->features & NETIF_F_RXCSUM)) {
>   			if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
>   						      MLX4_CQE_STATUS_UDP)) {
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index d12ab6a..3d0fc89 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -568,6 +568,7 @@ struct mlx4_en_priv {
>   	struct hlist_head mac_hash[MLX4_EN_MAC_HASH_SIZE];
>   	struct hwtstamp_config hwtstamp_config;
>   	u32 counter_index;
> +	struct bpf_prog *prog;
>
>   #ifdef CONFIG_MLX4_EN_DCB
>   	struct ieee_ets ets;
> @@ -682,6 +683,7 @@ int mlx4_en_create_drop_qp(struct mlx4_en_priv *priv);
>   void mlx4_en_destroy_drop_qp(struct mlx4_en_priv *priv);
>   int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring);
>   void mlx4_en_rx_irq(struct mlx4_cq *mcq);
> +int mlx4_call_bpf(struct bpf_prog *prog, void *data, unsigned int length);
>
>   int mlx4_SET_MCAST_FLTR(struct mlx4_dev *dev, u8 port, u64 mac, u64 clear, u8 mode);
>   int mlx4_SET_VLAN_FLTR(struct mlx4_dev *dev, struct mlx4_en_priv *priv);
>

^ permalink raw reply

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
From: Daniel Borkmann @ 2016-04-04  9:57 UTC (permalink / raw)
  To: Johannes Berg, Brenden Blanco
  Cc: davem, netdev, tom, alexei.starovoitov, ogerlitz, john.fastabend,
	brouer
In-Reply-To: <1459755310.18188.13.camel@sipsolutions.net>

On 04/04/2016 09:35 AM, Johannes Berg wrote:
> On Sat, 2016-04-02 at 23:38 -0700, Brenden Blanco wrote:
>>
>> Having a common check makes sense. The tricky thing is that the type can
>> only be checked after taking the reference, and I wanted to keep the
>> scope of the prog brief in the case of errors. I would have to move the
>> bpf_prog_get logic into dev_change_bpf_fd and pass a bpf_prog * into the
>> ndo instead. Would that API look fine to you?
>
> I can't really comment, I wasn't planning on using the API right now :)
>
> However, what else is there that the driver could possibly do with the
> FD, other than getting the bpf_prog?
>
>> A possible extension of this is just to keep the bpf_prog * in the
>> netdev itself and expose a feature flag from the driver rather than
>> an ndo. But that would mean another 8 bytes in the netdev.
>
> That also misses the signal to the driver when the program is
> set/removed, so I don't think that works. I'd argue it's not really
> desirable anyway though since I wouldn't expect a majority of drivers
> to start supporting this.

I think ndo is probably fine for this purpose, see also my other mail. I
think currently, the only really driver specific code would be to store
the prog pointer somewhere and to pass needed meta data to populate the
fake skb.

Maybe mid-term drivers might want to reuse this hook/signal for offloading
as well, not yet sure ... how would that relate to offloading of cls_bpf?
Should these be considered two different things (although from an offloading
perspective they are not really). _Conceptually_, XDP could also be seen
as a software offload for the facilities we support with cls_bpf et al.

Thanks,
Daniel

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox