Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] tg3: Avoid NULL pointer dereference in tg3_get_nstats()
From: Michael Chan @ 2017-01-05 21:53 UTC (permalink / raw)
  To: David Miller
  Cc: wangyufen, Siva Reddy Kallam, prashant.sreedharan@broadcom.com,
	michael.chan@broadcom.com, Netdev
In-Reply-To: <20170105.151705.2187713228201334532.davem@davemloft.net>

On Thu, Jan 5, 2017 at 12:17 PM, David Miller <davem@davemloft.net> wrote:
> From: Michael Chan <michael.chan@broadcom.com>
> Date: Thu, 5 Jan 2017 12:04:13 -0800
>
>> But it looks like ndo_get_stats() can be called without rtnl lock from
>> net-procfs.c.  So it is possible that we'll read tp->hw_stats after it
>> has been freed.  For example, if we are reading /proc/net/dev and
>> closing tg3 at the same time.  David, is not taking rtnl_lock in
>> net-procfs.c by design?
>
> Probably not, that dev_get_stats() call probably should be surrounded
> by RTNL protection.
>
> Doing a quick grep on dev_get_stats() shows other call sites, most of
> which are using it to fetch slave device statistics from the get stats
> method of the parent.  Which should be ok.
>
> It appears that the vlan procfs code in net/8021q/vlanproc.c has a
> similar bug as net/core/net-procfs.c
>
> Maybe net/core/net-sysfs.c has the same issue as well, and perhaps also
> net/openvswitch/vport.c:ovs_vport_get_stats().
>

OK.  I will send a patch later today to add rtnl_lock to these callers.

^ permalink raw reply

* Re: [Intel-wired-lan] [net-next PATCH v2 2/6] i40e: Introduce VF Port Representator(VFPR) netdevs.
From: Jeff Kirsher @ 2017-01-05 21:46 UTC (permalink / raw)
  To: Sridhar Samudrala, alexander.h.duyck, john.r.fastabend,
	anjali.singhai, jakub.kicinski, davem, intel-wired-lan, netdev
In-Reply-To: <1483466874-2962-3-git-send-email-sridhar.samudrala@intel.com>

[-- Attachment #1: Type: text/plain, Size: 2168 bytes --]

On Tue, 2017-01-03 at 10:07 -0800, Sridhar Samudrala wrote:
> VF Port Representator netdevs are created for each VF if the switch mode
> is set to 'switchdev'. These netdevs can be used to control and configure
> VFs from PFs namespace. They enable exposing VF statistics, configure and
> monitor link state, mtu, filters, fdb/vlan entries etc. of VFs.
> Broadcast filters are not enabled in switchdev mode.
> 
> Sample script to create VF port representors
> # rmmod i40e; modprobe i40e
> # devlink dev eswitch set pci/0000:05:00.0 mode switchdev
> # echo 2 > /sys/class/net/enp5s0f0/device/sriov_numvfs
> # ip l show
> 297: enp5s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop portid
> 6805ca2e7268 state DOWN mode DEFAULT group default qlen 1000
>      link/ether 68:05:ca:2e:72:68 brd ff:ff:ff:ff:ff:ff
>      vf 0 MAC 00:00:00:00:00:00, spoof checking on, link-state auto,
> trust off
>      vf 1 MAC 00:00:00:00:00:00, spoof checking on, link-state auto,
> trust off
> 299: enp5s0f0-vf0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
> mode DEFAULT group default qlen 1000
>      link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
> 300: enp5s0f0-vf1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
> mode DEFAULT group default qlen 1000
>      link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
> 
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_main.c        |  21 ++-
>  drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 154
> ++++++++++++++++++++-
>  drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.h |  14 ++
>  3 files changed, 182 insertions(+), 7 deletions(-)

This does not apply cleanly because it is based on an older version of
i40e_virtchnl_pf.c file.  It appears that i40e has been updated to use
"i40e_add_filter()" yet your patch still uses "i40e_add_mac_filter()".

We need to clarify what the "right way" is to add filters and use the
correct function.

Dropping this series and will await v3, please address the other feedback
from Or Gerlitz and Jiri Pirko as well in your updated series.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* [RFC PATCH] tcp: accept RST for rcv_nxt - 1 after receiving a FIN
From: Jason Baron @ 2017-01-05 21:33 UTC (permalink / raw)
  To: netdev

Using a Mac OSX box as a client connecting to a Linux server, we have found
that when certain applications (such as 'ab'), are abruptly terminated
(via ^C), a FIN is sent followed by a RST packet on tcp connections. The
FIN is accepted by the Linux stack but the RST is sent with the same
sequence number as the FIN, and Linux responds with a challenge ACK per
RFC 5961. The OSX client then does not reply with any RST as would be
expected on a closed socket.

This results in sockets accumulating on the Linux server left mostly in
the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible.
This sequence of events can tie up a lot of resources on the Linux server
since there may be a lot of data in write buffers at the time of the RST.
Accepting a RST equal to rcv_nxt - 1, after we have already successfully
processed a FIN, has made a significant difference for us in practice, by
freeing up unneeded resources in a more expedient fashion.

I also found a posting that the iOS client behaves in a similar manner here
(it will send a FIN followed by a RST for rcv_nxt - 1):
https://www.snellman.net/blog/archive/2016-02-01-tcp-rst/

A packetdrill test demonstrating the behavior.

// testing mac osx rst behavior

// Establish a connection
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < S 0:0(0) win 32768 <mss 1460,nop,wscale 10>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 5>
0.200 < . 1:1(0) ack 1 win 32768
0.200 accept(3, ..., ...) = 4

// Client closes the connection
0.300 < F. 1:1(0) ack 1 win 32768

// now send rst with same sequence
0.300 < R. 1:1(0) ack 1 win 32768

// make sure we are in TCP_CLOSE
0.400 %{
assert tcpi_state == 7
}%

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 net/ipv4/tcp_input.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ec6d84363024..373bea05c93b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5249,6 +5249,24 @@ static int tcp_copy_to_iovec(struct sock *sk, struct sk_buff *skb, int hlen)
 	return err;
 }
 
+/* Accept RST for rcv_nxt - 1 after a FIN.
+ * When tcp connections are abruptly terminated from Mac OSX (via ^C), a
+ * FIN is sent followed by a RST packet. The RST is sent with the same
+ * sequence number as the FIN, and thus according to RFC 5961 a challenge
+ * ACK should be sent. However, Mac OSX does not reply to the challenge ACK
+ * with a RST on the closed socket, hence accept this class of RSTs.
+ */
+static bool tcp_reset_check(struct sock *sk, struct sk_buff *skb)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	return unlikely((TCP_SKB_CB(skb)->seq == (tp->rcv_nxt - 1)) &&
+			(TCP_SKB_CB(skb)->end_seq == (tp->rcv_nxt - 1))	&&
+			(sk->sk_state == TCP_CLOSE_WAIT ||
+			 sk->sk_state == TCP_LAST_ACK ||
+			 sk->sk_state == TCP_CLOSING));
+}
+
 /* Does PAWS and seqno based validation of an incoming segment, flags will
  * play significant role here.
  */
@@ -5287,6 +5305,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 						  LINUX_MIB_TCPACKSKIPPEDSEQ,
 						  &tp->last_oow_ack_time))
 				tcp_send_dupack(sk, skb);
+		} else if (tcp_reset_check(sk, skb)) {
+			tcp_reset(sk);
 		}
 		goto discard;
 	}
@@ -5300,7 +5320,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 		 * else
 		 *     Send a challenge ACK
 		 */
-		if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt) {
+		if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt ||
+		    tcp_reset_check(sk, skb)) {
 			rst_seq_match = true;
 		} else if (tcp_is_sack(tp) && tp->rx_opt.num_sacks > 0) {
 			struct tcp_sack_block *sp = &tp->selective_acks[0];
-- 
2.6.1

^ permalink raw reply related

* TCP using IPv4-mapped IPv6 address as source
From: Jonathan T. Leighton @ 2017-01-05 21:25 UTC (permalink / raw)
  To: netdev

I've observed TCP using an IPv4-mapped IPv6 address as the source 
address, which I believe contradicts 
https://tools.ietf.org/html/rfc6890#page-14 (BCP 153). This occurs when 
an IPv6 TCP socket, bound to a local IPv4-mapped IPv6 address, attempts 
to connect to a remote IPv6 address. Presumable connect() should return 
EAFNOSUPPORT in this case. Please advise me if this is not to 
appropriate list to report this.

$ uname -a
Linux ubuntu 4.4.0-57-generic #78-Ubuntu SMP Fri Dec 9 23:50:32 UTC 2016 
x86_64 x86_64 x86_64 GNU/Linux

Best regards,
Jon

^ permalink raw reply

* Re: [PATCH 2/2] isdn: i4l: move active-isdn drivers to staging
From: Greg Kroah-Hartman @ 2017-01-05 21:19 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: devel, Karsten Keil, linux-doc, netdev, Jonathan Corbet,
	linux-kernel, David S. Miller, linux-arm-kernel
In-Reply-To: <3286019.I2UkVCJq41@wuerfel>

On Tue, Jan 03, 2017 at 10:19:29PM +0100, Arnd Bergmann wrote:
> On Tuesday, January 3, 2017 4:24:36 PM CET Greg Kroah-Hartman wrote:
> > On Wed, Mar 02, 2016 at 08:06:46PM +0100, Arnd Bergmann wrote:
> > > The icn, act2000 and pcbit drivers are all for very old hardware,
> > > and it is highly unlikely that anyone is actually still using them
> > > on modern kernels, if at all.
> > > 
> > > All three drivers apparently are for hardware that predates PCI
> > > being the common connector, as they are ISA-only and active
> > > PCI ISDN cards were widely available in the 1990s.
> > > 
> > > Looking through the git logs, it I cannot find any indication of a
> > > patch to any of these drivers that has been tested on real hardware,
> > > only cleanups or global API changes.
> > > 
> > > Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> > > Acked-by: Karsten Keil <isdn@linux-pingi.de>
> > 
> > This patch got added in the 4.6 kernel release.  As I am now taking
> > patches for 4.11-rc1, I figure it is time to just delete the
> > drivers/staging/i4l/ directory now, given that no one has really done
> > anything with it.  If people show up that wish to maintain it, I'll be
> > glad to revert it, or if someone really screams in the next week.
> > Otherwise it's time to just move on 
> 
> Sounds good to me.

Ok, now deleted!

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH] net: stmmac: fix maxmtu assignment to be within valid range
From: Andy Shevchenko @ 2017-01-05 21:06 UTC (permalink / raw)
  To: Kweh, Hock Leong
  Cc: David S. Miller, Joao Pinto, Giuseppe CAVALLARO,
	seraphin.bonnaffe, Jarod Wilson, Alexandre TORGUE,
	Joachim Eastwood, Niklas Cassel, Johan Hovold, Pavel Machek,
	lars.persson, netdev, LKML
In-Reply-To: <1483613277-6005-1-git-send-email-hock.leong.kweh@intel.com>

On Thu, Jan 5, 2017 at 12:47 PM, Kweh, Hock Leong
<hock.leong.kweh@intel.com> wrote:
> From: "Kweh, Hock Leong" <hock.leong.kweh@intel.com>
>
> There is no checking valid value of maxmtu when getting it from devicetree.

'Device Tree' or 'device tree' ?

> This resolution added the checking condition to ensure the assignment is
> made within a valid range.

> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index 39eb7a6..683d59f 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -3319,7 +3319,8 @@ int stmmac_dvr_probe(struct device *device,
>                 ndev->max_mtu = JUMBO_LEN;
>         else
>                 ndev->max_mtu = SKB_MAX_HEAD(NET_SKB_PAD + NET_IP_ALIGN);
> -       if (priv->plat->maxmtu < ndev->max_mtu)
> +       if ((priv->plat->maxmtu < ndev->max_mtu) &&
> +           (priv->plat->maxmtu >= ndev->min_mtu))
>                 ndev->max_mtu = priv->plat->maxmtu;

Perhaps add a warning message on else branch?

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply

* [PATCH] net: phy: dp83867: fix irq generation
From: Grygorii Strashko @ 2017-01-05 20:48 UTC (permalink / raw)
  To: Florian Fainelli, netdev, Dan Murphy, Mugunthan V N
  Cc: Grygorii Strashko, linux-omap, Sekhar Nori, linux-kernel,
	linux-arm-kernel

For proper IRQ generation by DP83867 phy the INT/PWDN pin has to be
programmed as an interrupt output instead of a Powerdown input in
Configuration Register 3 (CFG3), Address 0x001E, bit 7 INT_OE = 1. The
current driver doesn't do this and as result IRQs will not be generated by
DP83867 phy even if they are properly configured in DT.

Hence, fix IRQ generation by properly configuring CFG3.INT_OE bit and
ensure that Link Status Change (LINK_STATUS_CHNG_INT) and Auto-Negotiation
Complete (AUTONEG_COMP_INT) interrupt are enabled. After this the DP83867
driver will work properly in interrupt enabled mode.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
---
 drivers/net/phy/dp83867.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/net/phy/dp83867.c b/drivers/net/phy/dp83867.c
index 1b63924..e84ae08 100644
--- a/drivers/net/phy/dp83867.c
+++ b/drivers/net/phy/dp83867.c
@@ -29,6 +29,7 @@
 #define MII_DP83867_MICR	0x12
 #define MII_DP83867_ISR		0x13
 #define DP83867_CTRL		0x1f
+#define DP83867_CFG3		0x1e
 
 /* Extended Registers */
 #define DP83867_RGMIICTL	0x0032
@@ -98,6 +99,8 @@ static int dp83867_config_intr(struct phy_device *phydev)
 		micr_status |=
 			(MII_DP83867_MICR_AN_ERR_INT_EN |
 			MII_DP83867_MICR_SPEED_CHNG_INT_EN |
+			MII_DP83867_MICR_AUTONEG_COMP_INT_EN |
+			MII_DP83867_MICR_LINK_STS_CHNG_INT_EN |
 			MII_DP83867_MICR_DUP_MODE_CHNG_INT_EN |
 			MII_DP83867_MICR_SLEEP_MODE_CHNG_INT_EN);
 
@@ -214,6 +217,13 @@ static int dp83867_config_init(struct phy_device *phydev)
 		}
 	}
 
+	/* Enable Interrupt output INT_OE in CFG3 register */
+	if (phy_interrupt_is_valid(phydev)) {
+		val = phy_read(phydev, DP83867_CFG3);
+		val |= BIT(7);
+		phy_write(phydev, DP83867_CFG3, val);
+	}
+
 	return 0;
 }
 
-- 
2.10.1.dirty

^ permalink raw reply related

* Re: [PATCH net-next 01/10] net: netcp: ethss: add support of subsystem register region regmap
From: Murali Karicheri @ 2017-01-05 20:42 UTC (permalink / raw)
  To: Rob Herring
  Cc: netdev, linux-omap, grygorii.strashko, mugunthanvnm, linux-kernel,
	arnd, davem, devicetree, mark.rutland
In-Reply-To: <20161222212409.arhurpl4bpqf2yw6@rob-hp-laptop>

Rob,

On 12/22/2016 04:24 PM, Rob Herring wrote:
> On Tue, Dec 20, 2016 at 05:09:44PM -0500, Murali Karicheri wrote:
>> From: WingMan Kwok <w-kwok2@ti.com>
>>
>> 10gbe phy driver needs to access the 10gbe subsystem control
>> register during phy initialization. To facilitate the shared
>> access of the subsystem register region between the 10gbe Ethernet
>> driver and the phy driver, this patch adds support of the
>> subsystem register region defined by a syscon node in the dts.
>>
>> Although there is no shared access to the gbe subsystem register
>> region, using syscon for that is for the sake of consistency.
>>
>> This change is backward compatible with previously released gbe
>> devicetree bindings.
>>
>> Signed-off-by: WingMan Kwok <w-kwok2@ti.com>
>> Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
>> Signed-off-by: Sekhar Nori <nsekhar@ti.com>
>> ---
>>  .../devicetree/bindings/net/keystone-netcp.txt     |  16 ++-
>>  drivers/net/ethernet/ti/netcp_ethss.c              | 140 +++++++++++++++++----
>>  2 files changed, 127 insertions(+), 29 deletions(-)
>>
>> diff --git a/Documentation/devicetree/bindings/net/keystone-netcp.txt b/Documentation/devicetree/bindings/net/keystone-netcp.txt
>> index 04ba1dc..0854a73 100644
>> --- a/Documentation/devicetree/bindings/net/keystone-netcp.txt
>> +++ b/Documentation/devicetree/bindings/net/keystone-netcp.txt
>> @@ -72,20 +72,24 @@ Required properties:
>>  		"ti,netcp-gbe-2" for 1GbE N NetCP 1.5 (N=2)
>>  		"ti,netcp-xgbe" for 10 GbE
>>  
>> +- syscon-subsys:	phandle to syscon node of the switch
>> +			subsystem registers.
>> +
>>  - reg:		register location and the size for the following register
>>  		regions in the specified order.
>>  		- switch subsystem registers
>> +		- sgmii module registers
> 
> This needs to go on the end of the list. Otherwise, it is not backwards 
> compatible.

Thanks for your review! I assumed backward compatibility means new kernel
should work with old DTB. The driver code is adjusted to work with both
DTBs. Isn't that enough?

Murali

> 
>>  		- sgmii port3/4 module registers (only for NetCP 1.4)
>>  		- switch module registers
>>  		- serdes registers (only for 10G)
>>  
>>  		NetCP 1.4 ethss, here is the order
>> -			index #0 - switch subsystem registers
>> +			index #0 - sgmii module registers
>>  			index #1 - sgmii port3/4 module registers
>>  			index #2 - switch module registers
>>  
>>  		NetCP 1.5 ethss 9 port, 5 port and 2 port
>> -			index #0 - switch subsystem registers
>> +			index #0 - sgmii module registers
>>  			index #1 - switch module registers
>>  			index #2 - serdes registers
>>  
>> @@ -145,6 +149,11 @@ Optional properties:
>>  
>>  Example binding:
>>  
>> +gbe_subsys: subsys@2090000 {
>> +	compatible = "syscon";
>> +	reg = <0x02090000 0x100>;
>> +};
>> +
>>  netcp: netcp@2000000 {
>>  	reg = <0x2620110 0x8>;
>>  	reg-names = "efuse";
>> @@ -163,7 +172,8 @@ netcp: netcp@2000000 {
>>  		ranges;
>>  		gbe@90000 {
>>  			label = "netcp-gbe";
>> -			reg = <0x90000 0x300>, <0x90400 0x400>, <0x90800 0x700>;
>> +			syscon-subsys = <&gbe_subsys>;
>> +			reg = <0x90100 0x200>, <0x90400 0x200>, <0x90800 0x700>;
>>  			/* enable-ale; */
>>  			tx-queue = <648>;
>>  			tx-channel = <8>;
> 


-- 
Murali Karicheri
Linux Kernel, Keystone

^ permalink raw reply

* RE: [PATCH net] hyper-v: Add myself as additional MAINTAINER
From: KY Srinivasan @ 2017-01-05 20:34 UTC (permalink / raw)
  To: gregkh@linuxfoundation.org
  Cc: Stephen Hemminger, davem@davemloft.net, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, Stephen Hemminger
In-Reply-To: <20170105200911.GA6579@kroah.com>



> -----Original Message-----
> From: gregkh@linuxfoundation.org [mailto:gregkh@linuxfoundation.org]
> Sent: Thursday, January 5, 2017 12:09 PM
> To: KY Srinivasan <kys@microsoft.com>
> Cc: Stephen Hemminger <stephen@networkplumber.org>;
> davem@davemloft.net; netdev@vger.kernel.org; linux-
> kernel@vger.kernel.org; Stephen Hemminger <sthemmin@microsoft.com>
> Subject: Re: [PATCH net] hyper-v: Add myself as additional MAINTAINER
> 
> On Thu, Jan 05, 2017 at 07:08:23PM +0000, KY Srinivasan wrote:
> >
> >
> > > -----Original Message-----
> > > From: gregkh@linuxfoundation.org [mailto:gregkh@linuxfoundation.org]
> > > Sent: Thursday, January 5, 2017 10:29 AM
> > > To: KY Srinivasan <kys@microsoft.com>
> > > Cc: Stephen Hemminger <stephen@networkplumber.org>;
> > > davem@davemloft.net; netdev@vger.kernel.org; linux-
> > > kernel@vger.kernel.org; Stephen Hemminger
> <sthemmin@microsoft.com>
> > > Subject: Re: [PATCH net] hyper-v: Add myself as additional MAINTAINER
> > >
> > > On Thu, Jan 05, 2017 at 05:43:04PM +0000, KY Srinivasan wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > > > > Sent: Thursday, January 5, 2017 9:36 AM
> > > > > To: davem@davemloft.net; KY Srinivasan <kys@microsoft.com>
> > > > > Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > > gregkh@linuxfoundation.org; Stephen Hemminger
> > > > > <sthemmin@microsoft.com>
> > > > > Subject: [PATCH net] hyper-v: Add myself as additional MAINTAINER
> > > > >
> > > > > Update the Hyper-V MAINTAINERS to include myself.
> > > > >
> > > > > Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
> > > >
> > > > Acked-by: K. Y. Srinivasan <kys@microsoft.com>
> > >
> > > Thanks, will go queue this up now.
> >
> > Thanks Greg. On a different note, there are a bunch of Hyper-V specific
> > patches that have been submitted over the last month or so that have not
> > been committed. Should I resend them.
> 
> Nope, they are still in my mbox, I'm just going through stuff that has
> to be in 4.10-final at the moment, give me another week or so to catch
> up on all the new stuff for 4.11-rc1.

Thanks Greg.

K. Y
> 
> thanks,
> 
> greg k-h

^ permalink raw reply

* Re: [PATCH v1 1/2] bpf: add a longest prefix match trie map implementation
From: Daniel Borkmann @ 2017-01-05 20:20 UTC (permalink / raw)
  To: Daniel Mack, ast; +Cc: dh.herrmann, netdev, davem
In-Reply-To: <91fbf150-4e97-7400-90cf-dbe7ea3e6597@zonque.org>

Hi Daniel,

On 01/05/2017 09:04 PM, Daniel Mack wrote:
> On 01/05/2017 05:25 PM, Daniel Borkmann wrote:
>> On 12/29/2016 06:28 PM, Daniel Mack wrote:
>
>>> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
>>> new file mode 100644
>>> index 0000000..8b6a61d
>>> --- /dev/null
>>> +++ b/kernel/bpf/lpm_trie.c
>
> [..]
>
>>> +static struct bpf_map *trie_alloc(union bpf_attr *attr)
>>> +{
>>> +	struct lpm_trie *trie;
>>> +
>>> +	/* check sanity of attributes */
>>> +	if (attr->max_entries == 0 || attr->map_flags ||
>>> +	    attr->key_size < sizeof(struct bpf_lpm_trie_key) + 1   ||
>>> +	    attr->key_size > sizeof(struct bpf_lpm_trie_key) + 256 ||
>>> +	    attr->value_size != sizeof(u64))
>>> +		return ERR_PTR(-EINVAL);
>>
>> The correct attr->map_flags test here would need to be ...
>>
>>     attr->map_flags != BPF_F_NO_PREALLOC
>>
>> ... since in this case we don't have any prealloc pool, and
>> should that come one day that test could be relaxed again.
>>
>>> +	trie = kzalloc(sizeof(*trie), GFP_USER | __GFP_NOWARN);
>>> +	if (!trie)
>>> +		return NULL;
>>> +
>>> +	/* copy mandatory map attributes */
>>> +	trie->map.map_type = attr->map_type;
>>> +	trie->map.key_size = attr->key_size;
>>> +	trie->map.value_size = attr->value_size;
>>> +	trie->map.max_entries = attr->max_entries;
>>
>> You also need to fill in trie->map.pages as that is eventually
>> used to charge memory against in bpf_map_charge_memlock(), right
>> now that would remain as 0 meaning the map is not accounted for.
>
> Hmm, okay. The nodes are, however, allocated dynamically at runtime in
> this case. That means that we have trie->map.pages on each allocation,
> right?

The current scheme (f.e. htab_map_alloc() has some details, although
probably not too obvious) that was done charges worst-case cost up front,
so it would be in trie_alloc() where you fill map.pages and map_create()
will later account for them.

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH] tg3: Avoid NULL pointer dereference in tg3_get_nstats()
From: David Miller @ 2017-01-05 20:17 UTC (permalink / raw)
  To: michael.chan; +Cc: wangyufen, siva.kallam, prashant, mchan, netdev
In-Reply-To: <CACKFLimPkD6hxECA+ZhH+7BVmVSoJ1GfAMZyJ7S50CbhiuC0mA@mail.gmail.com>

From: Michael Chan <michael.chan@broadcom.com>
Date: Thu, 5 Jan 2017 12:04:13 -0800

> But it looks like ndo_get_stats() can be called without rtnl lock from
> net-procfs.c.  So it is possible that we'll read tp->hw_stats after it
> has been freed.  For example, if we are reading /proc/net/dev and
> closing tg3 at the same time.  David, is not taking rtnl_lock in
> net-procfs.c by design?

Probably not, that dev_get_stats() call probably should be surrounded
by RTNL protection.

Doing a quick grep on dev_get_stats() shows other call sites, most of
which are using it to fetch slave device statistics from the get stats
method of the parent.  Which should be ok.

It appears that the vlan procfs code in net/8021q/vlanproc.c has a
similar bug as net/core/net-procfs.c

Maybe net/core/net-sysfs.c has the same issue as well, and perhaps also
net/openvswitch/vport.c:ovs_vport_get_stats().

^ permalink raw reply

* Re: [PATCH v1 1/2] bpf: add a longest prefix match trie map implementation
From: Daniel Mack @ 2017-01-05 20:14 UTC (permalink / raw)
  To: Daniel Borkmann, ast; +Cc: dh.herrmann, netdev, davem
In-Reply-To: <586EA627.6020404@iogearbox.net>

Hi,

On 01/05/2017 09:01 PM, Daniel Borkmann wrote:
> On 01/05/2017 05:25 PM, Daniel Borkmann wrote:
>> On 12/29/2016 06:28 PM, Daniel Mack wrote:

> [...]
>>> +static struct bpf_map *trie_alloc(union bpf_attr *attr)
>>> +{
>>> +    struct lpm_trie *trie;
>>> +
>>> +    /* check sanity of attributes */
>>> +    if (attr->max_entries == 0 || attr->map_flags ||
>>> +        attr->key_size < sizeof(struct bpf_lpm_trie_key) + 1   ||
>>> +        attr->key_size > sizeof(struct bpf_lpm_trie_key) + 256 ||
>>> +        attr->value_size != sizeof(u64))
>>> +        return ERR_PTR(-EINVAL);
> 
> One more question on this regarding value size as u64 (perhaps I
> missed it along the way): reason this was chosen was because for
> keeping stats? Why not making user choose a size as in other maps,
> so also custom structs could be stored there?

In my use case, the actual value of a node is in fact ignored, all that
matters is whether a node exists in a trie or not. The test code uses
u64 for its tests.

I can change it around so that the value size can be defined by
userspace, but ideally it would also support 0-byte lengths then. The
bpf map syscall handler should handle the latter just fine if I read the
code correctly?


Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH net] hyper-v: Add myself as additional MAINTAINER
From: gregkh @ 2017-01-05 20:09 UTC (permalink / raw)
  To: KY Srinivasan
  Cc: Stephen Hemminger, davem@davemloft.net, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, Stephen Hemminger
In-Reply-To: <DM5PR03MB24902386946C55BCE617941CA0600@DM5PR03MB2490.namprd03.prod.outlook.com>

On Thu, Jan 05, 2017 at 07:08:23PM +0000, KY Srinivasan wrote:
> 
> 
> > -----Original Message-----
> > From: gregkh@linuxfoundation.org [mailto:gregkh@linuxfoundation.org]
> > Sent: Thursday, January 5, 2017 10:29 AM
> > To: KY Srinivasan <kys@microsoft.com>
> > Cc: Stephen Hemminger <stephen@networkplumber.org>;
> > davem@davemloft.net; netdev@vger.kernel.org; linux-
> > kernel@vger.kernel.org; Stephen Hemminger <sthemmin@microsoft.com>
> > Subject: Re: [PATCH net] hyper-v: Add myself as additional MAINTAINER
> > 
> > On Thu, Jan 05, 2017 at 05:43:04PM +0000, KY Srinivasan wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > > > Sent: Thursday, January 5, 2017 9:36 AM
> > > > To: davem@davemloft.net; KY Srinivasan <kys@microsoft.com>
> > > > Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > gregkh@linuxfoundation.org; Stephen Hemminger
> > > > <sthemmin@microsoft.com>
> > > > Subject: [PATCH net] hyper-v: Add myself as additional MAINTAINER
> > > >
> > > > Update the Hyper-V MAINTAINERS to include myself.
> > > >
> > > > Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
> > >
> > > Acked-by: K. Y. Srinivasan <kys@microsoft.com>
> > 
> > Thanks, will go queue this up now.
> 
> Thanks Greg. On a different note, there are a bunch of Hyper-V specific
> patches that have been submitted over the last month or so that have not
> been committed. Should I resend them.

Nope, they are still in my mbox, I'm just going through stuff that has
to be in 4.10-final at the moment, give me another week or so to catch
up on all the new stuff for 4.11-rc1.

thanks,

greg k-h

^ permalink raw reply

* Re: [for-next 07/10] IB/mlx5: Use blue flame register allocator in mlx5_ib
From: David Miller @ 2017-01-05 20:07 UTC (permalink / raw)
  To: eli-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb
  Cc: saeedm-VPRAkNaXOzVWk0Htik3J/w, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	leonro-VPRAkNaXOzVWk0Htik3J/w, eli-VPRAkNaXOzVWk0Htik3J/w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, leon-DgEjT+Ai2ygdnm+yROfE0A
In-Reply-To: <CAL3tnx4H2HsGMH=caiobivvvs0AU9yrzCmVONOdUQ_j_JMfufA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

From: Eli Cohen <eli-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
Date: Thu, 5 Jan 2017 14:03:18 -0600

> If necessary I can make sure it builds on 32 bits as well.

Please do.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v4 net-next] tools: psock_tpacket: block Rx until socket filter has been added and socket has been bound to loopback.
From: David Miller @ 2017-01-05 20:04 UTC (permalink / raw)
  To: sowmini.varadhan; +Cc: netdev, daniel, willemb
In-Reply-To: <6b77972e99e49096676e04854067a44226e926aa.1483642577.git.sowmini.varadhan@oracle.com>

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Thu,  5 Jan 2017 11:06:22 -0800

> Packets from any/all interfaces may be queued up on the PF_PACKET socket
> before it is bound to the loopback interface by psock_tpacket, and
> when these are passed up by the kernel, they could interfere
> with the Rx tests.
> 
> Avoid interference from spurious packet by blocking Rx until the
> socket filter has been set up, and the packet has been bound to the
> desired (lo) interface. The effective sequence is
> 	socket(PF_PACKET, SOCK_RAW, 0);
> 	set up ring
> 	Invoke SO_ATTACH_FILTER
> 	bind to sll_protocol set to ETH_P_ALL, sll_ifindex for lo
> After this sequence, the only packets that will be passed up are
> those received on loopback that pass the attached filter.
> 
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH v1 1/2] bpf: add a longest prefix match trie map implementation
From: Daniel Mack @ 2017-01-05 20:04 UTC (permalink / raw)
  To: Daniel Borkmann, ast; +Cc: dh.herrmann, netdev, davem
In-Reply-To: <586E7366.1010708@iogearbox.net>

Hi Daniel,

Thanks for your feedback! I agree on all points. Two questions below.

On 01/05/2017 05:25 PM, Daniel Borkmann wrote:
> On 12/29/2016 06:28 PM, Daniel Mack wrote:

>> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
>> new file mode 100644
>> index 0000000..8b6a61d
>> --- /dev/null
>> +++ b/kernel/bpf/lpm_trie.c

[..]

>> +static struct bpf_map *trie_alloc(union bpf_attr *attr)
>> +{
>> +	struct lpm_trie *trie;
>> +
>> +	/* check sanity of attributes */
>> +	if (attr->max_entries == 0 || attr->map_flags ||
>> +	    attr->key_size < sizeof(struct bpf_lpm_trie_key) + 1   ||
>> +	    attr->key_size > sizeof(struct bpf_lpm_trie_key) + 256 ||
>> +	    attr->value_size != sizeof(u64))
>> +		return ERR_PTR(-EINVAL);
> 
> The correct attr->map_flags test here would need to be ...
> 
>    attr->map_flags != BPF_F_NO_PREALLOC
> 
> ... since in this case we don't have any prealloc pool, and
> should that come one day that test could be relaxed again.
> 
>> +	trie = kzalloc(sizeof(*trie), GFP_USER | __GFP_NOWARN);
>> +	if (!trie)
>> +		return NULL;
>> +
>> +	/* copy mandatory map attributes */
>> +	trie->map.map_type = attr->map_type;
>> +	trie->map.key_size = attr->key_size;
>> +	trie->map.value_size = attr->value_size;
>> +	trie->map.max_entries = attr->max_entries;
> 
> You also need to fill in trie->map.pages as that is eventually
> used to charge memory against in bpf_map_charge_memlock(), right
> now that would remain as 0 meaning the map is not accounted for.

Hmm, okay. The nodes are, however, allocated dynamically at runtime in
this case. That means that we have trie->map.pages on each allocation,
right?

>> +static void trie_free(struct bpf_map *map)
>> +{
>> +	struct lpm_trie_node __rcu **slot;
>> +	struct lpm_trie_node *node;
>> +	struct lpm_trie *trie =
>> +		container_of(map, struct lpm_trie, map);
>> +
>> +	spin_lock(&trie->lock);
>> +
>> +	/*
>> +	 * Always start at the root and walk down to a node that has no
>> +	 * children. Then free that node, nullify its pointer in the parent,
>> +	 * then start over.
>> +	 */
>> +
>> +	for (;;) {
>> +		slot = &trie->root;
>> +
>> +		for (;;) {
>> +			node = rcu_dereference_protected(*slot,
>> +					lockdep_is_held(&trie->lock));
>> +			if (!node)
>> +				goto out;
>> +
>> +			if (node->child[0]) {
> 
> rcu_access_pointer(node->child[0]) (at least to keep sparse happy?)

Done, but sparse does not actually complain here.



Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH] tg3: Avoid NULL pointer dereference in tg3_get_nstats()
From: Michael Chan @ 2017-01-05 20:04 UTC (permalink / raw)
  To: David Miller
  Cc: wangyufen, Siva Reddy Kallam, prashant.sreedharan@broadcom.com,
	michael.chan@broadcom.com, Netdev
In-Reply-To: <20170105.123337.2237827308340782208.davem@davemloft.net>

On Thu, Jan 5, 2017 at 9:33 AM, David Miller <davem@davemloft.net> wrote:
> From: Wang Yufen <wangyufen@huawei.com>
> Date: Thu, 5 Jan 2017 22:13:21 +0800
>
>> From: Yufen Wang <wangyufen@huawei.com>
>>
>> A possible NULL pointer dereference in tg3_get_stats64 while doing
>> tg3_free_consistent.
>  ...
>> This patch avoids the NULL pointer dereference by using !tg3_flag(tp, INIT_COMPLETE)
>> instate of !tp->hw_stats.
>>
>> Signed-off-by: Yufen Wang <wangyufen@huawei.com>
>> ---
>>  drivers/net/ethernet/broadcom/tg3.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
>> index 185e9e0..012f18d 100644
>> --- a/drivers/net/ethernet/broadcom/tg3.c
>> +++ b/drivers/net/ethernet/broadcom/tg3.c
>> @@ -14148,7 +14148,7 @@ static struct rtnl_link_stats64 *tg3_get_stats64(struct net_device *dev,
>>       struct tg3 *tp = netdev_priv(dev);
>>
>>       spin_lock_bh(&tp->lock);
>> -     if (!tp->hw_stats) {
>> +     if (!tg3_flag(tp, INIT_COMPLETE)) {
>
> The real issue is the manner and order in which the driver performs
> initialization actions relative to netif_device_{attach,detach}().
>
> That is what needs to be fixed here, instead of adding more and more
> ad-hoc tests to the various methods which can be invoked once the
> netif_device_attach() occurs.

Normally, ndo_get_stats64() should be under rtnl lock in the netlink
code path and we should be safe. We only free tp->hw_stats under rtnl
lock in the close path or ethtool path.

But it looks like ndo_get_stats() can be called without rtnl lock from
net-procfs.c.  So it is possible that we'll read tp->hw_stats after it
has been freed.  For example, if we are reading /proc/net/dev and
closing tg3 at the same time.  David, is not taking rtnl_lock in
net-procfs.c by design?

^ permalink raw reply

* Re: [PATCH v1 1/2] bpf: add a longest prefix match trie map implementation
From: Daniel Borkmann @ 2017-01-05 20:01 UTC (permalink / raw)
  To: Daniel Mack, ast; +Cc: dh.herrmann, netdev, davem
In-Reply-To: <586E7366.1010708@iogearbox.net>

On 01/05/2017 05:25 PM, Daniel Borkmann wrote:
> On 12/29/2016 06:28 PM, Daniel Mack wrote:
>> This trie implements a longest prefix match algorithm that can be used
>> to match IP addresses to a stored set of ranges.
>>
>> Internally, data is stored in an unbalanced trie of nodes that has a
>> maximum height of n, where n is the prefixlen the trie was created
>> with.
>>
>> Tries may be created with prefix lengths that are multiples of 8, in
>> the range from 8 to 2048. The key used for lookup and update operations
>> is a struct bpf_lpm_trie_key, and the value is a uint64_t.
>>
>> The code carries more information about the internal implementation.
>>
>> Signed-off-by: Daniel Mack <daniel@zonque.org>
>> Reviewed-by: David Herrmann <dh.herrmann@gmail.com>
>
> Thanks for working on it, and sorry for late reply. In addition to
> Alexei's earlier comments on the cover letter, a few comments inline:
>
[...]
>> +static struct bpf_map *trie_alloc(union bpf_attr *attr)
>> +{
>> +    struct lpm_trie *trie;
>> +
>> +    /* check sanity of attributes */
>> +    if (attr->max_entries == 0 || attr->map_flags ||
>> +        attr->key_size < sizeof(struct bpf_lpm_trie_key) + 1   ||
>> +        attr->key_size > sizeof(struct bpf_lpm_trie_key) + 256 ||
>> +        attr->value_size != sizeof(u64))
>> +        return ERR_PTR(-EINVAL);

One more question on this regarding value size as u64 (perhaps I
missed it along the way): reason this was chosen was because for
keeping stats? Why not making user choose a size as in other maps,
so also custom structs could be stored there?

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH net-next v2] tcp: provide timestamps for partial writes
From: David Miller @ 2017-01-05 19:56 UTC (permalink / raw)
  To: soheil.kdev; +Cc: netdev, soheil, willemb, edumazet, ncardwell, kafai
In-Reply-To: <20170104161934.26849-1-soheil.kdev@gmail.com>

From: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
Date: Wed,  4 Jan 2017 11:19:34 -0500

> From: Soheil Hassas Yeganeh <soheil@google.com>
> 
> For TCP sockets, TX timestamps are only captured when the user data
> is successfully and fully written to the socket. In many cases,
> however, TCP writes can be partial for which no timestamp is
> collected.
> 
> Collect timestamps whenever any user data is (fully or partially)
> copied into the socket. Pass tcp_write_queue_tail to tcp_tx_timestamp
> instead of the local skb pointer since it can be set to NULL on
> the error path.
> 
> Note that tcp_write_queue_tail can be NULL, even if bytes have been
> copied to the socket. This is because acknowledgements are being
> processed in tcp_sendmsg(), and by the time tcp_tx_timestamp is
> called tcp_write_queue_tail can be NULL. For such cases, this patch
> does not collect any timestamps (i.e., it is best-effort).
> 
> This patch is written with suggestions from Willem de Bruijn and
> Eric Dumazet.
> 
> Change-log V1 -> V2:
> 	- Use sockc.tsflags instead of sk->sk_tsflags.
> 	- Use the same code path for normal writes and errors.
> 
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
> Acked-by: Yuchung Cheng <ycheng@google.com>

Applied, thanks.

^ permalink raw reply

* Re: [for-next 07/10] IB/mlx5: Use blue flame register allocator in mlx5_ib
From: David Miller @ 2017-01-05 19:51 UTC (permalink / raw)
  To: saeedm-VPRAkNaXOzVWk0Htik3J/w
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, leonro-VPRAkNaXOzVWk0Htik3J/w,
	eli-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
	leon-DgEjT+Ai2ygdnm+yROfE0A
In-Reply-To: <1483480528-22622-8-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

From: Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Date: Tue,  3 Jan 2017 23:55:25 +0200

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> index ddb4ca4..39505ac 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> @@ -5,7 +5,7 @@
>  config MLX5_CORE
>  	tristate "Mellanox Technologies ConnectX-4 and Connect-IB core driver"
>  	depends on MAY_USE_DEVLINK
> -	depends on PCI
> +	depends on PCI && 64BIT
>  	default n
>  	---help---
>  	  Core driver for low level functionality of the ConnectX-4 and

This is a regression, I'm not applying this.

I don't care how hard it is, you have to keep the driver building properly
in non-64bit builds.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] phy state machine: failsafe leave invalid RUNNING state
From: Florian Fainelli @ 2017-01-05 19:39 UTC (permalink / raw)
  To: Zefir Kurtisi, netdev; +Cc: andrew
In-Reply-To: <327be2cf-8b89-3823-2951-b31a44697a2f@neratec.com>

On 01/05/2017 01:23 AM, Zefir Kurtisi wrote:
> On 01/04/2017 10:44 PM, Florian Fainelli wrote:
>> On 01/04/2017 08:10 AM, Zefir Kurtisi wrote:
>>> On 01/04/2017 04:30 PM, Florian Fainelli wrote:
>>>>
>>>>
>>>> On 01/04/2017 07:27 AM, Zefir Kurtisi wrote:
>>>>> On 01/04/2017 04:13 PM, Florian Fainelli wrote:
>>>>>>
>>>>>>
>>>>>> On 01/04/2017 07:04 AM, Zefir Kurtisi wrote:
>>>>>>> While in RUNNING state, phy_state_machine() checks for link changes by
>>>>>>> comparing phydev->link before and after calling phy_read_status().
>>>>>>> This works as long as it is guaranteed that phydev->link is never
>>>>>>> changed outside the phy_state_machine().
>>>>>>>
>>>>>>> If in some setups this happens, it causes the state machine to miss
>>>>>>> a link loss and remain RUNNING despite phydev->link being 0.
>>>>>>>
>>>>>>> This has been observed running a dsa setup with a process continuously
>>>>>>> polling the link states over ethtool each second (SNMPD RFC-1213
>>>>>>> agent). Disconnecting the link on a phy followed by a ETHTOOL_GSET
>>>>>>> causes dsa_slave_get_settings() / dsa_slave_get_link_ksettings() to
>>>>>>> call phy_read_status() and with that modify the link status - and
>>>>>>> with that bricking the phy state machine.
>>>>>>
>>>>>> That's the interesting part of the analysis, how does this brick the PHY
>>>>>> state machine? Is the PHY driver changing the link status in the
>>>>>> read_status callback that it implements?
>>>>>>
>>>>> phydev->read_status points to genphy_read_status(), where the first call goes to
>>>>> genphy_update_link() which updates the link status.
>>>>>
>>>>> Thereafter phy_state_machine():RUNNING won't be able to detect the link loss
>>>>> anymore unless the link state changes again.
>>>>>
>>>>>
>>>>> I was trying to figure out if there is a rule that forbids changing phydev->link
>>>>> from outside the state machine, but found several places where it happens (either
>>>>> directly, or over genphy_read_status() or over genphy_update_link()).
>>>>>
>>>>> Curious how this did not show up before, since within the dsa setup it is very
>>>>> easy to trigger:
>>>>> a) physically disconnect link
>>>>> b) within one second run ethtool ethX
>>>>
>>>> You need to be more specific here about what "the dsa setup" is, drivers
>>>> involved, which ports of the switch you are seeing this with (user
>>>> facing, CPU port, DSA port?) etc.
>>>>
>>> I am working on top of LEDE and with that at kernel 4.4.21 - alas I checked the
>>> related source files and believe the effect should be reproducible with HEAD.
>>>
>>> The setup is as follows:
>>> mv88e6321:
>>> * ports 0+1 connected to fibre-optics transceivers at fixed 100 Mbps
>>> * port 4 is CPU port
>>> * custom phy driver (replacement for marvell.ko) only populated with
>>>   * .config_init to
>>>     * set fixed speed for ports 0+1 (when in FO mode)
>>>     * run genphy_config_init() for all other modes (here: CPU port)
>>>   * .config_aneg=genphy_config_aneg, .read_status=genphy_read_status
>>>
>>>
>>> To my understanding, the exact setup is irrelevant - to reproduce the issue it is
>>> enough to have a means of running genphy_update_link() (as done in e.g.
>>> mediatek/mtk_eth_soc.c, dsa/slave.c), or genphy_read_status() (as done in e.g.
>>> hisilicon/hns/hns_enet.c) or phy_read_status() (as done in e.g.
>>> ethernet/ti/netcp_ethss.c, ethernet/aeroflex/greth.c, etc.). In the observed
>>> drivers it is mostly implemented in the ETHTOOL_GSET execution path.
>>>
>>> Once you get the link state updated outside the phy state machine, it remains in
>>> invalid RUNNING. To prevent that invalid state, to my understanding upper layer
>>> drivers (Ethernet, dsa) must not modify link-states in any case (including calling
>>> the functions noted above), or we need the proposed fail-safe mechanism to prevent
>>> getting stuck.
>>
>> OK, I see the code path involved now, sorry -ENOCOFFEE when I initially
>> responded. Yes, clearly, we should not be mangling the PHY device's link
>> by calling genphy_read_status(). At first glance, none of the users
>> below should be doing what they are doing, but let's kick a separate
>> patch series to collect feedback from the driver writes.
>>
>> Thanks!
>>
> Ok, thanks for taking time.
> 
> The kbuild test robot error is due to 'struct device dev' been removed from
> phy_device struct since 4.4.21. Does it make sense to provide a v2 fixing that, or
> do you expect that this fail-safe mechanism is not needed once all Ethernet/dsa
> drivers are fixed?

I think there is value in identifying wrong behaving drivers while we
fix them one after the other.

> 
> I think it won't hurt to add the check simply to ensure that it got fixed and the
> issue is not popping up thereafter.

Agreed, can you resubmit against the latest net-next/master tree?

Thanks!
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next] packet: fix panic in __packet_set_timestamp on tpacket_v3 in tx mode
From: Daniel Borkmann @ 2017-01-05 19:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, sowmini.varadhan, willemb, netdev
In-Reply-To: <1483640872.9712.1.camel@edumazet-glaptop3.roam.corp.google.com>

On 01/05/2017 07:27 PM, Eric Dumazet wrote:
> On Thu, 2017-01-05 at 02:34 +0100, Daniel Borkmann wrote:
[...]
>> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
>> index 7e39087..ddbda25 100644
>> --- a/net/packet/af_packet.c
>> +++ b/net/packet/af_packet.c
>> @@ -481,6 +481,9 @@ static __u32 __packet_set_timestamp(struct packet_sock *po, void *frame,
>>   		h.h2->tp_nsec = ts.tv_nsec;
>>   		break;
>>   	case TPACKET_V3:
>> +		h.h3->tp_sec = ts.tv_sec;
>> +		h.h3->tp_nsec = ts.tv_nsec;
>> +		break;
>>   	default:
>>   		WARN(1, "TPACKET version not supported.\n");
>>   		BUG();
>
> Gosh. Can we also replace this BUG() into something less aggressive ?

There are currently 5 of these WARN() + BUG() constructs and 1 BUG()-only
for the 'default' TPACKET version spread all over af_packet, so probably
makes sense to rather make all of them less aggressive.

> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index b9e1a13b4ba36a0bc7edf6a8c2c116c7d48c970c..0c0d268544787dcbef6601c5014e7d3836d16f96 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -476,9 +476,11 @@ static __u32 __packet_set_timestamp(struct packet_sock *po, void *frame,
>   		h.h2->tp_nsec = ts.tv_nsec;
>   		break;
>   	case TPACKET_V3:
> +		h.h3->tp_sec = ts.tv_sec;
> +		h.h3->tp_nsec = ts.tv_nsec;
> +		break;
>   	default:
> -		WARN(1, "TPACKET version not supported.\n");
> -		BUG();
> +		pr_err_once("TPACKET version %u not supported.\n", po->tp_version);
>   	}
>
>   	/* one flush is safe, as both fields always lie on the same cacheline */
>
>

^ permalink raw reply

* [net-next PATCH v6 2/3] net: rtnetlink: Use a local dev_num_vf() implementation
From: Phil Sutter @ 2017-01-05 19:09 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20170105190913.29986-1-phil@nwl.cc>

Promote dev_num_vf() to be no longer PCI device specific but use
ndo_get_vf_count() if implemented and only fall back to pci_num_vf()
like the old dev_num_vf() did.

Since this implementation no longer requires a parent device to be
present, don't pass the parent but the actual device to it and have it
check for parent existence only in the fallback case. This in turn
allows to eliminate parent existence checks in callers.

Signed-off-by: Phil Sutter <phil@nwl.cc>
---
Changes since v5:
- Introduced this patch.
---
 include/linux/pci.h  |  2 --
 net/core/rtnetlink.c | 37 ++++++++++++++++++++++++-------------
 2 files changed, 24 insertions(+), 15 deletions(-)

diff --git a/include/linux/pci.h b/include/linux/pci.h
index e2d1a124216a9..adbc859fe7c4c 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -885,7 +885,6 @@ void pcibios_setup_bridge(struct pci_bus *bus, unsigned long type);
 void pci_sort_breadthfirst(void);
 #define dev_is_pci(d) ((d)->bus == &pci_bus_type)
 #define dev_is_pf(d) ((dev_is_pci(d) ? to_pci_dev(d)->is_physfn : false))
-#define dev_num_vf(d) ((dev_is_pci(d) ? pci_num_vf(to_pci_dev(d)) : 0))
 
 /* Generic PCI functions exported to card drivers */
 
@@ -1630,7 +1629,6 @@ static inline int pci_get_new_domain_nr(void) { return -ENOSYS; }
 
 #define dev_is_pci(d) (false)
 #define dev_is_pf(d) (false)
-#define dev_num_vf(d) (0)
 #endif /* CONFIG_PCI */
 
 /* Include architecture-dependent settings and functions */
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 18b5aae99becf..84294593e0306 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -833,13 +833,24 @@ static void copy_rtnl_link_stats(struct rtnl_link_stats *a,
 	a->rx_nohandler = b->rx_nohandler;
 }
 
+static int dev_num_vf(const struct net_device *dev)
+{
+	if (dev->netdev_ops->ndo_get_vf_count)
+		return dev->netdev_ops->ndo_get_vf_count(dev);
+#ifdef CONFIG_PCI
+	if (dev->dev.parent && dev_is_pci(dev->dev.parent))
+		return pci_num_vf(to_pci_dev(dev->dev.parent));
+#endif
+	return 0;
+}
+
 /* All VF info */
 static inline int rtnl_vfinfo_size(const struct net_device *dev,
 				   u32 ext_filter_mask)
 {
-	if (dev->dev.parent && dev_is_pci(dev->dev.parent) &&
-	    (ext_filter_mask & RTEXT_FILTER_VF)) {
-		int num_vfs = dev_num_vf(dev->dev.parent);
+	int num_vfs = dev_num_vf(dev);
+
+	if (num_vfs && (ext_filter_mask & RTEXT_FILTER_VF)) {
 		size_t size = nla_total_size(0);
 		size += num_vfs *
 			(nla_total_size(0) +
@@ -889,12 +900,12 @@ static size_t rtnl_port_size(const struct net_device *dev,
 	size_t port_self_size = nla_total_size(sizeof(struct nlattr))
 		+ port_size;
 
-	if (!dev->netdev_ops->ndo_get_vf_port || !dev->dev.parent ||
+	if (!dev->netdev_ops->ndo_get_vf_port ||
 	    !(ext_filter_mask & RTEXT_FILTER_VF))
 		return 0;
-	if (dev_num_vf(dev->dev.parent))
+	if (dev_num_vf(dev))
 		return port_self_size + vf_ports_size +
-			vf_port_size * dev_num_vf(dev->dev.parent);
+			vf_port_size * dev_num_vf(dev);
 	else
 		return port_self_size;
 }
@@ -962,7 +973,7 @@ static int rtnl_vf_ports_fill(struct sk_buff *skb, struct net_device *dev)
 	if (!vf_ports)
 		return -EMSGSIZE;
 
-	for (vf = 0; vf < dev_num_vf(dev->dev.parent); vf++) {
+	for (vf = 0; vf < dev_num_vf(dev); vf++) {
 		vf_port = nla_nest_start(skb, IFLA_VF_PORT);
 		if (!vf_port)
 			goto nla_put_failure;
@@ -1012,7 +1023,7 @@ static int rtnl_port_fill(struct sk_buff *skb, struct net_device *dev,
 {
 	int err;
 
-	if (!dev->netdev_ops->ndo_get_vf_port || !dev->dev.parent ||
+	if (!dev->netdev_ops->ndo_get_vf_port ||
 	    !(ext_filter_mask & RTEXT_FILTER_VF))
 		return 0;
 
@@ -1020,7 +1031,7 @@ static int rtnl_port_fill(struct sk_buff *skb, struct net_device *dev,
 	if (err)
 		return err;
 
-	if (dev_num_vf(dev->dev.parent)) {
+	if (dev_num_vf(dev)) {
 		err = rtnl_vf_ports_fill(skb, dev);
 		if (err)
 			return err;
@@ -1351,15 +1362,15 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 	if (rtnl_fill_stats(skb, dev))
 		goto nla_put_failure;
 
-	if (dev->dev.parent && (ext_filter_mask & RTEXT_FILTER_VF) &&
-	    nla_put_u32(skb, IFLA_NUM_VF, dev_num_vf(dev->dev.parent)))
+	if (ext_filter_mask & RTEXT_FILTER_VF &&
+	    nla_put_u32(skb, IFLA_NUM_VF, dev_num_vf(dev)))
 		goto nla_put_failure;
 
-	if (dev->netdev_ops->ndo_get_vf_config && dev->dev.parent &&
+	if (dev->netdev_ops->ndo_get_vf_config &&
 	    ext_filter_mask & RTEXT_FILTER_VF) {
 		int i;
 		struct nlattr *vfinfo;
-		int num_vfs = dev_num_vf(dev->dev.parent);
+		int num_vfs = dev_num_vf(dev);
 
 		vfinfo = nla_nest_start(skb, IFLA_VFINFO_LIST);
 		if (!vfinfo)
-- 
2.11.0

^ permalink raw reply related

* [net-next PATCH v6 0/3] net: dummy: Introduce dummy virtual functions
From: Phil Sutter @ 2017-01-05 19:09 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

This series adds VF support to dummy device driver after adding the
necessary infrastructure changes:

Patch 1 adds a netdevice callback for device-specific VF count
retrieval. Patch 2 then changes dev_num_vf() implementation to make use
of that new callback (if implemented), falling back to the old
behaviour. Patch 3 then implements VF support in dummy, without the fake
PCI parent device hack from v5.

Phil Sutter (3):
  net: net_device_ops: Introduce ndo_get_vf_count
  net: rtnetlink: Use a local dev_num_vf() implementation
  net: dummy: Introduce dummy virtual functions

 drivers/net/dummy.c       | 178 +++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/netdevice.h |   5 ++
 include/linux/pci.h       |   2 -
 net/core/rtnetlink.c      |  37 ++++++----
 4 files changed, 205 insertions(+), 17 deletions(-)

-- 
2.11.0

^ permalink raw reply

* [net-next PATCH v6 1/3] net: net_device_ops: Introduce ndo_get_vf_count
From: Phil Sutter @ 2017-01-05 19:09 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20170105190913.29986-1-phil@nwl.cc>

The idea is to allow drivers to implement this callback in order to
provide a custom way to return the number of virtual functions present
on the device.

Signed-off-by: Phil Sutter <phil@nwl.cc>
---
Changes since v5:
- Introduced this patch.
---
 include/linux/netdevice.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ecd78b3c9abad..a04a693f55065 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -964,6 +964,10 @@ struct netdev_xdp {
  *      with PF and querying it may introduce a theoretical security risk.
  * int (*ndo_set_vf_rss_query_en)(struct net_device *dev, int vf, bool setting);
  * int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff *skb);
+ * int (*ndo_get_vf_count)(const struct net_device *dev);
+ *	Return the number of VFs present on this device instead of having
+ *	rtnetlink use pci_num_vf() on the PCI parent device.
+ *
  * int (*ndo_setup_tc)(struct net_device *dev, u8 tc)
  * 	Called to setup 'tc' number of traffic classes in the net device. This
  * 	is always called from the stack with the rtnl lock held and netif tx
@@ -1218,6 +1222,7 @@ struct net_device_ops {
 	int			(*ndo_set_vf_rss_query_en)(
 						   struct net_device *dev,
 						   int vf, bool setting);
+	int			(*ndo_get_vf_count)(const struct net_device *dev);
 	int			(*ndo_setup_tc)(struct net_device *dev,
 						u32 handle,
 						__be16 protocol,
-- 
2.11.0

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox