Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] tg3: Avoid NULL pointer dereference in tg3_get_nstats()
From: David Miller @ 2017-01-05 20:17 UTC (permalink / raw)
  To: michael.chan; +Cc: wangyufen, siva.kallam, prashant, mchan, netdev
In-Reply-To: <CACKFLimPkD6hxECA+ZhH+7BVmVSoJ1GfAMZyJ7S50CbhiuC0mA@mail.gmail.com>

From: Michael Chan <michael.chan@broadcom.com>
Date: Thu, 5 Jan 2017 12:04:13 -0800

> But it looks like ndo_get_stats() can be called without rtnl lock from
> net-procfs.c.  So it is possible that we'll read tp->hw_stats after it
> has been freed.  For example, if we are reading /proc/net/dev and
> closing tg3 at the same time.  David, is not taking rtnl_lock in
> net-procfs.c by design?

Probably not, that dev_get_stats() call probably should be surrounded
by RTNL protection.

Doing a quick grep on dev_get_stats() shows other call sites, most of
which are using it to fetch slave device statistics from the get stats
method of the parent.  Which should be ok.

It appears that the vlan procfs code in net/8021q/vlanproc.c has a
similar bug as net/core/net-procfs.c

Maybe net/core/net-sysfs.c has the same issue as well, and perhaps also
net/openvswitch/vport.c:ovs_vport_get_stats().

^ permalink raw reply

* Re: [PATCH v1 1/2] bpf: add a longest prefix match trie map implementation
From: Daniel Borkmann @ 2017-01-05 20:20 UTC (permalink / raw)
  To: Daniel Mack, ast; +Cc: dh.herrmann, netdev, davem
In-Reply-To: <91fbf150-4e97-7400-90cf-dbe7ea3e6597@zonque.org>

Hi Daniel,

On 01/05/2017 09:04 PM, Daniel Mack wrote:
> On 01/05/2017 05:25 PM, Daniel Borkmann wrote:
>> On 12/29/2016 06:28 PM, Daniel Mack wrote:
>
>>> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
>>> new file mode 100644
>>> index 0000000..8b6a61d
>>> --- /dev/null
>>> +++ b/kernel/bpf/lpm_trie.c
>
> [..]
>
>>> +static struct bpf_map *trie_alloc(union bpf_attr *attr)
>>> +{
>>> +	struct lpm_trie *trie;
>>> +
>>> +	/* check sanity of attributes */
>>> +	if (attr->max_entries == 0 || attr->map_flags ||
>>> +	    attr->key_size < sizeof(struct bpf_lpm_trie_key) + 1   ||
>>> +	    attr->key_size > sizeof(struct bpf_lpm_trie_key) + 256 ||
>>> +	    attr->value_size != sizeof(u64))
>>> +		return ERR_PTR(-EINVAL);
>>
>> The correct attr->map_flags test here would need to be ...
>>
>>     attr->map_flags != BPF_F_NO_PREALLOC
>>
>> ... since in this case we don't have any prealloc pool, and
>> should that come one day that test could be relaxed again.
>>
>>> +	trie = kzalloc(sizeof(*trie), GFP_USER | __GFP_NOWARN);
>>> +	if (!trie)
>>> +		return NULL;
>>> +
>>> +	/* copy mandatory map attributes */
>>> +	trie->map.map_type = attr->map_type;
>>> +	trie->map.key_size = attr->key_size;
>>> +	trie->map.value_size = attr->value_size;
>>> +	trie->map.max_entries = attr->max_entries;
>>
>> You also need to fill in trie->map.pages as that is eventually
>> used to charge memory against in bpf_map_charge_memlock(), right
>> now that would remain as 0 meaning the map is not accounted for.
>
> Hmm, okay. The nodes are, however, allocated dynamically at runtime in
> this case. That means that we have trie->map.pages on each allocation,
> right?

The current scheme (f.e. htab_map_alloc() has some details, although
probably not too obvious) that was done charges worst-case cost up front,
so it would be in trie_alloc() where you fill map.pages and map_create()
will later account for them.

Thanks,
Daniel

^ permalink raw reply

* RE: [PATCH net] hyper-v: Add myself as additional MAINTAINER
From: KY Srinivasan @ 2017-01-05 20:34 UTC (permalink / raw)
  To: gregkh@linuxfoundation.org
  Cc: Stephen Hemminger, davem@davemloft.net, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, Stephen Hemminger
In-Reply-To: <20170105200911.GA6579@kroah.com>



> -----Original Message-----
> From: gregkh@linuxfoundation.org [mailto:gregkh@linuxfoundation.org]
> Sent: Thursday, January 5, 2017 12:09 PM
> To: KY Srinivasan <kys@microsoft.com>
> Cc: Stephen Hemminger <stephen@networkplumber.org>;
> davem@davemloft.net; netdev@vger.kernel.org; linux-
> kernel@vger.kernel.org; Stephen Hemminger <sthemmin@microsoft.com>
> Subject: Re: [PATCH net] hyper-v: Add myself as additional MAINTAINER
> 
> On Thu, Jan 05, 2017 at 07:08:23PM +0000, KY Srinivasan wrote:
> >
> >
> > > -----Original Message-----
> > > From: gregkh@linuxfoundation.org [mailto:gregkh@linuxfoundation.org]
> > > Sent: Thursday, January 5, 2017 10:29 AM
> > > To: KY Srinivasan <kys@microsoft.com>
> > > Cc: Stephen Hemminger <stephen@networkplumber.org>;
> > > davem@davemloft.net; netdev@vger.kernel.org; linux-
> > > kernel@vger.kernel.org; Stephen Hemminger
> <sthemmin@microsoft.com>
> > > Subject: Re: [PATCH net] hyper-v: Add myself as additional MAINTAINER
> > >
> > > On Thu, Jan 05, 2017 at 05:43:04PM +0000, KY Srinivasan wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > > > > Sent: Thursday, January 5, 2017 9:36 AM
> > > > > To: davem@davemloft.net; KY Srinivasan <kys@microsoft.com>
> > > > > Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > > gregkh@linuxfoundation.org; Stephen Hemminger
> > > > > <sthemmin@microsoft.com>
> > > > > Subject: [PATCH net] hyper-v: Add myself as additional MAINTAINER
> > > > >
> > > > > Update the Hyper-V MAINTAINERS to include myself.
> > > > >
> > > > > Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
> > > >
> > > > Acked-by: K. Y. Srinivasan <kys@microsoft.com>
> > >
> > > Thanks, will go queue this up now.
> >
> > Thanks Greg. On a different note, there are a bunch of Hyper-V specific
> > patches that have been submitted over the last month or so that have not
> > been committed. Should I resend them.
> 
> Nope, they are still in my mbox, I'm just going through stuff that has
> to be in 4.10-final at the moment, give me another week or so to catch
> up on all the new stuff for 4.11-rc1.

Thanks Greg.

K. Y
> 
> thanks,
> 
> greg k-h

^ permalink raw reply

* Re: [PATCH net-next 01/10] net: netcp: ethss: add support of subsystem register region regmap
From: Murali Karicheri @ 2017-01-05 20:42 UTC (permalink / raw)
  To: Rob Herring
  Cc: netdev, linux-omap, grygorii.strashko, mugunthanvnm, linux-kernel,
	arnd, davem, devicetree, mark.rutland
In-Reply-To: <20161222212409.arhurpl4bpqf2yw6@rob-hp-laptop>

Rob,

On 12/22/2016 04:24 PM, Rob Herring wrote:
> On Tue, Dec 20, 2016 at 05:09:44PM -0500, Murali Karicheri wrote:
>> From: WingMan Kwok <w-kwok2@ti.com>
>>
>> 10gbe phy driver needs to access the 10gbe subsystem control
>> register during phy initialization. To facilitate the shared
>> access of the subsystem register region between the 10gbe Ethernet
>> driver and the phy driver, this patch adds support of the
>> subsystem register region defined by a syscon node in the dts.
>>
>> Although there is no shared access to the gbe subsystem register
>> region, using syscon for that is for the sake of consistency.
>>
>> This change is backward compatible with previously released gbe
>> devicetree bindings.
>>
>> Signed-off-by: WingMan Kwok <w-kwok2@ti.com>
>> Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
>> Signed-off-by: Sekhar Nori <nsekhar@ti.com>
>> ---
>>  .../devicetree/bindings/net/keystone-netcp.txt     |  16 ++-
>>  drivers/net/ethernet/ti/netcp_ethss.c              | 140 +++++++++++++++++----
>>  2 files changed, 127 insertions(+), 29 deletions(-)
>>
>> diff --git a/Documentation/devicetree/bindings/net/keystone-netcp.txt b/Documentation/devicetree/bindings/net/keystone-netcp.txt
>> index 04ba1dc..0854a73 100644
>> --- a/Documentation/devicetree/bindings/net/keystone-netcp.txt
>> +++ b/Documentation/devicetree/bindings/net/keystone-netcp.txt
>> @@ -72,20 +72,24 @@ Required properties:
>>  		"ti,netcp-gbe-2" for 1GbE N NetCP 1.5 (N=2)
>>  		"ti,netcp-xgbe" for 10 GbE
>>  
>> +- syscon-subsys:	phandle to syscon node of the switch
>> +			subsystem registers.
>> +
>>  - reg:		register location and the size for the following register
>>  		regions in the specified order.
>>  		- switch subsystem registers
>> +		- sgmii module registers
> 
> This needs to go on the end of the list. Otherwise, it is not backwards 
> compatible.

Thanks for your review! I assumed backward compatibility means new kernel
should work with old DTB. The driver code is adjusted to work with both
DTBs. Isn't that enough?

Murali

> 
>>  		- sgmii port3/4 module registers (only for NetCP 1.4)
>>  		- switch module registers
>>  		- serdes registers (only for 10G)
>>  
>>  		NetCP 1.4 ethss, here is the order
>> -			index #0 - switch subsystem registers
>> +			index #0 - sgmii module registers
>>  			index #1 - sgmii port3/4 module registers
>>  			index #2 - switch module registers
>>  
>>  		NetCP 1.5 ethss 9 port, 5 port and 2 port
>> -			index #0 - switch subsystem registers
>> +			index #0 - sgmii module registers
>>  			index #1 - switch module registers
>>  			index #2 - serdes registers
>>  
>> @@ -145,6 +149,11 @@ Optional properties:
>>  
>>  Example binding:
>>  
>> +gbe_subsys: subsys@2090000 {
>> +	compatible = "syscon";
>> +	reg = <0x02090000 0x100>;
>> +};
>> +
>>  netcp: netcp@2000000 {
>>  	reg = <0x2620110 0x8>;
>>  	reg-names = "efuse";
>> @@ -163,7 +172,8 @@ netcp: netcp@2000000 {
>>  		ranges;
>>  		gbe@90000 {
>>  			label = "netcp-gbe";
>> -			reg = <0x90000 0x300>, <0x90400 0x400>, <0x90800 0x700>;
>> +			syscon-subsys = <&gbe_subsys>;
>> +			reg = <0x90100 0x200>, <0x90400 0x200>, <0x90800 0x700>;
>>  			/* enable-ale; */
>>  			tx-queue = <648>;
>>  			tx-channel = <8>;
> 


-- 
Murali Karicheri
Linux Kernel, Keystone

^ permalink raw reply

* [PATCH] net: phy: dp83867: fix irq generation
From: Grygorii Strashko @ 2017-01-05 20:48 UTC (permalink / raw)
  To: Florian Fainelli, netdev, Dan Murphy, Mugunthan V N
  Cc: Grygorii Strashko, linux-omap, Sekhar Nori, linux-kernel,
	linux-arm-kernel

For proper IRQ generation by DP83867 phy the INT/PWDN pin has to be
programmed as an interrupt output instead of a Powerdown input in
Configuration Register 3 (CFG3), Address 0x001E, bit 7 INT_OE = 1. The
current driver doesn't do this and as result IRQs will not be generated by
DP83867 phy even if they are properly configured in DT.

Hence, fix IRQ generation by properly configuring CFG3.INT_OE bit and
ensure that Link Status Change (LINK_STATUS_CHNG_INT) and Auto-Negotiation
Complete (AUTONEG_COMP_INT) interrupt are enabled. After this the DP83867
driver will work properly in interrupt enabled mode.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
---
 drivers/net/phy/dp83867.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/net/phy/dp83867.c b/drivers/net/phy/dp83867.c
index 1b63924..e84ae08 100644
--- a/drivers/net/phy/dp83867.c
+++ b/drivers/net/phy/dp83867.c
@@ -29,6 +29,7 @@
 #define MII_DP83867_MICR	0x12
 #define MII_DP83867_ISR		0x13
 #define DP83867_CTRL		0x1f
+#define DP83867_CFG3		0x1e
 
 /* Extended Registers */
 #define DP83867_RGMIICTL	0x0032
@@ -98,6 +99,8 @@ static int dp83867_config_intr(struct phy_device *phydev)
 		micr_status |=
 			(MII_DP83867_MICR_AN_ERR_INT_EN |
 			MII_DP83867_MICR_SPEED_CHNG_INT_EN |
+			MII_DP83867_MICR_AUTONEG_COMP_INT_EN |
+			MII_DP83867_MICR_LINK_STS_CHNG_INT_EN |
 			MII_DP83867_MICR_DUP_MODE_CHNG_INT_EN |
 			MII_DP83867_MICR_SLEEP_MODE_CHNG_INT_EN);
 
@@ -214,6 +217,13 @@ static int dp83867_config_init(struct phy_device *phydev)
 		}
 	}
 
+	/* Enable Interrupt output INT_OE in CFG3 register */
+	if (phy_interrupt_is_valid(phydev)) {
+		val = phy_read(phydev, DP83867_CFG3);
+		val |= BIT(7);
+		phy_write(phydev, DP83867_CFG3, val);
+	}
+
 	return 0;
 }
 
-- 
2.10.1.dirty

^ permalink raw reply related

* Re: [PATCH] net: stmmac: fix maxmtu assignment to be within valid range
From: Andy Shevchenko @ 2017-01-05 21:06 UTC (permalink / raw)
  To: Kweh, Hock Leong
  Cc: David S. Miller, Joao Pinto, Giuseppe CAVALLARO,
	seraphin.bonnaffe, Jarod Wilson, Alexandre TORGUE,
	Joachim Eastwood, Niklas Cassel, Johan Hovold, Pavel Machek,
	lars.persson, netdev, LKML
In-Reply-To: <1483613277-6005-1-git-send-email-hock.leong.kweh@intel.com>

On Thu, Jan 5, 2017 at 12:47 PM, Kweh, Hock Leong
<hock.leong.kweh@intel.com> wrote:
> From: "Kweh, Hock Leong" <hock.leong.kweh@intel.com>
>
> There is no checking valid value of maxmtu when getting it from devicetree.

'Device Tree' or 'device tree' ?

> This resolution added the checking condition to ensure the assignment is
> made within a valid range.

> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index 39eb7a6..683d59f 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -3319,7 +3319,8 @@ int stmmac_dvr_probe(struct device *device,
>                 ndev->max_mtu = JUMBO_LEN;
>         else
>                 ndev->max_mtu = SKB_MAX_HEAD(NET_SKB_PAD + NET_IP_ALIGN);
> -       if (priv->plat->maxmtu < ndev->max_mtu)
> +       if ((priv->plat->maxmtu < ndev->max_mtu) &&
> +           (priv->plat->maxmtu >= ndev->min_mtu))
>                 ndev->max_mtu = priv->plat->maxmtu;

Perhaps add a warning message on else branch?

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply

* Re: [PATCH 2/2] isdn: i4l: move active-isdn drivers to staging
From: Greg Kroah-Hartman @ 2017-01-05 21:19 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: devel, Karsten Keil, linux-doc, netdev, Jonathan Corbet,
	linux-kernel, David S. Miller, linux-arm-kernel
In-Reply-To: <3286019.I2UkVCJq41@wuerfel>

On Tue, Jan 03, 2017 at 10:19:29PM +0100, Arnd Bergmann wrote:
> On Tuesday, January 3, 2017 4:24:36 PM CET Greg Kroah-Hartman wrote:
> > On Wed, Mar 02, 2016 at 08:06:46PM +0100, Arnd Bergmann wrote:
> > > The icn, act2000 and pcbit drivers are all for very old hardware,
> > > and it is highly unlikely that anyone is actually still using them
> > > on modern kernels, if at all.
> > > 
> > > All three drivers apparently are for hardware that predates PCI
> > > being the common connector, as they are ISA-only and active
> > > PCI ISDN cards were widely available in the 1990s.
> > > 
> > > Looking through the git logs, it I cannot find any indication of a
> > > patch to any of these drivers that has been tested on real hardware,
> > > only cleanups or global API changes.
> > > 
> > > Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> > > Acked-by: Karsten Keil <isdn@linux-pingi.de>
> > 
> > This patch got added in the 4.6 kernel release.  As I am now taking
> > patches for 4.11-rc1, I figure it is time to just delete the
> > drivers/staging/i4l/ directory now, given that no one has really done
> > anything with it.  If people show up that wish to maintain it, I'll be
> > glad to revert it, or if someone really screams in the next week.
> > Otherwise it's time to just move on 
> 
> Sounds good to me.

Ok, now deleted!

thanks,

greg k-h

^ permalink raw reply

* TCP using IPv4-mapped IPv6 address as source
From: Jonathan T. Leighton @ 2017-01-05 21:25 UTC (permalink / raw)
  To: netdev

I've observed TCP using an IPv4-mapped IPv6 address as the source 
address, which I believe contradicts 
https://tools.ietf.org/html/rfc6890#page-14 (BCP 153). This occurs when 
an IPv6 TCP socket, bound to a local IPv4-mapped IPv6 address, attempts 
to connect to a remote IPv6 address. Presumable connect() should return 
EAFNOSUPPORT in this case. Please advise me if this is not to 
appropriate list to report this.

$ uname -a
Linux ubuntu 4.4.0-57-generic #78-Ubuntu SMP Fri Dec 9 23:50:32 UTC 2016 
x86_64 x86_64 x86_64 GNU/Linux

Best regards,
Jon

^ permalink raw reply

* [RFC PATCH] tcp: accept RST for rcv_nxt - 1 after receiving a FIN
From: Jason Baron @ 2017-01-05 21:33 UTC (permalink / raw)
  To: netdev

Using a Mac OSX box as a client connecting to a Linux server, we have found
that when certain applications (such as 'ab'), are abruptly terminated
(via ^C), a FIN is sent followed by a RST packet on tcp connections. The
FIN is accepted by the Linux stack but the RST is sent with the same
sequence number as the FIN, and Linux responds with a challenge ACK per
RFC 5961. The OSX client then does not reply with any RST as would be
expected on a closed socket.

This results in sockets accumulating on the Linux server left mostly in
the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible.
This sequence of events can tie up a lot of resources on the Linux server
since there may be a lot of data in write buffers at the time of the RST.
Accepting a RST equal to rcv_nxt - 1, after we have already successfully
processed a FIN, has made a significant difference for us in practice, by
freeing up unneeded resources in a more expedient fashion.

I also found a posting that the iOS client behaves in a similar manner here
(it will send a FIN followed by a RST for rcv_nxt - 1):
https://www.snellman.net/blog/archive/2016-02-01-tcp-rst/

A packetdrill test demonstrating the behavior.

// testing mac osx rst behavior

// Establish a connection
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < S 0:0(0) win 32768 <mss 1460,nop,wscale 10>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 5>
0.200 < . 1:1(0) ack 1 win 32768
0.200 accept(3, ..., ...) = 4

// Client closes the connection
0.300 < F. 1:1(0) ack 1 win 32768

// now send rst with same sequence
0.300 < R. 1:1(0) ack 1 win 32768

// make sure we are in TCP_CLOSE
0.400 %{
assert tcpi_state == 7
}%

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 net/ipv4/tcp_input.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ec6d84363024..373bea05c93b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5249,6 +5249,24 @@ static int tcp_copy_to_iovec(struct sock *sk, struct sk_buff *skb, int hlen)
 	return err;
 }
 
+/* Accept RST for rcv_nxt - 1 after a FIN.
+ * When tcp connections are abruptly terminated from Mac OSX (via ^C), a
+ * FIN is sent followed by a RST packet. The RST is sent with the same
+ * sequence number as the FIN, and thus according to RFC 5961 a challenge
+ * ACK should be sent. However, Mac OSX does not reply to the challenge ACK
+ * with a RST on the closed socket, hence accept this class of RSTs.
+ */
+static bool tcp_reset_check(struct sock *sk, struct sk_buff *skb)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	return unlikely((TCP_SKB_CB(skb)->seq == (tp->rcv_nxt - 1)) &&
+			(TCP_SKB_CB(skb)->end_seq == (tp->rcv_nxt - 1))	&&
+			(sk->sk_state == TCP_CLOSE_WAIT ||
+			 sk->sk_state == TCP_LAST_ACK ||
+			 sk->sk_state == TCP_CLOSING));
+}
+
 /* Does PAWS and seqno based validation of an incoming segment, flags will
  * play significant role here.
  */
@@ -5287,6 +5305,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 						  LINUX_MIB_TCPACKSKIPPEDSEQ,
 						  &tp->last_oow_ack_time))
 				tcp_send_dupack(sk, skb);
+		} else if (tcp_reset_check(sk, skb)) {
+			tcp_reset(sk);
 		}
 		goto discard;
 	}
@@ -5300,7 +5320,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 		 * else
 		 *     Send a challenge ACK
 		 */
-		if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt) {
+		if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt ||
+		    tcp_reset_check(sk, skb)) {
 			rst_seq_match = true;
 		} else if (tcp_is_sack(tp) && tp->rx_opt.num_sacks > 0) {
 			struct tcp_sack_block *sp = &tp->selective_acks[0];
-- 
2.6.1

^ permalink raw reply related

* Re: [Intel-wired-lan] [net-next PATCH v2 2/6] i40e: Introduce VF Port Representator(VFPR) netdevs.
From: Jeff Kirsher @ 2017-01-05 21:46 UTC (permalink / raw)
  To: Sridhar Samudrala, alexander.h.duyck, john.r.fastabend,
	anjali.singhai, jakub.kicinski, davem, intel-wired-lan, netdev
In-Reply-To: <1483466874-2962-3-git-send-email-sridhar.samudrala@intel.com>

[-- Attachment #1: Type: text/plain, Size: 2168 bytes --]

On Tue, 2017-01-03 at 10:07 -0800, Sridhar Samudrala wrote:
> VF Port Representator netdevs are created for each VF if the switch mode
> is set to 'switchdev'. These netdevs can be used to control and configure
> VFs from PFs namespace. They enable exposing VF statistics, configure and
> monitor link state, mtu, filters, fdb/vlan entries etc. of VFs.
> Broadcast filters are not enabled in switchdev mode.
> 
> Sample script to create VF port representors
> # rmmod i40e; modprobe i40e
> # devlink dev eswitch set pci/0000:05:00.0 mode switchdev
> # echo 2 > /sys/class/net/enp5s0f0/device/sriov_numvfs
> # ip l show
> 297: enp5s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop portid
> 6805ca2e7268 state DOWN mode DEFAULT group default qlen 1000
>      link/ether 68:05:ca:2e:72:68 brd ff:ff:ff:ff:ff:ff
>      vf 0 MAC 00:00:00:00:00:00, spoof checking on, link-state auto,
> trust off
>      vf 1 MAC 00:00:00:00:00:00, spoof checking on, link-state auto,
> trust off
> 299: enp5s0f0-vf0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
> mode DEFAULT group default qlen 1000
>      link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
> 300: enp5s0f0-vf1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
> mode DEFAULT group default qlen 1000
>      link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
> 
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_main.c        |  21 ++-
>  drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 154
> ++++++++++++++++++++-
>  drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.h |  14 ++
>  3 files changed, 182 insertions(+), 7 deletions(-)

This does not apply cleanly because it is based on an older version of
i40e_virtchnl_pf.c file.  It appears that i40e has been updated to use
"i40e_add_filter()" yet your patch still uses "i40e_add_mac_filter()".

We need to clarify what the "right way" is to add filters and use the
correct function.

Dropping this series and will await v3, please address the other feedback
from Or Gerlitz and Jiri Pirko as well in your updated series.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [PATCH] tg3: Avoid NULL pointer dereference in tg3_get_nstats()
From: Michael Chan @ 2017-01-05 21:53 UTC (permalink / raw)
  To: David Miller
  Cc: wangyufen, Siva Reddy Kallam, prashant.sreedharan@broadcom.com,
	michael.chan@broadcom.com, Netdev
In-Reply-To: <20170105.151705.2187713228201334532.davem@davemloft.net>

On Thu, Jan 5, 2017 at 12:17 PM, David Miller <davem@davemloft.net> wrote:
> From: Michael Chan <michael.chan@broadcom.com>
> Date: Thu, 5 Jan 2017 12:04:13 -0800
>
>> But it looks like ndo_get_stats() can be called without rtnl lock from
>> net-procfs.c.  So it is possible that we'll read tp->hw_stats after it
>> has been freed.  For example, if we are reading /proc/net/dev and
>> closing tg3 at the same time.  David, is not taking rtnl_lock in
>> net-procfs.c by design?
>
> Probably not, that dev_get_stats() call probably should be surrounded
> by RTNL protection.
>
> Doing a quick grep on dev_get_stats() shows other call sites, most of
> which are using it to fetch slave device statistics from the get stats
> method of the parent.  Which should be ok.
>
> It appears that the vlan procfs code in net/8021q/vlanproc.c has a
> similar bug as net/core/net-procfs.c
>
> Maybe net/core/net-sysfs.c has the same issue as well, and perhaps also
> net/openvswitch/vport.c:ovs_vport_get_stats().
>

OK.  I will send a patch later today to add rtnl_lock to these callers.

^ permalink raw reply

* Re: [PATCH v4] net: ethernet: faraday: To support device tree usage.
From: Arnd Bergmann @ 2017-01-05 21:55 UTC (permalink / raw)
  To: Greentime Hu
  Cc: f.fainelli, netdev, devicetree, andrew, linux-kernel, jiri,
	jonas.jensen, davem
In-Reply-To: <20170105102345.GA25774@app09>

On Thursday, January 5, 2017 6:23:53 PM CET Greentime Hu wrote:
> Signed-off-by: Greentime Hu <green.hu@gmail.com>
> ---
> Changes in v4:
>   - Use the same binding document to describe the same faraday ethernet controller and add faraday to vendor-prefixes.txt.
> Changes in v3:
>   - Nothing changed in this patch but I have committed andestech to vendor-prefixes.txt.
> Changes in v2:
>   - Change atmac100_of_ids to ftmac100_of_ids
> 

The patch looks good to me now, but please add a proper commit log
before your Signed-off-by tag.

	Arnd

^ permalink raw reply

* Re: [PATCH net-next 01/10] net: netcp: ethss: add support of subsystem register region regmap
From: Murali Karicheri @ 2017-01-05 22:08 UTC (permalink / raw)
  To: Rob Herring
  Cc: netdev, linux-omap, grygorii.strashko, mugunthanvnm, linux-kernel,
	arnd, davem, devicetree, mark.rutland
In-Reply-To: <586EAFBE.1000500@ti.com>

On 01/05/2017 03:42 PM, Murali Karicheri wrote:
> Rob,
> 
> On 12/22/2016 04:24 PM, Rob Herring wrote:
>> On Tue, Dec 20, 2016 at 05:09:44PM -0500, Murali Karicheri wrote:
>>> From: WingMan Kwok <w-kwok2@ti.com>
>>>
>>> 10gbe phy driver needs to access the 10gbe subsystem control
>>> register during phy initialization. To facilitate the shared
>>> access of the subsystem register region between the 10gbe Ethernet
>>> driver and the phy driver, this patch adds support of the
>>> subsystem register region defined by a syscon node in the dts.
>>>
>>> Although there is no shared access to the gbe subsystem register
>>> region, using syscon for that is for the sake of consistency.
>>>
>>> This change is backward compatible with previously released gbe
>>> devicetree bindings.
>>>
>>> Signed-off-by: WingMan Kwok <w-kwok2@ti.com>
>>> Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
>>> Signed-off-by: Sekhar Nori <nsekhar@ti.com>
>>> ---
>>>  .../devicetree/bindings/net/keystone-netcp.txt     |  16 ++-
>>>  drivers/net/ethernet/ti/netcp_ethss.c              | 140 +++++++++++++++++----
>>>  2 files changed, 127 insertions(+), 29 deletions(-)
>>>
>>> diff --git a/Documentation/devicetree/bindings/net/keystone-netcp.txt b/Documentation/devicetree/bindings/net/keystone-netcp.txt
>>> index 04ba1dc..0854a73 100644
>>> --- a/Documentation/devicetree/bindings/net/keystone-netcp.txt
>>> +++ b/Documentation/devicetree/bindings/net/keystone-netcp.txt
>>> @@ -72,20 +72,24 @@ Required properties:
>>>  		"ti,netcp-gbe-2" for 1GbE N NetCP 1.5 (N=2)
>>>  		"ti,netcp-xgbe" for 10 GbE
>>>  
>>> +- syscon-subsys:	phandle to syscon node of the switch
>>> +			subsystem registers.
>>> +
>>>  - reg:		register location and the size for the following register
>>>  		regions in the specified order.
>>>  		- switch subsystem registers
>>> +		- sgmii module registers
>>
>> This needs to go on the end of the list. Otherwise, it is not backwards 
>> compatible.
> 
> Thanks for your review! I assumed backward compatibility means new kernel
> should work with old DTB. The driver code is adjusted to work with both
> DTBs. Isn't that enough?
Rob,

I will pull out 1/10 and 2/10 from the series as it needs more work and
re-submit rest of the patches in my v1 spin. However please reply to my
above backward compatibility question.

Thanks and Regards,

Murali
> 
> Murali
> 
>>
>>>  		- sgmii port3/4 module registers (only for NetCP 1.4)
>>>  		- switch module registers
>>>  		- serdes registers (only for 10G)
>>>  
>>>  		NetCP 1.4 ethss, here is the order
>>> -			index #0 - switch subsystem registers
>>> +			index #0 - sgmii module registers
>>>  			index #1 - sgmii port3/4 module registers
>>>  			index #2 - switch module registers
>>>  
>>>  		NetCP 1.5 ethss 9 port, 5 port and 2 port
>>> -			index #0 - switch subsystem registers
>>> +			index #0 - sgmii module registers
>>>  			index #1 - switch module registers
>>>  			index #2 - serdes registers
>>>  
>>> @@ -145,6 +149,11 @@ Optional properties:
>>>  
>>>  Example binding:
>>>  
>>> +gbe_subsys: subsys@2090000 {
>>> +	compatible = "syscon";
>>> +	reg = <0x02090000 0x100>;
>>> +};
>>> +
>>>  netcp: netcp@2000000 {
>>>  	reg = <0x2620110 0x8>;
>>>  	reg-names = "efuse";
>>> @@ -163,7 +172,8 @@ netcp: netcp@2000000 {
>>>  		ranges;
>>>  		gbe@90000 {
>>>  			label = "netcp-gbe";
>>> -			reg = <0x90000 0x300>, <0x90400 0x400>, <0x90800 0x700>;
>>> +			syscon-subsys = <&gbe_subsys>;
>>> +			reg = <0x90100 0x200>, <0x90400 0x200>, <0x90800 0x700>;
>>>  			/* enable-ale; */
>>>  			tx-queue = <648>;
>>>  			tx-channel = <8>;
>>
> 
> 


-- 
Murali Karicheri
Linux Kernel, Keystone

^ permalink raw reply

* Re: [PATCH] net: phy: dp83867: fix irq generation
From: Florian Fainelli @ 2017-01-05 22:10 UTC (permalink / raw)
  To: Grygorii Strashko, netdev, Dan Murphy, Mugunthan V N
  Cc: linux-omap, Sekhar Nori, linux-kernel, linux-arm-kernel
In-Reply-To: <20170105204807.25990-1-grygorii.strashko@ti.com>

On 01/05/2017 12:48 PM, Grygorii Strashko wrote:
> For proper IRQ generation by DP83867 phy the INT/PWDN pin has to be
> programmed as an interrupt output instead of a Powerdown input in
> Configuration Register 3 (CFG3), Address 0x001E, bit 7 INT_OE = 1. The
> current driver doesn't do this and as result IRQs will not be generated by
> DP83867 phy even if they are properly configured in DT.
> 
> Hence, fix IRQ generation by properly configuring CFG3.INT_OE bit and
> ensure that Link Status Change (LINK_STATUS_CHNG_INT) and Auto-Negotiation
> Complete (AUTONEG_COMP_INT) interrupt are enabled. After this the DP83867
> driver will work properly in interrupt enabled mode.
> 
> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
> ---
>  drivers/net/phy/dp83867.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/net/phy/dp83867.c b/drivers/net/phy/dp83867.c
> index 1b63924..e84ae08 100644
> --- a/drivers/net/phy/dp83867.c
> +++ b/drivers/net/phy/dp83867.c
> @@ -29,6 +29,7 @@
>  #define MII_DP83867_MICR	0x12
>  #define MII_DP83867_ISR		0x13
>  #define DP83867_CTRL		0x1f
> +#define DP83867_CFG3		0x1e
>  
>  /* Extended Registers */
>  #define DP83867_RGMIICTL	0x0032
> @@ -98,6 +99,8 @@ static int dp83867_config_intr(struct phy_device *phydev)
>  		micr_status |=
>  			(MII_DP83867_MICR_AN_ERR_INT_EN |
>  			MII_DP83867_MICR_SPEED_CHNG_INT_EN |
> +			MII_DP83867_MICR_AUTONEG_COMP_INT_EN |
> +			MII_DP83867_MICR_LINK_STS_CHNG_INT_EN |
>  			MII_DP83867_MICR_DUP_MODE_CHNG_INT_EN |
>  			MII_DP83867_MICR_SLEEP_MODE_CHNG_INT_EN);
>  
> @@ -214,6 +217,13 @@ static int dp83867_config_init(struct phy_device *phydev)
>  		}
>  	}
>  
> +	/* Enable Interrupt output INT_OE in CFG3 register */
> +	if (phy_interrupt_is_valid(phydev)) {
> +		val = phy_read(phydev, DP83867_CFG3);
> +		val |= BIT(7);
> +		phy_write(phydev, DP83867_CFG3, val);
> +	}

Don't you need to clear that bit in the case phy_interrupt_is_valid()
returns false?

Other than that:

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
-- 
Florian

^ permalink raw reply

* Re: [net-next PATCH 5/6] i40e: Add TX and RX support in switchdev mode.
From: Samudrala, Sridhar @ 2017-01-05 22:32 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Alexander Duyck, John Fastabend, Anjali Singhai Jain,
	jakub.kicinski, intel-wired-lan, Linux Netdev List
In-Reply-To: <CAJ3xEMgAaXNbvaP0=GcuTsVOOGwPHuSMQf-SeLKpxB8WrbEvng@mail.gmail.com>



On 1/5/2017 3:50 AM, Or Gerlitz wrote:
> On Thu, Jan 5, 2017 at 12:46 AM, Samudrala, Sridhar
> <sridhar.samudrala@intel.com> wrote:
>>
>> On 1/3/2017 3:03 PM, Or Gerlitz wrote:
>>> On Fri, Dec 30, 2016 at 7:04 PM, Samudrala, Sridhar
>>> <sridhar.samudrala@intel.com> wrote:
>>>> On 12/30/2016 7:31 AM, Or Gerlitz wrote:
>>>>> Are you exposing switchdev ops for the representators? didn't see that
>>>>> or maybe it's in the 4th patch which didn't make it to the list?
>>>> Not at this time. In the future patches when we offload fdb/vlan
>>>> functionality, we could use switchdev ops.
>>> but wait, this is the switchdev mode... even before doing any
>>> offloading, you want (need) your representor netdevices to have the
>>> same HW ID marking they are all ports of the same ASIC, this you can
>>> do with the switchdev parent ID attribute.
>> OK. I will add switchdev_port_attr_get() with PORT_PARENT_ID support in v3.
> Good, I made this comment, b/c we want to create a well defined user-experience
> to be taken into account also by upper virtualization layers.
>
> Another piece there to add is have your VF reps implement the
> get_phys_port_name ndo,

It looks like you are returning the VF port number as phys_port_name() 
for a VF rep in en_rep.c.
Is this correct?

By default i am creating VFPR netdev with name as <pf_name>_VF<vf_num>
For ex; if enp5s0f0 is the pf name, VFPR netdev for VF0 will be enp5s0f0_vf0

If we want udev to follow this syntax should i return '_vf0'  as 
get_phys_port_name() for VF rep 0?

> where as we explain in commit cb67b832921cfa20ad79bafdc51f1745339d0557 is used
> as follows:
>
>      Port phys name (ndo_get_phys_port_name) is implemented to allow exporting
>      to user-space the VF vport number and along with the switchdev port parent
>      id (phys_switch_id) enable a udev base consistent naming scheme:
>
>      SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \
>              ATTR{phys_port_name}!="", NAME="$PF_NIC$attr{phys_port_name}"
>
>      where phys_switch_id is exposed by the PF (and VF reps) and $PF_NIC is
>      the name of the PF netdevice.
>
> Or.

^ permalink raw reply

* Re: [RFC PATCH] virtio_net: XDP support for adjust_head
From: John Fastabend @ 2017-01-05 22:57 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang
  Cc: john.r.fastabend, netdev, alexei.starovoitov, daniel
In-Reply-To: <20170104001332-mutt-send-email-mst@kernel.org>

On 17-01-03 02:16 PM, Michael S. Tsirkin wrote:
> On Tue, Jan 03, 2017 at 02:01:27PM +0800, Jason Wang wrote:
>>
>>
>> On 2017年01月03日 03:44, John Fastabend wrote:
>>> Add support for XDP adjust head by allocating a 256B header region
>>> that XDP programs can grow into. This is only enabled when a XDP
>>> program is loaded.
>>>
>>> In order to ensure that we do not have to unwind queue headroom push
>>> queue setup below bpf_prog_add. It reads better to do a prog ref
>>> unwind vs another queue setup call.
>>>
>>> : There is a problem with this patch as is. When xdp prog is loaded
>>>    the old buffers without the 256B headers need to be flushed so that
>>>    the bpf prog has the necessary headroom. This patch does this by
>>>    calling the virtqueue_detach_unused_buf() and followed by the
>>>    virtnet_set_queues() call to reinitialize the buffers. However I
>>>    don't believe this is safe per comment in virtio_ring this API
>>>    is not valid on an active queue and the only thing we have done
>>>    here is napi_disable/napi_enable wrappers which doesn't do anything
>>>    to the emulation layer.
>>>
>>>    So the RFC is really to find the best solution to this problem.
>>>    A couple things come to mind, (a) always allocate the necessary
>>>    headroom but this is a bit of a waste (b) add some bit somewhere
>>>    to check if the buffer has headroom but this would mean XDP programs
>>>    would be broke for a cycle through the ring, (c) figure out how
>>>    to deactivate a queue, free the buffers and finally reallocate.
>>>    I think (c) is the best choice for now but I'm not seeing the
>>>    API to do this so virtio/qemu experts anyone know off-hand
>>>    how to make this work? I started looking into the PCI callbacks
>>>    reset() and virtio_device_ready() or possibly hitting the right
>>>    set of bits with vp_set_status() but my first attempt just hung
>>>    the device.
>>
>> Hi John:
>>
>> AFAIK, disabling a specific queue was supported only by virtio 1.0 through
>> queue_enable field in pci common cfg.
> 
> In fact 1.0 only allows enabling queues selectively.
> We can add disabling by a spec enhancement but
> for now reset is the only way.
> 
> 
>> But unfortunately, qemu does not
>> emulate this at all and legacy device does not even support this. So the
>> safe way is probably reset the device and redo the initialization here.
> 
> You will also have to re-apply rx filtering if you do this.
> Probably sending notification uplink.
> 

The following seems to hang the device on the next virtnet_send_command()
I expected this to meet the reset requirements from the spec because I
believe its the same flow coming out of restore(). For a real patch we
don't actually need to kfree all the structs and reallocate them but
I was expecting the below to work. Any ideas/hints?

static int virtnet_xdp_reset(struct virtnet_info *vi)
{
        int i, ret;

        netif_device_detach(vi->dev);
        cancel_delayed_work_sync(&vi->refill);
        if (netif_running(vi->dev)) {
                for (i = 0; i < vi->max_queue_pairs; i++)
                        napi_disable(&vi->rq[i].napi);
        }

        remove_vq_common(vi, false);
        ret = init_vqs(vi);
        if (ret)
                return ret;
        virtio_device_ready(vi->vdev);

        if (netif_running(vi->dev)) {
                for (i = 0; i < vi->curr_queue_pairs; i++)
                        if (!try_fill_recv(vi, &vi->rq[i], GFP_KERNEL))
                                schedule_delayed_work(&vi->refill, 0);

                for (i = 0; i < vi->max_queue_pairs; i++)
                        virtnet_napi_enable(&vi->rq[i]);
        }
        netif_device_attach(vi->dev);
        return 0;
}

^ permalink raw reply

* Re: [PATCH net-next v4 0/4] Fix OdroidC2 Gigabit Tx link issue
From: Russell King - ARM Linux @ 2017-01-05 23:25 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: devicetree, Alexandre TORGUE, Neil Armstrong, Martin Blumenstingl,
	netdev, Giuseppe Cavallaro, linux-kernel, Yegor Yefremov,
	Julia Lawall, Andre Roth, Andrew Lunn, Kevin Hilman, Carlo Caione,
	linux-amlogic, Andreas Färber, linux-arm-kernel,
	Jerome Brunet
In-Reply-To: <049b1efc-3bad-92e0-45ef-0563dc5d81de@gmail.com>

On Mon, Nov 28, 2016 at 09:54:28AM -0800, Florian Fainelli wrote:
> If we start supporting generic "enable", "disable" type of properties
> with values that map directly to register definitions of the HW, we
> leave too much room for these properties to be utilized to implement a
> specific policy, and this is not acceptable.

Another concern with this patch is that the existing phylib "set_eee"
code is horribly buggy - it just translates the modes from userspace
into the register value and writes them directly to the register with
no validation.  So it's possible to set modes in the register that the
hardware doesn't support, and have them advertised to the link partner.

I have a patch which fixes that, restricting (as we do elsewhere) the
advert according to the EEE supported capabilities retrieved from the
PCS - maybe the problem here is that the PCS doesn't support support
EEE in 1000baseT mode?

Out of interest, which PHY is used on this platform?

On the SolidRun boards, they're using AR8035, and have suffered this
occasional link drop problem.  What has been found is that it seems to
be to do with the timing parameters, and it seemed to only be 1000bT
that was affected.  I don't remember off hand exactly which or what
the change was they made to stabilise it though, but I can probabily
find out tomorrow.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply

* Re: [net-next PATCH v2 5/6] i40e: Add TX and RX support in switchdev mode.
From: Samudrala, Sridhar @ 2017-01-06  0:27 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: alexander.h.duyck, john.r.fastabend, anjali.singhai,
	jakub.kicinski, davem, intel-wired-lan, netdev
In-Reply-To: <20170105125656.GB2211@nanopsycho>



On 1/5/2017 4:56 AM, Jiri Pirko wrote:
> Tue, Jan 03, 2017 at 07:07:53PM CET, sridhar.samudrala@intel.com wrote:
>> In switchdev mode, broadcast filter is not enabled on VFs. The broadcasts and
>> unknown frames from VFs are received by the PF and passed to corresponding VF
>> port representator netdev.
>> A host based switching entity like a linux bridge or OVS redirects these frames
>> to the right VFs via VFPR netdevs. Any frames sent via VFPR netdevs are sent as
>> directed transmits to the corresponding VFs. To enable directed transmit, skb
>> metadata dst is used to pass the VF id and the frame is requeued to call the PFs
>> transmit routine.
>>
>> Small script to demonstrate inter VF pings in switchdev mode.
>> PF: enp5s0f0, VFs: enp5s2,enp5s2f1 VFPRs:enp5s0f0-vf0, enp5s0f0-vf1
>>
>> # rmmod i40e; modprobe i40e
>> # devlink dev eswitch set pci/0000:05:00.0 mode switchdev
>> # echo 2 > /sys/class/net/enp5s0f0/device/sriov_numvfs
>> # ip link set enp5s0f0 vf 0 mac 00:11:22:33:44:55
>> # ip link set enp5s0f0 vf 1 mac 00:11:22:33:44:56
>> # rmmod i40evf; modprobe i40evf
>>
>> /* Create 2 namespaces and move the VFs to the corresponding ns. */
>> # ip netns add ns0
>> # ip link set enp5s2 netns ns0
>> # ip netns exec ns0 ip addr add 192.168.1.10/24 dev enp5s2
>> # ip netns exec ns0 ip link set enp5s2 up
>> # ip netns add ns1
>> # ip link set enp5s2f1 netns ns1
>> # ip netns exec ns1 ip addr add 192.168.1.11/24 dev enp5s2f1
>> # ip netns exec ns1 ip link set enp5s2f1 up
>>
>> /* bring up pf and vfpr netdevs */
>> # ip link set enp5s0f0 up
>> # ip link set enp5s0f0-vf0 up
>> # ip link set enp5s0f0-vf1 up
>>
>> /* Create a linux bridge and add vfpr netdevs to it. */
>> # ip link add vfpr-br type bridge
>> # ip link set enp5s0f0-vf0 master vfpr-br
>> # ip link set enp5s0f0-vf1 master vfpr-br
>> # ip addr add 192.168.1.1/24 dev vfpr-br
>> # ip link set vfpr-br up
>>
>> # ip netns exec ns0 ping -c3 192.168.1.11
>> # ip netns exec ns1 ping -c3 192.168.1.10
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> [...]
>
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>> index 352cf7c..b46ddaa 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
>> @@ -1176,16 +1176,37 @@ static bool i40e_alloc_mapped_page(struct i40e_ring *rx_ring,
>>   * @rx_ring:  rx ring in play
>>   * @skb: packet to send up
>>   * @vlan_tag: vlan tag for packet
>> + * @lpbk: is it a loopback frame?
>>   **/
>> static void i40e_receive_skb(struct i40e_ring *rx_ring,
>> -			     struct sk_buff *skb, u16 vlan_tag)
>> +			     struct sk_buff *skb, u16 vlan_tag, bool lpbk)
>> {
>> 	struct i40e_q_vector *q_vector = rx_ring->q_vector;
>> +	struct i40e_pf *pf = rx_ring->vsi->back;
>> +	struct i40e_vf *vf;
>> +	struct ethhdr *eth;
>> +	int vf_id;
>>
>> 	if ((rx_ring->netdev->features & NETIF_F_HW_VLAN_CTAG_RX) &&
>> 	    (vlan_tag & VLAN_VID_MASK))
>> 		__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), vlan_tag);
>>
>> +	if ((pf->eswitch_mode == DEVLINK_ESWITCH_MODE_LEGACY) || !lpbk)
>> +		goto gro_receive;
>> +
>> +	/* If a loopback packet is received from a VF in switchdev mode, pass the
>> +	 * frame to the corresponding VFPR netdev based on the source MAC in the frame.
>> +	 */
>> +	eth = (struct ethhdr *)skb_mac_header(skb);
>> +	for (vf_id = 0; vf_id < pf->num_alloc_vfs; vf_id++) {
>> +		vf = &pf->vf[vf_id];
>> +		if (ether_addr_equal(eth->h_source, vf->default_lan_addr.addr)) {
>> +			skb->dev = vf->vfpr_netdev;
> This sucks :( Is't there any identification coming from rx ring that
> would tell you what vf this is? To match vrpr according to a single MAC
> address seems a bit awkward. What if there is a macvlan configured
> on the VF?
Unfortunately, with the current HW, RX descriptor only indicates if it 
is a loopback packet from a VF,
but not the specific id of the VF.
Multiple macs on VF are not supported with the current patchset.
At this point we are not making switchdev as the default mode because of 
these limitations.

>
>
>
>> +			break;
>> +		}
>> +	}
>> +
>> +gro_receive:
>> 	napi_gro_receive(&q_vector->napi, skb);
>> }
>>
> [...]
>
>> @@ -2998,3 +3064,19 @@ netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
>>
>> 	return i40e_xmit_frame_ring(skb, tx_ring);
>> }
>> +
>> +netdev_tx_t i40e_vfpr_netdev_start_xmit(struct sk_buff *skb,
>> +				        struct net_device *dev)
>> +{
>> +	struct i40e_vfpr_netdev_priv *priv = netdev_priv(dev);
>> +	struct i40e_vf *vf = priv->vf;
>> +	struct i40e_pf *pf = vf->pf;
>> +	struct i40e_vsi *vsi = pf->vsi[pf->lan_vsi];
>> +
>> +	skb_dst_drop(skb);
>> +	dst_hold(&priv->vfpr_dst->dst);
>> +	skb_dst_set(skb, &priv->vfpr_dst->dst);
>> +	skb->dev = vsi->netdev;
> This dst dance seems a bit odd to me. Why don't you just call
> i40e_xmit_frame_ring with an extra arg holding the needed metadata?

We don't have TX/RX queues associated with VFPR netdevs, so we need to 
set the dev to PF netdev and requeue the skb.

>
>
>> +
>> +	return dev_queue_xmit(skb);
>> +}
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>> index e065321..850723f 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>> @@ -366,6 +366,7 @@ struct i40e_ring_container {
>>
>> bool i40e_alloc_rx_buffers(struct i40e_ring *rxr, u16 cleaned_count);
>> netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
>> +netdev_tx_t i40e_vfpr_netdev_start_xmit(struct sk_buff *skb, struct net_device *netdev);
>> void i40e_clean_tx_ring(struct i40e_ring *tx_ring);
>> void i40e_clean_rx_ring(struct i40e_ring *rx_ring);
>> int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring);
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_type.h b/drivers/net/ethernet/intel/i40e/i40e_type.h
>> index edc0abd..c136cc9 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_type.h
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_type.h
>> @@ -728,6 +728,9 @@ enum i40e_rx_desc_status_bits {
>> #define I40E_RXD_QW1_STATUS_TSYNVALID_SHIFT  I40E_RX_DESC_STATUS_TSYNVALID_SHIFT
>> #define I40E_RXD_QW1_STATUS_TSYNVALID_MASK \
>> 				    BIT_ULL(I40E_RXD_QW1_STATUS_TSYNVALID_SHIFT)
>> +#define I40E_RXD_QW1_STATUS_LPBK_SHIFT  I40E_RX_DESC_STATUS_LPBK_SHIFT
>> +#define I40E_RXD_QW1_STATUS_LPBK_MASK \
>> +				BIT_ULL(I40E_RXD_QW1_STATUS_LPBK_SHIFT)
>>
>> enum i40e_rx_desc_fltstat_values {
>> 	I40E_RX_DESC_FLTSTAT_NO_DATA	= 0,
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
>> index 0c8687d..f0860ef 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
>> @@ -1062,6 +1062,7 @@ static int i40e_vfpr_netdev_stop(struct net_device *dev)
>> static const struct net_device_ops i40e_vfpr_netdev_ops = {
>> 	.ndo_open       	= i40e_vfpr_netdev_open,
>> 	.ndo_stop       	= i40e_vfpr_netdev_stop,
>> +	.ndo_start_xmit		= i40e_vfpr_netdev_start_xmit,
>> };
>>
>> /**
>> @@ -1121,16 +1122,22 @@ int i40e_alloc_vfpr_netdev(struct i40e_vf *vf, u16 vf_num)
>>
>> 	priv = netdev_priv(vfpr_netdev);
>> 	priv->vf = &(pf->vf[vf_num]);
>> +	priv->vfpr_dst = metadata_dst_alloc(0, METADATA_HW_PORT_MUX, GFP_KERNEL);
> I'm missing a patch here. In net-next, metadata_dst_alloc has 2 args.

Somehow that patch didn't make it to netdev. You can find it here on 
intel-wired-lan archives.
http://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20170102/007723.html
I will CC you when i submit V3.

^ permalink raw reply

* [PATCH iproute2 1/3] ip vrf: Fix run-on error message on mkdir failure
From: David Ahern @ 2017-01-06  0:22 UTC (permalink / raw)
  To: netdev, stephen; +Cc: David Ahern
In-Reply-To: <1483662143-15242-1-git-send-email-dsa@cumulusnetworks.com>

Andy reported a missing newline if a non-root user attempts to run
'ip vrf exec':

$ ./ip/ip vrf exec default /bin/echo asdf
mkdir failed for /var/run/cgroup2: Permission deniedFailed to setup vrf cgroup2 directory

Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 lib/fs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/fs.c b/lib/fs.c
index 39cc96dccca9..644bb486ae8e 100644
--- a/lib/fs.c
+++ b/lib/fs.c
@@ -121,7 +121,7 @@ int make_path(const char *path, mode_t mode)
 
 			if (mkdir(dir, mode) != 0) {
 				fprintf(stderr,
-					"mkdir failed for %s: %s",
+					"mkdir failed for %s: %s\n",
 					dir, strerror(errno));
 				goto out;
 			}
-- 
2.1.4

^ permalink raw reply related

* [PATCH iproute2 3/3] ip vrf: Improve bpf error messages
From: David Ahern @ 2017-01-06  0:22 UTC (permalink / raw)
  To: netdev, stephen; +Cc: David Ahern
In-Reply-To: <1483662143-15242-1-git-send-email-dsa@cumulusnetworks.com>

Next up a non-root user gets various bpf related error messages:

$ ip vrf exec mgmt bash
Failed to load BPF prog: 'Operation not permitted'
Kernel compiled with CGROUP_BPF enabled?

Catch the EPERM error and do not show the kernel config option.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 ip/ipvrf.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/ip/ipvrf.c b/ip/ipvrf.c
index dc8364a43a57..8bd99d6251f2 100644
--- a/ip/ipvrf.c
+++ b/ip/ipvrf.c
@@ -181,7 +181,11 @@ static int vrf_configure_cgroup(const char *path, int ifindex)
 	if (prog_fd < 0) {
 		fprintf(stderr, "Failed to load BPF prog: '%s'\n",
 			strerror(errno));
-		fprintf(stderr, "Kernel compiled with CGROUP_BPF enabled?\n");
+
+		if (errno != EPERM) {
+			fprintf(stderr,
+				"Kernel compiled with CGROUP_BPF enabled?\n");
+		}
 		goto out;
 	}
 
-- 
2.1.4

^ permalink raw reply related

* [PATCH iproute2 0/3] ip vrf: minor error message cleanups
From: David Ahern @ 2017-01-06  0:22 UTC (permalink / raw)
  To: netdev, stephen; +Cc: David Ahern

David Ahern (3):
  ip vrf: Fix error message when running exec as non-root user
  ip vrf: Improve error message for non-root user
  ip vrf: Clean up bpf related error messages

 ip/ipvrf.c |  6 +++++-
 lib/fs.c   | 16 ++++++++++++----
 2 files changed, 17 insertions(+), 5 deletions(-)

-- 
2.1.4

^ permalink raw reply

* [PATCH iproute2 2/3] ip vrf: Improve cgroup2 error messages
From: David Ahern @ 2017-01-06  0:22 UTC (permalink / raw)
  To: netdev, stephen; +Cc: David Ahern
In-Reply-To: <1483662143-15242-1-git-send-email-dsa@cumulusnetworks.com>

Currently, if a non-root user attempts to run ip vrf exec a non-helpful
error is returned:

$ ip vrf exec mgmt bash
Failed to mount cgroup2. Are CGROUPS enabled in your kernel?

Only show the CGROUPS kernel hint for the ENODEV error and for the
rest show the strerror for the errno. So now:

$ ip/ip vrf exec mgmt bash
Failed to mount cgroup2: Operation not permitted

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 lib/fs.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/lib/fs.c b/lib/fs.c
index 644bb486ae8e..12a4657a0bc9 100644
--- a/lib/fs.c
+++ b/lib/fs.c
@@ -80,13 +80,21 @@ char *find_cgroup2_mount(void)
 
 	if (mount("none", mnt, CGROUP2_FS_NAME, 0, NULL)) {
 		/* EBUSY means already mounted */
-		if (errno != EBUSY) {
+		if (errno == EBUSY)
+			goto out;
+
+		if (errno == ENODEV) {
 			fprintf(stderr,
 				"Failed to mount cgroup2. Are CGROUPS enabled in your kernel?\n");
-			free(mnt);
-			return NULL;
+		} else {
+			fprintf(stderr,
+				"Failed to mount cgroup2: %s\n",
+				strerror(errno));
 		}
+		free(mnt);
+		return NULL;
 	}
+out:
 	return mnt;
 }
 
-- 
2.1.4

^ permalink raw reply related

* Re: [RFC PATCH] virtio_net: XDP support for adjust_head
From: Michael S. Tsirkin @ 2017-01-06  0:39 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jason Wang, john.r.fastabend, netdev, alexei.starovoitov, daniel
In-Reply-To: <586ECF53.1070700@gmail.com>

On Thu, Jan 05, 2017 at 02:57:23PM -0800, John Fastabend wrote:
> On 17-01-03 02:16 PM, Michael S. Tsirkin wrote:
> > On Tue, Jan 03, 2017 at 02:01:27PM +0800, Jason Wang wrote:
> >>
> >>
> >> On 2017年01月03日 03:44, John Fastabend wrote:
> >>> Add support for XDP adjust head by allocating a 256B header region
> >>> that XDP programs can grow into. This is only enabled when a XDP
> >>> program is loaded.
> >>>
> >>> In order to ensure that we do not have to unwind queue headroom push
> >>> queue setup below bpf_prog_add. It reads better to do a prog ref
> >>> unwind vs another queue setup call.
> >>>
> >>> : There is a problem with this patch as is. When xdp prog is loaded
> >>>    the old buffers without the 256B headers need to be flushed so that
> >>>    the bpf prog has the necessary headroom. This patch does this by
> >>>    calling the virtqueue_detach_unused_buf() and followed by the
> >>>    virtnet_set_queues() call to reinitialize the buffers. However I
> >>>    don't believe this is safe per comment in virtio_ring this API
> >>>    is not valid on an active queue and the only thing we have done
> >>>    here is napi_disable/napi_enable wrappers which doesn't do anything
> >>>    to the emulation layer.
> >>>
> >>>    So the RFC is really to find the best solution to this problem.
> >>>    A couple things come to mind, (a) always allocate the necessary
> >>>    headroom but this is a bit of a waste (b) add some bit somewhere
> >>>    to check if the buffer has headroom but this would mean XDP programs
> >>>    would be broke for a cycle through the ring, (c) figure out how
> >>>    to deactivate a queue, free the buffers and finally reallocate.
> >>>    I think (c) is the best choice for now but I'm not seeing the
> >>>    API to do this so virtio/qemu experts anyone know off-hand
> >>>    how to make this work? I started looking into the PCI callbacks
> >>>    reset() and virtio_device_ready() or possibly hitting the right
> >>>    set of bits with vp_set_status() but my first attempt just hung
> >>>    the device.
> >>
> >> Hi John:
> >>
> >> AFAIK, disabling a specific queue was supported only by virtio 1.0 through
> >> queue_enable field in pci common cfg.
> > 
> > In fact 1.0 only allows enabling queues selectively.
> > We can add disabling by a spec enhancement but
> > for now reset is the only way.
> > 
> > 
> >> But unfortunately, qemu does not
> >> emulate this at all and legacy device does not even support this. So the
> >> safe way is probably reset the device and redo the initialization here.
> > 
> > You will also have to re-apply rx filtering if you do this.
> > Probably sending notification uplink.
> > 
> 
> The following seems to hang the device on the next virtnet_send_command()
> I expected this to meet the reset requirements from the spec because I
> believe its the same flow coming out of restore(). For a real patch we
> don't actually need to kfree all the structs and reallocate them but
> I was expecting the below to work. Any ideas/hints?

Restore assumes device was previously reset.
You want to combine freeze+restore.

> static int virtnet_xdp_reset(struct virtnet_info *vi)
> {
>         int i, ret;
> 
>         netif_device_detach(vi->dev);
>         cancel_delayed_work_sync(&vi->refill);
>         if (netif_running(vi->dev)) {
>                 for (i = 0; i < vi->max_queue_pairs; i++)
>                         napi_disable(&vi->rq[i].napi);
>         }
> 
>         remove_vq_common(vi, false);
>         ret = init_vqs(vi);
>         if (ret)
>                 return ret;
>         virtio_device_ready(vi->vdev);
> 
>         if (netif_running(vi->dev)) {
>                 for (i = 0; i < vi->curr_queue_pairs; i++)
>                         if (!try_fill_recv(vi, &vi->rq[i], GFP_KERNEL))
>                                 schedule_delayed_work(&vi->refill, 0);
> 
>                 for (i = 0; i < vi->max_queue_pairs; i++)
>                         virtnet_napi_enable(&vi->rq[i]);
>         }
>         netif_device_attach(vi->dev);
>         return 0;
> }

^ permalink raw reply

* Re: [Intel-wired-lan] [net-next PATCH v2 2/6] i40e: Introduce VF Port Representator(VFPR) netdevs.
From: Samudrala, Sridhar @ 2017-01-06  0:42 UTC (permalink / raw)
  To: Jeff Kirsher, alexander.h.duyck, john.r.fastabend, anjali.singhai,
	jakub.kicinski, davem, intel-wired-lan, netdev
In-Reply-To: <1483652771.25700.41.camel@intel.com>



On 1/5/2017 1:46 PM, Jeff Kirsher wrote:
> On Tue, 2017-01-03 at 10:07 -0800, Sridhar Samudrala wrote:
>> VF Port Representator netdevs are created for each VF if the switch mode
>> is set to 'switchdev'. These netdevs can be used to control and configure
>> VFs from PFs namespace. They enable exposing VF statistics, configure and
>> monitor link state, mtu, filters, fdb/vlan entries etc. of VFs.
>> Broadcast filters are not enabled in switchdev mode.
>>
>> Sample script to create VF port representors
>> # rmmod i40e; modprobe i40e
>> # devlink dev eswitch set pci/0000:05:00.0 mode switchdev
>> # echo 2 > /sys/class/net/enp5s0f0/device/sriov_numvfs
>> # ip l show
>> 297: enp5s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop portid
>> 6805ca2e7268 state DOWN mode DEFAULT group default qlen 1000
>>       link/ether 68:05:ca:2e:72:68 brd ff:ff:ff:ff:ff:ff
>>       vf 0 MAC 00:00:00:00:00:00, spoof checking on, link-state auto,
>> trust off
>>       vf 1 MAC 00:00:00:00:00:00, spoof checking on, link-state auto,
>> trust off
>> 299: enp5s0f0-vf0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
>> mode DEFAULT group default qlen 1000
>>       link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
>> 300: enp5s0f0-vf1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
>> mode DEFAULT group default qlen 1000
>>       link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> ---
>>   drivers/net/ethernet/intel/i40e/i40e_main.c        |  21 ++-
>>   drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 154
>> ++++++++++++++++++++-
>>   drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.h |  14 ++
>>   3 files changed, 182 insertions(+), 7 deletions(-)
> This does not apply cleanly because it is based on an older version of
> i40e_virtchnl_pf.c file.  It appears that i40e has been updated to use
> "i40e_add_filter()" yet your patch still uses "i40e_add_mac_filter()".
I am not using i40e_add_mac_filter() in my patches. I only i40e_add_filter()
These patches are against davem's net-next kernel

>
> We need to clarify what the "right way" is to add filters and use the
> correct function.
>
> Dropping this series and will await v3, please address the other feedback
> from Or Gerlitz and Jiri Pirko as well in your updated series.

Sure. I will be submitting a v3 soon addressing the review comments.

^ permalink raw reply

* RE: [PATCH 2/2] PCI: lock each enable/disable num_vfs operation in sysfs
From: Tantilov, Emil S @ 2017-01-06  0:55 UTC (permalink / raw)
  To: Gavin Shan
  Cc: linux-pci@vger.kernel.org, intel-wired-lan@lists.osuosl.org,
	Duyck, Alexander H, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20170104231139.GA3726@gwshan>

>-----Original Message-----
>From: Gavin Shan [mailto:gwshan@linux.vnet.ibm.com]
>Sent: Wednesday, January 04, 2017 3:12 PM
>To: Tantilov, Emil S <emil.s.tantilov@intel.com>
>Cc: Gavin Shan <gwshan@linux.vnet.ibm.com>; linux-pci@vger.kernel.org;
>intel-wired-lan@lists.osuosl.org; Duyck, Alexander H
><alexander.h.duyck@intel.com>; netdev@vger.kernel.org; linux-
>kernel@vger.kernel.org
>Subject: Re: [PATCH 2/2] PCI: lock each enable/disable num_vfs operation in
>sysfs
>
>On Wed, Jan 04, 2017 at 04:00:20PM +0000, Tantilov, Emil S wrote:
>>>On Tue, Jan 03, 2017 at 04:48:31PM -0800, Emil Tantilov wrote:
>>>>Enabling/disabling SRIOV via sysfs by echo-ing multiple values
>>>>simultaneously:
>>>>
>>>>echo 63 > /sys/class/net/ethX/device/sriov_numvfs&
>>>>echo 63 > /sys/class/net/ethX/device/sriov_numvfs
>>>>
>>>>sleep 5
>>>>
>>>>echo 0 > /sys/class/net/ethX/device/sriov_numvfs&
>>>>echo 0 > /sys/class/net/ethX/device/sriov_numvfs
>>>>
>>>>Results in the following bug:
>>>>
>>>>kernel BUG at drivers/pci/iov.c:495!
>>>>invalid opcode: 0000 [#1] SMP
>>>>CPU: 1 PID: 8050 Comm: bash Tainted: G   W   4.9.0-rc7-net-next #2092
>>>>RIP: 0010:[<ffffffff813b1647>]
>>>>	  [<ffffffff813b1647>] pci_iov_release+0x57/0x60
>>>>
>>>>Call Trace:
>>>> [<ffffffff81391726>] pci_release_dev+0x26/0x70
>>>> [<ffffffff8155be6e>] device_release+0x3e/0xb0
>>>> [<ffffffff81365ee7>] kobject_cleanup+0x67/0x180
>>>> [<ffffffff81365d9d>] kobject_put+0x2d/0x60
>>>> [<ffffffff8155bc27>] put_device+0x17/0x20
>>>> [<ffffffff8139c08a>] pci_dev_put+0x1a/0x20
>>>> [<ffffffff8139cb6b>] pci_get_dev_by_id+0x5b/0x90
>>>> [<ffffffff8139cca5>] pci_get_subsys+0x35/0x40
>>>> [<ffffffff8139ccc8>] pci_get_device+0x18/0x20
>>>> [<ffffffff8139ccfb>] pci_get_domain_bus_and_slot+0x2b/0x60
>>>> [<ffffffff813b09e7>] pci_iov_remove_virtfn+0x57/0x180
>>>> [<ffffffff813b0b95>] pci_disable_sriov+0x65/0x140
>>>> [<ffffffffa00a1af7>] ixgbe_disable_sriov+0xc7/0x1d0 [ixgbe]
>>>> [<ffffffffa00a1e9d>] ixgbe_pci_sriov_configure+0x3d/0x170 [ixgbe]
>>>> [<ffffffff8139d28c>] sriov_numvfs_store+0xdc/0x130
>>>>...
>>>>RIP  [<ffffffff813b1647>] pci_iov_release+0x57/0x60
>>>>
>>>>Use the existing mutex lock to protect each enable/disable operation.
>>>>
>>>>CC: Alexander Duyck <alexander.h.duyck@intel.com>
>>>>Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
>>>
>>>Emil, It's going to change semantics of pci_enable_sriov() and
>pci_disable_sriov().
>>>They can be invoked when writing to the sysfs entry, or loading PF's
>>>driver. With the change applied, the lock (pf->sriov->lock) isn't acquired and released
>>>in the PF's driver loading path.
>>
>>The enablement of SRIOV on driver load is done via deprecated module parameter.
>>Perhaps we can just remove it, although there are probably still people that use it
>>and may not be happy if we get rid of it.
>>
>
>Yeah, some drivers are still using the interface. So we cannot affect it
>until it can be droped.
>
>>>I think the reasonable way would be adding a flag in "struct sriov", to
>>>indicate someone is accessing the IOV capability through sysfs file. With this, the
>>>code returns with "-EBUSY" immediately for contenders. With it, nothing is going
>>>to be changed in PF's driver loading path.
>>
>>Flag is what I initially had in mind, but did not want to add extra locking if we
>>can make use of the existing.
>>
>
>The problem is sriov->lock wasn't introduced to protect the whole IOV capability.
>Instead, it protects the allocation of virtual bus (if needed). In your patch,
>it will be used to protect the whole IOV capability, ensure accessing the
>IOV capability exclusively. So the usage of this lock is changed.
>
>     code extracted from pci.h:
>
>     struct pci_sriov {
>            :
>            struct mutex lock;      /* lock for VF bus */
>            :
>     }
>
>The lock was introduced by commit d1b054da8 ("PCI: initialize and release
>SR-IOV capability"). If I'm correct enough, I don't think this lock is needed when
>pci_enable_sriov() or pci_disable_sriov() are called in driver because of
>module
>parameters. I don't see the usage case calling pci_disable_sriov() while
>previous pci_enable_sriov() isn't finished yet. Also, it's not needed in EEH
>subsystem.
>So I think the lock can be dropped, then it can be used to protect sysfs path.

That's pretty much what this patch does, except I kept the locking for EEH since 
it is the only driver that calls pci_iov_add/remove_virtfn() directly.

I'll write it up and run some tests, although I have no way to test EEH.
 
>>>Also, there are some minor comments as below and I guess most of them won't
>>>be applied if you take my suggestion eventually. However, I'm trying to make
>>>the comments complete.
>>
>>Thanks a lot for reviewing!
>>
>>>
>>>>---
>>>> drivers/pci/pci-sysfs.c |   24 +++++++++++++++++-------
>>>> 1 file changed, 17 insertions(+), 7 deletions(-)
>>>>
>>>>diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
>>>>index 0666287..5b54cf5 100644
>>>>--- a/drivers/pci/pci-sysfs.c
>>>>+++ b/drivers/pci/pci-sysfs.c
>>>>@@ -472,7 +472,9 @@ static ssize_t sriov_numvfs_store(struct device
>*dev,
>>>> 				  const char *buf, size_t count)
>>>> {
>>>> 	struct pci_dev *pdev = to_pci_dev(dev);
>>>>+	struct pci_sriov *iov = pdev->sriov;
>>>> 	int ret;
>>>>+
>>>
>>>Unnecessary change.
>>>
>>>> 	u16 num_vfs;
>>>>
>>>> 	ret = kstrtou16(buf, 0, &num_vfs);
>>>>@@ -482,38 +484,46 @@ static ssize_t sriov_numvfs_store(struct device
>>>*dev,
>>>> 	if (num_vfs > pci_sriov_get_totalvfs(pdev))
>>>> 		return -ERANGE;
>>>>
>>>>+	mutex_lock(&iov->dev->sriov->lock);
>>>>+
>>>> 	if (num_vfs == pdev->sriov->num_VFs)
>>>>-		return count;		/* no change */
>>>>+		goto exit;
>>>>
>>>> 	/* is PF driver loaded w/callback */
>>>> 	if (!pdev->driver || !pdev->driver->sriov_configure) {
>>>> 		dev_info(&pdev->dev, "Driver doesn't support SRIOV
>>>configuration via sysfs\n");
>>>>-		return -ENOSYS;
>>>>+		ret = -EINVAL;
>>>>+		goto exit;
>>>
>>>Why we need change the error code here?
>>
>>checkpatch was complaining about the use of the ENOSYS error code being specific
>>and even though it was not my patch introducing it I had to change it to shut it up.
>>
>
>Right, it's reserved for attempt to call nonexisting syscall, but I think
>ENOENT might be more indicative than EINVAL in this specific case?

ENOENT is for a missing file, but if we got this far in the code then there must've been 
a sysfs file. This is pretty straightforward "not supported" error, which is why I picked
EINVAL.

Thanks,
Emil 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox