Netdev List
 help / color / mirror / Atom feed
* Re: RTL8723BE performance regression
From: Larry Finger @ 2018-04-04  2:51 UTC (permalink / raw)
  To: João Paulo Rechi Vita
  Cc: Yan-Hsuan Chuang, Ping-Ke Shih, Birming Chiu, Shaofu, Steven Ting,
	Chaoming Li, Kalle Valo, linux-wireless, Network Development,
	LKML, Daniel Drake, João Paulo Rechi Vita, linux
In-Reply-To: <CA+A7VXWjWaSSUiX3h_rpFXrSwtUkynKR9tpvnsSe=DZrC+mF-Q@mail.gmail.com>

On 04/03/2018 09:37 PM, João Paulo Rechi Vita wrote:
> On Tue, Apr 3, 2018 at 7:28 PM, Larry Finger <Larry.Finger@lwfinger.net> wrote:
> 
> (...)
> 
>> As the antenna selection code changes affected your first bisection, do you
>> have one of those HP laptops with only one antenna and the incorrect coding
>> in the FUSE?
> 
> Yes, that is why I've been passing ant_sel=1 during my tests -- this
> was needed to achieve a good performance in the past, before this
> regression. I've also opened the laptop chassis and confirmed the
> antenna cable is plugged to the connector labeled with "1" on the
> card.
> 
>> If so, please make sure that you still have the same signal
>> strength for good and bad cases. I have tried to keep the driver and the
>> btcoex code in sync, but there may be some combinations of antenna
>> configuration and FUSE contents that cause the code to fail.
>>
> 
> What is the recommended way to monitor the signal strength?

The btcoex code is developed for multiple platforms by a different group than 
the Linux driver. I think they made a change that caused ant_sel to switch from 
1 to 2. At least numerous comments at github.com/lwfinger/rtlwifi_new claimed 
they needed to make that change.

Mhy recommended method is to verify the wifi device name with "iw dev". Then 
using that device

sudo iw dev <dev_name> scan | egrep "SSID|signal"

Larry

^ permalink raw reply

* Re: RTL8723BE performance regression
From: João Paulo Rechi Vita @ 2018-04-04  2:37 UTC (permalink / raw)
  To: Larry Finger
  Cc: Yan-Hsuan Chuang, Ping-Ke Shih, Birming Chiu, Shaofu, Steven Ting,
	Chaoming Li, Kalle Valo, linux-wireless, Network Development,
	LKML, Daniel Drake, João Paulo Rechi Vita, linux
In-Reply-To: <afdf0498-c0a9-936c-0c9b-1b6da6eb5957@lwfinger.net>

On Tue, Apr 3, 2018 at 7:28 PM, Larry Finger <Larry.Finger@lwfinger.net> wrote:

(...)

> As the antenna selection code changes affected your first bisection, do you
> have one of those HP laptops with only one antenna and the incorrect coding
> in the FUSE?

Yes, that is why I've been passing ant_sel=1 during my tests -- this
was needed to achieve a good performance in the past, before this
regression. I've also opened the laptop chassis and confirmed the
antenna cable is plugged to the connector labeled with "1" on the
card.

> If so, please make sure that you still have the same signal
> strength for good and bad cases. I have tried to keep the driver and the
> btcoex code in sync, but there may be some combinations of antenna
> configuration and FUSE contents that cause the code to fail.
>

What is the recommended way to monitor the signal strength?

Thanks for such a quick reply,

--
João Paulo Rechi Vita
http://about.me/jprvita

^ permalink raw reply

* Re: RTL8723BE performance regression
From: Larry Finger @ 2018-04-04  2:28 UTC (permalink / raw)
  To: João Paulo Rechi Vita, Yan-Hsuan Chuang, Ping-Ke Shih,
	Birming Chiu, Shaofu, Steven Ting, Chaoming Li, Kalle Valo
  Cc: linux-wireless, Network Development, LKML, Daniel Drake,
	João Paulo Rechi Vita, linux
In-Reply-To: <CA+A7VXX+z8FrDhzW2BQ6AGMJ=_j_53wkUL_YNgkreEbWSh+jCw@mail.gmail.com>

On 04/03/2018 08:51 PM, João Paulo Rechi Vita wrote:
> Hello,
> 
> I've been trying to track a performance regression on the RTL8723BE
> WiFi adapter, which mainly affects the upload bandwidth (although we
> can see a decreased download performance as well, the effect on upload
> is more drastic). This was first reported by users after upgrading
> from our 4.11-based kernel to our 4.13-based kernel, but also
> confirmed to affect our development branch (4.15-based kernel) and
> wireless-drivers-next at the
> wireless-drivers-next-for-davem-2018-03-29 tag. This is happening on
> an HP laptop that needs rtl8723be.ant_sel=1 (and all the following
> tests have been made with that param).
> 
> My first bisect attempt pointed me to the following commit:
> 
> bcd37f4a0831 rtlwifi: btcoex: 23b 2ant: let bt transmit when hw
> initialisation done
> 
> Which I later found to be already fixed by
> 
> a33fcba6ec01 rtlwifi: btcoexist: Fix breakage of ant_sel for rtl8723be.
> 
> That fix is already included in v4.15 though (and our dev branch as
> well), so I did a second bisect, now cherry-picking a33fcba6ec01 at
> every step, and it pointed me to the following commit:
> 
> 7937f02d1953 rtlwifi: btcoex: hook external functions for newer chips
> 
> Reverting that commit on top of our development branch fixes the
> problem, but on top of v4.15 I get mixed results: a few times getting
> a good upload performance (~5-6Mbps) but most of the time just getting
> ~1-1.5Mpbs (which is still better than the 0.0 then test failure I've
> gotten on most bad points of the bisect).
> 
> Bisecting the downstream patches we carry on top of v4.15 (we base our
> kernel on Ubuntu's, so there are quite a few downstream changes) did
> not bring any clarity, as at all bisect points (plus reverting
> 7937f02d1953) the performance was good, so probably there was some
> other difference in the resulting kernels from my initial revert of
> that patch on top of v4.15 and each step during the bisect. I've
> experimented a bit with fwlps=0, but it did not bring any conclusive
> results either. I'll try to look at other things that may have changed
> (configuration perhaps?), but I don't have a clear plan yet.
> 
> Have you seen anything similar, or have any other ideas or suggestions
> to track this problem? Even without crystal clear results, it looks
> like 7937f02d1953 is having a negative impact on the RTL8723BE
> performance, so perhaps it is worth reverting it and reworking it a
> later point?
> 
> This are the results (testing with speedtest.net) I got at some key points:
> 
> Version        Commit        Ping        Down        Up
> 
> v4.11            a351e9b        12        25.44        5.99
> v4.11            a351e9b        131      17.02        5.89
> 
> v4.13            569dbb8        174      14.08        0.00
> v4.13            569dbb8        261      8.41          0.00
> 
> v4.15+revert d8a5b80        19        23.86        1.41
> v4.15+revert d8a5b80        189      18.69        1.39
> 

As the antenna selection code changes affected your first bisection, do you have 
one of those HP laptops with only one antenna and the incorrect coding in the 
FUSE? If so, please make sure that you still have the same signal strength for 
good and bad cases. I have tried to keep the driver and the btcoex code in sync, 
but there may be some combinations of antenna configuration and FUSE contents 
that cause the code to fail.

Larry

^ permalink raw reply

* Re: [PATCH] vhost-net: add limitation of sent packets for tx polling(Internet mail)
From: haibinzhang(张海斌) @ 2018-04-04  1:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, kvm@vger.kernel.org,
	virtualization@lists.linux-foundation.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	lidongchen(陈立东),
	yunfangtai(台运方)


>On Tue, Apr 03, 2018 at 12:29:47PM +0000, haibinzhang wrote:
>> >On Tue, Apr 03, 2018 at 08:08:26AM +0000, haibinzhang wrote:
>> >> handle_tx will delay rx for a long time when tx busy polling udp packets
>> >> with small length(e.g. 1byte udp payload), because setting VHOST_NET_WEIGHT
>> >> takes into account only sent-bytes but no single packet length.
>> >> 
>> >> Tests were done between two Virtual Machines using netperf(UDP_STREAM, len=1),
>> >> then another machine pinged the client. Result shows as follow:
>> >> 
>> >> Packet#       Ping-Latency(ms)
>> >>               min     avg     max
>> >> Origin      3.319  18.489  57.503
>> >> 64          1.643   2.021   2.552
>> >> 128         1.825   2.600   3.224
>> >> 256         1.997   2.710   4.295
>> >> 512*        1.860   3.171   4.631
>> >> 1024        2.002   4.173   9.056
>> >> 2048        2.257   5.650   9.688
>> >> 4096        2.093   8.508  15.943
>> >> 
>> >> 512 is selected, which is multi-VRING_SIZE
>> >
>> >There's no guarantee vring size is 256.
>> >
>> >Could you pls try with a different tx ring size?
>> >
>> >I suspect we want:
>> >
>> >#define VHOST_NET_PKT_WEIGHT(vq) ((vq)->num * 2)
>> >
>> >
>> >> and close to VHOST_NET_WEIGHT/MTU.
>> >
>> >Puzzled by this part.  Does tweaking MTU change anything?
>> 
>> The MTU of ethernet is 1500, so VHOST_NET_WEIGHT/MTU equals 0x80000/1500=350.
>
>We should include the 12 byte header so it's a bit lower.
>
>> Then sent-bytes cannot reach VHOST_NET_WEIGHT in one handle_tx even with 1500-bytes 
>> frame if packet# is less than 350. So packet# must be bigger than 350.
>> 512 meets this condition
>
>What you seem to say is this:
>
>	imagine MTU sized buffers. With these we stop after 350
>	packets. Thus adding another limit > 350 will not
>	slow us down.
>
>	Fair enough but won't apply with smaller packet
>	sizes, will it?

Right.

>
>	I still think a simpler argument carries more weight:
>
>ring size is a hint from device about a burst size
>it can tolerate. Based on benchmarks, we tweak
>the limit to 2 * vq size as that seems to
>perform a bit better, and is still safer
>than no limit on # of packets as is done now.
>
>	but this needs testing with another ring size.
>	Could you try that please?

Get it. 
We will test with another ring size and submit again

>
>	
>> and is also DEFAULT VRING_SIZE aligned.
>
>Neither Linux nor virtio have a default vring size. It's a historical
>construct that exists in qemu for qemu compatibility
>reasons.
>
>> >
>> >> To evaluate this change, another tests were done using netperf(RR, TX) between
>> >> two machines with Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz. Result as follow
>> >> does not show obvious changes:
>> >> 
>> >> TCP_RR
>> >> 
>> >> size/sessions/+thu%/+normalize%
>> >>    1/       1/  -7%/        -2%
>> >>    1/       4/  +1%/         0%
>> >>    1/       8/  +1%/        -2%
>> >>   64/       1/  -6%/         0%
>> >>   64/       4/   0%/        +2%
>> >>   64/       8/   0%/         0%
>> >>  256/       1/  -3%/        -4%
>> >>  256/       4/  +3%/        +4%
>> >>  256/       8/  +2%/         0%
>> >> 
>> >> UDP_RR
>> >> 
>> >> size/sessions/+thu%/+normalize%
>> >>    1/       1/  -5%/        +1%
>> >>    1/       4/  +4%/        +1%
>> >>    1/       8/  -1%/        -1%
>> >>   64/       1/  -2%/        -3%
>> >>   64/       4/  -5%/        -1%
>> >>   64/       8/   0%/        -1%
>> >>  256/       1/  +7%/        +1%
>> >>  256/       4/  +1%/        +1%
>> >>  256/       8/  +2%/        +2%
>> >> 
>> >> TCP_STREAM
>> >> 
>> >> size/sessions/+thu%/+normalize%
>> >>   64/       1/   0%/        -3%
>> >>   64/       4/  +3%/        -1%
>> >>   64/       8/  +9%/        -4%
>> >>  256/       1/  +1%/        -4%
>> >>  256/       4/  -1%/        -1%
>> >>  256/       8/  +7%/        +5%
>> >>  512/       1/  +1%/         0%
>> >>  512/       4/  +1%/        -1%
>> >>  512/       8/  +7%/        -5%
>> >> 1024/       1/   0%/        -1%
>> >> 1024/       4/  +3%/         0%
>> >> 1024/       8/  +8%/        +5%
>> >> 2048/       1/  +2%/        +2%
>> >> 2048/       4/  +1%/         0%
>> >> 2048/       8/  -2%/         0%
>> >> 4096/       1/  -2%/         0%
>> >> 4096/       4/  +2%/         0%
>> >> 4096/       8/  +9%/        -2%
>> >> 
>> >> Signed-off-by: Haibin Zhang <haibinzhang@tencent.com>
>> >> Signed-off-by: Yunfang Tai <yunfangtai@tencent.com>
>> >> Signed-off-by: Lidong Chen <lidongchen@tencent.com>
>> >> ---
>> >>  drivers/vhost/net.c | 8 +++++++-
>> >>  1 file changed, 7 insertions(+), 1 deletion(-)
>> >> 
>> >> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> >> index 8139bc70ad7d..13a23f3f3ea4 100644
>> >> --- a/drivers/vhost/net.c
>> >> +++ b/drivers/vhost/net.c
>> >> @@ -44,6 +44,10 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
>> >>   * Using this limit prevents one virtqueue from starving others. */
>> >>  #define VHOST_NET_WEIGHT 0x80000
>> >>  
>> >> +/* Max number of packets transferred before requeueing the job.
>> >> + * Using this limit prevents one virtqueue from starving rx. */
>> >> +#define VHOST_NET_PKT_WEIGHT 512
>> >> +
>> >>  /* MAX number of TX used buffers for outstanding zerocopy */
>> >>  #define VHOST_MAX_PEND 128
>> >>  #define VHOST_GOODCOPY_LEN 256
>> >> @@ -473,6 +477,7 @@ static void handle_tx(struct vhost_net *net)
>> >>  	struct socket *sock;
>> >>  	struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
>> >>  	bool zcopy, zcopy_used;
>> >> +	int sent_pkts = 0;
>> >>  
>> >>  	mutex_lock(&vq->mutex);
>> >>  	sock = vq->private_data;
>> >> @@ -580,7 +585,8 @@ static void handle_tx(struct vhost_net *net)
>> >>  		else
>> >>  			vhost_zerocopy_signal_used(net, vq);
>> >>  		vhost_net_tx_packet(net);
>> >> -		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
>> >> +		if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
>> >> +		    unlikely(++sent_pkts >= VHOST_NET_PKT_WEIGHT)) {
>> >>  			vhost_poll_queue(&vq->poll);
>> >>  			break;
>> >>  		}
>> >> -- 
>> >> 2.12.3
>> >> 
>> 


^ permalink raw reply

* RTL8723BE performance regression
From: João Paulo Rechi Vita @ 2018-04-04  1:51 UTC (permalink / raw)
  To: Larry Finger, Yan-Hsuan Chuang, Ping-Ke Shih, Birming Chiu,
	Shaofu, Steven Ting, Chaoming Li, Kalle Valo
  Cc: linux-wireless, Network Development, LKML, Daniel Drake,
	João Paulo Rechi Vita, linux

Hello,

I've been trying to track a performance regression on the RTL8723BE
WiFi adapter, which mainly affects the upload bandwidth (although we
can see a decreased download performance as well, the effect on upload
is more drastic). This was first reported by users after upgrading
from our 4.11-based kernel to our 4.13-based kernel, but also
confirmed to affect our development branch (4.15-based kernel) and
wireless-drivers-next at the
wireless-drivers-next-for-davem-2018-03-29 tag. This is happening on
an HP laptop that needs rtl8723be.ant_sel=1 (and all the following
tests have been made with that param).

My first bisect attempt pointed me to the following commit:

bcd37f4a0831 rtlwifi: btcoex: 23b 2ant: let bt transmit when hw
initialisation done

Which I later found to be already fixed by

a33fcba6ec01 rtlwifi: btcoexist: Fix breakage of ant_sel for rtl8723be.

That fix is already included in v4.15 though (and our dev branch as
well), so I did a second bisect, now cherry-picking a33fcba6ec01 at
every step, and it pointed me to the following commit:

7937f02d1953 rtlwifi: btcoex: hook external functions for newer chips

Reverting that commit on top of our development branch fixes the
problem, but on top of v4.15 I get mixed results: a few times getting
a good upload performance (~5-6Mbps) but most of the time just getting
~1-1.5Mpbs (which is still better than the 0.0 then test failure I've
gotten on most bad points of the bisect).

Bisecting the downstream patches we carry on top of v4.15 (we base our
kernel on Ubuntu's, so there are quite a few downstream changes) did
not bring any clarity, as at all bisect points (plus reverting
7937f02d1953) the performance was good, so probably there was some
other difference in the resulting kernels from my initial revert of
that patch on top of v4.15 and each step during the bisect. I've
experimented a bit with fwlps=0, but it did not bring any conclusive
results either. I'll try to look at other things that may have changed
(configuration perhaps?), but I don't have a clear plan yet.

Have you seen anything similar, or have any other ideas or suggestions
to track this problem? Even without crystal clear results, it looks
like 7937f02d1953 is having a negative impact on the RTL8723BE
performance, so perhaps it is worth reverting it and reworking it a
later point?

This are the results (testing with speedtest.net) I got at some key points:

Version        Commit        Ping        Down        Up

v4.11            a351e9b        12        25.44        5.99
v4.11            a351e9b        131      17.02        5.89

v4.13            569dbb8        174      14.08        0.00
v4.13            569dbb8        261      8.41          0.00

v4.15+revert d8a5b80        19        23.86        1.41
v4.15+revert d8a5b80        189      18.69        1.39

Best regards,

--
João Paulo Rechi Vita
http://about.me/jprvita

^ permalink raw reply

* Re: [PATCH v15 ] net/veth/XDP: Line-rate packet forwarding in kernel
From: David Ahern @ 2018-04-04  1:16 UTC (permalink / raw)
  To: Md. Islam, netdev, David Miller, stephen, agaceph,
	Pavel Emelyanov, Eric Dumazet, alexei.starovoitov, brouer
In-Reply-To: <CAFgPn1DX9cOpDRGj=wFwvZq_bpq6VFnEOzR1YbMuC0+=DFEWxA@mail.gmail.com>

On 4/1/18 6:47 PM, Md. Islam wrote:
> This patch implements IPv4 forwarding on xdp_buff. I added a new
> config option XDP_ROUTER. Kernel would forward packets through fast
> path when this option is enabled. But it would require driver support.
> Currently it only works with veth. Here I have modified veth such that
> it outputs xdp_buff. I created a testbed in Mininet. The Mininet
> script (topology.py) is attached. Here the topology is:
> 
> h1 -----r1-----h2 (r1 acts as a router)
> 
> This patch improves the throughput from 53.8Gb/s to 60Gb/s on my
> machine. Median RTT also improved from around .055 ms to around .035
> ms.
> 
> Then I disabled hyperthreading and cpu frequency scaling in order to
> utilize CPU cache (DPDK also utilizes CPU cache to improve
> forwarding). This further improves per-packet forwarding latency from
> around 400ns to 200 ns. More specifically, header parsing and fib
> lookup only takes around 82 ns. This shows that this could be used to
> implement linerate packet forwarding in kernel.
> 
> The patch has been generated on 4.15.0+. Please let me know your
> feedback and suggestions. Please feel free to let me know if this
> approach make sense.

This patch is not really using eBPF and XDP but rather trying to
shortcircuit forwarding through a veth pair.

Have you looked at the loss in performance with this config enabled if
there is no r1? i.e., h1 {veth1}  <---> {veth2} / h2. You are adding a
lookup per-packet to the Tx path.

Have you looked at what I would consider a more interesting use case of
packets into a node and delivered to a namespace via veth?

   +--------------------------+---------------
   | Host                     | container
   |                          |
   |        +-------{ veth1 }-|-{veth2}----
   |       |                  |
   +----{ eth1 }------------------

Can xdp / bpf on eth1 be used to speed up delivery to the container?

^ permalink raw reply

* Re: [PATCH v15 ] net/veth/XDP: Line-rate packet forwarding in kernel
From: David Ahern @ 2018-04-04  1:09 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: John Fastabend, Md. Islam, netdev, David Miller, stephen, agaceph,
	Pavel Emelyanov, Eric Dumazet, brouer
In-Reply-To: <20180403173701.t35u2p4qkgyqkr32@ast-mbp.dhcp.thefacebook.com>

On 4/3/18 11:37 AM, Alexei Starovoitov wrote:
> On Tue, Apr 03, 2018 at 11:14:00AM -0600, David Ahern wrote:
>> On 4/3/18 11:06 AM, Alexei Starovoitov wrote:
>>>> For 3 and 4 above I was referring to the route lookup part of it; sorry
>>>> for not being clear.
>>>>
>>>> For example, eth1 is enslaved to bond1 which is in VRF red. The lookup
>>>> needs to go to the table associated with the VRF. That is not known by
>>>> just looking at eth1. The code exists to walk the upper layers and do
>>>> the effective translations, just need to cover those cases.
>>>>
>>>> The VLAN part of it is a bit more difficult - ingress device for the
>>>> lookup should be eth1.100 for example not eth1, and then if eth1.100 is
>>>> enslaved to a VRF, ...
>>>>
>>>> None of it is that complex, just need to walk through the various use
>>>> cases and make sure bpf_ipv4_fwd_lookup and bpf_ipv6_fwd_lookup can do
>>>> the right thing for these common use cases.
>>> I'm a bit lost here. Why this is a concern?
>>> 'index' as argument that bpf prog is passing into the helper.
>>> The clsbpf program may choose to pass ifindex of the netdev
>>> it's attached to or some other one.
>>> In your patch you have:
>>> +BPF_CALL_3(bpf_ipv4_fwd_lookup, int, index, const struct iphdr *, iph,
>>> +	   struct ethhdr *, eth)
>>> +{
>>> +	struct flowi4 fl4 = {
>>> +		.daddr = iph->daddr,
>>> +		.saddr = iph->saddr,
>>> +		.flowi4_iif = index,
>>> +		.flowi4_tos = iph->tos & IPTOS_RT_MASK,
>>> +		.flowi4_scope = RT_SCOPE_UNIVERSE,
>>> +	};
>>>
>>> As you saying there is concern with .flowi4_iif = index line ?
>>
>> yes. BPF / XDP programs are installed on the bottom device ... e.g.,
>> eth1. The L3 lookup is not necessarily done on that device index.
> 
> right, but I still don't see any problem with this helper and vlans.
> If xdp program passes incorrect ifindex, it's program's mistake.
> If clsbpf attached to vlan passed good ifindex, the lookup will
> happen in the correct scope, but even in this case the prog
> can pass whatever ifindex it wants.
> 

I'll find some time update the bpf forwarding helpers and look at these
other cases in the next few weeks.

^ permalink raw reply

* Re: [PATCH v3 2/4] bus: fsl-mc: add restool userspace support
From: Andrew Lunn @ 2018-04-04  1:05 UTC (permalink / raw)
  To: Stuart Yoder
  Cc: Arnd Bergmann, Ioana Ciornei, gregkh, Laurentiu Tudor,
	Linux Kernel Mailing List, Ruxandra Ioana Ciocoi Radulescu,
	Razvan Stefanescu, Roy Pledge, Networking
In-Reply-To: <CAEac7tZ4we46EEa2DuAbmMr+hW6Q4BoBKJFFf1k_k57r2oJpng@mail.gmail.com>

> Suppose you want to create and assign a network interface to a KVM
> virtual machine, you would do something like the following using
> a user space tool like restool:
>    -create a new (empty) dprc object
>    -create a new dpni and assign it to the dprc
>    -create a new dpio and assign it to the dprc
>    -create a new dpbp and assign it to the dprc
>    -create a new dpmcp and assign it to the dprc
>    -create a new dpmac and assign it to the dprc
>    -connect the dpni to the dpmac

Hi Stuart

It this connecting to a physical port at the bottom?

If so, i would expect that when you probe the device you just create
all these for each physical port. You then just need to map one of
them into the KVM, in the same way you map one PCI device into a KVM.

If these are virtual devices, VF devices you would normally do

echo 4 > /sys/class/net/<device name>/device/sriov_numvfs

on the physical device to create virtual devices.

> The fsl-mc bus and DPAA2 is very NXP-specific, so there doesn't
> seem to be anything that can be made generic here to provide
> more common benefit.

Which is why you should try to avoid all of this.  The user knows how
to use standard linux commands and concepts. They don't want to have
to learn the inside plumbing of your hardware.

       Andrew

^ permalink raw reply

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: David Ahern @ 2018-04-04  1:04 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Si-Wei Liu, mst, stephen, alexander.h.duyck, davem,
	jesse.brandeburg, kubakici, jasowang, sridhar.samudrala, netdev,
	virtualization, virtio-dev
In-Reply-To: <20180403154210.GK3313@nanopsycho>

On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>
>> There are other use cases that want to hide a device from userspace. I
> 
> What usecases do you have in mind?

As mentioned in a previous response some kernel drivers create control
netdevs. Just as in this case users should not be mucking with it, and
S/W like lldpd should ignore it.

> 
>> would prefer a better solution than playing games with name prefixes and
>> one that includes an API for users to list all devices -- even ones
>> hidden by default.
> 
> Netdevice hiding feels a bit scarry for me. This smells like a workaround
> for userspace issues. Why can't the netdevice be visible always and
> userspace would know what is it and what should it do with it?
> 
> Once we start with hiding, there are other things related to that which
> appear. Like who can see what, levels of visibility etc...
> 

I would not advocate for any API that does not allow users to have full
introspection. The intent is to hide the netdev by default but have an
option to see it.

^ permalink raw reply

* Re: [PATCH net-next 09/11] devlink: convert occ_get op to separate registration
From: David Ahern @ 2018-04-04  0:47 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: Ido Schimmel, netdev, davem, jiri, petrm, mlxsw
In-Reply-To: <20180403153020.GJ3313@nanopsycho>

On 4/3/18 9:30 AM, Jiri Pirko wrote:
> Tue, Apr 03, 2018 at 04:33:11PM CEST, dsahern@gmail.com wrote:
>> On 4/3/18 1:32 AM, Jiri Pirko wrote:
>>> Fri, Mar 30, 2018 at 04:45:50PM CEST, dsahern@gmail.com wrote:
>>>> On 3/29/18 2:33 PM, Ido Schimmel wrote:
>>>>> From: Jiri Pirko <jiri@mellanox.com>
>>>>>
>>>>> This resolves race during initialization where the resources with
>>>>> ops are registered before driver and the structures used by occ_get
>>>>> op is initialized. So keep occ_get callbacks registered only when
>>>>> all structs are initialized.
>>>>
>>>> Why can't the occ_get handler look at some flag in an mlxsw struct to
>>>> know if the system has initialized?
>>>>
>>>> Separate registration here is awkward. You register a resource and then
>>>> register its op later.
>>>
>>> The separation is exactly why this patch is made. Note that devlink
>>> resouce is registered by core way before the initialization is done and
>>> the driver is actually able to perform the op. Also consider "reload"
>>
>> That's how you have chose to code it. I hit this problem adding devlink
>> to netdevsim; the solution was to fix the init order.
> 
> This is not about init order, at all. On reaload netdevs and internal
> driver structures disappear and appear again. And in between currently,
> the op is working with memory which is freed. That's the reason for this
> patch.
> 
> 
>>
>>> case, when the resource is still registered and the driver unloads and
>>> loads again. For that makes perfect sense to have that separated.
>>> Flag would just make things odd. Also, the priv could not be used in
>>> that case.
>>>
>>
>> I am not aware of any other API where you invoked the register function
>> at point A and then later add the operations at point B. In every API
>> that comes to mind the ops are part of the register.
> 
> I think that you just don't see any similar API.
> 
> 
>>
>> I am sure there are options for you to fix the init order of mlxsw
>> without making the devlink API awkward.
> 
> Again, not about init order, at all. I have no clue why you think so...
> 

Jiri, I am not aware of any other API where a driver registers with it
yet doesn't want the handler to be called so either waits to register
the handler later or unregisters the handler. That is an awkward design.
There are other options to have a sane API and handle the conditions you
need.

^ permalink raw reply

* [PATCH net] nfp: use full 40 bits of the NSP buffer address
From: Jakub Kicinski @ 2018-04-04  0:24 UTC (permalink / raw)
  To: davem; +Cc: oss-drivers, netdev, Dirk van der Merwe

From: Dirk van der Merwe <dirk.vandermerwe@netronome.com>

The NSP default buffer is a piece of NFP memory where additional
command data can be placed.  Its format has been copied from
host buffer, but the PCIe selection bits do not make sense in
this case.  If those get masked out from a NFP address - writes
to random place in the chip memory may be issued and crash the
device.

Even in the general NSP buffer case, it doesn't make sense to have the
PCIe selection bits there anymore. These are unused at the moment, and
when it becomes necessary, the PCIe selection bits should rather be
moved to another register to utilise more bits for the buffer address.

This has never been an issue because the buffer used to be
allocated in memory with less-than-38-bit-long address but that
is about to change.

Fixes: 1a64821c6af7 ("nfp: add support for service processor access")
Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c
index 39abac678b71..99bb679a9801 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c
@@ -71,10 +71,11 @@
 /* CPP address to retrieve the data from */
 #define NSP_BUFFER		0x10
 #define   NSP_BUFFER_CPP	GENMASK_ULL(63, 40)
-#define   NSP_BUFFER_PCIE	GENMASK_ULL(39, 38)
-#define   NSP_BUFFER_ADDRESS	GENMASK_ULL(37, 0)
+#define   NSP_BUFFER_ADDRESS	GENMASK_ULL(39, 0)
 
 #define NSP_DFLT_BUFFER		0x18
+#define   NSP_DFLT_BUFFER_CPP	GENMASK_ULL(63, 40)
+#define   NSP_DFLT_BUFFER_ADDRESS	GENMASK_ULL(39, 0)
 
 #define NSP_DFLT_BUFFER_CONFIG	0x20
 #define   NSP_DFLT_BUFFER_SIZE_MB	GENMASK_ULL(7, 0)
@@ -427,8 +428,8 @@ __nfp_nsp_command_buf(struct nfp_nsp *nsp, u16 code, u32 option,
 	if (err < 0)
 		return err;
 
-	cpp_id = FIELD_GET(NSP_BUFFER_CPP, reg) << 8;
-	cpp_buf = FIELD_GET(NSP_BUFFER_ADDRESS, reg);
+	cpp_id = FIELD_GET(NSP_DFLT_BUFFER_CPP, reg) << 8;
+	cpp_buf = FIELD_GET(NSP_DFLT_BUFFER_ADDRESS, reg);
 
 	if (in_buf && in_size) {
 		err = nfp_cpp_write(cpp, cpp_id, cpp_buf, in_buf, in_size);
-- 
2.16.2

^ permalink raw reply related

* Re: [RFC] net: bump the default number of RSS queues
From: Eric Dumazet @ 2018-04-04  0:20 UTC (permalink / raw)
  To: Jakub Kicinski, davem, netdev; +Cc: brouer, alexander.duyck, oss-drivers
In-Reply-To: <20180404001422.6080-1-jakub.kicinski@netronome.com>



On 04/03/2018 05:14 PM, Jakub Kicinski wrote:
> Some popular NIC vendors are not adhering to
> netif_get_num_default_rss_queues() which leads to users being
> surprised and filing bugs :)  Bump the number of default RX
> queues to something more reasonable for modern machines.
> 
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> ---
> I'm mostly wondering what's the policy on this default?  When
> should it be applied?  Why was 8 chosen as the default?  We
> can abandon using netif_get_num_default_rss_queues() for the
> nfp but I wonder what's the correct course of action here...
> Should new drivers use netif_get_num_default_rss_queues() for
> example?
> 
>  include/linux/netdevice.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 2a2d9cf50aa2..26fe145ada2a 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3260,7 +3260,7 @@ static inline unsigned int get_netdev_rx_queue_index(
>  }
>  #endif
>  
> -#define DEFAULT_MAX_NUM_RSS_QUEUES	(8)
> +#define DEFAULT_MAX_NUM_RSS_QUEUES	(64)
>  int netif_get_num_default_rss_queues(void);

There is no evidence having so many queues is beneficial.

Too many queues -> lots of overhead in many cases.

So I would rather not touch this, unless you can present good numbers ;)

^ permalink raw reply

* [RFC] net: bump the default number of RSS queues
From: Jakub Kicinski @ 2018-04-04  0:14 UTC (permalink / raw)
  To: davem, netdev; +Cc: brouer, alexander.duyck, oss-drivers, Jakub Kicinski

Some popular NIC vendors are not adhering to
netif_get_num_default_rss_queues() which leads to users being
surprised and filing bugs :)  Bump the number of default RX
queues to something more reasonable for modern machines.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
I'm mostly wondering what's the policy on this default?  When
should it be applied?  Why was 8 chosen as the default?  We
can abandon using netif_get_num_default_rss_queues() for the
nfp but I wonder what's the correct course of action here...
Should new drivers use netif_get_num_default_rss_queues() for
example?

 include/linux/netdevice.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2a2d9cf50aa2..26fe145ada2a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3260,7 +3260,7 @@ static inline unsigned int get_netdev_rx_queue_index(
 }
 #endif
 
-#define DEFAULT_MAX_NUM_RSS_QUEUES	(8)
+#define DEFAULT_MAX_NUM_RSS_QUEUES	(64)
 int netif_get_num_default_rss_queues(void);
 
 enum skb_free_reason {
-- 
2.16.2

^ permalink raw reply related

* Re: [PATCH v3 2/4] bus: fsl-mc: add restool userspace support
From: Stuart Yoder @ 2018-04-03 23:57 UTC (permalink / raw)
  To: Arnd Bergmann, Andrew Lunn
  Cc: Ioana Ciornei, gregkh, Laurentiu Tudor, Linux Kernel Mailing List,
	Ruxandra Ioana Ciocoi Radulescu, Razvan Stefanescu, Roy Pledge,
	Networking
In-Reply-To: <CAK8P3a1aQCm_r95Ka4x+j8SJLCLFvLuX3A4twK1WJcJs66YwKg@mail.gmail.com>

On Wed, Mar 28, 2018 at 10:43 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Wed, Mar 28, 2018 at 4:27 PM, Ioana Ciornei <ioana.ciornei@nxp.com> wrote:
>> Hi,
>>
>>>
>>> Hi Ioana,
>>>
>>> So this driver is a direct passthrough to your hardware for passing fixed-
>>> length command/response pairs. Have you considered using a higher-level
>>> interface instead?
>>>
>>> Can you list some of the commands that are passed here as clarification, and
>>> explain what the tradeoffs are that have led to adopting a low-level interface
>>> instead of a high-level interface?
>>>
>>> The main downside of the direct passthrough obviously is that you tie your
>>> user space to a particular hardware implementation, while a high-level
>>> abstraction could in principle work across a wider range of hardware revisions
>>> or even across multiple vendors implementing the same concept by different
>>> means.
>>
>> If by "higher-level" you mean an implementation where commands are created by the kernel at userspace's request, then I believe this approach is not really viable because of the sheer number of possible commands that would bloat the driver.
>>
>> For example, a DPNI (Data Path Network Interface) can be created using a command that has the following structure:
>>
>> struct dpni_cmd_create {
>>         uint32_t options;
>>         uint8_t num_queues;
>>         uint8_t num_tcs;
>>         uint8_t mac_filter_entries;
>>         uint8_t pad1;
>>         uint8_t vlan_filter_entries;
>>         uint8_t pad2;
>>         uint8_t qos_entries;
>>         uint8_t pad3;
>>         uint16_t fs_entries;
>> };
>>
>> In the above structure, each field has a meaning that the end-user might want to be able to change according to their particular use-case (not much is left at its default value).
>> The same level of complexity is encountered for all the commands that interact with Data Path objects such as DPBP(buffer pools), DPRC(Resource Container) etc.
>> You can find more examples of commands in restool's repo: https://github.com/qoriq-open-source/restool/tree/integration/mc_v10
>>
>> In my opinion, an in-kernel implementation that is equivalent in terms of flexibility will turn
>> into a giant ioctl parser, all while also exposing an userspace API that is not as simple/easy to use.
>
> (adding the netdev list)
>
> The command you list there seems to be networking related, so instead of
> an ioctl based interface, a high-lever interface would likely use netlink
> for consistency with other drivers. Are all commands for networking
> or are there some that are talking to the device to do something unrelated?
>
> Obviously creating a high-level interface would be a lot of work in the kernel,
> and it only pays off if there are multiple independent users, we wouldn't do
> that for just one driver.
>
> I'm still not convinced either way (high-level or low-level
> interface), but I think
> this needs to be discussed with the networking maintainers. Given the examples
> on the github page you linked to, the high-level user space commands
> based on these ioctls
>
>    ls-addni   # adds a network interface
>    ls-addmux  # adds a dpdmux
>    ls-addsw   # adds an l2switch
>    ls-listmac # lists MACs and their connections
>    ls-listni  # lists network interfaces and their connections
>
> and I see that you also support the switchdev interface in
> drivers/staging/fsl-dpaa2, which I think does some of the same
> things, presumably by implementing the switchdev API using
> fsl_mc_command low-level interfaces in the kernel.
>
> Is that a correct interpretation? If yes, could we extend switchdev
> or other networking interfaces to fill in whatever those don't handle
> yet?

The wrapper scripts you referenced are not sufficient to show the scope
of what the proposed user space interface is for.  The command list is
not just about networking related objects, as there are quite a few
other types of  objects as well:
    dprc - container object representing an fsl-mc bus instance...i.e. other
           objects are attached to this bus
    dpio - used for queuing operations towards any accelerator or network
           interface
    dpbp - buffer pool object
    dpmcp - command portal interface
    dpdmai - DMA engine
    dpseci - crypto accelerator
    dpdcei - compression/decompression accelerator
    dpni - network interface
    dprtc - real time counter
    dpaiop - heterogenous core complex for packet processing offload
    dpmac - represents an Ethernet MAC
    dpsw - network switch
    dpcon - network concentrator
    dpci - communication interface

The proposed ioctl interface is about:
   A)  creating and destroying all those object types
   B)  managing the dprc containers they live in, including moving
       objects between containers
   C)  where applicable establishing connections between different
       objects

No _operational_ aspects of these object/devices is being handled
through this interface.  The network interface should be managed
through ethtool, for example.  The proposed ioctl interface is about
bringing the devices into existence and getting them "wired" up.

Suppose you want to create and assign a network interface to a KVM
virtual machine, you would do something like the following using
a user space tool like restool:
   -create a new (empty) dprc object
   -create a new dpni and assign it to the dprc
   -create a new dpio and assign it to the dprc
   -create a new dpbp and assign it to the dprc
   -create a new dpmcp and assign it to the dprc
   -create a new dpmac and assign it to the dprc
   -connect the dpni to the dpmac

Now, at this point you have a functional set of objects that
can function as a network interface.

That dprc can now be assigned to a KVM VM using vfio and the
guest will see a dprc that it can probe and enumerate using the
fsl-mc bus infrastructure that is now upstream.

There is no existing kernel <--> user space mechanism that will
work to do all that, so something new was needed.

As far as low-level vs high-level...we did consider a higher level
interface that would expose operations on individual object such
as "create dpbp", but the user space API gets complex and
fragile for no obvious value.  Every object needs commands to
create/destroy and get attributes.  There is a sizeable dprc command
set.  Every time an object is enhanced (with a corresponding major
or minor version rev) you have to change the ioctl interface.

Having a simple command passthrough interface reduces complexity
in the kernel and provides an interface that should be very stable.

The fsl-mc bus and DPAA2 is very NXP-specific, so there doesn't
seem to be anything that can be made generic here to provide
more common benefit.

Thanks,
Stuart

^ permalink raw reply

* Re: [PATCH v2 bpf-next 0/3] bpf/verifier: subprog/func_call simplifications
From: Alexei Starovoitov @ 2018-04-03 23:37 UTC (permalink / raw)
  To: Edward Cree; +Cc: Daniel Borkmann, netdev
In-Reply-To: <554dc916-2eed-dabc-1776-eca6d8bf019b@solarflare.com>

On Tue, Apr 03, 2018 at 02:39:11PM +0100, Edward Cree wrote:
> On 03/04/18 02:08, Alexei Starovoitov wrote:
> > I like patch 3 and going to play with it.
> > How did you test it ?
> Just test_verifier and test_progs (the latter has a failure
>  "test_bpf_obj_id:FAIL:get-prog-info(fd) err 0 errno 2 i 0 type 1(1) info_len 80(40) jit_enabled 0 jited_prog_len 0 xlated_prog_len 72"
>  but that was already there before my patch).

hmm. that doesn't fail for me and any other bots didn't complain.
Are you sure you're running the latest kernel and tests?

You may see the issue with buildid test:
test_stacktrace_build_id:FAIL:get build_id with readelf err -1 errno 2
that's due to actually missing buildid in the binary.

> > Do you have processed_insn numbers for
> > cilium or selftests programs before and after?
> > There should be no difference, right?
> That's a good check, I'll do that.
> 
> > As far as patch 1 it was very difficult to review, since several logically
> > different things clamped together. So breaking it apart:
> > - converting two arrays of subprog_starts and subprog_stack_depth into
> >   single array of struct bpf_subprog_info is a good thing
> > - tsort is interesting, but not sure it's correct. more below
> > - but main change of combining subprog logic with do_check is no good.
> <snip>
> > There will be no 'do_check() across whole program' walk.
> > Combining subprog pass with do_check is going into opposite direction
> > of this long term work. Divide and conquer. Combining more things into
> > do_check is the opposite of this programming principle.
> The main object of my change here was to change the data structures, to
>  store a subprogno in each insn aux rather than using bsearch() on the
>  subprog_starts.  I have now figured out the algorithm to do this in its
>  own pass (previously I thought this needed a recursive walk which is why
>  I wanted to roll it into do_check() - doing more than one whole-program
>  recursive walk seems like a bad idea.)

hmm. what's wrong with bsearch? It's trivial and fast.

> > My short term plan is to split basic instruction correctness checks
> > out of do_check() loop into separate pass and run it early on.
> I agree with that short term plan, sounds like a good idea.
> I'm still not sure I understand the long-term plan, though; since most
>  insns' input registers will still need to be checked (I'm assuming
>  majority of most real ebpf programs consists of computing and
>  dereferencing pointers), the data flow analysis will have to end up
>  doing all the same register updates current do_check() does (though
>  potentially in a different order), e.g. if a function is called three
>  times it will have to analyse it with three sets of input registers.
> Unless you have some way of specifying function preconditions, I don't
>  see how it works.  In particular something like
>     char *f(char *p)
>     {
>         *p++ = 0;
>         return p;
>     }
>     int main(void)
>     {
>         char *p = "abc"; /* represents getting a ptr from ctx or map */
>         p = f(p);
>         p = f(p);
>         p = f(p);
>         return 0;
>     }
>  seems as though it would be difficult to analyse in any way more
>  scalable than the current full recursive walk.  Unless you somehow
>  propagate register state _backwards_ as constraints when analysing a
>  function...?  In any case it seems like there are likely to be things
>  which current verifier accepts which require 'whole-program' analysis
>  to determine the dataflow (e.g. what if there were some conditional
>  branches in f(), and the preconditions on p depended on the value of
>  some other arg in r2?)

The main point that we have to make verifier scale. That's my #1 priority.
Even if we don't see the solution today we have to work towards it.
I believe adopting compiler style analysis is the right way forward.
The independent passes to determine and _keep_ info about subprograms,
basic blocks, cfg, dom graph, liveness, dataflow are necessary steps
not only for the main purpose of the verifier (which is safety check),
but for additional analysis that JITs, like NFP, have to do.
"walk all instruction and apply single transformation" is a _good thing_.
llvm has tens, if not hundreds, loops like:
for_each_bb_in_func(bb, func)
  for_each_insn_in_bb(insn, bb)
    do stuff with insn
Compiler designers could have combined multiple of such passes into
fewer ones, but it's not done, because it increases complexity and
causes tough bugs where pass is partially complete. cfg and/or dataflow
sometimes is recomputed independently from transformation
instead of doing during the pass for the same reasons.

As far as scaling the verifier the number to shot for is 1M bpf instructions.
For certain program types, like XDP, we probably will always keep 4k insn limit,
but in some cases, where program runs in the user contex, we can run it longer.
There 64k insns would be fine. Verifier already capable to inject code
into the program. It could easily inject cond_resched afer every 4k or so.
We'd need to make progs preemtable and be smarter with rcu-based map lookups.
It's all solvable and easy to do as long as core of the verifier scales.
The prime example where more than 4k instructions and loops are mandatory
is user space stack analysis inside the program. Like walking python stack
requires non-trival pointer chasing. With 'pragma unroll' the stack depth
limit today is ~18. That's not really usable. Looping through 100 python
frames would require about 16k bpf assembler instructions. That's 16k JITed
cpu instructions. It's plenty fast and safe in user contex, but if we increase
bpf max_insn limit to 16k, I'm afraid, the verifier will start falling apart
due to insn_processed counter.
Hence do_check approach must go. The rough idea is to compute per basic
block a set of INs (registers and stack) that basic block needs
to see to be safe and corresponding set of OUTs.
Then propagate this knowledge across cfg edges.
Once we have such set per bpf function, it will essentially answer the question
'what arguments this function needs to see to be safe and what it returns'
To make bpf libraries scale we'd need to keep such information
around after the verification, so dynamic linking and indirect calls
are fast to verify.
It's very high level obviously. There are many gotchas to resolve.

> > As far as tsort approach for determining max stack depth...
> > It's an interesting idea, but implementation is suffering from the same
> > 'combine everything into one loop' coding issue.
> > I think updating total_stack_depth math should be separate from sorting,
> > since the same function can be part of different stacks with different max.
> > I don't see how updating global subprog_info[i].total_stack_depth can
> > be correct. It has to be different for every stack and the longest
> > stack is not necessarily the deepest. May be I'm missing something,
> > but combined sort and stack_depth math didn't make it easy to review.
> > I find existing check_max_stack_depth() algo much easier to understand.
> The sort _is_ the way to compute total stack depth.  The sort order isn't
>  being stored anywhere; it's being done just so that each subprog gets
>  looked at after all its callers have been considered.  So when it gets
>  selected as a 'frontier node', its maximum stack depth is known, and can
>  thus be used to update its callees (note that we do a max_t() with each
>  callee's existing total_stack_depth, thus getting the deepest stack of
>  all call chains to the function).
> It may help to imagine drawing the call graph and labelling each node with
>  a stack depth as it is visited; sadly that's difficult to show in an email
>  (or a code comment).  But I can try to explain it a bit better than
>  "/* Update callee total stack depth */".

Please do, since that's my concern with tsort.
The verifier is the key piece of bpf infra and to be effective maintainers
we need to thoroughly understand the verifier code.
We cannot just take the patch based on the cover letter. The author may
disappear tomorrow and what we're going to do with the code?
Hence today, with information I have, I prefer to keep the current
check_max_stack_depth() as-is.
In other words it looks to me that tsort complexity is not justified.

^ permalink raw reply

* Re: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun
From: Stephen Hemminger @ 2018-04-03 23:16 UTC (permalink / raw)
  To: Jon Rosen
  Cc: David S. Miller, Willem de Bruijn, Eric Dumazet, Kees Cook,
	David Windsor, Rosen, Rami, Reshetova, Elena, Mike Maloney,
	Benjamin Poirier, open list:NETWORKING [GENERAL], open list
In-Reply-To: <20180403215526.15934-1-jrosen@cisco.com>

On Tue,  3 Apr 2018 17:55:25 -0400
Jon Rosen <jrosen@cisco.com> wrote:

> Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
> casues the ring to get corrupted by allowing multiple kernel threads
> to claim ownership of the same ring entry, Mark the ring entry as
> already being used within the spin_lock to prevent other kernel
> threads from reusing the same entry before it's fully filled in,
> passed to user space, and then eventually passed back to the kernel
> for use with a new packet.
> 
> Note that the proposed change may modify the semantics of the
> interface between kernel space and user space in a way which may cause
> some applications to no longer work properly. More discussion on this
> change can be found in the additional comments section titled
> "3. Discussion on packet_mmap ownership semantics:".
> 
> Signed-off-by: Jon Rosen <jrosen@cisco.com>
> ---
> 
> Additional Comments Section
> ---------------------------
> 
> 1. Description of the diffs:
> ----------------------------
> 
>    TPACKET_V1 and TPACKET_V2 format rings:
>    -------------------------------------------
>    Mark each entry as TP_STATUS_IN_PROGRESS after allocating to
>    prevent other kernel threads from re-using the same entry.
> 
>    This is necessary because there may be a delay from the time the
>    spin_lock is released to the time that the packet is completed and
>    the corresponding ring entry is marked as owned by user space.  If
>    during this time other kernel threads enqueue more packets to the
>    ring than the size of the ring then it will cause mutliple kernel
>    threads to operate on the same entry at the same time, corrupting
>    packets and the ring state.
> 
>    By marking the entry as allocated (IN_PROGRESS) we prevent other
>    kernel threads from incorrectly re-using an entry that is still in
>    the progress of being filled in before it is passed to user space.
> 
>    This forces each entry through the following states:
> 
>    +-> 1. (tp_status == TP_STATUS_KERNEL)
>    |      Free: For use by any kernel thread to store a new packet
>    |
>    |   2. !(tp_status == TP_STATUS_KERNEL) && !(tp_status & TP_STATUS_USER)
>    |      Allocated: In use by a *specific* kernel thread
>    |
>    |   3. (tp_status & TP_STATUS_USER)
>    |      Available: Packet available for user space to process
>    |
>    +-- Loop back to #1 when user space writes entry as TP_STATUS_KERNEL
> 
> 
>    No impact on TPACKET_V3 format rings:
>    -------------------------------------
>    Packet entry ownership is already protected from other kernel
>    threads potentially re-using the same entry. This is done inside
>    packet_current_rx_frame() where storage is allocated for the
>    current packet. Since this is done within the spin_lock no
>    additional status updates for this entry are required.
> 
> 
>    Defining TP_STATUS_IN_PROGRESS:
>    -------------------------------
>    Rather than defining a new-bit we re-use an existing bit for this
>    intermediate state.  Status will eventually be overwritten with the
>    actual true status when passed to user space.  Any bit used to pass
>    information to user space other than the one that passes ownership
>    is suitable (can't use TP_STATUS_USER).  Alternatively a new bit
>    could be defined.
> 
> 
> 2. More detailed discussion:
> ----------------------------
>    Ring entries basically have 2 states, owned by the kernel or owned by
>    user space. For single producer/single consumer this works fine. For
>    multiple producers there is a window between the call to spin_unlock
>    [F] and the call to __packet_set_status [J] where if there are enough
>    packets added to the ring by other kernel threads then the ring can
>    wrap and multiple threads will end up using the same ring entries.
> 
>    This occurs because the ring entry alocated at [C] did not modify the
>    state of the entry so it continues to appear as owned by the kernel
>    and available for use for new packets even though it has already been
>    allocated.
> 
>    A simple fix is to temporarily mark the ring entries within the spin
>    lock such that user space will still think it?s owned by the kernel
>    and other kernel threads will not see it as available to be used for
>    new packets. If a kernel thread gets delayed between [F] and [J] for
>    an extended period of time and the ring wraps back to the same point
>    then subsiquent kernel threads attempts to allocate will fail and be
>    treated as the ring being full.
> 
>    The change below at [D] uses a newly defined TP_STATUS_IN_PROGRESS bit
>    to prevent other kernel threads from re-using the same entry. Note that
>    any existing bit other than TP_STATUS_USER could have been used.
> 
>    af_packet.c:tpacket_rcv()
>       ... code removed for brevity ...
> 
>       // Acquire spin lock
> A:    spin_lock(&sk->sk_receive_queue.lock);
> 
>             // Preemption is disabled
> 
>             // Get current ring entry
> B:          h.raw = packet_current_rx_frame(
>                         po, skb, TP_STATUS_KERNEL, (macoff+snaplen));
> 
>             // Get out if ring is full
>             // Code not show but it will also release lock
>             if (!h.raw) goto ring_is_full;
> 
>             // Allocate ring entry
> C:          packet_increment_rx_head(po, &po->rx_ring);
> 
>             // Protect against allocation overrun (the simple fix)
>             // by changing TP_STATUS_KERNEL to TP_STATUS_IN_PROGRESS
> D:          if (po->tp_version <= TPACKET_V2) 
>                __packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS);
> 
>             // Update counter
> E:          po->stats.stats1.tp_packets++;
> 
>       // Release spin lock
> F:    spin_unlock(&sk->sk_receive_queue.lock);
> 
>       // Copy packet payload
> G:    skb_copy_bits(skb, 0, h.raw + macoff, snaplen);
> 
>       // Get current timestamp
> H:    if (!(ts_status = tpacket_get_timestamp(skb, &ts, po->tp_tstamp))) 
>          getnstimeofday(&ts);
>       status |= ts_status;
> 
>       // Fill in header portions of ring entry (removed a bunch of
>       // writes for brevity)
>       ...
>       h.h1->tp_sec = ts.tv_sec;
>       h.h1->tp_usec = ts.tv_nsec / NSEC_PER_USEC;
>       sll->sll_halen = dev_parse_header(skb, sll->sll_addr);
>       ...
> 
>       // Memory Barrier to make sure all data is written before
>       // passing ownership to user space
> I:    smp_mb();
> 
>       // Update Status, passing ownership to user space
> J:    __packet_set_status(po, h.raw, status | TP_STATUS_USER);
> 
> 
> 3. Discussion on packet_mmap ownership semantics:
> -------------------------------------------------
>    One issue with the above proposed change to use TP_STATUS_IN_PROGRESS
>    is that the documentation of the tp_status field is somewhat
>    inconsistent.  In some places it's described as TP_STATUS_KERNEL(0)
>    meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0)
>    meaning the entry is owned by user space.  In other places ownership
>    by user space is defined by the TP_STATUS_USER(1) bit being set.
> 
>    Relevant section of packet_mmap documentation from
>    https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt
>    follow but a summary of the semantics in question are described as:
> 
>    - At the beginning of each frame there is an status field
>    - If this field is 0 means that the frame is ready to be
>      used for the kernel
>    - If not, there is a frame the user can read
> 
>    This would indicate that 0 means owned by the kernel and non-0 is
>    owned by the user.  The simple fix above is to set the status to
>    something that neither the kernel nor the user thinks they own.  That
>    means that 0 vs. non-0 would be insufficient.
> 
>    The description and examples in packet_mmap.txt actually talk about
>    using a specific bit, the TP_STATUS_USER bit, to indicate something is
>    owned by the kernel vs something is owned by the user.  The relevant
>    references from packet_mmap.txt to this bit are:
> 
>    - The kernel initializes all frames to TP_STATUS_KERNEL(0)
>    - when the kernel receives a packet it puts in the buffer and
>      updates the status with at least the TP_STATUS_USER(1) flag
> 
>    This description contradicts the previous description which implied
>    that non-0 means owned by the user. In the above statement it implies
>    that there needs to be more than just non-0 and that the
>    TP_STATUS_USER bit must be set for it to be owned by user space.
> 
>    If the above proposed fix of using a new TP_STATUS_IN_PROGRESS bit (or
>    any existing bit other than TP_STATUS_USER) were to be used and an
>    application were written assuming !TP_STATUS_KERNEL implies owned by
>    user space then the application would incorrectly assume there was a
>    valid packet entry to be processed when the entry is not ready.  If an
>    application were instead written to look specificaly for
>    TP_STATUS_USER then the above proposed fix would work correctly.
> 
>    A more complex solution might be to create a shadow data structure
>    which is private to the kernel and keeps status bits for each ring
>    entry to indicate the intermediate state between owned by kernel and
>    owned by user space.
> 
> 
> 4. Snipet from packet_mmap.txt
> ------------------------------
>    At the beginning of each frame there is an status field (see 
>    struct tpacket_hdr). If this field is 0 means that the frame is ready
>    to be used for the kernel, If not, there is a frame the user can read 
>    and the following flags apply:
> 
>    +++ Capture process:
>    from include/linux/if_packet.h
> 
>    #define TP_STATUS_COPY          (1 << 1)
>    #define TP_STATUS_LOSING        (1 << 2)
>    #define TP_STATUS_CSUMNOTREADY  (1 << 3)
>    #define TP_STATUS_CSUM_VALID    (1 << 7)
> 
>    TP_STATUS_COPY        : This flag indicates that the frame (and associated
>    meta information) has been truncated because it's 
>    larger than tp_frame_size. This packet can be 
>    read entirely with recvfrom().
> 
>    In order to make this work it must to be
>    enabled previously with setsockopt() and 
>    the PACKET_COPY_THRESH option. 
> 
>    The number of frames that can be buffered to
>    be read with recvfrom is limited like a normal socket.
>    See the SO_RCVBUF option in the socket (7) man page.
> 
>    TP_STATUS_LOSING      : indicates there were packet drops from last time 
>    statistics where checked with getsockopt() and
>    the PACKET_STATISTICS option.
> 
>    TP_STATUS_CSUMNOTREADY: currently it's used for outgoing IP packets which 
>    its checksum will be done in hardware. So while
>    reading the packet we should not try to check the 
>    checksum. 
> 
>    TP_STATUS_CSUM_VALID  : This flag indicates that at least the transport
>    header checksum of the packet has been already
>    validated on the kernel side. If the flag is not set
>    then we are free to check the checksum by ourselves
>    provided that TP_STATUS_CSUMNOTREADY is also not set.
> 
>    for convenience there are also the following defines:
> 
>    #define TP_STATUS_KERNEL        0
>    #define TP_STATUS_USER          1
> 
>    The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel
>    receives a packet it puts in the buffer and updates the status with
>    at least the TP_STATUS_USER flag. Then the user can read the packet,
>    once the packet is read the user must zero the status field, so the kernel 
>    can use again that frame buffer.
> 
>    The user can use poll (any other variant should apply too) to check if new
>    packets are in the ring:
> 
>    struct pollfd pfd;
> 
>    pfd.fd = fd;
>    pfd.revents = 0;
>    pfd.events = POLLIN|POLLRDNORM|POLLERR;
> 
>    if (status == TP_STATUS_KERNEL)
>    retval = poll(&pfd, 1, timeout);
> 
>    It doesn't incur in a race condition to first check the status value and 
>    then poll for frames.
> 
> 
> 5. Snipet from man pages for packet(7):
> ---------------------------------------
> 
>    See portion marked with "***" for reference to use of TP_STATUS_USER bit.
> 
>        PACKET_RX_RING
> 
>               Create a memory-mapped ring buffer for asynchronous
>               packet reception.  The packet socket reserves a
>               contiguous region of application address space, lays it
>               out into an array of packet slots and copies packets (up
>               to tp_snaplen) into subsequent slots.  Each packet is
>               preceded by a metadata structure similar to
>               tpacket_auxdata.  The protocol fields encode the offset
>               to the data from the start of the metadata header.
>               tp_net stores the offset to the network layer.  If the
>               packet socket is of type SOCK_DGRAM, then tp_mac is the
>               same.  If it is of type SOCK_RAW, then that field stores
>               the offset to the link-layer frame.  Packet socket and
>               application communi? cate the head and tail of the ring
>               through the tp_status field.  The packet socket owns all
>               slots with tp_status equal to TP_STATUS_KERNEL.  After
>               filling a slot, it changes the status of the slot to
>        ***    transfer ownership to the application.  During normal
>        ***    operation, the new tp_status value has at least the
>        ***    TP_STATUS_USER bit set to signal that a received packet
>        ***    has been stored.  When the application has finished
>               processing a packet, it transfers ownership of the slot
>               back to the socket by setting tp_status equal to
>               TP_STATUS_KERNEL.
> 
>               Packet sockets implement multiple variants of the packet
>               ring.  The implementation details are described in
>               Documentation/networking/packet_mmap.txt in the Linux
>               kernel source tree.
> 
> 
> 6. Relevant files:
> ------------------
>   net/packet/af_packet.c
>   include/uapi/linux/if_packet.h
>   Documentation/networking/packet_mmap.txt
> 
> Jon Rosen (1):
>   packet: mark ring entry as in-use inside spin_lock to prevent RX ring
>     overrun
> 
>  net/packet/af_packet.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index e0f3f4a..264d7b2 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -2287,6 +2287,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
>  		if (po->stats.stats1.tp_drops)
>  			status |= TP_STATUS_LOSING;
>  	}
> +
> +        /*
> +         * Mark this entry as TP_STATUS_IN_PROGRESS to prevent other
> +         * kernel threads from re-using this same entry.
> +         */
> +#define TP_STATUS_IN_PROGRESS TP_STATUS_LOSING
> +	if (po->tp_version <= TPACKET_V2)
> +            __packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS);
> +
>  	po->stats.stats1.tp_packets++;
>  	if (copy_skb) {
>  		status |= TP_STATUS_COPY;

This patch looks correct. Please resend it with proper signed-off-by
and with a kernel code indenting style (tabs).  Is this bug present
since the beginning of af_packet and multiqueue devices or did it get
introduced in some previous kernel?

^ permalink raw reply

* [PATCH v2] lan78xx: Connect phy early
From: Alexander Graf @ 2018-04-03 22:19 UTC (permalink / raw)
  To: netdev
  Cc: linux-usb, linux-kernel, Nisar.Sayed, Woojung.Huh, UNGLinuxDriver,
	tbogendoerfer, phil, Andrew Lunn, RaghuramChary.Jallipalli

When using wicked with a lan78xx device attached to the system, we
end up with ethtool commands issued on the device before an ifup
got issued. That lead to the following crash:

    Unable to handle kernel NULL pointer dereference at virtual address 0000039c
    pgd = ffff800035b30000
    [0000039c] *pgd=0000000000000000
    Internal error: Oops: 96000004 [#1] SMP
    Modules linked in: [...]
    Supported: Yes
    CPU: 3 PID: 638 Comm: wickedd Tainted: G            E      4.12.14-0-default #1
    Hardware name: raspberrypi rpi/rpi, BIOS 2018.03-rc2 02/21/2018
    task: ffff800035e74180 task.stack: ffff800036718000
    PC is at phy_ethtool_ksettings_get+0x20/0x98
    LR is at lan78xx_get_link_ksettings+0x44/0x60 [lan78xx]
    pc : [<ffff0000086f7f30>] lr : [<ffff000000dcca84>] pstate: 20000005
    sp : ffff80003671bb20
    x29: ffff80003671bb20 x28: ffff800035e74180
    x27: ffff000008912000 x26: 000000000000001d
    x25: 0000000000000124 x24: ffff000008f74d00
    x23: 0000004000114809 x22: 0000000000000000
    x21: ffff80003671bbd0 x20: 0000000000000000
    x19: ffff80003671bbd0 x18: 000000000000040d
    x17: 0000000000000001 x16: 0000000000000000
    x15: 0000000000000000 x14: ffffffffffffffff
    x13: 0000000000000000 x12: 0000000000000020
    x11: 0101010101010101 x10: fefefefefefefeff
    x9 : 7f7f7f7f7f7f7f7f x8 : fefefeff31677364
    x7 : 0000000080808080 x6 : ffff80003671bc9c
    x5 : ffff80003671b9f8 x4 : ffff80002c296190
    x3 : 0000000000000000 x2 : 0000000000000000
    x1 : ffff80003671bbd0 x0 : ffff80003671bc00
    Process wickedd (pid: 638, stack limit = 0xffff800036718000)
    Call trace:
    Exception stack(0xffff80003671b9e0 to 0xffff80003671bb20)
    b9e0: ffff80003671bc00 ffff80003671bbd0 0000000000000000 0000000000000000
    ba00: ffff80002c296190 ffff80003671b9f8 ffff80003671bc9c 0000000080808080
    ba20: fefefeff31677364 7f7f7f7f7f7f7f7f fefefefefefefeff 0101010101010101
    ba40: 0000000000000020 0000000000000000 ffffffffffffffff 0000000000000000
    ba60: 0000000000000000 0000000000000001 000000000000040d ffff80003671bbd0
    ba80: 0000000000000000 ffff80003671bbd0 0000000000000000 0000004000114809
    baa0: ffff000008f74d00 0000000000000124 000000000000001d ffff000008912000
    bac0: ffff800035e74180 ffff80003671bb20 ffff000000dcca84 ffff80003671bb20
    bae0: ffff0000086f7f30 0000000020000005 ffff80002c296000 ffff800035223900
    bb00: 0000ffffffffffff 0000000000000000 ffff80003671bb20 ffff0000086f7f30
    [<ffff0000086f7f30>] phy_ethtool_ksettings_get+0x20/0x98
    [<ffff000000dcca84>] lan78xx_get_link_ksettings+0x44/0x60 [lan78xx]
    [<ffff0000087cbc40>] ethtool_get_settings+0x68/0x210
    [<ffff0000087cc0d4>] dev_ethtool+0x214/0x2180
    [<ffff0000087e5008>] dev_ioctl+0x400/0x630
    [<ffff00000879dd00>] sock_do_ioctl+0x70/0x88
    [<ffff00000879f5f8>] sock_ioctl+0x208/0x368
    [<ffff0000082cde10>] do_vfs_ioctl+0xb0/0x848
    [<ffff0000082ce634>] SyS_ioctl+0x8c/0xa8
    Exception stack(0xffff80003671bec0 to 0xffff80003671c000)
    bec0: 0000000000000009 0000000000008946 0000fffff4e841d0 0000aa0032687465
    bee0: 0000aaaafa2319d4 0000fffff4e841d4 0000000032687465 0000000032687465
    bf00: 000000000000001d 7f7fff7f7f7f7f7f 72606b622e71ff4c 7f7f7f7f7f7f7f7f
    bf20: 0101010101010101 0000000000000020 ffffffffffffffff 0000ffff7f510c68
    bf40: 0000ffff7f6a9d18 0000ffff7f44ce30 000000000000040d 0000ffff7f6f98f0
    bf60: 0000fffff4e842c0 0000000000000001 0000aaaafa2c2e00 0000ffff7f6ab000
    bf80: 0000fffff4e842c0 0000ffff7f62a000 0000aaaafa2b9f20 0000aaaafa2c2e00
    bfa0: 0000fffff4e84818 0000fffff4e841a0 0000ffff7f5ad0cc 0000fffff4e841a0
    bfc0: 0000ffff7f44ce3c 0000000080000000 0000000000000009 000000000000001d
    bfe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000

The culprit is quite simple: The driver tries to access the phy left and right,
but only actually has a working reference to it when the device is up.

The fix thus is quite simple too: Get a reference to the phy on probe already
and keep it even when the device is going down.

With this patch applied, I can successfully run wicked on my system and bring
the interface up and down as many times as I want, without getting NULL pointer
dereferences in between.

Signed-off-by: Alexander Graf <agraf@suse.de>

---

v1 -> v2:

  - move debug message (Andrew Lunn)
  - check for phydev != NULL before phy_stop (Sayed Nisar)
  - move phy_disconnect before unregister_netdev (Sayed Nisar)
  - properly unroll failing probe phy_init (Sayed Nisar)
  - change phy_init to phy_start in reset_resume (Sayed Nisar)
---
 drivers/net/usb/lan78xx.c | 34 ++++++++++++++++++----------------
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index 60a604cc7647..6986f53fe967 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -2082,10 +2082,6 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
 
 	dev->fc_autoneg = phydev->autoneg;
 
-	phy_start(phydev);
-
-	netif_dbg(dev, ifup, dev->net, "phy initialised successfully");
-
 	return 0;
 
 error:
@@ -2512,9 +2508,9 @@ static int lan78xx_open(struct net_device *net)
 	if (ret < 0)
 		goto done;
 
-	ret = lan78xx_phy_init(dev);
-	if (ret < 0)
-		goto done;
+	phy_start(net->phydev);
+
+	netif_dbg(dev, ifup, dev->net, "phy initialised successfully");
 
 	/* for Link Check */
 	if (dev->urb_intr) {
@@ -2575,13 +2571,8 @@ static int lan78xx_stop(struct net_device *net)
 	if (timer_pending(&dev->stat_monitor))
 		del_timer_sync(&dev->stat_monitor);
 
-	phy_unregister_fixup_for_uid(PHY_KSZ9031RNX, 0xfffffff0);
-	phy_unregister_fixup_for_uid(PHY_LAN8835, 0xfffffff0);
-
-	phy_stop(net->phydev);
-	phy_disconnect(net->phydev);
-
-	net->phydev = NULL;
+	if (net->phydev)
+		phy_stop(net->phydev);
 
 	clear_bit(EVENT_DEV_OPEN, &dev->flags);
 	netif_stop_queue(net);
@@ -3477,8 +3468,13 @@ static void lan78xx_disconnect(struct usb_interface *intf)
 		return;
 
 	udev = interface_to_usbdev(intf);
-
 	net = dev->net;
+
+	phy_unregister_fixup_for_uid(PHY_KSZ9031RNX, 0xfffffff0);
+	phy_unregister_fixup_for_uid(PHY_LAN8835, 0xfffffff0);
+
+	phy_disconnect(net->phydev);
+
 	unregister_netdev(net);
 
 	cancel_delayed_work_sync(&dev->wq);
@@ -3634,8 +3630,14 @@ static int lan78xx_probe(struct usb_interface *intf,
 	pm_runtime_set_autosuspend_delay(&udev->dev,
 					 DEFAULT_AUTOSUSPEND_DELAY);
 
+	ret = lan78xx_phy_init(dev);
+	if (ret < 0)
+		goto out4;
+
 	return 0;
 
+out4:
+	unregister_netdev(netdev);
 out3:
 	lan78xx_unbind(dev, intf);
 out2:
@@ -3983,7 +3985,7 @@ static int lan78xx_reset_resume(struct usb_interface *intf)
 
 	lan78xx_reset(dev);
 
-	lan78xx_phy_init(dev);
+	phy_start(dev->net->phydev);
 
 	return lan78xx_resume(intf);
 }
-- 
2.12.3

^ permalink raw reply related

* [PATCH v2] net: thunderx: nicvf_main: Fix potential NULL pointer dereferences
From: Gustavo A. R. Silva @ 2018-04-03 22:04 UTC (permalink / raw)
  To: Eric Dumazet, Sunil Goutham, Robert Richter
  Cc: linux-arm-kernel, netdev, linux-kernel, Gustavo A. R. Silva

Add null check on kmalloc() return value in order to prevent
a null pointer dereference.

Addresses-Coverity-ID: 1467429 ("Dereference null return value")
Fixes: 37c3347eb247 ("net: thunderx: add ndo_set_rx_mode callback implementation for VF")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
---
Changes in v2:
 - Add a null check on a second kmalloc a few lines below. Thanks to
   Eric Dumazet for pointing this out.

 drivers/net/ethernet/cavium/thunder/nicvf_main.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 1e9a31f..f7b5ca5 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -1999,10 +1999,14 @@ static void nicvf_set_rx_mode(struct net_device *netdev)
 				struct xcast_addr *xaddr;
 
 				mc_list = kmalloc(sizeof(*mc_list), GFP_ATOMIC);
+				if (unlikely(!mc_list))
+					return;
 				INIT_LIST_HEAD(&mc_list->list);
 				netdev_hw_addr_list_for_each(ha, &netdev->mc) {
 					xaddr = kmalloc(sizeof(*xaddr),
 							GFP_ATOMIC);
+					if (unlikely(!xaddr))
+						return;
 					xaddr->addr =
 						ether_addr_to_u64(ha->addr);
 					list_add_tail(&xaddr->list,
-- 
2.7.4

^ permalink raw reply related

* [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun
From: Jon Rosen @ 2018-04-03 21:55 UTC (permalink / raw)
  To: David S. Miller
  Cc: Jon Rosen, Willem de Bruijn, Eric Dumazet, Kees Cook,
	David Windsor, Rosen, Rami, Reshetova, Elena, Mike Maloney,
	Benjamin Poirier, open list:NETWORKING [GENERAL], open list

Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
casues the ring to get corrupted by allowing multiple kernel threads
to claim ownership of the same ring entry, Mark the ring entry as
already being used within the spin_lock to prevent other kernel
threads from reusing the same entry before it's fully filled in,
passed to user space, and then eventually passed back to the kernel
for use with a new packet.

Note that the proposed change may modify the semantics of the
interface between kernel space and user space in a way which may cause
some applications to no longer work properly. More discussion on this
change can be found in the additional comments section titled
"3. Discussion on packet_mmap ownership semantics:".

Signed-off-by: Jon Rosen <jrosen@cisco.com>
---

Additional Comments Section
---------------------------

1. Description of the diffs:
----------------------------

   TPACKET_V1 and TPACKET_V2 format rings:
   -------------------------------------------
   Mark each entry as TP_STATUS_IN_PROGRESS after allocating to
   prevent other kernel threads from re-using the same entry.

   This is necessary because there may be a delay from the time the
   spin_lock is released to the time that the packet is completed and
   the corresponding ring entry is marked as owned by user space.  If
   during this time other kernel threads enqueue more packets to the
   ring than the size of the ring then it will cause mutliple kernel
   threads to operate on the same entry at the same time, corrupting
   packets and the ring state.

   By marking the entry as allocated (IN_PROGRESS) we prevent other
   kernel threads from incorrectly re-using an entry that is still in
   the progress of being filled in before it is passed to user space.

   This forces each entry through the following states:

   +-> 1. (tp_status == TP_STATUS_KERNEL)
   |      Free: For use by any kernel thread to store a new packet
   |
   |   2. !(tp_status == TP_STATUS_KERNEL) && !(tp_status & TP_STATUS_USER)
   |      Allocated: In use by a *specific* kernel thread
   |
   |   3. (tp_status & TP_STATUS_USER)
   |      Available: Packet available for user space to process
   |
   +-- Loop back to #1 when user space writes entry as TP_STATUS_KERNEL


   No impact on TPACKET_V3 format rings:
   -------------------------------------
   Packet entry ownership is already protected from other kernel
   threads potentially re-using the same entry. This is done inside
   packet_current_rx_frame() where storage is allocated for the
   current packet. Since this is done within the spin_lock no
   additional status updates for this entry are required.


   Defining TP_STATUS_IN_PROGRESS:
   -------------------------------
   Rather than defining a new-bit we re-use an existing bit for this
   intermediate state.  Status will eventually be overwritten with the
   actual true status when passed to user space.  Any bit used to pass
   information to user space other than the one that passes ownership
   is suitable (can't use TP_STATUS_USER).  Alternatively a new bit
   could be defined.


2. More detailed discussion:
----------------------------
   Ring entries basically have 2 states, owned by the kernel or owned by
   user space. For single producer/single consumer this works fine. For
   multiple producers there is a window between the call to spin_unlock
   [F] and the call to __packet_set_status [J] where if there are enough
   packets added to the ring by other kernel threads then the ring can
   wrap and multiple threads will end up using the same ring entries.

   This occurs because the ring entry alocated at [C] did not modify the
   state of the entry so it continues to appear as owned by the kernel
   and available for use for new packets even though it has already been
   allocated.

   A simple fix is to temporarily mark the ring entries within the spin
   lock such that user space will still think it?s owned by the kernel
   and other kernel threads will not see it as available to be used for
   new packets. If a kernel thread gets delayed between [F] and [J] for
   an extended period of time and the ring wraps back to the same point
   then subsiquent kernel threads attempts to allocate will fail and be
   treated as the ring being full.

   The change below at [D] uses a newly defined TP_STATUS_IN_PROGRESS bit
   to prevent other kernel threads from re-using the same entry. Note that
   any existing bit other than TP_STATUS_USER could have been used.

   af_packet.c:tpacket_rcv()
      ... code removed for brevity ...

      // Acquire spin lock
A:    spin_lock(&sk->sk_receive_queue.lock);

            // Preemption is disabled

            // Get current ring entry
B:          h.raw = packet_current_rx_frame(
                        po, skb, TP_STATUS_KERNEL, (macoff+snaplen));

            // Get out if ring is full
            // Code not show but it will also release lock
            if (!h.raw) goto ring_is_full;

            // Allocate ring entry
C:          packet_increment_rx_head(po, &po->rx_ring);

            // Protect against allocation overrun (the simple fix)
            // by changing TP_STATUS_KERNEL to TP_STATUS_IN_PROGRESS
D:          if (po->tp_version <= TPACKET_V2) 
               __packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS);

            // Update counter
E:          po->stats.stats1.tp_packets++;

      // Release spin lock
F:    spin_unlock(&sk->sk_receive_queue.lock);

      // Copy packet payload
G:    skb_copy_bits(skb, 0, h.raw + macoff, snaplen);

      // Get current timestamp
H:    if (!(ts_status = tpacket_get_timestamp(skb, &ts, po->tp_tstamp))) 
         getnstimeofday(&ts);
      status |= ts_status;

      // Fill in header portions of ring entry (removed a bunch of
      // writes for brevity)
      ...
      h.h1->tp_sec = ts.tv_sec;
      h.h1->tp_usec = ts.tv_nsec / NSEC_PER_USEC;
      sll->sll_halen = dev_parse_header(skb, sll->sll_addr);
      ...

      // Memory Barrier to make sure all data is written before
      // passing ownership to user space
I:    smp_mb();

      // Update Status, passing ownership to user space
J:    __packet_set_status(po, h.raw, status | TP_STATUS_USER);


3. Discussion on packet_mmap ownership semantics:
-------------------------------------------------
   One issue with the above proposed change to use TP_STATUS_IN_PROGRESS
   is that the documentation of the tp_status field is somewhat
   inconsistent.  In some places it's described as TP_STATUS_KERNEL(0)
   meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0)
   meaning the entry is owned by user space.  In other places ownership
   by user space is defined by the TP_STATUS_USER(1) bit being set.

   Relevant section of packet_mmap documentation from
   https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt
   follow but a summary of the semantics in question are described as:

   - At the beginning of each frame there is an status field
   - If this field is 0 means that the frame is ready to be
     used for the kernel
   - If not, there is a frame the user can read

   This would indicate that 0 means owned by the kernel and non-0 is
   owned by the user.  The simple fix above is to set the status to
   something that neither the kernel nor the user thinks they own.  That
   means that 0 vs. non-0 would be insufficient.

   The description and examples in packet_mmap.txt actually talk about
   using a specific bit, the TP_STATUS_USER bit, to indicate something is
   owned by the kernel vs something is owned by the user.  The relevant
   references from packet_mmap.txt to this bit are:

   - The kernel initializes all frames to TP_STATUS_KERNEL(0)
   - when the kernel receives a packet it puts in the buffer and
     updates the status with at least the TP_STATUS_USER(1) flag

   This description contradicts the previous description which implied
   that non-0 means owned by the user. In the above statement it implies
   that there needs to be more than just non-0 and that the
   TP_STATUS_USER bit must be set for it to be owned by user space.

   If the above proposed fix of using a new TP_STATUS_IN_PROGRESS bit (or
   any existing bit other than TP_STATUS_USER) were to be used and an
   application were written assuming !TP_STATUS_KERNEL implies owned by
   user space then the application would incorrectly assume there was a
   valid packet entry to be processed when the entry is not ready.  If an
   application were instead written to look specificaly for
   TP_STATUS_USER then the above proposed fix would work correctly.

   A more complex solution might be to create a shadow data structure
   which is private to the kernel and keeps status bits for each ring
   entry to indicate the intermediate state between owned by kernel and
   owned by user space.


4. Snipet from packet_mmap.txt
------------------------------
   At the beginning of each frame there is an status field (see 
   struct tpacket_hdr). If this field is 0 means that the frame is ready
   to be used for the kernel, If not, there is a frame the user can read 
   and the following flags apply:

   +++ Capture process:
   from include/linux/if_packet.h

   #define TP_STATUS_COPY          (1 << 1)
   #define TP_STATUS_LOSING        (1 << 2)
   #define TP_STATUS_CSUMNOTREADY  (1 << 3)
   #define TP_STATUS_CSUM_VALID    (1 << 7)

   TP_STATUS_COPY        : This flag indicates that the frame (and associated
   meta information) has been truncated because it's 
   larger than tp_frame_size. This packet can be 
   read entirely with recvfrom().

   In order to make this work it must to be
   enabled previously with setsockopt() and 
   the PACKET_COPY_THRESH option. 

   The number of frames that can be buffered to
   be read with recvfrom is limited like a normal socket.
   See the SO_RCVBUF option in the socket (7) man page.

   TP_STATUS_LOSING      : indicates there were packet drops from last time 
   statistics where checked with getsockopt() and
   the PACKET_STATISTICS option.

   TP_STATUS_CSUMNOTREADY: currently it's used for outgoing IP packets which 
   its checksum will be done in hardware. So while
   reading the packet we should not try to check the 
   checksum. 

   TP_STATUS_CSUM_VALID  : This flag indicates that at least the transport
   header checksum of the packet has been already
   validated on the kernel side. If the flag is not set
   then we are free to check the checksum by ourselves
   provided that TP_STATUS_CSUMNOTREADY is also not set.

   for convenience there are also the following defines:

   #define TP_STATUS_KERNEL        0
   #define TP_STATUS_USER          1

   The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel
   receives a packet it puts in the buffer and updates the status with
   at least the TP_STATUS_USER flag. Then the user can read the packet,
   once the packet is read the user must zero the status field, so the kernel 
   can use again that frame buffer.

   The user can use poll (any other variant should apply too) to check if new
   packets are in the ring:

   struct pollfd pfd;

   pfd.fd = fd;
   pfd.revents = 0;
   pfd.events = POLLIN|POLLRDNORM|POLLERR;

   if (status == TP_STATUS_KERNEL)
   retval = poll(&pfd, 1, timeout);

   It doesn't incur in a race condition to first check the status value and 
   then poll for frames.


5. Snipet from man pages for packet(7):
---------------------------------------

   See portion marked with "***" for reference to use of TP_STATUS_USER bit.

       PACKET_RX_RING

              Create a memory-mapped ring buffer for asynchronous
              packet reception.  The packet socket reserves a
              contiguous region of application address space, lays it
              out into an array of packet slots and copies packets (up
              to tp_snaplen) into subsequent slots.  Each packet is
              preceded by a metadata structure similar to
              tpacket_auxdata.  The protocol fields encode the offset
              to the data from the start of the metadata header.
              tp_net stores the offset to the network layer.  If the
              packet socket is of type SOCK_DGRAM, then tp_mac is the
              same.  If it is of type SOCK_RAW, then that field stores
              the offset to the link-layer frame.  Packet socket and
              application communi? cate the head and tail of the ring
              through the tp_status field.  The packet socket owns all
              slots with tp_status equal to TP_STATUS_KERNEL.  After
              filling a slot, it changes the status of the slot to
       ***    transfer ownership to the application.  During normal
       ***    operation, the new tp_status value has at least the
       ***    TP_STATUS_USER bit set to signal that a received packet
       ***    has been stored.  When the application has finished
              processing a packet, it transfers ownership of the slot
              back to the socket by setting tp_status equal to
              TP_STATUS_KERNEL.

              Packet sockets implement multiple variants of the packet
              ring.  The implementation details are described in
              Documentation/networking/packet_mmap.txt in the Linux
              kernel source tree.


6. Relevant files:
------------------
  net/packet/af_packet.c
  include/uapi/linux/if_packet.h
  Documentation/networking/packet_mmap.txt

Jon Rosen (1):
  packet: mark ring entry as in-use inside spin_lock to prevent RX ring
    overrun

 net/packet/af_packet.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index e0f3f4a..264d7b2 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2287,6 +2287,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 		if (po->stats.stats1.tp_drops)
 			status |= TP_STATUS_LOSING;
 	}
+
+        /*
+         * Mark this entry as TP_STATUS_IN_PROGRESS to prevent other
+         * kernel threads from re-using this same entry.
+         */
+#define TP_STATUS_IN_PROGRESS TP_STATUS_LOSING
+	if (po->tp_version <= TPACKET_V2)
+            __packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS);
+
 	po->stats.stats1.tp_packets++;
 	if (copy_skb) {
 		status |= TP_STATUS_COPY;
-- 
2.10.3.dirty

^ permalink raw reply related

* Re: [PATCH] net: thunderx: nicvf_main: Fix potential NULL pointer dereference
From: Gustavo A. R. Silva @ 2018-04-03 21:52 UTC (permalink / raw)
  To: Eric Dumazet, Sunil Goutham, Robert Richter
  Cc: linux-arm-kernel, netdev, linux-kernel
In-Reply-To: <f0deebb5-06e8-d788-0e68-0c15c979028c@gmail.com>



On 04/03/2018 04:47 PM, Eric Dumazet wrote:
> 
> 
> On 04/03/2018 02:29 PM, Gustavo A. R. Silva wrote:
>> Add null check on kmalloc() return value in order to prevent
>> a null pointer dereference.
>>
>> Addresses-Coverity-ID: 1467429 ("Dereference null return value")
>> Fixes: 37c3347eb247 ("net: thunderx: add ndo_set_rx_mode callback implementation for VF")
>> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
>> ---
>>   drivers/net/ethernet/cavium/thunder/nicvf_main.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
>> index 1e9a31f..468321a 100644
>> --- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
>> +++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
>> @@ -1999,6 +1999,8 @@ static void nicvf_set_rx_mode(struct net_device *netdev)
>>   				struct xcast_addr *xaddr;
>>   
>>   				mc_list = kmalloc(sizeof(*mc_list), GFP_ATOMIC);
>> +				if (unlikely(!mc_list))
>> +					return;
>>   				INIT_LIST_HEAD(&mc_list->list);
>>   				netdev_hw_addr_list_for_each(ha, &netdev->mc) {
>>   					xaddr = kmalloc(sizeof(*xaddr),
>>
> 
> What about the second kmalloc() right there ?
> 

Oops. I thought I had it covered.

I'll send v2 with that change shortly.

Thanks for the feedback, Eric.
--
Gustavo

^ permalink raw reply

* Re: [PATCH] net: thunderx: nicvf_main: Fix potential NULL pointer dereference
From: Eric Dumazet @ 2018-04-03 21:47 UTC (permalink / raw)
  To: Gustavo A. R. Silva, Sunil Goutham, Robert Richter
  Cc: linux-arm-kernel, netdev, linux-kernel
In-Reply-To: <20180403212925.GA31093@embeddedor.com>



On 04/03/2018 02:29 PM, Gustavo A. R. Silva wrote:
> Add null check on kmalloc() return value in order to prevent
> a null pointer dereference.
> 
> Addresses-Coverity-ID: 1467429 ("Dereference null return value")
> Fixes: 37c3347eb247 ("net: thunderx: add ndo_set_rx_mode callback implementation for VF")
> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
> ---
>  drivers/net/ethernet/cavium/thunder/nicvf_main.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> index 1e9a31f..468321a 100644
> --- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> +++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> @@ -1999,6 +1999,8 @@ static void nicvf_set_rx_mode(struct net_device *netdev)
>  				struct xcast_addr *xaddr;
>  
>  				mc_list = kmalloc(sizeof(*mc_list), GFP_ATOMIC);
> +				if (unlikely(!mc_list))
> +					return;
>  				INIT_LIST_HEAD(&mc_list->list);
>  				netdev_hw_addr_list_for_each(ha, &netdev->mc) {
>  					xaddr = kmalloc(sizeof(*xaddr),
> 

What about the second kmalloc() right there ?

^ permalink raw reply

* [PATCH] net: thunderx: nicvf_main: Fix potential NULL pointer dereference
From: Gustavo A. R. Silva @ 2018-04-03 21:29 UTC (permalink / raw)
  To: Sunil Goutham, Robert Richter
  Cc: linux-arm-kernel, netdev, linux-kernel, Gustavo A. R. Silva

Add null check on kmalloc() return value in order to prevent
a null pointer dereference.

Addresses-Coverity-ID: 1467429 ("Dereference null return value")
Fixes: 37c3347eb247 ("net: thunderx: add ndo_set_rx_mode callback implementation for VF")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
---
 drivers/net/ethernet/cavium/thunder/nicvf_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 1e9a31f..468321a 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -1999,6 +1999,8 @@ static void nicvf_set_rx_mode(struct net_device *netdev)
 				struct xcast_addr *xaddr;
 
 				mc_list = kmalloc(sizeof(*mc_list), GFP_ATOMIC);
+				if (unlikely(!mc_list))
+					return;
 				INIT_LIST_HEAD(&mc_list->list);
 				netdev_hw_addr_list_for_each(ha, &netdev->mc) {
 					xaddr = kmalloc(sizeof(*xaddr),
-- 
2.7.4

^ permalink raw reply related

* [iwl next-queue PATCH 10/10] ixgbe: Avoid performing unnecessary resets for macvlan offload
From: Alexander Duyck @ 2018-04-03 21:16 UTC (permalink / raw)
  To: intel-wired-lan, jeffrey.t.kirsher; +Cc: netdev
In-Reply-To: <20180403211519.7880.70243.stgit@ahduyck-green-test.jf.intel.com>

The original implementation for macvlan offload has us performing a full
port reset every time we added a new macvlan. This shouldn't be necessary
and can be avoided with a few behavior changes.

This patches updates the logic for the queues so that we have essentially 3
possible configurations for macvlan offload. They consist of 15 macvlans
with 4 queues per macvlan, 31 macvlans with 2 queues per macvlan, and 63
macvlans with 1 queue per macvlan. As macvlans are added you will encounter
up to 3 total resets if you add all the way up to 63, and after that the
device will stay in the mode supporting up to 63 macvlans until the L2FW
flag is cleared.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |  192 +++++++++++++++++-------
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c |    5 -
 2 files changed, 135 insertions(+), 62 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 56772d6..01c95bf 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -5336,15 +5336,11 @@ static int ixgbe_fwd_ring_up(struct ixgbe_adapter *adapter,
 	struct net_device *vdev = accel->netdev;
 	int i, baseq, err;
 
-	if (!test_bit(accel->pool, adapter->fwd_bitmask))
-		return 0;
-
 	baseq = accel->pool * adapter->num_rx_queues_per_pool;
 	netdev_dbg(vdev, "pool %i:%i queues %i:%i\n",
 		   accel->pool, adapter->num_rx_pools,
 		   baseq, baseq + adapter->num_rx_queues_per_pool);
 
-	accel->netdev = vdev;
 	accel->rx_base_queue = baseq;
 	accel->tx_base_queue = baseq;
 
@@ -5364,9 +5360,17 @@ static int ixgbe_fwd_ring_up(struct ixgbe_adapter *adapter,
 	if (err >= 0)
 		return 0;
 
+	/* if we cannot add the MAC rule then disable the offload */
+	macvlan_release_l2fw_offload(vdev);
+
 	for (i = 0; i < adapter->num_rx_queues_per_pool; i++)
 		adapter->rx_ring[baseq + i]->netdev = NULL;
 
+	netdev_err(vdev, "L2FW offload disabled due to L2 filter error\n");
+
+	clear_bit(accel->pool, adapter->fwd_bitmask);
+	kfree(accel);
+
 	return err;
 }
 
@@ -8783,6 +8787,49 @@ static void ixgbe_set_prio_tc_map(struct ixgbe_adapter *adapter)
 }
 
 #endif /* CONFIG_IXGBE_DCB */
+static int ixgbe_reassign_macvlan_pool(struct net_device *vdev, void *data)
+{
+	struct ixgbe_adapter *adapter = data;
+	struct ixgbe_fwd_adapter *accel;
+	int pool;
+
+	/* we only care about macvlans... */
+	if (!netif_is_macvlan(vdev))
+		return 0;
+
+	/* that have hardware offload enabled... */
+	accel = macvlan_accel_priv(vdev);
+	if (!accel)
+		return 0;
+
+	/* If we can relocate to a different bit do so */
+	pool = find_first_zero_bit(adapter->fwd_bitmask, adapter->num_rx_pools);
+	if (pool < adapter->num_rx_pools) {
+		set_bit(pool, adapter->fwd_bitmask);
+		accel->pool = pool;
+		return 0;
+	}
+
+	/* if we cannot find a free pool then disable the offload */
+	netdev_err(vdev, "L2FW offload disabled due to lack of queue resources\n");
+	macvlan_release_l2fw_offload(vdev);
+	kfree(accel);
+
+	return 0;
+}
+
+static void ixgbe_defrag_macvlan_pools(struct net_device *dev)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+
+	/* flush any stale bits out of the fwd bitmask */
+	bitmap_clear(adapter->fwd_bitmask, 1, 63);
+
+	/* walk through upper devices reassigning pools */
+	netdev_walk_all_upper_dev_rcu(dev, ixgbe_reassign_macvlan_pool,
+				      adapter);
+}
+
 /**
  * ixgbe_setup_tc - configure net_device for multiple traffic classes
  *
@@ -8850,6 +8897,8 @@ int ixgbe_setup_tc(struct net_device *dev, u8 tc)
 #endif /* CONFIG_IXGBE_DCB */
 	ixgbe_init_interrupt_scheme(adapter);
 
+	ixgbe_defrag_macvlan_pools(dev);
+
 	if (netif_running(dev))
 		return ixgbe_open(dev);
 
@@ -9399,6 +9448,22 @@ static netdev_features_t ixgbe_fix_features(struct net_device *netdev,
 	return features;
 }
 
+static void ixgbe_reset_l2fw_offload(struct ixgbe_adapter *adapter)
+{
+	int rss = min_t(int, ixgbe_max_rss_indices(adapter),
+			num_online_cpus());
+
+	/* go back to full RSS if we're not running SR-IOV */
+	if (!adapter->ring_feature[RING_F_VMDQ].offset)
+		adapter->flags &= ~(IXGBE_FLAG_VMDQ_ENABLED |
+				    IXGBE_FLAG_SRIOV_ENABLED);
+
+	adapter->ring_feature[RING_F_RSS].limit = rss;
+	adapter->ring_feature[RING_F_VMDQ].limit = 1;
+
+	ixgbe_setup_tc(adapter->netdev, adapter->hw_tcs);
+}
+
 static int ixgbe_set_features(struct net_device *netdev,
 			      netdev_features_t features)
 {
@@ -9479,7 +9544,9 @@ static int ixgbe_set_features(struct net_device *netdev,
 		}
 	}
 
-	if (need_reset)
+	if ((changed & NETIF_F_HW_L2FW_DOFFLOAD) && adapter->num_rx_pools > 1)
+		ixgbe_reset_l2fw_offload(adapter);
+	else if (need_reset)
 		ixgbe_do_reset(netdev);
 	else if (changed & (NETIF_F_HW_VLAN_CTAG_RX |
 			    NETIF_F_HW_VLAN_CTAG_FILTER))
@@ -9742,11 +9809,9 @@ static int ixgbe_ndo_bridge_getlink(struct sk_buff *skb, u32 pid, u32 seq,
 
 static void *ixgbe_fwd_add(struct net_device *pdev, struct net_device *vdev)
 {
-	struct ixgbe_fwd_adapter *fwd_adapter = NULL;
 	struct ixgbe_adapter *adapter = netdev_priv(pdev);
-	int used_pools = adapter->num_vfs + adapter->num_rx_pools;
+	struct ixgbe_fwd_adapter *accel;
 	int tcs = adapter->hw_tcs ? : 1;
-	unsigned int limit;
 	int pool, err;
 
 	/* The hardware supported by ixgbe only filters on the destination MAC
@@ -9756,47 +9821,73 @@ static void *ixgbe_fwd_add(struct net_device *pdev, struct net_device *vdev)
 	if (!macvlan_supports_dest_filter(vdev))
 		return ERR_PTR(-EMEDIUMTYPE);
 
-	/* Hardware has a limited number of available pools. Each VF, and the
-	 * PF require a pool. Check to ensure we don't attempt to use more
-	 * then the available number of pools.
-	 */
-	if (used_pools >= IXGBE_MAX_VF_FUNCTIONS)
-		return ERR_PTR(-EINVAL);
+	pool = find_first_zero_bit(adapter->fwd_bitmask, adapter->num_rx_pools);
+	if (pool == adapter->num_rx_pools) {
+		u16 used_pools = adapter->num_vfs + adapter->num_rx_pools;
+		u16 reserved_pools;
+
+		if (((adapter->flags & IXGBE_FLAG_DCB_ENABLED) &&
+		     adapter->num_rx_pools >= (MAX_TX_QUEUES / tcs)) ||
+		    adapter->num_rx_pools > IXGBE_MAX_MACVLANS)
+			return ERR_PTR(-EBUSY);
+
+		/* Hardware has a limited number of available pools. Each VF,
+		 * and the PF require a pool. Check to ensure we don't
+		 * attempt to use more then the available number of pools.
+		 */
+		if (used_pools >= IXGBE_MAX_VF_FUNCTIONS)
+			return ERR_PTR(-EBUSY);
 
-	if (((adapter->flags & IXGBE_FLAG_DCB_ENABLED) &&
-	      adapter->num_rx_pools >= (MAX_TX_QUEUES / tcs)) ||
-	    (adapter->num_rx_pools > IXGBE_MAX_MACVLANS))
-		return ERR_PTR(-EBUSY);
+		/* Enable VMDq flag so device will be set in VM mode */
+		adapter->flags |= IXGBE_FLAG_VMDQ_ENABLED |
+				  IXGBE_FLAG_SRIOV_ENABLED;
 
-	fwd_adapter = kzalloc(sizeof(*fwd_adapter), GFP_KERNEL);
-	if (!fwd_adapter)
-		return ERR_PTR(-ENOMEM);
+		/* Try to reserve as many queues per pool as possible,
+		 * we start with the configurations that support 4 queues
+		 * per pools, followed by 2, and then by just 1 per pool.
+		 */
+		if (used_pools < 32 && adapter->num_rx_pools < 16)
+			reserved_pools = min_t(u16,
+					       32 - used_pools,
+					       16 - adapter->num_rx_pools);
+		else if (adapter->num_rx_pools < 32)
+			reserved_pools = min_t(u16,
+					       64 - used_pools,
+					       32 - adapter->num_rx_pools);
+		else
+			reserved_pools = 64 - used_pools;
 
-	pool = find_first_zero_bit(adapter->fwd_bitmask, adapter->num_rx_pools);
-	set_bit(pool, adapter->fwd_bitmask);
-	limit = find_last_bit(adapter->fwd_bitmask, adapter->num_rx_pools + 1);
 
-	/* Enable VMDq flag so device will be set in VM mode */
-	adapter->flags |= IXGBE_FLAG_VMDQ_ENABLED | IXGBE_FLAG_SRIOV_ENABLED;
-	adapter->ring_feature[RING_F_VMDQ].limit = limit + 1;
+		if (!reserved_pools)
+			return ERR_PTR(-EBUSY);
 
-	fwd_adapter->pool = pool;
+		adapter->ring_feature[RING_F_VMDQ].limit += reserved_pools;
 
-	/* Force reinit of ring allocation with VMDQ enabled */
-	err = ixgbe_setup_tc(pdev, adapter->hw_tcs);
+		/* Force reinit of ring allocation with VMDQ enabled */
+		err = ixgbe_setup_tc(pdev, adapter->hw_tcs);
+		if (err)
+			return ERR_PTR(err);
 
-	if (!err && netif_running(pdev))
-		err = ixgbe_fwd_ring_up(adapter, fwd_adapter);
+		if (pool >= adapter->num_rx_pools)
+			return ERR_PTR(-ENOMEM);
+	}
 
-	if (!err)
-		return fwd_adapter;
+	accel = kzalloc(sizeof(*accel), GFP_KERNEL);
+	if (!accel)
+		return ERR_PTR(-ENOMEM);
+
+	set_bit(pool, adapter->fwd_bitmask);
+	accel->pool = pool;
+	accel->netdev = vdev;
 
-	/* unwind counter and free adapter struct */
-	netdev_info(pdev,
-		    "%s: dfwd hardware acceleration failed\n", vdev->name);
-	clear_bit(pool, adapter->fwd_bitmask);
-	kfree(fwd_adapter);
-	return ERR_PTR(err);
+	if (!netif_running(pdev))
+		return accel;
+
+	err = ixgbe_fwd_ring_up(adapter, accel);
+	if (err)
+		return ERR_PTR(err);
+
+	return accel;
 }
 
 static void ixgbe_fwd_del(struct net_device *pdev, void *priv)
@@ -9804,7 +9895,7 @@ static void ixgbe_fwd_del(struct net_device *pdev, void *priv)
 	struct ixgbe_fwd_adapter *accel = priv;
 	struct ixgbe_adapter *adapter = netdev_priv(pdev);
 	unsigned int rxbase = accel->rx_base_queue;
-	unsigned int limit, i;
+	unsigned int i;
 
 	/* delete unicast filter associated with offloaded interface */
 	ixgbe_del_mac_filter(adapter, accel->netdev->dev_addr,
@@ -9828,25 +9919,6 @@ static void ixgbe_fwd_del(struct net_device *pdev, void *priv)
 	}
 
 	clear_bit(accel->pool, adapter->fwd_bitmask);
-	limit = find_last_bit(adapter->fwd_bitmask, adapter->num_rx_pools);
-	adapter->ring_feature[RING_F_VMDQ].limit = limit + 1;
-
-	/* go back to full RSS if we're done with our VMQs */
-	if (adapter->ring_feature[RING_F_VMDQ].limit == 1) {
-		int rss = min_t(int, ixgbe_max_rss_indices(adapter),
-				num_online_cpus());
-
-		adapter->flags &= ~IXGBE_FLAG_VMDQ_ENABLED;
-		adapter->flags &= ~IXGBE_FLAG_SRIOV_ENABLED;
-		adapter->ring_feature[RING_F_RSS].limit = rss;
-	}
-
-	ixgbe_setup_tc(pdev, adapter->hw_tcs);
-	netdev_dbg(pdev, "pool %i:%i queues %i:%i\n",
-		   accel->pool, adapter->num_rx_pools,
-		   accel->rx_base_queue,
-		   accel->rx_base_queue +
-		   adapter->num_rx_queues_per_pool);
 	kfree(accel);
 }
 
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index 008aa07..bfc4171 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -266,7 +266,7 @@ int ixgbe_disable_sriov(struct ixgbe_adapter *adapter)
 #endif
 
 	/* Disable VMDq flag so device will be set in VM mode */
-	if (adapter->ring_feature[RING_F_VMDQ].limit == 1) {
+	if (bitmap_weight(adapter->fwd_bitmask, adapter->num_rx_pools) == 1) {
 		adapter->flags &= ~IXGBE_FLAG_VMDQ_ENABLED;
 		adapter->flags &= ~IXGBE_FLAG_SRIOV_ENABLED;
 		rss = min_t(int, ixgbe_max_rss_indices(adapter),
@@ -312,7 +312,8 @@ static int ixgbe_pci_sriov_enable(struct pci_dev *dev, int num_vfs)
 	 * other values out of range.
 	 */
 	num_tc = adapter->hw_tcs;
-	num_rx_pools = adapter->num_rx_pools;
+	num_rx_pools = bitmap_weight(adapter->fwd_bitmask,
+				     adapter->num_rx_pools);
 	limit = (num_tc > 4) ? IXGBE_MAX_VFS_8TC :
 		(num_tc > 1) ? IXGBE_MAX_VFS_4TC : IXGBE_MAX_VFS_1TC;
 

^ permalink raw reply related

* [iwl next-queue PATCH 09/10] ixgbe: Drop real_adapter from l2 fwd acceleration structure
From: Alexander Duyck @ 2018-04-03 21:16 UTC (permalink / raw)
  To: intel-wired-lan, jeffrey.t.kirsher; +Cc: netdev
In-Reply-To: <20180403211519.7880.70243.stgit@ahduyck-green-test.jf.intel.com>

This patch drops the real_adapter member from the fwd_adapter structure.
The general idea behind the change is that the real_adapter is carrying
unnecessary data since we could always just grab the adapter structure
from netdev_priv(macvlan->lowerdev) if we really needed to get at it.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h      |    1 -
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   28 ++++++++++++++-----------
 2 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 4f08c71..238137c 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -306,7 +306,6 @@ enum ixgbe_ring_state_t {
 struct ixgbe_fwd_adapter {
 	unsigned long active_vlans[BITS_TO_LONGS(VLAN_N_VID)];
 	struct net_device *netdev;
-	struct ixgbe_adapter *real_adapter;
 	unsigned int tx_base_queue;
 	unsigned int rx_base_queue;
 	int pool;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 7c0f310..56772d6 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -5330,10 +5330,10 @@ static void ixgbe_clean_rx_ring(struct ixgbe_ring *rx_ring)
 	rx_ring->next_to_use = 0;
 }
 
-static int ixgbe_fwd_ring_up(struct net_device *vdev,
+static int ixgbe_fwd_ring_up(struct ixgbe_adapter *adapter,
 			     struct ixgbe_fwd_adapter *accel)
 {
-	struct ixgbe_adapter *adapter = accel->real_adapter;
+	struct net_device *vdev = accel->netdev;
 	int i, baseq, err;
 
 	if (!test_bit(accel->pool, adapter->fwd_bitmask))
@@ -5370,14 +5370,19 @@ static int ixgbe_fwd_ring_up(struct net_device *vdev,
 	return err;
 }
 
-static int ixgbe_upper_dev_walk(struct net_device *upper, void *data)
+static int ixgbe_macvlan_up(struct net_device *vdev, void *data)
 {
-	if (netif_is_macvlan(upper)) {
-		struct ixgbe_fwd_adapter *vadapter = macvlan_accel_priv(upper);
+	struct ixgbe_adapter *adapter = data;
+	struct ixgbe_fwd_adapter *accel;
 
-		if (vadapter)
-			ixgbe_fwd_ring_up(upper, vadapter);
-	}
+	if (!netif_is_macvlan(vdev))
+		return 0;
+
+	accel = macvlan_accel_priv(vdev);
+	if (!accel)
+		return 0;
+
+	ixgbe_fwd_ring_up(adapter, accel);
 
 	return 0;
 }
@@ -5385,7 +5390,7 @@ static int ixgbe_upper_dev_walk(struct net_device *upper, void *data)
 static void ixgbe_configure_dfwd(struct ixgbe_adapter *adapter)
 {
 	netdev_walk_all_upper_dev_rcu(adapter->netdev,
-				      ixgbe_upper_dev_walk, NULL);
+				      ixgbe_macvlan_up, adapter);
 }
 
 static void ixgbe_configure(struct ixgbe_adapter *adapter)
@@ -9776,13 +9781,12 @@ static void *ixgbe_fwd_add(struct net_device *pdev, struct net_device *vdev)
 	adapter->ring_feature[RING_F_VMDQ].limit = limit + 1;
 
 	fwd_adapter->pool = pool;
-	fwd_adapter->real_adapter = adapter;
 
 	/* Force reinit of ring allocation with VMDQ enabled */
 	err = ixgbe_setup_tc(pdev, adapter->hw_tcs);
 
 	if (!err && netif_running(pdev))
-		err = ixgbe_fwd_ring_up(vdev, fwd_adapter);
+		err = ixgbe_fwd_ring_up(adapter, fwd_adapter);
 
 	if (!err)
 		return fwd_adapter;
@@ -9798,7 +9802,7 @@ static void *ixgbe_fwd_add(struct net_device *pdev, struct net_device *vdev)
 static void ixgbe_fwd_del(struct net_device *pdev, void *priv)
 {
 	struct ixgbe_fwd_adapter *accel = priv;
-	struct ixgbe_adapter *adapter = accel->real_adapter;
+	struct ixgbe_adapter *adapter = netdev_priv(pdev);
 	unsigned int rxbase = accel->rx_base_queue;
 	unsigned int limit, i;
 

^ permalink raw reply related

* [iwl next-queue PATCH 08/10] ixgbe/fm10k: Only support macvlan offload for types that support destination filtering
From: Alexander Duyck @ 2018-04-03 21:16 UTC (permalink / raw)
  To: intel-wired-lan, jeffrey.t.kirsher; +Cc: netdev
In-Reply-To: <20180403211519.7880.70243.stgit@ahduyck-green-test.jf.intel.com>

Both the ixgbe and fm10k drivers support destination filtering.

Instead of adding a ton of complexity to support either source or passthru
mode we can instead just avoid offloading them for now. Doing this we avoid
leaking packets into interfaces that aren't meant to receive them.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/net/ethernet/intel/fm10k/fm10k_netdev.c |    8 ++++++++
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   |    7 +++++++
 2 files changed, 15 insertions(+)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
index ee645ba..26e7497 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
@@ -22,6 +22,7 @@
 #include "fm10k.h"
 #include <linux/vmalloc.h>
 #include <net/udp_tunnel.h>
+#include <linux/if_macvlan.h>
 
 /**
  * fm10k_setup_tx_resources - allocate Tx resources (Descriptors)
@@ -1449,6 +1450,13 @@ static void *fm10k_dfwd_add_station(struct net_device *dev,
 	int size = 0, i;
 	u16 glort;
 
+	/* The hardware supported by fm10k only filters on the destination MAC
+	 * address. In order to avoid issues we only support offloading modes
+	 * where the hardware can actually provide the functionality.
+	 */
+	if (!macvlan_supports_dest_filter(sdev))
+		return ERR_PTR(-EMEDIUMTYPE);
+
 	/* allocate l2 accel structure if it is not available */
 	if (!l2_accel) {
 		/* verify there is enough free GLORTs to support l2_accel */
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 2c5100e..7c0f310 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -9744,6 +9744,13 @@ static void *ixgbe_fwd_add(struct net_device *pdev, struct net_device *vdev)
 	unsigned int limit;
 	int pool, err;
 
+	/* The hardware supported by ixgbe only filters on the destination MAC
+	 * address. In order to avoid issues we only support offloading modes
+	 * where the hardware can actually provide the functionality.
+	 */
+	if (!macvlan_supports_dest_filter(vdev))
+		return ERR_PTR(-EMEDIUMTYPE);
+
 	/* Hardware has a limited number of available pools. Each VF, and the
 	 * PF require a pool. Check to ensure we don't attempt to use more
 	 * then the available number of pools.

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox