Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: WARNING in rds_cmsg_rdma_args
From: Santosh Shilimkar @ 2018-01-03 18:43 UTC (permalink / raw)
  To: syzbot, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	rds-devel-N0ozoZBvEnrZJqsBc5GL+g,
	syzkaller-bugs-/JYPxA39Uh5TLH3MbocFFw, Mohamed Ghannam
In-Reply-To: <001a1144d1c8a6b0cd0561db6999-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

On 1/3/2018 12:58 AM, syzbot wrote:
> Hello,
> 
> syzkaller hit the following crash on 
> ad036b63ee57df9ab802a4eb20cbbbec66aa4520
> git://git.cmpxchg.org/linux-mmots.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> C reproducer is attached
> syzkaller reproducer is attached. See 

> 
> for information about syzkaller reproducers
> 
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+ef175b5825285531eabf-Pl5Pbv+GP7P466ipTTIvnc23WoclnBCfAL8bYrjMMd8@public.gmane.org
> It will help syzbot understand when the bug is fixed. See footer for 
> details.
> If you forward the report, please keep this part and the footer.
> 
> audit: type=1400 audit(1514947982.760:7): avc:  denied  { map } for  
> pid=3468 comm="syzkaller284818" path="/root/syzkaller284818499" 
> dev="sda1" ino=16481 
> scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023 
> tcontext=unconfined_u:object_r:user_home_t:s0 tclass=file permissive=1
> WARNING: CPU: 1 PID: 3468 at net/rds/rdma.c:617 
> rds_cmsg_rdma_args+0xe96/0x1360 net/rds/rdma.c:617
> Kernel panic - not syncing: panic_on_warn set ...
> 
+Mohamed Ghannam, who was discussing similar issue and was going
to post fix for it. Checking args->nr_local against 0 upfront would
address this issue as well.

Regards,
Santosh
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next v2 0/7] Resolve races in phy accessors
From: David Miller @ 2018-01-03 18:38 UTC (permalink / raw)
  To: linux; +Cc: andrew, f.fainelli, netdev
In-Reply-To: <20180103173255.GH28752@n2100.armlinux.org.uk>

From: Russell King - ARM Linux <linux@armlinux.org.uk>
Date: Wed, 3 Jan 2018 17:32:55 +0000

> On Wed, Jan 03, 2018 at 11:04:31AM -0500, David Miller wrote:
>> From: Russell King - ARM Linux <linux@armlinux.org.uk>
>> Date: Tue, 2 Jan 2018 10:52:18 +0000
>> 
>> > This series resolves races with various accesses to PHY registers.
>> > The first five patches are necessary before we add phylink support
>> > to mvneta, the remaining three are merely cleanups for unobserved
>> > races, and hence are less critical.
>> 
>> Series applied.
> 
> aieee, this should have been applied before the mvneta patch set (as
> mentioned in the covering email to that set, as well as the cover
> message you quoted above) because the mvneta patch set makes these
> races visible.  Well, I guess folk are going to have to put up with
> bisect issues if they hit a commit between the two merges in your tree.
> 
> Well, I guess what's done is done, I did my best to avoid it being
> visible.

Sorry for not catching this, my bad.

^ permalink raw reply

* [PATCH v2] rtnetlink: give a user socket to get_target_net()
From: Andrei Vagin @ 2018-01-03 18:37 UTC (permalink / raw)
  To: David S. Miller, netdev
  Cc: David Ahern, Florian Westphal, Andrei Vagin, Jiri Benc
In-Reply-To: <20180103072733.9097-1-avagin@openvz.org>

This function is used from two places: rtnl_dump_ifinfo and
rtnl_getlink. In rtnl_getlink(), we give a request skb into
get_target_net(), but in rtnl_dump_ifinfo, we give a response skb
into get_target_net().
The problem here is that NETLINK_CB() isn't initialized for the response
skb. In both cases we can get a user socket and give it instead of skb
into get_target_net().

This bug was found by syzkaller with this call-trace:

kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Modules linked in:
CPU: 1 PID: 3149 Comm: syzkaller140561 Not tainted 4.15.0-rc4-mm1+ #47
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:__netlink_ns_capable+0x8b/0x120 net/netlink/af_netlink.c:868
RSP: 0018:ffff8801c880f348 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff8443f900
RDX: 000000000000007b RSI: ffffffff86510f40 RDI: 00000000000003d8
RBP: ffff8801c880f360 R08: 0000000000000000 R09: 1ffff10039101e4f
R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff86510f40
R13: 000000000000000c R14: 0000000000000004 R15: 0000000000000011
FS:  0000000001a1a880(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020151000 CR3: 00000001c9511005 CR4: 00000000001606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  netlink_ns_capable+0x26/0x30 net/netlink/af_netlink.c:886
  get_target_net+0x9d/0x120 net/core/rtnetlink.c:1765
  rtnl_dump_ifinfo+0x2e5/0xee0 net/core/rtnetlink.c:1806
  netlink_dump+0x48c/0xce0 net/netlink/af_netlink.c:2222
  __netlink_dump_start+0x4f0/0x6d0 net/netlink/af_netlink.c:2319
  netlink_dump_start include/linux/netlink.h:214 [inline]
  rtnetlink_rcv_msg+0x7f0/0xb10 net/core/rtnetlink.c:4485
  netlink_rcv_skb+0x21e/0x460 net/netlink/af_netlink.c:2441
  rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4540
  netlink_unicast_kernel net/netlink/af_netlink.c:1308 [inline]
  netlink_unicast+0x4be/0x6a0 net/netlink/af_netlink.c:1334
  netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1897

Cc: Jiri Benc <jbenc@redhat.com>
Fixes: 79e1ad148c84 ("rtnetlink: use netnsid to query interface")
Reported-by: syzbot+e432865c29eb4c48c142@syzkaller.appspotmail.com
Signed-off-by: Andrei Vagin <avagin@openvz.org>
---

v2: add the Reported-by tag

 net/core/rtnetlink.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index dabba2a91fc8..778d7f03404a 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1681,18 +1681,18 @@ static bool link_dump_filtered(struct net_device *dev,
 	return false;
 }
 
-static struct net *get_target_net(struct sk_buff *skb, int netnsid)
+static struct net *get_target_net(struct sock *sk, int netnsid)
 {
 	struct net *net;
 
-	net = get_net_ns_by_id(sock_net(skb->sk), netnsid);
+	net = get_net_ns_by_id(sock_net(sk), netnsid);
 	if (!net)
 		return ERR_PTR(-EINVAL);
 
 	/* For now, the caller is required to have CAP_NET_ADMIN in
 	 * the user namespace owning the target net ns.
 	 */
-	if (!netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN)) {
+	if (!sk_ns_capable(sk, net->user_ns, CAP_NET_ADMIN)) {
 		put_net(net);
 		return ERR_PTR(-EACCES);
 	}
@@ -1733,7 +1733,7 @@ static int rtnl_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb)
 			ifla_policy, NULL) >= 0) {
 		if (tb[IFLA_IF_NETNSID]) {
 			netnsid = nla_get_s32(tb[IFLA_IF_NETNSID]);
-			tgt_net = get_target_net(skb, netnsid);
+			tgt_net = get_target_net(skb->sk, netnsid);
 			if (IS_ERR(tgt_net)) {
 				tgt_net = net;
 				netnsid = -1;
@@ -2883,7 +2883,7 @@ static int rtnl_getlink(struct sk_buff *skb, struct nlmsghdr *nlh,
 
 	if (tb[IFLA_IF_NETNSID]) {
 		netnsid = nla_get_s32(tb[IFLA_IF_NETNSID]);
-		tgt_net = get_target_net(skb, netnsid);
+		tgt_net = get_target_net(NETLINK_CB(skb).sk, netnsid);
 		if (IS_ERR(tgt_net))
 			return PTR_ERR(tgt_net);
 	}
-- 
2.13.6

^ permalink raw reply related

* Re: [patch net-next v2 00/10] Add support for resource abstraction
From: Jiri Pirko @ 2018-01-03 18:36 UTC (permalink / raw)
  To: David Ahern
  Cc: Arkadi Sharshevsky, netdev, roopa, davem, mlxsw, andrew,
	vivien.didelot, f.fainelli, michael.chan, ganeshgr, saeedm,
	matanb, leonro, idosch, jakub.kicinski, ast, daniel, simon.horman,
	pieter.jansenvanvuuren, john.hurley, alexander.h.duyck, linville,
	gospo, steven.lin1, yuvalm, ogerlitz
In-Reply-To: <49c72225-6437-54d5-a046-96fff5b65ce9@cumulusnetworks.com>

Wed, Jan 03, 2018 at 07:29:46PM CET, dsa@cumulusnetworks.com wrote:
>On 1/3/18 11:17 AM, Jiri Pirko wrote:
>> Wed, Jan 03, 2018 at 07:14:16PM CET, dsa@cumulusnetworks.com wrote:
>>> On 1/3/18 11:05 AM, Arkadi Sharshevsky wrote:
>>>> As I stated this is a user-space bug which I fixed, and updated my repo
>>>> so please pull. Devlink uses mnl,and currently mnl does not support
>>>> extended ack. I added support for this in my local ver of libmnl:
>>>>
>>>> https://github.com/arkadis/libmnl.git
>>>>
>>>> On branch master, so you can check it out. Besides this bugs, which were
>>>> userspace, can please specify what are the pending problems from your
>>>> point of view? Thanks!
>>>
>>> devlink is in iproute2 package and it has extack support. See 'git log
>>> lib/libnetlink.c'
>> 
>> Dave, devlink uses libmnl.
>> 
>
>Now I remember. You wrote it independently and but needed iproute2 be a
>delivery vehicle. It uses none of the common infrastructure from
>iproute2. Could we make this more difficult ....

Feel free to rewrite it to use lib/libnetlink.c. Should not be that
hard. Note that at the time I was pushing devlink userspace, tipc also
used libmnl as a part of iproute2, so devlink was not the first one.
That is why I decided not to rewrite.

As of the rest of the "common infrastructure", what exactly do you
have in mind?

>
>Sometime in the next day I will jump through the hoops to get a proper
>devlink command.

^ permalink raw reply

* Re: [net-next PATCH] net: ptr_ring: otherwise safe empty checks can overrun array bounds
From: Michael S. Tsirkin @ 2018-01-03 18:34 UTC (permalink / raw)
  To: John Fastabend; +Cc: David Miller, jakub.kicinski, xiyou.wangcong, jiri, netdev
In-Reply-To: <127b09e0-2acb-f220-d525-e61ad34da794@gmail.com>

On Wed, Jan 03, 2018 at 09:46:15AM -0800, John Fastabend wrote:
> On 01/03/2018 07:50 AM, Michael S. Tsirkin wrote:
> > On Tue, Jan 02, 2018 at 04:25:03PM -0800, John Fastabend wrote:
> >>>
> >>> More generally, what makes this usage safe?
> >>> Is there a way to formalize it at the API level?
> >>>
> >>
> >> Right I think these are good questions. I think the ptr_ring API should
> >> allow a peek operation to be used without a lock. The user has to ensure
> >> they only use it as a hint and if its dereferenced the user needs to
> >> ensure the object is not free'd from some other codepath while it is
> >> being dereferenced. The existing API seems to match this.
> >>
> >> This is how I used it in pfifo_fast expecting the above to be true. The
> >> API allows for false negatives which _should_ be OK if the user is expecting
> >> this. Alternatively, we could make it false positives if you want and
> >> that would also work for me considering this case is hit very rarely.
> > 
> > By now I'm confused by which are positives and which are negatives :)
> > OK so the guarantees we want would be:
> > 
> > - empty can return false if ring is empty.
> >   caller must re-check with consumer lock taken
> 
> Correct.
> 
> > - if multiple threads call it, only one thread
> 
> What is "it" here, __skb_array_empty() presumably.
> 
> >   is guaranteed that empty will not return true
> >   if ring is non empty.
> >   after detecting that ring is not empty,
> >   this thread shall ....
> > 
> The guarantee is even weaker.
> 
>   - if __skb_array_empty() returns true, either the
>     ring is empty OR a consumer operation is in
>     progress.

Well I'm not sure that's guaranteed actually, in that
in progress isn't all that well defined on multi-core
systems.

E.g. write of NULL could bypass index write and
empty will return true on another CPU even though
both completed on our CPU.

I still think our use is ok, I'm just still having
trouble formulating the exact rules.

> 
> For pfifo_fast this is OK because some thread is making
> progress at this point.
>
>
> Note, even if multiple threads
> are calling __skb_array_empty() there is no guarantee
> any of them will return false even if the ring has
> elements.
>
> > can you help me fill in the blank please?
> > 
> 
> Did that help?
> 
> > 
> > 
> > 
> > 
> >>>>> John, others, could you pls confirm it's not too bad performance-wise?
> >>>>> I'll then split it up properly and re-post.
> >>>>>
> >>>>
> >>>> I haven't benchmarked it but in the dequeue case taking a lock for
> >>>> every priority even when its empty seems unneeded.
> >>>
> >>> Well it does seem to make the code more comprehensible and more robust.
> >>>
> >>
> >> Its a trade-off between performance and robustness.
> >>
> >>> But OK -  I was looking at fixing the unlocked empty API to make sure it
> >>> actually does what it's supposed to. I posted a draft earlier in this
> >>> thread, it needs to be looked at in depth to figure out whether it can
> >>> ever give false negatives or positives, and document the results.
> >>>
> >>>
> >>
> >> I'll look at it. But I still think keeping a lockless version makes sense
> >> for many use cases.
> > 
> > Fine. Just let's try to document what are the guarantees.
> > 
> 
> Yep. Guarantees should be (on ptr_ring implementation and similar
> on skb_array implementation)
> 
>   - if __ptr_ring_empty returns true, the ring may be empty
>     user must use operation with consumer lock to guarantee the
>     ring is empty. Note, when run with concurrent consumer operations
>     it is possible to return true when ring has elements. This
>     occurs when ring consumer head rolls over ring size. Also,
>     producer operations may put object in the ring concurrently
>     making empty check incorrect. To avoid this use locked
>     ptr_ring_empty() API.
> 
>   - if __ptr_ring_empty returns false, the ring may have elements.
>     Possibly, scenario is consumer operation and __ptr_ring_empty
>     operation run concurrently causing false to be returned when
>     ring is empty. To avoid this use locked ptr_ring_empty() API.
> 
> >>>
> >>>>> -->
> >>>>>
> >>>>> net: don't misuse ptr_ring_peek
> >>>>>
> >>>>> ptr_ring_peek only claims to be safe if the result is never
> >>>>> dereferenced, which isn't the case for its use in sch_generic.
> >>>>> Add locked API variants and use the bh one here.
> >>>>>
> >>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >>>>>
> >>>>> ---
> >>>>>
> >>>>
> >>>> [...]
> >>>>
> >>>>> --- a/net/sched/sch_generic.c
> >>>>> +++ b/net/sched/sch_generic.c
> >>>>> @@ -659,7 +659,7 @@ static struct sk_buff *pfifo_fast_peek(struct Qdisc *qdisc)
> >>>>>  	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
> >>>>>  		struct skb_array *q = band2list(priv, band);
> >>>>>  
> >>>>> -		skb = __skb_array_peek(q);
> >>>>> +		skb = skb_array_peek_bh(q);
> >>>>
> >>>> Ah I should have added a comment here. For now peek() is only used from
> >>>> locking qdiscs. So peek and consume/produce operations will never happen
> >>>> in parallel. In this case we should never hit the false negative case with
> >>>> my patch or the out of bounds reference without my patch.
> > 
> > OK so this is the part I missed. Can you add a comment please?
> > 
> > 
> 
> Yes, working on a documentation patch set to address misleading and missing
> comments/documentation in qdisc layer now.
> 
> >>>> Doing a peek() op without qdisc lock is a bit problematic anyways. With
> >>>> current code another cpu could consume the skb and free it. Either we
> >>>> can ensure a single consumer runs at a time on an array (not the same as
> >>>> qdisc maybe) or just avoid peek operations in this case. My current plan
> >>>> was to just avoid peek() ops altogether, they seem unnecessary with the
> >>>> types of qdiscs I want to be build.
> >>>>
> >>>> Thanks,
> >>>> John
> >>>
> >>> For the lockless qdisc, for net-next, do we need the patch above until you fix that though?
> >>>
> >>
> >> No, I think after this patch (net: ptr_ring: otherwise safe empty checks...) is
> >> applied we do not need any additional fixes in net-next. Future work will
> >> require the above patch (the one you provided) though so its useful work.
> > 
> > The one that avoids allocating extra memory?
> > 
> 
> Let me try summarize again.
> 
> We need the extra memory slot patch if we expose the
> __ptr_ring_empty/__skb_array_empty API. Otherwise we hit the out
> of bounds issues. And for qdisc layer the __skb_array_empty() API
> call is useful so we should not remove it.
> 
> In addition to the qdisc layer using the __skb_array_empty() API if
> we need a peek() API that works without the qdisc lock then your patch,
> the one adding the locked peek API, will be needed. It is not needed
> though until we have a qdisc that would use it without the lock.
> 
> In summary we keep the extra allocation to make the __skb_array_empty() API
> usable. And hold off exposing the skb_array_peek()/ptr_ring_peek() until we
> have a user for it. We can roll your proposed patch into any series we come
> up with for a new/updated qdisc.
> 
> > 
> >> I'll do another review of the false positive case though to be sure the
> >> current code is OK wrt handling false positives and any potential stalls.
> > 
> > 
> > Thanks!
> > 
> >>>>>  	}
> >>>>>  
> >>>>>  	return skb;
> >>>>>

^ permalink raw reply

* Re: [patch net-next v2 00/10] Add support for resource abstraction
From: David Ahern @ 2018-01-03 18:29 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Arkadi Sharshevsky, netdev, roopa, davem, mlxsw, andrew,
	vivien.didelot, f.fainelli, michael.chan, ganeshgr, saeedm,
	matanb, leonro, idosch, jakub.kicinski, ast, daniel, simon.horman,
	pieter.jansenvanvuuren, john.hurley, alexander.h.duyck, linville,
	gospo, steven.lin1, yuvalm, ogerlitz
In-Reply-To: <20180103181744.GE2067@nanopsycho.orion>

On 1/3/18 11:17 AM, Jiri Pirko wrote:
> Wed, Jan 03, 2018 at 07:14:16PM CET, dsa@cumulusnetworks.com wrote:
>> On 1/3/18 11:05 AM, Arkadi Sharshevsky wrote:
>>> As I stated this is a user-space bug which I fixed, and updated my repo
>>> so please pull. Devlink uses mnl,and currently mnl does not support
>>> extended ack. I added support for this in my local ver of libmnl:
>>>
>>> https://github.com/arkadis/libmnl.git
>>>
>>> On branch master, so you can check it out. Besides this bugs, which were
>>> userspace, can please specify what are the pending problems from your
>>> point of view? Thanks!
>>
>> devlink is in iproute2 package and it has extack support. See 'git log
>> lib/libnetlink.c'
> 
> Dave, devlink uses libmnl.
> 

Now I remember. You wrote it independently and but needed iproute2 be a
delivery vehicle. It uses none of the common infrastructure from
iproute2. Could we make this more difficult ....

Sometime in the next day I will jump through the hoops to get a proper
devlink command.

^ permalink raw reply

* Re: [PATCH net-next 0/2] Enable virtio to act as a master for a passthru device
From: Alexander Duyck @ 2018-01-03 18:28 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Jakub Kicinski, Brandeburg, Jesse, Michael S. Tsirkin,
	Stephen Hemminger, Netdev, virtualization, virtio-dev,
	Alexander Duyck
In-Reply-To: <843c77a1-98bb-e667-3b4d-cc8bf7cd1668@intel.com>

On Wed, Jan 3, 2018 at 10:14 AM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
>
>
> On 1/3/2018 8:59 AM, Alexander Duyck wrote:
>>
>> On Tue, Jan 2, 2018 at 6:16 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
>>>
>>> On Tue,  2 Jan 2018 16:35:36 -0800, Sridhar Samudrala wrote:
>>>>
>>>> This patch series enables virtio to switch over to a VF datapath when a
>>>> VF
>>>> netdev is present with the same MAC address. It allows live migration of
>>>> a VM
>>>> with a direct attached VF without the need to setup a bond/team between
>>>> a
>>>> VF and virtio net device in the guest.
>>>>
>>>> The hypervisor needs to unplug the VF device from the guest on the
>>>> source
>>>> host and reset the MAC filter of the VF to initiate failover of datapath
>>>> to
>>>> virtio before starting the migration. After the migration is completed,
>>>> the
>>>> destination hypervisor sets the MAC filter on the VF and plugs it back
>>>> to
>>>> the guest to switch over to VF datapath.
>>>>
>>>> It is based on netvsc implementation and it may be possible to make this
>>>> code
>>>> generic and move it to a common location that can be shared by netvsc
>>>> and virtio.
>>>>
>>>> This patch series is based on the discussion initiated by Jesse on this
>>>> thread.
>>>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>>
>>> How does the notion of a device which is both a bond and a leg of a
>>> bond fit with Alex's recent discussions about feature propagation?
>>> Which propagation rules will apply to VirtIO master?  Meaning of the
>>> flags on a software upper device may be different.  Why muddy the
>>> architecture like this and not introduce a synthetic bond device?
>>
>> It doesn't really fit with the notion I had. I think there may have
>> been a bit of a disconnect as I have been out for the last week or so
>> for the holidays.
>>
>> My thought on this was that the feature bit should be spawning a new
>> para-virtual bond device and that bond should have the virto and the
>> VF as slaves. Also I thought there was some discussion about trying to
>> reuse as much of the netvsc code as possible for this so that we could
>> avoid duplication of effort and have the two drivers use the same
>> approach. It seems like it should be pretty straight forward since you
>> would have the feature bit in the case of virto, and netvsc just does
>> this sort of thing by default if I am not mistaken.
>
> This patch is mostly based on netvsc implementation. The only change is
> avoiding the
> explicit dev_open() call of the VF netdev after a delay. I am assuming that
> the guest userspace
> will bring up the VF netdev and the hypervisor will update the MAC filters
> to switch to
> the right data path.
> We could commonize the code and make it shared between netvsc and virtio. Do
> we want
> to do this right away or later? If so, what would be a good location for
> these shared functions?
> Is it net/core/dev.c?

No, I would think about starting a new driver file in "/drivers/net/".
The idea is this driver would be utilized to create a bond
automatically and set the appropriate registration hooks. If nothing
else you could probably just call it something generic like virt-bond
or vbond or whatever.

> Also, if we want to go with a solution that creates a bond device, do we
> want virtio_net/netvsc
> drivers to create a upper device?  Such a solution is already possible via
> config scripts that can
> create a bond with virtio and a VF net device as slaves.  netvsc and this
> patch series is trying to
> make it as simple as possible for the VM to use directly attached devices
> and support live migration
> by switching to virtio datapath as a backup during the migration process
> when the VF device
> is unplugged.

We all understand that. But you are making the solution very virtio
specific. We want to see this be usable for other interfaces such as
netsc and whatever other virtual interfaces are floating around out
there.

Also I haven't seen us address what happens as far as how we will
handle this on the host. My thought was we should have a paired
interface. Something like veth, but made up of a bond on each end. So
in the host we should have one bond that has a tap/vhost interface and
a VF port representor, and on the other we would be looking at the
virtio interface and the VF. Attaching the tap/vhost to the bond could
be a way of triggering the feature bit to be set in the virtio. That
way communication between the guest and the host won't get too
confusing as you will see all traffic from the bonded MAC address
always show up on the host side bond instead of potentially showing up
on two unrelated interfaces. It would also make for a good way to
resolve the east/west traffic problem on hosts since you could just
send the broadcast/multicast traffic via the tap/vhost/virtio channel
instead of having to send it back through the port representor and eat
up all that PCIe bus traffic.

^ permalink raw reply

* RE: [EXT] Re: [PATCH net-next 5/6] arm64: dts: marvell: mcbin: enable the fourth network interface
From: Stefan Chulski @ 2018-01-03 18:26 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Thomas Petazzoni, Andrew Lunn, Florian Fainelli, Yan Markman,
	Jason Cooper, netdev, Antoine Tenart,
	linux-kernel@vger.kernel.org, kishon@ti.com, Nadav Haklai,
	Miquèl Raynal, Gregory Clément, Marcin Wojtas,
	David S. Miller, linux-arm-kernel@lists.infradead.org,
	Sebastian Hesselbarth
In-Reply-To: <20180103175446.GI28752@n2100.armlinux.org.uk>

> Sorry, I find this very confusing.
> 
> It seems we have some people telling me that when there's no PHY described
> in DT, we use this link interrupt, and have a functional network interface
> (presumably at whatever speed.)

I give you couple of examples when we have functional network interface with
link interrupt:

1) 10G XLG mode without PHY - since XLG doesn't have auto negation mechanism
2) SGMII with in-band  auto negation
3) RGMII with auto negation done previously by boot loader or by change default fixed
    speed same as speed on cable.

Of course correct way to do it in RGMII mode is use values provided by phylink/phylib.

Stefan.

> 
> I can't see this working from what you describe - what you describe basically
> tells me that when in-band autonegotiation is disabled, and we have no PHY in
> the kernel, then effectively we are in fixed-link mode - since we need to know
> what speed, duplex and flow control settings to use.
> 
> So, this means that mvpp2 should be enforcing the presence of a fixed-link
> description in DT if there is no PHY node at the moment, but it doesn't.
> 
> Instead, it looks to me like the speed and duplex settings are inherited from the
> boot loader or whatever was running before - I can't find anything that
> configures MVPP2_GMAC_AUTONEG_CONFIG in this case.  That seems quite a
> mess.
> 
> Maybe I'm missing something, but I don't see how mvpp2 can be converted to
> phylink given this without causing some kind of regression, unless someone can
> be much clearer about exactly what kinds of link mvpp2 supports and how they
> work (which is basically what I asked back in
> October.)
> 
> --
> RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
> According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply

* Re: [PATCH net-next 1/2] virtio_net: Introduce VIRTIO_NET_F_MASTER feature bit
From: Samudrala, Sridhar @ 2018-01-03 18:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: stephen, netdev, virtualization, virtio-dev, jesse.brandeburg
In-Reply-To: <20180103191101-mutt-send-email-mst@kernel.org>



On 1/3/2018 9:43 AM, Michael S. Tsirkin wrote:
> On Tue, Jan 02, 2018 at 04:35:37PM -0800, Sridhar Samudrala wrote:
>> This feature bit can be used by hypervisor to indicate virtio_net device to
>> act as a master for a directly attached passthru device with the same MAC
>> address.
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> ---
>>   drivers/net/virtio_net.c        | 3 ++-
>>   include/uapi/linux/virtio_net.h | 1 +
>>   2 files changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 6fb7b658a6cc..46844a1d9a62 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -2796,7 +2796,8 @@ static struct virtio_device_id id_table[] = {
>>   	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, \
>>   	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
>>   	VIRTIO_NET_F_CTRL_MAC_ADDR, \
>> -	VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
>> +	VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
>> +	VIRTIO_NET_F_MASTER
>>   
>>   static unsigned int features[] = {
>>   	VIRTNET_FEATURES,
>> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
>> index fc353b518288..a9b4e0836786 100644
>> --- a/include/uapi/linux/virtio_net.h
>> +++ b/include/uapi/linux/virtio_net.h
>> @@ -56,6 +56,7 @@
>>   #define VIRTIO_NET_F_MQ	22	/* Device supports Receive Flow
>>   					 * Steering */
>>   #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
>> +#define VIRTIO_NET_F_MASTER	62	/* act as master for a VF device */
> Well the virtio hardware doesn't really know whether it's a master or
> the slave of a bond. In fact quite possibly down the road we'll add a
> virtual device on top as others seem to want to do - that other one will
> be the master then.
>
> And we do nothing do check it's a VF.
>
> So maybe a better name would be
>
> #define VIRTIO_NET_F_BACKUP	62	/* Act as backup for another device with the same MAC */
Yes. BACKUP  does make more sense as virtio is only used as a BACKUP 
datapath when the other
device is unplugged.


>
>>   
>>   #ifndef VIRTIO_NET_NO_LEGACY
>>   #define VIRTIO_NET_F_GSO	6	/* Host handles pkts w/ any GSO type */
>> -- 
>> 2.14.3

^ permalink raw reply

* Re: [patch net-next v2 00/10] Add support for resource abstraction
From: Jiri Pirko @ 2018-01-03 18:17 UTC (permalink / raw)
  To: David Ahern
  Cc: Arkadi Sharshevsky, netdev, roopa, davem, mlxsw, andrew,
	vivien.didelot, f.fainelli, michael.chan, ganeshgr, saeedm,
	matanb, leonro, idosch, jakub.kicinski, ast, daniel, simon.horman,
	pieter.jansenvanvuuren, john.hurley, alexander.h.duyck, linville,
	gospo, steven.lin1, yuvalm, ogerlitz
In-Reply-To: <2f0f6007-64a1-0d0d-eff2-2e5ba6fdb701@cumulusnetworks.com>

Wed, Jan 03, 2018 at 07:14:16PM CET, dsa@cumulusnetworks.com wrote:
>On 1/3/18 11:05 AM, Arkadi Sharshevsky wrote:
>> As I stated this is a user-space bug which I fixed, and updated my repo
>> so please pull. Devlink uses mnl,and currently mnl does not support
>> extended ack. I added support for this in my local ver of libmnl:
>> 
>> https://github.com/arkadis/libmnl.git
>> 
>> On branch master, so you can check it out. Besides this bugs, which were
>> userspace, can please specify what are the pending problems from your
>> point of view? Thanks!
>
>devlink is in iproute2 package and it has extack support. See 'git log
>lib/libnetlink.c'

Dave, devlink uses libmnl.

^ permalink raw reply

* Re: [EXT] Re: [PATCH net-next 5/6] arm64: dts: marvell: mcbin: enable the fourth network interface
From: Marcin Wojtas @ 2018-01-03 18:17 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Stefan Chulski, Thomas Petazzoni, Andrew Lunn, Florian Fainelli,
	Yan Markman, Jason Cooper, netdev, Antoine Tenart,
	linux-kernel@vger.kernel.org, kishon@ti.com, Nadav Haklai,
	Miquèl Raynal, Gregory Clément, David S. Miller,
	linux-arm-kernel@lists.infradead.org, Sebastian Hesselbarth
In-Reply-To: <20180103175446.GI28752@n2100.armlinux.org.uk>

Russell,

2018-01-03 18:54 GMT+01:00 Russell King - ARM Linux <linux@armlinux.org.uk>:
> On Wed, Jan 03, 2018 at 05:00:47PM +0000, Stefan Chulski wrote:
>> > > > -----Original Message-----
>> > > > Hi Russell,
>> > > >
>> > > > Indeed. RGMII MAC behaves same way, although it shouldn't be named
>> > > > as 'in- band' to be on par with the specifications. Anyway - this
>> > > > one is rather a stub for being able to work with ACPI, so once the
>> > > > MDIO bus works there, this will be out of any concerns.
>> > >
>> > > Hi Marcin,
>> > >
>> > > This is correct.
>> > > "in-band" supported only for SGMII mode.
>> > > IRQ link interrupt depend on "in-band"' auto negation only if "in-band"'
>> > enabled.
>> > > But IRQ link interrupt could be triggered with "in-band", "out-band" or with
>> > specific fixed speed/duplex/flow_contol.
>> >
>> > Hi Stefan,
>> >
>> > How does this work in RGMII mode - is this handled by the PP2 polling the PHY
>> > to get the speed, duplex and flow control settings?
>> >
>>
>> IRQ interrupt doesn't handled speed, duplex and flow control settings.
>> It's just raised if number of criterions met:
>> 1) Physical signal detected by MAC
>> 2) MAC auto negotiation succeeded(valid only auto negotiation enabled in MAC and "in-band" bypass disabled)
>>
>> So if auto negotiation mechanism disabled in MAC or bypassed, link status would changes to up and IRQ interrupt be triggered.
>>
>> In case of RGMII mode obviously we don't have "in-band" auto negotiation and "out-band" cannot be used in Kernel(due to missed locks).
>> So auto negotiation should be disabled on MAC level and speed/duplex/flow_contol would be negotiate by PHY.
>> phylink/phylib infrastructure should provide speed/duplex/flow_contol(that were agreed between PHY's) to ppv2 driver.
>
> Sorry, I find this very confusing.
>
> It seems we have some people telling me that when there's no PHY
> described in DT, we use this link interrupt, and have a functional
> network interface (presumably at whatever speed.)
>
> I can't see this working from what you describe - what you describe
> basically tells me that when in-band autonegotiation is disabled, and we
> have no PHY in the kernel, then effectively we are in fixed-link mode -
> since we need to know what speed, duplex and flow control settings to
> use.
>
> So, this means that mvpp2 should be enforcing the presence of a
> fixed-link description in DT if there is no PHY node at the moment, but
> it doesn't.
>
> Instead, it looks to me like the speed and duplex settings are inherited
> from the boot loader or whatever was running before - I can't find
> anything that configures MVPP2_GMAC_AUTONEG_CONFIG in this case.  That
> seems quite a mess.

Just 3 cents from me about the RGMII + link IRQ, which is only a
temporary solution for ACPI. It works basing on the results of UEFI HW
PHY polling and inherited GMAC settings. Once this MDIO bus / PHY
handling lands, this configuration will be out of any question and
usage in the pp2 driver.

Marcin

>
> Maybe I'm missing something, but I don't see how mvpp2 can be converted
> to phylink given this without causing some kind of regression, unless
> someone can be much clearer about exactly what kinds of link mvpp2
> supports and how they work (which is basically what I asked back in
> October.)
>
> --
> RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
> According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply

* Re: [patch net-next v2 00/10] Add support for resource abstraction
From: David Ahern @ 2018-01-03 18:14 UTC (permalink / raw)
  To: Arkadi Sharshevsky, Jiri Pirko, netdev, roopa
  Cc: davem, mlxsw, andrew, vivien.didelot, f.fainelli, michael.chan,
	ganeshgr, saeedm, matanb, leonro, idosch, jakub.kicinski, ast,
	daniel, simon.horman, pieter.jansenvanvuuren, john.hurley,
	alexander.h.duyck, linville, gospo, steven.lin1, yuvalm, ogerlitz
In-Reply-To: <0f861e90-63d3-2666-ef2d-0fc91beae957@mellanox.com>

On 1/3/18 11:05 AM, Arkadi Sharshevsky wrote:
> As I stated this is a user-space bug which I fixed, and updated my repo
> so please pull. Devlink uses mnl,and currently mnl does not support
> extended ack. I added support for this in my local ver of libmnl:
> 
> https://github.com/arkadis/libmnl.git
> 
> On branch master, so you can check it out. Besides this bugs, which were
> userspace, can please specify what are the pending problems from your
> point of view? Thanks!

devlink is in iproute2 package and it has extack support. See 'git log
lib/libnetlink.c'

You do not need to modify libmnl to get extack output in devlink. Can
you update the command accordingly? Once it works I will come back to
this topic and take another look.

^ permalink raw reply

* Re: [PATCH net-next 0/2] Enable virtio to act as a master for a passthru device
From: Samudrala, Sridhar @ 2018-01-03 18:14 UTC (permalink / raw)
  To: Alexander Duyck, Jakub Kicinski
  Cc: Brandeburg, Jesse, Michael S. Tsirkin, Stephen Hemminger, Netdev,
	virtualization, virtio-dev, Alexander Duyck
In-Reply-To: <CAKgT0Ue7E3ubPOP-d8femP1Yf=XK0j8MJRC7Wg8L7N59rugL4A@mail.gmail.com>



On 1/3/2018 8:59 AM, Alexander Duyck wrote:
> On Tue, Jan 2, 2018 at 6:16 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
>> On Tue,  2 Jan 2018 16:35:36 -0800, Sridhar Samudrala wrote:
>>> This patch series enables virtio to switch over to a VF datapath when a VF
>>> netdev is present with the same MAC address. It allows live migration of a VM
>>> with a direct attached VF without the need to setup a bond/team between a
>>> VF and virtio net device in the guest.
>>>
>>> The hypervisor needs to unplug the VF device from the guest on the source
>>> host and reset the MAC filter of the VF to initiate failover of datapath to
>>> virtio before starting the migration. After the migration is completed, the
>>> destination hypervisor sets the MAC filter on the VF and plugs it back to
>>> the guest to switch over to VF datapath.
>>>
>>> It is based on netvsc implementation and it may be possible to make this code
>>> generic and move it to a common location that can be shared by netvsc and virtio.
>>>
>>> This patch series is based on the discussion initiated by Jesse on this thread.
>>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>> How does the notion of a device which is both a bond and a leg of a
>> bond fit with Alex's recent discussions about feature propagation?
>> Which propagation rules will apply to VirtIO master?  Meaning of the
>> flags on a software upper device may be different.  Why muddy the
>> architecture like this and not introduce a synthetic bond device?
> It doesn't really fit with the notion I had. I think there may have
> been a bit of a disconnect as I have been out for the last week or so
> for the holidays.
>
> My thought on this was that the feature bit should be spawning a new
> para-virtual bond device and that bond should have the virto and the
> VF as slaves. Also I thought there was some discussion about trying to
> reuse as much of the netvsc code as possible for this so that we could
> avoid duplication of effort and have the two drivers use the same
> approach. It seems like it should be pretty straight forward since you
> would have the feature bit in the case of virto, and netvsc just does
> this sort of thing by default if I am not mistaken.
This patch is mostly based on netvsc implementation. The only change is 
avoiding the
explicit dev_open() call of the VF netdev after a delay. I am assuming 
that the guest userspace
will bring up the VF netdev and the hypervisor will update the MAC 
filters to switch to
the right data path.
We could commonize the code and make it shared between netvsc and 
virtio. Do we want
to do this right away or later? If so, what would be a good location for 
these shared functions?
Is it net/core/dev.c?

Also, if we want to go with a solution that creates a bond device, do we 
want virtio_net/netvsc
drivers to create a upper device?  Such a solution is already possible 
via config scripts that can
create a bond with virtio and a VF net device as slaves.  netvsc and 
this patch series is trying to
make it as simple as possible for the VM to use directly attached 
devices and support live migration
by switching to virtio datapath as a backup during the migration process 
when the VF device
is unplugged.

Thanks
Sridhar

^ permalink raw reply

* [PATCHv3] 3c59x: fix missing dma_mapping_error check and bad ring refill logic
From: Neil Horman @ 2018-01-03 18:09 UTC (permalink / raw)
  To: netdev
  Cc: tedheadster, Neil Horman, Neil Horman, Steffen Klassert,
	David S. Miller
In-Reply-To: <CAP8WD_aJSCCCYWFdNV3xH_NqfrgB9XtzBGp-nnua8YSaZfFf3w@mail.gmail.com>

A few spots in 3c59x missed calls to dma_mapping_error checks, casuing
WARN_ONS to trigger.  Clean those up.  While we're at it, refactor the
refill code a bit so that if skb allocation or dma mapping fails, we
recycle the existing buffer.  This prevents holes in the rx ring, and
makes for much simpler logic

Note: This is compile only tested.  Ted, if you could run this and
confirm that it continues to work properly, I would appreciate it, as I
currently don't have access to this hardware

Signed-off-by: Neil Horman <nhorman@redhat.com>
CC: Steffen Klassert <klassert@mathematik.tu-chemnitz.de>
CC: "David S. Miller" <davem@davemloft.net>
Reported-by: tedheadster@gmail.com

---
Change notes:

v2)
* Fixed tx path to free skb on mapping error
* Refactored rx path to recycle skbs on allocation/mapping error
* Used refactoring to remove oom timer and dirty_rx index

v3)
* Removed unused variable that was causing a warning
---
 drivers/net/ethernet/3com/3c59x.c | 90 +++++++++++++++++----------------------
 1 file changed, 38 insertions(+), 52 deletions(-)

diff --git a/drivers/net/ethernet/3com/3c59x.c b/drivers/net/ethernet/3com/3c59x.c
index f4e13a7014bd..36c8950dbd2d 100644
--- a/drivers/net/ethernet/3com/3c59x.c
+++ b/drivers/net/ethernet/3com/3c59x.c
@@ -602,7 +602,7 @@ struct vortex_private {
 	struct sk_buff* rx_skbuff[RX_RING_SIZE];
 	struct sk_buff* tx_skbuff[TX_RING_SIZE];
 	unsigned int cur_rx, cur_tx;		/* The next free ring entry */
-	unsigned int dirty_rx, dirty_tx;	/* The ring entries to be free()ed. */
+	unsigned int dirty_tx;	/* The ring entries to be free()ed. */
 	struct vortex_extra_stats xstats;	/* NIC-specific extra stats */
 	struct sk_buff *tx_skb;				/* Packet being eaten by bus master ctrl.  */
 	dma_addr_t tx_skb_dma;				/* Allocated DMA address for bus master ctrl DMA.   */
@@ -618,7 +618,6 @@ struct vortex_private {
 
 	/* The remainder are related to chip state, mostly media selection. */
 	struct timer_list timer;			/* Media selection timer. */
-	struct timer_list rx_oom_timer;		/* Rx skb allocation retry timer */
 	int options;						/* User-settable misc. driver options. */
 	unsigned int media_override:4, 		/* Passed-in media type. */
 		default_media:4,				/* Read from the EEPROM/Wn3_Config. */
@@ -760,7 +759,6 @@ static void mdio_sync(struct vortex_private *vp, int bits);
 static int mdio_read(struct net_device *dev, int phy_id, int location);
 static void mdio_write(struct net_device *vp, int phy_id, int location, int value);
 static void vortex_timer(struct timer_list *t);
-static void rx_oom_timer(struct timer_list *t);
 static netdev_tx_t vortex_start_xmit(struct sk_buff *skb,
 				     struct net_device *dev);
 static netdev_tx_t boomerang_start_xmit(struct sk_buff *skb,
@@ -1601,7 +1599,6 @@ vortex_up(struct net_device *dev)
 
 	timer_setup(&vp->timer, vortex_timer, 0);
 	mod_timer(&vp->timer, RUN_AT(media_tbl[dev->if_port].wait));
-	timer_setup(&vp->rx_oom_timer, rx_oom_timer, 0);
 
 	if (vortex_debug > 1)
 		pr_debug("%s: Initial media type %s.\n",
@@ -1676,7 +1673,7 @@ vortex_up(struct net_device *dev)
 	window_write16(vp, 0x0040, 4, Wn4_NetDiag);
 
 	if (vp->full_bus_master_rx) { /* Boomerang bus master. */
-		vp->cur_rx = vp->dirty_rx = 0;
+		vp->cur_rx = 0;
 		/* Initialize the RxEarly register as recommended. */
 		iowrite16(SetRxThreshold + (1536>>2), ioaddr + EL3_CMD);
 		iowrite32(0x0020, ioaddr + PktStatus);
@@ -1729,6 +1726,7 @@ vortex_open(struct net_device *dev)
 	struct vortex_private *vp = netdev_priv(dev);
 	int i;
 	int retval;
+	dma_addr_t dma;
 
 	/* Use the now-standard shared IRQ implementation. */
 	if ((retval = request_irq(dev->irq, vp->full_bus_master_rx ?
@@ -1753,7 +1751,11 @@ vortex_open(struct net_device *dev)
 				break;			/* Bad news!  */
 
 			skb_reserve(skb, NET_IP_ALIGN);	/* Align IP on 16 byte boundaries */
-			vp->rx_ring[i].addr = cpu_to_le32(pci_map_single(VORTEX_PCI(vp), skb->data, PKT_BUF_SZ, PCI_DMA_FROMDEVICE));
+			dma = pci_map_single(VORTEX_PCI(vp), skb->data,
+					     PKT_BUF_SZ, PCI_DMA_FROMDEVICE);
+			if (dma_mapping_error(&VORTEX_PCI(vp)->dev, dma))
+				break;
+			vp->rx_ring[i].addr = cpu_to_le32(dma);
 		}
 		if (i != RX_RING_SIZE) {
 			pr_emerg("%s: no memory for rx ring\n", dev->name);
@@ -2067,6 +2069,12 @@ vortex_start_xmit(struct sk_buff *skb, struct net_device *dev)
 		int len = (skb->len + 3) & ~3;
 		vp->tx_skb_dma = pci_map_single(VORTEX_PCI(vp), skb->data, len,
 						PCI_DMA_TODEVICE);
+		if (dma_mapping_error(&VORTEX_PCI(vp)->dev, vp->tx_skb_dma)) {
+			dev_kfree_skb_any(skb);
+			dev->stats.tx_dropped++;
+			return NETDEV_TX_OK;
+		}
+
 		spin_lock_irq(&vp->window_lock);
 		window_set(vp, 7);
 		iowrite32(vp->tx_skb_dma, ioaddr + Wn7_MasterAddr);
@@ -2593,7 +2601,7 @@ boomerang_rx(struct net_device *dev)
 	int entry = vp->cur_rx % RX_RING_SIZE;
 	void __iomem *ioaddr = vp->ioaddr;
 	int rx_status;
-	int rx_work_limit = vp->dirty_rx + RX_RING_SIZE - vp->cur_rx;
+	int rx_work_limit = RX_RING_SIZE;
 
 	if (vortex_debug > 5)
 		pr_debug("boomerang_rx(): status %4.4x\n", ioread16(ioaddr+EL3_STATUS));
@@ -2614,7 +2622,8 @@ boomerang_rx(struct net_device *dev)
 		} else {
 			/* The packet length: up to 4.5K!. */
 			int pkt_len = rx_status & 0x1fff;
-			struct sk_buff *skb;
+			struct sk_buff *skb, *newskb;
+			dma_addr_t newdma;
 			dma_addr_t dma = le32_to_cpu(vp->rx_ring[entry].addr);
 
 			if (vortex_debug > 4)
@@ -2633,9 +2642,27 @@ boomerang_rx(struct net_device *dev)
 				pci_dma_sync_single_for_device(VORTEX_PCI(vp), dma, PKT_BUF_SZ, PCI_DMA_FROMDEVICE);
 				vp->rx_copy++;
 			} else {
+				/* Pre-allocate the replacement skb.  If it or its
+				 * mapping fails then recycle the buffer thats already
+				 * in place
+				 */
+				newskb = netdev_alloc_skb_ip_align(dev, PKT_BUF_SZ);
+				if (!newskb) {
+					dev->stats.rx_dropped++;
+					goto clear_complete;
+				}
+				newdma = pci_map_single(VORTEX_PCI(vp), newskb->data,
+							PKT_BUF_SZ, PCI_DMA_FROMDEVICE);
+				if (dma_mapping_error(&VORTEX_PCI(vp)->dev, newdma)) {
+					dev->stats.rx_dropped++;
+					consume_skb(newskb);
+					goto clear_complete;
+				}
+
 				/* Pass up the skbuff already on the Rx ring. */
 				skb = vp->rx_skbuff[entry];
-				vp->rx_skbuff[entry] = NULL;
+				vp->rx_skbuff[entry] = newskb;
+				vp->rx_ring[entry].addr = cpu_to_le32(newdma);
 				skb_put(skb, pkt_len);
 				pci_unmap_single(VORTEX_PCI(vp), dma, PKT_BUF_SZ, PCI_DMA_FROMDEVICE);
 				vp->rx_nocopy++;
@@ -2653,55 +2680,15 @@ boomerang_rx(struct net_device *dev)
 			netif_rx(skb);
 			dev->stats.rx_packets++;
 		}
-		entry = (++vp->cur_rx) % RX_RING_SIZE;
-	}
-	/* Refill the Rx ring buffers. */
-	for (; vp->cur_rx - vp->dirty_rx > 0; vp->dirty_rx++) {
-		struct sk_buff *skb;
-		entry = vp->dirty_rx % RX_RING_SIZE;
-		if (vp->rx_skbuff[entry] == NULL) {
-			skb = netdev_alloc_skb_ip_align(dev, PKT_BUF_SZ);
-			if (skb == NULL) {
-				static unsigned long last_jif;
-				if (time_after(jiffies, last_jif + 10 * HZ)) {
-					pr_warn("%s: memory shortage\n",
-						dev->name);
-					last_jif = jiffies;
-				}
-				if ((vp->cur_rx - vp->dirty_rx) == RX_RING_SIZE)
-					mod_timer(&vp->rx_oom_timer, RUN_AT(HZ * 1));
-				break;			/* Bad news!  */
-			}
 
-			vp->rx_ring[entry].addr = cpu_to_le32(pci_map_single(VORTEX_PCI(vp), skb->data, PKT_BUF_SZ, PCI_DMA_FROMDEVICE));
-			vp->rx_skbuff[entry] = skb;
-		}
+clear_complete:
 		vp->rx_ring[entry].status = 0;	/* Clear complete bit. */
 		iowrite16(UpUnstall, ioaddr + EL3_CMD);
+		entry = (++vp->cur_rx) % RX_RING_SIZE;
 	}
 	return 0;
 }
 
-/*
- * If we've hit a total OOM refilling the Rx ring we poll once a second
- * for some memory.  Otherwise there is no way to restart the rx process.
- */
-static void
-rx_oom_timer(struct timer_list *t)
-{
-	struct vortex_private *vp = from_timer(vp, t, rx_oom_timer);
-	struct net_device *dev = vp->mii.dev;
-
-	spin_lock_irq(&vp->lock);
-	if ((vp->cur_rx - vp->dirty_rx) == RX_RING_SIZE)	/* This test is redundant, but makes me feel good */
-		boomerang_rx(dev);
-	if (vortex_debug > 1) {
-		pr_debug("%s: rx_oom_timer %s\n", dev->name,
-			((vp->cur_rx - vp->dirty_rx) != RX_RING_SIZE) ? "succeeded" : "retrying");
-	}
-	spin_unlock_irq(&vp->lock);
-}
-
 static void
 vortex_down(struct net_device *dev, int final_down)
 {
@@ -2711,7 +2698,6 @@ vortex_down(struct net_device *dev, int final_down)
 	netdev_reset_queue(dev);
 	netif_stop_queue(dev);
 
-	del_timer_sync(&vp->rx_oom_timer);
 	del_timer_sync(&vp->timer);
 
 	/* Turn off statistics ASAP.  We update dev->stats below. */
-- 
2.14.3

^ permalink raw reply related

* Re: "lockless" qdisc breaks tx_queue_len change too?
From: John Fastabend @ 2018-01-03 18:09 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers
In-Reply-To: <CAM_iQpU+VmMAvEwsP=i6AqATNxqqUo7kaATocyH8Y=jbyP9s3g@mail.gmail.com>

On 01/02/2018 08:41 PM, Cong Wang wrote:
> Hi, John
> 
> While reviewing your ptr_ring fix again today, it looks like your
> "lockless" qdisc patchset breaks dev->tx_queue_len behavior.
> 
> Before your patchset, dev->tx_queue_len is merely an integer to read,
> after your patchset, the skb array has to be resized when
> dev->tx_queue_len changes, but I don't see any qdisc code handles
> this...
> 
> Also, because of that, I doubt __skb_array_empty() in
> pfifo_fast_dequeue() can be safe any more even with your ptr_ring fix.
> 
> What am I missing?
> 

I dropped support for tx_queue_len changes after qdisc has been
created. The only check is at init time when building the qdisc.

Before this series teql and pfifo_fast were the only qdiscs that
used tx_queue_len other qdiscs used other mechanisms or copied
tx_queue_len at init time. So the API is inconsistent.

OK, but arguably its kAPI now and needs to be supported on live
qdiscs. So couple options drop the __skb_array_empty() check,
stop supporting changes on running qdiscs, or do a qdisc swap
with the new array.

I'm tempted to make the qdisc swap work, still need benchmarks
I guess without the empty check. Either way to get it working
we need a callback from tx_queue_len code paths.

Unfortunately, I guess someone somewhere probably uses pfifo_fast
and changes there queue length with a script after creating the
qdisc and expects it to work.

> Thanks.
> 

^ permalink raw reply

* Re: [PATCH net-next v2 4/4] net: mvpp2: 2500baseX support
From: Antoine Tenart @ 2018-01-03 18:08 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Antoine Tenart, davem, kishon, gregory.clement, linux, mw,
	stefanc, ymarkman, thomas.petazzoni, miquel.raynal, nadavh,
	netdev, linux-kernel
In-Reply-To: <20180103155311.GD3401@lunn.ch>

Hi Andrew,

On Wed, Jan 03, 2018 at 04:53:11PM +0100, Andrew Lunn wrote:
> On Wed, Jan 03, 2018 at 04:32:27PM +0100, Antoine Tenart wrote:
> > On Wed, Jan 03, 2018 at 04:20:36PM +0100, Andrew Lunn wrote:
> > > > @@ -4612,6 +4616,9 @@ static int mvpp22_comphy_init(struct mvpp2_port *port)
> > > >  	case PHY_INTERFACE_MODE_1000BASEX:
> > > >  		mode = PHY_MODE_SGMII;
> > > >  		break;
> > > > +	case PHY_INTERFACE_MODE_2500BASEX:
> > > > +		mode = PHY_MODE_2500SGMII;
> > > > +		break;
> > > 
> > > I think this is the source of confusion with linux/phy.h and
> > > linux/phy/phy.h.
> > > 
> > > What would PHY_INTERFACE_MODE_2500SGMII use?
> > > 
> > > Where is this all getting confused? Should the caller to
> > > mvpp22_comphy_init() actually be passing PHY_INTERFACE_MODE_2500SGMII?
> > > What is the MAC actually doing at this point? 2500BASEX or 2500SGMII?
> > 
> > PHY_INTERFACE_MODE_2500BASEX is the PHY mode whereas PHY_MODE_2500SGMII
> > is the mode used by the common PHY driver (i.e. the one configuring the
> > serdes lanes).
> 
> > There's no PHY_INTERFACE_MODE_2500SGMII mode.
> 
> At the moment there is no PHY_INTERFACE_MODE_2500SGMII. However,
> there are some devices which can do 2.5G SGMII. So it will appear
> sometime. This piece of code then looks even stranger.

True. As said by Stefan the Marvell common PHY configures the serdes
lanes to a given mode, regardless of the actual use of the lanes by the
physical layer. So these modes are valid:

PPv2 (SGMII) - COMPHY (SGMII)
PPv2 (1000BaseX) - COMPHY (SGMII)
PPv2 (2500BaseX) - COMPHY (2500SGMII)

> > Sure, I can add a comment to state this function is a translation
> > between the net PHY mode and the generic PHY mode (it's a n-to-1
> > translation).
> 
> I think from an API design point of view, passing PHY_MODE_2500BASEX
> to comphy makes more sense. That is what the MAC wants to do. How the
> comphy achieves that should be internal to the comphy.

I though about that. The reason I did not is I wanted to use the
existing generic PHY framework helpers. They take a generic PHY mode in
input. I also wanted to avoid having a custom callback between the PPv2
driver and the COMPHY driver as phy_set_mode() seems to perfectly fit
the need.

The issue really is there are two PHY frameworks, one for network
devices and one for the other ones. The COMPHY is used by network
controllers, but also by SATA or USB ones.

-- 
Antoine Ténart, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

^ permalink raw reply

* Re: [patch net-next v2 00/10] Add support for resource abstraction
From: Arkadi Sharshevsky @ 2018-01-03 18:05 UTC (permalink / raw)
  To: David Ahern, Jiri Pirko, netdev, roopa
  Cc: davem, mlxsw, andrew, vivien.didelot, f.fainelli, michael.chan,
	ganeshgr, saeedm, matanb, leonro, idosch, jakub.kicinski, ast,
	daniel, simon.horman, pieter.jansenvanvuuren, john.hurley,
	alexander.h.duyck, linville, gospo, steven.lin1, yuvalm, ogerlitz
In-Reply-To: <96389ae0-e038-8a26-84ea-0cf1b9fa0a05@cumulusnetworks.com>



On 01/02/2018 08:05 PM, David Ahern wrote:
> On 1/1/18 7:58 AM, Arkadi Sharshevsky wrote:
>>
>> Just to summarize the current fixes required:
>>
>> 1. ERIF dpipe table size is reporting wrong size. More precisely the
>>    ERIF table does not take rifs, so it should not be linked to the rif
>>    bank resource (is not part of this patchset, future extension).
>> 2. Extended ACK user-space bug.
>> 3. ABI documentation- Not sure we agreed upon it, Jiri?
>>
>> If I missed something please respond. Nothing of the fixes mentioned
>> above is relevant for this patchset actually.
>>
> 
> Can you fix the userspace command and then we come back to what else is
> needed? Right now, it is hard to tell what is a user space bug and what
> is a kernel space bug.
> 
> For example:
> $ devlink resource set pci/0000:03:00.0 path /kvd/linear size 10000
> $ devlink resource show pci/0000:03:00.0
> pci/0000:03:00.0:
>   name kvd size 245760 size_valid true
>   resources:
>     name linear size 98304 occ 0
>     name hash_double size 60416
>     name hash_single size 87040
> 
> The set command did not fail, yet there is no size_new arg in the output
> like there is for this change:
> 
> $ devlink resource set pci/0000:03:00.0 path /kvd/linear size 0
> $ devlink resource show pci/0000:03:00.0
> pci/0000:03:00.0:
>   name kvd size 245760 size_valid true
>   resources:
>     name linear size 98304 size_new 0 occ 0
>     name hash_double size 60416
>     name hash_single size 87040
> 

As I stated this is a user-space bug which I fixed, and updated my repo
so please pull. Devlink uses mnl,and currently mnl does not support
extended ack. I added support for this in my local ver of libmnl:

https://github.com/arkadis/libmnl.git

On branch master, so you can check it out. Besides this bugs, which were
userspace, can please specify what are the pending problems from your
point of view? Thanks!

^ permalink raw reply

* Re: [PATCH net-next v2 4/4] net: mvpp2: 2500baseX support
From: Russell King - ARM Linux @ 2018-01-03 18:05 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Antoine Tenart, davem, kishon, gregory.clement, mw, stefanc,
	ymarkman, thomas.petazzoni, miquel.raynal, nadavh, netdev,
	linux-kernel
In-Reply-To: <20180103152036.GC3401@lunn.ch>

On Wed, Jan 03, 2018 at 04:20:36PM +0100, Andrew Lunn wrote:
> > @@ -4612,6 +4616,9 @@ static int mvpp22_comphy_init(struct mvpp2_port *port)
> >  	case PHY_INTERFACE_MODE_1000BASEX:
> >  		mode = PHY_MODE_SGMII;
> >  		break;
> > +	case PHY_INTERFACE_MODE_2500BASEX:
> > +		mode = PHY_MODE_2500SGMII;
> > +		break;
> 
> I think this is the source of confusion with linux/phy.h and
> linux/phy/phy.h.
> 
> What would PHY_INTERFACE_MODE_2500SGMII use?
> 
> Where is this all getting confused? Should the caller to
> mvpp22_comphy_init() actually be passing PHY_INTERFACE_MODE_2500SGMII?
> What is the MAC actually doing at this point? 2500BASEX or 2500SGMII?
> 
> At minimum there needs to be a comment that this is not a typ0,
> otherwise you are going to get patches submitted to 'fix' this.

Andrew,

I think the confusion comes from Marvell using "SGMII" as a generic name
for the serial gigabit ethernet interface, which includes both Cisco
SGMII and 802.3z BASE-X.

Remember, the only difference between these two modes is the contents of
the 16-bit control word and how that 16-bit control word is handled -
there's differences at the PCS level, but that's a level up from the
Serdes USB/PCIe/Ethernet PHY.

So, PHY_MODE_SGMII gets used to describe both Cisco SGMII and 1000BASE-X
modes, and PHY_MODE_2500SGMII is used to describe 2500BASE-X, and
possibly a fixed-speed 2.5G SGMII (the 2.5G SGMII information I've found
say that the speed bits not unused.)

It's confusing, but I'd say that the use of the "SGMII" is generally
confusing throughout this space, and I'd suggest people use "Cisco SGMII"
when they mean Cisco SGMII configuration word format and 802.3z BASE-X
when they mean the ethernet standard version, leaving "SGMII" available
as meaning the serdes stream of either.  Not ideal, but it seems that's
the way everyone else uses the "SGMII" term. :(

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply

* Re: [PATCH net-next 0/5] Export SERDES stats via ethtool -S
From: Russell King - ARM Linux @ 2018-01-03 17:58 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: David Miller, Vivien Didelot, Florian Fainelli, netdev
In-Reply-To: <1514988562-20079-1-git-send-email-andrew@lunn.ch>

On Wed, Jan 03, 2018 at 03:09:17PM +0100, Andrew Lunn wrote:
> The mv88e6352 family has a SERDES interface which can be used for
> example to connect to SFF/SFP modules. This interface has a couple of
> statistics counters. Add support for including these counters in the
> output of ethtool -S.

Hi Andrew,

Isn't the serdes interface just a normal PHY - the register set in
Viviens debugfs patch certainly looks like an 88e1545 in one of the
switches I've here (I think on the dev rev B).  Is there a reason why
this PHY isn't visible as a normal PHY, just like we export the other
internal PHYs?

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply

* Re: [EXT] Re: [PATCH net-next 5/6] arm64: dts: marvell: mcbin: enable the fourth network interface
From: Russell King - ARM Linux @ 2018-01-03 17:54 UTC (permalink / raw)
  To: Stefan Chulski
  Cc: Marcin Wojtas, Thomas Petazzoni, Andrew Lunn, Florian Fainelli,
	Yan Markman, Jason Cooper, netdev, Antoine Tenart,
	linux-kernel@vger.kernel.org, kishon@ti.com, Nadav Haklai,
	Miquèl Raynal, Gregory Clément, David S. Miller,
	linux-arm-kernel@lists.infradead.org, Sebastian Hesselbarth
In-Reply-To: <0e95e9b20e2d42bc95e04034bd88e781@IL-EXCH01.marvell.com>

On Wed, Jan 03, 2018 at 05:00:47PM +0000, Stefan Chulski wrote:
> > > > -----Original Message-----
> > > > Hi Russell,
> > > >
> > > > Indeed. RGMII MAC behaves same way, although it shouldn't be named
> > > > as 'in- band' to be on par with the specifications. Anyway - this
> > > > one is rather a stub for being able to work with ACPI, so once the
> > > > MDIO bus works there, this will be out of any concerns.
> > >
> > > Hi Marcin,
> > >
> > > This is correct.
> > > "in-band" supported only for SGMII mode.
> > > IRQ link interrupt depend on "in-band"' auto negation only if "in-band"'
> > enabled.
> > > But IRQ link interrupt could be triggered with "in-band", "out-band" or with
> > specific fixed speed/duplex/flow_contol.
> > 
> > Hi Stefan,
> > 
> > How does this work in RGMII mode - is this handled by the PP2 polling the PHY
> > to get the speed, duplex and flow control settings?
> > 
> 
> IRQ interrupt doesn't handled speed, duplex and flow control settings.
> It's just raised if number of criterions met:
> 1) Physical signal detected by MAC
> 2) MAC auto negotiation succeeded(valid only auto negotiation enabled in MAC and "in-band" bypass disabled)
> 
> So if auto negotiation mechanism disabled in MAC or bypassed, link status would changes to up and IRQ interrupt be triggered.
> 
> In case of RGMII mode obviously we don't have "in-band" auto negotiation and "out-band" cannot be used in Kernel(due to missed locks).
> So auto negotiation should be disabled on MAC level and speed/duplex/flow_contol would be negotiate by PHY. 
> phylink/phylib infrastructure should provide speed/duplex/flow_contol(that were agreed between PHY's) to ppv2 driver.

Sorry, I find this very confusing.

It seems we have some people telling me that when there's no PHY
described in DT, we use this link interrupt, and have a functional
network interface (presumably at whatever speed.)

I can't see this working from what you describe - what you describe
basically tells me that when in-band autonegotiation is disabled, and we
have no PHY in the kernel, then effectively we are in fixed-link mode -
since we need to know what speed, duplex and flow control settings to
use.

So, this means that mvpp2 should be enforcing the presence of a
fixed-link description in DT if there is no PHY node at the moment, but
it doesn't.

Instead, it looks to me like the speed and duplex settings are inherited
from the boot loader or whatever was running before - I can't find
anything that configures MVPP2_GMAC_AUTONEG_CONFIG in this case.  That
seems quite a mess.

Maybe I'm missing something, but I don't see how mvpp2 can be converted
to phylink given this without causing some kind of regression, unless
someone can be much clearer about exactly what kinds of link mvpp2
supports and how they work (which is basically what I asked back in
October.)

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply

* Re: [net-next PATCH] net: ptr_ring: otherwise safe empty checks can overrun array bounds
From: John Fastabend @ 2018-01-03 17:46 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Miller, jakub.kicinski, xiyou.wangcong, jiri, netdev
In-Reply-To: <20180103165249-mutt-send-email-mst@kernel.org>

On 01/03/2018 07:50 AM, Michael S. Tsirkin wrote:
> On Tue, Jan 02, 2018 at 04:25:03PM -0800, John Fastabend wrote:
>>>
>>> More generally, what makes this usage safe?
>>> Is there a way to formalize it at the API level?
>>>
>>
>> Right I think these are good questions. I think the ptr_ring API should
>> allow a peek operation to be used without a lock. The user has to ensure
>> they only use it as a hint and if its dereferenced the user needs to
>> ensure the object is not free'd from some other codepath while it is
>> being dereferenced. The existing API seems to match this.
>>
>> This is how I used it in pfifo_fast expecting the above to be true. The
>> API allows for false negatives which _should_ be OK if the user is expecting
>> this. Alternatively, we could make it false positives if you want and
>> that would also work for me considering this case is hit very rarely.
> 
> By now I'm confused by which are positives and which are negatives :)
> OK so the guarantees we want would be:
> 
> - empty can return false if ring is empty.
>   caller must re-check with consumer lock taken

Correct.

> - if multiple threads call it, only one thread

What is "it" here, __skb_array_empty() presumably.

>   is guaranteed that empty will not return true
>   if ring is non empty.
>   after detecting that ring is not empty,
>   this thread shall ....
> 
The guarantee is even weaker.

  - if __skb_array_empty() returns true, either the
    ring is empty OR a consumer operation is in
    progress.

For pfifo_fast this is OK because some thread is making
progress at this point. Note, even if multiple threads
are calling __skb_array_empty() there is no guarantee
any of them will return false even if the ring has
elements.

> can you help me fill in the blank please?
> 

Did that help?

> 
> 
> 
> 
>>>>> John, others, could you pls confirm it's not too bad performance-wise?
>>>>> I'll then split it up properly and re-post.
>>>>>
>>>>
>>>> I haven't benchmarked it but in the dequeue case taking a lock for
>>>> every priority even when its empty seems unneeded.
>>>
>>> Well it does seem to make the code more comprehensible and more robust.
>>>
>>
>> Its a trade-off between performance and robustness.
>>
>>> But OK -  I was looking at fixing the unlocked empty API to make sure it
>>> actually does what it's supposed to. I posted a draft earlier in this
>>> thread, it needs to be looked at in depth to figure out whether it can
>>> ever give false negatives or positives, and document the results.
>>>
>>>
>>
>> I'll look at it. But I still think keeping a lockless version makes sense
>> for many use cases.
> 
> Fine. Just let's try to document what are the guarantees.
> 

Yep. Guarantees should be (on ptr_ring implementation and similar
on skb_array implementation)

  - if __ptr_ring_empty returns true, the ring may be empty
    user must use operation with consumer lock to guarantee the
    ring is empty. Note, when run with concurrent consumer operations
    it is possible to return true when ring has elements. This
    occurs when ring consumer head rolls over ring size. Also,
    producer operations may put object in the ring concurrently
    making empty check incorrect. To avoid this use locked
    ptr_ring_empty() API.

  - if __ptr_ring_empty returns false, the ring may have elements.
    Possibly, scenario is consumer operation and __ptr_ring_empty
    operation run concurrently causing false to be returned when
    ring is empty. To avoid this use locked ptr_ring_empty() API.

>>>
>>>>> -->
>>>>>
>>>>> net: don't misuse ptr_ring_peek
>>>>>
>>>>> ptr_ring_peek only claims to be safe if the result is never
>>>>> dereferenced, which isn't the case for its use in sch_generic.
>>>>> Add locked API variants and use the bh one here.
>>>>>
>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>>>>
>>>>> ---
>>>>>
>>>>
>>>> [...]
>>>>
>>>>> --- a/net/sched/sch_generic.c
>>>>> +++ b/net/sched/sch_generic.c
>>>>> @@ -659,7 +659,7 @@ static struct sk_buff *pfifo_fast_peek(struct Qdisc *qdisc)
>>>>>  	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>>>>>  		struct skb_array *q = band2list(priv, band);
>>>>>  
>>>>> -		skb = __skb_array_peek(q);
>>>>> +		skb = skb_array_peek_bh(q);
>>>>
>>>> Ah I should have added a comment here. For now peek() is only used from
>>>> locking qdiscs. So peek and consume/produce operations will never happen
>>>> in parallel. In this case we should never hit the false negative case with
>>>> my patch or the out of bounds reference without my patch.
> 
> OK so this is the part I missed. Can you add a comment please?
> 
> 

Yes, working on a documentation patch set to address misleading and missing
comments/documentation in qdisc layer now.

>>>> Doing a peek() op without qdisc lock is a bit problematic anyways. With
>>>> current code another cpu could consume the skb and free it. Either we
>>>> can ensure a single consumer runs at a time on an array (not the same as
>>>> qdisc maybe) or just avoid peek operations in this case. My current plan
>>>> was to just avoid peek() ops altogether, they seem unnecessary with the
>>>> types of qdiscs I want to be build.
>>>>
>>>> Thanks,
>>>> John
>>>
>>> For the lockless qdisc, for net-next, do we need the patch above until you fix that though?
>>>
>>
>> No, I think after this patch (net: ptr_ring: otherwise safe empty checks...) is
>> applied we do not need any additional fixes in net-next. Future work will
>> require the above patch (the one you provided) though so its useful work.
> 
> The one that avoids allocating extra memory?
> 

Let me try summarize again.

We need the extra memory slot patch if we expose the
__ptr_ring_empty/__skb_array_empty API. Otherwise we hit the out
of bounds issues. And for qdisc layer the __skb_array_empty() API
call is useful so we should not remove it.

In addition to the qdisc layer using the __skb_array_empty() API if
we need a peek() API that works without the qdisc lock then your patch,
the one adding the locked peek API, will be needed. It is not needed
though until we have a qdisc that would use it without the lock.

In summary we keep the extra allocation to make the __skb_array_empty() API
usable. And hold off exposing the skb_array_peek()/ptr_ring_peek() until we
have a user for it. We can roll your proposed patch into any series we come
up with for a new/updated qdisc.

> 
>> I'll do another review of the false positive case though to be sure the
>> current code is OK wrt handling false positives and any potential stalls.
> 
> 
> Thanks!
> 
>>>>>  	}
>>>>>  
>>>>>  	return skb;
>>>>>

^ permalink raw reply

* Re: [PATCH net-next 1/2] virtio_net: Introduce VIRTIO_NET_F_MASTER feature bit
From: Michael S. Tsirkin @ 2018-01-03 17:43 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: stephen, netdev, virtualization, virtio-dev, jesse.brandeburg
In-Reply-To: <1514939738-22423-2-git-send-email-sridhar.samudrala@intel.com>

On Tue, Jan 02, 2018 at 04:35:37PM -0800, Sridhar Samudrala wrote:
> This feature bit can be used by hypervisor to indicate virtio_net device to
> act as a master for a directly attached passthru device with the same MAC
> address.
> 
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> ---
>  drivers/net/virtio_net.c        | 3 ++-
>  include/uapi/linux/virtio_net.h | 1 +
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 6fb7b658a6cc..46844a1d9a62 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2796,7 +2796,8 @@ static struct virtio_device_id id_table[] = {
>  	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, \
>  	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
>  	VIRTIO_NET_F_CTRL_MAC_ADDR, \
> -	VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
> +	VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
> +	VIRTIO_NET_F_MASTER
>  
>  static unsigned int features[] = {
>  	VIRTNET_FEATURES,
> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> index fc353b518288..a9b4e0836786 100644
> --- a/include/uapi/linux/virtio_net.h
> +++ b/include/uapi/linux/virtio_net.h
> @@ -56,6 +56,7 @@
>  #define VIRTIO_NET_F_MQ	22	/* Device supports Receive Flow
>  					 * Steering */
>  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
> +#define VIRTIO_NET_F_MASTER	62	/* act as master for a VF device */

Well the virtio hardware doesn't really know whether it's a master or
the slave of a bond. In fact quite possibly down the road we'll add a
virtual device on top as others seem to want to do - that other one will
be the master then.

And we do nothing do check it's a VF.

So maybe a better name would be

#define VIRTIO_NET_F_BACKUP	62	/* Act as backup for another device with the same MAC */

>  
>  #ifndef VIRTIO_NET_NO_LEGACY
>  #define VIRTIO_NET_F_GSO	6	/* Host handles pkts w/ any GSO type */
> -- 
> 2.14.3

^ permalink raw reply

* Re: [RFC PATCH net-next 03/19] ipv6: Clear nexthop flags upon netdev up
From: Ido Schimmel @ 2018-01-03 17:40 UTC (permalink / raw)
  To: David Ahern; +Cc: Ido Schimmel, netdev, davem, roopa, nicolas.dichtel, mlxsw
In-Reply-To: <2a815a43-a9e9-10bd-7b4b-a3996b6952d2@gmail.com>

On Wed, Jan 03, 2018 at 09:56:02AM -0700, David Ahern wrote:
> On 1/3/18 9:43 AM, Ido Schimmel wrote:
> > On Wed, Jan 03, 2018 at 08:32:51AM -0700, David Ahern wrote:
> >> On 1/3/18 12:44 AM, Ido Schimmel wrote:
> >>>>> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> >>>>> index ed06b1190f05..b6405568ed7b 100644
> >>>>> --- a/net/ipv6/addrconf.c
> >>>>> +++ b/net/ipv6/addrconf.c
> >>>>> @@ -3484,6 +3484,9 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event,
> >>>>>  			if (run_pending)
> >>>>>  				addrconf_dad_run(idev);
> >>>>>  
> >>>>> +			/* Device has an address by now */
> >>>>> +			rt6_sync_up(dev, RTNH_F_DEAD);
> >>>>> +
> >>>>
> >>>> Seems like this should be in the NETDEV_UP section, say after
> >>>> addrconf_permanent_addr.
> >>>
> >>> Unless the `keep_addr_on_down` sysctl is set, then at this stage the
> >>> netdev doesn't have an IP address and we shouldn't clear the dead flag
> >>> just yet.
> >>>
> >>> This is consistent with IPv4 that clears the dead flag from nexthops in
> >>> a multipath route only if the nexthop device has an IP address. When the
> >>> last IPv4 address is removed from a netdev all the routes using it are
> >>> flushed and there's nothing to clear upon NETDEV_UP.
> >>
> >> I have a bug about that IPv4 handling from the FRR team:
> >>
> >> $ ip link add dummy1 type dummy
> >> $ ip li set dummy1 up
> >> $ ip route add 1.1.1.0/24 dev dummy1
> >>
> >> $ ip addr add dev dummy1 2.2.2.1/24
> >> $ ip ro ls | grep dummy1
> >> 1.1.1.0/24 dev dummy1 scope link
> >> 2.2.2.0/24 dev dummy1 proto kernel scope link src 2.2.2.1
> >>
> >> $ ip addr del dev dummy1 2.2.2.1/24
> >> $ ip ro ls | grep dummy1
> >> <no outpu>
> >>
> >> The 1.1.1.0/24 route was removed as well the 2.2.2.0 connected route.
> > 
> > If you're going to skip the flushing in this case, at least mark the
> > nexthops as dead.
> 
> On a down event, yes. If the device is still up then a route such as:
> $ ip route add 1.1.1.0/24 dev dummy1
> should still be usable even without an address on it.

mlxsw will trap all the packets hitting the route until you assign an IP
address to dummy1.

> > And this is my second reason to have rt6_sync_up() where I put it. I'm
> > preparing another set which sends FIB_EVENT_NH_ADD events from
> > rt6_sync_up() similar to what we've in fib_sync_up(). When mlxsw (others
> 
> On a tangent here, but I have been meaning to ask why you have
> FIB_EVENT_NH_ADD events as opposed to handling netdev events. What does
> a FIB_EVENT_NH_ADD provide that you can't do from a netdev event handler?

It'll make switch drivers more complex than they already are. Why every
driver needs to duplicate the logic in call_fib_nh_notifiers()?

> > in the future) processes the event it needs to add the nexthop back to
> > the forwarding plane. To do that, it needs to have a RIF for the
> > nexthop device. For the nexthop device to have a RIF, it needs at least
> > one IP address configured on the netdev.
> 
> Why is that?
> $ ip addr sh dev swp1s0.51
> 44: swp1s0.51@swp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
> noqueue master vrf1101 state UP group default qlen 1000
>     link/ether 7c:fe:90:e8:3a:7d brd ff:ff:ff:ff:ff:ff
> 
> $ ip ro add vrf vrf1101 1.1.1.0/24 dev swp1s0.51
> 
> $ ip ro ls vrf vrf1101
> unreachable default metric 8192
> 1.1.1.0/24 dev swp1s0.51 scope link offload
> 
> In this case, I take it mlxsw allocates a rif because of the vlan. The
> above does not work on just swp1s0 -- ie., that route is not offloaded:
> 
> $ # ip ro ls
> ...
> 1.1.1.0/24 dev swp1s0 scope link
> ...
> 
> Interesting.

It allocates the RIF because of the enslavement to a VRF, which is an
explicit indication the user wants to use the interface for L3
forwarding.

David, can we please get back to the issue at hand? What's the problem
with the location of the call to rt6_sync_up()?

^ permalink raw reply

* [net 1/4] i40e: Remove UDP support for big buffer
From: Jeff Kirsher @ 2018-01-03 17:33 UTC (permalink / raw)
  To: davem; +Cc: Amritha Nambiar, netdev, nhorman, sassmann, jogreene,
	Jeff Kirsher
In-Reply-To: <20180103173317.72430-1-jeffrey.t.kirsher@intel.com>

From: Amritha Nambiar <amritha.nambiar@intel.com>

Since UDP based filters are not supported via big buffer cloud
filters, remove UDP support.  Also change a few return types to
indicate unsupported vs invalid configuration.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 321d8be80871..fffd4868defb 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -6038,8 +6038,8 @@ static int i40e_validate_and_set_switch_mode(struct i40e_vsi *vsi)
 	/* Set Bit 7 to be valid */
 	mode = I40E_AQ_SET_SWITCH_BIT7_VALID;
 
-	/* Set L4type to both TCP and UDP support */
-	mode |= I40E_AQ_SET_SWITCH_L4_TYPE_BOTH;
+	/* Set L4type for TCP support */
+	mode |= I40E_AQ_SET_SWITCH_L4_TYPE_TCP;
 
 	/* Set cloud filter mode */
 	mode |= I40E_AQ_SET_SWITCH_MODE_NON_TUNNEL;
@@ -6969,18 +6969,18 @@ static int i40e_add_del_cloud_filter_big_buf(struct i40e_vsi *vsi,
 	     is_valid_ether_addr(filter->src_mac)) ||
 	    (is_multicast_ether_addr(filter->dst_mac) &&
 	     is_multicast_ether_addr(filter->src_mac)))
-		return -EINVAL;
+		return -EOPNOTSUPP;
 
-	/* Make sure port is specified, otherwise bail out, for channel
-	 * specific cloud filter needs 'L4 port' to be non-zero
+	/* Big buffer cloud filter needs 'L4 port' to be non-zero. Also, UDP
+	 * ports are not supported via big buffer now.
 	 */
-	if (!filter->dst_port)
-		return -EINVAL;
+	if (!filter->dst_port || filter->ip_proto == IPPROTO_UDP)
+		return -EOPNOTSUPP;
 
 	/* adding filter using src_port/src_ip is not supported at this stage */
 	if (filter->src_port || filter->src_ipv4 ||
 	    !ipv6_addr_any(&filter->ip.v6.src_ip6))
-		return -EINVAL;
+		return -EOPNOTSUPP;
 
 	/* copy element needed to add cloud filter from filter */
 	i40e_set_cld_element(filter, &cld_filter.element);
@@ -6991,7 +6991,7 @@ static int i40e_add_del_cloud_filter_big_buf(struct i40e_vsi *vsi,
 	    is_multicast_ether_addr(filter->src_mac)) {
 		/* MAC + IP : unsupported mode */
 		if (filter->dst_ipv4)
-			return -EINVAL;
+			return -EOPNOTSUPP;
 
 		/* since we validated that L4 port must be valid before
 		 * we get here, start with respective "flags" value
-- 
2.15.1

^ permalink raw reply related

* [net 4/4] i40e: flower: Fix return value for unsupported offload
From: Jeff Kirsher @ 2018-01-03 17:33 UTC (permalink / raw)
  To: davem; +Cc: Jiri Pirko, netdev, nhorman, sassmann, jogreene, Jeff Kirsher
In-Reply-To: <20180103173317.72430-1-jeffrey.t.kirsher@intel.com>

From: Jiri Pirko <jiri@mellanox.com>

When filter configuration is not supported, drivers should return
-EOPNOTSUPP so the core can react correctly.

Fixes: 2f4b411a3d67 ("i40e: Enable cloud filters via tc-flower")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 9e4b78e447f8..42dcaefc4c19 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -7371,7 +7371,7 @@ static int i40e_configure_clsflower(struct i40e_vsi *vsi,
 
 	if (tc < 0) {
 		dev_err(&vsi->back->pdev->dev, "Invalid traffic class\n");
-		return -EINVAL;
+		return -EOPNOTSUPP;
 	}
 
 	if (test_bit(__I40E_RESET_RECOVERY_PENDING, pf->state) ||
-- 
2.15.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox