Netdev List
 help / color / mirror / Atom feed
* Re: [RFC bpf-next 0/5] Convert iproute2 to use libbpf (WIP)
From: Toke Høiland-Jørgensen @ 2019-08-23 11:29 UTC (permalink / raw)
  To: Andrii Nakryiko, Daniel Borkmann
  Cc: Stephen Hemminger, Alexei Starovoitov, Martin KaFai Lau, Song Liu,
	Yonghong Song, David Miller, Jesper Dangaard Brouer, Networking,
	bpf
In-Reply-To: <CAEf4Bzab_w0AXy5P9mG14mcyJVgUCzuuNda5FpU5wSEwUciGfg@mail.gmail.com>

[ ... snip ...]

> E.g., today's API is essentially three steps:
>
> 1. open and parse ELF: collect relos, programs, map definitions
> 2. load: create maps from collected defs, do program/global data/CO-RE
> relocs, load and verify BPF programs
> 3. attach programs one by one.
>
> Between step 1 and 2 user has flexibility to create more maps, set up
> map-in-map, etc. Between 2 and 3 you can fill in global data, fill in
> tail call maps, etc. That's already pretty flexible. But we can tune
> and break apart those steps even further, if necessary.

Today, steps 1 and 2 can be collapsed into a single call to
bpf_prog_load_xattr(). As Jesper's mail explains, for XDP we don't
generally want to do all the fancy rewriting stuff, we just want a
simple way to load a program and get reusable pinning of maps.
Preferably in a way that is compatible with the iproute2 loader.

So I really think we need two things:

(1) a flexible API that splits up all the various steps in a way that
    allows programs to inject their own map definitions before
    relocations and loading

(2) a simple convenience wrapper that loads an object file, does
    something sensible with pinning and map-in-map definitions, and loads
    everything into the kernel.

I'd go so far as to say that (2) should even support system-wide
configuration, similar to the /etc/iproute2/bpf_pinning file. E.g., an
/etc/libbpf/pinning.conf file that sets the default pinning directory,
and makes it possible to set up pin-value-to-subdir mappings like what
iproute2 does today.

Having (2) makes it more likely that all the different custom loaders
will be compatible with each other, while still allowing people to do
their own custom thing with (1). And of course, (2) could be implemented
in terms of (1) internally in libbpf.

In my ideal world, (2) would just use the definition format already in
iproute2 (this is basically what I implemented already), but if you guys
don't want to put this into libbpf, I can probably live with the default
format being BTF-based instead. Which would mean that iproute2 I would
end up with a flow like this:

- When given an elf file, try to run it through the "standard loader"
  (2). If this works, great, proceed to program attach.

- If using (2) fails because it doesn't understand the map definition,
  fall back to a compatibility loader that parses the legacy iproute2
  map definition format and uses (1) to load that.


Does the above make sense? :)

-Toke

^ permalink raw reply

* [PATCH net-next] net: ipv6: fix listify ip6_rcv_finish in case of forwarding
From: Xin Long @ 2019-08-23 11:33 UTC (permalink / raw)
  To: network dev, linux-sctp
  Cc: Marcelo Ricardo Leitner, Neil Horman, davem,
	Jesper Dangaard Brouer, Edward Cree, Dmitry Vyukov,
	syzkaller-bugs

We need a similar fix for ipv6 as Commit 0761680d5215 ("net: ipv4: fix
listify ip_rcv_finish in case of forwarding") does for ipv4.

This issue can be reprocuded by syzbot since Commit 323ebb61e32b ("net:
use listified RX for handling GRO_NORMAL skbs") on net-next. The call
trace was:

  kernel BUG at include/linux/skbuff.h:2225!
  RIP: 0010:__skb_pull include/linux/skbuff.h:2225 [inline]
  RIP: 0010:skb_pull+0xea/0x110 net/core/skbuff.c:1902
  Call Trace:
    sctp_inq_pop+0x2f1/0xd80 net/sctp/inqueue.c:202
    sctp_endpoint_bh_rcv+0x184/0x8d0 net/sctp/endpointola.c:385
    sctp_inq_push+0x1e4/0x280 net/sctp/inqueue.c:80
    sctp_rcv+0x2807/0x3590 net/sctp/input.c:256
    sctp6_rcv+0x17/0x30 net/sctp/ipv6.c:1049
    ip6_protocol_deliver_rcu+0x2fe/0x1660 net/ipv6/ip6_input.c:397
    ip6_input_finish+0x84/0x170 net/ipv6/ip6_input.c:438
    NF_HOOK include/linux/netfilter.h:305 [inline]
    NF_HOOK include/linux/netfilter.h:299 [inline]
    ip6_input+0xe4/0x3f0 net/ipv6/ip6_input.c:447
    dst_input include/net/dst.h:442 [inline]
    ip6_sublist_rcv_finish+0x98/0x1e0 net/ipv6/ip6_input.c:84
    ip6_list_rcv_finish net/ipv6/ip6_input.c:118 [inline]
    ip6_sublist_rcv+0x80c/0xcf0 net/ipv6/ip6_input.c:282
    ipv6_list_rcv+0x373/0x4b0 net/ipv6/ip6_input.c:316
    __netif_receive_skb_list_ptype net/core/dev.c:5049 [inline]
    __netif_receive_skb_list_core+0x5fc/0x9d0 net/core/dev.c:5097
    __netif_receive_skb_list net/core/dev.c:5149 [inline]
    netif_receive_skb_list_internal+0x7eb/0xe60 net/core/dev.c:5244
    gro_normal_list.part.0+0x1e/0xb0 net/core/dev.c:5757
    gro_normal_list net/core/dev.c:5755 [inline]
    gro_normal_one net/core/dev.c:5769 [inline]
    napi_frags_finish net/core/dev.c:5782 [inline]
    napi_gro_frags+0xa6a/0xea0 net/core/dev.c:5855
    tun_get_user+0x2e98/0x3fa0 drivers/net/tun.c:1974
    tun_chr_write_iter+0xbd/0x156 drivers/net/tun.c:2020

Fixes: d8269e2cbf90 ("net: ipv6: listify ipv6_rcv() and ip6_rcv_finish()")
Fixes: 323ebb61e32b ("net: use listified RX for handling GRO_NORMAL skbs")
Reported-by: syzbot+eb349eeee854e389c36d@syzkaller.appspotmail.com
Reported-by: syzbot+4a0643a653ac375612d1@syzkaller.appspotmail.com
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/ipv6/ip6_input.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index fa014d5..d432d00 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -80,8 +80,10 @@ static void ip6_sublist_rcv_finish(struct list_head *head)
 {
 	struct sk_buff *skb, *next;
 
-	list_for_each_entry_safe(skb, next, head, list)
+	list_for_each_entry_safe(skb, next, head, list) {
+		skb_list_del_init(skb);
 		dst_input(skb);
+	}
 }
 
 static void ip6_list_rcv_finish(struct net *net, struct sock *sk,
-- 
2.1.0


^ permalink raw reply related

* Re: [PATCH 1/2] rtnetlink: gate MAC address with an LSM hook
From: Jeffrey Vander Stoep @ 2019-08-23 11:41 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, LSM List, selinux
In-Reply-To: <20190822.161913.326746900077543343.davem@davemloft.net>

On Fri, Aug 23, 2019 at 1:19 AM David Miller <davem@davemloft.net> wrote:
>
> From: Jeff Vander Stoep <jeffv@google.com>
> Date: Wed, 21 Aug 2019 15:45:47 +0200
>
> > MAC addresses are often considered sensitive because they are
> > usually unique and can be used to identify/track a device or
> > user [1].
> >
> > The MAC address is accessible via the RTM_NEWLINK message type of a
> > netlink route socket[2]. Ideally we could grant/deny access to the
> > MAC address on a case-by-case basis without blocking the entire
> > RTM_NEWLINK message type which contains a lot of other useful
> > information. This can be achieved using a new LSM hook on the netlink
> > message receive path. Using this new hook, individual LSMs can select
> > which processes are allowed access to the real MAC, otherwise a
> > default value of zeros is returned. Offloading access control
> > decisions like this to an LSM is convenient because it preserves the
> > status quo for most Linux users while giving the various LSMs
> > flexibility to make finer grained decisions on access to sensitive
> > data based on policy.
> >
> > [1] https://adamdrake.com/mac-addresses-udids-and-privacy.html
> > [2] Other access vectors like ioctl(SIOCGIFHWADDR) are already covered
> > by existing LSM hooks.
> >
> > Signed-off-by: Jeff Vander Stoep <jeffv@google.com>
>
> I'm sure the MAC address will escape into userspace via other means,
> dumping pieces of networking config in other contexts, etc.  I mean,
> if I can get a link dump, I can dump the neighbor table as well.

These are already gated by existing LSM hooks and capability checks.
They are not allowed on mandatory access control systems unless explicitly
granted.

>
> I kinda think this is all very silly whack-a-mole kind of stuff, to
> be quite honest.

We evaluated mechanisms for the MAC to reach unprivileged apps.
A number of researchers have published on this as well such as:
https://www.usenix.org/conference/usenixsecurity19/presentation/reardon

Three "leaks" were identified, two have already been fixed.
-ioctl(SIOCGIFHWADDR). Fixed using finer grained LSM checks
on socket ioctls (similar to this change).
-IPv6 IP addresses. Fixed by no longer including the MAC as part
of the IP address.
-RTM_NEWLINK netlink route messages. The last mole to be whacked.

>
> And like others have said, tomorrow you'll be like "oh crap, we should
> block X too" and we'll get another hook, another config knob, another
> rulset update, etc.

This seems like an issue inherent with permissions/capabilities. I don’t
think we should abandon the concept of permissions because someone
can forget to add a check.  Likewise, if someone adds new code to the
kernel and omits a capable(CAP_NET_*) check, I would expect it to be
fixed like any other bug without the idea of capability checks being tossed
out.

We need to do something because this information is being abused. Any
recommendations? This seemed like the simplest approach, but I can
definitely appreciate that it has downsides.

I could make this really generic by adding a single hook to the end of
sock_msgrecv() which would allow an LSM to modify the message to omit
the MAC address and any other information that we deem as sensitive in the
future. Basically what Casey was suggesting. Thoughts on that approach?

Thanks for your help on this.

^ permalink raw reply

* Re: [PATCH v2 2/5] spi: spi-fsl-dspi: Exit the ISR with IRQ_NONE when it's not ours
From: Vladimir Oltean @ 2019-08-23 12:06 UTC (permalink / raw)
  To: Mark Brown; +Cc: linux-spi, lkml, devicetree, netdev
In-Reply-To: <20190823105949.GQ23391@sirena.co.uk>

Hi Mark,

On Fri, 23 Aug 2019 at 13:59, Mark Brown <broonie@kernel.org> wrote:
>
> On Fri, Aug 23, 2019 at 11:50:44AM +0100, Mark Brown wrote:
> > On Fri, Aug 23, 2019 at 01:30:27PM +0300, Vladimir Oltean wrote:
>
> > > Did you see this?
> > > https://lkml.org/lkml/2019/8/22/1542
>
> > I'm not online enough to readily follow that link right now, I
> > did apply another patch for a similar issue though.  If that's
> > a different version of the same change please don't do that,
> > sending multiple conflicting versions of the same thing creates
> > conflicts and makes everything harder to work with.
>
> Oh, I guess this was due to there being an existing refactoring
> in -next that meant the fix wouldn't apply directly.  I sorted
> that out now I think, but in general the same thing applies -
> it's better to put fixes before anything else in the series,
> it'll flag up when reviewing.

I try to require as little attention span as possible from you and I
apologize that I'm sending patches like a noob, but I'm not used to
this sort of interaction with a maintainer. It's taking me some time
to adjust expectations.
- You left change requests in the initial patchset I submitted, but
you partially applied the series anyway. You didn't give me a chance
to respin the whole series and put the shared IRQ fix on top, so it
applies on old trees as well. No problem, I sent two versions of the
patch.
- On my previous series you left this comment:

> Please don't include all the extra tags you've got in your subject
> lines.  In my inbox this series looks like:
>
>    790 N T 08/18 Vladimir Oltean ( 16K) ├─>[PATCH spi for-5.4 01/14] spi: spi-f
>
> so I can't tell what it's actually about just from looking at it.  Just
> [PATCH 01/14] would be enough, putting a target version in or versioning
> the patch series can be OK but you usually don't use a target version
> for -next and adding spi in there is redundant given that it's also in
> the changelog.

So I didn't put any target version in the patch titles this time,
although arguably it would have been clearer to you that there's a
patch for-5.4 and another version of it for-4.20 (which i *think* is
how I should submit a fix, I don't see any branch for inclusion in
stable trees per se).
No problem, I explained in the cover letters that one patchset is for
-next and the other is for inclusion in stable trees. Maintainers do
read cover letters, right?
Message from the -next cover letter:

> The series also contains a bug fix for the shared IRQ of the DSPI
> driver. I am going to respin a version of it as a separate patch for
> inclusion in stable trees, independent of this patchset.

Message from the fix's cover letter:

> This patch is taken out of the "Poll mode for NXP DSPI driver" series
> and respun against the "for-4.20" branch.
> $(git describe --tags 13aed2392741) shows:
> v4.20-rc1-18-g13aed2392741

Yes, I did send a cover letter for a single patch. I thought it's
harder to miss than a note hidden under patch 2/5 of one series, and
in the note section of the other's. I think you could have also made
an argument about me not doing it the other way around. In the end,
you can not read a note just as well as not read a cover letter, and
there's little I can do.

No problem, you missed the link between the two. I sent you a link to
the lkml archive. You said "I'm not online enough to readily follow
that link right now". Please teach me - I really don't know - how can
I make links between patchsets easier for you to follow, if you don't
read cover letters and you can't access lkml? I promise I'll use that
method next time.

> I do frequently catch up on my mail on flights or while otherwise
> travelling so this is even more pressing for me than just being about
> making things a bit easier to read.

Maybe you simply should do something else while traveling, just saying.

Regards,
-Vladimir

^ permalink raw reply

* Re: [PATCH net-next] net: ipv6: fix listify ip6_rcv_finish in case of forwarding
From: Edward Cree @ 2019-08-23 12:07 UTC (permalink / raw)
  To: Xin Long, network dev, linux-sctp
  Cc: Marcelo Ricardo Leitner, Neil Horman, davem,
	Jesper Dangaard Brouer, Dmitry Vyukov, syzkaller-bugs
In-Reply-To: <e355527b374f6ce70fcc286457f87592cd8f3dcc.1566559983.git.lucien.xin@gmail.com>

On 23/08/2019 12:33, Xin Long wrote:
> We need a similar fix for ipv6 as Commit 0761680d5215 ("net: ipv4: fix
> listify ip_rcv_finish in case of forwarding") does for ipv4.
>
> This issue can be reprocuded by syzbot since Commit 323ebb61e32b ("net:
> use listified RX for handling GRO_NORMAL skbs") on net-next. The call
> trace was:
>
>   kernel BUG at include/linux/skbuff.h:2225!
>   RIP: 0010:__skb_pull include/linux/skbuff.h:2225 [inline]
>   RIP: 0010:skb_pull+0xea/0x110 net/core/skbuff.c:1902
>   Call Trace:
>     sctp_inq_pop+0x2f1/0xd80 net/sctp/inqueue.c:202
>     sctp_endpoint_bh_rcv+0x184/0x8d0 net/sctp/endpointola.c:385
>     sctp_inq_push+0x1e4/0x280 net/sctp/inqueue.c:80
>     sctp_rcv+0x2807/0x3590 net/sctp/input.c:256
>     sctp6_rcv+0x17/0x30 net/sctp/ipv6.c:1049
>     ip6_protocol_deliver_rcu+0x2fe/0x1660 net/ipv6/ip6_input.c:397
>     ip6_input_finish+0x84/0x170 net/ipv6/ip6_input.c:438
>     NF_HOOK include/linux/netfilter.h:305 [inline]
>     NF_HOOK include/linux/netfilter.h:299 [inline]
>     ip6_input+0xe4/0x3f0 net/ipv6/ip6_input.c:447
>     dst_input include/net/dst.h:442 [inline]
>     ip6_sublist_rcv_finish+0x98/0x1e0 net/ipv6/ip6_input.c:84
>     ip6_list_rcv_finish net/ipv6/ip6_input.c:118 [inline]
>     ip6_sublist_rcv+0x80c/0xcf0 net/ipv6/ip6_input.c:282
>     ipv6_list_rcv+0x373/0x4b0 net/ipv6/ip6_input.c:316
>     __netif_receive_skb_list_ptype net/core/dev.c:5049 [inline]
>     __netif_receive_skb_list_core+0x5fc/0x9d0 net/core/dev.c:5097
>     __netif_receive_skb_list net/core/dev.c:5149 [inline]
>     netif_receive_skb_list_internal+0x7eb/0xe60 net/core/dev.c:5244
>     gro_normal_list.part.0+0x1e/0xb0 net/core/dev.c:5757
>     gro_normal_list net/core/dev.c:5755 [inline]
>     gro_normal_one net/core/dev.c:5769 [inline]
>     napi_frags_finish net/core/dev.c:5782 [inline]
>     napi_gro_frags+0xa6a/0xea0 net/core/dev.c:5855
>     tun_get_user+0x2e98/0x3fa0 drivers/net/tun.c:1974
>     tun_chr_write_iter+0xbd/0x156 drivers/net/tun.c:2020
>
> Fixes: d8269e2cbf90 ("net: ipv6: listify ipv6_rcv() and ip6_rcv_finish()")
> Fixes: 323ebb61e32b ("net: use listified RX for handling GRO_NORMAL skbs")
> Reported-by: syzbot+eb349eeee854e389c36d@syzkaller.appspotmail.com
> Reported-by: syzbot+4a0643a653ac375612d1@syzkaller.appspotmail.com
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Thanks for catching this.
> ---
>  net/ipv6/ip6_input.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
> index fa014d5..d432d00 100644
> --- a/net/ipv6/ip6_input.c
> +++ b/net/ipv6/ip6_input.c
> @@ -80,8 +80,10 @@ static void ip6_sublist_rcv_finish(struct list_head *head)
>  {
>  	struct sk_buff *skb, *next;
>  
> -	list_for_each_entry_safe(skb, next, head, list)
> +	list_for_each_entry_safe(skb, next, head, list) {
> +		skb_list_del_init(skb);
>  		dst_input(skb);
> +	}
>  }
>  
>  static void ip6_list_rcv_finish(struct net *net, struct sock *sk,


^ permalink raw reply

* Re: [PATCH net v3] ixgbe: fix double clean of tx descriptors with xdp
From: Eelco Chaudron @ 2019-08-23 12:10 UTC (permalink / raw)
  To: Ilya Maximets
  Cc: netdev, linux-kernel, bpf, David S. Miller, Björn Töpel,
	Magnus Karlsson, Jakub Kicinski, Alexei Starovoitov,
	Daniel Borkmann, Jeff Kirsher, intel-wired-lan, William Tu,
	Alexander Duyck
In-Reply-To: <20190822171237.20798-1-i.maximets@samsung.com>



On 22 Aug 2019, at 19:12, Ilya Maximets wrote:

> Tx code doesn't clear the descriptors' status after cleaning.
> So, if the budget is larger than number of used elems in a ring, some
> descriptors will be accounted twice and xsk_umem_complete_tx will move
> prod_tail far beyond the prod_head breaking the completion queue ring.
>
> Fix that by limiting the number of descriptors to clean by the number
> of used descriptors in the tx ring.
>
> 'ixgbe_clean_xdp_tx_irq()' function refactored to look more like
> 'ixgbe_xsk_clean_tx_ring()' since we're allowed to directly use
> 'next_to_clean' and 'next_to_use' indexes.
>
> Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
> ---
>
> Version 3:
>   * Reverted some refactoring made for v2.
>   * Eliminated 'budget' for tx clean.
>   * prefetch returned.
>
> Version 2:
>   * 'ixgbe_clean_xdp_tx_irq()' refactored to look more like
>     'ixgbe_xsk_clean_tx_ring()'.
>
>  drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 29 
> ++++++++------------
>  1 file changed, 11 insertions(+), 18 deletions(-)
>

Did some test with and without the fix applied. For PVP the results are 
a little different depending on the packet size (note this is a single 
run, no deviation).
For the same physical port in and out it’s faster! Note this was OVS 
AF_XDP using a XENA tester at 10G wire speed.


+--------------------------------------------------------------------------------+
| Physical to Virtual to Physical test, L3 flows[port redirect]          
         |
+-----------------+--------------------------------------------------------------+
|                 | Packet size                                          
         |
+-----------------+--------+--------+--------+--------+--------+--------+--------+
| Number of flows |   64   |  128   |  256   |  512   |  768   |  1024  
|  1514  |
+-----------------+--------+--------+--------+--------+--------+--------+--------+
| [NO FIX]   1000 | 739161 | 700091 | 690034 | 659894 | 618128 | 594223 
| 537504 |
+-----------------+--------+--------+--------+--------+--------+--------+--------+
| [FIX]      1000 | 742317 | 708391 | 689952 | 658034 | 626056 | 587653 
| 530885 |
+-----------------+--------+--------+--------+--------+--------+--------+--------+

+--------------------------------------------------------------------------------------+
| Physical loopback test, L3 flows[port redirect]                        
               |
+-----------------+--------------------------------------------------------------------+
|                 | Packet size                                          
               |
+-----------------+---------+---------+---------+---------+---------+---------+--------+
| Number of flows |   64    |   128   |   256   |   512   |   768   |  
1024   |  1514  |
+-----------------+---------+---------+---------+---------+---------+---------+--------+
| [NO FIX]   1000 | 2573298 | 2227578 | 2514318 | 2298204 | 1081861 | 
1015173 | 788081 |
+-----------------+---------+---------+---------+---------+---------+---------+--------+
| [FIX]      1000 | 3343188 | 3234993 | 3151833 | 2349597 | 1586276 | 
1197304 | 814854 |
+-----------------+---------+---------+---------+---------+---------+---------+--------+


Tested-by: Eelco Chaudron <echaudro@redhat.com>

^ permalink raw reply

* Re: [PATCH 0/3] Add NETIF_F_HW_BRIDGE feature
From: Horatiu Vultur @ 2019-08-23 12:26 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: roopa, davem, UNGLinuxDriver, alexandre.belloni, allan.nielsen,
	netdev, linux-kernel, bridge
In-Reply-To: <b2c52206-82d1-ef28-aeec-a5dcdbe9df6c@cumulusnetworks.com>

The 08/23/2019 01:26, Nikolay Aleksandrov wrote:
> External E-Mail
> 
> 
> On 8/23/19 1:09 AM, Nikolay Aleksandrov wrote:
> > On 22/08/2019 22:07, Horatiu Vultur wrote:
> >> Current implementation of the SW bridge is setting the interfaces in
> >> promisc mode when they are added to bridge if learning of the frames is
> >> enabled.
> >> In case of Ocelot which has HW capabilities to switch frames, it is not
> >> needed to set the ports in promisc mode because the HW already capable of
> >> doing that. Therefore add NETIF_F_HW_BRIDGE feature to indicate that the
> >> HW has bridge capabilities. Therefore the SW bridge doesn't need to set
> >> the ports in promisc mode to do the switching.
> >> This optimization takes places only if all the interfaces that are part
> >> of the bridge have this flag and have the same network driver.
> >>
> >> If the bridge interfaces is added in promisc mode then also the ports part
> >> of the bridge are set in promisc mode.
> >>
> >> Horatiu Vultur (3):
> >>   net: Add HW_BRIDGE offload feature
> >>   net: mscc: Use NETIF_F_HW_BRIDGE
> >>   net: mscc: Implement promisc mode.
> >>
> >>  drivers/net/ethernet/mscc/ocelot.c | 26 ++++++++++++++++++++++++--
> >>  include/linux/netdev_features.h    |  3 +++
> >>  net/bridge/br_if.c                 | 29 ++++++++++++++++++++++++++++-
> >>  net/core/ethtool.c                 |  1 +
> >>  4 files changed, 56 insertions(+), 3 deletions(-)
> >>
> > 
> 

Hi Nikolay,

> Just to clarify:
> > IMO the name is misleading.
> - that's not mandatory or anything, just saying people might get confused when they see it

Naming is hard, I properly need to go back and find a better name once
the patch is more mature and the problem/purpose is better understood.

But you are right, this is a bad name, I will find a better one.
> 
> > Why do the devices have to be from the same driver ?
After seeing yours and Andrews comments I realize that I try to do two
things, but I have only explained one of them.

Here is what I was trying to do:
A. Prevent ports from going into promisc mode, if it is not needed.
B. Prevent adding the CPU to the flood-mask (in Ocelot we have a
flood-mask controlling who should be included when flooding due to
destination unknown).

We have been thinking of these two as the same thing, but they are in
fact quite different. It is because of "B" we had to require all the
devices to be controlled by the same Switch.

For item "A" I do not think we need this restriction. In this patch we
will continue only focusing on item "A".

Sorry for the confusion. I will do a new patch that does not have this
restriction (which will also make it simpler).

It just needs to check for the flag NETIF_F_HW_BRIDGE and if it is not
set then set the port in promisc mode.

To solve item "B", the network driver needs to detect if there is a
foreign interfaces added to the bridge. If that is the case then to add
the CPU port to the flooding mask otherwise no. But this part will be in
a different patch series.

> > This is too specific targeting some devices.
Maybe I was wrong to mention specific HW in the commit message. The
purpose of the patch was to add an optimization (not to copy all the
frames to the CPU) for HW that is capable of learning and flooding the
frames.

I would expect that most switching HW would benefit from this.

> > The bridge should not care what's the port device, it should be the other way
Not sure I understand how that could be done. Are you suggesting that
the flag belongs to another structure? If you have something specific in
mind, then please let us know.

> That was mostly a rhetorical question, it's obvious why but please add an explanation
> at least in the commit message and please fix the typos in the comment. Also HW
> is capable of doing switching, this needs some clarification that the whole process
> stays in HW IIUC. More details here would be great.

Yes, I will add more details in the commit message and in code.
> > around, so adding device-specific code to the bridge is not ok. Isn't there a solution
> > where you can use NETDEV_JOIN and handle it all from your driver ?
> > Would all HW-learned entries be hidden from user-space in this case ?

Yes, they would. But if the network driver will call
'call_switchdev_notifiers' with SWITCHDEV_FDB_ADD_TO_BRIDGE and
SWITCHDEV_FDB_DEL_TO_BRIDGE then the SW will see these entries.
> > 
> I.e. isn't there a way to do this without introducing a new feature flag ?
I do not have any better ideas.
>  
> 

-- 
/Horatiu

^ permalink raw reply

* [PATCH net-next] net/mlx5: Fix return code in case of hyperv wrong size read
From: Eran Ben Elisha @ 2019-08-23 12:34 UTC (permalink / raw)
  To: netdev, David S. Miller; +Cc: Saeed Mahameed, Haiyang Zhang, Eran Ben Elisha

Return code value could be non deterministic in case of wrong size read.
With this patch, if such error occurs, set rc to be -EIO.

In addition, mlx5_hv_config_common() supports reading of
HV_CONFIG_BLOCK_SIZE_MAX bytes only, fix to early return error with
bad input.

Fixes: 913d14e86657 ("net/mlx5: Add wrappers for HyperV PCIe operations")
Reported-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/lib/hv.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/hv.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv.c
index cf08d02703fb..583dc7e2aca8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/hv.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv.c
@@ -12,7 +12,7 @@ static int mlx5_hv_config_common(struct mlx5_core_dev *dev, void *buf, int len,
 	int bytes_returned;
 	int block_id;
 
-	if (offset % HV_CONFIG_BLOCK_SIZE_MAX || len % HV_CONFIG_BLOCK_SIZE_MAX)
+	if (offset % HV_CONFIG_BLOCK_SIZE_MAX || len != HV_CONFIG_BLOCK_SIZE_MAX)
 		return -EINVAL;
 
 	block_id = offset / HV_CONFIG_BLOCK_SIZE_MAX;
@@ -25,8 +25,8 @@ static int mlx5_hv_config_common(struct mlx5_core_dev *dev, void *buf, int len,
 				  HV_CONFIG_BLOCK_SIZE_MAX, block_id);
 
 	/* Make sure len bytes were read successfully  */
-	if (read)
-		rc |= !(len == bytes_returned);
+	if (read && !rc && len != bytes_returned)
+		rc = -EIO;
 
 	if (rc) {
 		mlx5_core_err(dev, "Failed to %s hv config, err = %d, len = %d, offset = %d\n",
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH 1/3] net: Add HW_BRIDGE offload feature
From: Horatiu Vultur @ 2019-08-23 12:39 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: roopa, nikolay, davem, UNGLinuxDriver, alexandre.belloni,
	allan.nielsen, netdev, linux-kernel, bridge
In-Reply-To: <20190822200817.GD21295@lunn.ch>

The 08/22/2019 22:08, Andrew Lunn wrote:
> External E-Mail
> 
> 
> > +/* Determin if the SW bridge can be offloaded to HW. Return true if all
> > + * the interfaces of the bridge have the feature NETIF_F_HW_SWITCHDEV set
> > + * and have the same netdev_ops.
> > + */
> 
> Hi Horatiu
> 
> Why do you need these restrictions. The HW bridge should be able to
> learn that a destination MAC address can be reached via the SW
> bridge. The software bridge can then forward it out the correct
> interface.
> 
> Or are you saying your hardware cannot learn from frames which come
> from the CPU?
> 
> 	Andrew
> 
Hi Andrew,

I do not believe that our HW can learn from frames which comes from the
CPU, at least not in the way they are injected today. But in case of Ocelot
(and the next chip we are working on), we have other issues in mixing with
foreign interfaces which is why we have the check in
ocelot_netdevice_dev_check.

More important, as we responded to Nikolay, we properly introduced this
restriction for the wrong reasons.

In SW bridge I will remove all these restrictions and only set ports in
promisc mode only if NETIF_F_HW_BRIDGE is not set.
Then in the network driver I can see if a foreign interface is added to
the bridge, and when that happens I can set the port in promisc mode.
Then the frames will be flooded to the SW bridge which eventually will
send to the foreign interface.
-- 
/Horatiu

^ permalink raw reply

* Re: [PATCHv2 net] ipv6/addrconf: allow adding multicast addr if IFA_F_MCAUTOJOIN is set
From: David Ahern @ 2019-08-23 12:41 UTC (permalink / raw)
  To: Hangbin Liu; +Cc: netdev, Madhu Challa, David S . Miller, Jianlin Shi
In-Reply-To: <20190822082012.GE18865@dhcp-12-139.nay.redhat.com>

On 8/22/19 4:20 AM, Hangbin Liu wrote:
> On Mon, Aug 19, 2019 at 10:33:58PM -0400, David Ahern wrote:
>> On 8/19/19 10:19 PM, Hangbin Liu wrote:
>>> But in ipv6_add_addr() it will check the address type and reject multicast
>>> address directly. So this feature is never worked for IPv6.
>>
>> If true, that is really disappointing.
>>
>> We need to get a functional test script started for various address cases.
> 
> Do you mean an `ip addr add` testing for all kinds of address types?
> 
yes.


^ permalink raw reply

* Re: IPv6 routes inserted into the kernel with 'route' end up with invalid type
From: David Ahern @ 2019-08-23 12:54 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, David S. Miller
  Cc: netdev, bird-users, Tom Bird, Maria Jan Matejka, Ondrej Zajicek
In-Reply-To: <87mug0m4rp.fsf@toke.dk>

On 8/23/19 8:43 AM, Toke Høiland-Jørgensen wrote:
> Hi David
> 
> Tom noticed[0] that on newer kernels, the Bird routing daemon rejects IPv6
> routes received from the kernel if those routes were inserted with the
> old 'route' utility (i.e., when they're inserted through the ioctl
> interface).
> 
> We tracked it down to the routes having an rtm_type of RTN_UNKNOWN, and
> a bit of git archaeology points suggestively at this commit:
> 
> e8478e80e5a ("net/ipv6: Save route type in rt6_info")
> 
> The same setup works with older kernels, so this seems like it's a
> regression, the age of 'route' notwithstanding. Any good ideas for the
> proper way to fix this?
> 

Should be fixed by:

commit c7036d97acd2527cef145b5ef9ad1a37ed21bbe6
Author: David Ahern <dsahern@gmail.com>
Date:   Wed Jun 19 10:50:24 2019 -0700

    ipv6: Default fib6_type to RTN_UNICAST when not set

    A user reported that routes are getting installed with type 0
(RTN_UNSPEC)
    where before the routes were RTN_UNICAST. One example is from accel-ppp
    which apparently still uses the ioctl interface and does not set
    rtmsg_type. Another is the netlink interface where ipv6 does not require
    rtm_type to be set (v4 does). Prior to the commit in the Fixes tag the
    ipv6 stack converted type 0 to RTN_UNICAST, so restore that behavior.

    Fixes: e8478e80e5a7 ("net/ipv6: Save route type in rt6_info")
    Signed-off-by: David Ahern <dsahern@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply

* Re: IPv6 routes inserted into the kernel with 'route' end up with invalid type
From: Toke Høiland-Jørgensen @ 2019-08-23 12:59 UTC (permalink / raw)
  To: David Ahern, David S. Miller
  Cc: netdev, bird-users, Tom Bird, Maria Jan Matejka, Ondrej Zajicek
In-Reply-To: <423380ec-b999-d620-9bd6-78c2dabfde99@gmail.com>

David Ahern <dsahern@gmail.com> writes:

> On 8/23/19 8:43 AM, Toke Høiland-Jørgensen wrote:
>> Hi David
>> 
>> Tom noticed[0] that on newer kernels, the Bird routing daemon rejects IPv6
>> routes received from the kernel if those routes were inserted with the
>> old 'route' utility (i.e., when they're inserted through the ioctl
>> interface).
>> 
>> We tracked it down to the routes having an rtm_type of RTN_UNKNOWN, and
>> a bit of git archaeology points suggestively at this commit:
>> 
>> e8478e80e5a ("net/ipv6: Save route type in rt6_info")
>> 
>> The same setup works with older kernels, so this seems like it's a
>> regression, the age of 'route' notwithstanding. Any good ideas for the
>> proper way to fix this?
>> 
>
> Should be fixed by:
>
> commit c7036d97acd2527cef145b5ef9ad1a37ed21bbe6
> Author: David Ahern <dsahern@gmail.com>
> Date:   Wed Jun 19 10:50:24 2019 -0700
>
>     ipv6: Default fib6_type to RTN_UNICAST when not set
>
>     A user reported that routes are getting installed with type 0
> (RTN_UNSPEC)
>     where before the routes were RTN_UNICAST. One example is from accel-ppp
>     which apparently still uses the ioctl interface and does not set
>     rtmsg_type. Another is the netlink interface where ipv6 does not require
>     rtm_type to be set (v4 does). Prior to the commit in the Fixes tag the
>     ipv6 stack converted type 0 to RTN_UNICAST, so restore that behavior.
>
>     Fixes: e8478e80e5a7 ("net/ipv6: Save route type in rt6_info")
>     Signed-off-by: David Ahern <dsahern@gmail.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>


Ah, great! Guess that hasn't made its way to the stable and distribution
kernels yet. Thanks for the pointer! :)

-Toke

^ permalink raw reply

* Re: [PATCH net-next v5] sched: Add dualpi2 qdisc
From: Tilmans, Olivier (Nokia - BE/Antwerp) @ 2019-08-23 12:59 UTC (permalink / raw)
  To: Dave Taht
  Cc: Eric Dumazet, Stephen Hemminger, Olga Albisser,
	De Schepper, Koen (Nokia - BE/Antwerp), Bob Briscoe, Henrik Steen,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, David S. Miller,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <CAA93jw5_LN_-zhHh=zZA8r6Zvv1CvA_AikT_rCgWyT8ytQM_rg@mail.gmail.com>

> 1) Since we're still duking it out over the meaning of the bits - not
> just the SCE thing, but as best as I can
> tell (but could be wrong) the NQB idea wants to put something into the
> l4s fast queue? Or is NQB supposed to
> be a third queue?

We can add support for NQB in the future, by expanding the
dualpi2_skb_classify() function. This is however out of scope at the
moment as NQB is not yet adopted by the TSV WG. I'd guess we may want more
than just the NQB DSCP codepoint in the L queue, which then warrant
another way to classify traffic, e.g., using tc filter hints.

> In those cases, the ecn_mask should just be mask.

That is actually what it is at the moment: a mask on the two ecn bits.

> 2) Is the intent to make the drop probability 0 by default? (10 in the
> pie rfc, not mentioned in the l4s rfc as yet)

I assume you are referring to §5.1 of the PIE RFC, i.e., the switch to
pure drop once the computed marking probability is >10%?

The default for dualpi2 is also to enter a pure-drop mode on overload.
More precisely, we define overload as reaching a marking probability of
100% in the L queue, meaning an internal PI probability of 50% (as it
gets mutiplied by the coupling factor which defaults to 2).
This is equivalent to a PIE probability of 25% (as the classic queue gets a 
squared probability).
This drop mode means that packets in both queues will be subject to
random drops with a PI^2 probability. Additionally, all packets not
dropped in the L queue are CE marked.

We used to have a parameter to configure this overload threshold (IIRC
it was still in the pre-v4 patches), but found no real use for lowering
it, hence its removal.

Note that the drop on overload can be disabled, resulting in increased
latencies in both queues, 100% CE marking in the L queue, and eventually
a taildrop behaviour once the packet limit is reached.

> 3) has this been tested on a hw mq system as yet? (10gigE is typically
> 64 queues)

Yes, in a setup where 1/32/64/128VMs were behind an Intel X540-*, which indeed
has 64 internal queues. The VMs use a mix of long/short cubic/DCTCP connections
towards another server. I could not think about another use-case where a 10G
software switch would prove to be a bottleneck, i.e., where a queue would
happen.
The qdisc is however not optimized for mq systems, could it cause performance 
degradation if the server was severely resource constrained?

Also, ensuring it was able to saturate 10G meant gro was required on the 
hypervisor, thus that the step threshold of dualpi2 has to be increased to 
compensate for those large bursts. Maybe that is where being mq-aware would 
help, i.e., by instantiating one dualpi2 instance per HW queue?
The AQM scheme itself is CPU friendly (lighter than PIE)--i.e., computing the
probability takes <10 arithmetic ops and 5 comparisons once every 16ms, while
enqueue/dequeue can involve ~10 comparisons and at most 2 rng calls)--so
should not increase the load too much issues if it was duplicated.


Best,
Olivier

^ permalink raw reply

* Re: [PATCH net v3] ixgbe: fix double clean of tx descriptors with xdp
From: William Tu @ 2019-08-23 13:19 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Alexander Duyck, Ilya Maximets, Netdev, LKML, bpf,
	David S. Miller, Magnus Karlsson, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, Jeff Kirsher,
	intel-wired-lan, Eelco Chaudron
In-Reply-To: <217e90f5-206a-bb80-6699-f6dd750b57d9@intel.com>

On Thu, Aug 22, 2019 at 11:10 PM Björn Töpel <bjorn.topel@intel.com> wrote:
>
> On 2019-08-22 19:32, William Tu wrote:
> > On Thu, Aug 22, 2019 at 10:21 AM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> >>
> >> On Thu, Aug 22, 2019 at 10:12 AM Ilya Maximets <i.maximets@samsung.com> wrote:
> >>>
> >>> Tx code doesn't clear the descriptors' status after cleaning.
> >>> So, if the budget is larger than number of used elems in a ring, some
> >>> descriptors will be accounted twice and xsk_umem_complete_tx will move
> >>> prod_tail far beyond the prod_head breaking the completion queue ring.
> >>>
> >>> Fix that by limiting the number of descriptors to clean by the number
> >>> of used descriptors in the tx ring.
> >>>
> >>> 'ixgbe_clean_xdp_tx_irq()' function refactored to look more like
> >>> 'ixgbe_xsk_clean_tx_ring()' since we're allowed to directly use
> >>> 'next_to_clean' and 'next_to_use' indexes.
> >>>
> >>> Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
> >>> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
> >>> ---
> >>>
> >>> Version 3:
> >>>    * Reverted some refactoring made for v2.
> >>>    * Eliminated 'budget' for tx clean.
> >>>    * prefetch returned.
> >>>
> >>> Version 2:
> >>>    * 'ixgbe_clean_xdp_tx_irq()' refactored to look more like
> >>>      'ixgbe_xsk_clean_tx_ring()'.
> >>>
> >>>   drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 29 ++++++++------------
> >>>   1 file changed, 11 insertions(+), 18 deletions(-)
> >>
> >> Thanks for addressing my concerns.
> >>
> >> Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> >
> > Thanks.
> >
> > Tested-by: William Tu <u9012063@gmail.com>
> >
>
> Will, thanks for testing! For this patch, did you notice any performance
> degradation?
>

No noticeable performance degradation seen this time.
Thanks,
William

^ permalink raw reply

* Re: [PATCH 0/3] Add NETIF_F_HW_BRIDGE feature
From: Andrew Lunn @ 2019-08-23 13:25 UTC (permalink / raw)
  To: Horatiu Vultur
  Cc: Nikolay Aleksandrov, roopa, davem, UNGLinuxDriver,
	alexandre.belloni, allan.nielsen, netdev, linux-kernel, bridge
In-Reply-To: <20190823122657.njk2tcgur2zu74i7@soft-dev3.microsemi.net>

> > > Why do the devices have to be from the same driver ?
> After seeing yours and Andrews comments I realize that I try to do two
> things, but I have only explained one of them.
> 
> Here is what I was trying to do:
> A. Prevent ports from going into promisc mode, if it is not needed.

The switch definition is promisc is a bit odd. You really need to
split it into two use cases.

The Linux interface is not a member of a bridge. In this case, promisc
mode would mean all frames ingressing the port should be forwarded to
the CPU. Without promisc, you can program the hardware to just accept
frames with the interfaces MAC address. So this is just the usual
behaviour of an interface.

When the interface is part of the bridge, then you can turn on all the
learning and not forward frames to the CPU, unless the CPU asks for
them. But you need to watch out for various flags. By default, you
should flood to the CPU, unknown destinations to the CPU etc. But some
of these can be turned off by flags.

> B. Prevent adding the CPU to the flood-mask (in Ocelot we have a
> flood-mask controlling who should be included when flooding due to
> destination unknown).

So destination unknown should be flooded to the CPU. The CPU might
know where to send the frame.

> To solve item "B", the network driver needs to detect if there is a
> foreign interfaces added to the bridge. If that is the case then to add
> the CPU port to the flooding mask otherwise no.

It is not just a foreign interface. What about the MAC address on the
bridge interface?

> > > This is too specific targeting some devices.
> Maybe I was wrong to mention specific HW in the commit message. The
> purpose of the patch was to add an optimization (not to copy all the
> frames to the CPU) for HW that is capable of learning and flooding the
> frames.

To some extent, this is also tied to your hardware not learning MAC
addresses from frames passed from the CPU. You should also consider
fixing this. The SW bridge does send out notifications when it
adds/removes MAC addresses to its tables. You probably want to receive
this modifications, and use them to program your hardware to forward
frames to the CPU when needed.

       Andrew

^ permalink raw reply

* Re: [PATCH bpf-next v5 03/11] xsk: add support to allow unaligned chunk placement
From: Laatz, Kevin @ 2019-08-23 13:35 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
	saeedm, maximmi, stephen, bruce.richardson, ciara.loftus, bpf,
	intel-wired-lan
In-Reply-To: <3AEEC88E-8D45-41C5-AFBF-51512826B1A7@gmail.com>

On 22/08/2019 19:43, Jonathan Lemon wrote:
> On 21 Aug 2019, at 18:44, Kevin Laatz wrote:
>> Currently, addresses are chunk size aligned. This means, we are very
>> restricted in terms of where we can place chunk within the umem. For
>> example, if we have a chunk size of 2k, then our chunks can only be 
>> placed
>> at 0,2k,4k,6k,8k... and so on (ie. every 2k starting from 0).
>>
>> This patch introduces the ability to use unaligned chunks. With these
>> changes, we are no longer bound to having to place chunks at a 2k (or
>> whatever your chunk size is) interval. Since we are no longer dealing 
>> with
>> aligned chunks, they can now cross page boundaries. Checks for page
>> contiguity have been added in order to keep track of which pages are
>> followed by a physically contiguous page.
>>
>> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
>> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
>> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
>>
>> ---
>> v2:
>>   - Add checks for the flags coming from userspace
>>   - Fix how we get chunk_size in xsk_diag.c
>>   - Add defines for masking the new descriptor format
>>   - Modified the rx functions to use new descriptor format
>>   - Modified the tx functions to use new descriptor format
>>
>> v3:
>>   - Add helper function to do address/offset masking/addition
>>
>> v4:
>>   - fixed page_start calculation in __xsk_rcv_memcpy().
>>   - move offset handling to the xdp_umem_get_* functions
>>   - modified the len field in xdp_umem_reg struct. We now use 16 bits 
>> from
>>     this for the flags field.
>>   - removed next_pg_contig field from xdp_umem_page struct. Using low 12
>>     bits of addr to store flags instead.
>>   - other minor changes based on review comments
>>
>> v5:
>>   - Added accessors for getting addr and offset
>>   - Added helper function to add offset to addr
>>   - Fixed offset handling in xsk_rcv
>>   - Removed bitfields from xdp_umem_reg
>>   - Added struct size checking for xdp_umem_reg in xsk_setsockopt to 
>> handle
>>     different versions of the struct.
>>   - fix conflicts after 'bpf-af-xdp-wakeup' was merged.
>> ---
>>  include/net/xdp_sock.h      | 75 +++++++++++++++++++++++++++--
>>  include/uapi/linux/if_xdp.h |  9 ++++
>>  net/xdp/xdp_umem.c          | 19 ++++++--
>>  net/xdp/xsk.c               | 96 +++++++++++++++++++++++++++++--------
>>  net/xdp/xsk_diag.c          |  2 +-
>>  net/xdp/xsk_queue.h         | 68 ++++++++++++++++++++++----
>>  6 files changed, 232 insertions(+), 37 deletions(-)
>>
>>
[...]

>> @@ -196,17 +221,17 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct 
>> xdp_buff *xdp)
>>          goto out_unlock;
>>      }
>>
>> -    if (!xskq_peek_addr(xs->umem->fq, &addr) ||
>> +    if (!xskq_peek_addr(xs->umem->fq, &addr, xs->umem) ||
>>          len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
>>          err = -ENOSPC;
>>          goto out_drop;
>>      }
>>
>> -    addr += xs->umem->headroom;
>> -
>> -    buffer = xdp_umem_get_data(xs->umem, addr);
>> +    buffer = xdp_umem_get_data(xs->umem, addr + offset);
>>      memcpy(buffer, xdp->data_meta, len + metalen);
>> -    addr += metalen;
>> +    offset += metalen;
>> +
>> +    addr = xsk_umem_adjust_offset(xs->umem, addr, offset);
>>      err = xskq_produce_batch_desc(xs->rx, addr, len);
>>      if (err)
>>          goto out_drop;
>
> Can't just add address and offset any longer.  This should read:
>
>     addr = xsk_umem_adjust_offset(xs->umem, addr, offset);
>     buffer = xdp_umem_get_data(xs->umem, addr);
>
>     addr = xsk_umem_adjust_offset(xs->umem, addr, metalen);
>
>
> so that offset and then metalen are added.  (or preserve the
> address across the calls like memcpy_addr earlier).


Will fix this, thanks!

-Kevin


^ permalink raw reply

* [PATCH net] ipv6: propagate ipv6_add_dev's error returns out of ipv6_find_idev
From: Sabrina Dubroca @ 2019-08-23 13:44 UTC (permalink / raw)
  To: netdev; +Cc: Sabrina Dubroca

Currently, ipv6_find_idev returns NULL when ipv6_add_dev fails,
ignoring the specific error value. This results in addrconf_add_dev
returning ENOBUFS in all cases, which is unfortunate in cases such as:

    # ip link add dummyX type dummy
    # ip link set dummyX mtu 1200 up
    # ip addr add 2000::/64 dev dummyX
    RTNETLINK answers: No buffer space available

Commit a317a2f19da7 ("ipv6: fail early when creating netdev named all
or default") introduced error returns in ipv6_add_dev. Before that,
that function would simply return NULL for all failures.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
---
 net/ipv6/addrconf.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index ced995f3fec4..6a576ff92c39 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -478,7 +478,7 @@ static struct inet6_dev *ipv6_find_idev(struct net_device *dev)
 	if (!idev) {
 		idev = ipv6_add_dev(dev);
 		if (IS_ERR(idev))
-			return NULL;
+			return idev;
 	}
 
 	if (dev->flags&IFF_UP)
@@ -2466,8 +2466,8 @@ static struct inet6_dev *addrconf_add_dev(struct net_device *dev)
 	ASSERT_RTNL();
 
 	idev = ipv6_find_idev(dev);
-	if (!idev)
-		return ERR_PTR(-ENOBUFS);
+	if (IS_ERR(idev))
+		return idev;
 
 	if (idev->cnf.disable_ipv6)
 		return ERR_PTR(-EACCES);
@@ -3159,7 +3159,7 @@ static void init_loopback(struct net_device *dev)
 	ASSERT_RTNL();
 
 	idev = ipv6_find_idev(dev);
-	if (!idev) {
+	if (IS_ERR(idev)) {
 		pr_debug("%s: add_dev failed\n", __func__);
 		return;
 	}
@@ -3374,7 +3374,7 @@ static void addrconf_sit_config(struct net_device *dev)
 	 */
 
 	idev = ipv6_find_idev(dev);
-	if (!idev) {
+	if (IS_ERR(idev)) {
 		pr_debug("%s: add_dev failed\n", __func__);
 		return;
 	}
@@ -3399,7 +3399,7 @@ static void addrconf_gre_config(struct net_device *dev)
 	ASSERT_RTNL();
 
 	idev = ipv6_find_idev(dev);
-	if (!idev) {
+	if (IS_ERR(idev)) {
 		pr_debug("%s: add_dev failed\n", __func__);
 		return;
 	}
@@ -4773,8 +4773,8 @@ inet6_rtm_newaddr(struct sk_buff *skb, struct nlmsghdr *nlh,
 			 IFA_F_MCAUTOJOIN | IFA_F_OPTIMISTIC;
 
 	idev = ipv6_find_idev(dev);
-	if (!idev)
-		return -ENOBUFS;
+	if (IS_ERR(idev))
+		return PTR_ERR(idev);
 
 	if (!ipv6_allow_optimistic_dad(net, idev))
 		cfg.ifa_flags &= ~IFA_F_OPTIMISTIC;
-- 
2.22.0


^ permalink raw reply related

* [PATCH net-next v3 3/3] dt-bindings: net: ethernet: Update mt7622 docs and dts to reflect the new phylink API
From: René van Dorst @ 2019-08-23 13:45 UTC (permalink / raw)
  To: John Crispin, Sean Wang, Nelson Chang, David S . Miller,
	Matthias Brugger
  Cc: netdev, linux-arm-kernel, linux-mediatek, linux-mips,
	Russell King, Frank Wunderlich, Stefan Roese, René van Dorst
In-Reply-To: <20190823134516.27559-1-opensource@vdorst.com>

This patch the removes the recently added mediatek,physpeed property.
Use the fixed-link property speed = <2500> to set the phy in 2.5Gbit.
See mt7622-bananapi-bpi-r64.dts for a working example.

Signed-off-by: René van Dorst <opensource@vdorst.com>
--
v2->v3:
* no change
v1->v2:
* SGMII port only support BASE-X at 2.5Gbit.
---
 .../arm/mediatek/mediatek,sgmiisys.txt        |  2 --
 .../dts/mediatek/mt7622-bananapi-bpi-r64.dts  | 28 +++++++++++++------
 arch/arm64/boot/dts/mediatek/mt7622.dtsi      |  1 -
 3 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/Documentation/devicetree/bindings/arm/mediatek/mediatek,sgmiisys.txt b/Documentation/devicetree/bindings/arm/mediatek/mediatek,sgmiisys.txt
index f5518f26a914..30cb645c0e54 100644
--- a/Documentation/devicetree/bindings/arm/mediatek/mediatek,sgmiisys.txt
+++ b/Documentation/devicetree/bindings/arm/mediatek/mediatek,sgmiisys.txt
@@ -9,8 +9,6 @@ Required Properties:
 	- "mediatek,mt7622-sgmiisys", "syscon"
 	- "mediatek,mt7629-sgmiisys", "syscon"
 - #clock-cells: Must be 1
-- mediatek,physpeed: Should be one of "auto", "1000" or "2500" to match up
-		     the capability of the target PHY.
 
 The SGMIISYS controller uses the common clk binding from
 Documentation/devicetree/bindings/clock/clock-bindings.txt
diff --git a/arch/arm64/boot/dts/mediatek/mt7622-bananapi-bpi-r64.dts b/arch/arm64/boot/dts/mediatek/mt7622-bananapi-bpi-r64.dts
index 710c5c3d87d3..83e10591e0e5 100644
--- a/arch/arm64/boot/dts/mediatek/mt7622-bananapi-bpi-r64.dts
+++ b/arch/arm64/boot/dts/mediatek/mt7622-bananapi-bpi-r64.dts
@@ -115,24 +115,34 @@
 };
 
 &eth {
-	pinctrl-names = "default";
-	pinctrl-0 = <&eth_pins>;
 	status = "okay";
+	gmac0: mac@0 {
+		compatible = "mediatek,eth-mac";
+		reg = <0>;
+		phy-mode = "2500base-x";
+
+		fixed-link {
+			speed = <2500>;
+			full-duplex;
+			pause;
+		};
+	};
 
 	gmac1: mac@1 {
 		compatible = "mediatek,eth-mac";
 		reg = <1>;
-		phy-handle = <&phy5>;
+		phy-mode = "rgmii";
+
+		fixed-link {
+			speed = <1000>;
+			full-duplex;
+			pause;
+		};
 	};
 
-	mdio-bus {
+	mdio: mdio-bus {
 		#address-cells = <1>;
 		#size-cells = <0>;
-
-		phy5: ethernet-phy@5 {
-			reg = <5>;
-			phy-mode = "sgmii";
-		};
 	};
 };
 
diff --git a/arch/arm64/boot/dts/mediatek/mt7622.dtsi b/arch/arm64/boot/dts/mediatek/mt7622.dtsi
index d1e13d340e26..dac51e98204c 100644
--- a/arch/arm64/boot/dts/mediatek/mt7622.dtsi
+++ b/arch/arm64/boot/dts/mediatek/mt7622.dtsi
@@ -931,6 +931,5 @@
 			     "syscon";
 		reg = <0 0x1b128000 0 0x3000>;
 		#clock-cells = <1>;
-		mediatek,physpeed = "2500";
 	};
 };
-- 
2.20.1


^ permalink raw reply related

* [PATCH net-next v3 0/3] net: ethernet: mediatek: convert to PHYLINK
From: René van Dorst @ 2019-08-23 13:45 UTC (permalink / raw)
  To: John Crispin, Sean Wang, Nelson Chang, David S . Miller,
	Matthias Brugger
  Cc: netdev, linux-arm-kernel, linux-mediatek, linux-mips,
	Russell King, Frank Wunderlich, Stefan Roese, René van Dorst

These patches converts mediatek driver to PHYLINK API.

v2->v3:
* Phylink improvements and clean-ups after review
v1->v2:
* Rebase for mt76x8 changes
* Phylink improvements and clean-ups after review
* SGMII port doesn't support 2.5Gbit in SGMII mode only in BASE-X mode.
  Refactor the code.

René van Dorst (3):
  net: ethernet: mediatek: Add basic PHYLINK support
  net: ethernet: mediatek: Re-add support SGMII
  dt-bindings: net: ethernet: Update mt7622 docs and dts to reflect the
    new phylink API

 .../arm/mediatek/mediatek,sgmiisys.txt        |   2 -
 .../dts/mediatek/mt7622-bananapi-bpi-r64.dts  |  28 +-
 arch/arm64/boot/dts/mediatek/mt7622.dtsi      |   1 -
 drivers/net/ethernet/mediatek/Kconfig         |   2 +-
 drivers/net/ethernet/mediatek/mtk_eth_path.c  |  75 +--
 drivers/net/ethernet/mediatek/mtk_eth_soc.c   | 529 ++++++++++++------
 drivers/net/ethernet/mediatek/mtk_eth_soc.h   |  68 ++-
 drivers/net/ethernet/mediatek/mtk_sgmii.c     |  65 ++-
 8 files changed, 477 insertions(+), 293 deletions(-)

-- 
2.20.1


^ permalink raw reply

* [PATCH net-next v3 1/3] net: ethernet: mediatek: Add basic PHYLINK support
From: René van Dorst @ 2019-08-23 13:45 UTC (permalink / raw)
  To: John Crispin, Sean Wang, Nelson Chang, David S . Miller,
	Matthias Brugger
  Cc: netdev, linux-arm-kernel, linux-mediatek, linux-mips,
	Russell King, Frank Wunderlich, Stefan Roese, René van Dorst
In-Reply-To: <20190823134516.27559-1-opensource@vdorst.com>

This convert the basics to PHYLINK API.
SGMII support is not in this patch.

Signed-off-by: René van Dorst <opensource@vdorst.com>
--
v2->v3:
* Make link_down() similar as link_up() suggested by Russell King
v1->v2:
* Also report 1000Base-X support suggested by Russell King
* Reverse christmas on many places suggested by David Miller
* Rebase too pickup the mt76x8 changes.
---
 drivers/net/ethernet/mediatek/Kconfig       |   2 +-
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 424 +++++++++++---------
 drivers/net/ethernet/mediatek/mtk_eth_soc.h |  31 +-
 3 files changed, 265 insertions(+), 192 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/Kconfig b/drivers/net/ethernet/mediatek/Kconfig
index b76cf2e1c9dc..4968352ba188 100644
--- a/drivers/net/ethernet/mediatek/Kconfig
+++ b/drivers/net/ethernet/mediatek/Kconfig
@@ -9,7 +9,7 @@ if NET_VENDOR_MEDIATEK
 
 config NET_MEDIATEK_SOC
 	tristate "MediaTek SoC Gigabit Ethernet support"
-	select PHYLIB
+	select PHYLINK
 	---help---
 	  This driver supports the gigabit ethernet MACs in the
 	  MediaTek SoC family.
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 8ddbb8dcf032..a04baad6337c 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -18,6 +18,7 @@
 #include <linux/tcp.h>
 #include <linux/interrupt.h>
 #include <linux/pinctrl/devinfo.h>
+#include <linux/phylink.h>
 
 #include "mtk_eth_soc.h"
 
@@ -186,168 +187,224 @@ static void mtk_gmac0_rgmii_adjust(struct mtk_eth *eth, int speed)
 	mtk_w32(eth, val, TRGMII_TCK_CTRL);
 }
 
-static void mtk_phy_link_adjust(struct net_device *dev)
+static void mtk_mac_config(struct phylink_config *config, unsigned int mode,
+			   const struct phylink_link_state *state)
 {
-	struct mtk_mac *mac = netdev_priv(dev);
-	u16 lcl_adv = 0, rmt_adv = 0;
-	u8 flowctrl;
-	u32 mcr = MAC_MCR_MAX_RX_1536 | MAC_MCR_IPG_CFG |
-		  MAC_MCR_FORCE_MODE | MAC_MCR_TX_EN |
-		  MAC_MCR_RX_EN | MAC_MCR_BACKOFF_EN |
-		  MAC_MCR_BACKPR_EN;
+	struct mtk_mac *mac = container_of(config, struct mtk_mac,
+					   phylink_config);
+	struct mtk_eth *eth = mac->hw;
+	u32 mcr_cur, mcr_new;
+	int val, ge_mode = 0;
+
+	/* MT76x8 has no hardware settings between for the MAC */
+	if (!MTK_HAS_CAPS(eth->soc->caps, MTK_SOC_MT7628) &&
+	    mac->interface != state->interface) {
+		/* Setup soc pin functions */
+		switch (state->interface) {
+		case PHY_INTERFACE_MODE_TRGMII:
+			if (mac->id)
+				goto err_phy;
+			if (!MTK_HAS_CAPS(mac->hw->soc->caps,
+					  MTK_GMAC1_TRGMII))
+				goto err_phy;
+			/* fall through */
+		case PHY_INTERFACE_MODE_GMII:
+		case PHY_INTERFACE_MODE_RGMII_TXID:
+		case PHY_INTERFACE_MODE_RGMII_RXID:
+		case PHY_INTERFACE_MODE_RGMII_ID:
+		case PHY_INTERFACE_MODE_RGMII:
+			break;
+		case PHY_INTERFACE_MODE_MII:
+			ge_mode = 1;
+			break;
+		case PHY_INTERFACE_MODE_REVMII:
+			ge_mode = 2;
+			break;
+		case PHY_INTERFACE_MODE_RMII:
+			if (mac->id)
+				goto err_phy;
+			ge_mode = 3;
+			break;
+		default:
+			goto err_phy;
+		}
 
-	if (unlikely(test_bit(MTK_RESETTING, &mac->hw->state)))
-		return;
+		/* Setup clock for 1st gmac */
+		if (!mac->id &&
+		    MTK_HAS_CAPS(mac->hw->soc->caps, MTK_GMAC1_TRGMII)) {
+			if (MTK_HAS_CAPS(mac->hw->soc->caps,
+					 MTK_TRGMII_MT7621_CLK)) {
+				if (mt7621_gmac0_rgmii_adjust(mac->hw,
+							      state->interface))
+					goto err_phy;
+			} else {
+				if (state->interface !=
+				    PHY_INTERFACE_MODE_TRGMII)
+					mtk_gmac0_rgmii_adjust(mac->hw,
+							       state->speed);
+			}
+		}
 
-	switch (dev->phydev->speed) {
+		/* put the gmac into the right mode */
+		regmap_read(eth->ethsys, ETHSYS_SYSCFG0, &val);
+		val &= ~SYSCFG0_GE_MODE(SYSCFG0_GE_MASK, mac->id);
+		val |= SYSCFG0_GE_MODE(ge_mode, mac->id);
+		regmap_write(eth->ethsys, ETHSYS_SYSCFG0, val);
+
+		mac->interface = state->interface;
+	}
+
+	/* Setup gmac */
+	mcr_cur = mtk_r32(mac->hw, MTK_MAC_MCR(mac->id));
+	mcr_new = mcr_cur;
+	mcr_new &= ~(MAC_MCR_SPEED_100 | MAC_MCR_SPEED_1000 |
+		     MAC_MCR_FORCE_DPX | MAC_MCR_FORCE_TX_FC |
+		     MAC_MCR_FORCE_RX_FC);
+	mcr_new |= MAC_MCR_MAX_RX_1536 | MAC_MCR_IPG_CFG | MAC_MCR_FORCE_MODE |
+		   MAC_MCR_BACKOFF_EN | MAC_MCR_BACKPR_EN | MAC_MCR_FORCE_LINK;
+
+	switch (state->speed) {
 	case SPEED_1000:
-		mcr |= MAC_MCR_SPEED_1000;
+		mcr_new |= MAC_MCR_SPEED_1000;
 		break;
 	case SPEED_100:
-		mcr |= MAC_MCR_SPEED_100;
+		mcr_new |= MAC_MCR_SPEED_100;
 		break;
 	}
-
-	if (MTK_HAS_CAPS(mac->hw->soc->caps, MTK_GMAC1_TRGMII) && !mac->id) {
-		if (MTK_HAS_CAPS(mac->hw->soc->caps, MTK_TRGMII_MT7621_CLK)) {
-			if (mt7621_gmac0_rgmii_adjust(mac->hw,
-						      dev->phydev->interface))
-				return;
-		} else {
-			if (!mac->trgmii)
-				mtk_gmac0_rgmii_adjust(mac->hw,
-						       dev->phydev->speed);
-		}
+	if (state->duplex == DUPLEX_FULL) {
+		mcr_new |= MAC_MCR_FORCE_DPX;
+		if (state->pause & MLO_PAUSE_TX)
+			mcr_new |= MAC_MCR_FORCE_TX_FC;
+		if (state->pause & MLO_PAUSE_RX)
+			mcr_new |= MAC_MCR_FORCE_RX_FC;
 	}
 
-	if (dev->phydev->link)
-		mcr |= MAC_MCR_FORCE_LINK;
+	/* Only update control register when needed! */
+	if (mcr_new != mcr_cur)
+		mtk_w32(mac->hw, mcr_new, MTK_MAC_MCR(mac->id));
 
-	if (dev->phydev->duplex) {
-		mcr |= MAC_MCR_FORCE_DPX;
+	return;
 
-		if (dev->phydev->pause)
-			rmt_adv = LPA_PAUSE_CAP;
-		if (dev->phydev->asym_pause)
-			rmt_adv |= LPA_PAUSE_ASYM;
+err_phy:
+	dev_err(eth->dev, "%s: GMAC%d mode %s not supported!\n", __func__,
+		mac->id, phy_modes(state->interface));
+}
 
-		lcl_adv = linkmode_adv_to_lcl_adv_t(dev->phydev->advertising);
-		flowctrl = mii_resolve_flowctrl_fdx(lcl_adv, rmt_adv);
+static int mtk_mac_link_state(struct phylink_config *config,
+			      struct phylink_link_state *state)
+{
+	struct mtk_mac *mac = container_of(config, struct mtk_mac,
+					   phylink_config);
+	u32 pmsr = mtk_r32(mac->hw, MTK_MAC_MSR(mac->id));
 
-		if (flowctrl & FLOW_CTRL_TX)
-			mcr |= MAC_MCR_FORCE_TX_FC;
-		if (flowctrl & FLOW_CTRL_RX)
-			mcr |= MAC_MCR_FORCE_RX_FC;
+	state->link = (pmsr & MAC_MSR_LINK);
+	state->duplex = (pmsr & MAC_MSR_DPX) >> 1;
 
-		netif_dbg(mac->hw, link, dev, "rx pause %s, tx pause %s\n",
-			  flowctrl & FLOW_CTRL_RX ? "enabled" : "disabled",
-			  flowctrl & FLOW_CTRL_TX ? "enabled" : "disabled");
+	switch (pmsr & (MAC_MSR_SPEED_1000 | MAC_MSR_SPEED_100)) {
+	case 0:
+		state->speed = SPEED_10;
+		break;
+	case MAC_MSR_SPEED_100:
+		state->speed = SPEED_100;
+		break;
+	case MAC_MSR_SPEED_1000:
+		state->speed = SPEED_1000;
+		break;
+	default:
+		state->speed = SPEED_UNKNOWN;
+		break;
 	}
 
-	mtk_w32(mac->hw, mcr, MTK_MAC_MCR(mac->id));
+	state->pause &= (MLO_PAUSE_RX | MLO_PAUSE_TX);
+	if (pmsr & MAC_MSR_RX_FC)
+		state->pause |= MLO_PAUSE_RX;
+	if (pmsr & MAC_MSR_TX_FC)
+		state->pause |= MLO_PAUSE_TX;
 
-	if (!of_phy_is_fixed_link(mac->of_node))
-		phy_print_status(dev->phydev);
+	return 1;
 }
 
-static int mtk_phy_connect_node(struct mtk_eth *eth, struct mtk_mac *mac,
-				struct device_node *phy_node)
+static void mtk_mac_an_restart(struct phylink_config *config)
 {
-	struct phy_device *phydev;
-	int phy_mode;
-
-	phy_mode = of_get_phy_mode(phy_node);
-	if (phy_mode < 0) {
-		dev_err(eth->dev, "incorrect phy-mode %d\n", phy_mode);
-		return -EINVAL;
-	}
-
-	phydev = of_phy_connect(eth->netdev[mac->id], phy_node,
-				mtk_phy_link_adjust, 0, phy_mode);
-	if (!phydev) {
-		dev_err(eth->dev, "could not connect to PHY\n");
-		return -ENODEV;
-	}
+	/* Do nothing */
+}
 
-	dev_info(eth->dev,
-		 "connected mac %d to PHY at %s [uid=%08x, driver=%s]\n",
-		 mac->id, phydev_name(phydev), phydev->phy_id,
-		 phydev->drv->name);
+static void mtk_mac_link_down(struct phylink_config *config, unsigned int mode,
+			      phy_interface_t interface)
+{
+	struct mtk_mac *mac = container_of(config, struct mtk_mac,
+					   phylink_config);
+	u32 mcr = mtk_r32(mac->hw, MTK_MAC_MCR(mac->id));
 
-	return 0;
+	mcr &= (MAC_MCR_TX_EN | MAC_MCR_RX_EN);
+	mtk_w32(mac->hw, mcr, MTK_MAC_MCR(mac->id));
 }
 
-static int mtk_phy_connect(struct net_device *dev)
+static void mtk_mac_link_up(struct phylink_config *config, unsigned int mode,
+			    phy_interface_t interface,
+			    struct phy_device *phy)
 {
-	struct mtk_mac *mac = netdev_priv(dev);
-	struct mtk_eth *eth;
-	struct device_node *np;
-	u32 val;
-	int err;
+	struct mtk_mac *mac = container_of(config, struct mtk_mac,
+					   phylink_config);
+	u32 mcr = mtk_r32(mac->hw, MTK_MAC_MCR(mac->id));
 
-	eth = mac->hw;
-	np = of_parse_phandle(mac->of_node, "phy-handle", 0);
-	if (!np && of_phy_is_fixed_link(mac->of_node))
-		if (!of_phy_register_fixed_link(mac->of_node))
-			np = of_node_get(mac->of_node);
-	if (!np)
-		return -ENODEV;
+	mcr |= MAC_MCR_TX_EN | MAC_MCR_RX_EN;
+	mtk_w32(mac->hw, mcr, MTK_MAC_MCR(mac->id));
+}
 
-	err = mtk_setup_hw_path(eth, mac->id, of_get_phy_mode(np));
-	if (err)
-		goto err_phy;
-
-	mac->ge_mode = 0;
-	switch (of_get_phy_mode(np)) {
-	case PHY_INTERFACE_MODE_TRGMII:
-		mac->trgmii = true;
-	case PHY_INTERFACE_MODE_RGMII_TXID:
-	case PHY_INTERFACE_MODE_RGMII_RXID:
-	case PHY_INTERFACE_MODE_RGMII_ID:
-	case PHY_INTERFACE_MODE_RGMII:
-	case PHY_INTERFACE_MODE_SGMII:
-		break;
-	case PHY_INTERFACE_MODE_MII:
-	case PHY_INTERFACE_MODE_GMII:
-		mac->ge_mode = 1;
-		break;
-	case PHY_INTERFACE_MODE_REVMII:
-		mac->ge_mode = 2;
-		break;
-	case PHY_INTERFACE_MODE_RMII:
-		if (!mac->id)
-			goto err_phy;
-		mac->ge_mode = 3;
-		break;
-	default:
-		goto err_phy;
-	}
+static void mtk_validate(struct phylink_config *config,
+			 unsigned long *supported,
+			 struct phylink_link_state *state)
+{
+	struct mtk_mac *mac = container_of(config, struct mtk_mac,
+					   phylink_config);
+	__ETHTOOL_DECLARE_LINK_MODE_MASK(mask) = { 0, };
 
-	/* No MT7628/88 support for now */
-	if (!MTK_HAS_CAPS(eth->soc->caps, MTK_SOC_MT7628)) {
-		/* put the gmac into the right mode */
-		regmap_read(eth->ethsys, ETHSYS_SYSCFG0, &val);
-		val &= ~SYSCFG0_GE_MODE(SYSCFG0_GE_MASK, mac->id);
-		val |= SYSCFG0_GE_MODE(mac->ge_mode, mac->id);
-		regmap_write(eth->ethsys, ETHSYS_SYSCFG0, val);
+	if (state->interface != PHY_INTERFACE_MODE_NA &&
+	    state->interface != PHY_INTERFACE_MODE_MII &&
+	    state->interface != PHY_INTERFACE_MODE_GMII &&
+	    !(MTK_HAS_CAPS(mac->hw->soc->caps, MTK_RGMII) &&
+	      phy_interface_mode_is_rgmii(state->interface)) &&
+	    !(MTK_HAS_CAPS(mac->hw->soc->caps, MTK_TRGMII) &&
+	      !mac->id && state->interface == PHY_INTERFACE_MODE_TRGMII)) {
+		linkmode_zero(supported);
+		return;
 	}
 
-	/* couple phydev to net_device */
-	if (mtk_phy_connect_node(eth, mac, np))
-		goto err_phy;
+	phylink_set_port_modes(mask);
+	phylink_set(mask, Autoneg);
 
-	of_node_put(np);
+	if (state->interface == PHY_INTERFACE_MODE_TRGMII) {
+		phylink_set(mask, 1000baseT_Full);
+	} else {
+		phylink_set(mask, 10baseT_Half);
+		phylink_set(mask, 10baseT_Full);
+		phylink_set(mask, 100baseT_Half);
+		phylink_set(mask, 100baseT_Full);
+
+		if (state->interface != PHY_INTERFACE_MODE_MII) {
+			phylink_set(mask, 1000baseT_Half);
+			phylink_set(mask, 1000baseT_Full);
+			phylink_set(mask, 1000baseX_Full);
+		}
+	}
 
-	return 0;
+	phylink_set(mask, Pause);
+	phylink_set(mask, Asym_Pause);
 
-err_phy:
-	if (of_phy_is_fixed_link(mac->of_node))
-		of_phy_deregister_fixed_link(mac->of_node);
-	of_node_put(np);
-	dev_err(eth->dev, "%s: invalid phy\n", __func__);
-	return -EINVAL;
+	linkmode_and(supported, supported, mask);
+	linkmode_and(state->advertising, state->advertising, mask);
 }
 
+static const struct phylink_mac_ops mtk_phylink_ops = {
+	.validate = mtk_validate,
+	.mac_link_state = mtk_mac_link_state,
+	.mac_an_restart = mtk_mac_an_restart,
+	.mac_config = mtk_mac_config,
+	.mac_link_down = mtk_mac_link_down,
+	.mac_link_up = mtk_mac_link_up,
+};
+
 static int mtk_mdio_init(struct mtk_eth *eth)
 {
 	struct device_node *mii_np;
@@ -2013,6 +2070,14 @@ static int mtk_open(struct net_device *dev)
 {
 	struct mtk_mac *mac = netdev_priv(dev);
 	struct mtk_eth *eth = mac->hw;
+	int err;
+
+	err = phylink_of_phy_connect(mac->phylink, mac->of_node, 0);
+	if (err) {
+		netdev_err(dev, "%s: could not attach PHY: %d\n", __func__,
+			   err);
+		return err;
+	}
 
 	/* we run 2 netdevs on the same dma ring so we only bring it up once */
 	if (!refcount_read(&eth->dma_refcnt)) {
@@ -2030,7 +2095,7 @@ static int mtk_open(struct net_device *dev)
 	else
 		refcount_inc(&eth->dma_refcnt);
 
-	phy_start(dev->phydev);
+	phylink_start(mac->phylink);
 	netif_start_queue(dev);
 	return 0;
 }
@@ -2063,8 +2128,11 @@ static int mtk_stop(struct net_device *dev)
 	struct mtk_mac *mac = netdev_priv(dev);
 	struct mtk_eth *eth = mac->hw;
 
+	phylink_stop(mac->phylink);
+
 	netif_tx_disable(dev);
-	phy_stop(dev->phydev);
+
+	phylink_disconnect_phy(mac->phylink);
 
 	/* only shutdown DMA if this is the last user */
 	if (!refcount_dec_and_test(&eth->dma_refcnt))
@@ -2159,15 +2227,6 @@ static int mtk_hw_init(struct mtk_eth *eth)
 	ethsys_reset(eth, RSTCTRL_FE);
 	ethsys_reset(eth, RSTCTRL_PPE);
 
-	regmap_read(eth->ethsys, ETHSYS_SYSCFG0, &val);
-	for (i = 0; i < MTK_MAC_COUNT; i++) {
-		if (!eth->mac[i])
-			continue;
-		val &= ~SYSCFG0_GE_MODE(SYSCFG0_GE_MASK, eth->mac[i]->id);
-		val |= SYSCFG0_GE_MODE(eth->mac[i]->ge_mode, eth->mac[i]->id);
-	}
-	regmap_write(eth->ethsys, ETHSYS_SYSCFG0, val);
-
 	if (eth->pctl) {
 		/* Set GE2 driving and slew rate */
 		regmap_write(eth->pctl, GPIO_DRV_SEL10, 0xa00);
@@ -2180,11 +2239,11 @@ static int mtk_hw_init(struct mtk_eth *eth)
 	}
 
 	/* Set linkdown as the default for each GMAC. Its own MCR would be set
-	 * up with the more appropriate value when mtk_phy_link_adjust call is
-	 * being invoked.
+	 * up with the more appropriate value when mtk_mac_config call is being
+	 * invoked.
 	 */
 	for (i = 0; i < MTK_MAC_COUNT; i++)
-		mtk_w32(eth, 0, MTK_MAC_MCR(i));
+		mtk_w32(eth, MAC_MCR_FORCE_LINK_DOWN, MTK_MAC_MCR(i));
 
 	/* Indicates CDM to parse the MTK special tag from CPU
 	 * which also is working out for untag packets.
@@ -2212,7 +2271,7 @@ static int mtk_hw_init(struct mtk_eth *eth)
 	mtk_w32(eth, MTK_RX_DONE_INT, MTK_QDMA_INT_GRP2);
 	mtk_w32(eth, 0x21021000, MTK_FE_INT_GRP);
 
-	for (i = 0; i < 2; i++) {
+	for (i = 0; i < MTK_MAC_COUNT; i++) {
 		u32 val = mtk_r32(eth, MTK_GDMA_FWD_CFG(i));
 
 		/* setup the forward port to send frame to PDMA */
@@ -2264,7 +2323,7 @@ static int __init mtk_init(struct net_device *dev)
 			dev->dev_addr);
 	}
 
-	return mtk_phy_connect(dev);
+	return 0;
 }
 
 static void mtk_uninit(struct net_device *dev)
@@ -2272,20 +2331,20 @@ static void mtk_uninit(struct net_device *dev)
 	struct mtk_mac *mac = netdev_priv(dev);
 	struct mtk_eth *eth = mac->hw;
 
-	phy_disconnect(dev->phydev);
-	if (of_phy_is_fixed_link(mac->of_node))
-		of_phy_deregister_fixed_link(mac->of_node);
+	phylink_disconnect_phy(mac->phylink);
 	mtk_tx_irq_disable(eth, ~0);
 	mtk_rx_irq_disable(eth, ~0);
 }
 
 static int mtk_do_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
 {
+	struct mtk_mac *mac = netdev_priv(dev);
+
 	switch (cmd) {
 	case SIOCGMIIPHY:
 	case SIOCGMIIREG:
 	case SIOCSMIIREG:
-		return phy_mii_ioctl(dev->phydev, ifr, cmd);
+		return phylink_mii_ioctl(mac->phylink, ifr, cmd);
 	default:
 		break;
 	}
@@ -2326,16 +2385,6 @@ static void mtk_pending_work(struct work_struct *work)
 				     eth->dev->pins->default_state);
 	mtk_hw_init(eth);
 
-	for (i = 0; i < MTK_MAC_COUNT; i++) {
-		if (!eth->mac[i] ||
-		    of_phy_is_fixed_link(eth->mac[i]->of_node))
-			continue;
-		err = phy_init_hw(eth->netdev[i]->phydev);
-		if (err)
-			dev_err(eth->dev, "%s: PHY init failed.\n",
-				eth->netdev[i]->name);
-	}
-
 	/* restart DMA and enable IRQs */
 	for (i = 0; i < MTK_MAC_COUNT; i++) {
 		if (!test_bit(i, &restart))
@@ -2398,9 +2447,7 @@ static int mtk_get_link_ksettings(struct net_device *ndev,
 	if (unlikely(test_bit(MTK_RESETTING, &mac->hw->state)))
 		return -EBUSY;
 
-	phy_ethtool_ksettings_get(ndev->phydev, cmd);
-
-	return 0;
+	return phylink_ethtool_ksettings_get(mac->phylink, cmd);
 }
 
 static int mtk_set_link_ksettings(struct net_device *ndev,
@@ -2411,7 +2458,7 @@ static int mtk_set_link_ksettings(struct net_device *ndev,
 	if (unlikely(test_bit(MTK_RESETTING, &mac->hw->state)))
 		return -EBUSY;
 
-	return phy_ethtool_ksettings_set(ndev->phydev, cmd);
+	return phylink_ethtool_ksettings_set(mac->phylink, cmd);
 }
 
 static void mtk_get_drvinfo(struct net_device *dev,
@@ -2445,22 +2492,10 @@ static int mtk_nway_reset(struct net_device *dev)
 	if (unlikely(test_bit(MTK_RESETTING, &mac->hw->state)))
 		return -EBUSY;
 
-	return genphy_restart_aneg(dev->phydev);
-}
+	if (!mac->phylink)
+		return -ENOTSUPP;
 
-static u32 mtk_get_link(struct net_device *dev)
-{
-	struct mtk_mac *mac = netdev_priv(dev);
-	int err;
-
-	if (unlikely(test_bit(MTK_RESETTING, &mac->hw->state)))
-		return -EBUSY;
-
-	err = genphy_update_link(dev->phydev);
-	if (err)
-		return ethtool_op_get_link(dev);
-
-	return dev->phydev->link;
+	return phylink_ethtool_nway_reset(mac->phylink);
 }
 
 static void mtk_get_strings(struct net_device *dev, u32 stringset, u8 *data)
@@ -2580,7 +2615,7 @@ static const struct ethtool_ops mtk_ethtool_ops = {
 	.get_msglevel		= mtk_get_msglevel,
 	.set_msglevel		= mtk_set_msglevel,
 	.nway_reset		= mtk_nway_reset,
-	.get_link		= mtk_get_link,
+	.get_link		= ethtool_op_get_link,
 	.get_strings		= mtk_get_strings,
 	.get_sset_count		= mtk_get_sset_count,
 	.get_ethtool_stats	= mtk_get_ethtool_stats,
@@ -2608,9 +2643,10 @@ static const struct net_device_ops mtk_netdev_ops = {
 
 static int mtk_add_mac(struct mtk_eth *eth, struct device_node *np)
 {
-	struct mtk_mac *mac;
 	const __be32 *_id = of_get_property(np, "reg", NULL);
-	int id, err;
+	struct phylink *phylink;
+	int phy_mode, id, err;
+	struct mtk_mac *mac;
 
 	if (!_id) {
 		dev_err(eth->dev, "missing mac id\n");
@@ -2654,6 +2690,32 @@ static int mtk_add_mac(struct mtk_eth *eth, struct device_node *np)
 	u64_stats_init(&mac->hw_stats->syncp);
 	mac->hw_stats->reg_offset = id * MTK_STAT_OFFSET;
 
+	/* phylink create */
+	phy_mode = of_get_phy_mode(np);
+	if (phy_mode < 0) {
+		dev_err(eth->dev, "incorrect phy-mode\n");
+		err = -EINVAL;
+		goto free_netdev;
+	}
+
+	/* mac config is not set */
+	mac->interface = PHY_INTERFACE_MODE_NA;
+	mac->mode = MLO_AN_PHY;
+	mac->speed = SPEED_UNKNOWN;
+
+	mac->phylink_config.dev = &eth->netdev[id]->dev;
+	mac->phylink_config.type = PHYLINK_NETDEV;
+
+	phylink = phylink_create(&mac->phylink_config,
+				 of_fwnode_handle(mac->of_node),
+				 phy_mode, &mtk_phylink_ops);
+	if (IS_ERR(phylink)) {
+		err = PTR_ERR(phylink);
+		goto free_netdev;
+	}
+
+	mac->phylink = phylink;
+
 	SET_NETDEV_DEV(eth->netdev[id], eth->dev);
 	eth->netdev[id]->watchdog_timeo = 5 * HZ;
 	eth->netdev[id]->netdev_ops = &mtk_netdev_ops;
@@ -2682,8 +2744,7 @@ static int mtk_probe(struct platform_device *pdev)
 {
 	struct device_node *mac_np;
 	struct mtk_eth *eth;
-	int err;
-	int i;
+	int err, i;
 
 	eth = devm_kzalloc(&pdev->dev, sizeof(*eth), GFP_KERNEL);
 	if (!eth)
@@ -2869,6 +2930,7 @@ static int mtk_probe(struct platform_device *pdev)
 static int mtk_remove(struct platform_device *pdev)
 {
 	struct mtk_eth *eth = platform_get_drvdata(pdev);
+	struct mtk_mac *mac;
 	int i;
 
 	/* stop all devices to make sure that dma is properly shut down */
@@ -2876,6 +2938,8 @@ static int mtk_remove(struct platform_device *pdev)
 		if (!eth->netdev[i])
 			continue;
 		mtk_stop(eth->netdev[i]);
+		mac = netdev_priv(eth->netdev[i]);
+		phylink_disconnect_phy(mac->phylink);
 	}
 
 	mtk_hw_deinit(eth);
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
index cc1466ae0926..7f5f541daad7 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
@@ -14,6 +14,7 @@
 #include <linux/of_net.h>
 #include <linux/u64_stats_sync.h>
 #include <linux/refcount.h>
+#include <linux/phylink.h>
 
 #define MTK_QDMA_PAGE_SIZE	2048
 #define	MTK_MAX_RX_LENGTH	1536
@@ -330,12 +331,19 @@
 #define MAC_MCR_SPEED_100	BIT(2)
 #define MAC_MCR_FORCE_DPX	BIT(1)
 #define MAC_MCR_FORCE_LINK	BIT(0)
-#define MAC_MCR_FIXED_LINK	(MAC_MCR_MAX_RX_1536 | MAC_MCR_IPG_CFG | \
-				 MAC_MCR_FORCE_MODE | MAC_MCR_TX_EN | \
-				 MAC_MCR_RX_EN | MAC_MCR_BACKOFF_EN | \
-				 MAC_MCR_BACKPR_EN | MAC_MCR_FORCE_RX_FC | \
-				 MAC_MCR_FORCE_TX_FC | MAC_MCR_SPEED_1000 | \
-				 MAC_MCR_FORCE_DPX | MAC_MCR_FORCE_LINK)
+#define MAC_MCR_FORCE_LINK_DOWN	(MAC_MCR_FORCE_MODE)
+
+/* Mac status registers */
+#define MTK_MAC_MSR(x)		(0x10108 + (x * 0x100))
+#define MAC_MSR_EEE1G		BIT(7)
+#define MAC_MSR_EEE100M		BIT(6)
+#define MAC_MSR_RX_FC		BIT(5)
+#define MAC_MSR_TX_FC		BIT(4)
+#define MAC_MSR_SPEED_1000	BIT(3)
+#define MAC_MSR_SPEED_100	BIT(2)
+#define MAC_MSR_SPEED_MASK	(MAC_MSR_SPEED_1000 | MAC_MSR_SPEED_100)
+#define MAC_MSR_DPX		BIT(1)
+#define MAC_MSR_LINK		BIT(0)
 
 /* TRGMII RXC control register */
 #define TRGMII_RCK_CTRL		0x10300
@@ -858,22 +866,23 @@ struct mtk_eth {
 /* struct mtk_mac -	the structure that holds the info about the MACs of the
  *			SoC
  * @id:			The number of the MAC
- * @ge_mode:            Interface mode kept for setup restoring
+ * @interface:		Interface mode kept for detecting change in hw settings
  * @of_node:		Our devicetree node
  * @hw:			Backpointer to our main datastruture
  * @hw_stats:		Packet statistics counter
- * @trgmii		Indicate if the MAC uses TRGMII connected to internal
-			switch
  */
 struct mtk_mac {
 	int				id;
-	int				ge_mode;
+	phy_interface_t			interface;
+	unsigned int			mode;
+	int				speed;
 	struct device_node		*of_node;
+	struct phylink			*phylink;
+	struct phylink_config		phylink_config;
 	struct mtk_eth			*hw;
 	struct mtk_hw_stats		*hw_stats;
 	__be32				hwlro_ip[MTK_MAX_LRO_IP_CNT];
 	int				hwlro_ip_cnt;
-	bool				trgmii;
 };
 
 /* the struct describing the SoC. these are declared in the soc_xyz.c files */
-- 
2.20.1


^ permalink raw reply related

* [PATCH net-next v3 2/3] net: ethernet: mediatek: Re-add support SGMII
From: René van Dorst @ 2019-08-23 13:45 UTC (permalink / raw)
  To: John Crispin, Sean Wang, Nelson Chang, David S . Miller,
	Matthias Brugger
  Cc: netdev, linux-arm-kernel, linux-mediatek, linux-mips,
	Russell King, Frank Wunderlich, Stefan Roese, René van Dorst
In-Reply-To: <20190823134516.27559-1-opensource@vdorst.com>

* Re-add SGMII support but now with PHYLINK API support
  So the SGMII changes are more clear
* Move SGMII block setup from mtk_gmac_sgmii_path_setup() to
  mtk_mac_config()
* Merge mtk_setup_hw_path() into mtk_mac_config()
* Remove mediatek,physpeed property, fixed-link supports now any speed so
  speed = <2500>; is now valid with PHYLINK.
* Demagic SGMII register values
* Use phylink state to setup fixed-link mode

Signed-off-by: René van Dorst <opensource@vdorst.com>
--
v2->v3:
* Redo validate(), it was totally wrong. Noticed by Russell King.
v1->v2:
* SGMII port only support SGMII at 1Gbit, 1000BASE-X and 2500BASE-X.
  Also SGMII mode only does auto-negotiation.
* Change validate() to support mt76x8 changes.
---
 drivers/net/ethernet/mediatek/mtk_eth_path.c |  75 +--------
 drivers/net/ethernet/mediatek/mtk_eth_soc.c  | 157 ++++++++++++++++---
 drivers/net/ethernet/mediatek/mtk_eth_soc.h  |  37 ++++-
 drivers/net/ethernet/mediatek/mtk_sgmii.c    |  65 +++++---
 4 files changed, 219 insertions(+), 115 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_path.c b/drivers/net/ethernet/mediatek/mtk_eth_path.c
index 28960e4c4e43..ef11cf3d1ccc 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_path.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_path.c
@@ -239,10 +239,9 @@ static int mtk_eth_mux_setup(struct mtk_eth *eth, int path)
 	return err;
 }
 
-static int mtk_gmac_sgmii_path_setup(struct mtk_eth *eth, int mac_id)
+int mtk_gmac_sgmii_path_setup(struct mtk_eth *eth, int mac_id)
 {
-	unsigned int val = 0;
-	int sid, err, path;
+	int err, path;
 
 	path = (mac_id == 0) ?  MTK_ETH_PATH_GMAC1_SGMII :
 				MTK_ETH_PATH_GMAC2_SGMII;
@@ -252,33 +251,10 @@ static int mtk_gmac_sgmii_path_setup(struct mtk_eth *eth, int mac_id)
 	if (err)
 		return err;
 
-	/* The path GMAC to SGMII will be enabled once the SGMIISYS is being
-	 * setup done.
-	 */
-	regmap_read(eth->ethsys, ETHSYS_SYSCFG0, &val);
-
-	regmap_update_bits(eth->ethsys, ETHSYS_SYSCFG0,
-			   SYSCFG0_SGMII_MASK, ~(u32)SYSCFG0_SGMII_MASK);
-
-	/* Decide how GMAC and SGMIISYS be mapped */
-	sid = (MTK_HAS_CAPS(eth->soc->caps, MTK_SHARED_SGMII)) ? 0 : mac_id;
-
-	/* Setup SGMIISYS with the determined property */
-	if (MTK_HAS_FLAGS(eth->sgmii->flags[sid], MTK_SGMII_PHYSPEED_AN))
-		err = mtk_sgmii_setup_mode_an(eth->sgmii, sid);
-	else
-		err = mtk_sgmii_setup_mode_force(eth->sgmii, sid);
-
-	if (err)
-		return err;
-
-	regmap_update_bits(eth->ethsys, ETHSYS_SYSCFG0,
-			   SYSCFG0_SGMII_MASK, val);
-
 	return 0;
 }
 
-static int mtk_gmac_gephy_path_setup(struct mtk_eth *eth, int mac_id)
+int mtk_gmac_gephy_path_setup(struct mtk_eth *eth, int mac_id)
 {
 	int err, path = 0;
 
@@ -296,7 +272,7 @@ static int mtk_gmac_gephy_path_setup(struct mtk_eth *eth, int mac_id)
 	return 0;
 }
 
-static int mtk_gmac_rgmii_path_setup(struct mtk_eth *eth, int mac_id)
+int mtk_gmac_rgmii_path_setup(struct mtk_eth *eth, int mac_id)
 {
 	int err, path;
 
@@ -311,46 +287,3 @@ static int mtk_gmac_rgmii_path_setup(struct mtk_eth *eth, int mac_id)
 	return 0;
 }
 
-int mtk_setup_hw_path(struct mtk_eth *eth, int mac_id, int phymode)
-{
-	int err;
-
-	/* No mux'ing for MT7628/88 */
-	if (MTK_HAS_CAPS(eth->soc->caps, MTK_SOC_MT7628))
-		return 0;
-
-	switch (phymode) {
-	case PHY_INTERFACE_MODE_TRGMII:
-	case PHY_INTERFACE_MODE_RGMII_TXID:
-	case PHY_INTERFACE_MODE_RGMII_RXID:
-	case PHY_INTERFACE_MODE_RGMII_ID:
-	case PHY_INTERFACE_MODE_RGMII:
-	case PHY_INTERFACE_MODE_MII:
-	case PHY_INTERFACE_MODE_REVMII:
-	case PHY_INTERFACE_MODE_RMII:
-		if (MTK_HAS_CAPS(eth->soc->caps, MTK_RGMII)) {
-			err = mtk_gmac_rgmii_path_setup(eth, mac_id);
-			if (err)
-				return err;
-		}
-		break;
-	case PHY_INTERFACE_MODE_SGMII:
-		if (MTK_HAS_CAPS(eth->soc->caps, MTK_SGMII)) {
-			err = mtk_gmac_sgmii_path_setup(eth, mac_id);
-			if (err)
-				return err;
-		}
-		break;
-	case PHY_INTERFACE_MODE_GMII:
-		if (MTK_HAS_CAPS(eth->soc->caps, MTK_GEPHY)) {
-			err = mtk_gmac_gephy_path_setup(eth, mac_id);
-			if (err)
-				return err;
-		}
-		break;
-	default:
-		break;
-	}
-
-	return 0;
-}
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index a04baad6337c..d471a8d16aa5 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -193,8 +193,8 @@ static void mtk_mac_config(struct phylink_config *config, unsigned int mode,
 	struct mtk_mac *mac = container_of(config, struct mtk_mac,
 					   phylink_config);
 	struct mtk_eth *eth = mac->hw;
-	u32 mcr_cur, mcr_new;
-	int val, ge_mode = 0;
+	u32 mcr_cur, mcr_new, sid;
+	int val, ge_mode, err;
 
 	/* MT76x8 has no hardware settings between for the MAC */
 	if (!MTK_HAS_CAPS(eth->soc->caps, MTK_SOC_MT7628) &&
@@ -208,29 +208,42 @@ static void mtk_mac_config(struct phylink_config *config, unsigned int mode,
 					  MTK_GMAC1_TRGMII))
 				goto err_phy;
 			/* fall through */
-		case PHY_INTERFACE_MODE_GMII:
 		case PHY_INTERFACE_MODE_RGMII_TXID:
 		case PHY_INTERFACE_MODE_RGMII_RXID:
 		case PHY_INTERFACE_MODE_RGMII_ID:
 		case PHY_INTERFACE_MODE_RGMII:
-			break;
 		case PHY_INTERFACE_MODE_MII:
-			ge_mode = 1;
-			break;
 		case PHY_INTERFACE_MODE_REVMII:
-			ge_mode = 2;
-			break;
 		case PHY_INTERFACE_MODE_RMII:
-			if (mac->id)
-				goto err_phy;
-			ge_mode = 3;
+			if (MTK_HAS_CAPS(eth->soc->caps, MTK_RGMII)) {
+				err = mtk_gmac_rgmii_path_setup(eth, mac->id);
+				if (err)
+					goto init_err;
+			}
+			break;
+		case PHY_INTERFACE_MODE_1000BASEX:
+		case PHY_INTERFACE_MODE_2500BASEX:
+		case PHY_INTERFACE_MODE_SGMII:
+			if (MTK_HAS_CAPS(eth->soc->caps, MTK_SGMII)) {
+				err = mtk_gmac_sgmii_path_setup(eth, mac->id);
+				if (err)
+					goto init_err;
+			}
+			break;
+		case PHY_INTERFACE_MODE_GMII:
+			if (MTK_HAS_CAPS(eth->soc->caps, MTK_GEPHY)) {
+				err = mtk_gmac_gephy_path_setup(eth, mac->id);
+				if (err)
+					goto init_err;
+			}
 			break;
 		default:
 			goto err_phy;
 		}
 
 		/* Setup clock for 1st gmac */
-		if (!mac->id &&
+		if (!mac->id && state->interface != PHY_INTERFACE_MODE_SGMII &&
+		    !phy_interface_mode_is_8023z(state->interface) &&
 		    MTK_HAS_CAPS(mac->hw->soc->caps, MTK_GMAC1_TRGMII)) {
 			if (MTK_HAS_CAPS(mac->hw->soc->caps,
 					 MTK_TRGMII_MT7621_CLK)) {
@@ -245,6 +258,23 @@ static void mtk_mac_config(struct phylink_config *config, unsigned int mode,
 			}
 		}
 
+		ge_mode = 0;
+		switch (state->interface) {
+		case PHY_INTERFACE_MODE_MII:
+			ge_mode = 1;
+			break;
+		case PHY_INTERFACE_MODE_REVMII:
+			ge_mode = 2;
+			break;
+		case PHY_INTERFACE_MODE_RMII:
+			if (mac->id)
+				goto err_phy;
+			ge_mode = 3;
+			break;
+		default:
+			break;
+		}
+
 		/* put the gmac into the right mode */
 		regmap_read(eth->ethsys, ETHSYS_SYSCFG0, &val);
 		val &= ~SYSCFG0_GE_MODE(SYSCFG0_GE_MASK, mac->id);
@@ -254,6 +284,40 @@ static void mtk_mac_config(struct phylink_config *config, unsigned int mode,
 		mac->interface = state->interface;
 	}
 
+	/* SGMII */
+	if (state->interface == PHY_INTERFACE_MODE_SGMII ||
+	    phy_interface_mode_is_8023z(state->interface)) {
+		/* The path GMAC to SGMII will be enabled once the SGMIISYS is
+		 * being setup done.
+		 */
+		regmap_read(eth->ethsys, ETHSYS_SYSCFG0, &val);
+
+		regmap_update_bits(eth->ethsys, ETHSYS_SYSCFG0,
+				   SYSCFG0_SGMII_MASK,
+				   ~(u32)SYSCFG0_SGMII_MASK);
+
+		/* Decide how GMAC and SGMIISYS be mapped */
+		sid = (MTK_HAS_CAPS(eth->soc->caps, MTK_SHARED_SGMII)) ?
+		       0 : mac->id;
+
+		/* Setup SGMIISYS with the determined property */
+		if (state->interface != PHY_INTERFACE_MODE_SGMII)
+			err = mtk_sgmii_setup_mode_force(eth->sgmii, sid,
+							 state);
+		else if (phylink_autoneg_inband(mode))
+			err = mtk_sgmii_setup_mode_an(eth->sgmii, sid);
+
+		if (err)
+			goto init_err;
+
+		regmap_update_bits(eth->ethsys, ETHSYS_SYSCFG0,
+				   SYSCFG0_SGMII_MASK, val);
+	} else if (phylink_autoneg_inband(mode)) {
+		dev_err(eth->dev,
+			"In-band mode not supported in non SGMII mode!\n");
+		return;
+	}
+
 	/* Setup gmac */
 	mcr_cur = mtk_r32(mac->hw, MTK_MAC_MCR(mac->id));
 	mcr_new = mcr_cur;
@@ -264,6 +328,7 @@ static void mtk_mac_config(struct phylink_config *config, unsigned int mode,
 		   MAC_MCR_BACKOFF_EN | MAC_MCR_BACKPR_EN | MAC_MCR_FORCE_LINK;
 
 	switch (state->speed) {
+	case SPEED_2500:
 	case SPEED_1000:
 		mcr_new |= MAC_MCR_SPEED_1000;
 		break;
@@ -288,6 +353,11 @@ static void mtk_mac_config(struct phylink_config *config, unsigned int mode,
 err_phy:
 	dev_err(eth->dev, "%s: GMAC%d mode %s not supported!\n", __func__,
 		mac->id, phy_modes(state->interface));
+	return;
+
+init_err:
+	dev_err(eth->dev, "%s: GMAC%d mode %s err: %d!\n", __func__,
+		mac->id, phy_modes(state->interface), err);
 }
 
 static int mtk_mac_link_state(struct phylink_config *config,
@@ -326,7 +396,10 @@ static int mtk_mac_link_state(struct phylink_config *config,
 
 static void mtk_mac_an_restart(struct phylink_config *config)
 {
-	/* Do nothing */
+	struct mtk_mac *mac = container_of(config, struct mtk_mac,
+					   phylink_config);
+
+	mtk_sgmii_restart_an(mac->hw, mac->id);
 }
 
 static void mtk_mac_link_down(struct phylink_config *config, unsigned int mode,
@@ -366,7 +439,10 @@ static void mtk_validate(struct phylink_config *config,
 	    !(MTK_HAS_CAPS(mac->hw->soc->caps, MTK_RGMII) &&
 	      phy_interface_mode_is_rgmii(state->interface)) &&
 	    !(MTK_HAS_CAPS(mac->hw->soc->caps, MTK_TRGMII) &&
-	      !mac->id && state->interface == PHY_INTERFACE_MODE_TRGMII)) {
+	      !mac->id && state->interface == PHY_INTERFACE_MODE_TRGMII) &&
+	    !(MTK_HAS_CAPS(mac->hw->soc->caps, MTK_SGMII) &&
+	      (state->interface == PHY_INTERFACE_MODE_SGMII ||
+	       phy_interface_mode_is_8023z(state->interface)))) {
 		linkmode_zero(supported);
 		return;
 	}
@@ -374,18 +450,58 @@ static void mtk_validate(struct phylink_config *config,
 	phylink_set_port_modes(mask);
 	phylink_set(mask, Autoneg);
 
-	if (state->interface == PHY_INTERFACE_MODE_TRGMII) {
+	switch (state->interface) {
+	case PHY_INTERFACE_MODE_SGMII:
+		phylink_set(mask, 10baseT_Half);
+		phylink_set(mask, 10baseT_Full);
+		phylink_set(mask, 100baseT_Half);
+		phylink_set(mask, 100baseT_Full);
+		/* fall through */
+	case PHY_INTERFACE_MODE_TRGMII:
 		phylink_set(mask, 1000baseT_Full);
-	} else {
+		break;
+	case PHY_INTERFACE_MODE_2500BASEX:
+		phylink_set(mask, 2500baseX_Full);
+		/* fall through */
+	case PHY_INTERFACE_MODE_1000BASEX:
+		phylink_set(mask, 1000baseX_Full);
+		break;
+	case PHY_INTERFACE_MODE_RGMII:
+	case PHY_INTERFACE_MODE_RGMII_ID:
+	case PHY_INTERFACE_MODE_RGMII_RXID:
+	case PHY_INTERFACE_MODE_RGMII_TXID:
+		phylink_set(mask, 1000baseX_Full);
+		/* fall through */
+	case PHY_INTERFACE_MODE_GMII:
+		phylink_set(mask, 1000baseT_Full);
+		phylink_set(mask, 1000baseT_Half);
+		/* fall through */
+	case PHY_INTERFACE_MODE_MII:
+	case PHY_INTERFACE_MODE_RMII:
+	case PHY_INTERFACE_MODE_REVMII:
+	case PHY_INTERFACE_MODE_NA:
+	default:
 		phylink_set(mask, 10baseT_Half);
 		phylink_set(mask, 10baseT_Full);
 		phylink_set(mask, 100baseT_Half);
 		phylink_set(mask, 100baseT_Full);
+		break;
+	}
 
-		if (state->interface != PHY_INTERFACE_MODE_MII) {
-			phylink_set(mask, 1000baseT_Half);
+	if (state->interface == PHY_INTERFACE_MODE_NA) {
+		if (MTK_HAS_CAPS(mac->hw->soc->caps, MTK_SGMII)) {
 			phylink_set(mask, 1000baseT_Full);
 			phylink_set(mask, 1000baseX_Full);
+			phylink_set(mask, 2500baseX_Full);
+		}
+		if (MTK_HAS_CAPS(mac->hw->soc->caps, MTK_RGMII)) {
+			phylink_set(mask, 1000baseT_Full);
+			phylink_set(mask, 1000baseT_Half);
+			phylink_set(mask, 1000baseX_Full);
+		}
+		if (MTK_HAS_CAPS(mac->hw->soc->caps, MTK_GEPHY)) {
+			phylink_set(mask, 1000baseT_Full);
+			phylink_set(mask, 1000baseT_Half);
 		}
 	}
 
@@ -394,6 +510,11 @@ static void mtk_validate(struct phylink_config *config,
 
 	linkmode_and(supported, supported, mask);
 	linkmode_and(state->advertising, state->advertising, mask);
+
+	/* We can only operate at 2500BaseX or 1000BaseX. If requested
+	 * to advertise both, only report advertising at 2500BaseX.
+	 */
+	phylink_helper_basex_speed(state);
 }
 
 static const struct phylink_mac_ops mtk_phylink_ops = {
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
index 7f5f541daad7..76bd12cb8150 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
@@ -412,14 +412,38 @@
 /* Register to auto-negotiation restart */
 #define SGMSYS_PCS_CONTROL_1	0x0
 #define SGMII_AN_RESTART	BIT(9)
+#define SGMII_ISOLATE		BIT(10)
+#define SGMII_AN_ENABLE		BIT(12)
+#define SGMII_LINK_STATYS	BIT(18)
+#define SGMII_AN_ABILITY	BIT(19)
+#define SGMII_AN_COMPLETE	BIT(21)
+#define SGMII_PCS_FAULT		BIT(23)
+#define SGMII_AN_EXPANSION_CLR	BIT(30)
 
 /* Register to programmable link timer, the unit in 2 * 8ns */
 #define SGMSYS_PCS_LINK_TIMER	0x18
 #define SGMII_LINK_TIMER_DEFAULT	(0x186a0 & GENMASK(19, 0))
 
 /* Register to control remote fault */
-#define SGMSYS_SGMII_MODE	0x20
-#define SGMII_REMOTE_FAULT_DIS	BIT(8)
+#define SGMSYS_SGMII_MODE		0x20
+#define SGMII_IF_MODE_BIT0		BIT(0)
+#define SGMII_SPEED_DUPLEX_AN		BIT(1)
+#define SGMII_SPEED_10			0x0
+#define SGMII_SPEED_100			BIT(2)
+#define SGMII_SPEED_1000		BIT(3)
+#define SGMII_DUPLEX_FULL		BIT(4)
+#define SGMII_IF_MODE_BIT5		BIT(5)
+#define SGMII_REMOTE_FAULT_DIS		BIT(8)
+#define SGMII_CODE_SYNC_SET_VAL		BIT(9)
+#define SGMII_CODE_SYNC_SET_EN		BIT(10)
+#define SGMII_SEND_AN_ERROR_EN		BIT(11)
+#define SGMII_IF_MODE_MASK		GENMASK(5, 1)
+
+/* Register to set SGMII speed, ANA RG_ Control Signals III*/
+#define SGMSYS_ANA_RG_CS3	0x2028
+#define RG_PHY_SPEED_MASK	(BIT(2) | BIT(3))
+#define RG_PHY_SPEED_1_25G	0x0
+#define RG_PHY_SPEED_3_125G	BIT(2)
 
 /* Register to power up QPHY */
 #define SGMSYS_QPHY_PWR_STATE_CTRL 0xe8
@@ -897,7 +921,12 @@ u32 mtk_r32(struct mtk_eth *eth, unsigned reg);
 int mtk_sgmii_init(struct mtk_sgmii *ss, struct device_node *np,
 		   u32 ana_rgc3);
 int mtk_sgmii_setup_mode_an(struct mtk_sgmii *ss, int id);
-int mtk_sgmii_setup_mode_force(struct mtk_sgmii *ss, int id);
-int mtk_setup_hw_path(struct mtk_eth *eth, int mac_id, int phymode);
+int mtk_sgmii_setup_mode_force(struct mtk_sgmii *ss, int id,
+			       const struct phylink_link_state *state);
+void mtk_sgmii_restart_an(struct mtk_eth *eth, int mac_id);
+
+int mtk_gmac_sgmii_path_setup(struct mtk_eth *eth, int mac_id);
+int mtk_gmac_gephy_path_setup(struct mtk_eth *eth, int mac_id);
+int mtk_gmac_rgmii_path_setup(struct mtk_eth *eth, int mac_id);
 
 #endif /* MTK_ETH_H */
diff --git a/drivers/net/ethernet/mediatek/mtk_sgmii.c b/drivers/net/ethernet/mediatek/mtk_sgmii.c
index ff509d42d818..4db27dfc7ec1 100644
--- a/drivers/net/ethernet/mediatek/mtk_sgmii.c
+++ b/drivers/net/ethernet/mediatek/mtk_sgmii.c
@@ -16,8 +16,7 @@
 int mtk_sgmii_init(struct mtk_sgmii *ss, struct device_node *r, u32 ana_rgc3)
 {
 	struct device_node *np;
-	const char *str;
-	int i, err;
+	int i;
 
 	ss->ana_rgc3 = ana_rgc3;
 
@@ -29,19 +28,6 @@ int mtk_sgmii_init(struct mtk_sgmii *ss, struct device_node *r, u32 ana_rgc3)
 		ss->regmap[i] = syscon_node_to_regmap(np);
 		if (IS_ERR(ss->regmap[i]))
 			return PTR_ERR(ss->regmap[i]);
-
-		err = of_property_read_string(np, "mediatek,physpeed", &str);
-		if (err)
-			return err;
-
-		if (!strcmp(str, "2500"))
-			ss->flags[i] |= MTK_SGMII_PHYSPEED_2500;
-		else if (!strcmp(str, "1000"))
-			ss->flags[i] |= MTK_SGMII_PHYSPEED_1000;
-		else if (!strcmp(str, "auto"))
-			ss->flags[i] |= MTK_SGMII_PHYSPEED_AN;
-		else
-			return -EINVAL;
 	}
 
 	return 0;
@@ -73,27 +59,45 @@ int mtk_sgmii_setup_mode_an(struct mtk_sgmii *ss, int id)
 	return 0;
 }
 
-int mtk_sgmii_setup_mode_force(struct mtk_sgmii *ss, int id)
+int mtk_sgmii_setup_mode_force(struct mtk_sgmii *ss, int id,
+			       const struct phylink_link_state *state)
 {
 	unsigned int val;
-	int mode;
 
 	if (!ss->regmap[id])
 		return -EINVAL;
 
 	regmap_read(ss->regmap[id], ss->ana_rgc3, &val);
-	val &= ~GENMASK(3, 2);
-	mode = ss->flags[id] & MTK_SGMII_PHYSPEED_MASK;
-	val |= (mode == MTK_SGMII_PHYSPEED_1000) ? 0 : BIT(2);
+	val &= ~RG_PHY_SPEED_MASK;
+	if (state->interface == PHY_INTERFACE_MODE_2500BASEX)
+		val |= RG_PHY_SPEED_3_125G;
 	regmap_write(ss->regmap[id], ss->ana_rgc3, val);
 
 	/* Disable SGMII AN */
 	regmap_read(ss->regmap[id], SGMSYS_PCS_CONTROL_1, &val);
-	val &= ~BIT(12);
+	val &= ~SGMII_AN_ENABLE;
 	regmap_write(ss->regmap[id], SGMSYS_PCS_CONTROL_1, val);
 
 	/* SGMII force mode setting */
-	val = 0x31120019;
+	regmap_read(ss->regmap[id], SGMSYS_SGMII_MODE, &val);
+	val &= ~SGMII_IF_MODE_MASK;
+
+	switch (state->speed) {
+	case SPEED_10:
+		val |= SGMII_SPEED_10;
+		break;
+	case SPEED_100:
+		val |= SGMII_SPEED_100;
+		break;
+	case SPEED_2500:
+	case SPEED_1000:
+		val |= SGMII_SPEED_1000;
+		break;
+	};
+
+	if (state->duplex == DUPLEX_FULL)
+		val |= SGMII_DUPLEX_FULL;
+
 	regmap_write(ss->regmap[id], SGMSYS_SGMII_MODE, val);
 
 	/* Release PHYA power down state */
@@ -103,3 +107,20 @@ int mtk_sgmii_setup_mode_force(struct mtk_sgmii *ss, int id)
 
 	return 0;
 }
+
+void mtk_sgmii_restart_an(struct mtk_eth *eth, int mac_id)
+{
+	struct mtk_sgmii *ss = eth->sgmii;
+	unsigned int val, sid;
+
+	/* Decide how GMAC and SGMIISYS be mapped */
+	sid = (MTK_HAS_CAPS(eth->soc->caps, MTK_SHARED_SGMII)) ?
+	       0 : mac_id;
+
+	if (!ss->regmap[sid])
+		return;
+
+	regmap_read(ss->regmap[sid], SGMSYS_PCS_CONTROL_1, &val);
+	val |= SGMII_AN_RESTART;
+	regmap_write(ss->regmap[sid], SGMSYS_PCS_CONTROL_1, val);
+}
-- 
2.20.1


^ permalink raw reply related

* [PATCH net-next] net/rds: Whitelist rdma_cookie and rx_tstamp for usercopy
From: Dag Moxnes @ 2019-08-23 14:03 UTC (permalink / raw)
  To: santosh.shilimkar, netdev, linux-rdma, rds-devel; +Cc: davem, dag.moxnes

Add the RDMA cookie and RX timestamp to the usercopy whitelist.

After the introduction of hardened usercopy whitelisting
(https://lwn.net/Articles/727322/), a warning is displayed when the
RDMA cookie or RX timestamp is copied to userspace:

kernel: WARNING: CPU: 3 PID: 5750 at
mm/usercopy.c:81 usercopy_warn+0x8e/0xa6
[...]
kernel: Call Trace:
kernel: __check_heap_object+0xb8/0x11b
kernel: __check_object_size+0xe3/0x1bc
kernel: put_cmsg+0x95/0x115
kernel: rds_recvmsg+0x43d/0x620 [rds]
kernel: sock_recvmsg+0x43/0x4a
kernel: ___sys_recvmsg+0xda/0x1e6
kernel: ? __handle_mm_fault+0xcae/0xf79
kernel: __sys_recvmsg+0x51/0x8a
kernel: SyS_recvmsg+0x12/0x1c
kernel: do_syscall_64+0x79/0x1ae

When the whitelisting feature was introduced, the memory for the RDMA
cookie and RX timestamp in RDS was not added to the whitelist, causing
the warning above.

Signed-off-by: Dag Moxnes <dag.moxnes@oracle.com>
Tested-by: jenny.x.xu@oracle.com
---
 net/rds/ib_recv.c | 11 ++++++++---
 net/rds/rds.h     |  9 +++++++--
 net/rds/recv.c    | 22 ++++++++++++----------
 3 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index 3cae88cbda..fecd0abdc7 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -1038,9 +1038,14 @@ int rds_ib_recv_init(void)
 	si_meminfo(&si);
 	rds_ib_sysctl_max_recv_allocation = si.totalram / 3 * PAGE_SIZE / RDS_FRAG_SIZE;
 
-	rds_ib_incoming_slab = kmem_cache_create("rds_ib_incoming",
-					sizeof(struct rds_ib_incoming),
-					0, SLAB_HWCACHE_ALIGN, NULL);
+	rds_ib_incoming_slab =
+		kmem_cache_create_usercopy("rds_ib_incoming",
+					   sizeof(struct rds_ib_incoming),
+					   0, SLAB_HWCACHE_ALIGN,
+					   offsetof(struct rds_ib_incoming,
+						    ii_inc.i_usercopy),
+					   sizeof(struct rds_inc_usercopy),
+					   NULL);
 	if (!rds_ib_incoming_slab)
 		goto out;
 
diff --git a/net/rds/rds.h b/net/rds/rds.h
index f0066d1684..e792a67dd5 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -271,6 +271,12 @@ struct rds_ext_header_rdma_dest {
 #define	RDS_MSG_RX_END		2
 #define	RDS_MSG_RX_CMSG		3
 
+/* The following values are whitelisted for usercopy */
+struct rds_inc_usercopy {
+	rds_rdma_cookie_t	rdma_cookie;
+	ktime_t			rx_tstamp;
+};
+
 struct rds_incoming {
 	refcount_t		i_refcount;
 	struct list_head	i_item;
@@ -280,8 +286,7 @@ struct rds_incoming {
 	unsigned long		i_rx_jiffies;
 	struct in6_addr		i_saddr;
 
-	rds_rdma_cookie_t	i_rdma_cookie;
-	ktime_t			i_rx_tstamp;
+	struct rds_inc_usercopy i_usercopy;
 	u64			i_rx_lat_trace[RDS_RX_MAX_TRACES];
 };
 
diff --git a/net/rds/recv.c b/net/rds/recv.c
index 853de48760..7e451c8259 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -47,8 +47,8 @@ void rds_inc_init(struct rds_incoming *inc, struct rds_connection *conn,
 	INIT_LIST_HEAD(&inc->i_item);
 	inc->i_conn = conn;
 	inc->i_saddr = *saddr;
-	inc->i_rdma_cookie = 0;
-	inc->i_rx_tstamp = ktime_set(0, 0);
+	inc->i_usercopy.rdma_cookie = 0;
+	inc->i_usercopy.rx_tstamp = ktime_set(0, 0);
 
 	memset(inc->i_rx_lat_trace, 0, sizeof(inc->i_rx_lat_trace));
 }
@@ -62,8 +62,8 @@ void rds_inc_path_init(struct rds_incoming *inc, struct rds_conn_path *cp,
 	inc->i_conn = cp->cp_conn;
 	inc->i_conn_path = cp;
 	inc->i_saddr = *saddr;
-	inc->i_rdma_cookie = 0;
-	inc->i_rx_tstamp = ktime_set(0, 0);
+	inc->i_usercopy.rdma_cookie = 0;
+	inc->i_usercopy.rx_tstamp = ktime_set(0, 0);
 }
 EXPORT_SYMBOL_GPL(rds_inc_path_init);
 
@@ -186,7 +186,7 @@ static void rds_recv_incoming_exthdrs(struct rds_incoming *inc, struct rds_sock
 		case RDS_EXTHDR_RDMA_DEST:
 			/* We ignore the size for now. We could stash it
 			 * somewhere and use it for error checking. */
-			inc->i_rdma_cookie = rds_rdma_make_cookie(
+			inc->i_usercopy.rdma_cookie = rds_rdma_make_cookie(
 					be32_to_cpu(buffer.rdma_dest.h_rdma_rkey),
 					be32_to_cpu(buffer.rdma_dest.h_rdma_offset));
 
@@ -380,7 +380,7 @@ void rds_recv_incoming(struct rds_connection *conn, struct in6_addr *saddr,
 				      be32_to_cpu(inc->i_hdr.h_len),
 				      inc->i_hdr.h_dport);
 		if (sock_flag(sk, SOCK_RCVTSTAMP))
-			inc->i_rx_tstamp = ktime_get_real();
+			inc->i_usercopy.rx_tstamp = ktime_get_real();
 		rds_inc_addref(inc);
 		inc->i_rx_lat_trace[RDS_MSG_RX_END] = local_clock();
 		list_add_tail(&inc->i_item, &rs->rs_recv_queue);
@@ -540,16 +540,18 @@ static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg,
 {
 	int ret = 0;
 
-	if (inc->i_rdma_cookie) {
+	if (inc->i_usercopy.rdma_cookie) {
 		ret = put_cmsg(msg, SOL_RDS, RDS_CMSG_RDMA_DEST,
-				sizeof(inc->i_rdma_cookie), &inc->i_rdma_cookie);
+				sizeof(inc->i_usercopy.rdma_cookie),
+				&inc->i_usercopy.rdma_cookie);
 		if (ret)
 			goto out;
 	}
 
-	if ((inc->i_rx_tstamp != 0) &&
+	if ((inc->i_usercopy.rx_tstamp != 0) &&
 	    sock_flag(rds_rs_to_sk(rs), SOCK_RCVTSTAMP)) {
-		struct __kernel_old_timeval tv = ns_to_kernel_old_timeval(inc->i_rx_tstamp);
+		struct __kernel_old_timeval tv =
+			ns_to_kernel_old_timeval(inc->i_usercopy.rx_tstamp);
 
 		if (!sock_flag(rds_rs_to_sk(rs), SOCK_TSTAMP_NEW)) {
 			ret = put_cmsg(msg, SOL_SOCKET, SO_TIMESTAMP_OLD,
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH] ethernet: Delete unnecessary checks before the macro call “dev_kfree_skb”
From: Christophe JAILLET @ 2019-08-23 14:08 UTC (permalink / raw)
  To: Markus Elfring, netdev, linux-arm-kernel, linux-stm32,
	intel-wired-lan, bcm-kernel-feedback-list, UNGLinuxDriver,
	Alexandre Torgue, Alexios Zavras, Allison Randal, Bryan Whitehead,
	Claudiu Manoil, David S. Miller, Doug Berger, Douglas Miller,
	Florian Fainelli, Giuseppe Cavallaro, Greg Kroah-Hartman,
	Jeff Kirsher, Jilayne Lovejoy, Jonathan Lemon, Jose Abreu,
	Kate Stewart
  Cc: kernel-janitors, LKML
In-Reply-To: <af1ae1cf-4a01-5e3a-edc2-058668487137@web.de>

Hi,

in this patch, there is one piece that looked better before. (see below)

Removing the 'if (skb)' is fine, but concatening everything in one 
statement just to save 2 variables and a few LOC is of no use, IMHO, and 
the code is less readable.

just my 2c.


CJ


diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c 
b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index d3a0b614dbfa..8b19ddcdafaa 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -2515,19 +2515,14 @@ static int bcmgenet_dma_teardown(struct 
bcmgenet_priv *priv)
  static void bcmgenet_fini_dma(struct bcmgenet_priv *priv)
  {
      struct netdev_queue *txq;
-    struct sk_buff *skb;
-    struct enet_cb *cb;
      int i;

      bcmgenet_fini_rx_napi(priv);
      bcmgenet_fini_tx_napi(priv);

-    for (i = 0; i < priv->num_tx_bds; i++) {
-        cb = priv->tx_cbs + i;
-        skb = bcmgenet_free_tx_cb(&priv->pdev->dev, cb);
-        if (skb)
-            dev_kfree_skb(skb);
-    }
+    for (i = 0; i < priv->num_tx_bds; i++)
+ dev_kfree_skb(bcmgenet_free_tx_cb(&priv->pdev->dev,
+                          priv->tx_cbs + i));

      for (i = 0; i < priv->hw_params->tx_queues; i++) {
          txq = netdev_get_tx_queue(priv->dev, priv->tx_rings[i].queue);

^ permalink raw reply related

* Re: [PATCH net-next] bnxt_en: Fix allocation of zero statistics block size regression.
From: Jonathan Lemon @ 2019-08-23 14:24 UTC (permalink / raw)
  To: Michael Chan; +Cc: davem, netdev, kernel-team
In-Reply-To: <1566539501-5884-1-git-send-email-michael.chan@broadcom.com>



On 22 Aug 2019, at 22:51, Michael Chan wrote:

> Recent commit added logic to determine the appropriate statistics 
> block
> size to allocate and the size is stored in bp->hw_ring_stats_size.  
> But
> if the firmware spec is older than 1.6.0, it is 0 and not initialized.
> This causes the allocation to fail with size 0 and bnxt_open() to
> abort.  Fix it by always initializing bp->hw_ring_stats_size to the
> legacy default size value.
>
> Fixes: 4e7485066373 ("bnxt_en: Allocate the larger per-ring statistics 
> block for 57500 chips.")
> Reported-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> Signed-off-by: Michael Chan <michael.chan@broadcom.com>

Tested-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>

Thanks, Michael!
-- 
Jonathan

^ permalink raw reply

* Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
From: Alex Williamson @ 2019-08-23 14:28 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jiri Pirko, Jiri Pirko, David S . Miller, Kirti Wankhede,
	Cornelia Huck, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	cjia, netdev@vger.kernel.org
In-Reply-To: <AM0PR05MB4866DED407D6F1C653D5D560D1A40@AM0PR05MB4866.eurprd05.prod.outlook.com>

On Fri, 23 Aug 2019 08:14:39 +0000
Parav Pandit <parav@mellanox.com> wrote:

> Hi Alex,
> 
> 
> > -----Original Message-----
> > From: Jiri Pirko <jiri@resnulli.us>
> > Sent: Friday, August 23, 2019 1:42 PM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: Alex Williamson <alex.williamson@redhat.com>; Jiri Pirko
> > <jiri@mellanox.com>; David S . Miller <davem@davemloft.net>; Kirti
> > Wankhede <kwankhede@nvidia.com>; Cornelia Huck <cohuck@redhat.com>;
> > kvm@vger.kernel.org; linux-kernel@vger.kernel.org; cjia <cjia@nvidia.com>;
> > netdev@vger.kernel.org
> > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > 
> > Thu, Aug 22, 2019 at 03:33:30PM CEST, parav@mellanox.com wrote:  
> > >
> > >  
> > >> -----Original Message-----
> > >> From: Jiri Pirko <jiri@resnulli.us>
> > >> Sent: Thursday, August 22, 2019 5:50 PM
> > >> To: Parav Pandit <parav@mellanox.com>
> > >> Cc: Alex Williamson <alex.williamson@redhat.com>; Jiri Pirko
> > >> <jiri@mellanox.com>; David S . Miller <davem@davemloft.net>; Kirti
> > >> Wankhede <kwankhede@nvidia.com>; Cornelia Huck  
> > <cohuck@redhat.com>;  
> > >> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> > >> <cjia@nvidia.com>; netdev@vger.kernel.org
> > >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > >>
> > >> Thu, Aug 22, 2019 at 12:04:02PM CEST, parav@mellanox.com wrote:  
> > >> >
> > >> >  
> > >> >> -----Original Message-----
> > >> >> From: Jiri Pirko <jiri@resnulli.us>
> > >> >> Sent: Thursday, August 22, 2019 3:28 PM
> > >> >> To: Parav Pandit <parav@mellanox.com>
> > >> >> Cc: Alex Williamson <alex.williamson@redhat.com>; Jiri Pirko
> > >> >> <jiri@mellanox.com>; David S . Miller <davem@davemloft.net>; Kirti
> > >> >> Wankhede <kwankhede@nvidia.com>; Cornelia Huck  
> > >> <cohuck@redhat.com>;  
> > >> >> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> > >> >> <cjia@nvidia.com>; netdev@vger.kernel.org
> > >> >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > >> >>
> > >> >> Thu, Aug 22, 2019 at 11:42:13AM CEST, parav@mellanox.com wrote:  
> > >> >> >
> > >> >> >  
> > >> >> >> -----Original Message-----
> > >> >> >> From: Jiri Pirko <jiri@resnulli.us>
> > >> >> >> Sent: Thursday, August 22, 2019 2:59 PM
> > >> >> >> To: Parav Pandit <parav@mellanox.com>
> > >> >> >> Cc: Alex Williamson <alex.williamson@redhat.com>; Jiri Pirko
> > >> >> >> <jiri@mellanox.com>; David S . Miller <davem@davemloft.net>;
> > >> >> >> Kirti Wankhede <kwankhede@nvidia.com>; Cornelia Huck  
> > >> >> <cohuck@redhat.com>;  
> > >> >> >> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> > >> >> >> <cjia@nvidia.com>; netdev@vger.kernel.org
> > >> >> >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > >> >> >>
> > >> >> >> Wed, Aug 21, 2019 at 08:23:17AM CEST, parav@mellanox.com wrote:  
> > >> >> >> >
> > >> >> >> >  
> > >> >> >> >> -----Original Message-----
> > >> >> >> >> From: Alex Williamson <alex.williamson@redhat.com>
> > >> >> >> >> Sent: Wednesday, August 21, 2019 10:56 AM
> > >> >> >> >> To: Parav Pandit <parav@mellanox.com>
> > >> >> >> >> Cc: Jiri Pirko <jiri@mellanox.com>; David S . Miller
> > >> >> >> >> <davem@davemloft.net>; Kirti Wankhede
> > >> >> >> >> <kwankhede@nvidia.com>; Cornelia Huck <cohuck@redhat.com>;
> > >> >> >> >> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> > >> >> >> >> <cjia@nvidia.com>; netdev@vger.kernel.org
> > >> >> >> >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev
> > >> >> >> >> core
> > >> >> >> >>  
> > >> >> >> >> > > > > Just an example of the alias, not proposing how it's set.
> > >> >> >> >> > > > > In fact, proposing that the user does not set it,
> > >> >> >> >> > > > > mdev-core provides one  
> > >> >> >> >> > > automatically.  
> > >> >> >> >> > > > >  
> > >> >> >> >> > > > > > > Since there seems to be some prefix overhead, as
> > >> >> >> >> > > > > > > I ask about above in how many characters we
> > >> >> >> >> > > > > > > actually have to work with in IFNAMESZ, maybe we
> > >> >> >> >> > > > > > > start with
> > >> >> >> >> > > > > > > 8 characters (matching your "index" namespace)
> > >> >> >> >> > > > > > > and expand as necessary for  
> > >> >> >> disambiguation.  
> > >> >> >> >> > > > > > > If we can eliminate overhead in IFNAMESZ, let's
> > >> >> >> >> > > > > > > start with  
> > >> 12.  
> > >> >> >> >> > > > > > > Thanks,
> > >> >> >> >> > > > > > >  
> > >> >> >> >> > > > > > If user is going to choose the alias, why does it
> > >> >> >> >> > > > > > have to be limited to  
> > >> >> >> >> sha1?  
> > >> >> >> >> > > > > > Or you just told it as an example?
> > >> >> >> >> > > > > >
> > >> >> >> >> > > > > > It can be an alpha-numeric string.  
> > >> >> >> >> > > > >
> > >> >> >> >> > > > > No, I'm proposing a different solution where
> > >> >> >> >> > > > > mdev-core creates an alias based on an abbreviated
> > >> >> >> >> > > > > sha1.  The user does not provide the  
> > >> >> >> >> alias.  
> > >> >> >> >> > > > >  
> > >> >> >> >> > > > > > Instead of mdev imposing number of characters on
> > >> >> >> >> > > > > > the alias, it should be best  
> > >> >> >> >> > > > > left to the user.  
> > >> >> >> >> > > > > > Because in future if netdev improves on the naming
> > >> >> >> >> > > > > > scheme, mdev will be  
> > >> >> >> >> > > > > limiting it, which is not right.  
> > >> >> >> >> > > > > > So not restricting alias size seems right to me.
> > >> >> >> >> > > > > > User configuring mdev for networking devices in a
> > >> >> >> >> > > > > > given kernel knows what  
> > >> >> >> >> > > > > user is doing.  
> > >> >> >> >> > > > > > So user can choose alias name size as it finds suitable.  
> > >> >> >> >> > > > >
> > >> >> >> >> > > > > That's not what I'm proposing, please read again.
> > >> >> >> >> > > > > Thanks,  
> > >> >> >> >> > > >
> > >> >> >> >> > > > I understood your point. But mdev doesn't know how
> > >> >> >> >> > > > user is going to use  
> > >> >> >> >> > > udev/systemd to name the netdev.  
> > >> >> >> >> > > > So even if mdev chose to pick 12 characters, it could
> > >> >> >> >> > > > result in  
> > >> >> collision.  
> > >> >> >> >> > > > Hence the proposal to provide the alias by the user,
> > >> >> >> >> > > > as user know the best  
> > >> >> >> >> > > policy for its use case in the environment its using.  
> > >> >> >> >> > > > So 12 character sha1 method will still work by user.  
> > >> >> >> >> > >
> > >> >> >> >> > > Haven't you already provided examples where certain
> > >> >> >> >> > > drivers or subsystems have unique netdev prefixes?  If
> > >> >> >> >> > > mdev provides a unique alias within the subsystem,
> > >> >> >> >> > > couldn't we simply define a netdev prefix for the mdev
> > >> >> >> >> > > subsystem and avoid all other collisions?  I'm not in
> > >> >> >> >> > > favor of the user providing both a uuid and an
> > >> >> >> >> > > alias/instance.  Thanks,
> > >> >> >> >> > >  
> > >> >> >> >> > For a given prefix, say ens2f0, can two UUID->sha1 first 9
> > >> >> >> >> > characters have  
> > >> >> >> >> collision?
> > >> >> >> >>
> > >> >> >> >> I think it would be a mistake to waste so many chars on a
> > >> >> >> >> prefix, but
> > >> >> >> >> 9 characters of sha1 likely wouldn't have a collision before
> > >> >> >> >> we have 10s of thousands of devices.  Thanks,
> > >> >> >> >>
> > >> >> >> >> Alex  
> > >> >> >> >
> > >> >> >> >Jiri, Dave,
> > >> >> >> >Are you ok with it for devlink/netdev part?
> > >> >> >> >Mdev core will create an alias from a UUID.
> > >> >> >> >
> > >> >> >> >This will be supplied during devlink port attr set such as,
> > >> >> >> >
> > >> >> >> >devlink_port_attrs_mdev_set(struct devlink_port *port, const
> > >> >> >> >char *mdev_alias);
> > >> >> >> >
> > >> >> >> >This alias is used to generate representor netdev's phys_port_name.
> > >> >> >> >This alias from the mdev device's sysfs will be used by the
> > >> >> >> >udev/systemd to  
> > >> >> >> generate predicable netdev's name.  
> > >> >> >> >Example: enm<mdev_alias_first_12_chars>  
> > >> >> >>
> > >> >> >> What happens in unlikely case of 2 UUIDs collide?
> > >> >> >>  
> > >> >> >Since users sees two devices with same phys_port_name, user
> > >> >> >should destroy  
> > >> >> recently created mdev and recreate mdev with different UUID?
> > >> >>
> > >> >> Driver should make sure phys port name wont collide,  
> > >> >So when mdev creation is initiated, mdev core calculates the alias
> > >> >and if there  
> > >> is any other mdev with same alias exist, it returns -EEXIST error
> > >> before progressing further.  
> > >> >This way user will get to know upfront in event of collision before
> > >> >the mdev  
> > >> device gets created.  
> > >> >How about that?  
> > >>
> > >> Sounds fine to me. Now the question is how many chars do we want to have.
> > >>  
> > >12 characters from Alex's suggestion similar to git?  
> > 
> > Ok.
> >   
> 
> Can you please confirm this scheme looks good now? I like to get patches started.

My only concern is your comment that in the event of an abbreviated
sha1 collision (as exceptionally rare as that might be at 12-chars),
we'd fail the device create, while my original suggestion was that
vfio-core would add an extra character to the alias.  For
non-networking devices, the sha1 is unnecessary, so the extension
behavior seems preferred.  The user is only responsible to provide a
unique uuid.  Perhaps the failure behavior could be applied based on
the mdev device_api.  A module option on mdev to specify the default
number of alias chars would also be useful for testing so that we can
set it low enough to validate the collision behavior.  Thanks,

Alex

> > >> >> in this case that it does
> > >> >> not provide 2 same attrs for 2 different ports.
> > >> >> Hmm, so the order of creation matters. That is not good.
> > >> >>  
> > >> >> >>  
> > >> >> >> >I took Ethernet mdev as an example.
> > >> >> >> >New prefix 'm' stands for mediated device.
> > >> >> >> >Remaining 12 characters are first 12 chars of the mdev alias.  
> > >> >> >>
> > >> >> >> Does this resolve the identification of devlink port representor?  
> > >> >> >Not sure if I understood your question correctly, attemping to
> > >> >> >answer  
> > >> below.  
> > >> >> >phys_port_name of devlink port is defined by the first 12
> > >> >> >characters of mdev  
> > >> >> alias.  
> > >> >> >> I assume you want to use the same 12(or so) chars, don't you?  
> > >> >> >Mdev's netdev will also use the same mdev alias from the sysfs to
> > >> >> >rename  
> > >> >> netdev name from ethX to enm<mdev_alias>, where en=Etherenet,  
> > >> m=mdev.  
> > >> >> >
> > >> >> >So yes, same 12 characters are use for mdev's netdev and mdev
> > >> >> >devlink port's  
> > >> >> phys_port_name.  
> > >> >> >
> > >> >> >Is that what are you asking?  
> > >> >>
> > >> >> Yes. Then you have 3 chars to handle the rest of the name (pci, pf)...  


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox