Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next v6 04/15] ethtool: introduce ethtool netlink interface
From: Jiri Pirko @ 2019-07-09 13:42 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: netdev, David Miller, Jakub Kicinski, Andrew Lunn,
	Florian Fainelli, John Linville, Stephen Hemminger, Johannes Berg,
	linux-kernel
In-Reply-To: <20190708202219.GE24474@unicorn.suse.cz>

Mon, Jul 08, 2019 at 10:22:19PM CEST, mkubecek@suse.cz wrote:
>On Mon, Jul 08, 2019 at 09:26:29PM +0200, Jiri Pirko wrote:
>> Mon, Jul 08, 2019 at 07:27:29PM CEST, mkubecek@suse.cz wrote:
>> >
>> >There are two reasons for this design. First is to reduce the number of
>> >requests needed to get the information. This is not so much a problem of
>> >ethtool itself; the only existing commands that would result in multiple
>> >request messages would be "ethtool <dev>" and "ethtool -s <dev>". Maybe
>> >also "ethtool -x/-X <dev>" but even if the indirection table and hash
>> >key have different bits assigned now, they don't have to be split even
>> >if we split other commands. It may be bigger problem for daemons wanting
>> >to keep track of system configuration which would have to issue many
>> >requests whenever a new device appears.
>> >
>> >Second reason is that with 8-bit genetlink command/message id, the space
>> >is not as infinite as it might seem. I counted quickly, right now the
>> >full series uses 14 ids for kernel messages, with split you propose it
>> >would most likely grow to 44. For full implementation of all ethtool
>> >functionality, we could get to ~60 ids. It's still only 1/4 of the
>> >available space but it's not clear what the future development will look
>> >like. We would certainly need to be careful not to start allocating new
>> >commands for single parameters and try to be foreseeing about what can
>> >be grouped together. But we will need to do that in any case.
>> >
>> >On kernel side, splitting existing messages would make some things a bit
>> >easier. It would also reduce the number of scenarios where only part of
>> >requested information is available or only part of a SET request fails.
>> 
>> Okay, I got your point. So why don't we look at if from the other angle.
>> Why don't we have only single get/set command that would be in general
>> used to get/set ALL info from/to the kernel. Where we can have these
>> bits (perhaps rather varlen bitfield) to for user to indicate which data
>> is he interested in? This scales. The other commands would be
>> just for action.
>> 
>> Something like RTM_GETLINK/RTM_SETLINK. Makes sense?
>
>It's certainly an option but at the first glance it seems as just moving
>what I tried to avoid one level lower. It would work around the u8 issue
>(but as Johannes pointed out, we can handle it with genetlink when/if
>the time comes). We would almost certainly have to split the replies
>into multiple messages to keep the packet size reasonable. I'll have to
>think more about the consequences for both kernel and userspace.
>
>My gut feeling is that out of the two extreme options (one universal
>message type and message types corresponding to current infomask bits),
>the latter is more appealing. After all, ethtool has been gathering
>features that would need those ~60 message types for 20 years.

Yeah, but I think that we have to do one or another. Anything in between
makes the code complex and uapi confusing. Let's start clean :)

^ permalink raw reply

* Re: [PATCH net-next,v3 10/11] net: flow_offload: add flow_block_cb_is_busy() and use it
From: Jiri Pirko @ 2019-07-09 13:39 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netdev, davem, thomas.lendacky, f.fainelli, ariel.elior,
	michael.chan, madalin.bucur, yisen.zhuang, salil.mehta,
	jeffrey.t.kirsher, tariqt, saeedm, jiri, idosch, jakub.kicinski,
	peppe.cavallaro, grygorii.strashko, andrew, vivien.didelot,
	alexandre.torgue, joabreu, linux-net-drivers, ogerlitz,
	Manish.Chopra, marcelo.leitner, mkubecek, venkatkumar.duvvuru,
	maxime.chevallier, cphealy, netfilter-devel
In-Reply-To: <20190708160614.2226-11-pablo@netfilter.org>

Mon, Jul 08, 2019 at 06:06:12PM CEST, pablo@netfilter.org wrote:

[...]

>+bool flow_block_cb_is_busy(tc_setup_cb_t *cb, void *cb_ident,

There should be another patch before this one renaming tc_setup_cb_t and
ndo_setup_tc. This is not TC specific anymore now, it might confuse the
reader.

[...]

^ permalink raw reply

* Re: [PATCH nf-next 1/3] netfilter: nf_nat_proto: add nf_nat_bridge_ops support
From: wenxu @ 2019-07-09 13:38 UTC (permalink / raw)
  To: Florian Westphal; +Cc: pablo, netfilter-devel, netdev
In-Reply-To: <20190709104206.gy6l52rx2dat3743@breakpoint.cc>


在 2019/7/9 18:42, Florian Westphal 写道:
> wenxu <wenxu@ucloud.cn> wrote:
>>> For NAT on bridge, it should be possible already to push such packets
>>> up the stack by
>>>
>>> bridge input meta iif eth0 ip saddr 192.168.0.0/16 \
>>>        meta pkttype set unicast ether daddr set 00:11:22:33:44:55
>> yes, packet can be push up to IP stack to handle the nat through bridge device. 
>>
>> In my case dnat 2.2.1.7 to 10.0.0.7, It assume the mac address of the two address
>> is the same known by outer.
> I think that in general they will have different MAC addresses, so plain
> replacement of ip addresses won't work.
>
>> But in This case modify the packet dmac to bridge device, the packet push up through bridge device
>> Then do nat and route send back to bridge device.
> Are you saying that you can use the send-to-ip-layer approach?
>
> We might need/want a more convenient way to do this.
> There are two ways that I can see:
>
> 1. a redirect support for nftables bridge family.
>    The redirect expression would be same as "ether daddr set
>    <bridge_mac>", but there is no need to know the bridge mac address.
>
> 2. Support ebtables -t broute in nftables.
>    The route rework for ebtables has been completed already, so
>    this needs a new expression.  Packet that is brouted behaves
>    as if the bridge port was not part of the bridge.

This is my senario:

For a virtual machine example with address  10.0.0.7 and internet address 2.2.1.7  default router

10.0.0.1. There are both the east-west and south-north traffic. So the outer vnet0 connect to bridge

br0 which with address 10.0.0.1.   The bridge also add an flow-based/metadata_dst vxlan device vxlan0.


So there are three kinds traffic to handle:

1. 10.0.0.7 <-----> 10.0.0.8: both ingress and egress packet gothrough the bridge with vlanid to vni feature.

2. 10.0.0.7 <-----> 10.0.1.8: The egress packet push up to stack through br0 to do route. And the route send packet through

vxlan0 to peer with static mac(Maybe the route can send through br0); The ingress packet always gothrough the bridge to VM.

3. 10.0.0.7  <----> 1.1.1.7: The egress The egress packet push up to stack through br0 to do route and nat. And the route send

packet through vxlan0 to router. With this patche, The router assume is the same mac address for 10.0.0.7 and 2.2.1.7. so it can do

nat under bridge and send to VM.


I think the most big problem is that the only vxlan0 device is alyways attach on br0. For L3( do route) traffic the egress packet will push

up to stack do route through br0.  The ingress I hope only gothrough the bridge to VM for all the three kinds traffic above.


^ permalink raw reply

* Re: [PATCH net-next,v3 04/11] net: flow_offload: add flow_block_cb_alloc() and flow_block_cb_free()
From: Jiri Pirko @ 2019-07-09 13:37 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netdev, davem, thomas.lendacky, f.fainelli, ariel.elior,
	michael.chan, madalin.bucur, yisen.zhuang, salil.mehta,
	jeffrey.t.kirsher, tariqt, saeedm, jiri, idosch, jakub.kicinski,
	peppe.cavallaro, grygorii.strashko, andrew, vivien.didelot,
	alexandre.torgue, joabreu, linux-net-drivers, ogerlitz,
	Manish.Chopra, marcelo.leitner, mkubecek, venkatkumar.duvvuru,
	maxime.chevallier, cphealy, netfilter-devel
In-Reply-To: <20190708160614.2226-5-pablo@netfilter.org>

Mon, Jul 08, 2019 at 06:06:06PM CEST, pablo@netfilter.org wrote:

[...]


>+struct flow_block_cb *flow_block_cb_alloc(struct net *net, tc_setup_cb_t *cb,

You don't use net any longer.


>+					  void *cb_ident, void *cb_priv,
>+					  void (*release)(void *cb_priv))

[...]

^ permalink raw reply

* Re: IPv6 flow label reflection behave for RST packets
From: Eric Dumazet @ 2019-07-09 13:36 UTC (permalink / raw)
  To: Eric Dumazet, Marek Majkowski
  Cc: kuznet, yoshfuji, Jakub Sitnicki, netdev, kernel-team
In-Reply-To: <8e2fca44-6fe7-42fc-8684-2cdd52c67103@gmail.com>



On 7/9/19 3:22 PM, Eric Dumazet wrote:
> 
> 
> On 7/9/19 2:33 PM, Marek Majkowski wrote:
>> Ha, thanks. I missed that.
>>
>> There is a caveat though. I don't think it's working as intended...
> 
> 
> Note that my commit really took a look at a fraction of the cases ;)
> 
> commit 323a53c41292a0d7efc8748856c623324c8d7c21
> 
>     ipv6: tcp: enable flowlabel reflection in some RST packets
>     
>     When RST packets are sent because no socket could be found,
>     it makes sense to use flowlabel_reflect sysctl to decide
>     if a reflection of the flowlabel is requested.
>     
> 
> In your case, a socket is found, most probably, and np->repflow seems to be ignored.
> 
> I'll take a look, thanks.

I guess a possible fix would be :

diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index d56a9019a0feb5a34312ec353c555f44b8c09b3d..2a298835317c0f6b1d82fb118dc4ba9647a2a110 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -984,8 +984,13 @@ static void tcp_v6_send_reset(const struct sock *sk, struct sk_buff *skb)
 
        if (sk) {
                oif = sk->sk_bound_dev_if;
-               if (sk_fullsock(sk))
+               if (sk_fullsock(sk)) {
+                       struct ipv6_pinfo *np = tcp_inet6_sk(sk);
+
                        trace_tcp_send_reset(sk, skb);
+                       if (np->repflow)
+                               label = ip6_flowlabel(ipv6h);
+               }
                if (sk->sk_state == TCP_TIME_WAIT)
                        label = cpu_to_be32(inet_twsk(sk)->tw_flowlabel);
        } else {


> 
>> Running my script:
>>
>> $ sysctl -w net.ipv6.flowlabel_reflect=3
>>
>> $ tail reflect.py
>> cd2.close()
>> cd.send(b"a")
>>
>> $ python3 reflect.py
>> IP6 (flowlabel 0xf2927, hlim 64) ::1.1235 > ::1.60246: Flags [F.]
>> IP6 (flowlabel 0xf2927, hlim 64) ::1.60246 > ::1.1235: Flags [P.]
>> IP6 (flowlabel 0x58ecd, hlim 64) ::1.1235 > ::1.60246: Flags [R]
>>
>> Note. The RST is opportunistic, depending on timing I sometimes get a
>> proper FIN, without RST.
>>
>> If I change the script to introduce some delay:
>>
>> $ tail reflect.py
>> cd2.close()
>> time.sleep(0.1)
>> cd.send(b"a")
>>
>> $ python3 reflect.py
>> IP6 (flowlabel 0x2f60c, hlim 64) ::1.60326 > ::1.1235: Flags [.]
>> IP6 (flowlabel 0x2f60c, hlim 64) ::1.60326 > ::1.1235: Flags [P.]
>> IP6 (flowlabel 0x2f60c, hlim 64) ::1.1235 > ::1.60326: Flags [R]
>>
>> Now it seem to work reliably. Tested on net-next under virtme.
>>
>> Marek
>>
>> On Tue, Jul 9, 2019 at 1:19 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>
>>>
>>>
>>> On 7/9/19 1:10 PM, Marek Majkowski wrote:
>>>> Morning,
>>>>
>>>> I'm experimenting with flow label reflection from a server point of
>>>> view. I'm able to get it working in both supported ways:
>>>>
>>>> (a) per-socket with flow manager IPV6_FL_F_REFLECT and flowlabel_consistency=0
>>>>
>>>> (b) with global flowlabel_reflect sysctl
>>>>
>>>> However, I was surprised to see that RST after the connection is torn
>>>> down, doesn't have the correct flow label value:
>>>>
>>>> IP6 (flowlabel 0x3ba3d) ::1.59276 > ::1.1235: Flags [S]
>>>> IP6 (flowlabel 0x3ba3d) ::1.1235 > ::1.59276: Flags [S.]
>>>> IP6 (flowlabel 0x3ba3d) ::1.59276 > ::1.1235: Flags [.]
>>>> IP6 (flowlabel 0x3ba3d) ::1.1235 > ::1.59276: Flags [F.]
>>>> IP6 (flowlabel 0x3ba3d) ::1.59276 > ::1.1235: Flags [P.]
>>>> IP6 (flowlabel 0xdfc46) ::1.1235 > ::1.59276: Flags [R]
>>>>
>>>> Notice, the last RST packet has inconsistent flow label. Perhaps we
>>>> can argue this behaviour might be acceptable for a per-socket
>>>> IPV6_FL_F_REFLECT option, but with global flowlabel_reflect, I would
>>>> expect the RST to preserve the reflected flow label value.
>>>>
>>>> I suspect the same behaviour is true for kernel-generated ICMPv6.
>>>>
>>>> Prepared test case:
>>>> https://gist.github.com/majek/139081b84f9b5b6187c8ccff802e3ab3
>>>>
>>>> This behaviour is not necessarily a bug, more of a surprise. Flow
>>>> label reflection is mostly useful in deployments where Linux servers
>>>> stand behind ECMP router, which uses flow-label to compute the hash.
>>>> Flow label reflection allows ICMP PTB message to be routed back to
>>>> correct server.
>>>>
>>>> It's hard to imagine a situation where generated RST or ICMP echo
>>>> response would trigger a ICMP PTB. Flow label reflection is explained
>>>> here:
>>>> https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01
>>>> and:
>>>> https://tools.ietf.org/html/rfc7098
>>>> https://tools.ietf.org/html/rfc6438
>>>>
>>>> Cheers,
>>>>     Marek
>>>>
>>>>
>>>> (Note: the unrelated "fwmark_reflect" toggle is about something
>>>> different - flow marks, but also addresses RST and ICMP generated by
>>>> the server)
>>>>
>>>
>>> Please check the recent commits, scheduled for linux-5.3
>>>
>>> a346abe051bd2bd0d5d0140b2da9ec95639acad7 ipv6: icmp: allow flowlabel reflection in echo replies
>>> c67b85558ff20cb1ff20874461d12af456bee5d0 ipv6: tcp: send consistent autoflowlabel in TIME_WAIT state
>>> 392096736a06bc9d8f2b42fd4bb1a44b245b9fed ipv6: tcp: fix potential NULL deref in tcp_v6_send_reset()
>>> 50a8accf10627b343109a9c9d5c361751bf753b0 ipv6: tcp: send consistent flowlabel in TIME_WAIT state
>>> 323a53c41292a0d7efc8748856c623324c8d7c21 ipv6: tcp: enable flowlabel reflection in some RST packets
>>>

^ permalink raw reply related

* Re: [PATCH net-next,v3 10/11] net: flow_offload: add flow_block_cb_is_busy() and use it
From: Jiri Pirko @ 2019-07-09 13:36 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netdev, davem, thomas.lendacky, f.fainelli, ariel.elior,
	michael.chan, madalin.bucur, yisen.zhuang, salil.mehta,
	jeffrey.t.kirsher, tariqt, saeedm, jiri, idosch, jakub.kicinski,
	peppe.cavallaro, grygorii.strashko, andrew, vivien.didelot,
	alexandre.torgue, joabreu, linux-net-drivers, ogerlitz,
	Manish.Chopra, marcelo.leitner, mkubecek, venkatkumar.duvvuru,
	maxime.chevallier, cphealy, netfilter-devel
In-Reply-To: <20190708160614.2226-11-pablo@netfilter.org>

Mon, Jul 08, 2019 at 06:06:12PM CEST, pablo@netfilter.org wrote:
>This patch adds a function to check if flow block callback is already in
>use.  Call this new function from flow_block_cb_setup_simple() and from
>drivers.
>
>Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
>---
>v3: formerly known as "net: flow_offload: don't allow subsystem to reuse blocks"
>    add flow_block_cb_is_busy() helper. Call it per driver to make it easier
>    to remove this whenever the first driver client support for multiple
>    subsystem offloads.
>
> drivers/net/ethernet/mellanox/mlx5/core/en_rep.c    |  4 ++++
> drivers/net/ethernet/mellanox/mlxsw/spectrum.c      |  4 ++++
> drivers/net/ethernet/mscc/ocelot_tc.c               |  3 +++
> drivers/net/ethernet/netronome/nfp/flower/offload.c |  4 ++++
> include/net/flow_offload.h                          |  3 +++
> net/core/flow_offload.c                             | 18 ++++++++++++++++++
> net/dsa/slave.c                                     |  3 +++
> 7 files changed, 39 insertions(+)
>
>diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
>index 19133b9e121a..e303149053e4 100644
>--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
>+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
>@@ -721,6 +721,10 @@ mlx5e_rep_indr_setup_tc_block(struct net_device *netdev,
> 		if (indr_priv)
> 			return -EEXIST;
> 
>+		if (flow_block_cb_is_busy(mlx5e_rep_indr_setup_block_cb,
>+					  indr_priv, &mlx5e_block_cb_list))

As I already asked for in another patch in this set, it would be really
much much better to have some wrapping struct instead of plain list
head here. 

[...]

^ permalink raw reply

* Re: [PATCH net-next v2 4/4] bnxt_en: add page_pool support
From: Andy Gospodarek @ 2019-07-09 13:29 UTC (permalink / raw)
  To: Ilias Apalodimas; +Cc: Michael Chan, davem, netdev, hawk, ast
In-Reply-To: <20190709062746.GA621@apalos>

On Tue, Jul 09, 2019 at 09:27:46AM +0300, Ilias Apalodimas wrote:
> 
> Thanks and sorry for the inconvenience :(
> /Ilias

No worries.  I didn't know Ivan's patch was going to go in so quickly!



^ permalink raw reply

* RE: [PATCH] tipc: ensure skb->lock is initialised
From: Jon Maloy @ 2019-07-09 13:25 UTC (permalink / raw)
  To: Eric Dumazet, Chris Packham, ying.xue@windriver.com,
	davem@davemloft.net
  Cc: netdev@vger.kernel.org, tipc-discussion@lists.sourceforge.net,
	linux-kernel@vger.kernel.org
In-Reply-To: <b862a74b-9f1e-fb64-0641-550a83b64664@gmail.com>



> -----Original Message-----
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Sent: 9-Jul-19 03:31
> To: Chris Packham <Chris.Packham@alliedtelesis.co.nz>; Eric Dumazet
> <eric.dumazet@gmail.com>; Jon Maloy <jon.maloy@ericsson.com>;
> ying.xue@windriver.com; davem@davemloft.net
> Cc: netdev@vger.kernel.org; tipc-discussion@lists.sourceforge.net; linux-
> kernel@vger.kernel.org
> Subject: Re: [PATCH] tipc: ensure skb->lock is initialised
> 
> 
> 
> On 7/8/19 11:13 PM, Chris Packham wrote:
> > On 9/07/19 8:43 AM, Chris Packham wrote:
> >> On 8/07/19 8:18 PM, Eric Dumazet wrote:
> >>>
> >>>
> >>> On 7/8/19 12:53 AM, Chris Packham wrote:
> >>>> tipc_named_node_up() creates a skb list. It passes the list to
> >>>> tipc_node_xmit() which has some code paths that can call
> >>>> skb_queue_purge() which relies on the list->lock being initialised.
> >>>> Ensure tipc_named_node_up() uses skb_queue_head_init() so that the
> >>>> lock is explicitly initialised.
> >>>>
> >>>> Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
> >>>
> >>> I would rather change the faulty skb_queue_purge() to
> >>> __skb_queue_purge()
> >>>
> >>
> >> Makes sense. I'll look at that for v2.
> >>
> >
> > Actually maybe not. tipc_rcast_xmit(), tipc_node_xmit_skb(),
> > tipc_send_group_msg(), __tipc_sendmsg(), __tipc_sendstream(), and
> > tipc_sk_timeout() all use skb_queue_head_init(). So my original change
> > brings tipc_named_node_up() into line with them.
> >
> > I think it should be safe for tipc_node_xmit() to use
> > __skb_queue_purge() since all the callers seem to have exclusive
> > access to the list of skbs. It still seems that the callers should all
> > use
> > skb_queue_head_init() for consistency.

I agree with that.

> >
> 
> No, tipc does not use the list lock (it relies on the socket lock)  and therefore
> should consistently use __skb_queue_head_init() instead of
> skb_queue_head_init()

TIPC is using the list lock at message reception within the scope of tipc_sk_rcv()/tipc_skb_peek_port(), so it is fundamental that the lock always is correctly initialized.

> 
[...]
> 
> tipc_link_xmit() for example never acquires the spinlock, yet uses skb_peek()
> and __skb_dequeue()


You should look at tipc_node_xmit instead. Node local messages are sent directly to tipc_sk_rcv(), and never go through tipc_link_xmit()

Regards
///jon



^ permalink raw reply

* Re: IPv6 flow label reflection behave for RST packets
From: Eric Dumazet @ 2019-07-09 13:22 UTC (permalink / raw)
  To: Marek Majkowski, Eric Dumazet
  Cc: kuznet, yoshfuji, Jakub Sitnicki, netdev, kernel-team
In-Reply-To: <CAJPywTKXL=_8h3aoC=n-c8o_Uo7P6RnKOgm6CpvrNsPQuw4C9A@mail.gmail.com>



On 7/9/19 2:33 PM, Marek Majkowski wrote:
> Ha, thanks. I missed that.
> 
> There is a caveat though. I don't think it's working as intended...


Note that my commit really took a look at a fraction of the cases ;)

commit 323a53c41292a0d7efc8748856c623324c8d7c21

    ipv6: tcp: enable flowlabel reflection in some RST packets
    
    When RST packets are sent because no socket could be found,
    it makes sense to use flowlabel_reflect sysctl to decide
    if a reflection of the flowlabel is requested.
    

In your case, a socket is found, most probably, and np->repflow seems to be ignored.

I'll take a look, thanks.

> Running my script:
> 
> $ sysctl -w net.ipv6.flowlabel_reflect=3
> 
> $ tail reflect.py
> cd2.close()
> cd.send(b"a")
> 
> $ python3 reflect.py
> IP6 (flowlabel 0xf2927, hlim 64) ::1.1235 > ::1.60246: Flags [F.]
> IP6 (flowlabel 0xf2927, hlim 64) ::1.60246 > ::1.1235: Flags [P.]
> IP6 (flowlabel 0x58ecd, hlim 64) ::1.1235 > ::1.60246: Flags [R]
> 
> Note. The RST is opportunistic, depending on timing I sometimes get a
> proper FIN, without RST.
> 
> If I change the script to introduce some delay:
> 
> $ tail reflect.py
> cd2.close()
> time.sleep(0.1)
> cd.send(b"a")
> 
> $ python3 reflect.py
> IP6 (flowlabel 0x2f60c, hlim 64) ::1.60326 > ::1.1235: Flags [.]
> IP6 (flowlabel 0x2f60c, hlim 64) ::1.60326 > ::1.1235: Flags [P.]
> IP6 (flowlabel 0x2f60c, hlim 64) ::1.1235 > ::1.60326: Flags [R]
> 
> Now it seem to work reliably. Tested on net-next under virtme.
> 
> Marek
> 
> On Tue, Jul 9, 2019 at 1:19 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>>
>>
>> On 7/9/19 1:10 PM, Marek Majkowski wrote:
>>> Morning,
>>>
>>> I'm experimenting with flow label reflection from a server point of
>>> view. I'm able to get it working in both supported ways:
>>>
>>> (a) per-socket with flow manager IPV6_FL_F_REFLECT and flowlabel_consistency=0
>>>
>>> (b) with global flowlabel_reflect sysctl
>>>
>>> However, I was surprised to see that RST after the connection is torn
>>> down, doesn't have the correct flow label value:
>>>
>>> IP6 (flowlabel 0x3ba3d) ::1.59276 > ::1.1235: Flags [S]
>>> IP6 (flowlabel 0x3ba3d) ::1.1235 > ::1.59276: Flags [S.]
>>> IP6 (flowlabel 0x3ba3d) ::1.59276 > ::1.1235: Flags [.]
>>> IP6 (flowlabel 0x3ba3d) ::1.1235 > ::1.59276: Flags [F.]
>>> IP6 (flowlabel 0x3ba3d) ::1.59276 > ::1.1235: Flags [P.]
>>> IP6 (flowlabel 0xdfc46) ::1.1235 > ::1.59276: Flags [R]
>>>
>>> Notice, the last RST packet has inconsistent flow label. Perhaps we
>>> can argue this behaviour might be acceptable for a per-socket
>>> IPV6_FL_F_REFLECT option, but with global flowlabel_reflect, I would
>>> expect the RST to preserve the reflected flow label value.
>>>
>>> I suspect the same behaviour is true for kernel-generated ICMPv6.
>>>
>>> Prepared test case:
>>> https://gist.github.com/majek/139081b84f9b5b6187c8ccff802e3ab3
>>>
>>> This behaviour is not necessarily a bug, more of a surprise. Flow
>>> label reflection is mostly useful in deployments where Linux servers
>>> stand behind ECMP router, which uses flow-label to compute the hash.
>>> Flow label reflection allows ICMP PTB message to be routed back to
>>> correct server.
>>>
>>> It's hard to imagine a situation where generated RST or ICMP echo
>>> response would trigger a ICMP PTB. Flow label reflection is explained
>>> here:
>>> https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01
>>> and:
>>> https://tools.ietf.org/html/rfc7098
>>> https://tools.ietf.org/html/rfc6438
>>>
>>> Cheers,
>>>     Marek
>>>
>>>
>>> (Note: the unrelated "fwmark_reflect" toggle is about something
>>> different - flow marks, but also addresses RST and ICMP generated by
>>> the server)
>>>
>>
>> Please check the recent commits, scheduled for linux-5.3
>>
>> a346abe051bd2bd0d5d0140b2da9ec95639acad7 ipv6: icmp: allow flowlabel reflection in echo replies
>> c67b85558ff20cb1ff20874461d12af456bee5d0 ipv6: tcp: send consistent autoflowlabel in TIME_WAIT state
>> 392096736a06bc9d8f2b42fd4bb1a44b245b9fed ipv6: tcp: fix potential NULL deref in tcp_v6_send_reset()
>> 50a8accf10627b343109a9c9d5c361751bf753b0 ipv6: tcp: send consistent flowlabel in TIME_WAIT state
>> 323a53c41292a0d7efc8748856c623324c8d7c21 ipv6: tcp: enable flowlabel reflection in some RST packets
>>

^ permalink raw reply

* Re: i.mx6ul with DSA in multi chip addressing mode - no MDIO access
From: Benjamin Beckmeyer @ 2019-07-09 13:20 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: netdev
In-Reply-To: <20190708145733.GA9027@lunn.ch>

>> Hi Andrew,
>> I got it working a little bit better. When I'm fast enough I can read
>> the registers I want but it isn't a solution.
> Why do you need to read registers?
>
> What you actually might be interested in is the debugfs patches in
> Viviens github tree.
>
>> Here is an output of the tracing even with my custom accesses.
>> mii -i 2 0 0x9b60; mii -i 2 1
>> phyid:2, reg:0x01 -> 0xc801
>>
>> Do you know how to delete EEInt bit? It is always one. And now all 
>> accesses coming from the kworker thread. Maybe this is your polling 
>> function?
> EEInt should clear on read for older chips. For 6390, it might be
> necessary to read global 2, register 0x13, index 03.
>  
>> I view the INT pin on an oscilloscope but it never changed. So maybe
>> this is the problem. We just soldered a pull-up to that pin but it 
>> still never changend. Maybe you have an idea?
> The EEInt bit is probably masked. So it will not generate in
> interrupt.
>
>> So what I think is, because of the EEInt bit is never set back to one 
>> i will poll it as fast as possible.
> Is it forever looping in mv88e6xxx_g1_irq_thread_work? Or is it the
> polling code, mv88e6xxx_irq_poll() firing every 100ms?
>
> 	Andrew

Hi Andrew,
good news first, it seems to be running ;-).

The interrupt GPIO pin was not correctly configured in the device tree.

For now we have around 68 accesses per second, I think this is okay 
because we even have indirect access, so the bus must be more busy.

What do you think about it?

Why we need access to the bus is because we have some software which was 
using the DSDT driver and now we want to switch to the UMSD driver.
But we hope that we can forget about all the UMSD driver stuff and the 
DSDT driver stuff as well and just use the DSA part from the kernel.
To be honest, so far I don't know what functions we need from the driver
which aren't supported by the DSA.

Thanks again for your help and patience.

Cheers,
Benny


^ permalink raw reply

* Re: [PATCH net-next] bnxt_en: Add page_pool_destroy() during RX ring cleanup.
From: Andy Gospodarek @ 2019-07-09 13:18 UTC (permalink / raw)
  To: Michael Chan; +Cc: davem, netdev, Ilias Apalodimas
In-Reply-To: <1562658607-30048-1-git-send-email-michael.chan@broadcom.com>

On Tue, Jul 09, 2019 at 03:50:07AM -0400, Michael Chan wrote:
> Add page_pool_destroy() in bnxt_free_rx_rings() during normal RX ring
> cleanup, as Ilias has informed us that the following commit has been
> merged:
> 
> 1da4bbeffe41 ("net: core: page_pool: add user refcnt and reintroduce page_pool_destroy")
> 
> The special error handling code to call page_pool_free() can now be
> removed.  bnxt_free_rx_rings() will always be called during normal
> shutdown or any error paths.
> 
> Fixes: 322b87ca55f2 ("bnxt_en: add page_pool support")
> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
> Cc: Andy Gospodarek <gospo@broadcom.com>
> Signed-off-by: Michael Chan <michael.chan@broadcom.com>
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index e9d3bd8..2b5b0ab 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -2500,6 +2500,7 @@ static void bnxt_free_rx_rings(struct bnxt *bp)
>  		if (xdp_rxq_info_is_reg(&rxr->xdp_rxq))
>  			xdp_rxq_info_unreg(&rxr->xdp_rxq);
>  
> +		page_pool_destroy(rxr->page_pool);
>  		rxr->page_pool = NULL;
>  
>  		kfree(rxr->rx_tpa);
> @@ -2560,19 +2561,14 @@ static int bnxt_alloc_rx_rings(struct bnxt *bp)
>  			return rc;
>  
>  		rc = xdp_rxq_info_reg(&rxr->xdp_rxq, bp->dev, i);
> -		if (rc < 0) {
> -			page_pool_free(rxr->page_pool);
> -			rxr->page_pool = NULL;
> +		if (rc < 0)
>  			return rc;
> -		}
>  
>  		rc = xdp_rxq_info_reg_mem_model(&rxr->xdp_rxq,
>  						MEM_TYPE_PAGE_POOL,
>  						rxr->page_pool);
>  		if (rc) {
>  			xdp_rxq_info_unreg(&rxr->xdp_rxq);
> -			page_pool_free(rxr->page_pool);
> -			rxr->page_pool = NULL;

Rather than deleting these lines it would also be acceptable to do:

                if (rc) {
                        xdp_rxq_info_unreg(&rxr->xdp_rxq);
-                       page_pool_free(rxr->page_pool);
+                       page_pool_destroy(rxr->page_pool);
                        rxr->page_pool = NULL;
                        return rc;
                }

but anytime there is a failure to bnxt_alloc_rx_rings the driver will
immediately follow it up with a call to bnxt_free_rx_rings, so
page_pool_destroy will be called.

Thanks for pushing this out so quickly!

Acked-by: Andy Gospodarek <gospo@broadcom.com> 


^ permalink raw reply

* [PATCH iproute2 2/2] ip tunnel: warn when changing IPv6 tunnel without tunnel name
From: Andrea Claudi @ 2019-07-09 13:16 UTC (permalink / raw)
  To: netdev; +Cc: stephen, dsahern
In-Reply-To: <cover.1562667648.git.aclaudi@redhat.com>

Tunnel change fails if a tunnel name is not specified while using
'ip -6 tunnel change'. However, no warning message is printed and
no error code is returned.

$ ip -6 tunnel add ip6tnl1 mode ip6gre local fd::1 remote fd::2 tos inherit ttl 127 encaplimit none dev dummy0
$ ip -6 tunnel change dev dummy0 local 2001:1234::1 remote 2001:1234::2
$ ip -6 tunnel show ip6tnl1
ip6tnl1: gre/ipv6 remote fd::2 local fd::1 dev dummy0 encaplimit none hoplimit 127 tclass inherit flowlabel 0x00000 (flowinfo 0x00000000)

This commit checks if tunnel interface name is equal to an empty
string: in this case, it prints a warning message to the user.
It intentionally avoids to return an error to not break existing
script setup.

This is the output after this commit:
$ ip -6 tunnel add ip6tnl1 mode ip6gre local fd::1 remote fd::2 tos inherit ttl 127 encaplimit none dev dummy0
$ ip -6 tunnel change dev dummy0 local 2001:1234::1 remote 2001:1234::2
Tunnel interface name not specified
$ ip -6 tunnel show ip6tnl1
ip6tnl1: gre/ipv6 remote fd::2 local fd::1 dev dummy0 encaplimit none hoplimit 127 tclass inherit flowlabel 0x00000 (flowinfo 0x00000000)

Reviewed-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
---
 ip/ip6tunnel.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/ip/ip6tunnel.c b/ip/ip6tunnel.c
index 999408ed801b1..e3da11eb4518e 100644
--- a/ip/ip6tunnel.c
+++ b/ip/ip6tunnel.c
@@ -386,6 +386,9 @@ static int do_add(int cmd, int argc, char **argv)
 	if (parse_args(argc, argv, cmd, &p) < 0)
 		return -1;
 
+	if (!*p.name)
+		fprintf(stderr, "Tunnel interface name not specified\n");
+
 	if (p.proto == IPPROTO_GRE)
 		basedev = "ip6gre0";
 	else if (p.i_flags & VTI_ISVTI)
-- 
2.20.1


^ permalink raw reply related

* [PATCH iproute2 0/2] Fix IPv6 tunnel add when dev param is used
From: Andrea Claudi @ 2019-07-09 13:16 UTC (permalink / raw)
  To: netdev; +Cc: stephen, dsahern

Commit ba126dcad20e6 ("ip6tunnel: fix 'ip -6 {show|change} dev
<name>' cmds") breaks IPv6 tunnel creation when dev parameter
is used.

This series revert the original commit, which mistakenly use
dev for tunnel name, while addressing a issue on tunnel change
when no interface name is specified.

Andrea Claudi (2):
  Revert "ip6tunnel: fix 'ip -6 {show|change} dev <name>' cmds"
  ip tunnel: warn when changing IPv6 tunnel without tunnel name

 ip/ip6tunnel.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

-- 
2.20.1


^ permalink raw reply

* [PATCH iproute2 1/2] Revert "ip6tunnel: fix 'ip -6 {show|change} dev <name>' cmds"
From: Andrea Claudi @ 2019-07-09 13:16 UTC (permalink / raw)
  To: netdev; +Cc: stephen, dsahern
In-Reply-To: <cover.1562667648.git.aclaudi@redhat.com>

This reverts commit ba126dcad20e6d0e472586541d78bdd1ac4f1123.
It breaks tunnel creation when using 'dev' parameter:

$ ip link add type dummy
$ ip -6 tunnel add ip6tnl1 mode ip6ip6 remote 2001:db8:ffff:100::2 local 2001:db8:ffff:100::1 hoplimit 1 tclass 0x0 dev dummy0
add tunnel "ip6tnl0" failed: File exists

dev parameter must be used to specify the device to which
the tunnel is binded, and not the tunnel itself.

Reported-by: Jianwen Ji <jiji@redhat.com>
Reviewed-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
---
 ip/ip6tunnel.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/ip/ip6tunnel.c b/ip/ip6tunnel.c
index 56fd3466ed062..999408ed801b1 100644
--- a/ip/ip6tunnel.c
+++ b/ip/ip6tunnel.c
@@ -298,8 +298,6 @@ static int parse_args(int argc, char **argv, int cmd, struct ip6_tnl_parm2 *p)
 		p->link = ll_name_to_index(medium);
 		if (!p->link)
 			return nodev(medium);
-		else
-			strlcpy(p->name, medium, sizeof(p->name));
 	}
 	return 0;
 }
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH v2 4/4] net: mvmdio: defer probe of orion-mdio if a clock is not ready
From: Andrew Lunn @ 2019-07-09 13:15 UTC (permalink / raw)
  To: josua; +Cc: netdev, David S. Miller
In-Reply-To: <20190709130101.5160-5-josua@solid-run.com>

On Tue, Jul 09, 2019 at 03:01:01PM +0200, josua@solid-run.com wrote:
> From: Josua Mayer <josua@solid-run.com>
> 
> Defer probing of the orion-mdio interface when getting a clock returns
> EPROBE_DEFER. This avoids locking up the Armada 8k SoC when mdio is used
> before all clocks have been enabled.
> 
> Signed-off-by: Josua Mayer <josua@solid-run.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH v2 3/4] net: mvmdio: print warning when orion-mdio has too many clocks
From: Andrew Lunn @ 2019-07-09 13:14 UTC (permalink / raw)
  To: josua; +Cc: netdev, David S. Miller
In-Reply-To: <20190709130101.5160-4-josua@solid-run.com>

On Tue, Jul 09, 2019 at 03:01:00PM +0200, josua@solid-run.com wrote:
> From: Josua Mayer <josua@solid-run.com>
> 
> Print a warning when device tree specifies more than the maximum of four
> clocks supported by orion-mdio. Because reading from mdio can lock up
> the Armada 8k when a required clock is not initialized, it is important
> to notify the user when a specified clock is ignored.
> 
> Signed-off-by: Josua Mayer <josua@solid-run.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* [PATCH v2 3/4] net: mvmdio: print warning when orion-mdio has too many clocks
From: josua @ 2019-07-09 13:01 UTC (permalink / raw)
  To: netdev; +Cc: Josua Mayer, David S. Miller, Andrew Lunn
In-Reply-To: <20190709130101.5160-1-josua@solid-run.com>

From: Josua Mayer <josua@solid-run.com>

Print a warning when device tree specifies more than the maximum of four
clocks supported by orion-mdio. Because reading from mdio can lock up
the Armada 8k when a required clock is not initialized, it is important
to notify the user when a specified clock is ignored.

Signed-off-by: Josua Mayer <josua@solid-run.com>
---
 drivers/net/ethernet/marvell/mvmdio.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/marvell/mvmdio.c b/drivers/net/ethernet/marvell/mvmdio.c
index e17d563e97a6..eba18065a4da 100644
--- a/drivers/net/ethernet/marvell/mvmdio.c
+++ b/drivers/net/ethernet/marvell/mvmdio.c
@@ -326,6 +326,10 @@ static int orion_mdio_probe(struct platform_device *pdev)
 		clk_prepare_enable(dev->clk[i]);
 	}
 
+	if (!IS_ERR(of_clk_get(pdev->dev.of_node, ARRAY_SIZE(dev->clk))))
+		dev_warn(&pdev->dev, "unsupported number of clocks, limiting to the first "
+			 __stringify(ARRAY_SIZE(dev->clk)) "\n");
+
 	dev->err_interrupt = platform_get_irq(pdev, 0);
 	if (dev->err_interrupt > 0 &&
 	    resource_size(r) < MVMDIO_ERR_INT_MASK + 4) {
-- 
2.16.4


^ permalink raw reply related

* [PATCH v2 1/4] dt-bindings: allow up to four clocks for orion-mdio
From: josua @ 2019-07-09 13:00 UTC (permalink / raw)
  To: netdev
  Cc: Josua Mayer, stable, David S. Miller, Rob Herring, Mark Rutland,
	Andrew Lunn
In-Reply-To: <20190709130101.5160-1-josua@solid-run.com>

From: Josua Mayer <josua@solid-run.com>

Armada 8040 needs four clocks to be enabled for MDIO accesses to work.
Update the binding to allow the extra clock to be specified.

Cc: stable@vger.kernel.org
Fixes: 6d6a331f44a1 ("dt-bindings: allow up to three clocks for orion-mdio")
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Josua Mayer <josua@solid-run.com>
---
 Documentation/devicetree/bindings/net/marvell-orion-mdio.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt b/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt
index 42cd81090a2c..3f3cfc1d8d4d 100644
--- a/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt
+++ b/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt
@@ -16,7 +16,7 @@ Required properties:
 
 Optional properties:
 - interrupts: interrupt line number for the SMI error/done interrupt
-- clocks: phandle for up to three required clocks for the MDIO instance
+- clocks: phandle for up to four required clocks for the MDIO instance
 
 The child nodes of the MDIO driver are the individual PHY devices
 connected to this MDIO bus. They must have a "reg" property given the
-- 
2.16.4


^ permalink raw reply related

* [PATCH v2 2/4] net: mvmdio: allow up to four clocks to be specified for orion-mdio
From: josua @ 2019-07-09 13:00 UTC (permalink / raw)
  To: netdev; +Cc: Josua Mayer, stable, David S. Miller, Andrew Lunn
In-Reply-To: <20190709130101.5160-1-josua@solid-run.com>

From: Josua Mayer <josua@solid-run.com>

Allow up to four clocks to be specified and enabled for the orion-mdio
interface, which are required by the Armada 8k and defined in
armada-cp110.dtsi.

Fixes a hang in probing the mvmdio driver that was encountered on the
Clearfog GT 8K with all drivers built as modules, but also affects other
boards such as the MacchiatoBIN.

Cc: stable@vger.kernel.org
Fixes: 96cb43423822 ("net: mvmdio: allow up to three clocks to be specified for orion-mdio")
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Josua Mayer <josua@solid-run.com>
---
 drivers/net/ethernet/marvell/mvmdio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/mvmdio.c b/drivers/net/ethernet/marvell/mvmdio.c
index c5dac6bd2be4..e17d563e97a6 100644
--- a/drivers/net/ethernet/marvell/mvmdio.c
+++ b/drivers/net/ethernet/marvell/mvmdio.c
@@ -64,7 +64,7 @@
 
 struct orion_mdio_dev {
 	void __iomem *regs;
-	struct clk *clk[3];
+	struct clk *clk[4];
 	/*
 	 * If we have access to the error interrupt pin (which is
 	 * somewhat misnamed as it not only reflects internal errors
-- 
2.16.4


^ permalink raw reply related

* [PATCH v2 4/4] net: mvmdio: defer probe of orion-mdio if a clock is not ready
From: josua @ 2019-07-09 13:01 UTC (permalink / raw)
  To: netdev; +Cc: Josua Mayer, David S. Miller, Andrew Lunn
In-Reply-To: <20190709130101.5160-1-josua@solid-run.com>

From: Josua Mayer <josua@solid-run.com>

Defer probing of the orion-mdio interface when getting a clock returns
EPROBE_DEFER. This avoids locking up the Armada 8k SoC when mdio is used
before all clocks have been enabled.

Signed-off-by: Josua Mayer <josua@solid-run.com>
---
 drivers/net/ethernet/marvell/mvmdio.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/marvell/mvmdio.c b/drivers/net/ethernet/marvell/mvmdio.c
index eba18065a4da..f660cc2b8258 100644
--- a/drivers/net/ethernet/marvell/mvmdio.c
+++ b/drivers/net/ethernet/marvell/mvmdio.c
@@ -321,6 +321,10 @@ static int orion_mdio_probe(struct platform_device *pdev)
 
 	for (i = 0; i < ARRAY_SIZE(dev->clk); i++) {
 		dev->clk[i] = of_clk_get(pdev->dev.of_node, i);
+		if (PTR_ERR(dev->clk[i]) == -EPROBE_DEFER) {
+			ret = -EPROBE_DEFER;
+			goto out_clk;
+		}
 		if (IS_ERR(dev->clk[i]))
 			break;
 		clk_prepare_enable(dev->clk[i]);
@@ -366,6 +370,7 @@ static int orion_mdio_probe(struct platform_device *pdev)
 	if (dev->err_interrupt > 0)
 		writel(0, dev->regs + MVMDIO_ERR_INT_MASK);
 
+out_clk:
 	for (i = 0; i < ARRAY_SIZE(dev->clk); i++) {
 		if (IS_ERR(dev->clk[i]))
 			break;
-- 
2.16.4


^ permalink raw reply related

* [PATCH v2 0/4] Fix hang of Armada 8040 SoC in orion-mdio
From: josua @ 2019-07-09 13:00 UTC (permalink / raw)
  To: netdev; +Cc: Josua Mayer
In-Reply-To: <20190706151900.14355-1-josua@solid-run.com>

From: Josua Mayer <josua.mayer@jm0.eu>

With a modular kernel as configured by Debian a hang was observed with
the Armada 8040 SoC in the Clearfog GT and Macchiatobin boards.

The 8040 SoC actually requires four clocks to be enabled for the mdio
interface to function. All 4 clocks are already specified in
armada-cp110.dtsi. It has however been missed that the orion-mdio driver
only supports enabling up to three clocks.

This patch-set allows the orion-mdio driver to handle four clocks and
adds a warning when more clocks are specified to prevent this particular
oversight in the future.

Changes since v1:
- fixed condition for priting the warning (Andrew Lunn)
- rephrased commit description for deferred probing (Andrew Lunn)
- fixed compiler warnings (kbuild test robot)

Josua Mayer (4):
  dt-bindings: allow up to four clocks for orion-mdio
  net: mvmdio: allow up to four clocks to be specified for orion-mdio
  net: mvmdio: print warning when orion-mdio has too many clocks
  net: mvmdio: defer probe of orion-mdio if a clock is not ready

 Documentation/devicetree/bindings/net/marvell-orion-mdio.txt |  2 +-
 drivers/net/ethernet/marvell/mvmdio.c                        | 11 ++++++++++-
 2 files changed, 11 insertions(+), 2 deletions(-)

-- 
2.16.4


^ permalink raw reply

* Re: [PATCH net-next 00/11] Add drop monitor for offloaded data paths
From: Ido Schimmel @ 2019-07-09 12:38 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Miller, netdev, jiri, mlxsw, dsahern, roopa, nikolay, andy,
	pablo, pieter.jansenvanvuuren, andrew, f.fainelli, vivien.didelot,
	idosch, Alexei Starovoitov
In-Reply-To: <20190708155158.3f75b57c@cakuba.netronome.com>

On Mon, Jul 08, 2019 at 03:51:58PM -0700, Jakub Kicinski wrote:
> On Mon, 8 Jul 2019 16:19:08 +0300, Ido Schimmel wrote:
> > On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote:
> > > From: Ido Schimmel <idosch@idosch.org>
> > > Date: Sun,  7 Jul 2019 10:58:17 +0300
> > >   
> > > > Users have several ways to debug the kernel and understand why a packet
> > > > was dropped. For example, using "drop monitor" and "perf". Both
> > > > utilities trace kfree_skb(), which is the function called when a packet
> > > > is freed as part of a failure. The information provided by these tools
> > > > is invaluable when trying to understand the cause of a packet loss.
> > > > 
> > > > In recent years, large portions of the kernel data path were offloaded
> > > > to capable devices. Today, it is possible to perform L2 and L3
> > > > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN).
> > > > Different TC classifiers and actions are also offloaded to capable
> > > > devices, at both ingress and egress.
> > > > 
> > > > However, when the data path is offloaded it is not possible to achieve
> > > > the same level of introspection as tools such "perf" and "drop monitor"
> > > > become irrelevant.
> > > > 
> > > > This patchset aims to solve this by allowing users to monitor packets
> > > > that the underlying device decided to drop along with relevant metadata
> > > > such as the drop reason and ingress port.  
> > > 
> > > We are now going to have 5 or so ways to capture packets passing through
> > > the system, this is nonsense.
> > > 
> > > AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now this
> > > devlink thing.
> > > 
> > > This is insanity, too many ways to do the same thing and therefore the
> > > worst possible user experience.
> > > 
> > > Pick _ONE_ method to trap packets and forward normal kfree_skb events,
> > > XDP perf events, and these taps there too.
> > > 
> > > I mean really, think about it from the average user's perspective.  To
> > > see all drops/pkts I have to attach a kfree_skb tracepoint, and not just
> > > listen on devlink but configure a special tap thing beforehand and then
> > > if someone is using XDP I gotta setup another perf event buffer capture
> > > thing too.  
> > 
> > Let me try to explain again because I probably wasn't clear enough. The
> > devlink-trap mechanism is not doing the same thing as other solutions.
> > 
> > The packets we are capturing in this patchset are packets that the
> > kernel (the CPU) never saw up until now - they were silently dropped by
> > the underlying device performing the packet forwarding instead of the
> > CPU.

Jakub,

It seems to me that most of the criticism is about consolidation of
interfaces because you believe I'm doing something you can already do
today, but this is not the case.

Switch ASICs have dedicated traps for specific packets. Usually, these
packets are control packets (e.g., ARP, BGP) which are required for the
correct functioning of the control plane. You can see this in the SAI
interface, which is an abstraction layer over vendors' SDKs:

https://github.com/opencomputeproject/SAI/blob/master/inc/saihostif.h#L157

We need to be able to configure the hardware policers of these traps and
read their statistics to understand how many packets they dropped. We
currently do not have a way to do any of that and we rely on hardcoded
defaults in the driver which do not fit every use case (from
experience):

https://elixir.bootlin.com/linux/v5.2/source/drivers/net/ethernet/mellanox/mlxsw/spectrum.c#L4103

We plan to extend devlink-trap mechanism to cover all these use cases. I
hope you agree that this functionality belongs in devlink given it is a
device-specific configuration and not a netdev-specific one.

That being said, in its current form, this mechanism is focused on traps
that correlate to packets the device decided to drop as this is very
useful for debugging.

Given that the entire configuration is done via devlink and that devlink
stores all the information about these traps, it seems logical to also
report these packets and their metadata to user space as devlink events.

If this is not desirable, we can try to call into drop_monitor from
devlink and add a new command (e.g., NET_DM_CMD_HW_ALERT), which will
encode all the information we currently have in DEVLINK_CMD_TRAP_REPORT.

IMO, this is less desirable, as instead of having one tool (devlink) to
interact with this mechanism we will need two (devlink & dropwatch).

Below I tried to answer all your questions and refer to all the points
you brought up.

> When you say silently dropped do you mean that mlxsw as of today
> doesn't have any counters exposed for those events?

Some of these packets are counted, but not all of them.

> If we wanted to consolidate this into something existing we can either
>  (a) add similar traps in the kernel data path;
>  (b) make these traps extension of statistics.
> 
> My knee jerk reaction to seeing the patches was that it adds a new
> place where device statistics are reported.

Not at all. This would be a step back. We can already count discards due
to VLAN membership on ingress on a per-port basis. A software maintained
global counter does not buy us anything.

By also getting the dropped packet - coupled with the drop reason and
ingress port - you can understand exactly why and on which VLAN the
packet was dropped. I wrote a Wireshark dissector for these netlink
packets to make our life easier. You can see the details in my comment
to the cover letter:

https://marc.info/?l=linux-netdev&m=156248736710238&w=2

In case you do not care about individual packets, but still want more
fine-grained statistics for your monitoring application, you can use
eBPF. For example, one thing we did is attaching a kprobe to
devlink_trap_report() with an eBPF program that dissects the incoming
skbs and maintains a counter per-{5 tuple, drop reason}. With
ebpf_exporter you can export these statistics to Prometheus on which you
can run queries and visualize the results with Grafana. This is
especially useful for tail and early drops since it allows you to
understand which flows contribute to most of the drops.

> Users who want to know why things are dropped will not get detailed
> breakdown from ethtool -S which for better or worse is the one stop
> shop for device stats today.

I hope I managed to explain why counters are not enough, but I also want
to point out that ethtool statistics are not properly documented and
this hinders their effectiveness. I did my best to document the exposed
traps in order to avoid the same fate:

https://patchwork.ozlabs.org/patch/1128585/

In addition, there are selftests to show how each trap can be triggered
to reduce the ambiguity even further:

https://patchwork.ozlabs.org/patch/1128610/

And a note in the documentation to make sure future functionality is
tested as well:

https://patchwork.ozlabs.org/patch/1128608/

> Having thought about it some more, however, I think that having a
> forwarding "exception" object and hanging statistics off of it is a
> better design, even if we need to deal with some duplication to get
> there.
> 
> IOW having an way to "trap all packets which would increment a
> statistic" (option (b) above) is probably a bad design.
> 
> As for (a) I wonder how many of those events have a corresponding event
> in the kernel stack?

Generic packet drops all have a corresponding kfree_skb() calls in the
kernel, but that does not mean that every packet dropped by the hardware
would also be dropped by the kernel if it were to be injected to its Rx
path. In my reply to Dave I gave buffer drops as an example.

There are also situations in which packets can be dropped due to
device-specific exceptions and these do not have a corresponding drop
reason in the kernel. See example here:

https://patchwork.ozlabs.org/patch/1128587/

> If we could add corresponding trace points and just feed those from
> the device driver, that'd obviously be a holy grail.

Unlike tracepoints, netlink gives you a structured and extensible
interface. For example, in Spectrum-1 we cannot provide the Tx port for
early/tail drops, whereas for Spectrum-2 and later we can. With netlink,
we can just omit the DEVLINK_ATTR_TRAP_OUT_PORT attribute for
Spectrum-1. You also get a programmatic interface that you can query for
this information:

# devlink -v trap show netdevsim/netdevsim10 trap ingress_vlan_filter
netdevsim/netdevsim10:
  name ingress_vlan_filter type drop generic true report false action drop group l2_drops
    metadata:
       input_port

Thanks

^ permalink raw reply

* [PATCH] net: netsec: start using buffers if page_pool registration succeeded
From: Ilias Apalodimas @ 2019-07-09 12:35 UTC (permalink / raw)
  To: netdev, jaswinder.singh, davem; +Cc: Ilias Apalodimas

The current driver starts using page_pool buffers before calling
xdp_rxq_info_reg_mem_model(). Start using the buffers after the
registration succeeded, so we won't have to call
page_pool_request_shutdown() in case of failure

Fixes: 5c67bf0ec4d0 ("net: netsec: Use page_pool API")
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
---
 drivers/net/ethernet/socionext/netsec.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/socionext/netsec.c b/drivers/net/ethernet/socionext/netsec.c
index d7307ab90d74..c3a4f86f56ee 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -1309,6 +1309,15 @@ static int netsec_setup_rx_dring(struct netsec_priv *priv)
 		goto err_out;
 	}
 
+	err = xdp_rxq_info_reg(&dring->xdp_rxq, priv->ndev, 0);
+	if (err)
+		goto err_out;
+
+	err = xdp_rxq_info_reg_mem_model(&dring->xdp_rxq, MEM_TYPE_PAGE_POOL,
+					 dring->page_pool);
+	if (err)
+		goto err_out;
+
 	for (i = 0; i < DESC_NUM; i++) {
 		struct netsec_desc *desc = &dring->desc[i];
 		dma_addr_t dma_handle;
@@ -1327,14 +1336,6 @@ static int netsec_setup_rx_dring(struct netsec_priv *priv)
 	}
 
 	netsec_rx_fill(priv, 0, DESC_NUM);
-	err = xdp_rxq_info_reg(&dring->xdp_rxq, priv->ndev, 0);
-	if (err)
-		goto err_out;
-
-	err = xdp_rxq_info_reg_mem_model(&dring->xdp_rxq, MEM_TYPE_PAGE_POOL,
-					 dring->page_pool);
-	if (err)
-		goto err_out;
 
 	return 0;
 
-- 
2.20.1


^ permalink raw reply related

* Re: IPv6 flow label reflection behave for RST packets
From: Marek Majkowski @ 2019-07-09 12:33 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: kuznet, yoshfuji, Jakub Sitnicki, netdev, kernel-team
In-Reply-To: <a854848f-9fb3-47b9-cb18-e76455e5e664@gmail.com>

Ha, thanks. I missed that.

There is a caveat though. I don't think it's working as intended...
Running my script:

$ sysctl -w net.ipv6.flowlabel_reflect=3

$ tail reflect.py
cd2.close()
cd.send(b"a")

$ python3 reflect.py
IP6 (flowlabel 0xf2927, hlim 64) ::1.1235 > ::1.60246: Flags [F.]
IP6 (flowlabel 0xf2927, hlim 64) ::1.60246 > ::1.1235: Flags [P.]
IP6 (flowlabel 0x58ecd, hlim 64) ::1.1235 > ::1.60246: Flags [R]

Note. The RST is opportunistic, depending on timing I sometimes get a
proper FIN, without RST.

If I change the script to introduce some delay:

$ tail reflect.py
cd2.close()
time.sleep(0.1)
cd.send(b"a")

$ python3 reflect.py
IP6 (flowlabel 0x2f60c, hlim 64) ::1.60326 > ::1.1235: Flags [.]
IP6 (flowlabel 0x2f60c, hlim 64) ::1.60326 > ::1.1235: Flags [P.]
IP6 (flowlabel 0x2f60c, hlim 64) ::1.1235 > ::1.60326: Flags [R]

Now it seem to work reliably. Tested on net-next under virtme.

Marek

On Tue, Jul 9, 2019 at 1:19 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 7/9/19 1:10 PM, Marek Majkowski wrote:
> > Morning,
> >
> > I'm experimenting with flow label reflection from a server point of
> > view. I'm able to get it working in both supported ways:
> >
> > (a) per-socket with flow manager IPV6_FL_F_REFLECT and flowlabel_consistency=0
> >
> > (b) with global flowlabel_reflect sysctl
> >
> > However, I was surprised to see that RST after the connection is torn
> > down, doesn't have the correct flow label value:
> >
> > IP6 (flowlabel 0x3ba3d) ::1.59276 > ::1.1235: Flags [S]
> > IP6 (flowlabel 0x3ba3d) ::1.1235 > ::1.59276: Flags [S.]
> > IP6 (flowlabel 0x3ba3d) ::1.59276 > ::1.1235: Flags [.]
> > IP6 (flowlabel 0x3ba3d) ::1.1235 > ::1.59276: Flags [F.]
> > IP6 (flowlabel 0x3ba3d) ::1.59276 > ::1.1235: Flags [P.]
> > IP6 (flowlabel 0xdfc46) ::1.1235 > ::1.59276: Flags [R]
> >
> > Notice, the last RST packet has inconsistent flow label. Perhaps we
> > can argue this behaviour might be acceptable for a per-socket
> > IPV6_FL_F_REFLECT option, but with global flowlabel_reflect, I would
> > expect the RST to preserve the reflected flow label value.
> >
> > I suspect the same behaviour is true for kernel-generated ICMPv6.
> >
> > Prepared test case:
> > https://gist.github.com/majek/139081b84f9b5b6187c8ccff802e3ab3
> >
> > This behaviour is not necessarily a bug, more of a surprise. Flow
> > label reflection is mostly useful in deployments where Linux servers
> > stand behind ECMP router, which uses flow-label to compute the hash.
> > Flow label reflection allows ICMP PTB message to be routed back to
> > correct server.
> >
> > It's hard to imagine a situation where generated RST or ICMP echo
> > response would trigger a ICMP PTB. Flow label reflection is explained
> > here:
> > https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01
> > and:
> > https://tools.ietf.org/html/rfc7098
> > https://tools.ietf.org/html/rfc6438
> >
> > Cheers,
> >     Marek
> >
> >
> > (Note: the unrelated "fwmark_reflect" toggle is about something
> > different - flow marks, but also addresses RST and ICMP generated by
> > the server)
> >
>
> Please check the recent commits, scheduled for linux-5.3
>
> a346abe051bd2bd0d5d0140b2da9ec95639acad7 ipv6: icmp: allow flowlabel reflection in echo replies
> c67b85558ff20cb1ff20874461d12af456bee5d0 ipv6: tcp: send consistent autoflowlabel in TIME_WAIT state
> 392096736a06bc9d8f2b42fd4bb1a44b245b9fed ipv6: tcp: fix potential NULL deref in tcp_v6_send_reset()
> 50a8accf10627b343109a9c9d5c361751bf753b0 ipv6: tcp: send consistent flowlabel in TIME_WAIT state
> 323a53c41292a0d7efc8748856c623324c8d7c21 ipv6: tcp: enable flowlabel reflection in some RST packets
>

^ permalink raw reply

* KASAN: global-out-of-bounds Read in load_next_firmware_from_table
From: syzbot @ 2019-07-09 12:27 UTC (permalink / raw)
  To: andreyknvl, davem, gregkh, kvalo, libertas-dev, linux-kernel,
	linux-usb, linux-wireless, netdev, syzkaller-bugs, tglx

Hello,

syzbot found the following crash on:

HEAD commit:    7829a896 usb-fuzzer: main usb gadget fuzzer driver
git tree:       https://github.com/google/kasan.git usb-fuzzer
console output: https://syzkaller.appspot.com/x/log.txt?x=12fd0e9ba00000
kernel config:  https://syzkaller.appspot.com/x/.config?x=f6d4561982f71f63
dashboard link: https://syzkaller.appspot.com/bug?extid=98156c174c5a2cad9f8f
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=125f669ba00000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=146b806ba00000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+98156c174c5a2cad9f8f@syzkaller.appspotmail.com

usb 1-1: Direct firmware load for libertas/usb8388_v5.bin failed with error  
-2
usb 1-1: Direct firmware load for libertas/usb8388.bin failed with error -2
usb 1-1: Direct firmware load for usb8388.bin failed with error -2
==================================================================
BUG: KASAN: global-out-of-bounds in  
load_next_firmware_from_table+0x267/0x2d0  
drivers/net/wireless/marvell/libertas/firmware.c:99
Read of size 8 at addr ffffffff860942b8 by task kworker/1:1/21

CPU: 1 PID: 21 Comm: kworker/1:1 Not tainted 5.2.0-rc6+ #13
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: events request_firmware_work_func
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0xca/0x13e lib/dump_stack.c:113
  print_address_description+0x67/0x231 mm/kasan/report.c:188
  __kasan_report.cold+0x1a/0x32 mm/kasan/report.c:317
  kasan_report+0xe/0x20 mm/kasan/common.c:614
  load_next_firmware_from_table+0x267/0x2d0  
drivers/net/wireless/marvell/libertas/firmware.c:99
  helper_firmware_cb+0xdc/0x100  
drivers/net/wireless/marvell/libertas/firmware.c:70
  request_firmware_work_func+0x126/0x242  
drivers/base/firmware_loader/main.c:785
  process_one_work+0x905/0x1570 kernel/workqueue.c:2269
  worker_thread+0x96/0xe20 kernel/workqueue.c:2415
  kthread+0x30b/0x410 kernel/kthread.c:255
  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352

The buggy address belongs to the variable:
  fw_table+0x98/0x5c0

Memory state around the buggy address:
  ffffffff86094180: fa fa fa fa 00 04 fa fa fa fa fa fa 00 00 05 fa
  ffffffff86094200: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00
> ffffffff86094280: 00 00 00 00 00 00 fa fa fa fa fa fa 00 00 00 00
                                         ^
  ffffffff86094300: 00 00 00 01 fa fa fa fa 00 00 00 00 02 fa fa fa
  ffffffff86094380: fa fa fa fa 00 03 fa fa fa fa fa fa 00 00 00 00
==================================================================


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox