Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next v2 2/4] bpf, mlx5: fix various refcount issues in mlx5e_xdp_set
From: Saeed Mahameed @ 2016-11-16 12:25 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David S. Miller, Alexei Starovoitov, Brenden Blanco, zhiyisun,
	Rana Shahout, Saeed Mahameed, Linux Netdev List
In-Reply-To: <0899adb6ace97b7e7f94b04dd11a0e55270d2b54.1479253024.git.daniel@iogearbox.net>

On Wed, Nov 16, 2016 at 2:04 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> There are multiple issues in mlx5e_xdp_set():
>
> 1) The batched bpf_prog_add() is currently not checked for errors! When
>    doing so, it should be done at an earlier point in time to makes sure
>    that we cannot fail anymore at the time we want to set the program for
>    each channel. This only means that we have to undo the bpf_prog_add()
>    in case we return due to required reset.
>
> 2) When swapping the priv->xdp_prog, then no extra reference count must be
>    taken since we got that from call path via dev_change_xdp_fd() already.
>    Otherwise, we'd never be able to free the program. Also, bpf_prog_add()
>    without checking the return code could fail.
>
> Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 25 ++++++++++++++++++-----
>  1 file changed, 20 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index ab0c336..cf26672 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -3142,6 +3142,17 @@ static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
>                 goto unlock;
>         }
>
> +       if (prog) {
> +               /* num_channels is invariant here, so we can take the
> +                * batched reference right upfront.
> +                */
> +               prog = bpf_prog_add(prog, priv->params.num_channels);
> +               if (IS_ERR(prog)) {
> +                       err = PTR_ERR(prog);
> +                       goto unlock;
> +               }
> +       }
> +

With this approach you will end up taking a ref count twice per RQ! on
the first time xdp_set is called i.e (old_prog = NULL, prog != NULL).
One ref will be taken per RQ/Channel from the above code, and since
reset will be TRUE mlx5e_open_locked will be called and another ref
count will be taken on mlx5e_create_rq.

The issue here is that we have two places for ref count accounting,
xdp_set and  mlx5e_create_rq. Having ref-count updates in
mlx5e_create_rq is essential for num_channels configuration changes
(mlx5e_set_ringparam).

Our previous approach made sure that only one path will do the ref
counting (mlx5e_open_locked vs. mlx5e_xdp_set batch ref update when
reset == false).

We have two options:
1. remove ref count updates from mlx5e_create_rq and only do batch
updates in mlx5e_set_ringparam and mlx5e_xdp_set, the only two places
num_channels/priv->xdp_prog could change.
2. Keep the current approach and make sure to not update the ref count
twice, you can batch update only if (!reset && was_open) otherwise you
can rely on mlx5e_open_locked to take the num_channels ref count for
you.

Personally I prefer option 2, since we will keep the current logic
which will allow configuration updates such as (mlx5e_set_ringparam)
to not worry about ref counting since it will be done in the reset
flow.

Basically we want on xdp_set
   -bpf_prog_add(prog, 1) // one ref for the device.
   - if not going to reset bpf_prog_add(prog, num_channels) //batch update
   - old_prog = xchg(&priv->xdp_prog, prog);
   - if going to reset then reset . // reset will handle ref-counting
per channel.
   - bpf_prog_put(old_prog) // put one ref for the device.
   - if (!reset) bpf_prog_put(old_prog, num_channels)  // if we didn't
reset, we need to release ref count on old_prog ourselves

Saeed.

^ permalink raw reply

* Re: [PATCH net-next v2 3/4] bpf, mlx5: drop priv->xdp_prog reference on netdev cleanup
From: Saeed Mahameed @ 2016-11-16 12:51 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David S. Miller, Alexei Starovoitov, Brenden Blanco, zhiyisun,
	Rana Shahout, Saeed Mahameed, Linux Netdev List
In-Reply-To: <4617cac95e3ca1b608f5d0a55709b0534c168943.1479253024.git.daniel@iogearbox.net>

On Wed, Nov 16, 2016 at 2:04 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> mlx5e_xdp_set() is currently the only place where we drop reference on the
> prog sitting in priv->xdp_prog when it's exchanged by a new one. We also
> need to make sure that we eventually release that reference, for example,
> in case the netdev is dismantled.
>
> Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index cf26672..60fe54c 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -3715,6 +3715,9 @@ static void mlx5e_nic_cleanup(struct mlx5e_priv *priv)
>
>         if (MLX5_CAP_GEN(mdev, vport_group_manager))
>                 mlx5_eswitch_unregister_vport_rep(esw, 0);
> +
> +       if (priv->xdp_prog)
> +               bpf_prog_put(priv->xdp_prog);
>  }
>

I thought that on unregister_netdev  ndo_xdp_set will be called with
NULL prog to cleanup. like any other resources (Vlans/mac_lists/
etc..), why xdp should be different ?
Anyway if this is the case, I am ok with this fix, you can even send
it to net (it looks like a serious leak).

Saeed.

^ permalink raw reply

* [PATCH 1/1] ixgbe: write flush vfta registers
From: zyjzyj2000 @ 2016-11-16 12:40 UTC (permalink / raw)
  To: e1000-devel, netdev, jeffrey.t.kirsher, intel-wired-lan; +Cc: Zhu Yanjun

From: Zhu Yanjun <zyjzyj2000@gmail.com>

Sometimes vfta registers can not be written successfully in dcb mode.
This is very occassional. When the ixgbe nic runs for a very long time,
sometimes this bug occurs. But after IXGBE_WRITE_FLUSH is executed,
this bug never occurs.

Signed-off-by: Zhu Yanjun <zyjzyj2000@gmail.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index bd93d82..1221cfb 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -4138,8 +4138,10 @@ static void ixgbe_vlan_promisc_enable(struct ixgbe_adapter *adapter)
 	}
 
 	/* Set all bits in the VLAN filter table array */
-	for (i = hw->mac.vft_size; i--;)
+	for (i = hw->mac.vft_size; i--;) {
 		IXGBE_WRITE_REG(hw, IXGBE_VFTA(i), ~0U);
+		IXGBE_WRITE_FLUSH(hw);
+	}
 }
 
 #define VFTA_BLOCK_SIZE 8
@@ -4186,6 +4188,7 @@ static void ixgbe_scrub_vfta(struct ixgbe_adapter *adapter, u32 vfta_offset)
 		vfta[i] |= adapter->active_vlans[word] >> bits;
 
 		IXGBE_WRITE_REG(hw, IXGBE_VFTA(vfta_offset + i), vfta[i]);
+		IXGBE_WRITE_FLUSH(hw);
 	}
 }
 
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH net 1/3] net: phy: realtek: add eee advertisement disable options
From: Andrew Lunn @ 2016-11-16 13:23 UTC (permalink / raw)
  To: Jerome Brunet
  Cc: Florian Fainelli, netdev-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA, Carlo Caione, Kevin Hilman,
	Giuseppe Cavallaro, Alexandre TORGUE, Martin Blumenstingl,
	Andre Roth, Neil Armstrong,
	linux-amlogic-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1479290189.17538.25.camel-rdvid1DuHRBWk0Htik3J/w@public.gmane.org>

> There two kind of PHYs supporting eee, the one advertising eee by
> default (like realtek) and the one not advertising it (like micrel).

I don't know too much about EEE. So maybe a dumb question. Does the
MAC need to be involved? Or is it just the PHY?

If the MAC needs to be involved, the PHY should not be advertising EEE
unless the MAC asks for it by calling phy_init_eee(). If this is true,
maybe we need to change the realtek driver, and others in that class.

      Andrew
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 1/2] can: holt_hi311x: document device tree bindings
From: Rob Herring @ 2016-11-16 13:34 UTC (permalink / raw)
  To: Akshay Bhat
  Cc: wg, mkl, mark.rutland, linux-can, netdev, devicetree,
	linux-kernel, Akshay Bhat
In-Reply-To: <1479146144-29143-1-git-send-email-akshay.bhat@timesys.com>

On Mon, Nov 14, 2016 at 12:55:43PM -0500, Akshay Bhat wrote:
> Document the HOLT HI-311x CAN device tree bindings.
> 
> Signed-off-by: Akshay Bhat <nodeax@gmail.com>
> ---
>  .../devicetree/bindings/net/can/holt_hi311x.txt    | 24 ++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/can/holt_hi311x.txt

Acked-by: Rob Herring <robh@kernel.org>

^ permalink raw reply

* Re: [PATCH net] sctp: use new rhlist interface on sctp transport rhashtable
From: Xin Long @ 2016-11-16 13:34 UTC (permalink / raw)
  To: Neil Horman
  Cc: network dev, linux-sctp, davem, Marcelo Ricardo Leitner,
	Vlad Yasevich, Herbert Xu, Phil Sutter
In-Reply-To: <20161115180427.GC11283@hmsreliant.think-freely.org>

On Wed, Nov 16, 2016 at 2:04 AM, Neil Horman <nhorman@tuxdriver.com> wrote:
> On Tue, Nov 15, 2016 at 11:23:11PM +0800, Xin Long wrote:
>> Now sctp transport rhashtable uses hash(lport, dport, daddr) as the key
>> to hash a node to one chain. If in one host thousands of assocs connect
>> to one server with the same lport and different laddrs (although it's
>> not a normal case), all the transports would be hashed into the same
>> chain.
>>
>> It may cause to keep returning -EBUSY when inserting a new node, as the
>> chain is too long and sctp inserts a transport node in a loop, which
>> could even lead to system hangs there.
>>
>> The new rhlist interface works for this case that there are many nodes
>> with the same key in one chain. It puts them into a list then makes this
>> list be as a node of the chain.
>>
>> This patch is to replace rhashtable_ interface with rhltable_ interface.
>> Since a chain would not be too long and it would not return -EBUSY with
>> this fix when inserting a node, the reinsert loop is also removed here.
>>
>> Signed-off-by: Xin Long <lucien.xin@gmail.com>
>
> Does this really buy us anything in this case though?  If the use case is that a
> majority of the associations map to the same key, then you might avoid EBUSY for
> the individual associaion that doesn't map there, but you still have to cope
> with a huge linear search for the majority of the keys.
>

This patch is NOT for improving performance, it is to reorganize
transports in rhashtable in another way to avoid EBUSY, rhlist is born
for this.

Before this patch, the transport insert codes are pretty bad, if it returns
EBUSY, it would retry in a loop. now this patch avoid this and even
removed that loop, it's a fix for this issue.

> Might be more reasonable to mix saddr into the hash function so that your use
> case gets spread through the hash table more evenly.

we can not do this:
1. it will increase rhashtable's size when using multihome, if a host has
    N addrs, the size for one assoc will be N times bigger than now.
2. the hash node is inside transport, if we mix saddrs, when using multihome
    one transport would be hashed many times with different saddrs, we
    would have to define new structure to link transport.
we do not need to do this:
1. as the changelog said, "it's not a normal case", in one host (client), it
shouldn't connect to the same server with different saddrs.
2. now as long as paddr+dport+lport are different, rhashtable can hash
it evenly.

^ permalink raw reply

* Re: [PATCH] net: ethernet: faraday: To support device tree usage.
From: Andrew Lunn @ 2016-11-16 13:47 UTC (permalink / raw)
  To: Greentime Hu; +Cc: netdev, linux-kernel
In-Reply-To: <1479285795-3105-1-git-send-email-green.hu@gmail.com>

On Wed, Nov 16, 2016 at 04:43:15PM +0800, Greentime Hu wrote:
> To support device tree usage for ftmac100.
> 
> Signed-off-by: Greentime Hu <green.hu@gmail.com>
> ---
>  drivers/net/ethernet/faraday/ftmac100.c |    7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/net/ethernet/faraday/ftmac100.c b/drivers/net/ethernet/faraday/ftmac100.c
> index dce5f7b..81dd9e1 100644
> --- a/drivers/net/ethernet/faraday/ftmac100.c
> +++ b/drivers/net/ethernet/faraday/ftmac100.c
> @@ -1172,11 +1172,17 @@ static int __exit ftmac100_remove(struct platform_device *pdev)
>  	return 0;
>  }
>  
> +static const struct of_device_id mac_of_ids[] = {
> +	{ .compatible = "andestech,atmac100" },
> +	{ }

andestech is not in
Documentation/devicetree/bindings/vendor-prefixes.txt Please provide a
separate patch adding it.

Humm, why andestech? Why not something based around faraday
technology?

      Andrew

^ permalink raw reply

* Re: [PATCH net] ipv6 addrconf: Implemented enhanced DAD (RFC7527)
From: Hannes Frederic Sowa @ 2016-11-16 13:49 UTC (permalink / raw)
  To: Erik Nordmark, netdev
In-Reply-To: <99c143c3-9ec5-0f5f-44ed-6a4c60e723d8@sonic.net>

Hello,

On 16.11.2016 01:44, Erik Nordmark wrote:
> On 11/16/16 1:00 AM, Hannes Frederic Sowa wrote:
>> On 15.11.2016 08:57, Erik Nordmark wrote:
>>> Implemented RFC7527 Enhanced DAD.
>>> IPv6 duplicate address detection can fail if there is some temporary
>>> loopback of Ethernet frames. RFC7527 solves this by including a random
>>> nonce in the NS messages used for DAD, and if an NS is received with the
>>> same nonce it is assumed to be a looped back DAD probe and is ignored.
>>> RFC7527 is disabled by default. Can be enabled by setting either one of
>>> conf/{all,interface}/ipv6_rfc7527 to non-zero.
>>>
>>> Signed-off-by: Erik Nordmark <nordmark@arista.com>
>>>
>>> Index: linux-stable/Documentation/networking/ip-sysctl.txt
>>> ===================================================================
>>> --- linux-stable.orig/Documentation/networking/ip-sysctl.txt
>>> +++ linux-stable/Documentation/networking/ip-sysctl.txt
>>> @@ -1713,6 +1713,15 @@ drop_unsolicited_na - BOOLEAN
>>>
>>>       By default this is turned off.
>>>
>>> +ipv6_rfc7527 - BOOLEAN
>> Could you rename the sysctl to enhanced_dad. As it will anyway show up
>> in the ipv6/ directory hierarchy you don't need to prefix it with 'ipv6_'
> 
> Thanks for your careful review and in particular finding that
> uninitialized variable.
> 
> Yes, I'll rename the sysctl and all the other things to enhanced_dad.

Thanks!

>>>       DEVCONF_MAX
>>>   };
>>>
>>> Index: linux-stable/net/ipv6/addrconf.c
>>> ===================================================================
>>> --- linux-stable.orig/net/ipv6/addrconf.c
>>> +++ linux-stable/net/ipv6/addrconf.c
>>> @@ -217,6 +217,7 @@ static struct ipv6_devconf ipv6_devconf
>>>       .use_oif_addrs_only    = 0,
>>>       .ignore_routes_with_linkdown = 0,
>>>       .keep_addr_on_down    = 0,
>>> +    .ipv6_rfc7527           = 0,
>> What is your reason to not enable this by default? I haven't fully
>> studied the RFC, but it seems to be backwards compatible.
> 
> The concern is that while RFC4861 says that all unknown options must be
> silently ignored, RFC7527 isn't widely implemented yet. Hence if there
> is some broken implementation of RFC4861 it would fail to interoperate
> with Linux if we set this by default.
> 
> Perhaps I'm being too conservative?
> In any case such broken implementations would need to be fixed to ignore
> unknown options.

I thought about even removing the sysctl altogether and enable enhanced
DAD by default. ;)

I am in favor of enabling it by default.

But given that there could be broken implementations out there, we
should give users a choice and provide.

Could you always generate a nonce in the interface structure? You could
check the sysctl in the send and receive path to attach and check the
nonce. This has the advantage that you don't need to delete the
interface and recreate it to enable/disable enhanced dad on an interface
(also you can get away with the loop around get_random_bytes to make
sure its value is not zero as we don't depend on a non-zero nonce
variable to signal enaling of the feature, see below).

> 
>>
>>>   };
>>>
>>>   static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = {
>>> @@ -262,6 +263,7 @@ static struct ipv6_devconf ipv6_devconf_
>>>       .use_oif_addrs_only    = 0,
>>>       .ignore_routes_with_linkdown = 0,
>>>       .keep_addr_on_down    = 0,
>>> +    .ipv6_rfc7527           = 0,
>>>   };
>>>
>>>   /* Check if a valid qdisc is available */
>>> @@ -3722,12 +3724,18 @@ static void addrconf_dad_kick(struct ine
>>>   {
>>>       unsigned long rand_num;
>>>       struct inet6_dev *idev = ifp->idev;
>>> +    u64 nonce;
>>>
>>>       if (ifp->flags & IFA_F_OPTIMISTIC)
>>>           rand_num = 0;
>>>       else
>>>           rand_num = prandom_u32() % (idev->cnf.rtr_solicit_delay ? :
>>> 1);
>>>
>>> +    nonce = 0;
>>> +    if (ifp->idev->cnf.ipv6_rfc7527 ||
>>> + dev_net((ifp->idev)->dev)->ipv6.devconf_all->ipv6_rfc7527)
>> idev should already be in scope, so you can simplify this conditional.
> 
> Yes; I'll fix.
> 
>>
>>
>>> +        get_random_bytes(&nonce, 6);
>> Maybe:
>>
>> do
>>     get_random_bytes(&nonce, 6);
>> while (!nonce);
> 
> Is that because get_random_bytes() will not fill in anything if there is
> insufficient entropy available?

No, just because 0 is a possible return value from the random number
generator. ;)

>>> @@ -745,6 +756,7 @@ static void ndisc_recv_ns(struct sk_buff
>>>       int dad = ipv6_addr_any(saddr);
>>>       bool inc;
>>>       int is_router = -1;
>>> +    u64 nonce;
>>>
>>>       if (skb->len < sizeof(struct nd_msg)) {
>>>           ND_PRINTK(2, warn, "NS: packet too short\n");
>>> @@ -789,6 +801,8 @@ static void ndisc_recv_ns(struct sk_buff
>>>               return;
>>>           }
>>>       }
>>> +    if (ndopts.nd_opts_nonce)
>>> +        memcpy(&nonce, (u8 *)(ndopts.nd_opts_nonce + 1), 6);
>> You only initialize 6 bytes of the nonce, with the other 2 being not
>> initialized.
> Mea culpa. Will fix.
> 
>>
>>>       inc = ipv6_addr_is_multicast(daddr);
>>>
>>> @@ -797,6 +811,16 @@ static void ndisc_recv_ns(struct sk_buff
>>>   have_ifp:
>>>           if (ifp->flags & (IFA_F_TENTATIVE|IFA_F_OPTIMISTIC)) {
>>>               if (dad) {
>>> +                if (nonce != 0 && ifp->dad_nonce == nonce) {
>>> +                    /* Matching nonce if looped back */
>>> +                    if (net_ratelimit())
>>> +                        ND_PRINTK(2, notice,
>>> +                              "%s: IPv6 DAD loopback for address %pI6c
>>> nonce %llu ignored\n",
>>> +                               ifp->idev->dev->name,
>>> +                               &ifp->addr,
>>> +                               nonce);
>> If we print the nonce for debugging reasons, we should keep it in
>> correct endianess on the wire vs. in the debug output.
> How about printing it as colon-separated hex bytes since that is more
> clear than decimal?
> Would follow the network byte order in the packet.

I would be totally fine with it. It will be probably easier to switch to
a char[6] array for the nonce then.

Bye,
Hannes

^ permalink raw reply

* [PATCH net-next] net/mlx4_en: remove napi_hash_del() call
From: Eric Dumazet @ 2016-11-16 13:49 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Tariq Toukan

From: Eric Dumazet <edumazet@google.com>

There is no need calling napi_hash_del()+synchronize_rcu() before
calling netif_napi_del()

netif_napi_del() does this already.

Using napi_hash_del() in a driver is useful only when dealing with
a batch of NAPI structures, so that a single synchronize_rcu() can
be used. mlx4_en_deactivate_cq() is deactivating a single NAPI.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_cq.c |    4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_cq.c b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
index 03f05c4d1f9897d851c4f36d1f0e37ad3398e76c..09dd3776db7632ae8818c9638507705176233ea4 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
@@ -185,10 +185,6 @@ void mlx4_en_destroy_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq **pcq)
 void mlx4_en_deactivate_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq *cq)
 {
 	napi_disable(&cq->napi);
-	if (cq->type == RX) {
-		napi_hash_del(&cq->napi);
-		synchronize_rcu();
-	}
 	netif_napi_del(&cq->napi);
 
 	mlx4_cq_free(priv->mdev->dev, &cq->mcq);

^ permalink raw reply related

* Re: [PATCH v2] net/phy/vitesse: Configure RGMII skew on VSC8601, if needed
From: Andrew Lunn @ 2016-11-16 13:50 UTC (permalink / raw)
  To: Alexandru Gagniuc; +Cc: f.fainelli, netdev, linux-kernel, davem
In-Reply-To: <1479286953-11481-1-git-send-email-alex.g@adaptrum.com>

On Wed, Nov 16, 2016 at 01:02:33AM -0800, Alexandru Gagniuc wrote:
> With RGMII, we need a 1.5 to 2ns skew between clock and data lines. The
> VSC8601 can handle this internally. While the VSC8601 can set more
> fine-grained delays, the standard skew settings work out of the box.
> The same heuristic is used to determine when this skew should be enabled
> as in vsc824x_config_init().
> 
> Tested on custom board with AM3352 SOC and VSC801 PHY.
> 
> Signed-off-by: Alexandru Gagniuc <alex.g@adaptrum.com>
> ---
> Changes since v1:
>  * Added comment detailing applicability to different RGMII interfaces.
> 
>  drivers/net/phy/vitesse.c | 34 +++++++++++++++++++++++++++++++++-
>  1 file changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/phy/vitesse.c b/drivers/net/phy/vitesse.c
> index 2e37eb3..24b4a09 100644
> --- a/drivers/net/phy/vitesse.c
> +++ b/drivers/net/phy/vitesse.c
> @@ -62,6 +62,10 @@
>  /* Vitesse Extended Page Access Register */
>  #define MII_VSC82X4_EXT_PAGE_ACCESS	0x1f
>  
> +/* Vitesse VSC8601 Extended PHY Control Register 1 */
> +#define MII_VSC8601_EPHY_CTL		0x17
> +#define MII_VSC8601_EPHY_CTL_RGMII_SKEW	(1 << 8)
> +
>  #define PHY_ID_VSC8234			0x000fc620
>  #define PHY_ID_VSC8244			0x000fc6c0
>  #define PHY_ID_VSC8514			0x00070670
> @@ -111,6 +115,34 @@ static int vsc824x_config_init(struct phy_device *phydev)
>  	return err;
>  }
>  
> +/* This adds a skew for both TX and RX clocks, so the skew should only be
> + * applied to "rgmii-id" interfaces. It may not work as expected
> + * on "rgmii-txid", "rgmii-rxid" or "rgmii" interfaces. */

Hi Alexandru

You should be able to make "rgmii" work as expected. If that is the
phy mode, disable the skew.

    Andrew

^ permalink raw reply

* Re: [PATCH v2] net: ethernet: faraday: To support device tree usage.
From: Andrew Lunn @ 2016-11-16 13:53 UTC (permalink / raw)
  To: Greentime Hu; +Cc: netdev, linux-kernel, jiri
In-Reply-To: <1479297826-4335-1-git-send-email-green.hu@gmail.com>

On Wed, Nov 16, 2016 at 08:03:46PM +0800, Greentime Hu wrote:
> Signed-off-by: Greentime Hu <green.hu@gmail.com>
> ---
>  drivers/net/ethernet/faraday/ftmac100.c |    7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/net/ethernet/faraday/ftmac100.c b/drivers/net/ethernet/faraday/ftmac100.c
> index dce5f7b..5d70ee9 100644
> --- a/drivers/net/ethernet/faraday/ftmac100.c
> +++ b/drivers/net/ethernet/faraday/ftmac100.c
> @@ -1172,11 +1172,17 @@ static int __exit ftmac100_remove(struct platform_device *pdev)
>  	return 0;
>  }
>  
> +static const struct of_device_id ftmac100_of_ids[] = {
> +	{ .compatible = "andestech,atmac100" },

Please see my comment to v1.

You are also supposed to add a binding document.

    Andrew

^ permalink raw reply

* Re: [PATCH net-next v2 2/4] bpf, mlx5: fix various refcount issues in mlx5e_xdp_set
From: Daniel Borkmann @ 2016-11-16 13:54 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, Alexei Starovoitov, Brenden Blanco, zhiyisun,
	Rana Shahout, Saeed Mahameed, Linux Netdev List
In-Reply-To: <CALzJLG_r9tujvSVysMXLSO8cRR6zHaHfmacmGq-hCqg5Ou7w_w@mail.gmail.com>

On 11/16/2016 01:25 PM, Saeed Mahameed wrote:
> On Wed, Nov 16, 2016 at 2:04 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>> There are multiple issues in mlx5e_xdp_set():
>>
>> 1) The batched bpf_prog_add() is currently not checked for errors! When
>>     doing so, it should be done at an earlier point in time to makes sure
>>     that we cannot fail anymore at the time we want to set the program for
>>     each channel. This only means that we have to undo the bpf_prog_add()
>>     in case we return due to required reset.
>>
>> 2) When swapping the priv->xdp_prog, then no extra reference count must be
>>     taken since we got that from call path via dev_change_xdp_fd() already.
>>     Otherwise, we'd never be able to free the program. Also, bpf_prog_add()
>>     without checking the return code could fail.
>>
>> Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support")
>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>> ---
>>   drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 25 ++++++++++++++++++-----
>>   1 file changed, 20 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> index ab0c336..cf26672 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> @@ -3142,6 +3142,17 @@ static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
>>                  goto unlock;
>>          }
>>
>> +       if (prog) {
>> +               /* num_channels is invariant here, so we can take the
>> +                * batched reference right upfront.
>> +                */
>> +               prog = bpf_prog_add(prog, priv->params.num_channels);
>> +               if (IS_ERR(prog)) {
>> +                       err = PTR_ERR(prog);
>> +                       goto unlock;
>> +               }
>> +       }
>> +
>
> With this approach you will end up taking a ref count twice per RQ! on
> the first time xdp_set is called i.e (old_prog = NULL, prog != NULL).
> One ref will be taken per RQ/Channel from the above code, and since
> reset will be TRUE mlx5e_open_locked will be called and another ref
> count will be taken on mlx5e_create_rq.
>
> The issue here is that we have two places for ref count accounting,
> xdp_set and  mlx5e_create_rq. Having ref-count updates in
> mlx5e_create_rq is essential for num_channels configuration changes
> (mlx5e_set_ringparam).
>
> Our previous approach made sure that only one path will do the ref
> counting (mlx5e_open_locked vs. mlx5e_xdp_set batch ref update when
> reset == false).

That is correct, for a short time bpf_prog_add() was charged also when
we reset. When we reset, we will then jump to unlock_put and safely undo
this since we took ref from mlx5e_create_rq() already in that case, and
return successfully. That was intentional, so that eventually we end up
just taking a /single/ ref per RQ when we exit mlx5e_xdp_set(), but more
below ...

[...]
> 2. Keep the current approach and make sure to not update the ref count
> twice, you can batch update only if (!reset && was_open) otherwise you
> can rely on mlx5e_open_locked to take the num_channels ref count for
> you.
>
> Personally I prefer option 2, since we will keep the current logic
> which will allow configuration updates such as (mlx5e_set_ringparam)
> to not worry about ref counting since it will be done in the reset
> flow.

... agree on keeping current approach. I actually like the idea, so we'd
end up with this simpler version for the batched ref then.

Note, your "bpf_prog_add(prog, 1) // one ref for the device." is not needed
since this we already got that one through dev_change_xdp_fd() as mentioned.

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index ab0c336..09ac27c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3148,11 +3148,21 @@ static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)

  	if (was_opened && reset)
  		mlx5e_close_locked(netdev);
+	if (was_opened && !reset) {
+		/* num_channels is invariant here, so we can take the
+		 * batched reference right upfront.
+		 */
+		prog = bpf_prog_add(prog, priv->params.num_channels);
+		if (IS_ERR(prog)) {
+			err = PTR_ERR(prog);
+			goto unlock;
+		}
+	}

-	/* exchange programs */
+	/* exchange programs, extra prog reference we got from caller
+	 * as long as we don't fail from this point onwards.
+	 */
  	old_prog = xchg(&priv->xdp_prog, prog);
-	if (prog)
-		bpf_prog_add(prog, 1);
  	if (old_prog)
  		bpf_prog_put(old_prog);

@@ -3168,7 +3178,6 @@ static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
  	/* exchanging programs w/o reset, we update ref counts on behalf
  	 * of the channels RQs here.
  	 */
-	bpf_prog_add(prog, priv->params.num_channels);
  	for (i = 0; i < priv->params.num_channels; i++) {
  		struct mlx5e_channel *c = priv->channel[i];

-- 
1.9.3

^ permalink raw reply related

* [PATCH net-next] net/mlx5e: remove napi_hash_del() calls
From: Eric Dumazet @ 2016-11-16 13:55 UTC (permalink / raw)
  To: David Miller; +Cc: Saeed Mahameed, netdev

From: Eric Dumazet <edumazet@google.com>

Calling napi_hash_del() after netif_napi_del() is pointless.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |    4 ----
 1 file changed, 4 deletions(-)

^ permalink raw reply

* Re: [PATCH net] virtio-net: add a missing synchronize_net()
From: Michael S. Tsirkin @ 2016-11-16 13:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Jason Wang
In-Reply-To: <1479277452.8455.156.camel@edumazet-glaptop3.roam.corp.google.com>

On Tue, Nov 15, 2016 at 10:24:12PM -0800, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> It seems many drivers do not respect napi_hash_del() contract.
> 
> When napi_hash_del() is used before netif_napi_del(), an RCU grace
> period is needed before freeing NAPI object.
> 
> Fixes: 91815639d880 ("virtio-net: rx busy polling support")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Jason Wang <jasowang@redhat.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>


> ---
>  drivers/net/virtio_net.c |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index fd8b1e62301f..7276d5a95bd0 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1497,6 +1497,11 @@ static void virtnet_free_queues(struct virtnet_info *vi)
>  		netif_napi_del(&vi->rq[i].napi);
>  	}
>  
> +	/* We called napi_hash_del() before netif_napi_del(),
> +	 * we need to respect an RCU grace period before freeing vi->rq
> +	 */
> +	synchronize_net();
> +
>  	kfree(vi->rq);
>  	kfree(vi->sq);
>  }
> 

^ permalink raw reply

* [PATCH net-next] sfc: remove napi_hash_del() call
From: Eric Dumazet @ 2016-11-16 14:01 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Edward Cree, Bert Kenward

From: Eric Dumazet <edumazet@google.com>

Calling napi_hash_del() after netif_napi_del() is pointless.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Edward Cree <ecree@solarflare.com>
Cc: Bert Kenward <bkenward@solarflare.com>
---
 drivers/net/ethernet/sfc/efx.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index 649bc508fa4b9ea42aa08331c50335e1a51f9259..52ca0404c49121f1c86126ab7d443b1144f20720 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -2113,10 +2113,9 @@ static void efx_init_napi(struct efx_nic *efx)
 
 static void efx_fini_napi_channel(struct efx_channel *channel)
 {
-	if (channel->napi_dev) {
+	if (channel->napi_dev)
 		netif_napi_del(&channel->napi_str);
-		napi_hash_del(&channel->napi_str);
-	}
+
 	channel->napi_dev = NULL;
 }
 

^ permalink raw reply related

* Re: [PATCH 1/2] ethernet: stmmac: make DWMAC_STM32 depend on it's associated SoC
From: Peter Robinson @ 2016-11-16 14:01 UTC (permalink / raw)
  To: David Miller; +Cc: mcoquelin.stm32, alexandre.torgue, peppe.cavallaro, netdev
In-Reply-To: <20161106.214128.668519501329683090.davem@davemloft.net>

On Mon, Nov 7, 2016 at 2:41 AM, David Miller <davem@davemloft.net> wrote:
> From: Peter Robinson <pbrobinson@gmail.com>
> Date: Sun,  6 Nov 2016 20:04:37 +0000
>
>> There's not much point, except compile test, enabling the stmmac
>> platform drivers unless the STM32 SoC is enabled. It's not
>> useful without it.
>>
>> Signed-off-by: Peter Robinson <pbrobinson@gmail.com>
>
> Please don't post just some of the patches in a patch series.
>
> Always post the complete series, with a proper "[PATCH 0/2] ..."
> posting at the beginning.

Sorry, it was a stm branch, I should have broke it our differently. I
can recompose and send this one again on it's own.

Peter

^ permalink raw reply

* Aw: [PATCH 1/1] ixgbe: write flush vfta registers
From: Lino Sanfilippo @ 2016-11-16 14:05 UTC (permalink / raw)
  To: zyjzyj2000
  Cc: e1000-devel, netdev, jeffrey.t.kirsher, intel-wired-lan,
	Zhu Yanjun
In-Reply-To: <1479300041-3436-1-git-send-email-zyjzyj2000@gmail.com>



Hi,

> 
> Sometimes vfta registers can not be written successfully in dcb mode.
> This is very occassional. When the ixgbe nic runs for a very long time,
> sometimes this bug occurs. But after IXGBE_WRITE_FLUSH is executed,
> this bug never occurs.
> 
> Signed-off-by: Zhu Yanjun <zyjzyj2000@gmail.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index bd93d82..1221cfb 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -4138,8 +4138,10 @@ static void ixgbe_vlan_promisc_enable(struct ixgbe_adapter *adapter)
>  	}
>  
>  	/* Set all bits in the VLAN filter table array */
> -	for (i = hw->mac.vft_size; i--;)
> +	for (i = hw->mac.vft_size; i--;) {
>  		IXGBE_WRITE_REG(hw, IXGBE_VFTA(i), ~0U);
> +		IXGBE_WRITE_FLUSH(hw);
> +	}

Should it not be sufficient to do the flush only once, at the end of the function?

Regards,
Lino

^ permalink raw reply

* [patch net-next 0/8] ipv4: fib: Allow modules to dump FIB tables
From: Jiri Pirko @ 2016-11-16 14:08 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber

From: Jiri Pirko <jiri@mellanox.com>

Ido says:

In kernel 4.9 the switchdev-specific FIB offload mechanism was replaced
by a new FIB notification chain to which modules could register in order
to be notified about the addition and deletion of FIB entries. The
motivation for this change was that switchdev drivers need to be able to
reflect the entire FIB table and not only FIBs configured on top of the
port netdevs themselves. This is useful in case of in-band management.

The fundamental problem with this approach is that upon registration
listeners lose all the information previously sent in the chain and
thus have an incomplete view of the FIB tables, which can result in
packet loss. This patchset fixes that by introducing a new API to dump
the FIB tables.

The entire dump process is done under RCU and thus the FIB notification
chain is converted to be atomic. The listeners are modified accordingly.
This is done in the first five patches.

The sixth patch adds the dump callback itself and the next patches call
it from current listeners of the FIB notification chain.

Ido Schimmel (8):
  ipv4: fib: Export free_fib_info()
  mlxsw: spectrum_router: Implement FIB offload in delayed work
  rocker: Implement FIB offload in delayed work
  ipv4: fib: Convert FIB notification chain to be atomic
  net: ipv4: Send notifications only after removing FIB alias
  ipv4: fib: Add an API to request a FIB dump
  mlxsw: spectrum_router: Request a dump of FIB tables during init
  rocker: Request a dump of FIB tables during init

 drivers/net/ethernet/mellanox/mlxsw/core.c         |   6 +
 drivers/net/ethernet/mellanox/mlxsw/core.h         |   1 +
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  |  73 ++++++++++--
 drivers/net/ethernet/rocker/rocker_main.c          |  59 ++++++++--
 drivers/net/ethernet/rocker/rocker_ofdpa.c         |   1 +
 include/net/ip_fib.h                               |   1 +
 net/ipv4/fib_semantics.c                           |   1 +
 net/ipv4/fib_trie.c                                | 128 +++++++++++++++++++--
 8 files changed, 242 insertions(+), 28 deletions(-)

-- 
2.7.4

^ permalink raw reply

* [patch net-next 1/8] ipv4: fib: Export free_fib_info()
From: Jiri Pirko @ 2016-11-16 14:08 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <1479305343-13167-1-git-send-email-jiri@resnulli.us>

From: Ido Schimmel <idosch@mellanox.com>

The FIB notification chain is going to be converted to an atomic chain,
which means switchdev drivers will have to offload FIB entries in
delayed work, as hardware operations entail sleeping.

However, while the work is queued fib info might be freed, so a
reference must be taken. To release the reference (and potentially free
the fib info) fib_info_put() will be called, which in turn calls
free_fib_info().

Export free_fib_info() so that modules will be able to invoke
fib_info_put().

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 net/ipv4/fib_semantics.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 388d3e2..c1bc1e9 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -234,6 +234,7 @@ void free_fib_info(struct fib_info *fi)
 #endif
 	call_rcu(&fi->rcu, free_fib_info_rcu);
 }
+EXPORT_SYMBOL_GPL(free_fib_info);

 void fib_release_info(struct fib_info *fi)
 {
-- 
2.7.4

^ permalink raw reply related

* [patch net-next 2/8] mlxsw: spectrum_router: Implement FIB offload in delayed work
From: Jiri Pirko @ 2016-11-16 14:08 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <1479305343-13167-1-git-send-email-jiri@resnulli.us>

From: Ido Schimmel <idosch@mellanox.com>

FIB offload is currently done in process context with RTNL held, but
we're about to dump the FIB tables in RCU critical section, so we can no
longer sleep.

Instead, defer the operation to process context using delayed work. Make
sure fib info isn't freed while the work is queued by taking a reference
on it and releasing it after the operation is done.

Deferring the operation is valid because the upper layers always assume
the operation was successful. If it's not, then the driver-specific
abort mechanism is called and all routed traffic is directed to slow
path.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/core.c         |  6 ++
 drivers/net/ethernet/mellanox/mlxsw/core.h         |  1 +
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 72 +++++++++++++++++++---
 3 files changed, 69 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c b/drivers/net/ethernet/mellanox/mlxsw/core.c
index 6004817..7874e30 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -1839,6 +1839,12 @@ int mlxsw_core_schedule_dw(struct delayed_work *dwork, unsigned long delay)
 }
 EXPORT_SYMBOL(mlxsw_core_schedule_dw);
 
+void mlxsw_core_flush_wq(void)
+{
+	flush_workqueue(mlxsw_wq);
+}
+EXPORT_SYMBOL(mlxsw_core_flush_wq);
+
 static int __init mlxsw_core_module_init(void)
 {
 	int err;
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.h b/drivers/net/ethernet/mellanox/mlxsw/core.h
index c0acc1b..e382ed0 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.h
@@ -156,6 +156,7 @@ enum devlink_port_type mlxsw_core_port_type_get(struct mlxsw_core *mlxsw_core,
 						u8 local_port);
 
 int mlxsw_core_schedule_dw(struct delayed_work *dwork, unsigned long delay);
+void mlxsw_core_flush_wq(void);
 
 #define MLXSW_CONFIG_PROFILE_SWID_COUNT 8
 
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 683f045..a8011a5 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -593,6 +593,14 @@ static void mlxsw_sp_router_fib_flush(struct mlxsw_sp *mlxsw_sp);
 
 static void mlxsw_sp_vrs_fini(struct mlxsw_sp *mlxsw_sp)
 {
+	/* At this stage we're guaranteed not to have new incoming
+	 * FIB notifications and the work queue is free from FIBs
+	 * sitting on top of mlxsw netdevs. However, we can still
+	 * have other FIBs queued. Flush the queue before flushing
+	 * the device's tables. No need for locks, as we're the only
+	 * writer.
+	 */
+	mlxsw_core_flush_wq();
 	mlxsw_sp_router_fib_flush(mlxsw_sp);
 	kfree(mlxsw_sp->router.vrs);
 }
@@ -1948,30 +1956,74 @@ static void __mlxsw_sp_router_fini(struct mlxsw_sp *mlxsw_sp)
 	kfree(mlxsw_sp->rifs);
 }
 
-static int mlxsw_sp_router_fib_event(struct notifier_block *nb,
-				     unsigned long event, void *ptr)
+struct mlxsw_sp_fib_event_work {
+	struct fib_entry_notifier_info fen_info;
+	struct mlxsw_sp *mlxsw_sp;
+	struct delayed_work dw;
+	unsigned long event;
+};
+
+static void mlxsw_sp_router_fib_event_work(struct work_struct *work)
 {
-	struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, fib_nb);
-	struct fib_entry_notifier_info *fen_info = ptr;
+	struct mlxsw_sp_fib_event_work *fib_work =
+		container_of(work, struct mlxsw_sp_fib_event_work, dw.work);
+	struct mlxsw_sp *mlxsw_sp = fib_work->mlxsw_sp;
 	int err;
 
-	if (!net_eq(fen_info->info.net, &init_net))
-		return NOTIFY_DONE;
-
-	switch (event) {
+	/* Protect internal structures from changes */
+	rtnl_lock();
+	switch (fib_work->event) {
 	case FIB_EVENT_ENTRY_ADD:
-		err = mlxsw_sp_router_fib4_add(mlxsw_sp, fen_info);
+		err = mlxsw_sp_router_fib4_add(mlxsw_sp, &fib_work->fen_info);
 		if (err)
 			mlxsw_sp_router_fib4_abort(mlxsw_sp);
+		fib_info_put(fib_work->fen_info.fi);
 		break;
 	case FIB_EVENT_ENTRY_DEL:
-		mlxsw_sp_router_fib4_del(mlxsw_sp, fen_info);
+		mlxsw_sp_router_fib4_del(mlxsw_sp, &fib_work->fen_info);
+		fib_info_put(fib_work->fen_info.fi);
 		break;
 	case FIB_EVENT_RULE_ADD: /* fall through */
 	case FIB_EVENT_RULE_DEL:
 		mlxsw_sp_router_fib4_abort(mlxsw_sp);
 		break;
 	}
+	rtnl_unlock();
+	kfree(fib_work);
+}
+
+/* Called with rcu_read_lock() */
+static int mlxsw_sp_router_fib_event(struct notifier_block *nb,
+				     unsigned long event, void *ptr)
+{
+	struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, fib_nb);
+	struct mlxsw_sp_fib_event_work *fib_work;
+	struct fib_notifier_info *info = ptr;
+
+	if (!net_eq(info->net, &init_net))
+		return NOTIFY_DONE;
+
+	fib_work = kzalloc(sizeof(*fib_work), GFP_ATOMIC);
+	if (WARN_ON(!fib_work))
+		return NOTIFY_BAD;
+
+	INIT_DELAYED_WORK(&fib_work->dw, mlxsw_sp_router_fib_event_work);
+	fib_work->mlxsw_sp = mlxsw_sp;
+	fib_work->event = event;
+
+	switch (event) {
+	case FIB_EVENT_ENTRY_ADD: /* fall through */
+	case FIB_EVENT_ENTRY_DEL:
+		memcpy(&fib_work->fen_info, ptr, sizeof(fib_work->fen_info));
+		/* Take referece on fib_info to prevent it from being
+		 * freed while work is queued. Release it afterwards.
+		 */
+		atomic_inc(&fib_work->fen_info.fi->fib_clntref);
+		break;
+	}
+
+	mlxsw_core_schedule_dw(&fib_work->dw, 0);
+
 	return NOTIFY_DONE;
 }
 
-- 
2.7.4

^ permalink raw reply related

* [patch net-next 3/8] rocker: Implement FIB offload in delayed work
From: Jiri Pirko @ 2016-11-16 14:08 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <1479305343-13167-1-git-send-email-jiri@resnulli.us>

From: Ido Schimmel <idosch@mellanox.com>

Convert rocker to offload FIBs in delayed work in a similar fashion to
mlxsw, which was converted in the previous patch.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/rocker/rocker_main.c  | 58 +++++++++++++++++++++++++-----
 drivers/net/ethernet/rocker/rocker_ofdpa.c |  1 +
 2 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/rocker/rocker_main.c b/drivers/net/ethernet/rocker/rocker_main.c
index 67df4cf..b80ff12 100644
--- a/drivers/net/ethernet/rocker/rocker_main.c
+++ b/drivers/net/ethernet/rocker/rocker_main.c
@@ -2165,28 +2165,70 @@ static const struct switchdev_ops rocker_port_switchdev_ops = {
 	.switchdev_port_obj_dump	= rocker_port_obj_dump,
 };
 
-static int rocker_router_fib_event(struct notifier_block *nb,
-				   unsigned long event, void *ptr)
+struct rocker_fib_event_work {
+	struct fib_entry_notifier_info fen_info;
+	struct rocker *rocker;
+	struct delayed_work dw;
+	unsigned long event;
+};
+
+static void rocker_router_fib_event_work(struct work_struct *work)
 {
-	struct rocker *rocker = container_of(nb, struct rocker, fib_nb);
-	struct fib_entry_notifier_info *fen_info = ptr;
+	struct rocker_fib_event_work *fib_work =
+		container_of(work, struct rocker_fib_event_work, dw.work);
+	struct rocker *rocker = fib_work->rocker;
 	int err;
 
-	switch (event) {
+	/* Protect internal structures from changes */
+	rtnl_lock();
+	switch (fib_work->event) {
 	case FIB_EVENT_ENTRY_ADD:
-		err = rocker_world_fib4_add(rocker, fen_info);
+		err = rocker_world_fib4_add(rocker, &fib_work->fen_info);
 		if (err)
 			rocker_world_fib4_abort(rocker);
-		else
+		fib_info_put(fib_work->fen_info.fi);
 		break;
 	case FIB_EVENT_ENTRY_DEL:
-		rocker_world_fib4_del(rocker, fen_info);
+		rocker_world_fib4_del(rocker, &fib_work->fen_info);
+		fib_info_put(fib_work->fen_info.fi);
 		break;
 	case FIB_EVENT_RULE_ADD: /* fall through */
 	case FIB_EVENT_RULE_DEL:
 		rocker_world_fib4_abort(rocker);
 		break;
 	}
+	rtnl_unlock();
+	kfree(fib_work);
+}
+
+/* Called with rcu_read_lock() */
+static int rocker_router_fib_event(struct notifier_block *nb,
+				   unsigned long event, void *ptr)
+{
+	struct rocker *rocker = container_of(nb, struct rocker, fib_nb);
+	struct rocker_fib_event_work *fib_work;
+
+	fib_work = kzalloc(sizeof(*fib_work), GFP_ATOMIC);
+	if (WARN_ON(!fib_work))
+		return NOTIFY_BAD;
+
+	INIT_DELAYED_WORK(&fib_work->dw, rocker_router_fib_event_work);
+	fib_work->rocker = rocker;
+	fib_work->event = event;
+
+	switch (event) {
+	case FIB_EVENT_ENTRY_ADD: /* fall through */
+	case FIB_EVENT_ENTRY_DEL:
+		memcpy(&fib_work->fen_info, ptr, sizeof(fib_work->fen_info));
+		/* Take referece on fib_info to prevent it from being
+		 * freed while work is queued. Release it afterwards.
+		 */
+		atomic_inc(&fib_work->fen_info.fi->fib_clntref);
+		break;
+	}
+
+	schedule_work(&fib_work->dw.work);
+
 	return NOTIFY_DONE;
 }
 
diff --git a/drivers/net/ethernet/rocker/rocker_ofdpa.c b/drivers/net/ethernet/rocker/rocker_ofdpa.c
index 4ca4613..5d7c738 100644
--- a/drivers/net/ethernet/rocker/rocker_ofdpa.c
+++ b/drivers/net/ethernet/rocker/rocker_ofdpa.c
@@ -2516,6 +2516,7 @@ static void ofdpa_fini(struct rocker *rocker)
 	int bkt;
 
 	del_timer_sync(&ofdpa->fdb_cleanup_timer);
+	flush_scheduled_work();
 
 	spin_lock_irqsave(&ofdpa->flow_tbl_lock, flags);
 	hash_for_each_safe(ofdpa->flow_tbl, bkt, tmp, flow_entry, entry)
-- 
2.7.4

^ permalink raw reply related

* [patch net-next 4/8] ipv4: fib: Convert FIB notification chain to be atomic
From: Jiri Pirko @ 2016-11-16 14:08 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <1479305343-13167-1-git-send-email-jiri@resnulli.us>

From: Ido Schimmel <idosch@mellanox.com>

In order not to hold RTNL for long periods of time we're going to dump
the FIB tables using RCU.

Convert the FIB notification chain to be atomic, as we can't block in
RCU critical sections.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 net/ipv4/fib_trie.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 4cff74d..64a51ec 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -84,17 +84,17 @@
 #include <trace/events/fib.h>
 #include "fib_lookup.h"
 
-static BLOCKING_NOTIFIER_HEAD(fib_chain);
+static ATOMIC_NOTIFIER_HEAD(fib_chain);
 
 int register_fib_notifier(struct notifier_block *nb)
 {
-	return blocking_notifier_chain_register(&fib_chain, nb);
+	return atomic_notifier_chain_register(&fib_chain, nb);
 }
 EXPORT_SYMBOL(register_fib_notifier);
 
 int unregister_fib_notifier(struct notifier_block *nb)
 {
-	return blocking_notifier_chain_unregister(&fib_chain, nb);
+	return atomic_notifier_chain_unregister(&fib_chain, nb);
 }
 EXPORT_SYMBOL(unregister_fib_notifier);
 
@@ -102,7 +102,7 @@ int call_fib_notifiers(struct net *net, enum fib_event_type event_type,
 		       struct fib_notifier_info *info)
 {
 	info->net = net;
-	return blocking_notifier_call_chain(&fib_chain, event_type, info);
+	return atomic_notifier_call_chain(&fib_chain, event_type, info);
 }
 
 static int call_fib_entry_notifiers(struct net *net,
-- 
2.7.4

^ permalink raw reply related

* [patch net-next 5/8] net: ipv4: Send notifications only after removing FIB alias
From: Jiri Pirko @ 2016-11-16 14:09 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <1479305343-13167-1-git-send-email-jiri@resnulli.us>

From: Ido Schimmel <idosch@mellanox.com>

When removing a FIB alias we should send a notification (both to user
space and in kernel) only after the fact, or otherwise we could end up
in problematic situations.

For example, assume we have two tasks:
a) Task A - Removing a FIB alias (fa1) following RTM_DELROUTE
b) Task B - Replaying FIB tables under RCU for listener (l) following
   fib_notifier_dump() (to be introduced in the next patch).

Timeline:
t0 - Task A calls FIB_EVENT_ENTRY_DEL for fa1. Will be ignored by
     listener l since it's not aware of fa1.
t1 - Task B reaches fa1 in the trie and replays it to listener l using
     FIB_EVENT_ENTRY_ADD.
t2 - Task A removes fa1 from the trie.

The above will result in listener l (f.e., some capable device) having
fa1 in its tables whereas fa1 isn't present in the kernel's table
anymore.

If we always send notifications after the fact, then we can avoid such
races. During insertion, if we traversed the trie when fa1 wasn't
present, then we'll eventually get the notification and process it.
During deletion, if we traversed the trie when fa1 was still present,
then the notification will let us know that it actually needs to be
removed.

The above is consistent with insertion of a FIB alias, where
notifications are sent only after the fact.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 net/ipv4/fib_trie.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 64a51ec..2195601 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1569,17 +1569,17 @@ int fib_table_delete(struct net *net, struct fib_table *tb,
 	if (!fa_to_delete)
 		return -ESRCH;

+	if (!plen)
+		tb->tb_num_default--;
+
+	fib_remove_alias(t, tp, l, fa_to_delete);
+
 	call_fib_entry_notifiers(net, FIB_EVENT_ENTRY_DEL, key, plen,
 				 fa_to_delete->fa_info, tos, cfg->fc_type,
 				 tb->tb_id, 0);
 	rtmsg_fib(RTM_DELROUTE, htonl(key), fa_to_delete, plen, tb->tb_id,
 		  &cfg->fc_nlinfo, 0);

-	if (!plen)
-		tb->tb_num_default--;
-
-	fib_remove_alias(t, tp, l, fa_to_delete);
-
 	if (fa_to_delete->fa_state & FA_S_ACCESSED)
 		rt_cache_flush(cfg->fc_nlinfo.nl_net);

@@ -1810,12 +1810,12 @@ int fib_table_flush(struct net *net, struct fib_table *tb)
 				continue;
 			}

+			hlist_del_rcu(&fa->fa_list);
 			call_fib_entry_notifiers(net, FIB_EVENT_ENTRY_DEL,
 						 n->key,
 						 KEYLENGTH - fa->fa_slen,
 						 fi, fa->fa_tos, fa->fa_type,
 						 tb->tb_id, 0);
-			hlist_del_rcu(&fa->fa_list);
 			fib_release_info(fa->fa_info);
 			alias_free_mem_rcu(fa);
 			found++;
-- 
2.7.4

^ permalink raw reply related

* [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
From: Jiri Pirko @ 2016-11-16 14:09 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <1479305343-13167-1-git-send-email-jiri@resnulli.us>

From: Ido Schimmel <idosch@mellanox.com>

Commit b90eb7549499 ("fib: introduce FIB notification infrastructure")
introduced a new notification chain to notify listeners (f.e., switchdev
drivers) about addition and deletion of routes.

However, upon registration to the chain the FIB tables can already be
populated, which means potential listeners will have an incomplete view
of the tables.

Solve that by adding an API to request a FIB dump. The dump itself it
done using RCU in order not to starve consumers that need RTNL to make
progress.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/ip_fib.h |   1 +
 net/ipv4/fib_trie.c  | 108 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 109 insertions(+)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index b9314b4..095a374 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -221,6 +221,7 @@ enum fib_event_type {
 	FIB_EVENT_RULE_DEL,
 };
 
+void fib_notifier_dump(struct notifier_block *nb);
 int register_fib_notifier(struct notifier_block *nb);
 int unregister_fib_notifier(struct notifier_block *nb);
 int call_fib_notifiers(struct net *net, enum fib_event_type event_type,
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 2195601..9670f3f 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -86,6 +86,58 @@
 
 static ATOMIC_NOTIFIER_HEAD(fib_chain);
 
+static int call_fib_notifier(struct notifier_block *nb, struct net *net,
+			     enum fib_event_type event_type,
+			     struct fib_notifier_info *info)
+{
+	info->net = net;
+	return nb->notifier_call(nb, event_type, info);
+}
+
+static void fib_rules_notify(struct net *net, struct notifier_block *nb,
+			     enum fib_event_type event_type)
+{
+#ifdef CONFIG_IP_MULTIPLE_TABLES
+	struct fib_notifier_info info;
+
+	if (net->ipv4.fib_has_custom_rules)
+		call_fib_notifier(nb, net, event_type, &info);
+#endif
+}
+
+static void fib_notify(struct net *net, struct notifier_block *nb,
+		       enum fib_event_type event_type);
+
+static int call_fib_entry_notifier(struct notifier_block *nb, struct net *net,
+				   enum fib_event_type event_type, u32 dst,
+				   int dst_len, struct fib_info *fi,
+				   u8 tos, u8 type, u32 tb_id, u32 nlflags)
+{
+	struct fib_entry_notifier_info info = {
+		.dst = dst,
+		.dst_len = dst_len,
+		.fi = fi,
+		.tos = tos,
+		.type = type,
+		.tb_id = tb_id,
+		.nlflags = nlflags,
+	};
+	return call_fib_notifier(nb, net, event_type, &info.info);
+}
+
+void fib_notifier_dump(struct notifier_block *nb)
+{
+	struct net *net;
+
+	rcu_read_lock();
+	for_each_net_rcu(net) {
+		fib_rules_notify(net, nb, FIB_EVENT_RULE_ADD);
+		fib_notify(net, nb, FIB_EVENT_ENTRY_ADD);
+	}
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL(fib_notifier_dump);
+
 int register_fib_notifier(struct notifier_block *nb)
 {
 	return atomic_notifier_chain_register(&fib_chain, nb);
@@ -1834,6 +1886,62 @@ int fib_table_flush(struct net *net, struct fib_table *tb)
 	return found;
 }
 
+static void fib_leaf_notify(struct net *net, struct key_vector *l,
+			    struct fib_table *tb, struct notifier_block *nb,
+			    enum fib_event_type event_type)
+{
+	struct fib_alias *fa;
+
+	hlist_for_each_entry_rcu(fa, &l->leaf, fa_list) {
+		struct fib_info *fi = fa->fa_info;
+
+		if (!fi)
+			continue;
+
+		/* local and main table can share the same trie,
+		 * so don't notify twice for the same entry.
+		 */
+		if (tb->tb_id != fa->tb_id)
+			continue;
+
+		call_fib_entry_notifier(nb, net, event_type, l->key,
+					KEYLENGTH - fa->fa_slen, fi, fa->fa_tos,
+					fa->fa_type, fa->tb_id, 0);
+	}
+}
+
+static void fib_table_notify(struct net *net, struct fib_table *tb,
+			     struct notifier_block *nb,
+			     enum fib_event_type event_type)
+{
+	struct trie *t = (struct trie *)tb->tb_data;
+	struct key_vector *l, *tp = t->kv;
+	t_key key = 0;
+
+	while ((l = leaf_walk_rcu(&tp, key)) != NULL) {
+		fib_leaf_notify(net, l, tb, nb, event_type);
+
+		key = l->key + 1;
+		/* stop in case of wrap around */
+		if (key < l->key)
+			break;
+	}
+}
+
+static void fib_notify(struct net *net, struct notifier_block *nb,
+		       enum fib_event_type event_type)
+{
+	unsigned int h;
+
+	for (h = 0; h < FIB_TABLE_HASHSZ; h++) {
+		struct hlist_head *head = &net->ipv4.fib_table_hash[h];
+		struct fib_table *tb;
+
+		hlist_for_each_entry_rcu(tb, head, tb_hlist)
+			fib_table_notify(net, tb, nb, event_type);
+	}
+}
+
 static void __trie_free_rcu(struct rcu_head *head)
 {
 	struct fib_table *tb = container_of(head, struct fib_table, rcu);
-- 
2.7.4

^ permalink raw reply related

* [patch net-next 7/8] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: Jiri Pirko @ 2016-11-16 14:09 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <1479305343-13167-1-git-send-email-jiri@resnulli.us>

From: Ido Schimmel <idosch@mellanox.com>

Make sure the device has a complete view of the FIB tables by invoking
their dump during module init.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index a8011a5..46126d9 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -2048,6 +2048,7 @@ int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
 
 	mlxsw_sp->fib_nb.notifier_call = mlxsw_sp_router_fib_event;
 	register_fib_notifier(&mlxsw_sp->fib_nb);
+	fib_notifier_dump(&mlxsw_sp->fib_nb);
 	return 0;
 
 err_neigh_init:
-- 
2.7.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox