Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
From: Ido Schimmel @ 2016-11-17 13:10 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf, arkadis,
	ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
	f.fainelli, alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <56d2179d-00ff-bcc5-e365-845179cbe672@stressinduktion.org>

Hi Hannes,

On Wed, Nov 16, 2016 at 08:43:25PM +0100, Hannes Frederic Sowa wrote:
> On 16.11.2016 19:51, Ido Schimmel wrote:
> > On Wed, Nov 16, 2016 at 06:35:45PM +0100, Hannes Frederic Sowa wrote:
> >> On 16.11.2016 16:18, Ido Schimmel wrote:
> >>> On Wed, Nov 16, 2016 at 03:51:01PM +0100, Hannes Frederic Sowa wrote:
> >>>> On 16.11.2016 15:09, Jiri Pirko wrote:
> >>>>> From: Ido Schimmel <idosch@mellanox.com>
> >>>>>
> >>>>> Commit b90eb7549499 ("fib: introduce FIB notification infrastructure")
> >>>>> introduced a new notification chain to notify listeners (f.e., switchdev
> >>>>> drivers) about addition and deletion of routes.
> >>>>>
> >>>>> However, upon registration to the chain the FIB tables can already be
> >>>>> populated, which means potential listeners will have an incomplete view
> >>>>> of the tables.
> >>>>>
> >>>>> Solve that by adding an API to request a FIB dump. The dump itself it
> >>>>> done using RCU in order not to starve consumers that need RTNL to make
> >>>>> progress.
> >>>>>
> >>>>> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
> >>>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> >>>>
> >>>> Have you looked at potential inconsistencies resulting of RCU walking
> >>>> the table and having concurrent inserts?
> >>>
> >>> Yes. I did try to think about situations in which this approach will
> >>> fail, but I could only find problems with concurrent removals, which I
> >>> addressed in 5/8. In case of concurrent insertions, even if you missed
> >>> the node, you would still get the ENTRY_ADD event to your listener.
> >>
> >> Theoretically a node could still be installed while the deletion event
> >> fired before registering the notifier. E.g. a synchronize_net before
> >> dumping could help here?
> > 
> > If the deletion event fired for some fib alias, then by 5/8 we are
> > guaranteed that it was already unlinked from the fib alias list in the
> > leaf in which it was contained. So, while it's possible we didn't
> > register our listener in time for the deletion event, we won't traverse
> > this fib alias while dumping the trie anyway. Did I understand you
> > correctly?
> > 
> 
> Theoretically we can have the same problem for insertion:
> 
> You receive a delete event from the notifier that is queued up first but
> the dump will still see the entry in the fib due to being managed by RCU
> (the notifier running on another CPU).
> 
> The problem is that the fib_remove_alias->hlist_del_rcu->WRITE_ONCE is
> still not strongly ordered against the local fib dump trie walk.

OK, so I believe my analysis in 5/8 was wrong. Despite CPU A invoking
fib_remove_alias() in t0 for fa1, it's possible for CPU B doing the fib
dump to see fa1 in t1, which will lead to fa1 being permanently present
in the listener's table.

Given the above, I think Dave's suggestion is the only applicable
solution. Do you agree? Any other suggestions?

Thanks

^ permalink raw reply

* Re: [PATCH net-next v3 2/3] net: fsl: Allow most drivers to be built with COMPILE_TEST
From: David Miller @ 2016-11-17 16:27 UTC (permalink / raw)
  To: arnd; +Cc: lkp, f.fainelli, kbuild-all, netdev, mw, gregory.clement,
	Shaohui.Xie
In-Reply-To: <2379973.pinmoxev18@wuerfel>

From: Arnd Bergmann <arnd@arndb.de>
Date: Thu, 17 Nov 2016 11:47:02 +0100

> On Thursday, November 17, 2016 9:38:23 AM CET Fengguang Wu wrote:
>> include/asm-generic/io.h conditionally defines iounmap() to be an
>> empty inline function, which may explain the warning on sh4.
>> 
>> General speaking, it's a false warning. The solution could be to teach
>> the robot to ignore such 'unused variable' warnings in non-x86 archs.
> 
> I think we should change the iounmap stub macro into an inline function.

Agreed.

^ permalink raw reply

* Re: [PATCH v2 net-next] net/mlx5e: remove napi_hash_del() calls
From: David Miller @ 2016-11-17 17:06 UTC (permalink / raw)
  To: eric.dumazet; +Cc: saeedm, netdev
In-Reply-To: <1479306094.8455.181.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 16 Nov 2016 06:21:34 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> Calling napi_hash_del() after netif_napi_del() is pointless.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.

^ permalink raw reply

* Re: Netperf UDP issue with connected sockets
From: Eric Dumazet @ 2016-11-17 16:21 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Rick Jones, netdev
In-Reply-To: <20161117155753.17b76f5a@redhat.com>

On Thu, 2016-11-17 at 15:57 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 17 Nov 2016 06:17:38 -0800
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote:
> > 
> > > I can see that qdisc layer does not activate xmit_more in this case.
> > >   
> > 
> > Sure. Not enough pressure from the sender(s).
> > 
> > The bottleneck is not the NIC or qdisc in your case, meaning that BQL
> > limit is kept at a small value.
> > 
> > (BTW not all NIC have expensive doorbells)
> 
> I believe this NIC mlx5 (50G edition) does.
> 
> I'm seeing UDP TX of 1656017.55 pps, which is per packet:
> 2414 cycles(tsc) 603.86 ns
> 
> Perf top shows (with my own udp_flood, that avoids __ip_select_ident):
> 
>  Samples: 56K of event 'cycles', Event count (approx.): 51613832267
>    Overhead  Command        Shared Object        Symbol
>  +    8.92%  udp_flood      [kernel.vmlinux]     [k] _raw_spin_lock
>    - _raw_spin_lock
>       + 90.78% __dev_queue_xmit
>       + 7.83% dev_queue_xmit
>       + 1.30% ___slab_alloc
>  +    5.59%  udp_flood      [kernel.vmlinux]     [k] skb_set_owner_w

Does TX completion happens on same cpu ?

>  +    4.77%  udp_flood      [mlx5_core]          [k] mlx5e_sq_xmit
>  +    4.09%  udp_flood      [kernel.vmlinux]     [k] fib_table_lookup

Why fib_table_lookup() is used with connected UDP flow ???

>  +    4.00%  swapper        [mlx5_core]          [k] mlx5e_poll_tx_cq
>  +    3.11%  udp_flood      [kernel.vmlinux]     [k] __ip_route_output_key_hash

Same here, this is suspect.

>  +    2.49%  swapper        [kernel.vmlinux]     [k] __slab_free
> 
> In this setup the spinlock in __dev_queue_xmit should be uncongested.
> An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system.
> 
> But 8.92% of the time is spend on it, which corresponds to a cost of 215
> cycles (2414*0.0892).  This cost is too high, thus something else is
> going on... I claim this mysterious extra cost is the tailptr/doorbell.

Well, with no pressure, doorbell is triggered for each packet.

Since we can not predict that your application is going to send yet
another packet one usec later, we can not avoid this.

Note that with the patches I am working on (busypolling extentions),
we could decide to let the busypoller doing the doorbells, say one every
10 usec. (or after ~16 packets were queued)

But mlx4 uses two different NAPI for TX and RX, maybe mlx5 has the same
strategy .

^ permalink raw reply

* [PATCH] VSOCK: add loopback to virtio_transport
From: Stefan Hajnoczi @ 2016-11-17 15:49 UTC (permalink / raw)
  To: netdev
  Cc: cavery, Claudio Imbrenda, Jorgen Hansen, David S . Miller,
	Stefan Hajnoczi

The VMware VMCI transport supports loopback inside virtual machines.
This patch implements loopback for virtio-vsock.

Flow control is handled by the virtio-vsock protocol as usual.  The
sending process stops transmitting on a connection when the peer's
receive buffer space is exhausted.

Cathy Avery <cavery@redhat.com> noticed this difference between VMCI and
virtio-vsock when a test case using loopback failed.  Although loopback
isn't the main point of AF_VSOCK, it is useful for testing and
virtio-vsock must match VMCI semantics so that userspace programs run
regardless of the underlying transport.

My understanding is that loopback is not supported on the host side with
VMCI.  Follow that by implementing it only in the guest driver, not the
vhost host driver.

Cc: Jorgen Hansen <jhansen@vmware.com>
Reported-by: Cathy Avery <cavery@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 net/vmw_vsock/virtio_transport.c | 57 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 57 insertions(+)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 936d7ee..f2c4071 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -44,6 +44,10 @@ struct virtio_vsock {
 	spinlock_t send_pkt_list_lock;
 	struct list_head send_pkt_list;
 
+	struct work_struct loopback_work;
+	spinlock_t loopback_list_lock;
+	struct list_head loopback_list;
+
 	atomic_t queued_replies;
 
 	/* The following fields are protected by rx_lock.  vqs[VSOCK_VQ_RX]
@@ -74,6 +78,42 @@ static u32 virtio_transport_get_local_cid(void)
 	return vsock->guest_cid;
 }
 
+static void virtio_transport_loopback_work(struct work_struct *work)
+{
+	struct virtio_vsock *vsock =
+		container_of(work, struct virtio_vsock, loopback_work);
+	LIST_HEAD(pkts);
+
+	spin_lock_bh(&vsock->loopback_list_lock);
+	list_splice_init(&vsock->loopback_list, &pkts);
+	spin_unlock_bh(&vsock->loopback_list_lock);
+
+	mutex_lock(&vsock->rx_lock);
+	while (!list_empty(&pkts)) {
+		struct virtio_vsock_pkt *pkt;
+
+		pkt = list_first_entry(&pkts, struct virtio_vsock_pkt, list);
+		list_del_init(&pkt->list);
+
+		virtio_transport_recv_pkt(pkt);
+	}
+	mutex_unlock(&vsock->rx_lock);
+}
+
+static int virtio_transport_send_pkt_loopback(struct virtio_vsock *vsock,
+					      struct virtio_vsock_pkt *pkt)
+{
+	int len = pkt->len;
+
+	spin_lock_bh(&vsock->loopback_list_lock);
+	list_add_tail(&pkt->list, &vsock->loopback_list);
+	spin_unlock_bh(&vsock->loopback_list_lock);
+
+	queue_work(virtio_vsock_workqueue, &vsock->loopback_work);
+
+	return len;
+}
+
 static void
 virtio_transport_send_pkt_work(struct work_struct *work)
 {
@@ -159,6 +199,10 @@ virtio_transport_send_pkt(struct virtio_vsock_pkt *pkt)
 		return -ENODEV;
 	}
 
+	if (le32_to_cpu(pkt->hdr.dst_cid) == vsock->guest_cid) {
+		return virtio_transport_send_pkt_loopback(vsock, pkt);
+	}
+
 	if (pkt->reply)
 		atomic_inc(&vsock->queued_replies);
 
@@ -510,10 +554,13 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
 	mutex_init(&vsock->event_lock);
 	spin_lock_init(&vsock->send_pkt_list_lock);
 	INIT_LIST_HEAD(&vsock->send_pkt_list);
+	spin_lock_init(&vsock->loopback_list_lock);
+	INIT_LIST_HEAD(&vsock->loopback_list);
 	INIT_WORK(&vsock->rx_work, virtio_transport_rx_work);
 	INIT_WORK(&vsock->tx_work, virtio_transport_tx_work);
 	INIT_WORK(&vsock->event_work, virtio_transport_event_work);
 	INIT_WORK(&vsock->send_pkt_work, virtio_transport_send_pkt_work);
+	INIT_WORK(&vsock->loopback_work, virtio_transport_loopback_work);
 
 	mutex_lock(&vsock->rx_lock);
 	virtio_vsock_rx_fill(vsock);
@@ -539,6 +586,7 @@ static void virtio_vsock_remove(struct virtio_device *vdev)
 	struct virtio_vsock *vsock = vdev->priv;
 	struct virtio_vsock_pkt *pkt;
 
+	flush_work(&vsock->loopback_work);
 	flush_work(&vsock->rx_work);
 	flush_work(&vsock->tx_work);
 	flush_work(&vsock->event_work);
@@ -565,6 +613,15 @@ static void virtio_vsock_remove(struct virtio_device *vdev)
 	}
 	spin_unlock_bh(&vsock->send_pkt_list_lock);
 
+	spin_lock_bh(&vsock->loopback_list_lock);
+	while (!list_empty(&vsock->loopback_list)) {
+		pkt = list_first_entry(&vsock->loopback_list,
+				       struct virtio_vsock_pkt, list);
+		list_del(&pkt->list);
+		virtio_transport_free_pkt(pkt);
+	}
+	spin_unlock_bh(&vsock->loopback_list_lock);
+
 	mutex_lock(&the_virtio_vsock_mutex);
 	the_virtio_vsock = NULL;
 	vsock_core_exit();
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH net] ip6_tunnel: disable caching when the traffic class is inherited
From: David Miller @ 2016-11-17 17:09 UTC (permalink / raw)
  To: pabeni; +Cc: netdev, liam.mcbirnie, hannes
In-Reply-To: <67662ec875375c86e95666a7e47fc61c3a869d86.1479309929.git.pabeni@redhat.com>

From: Paolo Abeni <pabeni@redhat.com>
Date: Wed, 16 Nov 2016 16:26:46 +0100

> If an ip6 tunnel is configured to inherit the traffic class from
> the inner header, the dst_cache must be disabled or it will foul
> the policy routing.
> 
> The issue is apprently there since at leat Linux-2.6.12-rc2.
> 
> Reported-by: Liam McBirnie <liam.mcbirnie@boeing.com>
> Cc: Liam McBirnie <liam.mcbirnie@boeing.com>
> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Applied and queued up for -stable, thank you.

^ permalink raw reply

* [PATCH net 3/3] net: Check netdevice presence on dev_get_phys_port_name
From: Tariq Toukan @ 2016-11-17 15:40 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, Eugenia Emantayev,
	Tariq Toukan, Jiri Pirko
In-Reply-To: <1479397251-6932-1-git-send-email-tariqt@mellanox.com>

From: Eugenia Emantayev <eugenia@mellanox.com>

Check presence of network device before calling device driver
ndo_get_phys_port_name callback. Otherwise callback may access
non-existing data structures and cause kernel panic.

Fixes: db24a9044ee1 ('net: add support for phys_port_name')
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reported-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Reported-by: Steve Wise <swise@opengridcomputing.com>
Cc: Jiri Pirko <jiri@mellanox.com>
---
 net/core/dev.c       | 2 ++
 net/core/rtnetlink.c | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 7de0d000a9f8..b21d77405863 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6623,6 +6623,8 @@ int dev_get_phys_port_name(struct net_device *dev,
 
 	if (!ops->ndo_get_phys_port_name)
 		return -EOPNOTSUPP;
+	if (!netif_device_present(dev))
+		return -ENODEV;
 	return ops->ndo_get_phys_port_name(dev, name, len);
 }
 EXPORT_SYMBOL(dev_get_phys_port_name);
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index b3dd81f82e70..40cd41376566 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1054,7 +1054,7 @@ static int rtnl_phys_port_name_fill(struct sk_buff *skb, struct net_device *dev)
 
 	err = dev_get_phys_port_name(dev, name, sizeof(name));
 	if (err) {
-		if (err == -EOPNOTSUPP)
+		if (err == -EOPNOTSUPP || err == -ENODEV)
 			return 0;
 		return err;
 	}
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net 0/3] mlx4 fix for shutdown flow
From: Tariq Toukan @ 2016-11-17 15:40 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, Eugenia Emantayev,
	Tariq Toukan

This patchset fixes an invalid reference to mdev in mlx4 shutdown flow.

In patch 1, we make sure netif_device_detach() is called from shutdown flow only,
since we want to keep it present during a simple configuration change.

In patches 2 and 3, we add checks that were missing in:
* dev_get_phys_port_id
* dev_get_phys_port_name
We check the presence of the network device before calling the driver's
callbacks. This already exists for all other ndo's.

Series generated against net commit:
e5f6f564fd19 bnxt: add a missing rcu synchronization

Thanks,
Tariq.

Eugenia Emantayev (3):
  net/mlx4_en: Remove netif_device_detach from stop port flow
  net: Check netdevice presence on dev_get_phys_port_id
  net: Check netdevice presence on dev_get_phys_port_name

 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |  8 ++++----
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  | 22 ++++++++--------------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h    |  2 +-
 net/core/dev.c                                  |  4 ++++
 net/core/rtnetlink.c                            |  4 ++--
 5 files changed, 19 insertions(+), 21 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* [PATCH net 2/3] net: Check netdevice presence on dev_get_phys_port_id
From: Tariq Toukan @ 2016-11-17 15:40 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, Eugenia Emantayev,
	Tariq Toukan, Jiri Pirko
In-Reply-To: <1479397251-6932-1-git-send-email-tariqt@mellanox.com>

From: Eugenia Emantayev <eugenia@mellanox.com>

Check presence of network device before calling device driver
ndo_get_phys_port_id callback. Otherwise callback may access
non-existing data structures and cause kernel panic.

Fixes: 66b52b0dc82c ("net: add ndo to get id of physical port of the device")
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reported-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Reported-by: Steve Wise <swise@opengridcomputing.com>
Cc: Jiri Pirko <jiri@mellanox.com>
---
 net/core/dev.c       | 2 ++
 net/core/rtnetlink.c | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 6666b28b6815..7de0d000a9f8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6602,6 +6602,8 @@ int dev_get_phys_port_id(struct net_device *dev,
 
 	if (!ops->ndo_get_phys_port_id)
 		return -EOPNOTSUPP;
+	if (!netif_device_present(dev))
+		return -ENODEV;
 	return ops->ndo_get_phys_port_id(dev, ppid);
 }
 EXPORT_SYMBOL(dev_get_phys_port_id);
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index a6529c55ffb7..b3dd81f82e70 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1036,7 +1036,7 @@ static int rtnl_phys_port_id_fill(struct sk_buff *skb, struct net_device *dev)
 
 	err = dev_get_phys_port_id(dev, &ppid);
 	if (err) {
-		if (err == -EOPNOTSUPP)
+		if (err == -EOPNOTSUPP || err == -ENODEV)
 			return 0;
 		return err;
 	}
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net 1/3] net/mlx4_en: Remove netif_device_detach from stop port flow
From: Tariq Toukan @ 2016-11-17 15:40 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, Eugenia Emantayev,
	Tariq Toukan
In-Reply-To: <1479397251-6932-1-git-send-email-tariqt@mellanox.com>

From: Eugenia Emantayev <eugenia@mellanox.com>

netif_device_detach() should be called from shutdown flow only,
in any other scenario netif_device_detach is not needed and may
result in -ENODEV error in certain cases.
In order to prevent TX timeout issue during heavy CPU load
netif_carrier_off will be called.

Fixes: 3484aac16149 ("net/mlx4_en: Fix transmit timeout when driver restarts port")
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |  8 ++++----
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  | 22 ++++++++--------------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h    |  2 +-
 3 files changed, 13 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index bdda17d2ea0f..3bb30f66aa07 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -932,7 +932,7 @@ static __be32 speed_set_ptys_admin(struct mlx4_en_priv *priv, u32 speed,
 	mutex_lock(&priv->mdev->state_lock);
 	if (priv->port_up) {
 		en_warn(priv, "Port link mode changed, restarting port...\n");
-		mlx4_en_stop_port(dev, 1);
+		mlx4_en_stop_port(dev);
 		if (mlx4_en_start_port(dev))
 			en_err(priv, "Failed restarting port %d\n", priv->port);
 	}
@@ -1077,7 +1077,7 @@ static int mlx4_en_set_ringparam(struct net_device *dev,
 
 	if (priv->port_up) {
 		port_up = 1;
-		mlx4_en_stop_port(dev, 1);
+		mlx4_en_stop_port(dev);
 	}
 
 	mlx4_en_safe_replace_resources(priv, tmp);
@@ -1205,7 +1205,7 @@ static int mlx4_en_set_rxfh(struct net_device *dev, const u32 *ring_index,
 	mutex_lock(&mdev->state_lock);
 	if (priv->port_up) {
 		port_up = 1;
-		mlx4_en_stop_port(dev, 1);
+		mlx4_en_stop_port(dev);
 	}
 
 	if (ring_index)
@@ -1751,7 +1751,7 @@ static int mlx4_en_set_channels(struct net_device *dev,
 
 	if (priv->port_up) {
 		port_up = 1;
-		mlx4_en_stop_port(dev, 1);
+		mlx4_en_stop_port(dev);
 	}
 
 	mlx4_en_safe_replace_resources(priv, tmp);
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 3a47e83d3e07..01a680b66177 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1741,8 +1741,7 @@ int mlx4_en_start_port(struct net_device *dev)
 		napi_schedule(&priv->rx_cq[i]->napi);
 
 	netif_tx_start_all_queues(dev);
-	netif_device_attach(dev);
-
+	netif_carrier_on(dev);
 	return 0;
 
 tx_err:
@@ -1767,7 +1766,7 @@ int mlx4_en_start_port(struct net_device *dev)
 }
 
 
-void mlx4_en_stop_port(struct net_device *dev, int detach)
+void mlx4_en_stop_port(struct net_device *dev)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	struct mlx4_en_dev *mdev = priv->mdev;
@@ -1785,12 +1784,7 @@ void mlx4_en_stop_port(struct net_device *dev, int detach)
 	mlx4_CLOSE_PORT(mdev->dev, priv->port);
 
 	/* Synchronize with tx routine */
-	netif_tx_lock_bh(dev);
-	if (detach)
-		netif_device_detach(dev);
-	netif_tx_stop_all_queues(dev);
-	netif_tx_unlock_bh(dev);
-
+	netif_carrier_off(dev);
 	netif_tx_disable(dev);
 
 	/* Set port as not active */
@@ -1903,7 +1897,7 @@ static void mlx4_en_restart(struct work_struct *work)
 	rtnl_lock();
 	mutex_lock(&mdev->state_lock);
 	if (priv->port_up) {
-		mlx4_en_stop_port(dev, 1);
+		mlx4_en_stop_port(dev);
 		if (mlx4_en_start_port(dev))
 			en_err(priv, "Failed restarting port %d\n", priv->port);
 	}
@@ -1987,7 +1981,7 @@ static int mlx4_en_close(struct net_device *dev)
 
 	mutex_lock(&mdev->state_lock);
 
-	mlx4_en_stop_port(dev, 0);
+	mlx4_en_stop_port(dev);
 	netif_carrier_off(dev);
 
 	mutex_unlock(&mdev->state_lock);
@@ -2231,7 +2225,7 @@ static int mlx4_en_change_mtu(struct net_device *dev, int new_mtu)
 			 * the port */
 			en_dbg(DRV, priv, "Change MTU called with card down!?\n");
 		} else {
-			mlx4_en_stop_port(dev, 1);
+			mlx4_en_stop_port(dev);
 			err = mlx4_en_start_port(dev);
 			if (err) {
 				en_err(priv, "Failed restarting port:%d\n",
@@ -2687,7 +2681,7 @@ static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 	mutex_lock(&mdev->state_lock);
 	if (priv->port_up) {
 		port_up = 1;
-		mlx4_en_stop_port(dev, 1);
+		mlx4_en_stop_port(dev);
 	}
 
 	priv->xdp_ring_num = xdp_ring_num;
@@ -3406,7 +3400,7 @@ int mlx4_en_reset_config(struct net_device *dev,
 
 	if (priv->port_up) {
 		port_up = 1;
-		mlx4_en_stop_port(dev, 1);
+		mlx4_en_stop_port(dev);
 	}
 
 	en_warn(priv, "Changing device configuration rx filter(%x) rx vlan(%x)\n",
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index a3528dd1e72e..fb17acdfe528 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -659,7 +659,7 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 			struct mlx4_en_port_profile *prof);
 
 int mlx4_en_start_port(struct net_device *dev);
-void mlx4_en_stop_port(struct net_device *dev, int detach);
+void mlx4_en_stop_port(struct net_device *dev);
 
 void mlx4_en_set_stats_bitmap(struct mlx4_dev *dev,
 			      struct mlx4_en_stats_bitmap *stats_bitmap,
-- 
1.8.3.1

^ permalink raw reply related

* Re: [patch] amd-xgbe: Signedness bug in xgbe_phy_link_status()
From: Tom Lendacky @ 2016-11-17 14:40 UTC (permalink / raw)
  To: Dan Carpenter; +Cc: netdev, kernel-janitors
In-Reply-To: <20161117105932.GA32143@mwanda>

On 11/17/2016 4:59 AM, Dan Carpenter wrote:
> "ret" needs to be signed for the error handling to work.
> 
> Fixes: abf0a1c2b26a ("amd-xgbe: Add support for SFP+ modules")
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> 

Hi Dan,

This was already identified and patched:
8c5385cbb036 ("amd-xgbe: Fix up some coccinelle identified warnings")

Thanks,
Tom

> diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c b/drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c
> index 4ba4332..a2559c2 100644
> --- a/drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c
> +++ b/drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c
> @@ -2346,7 +2346,8 @@ static bool xgbe_phy_valid_speed(struct xgbe_prv_data *pdata, int speed)
>  static int xgbe_phy_link_status(struct xgbe_prv_data *pdata, int *an_restart)
>  {
>  	struct xgbe_phy_data *phy_data = pdata->phy_data;
> -	unsigned int ret, reg;
> +	unsigned int reg;
> +	int ret;
>  
>  	*an_restart = 0;
>  
> 

^ permalink raw reply

* Re: Netperf UDP issue with connected sockets
From: Eric Dumazet @ 2016-11-17 13:20 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Rick Jones, netdev
In-Reply-To: <20161117091638.5fab8494@redhat.com>

On Thu, 2016-11-17 at 09:16 +0100, Jesper Dangaard Brouer wrote:

> 
> I noticed there is a Send-Q, and the perf-top2 is _raw_spin_lock, which
> looks like it comes from __dev_queue_xmit(), but we know from
> experience that this stall is actually caused by writing the
> tailptr/doorbell in the HW.  Thus, this could benefit a lot from
> bulk/xmit_more into the qdisc layer.

The Send-Q is there because of TX-completions being delayed a bit,
because of IRQ mitigation.

(ethtool -c eth0)

It happens even if you do not have a qdisc in the first place.

And we do have xmit_more in the qdisc layer already.

^ permalink raw reply

* Re: [PATCH net v2] net: nsid cannot be allocated for a dead netns
From: Cong Wang @ 2016-11-17 16:50 UTC (permalink / raw)
  To: Nicolas Dichtel
  Cc: David Miller, Andrey Wagin, Linux Kernel Network Developers
In-Reply-To: <0fb84877-d43d-736a-c5a8-1ca954f90925@6wind.com>

On Thu, Nov 17, 2016 at 12:41 AM, Nicolas Dichtel
<nicolas.dichtel@6wind.com> wrote:
> Le 17/11/2016 à 07:32, Cong Wang a écrit :
> [snip]
>> since Nicolas' patch doesn't even compile...
> It's always surprising how agressive you are, really :(
> Can you show me the compilation error of this patch (we are talking about the v2
> patch here)?

Sorry about that, I didn't event look at your v2, because your patch
(no matter v2 or v1) is already wrong to me, the idr_for_each() before
alloc_netid() is clearly a use-after-destroy.

Based on the same reason, my patch is not your v3, we are patching
different places.

^ permalink raw reply

* Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
From: Hannes Frederic Sowa @ 2016-11-17 14:36 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf, arkadis,
	ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
	f.fainelli, alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <20161117131049.pevkla5os75h4zlc@splinter.mtl.com>

On 17.11.2016 14:10, Ido Schimmel wrote:
> Hi Hannes,
> 
> On Wed, Nov 16, 2016 at 08:43:25PM +0100, Hannes Frederic Sowa wrote:
>> On 16.11.2016 19:51, Ido Schimmel wrote:
>>> On Wed, Nov 16, 2016 at 06:35:45PM +0100, Hannes Frederic Sowa wrote:
>>>> On 16.11.2016 16:18, Ido Schimmel wrote:
>>>>> On Wed, Nov 16, 2016 at 03:51:01PM +0100, Hannes Frederic Sowa wrote:
>>>>>> On 16.11.2016 15:09, Jiri Pirko wrote:
>>>>>>> From: Ido Schimmel <idosch@mellanox.com>
>>>>>>>
>>>>>>> Commit b90eb7549499 ("fib: introduce FIB notification infrastructure")
>>>>>>> introduced a new notification chain to notify listeners (f.e., switchdev
>>>>>>> drivers) about addition and deletion of routes.
>>>>>>>
>>>>>>> However, upon registration to the chain the FIB tables can already be
>>>>>>> populated, which means potential listeners will have an incomplete view
>>>>>>> of the tables.
>>>>>>>
>>>>>>> Solve that by adding an API to request a FIB dump. The dump itself it
>>>>>>> done using RCU in order not to starve consumers that need RTNL to make
>>>>>>> progress.
>>>>>>>
>>>>>>> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
>>>>>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>>>>>
>>>>>> Have you looked at potential inconsistencies resulting of RCU walking
>>>>>> the table and having concurrent inserts?
>>>>>
>>>>> Yes. I did try to think about situations in which this approach will
>>>>> fail, but I could only find problems with concurrent removals, which I
>>>>> addressed in 5/8. In case of concurrent insertions, even if you missed
>>>>> the node, you would still get the ENTRY_ADD event to your listener.
>>>>
>>>> Theoretically a node could still be installed while the deletion event
>>>> fired before registering the notifier. E.g. a synchronize_net before
>>>> dumping could help here?
>>>
>>> If the deletion event fired for some fib alias, then by 5/8 we are
>>> guaranteed that it was already unlinked from the fib alias list in the
>>> leaf in which it was contained. So, while it's possible we didn't
>>> register our listener in time for the deletion event, we won't traverse
>>> this fib alias while dumping the trie anyway. Did I understand you
>>> correctly?
>>>
>>
>> Theoretically we can have the same problem for insertion:
>>
>> You receive a delete event from the notifier that is queued up first but
>> the dump will still see the entry in the fib due to being managed by RCU
>> (the notifier running on another CPU).
>>
>> The problem is that the fib_remove_alias->hlist_del_rcu->WRITE_ONCE is
>> still not strongly ordered against the local fib dump trie walk.
> 
> OK, so I believe my analysis in 5/8 was wrong. Despite CPU A invoking
> fib_remove_alias() in t0 for fa1, it's possible for CPU B doing the fib
> dump to see fa1 in t1, which will lead to fa1 being permanently present
> in the listener's table.

Yep. :(

Also the delayed work queue is only partial ordered (only on one CPU),
so you don't know about when an event is processed from the dump and the
notifier. I think you need to create your own workqueue for doing so.

> Given the above, I think Dave's suggestion is the only applicable
> solution. Do you agree? Any other suggestions?

As I wrote before the problem with seqcounter is, that with a quagga
running on top and pushing updates you end up never getting a stable
view, so maybe some logic to abort in case it cannot be loaded after a
few tries would be fine.

The other way is the journal idea I had, which uses an rb-tree with
timestamps as keys (can be lamport timestamps). You insert into the tree
until the dump is finished and use it as queue later to shuffle stuff
into the hardware.

Because you should be able to hold refs to the fa_info or other structs
directly, I don't expect gigantic memory overhead, just for a cell to
store timetamp and pointer (rb_node).

Bye,
Hannes

^ permalink raw reply

* [PATCH net-next] amd-xgbe: Update connection validation for backplane mode
From: Tom Lendacky @ 2016-11-17 14:43 UTC (permalink / raw)
  To: netdev; +Cc: David Miller

Update the connection type enumeration for backplane mode and return
an error when there is a mismatch between the mode and the connection
type.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c b/drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c
index 348cc8c..9d8c9530 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c
@@ -164,6 +164,7 @@ enum xgbe_conn_type {
 	XGBE_CONN_TYPE_NONE = 0,
 	XGBE_CONN_TYPE_SFP,
 	XGBE_CONN_TYPE_MDIO,
+	XGBE_CONN_TYPE_RSVD1,
 	XGBE_CONN_TYPE_BACKPLANE,
 	XGBE_CONN_TYPE_MAX,
 };
@@ -2831,6 +2832,7 @@ static int xgbe_phy_init(struct xgbe_prv_data *pdata)
 	if (xgbe_phy_conn_type_mismatch(pdata)) {
 		dev_err(pdata->dev, "phy mode/connection mismatch (%#x/%#x)\n",
 			phy_data->port_mode, phy_data->conn_type);
+		return -EINVAL;
 	}
 
 	/* Validate the mode requested */

^ permalink raw reply related

* Re: Netperf UDP issue with connected sockets
From: Jesper Dangaard Brouer @ 2016-11-17 13:42 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Rick Jones, netdev, brouer
In-Reply-To: <1479388850.8455.240.camel@edumazet-glaptop3.roam.corp.google.com>

On Thu, 17 Nov 2016 05:20:50 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Thu, 2016-11-17 at 09:16 +0100, Jesper Dangaard Brouer wrote:
> 
> > 
> > I noticed there is a Send-Q, and the perf-top2 is _raw_spin_lock, which
> > looks like it comes from __dev_queue_xmit(), but we know from
> > experience that this stall is actually caused by writing the
> > tailptr/doorbell in the HW.  Thus, this could benefit a lot from
> > bulk/xmit_more into the qdisc layer.  
> 
> The Send-Q is there because of TX-completions being delayed a bit,
> because of IRQ mitigation.
> 
> (ethtool -c eth0)
> 
> It happens even if you do not have a qdisc in the first place.
> 
> And we do have xmit_more in the qdisc layer already.

I can see that qdisc layer does not activate xmit_more in this case.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


$ ethtool -c mlx5p4
Coalesce parameters for mlx5p4:
Adaptive RX: on  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 3
rx-frames: 32
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 16
tx-frames: 32
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

^ permalink raw reply

* Re: [PATCH net-next v3 4/7] vxlan: improve vxlan route lookup checks.
From: Jiri Benc @ 2016-11-17 15:59 UTC (permalink / raw)
  To: David Laight; +Cc: Pravin B Shelar, netdev@vger.kernel.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6DB0222355@AcuExch.aculab.com>

On Thu, 17 Nov 2016 10:17:01 +0000, David Laight wrote:
> Worse than arbitrary, it adds 4 bytes of pad on 64bit systems.

It does not, this is not a struct.

 Jiri

^ permalink raw reply

* Re: [PATCH net v2 7/7] net: ethernet: ti: cpsw: fix fixed-link phy probe deferral
From: Johan Hovold @ 2016-11-17 17:19 UTC (permalink / raw)
  To: David Miller
  Cc: johan, mugunthanvnm, grygorii.strashko, linux-omap, netdev,
	linux-kernel
In-Reply-To: <20161117.120416.1443476818715961812.davem@davemloft.net>

On Thu, Nov 17, 2016 at 12:04:16PM -0500, David Miller wrote:
> From: Johan Hovold <johan@kernel.org>
> Date: Thu, 17 Nov 2016 17:40:04 +0100
> 
> > Make sure to propagate errors from of_phy_register_fixed_link() which
> > can fail with -EPROBE_DEFER.
> > 
> > Fixes: 1f71e8c96fc6 ("drivers: net: cpsw: Add support for fixed-link
> > PHY")
> > Signed-off-by: Johan Hovold <johan@kernel.org>
> 
> Johan, when you update a patch within a series you must post the
> entire series freshly to the lists, cover posting and all.

I'm quite sure that is exactly what I did. Did you only get this last
patch out of the seven?

Johan

^ permalink raw reply

* Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
From: Hannes Frederic Sowa @ 2016-11-17 17:20 UTC (permalink / raw)
  To: David Miller
  Cc: idosch, jiri, netdev, idosch, eladr, yotamg, nogahf, arkadis,
	ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
	f.fainelli, alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <20161117.114529.1851585222876647915.davem@davemloft.net>

Hi,

On 17.11.2016 17:45, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Thu, 17 Nov 2016 15:36:48 +0100
> 
>> The other way is the journal idea I had, which uses an rb-tree with
>> timestamps as keys (can be lamport timestamps). You insert into the tree
>> until the dump is finished and use it as queue later to shuffle stuff
>> into the hardware.
> 
> If you have this "place" where pending inserts are stored, you have
> a policy decision to make.
> 
> First of all what do other lookups see when there are pending entires?

I think this is a problem with the current approach already, as the
delayed work queue already postpones the insert for an undecidable
amount of time (and reorders depending on which CPU the entry was
inserted and the fib notifier was called).

For user space queries we would still query the in-kernel table.

> If, once inserted into the pending queue, you return success to the
> inserting entity, then you must make those pending entries visible
> to lookups.

I think this same problem is in this patch set already. The way I
understood it, is, that if a problem during insert emerges, the driver
calls abort and every packet will be send to user space, as the routing
table cannot be offloaded and it won't try it again, Ido?

Probably this is an artifact of the mellanox implementation and we can
implement this differently for other cards, but the schema to abort all
if the modification doesn't work, seems to be fundamental (I think we
require the all-or-nothing approach for now).

> If you block the inserting entity, well that doesn't make a lot of
> sense.  If blocking is a workable solution, then we can just block the
> entire insert while this FIB dump to the device driver is happening.

I don't think we should really block (as in kernel-block) at any time.

I was suggesting something like:

rtnl_lock();
synchronize_rcu_expedited(); // barrier, all rounting modifications are
stable and no new can be added due to rtnl_lock
register notifier(); // notifier adds entries also into journal();
rtnl_unlock();
walk_fib_tree_rcu_into_journal();
// walk finished
start_syncing_journal_to_hw(); // if new entries show up we sync them
asap after this point

The journal would need a spin lock to protect its internal state and
order events correctly.

> But I am pretty sure the idea of blocking modifications for so long
> was considered undesirable.

Yes, this is also still my opinion.

Bye,
Hannes

^ permalink raw reply

* Re: Netperf UDP issue with connected sockets
From: Eric Dumazet @ 2016-11-17 14:17 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Rick Jones, netdev
In-Reply-To: <20161117144248.23500001@redhat.com>

On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote:

> I can see that qdisc layer does not activate xmit_more in this case.
> 

Sure. Not enough pressure from the sender(s).

The bottleneck is not the NIC or qdisc in your case, meaning that BQL
limit is kept at a small value.

(BTW not all NIC have expensive doorbells)

^ permalink raw reply

* Re: Netperf UDP issue with connected sockets
From: Jesper Dangaard Brouer @ 2016-11-17 14:57 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Rick Jones, netdev, brouer
In-Reply-To: <1479392258.8455.249.camel@edumazet-glaptop3.roam.corp.google.com>

On Thu, 17 Nov 2016 06:17:38 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote:
> 
> > I can see that qdisc layer does not activate xmit_more in this case.
> >   
> 
> Sure. Not enough pressure from the sender(s).
> 
> The bottleneck is not the NIC or qdisc in your case, meaning that BQL
> limit is kept at a small value.
> 
> (BTW not all NIC have expensive doorbells)

I believe this NIC mlx5 (50G edition) does.

I'm seeing UDP TX of 1656017.55 pps, which is per packet:
2414 cycles(tsc) 603.86 ns

Perf top shows (with my own udp_flood, that avoids __ip_select_ident):

 Samples: 56K of event 'cycles', Event count (approx.): 51613832267
   Overhead  Command        Shared Object        Symbol
 +    8.92%  udp_flood      [kernel.vmlinux]     [k] _raw_spin_lock
   - _raw_spin_lock
      + 90.78% __dev_queue_xmit
      + 7.83% dev_queue_xmit
      + 1.30% ___slab_alloc
 +    5.59%  udp_flood      [kernel.vmlinux]     [k] skb_set_owner_w
 +    4.77%  udp_flood      [mlx5_core]          [k] mlx5e_sq_xmit
 +    4.09%  udp_flood      [kernel.vmlinux]     [k] fib_table_lookup
 +    4.00%  swapper        [mlx5_core]          [k] mlx5e_poll_tx_cq
 +    3.11%  udp_flood      [kernel.vmlinux]     [k] __ip_route_output_key_hash
 +    2.49%  swapper        [kernel.vmlinux]     [k] __slab_free

In this setup the spinlock in __dev_queue_xmit should be uncongested.
An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system.

But 8.92% of the time is spend on it, which corresponds to a cost of 215
cycles (2414*0.0892).  This cost is too high, thus something else is
going on... I claim this mysterious extra cost is the tailptr/doorbell.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH] net: sky2: Fix shutdown crash
From: Sudeep Holla @ 2016-11-17 16:14 UTC (permalink / raw)
  To: Jeremy Linton, netdev; +Cc: Sudeep Holla, mlindner, stephen
In-Reply-To: <1479395665-27784-1-git-send-email-jeremy.linton@arm.com>



On 17/11/16 15:14, Jeremy Linton wrote:
> The sky2 frequently crashes during machine shutdown with:
>
> sky2_get_stats+0x60/0x3d8 [sky2]
> dev_get_stats+0x68/0xd8
> rtnl_fill_stats+0x54/0x140
> rtnl_fill_ifinfo+0x46c/0xc68
> rtmsg_ifinfo_build_skb+0x7c/0xf0
> rtmsg_ifinfo.part.22+0x3c/0x70
> rtmsg_ifinfo+0x50/0x5c
> netdev_state_change+0x4c/0x58
> linkwatch_do_dev+0x50/0x88
> __linkwatch_run_queue+0x104/0x1a4
> linkwatch_event+0x30/0x3c
> process_one_work+0x140/0x3e0
> worker_thread+0x60/0x44c
> kthread+0xdc/0xf0
> ret_from_fork+0x10/0x50
>
> This is caused by the sky2 being called after it has been shutdown.
> A previous thread about this can be found here:
>
> https://lkml.org/lkml/2016/4/12/410
>
> An alternative fix is to assure that IFF_UP gets cleared by
> calling dev_close() during shutdown. This is similar to what the
> bnx2/tg3/xgene and maybe others are doing to assure that the driver
> isn't being called following _shutdown().
>
> Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>

Since this issue has been very random and on/off recently, it's quite
hard to test this and confirm. However I did around 20 reboot/shutdown
and could not reproduce the issue after applying this patch. So,

Tested-by: Sudeep Holla <sudeep.holla@arm.com>

-- 
Regards,
Sudeep

^ permalink raw reply

* RE: Netperf UDP issue with connected sockets
From: David Laight @ 2016-11-17 17:34 UTC (permalink / raw)
  To: 'Jesper Dangaard Brouer', Eric Dumazet
  Cc: Rick Jones, netdev@vger.kernel.org
In-Reply-To: <20161117155753.17b76f5a@redhat.com>

From: Jesper Dangaard Brouer
> Sent: 17 November 2016 14:58
> On Thu, 17 Nov 2016 06:17:38 -0800
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote:
> >
> > > I can see that qdisc layer does not activate xmit_more in this case.
> > >
> >
> > Sure. Not enough pressure from the sender(s).
> >
> > The bottleneck is not the NIC or qdisc in your case, meaning that BQL
> > limit is kept at a small value.
> >
> > (BTW not all NIC have expensive doorbells)
> 
> I believe this NIC mlx5 (50G edition) does.
> 
> I'm seeing UDP TX of 1656017.55 pps, which is per packet:
> 2414 cycles(tsc) 603.86 ns
> 
> Perf top shows (with my own udp_flood, that avoids __ip_select_ident):
> 
>  Samples: 56K of event 'cycles', Event count (approx.): 51613832267
>    Overhead  Command        Shared Object        Symbol
>  +    8.92%  udp_flood      [kernel.vmlinux]     [k] _raw_spin_lock
>    - _raw_spin_lock
>       + 90.78% __dev_queue_xmit
>       + 7.83% dev_queue_xmit
>       + 1.30% ___slab_alloc
>  +    5.59%  udp_flood      [kernel.vmlinux]     [k] skb_set_owner_w
>  +    4.77%  udp_flood      [mlx5_core]          [k] mlx5e_sq_xmit
>  +    4.09%  udp_flood      [kernel.vmlinux]     [k] fib_table_lookup
>  +    4.00%  swapper        [mlx5_core]          [k] mlx5e_poll_tx_cq
>  +    3.11%  udp_flood      [kernel.vmlinux]     [k] __ip_route_output_key_hash
>  +    2.49%  swapper        [kernel.vmlinux]     [k] __slab_free
> 
> In this setup the spinlock in __dev_queue_xmit should be uncongested.
> An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system.
> 
> But 8.92% of the time is spend on it, which corresponds to a cost of 215
> cycles (2414*0.0892).  This cost is too high, thus something else is
> going on... I claim this mysterious extra cost is the tailptr/doorbell.

Try adding code to ring the doorbell twice.
If this doesn't slow things down then it isn't (likely to be) responsible
for the delay you are seeing.

	David

^ permalink raw reply

* [PATCH net-next v3 0/5] Adding PHY-Tunables and downshift support
From: Allan W. Nielsen @ 2016-11-17 12:07 UTC (permalink / raw)
  To: netdev; +Cc: andrew, f.fainelli, raju.lakkaraju, allan.nielsen

Hi All,

(This is a re-post of the v3 patch set with a new cover letter - I was not
aware that the cover letters was used a commit comments in merge commits).

This series add support for PHY tunables, and uses this facility to
configure downshifting. The downshifting mechanism is implemented for MSCC
phys.

This series tries to address the comments provided back in mid October when
this feature was posted along with fast-link-failure. Fast-link-failure has
been separated out, but we would like to pick continue on that if/when we
agree on how the phy-tunables and downshifting should be done.

The proposed generic interface is similar to
ETHTOOL_GTUNABLE/ETHTOOL_STUNABLE, it uses the same type
(ethtool_tunable/tunable_type_id) but a new enum (phy_tunable_id) is added
to reflect the PHY tunable.

The implementation just call the newly added function pointers in
get_tunable/set_tunable phy_device structure.

To configure downshifting, the ethtool_tunable structure is used. 'id' must
be set to 'ETHTOOL_PHY_DOWNSHIFT', 'type_id' must be set to
'ETHTOOL_TUNABLE_U8' and 'data' value configure the amount of downshift
re-tries.

If configured to DOWNSHIFT_DEV_DISABLE, then downshift is disabled If
configured to DOWNSHIFT_DEV_DEFAULT_COUNT, then it is up to the device to
choose a device-specific re-try count.

Tested on Beaglebone Black with VSC 8531 PHY.

Change set:
v0:

- Link Speed downshift and Fast Link failure-2 features coded by using
  Device tree.
v1: 
- Split the Downshift and FLF2 features in different set of patches.
- Removed DT access and implemented IOCTL access suggested by Andrew.
- Added function pointers in get_tunable/set_tunable phy_device structure
v2:
- Added trace message with a hist is printed when downshifting clould not
  be eanbled with the requested count
- (ethtool) Syntax is changed from "--set-phy-tunable downshift on|off|%d"
  to "--set-phy-tunable [downshift on|off [count N]]" - as requested by
  Andrew.
v3:
- Fixed Spelling in "net: phy: Add downshift get/set support in Microsemi
  PHYs driver"

Raju Lakkaraju (5):
  ethtool: (uapi) Add ETHTOOL_PHY_GTUNABLE and ETHTOOL_PHY_STUNABLE
  ethtool: Implements ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE
  ethtool: (uapi) Add ETHTOOL_PHY_DOWNSHIFT to PHY tunables
  ethtool: Core impl for ETHTOOL_PHY_DOWNSHIFT tunable
  net: phy: Add downshift get/set support in Microsemi PHYs driver

 drivers/net/phy/mscc.c       | 100 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/phy.h          |   7 +++
 include/uapi/linux/ethtool.h |  18 +++++++-
 net/core/ethtool.c           |  93 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 217 insertions(+), 1 deletion(-)

-- 
2.7.4

^ permalink raw reply

* [PATCH net-next v3 3/5] ethtool: (uapi) Add ETHTOOL_PHY_DOWNSHIFT to PHY tunables
From: Allan W. Nielsen @ 2016-11-17 12:07 UTC (permalink / raw)
  To: netdev; +Cc: andrew, f.fainelli, raju.lakkaraju, allan.nielsen, Raju Lakkaraju
In-Reply-To: <1479384444-31122-1-git-send-email-allan.nielsen@microsemi.com>

From: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>

For operation in cabling environments that are incompatible with
1000BASE-T, PHY device may provide an automatic link speed downshift
operation. When enabled, the device automatically changes its 1000BASE-T
auto-negotiation to the next slower speed after a configured number of
failed attempts at 1000BASE-T.  This feature is useful in setting up in
networks using older cable installations that include only pairs A and B,
and not pairs C and D.

Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Signed-off-by: Allan W. Nielsen <allan.nielsen@microsemi.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
---
 include/uapi/linux/ethtool.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index 42f696f..f0db778 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -248,9 +248,12 @@ struct ethtool_tunable {
 	void	*data[0];
 };

+#define DOWNSHIFT_DEV_DEFAULT_COUNT	0xff
+#define DOWNSHIFT_DEV_DISABLE		0
+
 enum phy_tunable_id {
 	ETHTOOL_PHY_ID_UNSPEC,
-
+	ETHTOOL_PHY_DOWNSHIFT,
 	/*
 	 * Add your fresh new phy tunable attribute above and remember to update
 	 * phy_tunable_strings[] in net/core/ethtool.c
-- 
2.7.3

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox