Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] net: gianfar: add ethtool eee support
From: David Miller @ 2016-12-08 23:19 UTC (permalink / raw)
  To: Shaohui.Xie; +Cc: netdev
In-Reply-To: <1481196426-45576-1-git-send-email-Shaohui.Xie@nxp.com>

From: Shaohui Xie <Shaohui.Xie@nxp.com>
Date: Thu, 8 Dec 2016 19:27:06 +0800

> Gianfar does not support EEE, but it can connect to a PHY which supports
> EEE and the PHY advertises EEE by default, and its link partner also
> advertises EEE, so the PHY enters low power mode when traffic rate is low,
> which causes packet loss if an application's traffic rate is low. This
> patch provides .get_eee and .set_eee so that to disable the EEE
> advertisement via ethtool if needed, other EEE features are not supported.
> 
> Signed-off-by: Shaohui Xie <Shaohui.Xie@nxp.com>

This is not the way to fix this.

If the Gianfar MAC does not support EEE properly, then the gianfar
driver should not create a situation where the PHY advertises EEE
in the first place.

^ permalink raw reply

* Re: [PATCH] cxgb4/cxgb4vf: Assign netdev->dev_port with port ID
From: David Miller @ 2016-12-08 23:21 UTC (permalink / raw)
  To: ganeshgr; +Cc: netdev, nirranjan, arjun, leedom, hariprasad
In-Reply-To: <1481200763-15799-1-git-send-email-ganeshgr@chelsio.com>

From: Ganesh Goudar <ganeshgr@chelsio.com>
Date: Thu,  8 Dec 2016 18:09:23 +0530

> From: Arjun V <arjun@chelsio.com>
> 
> Added missing dev_port assignment in cxgb4vf driver.
> Also made dev_port assignment of cxgb4 in sync with cxgb4vf driver.
> 
> Signed-off-by: Casey Leedom <leedom@chelsio.com>
> Signed-off-by: Arjun V <arjun@chelsio.com>
> Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>

Applied.

^ permalink raw reply

* Re: [PATCH] net: socket: preferred __aligned(size) for control buffer
From: David Miller @ 2016-12-08 23:21 UTC (permalink / raw)
  To: kushwaha.a; +Cc: netdev, akkushwaha9896
In-Reply-To: <1481201513-20783-1-git-send-email-kushwaha.a@samsung.com>

From: kushwaha.a@samsung.com
Date: Thu, 08 Dec 2016 18:21:53 +0530

> From: Amit Kushwaha <kushwaha.a@samsung.com>
> 
> This patch cleanup checkpatch.pl warning
> WARNING: __aligned(size) is preferred over __attribute__((aligned(size)))
> 
> Signed-off-by: Amit Kushwaha <kushwaha.a@samsung.com>

Applied to net-next.

^ permalink raw reply

* Re: [PATCH] net: gianfar: add ethtool eee support
From: Florian Fainelli @ 2016-12-08 23:22 UTC (permalink / raw)
  To: David Miller, Shaohui.Xie; +Cc: netdev
In-Reply-To: <20161208.181922.957507188752351548.davem@davemloft.net>

On 12/08/2016 03:19 PM, David Miller wrote:
> From: Shaohui Xie <Shaohui.Xie@nxp.com>
> Date: Thu, 8 Dec 2016 19:27:06 +0800
> 
>> Gianfar does not support EEE, but it can connect to a PHY which supports
>> EEE and the PHY advertises EEE by default, and its link partner also
>> advertises EEE, so the PHY enters low power mode when traffic rate is low,
>> which causes packet loss if an application's traffic rate is low. This
>> patch provides .get_eee and .set_eee so that to disable the EEE
>> advertisement via ethtool if needed, other EEE features are not supported.
>>
>> Signed-off-by: Shaohui Xie <Shaohui.Xie@nxp.com>
> 
> This is not the way to fix this.
> 
> If the Gianfar MAC does not support EEE properly, then the gianfar
> driver should not create a situation where the PHY advertises EEE
> in the first place.

Agreed, you should have gianfar mask out the EEE advertisement from the
PHY it connects to, in order to make sure that EEE is properly disabled.
This should be no different from e.g: connecting a 10/100/1000 PHY to a
10/100 only controller for instance.
-- 
Florian

^ permalink raw reply

* Re: pull-request: can 2016-12-08
From: David Miller @ 2016-12-08 23:23 UTC (permalink / raw)
  To: mkl; +Cc: netdev, linux-can, kernel
In-Reply-To: <20161208153552.18122-1-mkl@pengutronix.de>

From: Marc Kleine-Budde <mkl@pengutronix.de>
Date: Thu,  8 Dec 2016 16:35:51 +0100

> this is a pull request for one patch.
> 
> Jiho Chu found and fixed a use-after-free error in the cleanup path
> in the peak pcan USB CAN driver.

Pulled, thanks Marc.

^ permalink raw reply

* Remove private tx queue locks
From: Lino Sanfilippo @ 2016-12-08 23:55 UTC (permalink / raw)
  To: bh74.an, ks.giri, vipul.pandya, peppe.cavallaro, alexandre.torgue
  Cc: romieu, pavel, davem, linux-kernel, netdev

Hi,

this patch series removes unnecessary private locks in the sxgbe and the
stmmac driver.

v2:
- adjust commit message

^ permalink raw reply

* [PATCH v2 1/2] net: ethernet: sxgbe: remove private tx queue lock
From: Lino Sanfilippo @ 2016-12-08 23:55 UTC (permalink / raw)
  To: bh74.an, ks.giri, vipul.pandya, peppe.cavallaro, alexandre.torgue
  Cc: romieu, pavel, davem, linux-kernel, netdev, Lino Sanfilippo
In-Reply-To: <1481241343-18062-1-git-send-email-LinoSanfilippo@gmx.de>

The driver uses a private lock for synchronization of the xmit function and
the xmit completion handler, but since the NETIF_F_LLTX flag is not set,
the xmit function is also called with the xmit_lock held.

On the other hand the completion handler uses the reverse locking order by
first taking the private lock and (in case that the tx queue had been
stopped) then the xmit_lock.

Improve the locking by removing the private lock and using only the
xmit_lock for synchronization instead.

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
---
 drivers/net/ethernet/samsung/sxgbe/sxgbe_common.h |  1 -
 drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c   | 27 +++++------------------
 2 files changed, 6 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/samsung/sxgbe/sxgbe_common.h b/drivers/net/ethernet/samsung/sxgbe/sxgbe_common.h
index 5cb51b6..c61f260 100644
--- a/drivers/net/ethernet/samsung/sxgbe/sxgbe_common.h
+++ b/drivers/net/ethernet/samsung/sxgbe/sxgbe_common.h
@@ -384,7 +384,6 @@ struct sxgbe_tx_queue {
 	dma_addr_t *tx_skbuff_dma;
 	struct sk_buff **tx_skbuff;
 	struct timer_list txtimer;
-	spinlock_t tx_lock;	/* lock for tx queues */
 	unsigned int cur_tx;
 	unsigned int dirty_tx;
 	u32 tx_count_frames;
diff --git a/drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c b/drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c
index ea44a24..22d3b0b 100644
--- a/drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c
+++ b/drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c
@@ -426,9 +426,6 @@ static int init_tx_ring(struct device *dev, u8 queue_no,
 	tx_ring->dirty_tx = 0;
 	tx_ring->cur_tx = 0;
 
-	/* initialise TX queue lock */
-	spin_lock_init(&tx_ring->tx_lock);
-
 	return 0;
 
 dmamem_err:
@@ -743,7 +740,7 @@ static void sxgbe_tx_queue_clean(struct sxgbe_tx_queue *tqueue)
 
 	dev_txq = netdev_get_tx_queue(priv->dev, queue_no);
 
-	spin_lock(&tqueue->tx_lock);
+	__netif_tx_lock(dev_txq, smp_processor_id());
 
 	priv->xstats.tx_clean++;
 	while (tqueue->dirty_tx != tqueue->cur_tx) {
@@ -781,18 +778,13 @@ static void sxgbe_tx_queue_clean(struct sxgbe_tx_queue *tqueue)
 
 	/* wake up queue */
 	if (unlikely(netif_tx_queue_stopped(dev_txq) &&
-		     sxgbe_tx_avail(tqueue, tx_rsize) > SXGBE_TX_THRESH(priv))) {
-		netif_tx_lock(priv->dev);
-		if (netif_tx_queue_stopped(dev_txq) &&
-		    sxgbe_tx_avail(tqueue, tx_rsize) > SXGBE_TX_THRESH(priv)) {
-			if (netif_msg_tx_done(priv))
-				pr_debug("%s: restart transmit\n", __func__);
-			netif_tx_wake_queue(dev_txq);
-		}
-		netif_tx_unlock(priv->dev);
+	    sxgbe_tx_avail(tqueue, tx_rsize) > SXGBE_TX_THRESH(priv))) {
+		if (netif_msg_tx_done(priv))
+			pr_debug("%s: restart transmit\n", __func__);
+		netif_tx_wake_queue(dev_txq);
 	}
 
-	spin_unlock(&tqueue->tx_lock);
+	__netif_tx_unlock(dev_txq);
 }
 
 /**
@@ -1304,9 +1296,6 @@ static netdev_tx_t sxgbe_xmit(struct sk_buff *skb, struct net_device *dev)
 		      tqueue->hwts_tx_en)))
 		ctxt_desc_req = 1;
 
-	/* get the spinlock */
-	spin_lock(&tqueue->tx_lock);
-
 	if (priv->tx_path_in_lpi_mode)
 		sxgbe_disable_eee_mode(priv);
 
@@ -1316,8 +1305,6 @@ static netdev_tx_t sxgbe_xmit(struct sk_buff *skb, struct net_device *dev)
 			netdev_err(dev, "%s: Tx Ring is full when %d queue is awake\n",
 				   __func__, txq_index);
 		}
-		/* release the spin lock in case of BUSY */
-		spin_unlock(&tqueue->tx_lock);
 		return NETDEV_TX_BUSY;
 	}
 
@@ -1436,8 +1423,6 @@ static netdev_tx_t sxgbe_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	priv->hw->dma->enable_dma_transmission(priv->ioaddr, txq_index);
 
-	spin_unlock(&tqueue->tx_lock);
-
 	return NETDEV_TX_OK;
 }
 
-- 
1.9.1

^ permalink raw reply related

* [PATCH v2 2/2] net: ethernet: stmmac: remove private tx queue lock
From: Lino Sanfilippo @ 2016-12-08 23:55 UTC (permalink / raw)
  To: bh74.an, ks.giri, vipul.pandya, peppe.cavallaro, alexandre.torgue
  Cc: romieu, pavel, davem, linux-kernel, netdev, Lino Sanfilippo
In-Reply-To: <1481241343-18062-1-git-send-email-LinoSanfilippo@gmx.de>

The driver uses a private lock for synchronization of the xmit function and
the xmit completion handler, but since the NETIF_F_LLTX flag is not set,
the xmit function is also called with the xmit_lock held.

On the other hand the completion handler uses the reverse locking order by
first taking the private lock and (in case that the tx queue had been
stopped) then the xmit_lock.

Improve the locking by removing the private lock and using only the
xmit_lock for synchronization instead.

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac.h      |  1 -
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 28 +++++------------------
 2 files changed, 6 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index 4d2a759..7e69b11 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -64,7 +64,6 @@ struct stmmac_priv {
 	dma_addr_t dma_tx_phy;
 	int tx_coalesce;
 	int hwts_tx_en;
-	spinlock_t tx_lock;
 	bool tx_path_in_lpi_mode;
 	struct timer_list txtimer;
 	bool tso;
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index caf069a..db46ec4 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1307,7 +1307,7 @@ static void stmmac_tx_clean(struct stmmac_priv *priv)
 	unsigned int bytes_compl = 0, pkts_compl = 0;
 	unsigned int entry = priv->dirty_tx;
 
-	spin_lock(&priv->tx_lock);
+	netif_tx_lock(priv->dev);
 
 	priv->xstats.tx_clean++;
 
@@ -1378,22 +1378,17 @@ static void stmmac_tx_clean(struct stmmac_priv *priv)
 	netdev_completed_queue(priv->dev, pkts_compl, bytes_compl);
 
 	if (unlikely(netif_queue_stopped(priv->dev) &&
-		     stmmac_tx_avail(priv) > STMMAC_TX_THRESH)) {
-		netif_tx_lock(priv->dev);
-		if (netif_queue_stopped(priv->dev) &&
-		    stmmac_tx_avail(priv) > STMMAC_TX_THRESH) {
-			if (netif_msg_tx_done(priv))
-				pr_debug("%s: restart transmit\n", __func__);
-			netif_wake_queue(priv->dev);
-		}
-		netif_tx_unlock(priv->dev);
+	    stmmac_tx_avail(priv) > STMMAC_TX_THRESH)) {
+		if (netif_msg_tx_done(priv))
+			pr_debug("%s: restart transmit\n", __func__);
+		netif_wake_queue(priv->dev);
 	}
 
 	if ((priv->eee_enabled) && (!priv->tx_path_in_lpi_mode)) {
 		stmmac_enable_eee_mode(priv);
 		mod_timer(&priv->eee_ctrl_timer, STMMAC_LPI_T(eee_timer));
 	}
-	spin_unlock(&priv->tx_lock);
+	netif_tx_unlock(priv->dev);
 }
 
 static inline void stmmac_enable_dma_irq(struct stmmac_priv *priv)
@@ -1998,8 +1993,6 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
 	u8 proto_hdr_len;
 	int i;
 
-	spin_lock(&priv->tx_lock);
-
 	/* Compute header lengths */
 	proto_hdr_len = skb_transport_offset(skb) + tcp_hdrlen(skb);
 
@@ -2011,7 +2004,6 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
 			/* This is a hard error, log it. */
 			pr_err("%s: Tx Ring full when queue awake\n", __func__);
 		}
-		spin_unlock(&priv->tx_lock);
 		return NETDEV_TX_BUSY;
 	}
 
@@ -2146,11 +2138,9 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
 	priv->hw->dma->set_tx_tail_ptr(priv->ioaddr, priv->tx_tail_addr,
 				       STMMAC_CHAN0);
 
-	spin_unlock(&priv->tx_lock);
 	return NETDEV_TX_OK;
 
 dma_map_err:
-	spin_unlock(&priv->tx_lock);
 	dev_err(priv->device, "Tx dma map failed\n");
 	dev_kfree_skb(skb);
 	priv->dev->stats.tx_dropped++;
@@ -2182,10 +2172,7 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 			return stmmac_tso_xmit(skb, dev);
 	}
 
-	spin_lock(&priv->tx_lock);
-
 	if (unlikely(stmmac_tx_avail(priv) < nfrags + 1)) {
-		spin_unlock(&priv->tx_lock);
 		if (!netif_queue_stopped(dev)) {
 			netif_stop_queue(dev);
 			/* This is a hard error, log it. */
@@ -2357,11 +2344,9 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 		priv->hw->dma->set_tx_tail_ptr(priv->ioaddr, priv->tx_tail_addr,
 					       STMMAC_CHAN0);
 
-	spin_unlock(&priv->tx_lock);
 	return NETDEV_TX_OK;
 
 dma_map_err:
-	spin_unlock(&priv->tx_lock);
 	dev_err(priv->device, "Tx dma map failed\n");
 	dev_kfree_skb(skb);
 	priv->dev->stats.tx_dropped++;
@@ -3347,7 +3332,6 @@ int stmmac_dvr_probe(struct device *device,
 	netif_napi_add(ndev, &priv->napi, stmmac_poll, 64);
 
 	spin_lock_init(&priv->lock);
-	spin_lock_init(&priv->tx_lock);
 
 	ret = register_netdev(ndev);
 	if (ret) {
-- 
1.9.1

^ permalink raw reply related

* Re: [net-next PATCH v5 1/6] net: virtio dynamically disable/enable LRO
From: John Fastabend @ 2016-12-09  0:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: daniel, shm, davem, tgraf, alexei.starovoitov, john.r.fastabend,
	netdev, brouer
In-Reply-To: <20161208231857-mutt-send-email-mst@kernel.org>

On 16-12-08 01:36 PM, Michael S. Tsirkin wrote:
> On Wed, Dec 07, 2016 at 12:11:11PM -0800, John Fastabend wrote:
>> This adds support for dynamically setting the LRO feature flag. The
>> message to control guest features in the backend uses the
>> CTRL_GUEST_OFFLOADS msg type.
>>
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>> ---
>>  drivers/net/virtio_net.c |   40 +++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 39 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index a21d93a..a5c47b1 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -1419,6 +1419,36 @@ static void virtnet_init_settings(struct net_device *dev)
>>  	.set_settings = virtnet_set_settings,
>>  };
>>  
>> +static int virtnet_set_features(struct net_device *netdev,
>> +				netdev_features_t features)
>> +{
>> +	struct virtnet_info *vi = netdev_priv(netdev);
>> +	struct virtio_device *vdev = vi->vdev;
>> +	struct scatterlist sg;
>> +	u64 offloads = 0;
>> +
>> +	if (features & NETIF_F_LRO)
>> +		offloads |= (1 << VIRTIO_NET_F_GUEST_TSO4) |
>> +			    (1 << VIRTIO_NET_F_GUEST_TSO6);
>> +
>> +	if (features & NETIF_F_RXCSUM)
>> +		offloads |= (1 << VIRTIO_NET_F_GUEST_CSUM);
>> +
>> +	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)) {
>> +		sg_init_one(&sg, &offloads, sizeof(uint64_t));
>> +		if (!virtnet_send_command(vi,
>> +					  VIRTIO_NET_CTRL_GUEST_OFFLOADS,
>> +					  VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET,
>> +					  &sg)) {
> 
> Hmm I just realised that this will slow down setups that bridge
> virtio net interfaces since bridge calls this if provided.
> See below.


Really? What code is trying to turn off GRO via the GUEST_OFFLOADS LRO
command. My qemu/Linux setup has a set of tap/vhost devices attached to
a bridge and all of them have LRO enabled even with this patch series.

I must missing a setup handler somewhere?

> 
>> +			dev_warn(&netdev->dev,
>> +				 "Failed to set guest offloads by virtnet command.\n");
>> +			return -EINVAL;
>> +		}
>> +	}
> 
> Hmm if VIRTIO_NET_F_CTRL_GUEST_OFFLOADS is off, this fails
> silently. It might actually be a good idea to avoid
> breaking setups.
> 
>> +
>> +	return 0;
>> +}
>> +
>>  static const struct net_device_ops virtnet_netdev = {
>>  	.ndo_open            = virtnet_open,
>>  	.ndo_stop   	     = virtnet_close,
>> @@ -1435,6 +1465,7 @@ static void virtnet_init_settings(struct net_device *dev)
>>  #ifdef CONFIG_NET_RX_BUSY_POLL
>>  	.ndo_busy_poll		= virtnet_busy_poll,
>>  #endif
>> +	.ndo_set_features	= virtnet_set_features,
>>  };
>>  
>>  static void virtnet_config_changed_work(struct work_struct *work)
>> @@ -1815,6 +1846,12 @@ static int virtnet_probe(struct virtio_device *vdev)
>>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_CSUM))
>>  		dev->features |= NETIF_F_RXCSUM;
>>  
>> +	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) &&
>> +	    virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO6)) {
>> +		dev->features |= NETIF_F_LRO;
>> +		dev->hw_features |= NETIF_F_LRO;
> 
> So the issue is I think that the virtio "LRO" isn't really
> LRO, it's typically just GRO forwarded to guests.
> So these are easily re-split along MTU boundaries,
> which makes it ok to forward these across bridges.
> 
> It's not nice that we don't document this in the spec,
> but it's the reality and people rely on this.
> 
> For now, how about doing a custom thing and just disable/enable
> it as XDP is attached/detached?

The annoying part about doing this is ethtool will say that it is fixed
yet it will be changed by seemingly unrelated operation. I'm not sure I
like the idea to start automatically configuring the link via xdp_set.

> 
>> +	}
>> +
>>  	dev->vlan_features = dev->features;
>>  
>>  	/* MTU range: 68 - 65535 */
>> @@ -2057,7 +2094,8 @@ static int virtnet_restore(struct virtio_device *vdev)
>>  	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, \
>>  	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
>>  	VIRTIO_NET_F_CTRL_MAC_ADDR, \
>> -	VIRTIO_NET_F_MTU
>> +	VIRTIO_NET_F_MTU, \
>> +	VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
>>  
>>  static unsigned int features[] = {
>>  	VIRTNET_FEATURES,

^ permalink raw reply

* Re: fs, net: deadlock between bind/splice on af_unix
From: Cong Wang @ 2016-12-09  0:08 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Al Viro, linux-fsdevel@vger.kernel.org, LKML, David Miller,
	Rainer Weikusat, Hannes Frederic Sowa, netdev, Eric Dumazet,
	syzkaller
In-Reply-To: <CACT4Y+a-fTzW95ViVP3knhNTa=h6XFB0bMddcyQVeeV+LbXKHw@mail.gmail.com>

On Thu, Dec 8, 2016 at 8:30 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
> Chain exists of:
>  Possible unsafe locking scenario:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(sb_writers#5);
>                                lock(&u->bindlock);
>                                lock(sb_writers#5);
>   lock(&pipe->mutex/1);

This looks false positive, probably just needs lockdep_set_class()
to set keys for pipe->mutex and unix->bindlock.

^ permalink raw reply

* Re: net: deadlock on genl_mutex
From: Cong Wang @ 2016-12-09  0:13 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: syzkaller, Eric Dumazet, David Miller, Matti Vaittinen,
	Tycho Andersen, Florian Westphal, stephen hemminger, Tom Herbert,
	netdev, LKML, Richard Guy Briggs, netdev-owner
In-Reply-To: <CACT4Y+YaUL8F2L0h1VG3QZP=gSC4NJ=NOTG_dH5Awz=FOd1EVg@mail.gmail.com>

On Thu, Dec 8, 2016 at 10:02 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
> Chain exists of:
>  Possible unsafe locking scenario:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(nlk->cb_mutex);
>                                lock(&table[i].mutex);
>                                lock(nlk->cb_mutex);
>   lock(genl_mutex);

Similar to the unix bindlock, this one looks false positive to me too.

^ permalink raw reply

* Re: Soft lockup in inet_put_port on 4.6
From: Eric Dumazet @ 2016-12-09  0:30 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Hannes Frederic Sowa, Tom Herbert,
	Linux Kernel Network Developers
In-Reply-To: <1481233016.11849.1@smtp.office365.com>

On Thu, 2016-12-08 at 16:36 -0500, Josef Bacik wrote:

> We can reproduce the problem at will, still trying to run down the 
> problem.  I'll try and find one of the boxes that dumped a core and get 
> a bt of everybody.  Thanks,

OK, sounds good.

I had a look and :
- could not spot a fix that came after 4.6. 
- could not spot an obvious bug.

Anything special in the program triggering the issue ?
SO_REUSEPORT and/or special socket options ?

Thanks.

^ permalink raw reply

* Re: net: deadlock on genl_mutex
From: Cong Wang @ 2016-12-09  0:32 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: syzkaller, Eric Dumazet, David Miller, Matti Vaittinen,
	Tycho Andersen, Florian Westphal, stephen hemminger, Tom Herbert,
	netdev, LKML, Richard Guy Briggs, netdev-owner
In-Reply-To: <CACT4Y+Zy82UAJ55VbPbVadUM92ZSx1VJCFPdhhcmj53uxZ5PXQ@mail.gmail.com>

On Thu, Dec 8, 2016 at 9:16 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
> Chain exists of:
>  Possible unsafe locking scenario:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(genl_mutex);
>                                lock(nlk->cb_mutex);
>                                lock(genl_mutex);
>   lock(rtnl_mutex);
>
>  *** DEADLOCK ***

This one looks legitimate, because nlk->cb_mutex could be rtnl_mutex.
Let me think about it.

^ permalink raw reply

* linux-next: build warning after merge of the net-next tree
From: Stephen Rothwell @ 2016-12-09  0:34 UTC (permalink / raw)
  To: David Miller, Networking
  Cc: linux-next, linux-kernel, Jacob Keller, Jeff Kirsher

Hi all,

After merging the net-next tree, today's linux-next build (powerpc
ppc64_defconfig) produced this warning:

In file included from include/linux/byteorder/big_endian.h:4:0,
                 from arch/powerpc/include/uapi/asm/byteorder.h:13,
                 from include/asm-generic/bitops/le.h:5,
                 from arch/powerpc/include/asm/bitops.h:279,
                 from include/linux/bitops.h:36,
                 from include/linux/kernel.h:10,
                 from include/linux/skbuff.h:17,
                 from include/linux/if_ether.h:23,
                 from include/linux/etherdevice.h:25,
                 from drivers/net/ethernet/intel/i40e/i40e_main.c:27:
drivers/net/ethernet/intel/i40e/i40e_main.c: In function 'i40e_sync_vsi_filters':
include/uapi/linux/byteorder/big_endian.h:34:26: warning: large integer implicitly truncated to unsigned type [-Woverflow]
 #define __cpu_to_le16(x) ((__force __le16)__swab16((x)))
                          ^
include/linux/byteorder/generic.h:89:21: note: in expansion of macro '__cpu_to_le16'
 #define cpu_to_le16 __cpu_to_le16
                     ^
drivers/net/ethernet/intel/i40e/i40e_main.c:2200:5: note: in expansion of macro 'cpu_to_le16'
     cpu_to_le16((u16)I40E_AQC_MM_ERR_NO_RES);
     ^

Introduced by commit

  ac9e23901441 ("i40e: refactor i40e_update_filter_state to avoid passing aq_err")

-- 
Cheers,
Stephen Rothwell

^ permalink raw reply

* Re: [PATCH net-next] openvswitch: fix VxLAN-gpe port can't be created in ovs compat mode
From: Yang, Yi @ 2016-12-09  0:27 UTC (permalink / raw)
  To: Pravin Shelar; +Cc: netdev, dev, jbenc
In-Reply-To: <CAOrHB_CVr0PQGsNkhrrvT7L8cX1KZypV4UXakXm1uuCz7ZPYHw@mail.gmail.com>

On Thu, Dec 08, 2016 at 11:41:58AM -0800, Pravin Shelar wrote:
> On Thu, Dec 8, 2016 at 12:20 AM, Yi Yang <yi.y.yang@intel.com> wrote:
> >
> > Signed-off-by: Yi Yang <yi.y.yang@intel.com>
> > ---
> >  include/uapi/linux/openvswitch.h |  1 +
> >  net/openvswitch/vport-vxlan.c    | 15 +++++++++++++++
> >  2 files changed, 16 insertions(+)
> >
> There is no need for this patch in upstream kernel module. I am open
> to having such a patch in out of tree kernel if it simplifies feature
> compatibility code.

I'm very glad to hear this :-), the goal is to enable current ovs to create
vxlan-gpe port in compat mode without new kernel help, I'll post a patch
for ovs, thanks a lot.

^ permalink raw reply

* Re: [PATCH 37/50] netfilter: nf_tables: atomic dump and reset for stateful objects
From: Paul Gortmaker @ 2016-12-09  0:40 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, David Miller, netdev, linux-next@vger.kernel.org
In-Reply-To: <1481147576-5690-38-git-send-email-pablo@netfilter.org>

On Wed, Dec 7, 2016 at 4:52 PM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> This patch adds a new NFT_MSG_GETOBJ_RESET command perform an atomic
> dump-and-reset of the stateful object. This also comes with add support
> for atomic dump and reset for counter and quota objects.

This triggered a new build failure in linux-next on parisc-32, which a
hands-off bisect
run lists as resulting from this:

ERROR: "__cmpxchg_u64" [net/netfilter/nft_counter.ko] undefined!
make[2]: *** [__modpost] Error 1
make[1]: *** [modules] Error 2
make: *** [sub-make] Error 2
43da04a593d8b2626f1cf4b56efe9402f6b53652 is the first bad commit
commit 43da04a593d8b2626f1cf4b56efe9402f6b53652
Author: Pablo Neira Ayuso <pablo@netfilter.org>
Date:   Mon Nov 28 00:05:44 2016 +0100

    netfilter: nf_tables: atomic dump and reset for stateful objects

    This patch adds a new NFT_MSG_GETOBJ_RESET command perform an atomic
    dump-and-reset of the stateful object. This also comes with add support
    for atomic dump and reset for counter and quota objects.

    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

:040000 040000 6cd4554f69247e5c837db52342f26888beda1623
5908aca93c89e7922336546c3753bfcf2aceefba M      include
:040000 040000 f25d5831eb30972436bd198c5bb237a0cb0b4856
4ee5751c8de02bb5a8dcaadb2a2df7986d90f8e9 M      net
bisect run success

Guessing this is more an issue with parisc than it is with netfilter, but I
figured I'd mention it anyway.

Paul.
--




>
> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> ---
>  include/net/netfilter/nf_tables.h        |  3 +-
>  include/uapi/linux/netfilter/nf_tables.h |  2 ++
>  net/netfilter/nf_tables_api.c            | 29 ++++++++++++-----
>  net/netfilter/nft_counter.c              | 56 +++++++++++++++++++++++++++-----
>  net/netfilter/nft_quota.c                | 18 ++++++----
>  5 files changed, 85 insertions(+), 23 deletions(-)
>
> diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
> index 903cd618f50e..6f7d6a1dc09c 100644
> --- a/include/net/netfilter/nf_tables.h
> +++ b/include/net/netfilter/nf_tables.h
> @@ -997,7 +997,8 @@ struct nft_object_type {
>                                                 struct nft_object *obj);
>         void                            (*destroy)(struct nft_object *obj);
>         int                             (*dump)(struct sk_buff *skb,
> -                                               const struct nft_object *obj);
> +                                               struct nft_object *obj,
> +                                               bool reset);
>  };
>
>  int nft_register_obj(struct nft_object_type *obj_type);
> diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h
> index 3d47582caa80..399eac1eee91 100644
> --- a/include/uapi/linux/netfilter/nf_tables.h
> +++ b/include/uapi/linux/netfilter/nf_tables.h
> @@ -89,6 +89,7 @@ enum nft_verdicts {
>   * @NFT_MSG_NEWOBJ: create a stateful object (enum nft_obj_attributes)
>   * @NFT_MSG_GETOBJ: get a stateful object (enum nft_obj_attributes)
>   * @NFT_MSG_DELOBJ: delete a stateful object (enum nft_obj_attributes)
> + * @NFT_MSG_GETOBJ_RESET: get and reset a stateful object (enum nft_obj_attributes)
>   */
>  enum nf_tables_msg_types {
>         NFT_MSG_NEWTABLE,
> @@ -112,6 +113,7 @@ enum nf_tables_msg_types {
>         NFT_MSG_NEWOBJ,
>         NFT_MSG_GETOBJ,
>         NFT_MSG_DELOBJ,
> +       NFT_MSG_GETOBJ_RESET,
>         NFT_MSG_MAX,
>  };
>
> diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
> index 2ae717c5dcb8..bfc015af366a 100644
> --- a/net/netfilter/nf_tables_api.c
> +++ b/net/netfilter/nf_tables_api.c
> @@ -3972,14 +3972,14 @@ static struct nft_object *nft_obj_init(const struct nft_object_type *type,
>  }
>
>  static int nft_object_dump(struct sk_buff *skb, unsigned int attr,
> -                          const struct nft_object *obj)
> +                          struct nft_object *obj, bool reset)
>  {
>         struct nlattr *nest;
>
>         nest = nla_nest_start(skb, attr);
>         if (!nest)
>                 goto nla_put_failure;
> -       if (obj->type->dump(skb, obj) < 0)
> +       if (obj->type->dump(skb, obj, reset) < 0)
>                 goto nla_put_failure;
>         nla_nest_end(skb, nest);
>         return 0;
> @@ -4096,7 +4096,7 @@ static int nf_tables_newobj(struct net *net, struct sock *nlsk,
>  static int nf_tables_fill_obj_info(struct sk_buff *skb, struct net *net,
>                                    u32 portid, u32 seq, int event, u32 flags,
>                                    int family, const struct nft_table *table,
> -                                  const struct nft_object *obj)
> +                                  struct nft_object *obj, bool reset)
>  {
>         struct nfgenmsg *nfmsg;
>         struct nlmsghdr *nlh;
> @@ -4115,7 +4115,7 @@ static int nf_tables_fill_obj_info(struct sk_buff *skb, struct net *net,
>             nla_put_string(skb, NFTA_OBJ_NAME, obj->name) ||
>             nla_put_be32(skb, NFTA_OBJ_TYPE, htonl(obj->type->type)) ||
>             nla_put_be32(skb, NFTA_OBJ_USE, htonl(obj->use)) ||
> -           nft_object_dump(skb, NFTA_OBJ_DATA, obj))
> +           nft_object_dump(skb, NFTA_OBJ_DATA, obj, reset))
>                 goto nla_put_failure;
>
>         nlmsg_end(skb, nlh);
> @@ -4131,10 +4131,14 @@ static int nf_tables_dump_obj(struct sk_buff *skb, struct netlink_callback *cb)
>         const struct nfgenmsg *nfmsg = nlmsg_data(cb->nlh);
>         const struct nft_af_info *afi;
>         const struct nft_table *table;
> -       const struct nft_object *obj;
>         unsigned int idx = 0, s_idx = cb->args[0];
>         struct net *net = sock_net(skb->sk);
>         int family = nfmsg->nfgen_family;
> +       struct nft_object *obj;
> +       bool reset = false;
> +
> +       if (NFNL_MSG_TYPE(cb->nlh->nlmsg_type) == NFT_MSG_GETOBJ_RESET)
> +               reset = true;
>
>         rcu_read_lock();
>         cb->seq = net->nft.base_seq;
> @@ -4156,7 +4160,7 @@ static int nf_tables_dump_obj(struct sk_buff *skb, struct netlink_callback *cb)
>                                                             cb->nlh->nlmsg_seq,
>                                                             NFT_MSG_NEWOBJ,
>                                                             NLM_F_MULTI | NLM_F_APPEND,
> -                                                           afi->family, table, obj) < 0)
> +                                                           afi->family, table, obj, reset) < 0)
>                                         goto done;
>
>                                 nl_dump_check_consistent(cb, nlmsg_hdr(skb));
> @@ -4183,6 +4187,7 @@ static int nf_tables_getobj(struct net *net, struct sock *nlsk,
>         const struct nft_table *table;
>         struct nft_object *obj;
>         struct sk_buff *skb2;
> +       bool reset = false;
>         u32 objtype;
>         int err;
>
> @@ -4214,9 +4219,12 @@ static int nf_tables_getobj(struct net *net, struct sock *nlsk,
>         if (!skb2)
>                 return -ENOMEM;
>
> +       if (NFNL_MSG_TYPE(nlh->nlmsg_type) == NFT_MSG_GETOBJ_RESET)
> +               reset = true;
> +
>         err = nf_tables_fill_obj_info(skb2, net, NETLINK_CB(skb).portid,
>                                       nlh->nlmsg_seq, NFT_MSG_NEWOBJ, 0,
> -                                     family, table, obj);
> +                                     family, table, obj, reset);
>         if (err < 0)
>                 goto err;
>
> @@ -4291,7 +4299,7 @@ static int nf_tables_obj_notify(const struct nft_ctx *ctx,
>
>         err = nf_tables_fill_obj_info(skb, ctx->net, ctx->portid, ctx->seq,
>                                       event, 0, ctx->afi->family, ctx->table,
> -                                     obj);
> +                                     obj, false);
>         if (err < 0) {
>                 kfree_skb(skb);
>                 goto err;
> @@ -4482,6 +4490,11 @@ static const struct nfnl_callback nf_tables_cb[NFT_MSG_MAX] = {
>                 .attr_count     = NFTA_OBJ_MAX,
>                 .policy         = nft_obj_policy,
>         },
> +       [NFT_MSG_GETOBJ_RESET] = {
> +               .call           = nf_tables_getobj,
> +               .attr_count     = NFTA_OBJ_MAX,
> +               .policy         = nft_obj_policy,
> +       },
>  };
>
>  static void nft_chain_commit_update(struct nft_trans *trans)
> diff --git a/net/netfilter/nft_counter.c b/net/netfilter/nft_counter.c
> index 6f3dd429f865..f6a02c5071c2 100644
> --- a/net/netfilter/nft_counter.c
> +++ b/net/netfilter/nft_counter.c
> @@ -100,10 +100,10 @@ static void nft_counter_obj_destroy(struct nft_object *obj)
>         nft_counter_do_destroy(priv);
>  }
>
> -static void nft_counter_fetch(const struct nft_counter_percpu __percpu *counter,
> +static void nft_counter_fetch(struct nft_counter_percpu __percpu *counter,
>                               struct nft_counter *total)
>  {
> -       const struct nft_counter_percpu *cpu_stats;
> +       struct nft_counter_percpu *cpu_stats;
>         u64 bytes, packets;
>         unsigned int seq;
>         int cpu;
> @@ -122,12 +122,52 @@ static void nft_counter_fetch(const struct nft_counter_percpu __percpu *counter,
>         }
>  }
>
> +static u64 __nft_counter_reset(u64 *counter)
> +{
> +       u64 ret, old;
> +
> +       do {
> +               old = *counter;
> +               ret = cmpxchg64(counter, old, 0);
> +       } while (ret != old);
> +
> +       return ret;
> +}
> +
> +static void nft_counter_reset(struct nft_counter_percpu __percpu *counter,
> +                             struct nft_counter *total)
> +{
> +       struct nft_counter_percpu *cpu_stats;
> +       u64 bytes, packets;
> +       unsigned int seq;
> +       int cpu;
> +
> +       memset(total, 0, sizeof(*total));
> +       for_each_possible_cpu(cpu) {
> +               bytes = packets = 0;
> +
> +               cpu_stats = per_cpu_ptr(counter, cpu);
> +               do {
> +                       seq     = u64_stats_fetch_begin_irq(&cpu_stats->syncp);
> +                       packets += __nft_counter_reset(&cpu_stats->counter.packets);
> +                       bytes   += __nft_counter_reset(&cpu_stats->counter.bytes);
> +               } while (u64_stats_fetch_retry_irq(&cpu_stats->syncp, seq));
> +
> +               total->packets += packets;
> +               total->bytes += bytes;
> +       }
> +}
> +
>  static int nft_counter_do_dump(struct sk_buff *skb,
> -                              const struct nft_counter_percpu_priv *priv)
> +                              const struct nft_counter_percpu_priv *priv,
> +                              bool reset)
>  {
>         struct nft_counter total;
>
> -       nft_counter_fetch(priv->counter, &total);
> +       if (reset)
> +               nft_counter_reset(priv->counter, &total);
> +       else
> +               nft_counter_fetch(priv->counter, &total);
>
>         if (nla_put_be64(skb, NFTA_COUNTER_BYTES, cpu_to_be64(total.bytes),
>                          NFTA_COUNTER_PAD) ||
> @@ -141,11 +181,11 @@ static int nft_counter_do_dump(struct sk_buff *skb,
>  }
>
>  static int nft_counter_obj_dump(struct sk_buff *skb,
> -                               const struct nft_object *obj)
> +                               struct nft_object *obj, bool reset)
>  {
> -       const struct nft_counter_percpu_priv *priv = nft_obj_data(obj);
> +       struct nft_counter_percpu_priv *priv = nft_obj_data(obj);
>
> -       return nft_counter_do_dump(skb, priv);
> +       return nft_counter_do_dump(skb, priv, reset);
>  }
>
>  static const struct nla_policy nft_counter_policy[NFTA_COUNTER_MAX + 1] = {
> @@ -178,7 +218,7 @@ static int nft_counter_dump(struct sk_buff *skb, const struct nft_expr *expr)
>  {
>         const struct nft_counter_percpu_priv *priv = nft_expr_priv(expr);
>
> -       return nft_counter_do_dump(skb, priv);
> +       return nft_counter_do_dump(skb, priv, false);
>  }
>
>  static int nft_counter_init(const struct nft_ctx *ctx,
> diff --git a/net/netfilter/nft_quota.c b/net/netfilter/nft_quota.c
> index 0d344209803a..5d25f57497cb 100644
> --- a/net/netfilter/nft_quota.c
> +++ b/net/netfilter/nft_quota.c
> @@ -83,12 +83,17 @@ static int nft_quota_obj_init(const struct nlattr * const tb[],
>         return nft_quota_do_init(tb, priv);
>  }
>
> -static int nft_quota_do_dump(struct sk_buff *skb, const struct nft_quota *priv)
> +static int nft_quota_do_dump(struct sk_buff *skb, struct nft_quota *priv,
> +                            bool reset)
>  {
>         u32 flags = priv->invert ? NFT_QUOTA_F_INV : 0;
>         u64 consumed;
>
> -       consumed = atomic64_read(&priv->consumed);
> +       if (reset)
> +               consumed = atomic64_xchg(&priv->consumed, 0);
> +       else
> +               consumed = atomic64_read(&priv->consumed);
> +
>         /* Since we inconditionally increment consumed quota for each packet
>          * that we see, don't go over the quota boundary in what we send to
>          * userspace.
> @@ -108,11 +113,12 @@ static int nft_quota_do_dump(struct sk_buff *skb, const struct nft_quota *priv)
>         return -1;
>  }
>
> -static int nft_quota_obj_dump(struct sk_buff *skb, const struct nft_object *obj)
> +static int nft_quota_obj_dump(struct sk_buff *skb, struct nft_object *obj,
> +                             bool reset)
>  {
>         struct nft_quota *priv = nft_obj_data(obj);
>
> -       return nft_quota_do_dump(skb, priv);
> +       return nft_quota_do_dump(skb, priv, reset);
>  }
>
>  static struct nft_object_type nft_quota_obj __read_mostly = {
> @@ -146,9 +152,9 @@ static int nft_quota_init(const struct nft_ctx *ctx,
>
>  static int nft_quota_dump(struct sk_buff *skb, const struct nft_expr *expr)
>  {
> -       const struct nft_quota *priv = nft_expr_priv(expr);
> +       struct nft_quota *priv = nft_expr_priv(expr);
>
> -       return nft_quota_do_dump(skb, priv);
> +       return nft_quota_do_dump(skb, priv, false);
>  }
>
>  static struct nft_expr_type nft_quota_type;
> --
> 2.1.4
>

^ permalink raw reply

* Re: [Intel-wired-lan] [RFC PATCH] i40e: enable PCIe relax ordering for SPARC
From: tndave @ 2016-12-09  0:45 UTC (permalink / raw)
  To: Alexander Duyck, David Laight; +Cc: Jeff Kirsher, intel-wired-lan, Netdev
In-Reply-To: <CAKgT0Uf3aZBoziv7B0tFMovYe9JNWEYZAZhe9iof2XcuRJD+uw@mail.gmail.com>



On 12/08/2016 08:05 AM, Alexander Duyck wrote:
> On Thu, Dec 8, 2016 at 2:43 AM, David Laight
> <David.Laight@aculab.com> wrote:
>> From: Alexander Duyck
>>> Sent: 06 December 2016 17:10
>> ...
>>> I was thinking about it and I realized we can probably simplify
>>> this even further.  In the case of most other architectures the
>>> DMA_ATTR_WEAK_ORDERING has no effect anyway.  So from what I can
>>> tell there is probably no reason not to just always pass that
>>> attribute with the DMA mappings.  From what I can tell the only
>>> other architecture that uses this is the PowerPC Cell
>>> architecture.
>>
>> And I should have read all the thread :-(
>>
>>> Also I was wondering if you actually needed to enable this
>>> attribute for both Rx and Tx buffers or just Rx buffers?  The
>>> patch that enabled DMA_ATTR_WEAK_ORDERING for Sparc64 seems to
>>> call out writes, but I didn't see anything about reads.  I'm just
>>> wondering if changing the code for Tx has any effect?  If not you
>>> could probably drop those changes and just focus on Rx.
>>
>> 'Weak ordering' only applies to PCIe read transfers, so can only
>> have an effect on descriptor reads and transmit buffer reads.
>>
>> Basically PCIe is a comms protocol and an endpoint (or the host)
>> can have multiple outstanding read requests (each of which might
>> generate multiple response messages. The responses for each request
>> must arrive in order, but responses for different requests can be
>> interleaved. Setting 'not weak ordering' lets the host interwork
>> with broken endpoints. (Or, like we did, you fix the fpga's PCIe
>> implementation.)
>
> I get the basics of relaxed ordering.  The question is how does the
> Sparc64 IOMMU translate DMA_ATTR_WEAK_ORDERING into relaxed ordering
> messages, and at what level the ordering is relaxed.  Odds are the
> wording in the description where this attribute was added to Sparc
> is just awkward, but I was wanting to verify if this only applies to
> writes, or also read completions.
In Sparc64, passing DMA_ATTR_WEAK_ORDERING in dma map/unmap only affects 
PCIe root complex (Hostbridge). Using DMA_ATTR_WEAK_ORDERING, requested 
DMA transaction can be relaxed ordered within the PCIe root complex.

In Sparc64, memory writes can be held at PCIe root complex not letting
other memory writes to go through. By passing DMA_ATTR_WEAK_ORDERING in
dma map/unmap allows memory writes to bypass other memory writes in PCIe
root complex. (This applies to only PCIe root complex and does not 
affect at any other level of PCIe hierarchy e.g. PCIe bridges et al. 
Also the PCIe root complex when bypassing memory writes does follow PCIe 
relax ordering rules as per PCIe specification.

For reference [old but still relevant write-up]: PCI-Express Relaxed 
Ordering and the Sun SPARC Enterprise M-class Servers
https://blogs.oracle.com/olympus/entry/relaxed_ordering

>
>> In this case you need the reads of both transmit and receive rings
>> to 'overtake' reads of transmit data.
>
> Actually that isn't quite right.  With relaxed ordering completions
> and writes can pass each other if I recall correctly, but reads will
> always force all writes ahead of them to be completed before you can
> begin generating the read completions.
That is my understanding as well.

>
>> I'm not at all clear how this 'flag' can be set on dma_map(). It is
>> a property of the PCIe subsystem.
Because in Sparc64, passing DMA_ATTR_WEAK_ORDERING flag in DMA map/unmap 
adds an entry in IOMMU/ATU table so that an access to requested DMA 
address from PCIe root complex can be relaxed ordered.
>
> That was where my original question on this came in.  We can do a
> blanket enable of relaxed ordering for Tx and Rx data buffers, but
> if we only need it on Rx then there isn't any need for us to make
> unnecessary changes.
I ran some quick test and it is likely that we don't need
DMA_ATTR_WEAK_ORDERING for any TX dma buffer (because in case of TX dma
buffers, its all memory reads from device).

-Tushar
>
> - Alex
>

^ permalink raw reply

* Re: [PATCH 2/6] net: ethernet: ti: cpts: add support for ext rftclk selection
From: Stephen Boyd @ 2016-12-09  0:47 UTC (permalink / raw)
  To: Grygorii Strashko
  Cc: Richard Cochran, Murali Karicheri, David S. Miller, netdev,
	Mugunthan V N, Sekhar Nori, linux-kernel, linux-omap, Rob Herring,
	devicetree, Wingman Kwok, linux-clk
In-Reply-To: <11994fbc-3713-6ef7-8a44-8a2442106dfc@ti.com>

On 12/06, Grygorii Strashko wrote:
> Subject: [PATCH] cpts refclk sel
> 
> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
> ---
>  arch/arm/boot/dts/keystone-k2e-netcp.dtsi | 10 +++++-
>  drivers/net/ethernet/ti/cpts.c            | 52 ++++++++++++++++++++++++++++++-
>  2 files changed, 60 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm/boot/dts/keystone-k2e-netcp.dtsi b/arch/arm/boot/dts/keystone-k2e-netcp.dtsi
> index 919e655..b27aa22 100644
> --- a/arch/arm/boot/dts/keystone-k2e-netcp.dtsi
> +++ b/arch/arm/boot/dts/keystone-k2e-netcp.dtsi
> @@ -138,7 +138,7 @@ netcp: netcp@24000000 {
>  	/* NetCP address range */
>  	ranges = <0 0x24000000 0x1000000>;
> 
> -	clocks = <&clkpa>, <&clkcpgmac>, <&chipclk12>;
> +	clocks = <&clkpa>, <&clkcpgmac>, <&cpts_mux>;
>  	clock-names = "pa_clk", "ethss_clk", "cpts";
>  	dma-coherent;
> 
> @@ -162,6 +162,14 @@ netcp: netcp@24000000 {
>  			cpts-ext-ts-inputs = <6>;
>  			cpts-ts-comp-length;
> 
> +			cpts_mux: cpts_refclk_mux {
> +				#clock-cells = <0>;
> +				clocks = <&chipclk12>, <&chipclk13>;
> +				cpts-mux-tbl = <0>, <1>;
> +				assigned-clocks = <&cpts_mux>;
> +				assigned-clock-parents = <&chipclk12>;

Is there a binding update? Why the subnode? Why not have it as
part of the netcp node? Does the cpts-mux-tbl property change?

> +			};
> +
>  			interfaces {
>  				gbe0: interface-0 {
>  					slave-port = <0>;
> diff --git a/drivers/net/ethernet/ti/cpts.c b/drivers/net/ethernet/ti/cpts.c
> index 938de22..ef94316 100644
> --- a/drivers/net/ethernet/ti/cpts.c
> +++ b/drivers/net/ethernet/ti/cpts.c
> @@ -17,6 +17,7 @@
>   * along with this program; if not, write to the Free Software
>   * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301  USA
>   */
> +#include <linux/clk-provider.h>
>  #include <linux/err.h>
>  #include <linux/if.h>
>  #include <linux/hrtimer.h>
> @@ -672,6 +673,7 @@ int cpts_register(struct cpts *cpts)
>  	cpts->phc_index = ptp_clock_index(cpts->clock);
> 
>  	schedule_delayed_work(&cpts->overflow_work, cpts->ov_check_period);
> +

Maybe in another patch.

>  	return 0;
> 
>  err_ptp:
> @@ -741,6 +743,54 @@ static void cpts_calc_mult_shift(struct cpts *cpts)
>  		 freq, cpts->cc_mult, cpts->cc.shift, (ns - NSEC_PER_SEC));
>  }
> 
> +static int cpts_of_mux_clk_setup(struct cpts *cpts, struct device_node *node)
> +{
> +	unsigned int num_parents;
> +	const char **parent_names;
> +	struct device_node *refclk_np;
> +	void __iomem *reg;
> +	struct clk *clk;
> +	u32 *mux_table;
> +	int ret;
> +
> +	refclk_np = of_get_child_by_name(node, "cpts_refclk_mux");
> +	if (!refclk_np)
> +		return -EINVAL;
> +
> +	num_parents = of_clk_get_parent_count(refclk_np);
> +	if (num_parents < 1) {
> +		dev_err(cpts->dev, "mux-clock %s must have parents\n",
> +			refclk_np->name);
> +		return -EINVAL;
> +	}
> +	parent_names = devm_kzalloc(cpts->dev, (sizeof(char *) * num_parents),
> +				    GFP_KERNEL);
> +	if (!parent_names)
> +		return -ENOMEM;
> +
> +	of_clk_parent_fill(refclk_np, parent_names, num_parents);
> +
> +	mux_table = devm_kzalloc(cpts->dev, sizeof(*mux_table) * (32 + 1),
> +				 GFP_KERNEL);
> +	if (!mux_table)
> +		return -ENOMEM;
> +
> +	ret = of_property_read_variable_u32_array(refclk_np, "cpts-mux-tbl",
> +						  mux_table, 1, 32);
> +	if (ret < 0)
> +		return ret;
> +
> +	reg = &cpts->reg->rftclk_sel;
> +
> +	clk = clk_register_mux_table(cpts->dev, refclk_np->name,
> +				     parent_names, num_parents,
> +				     0, reg, 0, 0x1F, 0, mux_table, NULL);
> +	if (IS_ERR(clk))
> +		return PTR_ERR(clk);
> +
> +	return of_clk_add_provider(refclk_np, of_clk_src_simple_get, clk);

Can you please use the clk_hw APIs instead?


-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply

* Re: Soft lockup in inet_put_port on 4.6
From: Josef Bacik @ 2016-12-09  1:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Hannes Frederic Sowa, Tom Herbert,
	Linux Kernel Network Developers
In-Reply-To: <1481243432.4930.145.camel@edumazet-glaptop3.roam.corp.google.com>

> On Dec 8, 2016, at 7:32 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
>> On Thu, 2016-12-08 at 16:36 -0500, Josef Bacik wrote:
>> 
>> We can reproduce the problem at will, still trying to run down the 
>> problem.  I'll try and find one of the boxes that dumped a core and get 
>> a bt of everybody.  Thanks,
> 
> OK, sounds good.
> 
> I had a look and :
> - could not spot a fix that came after 4.6. 
> - could not spot an obvious bug.
> 
> Anything special in the program triggering the issue ?
> SO_REUSEPORT and/or special socket options ?
> 

So they recently started using SO_REUSEPORT, that's what triggered it, if they don't use it then everything is fine.

I added some instrumentation for get_port to see if it was looping in there and none of my printk's triggered.  The softlockup messages are always on the inet_bind_bucket lock, sometimes in the process context in get_port or in the softirq context either through inet_put_port or inet_kill_twsk.  On the box that I have a coredump for there's only one processor in the inet code so I'm not sure what to make of that.  That was a box from last week so I'll look at a more recent core and see if it's different.  Thanks,

Josef

^ permalink raw reply

* Re: [Intel-wired-lan] [RFC PATCH] i40e: enable PCIe relax ordering for SPARC
From: tndave @ 2016-12-09  1:16 UTC (permalink / raw)
  To: Alexander Duyck, David Laight
  Cc: Jeff Kirsher, intel-wired-lan, Netdev, sparclinux
In-Reply-To: <fc3fc99a-ff53-d895-ec06-56f9cf8a0d13@oracle.com>



On 12/08/2016 04:45 PM, tndave wrote:
>
>
> On 12/08/2016 08:05 AM, Alexander Duyck wrote:
>> On Thu, Dec 8, 2016 at 2:43 AM, David Laight
>> <David.Laight@aculab.com> wrote:
>>> From: Alexander Duyck
>>>> Sent: 06 December 2016 17:10
>>> ...
>>>> I was thinking about it and I realized we can probably simplify
>>>> this even further.  In the case of most other architectures the
>>>> DMA_ATTR_WEAK_ORDERING has no effect anyway.  So from what I can
>>>> tell there is probably no reason not to just always pass that
>>>> attribute with the DMA mappings.  From what I can tell the only
>>>> other architecture that uses this is the PowerPC Cell
>>>> architecture.
>>>
>>> And I should have read all the thread :-(
>>>
>>>> Also I was wondering if you actually needed to enable this
>>>> attribute for both Rx and Tx buffers or just Rx buffers?  The
>>>> patch that enabled DMA_ATTR_WEAK_ORDERING for Sparc64 seems to
>>>> call out writes, but I didn't see anything about reads.  I'm just
>>>> wondering if changing the code for Tx has any effect?  If not you
>>>> could probably drop those changes and just focus on Rx.
>>>
>>> 'Weak ordering' only applies to PCIe read transfers, so can only
>>> have an effect on descriptor reads and transmit buffer reads.
>>>
>>> Basically PCIe is a comms protocol and an endpoint (or the host)
>>> can have multiple outstanding read requests (each of which might
>>> generate multiple response messages. The responses for each request
>>> must arrive in order, but responses for different requests can be
>>> interleaved. Setting 'not weak ordering' lets the host interwork
>>> with broken endpoints. (Or, like we did, you fix the fpga's PCIe
>>> implementation.)
>>
>> I get the basics of relaxed ordering.  The question is how does the
>> Sparc64 IOMMU translate DMA_ATTR_WEAK_ORDERING into relaxed ordering
>> messages, and at what level the ordering is relaxed.  Odds are the
>> wording in the description where this attribute was added to Sparc
>> is just awkward, but I was wanting to verify if this only applies to
>> writes, or also read completions.
> In Sparc64, passing DMA_ATTR_WEAK_ORDERING in dma map/unmap only affects
> PCIe root complex (Hostbridge). Using DMA_ATTR_WEAK_ORDERING, requested
> DMA transaction can be relaxed ordered within the PCIe root complex.
>
> In Sparc64, memory writes can be held at PCIe root complex not letting
> other memory writes to go through. By passing DMA_ATTR_WEAK_ORDERING in
> dma map/unmap allows memory writes to bypass other memory writes in PCIe
> root complex. (This applies to only PCIe root complex and does not
> affect at any other level of PCIe hierarchy e.g. PCIe bridges et al.
> Also the PCIe root complex when bypassing memory writes does follow PCIe
> relax ordering rules as per PCIe specification.
>
> For reference [old but still relevant write-up]: PCI-Express Relaxed
> Ordering and the Sun SPARC Enterprise M-class Servers
> https://blogs.oracle.com/olympus/entry/relaxed_ordering
>
>>
>>> In this case you need the reads of both transmit and receive rings
>>> to 'overtake' reads of transmit data.
>>
>> Actually that isn't quite right.  With relaxed ordering completions
>> and writes can pass each other if I recall correctly, but reads will
>> always force all writes ahead of them to be completed before you can
>> begin generating the read completions.
> That is my understanding as well.
>
>>
>>> I'm not at all clear how this 'flag' can be set on dma_map(). It is
>>> a property of the PCIe subsystem.
> Because in Sparc64, passing DMA_ATTR_WEAK_ORDERING flag in DMA map/unmap
> adds an entry in IOMMU/ATU table so that an access to requested DMA
> address from PCIe root complex can be relaxed ordered.
>>
>> That was where my original question on this came in.  We can do a
>> blanket enable of relaxed ordering for Tx and Rx data buffers, but
>> if we only need it on Rx then there isn't any need for us to make
>> unnecessary changes.
> I ran some quick test and it is likely that we don't need
> DMA_ATTR_WEAK_ORDERING for any TX dma buffer (because in case of TX dma
> buffers, its all memory reads from device).
in above line , s/from/by

+ cc sparclinux@vger.kernel.org

-Tushar
>
> -Tushar
>>
>> - Alex
>>
>

^ permalink raw reply

* Re: fs, net: deadlock between bind/splice on af_unix
From: Al Viro @ 2016-12-09  1:32 UTC (permalink / raw)
  To: Cong Wang
  Cc: Dmitry Vyukov, linux-fsdevel@vger.kernel.org, LKML, David Miller,
	Rainer Weikusat, Hannes Frederic Sowa, netdev, Eric Dumazet,
	syzkaller
In-Reply-To: <CAM_iQpXu+fyjmvrYRB9+VJCdSLS=7Jiet762hqWDANfsOM0XWw@mail.gmail.com>

On Thu, Dec 08, 2016 at 04:08:27PM -0800, Cong Wang wrote:
> On Thu, Dec 8, 2016 at 8:30 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
> > Chain exists of:
> >  Possible unsafe locking scenario:
> >
> >        CPU0                    CPU1
> >        ----                    ----
> >   lock(sb_writers#5);
> >                                lock(&u->bindlock);
> >                                lock(sb_writers#5);
> >   lock(&pipe->mutex/1);
> 
> This looks false positive, probably just needs lockdep_set_class()
> to set keys for pipe->mutex and unix->bindlock.

I'm afraid that it's not a false positive at all.

Preparations:
	* create an AF_UNIX socket.
	* set SOCK_PASSCRED on it.
	* create a pipe.

Child 1: splice from pipe to socket; locks pipe and proceeds down towards
unix_dgram_sendmsg().

Child 2: splice from pipe to /mnt/foo/bar; requests write access to /mnt
and blocks on attempt to lock the pipe already locked by (1).

Child 3: freeze /mnt; blocks until (2) is done

Child 4: bind() the socket to /mnt/barf; grabs ->bindlock on the socket and
proceeds to create /mnt/barf, which blocks due to fairness of freezer (no
extra write accesses to something that is in process of being frozen).

_Now_ (1) gets around to unix_dgram_sendmsg().  We still have NULL u->addr,
since bind() has not gotten through yet.  We also have SOCK_PASSCRED set,
so we attempt autobind; it blocks on the ->bindlock, which won't be
released until bind() is done (at which point we'll see non-NULL u->addr
and bugger off from autobind), but bind() won't succeed until /mnt
goes through the freeze-thaw cycle, which won't happen until (2) finishes,
which won't happen until (1) unlocks the pipe.  Deadlock.

Granted, ->bindlock is taken interruptibly, so it's not that much of
a problem (you can kill the damn thing), but you would need to intervene
and kill it.

Why do we do autobind there, anyway, and why is it conditional on
SOCK_PASSCRED?  Note that e.g. for SOCK_STREAM we can bloody well get
to sending stuff without autobind ever done - just use socketpair()
to create that sucker and we won't be going through the connect()
at all.

^ permalink raw reply

* Re: [PATCH] [v4] net: phy: phy drivers should not set SUPPORTED_[Asym_]Pause
From: David Miller @ 2016-12-09  2:23 UTC (permalink / raw)
  To: timur; +Cc: f.fainelli, netdev, jon.mason, nks.gnu
In-Reply-To: <1481138451-28144-1-git-send-email-timur@codeaurora.org>


Florian, please review this patch.

Thanks.

^ permalink raw reply

* Re: [PATCH-RESEND] vhost-vsock: fix orphan connection reset
From: David Miller @ 2016-12-09  2:25 UTC (permalink / raw)
  To: bergwolf; +Cc: stefanha, kvm, virtualization, netdev
In-Reply-To: <1481217046-7058-1-git-send-email-bergwolf@gmail.com>

From: Peng Tao <bergwolf@gmail.com>
Date: Fri,  9 Dec 2016 01:10:46 +0800

> local_addr.svm_cid is host cid. We should check guest cid instead,
> which is remote_addr.svm_cid. Otherwise we end up resetting all
> connections to all guests.
> 
> Cc: stable@vger.kernel.org [4.8+]
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Peng Tao <bergwolf@gmail.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net 0/2] net: ethernet: Make sure we set dev->dev.parent
From: David Miller @ 2016-12-09  2:27 UTC (permalink / raw)
  To: f.fainelli; +Cc: netdev, madalin.bucur, johan
In-Reply-To: <20161208194125.13264-1-f.fainelli@gmail.com>

From: Florian Fainelli <f.fainelli@gmail.com>
Date: Thu,  8 Dec 2016 11:41:23 -0800

> This patch series builds atop:
> 
> ec988ad78ed6d184a7f4ca6b8e962b0e8f1de461 ("phy: Don't increment MDIO bus
> refcount unless it's a different owner")
> 
> FMAN is the one that potentially needs patching as well (call SET_NETDEV_DEV),
> but there appears to be no way that init_phy is called right now, or there is
> not such an in-tree user. Madalin, can you comment on that?

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next 1/3] net/mlx5e: use %pad format string for dma_addr_t
From: David Miller @ 2016-12-09  2:31 UTC (permalink / raw)
  To: arnd-r2nGTMty4D4
  Cc: saeedm-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
	leonro-VPRAkNaXOzVWk0Htik3J/w, danielj-VPRAkNaXOzVWk0Htik3J/w,
	tariqt-VPRAkNaXOzVWk0Htik3J/w, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20161208215727.44841-1-arnd-r2nGTMty4D4@public.gmane.org>

From: Arnd Bergmann <arnd@arndb.de>
Date: Thu,  8 Dec 2016 22:57:03 +0100

> On 32-bit ARM with 64-bit dma_addr_t I get this warning about an
> incorrect format string:
> 
> In file included from /git/arm-soc/drivers/net/ethernet/mellanox/mlx5/core/alloc.c:42:0:
> drivers/net/ethernet/mellanox/mlx5/core/alloc.c: In function ‘mlx5_frag_buf_alloc_node’:
> drivers/net/ethernet/mellanox/mlx5/core/alloc.c:134:12: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
> 
> We have the special %pad format for printing dma_addr_t, so use that
> to print the correct address and avoid the warning.
> 
> Fixes: 1c1b522808a1 ("net/mlx5e: Implement Fragmented Work Queue (WQ)")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Applied.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox