Netdev List
 help / color / mirror / Atom feed
* [PATCH] ibmveth: set correct gso_size and gso_type
From: Thomas Falcon @ 2016-12-08 22:40 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1477440555-21133-1-git-send-email-jmaxwell37@gmail.com>

This patch is based on an earlier one submitted
by Jon Maxwell with the following commit message:

"We recently encountered a bug where a few customers using ibmveth on the
same LPAR hit an issue where a TCP session hung when large receive was
enabled. Closer analysis revealed that the session was stuck because the
one side was advertising a zero window repeatedly.

We narrowed this down to the fact the ibmveth driver did not set gso_size
which is translated by TCP into the MSS later up the stack. The MSS is
used to calculate the TCP window size and as that was abnormally large,
it was calculating a zero window, even although the sockets receive buffer
was completely empty."

We rely on the Virtual I/O Server partition in a pseries
environment to provide the MSS through the TCP header checksum
field. The stipulation is that users should not disable checksum
offloading if rx packet aggregation is enabled through VIOS.

Some firmware offerings provide the MSS in the RX buffer.
This is signalled by a bit in the RX queue descriptor.

Reviewed-by: Brian King <brking@linux.vnet.ibm.com>
Reviewed-by: Pradeep Satyanarayana <pradeeps@linux.vnet.ibm.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Jonathan Maxwell <jmaxwell37@gmail.com>
Reviewed-by: David Dai <zdai@us.ibm.com>
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>

---
 drivers/net/ethernet/ibm/ibmveth.c | 65 ++++++++++++++++++++++++++++++++++++--
 drivers/net/ethernet/ibm/ibmveth.h |  1 +
 2 files changed, 64 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
index ebe6071..a36022b 100644
--- a/drivers/net/ethernet/ibm/ibmveth.c
+++ b/drivers/net/ethernet/ibm/ibmveth.c
@@ -58,7 +58,7 @@
 
 static const char ibmveth_driver_name[] = "ibmveth";
 static const char ibmveth_driver_string[] = "IBM Power Virtual Ethernet Driver";
-#define ibmveth_driver_version "1.05"
+#define ibmveth_driver_version "1.06"
 
 MODULE_AUTHOR("Santiago Leon <santil@linux.vnet.ibm.com>");
 MODULE_DESCRIPTION("IBM Power Virtual Ethernet Driver");
@@ -137,6 +137,11 @@ static inline int ibmveth_rxq_frame_offset(struct ibmveth_adapter *adapter)
 	return ibmveth_rxq_flags(adapter) & IBMVETH_RXQ_OFF_MASK;
 }
 
+static inline int ibmveth_rxq_large_packet(struct ibmveth_adapter *adapter)
+{
+	return ibmveth_rxq_flags(adapter) & IBMVETH_RXQ_LRG_PKT;
+}
+
 static inline int ibmveth_rxq_frame_length(struct ibmveth_adapter *adapter)
 {
 	return be32_to_cpu(adapter->rx_queue.queue_addr[adapter->rx_queue.index].length);
@@ -1174,6 +1179,45 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb,
 	goto retry_bounce;
 }
 
+static void ibmveth_rx_mss_helper(struct sk_buff *skb, u16 mss, int lrg_pkt)
+{
+	int offset = 0;
+
+	/* only TCP packets will be aggregated */
+	if (skb->protocol == htons(ETH_P_IP)) {
+		struct iphdr *iph = (struct iphdr *)skb->data;
+
+		if (iph->protocol == IPPROTO_TCP) {
+			offset = iph->ihl * 4;
+			skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
+		} else {
+			return;
+		}
+	} else if (skb->protocol == htons(ETH_P_IPV6)) {
+		struct ipv6hdr *iph6 = (struct ipv6hdr *)skb->data;
+
+		if (iph6->nexthdr == IPPROTO_TCP) {
+			offset = sizeof(struct ipv6hdr);
+			skb_shinfo(skb)->gso_type = SKB_GSO_TCPV6;
+		} else {
+			return;
+		}
+	} else {
+		return;
+	}
+	/* if mss is not set through Large Packet bit/mss in rx buffer,
+	 * expect that the mss will be written to the tcp header checksum.
+	 */
+	if (lrg_pkt) {
+		skb_shinfo(skb)->gso_size = mss;
+	} else if (offset) {
+		struct tcphdr *tcph = (struct tcphdr *)(skb->data + offset);
+
+		skb_shinfo(skb)->gso_size = ntohs(tcph->check);
+		tcph->check = 0;
+	}
+}
+
 static int ibmveth_poll(struct napi_struct *napi, int budget)
 {
 	struct ibmveth_adapter *adapter =
@@ -1182,6 +1226,7 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
 	int frames_processed = 0;
 	unsigned long lpar_rc;
 	struct iphdr *iph;
+	u16 mss = 0;
 
 restart_poll:
 	while (frames_processed < budget) {
@@ -1199,9 +1244,21 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
 			int length = ibmveth_rxq_frame_length(adapter);
 			int offset = ibmveth_rxq_frame_offset(adapter);
 			int csum_good = ibmveth_rxq_csum_good(adapter);
+			int lrg_pkt = ibmveth_rxq_large_packet(adapter);
 
 			skb = ibmveth_rxq_get_buffer(adapter);
 
+			/* if the large packet bit is set in the rx queue
+			 * descriptor, the mss will be written by PHYP eight
+			 * bytes from the start of the rx buffer, which is
+			 * skb->data at this stage
+			 */
+			if (lrg_pkt) {
+				__be64 *rxmss = (__be64 *)(skb->data + 8);
+
+				mss = (u16)be64_to_cpu(*rxmss);
+			}
+
 			new_skb = NULL;
 			if (length < rx_copybreak)
 				new_skb = netdev_alloc_skb(netdev, length);
@@ -1235,11 +1292,15 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
 					if (iph->check == 0xffff) {
 						iph->check = 0;
 						iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl);
-						adapter->rx_large_packets++;
 					}
 				}
 			}
 
+			if (length > netdev->mtu + ETH_HLEN) {
+				ibmveth_rx_mss_helper(skb, mss, lrg_pkt);
+				adapter->rx_large_packets++;
+			}
+
 			napi_gro_receive(napi, skb);	/* send it up */
 
 			netdev->stats.rx_packets++;
diff --git a/drivers/net/ethernet/ibm/ibmveth.h b/drivers/net/ethernet/ibm/ibmveth.h
index 4eade67..7acda04 100644
--- a/drivers/net/ethernet/ibm/ibmveth.h
+++ b/drivers/net/ethernet/ibm/ibmveth.h
@@ -209,6 +209,7 @@ struct ibmveth_rx_q_entry {
 #define IBMVETH_RXQ_TOGGLE		0x80000000
 #define IBMVETH_RXQ_TOGGLE_SHIFT	31
 #define IBMVETH_RXQ_VALID		0x40000000
+#define IBMVETH_RXQ_LRG_PKT		0x04000000
 #define IBMVETH_RXQ_NO_CSUM		0x02000000
 #define IBMVETH_RXQ_CSUM_GOOD		0x01000000
 #define IBMVETH_RXQ_OFF_MASK		0x0000FFFF
-- 
1.8.3.1

^ permalink raw reply related

* RE: [PATCH net-next 2/5] liquidio VF vxlan
From: Vatsavayi, Raghu @ 2016-12-08 22:42 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: David Miller, Linux Netdev List, Chickles, Derek,
	Burla, Satananda, Manlunas, Felix
In-Reply-To: <CAJ3xEMgrV4kjMY_SVJtk9VR85K7oSsSZGNQCi4UShogCKOq=hQ@mail.gmail.com>



> -----Original Message-----
> From: Or Gerlitz [mailto:gerlitz.or@gmail.com]
> Sent: Thursday, December 08, 2016 1:08 PM
> To: Vatsavayi, Raghu
> Cc: David Miller; Linux Netdev List; Vatsavayi, Raghu; Chickles, Derek; Burla,
> Satananda; Manlunas, Felix
> Subject: Re: [PATCH net-next 2/5] liquidio VF vxlan
> 
> On Thu, Dec 8, 2016 at 11:00 PM, Raghu Vatsavayi
> <rvatsavayi@caviumnetworks.com> wrote:
> 
> > Adds VF vxlan offload support.
> 
> What's the use case for that? a VM running a VTEP, isn't that part needs to
> run @ the host?
> 
> Or.

Our HW can support offloads for VF which is required if we load it on Hypervisor.
Thanks.

^ permalink raw reply

* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
From: Lino Sanfilippo @ 2016-12-08 22:45 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Francois Romieu, bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev
In-Reply-To: <20161208221830.GD12472@amd>

On 08.12.2016 23:18, Pavel Machek wrote:
> On Thu 2016-12-08 23:12:10, Lino Sanfilippo wrote:
>> Hi,
>> 
>> On 08.12.2016 22:54, Pavel Machek wrote:
>> > On Thu 2016-12-08 21:32:12, Lino Sanfilippo wrote:
>> >> Hi,
>> >> 
>> >> On 08.12.2016 00:15, Francois Romieu wrote:
>> >> > Lino Sanfilippo <LinoSanfilippo@gmx.de> :
>> >> >> The driver uses a private lock for synchronization between the xmit
>> >> >> function and the xmit completion handler, but since the NETIF_F_LLTX flag
>> >> >> is not set, the xmit function is also called with the xmit_lock held.
>> >> >> 
>> >> >> On the other hand the xmit completion handler first takes the private lock
>> >> >> and (in case that the tx queue has been stopped) the xmit_lock, leading
>> >> >> to a reverse locking order and the potential danger of a deadlock.
>> >> > 
>> >> > netif_tx_stop_queue is used by:
>> >> > 1. xmit function before releasing lock and returning.
>> >> > 2. sxgbe_restart_tx_queue()
>> >> >    <- sxgbe_tx_interrupt
>> >> >    <- sxgbe_reset_all_tx_queues()
>> >> >       <- sxgbe_tx_timeout()
>> >> > 
>> >> > Given xmit won't be called again until tx queue is enabled, it's not clear
>> >> > how a deadlock could happen due to #1.
>> >> > 
>> >> 
>> >> 
>> >> After spending more thoughts on this I tend to agree with you. Yes, we have the
>> >> different locking order for the xmit_lock and the private lock in two concurrent
>> >> threads. And one of the first things one learns about locking is that this is a
>> >> good way to create a deadlock sooner or later. But in our case the deadlock 
>> >> can only occur if the xmit function and the tx completion handler perceive different
>> >>  states for the tx queue, or to be more specific: 
>> >> the completion handler sees the tx queue in state "stopped" while the xmit handler 
>> >> sees it in state "running" at the same time. Only then both functions would try to
>> >> take both locks, which could lead to a deadlock.
>> >> 
>> >> OTOH Pavel said that he actually could produce a deadlock. Now I wonder if this is caused
>> >> by that locking scheme (in a way I have not figured out yet) or if it is a different issue.
>> > 
>> > Pavel has some problems, but that's on different hardware.. and it is
>> > possible that it is deadlock (or something else) somewhere else.
>> > 
>> 
>> Right, it is different hardware. But the locking situation in xmit function and tx completion handler
>> is very similar in both drivers. So if a deadlock is not possible in sxgbe it should 
>> also not be possible in stmmac (at least not due to the different locking order). 
>> So maybe there is no real issue that we could fix with removing the private lock and we should
>> keep it as it is.
> 
> Well.. the locking is pretty confused there. Having private lock that
> mirrors lock from network layer is confusing and ugly... that should
> be reason to fix it.
> 									Pavel
> 

Ok. Then I will resend the patches for both drivers with a different (less dramatic) commit message
in which the change is not longer described as a fix for deadlock but rather as a code 
cleanup/improvement, ok?

Regards,
Lino

^ permalink raw reply

* Re: Misalignment, MIPS, and ip_hdr(skb)->version
From: David Miller @ 2016-12-08 23:14 UTC (permalink / raw)
  To: Jason; +Cc: netdev, wireguard, linux-kernel, linux-mips
In-Reply-To: <CAHmME9qGoPGEFyqe0jBaZD5R51wHTEgAYb9edj+nu9nNPWSYCA@mail.gmail.com>

From: "Jason A. Donenfeld" <Jason@zx2c4.com>
Date: Thu, 8 Dec 2016 23:20:04 +0100

> Hi David,
> 
> On Thu, Dec 8, 2016 at 1:37 AM, David Miller <davem@davemloft.net> wrote:
>> You really have to land the IP header on a proper 4 byte boundary.
>>
>> I would suggest pushing 3 dummy garbage bytes of padding at the front
>> or the end of your header.
> 
> Are you sure 3 bytes to get 4 byte alignment is really the best? I was
> thinking that adding 1 byte to get 2 byte alignment might be better,
> since it would then ensure that the subsequent TCP header winds up
> being 4 byte aligned. Or is this in fact not the desired trade off,
> and so I should stick with the 3 bytes you suggested?

If the IP header is 4 byte aligned, the TCP header will be as well.

^ permalink raw reply

* Re: [PATCH net-next 1/3] net/mlx5e: use %pad format string for dma_addr_t
From: Saeed Mahameed @ 2016-12-08 23:16 UTC (permalink / raw)
  To: Arnd Bergmann, Matan Barak, Leon Romanovsky
  Cc: David S. Miller, Daniel Jurgens, Tariq Toukan,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20161208215727.44841-1-arnd-r2nGTMty4D4@public.gmane.org>



On 12/08/2016 11:57 PM, Arnd Bergmann wrote:
> On 32-bit ARM with 64-bit dma_addr_t I get this warning about an
> incorrect format string:
> 
> In file included from /git/arm-soc/drivers/net/ethernet/mellanox/mlx5/core/alloc.c:42:0:
> drivers/net/ethernet/mellanox/mlx5/core/alloc.c: In function ‘mlx5_frag_buf_alloc_node’:
> drivers/net/ethernet/mellanox/mlx5/core/alloc.c:134:12: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
> 
> We have the special %pad format for printing dma_addr_t, so use that
> to print the correct address and avoid the warning.
> 
> Fixes: 1c1b522808a1 ("net/mlx5e: Implement Fragmented Work Queue (WQ)")
> Signed-off-by: Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org>

Thank you Arnd !!

Acked-by: Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [PATCH net-next 4/5] liquidio VF timestamp
From: Chickles, Derek @ 2016-12-08 22:41 UTC (permalink / raw)
  To: Or Gerlitz, Vatsavayi, Raghu
  Cc: David Miller, Linux Netdev List, Vatsavayi, Raghu,
	Burla, Satananda, Manlunas, Felix
In-Reply-To: <CAJ3xEMg3RC74PEF-6_mF96Wj-dHWW6simspOpxkiObgXBr67KA@mail.gmail.com>



> -----Original Message-----
> From: Or Gerlitz [mailto:gerlitz.or@gmail.com]
> Sent: Thursday, December 08, 2016 1:10 PM
> To: Vatsavayi, Raghu
> Cc: David Miller; Linux Netdev List; Vatsavayi, Raghu; Chickles, Derek; Burla,
> Satananda; Manlunas, Felix
> Subject: Re: [PATCH net-next 4/5] liquidio VF timestamp
> 
> On Thu, Dec 8, 2016 at 11:00 PM, Raghu Vatsavayi
> <rvatsavayi@caviumnetworks.com> wrote:
> > Adds support for VF timestamp.
> 
> same here, what's the use case? do you have per VF HW clocks to set?
> How it works if VF A does setup of X and VF B setup of Y

There's no support for adjusting per-VF clock skew and such in hardware,
but there is support for retrieving the hardware timestamp for Tx and Rx packets
on the VF. Applications can use this for their purposes.

^ permalink raw reply

* Re: [PATCH net-next 0/5] liquidio VF offloads and stats
From: David Miller @ 2016-12-08 23:17 UTC (permalink / raw)
  To: rvatsavayi; +Cc: netdev
In-Reply-To: <1481230848-2393-1-git-send-email-rvatsavayi@caviumnetworks.com>

From: Raghu Vatsavayi <rvatsavayi@caviumnetworks.com>
Date: Thu, 8 Dec 2016 13:00:43 -0800

> Following is final patch series in completing the liquidio
> VF driver support. These patches have minor changes related
> to offloads and stats.
> 
> Please apply patches in following order as some of them
> depend on earlier patches.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
From: Francois Romieu @ 2016-12-08 23:19 UTC (permalink / raw)
  To: Lino Sanfilippo
  Cc: bh74.an, ks.giri, vipul.pandya, peppe.cavallaro, alexandre.torgue,
	pavel, davem, linux-kernel, netdev
In-Reply-To: <051e3043-8b58-0591-36e3-99e2267f67f4@gmx.de>

Lino Sanfilippo <LinoSanfilippo@gmx.de> :
[...]
> OTOH Pavel said that he actually could produce a deadlock. Now I wonder if
> this is caused by that locking scheme (in a way I have not figured out yet)
> or if it is a different issue.

stmmac_tx_err races with stmmac_xmit.

-- 
Ueimor

^ permalink raw reply

* Re: [PATCH] net: gianfar: add ethtool eee support
From: David Miller @ 2016-12-08 23:19 UTC (permalink / raw)
  To: Shaohui.Xie; +Cc: netdev
In-Reply-To: <1481196426-45576-1-git-send-email-Shaohui.Xie@nxp.com>

From: Shaohui Xie <Shaohui.Xie@nxp.com>
Date: Thu, 8 Dec 2016 19:27:06 +0800

> Gianfar does not support EEE, but it can connect to a PHY which supports
> EEE and the PHY advertises EEE by default, and its link partner also
> advertises EEE, so the PHY enters low power mode when traffic rate is low,
> which causes packet loss if an application's traffic rate is low. This
> patch provides .get_eee and .set_eee so that to disable the EEE
> advertisement via ethtool if needed, other EEE features are not supported.
> 
> Signed-off-by: Shaohui Xie <Shaohui.Xie@nxp.com>

This is not the way to fix this.

If the Gianfar MAC does not support EEE properly, then the gianfar
driver should not create a situation where the PHY advertises EEE
in the first place.

^ permalink raw reply

* Re: [PATCH] cxgb4/cxgb4vf: Assign netdev->dev_port with port ID
From: David Miller @ 2016-12-08 23:21 UTC (permalink / raw)
  To: ganeshgr; +Cc: netdev, nirranjan, arjun, leedom, hariprasad
In-Reply-To: <1481200763-15799-1-git-send-email-ganeshgr@chelsio.com>

From: Ganesh Goudar <ganeshgr@chelsio.com>
Date: Thu,  8 Dec 2016 18:09:23 +0530

> From: Arjun V <arjun@chelsio.com>
> 
> Added missing dev_port assignment in cxgb4vf driver.
> Also made dev_port assignment of cxgb4 in sync with cxgb4vf driver.
> 
> Signed-off-by: Casey Leedom <leedom@chelsio.com>
> Signed-off-by: Arjun V <arjun@chelsio.com>
> Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>

Applied.

^ permalink raw reply

* Re: [PATCH] net: socket: preferred __aligned(size) for control buffer
From: David Miller @ 2016-12-08 23:21 UTC (permalink / raw)
  To: kushwaha.a; +Cc: netdev, akkushwaha9896
In-Reply-To: <1481201513-20783-1-git-send-email-kushwaha.a@samsung.com>

From: kushwaha.a@samsung.com
Date: Thu, 08 Dec 2016 18:21:53 +0530

> From: Amit Kushwaha <kushwaha.a@samsung.com>
> 
> This patch cleanup checkpatch.pl warning
> WARNING: __aligned(size) is preferred over __attribute__((aligned(size)))
> 
> Signed-off-by: Amit Kushwaha <kushwaha.a@samsung.com>

Applied to net-next.

^ permalink raw reply

* Re: [PATCH] net: gianfar: add ethtool eee support
From: Florian Fainelli @ 2016-12-08 23:22 UTC (permalink / raw)
  To: David Miller, Shaohui.Xie; +Cc: netdev
In-Reply-To: <20161208.181922.957507188752351548.davem@davemloft.net>

On 12/08/2016 03:19 PM, David Miller wrote:
> From: Shaohui Xie <Shaohui.Xie@nxp.com>
> Date: Thu, 8 Dec 2016 19:27:06 +0800
> 
>> Gianfar does not support EEE, but it can connect to a PHY which supports
>> EEE and the PHY advertises EEE by default, and its link partner also
>> advertises EEE, so the PHY enters low power mode when traffic rate is low,
>> which causes packet loss if an application's traffic rate is low. This
>> patch provides .get_eee and .set_eee so that to disable the EEE
>> advertisement via ethtool if needed, other EEE features are not supported.
>>
>> Signed-off-by: Shaohui Xie <Shaohui.Xie@nxp.com>
> 
> This is not the way to fix this.
> 
> If the Gianfar MAC does not support EEE properly, then the gianfar
> driver should not create a situation where the PHY advertises EEE
> in the first place.

Agreed, you should have gianfar mask out the EEE advertisement from the
PHY it connects to, in order to make sure that EEE is properly disabled.
This should be no different from e.g: connecting a 10/100/1000 PHY to a
10/100 only controller for instance.
-- 
Florian

^ permalink raw reply

* Re: pull-request: can 2016-12-08
From: David Miller @ 2016-12-08 23:23 UTC (permalink / raw)
  To: mkl; +Cc: netdev, linux-can, kernel
In-Reply-To: <20161208153552.18122-1-mkl@pengutronix.de>

From: Marc Kleine-Budde <mkl@pengutronix.de>
Date: Thu,  8 Dec 2016 16:35:51 +0100

> this is a pull request for one patch.
> 
> Jiho Chu found and fixed a use-after-free error in the cleanup path
> in the peak pcan USB CAN driver.

Pulled, thanks Marc.

^ permalink raw reply

* Remove private tx queue locks
From: Lino Sanfilippo @ 2016-12-08 23:55 UTC (permalink / raw)
  To: bh74.an, ks.giri, vipul.pandya, peppe.cavallaro, alexandre.torgue
  Cc: romieu, pavel, davem, linux-kernel, netdev

Hi,

this patch series removes unnecessary private locks in the sxgbe and the
stmmac driver.

v2:
- adjust commit message

^ permalink raw reply

* [PATCH v2 1/2] net: ethernet: sxgbe: remove private tx queue lock
From: Lino Sanfilippo @ 2016-12-08 23:55 UTC (permalink / raw)
  To: bh74.an, ks.giri, vipul.pandya, peppe.cavallaro, alexandre.torgue
  Cc: romieu, pavel, davem, linux-kernel, netdev, Lino Sanfilippo
In-Reply-To: <1481241343-18062-1-git-send-email-LinoSanfilippo@gmx.de>

The driver uses a private lock for synchronization of the xmit function and
the xmit completion handler, but since the NETIF_F_LLTX flag is not set,
the xmit function is also called with the xmit_lock held.

On the other hand the completion handler uses the reverse locking order by
first taking the private lock and (in case that the tx queue had been
stopped) then the xmit_lock.

Improve the locking by removing the private lock and using only the
xmit_lock for synchronization instead.

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
---
 drivers/net/ethernet/samsung/sxgbe/sxgbe_common.h |  1 -
 drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c   | 27 +++++------------------
 2 files changed, 6 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/samsung/sxgbe/sxgbe_common.h b/drivers/net/ethernet/samsung/sxgbe/sxgbe_common.h
index 5cb51b6..c61f260 100644
--- a/drivers/net/ethernet/samsung/sxgbe/sxgbe_common.h
+++ b/drivers/net/ethernet/samsung/sxgbe/sxgbe_common.h
@@ -384,7 +384,6 @@ struct sxgbe_tx_queue {
 	dma_addr_t *tx_skbuff_dma;
 	struct sk_buff **tx_skbuff;
 	struct timer_list txtimer;
-	spinlock_t tx_lock;	/* lock for tx queues */
 	unsigned int cur_tx;
 	unsigned int dirty_tx;
 	u32 tx_count_frames;
diff --git a/drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c b/drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c
index ea44a24..22d3b0b 100644
--- a/drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c
+++ b/drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c
@@ -426,9 +426,6 @@ static int init_tx_ring(struct device *dev, u8 queue_no,
 	tx_ring->dirty_tx = 0;
 	tx_ring->cur_tx = 0;
 
-	/* initialise TX queue lock */
-	spin_lock_init(&tx_ring->tx_lock);
-
 	return 0;
 
 dmamem_err:
@@ -743,7 +740,7 @@ static void sxgbe_tx_queue_clean(struct sxgbe_tx_queue *tqueue)
 
 	dev_txq = netdev_get_tx_queue(priv->dev, queue_no);
 
-	spin_lock(&tqueue->tx_lock);
+	__netif_tx_lock(dev_txq, smp_processor_id());
 
 	priv->xstats.tx_clean++;
 	while (tqueue->dirty_tx != tqueue->cur_tx) {
@@ -781,18 +778,13 @@ static void sxgbe_tx_queue_clean(struct sxgbe_tx_queue *tqueue)
 
 	/* wake up queue */
 	if (unlikely(netif_tx_queue_stopped(dev_txq) &&
-		     sxgbe_tx_avail(tqueue, tx_rsize) > SXGBE_TX_THRESH(priv))) {
-		netif_tx_lock(priv->dev);
-		if (netif_tx_queue_stopped(dev_txq) &&
-		    sxgbe_tx_avail(tqueue, tx_rsize) > SXGBE_TX_THRESH(priv)) {
-			if (netif_msg_tx_done(priv))
-				pr_debug("%s: restart transmit\n", __func__);
-			netif_tx_wake_queue(dev_txq);
-		}
-		netif_tx_unlock(priv->dev);
+	    sxgbe_tx_avail(tqueue, tx_rsize) > SXGBE_TX_THRESH(priv))) {
+		if (netif_msg_tx_done(priv))
+			pr_debug("%s: restart transmit\n", __func__);
+		netif_tx_wake_queue(dev_txq);
 	}
 
-	spin_unlock(&tqueue->tx_lock);
+	__netif_tx_unlock(dev_txq);
 }
 
 /**
@@ -1304,9 +1296,6 @@ static netdev_tx_t sxgbe_xmit(struct sk_buff *skb, struct net_device *dev)
 		      tqueue->hwts_tx_en)))
 		ctxt_desc_req = 1;
 
-	/* get the spinlock */
-	spin_lock(&tqueue->tx_lock);
-
 	if (priv->tx_path_in_lpi_mode)
 		sxgbe_disable_eee_mode(priv);
 
@@ -1316,8 +1305,6 @@ static netdev_tx_t sxgbe_xmit(struct sk_buff *skb, struct net_device *dev)
 			netdev_err(dev, "%s: Tx Ring is full when %d queue is awake\n",
 				   __func__, txq_index);
 		}
-		/* release the spin lock in case of BUSY */
-		spin_unlock(&tqueue->tx_lock);
 		return NETDEV_TX_BUSY;
 	}
 
@@ -1436,8 +1423,6 @@ static netdev_tx_t sxgbe_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	priv->hw->dma->enable_dma_transmission(priv->ioaddr, txq_index);
 
-	spin_unlock(&tqueue->tx_lock);
-
 	return NETDEV_TX_OK;
 }
 
-- 
1.9.1

^ permalink raw reply related

* [PATCH v2 2/2] net: ethernet: stmmac: remove private tx queue lock
From: Lino Sanfilippo @ 2016-12-08 23:55 UTC (permalink / raw)
  To: bh74.an, ks.giri, vipul.pandya, peppe.cavallaro, alexandre.torgue
  Cc: romieu, pavel, davem, linux-kernel, netdev, Lino Sanfilippo
In-Reply-To: <1481241343-18062-1-git-send-email-LinoSanfilippo@gmx.de>

The driver uses a private lock for synchronization of the xmit function and
the xmit completion handler, but since the NETIF_F_LLTX flag is not set,
the xmit function is also called with the xmit_lock held.

On the other hand the completion handler uses the reverse locking order by
first taking the private lock and (in case that the tx queue had been
stopped) then the xmit_lock.

Improve the locking by removing the private lock and using only the
xmit_lock for synchronization instead.

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac.h      |  1 -
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 28 +++++------------------
 2 files changed, 6 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index 4d2a759..7e69b11 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -64,7 +64,6 @@ struct stmmac_priv {
 	dma_addr_t dma_tx_phy;
 	int tx_coalesce;
 	int hwts_tx_en;
-	spinlock_t tx_lock;
 	bool tx_path_in_lpi_mode;
 	struct timer_list txtimer;
 	bool tso;
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index caf069a..db46ec4 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1307,7 +1307,7 @@ static void stmmac_tx_clean(struct stmmac_priv *priv)
 	unsigned int bytes_compl = 0, pkts_compl = 0;
 	unsigned int entry = priv->dirty_tx;
 
-	spin_lock(&priv->tx_lock);
+	netif_tx_lock(priv->dev);
 
 	priv->xstats.tx_clean++;
 
@@ -1378,22 +1378,17 @@ static void stmmac_tx_clean(struct stmmac_priv *priv)
 	netdev_completed_queue(priv->dev, pkts_compl, bytes_compl);
 
 	if (unlikely(netif_queue_stopped(priv->dev) &&
-		     stmmac_tx_avail(priv) > STMMAC_TX_THRESH)) {
-		netif_tx_lock(priv->dev);
-		if (netif_queue_stopped(priv->dev) &&
-		    stmmac_tx_avail(priv) > STMMAC_TX_THRESH) {
-			if (netif_msg_tx_done(priv))
-				pr_debug("%s: restart transmit\n", __func__);
-			netif_wake_queue(priv->dev);
-		}
-		netif_tx_unlock(priv->dev);
+	    stmmac_tx_avail(priv) > STMMAC_TX_THRESH)) {
+		if (netif_msg_tx_done(priv))
+			pr_debug("%s: restart transmit\n", __func__);
+		netif_wake_queue(priv->dev);
 	}
 
 	if ((priv->eee_enabled) && (!priv->tx_path_in_lpi_mode)) {
 		stmmac_enable_eee_mode(priv);
 		mod_timer(&priv->eee_ctrl_timer, STMMAC_LPI_T(eee_timer));
 	}
-	spin_unlock(&priv->tx_lock);
+	netif_tx_unlock(priv->dev);
 }
 
 static inline void stmmac_enable_dma_irq(struct stmmac_priv *priv)
@@ -1998,8 +1993,6 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
 	u8 proto_hdr_len;
 	int i;
 
-	spin_lock(&priv->tx_lock);
-
 	/* Compute header lengths */
 	proto_hdr_len = skb_transport_offset(skb) + tcp_hdrlen(skb);
 
@@ -2011,7 +2004,6 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
 			/* This is a hard error, log it. */
 			pr_err("%s: Tx Ring full when queue awake\n", __func__);
 		}
-		spin_unlock(&priv->tx_lock);
 		return NETDEV_TX_BUSY;
 	}
 
@@ -2146,11 +2138,9 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
 	priv->hw->dma->set_tx_tail_ptr(priv->ioaddr, priv->tx_tail_addr,
 				       STMMAC_CHAN0);
 
-	spin_unlock(&priv->tx_lock);
 	return NETDEV_TX_OK;
 
 dma_map_err:
-	spin_unlock(&priv->tx_lock);
 	dev_err(priv->device, "Tx dma map failed\n");
 	dev_kfree_skb(skb);
 	priv->dev->stats.tx_dropped++;
@@ -2182,10 +2172,7 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 			return stmmac_tso_xmit(skb, dev);
 	}
 
-	spin_lock(&priv->tx_lock);
-
 	if (unlikely(stmmac_tx_avail(priv) < nfrags + 1)) {
-		spin_unlock(&priv->tx_lock);
 		if (!netif_queue_stopped(dev)) {
 			netif_stop_queue(dev);
 			/* This is a hard error, log it. */
@@ -2357,11 +2344,9 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 		priv->hw->dma->set_tx_tail_ptr(priv->ioaddr, priv->tx_tail_addr,
 					       STMMAC_CHAN0);
 
-	spin_unlock(&priv->tx_lock);
 	return NETDEV_TX_OK;
 
 dma_map_err:
-	spin_unlock(&priv->tx_lock);
 	dev_err(priv->device, "Tx dma map failed\n");
 	dev_kfree_skb(skb);
 	priv->dev->stats.tx_dropped++;
@@ -3347,7 +3332,6 @@ int stmmac_dvr_probe(struct device *device,
 	netif_napi_add(ndev, &priv->napi, stmmac_poll, 64);
 
 	spin_lock_init(&priv->lock);
-	spin_lock_init(&priv->tx_lock);
 
 	ret = register_netdev(ndev);
 	if (ret) {
-- 
1.9.1

^ permalink raw reply related

* Re: [net-next PATCH v5 1/6] net: virtio dynamically disable/enable LRO
From: John Fastabend @ 2016-12-09  0:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: daniel, shm, davem, tgraf, alexei.starovoitov, john.r.fastabend,
	netdev, brouer
In-Reply-To: <20161208231857-mutt-send-email-mst@kernel.org>

On 16-12-08 01:36 PM, Michael S. Tsirkin wrote:
> On Wed, Dec 07, 2016 at 12:11:11PM -0800, John Fastabend wrote:
>> This adds support for dynamically setting the LRO feature flag. The
>> message to control guest features in the backend uses the
>> CTRL_GUEST_OFFLOADS msg type.
>>
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>> ---
>>  drivers/net/virtio_net.c |   40 +++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 39 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index a21d93a..a5c47b1 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -1419,6 +1419,36 @@ static void virtnet_init_settings(struct net_device *dev)
>>  	.set_settings = virtnet_set_settings,
>>  };
>>  
>> +static int virtnet_set_features(struct net_device *netdev,
>> +				netdev_features_t features)
>> +{
>> +	struct virtnet_info *vi = netdev_priv(netdev);
>> +	struct virtio_device *vdev = vi->vdev;
>> +	struct scatterlist sg;
>> +	u64 offloads = 0;
>> +
>> +	if (features & NETIF_F_LRO)
>> +		offloads |= (1 << VIRTIO_NET_F_GUEST_TSO4) |
>> +			    (1 << VIRTIO_NET_F_GUEST_TSO6);
>> +
>> +	if (features & NETIF_F_RXCSUM)
>> +		offloads |= (1 << VIRTIO_NET_F_GUEST_CSUM);
>> +
>> +	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)) {
>> +		sg_init_one(&sg, &offloads, sizeof(uint64_t));
>> +		if (!virtnet_send_command(vi,
>> +					  VIRTIO_NET_CTRL_GUEST_OFFLOADS,
>> +					  VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET,
>> +					  &sg)) {
> 
> Hmm I just realised that this will slow down setups that bridge
> virtio net interfaces since bridge calls this if provided.
> See below.


Really? What code is trying to turn off GRO via the GUEST_OFFLOADS LRO
command. My qemu/Linux setup has a set of tap/vhost devices attached to
a bridge and all of them have LRO enabled even with this patch series.

I must missing a setup handler somewhere?

> 
>> +			dev_warn(&netdev->dev,
>> +				 "Failed to set guest offloads by virtnet command.\n");
>> +			return -EINVAL;
>> +		}
>> +	}
> 
> Hmm if VIRTIO_NET_F_CTRL_GUEST_OFFLOADS is off, this fails
> silently. It might actually be a good idea to avoid
> breaking setups.
> 
>> +
>> +	return 0;
>> +}
>> +
>>  static const struct net_device_ops virtnet_netdev = {
>>  	.ndo_open            = virtnet_open,
>>  	.ndo_stop   	     = virtnet_close,
>> @@ -1435,6 +1465,7 @@ static void virtnet_init_settings(struct net_device *dev)
>>  #ifdef CONFIG_NET_RX_BUSY_POLL
>>  	.ndo_busy_poll		= virtnet_busy_poll,
>>  #endif
>> +	.ndo_set_features	= virtnet_set_features,
>>  };
>>  
>>  static void virtnet_config_changed_work(struct work_struct *work)
>> @@ -1815,6 +1846,12 @@ static int virtnet_probe(struct virtio_device *vdev)
>>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_CSUM))
>>  		dev->features |= NETIF_F_RXCSUM;
>>  
>> +	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) &&
>> +	    virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO6)) {
>> +		dev->features |= NETIF_F_LRO;
>> +		dev->hw_features |= NETIF_F_LRO;
> 
> So the issue is I think that the virtio "LRO" isn't really
> LRO, it's typically just GRO forwarded to guests.
> So these are easily re-split along MTU boundaries,
> which makes it ok to forward these across bridges.
> 
> It's not nice that we don't document this in the spec,
> but it's the reality and people rely on this.
> 
> For now, how about doing a custom thing and just disable/enable
> it as XDP is attached/detached?

The annoying part about doing this is ethtool will say that it is fixed
yet it will be changed by seemingly unrelated operation. I'm not sure I
like the idea to start automatically configuring the link via xdp_set.

> 
>> +	}
>> +
>>  	dev->vlan_features = dev->features;
>>  
>>  	/* MTU range: 68 - 65535 */
>> @@ -2057,7 +2094,8 @@ static int virtnet_restore(struct virtio_device *vdev)
>>  	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, \
>>  	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
>>  	VIRTIO_NET_F_CTRL_MAC_ADDR, \
>> -	VIRTIO_NET_F_MTU
>> +	VIRTIO_NET_F_MTU, \
>> +	VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
>>  
>>  static unsigned int features[] = {
>>  	VIRTNET_FEATURES,

^ permalink raw reply

* Re: fs, net: deadlock between bind/splice on af_unix
From: Cong Wang @ 2016-12-09  0:08 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Al Viro, linux-fsdevel@vger.kernel.org, LKML, David Miller,
	Rainer Weikusat, Hannes Frederic Sowa, netdev, Eric Dumazet,
	syzkaller
In-Reply-To: <CACT4Y+a-fTzW95ViVP3knhNTa=h6XFB0bMddcyQVeeV+LbXKHw@mail.gmail.com>

On Thu, Dec 8, 2016 at 8:30 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
> Chain exists of:
>  Possible unsafe locking scenario:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(sb_writers#5);
>                                lock(&u->bindlock);
>                                lock(sb_writers#5);
>   lock(&pipe->mutex/1);

This looks false positive, probably just needs lockdep_set_class()
to set keys for pipe->mutex and unix->bindlock.

^ permalink raw reply

* Re: net: deadlock on genl_mutex
From: Cong Wang @ 2016-12-09  0:13 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: syzkaller, Eric Dumazet, David Miller, Matti Vaittinen,
	Tycho Andersen, Florian Westphal, stephen hemminger, Tom Herbert,
	netdev, LKML, Richard Guy Briggs, netdev-owner
In-Reply-To: <CACT4Y+YaUL8F2L0h1VG3QZP=gSC4NJ=NOTG_dH5Awz=FOd1EVg@mail.gmail.com>

On Thu, Dec 8, 2016 at 10:02 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
> Chain exists of:
>  Possible unsafe locking scenario:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(nlk->cb_mutex);
>                                lock(&table[i].mutex);
>                                lock(nlk->cb_mutex);
>   lock(genl_mutex);

Similar to the unix bindlock, this one looks false positive to me too.

^ permalink raw reply

* Re: Soft lockup in inet_put_port on 4.6
From: Eric Dumazet @ 2016-12-09  0:30 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Hannes Frederic Sowa, Tom Herbert,
	Linux Kernel Network Developers
In-Reply-To: <1481233016.11849.1@smtp.office365.com>

On Thu, 2016-12-08 at 16:36 -0500, Josef Bacik wrote:

> We can reproduce the problem at will, still trying to run down the 
> problem.  I'll try and find one of the boxes that dumped a core and get 
> a bt of everybody.  Thanks,

OK, sounds good.

I had a look and :
- could not spot a fix that came after 4.6. 
- could not spot an obvious bug.

Anything special in the program triggering the issue ?
SO_REUSEPORT and/or special socket options ?

Thanks.

^ permalink raw reply

* Re: net: deadlock on genl_mutex
From: Cong Wang @ 2016-12-09  0:32 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: syzkaller, Eric Dumazet, David Miller, Matti Vaittinen,
	Tycho Andersen, Florian Westphal, stephen hemminger, Tom Herbert,
	netdev, LKML, Richard Guy Briggs, netdev-owner
In-Reply-To: <CACT4Y+Zy82UAJ55VbPbVadUM92ZSx1VJCFPdhhcmj53uxZ5PXQ@mail.gmail.com>

On Thu, Dec 8, 2016 at 9:16 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
> Chain exists of:
>  Possible unsafe locking scenario:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(genl_mutex);
>                                lock(nlk->cb_mutex);
>                                lock(genl_mutex);
>   lock(rtnl_mutex);
>
>  *** DEADLOCK ***

This one looks legitimate, because nlk->cb_mutex could be rtnl_mutex.
Let me think about it.

^ permalink raw reply

* linux-next: build warning after merge of the net-next tree
From: Stephen Rothwell @ 2016-12-09  0:34 UTC (permalink / raw)
  To: David Miller, Networking
  Cc: linux-next, linux-kernel, Jacob Keller, Jeff Kirsher

Hi all,

After merging the net-next tree, today's linux-next build (powerpc
ppc64_defconfig) produced this warning:

In file included from include/linux/byteorder/big_endian.h:4:0,
                 from arch/powerpc/include/uapi/asm/byteorder.h:13,
                 from include/asm-generic/bitops/le.h:5,
                 from arch/powerpc/include/asm/bitops.h:279,
                 from include/linux/bitops.h:36,
                 from include/linux/kernel.h:10,
                 from include/linux/skbuff.h:17,
                 from include/linux/if_ether.h:23,
                 from include/linux/etherdevice.h:25,
                 from drivers/net/ethernet/intel/i40e/i40e_main.c:27:
drivers/net/ethernet/intel/i40e/i40e_main.c: In function 'i40e_sync_vsi_filters':
include/uapi/linux/byteorder/big_endian.h:34:26: warning: large integer implicitly truncated to unsigned type [-Woverflow]
 #define __cpu_to_le16(x) ((__force __le16)__swab16((x)))
                          ^
include/linux/byteorder/generic.h:89:21: note: in expansion of macro '__cpu_to_le16'
 #define cpu_to_le16 __cpu_to_le16
                     ^
drivers/net/ethernet/intel/i40e/i40e_main.c:2200:5: note: in expansion of macro 'cpu_to_le16'
     cpu_to_le16((u16)I40E_AQC_MM_ERR_NO_RES);
     ^

Introduced by commit

  ac9e23901441 ("i40e: refactor i40e_update_filter_state to avoid passing aq_err")

-- 
Cheers,
Stephen Rothwell

^ permalink raw reply

* Re: [PATCH net-next] openvswitch: fix VxLAN-gpe port can't be created in ovs compat mode
From: Yang, Yi @ 2016-12-09  0:27 UTC (permalink / raw)
  To: Pravin Shelar; +Cc: netdev, dev, jbenc
In-Reply-To: <CAOrHB_CVr0PQGsNkhrrvT7L8cX1KZypV4UXakXm1uuCz7ZPYHw@mail.gmail.com>

On Thu, Dec 08, 2016 at 11:41:58AM -0800, Pravin Shelar wrote:
> On Thu, Dec 8, 2016 at 12:20 AM, Yi Yang <yi.y.yang@intel.com> wrote:
> >
> > Signed-off-by: Yi Yang <yi.y.yang@intel.com>
> > ---
> >  include/uapi/linux/openvswitch.h |  1 +
> >  net/openvswitch/vport-vxlan.c    | 15 +++++++++++++++
> >  2 files changed, 16 insertions(+)
> >
> There is no need for this patch in upstream kernel module. I am open
> to having such a patch in out of tree kernel if it simplifies feature
> compatibility code.

I'm very glad to hear this :-), the goal is to enable current ovs to create
vxlan-gpe port in compat mode without new kernel help, I'll post a patch
for ovs, thanks a lot.

^ permalink raw reply

* Re: [PATCH 37/50] netfilter: nf_tables: atomic dump and reset for stateful objects
From: Paul Gortmaker @ 2016-12-09  0:40 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, David Miller, netdev, linux-next@vger.kernel.org
In-Reply-To: <1481147576-5690-38-git-send-email-pablo@netfilter.org>

On Wed, Dec 7, 2016 at 4:52 PM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> This patch adds a new NFT_MSG_GETOBJ_RESET command perform an atomic
> dump-and-reset of the stateful object. This also comes with add support
> for atomic dump and reset for counter and quota objects.

This triggered a new build failure in linux-next on parisc-32, which a
hands-off bisect
run lists as resulting from this:

ERROR: "__cmpxchg_u64" [net/netfilter/nft_counter.ko] undefined!
make[2]: *** [__modpost] Error 1
make[1]: *** [modules] Error 2
make: *** [sub-make] Error 2
43da04a593d8b2626f1cf4b56efe9402f6b53652 is the first bad commit
commit 43da04a593d8b2626f1cf4b56efe9402f6b53652
Author: Pablo Neira Ayuso <pablo@netfilter.org>
Date:   Mon Nov 28 00:05:44 2016 +0100

    netfilter: nf_tables: atomic dump and reset for stateful objects

    This patch adds a new NFT_MSG_GETOBJ_RESET command perform an atomic
    dump-and-reset of the stateful object. This also comes with add support
    for atomic dump and reset for counter and quota objects.

    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

:040000 040000 6cd4554f69247e5c837db52342f26888beda1623
5908aca93c89e7922336546c3753bfcf2aceefba M      include
:040000 040000 f25d5831eb30972436bd198c5bb237a0cb0b4856
4ee5751c8de02bb5a8dcaadb2a2df7986d90f8e9 M      net
bisect run success

Guessing this is more an issue with parisc than it is with netfilter, but I
figured I'd mention it anyway.

Paul.
--




>
> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> ---
>  include/net/netfilter/nf_tables.h        |  3 +-
>  include/uapi/linux/netfilter/nf_tables.h |  2 ++
>  net/netfilter/nf_tables_api.c            | 29 ++++++++++++-----
>  net/netfilter/nft_counter.c              | 56 +++++++++++++++++++++++++++-----
>  net/netfilter/nft_quota.c                | 18 ++++++----
>  5 files changed, 85 insertions(+), 23 deletions(-)
>
> diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
> index 903cd618f50e..6f7d6a1dc09c 100644
> --- a/include/net/netfilter/nf_tables.h
> +++ b/include/net/netfilter/nf_tables.h
> @@ -997,7 +997,8 @@ struct nft_object_type {
>                                                 struct nft_object *obj);
>         void                            (*destroy)(struct nft_object *obj);
>         int                             (*dump)(struct sk_buff *skb,
> -                                               const struct nft_object *obj);
> +                                               struct nft_object *obj,
> +                                               bool reset);
>  };
>
>  int nft_register_obj(struct nft_object_type *obj_type);
> diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h
> index 3d47582caa80..399eac1eee91 100644
> --- a/include/uapi/linux/netfilter/nf_tables.h
> +++ b/include/uapi/linux/netfilter/nf_tables.h
> @@ -89,6 +89,7 @@ enum nft_verdicts {
>   * @NFT_MSG_NEWOBJ: create a stateful object (enum nft_obj_attributes)
>   * @NFT_MSG_GETOBJ: get a stateful object (enum nft_obj_attributes)
>   * @NFT_MSG_DELOBJ: delete a stateful object (enum nft_obj_attributes)
> + * @NFT_MSG_GETOBJ_RESET: get and reset a stateful object (enum nft_obj_attributes)
>   */
>  enum nf_tables_msg_types {
>         NFT_MSG_NEWTABLE,
> @@ -112,6 +113,7 @@ enum nf_tables_msg_types {
>         NFT_MSG_NEWOBJ,
>         NFT_MSG_GETOBJ,
>         NFT_MSG_DELOBJ,
> +       NFT_MSG_GETOBJ_RESET,
>         NFT_MSG_MAX,
>  };
>
> diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
> index 2ae717c5dcb8..bfc015af366a 100644
> --- a/net/netfilter/nf_tables_api.c
> +++ b/net/netfilter/nf_tables_api.c
> @@ -3972,14 +3972,14 @@ static struct nft_object *nft_obj_init(const struct nft_object_type *type,
>  }
>
>  static int nft_object_dump(struct sk_buff *skb, unsigned int attr,
> -                          const struct nft_object *obj)
> +                          struct nft_object *obj, bool reset)
>  {
>         struct nlattr *nest;
>
>         nest = nla_nest_start(skb, attr);
>         if (!nest)
>                 goto nla_put_failure;
> -       if (obj->type->dump(skb, obj) < 0)
> +       if (obj->type->dump(skb, obj, reset) < 0)
>                 goto nla_put_failure;
>         nla_nest_end(skb, nest);
>         return 0;
> @@ -4096,7 +4096,7 @@ static int nf_tables_newobj(struct net *net, struct sock *nlsk,
>  static int nf_tables_fill_obj_info(struct sk_buff *skb, struct net *net,
>                                    u32 portid, u32 seq, int event, u32 flags,
>                                    int family, const struct nft_table *table,
> -                                  const struct nft_object *obj)
> +                                  struct nft_object *obj, bool reset)
>  {
>         struct nfgenmsg *nfmsg;
>         struct nlmsghdr *nlh;
> @@ -4115,7 +4115,7 @@ static int nf_tables_fill_obj_info(struct sk_buff *skb, struct net *net,
>             nla_put_string(skb, NFTA_OBJ_NAME, obj->name) ||
>             nla_put_be32(skb, NFTA_OBJ_TYPE, htonl(obj->type->type)) ||
>             nla_put_be32(skb, NFTA_OBJ_USE, htonl(obj->use)) ||
> -           nft_object_dump(skb, NFTA_OBJ_DATA, obj))
> +           nft_object_dump(skb, NFTA_OBJ_DATA, obj, reset))
>                 goto nla_put_failure;
>
>         nlmsg_end(skb, nlh);
> @@ -4131,10 +4131,14 @@ static int nf_tables_dump_obj(struct sk_buff *skb, struct netlink_callback *cb)
>         const struct nfgenmsg *nfmsg = nlmsg_data(cb->nlh);
>         const struct nft_af_info *afi;
>         const struct nft_table *table;
> -       const struct nft_object *obj;
>         unsigned int idx = 0, s_idx = cb->args[0];
>         struct net *net = sock_net(skb->sk);
>         int family = nfmsg->nfgen_family;
> +       struct nft_object *obj;
> +       bool reset = false;
> +
> +       if (NFNL_MSG_TYPE(cb->nlh->nlmsg_type) == NFT_MSG_GETOBJ_RESET)
> +               reset = true;
>
>         rcu_read_lock();
>         cb->seq = net->nft.base_seq;
> @@ -4156,7 +4160,7 @@ static int nf_tables_dump_obj(struct sk_buff *skb, struct netlink_callback *cb)
>                                                             cb->nlh->nlmsg_seq,
>                                                             NFT_MSG_NEWOBJ,
>                                                             NLM_F_MULTI | NLM_F_APPEND,
> -                                                           afi->family, table, obj) < 0)
> +                                                           afi->family, table, obj, reset) < 0)
>                                         goto done;
>
>                                 nl_dump_check_consistent(cb, nlmsg_hdr(skb));
> @@ -4183,6 +4187,7 @@ static int nf_tables_getobj(struct net *net, struct sock *nlsk,
>         const struct nft_table *table;
>         struct nft_object *obj;
>         struct sk_buff *skb2;
> +       bool reset = false;
>         u32 objtype;
>         int err;
>
> @@ -4214,9 +4219,12 @@ static int nf_tables_getobj(struct net *net, struct sock *nlsk,
>         if (!skb2)
>                 return -ENOMEM;
>
> +       if (NFNL_MSG_TYPE(nlh->nlmsg_type) == NFT_MSG_GETOBJ_RESET)
> +               reset = true;
> +
>         err = nf_tables_fill_obj_info(skb2, net, NETLINK_CB(skb).portid,
>                                       nlh->nlmsg_seq, NFT_MSG_NEWOBJ, 0,
> -                                     family, table, obj);
> +                                     family, table, obj, reset);
>         if (err < 0)
>                 goto err;
>
> @@ -4291,7 +4299,7 @@ static int nf_tables_obj_notify(const struct nft_ctx *ctx,
>
>         err = nf_tables_fill_obj_info(skb, ctx->net, ctx->portid, ctx->seq,
>                                       event, 0, ctx->afi->family, ctx->table,
> -                                     obj);
> +                                     obj, false);
>         if (err < 0) {
>                 kfree_skb(skb);
>                 goto err;
> @@ -4482,6 +4490,11 @@ static const struct nfnl_callback nf_tables_cb[NFT_MSG_MAX] = {
>                 .attr_count     = NFTA_OBJ_MAX,
>                 .policy         = nft_obj_policy,
>         },
> +       [NFT_MSG_GETOBJ_RESET] = {
> +               .call           = nf_tables_getobj,
> +               .attr_count     = NFTA_OBJ_MAX,
> +               .policy         = nft_obj_policy,
> +       },
>  };
>
>  static void nft_chain_commit_update(struct nft_trans *trans)
> diff --git a/net/netfilter/nft_counter.c b/net/netfilter/nft_counter.c
> index 6f3dd429f865..f6a02c5071c2 100644
> --- a/net/netfilter/nft_counter.c
> +++ b/net/netfilter/nft_counter.c
> @@ -100,10 +100,10 @@ static void nft_counter_obj_destroy(struct nft_object *obj)
>         nft_counter_do_destroy(priv);
>  }
>
> -static void nft_counter_fetch(const struct nft_counter_percpu __percpu *counter,
> +static void nft_counter_fetch(struct nft_counter_percpu __percpu *counter,
>                               struct nft_counter *total)
>  {
> -       const struct nft_counter_percpu *cpu_stats;
> +       struct nft_counter_percpu *cpu_stats;
>         u64 bytes, packets;
>         unsigned int seq;
>         int cpu;
> @@ -122,12 +122,52 @@ static void nft_counter_fetch(const struct nft_counter_percpu __percpu *counter,
>         }
>  }
>
> +static u64 __nft_counter_reset(u64 *counter)
> +{
> +       u64 ret, old;
> +
> +       do {
> +               old = *counter;
> +               ret = cmpxchg64(counter, old, 0);
> +       } while (ret != old);
> +
> +       return ret;
> +}
> +
> +static void nft_counter_reset(struct nft_counter_percpu __percpu *counter,
> +                             struct nft_counter *total)
> +{
> +       struct nft_counter_percpu *cpu_stats;
> +       u64 bytes, packets;
> +       unsigned int seq;
> +       int cpu;
> +
> +       memset(total, 0, sizeof(*total));
> +       for_each_possible_cpu(cpu) {
> +               bytes = packets = 0;
> +
> +               cpu_stats = per_cpu_ptr(counter, cpu);
> +               do {
> +                       seq     = u64_stats_fetch_begin_irq(&cpu_stats->syncp);
> +                       packets += __nft_counter_reset(&cpu_stats->counter.packets);
> +                       bytes   += __nft_counter_reset(&cpu_stats->counter.bytes);
> +               } while (u64_stats_fetch_retry_irq(&cpu_stats->syncp, seq));
> +
> +               total->packets += packets;
> +               total->bytes += bytes;
> +       }
> +}
> +
>  static int nft_counter_do_dump(struct sk_buff *skb,
> -                              const struct nft_counter_percpu_priv *priv)
> +                              const struct nft_counter_percpu_priv *priv,
> +                              bool reset)
>  {
>         struct nft_counter total;
>
> -       nft_counter_fetch(priv->counter, &total);
> +       if (reset)
> +               nft_counter_reset(priv->counter, &total);
> +       else
> +               nft_counter_fetch(priv->counter, &total);
>
>         if (nla_put_be64(skb, NFTA_COUNTER_BYTES, cpu_to_be64(total.bytes),
>                          NFTA_COUNTER_PAD) ||
> @@ -141,11 +181,11 @@ static int nft_counter_do_dump(struct sk_buff *skb,
>  }
>
>  static int nft_counter_obj_dump(struct sk_buff *skb,
> -                               const struct nft_object *obj)
> +                               struct nft_object *obj, bool reset)
>  {
> -       const struct nft_counter_percpu_priv *priv = nft_obj_data(obj);
> +       struct nft_counter_percpu_priv *priv = nft_obj_data(obj);
>
> -       return nft_counter_do_dump(skb, priv);
> +       return nft_counter_do_dump(skb, priv, reset);
>  }
>
>  static const struct nla_policy nft_counter_policy[NFTA_COUNTER_MAX + 1] = {
> @@ -178,7 +218,7 @@ static int nft_counter_dump(struct sk_buff *skb, const struct nft_expr *expr)
>  {
>         const struct nft_counter_percpu_priv *priv = nft_expr_priv(expr);
>
> -       return nft_counter_do_dump(skb, priv);
> +       return nft_counter_do_dump(skb, priv, false);
>  }
>
>  static int nft_counter_init(const struct nft_ctx *ctx,
> diff --git a/net/netfilter/nft_quota.c b/net/netfilter/nft_quota.c
> index 0d344209803a..5d25f57497cb 100644
> --- a/net/netfilter/nft_quota.c
> +++ b/net/netfilter/nft_quota.c
> @@ -83,12 +83,17 @@ static int nft_quota_obj_init(const struct nlattr * const tb[],
>         return nft_quota_do_init(tb, priv);
>  }
>
> -static int nft_quota_do_dump(struct sk_buff *skb, const struct nft_quota *priv)
> +static int nft_quota_do_dump(struct sk_buff *skb, struct nft_quota *priv,
> +                            bool reset)
>  {
>         u32 flags = priv->invert ? NFT_QUOTA_F_INV : 0;
>         u64 consumed;
>
> -       consumed = atomic64_read(&priv->consumed);
> +       if (reset)
> +               consumed = atomic64_xchg(&priv->consumed, 0);
> +       else
> +               consumed = atomic64_read(&priv->consumed);
> +
>         /* Since we inconditionally increment consumed quota for each packet
>          * that we see, don't go over the quota boundary in what we send to
>          * userspace.
> @@ -108,11 +113,12 @@ static int nft_quota_do_dump(struct sk_buff *skb, const struct nft_quota *priv)
>         return -1;
>  }
>
> -static int nft_quota_obj_dump(struct sk_buff *skb, const struct nft_object *obj)
> +static int nft_quota_obj_dump(struct sk_buff *skb, struct nft_object *obj,
> +                             bool reset)
>  {
>         struct nft_quota *priv = nft_obj_data(obj);
>
> -       return nft_quota_do_dump(skb, priv);
> +       return nft_quota_do_dump(skb, priv, reset);
>  }
>
>  static struct nft_object_type nft_quota_obj __read_mostly = {
> @@ -146,9 +152,9 @@ static int nft_quota_init(const struct nft_ctx *ctx,
>
>  static int nft_quota_dump(struct sk_buff *skb, const struct nft_expr *expr)
>  {
> -       const struct nft_quota *priv = nft_expr_priv(expr);
> +       struct nft_quota *priv = nft_expr_priv(expr);
>
> -       return nft_quota_do_dump(skb, priv);
> +       return nft_quota_do_dump(skb, priv, false);
>  }
>
>  static struct nft_expr_type nft_quota_type;
> --
> 2.1.4
>

^ permalink raw reply

* Re: [Intel-wired-lan] [RFC PATCH] i40e: enable PCIe relax ordering for SPARC
From: tndave @ 2016-12-09  0:45 UTC (permalink / raw)
  To: Alexander Duyck, David Laight; +Cc: Jeff Kirsher, intel-wired-lan, Netdev
In-Reply-To: <CAKgT0Uf3aZBoziv7B0tFMovYe9JNWEYZAZhe9iof2XcuRJD+uw@mail.gmail.com>



On 12/08/2016 08:05 AM, Alexander Duyck wrote:
> On Thu, Dec 8, 2016 at 2:43 AM, David Laight
> <David.Laight@aculab.com> wrote:
>> From: Alexander Duyck
>>> Sent: 06 December 2016 17:10
>> ...
>>> I was thinking about it and I realized we can probably simplify
>>> this even further.  In the case of most other architectures the
>>> DMA_ATTR_WEAK_ORDERING has no effect anyway.  So from what I can
>>> tell there is probably no reason not to just always pass that
>>> attribute with the DMA mappings.  From what I can tell the only
>>> other architecture that uses this is the PowerPC Cell
>>> architecture.
>>
>> And I should have read all the thread :-(
>>
>>> Also I was wondering if you actually needed to enable this
>>> attribute for both Rx and Tx buffers or just Rx buffers?  The
>>> patch that enabled DMA_ATTR_WEAK_ORDERING for Sparc64 seems to
>>> call out writes, but I didn't see anything about reads.  I'm just
>>> wondering if changing the code for Tx has any effect?  If not you
>>> could probably drop those changes and just focus on Rx.
>>
>> 'Weak ordering' only applies to PCIe read transfers, so can only
>> have an effect on descriptor reads and transmit buffer reads.
>>
>> Basically PCIe is a comms protocol and an endpoint (or the host)
>> can have multiple outstanding read requests (each of which might
>> generate multiple response messages. The responses for each request
>> must arrive in order, but responses for different requests can be
>> interleaved. Setting 'not weak ordering' lets the host interwork
>> with broken endpoints. (Or, like we did, you fix the fpga's PCIe
>> implementation.)
>
> I get the basics of relaxed ordering.  The question is how does the
> Sparc64 IOMMU translate DMA_ATTR_WEAK_ORDERING into relaxed ordering
> messages, and at what level the ordering is relaxed.  Odds are the
> wording in the description where this attribute was added to Sparc
> is just awkward, but I was wanting to verify if this only applies to
> writes, or also read completions.
In Sparc64, passing DMA_ATTR_WEAK_ORDERING in dma map/unmap only affects 
PCIe root complex (Hostbridge). Using DMA_ATTR_WEAK_ORDERING, requested 
DMA transaction can be relaxed ordered within the PCIe root complex.

In Sparc64, memory writes can be held at PCIe root complex not letting
other memory writes to go through. By passing DMA_ATTR_WEAK_ORDERING in
dma map/unmap allows memory writes to bypass other memory writes in PCIe
root complex. (This applies to only PCIe root complex and does not 
affect at any other level of PCIe hierarchy e.g. PCIe bridges et al. 
Also the PCIe root complex when bypassing memory writes does follow PCIe 
relax ordering rules as per PCIe specification.

For reference [old but still relevant write-up]: PCI-Express Relaxed 
Ordering and the Sun SPARC Enterprise M-class Servers
https://blogs.oracle.com/olympus/entry/relaxed_ordering

>
>> In this case you need the reads of both transmit and receive rings
>> to 'overtake' reads of transmit data.
>
> Actually that isn't quite right.  With relaxed ordering completions
> and writes can pass each other if I recall correctly, but reads will
> always force all writes ahead of them to be completed before you can
> begin generating the read completions.
That is my understanding as well.

>
>> I'm not at all clear how this 'flag' can be set on dma_map(). It is
>> a property of the PCIe subsystem.
Because in Sparc64, passing DMA_ATTR_WEAK_ORDERING flag in DMA map/unmap 
adds an entry in IOMMU/ATU table so that an access to requested DMA 
address from PCIe root complex can be relaxed ordered.
>
> That was where my original question on this came in.  We can do a
> blanket enable of relaxed ordering for Tx and Rx data buffers, but
> if we only need it on Rx then there isn't any need for us to make
> unnecessary changes.
I ran some quick test and it is likely that we don't need
DMA_ATTR_WEAK_ORDERING for any TX dma buffer (because in case of TX dma
buffers, its all memory reads from device).

-Tushar
>
> - Alex
>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox