Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next] net: skb_segment() should preserve backpressure
From: David Miller @ 2014-10-29 18:47 UTC (permalink / raw)
  To: eric.dumazet; +Cc: makita.toshiaki, netdev, herbert
In-Reply-To: <1414431051.16231.23.camel@edumazet-glaptop2.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 27 Oct 2014 10:30:51 -0700

> From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> 
> This patch generalizes commit d6a4a1041176 ("tcp: GSO should be TSQ
> friendly") to protocols using skb_set_owner_w()
> 
> TCP uses its own destructor (tcp_wfree) and needs a more complex scheme
> as explained in commit 6ff50cd55545 ("tcp: gso: do not generate out of
> order packets")
> 
> This allows UDP sockets using UFO to get proper backpressure,
> thus avoiding qdisc drops and excessive cpu usage.
> 
> Here are performance test results (macvlan on vlan):
 ...
> [edumazet] Rewrote patch and changelog.
> 
> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks everyone.

^ permalink raw reply

* Re: [PATCH net] cxgb4vf: Replace repetitive pci device ID's with right ones
From: David Miller @ 2014-10-29 18:48 UTC (permalink / raw)
  To: hariprasad; +Cc: netdev, leedom, kumaras, nirranjan, santosh, anish
In-Reply-To: <1414432330-12130-1-git-send-email-hariprasad@chelsio.com>

From: Hariprasad Shenai <hariprasad@chelsio.com>
Date: Mon, 27 Oct 2014 23:22:10 +0530

> Replaced repetive Device ID's which got added in commit b961f9a48844ecf3
> ("cxgb4vf: Remove superfluous "idx" parameter of CH_DEVICE() macro")
> 
> Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH] net: smc91x: Fix gpios for device tree based booting
From: David Miller @ 2014-10-29 18:51 UTC (permalink / raw)
  To: tony; +Cc: netdev, devicetree, linux-omap, khilman
In-Reply-To: <20141027202556.GT2560@atomide.com>

From: Tony Lindgren <tony@atomide.com>
Date: Mon, 27 Oct 2014 13:25:56 -0700

> +/**
> + * of_try_set_control_gpio - configure a gpio if it exists
> + */
> +static int try_toggle_control_gpio(struct device *dev,
> +				   struct gpio_desc **desc,
> +				   const char *name, int index,
> +				   int value, unsigned int nsdelay)

This needs to be under CONFIG_OF cpp protection just like the code that
calls it.

^ permalink raw reply

* Re: [PATCH net-next] neigh: optimize neigh_parms_release()
From: Eric Dumazet @ 2014-10-29 19:01 UTC (permalink / raw)
  To: Nicolas Dichtel; +Cc: netdev, davem
In-Reply-To: <1414607371-4246-1-git-send-email-nicolas.dichtel@6wind.com>

On Wed, 2014-10-29 at 19:29 +0100, Nicolas Dichtel wrote:
> In neigh_parms_release() we loop over all entries to find the entry given in
> argument and being able to remove it from the list. By using a double linked
> list, we can avoid this loop.

Acked-by: Eric Dumazet <edumazet@google.com>

Thanks !

^ permalink raw reply

* Re: [RFC] ipv4: Do not cache routing failures due to disabled forwarding.
From: David Miller @ 2014-10-29 19:03 UTC (permalink / raw)
  To: nicolas.cavallari; +Cc: netdev, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <1410531260-13794-2-git-send-email-nicolas.cavallari@green-communications.fr>

From: Nicolas Cavallari <nicolas.cavallari@green-communications.fr>
Date: Fri, 12 Sep 2014 16:14:20 +0200

> If we cache them, the kernel will reuse them, independently of
> whether forwarding is enabled or not.  Which means that if forwarding is
> disabled on the input interface where the first routing request comes
> from, then that unreachable result will be cached and reused for
> other interfaces, even if forwarding is enabled on them.
> 
> This can be verified with two interfaces A and B and an output interface
> C, where B has forwarding enabled, but not A and trying
> ip route get $dst iif A from $src && ip route get $dst iif B from $src
> 
> Signed-off-by: Nicolas Cavallari <nicolas.cavallari@green-communications.fr>
> ---
> based on net-next, but not really tested on top of it.

Sorry Nicolas, this seems to have fallen on the floor.  Could you please
resubmit your most uptodate version of this patch so I can apply it?

Thanks.

^ permalink raw reply

* Re: [PATCH v2] net: ethernet: realtek: atp: checkpatch errors and warnings corrected
From: David Miller @ 2014-10-29 19:05 UTC (permalink / raw)
  To: robertoxmed; +Cc: netdev, linux-kernel
In-Reply-To: <1414453916-27976-1-git-send-email-robertoxmed@gmail.com>

From: Roberto Medina <robertoxmed@gmail.com>
Date: Tue, 28 Oct 2014 00:51:56 +0100

> From: Roberto Medina <robertoxmed@gmail.com>
> 
> Several warnings and errors of coding style rules corrected.
> Compile tested.
> 
> Signed-off-by: Roberto Medina <robertoxmed@gmail.com>

Applied, thanks.

^ permalink raw reply

* PATCH: Add IFF_ALLMULTI support to cpsw (and fix IFF_PROMISC support too)
From: Lennart Sorensen @ 2014-10-29 19:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: Len Sorensen, Mugunthan V N, David S. Miller, netdev

The cpsw driver did not support the IFF_ALLMULTI flag which makes dynamic
multicast routing rather hard to handle.  Related to this, when enabling
IFF_PROMISC in non dual_emac mode, all registered vlans are flushed,
and only broadcast and unicast are allowed which isn't what you would
want from IFF_PROMISC.

A new cpsw_ale_set_allmulti function now scans through the ALE entry
table and adds or removes the host port from the unregistered multicast
port mask depending on the state of IFF_ALLMULTI.  In promisc mode,
it is also called to enable reception of all multicast traffic.

With this change I am now able to run dynamic multicast routing and also
use tcpdump and actually see multicast traffic.

Signed-off-by: Len Sorensen <lsorense@csclub.uwaterloo.ca>

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 952e1e4..96a61d1 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -638,12 +638,16 @@ static void cpsw_ndo_set_rx_mode(struct net_device *ndev)
 	if (ndev->flags & IFF_PROMISC) {
 		/* Enable promiscuous mode */
 		cpsw_set_promiscious(ndev, true);
+		cpsw_ale_set_allmulti(priv->ale, IFF_ALLMULTI);
 		return;
 	} else {
 		/* Disable promiscuous mode */
 		cpsw_set_promiscious(ndev, false);
 	}
 
+	/* Restore allmulti on vlans if necessary */
+	cpsw_ale_set_allmulti(priv->ale, priv->ndev->flags & IFF_ALLMULTI);
+
 	/* Clear all mcast from ALE */
 	cpsw_ale_flush_multicast(priv->ale, ALE_ALL_PORTS << priv->host_port);
 
@@ -1149,6 +1153,7 @@ static inline void cpsw_add_default_vlan(struct cpsw_priv *priv)
 	const int port = priv->host_port;
 	u32 reg;
 	int i;
+	int unreg_mcast_mask;
 
 	reg = (priv->version == CPSW_VERSION_1) ? CPSW1_PORT_VLAN :
 	       CPSW2_PORT_VLAN;
@@ -1158,9 +1163,14 @@ static inline void cpsw_add_default_vlan(struct cpsw_priv *priv)
 	for (i = 0; i < priv->data.slaves; i++)
 		slave_write(priv->slaves + i, vlan, reg);
 
+	if (priv->ndev->flags & IFF_ALLMULTI)
+		unreg_mcast_mask = ALE_ALL_PORTS;
+	else
+		unreg_mcast_mask = ALE_PORT_1 | ALE_PORT_2;
+
 	cpsw_ale_add_vlan(priv->ale, vlan, ALE_ALL_PORTS << port,
 			  ALE_ALL_PORTS << port, ALE_ALL_PORTS << port,
-			  (ALE_PORT_1 | ALE_PORT_2) << port);
+			  unreg_mcast_mask << port);
 }
 
 static void cpsw_init_host_port(struct cpsw_priv *priv)
@@ -1620,11 +1630,17 @@ static inline int cpsw_add_vlan_ale_entry(struct cpsw_priv *priv,
 				unsigned short vid)
 {
 	int ret;
+	int unreg_mcast_mask;
+
+	if (priv->ndev->flags & IFF_ALLMULTI)
+		unreg_mcast_mask = ALE_ALL_PORTS;
+	else
+		unreg_mcast_mask = ALE_PORT_1 | ALE_PORT_2;
 
 	ret = cpsw_ale_add_vlan(priv->ale, vid,
 				ALE_ALL_PORTS << priv->host_port,
 				0, ALE_ALL_PORTS << priv->host_port,
-				(ALE_PORT_1 | ALE_PORT_2) << priv->host_port);
+				unreg_mcast_mask << priv->host_port);
 	if (ret != 0)
 		return ret;
 
diff --git a/drivers/net/ethernet/ti/cpsw_ale.c b/drivers/net/ethernet/ti/cpsw_ale.c
index 0579b22..3ae8387 100644
--- a/drivers/net/ethernet/ti/cpsw_ale.c
+++ b/drivers/net/ethernet/ti/cpsw_ale.c
@@ -443,6 +443,35 @@ int cpsw_ale_del_vlan(struct cpsw_ale *ale, u16 vid, int port_mask)
 	return 0;
 }
 
+void cpsw_ale_set_allmulti(struct cpsw_ale *ale, int allmulti)
+{
+	u32 ale_entry[ALE_ENTRY_WORDS];
+	int type, idx;
+	int unreg_mcast = 0;
+
+	/* Only bother doing the work if the setting is actually changing */
+	if (ale->allmulti == allmulti)
+		return;
+
+	/* Remember the new setting to check against next time */
+	ale->allmulti = allmulti;
+
+	for (idx = 0; idx < ale->params.ale_entries; idx++) {
+		cpsw_ale_read(ale, idx, ale_entry);
+		type = cpsw_ale_get_entry_type(ale_entry);
+		if (type != ALE_TYPE_VLAN)
+			continue;
+
+		unreg_mcast = cpsw_ale_get_vlan_unreg_mcast(ale_entry);
+		if (allmulti)
+			unreg_mcast |= 1;
+		else
+			unreg_mcast &= ~1;
+		cpsw_ale_set_vlan_unreg_mcast(ale_entry, unreg_mcast);
+		cpsw_ale_write(ale, idx, ale_entry);
+	}
+}
+
 struct ale_control_info {
 	const char	*name;
 	int		offset, port_offset;
diff --git a/drivers/net/ethernet/ti/cpsw_ale.h b/drivers/net/ethernet/ti/cpsw_ale.h
index 31cf43c..c0d4127 100644
--- a/drivers/net/ethernet/ti/cpsw_ale.h
+++ b/drivers/net/ethernet/ti/cpsw_ale.h
@@ -27,6 +27,7 @@ struct cpsw_ale {
 	struct cpsw_ale_params	params;
 	struct timer_list	timer;
 	unsigned long		ageout;
+	int			allmulti;
 };
 
 enum cpsw_ale_control {
@@ -103,6 +104,7 @@ int cpsw_ale_del_mcast(struct cpsw_ale *ale, u8 *addr, int port_mask,
 int cpsw_ale_add_vlan(struct cpsw_ale *ale, u16 vid, int port, int untag,
 			int reg_mcast, int unreg_mcast);
 int cpsw_ale_del_vlan(struct cpsw_ale *ale, u16 vid, int port);
+void cpsw_ale_set_allmulti(struct cpsw_ale *ale, int allmulti);
 
 int cpsw_ale_control_get(struct cpsw_ale *ale, int port, int control);
 int cpsw_ale_control_set(struct cpsw_ale *ale, int port,

-- 
Len Sorensen

^ permalink raw reply related

* Re: [PATCH net-next] tcp: allow for bigger reordering level
From: David Miller @ 2014-10-29 19:06 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, wygivan
In-Reply-To: <1414471524.13259.14.camel@edumazet-glaptop2.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 27 Oct 2014 21:45:24 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> While testing upcoming Yaogong patch (converting out of order queue
> into an RB tree), I hit the max reordering level of linux TCP stack.
> 
> Reordering level was limited to 127 for no good reason, and some
> network setups [1] can easily reach this limit and get limited
> throughput.
> 
> Allow a new max limit of 300, and add a sysctl to allow admins to even
> allow bigger (or lower) values if needed.
> 
> [1] Aggregation of links, per packet load balancing, fabrics not doing
>  deep packet inspections, alternative TCP congestion modules...
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

I'll apply this, thanks.

However in the longer term I'd say that this value, if it is to have a
limit, then such a limit should probably be scaled based upon the
window size.

^ permalink raw reply

* Re: [PATCH net 1/1] cnic: Update the rcu_access_pointer() usages
From: David Miller @ 2014-10-29 19:07 UTC (permalink / raw)
  To: nilesh.javali
  Cc: netdev, Dept-GELinuxNICDev, sudarsana.kalluru, vikas.chaudhary,
	giridhar.malavali, tej.parkash
In-Reply-To: <1414473495-24790-2-git-send-email-nilesh.javali@qlogic.com>

From: Nilesh Javali <nilesh.javali@qlogic.com>
Date: Tue, 28 Oct 2014 01:18:15 -0400

> From: Tej Parkash <tej.parkash@qlogic.com>
> 
> 1. Remove the rcu_read_lock/unlock around rcu_access_pointer
> 2. Replace the rcu_dereference with rcu_access_pointer
> 
> Signed-off-by: Tej Parkash <tej.parkash@qlogic.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 0/2] r8152: support nway_reset
From: David Miller @ 2014-10-29 19:10 UTC (permalink / raw)
  To: hayeswang; +Cc: netdev, nic_swsd, linux-kernel, linux-usb
In-Reply-To: <1394712342-15778-66-Taiwan-albertk@realtek.com>

From: Hayes Wang <hayeswang@realtek.com>
Date: Tue, 28 Oct 2014 14:05:50 +0800

> Fix the CHECK from checkpatch.pl and support nway_reset.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH v2 net-next] net: ipv6: Add a sysctl to make optimistic addresses useful candidates
From: David Miller @ 2014-10-29 19:12 UTC (permalink / raw)
  To: ek; +Cc: netdev, ben, lorenzo, hannes
In-Reply-To: <1414487474-18201-1-git-send-email-ek@google.com>

From: Erik Kline <ek@google.com>
Date: Tue, 28 Oct 2014 18:11:14 +0900

> Add a sysctl that causes an interface's optimistic addresses
> to be considered equivalent to other non-deprecated addresses
> for source address selection purposes.  Preferred addresses
> will still take precedence over optimistic addresses, subject
> to other ranking in the source address selection algorithm.
> 
> This is useful where different interfaces are connected to
> different networks from different ISPs (e.g., a cell network
> and a home wifi network).
> 
> The current behaviour complies with RFC 3484/6724, and it
> makes sense if the host has only one interface, or has
> multiple interfaces on the same network (same or cooperating
> administrative domain(s), but not in the multiple distinct
> networks case.
> 
> For example, if a mobile device has an IPv6 address on an LTE
> network and then connects to IPv6-enabled wifi, while the wifi
> IPv6 address is undergoing DAD, IPv6 connections will try use
> the wifi default route with the LTE IPv6 address, and will get
> stuck until they time out.
> 
> Also, because optimistic nodes can receive frames, issue
> an RTM_NEWADDR as soon as DAD starts (with the IFA_F_OPTIMSTIC
> flag appropriately set).  A second RTM_NEWADDR is sent if DAD
> completes (the address flags have changed), otherwise an
> RTM_DELADDR is sent.
> 
> Also: add an entry in ip-sysctl.txt for optimistic_dad.
> 
> Signed-off-by: Erik Kline <ek@google.com>

Applied, if the idev != NULL check proves to be unnecessary we can
kill it in a follow-on patch.

Thanks.

^ permalink raw reply

* Re: [net 1/2] sctp: add transport state in /proc/net/sctp/remaddr
From: David Miller @ 2014-10-29 19:18 UTC (permalink / raw)
  To: nhorman; +Cc: michele, linux-sctp, vyasevich, netdev, dborkman
In-Reply-To: <20141028102741.GA18365@hmsreliant.think-freely.org>

From: Neil Horman <nhorman@tuxdriver.com>
Date: Tue, 28 Oct 2014 06:27:41 -0400

> On Mon, Oct 27, 2014 at 06:55:45PM -0400, David Miller wrote:
>> From: Michele Baldessari <michele@acksyn.org>
>> Date: Thu, 23 Oct 2014 21:48:40 +0200
>> 
>> > It is often quite helpful to be able to know the state of a transport
>> > outside of the application itself (for troubleshooting purposes or for
>> > monitoring purposes). Add it under /proc/net/sctp/remaddr.
>> > 
>> > Signed-off-by: Michele Baldessari <michele@acksyn.org>
>> 
>> You can't change the layout of procfs files, applications parse
>> these files and any modification can potentially break such tools.
>> 
>> Secondly, even if this change were acceptable, targetting this
>> change at anything other than the net-next tree is not appropriate
>> because it is a new feature.
>> 
> 
> Agree on the net-next submission, though there is precident for extending this
> procfile, as we've done it a few times in the past to this, and other files in
> the sctp area (see commits f406c8b9693f2f71ef2caeb0b68521a7d22d00f0 and
> 58fbbed4fbc0094fc808a568fe99a915f85402ee)

Fair enough.

^ permalink raw reply

* Re: [PATCH 6/7] can: m_can: update to support CAN FD features
From: Oliver Hartkopp @ 2014-10-29 19:21 UTC (permalink / raw)
  To: Dong Aisheng, linux-can; +Cc: mkl, wg, varkabhadram, netdev, linux-arm-kernel
In-Reply-To: <1414579527-31100-6-git-send-email-b29396@freescale.com>

Hello Dong,

thanks for your update to support CAN FD with the 3.0.x M_CAN IP core.

AFAIK from the last CAN in Automation (CiA) Plugfest which took place in
Nuremberg yesterday, there are two more IP cores on the way:

v3.0.1 / v3.0.2 (the current spec from the Bosch website)

v3.1.0 which will support per-frame CAN/CANFD switching in the tx path
(the FDF/(former)EDL bit and the BRS bit appear in the TX buffer element at
the bit position you know from reading the RX FIFO element)

v3.2.0 which will support the final(?) ISO specification with a bitstuffing
counter before the CRC in the frame on the wire.

So first I would suggest to check the core release register (CREL) to be
version 3.0.x and quit the driver initialization if it doesn't fit this
version. I would suggest to provide a separate patch for that check. What
about printing the core release version into the kernel log at driver
initialization time?

Regarding the CAN FD support in this patch I have some remarks in the text ...

On 10/29/2014 11:45 AM, Dong Aisheng wrote:

>  /* Rx Buffer Element */
> +/* R0 */
>  #define RX_BUF_ESI		BIT(31)
>  #define RX_BUF_XTD		BIT(30)
>  #define RX_BUF_RTR		BIT(29)
> +/* R1 */
> +#define RX_BUF_ANMF		BIT(31)
> +#define RX_BUF_EDL		BIT(21)
> +#define RX_BUF_BRS		BIT(20)
>  
>  /* Tx Buffer Element */
> +/* R0 */
>  #define TX_BUF_XTD		BIT(30)
>  #define TX_BUF_RTR		BIT(29)
>  
> @@ -327,11 +357,12 @@ static inline void m_can_disable_all_interrupts(const struct m_can_priv *priv)
>  	m_can_write(priv, M_CAN_ILE, 0x0);
>  }
>  
> -static void m_can_read_fifo(const struct net_device *dev, struct can_frame *cf,
> +static void m_can_read_fifo(const struct net_device *dev, struct canfd_frame *cf,
>  			    u32 rxfs)
>  {
>  	struct m_can_priv *priv = netdev_priv(dev);
>  	u32 id, fgi;
> +	int i;
>  
>  	/* calculate the fifo get index for where to read data */
>  	fgi = (rxfs & RXFS_FGI_MASK) >> RXFS_FGI_OFF;
> @@ -341,15 +372,23 @@ static void m_can_read_fifo(const struct net_device *dev, struct can_frame *cf,
>  	else
>  		cf->can_id = (id >> 18) & CAN_SFF_MASK;
>  
> +	if (id & RX_BUF_ESI) {
> +		cf->flags |= CANFD_ESI;
> +		netdev_dbg(dev, "ESI Error\n");
> +	}
> +
>  	if (id & RX_BUF_RTR) {
>  		cf->can_id |= CAN_RTR_FLAG;

When RX_BUF_EDL is set you should not check for RX_BUF_RTR as RTR is not
allowed for CAN FD.

>  	} else {
>  		id = m_can_fifo_read(priv, fgi, M_CAN_FIFO_DLC);
> -		cf->can_dlc = get_can_dlc((id >> 16) & 0x0F);
> -		*(u32 *)(cf->data + 0) = m_can_fifo_read(priv, fgi,
> -							 M_CAN_FIFO_DATA(0));
> -		*(u32 *)(cf->data + 4) = m_can_fifo_read(priv, fgi,
> -							 M_CAN_FIFO_DATA(1));
> +		cf->len = can_dlc2len(get_canfd_dlc((id >> 16) & 0x0F));
> +
> +		if (id & RX_BUF_BRS)
> +			cf->flags |= CANFD_BRS;
> +
> +		for (i = 0; i < cf->len; i += 4)
> +			*(u32 *)(cf->data + i) =
> +				m_can_fifo_read(priv, fgi, M_CAN_FIFO_DATA(i / 4));
>  	}
>  
>  	/* acknowledge rx fifo 0 */
> @@ -361,7 +400,7 @@ static int m_can_do_rx_poll(struct net_device *dev, int quota)
>  	struct m_can_priv *priv = netdev_priv(dev);
>  	struct net_device_stats *stats = &dev->stats;
>  	struct sk_buff *skb;
> -	struct can_frame *frame;
> +	struct canfd_frame *frame;
>  	u32 pkts = 0;
>  	u32 rxfs;
>  
> @@ -375,7 +414,7 @@ static int m_can_do_rx_poll(struct net_device *dev, int quota)
>  		if (rxfs & RXFS_RFL)
>  			netdev_warn(dev, "Rx FIFO 0 Message Lost\n");
>  
> -		skb = alloc_can_skb(dev, &frame);
> +		skb = alloc_canfd_skb(dev, &frame);

You are *always* allocating CAN FD frames now?

Depending on RX_BUF_EDL in the RX FIFO message you should create a CAN or CAN
FD frame.

Of course you can use the struct canfd_frame in m_can_read_fifo() as the
layout of the struct can_frame is identical when filled with 'normal' CAN
frame content.

But you need to distinguish whether it is a CAN or CAN FD frame when
allocating the skb based on the RX_BUF_EDL value.

>  		if (!skb) {
>  			stats->rx_dropped++;
>  			return pkts;
> @@ -384,7 +423,7 @@ static int m_can_do_rx_poll(struct net_device *dev, int quota)
>  		m_can_read_fifo(dev, frame, rxfs);
>  
>  		stats->rx_packets++;
> -		stats->rx_bytes += frame->can_dlc;
> +		stats->rx_bytes += frame->len;
>  
>  		netif_receive_skb(skb);
>  

The rest of your entire patch set looks very good from my perspective.

Best regards,
Oliver



^ permalink raw reply

* Re: [PATCH net] inet: frags: fix a race between inet_evict_bucket and inet_frag_kill
From: David Miller @ 2014-10-29 19:21 UTC (permalink / raw)
  To: nikolay; +Cc: netdev, fw, eric.dumazet, chutzpah
In-Reply-To: <1414488634-28412-1-git-send-email-nikolay@redhat.com>

From: Nikolay Aleksandrov <nikolay@redhat.com>
Date: Tue, 28 Oct 2014 10:30:34 +0100

> When the evictor is running it adds some chosen frags to a local list to
> be evicted once the chain lock has been released but at the same time
> the *frag_queue can be running for some of the same queues and it
> may call inet_frag_kill which will wait on the chain lock and
> will then delete the queue from the wrong list since it was added in the
> eviction one. The fix is simple - check if the queue has the evict flag
> set under the chain lock before deleting it, this is safe because the
> evict flag is set only under that lock and having the flag set also means
> that the queue has been detached from the chain list, so no need to delete
> it again.
> An important note to make is that we're safe w.r.t refcnt because
> inet_frag_kill and inet_evict_bucket will sync on the del_timer operation
> where only one of the two can succeed (or if the timer is executing -
> none of them), the cases are:
> 1. inet_frag_kill succeeds in del_timer
>  - then the timer ref is removed, but inet_evict_bucket will not add
>    this queue to its expire list but will restart eviction in that chain
> 2. inet_evict_bucket succeeds in del_timer
>  - then the timer ref is kept until the evictor "expires" the queue, but
>    inet_frag_kill will remove the initial ref and will set
>    INET_FRAG_COMPLETE which will make the frag_expire fn just to remove
>    its ref.
> In the end all of the queue users will do an inet_frag_put and the one
> that reaches 0 will free it. The refcount balance should be okay.
> 
> CC: Florian Westphal <fw@strlen.de>
> CC: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Patrick McLean <chutzpah@gentoo.org>
> 
> Fixes: b13d3cbfb8e8 ("inet: frag: move eviction of queues to work queue")
> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
> Reported-by: Patrick McLean <chutzpah@gentoo.org>
> Tested-by: Patrick McLean <chutzpah@gentoo.org>
> Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
> ---
> A few more eyes to confirm all of this would be much appreciated.

I've applied this and tentatively scheduled it for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net] inet: frags: remove the WARN_ON from inet_evict_bucket
From: David Miller @ 2014-10-29 19:22 UTC (permalink / raw)
  To: nikolay; +Cc: netdev, fw, eric.dumazet, chutzpah
In-Reply-To: <1414489441-28578-1-git-send-email-nikolay@redhat.com>

From: Nikolay Aleksandrov <nikolay@redhat.com>
Date: Tue, 28 Oct 2014 10:44:01 +0100

> The WARN_ON in inet_evict_bucket can be triggered by a valid case:
> inet_frag_kill and inet_evict_bucket can be running in parallel on the
> same queue which means that there has been at least one more ref added
> by a previous inet_frag_find call, but inet_frag_kill can delete the
> timer before inet_evict_bucket which will cause the WARN_ON() there to
> trigger since we'll have refcnt!=1. Now, this case is valid because the
> queue is being "killed" for some reason (removed from the chain list and
> its timer deleted) so it will get destroyed in the end by one of the
> inet_frag_put() calls which reaches 0 i.e. refcnt is still valid.
> 
> CC: Florian Westphal <fw@strlen.de>
> CC: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Patrick McLean <chutzpah@gentoo.org>
> 
> Fixes: b13d3cbfb8e8 ("inet: frag: move eviction of queues to work queue")
> Reported-by: Patrick McLean <chutzpah@gentoo.org>
> Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
> ---
> I'm sending this as a separate patch so the race fix doesn't get blocked
> in case I'm wrong and also it's a different issue.

Also applied, thanks.

^ permalink raw reply

* [PATCH net-next 0/3] sunvnet: Use multiple Tx queues.
From: Sowmini Varadhan @ 2014-10-29 19:27 UTC (permalink / raw)
  To: davem, sowmini.varadhan; +Cc: netdev

The primary objective of this patch-set is to address the suggestion from
  http://marc.info/?l=linux-netdev&m=140790778931563&w=2
With the changes in Patch 3, every vnet_port will get  packets from 
a single tx-queue, and flow-control/head-of-line-blocking is 
confined to the vnet_ports that share that tx queue (as opposed to 
flow-controlling *all* peers).

Patch 1 is a minor correction to a comment that has a typo in 
the RFC number cited for the challenge-ack generation.

Patch 2 is an optimization that resets the DATA_READY bit when
we re-enable Rx interrupts.  This optimization lets us exit quickly 
from vnet_event_napi() when new data has not triggered an interrupt.

Sowmini Varadhan (3):
  Correction to RFC number in comment
  Reset LDC_EVENT_DATA_READY when napi completes.
  Use one Tx queue per vnet_port

 drivers/net/ethernet/sun/sunvnet.c | 95 +++++++++++++++++++++++++-------------
 drivers/net/ethernet/sun/sunvnet.h |  2 +
 net/ipv4/tcp_input.c               |  2 +-
 3 files changed, 67 insertions(+), 32 deletions(-)

-- 
1.8.4.2

^ permalink raw reply

* [PATCH net-next 1/3] tcp: Correction to RFC number in comment
From: Sowmini Varadhan @ 2014-10-29 19:27 UTC (permalink / raw)
  To: davem, sowmini.varadhan; +Cc: netdev


Challenge ACK is described in RFC 5961, fix typo.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 net/ipv4/tcp_input.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a12b455..d285962 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5028,7 +5028,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 	/* step 3: check security and precedence [ignored] */
 
 	/* step 4: Check for a SYN
-	 * RFC 5691 4.2 : Send a challenge ack
+	 * RFC 5961 4.2 : Send a challenge ack
 	 */
 	if (th->syn) {
 syn_challenge:
-- 
1.8.4.2

^ permalink raw reply related

* [PATCH net-next 2/3] sunvnet: Reset LDC_EVENT_DATA_READY when napi completes.
From: Sowmini Varadhan @ 2014-10-29 19:27 UTC (permalink / raw)
  To: davem, sowmini.varadhan; +Cc: netdev

When vnet_event_napi re-enables interrupts, it should
reset LDC_EVENT_DATA_READY as an optimization.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 drivers/net/ethernet/sun/sunvnet.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/sun/sunvnet.c b/drivers/net/ethernet/sun/sunvnet.c
index c390a27..7ada479 100644
--- a/drivers/net/ethernet/sun/sunvnet.c
+++ b/drivers/net/ethernet/sun/sunvnet.c
@@ -760,6 +760,7 @@ static int vnet_poll(struct napi_struct *napi, int budget)

 	if (processed < budget) {
 		napi_complete(napi);
+		port->rx_event &= ~LDC_EVENT_DATA_READY;
 		vio_set_intr(vio->vdev->rx_ino, HV_INTR_ENABLED);
 	}
 	return processed;
-- 
1.8.4.2

^ permalink raw reply related

* Re: [PATCH net-next] tcp: allow for bigger reordering level
From: Eric Dumazet @ 2014-10-29 19:27 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, wygivan
In-Reply-To: <20141029.150624.878155212768758630.davem@davemloft.net>

On Wed, 2014-10-29 at 15:06 -0400, David Miller wrote:

> However in the longer term I'd say that this value, if it is to have a
> limit, then such a limit should probably be scaled based upon the
> window size.

Yuchung and othres are working on a new way to handle reorders (RACK),
and should present the concept in next IETF meeting.

A linux patch should follow shortly.

High level idea is :

Decide when and what to retransmit based on the timing, instead of
sequence, relationships. This covers both original or retransmitted
packets.

On dupacks, wait a fraction of RTT before the repair process to both
allow reordering and relieve the network

Thanks

^ permalink raw reply

* Re: [RFC] use smp_load_acquire()/smp_store_release()
From: Jeff Kirsher @ 2014-10-29 19:27 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Eric Dumazet, netdev
In-Reply-To: <545112E0.40106@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 2147 bytes --]

On Wed, 2014-10-29 at 09:16 -0700, Alexander Duyck wrote:
> On 10/29/2014 07:49 AM, Eric Dumazet wrote:
> > Hi Alexander
> >
> > The memory barriers added in commit
> > b37c0fbe3f6dfba1f8ad2aed47fb40578a254635
> > ("net: Add memory barriers to prevent possible race in byte queue
> > limits")
> >
> > have heavy cost.
> >
> > It seems we could use smp_load_acquire() and smp_store_release()
> > instead ?
> >
> > I'll post a patch later today. I would be interested if someone was able
> > to test it, as your commit apparently was tested and known to fix a
> > reproducible race.
> >
> > Thanks !

Eric- just CC me on the patch you post and I will see what I can do
about getting validation eyes on it.

> 
> Unfortunately Stephen left Intel before I did, so we will need to find 
> someone else in the validation team to test this if possible. I have 
> added Jeff to the CC so that he can give the appropriate validation 
> people a heads up that this patch might be coming.
> 
> As I recall what was seen was random Tx hangs on systems with the 
> original BQL code when interfaces were stressed.  It has been a while so 
> I don't recall the exact set-up for all of it.  Also some less 
> used/tested architectures such as PowerPC can be more susceptible to 
> synchronization issues such as these as the memory model is more weakly 
> ordered.
> 
> I'm wondering where you are seeing the barrier show up?  In 
> netdev_tx_send_queue you should only hit the barrier if you actually are 
> triggering the XOFF condition, and in netdev_tx_completed_queue the 
> barrier should be coalesced in amongst a number of frames reducing the cost.
> 
> My concern with this would be that we are actually syncronizing multiple 
> things, the __QUEUE_STATE_STACK_XOFF flag, dql->adj_limit, and 
> dql->num_queued, and we might be trading off reducing the cost on x86 to 
> result in it being increased on other architectures as we may have to 
> actually add additional synchronization as I suspect we would need to 
> use acquire/release on both adj_limit and num_queued.
> 
> Thanks,
> 
> Alex
> 



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* [PATCH net-next 3/3] sunvnet: Use one Tx queue per vnet_port
From: Sowmini Varadhan @ 2014-10-29 19:27 UTC (permalink / raw)
  To: davem, sowmini.varadhan; +Cc: netdev


Use multple Tx netdev queues for sunvnet by supporting a one-to-one
mapping between vnet_port and Tx queue. Provide a ndo_select_queue
indirection (vnet_select_queue()) which selects the queue based
on the peer that would be selected in vnet_start_xmit()

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 drivers/net/ethernet/sun/sunvnet.c | 94 +++++++++++++++++++++++++-------------
 drivers/net/ethernet/sun/sunvnet.h |  2 +
 2 files changed, 65 insertions(+), 31 deletions(-)

diff --git a/drivers/net/ethernet/sun/sunvnet.c b/drivers/net/ethernet/sun/sunvnet.c
index 7ada479..e7bb63b 100644
--- a/drivers/net/ethernet/sun/sunvnet.c
+++ b/drivers/net/ethernet/sun/sunvnet.c
@@ -40,6 +40,8 @@ MODULE_DESCRIPTION("Sun LDOM virtual network driver");
 MODULE_LICENSE("GPL");
 MODULE_VERSION(DRV_MODULE_VERSION);
 
+#define	VNET_MAX_TXQS		16
+
 /* Heuristic for the number of times to exponentially backoff and
  * retry sending an LDC trigger when EAGAIN is encountered
  */
@@ -551,6 +553,8 @@ static int vnet_ack(struct vnet_port *port, void *msgbuf)
 	struct vnet *vp;
 	u32 end;
 	struct vio_net_desc *desc;
+	struct netdev_queue *txq;
+
 	if (unlikely(pkt->tag.stype_env != VIO_DRING_DATA))
 		return 0;
 
@@ -580,7 +584,8 @@ static int vnet_ack(struct vnet_port *port, void *msgbuf)
 	}
 	netif_tx_unlock(dev);
 
-	if (unlikely(netif_queue_stopped(dev) &&
+	txq = netdev_get_tx_queue(dev, port->q_index);
+	if (unlikely(netif_tx_queue_stopped(txq) &&
 		     vnet_tx_dring_avail(dr) >= VNET_TX_WAKEUP_THRESH(dr)))
 		return 1;
 
@@ -608,31 +613,23 @@ static int handle_mcast(struct vnet_port *port, void *msgbuf)
 	return 0;
 }
 
-static void maybe_tx_wakeup(struct vnet *vp)
+/* Got back a STOPPED LDC message on port. If the queue is stopped,
+ * wake it up so that we'll send out another START message at the
+ * next TX.
+ */
+static void maybe_tx_wakeup(struct vnet_port *port)
 {
-	struct net_device *dev = vp->dev;
+	struct netdev_queue *txq;
 
-	netif_tx_lock(dev);
-	if (likely(netif_queue_stopped(dev))) {
-		struct vnet_port *port;
-		int wake = 1;
-
-		rcu_read_lock();
-		list_for_each_entry_rcu(port, &vp->port_list, list) {
-			struct vio_dring_state *dr;
-
-			dr = &port->vio.drings[VIO_DRIVER_TX_RING];
-			if (vnet_tx_dring_avail(dr) <
-			    VNET_TX_WAKEUP_THRESH(dr)) {
-				wake = 0;
-				break;
-			}
-		}
-		rcu_read_unlock();
-		if (wake)
-			netif_wake_queue(dev);
+	txq = netdev_get_tx_queue(port->vp->dev, port->q_index);
+	__netif_tx_lock(txq, smp_processor_id());
+	if (likely(netif_tx_queue_stopped(txq))) {
+		struct vio_dring_state *dr;
+
+		dr = &port->vio.drings[VIO_DRIVER_TX_RING];
+			netif_tx_wake_queue(txq);
 	}
-	netif_tx_unlock(dev);
+	__netif_tx_unlock(txq);
 }
 
 static inline bool port_is_up(struct vnet_port *vnet)
@@ -748,7 +745,7 @@ napi_resume:
 			break;
 	}
 	if (unlikely(tx_wakeup && err != -ECONNRESET))
-		maybe_tx_wakeup(port->vp);
+		maybe_tx_wakeup(port);
 	return npkts;
 }
 
@@ -953,6 +950,16 @@ static inline struct sk_buff *vnet_skb_shape(struct sk_buff *skb, void **pstart,
 	return skb;
 }
 
+static u16
+vnet_select_queue(struct net_device *dev, struct sk_buff *skb,
+		  void *accel_priv, select_queue_fallback_t fallback)
+{
+	struct vnet *vp = netdev_priv(dev);
+	struct vnet_port *port = __tx_port_find(vp, skb);
+
+	return port->q_index;
+}
+
 static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct vnet *vp = netdev_priv(dev);
@@ -965,6 +972,7 @@ static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	void *start = NULL;
 	int nlen = 0;
 	unsigned pending = 0;
+	struct netdev_queue *txq;
 
 	skb = vnet_skb_shape(skb, &start, &nlen);
 	if (unlikely(!skb))
@@ -1008,9 +1016,11 @@ static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	}
 
 	dr = &port->vio.drings[VIO_DRIVER_TX_RING];
+	i = skb_get_queue_mapping(skb);
+	txq = netdev_get_tx_queue(dev, i);
 	if (unlikely(vnet_tx_dring_avail(dr) < 1)) {
-		if (!netif_queue_stopped(dev)) {
-			netif_stop_queue(dev);
+		if (!netif_tx_queue_stopped(txq)) {
+			netif_tx_stop_queue(txq);
 
 			/* This is a hard error, log it. */
 			netdev_err(dev, "BUG! Tx Ring full when queue awake!\n");
@@ -1104,9 +1114,9 @@ ldc_start_done:
 
 	dr->prod = (dr->prod + 1) & (VNET_TX_RING_SIZE - 1);
 	if (unlikely(vnet_tx_dring_avail(dr) < 1)) {
-		netif_stop_queue(dev);
+		netif_tx_stop_queue(txq);
 		if (vnet_tx_dring_avail(dr) > VNET_TX_WAKEUP_THRESH(dr))
-			netif_wake_queue(dev);
+			netif_tx_wake_queue(txq);
 	}
 
 	(void)mod_timer(&port->clean_timer, jiffies + VNET_CLEAN_TIMEOUT);
@@ -1139,14 +1149,14 @@ static void vnet_tx_timeout(struct net_device *dev)
 static int vnet_open(struct net_device *dev)
 {
 	netif_carrier_on(dev);
-	netif_start_queue(dev);
+	netif_tx_start_all_queues(dev);
 
 	return 0;
 }
 
 static int vnet_close(struct net_device *dev)
 {
-	netif_stop_queue(dev);
+	netif_tx_stop_all_queues(dev);
 	netif_carrier_off(dev);
 
 	return 0;
@@ -1420,6 +1430,7 @@ static const struct net_device_ops vnet_ops = {
 	.ndo_tx_timeout		= vnet_tx_timeout,
 	.ndo_change_mtu		= vnet_change_mtu,
 	.ndo_start_xmit		= vnet_start_xmit,
+	.ndo_select_queue	= vnet_select_queue,
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	.ndo_poll_controller	= vnet_poll_controller,
 #endif
@@ -1431,7 +1442,7 @@ static struct vnet *vnet_new(const u64 *local_mac)
 	struct vnet *vp;
 	int err, i;
 
-	dev = alloc_etherdev(sizeof(*vp));
+	dev = alloc_etherdev_mqs(sizeof(*vp), VNET_MAX_TXQS, 1);
 	if (!dev)
 		return ERR_PTR(-ENOMEM);
 	dev->needed_headroom = VNET_PACKET_SKIP + 8;
@@ -1556,6 +1567,25 @@ static void print_version(void)
 
 const char *remote_macaddr_prop = "remote-mac-address";
 
+static void
+vnet_port_add_txq(struct vnet_port *port)
+{
+	struct vnet *vp = port->vp;
+	int n;
+
+	n = vp->nports++;
+	n = n & (VNET_MAX_TXQS - 1);
+	port->q_index = n;
+	netif_tx_wake_queue(netdev_get_tx_queue(vp->dev, port->q_index));
+}
+
+static void
+vnet_port_rm_txq(struct vnet_port *port)
+{
+	port->vp->nports--;
+	netif_tx_stop_queue(netdev_get_tx_queue(port->vp->dev, port->q_index));
+}
+
 static int vnet_port_probe(struct vio_dev *vdev, const struct vio_device_id *id)
 {
 	struct mdesc_handle *hp;
@@ -1624,6 +1654,7 @@ static int vnet_port_probe(struct vio_dev *vdev, const struct vio_device_id *id)
 		list_add_tail_rcu(&port->list, &vp->port_list);
 	hlist_add_head_rcu(&port->hash,
 			   &vp->port_hash[vnet_hashfn(port->raddr)]);
+	vnet_port_add_txq(port);
 	spin_unlock_irqrestore(&vp->lock, flags);
 
 	dev_set_drvdata(&vdev->dev, port);
@@ -1668,6 +1699,7 @@ static int vnet_port_remove(struct vio_dev *vdev)
 
 		synchronize_rcu();
 		del_timer_sync(&port->clean_timer);
+		vnet_port_rm_txq(port);
 		netif_napi_del(&port->napi);
 		vnet_port_free_tx_bufs(port);
 		vio_ldc_free(&port->vio);
diff --git a/drivers/net/ethernet/sun/sunvnet.h b/drivers/net/ethernet/sun/sunvnet.h
index c8a862e..cd5d343 100644
--- a/drivers/net/ethernet/sun/sunvnet.h
+++ b/drivers/net/ethernet/sun/sunvnet.h
@@ -61,6 +61,7 @@ struct vnet_port {
 	u32			napi_stop_idx;
 	bool			napi_resume;
 	int			rx_event;
+	u16			q_index;
 };
 
 static inline struct vnet_port *to_vnet_port(struct vio_driver_state *vio)
@@ -102,6 +103,7 @@ struct vnet {
 	struct list_head	list;
 	u64			local_mac;
 
+	int			nports;
 };
 
 #endif /* _SUNVNET_H */
-- 
1.8.4.2

^ permalink raw reply related

* Re: net: fec: fix regression on i.MX28 introduced by rx_copybreak support
From: David Miller @ 2014-10-29 19:34 UTC (permalink / raw)
  To: LW
  Cc: netdev, rmk+kernel, Frank.Li, fabio.estevam, linux-kernel,
	linux-arm-kernel
In-Reply-To: <1414502584-10583-1-git-send-email-LW@KARO-electronics.de>

From: Lothar Waßmann <LW@KARO-electronics.de>
Date: Tue, 28 Oct 2014 14:22:55 +0100

> Changes wrt. v1:
> - added some cleanup patches
> - simplify handling of 'quirks' flags as suggested by Russell King.
> - remove DIV_ROUND_UP() from byte swapping loop as suggested by
>   Eric Dumazet
> 
> Changes wrt. v2:
> - rebased against next-20141028
> - added some more cleanups in fec.h
> - removed unused return value from swap_buffer()
> - fixed messed swab32s() call in swap_buffer2()
> - fixed messed up setup of fep->quirks
> 

It is not appropriate to mix cleanups and bonafide bug fixes.

I want to see only bug fixes targetted at 'net'.  You can later
submit the cleanups to 'net-next'.

Also, I don't thnk your DIV_ROUND_UP() eliminate for the loop
in swap_buffer() is valid.  The whole point is that the current
code handles buffers which have a length which is not a multiple
of 4 properly, after your change it will no longer do so.

^ permalink raw reply

* Re: e1000_netpoll(): disable_irq() triggers might_sleep() on linux-next
From: Peter Zijlstra @ 2014-10-29 19:36 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Sabrina Dubroca, netdev, linux-kernel, jeffrey.t.kirsher
In-Reply-To: <alpine.DEB.2.11.1410291918060.5308@nanos>

On Wed, Oct 29, 2014 at 07:33:00PM +0100, Thomas Gleixner wrote:
> Yuck. No. You are just papering over the problem.
> 
> What happens if you add 'threadirqs' to the kernel command line? Or if
> the interrupt line is shared with a real threaded interrupt user?
> 
> The proper solution is to have a poll_lock for e1000 which serializes
> the hardware interrupt against netpoll instead of using
> disable/enable_irq().
> 
> In fact that's less expensive than the disable/enable_irq() dance and
> the chance of contention is pretty low. If done right it will be a
> NOOP for the CONFIG_NET_POLL_CONTROLLER=n case.
> 

OK a little something like so then I suppose.. But I suspect most all
the network drivers will need this and maybe more, disable_irq() is a
popular little thing and we 'just' changed semantics on them.

---
 drivers/net/ethernet/intel/e1000/e1000.h      |  2 ++
 drivers/net/ethernet/intel/e1000/e1000_main.c | 22 +++++++++++++++++-----
 kernel/irq/manage.c                           |  2 +-
 3 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000.h b/drivers/net/ethernet/intel/e1000/e1000.h
index 69707108d23c..3f48609f2318 100644
--- a/drivers/net/ethernet/intel/e1000/e1000.h
+++ b/drivers/net/ethernet/intel/e1000/e1000.h
@@ -323,6 +323,8 @@ struct e1000_adapter {
 	struct delayed_work watchdog_task;
 	struct delayed_work fifo_stall_task;
 	struct delayed_work phy_info_task;
+
+	spinlock_t irq_lock;
 };
 
 enum e1000_state_t {
diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index 5f6aded512f5..d12cbffe2149 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -1310,6 +1310,7 @@ static int e1000_sw_init(struct e1000_adapter *adapter)
 	e1000_irq_disable(adapter);
 
 	spin_lock_init(&adapter->stats_lock);
+	spin_lock_init(&adapter->irq_lock);
 
 	set_bit(__E1000_DOWN, &adapter->flags);
 
@@ -3748,10 +3749,8 @@ void e1000_update_stats(struct e1000_adapter *adapter)
  * @irq: interrupt number
  * @data: pointer to a network interface device structure
  **/
-static irqreturn_t e1000_intr(int irq, void *data)
+static irqreturn_t __e1000_intr(int irq, struct e1000_adapter *adapter)
 {
-	struct net_device *netdev = data;
-	struct e1000_adapter *adapter = netdev_priv(netdev);
 	struct e1000_hw *hw = &adapter->hw;
 	u32 icr = er32(ICR);
 
@@ -3793,6 +3792,19 @@ static irqreturn_t e1000_intr(int irq, void *data)
 	return IRQ_HANDLED;
 }
 
+static irqreturn_t e1000_intr(int irq, void *data)
+{
+	struct net_device *netdev = data;
+	struct e1000_adapter *adapter = netdev_priv(netdev);
+	irqreturn_t ret;
+
+	spin_lock(&adapter->irq_lock);
+	ret = __e1000_intr(irq, adapter);
+	spin_unlock(&adapter->irq_lock);
+
+	return ret;
+}
+
 /**
  * e1000_clean - NAPI Rx polling callback
  * @adapter: board private structure
@@ -5217,9 +5229,9 @@ static void e1000_netpoll(struct net_device *netdev)
 {
 	struct e1000_adapter *adapter = netdev_priv(netdev);
 
-	disable_irq(adapter->pdev->irq);
+	spin_lock(&adapter->irq_lock)
 	e1000_intr(adapter->pdev->irq, netdev);
-	enable_irq(adapter->pdev->irq);
+	spin_unlock(&adapter->irq_lock)
 }
 #endif
 
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 0a9104b4608b..b5a4a06bf2fd 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -427,7 +427,7 @@ EXPORT_SYMBOL(disable_irq_nosync);
  *	to complete before returning. If you use this function while
  *	holding a resource the IRQ handler may need you will deadlock.
  *
- *	This function may be called - with care - from IRQ context.
+ *	This function may _NOT_ be called from IRQ context.
  */
 void disable_irq(unsigned int irq)
 {

^ permalink raw reply related

* Re: e1000_netpoll(): disable_irq() triggers might_sleep() on linux-next
From: Jeff Kirsher @ 2014-10-29 19:40 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Thomas Gleixner, Sabrina Dubroca, netdev, linux-kernel
In-Reply-To: <20141029193603.GS12706@worktop.programming.kicks-ass.net>

[-- Attachment #1: Type: text/plain, Size: 4206 bytes --]

On Wed, 2014-10-29 at 20:36 +0100, Peter Zijlstra wrote:
> On Wed, Oct 29, 2014 at 07:33:00PM +0100, Thomas Gleixner wrote:
> > Yuck. No. You are just papering over the problem.
> > 
> > What happens if you add 'threadirqs' to the kernel command line? Or if
> > the interrupt line is shared with a real threaded interrupt user?
> > 
> > The proper solution is to have a poll_lock for e1000 which serializes
> > the hardware interrupt against netpoll instead of using
> > disable/enable_irq().
> > 
> > In fact that's less expensive than the disable/enable_irq() dance and
> > the chance of contention is pretty low. If done right it will be a
> > NOOP for the CONFIG_NET_POLL_CONTROLLER=n case.
> > 
> 
> OK a little something like so then I suppose.. But I suspect most all
> the network drivers will need this and maybe more, disable_irq() is a
> popular little thing and we 'just' changed semantics on them.

Thomas- if you are fine with Peter's patch, I can get this under
testing.

> 
> ---
>  drivers/net/ethernet/intel/e1000/e1000.h      |  2 ++
>  drivers/net/ethernet/intel/e1000/e1000_main.c | 22 +++++++++++++++++-----
>  kernel/irq/manage.c                           |  2 +-
>  3 files changed, 20 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/e1000/e1000.h b/drivers/net/ethernet/intel/e1000/e1000.h
> index 69707108d23c..3f48609f2318 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000.h
> +++ b/drivers/net/ethernet/intel/e1000/e1000.h
> @@ -323,6 +323,8 @@ struct e1000_adapter {
>  	struct delayed_work watchdog_task;
>  	struct delayed_work fifo_stall_task;
>  	struct delayed_work phy_info_task;
> +
> +	spinlock_t irq_lock;
>  };
>  
>  enum e1000_state_t {
> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
> index 5f6aded512f5..d12cbffe2149 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> @@ -1310,6 +1310,7 @@ static int e1000_sw_init(struct e1000_adapter *adapter)
>  	e1000_irq_disable(adapter);
>  
>  	spin_lock_init(&adapter->stats_lock);
> +	spin_lock_init(&adapter->irq_lock);
>  
>  	set_bit(__E1000_DOWN, &adapter->flags);
>  
> @@ -3748,10 +3749,8 @@ void e1000_update_stats(struct e1000_adapter *adapter)
>   * @irq: interrupt number
>   * @data: pointer to a network interface device structure
>   **/
> -static irqreturn_t e1000_intr(int irq, void *data)
> +static irqreturn_t __e1000_intr(int irq, struct e1000_adapter *adapter)
>  {
> -	struct net_device *netdev = data;
> -	struct e1000_adapter *adapter = netdev_priv(netdev);
>  	struct e1000_hw *hw = &adapter->hw;
>  	u32 icr = er32(ICR);
>  
> @@ -3793,6 +3792,19 @@ static irqreturn_t e1000_intr(int irq, void *data)
>  	return IRQ_HANDLED;
>  }
>  
> +static irqreturn_t e1000_intr(int irq, void *data)
> +{
> +	struct net_device *netdev = data;
> +	struct e1000_adapter *adapter = netdev_priv(netdev);
> +	irqreturn_t ret;
> +
> +	spin_lock(&adapter->irq_lock);
> +	ret = __e1000_intr(irq, adapter);
> +	spin_unlock(&adapter->irq_lock);
> +
> +	return ret;
> +}
> +
>  /**
>   * e1000_clean - NAPI Rx polling callback
>   * @adapter: board private structure
> @@ -5217,9 +5229,9 @@ static void e1000_netpoll(struct net_device *netdev)
>  {
>  	struct e1000_adapter *adapter = netdev_priv(netdev);
>  
> -	disable_irq(adapter->pdev->irq);
> +	spin_lock(&adapter->irq_lock)
>  	e1000_intr(adapter->pdev->irq, netdev);
> -	enable_irq(adapter->pdev->irq);
> +	spin_unlock(&adapter->irq_lock)
>  }
>  #endif
>  
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index 0a9104b4608b..b5a4a06bf2fd 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -427,7 +427,7 @@ EXPORT_SYMBOL(disable_irq_nosync);
>   *	to complete before returning. If you use this function while
>   *	holding a resource the IRQ handler may need you will deadlock.
>   *
> - *	This function may be called - with care - from IRQ context.
> + *	This function may _NOT_ be called from IRQ context.
>   */
>  void disable_irq(unsigned int irq)
>  {



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: e1000_netpoll(): disable_irq() triggers might_sleep() on linux-next
From: Thomas Gleixner @ 2014-10-29 19:49 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Sabrina Dubroca, netdev, linux-kernel, jeffrey.t.kirsher
In-Reply-To: <20141029193603.GS12706@worktop.programming.kicks-ass.net>

On Wed, 29 Oct 2014, Peter Zijlstra wrote:

> On Wed, Oct 29, 2014 at 07:33:00PM +0100, Thomas Gleixner wrote:
> > Yuck. No. You are just papering over the problem.
> > 
> > What happens if you add 'threadirqs' to the kernel command line? Or if
> > the interrupt line is shared with a real threaded interrupt user?
> > 
> > The proper solution is to have a poll_lock for e1000 which serializes
> > the hardware interrupt against netpoll instead of using
> > disable/enable_irq().
> > 
> > In fact that's less expensive than the disable/enable_irq() dance and
> > the chance of contention is pretty low. If done right it will be a
> > NOOP for the CONFIG_NET_POLL_CONTROLLER=n case.
> > 
> 
> OK a little something like so then I suppose.. But I suspect most all
> the network drivers will need this and maybe more, disable_irq() is a
> popular little thing and we 'just' changed semantics on them.

We changed that almost 4 years ago :) What we 'just' did was to add a
prominent warning into the code.
 
> ---
>  drivers/net/ethernet/intel/e1000/e1000.h      |  2 ++
>  drivers/net/ethernet/intel/e1000/e1000_main.c | 22 +++++++++++++++++-----
>  kernel/irq/manage.c                           |  2 +-
>  3 files changed, 20 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/e1000/e1000.h b/drivers/net/ethernet/intel/e1000/e1000.h
> index 69707108d23c..3f48609f2318 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000.h
> +++ b/drivers/net/ethernet/intel/e1000/e1000.h
> @@ -323,6 +323,8 @@ struct e1000_adapter {
>  	struct delayed_work watchdog_task;
>  	struct delayed_work fifo_stall_task;
>  	struct delayed_work phy_info_task;
> +
> +	spinlock_t irq_lock;
>  };
>  
>  enum e1000_state_t {
> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
> index 5f6aded512f5..d12cbffe2149 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> @@ -1310,6 +1310,7 @@ static int e1000_sw_init(struct e1000_adapter *adapter)
>  	e1000_irq_disable(adapter);
>  
>  	spin_lock_init(&adapter->stats_lock);
> +	spin_lock_init(&adapter->irq_lock);
>  
>  	set_bit(__E1000_DOWN, &adapter->flags);
>  
> @@ -3748,10 +3749,8 @@ void e1000_update_stats(struct e1000_adapter *adapter)
>   * @irq: interrupt number
>   * @data: pointer to a network interface device structure
>   **/
> -static irqreturn_t e1000_intr(int irq, void *data)
> +static irqreturn_t __e1000_intr(int irq, struct e1000_adapter *adapter)
>  {
> -	struct net_device *netdev = data;
> -	struct e1000_adapter *adapter = netdev_priv(netdev);
>  	struct e1000_hw *hw = &adapter->hw;
>  	u32 icr = er32(ICR);
>  
> @@ -3793,6 +3792,19 @@ static irqreturn_t e1000_intr(int irq, void *data)
>  	return IRQ_HANDLED;
>  }
>  
> +static irqreturn_t e1000_intr(int irq, void *data)
> +{
> +	struct net_device *netdev = data;
> +	struct e1000_adapter *adapter = netdev_priv(netdev);
> +	irqreturn_t ret;
> +
> +	spin_lock(&adapter->irq_lock);
> +	ret = __e1000_intr(irq, adapter);
> +	spin_unlock(&adapter->irq_lock);
> +
> +	return ret;
> +}
> +
>  /**
>   * e1000_clean - NAPI Rx polling callback
>   * @adapter: board private structure
> @@ -5217,9 +5229,9 @@ static void e1000_netpoll(struct net_device *netdev)
>  {
>  	struct e1000_adapter *adapter = netdev_priv(netdev);
>  
> -	disable_irq(adapter->pdev->irq);
> +	spin_lock(&adapter->irq_lock)
>  	e1000_intr(adapter->pdev->irq, netdev);
> -	enable_irq(adapter->pdev->irq);
> +	spin_unlock(&adapter->irq_lock)
>  }
>  #endif
>  
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index 0a9104b4608b..b5a4a06bf2fd 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -427,7 +427,7 @@ EXPORT_SYMBOL(disable_irq_nosync);
>   *	to complete before returning. If you use this function while
>   *	holding a resource the IRQ handler may need you will deadlock.
>   *
> - *	This function may be called - with care - from IRQ context.
> + *	This function may _NOT_ be called from IRQ context.

It can only be called from preemptible thread context.

Thanks,

	tglx

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox