Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next] tcp: Add TCP_FREEZE socket option
From: Cong Wang @ 2014-10-22 16:56 UTC (permalink / raw)
  To: Kristian Evensen; +Cc: netdev
In-Reply-To: <1413992196-4891-1-git-send-email-kristian.evensen@gmail.com>

On Wed, Oct 22, 2014 at 8:36 AM, Kristian Evensen
<kristian.evensen@gmail.com> wrote:
> From: Kristian Evensen <kristian.evensen@gmail.com>
>
> This patch introduces support for Freeze-TCP [1].
>
> Devices that are mobile frequently experience temporary disconnects, for example
> due to signal fading or a technology change. These changes can last for a
> substantial amount of time (>10 seconds), potentially causing multiple RTOs to
> expire and the sender to enter slow start. Even though a device has
> reconnected, it can take a long time for the TCP connection to recover.
>
> Operators of mobile broadband networks mitigate this issue by placing TCP
> splitters at the edge of their networks. However, the splitters typically only
> operate on some ports (mostly only port 80) and violate the end-to-end
> principle. The operator's TCP splitter receives a notification when a temporary
> disconnect occurs and starts sending Zero Window Announcements (ZWA) to the
> remote part of the connection. When a devices regains connectivity, the window
> is reopened.

At least split TCP is transparent to applications, while your approach is not.
I don't understand why you said it typically operates on some ports, since
TCP is stateful.

BTW, AFAIK Linux doesn't support split TCP.

^ permalink raw reply

* Re: [PATCH] net: tso: fix unaligned access to crafted TCP header in helper API
From: David Miller @ 2014-10-22 16:53 UTC (permalink / raw)
  To: karl.beldan; +Cc: netdev, karl.beldan, ezequiel.garcia
In-Reply-To: <1413900365-12289-1-git-send-email-karl.beldan@gmail.com>

From: Karl Beldan <karl.beldan@gmail.com>
Date: Tue, 21 Oct 2014 16:06:05 +0200

> From: Karl Beldan <karl.beldan@rivierawaves.com>
> 
> The crafted header start address is from a driver supplied buffer, which
> one can reasonably expect to be aligned on a 4-bytes boundary.
> However ATM the TSO helper API is only used by ethernet drivers and
> the tcp header will then be aligned to a 2-bytes only boundary from the
> header start address.
> 
> Signed-off-by: Karl Beldan <karl.beldan@rivierawaves.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net] sfc: remove incorrect EFX_BUG_ON_PARANOID check
From: David Miller @ 2014-10-22 16:51 UTC (permalink / raw)
  To: ecree; +Cc: netdev, sshah, jcooper, linux-net-drivers
In-Reply-To: <alpine.LFD.2.03.1410211353290.26215@solarflare.com>

From: Edward Cree <ecree@solarflare.com>
Date: Tue, 21 Oct 2014 14:50:29 +0100

> From: Jon Cooper <jcooper@solarflare.com>
> 
> write_count and insert_count can wrap around, making > check invalid.
> 
> Fixes: 70b33fb0ddec827cbbd14cdc664fc27b2ef4a6b6 ("sfc: add support for
>  skb->xmit_more").
> 
> Signed-off-by: Edward Cree <ecree@solarflare.com>

Applied, thanks.

^ permalink raw reply

* Re: [net-next 1/2] ip6_tunnel: put ip6tnl0 FB device into 'any' mode
From: David Miller @ 2014-10-22 16:47 UTC (permalink / raw)
  To: alan; +Cc: netdev, edumazet
In-Reply-To: <1413879088-13513-1-git-send-email-alan@al-an.info>

From: "Alexey Andriyanov" <alan@al-an.info>
Date: Tue, 21 Oct 2014 12:11:27 +0400

> The fallback device is in ipv6 mode by default.
> The mode can not be changed in runtime, so there
> is no way to decapsulate ip4in6 packets coming from
> various sources without creating the specific tunnel
> ifaces for each peer.
> 
> Cc: David Miller <davem@davemloft.net>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Alexey Andriyanov <alan@al-an.info>

I don't think you can legitimately change this after all
these years.  You'll break someone's setup somehow.

You're going to have to find a way to achieve what you
want whilst keeping the default unchanged.

Sorry.

^ permalink raw reply

* Re: [PATCH] netfilter: log: protect nf_log_register against double registering
From: Pablo Neira Ayuso @ 2014-10-22 16:44 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner; +Cc: netfilter-devel, netdev
In-Reply-To: <5447A429.3030608@redhat.com>

On Wed, Oct 22, 2014 at 10:33:45AM -0200, Marcelo Ricardo Leitner wrote:
> On 22-10-2014 10:02, Pablo Neira Ayuso wrote:
> >On Mon, Oct 20, 2014 at 07:58:03PM -0200, Marcelo Ricardo Leitner wrote:
> >>Currently, despite the comment right before the function,
> >>nf_log_register allows registering two loggers on with the same type and
> >>end up overwriting the previous register.
> >>
> >>Not a real issue today as current tree doesn't have two loggers for the
> >>same type but it's better to get this protected.
> >>
> >>Also make sure that all of its callers do error checking.
> >
> >No major objetions to this sanity check. Some comment below.
> >
> >>Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
> >>---
> >>
> >>Notes:
> >>     Please let me know if you have any issues with the identation on
> >>     nf_log_register. I just couldn't find a better one.
> >
> >You can split nf_log_register() in two functions to avoid this.
> 
> Sorry but I don't follow this one. You mean having the check on
> nf_log_register() and then calling a __nf_log_register() to actually
> register it?
> 
> Now I'm thinking on wrapping
>             rcu_dereference_protected(loggers[i][logger->type],
> +					lockdep_is_held(&nf_log_mutex))
> into a macro or something like that, because that's the issue in
> there and this construction is called several times. Something like:
> 
> #define logger_deref_protected(pf, type) \
>         rcu_dereference_protected(loggers[pf][type], \
>                                   lockdep_is_held(&nf_log_mutex));
> 
> WDYT?

Seems OK, I think this can be:

#define nft_log_dereference(logger)

So you can use this both from net->nf.nf_loggers[x] and
loggers[x][y] and we have one single macro and we avoid the indent
issues.

Thanks.

^ permalink raw reply

* [PATCH net v1 0/2] amd-xgbe: AMD XGBE driver fixes 2014-10-22
From: Tom Lendacky @ 2014-10-22 16:26 UTC (permalink / raw)
  To: netdev; +Cc: davem

The following series of patches includes fixes to the driver.

- Properly handle feature changes via ethtool by using correctly sized
  variables
- Perform proper napi packet counting and budget checking

This patch series is based on net.

---

Tom Lendacky (2):
      amd-xgbe: Properly handle feature changes via ethtool
      amd-xgbe: Fix napi Rx budget accounting


 drivers/net/ethernet/amd/xgbe/xgbe-drv.c |   22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

-- 
Tom Lendacky

^ permalink raw reply

* [PATCH net v1 2/2] amd-xgbe: Fix napi Rx budget accounting
From: Tom Lendacky @ 2014-10-22 16:26 UTC (permalink / raw)
  To: netdev; +Cc: davem
In-Reply-To: <20141022162605.31495.98889.stgit@tlendack-t1.amdoffice.net>

Currently the amd-xgbe driver increments the packets processed counter
each time a descriptor is processed.  Since a packet can be represented
by more than one descriptor incrementing the counter in this way is not
appropriate.  Also, since multiple descriptors cause the budget check
to be short circuited, sometimes the returned value from the poll
function would be larger than the budget value resulting in a WARN_ONCE
being triggered.

Update the polling logic to properly account for the number of packets
processed and exit when the budget value is reached.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 drivers/net/ethernet/amd/xgbe/xgbe-drv.c |   20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
index a480b23..2349ea9 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
@@ -1598,7 +1598,8 @@ static int xgbe_rx_poll(struct xgbe_channel *channel, int budget)
 	struct skb_shared_hwtstamps *hwtstamps;
 	unsigned int incomplete, error, context_next, context;
 	unsigned int len, put_len, max_len;
-	int received = 0;
+	unsigned int received = 0;
+	int packet_count = 0;
 
 	DBGPR("-->xgbe_rx_poll: budget=%d\n", budget);
 
@@ -1608,7 +1609,7 @@ static int xgbe_rx_poll(struct xgbe_channel *channel, int budget)
 
 	rdata = XGBE_GET_DESC_DATA(ring, ring->cur);
 	packet = &ring->packet_data;
-	while (received < budget) {
+	while (packet_count < budget) {
 		DBGPR("  cur = %d\n", ring->cur);
 
 		/* First time in loop see if we need to restore state */
@@ -1662,7 +1663,7 @@ read_again:
 			if (packet->errors)
 				DBGPR("Error in received packet\n");
 			dev_kfree_skb(skb);
-			continue;
+			goto next_packet;
 		}
 
 		if (!context) {
@@ -1677,7 +1678,7 @@ read_again:
 					}
 
 					dev_kfree_skb(skb);
-					continue;
+					goto next_packet;
 				}
 				memcpy(skb_tail_pointer(skb), rdata->skb->data,
 				       put_len);
@@ -1694,7 +1695,7 @@ read_again:
 
 		/* Stray Context Descriptor? */
 		if (!skb)
-			continue;
+			goto next_packet;
 
 		/* Be sure we don't exceed the configured MTU */
 		max_len = netdev->mtu + ETH_HLEN;
@@ -1705,7 +1706,7 @@ read_again:
 		if (skb->len > max_len) {
 			DBGPR("packet length exceeds configured MTU\n");
 			dev_kfree_skb(skb);
-			continue;
+			goto next_packet;
 		}
 
 #ifdef XGMAC_ENABLE_RX_PKT_DUMP
@@ -1739,6 +1740,9 @@ read_again:
 
 		netdev->last_rx = jiffies;
 		napi_gro_receive(&pdata->napi, skb);
+
+next_packet:
+		packet_count++;
 	}
 
 	/* Check if we need to save state before leaving */
@@ -1752,9 +1756,9 @@ read_again:
 		rdata->state.error = error;
 	}
 
-	DBGPR("<--xgbe_rx_poll: received = %d\n", received);
+	DBGPR("<--xgbe_rx_poll: packet_count = %d\n", packet_count);
 
-	return received;
+	return packet_count;
 }
 
 static int xgbe_poll(struct napi_struct *napi, int budget)

^ permalink raw reply related

* [PATCH net v1 1/2] amd-xgbe: Properly handle feature changes via ethtool
From: Tom Lendacky @ 2014-10-22 16:26 UTC (permalink / raw)
  To: netdev; +Cc: davem
In-Reply-To: <20141022162605.31495.98889.stgit@tlendack-t1.amdoffice.net>

The ndo_set_features callback function was improperly using an unsigned
int to save the current feature value for features such as NETIF_F_RXCSUM.
Since that feature is in the upper 32 bits of a 64 bit variable the
result was always 0 making it not possible to actually turn off the
hardware RX checksum support.  Change the unsigned int type to the
netdev_features_t type in order to properly capture the current value
and perform the proper operation.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 drivers/net/ethernet/amd/xgbe/xgbe-drv.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
index 2955499..a480b23 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
@@ -1465,7 +1465,7 @@ static int xgbe_set_features(struct net_device *netdev,
 {
 	struct xgbe_prv_data *pdata = netdev_priv(netdev);
 	struct xgbe_hw_if *hw_if = &pdata->hw_if;
-	unsigned int rxcsum, rxvlan, rxvlan_filter;
+	netdev_features_t rxcsum, rxvlan, rxvlan_filter;
 
 	rxcsum = pdata->netdev_features & NETIF_F_RXCSUM;
 	rxvlan = pdata->netdev_features & NETIF_F_HW_VLAN_CTAG_RX;

^ permalink raw reply related

* Re: [PATCH net-next] tcp: Add TCP_FREEZE socket option
From: David Miller @ 2014-10-22 16:14 UTC (permalink / raw)
  To: kristian.evensen; +Cc: netdev
In-Reply-To: <1413992196-4891-1-git-send-email-kristian.evensen@gmail.com>

From: Kristian Evensen <kristian.evensen@gmail.com>
Date: Wed, 22 Oct 2014 17:36:36 +0200

> This patch introduces support for Freeze-TCP [1].

By your description I would not expect the application to get involved
with the actual final zero window advertisement decision at all.

Instead, I would expect the device layer to trigger a notification
during a "technology change" or whatever you want to call losing
connectivity, whichi TCP can receive and use to start sending zero
windows over all TCP connections using that path.

So the socket option enables or disables the facility, but doesn't
actually trigger the zero window advertisement.  A real device based
event does that.

The application has no business watching for the loss of connectivity,
and I am certain you do not want that logice in every application in
order for it to take advantage of this.

And therefore there should be a global option that turns this on for
the entire system by default.

This requires a lot more work than you have done here, you need to
add all the notification handling, the logic in TCP to look at the
attached route on send and trigger zero window probes if the device
event has happened, etc.

^ permalink raw reply

* Re: [PATCH net-next] tcp: Add TCP_FREEZE socket option
From: Kristian Evensen @ 2014-10-22 16:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Network Development
In-Reply-To: <1413993272.9031.3.camel@edumazet-glaptop2.roam.corp.google.com>

Hi,

On Wed, Oct 22, 2014 at 5:54 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> This asymmetry looks strange
>
> Following sequence should be allowed :
>
> getsockopt(...  TCP_FREEZE, &val, ...)
> setsockopt(...  TCP_FREEZE, &val, ...)
>
> So setsockopt() should accept val = 0

Thanks for you comment and I agree. The reasoning behind my original
ordering was that I wanted the values to be in the order which made
most logical sense to me, which is Enable (1), Disable (2) and Disable
with TR-ACK (3). However, I see now that when using the option and
when combined with getsockopt(), this does not make much sense. I will
wait for some more feedback and send a revised version tomorrow with
the following ordering: Disable (0), Enable (1), Disable with TR-ACK
(2).

Thanks again,
Kristian

^ permalink raw reply

* [PATCH] net: fix saving TX flow hash in sock for outgoing connections
From: Sathya Perla @ 2014-10-22 16:12 UTC (permalink / raw)
  To: netdev; +Cc: therbert

The commit "net: Save TX flow hash in sock and set in skbuf on xmit"
introduced the inet_set_txhash() and ip6_set_txhash() routines to calculate
and record flow hash(sk_txhash) in the socket structure. sk_txhash is used
to set skb->hash which is used to spread flows across multiple TXQs.

But, the above routines are invoked before the source port of the connection
is created. Because of this all outgoing connections that just differ in the
source port get hashed into the same TXQ.

This patch fixes this problem for IPv4/6 by invoking the the above routines
after the source port is available for the socket.

Fixes: b73c3d0e4("net: Save TX flow hash in sock and set in skbuf on xmit")

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
---
 net/ipv4/tcp_ipv4.c |    4 ++--
 net/ipv6/tcp_ipv6.c |    4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 94d1a77..9c7d762 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -206,8 +206,6 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	inet->inet_dport = usin->sin_port;
 	inet->inet_daddr = daddr;
 
-	inet_set_txhash(sk);
-
 	inet_csk(sk)->icsk_ext_hdr_len = 0;
 	if (inet_opt)
 		inet_csk(sk)->icsk_ext_hdr_len = inet_opt->opt.optlen;
@@ -224,6 +222,8 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	if (err)
 		goto failure;
 
+	inet_set_txhash(sk);
+
 	rt = ip_route_newports(fl4, rt, orig_sport, orig_dport,
 			       inet->inet_sport, inet->inet_dport, sk);
 	if (IS_ERR(rt)) {
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 8314955..ace29b6 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -200,8 +200,6 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	sk->sk_v6_daddr = usin->sin6_addr;
 	np->flow_label = fl6.flowlabel;
 
-	ip6_set_txhash(sk);
-
 	/*
 	 *	TCP over IPv4
 	 */
@@ -297,6 +295,8 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	if (err)
 		goto late_failure;
 
+	ip6_set_txhash(sk);
+
 	if (!tp->write_seq && likely(!tp->repair))
 		tp->write_seq = secure_tcpv6_sequence_number(np->saddr.s6_addr32,
 							     sk->sk_v6_daddr.s6_addr32,
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH net-next] tcp: Add TCP_FREEZE socket option
From: Eric Dumazet @ 2014-10-22 15:54 UTC (permalink / raw)
  To: Kristian Evensen; +Cc: netdev
In-Reply-To: <1413992196-4891-1-git-send-email-kristian.evensen@gmail.com>

On Wed, 2014-10-22 at 17:36 +0200, Kristian Evensen wrote:
> From: Kristian Evensen <kristian.evensen@gmail.com>
> 
> This patch introduces support for Freeze-TCP [1].
> 
> Devices that are mobile frequently experience temporary disconnects, for example
> due to signal fading or a technology change. These changes can last for a
> substantial amount of time (>10 seconds), potentially causing multiple RTOs to
> expire and the sender to enter slow start. Even though a device has
> reconnected, it can take a long time for the TCP connection to recover.
> 
> Operators of mobile broadband networks mitigate this issue by placing TCP
> splitters at the edge of their networks. However, the splitters typically only
> operate on some ports (mostly only port 80) and violate the end-to-end
> principle. The operator's TCP splitter receives a notification when a temporary
> disconnect occurs and starts sending Zero Window Announcements (ZWA) to the
> remote part of the connection. When a devices regains connectivity, the window
> is reopened.
> 
> Freeze-TCP is a client-side only approach for enabling application developers to
> trigger sending ZWAs. It is implemented as a socket option and accepts three
> different values. If the value is set to one, the connection is frozen. A ZWA is
> sent and the window size set to 0 in any reply to additional packets arriving
> from remote party. If the value is set to two, the connection is unfrozen and a
> window update announcement is sent. If the value is set to three, two additional
> window update announcements are sent. This is referred to as TR-ACK in the paper
> and is used to increase probability that a window update announcement will be
> received.
> 
> When to trigger Freeze-TCP depends on the application requirements and
> underlaying network, is not the responsibility of the kernel. One approach is to
> have the application, or a daemon, analyze the meta data exported from a mobile
> broadband modem. A temporary disconnect can often be detected in advance by
> looking at different statistics.
> 
> [1] - T. Goff, J. Moronski, D. S. Phatak, and V. Gupta, "Freeze-TCP: a True
> End-to-end TCP Enhancement Mechanism for Mobile Environments," In Proceedings of
> IEEE INFOCOM 2000. URL: http://www.csee.umbc.edu/~phatak/publications/ftcp.pdf
> 
> Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com>
> ---
>  include/linux/tcp.h      |  3 ++-
>  include/uapi/linux/tcp.h |  1 +
>  net/ipv4/tcp.c           | 33 +++++++++++++++++++++++++++++++++
>  net/ipv4/tcp_output.c    |  8 +++++++-
>  4 files changed, 43 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index c2dee7d..7ed26c1 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -187,7 +187,8 @@ struct tcp_sock {
>  		syn_data:1,	/* SYN includes data */
>  		syn_fastopen:1,	/* SYN includes Fast Open option */
>  		syn_data_acked:1,/* data in SYN is acked by SYN-ACK */
> -		is_cwnd_limited:1;/* forward progress limited by snd_cwnd? */
> +		is_cwnd_limited:1,/* forward progress limited by snd_cwnd? */
> +		frozen:1;	/* Artifically deflate announced window to 0 */
>  	u32	tlp_high_seq;	/* snd_nxt at the time of TLP retransmit. */
>  
>  /* RTT measurement */
> diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> index 3b97183..bc0684d 100644
> --- a/include/uapi/linux/tcp.h
> +++ b/include/uapi/linux/tcp.h
> @@ -112,6 +112,7 @@ enum {
>  #define TCP_FASTOPEN		23	/* Enable FastOpen on listeners */
>  #define TCP_TIMESTAMP		24
>  #define TCP_NOTSENT_LOWAT	25	/* limit number of unsent bytes in write queue */
> +#define TCP_FREEZE		26	/* Freeze TCP connection by sending ZWA */
>  
>  struct tcp_repair_opt {
>  	__u32	opt_code;
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 1bec4e7..5bf30d0 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2339,6 +2339,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
>  	struct inet_connection_sock *icsk = inet_csk(sk);
>  	int val;
>  	int err = 0;
> +	u8 itr = 0;
>  
>  	/* These are data/string values, all the others are ints */
>  	switch (optname) {
> @@ -2600,6 +2601,35 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
>  		tp->notsent_lowat = val;
>  		sk->sk_write_space(sk);
>  		break;
> +	case TCP_FREEZE:
> +		if (val < 1 || val > 3 ||
> +		    !((1 << sk->sk_state) & TCPF_ESTABLISHED)) {
> +			err = -EINVAL;
> +			break;
> +		}
> +
> +		if (val == 1) {
> +			tp->frozen = 1;
> +			tcp_send_ack(sk);
> +			break;
> +		} else if (!tp->frozen) {
> +			err = -EINVAL;
> +			break;
> +		}
> +
> +		tp->frozen = 0;
> +		tcp_send_ack(sk);
> +
> +		if (val == 2)
> +			break;
> +
> +		/* If val is three, send two additional reconnection ACKs to
> +		 * increase chance of a non-zero windows announcement arriving.
> +		 */
> +		for (itr = 0; itr < 2; itr++)
> +			tcp_send_ack(sk);
> +
> +		break;
>  	default:
>  		err = -ENOPROTOOPT;
>  		break;
> @@ -2832,6 +2862,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
>  	case TCP_NOTSENT_LOWAT:
>  		val = tp->notsent_lowat;
>  		break;
> +	case TCP_FREEZE:
> +		val = tp->frozen;
> +		break;
>  	default:
>  		return -ENOPROTOOPT;
>  	}


This asymmetry looks strange

Following sequence should be allowed :

getsockopt(...  TCP_FREEZE, &val, ...)
setsockopt(...  TCP_FREEZE, &val, ...)

So setsockopt() should accept val = 0

^ permalink raw reply

* [PATCH RFC v2 10/16] virtio_net: use v1.0 endian.
From: Michael S. Tsirkin @ 2014-10-22 15:50 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev, virtualization
In-Reply-To: <1413992894-22976-1-git-send-email-mst@redhat.com>

From: Rusty Russell <rusty@rustcorp.com.au>

[Cornelia Huck: converted some missed fields]
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/virtio_net.c | 31 +++++++++++++++++++------------
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d75256bd..2e6561e 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -347,13 +347,14 @@ err:
 }
 
 static struct sk_buff *receive_mergeable(struct net_device *dev,
+					 struct virtnet_info *vi,
 					 struct receive_queue *rq,
 					 unsigned long ctx,
 					 unsigned int len)
 {
 	void *buf = mergeable_ctx_to_buf_address(ctx);
 	struct skb_vnet_hdr *hdr = buf;
-	int num_buf = hdr->mhdr.num_buffers;
+	u16 num_buf = virtio16_to_cpu(rq->vq->vdev, hdr->mhdr.num_buffers);
 	struct page *page = virt_to_head_page(buf);
 	int offset = buf - page_address(page);
 	unsigned int truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
@@ -369,7 +370,9 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 		ctx = (unsigned long)virtqueue_get_buf(rq->vq, &len);
 		if (unlikely(!ctx)) {
 			pr_debug("%s: rx error: %d buffers out of %d missing\n",
-				 dev->name, num_buf, hdr->mhdr.num_buffers);
+				 dev->name, num_buf,
+				 virtio16_to_cpu(rq->vq->vdev,
+						 hdr->mhdr.num_buffers));
 			dev->stats.rx_length_errors++;
 			goto err_buf;
 		}
@@ -454,7 +457,7 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
 	}
 
 	if (vi->mergeable_rx_bufs)
-		skb = receive_mergeable(dev, rq, (unsigned long)buf, len);
+		skb = receive_mergeable(dev, vi, rq, (unsigned long)buf, len);
 	else if (vi->big_packets)
 		skb = receive_big(dev, rq, buf, len);
 	else
@@ -473,8 +476,8 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
 	if (hdr->hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
 		pr_debug("Needs csum!\n");
 		if (!skb_partial_csum_set(skb,
-					  hdr->hdr.csum_start,
-					  hdr->hdr.csum_offset))
+			  virtio16_to_cpu(vi->vdev, hdr->hdr.csum_start),
+			  virtio16_to_cpu(vi->vdev, hdr->hdr.csum_offset)))
 			goto frame_err;
 	} else if (hdr->hdr.flags & VIRTIO_NET_HDR_F_DATA_VALID) {
 		skb->ip_summed = CHECKSUM_UNNECESSARY;
@@ -505,7 +508,8 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
 		if (hdr->hdr.gso_type & VIRTIO_NET_HDR_GSO_ECN)
 			skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
 
-		skb_shinfo(skb)->gso_size = hdr->hdr.gso_size;
+		skb_shinfo(skb)->gso_size = virtio16_to_cpu(vi->vdev,
+							    hdr->hdr.gso_size);
 		if (skb_shinfo(skb)->gso_size == 0) {
 			net_warn_ratelimited("%s: zero gso size.\n", dev->name);
 			goto frame_err;
@@ -867,16 +871,19 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
 
 	if (skb->ip_summed == CHECKSUM_PARTIAL) {
 		hdr->hdr.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-		hdr->hdr.csum_start = skb_checksum_start_offset(skb);
-		hdr->hdr.csum_offset = skb->csum_offset;
+		hdr->hdr.csum_start = cpu_to_virtio16(vi->vdev,
+						skb_checksum_start_offset(skb));
+		hdr->hdr.csum_offset = cpu_to_virtio16(vi->vdev,
+							 skb->csum_offset);
 	} else {
 		hdr->hdr.flags = 0;
 		hdr->hdr.csum_offset = hdr->hdr.csum_start = 0;
 	}
 
 	if (skb_is_gso(skb)) {
-		hdr->hdr.hdr_len = skb_headlen(skb);
-		hdr->hdr.gso_size = skb_shinfo(skb)->gso_size;
+		hdr->hdr.hdr_len = cpu_to_virtio16(vi->vdev, skb_headlen(skb));
+		hdr->hdr.gso_size = cpu_to_virtio16(vi->vdev,
+						    skb_shinfo(skb)->gso_size);
 		if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4)
 			hdr->hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
 		else if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6)
@@ -1182,7 +1189,7 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 	sg_init_table(sg, 2);
 
 	/* Store the unicast list and count in the front of the buffer */
-	mac_data->entries = uc_count;
+	mac_data->entries = cpu_to_virtio32(vi->vdev, uc_count);
 	i = 0;
 	netdev_for_each_uc_addr(ha, dev)
 		memcpy(&mac_data->macs[i++][0], ha->addr, ETH_ALEN);
@@ -1193,7 +1200,7 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 	/* multicast list and count fill the end */
 	mac_data = (void *)&mac_data->macs[uc_count][0];
 
-	mac_data->entries = mc_count;
+	mac_data->entries = cpu_to_virtio32(vi->vdev, mc_count);
 	i = 0;
 	netdev_for_each_mc_addr(ha, dev)
 		memcpy(&mac_data->macs[i++][0], ha->addr, ETH_ALEN);
-- 
MST

^ permalink raw reply related

* Re: [PATCH] igb: don't reuse pages with pfmemalloc flag
From: Eric Dumazet @ 2014-10-22 15:45 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: alexander.h.duyck, netdev, sassmann, e1000-devel,
	peter.p.waskiewicz.jr, bruce.w.allan, jesse.brandeburg, davem,
	john.ronciak, gregkh, linux-kernel
In-Reply-To: <1413985819-9553-1-git-send-email-klamm@yandex-team.ru>

On Wed, 2014-10-22 at 17:50 +0400, Roman Gushchin wrote:
> Incoming packet is dropped silently by sk_filter(), if the skb was
> allocated from pfmemalloc reserves and the corresponding socket is
> not marked with the SOCK_MEMALLOC flag.
> 
> Igb driver allocates pages for DMA with __skb_alloc_page(), which
> calls alloc_pages_node() with the __GFP_MEMALLOC flag. So, in case
> of OOM condition, igb can get pages with pfmemalloc flag set.
> 
> If an incoming packet hits the pfmemalloc page and is large enough
> (small packets are copying into the memory, allocated with
> netdev_alloc_skb_ip_align(), so they are not affected), it will be
> dropped.
> 
> This behavior is ok under high memory pressure, but the problem is
> that the igb driver reuses these mapped pages. So, packets are still
> dropping even if all memory issues are gone and there is a plenty
> of free memory.
> 
> In my case, some TCP sessions hang on a small percentage (< 0.1%)
> of machines days after OOMs.
> 
> Fix this by avoiding reuse of such pages.
> 
> Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
> ---

Interesting...

It seems we also need to clear skb->pfmemalloc in napi_reuse_skb()




------------------------------------------------------------------------------
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* [PATCH net-next] tcp: Add TCP_FREEZE socket option
From: Kristian Evensen @ 2014-10-22 15:36 UTC (permalink / raw)
  To: netdev; +Cc: Kristian Evensen

From: Kristian Evensen <kristian.evensen@gmail.com>

This patch introduces support for Freeze-TCP [1].

Devices that are mobile frequently experience temporary disconnects, for example
due to signal fading or a technology change. These changes can last for a
substantial amount of time (>10 seconds), potentially causing multiple RTOs to
expire and the sender to enter slow start. Even though a device has
reconnected, it can take a long time for the TCP connection to recover.

Operators of mobile broadband networks mitigate this issue by placing TCP
splitters at the edge of their networks. However, the splitters typically only
operate on some ports (mostly only port 80) and violate the end-to-end
principle. The operator's TCP splitter receives a notification when a temporary
disconnect occurs and starts sending Zero Window Announcements (ZWA) to the
remote part of the connection. When a devices regains connectivity, the window
is reopened.

Freeze-TCP is a client-side only approach for enabling application developers to
trigger sending ZWAs. It is implemented as a socket option and accepts three
different values. If the value is set to one, the connection is frozen. A ZWA is
sent and the window size set to 0 in any reply to additional packets arriving
from remote party. If the value is set to two, the connection is unfrozen and a
window update announcement is sent. If the value is set to three, two additional
window update announcements are sent. This is referred to as TR-ACK in the paper
and is used to increase probability that a window update announcement will be
received.

When to trigger Freeze-TCP depends on the application requirements and
underlaying network, is not the responsibility of the kernel. One approach is to
have the application, or a daemon, analyze the meta data exported from a mobile
broadband modem. A temporary disconnect can often be detected in advance by
looking at different statistics.

[1] - T. Goff, J. Moronski, D. S. Phatak, and V. Gupta, "Freeze-TCP: a True
End-to-end TCP Enhancement Mechanism for Mobile Environments," In Proceedings of
IEEE INFOCOM 2000. URL: http://www.csee.umbc.edu/~phatak/publications/ftcp.pdf

Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com>
---
 include/linux/tcp.h      |  3 ++-
 include/uapi/linux/tcp.h |  1 +
 net/ipv4/tcp.c           | 33 +++++++++++++++++++++++++++++++++
 net/ipv4/tcp_output.c    |  8 +++++++-
 4 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index c2dee7d..7ed26c1 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -187,7 +187,8 @@ struct tcp_sock {
 		syn_data:1,	/* SYN includes data */
 		syn_fastopen:1,	/* SYN includes Fast Open option */
 		syn_data_acked:1,/* data in SYN is acked by SYN-ACK */
-		is_cwnd_limited:1;/* forward progress limited by snd_cwnd? */
+		is_cwnd_limited:1,/* forward progress limited by snd_cwnd? */
+		frozen:1;	/* Artifically deflate announced window to 0 */
 	u32	tlp_high_seq;	/* snd_nxt at the time of TLP retransmit. */
 
 /* RTT measurement */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 3b97183..bc0684d 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -112,6 +112,7 @@ enum {
 #define TCP_FASTOPEN		23	/* Enable FastOpen on listeners */
 #define TCP_TIMESTAMP		24
 #define TCP_NOTSENT_LOWAT	25	/* limit number of unsent bytes in write queue */
+#define TCP_FREEZE		26	/* Freeze TCP connection by sending ZWA */
 
 struct tcp_repair_opt {
 	__u32	opt_code;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1bec4e7..5bf30d0 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2339,6 +2339,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	int val;
 	int err = 0;
+	u8 itr = 0;
 
 	/* These are data/string values, all the others are ints */
 	switch (optname) {
@@ -2600,6 +2601,35 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 		tp->notsent_lowat = val;
 		sk->sk_write_space(sk);
 		break;
+	case TCP_FREEZE:
+		if (val < 1 || val > 3 ||
+		    !((1 << sk->sk_state) & TCPF_ESTABLISHED)) {
+			err = -EINVAL;
+			break;
+		}
+
+		if (val == 1) {
+			tp->frozen = 1;
+			tcp_send_ack(sk);
+			break;
+		} else if (!tp->frozen) {
+			err = -EINVAL;
+			break;
+		}
+
+		tp->frozen = 0;
+		tcp_send_ack(sk);
+
+		if (val == 2)
+			break;
+
+		/* If val is three, send two additional reconnection ACKs to
+		 * increase chance of a non-zero windows announcement arriving.
+		 */
+		for (itr = 0; itr < 2; itr++)
+			tcp_send_ack(sk);
+
+		break;
 	default:
 		err = -ENOPROTOOPT;
 		break;
@@ -2832,6 +2862,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 	case TCP_NOTSENT_LOWAT:
 		val = tp->notsent_lowat;
 		break;
+	case TCP_FREEZE:
+		val = tp->frozen;
+		break;
 	default:
 		return -ENOPROTOOPT;
 	}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 3af2129..9c1429b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -958,7 +958,13 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 		 */
 		th->window	= htons(min(tp->rcv_wnd, 65535U));
 	} else {
-		th->window	= htons(tcp_select_window(sk));
+		/* Because window is only artifically deflated to zero, we
+		 * postpone updating tcp state until connection is unfrozen
+		 */
+		if (unlikely(tp->frozen))
+			th->window = 0;
+		else
+			th->window = htons(tcp_select_window(sk));
 	}
 	th->check		= 0;
 	th->urg_ptr		= 0;
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH] ovs: Turn vports with dependencies into separate modules
From: Thomas Graf @ 2014-10-22 15:29 UTC (permalink / raw)
  To: dev; +Cc: netdev

The internal and netdev vport remain part of openvswitch.ko. Encap
vports including vxlan, gre, and geneve can be built as separate
modules and are loaded on demand. Modules can be unloaded after use.
Datapath ports keep a reference to the vport module during their
lifetime.

Allows to remove the error prone maintenance of the global list
vport_ops_list.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 net/openvswitch/Kconfig              |  18 +++----
 net/openvswitch/Makefile             |  14 ++---
 net/openvswitch/datapath.c           |  16 +++++-
 net/openvswitch/vport-geneve.c       |  23 +++++++-
 net/openvswitch/vport-gre.c          |  23 +++++++-
 net/openvswitch/vport-internal_dev.c |  17 +++++-
 net/openvswitch/vport-netdev.c       |  14 ++++-
 net/openvswitch/vport-netdev.h       |   3 ++
 net/openvswitch/vport-vxlan.c        |  23 +++++++-
 net/openvswitch/vport.c              | 102 ++++++++++++++++++++++++-----------
 net/openvswitch/vport.h              |  14 +++--
 11 files changed, 199 insertions(+), 68 deletions(-)

diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index ba3bb82..2a9673e 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -29,11 +29,11 @@ config OPENVSWITCH
 	  If unsure, say N.
 
 config OPENVSWITCH_GRE
-	bool "Open vSwitch GRE tunneling support"
+	tristate "Open vSwitch GRE tunneling support"
 	depends on INET
 	depends on OPENVSWITCH
-	depends on NET_IPGRE_DEMUX && !(OPENVSWITCH=y && NET_IPGRE_DEMUX=m)
-	default y
+	depends on NET_IPGRE_DEMUX
+	default OPENVSWITCH
 	---help---
 	  If you say Y here, then the Open vSwitch will be able create GRE
 	  vport.
@@ -43,11 +43,11 @@ config OPENVSWITCH_GRE
 	  If unsure, say Y.
 
 config OPENVSWITCH_VXLAN
-	bool "Open vSwitch VXLAN tunneling support"
+	tristate "Open vSwitch VXLAN tunneling support"
 	depends on INET
 	depends on OPENVSWITCH
-	depends on VXLAN && !(OPENVSWITCH=y && VXLAN=m)
-	default y
+	depends on VXLAN
+	default OPENVSWITCH
 	---help---
 	  If you say Y here, then the Open vSwitch will be able create vxlan vport.
 
@@ -56,11 +56,11 @@ config OPENVSWITCH_VXLAN
 	  If unsure, say Y.
 
 config OPENVSWITCH_GENEVE
-	bool "Open vSwitch Geneve tunneling support"
+	tristate "Open vSwitch Geneve tunneling support"
 	depends on INET
 	depends on OPENVSWITCH
-	depends on GENEVE && !(OPENVSWITCH=y && GENEVE=m)
-	default y
+	depends on GENEVE
+	default OPENVSWITCH
 	---help---
 	  If you say Y here, then the Open vSwitch will be able create geneve vport.
 
diff --git a/net/openvswitch/Makefile b/net/openvswitch/Makefile
index 9a33a27..91b9478 100644
--- a/net/openvswitch/Makefile
+++ b/net/openvswitch/Makefile
@@ -15,14 +15,6 @@ openvswitch-y := \
 	vport-internal_dev.o \
 	vport-netdev.o
 
-ifneq ($(CONFIG_OPENVSWITCH_GENEVE),)
-openvswitch-y += vport-geneve.o
-endif
-
-ifneq ($(CONFIG_OPENVSWITCH_VXLAN),)
-openvswitch-y += vport-vxlan.o
-endif
-
-ifneq ($(CONFIG_OPENVSWITCH_GRE),)
-openvswitch-y += vport-gre.o
-endif
+obj-$(CONFIG_OPENVSWITCH_GENEVE)+= vport-geneve.o
+obj-$(CONFIG_OPENVSWITCH_VXLAN)	+= vport-vxlan.o
+obj-$(CONFIG_OPENVSWITCH_GRE)	+= vport-gre.o
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index e6d7255..aecddb9 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -59,6 +59,7 @@
 #include "vport-netdev.h"
 
 int ovs_net_id __read_mostly;
+EXPORT_SYMBOL(ovs_net_id);
 
 static struct genl_family dp_packet_genl_family;
 static struct genl_family dp_flow_genl_family;
@@ -1764,6 +1765,7 @@ static int ovs_vport_cmd_new(struct sk_buff *skb, struct genl_info *info)
 		return -ENOMEM;
 
 	ovs_lock();
+restart:
 	dp = get_dp(sock_net(skb->sk), ovs_header->dp_ifindex);
 	err = -ENODEV;
 	if (!dp)
@@ -1795,8 +1797,11 @@ static int ovs_vport_cmd_new(struct sk_buff *skb, struct genl_info *info)
 
 	vport = new_vport(&parms);
 	err = PTR_ERR(vport);
-	if (IS_ERR(vport))
+	if (IS_ERR(vport)) {
+		if (err == -EAGAIN)
+			goto restart;
 		goto exit_unlock_free;
+	}
 
 	err = ovs_vport_cmd_fill_info(vport, reply, info->snd_portid,
 				      info->snd_seq, 0, OVS_VPORT_CMD_NEW);
@@ -2112,12 +2117,18 @@ static int __init dp_init(void)
 	if (err)
 		goto error_netns_exit;
 
+	err = ovs_netdev_init();
+	if (err)
+		goto error_unreg_notifier;
+
 	err = dp_register_genl();
 	if (err < 0)
-		goto error_unreg_notifier;
+		goto error_unreg_netdev;
 
 	return 0;
 
+error_unreg_netdev:
+	ovs_netdev_exit();
 error_unreg_notifier:
 	unregister_netdevice_notifier(&ovs_dp_device_notifier);
 error_netns_exit:
@@ -2137,6 +2148,7 @@ error:
 static void dp_cleanup(void)
 {
 	dp_unregister_genl(ARRAY_SIZE(dp_genl_families));
+	ovs_netdev_exit();
 	unregister_netdevice_notifier(&ovs_dp_device_notifier);
 	unregister_pernet_device(&ovs_net_ops);
 	rcu_barrier();
diff --git a/net/openvswitch/vport-geneve.c b/net/openvswitch/vport-geneve.c
index 106a9d8..70c9765 100644
--- a/net/openvswitch/vport-geneve.c
+++ b/net/openvswitch/vport-geneve.c
@@ -17,6 +17,7 @@
 #include <linux/rculist.h>
 #include <linux/udp.h>
 #include <linux/if_vlan.h>
+#include <linux/module.h>
 
 #include <net/geneve.h>
 #include <net/icmp.h>
@@ -28,6 +29,8 @@
 #include "datapath.h"
 #include "vport.h"
 
+static struct vport_ops ovs_geneve_vport_ops;
+
 /**
  * struct geneve_port - Keeps track of open UDP ports
  * @gs: The socket created for this port number.
@@ -225,11 +228,29 @@ static const char *geneve_get_name(const struct vport *vport)
 	return geneve_port->name;
 }
 
-const struct vport_ops ovs_geneve_vport_ops = {
+static struct vport_ops ovs_geneve_vport_ops = {
 	.type		= OVS_VPORT_TYPE_GENEVE,
 	.create		= geneve_tnl_create,
 	.destroy	= geneve_tnl_destroy,
 	.get_name	= geneve_get_name,
 	.get_options	= geneve_get_options,
 	.send		= geneve_tnl_send,
+	.owner          = THIS_MODULE,
 };
+
+static int __init ovs_geneve_tnl_init(void)
+{
+	return ovs_vport_ops_register(&ovs_geneve_vport_ops);
+}
+
+static void __exit ovs_geneve_tnl_exit(void)
+{
+	ovs_vport_ops_unregister(&ovs_geneve_vport_ops);
+}
+
+module_init(ovs_geneve_tnl_init);
+module_exit(ovs_geneve_tnl_exit);
+
+MODULE_DESCRIPTION("OVS: Geneve swiching port");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS("vport-type-5");
diff --git a/net/openvswitch/vport-gre.c b/net/openvswitch/vport-gre.c
index 108b82d..00270b6 100644
--- a/net/openvswitch/vport-gre.c
+++ b/net/openvswitch/vport-gre.c
@@ -29,6 +29,7 @@
 #include <linux/jhash.h>
 #include <linux/list.h>
 #include <linux/kernel.h>
+#include <linux/module.h>
 #include <linux/workqueue.h>
 #include <linux/rculist.h>
 #include <net/route.h>
@@ -45,6 +46,8 @@
 #include "datapath.h"
 #include "vport.h"
 
+static struct vport_ops ovs_gre_vport_ops;
+
 /* Returns the least-significant 32 bits of a __be64. */
 static __be32 be64_get_low32(__be64 x)
 {
@@ -281,10 +284,28 @@ static void gre_tnl_destroy(struct vport *vport)
 	gre_exit();
 }
 
-const struct vport_ops ovs_gre_vport_ops = {
+static struct vport_ops ovs_gre_vport_ops = {
 	.type		= OVS_VPORT_TYPE_GRE,
 	.create		= gre_create,
 	.destroy	= gre_tnl_destroy,
 	.get_name	= gre_get_name,
 	.send		= gre_tnl_send,
+	.owner		= THIS_MODULE,
 };
+
+static int __init ovs_gre_tnl_init(void)
+{
+	return ovs_vport_ops_register(&ovs_gre_vport_ops);
+}
+
+static void __exit ovs_gre_tnl_exit(void)
+{
+	ovs_vport_ops_unregister(&ovs_gre_vport_ops);
+}
+
+module_init(ovs_gre_tnl_init);
+module_exit(ovs_gre_tnl_exit);
+
+MODULE_DESCRIPTION("OVS: GRE switching port");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS("vport-type-3");
diff --git a/net/openvswitch/vport-internal_dev.c b/net/openvswitch/vport-internal_dev.c
index 8451612..10dc07e 100644
--- a/net/openvswitch/vport-internal_dev.c
+++ b/net/openvswitch/vport-internal_dev.c
@@ -36,6 +36,8 @@ struct internal_dev {
 	struct vport *vport;
 };
 
+static struct vport_ops ovs_internal_vport_ops;
+
 static struct internal_dev *internal_dev_priv(struct net_device *netdev)
 {
 	return netdev_priv(netdev);
@@ -238,7 +240,7 @@ static int internal_dev_recv(struct vport *vport, struct sk_buff *skb)
 	return len;
 }
 
-const struct vport_ops ovs_internal_vport_ops = {
+static struct vport_ops ovs_internal_vport_ops = {
 	.type		= OVS_VPORT_TYPE_INTERNAL,
 	.create		= internal_dev_create,
 	.destroy	= internal_dev_destroy,
@@ -261,10 +263,21 @@ struct vport *ovs_internal_dev_get_vport(struct net_device *netdev)
 
 int ovs_internal_dev_rtnl_link_register(void)
 {
-	return rtnl_link_register(&internal_dev_link_ops);
+	int err;
+
+	err = rtnl_link_register(&internal_dev_link_ops);
+	if (err < 0)
+		return err;
+
+	err = ovs_vport_ops_register(&ovs_internal_vport_ops);
+	if (err < 0)
+		rtnl_link_unregister(&internal_dev_link_ops);
+
+	return err;
 }
 
 void ovs_internal_dev_rtnl_link_unregister(void)
 {
+	ovs_vport_ops_unregister(&ovs_internal_vport_ops);
 	rtnl_link_unregister(&internal_dev_link_ops);
 }
diff --git a/net/openvswitch/vport-netdev.c b/net/openvswitch/vport-netdev.c
index d21f77d..877ee74 100644
--- a/net/openvswitch/vport-netdev.c
+++ b/net/openvswitch/vport-netdev.c
@@ -33,6 +33,8 @@
 #include "vport-internal_dev.h"
 #include "vport-netdev.h"
 
+static struct vport_ops ovs_netdev_vport_ops;
+
 /* Must be called with rcu_read_lock. */
 static void netdev_port_receive(struct vport *vport, struct sk_buff *skb)
 {
@@ -224,10 +226,20 @@ struct vport *ovs_netdev_get_vport(struct net_device *dev)
 		return NULL;
 }
 
-const struct vport_ops ovs_netdev_vport_ops = {
+static struct vport_ops ovs_netdev_vport_ops = {
 	.type		= OVS_VPORT_TYPE_NETDEV,
 	.create		= netdev_create,
 	.destroy	= netdev_destroy,
 	.get_name	= ovs_netdev_get_name,
 	.send		= netdev_send,
 };
+
+int __init ovs_netdev_init(void)
+{
+	return ovs_vport_ops_register(&ovs_netdev_vport_ops);
+}
+
+void ovs_netdev_exit(void)
+{
+	ovs_vport_ops_unregister(&ovs_netdev_vport_ops);
+}
diff --git a/net/openvswitch/vport-netdev.h b/net/openvswitch/vport-netdev.h
index 8df01c11..6f7038e 100644
--- a/net/openvswitch/vport-netdev.h
+++ b/net/openvswitch/vport-netdev.h
@@ -41,4 +41,7 @@ netdev_vport_priv(const struct vport *vport)
 const char *ovs_netdev_get_name(const struct vport *);
 void ovs_netdev_detach_dev(struct vport *);
 
+int __init ovs_netdev_init(void);
+void ovs_netdev_exit(void);
+
 #endif /* vport_netdev.h */
diff --git a/net/openvswitch/vport-vxlan.c b/net/openvswitch/vport-vxlan.c
index 2735e01..965e750 100644
--- a/net/openvswitch/vport-vxlan.c
+++ b/net/openvswitch/vport-vxlan.c
@@ -24,6 +24,7 @@
 #include <linux/net.h>
 #include <linux/rculist.h>
 #include <linux/udp.h>
+#include <linux/module.h>
 
 #include <net/icmp.h>
 #include <net/ip.h>
@@ -50,6 +51,8 @@ struct vxlan_port {
 	char name[IFNAMSIZ];
 };
 
+static struct vport_ops ovs_vxlan_vport_ops;
+
 static inline struct vxlan_port *vxlan_vport(const struct vport *vport)
 {
 	return vport_priv(vport);
@@ -192,11 +195,29 @@ static const char *vxlan_get_name(const struct vport *vport)
 	return vxlan_port->name;
 }
 
-const struct vport_ops ovs_vxlan_vport_ops = {
+static struct vport_ops ovs_vxlan_vport_ops = {
 	.type		= OVS_VPORT_TYPE_VXLAN,
 	.create		= vxlan_tnl_create,
 	.destroy	= vxlan_tnl_destroy,
 	.get_name	= vxlan_get_name,
 	.get_options	= vxlan_get_options,
 	.send		= vxlan_tnl_send,
+	.owner		= THIS_MODULE,
 };
+
+static int __init ovs_vxlan_tnl_init(void)
+{
+	return ovs_vport_ops_register(&ovs_vxlan_vport_ops);
+}
+
+static void __exit ovs_vxlan_tnl_exit(void)
+{
+	ovs_vport_ops_unregister(&ovs_vxlan_vport_ops);
+}
+
+module_init(ovs_vxlan_tnl_init);
+module_exit(ovs_vxlan_tnl_exit);
+
+MODULE_DESCRIPTION("OVS: VXLAN switching port");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS("vport-type-4");
diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index 6015802..8168ef0 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -28,6 +28,7 @@
 #include <linux/rtnetlink.h>
 #include <linux/compat.h>
 #include <net/net_namespace.h>
+#include <linux/module.h>
 
 #include "datapath.h"
 #include "vport.h"
@@ -36,22 +37,7 @@
 static void ovs_vport_record_error(struct vport *,
 				   enum vport_err_type err_type);
 
-/* List of statically compiled vport implementations.  Don't forget to also
- * add yours to the list at the bottom of vport.h. */
-static const struct vport_ops *vport_ops_list[] = {
-	&ovs_netdev_vport_ops,
-	&ovs_internal_vport_ops,
-
-#ifdef CONFIG_OPENVSWITCH_GRE
-	&ovs_gre_vport_ops,
-#endif
-#ifdef CONFIG_OPENVSWITCH_VXLAN
-	&ovs_vxlan_vport_ops,
-#endif
-#ifdef CONFIG_OPENVSWITCH_GENEVE
-	&ovs_geneve_vport_ops,
-#endif
-};
+static LIST_HEAD(vport_ops_list);
 
 /* Protected by RCU read lock for reading, ovs_mutex for writing. */
 static struct hlist_head *dev_table;
@@ -88,6 +74,32 @@ static struct hlist_head *hash_bucket(struct net *net, const char *name)
 	return &dev_table[hash & (VPORT_HASH_BUCKETS - 1)];
 }
 
+int ovs_vport_ops_register(struct vport_ops *ops)
+{
+	int err = -EEXIST;
+	struct vport_ops *o;
+
+	ovs_lock();
+	list_for_each_entry(o, &vport_ops_list, list)
+		if (ops->type == o->type)
+			goto errout;
+
+	list_add_tail(&ops->list, &vport_ops_list);
+	err = 0;
+errout:
+	ovs_unlock();
+	return err;
+}
+EXPORT_SYMBOL(ovs_vport_ops_register);
+
+void ovs_vport_ops_unregister(struct vport_ops *ops)
+{
+	ovs_lock();
+	list_del(&ops->list);
+	ovs_unlock();
+}
+EXPORT_SYMBOL(ovs_vport_ops_unregister);
+
 /**
  *	ovs_vport_locate - find a port that has already been created
  *
@@ -153,6 +165,7 @@ struct vport *ovs_vport_alloc(int priv_size, const struct vport_ops *ops,
 
 	return vport;
 }
+EXPORT_SYMBOL(ovs_vport_alloc);
 
 /**
  *	ovs_vport_free - uninitialize and free vport
@@ -173,6 +186,18 @@ void ovs_vport_free(struct vport *vport)
 	free_percpu(vport->percpu_stats);
 	kfree(vport);
 }
+EXPORT_SYMBOL(ovs_vport_free);
+
+static struct vport_ops *ovs_vport_lookup(const struct vport_parms *parms)
+{
+	struct vport_ops *ops;
+
+	list_for_each_entry(ops, &vport_ops_list, list)
+		if (ops->type == parms->type)
+			return ops;
+
+	return NULL;
+}
 
 /**
  *	ovs_vport_add - add vport device (for kernel callers)
@@ -184,31 +209,40 @@ void ovs_vport_free(struct vport *vport)
  */
 struct vport *ovs_vport_add(const struct vport_parms *parms)
 {
+	struct vport_ops *ops;
 	struct vport *vport;
-	int err = 0;
-	int i;
 
-	for (i = 0; i < ARRAY_SIZE(vport_ops_list); i++) {
-		if (vport_ops_list[i]->type == parms->type) {
-			struct hlist_head *bucket;
+	ops = ovs_vport_lookup(parms);
+	if (ops) {
+		struct hlist_head *bucket;
 
-			vport = vport_ops_list[i]->create(parms);
-			if (IS_ERR(vport)) {
-				err = PTR_ERR(vport);
-				goto out;
-			}
+		if (!try_module_get(ops->owner))
+			return ERR_PTR(-EAFNOSUPPORT);
 
-			bucket = hash_bucket(ovs_dp_get_net(vport->dp),
-					     vport->ops->get_name(vport));
-			hlist_add_head_rcu(&vport->hash_node, bucket);
+		vport = ops->create(parms);
+		if (IS_ERR(vport)) {
+			module_put(ops->owner);
 			return vport;
 		}
+
+		bucket = hash_bucket(ovs_dp_get_net(vport->dp),
+				     vport->ops->get_name(vport));
+		hlist_add_head_rcu(&vport->hash_node, bucket);
+		return vport;
 	}
 
-	err = -EAFNOSUPPORT;
+	/* Unlock to attempt module load and return -EAGAIN if load
+	 * was successful as we need to restart the port addition
+	 * workflow.
+	 */
+	ovs_unlock();
+	request_module("vport-type-%d", parms->type);
+	ovs_lock();
 
-out:
-	return ERR_PTR(err);
+	if (!ovs_vport_lookup(parms))
+		return ERR_PTR(-EAFNOSUPPORT);
+	else
+		return ERR_PTR(-EAGAIN);
 }
 
 /**
@@ -242,6 +276,8 @@ void ovs_vport_del(struct vport *vport)
 	hlist_del_rcu(&vport->hash_node);
 
 	vport->ops->destroy(vport);
+
+	module_put(vport->ops->owner);
 }
 
 /**
@@ -457,6 +493,7 @@ void ovs_vport_receive(struct vport *vport, struct sk_buff *skb,
 	}
 	ovs_dp_process_packet(skb, &key);
 }
+EXPORT_SYMBOL(ovs_vport_receive);
 
 /**
  *	ovs_vport_send - send a packet on a device
@@ -535,3 +572,4 @@ void ovs_vport_deferred_free(struct vport *vport)
 
 	call_rcu(&vport->rcu, free_vport_rcu);
 }
+EXPORT_SYMBOL(ovs_vport_deferred_free);
diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h
index 8942125..e41c3fa 100644
--- a/net/openvswitch/vport.h
+++ b/net/openvswitch/vport.h
@@ -161,6 +161,9 @@ struct vport_ops {
 	const char *(*get_name)(const struct vport *);
 
 	int (*send)(struct vport *, struct sk_buff *);
+
+	struct module *owner;
+	struct list_head list;
 };
 
 enum vport_err_type {
@@ -209,14 +212,6 @@ static inline struct vport *vport_from_priv(void *priv)
 void ovs_vport_receive(struct vport *, struct sk_buff *,
 		       struct ovs_tunnel_info *);
 
-/* List of statically compiled vport implementations.  Don't forget to also
- * add yours to the list at the top of vport.c. */
-extern const struct vport_ops ovs_netdev_vport_ops;
-extern const struct vport_ops ovs_internal_vport_ops;
-extern const struct vport_ops ovs_gre_vport_ops;
-extern const struct vport_ops ovs_vxlan_vport_ops;
-extern const struct vport_ops ovs_geneve_vport_ops;
-
 static inline void ovs_skb_postpush_rcsum(struct sk_buff *skb,
 				      const void *start, unsigned int len)
 {
@@ -224,4 +219,7 @@ static inline void ovs_skb_postpush_rcsum(struct sk_buff *skb,
 		skb->csum = csum_add(skb->csum, csum_partial(start, len, 0));
 }
 
+int ovs_vport_ops_register(struct vport_ops *ops);
+void ovs_vport_ops_unregister(struct vport_ops *ops);
+
 #endif /* vport.h */
-- 
1.9.3

^ permalink raw reply related

* [PATCH] net: fec: ptp: fix NULL pointer dereference if ptp_clock is not set
From: Philipp Zabel @ 2014-10-22 14:34 UTC (permalink / raw)
  To: David S. Miller, Luwei Zhou; +Cc: netdev, Philipp Zabel

Since commit 278d24047891 (net: fec: ptp: Enable PPS output based on ptp clock)
fec_enet_interrupt calls fec_ptp_check_pps_event unconditionally, which calls
into ptp_clock_event. If fep->ptp_clock is NULL, ptp_clock_event tries to
dereference the NULL pointer.
Since on i.MX53 fep->bufdesc_ex is not set, fec_ptp_init is never called,
and fep->ptp_clock is NULL, which reliably causes a kernel panic.

This patch adds a check for fep->ptp_clock == NULL in fec_enet_interrupt.

Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
---
 drivers/net/ethernet/freescale/fec_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 81b96cf..50a851d 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1581,7 +1581,8 @@ fec_enet_interrupt(int irq, void *dev_id)
 		complete(&fep->mdio_done);
 	}
 
-	fec_ptp_check_pps_event(fep);
+	if (fep->ptp_clock)
+		fec_ptp_check_pps_event(fep);
 
 	return ret;
 }
-- 
2.1.1

^ permalink raw reply related

* Re: [PATCH] Documentation: ptp: Fix build failure on MIPS cross builds
From: Peter Foley @ 2014-10-22 14:09 UTC (permalink / raw)
  To: Richard Cochran
  Cc: David Daney, David Miller, markos.chandras, linux-mips, corbet,
	netdev, linux-doc@vger.kernel.org, LKML
In-Reply-To: <20141022080302.GA4037@localhost.localdomain>

On Wed, Oct 22, 2014 at 4:03 AM, Richard Cochran
<richardcochran@gmail.com> wrote:
> In the mean time, I would like to restore the testptp.mk that *does*
> cross compile, so that people may use the test program if they
> want. In fact I use this all the time, and so I am a bit annoyed that
> something working was deleted and replaced with something broken.

Sure, I didn't realize that anyone was actually using testptp.mk on a
regular basis.
Feel free to restore it.

^ permalink raw reply

* [PATCH] igb: don't reuse pages with pfmemalloc flag
From: Roman Gushchin @ 2014-10-22 13:50 UTC (permalink / raw)
  To: jeffrey.t.kirsher, jesse.brandeburg, bruce.w.allan,
	carolyn.wyborny, donald.c.skidmore, gregory.v.rose,
	peter.p.waskiewicz.jr, alexander.h.duyck, john.ronciak,
	tushar.n.dave, davem, sassmann, gregkh, e1000-devel
  Cc: netdev, linux-kernel, Roman Gushchin

Incoming packet is dropped silently by sk_filter(), if the skb was
allocated from pfmemalloc reserves and the corresponding socket is
not marked with the SOCK_MEMALLOC flag.

Igb driver allocates pages for DMA with __skb_alloc_page(), which
calls alloc_pages_node() with the __GFP_MEMALLOC flag. So, in case
of OOM condition, igb can get pages with pfmemalloc flag set.

If an incoming packet hits the pfmemalloc page and is large enough
(small packets are copying into the memory, allocated with
netdev_alloc_skb_ip_align(), so they are not affected), it will be
dropped.

This behavior is ok under high memory pressure, but the problem is
that the igb driver reuses these mapped pages. So, packets are still
dropping even if all memory issues are gone and there is a plenty
of free memory.

In my case, some TCP sessions hang on a small percentage (< 0.1%)
of machines days after OOMs.

Fix this by avoiding reuse of such pages.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 0d4c897..6586392 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6178,6 +6178,9 @@ static bool igb_can_reuse_rx_page(struct igb_rx_buffer *rx_buffer,
 	if (unlikely(page_to_nid(page) != numa_node_id()))
 		return false;
 
+	if (unlikely(page->pfmemalloc))
+		return false;
+
 #if (PAGE_SIZE < 8192)
 	/* if we are only owner of page we can reuse it */
 	if (unlikely(page_count(page) != 1))
@@ -6245,7 +6248,8 @@ static bool igb_add_rx_frag(struct igb_ring *rx_ring,
 		memcpy(__skb_put(skb, size), va, ALIGN(size, sizeof(long)));
 
 		/* we can reuse buffer as-is, just make sure it is local */
-		if (likely(page_to_nid(page) == numa_node_id()))
+		if (likely((page_to_nid(page) == numa_node_id()) &&
+			   !page->pfmemalloc))
 			return true;
 
 		/* this page cannot be reused so discard it */
-- 
1.9.3

^ permalink raw reply related

* [PATCH 1/3] xen-netback: make feature-rx-notify mandatory
From: David Vrabel @ 2014-10-22 13:08 UTC (permalink / raw)
  To: netdev; +Cc: David Vrabel, xen-devel, Ian Campbell, Wei Liu
In-Reply-To: <1413983335-8307-1-git-send-email-david.vrabel@citrix.com>

Frontends that do not provide feature-rx-notify may stall because
netback depends on the notification from frontend to wake the guest Rx
thread (even if can_queue is false).

This could be fixed but feature-rx-notify was introduced in 2006 and I
am not aware of any frontends that do not implement this.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 drivers/net/xen-netback/common.h    |    5 -----
 drivers/net/xen-netback/interface.c |   12 +-----------
 drivers/net/xen-netback/xenbus.c    |   13 ++++---------
 3 files changed, 5 insertions(+), 25 deletions(-)

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index d4eb8d2..93ca77c 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -228,9 +228,6 @@ struct xenvif {
 	u8 ip_csum:1;
 	u8 ipv6_csum:1;
 
-	/* Internal feature information. */
-	u8 can_queue:1;	    /* can queue packets for receiver? */
-
 	/* Is this interface disabled? True when backend discovers
 	 * frontend is rogue.
 	 */
@@ -272,8 +269,6 @@ void xenvif_xenbus_fini(void);
 
 int xenvif_schedulable(struct xenvif *vif);
 
-int xenvif_must_stop_queue(struct xenvif_queue *queue);
-
 int xenvif_queue_stopped(struct xenvif_queue *queue);
 void xenvif_wake_queue(struct xenvif_queue *queue);
 
diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
index f379689..c6759b1 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -60,16 +60,6 @@ void xenvif_skb_zerocopy_complete(struct xenvif_queue *queue)
 	atomic_dec(&queue->inflight_packets);
 }
 
-static inline void xenvif_stop_queue(struct xenvif_queue *queue)
-{
-	struct net_device *dev = queue->vif->dev;
-
-	if (!queue->vif->can_queue)
-		return;
-
-	netif_tx_stop_queue(netdev_get_tx_queue(dev, queue->id));
-}
-
 int xenvif_schedulable(struct xenvif *vif)
 {
 	return netif_running(vif->dev) &&
@@ -209,7 +199,7 @@ static int xenvif_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	if (!xenvif_rx_ring_slots_available(queue, min_slots_needed)) {
 		queue->rx_stalled.function = xenvif_rx_stalled;
 		queue->rx_stalled.data = (unsigned long)queue;
-		xenvif_stop_queue(queue);
+		netif_tx_stop_queue(netdev_get_tx_queue(dev, queue->id));
 		mod_timer(&queue->rx_stalled,
 			  jiffies + rx_drain_timeout_jiffies);
 	}
diff --git a/drivers/net/xen-netback/xenbus.c b/drivers/net/xen-netback/xenbus.c
index 8079c31..9060857 100644
--- a/drivers/net/xen-netback/xenbus.c
+++ b/drivers/net/xen-netback/xenbus.c
@@ -873,15 +873,10 @@ static int read_xenbus_vif_flags(struct backend_info *be)
 	if (!rx_copy)
 		return -EOPNOTSUPP;
 
-	if (vif->dev->tx_queue_len != 0) {
-		if (xenbus_scanf(XBT_NIL, dev->otherend,
-				 "feature-rx-notify", "%d", &val) < 0)
-			val = 0;
-		if (val)
-			vif->can_queue = 1;
-		else
-			/* Must be non-zero for pfifo_fast to work. */
-			vif->dev->tx_queue_len = 1;
+	if (xenbus_scanf(XBT_NIL, dev->otherend,
+			 "feature-rx-notify", "%d", &val) < 0 || val == 0) {
+		xenbus_dev_fatal(dev, -EINVAL, "feature-rx-notify is mandatory");
+		return -EINVAL;
 	}
 
 	if (xenbus_scanf(XBT_NIL, dev->otherend, "feature-sg",
-- 
1.7.10.4

^ permalink raw reply related

* [PATCHv1 0/3 net-next] xen-netback: guest Rx queue drain and stall fixes
From: David Vrabel @ 2014-10-22 13:08 UTC (permalink / raw)
  To: netdev; +Cc: David Vrabel, xen-devel, Ian Campbell, Wei Liu

This series fixes two critical xen-netback bugs.

1. Netback may consume all of host memory by queuing an unlimited
   number of skb on the internal guest Rx queue.  This behaviour is
   guest triggerable.

2. Carrier flapping under high traffic rates which reduces
   performance.

The first patch is a prerequite.  Removing support for frontends with
feature-rx-notify makes it easier to reason about the correctness of
netback since it no longer has to support this outdated and broken
mode.

David

^ permalink raw reply

* [PATCH 2/3] xen-netback: fix unlimited guest Rx internal queue and carrier flapping
From: David Vrabel @ 2014-10-22 13:08 UTC (permalink / raw)
  To: netdev; +Cc: David Vrabel, xen-devel, Ian Campbell, Wei Liu
In-Reply-To: <1413983335-8307-1-git-send-email-david.vrabel@citrix.com>

Netback needs to discard old to-guest skb's (guest Rx queue drain) and
it needs detect guest Rx stalls (to disable the carrier so packets are
discarded earlier), but the current implementation is very broken.

1. The check in hard_start_xmit of the slot availability did not
   consider the number of packets that were already in the guest Rx
   queue.  This could allow the queue to grow without bound.

   The guest stops consuming packets and the ring was allowed to fill
   leaving S slot free.  Netback queues a packet requiring more than S
   slots (ensuring that the ring stays with S slots free).  Netback
   queue indefinately packets provided that then require S or fewer
   slots.

2. The Rx stall detection is not triggered in this case since the
   (host) Tx queue is not stopped.

3. If the Tx queue is stopped and a guest Rx interrupt occurs, netback
   will consider this an Rx purge event which may result in it taking
   the carrier down unnecessarily.  It also considers a queue with
   only 1 slot free as unstalled (even though the next packet might
   not fit in this).

The internal guest Rx queue is limited by a byte length (to 512 Kib,
enough for half the ring).  The (host) Tx queue is stopped and started
based on this limit.  This sets an upper bound on the amount of memory
used by packets on the internal queue.

This allows the estimatation of the number of slots for an skb to be
removed (it wasn't a very good estimate anyway).  Instead, the guest
Rx thread just waits for enough free slots for a maximum sized packet.

skbs queued on the internal queue have an 'expires' time (set to the
current time plus the drain timeout).  The guest Rx thread will detect
when the skb at the head of the queue has expired and discard expired
skbs.  This sets a clear upper bound on the length of time an skb can
be queued for.  For a guest being destroyed the maximum time needed to
wait for all the packets it sent to be dropped is still the drain
timeout (10 s) since it will not be sending new packets.

Rx stall detection is reintroduced in a later commit.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 drivers/net/xen-netback/common.h    |   29 +++--
 drivers/net/xen-netback/interface.c |   59 ++-------
 drivers/net/xen-netback/netback.c   |  243 ++++++++++++++++++-----------------
 drivers/net/xen-netback/xenbus.c    |    8 ++
 4 files changed, 161 insertions(+), 178 deletions(-)

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index 93ca77c..c264240 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -176,10 +176,9 @@ struct xenvif_queue { /* Per-queue data for xenvif */
 	char rx_irq_name[IRQ_NAME_SIZE]; /* DEVNAME-qN-rx */
 	struct xen_netif_rx_back_ring rx;
 	struct sk_buff_head rx_queue;
-	RING_IDX rx_last_skb_slots;
-	unsigned long status;
 
-	struct timer_list rx_stalled;
+	unsigned int rx_queue_max;
+	unsigned int rx_queue_len;
 
 	struct gnttab_copy grant_copy_op[MAX_GRANT_COPY_OPS];
 
@@ -199,18 +198,14 @@ struct xenvif_queue { /* Per-queue data for xenvif */
 	struct xenvif_stats stats;
 };
 
+/* Maximum number of Rx slots a to-guest packet may use, including the
+ * slot needed for GSO meta-data.
+ */
+#define XEN_NETBK_RX_SLOTS_MAX (MAX_SKB_FRAGS + 1)
+
 enum state_bit_shift {
 	/* This bit marks that the vif is connected */
 	VIF_STATUS_CONNECTED,
-	/* This bit signals the RX thread that queuing was stopped (in
-	 * start_xmit), and either the timer fired or an RX interrupt came
-	 */
-	QUEUE_STATUS_RX_PURGE_EVENT,
-	/* This bit tells the interrupt handler that this queue was the reason
-	 * for the carrier off, so it should kick the thread. Only queues which
-	 * brought it down can turn on the carrier.
-	 */
-	QUEUE_STATUS_RX_STALLED
 };
 
 struct xenvif {
@@ -246,6 +241,14 @@ struct xenvif {
 	struct net_device *dev;
 };
 
+struct xenvif_rx_cb {
+	unsigned long expires;
+	int meta_slots_used;
+	bool full_coalesce;
+};
+
+#define XENVIF_RX_CB(skb) ((struct xenvif_rx_cb *)(skb)->cb)
+
 static inline struct xenbus_device *xenvif_to_xenbus_device(struct xenvif *vif)
 {
 	return to_xenbus_device(vif->dev->dev.parent);
@@ -291,6 +294,8 @@ void xenvif_kick_thread(struct xenvif_queue *queue);
 
 int xenvif_dealloc_kthread(void *data);
 
+void xenvif_rx_queue_tail(struct xenvif_queue *queue, struct sk_buff *skb);
+
 /* Determine whether the needed number of slots (req) are available,
  * and set req_event if not.
  */
diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
index c6759b1..a134d52 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -43,6 +43,9 @@
 #define XENVIF_QUEUE_LENGTH 32
 #define XENVIF_NAPI_WEIGHT  64
 
+/* Number of bytes allowed on the internal guest Rx queue. */
+#define XENVIF_RX_QUEUE_BYTES (XEN_NETIF_RX_RING_SIZE/2 * PAGE_SIZE)
+
 /* This function is used to set SKBTX_DEV_ZEROCOPY as well as
  * increasing the inflight counter. We need to increase the inflight
  * counter because core driver calls into xenvif_zerocopy_callback
@@ -63,7 +66,8 @@ void xenvif_skb_zerocopy_complete(struct xenvif_queue *queue)
 int xenvif_schedulable(struct xenvif *vif)
 {
 	return netif_running(vif->dev) &&
-		test_bit(VIF_STATUS_CONNECTED, &vif->status);
+		test_bit(VIF_STATUS_CONNECTED, &vif->status) &&
+		!vif->disabled;
 }
 
 static irqreturn_t xenvif_tx_interrupt(int irq, void *dev_id)
@@ -104,16 +108,7 @@ int xenvif_poll(struct napi_struct *napi, int budget)
 static irqreturn_t xenvif_rx_interrupt(int irq, void *dev_id)
 {
 	struct xenvif_queue *queue = dev_id;
-	struct netdev_queue *net_queue =
-		netdev_get_tx_queue(queue->vif->dev, queue->id);
 
-	/* QUEUE_STATUS_RX_PURGE_EVENT is only set if either QDisc was off OR
-	 * the carrier went down and this queue was previously blocked
-	 */
-	if (unlikely(netif_tx_queue_stopped(net_queue) ||
-		     (!netif_carrier_ok(queue->vif->dev) &&
-		      test_bit(QUEUE_STATUS_RX_STALLED, &queue->status))))
-		set_bit(QUEUE_STATUS_RX_PURGE_EVENT, &queue->status);
 	xenvif_kick_thread(queue);
 
 	return IRQ_HANDLED;
@@ -141,24 +136,13 @@ void xenvif_wake_queue(struct xenvif_queue *queue)
 	netif_tx_wake_queue(netdev_get_tx_queue(dev, id));
 }
 
-/* Callback to wake the queue's thread and turn the carrier off on timeout */
-static void xenvif_rx_stalled(unsigned long data)
-{
-	struct xenvif_queue *queue = (struct xenvif_queue *)data;
-
-	if (xenvif_queue_stopped(queue)) {
-		set_bit(QUEUE_STATUS_RX_PURGE_EVENT, &queue->status);
-		xenvif_kick_thread(queue);
-	}
-}
-
 static int xenvif_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct xenvif *vif = netdev_priv(dev);
 	struct xenvif_queue *queue = NULL;
 	unsigned int num_queues = vif->num_queues;
 	u16 index;
-	int min_slots_needed;
+	struct xenvif_rx_cb *cb;
 
 	BUG_ON(skb->dev != dev);
 
@@ -181,30 +165,10 @@ static int xenvif_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	    !xenvif_schedulable(vif))
 		goto drop;
 
-	/* At best we'll need one slot for the header and one for each
-	 * frag.
-	 */
-	min_slots_needed = 1 + skb_shinfo(skb)->nr_frags;
+	cb = XENVIF_RX_CB(skb);
+	cb->expires = jiffies + rx_drain_timeout_jiffies;
 
-	/* If the skb is GSO then we'll also need an extra slot for the
-	 * metadata.
-	 */
-	if (skb_is_gso(skb))
-		min_slots_needed++;
-
-	/* If the skb can't possibly fit in the remaining slots
-	 * then turn off the queue to give the ring a chance to
-	 * drain.
-	 */
-	if (!xenvif_rx_ring_slots_available(queue, min_slots_needed)) {
-		queue->rx_stalled.function = xenvif_rx_stalled;
-		queue->rx_stalled.data = (unsigned long)queue;
-		netif_tx_stop_queue(netdev_get_tx_queue(dev, queue->id));
-		mod_timer(&queue->rx_stalled,
-			  jiffies + rx_drain_timeout_jiffies);
-	}
-
-	skb_queue_tail(&queue->rx_queue, skb);
+	xenvif_rx_queue_tail(queue, skb);
 	xenvif_kick_thread(queue);
 
 	return NETDEV_TX_OK;
@@ -498,6 +462,8 @@ int xenvif_init_queue(struct xenvif_queue *queue)
 	init_timer(&queue->credit_timeout);
 	queue->credit_window_start = get_jiffies_64();
 
+	queue->rx_queue_max = XENVIF_RX_QUEUE_BYTES;
+
 	skb_queue_head_init(&queue->rx_queue);
 	skb_queue_head_init(&queue->tx_queue);
 
@@ -529,8 +495,6 @@ int xenvif_init_queue(struct xenvif_queue *queue)
 		queue->grant_tx_handle[i] = NETBACK_INVALID_HANDLE;
 	}
 
-	init_timer(&queue->rx_stalled);
-
 	return 0;
 }
 
@@ -664,7 +628,6 @@ void xenvif_disconnect(struct xenvif *vif)
 		netif_napi_del(&queue->napi);
 
 		if (queue->task) {
-			del_timer_sync(&queue->rx_stalled);
 			kthread_stop(queue->task);
 			queue->task = NULL;
 		}
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 08f6599..15893dc 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -55,8 +55,8 @@
 bool separate_tx_rx_irq = 1;
 module_param(separate_tx_rx_irq, bool, 0644);
 
-/* When guest ring is filled up, qdisc queues the packets for us, but we have
- * to timeout them, otherwise other guests' packets can get stuck there
+/* The time that packets can stay on the guest Rx internal queue
+ * before they are dropped.
  */
 unsigned int rx_drain_timeout_msecs = 10000;
 module_param(rx_drain_timeout_msecs, uint, 0444);
@@ -83,7 +83,6 @@ static void make_tx_response(struct xenvif_queue *queue,
 			     s8       st);
 
 static inline int tx_work_todo(struct xenvif_queue *queue);
-static inline int rx_work_todo(struct xenvif_queue *queue);
 
 static struct xen_netif_rx_response *make_rx_response(struct xenvif_queue *queue,
 					     u16      id,
@@ -163,6 +162,69 @@ bool xenvif_rx_ring_slots_available(struct xenvif_queue *queue, int needed)
 	return false;
 }
 
+void xenvif_rx_queue_tail(struct xenvif_queue *queue, struct sk_buff *skb)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&queue->rx_queue.lock, flags);
+
+	__skb_queue_tail(&queue->rx_queue, skb);
+
+	queue->rx_queue_len += skb->len;
+	if (queue->rx_queue_len > queue->rx_queue_max)
+		netif_tx_stop_queue(netdev_get_tx_queue(queue->vif->dev, queue->id));
+
+	spin_unlock_irqrestore(&queue->rx_queue.lock, flags);
+}
+
+static struct sk_buff *xenvif_rx_dequeue(struct xenvif_queue *queue)
+{
+	struct sk_buff *skb;
+
+	spin_lock_irq(&queue->rx_queue.lock);
+
+	skb = __skb_dequeue(&queue->rx_queue);
+	if (skb)
+		queue->rx_queue_len -= skb->len;
+
+	spin_unlock_irq(&queue->rx_queue.lock);
+
+	return skb;
+}
+
+static void xenvif_rx_queue_maybe_wake(struct xenvif_queue *queue)
+{
+	spin_lock_irq(&queue->rx_queue.lock);
+
+	if (queue->rx_queue_len < queue->rx_queue_max)
+		netif_tx_wake_queue(netdev_get_tx_queue(queue->vif->dev, queue->id));
+
+	spin_unlock_irq(&queue->rx_queue.lock);
+}
+
+
+static void xenvif_rx_queue_purge(struct xenvif_queue *queue)
+{
+	struct sk_buff *skb;
+	while ((skb = xenvif_rx_dequeue(queue)) != NULL)
+		kfree_skb(skb);
+}
+
+static void xenvif_rx_queue_drop_expired(struct xenvif_queue *queue)
+{
+	struct sk_buff *skb;
+
+	for(;;) {
+		skb = skb_peek(&queue->rx_queue);
+		if (!skb)
+			break;
+		if (time_before(jiffies, XENVIF_RX_CB(skb)->expires))
+			break;
+		xenvif_rx_dequeue(queue);
+		kfree_skb(skb);
+	}
+}
+
 /*
  * Returns true if we should start a new receive buffer instead of
  * adding 'size' bytes to a buffer which currently contains 'offset'
@@ -237,13 +299,6 @@ static struct xenvif_rx_meta *get_next_rx_buffer(struct xenvif_queue *queue,
 	return meta;
 }
 
-struct xenvif_rx_cb {
-	int meta_slots_used;
-	bool full_coalesce;
-};
-
-#define XENVIF_RX_CB(skb) ((struct xenvif_rx_cb *)(skb)->cb)
-
 /*
  * Set up the grant operations for this fragment. If it's a flipping
  * interface, we also set up the unmap request from here.
@@ -587,7 +642,8 @@ static void xenvif_rx_action(struct xenvif_queue *queue)
 
 	skb_queue_head_init(&rxq);
 
-	while ((skb = skb_dequeue(&queue->rx_queue)) != NULL) {
+	while (xenvif_rx_ring_slots_available(queue, XEN_NETBK_RX_SLOTS_MAX)
+	       && (skb = xenvif_rx_dequeue(queue)) != NULL) {
 		RING_IDX max_slots_needed;
 		RING_IDX old_req_cons;
 		RING_IDX ring_slots_used;
@@ -634,15 +690,6 @@ static void xenvif_rx_action(struct xenvif_queue *queue)
 		    skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6))
 			max_slots_needed++;
 
-		/* If the skb may not fit then bail out now */
-		if (!xenvif_rx_ring_slots_available(queue, max_slots_needed)) {
-			skb_queue_head(&queue->rx_queue, skb);
-			need_to_notify = true;
-			queue->rx_last_skb_slots = max_slots_needed;
-			break;
-		} else
-			queue->rx_last_skb_slots = 0;
-
 		old_req_cons = queue->rx.req_cons;
 		XENVIF_RX_CB(skb)->meta_slots_used = xenvif_gop_skb(skb, &npo, queue);
 		ring_slots_used = queue->rx.req_cons - old_req_cons;
@@ -1869,12 +1916,6 @@ void xenvif_idx_unmap(struct xenvif_queue *queue, u16 pending_idx)
 	}
 }
 
-static inline int rx_work_todo(struct xenvif_queue *queue)
-{
-	return (!skb_queue_empty(&queue->rx_queue) &&
-	       xenvif_rx_ring_slots_available(queue, queue->rx_last_skb_slots));
-}
-
 static inline int tx_work_todo(struct xenvif_queue *queue)
 {
 	if (likely(RING_HAS_UNCONSUMED_REQUESTS(&queue->tx)))
@@ -1931,92 +1972,64 @@ err:
 	return err;
 }
 
-static void xenvif_start_queue(struct xenvif_queue *queue)
+static bool xenvif_have_rx_work(struct xenvif_queue *queue)
 {
-	if (xenvif_schedulable(queue->vif))
-		xenvif_wake_queue(queue);
+	return (!skb_queue_empty(&queue->rx_queue)
+		&& xenvif_rx_ring_slots_available(queue, XEN_NETBK_RX_SLOTS_MAX))
+		|| kthread_should_stop()
+		|| queue->vif->disabled;
 }
 
-/* Only called from the queue's thread, it handles the situation when the guest
- * doesn't post enough requests on the receiving ring.
- * First xenvif_start_xmit disables QDisc and start a timer, and then either the
- * timer fires, or the guest send an interrupt after posting new request. If it
- * is the timer, the carrier is turned off here.
- * */
-static void xenvif_rx_purge_event(struct xenvif_queue *queue)
+static long xenvif_rx_queue_timeout(struct xenvif_queue *queue)
 {
-	/* Either the last unsuccesful skb or at least 1 slot should fit */
-	int needed = queue->rx_last_skb_slots ?
-		     queue->rx_last_skb_slots : 1;
+	struct sk_buff *skb;
+	long timeout;
 
-	/* It is assumed that if the guest post new slots after this, the RX
-	 * interrupt will set the QUEUE_STATUS_RX_PURGE_EVENT bit and wake up
-	 * the thread again
-	 */
-	set_bit(QUEUE_STATUS_RX_STALLED, &queue->status);
-	if (!xenvif_rx_ring_slots_available(queue, needed)) {
-		rtnl_lock();
-		if (netif_carrier_ok(queue->vif->dev)) {
-			/* Timer fired and there are still no slots. Turn off
-			 * everything except the interrupts
-			 */
-			netif_carrier_off(queue->vif->dev);
-			skb_queue_purge(&queue->rx_queue);
-			queue->rx_last_skb_slots = 0;
-			if (net_ratelimit())
-				netdev_err(queue->vif->dev, "Carrier off due to lack of guest response on queue %d\n", queue->id);
-		} else {
-			/* Probably an another queue already turned the carrier
-			 * off, make sure nothing is stucked in the internal
-			 * queue of this queue
-			 */
-			skb_queue_purge(&queue->rx_queue);
-			queue->rx_last_skb_slots = 0;
-		}
-		rtnl_unlock();
-	} else if (!netif_carrier_ok(queue->vif->dev)) {
-		unsigned int num_queues = queue->vif->num_queues;
-		unsigned int i;
-		/* The carrier was down, but an interrupt kicked
-		 * the thread again after new requests were
-		 * posted
-		 */
-		clear_bit(QUEUE_STATUS_RX_STALLED,
-			  &queue->status);
-		rtnl_lock();
-		netif_carrier_on(queue->vif->dev);
-		netif_tx_wake_all_queues(queue->vif->dev);
-		rtnl_unlock();
+	skb = skb_peek(&queue->rx_queue);
+	if (!skb)
+		return MAX_SCHEDULE_TIMEOUT;
 
-		for (i = 0; i < num_queues; i++) {
-			struct xenvif_queue *temp = &queue->vif->queues[i];
+	timeout = XENVIF_RX_CB(skb)->expires - jiffies;
+	return timeout < 0 ? 0 : timeout;
+}
 
-			xenvif_napi_schedule_or_enable_events(temp);
-		}
-		if (net_ratelimit())
-			netdev_err(queue->vif->dev, "Carrier on again\n");
-	} else {
-		/* Queuing were stopped, but the guest posted
-		 * new requests and sent an interrupt
-		 */
-		clear_bit(QUEUE_STATUS_RX_STALLED,
-			  &queue->status);
-		del_timer_sync(&queue->rx_stalled);
-		xenvif_start_queue(queue);
+/* Wait until the guest Rx thread has work.
+ *
+ * The timeout needs to be adjusted based on the current head of the
+ * queue (and not just the head at the beginning).  In particular, if
+ * the queue is initially empty an infinite timeout is used and this
+ * needs to be reduced when a skb is queued.
+ * 
+ * This cannot be done with wait_event_timeout() because it only
+ * calculates the timeout once.
+ */
+static void xenvif_wait_for_rx_work(struct xenvif_queue *queue)
+{
+	DEFINE_WAIT(wait);
+
+	if (xenvif_have_rx_work(queue))
+		return;
+
+	for (;;) {
+		long ret;
+
+		prepare_to_wait(&queue->wq, &wait, TASK_INTERRUPTIBLE);
+		if (xenvif_have_rx_work(queue))
+			break;
+		ret = schedule_timeout(xenvif_rx_queue_timeout(queue));
+		if (!ret)
+			break;
 	}
+	finish_wait(&queue->wq, &wait);
 }
 
 int xenvif_kthread_guest_rx(void *data)
 {
 	struct xenvif_queue *queue = data;
-	struct sk_buff *skb;
+	struct xenvif *vif = queue->vif;
 
-	while (!kthread_should_stop()) {
-		wait_event_interruptible(queue->wq,
-					 rx_work_todo(queue) ||
-					 queue->vif->disabled ||
-					 test_bit(QUEUE_STATUS_RX_PURGE_EVENT, &queue->status) ||
-					 kthread_should_stop());
+	for (;;) {
+		xenvif_wait_for_rx_work(queue);
 
 		if (kthread_should_stop())
 			break;
@@ -2028,35 +2041,29 @@ int xenvif_kthread_guest_rx(void *data)
 		 * context so we defer it here, if this thread is
 		 * associated with queue 0.
 		 */
-		if (unlikely(queue->vif->disabled && queue->id == 0)) {
-			xenvif_carrier_off(queue->vif);
-		} else if (unlikely(queue->vif->disabled)) {
-			/* kthread_stop() would be called upon this thread soon,
-			 * be a bit proactive
-			 */
-			skb_queue_purge(&queue->rx_queue);
-			queue->rx_last_skb_slots = 0;
-		} else if (unlikely(test_and_clear_bit(QUEUE_STATUS_RX_PURGE_EVENT,
-						     &queue->status))) {
-			xenvif_rx_purge_event(queue);
-		} else if (!netif_carrier_ok(queue->vif->dev)) {
-			/* Another queue stalled and turned the carrier off, so
-			 * purge the internal queue of queues which were not
-			 * blocked
-			 */
-			skb_queue_purge(&queue->rx_queue);
-			queue->rx_last_skb_slots = 0;
+		if (unlikely(vif->disabled && queue->id == 0)) {
+			xenvif_carrier_off(vif);
+			xenvif_rx_queue_purge(queue);
+			continue;
 		}
 
 		if (!skb_queue_empty(&queue->rx_queue))
 			xenvif_rx_action(queue);
 
+		/* Queued packets may have foreign pages from other
+		 * domains.  These cannot be queued indefinitely as
+		 * this would starve guests of grant refs and transmit
+		 * slots.
+		 */
+		xenvif_rx_queue_drop_expired(queue);
+
+		xenvif_rx_queue_maybe_wake(queue);
+
 		cond_resched();
 	}
 
 	/* Bin any remaining skbs */
-	while ((skb = skb_dequeue(&queue->rx_queue)) != NULL)
-		dev_kfree_skb(skb);
+	xenvif_rx_queue_purge(queue);
 
 	return 0;
 }
diff --git a/drivers/net/xen-netback/xenbus.c b/drivers/net/xen-netback/xenbus.c
index 9060857..96a754d 100644
--- a/drivers/net/xen-netback/xenbus.c
+++ b/drivers/net/xen-netback/xenbus.c
@@ -52,6 +52,7 @@ static int xenvif_read_io_ring(struct seq_file *m, void *v)
 	struct xenvif_queue *queue = m->private;
 	struct xen_netif_tx_back_ring *tx_ring = &queue->tx;
 	struct xen_netif_rx_back_ring *rx_ring = &queue->rx;
+	struct netdev_queue *dev_queue;
 
 	if (tx_ring->sring) {
 		struct xen_netif_tx_sring *sring = tx_ring->sring;
@@ -112,6 +113,13 @@ static int xenvif_read_io_ring(struct seq_file *m, void *v)
 		   queue->credit_timeout.expires,
 		   jiffies);
 
+	dev_queue = netdev_get_tx_queue(queue->vif->dev, queue->id);
+
+	seq_printf(m, "\nRx internal queue: len %u max %u pkts %u %s\n",
+		   queue->rx_queue_len, queue->rx_queue_max,
+		   skb_queue_len(&queue->rx_queue),
+		   netif_tx_queue_stopped(dev_queue) ? "stopped" : "running");
+
 	return 0;
 }
 
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 3/3] xen-netback: reintroduce guest Rx stall detection
From: David Vrabel @ 2014-10-22 13:08 UTC (permalink / raw)
  To: netdev; +Cc: David Vrabel, xen-devel, Ian Campbell, Wei Liu
In-Reply-To: <1413983335-8307-1-git-send-email-david.vrabel@citrix.com>

If a frontend not receiving packets it is useful to detect this and
turn off the carrier so packets are dropped early instead of being
queued and drained when they expire.

A to-guest queue is stalled if it doesn't have enough free slots for a
an extended period of time (default 60 s).

If at least one queue is stalled, the carrier is turned off (in the
expectation that the other queues will soon stall as well).  The
carrier is only turned on once all queues are ready.

When the frontend connects, all the queues start in the stalled state
and only become ready once the frontend queues enough Rx requests.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 drivers/net/xen-netback/common.h    |    5 +++
 drivers/net/xen-netback/interface.c |    5 ++-
 drivers/net/xen-netback/netback.c   |   76 +++++++++++++++++++++++++++++++++++
 drivers/net/xen-netback/xenbus.c    |    1 +
 4 files changed, 86 insertions(+), 1 deletion(-)

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index c264240..083ecc9 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -179,6 +179,8 @@ struct xenvif_queue { /* Per-queue data for xenvif */
 
 	unsigned int rx_queue_max;
 	unsigned int rx_queue_len;
+	unsigned long last_rx_time;
+	bool stalled;
 
 	struct gnttab_copy grant_copy_op[MAX_GRANT_COPY_OPS];
 
@@ -232,6 +234,9 @@ struct xenvif {
 	/* Queues */
 	struct xenvif_queue *queues;
 	unsigned int num_queues; /* active queues, resource allocated */
+	unsigned int stalled_queues;
+
+	spinlock_t lock;
 
 #ifdef CONFIG_DEBUG_FS
 	struct dentry *xenvif_dbg_root;
diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
index a134d52..895fe84 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -419,6 +419,8 @@ struct xenvif *xenvif_alloc(struct device *parent, domid_t domid,
 	vif->queues = NULL;
 	vif->num_queues = 0;
 
+	spin_lock_init(&vif->lock);
+
 	dev->netdev_ops	= &xenvif_netdev_ops;
 	dev->hw_features = NETIF_F_SG |
 		NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
@@ -505,7 +507,6 @@ void xenvif_carrier_on(struct xenvif *vif)
 		dev_set_mtu(vif->dev, ETH_DATA_LEN);
 	netdev_update_features(vif->dev);
 	set_bit(VIF_STATUS_CONNECTED, &vif->status);
-	netif_carrier_on(vif->dev);
 	if (netif_running(vif->dev))
 		xenvif_up(vif);
 	rtnl_unlock();
@@ -565,6 +566,8 @@ int xenvif_connect(struct xenvif_queue *queue, unsigned long tx_ring_ref,
 		disable_irq(queue->rx_irq);
 	}
 
+	queue->stalled = true;
+
 	task = kthread_create(xenvif_kthread_guest_rx,
 			      (void *)queue, "%s-guest-rx", queue->name);
 	if (IS_ERR(task)) {
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 15893dc..25f4c06 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -62,6 +62,13 @@ unsigned int rx_drain_timeout_msecs = 10000;
 module_param(rx_drain_timeout_msecs, uint, 0444);
 unsigned int rx_drain_timeout_jiffies;
 
+/* The length of time before the frontend is considered unresponsive
+ * because it isn't providing Rx slots.
+ */
+static unsigned int rx_stall_timeout_msecs = 60000;
+module_param(rx_stall_timeout_msecs, uint, 0444);
+static unsigned int rx_stall_timeout_jiffies;
+
 unsigned int xenvif_max_queues;
 module_param_named(max_queues, xenvif_max_queues, uint, 0644);
 MODULE_PARM_DESC(max_queues,
@@ -649,6 +656,8 @@ static void xenvif_rx_action(struct xenvif_queue *queue)
 		RING_IDX ring_slots_used;
 		int i;
 
+		queue->last_rx_time = jiffies;
+
 		/* We need a cheap worse case estimate for the number of
 		 * slots we'll use.
 		 */
@@ -1972,10 +1981,67 @@ err:
 	return err;
 }
 
+static void xenvif_queue_carrier_off(struct xenvif_queue *queue)
+{
+	struct xenvif *vif = queue->vif;
+
+	queue->stalled = true;
+
+	/* At least one queue has stalled? Disable the carrier. */
+	spin_lock(&vif->lock);
+	if (vif->stalled_queues++ == 0) {
+		netdev_info(vif->dev, "Guest Rx stalled");
+		netif_carrier_off(vif->dev);
+	}
+	spin_unlock(&vif->lock);
+}
+
+static void xenvif_queue_carrier_on(struct xenvif_queue *queue)
+{
+	struct xenvif *vif = queue->vif;
+
+	queue->last_rx_time = jiffies; /* Reset Rx stall detection. */
+	queue->stalled = false;
+
+	/* All queues are ready? Enable the carrier. */
+	spin_lock(&vif->lock);
+	if (--vif->stalled_queues == 0) {
+		netdev_info(vif->dev, "Guest Rx ready");
+		netif_carrier_on(vif->dev);
+	}
+	spin_unlock(&vif->lock);
+}
+
+static bool xenvif_rx_queue_stalled(struct xenvif_queue *queue)
+{
+	RING_IDX prod, cons;
+
+	prod = queue->rx.sring->req_prod;
+	cons = queue->rx.req_cons;
+
+	return !queue->stalled
+		&& prod - cons < XEN_NETBK_RX_SLOTS_MAX
+		&& time_after(jiffies,
+			      queue->last_rx_time + rx_stall_timeout_jiffies);
+}
+ 
+static bool xenvif_rx_queue_ready(struct xenvif_queue *queue)
+{
+	RING_IDX prod, cons;
+
+	prod = queue->rx.sring->req_prod;
+	cons = queue->rx.req_cons;
+
+	return queue->stalled
+		&& prod - cons >= XEN_NETBK_RX_SLOTS_MAX;
+}
+
 static bool xenvif_have_rx_work(struct xenvif_queue *queue)
 {
 	return (!skb_queue_empty(&queue->rx_queue)
 		&& xenvif_rx_ring_slots_available(queue, XEN_NETBK_RX_SLOTS_MAX))
+		|| xenvif_rx_queue_stalled(queue)
+		|| xenvif_rx_queue_ready(queue)
 		|| kthread_should_stop()
 		|| queue->vif->disabled;
 }
@@ -2050,6 +2116,15 @@ int xenvif_kthread_guest_rx(void *data)
 		if (!skb_queue_empty(&queue->rx_queue))
 			xenvif_rx_action(queue);
 
+		/* If the guest hasn't provided any Rx slots for a
+		 * while it's probably not responsive, drop the
+		 * carrier so packets are dropped earlier.
+		 */
+		if (xenvif_rx_queue_stalled(queue))
+			xenvif_queue_carrier_off(queue);
+		else if (xenvif_rx_queue_ready(queue))
+			xenvif_queue_carrier_on(queue);
+
 		/* Queued packets may have foreign pages from other
 		 * domains.  These cannot be queued indefinitely as
 		 * this would starve guests of grant refs and transmit
@@ -2120,6 +2195,7 @@ static int __init netback_init(void)
 		goto failed_init;
 
 	rx_drain_timeout_jiffies = msecs_to_jiffies(rx_drain_timeout_msecs);
+	rx_stall_timeout_jiffies = msecs_to_jiffies(rx_stall_timeout_msecs);
 
 #ifdef CONFIG_DEBUG_FS
 	xen_netback_dbg_root = debugfs_create_dir("xen-netback", NULL);
diff --git a/drivers/net/xen-netback/xenbus.c b/drivers/net/xen-netback/xenbus.c
index 96a754d..4e56a27 100644
--- a/drivers/net/xen-netback/xenbus.c
+++ b/drivers/net/xen-netback/xenbus.c
@@ -711,6 +711,7 @@ static void connect(struct backend_info *be)
 	be->vif->queues = vzalloc(requested_num_queues *
 				  sizeof(struct xenvif_queue));
 	be->vif->num_queues = requested_num_queues;
+	be->vif->stalled_queues = requested_num_queues;
 
 	for (queue_index = 0; queue_index < requested_num_queues; ++queue_index) {
 		queue = &be->vif->queues[queue_index];
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH net-next] sfc: add support for skb->xmit_more
From: Edward Cree @ 2014-10-22 12:36 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Robert Stonehouse, Daniel Borkmann, davem, nikolay, netdev,
	Shradha Shah, Jon Cooper, linux-net-drivers
In-Reply-To: <1413968323.5994.6.camel@decadent.org.uk>

On 22/10/14 09:58, Ben Hutchings wrote:
> On Fri, 2014-10-17 at 15:32 +0100, Edward Cree wrote:
> [...]
>> @@ -351,7 +343,7 @@ netdev_tx_t efx_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb)
>>  	unsigned short dma_flags;
>>  	int i = 0;
>>  
>> -	EFX_BUG_ON_PARANOID(tx_queue->write_count != tx_queue->insert_count);
>> +	EFX_BUG_ON_PARANOID(tx_queue->write_count > tx_queue->insert_count);
> [...]
>
> Doesn't this break after 2^32 descriptors?  It seems like you would need
> a similar comparison to time_after(); possibly:
>
> 	EFX_BUG_ON_PARANOID((int)(tx_queue->write_count - tx_queue->insert_count) > 0);
>
> Ben.
>
Yes, Jon already spotted that.  We just diked it out - see "[PATCH net]
sfc: remove incorrect EFX_BUG_ON_PARANOID check"
http://patchwork.ozlabs.org/patch/401503/
But your suggestion is a good one; feel free to post as a new patch on
that thread.
-Edward

^ permalink raw reply

* Re: [PATCH] netfilter: log: protect nf_log_register against double registering
From: Marcelo Ricardo Leitner @ 2014-10-22 12:33 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, netdev
In-Reply-To: <20141022120255.GA21821@salvia>

On 22-10-2014 10:02, Pablo Neira Ayuso wrote:
> On Mon, Oct 20, 2014 at 07:58:03PM -0200, Marcelo Ricardo Leitner wrote:
>> Currently, despite the comment right before the function,
>> nf_log_register allows registering two loggers on with the same type and
>> end up overwriting the previous register.
>>
>> Not a real issue today as current tree doesn't have two loggers for the
>> same type but it's better to get this protected.
>>
>> Also make sure that all of its callers do error checking.
>
> No major objetions to this sanity check. Some comment below.
>
>> Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
>> ---
>>
>> Notes:
>>      Please let me know if you have any issues with the identation on
>>      nf_log_register. I just couldn't find a better one.
>
> You can split nf_log_register() in two functions to avoid this.

Sorry but I don't follow this one. You mean having the check on 
nf_log_register() and then calling a __nf_log_register() to actually register it?

Now I'm thinking on wrapping
             rcu_dereference_protected(loggers[i][logger->type],
+					lockdep_is_held(&nf_log_mutex))
into a macro or something like that, because that's the issue in there and 
this construction is called several times. Something like:

#define logger_deref_protected(pf, type) \
         rcu_dereference_protected(loggers[pf][type], \
                                   lockdep_is_held(&nf_log_mutex));

WDYT?

>>      Thanks
>>
>>   net/ipv4/netfilter/nf_log_arp.c  |  8 +++++++-
>>   net/ipv4/netfilter/nf_log_ipv4.c |  8 +++++++-
>>   net/ipv6/netfilter/nf_log_ipv6.c |  8 +++++++-
>>   net/netfilter/nf_log.c           | 13 ++++++++++++-
>>   4 files changed, 33 insertions(+), 4 deletions(-)
>>
>> diff --git a/net/ipv4/netfilter/nf_log_arp.c b/net/ipv4/netfilter/nf_log_arp.c
>> index ccfc78db12ee8acae68faf451f2cf6bc5597f2c1..8b39174b7be390397a110ec9d3ed497bf8ce6d26 100644
>> --- a/net/ipv4/netfilter/nf_log_arp.c
>> +++ b/net/ipv4/netfilter/nf_log_arp.c
>> @@ -130,7 +130,13 @@ static int __init nf_log_arp_init(void)
>>   	if (ret < 0)
>>   		return ret;
>>
>> -	nf_log_register(NFPROTO_ARP, &nf_arp_logger);
>> +	ret = nf_log_register(NFPROTO_ARP, &nf_arp_logger);
>> +	if (ret < 0) {
>> +		pr_err("log: failed to register logger\n");
>
> I think you can use pr_fmt to avoid appending log, it's also going to
> append useful information to know which logger has indeed failed.

Sure. It was taken from nfnetlink_log_init() so I'll take the shot to update 
it too.

>> +		unregister_pernet_subsys(&nf_log_arp_net_ops);
>> +		return ret;
>
> Could you add a goto err1; instead? Just in case we need to extend
> this again later on, we'll skip some refactoring.

Yes. I'll send a v2

Thanks,
Marcelo


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox