Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 1/2] net: Toeplitz library functions
From: Eric Dumazet @ 2013-09-24 15:29 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David Laight, David Miller, Linux Netdev List, Brandeburg, Jesse
In-Reply-To: <CA+mtBx_y0hK69jcdY5e0MUzAa8hEPwExBDuK9Dn4Ceo5Hkn_iw@mail.gmail.com>

On Tue, 2013-09-24 at 08:22 -0700, Tom Herbert wrote:

> Assuming skb_rx_hash does symmetric calculation is currently
> incorrect.  For instance, looks like tun.c is trying to implement a
> sort of 'flow director' logic to pair TX queues and RX queues using
> skb_get_rxhash an expecting that the value is calculated
> symmetrically.  If HW is providing RX hash, this is broken and we'll
> never match the flows.  We could either recompute the hash in SW or
> try to match HW hash.

Its not incorrect, its an implementation choice.

Its software in linux, we do not have to care of how its done in
hardware.

This is done to reduce conntracking cost, in case RPS is used on a
router : Same cpu will process frames in both ways.

But conntracking does not 'rely' on rxhash being symmetric, thats an
optimization to have better data locality.

^ permalink raw reply

* Re: [PATCH] net: net_secret should not depend on TCP
From: Hannes Frederic Sowa @ 2013-09-24 15:28 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tom Herbert, davem, netdev, jesse.brandeburg
In-Reply-To: <1380036147.3165.72.camel@edumazet-glaptop>

On Tue, Sep 24, 2013 at 08:22:27AM -0700, Eric Dumazet wrote:
> On Tue, 2013-09-24 at 17:13 +0200, Hannes Frederic Sowa wrote:
> > On Tue, Sep 24, 2013 at 06:19:57AM -0700, Eric Dumazet wrote:
> > > -void net_secret_init(void)
> > > +static u32 net_secret[NET_SECRET_SIZE] ____cacheline_aligned;
> > > +
> > > +static void net_secret_init(void)
> > >  {
> > > -	get_random_bytes(net_secret, sizeof(net_secret));
> > > +	u32 tmp;
> > > +	int i;
> > > +
> > > +	if (likely(net_secret[0]))
> > > +		return;
> > > +
> > > +	for (i = NET_SECRET_SIZE; i > 0;) {
> > > +		do {
> > > +			get_random_bytes(&tmp, sizeof(tmp));
> > > +		} while (!tmp);
> > 
> > I am afraid we can block here on embedded systems in an atomic section? Is
> > this actually an issue? It does get called in a spin_lock_h.
> 
> I do not see issues : get_random_bytes() is irq safe.

But couldn't it be that get_random_bytes always returns 0 and we won't make
any progress here. Does the reseed happen from irq context or from softirqs? I
always thought it would be from a softirq (which could be blocked).

Thanks,

  Hannes

^ permalink raw reply

* Re: [PATCH net v3 1/1] xen-netback: Handle backend state transitions in a more robust way
From: Wei Liu @ 2013-09-24 15:26 UTC (permalink / raw)
  To: Paul Durrant; +Cc: xen-devel, netdev, Ian Campbell, Wei Liu, David Vrabel
In-Reply-To: <1380034282-11210-2-git-send-email-paul.durrant@citrix.com>

On Tue, Sep 24, 2013 at 03:51:22PM +0100, Paul Durrant wrote:
> When the frontend state changes metback now specifies its desired state to
                                  netback
> a new function, set_backend_state(), which transitions through any
[...]
> +/* Handle backend state transitions:
> + *
> + * The backend state starts in InitWait and the following transtions are
                                                             transitions
> + * allowed.
>  
[...]
> @@ -363,7 +448,9 @@ static void hotplug_status_changed(struct xenbus_watch *watch,
>  	if (IS_ERR(str))
>  		return;
>  	if (len == sizeof("connected")-1 && !memcmp(str, "connected", len)) {
> -		xenbus_switch_state(be->dev, XenbusStateConnected);
> +		/* Complete any pending state change */
> +		xenbus_switch_state(be->dev, be->state);
> +

The state transition takes place iff hotplug status is "connected", is
this desirable? What if hotplug fails?

If it cycles through connect again it looks like it will trigger that
BUG_ON in connect()?

Wei.

^ permalink raw reply

* Re: [PATCH 1/2] net: Toeplitz library functions
From: Tom Herbert @ 2013-09-24 15:22 UTC (permalink / raw)
  To: David Laight; +Cc: David Miller, Linux Netdev List, Brandeburg, Jesse
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6026B7355@saturn3.aculab.com>

On Tue, Sep 24, 2013 at 1:32 AM, David Laight <David.Laight@aculab.com> wrote:
>> +static inline unsigned int
>> +toeplitz_hash(const unsigned char *bytes,
>> +           struct toeplitz *toeplitz, int n)
>> +{
>> +     int i;
>> +     unsigned int result = 0;
>> +
>> +     for (i = 0; i < n; i++)
>> +             result ^= toeplitz->key_cache[i][bytes[i]];
>> +
>> +        return result;
>> +};
>
> That is a horrid hash function to be calculating in software.
>
> The code looks very much like a simple 32bit CRC.
> It isn't entirely clears exactly where the 'key' gets included,
> but I suspect it is just xored with the data bytes.
>
Please google Toeplitz hash function to learn about the algorithm.
http://msdn.microsoft.com/en-us/library/windows/hardware/ff570725(v=vs.85).aspx

> Using in it hardware is probably fine - the hardware can do
> it cheaply (in dedicated logic) as the frame arrives.
> The CRC polynomial probably collapses to a few XOR operations
> when done byte by byte (the hdlc crc16 collapses to 3 levels
> of xor).
>
> IIRC jhash() works on 32bit quantities - so has far fewer
> maths operations and well as not having all the random data
> accesses (cache misses and displacing other parts of the
> working set from the cache).
>
Yes, but pretty much every NIC vendor implements Toeplitz hash and
provides it in their receive descriptor.  We use this value for
steering, and could use it for other uses like connection lookup.

> I also thought the hash was arranged so that tx and rx packets
> for a single connection hash to the same value?
>
ehashfn hashes consistently based on local and remote sides.
skb_get_rxhash orders the addresses and ports to make a consistent
hash in both directions. Presumably, this allows skb_get_rxhash to be
called from TX side to get same value of RX side, but when we get
rxhash from device (e.g. Toeplitz) this property is broken anyway.
Instead of jumping through this hoop, it might be better to have a
separate function from calculating RX hash for reverse path on TX.

Assuming skb_rx_hash does symmetric calculation is currently
incorrect.  For instance, looks like tun.c is trying to implement a
sort of 'flow director' logic to pair TX queues and RX queues using
skb_get_rxhash an expecting that the value is calculated
symmetrically.  If HW is providing RX hash, this is broken and we'll
never match the flows.  We could either recompute the hash in SW or
try to match HW hash.

Tom

>         David
>
>
>

^ permalink raw reply

* Re: [PATCH] net: net_secret should not depend on TCP
From: Eric Dumazet @ 2013-09-24 15:22 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: Tom Herbert, davem, netdev, jesse.brandeburg
In-Reply-To: <20130924151333.GA1527@order.stressinduktion.org>

On Tue, 2013-09-24 at 17:13 +0200, Hannes Frederic Sowa wrote:
> On Tue, Sep 24, 2013 at 06:19:57AM -0700, Eric Dumazet wrote:
> > -void net_secret_init(void)
> > +static u32 net_secret[NET_SECRET_SIZE] ____cacheline_aligned;
> > +
> > +static void net_secret_init(void)
> >  {
> > -	get_random_bytes(net_secret, sizeof(net_secret));
> > +	u32 tmp;
> > +	int i;
> > +
> > +	if (likely(net_secret[0]))
> > +		return;
> > +
> > +	for (i = NET_SECRET_SIZE; i > 0;) {
> > +		do {
> > +			get_random_bytes(&tmp, sizeof(tmp));
> > +		} while (!tmp);
> 
> I am afraid we can block here on embedded systems in an atomic section? Is
> this actually an issue? It does get called in a spin_lock_h.

I do not see issues : get_random_bytes() is irq safe.

^ permalink raw reply

* [PATCH v2 net-next] net: introduce SO_MAX_PACING_RATE
From: Eric Dumazet @ 2013-09-24 15:20 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Steinar H. Gunderson, Michael Kerrisk
In-Reply-To: <1379949014.3165.24.camel@edumazet-glaptop>

From: Eric Dumazet <edumazet@google.com>

As mentioned in commit afe4fd062416b ("pkt_sched: fq: Fair Queue packet
scheduler"), this patch adds a new socket option.

SO_MAX_PACING_RATE offers the application the ability to cap the
rate computed by transport layer. Value is in bytes per second.

u32 val = 1000000;
setsockopt(sockfd, SOL_SOCKET, SO_MAX_PACING_RATE, &val, sizeof(val));

To be effectively paced, a flow must use FQ packet scheduler.

Note that a packet scheduler takes into account the headers for its
computations. The effective payload rate depends on MSS and retransmits
if any.

I chose to make this pacing rate a SOL_SOCKET option instead of a
TCP one because this can be used by other protocols.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Steinar H. Gunderson <sesse@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
---
v2: cap sk->sk_pacing_rate in sock_setsockopt()

 arch/alpha/include/uapi/asm/socket.h   |    4 +++-
 arch/avr32/include/uapi/asm/socket.h   |    2 ++
 arch/cris/include/uapi/asm/socket.h    |    2 ++
 arch/frv/include/uapi/asm/socket.h     |    2 ++
 arch/h8300/include/uapi/asm/socket.h   |    2 ++
 arch/ia64/include/uapi/asm/socket.h    |    2 ++
 arch/m32r/include/uapi/asm/socket.h    |    2 ++
 arch/mips/include/uapi/asm/socket.h    |    2 ++
 arch/mn10300/include/uapi/asm/socket.h |    2 ++
 arch/parisc/include/uapi/asm/socket.h  |    2 ++
 arch/powerpc/include/uapi/asm/socket.h |    2 ++
 arch/s390/include/uapi/asm/socket.h    |    2 ++
 arch/sparc/include/uapi/asm/socket.h   |    2 ++
 arch/xtensa/include/uapi/asm/socket.h  |    2 ++
 include/net/sock.h                     |    1 +
 include/uapi/asm-generic/socket.h      |    2 ++
 net/core/sock.c                        |   12 ++++++++++++
 net/ipv4/tcp_input.c                   |    2 +-
 18 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h
index 467de01..e3a1491 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -81,6 +81,8 @@
 
 #define SO_SELECT_ERR_QUEUE	45
 
-#define SO_BUSY_POLL			46
+#define SO_BUSY_POLL		46
+
+#define SO_MAX_PACING_RATE	47
 
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/avr32/include/uapi/asm/socket.h b/arch/avr32/include/uapi/asm/socket.h
index 11c4259..4399364 100644
--- a/arch/avr32/include/uapi/asm/socket.h
+++ b/arch/avr32/include/uapi/asm/socket.h
@@ -76,4 +76,6 @@
 
 #define SO_BUSY_POLL		46
 
+#define SO_MAX_PACING_RATE	47
+
 #endif /* __ASM_AVR32_SOCKET_H */
diff --git a/arch/cris/include/uapi/asm/socket.h b/arch/cris/include/uapi/asm/socket.h
index eb723e5..13829aa 100644
--- a/arch/cris/include/uapi/asm/socket.h
+++ b/arch/cris/include/uapi/asm/socket.h
@@ -78,6 +78,8 @@
 
 #define SO_BUSY_POLL		46
 
+#define SO_MAX_PACING_RATE	47
+
 #endif /* _ASM_SOCKET_H */
 
 
diff --git a/arch/frv/include/uapi/asm/socket.h b/arch/frv/include/uapi/asm/socket.h
index f0cb1c3..5d42997 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -76,5 +76,7 @@
 
 #define SO_BUSY_POLL		46
 
+#define SO_MAX_PACING_RATE	47
+
 #endif /* _ASM_SOCKET_H */
 
diff --git a/arch/h8300/include/uapi/asm/socket.h b/arch/h8300/include/uapi/asm/socket.h
index 9490758..214ccaf 100644
--- a/arch/h8300/include/uapi/asm/socket.h
+++ b/arch/h8300/include/uapi/asm/socket.h
@@ -76,4 +76,6 @@
 
 #define SO_BUSY_POLL		46
 
+#define SO_MAX_PACING_RATE	47
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/ia64/include/uapi/asm/socket.h b/arch/ia64/include/uapi/asm/socket.h
index 556d070..c25302f 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -85,4 +85,6 @@
 
 #define SO_BUSY_POLL		46
 
+#define SO_MAX_PACING_RATE	47
+
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h b/arch/m32r/include/uapi/asm/socket.h
index 24be7c8..5296665 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -76,4 +76,6 @@
 
 #define SO_BUSY_POLL		46
 
+#define SO_MAX_PACING_RATE	47
+
 #endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h
index 61c01f0..0df9787 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -94,4 +94,6 @@
 
 #define SO_BUSY_POLL		46
 
+#define SO_MAX_PACING_RATE	47
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h b/arch/mn10300/include/uapi/asm/socket.h
index e2a2b203..71dedca 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -76,4 +76,6 @@
 
 #define SO_BUSY_POLL		46
 
+#define SO_MAX_PACING_RATE	47
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h
index 71700e6..7c614d0 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -75,6 +75,8 @@
 
 #define SO_BUSY_POLL		0x4027
 
+#define SO_MAX_PACING_RATE	0x4048
+
 /* O_NONBLOCK clashes with the bits used for socket types.  Therefore we
  * have to define SOCK_NONBLOCK to a different value here.
  */
diff --git a/arch/powerpc/include/uapi/asm/socket.h b/arch/powerpc/include/uapi/asm/socket.h
index a6d7446..fa69832 100644
--- a/arch/powerpc/include/uapi/asm/socket.h
+++ b/arch/powerpc/include/uapi/asm/socket.h
@@ -83,4 +83,6 @@
 
 #define SO_BUSY_POLL		46
 
+#define SO_MAX_PACING_RATE	47
+
 #endif	/* _ASM_POWERPC_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h b/arch/s390/include/uapi/asm/socket.h
index 9249449..c286c2e 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -82,4 +82,6 @@
 
 #define SO_BUSY_POLL		46
 
+#define SO_MAX_PACING_RATE	47
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h
index 4e1d66c..0f21e9a 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -72,6 +72,8 @@
 
 #define SO_BUSY_POLL		0x0030
 
+#define SO_MAX_PACING_RATE	0x0031
+
 /* Security levels - as per NRL IPv6 - don't actually do anything */
 #define SO_SECURITY_AUTHENTICATION		0x5001
 #define SO_SECURITY_ENCRYPTION_TRANSPORT	0x5002
diff --git a/arch/xtensa/include/uapi/asm/socket.h b/arch/xtensa/include/uapi/asm/socket.h
index c114483..7db5c22 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -87,4 +87,6 @@
 
 #define SO_BUSY_POLL		46
 
+#define SO_MAX_PACING_RATE	47
+
 #endif	/* _XTENSA_SOCKET_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index 4625d2e..240aa3f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -363,6 +363,7 @@ struct sock {
 	int			sk_wmem_queued;
 	gfp_t			sk_allocation;
 	u32			sk_pacing_rate; /* bytes per second */
+	u32			sk_max_pacing_rate;
 	netdev_features_t	sk_route_caps;
 	netdev_features_t	sk_route_nocaps;
 	int			sk_gso_type;
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index f04b69b..38f14d0 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -78,4 +78,6 @@
 
 #define SO_BUSY_POLL		46
 
+#define SO_MAX_PACING_RATE	47
+
 #endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/net/core/sock.c b/net/core/sock.c
index 5b6beba..2bd9b3f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -914,6 +914,13 @@ set_rcvbuf:
 		}
 		break;
 #endif
+
+	case SO_MAX_PACING_RATE:
+		sk->sk_max_pacing_rate = val;
+		sk->sk_pacing_rate = min(sk->sk_pacing_rate,
+					 sk->sk_max_pacing_rate);
+		break;
+
 	default:
 		ret = -ENOPROTOOPT;
 		break;
@@ -1177,6 +1184,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
 		break;
 #endif
 
+	case SO_MAX_PACING_RATE:
+		v.val = sk->sk_max_pacing_rate;
+		break;
+
 	default:
 		return -ENOPROTOOPT;
 	}
@@ -2319,6 +2330,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 	sk->sk_ll_usec		=	sysctl_net_busy_read;
 #endif
 
+	sk->sk_max_pacing_rate = ~0U;
 	/*
 	 * Before updating sk_refcnt, we must commit prior changes to memory
 	 * (Documentation/RCU/rculist_nulls.txt for details)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 25a89ea..75372c0 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -713,7 +713,7 @@ static void tcp_update_pacing_rate(struct sock *sk)
 	if (tp->srtt > 8 + 2)
 		do_div(rate, tp->srtt);
 
-	sk->sk_pacing_rate = min_t(u64, rate, ~0U);
+	sk->sk_pacing_rate = min_t(u64, rate, sk->sk_max_pacing_rate);
 }
 
 /* Calculate rto without backoff.  This is the second half of Van Jacobson's

^ permalink raw reply related

* Re: [PATCH v2] qlge: call ql_core_dump() only if dump memory was allocated.
From: David Miller @ 2013-09-24 15:20 UTC (permalink / raw)
  To: malahal; +Cc: netdev
In-Reply-To: <1379712077-31750-1-git-send-email-malahal@us.ibm.com>

From: Malahal Naineni <malahal@us.ibm.com>
Date: Fri, 20 Sep 2013 16:21:17 -0500

> Also changed a log message to indicate that memory was not allocated
> instead of memory not available!
> 
> Signed-off-by: Malahal Naineni <malahal@us.ibm.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] tcp: fix dynamic right sizing
From: David Miller @ 2013-09-24 15:16 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, ncardwell, ycheng, vanj
In-Reply-To: <1379710618.3431.5.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 20 Sep 2013 13:56:58 -0700

> Dynamic Right Sizing (DRS) is supposed to open TCP receive window
> automatically, but suffers from two bugs, presented by order
> of importance.
> 
> 1) tcp_rcv_space_adjust() fix :
> 
> Using twice the last received amount is very pessimistic,
> because it doesn't allow fast recovery or proper slow start
> ramp up, if sender wants to increase cwin by 100% every RTT.
> 
> copied = bytes received in previous RTT
> 
> 2*copied = bytes we expect to receive in next RTT
> 
> 4*copied = bytes we need to advertise in rwin at end of next RTT
> 
> DRS is one RTT late, it needs a 4x factor.
> 
> If sender is not using ABC, and increases cwin by 50% every rtt,
> then we needed 1.5*1.5 = 2.25 factor.
> This is probably why this bug was not really noticed.
> 
> 2) There is no window adjustment after first RTT. DRS triggers only
>   after the second RTT.
>   DRS needs two RTT to initialize, so tcp_fixup_rcvbuf() should setup
>   sk_rcvbuf to allow proper window grow for first two RTT.
> 
> This patch increases TCP efficiency particularly for large RTT flows
> when autotuning is used at the receiver, and more particularly
> in presence of packet losses.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Neal Cardwell <ncardwell@google.com>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> Cc: Van Jacobson <vanj@google.com>

Looks good, applied, thanks Eric.

^ permalink raw reply

* Re: [REGRESSION][BISECTED] skge: add dma_mapping check
From: Joseph Salisbury @ 2013-09-24 15:16 UTC (permalink / raw)
  To: Igor Gnatenko, Francois Romieu, mpatocka
  Cc: Mirko Lindner, linux-kernel, Stephen Hemminger, netdev,
	member graysky, Greg KH
In-Reply-To: <1379581391.2403.5.camel@ThinkPad-X230.localdomain>

On 09/19/2013 05:03 AM, Igor Gnatenko wrote:
> Please, send patch.
>
The patch is in mainline as of 3.12-rc2 as commit:

Author: Mikulas Patocka <mpatocka@redhat.com>
Date:   Thu Sep 19 14:13:17 2013 -0400

    skge: fix broken driver

I don't see that the commit was Cc'd to stable.  Mikulas, we might need
to send a request directly to the stable maintainers and reqeust that
the commit be pulled into stable, in case they didn't notice the request
in the commit message.

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_MAX_PACING_RATE
From: Eric Dumazet @ 2013-09-24 15:14 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Steinar H. Gunderson, Michael Kerrisk
In-Reply-To: <1379949014.3165.24.camel@edumazet-glaptop>

On Mon, 2013-09-23 at 08:10 -0700, Eric Dumazet wrote:

> +
> +	case SO_MAX_PACING_RATE:
> +		sk->sk_max_pacing_rate = val;
> +		break;
> +

I'll send a v2, adding here :

sk->sk_pacing_rate = min(sk->sk_pacing_rate, sk->max_pacing_rate);

to enforce current limit for non TCP protocols.

^ permalink raw reply

* Re: [PATCH] net: net_secret should not depend on TCP
From: Hannes Frederic Sowa @ 2013-09-24 15:13 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tom Herbert, davem, netdev, jesse.brandeburg
In-Reply-To: <1380028797.3165.65.camel@edumazet-glaptop>

On Tue, Sep 24, 2013 at 06:19:57AM -0700, Eric Dumazet wrote:
> -void net_secret_init(void)
> +static u32 net_secret[NET_SECRET_SIZE] ____cacheline_aligned;
> +
> +static void net_secret_init(void)
>  {
> -	get_random_bytes(net_secret, sizeof(net_secret));
> +	u32 tmp;
> +	int i;
> +
> +	if (likely(net_secret[0]))
> +		return;
> +
> +	for (i = NET_SECRET_SIZE; i > 0;) {
> +		do {
> +			get_random_bytes(&tmp, sizeof(tmp));
> +		} while (!tmp);

I am afraid we can block here on embedded systems in an atomic section? Is
this actually an issue? It does get called in a spin_lock_h.

^ permalink raw reply

* [PATCH net v3 1/1] xen-netback: Handle backend state transitions in a more robust way
From: Paul Durrant @ 2013-09-24 14:51 UTC (permalink / raw)
  To: xen-devel, netdev; +Cc: Paul Durrant, Ian Campbell, Wei Liu, David Vrabel
In-Reply-To: <1380034282-11210-1-git-send-email-paul.durrant@citrix.com>

When the frontend state changes metback now specifies its desired state to
a new function, set_backend_state(), which transitions through any
necessary intermediate states.
This fixes an issue observed with some old Windows frontend drivers where
they failed to transition through the Closing state and netback would not
behave correctly.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
 drivers/net/xen-netback/xenbus.c |  145 ++++++++++++++++++++++++++++++--------
 1 file changed, 114 insertions(+), 31 deletions(-)

diff --git a/drivers/net/xen-netback/xenbus.c b/drivers/net/xen-netback/xenbus.c
index a53782e..716b167 100644
--- a/drivers/net/xen-netback/xenbus.c
+++ b/drivers/net/xen-netback/xenbus.c
@@ -24,6 +24,7 @@
 struct backend_info {
 	struct xenbus_device *dev;
 	struct xenvif *vif;
+	enum xenbus_state state;
 	enum xenbus_state frontend_state;
 	struct xenbus_watch hotplug_status_watch;
 	u8 have_hotplug_status_watch:1;
@@ -136,6 +137,8 @@ static int netback_probe(struct xenbus_device *dev,
 	if (err)
 		goto fail;
 
+	be->state = XenbusStateInitWait;
+
 	/* This kicks hotplug scripts, so do it immediately. */
 	backend_create_xenvif(be);
 
@@ -208,24 +211,113 @@ static void backend_create_xenvif(struct backend_info *be)
 	kobject_uevent(&dev->dev.kobj, KOBJ_ONLINE);
 }
 
-
-static void disconnect_backend(struct xenbus_device *dev)
+static void backend_disconnect(struct backend_info *be)
 {
-	struct backend_info *be = dev_get_drvdata(&dev->dev);
-
 	if (be->vif)
 		xenvif_disconnect(be->vif);
 }
 
-static void destroy_backend(struct xenbus_device *dev)
+static void backend_connect(struct backend_info *be)
 {
-	struct backend_info *be = dev_get_drvdata(&dev->dev);
+	if (be->vif)
+		connect(be);
+}
 
-	if (be->vif) {
-		kobject_uevent(&dev->dev.kobj, KOBJ_OFFLINE);
-		xenbus_rm(XBT_NIL, dev->nodename, "hotplug-status");
-		xenvif_free(be->vif);
-		be->vif = NULL;
+static inline void backend_switch_state(struct backend_info *be,
+					enum xenbus_state state)
+{
+	struct xenbus_device *dev = be->dev;
+
+	pr_debug("%s -> %s\n", dev->nodename, xenbus_strstate(state));
+	be->state = state;
+
+	/* If we are waiting for a hotplug script then defer the
+	 * actual xenbus state change.
+	 */
+	if (!be->have_hotplug_status_watch)
+		xenbus_switch_state(dev, state);
+}
+
+/* Handle backend state transitions:
+ *
+ * The backend state starts in InitWait and the following transtions are
+ * allowed.
+ *
+ * InitWait -> Connected
+ *
+ *    ^    \         |
+ *    |     \        |
+ *    |      \       |
+ *    |       \      |
+ *    |        \     |
+ *    |         \    |
+ *    |          V   V
+ *
+ *  Closed  <-> Closing
+ *
+ * The state argument specifies the eventual state of the backend and the
+ * function transitions to that state via the shortest path.
+ */
+static void set_backend_state(struct backend_info *be,
+			      enum xenbus_state state)
+{
+	while (be->state != state) {
+		switch (be->state) {
+		case XenbusStateClosed:
+			switch (state) {
+			case XenbusStateInitWait:
+			case XenbusStateConnected:
+				pr_info("%s: prepare for reconnect\n",
+					be->dev->nodename);
+				backend_switch_state(be, XenbusStateInitWait);
+				break;
+			case XenbusStateClosing:
+				backend_switch_state(be, XenbusStateClosing);
+				break;
+			default:
+				BUG();
+			}
+			break;
+		case XenbusStateInitWait:
+			switch (state) {
+			case XenbusStateConnected:
+				backend_connect(be);
+				backend_switch_state(be, XenbusStateConnected);
+				break;
+			case XenbusStateClosing:
+			case XenbusStateClosed:
+				backend_switch_state(be, XenbusStateClosing);
+				break;
+			default:
+				BUG();
+			}
+			break;
+		case XenbusStateConnected:
+			switch (state) {
+			case XenbusStateInitWait:
+			case XenbusStateClosing:
+			case XenbusStateClosed:
+				backend_disconnect(be);
+				backend_switch_state(be, XenbusStateClosing);
+				break;
+			default:
+				BUG();
+			}
+			break;
+		case XenbusStateClosing:
+			switch (state) {
+			case XenbusStateInitWait:
+			case XenbusStateConnected:
+			case XenbusStateClosed:
+				backend_switch_state(be, XenbusStateClosed);
+				break;
+			default:
+				BUG();
+			}
+			break;
+		default:
+			BUG();
+		}
 	}
 }
 
@@ -237,40 +329,33 @@ static void frontend_changed(struct xenbus_device *dev,
 {
 	struct backend_info *be = dev_get_drvdata(&dev->dev);
 
-	pr_debug("frontend state %s\n", xenbus_strstate(frontend_state));
+	pr_debug("%s -> %s\n", dev->otherend, xenbus_strstate(frontend_state));
 
 	be->frontend_state = frontend_state;
 
 	switch (frontend_state) {
 	case XenbusStateInitialising:
-		if (dev->state == XenbusStateClosed) {
-			pr_info("%s: prepare for reconnect\n", dev->nodename);
-			xenbus_switch_state(dev, XenbusStateInitWait);
-		}
+		set_backend_state(be, XenbusStateInitWait);
 		break;
 
 	case XenbusStateInitialised:
 		break;
 
 	case XenbusStateConnected:
-		if (dev->state == XenbusStateConnected)
-			break;
-		if (be->vif)
-			connect(be);
+		set_backend_state(be, XenbusStateConnected);
 		break;
 
 	case XenbusStateClosing:
-		disconnect_backend(dev);
-		xenbus_switch_state(dev, XenbusStateClosing);
+		set_backend_state(be, XenbusStateClosing);
 		break;
 
 	case XenbusStateClosed:
-		xenbus_switch_state(dev, XenbusStateClosed);
+		set_backend_state(be, XenbusStateClosed);
 		if (xenbus_dev_is_online(dev))
 			break;
-		destroy_backend(dev);
 		/* fall through if not online */
 	case XenbusStateUnknown:
+		set_backend_state(be, XenbusStateClosed);
 		device_unregister(&dev->dev);
 		break;
 
@@ -363,7 +448,9 @@ static void hotplug_status_changed(struct xenbus_watch *watch,
 	if (IS_ERR(str))
 		return;
 	if (len == sizeof("connected")-1 && !memcmp(str, "connected", len)) {
-		xenbus_switch_state(be->dev, XenbusStateConnected);
+		/* Complete any pending state change */
+		xenbus_switch_state(be->dev, be->state);
+
 		/* Not interested in this watch anymore. */
 		unregister_hotplug_status_watch(be);
 	}
@@ -389,16 +476,12 @@ static void connect(struct backend_info *be)
 			  &be->vif->credit_usec);
 	be->vif->remaining_credit = be->vif->credit_bytes;
 
-	unregister_hotplug_status_watch(be);
+	BUG_ON(be->have_hotplug_status_watch);
 	err = xenbus_watch_pathfmt(dev, &be->hotplug_status_watch,
 				   hotplug_status_changed,
 				   "%s/%s", dev->nodename, "hotplug-status");
-	if (err) {
-		/* Switch now, since we can't do a watch. */
-		xenbus_switch_state(dev, XenbusStateConnected);
-	} else {
+	if (!err)
 		be->have_hotplug_status_watch = 1;
-	}
 
 	netif_wake_queue(be->vif->dev);
 }
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH net v3 0/1] xen-netback: windows frontend compatibility fixes
From: Paul Durrant @ 2013-09-24 14:51 UTC (permalink / raw)
  To: xen-devel, netdev

The following patches fix a couple more issues found when testing with
Windows frontends.

v3:
- Collapse both v2 patches into a single patch that introduces a new
  function to handle backend state transtions. By doing this we ensure that
  we always transition through intermediate states and that we don't attempt
  repeated connects or disconnects.

v2:
- Add comment in 2/2 to note that state transitions from Connected to Closed
  are incorrect.

^ permalink raw reply

* Re: [PATCH v3 -next 2/2] tcp: syncookies: reduce mss table to four values
From: David Miller @ 2013-09-24 14:40 UTC (permalink / raw)
  To: fw; +Cc: netdev
In-Reply-To: <1379709176-1625-2-git-send-email-fw@strlen.de>

From: Florian Westphal <fw@strlen.de>
Date: Fri, 20 Sep 2013 22:32:56 +0200

> Halve mss table size to make blind cookie guessing more difficult.
> This is sad since the tables were already small, but there
> is little alternative except perhaps adding more precise mss information
> in the tcp timestamp.  Timestamps are unfortunately not ubiquitous.
> 
> Guessing all possible cookie values still has 8-in 2**32 chance.
> 
> Reported-by: Jakob Lell <jakob@jakoblell.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>

Applied.

Thanks for following up on all of my feedback.

^ permalink raw reply

* Re: [PATCH v3 -next 1/2] tcp: syncookies: reduce cookie lifetime to 128 seconds
From: David Miller @ 2013-09-24 14:40 UTC (permalink / raw)
  To: fw; +Cc: netdev
In-Reply-To: <1379709176-1625-1-git-send-email-fw@strlen.de>

From: Florian Westphal <fw@strlen.de>
Date: Fri, 20 Sep 2013 22:32:55 +0200

> We currently accept cookies that were created less than 4 minutes ago
> (ie, cookies with counter delta 0-3).  Combined with the 8 mss table
> values, this yields 32 possible values (out of 2**32) that will be valid.
> 
> Reducing the lifetime to < 2 minutes halves the guessing chance while
> still providing a large enough period.
> 
> While at it, get rid of jiffies value -- they overflow too quickly on
> 32 bit platforms.
> 
> getnstimeofday is used to create a counter that increments every 64s.
> perf shows getnstimeofday cost is negible compared to sha_transform;
> normal tcp initial sequence number generation uses getnstimeofday, too.
> 
> Reported-by: Jakob Lell <jakob@jakoblell.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>

Applied.

^ permalink raw reply

* Re: [net-next PATCH 0/4] cpsw: support for control module register
From: David Miller @ 2013-09-24 14:34 UTC (permalink / raw)
  To: mugunthanvnm
  Cc: netdev, zonque, bcousson, tony, devicetree, linux-omap,
	linux-arm-kernel
In-Reply-To: <1379704841-32693-1-git-send-email-mugunthanvnm@ti.com>

From: Mugunthan V N <mugunthanvnm@ti.com>
Date: Sat, 21 Sep 2013 00:50:37 +0530

> This patch series adds the support for configuring GMII_SEL register
> of control module to select the phy mode type and also to configure
> the clock source for RMII phy mode whether to use internal clock or
> the external clock from the phy itself.
> 
> Till now CPSW works as this configuration is done in U-Boot and carried
> over to the kernel. But during suspend/resume Control module tends to
> lose its configured value for GMII_SEL register in AM33xx PG1.0, so
> if CPSW is used in RMII or RGMII mode, on resume cpsw is not working
> as GMII_SEL register lost its configuration values.
> 
> The initial version of the patch is done by Daniel Mack but as per
> Tony's comment he wants it as a seperate driver as it is done in USB
> control module. I have created a seperate driver for the same.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH 01/10] can: Remove extern from function prototypes
From: Marc Kleine-Budde @ 2013-09-24 14:22 UTC (permalink / raw)
  To: David Miller; +Cc: joe, netdev, linux-kernel, wg, linux-can
In-Reply-To: <20130924.100209.1219862004963547693.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 1094 bytes --]

On 09/24/2013 04:02 PM, David Miller wrote:
> From: Marc Kleine-Budde <mkl@pengutronix.de>
> Date: Tue, 24 Sep 2013 09:37:12 +0200
> 
>> On 09/24/2013 12:11 AM, Joe Perches wrote:
>>> There are a mix of function prototypes with and without extern
>>> in the kernel sources.  Standardize on not using extern for
>>> function prototypes.
>>>
>>> Function prototypes don't need to be written with extern.
>>> extern is assumed by the compiler.  Its use is as unnecessary as
>>> using auto to declare automatic/local variables in a block.
>>>
>>> Signed-off-by: Joe Perches <joe@perches.com>
>>
>> Thx, added to linux-can-next. The patch will be included in the next
>> pull request to David.
> 
> Marc, I'm trying to just quickly apply these all into my tree directly.

Fine with me.

Tnx, Marc


-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply

* Re: [PATCH] skge: fix invalid value passed to pci_unmap_sigle
From: David Miller @ 2013-09-24 14:18 UTC (permalink / raw)
  To: mpatocka; +Cc: netdev, romieu, i.gnatenko.brain, stephen
In-Reply-To: <alpine.LRH.2.02.1309201352010.1763@file01.intranet.prod.int.rdu2.redhat.com>

From: Mikulas Patocka <mpatocka@redhat.com>
Date: Fri, 20 Sep 2013 13:53:22 -0400 (EDT)

> In my patch c194992cbe71c20bb3623a566af8d11b0bfaa721 I didn't fix the skge

Always refer to commits, not just by SHA ID, but also with the commit
header line text in parenthesis and double quotes, for this you'd say:

c194992cbe71c20bb3623a566af8d11b0bfaa721 ("skge: fix broken driver")

Using just the SHA ID is completely ambiguous, because the SHA ID will
be entirely different if this commit is added to a different tree, such
as -stable.

> bug correctly. The value of the new mapping (not old) was passed to
> pci_unmap_single.
> 
> If we enable CONFIG_DMA_API_DEBUG, it results in this warning:
> WARNING: CPU: 0 PID: 0 at lib/dma-debug.c:986 check_sync+0x4c4/0x580()
> skge 0000:02:07.0: DMA-API: device driver tries to sync DMA memory it has
> not allocated [device address=0x000000023a0096c0] [size=1536 bytes]
> 
> This patch makes the skge driver pass the correct value to
> pci_unmap_single and fixes the warning. It copies the old descriptor to
> on-stack variable "ee" and unmaps it if mapping of the new descriptor
> succeeded.
> 
> This patch should be backported to 3.11-stable.
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Reported-by: Francois Romieu <romieu@fr.zoreil.com>
> Tested-by: Mikulas Patocka <mpatocka@redhat.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH ] net: raw: do not report ICMP redirects to user space
From: David Miller @ 2013-09-24 14:17 UTC (permalink / raw)
  To: duanj.fnst; +Cc: netdev
In-Reply-To: <523C21A5.4050504@cn.fujitsu.com>

From: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Date: Fri, 20 Sep 2013 18:21:25 +0800

> From: Duan Jiong <duanj.fnst@cn.fujitsu.com>
> 
> Redirect isn't an error condition, it should leave
> the error handler without touching the socket.
> 
> Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>

Applied.

^ permalink raw reply

* Re: [PATCH ] net: udp: do not report ICMP redirects to user space
From: David Miller @ 2013-09-24 14:17 UTC (permalink / raw)
  To: duanj.fnst; +Cc: netdev
In-Reply-To: <523C216C.90707@cn.fujitsu.com>

From: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Date: Fri, 20 Sep 2013 18:20:28 +0800

> From: Duan Jiong <duanj.fnst@cn.fujitsu.com>
> 
> Redirect isn't an error condition, it should leave
> the error handler without touching the socket.
> 
> Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>

Applied.

^ permalink raw reply

* Re: [net 5/6] i40e: better return values
From: David Miller @ 2013-09-24 14:12 UTC (permalink / raw)
  To: joe; +Cc: jeffrey.t.kirsher, jesse.brandeburg, netdev, gospo, sassmann
In-Reply-To: <1380022486.3575.74.camel@joe-AO722>

From: Joe Perches <joe@perches.com>
Date: Tue, 24 Sep 2013 04:34:46 -0700

> On Tue, 2013-09-24 at 02:45 -0700, Jeff Kirsher wrote:
> 
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
> []
>> @@ -3339,9 +3345,7 @@ static u8 i40e_dcb_get_num_tc(struct i40e_dcbx_config *dcbcfg)
>>  	/* Traffic class index starts from zero so
>>  	 * increment to return the actual count
>>  	 */
>> -	num_tc++;
>> -
>> -	return num_tc;
>> +	return num_tc++;
> 
> Ick.  post_increment problem.
> 
> 	return ++num_tc;
> 
> There's nothing wrong with the original code
> unless this is a bugfix which should be documented
> better than "better return values".

Agreed, this style of coding is asking for a bug.

If you want to return "num_tc PLUS ONE" just say that:

	return num_tc + 1;

Why even use pre/post increment when the variable has no other
use than as a return value?

^ permalink raw reply

* Re: [PATCH 01/10] can: Remove extern from function prototypes
From: David Miller @ 2013-09-24 14:11 UTC (permalink / raw)
  To: joe; +Cc: netdev, linux-kernel, wg, mkl, linux-can
In-Reply-To: <5570169a078375fa8662adeb2a7f24c1ae718bfb.1379974101.git.joe@perches.com>


Series applied, thanks Joe.

^ permalink raw reply

* Re: [PATCH 01/10] can: Remove extern from function prototypes
From: David Miller @ 2013-09-24 14:02 UTC (permalink / raw)
  To: mkl; +Cc: joe, netdev, linux-kernel, wg, linux-can
In-Reply-To: <52414128.8030706@pengutronix.de>

From: Marc Kleine-Budde <mkl@pengutronix.de>
Date: Tue, 24 Sep 2013 09:37:12 +0200

> On 09/24/2013 12:11 AM, Joe Perches wrote:
>> There are a mix of function prototypes with and without extern
>> in the kernel sources.  Standardize on not using extern for
>> function prototypes.
>> 
>> Function prototypes don't need to be written with extern.
>> extern is assumed by the compiler.  Its use is as unnecessary as
>> using auto to declare automatic/local variables in a block.
>> 
>> Signed-off-by: Joe Perches <joe@perches.com>
> 
> Thx, added to linux-can-next. The patch will be included in the next
> pull request to David.

Marc, I'm trying to just quickly apply these all into my tree directly.

Thanks.

^ permalink raw reply

* [PATCH net-next v3 2/2] ipv4: processing ancillary IP_TOS or IP_TTL
From: Francesco Fusco @ 2013-09-24 13:43 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <cover.1379944641.git.ffusco@redhat.com>

If IP_TOS or IP_TTL are specified as ancillary data, then sendmsg() sends out
packets with the specified TTL or TOS overriding the socket values specified
with the traditional setsockopt().

The struct inet_cork stores the values of TOS, TTL and priority that are
passed through the struct ipcm_cookie. If there are user-specified TOS
(tos != -1) or TTL (ttl != 0) in the struct ipcm_cookie, these values are
used to override the per-socket values. In case of TOS also the priority
is changed accordingly.

Two helper functions get_rttos and get_rtconn_flags are defined to take
into account the presence of a user specified TOS value when computing
RT_TOS and RT_CONN_FLAGS.

Signed-off-by: Francesco Fusco <ffusco@redhat.com>
---
 v1->v2
  - reworked the entire patch
  - modified the ttl field in the struct inet_cork from __s16 to __u8:
    0 means that the TTL is not specified
  - the tos field in the struct inet_cork is still __s16: 
    -1 means tha the tos is not set
  - modified the priority field in the struct inet_cork from __u32 to 
    char.
  - introduced the get_rttos and get_rtconn_flags functions
 v2->v3
  - no code changes, rebase to net-next

 include/net/inet_sock.h |  3 +++
 include/net/ip.h        | 11 +++++++++++
 include/net/route.h     |  1 +
 net/ipv4/icmp.c         |  5 +++++
 net/ipv4/ip_output.c    | 13 ++++++++++---
 net/ipv4/ping.c         |  4 +++-
 net/ipv4/raw.c          |  4 +++-
 net/ipv4/udp.c          |  4 +++-
 8 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 636d203..f314177 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -103,6 +103,9 @@ struct inet_cork {
 	int			length; /* Total length of all frames */
 	struct dst_entry	*dst;
 	u8			tx_flags;
+	__u8			ttl;
+	__s16			tos;
+	char			priority;
 };
 
 struct inet_cork_full {
diff --git a/include/net/ip.h b/include/net/ip.h
index 0135f38..77b4f9b 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -28,6 +28,7 @@
 #include <linux/skbuff.h>
 
 #include <net/inet_sock.h>
+#include <net/route.h>
 #include <net/snmp.h>
 #include <net/flow.h>
 
@@ -140,6 +141,16 @@ static inline struct sk_buff *ip_finish_skb(struct sock *sk, struct flowi4 *fl4)
 	return __ip_make_skb(sk, fl4, &sk->sk_write_queue, &inet_sk(sk)->cork.base);
 }
 
+static inline __u8 get_rttos(struct ipcm_cookie* ipc, struct inet_sock *inet)
+{
+	return (ipc->tos != -1) ? RT_TOS(ipc->tos) : RT_TOS(inet->tos);
+}
+
+static inline __u8 get_rtconn_flags(struct ipcm_cookie* ipc, struct sock* sk)
+{
+	return (ipc->tos != -1) ? RT_CONN_FLAGS_TOS(sk, ipc->tos) : RT_CONN_FLAGS(sk);
+}
+
 /* datagram.c */
 int ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
 
diff --git a/include/net/route.h b/include/net/route.h
index 6f572ca..0ad8e01 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -39,6 +39,7 @@
 #define RTO_ONLINK	0x01
 
 #define RT_CONN_FLAGS(sk)   (RT_TOS(inet_sk(sk)->tos) | sock_flag(sk, SOCK_LOCALROUTE))
+#define RT_CONN_FLAGS_TOS(sk,tos)   (RT_TOS(tos) | sock_flag(sk, SOCK_LOCALROUTE))
 
 struct fib_nh;
 struct fib_info;
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 5f7d11a..5c0e8bc 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -353,6 +353,9 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
 	saddr = fib_compute_spec_dst(skb);
 	ipc.opt = NULL;
 	ipc.tx_flags = 0;
+	ipc.ttl = 0;
+	ipc.tos = -1;
+
 	if (icmp_param->replyopts.opt.opt.optlen) {
 		ipc.opt = &icmp_param->replyopts.opt;
 		if (ipc.opt->opt.srr)
@@ -608,6 +611,8 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
 	ipc.addr = iph->saddr;
 	ipc.opt = &icmp_param->replyopts.opt;
 	ipc.tx_flags = 0;
+	ipc.ttl = 0;
+	ipc.tos = -1;
 
 	rt = icmp_route_lookup(net, &fl4, skb_in, iph, saddr, tos,
 			       type, code, icmp_param);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index a04d872..7d8357b 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1060,6 +1060,9 @@ static int ip_setup_cork(struct sock *sk, struct inet_cork *cork,
 			 rt->dst.dev->mtu : dst_mtu(&rt->dst);
 	cork->dst = &rt->dst;
 	cork->length = 0;
+	cork->ttl = ipc->ttl;
+	cork->tos = ipc->tos;
+	cork->priority = ipc->priority;
 	cork->tx_flags = ipc->tx_flags;
 
 	return 0;
@@ -1311,7 +1314,9 @@ struct sk_buff *__ip_make_skb(struct sock *sk,
 	if (cork->flags & IPCORK_OPT)
 		opt = cork->opt;
 
-	if (rt->rt_type == RTN_MULTICAST)
+	if (cork->ttl != 0)
+		ttl = cork->ttl;
+	else if (rt->rt_type == RTN_MULTICAST)
 		ttl = inet->mc_ttl;
 	else
 		ttl = ip_select_ttl(inet, &rt->dst);
@@ -1319,7 +1324,7 @@ struct sk_buff *__ip_make_skb(struct sock *sk,
 	iph = ip_hdr(skb);
 	iph->version = 4;
 	iph->ihl = 5;
-	iph->tos = inet->tos;
+	iph->tos = (cork->tos != -1) ? cork->tos : inet->tos;
 	iph->frag_off = df;
 	iph->ttl = ttl;
 	iph->protocol = sk->sk_protocol;
@@ -1331,7 +1336,7 @@ struct sk_buff *__ip_make_skb(struct sock *sk,
 		ip_options_build(skb, opt, cork->addr, rt, 0);
 	}
 
-	skb->priority = sk->sk_priority;
+	skb->priority = (cork->tos != -1) ? cork->priority: sk->sk_priority;
 	skb->mark = sk->sk_mark;
 	/*
 	 * Steal rt from cork.dst to avoid a pair of atomic_inc/atomic_dec
@@ -1481,6 +1486,8 @@ void ip_send_unicast_reply(struct net *net, struct sk_buff *skb, __be32 daddr,
 	ipc.addr = daddr;
 	ipc.opt = NULL;
 	ipc.tx_flags = 0;
+	ipc.ttl = 0;
+	ipc.tos = -1;
 
 	if (replyopts.opt.opt.optlen) {
 		ipc.opt = &replyopts.opt;
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index d7d9882..706d108e 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -713,6 +713,8 @@ int ping_v4_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	ipc.opt = NULL;
 	ipc.oif = sk->sk_bound_dev_if;
 	ipc.tx_flags = 0;
+	ipc.ttl = 0;
+	ipc.tos = -1;
 
 	sock_tx_timestamp(sk, &ipc.tx_flags);
 
@@ -744,7 +746,7 @@ int ping_v4_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 			return -EINVAL;
 		faddr = ipc.opt->opt.faddr;
 	}
-	tos = RT_TOS(inet->tos);
+	tos = get_rttos(&ipc, inet);
 	if (sock_flag(sk, SOCK_LOCALROUTE) ||
 	    (msg->msg_flags & MSG_DONTROUTE) ||
 	    (ipc.opt && ipc.opt->opt.is_strictroute)) {
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index bfec521..a3fe534 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -517,6 +517,8 @@ static int raw_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	ipc.addr = inet->inet_saddr;
 	ipc.opt = NULL;
 	ipc.tx_flags = 0;
+	ipc.ttl = 0;
+	ipc.tos = -1;
 	ipc.oif = sk->sk_bound_dev_if;
 
 	if (msg->msg_controllen) {
@@ -556,7 +558,7 @@ static int raw_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 			daddr = ipc.opt->opt.faddr;
 		}
 	}
-	tos = RT_CONN_FLAGS(sk);
+	tos = get_rtconn_flags(&ipc, sk);
 	if (msg->msg_flags & MSG_DONTROUTE)
 		tos |= RTO_ONLINK;
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 74d2c95..22462d94 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -855,6 +855,8 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 
 	ipc.opt = NULL;
 	ipc.tx_flags = 0;
+	ipc.ttl = 0;
+	ipc.tos = -1;
 
 	getfrag = is_udplite ? udplite_getfrag : ip_generic_getfrag;
 
@@ -938,7 +940,7 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		faddr = ipc.opt->opt.faddr;
 		connected = 0;
 	}
-	tos = RT_TOS(inet->tos);
+	tos = get_rttos(&ipc, inet);
 	if (sock_flag(sk, SOCK_LOCALROUTE) ||
 	    (msg->msg_flags & MSG_DONTROUTE) ||
 	    (ipc.opt && ipc.opt->opt.is_strictroute)) {
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next v3 1/2] ipv4: IP_TOS and IP_TTL can be specified as ancillary data
From: Francesco Fusco @ 2013-09-24 13:43 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <cover.1379944641.git.ffusco@redhat.com>

This patch enables the IP_TTL and IP_TOS values passed from userspace to
be stored in the ipcm_cookie struct. Three fields are added to the struct:

- the TTL, expressed as __u8.
  The allowed values are in the [1-255].
  A value of 0 means that the TTL is not specified.

- the TOS, expressed as __s16.
  The allowed values are in the range [0,255].
  A value of -1 means that the TOS is not specified.

- the priority, expressed as a char and computed when
  handling the ancillary data.

Signed-off-by: Francesco Fusco <ffusco@redhat.com>
---
 v1->v2
  - changed the icmp_cookie ttl field from __s16 to __u8.
    A value of 0 means that the TTL has not been specified
  - to tos field is still __s16. The user can specify
    values in the range 0-255 included, therefore I use
    a value of -1 as a flag saying that the value has
    not been specified
  - the priority it is now a char instead of a __u32, 
    which is the return type of rt_tos2priority
  - improved commit message
 v1->v2
  - no code changes, rebase to net-next

 include/net/ip.h       |  3 +++
 net/ipv4/ip_sockglue.c | 20 +++++++++++++++++++-
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index c1f192b..0135f38 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -56,6 +56,9 @@ struct ipcm_cookie {
 	int			oif;
 	struct ip_options_rcu	*opt;
 	__u8			tx_flags;
+	__u8			ttl;
+	__s16			tos;
+	char			priority;
 };
 
 #define IPCB(skb) ((struct inet_skb_parm*)((skb)->cb))
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index d9c4f11..56e3445 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -189,7 +189,7 @@ EXPORT_SYMBOL(ip_cmsg_recv);
 
 int ip_cmsg_send(struct net *net, struct msghdr *msg, struct ipcm_cookie *ipc)
 {
-	int err;
+	int err, val;
 	struct cmsghdr *cmsg;
 
 	for (cmsg = CMSG_FIRSTHDR(msg); cmsg; cmsg = CMSG_NXTHDR(msg, cmsg)) {
@@ -215,6 +215,24 @@ int ip_cmsg_send(struct net *net, struct msghdr *msg, struct ipcm_cookie *ipc)
 			ipc->addr = info->ipi_spec_dst.s_addr;
 			break;
 		}
+		case IP_TTL:
+			if (cmsg->cmsg_len != CMSG_LEN(sizeof(int)))
+				return -EINVAL;
+			val = *(int *)CMSG_DATA(cmsg);
+			if (val < 1 || val > 255)
+				return -EINVAL;
+			ipc->ttl = val;
+			break;
+		case IP_TOS:
+			if (cmsg->cmsg_len != CMSG_LEN(sizeof(int)))
+				return -EINVAL;
+			val = *(int *)CMSG_DATA(cmsg);
+			if (val < 0 || val > 255)
+				return -EINVAL;
+			ipc->tos = val;
+			ipc->priority = rt_tos2priority(ipc->tos);
+			break;
+
 		default:
 			return -EINVAL;
 		}
-- 
1.8.3.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox