Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next] ixgbe: simplify padding and length checks (v2)
From: Jeff Kirsher @ 2012-06-19 23:30 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Alexander Duyck, Bruce Allan, Carolyn Wyborny, Don Skidmore,
	Greg Rose, Peter P Waskiewicz Jr, David S. Miller, e1000-devel,
	netdev
In-Reply-To: <20120618163111.4e46493b@nehalam.linuxnetplumber.net>

[-- Attachment #1: Type: text/plain, Size: 548 bytes --]

On Mon, 2012-06-18 at 16:31 -0700, Stephen Hemminger wrote:
> The check for length <= 0 is bogus because length is unsigned, and
> network
> stack never sends zero length packets (unless it is totally broken).
> 
> The check for really small packets can be optimized (using unlikely)
> and calling skb_pad directly.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> 

I just realized I had not responded to this updated patch.

Thanks Stephen, I have dropped your previous 2 patch series and added
this patch to my queue.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* [PATCH v2] ipv4: Early TCP socket demux.
From: David Miller @ 2012-06-19 23:39 UTC (permalink / raw)
  To: netdev


Input packet processing for local sockets involves two major demuxes.
One for the route and one for the socket.

But we can optimize this down to one demux for certain kinds of local
sockets.

Currently we only do this for established TCP sockets, but it could
at least in theory be expanded to other kinds of connections.

If a TCP socket is established then it's identity is fully specified.

This means that whatever input route was used during the three-way
handshake must work equally well for the rest of the connection since
the keys will not change.

Once we move to established state, we cache the receive packet's input
route to use later.

Like the existing cached route in sk->sk_dst_cache used for output
packets, we have to check for route invalidations using dst->obsolete
and dst->ops->check().

Early demux occurs outside of a socket locked section, so when a route
invalidation occurs we defer the fixup of sk->sk_rx_dst until we are
actually inside of established state packet processing and thus have
the socket locked.

Signed-off-by: David S. Miller <davem@davemloft.net>
---

Changes since v1:

1) Remove unlikely() from __inet_lookup_skb()

2) Check for cached route invalidations.

3) Hook up RX dst when outgoing connection moved to established too,
   previously it was only handling incoming connections.

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 808fc5f..54be028 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -379,10 +379,10 @@ static inline struct sock *__inet_lookup_skb(struct inet_hashinfo *hashinfo,
 					     const __be16 sport,
 					     const __be16 dport)
 {
-	struct sock *sk;
+	struct sock *sk = skb_steal_sock(skb);
 	const struct iphdr *iph = ip_hdr(skb);
 
-	if (unlikely(sk = skb_steal_sock(skb)))
+	if (sk)
 		return sk;
 	else
 		return __inet_lookup(dev_net(skb_dst(skb)->dev), hashinfo,
diff --git a/include/net/protocol.h b/include/net/protocol.h
index 875f489..6c47bf8 100644
--- a/include/net/protocol.h
+++ b/include/net/protocol.h
@@ -34,6 +34,7 @@
 
 /* This is used to register protocols. */
 struct net_protocol {
+	int			(*early_demux)(struct sk_buff *skb);
 	int			(*handler)(struct sk_buff *skb);
 	void			(*err_handler)(struct sk_buff *skb, u32 info);
 	int			(*gso_send_check)(struct sk_buff *skb);
diff --git a/include/net/sock.h b/include/net/sock.h
index 4a45216..87b424a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -319,6 +319,7 @@ struct sock {
 	unsigned long 		sk_flags;
 	struct dst_entry	*sk_dst_cache;
 	spinlock_t		sk_dst_lock;
+	struct dst_entry	*sk_rx_dst;
 	atomic_t		sk_wmem_alloc;
 	atomic_t		sk_omem_alloc;
 	int			sk_sndbuf;
@@ -1426,6 +1427,7 @@ extern struct sk_buff		*sock_rmalloc(struct sock *sk,
 					      gfp_t priority);
 extern void			sock_wfree(struct sk_buff *skb);
 extern void			sock_rfree(struct sk_buff *skb);
+extern void			sock_edemux(struct sk_buff *skb);
 
 extern int			sock_setsockopt(struct socket *sock, int level,
 						int op, char __user *optval,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 9332f34..6660ffc 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -325,6 +325,7 @@ extern void tcp_v4_err(struct sk_buff *skb, u32);
 
 extern void tcp_shutdown (struct sock *sk, int how);
 
+extern int tcp_v4_early_demux(struct sk_buff *skb);
 extern int tcp_v4_rcv(struct sk_buff *skb);
 
 extern struct inet_peer *tcp_v4_get_peer(struct sock *sk);
diff --git a/net/core/sock.c b/net/core/sock.c
index 9e5b71f..929bdcc 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1465,6 +1465,11 @@ void sock_rfree(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(sock_rfree);
 
+void sock_edemux(struct sk_buff *skb)
+{
+	sock_put(skb->sk);
+}
+EXPORT_SYMBOL(sock_edemux);
 
 int sock_i_uid(struct sock *sk)
 {
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index e4e8e00..a2bd2d2 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -157,6 +157,7 @@ void inet_sock_destruct(struct sock *sk)
 
 	kfree(rcu_dereference_protected(inet->inet_opt, 1));
 	dst_release(rcu_dereference_check(sk->sk_dst_cache, 1));
+	dst_release(sk->sk_rx_dst);
 	sk_refcnt_debug_dec(sk);
 }
 EXPORT_SYMBOL(inet_sock_destruct);
@@ -1520,14 +1521,15 @@ static const struct net_protocol igmp_protocol = {
 #endif
 
 static const struct net_protocol tcp_protocol = {
-	.handler =	tcp_v4_rcv,
-	.err_handler =	tcp_v4_err,
-	.gso_send_check = tcp_v4_gso_send_check,
-	.gso_segment =	tcp_tso_segment,
-	.gro_receive =	tcp4_gro_receive,
-	.gro_complete =	tcp4_gro_complete,
-	.no_policy =	1,
-	.netns_ok =	1,
+	.early_demux	=	tcp_v4_early_demux,
+	.handler	=	tcp_v4_rcv,
+	.err_handler	=	tcp_v4_err,
+	.gso_send_check	=	tcp_v4_gso_send_check,
+	.gso_segment	=	tcp_tso_segment,
+	.gro_receive	=	tcp4_gro_receive,
+	.gro_complete	=	tcp4_gro_complete,
+	.no_policy	=	1,
+	.netns_ok	=	1,
 };
 
 static const struct net_protocol udp_protocol = {
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 8590144..cb883e1 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -324,19 +324,34 @@ static int ip_rcv_finish(struct sk_buff *skb)
 	 *	how the packet travels inside Linux networking.
 	 */
 	if (skb_dst(skb) == NULL) {
-		int err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
-					       iph->tos, skb->dev);
-		if (unlikely(err)) {
-			if (err == -EHOSTUNREACH)
-				IP_INC_STATS_BH(dev_net(skb->dev),
-						IPSTATS_MIB_INADDRERRORS);
-			else if (err == -ENETUNREACH)
-				IP_INC_STATS_BH(dev_net(skb->dev),
-						IPSTATS_MIB_INNOROUTES);
-			else if (err == -EXDEV)
-				NET_INC_STATS_BH(dev_net(skb->dev),
-						 LINUX_MIB_IPRPFILTER);
-			goto drop;
+		const struct net_protocol *ipprot;
+		int protocol = iph->protocol;
+		int hash, err;
+
+		hash = protocol & (MAX_INET_PROTOS - 1);
+
+		rcu_read_lock();
+		ipprot = rcu_dereference(inet_protos[hash]);
+		err = -ENOENT;
+		if (ipprot && ipprot->early_demux)
+			err = ipprot->early_demux(skb);
+		rcu_read_unlock();
+
+		if (err) {
+			err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
+						   iph->tos, skb->dev);
+			if (unlikely(err)) {
+				if (err == -EHOSTUNREACH)
+					IP_INC_STATS_BH(dev_net(skb->dev),
+							IPSTATS_MIB_INADDRERRORS);
+				else if (err == -ENETUNREACH)
+					IP_INC_STATS_BH(dev_net(skb->dev),
+							IPSTATS_MIB_INNOROUTES);
+				else if (err == -EXDEV)
+					NET_INC_STATS_BH(dev_net(skb->dev),
+							 LINUX_MIB_IPRPFILTER);
+				goto drop;
+			}
 		}
 	}
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b224eb8..8416f8a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5518,6 +5518,18 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 	struct tcp_sock *tp = tcp_sk(sk);
 	int res;
 
+	if (sk->sk_rx_dst) {
+		struct dst_entry *dst = sk->sk_rx_dst;
+		if (unlikely(dst->obsolete)) {
+			if (dst->ops->check(dst, 0) == NULL) {
+				dst_release(dst);
+				sk->sk_rx_dst = NULL;
+			}
+		}
+	}
+	if (unlikely(sk->sk_rx_dst == NULL))
+		sk->sk_rx_dst = dst_clone(skb_dst(skb));
+
 	/*
 	 *	Header prediction.
 	 *	The code loosely follows the one in the famous
@@ -5729,8 +5741,10 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
 
 	tcp_set_state(sk, TCP_ESTABLISHED);
 
-	if (skb != NULL)
+	if (skb != NULL) {
+		sk->sk_rx_dst = dst_clone(skb_dst(skb));
 		security_inet_conn_established(sk, skb);
+	}
 
 	/* Make sure socket is routed, for correct metrics.  */
 	icsk->icsk_af_ops->rebuild_header(sk);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index fda2ca1..13857df 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1671,6 +1671,52 @@ csum_err:
 }
 EXPORT_SYMBOL(tcp_v4_do_rcv);
 
+int tcp_v4_early_demux(struct sk_buff *skb)
+{
+	struct net *net = dev_net(skb->dev);
+	const struct iphdr *iph;
+	const struct tcphdr *th;
+	struct sock *sk;
+	int err;
+
+	err = -ENOENT;
+	if (skb->pkt_type != PACKET_HOST)
+		goto out_err;
+
+	if (!pskb_may_pull(skb, ip_hdrlen(skb) + sizeof(struct tcphdr)))
+		goto out_err;
+
+	iph = ip_hdr(skb);
+	th = (struct tcphdr *) ((char *)iph + ip_hdrlen(skb));
+
+	if (th->doff < sizeof(struct tcphdr) / 4)
+		goto out_err;
+
+	if (!pskb_may_pull(skb, ip_hdrlen(skb) + th->doff * 4))
+		goto out_err;
+
+	sk = __inet_lookup_established(net, &tcp_hashinfo,
+				       iph->saddr, th->source,
+				       iph->daddr, th->dest,
+				       skb->dev->ifindex);
+	if (sk) {
+		skb->sk = sk;
+		skb->destructor = sock_edemux;
+		if (sk->sk_state != TCP_TIME_WAIT) {
+			struct dst_entry *dst = sk->sk_rx_dst;
+			if (dst)
+				dst = dst_check(dst, 0);
+			if (dst) {
+				skb_dst_set_noref(skb, dst);
+				err = 0;
+			}
+		}
+	}
+
+out_err:
+	return err;
+}
+
 /*
  *	From tcp_input.c
  */
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index cb01531..72b7c63 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -445,6 +445,8 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 		struct tcp_sock *oldtp = tcp_sk(sk);
 		struct tcp_cookie_values *oldcvp = oldtp->cookie_values;
 
+		newsk->sk_rx_dst = dst_clone(skb_dst(skb));
+
 		/* TCP Cookie Transactions require space for the cookie pair,
 		 * as it differs for each connection.  There is no need to
 		 * copy any s_data_payload stored at the original socket.

^ permalink raw reply related

* Re: [PATCH v2] ipv4: Early TCP socket demux.
From: Ben Hutchings @ 2012-06-20  0:21 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120619.163911.2094057156011157978.davem@davemloft.net>

On Tue, 2012-06-19 at 16:39 -0700, David Miller wrote:
> Input packet processing for local sockets involves two major demuxes.
> One for the route and one for the socket.
> 
> But we can optimize this down to one demux for certain kinds of local
> sockets.
[...]
> --- a/net/ipv4/ip_input.c
> +++ b/net/ipv4/ip_input.c
> @@ -324,19 +324,34 @@ static int ip_rcv_finish(struct sk_buff *skb)
>  	 *	how the packet travels inside Linux networking.
>  	 */
>  	if (skb_dst(skb) == NULL) {
> -		int err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
> -					       iph->tos, skb->dev);
> -		if (unlikely(err)) {
> -			if (err == -EHOSTUNREACH)
> -				IP_INC_STATS_BH(dev_net(skb->dev),
> -						IPSTATS_MIB_INADDRERRORS);
> -			else if (err == -ENETUNREACH)
> -				IP_INC_STATS_BH(dev_net(skb->dev),
> -						IPSTATS_MIB_INNOROUTES);
> -			else if (err == -EXDEV)
> -				NET_INC_STATS_BH(dev_net(skb->dev),
> -						 LINUX_MIB_IPRPFILTER);
> -			goto drop;
> +		const struct net_protocol *ipprot;
> +		int protocol = iph->protocol;
> +		int hash, err;
> +
> +		hash = protocol & (MAX_INET_PROTOS - 1);
[...]

This 'hashing' threw me when I read v1, because nowhere do we actually
check that the protocol (as opposed to hash) matches that for the
selected ipprot.  (And this also turns out to be true for the current
receive path.)

This works only because MAX_INET_PROTOS is defined as 256, so that hash
== protocol.  If we were ever to change MAX_INET_PROTOS then we would
need to add a whole lot of protocol checks, but this isn't particularly
obvious!  Perhaps it would be better to remove the 'hashing' altogether?

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO
From: Alexander Duyck @ 2012-06-20  0:43 UTC (permalink / raw)
  To: netdev; +Cc: davem, jeffrey.t.kirsher

This patch is meant to help improve system performance when
netdev_alloc_frag is used in scenarios in which buffers are short lived.
This is accomplished by allowing the page offset to be reset in the event
that the page count is 1.  I also reordered the direction in which we give
out sections of the page so that we start at the end of the page and end at
the start.  The main motivation being that I preferred to have offset
represent the amount of page remaining to be used.

My primary test case was using ixgbe in combination with TCP.  With this
patch applied I saw CPU utilization drop from 3.4% to 3.0% for a single
thread of netperf receiving a TCP stream via ixgbe.

I also tested several scenarios in which the page reuse would not be
possible such as UDP flows and routing.  In both of these scenarios I saw
no noticeable performance degradation compared to the kernel without this
patch.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 net/core/skbuff.c |   15 +++++++++++----
 1 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b21522..eb3853c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -317,15 +317,22 @@ void *netdev_alloc_frag(unsigned int fragsz)
 	if (unlikely(!nc->page)) {
 refill:
 		nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
-		nc->offset = 0;
 	}
 	if (likely(nc->page)) {
-		if (nc->offset + fragsz > PAGE_SIZE) {
+		unsigned int offset = PAGE_SIZE;
+
+		if (page_count(nc->page) != 1)
+			offset = nc->offset;
+
+		if (offset < fragsz) {
 			put_page(nc->page);
 			goto refill;
 		}
-		data = page_address(nc->page) + nc->offset;
-		nc->offset += fragsz;
+
+		offset -= fragsz;
+		nc->offset = offset;
+
+		data = page_address(nc->page) + offset;
 		get_page(nc->page);
 	}
 	local_irq_restore(flags);

^ permalink raw reply related

* Re: [PATCH v2] ipv4: Early TCP socket demux.
From: David Miller @ 2012-06-20  0:54 UTC (permalink / raw)
  To: bhutchings; +Cc: netdev
In-Reply-To: <1340151666.6871.81.camel@deadeye.wl.decadent.org.uk>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Wed, 20 Jun 2012 01:21:06 +0100

> This works only because MAX_INET_PROTOS is defined as 256, so that hash
> == protocol.  If we were ever to change MAX_INET_PROTOS then we would
> need to add a whole lot of protocol checks, but this isn't particularly
> obvious!  Perhaps it would be better to remove the 'hashing' altogether?

We never have changed the value and we never will.  The hash is
perfect, what's the big deal?

^ permalink raw reply

* Re: [PATCH v2] ipv4: Early TCP socket demux.
From: Ben Hutchings @ 2012-06-20  1:03 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120619.175435.1705108432673390966.davem@davemloft.net>

On Tue, 2012-06-19 at 17:54 -0700, David Miller wrote:
> From: Ben Hutchings <bhutchings@solarflare.com>
> Date: Wed, 20 Jun 2012 01:21:06 +0100
> 
> > This works only because MAX_INET_PROTOS is defined as 256, so that hash
> > == protocol.  If we were ever to change MAX_INET_PROTOS then we would
> > need to add a whole lot of protocol checks, but this isn't particularly
> > obvious!  Perhaps it would be better to remove the 'hashing' altogether?
> 
> We never have changed the value and we never will.

Well that's what I expected.

> The hash is perfect, what's the big deal?

It obscures what we're really doing and relying on.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH v2] ipv4: Early TCP socket demux.
From: David Miller @ 2012-06-20  1:05 UTC (permalink / raw)
  To: bhutchings; +Cc: netdev
In-Reply-To: <1340154206.6871.90.camel@deadeye.wl.decadent.org.uk>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Wed, 20 Jun 2012 02:03:26 +0100

> On Tue, 2012-06-19 at 17:54 -0700, David Miller wrote:
>> The hash is perfect, what's the big deal?
> 
> It obscures what we're really doing and relying on.

If it matters to you, patches are always welcome :-)

^ permalink raw reply

* Re: [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO
From: Alexander Duyck @ 2012-06-20  1:49 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, davem, jeffrey.t.kirsher, Eric Dumazet
In-Reply-To: <20120620004306.17814.58369.stgit@gitlad.jf.intel.com>

On 6/19/2012 5:43 PM, Alexander Duyck wrote:
> This patch is meant to help improve system performance when
> netdev_alloc_frag is used in scenarios in which buffers are short lived.
> This is accomplished by allowing the page offset to be reset in the event
> that the page count is 1.  I also reordered the direction in which we give
> out sections of the page so that we start at the end of the page and end at
> the start.  The main motivation being that I preferred to have offset
> represent the amount of page remaining to be used.
>
> My primary test case was using ixgbe in combination with TCP.  With this
> patch applied I saw CPU utilization drop from 3.4% to 3.0% for a single
> thread of netperf receiving a TCP stream via ixgbe.
>
> I also tested several scenarios in which the page reuse would not be
> possible such as UDP flows and routing.  In both of these scenarios I saw
> no noticeable performance degradation compared to the kernel without this
> patch.
>
> Cc: Eric Dumazet<edumazet@google.com>
> Signed-off-by: Alexander Duyck<alexander.h.duyck@intel.com>
> ---
>
>   net/core/skbuff.c |   15 +++++++++++----
>   1 files changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 5b21522..eb3853c 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -317,15 +317,22 @@ void *netdev_alloc_frag(unsigned int fragsz)
>   	if (unlikely(!nc->page)) {
>   refill:
>   		nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
> -		nc->offset = 0;
>   	}
>   	if (likely(nc->page)) {
> -		if (nc->offset + fragsz>  PAGE_SIZE) {
> +		unsigned int offset = PAGE_SIZE;
> +
> +		if (page_count(nc->page) != 1)
> +			offset = nc->offset;
> +
> +		if (offset<  fragsz) {
>   			put_page(nc->page);
>   			goto refill;
>   		}
> -		data = page_address(nc->page) + nc->offset;
> -		nc->offset += fragsz;
> +
> +		offset -= fragsz;
> +		nc->offset = offset;
> +
> +		data = page_address(nc->page) + offset;
>   		get_page(nc->page);
>   	}
>   	local_irq_restore(flags);
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
It looks like I forgot to add "--auto" to the command line when I sent 
this out via stg mail so I am just adding Eric to the CC list on this 
reply.  Sorry for the extra noise.

Thanks,

Alex

^ permalink raw reply

* Re: [PATCH v2] ipv4: Early TCP socket demux.
From: David Miller @ 2012-06-20  2:02 UTC (permalink / raw)
  To: bhutchings; +Cc: netdev
In-Reply-To: <20120619.180527.65196005824751590.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Tue, 19 Jun 2012 18:05:27 -0700 (PDT)

> From: Ben Hutchings <bhutchings@solarflare.com>
> Date: Wed, 20 Jun 2012 02:03:26 +0100
> 
>> On Tue, 2012-06-19 at 17:54 -0700, David Miller wrote:
>>> The hash is perfect, what's the big deal?
>> 
>> It obscures what we're really doing and relying on.
> 
> If it matters to you, patches are always welcome :-)

Nevermind, I just committed the following to net-next:

--------------------
inet: Sanitize inet{,6} protocol demux.

Don't pretend that inet_protos[] and inet6_protos[] are hashes, thay
are just a straight arrays.  Remove all unnecessary hash masking.

Document MAX_INET_PROTOS.

Use RAW_HTABLE_SIZE when appropriate.

Reported-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/protocol.h |    7 +++++--
 net/ipv4/af_inet.c     |   26 ++++++++++++--------------
 net/ipv4/icmp.c        |    9 ++++-----
 net/ipv4/ip_input.c    |    5 ++---
 net/ipv4/protocol.c    |    8 +++-----
 net/ipv6/icmp.c        |    7 ++-----
 net/ipv6/ip6_input.c   |    9 +++------
 net/ipv6/protocol.c    |    8 +++-----
 net/ipv6/raw.c         |    4 ++--
 9 files changed, 36 insertions(+), 47 deletions(-)

diff --git a/include/net/protocol.h b/include/net/protocol.h
index 875f489..a1b1b53 100644
--- a/include/net/protocol.h
+++ b/include/net/protocol.h
@@ -29,8 +29,11 @@
 #include <linux/ipv6.h>
 #endif
 
-#define MAX_INET_PROTOS	256		/* Must be a power of 2		*/
-
+/* This is one larger than the largest protocol value that can be
+ * found in an ipv4 or ipv6 header.  Since in both cases the protocol
+ * value is presented in a __u8, this is defined to be 256.
+ */
+#define MAX_INET_PROTOS		256
 
 /* This is used to register protocols. */
 struct net_protocol {
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index e4e8e00..85a3b17 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -242,20 +242,18 @@ void build_ehash_secret(void)
 }
 EXPORT_SYMBOL(build_ehash_secret);
 
-static inline int inet_netns_ok(struct net *net, int protocol)
+static inline int inet_netns_ok(struct net *net, __u8 protocol)
 {
-	int hash;
 	const struct net_protocol *ipprot;
 
 	if (net_eq(net, &init_net))
 		return 1;
 
-	hash = protocol & (MAX_INET_PROTOS - 1);
-	ipprot = rcu_dereference(inet_protos[hash]);
-
-	if (ipprot == NULL)
+	ipprot = rcu_dereference(inet_protos[protocol]);
+	if (ipprot == NULL) {
 		/* raw IP is OK */
 		return 1;
+	}
 	return ipprot->netns_ok;
 }
 
@@ -1216,8 +1214,8 @@ EXPORT_SYMBOL(inet_sk_rebuild_header);
 
 static int inet_gso_send_check(struct sk_buff *skb)
 {
-	const struct iphdr *iph;
 	const struct net_protocol *ops;
+	const struct iphdr *iph;
 	int proto;
 	int ihl;
 	int err = -EINVAL;
@@ -1236,7 +1234,7 @@ static int inet_gso_send_check(struct sk_buff *skb)
 	__skb_pull(skb, ihl);
 	skb_reset_transport_header(skb);
 	iph = ip_hdr(skb);
-	proto = iph->protocol & (MAX_INET_PROTOS - 1);
+	proto = iph->protocol;
 	err = -EPROTONOSUPPORT;
 
 	rcu_read_lock();
@@ -1253,8 +1251,8 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 	netdev_features_t features)
 {
 	struct sk_buff *segs = ERR_PTR(-EINVAL);
-	struct iphdr *iph;
 	const struct net_protocol *ops;
+	struct iphdr *iph;
 	int proto;
 	int ihl;
 	int id;
@@ -1286,7 +1284,7 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 	skb_reset_transport_header(skb);
 	iph = ip_hdr(skb);
 	id = ntohs(iph->id);
-	proto = iph->protocol & (MAX_INET_PROTOS - 1);
+	proto = iph->protocol;
 	segs = ERR_PTR(-EPROTONOSUPPORT);
 
 	rcu_read_lock();
@@ -1340,7 +1338,7 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head,
 			goto out;
 	}
 
-	proto = iph->protocol & (MAX_INET_PROTOS - 1);
+	proto = iph->protocol;
 
 	rcu_read_lock();
 	ops = rcu_dereference(inet_protos[proto]);
@@ -1398,11 +1396,11 @@ out:
 
 static int inet_gro_complete(struct sk_buff *skb)
 {
-	const struct net_protocol *ops;
+	__be16 newlen = htons(skb->len - skb_network_offset(skb));
 	struct iphdr *iph = ip_hdr(skb);
-	int proto = iph->protocol & (MAX_INET_PROTOS - 1);
+	const struct net_protocol *ops;
+	int proto = iph->protocol;
 	int err = -ENOSYS;
-	__be16 newlen = htons(skb->len - skb_network_offset(skb));
 
 	csum_replace2(&iph->check, iph->tot_len, newlen);
 	iph->tot_len = newlen;
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index e1caa1a..49a74cc 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -637,12 +637,12 @@ EXPORT_SYMBOL(icmp_send);
 
 static void icmp_unreach(struct sk_buff *skb)
 {
+	const struct net_protocol *ipprot;
 	const struct iphdr *iph;
 	struct icmphdr *icmph;
-	int hash, protocol;
-	const struct net_protocol *ipprot;
-	u32 info = 0;
 	struct net *net;
+	u32 info = 0;
+	int protocol;
 
 	net = dev_net(skb_dst(skb)->dev);
 
@@ -731,9 +731,8 @@ static void icmp_unreach(struct sk_buff *skb)
 	 */
 	raw_icmp_error(skb, protocol, info);
 
-	hash = protocol & (MAX_INET_PROTOS - 1);
 	rcu_read_lock();
-	ipprot = rcu_dereference(inet_protos[hash]);
+	ipprot = rcu_dereference(inet_protos[protocol]);
 	if (ipprot && ipprot->err_handler)
 		ipprot->err_handler(skb, info);
 	rcu_read_unlock();
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 8590144..c4fe1d2 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -198,14 +198,13 @@ static int ip_local_deliver_finish(struct sk_buff *skb)
 	rcu_read_lock();
 	{
 		int protocol = ip_hdr(skb)->protocol;
-		int hash, raw;
 		const struct net_protocol *ipprot;
+		int raw;
 
 	resubmit:
 		raw = raw_local_deliver(skb, protocol);
 
-		hash = protocol & (MAX_INET_PROTOS - 1);
-		ipprot = rcu_dereference(inet_protos[hash]);
+		ipprot = rcu_dereference(inet_protos[protocol]);
 		if (ipprot != NULL) {
 			int ret;
 
diff --git a/net/ipv4/protocol.c b/net/ipv4/protocol.c
index 9ae5c01..8918eff 100644
--- a/net/ipv4/protocol.c
+++ b/net/ipv4/protocol.c
@@ -36,9 +36,7 @@ const struct net_protocol __rcu *inet_protos[MAX_INET_PROTOS] __read_mostly;
 
 int inet_add_protocol(const struct net_protocol *prot, unsigned char protocol)
 {
-	int hash = protocol & (MAX_INET_PROTOS - 1);
-
-	return !cmpxchg((const struct net_protocol **)&inet_protos[hash],
+	return !cmpxchg((const struct net_protocol **)&inet_protos[protocol],
 			NULL, prot) ? 0 : -1;
 }
 EXPORT_SYMBOL(inet_add_protocol);
@@ -49,9 +47,9 @@ EXPORT_SYMBOL(inet_add_protocol);
 
 int inet_del_protocol(const struct net_protocol *prot, unsigned char protocol)
 {
-	int ret, hash = protocol & (MAX_INET_PROTOS - 1);
+	int ret;
 
-	ret = (cmpxchg((const struct net_protocol **)&inet_protos[hash],
+	ret = (cmpxchg((const struct net_protocol **)&inet_protos[protocol],
 		       prot, NULL) == prot) ? 0 : -1;
 
 	synchronize_net();
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index 5247d5c..c7da142 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -600,9 +600,8 @@ static void icmpv6_notify(struct sk_buff *skb, u8 type, u8 code, __be32 info)
 {
 	const struct inet6_protocol *ipprot;
 	int inner_offset;
-	int hash;
-	u8 nexthdr;
 	__be16 frag_off;
+	u8 nexthdr;
 
 	if (!pskb_may_pull(skb, sizeof(struct ipv6hdr)))
 		return;
@@ -629,10 +628,8 @@ static void icmpv6_notify(struct sk_buff *skb, u8 type, u8 code, __be32 info)
 	   --ANK (980726)
 	 */
 
-	hash = nexthdr & (MAX_INET_PROTOS - 1);
-
 	rcu_read_lock();
-	ipprot = rcu_dereference(inet6_protos[hash]);
+	ipprot = rcu_dereference(inet6_protos[nexthdr]);
 	if (ipprot && ipprot->err_handler)
 		ipprot->err_handler(skb, NULL, type, code, inner_offset, info);
 	rcu_read_unlock();
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index 21a15df..5ab923e 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -168,13 +168,12 @@ drop:
 
 static int ip6_input_finish(struct sk_buff *skb)
 {
+	struct net *net = dev_net(skb_dst(skb)->dev);
 	const struct inet6_protocol *ipprot;
+	struct inet6_dev *idev;
 	unsigned int nhoff;
 	int nexthdr;
 	bool raw;
-	u8 hash;
-	struct inet6_dev *idev;
-	struct net *net = dev_net(skb_dst(skb)->dev);
 
 	/*
 	 *	Parse extension headers
@@ -189,9 +188,7 @@ resubmit:
 	nexthdr = skb_network_header(skb)[nhoff];
 
 	raw = raw6_local_deliver(skb, nexthdr);
-
-	hash = nexthdr & (MAX_INET_PROTOS - 1);
-	if ((ipprot = rcu_dereference(inet6_protos[hash])) != NULL) {
+	if ((ipprot = rcu_dereference(inet6_protos[nexthdr])) != NULL) {
 		int ret;
 
 		if (ipprot->flags & INET6_PROTO_FINAL) {
diff --git a/net/ipv6/protocol.c b/net/ipv6/protocol.c
index 9a7978f..053082d 100644
--- a/net/ipv6/protocol.c
+++ b/net/ipv6/protocol.c
@@ -29,9 +29,7 @@ const struct inet6_protocol __rcu *inet6_protos[MAX_INET_PROTOS] __read_mostly;
 
 int inet6_add_protocol(const struct inet6_protocol *prot, unsigned char protocol)
 {
-	int hash = protocol & (MAX_INET_PROTOS - 1);
-
-	return !cmpxchg((const struct inet6_protocol **)&inet6_protos[hash],
+	return !cmpxchg((const struct inet6_protocol **)&inet6_protos[protocol],
 			NULL, prot) ? 0 : -1;
 }
 EXPORT_SYMBOL(inet6_add_protocol);
@@ -42,9 +40,9 @@ EXPORT_SYMBOL(inet6_add_protocol);
 
 int inet6_del_protocol(const struct inet6_protocol *prot, unsigned char protocol)
 {
-	int ret, hash = protocol & (MAX_INET_PROTOS - 1);
+	int ret;
 
-	ret = (cmpxchg((const struct inet6_protocol **)&inet6_protos[hash],
+	ret = (cmpxchg((const struct inet6_protocol **)&inet6_protos[protocol],
 		       prot, NULL) == prot) ? 0 : -1;
 
 	synchronize_net();
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 43b0042..b5c1dcb 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -165,7 +165,7 @@ static bool ipv6_raw_deliver(struct sk_buff *skb, int nexthdr)
 	saddr = &ipv6_hdr(skb)->saddr;
 	daddr = saddr + 1;
 
-	hash = nexthdr & (MAX_INET_PROTOS - 1);
+	hash = nexthdr & (RAW_HTABLE_SIZE - 1);
 
 	read_lock(&raw_v6_hashinfo.lock);
 	sk = sk_head(&raw_v6_hashinfo.ht[hash]);
@@ -229,7 +229,7 @@ bool raw6_local_deliver(struct sk_buff *skb, int nexthdr)
 {
 	struct sock *raw_sk;
 
-	raw_sk = sk_head(&raw_v6_hashinfo.ht[nexthdr & (MAX_INET_PROTOS - 1)]);
+	raw_sk = sk_head(&raw_v6_hashinfo.ht[nexthdr & (RAW_HTABLE_SIZE - 1)]);
 	if (raw_sk && !ipv6_raw_deliver(skb, nexthdr))
 		raw_sk = NULL;
 
-- 
1.7.10

^ permalink raw reply related

* Re: [PATCH v2] ipv4: Early TCP socket demux.
From: Stephen Hemminger @ 2012-06-20  2:35 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120619.163911.2094057156011157978.davem@davemloft.net>

On Tue, 19 Jun 2012 16:39:11 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> 
> Input packet processing for local sockets involves two major demuxes.
> One for the route and one for the socket.
> 
> But we can optimize this down to one demux for certain kinds of local
> sockets.
> 
> Currently we only do this for established TCP sockets, but it could
> at least in theory be expanded to other kinds of connections.
> 
> If a TCP socket is established then it's identity is fully specified.
> 
> This means that whatever input route was used during the three-way
> handshake must work equally well for the rest of the connection since
> the keys will not change.
> 
> Once we move to established state, we cache the receive packet's input
> route to use later.
> 
> Like the existing cached route in sk->sk_dst_cache used for output
> packets, we have to check for route invalidations using dst->obsolete
> and dst->ops->check().
> 
> Early demux occurs outside of a socket locked section, so when a route
> invalidation occurs we defer the fixup of sk->sk_rx_dst until we are
> actually inside of established state packet processing and thus have
> the socket locked.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>
> ---
> 
> Changes since v1:
> 
> 1) Remove unlikely() from __inet_lookup_skb()
> 
> 2) Check for cached route invalidations.
> 
> 3) Hook up RX dst when outgoing connection moved to established too,
>    previously it was only handling incoming connections.
> 
> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> index 808fc5f..54be028 100644
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -379,10 +379,10 @@ static inline struct sock *__inet_lookup_skb(struct inet_hashinfo *hashinfo,
>  					     const __be16 sport,
>  					     const __be16 dport)
>  {
> -	struct sock *sk;
> +	struct sock *sk = skb_steal_sock(skb);
>  	const struct iphdr *iph = ip_hdr(skb);
>  
> -	if (unlikely(sk = skb_steal_sock(skb)))
> +	if (sk)
>  		return sk;
>  	else
>  		return __inet_lookup(dev_net(skb_dst(skb)->dev), hashinfo,
> diff --git a/include/net/protocol.h b/include/net/protocol.h
> index 875f489..6c47bf8 100644
> --- a/include/net/protocol.h
> +++ b/include/net/protocol.h
> @@ -34,6 +34,7 @@
>  
>  /* This is used to register protocols. */
>  struct net_protocol {
> +	int			(*early_demux)(struct sk_buff *skb);
>  	int			(*handler)(struct sk_buff *skb);
>  	void			(*err_handler)(struct sk_buff *skb, u32 info);
>  	int			(*gso_send_check)(struct sk_buff *skb);
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 4a45216..87b424a 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -319,6 +319,7 @@ struct sock {
>  	unsigned long 		sk_flags;
>  	struct dst_entry	*sk_dst_cache;
>  	spinlock_t		sk_dst_lock;
> +	struct dst_entry	*sk_rx_dst;
>  	atomic_t		sk_wmem_alloc;
>  	atomic_t		sk_omem_alloc;
>  	int			sk_sndbuf;
> @@ -1426,6 +1427,7 @@ extern struct sk_buff		*sock_rmalloc(struct sock *sk,
>  					      gfp_t priority);
>  extern void			sock_wfree(struct sk_buff *skb);
>  extern void			sock_rfree(struct sk_buff *skb);
> +extern void			sock_edemux(struct sk_buff *skb);
>  
>  extern int			sock_setsockopt(struct socket *sock, int level,
>  						int op, char __user *optval,
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 9332f34..6660ffc 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -325,6 +325,7 @@ extern void tcp_v4_err(struct sk_buff *skb, u32);
>  
>  extern void tcp_shutdown (struct sock *sk, int how);
>  
> +extern int tcp_v4_early_demux(struct sk_buff *skb);
>  extern int tcp_v4_rcv(struct sk_buff *skb);
>  
>  extern struct inet_peer *tcp_v4_get_peer(struct sock *sk);
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 9e5b71f..929bdcc 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1465,6 +1465,11 @@ void sock_rfree(struct sk_buff *skb)
>  }
>  EXPORT_SYMBOL(sock_rfree);
>  
> +void sock_edemux(struct sk_buff *skb)
> +{
> +	sock_put(skb->sk);
> +}
> +EXPORT_SYMBOL(sock_edemux);
>  
>  int sock_i_uid(struct sock *sk)
>  {
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index e4e8e00..a2bd2d2 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -157,6 +157,7 @@ void inet_sock_destruct(struct sock *sk)
>  
>  	kfree(rcu_dereference_protected(inet->inet_opt, 1));
>  	dst_release(rcu_dereference_check(sk->sk_dst_cache, 1));
> +	dst_release(sk->sk_rx_dst);
>  	sk_refcnt_debug_dec(sk);
>  }
>  EXPORT_SYMBOL(inet_sock_destruct);
> @@ -1520,14 +1521,15 @@ static const struct net_protocol igmp_protocol = {
>  #endif
>  
>  static const struct net_protocol tcp_protocol = {
> -	.handler =	tcp_v4_rcv,
> -	.err_handler =	tcp_v4_err,
> -	.gso_send_check = tcp_v4_gso_send_check,
> -	.gso_segment =	tcp_tso_segment,
> -	.gro_receive =	tcp4_gro_receive,
> -	.gro_complete =	tcp4_gro_complete,
> -	.no_policy =	1,
> -	.netns_ok =	1,
> +	.early_demux	=	tcp_v4_early_demux,
> +	.handler	=	tcp_v4_rcv,
> +	.err_handler	=	tcp_v4_err,
> +	.gso_send_check	=	tcp_v4_gso_send_check,
> +	.gso_segment	=	tcp_tso_segment,
> +	.gro_receive	=	tcp4_gro_receive,
> +	.gro_complete	=	tcp4_gro_complete,
> +	.no_policy	=	1,
> +	.netns_ok	=	1,
>  };
>  
>  static const struct net_protocol udp_protocol = {
> diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> index 8590144..cb883e1 100644
> --- a/net/ipv4/ip_input.c
> +++ b/net/ipv4/ip_input.c
> @@ -324,19 +324,34 @@ static int ip_rcv_finish(struct sk_buff *skb)
>  	 *	how the packet travels inside Linux networking.
>  	 */
>  	if (skb_dst(skb) == NULL) {
> -		int err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
> -					       iph->tos, skb->dev);
> -		if (unlikely(err)) {
> -			if (err == -EHOSTUNREACH)
> -				IP_INC_STATS_BH(dev_net(skb->dev),
> -						IPSTATS_MIB_INADDRERRORS);
> -			else if (err == -ENETUNREACH)
> -				IP_INC_STATS_BH(dev_net(skb->dev),
> -						IPSTATS_MIB_INNOROUTES);
> -			else if (err == -EXDEV)
> -				NET_INC_STATS_BH(dev_net(skb->dev),
> -						 LINUX_MIB_IPRPFILTER);
> -			goto drop;
> +		const struct net_protocol *ipprot;
> +		int protocol = iph->protocol;
> +		int hash, err;
> +
> +		hash = protocol & (MAX_INET_PROTOS - 1);
> +
> +		rcu_read_lock();
> +		ipprot = rcu_dereference(inet_protos[hash]);
> +		err = -ENOENT;
> +		if (ipprot && ipprot->early_demux)
> +			err = ipprot->early_demux(skb);
> +		rcu_read_unlock();
> +
> +		if (err) {
> +			err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
> +						   iph->tos, skb->dev);
> +			if (unlikely(err)) {
> +				if (err == -EHOSTUNREACH)
> +					IP_INC_STATS_BH(dev_net(skb->dev),
> +							IPSTATS_MIB_INADDRERRORS);
> +				else if (err == -ENETUNREACH)
> +					IP_INC_STATS_BH(dev_net(skb->dev),
> +							IPSTATS_MIB_INNOROUTES);
> +				else if (err == -EXDEV)
> +					NET_INC_STATS_BH(dev_net(skb->dev),
> +							 LINUX_MIB_IPRPFILTER);
> +				goto drop;
> +			}
>  		}
>  	}
>  
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index b224eb8..8416f8a 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5518,6 +5518,18 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
>  	struct tcp_sock *tp = tcp_sk(sk);
>  	int res;
>  
> +	if (sk->sk_rx_dst) {
> +		struct dst_entry *dst = sk->sk_rx_dst;
> +		if (unlikely(dst->obsolete)) {
> +			if (dst->ops->check(dst, 0) == NULL) {
> +				dst_release(dst);
> +				sk->sk_rx_dst = NULL;
> +			}
> +		}
> +	}
> +	if (unlikely(sk->sk_rx_dst == NULL))
> +		sk->sk_rx_dst = dst_clone(skb_dst(skb));
> +
>  	/*
>  	 *	Header prediction.
>  	 *	The code loosely follows the one in the famous
> @@ -5729,8 +5741,10 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
>  
>  	tcp_set_state(sk, TCP_ESTABLISHED);
>  
> -	if (skb != NULL)
> +	if (skb != NULL) {
> +		sk->sk_rx_dst = dst_clone(skb_dst(skb));
>  		security_inet_conn_established(sk, skb);
> +	}
>  
>  	/* Make sure socket is routed, for correct metrics.  */
>  	icsk->icsk_af_ops->rebuild_header(sk);
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index fda2ca1..13857df 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1671,6 +1671,52 @@ csum_err:
>  }
>  EXPORT_SYMBOL(tcp_v4_do_rcv);
>  
> +int tcp_v4_early_demux(struct sk_buff *skb)
> +{
> +	struct net *net = dev_net(skb->dev);
> +	const struct iphdr *iph;
> +	const struct tcphdr *th;
> +	struct sock *sk;
> +	int err;
> +
> +	err = -ENOENT;
> +	if (skb->pkt_type != PACKET_HOST)
> +		goto out_err;
> +
> +	if (!pskb_may_pull(skb, ip_hdrlen(skb) + sizeof(struct tcphdr)))
> +		goto out_err;
> +
> +	iph = ip_hdr(skb);
> +	th = (struct tcphdr *) ((char *)iph + ip_hdrlen(skb));
> +
> +	if (th->doff < sizeof(struct tcphdr) / 4)
> +		goto out_err;
> +
> +	if (!pskb_may_pull(skb, ip_hdrlen(skb) + th->doff * 4))
> +		goto out_err;
> +
> +	sk = __inet_lookup_established(net, &tcp_hashinfo,
> +				       iph->saddr, th->source,
> +				       iph->daddr, th->dest,
> +				       skb->dev->ifindex);
> +	if (sk) {
> +		skb->sk = sk;
> +		skb->destructor = sock_edemux;
> +		if (sk->sk_state != TCP_TIME_WAIT) {
> +			struct dst_entry *dst = sk->sk_rx_dst;
> +			if (dst)
> +				dst = dst_check(dst, 0);
> +			if (dst) {
> +				skb_dst_set_noref(skb, dst);
> +				err = 0;
> +			}
> +		}
> +	}
> +
> +out_err:
> +	return err;
> +}
> +
>  /*
>   *	From tcp_input.c
>   */
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index cb01531..72b7c63 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -445,6 +445,8 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
>  		struct tcp_sock *oldtp = tcp_sk(sk);
>  		struct tcp_cookie_values *oldcvp = oldtp->cookie_values;
>  
> +		newsk->sk_rx_dst = dst_clone(skb_dst(skb));
> +
>  		/* TCP Cookie Transactions require space for the cookie pair,
>  		 * as it differs for each connection.  There is no need to
>  		 * copy any s_data_payload stored at the original socket.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Any benchmark numbers?

I think the number of ref count operations per packet is going to be
the next line in the sand.

^ permalink raw reply

* linux-next: manual merge of the net-next tree with the  tree
From: Stephen Rothwell @ 2012-06-20  3:18 UTC (permalink / raw)
  To: David Miller, netdev; +Cc: linux-next, linux-kernel, Thomas Graf

[-- Attachment #1: Type: text/plain, Size: 1553 bytes --]

Hi all,

Today's linux-next merge of the net-next tree got a conflict in
net/ipv6/route.c between commit d189634ecab9 ("ipv6: Move ipv6 proc file
registration to end of init order") from the net tree and commit
c3426b47190d ("inet: Initialize per-netns inetpeer roots in net/ipv
{4,6}/route.c") from the net-next tree.

I fixed it up (I think - see below) and can carry the fix as necessary.
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

diff --cc net/ipv6/route.c
index becb048,e649cd7..0000000
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@@ -3007,11 -2929,31 +2938,36 @@@ static struct pernet_operations ip6_rou
  	.exit = ip6_route_net_exit,
  };
  
 +static struct pernet_operations ip6_route_net_late_ops = {
 +	.init = ip6_route_net_init_late,
 +	.exit = ip6_route_net_exit_late,
 +};
 +
+ static int __net_init ipv6_inetpeer_init(struct net *net)
+ {
+ 	struct inet_peer_base *bp = kmalloc(sizeof(*bp), GFP_KERNEL);
+ 
+ 	if (!bp)
+ 		return -ENOMEM;
+ 	inet_peer_base_init(bp);
+ 	net->ipv6.peers = bp;
+ 	return 0;
+ }
+ 
+ static void __net_exit ipv6_inetpeer_exit(struct net *net)
+ {
+ 	struct inet_peer_base *bp = net->ipv6.peers;
+ 
+ 	net->ipv6.peers = NULL;
+ 	inetpeer_invalidate_tree(bp);
+ 	kfree(bp);
+ }
+ 
+ static struct pernet_operations ipv6_inetpeer_ops = {
+ 	.init	=	ipv6_inetpeer_init,
+ 	.exit	=	ipv6_inetpeer_exit,
+ };
+ 
  static struct notifier_block ip6_route_dev_notifier = {
  	.notifier_call = ip6_route_dev_notify,
  	.priority = 0,

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* linux-next: build failure after merge of the net-next tree
From: Stephen Rothwell @ 2012-06-20  3:33 UTC (permalink / raw)
  To: David Miller, netdev
  Cc: linux-next, linux-kernel, Federico Vaga, Giancarlo Asnaghi,
	Wolfgang Grandegger, Bhupesh Sharma, Marc Kleine-Budde

[-- Attachment #1: Type: text/plain, Size: 583 bytes --]

Hi all,

After merging the net-next tree, today's linux-next build (x86_64
allmodconfig) failed like this:

ERROR: "clk_get_rate" [drivers/net/can/c_can/c_can_pci.ko] undefined!
ERROR: "clk_get" [drivers/net/can/c_can/c_can_pci.ko] undefined!
ERROR: "clk_put" [drivers/net/can/c_can/c_can_pci.ko] undefined!

Caused by commit 5b92da0443c2 ("c_can_pci: generic module for C_CAN/D_CAN
on PCI").  Maybe a missing Kconfig dependency/select?

I have used the net-next tree from next-20120619 for today.
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: linux-next: build failure after merge of the net-next tree
From: David Miller @ 2012-06-20  4:24 UTC (permalink / raw)
  To: sfr
  Cc: netdev, linux-next, linux-kernel, federico.vaga,
	giancarlo.asnaghi, wg, bhupesh.sharma, mkl
In-Reply-To: <20120620133348.a9ff31a72ffd11818f574fe3@canb.auug.org.au>

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Wed, 20 Jun 2012 13:33:48 +1000

> After merging the net-next tree, today's linux-next build (x86_64
> allmodconfig) failed like this:
> 
> ERROR: "clk_get_rate" [drivers/net/can/c_can/c_can_pci.ko] undefined!
> ERROR: "clk_get" [drivers/net/can/c_can/c_can_pci.ko] undefined!
> ERROR: "clk_put" [drivers/net/can/c_can/c_can_pci.ko] undefined!
> 
> Caused by commit 5b92da0443c2 ("c_can_pci: generic module for C_CAN/D_CAN
> on PCI").  Maybe a missing Kconfig dependency/select?
> 
> I have used the net-next tree from next-20120619 for today.

Known problem:

http://marc.info/?l=linux-netdev&m=134014347620836&w=2

^ permalink raw reply

* RE: linux-next: build failure after merge of the net-next tree
From: Bhupesh SHARMA @ 2012-06-20  4:27 UTC (permalink / raw)
  To: Stephen Rothwell, David Miller, netdev@vger.kernel.org
  Cc: linux-next@vger.kernel.org, linux-kernel@vger.kernel.org,
	Federico Vaga, Giancarlo ASNAGHI, Wolfgang Grandegger,
	Marc Kleine-Budde
In-Reply-To: <20120620133348.a9ff31a72ffd11818f574fe3@canb.auug.org.au>

Hi,

> -----Original Message-----
> From: Stephen Rothwell [mailto:sfr@canb.auug.org.au]
> Sent: Wednesday, June 20, 2012 9:04 AM
> To: David Miller; netdev@vger.kernel.org
> Cc: linux-next@vger.kernel.org; linux-kernel@vger.kernel.org; Federico
> Vaga; Giancarlo ASNAGHI; Wolfgang Grandegger; Bhupesh SHARMA; Marc
> Kleine-Budde
> Subject: linux-next: build failure after merge of the net-next tree
> 
> Hi all,
> 
> After merging the net-next tree, today's linux-next build (x86_64
> allmodconfig) failed like this:
> 
> ERROR: "clk_get_rate" [drivers/net/can/c_can/c_can_pci.ko] undefined!
> ERROR: "clk_get" [drivers/net/can/c_can/c_can_pci.ko] undefined!
> ERROR: "clk_put" [drivers/net/can/c_can/c_can_pci.ko] undefined!
> 
> Caused by commit 5b92da0443c2 ("c_can_pci: generic module for
> C_CAN/D_CAN on PCI").  Maybe a missing Kconfig dependency/select?
> 
> I have used the net-next tree from next-20120619 for today.
> --

clk_get/clk_put* variants are usually used by ARM platforms.
Protecting their calls under macro 'CONFIG_HAVE_CLK' should solve the problem.

See [1] for how it is done in c_can_platform.c

Could you possibly add these checks and send a patch for the same?
It should be fairly simple.

[1] http://lxr.linux.no/linux+v3.4.3/drivers/net/can/c_can/c_can_platform.c#L68

Regards,
Bhupesh

^ permalink raw reply

* Re: linux-next: build failure after merge of the net-next tree
From: David Miller @ 2012-06-20  4:37 UTC (permalink / raw)
  To: bhupesh.sharma
  Cc: sfr, netdev, linux-next, linux-kernel, federico.vaga,
	giancarlo.asnaghi, wg, mkl
In-Reply-To: <D5ECB3C7A6F99444980976A8C6D896384FAA275CA8@EAPEX1MAIL1.st.com>

From: Bhupesh SHARMA <bhupesh.sharma@st.com>
Date: Wed, 20 Jun 2012 12:27:11 +0800

> clk_get/clk_put* variants are usually used by ARM platforms.
> Protecting their calls under macro 'CONFIG_HAVE_CLK' should solve the problem.

No, we don't pepper foo.c files with crappy ifdefs.

^ permalink raw reply

* RE: linux-next: build failure after merge of the net-next tree
From: Bhupesh SHARMA @ 2012-06-20  4:45 UTC (permalink / raw)
  To: David Miller
  Cc: sfr@canb.auug.org.au, netdev@vger.kernel.org,
	linux-next@vger.kernel.org, linux-kernel@vger.kernel.org,
	federico.vaga@gmail.com, Giancarlo ASNAGHI, wg@grandegger.com,
	mkl@pengutronix.de
In-Reply-To: <20120619.213703.513466167861199217.davem@davemloft.net>

Hi David,

> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Wednesday, June 20, 2012 10:07 AM
> To: Bhupesh SHARMA
> Cc: sfr@canb.auug.org.au; netdev@vger.kernel.org; linux-
> next@vger.kernel.org; linux-kernel@vger.kernel.org;
> federico.vaga@gmail.com; Giancarlo ASNAGHI; wg@grandegger.com;
> mkl@pengutronix.de
> Subject: Re: linux-next: build failure after merge of the net-next tree
> 
> From: Bhupesh SHARMA <bhupesh.sharma@st.com>
> Date: Wed, 20 Jun 2012 12:27:11 +0800
> 
> > clk_get/clk_put* variants are usually used by ARM platforms.
> > Protecting their calls under macro 'CONFIG_HAVE_CLK' should solve the
> problem.
> 
> No, we don't pepper foo.c files with crappy ifdefs.

So, whether adding a check in Kconfig for HAVE_CLK be a proper solution ?
But that will limit the compilation of this driver for only platforms which are ARM based.

One may need to support this driver on x86 like platforms also..

Regards,
Bhupesh

^ permalink raw reply

* Re: [PATCH v2] ipv4: Early TCP socket demux.
From: David Miller @ 2012-06-20  4:46 UTC (permalink / raw)
  To: shemminger; +Cc: netdev
In-Reply-To: <20120619193549.13bcffa7@s6510.linuxnetplumber.net>

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Tue, 19 Jun 2012 19:35:49 -0700

> Any benchmark numbers?

Measuring the path from ip_rcv_finish() to where we lock the socket in
tcp_v4_rcv(), on a SPARC-T3, with a pre-warmed routing cache:

Both sk and RT lookup:	~4200 cycles
Optimized early demux:	~2800 cycles

These numbers can be decreased further, because since we're already
looking at the TCP header we can pre-cook the TCP control block in the
SKB and skip much of the stuff that tcp_v4_rcv() does since we've done
it already in the early demux code.

> I think the number of ref count operations per packet is going to be
> the next line in the sand.

There is only one, for the socket.  We haven't taken a reference on the
route for years.

^ permalink raw reply

* Re: linux-next: build failure after merge of the net-next tree
From: David Miller @ 2012-06-20  4:47 UTC (permalink / raw)
  To: bhupesh.sharma
  Cc: sfr, netdev, linux-next, linux-kernel, federico.vaga,
	giancarlo.asnaghi, wg, mkl
In-Reply-To: <D5ECB3C7A6F99444980976A8C6D896384FAA275CBE@EAPEX1MAIL1.st.com>

From: Bhupesh SHARMA <bhupesh.sharma@st.com>
Date: Wed, 20 Jun 2012 12:45:46 +0800

> So, whether adding a check in Kconfig for HAVE_CLK be a proper
> solution ?  But that will limit the compilation of this driver for
> only platforms which are ARM based.
> 
> One may need to support this driver on x86 like platforms also..

Then x86 will need to provide clock operations, or there needs to
be dummy ones for such platforms.

This isn't rocket science.

^ permalink raw reply

* [PATCH v3] ipv4: Early TCP socket demux.
From: David Miller @ 2012-06-20  4:49 UTC (permalink / raw)
  To: netdev


Input packet processing for local sockets involves two major demuxes.
One for the route and one for the socket.

But we can optimize this down to one demux for certain kinds of local
sockets.

Currently we only do this for established TCP sockets, but it could
at least in theory be expanded to other kinds of connections.

If a TCP socket is established then it's identity is fully specified.

This means that whatever input route was used during the three-way
handshake must work equally well for the rest of the connection since
the keys will not change.

Once we move to established state, we cache the receive packet's input
route to use later.

Like the existing cached route in sk->sk_dst_cache used for output
packets, we have to check for route invalidations using dst->obsolete
and dst->ops->check().

Early demux occurs outside of a socket locked section, so when a route
invalidation occurs we defer the fixup of sk->sk_rx_dst until we are
actually inside of established state packet processing and thus have
the socket locked.

Signed-off-by: David S. Miller <davem@davemloft.net>
---

This is the final version I pushed into net-next, it's just respun
with the MAX_INET_PROTOS hash masking removed.

 include/net/inet_hashtables.h |    4 ++--
 include/net/protocol.h        |    1 +
 include/net/sock.h            |    2 ++
 include/net/tcp.h             |    1 +
 net/core/sock.c               |    5 +++++
 net/ipv4/af_inet.c            |   18 +++++++++-------
 net/ipv4/ip_input.c           |   39 ++++++++++++++++++++++------------
 net/ipv4/tcp_input.c          |   16 +++++++++++++-
 net/ipv4/tcp_ipv4.c           |   46 +++++++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp_minisocks.c      |    2 ++
 10 files changed, 110 insertions(+), 24 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 808fc5f..54be028 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -379,10 +379,10 @@ static inline struct sock *__inet_lookup_skb(struct inet_hashinfo *hashinfo,
 					     const __be16 sport,
 					     const __be16 dport)
 {
-	struct sock *sk;
+	struct sock *sk = skb_steal_sock(skb);
 	const struct iphdr *iph = ip_hdr(skb);
 
-	if (unlikely(sk = skb_steal_sock(skb)))
+	if (sk)
 		return sk;
 	else
 		return __inet_lookup(dev_net(skb_dst(skb)->dev), hashinfo,
diff --git a/include/net/protocol.h b/include/net/protocol.h
index a1b1b53..967b926 100644
--- a/include/net/protocol.h
+++ b/include/net/protocol.h
@@ -37,6 +37,7 @@
 
 /* This is used to register protocols. */
 struct net_protocol {
+	int			(*early_demux)(struct sk_buff *skb);
 	int			(*handler)(struct sk_buff *skb);
 	void			(*err_handler)(struct sk_buff *skb, u32 info);
 	int			(*gso_send_check)(struct sk_buff *skb);
diff --git a/include/net/sock.h b/include/net/sock.h
index 4a45216..87b424a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -319,6 +319,7 @@ struct sock {
 	unsigned long 		sk_flags;
 	struct dst_entry	*sk_dst_cache;
 	spinlock_t		sk_dst_lock;
+	struct dst_entry	*sk_rx_dst;
 	atomic_t		sk_wmem_alloc;
 	atomic_t		sk_omem_alloc;
 	int			sk_sndbuf;
@@ -1426,6 +1427,7 @@ extern struct sk_buff		*sock_rmalloc(struct sock *sk,
 					      gfp_t priority);
 extern void			sock_wfree(struct sk_buff *skb);
 extern void			sock_rfree(struct sk_buff *skb);
+extern void			sock_edemux(struct sk_buff *skb);
 
 extern int			sock_setsockopt(struct socket *sock, int level,
 						int op, char __user *optval,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 9332f34..6660ffc 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -325,6 +325,7 @@ extern void tcp_v4_err(struct sk_buff *skb, u32);
 
 extern void tcp_shutdown (struct sock *sk, int how);
 
+extern int tcp_v4_early_demux(struct sk_buff *skb);
 extern int tcp_v4_rcv(struct sk_buff *skb);
 
 extern struct inet_peer *tcp_v4_get_peer(struct sock *sk);
diff --git a/net/core/sock.c b/net/core/sock.c
index 9e5b71f..929bdcc 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1465,6 +1465,11 @@ void sock_rfree(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(sock_rfree);
 
+void sock_edemux(struct sk_buff *skb)
+{
+	sock_put(skb->sk);
+}
+EXPORT_SYMBOL(sock_edemux);
 
 int sock_i_uid(struct sock *sk)
 {
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 85a3b17..07a02f6 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -157,6 +157,7 @@ void inet_sock_destruct(struct sock *sk)
 
 	kfree(rcu_dereference_protected(inet->inet_opt, 1));
 	dst_release(rcu_dereference_check(sk->sk_dst_cache, 1));
+	dst_release(sk->sk_rx_dst);
 	sk_refcnt_debug_dec(sk);
 }
 EXPORT_SYMBOL(inet_sock_destruct);
@@ -1518,14 +1519,15 @@ static const struct net_protocol igmp_protocol = {
 #endif
 
 static const struct net_protocol tcp_protocol = {
-	.handler =	tcp_v4_rcv,
-	.err_handler =	tcp_v4_err,
-	.gso_send_check = tcp_v4_gso_send_check,
-	.gso_segment =	tcp_tso_segment,
-	.gro_receive =	tcp4_gro_receive,
-	.gro_complete =	tcp4_gro_complete,
-	.no_policy =	1,
-	.netns_ok =	1,
+	.early_demux	=	tcp_v4_early_demux,
+	.handler	=	tcp_v4_rcv,
+	.err_handler	=	tcp_v4_err,
+	.gso_send_check	=	tcp_v4_gso_send_check,
+	.gso_segment	=	tcp_tso_segment,
+	.gro_receive	=	tcp4_gro_receive,
+	.gro_complete	=	tcp4_gro_complete,
+	.no_policy	=	1,
+	.netns_ok	=	1,
 };
 
 static const struct net_protocol udp_protocol = {
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index c4fe1d2..93b092c 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -323,19 +323,32 @@ static int ip_rcv_finish(struct sk_buff *skb)
 	 *	how the packet travels inside Linux networking.
 	 */
 	if (skb_dst(skb) == NULL) {
-		int err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
-					       iph->tos, skb->dev);
-		if (unlikely(err)) {
-			if (err == -EHOSTUNREACH)
-				IP_INC_STATS_BH(dev_net(skb->dev),
-						IPSTATS_MIB_INADDRERRORS);
-			else if (err == -ENETUNREACH)
-				IP_INC_STATS_BH(dev_net(skb->dev),
-						IPSTATS_MIB_INNOROUTES);
-			else if (err == -EXDEV)
-				NET_INC_STATS_BH(dev_net(skb->dev),
-						 LINUX_MIB_IPRPFILTER);
-			goto drop;
+		const struct net_protocol *ipprot;
+		int protocol = iph->protocol;
+		int err;
+
+		rcu_read_lock();
+		ipprot = rcu_dereference(inet_protos[protocol]);
+		err = -ENOENT;
+		if (ipprot && ipprot->early_demux)
+			err = ipprot->early_demux(skb);
+		rcu_read_unlock();
+
+		if (err) {
+			err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
+						   iph->tos, skb->dev);
+			if (unlikely(err)) {
+				if (err == -EHOSTUNREACH)
+					IP_INC_STATS_BH(dev_net(skb->dev),
+							IPSTATS_MIB_INADDRERRORS);
+				else if (err == -ENETUNREACH)
+					IP_INC_STATS_BH(dev_net(skb->dev),
+							IPSTATS_MIB_INNOROUTES);
+				else if (err == -EXDEV)
+					NET_INC_STATS_BH(dev_net(skb->dev),
+							 LINUX_MIB_IPRPFILTER);
+				goto drop;
+			}
 		}
 	}
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b224eb8..8416f8a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5518,6 +5518,18 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 	struct tcp_sock *tp = tcp_sk(sk);
 	int res;
 
+	if (sk->sk_rx_dst) {
+		struct dst_entry *dst = sk->sk_rx_dst;
+		if (unlikely(dst->obsolete)) {
+			if (dst->ops->check(dst, 0) == NULL) {
+				dst_release(dst);
+				sk->sk_rx_dst = NULL;
+			}
+		}
+	}
+	if (unlikely(sk->sk_rx_dst == NULL))
+		sk->sk_rx_dst = dst_clone(skb_dst(skb));
+
 	/*
 	 *	Header prediction.
 	 *	The code loosely follows the one in the famous
@@ -5729,8 +5741,10 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
 
 	tcp_set_state(sk, TCP_ESTABLISHED);
 
-	if (skb != NULL)
+	if (skb != NULL) {
+		sk->sk_rx_dst = dst_clone(skb_dst(skb));
 		security_inet_conn_established(sk, skb);
+	}
 
 	/* Make sure socket is routed, for correct metrics.  */
 	icsk->icsk_af_ops->rebuild_header(sk);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index fda2ca1..13857df 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1671,6 +1671,52 @@ csum_err:
 }
 EXPORT_SYMBOL(tcp_v4_do_rcv);
 
+int tcp_v4_early_demux(struct sk_buff *skb)
+{
+	struct net *net = dev_net(skb->dev);
+	const struct iphdr *iph;
+	const struct tcphdr *th;
+	struct sock *sk;
+	int err;
+
+	err = -ENOENT;
+	if (skb->pkt_type != PACKET_HOST)
+		goto out_err;
+
+	if (!pskb_may_pull(skb, ip_hdrlen(skb) + sizeof(struct tcphdr)))
+		goto out_err;
+
+	iph = ip_hdr(skb);
+	th = (struct tcphdr *) ((char *)iph + ip_hdrlen(skb));
+
+	if (th->doff < sizeof(struct tcphdr) / 4)
+		goto out_err;
+
+	if (!pskb_may_pull(skb, ip_hdrlen(skb) + th->doff * 4))
+		goto out_err;
+
+	sk = __inet_lookup_established(net, &tcp_hashinfo,
+				       iph->saddr, th->source,
+				       iph->daddr, th->dest,
+				       skb->dev->ifindex);
+	if (sk) {
+		skb->sk = sk;
+		skb->destructor = sock_edemux;
+		if (sk->sk_state != TCP_TIME_WAIT) {
+			struct dst_entry *dst = sk->sk_rx_dst;
+			if (dst)
+				dst = dst_check(dst, 0);
+			if (dst) {
+				skb_dst_set_noref(skb, dst);
+				err = 0;
+			}
+		}
+	}
+
+out_err:
+	return err;
+}
+
 /*
  *	From tcp_input.c
  */
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index cb01531..72b7c63 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -445,6 +445,8 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 		struct tcp_sock *oldtp = tcp_sk(sk);
 		struct tcp_cookie_values *oldcvp = oldtp->cookie_values;
 
+		newsk->sk_rx_dst = dst_clone(skb_dst(skb));
+
 		/* TCP Cookie Transactions require space for the cookie pair,
 		 * as it differs for each connection.  There is no need to
 		 * copy any s_data_payload stored at the original socket.
-- 
1.7.10

^ permalink raw reply related

* Re: [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO
From: Eric Dumazet @ 2012-06-20  5:36 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, davem, jeffrey.t.kirsher
In-Reply-To: <20120620004306.17814.58369.stgit@gitlad.jf.intel.com>

On Tue, 2012-06-19 at 17:43 -0700, Alexander Duyck wrote:
> This patch is meant to help improve system performance when
> netdev_alloc_frag is used in scenarios in which buffers are short lived.
> This is accomplished by allowing the page offset to be reset in the event
> that the page count is 1.  I also reordered the direction in which we give
> out sections of the page so that we start at the end of the page and end at
> the start.  The main motivation being that I preferred to have offset
> represent the amount of page remaining to be used.
> 
> My primary test case was using ixgbe in combination with TCP.  With this
> patch applied I saw CPU utilization drop from 3.4% to 3.0% for a single
> thread of netperf receiving a TCP stream via ixgbe.
> 
> I also tested several scenarios in which the page reuse would not be
> possible such as UDP flows and routing.  In both of these scenarios I saw
> no noticeable performance degradation compared to the kernel without this
> patch.
> 
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
> 
>  net/core/skbuff.c |   15 +++++++++++----
>  1 files changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 5b21522..eb3853c 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -317,15 +317,22 @@ void *netdev_alloc_frag(unsigned int fragsz)
>  	if (unlikely(!nc->page)) {
>  refill:
>  		nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
> -		nc->offset = 0;
>  	}
>  	if (likely(nc->page)) {
> -		if (nc->offset + fragsz > PAGE_SIZE) {
> +		unsigned int offset = PAGE_SIZE;
> +
> +		if (page_count(nc->page) != 1)
> +			offset = nc->offset;
> +
> +		if (offset < fragsz) {
>  			put_page(nc->page);
>  			goto refill;
>  		}
> -		data = page_address(nc->page) + nc->offset;
> -		nc->offset += fragsz;
> +
> +		offset -= fragsz;
> +		nc->offset = offset;
> +
> +		data = page_address(nc->page) + offset;
>  		get_page(nc->page);
>  	}
>  	local_irq_restore(flags);
> 

I tested this idea one month ago and got not convincing results, because
the branch was taken half of the time.

The cases where page can be reused is probably specific to ixgbe because
it uses a different allocator for the frags themselves.
netdev_alloc_frag() is only used to allocate the skb head.

For typical nics, we allocate frags to populate the RX ring _way_ before
packet is received by the NIC.

Then, I played with using order-2 pages instead of order-0 ones if
PAGE_SIZE < 8192.

No clear win either, but you might try this too.

^ permalink raw reply

* Re: [PATCH v2] ipv4: Early TCP socket demux.
From: Eric Dumazet @ 2012-06-20  5:49 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, netdev
In-Reply-To: <20120619.214657.1345825814176201681.davem@davemloft.net>

On Tue, 2012-06-19 at 21:46 -0700, David Miller wrote:
> From: Stephen Hemminger <shemminger@vyatta.com>
> Date: Tue, 19 Jun 2012 19:35:49 -0700
> 
> > Any benchmark numbers?
> 
> Measuring the path from ip_rcv_finish() to where we lock the socket in
> tcp_v4_rcv(), on a SPARC-T3, with a pre-warmed routing cache:
> 
> Both sk and RT lookup:	~4200 cycles
> Optimized early demux:	~2800 cycles
> 
> These numbers can be decreased further, because since we're already
> looking at the TCP header we can pre-cook the TCP control block in the
> SKB and skip much of the stuff that tcp_v4_rcv() does since we've done
> it already in the early demux code.
> 
> > I think the number of ref count operations per packet is going to be
> > the next line in the sand.
> 
> There is only one, for the socket.  We haven't taken a reference on the
> route for years.

Actually this patch makes things probably slower for :

1) routers :

Each incoming tcp packet has to perform lookups 
(ESTABLISHED and TIMEWAIT), adding one cache miss

2) small lived tcp sessions

   input dst is now dirtied because of the additional
dst_clone()/dst_release()


1) can be solved using a knob as suggested by Changli, possibly using a
JUMP_LABEL shadowing ip_forward ?

^ permalink raw reply

* Re: [PATCH v2] ipv4: Early TCP socket demux.
From: Eric Dumazet @ 2012-06-20  5:51 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, netdev
In-Reply-To: <1340171359.4604.794.camel@edumazet-glaptop>

On Wed, 2012-06-20 at 07:49 +0200, Eric Dumazet wrote:

> 2) small lived tcp sessions
> 
>    input dst is now dirtied because of the additional
> dst_clone()/dst_release()

Not realy a concern because we dirty cache line anyway

dst_use_noref()
{
	dst->__use++;
	dst->lastuse = time;
}

^ permalink raw reply

* Re: [PATCH v2] ipv4: Early TCP socket demux.
From: Eric Dumazet @ 2012-06-20  5:59 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, netdev
In-Reply-To: <20120619.214657.1345825814176201681.davem@davemloft.net>

On Tue, 2012-06-19 at 21:46 -0700, David Miller wrote:

> These numbers can be decreased further, because since we're already
> looking at the TCP header we can pre-cook the TCP control block in the
> SKB and skip much of the stuff that tcp_v4_rcv() does since we've done
> it already in the early demux code.

It could be done at GRO level and remove one another demux.

As routers probably have no use of GRO, no need of additional knob.

^ permalink raw reply

* [net-next.git 4/4 (v7)] phy: add the EEE support and the way to access to the MMD registers.
From: Giuseppe CAVALLARO @ 2012-06-20  6:12 UTC (permalink / raw)
  To: netdev
  Cc: eric.dumazet, bhutchings, rayagond, davem, yuvalmin,
	Giuseppe Cavallaro
In-Reply-To: <1340172774-27443-1-git-send-email-peppe.cavallaro@st.com>

This patch adds the support for the Energy-Efficient Ethernet (EEE)
to the Physical Abstraction Layer.
To support the EEE we have to access to the MMD registers 3.20 and
7.60/61. So two new functions have been added to read/write the MMD
registers (clause 45).

An Ethernet driver (I tested the stmmac) can invoke the phy_init_eee to properly
check if the EEE is supported by the PHYs and it can also set the clock
stop enable bit in the 3.0 register.
The phy_get_eee_err can be used for reporting the number of time where
the PHY failed to complete its normal wake sequence.

In the end, this patch also adds the EEE ethtool support implementing:
 o phy_ethtool_set_eee
 o phy_ethtool_get_eee

v1: initial patch
v2: fixed some errors especially on naming convention
v3: renamed again the mmd read/write functions thank to Ben's feedback
v4: moved file to phy.c and added the ethtool support.
v5: fixed phy_adv_to_eee, phy_eee_to_supported, phy_eee_to_adv return
    values according to ethtool API (thanks to Ben's feedback).
    Renamed some macros to avoid too long names.
v6: fixed kernel-doc comments to be properly parsed.
    Fixed the phy_init_eee function: we need to check which link mode
    was autonegotiated and then the corresponding bits in 7.60 and 7.61
    registers.
v7: reviewed the way to get the negotiated settings.

Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
---
 drivers/net/phy/phy.c |  281 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mdio.h  |   21 +++-
 include/linux/mii.h   |    9 ++
 include/linux/phy.h   |    5 +
 4 files changed, 312 insertions(+), 4 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 2e1c237..e13a30d 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -35,6 +35,7 @@
 #include <linux/phy.h>
 #include <linux/timer.h>
 #include <linux/workqueue.h>
+#include <linux/mdio.h>
 
 #include <linux/atomic.h>
 #include <asm/io.h>
@@ -967,3 +968,283 @@ void phy_state_machine(struct work_struct *work)
 
 	schedule_delayed_work(&phydev->state_queue, PHY_STATE_TIME * HZ);
 }
+
+static inline void mmd_phy_indirect(struct mii_bus *bus, int prtad, int devad,
+				    int addr)
+{
+	/* Write the desired MMD Devad */
+	bus->write(bus, addr, MII_MMD_CTRL, devad);
+
+	/* Write the desired MMD register address */
+	bus->write(bus, addr, MII_MMD_DATA, prtad);
+
+	/* Select the Function : DATA with no post increment */
+	bus->write(bus, addr, MII_MMD_CTRL, (devad | MII_MMD_CTRL_NOINCR));
+}
+
+/**
+ * phy_read_mmd_indirect - reads data from the MMD registers
+ * @bus: the target MII bus
+ * @prtad: MMD Address
+ * @devad: MMD DEVAD
+ * @addr: PHY address on the MII bus
+ *
+ * Description: it reads data from the MMD registers (clause 22 to access to
+ * clause 45) of the specified phy address.
+ * To read these register we have:
+ * 1) Write reg 13 // DEVAD
+ * 2) Write reg 14 // MMD Address
+ * 3) Write reg 13 // MMD Data Command for MMD DEVAD
+ * 3) Read  reg 14 // Read MMD data
+ */
+static int phy_read_mmd_indirect(struct mii_bus *bus, int prtad, int devad,
+				 int addr)
+{
+	u32 ret;
+
+	mmd_phy_indirect(bus, prtad, devad, addr);
+
+	/* Read the content of the MMD's selected register */
+	ret = bus->read(bus, addr, MII_MMD_DATA);
+
+	return ret;
+}
+
+/**
+ * phy_write_mmd_indirect - writes data to the MMD registers
+ * @bus: the target MII bus
+ * @prtad: MMD Address
+ * @devad: MMD DEVAD
+ * @addr: PHY address on the MII bus
+ * @data: data to write in the MMD register
+ *
+ * Description: Write data from the MMD registers of the specified
+ * phy address.
+ * To write these register we have:
+ * 1) Write reg 13 // DEVAD
+ * 2) Write reg 14 // MMD Address
+ * 3) Write reg 13 // MMD Data Command for MMD DEVAD
+ * 3) Write reg 14 // Write MMD data
+ */
+static void phy_write_mmd_indirect(struct mii_bus *bus, int prtad, int devad,
+				   int addr, u32 data)
+{
+	mmd_phy_indirect(bus, prtad, devad, addr);
+
+	/* Write the data into MMD's selected register */
+	bus->write(bus, addr, MII_MMD_DATA, data);
+}
+
+static u32 phy_eee_to_adv(u16 eee_adv)
+{
+	u32 adv = 0;
+
+	if (eee_adv & MDIO_EEE_100TX)
+		adv |= ADVERTISED_100baseT_Full;
+	if (eee_adv & MDIO_EEE_1000T)
+		adv |= ADVERTISED_1000baseT_Full;
+	if (eee_adv & MDIO_EEE_10GT)
+		adv |= ADVERTISED_10000baseT_Full;
+	if (eee_adv & MDIO_EEE_1000KX)
+		adv |= ADVERTISED_1000baseKX_Full;
+	if (eee_adv & MDIO_EEE_10GKX4)
+		adv |= ADVERTISED_10000baseKX4_Full;
+	if (eee_adv & MDIO_EEE_10GKR)
+		adv |= ADVERTISED_10000baseKR_Full;
+
+	return adv;
+}
+
+static u32 phy_eee_to_supported(u16 eee_caported)
+{
+	u32 supported = 0;
+
+	if (eee_caported & MDIO_EEE_100TX)
+		supported |= SUPPORTED_100baseT_Full;
+	if (eee_caported & MDIO_EEE_1000T)
+		supported |= SUPPORTED_1000baseT_Full;
+	if (eee_caported & MDIO_EEE_10GT)
+		supported |= SUPPORTED_10000baseT_Full;
+	if (eee_caported & MDIO_EEE_1000KX)
+		supported |= SUPPORTED_1000baseKX_Full;
+	if (eee_caported & MDIO_EEE_10GKX4)
+		supported |= SUPPORTED_10000baseKX4_Full;
+	if (eee_caported & MDIO_EEE_10GKR)
+		supported |= SUPPORTED_10000baseKR_Full;
+
+	return supported;
+}
+
+static u16 phy_adv_to_eee(u32 adv)
+{
+	u16 reg = 0;
+
+	if (adv & ADVERTISED_100baseT_Full)
+		reg |= MDIO_EEE_100TX;
+	if (adv & ADVERTISED_1000baseT_Full)
+		reg |= MDIO_EEE_1000T;
+	if (adv & ADVERTISED_10000baseT_Full)
+		reg |= MDIO_EEE_10GT;
+	if (adv & ADVERTISED_1000baseKX_Full)
+		reg |= MDIO_EEE_1000KX;
+	if (adv & ADVERTISED_10000baseKX4_Full)
+		reg |= MDIO_EEE_10GKX4;
+	if (adv & ADVERTISED_10000baseKR_Full)
+		reg |= MDIO_EEE_10GKR;
+
+	return reg;
+}
+
+/**
+ * phy_init_eee - init and check the EEE feature
+ * @phydev: target phy_device struct
+ * @clk_stop_enable: PHY may stop the clock during LPI
+ *
+ * Description: it checks if the Energy-Efficient Ethernet (EEE)
+ * is supported by looking at the MMD registers 3.20 and 7.60/61
+ * and it programs the MMD register 3.0 setting the "Clock stop enable"
+ * bit if required.
+ */
+int phy_init_eee(struct phy_device *phydev, bool clk_stop_enable)
+{
+	int ret = -EPROTONOSUPPORT;
+
+	/* According to 802.3az,the EEE is supported only in full duplex-mode.
+	 * Also EEE feature is active when core is operating with MII, GMII
+	 * or RGMII.
+	 */
+	if ((phydev->duplex == DUPLEX_FULL) &&
+	    ((phydev->interface == PHY_INTERFACE_MODE_MII) ||
+	    (phydev->interface == PHY_INTERFACE_MODE_GMII) ||
+	    (phydev->interface == PHY_INTERFACE_MODE_RGMII))) {
+		u16 eee_lp, eee_cap, eee_adv;
+		u32 lp, cap, adv;
+		int idx;
+
+		/* Read phy status to properly get the right settings */
+		ret = phy_read_status(phydev);
+		if (ret)
+			return ret;
+
+		/* First check if the EEE ability is supported */
+		eee_cap = phy_read_mmd_indirect(phydev->bus, MDIO_PCS_EEE_ABLE,
+						MDIO_MMD_PCS, phydev->addr);
+		if (eee_cap < 0)
+			return eee_cap;
+
+		cap = phy_eee_to_supported(eee_cap);
+		if (!cap)
+			goto eee_exit;
+
+		/* Check which link settings negotiated and verify it in
+		 * the EEE advertising registers.
+		 */
+		eee_lp = phy_read_mmd_indirect(phydev->bus, MDIO_AN_EEE_LPABLE,
+					       MDIO_MMD_AN, phydev->addr);
+		if (eee_lp < 0)
+			return eee_lp;
+
+		eee_adv = phy_read_mmd_indirect(phydev->bus, MDIO_AN_EEE_ADV,
+						MDIO_MMD_AN, phydev->addr);
+		if (eee_adv < 0)
+			return eee_adv;
+
+		adv = phy_eee_to_adv(eee_adv);
+		lp = phy_eee_to_adv(eee_lp);
+		idx = phy_find_setting(phydev->speed, phydev->duplex);
+		if ((lp & adv & settings[idx].setting))
+			goto eee_exit;
+
+		if (clk_stop_enable) {
+			/* Configure the PHY to stop receiving xMII
+			 * clock while it is signaling LPI.
+			 */
+			u32 val = phy_read_mmd_indirect(phydev->bus, MDIO_CTRL1,
+							MDIO_MMD_PCS,
+							phydev->addr);
+			if (val < 0)
+				return val;
+
+			val |= MDIO_PCS_CTRL1_CLKSTOP_EN;
+			phy_write_mmd_indirect(phydev->bus, MDIO_CTRL1,
+					       MDIO_MMD_PCS, phydev->addr, val);
+		}
+
+		ret = 0; /* EEE supported */
+	}
+
+eee_exit:
+	return ret;
+}
+EXPORT_SYMBOL(phy_init_eee);
+
+/**
+ * phy_get_eee_err - report the EEE wake error count
+ * @phydev: target phy_device struct
+ *
+ * Description: it is to report the number of time where the PHY
+ * failed to complete its normal wake sequence.
+ */
+int phy_get_eee_err(struct phy_device *phydev)
+{
+	return phy_read_mmd_indirect(phydev->bus, MDIO_PCS_EEE_WK_ERR,
+				     MDIO_MMD_PCS, phydev->addr);
+
+}
+EXPORT_SYMBOL(phy_get_eee_err);
+
+/**
+ * phy_ethtool_get_eee - get EEE supported and status
+ * @phydev: target phy_device struct
+ * @data: ethtool_eee data
+ *
+ * Description: it reportes the Supported/Advertisement/LP Advertisement
+ * capabilities.
+ */
+int phy_ethtool_get_eee(struct phy_device *phydev, struct ethtool_eee *data)
+{
+	int val;
+
+	/* Get Supported EEE */
+	val = phy_read_mmd_indirect(phydev->bus, MDIO_PCS_EEE_ABLE,
+				    MDIO_MMD_PCS, phydev->addr);
+	if (val < 0)
+		return val;
+	data->supported = phy_eee_to_supported(val);
+
+	/* Get advertisement EEE */
+	val = phy_read_mmd_indirect(phydev->bus, MDIO_AN_EEE_ADV,
+				    MDIO_MMD_AN, phydev->addr);
+	if (val < 0)
+		return val;
+	data->advertised = phy_eee_to_adv(val);
+
+	/* Get LP advertisement EEE */
+	val = phy_read_mmd_indirect(phydev->bus, MDIO_AN_EEE_LPABLE,
+				    MDIO_MMD_AN, phydev->addr);
+	if (val < 0)
+		return val;
+	data->lp_advertised = phy_eee_to_adv(val);
+
+	return 0;
+}
+EXPORT_SYMBOL(phy_ethtool_get_eee);
+
+/**
+ * phy_ethtool_set_eee - set EEE supported and status
+ * @phydev: target phy_device struct
+ * @data: ethtool_eee data
+ *
+ * Description: it is to program the Advertisement EEE register.
+ */
+int phy_ethtool_set_eee(struct phy_device *phydev, struct ethtool_eee *data)
+{
+	int val;
+
+	val = phy_adv_to_eee(data->advertised);
+	phy_write_mmd_indirect(phydev->bus, MDIO_AN_EEE_ADV, MDIO_MMD_AN,
+			       phydev->addr, val);
+
+	return 0;
+}
+EXPORT_SYMBOL(phy_ethtool_set_eee);
diff --git a/include/linux/mdio.h b/include/linux/mdio.h
index dfb9479..4ad8f0e 100644
--- a/include/linux/mdio.h
+++ b/include/linux/mdio.h
@@ -43,7 +43,11 @@
 #define MDIO_PKGID2		15
 #define MDIO_AN_ADVERTISE	16	/* AN advertising (base page) */
 #define MDIO_AN_LPA		19	/* AN LP abilities (base page) */
+#define MDIO_PCS_EEE_ABLE	20	/* EEE Capability register */
+#define MDIO_PCS_EEE_WK_ERR	22	/* EEE wake error counter */
 #define MDIO_PHYXS_LNSTAT	24	/* PHY XGXS lane state */
+#define MDIO_AN_EEE_ADV		60	/* EEE advertisement */
+#define MDIO_AN_EEE_LPABLE	61	/* EEE link partner ability */
 
 /* Media-dependent registers. */
 #define MDIO_PMA_10GBT_SWAPPOL	130	/* 10GBASE-T pair swap & polarity */
@@ -56,7 +60,6 @@
 #define MDIO_PCS_10GBRT_STAT2	33	/* 10GBASE-R/-T PCS status 2 */
 #define MDIO_AN_10GBT_CTRL	32	/* 10GBASE-T auto-negotiation control */
 #define MDIO_AN_10GBT_STAT	33	/* 10GBASE-T auto-negotiation status */
-#define MDIO_AN_EEE_ADV		60	/* EEE advertisement */
 
 /* LASI (Link Alarm Status Interrupt) registers, defined by XENPAK MSA. */
 #define MDIO_PMA_LASI_RXCTRL	0x9000	/* RX_ALARM control */
@@ -82,6 +85,7 @@
 #define MDIO_AN_CTRL1_RESTART		BMCR_ANRESTART
 #define MDIO_AN_CTRL1_ENABLE		BMCR_ANENABLE
 #define MDIO_AN_CTRL1_XNP		0x2000	/* Enable extended next page */
+#define MDIO_PCS_CTRL1_CLKSTOP_EN	0x400	/* Stop the clock during LPI */
 
 /* 10 Gb/s */
 #define MDIO_CTRL1_SPEED10G		(MDIO_CTRL1_SPEEDSELEXT | 0x00)
@@ -237,9 +241,18 @@
 #define MDIO_AN_10GBT_STAT_MS		0x4000	/* Master/slave config */
 #define MDIO_AN_10GBT_STAT_MSFLT	0x8000	/* Master/slave config fault */
 
-/* AN EEE Advertisement register. */
-#define MDIO_AN_EEE_ADV_100TX		0x0002	/* Advertise 100TX EEE cap */
-#define MDIO_AN_EEE_ADV_1000T		0x0004	/* Advertise 1000T EEE cap */
+/* EEE Supported/Advertisement/LP Advertisement registers.
+ *
+ * EEE capability Register (3.20), Advertisement (7.60) and
+ * Link partner ability (7.61) registers have and can use the same identical
+ * bit masks.
+ */
+#define MDIO_EEE_100TX			0x0002	/* 100TX EEE cap */
+#define MDIO_EEE_1000T			0x0004	/* 1000T EEE cap */
+#define MDIO_EEE_10GT			0x0008	/* 10GT EEE cap */
+#define MDIO_EEE_1000KX			0x0010	/* 1000KX EEE cap */
+#define MDIO_EEE_10GKX4			0x0020	/* 10G KX4 EEE cap */
+#define MDIO_EEE_10GKR			0x0040	/* 10G KR EEE cap */
 
 /* LASI RX_ALARM control/status registers. */
 #define MDIO_PMA_LASI_RX_PHYXSLFLT	0x0001	/* PHY XS RX local fault */
diff --git a/include/linux/mii.h b/include/linux/mii.h
index 2783eca..8ef3a7a 100644
--- a/include/linux/mii.h
+++ b/include/linux/mii.h
@@ -21,6 +21,8 @@
 #define MII_EXPANSION		0x06	/* Expansion register          */
 #define MII_CTRL1000		0x09	/* 1000BASE-T control          */
 #define MII_STAT1000		0x0a	/* 1000BASE-T status           */
+#define	MII_MMD_CTRL		0x0d	/* MMD Access Control Register */
+#define	MII_MMD_DATA		0x0e	/* MMD Access Data Register */
 #define MII_ESTATUS		0x0f	/* Extended Status             */
 #define MII_DCOUNTER		0x12	/* Disconnect counter          */
 #define MII_FCSCOUNTER		0x13	/* False carrier counter       */
@@ -141,6 +143,13 @@
 #define FLOW_CTRL_TX		0x01
 #define FLOW_CTRL_RX		0x02
 
+/* MMD Access Control register fields */
+#define MII_MMD_CTRL_DEVAD_MASK	0x1f	/* Mask MMD DEVAD*/
+#define MII_MMD_CTRL_ADDR	0x0000	/* Address */
+#define MII_MMD_CTRL_NOINCR	0x4000	/* no post increment */
+#define MII_MMD_CTRL_INCR_RDWT	0x8000	/* post increment on reads & writes */
+#define MII_MMD_CTRL_INCR_ON_WT	0xC000	/* post increment on writes only */
+
 /* This structure is used in all SIOCxMIIxxx ioctl calls */
 struct mii_ioctl_data {
 	__u16		phy_id;
diff --git a/include/linux/phy.h b/include/linux/phy.h
index c291cae..97fc4cf 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -532,6 +532,11 @@ int phy_register_fixup_for_uid(u32 phy_uid, u32 phy_uid_mask,
 		int (*run)(struct phy_device *));
 int phy_scan_fixups(struct phy_device *phydev);
 
+int phy_init_eee(struct phy_device *phydev, bool clk_stop_enable);
+int phy_get_eee_err(struct phy_device *phydev);
+int phy_ethtool_set_eee(struct phy_device *phydev, struct ethtool_eee *data);
+int phy_ethtool_get_eee(struct phy_device *phydev, struct ethtool_eee *data);
+
 int __init mdio_bus_init(void);
 void mdio_bus_exit(void);
 
-- 
1.7.4.4

^ permalink raw reply related

* [net-next.git 2/4] stmmac: update the driver Documentation and add EEE
From: Giuseppe CAVALLARO @ 2012-06-20  6:12 UTC (permalink / raw)
  To: netdev
  Cc: eric.dumazet, bhutchings, rayagond, davem, yuvalmin,
	Giuseppe Cavallaro
In-Reply-To: <1340172774-27443-1-git-send-email-peppe.cavallaro@st.com>

This patch updates the stmmac's documentation adding
some missing files in the section used to describe the
internal driver's structure.

Also the patch adds a new section to describe the EEE support.

Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
---
 Documentation/networking/stmmac.txt |   36 +++++++++++++++++++++++++++++-----
 1 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/Documentation/networking/stmmac.txt b/Documentation/networking/stmmac.txt
index 5cb9a19..c676b9c 100644
--- a/Documentation/networking/stmmac.txt
+++ b/Documentation/networking/stmmac.txt
@@ -257,9 +257,11 @@ reset procedure etc).
  o Makefile
  o stmmac_main.c: main network device driver;
  o stmmac_mdio.c: mdio functions;
+ o stmmac_pci: PCI driver;
+ o stmmac_platform.c: platform driver
  o stmmac_ethtool.c: ethtool support;
  o stmmac_timer.[ch]: timer code used for mitigating the driver dma interrupts
-		      Only tested on ST40 platforms based.
+		      (only tested on ST40 platforms based);
  o stmmac.h: private driver structure;
  o common.h: common definitions and VFTs;
  o descs.h: descriptor structure definitions;
@@ -269,9 +271,11 @@ reset procedure etc).
  o dwmac100_core: MAC 100 core and dma code;
  o dwmac100_dma.c: dma funtions for the MAC chip;
  o dwmac1000.h: specific header file for the MAC;
- o dwmac_lib.c: generic DMA functions shared among chips
- o enh_desc.c: functions for handling enhanced descriptors
- o norm_desc.c: functions for handling normal descriptors
+ o dwmac_lib.c: generic DMA functions shared among chips;
+ o enh_desc.c: functions for handling enhanced descriptors;
+ o norm_desc.c: functions for handling normal descriptors;
+ o chain_mode.c/ring_mode.c:: functions to manage RING/CHAINED modes;
+ o mmc_core.c/mmc.h: Management MAC Counters;
 
 5) Debug Information
 
@@ -304,7 +308,27 @@ All these are only useful during the developing stage
 and should never enabled inside the code for general usage.
 In fact, these can generate an huge amount of debug messages.
 
-6) TODO:
+6) Energy Efficient Ethernet
+
+Energy Efficient Ethernet(EEE) enables IEEE 802.3 MAC sublayer along
+with a family of Physical layer to operate in the Low power Idle(LPI)
+mode. The EEE mode supports the IEEE 802.3 MAC operation at 100Mbps,
+1000Mbps & 10Gbps.
+
+The LPI mode allows power saving by switching off parts of the
+communication device functionality when there is no data to be
+transmitted & received. The system on both the side of the link can
+disable some functionalities & save power during the period of low-link
+utilization. The MAC controls whether the system should enter or exit
+the LPI mode & communicate this to PHY.
+
+As soon as the interface is opened, the driver verifies if the EEE can
+be supported. This is done by looking at both the DMA HW capability
+register and the PHY devices MCD registers.
+To enter in Tx LPI mode the driver needs to have a software timer
+that enable and disable the LPI mode when there is nothing to be
+transmitted.
+
+7) TODO:
  o XGMAC is not supported.
- o Add the EEE - Energy Efficient Ethernet
  o Add the PTP - precision time protocol
-- 
1.7.4.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox