* [PATCH] ipv4: Add sysctl knob to control early socket demux
@ 2012-06-21 23:58 Alexander Duyck
2012-06-23 0:15 ` David Miller
0 siblings, 1 reply; 5+ messages in thread
From: Alexander Duyck @ 2012-06-21 23:58 UTC (permalink / raw)
To: netdev; +Cc: jeffrey.t.kirsher, David S. Miller, Eric Dumazet, Alexander Duyck
This change is meant to add a control for disabling early socket demux.
The main motivation behind this patch is to provide an option to disable
the feature as it adds an additional cost to routing that reduces overall
throughput by up to 5%. For example one of my systems went from 12.1Mpps
to 11.6 after the early socket demux was added. It looks like the reason
for the regression is that we are now having to perform two lookups, first
the one for an established socket, and then the one for the routing table.
By adding this patch and toggling the value for ip_early_demux to 0 I am
able to get back to the 12.1Mpps I was previously seeing.
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
I am open to any comments or suggestions on this patch. I had seen the
earlier discussions and saw mention of adding a control for disabling the
early demux feature so I figured I would just code it up real quick once I
ran into the regression. I am assuming it is okay to disable the early
demux code since I suspect there is other code in place that will still
handle the demux for the TCP sockets later. Also I wasn't sure about the
sysctl since I haven't set one up before.
include/linux/sysctl.h | 1 +
include/net/ip.h | 3 +++
kernel/sysctl_binary.c | 2 ++
net/ipv4/ip_input.c | 19 +++++++++++--------
net/ipv4/sysctl_net_ipv4.c | 7 +++++++
5 files changed, 24 insertions(+), 8 deletions(-)
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index c34b4c8..20825e5 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -425,6 +425,7 @@ enum
NET_TCP_ALLOWED_CONG_CONTROL=123,
NET_TCP_MAX_SSTHRESH=124,
NET_TCP_FRTO_RESPONSE=125,
+ NET_IPV4_EARLY_DEMUX=126,
};
enum {
diff --git a/include/net/ip.h b/include/net/ip.h
index 83e0619..50841bd 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -210,6 +210,9 @@ extern int inet_peer_threshold;
extern int inet_peer_minttl;
extern int inet_peer_maxttl;
+/* From ip_input.c */
+extern int sysctl_ip_early_demux;
+
/* From ip_output.c */
extern int sysctl_ip_dynaddr;
diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
index a650694..6a3cf82 100644
--- a/kernel/sysctl_binary.c
+++ b/kernel/sysctl_binary.c
@@ -415,6 +415,8 @@ static const struct bin_table bin_net_ipv4_table[] = {
{ CTL_INT, NET_IPV4_IPFRAG_SECRET_INTERVAL, "ipfrag_secret_interval" },
/* NET_IPV4_IPFRAG_MAX_DIST "ipfrag_max_dist" no longer used */
+ { CTL_INT, NET_IPV4_EARLY_DEMUX, "ip_early_demux" },
+
{ CTL_INT, 2088 /* NET_IPQ_QMAX */, "ip_queue_maxlen" },
/* NET_TCP_DEFAULT_WIN_SCALE unused */
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 93b092c..07de38d 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -313,6 +313,8 @@ drop:
return true;
}
+int sysctl_ip_early_demux __read_mostly = 1;
+
static int ip_rcv_finish(struct sk_buff *skb)
{
const struct iphdr *iph = ip_hdr(skb);
@@ -325,14 +327,15 @@ static int ip_rcv_finish(struct sk_buff *skb)
if (skb_dst(skb) == NULL) {
const struct net_protocol *ipprot;
int protocol = iph->protocol;
- int err;
-
- rcu_read_lock();
- ipprot = rcu_dereference(inet_protos[protocol]);
- err = -ENOENT;
- if (ipprot && ipprot->early_demux)
- err = ipprot->early_demux(skb);
- rcu_read_unlock();
+ int err = -ENOENT;
+
+ if (sysctl_ip_early_demux) {
+ rcu_read_lock();
+ ipprot = rcu_dereference(inet_protos[protocol]);
+ if (ipprot && ipprot->early_demux)
+ err = ipprot->early_demux(skb);
+ rcu_read_unlock();
+ }
if (err) {
err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index ef32956..12aa0c5 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -301,6 +301,13 @@ static struct ctl_table ipv4_table[] = {
.proc_handler = proc_dointvec
},
{
+ .procname = "ip_early_demux",
+ .data = &sysctl_ip_early_demux,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec
+ },
+ {
.procname = "ip_dynaddr",
.data = &sysctl_ip_dynaddr,
.maxlen = sizeof(int),
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] ipv4: Add sysctl knob to control early socket demux
2012-06-21 23:58 [PATCH] ipv4: Add sysctl knob to control early socket demux Alexander Duyck
@ 2012-06-23 0:15 ` David Miller
2012-06-23 5:45 ` Eric Dumazet
0 siblings, 1 reply; 5+ messages in thread
From: David Miller @ 2012-06-23 0:15 UTC (permalink / raw)
To: alexander.h.duyck; +Cc: netdev, jeffrey.t.kirsher, edumazet
From: Alexander Duyck <alexander.h.duyck@intel.com>
Date: Thu, 21 Jun 2012 16:58:31 -0700
> This change is meant to add a control for disabling early socket demux.
> The main motivation behind this patch is to provide an option to disable
> the feature as it adds an additional cost to routing that reduces overall
> throughput by up to 5%. For example one of my systems went from 12.1Mpps
> to 11.6 after the early socket demux was added. It looks like the reason
> for the regression is that we are now having to perform two lookups, first
> the one for an established socket, and then the one for the routing table.
>
> By adding this patch and toggling the value for ip_early_demux to 0 I am
> able to get back to the 12.1Mpps I was previously seeing.
>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
I applied this for now, making a minor change to move the local
variables down into the new basic block you created.
There has got to be a way to make this really cheap. At the very
least we can have the GRO code store away the ports and therefore
allow us to just do a direct call to try and demux the socket. Thus,
we'd avoid all of pskb_may_pull() et al. packet validations, and
packet header pointer calculations.
Furthermore, we can reduce to overhead by making a special inet
established hash demux that doesn't check for time-wait sockets,
reducing the number of probes to 1 from 2.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] ipv4: Add sysctl knob to control early socket demux
2012-06-23 0:15 ` David Miller
@ 2012-06-23 5:45 ` Eric Dumazet
2012-06-23 6:00 ` David Miller
2012-06-23 6:03 ` David Miller
0 siblings, 2 replies; 5+ messages in thread
From: Eric Dumazet @ 2012-06-23 5:45 UTC (permalink / raw)
To: David Miller; +Cc: alexander.h.duyck, netdev, jeffrey.t.kirsher, edumazet
On Fri, 2012-06-22 at 17:15 -0700, David Miller wrote:
> I applied this for now, making a minor change to move the local
> variables down into the new basic block you created.
>
Hmm, sorry to come late, but you left NET_IPV4_EARLY_DEMUX=126 sysctl,
while this is deprecated way...
> There has got to be a way to make this really cheap. At the very
> least we can have the GRO code store away the ports and therefore
> allow us to just do a direct call to try and demux the socket. Thus,
> we'd avoid all of pskb_may_pull() et al. packet validations, and
> packet header pointer calculations.
>
> Furthermore, we can reduce to overhead by making a special inet
> established hash demux that doesn't check for time-wait sockets,
> reducing the number of probes to 1 from 2.
The timewait hash chain is on the same cache line than established one.
And on a router, both chains are empty with a 99.999 % probability.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] ipv4: Add sysctl knob to control early socket demux
2012-06-23 5:45 ` Eric Dumazet
@ 2012-06-23 6:00 ` David Miller
2012-06-23 6:03 ` David Miller
1 sibling, 0 replies; 5+ messages in thread
From: David Miller @ 2012-06-23 6:00 UTC (permalink / raw)
To: eric.dumazet; +Cc: alexander.h.duyck, netdev, jeffrey.t.kirsher, edumazet
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 23 Jun 2012 07:45:26 +0200
> On Fri, 2012-06-22 at 17:15 -0700, David Miller wrote:
>
>> Furthermore, we can reduce to overhead by making a special inet
>> established hash demux that doesn't check for time-wait sockets,
>> reducing the number of probes to 1 from 2.
>
> The timewait hash chain is on the same cache line than established one.
> And on a router, both chains are empty with a 99.999 % probability.
I understand this.
Probably a lot of the overhead has to do with the function calls
and, as I mentioned, the transport layer probing and validation.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] ipv4: Add sysctl knob to control early socket demux
2012-06-23 5:45 ` Eric Dumazet
2012-06-23 6:00 ` David Miller
@ 2012-06-23 6:03 ` David Miller
1 sibling, 0 replies; 5+ messages in thread
From: David Miller @ 2012-06-23 6:03 UTC (permalink / raw)
To: eric.dumazet; +Cc: alexander.h.duyck, netdev, jeffrey.t.kirsher, edumazet
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 23 Jun 2012 07:45:26 +0200
> On Fri, 2012-06-22 at 17:15 -0700, David Miller wrote:
>
>> I applied this for now, making a minor change to move the local
>> variables down into the new basic block you created.
>>
>
> Hmm, sorry to come late, but you left NET_IPV4_EARLY_DEMUX=126 sysctl,
> while this is deprecated way...
Thanks for catching this:
--------------------
[PATCH] ipv4: Don't add deprecated new binary sysctl value.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
include/linux/sysctl.h | 1 -
kernel/sysctl_binary.c | 2 --
2 files changed, 3 deletions(-)
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 20825e5..c34b4c8 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -425,7 +425,6 @@ enum
NET_TCP_ALLOWED_CONG_CONTROL=123,
NET_TCP_MAX_SSTHRESH=124,
NET_TCP_FRTO_RESPONSE=125,
- NET_IPV4_EARLY_DEMUX=126,
};
enum {
diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
index 6a3cf82..a650694 100644
--- a/kernel/sysctl_binary.c
+++ b/kernel/sysctl_binary.c
@@ -415,8 +415,6 @@ static const struct bin_table bin_net_ipv4_table[] = {
{ CTL_INT, NET_IPV4_IPFRAG_SECRET_INTERVAL, "ipfrag_secret_interval" },
/* NET_IPV4_IPFRAG_MAX_DIST "ipfrag_max_dist" no longer used */
- { CTL_INT, NET_IPV4_EARLY_DEMUX, "ip_early_demux" },
-
{ CTL_INT, 2088 /* NET_IPQ_QMAX */, "ip_queue_maxlen" },
/* NET_TCP_DEFAULT_WIN_SCALE unused */
--
1.7.10.2
^ permalink raw reply related [flat|nested] 5+ messages in thread
end of thread, other threads:[~2012-06-23 6:03 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-21 23:58 [PATCH] ipv4: Add sysctl knob to control early socket demux Alexander Duyck
2012-06-23 0:15 ` David Miller
2012-06-23 5:45 ` Eric Dumazet
2012-06-23 6:00 ` David Miller
2012-06-23 6:03 ` David Miller
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).