From: Eric Dumazet <dada1@cosmosbay.com>
To: Rick Jones <rick.jones2@hp.com>
Cc: Linux Network Development list <netdev@vger.kernel.org>,
Netfilter Developers <netfilter-devel@vger.kernel.org>,
Stephen Hemminger <shemminger@vyatta.com>,
Patrick McHardy <kaber@trash.net>
Subject: Re: 32 core net-next stack/netfilter "scaling"
Date: Tue, 27 Jan 2009 10:10:28 +0100 [thread overview]
Message-ID: <497ECF84.1030308@cosmosbay.com> (raw)
In-Reply-To: <497E44F6.2010703@hp.com>
Rick Jones a écrit :
>>
>> Hi Rick, nice hardware you have :)
>
> Every once in a while the cobbler's children get some shoes to wear :)
>
>>
>> Stephen had a patch to nuke read_lock() from iptables, using RCU and
>> seqlocks.
>> I hit this contention point even with low cost hardware, and quite
>> standard application.
>>
>> I pinged him few days ago to try to finish the job with him, but it
>> seems Stephen
>> is busy at the moment.
>>
>> Then conntrack (tcp sessions) is awfull, since it uses a single
>> rwlock_t tcp_lock
>> that must be write_locked() for basically every handled tcp frame...
>>
>> How long is "not indefinitely" ?
>
> The system I am using is being borrowed under an open ended loan.
> However, my use of it can be thought of as being the "null process" in
> VMS - once anyone else wants it I have to get off of it.
>
> That said, I would guess that the chances of someone else trying to get
> that system are pretty small for the next four+ weeks. I had a similar
> system (PCIe I/O rather than PCI-X) for quite a few weeks before it got
> pulled-out from under.
>
As Stephen will probably send the iptable part, I can do the tcp conntrack
improvment.
Let me know how it helps your last test (connection tracking enabled)
Thanks
[PATCH] netfilter: Get rid of central rwlock in tcp conntracking
TCP connection tracking suffers of huge contention on a global rwlock,
used to protect tcp conntracking state.
As each tcp conntrack state have no relations between each others, we
can switch to fine grained lock, using a spinlock per "struct ip_ct_tcp"
tcp_print_conntrack() dont need to lock anything to read ct->proto.tcp.state,
so speedup /proc/net/ip_conntrack as well.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Reported-by: Rick Jones <rick.jones2@hp.com>
---
diff --git a/include/linux/netfilter/nf_conntrack_tcp.h b/include/linux/netfilter/nf_conntrack_tcp.h
index a049df4..3dc72c6 100644
--- a/include/linux/netfilter/nf_conntrack_tcp.h
+++ b/include/linux/netfilter/nf_conntrack_tcp.h
@@ -50,6 +50,7 @@ struct ip_ct_tcp_state {
struct ip_ct_tcp
{
+ spinlock_t lock;
struct ip_ct_tcp_state seen[2]; /* connection parameters per direction */
u_int8_t state; /* state of the connection (enum tcp_conntrack) */
/* For detecting stale connections */
diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
index a1edb9c..1bc8fc1 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -26,9 +26,6 @@
#include <net/netfilter/nf_conntrack_ecache.h>
#include <net/netfilter/nf_log.h>
-/* Protects ct->proto.tcp */
-static DEFINE_RWLOCK(tcp_lock);
-
/* "Be conservative in what you do,
be liberal in what you accept from others."
If it's non-zero, we mark only out of window RST segments as INVALID. */
@@ -297,9 +294,7 @@ static int tcp_print_conntrack(struct seq_file *s, const struct nf_conn *ct)
{
enum tcp_conntrack state;
- read_lock_bh(&tcp_lock);
state = ct->proto.tcp.state;
- read_unlock_bh(&tcp_lock);
return seq_printf(s, "%s ", tcp_conntrack_names[state]);
}
@@ -705,14 +700,14 @@ void nf_conntrack_tcp_update(const struct sk_buff *skb,
end = segment_seq_plus_len(ntohl(tcph->seq), skb->len, dataoff, tcph);
- write_lock_bh(&tcp_lock);
+ spin_lock_bh(&ct->proto.lock);
/*
* We have to worry for the ack in the reply packet only...
*/
if (after(end, ct->proto.tcp.seen[dir].td_end))
ct->proto.tcp.seen[dir].td_end = end;
ct->proto.tcp.last_end = end;
- write_unlock_bh(&tcp_lock);
+ spin_unlock_bh(&ct->proto.lock);
pr_debug("tcp_update: sender end=%u maxend=%u maxwin=%u scale=%i "
"receiver end=%u maxend=%u maxwin=%u scale=%i\n",
sender->td_end, sender->td_maxend, sender->td_maxwin,
@@ -821,7 +816,7 @@ static int tcp_packet(struct nf_conn *ct,
th = skb_header_pointer(skb, dataoff, sizeof(_tcph), &_tcph);
BUG_ON(th == NULL);
- write_lock_bh(&tcp_lock);
+ spin_lock_bh(&ct->proto.tcp.lock);
old_state = ct->proto.tcp.state;
dir = CTINFO2DIR(ctinfo);
index = get_conntrack_index(th);
@@ -851,7 +846,7 @@ static int tcp_packet(struct nf_conn *ct,
&& ct->proto.tcp.last_index == TCP_RST_SET)) {
/* Attempt to reopen a closed/aborted connection.
* Delete this connection and look up again. */
- write_unlock_bh(&tcp_lock);
+ spin_unlock_bh(&ct->proto.tcp.lock);
/* Only repeat if we can actually remove the timer.
* Destruction may already be in progress in process
@@ -887,7 +882,7 @@ static int tcp_packet(struct nf_conn *ct,
* that the client cannot but retransmit its SYN and
* thus initiate a clean new session.
*/
- write_unlock_bh(&tcp_lock);
+ spin_unlock_bh(&ct->proto.tcp.lock);
if (LOG_INVALID(net, IPPROTO_TCP))
nf_log_packet(pf, 0, skb, NULL, NULL, NULL,
"nf_ct_tcp: killing out of sync session ");
@@ -900,7 +895,7 @@ static int tcp_packet(struct nf_conn *ct,
ct->proto.tcp.last_end =
segment_seq_plus_len(ntohl(th->seq), skb->len, dataoff, th);
- write_unlock_bh(&tcp_lock);
+ spin_unlock_bh(&ct->proto.tcp.lock);
if (LOG_INVALID(net, IPPROTO_TCP))
nf_log_packet(pf, 0, skb, NULL, NULL, NULL,
"nf_ct_tcp: invalid packet ignored ");
@@ -909,7 +904,7 @@ static int tcp_packet(struct nf_conn *ct,
/* Invalid packet */
pr_debug("nf_ct_tcp: Invalid dir=%i index=%u ostate=%u\n",
dir, get_conntrack_index(th), old_state);
- write_unlock_bh(&tcp_lock);
+ spin_unlock_bh(&ct->proto.tcp.lock);
if (LOG_INVALID(net, IPPROTO_TCP))
nf_log_packet(pf, 0, skb, NULL, NULL, NULL,
"nf_ct_tcp: invalid state ");
@@ -940,7 +935,7 @@ static int tcp_packet(struct nf_conn *ct,
if (!tcp_in_window(ct, &ct->proto.tcp, dir, index,
skb, dataoff, th, pf)) {
- write_unlock_bh(&tcp_lock);
+ spin_unlock_bh(&ct->proto.tcp.lock);
return -NF_ACCEPT;
}
in_window:
@@ -969,7 +964,7 @@ static int tcp_packet(struct nf_conn *ct,
timeout = nf_ct_tcp_timeout_unacknowledged;
else
timeout = tcp_timeouts[new_state];
- write_unlock_bh(&tcp_lock);
+ spin_unlock_bh(&ct->proto.tcp.lock);
nf_conntrack_event_cache(IPCT_PROTOINFO_VOLATILE, ct);
if (new_state != old_state)
@@ -1022,6 +1017,7 @@ static bool tcp_new(struct nf_conn *ct, const struct sk_buff *skb,
pr_debug("nf_ct_tcp: invalid new deleting.\n");
return false;
}
+ spin_lock_init(&ct->proto.tcp.lock);
if (new_state == TCP_CONNTRACK_SYN_SENT) {
/* SYN packet */
@@ -1092,7 +1088,7 @@ static int tcp_to_nlattr(struct sk_buff *skb, struct nlattr *nla,
struct nlattr *nest_parms;
struct nf_ct_tcp_flags tmp = {};
- read_lock_bh(&tcp_lock);
+ spin_lock_bh(&ct->proto.tcp.lock);
nest_parms = nla_nest_start(skb, CTA_PROTOINFO_TCP | NLA_F_NESTED);
if (!nest_parms)
goto nla_put_failure;
@@ -1112,14 +1108,14 @@ static int tcp_to_nlattr(struct sk_buff *skb, struct nlattr *nla,
tmp.flags = ct->proto.tcp.seen[1].flags;
NLA_PUT(skb, CTA_PROTOINFO_TCP_FLAGS_REPLY,
sizeof(struct nf_ct_tcp_flags), &tmp);
- read_unlock_bh(&tcp_lock);
+ spin_unlock_bh(&ct->proto.tcp.lock);
nla_nest_end(skb, nest_parms);
return 0;
nla_put_failure:
- read_unlock_bh(&tcp_lock);
+ spin_unlock_bh(&ct->proto.tcp.lock);
return -1;
}
@@ -1150,7 +1146,7 @@ static int nlattr_to_tcp(struct nlattr *cda[], struct nf_conn *ct)
nla_get_u8(tb[CTA_PROTOINFO_TCP_STATE]) >= TCP_CONNTRACK_MAX)
return -EINVAL;
- write_lock_bh(&tcp_lock);
+ spin_lock_bh(&ct->proto.tcp.lock);
if (tb[CTA_PROTOINFO_TCP_STATE])
ct->proto.tcp.state = nla_get_u8(tb[CTA_PROTOINFO_TCP_STATE]);
@@ -1177,7 +1173,7 @@ static int nlattr_to_tcp(struct nlattr *cda[], struct nf_conn *ct)
ct->proto.tcp.seen[1].td_scale =
nla_get_u8(tb[CTA_PROTOINFO_TCP_WSCALE_REPLY]);
}
- write_unlock_bh(&tcp_lock);
+ spin_unlock_bh(&ct->proto.tcp.lock);
return 0;
}
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2009-01-27 9:10 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-01-26 22:15 32 core net-next stack/netfilter "scaling" Rick Jones
2009-01-26 23:10 ` Eric Dumazet
2009-01-26 23:14 ` Stephen Hemminger
2009-01-26 23:19 ` Rick Jones
2009-01-27 9:10 ` Eric Dumazet [this message]
2009-01-27 9:15 ` Patrick McHardy
2009-01-27 11:29 ` Eric Dumazet
2009-01-27 11:37 ` Patrick McHardy
2009-01-27 16:23 ` Eric Dumazet
2009-01-27 17:33 ` Patrick McHardy
2009-01-27 18:02 ` Rick Jones
2009-01-27 19:09 ` Rick Jones
2009-01-27 19:24 ` Rick Jones
2009-01-27 22:17 ` Eric Dumazet
2009-01-27 22:29 ` Rick Jones
2009-01-27 22:34 ` Eric Dumazet
2009-01-27 22:43 ` Rick Jones
2009-01-28 13:55 ` Eric Dumazet
2009-01-28 16:25 ` Patrick McHardy
2009-01-28 17:07 ` Eric Dumazet
2009-01-28 17:34 ` Eric Dumazet
2009-01-29 15:31 ` [PATCH] netfilter: unfold two critical loops in ip_packet_match() Eric Dumazet
2009-01-30 15:47 ` Andi Kleen
2009-01-30 16:54 ` Eric Dumazet
2009-01-30 17:27 ` Andi Kleen
2009-01-30 17:27 ` Eric Dumazet
2009-01-30 17:50 ` Andi Kleen
2009-02-09 13:41 ` Patrick McHardy
2009-02-18 15:10 ` Eric Dumazet
2009-02-18 15:21 ` Patrick McHardy
2009-02-18 16:33 ` Eric Dumazet
2009-02-18 16:52 ` Patrick McHardy
2009-02-18 17:36 ` [PATCH] netfilter: xt_physdev fixes Eric Dumazet
2009-02-18 18:14 ` Patrick McHardy
2009-02-19 8:00 ` [PATCH] netfilter: unfold two loops in physdev_mt() Eric Dumazet
2009-02-19 8:14 ` [PATCH] netfilter: unfold two loops in ip6_packet_match() Eric Dumazet
2009-02-19 10:19 ` Patrick McHardy
2009-02-19 10:17 ` [PATCH] netfilter: unfold two loops in physdev_mt() Patrick McHardy
2009-02-20 10:02 ` [PATCH] netfilter: unfold two critical loops in ip_packet_match() Eric Dumazet
2009-02-20 10:04 ` Patrick McHardy
2009-02-09 14:57 ` 32 core net-next stack/netfilter "scaling" Patrick McHardy
2009-02-10 18:44 ` Stephen Hemminger
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=497ECF84.1030308@cosmosbay.com \
--to=dada1@cosmosbay.com \
--cc=kaber@trash.net \
--cc=netdev@vger.kernel.org \
--cc=netfilter-devel@vger.kernel.org \
--cc=rick.jones2@hp.com \
--cc=shemminger@vyatta.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).