From: Eric Dumazet <dada1@cosmosbay.com>
To: Rick Jones <rick.jones2@hp.com>
Cc: Linux Network Development list <netdev@vger.kernel.org>,
Netfilter Developers <netfilter-devel@vger.kernel.org>,
Stephen Hemminger <shemminger@vyatta.com>,
Patrick McHardy <kaber@trash.net>
Subject: Re: 32 core net-next stack/netfilter "scaling"
Date: Tue, 27 Jan 2009 10:10:28 +0100 [thread overview]
Message-ID: <497ECF84.1030308@cosmosbay.com> (raw)
In-Reply-To: <497E44F6.2010703@hp.com>
Rick Jones a écrit :
>>
>> Hi Rick, nice hardware you have :)
>
> Every once in a while the cobbler's children get some shoes to wear :)
>
>>
>> Stephen had a patch to nuke read_lock() from iptables, using RCU and
>> seqlocks.
>> I hit this contention point even with low cost hardware, and quite
>> standard application.
>>
>> I pinged him few days ago to try to finish the job with him, but it
>> seems Stephen
>> is busy at the moment.
>>
>> Then conntrack (tcp sessions) is awfull, since it uses a single
>> rwlock_t tcp_lock
>> that must be write_locked() for basically every handled tcp frame...
>>
>> How long is "not indefinitely" ?
>
> The system I am using is being borrowed under an open ended loan.
> However, my use of it can be thought of as being the "null process" in
> VMS - once anyone else wants it I have to get off of it.
>
> That said, I would guess that the chances of someone else trying to get
> that system are pretty small for the next four+ weeks. I had a similar
> system (PCIe I/O rather than PCI-X) for quite a few weeks before it got
> pulled-out from under.
>
As Stephen will probably send the iptable part, I can do the tcp conntrack
improvment.
Let me know how it helps your last test (connection tracking enabled)
Thanks
[PATCH] netfilter: Get rid of central rwlock in tcp conntracking
TCP connection tracking suffers of huge contention on a global rwlock,
used to protect tcp conntracking state.
As each tcp conntrack state have no relations between each others, we
can switch to fine grained lock, using a spinlock per "struct ip_ct_tcp"
tcp_print_conntrack() dont need to lock anything to read ct->proto.tcp.state,
so speedup /proc/net/ip_conntrack as well.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Reported-by: Rick Jones <rick.jones2@hp.com>
---
diff --git a/include/linux/netfilter/nf_conntrack_tcp.h b/include/linux/netfilter/nf_conntrack_tcp.h
index a049df4..3dc72c6 100644
--- a/include/linux/netfilter/nf_conntrack_tcp.h
+++ b/include/linux/netfilter/nf_conntrack_tcp.h
@@ -50,6 +50,7 @@ struct ip_ct_tcp_state {
struct ip_ct_tcp
{
+ spinlock_t lock;
struct ip_ct_tcp_state seen[2]; /* connection parameters per direction */
u_int8_t state; /* state of the connection (enum tcp_conntrack) */
/* For detecting stale connections */
diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
index a1edb9c..1bc8fc1 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -26,9 +26,6 @@
#include <net/netfilter/nf_conntrack_ecache.h>
#include <net/netfilter/nf_log.h>
-/* Protects ct->proto.tcp */
-static DEFINE_RWLOCK(tcp_lock);
-
/* "Be conservative in what you do,
be liberal in what you accept from others."
If it's non-zero, we mark only out of window RST segments as INVALID. */
@@ -297,9 +294,7 @@ static int tcp_print_conntrack(struct seq_file *s, const struct nf_conn *ct)
{
enum tcp_conntrack state;
- read_lock_bh(&tcp_lock);
state = ct->proto.tcp.state;
- read_unlock_bh(&tcp_lock);
return seq_printf(s, "%s ", tcp_conntrack_names[state]);
}
@@ -705,14 +700,14 @@ void nf_conntrack_tcp_update(const struct sk_buff *skb,
end = segment_seq_plus_len(ntohl(tcph->seq), skb->len, dataoff, tcph);
- write_lock_bh(&tcp_lock);
+ spin_lock_bh(&ct->proto.lock);
/*
* We have to worry for the ack in the reply packet only...
*/
if (after(end, ct->proto.tcp.seen[dir].td_end))
ct->proto.tcp.seen[dir].td_end = end;
ct->proto.tcp.last_end = end;
- write_unlock_bh(&tcp_lock);
+ spin_unlock_bh(&ct->proto.lock);
pr_debug("tcp_update: sender end=%u maxend=%u maxwin=%u scale=%i "
"receiver end=%u maxend=%u maxwin=%u scale=%i\n",
sender->td_end, sender->td_maxend, sender->td_maxwin,
@@ -821,7 +816,7 @@ static int tcp_packet(struct nf_conn *ct,
th = skb_header_pointer(skb, dataoff, sizeof(_tcph), &_tcph);
BUG_ON(th == NULL);
- write_lock_bh(&tcp_lock);
+ spin_lock_bh(&ct->proto.tcp.lock);
old_state = ct->proto.tcp.state;
dir = CTINFO2DIR(ctinfo);
index = get_conntrack_index(th);
@@ -851,7 +846,7 @@ static int tcp_packet(struct nf_conn *ct,
&& ct->proto.tcp.last_index == TCP_RST_SET)) {
/* Attempt to reopen a closed/aborted connection.
* Delete this connection and look up again. */
- write_unlock_bh(&tcp_lock);
+ spin_unlock_bh(&ct->proto.tcp.lock);
/* Only repeat if we can actually remove the timer.
* Destruction may already be in progress in process
@@ -887,7 +882,7 @@ static int tcp_packet(struct nf_conn *ct,
* that the client cannot but retransmit its SYN and
* thus initiate a clean new session.
*/
- write_unlock_bh(&tcp_lock);
+ spin_unlock_bh(&ct->proto.tcp.lock);
if (LOG_INVALID(net, IPPROTO_TCP))
nf_log_packet(pf, 0, skb, NULL, NULL, NULL,
"nf_ct_tcp: killing out of sync session ");
@@ -900,7 +895,7 @@ static int tcp_packet(struct nf_conn *ct,
ct->proto.tcp.last_end =
segment_seq_plus_len(ntohl(th->seq), skb->len, dataoff, th);
- write_unlock_bh(&tcp_lock);
+ spin_unlock_bh(&ct->proto.tcp.lock);
if (LOG_INVALID(net, IPPROTO_TCP))
nf_log_packet(pf, 0, skb, NULL, NULL, NULL,
"nf_ct_tcp: invalid packet ignored ");
@@ -909,7 +904,7 @@ static int tcp_packet(struct nf_conn *ct,
/* Invalid packet */
pr_debug("nf_ct_tcp: Invalid dir=%i index=%u ostate=%u\n",
dir, get_conntrack_index(th), old_state);
- write_unlock_bh(&tcp_lock);
+ spin_unlock_bh(&ct->proto.tcp.lock);
if (LOG_INVALID(net, IPPROTO_TCP))
nf_log_packet(pf, 0, skb, NULL, NULL, NULL,
"nf_ct_tcp: invalid state ");
@@ -940,7 +935,7 @@ static int tcp_packet(struct nf_conn *ct,
if (!tcp_in_window(ct, &ct->proto.tcp, dir, index,
skb, dataoff, th, pf)) {
- write_unlock_bh(&tcp_lock);
+ spin_unlock_bh(&ct->proto.tcp.lock);
return -NF_ACCEPT;
}
in_window:
@@ -969,7 +964,7 @@ static int tcp_packet(struct nf_conn *ct,
timeout = nf_ct_tcp_timeout_unacknowledged;
else
timeout = tcp_timeouts[new_state];
- write_unlock_bh(&tcp_lock);
+ spin_unlock_bh(&ct->proto.tcp.lock);
nf_conntrack_event_cache(IPCT_PROTOINFO_VOLATILE, ct);
if (new_state != old_state)
@@ -1022,6 +1017,7 @@ static bool tcp_new(struct nf_conn *ct, const struct sk_buff *skb,
pr_debug("nf_ct_tcp: invalid new deleting.\n");
return false;
}
+ spin_lock_init(&ct->proto.tcp.lock);
if (new_state == TCP_CONNTRACK_SYN_SENT) {
/* SYN packet */
@@ -1092,7 +1088,7 @@ static int tcp_to_nlattr(struct sk_buff *skb, struct nlattr *nla,
struct nlattr *nest_parms;
struct nf_ct_tcp_flags tmp = {};
- read_lock_bh(&tcp_lock);
+ spin_lock_bh(&ct->proto.tcp.lock);
nest_parms = nla_nest_start(skb, CTA_PROTOINFO_TCP | NLA_F_NESTED);
if (!nest_parms)
goto nla_put_failure;
@@ -1112,14 +1108,14 @@ static int tcp_to_nlattr(struct sk_buff *skb, struct nlattr *nla,
tmp.flags = ct->proto.tcp.seen[1].flags;
NLA_PUT(skb, CTA_PROTOINFO_TCP_FLAGS_REPLY,
sizeof(struct nf_ct_tcp_flags), &tmp);
- read_unlock_bh(&tcp_lock);
+ spin_unlock_bh(&ct->proto.tcp.lock);
nla_nest_end(skb, nest_parms);
return 0;
nla_put_failure:
- read_unlock_bh(&tcp_lock);
+ spin_unlock_bh(&ct->proto.tcp.lock);
return -1;
}
@@ -1150,7 +1146,7 @@ static int nlattr_to_tcp(struct nlattr *cda[], struct nf_conn *ct)
nla_get_u8(tb[CTA_PROTOINFO_TCP_STATE]) >= TCP_CONNTRACK_MAX)
return -EINVAL;
- write_lock_bh(&tcp_lock);
+ spin_lock_bh(&ct->proto.tcp.lock);
if (tb[CTA_PROTOINFO_TCP_STATE])
ct->proto.tcp.state = nla_get_u8(tb[CTA_PROTOINFO_TCP_STATE]);
@@ -1177,7 +1173,7 @@ static int nlattr_to_tcp(struct nlattr *cda[], struct nf_conn *ct)
ct->proto.tcp.seen[1].td_scale =
nla_get_u8(tb[CTA_PROTOINFO_TCP_WSCALE_REPLY]);
}
- write_unlock_bh(&tcp_lock);
+ spin_unlock_bh(&ct->proto.tcp.lock);
return 0;
}
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2009-01-27 9:10 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-01-26 22:15 32 core net-next stack/netfilter "scaling" Rick Jones
2009-01-26 23:10 ` Eric Dumazet
2009-01-26 23:14 ` Stephen Hemminger
2009-01-26 23:19 ` Rick Jones
2009-01-27 9:10 ` Eric Dumazet [this message]
2009-01-27 9:15 ` Patrick McHardy
2009-01-27 11:29 ` Eric Dumazet
2009-01-27 11:37 ` Patrick McHardy
2009-01-27 16:23 ` Eric Dumazet
2009-01-27 17:33 ` Patrick McHardy
2009-01-27 18:02 ` Rick Jones
2009-01-27 19:09 ` Rick Jones
2009-01-27 19:24 ` Rick Jones
2009-01-27 22:17 ` Eric Dumazet
2009-01-27 22:29 ` Rick Jones
2009-01-27 22:34 ` Eric Dumazet
2009-01-27 22:43 ` Rick Jones
2009-01-28 13:55 ` Eric Dumazet
2009-01-28 16:25 ` Patrick McHardy
2009-01-28 17:07 ` Eric Dumazet
2009-01-28 17:34 ` Eric Dumazet
2009-01-29 15:31 ` [PATCH] netfilter: unfold two critical loops in ip_packet_match() Eric Dumazet
2009-01-30 15:47 ` Andi Kleen
2009-01-30 16:54 ` Eric Dumazet
2009-01-30 17:27 ` Andi Kleen
2009-01-30 17:27 ` Eric Dumazet
2009-01-30 17:50 ` Andi Kleen
2009-02-09 13:41 ` Patrick McHardy
2009-02-18 15:10 ` Eric Dumazet
2009-02-18 15:21 ` Patrick McHardy
2009-02-18 16:33 ` Eric Dumazet
2009-02-18 16:52 ` Patrick McHardy
2009-02-18 17:36 ` [PATCH] netfilter: xt_physdev fixes Eric Dumazet
2009-02-18 18:14 ` Patrick McHardy
2009-02-19 8:00 ` [PATCH] netfilter: unfold two loops in physdev_mt() Eric Dumazet
2009-02-19 8:14 ` [PATCH] netfilter: unfold two loops in ip6_packet_match() Eric Dumazet
2009-02-19 10:19 ` Patrick McHardy
2009-02-19 10:17 ` [PATCH] netfilter: unfold two loops in physdev_mt() Patrick McHardy
2009-02-20 10:02 ` [PATCH] netfilter: unfold two critical loops in ip_packet_match() Eric Dumazet
2009-02-20 10:04 ` Patrick McHardy
2009-02-09 14:57 ` 32 core net-next stack/netfilter "scaling" Patrick McHardy
2009-02-10 18:44 ` Stephen Hemminger
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=497ECF84.1030308@cosmosbay.com \
--to=dada1@cosmosbay.com \
--cc=kaber@trash.net \
--cc=netdev@vger.kernel.org \
--cc=netfilter-devel@vger.kernel.org \
--cc=rick.jones2@hp.com \
--cc=shemminger@vyatta.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.