From mboxrd@z Thu Jan 1 00:00:00 1970 From: Patrick McHardy Subject: Re: [PATCH 2/2] netfilter: conntrack: optional reliable conntrack event delivery Date: Fri, 05 Jun 2009 16:37:43 +0200 Message-ID: <4A292DB7.4000000@trash.net> References: <20090604110307.6702.10147.stgit@Decadence> <20090604110841.6702.76228.stgit@Decadence> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Cc: netfilter-devel@vger.kernel.org To: Pablo Neira Ayuso Return-path: Received: from stinky.trash.net ([213.144.137.162]:59391 "EHLO stinky.trash.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752564AbZFEOhp (ORCPT ); Fri, 5 Jun 2009 10:37:45 -0400 In-Reply-To: <20090604110841.6702.76228.stgit@Decadence> Sender: netfilter-devel-owner@vger.kernel.org List-ID: Pablo Neira Ayuso wrote: > This patch improves ctnetlink event reliability if one broadcast > listener has set the NETLINK_BROADCAST_ERROR socket option. > > The logic is the following: if the event delivery, we keep the > undelivered events in the per-conntrack event cache. Thus, once > the next packet arrives, we trigger another event delivery in > nf_conntrack_in(). If things don't go well in this second try, > we accumulate the pending events in the cache but we try to > deliver the current state as soon as possible. Therefore, we may > lost state transitions but the userspace process gets in sync at > some point. > > At worst case, if no events were delivered to userspace, we make > sure that destroy events are successfully delivered. This happens > because if ctnetlink fails to deliver the destroy event, we remove > the conntrack entry from the hashes and insert them in the dying > list, which contains inactive entries. Then, the conntrack timer > is added with an extra grace timeout of random32() % 15 seconds > to trigger the event again (this grace timeout is tunable via > /proc). The use of a limited random timeout value allows > distributing the "destroy" resends, thus, avoiding accumulating > lots "destroy" events at the same time. > > The maximum number of conntrack entries (active or inactive) is > still handled by nf_conntrack_max. Thus, we may start dropping > packets at some point if we accumulate a lot of inactive conntrack > entries waiting to deliver the destroy event to userspace. > During my stress tests consisting of setting a very small buffer > of 2048 bytes for conntrackd and the NETLINK_BROADCAST_ERROR socket > flag, and generating lots of very small connections, I noticed > very few destroy entries on the fly waiting to be resend. Conceptually, I think this all makes sense and is an improvement. > @@ -240,6 +225,60 @@ static void death_by_timeout(unsigned long ul_conntrack) > NF_CT_STAT_INC(net, delete_list); > clean_from_lists(ct); > spin_unlock_bh(&nf_conntrack_lock); > +} > + > +static void death_by_event(unsigned long ul_conntrack) > +{ > + struct nf_conn *ct = (void *)ul_conntrack; > + struct net *net = nf_ct_net(ct); > + > + if (nf_conntrack_event(IPCT_DESTROY, ct) < 0) { > + /* bad luck, let's retry again */ > + ct->timeout.expires = jiffies + > + (random32() % net->ct.sysctl_events_retry_timeout); > + add_timer(&ct->timeout); > + return; > + } > + /* we've got the event delivered, now it's dying */ > + set_bit(IPS_DYING_BIT, &ct->status); > + spin_lock_bh(&nf_conntrack_lock); _bh is not needed here, the timer is already running in BH context. > + hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode); Why is _rcu used? The lists are never used for a lookup. > + spin_unlock_bh(&nf_conntrack_lock); > + nf_ct_helper_destroy(ct); > + nf_ct_put(ct); > +} > + > +void nf_ct_setup_event_timer(struct nf_conn *ct) > +{ > + struct net *net = nf_ct_net(ct); > + > + nf_ct_delete_from_lists(ct); This doesn't really belong here. I realize its because of ctnetlink, but I think I'd rather have an additional export. > + /* add this conntrack to the dying list */ > + spin_lock_bh(&nf_conntrack_lock); > + hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode, > + &net->ct.dying); > + /* set a new timer to retry event delivery */ > + setup_timer(&ct->timeout, death_by_event, (unsigned long)ct); > + ct->timeout.expires = jiffies + > + (random32() % net->ct.sysctl_events_retry_timeout); > + add_timer(&ct->timeout); > + spin_unlock_bh(&nf_conntrack_lock); Its not necessary to hold the lock around the timer setup. There's only this single reference left to the conntrack - at least it should be that way. > +} > +EXPORT_SYMBOL_GPL(nf_ct_setup_event_timer); > + > +static void death_by_timeout(unsigned long ul_conntrack) > +{ > + struct nf_conn *ct = (void *)ul_conntrack; > + > + if (!test_bit(IPS_DYING_BIT, &ct->status) && > + unlikely(nf_conntrack_event(IPCT_DESTROY, ct) < 0)) { > + /* destroy event was not delivered */ > + nf_ct_setup_event_timer(ct); > + return; > + } > + set_bit(IPS_DYING_BIT, &ct->status); > + nf_ct_helper_destroy(ct); > + nf_ct_delete_from_lists(ct); The helpers might keep global data themselves thats cleaned up by ->destroy() (gre keymap for instance). The cleanup step might need to be done immediately to avoid clashes with new data or incorrectly returning data thats supposed to be gone. > nf_ct_put(ct); > } > > @@ -1030,6 +1069,22 @@ void nf_conntrack_flush_report(struct net *net, u32 pid, int report) > } > EXPORT_SYMBOL_GPL(nf_conntrack_flush_report); > > +static void nf_ct_release_dying_list(void) > +{ > + struct nf_conntrack_tuple_hash *h; > + struct nf_conn *ct; > + struct hlist_nulls_node *n; > + > + spin_lock_bh(&nf_conntrack_lock); > + hlist_nulls_for_each_entry(h, n, &init_net.ct.dying, hnnode) { > + ct = nf_ct_tuplehash_to_ctrack(h); > + /* never fails to remove them, no listeners at this point */ > + if (del_timer(&ct->timeout)) > + ct->timeout.function((unsigned long)ct); nf_ct_kill()? > + } > + spin_unlock_bh(&nf_conntrack_lock); > +} > diff --git a/net/netfilter/nf_conntrack_helper.c b/net/netfilter/nf_conntrack_helper.c > index 0fa5a42..5fc1fe7 100644 > --- a/net/netfilter/nf_conntrack_helper.c > +++ b/net/netfilter/nf_conntrack_helper.c > @@ -136,6 +136,21 @@ static inline int unhelp(struct nf_conntrack_tuple_hash *i, > return 0; > } > > +void nf_ct_helper_destroy(struct nf_conn *ct) > +{ > + struct nf_conn_help *help = nfct_help(ct); > + struct nf_conntrack_helper *helper; > + > + if (help) { > + rcu_read_lock(); > + helper = rcu_dereference(help->helper); > + if (helper && helper->destroy) > + helper->destroy(ct); > + rcu_read_unlock(); > + } > +} > +EXPORT_SYMBOL_GPL(nf_ct_helper_destroy); The export looks unnecessary, its only used in nf_conntrack_core.c unless I've missed something > diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c > index 2dd2910..6695e4b 100644 > --- a/net/netfilter/nf_conntrack_netlink.c > +++ b/net/netfilter/nf_conntrack_netlink.c > @@ -558,7 +559,10 @@ ctnetlink_conntrack_event(unsigned int events, struct nf_ct_event *item) > rcu_read_unlock(); > > nlmsg_end(skb, nlh); > - nfnetlink_send(skb, item->pid, group, item->report, GFP_ATOMIC); > + err = nfnetlink_send(skb, item->pid, group, item->report, GFP_ATOMIC); > + if ((err == -ENOBUFS) || (err == -EAGAIN)) minor nit: unnecessary parens