From: Pablo Neira Ayuso <pablo@netfilter.org>
To: Patrick McHardy <kaber@trash.net>
Cc: David Miller <davem@davemloft.net>,
netdev@vger.kernel.org, netfilter-devel@vger.kernel.org
Subject: Re: [RFC] netlink broadcast return value
Date: Tue, 10 Feb 2009 00:58:47 +0100 [thread overview]
Message-ID: <4990C337.3040704@netfilter.org> (raw)
In-Reply-To: <4990BADA.7040309@trash.net>
[-- Attachment #1: Type: text/plain, Size: 2960 bytes --]
Patrick McHardy wrote:
> Pablo Neira Ayuso wrote:
>> Patrick McHardy wrote:
>>> We have at least one case where the caller wants to know of
>>> any successful delivery. Keymanager queries done by xfrm_state
>>> want to know whether an acquire was delivered to any keymanager.
>>> So we need to continue to indicate this, maybe using a different
>>> errno code than -ENOBUFS. I don't have a suggestion which one to
>>> use though.
>>
>> Indeed, I have missed that spot. I'm not very familiar with that code,
>> however, I see that the creation of a state depends on the netlink
>> broadcast return value, but how useful is that? I think that the state
>> should be created even if the broadcast fails, the userspace daemon
>> should request a resync to the kernel as soon as it hits ENOBUFS, then
>> it would be in sync again with that state.
>
> The idea is that the kernel is performing an active query. I agree
> that there's nothing wrong with installing the SA and indicating the
> error to userspace. Userspace could dump the SADB and look for new
> larval states, however thats unlikely to be very useful since once
> an overflow occurs, you probably have a lot of states.
More situations may trigger overflows: a "slow" reader (for example,
spending time on whatever while not retrieving messages) and a userspace
process with too small receive buffer.
> But unless I'm missing something, there's nothing wrong with this
> as long as the error is ignored. The fact that something was received
> by some listener doesn't have any meaning anyways, it might have
> been "ip monitor". Which somehow raises doubt about your proposed
> interface change though, I think anything that wants a reliable
> answer whether a packet was delivered to a process handling it
> appropriately should use unicast.
Don't get me wrong, I agree with you that all netlink_broadcast callers
in the kernel should ignore the return value...
... unless they have "some way" (like in Netfilter) to make event
delivery reliable: I have attached a patch that I didn't send you yet,
I'm still reviewing and testing it. It adds an entry to /proc to enable
reliable event delivery over netlink by dropping packets whose events
were not delivered, you mentioned that possibility once during one of
our conversations ;).
I'm aware of that this option may be dangerous if used by a buggy
process that trigger frequent overflows but it the cost of having
realible logging for ctnetlink (still, this behaviour is not the one by
default!).
And I need this option to make conntrackd synchronize state-changes
appropriately under very heavy load: I've testing the daemon with these
patches and it reliably synchronizes state-changes (my system were 100%
busy filtering traffic and fully synchronizing all TCP state-changes in
near real-time effort, with a noticeable performance drop of 30% in
terms of filtered connections).
--
"Los honestos son inadaptados sociales" -- Les Luthiers
[-- Attachment #2: ctnetlink-drop-under-stress.patch --]
[-- Type: text/x-diff, Size: 11254 bytes --]
ctnetlink: optional packet drop to make event delivery reliable
From: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds /proc entry to enable reliable ctnetlink event
delivery. The entry is located at:
/proc/sys/net/netfilter/nf_conntrack_netlink_broadcast_reliable
When this entry is != 0, ctnetlink drops the packet if the delivery of
an event over netlink fails. This patch is useful to provide reliable
state synchronization for conntrackd.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
include/linux/netfilter/nfnetlink.h | 4 +
include/net/netfilter/nf_conntrack_core.h | 6 +-
include/net/netfilter/nf_conntrack_ecache.h | 2 -
include/net/netns/conntrack.h | 2 +
net/netfilter/nf_conntrack_ecache.c | 18 +++--
net/netfilter/nf_conntrack_netlink.c | 108 ++++++++++++++++++++++++++-
net/netfilter/nfnetlink.c | 24 +++++-
7 files changed, 146 insertions(+), 18 deletions(-)
diff --git a/include/linux/netfilter/nfnetlink.h b/include/linux/netfilter/nfnetlink.h
index 7d8e045..b89d5f3 100644
--- a/include/linux/netfilter/nfnetlink.h
+++ b/include/linux/netfilter/nfnetlink.h
@@ -74,8 +74,8 @@ extern int nfnetlink_subsys_register(const struct nfnetlink_subsystem *n);
extern int nfnetlink_subsys_unregister(const struct nfnetlink_subsystem *n);
extern int nfnetlink_has_listeners(unsigned int group);
-extern int nfnetlink_send(struct sk_buff *skb, u32 pid, unsigned group,
- int echo);
+extern int nfnetlink_notify(struct sk_buff *skb, u32 pid, unsigned group,
+ int echo);
extern int nfnetlink_unicast(struct sk_buff *skb, u_int32_t pid, int flags);
extern void nfnl_lock(void);
diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h
index e78afe7..0c6826d 100644
--- a/include/net/netfilter/nf_conntrack_core.h
+++ b/include/net/netfilter/nf_conntrack_core.h
@@ -62,7 +62,11 @@ static inline int nf_conntrack_confirm(struct sk_buff *skb)
if (ct) {
if (!nf_ct_is_confirmed(ct) && !nf_ct_is_dying(ct))
ret = __nf_conntrack_confirm(skb);
- nf_ct_deliver_cached_events(ct);
+ if (ret == NF_ACCEPT && nf_ct_deliver_cached_events(ct) < 0) {
+ struct net *net = nf_ct_net(ct);
+ NF_CT_STAT_INC_ATOMIC(net, drop);
+ return NF_DROP;
+ }
}
return ret;
}
diff --git a/include/net/netfilter/nf_conntrack_ecache.h b/include/net/netfilter/nf_conntrack_ecache.h
index 0ff0dc6..6e9e1f7 100644
--- a/include/net/netfilter/nf_conntrack_ecache.h
+++ b/include/net/netfilter/nf_conntrack_ecache.h
@@ -28,7 +28,7 @@ extern struct atomic_notifier_head nf_conntrack_chain;
extern int nf_conntrack_register_notifier(struct notifier_block *nb);
extern int nf_conntrack_unregister_notifier(struct notifier_block *nb);
-extern void nf_ct_deliver_cached_events(const struct nf_conn *ct);
+extern int nf_ct_deliver_cached_events(const struct nf_conn *ct);
extern void __nf_ct_event_cache_init(struct nf_conn *ct);
extern void nf_ct_event_cache_flush(struct net *net);
diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h
index f4498a6..1ff61dd 100644
--- a/include/net/netns/conntrack.h
+++ b/include/net/netns/conntrack.h
@@ -20,9 +20,11 @@ struct netns_ct {
int sysctl_acct;
int sysctl_checksum;
unsigned int sysctl_log_invalid; /* Log invalid packets */
+ int sysctl_ctnetlink_event_reliable;
#ifdef CONFIG_SYSCTL
struct ctl_table_header *sysctl_header;
struct ctl_table_header *acct_sysctl_header;
+ struct ctl_table_header *ctnetlink_sysctl_header;
#endif
int hash_vmalloc;
int expect_vmalloc;
diff --git a/net/netfilter/nf_conntrack_ecache.c b/net/netfilter/nf_conntrack_ecache.c
index dee4190..9c21269 100644
--- a/net/netfilter/nf_conntrack_ecache.c
+++ b/net/netfilter/nf_conntrack_ecache.c
@@ -31,9 +31,11 @@ EXPORT_SYMBOL_GPL(nf_ct_expect_chain);
/* deliver cached events and clear cache entry - must be called with locally
* disabled softirqs */
-static inline void
+static inline int
__nf_ct_deliver_cached_events(struct nf_conntrack_ecache *ecache)
{
+ int ret = 0;
+
if (nf_ct_is_confirmed(ecache->ct) && !nf_ct_is_dying(ecache->ct)
&& ecache->events) {
struct nf_ct_event item = {
@@ -42,28 +44,32 @@ __nf_ct_deliver_cached_events(struct nf_conntrack_ecache *ecache)
.report = 0
};
- atomic_notifier_call_chain(&nf_conntrack_chain,
- ecache->events,
- &item);
+ ret = atomic_notifier_call_chain(&nf_conntrack_chain,
+ ecache->events,
+ &item);
+ ret = notifier_to_errno(ret);
}
ecache->events = 0;
nf_ct_put(ecache->ct);
ecache->ct = NULL;
+ return ret;
}
/* Deliver all cached events for a particular conntrack. This is called
* by code prior to async packet handling for freeing the skb */
-void nf_ct_deliver_cached_events(const struct nf_conn *ct)
+int nf_ct_deliver_cached_events(const struct nf_conn *ct)
{
struct net *net = nf_ct_net(ct);
struct nf_conntrack_ecache *ecache;
+ int ret = 0;
local_bh_disable();
ecache = per_cpu_ptr(net->ct.ecache, raw_smp_processor_id());
if (ecache->ct == ct)
- __nf_ct_deliver_cached_events(ecache);
+ ret = __nf_ct_deliver_cached_events(ecache);
local_bh_enable();
+ return ret;
}
EXPORT_SYMBOL_GPL(nf_ct_deliver_cached_events);
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 47c2f54..3e0ffb6 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -517,6 +517,8 @@ static int ctnetlink_conntrack_event(struct notifier_block *this,
unsigned int type;
sk_buff_data_t b;
unsigned int flags = 0, group;
+ struct net *net = nf_ct_net(ct);
+ int err;
/* ignore our fake conntrack entry */
if (ct == &nf_conntrack_untracked)
@@ -613,13 +615,20 @@ static int ctnetlink_conntrack_event(struct notifier_block *this,
rcu_read_unlock();
nlh->nlmsg_len = skb->tail - b;
- nfnetlink_send(skb, item->pid, group, item->report);
+ err = nfnetlink_notify(skb, item->pid, group, item->report);
+ if (net->ct.sysctl_ctnetlink_event_reliable &&
+ (err == -ENOBUFS || err == -EAGAIN))
+ return notifier_from_errno(err);
+
return NOTIFY_DONE;
nla_put_failure:
rcu_read_unlock();
nlmsg_failure:
kfree_skb(skb);
+ if (net->ct.sysctl_ctnetlink_event_reliable)
+ return notifier_from_errno(-ENOSPC);
+
return NOTIFY_DONE;
}
#endif /* CONFIG_NF_CONNTRACK_EVENTS */
@@ -1604,7 +1613,8 @@ static int ctnetlink_expect_event(struct notifier_block *this,
struct sk_buff *skb;
unsigned int type;
sk_buff_data_t b;
- int flags = 0;
+ int flags = 0, err;
+ struct net *net = nf_ct_exp_net(exp);
if (events & IPEXP_NEW) {
type = IPCTNL_MSG_EXP_NEW;
@@ -1637,13 +1647,21 @@ static int ctnetlink_expect_event(struct notifier_block *this,
rcu_read_unlock();
nlh->nlmsg_len = skb->tail - b;
- nfnetlink_send(skb, item->pid, NFNLGRP_CONNTRACK_EXP_NEW, item->report);
+ err = nfnetlink_notify(skb, item->pid, NFNLGRP_CONNTRACK_EXP_NEW,
+ item->report);
+ if (net->ct.sysctl_ctnetlink_event_reliable &&
+ (err == -ENOBUFS || err == -EAGAIN))
+ return notifier_from_errno(err);
+
return NOTIFY_DONE;
nla_put_failure:
rcu_read_unlock();
nlmsg_failure:
kfree_skb(skb);
+ if (net->ct.sysctl_ctnetlink_event_reliable)
+ return notifier_from_errno(-ENOSPC);
+
return NOTIFY_DONE;
}
#endif
@@ -2003,7 +2021,63 @@ MODULE_ALIAS("ip_conntrack_netlink");
MODULE_ALIAS_NFNL_SUBSYS(NFNL_SUBSYS_CTNETLINK);
MODULE_ALIAS_NFNL_SUBSYS(NFNL_SUBSYS_CTNETLINK_EXP);
-static int __init ctnetlink_init(void)
+#ifdef CONFIG_SYSCTL
+static struct ctl_table ctnetlink_sysctl_table[] = {
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "nf_conntrack_netlink_broadcast_reliable",
+ .data = &init_net.ct.sysctl_ctnetlink_event_reliable,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {}
+};
+
+static int ctnetlink_init_sysctl(struct net *net)
+{
+ struct ctl_table *table;
+
+ table = kmemdup(ctnetlink_sysctl_table, sizeof(ctnetlink_sysctl_table),
+ GFP_KERNEL);
+ if (!table)
+ goto out;
+
+ table[0].data = &net->ct.sysctl_ctnetlink_event_reliable;
+
+ net->ct.ctnetlink_sysctl_header = register_net_sysctl_table(net,
+ nf_net_netfilter_sysctl_path, table);
+ if (!net->ct.ctnetlink_sysctl_header)
+ goto out_register;
+
+ return 0;
+
+out_register:
+ kfree(table);
+out:
+ return -ENOMEM;
+}
+
+static void ctnetlink_fini_sysctl(struct net *net)
+{
+ struct ctl_table *table;
+
+ table = net->ct.ctnetlink_sysctl_header->ctl_table_arg;
+ unregister_net_sysctl_table(net->ct.ctnetlink_sysctl_header);
+ kfree(table);
+}
+#else
+static int ctnetlink_init_sysctl(struct net *net)
+{
+ return 0;
+}
+
+static void ctnetlink_fini_sysctl(struct net *net)
+{
+}
+#endif /* CONFIG_SYSCTL */
+
+static int ctnetlink_net_init(struct net *net)
{
int ret;
@@ -2033,10 +2107,18 @@ static int __init ctnetlink_init(void)
goto err_unreg_notifier;
}
#endif
+ ret = ctnetlink_init_sysctl(net);
+ if (ret < 0) {
+ printk("ctnetlink_init: cannot register sysctl.\n");
+ goto err_unreg_exp_notifier;
+ }
+ net->ct.sysctl_ctnetlink_event_reliable = 0;
return 0;
#ifdef CONFIG_NF_CONNTRACK_EVENTS
+err_unreg_exp_notifier:
+ nf_ct_expect_unregister_notifier(&ctnl_notifier_exp);
err_unreg_notifier:
nf_conntrack_unregister_notifier(&ctnl_notifier);
err_unreg_exp_subsys:
@@ -2048,7 +2130,7 @@ err_out:
return ret;
}
-static void __exit ctnetlink_exit(void)
+static void ctnetlink_net_exit(struct net *net)
{
printk("ctnetlink: unregistering from nfnetlink.\n");
@@ -2059,8 +2141,24 @@ static void __exit ctnetlink_exit(void)
nfnetlink_subsys_unregister(&ctnl_exp_subsys);
nfnetlink_subsys_unregister(&ctnl_subsys);
+ ctnetlink_fini_sysctl(net);
return;
}
+static struct pernet_operations ctnetlink_net_ops = {
+ .init = ctnetlink_net_init,
+ .exit = ctnetlink_net_exit,
+};
+
+static int __init ctnetlink_init(void)
+{
+ return register_pernet_subsys(&ctnetlink_net_ops);
+}
+
+static void __exit ctnetlink_exit(void)
+{
+ unregister_pernet_subsys(&ctnetlink_net_ops);
+}
+
module_init(ctnetlink_init);
module_exit(ctnetlink_exit);
diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c
index 9c0ba17..fd7bbf4 100644
--- a/net/netfilter/nfnetlink.c
+++ b/net/netfilter/nfnetlink.c
@@ -107,11 +107,29 @@ int nfnetlink_has_listeners(unsigned int group)
}
EXPORT_SYMBOL_GPL(nfnetlink_has_listeners);
-int nfnetlink_send(struct sk_buff *skb, u32 pid, unsigned group, int echo)
+/* like nlmsg_notify, but we return the multicast error */
+int nfnetlink_notify(struct sk_buff *skb, u32 pid, unsigned group, int report)
{
- return nlmsg_notify(nfnl, skb, pid, group, echo, gfp_any());
+ int err = 0, mcast_err = 0;
+
+ if (group) {
+ int exclude_pid = 0;
+
+ if (report) {
+ atomic_inc(&skb->users);
+ exclude_pid = pid;
+ }
+
+ mcast_err = nlmsg_multicast(nfnl, skb, exclude_pid,
+ group, gfp_any());
+ }
+
+ if (report)
+ err = nlmsg_unicast(nfnl, skb, pid);
+
+ return mcast_err ? mcast_err : err;
}
-EXPORT_SYMBOL_GPL(nfnetlink_send);
+EXPORT_SYMBOL_GPL(nfnetlink_notify);
int nfnetlink_unicast(struct sk_buff *skb, u_int32_t pid, int flags)
{
next prev parent reply other threads:[~2009-02-09 23:59 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-02-01 13:33 [RFC] netlink broadcast return value Pablo Neira Ayuso
2009-02-02 22:05 ` David Miller
2009-02-09 14:17 ` Patrick McHardy
2009-02-09 22:51 ` Pablo Neira Ayuso
2009-02-09 23:23 ` Patrick McHardy
2009-02-09 23:58 ` Pablo Neira Ayuso [this message]
2009-02-10 13:50 ` Patrick McHardy
2009-02-10 18:51 ` Pablo Neira Ayuso
2009-02-11 12:44 ` Patrick McHardy
2009-02-11 16:39 ` Pablo Neira Ayuso
2009-02-11 16:54 ` Patrick McHardy
2009-02-11 21:01 ` Pablo Neira Ayuso
2009-02-12 5:07 ` Patrick McHardy
2009-02-12 12:36 ` Pablo Neira Ayuso
2009-02-12 12:41 ` Pablo Neira Ayuso
2009-02-12 12:48 ` Patrick McHardy
2009-02-12 13:20 ` Pablo Neira Ayuso
2009-02-12 13:25 ` Patrick McHardy
2009-02-12 12:45 ` Patrick McHardy
2009-02-02 22:35 ` Inaky Perez-Gonzalez
2009-02-03 10:07 ` Pablo Neira Ayuso
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4990C337.3040704@netfilter.org \
--to=pablo@netfilter.org \
--cc=davem@davemloft.net \
--cc=kaber@trash.net \
--cc=netdev@vger.kernel.org \
--cc=netfilter-devel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).