From: Pablo Neira Ayuso <pablo@netfilter.org>
To: Patrick McHardy <kaber@trash.net>
Cc: David Miller <davem@davemloft.net>,
netdev@vger.kernel.org, netfilter-devel@vger.kernel.org
Subject: Re: [RFC] netlink broadcast return value
Date: Tue, 10 Feb 2009 00:58:47 +0100 [thread overview]
Message-ID: <4990C337.3040704@netfilter.org> (raw)
In-Reply-To: <4990BADA.7040309@trash.net>
[-- Attachment #1: Type: text/plain, Size: 2960 bytes --]
Patrick McHardy wrote:
> Pablo Neira Ayuso wrote:
>> Patrick McHardy wrote:
>>> We have at least one case where the caller wants to know of
>>> any successful delivery. Keymanager queries done by xfrm_state
>>> want to know whether an acquire was delivered to any keymanager.
>>> So we need to continue to indicate this, maybe using a different
>>> errno code than -ENOBUFS. I don't have a suggestion which one to
>>> use though.
>>
>> Indeed, I have missed that spot. I'm not very familiar with that code,
>> however, I see that the creation of a state depends on the netlink
>> broadcast return value, but how useful is that? I think that the state
>> should be created even if the broadcast fails, the userspace daemon
>> should request a resync to the kernel as soon as it hits ENOBUFS, then
>> it would be in sync again with that state.
>
> The idea is that the kernel is performing an active query. I agree
> that there's nothing wrong with installing the SA and indicating the
> error to userspace. Userspace could dump the SADB and look for new
> larval states, however thats unlikely to be very useful since once
> an overflow occurs, you probably have a lot of states.
More situations may trigger overflows: a "slow" reader (for example,
spending time on whatever while not retrieving messages) and a userspace
process with too small receive buffer.
> But unless I'm missing something, there's nothing wrong with this
> as long as the error is ignored. The fact that something was received
> by some listener doesn't have any meaning anyways, it might have
> been "ip monitor". Which somehow raises doubt about your proposed
> interface change though, I think anything that wants a reliable
> answer whether a packet was delivered to a process handling it
> appropriately should use unicast.
Don't get me wrong, I agree with you that all netlink_broadcast callers
in the kernel should ignore the return value...
... unless they have "some way" (like in Netfilter) to make event
delivery reliable: I have attached a patch that I didn't send you yet,
I'm still reviewing and testing it. It adds an entry to /proc to enable
reliable event delivery over netlink by dropping packets whose events
were not delivered, you mentioned that possibility once during one of
our conversations ;).
I'm aware of that this option may be dangerous if used by a buggy
process that trigger frequent overflows but it the cost of having
realible logging for ctnetlink (still, this behaviour is not the one by
default!).
And I need this option to make conntrackd synchronize state-changes
appropriately under very heavy load: I've testing the daemon with these
patches and it reliably synchronizes state-changes (my system were 100%
busy filtering traffic and fully synchronizing all TCP state-changes in
near real-time effort, with a noticeable performance drop of 30% in
terms of filtered connections).
--
"Los honestos son inadaptados sociales" -- Les Luthiers
[-- Attachment #2: ctnetlink-drop-under-stress.patch --]
[-- Type: text/x-diff, Size: 11254 bytes --]
ctnetlink: optional packet drop to make event delivery reliable
From: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds /proc entry to enable reliable ctnetlink event
delivery. The entry is located at:
/proc/sys/net/netfilter/nf_conntrack_netlink_broadcast_reliable
When this entry is != 0, ctnetlink drops the packet if the delivery of
an event over netlink fails. This patch is useful to provide reliable
state synchronization for conntrackd.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
include/linux/netfilter/nfnetlink.h | 4 +
include/net/netfilter/nf_conntrack_core.h | 6 +-
include/net/netfilter/nf_conntrack_ecache.h | 2 -
include/net/netns/conntrack.h | 2 +
net/netfilter/nf_conntrack_ecache.c | 18 +++--
net/netfilter/nf_conntrack_netlink.c | 108 ++++++++++++++++++++++++++-
net/netfilter/nfnetlink.c | 24 +++++-
7 files changed, 146 insertions(+), 18 deletions(-)
diff --git a/include/linux/netfilter/nfnetlink.h b/include/linux/netfilter/nfnetlink.h
index 7d8e045..b89d5f3 100644
--- a/include/linux/netfilter/nfnetlink.h
+++ b/include/linux/netfilter/nfnetlink.h
@@ -74,8 +74,8 @@ extern int nfnetlink_subsys_register(const struct nfnetlink_subsystem *n);
extern int nfnetlink_subsys_unregister(const struct nfnetlink_subsystem *n);
extern int nfnetlink_has_listeners(unsigned int group);
-extern int nfnetlink_send(struct sk_buff *skb, u32 pid, unsigned group,
- int echo);
+extern int nfnetlink_notify(struct sk_buff *skb, u32 pid, unsigned group,
+ int echo);
extern int nfnetlink_unicast(struct sk_buff *skb, u_int32_t pid, int flags);
extern void nfnl_lock(void);
diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h
index e78afe7..0c6826d 100644
--- a/include/net/netfilter/nf_conntrack_core.h
+++ b/include/net/netfilter/nf_conntrack_core.h
@@ -62,7 +62,11 @@ static inline int nf_conntrack_confirm(struct sk_buff *skb)
if (ct) {
if (!nf_ct_is_confirmed(ct) && !nf_ct_is_dying(ct))
ret = __nf_conntrack_confirm(skb);
- nf_ct_deliver_cached_events(ct);
+ if (ret == NF_ACCEPT && nf_ct_deliver_cached_events(ct) < 0) {
+ struct net *net = nf_ct_net(ct);
+ NF_CT_STAT_INC_ATOMIC(net, drop);
+ return NF_DROP;
+ }
}
return ret;
}
diff --git a/include/net/netfilter/nf_conntrack_ecache.h b/include/net/netfilter/nf_conntrack_ecache.h
index 0ff0dc6..6e9e1f7 100644
--- a/include/net/netfilter/nf_conntrack_ecache.h
+++ b/include/net/netfilter/nf_conntrack_ecache.h
@@ -28,7 +28,7 @@ extern struct atomic_notifier_head nf_conntrack_chain;
extern int nf_conntrack_register_notifier(struct notifier_block *nb);
extern int nf_conntrack_unregister_notifier(struct notifier_block *nb);
-extern void nf_ct_deliver_cached_events(const struct nf_conn *ct);
+extern int nf_ct_deliver_cached_events(const struct nf_conn *ct);
extern void __nf_ct_event_cache_init(struct nf_conn *ct);
extern void nf_ct_event_cache_flush(struct net *net);
diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h
index f4498a6..1ff61dd 100644
--- a/include/net/netns/conntrack.h
+++ b/include/net/netns/conntrack.h
@@ -20,9 +20,11 @@ struct netns_ct {
int sysctl_acct;
int sysctl_checksum;
unsigned int sysctl_log_invalid; /* Log invalid packets */
+ int sysctl_ctnetlink_event_reliable;
#ifdef CONFIG_SYSCTL
struct ctl_table_header *sysctl_header;
struct ctl_table_header *acct_sysctl_header;
+ struct ctl_table_header *ctnetlink_sysctl_header;
#endif
int hash_vmalloc;
int expect_vmalloc;
diff --git a/net/netfilter/nf_conntrack_ecache.c b/net/netfilter/nf_conntrack_ecache.c
index dee4190..9c21269 100644
--- a/net/netfilter/nf_conntrack_ecache.c
+++ b/net/netfilter/nf_conntrack_ecache.c
@@ -31,9 +31,11 @@ EXPORT_SYMBOL_GPL(nf_ct_expect_chain);
/* deliver cached events and clear cache entry - must be called with locally
* disabled softirqs */
-static inline void
+static inline int
__nf_ct_deliver_cached_events(struct nf_conntrack_ecache *ecache)
{
+ int ret = 0;
+
if (nf_ct_is_confirmed(ecache->ct) && !nf_ct_is_dying(ecache->ct)
&& ecache->events) {
struct nf_ct_event item = {
@@ -42,28 +44,32 @@ __nf_ct_deliver_cached_events(struct nf_conntrack_ecache *ecache)
.report = 0
};
- atomic_notifier_call_chain(&nf_conntrack_chain,
- ecache->events,
- &item);
+ ret = atomic_notifier_call_chain(&nf_conntrack_chain,
+ ecache->events,
+ &item);
+ ret = notifier_to_errno(ret);
}
ecache->events = 0;
nf_ct_put(ecache->ct);
ecache->ct = NULL;
+ return ret;
}
/* Deliver all cached events for a particular conntrack. This is called
* by code prior to async packet handling for freeing the skb */
-void nf_ct_deliver_cached_events(const struct nf_conn *ct)
+int nf_ct_deliver_cached_events(const struct nf_conn *ct)
{
struct net *net = nf_ct_net(ct);
struct nf_conntrack_ecache *ecache;
+ int ret = 0;
local_bh_disable();
ecache = per_cpu_ptr(net->ct.ecache, raw_smp_processor_id());
if (ecache->ct == ct)
- __nf_ct_deliver_cached_events(ecache);
+ ret = __nf_ct_deliver_cached_events(ecache);
local_bh_enable();
+ return ret;
}
EXPORT_SYMBOL_GPL(nf_ct_deliver_cached_events);
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 47c2f54..3e0ffb6 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -517,6 +517,8 @@ static int ctnetlink_conntrack_event(struct notifier_block *this,
unsigned int type;
sk_buff_data_t b;
unsigned int flags = 0, group;
+ struct net *net = nf_ct_net(ct);
+ int err;
/* ignore our fake conntrack entry */
if (ct == &nf_conntrack_untracked)
@@ -613,13 +615,20 @@ static int ctnetlink_conntrack_event(struct notifier_block *this,
rcu_read_unlock();
nlh->nlmsg_len = skb->tail - b;
- nfnetlink_send(skb, item->pid, group, item->report);
+ err = nfnetlink_notify(skb, item->pid, group, item->report);
+ if (net->ct.sysctl_ctnetlink_event_reliable &&
+ (err == -ENOBUFS || err == -EAGAIN))
+ return notifier_from_errno(err);
+
return NOTIFY_DONE;
nla_put_failure:
rcu_read_unlock();
nlmsg_failure:
kfree_skb(skb);
+ if (net->ct.sysctl_ctnetlink_event_reliable)
+ return notifier_from_errno(-ENOSPC);
+
return NOTIFY_DONE;
}
#endif /* CONFIG_NF_CONNTRACK_EVENTS */
@@ -1604,7 +1613,8 @@ static int ctnetlink_expect_event(struct notifier_block *this,
struct sk_buff *skb;
unsigned int type;
sk_buff_data_t b;
- int flags = 0;
+ int flags = 0, err;
+ struct net *net = nf_ct_exp_net(exp);
if (events & IPEXP_NEW) {
type = IPCTNL_MSG_EXP_NEW;
@@ -1637,13 +1647,21 @@ static int ctnetlink_expect_event(struct notifier_block *this,
rcu_read_unlock();
nlh->nlmsg_len = skb->tail - b;
- nfnetlink_send(skb, item->pid, NFNLGRP_CONNTRACK_EXP_NEW, item->report);
+ err = nfnetlink_notify(skb, item->pid, NFNLGRP_CONNTRACK_EXP_NEW,
+ item->report);
+ if (net->ct.sysctl_ctnetlink_event_reliable &&
+ (err == -ENOBUFS || err == -EAGAIN))
+ return notifier_from_errno(err);
+
return NOTIFY_DONE;
nla_put_failure:
rcu_read_unlock();
nlmsg_failure:
kfree_skb(skb);
+ if (net->ct.sysctl_ctnetlink_event_reliable)
+ return notifier_from_errno(-ENOSPC);
+
return NOTIFY_DONE;
}
#endif
@@ -2003,7 +2021,63 @@ MODULE_ALIAS("ip_conntrack_netlink");
MODULE_ALIAS_NFNL_SUBSYS(NFNL_SUBSYS_CTNETLINK);
MODULE_ALIAS_NFNL_SUBSYS(NFNL_SUBSYS_CTNETLINK_EXP);
-static int __init ctnetlink_init(void)
+#ifdef CONFIG_SYSCTL
+static struct ctl_table ctnetlink_sysctl_table[] = {
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "nf_conntrack_netlink_broadcast_reliable",
+ .data = &init_net.ct.sysctl_ctnetlink_event_reliable,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {}
+};
+
+static int ctnetlink_init_sysctl(struct net *net)
+{
+ struct ctl_table *table;
+
+ table = kmemdup(ctnetlink_sysctl_table, sizeof(ctnetlink_sysctl_table),
+ GFP_KERNEL);
+ if (!table)
+ goto out;
+
+ table[0].data = &net->ct.sysctl_ctnetlink_event_reliable;
+
+ net->ct.ctnetlink_sysctl_header = register_net_sysctl_table(net,
+ nf_net_netfilter_sysctl_path, table);
+ if (!net->ct.ctnetlink_sysctl_header)
+ goto out_register;
+
+ return 0;
+
+out_register:
+ kfree(table);
+out:
+ return -ENOMEM;
+}
+
+static void ctnetlink_fini_sysctl(struct net *net)
+{
+ struct ctl_table *table;
+
+ table = net->ct.ctnetlink_sysctl_header->ctl_table_arg;
+ unregister_net_sysctl_table(net->ct.ctnetlink_sysctl_header);
+ kfree(table);
+}
+#else
+static int ctnetlink_init_sysctl(struct net *net)
+{
+ return 0;
+}
+
+static void ctnetlink_fini_sysctl(struct net *net)
+{
+}
+#endif /* CONFIG_SYSCTL */
+
+static int ctnetlink_net_init(struct net *net)
{
int ret;
@@ -2033,10 +2107,18 @@ static int __init ctnetlink_init(void)
goto err_unreg_notifier;
}
#endif
+ ret = ctnetlink_init_sysctl(net);
+ if (ret < 0) {
+ printk("ctnetlink_init: cannot register sysctl.\n");
+ goto err_unreg_exp_notifier;
+ }
+ net->ct.sysctl_ctnetlink_event_reliable = 0;
return 0;
#ifdef CONFIG_NF_CONNTRACK_EVENTS
+err_unreg_exp_notifier:
+ nf_ct_expect_unregister_notifier(&ctnl_notifier_exp);
err_unreg_notifier:
nf_conntrack_unregister_notifier(&ctnl_notifier);
err_unreg_exp_subsys:
@@ -2048,7 +2130,7 @@ err_out:
return ret;
}
-static void __exit ctnetlink_exit(void)
+static void ctnetlink_net_exit(struct net *net)
{
printk("ctnetlink: unregistering from nfnetlink.\n");
@@ -2059,8 +2141,24 @@ static void __exit ctnetlink_exit(void)
nfnetlink_subsys_unregister(&ctnl_exp_subsys);
nfnetlink_subsys_unregister(&ctnl_subsys);
+ ctnetlink_fini_sysctl(net);
return;
}
+static struct pernet_operations ctnetlink_net_ops = {
+ .init = ctnetlink_net_init,
+ .exit = ctnetlink_net_exit,
+};
+
+static int __init ctnetlink_init(void)
+{
+ return register_pernet_subsys(&ctnetlink_net_ops);
+}
+
+static void __exit ctnetlink_exit(void)
+{
+ unregister_pernet_subsys(&ctnetlink_net_ops);
+}
+
module_init(ctnetlink_init);
module_exit(ctnetlink_exit);
diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c
index 9c0ba17..fd7bbf4 100644
--- a/net/netfilter/nfnetlink.c
+++ b/net/netfilter/nfnetlink.c
@@ -107,11 +107,29 @@ int nfnetlink_has_listeners(unsigned int group)
}
EXPORT_SYMBOL_GPL(nfnetlink_has_listeners);
-int nfnetlink_send(struct sk_buff *skb, u32 pid, unsigned group, int echo)
+/* like nlmsg_notify, but we return the multicast error */
+int nfnetlink_notify(struct sk_buff *skb, u32 pid, unsigned group, int report)
{
- return nlmsg_notify(nfnl, skb, pid, group, echo, gfp_any());
+ int err = 0, mcast_err = 0;
+
+ if (group) {
+ int exclude_pid = 0;
+
+ if (report) {
+ atomic_inc(&skb->users);
+ exclude_pid = pid;
+ }
+
+ mcast_err = nlmsg_multicast(nfnl, skb, exclude_pid,
+ group, gfp_any());
+ }
+
+ if (report)
+ err = nlmsg_unicast(nfnl, skb, pid);
+
+ return mcast_err ? mcast_err : err;
}
-EXPORT_SYMBOL_GPL(nfnetlink_send);
+EXPORT_SYMBOL_GPL(nfnetlink_notify);
int nfnetlink_unicast(struct sk_buff *skb, u_int32_t pid, int flags)
{
next prev parent reply other threads:[~2009-02-09 23:59 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-02-01 13:33 [RFC] netlink broadcast return value Pablo Neira Ayuso
2009-02-02 22:05 ` David Miller
2009-02-09 14:17 ` Patrick McHardy
2009-02-09 22:51 ` Pablo Neira Ayuso
2009-02-09 23:23 ` Patrick McHardy
2009-02-09 23:58 ` Pablo Neira Ayuso [this message]
2009-02-10 13:50 ` Patrick McHardy
2009-02-10 18:51 ` Pablo Neira Ayuso
2009-02-11 12:44 ` Patrick McHardy
2009-02-11 16:39 ` Pablo Neira Ayuso
2009-02-11 16:54 ` Patrick McHardy
2009-02-11 21:01 ` Pablo Neira Ayuso
2009-02-12 5:07 ` Patrick McHardy
2009-02-12 12:36 ` Pablo Neira Ayuso
2009-02-12 12:41 ` Pablo Neira Ayuso
2009-02-12 12:48 ` Patrick McHardy
2009-02-12 13:20 ` Pablo Neira Ayuso
2009-02-12 13:25 ` Patrick McHardy
2009-02-12 12:45 ` Patrick McHardy
2009-02-02 22:35 ` Inaky Perez-Gonzalez
2009-02-03 10:07 ` Pablo Neira Ayuso
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4990C337.3040704@netfilter.org \
--to=pablo@netfilter.org \
--cc=davem@davemloft.net \
--cc=kaber@trash.net \
--cc=netdev@vger.kernel.org \
--cc=netfilter-devel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.