[RFC] netlink broadcast return value

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] netlink broadcast return value
@ 2009-02-01 13:33 Pablo Neira Ayuso
  2009-02-02 22:05 ` David Miller
  2009-02-02 22:35 ` Inaky Perez-Gonzalez
  0 siblings, 2 replies; 21+ messages in thread
From: Pablo Neira Ayuso @ 2009-02-01 13:33 UTC (permalink / raw)
  To: netdev; +Cc: Patrick McHardy, Netfilter Development Mailinglist

[-- Attachment #1: Type: text/plain, Size: 3439 bytes --]

Currently, and according to my interpretation of the source code,
netlink_broadcast() return-value reports errors to the caller if no
messages at all were delivered:

1) If, at least, one message has been delivered correctly, returns 0.
2) Otherwise, if no messages at all were delivered due to skb_clone()
failure, return -ENOBUFS.
3) Otherwise, if there are no listeners, return -ESRCH.

I would need to know if the caller has failed delivering any of the
messages to the listeners as follows:

1) If it fails to deliver any message (for whatever reason), return
-ENOBUFS.
2) If all messages were delivered OK, returns 0.
3) If no listeners, return -ESRCH.

In the current ctnetlink code and in Netfilter in general, we can add
reliable logging and connection tracking event delivery by dropping the
packets whose events were not successfully delivered over Netlink. Of
course, this option would be settable via /proc as this approach reduces
performance (in terms of filtered connections per seconds by a stateful
firewall) but providing reliable logging and event delivery (for
conntrackd) in return.

I have check the whole kernel code to look for current users of
netlink_broadcast() to see how they are handling errors reported and how
a change in the return value would affect them. Here it follows a short
summary:

= current list of clients of netlink_broadcast() =
== netlink_broadcast() ==

                                          Handling

drivers/scsi/scsi_transport_iscsi.c	: printk error
drivers/connector/connector.c		: cn_netlink_send() return value
include/net/netlink.h			: nlmsg_multicast() return value
lib/kobject_uevent.c			: ignores return value
net/core/rtnetlink.c			: ignores return value
net/ipv4/netfilter/ipt_ULOG.c		: ignores return value
net/bridge/netfilter/ebt_ulog.c		: ignores return value
net/decnet/netfilter/dn_rtmsg.c		: ignores return value
security/selinux/netlink.c		: ignores return value

== cn_netlink_send (uses netlink_broadcast return value) ==
drivers/w1/w1_netlink.c			: ignores return value
drivers/video/uvesafb.c			: printk error (if err != ESRCH)

== nlmsg_multicast (calls netlink_broadcast) ==
drivers/scsi/scsi_transport_fc.c	: printk error (if err != -ESRCH)
include/net/genetlink.h			: genlmsg_multicast() return value
net/xfrm/xfrm_user.c			: xfrm_send_migrate() return value
					  xfrm_exp_state_notify() return value
					  xfrm_aevent_state_notify() return value
					  xfrm_notify_sa_flush() return value
					  xfrm_notify_sa() return value
					  xfrm_send_acquire() return value
					  xfrm_exp_policy_notify() return value
					  xfrm_notify_policy() return value
					  xfrm_notify_policy_flush() return val
					  xfrm_send_report() return value
					  xfrm_send_mapping() return value
					  ...
					  later they all ignore the return value

== genlmsg_multicast (calls nlmsg_multicast) ==
net/netlink/genetlink.c			: ignores return value
drivers/acpi/event.c			: printk error
fs/dquot.c				: printk error (if err != -ESRCH)
net/wireless/nl80211.c			: ignores return value

In short, I think that the change that I'm proposing would also require
to fix some netlink_broadcast() clients to skip ENOBUFS errors: they are
not meaningful for them since they assume that Netlink is unreliable and
so the return value does not provide any useful information.

Please, let me know how crazy this idea is ;).

-- 
"Los honestos son inadaptados sociales" -- Les Luthiers

[-- Attachment #2: netlink-broadcast-delivery-failure.patch --]
[-- Type: text/x-diff, Size: 1255 bytes --]

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 480184a..26e1a89 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -943,6 +943,7 @@ struct netlink_broadcast_data {
 	u32 pid;
 	u32 group;
 	int failure;
+	int delivery_failure;
 	int congested;
 	int delivered;
 	gfp_t allocation;
@@ -992,6 +993,7 @@ static inline int do_one_broadcast(struct sock *sk,
 		p->skb2 = NULL;
 	} else if ((val = netlink_broadcast_deliver(sk, p->skb2)) < 0) {
 		netlink_overrun(sk);
+		p->delivery_failure = 1;
 	} else {
 		p->congested |= val;
 		p->delivered = 1;
@@ -1018,6 +1020,7 @@ int netlink_broadcast(struct sock *ssk, struct sk_buff *skb, u32 pid,
 	info.pid = pid;
 	info.group = group;
 	info.failure = 0;
+	info.delivery_failure = 0;
 	info.congested = 0;
 	info.delivered = 0;
 	info.allocation = allocation;
@@ -1038,13 +1041,14 @@ int netlink_broadcast(struct sock *ssk, struct sk_buff *skb, u32 pid,
 	if (info.skb2)
 		kfree_skb(info.skb2);
 
+	if (info.delivery_failure || info.failure)
+		return -ENOBUFS;
+
 	if (info.delivered) {
 		if (info.congested && (allocation & __GFP_WAIT))
 			yield();
 		return 0;
 	}
-	if (info.failure)
-		return -ENOBUFS;
 	return -ESRCH;
 }
 EXPORT_SYMBOL(netlink_broadcast);

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC] netlink broadcast return value
  2009-02-01 13:33 [RFC] netlink broadcast return value Pablo Neira Ayuso
@ 2009-02-02 22:05 ` David Miller
  2009-02-09 14:17   ` Patrick McHardy
  2009-02-02 22:35 ` Inaky Perez-Gonzalez
  1 sibling, 1 reply; 21+ messages in thread
From: David Miller @ 2009-02-02 22:05 UTC (permalink / raw)
  To: pablo; +Cc: netdev, kaber, netfilter-devel

From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Sun, 01 Feb 2009 14:33:57 +0100

> In short, I think that the change that I'm proposing would also require
> to fix some netlink_broadcast() clients to skip ENOBUFS errors: they are
> not meaningful for them since they assume that Netlink is unreliable and
> so the return value does not provide any useful information.

I think this analysis is accurate.

Please proceed to do the netlink_broadcast() client changes
and then submit those with this patch and I'll queue it all
up for net-next-2.6

Thanks!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] netlink broadcast return value
  2009-02-01 13:33 [RFC] netlink broadcast return value Pablo Neira Ayuso
  2009-02-02 22:05 ` David Miller
@ 2009-02-02 22:35 ` Inaky Perez-Gonzalez
  2009-02-03 10:07   ` Pablo Neira Ayuso
  1 sibling, 1 reply; 21+ messages in thread
From: Inaky Perez-Gonzalez @ 2009-02-02 22:35 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netdev, Patrick McHardy, Netfilter Development Mailinglist

On Sunday 01 February 2009, Pablo Neira Ayuso wrote:

> == genlmsg_multicast (calls nlmsg_multicast) ==
> net/netlink/genetlink.c			: ignores return value
> drivers/acpi/event.c			: printk error
> fs/dquot.c				: printk error (if err != -ESRCH)
> net/wireless/nl80211.c			: ignores return value
>
> In short, I think that the change that I'm proposing would also require
> to fix some netlink_broadcast() clients to skip ENOBUFS errors: they are
> not meaningful for them since they assume that Netlink is unreliable and
> so the return value does not provide any useful information.

You are missing a few callers in net/wimax. Which kernel version did you
do the analysis on?

-- 
Inaky

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] netlink broadcast return value
  2009-02-02 22:35 ` Inaky Perez-Gonzalez
@ 2009-02-03 10:07   ` Pablo Neira Ayuso
  0 siblings, 0 replies; 21+ messages in thread
From: Pablo Neira Ayuso @ 2009-02-03 10:07 UTC (permalink / raw)
  To: Inaky Perez-Gonzalez
  Cc: netdev, Patrick McHardy, Netfilter Development Mailinglist

Hi Iñaky,

Inaky Perez-Gonzalez wrote:
> On Sunday 01 February 2009, Pablo Neira Ayuso wrote:
> 
>> == genlmsg_multicast (calls nlmsg_multicast) ==
>> net/netlink/genetlink.c			: ignores return value
>> drivers/acpi/event.c			: printk error
>> fs/dquot.c				: printk error (if err != -ESRCH)
>> net/wireless/nl80211.c			: ignores return value
>>
>> In short, I think that the change that I'm proposing would also require
>> to fix some netlink_broadcast() clients to skip ENOBUFS errors: they are
>> not meaningful for them since they assume that Netlink is unreliable and
>> so the return value does not provide any useful information.
> 
> You are missing a few callers in net/wimax. Which kernel version did you
> do the analysis on?

I was using Patrick's tree (nf-2.6.git) which did not contain net/wimax
yet. I'm going to re-check against David's tree and let you know.

-- 
"Los honestos son inadaptados sociales" -- Les Luthiers

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] netlink broadcast return value
  2009-02-02 22:05 ` David Miller
@ 2009-02-09 14:17   ` Patrick McHardy
  2009-02-09 22:51     ` Pablo Neira Ayuso
  0 siblings, 1 reply; 21+ messages in thread
From: Patrick McHardy @ 2009-02-09 14:17 UTC (permalink / raw)
  To: David Miller; +Cc: pablo, netdev, netfilter-devel

David Miller wrote:
> From: Pablo Neira Ayuso <pablo@netfilter.org>
> Date: Sun, 01 Feb 2009 14:33:57 +0100
> 
>> In short, I think that the change that I'm proposing would also require
>> to fix some netlink_broadcast() clients to skip ENOBUFS errors: they are
>> not meaningful for them since they assume that Netlink is unreliable and
>> so the return value does not provide any useful information.
> 
> I think this analysis is accurate.

We have at least one case where the caller wants to know of
any successful delivery. Keymanager queries done by xfrm_state
want to know whether an acquire was delivered to any keymanager.
So we need to continue to indicate this, maybe using a different
errno code than -ENOBUFS. I don't have a suggestion which one to
use though.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] netlink broadcast return value
  2009-02-09 14:17   ` Patrick McHardy
@ 2009-02-09 22:51     ` Pablo Neira Ayuso
  2009-02-09 23:23       ` Patrick McHardy
  0 siblings, 1 reply; 21+ messages in thread
From: Pablo Neira Ayuso @ 2009-02-09 22:51 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: David Miller, netdev, netfilter-devel

Patrick McHardy wrote:
> David Miller wrote:
>> From: Pablo Neira Ayuso <pablo@netfilter.org>
>> Date: Sun, 01 Feb 2009 14:33:57 +0100
>>
>>> In short, I think that the change that I'm proposing would also require
>>> to fix some netlink_broadcast() clients to skip ENOBUFS errors: they are
>>> not meaningful for them since they assume that Netlink is unreliable and
>>> so the return value does not provide any useful information.
>>
>> I think this analysis is accurate.
> 
> We have at least one case where the caller wants to know of
> any successful delivery. Keymanager queries done by xfrm_state
> want to know whether an acquire was delivered to any keymanager.
> So we need to continue to indicate this, maybe using a different
> errno code than -ENOBUFS. I don't have a suggestion which one to
> use though.

Indeed, I have missed that spot. I'm not very familiar with that code,
however, I see that the creation of a state depends on the netlink
broadcast return value, but how useful is that? I think that the state
should be created even if the broadcast fails, the userspace daemon
should request a resync to the kernel as soon as it hits ENOBUFS, then
it would be in sync again with that state.

-- 
"Los honestos son inadaptados sociales" -- Les Luthiers

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] netlink broadcast return value
  2009-02-09 22:51     ` Pablo Neira Ayuso
@ 2009-02-09 23:23       ` Patrick McHardy
  2009-02-09 23:58         ` Pablo Neira Ayuso
  0 siblings, 1 reply; 21+ messages in thread
From: Patrick McHardy @ 2009-02-09 23:23 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: David Miller, netdev, netfilter-devel

Pablo Neira Ayuso wrote:
> Patrick McHardy wrote:
>> David Miller wrote:
>>> From: Pablo Neira Ayuso <pablo@netfilter.org>
>>> Date: Sun, 01 Feb 2009 14:33:57 +0100
>>>
>>>> In short, I think that the change that I'm proposing would also require
>>>> to fix some netlink_broadcast() clients to skip ENOBUFS errors: they are
>>>> not meaningful for them since they assume that Netlink is unreliable and
>>>> so the return value does not provide any useful information.
>>> I think this analysis is accurate.
>> We have at least one case where the caller wants to know of
>> any successful delivery. Keymanager queries done by xfrm_state
>> want to know whether an acquire was delivered to any keymanager.
>> So we need to continue to indicate this, maybe using a different
>> errno code than -ENOBUFS. I don't have a suggestion which one to
>> use though.
> 
> Indeed, I have missed that spot. I'm not very familiar with that code,
> however, I see that the creation of a state depends on the netlink
> broadcast return value, but how useful is that? I think that the state
> should be created even if the broadcast fails, the userspace daemon
> should request a resync to the kernel as soon as it hits ENOBUFS, then
> it would be in sync again with that state.

The idea is that the kernel is performing an active query. I agree
that there's nothing wrong with installing the SA and indicating the
error to userspace. Userspace could dump the SADB and look for new
larval states, however thats unlikely to be very useful since once
an overflow occurs, you probably have a lot of states.

But unless I'm missing something, there's nothing wrong with this
as long as the error is ignored. The fact that something was received
by some listener doesn't have any meaning anyways, it might have
been "ip monitor". Which somehow raises doubt about your proposed
interface change though, I think anything that wants a reliable
answer whether a packet was delivered to a process handling it
appropriately should use unicast.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] netlink broadcast return value
  2009-02-09 23:23       ` Patrick McHardy
@ 2009-02-09 23:58         ` Pablo Neira Ayuso
  2009-02-10 13:50           ` Patrick McHardy
  0 siblings, 1 reply; 21+ messages in thread
From: Pablo Neira Ayuso @ 2009-02-09 23:58 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: David Miller, netdev, netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 2960 bytes --]

Patrick McHardy wrote:
> Pablo Neira Ayuso wrote:
>> Patrick McHardy wrote:
>>> We have at least one case where the caller wants to know of
>>> any successful delivery. Keymanager queries done by xfrm_state
>>> want to know whether an acquire was delivered to any keymanager.
>>> So we need to continue to indicate this, maybe using a different
>>> errno code than -ENOBUFS. I don't have a suggestion which one to
>>> use though.
>>
>> Indeed, I have missed that spot. I'm not very familiar with that code,
>> however, I see that the creation of a state depends on the netlink
>> broadcast return value, but how useful is that? I think that the state
>> should be created even if the broadcast fails, the userspace daemon
>> should request a resync to the kernel as soon as it hits ENOBUFS, then
>> it would be in sync again with that state.
> 
> The idea is that the kernel is performing an active query. I agree
> that there's nothing wrong with installing the SA and indicating the
> error to userspace. Userspace could dump the SADB and look for new
> larval states, however thats unlikely to be very useful since once
> an overflow occurs, you probably have a lot of states.

More situations may trigger overflows: a "slow" reader (for example,
spending time on whatever while not retrieving messages) and a userspace
process with too small receive buffer.

> But unless I'm missing something, there's nothing wrong with this
> as long as the error is ignored. The fact that something was received
> by some listener doesn't have any meaning anyways, it might have
> been "ip monitor". Which somehow raises doubt about your proposed
> interface change though, I think anything that wants a reliable
> answer whether a packet was delivered to a process handling it
> appropriately should use unicast.

Don't get me wrong, I agree with you that all netlink_broadcast callers
in the kernel should ignore the return value...

... unless they have "some way" (like in Netfilter) to make event
delivery reliable: I have attached a patch that I didn't send you yet,
I'm still reviewing and testing it. It adds an entry to /proc to enable
reliable event delivery over netlink by dropping packets whose events
were not delivered, you mentioned that possibility once during one of
our conversations ;).

I'm aware of that this option may be dangerous if used by a buggy
process that trigger frequent overflows but it the cost of having
realible logging for ctnetlink (still, this behaviour is not the one by
default!).

And I need this option to make conntrackd synchronize state-changes
appropriately under very heavy load: I've testing the daemon with these
patches and it reliably synchronizes state-changes (my system were 100%
busy filtering traffic and fully synchronizing all TCP state-changes in
near real-time effort, with a noticeable performance drop of 30% in
terms of filtered connections).

-- 
"Los honestos son inadaptados sociales" -- Les Luthiers

[-- Attachment #2: ctnetlink-drop-under-stress.patch --]
[-- Type: text/x-diff, Size: 11254 bytes --]

ctnetlink: optional packet drop to make event delivery reliable

From: Pablo Neira Ayuso <pablo@netfilter.org>

This patch adds /proc entry to enable reliable ctnetlink event
delivery. The entry is located at:

/proc/sys/net/netfilter/nf_conntrack_netlink_broadcast_reliable

When this entry is != 0, ctnetlink drops the packet if the delivery of
an event over netlink fails. This patch is useful to provide reliable
state synchronization for conntrackd.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---

 include/linux/netfilter/nfnetlink.h         |    4 +
 include/net/netfilter/nf_conntrack_core.h   |    6 +-
 include/net/netfilter/nf_conntrack_ecache.h |    2 -
 include/net/netns/conntrack.h               |    2 +
 net/netfilter/nf_conntrack_ecache.c         |   18 +++--
 net/netfilter/nf_conntrack_netlink.c        |  108 ++++++++++++++++++++++++++-
 net/netfilter/nfnetlink.c                   |   24 +++++-
 7 files changed, 146 insertions(+), 18 deletions(-)


diff --git a/include/linux/netfilter/nfnetlink.h b/include/linux/netfilter/nfnetlink.h
index 7d8e045..b89d5f3 100644
--- a/include/linux/netfilter/nfnetlink.h
+++ b/include/linux/netfilter/nfnetlink.h
@@ -74,8 +74,8 @@ extern int nfnetlink_subsys_register(const struct nfnetlink_subsystem *n);
 extern int nfnetlink_subsys_unregister(const struct nfnetlink_subsystem *n);
 
 extern int nfnetlink_has_listeners(unsigned int group);
-extern int nfnetlink_send(struct sk_buff *skb, u32 pid, unsigned group, 
-			  int echo);
+extern int nfnetlink_notify(struct sk_buff *skb, u32 pid, unsigned group,
+			    int echo);
 extern int nfnetlink_unicast(struct sk_buff *skb, u_int32_t pid, int flags);
 
 extern void nfnl_lock(void);
diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h
index e78afe7..0c6826d 100644
--- a/include/net/netfilter/nf_conntrack_core.h
+++ b/include/net/netfilter/nf_conntrack_core.h
@@ -62,7 +62,11 @@ static inline int nf_conntrack_confirm(struct sk_buff *skb)
 	if (ct) {
 		if (!nf_ct_is_confirmed(ct) && !nf_ct_is_dying(ct))
 			ret = __nf_conntrack_confirm(skb);
-		nf_ct_deliver_cached_events(ct);
+		if (ret == NF_ACCEPT && nf_ct_deliver_cached_events(ct) < 0) {
+			struct net *net = nf_ct_net(ct);
+			NF_CT_STAT_INC_ATOMIC(net, drop);
+			return NF_DROP;
+		}
 	}
 	return ret;
 }
diff --git a/include/net/netfilter/nf_conntrack_ecache.h b/include/net/netfilter/nf_conntrack_ecache.h
index 0ff0dc6..6e9e1f7 100644
--- a/include/net/netfilter/nf_conntrack_ecache.h
+++ b/include/net/netfilter/nf_conntrack_ecache.h
@@ -28,7 +28,7 @@ extern struct atomic_notifier_head nf_conntrack_chain;
 extern int nf_conntrack_register_notifier(struct notifier_block *nb);
 extern int nf_conntrack_unregister_notifier(struct notifier_block *nb);
 
-extern void nf_ct_deliver_cached_events(const struct nf_conn *ct);
+extern int nf_ct_deliver_cached_events(const struct nf_conn *ct);
 extern void __nf_ct_event_cache_init(struct nf_conn *ct);
 extern void nf_ct_event_cache_flush(struct net *net);
 
diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h
index f4498a6..1ff61dd 100644
--- a/include/net/netns/conntrack.h
+++ b/include/net/netns/conntrack.h
@@ -20,9 +20,11 @@ struct netns_ct {
 	int			sysctl_acct;
 	int			sysctl_checksum;
 	unsigned int		sysctl_log_invalid; /* Log invalid packets */
+	int			sysctl_ctnetlink_event_reliable;
 #ifdef CONFIG_SYSCTL
 	struct ctl_table_header	*sysctl_header;
 	struct ctl_table_header	*acct_sysctl_header;
+	struct ctl_table_header *ctnetlink_sysctl_header;
 #endif
 	int			hash_vmalloc;
 	int			expect_vmalloc;
diff --git a/net/netfilter/nf_conntrack_ecache.c b/net/netfilter/nf_conntrack_ecache.c
index dee4190..9c21269 100644
--- a/net/netfilter/nf_conntrack_ecache.c
+++ b/net/netfilter/nf_conntrack_ecache.c
@@ -31,9 +31,11 @@ EXPORT_SYMBOL_GPL(nf_ct_expect_chain);
 
 /* deliver cached events and clear cache entry - must be called with locally
  * disabled softirqs */
-static inline void
+static inline int
 __nf_ct_deliver_cached_events(struct nf_conntrack_ecache *ecache)
 {
+	int ret = 0;
+
 	if (nf_ct_is_confirmed(ecache->ct) && !nf_ct_is_dying(ecache->ct)
 	    && ecache->events) {
 		struct nf_ct_event item = {
@@ -42,28 +44,32 @@ __nf_ct_deliver_cached_events(struct nf_conntrack_ecache *ecache)
 			.report	= 0
 		};
 
-		atomic_notifier_call_chain(&nf_conntrack_chain,
-					   ecache->events,
-					   &item);
+		ret = atomic_notifier_call_chain(&nf_conntrack_chain,
+						 ecache->events,
+						 &item);
+		ret = notifier_to_errno(ret);
 	}
 
 	ecache->events = 0;
 	nf_ct_put(ecache->ct);
 	ecache->ct = NULL;
+	return ret;
 }
 
 /* Deliver all cached events for a particular conntrack. This is called
  * by code prior to async packet handling for freeing the skb */
-void nf_ct_deliver_cached_events(const struct nf_conn *ct)
+int nf_ct_deliver_cached_events(const struct nf_conn *ct)
 {
 	struct net *net = nf_ct_net(ct);
 	struct nf_conntrack_ecache *ecache;
+	int ret = 0;
 
 	local_bh_disable();
 	ecache = per_cpu_ptr(net->ct.ecache, raw_smp_processor_id());
 	if (ecache->ct == ct)
-		__nf_ct_deliver_cached_events(ecache);
+		ret = __nf_ct_deliver_cached_events(ecache);
 	local_bh_enable();
+	return ret;
 }
 EXPORT_SYMBOL_GPL(nf_ct_deliver_cached_events);
 
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 47c2f54..3e0ffb6 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -517,6 +517,8 @@ static int ctnetlink_conntrack_event(struct notifier_block *this,
 	unsigned int type;
 	sk_buff_data_t b;
 	unsigned int flags = 0, group;
+	struct net *net = nf_ct_net(ct);
+	int err;
 
 	/* ignore our fake conntrack entry */
 	if (ct == &nf_conntrack_untracked)
@@ -613,13 +615,20 @@ static int ctnetlink_conntrack_event(struct notifier_block *this,
 	rcu_read_unlock();
 
 	nlh->nlmsg_len = skb->tail - b;
-	nfnetlink_send(skb, item->pid, group, item->report);
+	err = nfnetlink_notify(skb, item->pid, group, item->report);
+	if (net->ct.sysctl_ctnetlink_event_reliable &&
+	    (err == -ENOBUFS || err == -EAGAIN))
+		return notifier_from_errno(err);
+
 	return NOTIFY_DONE;
 
 nla_put_failure:
 	rcu_read_unlock();
 nlmsg_failure:
 	kfree_skb(skb);
+	if (net->ct.sysctl_ctnetlink_event_reliable)
+		return notifier_from_errno(-ENOSPC);
+
 	return NOTIFY_DONE;
 }
 #endif /* CONFIG_NF_CONNTRACK_EVENTS */
@@ -1604,7 +1613,8 @@ static int ctnetlink_expect_event(struct notifier_block *this,
 	struct sk_buff *skb;
 	unsigned int type;
 	sk_buff_data_t b;
-	int flags = 0;
+	int flags = 0, err;
+	struct net *net = nf_ct_exp_net(exp);
 
 	if (events & IPEXP_NEW) {
 		type = IPCTNL_MSG_EXP_NEW;
@@ -1637,13 +1647,21 @@ static int ctnetlink_expect_event(struct notifier_block *this,
 	rcu_read_unlock();
 
 	nlh->nlmsg_len = skb->tail - b;
-	nfnetlink_send(skb, item->pid, NFNLGRP_CONNTRACK_EXP_NEW, item->report);
+	err = nfnetlink_notify(skb, item->pid, NFNLGRP_CONNTRACK_EXP_NEW,
+			       item->report);
+	if (net->ct.sysctl_ctnetlink_event_reliable &&
+	    (err == -ENOBUFS || err == -EAGAIN))
+		return notifier_from_errno(err);
+
 	return NOTIFY_DONE;
 
 nla_put_failure:
 	rcu_read_unlock();
 nlmsg_failure:
 	kfree_skb(skb);
+	if (net->ct.sysctl_ctnetlink_event_reliable)
+		return notifier_from_errno(-ENOSPC);
+
 	return NOTIFY_DONE;
 }
 #endif
@@ -2003,7 +2021,63 @@ MODULE_ALIAS("ip_conntrack_netlink");
 MODULE_ALIAS_NFNL_SUBSYS(NFNL_SUBSYS_CTNETLINK);
 MODULE_ALIAS_NFNL_SUBSYS(NFNL_SUBSYS_CTNETLINK_EXP);
 
-static int __init ctnetlink_init(void)
+#ifdef CONFIG_SYSCTL
+static struct ctl_table ctnetlink_sysctl_table[] = {
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "nf_conntrack_netlink_broadcast_reliable",
+		.data		= &init_net.ct.sysctl_ctnetlink_event_reliable,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{}
+};
+
+static int ctnetlink_init_sysctl(struct net *net)
+{
+	struct ctl_table *table;
+
+	table = kmemdup(ctnetlink_sysctl_table, sizeof(ctnetlink_sysctl_table),
+			GFP_KERNEL);
+	if (!table)
+		goto out;
+
+	table[0].data = &net->ct.sysctl_ctnetlink_event_reliable;
+
+	net->ct.ctnetlink_sysctl_header = register_net_sysctl_table(net,
+			nf_net_netfilter_sysctl_path, table);
+	if (!net->ct.ctnetlink_sysctl_header)
+		goto out_register;
+
+	return 0;
+
+out_register:
+	kfree(table);
+out:
+	return -ENOMEM;
+}
+
+static void ctnetlink_fini_sysctl(struct net *net)
+{
+	struct ctl_table *table;
+
+	table = net->ct.ctnetlink_sysctl_header->ctl_table_arg;
+	unregister_net_sysctl_table(net->ct.ctnetlink_sysctl_header);
+	kfree(table);
+}
+#else
+static int ctnetlink_init_sysctl(struct net *net)
+{
+	return 0;
+}
+
+static void ctnetlink_fini_sysctl(struct net *net)
+{
+}
+#endif /* CONFIG_SYSCTL */
+
+static int ctnetlink_net_init(struct net *net)
 {
 	int ret;
 
@@ -2033,10 +2107,18 @@ static int __init ctnetlink_init(void)
 		goto err_unreg_notifier;
 	}
 #endif
+	ret = ctnetlink_init_sysctl(net);
+	if (ret < 0) {
+		printk("ctnetlink_init: cannot register sysctl.\n");
+		goto err_unreg_exp_notifier;
+	}
+	net->ct.sysctl_ctnetlink_event_reliable = 0;
 
 	return 0;
 
 #ifdef CONFIG_NF_CONNTRACK_EVENTS
+err_unreg_exp_notifier:
+	nf_ct_expect_unregister_notifier(&ctnl_notifier_exp);
 err_unreg_notifier:
 	nf_conntrack_unregister_notifier(&ctnl_notifier);
 err_unreg_exp_subsys:
@@ -2048,7 +2130,7 @@ err_out:
 	return ret;
 }
 
-static void __exit ctnetlink_exit(void)
+static void ctnetlink_net_exit(struct net *net)
 {
 	printk("ctnetlink: unregistering from nfnetlink.\n");
 
@@ -2059,8 +2141,24 @@ static void __exit ctnetlink_exit(void)
 
 	nfnetlink_subsys_unregister(&ctnl_exp_subsys);
 	nfnetlink_subsys_unregister(&ctnl_subsys);
+	ctnetlink_fini_sysctl(net);
 	return;
 }
 
+static struct pernet_operations ctnetlink_net_ops = {
+	.init = ctnetlink_net_init,
+	.exit = ctnetlink_net_exit,
+};
+
+static int __init ctnetlink_init(void)
+{
+	return register_pernet_subsys(&ctnetlink_net_ops);
+}
+
+static void __exit ctnetlink_exit(void)
+{
+	unregister_pernet_subsys(&ctnetlink_net_ops);
+}
+
 module_init(ctnetlink_init);
 module_exit(ctnetlink_exit);
diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c
index 9c0ba17..fd7bbf4 100644
--- a/net/netfilter/nfnetlink.c
+++ b/net/netfilter/nfnetlink.c
@@ -107,11 +107,29 @@ int nfnetlink_has_listeners(unsigned int group)
 }
 EXPORT_SYMBOL_GPL(nfnetlink_has_listeners);
 
-int nfnetlink_send(struct sk_buff *skb, u32 pid, unsigned group, int echo)
+/* like nlmsg_notify, but we return the multicast error */
+int nfnetlink_notify(struct sk_buff *skb, u32 pid, unsigned group, int report)
 {
-	return nlmsg_notify(nfnl, skb, pid, group, echo, gfp_any());
+	int err = 0, mcast_err = 0;
+
+	if (group) {
+		int exclude_pid = 0;
+
+		if (report) {
+			atomic_inc(&skb->users);
+			exclude_pid = pid;
+		}
+
+		mcast_err = nlmsg_multicast(nfnl, skb, exclude_pid,
+					    group, gfp_any());
+	}
+
+	if (report)
+		err = nlmsg_unicast(nfnl, skb, pid);
+
+	return mcast_err ? mcast_err : err;
 }
-EXPORT_SYMBOL_GPL(nfnetlink_send);
+EXPORT_SYMBOL_GPL(nfnetlink_notify);
 
 int nfnetlink_unicast(struct sk_buff *skb, u_int32_t pid, int flags)
 {

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC] netlink broadcast return value
  2009-02-09 23:58         ` Pablo Neira Ayuso
@ 2009-02-10 13:50           ` Patrick McHardy
  2009-02-10 18:51             ` Pablo Neira Ayuso
  0 siblings, 1 reply; 21+ messages in thread
From: Patrick McHardy @ 2009-02-10 13:50 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: David Miller, netdev, netfilter-devel

Pablo Neira Ayuso wrote:
> Patrick McHardy wrote:
>> Pablo Neira Ayuso wrote:
> 
>> But unless I'm missing something, there's nothing wrong with this
>> as long as the error is ignored. The fact that something was received
>> by some listener doesn't have any meaning anyways, it might have
>> been "ip monitor". Which somehow raises doubt about your proposed
>> interface change though, I think anything that wants a reliable
>> answer whether a packet was delivered to a process handling it
>> appropriately should use unicast.
> 
> Don't get me wrong, I agree with you that all netlink_broadcast callers
> in the kernel should ignore the return value...
> 
> ... unless they have "some way" (like in Netfilter) to make event
> delivery reliable: I have attached a patch that I didn't send you yet,
> I'm still reviewing and testing it. It adds an entry to /proc to enable
> reliable event delivery over netlink by dropping packets whose events
> were not delivered, you mentioned that possibility once during one of
> our conversations ;).

I know, but in the mean time I think its wrong :) The delivery
isn't reliable and what the admin is effectively expressing by
setting your sysctl is "I don't have any listeners besides the
synchronization daemon running". So it might as well use unicast.

> I'm aware of that this option may be dangerous if used by a buggy
> process that trigger frequent overflows but it the cost of having
> realible logging for ctnetlink (still, this behaviour is not the one by
> default!).
> 
> And I need this option to make conntrackd synchronize state-changes
> appropriately under very heavy load: I've testing the daemon with these
> patches and it reliably synchronizes state-changes (my system were 100%
> busy filtering traffic and fully synchronizing all TCP state-changes in
> near real-time effort, with a noticeable performance drop of 30% in
> terms of filtered connections).

So you're dropping the packet if you can't manage to synchronize.
Doesn't that defeat the entire purpose of synchronizing, which is
*increasing* reliability? :)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] netlink broadcast return value
  2009-02-10 13:50           ` Patrick McHardy
@ 2009-02-10 18:51             ` Pablo Neira Ayuso
  2009-02-11 12:44               ` Patrick McHardy
  0 siblings, 1 reply; 21+ messages in thread
From: Pablo Neira Ayuso @ 2009-02-10 18:51 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: David Miller, netdev, netfilter-devel

Patrick McHardy wrote:
> Pablo Neira Ayuso wrote:
>> Patrick McHardy wrote:
>>> Pablo Neira Ayuso wrote:
>>
>>> But unless I'm missing something, there's nothing wrong with this
>>> as long as the error is ignored. The fact that something was received
>>> by some listener doesn't have any meaning anyways, it might have
>>> been "ip monitor". Which somehow raises doubt about your proposed
>>> interface change though, I think anything that wants a reliable
>>> answer whether a packet was delivered to a process handling it
>>> appropriately should use unicast.
>>
>> Don't get me wrong, I agree with you that all netlink_broadcast callers
>> in the kernel should ignore the return value...
>>
>> ... unless they have "some way" (like in Netfilter) to make event
>> delivery reliable: I have attached a patch that I didn't send you yet,
>> I'm still reviewing and testing it. It adds an entry to /proc to enable
>> reliable event delivery over netlink by dropping packets whose events
>> were not delivered, you mentioned that possibility once during one of
>> our conversations ;).
> 
> I know, but in the mean time I think its wrong :) The delivery
> isn't reliable and what the admin is effectively expressing by
> setting your sysctl is "I don't have any listeners besides the
> synchronization daemon running". So it might as well use unicast.

No :), this setting means "state-changes over ctnetlink will be reliable
at the cost of dropping packets (if needed)", it's an optional
trade-off. You may also have more listeners like a logging daemon
(ulogd), similarly this will be useful to ensure that ulogd doesn't leak
logging information which may happen under very heavy load. This option
is *not* only oriented to state-synchronization.

Using unicast would not do any different from broadcast as you may have
two listeners receiving state-changes from ctnetlink via unicast, so the
problem would be basically the same as above if you want reliable
state-change information at the cost of dropping packets.

BTW, the netlink_broadcast return value looked to me inconsistent before
the patch. It returned ENOBUFS if it could not clone the skb, but zero
when at least one message was delivered. How useful can be this return
value for the callers? I would expect to have a similar behaviour to the
one of netlink_unicast (reporting EAGAIN error when it could not deliver
the message), even if the return value for most callers should be
ignored as it is not of any help.

>> I'm aware of that this option may be dangerous if used by a buggy
>> process that trigger frequent overflows but it the cost of having
>> realible logging for ctnetlink (still, this behaviour is not the one by
>> default!).
>>
>> And I need this option to make conntrackd synchronize state-changes
>> appropriately under very heavy load: I've testing the daemon with these
>> patches and it reliably synchronizes state-changes (my system were 100%
>> busy filtering traffic and fully synchronizing all TCP state-changes in
>> near real-time effort, with a noticeable performance drop of 30% in
>> terms of filtered connections).
> 
> So you're dropping the packet if you can't manage to synchronize.
> Doesn't that defeat the entire purpose of synchronizing, which is
> *increasing* reliability? :)

This reduces communications reliability a bit under very heavy load,
yes, because it may drop some packets but it adds reliable flow-based
logging accounting / state-synchronization in return. Both refers to
reliability in different contexts. In the end, it's a trade-off world.
There's some point at which you may want to choose which one you prefer,
reliable communications if the system is under heavy load or reliable
logging (no leaks in the logging) / state-synchronization (the backup
firewall is able to follow state-changes of the master under heavy load).

In my experiments, reaching 100% of CPU consumption, the number of
packets drop where in fact very few indeed, but the harm in logging and
state-synchronization reliability is considerable in the long run, as
the backup starts getting unsynchronized (thus, becoming useless to
increase cluster reliability but consuming resources) and you also have
to interpret log information without forgetting the margin of error in
the case of logging.

BTW, I did not tell you, I can give you access to my testbed platform at
any time, of course ;).

-- 
"Los honestos son inadaptados sociales" -- Les Luthiers

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] netlink broadcast return value
  2009-02-10 18:51             ` Pablo Neira Ayuso
@ 2009-02-11 12:44               ` Patrick McHardy
  2009-02-11 16:39                 ` Pablo Neira Ayuso
  0 siblings, 1 reply; 21+ messages in thread
From: Patrick McHardy @ 2009-02-11 12:44 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: David Miller, netdev, netfilter-devel

Pablo Neira Ayuso wrote:
> Patrick McHardy wrote:
>> I know, but in the mean time I think its wrong :) The delivery
>> isn't reliable and what the admin is effectively expressing by
>> setting your sysctl is "I don't have any listeners besides the
>> synchronization daemon running". So it might as well use unicast.
> 
> No :), this setting means "state-changes over ctnetlink will be reliable
> at the cost of dropping packets (if needed)", it's an optional
> trade-off. You may also have more listeners like a logging daemon
> (ulogd), similarly this will be useful to ensure that ulogd doesn't leak
> logging information which may happen under very heavy load. This option
> is *not* only oriented to state-synchronization.

I'm aware of that. But you're adding a policy knob to control the
behaviour of a one-to-many interface based on what a single listener
(or maybe even two) want. Its not possible anymore to just listen to
events for debugging, since that might even lock you out. You also
can't use ulogd and say that you *don't* care whether every last state
change was delivered to it.

This seems very wrong to me. And I don't even see a reason to do
this since its easy to use unicast and per-listener state.

> Using unicast would not do any different from broadcast as you may have
> two listeners receiving state-changes from ctnetlink via unicast, so the
> problem would be basically the same as above if you want reliable
> state-change information at the cost of dropping packets.

Only the processes that actually care can specify this behaviour.
They're likely to have more CPU time, better adjusted receive
buffers etc than f.i. the conntrack tool when dumping events.

> BTW, the netlink_broadcast return value looked to me inconsistent before
> the patch. It returned ENOBUFS if it could not clone the skb, but zero
> when at least one message was delivered. How useful can be this return
> value for the callers? I would expect to have a similar behaviour to the
> one of netlink_unicast (reporting EAGAIN error when it could not deliver
> the message), even if the return value for most callers should be
> ignored as it is not of any help.

Its useless since you don't know how received it. It should return
void IMO.

>> So you're dropping the packet if you can't manage to synchronize.
>> Doesn't that defeat the entire purpose of synchronizing, which is
>> *increasing* reliability? :)
> 
> This reduces communications reliability a bit under very heavy load,
> yes, because it may drop some packets but it adds reliable flow-based
> logging accounting / state-synchronization in return. Both refers to
> reliability in different contexts. In the end, it's a trade-off world.
> There's some point at which you may want to choose which one you prefer,
> reliable communications if the system is under heavy load or reliable
> logging (no leaks in the logging) / state-synchronization (the backup
> firewall is able to follow state-changes of the master under heavy load).

Logging yes, but I can't see the point in perfect synchronization if
that leads to less throughput.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] netlink broadcast return value
  2009-02-11 12:44               ` Patrick McHardy
@ 2009-02-11 16:39                 ` Pablo Neira Ayuso
  2009-02-11 16:54                   ` Patrick McHardy
  0 siblings, 1 reply; 21+ messages in thread
From: Pablo Neira Ayuso @ 2009-02-11 16:39 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: David Miller, netdev, netfilter-devel

First of all, sorry, this email is probably too long.

Patrick McHardy wrote:
> Pablo Neira Ayuso wrote:
>> Patrick McHardy wrote:
>>> I know, but in the mean time I think its wrong :) The delivery
>>> isn't reliable and what the admin is effectively expressing by
>>> setting your sysctl is "I don't have any listeners besides the
>>> synchronization daemon running". So it might as well use unicast.
>>
>> No :), this setting means "state-changes over ctnetlink will be reliable
>> at the cost of dropping packets (if needed)", it's an optional
>> trade-off. You may also have more listeners like a logging daemon
>> (ulogd), similarly this will be useful to ensure that ulogd doesn't leak
>> logging information which may happen under very heavy load. This option
>> is *not* only oriented to state-synchronization.
> 
> I'm aware of that. But you're adding a policy knob to control the
> behaviour of a one-to-many interface based on what a single listener
> (or maybe even two) want. Its not possible anymore to just listen to
> events for debugging, since that might even lock you out.

Can you think of one example where one ctnetlink listener may not find
useful reliable state-change reports? Still, this setting is optional
(it will be disabled by default) and, if turned on, you can disable it
for debugging purposes.

Thinking more about it, reliable logging and monitoring would be even
something interesting in terms of security.

> You also
> can't use ulogd and say that you *don't* care whether every last state
> change was delivered to it.
> 
> This seems very wrong to me. And I don't even see a reason to do
> this since its easy to use unicast and per-listener state.

Netlink unicast would not be of any help either if you want reliable
state-change reporting via ctnetlink. If one process receives the event
and the other does not, you would also need to drop the packet to
perform reliable logging.

>> Using unicast would not do any different from broadcast as you may have
>> two listeners receiving state-changes from ctnetlink via unicast, so the
>> problem would be basically the same as above if you want reliable
>> state-change information at the cost of dropping packets.
> 
> Only the processes that actually care can specify this behaviour.

No, because this behaviour implies that the packet would be drop if the
state-change is not delivered correctly to all. It has to be an on/off
behaviour for all listeners.

> They're likely to have more CPU time, better adjusted receive
> buffers etc than f.i. the conntrack tool when dumping events.
> 
>> BTW, the netlink_broadcast return value looked to me inconsistent before
>> the patch. It returned ENOBUFS if it could not clone the skb, but zero
>> when at least one message was delivered. How useful can be this return
>> value for the callers? I would expect to have a similar behaviour to the
>> one of netlink_unicast (reporting EAGAIN error when it could not deliver
>> the message), even if the return value for most callers should be
>> ignored as it is not of any help.
> 
> Its useless since you don't know how received it. It should return
> void IMO.
> 
>>> So you're dropping the packet if you can't manage to synchronize.
>>> Doesn't that defeat the entire purpose of synchronizing, which is
>>> *increasing* reliability? :)
>>
>> This reduces communications reliability a bit under very heavy load,
>> yes, because it may drop some packets but it adds reliable flow-based
>> logging accounting / state-synchronization in return. Both refers to
>> reliability in different contexts. In the end, it's a trade-off world.
>> There's some point at which you may want to choose which one you prefer,
>> reliable communications if the system is under heavy load or reliable
>> logging (no leaks in the logging) / state-synchronization (the backup
>> firewall is able to follow state-changes of the master under heavy load).
> 
> Logging yes, but I can't see the point in perfect synchronization if
> that leads to less throughput.

Indeed, (reactive) fault-tolerance force you to trade-off between the
synchronization degree and performance. Conntrackd is far from doing
"perfect synchronization", let me develop this idea a bit.

Perfect synchronization (or call it "synchronous" replication) indeed
implies *way* less performance. In the particular case of stateful
firewalls, synchronous replication means that each packet would have to
wait until one state-change is propagated to all backups.
Then, once the backups have confirmed that the state-change has been
propagated correctly, the packet continues its travel. Thus, packets
would be delayed and throughput would severely drop. This is what
fault-tolerant "erudite" people call a "correct fault-tolerant system"
since the status of the replication is known at any time and the backups
can successfully recover the stateful filtering at any time. However,
the cost in terms of performance is *crap*, of course :), think of the
delay in the packet delivery of that stateful firewall, like getting a
coke from the moon to be "correct".

It's clear that synchronous replication is not feasible in today's
Internet systems. So, let's consider asynchronous replication, in the
case of stateful firewalls, this means that the packet is not kept until
the state-change is delivered, instead the packet continues its travel
and the state-change event is delivered to the backups in a "do your
best" approach. This is indeed a trade-off, we relax replication by
allowing better performance to make fault-tolerant Internet systems
feasible. But, in return, the backups are ready to recover a sub-set of
state-changes while others may not be recovered (think of long-standing
TCP established connections and very short TCP connections, the first
sort can be recovered, the latter may not). Nevertheless, asynchronous
replication works fine in practise.

But asynchronous replication may become useless to achieve
fault-tolerance if the rate of state-changes is high enough not to allow
backup nodes follow the primary node. Going back to the problem, if
Netlink cannot deliver the state-change, the backup would be able to
recover the filtering if the primary fails. At some point you have to
set a boundary limit after which you can ensure an acceptable
synchronization and performance, and if the boundary is overpassed from
one side, the other gets harmed.

I would have to tell sysadmins that conntrackd becomes unreliable under
heavy load in full near real-time mode, that would be horrible!.
Instead, with this option, I can tell them that, if they have selected
full near real-time event-driven synchronization, that reduces performance.

BTW, conntrackd has one batch mode that relaxes synchronization *a lot*
(it sends to the backup nodes the states that have been living in the
kernel conntrack table between a range of time, say, [10-20) seconds,
this also is possible. But, with the option that I'm proposing, we could
allow the network designer choose what synchronization approach it
prefers according to the network requirements. That includes that he/she
understands that he/she assumes a performance drop (which I have
measured in ~30% less with full TCP state replication of very short
connections in event-driven near real-time fashion, which I think that
it is close to the worst case).

-- 
"Los honestos son inadaptados sociales" -- Les Luthiers

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] netlink broadcast return value
  2009-02-11 16:39                 ` Pablo Neira Ayuso
@ 2009-02-11 16:54                   ` Patrick McHardy
  2009-02-11 21:01                     ` Pablo Neira Ayuso
  0 siblings, 1 reply; 21+ messages in thread
From: Patrick McHardy @ 2009-02-11 16:54 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: David Miller, netdev, netfilter-devel

Pablo Neira Ayuso wrote:
> First of all, sorry, this email is probably too long.

Indeed, I'm doing some trimminng :)

> Patrick McHardy wrote:
>> I'm aware of that. But you're adding a policy knob to control the
>> behaviour of a one-to-many interface based on what a single listener
>> (or maybe even two) want. Its not possible anymore to just listen to
>> events for debugging, since that might even lock you out.
> 
> Can you think of one example where one ctnetlink listener may not find
> useful reliable state-change reports? Still, this setting is optional
> (it will be disabled by default) and, if turned on, you can disable it
> for debugging purposes.

As I already said, "conntrack -E" used for debugging. Nobody cares
whether it misses a few events instead of causing dropped packets.
Whether its on or not by default is secondary to being the right
thing at all.

> Thinking more about it, reliable logging and monitoring would be even
> something interesting in terms of security.

I don't doubt that, I question the mechanism.

>> This seems very wrong to me. And I don't even see a reason to do
>> this since its easy to use unicast and per-listener state.
> 
> Netlink unicast would not be of any help either if you want reliable
> state-change reporting via ctnetlink. If one process receives the event
> and the other does not, you would also need to drop the packet to
> perform reliable logging.

Yes, and you don't need to if you don't want "reliable" logging.
The point is that you can choose per socket. Only if a socket that
really wants this doesn't get a copy you drop.

>>> Using unicast would not do any different from broadcast as you may have
>>> two listeners receiving state-changes from ctnetlink via unicast, so the
>>> problem would be basically the same as above if you want reliable
>>> state-change information at the cost of dropping packets.

No, its not the same. ctsync sets big receive buffers and requests
"reliable" delivery, "conntrack -E" does nothing special and doesn't
care whether messages are dropped because its receive queue is too
small.

>> Only the processes that actually care can specify this behaviour.
> 
> No, because this behaviour implies that the packet would be drop if the
> state-change is not delivered correctly to all. It has to be an on/off
> behaviour for all listeners.

You keep saying that, but its only the case because the way you
implemented it requires this. Why would ctsync care whether
conntrack -E missed a packet?

>> [...]
> I would have to tell sysadmins that conntrackd becomes unreliable under
> heavy load in full near real-time mode, that would be horrible!.
> Instead, with this option, I can tell them that, if they have selected
> full near real-time event-driven synchronization, that reduces performance.

Again, I'm not arguing about the option but about making it a
sysctl or something that affects all (ctnetlink) sockets whether
they care or not. You could even make it a per-broadcast listener
option, but the sysctl is effectively converting broadcast
operation to reliable unicast semantics and that seems wrong.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] netlink broadcast return value
  2009-02-11 16:54                   ` Patrick McHardy
@ 2009-02-11 21:01                     ` Pablo Neira Ayuso
  2009-02-12  5:07                       ` Patrick McHardy
  0 siblings, 1 reply; 21+ messages in thread
From: Pablo Neira Ayuso @ 2009-02-11 21:01 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: David Miller, netdev, netfilter-devel

I'm also trimming ;)

Patrick McHardy wrote:
> Pablo Neira Ayuso wrote:
>> Can you think of one example where one ctnetlink listener may not find
>> useful reliable state-change reports? Still, this setting is optional
>> (it will be disabled by default) and, if turned on, you can disable it
>> for debugging purposes.
> 
> As I already said, "conntrack -E" used for debugging. Nobody cares
> whether it misses a few events instead of causing dropped packets.
> Whether its on or not by default is secondary to being the right
> thing at all.

In particular, conntrack -E returns an error message when it hits
ENOBUFS, so it's a bad example. Indeed, I think that other programs in
userspace should do this if they don't know what to do with ENOBUFS,
otherwise increase the buffer up to a reasonable limit (set by the
user), and then report that this limit has been reached telling that
they have become unreliable (or depending on the sysctl value that I'm
proposing, tell that they may drop packets).

And I think that there are tons of interfaces that userspace programs
can abuse to do the wrong thing.

>>>> Using unicast would not do any different from broadcast as you may have
>>>> two listeners receiving state-changes from ctnetlink via unicast, so
>>>> the
>>>> problem would be basically the same as above if you want reliable
>>>> state-change information at the cost of dropping packets.
> 
> No, its not the same. ctsync sets big receive buffers and requests
> "reliable" delivery, "conntrack -E" does nothing special and doesn't
> care whether messages are dropped because its receive queue is too
> small.

conntrack -E is a bad example but I get the point. This sysctl has to be
for all ctnetlink listeners.

>> I would have to tell sysadmins that conntrackd becomes unreliable under
>> heavy load in full near real-time mode, that would be horrible!.
>> Instead, with this option, I can tell them that, if they have selected
>> full near real-time event-driven synchronization, that reduces
>> performance.
> 
> Again, I'm not arguing about the option but about making it a
> sysctl or something that affects all (ctnetlink) sockets whether
> they care or not. You could even make it a per-broadcast listener
> option, but the sysctl is effectively converting broadcast
> operation to reliable unicast semantics and that seems wrong.

And again, you point that this should be per-socket, but how can you
make this option per-socket? The only way that I see to make
state-change reporting reliable is to drop the packet to force the peer
to retransmit the packet and trigger the same state-change, and that
affect all ctnetlink listeners.

-- 
"Los honestos son inadaptados sociales" -- Les Luthiers

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] netlink broadcast return value
  2009-02-11 21:01                     ` Pablo Neira Ayuso
@ 2009-02-12  5:07                       ` Patrick McHardy
  2009-02-12 12:36                         ` Pablo Neira Ayuso
  0 siblings, 1 reply; 21+ messages in thread
From: Patrick McHardy @ 2009-02-12  5:07 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: David Miller, netdev, netfilter-devel

Pablo Neira Ayuso wrote:
>>> Can you think of one example where one ctnetlink listener may not find
>>> useful reliable state-change reports? Still, this setting is optional
>>> (it will be disabled by default) and, if turned on, you can disable it
>>> for debugging purposes.
>> As I already said, "conntrack -E" used for debugging. Nobody cares
>> whether it misses a few events instead of causing dropped packets.
>> Whether its on or not by default is secondary to being the right
>> thing at all.
> 
> In particular, conntrack -E returns an error message when it hits
> ENOBUFS, so it's a bad example.

You're proposing to drop packets, I don't think an error message
after the fact makes up for that :)

> Indeed, I think that other programs in
> userspace should do this if they don't know what to do with ENOBUFS,
> otherwise increase the buffer up to a reasonable limit (set by the
> user), and then report that this limit has been reached telling that
> they have become unreliable (or depending on the sysctl value that I'm
> proposing, tell that they may drop packets).
> 
> And I think that there are tons of interfaces that userspace programs
> can abuse to do the wrong thing.

Thats true, in this case the userspace program doesn't need to
do anything wrong though.

>>> I would have to tell sysadmins that conntrackd becomes unreliable under
>>> heavy load in full near real-time mode, that would be horrible!.
>>> Instead, with this option, I can tell them that, if they have selected
>>> full near real-time event-driven synchronization, that reduces
>>> performance.
>> Again, I'm not arguing about the option but about making it a
>> sysctl or something that affects all (ctnetlink) sockets whether
>> they care or not. You could even make it a per-broadcast listener
>> option, but the sysctl is effectively converting broadcast
>> operation to reliable unicast semantics and that seems wrong.
> 
> And again, you point that this should be per-socket, but how can you
> make this option per-socket? The only way that I see to make
> state-change reporting reliable is to drop the packet to force the peer
> to retransmit the packet and trigger the same state-change, and that
> affect all ctnetlink listeners.

For unicast its obviously simple, for broadcast you'd need something
like this:

err = 0;
for (all netlink sockets; sk && !err; ...) {
	skb = skb_clone(...)
	if (skb == NULL) {
		if (sk->flags & NETLINK_HIGHLY_RELIABLE)
			err = -ENOBUFS;
		continue;
	}
	...
}

So you're returning an error when at least one of the "reliable"
sockets doesn't get its delivery.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] netlink broadcast return value
  2009-02-12  5:07                       ` Patrick McHardy
@ 2009-02-12 12:36                         ` Pablo Neira Ayuso
  2009-02-12 12:41                           ` Pablo Neira Ayuso
  2009-02-12 12:45                           ` Patrick McHardy
  0 siblings, 2 replies; 21+ messages in thread
From: Pablo Neira Ayuso @ 2009-02-12 12:36 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: David Miller, netdev, netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 1944 bytes --]

Patrick McHardy wrote:
> Pablo Neira Ayuso wrote:
>> And again, you point that this should be per-socket, but how can you
>> make this option per-socket? The only way that I see to make
>> state-change reporting reliable is to drop the packet to force the peer
>> to retransmit the packet and trigger the same state-change, and that
>> affect all ctnetlink listeners.
> 
> For unicast its obviously simple, for broadcast you'd need something
> like this:
> 
> err = 0;
> for (all netlink sockets; sk && !err; ...) {
>     skb = skb_clone(...)
>     if (skb == NULL) {
>         if (sk->flags & NETLINK_HIGHLY_RELIABLE)
>             err = -ENOBUFS;
>         continue;
>     }
>     ...
> }
> 
> So you're returning an error when at least one of the "reliable"
> sockets doesn't get its delivery.

Patrick, I like it, I'm fine with this approach as soon as it let me add
the "reliable" ctnetlink state-change reporting. I can add the following
on top of the patch that David already applied:

--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
[...]
@@ -999,6 +1000,7 @@ static inline int do_one_broadcast(struct sock *sk,
 		p->skb2 = NULL;
 	} else if ((val = netlink_broadcast_deliver(sk, p->skb2)) < 0) {
 		netlink_overrun(sk);
+		p->delivery_failure = 1;
                ^^^^^^^^^^^^^^^^^^^^^^^^
Replace this by:
+		if (nlk->flags & NETLINK_HIGHLY_RELIABLE)
+			p->delivery_failure = 1;

And include the flag definition and setsockopt() operations in the new
patch, of course.

Please, find the previous patch that was applied to net-next tree
enclosed to save you some time in case that you don't know what patch I
was refering to. I think that the changes (several drivers and such) are
still useful, as they should ignore the return value of
netlink_broadcast() since it's not of any use for them (as we already
discussed, they printk the error, that's useless).

-- 
"Los honestos son inadaptados sociales" -- Les Luthiers

[-- Attachment #2: netlink-broadcast-delivery-failure.patch --]
[-- Type: text/x-diff, Size: 9619 bytes --]

netlink: change return-value logic of netlink_broadcast()

From: Pablo Neira Ayuso <pablo@netfilter.org>

Currently, netlink_broadcast() reports errors to the caller if no
messages at all were delivered:

1) If, at least, one message has been delivered correctly, returns 0.
2) Otherwise, if no messages at all were delivered due to skb_clone()
   failure, return -ENOBUFS.
3) Otherwise, if there are no listeners, return -ESRCH.

With this patch, the caller knows if the delivery of any of the
messages to the listeners have failed:

1) If it fails to deliver any message (for whatever reason), return
   -ENOBUFS.
2) Otherwise, if all messages were delivered OK, returns 0.
3) Otherwise, if no listeners, return -ESRCH.

In the current ctnetlink code and in Netfilter in general, we can add
reliable logging and connection tracking event delivery by dropping the
packets whose events were not successfully delivered over Netlink. Of
course, this option would be settable via /proc as this approach reduces
performance (in terms of filtered connections per seconds by a stateful
firewall) but providing reliable logging and event delivery (for
conntrackd) in return.

This patch also changes some clients of netlink_broadcast() that
may report ENOBUFS errors via printk. This error handling is not
of any help. Instead, the userspace daemons that are listening to
those netlink messages should resync themselves with the kernel-side
if they hit ENOBUFS.

BTW, netlink_broadcast() clients include those that call
cn_netlink_send(), nlmsg_multicast() and genlmsg_multicast() since they
internally call netlink_broadcast() and return its error value.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---

 drivers/acpi/event.c                |    6 +-----
 drivers/scsi/scsi_transport_fc.c    |   16 ++++------------
 drivers/scsi/scsi_transport_iscsi.c |   12 ++----------
 drivers/video/uvesafb.c             |    5 ++++-
 fs/dquot.c                          |    5 +----
 lib/kobject_uevent.c                |    3 +++
 net/netlink/af_netlink.c            |    8 ++++++--
 net/wimax/op-msg.c                  |    9 +++------
 net/wimax/stack.c                   |   12 ++++--------
 9 files changed, 28 insertions(+), 48 deletions(-)


diff --git a/drivers/acpi/event.c b/drivers/acpi/event.c
index 0c24bd4..aeb7e5f 100644
--- a/drivers/acpi/event.c
+++ b/drivers/acpi/event.c
@@ -235,11 +235,7 @@ int acpi_bus_generate_netlink_event(const char *device_class,
 		return result;
 	}
 
-	result =
-	    genlmsg_multicast(skb, 0, acpi_event_mcgrp.id, GFP_ATOMIC);
-	if (result)
-		ACPI_DEBUG_PRINT((ACPI_DB_INFO,
-				  "Failed to send a Genetlink message!\n"));
+	genlmsg_multicast(skb, 0, acpi_event_mcgrp.id, GFP_ATOMIC);
 	return 0;
 }
 
diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c
index 5f77417..3ee4eb4 100644
--- a/drivers/scsi/scsi_transport_fc.c
+++ b/drivers/scsi/scsi_transport_fc.c
@@ -533,12 +533,8 @@ fc_host_post_event(struct Scsi_Host *shost, u32 event_number,
 	event->event_code = event_code;
 	event->event_data = event_data;
 
-	err = nlmsg_multicast(scsi_nl_sock, skb, 0, SCSI_NL_GRP_FC_EVENTS,
-			      GFP_KERNEL);
-	if (err && (err != -ESRCH))	/* filter no recipient errors */
-		/* nlmsg_multicast already kfree_skb'd */
-		goto send_fail;
-
+	nlmsg_multicast(scsi_nl_sock, skb, 0, SCSI_NL_GRP_FC_EVENTS,
+			GFP_KERNEL);
 	return;
 
 send_fail_skb:
@@ -607,12 +603,8 @@ fc_host_post_vendor_event(struct Scsi_Host *shost, u32 event_number,
 	event->event_code = FCH_EVT_VENDOR_UNIQUE;
 	memcpy(&event->event_data, data_buf, data_len);
 
-	err = nlmsg_multicast(scsi_nl_sock, skb, 0, SCSI_NL_GRP_FC_EVENTS,
-			      GFP_KERNEL);
-	if (err && (err != -ESRCH))	/* filter no recipient errors */
-		/* nlmsg_multicast already kfree_skb'd */
-		goto send_vendor_fail;
-
+	nlmsg_multicast(scsi_nl_sock, skb, 0, SCSI_NL_GRP_FC_EVENTS,
+			GFP_KERNEL);
 	return;
 
 send_vendor_fail_skb:
diff --git a/drivers/scsi/scsi_transport_iscsi.c b/drivers/scsi/scsi_transport_iscsi.c
index 75c9297..2adfab8 100644
--- a/drivers/scsi/scsi_transport_iscsi.c
+++ b/drivers/scsi/scsi_transport_iscsi.c
@@ -966,15 +966,7 @@ iscsi_if_transport_lookup(struct iscsi_transport *tt)
 static int
 iscsi_broadcast_skb(struct sk_buff *skb, gfp_t gfp)
 {
-	int rc;
-
-	rc = netlink_broadcast(nls, skb, 0, 1, gfp);
-	if (rc < 0) {
-		printk(KERN_ERR "iscsi: can not broadcast skb (%d)\n", rc);
-		return rc;
-	}
-
-	return 0;
+	return netlink_broadcast(nls, skb, 0, 1, gfp);
 }
 
 static int
@@ -1207,7 +1199,7 @@ int iscsi_session_event(struct iscsi_cls_session *session,
 	 * the user and when the daemon is restarted it will handle it
 	 */
 	rc = iscsi_broadcast_skb(skb, GFP_KERNEL);
-	if (rc < 0)
+	if (rc == -ESRCH)
 		iscsi_cls_session_printk(KERN_ERR, session,
 					 "Cannot notify userspace of session "
 					 "event %u. Check iscsi daemon\n",
diff --git a/drivers/video/uvesafb.c b/drivers/video/uvesafb.c
index 6c2d37f..74ae758 100644
--- a/drivers/video/uvesafb.c
+++ b/drivers/video/uvesafb.c
@@ -204,8 +204,11 @@ static int uvesafb_exec(struct uvesafb_ktask *task)
 		} else {
 			v86d_started = 1;
 			err = cn_netlink_send(m, 0, gfp_any());
+			if (err == -ENOBUFS)
+				err = 0;
 		}
-	}
+	} else if (err == -ENOBUFS)
+		err = 0;
 
 	if (!err && !(task->t.flags & TF_EXIT))
 		err = !wait_for_completion_timeout(task->done,
diff --git a/fs/dquot.c b/fs/dquot.c
index bca3cac..d6add0b 100644
--- a/fs/dquot.c
+++ b/fs/dquot.c
@@ -1057,10 +1057,7 @@ static void send_warning(const struct dquot *dquot, const char warntype)
 		goto attr_err_out;
 	genlmsg_end(skb, msg_head);
 
-	ret = genlmsg_multicast(skb, 0, quota_genl_family.id, GFP_NOFS);
-	if (ret < 0 && ret != -ESRCH)
-		printk(KERN_ERR
-			"VFS: Failed to send notification message: %d\n", ret);
+	genlmsg_multicast(skb, 0, quota_genl_family.id, GFP_NOFS);
 	return;
 attr_err_out:
 	printk(KERN_ERR "VFS: Not enough space to compose quota message!\n");
diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 318328d..3813102 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -227,6 +227,9 @@ int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
 			NETLINK_CB(skb).dst_group = 1;
 			retval = netlink_broadcast(uevent_sock, skb, 0, 1,
 						   GFP_KERNEL);
+			/* ENOBUFS should be handled in userspace */
+			if (retval == -ENOBUFS)
+				retval = 0;
 		} else
 			retval = -ENOMEM;
 	}
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 9eb895c..6ee69c2 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -950,6 +950,7 @@ struct netlink_broadcast_data {
 	u32 pid;
 	u32 group;
 	int failure;
+	int delivery_failure;
 	int congested;
 	int delivered;
 	gfp_t allocation;
@@ -999,6 +1000,7 @@ static inline int do_one_broadcast(struct sock *sk,
 		p->skb2 = NULL;
 	} else if ((val = netlink_broadcast_deliver(sk, p->skb2)) < 0) {
 		netlink_overrun(sk);
+		p->delivery_failure = 1;
 	} else {
 		p->congested |= val;
 		p->delivered = 1;
@@ -1025,6 +1027,7 @@ int netlink_broadcast(struct sock *ssk, struct sk_buff *skb, u32 pid,
 	info.pid = pid;
 	info.group = group;
 	info.failure = 0;
+	info.delivery_failure = 0;
 	info.congested = 0;
 	info.delivered = 0;
 	info.allocation = allocation;
@@ -1045,13 +1048,14 @@ int netlink_broadcast(struct sock *ssk, struct sk_buff *skb, u32 pid,
 	if (info.skb2)
 		kfree_skb(info.skb2);
 
+	if (info.delivery_failure || info.failure)
+		return -ENOBUFS;
+
 	if (info.delivered) {
 		if (info.congested && (allocation & __GFP_WAIT))
 			yield();
 		return 0;
 	}
-	if (info.failure)
-		return -ENOBUFS;
 	return -ESRCH;
 }
 EXPORT_SYMBOL(netlink_broadcast);
diff --git a/net/wimax/op-msg.c b/net/wimax/op-msg.c
index cb3b4ad..5d149c1 100644
--- a/net/wimax/op-msg.c
+++ b/net/wimax/op-msg.c
@@ -258,7 +258,6 @@ EXPORT_SYMBOL_GPL(wimax_msg_len);
  */
 int wimax_msg_send(struct wimax_dev *wimax_dev, struct sk_buff *skb)
 {
-	int result;
 	struct device *dev = wimax_dev->net_dev->dev.parent;
 	void *msg = skb->data;
 	size_t size = skb->len;
@@ -266,11 +265,9 @@ int wimax_msg_send(struct wimax_dev *wimax_dev, struct sk_buff *skb)
 
 	d_printf(1, dev, "CTX: wimax msg, %zu bytes\n", size);
 	d_dump(2, dev, msg, size);
-	result = genlmsg_multicast(skb, 0, wimax_gnl_mcg.id, GFP_KERNEL);
-	d_printf(1, dev, "CTX: genl multicast result %d\n", result);
-	if (result == -ESRCH)	/* Nobody connected, ignore it */
-		result = 0;	/* btw, the skb is freed already */
-	return result;
+	genlmsg_multicast(skb, 0, wimax_gnl_mcg.id, GFP_KERNEL);
+	d_printf(1, dev, "CTX: genl multicast done\n");
+	return 0;
 }
 EXPORT_SYMBOL_GPL(wimax_msg_send);
 
diff --git a/net/wimax/stack.c b/net/wimax/stack.c
index 3869c03..a0ee76b 100644
--- a/net/wimax/stack.c
+++ b/net/wimax/stack.c
@@ -163,16 +163,12 @@ int wimax_gnl_re_state_change_send(
 	struct device *dev = wimax_dev_to_dev(wimax_dev);
 	d_fnstart(3, dev, "(wimax_dev %p report_skb %p)\n",
 		  wimax_dev, report_skb);
-	if (report_skb == NULL)
+	if (report_skb == NULL) {
+		result = -ENOMEM;
 		goto out;
-	genlmsg_end(report_skb, header);
-	result = genlmsg_multicast(report_skb, 0, wimax_gnl_mcg.id, GFP_KERNEL);
-	if (result == -ESRCH)	/* Nobody connected, ignore it */
-		result = 0;	/* btw, the skb is freed already */
-	if (result < 0) {
-		dev_err(dev, "RE_STCH: Error sending: %d\n", result);
-		nlmsg_free(report_skb);
 	}
+	genlmsg_end(report_skb, header);
+	genlmsg_multicast(report_skb, 0, wimax_gnl_mcg.id, GFP_KERNEL);
 out:
 	d_fnend(3, dev, "(wimax_dev %p report_skb %p) = %d\n",
 		wimax_dev, report_skb, result);

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC] netlink broadcast return value
  2009-02-12 12:36                         ` Pablo Neira Ayuso
@ 2009-02-12 12:41                           ` Pablo Neira Ayuso
  2009-02-12 12:48                             ` Patrick McHardy
  2009-02-12 12:45                           ` Patrick McHardy
  1 sibling, 1 reply; 21+ messages in thread
From: Pablo Neira Ayuso @ 2009-02-12 12:41 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: David Miller, netdev, netfilter-devel

Pablo Neira Ayuso wrote:
> Patrick, I like it, I'm fine with this approach as soon as it let me add
> the "reliable" ctnetlink state-change reporting. I can add the following
> on top of the patch that David already applied:
> 
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> [...]
> @@ -999,6 +1000,7 @@ static inline int do_one_broadcast(struct sock *sk,
>  		p->skb2 = NULL;
>  	} else if ((val = netlink_broadcast_deliver(sk, p->skb2)) < 0) {
>  		netlink_overrun(sk);
> +		p->delivery_failure = 1;
>                 ^^^^^^^^^^^^^^^^^^^^^^^^
> Replace this by:
> +		if (nlk->flags & NETLINK_HIGHLY_RELIABLE)
> +			p->delivery_failure = 1;
> 
> And include the flag definition and setsockopt() operations in the new
> patch, of course.

Oh, and also for this:

> 	if (p->skb2 == NULL) {
> 		netlink_overrun(sk);
> 		/* Clone failed. Notify ALL listeners. */
> 		p->failure = 1;
                ^^^^^^^^^^^^^^^

if (nlk->flags & NETLINK_HIGHLY_RELIABLE)
	p->failure = 1;

-- 
"Los honestos son inadaptados sociales" -- Les Luthiers

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] netlink broadcast return value
  2009-02-12 12:36                         ` Pablo Neira Ayuso
  2009-02-12 12:41                           ` Pablo Neira Ayuso
@ 2009-02-12 12:45                           ` Patrick McHardy
  1 sibling, 0 replies; 21+ messages in thread
From: Patrick McHardy @ 2009-02-12 12:45 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: David Miller, netdev, netfilter-devel

Pablo Neira Ayuso wrote:
> Patrick McHardy wrote:
>> So you're returning an error when at least one of the "reliable"
>> sockets doesn't get its delivery.
> 
> Patrick, I like it, I'm fine with this approach as soon as it let me add
> the "reliable" ctnetlink state-change reporting. I can add the following
> on top of the patch that David already applied:
> 
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> [...]
> @@ -999,6 +1000,7 @@ static inline int do_one_broadcast(struct sock *sk,
>  		p->skb2 = NULL;
>  	} else if ((val = netlink_broadcast_deliver(sk, p->skb2)) < 0) {
>  		netlink_overrun(sk);
> +		p->delivery_failure = 1;
>                 ^^^^^^^^^^^^^^^^^^^^^^^^
> Replace this by:
> +		if (nlk->flags & NETLINK_HIGHLY_RELIABLE)
> +			p->delivery_failure = 1;
> 
> And include the flag definition and setsockopt() operations in the new
> patch, of course.

Sounds good. Maybe a nicer name for the flag :)

> Please, find the previous patch that was applied to net-next tree
> enclosed to save you some time in case that you don't know what patch I
> was refering to. I think that the changes (several drivers and such) are
> still useful, as they should ignore the return value of
> netlink_broadcast() since it's not of any use for them (as we already
> discussed, they printk the error, that's useless).

Agreed. The remaining question would be what to do about
xfrm_state. I think it can stay as it is if you add this
flag, *swan could use it if desired.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] netlink broadcast return value
  2009-02-12 12:41                           ` Pablo Neira Ayuso
@ 2009-02-12 12:48                             ` Patrick McHardy
  2009-02-12 13:20                               ` Pablo Neira Ayuso
  0 siblings, 1 reply; 21+ messages in thread
From: Patrick McHardy @ 2009-02-12 12:48 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: David Miller, netdev, netfilter-devel

Pablo Neira Ayuso wrote:
> Oh, and also for this:
> 
>> 	if (p->skb2 == NULL) {
>> 		netlink_overrun(sk);
>> 		/* Clone failed. Notify ALL listeners. */
>> 		p->failure = 1;
>                 ^^^^^^^^^^^^^^^
> 
> if (nlk->flags & NETLINK_HIGHLY_RELIABLE)
> 	p->failure = 1;

I always wondered about the intention behind this. It wouldn't
hurt to just try the other allocation and see if they also fail.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] netlink broadcast return value
  2009-02-12 12:48                             ` Patrick McHardy
@ 2009-02-12 13:20                               ` Pablo Neira Ayuso
  2009-02-12 13:25                                 ` Patrick McHardy
  0 siblings, 1 reply; 21+ messages in thread
From: Pablo Neira Ayuso @ 2009-02-12 13:20 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: David Miller, netdev, netfilter-devel

Patrick McHardy wrote:
> Pablo Neira Ayuso wrote:
>> Oh, and also for this:
>>
>>>     if (p->skb2 == NULL) {
>>>         netlink_overrun(sk);
>>>         /* Clone failed. Notify ALL listeners. */
>>>         p->failure = 1;
>>                 ^^^^^^^^^^^^^^^
>>
>> if (nlk->flags & NETLINK_HIGHLY_RELIABLE)
>>     p->failure = 1;
> 
> I always wondered about the intention behind this. It wouldn't
> hurt to just try the other allocation and see if they also fail.

Indeed. I don't see either the point of stopping other sockets from
receiving the message because one clone failed, but that's a different
issue I think.

BTW, the netlink_set_err() function (I found one call in rtnetlink.c)
also attracted my attention since it sets the EAGAIN error to all
listeners when nlmsg_multicast() fails, which happens if the echoing is
set and unicast fails. This made me think, what would it be the action
taken by the multicast userspace listener if it hits EAGAIN? Probably,
let them know that this request may retry? I think there's nothing they
can do anyway.

-- 
"Los honestos son inadaptados sociales" -- Les Luthiers

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] netlink broadcast return value
  2009-02-12 13:20                               ` Pablo Neira Ayuso
@ 2009-02-12 13:25                                 ` Patrick McHardy
  0 siblings, 0 replies; 21+ messages in thread
From: Patrick McHardy @ 2009-02-12 13:25 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: David Miller, netdev, netfilter-devel

Pablo Neira Ayuso wrote:
> Patrick McHardy wrote:
>> Pablo Neira Ayuso wrote:
>>> Oh, and also for this:
>>>
>>>>     if (p->skb2 == NULL) {
>>>>         netlink_overrun(sk);
>>>>         /* Clone failed. Notify ALL listeners. */
>>>>         p->failure = 1;
>>>                 ^^^^^^^^^^^^^^^
>>>
>>> if (nlk->flags & NETLINK_HIGHLY_RELIABLE)
>>>     p->failure = 1;
>> I always wondered about the intention behind this. It wouldn't
>> hurt to just try the other allocation and see if they also fail.
> 
> Indeed. I don't see either the point of stopping other sockets from
> receiving the message because one clone failed, but that's a different
> issue I think.

Indeed.

> BTW, the netlink_set_err() function (I found one call in rtnetlink.c)
> also attracted my attention since it sets the EAGAIN error to all
> listeners when nlmsg_multicast() fails, which happens if the echoing is
> set and unicast fails. This made me think, what would it be the action
> taken by the multicast userspace listener if it hits EAGAIN? Probably,
> let them know that this request may retry? I think there's nothing they
> can do anyway.

They can resync. Not sure where you're getting EAGAIN from, its usually
used to deliver ENOBUFS to sockets when the kernel failed to even
allocate the first skb thats cloned to the sockets. Check out rtmsg_ifa
for an example.


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2009-02-12 13:25 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-01 13:33 [RFC] netlink broadcast return value Pablo Neira Ayuso
2009-02-02 22:05 ` David Miller
2009-02-09 14:17   ` Patrick McHardy
2009-02-09 22:51     ` Pablo Neira Ayuso
2009-02-09 23:23       ` Patrick McHardy
2009-02-09 23:58         ` Pablo Neira Ayuso
2009-02-10 13:50           ` Patrick McHardy
2009-02-10 18:51             ` Pablo Neira Ayuso
2009-02-11 12:44               ` Patrick McHardy
2009-02-11 16:39                 ` Pablo Neira Ayuso
2009-02-11 16:54                   ` Patrick McHardy
2009-02-11 21:01                     ` Pablo Neira Ayuso
2009-02-12  5:07                       ` Patrick McHardy
2009-02-12 12:36                         ` Pablo Neira Ayuso
2009-02-12 12:41                           ` Pablo Neira Ayuso
2009-02-12 12:48                             ` Patrick McHardy
2009-02-12 13:20                               ` Pablo Neira Ayuso
2009-02-12 13:25                                 ` Patrick McHardy
2009-02-12 12:45                           ` Patrick McHardy
2009-02-02 22:35 ` Inaky Perez-Gonzalez
2009-02-03 10:07   ` Pablo Neira Ayuso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).