[PATCH 1/1] netfilter: nat: work around shared nfct struct in bridge case

netfilter-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/1] netfilter: nat: work around shared nfct struct in bridge case
@ 2011-08-30 13:28 Florian Westphal
  2011-08-30 13:44 ` Patrick McHardy
  0 siblings, 1 reply; 6+ messages in thread
From: Florian Westphal @ 2011-08-30 13:28 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

When calling iptables code using bridge netfilter, the assumption
that a non-confirmed skb->nfct is never shared does no longer hold,
as bridge code clones skbs when e.g. forwarding packets to multiple
bridge ports.

When NFQUEUE is used, we can BUG because nf_nat_setup_info can be
invoked simultaneously for the same conntrack:

[ 3196.798768] kernel BUG at net/ipv4/netfilter/nf_nat_core.c:300!
[..]
[ 3196.798768]  [<ffffffff8120d73e>] ? nf_hook_slow+0x21a/0x282
[ 3196.798768]  [<ffffffffa03207e4>] ? br_handle_frame_finish+0x0/0x13b [bridge]
[ 3196.798768]  [<ffffffffa02a61a5>] ? alloc_null_binding+0x47/0x4c [iptable_nat]
[ 3196.798768]  [<ffffffffa02a64eb>] ? nf_nat_fn+0x193/0x1fb [iptable_nat]
[ 3196.798768]  [<ffffffff8120d4c5>] ? nf_iterate+0x40/0x9f
[ 3196.798768]  [<ffffffff8120d73e>] ? nf_hook_slow+0x21a/0x282
[ 3196.798768]  [<ffffffff81213c94>] ? ip_local_deliver_finish+0x0/0x1f1
[ 3196.798768]  [<ffffffff81213c94>] ? ip_local_deliver_finish+0x0/0x1f1
[ 3196.798768]  [<ffffffff8120d73e>] ? nf_hook_slow+0x21a/0x282
[ 3196.798768]  [<ffffffff8121369c>] ? ip_rcv_finish+0x0/0x340
[ 3196.798768]  [<ffffffff81213ed7>] ? ip_local_deliver+0x52/0x6c
[ 3196.798768]  [<ffffffff812139c2>] ? ip_rcv_finish+0x326/0x340
[ 3196.798768]  [<ffffffff81213c4f>] ? ip_rcv+0x273/0x2b8
[ 3196.798768]  [<ffffffff811f1384>] ? process_backlog+0x8d/0xc6
[ 3196.798768]  [<ffffffff811f2f85>] ? net_rx_action+0xa2/0x1cf
[ 3196.798768]  [<ffffffff8103d3c2>] ? __do_softirq+0x8b/0x10b
[ 3196.798768]  [<ffffffff8100c9dc>] ? call_softirq+0x1c/0x28
[ 3196.798768]  [<ffffffff8100dd15>] ? do_softirq+0x31/0x66
[ 3196.798768]  [<ffffffff8103d267>] ? irq_exit+0x36/0x78
[ 3196.798768]  [<ffffffff8100d41a>] ? do_IRQ+0xa0/0xb6
[ 3196.798768]  [<ffffffff8100c253>] ? ret_from_intr+0x0/0xa
[..]
[ 3196.798768] Code: be 2b 01 00 00 48 c7 c7 e8 cd 29 a0 e8 e8 d7 d9 e0 45 85 ff 49 8b 45 78 75 06 48 c1 e8 07 eb 04 48 c1 e8 08 83 e0 01 85 c0 74 04 <0f> 0b eb fe 49 8d 75 50 48 8d bc 24 80 00 00 00 e8 83 38 f7 ff
[ 3196.798768] RIP  [<ffffffffa029b68f>] nf_nat_setup_info+0x8a/0x564 [nf_nat]
[ 3196.798768]  RSP <ffff880001603bf0>

Work around this by forcing serialization via ct->lock and
accepting the packet silently.

Signed-off-by: Florian Westphal <fw@strlen.de>

---
 net/ipv4/netfilter/nf_nat_core.c |   39 ++++++++++++++++++++++++++++++++++++-
 1 files changed, 37 insertions(+), 2 deletions(-)

 Alternate patch to the 'make nfct of cloned skbs untracked' solution (which
 introduced unwanted module dependeny...)

 Main problem is that we can still end up confirming an
 already-confirmed conntrack, but this should be harmless.

diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c
index 9c71b27..bd89744 100644
--- a/net/ipv4/netfilter/nf_nat_core.c
+++ b/net/ipv4/netfilter/nf_nat_core.c
@@ -265,6 +265,35 @@ out:
 	rcu_read_unlock();
 }
 
+/* bridge netfilter uses cloned skbs when forwarding to multiple bridge ports.
+ * when userspace queueing is involved, we might try to set up NAT bindings
+ * on the same conntrack simultaneoulsy.  Can happen e.g. when broadcast has
+ * to be forwarded by the bridge but is also passed up the stack.
+ *
+ * Thus, when bridge netfilter is enabled, we need to serialize and silently
+ * accept the packet in the collision case.
+ */
+static inline bool nf_nat_bridge_lock(struct nf_conn *ct, enum nf_nat_manip_type maniptype)
+{
+#ifdef CONFIG_BRIDGE_NETFILTER
+	spin_lock_bh(&ct->lock);
+
+	if (unlikely(nf_nat_initialized(ct, maniptype))) {
+		pr_debug("race with cloned skb? Not adding NAT extension\n");
+		spin_unlock_bh(&ct->lock);
+		return false;
+	}
+#endif
+	return true;
+}
+
+static inline void nf_nat_bridge_unlock(struct nf_conn *ct)
+{
+#ifdef CONFIG_BRIDGE_NETFILTER
+	spin_unlock_bh(&ct->lock);
+#endif
+}
+
 unsigned int
 nf_nat_setup_info(struct nf_conn *ct,
 		  const struct nf_nat_range *range,
@@ -274,18 +303,23 @@ nf_nat_setup_info(struct nf_conn *ct,
 	struct nf_conntrack_tuple curr_tuple, new_tuple;
 	struct nf_conn_nat *nat;
 
+	NF_CT_ASSERT(maniptype == IP_NAT_MANIP_SRC ||
+		     maniptype == IP_NAT_MANIP_DST);
+
+	if (!nf_nat_bridge_lock(ct, maniptype))
+		return NF_ACCEPT;
+
 	/* nat helper or nfctnetlink also setup binding */
 	nat = nfct_nat(ct);
 	if (!nat) {
 		nat = nf_ct_ext_add(ct, NF_CT_EXT_NAT, GFP_ATOMIC);
 		if (nat == NULL) {
+			nf_nat_bridge_unlock(ct);
 			pr_debug("failed to add NAT extension\n");
 			return NF_ACCEPT;
 		}
 	}
 
-	NF_CT_ASSERT(maniptype == IP_NAT_MANIP_SRC ||
-		     maniptype == IP_NAT_MANIP_DST);
 	BUG_ON(nf_nat_initialized(ct, maniptype));
 
 	/* What we've got will look like inverse of reply. Normally
@@ -332,6 +366,7 @@ nf_nat_setup_info(struct nf_conn *ct,
 	else
 		ct->status |= IPS_SRC_NAT_DONE;
 
+	nf_nat_bridge_unlock(ct);
 	return NF_ACCEPT;
 }
 EXPORT_SYMBOL(nf_nat_setup_info);
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] netfilter: nat: work around shared nfct struct in bridge case
  2011-08-30 13:28 [PATCH 1/1] netfilter: nat: work around shared nfct struct in bridge case Florian Westphal
@ 2011-08-30 13:44 ` Patrick McHardy
  2011-08-30 14:00   ` Florian Westphal
  0 siblings, 1 reply; 6+ messages in thread
From: Patrick McHardy @ 2011-08-30 13:44 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

On 30.08.2011 15:28, Florian Westphal wrote:
> diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c
> index 9c71b27..bd89744 100644
> --- a/net/ipv4/netfilter/nf_nat_core.c
> +++ b/net/ipv4/netfilter/nf_nat_core.c
> @@ -265,6 +265,35 @@ out:
>  	rcu_read_unlock();
>  }
>  
> +/* bridge netfilter uses cloned skbs when forwarding to multiple bridge ports.
> + * when userspace queueing is involved, we might try to set up NAT bindings
> + * on the same conntrack simultaneoulsy.  Can happen e.g. when broadcast has
> + * to be forwarded by the bridge but is also passed up the stack.
> + *
> + * Thus, when bridge netfilter is enabled, we need to serialize and silently
> + * accept the packet in the collision case.
> + */
> +static inline bool nf_nat_bridge_lock(struct nf_conn *ct, enum nf_nat_manip_type maniptype)
> +{
> +#ifdef CONFIG_BRIDGE_NETFILTER
> +	spin_lock_bh(&ct->lock);
> +
> +	if (unlikely(nf_nat_initialized(ct, maniptype))) {
> +		pr_debug("race with cloned skb? Not adding NAT extension\n");
> +		spin_unlock_bh(&ct->lock);
> +		return false;
> +	}
> +#endif
> +	return true;
> +}

Ugh, what beauty :) I can't see a much nicer way how to fix this right
now, but I really want to have another look for different possibilities
before applying this.

Unfortunately pushing this down to nf_nat_setup_info() could only fix
the BUG(), but we'd still have a possible memory leak when adding the
NAT extension simulaneously on multiple CPUs. I also fear this is not
going to be the only problem caused by breaking the "unconfirmed means
non-shared nfct" assumption.

> +
> +static inline void nf_nat_bridge_unlock(struct nf_conn *ct)
> +{
> +#ifdef CONFIG_BRIDGE_NETFILTER
> +	spin_unlock_bh(&ct->lock);
> +#endif
> +}
> +
>  unsigned int
>  nf_nat_setup_info(struct nf_conn *ct,
>  		  const struct nf_nat_range *range,
> @@ -274,18 +303,23 @@ nf_nat_setup_info(struct nf_conn *ct,
>  	struct nf_conntrack_tuple curr_tuple, new_tuple;
>  	struct nf_conn_nat *nat;
>  
> +	NF_CT_ASSERT(maniptype == IP_NAT_MANIP_SRC ||
> +		     maniptype == IP_NAT_MANIP_DST);
> +
> +	if (!nf_nat_bridge_lock(ct, maniptype))
> +		return NF_ACCEPT;
> +
>  	/* nat helper or nfctnetlink also setup binding */
>  	nat = nfct_nat(ct);
>  	if (!nat) {
>  		nat = nf_ct_ext_add(ct, NF_CT_EXT_NAT, GFP_ATOMIC);
>  		if (nat == NULL) {
> +			nf_nat_bridge_unlock(ct);
>  			pr_debug("failed to add NAT extension\n");
>  			return NF_ACCEPT;
>  		}
>  	}
>  
> -	NF_CT_ASSERT(maniptype == IP_NAT_MANIP_SRC ||
> -		     maniptype == IP_NAT_MANIP_DST);
>  	BUG_ON(nf_nat_initialized(ct, maniptype));
>  
>  	/* What we've got will look like inverse of reply. Normally
> @@ -332,6 +366,7 @@ nf_nat_setup_info(struct nf_conn *ct,
>  	else
>  		ct->status |= IPS_SRC_NAT_DONE;
>  
> +	nf_nat_bridge_unlock(ct);
>  	return NF_ACCEPT;
>  }
>  EXPORT_SYMBOL(nf_nat_setup_info);


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] netfilter: nat: work around shared nfct struct in bridge case
  2011-08-30 13:44 ` Patrick McHardy
@ 2011-08-30 14:00   ` Florian Westphal
  2011-08-30 14:07     ` Patrick McHardy
  0 siblings, 1 reply; 6+ messages in thread
From: Florian Westphal @ 2011-08-30 14:00 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Florian Westphal, netfilter-devel

Patrick McHardy <kaber@trash.net> wrote:
> On 30.08.2011 15:28, Florian Westphal wrote:
> > diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c
> > index 9c71b27..bd89744 100644
> > --- a/net/ipv4/netfilter/nf_nat_core.c
> > +++ b/net/ipv4/netfilter/nf_nat_core.c
> > @@ -265,6 +265,35 @@ out:
> >  	rcu_read_unlock();
> >  }
> >  
> > +/* bridge netfilter uses cloned skbs when forwarding to multiple bridge ports.
> > + * when userspace queueing is involved, we might try to set up NAT bindings
> > + * on the same conntrack simultaneoulsy.  Can happen e.g. when broadcast has
> > + * to be forwarded by the bridge but is also passed up the stack.
> > + *
> > + * Thus, when bridge netfilter is enabled, we need to serialize and silently
> > + * accept the packet in the collision case.
> > + */
> > +static inline bool nf_nat_bridge_lock(struct nf_conn *ct, enum nf_nat_manip_type maniptype)
> > +{
> > +#ifdef CONFIG_BRIDGE_NETFILTER
> > +	spin_lock_bh(&ct->lock);
> > +
> > +	if (unlikely(nf_nat_initialized(ct, maniptype))) {
> > +		pr_debug("race with cloned skb? Not adding NAT extension\n");
> > +		spin_unlock_bh(&ct->lock);
> > +		return false;
> > +	}
> > +#endif
> > +	return true;
> > +}
> 
> Ugh, what beauty :) I can't see a much nicer way how to fix this right
> now, but I really want to have another look for different possibilities
> before applying this.

Sure.  In fact, I really hope that this patch isn't needed :)

> Unfortunately pushing this down to nf_nat_setup_info() could only fix
> the BUG(), but we'd still have a possible memory leak when adding the
> NAT extension simulaneously on multiple CPUs.

nf_ct_ext_add() is only invoked once due to the spinlock serialization,
so I don't see why we can memleak here.

In case nf_ct_ext_add() we already return NF_ACCEPT, so I think this
part is OK.

> I also fear this is not
> going to be the only problem caused by breaking the "unconfirmed means
> non-shared nfct" assumption.

Agreed. Perhaps we can solve the module dependeny issue of the "unshare"
approach.  In fact, if invalid state for the clones would be acceptable
then the dependency should go away; AFAICS nf_conntrack_untracked is the
only nf-related symbol required by br_netfilter.o not in netfilter/core.c.

Thanks for looking into this.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] netfilter: nat: work around shared nfct struct in bridge case
  2011-08-30 14:00   ` Florian Westphal
@ 2011-08-30 14:07     ` Patrick McHardy
  2011-08-30 15:27       ` Florian Westphal
  0 siblings, 1 reply; 6+ messages in thread
From: Patrick McHardy @ 2011-08-30 14:07 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

On 30.08.2011 16:00, Florian Westphal wrote:
> Patrick McHardy <kaber@trash.net> wrote:
>> On 30.08.2011 15:28, Florian Westphal wrote:
>>> diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c
>>> index 9c71b27..bd89744 100644
>>> --- a/net/ipv4/netfilter/nf_nat_core.c
>>> +++ b/net/ipv4/netfilter/nf_nat_core.c
>>> @@ -265,6 +265,35 @@ out:
>>>  	rcu_read_unlock();
>>>  }
>>>  
>>> +/* bridge netfilter uses cloned skbs when forwarding to multiple bridge ports.
>>> + * when userspace queueing is involved, we might try to set up NAT bindings
>>> + * on the same conntrack simultaneoulsy.  Can happen e.g. when broadcast has
>>> + * to be forwarded by the bridge but is also passed up the stack.
>>> + *
>>> + * Thus, when bridge netfilter is enabled, we need to serialize and silently
>>> + * accept the packet in the collision case.
>>> + */
>>> +static inline bool nf_nat_bridge_lock(struct nf_conn *ct, enum nf_nat_manip_type maniptype)
>>> +{
>>> +#ifdef CONFIG_BRIDGE_NETFILTER
>>> +	spin_lock_bh(&ct->lock);
>>> +
>>> +	if (unlikely(nf_nat_initialized(ct, maniptype))) {
>>> +		pr_debug("race with cloned skb? Not adding NAT extension\n");
>>> +		spin_unlock_bh(&ct->lock);
>>> +		return false;
>>> +	}
>>> +#endif
>>> +	return true;
>>> +}
>>
>> Ugh, what beauty :) I can't see a much nicer way how to fix this right
>> now, but I really want to have another look for different possibilities
>> before applying this.
> 
> Sure.  In fact, I really hope that this patch isn't needed :)
> 
>> Unfortunately pushing this down to nf_nat_setup_info() could only fix
>> the BUG(), but we'd still have a possible memory leak when adding the
>> NAT extension simulaneously on multiple CPUs.
> 
> nf_ct_ext_add() is only invoked once due to the spinlock serialization,
> so I don't see why we can memleak here.

Yes, when using your patch, otherwise (when handling this case in
nf_nat_setup_info() we might invoke it multiple times simultaneously
though.

> In case nf_ct_ext_add() we already return NF_ACCEPT, so I think this
> part is OK.
> 
>> I also fear this is not
>> going to be the only problem caused by breaking the "unconfirmed means
>> non-shared nfct" assumption.
> 
> Agreed. Perhaps we can solve the module dependeny issue of the "unshare"
> approach.  In fact, if invalid state for the clones would be acceptable
> then the dependency should go away; AFAICS nf_conntrack_untracked is the
> only nf-related symbol required by br_netfilter.o not in netfilter/core.c.

I don't think the clones should have invalid state, even untracked is
very questionable since all packets should have NAT applied to them in
the same way, connmarks might be used etc.

We probably need to restore the above mentioned assumption somehow. One
way would be to serialize reinjection of packets belonging to
unconfirmed conntracks in nf_reinject or the queueing modules. Conntrack
related stuff doesn't really belong there, but it seems like the easiest
and safest fix to me.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] netfilter: nat: work around shared nfct struct in bridge case
  2011-08-30 14:07     ` Patrick McHardy
@ 2011-08-30 15:27       ` Florian Westphal
  2011-08-31 10:05         ` Patrick McHardy
  0 siblings, 1 reply; 6+ messages in thread
From: Florian Westphal @ 2011-08-30 15:27 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Florian Westphal, netfilter-devel

Patrick McHardy <kaber@trash.net> wrote:
> Yes, when using your patch, otherwise (when handling this case in
> nf_nat_setup_info() we might invoke it multiple times simultaneously
> though.
> 
> > In case nf_ct_ext_add() we already return NF_ACCEPT, so I think this
> > part is OK.
> > 
> >> I also fear this is not
> >> going to be the only problem caused by breaking the "unconfirmed means
> >> non-shared nfct" assumption.
> > 
> > Agreed. Perhaps we can solve the module dependeny issue of the "unshare"
> > approach.  In fact, if invalid state for the clones would be acceptable
> > then the dependency should go away; AFAICS nf_conntrack_untracked is the
> > only nf-related symbol required by br_netfilter.o not in netfilter/core.c.
> 
> I don't think the clones should have invalid state, even untracked is
> very questionable since all packets should have NAT applied to them in
> the same way, connmarks might be used etc.

Right, but this is probably only going to be fixable in a "try to do the
best without crashing", because even without userspace queueing
there are cases where this is not deterministic:

-m physdev --physdev-out eth1 -j SNAT ...
-m physdev --physdev-out eth2 -j SNAT ...

... will match whatever bridge port the packet will be sent out on
first.

Also, before 87557c18ac36241b596984589a0889c5c4bf916c
forward ran after pass_frame_up() in which case post_routing is
not involved.

I am afraid we might first need to find out what should happen in
the "delivered locally and forwarded" case before we can figure
out what a sane fix might look like.

> We probably need to restore the above mentioned assumption somehow. One
> way would be to serialize reinjection of packets belonging to
> unconfirmed conntracks in nf_reinject or the queueing modules. Conntrack
> related stuff doesn't really belong there, but it seems like the easiest
> and safest fix to me.

Only serializing reinject may not be enough, since some packets might not be
queued (e.g. when queueing only in forward, or only when dealing with
a particular bridge port); in which case we'd still race.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] netfilter: nat: work around shared nfct struct in bridge case
  2011-08-30 15:27       ` Florian Westphal
@ 2011-08-31 10:05         ` Patrick McHardy
  0 siblings, 0 replies; 6+ messages in thread
From: Patrick McHardy @ 2011-08-31 10:05 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

On 30.08.2011 17:27, Florian Westphal wrote:
> Patrick McHardy <kaber@trash.net> wrote:
>> Yes, when using your patch, otherwise (when handling this case in
>> nf_nat_setup_info() we might invoke it multiple times simultaneously
>> though.
>>
>>> In case nf_ct_ext_add() we already return NF_ACCEPT, so I think this
>>> part is OK.
>>>
>>>> I also fear this is not
>>>> going to be the only problem caused by breaking the "unconfirmed means
>>>> non-shared nfct" assumption.
>>>
>>> Agreed. Perhaps we can solve the module dependeny issue of the "unshare"
>>> approach.  In fact, if invalid state for the clones would be acceptable
>>> then the dependency should go away; AFAICS nf_conntrack_untracked is the
>>> only nf-related symbol required by br_netfilter.o not in netfilter/core.c.
>>
>> I don't think the clones should have invalid state, even untracked is
>> very questionable since all packets should have NAT applied to them in
>> the same way, connmarks might be used etc.
> 
> Right, but this is probably only going to be fixable in a "try to do the
> best without crashing", because even without userspace queueing
> there are cases where this is not deterministic:
> 
> -m physdev --physdev-out eth1 -j SNAT ...
> -m physdev --physdev-out eth2 -j SNAT ...
> 
> ... will match whatever bridge port the packet will be sent out on
> first.

Yes, but setting up the rules properly is responsibility of the
user. Usually you'd just have a regular NAT rule, in which case
you normally want flooded packets to be treated similar.

> Also, before 87557c18ac36241b596984589a0889c5c4bf916c
> forward ran after pass_frame_up() in which case post_routing is
> not involved.
> 
> I am afraid we might first need to find out what should happen in
> the "delivered locally and forwarded" case before we can figure
> out what a sane fix might look like.

I don't really see the problem, the user has to set up his rules
properly.

>> We probably need to restore the above mentioned assumption somehow. One
>> way would be to serialize reinjection of packets belonging to
>> unconfirmed conntracks in nf_reinject or the queueing modules. Conntrack
>> related stuff doesn't really belong there, but it seems like the easiest
>> and safest fix to me.
> 
> Only serializing reinject may not be enough, since some packets might not be
> queued (e.g. when queueing only in forward, or only when dealing with
> a particular bridge port); in which case we'd still race.

True, that case has also always been broken. I don't see a way
to properly fix this right now, need to think about it some more.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-08-31 10:05 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-30 13:28 [PATCH 1/1] netfilter: nat: work around shared nfct struct in bridge case Florian Westphal
2011-08-30 13:44 ` Patrick McHardy
2011-08-30 14:00   ` Florian Westphal
2011-08-30 14:07     ` Patrick McHardy
2011-08-30 15:27       ` Florian Westphal
2011-08-31 10:05         ` Patrick McHardy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).