[PATCH RFC 0/3] Allow postponed netfilter handling for socket matches

netfilter-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 0/3] Allow postponed netfilter handling for socket matches
@ 2015-09-16 15:42 Daniel Mack
  2015-09-16 15:42 ` [PATCH RFC 1/3] netfilter: add socket to struct nft_pktinfo Daniel Mack
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Daniel Mack @ 2015-09-16 15:42 UTC (permalink / raw)
  To: pablo; +Cc: daniel, netfilter-devel, netdev, fw, balazs.scheidler,
	Daniel Mack

I'm re-addressing the issue of matching socket meta information for
non-established sockets that has been discussed a while ago:

  http://article.gmane.org/gmane.comp.security.firewalls.netfilter.devel/56877

Being able to reliably match on net_cls cgroup ids is crucial in
order to build a per-application or per-container firewall rules
which don't leak ingress packets. Such a feature would be very
useful to have.

A previous attempt to fix the currently existing issues was to call
out to the early demuxing helper functions from the meta matching
callbacks, but that doesn't suffice because it doesn't address the
case of multicast UDP and other, more complex lookup methods 
implemented in various protocol handlers.

This patch set outlines a different approach by adding a flag to
'struct sk_buff' called 'nf_postponed'. This flag is set by
nft_meta_get_eval() in case a decision cannot be made due to a missing
skb->sk. skbs flagged that way will then be ran through the netfilter
chain processor again after the protocol handlers did the real socket
lookup. A small addition to 'struct nft_pktinfo' is needed so that the
matching callbacks can access the socket that was passed into
nf_hook().

Note that the new flag does not actually bloat 'struct skb_buff',
because it still fits into the 'flags1' bitfield. Also, the extra
netfilter chain iteration will not be done by any subsequent packet in
the same stream, as for those, the early demux code will set skb->sk.

The patch set is obviously not yet finished, because a lot more
protocol handlers need to be patched. Right now, I only addressed
tcp_ipv4. Before I do that, I want to get some feedback on the
approach, so please let me know what you think.

Thanks,
Daniel

Daniel Mack (3):
  netfilter: add socket to struct nft_pktinfo
  netfilter: nft_meta: mark skbs for postponed filter processing
  net: tcp_ipv4: re-run netfilter chains for marked skbs

 include/linux/skbuff.h            |  3 ++-
 include/net/netfilter/nf_tables.h |  2 ++
 net/ipv4/tcp_ipv4.c               | 10 ++++++++++
 net/netfilter/nft_meta.c          |  9 ++++++---
 4 files changed, 20 insertions(+), 4 deletions(-)

-- 
2.5.0

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH RFC 1/3] netfilter: add socket to struct nft_pktinfo
  2015-09-16 15:42 [PATCH RFC 0/3] Allow postponed netfilter handling for socket matches Daniel Mack
@ 2015-09-16 15:42 ` Daniel Mack
  2015-09-16 15:42 ` [PATCH RFC 2/3] netfilter: nft_meta: mark skbs for postponed filter processing Daniel Mack
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Daniel Mack @ 2015-09-16 15:42 UTC (permalink / raw)
  To: pablo; +Cc: daniel, netfilter-devel, netdev, fw, balazs.scheidler,
	Daniel Mack

The high-level netfilter hook API already enables users to pass a socket,
but that information is lost when the chains are walked.

In order to let internal eval callbacks use the passed filter rather than
skb->sk, add a pointer of type 'struct sock' to 'struct nft_pktinfo' and
set that field via nft_set_pktinfo().

This allows us to run filter chains from situations where skb->sk is unset.
Fall back to skb->sk in case state->sk is NULL, so filter callbacks can be
written in a generic way.

Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 include/net/netfilter/nf_tables.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index aa8bee7..05e97ed 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -13,6 +13,7 @@
 #define NFT_JUMP_STACK_SIZE	16
 
 struct nft_pktinfo {
+	struct sock			*sk;
 	struct sk_buff			*skb;
 	const struct net_device		*in;
 	const struct net_device		*out;
@@ -29,6 +30,7 @@ static inline void nft_set_pktinfo(struct nft_pktinfo *pkt,
 				   struct sk_buff *skb,
 				   const struct nf_hook_state *state)
 {
+	pkt->sk = state->sk ?: skb->sk;
 	pkt->skb = skb;
 	pkt->in = pkt->xt.in = state->in;
 	pkt->out = pkt->xt.out = state->out;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH RFC 2/3] netfilter: nft_meta: mark skbs for postponed filter processing
  2015-09-16 15:42 [PATCH RFC 0/3] Allow postponed netfilter handling for socket matches Daniel Mack
  2015-09-16 15:42 ` [PATCH RFC 1/3] netfilter: add socket to struct nft_pktinfo Daniel Mack
@ 2015-09-16 15:42 ` Daniel Mack
  2015-09-16 15:43 ` [PATCH RFC 3/3] net: tcp_ipv4: re-run netfilter chains for marked skbs Daniel Mack
  2015-09-16 21:21 ` [PATCH RFC 0/3] Allow postponed netfilter handling for socket matches Florian Westphal
  3 siblings, 0 replies; 9+ messages in thread
From: Daniel Mack @ 2015-09-16 15:42 UTC (permalink / raw)
  To: pablo; +Cc: daniel, netfilter-devel, netdev, fw, balazs.scheidler,
	Daniel Mack

When the cgroup matching code in nft_meta is called without a socket to
look at, it currently bails out and lets the packet pass. This is bad,
because the reason for skb->sk being NULL is simply that the packet was
directed to a socket that hasn't been looked up yet by early demux.

This patch does two things:

 a) it uses the newly introduced pkt->sk pointer rather than skb->sk
    to check for the net class ID. This allows us to look at the socket
    the user passed into nf_hook().

 b) in case the socket can't be accessed, it marks the skb as
    'nf_postponed', so that later dispatchers have a chance to
    re-iterate the chain for such packets, after a full demux was
    conducted.

Note that the added flag in 'struct skb' does not increase the size
of the struct, as it fits in the 'flags1' bitfield.

Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 include/linux/skbuff.h   | 3 ++-
 net/netfilter/nft_meta.c | 9 ++++++---
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 2738d35..3590101 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -584,7 +584,8 @@ struct sk_buff {
 				fclone:2,
 				peeked:1,
 				head_frag:1,
-				xmit_more:1;
+				xmit_more:1,
+				nf_postponed:1;
 	/* one bit hole */
 	kmemcheck_bitfield_end(flags1);
 
diff --git a/net/netfilter/nft_meta.c b/net/netfilter/nft_meta.c
index cb2f13e..33b8d23 100644
--- a/net/netfilter/nft_meta.c
+++ b/net/netfilter/nft_meta.c
@@ -29,8 +29,9 @@ void nft_meta_get_eval(const struct nft_expr *expr,
 		       const struct nft_pktinfo *pkt)
 {
 	const struct nft_meta *priv = nft_expr_priv(expr);
-	const struct sk_buff *skb = pkt->skb;
 	const struct net_device *in = pkt->in, *out = pkt->out;
+	struct sk_buff *skb = pkt->skb;
+	struct sock *sk = pkt->sk;
 	u32 *dest = &regs->data[priv->dreg];
 
 	switch (priv->key) {
@@ -168,9 +169,11 @@ void nft_meta_get_eval(const struct nft_expr *expr,
 		break;
 #ifdef CONFIG_CGROUP_NET_CLASSID
 	case NFT_META_CGROUP:
-		if (skb->sk == NULL || !sk_fullsock(skb->sk))
+		if (sk == NULL || !sk_fullsock(sk)) {
+			skb->nf_postponed = 1;
 			goto err;
-		*dest = skb->sk->sk_classid;
+		}
+		*dest = sk->sk_classid;
 		break;
 #endif
 	default:
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH RFC 3/3] net: tcp_ipv4: re-run netfilter chains for marked skbs
  2015-09-16 15:42 [PATCH RFC 0/3] Allow postponed netfilter handling for socket matches Daniel Mack
  2015-09-16 15:42 ` [PATCH RFC 1/3] netfilter: add socket to struct nft_pktinfo Daniel Mack
  2015-09-16 15:42 ` [PATCH RFC 2/3] netfilter: nft_meta: mark skbs for postponed filter processing Daniel Mack
@ 2015-09-16 15:43 ` Daniel Mack
  2015-09-16 21:21 ` [PATCH RFC 0/3] Allow postponed netfilter handling for socket matches Florian Westphal
  3 siblings, 0 replies; 9+ messages in thread
From: Daniel Mack @ 2015-09-16 15:43 UTC (permalink / raw)
  To: pablo; +Cc: daniel, netfilter-devel, netdev, fw, balazs.scheidler,
	Daniel Mack

When an skb has been marked for later re-iteration through netfilter,
do that after __inet_lookup_skb() has been called. This allows packets
sent to unconnected sockets to be filtered reliably.

Note that this will never happen for subsequent packets in the same
stream, as skb->sk will be set due to early demux, and hence
skb->nf_postponed will remain 0.

Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 net/ipv4/tcp_ipv4.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 93898e0..61e0cb4 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -78,6 +78,7 @@
 
 #include <linux/inet.h>
 #include <linux/ipv6.h>
+#include <linux/netfilter.h>
 #include <linux/stddef.h>
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
@@ -1594,6 +1595,15 @@ int tcp_v4_rcv(struct sk_buff *skb)
 	if (!sk)
 		goto no_tcp_socket;
 
+	if (unlikely(skb->nf_postponed)) {
+		ret = nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_IN, sk,
+			      skb, skb->dev, NULL, NULL);
+		if (ret != 1) {
+			sock_put(sk);
+			return 0;
+		}
+	}
+
 process:
 	if (sk->sk_state == TCP_TIME_WAIT)
 		goto do_time_wait;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 0/3] Allow postponed netfilter handling for socket matches
  2015-09-16 15:42 [PATCH RFC 0/3] Allow postponed netfilter handling for socket matches Daniel Mack
                   ` (2 preceding siblings ...)
  2015-09-16 15:43 ` [PATCH RFC 3/3] net: tcp_ipv4: re-run netfilter chains for marked skbs Daniel Mack
@ 2015-09-16 21:21 ` Florian Westphal
  2015-09-17 10:04   ` Daniel Mack
  3 siblings, 1 reply; 9+ messages in thread
From: Florian Westphal @ 2015-09-16 21:21 UTC (permalink / raw)
  To: Daniel Mack; +Cc: pablo, daniel, netfilter-devel, netdev, fw, balazs.scheidler

Daniel Mack <daniel@zonque.org> wrote:
> I'm re-addressing the issue of matching socket meta information for
> non-established sockets that has been discussed a while ago:
> 
>   http://article.gmane.org/gmane.comp.security.firewalls.netfilter.devel/56877
> 
> Being able to reliably match on net_cls cgroup ids is crucial in
> order to build a per-application or per-container firewall rules
> which don't leak ingress packets. Such a feature would be very
> useful to have.

Could you clarify what 'which don't leak ingress packets' means?

> A previous attempt to fix the currently existing issues was to call
> out to the early demuxing helper functions from the meta matching
> callbacks, but that doesn't suffice because it doesn't address the
> case of multicast UDP and other, more complex lookup methods
> implemented in various protocol handlers.

Yes, but see below.

> This patch set outlines a different approach by adding a flag to
> 'struct sk_buff' called 'nf_postponed'. This flag is set by
> nft_meta_get_eval() in case a decision cannot be made due to a missing
> skb->sk. skbs flagged that way will then be ran through the netfilter
> chain processor again after the protocol handlers did the real socket
> lookup. A small addition to 'struct nft_pktinfo' is needed so that the
> matching callbacks can access the socket that was passed into
> nf_hook().
> 
> Note that the new flag does not actually bloat 'struct skb_buff',
> because it still fits into the 'flags1' bitfield. Also, the extra
> netfilter chain iteration will not be done by any subsequent packet in
> the same stream, as for those, the early demux code will set skb->sk.
> 
> The patch set is obviously not yet finished, because a lot more
> protocol handlers need to be patched. Right now, I only addressed
> tcp_ipv4. Before I do that, I want to get some feedback on the
> approach, so please let me know what you think.

I think there are several issues.

implementation problems:
- i'm not sure its legal to call the hook input with skb->sk locked,
  some matches might want to aquire it.
- what makes NFT_META_CGROUP special? (or was that just an example?)

design issues:
The assumption seems to be that a given skb can always be mapped to a
particular socket, and hence a cgroup.

Thats not necessarily the case, e.g. with broad-/multicasting or when
the socket is e.g. in timewait state.

Some skbs will now travel INPUT hooks twice.

And once you'd extend this so that we re-invoke nf hooks for mcast
packets, for each socket they've been received on, you change netfilter
behaviour again (one skb, one traversal -> n traversals of ruleset, one
for each sk).

I think that this makes it a non-starter, sorry.

I would much rather see nft_demux_{udp,tcp,sctp,dccp,...}.c which moves
early-demux-esque code into the nft ruleset.

Then you could do something like

nft add rule ip filter input meta l4proto tcp demux meta cgroup 42

The caveat being that even in this case we cannot guarantee
that skb->sk is set afterwards, or that a cgroup can be derived from it.

Iff you absolutely need this, I'd seriously entertain the idea of adding
NFPROTO_L4_TCP, etc, ... or, maybe better, allow to attach nft ruleset
as a socket filter.

But really, at that point, a much better question would be wheter net
cgroups are the answer to whatever the question was, or what problem we
are attempting to address here...

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 0/3] Allow postponed netfilter handling for socket matches
  2015-09-16 21:21 ` [PATCH RFC 0/3] Allow postponed netfilter handling for socket matches Florian Westphal
@ 2015-09-17 10:04   ` Daniel Mack
  2015-09-17 16:00     ` Florian Westphal
  0 siblings, 1 reply; 9+ messages in thread
From: Daniel Mack @ 2015-09-17 10:04 UTC (permalink / raw)
  To: Florian Westphal; +Cc: pablo, daniel, netfilter-devel, netdev, balazs.scheidler

Hi Florian,

On 09/16/2015 11:21 PM, Florian Westphal wrote:
> Daniel Mack <daniel@zonque.org> wrote:
>> I'm re-addressing the issue of matching socket meta information for
>> non-established sockets that has been discussed a while ago:
>>
>>   http://article.gmane.org/gmane.comp.security.firewalls.netfilter.devel/56877
>>
>> Being able to reliably match on net_cls cgroup ids is crucial in
>> order to build a per-application or per-container firewall rules
>> which don't leak ingress packets. Such a feature would be very
>> useful to have.
> 
> Could you clarify what 'which don't leak ingress packets' means?

Well, currently, the existing cgroups matches only filter packets that
are sent to an established socket. All other packets are ignored. So
when users install such matches as advertised by the documented
examples, and the chain policy is permissive, the firewall 'leaks'
packets, which is unexpected.

>> The patch set is obviously not yet finished, because a lot more
>> protocol handlers need to be patched. Right now, I only addressed
>> tcp_ipv4. Before I do that, I want to get some feedback on the
>> approach, so please let me know what you think.
> 
> I think there are several issues.
> 
> implementation problems:
> - i'm not sure its legal to call the hook input with skb->sk locked,
>   some matches might want to aquire it.

In the code as it stands after my patch set, I don't see where skb->sk
is locked? After all, skb->sk is NULL, even on the 2nd iteration, which
is why I patched the newly looked up socket to be available in the nf hook.

> - what makes NFT_META_CGROUP special? (or was that just an example?)

It's what I want to get working, but other 'meta' hooks can be made
working in a similar fashion.

> design issues:
> The assumption seems to be that a given skb can always be mapped to a
> particular socket, and hence a cgroup.
>
> Thats not necessarily the case, e.g. with broad-/multicasting or when
> the socket is e.g. in timewait state.

Yes, that's true. The idea for multicast would be to just drop the
cloned skb instead of delivering it to the final socket.

> Some skbs will now travel INPUT hooks twice.
>
> And once you'd extend this so that we re-invoke nf hooks for mcast
> packets, for each socket they've been received on, you change netfilter
> behaviour again (one skb, one traversal -> n traversals of ruleset, one
> for each sk).
> 
> I think that this makes it a non-starter, sorry.

Hmm, I see your point.

> I would much rather see nft_demux_{udp,tcp,sctp,dccp,...}.c which moves
> early-demux-esque code into the nft ruleset.
> 
> Then you could do something like
> 
> nft add rule ip filter input meta l4proto tcp demux meta cgroup 42

Ok, but how would that be different from the unconditional demuxing
patches we've kicked around earlier, especially when it comes to
multicast sockets? Could you explain what you have in mind here?

> The caveat being that even in this case we cannot guarantee
> that skb->sk is set afterwards, or that a cgroup can be derived from it.
> 
> Iff you absolutely need this, I'd seriously entertain the idea of adding
> NFPROTO_L4_TCP, etc, ... or, maybe better, allow to attach nft ruleset
> as a socket filter.

That would be a new netfilter hook then, something that is called after
LOCAL_IN, for ingress only? In a sense, it would be called from the
protocol handlers, just as my patches do right now, but instead of
conditionally re-iterating the same rules again, we would walk a
different chain?

> But really, at that point, a much better question would be wheter net
> cgroups are the answer to whatever the question was, or what problem we
> are attempting to address here...

The idea is simply to have a packet filter which is based on information
derived from the task that sends or will eventually handle the packet.
IOW: We want to be able to install netfilter rules that apply to all
packets received or sent by tasks that match a certain criteria, without
modifying the sources of those tasks.

As we already have net_cls hooked up in netfilter rules, it seems
easiest to just get this working. But with the multiple approaches we
already had, it appears the real fix needs more thinking.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 0/3] Allow postponed netfilter handling for socket matches
  2015-09-17 10:04   ` Daniel Mack
@ 2015-09-17 16:00     ` Florian Westphal
  2015-09-21 16:52       ` Daniel Mack
  0 siblings, 1 reply; 9+ messages in thread
From: Florian Westphal @ 2015-09-17 16:00 UTC (permalink / raw)
  To: Daniel Mack
  Cc: Florian Westphal, pablo, daniel, netfilter-devel, netdev,
	balazs.scheidler

Daniel Mack <daniel@zonque.org> wrote:
> Hi Florian,
> 
> On 09/16/2015 11:21 PM, Florian Westphal wrote:
> > Daniel Mack <daniel@zonque.org> wrote:
> >> I'm re-addressing the issue of matching socket meta information for
> >> non-established sockets that has been discussed a while ago:
> >>
> >>   http://article.gmane.org/gmane.comp.security.firewalls.netfilter.devel/56877
> >>
> >> Being able to reliably match on net_cls cgroup ids is crucial in
> >> order to build a per-application or per-container firewall rules
> >> which don't leak ingress packets. Such a feature would be very
> >> useful to have.
> > 
> > Could you clarify what 'which don't leak ingress packets' means?
> 
> Well, currently, the existing cgroups matches only filter packets that
> are sent to an established socket. All other packets are ignored. So
> when users install such matches as advertised by the documented
> examples, and the chain policy is permissive, the firewall 'leaks'
> packets, which is unexpected.

Then 'the documentation' needs fixing.
cgroup (and anything related to sk data, including uid, etc.) is
not guaranteed to work.

We can only match what is available in the packet payload, and
some extra info that the stack can make available to us (e.g.
VLAN id, or skb->sk in some cases on output) and conntrack state plus
whatever extra data conntrack allows to attach.

> >> The patch set is obviously not yet finished, because a lot more
> >> protocol handlers need to be patched. Right now, I only addressed
> >> tcp_ipv4. Before I do that, I want to get some feedback on the
> >> approach, so please let me know what you think.
> > 
> > I think there are several issues.
> > 
> > implementation problems:
> > - i'm not sure its legal to call the hook input with skb->sk locked,
> >   some matches might want to aquire it.
> 
> In the code as it stands after my patch set, I don't see where skb->sk
> is locked?

True.

[..]

> > design issues:
> > The assumption seems to be that a given skb can always be mapped to a
> > particular socket, and hence a cgroup.
> >
> > Thats not necessarily the case, e.g. with broad-/multicasting or when
> > the socket is e.g. in timewait state.
> 
> Yes, that's true. The idea for multicast would be to just drop the
> cloned skb instead of delivering it to the final socket.

-v please.

> > I would much rather see nft_demux_{udp,tcp,sctp,dccp,...}.c which moves
> > early-demux-esque code into the nft ruleset.
> > 
> > Then you could do something like
> > 
> > nft add rule ip filter input meta l4proto tcp demux meta cgroup 42
> 
> Ok, but how would that be different from the unconditional demuxing
> patches we've kicked around earlier, especially when it comes to
> multicast sockets? Could you explain what you have in mind here?

Two things:
- keep it out of core network stack
- make it explicit so we can document that 'demux' keyword is fishy and
will not work reliably.

F.e. I don't see how mcast could ever be made to work except by adding
an entirely new filtering mechanism/new hooks in core stack.

> > The caveat being that even in this case we cannot guarantee
> > that skb->sk is set afterwards, or that a cgroup can be derived from it.
> > 
> > Iff you absolutely need this, I'd seriously entertain the idea of adding
> > NFPROTO_L4_TCP, etc, ... or, maybe better, allow to attach nft ruleset
> > as a socket filter.
> 
> That would be a new netfilter hook then, something that is called after
> LOCAL_IN, for ingress only? In a sense, it would be called from the
> protocol handlers, just as my patches do right now, but instead of
> conditionally re-iterating the same rules again, we would walk a
> different chain?

Yes, something like that.  Obviously, you'll need to dru^W brib^W
convince a LOT of people before that could ever fly.

I think we should not do this and that this 'match on ingress sk
properties' is just bad[tm].

f.e. you'd also have to move all of the stuff you want into
sock_common ... 8-(

> > But really, at that point, a much better question would be wheter net
> > cgroups are the answer to whatever the question was, or what problem we
> > are attempting to address here...
> 
> The idea is simply to have a packet filter which is based on information
> derived from the task that sends or will eventually handle the packet.

Starts to smell like snet (https://lwn.net/Articles/441587/)

> IOW: We want to be able to install netfilter rules that apply to all
> packets received or sent by tasks that match a certain criteria, without
> modifying the sources of those tasks.

Sorry, I think netfilter is wrong tool for this, at least for ingress.
You could use conntrack to stash net_cls id in the connmark, though (for
inbound reply packets).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 0/3] Allow postponed netfilter handling for socket matches
  2015-09-17 16:00     ` Florian Westphal
@ 2015-09-21 16:52       ` Daniel Mack
  2015-09-21 19:05         ` Florian Westphal
  0 siblings, 1 reply; 9+ messages in thread
From: Daniel Mack @ 2015-09-21 16:52 UTC (permalink / raw)
  To: Florian Westphal; +Cc: pablo, daniel, netfilter-devel, netdev, balazs.scheidler

[-- Attachment #1: Type: text/plain, Size: 1967 bytes --]

Hi,

Thanks for your feedback, Florian!

On 09/17/2015 06:00 PM, Florian Westphal wrote:
> Daniel Mack <daniel@zonque.org> wrote:

>> That would be a new netfilter hook then, something that is called after
>> LOCAL_IN, for ingress only? In a sense, it would be called from the
>> protocol handlers, just as my patches do right now, but instead of
>> conditionally re-iterating the same rules again, we would walk a
>> different chain?
> 
> Yes, something like that.  Obviously, you'll need to dru^W brib^W
> convince a LOT of people before that could ever fly.
> 
> I think we should not do this and that this 'match on ingress sk
> properties' is just bad[tm].
> 
> f.e. you'd also have to move all of the stuff you want into
> sock_common ... 8-(

Hmm, I'm not sure whether I understand which problems you see, or which
corner cases I am missing in my assessment. I did a quick test with the
attached 4 patches that

1) Allow hook callbacks to look at the socket passed to nf_hook(), so
   skb->sk does not have to be set

2) Make nft_meta look at pkt->sk rather that skb->sk (only for cgroups
   as proof of concept)

3) Introduce a new POST_DEMUX netfilter chain (the name is not
   perfect, admittedly)

4) Iterate POST_DEMUX chains for v4 TCP and UDP unicast+multicast
   sockets.


With some really trivial modifications to libnftnl/nftables (which just
map strings to the new enum value), this works fine in my tests.
Multicast receivers that match a netclass ID in the ruleset won't see
any packets, while others do.

Some more considerations: if we cannot determine a socket for a packet
and hence don't deliver it, it's IMO perfectly fine not to run the
netfilter rules for them. All we need to achieve with this chain is that
for packets that _are_ delivered to a socket, all the necessary rules
have been processed, at a time when we know who the final receiver of
the skb is.

I'm happy to discuss the side effects of such an approach.


Thanks,
Daniel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-netfilter-add-socket-to-struct-nft_pktinfo.patch --]
[-- Type: text/x-diff; name="0001-netfilter-add-socket-to-struct-nft_pktinfo.patch", Size: 1553 bytes --]

From 808dacb17308c7a1e00a2e6504cbe6468a25f0d1 Mon Sep 17 00:00:00 2001
From: Daniel Mack <daniel@zonque.org>
Date: Wed, 16 Sep 2015 14:40:26 +0200
Subject: [PATCH RFC 1/4] netfilter: add socket to struct nft_pktinfo

The high-level netfilter hook API already enables users to pass a socket,
but that information is lost when the chains are walked.

In order to let internal eval callbacks use the passed filter rather than
skb->sk, add a pointer of type 'struct sock' to 'struct nft_pktinfo' and
set that field via nft_set_pktinfo().

This allows us to run filter chains from situations where skb->sk is unset.
Fall back to skb->sk in case state->sk is NULL, so filter callbacks can be
written in a generic way.

Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 include/net/netfilter/nf_tables.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index aa8bee7..05e97ed 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -13,6 +13,7 @@
 #define NFT_JUMP_STACK_SIZE	16
 
 struct nft_pktinfo {
+	struct sock			*sk;
 	struct sk_buff			*skb;
 	const struct net_device		*in;
 	const struct net_device		*out;
@@ -29,6 +30,7 @@ static inline void nft_set_pktinfo(struct nft_pktinfo *pkt,
 				   struct sk_buff *skb,
 				   const struct nf_hook_state *state)
 {
+	pkt->sk = state->sk ?: skb->sk;
 	pkt->skb = skb;
 	pkt->in = pkt->xt.in = state->in;
 	pkt->out = pkt->xt.out = state->out;
-- 
2.5.0


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0002-netfilter-nft_meta-look-at-pkt-sk-rather-than-skb-sk.patch --]
[-- Type: text/x-diff; name="0002-netfilter-nft_meta-look-at-pkt-sk-rather-than-skb-sk.patch", Size: 1489 bytes --]

From 1ae058eaedc9d20abcfe7c84496bbf1acb979026 Mon Sep 17 00:00:00 2001
From: Daniel Mack <daniel@zonque.org>
Date: Fri, 18 Sep 2015 16:55:05 +0200
Subject: [PATCH RFC 2/4] netfilter: nft_meta: look at pkt->sk rather than
 skb->sk

pkt->sk is set to whatever was passed to nh_hook() by the caller,
and for post demux chains, this is the one that should be looked
at, as skb->sk is still NULL at this point in time.

Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 net/netfilter/nft_meta.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/netfilter/nft_meta.c b/net/netfilter/nft_meta.c
index cb2f13e..f195bee 100644
--- a/net/netfilter/nft_meta.c
+++ b/net/netfilter/nft_meta.c
@@ -29,8 +29,9 @@ void nft_meta_get_eval(const struct nft_expr *expr,
 		       const struct nft_pktinfo *pkt)
 {
 	const struct nft_meta *priv = nft_expr_priv(expr);
-	const struct sk_buff *skb = pkt->skb;
 	const struct net_device *in = pkt->in, *out = pkt->out;
+	struct sk_buff *skb = pkt->skb;
+	struct sock *sk = pkt->sk;
 	u32 *dest = &regs->data[priv->dreg];
 
 	switch (priv->key) {
@@ -168,9 +169,9 @@ void nft_meta_get_eval(const struct nft_expr *expr,
 		break;
 #ifdef CONFIG_CGROUP_NET_CLASSID
 	case NFT_META_CGROUP:
-		if (skb->sk == NULL || !sk_fullsock(skb->sk))
+		if (sk == NULL || !sk_fullsock(sk))
 			goto err;
-		*dest = skb->sk->sk_classid;
+		*dest = sk->sk_classid;
 		break;
 #endif
 	default:
-- 
2.5.0


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #4: 0003-netfilter-add-NF_INET_POST_DEMUX-chain-type.patch --]
[-- Type: text/x-diff; name="0003-netfilter-add-NF_INET_POST_DEMUX-chain-type.patch", Size: 3018 bytes --]

From 661e8f5c6e7f48bb03dbf8fcd7cbccb9ffb5cc5d Mon Sep 17 00:00:00 2001
From: Daniel Mack <daniel@zonque.org>
Date: Fri, 18 Sep 2015 16:39:16 +0200
Subject: [PATCH RFC 3/4] netfilter: add NF_INET_POST_DEMUX chain type

Add a new chain type NF_INET_POST_DEMUX which is ran after the input demux
is complete and the final destination socket (if any) has been determined.

This helps filtering packets based on information stored in the destination
socket, such as cgroup controller supplied net class IDs.

Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 include/uapi/linux/netfilter.h      | 1 +
 net/ipv4/netfilter/iptable_filter.c | 1 +
 net/ipv4/netfilter/nf_tables_ipv4.c | 4 +++-
 net/netfilter/nf_tables_inet.c      | 3 ++-
 4 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/netfilter.h b/include/uapi/linux/netfilter.h
index d93f949..d402679 100644
--- a/include/uapi/linux/netfilter.h
+++ b/include/uapi/linux/netfilter.h
@@ -49,6 +49,7 @@ enum nf_inet_hooks {
 	NF_INET_FORWARD,
 	NF_INET_LOCAL_OUT,
 	NF_INET_POST_ROUTING,
+	NF_INET_POST_DEMUX,
 	NF_INET_NUMHOOKS
 };
 
diff --git a/net/ipv4/netfilter/iptable_filter.c b/net/ipv4/netfilter/iptable_filter.c
index a0f3bec..55b4795 100644
--- a/net/ipv4/netfilter/iptable_filter.c
+++ b/net/ipv4/netfilter/iptable_filter.c
@@ -21,6 +21,7 @@ MODULE_AUTHOR("Netfilter Core Team <coreteam@netfilter.org>");
 MODULE_DESCRIPTION("iptables filter table");
 
 #define FILTER_VALID_HOOKS ((1 << NF_INET_LOCAL_IN) | \
+			    (1 << NF_INET_POST_DEMUX) | \
 			    (1 << NF_INET_FORWARD) | \
 			    (1 << NF_INET_LOCAL_OUT))
 
diff --git a/net/ipv4/netfilter/nf_tables_ipv4.c b/net/ipv4/netfilter/nf_tables_ipv4.c
index aa180d3..993f302 100644
--- a/net/ipv4/netfilter/nf_tables_ipv4.c
+++ b/net/ipv4/netfilter/nf_tables_ipv4.c
@@ -55,6 +55,7 @@ struct nft_af_info nft_af_ipv4 __read_mostly = {
 		[NF_INET_FORWARD]	= nft_do_chain_ipv4,
 		[NF_INET_PRE_ROUTING]	= nft_do_chain_ipv4,
 		[NF_INET_POST_ROUTING]	= nft_do_chain_ipv4,
+		[NF_INET_POST_DEMUX]	= nft_do_chain_ipv4,
 	},
 };
 EXPORT_SYMBOL_GPL(nft_af_ipv4);
@@ -96,7 +97,8 @@ static const struct nf_chain_type filter_ipv4 = {
 			  (1 << NF_INET_LOCAL_OUT) |
 			  (1 << NF_INET_FORWARD) |
 			  (1 << NF_INET_PRE_ROUTING) |
-			  (1 << NF_INET_POST_ROUTING),
+			  (1 << NF_INET_POST_ROUTING) |
+			  (1 << NF_INET_POST_DEMUX),
 };
 
 static int __init nf_tables_ipv4_init(void)
diff --git a/net/netfilter/nf_tables_inet.c b/net/netfilter/nf_tables_inet.c
index 9dd2d21..c01db78 100644
--- a/net/netfilter/nf_tables_inet.c
+++ b/net/netfilter/nf_tables_inet.c
@@ -75,7 +75,8 @@ static const struct nf_chain_type filter_inet = {
 			  (1 << NF_INET_LOCAL_OUT) |
 			  (1 << NF_INET_FORWARD) |
 			  (1 << NF_INET_PRE_ROUTING) |
-			  (1 << NF_INET_POST_ROUTING),
+			  (1 << NF_INET_POST_ROUTING) |
+			  (1 << NF_INET_POST_DEMUX),
 };
 
 static int __init nf_tables_inet_init(void)
-- 
2.5.0


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #5: 0004-net-tcp_ipv4-udp_ipv4-hook-up-post-demux-netfilter-c.patch --]
[-- Type: text/x-diff; name="0004-net-tcp_ipv4-udp_ipv4-hook-up-post-demux-netfilter-c.patch", Size: 2345 bytes --]

From 1898df7d6a35967972bae412994623a8d7c262cd Mon Sep 17 00:00:00 2001
From: Daniel Mack <daniel@zonque.org>
Date: Wed, 16 Sep 2015 14:58:08 +0200
Subject: [PATCH RFC 4/4] net: tcp_ipv4, udp_ipv4: hook up post demux netfilter
 chains

Run the POST_DEMUX netfilter chain rules after the destination socket
has been looked up.

Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 net/ipv4/tcp_ipv4.c |  8 ++++++++
 net/ipv4/udp.c      | 15 +++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 93898e0..33f968e 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -78,6 +78,7 @@
 
 #include <linux/inet.h>
 #include <linux/ipv6.h>
+#include <linux/netfilter.h>
 #include <linux/stddef.h>
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
@@ -1594,6 +1595,13 @@ int tcp_v4_rcv(struct sk_buff *skb)
 	if (!sk)
 		goto no_tcp_socket;
 
+	ret = nf_hook(NFPROTO_IPV4, NF_INET_POST_DEMUX, sk,
+		      skb, skb->dev, NULL, NULL);
+	if (ret != 1) {
+		sock_put(sk);
+		return 0;
+	}
+
 process:
 	if (sk->sk_state == TCP_TIME_WAIT)
 		goto do_time_wait;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c0a15e7..0056c20 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -97,6 +97,7 @@
 #include <linux/mm.h>
 #include <linux/inet.h>
 #include <linux/netdevice.h>
+#include <linux/netfilter.h>
 #include <linux/slab.h>
 #include <net/tcp_states.h>
 #include <linux/skbuff.h>
@@ -1632,7 +1633,14 @@ static void flush_stack(struct sock **stack, unsigned int count,
 	struct sock *sk;
 
 	for (i = 0; i < count; i++) {
+		int ret;
 		sk = stack[i];
+
+		ret = nf_hook(NFPROTO_IPV4, NF_INET_POST_DEMUX, sk,
+			      skb, skb->dev, NULL, NULL);
+		if (ret != 1)
+			continue;
+
 		if (likely(!skb1))
 			skb1 = (i == final) ? skb : skb_clone(skb, GFP_ATOMIC);
 
@@ -1819,6 +1827,13 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	if (sk) {
 		int ret;
 
+		ret = nf_hook(NFPROTO_IPV4, NF_INET_POST_DEMUX, sk,
+			      skb, skb->dev, NULL, NULL);
+		if (ret != 1) {
+			sock_put(sk);
+			return 0;
+		}
+
 		if (inet_get_convert_csum(sk) && uh->check && !IS_UDPLITE(sk))
 			skb_checksum_try_convert(skb, IPPROTO_UDP, uh->check,
 						 inet_compute_pseudo);
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 0/3] Allow postponed netfilter handling for socket matches
  2015-09-21 16:52       ` Daniel Mack
@ 2015-09-21 19:05         ` Florian Westphal
  0 siblings, 0 replies; 9+ messages in thread
From: Florian Westphal @ 2015-09-21 19:05 UTC (permalink / raw)
  To: Daniel Mack
  Cc: Florian Westphal, pablo, daniel, netfilter-devel, netdev,
	balazs.scheidler

Daniel Mack <daniel@zonque.org> wrote:
> >> LOCAL_IN, for ingress only? In a sense, it would be called from the
> >> protocol handlers, just as my patches do right now, but instead of
> >> conditionally re-iterating the same rules again, we would walk a
> >> different chain?
> > 
> > Yes, something like that.  Obviously, you'll need to dru^W brib^W
> > convince a LOT of people before that could ever fly.
> > 
> > I think we should not do this and that this 'match on ingress sk
> > properties' is just bad[tm].
> > 
> > f.e. you'd also have to move all of the stuff you want into
> > sock_common ... 8-(
> 
> Hmm, I'm not sure whether I understand which problems you see, or which
> corner cases I am missing in my assessment. I did a quick test with the
> attached 4 patches that
> 
> 1) Allow hook callbacks to look at the socket passed to nf_hook(), so
>    skb->sk does not have to be set
> 
> 2) Make nft_meta look at pkt->sk rather that skb->sk (only for cgroups
>    as proof of concept)
> 
> 3) Introduce a new POST_DEMUX netfilter chain (the name is not
>    perfect, admittedly)
> 
> 4) Iterate POST_DEMUX chains for v4 TCP and UDP unicast+multicast
>    sockets.
> 
> With some really trivial modifications to libnftnl/nftables (which just
> map strings to the new enum value), this works fine in my tests.
> Multicast receivers that match a netclass ID in the ruleset won't see
> any packets, while others do.
> 
> Some more considerations: if we cannot determine a socket for a packet
> and hence don't deliver it, it's IMO perfectly fine not to run the
> netfilter rules for them. All we need to achieve with this chain is that
> for packets that _are_ delivered to a socket, all the necessary rules
> have been processed, at a time when we know who the final receiver of
> the skb is.

Not sure if thats true.  What about Timewait sockets?

Its easy to imagine someone using this feature and then complaining
that it doesn't match some packets, at which point we'd have to grow
sock_common to accomondate all sk member we support matching for :-/

If we'd have kernel releases where we drop features this wouldn't be
much of an issue since we could back out in case it causes issues later.

But once we add your proposed feature we cannot go back...

I'm not sure; I dislike this feature proposal but I can't think of
any alternative [other than "don't do this"] :-(

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-09-21 19:05 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-16 15:42 [PATCH RFC 0/3] Allow postponed netfilter handling for socket matches Daniel Mack
2015-09-16 15:42 ` [PATCH RFC 1/3] netfilter: add socket to struct nft_pktinfo Daniel Mack
2015-09-16 15:42 ` [PATCH RFC 2/3] netfilter: nft_meta: mark skbs for postponed filter processing Daniel Mack
2015-09-16 15:43 ` [PATCH RFC 3/3] net: tcp_ipv4: re-run netfilter chains for marked skbs Daniel Mack
2015-09-16 21:21 ` [PATCH RFC 0/3] Allow postponed netfilter handling for socket matches Florian Westphal
2015-09-17 10:04   ` Daniel Mack
2015-09-17 16:00     ` Florian Westphal
2015-09-21 16:52       ` Daniel Mack
2015-09-21 19:05         ` Florian Westphal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).