* Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
[not found] <537621AC.1060409@davidnewall.com>
@ 2014-05-19 12:58 ` David Newall
2014-05-19 14:01 ` Florian Westphal
0 siblings, 1 reply; 7+ messages in thread
From: David Newall @ 2014-05-19 12:58 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Netdev, Linux Kernel Mailing List, bridge
Having received no feedback of substance from netdev, I now address my
previous email to a wider audience for discussion and in preparation for
submitting a patch based closely on that below.
This email is not addressed to Bandan Das <bandan.das@stratus.com>, who
is the author of the commit I propose reverting, as his email address is
no longer current. I believe I have otherwise addressed all appropriate
recipients and will circulate a formal patch to the same recipients if
no adverse comments are received. (That would surprise me.)
-------- Original Message --------
Subject: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge :
Sanitize skb before it enters the IP stack)
Date: Sat, 17 May 2014 00:03:16 +0930
From: David Newall <davidn@davidnewall.com>
To: Lukas Tribus <luky-37@hotmail.com>, Eric Dumazet
<eric.dumazet@gmail.com>, Netdev <netdev@vger.kernel.org>
CC: fw@strlen.de <fw@strlen.de>
We should revert commit 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge
: Sanitize skb before it enters the IP stack) because it corrupts IP
packets with RR or TS options set, only partially updating the IP header
and leaving an incorrect checksum.
The argument for introducing the change is at lkml.org/lkml/2010/8/30/391:
The bridge layer can overwrite the IPCB using the
BR_INPUT_SKB_CB macro. In br_nf_dev_queue_xmit,
if we recieve a packet greater in size than the bridge
device MTU, we call ip_fragment which in turn will lead to
icmp_send calling ip_options_echo if the DF flag is set.
ip_options_echo will then incorrectly try to parse the IPCB as
IP options resulting in a buffer overflow.
This change refills the CB area back with IP
options before ip_fragment calls icmp_send. If we fail parsing,
we zero out the IPCB area to guarantee that the stack does
not get corrupted.
A bridge should not fragment packets; an *ethernet* bridge should not
need to. Fragmenting packets is the job of higher level protocol.
--- br_netfilter.c 2014-01-20 13:10:07.000000000 +1030
+++ br_netfilter.c.prop 2014-05-16 23:07:57.975386905 +0930
@@ -253,73 +253,6 @@
skb->protocol = htons(ETH_P_PPP_SES);
}
-/* When handing a packet over to the IP layer
- * check whether we have a skb that is in the
- * expected format
- */
-
-static int br_parse_ip_options(struct sk_buff *skb)
-{
- struct ip_options *opt;
- const struct iphdr *iph;
- struct net_device *dev = skb->dev;
- u32 len;
-
- if (!pskb_may_pull(skb, sizeof(struct iphdr)))
- goto inhdr_error;
-
- iph = ip_hdr(skb);
- opt = &(IPCB(skb)->opt);
-
- /* Basic sanity checks */
- if (iph->ihl < 5 || iph->version != 4)
- goto inhdr_error;
-
- if (!pskb_may_pull(skb, iph->ihl*4))
- goto inhdr_error;
-
- iph = ip_hdr(skb);
- if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
- goto inhdr_error;
-
- len = ntohs(iph->tot_len);
- if (skb->len < len) {
- IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INTRUNCATEDPKTS);
- goto drop;
- } else if (len < (iph->ihl*4))
- goto inhdr_error;
-
- if (pskb_trim_rcsum(skb, len)) {
- IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INDISCARDS);
- goto drop;
- }
-
- memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
- if (iph->ihl == 5)
- return 0;
-
- opt->optlen = iph->ihl*4 - sizeof(struct iphdr);
- if (ip_options_compile(dev_net(dev), opt, skb))
- goto inhdr_error;
-
- /* Check correct handling of SRR option */
- if (unlikely(opt->srr)) {
- struct in_device *in_dev = __in_dev_get_rcu(dev);
- if (in_dev && !IN_DEV_SOURCE_ROUTE(in_dev))
- goto drop;
-
- if (ip_options_rcv_srr(skb))
- goto drop;
- }
-
- return 0;
-
-inhdr_error:
- IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INHDRERRORS);
-drop:
- return -1;
-}
-
/* Fill in the header for fragmented IP packets handled by
* the IPv4 connection tracking code.
*/
@@ -679,6 +612,7 @@
{
struct net_bridge_port *p;
struct net_bridge *br;
+ const struct iphdr *iph;
__u32 len = nf_bridge_encap_header_len(skb);
if (unlikely(!pskb_may_pull(skb, len)))
@@ -704,9 +638,29 @@
return NF_ACCEPT;
nf_bridge_pull_encap_header_rcsum(skb);
+
+ if (!pskb_may_pull(skb, sizeof(struct iphdr)))
+ goto inhdr_error;
- if (br_parse_ip_options(skb))
- return NF_DROP;
+ iph = ip_hdr(skb);
+ if (iph->ihl < 5 || iph->version != 4)
+ goto inhdr_error;
+
+ if (!pskb_may_pull(skb, 4 * iph->ihl))
+ goto inhdr_error;
+
+ iph = ip_hdr(skb);
+ if (ip_fast_csum((__u8 *) iph, iph->ihl) != 0)
+ goto inhdr_error;
+
+ len = ntohs(iph->tot_len);
+ if (skb->len < len || len < 4 * iph->ihl)
+ goto inhdr_error;
+
+ pskb_trim_rcsum(skb, len);
+
+ /* BUG: Should really parse the IP options here. */
+ memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
nf_bridge_put(skb->nf_bridge);
if (!nf_bridge_alloc(skb))
@@ -720,6 +674,10 @@
br_nf_pre_routing_finish);
return NF_STOLEN;
+
+inhdr_error:
+// IP_INC_STATS_BH(IpInHdrErrors);
+ return NF_DROP;
}
@@ -806,9 +764,6 @@
nf_bridge->mask |= BRNF_PKT_TYPE;
}
- if (pf == NFPROTO_IPV4 && br_parse_ip_options(skb))
- return NF_DROP;
-
/* The physdev module checks on this */
nf_bridge->mask |= BRNF_BRIDGED;
nf_bridge->physoutdev = skb->dev;
@@ -862,19 +817,14 @@
#if IS_ENABLED(CONFIG_NF_CONNTRACK_IPV4)
static int br_nf_dev_queue_xmit(struct sk_buff *skb)
{
- int ret;
-
if (skb->nfct != NULL && skb->protocol == htons(ETH_P_IP) &&
skb->len + nf_bridge_mtu_reduction(skb) > skb->dev->mtu &&
!skb_is_gso(skb)) {
- if (br_parse_ip_options(skb))
- /* Drop invalid packet */
- return NF_DROP;
- ret = ip_fragment(skb, br_dev_queue_push_xmit);
+ /* BUG: Should really parse the IP options here. */
+ memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
+ return ip_fragment(skb, br_dev_queue_push_xmit);
} else
- ret = br_dev_queue_push_xmit(skb);
-
- return ret;
+ return br_dev_queue_push_xmit(skb);
}
#else
static int br_nf_dev_queue_xmit(struct sk_buff *skb)
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
2014-05-19 12:58 ` Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack) David Newall
@ 2014-05-19 14:01 ` Florian Westphal
2014-05-19 14:19 ` David Newall
0 siblings, 1 reply; 7+ messages in thread
From: Florian Westphal @ 2014-05-19 14:01 UTC (permalink / raw)
To: David Newall; +Cc: Stephen Hemminger, Netdev, Linux Kernel Mailing List, bridge
David Newall <davidn@davidnewall.com> wrote:
> Having received no feedback of substance from netdev, I now address
> my previous email to a wider audience for discussion and in
> preparation for submitting a patch based closely on that below.
>
> This email is not addressed to Bandan Das <bandan.das@stratus.com>,
> who is the author of the commit I propose reverting, as his email
> address is no longer current. I believe I have otherwise addressed
> all appropriate recipients and will circulate a formal patch to the
> same recipients if no adverse comments are received. (That would
> surprise me.)
>
> -------- Original Message --------
> Subject: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge :
> Sanitize skb before it enters the IP stack)
> Date: Sat, 17 May 2014 00:03:16 +0930
> From: David Newall <davidn@davidnewall.com>
> To: Lukas Tribus <luky-37@hotmail.com>, Eric Dumazet
> <eric.dumazet@gmail.com>, Netdev <netdev@vger.kernel.org>
> CC: fw@strlen.de <fw@strlen.de>
>
>
>
> We should revert commit 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge
> : Sanitize skb before it enters the IP stack) because it corrupts IP
> packets with RR or TS options set, only partially updating the IP header
> and leaving an incorrect checksum.
>
> The argument for introducing the change is at lkml.org/lkml/2010/8/30/391:
>
> The bridge layer can overwrite the IPCB using the
> BR_INPUT_SKB_CB macro. In br_nf_dev_queue_xmit,
> if we recieve a packet greater in size than the bridge
> device MTU, we call ip_fragment which in turn will lead to
> icmp_send calling ip_options_echo if the DF flag is set.
> ip_options_echo will then incorrectly try to parse the IPCB as
> IP options resulting in a buffer overflow.
> This change refills the CB area back with IP
> options before ip_fragment calls icmp_send. If we fail parsing,
> we zero out the IPCB area to guarantee that the stack does
> not get corrupted.
>
> A bridge should not fragment packets; an *ethernet* bridge should not
> need to. Fragmenting packets is the job of higher level protocol.
Well, did you test what happens if we try to refrag a packet
containing ip options after the revert?
can happen e.g. when using netfilter conntrack on top of a bridge.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
2014-05-19 14:01 ` Florian Westphal
@ 2014-05-19 14:19 ` David Newall
2014-05-20 4:55 ` Valdis.Kletnieks
0 siblings, 1 reply; 7+ messages in thread
From: David Newall @ 2014-05-19 14:19 UTC (permalink / raw)
To: Florian Westphal
Cc: Stephen Hemminger, Netdev, Linux Kernel Mailing List, bridge
Thanks for the reply. I've been hanging out for it!
On 19/05/14 23:31, Florian Westphal wrote:
> Well, did you test what happens if we try to refrag a packet
> containing ip options after the revert?
>
> can happen e.g. when using netfilter conntrack on top of a bridge.
No. I expect it would panic, as was reported prior to the commit.
I tried to persevere with the commit: I recalculated checksum, which
left routes and times improperly updated in options. Then I tried
calling ip_forward_options, which looks like it would correctly update
RR and TS (not to mention checksum)m but that bombed because skb_rtable
returned NULL. I think calling skb_set_dst would answer that, but I
don't know how to get a valid dst. (I asked for help but no answer.)
I see three ways to progress:
1. Possibly call ip_forward_option, but that requires somebody who
understands this code to help;
2. Just recalculate the checksum, leaving crap in the options; or
3. Revert the commit.
Option 1 doesn't look like it's going to happen; option 2 is stupid;
leaving option 3, and I begin to think that's the right way to go if
bridge is supposed to be a bridge and not a router. The idea that
bridge is doing too much seems to have quite a lot of currency, so think
of reversion as chopping off a canker. Or we keep fixing bugs, adding
to bridge, until it replicates all of IP.
How does a packet get fragmented in this case? Does it only happen when
bridging to a device with smaller MTU? That scenario sounds quite
un-bridge-like. It also sounds like something that can be handled by
real routing.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
2014-05-19 14:19 ` David Newall
@ 2014-05-20 4:55 ` Valdis.Kletnieks
2014-05-20 16:05 ` Vlad Yasevich
2014-05-21 8:10 ` David Newall
0 siblings, 2 replies; 7+ messages in thread
From: Valdis.Kletnieks @ 2014-05-20 4:55 UTC (permalink / raw)
To: David Newall
Cc: Florian Westphal, Stephen Hemminger, Netdev,
Linux Kernel Mailing List, bridge
[-- Attachment #1: Type: text/plain, Size: 1206 bytes --]
On Mon, 19 May 2014 23:49:22 +0930, David Newall said:
> How does a packet get fragmented in this case? Does it only happen when
> bridging to a device with smaller MTU? That scenario sounds quite
> un-bridge-like. It also sounds like something that can be handled by
> real routing.
Which doesn't change the fact that you *will* get clowns who take a box that
has a 10G card on a jumbogram-enabled subnet that's running with an MTU of
9000, and a 1G at MTU 1500 on the other, and try to bridge rather than route.
(Did you know that you can actually mount an NFS filesystem across that? And
that ls and cat and friends will work *just fine*? Until you hit a file that's
more than 1.5 in size, that is. And when you do a traceroute to the wedged
client, it tells you it's on the 10G network, so you have no idea why you're
seeing an MTU issue. Don't ask how I know this - let's just say that
supporting HPC users is never boring. :)
So yes, we *do* need to do something sensible there - either frag the packet
on the way out, or something. It *would* be nice if we could drop the
packet and send an ICMP Frag Needed back - except it's unclear what IP
you use as the source address for the ICMP....
[-- Attachment #2: Type: application/pgp-signature, Size: 848 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
2014-05-20 4:55 ` Valdis.Kletnieks
@ 2014-05-20 16:05 ` Vlad Yasevich
2014-05-21 8:10 ` David Newall
1 sibling, 0 replies; 7+ messages in thread
From: Vlad Yasevich @ 2014-05-20 16:05 UTC (permalink / raw)
To: Valdis.Kletnieks, David Newall
Cc: Florian Westphal, Stephen Hemminger, Netdev,
Linux Kernel Mailing List, bridge
On 05/20/2014 12:55 AM, Valdis.Kletnieks@vt.edu wrote:
> On Mon, 19 May 2014 23:49:22 +0930, David Newall said:
>
>> How does a packet get fragmented in this case? Does it only happen when
>> bridging to a device with smaller MTU? That scenario sounds quite
>> un-bridge-like. It also sounds like something that can be handled by
>> real routing.
>
> Which doesn't change the fact that you *will* get clowns who take a box that
> has a 10G card on a jumbogram-enabled subnet that's running with an MTU of
> 9000, and a 1G at MTU 1500 on the other, and try to bridge rather than route.
> (Did you know that you can actually mount an NFS filesystem across that? And
> that ls and cat and friends will work *just fine*? Until you hit a file that's
> more than 1.5 in size, that is. And when you do a traceroute to the wedged
> client, it tells you it's on the 10G network, so you have no idea why you're
> seeing an MTU issue. Don't ask how I know this - let's just say that
> supporting HPC users is never boring. :)
>
> So yes, we *do* need to do something sensible there - either frag the packet
> on the way out, or something. It *would* be nice if we could drop the
> packet and send an ICMP Frag Needed back - except it's unclear what IP
> you use as the source address for the ICMP....
>
If there is no netfilter, then the bridge will just drop the packet
(see br_dev_queue_push_xmit). It should probably also do that with
netfilter.
On the question of ICMP, I've also debated about sending ICMP Frag
Needed, but that's really beyond the scope of the bridge device.
Recording a stat might be sufficient to help troubleshoot these types
of issues.
-vlad
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
2014-05-20 4:55 ` Valdis.Kletnieks
2014-05-20 16:05 ` Vlad Yasevich
@ 2014-05-21 8:10 ` David Newall
2014-05-21 20:14 ` David Miller
1 sibling, 1 reply; 7+ messages in thread
From: David Newall @ 2014-05-21 8:10 UTC (permalink / raw)
To: Valdis.Kletnieks
Cc: Florian Westphal, Stephen Hemminger, Netdev,
Linux Kernel Mailing List, bridge
On 20/05/14 14:25, Valdis.Kletnieks@vt.edu wrote:
> So yes, we*do* need to do something sensible there - either frag the packet
> on the way out, or something.
I think the problem is that a bridge cannot be used across incompatible
media. That's the job of a router.
A bridge should act like a bridge, not a router. Fragmenting the packet
is wrong; that's IP's job. Dropping the packet is also arguably wrong;
that's the real device-driver's job. What seems right to me is to act
like a bridge and forward packets by looking inside of them *no more
than is necessary*. Looking beyond MAC address is perhaps too much.
We can finish the job of processing IP options, or at least in this
scenario, but that seems wrong-headed and invites more work as more
problems are discovered; or we could remove the half-hearted attempt it
currently does and leave the bridge as a simple bridge.
This problem wouldn't occur if all devices in a bridge were required to
be compatible media; particularly identical MTU.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
2014-05-21 8:10 ` David Newall
@ 2014-05-21 20:14 ` David Miller
0 siblings, 0 replies; 7+ messages in thread
From: David Miller @ 2014-05-21 20:14 UTC (permalink / raw)
To: davidn; +Cc: Valdis.Kletnieks, fw, stephen, netdev, linux-kernel, bridge
From: David Newall <davidn@davidnewall.com>
Date: Wed, 21 May 2014 17:40:25 +0930
> On 20/05/14 14:25, Valdis.Kletnieks@vt.edu wrote:
>> So yes, we*do* need to do something sensible there - either frag the
>> packet
>> on the way out, or something.
>
> I think the problem is that a bridge cannot be used across
> incompatible media. That's the job of a router.
>
> A bridge should act like a bridge, not a router. Fragmenting the
> packet is wrong; that's IP's job. Dropping the packet is also
> arguably wrong; that's the real device-driver's job. What seems right
> to me is to act like a bridge and forward packets by looking inside of
> them *no more than is necessary*. Looking beyond MAC address is
> perhaps too much.
>
> We can finish the job of processing IP options, or at least in this
> scenario, but that seems wrong-headed and invites more work as more
> problems are discovered; or we could remove the half-hearted attempt
> it currently does and leave the bridge as a simple bridge.
>
> This problem wouldn't occur if all devices in a bridge were required
> to be compatible media; particularly identical MTU.
I completely agree with you.
I also just want to state for the record, and I know some people will
disagree with me, that I think the bridging netfilter layer should
never have been integrated into the tree.
And I've been saying this for more than a decade.
It takes layering violations to a whole new level, and it's why we see
problems like this.
Besides this IP options issue, it also creates fake ipv4 routes, so
every time someone tries to do anything non-trivial with the ipv4
routing code the bridging netfilter fake route code had to be adjusted
or else we'd get crashes.
It has also held back many potential improvements to iptables in
general over the years because it does so many things differently
than the rest of the iptables modules.
It stinks, we never should have added it, and now since we have people
have been perversely convinced that doing stuff like this is actually
sane. It's not.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-05-21 20:14 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <537621AC.1060409@davidnewall.com>
2014-05-19 12:58 ` Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack) David Newall
2014-05-19 14:01 ` Florian Westphal
2014-05-19 14:19 ` David Newall
2014-05-20 4:55 ` Valdis.Kletnieks
2014-05-20 16:05 ` Vlad Yasevich
2014-05-21 8:10 ` David Newall
2014-05-21 20:14 ` David Miller
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).