* [PATCH] PKT_SCHED: dsmark must take care of shared/cloned skbs
@ 2004-12-18 17:00 Thomas Graf
2004-12-18 21:00 ` Thomas Graf
2004-12-19 20:23 ` jamal
0 siblings, 2 replies; 11+ messages in thread
From: Thomas Graf @ 2004-12-18 17:00 UTC (permalink / raw)
To: David S. Miller; +Cc: netdev
Dave,
Correctly handle shared and cloned skbs by copying them before writing
and dequeue unwriteable skbs unchanged. Assumes that IP/IPv6 header
is always linear so no pulling required.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
--- linux-2.6.10-rc3-bk12.orig/net/sched/sch_dsmark.c 2004-12-18 15:16:29.000000000 +0100
+++ linux-2.6.10-rc3-bk12/net/sched/sch_dsmark.c 2004-12-18 16:24:58.000000000 +0100
@@ -195,7 +195,8 @@
D2PRINTK("dsmark_enqueue(skb %p,sch %p,[qdisc %p])\n",skb,sch,p);
if (p->set_tc_index) {
- /* FIXME: Safe with non-linear skbs? --RR */
+ /* Safe with non-linear skbs? --RR
+ IP/IPv6 header is always linear --TGR */
switch (skb->protocol) {
case __constant_htons(ETH_P_IP):
skb->tc_index = ipv4_get_dsfield(skb->nh.iph);
@@ -250,6 +251,34 @@
return ret;
}
+static inline int
+dsmark_make_writeable(struct sk_buff **pskb, int offset)
+{
+ struct sk_buff *nskb;
+
+ if ((offset + (*pskb)->mac_len) > (*pskb)->len)
+ return 0;
+
+ if (skb_shared(*pskb) || skb_cloned(*pskb))
+ goto copy_skb;
+
+ /* IP/IPv6 header is always linear, no need to pull */
+ return 1;
+
+copy_skb:
+ nskb = skb_copy(*pskb, GFP_ATOMIC);
+ if (!nskb)
+ return 0;
+ BUG_ON(skb_is_nonlinear(nskb));
+
+ /* Rest of kernel will get very unhappy if we pass it a
+ suddenly-orphaned skbuff */
+ if ((*pskb)->sk)
+ skb_set_owner_w(nskb, (*pskb)->sk);
+ kfree_skb(*pskb);
+ *pskb = nskb;
+ return 1;
+}
static struct sk_buff *dsmark_dequeue(struct Qdisc *sch)
{
@@ -266,10 +295,14 @@
D2PRINTK("index %d->%d\n",skb->tc_index,index);
switch (skb->protocol) {
case __constant_htons(ETH_P_IP):
+ if (!dsmark_make_writeable(&skb, sizeof(struct iphdr)))
+ goto unwriteable;
ipv4_change_dsfield(skb->nh.iph,
p->mask[index],p->value[index]);
break;
case __constant_htons(ETH_P_IPV6):
+ if (!dsmark_make_writeable(&skb, sizeof(struct ipv6hdr)))
+ goto unwriteable;
ipv6_change_dsfield(skb->nh.ipv6h,
p->mask[index],p->value[index]);
break;
@@ -280,12 +313,17 @@
* and don't need yet another qdisc as a bypass.
*/
if (p->mask[index] != 0xff || p->value[index])
- printk(KERN_WARNING "dsmark_dequeue: "
- "unsupported protocol %d\n",
- htons(skb->protocol));
+ if (net_ratelimit())
+ printk(KERN_WARNING "dsmark_dequeue: "
+ "unsupported protocol %d\n",
+ htons(skb->protocol));
break;
};
return skb;
+unwriteable:
+ if (net_ratelimit())
+ printk(KERN_WARNING "dsmark_dequeue: skb not writable\n");
+ return skb;
}
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] PKT_SCHED: dsmark must take care of shared/cloned skbs
2004-12-18 17:00 [PATCH] PKT_SCHED: dsmark must take care of shared/cloned skbs Thomas Graf
@ 2004-12-18 21:00 ` Thomas Graf
2004-12-19 20:23 ` jamal
1 sibling, 0 replies; 11+ messages in thread
From: Thomas Graf @ 2004-12-18 21:00 UTC (permalink / raw)
To: David S. Miller; +Cc: netdev
> Correctly handle shared and cloned skbs by copying them before writing
> and dequeue unwriteable skbs unchanged. Assumes that IP/IPv6 header
> is always linear so no pulling required.
Hmm.. OTOH, setting skb->tc_index in enqueue() might need a pskb_copy() in
some cases. I'm not aware of such a path though, does anyone know about
a path where it would be required? If so it would be better to make the
skb writeable in enqueue() to safe a copy.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] PKT_SCHED: dsmark must take care of shared/cloned skbs
2004-12-18 17:00 [PATCH] PKT_SCHED: dsmark must take care of shared/cloned skbs Thomas Graf
2004-12-18 21:00 ` Thomas Graf
@ 2004-12-19 20:23 ` jamal
2004-12-19 20:36 ` Thomas Graf
1 sibling, 1 reply; 11+ messages in thread
From: jamal @ 2004-12-19 20:23 UTC (permalink / raw)
To: Thomas Graf; +Cc: David S. Miller, netdev
If the qdisc at that level muddies the packet thats fair game - thats
what goes out on the wire. So we should leave the code as is.
cheers,
jamal
On Sat, 2004-12-18 at 12:00, Thomas Graf wrote:
> Dave,
>
> Correctly handle shared and cloned skbs by copying them before writing
> and dequeue unwriteable skbs unchanged. Assumes that IP/IPv6 header
> is always linear so no pulling required.
>
> Signed-off-by: Thomas Graf <tgraf@suug.ch>
>
> --- linux-2.6.10-rc3-bk12.orig/net/sched/sch_dsmark.c 2004-12-18 15:16:29.000000000 +0100
> +++ linux-2.6.10-rc3-bk12/net/sched/sch_dsmark.c 2004-12-18 16:24:58.000000000 +0100
> @@ -195,7 +195,8 @@
>
> D2PRINTK("dsmark_enqueue(skb %p,sch %p,[qdisc %p])\n",skb,sch,p);
> if (p->set_tc_index) {
> - /* FIXME: Safe with non-linear skbs? --RR */
> + /* Safe with non-linear skbs? --RR
> + IP/IPv6 header is always linear --TGR */
> switch (skb->protocol) {
> case __constant_htons(ETH_P_IP):
> skb->tc_index = ipv4_get_dsfield(skb->nh.iph);
> @@ -250,6 +251,34 @@
> return ret;
> }
>
> +static inline int
> +dsmark_make_writeable(struct sk_buff **pskb, int offset)
> +{
> + struct sk_buff *nskb;
> +
> + if ((offset + (*pskb)->mac_len) > (*pskb)->len)
> + return 0;
> +
> + if (skb_shared(*pskb) || skb_cloned(*pskb))
> + goto copy_skb;
> +
> + /* IP/IPv6 header is always linear, no need to pull */
> + return 1;
> +
> +copy_skb:
> + nskb = skb_copy(*pskb, GFP_ATOMIC);
> + if (!nskb)
> + return 0;
> + BUG_ON(skb_is_nonlinear(nskb));
> +
> + /* Rest of kernel will get very unhappy if we pass it a
> + suddenly-orphaned skbuff */
> + if ((*pskb)->sk)
> + skb_set_owner_w(nskb, (*pskb)->sk);
> + kfree_skb(*pskb);
> + *pskb = nskb;
> + return 1;
> +}
>
> static struct sk_buff *dsmark_dequeue(struct Qdisc *sch)
> {
> @@ -266,10 +295,14 @@
> D2PRINTK("index %d->%d\n",skb->tc_index,index);
> switch (skb->protocol) {
> case __constant_htons(ETH_P_IP):
> + if (!dsmark_make_writeable(&skb, sizeof(struct iphdr)))
> + goto unwriteable;
> ipv4_change_dsfield(skb->nh.iph,
> p->mask[index],p->value[index]);
> break;
> case __constant_htons(ETH_P_IPV6):
> + if (!dsmark_make_writeable(&skb, sizeof(struct ipv6hdr)))
> + goto unwriteable;
> ipv6_change_dsfield(skb->nh.ipv6h,
> p->mask[index],p->value[index]);
> break;
> @@ -280,12 +313,17 @@
> * and don't need yet another qdisc as a bypass.
> */
> if (p->mask[index] != 0xff || p->value[index])
> - printk(KERN_WARNING "dsmark_dequeue: "
> - "unsupported protocol %d\n",
> - htons(skb->protocol));
> + if (net_ratelimit())
> + printk(KERN_WARNING "dsmark_dequeue: "
> + "unsupported protocol %d\n",
> + htons(skb->protocol));
> break;
> };
> return skb;
> +unwriteable:
> + if (net_ratelimit())
> + printk(KERN_WARNING "dsmark_dequeue: skb not writable\n");
> + return skb;
> }
>
>
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] PKT_SCHED: dsmark must take care of shared/cloned skbs
2004-12-19 20:23 ` jamal
@ 2004-12-19 20:36 ` Thomas Graf
2004-12-19 22:53 ` jamal
2004-12-20 8:06 ` Patrick McHardy
0 siblings, 2 replies; 11+ messages in thread
From: Thomas Graf @ 2004-12-19 20:36 UTC (permalink / raw)
To: jamal; +Cc: David S. Miller, netdev
* jamal <1103487827.1048.188.camel@jzny.localdomain> 2004-12-19 15:23
> If the qdisc at that level muddies the packet thats fair game - thats
> what goes out on the wire. So we should leave the code as is.
Agreed for egress but I think it is needed for stuff like IMQ. It's
debatable whether we should take care of IMQ and alike though.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] PKT_SCHED: dsmark must take care of shared/cloned skbs
2004-12-19 20:36 ` Thomas Graf
@ 2004-12-19 22:53 ` jamal
2004-12-19 23:18 ` Thomas Graf
2004-12-20 8:06 ` Patrick McHardy
1 sibling, 1 reply; 11+ messages in thread
From: jamal @ 2004-12-19 22:53 UTC (permalink / raw)
To: Thomas Graf; +Cc: David S. Miller, netdev
On Sun, 2004-12-19 at 15:36, Thomas Graf wrote:
> * jamal <1103487827.1048.188.camel@jzny.localdomain> 2004-12-19 15:23
> > If the qdisc at that level muddies the packet thats fair game - thats
> > what goes out on the wire. So we should leave the code as is.
>
> Agreed for egress but I think it is needed for stuff like IMQ. It's
> debatable whether we should take care of IMQ and alike though.
You are right, it may cause an issue on an IMQ like device when we have
one in kernel. Hang on to the patch for now is my opinion; "schedulers"
should probably not be mucking with packets once they are queued in
(dsmark aint really a scheduler). We should talk to Werner to see if we
can move that functionality prequeueing ...
cheers,
jamal
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] PKT_SCHED: dsmark must take care of shared/cloned skbs
2004-12-19 22:53 ` jamal
@ 2004-12-19 23:18 ` Thomas Graf
0 siblings, 0 replies; 11+ messages in thread
From: Thomas Graf @ 2004-12-19 23:18 UTC (permalink / raw)
To: jamal; +Cc: David S. Miller, netdev
* jamal <1103496820.1049.204.camel@jzny.localdomain> 2004-12-19 17:53
> On Sun, 2004-12-19 at 15:36, Thomas Graf wrote:
> > * jamal <1103487827.1048.188.camel@jzny.localdomain> 2004-12-19 15:23
> > > If the qdisc at that level muddies the packet thats fair game - thats
> > > what goes out on the wire. So we should leave the code as is.
> >
> > Agreed for egress but I think it is needed for stuff like IMQ. It's
> > debatable whether we should take care of IMQ and alike though.
>
> You are right, it may cause an issue on an IMQ like device when we have
> one in kernel. Hang on to the patch for now is my opinion; "schedulers"
> should probably not be mucking with packets once they are queued in
> (dsmark aint really a scheduler). We should talk to Werner to see if we
> can move that functionality prequeueing ...
Fine with me, we could make the dscp mapping an action. I doubt that
there are many users out there since the ECN bits are still taken
into account which doesn't make much sense. Something like the
following patch would be required to bring back the old behaviour
which relied on the lower two bits being unused.
--- linux-2.6.10-rc3-bk12.orig/net/sched/sch_dsmark.c 2004-12-19 01:31:01.000000000 +0100
+++ linux-2.6.10-rc3-bk12/net/sched/sch_dsmark.c 2004-12-19 01:31:27.000000000 +0100
@@ -14,6 +14,7 @@
#include <linux/rtnetlink.h>
#include <net/pkt_sched.h>
#include <net/dsfield.h>
+#include <net/inet_ecn.h>
#include <asm/byteorder.h>
@@ -199,10 +200,12 @@
IP/IPv6 header is always linear --TGR */
switch (skb->protocol) {
case __constant_htons(ETH_P_IP):
- skb->tc_index = ipv4_get_dsfield(skb->nh.iph);
+ skb->tc_index = ipv4_get_dsfield(skb->nh.iph)
+ & ~INET_ECN_MASK;
break;
case __constant_htons(ETH_P_IPV6):
- skb->tc_index = ipv6_get_dsfield(skb->nh.ipv6h);
+ skb->tc_index = ipv6_get_dsfield(skb->nh.ipv6h)
+ & ~INET_ECN_MASK;
break;
default:
skb->tc_index = 0;
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] PKT_SCHED: dsmark must take care of shared/cloned skbs
2004-12-19 20:36 ` Thomas Graf
2004-12-19 22:53 ` jamal
@ 2004-12-20 8:06 ` Patrick McHardy
2004-12-20 14:13 ` jamal
1 sibling, 1 reply; 11+ messages in thread
From: Patrick McHardy @ 2004-12-20 8:06 UTC (permalink / raw)
To: Thomas Graf; +Cc: jamal, David S. Miller, netdev
Thomas Graf wrote:
> * jamal <1103487827.1048.188.camel@jzny.localdomain> 2004-12-19 15:23
>
>>If the qdisc at that level muddies the packet thats fair game - thats
>>what goes out on the wire. So we should leave the code as is.
>
>
> Agreed for egress but I think it is needed for stuff like IMQ. It's
> debatable whether we should take care of IMQ and alike though.
>
You shouldn't care about IMQ, but we still need to copy the packet
before modifying it if the data is shared. Otherwise we have a race
on SMP with AF_PACKET sockets, depending on when the packet is read
it can be either modified or not.
Converting dsmark to an action sounds like the best long-term solution.
Regards
Patrick
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] PKT_SCHED: dsmark must take care of shared/cloned skbs
2004-12-20 8:06 ` Patrick McHardy
@ 2004-12-20 14:13 ` jamal
2004-12-21 1:02 ` David S. Miller
0 siblings, 1 reply; 11+ messages in thread
From: jamal @ 2004-12-20 14:13 UTC (permalink / raw)
To: Patrick McHardy; +Cc: Thomas Graf, David S. Miller, netdev
On Mon, 2004-12-20 at 03:06, Patrick McHardy wrote:
> You shouldn't care about IMQ, but we still need to copy the packet
> before modifying it if the data is shared. Otherwise we have a race
> on SMP with AF_PACKET sockets, depending on when the packet is read
> it can be either modified or not.
Certainly not a big deal; shouldnt care if once in a while tcpdump
actually gets to see the real packet that went out the wire.
> Converting dsmark to an action sounds like the best long-term solution.
Indeed; hence my comment to talk to Werner.
cheers,
jamal
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] PKT_SCHED: dsmark must take care of shared/cloned skbs
2004-12-20 14:13 ` jamal
@ 2004-12-21 1:02 ` David S. Miller
2004-12-22 13:14 ` jamal
0 siblings, 1 reply; 11+ messages in thread
From: David S. Miller @ 2004-12-21 1:02 UTC (permalink / raw)
To: hadi; +Cc: kaber, tgraf, netdev
On 20 Dec 2004 09:13:46 -0500
jamal <hadi@cyberus.ca> wrote:
> Certainly not a big deal; shouldnt care if once in a while tcpdump
> actually gets to see the real packet that went out the wire.
It's not just tcpdump. Any modification of a the packet data
for a shared SKB is illegal, no matter where it occurs.
This can corrupt TCP packets, which share the transmitted
packet with the socket retransmit queue.
We have a similar problem with TSO and some gigabit cards whose
drivers muck with the iphdr->tot_len field on transmit. I still
am not sure how I want to address that case yet. Since transmitted
TCP data packets are always shared/cloned, we'll have to do a data
copy on every TSO send on these cards which frankly nullifies much
of the performance gain TSO gives. If we end of fixing it via a copy
we'll probably need to seriously consider not doing TSO unless we
are doing sendfile.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] PKT_SCHED: dsmark must take care of shared/cloned skbs
2004-12-21 1:02 ` David S. Miller
@ 2004-12-22 13:14 ` jamal
2004-12-22 13:50 ` Thomas Graf
0 siblings, 1 reply; 11+ messages in thread
From: jamal @ 2004-12-22 13:14 UTC (permalink / raw)
To: David S. Miller; +Cc: kaber, tgraf, netdev
On Mon, 2004-12-20 at 20:02, David S. Miller wrote:
> On 20 Dec 2004 09:13:46 -0500
> jamal <hadi@cyberus.ca> wrote:
>
> > Certainly not a big deal; shouldnt care if once in a while tcpdump
> > actually gets to see the real packet that went out the wire.
>
> It's not just tcpdump. Any modification of a the packet data
> for a shared SKB is illegal, no matter where it occurs.
> This can corrupt TCP packets, which share the transmitted
> packet with the socket retransmit queue.
>
> We have a similar problem with TSO and some gigabit cards whose
> drivers muck with the iphdr->tot_len field on transmit. I still
> am not sure how I want to address that case yet. Since transmitted
> TCP data packets are always shared/cloned, we'll have to do a data
> copy on every TSO send on these cards which frankly nullifies much
> of the performance gain TSO gives. If we end of fixing it via a copy
> we'll probably need to seriously consider not doing TSO unless we
> are doing sendfile.
Ok, makes sense. I can see how TSO may not be useful if you have to
copy. I suppose NICS with hardware based retransmits will behave
differently.
cheers,
jamal
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] PKT_SCHED: dsmark must take care of shared/cloned skbs
2004-12-22 13:14 ` jamal
@ 2004-12-22 13:50 ` Thomas Graf
0 siblings, 0 replies; 11+ messages in thread
From: Thomas Graf @ 2004-12-22 13:50 UTC (permalink / raw)
To: jamal; +Cc: David S. Miller, kaber, netdev
* jamal <1103721246.1093.39.camel@jzny.localdomain> 2004-12-22 08:14
> On Mon, 2004-12-20 at 20:02, David S. Miller wrote:
> > We have a similar problem with TSO and some gigabit cards whose
> > drivers muck with the iphdr->tot_len field on transmit. I still
> > am not sure how I want to address that case yet. Since transmitted
> > TCP data packets are always shared/cloned, we'll have to do a data
> > copy on every TSO send on these cards which frankly nullifies much
> > of the performance gain TSO gives. If we end of fixing it via a copy
> > we'll probably need to seriously consider not doing TSO unless we
> > are doing sendfile.
>
> Ok, makes sense. I can see how TSO may not be useful if you have to
> copy. I suppose NICS with hardware based retransmits will behave
> differently.
I think we should move all packet manngling to be an action and warn
about the loss of performance. We might be able to add a fast path for
simple modifications such as dsmark dscp maping and other simple header
modifications by not copying but mangle the fragment itself assuming we
can accept some drawbacks with packet sockets. More advanced mangling
as done by pedit is less of a problem since it will likely be used
in combination with heavy filtering but we could of course do some
analysis of the edit request and go the fast path under some
circumstances.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2004-12-22 13:50 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-12-18 17:00 [PATCH] PKT_SCHED: dsmark must take care of shared/cloned skbs Thomas Graf
2004-12-18 21:00 ` Thomas Graf
2004-12-19 20:23 ` jamal
2004-12-19 20:36 ` Thomas Graf
2004-12-19 22:53 ` jamal
2004-12-19 23:18 ` Thomas Graf
2004-12-20 8:06 ` Patrick McHardy
2004-12-20 14:13 ` jamal
2004-12-21 1:02 ` David S. Miller
2004-12-22 13:14 ` jamal
2004-12-22 13:50 ` Thomas Graf
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).