Re: [PATCH] tbf scheduler: TSO support

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH] tbf scheduler: TSO support
       [not found] <20070510.210556.05145104.taka@valinux.co.jp>
@ 2007-05-10 12:56 ` Patrick McHardy
  2007-05-10 21:13   ` David Miller
  0 siblings, 1 reply; 17+ messages in thread
From: Patrick McHardy @ 2007-05-10 12:56 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: linux-net, Linux Netdev List

Hirokazu Takahashi wrote:
> TBF --- Simple Token Bucket Filter --- packet scheduler doesn't
> work correctly with TSO on that it slows down to send out packets.
> TSO packets will be discarded since the size can be larger than
> the scheduler expects. But it won't cause serious problems
> because the retransmitted packets can be passed.
> 
> So I made the scheduler allow to pass TSO packets:
> 
>  - tbf_enqueue() accepts packets with any size if the netdevice
>    has TSO ability.
> 
>  - tbf_dequeue() can handle the packets whose size is larger than
>    the bucket, which keeps tokens.
>    Any packet, which may be TSO packet, can be sent if the bucket is
>    full of tokens. this may lead that the number of tokens in
>    the bucket turns into negative value, which means kind of debt.
>    But we don't have to mind it because this will be filled up
>    with tokens in a short time and it will turns into positive value
>    again.
> 
> I'm not sure if this approach is the best. I appreciate any comments.


I don't see why this is needed, the correct way to use TBF with TSO
is to specify a larger MTU value, in which case it won't drop TSO
packets.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] tbf scheduler: TSO support
  2007-05-10 12:56 ` [PATCH] tbf scheduler: TSO support Patrick McHardy
@ 2007-05-10 21:13   ` David Miller
  2007-05-11  1:04     ` Patrick McHardy
  0 siblings, 1 reply; 17+ messages in thread
From: David Miller @ 2007-05-10 21:13 UTC (permalink / raw)
  To: kaber; +Cc: taka, linux-net, netdev

From: Patrick McHardy <kaber@trash.net>
Date: Thu, 10 May 2007 14:56:39 +0200

> Hirokazu Takahashi wrote:
> > TBF --- Simple Token Bucket Filter --- packet scheduler doesn't
> > work correctly with TSO on that it slows down to send out packets.
> > TSO packets will be discarded since the size can be larger than
> > the scheduler expects. But it won't cause serious problems
> > because the retransmitted packets can be passed.
> > 
> > So I made the scheduler allow to pass TSO packets:
> > 
> >  - tbf_enqueue() accepts packets with any size if the netdevice
> >    has TSO ability.
> > 
> >  - tbf_dequeue() can handle the packets whose size is larger than
> >    the bucket, which keeps tokens.
> >    Any packet, which may be TSO packet, can be sent if the bucket is
> >    full of tokens. this may lead that the number of tokens in
> >    the bucket turns into negative value, which means kind of debt.
> >    But we don't have to mind it because this will be filled up
> >    with tokens in a short time and it will turns into positive value
> >    again.
> > 
> > I'm not sure if this approach is the best. I appreciate any comments.
> 
> 
> I don't see why this is needed, the correct way to use TBF with TSO
> is to specify a larger MTU value, in which case it won't drop TSO
> packets.

Why should a user have to know anything in the world about TSO in
order to configure TBF properly?  I don't think they should have
to at all.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] tbf scheduler: TSO support
  2007-05-10 21:13   ` David Miller
@ 2007-05-11  1:04     ` Patrick McHardy
  2007-05-11  4:01       ` Hirokazu Takahashi
  0 siblings, 1 reply; 17+ messages in thread
From: Patrick McHardy @ 2007-05-11  1:04 UTC (permalink / raw)
  To: David Miller; +Cc: taka, linux-net, netdev

David Miller wrote:
> From: Patrick McHardy <kaber@trash.net>
> Date: Thu, 10 May 2007 14:56:39 +0200
> 
>>I don't see why this is needed, the correct way to use TBF with TSO
>>is to specify a larger MTU value, in which case it won't drop TSO
>>packets.
> 
> 
> Why should a user have to know anything in the world about TSO in
> order to configure TBF properly?  I don't think they should have
> to at all.

The user shouldn't necessarily, but userspace should.
The way I see it the MTU is a fundamental parameter for TBF
(the peakrate bucket size) and just because userspace picks
a bad default (2000) this is no reason to change the
implementation to something that is not really TBF anymore
and even affects non-TSO packets _and_ TSO packets even
when the MTU is chosen large enough (granted, the first
point is an implementation detail). The much better solution
would be to let userspace pick an appropriate default value
and still let the user override it.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] tbf scheduler: TSO support
  2007-05-11  1:04     ` Patrick McHardy
@ 2007-05-11  4:01       ` Hirokazu Takahashi
  2007-05-11 18:13         ` Patrick McHardy
  0 siblings, 1 reply; 17+ messages in thread
From: Hirokazu Takahashi @ 2007-05-11  4:01 UTC (permalink / raw)
  To: kaber; +Cc: davem, linux-net, netdev

Hi,

> >>I don't see why this is needed, the correct way to use TBF with TSO
> >>is to specify a larger MTU value, in which case it won't drop TSO
> >>packets.
> > 
> > 
> > Why should a user have to know anything in the world about TSO in
> > order to configure TBF properly?  I don't think they should have
> > to at all.
> 
> 
> The user shouldn't necessarily, but userspace should.
> The way I see it the MTU is a fundamental parameter for TBF
> (the peakrate bucket size) and just because userspace picks
> a bad default (2000) this is no reason to change the
> implementation to something that is not really TBF anymore
> and even affects non-TSO packets _and_ TSO packets even
> when the MTU is chosen large enough (granted, the first
> point is an implementation detail). The much better solution
> would be to let userspace pick an appropriate default value
> and still let the user override it.

I think the concept of TBF is quit good but the userspace tools have
become old that it doesn't fit to Gb ethernet environment.
The tools should be updated to care about much faster network and
GbE jumbo frames. I agree with you at this point.

On the other hand, handling TSO packet should be a kernel issue.
A TSO packet is logically just a multiple segmented packet
including several ordinary packets. This feature should be kept
invisible from userspace.

And I also feel it's hard to determine how large TSO packets
can/will be accepted that it depends on ether controllers.
It would be also tough to make the tools catch up with the
latest off loading features.

Thank you,
Hirokazu Takahashi.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] tbf scheduler: TSO support
  2007-05-11  4:01       ` Hirokazu Takahashi
@ 2007-05-11 18:13         ` Patrick McHardy
  2007-05-12  8:49           ` Hirokazu Takahashi
  0 siblings, 1 reply; 17+ messages in thread
From: Patrick McHardy @ 2007-05-11 18:13 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: davem, linux-net, netdev

Hirokazu Takahashi wrote:
> I think the concept of TBF is quit good but the userspace tools have
> become old that it doesn't fit to Gb ethernet environment.
> The tools should be updated to care about much faster network and
> GbE jumbo frames. I agree with you at this point.
> 
> On the other hand, handling TSO packet should be a kernel issue.
> A TSO packet is logically just a multiple segmented packet
> including several ordinary packets. This feature should be kept
> invisible from userspace.

Putting aside this question, this cannot work properly without userspace
since the rate table is only built for the given MTU. Your patch tries
to work around that by summing up the result of L2T in q->max_size
steps, which gives incorrect results with MPU or overhead settings.

I think we can only do two things without userspace support:

- disable TSO (not a good idea)
- split the skb into q->max_size chunks on dequeue

The later would allow people to take full advantage of TSO with properly
configured TBF, but it would still at least work with a too small mtu
value.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] tbf scheduler: TSO support
  2007-05-11 18:13         ` Patrick McHardy
@ 2007-05-12  8:49           ` Hirokazu Takahashi
  2007-05-12  9:09             ` Hirokazu Takahashi
  2007-05-13 12:42             ` [PATCH] tbf scheduler: TSO support (updated) Hirokazu Takahashi
  0 siblings, 2 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2007-05-12  8:49 UTC (permalink / raw)
  To: kaber; +Cc: davem, linux-net, netdev

Hi,

> > I think the concept of TBF is quit good but the userspace tools have
> > become old that it doesn't fit to Gb ethernet environment.
> > The tools should be updated to care about much faster network and
> > GbE jumbo frames. I agree with you at this point.
> > 
> > On the other hand, handling TSO packet should be a kernel issue.
> > A TSO packet is logically just a multiple segmented packet
> > including several ordinary packets. This feature should be kept
> > invisible from userspace.
> 
> 
> Putting aside this question, this cannot work properly without userspace
> since the rate table is only built for the given MTU. Your patch tries
> to work around that by summing up the result of L2T in q->max_size
> steps, which gives incorrect results with MPU or overhead settings.

You have a point. sch->dev->mtu should be used instead of q->max_size,
which gives precisely correct results. Every TSO packet will be split
into several packets with sch->dev->mtu length, which administrators
specify. The splitting is automatically done by TSO supporting controllers.

To minimize the overhead of the calculation I'd re-implement it like:

   segs = skb->len/sch->dev->mtu;
   rest = skb->len - sch->dev->mtu*segs;
   toks -= L2T(q, sch->dev->mtu) * segs;
   if (rest)
        toks -= L2T(q, rest);

I think the problem is that sch->dev->mtu might not be set correctly,
so some error checking is needed.

> I think we can only do two things without userspace support:
> 
> - disable TSO (not a good idea)

This is just what I'm doing now as a workaround.

> - split the skb into q->max_size chunks on dequeue

It sure is difficult because the splitting deeply depends on the
protocol stack which made the skb and it will make some overhead,
which is why TSO and other offloading features are introduced.

I'm now thinking I can make it just hold a TSO packet until p->tokens
reaches the size of the packet. I think it is straightforward
implementation. I'll try this.

> The later would allow people to take full advantage of TSO with properly
> configured TBF, but it would still at least work with a too small mtu
> value.

Thanks,
Hirokazu Takahashi.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] tbf scheduler: TSO support
  2007-05-12  8:49           ` Hirokazu Takahashi
@ 2007-05-12  9:09             ` Hirokazu Takahashi
  2007-05-13 12:42             ` [PATCH] tbf scheduler: TSO support (updated) Hirokazu Takahashi
  1 sibling, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2007-05-12  9:09 UTC (permalink / raw)
  To: kaber; +Cc: davem, linux-net, netdev

Hi,

> > > I think the concept of TBF is quit good but the userspace tools have
> > > become old that it doesn't fit to Gb ethernet environment.
> > > The tools should be updated to care about much faster network and
> > > GbE jumbo frames. I agree with you at this point.
> > > 
> > > On the other hand, handling TSO packet should be a kernel issue.
> > > A TSO packet is logically just a multiple segmented packet
> > > including several ordinary packets. This feature should be kept
> > > invisible from userspace.
> > 
> > 
> > Putting aside this question, this cannot work properly without userspace
> > since the rate table is only built for the given MTU. Your patch tries
> > to work around that by summing up the result of L2T in q->max_size
> > steps, which gives incorrect results with MPU or overhead settings.
> 
> You have a point. sch->dev->mtu should be used instead of q->max_size,
> which gives precisely correct results.

Oops, this isn't really true because TSO supporting controllers will
probably add some headers to the split packets.
So I'd say "it will give almost correct results".

> Every TSO packet will be split
> into several packets with sch->dev->mtu length, which administrators
> specify. The splitting is automatically done by TSO supporting controllers.
> 
> To minimize the overhead of the calculation I'd re-implement it like:
> 
>    segs = skb->len/sch->dev->mtu;
>    rest = skb->len - sch->dev->mtu*segs;
>    toks -= L2T(q, sch->dev->mtu) * segs;
>    if (rest)
>         toks -= L2T(q, rest);
> 
> I think the problem is that sch->dev->mtu might not be set correctly,
> so some error checking is needed.
> 
> > I think we can only do two things without userspace support:
> > 
> > - disable TSO (not a good idea)
> 
> This is just what I'm doing now as a workaround.
> 
> > - split the skb into q->max_size chunks on dequeue
> 
> It sure is difficult because the splitting deeply depends on the
> protocol stack which made the skb and it will make some overhead,
> which is why TSO and other offloading features are introduced.
> 
> I'm now thinking I can make it just hold a TSO packet until p->tokens
> reaches the size of the packet. I think it is straightforward
> implementation. I'll try this.
> 
> > The later would allow people to take full advantage of TSO with properly
> > configured TBF, but it would still at least work with a too small mtu
> > value.
> 
> Thanks,
> Hirokazu Takahashi.
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-net" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] tbf scheduler: TSO support (updated)
  2007-05-12  8:49           ` Hirokazu Takahashi
  2007-05-12  9:09             ` Hirokazu Takahashi
@ 2007-05-13 12:42             ` Hirokazu Takahashi
  2007-05-13 15:52               ` Stephen Hemminger
  1 sibling, 1 reply; 17+ messages in thread
From: Hirokazu Takahashi @ 2007-05-13 12:42 UTC (permalink / raw)
  To: netdev; +Cc: kaber, davem, linux-net

Hi,

> I'm now thinking I can make it just hold a TSO packet until p->tokens
> reaches the size of the packet. I think it is straightforward
> implementation. I'll try this.

I re-implemented the patch, which is simpler than the previous one.

sch->dev->mtu is used to determine how many segments are included
in a TSO packet. The point is that the value of sch->dev->mtu has to
be treated possibly broken because it is set by administrators.

Thanks,
Hirokazu Takahashi.


Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>

--- linux-2.6.21/net/sched/sch_tbf.c.ORG	2007-05-08 20:59:28.000000000 +0900
+++ linux-2.6.21/net/sched/sch_tbf.c	2007-05-13 14:22:39.000000000 +0900
@@ -9,7 +9,8 @@
  * Authors:	Alexey Kuznetsov, <kuznet@ms2.inr.ac.ru>
  *		Dmitry Torokhov <dtor@mail.ru> - allow attaching inner qdiscs -
  *						 original idea by Martin Devera
- *
+ * Fixes:
+ * 		Hirokazu Takahashi <taka@valinux.co.jp> : TSO support
  */
 
 #include <linux/module.h>
@@ -139,7 +140,7 @@ static int tbf_enqueue(struct sk_buff *s
 	struct tbf_sched_data *q = qdisc_priv(sch);
 	int ret;
 
-	if (skb->len > q->max_size) {
+	if (skb->len > q->max_size && !(sch->dev->features & NETIF_F_GSO_MASK)) {
 		sch->qstats.drops++;
 #ifdef CONFIG_NET_CLS_POLICE
 		if (sch->reshape_fail == NULL || sch->reshape_fail(skb, sch))
@@ -205,6 +206,16 @@ static struct sk_buff *tbf_dequeue(struc
 		long toks, delay;
 		long ptoks = 0;
 		unsigned int len = skb->len;
+		/*
+		 * Note: TSO packets will be larger than its actual mtu.
+		 * These packets should be treated as packets including
+		 * several ordinary ones. In this case, tokens should
+		 * be held until it reaches the length of them.
+		 */
+		long max_toks = max(len, q->buffer);
+		unsigned int rmtu = sch->dev->mtu? min(q->max_size, sch->dev->mtu) : q->max_size;
+		unsigned int segs = len/rmtu;
+		unsigned int rest = len - rmtu*segs;
 
 		PSCHED_GET_TIME(now);
 
@@ -212,14 +223,18 @@ static struct sk_buff *tbf_dequeue(struc
 
 		if (q->P_tab) {
 			ptoks = toks + q->ptokens;
-			if (ptoks > (long)q->mtu)
-				ptoks = q->mtu;
-			ptoks -= L2T_P(q, len);
+			if (ptoks > (long)(q->mtu * (segs + !!rest)))
+				ptoks = q->mtu * (segs + !!rest);
+			ptoks -= L2T_P(q, rmtu) * segs;
+			if (rest)
+				ptoks -= L2T_P(q, rest);
 		}
 		toks += q->tokens;
-		if (toks > (long)q->buffer)
-			toks = q->buffer;
-		toks -= L2T(q, len);
+		if (toks > max_toks)
+			toks = max_toks;
+		toks -= L2T(q, rmtu) * segs;
+		if (rest)
+			toks -= L2T(q, rest);
 
 		if ((toks|ptoks) >= 0) {
 			q->t_c = now;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] tbf scheduler: TSO support (updated)
  2007-05-13 12:42             ` [PATCH] tbf scheduler: TSO support (updated) Hirokazu Takahashi
@ 2007-05-13 15:52               ` Stephen Hemminger
  2007-05-13 16:21                 ` Hirokazu Takahashi
  0 siblings, 1 reply; 17+ messages in thread
From: Stephen Hemminger @ 2007-05-13 15:52 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: netdev, kaber, davem, linux-net

On Sun, 13 May 2007 21:42:36 +0900 (JST)
Hirokazu Takahashi <taka@valinux.co.jp> wrote:

> Hi,
> 
> > I'm now thinking I can make it just hold a TSO packet until
> > p->tokens reaches the size of the packet. I think it is
> > straightforward implementation. I'll try this.
> 
> I re-implemented the patch, which is simpler than the previous one.
> 
> sch->dev->mtu is used to determine how many segments are included
> in a TSO packet. The point is that the value of sch->dev->mtu has to
> be treated possibly broken because it is set by administrators.
>

There are cases where the TSO packet will be divided into chunks
smaller than the MTU. If path MTU discovery is being used, the chunk
size of the TSO skb will be smaller. To be accurate you need to look
at mss (gso_size) and compute actual packet size based on protocol
(IPV4/IPV6 and UDP/TCP).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] tbf scheduler: TSO support (updated)
  2007-05-13 15:52               ` Stephen Hemminger
@ 2007-05-13 16:21                 ` Hirokazu Takahashi
  2007-05-14  6:04                   ` Herbert Xu
  0 siblings, 1 reply; 17+ messages in thread
From: Hirokazu Takahashi @ 2007-05-13 16:21 UTC (permalink / raw)
  To: shemminger; +Cc: netdev, kaber, davem, linux-net

Hi,

> > > I'm now thinking I can make it just hold a TSO packet until
> > > p->tokens reaches the size of the packet. I think it is
> > > straightforward implementation. I'll try this.
> > 
> > I re-implemented the patch, which is simpler than the previous one.
> > 
> > sch->dev->mtu is used to determine how many segments are included
> > in a TSO packet. The point is that the value of sch->dev->mtu has to
> > be treated possibly broken because it is set by administrators.
> >
> 
> There are cases where the TSO packet will be divided into chunks
> smaller than the MTU. If path MTU discovery is being used, the chunk
> size of the TSO skb will be smaller. To be accurate you need to look
> at mss (gso_size) and compute actual packet size based on protocol
> (IPV4/IPV6 and UDP/TCP).

Uhh, you are right.
skb_shinfo(skb)->gso_segs and skb_shinfo(skb)->gso_size should be used.
Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] tbf scheduler: TSO support (updated)
  2007-05-13 16:21                 ` Hirokazu Takahashi
@ 2007-05-14  6:04                   ` Herbert Xu
  2007-05-15 11:39                     ` [PATCH 1/2] tbf scheduler: TSO support (update 2) Hirokazu Takahashi
  0 siblings, 1 reply; 17+ messages in thread
From: Herbert Xu @ 2007-05-14  6:04 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: shemminger, netdev, kaber, davem, linux-net

Hirokazu Takahashi <taka@valinux.co.jp> wrote:
>
> Uhh, you are right.
> skb_shinfo(skb)->gso_segs and skb_shinfo(skb)->gso_size should be used.

Actually forget about gso_segs, it's only filled in for TCP.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/2] tbf scheduler: TSO support (update 2)
  2007-05-14  6:04                   ` Herbert Xu
@ 2007-05-15 11:39                     ` Hirokazu Takahashi
  2007-05-15 11:41                       ` [PATCH 2/2] " Hirokazu Takahashi
  0 siblings, 1 reply; 17+ messages in thread
From: Hirokazu Takahashi @ 2007-05-15 11:39 UTC (permalink / raw)
  To: herbert; +Cc: shemminger, netdev, kaber, davem, linux-net

Hi,

> Hirokazu Takahashi <taka@valinux.co.jp> wrote:
> >
> > Uhh, you are right.
> > skb_shinfo(skb)->gso_segs and skb_shinfo(skb)->gso_size should be used.
> 
> Actually forget about gso_segs, it's only filled in for TCP.

I realized it was really hard to determine the actual size of each
packet that will be generated from TSO packets, the size which
should be used to calculate the really accurate traffic.
There isn't enough information in socket buffers to determine
the size of their headers as gso_size just shows the maximum length of
the segment without any headers and the other members are helpless
either.

              split into
 TSO packet  ----------->  packets after being split
+----------+                +----------+
| headers  |                | headers  |
+----------+                +----------+   ----
| segment1 |                | segment1 |     A
|          |                |          |     | gso_size
|          |                |          |     V
+----------+                +----------+   ----
| segment2 |        
|          |                +----------+
|          |                | headers  |
+----------+                +----------+
| segment3 |                | segment2 |
|          |                |          |
+----------+                |          |
                            +----------+

                            +----------+
                            | headers  |
                            +----------+
                            | segment3 |
                            |          |
                            +----------+

So I decided to make it simple to calculate the traffic:
   - assume each packet generated from the same TSO packet have
     the same length.
   - ignore the length of additional headers which will be
     automatically applied.

It looks working pretty well to control bandwidth as I expected,
but I'm not sure everybody will be satisfied with it.
Do you think this approximate calculation is enough?


I also realized CBQ scheduler have to be fixed to handle large
TSO packets or it may possibly cause Oops. The next mail contains
the patch for CBQ.



--- linux-2.6.21/net/sched/sch_tbf.c.ORG	2007-05-08 20:59:28.000000000 +0900
+++ linux-2.6.21/net/sched/sch_tbf.c	2007-05-15 19:59:34.000000000 +0900
@@ -9,7 +9,8 @@
  * Authors:	Alexey Kuznetsov, <kuznet@ms2.inr.ac.ru>
  *		Dmitry Torokhov <dtor@mail.ru> - allow attaching inner qdiscs -
  *						 original idea by Martin Devera
- *
+ * Fixes:
+ * 		Hirokazu Takahashi <taka@valinux.co.jp> : TSO support
  */
 
 #include <linux/module.h>
@@ -138,8 +139,12 @@ static int tbf_enqueue(struct sk_buff *s
 {
 	struct tbf_sched_data *q = qdisc_priv(sch);
 	int ret;
+	//unsigned int segs = skb_shinfo(skb)->gso_segs ? : 1;
+	unsigned int segs = skb_shinfo(skb)->gso_segs ? :
+	  skb_shinfo(skb)->gso_size ? skb->len/skb_shinfo(skb)->gso_size + 1 : 1;
+	unsigned int len = (skb->len - 1)/segs + 1;
 
-	if (skb->len > q->max_size) {
+	if (len > q->max_size) {
 		sch->qstats.drops++;
 #ifdef CONFIG_NET_CLS_POLICE
 		if (sch->reshape_fail == NULL || sch->reshape_fail(skb, sch))
@@ -204,22 +209,41 @@ static struct sk_buff *tbf_dequeue(struc
 		psched_time_t now;
 		long toks, delay;
 		long ptoks = 0;
-		unsigned int len = skb->len;
+		/*
+		 * Note: TSO packets will be larger than its actual mtu.
+		 * These packets should be treated as packets including
+		 * several ordinary ones. In this case, tokens should
+		 * be held until it reaches the length of them.
+		 *
+		 * To simplify, we assume each segment in a TSO packet
+		 * has the same length though it may probably not be true.
+		 * And ignore the length of headers which will be applied
+		 * to each segment when splitting TSO packets.
+		 * 
+		 * The number of segments are calculated from the segment
+		 * size of TSO packets temporarily if it isn't set.
+		 */
+		unsigned int segs = skb_shinfo(skb)->gso_segs ? :
+		  skb_shinfo(skb)->gso_size ? skb->len/skb_shinfo(skb)->gso_size + 1 : 1;
+		unsigned int len = (skb->len - 1)/segs + 1;
+		unsigned int expect = L2T(q, len) * segs;
+		long max_toks = max(expect, q->buffer);
+
 
 		PSCHED_GET_TIME(now);
 
-		toks = PSCHED_TDIFF_SAFE(now, q->t_c, q->buffer);
+		toks = PSCHED_TDIFF_SAFE(now, q->t_c, max_toks);
 
 		if (q->P_tab) {
 			ptoks = toks + q->ptokens;
-			if (ptoks > (long)q->mtu)
-				ptoks = q->mtu;
-			ptoks -= L2T_P(q, len);
+			if (ptoks > (long)(q->mtu * segs))
+				ptoks = q->mtu * segs;
+			ptoks -= L2T_P(q, len) * segs;
 		}
 		toks += q->tokens;
-		if (toks > (long)q->buffer)
-			toks = q->buffer;
-		toks -= L2T(q, len);
+		if (toks > max_toks)
+			toks = max_toks;
+		toks -= expect;
 
 		if ((toks|ptoks) >= 0) {
 			q->t_c = now;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 2/2] tbf scheduler: TSO support (update 2)
  2007-05-15 11:39                     ` [PATCH 1/2] tbf scheduler: TSO support (update 2) Hirokazu Takahashi
@ 2007-05-15 11:41                       ` Hirokazu Takahashi
  2007-05-17 13:55                         ` Herbert Xu
  0 siblings, 1 reply; 17+ messages in thread
From: Hirokazu Takahashi @ 2007-05-15 11:41 UTC (permalink / raw)
  To: herbert; +Cc: shemminger, netdev, kaber, davem, linux-net

Hi,

Without this patch, CBQ scheduler may cause Oops with some TSO packets
because the tables passed to L2T() isn't big enough to handle such
huge packets.

Thanks,
Hirokazu Takahashi.


--- linux-2.6.21/net/sched/sch_cbq.c.ORG	2007-05-14 20:53:06.000000000 +0900
+++ linux-2.6.21/net/sched/sch_cbq.c	2007-05-15 18:10:43.000000000 +0900
@@ -176,6 +176,7 @@ struct cbq_sched_data
 	struct cbq_class	*tx_class;
 	struct cbq_class	*tx_borrowed;
 	int			tx_len;
+	unsigned int 		tx_segs;
 	psched_time_t		now;		/* Cached timestamp */
 	psched_time_t		now_rt;		/* Cached real time */
 	unsigned		pmask;
@@ -753,6 +754,7 @@ cbq_update(struct cbq_sched_data *q)
 	struct cbq_class *this = q->tx_class;
 	struct cbq_class *cl = this;
 	int len = q->tx_len;
+	unsigned int segs = q->tx_segs;
 
 	q->tx_class = NULL;
 
@@ -761,7 +763,7 @@ cbq_update(struct cbq_sched_data *q)
 		long idle;
 
 		cl->bstats.packets++;
-		cl->bstats.bytes += len;
+		cl->bstats.bytes += len*segs;
 
 		/*
 		   (now - last) is total time between packet right edges.
@@ -774,7 +776,7 @@ cbq_update(struct cbq_sched_data *q)
 		if ((unsigned long)idle > 128*1024*1024) {
 			avgidle = cl->maxidle;
 		} else {
-			idle -= L2T(cl, len);
+			idle -= L2T(cl, len) * segs;
 
 		/* true_avgidle := (1-W)*true_avgidle + W*idle,
 		   where W=2^{-ewma_log}. But cl->avgidle is scaled:
@@ -811,8 +813,8 @@ cbq_update(struct cbq_sched_data *q)
 			   to the moment of cbq_update)
 			 */
 
-			idle -= L2T(&q->link, len);
-			idle += L2T(cl, len);
+			idle -= L2T(&q->link, len) * segs;
+			idle += L2T(cl, len) * segs;
 
 			PSCHED_AUDIT_TDIFF(idle);
 
@@ -924,7 +926,9 @@ cbq_dequeue_prio(struct Qdisc *sch, int 
 				cl->xstats.borrows += skb->len;
 #endif
 			}
-			q->tx_len = skb->len;
+			q->tx_segs = skb_shinfo(skb)->gso_segs ? :
+			  skb_shinfo(skb)->gso_size ? skb->len/skb_shinfo(skb)->gso_size + 1 : 1;
+			q->tx_len = (skb->len - 1)/q->tx_segs + 1;
 
 			if (cl->deficit <= 0) {
 				q->active[prio] = cl;
@@ -1013,7 +1017,7 @@ cbq_dequeue(struct Qdisc *sch)
 
 		   cbq_time = max(real_time, work);
 		 */
-		incr2 = L2T(&q->link, q->tx_len);
+		incr2 = L2T(&q->link, q->tx_len) * q->tx_segs;
 		PSCHED_TADD(q->now, incr2);
 		cbq_update(q);
 		if ((incr -= incr2) < 0)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] tbf scheduler: TSO support (update 2)
  2007-05-15 11:41                       ` [PATCH 2/2] " Hirokazu Takahashi
@ 2007-05-17 13:55                         ` Herbert Xu
  2007-05-18  2:35                           ` Hirokazu Takahashi
  0 siblings, 1 reply; 17+ messages in thread
From: Herbert Xu @ 2007-05-17 13:55 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: shemminger, netdev, kaber, davem, linux-net

On Tue, May 15, 2007 at 08:41:48PM +0900, Hirokazu Takahashi wrote:
>  
> @@ -924,7 +926,9 @@ cbq_dequeue_prio(struct Qdisc *sch, int 
>  				cl->xstats.borrows += skb->len;
>  #endif
>  			}
> -			q->tx_len = skb->len;
> +			q->tx_segs = skb_shinfo(skb)->gso_segs ? :
> +			  skb_shinfo(skb)->gso_size ? skb->len/skb_shinfo(skb)->gso_size + 1 : 1;
> +			q->tx_len = (skb->len - 1)/q->tx_segs + 1;

This isn't safe for Xen (and potentially other virtualisation
environments) since qdisc code runs before dev_hard_start_xmit
which is where we verify the sanity of gso_segs.  So you could
be using some arbitrary value from an untrusted source.

If you really want to use it, you should test for SKB_GSO_DODGY
on the packet which will be set if gso_segs can't be trusted.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] tbf scheduler: TSO support (update 2)
  2007-05-17 13:55                         ` Herbert Xu
@ 2007-05-18  2:35                           ` Hirokazu Takahashi
  2007-05-23 11:47                             ` [PATCH 1/2] tbf scheduler: TSO support (update 3) Hirokazu Takahashi
  0 siblings, 1 reply; 17+ messages in thread
From: Hirokazu Takahashi @ 2007-05-18  2:35 UTC (permalink / raw)
  To: herbert; +Cc: shemminger, netdev, kaber, davem, linux-net

Hi,

> > @@ -924,7 +926,9 @@ cbq_dequeue_prio(struct Qdisc *sch, int 
> >  				cl->xstats.borrows += skb->len;
> >  #endif
> >  			}
> > -			q->tx_len = skb->len;
> > +			q->tx_segs = skb_shinfo(skb)->gso_segs ? :
> > +			  skb_shinfo(skb)->gso_size ? skb->len/skb_shinfo(skb)->gso_size + 1 : 1;
> > +			q->tx_len = (skb->len - 1)/q->tx_segs + 1;
> 
> This isn't safe for Xen (and potentially other virtualisation
> environments) since qdisc code runs before dev_hard_start_xmit
> which is where we verify the sanity of gso_segs.  So you could
> be using some arbitrary value from an untrusted source.
> 
> If you really want to use it, you should test for SKB_GSO_DODGY
> on the packet which will be set if gso_segs can't be trusted.

Yep, you have a point that some sanity check should be added.
I think a simple check would be enough not to crash CBQ
as the accurate checking will be done in dev_hard_start_xmit or
device drivers.

Thanks,
Hirokazu Takahashi.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/2] tbf scheduler: TSO support (update 3)
  2007-05-18  2:35                           ` Hirokazu Takahashi
@ 2007-05-23 11:47                             ` Hirokazu Takahashi
  2007-05-23 11:49                               ` [PATCH 2/2] " Hirokazu Takahashi
  0 siblings, 1 reply; 17+ messages in thread
From: Hirokazu Takahashi @ 2007-05-23 11:47 UTC (permalink / raw)
  To: herbert; +Cc: shemminger, netdev, kaber, davem, linux-net

Hi,

> > > @@ -924,7 +926,9 @@ cbq_dequeue_prio(struct Qdisc *sch, int 
> > >  				cl->xstats.borrows += skb->len;
> > >  #endif
> > >  			}
> > > -			q->tx_len = skb->len;
> > > +			q->tx_segs = skb_shinfo(skb)->gso_segs ? :
> > > +			  skb_shinfo(skb)->gso_size ? skb->len/skb_shinfo(skb)->gso_size + 1 : 1;
> > > +			q->tx_len = (skb->len - 1)/q->tx_segs + 1;
> > 
> > This isn't safe for Xen (and potentially other virtualisation
> > environments) since qdisc code runs before dev_hard_start_xmit
> > which is where we verify the sanity of gso_segs.  So you could
> > be using some arbitrary value from an untrusted source.
> > 
> > If you really want to use it, you should test for SKB_GSO_DODGY
> > on the packet which will be set if gso_segs can't be trusted.
> 
> Yep, you have a point that some sanity check should be added.
> I think a simple check would be enough not to crash CBQ
> as the accurate checking will be done in dev_hard_start_xmit or
> device drivers.

I updated the patch that a temporary index is used to calculate
the transmission time if the index derived from gso_size exceeds
the size of R_tab->data table. see the definition of L2T().

It is intended just to avoid causing any troubles in CBQ
with broken gso_size, which guests on Xen hypervisor or others
can possibly set.

I didn't get any better ideas than this. What do you think of it?


Thanks,
Hirokazu Takahashi.


--- linux-2.6.21/net/sched/sch_cbq.c.ORG	2007-05-14 20:53:06.000000000 +0900
+++ linux-2.6.21/net/sched/sch_cbq.c	2007-05-21 21:07:48.000000000 +0900
@@ -176,6 +176,7 @@ struct cbq_sched_data
 	struct cbq_class	*tx_class;
 	struct cbq_class	*tx_borrowed;
 	int			tx_len;
+	unsigned int 		tx_segs;
 	psched_time_t		now;		/* Cached timestamp */
 	psched_time_t		now_rt;		/* Cached real time */
 	unsigned		pmask;
@@ -191,7 +192,15 @@ struct cbq_sched_data
 };
 
 
-#define L2T(cl,len)	((cl)->R_tab->data[(len)>>(cl)->R_tab->rate.cell_log])
+inline psched_tdiff_t
+L2T(struct cbq_class *cl, int len) {
+	int nent = sizeof(cl->R_tab->data)/sizeof(cl->R_tab->data[0]);
+	int index =  len >> cl->R_tab->rate.cell_log;
+	if (index < nent)
+		return cl->R_tab->data[index];
+	else
+		return cl->R_tab->data[nent - 1] * (index/nent + 1);
+}
 
 
 static __inline__ unsigned cbq_hash(u32 h)
@@ -753,6 +762,7 @@ cbq_update(struct cbq_sched_data *q)
 	struct cbq_class *this = q->tx_class;
 	struct cbq_class *cl = this;
 	int len = q->tx_len;
+	unsigned int segs = q->tx_segs;
 
 	q->tx_class = NULL;
 
@@ -761,7 +771,7 @@ cbq_update(struct cbq_sched_data *q)
 		long idle;
 
 		cl->bstats.packets++;
-		cl->bstats.bytes += len;
+		cl->bstats.bytes += len*segs;
 
 		/*
 		   (now - last) is total time between packet right edges.
@@ -774,7 +784,7 @@ cbq_update(struct cbq_sched_data *q)
 		if ((unsigned long)idle > 128*1024*1024) {
 			avgidle = cl->maxidle;
 		} else {
-			idle -= L2T(cl, len);
+			idle -= L2T(cl, len) * segs;
 
 		/* true_avgidle := (1-W)*true_avgidle + W*idle,
 		   where W=2^{-ewma_log}. But cl->avgidle is scaled:
@@ -811,8 +821,8 @@ cbq_update(struct cbq_sched_data *q)
 			   to the moment of cbq_update)
 			 */
 
-			idle -= L2T(&q->link, len);
-			idle += L2T(cl, len);
+			idle -= L2T(&q->link, len) * segs;
+			idle += L2T(cl, len) * segs;
 
 			PSCHED_AUDIT_TDIFF(idle);
 
@@ -924,7 +934,9 @@ cbq_dequeue_prio(struct Qdisc *sch, int 
 				cl->xstats.borrows += skb->len;
 #endif
 			}
-			q->tx_len = skb->len;
+			q->tx_segs = skb_is_gso(skb) ? skb_shinfo(skb)->gso_segs ? :
+				skb->len/skb_shinfo(skb)->gso_size + 1 : 1;
+			q->tx_len = (skb->len - 1)/q->tx_segs + 1;
 
 			if (cl->deficit <= 0) {
 				q->active[prio] = cl;
@@ -1013,7 +1025,7 @@ cbq_dequeue(struct Qdisc *sch)
 
 		   cbq_time = max(real_time, work);
 		 */
-		incr2 = L2T(&q->link, q->tx_len);
+		incr2 = L2T(&q->link, q->tx_len) * q->tx_segs;
 		PSCHED_TADD(q->now, incr2);
 		cbq_update(q);
 		if ((incr -= incr2) < 0)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 2/2] tbf scheduler: TSO support (update 3)
  2007-05-23 11:47                             ` [PATCH 1/2] tbf scheduler: TSO support (update 3) Hirokazu Takahashi
@ 2007-05-23 11:49                               ` Hirokazu Takahashi
  0 siblings, 0 replies; 17+ messages in thread
From: Hirokazu Takahashi @ 2007-05-23 11:49 UTC (permalink / raw)
  To: herbert; +Cc: shemminger, netdev, kaber, davem, linux-net

Hi,

This patch is as same as the previous one.
I just cleaned up some code.

Thanks,
Hirokazu Takahashi.

Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>

--- linux-2.6.21/net/sched/sch_tbf.c.ORG	2007-05-08 20:59:28.000000000 +0900
+++ linux-2.6.21/net/sched/sch_tbf.c	2007-05-21 21:05:18.000000000 +0900
@@ -9,7 +9,8 @@
  * Authors:	Alexey Kuznetsov, <kuznet@ms2.inr.ac.ru>
  *		Dmitry Torokhov <dtor@mail.ru> - allow attaching inner qdiscs -
  *						 original idea by Martin Devera
- *
+ * Fixes:
+ * 		Hirokazu Takahashi <taka@valinux.co.jp> : TSO support
  */
 
 #include <linux/module.h>
@@ -138,8 +139,11 @@ static int tbf_enqueue(struct sk_buff *s
 {
 	struct tbf_sched_data *q = qdisc_priv(sch);
 	int ret;
+	unsigned int segs = skb_is_gso(skb) ? skb_shinfo(skb)->gso_segs ? :
+		skb->len/skb_shinfo(skb)->gso_size + 1 : 1;
+	unsigned int len = (skb->len - 1)/segs + 1;
 
-	if (skb->len > q->max_size) {
+	if (len > q->max_size) {
 		sch->qstats.drops++;
 #ifdef CONFIG_NET_CLS_POLICE
 		if (sch->reshape_fail == NULL || sch->reshape_fail(skb, sch))
@@ -204,22 +208,40 @@ static struct sk_buff *tbf_dequeue(struc
 		psched_time_t now;
 		long toks, delay;
 		long ptoks = 0;
-		unsigned int len = skb->len;
+		/*
+		 * Note: TSO packets will be larger than its actual mtu.
+		 * These packets should be treated as packets including
+		 * several ordinary ones. In this case, tokens should
+		 * be held until it reaches the length of them.
+		 *
+		 * To simplify, we assume each segment in a TSO packet
+		 * has the same length though it may probably not be true.
+		 * And ignore the length of headers which will be applied
+		 * to each segment when splitting TSO packets.
+		 * 
+		 * The number of segments are calculated from the segment
+		 * size of TSO packets temporarily if it isn't set.
+		 */
+		unsigned int segs = skb_shinfo(skb)->gso_segs ? :
+		  skb_is_gso(skb) ? skb->len/skb_shinfo(skb)->gso_size + 1 : 1;
+		unsigned int len = (skb->len - 1)/segs + 1;
+		unsigned int expect = L2T(q, len) * segs;
+		long max_toks = max(expect, q->buffer);
 
 		PSCHED_GET_TIME(now);
 
-		toks = PSCHED_TDIFF_SAFE(now, q->t_c, q->buffer);
+		toks = PSCHED_TDIFF_SAFE(now, q->t_c, max_toks);
 
 		if (q->P_tab) {
 			ptoks = toks + q->ptokens;
-			if (ptoks > (long)q->mtu)
-				ptoks = q->mtu;
-			ptoks -= L2T_P(q, len);
+			if (ptoks > (long)(q->mtu * segs))
+				ptoks = q->mtu * segs;
+			ptoks -= L2T_P(q, len) * segs;
 		}
 		toks += q->tokens;
-		if (toks > (long)q->buffer)
-			toks = q->buffer;
-		toks -= L2T(q, len);
+		if (toks > max_toks)
+			toks = max_toks;
+		toks -= expect;
 
 		if ((toks|ptoks) >= 0) {
 			q->t_c = now;


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2007-05-23 11:49 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20070510.210556.05145104.taka@valinux.co.jp>
2007-05-10 12:56 ` [PATCH] tbf scheduler: TSO support Patrick McHardy
2007-05-10 21:13   ` David Miller
2007-05-11  1:04     ` Patrick McHardy
2007-05-11  4:01       ` Hirokazu Takahashi
2007-05-11 18:13         ` Patrick McHardy
2007-05-12  8:49           ` Hirokazu Takahashi
2007-05-12  9:09             ` Hirokazu Takahashi
2007-05-13 12:42             ` [PATCH] tbf scheduler: TSO support (updated) Hirokazu Takahashi
2007-05-13 15:52               ` Stephen Hemminger
2007-05-13 16:21                 ` Hirokazu Takahashi
2007-05-14  6:04                   ` Herbert Xu
2007-05-15 11:39                     ` [PATCH 1/2] tbf scheduler: TSO support (update 2) Hirokazu Takahashi
2007-05-15 11:41                       ` [PATCH 2/2] " Hirokazu Takahashi
2007-05-17 13:55                         ` Herbert Xu
2007-05-18  2:35                           ` Hirokazu Takahashi
2007-05-23 11:47                             ` [PATCH 1/2] tbf scheduler: TSO support (update 3) Hirokazu Takahashi
2007-05-23 11:49                               ` [PATCH 2/2] " Hirokazu Takahashi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).