* Re: [PATCH] tbf scheduler: TSO support [not found] <20070510.210556.05145104.taka@valinux.co.jp> @ 2007-05-10 12:56 ` Patrick McHardy 2007-05-10 21:13 ` David Miller 0 siblings, 1 reply; 19+ messages in thread From: Patrick McHardy @ 2007-05-10 12:56 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: linux-net, Linux Netdev List Hirokazu Takahashi wrote: > TBF --- Simple Token Bucket Filter --- packet scheduler doesn't > work correctly with TSO on that it slows down to send out packets. > TSO packets will be discarded since the size can be larger than > the scheduler expects. But it won't cause serious problems > because the retransmitted packets can be passed. > > So I made the scheduler allow to pass TSO packets: > > - tbf_enqueue() accepts packets with any size if the netdevice > has TSO ability. > > - tbf_dequeue() can handle the packets whose size is larger than > the bucket, which keeps tokens. > Any packet, which may be TSO packet, can be sent if the bucket is > full of tokens. this may lead that the number of tokens in > the bucket turns into negative value, which means kind of debt. > But we don't have to mind it because this will be filled up > with tokens in a short time and it will turns into positive value > again. > > I'm not sure if this approach is the best. I appreciate any comments. I don't see why this is needed, the correct way to use TBF with TSO is to specify a larger MTU value, in which case it won't drop TSO packets. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] tbf scheduler: TSO support 2007-05-10 12:56 ` [PATCH] tbf scheduler: TSO support Patrick McHardy @ 2007-05-10 21:13 ` David Miller 2007-05-11 1:04 ` Patrick McHardy 0 siblings, 1 reply; 19+ messages in thread From: David Miller @ 2007-05-10 21:13 UTC (permalink / raw) To: kaber; +Cc: taka, linux-net, netdev From: Patrick McHardy <kaber@trash.net> Date: Thu, 10 May 2007 14:56:39 +0200 > Hirokazu Takahashi wrote: > > TBF --- Simple Token Bucket Filter --- packet scheduler doesn't > > work correctly with TSO on that it slows down to send out packets. > > TSO packets will be discarded since the size can be larger than > > the scheduler expects. But it won't cause serious problems > > because the retransmitted packets can be passed. > > > > So I made the scheduler allow to pass TSO packets: > > > > - tbf_enqueue() accepts packets with any size if the netdevice > > has TSO ability. > > > > - tbf_dequeue() can handle the packets whose size is larger than > > the bucket, which keeps tokens. > > Any packet, which may be TSO packet, can be sent if the bucket is > > full of tokens. this may lead that the number of tokens in > > the bucket turns into negative value, which means kind of debt. > > But we don't have to mind it because this will be filled up > > with tokens in a short time and it will turns into positive value > > again. > > > > I'm not sure if this approach is the best. I appreciate any comments. > > > I don't see why this is needed, the correct way to use TBF with TSO > is to specify a larger MTU value, in which case it won't drop TSO > packets. Why should a user have to know anything in the world about TSO in order to configure TBF properly? I don't think they should have to at all. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] tbf scheduler: TSO support 2007-05-10 21:13 ` David Miller @ 2007-05-11 1:04 ` Patrick McHardy 2007-05-11 4:01 ` Hirokazu Takahashi 0 siblings, 1 reply; 19+ messages in thread From: Patrick McHardy @ 2007-05-11 1:04 UTC (permalink / raw) To: David Miller; +Cc: taka, linux-net, netdev David Miller wrote: > From: Patrick McHardy <kaber@trash.net> > Date: Thu, 10 May 2007 14:56:39 +0200 > >>I don't see why this is needed, the correct way to use TBF with TSO >>is to specify a larger MTU value, in which case it won't drop TSO >>packets. > > > Why should a user have to know anything in the world about TSO in > order to configure TBF properly? I don't think they should have > to at all. The user shouldn't necessarily, but userspace should. The way I see it the MTU is a fundamental parameter for TBF (the peakrate bucket size) and just because userspace picks a bad default (2000) this is no reason to change the implementation to something that is not really TBF anymore and even affects non-TSO packets _and_ TSO packets even when the MTU is chosen large enough (granted, the first point is an implementation detail). The much better solution would be to let userspace pick an appropriate default value and still let the user override it. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] tbf scheduler: TSO support 2007-05-11 1:04 ` Patrick McHardy @ 2007-05-11 4:01 ` Hirokazu Takahashi 2007-05-11 18:13 ` Patrick McHardy 0 siblings, 1 reply; 19+ messages in thread From: Hirokazu Takahashi @ 2007-05-11 4:01 UTC (permalink / raw) To: kaber; +Cc: davem, linux-net, netdev Hi, > >>I don't see why this is needed, the correct way to use TBF with TSO > >>is to specify a larger MTU value, in which case it won't drop TSO > >>packets. > > > > > > Why should a user have to know anything in the world about TSO in > > order to configure TBF properly? I don't think they should have > > to at all. > > > The user shouldn't necessarily, but userspace should. > The way I see it the MTU is a fundamental parameter for TBF > (the peakrate bucket size) and just because userspace picks > a bad default (2000) this is no reason to change the > implementation to something that is not really TBF anymore > and even affects non-TSO packets _and_ TSO packets even > when the MTU is chosen large enough (granted, the first > point is an implementation detail). The much better solution > would be to let userspace pick an appropriate default value > and still let the user override it. I think the concept of TBF is quit good but the userspace tools have become old that it doesn't fit to Gb ethernet environment. The tools should be updated to care about much faster network and GbE jumbo frames. I agree with you at this point. On the other hand, handling TSO packet should be a kernel issue. A TSO packet is logically just a multiple segmented packet including several ordinary packets. This feature should be kept invisible from userspace. And I also feel it's hard to determine how large TSO packets can/will be accepted that it depends on ether controllers. It would be also tough to make the tools catch up with the latest off loading features. Thank you, Hirokazu Takahashi. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] tbf scheduler: TSO support 2007-05-11 4:01 ` Hirokazu Takahashi @ 2007-05-11 18:13 ` Patrick McHardy 2007-05-12 8:49 ` Hirokazu Takahashi 0 siblings, 1 reply; 19+ messages in thread From: Patrick McHardy @ 2007-05-11 18:13 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: davem, linux-net, netdev Hirokazu Takahashi wrote: > I think the concept of TBF is quit good but the userspace tools have > become old that it doesn't fit to Gb ethernet environment. > The tools should be updated to care about much faster network and > GbE jumbo frames. I agree with you at this point. > > On the other hand, handling TSO packet should be a kernel issue. > A TSO packet is logically just a multiple segmented packet > including several ordinary packets. This feature should be kept > invisible from userspace. Putting aside this question, this cannot work properly without userspace since the rate table is only built for the given MTU. Your patch tries to work around that by summing up the result of L2T in q->max_size steps, which gives incorrect results with MPU or overhead settings. I think we can only do two things without userspace support: - disable TSO (not a good idea) - split the skb into q->max_size chunks on dequeue The later would allow people to take full advantage of TSO with properly configured TBF, but it would still at least work with a too small mtu value. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] tbf scheduler: TSO support 2007-05-11 18:13 ` Patrick McHardy @ 2007-05-12 8:49 ` Hirokazu Takahashi 2007-05-12 9:09 ` Hirokazu Takahashi 2007-05-13 12:42 ` [PATCH] tbf scheduler: TSO support (updated) Hirokazu Takahashi 0 siblings, 2 replies; 19+ messages in thread From: Hirokazu Takahashi @ 2007-05-12 8:49 UTC (permalink / raw) To: kaber; +Cc: davem, linux-net, netdev Hi, > > I think the concept of TBF is quit good but the userspace tools have > > become old that it doesn't fit to Gb ethernet environment. > > The tools should be updated to care about much faster network and > > GbE jumbo frames. I agree with you at this point. > > > > On the other hand, handling TSO packet should be a kernel issue. > > A TSO packet is logically just a multiple segmented packet > > including several ordinary packets. This feature should be kept > > invisible from userspace. > > > Putting aside this question, this cannot work properly without userspace > since the rate table is only built for the given MTU. Your patch tries > to work around that by summing up the result of L2T in q->max_size > steps, which gives incorrect results with MPU or overhead settings. You have a point. sch->dev->mtu should be used instead of q->max_size, which gives precisely correct results. Every TSO packet will be split into several packets with sch->dev->mtu length, which administrators specify. The splitting is automatically done by TSO supporting controllers. To minimize the overhead of the calculation I'd re-implement it like: segs = skb->len/sch->dev->mtu; rest = skb->len - sch->dev->mtu*segs; toks -= L2T(q, sch->dev->mtu) * segs; if (rest) toks -= L2T(q, rest); I think the problem is that sch->dev->mtu might not be set correctly, so some error checking is needed. > I think we can only do two things without userspace support: > > - disable TSO (not a good idea) This is just what I'm doing now as a workaround. > - split the skb into q->max_size chunks on dequeue It sure is difficult because the splitting deeply depends on the protocol stack which made the skb and it will make some overhead, which is why TSO and other offloading features are introduced. I'm now thinking I can make it just hold a TSO packet until p->tokens reaches the size of the packet. I think it is straightforward implementation. I'll try this. > The later would allow people to take full advantage of TSO with properly > configured TBF, but it would still at least work with a too small mtu > value. Thanks, Hirokazu Takahashi. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] tbf scheduler: TSO support 2007-05-12 8:49 ` Hirokazu Takahashi @ 2007-05-12 9:09 ` Hirokazu Takahashi 2007-05-13 12:42 ` [PATCH] tbf scheduler: TSO support (updated) Hirokazu Takahashi 1 sibling, 0 replies; 19+ messages in thread From: Hirokazu Takahashi @ 2007-05-12 9:09 UTC (permalink / raw) To: kaber; +Cc: davem, linux-net, netdev Hi, > > > I think the concept of TBF is quit good but the userspace tools have > > > become old that it doesn't fit to Gb ethernet environment. > > > The tools should be updated to care about much faster network and > > > GbE jumbo frames. I agree with you at this point. > > > > > > On the other hand, handling TSO packet should be a kernel issue. > > > A TSO packet is logically just a multiple segmented packet > > > including several ordinary packets. This feature should be kept > > > invisible from userspace. > > > > > > Putting aside this question, this cannot work properly without userspace > > since the rate table is only built for the given MTU. Your patch tries > > to work around that by summing up the result of L2T in q->max_size > > steps, which gives incorrect results with MPU or overhead settings. > > You have a point. sch->dev->mtu should be used instead of q->max_size, > which gives precisely correct results. Oops, this isn't really true because TSO supporting controllers will probably add some headers to the split packets. So I'd say "it will give almost correct results". > Every TSO packet will be split > into several packets with sch->dev->mtu length, which administrators > specify. The splitting is automatically done by TSO supporting controllers. > > To minimize the overhead of the calculation I'd re-implement it like: > > segs = skb->len/sch->dev->mtu; > rest = skb->len - sch->dev->mtu*segs; > toks -= L2T(q, sch->dev->mtu) * segs; > if (rest) > toks -= L2T(q, rest); > > I think the problem is that sch->dev->mtu might not be set correctly, > so some error checking is needed. > > > I think we can only do two things without userspace support: > > > > - disable TSO (not a good idea) > > This is just what I'm doing now as a workaround. > > > - split the skb into q->max_size chunks on dequeue > > It sure is difficult because the splitting deeply depends on the > protocol stack which made the skb and it will make some overhead, > which is why TSO and other offloading features are introduced. > > I'm now thinking I can make it just hold a TSO packet until p->tokens > reaches the size of the packet. I think it is straightforward > implementation. I'll try this. > > > The later would allow people to take full advantage of TSO with properly > > configured TBF, but it would still at least work with a too small mtu > > value. > > Thanks, > Hirokazu Takahashi. > > - > To unsubscribe from this list: send the line "unsubscribe linux-net" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH] tbf scheduler: TSO support (updated) 2007-05-12 8:49 ` Hirokazu Takahashi 2007-05-12 9:09 ` Hirokazu Takahashi @ 2007-05-13 12:42 ` Hirokazu Takahashi 2007-05-13 15:52 ` Stephen Hemminger 1 sibling, 1 reply; 19+ messages in thread From: Hirokazu Takahashi @ 2007-05-13 12:42 UTC (permalink / raw) To: netdev; +Cc: kaber, davem, linux-net Hi, > I'm now thinking I can make it just hold a TSO packet until p->tokens > reaches the size of the packet. I think it is straightforward > implementation. I'll try this. I re-implemented the patch, which is simpler than the previous one. sch->dev->mtu is used to determine how many segments are included in a TSO packet. The point is that the value of sch->dev->mtu has to be treated possibly broken because it is set by administrators. Thanks, Hirokazu Takahashi. Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp> --- linux-2.6.21/net/sched/sch_tbf.c.ORG 2007-05-08 20:59:28.000000000 +0900 +++ linux-2.6.21/net/sched/sch_tbf.c 2007-05-13 14:22:39.000000000 +0900 @@ -9,7 +9,8 @@ * Authors: Alexey Kuznetsov, <kuznet@ms2.inr.ac.ru> * Dmitry Torokhov <dtor@mail.ru> - allow attaching inner qdiscs - * original idea by Martin Devera - * + * Fixes: + * Hirokazu Takahashi <taka@valinux.co.jp> : TSO support */ #include <linux/module.h> @@ -139,7 +140,7 @@ static int tbf_enqueue(struct sk_buff *s struct tbf_sched_data *q = qdisc_priv(sch); int ret; - if (skb->len > q->max_size) { + if (skb->len > q->max_size && !(sch->dev->features & NETIF_F_GSO_MASK)) { sch->qstats.drops++; #ifdef CONFIG_NET_CLS_POLICE if (sch->reshape_fail == NULL || sch->reshape_fail(skb, sch)) @@ -205,6 +206,16 @@ static struct sk_buff *tbf_dequeue(struc long toks, delay; long ptoks = 0; unsigned int len = skb->len; + /* + * Note: TSO packets will be larger than its actual mtu. + * These packets should be treated as packets including + * several ordinary ones. In this case, tokens should + * be held until it reaches the length of them. + */ + long max_toks = max(len, q->buffer); + unsigned int rmtu = sch->dev->mtu? min(q->max_size, sch->dev->mtu) : q->max_size; + unsigned int segs = len/rmtu; + unsigned int rest = len - rmtu*segs; PSCHED_GET_TIME(now); @@ -212,14 +223,18 @@ static struct sk_buff *tbf_dequeue(struc if (q->P_tab) { ptoks = toks + q->ptokens; - if (ptoks > (long)q->mtu) - ptoks = q->mtu; - ptoks -= L2T_P(q, len); + if (ptoks > (long)(q->mtu * (segs + !!rest))) + ptoks = q->mtu * (segs + !!rest); + ptoks -= L2T_P(q, rmtu) * segs; + if (rest) + ptoks -= L2T_P(q, rest); } toks += q->tokens; - if (toks > (long)q->buffer) - toks = q->buffer; - toks -= L2T(q, len); + if (toks > max_toks) + toks = max_toks; + toks -= L2T(q, rmtu) * segs; + if (rest) + toks -= L2T(q, rest); if ((toks|ptoks) >= 0) { q->t_c = now; ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] tbf scheduler: TSO support (updated) 2007-05-13 12:42 ` [PATCH] tbf scheduler: TSO support (updated) Hirokazu Takahashi @ 2007-05-13 15:52 ` Stephen Hemminger 2007-05-13 16:21 ` Hirokazu Takahashi 0 siblings, 1 reply; 19+ messages in thread From: Stephen Hemminger @ 2007-05-13 15:52 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: netdev, kaber, davem, linux-net On Sun, 13 May 2007 21:42:36 +0900 (JST) Hirokazu Takahashi <taka@valinux.co.jp> wrote: > Hi, > > > I'm now thinking I can make it just hold a TSO packet until > > p->tokens reaches the size of the packet. I think it is > > straightforward implementation. I'll try this. > > I re-implemented the patch, which is simpler than the previous one. > > sch->dev->mtu is used to determine how many segments are included > in a TSO packet. The point is that the value of sch->dev->mtu has to > be treated possibly broken because it is set by administrators. > There are cases where the TSO packet will be divided into chunks smaller than the MTU. If path MTU discovery is being used, the chunk size of the TSO skb will be smaller. To be accurate you need to look at mss (gso_size) and compute actual packet size based on protocol (IPV4/IPV6 and UDP/TCP). ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] tbf scheduler: TSO support (updated) 2007-05-13 15:52 ` Stephen Hemminger @ 2007-05-13 16:21 ` Hirokazu Takahashi 2007-05-14 6:04 ` Herbert Xu 0 siblings, 1 reply; 19+ messages in thread From: Hirokazu Takahashi @ 2007-05-13 16:21 UTC (permalink / raw) To: shemminger; +Cc: netdev, kaber, davem, linux-net Hi, > > > I'm now thinking I can make it just hold a TSO packet until > > > p->tokens reaches the size of the packet. I think it is > > > straightforward implementation. I'll try this. > > > > I re-implemented the patch, which is simpler than the previous one. > > > > sch->dev->mtu is used to determine how many segments are included > > in a TSO packet. The point is that the value of sch->dev->mtu has to > > be treated possibly broken because it is set by administrators. > > > > There are cases where the TSO packet will be divided into chunks > smaller than the MTU. If path MTU discovery is being used, the chunk > size of the TSO skb will be smaller. To be accurate you need to look > at mss (gso_size) and compute actual packet size based on protocol > (IPV4/IPV6 and UDP/TCP). Uhh, you are right. skb_shinfo(skb)->gso_segs and skb_shinfo(skb)->gso_size should be used. Thanks. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] tbf scheduler: TSO support (updated) 2007-05-13 16:21 ` Hirokazu Takahashi @ 2007-05-14 6:04 ` Herbert Xu 2007-05-15 11:39 ` [PATCH 1/2] tbf scheduler: TSO support (update 2) Hirokazu Takahashi 0 siblings, 1 reply; 19+ messages in thread From: Herbert Xu @ 2007-05-14 6:04 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: shemminger, netdev, kaber, davem, linux-net Hirokazu Takahashi <taka@valinux.co.jp> wrote: > > Uhh, you are right. > skb_shinfo(skb)->gso_segs and skb_shinfo(skb)->gso_size should be used. Actually forget about gso_segs, it's only filled in for TCP. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 1/2] tbf scheduler: TSO support (update 2) 2007-05-14 6:04 ` Herbert Xu @ 2007-05-15 11:39 ` Hirokazu Takahashi 2007-05-15 11:41 ` [PATCH 2/2] " Hirokazu Takahashi 0 siblings, 1 reply; 19+ messages in thread From: Hirokazu Takahashi @ 2007-05-15 11:39 UTC (permalink / raw) To: herbert; +Cc: shemminger, netdev, kaber, davem, linux-net Hi, > Hirokazu Takahashi <taka@valinux.co.jp> wrote: > > > > Uhh, you are right. > > skb_shinfo(skb)->gso_segs and skb_shinfo(skb)->gso_size should be used. > > Actually forget about gso_segs, it's only filled in for TCP. I realized it was really hard to determine the actual size of each packet that will be generated from TSO packets, the size which should be used to calculate the really accurate traffic. There isn't enough information in socket buffers to determine the size of their headers as gso_size just shows the maximum length of the segment without any headers and the other members are helpless either. split into TSO packet -----------> packets after being split +----------+ +----------+ | headers | | headers | +----------+ +----------+ ---- | segment1 | | segment1 | A | | | | | gso_size | | | | V +----------+ +----------+ ---- | segment2 | | | +----------+ | | | headers | +----------+ +----------+ | segment3 | | segment2 | | | | | +----------+ | | +----------+ +----------+ | headers | +----------+ | segment3 | | | +----------+ So I decided to make it simple to calculate the traffic: - assume each packet generated from the same TSO packet have the same length. - ignore the length of additional headers which will be automatically applied. It looks working pretty well to control bandwidth as I expected, but I'm not sure everybody will be satisfied with it. Do you think this approximate calculation is enough? I also realized CBQ scheduler have to be fixed to handle large TSO packets or it may possibly cause Oops. The next mail contains the patch for CBQ. --- linux-2.6.21/net/sched/sch_tbf.c.ORG 2007-05-08 20:59:28.000000000 +0900 +++ linux-2.6.21/net/sched/sch_tbf.c 2007-05-15 19:59:34.000000000 +0900 @@ -9,7 +9,8 @@ * Authors: Alexey Kuznetsov, <kuznet@ms2.inr.ac.ru> * Dmitry Torokhov <dtor@mail.ru> - allow attaching inner qdiscs - * original idea by Martin Devera - * + * Fixes: + * Hirokazu Takahashi <taka@valinux.co.jp> : TSO support */ #include <linux/module.h> @@ -138,8 +139,12 @@ static int tbf_enqueue(struct sk_buff *s { struct tbf_sched_data *q = qdisc_priv(sch); int ret; + //unsigned int segs = skb_shinfo(skb)->gso_segs ? : 1; + unsigned int segs = skb_shinfo(skb)->gso_segs ? : + skb_shinfo(skb)->gso_size ? skb->len/skb_shinfo(skb)->gso_size + 1 : 1; + unsigned int len = (skb->len - 1)/segs + 1; - if (skb->len > q->max_size) { + if (len > q->max_size) { sch->qstats.drops++; #ifdef CONFIG_NET_CLS_POLICE if (sch->reshape_fail == NULL || sch->reshape_fail(skb, sch)) @@ -204,22 +209,41 @@ static struct sk_buff *tbf_dequeue(struc psched_time_t now; long toks, delay; long ptoks = 0; - unsigned int len = skb->len; + /* + * Note: TSO packets will be larger than its actual mtu. + * These packets should be treated as packets including + * several ordinary ones. In this case, tokens should + * be held until it reaches the length of them. + * + * To simplify, we assume each segment in a TSO packet + * has the same length though it may probably not be true. + * And ignore the length of headers which will be applied + * to each segment when splitting TSO packets. + * + * The number of segments are calculated from the segment + * size of TSO packets temporarily if it isn't set. + */ + unsigned int segs = skb_shinfo(skb)->gso_segs ? : + skb_shinfo(skb)->gso_size ? skb->len/skb_shinfo(skb)->gso_size + 1 : 1; + unsigned int len = (skb->len - 1)/segs + 1; + unsigned int expect = L2T(q, len) * segs; + long max_toks = max(expect, q->buffer); + PSCHED_GET_TIME(now); - toks = PSCHED_TDIFF_SAFE(now, q->t_c, q->buffer); + toks = PSCHED_TDIFF_SAFE(now, q->t_c, max_toks); if (q->P_tab) { ptoks = toks + q->ptokens; - if (ptoks > (long)q->mtu) - ptoks = q->mtu; - ptoks -= L2T_P(q, len); + if (ptoks > (long)(q->mtu * segs)) + ptoks = q->mtu * segs; + ptoks -= L2T_P(q, len) * segs; } toks += q->tokens; - if (toks > (long)q->buffer) - toks = q->buffer; - toks -= L2T(q, len); + if (toks > max_toks) + toks = max_toks; + toks -= expect; if ((toks|ptoks) >= 0) { q->t_c = now; ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 2/2] tbf scheduler: TSO support (update 2) 2007-05-15 11:39 ` [PATCH 1/2] tbf scheduler: TSO support (update 2) Hirokazu Takahashi @ 2007-05-15 11:41 ` Hirokazu Takahashi 2007-05-17 13:55 ` Herbert Xu 0 siblings, 1 reply; 19+ messages in thread From: Hirokazu Takahashi @ 2007-05-15 11:41 UTC (permalink / raw) To: herbert; +Cc: shemminger, netdev, kaber, davem, linux-net Hi, Without this patch, CBQ scheduler may cause Oops with some TSO packets because the tables passed to L2T() isn't big enough to handle such huge packets. Thanks, Hirokazu Takahashi. --- linux-2.6.21/net/sched/sch_cbq.c.ORG 2007-05-14 20:53:06.000000000 +0900 +++ linux-2.6.21/net/sched/sch_cbq.c 2007-05-15 18:10:43.000000000 +0900 @@ -176,6 +176,7 @@ struct cbq_sched_data struct cbq_class *tx_class; struct cbq_class *tx_borrowed; int tx_len; + unsigned int tx_segs; psched_time_t now; /* Cached timestamp */ psched_time_t now_rt; /* Cached real time */ unsigned pmask; @@ -753,6 +754,7 @@ cbq_update(struct cbq_sched_data *q) struct cbq_class *this = q->tx_class; struct cbq_class *cl = this; int len = q->tx_len; + unsigned int segs = q->tx_segs; q->tx_class = NULL; @@ -761,7 +763,7 @@ cbq_update(struct cbq_sched_data *q) long idle; cl->bstats.packets++; - cl->bstats.bytes += len; + cl->bstats.bytes += len*segs; /* (now - last) is total time between packet right edges. @@ -774,7 +776,7 @@ cbq_update(struct cbq_sched_data *q) if ((unsigned long)idle > 128*1024*1024) { avgidle = cl->maxidle; } else { - idle -= L2T(cl, len); + idle -= L2T(cl, len) * segs; /* true_avgidle := (1-W)*true_avgidle + W*idle, where W=2^{-ewma_log}. But cl->avgidle is scaled: @@ -811,8 +813,8 @@ cbq_update(struct cbq_sched_data *q) to the moment of cbq_update) */ - idle -= L2T(&q->link, len); - idle += L2T(cl, len); + idle -= L2T(&q->link, len) * segs; + idle += L2T(cl, len) * segs; PSCHED_AUDIT_TDIFF(idle); @@ -924,7 +926,9 @@ cbq_dequeue_prio(struct Qdisc *sch, int cl->xstats.borrows += skb->len; #endif } - q->tx_len = skb->len; + q->tx_segs = skb_shinfo(skb)->gso_segs ? : + skb_shinfo(skb)->gso_size ? skb->len/skb_shinfo(skb)->gso_size + 1 : 1; + q->tx_len = (skb->len - 1)/q->tx_segs + 1; if (cl->deficit <= 0) { q->active[prio] = cl; @@ -1013,7 +1017,7 @@ cbq_dequeue(struct Qdisc *sch) cbq_time = max(real_time, work); */ - incr2 = L2T(&q->link, q->tx_len); + incr2 = L2T(&q->link, q->tx_len) * q->tx_segs; PSCHED_TADD(q->now, incr2); cbq_update(q); if ((incr -= incr2) < 0) ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/2] tbf scheduler: TSO support (update 2) 2007-05-15 11:41 ` [PATCH 2/2] " Hirokazu Takahashi @ 2007-05-17 13:55 ` Herbert Xu 2007-05-18 2:35 ` Hirokazu Takahashi 0 siblings, 1 reply; 19+ messages in thread From: Herbert Xu @ 2007-05-17 13:55 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: shemminger, netdev, kaber, davem, linux-net On Tue, May 15, 2007 at 08:41:48PM +0900, Hirokazu Takahashi wrote: > > @@ -924,7 +926,9 @@ cbq_dequeue_prio(struct Qdisc *sch, int > cl->xstats.borrows += skb->len; > #endif > } > - q->tx_len = skb->len; > + q->tx_segs = skb_shinfo(skb)->gso_segs ? : > + skb_shinfo(skb)->gso_size ? skb->len/skb_shinfo(skb)->gso_size + 1 : 1; > + q->tx_len = (skb->len - 1)/q->tx_segs + 1; This isn't safe for Xen (and potentially other virtualisation environments) since qdisc code runs before dev_hard_start_xmit which is where we verify the sanity of gso_segs. So you could be using some arbitrary value from an untrusted source. If you really want to use it, you should test for SKB_GSO_DODGY on the packet which will be set if gso_segs can't be trusted. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/2] tbf scheduler: TSO support (update 2) 2007-05-17 13:55 ` Herbert Xu @ 2007-05-18 2:35 ` Hirokazu Takahashi 2007-05-23 11:47 ` [PATCH 1/2] tbf scheduler: TSO support (update 3) Hirokazu Takahashi 0 siblings, 1 reply; 19+ messages in thread From: Hirokazu Takahashi @ 2007-05-18 2:35 UTC (permalink / raw) To: herbert; +Cc: shemminger, netdev, kaber, davem, linux-net Hi, > > @@ -924,7 +926,9 @@ cbq_dequeue_prio(struct Qdisc *sch, int > > cl->xstats.borrows += skb->len; > > #endif > > } > > - q->tx_len = skb->len; > > + q->tx_segs = skb_shinfo(skb)->gso_segs ? : > > + skb_shinfo(skb)->gso_size ? skb->len/skb_shinfo(skb)->gso_size + 1 : 1; > > + q->tx_len = (skb->len - 1)/q->tx_segs + 1; > > This isn't safe for Xen (and potentially other virtualisation > environments) since qdisc code runs before dev_hard_start_xmit > which is where we verify the sanity of gso_segs. So you could > be using some arbitrary value from an untrusted source. > > If you really want to use it, you should test for SKB_GSO_DODGY > on the packet which will be set if gso_segs can't be trusted. Yep, you have a point that some sanity check should be added. I think a simple check would be enough not to crash CBQ as the accurate checking will be done in dev_hard_start_xmit or device drivers. Thanks, Hirokazu Takahashi. ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 1/2] tbf scheduler: TSO support (update 3) 2007-05-18 2:35 ` Hirokazu Takahashi @ 2007-05-23 11:47 ` Hirokazu Takahashi 2007-05-23 11:49 ` [PATCH 2/2] " Hirokazu Takahashi 0 siblings, 1 reply; 19+ messages in thread From: Hirokazu Takahashi @ 2007-05-23 11:47 UTC (permalink / raw) To: herbert; +Cc: shemminger, netdev, kaber, davem, linux-net Hi, > > > @@ -924,7 +926,9 @@ cbq_dequeue_prio(struct Qdisc *sch, int > > > cl->xstats.borrows += skb->len; > > > #endif > > > } > > > - q->tx_len = skb->len; > > > + q->tx_segs = skb_shinfo(skb)->gso_segs ? : > > > + skb_shinfo(skb)->gso_size ? skb->len/skb_shinfo(skb)->gso_size + 1 : 1; > > > + q->tx_len = (skb->len - 1)/q->tx_segs + 1; > > > > This isn't safe for Xen (and potentially other virtualisation > > environments) since qdisc code runs before dev_hard_start_xmit > > which is where we verify the sanity of gso_segs. So you could > > be using some arbitrary value from an untrusted source. > > > > If you really want to use it, you should test for SKB_GSO_DODGY > > on the packet which will be set if gso_segs can't be trusted. > > Yep, you have a point that some sanity check should be added. > I think a simple check would be enough not to crash CBQ > as the accurate checking will be done in dev_hard_start_xmit or > device drivers. I updated the patch that a temporary index is used to calculate the transmission time if the index derived from gso_size exceeds the size of R_tab->data table. see the definition of L2T(). It is intended just to avoid causing any troubles in CBQ with broken gso_size, which guests on Xen hypervisor or others can possibly set. I didn't get any better ideas than this. What do you think of it? Thanks, Hirokazu Takahashi. --- linux-2.6.21/net/sched/sch_cbq.c.ORG 2007-05-14 20:53:06.000000000 +0900 +++ linux-2.6.21/net/sched/sch_cbq.c 2007-05-21 21:07:48.000000000 +0900 @@ -176,6 +176,7 @@ struct cbq_sched_data struct cbq_class *tx_class; struct cbq_class *tx_borrowed; int tx_len; + unsigned int tx_segs; psched_time_t now; /* Cached timestamp */ psched_time_t now_rt; /* Cached real time */ unsigned pmask; @@ -191,7 +192,15 @@ struct cbq_sched_data }; -#define L2T(cl,len) ((cl)->R_tab->data[(len)>>(cl)->R_tab->rate.cell_log]) +inline psched_tdiff_t +L2T(struct cbq_class *cl, int len) { + int nent = sizeof(cl->R_tab->data)/sizeof(cl->R_tab->data[0]); + int index = len >> cl->R_tab->rate.cell_log; + if (index < nent) + return cl->R_tab->data[index]; + else + return cl->R_tab->data[nent - 1] * (index/nent + 1); +} static __inline__ unsigned cbq_hash(u32 h) @@ -753,6 +762,7 @@ cbq_update(struct cbq_sched_data *q) struct cbq_class *this = q->tx_class; struct cbq_class *cl = this; int len = q->tx_len; + unsigned int segs = q->tx_segs; q->tx_class = NULL; @@ -761,7 +771,7 @@ cbq_update(struct cbq_sched_data *q) long idle; cl->bstats.packets++; - cl->bstats.bytes += len; + cl->bstats.bytes += len*segs; /* (now - last) is total time between packet right edges. @@ -774,7 +784,7 @@ cbq_update(struct cbq_sched_data *q) if ((unsigned long)idle > 128*1024*1024) { avgidle = cl->maxidle; } else { - idle -= L2T(cl, len); + idle -= L2T(cl, len) * segs; /* true_avgidle := (1-W)*true_avgidle + W*idle, where W=2^{-ewma_log}. But cl->avgidle is scaled: @@ -811,8 +821,8 @@ cbq_update(struct cbq_sched_data *q) to the moment of cbq_update) */ - idle -= L2T(&q->link, len); - idle += L2T(cl, len); + idle -= L2T(&q->link, len) * segs; + idle += L2T(cl, len) * segs; PSCHED_AUDIT_TDIFF(idle); @@ -924,7 +934,9 @@ cbq_dequeue_prio(struct Qdisc *sch, int cl->xstats.borrows += skb->len; #endif } - q->tx_len = skb->len; + q->tx_segs = skb_is_gso(skb) ? skb_shinfo(skb)->gso_segs ? : + skb->len/skb_shinfo(skb)->gso_size + 1 : 1; + q->tx_len = (skb->len - 1)/q->tx_segs + 1; if (cl->deficit <= 0) { q->active[prio] = cl; @@ -1013,7 +1025,7 @@ cbq_dequeue(struct Qdisc *sch) cbq_time = max(real_time, work); */ - incr2 = L2T(&q->link, q->tx_len); + incr2 = L2T(&q->link, q->tx_len) * q->tx_segs; PSCHED_TADD(q->now, incr2); cbq_update(q); if ((incr -= incr2) < 0) ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 2/2] tbf scheduler: TSO support (update 3) 2007-05-23 11:47 ` [PATCH 1/2] tbf scheduler: TSO support (update 3) Hirokazu Takahashi @ 2007-05-23 11:49 ` Hirokazu Takahashi 0 siblings, 0 replies; 19+ messages in thread From: Hirokazu Takahashi @ 2007-05-23 11:49 UTC (permalink / raw) To: herbert; +Cc: shemminger, netdev, kaber, davem, linux-net Hi, This patch is as same as the previous one. I just cleaned up some code. Thanks, Hirokazu Takahashi. Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp> --- linux-2.6.21/net/sched/sch_tbf.c.ORG 2007-05-08 20:59:28.000000000 +0900 +++ linux-2.6.21/net/sched/sch_tbf.c 2007-05-21 21:05:18.000000000 +0900 @@ -9,7 +9,8 @@ * Authors: Alexey Kuznetsov, <kuznet@ms2.inr.ac.ru> * Dmitry Torokhov <dtor@mail.ru> - allow attaching inner qdiscs - * original idea by Martin Devera - * + * Fixes: + * Hirokazu Takahashi <taka@valinux.co.jp> : TSO support */ #include <linux/module.h> @@ -138,8 +139,11 @@ static int tbf_enqueue(struct sk_buff *s { struct tbf_sched_data *q = qdisc_priv(sch); int ret; + unsigned int segs = skb_is_gso(skb) ? skb_shinfo(skb)->gso_segs ? : + skb->len/skb_shinfo(skb)->gso_size + 1 : 1; + unsigned int len = (skb->len - 1)/segs + 1; - if (skb->len > q->max_size) { + if (len > q->max_size) { sch->qstats.drops++; #ifdef CONFIG_NET_CLS_POLICE if (sch->reshape_fail == NULL || sch->reshape_fail(skb, sch)) @@ -204,22 +208,40 @@ static struct sk_buff *tbf_dequeue(struc psched_time_t now; long toks, delay; long ptoks = 0; - unsigned int len = skb->len; + /* + * Note: TSO packets will be larger than its actual mtu. + * These packets should be treated as packets including + * several ordinary ones. In this case, tokens should + * be held until it reaches the length of them. + * + * To simplify, we assume each segment in a TSO packet + * has the same length though it may probably not be true. + * And ignore the length of headers which will be applied + * to each segment when splitting TSO packets. + * + * The number of segments are calculated from the segment + * size of TSO packets temporarily if it isn't set. + */ + unsigned int segs = skb_shinfo(skb)->gso_segs ? : + skb_is_gso(skb) ? skb->len/skb_shinfo(skb)->gso_size + 1 : 1; + unsigned int len = (skb->len - 1)/segs + 1; + unsigned int expect = L2T(q, len) * segs; + long max_toks = max(expect, q->buffer); PSCHED_GET_TIME(now); - toks = PSCHED_TDIFF_SAFE(now, q->t_c, q->buffer); + toks = PSCHED_TDIFF_SAFE(now, q->t_c, max_toks); if (q->P_tab) { ptoks = toks + q->ptokens; - if (ptoks > (long)q->mtu) - ptoks = q->mtu; - ptoks -= L2T_P(q, len); + if (ptoks > (long)(q->mtu * segs)) + ptoks = q->mtu * segs; + ptoks -= L2T_P(q, len) * segs; } toks += q->tokens; - if (toks > (long)q->buffer) - toks = q->buffer; - toks -= L2T(q, len); + if (toks > max_toks) + toks = max_toks; + toks -= expect; if ((toks|ptoks) >= 0) { q->t_c = now; ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH] tbf scheduler: TSO support
@ 2007-05-10 16:00 Hirokazu Takahashi
2007-05-10 16:01 ` Patrick McHardy
0 siblings, 1 reply; 19+ messages in thread
From: Hirokazu Takahashi @ 2007-05-10 16:00 UTC (permalink / raw)
To: netdev
Hi,
TBF --- Simple Token Bucket Filter --- packet scheduler doesn't
work correctly with TSO on that it slows down to send out packets.
TSO packets will be discarded since the size can be larger than
the scheduler expects. But it won't cause serious problems
because the retransmitted packets can be passed.
So I made the scheduler allow to pass TSO packets:
- tbf_enqueue() accepts packets with any size if the netdevice
has TSO ability.
- tbf_dequeue() can handle the packets whose size is larger than
the bucket, which keeps tokens.
Any packet, which may be TSO packet, can be sent if the bucket is
full of tokens. this may lead that the number of tokens in
the bucket turns into negative value, which means kind of debt.
But we don't have to mind it because this will be filled up
with tokens in a short time and it will turns into positive value
again.
I'm not sure if this approach is the best. I appreciate any comments.
Thank you.
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
--- linux-2.6.21/net/sched/sch_tbf.c.ORG 2007-05-08 20:59:28.000000000 +0900
+++ linux-2.6.21/net/sched/sch_tbf.c 2007-05-10 16:12:54.000000000 +0900
@@ -139,7 +139,7 @@ static int tbf_enqueue(struct sk_buff *s
struct tbf_sched_data *q = qdisc_priv(sch);
int ret;
- if (skb->len > q->max_size) {
+ if (skb->len > q->max_size && !(sch->dev->features & NETIF_F_GSO_MASK)) {
sch->qstats.drops++;
#ifdef CONFIG_NET_CLS_POLICE
if (sch->reshape_fail == NULL || sch->reshape_fail(skb, sch))
@@ -197,6 +197,7 @@ static struct sk_buff *tbf_dequeue(struc
{
struct tbf_sched_data *q = qdisc_priv(sch);
struct sk_buff *skb;
+ int full = 0;
skb = q->qdisc->dequeue(q->qdisc);
@@ -204,7 +205,8 @@ static struct sk_buff *tbf_dequeue(struc
psched_time_t now;
long toks, delay;
long ptoks = 0;
- unsigned int len = skb->len;
+ unsigned int len;
+ int n;
PSCHED_GET_TIME(now);
@@ -212,16 +214,26 @@ static struct sk_buff *tbf_dequeue(struc
if (q->P_tab) {
ptoks = toks + q->ptokens;
- if (ptoks > (long)q->mtu)
+ if (ptoks >= (long)q->mtu) {
ptoks = q->mtu;
- ptoks -= L2T_P(q, len);
+ full = 1;
+ }
+ for (len = skb->len; len > 0; len -= n) {
+ n = min(q->max_size, len);
+ ptoks -= L2T_P(q, n);
+ }
}
toks += q->tokens;
- if (toks > (long)q->buffer)
+ if (toks >= (long)q->buffer) {
toks = q->buffer;
- toks -= L2T(q, len);
+ full = 1;
+ }
+ for (len = skb->len; len > 0; len -= n) {
+ n = min(q->max_size, len);
+ toks -= L2T(q, n);
+ }
- if ((toks|ptoks) >= 0) {
+ if ((toks|ptoks) >= 0 || full) {
q->t_c = now;
q->tokens = toks;
q->ptokens = ptoks;
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH] tbf scheduler: TSO support 2007-05-10 16:00 [PATCH] tbf scheduler: TSO support Hirokazu Takahashi @ 2007-05-10 16:01 ` Patrick McHardy 0 siblings, 0 replies; 19+ messages in thread From: Patrick McHardy @ 2007-05-10 16:01 UTC (permalink / raw) To: Hirokazu Takahashi; +Cc: netdev Hirokazu Takahashi wrote: > TBF --- Simple Token Bucket Filter --- packet scheduler doesn't > work correctly with TSO on that it slows down to send out packets. > TSO packets will be discarded since the size can be larger than > the scheduler expects. But it won't cause serious problems > because the retransmitted packets can be passed. I already CCed netdev in my reply, so please use that thread instead of starting a new one. ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2007-05-23 11:49 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20070510.210556.05145104.taka@valinux.co.jp>
2007-05-10 12:56 ` [PATCH] tbf scheduler: TSO support Patrick McHardy
2007-05-10 21:13 ` David Miller
2007-05-11 1:04 ` Patrick McHardy
2007-05-11 4:01 ` Hirokazu Takahashi
2007-05-11 18:13 ` Patrick McHardy
2007-05-12 8:49 ` Hirokazu Takahashi
2007-05-12 9:09 ` Hirokazu Takahashi
2007-05-13 12:42 ` [PATCH] tbf scheduler: TSO support (updated) Hirokazu Takahashi
2007-05-13 15:52 ` Stephen Hemminger
2007-05-13 16:21 ` Hirokazu Takahashi
2007-05-14 6:04 ` Herbert Xu
2007-05-15 11:39 ` [PATCH 1/2] tbf scheduler: TSO support (update 2) Hirokazu Takahashi
2007-05-15 11:41 ` [PATCH 2/2] " Hirokazu Takahashi
2007-05-17 13:55 ` Herbert Xu
2007-05-18 2:35 ` Hirokazu Takahashi
2007-05-23 11:47 ` [PATCH 1/2] tbf scheduler: TSO support (update 3) Hirokazu Takahashi
2007-05-23 11:49 ` [PATCH 2/2] " Hirokazu Takahashi
2007-05-10 16:00 [PATCH] tbf scheduler: TSO support Hirokazu Takahashi
2007-05-10 16:01 ` Patrick McHardy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).