* txqueuelen has wrong units; should be time @ 2011-02-27 5:44 Albert Cahalan 2011-02-27 7:02 ` Mikael Abrahamsson 0 siblings, 1 reply; 43+ messages in thread From: Albert Cahalan @ 2011-02-27 5:44 UTC (permalink / raw) To: linux-kernel, netdev (thinking about the bufferbloat problem here) Setting txqueuelen to some fixed number of packets seems pretty broken if: 1. a link can vary in speed (802.11 especially) 2. a packet can vary in size (9 KiB jumbograms, etc.) 3. there is other weirdness (PPP compression, etc.) It really needs to be set to some amount of time, with the OS accounting for packets in terms of the time it will take to transmit them. This would need to account for physical-layer packet headers and minimum spacing requirements. I think it could also account for estimated congestion on the local link, because that effects the rate at which the queue can empty. An OS can directly observe this on some types of hardware. Nanoseconds seems fine; it's unlikely you'd ever want more than 4.2 seconds (32-bit unsigned) of queue. I guess there are at least 2 queues of interest, with the second one being under control of the hardware driver. Having the kernel split the max time as appropriate for the hardware seems nicest. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-02-27 5:44 txqueuelen has wrong units; should be time Albert Cahalan @ 2011-02-27 7:02 ` Mikael Abrahamsson 2011-02-27 7:54 ` Eric Dumazet 0 siblings, 1 reply; 43+ messages in thread From: Mikael Abrahamsson @ 2011-02-27 7:02 UTC (permalink / raw) To: Albert Cahalan; +Cc: linux-kernel, netdev On Sun, 27 Feb 2011, Albert Cahalan wrote: > Nanoseconds seems fine; it's unlikely you'd ever want > more than 4.2 seconds (32-bit unsigned) of queue. I think this is shortsighted and I'm sure someone will come up with a case where 4.2 seconds isn't enough. Let's not build in those kinds of limitations from start. Why not make it 64bit and go to picoseconds from start? If you need to make it 32bit unsigned, I'd suggest to start from microseconds instead. It's less likely someone would want less than a microsecond of queue, than someone wanting more than 4.2 seconds of queue. -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-02-27 7:02 ` Mikael Abrahamsson @ 2011-02-27 7:54 ` Eric Dumazet 2011-02-27 8:27 ` Albert Cahalan 0 siblings, 1 reply; 43+ messages in thread From: Eric Dumazet @ 2011-02-27 7:54 UTC (permalink / raw) To: Mikael Abrahamsson; +Cc: Albert Cahalan, linux-kernel, netdev Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson a écrit : > On Sun, 27 Feb 2011, Albert Cahalan wrote: > > > Nanoseconds seems fine; it's unlikely you'd ever want > > more than 4.2 seconds (32-bit unsigned) of queue. > > I think this is shortsighted and I'm sure someone will come up with a case > where 4.2 seconds isn't enough. Let's not build in those kinds of > limitations from start. > > Why not make it 64bit and go to picoseconds from start? > > If you need to make it 32bit unsigned, I'd suggest to start from > microseconds instead. It's less likely someone would want less than a > microsecond of queue, than someone wanting more than 4.2 seconds of queue. > 32 or 64 bits doesnt matter a lot. At Qdisc stage we have up to 40 bytes available in skb->sb[] for our usage. Problem is some machines have slow High Resolution timing services. _If_ we have a time limit, it will probably use the low resolution (aka jiffies), unless high resolution services are cheap. I was thinking not having an absolute hard limit, but an EWMA based one. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-02-27 7:54 ` Eric Dumazet @ 2011-02-27 8:27 ` Albert Cahalan 2011-02-27 10:55 ` Jussi Kivilinna 0 siblings, 1 reply; 43+ messages in thread From: Albert Cahalan @ 2011-02-27 8:27 UTC (permalink / raw) To: Eric Dumazet; +Cc: Mikael Abrahamsson, linux-kernel, netdev On Sun, Feb 27, 2011 at 2:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson a écrit : >> On Sun, 27 Feb 2011, Albert Cahalan wrote: >> >> > Nanoseconds seems fine; it's unlikely you'd ever want >> > more than 4.2 seconds (32-bit unsigned) of queue. ... > Problem is some machines have slow High Resolution timing services. > > _If_ we have a time limit, it will probably use the low resolution (aka > jiffies), unless high resolution services are cheap. As long as that is totally internal to the kernel and never getting exposed by some API for setting the amount, sure. > I was thinking not having an absolute hard limit, but an EWMA based one. The whole point is to prevent stale packets, especially to prevent them from messing with TCP, so I really don't think so. I suppose you do get this to some extent via early drop. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-02-27 8:27 ` Albert Cahalan @ 2011-02-27 10:55 ` Jussi Kivilinna 2011-02-27 20:07 ` Eric Dumazet 2011-02-27 23:33 ` Albert Cahalan 0 siblings, 2 replies; 43+ messages in thread From: Jussi Kivilinna @ 2011-02-27 10:55 UTC (permalink / raw) To: Albert Cahalan; +Cc: Eric Dumazet, Mikael Abrahamsson, linux-kernel, netdev [-- Attachment #1: Type: text/plain, Size: 1423 bytes --] Quoting Albert Cahalan <acahalan@gmail.com>: > On Sun, Feb 27, 2011 at 2:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson a écrit : >>> On Sun, 27 Feb 2011, Albert Cahalan wrote: >>> >>> > Nanoseconds seems fine; it's unlikely you'd ever want >>> > more than 4.2 seconds (32-bit unsigned) of queue. > ... >> Problem is some machines have slow High Resolution timing services. >> >> _If_ we have a time limit, it will probably use the low resolution (aka >> jiffies), unless high resolution services are cheap. > > As long as that is totally internal to the kernel and never > getting exposed by some API for setting the amount, sure. > >> I was thinking not having an absolute hard limit, but an EWMA based one. > > The whole point is to prevent stale packets, especially to prevent > them from messing with TCP, so I really don't think so. I suppose > you do get this to some extent via early drop. I made simple hack on sch_fifo with per packet time limits (attachment) this weekend and have been doing limited testing on wireless link. I think hardlimit is fine, it's simple and does somewhat same as what packet(-hard)limited buffer does, drops packets when buffer is 'full'. My hack checks for timed out packets on enqueue, might be wrong approach (on other hand might allow some more burstiness). -Jussi [-- Attachment #2: sch_fifo_to.c --] [-- Type: text/x-csrc, Size: 6138 bytes --] /* * sch_fifo_timeout.c Simple FIFO queue with per packet timeout. * * This program is free software; you can redistribute it and/or modify it under * the terms of the GNU General Public License as published by the Free Software * Foundation; either version 2 of the License, or (at your option) any later * version. * */ #include <linux/module.h> #include <linux/slab.h> #include <linux/types.h> #include <linux/kernel.h> #include <linux/errno.h> #include <linux/skbuff.h> #include <net/pkt_sched.h> #include <net/inet_ecn.h> #define DEFAULT_TIMEOUT_PKT_MS 10 #define DEFAULT_TIMEOUT_PKT PSCHED_NS2TICKS((u64)NSEC_PER_SEC * \ DEFAULT_TIMEOUT_PKT_MS / 1000) struct tc_fifo_timeout_qopt { __u64 timeout; /* Max time packet may stay in buffer */ __u32 limit; /* Queue length: bytes for bfifo, packets for pfifo */ }; struct fifo_timeout_skb_cb { psched_time_t time_queued; }; struct fifo_timeout_sched_data { psched_tdiff_t timeout; u32 limit; }; static inline struct fifo_timeout_skb_cb *fifo_timeout_skb_cb(struct sk_buff *skb) { BUILD_BUG_ON(sizeof(skb->cb) < sizeof(struct qdisc_skb_cb) + sizeof(struct fifo_timeout_skb_cb)); return (struct fifo_timeout_skb_cb *)qdisc_skb_cb(skb)->data; } static void pfifo_timeout_drop_timedout_packets(struct Qdisc *sch, psched_time_t now) { struct fifo_timeout_sched_data *q = qdisc_priv(sch); struct sk_buff *skb; check_next: skb = qdisc_peek_head(sch); if (likely(!skb)) return; if (likely(fifo_timeout_skb_cb(skb)->time_queued + q->timeout > now)) return; __qdisc_queue_drop_head(sch, &sch->q); sch->qstats.drops++; goto check_next; } static int pfifo_tail_enqueue(struct sk_buff *skb, struct Qdisc* sch) { struct fifo_timeout_sched_data *q = qdisc_priv(sch); if (likely(skb_queue_len(&sch->q) < q->limit)) return qdisc_enqueue_tail(skb, sch); /* queue full, remove one skb to fulfill the limit */ __qdisc_queue_drop_head(sch, &sch->q); sch->qstats.drops++; qdisc_enqueue_tail(skb, sch); return NET_XMIT_CN; } static int bfifo_enqueue(struct sk_buff *skb, struct Qdisc* sch) { struct fifo_timeout_sched_data *q = qdisc_priv(sch); if (likely(sch->qstats.backlog + qdisc_pkt_len(skb) <= q->limit)) return qdisc_enqueue_tail(skb, sch); return qdisc_reshape_fail(skb, sch); } static int pfifo_enqueue(struct sk_buff *skb, struct Qdisc* sch) { struct fifo_timeout_sched_data *q = qdisc_priv(sch); if (likely(skb_queue_len(&sch->q) < q->limit)) return qdisc_enqueue_tail(skb, sch); return qdisc_reshape_fail(skb, sch); } static int pfifo_timeout_tail_enqueue(struct sk_buff *skb, struct Qdisc* sch) { psched_time_t now = psched_get_time(); fifo_timeout_skb_cb(skb)->time_queued = now; pfifo_timeout_drop_timedout_packets(sch, now); return pfifo_tail_enqueue(skb, sch); } static int bfifo_timeout_enqueue(struct sk_buff *skb, struct Qdisc* sch) { psched_time_t now = psched_get_time(); fifo_timeout_skb_cb(skb)->time_queued = now; pfifo_timeout_drop_timedout_packets(sch, now); return bfifo_enqueue(skb, sch); } static int pfifo_timeout_enqueue(struct sk_buff *skb, struct Qdisc* sch) { psched_time_t now = psched_get_time(); fifo_timeout_skb_cb(skb)->time_queued = now; pfifo_timeout_drop_timedout_packets(sch, now); return pfifo_enqueue(skb, sch); } static int fifo_timeout_init(struct Qdisc *sch, struct nlattr *opt) { struct fifo_timeout_sched_data *q = qdisc_priv(sch); if (opt == NULL) { u32 limit = qdisc_dev(sch)->tx_queue_len ? : 1; q->limit = limit; q->timeout = DEFAULT_TIMEOUT_PKT; } else { struct tc_fifo_timeout_qopt *ctl = nla_data(opt); if (nla_len(opt) < sizeof(*ctl)) return -EINVAL; q->limit = ctl->limit; q->timeout = ctl->timeout ? : DEFAULT_TIMEOUT_PKT; } return 0; } static int fifo_timeout_dump(struct Qdisc *sch, struct sk_buff *skb) { struct fifo_timeout_sched_data *q = qdisc_priv(sch); struct tc_fifo_timeout_qopt opt = { .limit = q->limit, .timeout = q->timeout }; NLA_PUT(skb, TCA_OPTIONS, sizeof(opt), &opt); return skb->len; nla_put_failure: return -1; } static struct Qdisc_ops pfifo_timeout_qdisc_ops __read_mostly = { .id = "pfifo_timeout", .priv_size = sizeof(struct fifo_timeout_sched_data), .enqueue = pfifo_timeout_enqueue, .dequeue = qdisc_dequeue_head, .peek = qdisc_peek_head, .drop = qdisc_queue_drop, .init = fifo_timeout_init, .reset = qdisc_reset_queue, .change = fifo_timeout_init, .dump = fifo_timeout_dump, .owner = THIS_MODULE, }; static struct Qdisc_ops bfifo_timeout_qdisc_ops __read_mostly = { .id = "bfifo_timeout", .priv_size = sizeof(struct fifo_timeout_sched_data), .enqueue = bfifo_timeout_enqueue, .dequeue = qdisc_dequeue_head, .peek = qdisc_peek_head, .drop = qdisc_queue_drop, .init = fifo_timeout_init, .reset = qdisc_reset_queue, .change = fifo_timeout_init, .dump = fifo_timeout_dump, .owner = THIS_MODULE, }; static struct Qdisc_ops pfifo_head_drop_timeout_qdisc_ops __read_mostly = { .id = "pfifo_hd_tout", .priv_size = sizeof(struct fifo_timeout_sched_data), .enqueue = pfifo_timeout_tail_enqueue, .dequeue = qdisc_dequeue_head, .peek = qdisc_peek_head, .drop = qdisc_queue_drop_head, .init = fifo_timeout_init, .reset = qdisc_reset_queue, .change = fifo_timeout_init, .dump = fifo_timeout_dump, .owner = THIS_MODULE, }; static int __init fifo_timeout_module_init(void) { int retval; retval = register_qdisc(&pfifo_timeout_qdisc_ops); if (retval) goto cleanup; retval = register_qdisc(&bfifo_timeout_qdisc_ops); if (retval) goto cleanup; retval = register_qdisc(&pfifo_head_drop_timeout_qdisc_ops); if (retval) goto cleanup; return 0; cleanup: unregister_qdisc(&pfifo_timeout_qdisc_ops); unregister_qdisc(&bfifo_timeout_qdisc_ops); unregister_qdisc(&pfifo_head_drop_timeout_qdisc_ops); return retval; } static void __exit fifo_timeout_module_exit(void) { unregister_qdisc(&pfifo_timeout_qdisc_ops); unregister_qdisc(&bfifo_timeout_qdisc_ops); unregister_qdisc(&pfifo_head_drop_timeout_qdisc_ops); } module_init(fifo_timeout_module_init) module_exit(fifo_timeout_module_exit) MODULE_LICENSE("GPL"); ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-02-27 10:55 ` Jussi Kivilinna @ 2011-02-27 20:07 ` Eric Dumazet 2011-02-27 21:32 ` Jussi Kivilinna ` (2 more replies) 2011-02-27 23:33 ` Albert Cahalan 1 sibling, 3 replies; 43+ messages in thread From: Eric Dumazet @ 2011-02-27 20:07 UTC (permalink / raw) To: Jussi Kivilinna; +Cc: Albert Cahalan, Mikael Abrahamsson, linux-kernel, netdev Le dimanche 27 février 2011 à 12:55 +0200, Jussi Kivilinna a écrit : > Quoting Albert Cahalan <acahalan@gmail.com>: > > > On Sun, Feb 27, 2011 at 2:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > >> Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson a écrit : > >>> On Sun, 27 Feb 2011, Albert Cahalan wrote: > >>> > >>> > Nanoseconds seems fine; it's unlikely you'd ever want > >>> > more than 4.2 seconds (32-bit unsigned) of queue. > > ... > >> Problem is some machines have slow High Resolution timing services. > >> > >> _If_ we have a time limit, it will probably use the low resolution (aka > >> jiffies), unless high resolution services are cheap. > > > > As long as that is totally internal to the kernel and never > > getting exposed by some API for setting the amount, sure. > > > >> I was thinking not having an absolute hard limit, but an EWMA based one. > > > > The whole point is to prevent stale packets, especially to prevent > > them from messing with TCP, so I really don't think so. I suppose > > you do get this to some extent via early drop. > > I made simple hack on sch_fifo with per packet time limits > (attachment) this weekend and have been doing limited testing on > wireless link. I think hardlimit is fine, it's simple and does > somewhat same as what packet(-hard)limited buffer does, drops packets > when buffer is 'full'. My hack checks for timed out packets on > enqueue, might be wrong approach (on other hand might allow some more > burstiness). > Qdisc should return to caller a good indication packet is queued or dropped at enqueue() time... not later (aka : never) Accepting a packet at t0, and dropping it later at t0+limit without giving any indication to caller is a problem. This is why I suggested using an EWMA plus a probabilist drop or congestion indication (NET_XMIT_CN) to caller at enqueue() time. The absolute time limit you are trying to implement should be checked at dequeue time, to cope with enqueue bursts or pauses on wire. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-02-27 20:07 ` Eric Dumazet @ 2011-02-27 21:32 ` Jussi Kivilinna 2011-02-28 11:43 ` Jussi Kivilinna 2011-02-28 16:11 ` John W. Linville 2 siblings, 0 replies; 43+ messages in thread From: Jussi Kivilinna @ 2011-02-27 21:32 UTC (permalink / raw) To: Eric Dumazet; +Cc: Albert Cahalan, Mikael Abrahamsson, linux-kernel, netdev Quoting Eric Dumazet <eric.dumazet@gmail.com>: > Le dimanche 27 février 2011 à 12:55 +0200, Jussi Kivilinna a écrit : >> Quoting Albert Cahalan <acahalan@gmail.com>: >> >> > On Sun, Feb 27, 2011 at 2:54 AM, Eric Dumazet >> <eric.dumazet@gmail.com> wrote: >> >> Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson a écrit : >> >>> On Sun, 27 Feb 2011, Albert Cahalan wrote: >> >>> >> >>> > Nanoseconds seems fine; it's unlikely you'd ever want >> >>> > more than 4.2 seconds (32-bit unsigned) of queue. >> > ... >> >> Problem is some machines have slow High Resolution timing services. >> >> >> >> _If_ we have a time limit, it will probably use the low resolution (aka >> >> jiffies), unless high resolution services are cheap. >> > >> > As long as that is totally internal to the kernel and never >> > getting exposed by some API for setting the amount, sure. >> > >> >> I was thinking not having an absolute hard limit, but an EWMA based one. >> > >> > The whole point is to prevent stale packets, especially to prevent >> > them from messing with TCP, so I really don't think so. I suppose >> > you do get this to some extent via early drop. >> >> I made simple hack on sch_fifo with per packet time limits >> (attachment) this weekend and have been doing limited testing on >> wireless link. I think hardlimit is fine, it's simple and does >> somewhat same as what packet(-hard)limited buffer does, drops packets >> when buffer is 'full'. My hack checks for timed out packets on >> enqueue, might be wrong approach (on other hand might allow some more >> burstiness). >> > > > Qdisc should return to caller a good indication packet is queued or > dropped at enqueue() time... not later (aka : never) Ok, it is ugly hack ;) I got idea of dropping head from pfifo_head_drop. > > Accepting a packet at t0, and dropping it later at t0+limit without > giving any indication to caller is a problem. Ok. > > This is why I suggested using an EWMA plus a probabilist drop or > congestion indication (NET_XMIT_CN) to caller at enqueue() time. > > The absolute time limit you are trying to implement should be checked at > dequeue time, to cope with enqueue bursts or pauses on wire. > Ok. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-02-27 20:07 ` Eric Dumazet 2011-02-27 21:32 ` Jussi Kivilinna @ 2011-02-28 11:43 ` Jussi Kivilinna 2011-02-28 13:10 ` Eric Dumazet 2011-02-28 16:11 ` John W. Linville 2 siblings, 1 reply; 43+ messages in thread From: Jussi Kivilinna @ 2011-02-28 11:43 UTC (permalink / raw) To: Eric Dumazet; +Cc: Albert Cahalan, Mikael Abrahamsson, linux-kernel, netdev Quoting Eric Dumazet <eric.dumazet@gmail.com>: > Le dimanche 27 février 2011 à 12:55 +0200, Jussi Kivilinna a écrit : >> Quoting Albert Cahalan <acahalan@gmail.com>: >> >> > On Sun, Feb 27, 2011 at 2:54 AM, Eric Dumazet >> <eric.dumazet@gmail.com> wrote: >> >> Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson a écrit : >> >>> On Sun, 27 Feb 2011, Albert Cahalan wrote: >> >>> >> >>> > Nanoseconds seems fine; it's unlikely you'd ever want >> >>> > more than 4.2 seconds (32-bit unsigned) of queue. >> > ... >> >> Problem is some machines have slow High Resolution timing services. >> >> >> >> _If_ we have a time limit, it will probably use the low resolution (aka >> >> jiffies), unless high resolution services are cheap. >> > >> > As long as that is totally internal to the kernel and never >> > getting exposed by some API for setting the amount, sure. >> > >> >> I was thinking not having an absolute hard limit, but an EWMA based one. >> > >> > The whole point is to prevent stale packets, especially to prevent >> > them from messing with TCP, so I really don't think so. I suppose >> > you do get this to some extent via early drop. >> >> I made simple hack on sch_fifo with per packet time limits >> (attachment) this weekend and have been doing limited testing on >> wireless link. I think hardlimit is fine, it's simple and does >> somewhat same as what packet(-hard)limited buffer does, drops packets >> when buffer is 'full'. My hack checks for timed out packets on >> enqueue, might be wrong approach (on other hand might allow some more >> burstiness). >> > > > Qdisc should return to caller a good indication packet is queued or > dropped at enqueue() time... not later (aka : never) > > Accepting a packet at t0, and dropping it later at t0+limit without > giving any indication to caller is a problem. > > This is why I suggested using an EWMA plus a probabilist drop or > congestion indication (NET_XMIT_CN) to caller at enqueue() time. > > The absolute time limit you are trying to implement should be checked at > dequeue time, to cope with enqueue bursts or pauses on wire. > Would it be better to implement this as generic feature instead of qdisc specific? Have qdisc_enqueue_root do ewma check: static inline int qdisc_enqueue_root(struct sk_buff *skb, struct Qdisc *sch) { qdisc_skb_cb(skb)->pkt_len = skb->len; if (likely(!sch->use_timeout)) { ewma_ok: return qdisc_enqueue(skb, sch) & NET_XMIT_MASK; } status = qdisc_check_ewma_status() if (status == ok) goto ewma_ok; if (status == overlimits) ...drop... if (status == congestion) { ret = qdisc_enqueue(skb, sch) & NET_XMIT_MASK; return (ret == success) ? NET_XMIT_CN : ret; } } And add qdisc_dequeue_root: static inline struct sk_buff *qdisc_dequeue_root(struct Qdisc *sch) { skb = sch->dequeue(sch); if (skb && unlikely(sch->use_timeout)) qdisc_update_ewma(skb); return skb; } Then user could specify any qdisc to use timeout or not with tc. Maybe go even as far as have some default timeout for default qdisc(?) -Jussi ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-02-28 11:43 ` Jussi Kivilinna @ 2011-02-28 13:10 ` Eric Dumazet 2011-02-28 18:31 ` Jussi Kivilinna 0 siblings, 1 reply; 43+ messages in thread From: Eric Dumazet @ 2011-02-28 13:10 UTC (permalink / raw) To: Jussi Kivilinna; +Cc: Albert Cahalan, Mikael Abrahamsson, linux-kernel, netdev Le lundi 28 février 2011 à 13:43 +0200, Jussi Kivilinna a écrit : > Quoting Eric Dumazet <eric.dumazet@gmail.com>: > > > Le dimanche 27 février 2011 à 12:55 +0200, Jussi Kivilinna a écrit : > >> Quoting Albert Cahalan <acahalan@gmail.com>: > >> > >> > On Sun, Feb 27, 2011 at 2:54 AM, Eric Dumazet > >> <eric.dumazet@gmail.com> wrote: > >> >> Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson a écrit : > >> >>> On Sun, 27 Feb 2011, Albert Cahalan wrote: > >> >>> > >> >>> > Nanoseconds seems fine; it's unlikely you'd ever want > >> >>> > more than 4.2 seconds (32-bit unsigned) of queue. > >> > ... > >> >> Problem is some machines have slow High Resolution timing services. > >> >> > >> >> _If_ we have a time limit, it will probably use the low resolution (aka > >> >> jiffies), unless high resolution services are cheap. > >> > > >> > As long as that is totally internal to the kernel and never > >> > getting exposed by some API for setting the amount, sure. > >> > > >> >> I was thinking not having an absolute hard limit, but an EWMA based one. > >> > > >> > The whole point is to prevent stale packets, especially to prevent > >> > them from messing with TCP, so I really don't think so. I suppose > >> > you do get this to some extent via early drop. > >> > >> I made simple hack on sch_fifo with per packet time limits > >> (attachment) this weekend and have been doing limited testing on > >> wireless link. I think hardlimit is fine, it's simple and does > >> somewhat same as what packet(-hard)limited buffer does, drops packets > >> when buffer is 'full'. My hack checks for timed out packets on > >> enqueue, might be wrong approach (on other hand might allow some more > >> burstiness). > >> > > > > > > Qdisc should return to caller a good indication packet is queued or > > dropped at enqueue() time... not later (aka : never) > > > > Accepting a packet at t0, and dropping it later at t0+limit without > > giving any indication to caller is a problem. > > > > This is why I suggested using an EWMA plus a probabilist drop or > > congestion indication (NET_XMIT_CN) to caller at enqueue() time. > > > > The absolute time limit you are trying to implement should be checked at > > dequeue time, to cope with enqueue bursts or pauses on wire. > > > > Would it be better to implement this as generic feature instead of > qdisc specific? Have qdisc_enqueue_root do ewma check: Problem is you can have several virtual queues in a qdisc. For example, pfifo_fast has 3 bands. You could have a global ewma with high values, but you still want to let a high priority packet going through... ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-02-28 13:10 ` Eric Dumazet @ 2011-02-28 18:31 ` Jussi Kivilinna 0 siblings, 0 replies; 43+ messages in thread From: Jussi Kivilinna @ 2011-02-28 18:31 UTC (permalink / raw) To: Eric Dumazet; +Cc: Albert Cahalan, Mikael Abrahamsson, linux-kernel, netdev Quoting Eric Dumazet <eric.dumazet@gmail.com>: > Le lundi 28 février 2011 à 13:43 +0200, Jussi Kivilinna a écrit : >> Quoting Eric Dumazet <eric.dumazet@gmail.com>: >> >> > Le dimanche 27 février 2011 à 12:55 +0200, Jussi Kivilinna a écrit : >> >> Quoting Albert Cahalan <acahalan@gmail.com>: >> >> >> >> > On Sun, Feb 27, 2011 at 2:54 AM, Eric Dumazet >> >> <eric.dumazet@gmail.com> wrote: >> >> >> Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson >> a écrit : >> >> >>> On Sun, 27 Feb 2011, Albert Cahalan wrote: >> >> >>> >> >> >>> > Nanoseconds seems fine; it's unlikely you'd ever want >> >> >>> > more than 4.2 seconds (32-bit unsigned) of queue. >> >> > ... >> >> >> Problem is some machines have slow High Resolution timing services. >> >> >> >> >> >> _If_ we have a time limit, it will probably use the low >> resolution (aka >> >> >> jiffies), unless high resolution services are cheap. >> >> > >> >> > As long as that is totally internal to the kernel and never >> >> > getting exposed by some API for setting the amount, sure. >> >> > >> >> >> I was thinking not having an absolute hard limit, but an EWMA >> based one. >> >> > >> >> > The whole point is to prevent stale packets, especially to prevent >> >> > them from messing with TCP, so I really don't think so. I suppose >> >> > you do get this to some extent via early drop. >> >> >> >> I made simple hack on sch_fifo with per packet time limits >> >> (attachment) this weekend and have been doing limited testing on >> >> wireless link. I think hardlimit is fine, it's simple and does >> >> somewhat same as what packet(-hard)limited buffer does, drops packets >> >> when buffer is 'full'. My hack checks for timed out packets on >> >> enqueue, might be wrong approach (on other hand might allow some more >> >> burstiness). >> >> >> > >> > >> > Qdisc should return to caller a good indication packet is queued or >> > dropped at enqueue() time... not later (aka : never) >> > >> > Accepting a packet at t0, and dropping it later at t0+limit without >> > giving any indication to caller is a problem. >> > >> > This is why I suggested using an EWMA plus a probabilist drop or >> > congestion indication (NET_XMIT_CN) to caller at enqueue() time. >> > >> > The absolute time limit you are trying to implement should be checked at >> > dequeue time, to cope with enqueue bursts or pauses on wire. >> > >> >> Would it be better to implement this as generic feature instead of >> qdisc specific? Have qdisc_enqueue_root do ewma check: > > Problem is you can have several virtual queues in a qdisc. > > For example, pfifo_fast has 3 bands. You could have a global ewma with > high values, but you still want to let a high priority packet going > through... > Ok. It would better to have ewma/timelimit at leaf qdisc. (Or have in-middle-qdisc handling ewma/timelimit for leaf qdisc, sch_timelimit) -Jussi ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-02-27 20:07 ` Eric Dumazet 2011-02-27 21:32 ` Jussi Kivilinna 2011-02-28 11:43 ` Jussi Kivilinna @ 2011-02-28 16:11 ` John W. Linville 2011-02-28 16:48 ` Eric Dumazet 2 siblings, 1 reply; 43+ messages in thread From: John W. Linville @ 2011-02-28 16:11 UTC (permalink / raw) To: Eric Dumazet Cc: Jussi Kivilinna, Albert Cahalan, Mikael Abrahamsson, linux-kernel, netdev On Sun, Feb 27, 2011 at 09:07:53PM +0100, Eric Dumazet wrote: > Qdisc should return to caller a good indication packet is queued or > dropped at enqueue() time... not later (aka : never) > > Accepting a packet at t0, and dropping it later at t0+limit without > giving any indication to caller is a problem. Can you elaborate on what problem this causes? Is it any worse than if the packet is dropped at some later hop? Is there any API that could report the drop to the sender (at least a local one) without having to wait for the ack timeout? Should there be? John -- John W. Linville Someday the world will need a hero, and you linville@tuxdriver.com might be all we have. Be ready. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-02-28 16:11 ` John W. Linville @ 2011-02-28 16:48 ` Eric Dumazet 2011-02-28 16:55 ` John W. Linville 0 siblings, 1 reply; 43+ messages in thread From: Eric Dumazet @ 2011-02-28 16:48 UTC (permalink / raw) To: John W. Linville Cc: Jussi Kivilinna, Albert Cahalan, Mikael Abrahamsson, linux-kernel, netdev Le lundi 28 février 2011 à 11:11 -0500, John W. Linville a écrit : > On Sun, Feb 27, 2011 at 09:07:53PM +0100, Eric Dumazet wrote: > > > Qdisc should return to caller a good indication packet is queued or > > dropped at enqueue() time... not later (aka : never) > > > > Accepting a packet at t0, and dropping it later at t0+limit without > > giving any indication to caller is a problem. > > Can you elaborate on what problem this causes? Is it any worse than > if the packet is dropped at some later hop? > > Is there any API that could report the drop to the sender (at > least a local one) without having to wait for the ack timeout? > Should there be? > Not all protocols have ACKS ;) dev_queue_xmit() returns an error code, some callers use it. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-02-28 16:48 ` Eric Dumazet @ 2011-02-28 16:55 ` John W. Linville 2011-02-28 17:18 ` Eric Dumazet 2011-02-28 21:45 ` John Heffner 0 siblings, 2 replies; 43+ messages in thread From: John W. Linville @ 2011-02-28 16:55 UTC (permalink / raw) To: Eric Dumazet Cc: Jussi Kivilinna, Albert Cahalan, Mikael Abrahamsson, linux-kernel, netdev On Mon, Feb 28, 2011 at 05:48:14PM +0100, Eric Dumazet wrote: > Le lundi 28 février 2011 à 11:11 -0500, John W. Linville a écrit : > > On Sun, Feb 27, 2011 at 09:07:53PM +0100, Eric Dumazet wrote: > > > > > Qdisc should return to caller a good indication packet is queued or > > > dropped at enqueue() time... not later (aka : never) > > > > > > Accepting a packet at t0, and dropping it later at t0+limit without > > > giving any indication to caller is a problem. > > > > Can you elaborate on what problem this causes? Is it any worse than > > if the packet is dropped at some later hop? > > > > Is there any API that could report the drop to the sender (at > > least a local one) without having to wait for the ack timeout? > > Should there be? > > > > Not all protocols have ACKS ;) > > dev_queue_xmit() returns an error code, some callers use it. Well, OK -- I agree it is best if you can return the status at enqueue time. The question becomes whether or not a dropped frame is worse than living with high latency. The answer, of course, still seems to be a bit subjective. But, if the admin has determined that a link should be low latency...? John -- John W. Linville Someday the world will need a hero, and you linville@tuxdriver.com might be all we have. Be ready. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-02-28 16:55 ` John W. Linville @ 2011-02-28 17:18 ` Eric Dumazet 2011-02-28 21:45 ` John Heffner 1 sibling, 0 replies; 43+ messages in thread From: Eric Dumazet @ 2011-02-28 17:18 UTC (permalink / raw) To: John W. Linville Cc: Jussi Kivilinna, Albert Cahalan, Mikael Abrahamsson, linux-kernel, netdev Le lundi 28 février 2011 à 11:55 -0500, John W. Linville a écrit : > On Mon, Feb 28, 2011 at 05:48:14PM +0100, Eric Dumazet wrote: > > Le lundi 28 février 2011 à 11:11 -0500, John W. Linville a écrit : > > > On Sun, Feb 27, 2011 at 09:07:53PM +0100, Eric Dumazet wrote: > > > > > > > Qdisc should return to caller a good indication packet is queued or > > > > dropped at enqueue() time... not later (aka : never) > > > > > > > > Accepting a packet at t0, and dropping it later at t0+limit without > > > > giving any indication to caller is a problem. > > > > > > Can you elaborate on what problem this causes? Is it any worse than > > > if the packet is dropped at some later hop? > > > > > > Is there any API that could report the drop to the sender (at > > > least a local one) without having to wait for the ack timeout? > > > Should there be? > > > > > > > Not all protocols have ACKS ;) > > > > dev_queue_xmit() returns an error code, some callers use it. > > Well, OK -- I agree it is best if you can return the status at > enqueue time. The question becomes whether or not a dropped frame > is worse than living with high latency. The answer, of course, still > seems to be a bit subjective. But, if the admin has determined that > a link should be low latency...? > If the latency problem could be solved by an admin choice, it probably would be there already. Point is qdisc layer is able to immediately return an error code to caller, if qdisc handlers properly done. This can help applications to immediately react to congestion notifications. Some applications, even running on a "low latency link" can afford a long delay for their packets. Should we introduce a socket API to give the upper bound for the limit, or share a global 'per qdisc' limit ? ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-02-28 16:55 ` John W. Linville 2011-02-28 17:18 ` Eric Dumazet @ 2011-02-28 21:45 ` John Heffner 2011-03-01 4:11 ` Albert Cahalan 1 sibling, 1 reply; 43+ messages in thread From: John Heffner @ 2011-02-28 21:45 UTC (permalink / raw) To: John W. Linville Cc: Eric Dumazet, Jussi Kivilinna, Albert Cahalan, Mikael Abrahamsson, linux-kernel, netdev On Mon, Feb 28, 2011 at 11:55 AM, John W. Linville <linville@tuxdriver.com> wrote: > On Mon, Feb 28, 2011 at 05:48:14PM +0100, Eric Dumazet wrote: >> Le lundi 28 février 2011 à 11:11 -0500, John W. Linville a écrit : >> > On Sun, Feb 27, 2011 at 09:07:53PM +0100, Eric Dumazet wrote: >> > >> > > Qdisc should return to caller a good indication packet is queued or >> > > dropped at enqueue() time... not later (aka : never) >> > > >> > > Accepting a packet at t0, and dropping it later at t0+limit without >> > > giving any indication to caller is a problem. >> > >> > Can you elaborate on what problem this causes? Is it any worse than >> > if the packet is dropped at some later hop? >> > >> > Is there any API that could report the drop to the sender (at >> > least a local one) without having to wait for the ack timeout? >> > Should there be? >> > >> >> Not all protocols have ACKS ;) >> >> dev_queue_xmit() returns an error code, some callers use it. > > Well, OK -- I agree it is best if you can return the status at > enqueue time. The question becomes whether or not a dropped frame > is worse than living with high latency. The answer, of course, still > seems to be a bit subjective. But, if the admin has determined that > a link should be low latency...? Notably, TCP is one caller that uses the error code. The error code is functionally equivalent to ECN, one of whose great advantages is reducing delay jitter. If TCP didn't get the error, that would effectively double the latency for a full window of data, since the dropped segment would not be retransmitted for an RTT. -John ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-02-28 21:45 ` John Heffner @ 2011-03-01 4:11 ` Albert Cahalan 2011-03-01 4:18 ` David Miller 2011-03-01 5:01 ` Eric Dumazet 0 siblings, 2 replies; 43+ messages in thread From: Albert Cahalan @ 2011-03-01 4:11 UTC (permalink / raw) To: John Heffner Cc: John W. Linville, Eric Dumazet, Jussi Kivilinna, Mikael Abrahamsson, linux-kernel, netdev On Mon, Feb 28, 2011 at 4:45 PM, John Heffner <johnwheffner@gmail.com> wrote: > On Mon, Feb 28, 2011 at 11:55 AM, John W. Linville > <linville@tuxdriver.com> wrote: >> On Mon, Feb 28, 2011 at 05:48:14PM +0100, Eric Dumazet wrote: >>> Le lundi 28 février 2011 à 11:11 -0500, John W. Linville a écrit : >>> > On Sun, Feb 27, 2011 at 09:07:53PM +0100, Eric Dumazet wrote: >>> > > Qdisc should return to caller a good indication packet is queued or >>> > > dropped at enqueue() time... not later (aka : never) >>> > > >>> > > Accepting a packet at t0, and dropping it later at t0+limit without >>> > > giving any indication to caller is a problem. >>> > >>> > Can you elaborate on what problem this causes? Is it any worse than >>> > if the packet is dropped at some later hop? >>> > >>> > Is there any API that could report the drop to the sender (at >>> > least a local one) without having to wait for the ack timeout? >>> > Should there be? >>> >>> Not all protocols have ACKS ;) >>> >>> dev_queue_xmit() returns an error code, some callers use it. >> >> Well, OK -- I agree it is best if you can return the status at >> enqueue time. The question becomes whether or not a dropped frame >> is worse than living with high latency. The answer, of course, still >> seems to be a bit subjective. But, if the admin has determined that >> a link should be low latency...? > > Notably, TCP is one caller that uses the error code. The error code > is functionally equivalent to ECN, one of whose great advantages is > reducing delay jitter. If TCP didn't get the error, that would > effectively double the latency for a full window of data, since the > dropped segment would not be retransmitted for an RTT. It sounds like you need a callback or similar, so that TCP can be informed later that the drop has occurred. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-03-01 4:11 ` Albert Cahalan @ 2011-03-01 4:18 ` David Miller 2011-03-01 6:54 ` Albert Cahalan 2011-03-01 5:01 ` Eric Dumazet 1 sibling, 1 reply; 43+ messages in thread From: David Miller @ 2011-03-01 4:18 UTC (permalink / raw) To: acahalan Cc: johnwheffner, linville, eric.dumazet, jussi.kivilinna, swmike, linux-kernel, netdev From: Albert Cahalan <acahalan@gmail.com> Date: Mon, 28 Feb 2011 23:11:13 -0500 > It sounds like you need a callback or similar, so that TCP can be > informed later that the drop has occurred. By that point we could have already sent an entire RTT's worth of data, or more. It needs to be synchronous, otherwise performance suffers. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-03-01 4:18 ` David Miller @ 2011-03-01 6:54 ` Albert Cahalan 2011-03-01 7:25 ` David Miller 2011-03-01 7:26 ` Eric Dumazet 0 siblings, 2 replies; 43+ messages in thread From: Albert Cahalan @ 2011-03-01 6:54 UTC (permalink / raw) To: David Miller Cc: johnwheffner, linville, eric.dumazet, jussi.kivilinna, swmike, linux-kernel, netdev On Mon, Feb 28, 2011 at 11:18 PM, David Miller <davem@davemloft.net> wrote: > From: Albert Cahalan <acahalan@gmail.com> > Date: Mon, 28 Feb 2011 23:11:13 -0500 > >> It sounds like you need a callback or similar, so that TCP can be >> informed later that the drop has occurred. > > By that point we could have already sent an entire RTT's worth > of data, or more. > > It needs to be synchronous, otherwise performance suffers. Ouch. OTOH, the current situation: performance suffers. In case it makes you feel any better, consider two cases where synchronous feedback is already impossible. One is when you're routing packets that merely pass through. The other is when some other box is doing that to you. Either way, packets go bye-bye and nobody tells TCP. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-03-01 6:54 ` Albert Cahalan @ 2011-03-01 7:25 ` David Miller 2011-03-01 7:26 ` Eric Dumazet 1 sibling, 0 replies; 43+ messages in thread From: David Miller @ 2011-03-01 7:25 UTC (permalink / raw) To: acahalan Cc: johnwheffner, linville, eric.dumazet, jussi.kivilinna, swmike, linux-kernel, netdev From: Albert Cahalan <acahalan@gmail.com> Date: Tue, 1 Mar 2011 01:54:09 -0500 > In case it makes you feel any better, consider two cases > where synchronous feedback is already impossible. > One is when you're routing packets that merely pass through. > The other is when some other box is doing that to you. > Either way, packets go bye-bye and nobody tells TCP. I consider ECN quite synchronous, and routers will set ECN bits to propagate congestion information when they do or are about to drop packets. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-03-01 6:54 ` Albert Cahalan 2011-03-01 7:25 ` David Miller @ 2011-03-01 7:26 ` Eric Dumazet 2011-03-01 19:37 ` Albert Cahalan 2011-03-02 3:10 ` Mikael Abrahamsson 1 sibling, 2 replies; 43+ messages in thread From: Eric Dumazet @ 2011-03-01 7:26 UTC (permalink / raw) To: Albert Cahalan Cc: David Miller, johnwheffner, linville, jussi.kivilinna, swmike, linux-kernel, netdev Le mardi 01 mars 2011 à 01:54 -0500, Albert Cahalan a écrit : > On Mon, Feb 28, 2011 at 11:18 PM, David Miller <davem@davemloft.net> wrote: > > From: Albert Cahalan <acahalan@gmail.com> > > Date: Mon, 28 Feb 2011 23:11:13 -0500 > > > >> It sounds like you need a callback or similar, so that TCP can be > >> informed later that the drop has occurred. > > > > By that point we could have already sent an entire RTT's worth > > of data, or more. > > > > It needs to be synchronous, otherwise performance suffers. > > Ouch. OTOH, the current situation: performance suffers. > > In case it makes you feel any better, consider two cases > where synchronous feedback is already impossible. > One is when you're routing packets that merely pass through. > The other is when some other box is doing that to you. > Either way, packets go bye-bye and nobody tells TCP. So in a hurry we decide to drop packets blindly because kernel took the cpu to perform an urgent task ? Bufferbloat is a configuration/tuning problem, not a "everything must be redone" problem. We add new qdiscs (CHOKe, SFB, QFQ, ...) and let admins do their job. Problem is most admins are unaware of the problems, and only buy more bandwidth. And no, there is no "generic" solution, unless you have a lab with two machines back to back (private link) and a known workload. We might need some changes (including new APIs). ECN is a forward step. Blindly dropping packets before ever sending them is a step backward. We should allow some trafic spikes, or many applications will stop working. Unless all applications are fixed, we are stuck. Only if the queue stay loaded a long time (yet another parameter) we can try to drop packets. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-03-01 7:26 ` Eric Dumazet @ 2011-03-01 19:37 ` Albert Cahalan 2011-03-01 20:14 ` Eric Dumazet 2011-03-02 3:10 ` Mikael Abrahamsson 1 sibling, 1 reply; 43+ messages in thread From: Albert Cahalan @ 2011-03-01 19:37 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, johnwheffner, linville, jussi.kivilinna, swmike, linux-kernel, netdev On Tue, Mar 1, 2011 at 2:26 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > Le mardi 01 mars 2011 à 01:54 -0500, Albert Cahalan a écrit : >> On Mon, Feb 28, 2011 at 11:18 PM, David Miller <davem@davemloft.net> wrote: >> > From: Albert Cahalan <acahalan@gmail.com> >> >> It sounds like you need a callback or similar, so that TCP can be >> >> informed later that the drop has occurred. >> > >> > By that point we could have already sent an entire RTT's worth >> > of data, or more. >> > >> > It needs to be synchronous, otherwise performance suffers. >> >> Ouch. OTOH, the current situation: performance suffers. >> >> In case it makes you feel any better, consider two cases >> where synchronous feedback is already impossible. >> One is when you're routing packets that merely pass through. >> The other is when some other box is doing that to you. >> Either way, packets go bye-bye and nobody tells TCP. > > So in a hurry we decide to drop packets blindly because kernel took the > cpu to perform an urgent task ? Yes. If the system can't handle the load, it needs to fess up. > Bufferbloat is a configuration/tuning problem, not a "everything must be > redone" problem. We add new qdiscs (CHOKe, SFB, QFQ, ...) and let admins > do their job. Problem is most admins are unaware of the problems, and > only buy more bandwidth. We could at least do as well as Windows. >:-) You can not expect some random Linux user to tune things every time the link changes speed or the app mix changes. What person NOT ON THIS MAILING LIST is going to mess with their qdisc when they connect to a new access point or switch from running Skype to running Netflix? Heck, how many have any awareness of what a qdisk even is? Linux networking needs to be excellent for people with no clue. > We might need some changes (including new APIs). If an app can't specify latency, adding the ability could be nice. Still, stuff needs to JUST WORK more of the time. > ECN is a forward step. Blindly dropping packets before ever sending them > is a step backward. Last I knew, ECN defaulted to a setting of "2" which means it is only used in response. Perhaps it's time to change that. It's been a while, with defective firewalls being replaced by faster hardware. > We should allow some trafic spikes, or many applications will stop > working. Unless all applications are fixed, we are stuck. Such applications would stop working... 1. across a switch 2. across an older router We certainly should allow some traffic spikes. 1 to 10 ms of traffic ought to do nicely. Hundreds or thousands of ms is getting way beyond "spike". ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-03-01 19:37 ` Albert Cahalan @ 2011-03-01 20:14 ` Eric Dumazet 2011-03-01 20:16 ` Eric Dumazet 0 siblings, 1 reply; 43+ messages in thread From: Eric Dumazet @ 2011-03-01 20:14 UTC (permalink / raw) To: Albert Cahalan Cc: David Miller, johnwheffner, linville, jussi.kivilinna, swmike, linux-kernel, netdev Le mardi 01 mars 2011 à 14:37 -0500, Albert Cahalan a écrit : > On Tue, Mar 1, 2011 at 2:26 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > Le mardi 01 mars 2011 à 01:54 -0500, Albert Cahalan a écrit : > >> On Mon, Feb 28, 2011 at 11:18 PM, David Miller <davem@davemloft.net> wrote: > >> > From: Albert Cahalan <acahalan@gmail.com> > > >> >> It sounds like you need a callback or similar, so that TCP can be > >> >> informed later that the drop has occurred. > >> > > >> > By that point we could have already sent an entire RTT's worth > >> > of data, or more. > >> > > >> > It needs to be synchronous, otherwise performance suffers. > >> > >> Ouch. OTOH, the current situation: performance suffers. > >> > >> In case it makes you feel any better, consider two cases > >> where synchronous feedback is already impossible. > >> One is when you're routing packets that merely pass through. > >> The other is when some other box is doing that to you. > >> Either way, packets go bye-bye and nobody tells TCP. > > > > So in a hurry we decide to drop packets blindly because kernel took the > > cpu to perform an urgent task ? > > Yes. If the system can't handle the load, it needs to fess up. > > > Bufferbloat is a configuration/tuning problem, not a "everything must be > > redone" problem. We add new qdiscs (CHOKe, SFB, QFQ, ...) and let admins > > do their job. Problem is most admins are unaware of the problems, and > > only buy more bandwidth. > > We could at least do as well as Windows. >:-) > > You can not expect some random Linux user to tune things > every time the link changes speed or the app mix changes. > What person NOT ON THIS MAILING LIST is going to mess > with their qdisc when they connect to a new access point > or switch from running Skype to running Netflix? Heck, how > many have any awareness of what a qdisk even is? Linux > networking needs to be excellent for people with no clue. > > > We might need some changes (including new APIs). > > If an app can't specify latency, adding the ability could > be nice. Still, stuff needs to JUST WORK more of the time. > > > ECN is a forward step. Blindly dropping packets before ever sending them > > is a step backward. > > Last I knew, ECN defaulted to a setting of "2" which means > it is only used in response. Perhaps it's time to change that. > It's been a while, with defective firewalls being replaced > by faster hardware. > > > We should allow some trafic spikes, or many applications will stop > > working. Unless all applications are fixed, we are stuck. > > Such applications would stop working... > > 1. across a switch > 2. across an older router > > We certainly should allow some traffic spikes. 1 to 10 ms of > traffic ought to do nicely. Hundreds or thousands of ms is > getting way beyond "spike". OK. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-03-01 20:14 ` Eric Dumazet @ 2011-03-01 20:16 ` Eric Dumazet 0 siblings, 0 replies; 43+ messages in thread From: Eric Dumazet @ 2011-03-01 20:16 UTC (permalink / raw) To: Albert Cahalan Cc: David Miller, johnwheffner, linville, jussi.kivilinna, swmike, linux-kernel, netdev Le mardi 01 mars 2011 à 21:14 +0100, Eric Dumazet a écrit : > Le mardi 01 mars 2011 à 14:37 -0500, Albert Cahalan a écrit : > > > > We certainly should allow some traffic spikes. 1 to 10 ms of > > traffic ought to do nicely. Hundreds or thousands of ms is > > getting way beyond "spike". > > OK. Hmm, user error, hit wrong button, sorry. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-03-01 7:26 ` Eric Dumazet 2011-03-01 19:37 ` Albert Cahalan @ 2011-03-02 3:10 ` Mikael Abrahamsson 2011-03-02 20:25 ` Chris Friesen 1 sibling, 1 reply; 43+ messages in thread From: Mikael Abrahamsson @ 2011-03-02 3:10 UTC (permalink / raw) To: Eric Dumazet Cc: Albert Cahalan, David Miller, johnwheffner, linville, jussi.kivilinna, linux-kernel, netdev On Tue, 1 Mar 2011, Eric Dumazet wrote: > We should allow some trafic spikes, or many applications will stop > working. Unless all applications are fixed, we are stuck. > > Only if the queue stay loaded a long time (yet another parameter) we can > try to drop packets. Are we talking forwarding of packets or originating them ourselves, or trying to use the same mechanism for both? In the case of routing a packet, I envision a WRED kind of behaviour is the most efficient. <http://www.cisco.com/en/US/docs/ios/12_0s/feature/guide/12stbwr.html> "QoS: Time-Based Thresholds for WRED and Queue Limit for the Cisco 12000 Series Router" You can set the drop probabilites in milliseconds. Unfortunately ECN isn't supported on this platform but on other platforms it can be configured and used instead of WRED dropping packets. For the case when we're ourselves originating the traffic (for instance to a wifi card with varying speed and jitter due to retransmits on the wifi layer), I think it's taking the too easy way out to use the same mechanisms (dropping packets or marking ECN for our own originated packets seems really weird), here we should be able to pushback information to the applications somehow and do prioritization between flows since we're sitting on all information ourselves including the application. For this case, I think there is something to be learnt from: <http://www.cisco.com/en/US/tech/tk39/tk824/technologies_tech_note09186a00800fbafc.shtml> Here you have the IP part and the ATM part, and you can limit the number of cells/packets sent to the ATM hardware at any given time (this queue is FIFO so no AQM when the packet has been sent here). We need the same here, to properly keep latency down and make AQM work, the hardware FIFO queue needs to be kept low. -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-03-02 3:10 ` Mikael Abrahamsson @ 2011-03-02 20:25 ` Chris Friesen 0 siblings, 0 replies; 43+ messages in thread From: Chris Friesen @ 2011-03-02 20:25 UTC (permalink / raw) To: Mikael Abrahamsson Cc: Eric Dumazet, Albert Cahalan, David Miller, johnwheffner, linville, jussi.kivilinna, linux-kernel, netdev On 03/01/2011 09:10 PM, Mikael Abrahamsson wrote: > For the case when we're ourselves originating the traffic (for instance to > a wifi card with varying speed and jitter due to retransmits on the wifi > layer), I think it's taking the too easy way out to use the same > mechanisms (dropping packets or marking ECN for our own originated packets > seems really weird), here we should be able to pushback information to the > applications somehow and do prioritization between flows since we're > sitting on all information ourselves including the application. Doesn't the socket tx buffer give all the app pushback necessary? (Assuming it's set to a sane value.) We should certainly do prioritization between flows. Perhaps if no other information is available the scheduler priority could be used? Chris -- Chris Friesen Software Developer GENBAND chris.friesen@genband.com www.genband.com ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-03-01 4:11 ` Albert Cahalan 2011-03-01 4:18 ` David Miller @ 2011-03-01 5:01 ` Eric Dumazet 2011-03-01 5:36 ` Eric Dumazet 1 sibling, 1 reply; 43+ messages in thread From: Eric Dumazet @ 2011-03-01 5:01 UTC (permalink / raw) To: Albert Cahalan Cc: John Heffner, John W. Linville, Jussi Kivilinna, Mikael Abrahamsson, linux-kernel, netdev Le lundi 28 février 2011 à 23:11 -0500, Albert Cahalan a écrit : > It sounds like you need a callback or similar, so that TCP can be > informed later that the drop has occurred. There is the thing called skb destructor / skb_orphan() mess, that is not stackable... Might extend this to something more clever, and be able to call functions (into TCP stack for example) giving a status of skb : Sent, or dropped somewhere in the stack... ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-03-01 5:01 ` Eric Dumazet @ 2011-03-01 5:36 ` Eric Dumazet 0 siblings, 0 replies; 43+ messages in thread From: Eric Dumazet @ 2011-03-01 5:36 UTC (permalink / raw) To: Albert Cahalan Cc: John Heffner, John W. Linville, Jussi Kivilinna, Mikael Abrahamsson, linux-kernel, netdev Le mardi 01 mars 2011 à 06:01 +0100, Eric Dumazet a écrit : > Le lundi 28 février 2011 à 23:11 -0500, Albert Cahalan a écrit : > > > It sounds like you need a callback or similar, so that TCP can be > > informed later that the drop has occurred. > > There is the thing called skb destructor / skb_orphan() mess, that is > not stackable... Might extend this to something more clever, and be able > to call functions (into TCP stack for example) giving a status of skb : > Sent, or dropped somewhere in the stack... > One problem of such schem is the huge extra cost involved, extra locking, extra memory allocations, extra atomic operations... ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-02-27 10:55 ` Jussi Kivilinna 2011-02-27 20:07 ` Eric Dumazet @ 2011-02-27 23:33 ` Albert Cahalan 2011-02-28 11:23 ` Jussi Kivilinna 2011-02-28 15:38 ` txqueuelen has wrong units; should be time Hagen Paul Pfeifer 1 sibling, 2 replies; 43+ messages in thread From: Albert Cahalan @ 2011-02-27 23:33 UTC (permalink / raw) To: Jussi Kivilinna; +Cc: Eric Dumazet, Mikael Abrahamsson, linux-kernel, netdev On Sun, Feb 27, 2011 at 5:55 AM, Jussi Kivilinna <jussi.kivilinna@mbnet.fi> wrote: > I made simple hack on sch_fifo with per packet time limits (attachment) this > weekend and have been doing limited testing on wireless link. I think > hardlimit is fine, it's simple and does somewhat same as what > packet(-hard)limited buffer does, drops packets when buffer is 'full'. My > hack checks for timed out packets on enqueue, might be wrong approach (on > other hand might allow some more burstiness). Thanks! I think the default is too high. 1 ms may even be a bit high. I suppose there is a need to allow at least 2 packets despite any time limits, so that it remains possible to use a traditional modem even if a huge packet takes several seconds to send. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-02-27 23:33 ` Albert Cahalan @ 2011-02-28 11:23 ` Jussi Kivilinna 2011-03-02 21:54 ` [RFC LOL OMG] pfifo_lat: qdisc that limits dequeueing based on estimated link latency John W. Linville 2011-02-28 15:38 ` txqueuelen has wrong units; should be time Hagen Paul Pfeifer 1 sibling, 1 reply; 43+ messages in thread From: Jussi Kivilinna @ 2011-02-28 11:23 UTC (permalink / raw) To: Albert Cahalan; +Cc: Eric Dumazet, Mikael Abrahamsson, linux-kernel, netdev [-- Attachment #1: Type: text/plain, Size: 1170 bytes --] Quoting Albert Cahalan <acahalan@gmail.com>: > On Sun, Feb 27, 2011 at 5:55 AM, Jussi Kivilinna > <jussi.kivilinna@mbnet.fi> wrote: > >> I made simple hack on sch_fifo with per packet time limits (attachment) this >> weekend and have been doing limited testing on wireless link. I think >> hardlimit is fine, it's simple and does somewhat same as what >> packet(-hard)limited buffer does, drops packets when buffer is 'full'. My >> hack checks for timed out packets on enqueue, might be wrong approach (on >> other hand might allow some more burstiness). > > Thanks! > > I think the default is too high. 1 ms may even be a bit high. Well, with 10ms buffer timeout latency goes to 10-20ms on 54Mbit wifi link (zd1211rw driver) from >500ms (ping rtt when iperf running same time). So for that it's good enough. > > I suppose there is a need to allow at least 2 packets despite any > time limits, so that it remains possible to use a traditional modem > even if a huge packet takes several seconds to send. > I made EWMA version of my fifo hack (attached). I added minimum 2 packet queue limit and probabilistic 1% ECN marking/dropping for timeout/2. -Jussi [-- Attachment #2: sch_fifo_ewma.c --] [-- Type: text/x-csrc, Size: 7809 bytes --] /* * sch_fifo_ewma.c Simple FIFO EWMA timelimit queue. * * This program is free software; you can redistribute it and/or modify it under * the terms of the GNU General Public License as published by the Free Software * Foundation; either version 2 of the License, or (at your option) any later * version. * */ #include <linux/module.h> #include <linux/slab.h> #include <linux/types.h> #include <linux/kernel.h> #include <linux/errno.h> #include <linux/skbuff.h> #include <net/pkt_sched.h> #include <net/inet_ecn.h> #include <linux/version.h> #if LINUX_VERSION_CODE <= KERNEL_VERSION(2, 6, 37) #include "average.h" #else #include <linux/average.h> #endif #define DEFAULT_PKT_TIMEOUT_MS 10 #define DEFAULT_PKT_TIMEOUT PSCHED_NS2TICKS(NSEC_PER_MSEC * \ DEFAULT_PKT_TIMEOUT_MS) #define DEFAULT_PROB_HALF_DROP 10 /* 1% */ #define FIFO_EWMA_MIN_QDISC_LEN 2 struct tc_fifo_ewma_qopt { __u64 timeout; /* Max time packet may stay in buffer */ __u32 limit; /* Queue length: bytes for bfifo, packets for pfifo */ }; struct fifo_ewma_skb_cb { psched_time_t time_queued; }; struct fifo_ewma_sched_data { psched_tdiff_t timeout; u32 limit; struct ewma ewma; }; static inline struct fifo_ewma_skb_cb *fifo_ewma_skb_cb(struct sk_buff *skb) { BUILD_BUG_ON(sizeof(skb->cb) < sizeof(struct qdisc_skb_cb) + sizeof(struct fifo_ewma_skb_cb)); return (struct fifo_ewma_skb_cb *)qdisc_skb_cb(skb)->data; } static int pfifo_tail_enqueue(struct sk_buff *skb, struct Qdisc* sch) { struct fifo_ewma_sched_data *q = qdisc_priv(sch); if (likely(skb_queue_len(&sch->q) < q->limit)) return qdisc_enqueue_tail(skb, sch); /* queue full, remove one skb to fulfill the limit */ __qdisc_queue_drop_head(sch, &sch->q); sch->qstats.drops++; qdisc_enqueue_tail(skb, sch); return NET_XMIT_CN; } static int bfifo_enqueue(struct sk_buff *skb, struct Qdisc* sch) { struct fifo_ewma_sched_data *q = qdisc_priv(sch); if (likely(sch->qstats.backlog + qdisc_pkt_len(skb) <= q->limit)) return qdisc_enqueue_tail(skb, sch); return qdisc_reshape_fail(skb, sch); } static int pfifo_enqueue(struct sk_buff *skb, struct Qdisc* sch) { struct fifo_ewma_sched_data *q = qdisc_priv(sch); if (likely(skb_queue_len(&sch->q) < q->limit)) return qdisc_enqueue_tail(skb, sch); return qdisc_reshape_fail(skb, sch); } static inline int fifo_get_prob(void) { return (net_random() & 0xffff) * 1000 / 0xffff; } static struct sk_buff *fifo_ewma_dequeue(struct Qdisc* sch) { struct fifo_ewma_sched_data *q = qdisc_priv(sch); struct sk_buff *skb; psched_tdiff_t tdiff; if (likely(!q->timeout)) goto no_ewma; skb = qdisc_peek_head(sch); if (!skb) return NULL; /* update EWMA */ tdiff = psched_get_time() - fifo_ewma_skb_cb(skb)->time_queued; ewma_add(&q->ewma, tdiff); no_ewma: return qdisc_dequeue_head(sch); } #define FIFO_EWMA_OK 0 #define FIFO_EWMA_DROP 1 #define FIFO_EWMA_CN 2 static int fifo_check_ewma_drop(struct sk_buff *skb, struct Qdisc *sch) { struct fifo_ewma_sched_data *q = qdisc_priv(sch); unsigned long fifo_latency_avg; int ret = FIFO_EWMA_OK; if (likely(!q->timeout)) goto no_ewma; /* lower limit */ if (skb_queue_len(&sch->q) <= FIFO_EWMA_MIN_QDISC_LEN) goto no_drop; fifo_latency_avg = ewma_read(&q->ewma); /* hard drop */ if (fifo_latency_avg > q->timeout) { /*printk(KERN_WARNING "fifo_ewma: hard drop\n");*/ return FIFO_EWMA_DROP; } /* probabilistic drop */ if (fifo_latency_avg > q->timeout / 2 && fifo_get_prob() < DEFAULT_PROB_HALF_DROP) { if (!INET_ECN_set_ce(skb)) { /*printk(KERN_WARNING "fifo_ewma: prob drop\n");*/ return FIFO_EWMA_DROP; } /*printk(KERN_WARNING "fifo_ewma: prob mark\n");*/ ret = FIFO_EWMA_CN; } no_drop: fifo_ewma_skb_cb(skb)->time_queued = psched_get_time(); no_ewma: return ret; } static int pfifo_ewma_tail_enqueue(struct sk_buff *skb, struct Qdisc* sch) { int ewma_action, ret; ewma_action = fifo_check_ewma_drop(skb, sch); if (unlikely(ewma_action == FIFO_EWMA_DROP)) return qdisc_drop(skb, sch); ret = pfifo_tail_enqueue(skb, sch); if (unlikely(ret != NET_XMIT_SUCCESS)) return ret; return unlikely(ewma_action == FIFO_EWMA_CN) ? NET_XMIT_CN : ret; } static int bfifo_ewma_enqueue(struct sk_buff *skb, struct Qdisc* sch) { int ewma_action, ret; ewma_action = fifo_check_ewma_drop(skb, sch); if (unlikely(ewma_action == FIFO_EWMA_DROP)) return qdisc_drop(skb, sch); ret = bfifo_enqueue(skb, sch); if (unlikely(ret != NET_XMIT_SUCCESS)) return ret; return unlikely(ewma_action == FIFO_EWMA_CN) ? NET_XMIT_CN : ret; } static int pfifo_ewma_enqueue(struct sk_buff *skb, struct Qdisc* sch) { int ewma_action, ret; ewma_action = fifo_check_ewma_drop(skb, sch); if (unlikely(ewma_action == FIFO_EWMA_DROP)) return qdisc_drop(skb, sch); ret = pfifo_enqueue(skb, sch); if (unlikely(ret != NET_XMIT_SUCCESS)) return ret; return unlikely(ewma_action == FIFO_EWMA_CN) ? NET_XMIT_CN : ret; } static int fifo_ewma_init(struct Qdisc *sch, struct nlattr *opt) { struct fifo_ewma_sched_data *q = qdisc_priv(sch); if (opt == NULL) { u32 limit = qdisc_dev(sch)->tx_queue_len ? : 1; q->limit = limit; q->timeout = DEFAULT_PKT_TIMEOUT; } else { struct tc_fifo_ewma_qopt *ctl = nla_data(opt); if (nla_len(opt) < sizeof(*ctl)) return -EINVAL; q->limit = ctl->limit; q->timeout = ctl->timeout ? : DEFAULT_PKT_TIMEOUT; } ewma_init(&q->ewma, 1, 64); return 0; } static int fifo_ewma_dump(struct Qdisc *sch, struct sk_buff *skb) { struct fifo_ewma_sched_data *q = qdisc_priv(sch); struct tc_fifo_ewma_qopt opt = { .limit = q->limit, .timeout = q->timeout }; NLA_PUT(skb, TCA_OPTIONS, sizeof(opt), &opt); return skb->len; nla_put_failure: return -1; } static struct Qdisc_ops pfifo_ewma_qdisc_ops __read_mostly = { .id = "pfifo_ewma", .priv_size = sizeof(struct fifo_ewma_sched_data), .enqueue = pfifo_ewma_enqueue, .dequeue = fifo_ewma_dequeue, .peek = qdisc_peek_head, .drop = qdisc_queue_drop, .init = fifo_ewma_init, .reset = qdisc_reset_queue, .change = fifo_ewma_init, .dump = fifo_ewma_dump, .owner = THIS_MODULE, }; static struct Qdisc_ops bfifo_ewma_qdisc_ops __read_mostly = { .id = "bfifo_ewma", .priv_size = sizeof(struct fifo_ewma_sched_data), .enqueue = bfifo_ewma_enqueue, .dequeue = fifo_ewma_dequeue, .peek = qdisc_peek_head, .drop = qdisc_queue_drop, .init = fifo_ewma_init, .reset = qdisc_reset_queue, .change = fifo_ewma_init, .dump = fifo_ewma_dump, .owner = THIS_MODULE, }; static struct Qdisc_ops pfifo_head_drop_ewma_qdisc_ops __read_mostly = { .id = "pfifo_hd_ewma", .priv_size = sizeof(struct fifo_ewma_sched_data), .enqueue = pfifo_ewma_tail_enqueue, .dequeue = fifo_ewma_dequeue, .peek = qdisc_peek_head, .drop = qdisc_queue_drop_head, .init = fifo_ewma_init, .reset = qdisc_reset_queue, .change = fifo_ewma_init, .dump = fifo_ewma_dump, .owner = THIS_MODULE, }; static int __init fifo_ewma_module_init(void) { int retval; retval = register_qdisc(&pfifo_ewma_qdisc_ops); if (retval) goto cleanup; retval = register_qdisc(&bfifo_ewma_qdisc_ops); if (retval) goto cleanup; retval = register_qdisc(&pfifo_head_drop_ewma_qdisc_ops); if (retval) goto cleanup; return 0; cleanup: unregister_qdisc(&pfifo_ewma_qdisc_ops); unregister_qdisc(&bfifo_ewma_qdisc_ops); unregister_qdisc(&pfifo_head_drop_ewma_qdisc_ops); return retval; } static void __exit fifo_ewma_module_exit(void) { unregister_qdisc(&pfifo_ewma_qdisc_ops); unregister_qdisc(&bfifo_ewma_qdisc_ops); unregister_qdisc(&pfifo_head_drop_ewma_qdisc_ops); } module_init(fifo_ewma_module_init) module_exit(fifo_ewma_module_exit) MODULE_LICENSE("GPL"); #include <linux/version.h> #if LINUX_VERSION_CODE <= KERNEL_VERSION(2, 6, 37) #include "average.c" #endif ^ permalink raw reply [flat|nested] 43+ messages in thread
* [RFC LOL OMG] pfifo_lat: qdisc that limits dequeueing based on estimated link latency 2011-02-28 11:23 ` Jussi Kivilinna @ 2011-03-02 21:54 ` John W. Linville 2011-03-02 22:08 ` John W. Linville 2011-03-03 12:51 ` Eric Dumazet 0 siblings, 2 replies; 43+ messages in thread From: John W. Linville @ 2011-03-02 21:54 UTC (permalink / raw) To: netdev; +Cc: bloat-devel, John W. Linville This is a qdisc based on the existing pfifo_fast code. The difference is that this qdisc limits the dequeue rate based on estimates of how many packets can be in-flight at a given time while maintaining a target link latency. This work is based on the eBDP documented in Section IV of "Buffer Sizing for 802.11 Based Networks" by Tianji Li, et al. http://www.hamilton.ie/tianji_li/buffersizing.pdf This implementation timestamps an skb as it dequeues it, then computes the service time when the frame is freed by the driver. An exponentially weighted moving average of per fragment service times is used to restrict queueing delays in hopes of achieving a target fragment transmission latency. The skb->deconstructor mechanism is abused in order to obtain packet service time estimates. Signed-off-by: John W. Linville <linville@tuxdriver.com> --- I took a whack at reimplementing my eBDP patch at the qdisc level. Unfortunately, it doesn't seem to work very well and I'm at a loss as to why... :-( Comments welcome -- maybe I'm doing something really stupid in the math and just can't see it. The skb->deconstructor abuse includes adding a union member in the skb to record the qdisc->handle on the way out so that it can be used for accounting in the deconstructor -- thanks to Neil Horman for the suggestion! The reason I think this is an idea worth exploring is that existing qdisc code doesn't seem to account for the fact that the devices could be doing a lot of queueing behind them. Even Jussi's recent sch_fifo_ewma post doesn't seem to take into account how long the device holds-on to packets, which limits his ability to fight latency. Anyway, all comments appreciated! include/linux/skbuff.h | 2 + include/net/pkt_sched.h | 1 + net/sched/sch_api.c | 1 + net/sched/sch_fifo.c | 131 +++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 135 insertions(+), 0 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index bf221d6..d99861e 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -296,6 +296,7 @@ typedef unsigned char *sk_buff_data_t; * @end: End pointer * @destructor: Destruct function * @mark: Generic packet mark + * @qdhandle: handle of leaf qdisc that handled skb * @nfct: Associated connection, if any * @ipvs_property: skbuff is owned by ipvs * @peeked: this packet has been seen already, so stats have been @@ -407,6 +408,7 @@ struct sk_buff { union { __u32 mark; __u32 dropcount; + __u32 qdhandle; }; __u16 vlan_tci; diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h index d9549af..93189f6 100644 --- a/include/net/pkt_sched.h +++ b/include/net/pkt_sched.h @@ -72,6 +72,7 @@ extern void qdisc_watchdog_cancel(struct qdisc_watchdog *wd); extern struct Qdisc_ops pfifo_qdisc_ops; extern struct Qdisc_ops bfifo_qdisc_ops; extern struct Qdisc_ops pfifo_head_drop_qdisc_ops; +extern struct Qdisc_ops pfifo_lat_qdisc_ops; extern int fifo_set_limit(struct Qdisc *q, unsigned int limit); extern struct Qdisc *fifo_create_dflt(struct Qdisc *sch, struct Qdisc_ops *ops, diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index b22ca2d..9c9ba9a 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -1769,6 +1769,7 @@ static int __init pktsched_init(void) register_qdisc(&pfifo_qdisc_ops); register_qdisc(&bfifo_qdisc_ops); register_qdisc(&pfifo_head_drop_qdisc_ops); + register_qdisc(&pfifo_lat_qdisc_ops); register_qdisc(&mq_qdisc_ops); rtnl_register(PF_UNSPEC, RTM_NEWQDISC, tc_modify_qdisc, NULL); diff --git a/net/sched/sch_fifo.c b/net/sched/sch_fifo.c index d468b47..0d2cb48 100644 --- a/net/sched/sch_fifo.c +++ b/net/sched/sch_fifo.c @@ -15,6 +15,7 @@ #include <linux/kernel.h> #include <linux/errno.h> #include <linux/skbuff.h> +#include <linux/average.h> #include <net/pkt_sched.h> /* 1 band FIFO pseudo-"scheduler" */ @@ -24,6 +25,20 @@ struct fifo_sched_data u32 limit; }; +/* + * Private data for a pfifo_lat scheduler containing: + * - embedded fifo private data + * - EWMA of average skb service time for each band + * - count of currently in-flight skbs for each band + * - maximum in-flight skbs for each band + */ +struct pfifo_lat_data { + struct fifo_sched_data q; + struct ewma tserv; + unsigned int inflight; + unsigned int inflight_max; +}; + static int bfifo_enqueue(struct sk_buff *skb, struct Qdisc* sch) { struct fifo_sched_data *q = qdisc_priv(sch); @@ -59,6 +74,86 @@ static int pfifo_tail_enqueue(struct sk_buff *skb, struct Qdisc* sch) return NET_XMIT_CN; } +static int pfifo_lat_enqueue(struct sk_buff *skb, struct Qdisc* sch) +{ + struct pfifo_lat_data *priv = qdisc_priv(sch); + + /* include inflight count when checking queue length limit */ + if (skb_queue_len(&sch->q) + priv->inflight < priv->q.limit) + return qdisc_enqueue_tail(skb, sch); + + return qdisc_reshape_fail(skb, sch); +} + +static void pfifo_lat_skb_free(struct sk_buff *skb) +{ + struct Qdisc *qdisc = qdisc_lookup(skb->dev, skb->qdhandle); + struct pfifo_lat_data *priv = qdisc_priv(qdisc); + unsigned int tserv_ns, inflight_mult; + + /* + * grab timestamp info for buffer control estimates and factor + * that into service time estimate for this queue + */ + ewma_add(&priv->tserv, + ktime_to_ns(ktime_sub(ktime_get(), skb->tstamp))); + tserv_ns = ewma_read(&priv->tserv); + if (tserv_ns) { + /* calculate multiplier between tserv and target latency */ + inflight_mult = 2 * NSEC_PER_MSEC / tserv_ns; + + /* + * use current inflight number as proxy for number of + * packets inflight when this packet was sent to + * hardware queue + */ + priv->inflight_max = + max_t(int, 2, priv->inflight * inflight_mult); + } + + priv->inflight--; +} + +static struct sk_buff *pfifo_lat_dequeue(struct Qdisc *qdisc) +{ + struct pfifo_lat_data *priv = qdisc_priv(qdisc); + struct sk_buff *skb; + + if (priv->inflight >= priv->inflight_max) + return NULL; + + skb = qdisc_dequeue_head(qdisc); + if (!skb) + return NULL; + + priv->inflight++; + + /* take ownership of skb and timestamp it */ + skb_orphan(skb); + skb->qdhandle = qdisc->handle; + skb->destructor = pfifo_lat_skb_free; + skb->dev = qdisc_dev(qdisc); /* do I need to set this? */ + skb->tstamp = ktime_get(); + + return skb; +} + +static void pfifo_lat_reset(struct Qdisc* qdisc) +{ + struct pfifo_lat_data *priv = qdisc_priv(qdisc); + + /* + * since fifo_sched_data is embedded at head of pfifo_lat_data, + * this should be OK to do... + */ + qdisc_reset_queue(qdisc); + + /* need to reset priv->tserv somehow? */ + + priv->inflight = 0; + priv->inflight_max = (typeof(priv->inflight_max))-1; +} + static int fifo_init(struct Qdisc *sch, struct nlattr *opt) { struct fifo_sched_data *q = qdisc_priv(sch); @@ -82,6 +177,30 @@ static int fifo_init(struct Qdisc *sch, struct nlattr *opt) return 0; } +static int pfifo_lat_init(struct Qdisc *qdisc, struct nlattr *opt) +{ + struct pfifo_lat_data *priv = qdisc_priv(qdisc); + int rc; + + /* + * since fifo_sched_data is embedded at head of pfifo_lat_data, + * this should be OK to do... + */ + rc = fifo_init(qdisc, opt); + if (rc) + return rc; + + /* initialize service time estimate */ + ewma_init(&priv->tserv, 1, 64); + + priv->inflight = 0; /* necessary to set this explicitly? */ + + /* initial inflight_max should be ??? */ + priv->inflight_max = (typeof(priv->inflight_max))-1; + + return 0; +} + static int fifo_dump(struct Qdisc *sch, struct sk_buff *skb) { struct fifo_sched_data *q = qdisc_priv(sch); @@ -138,6 +257,18 @@ struct Qdisc_ops pfifo_head_drop_qdisc_ops __read_mostly = { .owner = THIS_MODULE, }; +struct Qdisc_ops pfifo_lat_qdisc_ops __read_mostly = { + .id = "pfifo_lat", + .priv_size = sizeof(struct pfifo_lat_data), + .enqueue = pfifo_lat_enqueue, + .dequeue = pfifo_lat_dequeue, + .peek = qdisc_peek_head, + .init = pfifo_lat_init, + .reset = pfifo_lat_reset, + .dump = fifo_dump, + .owner = THIS_MODULE, +}; + /* Pass size change message down to embedded FIFO */ int fifo_set_limit(struct Qdisc *q, unsigned int limit) { -- 1.7.4 ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [RFC LOL OMG] pfifo_lat: qdisc that limits dequeueing based on estimated link latency 2011-03-02 21:54 ` [RFC LOL OMG] pfifo_lat: qdisc that limits dequeueing based on estimated link latency John W. Linville @ 2011-03-02 22:08 ` John W. Linville 2011-03-03 12:51 ` Eric Dumazet 1 sibling, 0 replies; 43+ messages in thread From: John W. Linville @ 2011-03-02 22:08 UTC (permalink / raw) To: netdev; +Cc: bloat-devel On Wed, Mar 02, 2011 at 04:54:10PM -0500, John W. Linville wrote: > This is a qdisc based on the existing pfifo_fast code. The difference Well, it started that way. This is obviously based on the pfifo code instead... John -- John W. Linville Someday the world will need a hero, and you linville@tuxdriver.com might be all we have. Be ready. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC LOL OMG] pfifo_lat: qdisc that limits dequeueing based on estimated link latency 2011-03-02 21:54 ` [RFC LOL OMG] pfifo_lat: qdisc that limits dequeueing based on estimated link latency John W. Linville 2011-03-02 22:08 ` John W. Linville @ 2011-03-03 12:51 ` Eric Dumazet 1 sibling, 0 replies; 43+ messages in thread From: Eric Dumazet @ 2011-03-03 12:51 UTC (permalink / raw) To: John W. Linville; +Cc: netdev, bloat-devel Le mercredi 02 mars 2011 à 16:54 -0500, John W. Linville a écrit : > This is a qdisc based on the existing pfifo_fast code. The difference > is that this qdisc limits the dequeue rate based on estimates of how > many packets can be in-flight at a given time while maintaining a target > link latency. > > This work is based on the eBDP documented in Section IV of "Buffer > Sizing for 802.11 Based Networks" by Tianji Li, et al. > > http://www.hamilton.ie/tianji_li/buffersizing.pdf > > This implementation timestamps an skb as it dequeues it, then > computes the service time when the frame is freed by the driver. > An exponentially weighted moving average of per fragment service times > is used to restrict queueing delays in hopes of achieving a target > fragment transmission latency. The skb->deconstructor mechanism is > abused in order to obtain packet service time estimates. > > Signed-off-by: John W. Linville <linville@tuxdriver.com> > --- > I took a whack at reimplementing my eBDP patch at the qdisc level. > Unfortunately, it doesn't seem to work very well and I'm at a loss > as to why... :-( Comments welcome -- maybe I'm doing something really > stupid in the math and just can't see it. > > The skb->deconstructor abuse includes adding a union member in the skb > to record the qdisc->handle on the way out so that it can be used for > accounting in the deconstructor -- thanks to Neil Horman for the > suggestion! > > The reason I think this is an idea worth exploring is that existing > qdisc code doesn't seem to account for the fact that the devices could > be doing a lot of queueing behind them. Even Jussi's recent > sch_fifo_ewma post doesn't seem to take into account how long the device > holds-on to packets, which limits his ability to fight latency. > > Anyway, all comments appreciated! > > Well, many issues in your patch. skb destructor cannot be used like that (think about locking, and various context where drivers actually free skbs (from interrupt, from softirq, or even _before_ sending data on wire). qdisc_lookup(skb->dev, skb->qdhandle) for example is only safe if run with RTNL held. Its not meant to be used in fast path at all, but management code only. Being able to have a feedback on when a skb is freed (with a notification of being delivered or dropped) is a recurring idea, so we might design a stackable infrastructure. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-02-27 23:33 ` Albert Cahalan 2011-02-28 11:23 ` Jussi Kivilinna @ 2011-02-28 15:38 ` Hagen Paul Pfeifer 2011-02-28 16:37 ` Albert Cahalan 2011-02-28 17:20 ` Bill Sommerfeld 1 sibling, 2 replies; 43+ messages in thread From: Hagen Paul Pfeifer @ 2011-02-28 15:38 UTC (permalink / raw) To: Albert Cahalan Cc: Jussi Kivilinna, Eric Dumazet, Mikael Abrahamsson, linux-kernel, netdev On Sun, 27 Feb 2011 18:33:39 -0500, Albert Cahalan wrote: > I suppose there is a need to allow at least 2 packets despite any > time limits, so that it remains possible to use a traditional modem > even if a huge packet takes several seconds to send. That is a good point! We talk about as we may know every use case of Linux. But this is not true at all. One of my customer for example operates the Linux network stack functionality on top of a proprietary MAC/Driver where the current packet queue characteristic is just fine. The time-drop-approach is unsuitable because the bandwidth can vary in a small amount of time over a great range (0 till max. bandwidth). A sufficient buffering shows up superior in this environment (only IPv{4,6}/UDP). Hagen ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-02-28 15:38 ` txqueuelen has wrong units; should be time Hagen Paul Pfeifer @ 2011-02-28 16:37 ` Albert Cahalan 2011-02-28 17:45 ` John W. Linville 2011-02-28 17:20 ` Bill Sommerfeld 1 sibling, 1 reply; 43+ messages in thread From: Albert Cahalan @ 2011-02-28 16:37 UTC (permalink / raw) To: Hagen Paul Pfeifer Cc: Jussi Kivilinna, Eric Dumazet, Mikael Abrahamsson, linux-kernel, netdev On Mon, Feb 28, 2011 at 10:38 AM, Hagen Paul Pfeifer <hagen@jauu.net> wrote: > On Sun, 27 Feb 2011 18:33:39 -0500, Albert Cahalan wrote: > >> I suppose there is a need to allow at least 2 packets despite any >> time limits, so that it remains possible to use a traditional modem >> even if a huge packet takes several seconds to send. > > That is a good point! We talk about as we may know every use case of > Linux. But this is not true at all. One of my customer for example operates > the Linux network stack functionality on top of a proprietary MAC/Driver > where the current packet queue characteristic is just fine. The > time-drop-approach is unsuitable because the bandwidth can vary in a small > amount of time over a great range (0 till max. bandwidth). A sufficient > buffering shows up superior in this environment (only IPv{4,6}/UDP). I don't think the current non-time queue is just fine for him. I can see that time-based discard-on-enqueue would not be fine either. He needs time-based discard-on-dequeue. Good for him is probably: On dequeue, discard all packets that are too old. On enqueue, assume max bandwidth and discard all packets that have no hope of surviving the dequeue check. (the enqueue check is only to prevent wasting RAM) Exception: always keep at least 2 packets. Better is something that would allow random drop. The trouble here is that bandwidth varies greatly. Some sort of undelete functionality is needed...? Assuming the difficulty with implementing random drop is solvable, I think this would work for the rest of us too. Keeping the timeout really low is important because it isn't OK to eat up all the latency tolerance in one hop. You have an end-to-end budget of 20 ms for usable GUI rubber banding. The budget for gaming is about 80 and for VoIP is about 150. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-02-28 16:37 ` Albert Cahalan @ 2011-02-28 17:45 ` John W. Linville 0 siblings, 0 replies; 43+ messages in thread From: John W. Linville @ 2011-02-28 17:45 UTC (permalink / raw) To: Albert Cahalan Cc: Hagen Paul Pfeifer, Jussi Kivilinna, Eric Dumazet, Mikael Abrahamsson, linux-kernel, netdev On Mon, Feb 28, 2011 at 11:37:45AM -0500, Albert Cahalan wrote: > Keeping the timeout really low is important because it isn't > OK to eat up all the latency tolerance in one hop. You have > an end-to-end budget of 20 ms for usable GUI rubber banding. > The budget for gaming is about 80 and for VoIP is about 150. Oooh, numbers! :-) Where can I find estimates on average hop counts for internet connections? John -- John W. Linville Someday the world will need a hero, and you linville@tuxdriver.com might be all we have. Be ready. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-02-28 15:38 ` txqueuelen has wrong units; should be time Hagen Paul Pfeifer 2011-02-28 16:37 ` Albert Cahalan @ 2011-02-28 17:20 ` Bill Sommerfeld 2011-02-28 21:51 ` John Heffner 1 sibling, 1 reply; 43+ messages in thread From: Bill Sommerfeld @ 2011-02-28 17:20 UTC (permalink / raw) To: Hagen Paul Pfeifer Cc: Albert Cahalan, Jussi Kivilinna, Eric Dumazet, Mikael Abrahamsson, linux-kernel, netdev On Mon, Feb 28, 2011 at 07:38, Hagen Paul Pfeifer <hagen@jauu.net> wrote: > On Sun, 27 Feb 2011 18:33:39 -0500, Albert Cahalan wrote: >> I suppose there is a need to allow at least 2 packets despite any >> time limits, so that it remains possible to use a traditional modem >> even if a huge packet takes several seconds to send. > > That is a good point! We talk about as we may know every use case of > Linux. But this is not true at all. One of my customer for example operates > the Linux network stack functionality on top of a proprietary MAC/Driver > where the current packet queue characteristic is just fine. The > time-drop-approach is unsuitable because the bandwidth can vary in a small > amount of time over a great range (0 till max. bandwidth). A sufficient > buffering shows up superior in this environment (only IPv{4,6}/UDP). The tension is between the average queue length and the maximum amount of buffering needed. Fixed-sized tail-drop queues -- either long, or short -- are not ideal. My understanding is that the best practice here is that you need (bandwidth * path delay) buffering to be available to absorb bursts and avoid drops, but you also need to use queue management algorithms with ECN or random drop to keep the *average* queue length short; unfortunately, researchers are still arguing about the details of the second part... ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-02-28 17:20 ` Bill Sommerfeld @ 2011-02-28 21:51 ` John Heffner 2011-03-01 0:46 ` Mikael Abrahamsson 0 siblings, 1 reply; 43+ messages in thread From: John Heffner @ 2011-02-28 21:51 UTC (permalink / raw) To: Bill Sommerfeld Cc: Hagen Paul Pfeifer, Albert Cahalan, Jussi Kivilinna, Eric Dumazet, Mikael Abrahamsson, linux-kernel, netdev Right... while I generally agree that a fixed-length drop-tail queue isn't optimal, isn't this problem what the various AQM schemes try to solve? -John On Mon, Feb 28, 2011 at 12:20 PM, Bill Sommerfeld <wsommerfeld@google.com> wrote: > On Mon, Feb 28, 2011 at 07:38, Hagen Paul Pfeifer <hagen@jauu.net> wrote: >> On Sun, 27 Feb 2011 18:33:39 -0500, Albert Cahalan wrote: >>> I suppose there is a need to allow at least 2 packets despite any >>> time limits, so that it remains possible to use a traditional modem >>> even if a huge packet takes several seconds to send. >> >> That is a good point! We talk about as we may know every use case of >> Linux. But this is not true at all. One of my customer for example operates >> the Linux network stack functionality on top of a proprietary MAC/Driver >> where the current packet queue characteristic is just fine. The >> time-drop-approach is unsuitable because the bandwidth can vary in a small >> amount of time over a great range (0 till max. bandwidth). A sufficient >> buffering shows up superior in this environment (only IPv{4,6}/UDP). > > The tension is between the average queue length and the maximum amount > of buffering needed. Fixed-sized tail-drop queues -- either long, or > short -- are not ideal. > > My understanding is that the best practice here is that you need > (bandwidth * path delay) buffering to be available to absorb bursts > and avoid drops, but you also need to use queue management algorithms > with ECN or random drop to keep the *average* queue length short; > unfortunately, researchers are still arguing about the details of the > second part... > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-02-28 21:51 ` John Heffner @ 2011-03-01 0:46 ` Mikael Abrahamsson 2011-03-02 6:25 ` Stephen Hemminger 0 siblings, 1 reply; 43+ messages in thread From: Mikael Abrahamsson @ 2011-03-01 0:46 UTC (permalink / raw) To: John Heffner Cc: Bill Sommerfeld, Hagen Paul Pfeifer, Albert Cahalan, Jussi Kivilinna, Eric Dumazet, linux-kernel, netdev On Mon, 28 Feb 2011, John Heffner wrote: > Right... while I generally agree that a fixed-length drop-tail queue > isn't optimal, isn't this problem what the various AQM schemes try to > solve? I am not an expert on exactly how Linux does this, but for Cisco and for instance ATM interfaces, there are two stages of queuing. One is the "hardware queue", which is a FIFO queue going into the ATM framer. If one wants low CPU usage, then this needs to be high so multiple packets can be put there per interrupt. Since AQM is working before this, it also means the low-latency-queue will have a higher latency as it ends up behind larger packets in the hw queue. So on what level does the AQM work in Linux? Does it work similarily, that txqueuelen is a FIFO queue to the hardware that AQM feeds packets into? Also, when one uses WRED the thinking is generally to keep the average queue len down, but still allow for bursts by dynamically changing the drop probability and where it happens. When there is no queuing, allow for big queue (so it can fill up if needed), but if the queue is large for several seconds, start to apply WRED to bring it down. There is generally no need at all to constantly buffer > 50 ms of data, then it's better to just start selectively dropping it. In time of burstyness (perhaps when re-routing traffic) there is need to buffer 200-500ms of during perhaps 1-2 seconds before things stabilize. So one queuing scheme and one queue limit isn't going to solve this, there need to be some dynamic built into the system for it to work well. AQM needs to feed into a relatively short hw queue and AQM needs to exist on output also when the traffic is sourced from the box itself, no tonly routed. It would also help if the default would be to use let's say 25% of the bandwidth for smaller packets (< 200 bytes or so) which generally are for interactive uses or are ACKs. -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-03-01 0:46 ` Mikael Abrahamsson @ 2011-03-02 6:25 ` Stephen Hemminger 2011-03-02 6:41 ` Mikael Abrahamsson 0 siblings, 1 reply; 43+ messages in thread From: Stephen Hemminger @ 2011-03-02 6:25 UTC (permalink / raw) To: Mikael Abrahamsson Cc: John Heffner, Bill Sommerfeld, Hagen Paul Pfeifer, Albert Cahalan, Jussi Kivilinna, Eric Dumazet, linux-kernel, netdev On Tue, 1 Mar 2011 01:46:51 +0100 (CET) Mikael Abrahamsson <swmike@swm.pp.se> wrote: > On Mon, 28 Feb 2011, John Heffner wrote: > > > Right... while I generally agree that a fixed-length drop-tail queue > > isn't optimal, isn't this problem what the various AQM schemes try to > > solve? > > I am not an expert on exactly how Linux does this, but for Cisco and for > instance ATM interfaces, there are two stages of queuing. One is the > "hardware queue", which is a FIFO queue going into the ATM framer. If one > wants low CPU usage, then this needs to be high so multiple packets can be > put there per interrupt. Since AQM is working before this, it also means > the low-latency-queue will have a higher latency as it ends up behind > larger packets in the hw queue. > > So on what level does the AQM work in Linux? Does it work similarily, that > txqueuelen is a FIFO queue to the hardware that AQM feeds packets into? > > Also, when one uses WRED the thinking is generally to keep the average > queue len down, but still allow for bursts by dynamically changing the > drop probability and where it happens. When there is no queuing, allow for > big queue (so it can fill up if needed), but if the queue is large for > several seconds, start to apply WRED to bring it down. > > There is generally no need at all to constantly buffer > 50 ms of data, > then it's better to just start selectively dropping it. In time of > burstyness (perhaps when re-routing traffic) there is need to buffer > 200-500ms of during perhaps 1-2 seconds before things stabilize. > > So one queuing scheme and one queue limit isn't going to solve this, there > need to be some dynamic built into the system for it to work well. > > AQM needs to feed into a relatively short hw queue and AQM needs to exist > on output also when the traffic is sourced from the box itself, no tonly > routed. It would also help if the default would be to use let's say 25% of > the bandwidth for smaller packets (< 200 bytes or so) which generally are > for interactive uses or are ACKs. > It is possible to build an equivalent to WRED out existing GRED queuing discipline but it does require a lot of tc knowledge to get right. The inventor of RED (Van Jacobsen) has issues with WRED because of the added complexity of queue selection. RED requires some parameters which the average user has no idea how to set. There are several problems with RED that prevent prevent VJ from recommending it in the current form. http://gettys.wordpress.com/2010/12/17/red-in-a-different-light/ -- ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-03-02 6:25 ` Stephen Hemminger @ 2011-03-02 6:41 ` Mikael Abrahamsson 2011-03-02 7:07 ` Stephen Hemminger 0 siblings, 1 reply; 43+ messages in thread From: Mikael Abrahamsson @ 2011-03-02 6:41 UTC (permalink / raw) To: Stephen Hemminger Cc: John Heffner, Bill Sommerfeld, Hagen Paul Pfeifer, Albert Cahalan, Jussi Kivilinna, Eric Dumazet, linux-kernel, netdev On Tue, 1 Mar 2011, Stephen Hemminger wrote: > It is possible to build an equivalent to WRED out existing GRED queuing > discipline but it does require a lot of tc knowledge to get right. To me who has worked with cisco routers for 10+ years and who is used to the different variants Cisco use, tc is just weird. It must come from a completely different school of thinking compared to what router people are used to, because I have tried and failed twice to do anything sensible with it. > The inventor of RED (Van Jacobsen) has issues with WRED because of the > added complexity of queue selection. RED requires some parameters which > the average user has no idea how to set. Of course there are issues and some of them can be adressed by simply lowering the queue depth. Yes, that might bring down the performance of some sessions, but for most of the interactive traffic, never buffering more than 40ms is a good thing. > There are several problems with RED that prevent prevent VJ from > recommending it in the current form. Ask if he prefers FIFO+tail drop to RED in current form. -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-03-02 6:41 ` Mikael Abrahamsson @ 2011-03-02 7:07 ` Stephen Hemminger 2011-03-02 16:41 ` Mikael Abrahamsson 0 siblings, 1 reply; 43+ messages in thread From: Stephen Hemminger @ 2011-03-02 7:07 UTC (permalink / raw) To: Mikael Abrahamsson Cc: John Heffner, Bill Sommerfeld, Hagen Paul Pfeifer, Albert Cahalan, Jussi Kivilinna, Eric Dumazet, linux-kernel, netdev On Wed, 2 Mar 2011 07:41:30 +0100 (CET) Mikael Abrahamsson <swmike@swm.pp.se> wrote: > On Tue, 1 Mar 2011, Stephen Hemminger wrote: > > > It is possible to build an equivalent to WRED out existing GRED queuing > > discipline but it does require a lot of tc knowledge to get right. > > To me who has worked with cisco routers for 10+ years and who is used to > the different variants Cisco use, tc is just weird. It must come from a > completely different school of thinking compared to what router people are > used to, because I have tried and failed twice to do anything sensible > with it. Vyatta has scripting that handles all that: vyatta@napa:~$ configure [edit] yatta@napa# set traffic-policy random-detect MyWFQ bandwidth 1gbps [edit] vyatta@napa# set interfaces ethernet eth0 traffic-policy out MyWFQ [edit] vyatta@napa# commit [edit] vyatta@napa# exit vyatta@napa:~$ show queueing ethernet eth0 eth0 Queueing: Class Policy Sent Rate Dropped Overlimit Backlog root weighted-random 16550 0 0 0 vyatta@napa:~$ /sbin/tc qdisc show dev eth0 qdisc dsmark 1: root refcnt 2 indices 0x0008 set_tc_index qdisc gred 2: parent 1: DP:0 (prio 8) Average Queue 0b Measured Queue 0b Packet drops: 0 (forced 0 early 0) Packet totals: 82 (bytes 9540) ewma 3 Plog 17 Scell_log 3 DP:1 (prio 7) Average Queue 0b Measured Queue 0b Packet drops: 0 (forced 0 early 0) Packet totals: 0 (bytes 0) ewma 2 Plog 17 Scell_log 2 DP:2 (prio 6) Average Queue 0b Measured Queue 0b Packet drops: 0 (forced 0 early 0) Packet totals: 0 (bytes 0) ewma 2 Plog 17 Scell_log 2 DP:3 (prio 5) Average Queue 0b Measured Queue 0b Packet drops: 0 (forced 0 early 0) Packet totals: 0 (bytes 0) ewma 2 Plog 16 Scell_log 2 DP:4 (prio 4) Average Queue 0b Measured Queue 0b Packet drops: 0 (forced 0 early 0) Packet totals: 0 (bytes 0) ewma 2 Plog 16 Scell_log 2 DP:5 (prio 3) Average Queue 0b Measured Queue 0b Packet drops: 0 (forced 0 early 0) Packet totals: 0 (bytes 0) ewma 2 Plog 16 Scell_log 2 DP:6 (prio 2) Average Queue 0b Measured Queue 0b Packet drops: 0 (forced 0 early 0) Packet totals: 0 (bytes 0) ewma 2 Plog 15 Scell_log 2 DP:7 (prio 1) Average Queue 0b Measured Queue 0b Packet drops: 0 (forced 0 early 0) Packet totals: 0 (bytes 0) ewma 1 Plog 15 Scell_log 1 QoS on Cisco has different/other problems mostly because various groups tried to fix the QoS problem over time and never got it quite right. Also WRED is not default on faster links because it can't be done fast enough. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-03-02 7:07 ` Stephen Hemminger @ 2011-03-02 16:41 ` Mikael Abrahamsson 2011-03-02 16:50 ` Eric Dumazet 0 siblings, 1 reply; 43+ messages in thread From: Mikael Abrahamsson @ 2011-03-02 16:41 UTC (permalink / raw) To: Stephen Hemminger Cc: John Heffner, Bill Sommerfeld, Hagen Paul Pfeifer, Albert Cahalan, Jussi Kivilinna, Eric Dumazet, linux-kernel, netdev On Tue, 1 Mar 2011, Stephen Hemminger wrote: > Also WRED is not default on faster links because it can't be done fast > enough. Before this propagates as some kind of truth. Cisco modern core routers have no problems doing WRED at wirespeed, the above statement is not true. -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: txqueuelen has wrong units; should be time 2011-03-02 16:41 ` Mikael Abrahamsson @ 2011-03-02 16:50 ` Eric Dumazet 0 siblings, 0 replies; 43+ messages in thread From: Eric Dumazet @ 2011-03-02 16:50 UTC (permalink / raw) To: Mikael Abrahamsson Cc: Stephen Hemminger, John Heffner, Bill Sommerfeld, Hagen Paul Pfeifer, Albert Cahalan, Jussi Kivilinna, linux-kernel, netdev Le mercredi 02 mars 2011 à 17:41 +0100, Mikael Abrahamsson a écrit : > On Tue, 1 Mar 2011, Stephen Hemminger wrote: > > > Also WRED is not default on faster links because it can't be done fast > > enough. > > Before this propagates as some kind of truth. Cisco modern core routers > have no problems doing WRED at wirespeed, the above statement is not true. > looking at cisco docs you provided ( <http://www.cisco.com/en/US/docs/ios/12_0s/feature/guide/12stbwr.html> ) , it seems the WRED time limits (instead of bytes/packets limits) are internaly converted to bytes/packets limits quote : When the queue limit threshold is specified in milliseconds, the number of milliseconds is internally converted to bytes using the bandwidth available for the class. So it seems its only a facility provided, and queues are still managed with bytes/packets limits... WRED is able to prob drop a packet when this packet is enqueued. At time of enqueue, we dont know yet the time of dequeue, unless bandwidth is known. ^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2011-03-03 12:51 UTC | newest] Thread overview: 43+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-02-27 5:44 txqueuelen has wrong units; should be time Albert Cahalan 2011-02-27 7:02 ` Mikael Abrahamsson 2011-02-27 7:54 ` Eric Dumazet 2011-02-27 8:27 ` Albert Cahalan 2011-02-27 10:55 ` Jussi Kivilinna 2011-02-27 20:07 ` Eric Dumazet 2011-02-27 21:32 ` Jussi Kivilinna 2011-02-28 11:43 ` Jussi Kivilinna 2011-02-28 13:10 ` Eric Dumazet 2011-02-28 18:31 ` Jussi Kivilinna 2011-02-28 16:11 ` John W. Linville 2011-02-28 16:48 ` Eric Dumazet 2011-02-28 16:55 ` John W. Linville 2011-02-28 17:18 ` Eric Dumazet 2011-02-28 21:45 ` John Heffner 2011-03-01 4:11 ` Albert Cahalan 2011-03-01 4:18 ` David Miller 2011-03-01 6:54 ` Albert Cahalan 2011-03-01 7:25 ` David Miller 2011-03-01 7:26 ` Eric Dumazet 2011-03-01 19:37 ` Albert Cahalan 2011-03-01 20:14 ` Eric Dumazet 2011-03-01 20:16 ` Eric Dumazet 2011-03-02 3:10 ` Mikael Abrahamsson 2011-03-02 20:25 ` Chris Friesen 2011-03-01 5:01 ` Eric Dumazet 2011-03-01 5:36 ` Eric Dumazet 2011-02-27 23:33 ` Albert Cahalan 2011-02-28 11:23 ` Jussi Kivilinna 2011-03-02 21:54 ` [RFC LOL OMG] pfifo_lat: qdisc that limits dequeueing based on estimated link latency John W. Linville 2011-03-02 22:08 ` John W. Linville 2011-03-03 12:51 ` Eric Dumazet 2011-02-28 15:38 ` txqueuelen has wrong units; should be time Hagen Paul Pfeifer 2011-02-28 16:37 ` Albert Cahalan 2011-02-28 17:45 ` John W. Linville 2011-02-28 17:20 ` Bill Sommerfeld 2011-02-28 21:51 ` John Heffner 2011-03-01 0:46 ` Mikael Abrahamsson 2011-03-02 6:25 ` Stephen Hemminger 2011-03-02 6:41 ` Mikael Abrahamsson 2011-03-02 7:07 ` Stephen Hemminger 2011-03-02 16:41 ` Mikael Abrahamsson 2011-03-02 16:50 ` Eric Dumazet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).