txqueuelen has wrong units; should be time

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* txqueuelen has wrong units; should be time
@ 2011-02-27  5:44 Albert Cahalan
  2011-02-27  7:02 ` Mikael Abrahamsson
  0 siblings, 1 reply; 43+ messages in thread
From: Albert Cahalan @ 2011-02-27  5:44 UTC (permalink / raw)
  To: linux-kernel, netdev

(thinking about the bufferbloat problem here)

Setting txqueuelen to some fixed number of packets
seems pretty broken if:

1. a link can vary in speed (802.11 especially)

2. a packet can vary in size (9 KiB jumbograms, etc.)

3. there is other weirdness (PPP compression, etc.)

It really needs to be set to some amount of time,
with the OS accounting for packets in terms of the
time it will take to transmit them. This would need
to account for physical-layer packet headers and
minimum spacing requirements.

I think it could also account for estimated congestion
on the local link, because that effects the rate at which
the queue can empty. An OS can directly observe this
on some types of hardware.

Nanoseconds seems fine; it's unlikely you'd ever want
more than 4.2 seconds (32-bit unsigned) of queue.

I guess there are at least 2 queues of interest, with the
second one being under control of the hardware driver.
Having the kernel split the max time as appropriate for
the hardware seems nicest.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-02-27  5:44 txqueuelen has wrong units; should be time Albert Cahalan
@ 2011-02-27  7:02 ` Mikael Abrahamsson
  2011-02-27  7:54   ` Eric Dumazet
  0 siblings, 1 reply; 43+ messages in thread
From: Mikael Abrahamsson @ 2011-02-27  7:02 UTC (permalink / raw)
  To: Albert Cahalan; +Cc: linux-kernel, netdev

On Sun, 27 Feb 2011, Albert Cahalan wrote:

> Nanoseconds seems fine; it's unlikely you'd ever want
> more than 4.2 seconds (32-bit unsigned) of queue.

I think this is shortsighted and I'm sure someone will come up with a case 
where 4.2 seconds isn't enough. Let's not build in those kinds of 
limitations from start.

Why not make it 64bit and go to picoseconds from start?

If you need to make it 32bit unsigned, I'd suggest to start from 
microseconds instead. It's less likely someone would want less than a 
microsecond of queue, than someone wanting more than 4.2 seconds of queue.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-02-27  7:02 ` Mikael Abrahamsson
@ 2011-02-27  7:54   ` Eric Dumazet
  2011-02-27  8:27     ` Albert Cahalan
  0 siblings, 1 reply; 43+ messages in thread
From: Eric Dumazet @ 2011-02-27  7:54 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: Albert Cahalan, linux-kernel, netdev

Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson a écrit :
> On Sun, 27 Feb 2011, Albert Cahalan wrote:
> 
> > Nanoseconds seems fine; it's unlikely you'd ever want
> > more than 4.2 seconds (32-bit unsigned) of queue.
> 
> I think this is shortsighted and I'm sure someone will come up with a case 
> where 4.2 seconds isn't enough. Let's not build in those kinds of 
> limitations from start.
> 
> Why not make it 64bit and go to picoseconds from start?
> 
> If you need to make it 32bit unsigned, I'd suggest to start from 
> microseconds instead. It's less likely someone would want less than a 
> microsecond of queue, than someone wanting more than 4.2 seconds of queue.
> 

32 or 64 bits doesnt matter a lot. At Qdisc stage we have up to 40 bytes
available in skb->sb[] for our usage.

Problem is some machines have slow High Resolution timing services.

_If_ we have a time limit, it will probably use the low resolution (aka
jiffies), unless high resolution services are cheap.

I was thinking not having an absolute hard limit, but an EWMA based one.




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-02-27  7:54   ` Eric Dumazet
@ 2011-02-27  8:27     ` Albert Cahalan
  2011-02-27 10:55       ` Jussi Kivilinna
  0 siblings, 1 reply; 43+ messages in thread
From: Albert Cahalan @ 2011-02-27  8:27 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Mikael Abrahamsson, linux-kernel, netdev

On Sun, Feb 27, 2011 at 2:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson a écrit :
>> On Sun, 27 Feb 2011, Albert Cahalan wrote:
>>
>> > Nanoseconds seems fine; it's unlikely you'd ever want
>> > more than 4.2 seconds (32-bit unsigned) of queue.
...
> Problem is some machines have slow High Resolution timing services.
>
> _If_ we have a time limit, it will probably use the low resolution (aka
> jiffies), unless high resolution services are cheap.

As long as that is totally internal to the kernel and never
getting exposed by some API for setting the amount, sure.

> I was thinking not having an absolute hard limit, but an EWMA based one.

The whole point is to prevent stale packets, especially to prevent
them from messing with TCP, so I really don't think so. I suppose
you do get this to some extent via early drop.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-02-27  8:27     ` Albert Cahalan
@ 2011-02-27 10:55       ` Jussi Kivilinna
  2011-02-27 20:07         ` Eric Dumazet
  2011-02-27 23:33         ` Albert Cahalan
  0 siblings, 2 replies; 43+ messages in thread
From: Jussi Kivilinna @ 2011-02-27 10:55 UTC (permalink / raw)
  To: Albert Cahalan; +Cc: Eric Dumazet, Mikael Abrahamsson, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 1423 bytes --]

Quoting Albert Cahalan <acahalan@gmail.com>:

> On Sun, Feb 27, 2011 at 2:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson a écrit :
>>> On Sun, 27 Feb 2011, Albert Cahalan wrote:
>>>
>>> > Nanoseconds seems fine; it's unlikely you'd ever want
>>> > more than 4.2 seconds (32-bit unsigned) of queue.
> ...
>> Problem is some machines have slow High Resolution timing services.
>>
>> _If_ we have a time limit, it will probably use the low resolution (aka
>> jiffies), unless high resolution services are cheap.
>
> As long as that is totally internal to the kernel and never
> getting exposed by some API for setting the amount, sure.
>
>> I was thinking not having an absolute hard limit, but an EWMA based one.
>
> The whole point is to prevent stale packets, especially to prevent
> them from messing with TCP, so I really don't think so. I suppose
> you do get this to some extent via early drop.

I made simple hack on sch_fifo with per packet time limits  
(attachment) this weekend and have been doing limited testing on  
wireless link. I think hardlimit is fine, it's simple and does  
somewhat same as what packet(-hard)limited buffer does, drops packets  
when buffer is 'full'. My hack checks for timed out packets on  
enqueue, might be wrong approach (on other hand might allow some more  
burstiness).

-Jussi

[-- Attachment #2: sch_fifo_to.c --]
[-- Type: text/x-csrc, Size: 6138 bytes --]

/*
 * sch_fifo_timeout.c	Simple FIFO queue with per packet timeout.
 *
 * This program is free software; you can redistribute it and/or modify it under
 * the terms of the GNU General Public License as published by the Free Software
 * Foundation; either version 2 of the License, or (at your option) any later
 * version.
 *
 */

#include <linux/module.h>
#include <linux/slab.h>
#include <linux/types.h>
#include <linux/kernel.h>
#include <linux/errno.h>
#include <linux/skbuff.h>
#include <net/pkt_sched.h>
#include <net/inet_ecn.h>

#define DEFAULT_TIMEOUT_PKT_MS 10
#define DEFAULT_TIMEOUT_PKT PSCHED_NS2TICKS((u64)NSEC_PER_SEC * \
						DEFAULT_TIMEOUT_PKT_MS / 1000)

struct tc_fifo_timeout_qopt {
	__u64	timeout;	/* Max time packet may stay in buffer */
	__u32   limit;		/* Queue length: bytes for bfifo, packets for pfifo */
};

struct fifo_timeout_skb_cb {
	psched_time_t	time_queued;
};

struct fifo_timeout_sched_data {
	psched_tdiff_t	timeout;
	u32		limit;
};

static inline
struct fifo_timeout_skb_cb *fifo_timeout_skb_cb(struct sk_buff *skb)
{
	BUILD_BUG_ON(sizeof(skb->cb) <
		sizeof(struct qdisc_skb_cb) +
			sizeof(struct fifo_timeout_skb_cb));
	return (struct fifo_timeout_skb_cb *)qdisc_skb_cb(skb)->data;
}

static void pfifo_timeout_drop_timedout_packets(struct Qdisc *sch,
						psched_time_t now)
{
	struct fifo_timeout_sched_data *q = qdisc_priv(sch);
	struct sk_buff *skb;

check_next:
	skb = qdisc_peek_head(sch);
	if (likely(!skb))
		return;

	if (likely(fifo_timeout_skb_cb(skb)->time_queued + q->timeout > now))
		return;

	__qdisc_queue_drop_head(sch, &sch->q);
	sch->qstats.drops++;

	goto check_next;
}

static int pfifo_tail_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
	struct fifo_timeout_sched_data *q = qdisc_priv(sch);

	if (likely(skb_queue_len(&sch->q) < q->limit))
		return qdisc_enqueue_tail(skb, sch);

	/* queue full, remove one skb to fulfill the limit */
	__qdisc_queue_drop_head(sch, &sch->q);
	sch->qstats.drops++;
	qdisc_enqueue_tail(skb, sch);

	return NET_XMIT_CN;
}

static int bfifo_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
	struct fifo_timeout_sched_data *q = qdisc_priv(sch);

	if (likely(sch->qstats.backlog + qdisc_pkt_len(skb) <= q->limit))
		return qdisc_enqueue_tail(skb, sch);

	return qdisc_reshape_fail(skb, sch);
}

static int pfifo_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
	struct fifo_timeout_sched_data *q = qdisc_priv(sch);

	if (likely(skb_queue_len(&sch->q) < q->limit))
		return qdisc_enqueue_tail(skb, sch);

	return qdisc_reshape_fail(skb, sch);
}

static int pfifo_timeout_tail_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
	psched_time_t now = psched_get_time();

	fifo_timeout_skb_cb(skb)->time_queued = now;
	pfifo_timeout_drop_timedout_packets(sch, now);

	return pfifo_tail_enqueue(skb, sch);
}

static int bfifo_timeout_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
	psched_time_t now = psched_get_time();

	fifo_timeout_skb_cb(skb)->time_queued = now;
	pfifo_timeout_drop_timedout_packets(sch, now);

	return bfifo_enqueue(skb, sch);
}

static int pfifo_timeout_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
	psched_time_t now = psched_get_time();

	fifo_timeout_skb_cb(skb)->time_queued = now;
	pfifo_timeout_drop_timedout_packets(sch, now);

	return pfifo_enqueue(skb, sch);
}

static int fifo_timeout_init(struct Qdisc *sch, struct nlattr *opt)
{
	struct fifo_timeout_sched_data *q = qdisc_priv(sch);

	if (opt == NULL) {
		u32 limit = qdisc_dev(sch)->tx_queue_len ? : 1;

		q->limit = limit;
		q->timeout = DEFAULT_TIMEOUT_PKT;
	} else {
		struct tc_fifo_timeout_qopt *ctl = nla_data(opt);

		if (nla_len(opt) < sizeof(*ctl))
			return -EINVAL;

		q->limit = ctl->limit;
		q->timeout = ctl->timeout ? : DEFAULT_TIMEOUT_PKT;
	}

	return 0;
}

static int fifo_timeout_dump(struct Qdisc *sch, struct sk_buff *skb)
{
	struct fifo_timeout_sched_data *q = qdisc_priv(sch);
	struct tc_fifo_timeout_qopt opt = {
		.limit = q->limit,
		.timeout = q->timeout
	};

	NLA_PUT(skb, TCA_OPTIONS, sizeof(opt), &opt);
	return skb->len;

nla_put_failure:
	return -1;
}

static struct Qdisc_ops pfifo_timeout_qdisc_ops __read_mostly = {
	.id		=	"pfifo_timeout",
	.priv_size	=	sizeof(struct fifo_timeout_sched_data),
	.enqueue	=	pfifo_timeout_enqueue,
	.dequeue	=	qdisc_dequeue_head,
	.peek		=	qdisc_peek_head,
	.drop		=	qdisc_queue_drop,
	.init		=	fifo_timeout_init,
	.reset		=	qdisc_reset_queue,
	.change		=	fifo_timeout_init,
	.dump		=	fifo_timeout_dump,
	.owner		=	THIS_MODULE,
};

static struct Qdisc_ops bfifo_timeout_qdisc_ops __read_mostly = {
	.id		=	"bfifo_timeout",
	.priv_size	=	sizeof(struct fifo_timeout_sched_data),
	.enqueue	=	bfifo_timeout_enqueue,
	.dequeue	=	qdisc_dequeue_head,
	.peek		=	qdisc_peek_head,
	.drop		=	qdisc_queue_drop,
	.init		=	fifo_timeout_init,
	.reset		=	qdisc_reset_queue,
	.change		=	fifo_timeout_init,
	.dump		=	fifo_timeout_dump,
	.owner		=	THIS_MODULE,
};

static struct Qdisc_ops pfifo_head_drop_timeout_qdisc_ops __read_mostly = {
	.id		=	"pfifo_hd_tout",
	.priv_size	=	sizeof(struct fifo_timeout_sched_data),
	.enqueue	=	pfifo_timeout_tail_enqueue,
	.dequeue	=	qdisc_dequeue_head,
	.peek		=	qdisc_peek_head,
	.drop		=	qdisc_queue_drop_head,
	.init		=	fifo_timeout_init,
	.reset		=	qdisc_reset_queue,
	.change		=	fifo_timeout_init,
	.dump		=	fifo_timeout_dump,
	.owner		=	THIS_MODULE,
};

static int __init fifo_timeout_module_init(void)
{
	int retval;

	retval = register_qdisc(&pfifo_timeout_qdisc_ops);
	if (retval)
		goto cleanup;
	retval = register_qdisc(&bfifo_timeout_qdisc_ops);
	if (retval)
		goto cleanup;
	retval = register_qdisc(&pfifo_head_drop_timeout_qdisc_ops);
	if (retval)
		goto cleanup;

	return 0;

cleanup:
	unregister_qdisc(&pfifo_timeout_qdisc_ops);
	unregister_qdisc(&bfifo_timeout_qdisc_ops);
	unregister_qdisc(&pfifo_head_drop_timeout_qdisc_ops);
	return retval;
}
static void __exit fifo_timeout_module_exit(void)
{
	unregister_qdisc(&pfifo_timeout_qdisc_ops);
	unregister_qdisc(&bfifo_timeout_qdisc_ops);
	unregister_qdisc(&pfifo_head_drop_timeout_qdisc_ops);
}

module_init(fifo_timeout_module_init)
module_exit(fifo_timeout_module_exit)
MODULE_LICENSE("GPL");


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-02-27 10:55       ` Jussi Kivilinna
@ 2011-02-27 20:07         ` Eric Dumazet
  2011-02-27 21:32           ` Jussi Kivilinna
                             ` (2 more replies)
  2011-02-27 23:33         ` Albert Cahalan
  1 sibling, 3 replies; 43+ messages in thread
From: Eric Dumazet @ 2011-02-27 20:07 UTC (permalink / raw)
  To: Jussi Kivilinna; +Cc: Albert Cahalan, Mikael Abrahamsson, linux-kernel, netdev

Le dimanche 27 février 2011 à 12:55 +0200, Jussi Kivilinna a écrit :
> Quoting Albert Cahalan <acahalan@gmail.com>:
> 
> > On Sun, Feb 27, 2011 at 2:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >> Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson a écrit :
> >>> On Sun, 27 Feb 2011, Albert Cahalan wrote:
> >>>
> >>> > Nanoseconds seems fine; it's unlikely you'd ever want
> >>> > more than 4.2 seconds (32-bit unsigned) of queue.
> > ...
> >> Problem is some machines have slow High Resolution timing services.
> >>
> >> _If_ we have a time limit, it will probably use the low resolution (aka
> >> jiffies), unless high resolution services are cheap.
> >
> > As long as that is totally internal to the kernel and never
> > getting exposed by some API for setting the amount, sure.
> >
> >> I was thinking not having an absolute hard limit, but an EWMA based one.
> >
> > The whole point is to prevent stale packets, especially to prevent
> > them from messing with TCP, so I really don't think so. I suppose
> > you do get this to some extent via early drop.
> 
> I made simple hack on sch_fifo with per packet time limits  
> (attachment) this weekend and have been doing limited testing on  
> wireless link. I think hardlimit is fine, it's simple and does  
> somewhat same as what packet(-hard)limited buffer does, drops packets  
> when buffer is 'full'. My hack checks for timed out packets on  
> enqueue, might be wrong approach (on other hand might allow some more  
> burstiness).
> 


Qdisc should return to caller a good indication packet is queued or
dropped at enqueue() time... not later (aka : never)

Accepting a packet at t0, and dropping it later at t0+limit without
giving any indication to caller is a problem.

This is why I suggested using an EWMA plus a probabilist drop or
congestion indication (NET_XMIT_CN) to caller at enqueue() time.

The absolute time limit you are trying to implement should be checked at
dequeue time, to cope with enqueue bursts or pauses on wire.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-02-27 20:07         ` Eric Dumazet
@ 2011-02-27 21:32           ` Jussi Kivilinna
  2011-02-28 11:43           ` Jussi Kivilinna
  2011-02-28 16:11           ` John W. Linville
  2 siblings, 0 replies; 43+ messages in thread
From: Jussi Kivilinna @ 2011-02-27 21:32 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Albert Cahalan, Mikael Abrahamsson, linux-kernel, netdev

Quoting Eric Dumazet <eric.dumazet@gmail.com>:

> Le dimanche 27 février 2011 à 12:55 +0200, Jussi Kivilinna a écrit :
>> Quoting Albert Cahalan <acahalan@gmail.com>:
>>
>> > On Sun, Feb 27, 2011 at 2:54 AM, Eric Dumazet  
>> <eric.dumazet@gmail.com> wrote:
>> >> Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson a écrit :
>> >>> On Sun, 27 Feb 2011, Albert Cahalan wrote:
>> >>>
>> >>> > Nanoseconds seems fine; it's unlikely you'd ever want
>> >>> > more than 4.2 seconds (32-bit unsigned) of queue.
>> > ...
>> >> Problem is some machines have slow High Resolution timing services.
>> >>
>> >> _If_ we have a time limit, it will probably use the low resolution (aka
>> >> jiffies), unless high resolution services are cheap.
>> >
>> > As long as that is totally internal to the kernel and never
>> > getting exposed by some API for setting the amount, sure.
>> >
>> >> I was thinking not having an absolute hard limit, but an EWMA based one.
>> >
>> > The whole point is to prevent stale packets, especially to prevent
>> > them from messing with TCP, so I really don't think so. I suppose
>> > you do get this to some extent via early drop.
>>
>> I made simple hack on sch_fifo with per packet time limits
>> (attachment) this weekend and have been doing limited testing on
>> wireless link. I think hardlimit is fine, it's simple and does
>> somewhat same as what packet(-hard)limited buffer does, drops packets
>> when buffer is 'full'. My hack checks for timed out packets on
>> enqueue, might be wrong approach (on other hand might allow some more
>> burstiness).
>>
>
>
> Qdisc should return to caller a good indication packet is queued or
> dropped at enqueue() time... not later (aka : never)

Ok, it is ugly hack ;) I got idea of dropping head from pfifo_head_drop.

>
> Accepting a packet at t0, and dropping it later at t0+limit without
> giving any indication to caller is a problem.

Ok.

>
> This is why I suggested using an EWMA plus a probabilist drop or
> congestion indication (NET_XMIT_CN) to caller at enqueue() time.
>
> The absolute time limit you are trying to implement should be checked at
> dequeue time, to cope with enqueue bursts or pauses on wire.
>

Ok.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-02-27 20:07         ` Eric Dumazet
  2011-02-27 21:32           ` Jussi Kivilinna
@ 2011-02-28 11:43           ` Jussi Kivilinna
  2011-02-28 13:10             ` Eric Dumazet
  2011-02-28 16:11           ` John W. Linville
  2 siblings, 1 reply; 43+ messages in thread
From: Jussi Kivilinna @ 2011-02-28 11:43 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Albert Cahalan, Mikael Abrahamsson, linux-kernel, netdev

Quoting Eric Dumazet <eric.dumazet@gmail.com>:

> Le dimanche 27 février 2011 à 12:55 +0200, Jussi Kivilinna a écrit :
>> Quoting Albert Cahalan <acahalan@gmail.com>:
>>
>> > On Sun, Feb 27, 2011 at 2:54 AM, Eric Dumazet  
>> <eric.dumazet@gmail.com> wrote:
>> >> Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson a écrit :
>> >>> On Sun, 27 Feb 2011, Albert Cahalan wrote:
>> >>>
>> >>> > Nanoseconds seems fine; it's unlikely you'd ever want
>> >>> > more than 4.2 seconds (32-bit unsigned) of queue.
>> > ...
>> >> Problem is some machines have slow High Resolution timing services.
>> >>
>> >> _If_ we have a time limit, it will probably use the low resolution (aka
>> >> jiffies), unless high resolution services are cheap.
>> >
>> > As long as that is totally internal to the kernel and never
>> > getting exposed by some API for setting the amount, sure.
>> >
>> >> I was thinking not having an absolute hard limit, but an EWMA based one.
>> >
>> > The whole point is to prevent stale packets, especially to prevent
>> > them from messing with TCP, so I really don't think so. I suppose
>> > you do get this to some extent via early drop.
>>
>> I made simple hack on sch_fifo with per packet time limits
>> (attachment) this weekend and have been doing limited testing on
>> wireless link. I think hardlimit is fine, it's simple and does
>> somewhat same as what packet(-hard)limited buffer does, drops packets
>> when buffer is 'full'. My hack checks for timed out packets on
>> enqueue, might be wrong approach (on other hand might allow some more
>> burstiness).
>>
>
>
> Qdisc should return to caller a good indication packet is queued or
> dropped at enqueue() time... not later (aka : never)
>
> Accepting a packet at t0, and dropping it later at t0+limit without
> giving any indication to caller is a problem.
>
> This is why I suggested using an EWMA plus a probabilist drop or
> congestion indication (NET_XMIT_CN) to caller at enqueue() time.
>
> The absolute time limit you are trying to implement should be checked at
> dequeue time, to cope with enqueue bursts or pauses on wire.
>

Would it be better to implement this as generic feature instead of  
qdisc specific? Have qdisc_enqueue_root do ewma check:

static inline int qdisc_enqueue_root(struct sk_buff *skb, struct Qdisc *sch)
{
	qdisc_skb_cb(skb)->pkt_len = skb->len;
	if (likely(!sch->use_timeout)) {
ewma_ok:
		return qdisc_enqueue(skb, sch) & NET_XMIT_MASK;
	}

	status = qdisc_check_ewma_status()
	if (status == ok)
		goto ewma_ok;

	if (status == overlimits)
		...drop...

	if (status == congestion) {
		ret = qdisc_enqueue(skb, sch) & NET_XMIT_MASK;
		return (ret == success) ? NET_XMIT_CN : ret;
	}
}

And add qdisc_dequeue_root:

static inline struct sk_buff *qdisc_dequeue_root(struct Qdisc *sch)
{
	skb = sch->dequeue(sch);

	if (skb && unlikely(sch->use_timeout))
		qdisc_update_ewma(skb);

	return skb;
}

Then user could specify any qdisc to use timeout or not with tc. Maybe  
go even as far as have some default timeout for default qdisc(?)

-Jussi

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-02-28 11:43           ` Jussi Kivilinna
@ 2011-02-28 13:10             ` Eric Dumazet
  2011-02-28 18:31               ` Jussi Kivilinna
  0 siblings, 1 reply; 43+ messages in thread
From: Eric Dumazet @ 2011-02-28 13:10 UTC (permalink / raw)
  To: Jussi Kivilinna; +Cc: Albert Cahalan, Mikael Abrahamsson, linux-kernel, netdev

Le lundi 28 février 2011 à 13:43 +0200, Jussi Kivilinna a écrit :
> Quoting Eric Dumazet <eric.dumazet@gmail.com>:
> 
> > Le dimanche 27 février 2011 à 12:55 +0200, Jussi Kivilinna a écrit :
> >> Quoting Albert Cahalan <acahalan@gmail.com>:
> >>
> >> > On Sun, Feb 27, 2011 at 2:54 AM, Eric Dumazet  
> >> <eric.dumazet@gmail.com> wrote:
> >> >> Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson a écrit :
> >> >>> On Sun, 27 Feb 2011, Albert Cahalan wrote:
> >> >>>
> >> >>> > Nanoseconds seems fine; it's unlikely you'd ever want
> >> >>> > more than 4.2 seconds (32-bit unsigned) of queue.
> >> > ...
> >> >> Problem is some machines have slow High Resolution timing services.
> >> >>
> >> >> _If_ we have a time limit, it will probably use the low resolution (aka
> >> >> jiffies), unless high resolution services are cheap.
> >> >
> >> > As long as that is totally internal to the kernel and never
> >> > getting exposed by some API for setting the amount, sure.
> >> >
> >> >> I was thinking not having an absolute hard limit, but an EWMA based one.
> >> >
> >> > The whole point is to prevent stale packets, especially to prevent
> >> > them from messing with TCP, so I really don't think so. I suppose
> >> > you do get this to some extent via early drop.
> >>
> >> I made simple hack on sch_fifo with per packet time limits
> >> (attachment) this weekend and have been doing limited testing on
> >> wireless link. I think hardlimit is fine, it's simple and does
> >> somewhat same as what packet(-hard)limited buffer does, drops packets
> >> when buffer is 'full'. My hack checks for timed out packets on
> >> enqueue, might be wrong approach (on other hand might allow some more
> >> burstiness).
> >>
> >
> >
> > Qdisc should return to caller a good indication packet is queued or
> > dropped at enqueue() time... not later (aka : never)
> >
> > Accepting a packet at t0, and dropping it later at t0+limit without
> > giving any indication to caller is a problem.
> >
> > This is why I suggested using an EWMA plus a probabilist drop or
> > congestion indication (NET_XMIT_CN) to caller at enqueue() time.
> >
> > The absolute time limit you are trying to implement should be checked at
> > dequeue time, to cope with enqueue bursts or pauses on wire.
> >
> 
> Would it be better to implement this as generic feature instead of  
> qdisc specific? Have qdisc_enqueue_root do ewma check:

Problem is you can have several virtual queues in a qdisc.

For example, pfifo_fast has 3 bands. You could have a global ewma with
high values, but you still want to let a high priority packet going
through...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-02-28 13:10             ` Eric Dumazet
@ 2011-02-28 18:31               ` Jussi Kivilinna
  0 siblings, 0 replies; 43+ messages in thread
From: Jussi Kivilinna @ 2011-02-28 18:31 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Albert Cahalan, Mikael Abrahamsson, linux-kernel, netdev

Quoting Eric Dumazet <eric.dumazet@gmail.com>:

> Le lundi 28 février 2011 à 13:43 +0200, Jussi Kivilinna a écrit :
>> Quoting Eric Dumazet <eric.dumazet@gmail.com>:
>>
>> > Le dimanche 27 février 2011 à 12:55 +0200, Jussi Kivilinna a écrit :
>> >> Quoting Albert Cahalan <acahalan@gmail.com>:
>> >>
>> >> > On Sun, Feb 27, 2011 at 2:54 AM, Eric Dumazet
>> >> <eric.dumazet@gmail.com> wrote:
>> >> >> Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson  
>> a écrit :
>> >> >>> On Sun, 27 Feb 2011, Albert Cahalan wrote:
>> >> >>>
>> >> >>> > Nanoseconds seems fine; it's unlikely you'd ever want
>> >> >>> > more than 4.2 seconds (32-bit unsigned) of queue.
>> >> > ...
>> >> >> Problem is some machines have slow High Resolution timing services.
>> >> >>
>> >> >> _If_ we have a time limit, it will probably use the low  
>> resolution (aka
>> >> >> jiffies), unless high resolution services are cheap.
>> >> >
>> >> > As long as that is totally internal to the kernel and never
>> >> > getting exposed by some API for setting the amount, sure.
>> >> >
>> >> >> I was thinking not having an absolute hard limit, but an EWMA  
>> based one.
>> >> >
>> >> > The whole point is to prevent stale packets, especially to prevent
>> >> > them from messing with TCP, so I really don't think so. I suppose
>> >> > you do get this to some extent via early drop.
>> >>
>> >> I made simple hack on sch_fifo with per packet time limits
>> >> (attachment) this weekend and have been doing limited testing on
>> >> wireless link. I think hardlimit is fine, it's simple and does
>> >> somewhat same as what packet(-hard)limited buffer does, drops packets
>> >> when buffer is 'full'. My hack checks for timed out packets on
>> >> enqueue, might be wrong approach (on other hand might allow some more
>> >> burstiness).
>> >>
>> >
>> >
>> > Qdisc should return to caller a good indication packet is queued or
>> > dropped at enqueue() time... not later (aka : never)
>> >
>> > Accepting a packet at t0, and dropping it later at t0+limit without
>> > giving any indication to caller is a problem.
>> >
>> > This is why I suggested using an EWMA plus a probabilist drop or
>> > congestion indication (NET_XMIT_CN) to caller at enqueue() time.
>> >
>> > The absolute time limit you are trying to implement should be checked at
>> > dequeue time, to cope with enqueue bursts or pauses on wire.
>> >
>>
>> Would it be better to implement this as generic feature instead of
>> qdisc specific? Have qdisc_enqueue_root do ewma check:
>
> Problem is you can have several virtual queues in a qdisc.
>
> For example, pfifo_fast has 3 bands. You could have a global ewma with
> high values, but you still want to let a high priority packet going
> through...
>

Ok. It would better to have ewma/timelimit at leaf qdisc.
(Or have in-middle-qdisc handling ewma/timelimit for leaf qdisc,  
sch_timelimit)

-Jussi

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-02-27 20:07         ` Eric Dumazet
  2011-02-27 21:32           ` Jussi Kivilinna
  2011-02-28 11:43           ` Jussi Kivilinna
@ 2011-02-28 16:11           ` John W. Linville
  2011-02-28 16:48             ` Eric Dumazet
  2 siblings, 1 reply; 43+ messages in thread
From: John W. Linville @ 2011-02-28 16:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jussi Kivilinna, Albert Cahalan, Mikael Abrahamsson, linux-kernel,
	netdev

On Sun, Feb 27, 2011 at 09:07:53PM +0100, Eric Dumazet wrote:

> Qdisc should return to caller a good indication packet is queued or
> dropped at enqueue() time... not later (aka : never)
> 
> Accepting a packet at t0, and dropping it later at t0+limit without
> giving any indication to caller is a problem.

Can you elaborate on what problem this causes?  Is it any worse than
if the packet is dropped at some later hop?

Is there any API that could report the drop to the sender (at
least a local one) without having to wait for the ack timeout?
Should there be?

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-02-28 16:11           ` John W. Linville
@ 2011-02-28 16:48             ` Eric Dumazet
  2011-02-28 16:55               ` John W. Linville
  0 siblings, 1 reply; 43+ messages in thread
From: Eric Dumazet @ 2011-02-28 16:48 UTC (permalink / raw)
  To: John W. Linville
  Cc: Jussi Kivilinna, Albert Cahalan, Mikael Abrahamsson, linux-kernel,
	netdev

Le lundi 28 février 2011 à 11:11 -0500, John W. Linville a écrit :
> On Sun, Feb 27, 2011 at 09:07:53PM +0100, Eric Dumazet wrote:
> 
> > Qdisc should return to caller a good indication packet is queued or
> > dropped at enqueue() time... not later (aka : never)
> > 
> > Accepting a packet at t0, and dropping it later at t0+limit without
> > giving any indication to caller is a problem.
> 
> Can you elaborate on what problem this causes?  Is it any worse than
> if the packet is dropped at some later hop?
> 
> Is there any API that could report the drop to the sender (at
> least a local one) without having to wait for the ack timeout?
> Should there be?
> 

Not all protocols have ACKS ;)

dev_queue_xmit() returns an error code, some callers use it.




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-02-28 16:48             ` Eric Dumazet
@ 2011-02-28 16:55               ` John W. Linville
  2011-02-28 17:18                 ` Eric Dumazet
  2011-02-28 21:45                 ` John Heffner
  0 siblings, 2 replies; 43+ messages in thread
From: John W. Linville @ 2011-02-28 16:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jussi Kivilinna, Albert Cahalan, Mikael Abrahamsson, linux-kernel,
	netdev

On Mon, Feb 28, 2011 at 05:48:14PM +0100, Eric Dumazet wrote:
> Le lundi 28 février 2011 à 11:11 -0500, John W. Linville a écrit :
> > On Sun, Feb 27, 2011 at 09:07:53PM +0100, Eric Dumazet wrote:
> > 
> > > Qdisc should return to caller a good indication packet is queued or
> > > dropped at enqueue() time... not later (aka : never)
> > > 
> > > Accepting a packet at t0, and dropping it later at t0+limit without
> > > giving any indication to caller is a problem.
> > 
> > Can you elaborate on what problem this causes?  Is it any worse than
> > if the packet is dropped at some later hop?
> > 
> > Is there any API that could report the drop to the sender (at
> > least a local one) without having to wait for the ack timeout?
> > Should there be?
> > 
> 
> Not all protocols have ACKS ;)
> 
> dev_queue_xmit() returns an error code, some callers use it.

Well, OK -- I agree it is best if you can return the status at
enqueue time.  The question becomes whether or not a dropped frame
is worse than living with high latency.  The answer, of course, still
seems to be a bit subjective.  But, if the admin has determined that
a link should be low latency...?

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-02-28 16:55               ` John W. Linville
@ 2011-02-28 17:18                 ` Eric Dumazet
  2011-02-28 21:45                 ` John Heffner
  1 sibling, 0 replies; 43+ messages in thread
From: Eric Dumazet @ 2011-02-28 17:18 UTC (permalink / raw)
  To: John W. Linville
  Cc: Jussi Kivilinna, Albert Cahalan, Mikael Abrahamsson, linux-kernel,
	netdev

Le lundi 28 février 2011 à 11:55 -0500, John W. Linville a écrit :
> On Mon, Feb 28, 2011 at 05:48:14PM +0100, Eric Dumazet wrote:
> > Le lundi 28 février 2011 à 11:11 -0500, John W. Linville a écrit :
> > > On Sun, Feb 27, 2011 at 09:07:53PM +0100, Eric Dumazet wrote:
> > > 
> > > > Qdisc should return to caller a good indication packet is queued or
> > > > dropped at enqueue() time... not later (aka : never)
> > > > 
> > > > Accepting a packet at t0, and dropping it later at t0+limit without
> > > > giving any indication to caller is a problem.
> > > 
> > > Can you elaborate on what problem this causes?  Is it any worse than
> > > if the packet is dropped at some later hop?
> > > 
> > > Is there any API that could report the drop to the sender (at
> > > least a local one) without having to wait for the ack timeout?
> > > Should there be?
> > > 
> > 
> > Not all protocols have ACKS ;)
> > 
> > dev_queue_xmit() returns an error code, some callers use it.
> 
> Well, OK -- I agree it is best if you can return the status at
> enqueue time.  The question becomes whether or not a dropped frame
> is worse than living with high latency.  The answer, of course, still
> seems to be a bit subjective.  But, if the admin has determined that
> a link should be low latency...?
> 

If the latency problem could be solved by an admin choice, it probably
would be there already.

Point is qdisc layer is able to immediately return an error code to
caller, if qdisc handlers properly done. This can help applications to
immediately react to congestion notifications.

Some applications, even running on a "low latency link" can afford a
long delay for their packets. Should we introduce a socket API to give
the upper bound for the limit, or share a global 'per qdisc' limit ?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-02-28 16:55               ` John W. Linville
  2011-02-28 17:18                 ` Eric Dumazet
@ 2011-02-28 21:45                 ` John Heffner
  2011-03-01  4:11                   ` Albert Cahalan
  1 sibling, 1 reply; 43+ messages in thread
From: John Heffner @ 2011-02-28 21:45 UTC (permalink / raw)
  To: John W. Linville
  Cc: Eric Dumazet, Jussi Kivilinna, Albert Cahalan, Mikael Abrahamsson,
	linux-kernel, netdev

On Mon, Feb 28, 2011 at 11:55 AM, John W. Linville
<linville@tuxdriver.com> wrote:
> On Mon, Feb 28, 2011 at 05:48:14PM +0100, Eric Dumazet wrote:
>> Le lundi 28 février 2011 à 11:11 -0500, John W. Linville a écrit :
>> > On Sun, Feb 27, 2011 at 09:07:53PM +0100, Eric Dumazet wrote:
>> >
>> > > Qdisc should return to caller a good indication packet is queued or
>> > > dropped at enqueue() time... not later (aka : never)
>> > >
>> > > Accepting a packet at t0, and dropping it later at t0+limit without
>> > > giving any indication to caller is a problem.
>> >
>> > Can you elaborate on what problem this causes?  Is it any worse than
>> > if the packet is dropped at some later hop?
>> >
>> > Is there any API that could report the drop to the sender (at
>> > least a local one) without having to wait for the ack timeout?
>> > Should there be?
>> >
>>
>> Not all protocols have ACKS ;)
>>
>> dev_queue_xmit() returns an error code, some callers use it.
>
> Well, OK -- I agree it is best if you can return the status at
> enqueue time.  The question becomes whether or not a dropped frame
> is worse than living with high latency.  The answer, of course, still
> seems to be a bit subjective.  But, if the admin has determined that
> a link should be low latency...?

Notably, TCP is one caller that uses the error code.  The error code
is functionally equivalent to ECN, one of whose great advantages is
reducing delay jitter.  If TCP didn't get the error, that would
effectively double the latency for a full window of data, since the
dropped segment would not be retransmitted for an RTT.

  -John

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-02-28 21:45                 ` John Heffner
@ 2011-03-01  4:11                   ` Albert Cahalan
  2011-03-01  4:18                     ` David Miller
  2011-03-01  5:01                     ` Eric Dumazet
  0 siblings, 2 replies; 43+ messages in thread
From: Albert Cahalan @ 2011-03-01  4:11 UTC (permalink / raw)
  To: John Heffner
  Cc: John W. Linville, Eric Dumazet, Jussi Kivilinna,
	Mikael Abrahamsson, linux-kernel, netdev

On Mon, Feb 28, 2011 at 4:45 PM, John Heffner <johnwheffner@gmail.com> wrote:
> On Mon, Feb 28, 2011 at 11:55 AM, John W. Linville
> <linville@tuxdriver.com> wrote:
>> On Mon, Feb 28, 2011 at 05:48:14PM +0100, Eric Dumazet wrote:
>>> Le lundi 28 février 2011 à 11:11 -0500, John W. Linville a écrit :
>>> > On Sun, Feb 27, 2011 at 09:07:53PM +0100, Eric Dumazet wrote:

>>> > > Qdisc should return to caller a good indication packet is queued or
>>> > > dropped at enqueue() time... not later (aka : never)
>>> > >
>>> > > Accepting a packet at t0, and dropping it later at t0+limit without
>>> > > giving any indication to caller is a problem.
>>> >
>>> > Can you elaborate on what problem this causes?  Is it any worse than
>>> > if the packet is dropped at some later hop?
>>> >
>>> > Is there any API that could report the drop to the sender (at
>>> > least a local one) without having to wait for the ack timeout?
>>> > Should there be?
>>>
>>> Not all protocols have ACKS ;)
>>>
>>> dev_queue_xmit() returns an error code, some callers use it.
>>
>> Well, OK -- I agree it is best if you can return the status at
>> enqueue time.  The question becomes whether or not a dropped frame
>> is worse than living with high latency.  The answer, of course, still
>> seems to be a bit subjective.  But, if the admin has determined that
>> a link should be low latency...?
>
> Notably, TCP is one caller that uses the error code.  The error code
> is functionally equivalent to ECN, one of whose great advantages is
> reducing delay jitter.  If TCP didn't get the error, that would
> effectively double the latency for a full window of data, since the
> dropped segment would not be retransmitted for an RTT.

It sounds like you need a callback or similar, so that TCP can be
informed later that the drop has occurred.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-03-01  4:11                   ` Albert Cahalan
@ 2011-03-01  4:18                     ` David Miller
  2011-03-01  6:54                       ` Albert Cahalan
  2011-03-01  5:01                     ` Eric Dumazet
  1 sibling, 1 reply; 43+ messages in thread
From: David Miller @ 2011-03-01  4:18 UTC (permalink / raw)
  To: acahalan
  Cc: johnwheffner, linville, eric.dumazet, jussi.kivilinna, swmike,
	linux-kernel, netdev

From: Albert Cahalan <acahalan@gmail.com>
Date: Mon, 28 Feb 2011 23:11:13 -0500

> It sounds like you need a callback or similar, so that TCP can be
> informed later that the drop has occurred.

By that point we could have already sent an entire RTT's worth
of data, or more.

It needs to be synchronous, otherwise performance suffers.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-03-01  4:18                     ` David Miller
@ 2011-03-01  6:54                       ` Albert Cahalan
  2011-03-01  7:25                         ` David Miller
  2011-03-01  7:26                         ` Eric Dumazet
  0 siblings, 2 replies; 43+ messages in thread
From: Albert Cahalan @ 2011-03-01  6:54 UTC (permalink / raw)
  To: David Miller
  Cc: johnwheffner, linville, eric.dumazet, jussi.kivilinna, swmike,
	linux-kernel, netdev

On Mon, Feb 28, 2011 at 11:18 PM, David Miller <davem@davemloft.net> wrote:
> From: Albert Cahalan <acahalan@gmail.com>
> Date: Mon, 28 Feb 2011 23:11:13 -0500
>
>> It sounds like you need a callback or similar, so that TCP can be
>> informed later that the drop has occurred.
>
> By that point we could have already sent an entire RTT's worth
> of data, or more.
>
> It needs to be synchronous, otherwise performance suffers.

Ouch. OTOH, the current situation: performance suffers.

In case it makes you feel any better, consider two cases
where synchronous feedback is already impossible.
One is when you're routing packets that merely pass through.
The other is when some other box is doing that to you.
Either way, packets go bye-bye and nobody tells TCP.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-03-01  6:54                       ` Albert Cahalan
@ 2011-03-01  7:25                         ` David Miller
  2011-03-01  7:26                         ` Eric Dumazet
  1 sibling, 0 replies; 43+ messages in thread
From: David Miller @ 2011-03-01  7:25 UTC (permalink / raw)
  To: acahalan
  Cc: johnwheffner, linville, eric.dumazet, jussi.kivilinna, swmike,
	linux-kernel, netdev

From: Albert Cahalan <acahalan@gmail.com>
Date: Tue, 1 Mar 2011 01:54:09 -0500

> In case it makes you feel any better, consider two cases
> where synchronous feedback is already impossible.
> One is when you're routing packets that merely pass through.
> The other is when some other box is doing that to you.
> Either way, packets go bye-bye and nobody tells TCP.

I consider ECN quite synchronous, and routers will set ECN bits to
propagate congestion information when they do or are about to drop
packets.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-03-01  6:54                       ` Albert Cahalan
  2011-03-01  7:25                         ` David Miller
@ 2011-03-01  7:26                         ` Eric Dumazet
  2011-03-01 19:37                           ` Albert Cahalan
  2011-03-02  3:10                           ` Mikael Abrahamsson
  1 sibling, 2 replies; 43+ messages in thread
From: Eric Dumazet @ 2011-03-01  7:26 UTC (permalink / raw)
  To: Albert Cahalan
  Cc: David Miller, johnwheffner, linville, jussi.kivilinna, swmike,
	linux-kernel, netdev

Le mardi 01 mars 2011 à 01:54 -0500, Albert Cahalan a écrit :
> On Mon, Feb 28, 2011 at 11:18 PM, David Miller <davem@davemloft.net> wrote:
> > From: Albert Cahalan <acahalan@gmail.com>
> > Date: Mon, 28 Feb 2011 23:11:13 -0500
> >
> >> It sounds like you need a callback or similar, so that TCP can be
> >> informed later that the drop has occurred.
> >
> > By that point we could have already sent an entire RTT's worth
> > of data, or more.
> >
> > It needs to be synchronous, otherwise performance suffers.
> 
> Ouch. OTOH, the current situation: performance suffers.
> 
> In case it makes you feel any better, consider two cases
> where synchronous feedback is already impossible.
> One is when you're routing packets that merely pass through.
> The other is when some other box is doing that to you.
> Either way, packets go bye-bye and nobody tells TCP.

So in a hurry we decide to drop packets blindly because kernel took the
cpu to perform an urgent task ?

Bufferbloat is a configuration/tuning problem, not a "everything must be
redone" problem. We add new qdiscs (CHOKe, SFB, QFQ, ...) and let admins
do their job. Problem is most admins are unaware of the problems, and
only buy more bandwidth.

And no, there is no "generic" solution, unless you have a lab with two
machines back to back (private link) and a known workload.

We might need some changes (including new APIs).

ECN is a forward step. Blindly dropping packets before ever sending them
is a step backward.

We should allow some trafic spikes, or many applications will stop
working. Unless all applications are fixed, we are stuck.

Only if the queue stay loaded a long time (yet another parameter) we can
try to drop packets.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-03-01  7:26                         ` Eric Dumazet
@ 2011-03-01 19:37                           ` Albert Cahalan
  2011-03-01 20:14                             ` Eric Dumazet
  2011-03-02  3:10                           ` Mikael Abrahamsson
  1 sibling, 1 reply; 43+ messages in thread
From: Albert Cahalan @ 2011-03-01 19:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, johnwheffner, linville, jussi.kivilinna, swmike,
	linux-kernel, netdev

On Tue, Mar 1, 2011 at 2:26 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 01 mars 2011 à 01:54 -0500, Albert Cahalan a écrit :
>> On Mon, Feb 28, 2011 at 11:18 PM, David Miller <davem@davemloft.net> wrote:
>> > From: Albert Cahalan <acahalan@gmail.com>

>> >> It sounds like you need a callback or similar, so that TCP can be
>> >> informed later that the drop has occurred.
>> >
>> > By that point we could have already sent an entire RTT's worth
>> > of data, or more.
>> >
>> > It needs to be synchronous, otherwise performance suffers.
>>
>> Ouch. OTOH, the current situation: performance suffers.
>>
>> In case it makes you feel any better, consider two cases
>> where synchronous feedback is already impossible.
>> One is when you're routing packets that merely pass through.
>> The other is when some other box is doing that to you.
>> Either way, packets go bye-bye and nobody tells TCP.
>
> So in a hurry we decide to drop packets blindly because kernel took the
> cpu to perform an urgent task ?

Yes. If the system can't handle the load, it needs to fess up.

> Bufferbloat is a configuration/tuning problem, not a "everything must be
> redone" problem. We add new qdiscs (CHOKe, SFB, QFQ, ...) and let admins
> do their job. Problem is most admins are unaware of the problems, and
> only buy more bandwidth.

We could at least do as well as Windows. >:-)

You can not expect some random Linux user to tune things
every time the link changes speed or the app mix changes.
What person NOT ON THIS MAILING LIST is going to mess
with their qdisc when they connect to a new access point
or switch from running Skype to running Netflix? Heck, how
many have any awareness of what a qdisk even is? Linux
networking needs to be excellent for people with no clue.

> We might need some changes (including new APIs).

If an app can't specify latency, adding the ability could
be nice. Still, stuff needs to JUST WORK more of the time.

> ECN is a forward step. Blindly dropping packets before ever sending them
> is a step backward.

Last I knew, ECN defaulted to a setting of "2" which means
it is only used in response. Perhaps it's time to change that.
It's been a while, with defective firewalls being replaced
by faster hardware.

> We should allow some trafic spikes, or many applications will stop
> working. Unless all applications are fixed, we are stuck.

Such applications would stop working...

1. across a switch
2. across an older router

We certainly should allow some traffic spikes. 1 to 10 ms of
traffic ought to do nicely. Hundreds or thousands of ms is
getting way beyond "spike".

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-03-01 19:37                           ` Albert Cahalan
@ 2011-03-01 20:14                             ` Eric Dumazet
  2011-03-01 20:16                               ` Eric Dumazet
  0 siblings, 1 reply; 43+ messages in thread
From: Eric Dumazet @ 2011-03-01 20:14 UTC (permalink / raw)
  To: Albert Cahalan
  Cc: David Miller, johnwheffner, linville, jussi.kivilinna, swmike,
	linux-kernel, netdev

Le mardi 01 mars 2011 à 14:37 -0500, Albert Cahalan a écrit :
> On Tue, Mar 1, 2011 at 2:26 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Le mardi 01 mars 2011 à 01:54 -0500, Albert Cahalan a écrit :
> >> On Mon, Feb 28, 2011 at 11:18 PM, David Miller <davem@davemloft.net> wrote:
> >> > From: Albert Cahalan <acahalan@gmail.com>
> 
> >> >> It sounds like you need a callback or similar, so that TCP can be
> >> >> informed later that the drop has occurred.
> >> >
> >> > By that point we could have already sent an entire RTT's worth
> >> > of data, or more.
> >> >
> >> > It needs to be synchronous, otherwise performance suffers.
> >>
> >> Ouch. OTOH, the current situation: performance suffers.
> >>
> >> In case it makes you feel any better, consider two cases
> >> where synchronous feedback is already impossible.
> >> One is when you're routing packets that merely pass through.
> >> The other is when some other box is doing that to you.
> >> Either way, packets go bye-bye and nobody tells TCP.
> >
> > So in a hurry we decide to drop packets blindly because kernel took the
> > cpu to perform an urgent task ?
> 
> Yes. If the system can't handle the load, it needs to fess up.
> 
> > Bufferbloat is a configuration/tuning problem, not a "everything must be
> > redone" problem. We add new qdiscs (CHOKe, SFB, QFQ, ...) and let admins
> > do their job. Problem is most admins are unaware of the problems, and
> > only buy more bandwidth.
> 
> We could at least do as well as Windows. >:-)
> 
> You can not expect some random Linux user to tune things
> every time the link changes speed or the app mix changes.
> What person NOT ON THIS MAILING LIST is going to mess
> with their qdisc when they connect to a new access point
> or switch from running Skype to running Netflix? Heck, how
> many have any awareness of what a qdisk even is? Linux
> networking needs to be excellent for people with no clue.
> 
> > We might need some changes (including new APIs).
> 
> If an app can't specify latency, adding the ability could
> be nice. Still, stuff needs to JUST WORK more of the time.
> 
> > ECN is a forward step. Blindly dropping packets before ever sending them
> > is a step backward.
> 
> Last I knew, ECN defaulted to a setting of "2" which means
> it is only used in response. Perhaps it's time to change that.
> It's been a while, with defective firewalls being replaced
> by faster hardware.
> 
> > We should allow some trafic spikes, or many applications will stop
> > working. Unless all applications are fixed, we are stuck.
> 
> Such applications would stop working...
> 
> 1. across a switch
> 2. across an older router
> 
> We certainly should allow some traffic spikes. 1 to 10 ms of
> traffic ought to do nicely. Hundreds or thousands of ms is
> getting way beyond "spike".

OK.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-03-01 20:14                             ` Eric Dumazet
@ 2011-03-01 20:16                               ` Eric Dumazet
  0 siblings, 0 replies; 43+ messages in thread
From: Eric Dumazet @ 2011-03-01 20:16 UTC (permalink / raw)
  To: Albert Cahalan
  Cc: David Miller, johnwheffner, linville, jussi.kivilinna, swmike,
	linux-kernel, netdev

Le mardi 01 mars 2011 à 21:14 +0100, Eric Dumazet a écrit :
> Le mardi 01 mars 2011 à 14:37 -0500, Albert Cahalan a écrit :
> > 
> > We certainly should allow some traffic spikes. 1 to 10 ms of
> > traffic ought to do nicely. Hundreds or thousands of ms is
> > getting way beyond "spike".
> 
> OK.

Hmm, user error, hit wrong button, sorry.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-03-01  7:26                         ` Eric Dumazet
  2011-03-01 19:37                           ` Albert Cahalan
@ 2011-03-02  3:10                           ` Mikael Abrahamsson
  2011-03-02 20:25                             ` Chris Friesen
  1 sibling, 1 reply; 43+ messages in thread
From: Mikael Abrahamsson @ 2011-03-02  3:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Albert Cahalan, David Miller, johnwheffner, linville,
	jussi.kivilinna, linux-kernel, netdev

On Tue, 1 Mar 2011, Eric Dumazet wrote:

> We should allow some trafic spikes, or many applications will stop
> working. Unless all applications are fixed, we are stuck.
>
> Only if the queue stay loaded a long time (yet another parameter) we can
> try to drop packets.

Are we talking forwarding of packets or originating them ourselves, or 
trying to use the same mechanism for both?

In the case of routing a packet, I envision a WRED kind of behaviour is 
the most efficient.

<http://www.cisco.com/en/US/docs/ios/12_0s/feature/guide/12stbwr.html>

"QoS: Time-Based Thresholds for WRED and Queue Limit for the Cisco 12000 
Series Router" You can set the drop probabilites in milliseconds. 
Unfortunately ECN isn't supported on this platform but on other platforms 
it can be configured and used instead of WRED dropping packets.

For the case when we're ourselves originating the traffic (for instance to 
a wifi card with varying speed and jitter due to retransmits on the wifi 
layer), I think it's taking the too easy way out to use the same 
mechanisms (dropping packets or marking ECN for our own originated packets 
seems really weird), here we should be able to pushback information to the 
applications somehow and do prioritization between flows since we're 
sitting on all information ourselves including the application.

For this case, I think there is something to be learnt from:

<http://www.cisco.com/en/US/tech/tk39/tk824/technologies_tech_note09186a00800fbafc.shtml>

Here you have the IP part and the ATM part, and you can limit the number 
of cells/packets sent to the ATM hardware at any given time (this queue is 
FIFO so no AQM when the packet has been sent here). We need the same here, 
to properly keep latency down and make AQM work, the hardware FIFO queue 
needs to be kept low.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-03-02  3:10                           ` Mikael Abrahamsson
@ 2011-03-02 20:25                             ` Chris Friesen
  0 siblings, 0 replies; 43+ messages in thread
From: Chris Friesen @ 2011-03-02 20:25 UTC (permalink / raw)
  To: Mikael Abrahamsson
  Cc: Eric Dumazet, Albert Cahalan, David Miller, johnwheffner,
	linville, jussi.kivilinna, linux-kernel, netdev

On 03/01/2011 09:10 PM, Mikael Abrahamsson wrote:

> For the case when we're ourselves originating the traffic (for instance to 
> a wifi card with varying speed and jitter due to retransmits on the wifi 
> layer), I think it's taking the too easy way out to use the same 
> mechanisms (dropping packets or marking ECN for our own originated packets 
> seems really weird), here we should be able to pushback information to the 
> applications somehow and do prioritization between flows since we're 
> sitting on all information ourselves including the application.

Doesn't the socket tx buffer give all the app pushback necessary?
(Assuming it's set to a sane value.)

We should certainly do prioritization between flows.  Perhaps if no
other information is available the scheduler priority could be used?

Chris

-- 
Chris Friesen
Software Developer
GENBAND
chris.friesen@genband.com
www.genband.com

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-03-01  4:11                   ` Albert Cahalan
  2011-03-01  4:18                     ` David Miller
@ 2011-03-01  5:01                     ` Eric Dumazet
  2011-03-01  5:36                       ` Eric Dumazet
  1 sibling, 1 reply; 43+ messages in thread
From: Eric Dumazet @ 2011-03-01  5:01 UTC (permalink / raw)
  To: Albert Cahalan
  Cc: John Heffner, John W. Linville, Jussi Kivilinna,
	Mikael Abrahamsson, linux-kernel, netdev

Le lundi 28 février 2011 à 23:11 -0500, Albert Cahalan a écrit :

> It sounds like you need a callback or similar, so that TCP can be
> informed later that the drop has occurred.

There is the thing called skb destructor / skb_orphan() mess, that is
not stackable... Might extend this to something more clever, and be able
to call functions (into TCP stack for example) giving a status of skb :
Sent, or dropped somewhere in the stack...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-03-01  5:01                     ` Eric Dumazet
@ 2011-03-01  5:36                       ` Eric Dumazet
  0 siblings, 0 replies; 43+ messages in thread
From: Eric Dumazet @ 2011-03-01  5:36 UTC (permalink / raw)
  To: Albert Cahalan
  Cc: John Heffner, John W. Linville, Jussi Kivilinna,
	Mikael Abrahamsson, linux-kernel, netdev

Le mardi 01 mars 2011 à 06:01 +0100, Eric Dumazet a écrit :
> Le lundi 28 février 2011 à 23:11 -0500, Albert Cahalan a écrit :
> 
> > It sounds like you need a callback or similar, so that TCP can be
> > informed later that the drop has occurred.
> 
> There is the thing called skb destructor / skb_orphan() mess, that is
> not stackable... Might extend this to something more clever, and be able
> to call functions (into TCP stack for example) giving a status of skb :
> Sent, or dropped somewhere in the stack...
> 

One problem of such schem is the huge extra cost involved, extra
locking, extra memory allocations, extra atomic operations...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-02-27 10:55       ` Jussi Kivilinna
  2011-02-27 20:07         ` Eric Dumazet
@ 2011-02-27 23:33         ` Albert Cahalan
  2011-02-28 11:23           ` Jussi Kivilinna
  2011-02-28 15:38           ` txqueuelen has wrong units; should be time Hagen Paul Pfeifer
  1 sibling, 2 replies; 43+ messages in thread
From: Albert Cahalan @ 2011-02-27 23:33 UTC (permalink / raw)
  To: Jussi Kivilinna; +Cc: Eric Dumazet, Mikael Abrahamsson, linux-kernel, netdev

On Sun, Feb 27, 2011 at 5:55 AM, Jussi Kivilinna
<jussi.kivilinna@mbnet.fi> wrote:

> I made simple hack on sch_fifo with per packet time limits (attachment) this
> weekend and have been doing limited testing on wireless link. I think
> hardlimit is fine, it's simple and does somewhat same as what
> packet(-hard)limited buffer does, drops packets when buffer is 'full'. My
> hack checks for timed out packets on enqueue, might be wrong approach (on
> other hand might allow some more burstiness).

Thanks!

I think the default is too high. 1 ms may even be a bit high.

I suppose there is a need to allow at least 2 packets despite any
time limits, so that it remains possible to use a traditional modem
even if a huge packet takes several seconds to send.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-02-27 23:33         ` Albert Cahalan
@ 2011-02-28 11:23           ` Jussi Kivilinna
  2011-03-02 21:54             ` [RFC LOL OMG] pfifo_lat: qdisc that limits dequeueing based on estimated link latency John W. Linville
  2011-02-28 15:38           ` txqueuelen has wrong units; should be time Hagen Paul Pfeifer
  1 sibling, 1 reply; 43+ messages in thread
From: Jussi Kivilinna @ 2011-02-28 11:23 UTC (permalink / raw)
  To: Albert Cahalan; +Cc: Eric Dumazet, Mikael Abrahamsson, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 1170 bytes --]

Quoting Albert Cahalan <acahalan@gmail.com>:

> On Sun, Feb 27, 2011 at 5:55 AM, Jussi Kivilinna
> <jussi.kivilinna@mbnet.fi> wrote:
>
>> I made simple hack on sch_fifo with per packet time limits (attachment) this
>> weekend and have been doing limited testing on wireless link. I think
>> hardlimit is fine, it's simple and does somewhat same as what
>> packet(-hard)limited buffer does, drops packets when buffer is 'full'. My
>> hack checks for timed out packets on enqueue, might be wrong approach (on
>> other hand might allow some more burstiness).
>
> Thanks!
>
> I think the default is too high. 1 ms may even be a bit high.

Well, with 10ms buffer timeout latency goes to 10-20ms on 54Mbit wifi  
link (zd1211rw driver) from >500ms (ping rtt when iperf running same  
time). So for that it's good enough.

>
> I suppose there is a need to allow at least 2 packets despite any
> time limits, so that it remains possible to use a traditional modem
> even if a huge packet takes several seconds to send.
>

I made EWMA version of my fifo hack (attached). I added minimum 2  
packet queue limit and  probabilistic 1% ECN marking/dropping for  
timeout/2.

-Jussi


[-- Attachment #2: sch_fifo_ewma.c --]
[-- Type: text/x-csrc, Size: 7809 bytes --]

/*
 * sch_fifo_ewma.c	Simple FIFO EWMA timelimit queue.
 *
 * This program is free software; you can redistribute it and/or modify it under
 * the terms of the GNU General Public License as published by the Free Software
 * Foundation; either version 2 of the License, or (at your option) any later
 * version.
 *
 */

#include <linux/module.h>
#include <linux/slab.h>
#include <linux/types.h>
#include <linux/kernel.h>
#include <linux/errno.h>
#include <linux/skbuff.h>
#include <net/pkt_sched.h>
#include <net/inet_ecn.h>

#include <linux/version.h>
#if LINUX_VERSION_CODE <= KERNEL_VERSION(2, 6, 37)
#include "average.h"
#else
#include <linux/average.h>
#endif

#define DEFAULT_PKT_TIMEOUT_MS		10
#define DEFAULT_PKT_TIMEOUT		PSCHED_NS2TICKS(NSEC_PER_MSEC * \
							DEFAULT_PKT_TIMEOUT_MS)
#define DEFAULT_PROB_HALF_DROP		10	/* 1% */

#define FIFO_EWMA_MIN_QDISC_LEN		2

struct tc_fifo_ewma_qopt {
	__u64	timeout;	/* Max time packet may stay in buffer */
	__u32   limit;		/* Queue length: bytes for bfifo, packets for pfifo */
};

struct fifo_ewma_skb_cb {
	psched_time_t	time_queued;
};

struct fifo_ewma_sched_data {
	psched_tdiff_t	timeout;
	u32		limit;
	struct ewma	ewma;
};

static inline
struct fifo_ewma_skb_cb *fifo_ewma_skb_cb(struct sk_buff *skb)
{
	BUILD_BUG_ON(sizeof(skb->cb) <
		sizeof(struct qdisc_skb_cb) +
			sizeof(struct fifo_ewma_skb_cb));
	return (struct fifo_ewma_skb_cb *)qdisc_skb_cb(skb)->data;
}

static int pfifo_tail_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
	struct fifo_ewma_sched_data *q = qdisc_priv(sch);

	if (likely(skb_queue_len(&sch->q) < q->limit))
		return qdisc_enqueue_tail(skb, sch);

	/* queue full, remove one skb to fulfill the limit */
	__qdisc_queue_drop_head(sch, &sch->q);
	sch->qstats.drops++;
	qdisc_enqueue_tail(skb, sch);

	return NET_XMIT_CN;
}

static int bfifo_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
	struct fifo_ewma_sched_data *q = qdisc_priv(sch);

	if (likely(sch->qstats.backlog + qdisc_pkt_len(skb) <= q->limit))
		return qdisc_enqueue_tail(skb, sch);

	return qdisc_reshape_fail(skb, sch);
}

static int pfifo_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
	struct fifo_ewma_sched_data *q = qdisc_priv(sch);

	if (likely(skb_queue_len(&sch->q) < q->limit))
		return qdisc_enqueue_tail(skb, sch);

	return qdisc_reshape_fail(skb, sch);
}

static inline int fifo_get_prob(void)
{
	return (net_random() & 0xffff) * 1000 / 0xffff;
}

static struct sk_buff *fifo_ewma_dequeue(struct Qdisc* sch)
{
	struct fifo_ewma_sched_data *q = qdisc_priv(sch);
	struct sk_buff *skb;
	psched_tdiff_t tdiff;

	if (likely(!q->timeout))
		goto no_ewma;

	skb = qdisc_peek_head(sch);
	if (!skb)
		return NULL;

	/* update EWMA */
	tdiff = psched_get_time() - fifo_ewma_skb_cb(skb)->time_queued;
	ewma_add(&q->ewma, tdiff);

no_ewma:
	return qdisc_dequeue_head(sch);
}

#define FIFO_EWMA_OK	0
#define FIFO_EWMA_DROP	1
#define FIFO_EWMA_CN	2

static int fifo_check_ewma_drop(struct sk_buff *skb, struct Qdisc *sch)
{
	struct fifo_ewma_sched_data *q = qdisc_priv(sch);
	unsigned long fifo_latency_avg;
	int ret = FIFO_EWMA_OK;

	if (likely(!q->timeout))
		goto no_ewma;

	/* lower limit */
	if (skb_queue_len(&sch->q) <= FIFO_EWMA_MIN_QDISC_LEN)
		goto no_drop;

	fifo_latency_avg = ewma_read(&q->ewma);

	/* hard drop */
	if (fifo_latency_avg > q->timeout) {
		/*printk(KERN_WARNING "fifo_ewma: hard drop\n");*/
		return FIFO_EWMA_DROP;
	}

	/* probabilistic drop */
	if (fifo_latency_avg > q->timeout / 2 &&
				fifo_get_prob() < DEFAULT_PROB_HALF_DROP) {
		if (!INET_ECN_set_ce(skb)) {
			/*printk(KERN_WARNING "fifo_ewma: prob drop\n");*/
			return FIFO_EWMA_DROP;
		}

		/*printk(KERN_WARNING "fifo_ewma: prob mark\n");*/
		ret = FIFO_EWMA_CN;
	}

no_drop:
	fifo_ewma_skb_cb(skb)->time_queued = psched_get_time();
no_ewma:
	return ret;
}

static int pfifo_ewma_tail_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
	int ewma_action, ret;

	ewma_action = fifo_check_ewma_drop(skb, sch);
	if (unlikely(ewma_action == FIFO_EWMA_DROP))
		return qdisc_drop(skb, sch);

	ret = pfifo_tail_enqueue(skb, sch);
	if (unlikely(ret != NET_XMIT_SUCCESS))
		return ret;

	return unlikely(ewma_action == FIFO_EWMA_CN) ? NET_XMIT_CN : ret;
}

static int bfifo_ewma_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
	int ewma_action, ret;

	ewma_action = fifo_check_ewma_drop(skb, sch);
	if (unlikely(ewma_action == FIFO_EWMA_DROP))
		return qdisc_drop(skb, sch);

	ret = bfifo_enqueue(skb, sch);
	if (unlikely(ret != NET_XMIT_SUCCESS))
		return ret;

	return unlikely(ewma_action == FIFO_EWMA_CN) ? NET_XMIT_CN : ret;
}

static int pfifo_ewma_enqueue(struct sk_buff *skb, struct Qdisc* sch)
{
	int ewma_action, ret;

	ewma_action = fifo_check_ewma_drop(skb, sch);
	if (unlikely(ewma_action == FIFO_EWMA_DROP))
		return qdisc_drop(skb, sch);

	ret = pfifo_enqueue(skb, sch);
	if (unlikely(ret != NET_XMIT_SUCCESS))
		return ret;

	return unlikely(ewma_action == FIFO_EWMA_CN) ? NET_XMIT_CN : ret;
}

static int fifo_ewma_init(struct Qdisc *sch, struct nlattr *opt)
{
	struct fifo_ewma_sched_data *q = qdisc_priv(sch);

	if (opt == NULL) {
		u32 limit = qdisc_dev(sch)->tx_queue_len ? : 1;

		q->limit = limit;
		q->timeout = DEFAULT_PKT_TIMEOUT;
	} else {
		struct tc_fifo_ewma_qopt *ctl = nla_data(opt);

		if (nla_len(opt) < sizeof(*ctl))
			return -EINVAL;

		q->limit = ctl->limit;
		q->timeout = ctl->timeout ? : DEFAULT_PKT_TIMEOUT;
	}

	ewma_init(&q->ewma, 1, 64);

	return 0;
}

static int fifo_ewma_dump(struct Qdisc *sch, struct sk_buff *skb)
{
	struct fifo_ewma_sched_data *q = qdisc_priv(sch);
	struct tc_fifo_ewma_qopt opt = {
		.limit = q->limit,
		.timeout = q->timeout
	};

	NLA_PUT(skb, TCA_OPTIONS, sizeof(opt), &opt);
	return skb->len;

nla_put_failure:
	return -1;
}

static struct Qdisc_ops pfifo_ewma_qdisc_ops __read_mostly = {
	.id		=	"pfifo_ewma",
	.priv_size	=	sizeof(struct fifo_ewma_sched_data),
	.enqueue	=	pfifo_ewma_enqueue,
	.dequeue	=	fifo_ewma_dequeue,
	.peek		=	qdisc_peek_head,
	.drop		=	qdisc_queue_drop,
	.init		=	fifo_ewma_init,
	.reset		=	qdisc_reset_queue,
	.change		=	fifo_ewma_init,
	.dump		=	fifo_ewma_dump,
	.owner		=	THIS_MODULE,
};

static struct Qdisc_ops bfifo_ewma_qdisc_ops __read_mostly = {
	.id		=	"bfifo_ewma",
	.priv_size	=	sizeof(struct fifo_ewma_sched_data),
	.enqueue	=	bfifo_ewma_enqueue,
	.dequeue	=	fifo_ewma_dequeue,
	.peek		=	qdisc_peek_head,
	.drop		=	qdisc_queue_drop,
	.init		=	fifo_ewma_init,
	.reset		=	qdisc_reset_queue,
	.change		=	fifo_ewma_init,
	.dump		=	fifo_ewma_dump,
	.owner		=	THIS_MODULE,
};

static struct Qdisc_ops pfifo_head_drop_ewma_qdisc_ops __read_mostly = {
	.id		=	"pfifo_hd_ewma",
	.priv_size	=	sizeof(struct fifo_ewma_sched_data),
	.enqueue	=	pfifo_ewma_tail_enqueue,
	.dequeue	=	fifo_ewma_dequeue,
	.peek		=	qdisc_peek_head,
	.drop		=	qdisc_queue_drop_head,
	.init		=	fifo_ewma_init,
	.reset		=	qdisc_reset_queue,
	.change		=	fifo_ewma_init,
	.dump		=	fifo_ewma_dump,
	.owner		=	THIS_MODULE,
};

static int __init fifo_ewma_module_init(void)
{
	int retval;

	retval = register_qdisc(&pfifo_ewma_qdisc_ops);
	if (retval)
		goto cleanup;
	retval = register_qdisc(&bfifo_ewma_qdisc_ops);
	if (retval)
		goto cleanup;
	retval = register_qdisc(&pfifo_head_drop_ewma_qdisc_ops);
	if (retval)
		goto cleanup;

	return 0;

cleanup:
	unregister_qdisc(&pfifo_ewma_qdisc_ops);
	unregister_qdisc(&bfifo_ewma_qdisc_ops);
	unregister_qdisc(&pfifo_head_drop_ewma_qdisc_ops);
	return retval;
}
static void __exit fifo_ewma_module_exit(void)
{
	unregister_qdisc(&pfifo_ewma_qdisc_ops);
	unregister_qdisc(&bfifo_ewma_qdisc_ops);
	unregister_qdisc(&pfifo_head_drop_ewma_qdisc_ops);
}

module_init(fifo_ewma_module_init)
module_exit(fifo_ewma_module_exit)
MODULE_LICENSE("GPL");

#include <linux/version.h>
#if LINUX_VERSION_CODE <= KERNEL_VERSION(2, 6, 37)
#include "average.c"
#endif


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC LOL OMG] pfifo_lat: qdisc that limits dequeueing based on estimated link latency
  2011-02-28 11:23           ` Jussi Kivilinna
@ 2011-03-02 21:54             ` John W. Linville
  2011-03-02 22:08               ` John W. Linville
  2011-03-03 12:51               ` Eric Dumazet
  0 siblings, 2 replies; 43+ messages in thread
From: John W. Linville @ 2011-03-02 21:54 UTC (permalink / raw)
  To: netdev; +Cc: bloat-devel, John W. Linville

This is a qdisc based on the existing pfifo_fast code.  The difference
is that this qdisc limits the dequeue rate based on estimates of how
many packets can be in-flight at a given time while maintaining a target
link latency.

This work is based on the eBDP documented in Section IV of "Buffer
Sizing for 802.11 Based Networks" by Tianji Li, et al.

	http://www.hamilton.ie/tianji_li/buffersizing.pdf

This implementation timestamps an skb as it dequeues it, then
computes the service time when the frame is freed by the driver.
An exponentially weighted moving average of per fragment service times
is used to restrict queueing delays in hopes of achieving a target
fragment transmission latency.  The skb->deconstructor mechanism is
abused in order to obtain packet service time estimates.

Signed-off-by: John W. Linville <linville@tuxdriver.com>
---
I took a whack at reimplementing my eBDP patch at the qdisc level.
Unfortunately, it doesn't seem to work very well and I'm at a loss
as to why... :-( Comments welcome -- maybe I'm doing something really
stupid in the math and just can't see it.

The skb->deconstructor abuse includes adding a union member in the skb
to record the qdisc->handle on the way out so that it can be used for
accounting in the deconstructor -- thanks to Neil Horman for the
suggestion!

The reason I think this is an idea worth exploring is that existing
qdisc code doesn't seem to account for the fact that the devices could
be doing a lot of queueing behind them.  Even Jussi's recent
sch_fifo_ewma post doesn't seem to take into account how long the device
holds-on to packets, which limits his ability to fight latency.

Anyway, all comments appreciated!

 include/linux/skbuff.h  |    2 +
 include/net/pkt_sched.h |    1 +
 net/sched/sch_api.c     |    1 +
 net/sched/sch_fifo.c    |  131 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 135 insertions(+), 0 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bf221d6..d99861e 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -296,6 +296,7 @@ typedef unsigned char *sk_buff_data_t;
  *	@end: End pointer
  *	@destructor: Destruct function
  *	@mark: Generic packet mark
+ *	@qdhandle: handle of leaf qdisc that handled skb
  *	@nfct: Associated connection, if any
  *	@ipvs_property: skbuff is owned by ipvs
  *	@peeked: this packet has been seen already, so stats have been
@@ -407,6 +408,7 @@ struct sk_buff {
 	union {
 		__u32		mark;
 		__u32		dropcount;
+		__u32		qdhandle;
 	};
 
 	__u16			vlan_tci;
diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index d9549af..93189f6 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -72,6 +72,7 @@ extern void qdisc_watchdog_cancel(struct qdisc_watchdog *wd);
 extern struct Qdisc_ops pfifo_qdisc_ops;
 extern struct Qdisc_ops bfifo_qdisc_ops;
 extern struct Qdisc_ops pfifo_head_drop_qdisc_ops;
+extern struct Qdisc_ops pfifo_lat_qdisc_ops;
 
 extern int fifo_set_limit(struct Qdisc *q, unsigned int limit);
 extern struct Qdisc *fifo_create_dflt(struct Qdisc *sch, struct Qdisc_ops *ops,
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index b22ca2d..9c9ba9a 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1769,6 +1769,7 @@ static int __init pktsched_init(void)
 	register_qdisc(&pfifo_qdisc_ops);
 	register_qdisc(&bfifo_qdisc_ops);
 	register_qdisc(&pfifo_head_drop_qdisc_ops);
+	register_qdisc(&pfifo_lat_qdisc_ops);
 	register_qdisc(&mq_qdisc_ops);
 
 	rtnl_register(PF_UNSPEC, RTM_NEWQDISC, tc_modify_qdisc, NULL);
diff --git a/net/sched/sch_fifo.c b/net/sched/sch_fifo.c
index d468b47..0d2cb48 100644
--- a/net/sched/sch_fifo.c
+++ b/net/sched/sch_fifo.c
@@ -15,6 +15,7 @@
 #include <linux/kernel.h>
 #include <linux/errno.h>
 #include <linux/skbuff.h>
+#include <linux/average.h>
 #include <net/pkt_sched.h>
 
 /* 1 band FIFO pseudo-"scheduler" */
@@ -24,6 +25,20 @@ struct fifo_sched_data
 	u32 limit;
 };
 
+/*
+ * Private data for a pfifo_lat scheduler containing:
+ *	- embedded fifo private data
+ *	- EWMA of average skb service time for each band
+ *	- count of currently in-flight skbs for each band
+ *	- maximum in-flight skbs for each band
+ */
+struct pfifo_lat_data {
+	struct fifo_sched_data q;
+	struct ewma tserv;
+	unsigned int inflight;
+	unsigned int inflight_max;
+};
+
 static int bfifo_enqueue(struct sk_buff *skb, struct Qdisc* sch)
 {
 	struct fifo_sched_data *q = qdisc_priv(sch);
@@ -59,6 +74,86 @@ static int pfifo_tail_enqueue(struct sk_buff *skb, struct Qdisc* sch)
 	return NET_XMIT_CN;
 }
 
+static int pfifo_lat_enqueue(struct sk_buff *skb, struct Qdisc* sch)
+{
+	struct pfifo_lat_data *priv = qdisc_priv(sch);
+
+	/* include inflight count when checking queue length limit */
+	if (skb_queue_len(&sch->q) + priv->inflight < priv->q.limit)
+		return qdisc_enqueue_tail(skb, sch);
+
+	return qdisc_reshape_fail(skb, sch);
+}
+
+static void pfifo_lat_skb_free(struct sk_buff *skb)
+{
+	struct Qdisc *qdisc = qdisc_lookup(skb->dev, skb->qdhandle);
+	struct pfifo_lat_data *priv = qdisc_priv(qdisc);
+	unsigned int tserv_ns, inflight_mult;
+
+	/*
+	 * grab timestamp info for buffer control estimates and factor
+	 * that into service time estimate for this queue
+	 */
+	ewma_add(&priv->tserv,
+		 ktime_to_ns(ktime_sub(ktime_get(), skb->tstamp)));
+	tserv_ns = ewma_read(&priv->tserv);
+	if (tserv_ns) {
+		/* calculate multiplier between tserv and target latency */
+		inflight_mult = 2 * NSEC_PER_MSEC / tserv_ns;
+
+		/*
+		 * use current inflight number as proxy for number of
+		 * packets inflight when this packet was sent to
+		 * hardware queue
+		 */
+		priv->inflight_max =
+			max_t(int, 2, priv->inflight * inflight_mult);
+	}
+
+	priv->inflight--;
+}
+
+static struct sk_buff *pfifo_lat_dequeue(struct Qdisc *qdisc)
+{
+	struct pfifo_lat_data *priv = qdisc_priv(qdisc);
+	struct sk_buff *skb;
+
+	if (priv->inflight >= priv->inflight_max)
+		return NULL;
+
+	skb = qdisc_dequeue_head(qdisc);
+	if (!skb)
+		return NULL;
+
+	priv->inflight++;
+
+	/* take ownership of skb and timestamp it */
+	skb_orphan(skb);
+	skb->qdhandle = qdisc->handle;
+	skb->destructor = pfifo_lat_skb_free;
+	skb->dev = qdisc_dev(qdisc); /* do I need to set this?  */
+	skb->tstamp = ktime_get();
+
+	return skb;
+}
+
+static void pfifo_lat_reset(struct Qdisc* qdisc)
+{
+	struct pfifo_lat_data *priv = qdisc_priv(qdisc);
+
+	/*
+	 * since fifo_sched_data is embedded at head of pfifo_lat_data,
+	 * this should be OK to do...
+	 */
+	qdisc_reset_queue(qdisc);
+
+	/* need to reset priv->tserv somehow? */
+
+	priv->inflight = 0;
+	priv->inflight_max = (typeof(priv->inflight_max))-1;
+}
+
 static int fifo_init(struct Qdisc *sch, struct nlattr *opt)
 {
 	struct fifo_sched_data *q = qdisc_priv(sch);
@@ -82,6 +177,30 @@ static int fifo_init(struct Qdisc *sch, struct nlattr *opt)
 	return 0;
 }
 
+static int pfifo_lat_init(struct Qdisc *qdisc, struct nlattr *opt)
+{
+	struct pfifo_lat_data *priv = qdisc_priv(qdisc);
+	int rc;
+
+	/*
+	 * since fifo_sched_data is embedded at head of pfifo_lat_data,
+	 * this should be OK to do...
+	 */
+	rc = fifo_init(qdisc, opt);
+	if (rc)
+		return rc;
+
+	/* initialize service time estimate */
+	ewma_init(&priv->tserv, 1, 64);
+
+	priv->inflight = 0; /* necessary to set this explicitly? */
+
+	/* initial inflight_max should be ??? */
+	priv->inflight_max = (typeof(priv->inflight_max))-1;
+
+	return 0;
+}
+
 static int fifo_dump(struct Qdisc *sch, struct sk_buff *skb)
 {
 	struct fifo_sched_data *q = qdisc_priv(sch);
@@ -138,6 +257,18 @@ struct Qdisc_ops pfifo_head_drop_qdisc_ops __read_mostly = {
 	.owner		=	THIS_MODULE,
 };
 
+struct Qdisc_ops pfifo_lat_qdisc_ops __read_mostly = {
+	.id		=	"pfifo_lat",
+	.priv_size	=	sizeof(struct pfifo_lat_data),
+	.enqueue	=	pfifo_lat_enqueue,
+	.dequeue	=	pfifo_lat_dequeue,
+	.peek		=	qdisc_peek_head,
+	.init		=	pfifo_lat_init,
+	.reset		=	pfifo_lat_reset,
+	.dump		=	fifo_dump,
+	.owner		=	THIS_MODULE,
+};
+
 /* Pass size change message down to embedded FIFO */
 int fifo_set_limit(struct Qdisc *q, unsigned int limit)
 {
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC LOL OMG] pfifo_lat: qdisc that limits dequeueing based on estimated link latency
  2011-03-02 21:54             ` [RFC LOL OMG] pfifo_lat: qdisc that limits dequeueing based on estimated link latency John W. Linville
@ 2011-03-02 22:08               ` John W. Linville
  2011-03-03 12:51               ` Eric Dumazet
  1 sibling, 0 replies; 43+ messages in thread
From: John W. Linville @ 2011-03-02 22:08 UTC (permalink / raw)
  To: netdev; +Cc: bloat-devel

On Wed, Mar 02, 2011 at 04:54:10PM -0500, John W. Linville wrote:
> This is a qdisc based on the existing pfifo_fast code.  The difference

Well, it started that way.  This is obviously based on the pfifo
code instead...

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC LOL OMG] pfifo_lat: qdisc that limits dequeueing based on estimated link latency
  2011-03-02 21:54             ` [RFC LOL OMG] pfifo_lat: qdisc that limits dequeueing based on estimated link latency John W. Linville
  2011-03-02 22:08               ` John W. Linville
@ 2011-03-03 12:51               ` Eric Dumazet
  1 sibling, 0 replies; 43+ messages in thread
From: Eric Dumazet @ 2011-03-03 12:51 UTC (permalink / raw)
  To: John W. Linville; +Cc: netdev, bloat-devel

Le mercredi 02 mars 2011 à 16:54 -0500, John W. Linville a écrit :
> This is a qdisc based on the existing pfifo_fast code.  The difference
> is that this qdisc limits the dequeue rate based on estimates of how
> many packets can be in-flight at a given time while maintaining a target
> link latency.
> 
> This work is based on the eBDP documented in Section IV of "Buffer
> Sizing for 802.11 Based Networks" by Tianji Li, et al.
> 
> 	http://www.hamilton.ie/tianji_li/buffersizing.pdf
> 
> This implementation timestamps an skb as it dequeues it, then
> computes the service time when the frame is freed by the driver.
> An exponentially weighted moving average of per fragment service times
> is used to restrict queueing delays in hopes of achieving a target
> fragment transmission latency.  The skb->deconstructor mechanism is
> abused in order to obtain packet service time estimates.
> 
> Signed-off-by: John W. Linville <linville@tuxdriver.com>
> ---
> I took a whack at reimplementing my eBDP patch at the qdisc level.
> Unfortunately, it doesn't seem to work very well and I'm at a loss
> as to why... :-( Comments welcome -- maybe I'm doing something really
> stupid in the math and just can't see it.
> 
> The skb->deconstructor abuse includes adding a union member in the skb
> to record the qdisc->handle on the way out so that it can be used for
> accounting in the deconstructor -- thanks to Neil Horman for the
> suggestion!
> 
> The reason I think this is an idea worth exploring is that existing
> qdisc code doesn't seem to account for the fact that the devices could
> be doing a lot of queueing behind them.  Even Jussi's recent
> sch_fifo_ewma post doesn't seem to take into account how long the device
> holds-on to packets, which limits his ability to fight latency.
> 
> Anyway, all comments appreciated!
> 
>  

Well, many issues in your patch.

skb destructor cannot be used like that (think about locking, and
various context where drivers actually free skbs (from interrupt, from
softirq, or even _before_ sending data on wire).

qdisc_lookup(skb->dev, skb->qdhandle) for example is only safe if run
with RTNL held. Its not meant to be used in fast path at all, but
management code only.

Being able to have a feedback on when a skb is freed (with a
notification of being delivered or dropped) is a recurring idea, so we
might design a stackable infrastructure.




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-02-27 23:33         ` Albert Cahalan
  2011-02-28 11:23           ` Jussi Kivilinna
@ 2011-02-28 15:38           ` Hagen Paul Pfeifer
  2011-02-28 16:37             ` Albert Cahalan
  2011-02-28 17:20             ` Bill Sommerfeld
  1 sibling, 2 replies; 43+ messages in thread
From: Hagen Paul Pfeifer @ 2011-02-28 15:38 UTC (permalink / raw)
  To: Albert Cahalan
  Cc: Jussi Kivilinna, Eric Dumazet, Mikael Abrahamsson, linux-kernel,
	netdev


On Sun, 27 Feb 2011 18:33:39 -0500, Albert Cahalan wrote:



> I suppose there is a need to allow at least 2 packets despite any

> time limits, so that it remains possible to use a traditional modem

> even if a huge packet takes several seconds to send.



That is a good point! We talk about as we may know every use case of

Linux. But this is not true at all. One of my customer for example operates

the Linux network stack functionality on top of a proprietary MAC/Driver

where the current packet queue characteristic is just fine. The

time-drop-approach is unsuitable because the bandwidth can vary in a small

amount of time over a great range (0 till max. bandwidth). A sufficient

buffering shows up superior in this environment (only IPv{4,6}/UDP).



Hagen

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-02-28 15:38           ` txqueuelen has wrong units; should be time Hagen Paul Pfeifer
@ 2011-02-28 16:37             ` Albert Cahalan
  2011-02-28 17:45               ` John W. Linville
  2011-02-28 17:20             ` Bill Sommerfeld
  1 sibling, 1 reply; 43+ messages in thread
From: Albert Cahalan @ 2011-02-28 16:37 UTC (permalink / raw)
  To: Hagen Paul Pfeifer
  Cc: Jussi Kivilinna, Eric Dumazet, Mikael Abrahamsson, linux-kernel,
	netdev

On Mon, Feb 28, 2011 at 10:38 AM, Hagen Paul Pfeifer <hagen@jauu.net> wrote:
> On Sun, 27 Feb 2011 18:33:39 -0500, Albert Cahalan wrote:
>
>> I suppose there is a need to allow at least 2 packets despite any
>> time limits, so that it remains possible to use a traditional modem
>> even if a huge packet takes several seconds to send.
>
> That is a good point! We talk about as we may know every use case of
> Linux. But this is not true at all. One of my customer for example operates
> the Linux network stack functionality on top of a proprietary MAC/Driver
> where the current packet queue characteristic is just fine. The
> time-drop-approach is unsuitable because the bandwidth can vary in a small
> amount of time over a great range (0 till max. bandwidth). A sufficient
> buffering shows up superior in this environment (only IPv{4,6}/UDP).

I don't think the current non-time queue is just fine for him.
I can see that time-based discard-on-enqueue would not be
fine either. He needs time-based discard-on-dequeue.
Good for him is probably:

On dequeue, discard all packets that are too old.
On enqueue, assume max bandwidth and discard all
packets that have no hope of surviving the dequeue check.
(the enqueue check is only to prevent wasting RAM)
Exception: always keep at least 2 packets.

Better is something that would allow random drop.
The trouble here is that bandwidth varies greatly.
Some sort of undelete functionality is needed...?

Assuming the difficulty with implementing random drop
is solvable, I think this would work for the rest of us too.

Keeping the timeout really low is important because it isn't
OK to eat up all the latency tolerance in one hop. You have
an end-to-end budget of 20 ms for usable GUI rubber banding.
The budget for gaming is about 80 and for VoIP is about 150.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-02-28 16:37             ` Albert Cahalan
@ 2011-02-28 17:45               ` John W. Linville
  0 siblings, 0 replies; 43+ messages in thread
From: John W. Linville @ 2011-02-28 17:45 UTC (permalink / raw)
  To: Albert Cahalan
  Cc: Hagen Paul Pfeifer, Jussi Kivilinna, Eric Dumazet,
	Mikael Abrahamsson, linux-kernel, netdev

On Mon, Feb 28, 2011 at 11:37:45AM -0500, Albert Cahalan wrote:

> Keeping the timeout really low is important because it isn't
> OK to eat up all the latency tolerance in one hop. You have
> an end-to-end budget of 20 ms for usable GUI rubber banding.
> The budget for gaming is about 80 and for VoIP is about 150.

Oooh, numbers! :-)

Where can I find estimates on average hop counts for internet
connections?

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-02-28 15:38           ` txqueuelen has wrong units; should be time Hagen Paul Pfeifer
  2011-02-28 16:37             ` Albert Cahalan
@ 2011-02-28 17:20             ` Bill Sommerfeld
  2011-02-28 21:51               ` John Heffner
  1 sibling, 1 reply; 43+ messages in thread
From: Bill Sommerfeld @ 2011-02-28 17:20 UTC (permalink / raw)
  To: Hagen Paul Pfeifer
  Cc: Albert Cahalan, Jussi Kivilinna, Eric Dumazet, Mikael Abrahamsson,
	linux-kernel, netdev

On Mon, Feb 28, 2011 at 07:38, Hagen Paul Pfeifer <hagen@jauu.net> wrote:
> On Sun, 27 Feb 2011 18:33:39 -0500, Albert Cahalan wrote:
>> I suppose there is a need to allow at least 2 packets despite any
>> time limits, so that it remains possible to use a traditional modem
>> even if a huge packet takes several seconds to send.
>
> That is a good point! We talk about as we may know every use case of
> Linux. But this is not true at all. One of my customer for example operates
> the Linux network stack functionality on top of a proprietary MAC/Driver
> where the current packet queue characteristic is just fine. The
> time-drop-approach is unsuitable because the bandwidth can vary in a small
> amount of time over a great range (0 till max. bandwidth). A sufficient
> buffering shows up superior in this environment (only IPv{4,6}/UDP).

The tension is between the average queue length and the maximum amount
of buffering needed.  Fixed-sized tail-drop queues -- either long, or
short -- are not ideal.

My understanding is that the best practice here is that you need
(bandwidth * path delay) buffering to be available to absorb bursts
and avoid drops, but you also need to use queue management algorithms
with ECN or random drop to keep the *average* queue length short;
unfortunately, researchers are still arguing about the details of the
second part...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-02-28 17:20             ` Bill Sommerfeld
@ 2011-02-28 21:51               ` John Heffner
  2011-03-01  0:46                 ` Mikael Abrahamsson
  0 siblings, 1 reply; 43+ messages in thread
From: John Heffner @ 2011-02-28 21:51 UTC (permalink / raw)
  To: Bill Sommerfeld
  Cc: Hagen Paul Pfeifer, Albert Cahalan, Jussi Kivilinna, Eric Dumazet,
	Mikael Abrahamsson, linux-kernel, netdev

Right... while I generally agree that a fixed-length drop-tail queue
isn't optimal, isn't this problem what the various AQM schemes try to
solve?

  -John


On Mon, Feb 28, 2011 at 12:20 PM, Bill Sommerfeld
<wsommerfeld@google.com> wrote:
> On Mon, Feb 28, 2011 at 07:38, Hagen Paul Pfeifer <hagen@jauu.net> wrote:
>> On Sun, 27 Feb 2011 18:33:39 -0500, Albert Cahalan wrote:
>>> I suppose there is a need to allow at least 2 packets despite any
>>> time limits, so that it remains possible to use a traditional modem
>>> even if a huge packet takes several seconds to send.
>>
>> That is a good point! We talk about as we may know every use case of
>> Linux. But this is not true at all. One of my customer for example operates
>> the Linux network stack functionality on top of a proprietary MAC/Driver
>> where the current packet queue characteristic is just fine. The
>> time-drop-approach is unsuitable because the bandwidth can vary in a small
>> amount of time over a great range (0 till max. bandwidth). A sufficient
>> buffering shows up superior in this environment (only IPv{4,6}/UDP).
>
> The tension is between the average queue length and the maximum amount
> of buffering needed.  Fixed-sized tail-drop queues -- either long, or
> short -- are not ideal.
>
> My understanding is that the best practice here is that you need
> (bandwidth * path delay) buffering to be available to absorb bursts
> and avoid drops, but you also need to use queue management algorithms
> with ECN or random drop to keep the *average* queue length short;
> unfortunately, researchers are still arguing about the details of the
> second part...
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-02-28 21:51               ` John Heffner
@ 2011-03-01  0:46                 ` Mikael Abrahamsson
  2011-03-02  6:25                   ` Stephen Hemminger
  0 siblings, 1 reply; 43+ messages in thread
From: Mikael Abrahamsson @ 2011-03-01  0:46 UTC (permalink / raw)
  To: John Heffner
  Cc: Bill Sommerfeld, Hagen Paul Pfeifer, Albert Cahalan,
	Jussi Kivilinna, Eric Dumazet, linux-kernel, netdev

On Mon, 28 Feb 2011, John Heffner wrote:

> Right... while I generally agree that a fixed-length drop-tail queue 
> isn't optimal, isn't this problem what the various AQM schemes try to 
> solve?

I am not an expert on exactly how Linux does this, but for Cisco and for 
instance ATM interfaces, there are two stages of queuing. One is the 
"hardware queue", which is a FIFO queue going into the ATM framer. If one 
wants low CPU usage, then this needs to be high so multiple packets can be 
put there per interrupt. Since AQM is working before this, it also means 
the low-latency-queue will have a higher latency as it ends up behind 
larger packets in the hw queue.

So on what level does the AQM work in Linux? Does it work similarily, that 
txqueuelen is a FIFO queue to the hardware that AQM feeds packets into?

Also, when one uses WRED the thinking is generally to keep the average 
queue len down, but still allow for bursts by dynamically changing the 
drop probability and where it happens. When there is no queuing, allow for 
big queue (so it can fill up if needed), but if the queue is large for 
several seconds, start to apply WRED to bring it down.

There is generally no need at all to constantly buffer > 50 ms of data, 
then it's better to just start selectively dropping it. In time of 
burstyness (perhaps when re-routing traffic) there is need to buffer 
200-500ms of during perhaps 1-2 seconds before things stabilize.

So one queuing scheme and one queue limit isn't going to solve this, there 
need to be some dynamic built into the system for it to work well.

AQM needs to feed into a relatively short hw queue and AQM needs to exist 
on output also when the traffic is sourced from the box itself, no tonly 
routed. It would also help if the default would be to use let's say 25% of 
the bandwidth for smaller packets (< 200 bytes or so) which generally are 
for interactive uses or are ACKs.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-03-01  0:46                 ` Mikael Abrahamsson
@ 2011-03-02  6:25                   ` Stephen Hemminger
  2011-03-02  6:41                     ` Mikael Abrahamsson
  0 siblings, 1 reply; 43+ messages in thread
From: Stephen Hemminger @ 2011-03-02  6:25 UTC (permalink / raw)
  To: Mikael Abrahamsson
  Cc: John Heffner, Bill Sommerfeld, Hagen Paul Pfeifer, Albert Cahalan,
	Jussi Kivilinna, Eric Dumazet, linux-kernel, netdev

On Tue, 1 Mar 2011 01:46:51 +0100 (CET)
Mikael Abrahamsson <swmike@swm.pp.se> wrote:

> On Mon, 28 Feb 2011, John Heffner wrote:
> 
> > Right... while I generally agree that a fixed-length drop-tail queue 
> > isn't optimal, isn't this problem what the various AQM schemes try to 
> > solve?
> 
> I am not an expert on exactly how Linux does this, but for Cisco and for 
> instance ATM interfaces, there are two stages of queuing. One is the 
> "hardware queue", which is a FIFO queue going into the ATM framer. If one 
> wants low CPU usage, then this needs to be high so multiple packets can be 
> put there per interrupt. Since AQM is working before this, it also means 
> the low-latency-queue will have a higher latency as it ends up behind 
> larger packets in the hw queue.
> 
> So on what level does the AQM work in Linux? Does it work similarily, that 
> txqueuelen is a FIFO queue to the hardware that AQM feeds packets into?
> 
> Also, when one uses WRED the thinking is generally to keep the average 
> queue len down, but still allow for bursts by dynamically changing the 
> drop probability and where it happens. When there is no queuing, allow for 
> big queue (so it can fill up if needed), but if the queue is large for 
> several seconds, start to apply WRED to bring it down.
> 
> There is generally no need at all to constantly buffer > 50 ms of data, 
> then it's better to just start selectively dropping it. In time of 
> burstyness (perhaps when re-routing traffic) there is need to buffer 
> 200-500ms of during perhaps 1-2 seconds before things stabilize.
> 
> So one queuing scheme and one queue limit isn't going to solve this, there 
> need to be some dynamic built into the system for it to work well.
> 
> AQM needs to feed into a relatively short hw queue and AQM needs to exist 
> on output also when the traffic is sourced from the box itself, no tonly 
> routed. It would also help if the default would be to use let's say 25% of 
> the bandwidth for smaller packets (< 200 bytes or so) which generally are 
> for interactive uses or are ACKs.
> 

It is possible to build an equivalent to WRED out existing GRED queuing
discipline but it does require a lot of tc knowledge to get right.
The inventor of RED (Van Jacobsen) has issues with WRED because of
the added complexity of queue selection. RED requires some parameters
which the average user has no idea how to set.

There are several problems with RED that prevent prevent VJ from
recommending it in the current form.

  http://gettys.wordpress.com/2010/12/17/red-in-a-different-light/


-- 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-03-02  6:25                   ` Stephen Hemminger
@ 2011-03-02  6:41                     ` Mikael Abrahamsson
  2011-03-02  7:07                       ` Stephen Hemminger
  0 siblings, 1 reply; 43+ messages in thread
From: Mikael Abrahamsson @ 2011-03-02  6:41 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: John Heffner, Bill Sommerfeld, Hagen Paul Pfeifer, Albert Cahalan,
	Jussi Kivilinna, Eric Dumazet, linux-kernel, netdev

On Tue, 1 Mar 2011, Stephen Hemminger wrote:

> It is possible to build an equivalent to WRED out existing GRED queuing
> discipline but it does require a lot of tc knowledge to get right.

To me who has worked with cisco routers for 10+ years and who is used to 
the different variants Cisco use, tc is just weird. It must come from a 
completely different school of thinking compared to what router people are 
used to, because I have tried and failed twice to do anything sensible 
with it.

> The inventor of RED (Van Jacobsen) has issues with WRED because of the 
> added complexity of queue selection. RED requires some parameters which 
> the average user has no idea how to set.

Of course there are issues and some of them can be adressed by simply 
lowering the queue depth. Yes, that might bring down the performance of 
some sessions, but for most of the interactive traffic, never buffering 
more than 40ms is a good thing.

> There are several problems with RED that prevent prevent VJ from
> recommending it in the current form.

Ask if he prefers FIFO+tail drop to RED in current form.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-03-02  6:41                     ` Mikael Abrahamsson
@ 2011-03-02  7:07                       ` Stephen Hemminger
  2011-03-02 16:41                         ` Mikael Abrahamsson
  0 siblings, 1 reply; 43+ messages in thread
From: Stephen Hemminger @ 2011-03-02  7:07 UTC (permalink / raw)
  To: Mikael Abrahamsson
  Cc: John Heffner, Bill Sommerfeld, Hagen Paul Pfeifer, Albert Cahalan,
	Jussi Kivilinna, Eric Dumazet, linux-kernel, netdev

On Wed, 2 Mar 2011 07:41:30 +0100 (CET)
Mikael Abrahamsson <swmike@swm.pp.se> wrote:

> On Tue, 1 Mar 2011, Stephen Hemminger wrote:
> 
> > It is possible to build an equivalent to WRED out existing GRED queuing
> > discipline but it does require a lot of tc knowledge to get right.  
> 
> To me who has worked with cisco routers for 10+ years and who is used to 
> the different variants Cisco use, tc is just weird. It must come from a 
> completely different school of thinking compared to what router people are 
> used to, because I have tried and failed twice to do anything sensible 
> with it.

Vyatta has scripting that handles all that:

vyatta@napa:~$ configure
[edit]
yatta@napa# set traffic-policy random-detect MyWFQ bandwidth 1gbps
[edit]
vyatta@napa# set interfaces ethernet eth0 traffic-policy out MyWFQ 
[edit]
vyatta@napa# commit
[edit]
vyatta@napa# exit
vyatta@napa:~$ show queueing ethernet eth0

eth0 Queueing:
Class      Policy               Sent     Rate  Dropped Overlimit  Backlog
root       weighted-random     16550                 0        0        0

vyatta@napa:~$ /sbin/tc qdisc show dev eth0
qdisc dsmark 1: root refcnt 2 indices 0x0008 set_tc_index 
qdisc gred 2: parent 1: 
 DP:0 (prio 8) Average Queue 0b Measured Queue 0b  
	 Packet drops: 0 (forced 0 early 0)  
	 Packet totals: 82 (bytes 9540)  ewma 3 Plog 17 Scell_log 3
 DP:1 (prio 7) Average Queue 0b Measured Queue 0b  
	 Packet drops: 0 (forced 0 early 0)  
	 Packet totals: 0 (bytes 0)  ewma 2 Plog 17 Scell_log 2
 DP:2 (prio 6) Average Queue 0b Measured Queue 0b  
	 Packet drops: 0 (forced 0 early 0)  
	 Packet totals: 0 (bytes 0)  ewma 2 Plog 17 Scell_log 2
 DP:3 (prio 5) Average Queue 0b Measured Queue 0b  
	 Packet drops: 0 (forced 0 early 0)  
	 Packet totals: 0 (bytes 0)  ewma 2 Plog 16 Scell_log 2
 DP:4 (prio 4) Average Queue 0b Measured Queue 0b  
	 Packet drops: 0 (forced 0 early 0)  
	 Packet totals: 0 (bytes 0)  ewma 2 Plog 16 Scell_log 2
 DP:5 (prio 3) Average Queue 0b Measured Queue 0b  
	 Packet drops: 0 (forced 0 early 0)  
	 Packet totals: 0 (bytes 0)  ewma 2 Plog 16 Scell_log 2
 DP:6 (prio 2) Average Queue 0b Measured Queue 0b  
	 Packet drops: 0 (forced 0 early 0)  
	 Packet totals: 0 (bytes 0)  ewma 2 Plog 15 Scell_log 2
 DP:7 (prio 1) Average Queue 0b Measured Queue 0b  
	 Packet drops: 0 (forced 0 early 0)  
	 Packet totals: 0 (bytes 0)  ewma 1 Plog 15 Scell_log 1



QoS on Cisco has different/other problems mostly because various groups
tried to fix the QoS problem over time and never got it quite right.
Also WRED is not default on faster links because it can't be done
fast enough.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-03-02  7:07                       ` Stephen Hemminger
@ 2011-03-02 16:41                         ` Mikael Abrahamsson
  2011-03-02 16:50                           ` Eric Dumazet
  0 siblings, 1 reply; 43+ messages in thread
From: Mikael Abrahamsson @ 2011-03-02 16:41 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: John Heffner, Bill Sommerfeld, Hagen Paul Pfeifer, Albert Cahalan,
	Jussi Kivilinna, Eric Dumazet, linux-kernel, netdev

On Tue, 1 Mar 2011, Stephen Hemminger wrote:

> Also WRED is not default on faster links because it can't be done fast 
> enough.

Before this propagates as some kind of truth. Cisco modern core routers 
have no problems doing WRED at wirespeed, the above statement is not true.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: txqueuelen has wrong units; should be time
  2011-03-02 16:41                         ` Mikael Abrahamsson
@ 2011-03-02 16:50                           ` Eric Dumazet
  0 siblings, 0 replies; 43+ messages in thread
From: Eric Dumazet @ 2011-03-02 16:50 UTC (permalink / raw)
  To: Mikael Abrahamsson
  Cc: Stephen Hemminger, John Heffner, Bill Sommerfeld,
	Hagen Paul Pfeifer, Albert Cahalan, Jussi Kivilinna, linux-kernel,
	netdev

Le mercredi 02 mars 2011 à 17:41 +0100, Mikael Abrahamsson a écrit :
> On Tue, 1 Mar 2011, Stephen Hemminger wrote:
> 
> > Also WRED is not default on faster links because it can't be done fast 
> > enough.
> 
> Before this propagates as some kind of truth. Cisco modern core routers 
> have no problems doing WRED at wirespeed, the above statement is not true.
> 

looking at cisco docs you provided
( <http://www.cisco.com/en/US/docs/ios/12_0s/feature/guide/12stbwr.html>
 )
, it seems the WRED time limits (instead of bytes/packets limits) are
internaly converted to bytes/packets limits


quote : 

When the queue limit threshold is specified in milliseconds, the number
of milliseconds is internally converted to bytes using the bandwidth
available for the class. 


So it seems its only a facility provided, and queues are still managed
with bytes/packets limits...

WRED is able to prob drop a packet when this packet is enqueued. At time
of enqueue, we dont know yet the time of dequeue, unless bandwidth is
known.

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2011-03-03 12:51 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-02-27  5:44 txqueuelen has wrong units; should be time Albert Cahalan
2011-02-27  7:02 ` Mikael Abrahamsson
2011-02-27  7:54   ` Eric Dumazet
2011-02-27  8:27     ` Albert Cahalan
2011-02-27 10:55       ` Jussi Kivilinna
2011-02-27 20:07         ` Eric Dumazet
2011-02-27 21:32           ` Jussi Kivilinna
2011-02-28 11:43           ` Jussi Kivilinna
2011-02-28 13:10             ` Eric Dumazet
2011-02-28 18:31               ` Jussi Kivilinna
2011-02-28 16:11           ` John W. Linville
2011-02-28 16:48             ` Eric Dumazet
2011-02-28 16:55               ` John W. Linville
2011-02-28 17:18                 ` Eric Dumazet
2011-02-28 21:45                 ` John Heffner
2011-03-01  4:11                   ` Albert Cahalan
2011-03-01  4:18                     ` David Miller
2011-03-01  6:54                       ` Albert Cahalan
2011-03-01  7:25                         ` David Miller
2011-03-01  7:26                         ` Eric Dumazet
2011-03-01 19:37                           ` Albert Cahalan
2011-03-01 20:14                             ` Eric Dumazet
2011-03-01 20:16                               ` Eric Dumazet
2011-03-02  3:10                           ` Mikael Abrahamsson
2011-03-02 20:25                             ` Chris Friesen
2011-03-01  5:01                     ` Eric Dumazet
2011-03-01  5:36                       ` Eric Dumazet
2011-02-27 23:33         ` Albert Cahalan
2011-02-28 11:23           ` Jussi Kivilinna
2011-03-02 21:54             ` [RFC LOL OMG] pfifo_lat: qdisc that limits dequeueing based on estimated link latency John W. Linville
2011-03-02 22:08               ` John W. Linville
2011-03-03 12:51               ` Eric Dumazet
2011-02-28 15:38           ` txqueuelen has wrong units; should be time Hagen Paul Pfeifer
2011-02-28 16:37             ` Albert Cahalan
2011-02-28 17:45               ` John W. Linville
2011-02-28 17:20             ` Bill Sommerfeld
2011-02-28 21:51               ` John Heffner
2011-03-01  0:46                 ` Mikael Abrahamsson
2011-03-02  6:25                   ` Stephen Hemminger
2011-03-02  6:41                     ` Mikael Abrahamsson
2011-03-02  7:07                       ` Stephen Hemminger
2011-03-02 16:41                         ` Mikael Abrahamsson
2011-03-02 16:50                           ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).