Netdev List
 help / color / mirror / Atom feed
* ASSISTANCE CONEIDETIAL!!!
From: Dr. Smith Lee. @ 2011-01-05 17:11 UTC (permalink / raw)


Hello,

I have a proposition for you, this however is not mandatory nor will I 
in any manner compel you to honor against your will.Let me start by 
introducing myself. I am Dr.Smith  Lee, Director of Operations 
of the Hang Seng Bank Ltd,Sai Wan Ho Branch. I have a mutual beneficial 
business suggestion for you.

1. Can you handle this project?
2. Can I give you this trust ?

Absolute confidentiality is required from you.Besides,I will use my connection 
to get some documents to back up the fund so that the fund can not be 
question by any authority.

More information await you in my next response to your email message.


Treat as very urgent.

Yours Faithfully,

Dr. Smith Lee.


smith@hangsengbankltd-hk.com

^ permalink raw reply

* [PATCH] net_sched: pfifo_head_drop problem
From: Eric Dumazet @ 2011-01-05 20:35 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Florian Westphal, Patrick McHardy, Hagen Paul Pfeifer,
	Stephen Hemminger, Jarek Poplawski
In-Reply-To: <1294246850.2775.244.camel@edumazet-laptop>

commit 57dbb2d83d100ea (sched: add head drop fifo queue)
introduced pfifo_head_drop, and broke the invariant that
sch->bstats.bytes and sch->bstats.packets are COUNTER (increasing
counters only)

This can break estimators because est_timer() handles unsigned deltas
only. A decreasing counter can then give a huge unsigned delta.

My mid term suggestion would be to change things so that
sch->bstats.bytes and sch->bstats.packets are incremented in dequeue()
only, not at enqueue() time. We also could add drop_bytes/drop_packets
and provide estimations of drop rates.

It would be more sensible anyway for very low speeds, and big bursts.
Right now, if we drop packets, they still are accounted in byte/packets
abolute counters and rate estimators.

Before this mid term change, this patch makes pfifo_head_drop behavior
similar to other qdiscs in case of drops :
Dont decrement sch->bstats.bytes and sch->bstats.packets

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/sched/sch_fifo.c |    2 --
 1 file changed, 2 deletions(-)

diff --git a/net/sched/sch_fifo.c b/net/sched/sch_fifo.c
index 4dfecb0..aa4d633 100644
--- a/net/sched/sch_fifo.c
+++ b/net/sched/sch_fifo.c
@@ -54,8 +54,6 @@ static int pfifo_tail_enqueue(struct sk_buff *skb, struct Qdisc* sch)
 
 	/* queue full, remove one skb to fulfill the limit */
 	skb_head = qdisc_dequeue_head(sch);
-	sch->bstats.bytes -= qdisc_pkt_len(skb_head);
-	sch->bstats.packets--;
 	sch->qstats.drops++;
 	kfree_skb(skb_head);
 



^ permalink raw reply related

* Re: [RFC] sched: CHOKe packet scheduler (v0.2)
From: Stephen Hemminger @ 2011-01-05 20:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1294257967.2723.9.camel@edumazet-laptop>

On Wed, 05 Jan 2011 21:06:07 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le mercredi 05 janvier 2011 à 11:21 -0800, Stephen Hemminger a écrit :
> > This implements the CHOKe packet scheduler based on the existing
> > Linux RED scheduler based on the algorithm described in the paper.
> > 
> 
> > +
> > +static int choke_enqueue(struct sk_buff *skb, struct Qdisc *sch)
> > +{
> > +	struct choke_sched_data *q = qdisc_priv(sch);
> > +	struct red_parms *p = &q->parms;
> > +
> > +	p->qavg = red_calc_qavg(p, skb_queue_len(&sch->q));
> > +	if (red_is_idling(p))
> > +		red_end_of_idle_period(p);
> > +
> > +	if (p->qavg <= p->qth_min)
> > +		p->qcount = -1;
> > +	else {
> > +		struct sk_buff *oskb;
> > +
> > +		/* Draw a packet at random from queue */
> > +		oskb = skb_peek_random(&sch->q);
> > +
> > +		/* Both packets from same flow?
> > +		 * Assumes skb_get_rxhash already set hash on oskb->rxhash
> > +		 * prior to queuing
> 
> but this is not true... if at that time, p->qavg was <= p->qth_min.
> Packet was directly enqueued.
> 
> Just use :
> 
> if (skb_get_rxhash(oskb) == skb_get_rxhash(skb))
> 
> Since skb_get_rxhash(skb) doesnt recompute rxhash if already set.
> 
> Hmm... I am now wondering if this actually works on egress at all
> (can we use rxhash here I mean)
> 
> 
> 
> > +		 */
> > +		if (oskb->rxhash == skb_get_rxhash(skb)) {
> > +			/* Drop both packets */
> > +			__skb_unlink(oskb, &sch->q);
> > +			qdisc_drop(oskb, sch);
> > +			goto congestion_drop;
> > +		}
> 

The code computes a value, whether it is correct or not is another question.
Also a little concerned that different NIC's compute different flow hash
values which could cause false positives.

-- 

^ permalink raw reply

* Re: [PATCH v3 08/10] ARM: mxs: add ocotp read function
From: Russell King - ARM Linux @ 2011-01-05 20:15 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Jamie Iles, gerg, B32542, netdev, s.hauer, bryan.wu, baruch,
	w.sang, r64343, Shawn Guo, eric, Uwe Kleine-König, davem,
	linux-arm-kernel, lw
In-Reply-To: <20110105194418.GE12222@shareable.org>

On Wed, Jan 05, 2011 at 07:44:18PM +0000, Jamie Lokier wrote:
> 'git show 534be1d5' explains how it works: cpu_relax() flushes buffered
> writes from _this_ CPU, so that other CPUs which are polling can make
> progress, which avoids this CPU getting stuck if there is an indirect
> dependency (no matter how convoluted) between what it's polling and which
> it wrote just before.
> 
> So cpu_relax() is *essential* in some polling loops, not a hint.
> 
> In principle that could happen for I/O polling, if (a) buffered memory
> writes are delayed by I/O read transactions, and (b) the device state we're
> waiting on depends on I/O yet to be done on another CPU, which could be
> polling memory first (e.g. a spinlock).
> 
> I doubt (a) in practice - but what about buses that block during I/O read?
> (I have a chip like that here, but it's ARMv4T.)

Let's be clear - ARMv5 and below generally are well ordered architectures
within the limits of caching.  There are cases where the write buffer
allows two writes to pass each other.  However, for IO we generally map
these - especially for ARMv4 and below - as 'uncacheable unbufferable'.
So on these, if the program says "read this location" the pipeline will
stall until the read has been issued - and if you use the result in the
next instruction, it will stall until the data is available.  So really,
it's not a problem here.

ARMv6 and above have a weakly ordered memory model with speculative
prefetching, so memory reads/writes can be completely unordered.  Device
accesses can pass memory accesses, but device accesses are always visible
in program order with respect to each other.

So, if you're spinning in a loop reading an IO device, all previous IO
accesses will be completed (in all ARM architectures) before the result
of your read is evaluated.


(But, let's make you squirm some more - mb() on ARMv6 and above may
equate to a CPU memory barrier _plus_ a few IO accesses to the external
L2 cache controller - which will be ordered wrt other IO accesses of
course.)

^ permalink raw reply

* Re: [RFC] sched: CHOKe packet scheduler (v0.2)
From: Eric Dumazet @ 2011-01-05 20:06 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, netdev
In-Reply-To: <20110105112104.64ad3c86@nehalam>

Le mercredi 05 janvier 2011 à 11:21 -0800, Stephen Hemminger a écrit :
> This implements the CHOKe packet scheduler based on the existing
> Linux RED scheduler based on the algorithm described in the paper.
> 

> +
> +static int choke_enqueue(struct sk_buff *skb, struct Qdisc *sch)
> +{
> +	struct choke_sched_data *q = qdisc_priv(sch);
> +	struct red_parms *p = &q->parms;
> +
> +	p->qavg = red_calc_qavg(p, skb_queue_len(&sch->q));
> +	if (red_is_idling(p))
> +		red_end_of_idle_period(p);
> +
> +	if (p->qavg <= p->qth_min)
> +		p->qcount = -1;
> +	else {
> +		struct sk_buff *oskb;
> +
> +		/* Draw a packet at random from queue */
> +		oskb = skb_peek_random(&sch->q);
> +
> +		/* Both packets from same flow?
> +		 * Assumes skb_get_rxhash already set hash on oskb->rxhash
> +		 * prior to queuing

but this is not true... if at that time, p->qavg was <= p->qth_min.
Packet was directly enqueued.

Just use :

if (skb_get_rxhash(oskb) == skb_get_rxhash(skb))

Since skb_get_rxhash(skb) doesnt recompute rxhash if already set.

Hmm... I am now wondering if this actually works on egress at all
(can we use rxhash here I mean)



> +		 */
> +		if (oskb->rxhash == skb_get_rxhash(skb)) {
> +			/* Drop both packets */
> +			__skb_unlink(oskb, &sch->q);
> +			qdisc_drop(oskb, sch);
> +			goto congestion_drop;
> +		}



^ permalink raw reply

* Re: [PATCH v3 08/10] ARM: mxs: add ocotp read function
From: Jamie Lokier @ 2011-01-05 19:44 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Jamie Iles, gerg, B32542, netdev, s.hauer, bryan.wu, baruch,
	w.sang, r64343, Shawn Guo, eric, Uwe Kleine-König, davem,
	linux-arm-kernel, lw
In-Reply-To: <20110105183509.GH8638@n2100.arm.linux.org.uk>

Russell King - ARM Linux wrote:
> > By the way, I see ARM defines cpu_relax as smp_mb() on arch >= 6.  Is
> > that correct and useful?  On other architectures*, barrier() is enough
> > of a barrier, but it's conceivable that smp_mb() would have some
> > ARM-specific fairness or bus activity benefit - in which case it
> > should probably be mb().
> 
> See a discussion last year with Linus.  It's there to ensure that one CPU
> spinning on a variable can see a write by another CPU to that same
> variable.  Without the barrier, the visibility effects are unbounded on
> ARMv6 - and it's only like that for ARMv6, not >= ARMv6.

Ah, thanks.

It might be relevant to this thread; I'm not sure.

'git show 534be1d5' explains how it works: cpu_relax() flushes buffered
writes from _this_ CPU, so that other CPUs which are polling can make
progress, which avoids this CPU getting stuck if there is an indirect
dependency (no matter how convoluted) between what it's polling and which
it wrote just before.

So cpu_relax() is *essential* in some polling loops, not a hint.

In principle that could happen for I/O polling, if (a) buffered memory
writes are delayed by I/O read transactions, and (b) the device state we're
waiting on depends on I/O yet to be done on another CPU, which could be
polling memory first (e.g. a spinlock).

I doubt (a) in practice - but what about buses that block during I/O read?
(I have a chip like that here, but it's ARMv4T.)

If so, readl() doesn't have a barrier that addresses the issue.

(b) is conceivable, and if the memory is a spinlock, spin_unlock() does
_not_ deal with this on ARMv6, as it has no barrier after the write.
('git show c5113b61' doesn't really explain why.)

It's convoluted and unlikely, but (imho) suggests cpu_relax() is good
practice in polling loops, even if not needed.  It doesn't cost anything.

-- Jamie

^ permalink raw reply

* Re: [net-next 00/10][pull-request] Intel Wired LAN Driver Updates
From: David Miller @ 2011-01-05 19:41 UTC (permalink / raw)
  To: donald.c.skidmore; +Cc: netdev
In-Reply-To: <20110105175920.GB23050@gitlad.jf.intel.com>

From: Don Skidmore <donald.c.skidmore@intel.com>
Date: Wed, 5 Jan 2011 09:59:20 -0800

> The following series contains: 
>  - ixgbe flow director improvement patches 
>  - general e1000 cleanup
>  - switches ixgbe to new vlan interface
> 
> The following are changes since commit:
> 
> commit 2316aa2aee254c126e688b53a3a105b82bc3f368
> Author: Greg Rose <gregory.v.rose@intel.com>
>     ixgbevf: Add X540 VF device support to the ixgbevf driver
> 
> and are available in the git repository at:
> 
>      master.kernel.org:/pub/scm/linux/kernel/git/jkirsher/net-next-2.6.git master

There's nothing there.

davem@sunset:~/src/GIT/net-next-2.6$ git pull master.kernel.org:/pub/scm/linux/kernel/git/jkirsher/net-next-2.6.git master
davem@master.kernel.org's password: 
Permission denied, please try again.
davem@master.kernel.org's password: 
>From master.kernel.org:/pub/scm/linux/kernel/git/jkirsher/net-next-2.6
 * branch            master     -> FETCH_HEAD
Already up-to-date.
davem@sunset:~/src/GIT/net-next-2.6$ 

Also all of your patch emails went out with the actual SMTP
"From: " changed, so it looks as if the patch author actually
sent those patch emails.

That's not what it should look like, the top level From: should
be you, and at the top of the commit message body there should
be another "From: " line demarking the patch author.

^ permalink raw reply

* [RFC] sched: CHOKe packet scheduler (v0.2)
From: Stephen Hemminger @ 2011-01-05 19:21 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1294248332.10633.25.camel@edumazet-laptop>

This implements the CHOKe packet scheduler based on the existing
Linux RED scheduler based on the algorithm described in the paper.

The core idea is:
  For every packet arrival:
  	Calculate Qave
	if (Qave < minth) 
	     Queue the new packet
	else 
	     Select randomly a packet from the queue 
	     if (both packets from same flow)
	     then Drop both the packets
	     else if (Qave > maxth)
	          Drop packet
	     else
	       	  Admit packet with probability p (same as RED)

See also:
  Rong Pan, Balaji Prabhakar, Konstantinos Psounis, "CHOKe: a stateless active
   queue management scheme for approximating fair bandwidth allocation", 
  Proceeding of INFOCOM'2000, March 2000. 

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

---
New in this version:
	- use skb hash values for flow matching
	- reciprocal_divide optimization
	- optional ECN support (same as RED)

 net/sched/Kconfig     |   11 +
 net/sched/Makefile    |    1 
 net/sched/sch_choke.c |  364 ++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 376 insertions(+)

--- a/net/sched/Kconfig	2011-01-04 16:25:18.000000000 -0800
+++ b/net/sched/Kconfig	2011-01-05 09:01:33.280032462 -0800
@@ -205,6 +205,17 @@ config NET_SCH_DRR
 
 	  If unsure, say N.
 
+config NET_SCH_CHOKE
+	tristate "CHOose and Keep responsive flow scheduler (CHOKE)"
+	help
+	  Say Y here if you want to use the CHOKe packet scheduler (CHOose
+	  and Keep for responsive flows, CHOose and Kill for unresponsive
+	  flows). This is a variation of RED which trys to penalize flows
+	  that monopolize the queue.
+
+	  To compile this code as a module, choose M here: the
+	  module will be called sch_choke.
+
 config NET_SCH_INGRESS
 	tristate "Ingress Qdisc"
 	depends on NET_CLS_ACT
--- a/net/sched/Makefile	2011-01-04 16:25:18.000000000 -0800
+++ b/net/sched/Makefile	2011-01-05 09:01:33.284032598 -0800
@@ -32,6 +32,7 @@ obj-$(CONFIG_NET_SCH_MULTIQ)	+= sch_mult
 obj-$(CONFIG_NET_SCH_ATM)	+= sch_atm.o
 obj-$(CONFIG_NET_SCH_NETEM)	+= sch_netem.o
 obj-$(CONFIG_NET_SCH_DRR)	+= sch_drr.o
+obj-$(CONFIG_NET_SCH_CHOKE)	+= sch_choke.o
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
 obj-$(CONFIG_NET_CLS_FW)	+= cls_fw.o
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ b/net/sched/sch_choke.c	2011-01-05 11:18:14.422320985 -0800
@@ -0,0 +1,307 @@
+/*
+ * net/sched/sch_choke.c	CHOKE scheduler
+ *
+ * Copyright (c) 2011 Stephen Hemminger <shemminger@vyatta.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * version 2 as published by the Free Software Foundation.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/skbuff.h>
+#include <linux/reciprocal_div.h>
+#include <net/pkt_sched.h>
+#include <net/inet_ecn.h>
+#include <net/red.h>
+
+/*	CHOKe stateless AQM for fair bandwidth allocation
+        =================================================
+
+   CHOKe (CHOose and Keep for responsive flows, CHOose and Kill for
+   unresponsive flows) is a variant of RED that penalizes misbehaving flows but
+   maintains no flow state. The difference from RED is an additional step
+   during the enqueuing process. If average queue size is over the
+   low threshold (qmin), a packet is chosen at random from the queue.
+   If both the new and chosen packet are from the same flow, both
+   are dropped. Unlike RED, CHOKe is not a "classful" qdisc because it
+   needs to access packets in queue randomly.
+
+   Source:
+   R. Pan, B. Prabhakar, and K. Psounis, "CHOKe, A Stateless
+   Active Queue Management Scheme for Approximating Fair Bandwidth Allocation",
+   IEEE INFOCOM, 2000.
+
+   A. Tang, J. Wang, S. Low, "Understanding CHOKe: Throughput and Spatial
+   Characteristics", IEEE/ACM Transactions on Networking, 2004
+
+ */
+
+struct choke_sched_data
+{
+	u32		  limit;
+	unsigned char	  flags;
+
+	struct red_parms  parms;
+	struct red_stats  stats;
+};
+
+
+/* deliver a random number between 0 and N - 1 */
+static inline u32 random_N(unsigned int N)
+{
+	return reciprocal_divide(random(), N);
+}
+
+/* Select a packet at random from the list.
+ * Same caveats as skb_peek.
+ */
+static struct sk_buff *skb_peek_random(struct sk_buff_head *list)
+{
+	struct sk_buff *skb = list->next;
+	unsigned int idx = random_N(list->qlen);
+
+	while (skb && idx-- > 0)
+		skb = skb->next;
+
+	return skb;
+}
+
+
+static inline int use_ecn(const struct choke_sched_data *q)
+{
+	return q->flags & TC_RED_ECN;
+}
+
+static inline int use_harddrop(const struct choke_sched_data *q)
+{
+	return q->flags & TC_RED_HARDDROP;
+}
+
+static int choke_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct red_parms *p = &q->parms;
+
+	p->qavg = red_calc_qavg(p, skb_queue_len(&sch->q));
+	if (red_is_idling(p))
+		red_end_of_idle_period(p);
+
+	if (p->qavg <= p->qth_min)
+		p->qcount = -1;
+	else {
+		struct sk_buff *oskb;
+
+		/* Draw a packet at random from queue */
+		oskb = skb_peek_random(&sch->q);
+
+		/* Both packets from same flow?
+		 * Assumes skb_get_rxhash already set hash on oskb->rxhash
+		 * prior to queuing
+		 */
+		if (oskb->rxhash == skb_get_rxhash(skb)) {
+			/* Drop both packets */
+			__skb_unlink(oskb, &sch->q);
+			qdisc_drop(oskb, sch);
+			goto congestion_drop;
+		}
+
+		if (p->qavg > p->qth_max) {
+			p->qcount = -1;
+
+			sch->qstats.overlimits++;
+			if (use_harddrop(q) || !use_ecn(q) ||
+			    !INET_ECN_set_ce(skb)) {
+				q->stats.forced_drop++;
+				goto congestion_drop;
+			}
+
+			q->stats.forced_mark++;
+		}
+
+		if (++p->qcount) {
+			if (red_mark_probability(p, p->qavg)) {
+				p->qcount = 0;
+				p->qR = red_random(p);
+
+				sch->qstats.overlimits++;
+				if (!use_ecn(q) || !INET_ECN_set_ce(skb)) {
+					q->stats.prob_drop++;
+					goto congestion_drop;
+				}
+
+				q->stats.prob_mark++;
+			}
+		} else
+			p->qR = red_random(p);
+	}
+
+	/* Admit new packet */
+	if (likely(skb_queue_len(&sch->q) < q->limit))
+		return qdisc_enqueue_tail(skb, sch);
+
+	q->stats.pdrop++;
+	sch->qstats.drops++;
+	kfree_skb(skb);
+	return NET_XMIT_DROP;
+
+ congestion_drop:
+	qdisc_drop(skb, sch);
+	return NET_XMIT_CN;
+}
+
+static struct sk_buff *choke_dequeue(struct Qdisc* sch)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb;
+
+	skb = qdisc_dequeue_head(sch);
+	if (!skb) {
+		if (!red_is_idling(&q->parms))
+			red_start_of_idle_period(&q->parms);
+	}
+
+	return skb;
+}
+
+static unsigned int choke_drop(struct Qdisc* sch)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	unsigned int len;
+
+	len = qdisc_queue_drop(sch);
+
+	if (len > 0)
+		q->stats.other++;
+	else {
+		if (!red_is_idling(&q->parms))
+			red_start_of_idle_period(&q->parms);
+	}
+
+	return len;
+}
+
+static void choke_reset(struct Qdisc* sch)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+
+	red_restart(&q->parms);
+}
+
+static const struct nla_policy choke_policy[TCA_RED_MAX + 1] = {
+	[TCA_RED_PARMS]	= { .len = sizeof(struct tc_red_qopt) },
+	[TCA_RED_STAB]	= { .len = RED_STAB_SIZE },
+};
+
+static int choke_change(struct Qdisc *sch, struct nlattr *opt)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct nlattr *tb[TCA_RED_MAX + 1];
+	struct tc_red_qopt *ctl;
+	int err;
+
+	if (opt == NULL)
+		return -EINVAL;
+
+	err = nla_parse_nested(tb, TCA_RED_MAX, opt, choke_policy);
+	if (err < 0)
+		return err;
+
+	if (tb[TCA_RED_PARMS] == NULL ||
+	    tb[TCA_RED_STAB] == NULL)
+		return -EINVAL;
+
+	ctl = nla_data(tb[TCA_RED_PARMS]);
+
+	sch_tree_lock(sch);
+	q->flags = ctl->flags;
+	q->limit = ctl->limit;
+
+	red_set_parms(&q->parms, ctl->qth_min, ctl->qth_max, ctl->Wlog,
+		      ctl->Plog, ctl->Scell_log,
+		      nla_data(tb[TCA_RED_STAB]));
+
+	if (skb_queue_empty(&sch->q))
+		red_end_of_idle_period(&q->parms);
+
+	sch_tree_unlock(sch);
+	return 0;
+}
+
+static int choke_init(struct Qdisc* sch, struct nlattr *opt)
+{
+	return choke_change(sch, opt);
+}
+
+static int choke_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct nlattr *opts = NULL;
+	struct tc_red_qopt opt = {
+		.limit		= q->limit,
+		.flags		= q->flags,
+		.qth_min	= q->parms.qth_min >> q->parms.Wlog,
+		.qth_max	= q->parms.qth_max >> q->parms.Wlog,
+		.Wlog		= q->parms.Wlog,
+		.Plog		= q->parms.Plog,
+		.Scell_log	= q->parms.Scell_log,
+	};
+
+	opts = nla_nest_start(skb, TCA_OPTIONS);
+	if (opts == NULL)
+		goto nla_put_failure;
+
+	NLA_PUT(skb, TCA_RED_PARMS, sizeof(opt), &opt);
+	return nla_nest_end(skb, opts);
+
+nla_put_failure:
+	nla_nest_cancel(skb, opts);
+	return -EMSGSIZE;
+}
+
+static int choke_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct tc_red_xstats st = {
+		.early	= q->stats.prob_drop + q->stats.forced_drop,
+		.pdrop	= q->stats.pdrop,
+		.other	= q->stats.other,
+		.marked	= q->stats.prob_mark + q->stats.forced_mark,
+	};
+
+	return gnet_stats_copy_app(d, &st, sizeof(st));
+}
+
+static struct Qdisc_ops choke_qdisc_ops __read_mostly = {
+	.id		=	"choke",
+	.priv_size	=	sizeof(struct choke_sched_data),
+
+	.enqueue	=	choke_enqueue,
+	.dequeue	=	choke_dequeue,
+	.peek		=	qdisc_peek_head,
+	.drop		=	choke_drop,
+	.init		=	choke_init,
+	.reset		=	choke_reset,
+	.change		=	choke_change,
+	.dump		=	choke_dump,
+	.dump_stats	=	choke_dump_stats,
+	.owner		=	THIS_MODULE,
+};
+
+static int __init choke_module_init(void)
+{
+	return register_qdisc(&choke_qdisc_ops);
+}
+
+static void __exit choke_module_exit(void)
+{
+	unregister_qdisc(&choke_qdisc_ops);
+}
+
+module_init(choke_module_init)
+module_exit(choke_module_exit)
+
+MODULE_LICENSE("GPL");

^ permalink raw reply

* Re: [PATCH v3 08/10] ARM: mxs: add ocotp read function
From: Russell King - ARM Linux @ 2011-01-05 18:35 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Jamie Iles, gerg, B32542, netdev, s.hauer, bryan.wu, baruch,
	w.sang, r64343, Shawn Guo, eric, Uwe Kleine-König, davem,
	linux-arm-kernel, lw
In-Reply-To: <20110105175617.GD12222@shareable.org>

On Wed, Jan 05, 2011 at 05:56:17PM +0000, Jamie Lokier wrote:
> cpu_relax() is a hint to the CPU to, for example, save power or be
> less aggressive on the memory bus (to save power or be fairer).
> 
> Currently these architectures do more than just a barrier in cpu_relax():
> x86, IA64, PowerPC, Tile and S390.
> 
> Although it's just a hint on ARM at the moment, it might change in
> future - especially with power mattering on so many ARM systems.
> (Even now, just changing it to a very short udelay might save power
> on existing ARMs without breaking drivers.)

I think that's a matter for what the loop is doing.  If it's polling
a memory location then it probably has no effect what so ever.  If
the loop is spinning on a device, then the CPU will have to wait for
the read to complete which will slow it down.

It's something that would need very careful evaluation, and is probably
something that's very platform and loop specific.

> By the way, I see ARM defines cpu_relax as smp_mb() on arch >= 6.  Is
> that correct and useful?  On other architectures*, barrier() is enough
> of a barrier, but it's conceivable that smp_mb() would have some
> ARM-specific fairness or bus activity benefit - in which case it
> should probably be mb().

See a discussion last year with Linus.  It's there to ensure that one CPU
spinning on a variable can see a write by another CPU to that same
variable.  Without the barrier, the visibility effects are unbounded on
ARMv6 - and it's only like that for ARMv6, not >= ARMv6.

^ permalink raw reply

* Re: [PATCH v2] netfilter: fix the race when initializing nf_ct_expect_hash_rnd
From: David Miller @ 2011-01-05 18:32 UTC (permalink / raw)
  To: eric.dumazet; +Cc: xiaosuo, kaber, netfilter-devel, netdev
In-Reply-To: <1294250598.10633.77.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 05 Jan 2011 19:03:18 +0100

> Le mercredi 05 janvier 2011 à 22:23 +0800, Changli Gao a écrit :
>> Since nf_ct_expect_dst_hash() may be called without nf_conntrack_lock
>> locked, nf_ct_expect_hash_rnd should be initialized in the atomic way.
>> 
>> In this patch, we use nf_conntrack_hash_rnd instead of
>> nf_ct_expect_hash_rnd.
>> 
>> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
>> ---
> 
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> Problem is Patrick seems not responsive these days ;)

As a result, I'll integrate this fix later today.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* ethtool 2.6.37 released
From: Ben Hutchings @ 2011-01-05 18:12 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 535 bytes --]

ethtool version 2.6.37 has been released.

Home page: https://ftp.kernel.org/pub/software/network/ethtool/
Download link:
https://ftp.kernel.org/pub/software/network/ethtool/ethtool-2.6.37.tar.gz

Release notes:

	* Fix: Build fix for distributions with kernel headers from
	  Linux 2.6.9 or earlier

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply

* [RFC v2 PATCH] m68knommu: added dm9000 support
From: Angelo Dureghello @ 2011-01-05 18:03 UTC (permalink / raw)
  To: linux-kernel, netdev; +Cc: Randy Dunlap

This patch allows to use the dm9000 network chip with a m68knommu 
big-endian cpu. From the HW point of view, the cpu data bus connected to 
the dm9000 chip should be hardware-byte-swapped, crossing the bytes 
wires (D0:7 to D24:31, etc.). In anyway, has been also added an option 
to swap the bytes in the driver, if some cpu has been wired straight 
D0:D31 to dm9000.

Signed-off-by: Angelo Dureghello <angelo70@gmail.com>
---

--- linux/drivers/net/Kconfig.orig	2011-01-05 17:11:37.992376124 +0100
+++ linux/drivers/net/Kconfig	2011-01-04 22:33:14.132301872 +0100
@@ -960,7 +960,7 @@ config TI_DAVINCI_EMAC
 
 config DM9000
 	tristate "DM9000 support"
-	depends on ARM || BLACKFIN || MIPS
+	depends on COLDFIRE || ARM || BLACKFIN || MIPS
 	select CRC32
 	select MII
 	---help---
@@ -986,6 +986,14 @@ config DM9000_FORCE_SIMPLE_PHY_POLL
 	  costly MII PHY reads. Note, this will not work if the chip is
 	  operating with an external PHY.
 
+config DM9000_32BIT_SW_SWAP
+	bool "Software byte swap for 32 bit data bus"
+	depends on DM9000 && COLDFIRE
+	---help---
+	  This configuration allows to swap data bytes from the dm9000
+	  driver itself, when the big endian cpu is wired straight to
+	  the dm9000 32 bit data bus.
+
 config ENC28J60
 	tristate "ENC28J60 support"
 	depends on EXPERIMENTAL && SPI && NET_ETHERNET
@@ -3347,4 +3355,3 @@ config VMXNET3
          module will be called vmxnet3.
 
 endif # NETDEVICES
-

--- linux/drivers/net/dm9000.c.orig	2010-12-30 23:19:39.747836070 +0100
+++ linux/drivers/net/dm9000.c	2011-01-05 16:30:48.636116500 +0100
@@ -158,9 +158,17 @@ dm9000_reset(board_info_t * db)
 	dev_dbg(db->dev, "resetting device\n");
 
 	/* RESET device */
+#ifdef CONFIG_DM9000_32BIT_SW_SWAP
+	writel(DM9000_NCR, db->io_addr);
+#else
 	writeb(DM9000_NCR, db->io_addr);
+#endif
 	udelay(200);
+#ifdef CONFIG_DM9000_32BIT_SW_SWAP
+	writel(NCR_RST, db->io_data);
+#else
 	writeb(NCR_RST, db->io_data);
+#endif
 	udelay(200);
 }
 
@@ -170,8 +178,13 @@ dm9000_reset(board_info_t * db)
 static u8
 ior(board_info_t * db, int reg)
 {
+#ifdef CONFIG_DM9000_32BIT_SW_SWAP
+	writel(reg, db->io_addr);
+	return (u8)readl(db->io_data);
+#else
 	writeb(reg, db->io_addr);
 	return readb(db->io_data);
+#endif
 }
 
 /*
@@ -181,43 +194,72 @@ ior(board_info_t * db, int reg)
 static void
 iow(board_info_t * db, int reg, int value)
 {
+#ifdef CONFIG_DM9000_32BIT_SW_SWAP
+	writel(reg, db->io_addr);
+	writel(value, db->io_data);
+#else
 	writeb(reg, db->io_addr);
 	writeb(value, db->io_data);
+#endif
 }
 
 /* routines for sending block to chip */
 
 static void dm9000_outblk_8bit(void __iomem *reg, void *data, int count)
 {
+#ifdef CONFIG_DM9000_32BIT_SW_SWAP
+	writesbsw(reg, data, count);
+#else
 	writesb(reg, data, count);
+#endif
 }
 
 static void dm9000_outblk_16bit(void __iomem *reg, void *data, int count)
 {
+#ifdef CONFIG_DM9000_32BIT_SW_SWAP
+	writeswsw(reg, data, (count+1) >> 1);
+#else
 	writesw(reg, data, (count+1) >> 1);
+#endif
 }
 
 static void dm9000_outblk_32bit(void __iomem *reg, void *data, int count)
 {
+#ifdef CONFIG_DM9000_32BIT_SW_SWAP
+	writeslsw(reg, data, (count+3) >> 2);
+#else
 	writesl(reg, data, (count+3) >> 2);
+#endif
 }
 
 /* input block from chip to memory */
 
 static void dm9000_inblk_8bit(void __iomem *reg, void *data, int count)
 {
+#ifdef CONFIG_DM9000_32BIT_SW_SWAP
+	readsbsw(reg, data, count);
+#else
 	readsb(reg, data, count);
+#endif
 }
 
 
 static void dm9000_inblk_16bit(void __iomem *reg, void *data, int count)
 {
+#ifdef CONFIG_DM9000_32BIT_SW_SWAP
+	readswsw(reg, data, (count+1) >> 1);
+#else
 	readsw(reg, data, (count+1) >> 1);
+#endif
 }
 
 static void dm9000_inblk_32bit(void __iomem *reg, void *data, int count)
 {
+#ifdef CONFIG_DM9000_32BIT_SW_SWAP
+	readslsw(reg, data, (count+3) >> 2);
+#else
 	readsl(reg, data, (count+3) >> 2);
+#endif
 }
 
 /* dump block from chip to null */
@@ -863,8 +905,13 @@ static void dm9000_timeout(struct net_de
 	netif_wake_queue(dev);
 
 	/* Restore previous register address */
+#ifdef CONFIG_DM9000_32BIT_SW_SWAP
+	writel(reg_save, db->io_addr);
+#else
 	writeb(reg_save, db->io_addr);
-	spin_unlock_irqrestore(&db->lock, flags);
+#endif
+
+	spin_unlock_irqrestore(&db->lock,flags);
 }
 
 static void dm9000_send_packet(struct net_device *dev,
@@ -908,7 +955,11 @@ dm9000_start_xmit(struct sk_buff *skb, s
 	spin_lock_irqsave(&db->lock, flags);
 
 	/* Move data to DM9000 TX RAM */
+#ifdef CONFIG_DM9000_32BIT_SW_SWAP
+   writel(DM9000_MWCMD, db->io_addr);
+#else
 	writeb(DM9000_MWCMD, db->io_addr);
+#endif	
 
 	(db->outblk)(db->io_data, skb->data, skb->len);
 	dev->stats.tx_bytes += skb->len;
@@ -981,7 +1032,11 @@ dm9000_rx(struct net_device *dev)
 		ior(db, DM9000_MRCMDX);	/* Dummy read */
 
 		/* Get most updated data */
-		rxbyte = readb(db->io_data);
+#ifdef CONFIG_DM9000_32BIT_SW_SWAP
+      rxbyte = (u8)readl(db->io_data);
+#else
+      rxbyte = readb(db->io_data);
+#endif
 
 		/* Status check: this byte must be 0 or 1 */
 		if (rxbyte & DM9000_PKT_ERR) {
@@ -996,8 +1051,13 @@ dm9000_rx(struct net_device *dev)
 
 		/* A packet ready now  & Get status/length */
 		GoodPacket = true;
-		writeb(DM9000_MRCMD, db->io_addr);
 
+#ifdef CONFIG_DM9000_32BIT_SW_SWAP
+		writel(DM9000_MRCMD, db->io_addr);
+#else
+      writeb(DM9000_MRCMD, db->io_addr);
+#endif		
+		
 		(db->inblk)(db->io_data, &rxhdr, sizeof(rxhdr));
 
 		RxLen = le16_to_cpu(rxhdr.RxLen);
@@ -1077,7 +1137,7 @@ static irqreturn_t dm9000_interrupt(int
 	unsigned long flags;
 	u8 reg_save;
 
-	dm9000_dbg(db, 3, "entering %s\n", __func__);
+	//dm9000_dbg(db, 3, "entering %s\n", __func__);
 
 	/* A real interrupt coming */
 
@@ -1085,7 +1145,11 @@ static irqreturn_t dm9000_interrupt(int
 	spin_lock_irqsave(&db->lock, flags);
 
 	/* Save previous register address */
-	reg_save = readb(db->io_addr);
+#ifdef CONFIG_DM9000_32BIT_SW_SWAP
+   reg_save = (u8)readl(db->io_addr);
+#else
+   reg_save = readb(db->io_addr);
+#endif
 
 	/* Disable all interrupts */
 	iow(db, DM9000_IMR, IMR_PAR);
@@ -1100,7 +1164,7 @@ static irqreturn_t dm9000_interrupt(int
 	/* Received the coming packet */
 	if (int_status & ISR_PRS)
 		dm9000_rx(dev);
-
+		
 	/* Trnasmit Interrupt check */
 	if (int_status & ISR_PTS)
 		dm9000_tx_done(dev, db);
@@ -1116,8 +1180,12 @@ static irqreturn_t dm9000_interrupt(int
 	iow(db, DM9000_IMR, db->imr_all);
 
 	/* Restore previous register address */
+#ifdef CONFIG_DM9000_32BIT_SW_SWAP
+	writel(reg_save, db->io_addr);
+#else
 	writeb(reg_save, db->io_addr);
-
+#endif
+	
 	spin_unlock_irqrestore(&db->lock, flags);
 
 	return IRQ_HANDLED;
@@ -1233,11 +1301,15 @@ dm9000_phy_read(struct net_device *dev,
 	int ret;
 
 	mutex_lock(&db->addr_lock);
-
+	
 	spin_lock_irqsave(&db->lock,flags);
 
 	/* Save previous register address */
-	reg_save = readb(db->io_addr);
+#ifdef CONFIG_DM9000_32BIT_SW_SWAP
+   reg_save = (u8)readl(db->io_addr);
+#else
+   reg_save = readb(db->io_addr);
+#endif
 
 	/* Fill the phyxcer register into REG_0C */
 	iow(db, DM9000_EPAR, DM9000_PHY | reg);
@@ -1250,7 +1322,11 @@ dm9000_phy_read(struct net_device *dev,
 	dm9000_msleep(db, 1);		/* Wait read complete */
 
 	spin_lock_irqsave(&db->lock,flags);
-	reg_save = readb(db->io_addr);
+#ifdef CONFIG_DM9000_32BIT_SW_SWAP
+   reg_save = (u8)readl(db->io_addr);
+#else
+   reg_save = readb(db->io_addr);
+#endif
 
 	iow(db, DM9000_EPCR, 0x0);	/* Clear phyxcer read command */
 
@@ -1258,9 +1334,14 @@ dm9000_phy_read(struct net_device *dev,
 	ret = (ior(db, DM9000_EPDRH) << 8) | ior(db, DM9000_EPDRL);
 
 	/* restore the previous address */
+#ifdef CONFIG_DM9000_32BIT_SW_SWAP
+	writel(reg_save, db->io_addr);
+#else
 	writeb(reg_save, db->io_addr);
-	spin_unlock_irqrestore(&db->lock,flags);
+#endif
 
+	spin_unlock_irqrestore(&db->lock,flags);
+	
 	mutex_unlock(&db->addr_lock);
 
 	dm9000_dbg(db, 5, "phy_read[%02x] -> %04x\n", reg, ret);
@@ -1284,7 +1365,11 @@ dm9000_phy_write(struct net_device *dev,
 	spin_lock_irqsave(&db->lock,flags);
 
 	/* Save previous register address */
-	reg_save = readb(db->io_addr);
+#ifdef CONFIG_DM9000_32BIT_SW_SWAP
+   reg_save = (u8)readl(db->io_addr);
+#else
+   reg_save = readb(db->io_addr);
+#endif
 
 	/* Fill the phyxcer register into REG_0C */
 	iow(db, DM9000_EPAR, DM9000_PHY | reg);
@@ -1295,18 +1380,31 @@ dm9000_phy_write(struct net_device *dev,
 
 	iow(db, DM9000_EPCR, EPCR_EPOS | EPCR_ERPRW);	/* Issue phyxcer write command */
 
+#ifdef CONFIG_DM9000_32BIT_SW_SWAP
+	writel(reg_save, db->io_addr);
+#else
 	writeb(reg_save, db->io_addr);
+#endif
+
 	spin_unlock_irqrestore(&db->lock, flags);
 
 	dm9000_msleep(db, 1);		/* Wait write complete */
 
 	spin_lock_irqsave(&db->lock,flags);
-	reg_save = readb(db->io_addr);
+#ifdef CONFIG_DM9000_32BIT_SW_SWAP
+   reg_save = (u8)readl(db->io_addr);
+#else
+   reg_save = readb(db->io_addr);
+#endif
 
 	iow(db, DM9000_EPCR, 0x0);	/* Clear phyxcer write command */
 
 	/* restore the previous address */
+#ifdef CONFIG_DM9000_32BIT_SW_SWAP
+	writel(reg_save, db->io_addr);
+#else
 	writeb(reg_save, db->io_addr);
+#endif
 
 	spin_unlock_irqrestore(&db->lock, flags);
 	mutex_unlock(&db->addr_lock);
@@ -1713,4 +1811,3 @@ MODULE_AUTHOR("Sascha Hauer, Ben Dooks")
 MODULE_DESCRIPTION("Davicom DM9000 network driver");
 MODULE_LICENSE("GPL");
 MODULE_ALIAS("platform:dm9000");
-


--- linux/arch/m68k/include/asm/io_no.h.orig	2011-01-05 16:53:55.904905038 +0100
+++ linux/arch/m68k/include/asm/io_no.h	2011-01-04 23:45:08.893049554 +0100
@@ -47,6 +47,91 @@ static inline unsigned int _swapl(volati
 #define writew(b,addr) (void)((*(volatile unsigned short *) (addr)) = (b))
 #define writel(b,addr) (void)((*(volatile unsigned int *) (addr)) = (b))
 
+static inline void writesb (void __iomem *reg, void *data, int count)
+{
+	unsigned char *p = (unsigned char*) data;
+
+	while (count--) writeb(*p++, reg);
+}
+
+static inline void writesbsw (void __iomem *reg, void *data, int count)
+{
+	unsigned char *p = (unsigned char *) data;
+
+	while (count--) writel((int)(*p++), reg);
+}
+
+static inline void writesw (void __iomem *reg, void *data, int count)
+{
+   unsigned short *p = (unsigned short*) data;
+
+   while (count--) writew(*p++, reg);
+}
+
+static inline void writeswsw (void __iomem *reg, void *data, int count)
+{
+   unsigned short *p = (unsigned short *) data;
+
+   while (count--) writel((int)(_swapw(*p++)), reg);
+}
+
+static inline void writesl (void __iomem *reg, void *data, int count)
+{
+   unsigned long *p = (unsigned long*) data;
+
+   while (count--) writel(*p++, reg);
+}
+
+static inline void writeslsw (void __iomem *reg, void *data, int count)
+{
+   unsigned long *p = (unsigned long *) data;
+
+   while (count--) writel((int)(_swapl(*p++)), reg);
+}
+
+static inline void readsb (void __iomem *reg, void *data, int count)
+{
+   unsigned char *p = (unsigned char *) data;
+
+   while (count--) *p++ = readb(reg);
+}
+
+static inline void readsbsw (void __iomem *reg, void *data, int count)
+{
+   unsigned char *p = (unsigned char *) data;
+
+   while (count--) *p++ = (unsigned char)readl(reg);
+}
+
+static inline void readsw (void __iomem *reg, void *data, int count)
+{
+   unsigned short *p = (unsigned short *) data;
+
+   while (count--) *p++ = readb(reg);
+}
+
+static inline void readswsw (void __iomem *reg, void *data, int count)
+{
+   unsigned short *p = (unsigned short *) data;
+
+   while (count--) *p++ = _swapw((unsigned short)readw(reg));
+}
+
+static inline void readsl (void __iomem *reg, void *data, int count)
+{
+   unsigned long *p = (unsigned long *) data;
+
+   while (count--) *p++ = readb(reg);
+}
+
+static inline void readslsw (void __iomem *reg, void *data, int count)
+{
+   unsigned long *p = (unsigned long *) data;
+
+   while (count--) *p++ = _swapl(readl(reg));
+}
+
+
 #define __raw_readb readb
 #define __raw_readw readw
 #define __raw_readl readl
@@ -180,4 +265,3 @@ extern void iounmap(void *addr);
 #endif /* __KERNEL__ */
 
 #endif /* _M68KNOMMU_IO_H */
-

^ permalink raw reply

* Re: [PATCH v2] netfilter: fix the race when initializing nf_ct_expect_hash_rnd
From: Eric Dumazet @ 2011-01-05 18:03 UTC (permalink / raw)
  To: Changli Gao; +Cc: Patrick McHardy, netfilter-devel, David S. Miller, netdev
In-Reply-To: <1294237403-15616-1-git-send-email-xiaosuo@gmail.com>

Le mercredi 05 janvier 2011 à 22:23 +0800, Changli Gao a écrit :
> Since nf_ct_expect_dst_hash() may be called without nf_conntrack_lock
> locked, nf_ct_expect_hash_rnd should be initialized in the atomic way.
> 
> In this patch, we use nf_conntrack_hash_rnd instead of
> nf_ct_expect_hash_rnd.
> 
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
> ---

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

Problem is Patrick seems not responsive these days ;)



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v3 08/10] ARM: mxs: add ocotp read function
From: Jamie Lokier @ 2011-01-05 17:56 UTC (permalink / raw)
  To: Jamie Iles
  Cc: gerg, B32542, netdev, s.hauer, bryan.wu, baruch, w.sang, r64343,
	Shawn Guo, eric, Uwe Kleine-König, davem, linux-arm-kernel,
	lw
In-Reply-To: <20110105172501.GB2112@gallagher>

Jamie Iles wrote:
> On Wed, Jan 05, 2011 at 05:44:09PM +0100, Uwe Kleine-König wrote:
> > Hello Jamie,
> > On Wed, Jan 05, 2011 at 04:16:46PM +0000, Jamie Iles wrote:
> > > On Wed, Jan 05, 2011 at 10:07:35PM +0800, Shawn Guo wrote:
> > > > +	/* check both BUSY and ERROR cleared */
> > > > +	while ((__raw_readl(ocotp_base) &
> > > > +		(BM_OCOTP_CTRL_BUSY | BM_OCOTP_CTRL_ERROR)) && --timeout)
> > > > +		/* nothing */;
> > > 
> > > Is it worth using cpu_relax() in these polling loops?
> > I don't know what cpu_relax does for other platforms, but on ARM it's
> > just a memory barrier which AFAICT doesn't help here at all (which
> > doesn't need to be correct).  Why do you think it would be better?
> 
> Well I don't see that there's anything broken without cpu_relax() but 
> using cpu_relax() seems to be the most common way of doing busy polling 
> loops that I've seen. It's also a bit easier to read than a comment and 
> semi-colon. Perhaps it's just personal preference.

cpu_relax() is a hint to the CPU to, for example, save power or be
less aggressive on the memory bus (to save power or be fairer).

Currently these architectures do more than just a barrier in cpu_relax():
x86, IA64, PowerPC, Tile and S390.

Although it's just a hint on ARM at the moment, it might change in
future - especially with power mattering on so many ARM systems.
(Even now, just changing it to a very short udelay might save power
on existing ARMs without breaking drivers.)


By the way, I see ARM defines cpu_relax as smp_mb() on arch >= 6.  Is
that correct and useful?  On other architectures*, barrier() is enough
of a barrier, but it's conceivable that smp_mb() would have some
ARM-specific fairness or bus activity benefit - in which case it
should probably be mb().

* - except Blackfin, which historically derived a lot from ARM headers.

-- Jamie

^ permalink raw reply

* Re: [PATCH] ipv4: IP defragmentation must be ECN aware
From: Eric Dumazet @ 2011-01-05 17:52 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, netdev
In-Reply-To: <20110105094830.63a68230@nehalam>

Le mercredi 05 janvier 2011 à 09:48 -0800, Stephen Hemminger a écrit :

> At least make it unsigned int?
> 

Let's keep it simple and use u8 as you suggested ;)

[PATCH v2] ipv4: IP defragmentation must be ECN aware

RFC3168 (The Addition of Explicit Congestion Notification to IP)
states :

5.3.  Fragmentation

   ECN-capable packets MAY have the DF (Don't Fragment) bit set.
   Reassembly of a fragmented packet MUST NOT lose indications of
   congestion.  In other words, if any fragment of an IP packet to be
   reassembled has the CE codepoint set, then one of two actions MUST be
   taken:

      * Set the CE codepoint on the reassembled packet.  However, this
        MUST NOT occur if any of the other fragments contributing to
        this reassembly carries the Not-ECT codepoint.

      * The packet is dropped, instead of being reassembled, for any
        other reason.

This patch implements this requirement for IPv4, choosing the first
action : 

If one fragment had NO-ECT codepoint
        reassembled frame has NO-ECT
ElIf one fragment had CE codepoint
        reassembled frame has CE

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
Use u8 instead of int in ip4_frag_ecn()
 net/ipv4/ip_fragment.c |   34 ++++++++++++++++++++++++++++++++++
 1 files changed, 34 insertions(+)


diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index e6215bd..e6b53a7 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -45,6 +45,7 @@
 #include <linux/udp.h>
 #include <linux/inet.h>
 #include <linux/netfilter_ipv4.h>
+#include <net/inet_ecn.h>
 
 /* NOTE. Logic of IP defragmentation is parallel to corresponding IPv6
  * code now. If you change something here, _PLEASE_ update ipv6/reassembly.c
@@ -70,11 +71,28 @@ struct ipq {
 	__be32		daddr;
 	__be16		id;
 	u8		protocol;
+	u8		ecn; /* RFC3168 support */
 	int             iif;
 	unsigned int    rid;
 	struct inet_peer *peer;
 };
 
+#define IPFRAG_ECN_CLEAR  0x01 /* one frag had INET_ECN_NOT_ECT */
+#define IPFRAG_ECN_SET_CE 0x04 /* one frag had INET_ECN_CE */
+
+static inline u8 ip4_frag_ecn(u8 tos)
+{
+	tos = (tos & INET_ECN_MASK) + 1;
+	/*
+	 * After the last operation we have (in binary):
+	 * INET_ECN_NOT_ECT => 001
+	 * INET_ECN_ECT_1   => 010
+	 * INET_ECN_ECT_0   => 011
+	 * INET_ECN_CE      => 100
+	 */
+	return (tos & 2) ? 0 : tos;
+}
+
 static struct inet_frags ip4_frags;
 
 int ip_frag_nqueues(struct net *net)
@@ -137,6 +155,7 @@ static void ip4_frag_init(struct inet_frag_queue *q, void *a)
 
 	qp->protocol = arg->iph->protocol;
 	qp->id = arg->iph->id;
+	qp->ecn = ip4_frag_ecn(arg->iph->tos);
 	qp->saddr = arg->iph->saddr;
 	qp->daddr = arg->iph->daddr;
 	qp->user = arg->user;
@@ -316,6 +335,7 @@ static int ip_frag_reinit(struct ipq *qp)
 	qp->q.fragments = NULL;
 	qp->q.fragments_tail = NULL;
 	qp->iif = 0;
+	qp->ecn = 0;
 
 	return 0;
 }
@@ -328,6 +348,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	int flags, offset;
 	int ihl, end;
 	int err = -ENOENT;
+	u8 ecn;
 
 	if (qp->q.last_in & INET_FRAG_COMPLETE)
 		goto err;
@@ -339,6 +360,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 		goto err;
 	}
 
+	ecn = ip4_frag_ecn(ip_hdr(skb)->tos);
 	offset = ntohs(ip_hdr(skb)->frag_off);
 	flags = offset & ~IP_OFFSET;
 	offset &= IP_OFFSET;
@@ -472,6 +494,7 @@ found:
 	}
 	qp->q.stamp = skb->tstamp;
 	qp->q.meat += skb->len;
+	qp->ecn |= ecn;
 	atomic_add(skb->truesize, &qp->q.net->mem);
 	if (offset == 0)
 		qp->q.last_in |= INET_FRAG_FIRST_IN;
@@ -583,6 +606,17 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 	iph = ip_hdr(head);
 	iph->frag_off = 0;
 	iph->tot_len = htons(len);
+	/* RFC3168 5.3 Fragmentation support
+	 * If one fragment had INET_ECN_NOT_ECT,
+	 *	reassembled frame also has INET_ECN_NOT_ECT
+	 * Elif one fragment had INET_ECN_CE
+	 *	reassembled frame also has INET_ECN_CE
+	 */
+	if (qp->ecn & IPFRAG_ECN_CLEAR)
+		iph->tos &= ~INET_ECN_MASK;
+	else if (qp->ecn & IPFRAG_ECN_SET_CE)
+		iph->tos |= INET_ECN_CE;
+
 	IP_INC_STATS_BH(net, IPSTATS_MIB_REASMOKS);
 	qp->q.fragments = NULL;
 	qp->q.fragments_tail = NULL;



^ permalink raw reply related

* Re: [PATCH] ipv4: IP defragmentation must be ECN aware
From: Stephen Hemminger @ 2011-01-05 17:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1294249261.10633.54.camel@edumazet-laptop>

On Wed, 05 Jan 2011 18:41:01 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le mercredi 05 janvier 2011 à 09:13 -0800, Stephen Hemminger a écrit :
> > On Wed, 05 Jan 2011 14:59:02 +0100
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > 
> > > +
> > > +static inline int ip4_frag_ecn(int tos)
> > 
> > Since tos is only a byte, this should be:
> > 
> > static inline u8 ip4_frag_ecn(u8 tos)
> > 
> > 
> 
> In fact, generated code is the same on x86, but some arches have faster
> arithmetic on WORD units.
> 
> And I added the 'inline' because on x86_64 gcc, compiler chose _not_ to
> inline this 6 instruction sequence ! Code was much larger.
> 
> 31 d2                   xor    %edx,%edx
> 0f b6 40 01             movzbl 0x1(%rax),%eax
> 83 e0 03                and    $0x3,%eax
> ff c0                   inc    %eax
> a8 02                   test   $0x2,%al
> 0f 44 d0                cmove  %eax,%edx
> 
> 
> We do roughly the same (working on WORD arith) in 
>
At least make it unsigned int?

-- 

^ permalink raw reply

* Re: [net-next-2.6 PATCH v5 2/2] net_sched: implement a root container qdisc sch_mqprio
From: Stephen Hemminger @ 2011-01-05 17:47 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jarek Poplawski, davem@davemloft.net, hadi@cyberus.ca,
	tgraf@infradead.org, eric.dumazet@gmail.com,
	bhutchings@solarflare.com, nhorman@tuxdriver.com,
	netdev@vger.kernel.org
In-Reply-To: <4D24ACA4.3000301@intel.com>

On Wed, 05 Jan 2011 09:38:44 -0800
John Fastabend <john.r.fastabend@intel.com> wrote:

> On 1/4/2011 2:59 PM, Jarek Poplawski wrote:
> > On Tue, Jan 04, 2011 at 10:56:46AM -0800, John Fastabend wrote:
> >> This implements a mqprio queueing discipline that by default creates
> >> a pfifo_fast qdisc per tx queue and provides the needed configuration
> >> interface.
> >>
> >> Using the mqprio qdisc the number of tcs currently in use along
> >> with the range of queues alloted to each class can be configured. By
> >> default skbs are mapped to traffic classes using the skb priority.
> >> This mapping is configurable.
> >>
> >> Configurable parameters,
> >>
> >> struct tc_mqprio_qopt {
> >>         __u8    num_tc;
> >>         __u8    prio_tc_map[TC_BITMASK + 1];
> >>         __u8    hw;
> >>         __u16   count[TC_MAX_QUEUE];
> >>         __u16   offset[TC_MAX_QUEUE];
> >> };
> >>
> >> Here the count/offset pairing give the queue alignment and the
> >> prio_tc_map gives the mapping from skb->priority to tc.
> >>
> >> The hw bit determines if the hardware should configure the count
> >> and offset values. If the hardware bit is set then the operation
> >> will fail if the hardware does not implement the ndo_setup_tc
> >> operation. This is to avoid undetermined states where the hardware
> >> may or may not control the queue mapping. Also minimal bounds
> >> checking is done on the count/offset to verify a queue does not
> >> exceed num_tx_queues and that queue ranges do not overlap. Otherwise
> >> it is left to user policy or hardware configuration to create
> >> useful mappings.
> >>
> >> It is expected that hardware QOS schemes can be implemented by
> >> creating appropriate mappings of queues in ndo_tc_setup().
> >>
> >> One expected use case is drivers will use the ndo_setup_tc to map
> >> queue ranges onto 802.1Q traffic classes. This provides a generic
> >> mechanism to map network traffic onto these traffic classes and
> >> removes the need for lower layer drivers to know specifics about
> >> traffic types.
> >>
> >> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> >> ---
> >>
> >>  include/linux/netdevice.h |    3
> >>  include/linux/pkt_sched.h |   10 +
> >>  net/sched/Kconfig         |   12 +
> >>  net/sched/Makefile        |    1
> >>  net/sched/sch_generic.c   |    4
> >>  net/sched/sch_mqprio.c    |  413 +++++++++++++++++++++++++++++++++++++++++++++
> >>  6 files changed, 443 insertions(+), 0 deletions(-)
> >>  create mode 100644 net/sched/sch_mqprio.c
> >>
> >> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> >> index ae51323..19a855b 100644
> >> --- a/include/linux/netdevice.h
> >> +++ b/include/linux/netdevice.h
> >> @@ -764,6 +764,8 @@ struct netdev_tc_txq {
> >>   * int (*ndo_set_vf_port)(struct net_device *dev, int vf,
> >>   *                     struct nlattr *port[]);
> >>   * int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff *skb);
> >> + *
> >> + * int (*ndo_setup_tc)(struct net_device *dev, int tc);
> >
> >  * int (*ndo_setup_tc)(struct net_device *dev, u8 tc);
> >
> >>   */
> >>  #define HAVE_NET_DEVICE_OPS
> >>  struct net_device_ops {
> >> @@ -822,6 +824,7 @@ struct net_device_ops {
> >>                                                  struct nlattr *port[]);
> >>       int                     (*ndo_get_vf_port)(struct net_device *dev,
> >>                                                  int vf, struct sk_buff *skb);
> >> +     int                     (*ndo_setup_tc)(struct net_device *dev, u8 tc);
> >>  #if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
> >>       int                     (*ndo_fcoe_enable)(struct net_device *dev);
> >>       int                     (*ndo_fcoe_disable)(struct net_device *dev);
> >> diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
> >> index 2cfa4bc..1c5310a 100644
> >> --- a/include/linux/pkt_sched.h
> >> +++ b/include/linux/pkt_sched.h
> >> @@ -2,6 +2,7 @@
> >>  #define __LINUX_PKT_SCHED_H
> >>
> >>  #include <linux/types.h>
> >> +#include <linux/netdevice.h>
> >

This won't be acceptable.

All the TC api needs to be in linux/pkt_sched.h.

I regularly take the sanitized headers from kernel version an put them
in iproute2 source.



-- 

^ permalink raw reply

* Re: [PATCH] net: ixp4xx_eth: Return proper error for eth_init_one
From: Krzysztof Halasa @ 2011-01-05 17:43 UTC (permalink / raw)
  To: Axel Lin; +Cc: linux-kernel, David S. Miller, netdev
In-Reply-To: <1294205046.8294.2.camel@mola>

Axel Lin <axel.lin@gmail.com> writes:

> Return PTR_ERR(port->phydev) instead of 1 if phy_connect failed.
>
> Signed-off-by: Axel Lin <axel.lin@gmail.com>

> --- a/drivers/net/arm/ixp4xx_eth.c
> +++ b/drivers/net/arm/ixp4xx_eth.c
> @@ -1229,8 +1229,10 @@ static int __devinit eth_init_one(struct platform_device *pdev)
>  	snprintf(phy_id, MII_BUS_ID_SIZE + 3, PHY_ID_FMT, "0", plat->phy);
>  	port->phydev = phy_connect(dev, phy_id, &ixp4xx_adjust_link, 0,
>  				   PHY_INTERFACE_MODE_MII);
> -	if ((err = IS_ERR(port->phydev)))
> +	if (IS_ERR(port->phydev)) {
> +		err = PTR_ERR(port->phydev);
>  		goto err_free_mem;
> +	}
>  
>  	port->phydev->irq = PHY_POLL;

Right. Thanks.

Acked-by: Krzysztof Halasa <khc@pm.waw.pl>
-- 
Krzysztof Halasa

^ permalink raw reply

* Re: [PATCH] ipv4: IP defragmentation must be ECN aware
From: Eric Dumazet @ 2011-01-05 17:41 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, netdev
In-Reply-To: <20110105091340.3f8833ef@nehalam>

Le mercredi 05 janvier 2011 à 09:13 -0800, Stephen Hemminger a écrit :
> On Wed, 05 Jan 2011 14:59:02 +0100
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > +
> > +static inline int ip4_frag_ecn(int tos)
> 
> Since tos is only a byte, this should be:
> 
> static inline u8 ip4_frag_ecn(u8 tos)
> 
> 

In fact, generated code is the same on x86, but some arches have faster
arithmetic on WORD units.

And I added the 'inline' because on x86_64 gcc, compiler chose _not_ to
inline this 6 instruction sequence ! Code was much larger.

31 d2                   xor    %edx,%edx
0f b6 40 01             movzbl 0x1(%rax),%eax
83 e0 03                and    $0x3,%eax
ff c0                   inc    %eax
a8 02                   test   $0x2,%al
0f 44 d0                cmove  %eax,%edx


We do roughly the same (working on WORD arith) in 

static inline int IP_ECN_set_ce(struct iphdr *iph)
{
u32 ecn = (iph->tos + 1) & INET_ECN_MASK;
...
}

What others think ? I have no real strong opinion.



^ permalink raw reply

* Re: [net-next-2.6 PATCH v5 2/2] net_sched: implement a root container qdisc sch_mqprio
From: John Fastabend @ 2011-01-05 17:38 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: davem@davemloft.net, hadi@cyberus.ca, shemminger@vyatta.com,
	tgraf@infradead.org, eric.dumazet@gmail.com,
	bhutchings@solarflare.com, nhorman@tuxdriver.com,
	netdev@vger.kernel.org
In-Reply-To: <20110104225936.GA2030@del.dom.local>

On 1/4/2011 2:59 PM, Jarek Poplawski wrote:
> On Tue, Jan 04, 2011 at 10:56:46AM -0800, John Fastabend wrote:
>> This implements a mqprio queueing discipline that by default creates
>> a pfifo_fast qdisc per tx queue and provides the needed configuration
>> interface.
>>
>> Using the mqprio qdisc the number of tcs currently in use along
>> with the range of queues alloted to each class can be configured. By
>> default skbs are mapped to traffic classes using the skb priority.
>> This mapping is configurable.
>>
>> Configurable parameters,
>>
>> struct tc_mqprio_qopt {
>>         __u8    num_tc;
>>         __u8    prio_tc_map[TC_BITMASK + 1];
>>         __u8    hw;
>>         __u16   count[TC_MAX_QUEUE];
>>         __u16   offset[TC_MAX_QUEUE];
>> };
>>
>> Here the count/offset pairing give the queue alignment and the
>> prio_tc_map gives the mapping from skb->priority to tc.
>>
>> The hw bit determines if the hardware should configure the count
>> and offset values. If the hardware bit is set then the operation
>> will fail if the hardware does not implement the ndo_setup_tc
>> operation. This is to avoid undetermined states where the hardware
>> may or may not control the queue mapping. Also minimal bounds
>> checking is done on the count/offset to verify a queue does not
>> exceed num_tx_queues and that queue ranges do not overlap. Otherwise
>> it is left to user policy or hardware configuration to create
>> useful mappings.
>>
>> It is expected that hardware QOS schemes can be implemented by
>> creating appropriate mappings of queues in ndo_tc_setup().
>>
>> One expected use case is drivers will use the ndo_setup_tc to map
>> queue ranges onto 802.1Q traffic classes. This provides a generic
>> mechanism to map network traffic onto these traffic classes and
>> removes the need for lower layer drivers to know specifics about
>> traffic types.
>>
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>> ---
>>
>>  include/linux/netdevice.h |    3
>>  include/linux/pkt_sched.h |   10 +
>>  net/sched/Kconfig         |   12 +
>>  net/sched/Makefile        |    1
>>  net/sched/sch_generic.c   |    4
>>  net/sched/sch_mqprio.c    |  413 +++++++++++++++++++++++++++++++++++++++++++++
>>  6 files changed, 443 insertions(+), 0 deletions(-)
>>  create mode 100644 net/sched/sch_mqprio.c
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index ae51323..19a855b 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -764,6 +764,8 @@ struct netdev_tc_txq {
>>   * int (*ndo_set_vf_port)(struct net_device *dev, int vf,
>>   *                     struct nlattr *port[]);
>>   * int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff *skb);
>> + *
>> + * int (*ndo_setup_tc)(struct net_device *dev, int tc);
>
>  * int (*ndo_setup_tc)(struct net_device *dev, u8 tc);
>
>>   */
>>  #define HAVE_NET_DEVICE_OPS
>>  struct net_device_ops {
>> @@ -822,6 +824,7 @@ struct net_device_ops {
>>                                                  struct nlattr *port[]);
>>       int                     (*ndo_get_vf_port)(struct net_device *dev,
>>                                                  int vf, struct sk_buff *skb);
>> +     int                     (*ndo_setup_tc)(struct net_device *dev, u8 tc);
>>  #if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
>>       int                     (*ndo_fcoe_enable)(struct net_device *dev);
>>       int                     (*ndo_fcoe_disable)(struct net_device *dev);
>> diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
>> index 2cfa4bc..1c5310a 100644
>> --- a/include/linux/pkt_sched.h
>> +++ b/include/linux/pkt_sched.h
>> @@ -2,6 +2,7 @@
>>  #define __LINUX_PKT_SCHED_H
>>
>>  #include <linux/types.h>
>> +#include <linux/netdevice.h>
>
> This should better be consulted with Stephen wrt. iproute patch.

OK. Stephen is there a better way to do this? Possibly push the TC_xxx defines into a linux/if_* header? But that doesn't seem right either. I'll poke around some to see if this can be avoided.

>
>>
>>  /* Logical priority bands not depending on specific packet scheduler.
>>     Every scheduler will map them to real traffic classes, if it has
>> @@ -481,4 +482,13 @@ struct tc_drr_stats {
>>       __u32   deficit;
>>  };
>>
>> +/* MQPRIO */
>> +struct tc_mqprio_qopt {
>> +     __u8    num_tc;
>> +     __u8    prio_tc_map[TC_BITMASK + 1];
>> +     __u8    hw;
>> +     __u16   count[TC_MAX_QUEUE];
>> +     __u16   offset[TC_MAX_QUEUE];
>> +};
>> +
>>  #endif
>> diff --git a/net/sched/Kconfig b/net/sched/Kconfig
>> index a36270a..f52f5eb 100644
>> --- a/net/sched/Kconfig
>> +++ b/net/sched/Kconfig
>> @@ -205,6 +205,18 @@ config NET_SCH_DRR
>>
>>         If unsure, say N.
>>
>> +config NET_SCH_MQPRIO
>> +     tristate "Multi-queue priority scheduler (MQPRIO)"
>> +     help
>> +       Say Y here if you want to use the Multi-queue Priority scheduler.
>> +       This scheduler allows QOS to be offloaded on NICs that have support
>> +       for offloading QOS schedulers.
>> +
>> +       To compile this driver as a module, choose M here: the module will
>> +       be called sch_mqprio.
>> +
>> +       If unsure, say N.
>> +
>>  config NET_SCH_INGRESS
>>       tristate "Ingress Qdisc"
>>       depends on NET_CLS_ACT
>> diff --git a/net/sched/Makefile b/net/sched/Makefile
>> index 960f5db..26ce681 100644
>> --- a/net/sched/Makefile
>> +++ b/net/sched/Makefile
>> @@ -32,6 +32,7 @@ obj-$(CONFIG_NET_SCH_MULTIQ)        += sch_multiq.o
>>  obj-$(CONFIG_NET_SCH_ATM)    += sch_atm.o
>>  obj-$(CONFIG_NET_SCH_NETEM)  += sch_netem.o
>>  obj-$(CONFIG_NET_SCH_DRR)    += sch_drr.o
>> +obj-$(CONFIG_NET_SCH_MQPRIO) += sch_mqprio.o
>>  obj-$(CONFIG_NET_CLS_U32)    += cls_u32.o
>>  obj-$(CONFIG_NET_CLS_ROUTE4) += cls_route.o
>>  obj-$(CONFIG_NET_CLS_FW)     += cls_fw.o
>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>> index 34dc598..723b278 100644
>> --- a/net/sched/sch_generic.c
>> +++ b/net/sched/sch_generic.c
>> @@ -540,6 +540,7 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = {
>>       .dump           =       pfifo_fast_dump,
>>       .owner          =       THIS_MODULE,
>>  };
>> +EXPORT_SYMBOL(pfifo_fast_ops);
>>
>>  struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
>>                         struct Qdisc_ops *ops)
>> @@ -674,6 +675,7 @@ struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
>>
>>       return oqdisc;
>>  }
>> +EXPORT_SYMBOL(dev_graft_qdisc);
>>
>>  static void attach_one_default_qdisc(struct net_device *dev,
>>                                    struct netdev_queue *dev_queue,
>> @@ -761,6 +763,7 @@ void dev_activate(struct net_device *dev)
>>               dev_watchdog_up(dev);
>>       }
>>  }
>> +EXPORT_SYMBOL(dev_activate);
>>
>>  static void dev_deactivate_queue(struct net_device *dev,
>>                                struct netdev_queue *dev_queue,
>> @@ -840,6 +843,7 @@ void dev_deactivate(struct net_device *dev)
>>       list_add(&dev->unreg_list, &single);
>>       dev_deactivate_many(&single);
>>  }
>> +EXPORT_SYMBOL(dev_deactivate);
>>
>>  static void dev_init_scheduler_queue(struct net_device *dev,
>>                                    struct netdev_queue *dev_queue,
>> diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
>> new file mode 100644
>> index 0000000..b16dc2c
>> --- /dev/null
>> +++ b/net/sched/sch_mqprio.c
>> @@ -0,0 +1,413 @@
>> +/*
>> + * net/sched/sch_mqprio.c
>> + *
>> + * Copyright (c) 2010 John Fastabend <john.r.fastabend@intel.com>
>> + *
>> + * This program is free software; you can redistribute it and/or
>> + * modify it under the terms of the GNU General Public License
>> + * version 2 as published by the Free Software Foundation.
>> + */
>> +
>> +#include <linux/types.h>
>> +#include <linux/slab.h>
>> +#include <linux/kernel.h>
>> +#include <linux/string.h>
>> +#include <linux/errno.h>
>> +#include <linux/skbuff.h>
>> +#include <net/netlink.h>
>> +#include <net/pkt_sched.h>
>> +#include <net/sch_generic.h>
>> +
>> +struct mqprio_sched {
>> +     struct Qdisc            **qdiscs;
>> +     int hw_owned;
>> +};
>> +
>> +static void mqprio_destroy(struct Qdisc *sch)
>> +{
>> +     struct net_device *dev = qdisc_dev(sch);
>> +     struct mqprio_sched *priv = qdisc_priv(sch);
>> +     unsigned int ntx;
>> +
>> +     if (!priv->qdiscs)
>> +             return;
>> +
>> +     for (ntx = 0; ntx < dev->num_tx_queues && priv->qdiscs[ntx]; ntx++)
>> +             qdisc_destroy(priv->qdiscs[ntx]);
>> +
>> +     if (priv->hw_owned && dev->netdev_ops->ndo_setup_tc)
>> +             dev->netdev_ops->ndo_setup_tc(dev, 0);
>> +     else
>> +             netdev_set_num_tc(dev, 0);
>> +
>> +     kfree(priv->qdiscs);
>> +}
>> +
>> +static int mqprio_parse_opt(struct net_device *dev, struct tc_mqprio_qopt *qopt)
>> +{
>> +     int i, j;
>> +
>> +     /* Verify num_tc is not out of max range */
>> +     if (qopt->num_tc > TC_MAX_QUEUE)
>> +             return -EINVAL;
>> +
>> +     for (i = 0; i < qopt->num_tc; i++) {
>> +             unsigned int last = qopt->offset[i] + qopt->count[i];
>
> (empty line after declarations)
>

fixed

>> +             /* Verify the queue offset is in the num tx range */
>> +             if (qopt->offset[i] >= dev->num_tx_queues)
>> +                     return -EINVAL;
>> +             /* Verify the queue count is in tx range being equal to the
>> +              * num_tx_queues indicates the last queue is in use.
>> +              */
>> +             else if (last > dev->num_tx_queues)
>> +                     return -EINVAL;
>> +
>> +             /* Verify that the offset and counts do not overlap */
>> +             for (j = i + 1; j < qopt->num_tc; j++) {
>> +                     if (last > qopt->offset[j])
>> +                             return -EINVAL;
>> +             }
>> +     }
>> +
>> +     return 0;
>> +}
>> +
>> +static int mqprio_init(struct Qdisc *sch, struct nlattr *opt)
>> +{
>> +     struct net_device *dev = qdisc_dev(sch);
>> +     struct mqprio_sched *priv = qdisc_priv(sch);
>> +     struct netdev_queue *dev_queue;
>> +     struct Qdisc *qdisc;
>> +     int i, err = -EOPNOTSUPP;
>> +     struct tc_mqprio_qopt *qopt = NULL;
>> +
>> +     /* Unwind attributes on failure */
>> +     u8 unwnd_tc = dev->num_tc;
>> +     u8 unwnd_map[TC_BITMASK + 1];
>> +     struct netdev_tc_txq unwnd_txq[TC_MAX_QUEUE];
>> +
>> +     if (sch->parent != TC_H_ROOT)
>> +             return -EOPNOTSUPP;
>> +
>> +     if (!netif_is_multiqueue(dev))
>> +             return -EOPNOTSUPP;
>> +
>> +     if (nla_len(opt) < sizeof(*qopt))
>> +             return -EINVAL;
>> +     qopt = nla_data(opt);
>> +
>> +     memcpy(unwnd_map, dev->prio_tc_map, sizeof(unwnd_map));
>> +     memcpy(unwnd_txq, dev->tc_to_txq, sizeof(unwnd_txq));
>> +
>> +     /* If the mqprio options indicate that hardware should own
>> +      * the queue mapping then run ndo_setup_tc if this can not
>> +      * be done fail immediately.
>> +      */
>> +     if (qopt->hw && dev->netdev_ops->ndo_setup_tc) {
>> +             priv->hw_owned = 1;
>> +             err = dev->netdev_ops->ndo_setup_tc(dev, qopt->num_tc);
>> +             if (err)
>> +                     return err;
>> +     } else if (!qopt->hw) {
>> +             if (mqprio_parse_opt(dev, qopt))
>> +                     return -EINVAL;
>> +
>> +             if (netdev_set_num_tc(dev, qopt->num_tc))
>> +                     return -EINVAL;
>> +
>> +             for (i = 0; i < qopt->num_tc; i++)
>> +                     netdev_set_tc_queue(dev, i,
>> +                                         qopt->count[i], qopt->offset[i]);
>> +     } else {
>> +             return -EINVAL;
>> +     }
>> +
>> +     /* Always use supplied priority mappings */
>> +     for (i = 0; i < TC_BITMASK + 1; i++) {
>> +             if (netdev_set_prio_tc_map(dev, i, qopt->prio_tc_map[i])) {
>> +                     err = -EINVAL;
>
> This would probably trigger if we try qopt->num_tc == 0. Is it expected?

netdev_set_prio_tc_map() returns 0 on sucess. This if(..) is a bit strange though.

        err = netdev_set_prio_tc_map(dev, i, qopt->prio_tc_map[i])
        if (err)
                ...

Is cleaner IMHO.

>
>> +                     goto tc_err;
>> +             }
>> +     }
>> +
>> +     /* pre-allocate qdisc, attachment can't fail */
>> +     priv->qdiscs = kcalloc(dev->num_tx_queues, sizeof(priv->qdiscs[0]),
>> +                            GFP_KERNEL);
>> +     if (priv->qdiscs == NULL) {
>> +             err = -ENOMEM;
>> +             goto tc_err;
>> +     }
>> +
>> +     for (i = 0; i < dev->num_tx_queues; i++) {
>> +             dev_queue = netdev_get_tx_queue(dev, i);
>> +             qdisc = qdisc_create_dflt(dev_queue, &pfifo_fast_ops,
>> +                                       TC_H_MAKE(TC_H_MAJ(sch->handle),
>> +                                                 TC_H_MIN(i + 1)));
>> +             if (qdisc == NULL) {
>> +                     err = -ENOMEM;
>> +                     goto err;
>> +             }
>> +             qdisc->flags |= TCQ_F_CAN_BYPASS;
>> +             priv->qdiscs[i] = qdisc;
>> +     }
>> +
>> +     sch->flags |= TCQ_F_MQROOT;
>> +     return 0;
>> +
>> +err:
>> +     mqprio_destroy(sch);
>> +tc_err:
>> +     if (priv->hw_owned)
>> +             dev->netdev_ops->ndo_setup_tc(dev, unwnd_tc);
>
> Setting here (again) to unwind a bit later looks strange.
> Why not this 'else' only?

The entire unwind stuff is a bit awkward. With a bit more work up front parsing the parameters the unwinding can be avoided all together.

>
>> +     else
>> +             netdev_set_num_tc(dev, unwnd_tc);
>> +
>> +     memcpy(dev->prio_tc_map, unwnd_map, sizeof(unwnd_map));
>> +     memcpy(dev->tc_to_txq, unwnd_txq, sizeof(unwnd_txq));
>> +
>> +     return err;
>> +}
>> +
>> +static void mqprio_attach(struct Qdisc *sch)
>> +{
>> +     struct net_device *dev = qdisc_dev(sch);
>> +     struct mqprio_sched *priv = qdisc_priv(sch);
>> +     struct Qdisc *qdisc;
>> +     unsigned int ntx;
>> +
>> +     /* Attach underlying qdisc */
>> +     for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
>> +             qdisc = priv->qdiscs[ntx];
>> +             qdisc = dev_graft_qdisc(qdisc->dev_queue, qdisc);
>> +             if (qdisc)
>> +                     qdisc_destroy(qdisc);
>> +     }
>> +     kfree(priv->qdiscs);
>> +     priv->qdiscs = NULL;
>> +}
>> +
>> +static struct netdev_queue *mqprio_queue_get(struct Qdisc *sch,
>> +                                          unsigned long cl)
>> +{
>> +     struct net_device *dev = qdisc_dev(sch);
>> +     unsigned long ntx = cl - 1 - netdev_get_num_tc(dev);
>> +
>> +     if (ntx >= dev->num_tx_queues)
>> +             return NULL;
>> +     return netdev_get_tx_queue(dev, ntx);
>> +}
>> +
>> +static int mqprio_graft(struct Qdisc *sch, unsigned long cl, struct Qdisc *new,
>> +                 struct Qdisc **old)
>> +{
>> +     struct net_device *dev = qdisc_dev(sch);
>> +     struct netdev_queue *dev_queue = mqprio_queue_get(sch, cl);
>> +
>> +     if (dev->flags & IFF_UP)
>> +             dev_deactivate(dev);
>> +
>> +     *old = dev_graft_qdisc(dev_queue, new);
>> +
>> +     if (dev->flags & IFF_UP)
>> +             dev_activate(dev);
>> +
>> +     return 0;
>> +}
>> +
>> +static int mqprio_dump(struct Qdisc *sch, struct sk_buff *skb)
>> +{
>> +     struct net_device *dev = qdisc_dev(sch);
>> +     struct mqprio_sched *priv = qdisc_priv(sch);
>> +     unsigned char *b = skb_tail_pointer(skb);
>> +     struct tc_mqprio_qopt opt;
>> +     struct Qdisc *qdisc;
>> +     unsigned int i;
>> +
>> +     sch->q.qlen = 0;
>> +     memset(&sch->bstats, 0, sizeof(sch->bstats));
>> +     memset(&sch->qstats, 0, sizeof(sch->qstats));
>> +
>> +     for (i = 0; i < dev->num_tx_queues; i++) {
>> +             qdisc = netdev_get_tx_queue(dev, i)->qdisc;
>> +             spin_lock_bh(qdisc_lock(qdisc));
>> +             sch->q.qlen             += qdisc->q.qlen;
>> +             sch->bstats.bytes       += qdisc->bstats.bytes;
>> +             sch->bstats.packets     += qdisc->bstats.packets;
>> +             sch->qstats.qlen        += qdisc->qstats.qlen;
>> +             sch->qstats.backlog     += qdisc->qstats.backlog;
>> +             sch->qstats.drops       += qdisc->qstats.drops;
>> +             sch->qstats.requeues    += qdisc->qstats.requeues;
>> +             sch->qstats.overlimits  += qdisc->qstats.overlimits;
>> +             spin_unlock_bh(qdisc_lock(qdisc));
>> +     }
>> +
>> +     opt.num_tc = dev->num_tc;
>> +     memcpy(opt.prio_tc_map, dev->prio_tc_map, sizeof(opt.prio_tc_map));
>> +     opt.hw = priv->hw_owned;
>> +
>> +     for (i = 0; i < dev->num_tc; i++) {
>> +             opt.count[i] = dev->tc_to_txq[i].count;
>> +             opt.offset[i] = dev->tc_to_txq[i].offset;
>> +     }
>> +
>> +     NLA_PUT(skb, TCA_OPTIONS, sizeof(opt), &opt);
>> +
>> +     return skb->len;
>> +nla_put_failure:
>> +     nlmsg_trim(skb, b);
>> +     return -1;
>> +}
>> +
>> +static struct Qdisc *mqprio_leaf(struct Qdisc *sch, unsigned long cl)
>> +{
>> +     struct netdev_queue *dev_queue = mqprio_queue_get(sch, cl);
>> +
>> +     return dev_queue->qdisc_sleeping;
>> +}
>> +
>> +static unsigned long mqprio_get(struct Qdisc *sch, u32 classid)
>> +{
>> +     unsigned int ntx = TC_H_MIN(classid);
>
> We need to 'get' tc classes too, eg for individual dumps. Then we
> should omit them in .leaf, .graft etc.
>

OK missed this. Looks like iproute2 always sets NLM_F_DUMP which works because it uses cl_ops->walk

# tc -s class show dev eth3 classid 800b:1
class mqprio 800b:1 root
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0


>> +
>> +     if (!mqprio_queue_get(sch, ntx))
>> +             return 0;
>> +     return ntx;
>> +}
>> +
>> +static void mqprio_put(struct Qdisc *sch, unsigned long cl)
>> +{
>> +}
>> +
>> +static int mqprio_dump_class(struct Qdisc *sch, unsigned long cl,
>> +                      struct sk_buff *skb, struct tcmsg *tcm)
>> +{
>> +     struct net_device *dev = qdisc_dev(sch);
>> +
>> +     if (cl <= dev->num_tc) {
>> +             tcm->tcm_parent = TC_H_ROOT;
>> +             tcm->tcm_info = 0;
>> +     } else {
>> +             int i;
>> +             struct netdev_queue *dev_queue;
>> +             dev_queue = mqprio_queue_get(sch, cl);
>> +
>> +             tcm->tcm_parent = 0;
>> +             for (i = 0; i < netdev_get_num_tc(dev); i++) {
>
>
> Why dev->num_tc above, netdev_get_num_tc(dev) here, and dev->num_tc
> below?

No reason just inconsistant I will use dev->num_tc.

>
>> +                     struct netdev_tc_txq tc = dev->tc_to_txq[i];
>> +                     int q_idx = cl - dev->num_tc;
>
> (empty line after declarations)
>

fixed

>> +                     if (q_idx >= tc.offset &&
>> +                         q_idx < tc.offset + tc.count) {
>
> cl == 17, tc.offset == 0, tc.count == 1, num_tc = 16, q_idx = 1,
> !(1 < 0 + 1), doesn't belong to the parent #1?
>

Should be

	if (q_idx > tc.offset &&
	    q_idx <= tc.offset + tc.count) 

Now for cl == 17, tc.offset == , tc.count == 1, num_tc = 16, q_idx = 1,
(1 <= 0 + 1) belongs to the parent #1.

>> +                             tcm->tcm_parent =
>> +                                     TC_H_MAKE(TC_H_MAJ(sch->handle),
>> +                                               TC_H_MIN(i + 1));
>> +                             break;
>> +                     }
>> +             }
>> +             tcm->tcm_info = dev_queue->qdisc_sleeping->handle;
>> +     }
>> +     tcm->tcm_handle |= TC_H_MIN(cl);
>> +     return 0;
>> +}
>> +
>> +static int mqprio_dump_class_stats(struct Qdisc *sch, unsigned long cl,
>> +                            struct gnet_dump *d)
>> +{
>> +     struct net_device *dev = qdisc_dev(sch);
>> +
>> +     if (cl <= netdev_get_num_tc(dev)) {
>> +             int i;
>> +             struct Qdisc *qdisc;
>> +             struct gnet_stats_queue qstats = {0};
>> +             struct gnet_stats_basic_packed bstats = {0};
>> +             struct netdev_tc_txq tc = dev->tc_to_txq[cl - 1];
>> +
>> +             /* Drop lock here it will be reclaimed before touching
>> +              * statistics this is required because the d->lock we
>> +              * hold here is the look on dev_queue->qdisc_sleeping
>> +              * also acquired below.
>> +              */
>> +             spin_unlock_bh(d->lock);
>> +
>> +             for (i = tc.offset; i < tc.offset + tc.count; i++) {
>> +                     qdisc = netdev_get_tx_queue(dev, i)->qdisc;
>> +                     spin_lock_bh(qdisc_lock(qdisc));
>> +                     bstats.bytes      += qdisc->bstats.bytes;
>> +                     bstats.packets    += qdisc->bstats.packets;
>> +                     qstats.qlen       += qdisc->qstats.qlen;
>> +                     qstats.backlog    += qdisc->qstats.backlog;
>> +                     qstats.drops      += qdisc->qstats.drops;
>> +                     qstats.requeues   += qdisc->qstats.requeues;
>> +                     qstats.overlimits += qdisc->qstats.overlimits;
>> +                     spin_unlock_bh(qdisc_lock(qdisc));
>> +             }
>> +             /* Reclaim root sleeping lock before completing stats */
>> +             spin_lock_bh(d->lock);
>> +             if (gnet_stats_copy_basic(d, &bstats) < 0 ||
>> +                 gnet_stats_copy_queue(d, &qstats) < 0)
>> +                     return -1;
>> +     } else {
>> +             struct netdev_queue *dev_queue = mqprio_queue_get(sch, cl);
>
> (empty line after declarations)
>

fixed.

>> +             sch = dev_queue->qdisc_sleeping;
>> +             sch->qstats.qlen = sch->q.qlen;
>> +             if (gnet_stats_copy_basic(d, &sch->bstats) < 0 ||
>> +                 gnet_stats_copy_queue(d, &sch->qstats) < 0)
>> +                     return -1;
>> +     }
>> +     return 0;
>> +}
>> +
>> +static void mqprio_walk(struct Qdisc *sch, struct qdisc_walker *arg)
>> +{
>> +     struct net_device *dev = qdisc_dev(sch);
>> +     unsigned long ntx;
>> +     u8 num_tc = netdev_get_num_tc(dev);
>> +
>> +     if (arg->stop)
>> +             return;
>> +
>> +     /* Walk hierarchy with a virtual class per tc */
>> +     arg->count = arg->skip;
>> +     for (ntx = arg->skip; ntx < dev->num_tx_queues + num_tc; ntx++) {
>
> Should we report possibly unused/unconfigured tx_queues?

I think it may be OK select_queue() could push skbs onto these queues and we may still want to see the statistics in this case. Although (real_num_tx_queues + num_tc) may make sense I see no reason to show queues above real_num_tx_queues.

Thanks,
John.

^ permalink raw reply

* Re: [net-next-2.6 PATCH] ethtool: update get_rx_ntuple to correctly interpret string count
From: Ben Hutchings @ 2011-01-05 17:28 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: davem@davemloft.net, netdev@vger.kernel.org
In-Reply-To: <4D24A2DD.2040603@intel.com>

On Wed, 2011-01-05 at 08:57 -0800, Alexander Duyck wrote:
[...]
> I'm fine with us replacing the ETHTOOL_GRXNTUPLE interface, but I would 
> prefer to do it after the merge windows for 2.6.39 has opened.  For now 
> I would like to get this patch accepted as my main concern is getting a 
> minor fix in versus rewriting the entire interface.

So long as there are no in-tree implementations of
ethtool_ops::get_rx_ntuple then it's a valid candidate for removal.
Since you now want to implement it, I think you should submit the
implementation along with the fix for the calling code.

> While we're at it how would you feel about us inverting the masks for 
> setting up an ntuple by making them an inclusion mask instead of an 
> exclusion one?  The reason why I ask is because I have to perform an and 
> operation over all the input anyway before I can use it to compute the 
> hashes and as such I am having to invert almost all of the mask bits, 
> and it appears you are having to do this as well for many of the masks 
> in sfc.

We can't change the userland interface but we could potentially invert
the masks in the ethtool core.  I'm really not convinced that this is
worth the trouble though.  (And it would be a massive pain for the OOT
versions of our drivers.)

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [RFC] sched: CHOKe packet scheduler
From: Eric Dumazet @ 2011-01-05 17:25 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, netdev
In-Reply-To: <20110105091718.02f8a00f@nehalam>

Le mercredi 05 janvier 2011 à 09:17 -0800, Stephen Hemminger a écrit :
> On Wed, 05 Jan 2011 07:19:35 +0100
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > Le mardi 04 janvier 2011 à 16:29 -0800, Stephen Hemminger a écrit :
> > > +static struct sk_buff *skb_peek_random(struct sk_buff_head *list)
> > > +{
> > > +	struct sk_buff *skb = list->next;
> > > +	unsigned int idx = net_random() % list->qlen;
> > > +
> > > +	while (skb && idx-- > 0)
> > > +		skb = skb->next;
> > > +
> > > +	return skb;
> > > +}
> > 
> > You could avoid the divide op :
> > 
> > unsigned int idx = reciprocal_divide(random32(), list->qlen);
> 
> How would this work, it is a mod not a divide??
> 

It works, because random32() provides a 32bit 'random' number.
between 0 and 0xFFFFFFFF

We multiply it by X, get a 64bit number between 0 and 0xFFFFFFFF * X
then we right shift it by 32, get a number between 0 and X - 1 

We dont need to get the modulus, just a random number between 0 and X -
1

Dont worry, we should add a helper function to do that, since it might
be used in many places.

/* deliver a random number between 0 and N - 1 */
u32 random_N(unsigned int N)
{
	reciprocal_divide(random32(), N);
}





^ permalink raw reply

* Re: [PATCH v3 08/10] ARM: mxs: add ocotp read function
From: Jamie Iles @ 2011-01-05 17:25 UTC (permalink / raw)
  To: Uwe Kleine-König
  Cc: Jamie Iles, Shawn Guo, gerg, B32542, netdev, s.hauer, baruch,
	w.sang, r64343, eric, bryan.wu, davem, linux-arm-kernel, lw
In-Reply-To: <20110105164409.GV25121@pengutronix.de>

On Wed, Jan 05, 2011 at 05:44:09PM +0100, Uwe Kleine-König wrote:
> Hello Jamie,
> On Wed, Jan 05, 2011 at 04:16:46PM +0000, Jamie Iles wrote:
> > On Wed, Jan 05, 2011 at 10:07:35PM +0800, Shawn Guo wrote:
> > > +	/* check both BUSY and ERROR cleared */
> > > +	while ((__raw_readl(ocotp_base) &
> > > +		(BM_OCOTP_CTRL_BUSY | BM_OCOTP_CTRL_ERROR)) && --timeout)
> > > +		/* nothing */;
> > 
> > Is it worth using cpu_relax() in these polling loops?
> I don't know what cpu_relax does for other platforms, but on ARM it's
> just a memory barrier which AFAICT doesn't help here at all (which
> doesn't need to be correct).  Why do you think it would be better?

Well I don't see that there's anything broken without cpu_relax() but 
using cpu_relax() seems to be the most common way of doing busy polling 
loops that I've seen. It's also a bit easier to read than a comment and 
semi-colon. Perhaps it's just personal preference.

Jamie

^ permalink raw reply

* Re: [RFC] sched: CHOKe packet scheduler
From: Stephen Hemminger @ 2011-01-05 17:17 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1294208375.3420.46.camel@edumazet-laptop>

On Wed, 05 Jan 2011 07:19:35 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le mardi 04 janvier 2011 à 16:29 -0800, Stephen Hemminger a écrit :
> > +static struct sk_buff *skb_peek_random(struct sk_buff_head *list)
> > +{
> > +	struct sk_buff *skb = list->next;
> > +	unsigned int idx = net_random() % list->qlen;
> > +
> > +	while (skb && idx-- > 0)
> > +		skb = skb->next;
> > +
> > +	return skb;
> > +}
> 
> You could avoid the divide op :
> 
> unsigned int idx = reciprocal_divide(random32(), list->qlen);

How would this work, it is a mod not a divide??

-- 

^ permalink raw reply

* Re: [BUG] net_sched: pfifo_head_drop problem
From: Stephen Hemminger @ 2011-01-05 17:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Florian Westphal, Patrick McHardy,
	Hagen Paul Pfeifer, Jarek Poplawski
In-Reply-To: <1294246850.2775.244.camel@edumazet-laptop>

On Wed, 05 Jan 2011 18:00:50 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> While reviewing CHOKe stuff, I found following problem :
> 
> commit 57dbb2d83d100ea (sched: add head drop fifo queue)
> introduced pfifo_head_drop, and broke the invariant that
> sch->bstats.bytes and sch->bstats.packets are COUNTER (increasing
> counters only)
> 
> This can break estimators because est_timer() handle unsigned deltas
> only. A decreasing counter can then give a huge unsigned delta.
> 
> My suggestion would be to change things so that sch->bstats.bytes and
> sch->bstats.packets are incremented in dequeue() only, not at enqueue()
> time.
> 
> It would be more sensible anyway for very low speeds, and big bursts.
> Right now, if we drop packets, they still are accounted in estimators.
> 
> Or maybe my understanding of estimators is wrong, and only apply to
> enqueue rate, not dequeue rate ?
> 
> If so, we should remove the 
> 
> sch->bstats.bytes -= qdisc_pkt_len(skb_head);
> sch->bstats.packets--;
> 
> done in pfifo_tail_enqueue() in case we drop the head skb.
> 
> 
> My preference would be to add dropped pack/byte rates to estimators...
> It might be good for tuning.

Agreed counters should reflect dequeued packets not enqueued packets.


-- 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox