Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next-2.6] sch_sfq: allow big packets and be fair
From: Jarek Poplawski @ 2010-12-21 10:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Patrick McHardy, netdev
In-Reply-To: <1292928269.2720.0.camel@edumazet-laptop>

On Tue, Dec 21, 2010 at 11:44:29AM +0100, Eric Dumazet wrote:
> Le mardi 21 décembre 2010 ?? 10:30 +0000, Jarek Poplawski a écrit :
> > On Tue, Dec 21, 2010 at 10:15:06AM +0000, Jarek Poplawski wrote:
> > > The change of allotment limit looks OK [...]
> > 
> > Hmm... but maybe s/ALLOT_ZERO/SFQ_ALLOT_ZERO/? ;-)
> 
> Its a local symbol, its not like it being in an include file ;)

Sure, but why this one has to be different than others in this file?
(Plus, if you don't remember all this code it's a good hint.)

Jarek P.

^ permalink raw reply

* Re: [PATCH net-next-2.6] sch_sfq: allow big packets and be fair
From: Eric Dumazet @ 2010-12-21 10:57 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, Patrick McHardy, netdev
In-Reply-To: <20101221101506.GA8149@ff.dom.local>

Le mardi 21 décembre 2010 à 10:15 +0000, Jarek Poplawski a écrit :
> On 2010-12-21 00:16, Eric Dumazet wrote:
> > SFQ is currently 'limited' to small packets, because it uses a 16bit
> > allotment number per flow. Switch it to 18bit, and use appropriate
> > handling to make sure this allotment is in [1 .. quantum] range before a
> > new packet is dequeued, so that fairness is respected.
> 
> Well, such two important changes should be in separate patches.
> 
> The change of allotment limit looks OK (but I would try scaling, e.g.
> in 16-byte chunks, btw).
> 

Hmm, we could scale by 2 or 3 and keep 16bit allot/hash (faster than
18/14 bit bitfields on x86). Not sure its worth it (it adds two shifts
per packet)


> The change in fair treatment looks dubious. A flow which uses exactly
> it's quantum in one round will be skipped in the next round. A flow
> which uses a bit more than its quantum in one round, will be skipped
> too, while we should only give it less this time to keep the sum up to
> 2 quantums. (The usual algorithm is to check if a flow has enough
> "tickets" for sending its next packet.)

Hmm... 

A flow which uses exactly its quantum in one round wont be skipped in
the next round.

I only made the "I pass my round to next slot in chain" in one place
instead of two, maybe you missed the removal at the end of
sfq_dequeue() ?

-	} else if ((slot->allot -= qdisc_pkt_len(skb)) <= 0) {
-		q->tail = slot;
-		slot->allot += q->quantum;
+	} else {
+		slot->allot -= qdisc_pkt_len(skb);
	}

Now the check is performed at the beginning of sfq_dequeue(), to be able
to charge a previously sent 'big packet' multiple times (faulty flow
wont send a packet before passing xx rounds)

I believe I just did the right thing. The "allot" is incremented when
current flow "pass its round to next slot", and decremented when a
packet is dequeued from this slot. Before being allowed to dequeue a
packet, "allot" must be 'positive'.




^ permalink raw reply

* Re: [PATCH net-next-2.6] sch_sfq: allow big packets and be fair
From: Jarek Poplawski @ 2010-12-21 11:39 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Patrick McHardy, netdev
In-Reply-To: <1292929037.2720.12.camel@edumazet-laptop>

On Tue, Dec 21, 2010 at 11:57:17AM +0100, Eric Dumazet wrote:
> Le mardi 21 décembre 2010 ?? 10:15 +0000, Jarek Poplawski a écrit :
> > On 2010-12-21 00:16, Eric Dumazet wrote:
> > > SFQ is currently 'limited' to small packets, because it uses a 16bit
> > > allotment number per flow. Switch it to 18bit, and use appropriate
> > > handling to make sure this allotment is in [1 .. quantum] range before a
> > > new packet is dequeued, so that fairness is respected.
> > 
> > Well, such two important changes should be in separate patches.
> > 
> > The change of allotment limit looks OK (but I would try scaling, e.g.
> > in 16-byte chunks, btw).
> > 
> 
> Hmm, we could scale by 2 or 3 and keep 16bit allot/hash (faster than
> 18/14 bit bitfields on x86). Not sure its worth it (it adds two shifts
> per packet)

I'm OK with any of those methods.

> > The change in fair treatment looks dubious. A flow which uses exactly
> > it's quantum in one round will be skipped in the next round. A flow
> > which uses a bit more than its quantum in one round, will be skipped
> > too, while we should only give it less this time to keep the sum up to
> > 2 quantums. (The usual algorithm is to check if a flow has enough
> > "tickets" for sending its next packet.)
> 
> Hmm... 
> 
> A flow which uses exactly its quantum in one round wont be skipped in
> the next round.
> 
> I only made the "I pass my round to next slot in chain" in one place
> instead of two, maybe you missed the removal at the end of
> sfq_dequeue() ?
> 
> -	} else if ((slot->allot -= qdisc_pkt_len(skb)) <= 0) {
> -		q->tail = slot;
> -		slot->allot += q->quantum;
> +	} else {
> +		slot->allot -= qdisc_pkt_len(skb);
> 	}
> 
> Now the check is performed at the beginning of sfq_dequeue(), to be able
> to charge a previously sent 'big packet' multiple times (faulty flow
> wont send a packet before passing xx rounds)
> 
> I believe I just did the right thing. The "allot" is incremented when
> current flow "pass its round to next slot", and decremented when a
> packet is dequeued from this slot. Before being allowed to dequeue a
> packet, "allot" must be 'positive'.

Simply try to check my examples before and after. There is no skipping
of a round now. It's a serious change. Somebody tried to avoid it at
all in the current implementation. You should also think about fairness
of normal (but different) size packets.

Jarek P.

^ permalink raw reply

* [PATCH 1/2 -next] sundance: Wrap up acceess to ASICCtrl high word with a macro
From: Denis Kirjanov @ 2010-12-21 12:01 UTC (permalink / raw)
  To: davem; +Cc: netdev

Wrap up acceess to ASICCtrl high word with a macro

Signed-off-by: Denis Kirjanov <dkirjanov@kernel.org>
---
 drivers/net/sundance.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/sundance.c b/drivers/net/sundance.c
index 3ed2a67..d82944c 100644
--- a/drivers/net/sundance.c
+++ b/drivers/net/sundance.c
@@ -294,6 +294,9 @@ enum alta_offsets {
 	/* Aliased and bogus values! */
 	RxStatus = 0x0c,
 };
+
+#define ASIC_HI_WORD(x)	((x) + 2)
+
 enum ASICCtrl_HiWord_bit {
 	GlobalReset = 0x0001,
 	RxReset = 0x0002,
@@ -1772,10 +1775,10 @@ static int netdev_close(struct net_device *dev)
     	}
 
     	iowrite16(GlobalReset | DMAReset | FIFOReset | NetworkReset,
-			ioaddr +ASICCtrl + 2);
+			ioaddr + ASIC_HI_WORD(ASICCtrl));
 
     	for (i = 2000; i > 0; i--) {
- 		if ((ioread16(ioaddr + ASICCtrl +2) & ResetBusy) == 0)
+ 		if ((ioread16(ioaddr + ASIC_HI_WORD(ASICCtrl)) & ResetBusy) == 0)
 			break;
 		mdelay(1);
     	}
-- 
1.7.2.2


^ permalink raw reply related

* [PATCH 2/2 -next] sundance: Program station address into HW
From: Denis Kirjanov @ 2010-12-21 12:02 UTC (permalink / raw)
  To: davem; +Cc: netdev

Program adapter's StationAddress register when changing device MAC address

Signed-off-by: Denis Kirjanov <dkirjanov@kernel.org>
---
 drivers/net/sundance.c |   16 +++++++++++++++-
 1 files changed, 15 insertions(+), 1 deletions(-)

diff --git a/drivers/net/sundance.c b/drivers/net/sundance.c
index d82944c..826fe34 100644
--- a/drivers/net/sundance.c
+++ b/drivers/net/sundance.c
@@ -434,6 +434,7 @@ static void netdev_error(struct net_device *dev, int intr_status);
 static void netdev_error(struct net_device *dev, int intr_status);
 static void set_rx_mode(struct net_device *dev);
 static int __set_mac_addr(struct net_device *dev);
+static int sundance_set_mac_addr(struct net_device *dev, void *data);
 static struct net_device_stats *get_stats(struct net_device *dev);
 static int netdev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd);
 static int  netdev_close(struct net_device *dev);
@@ -467,7 +468,7 @@ static const struct net_device_ops netdev_ops = {
 	.ndo_do_ioctl 		= netdev_ioctl,
 	.ndo_tx_timeout		= tx_timeout,
 	.ndo_change_mtu		= change_mtu,
-	.ndo_set_mac_address 	= eth_mac_addr,
+	.ndo_set_mac_address 	= sundance_set_mac_addr,
 	.ndo_validate_addr	= eth_validate_addr,
 };
 
@@ -1595,6 +1596,19 @@ static int __set_mac_addr(struct net_device *dev)
 	return 0;
 }
 
+/* Invoked with rtnl_lock held */
+static int sundance_set_mac_addr(struct net_device *dev, void *data)
+{
+	const struct sockaddr *addr = data;
+
+	if (!is_valid_ether_addr(addr->sa_data))
+		return -EINVAL;
+	memcpy(dev->dev_addr, addr->sa_data, ETH_ALEN);
+	__set_mac_addr(dev);
+
+	return 0;
+}
+
 static const struct {
 	const char name[ETH_GSTRING_LEN];
 } sundance_stats[] = {
-- 
1.7.2.2


^ permalink raw reply related

* Re: [PATCH net-next-2.6] sch_sfq: allow big packets and be fair
From: Jarek Poplawski @ 2010-12-21 12:17 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Patrick McHardy, netdev
In-Reply-To: <20101221113920.GB8813@ff.dom.local>

On Tue, Dec 21, 2010 at 11:39:20AM +0000, Jarek Poplawski wrote:
> On Tue, Dec 21, 2010 at 11:57:17AM +0100, Eric Dumazet wrote:
> > Now the check is performed at the beginning of sfq_dequeue(), to be able
> > to charge a previously sent 'big packet' multiple times (faulty flow
> > wont send a packet before passing xx rounds)
> > 
> > I believe I just did the right thing. The "allot" is incremented when
> > current flow "pass its round to next slot", and decremented when a
> > packet is dequeued from this slot. Before being allowed to dequeue a
> > packet, "allot" must be 'positive'.
> 
> Simply try to check my examples before and after. There is no skipping
> of a round now. It's a serious change. Somebody tried to avoid it at
> all in the current implementation. You should also think about fairness
> of normal (but different) size packets.

Oops! You're right yet ;-) This skipping shouldn't happen with quantum
bigger than max packet size, so this patch is OK.

Sorry,
Jarek P.

^ permalink raw reply

* [PATCH v2 net-next-2.6] sch_sfq: allow big packets and be fair
From: Eric Dumazet @ 2010-12-21 13:04 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, Patrick McHardy, netdev
In-Reply-To: <20101221121706.GC8813@ff.dom.local>

Le mardi 21 décembre 2010 à 12:17 +0000, Jarek Poplawski a écrit :

> Oops! You're right yet ;-) This skipping shouldn't happen with quantum
> bigger than max packet size, so this patch is OK.

Thanks Jarek, here is a v2 with the scale you suggested.

[PATCH v2 net-next-2.6] sch_sfq: allow big packets and be fair

SFQ is currently 'limited' to small packets, because it uses a 15bit
allotment number per flow. Introduce a scale by 8, so that we can handle
full size TSO/GRO packets.

Use appropriate handling to make sure allot is positive before a new
packet is dequeued, so that fairness is respected.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jarek Poplawski <jarkao2@gmail.com>
Cc: Patrick McHardy <kaber@trash.net>
---
v2: Use a scale of 8 as Jarek suggested, instead of 18bit fields

 net/sched/sch_sfq.c |   28 +++++++++++++++++++++-------
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index c474b4b..f3a9fd7 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -67,7 +67,7 @@
 
 	IMPLEMENTATION:
 	This implementation limits maximal queue length to 128;
-	maximal mtu to 2^15-1; max 128 flows, number of hash buckets to 1024.
+	max mtu to 2^18-1; max 128 flows, number of hash buckets to 1024.
 	The only goal of this restrictions was that all data
 	fit into one 4K page on 32bit arches.
 
@@ -77,6 +77,11 @@
 #define SFQ_SLOTS		128 /* max number of flows */
 #define SFQ_EMPTY_SLOT		255
 #define SFQ_HASH_DIVISOR	1024
+/* We use 15+1 bits to store allot, and want to handle packets up to 64K
+ * Scale allot by 8 (1<<3) so that no overflow occurs.
+ */
+#define SFQ_ALLOT_SHIFT		3
+#define SFQ_ALLOT_SIZE(X)	DIV_ROUND_UP(X, 1 << SFQ_ALLOT_SHIFT)
 
 /* This type should contain at least SFQ_DEPTH + SFQ_SLOTS values */
 typedef unsigned char sfq_index;
@@ -115,7 +120,7 @@ struct sfq_sched_data
 	struct timer_list perturb_timer;
 	u32		perturbation;
 	sfq_index	cur_depth;	/* depth of longest slot */
-
+	unsigned short  scaled_quantum; /* SFQ_ALLOT_SIZE(quantum) */
 	struct sfq_slot *tail;		/* current slot in round */
 	sfq_index	ht[SFQ_HASH_DIVISOR];	/* Hash table */
 	struct sfq_slot	slots[SFQ_SLOTS];
@@ -394,7 +399,7 @@ sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 			q->tail->next = x;
 		}
 		q->tail = slot;
-		slot->allot = q->quantum;
+		slot->allot = q->scaled_quantum;
 	}
 	if (++sch->q.qlen <= q->limit) {
 		sch->bstats.bytes += qdisc_pkt_len(skb);
@@ -430,8 +435,14 @@ sfq_dequeue(struct Qdisc *sch)
 	if (q->tail == NULL)
 		return NULL;
 
+next_slot:
 	a = q->tail->next;
 	slot = &q->slots[a];
+	if (slot->allot <= 0) {
+		q->tail = slot;
+		slot->allot += q->scaled_quantum;
+		goto next_slot;
+	}
 	skb = slot_dequeue_head(slot);
 	sfq_dec(q, a);
 	sch->q.qlen--;
@@ -446,9 +457,8 @@ sfq_dequeue(struct Qdisc *sch)
 			return skb;
 		}
 		q->tail->next = next_a;
-	} else if ((slot->allot -= qdisc_pkt_len(skb)) <= 0) {
-		q->tail = slot;
-		slot->allot += q->quantum;
+	} else {
+		slot->allot -= SFQ_ALLOT_SIZE(qdisc_pkt_len(skb));
 	}
 	return skb;
 }
@@ -484,6 +494,7 @@ static int sfq_change(struct Qdisc *sch, struct nlattr *opt)
 
 	sch_tree_lock(sch);
 	q->quantum = ctl->quantum ? : psched_mtu(qdisc_dev(sch));
+	q->scaled_quantum = SFQ_ALLOT_SIZE(q->quantum);
 	q->perturb_period = ctl->perturb_period * HZ;
 	if (ctl->limit)
 		q->limit = min_t(u32, ctl->limit, SFQ_DEPTH - 1);
@@ -524,6 +535,7 @@ static int sfq_init(struct Qdisc *sch, struct nlattr *opt)
 	q->tail = NULL;
 	if (opt == NULL) {
 		q->quantum = psched_mtu(qdisc_dev(sch));
+		q->scaled_quantum = SFQ_ALLOT_SIZE(q->quantum);
 		q->perturb_period = 0;
 		q->perturbation = net_random();
 	} else {
@@ -610,7 +622,9 @@ static int sfq_dump_class_stats(struct Qdisc *sch, unsigned long cl,
 	struct sfq_sched_data *q = qdisc_priv(sch);
 	const struct sfq_slot *slot = &q->slots[q->ht[cl - 1]];
 	struct gnet_stats_queue qs = { .qlen = slot->qlen };
-	struct tc_sfq_xstats xstats = { .allot = slot->allot };
+	struct tc_sfq_xstats xstats = {
+		.allot = slot->allot << SFQ_ALLOT_SHIFT
+	};
 	struct sk_buff *skb;
 
 	slot_queue_walk(slot, skb)



^ permalink raw reply related

* Re: [PATCH 5/5 v4] net: add old_queue_mapping into skb->cb
From: jamal @ 2010-12-21 13:07 UTC (permalink / raw)
  To: Changli Gao
  Cc: David S. Miller, Stephen Hemminger, Eric Dumazet, Tom Herbert,
	Jiri Pirko, netdev, netem
In-Reply-To: <AANLkTi=Eq1HuPwMMGQjk9x0QXp3_5djmAY6JeZyFkk0k@mail.gmail.com>

On Fri, 2010-12-17 at 21:41 +0800, Changli Gao wrote:

Sorry for the latency - I am a little swamped.

> I doubt it can work.

It should work - this used to be part of my regression tests.
If it doesnt work something is broken.

In any case, I shouldnt have used this example because
it distracted from the point i was trying to make:
You are restoring the old qmap when the point is we could change
it to a new one. A simpler example illustrating how
a qmap could be changed:

----
tc filter add dev ifb0 parent 1:0 protocol ip prio 10 u32 \
 match u32 0 0 flowid 1:2 \
 action skbedit queue_mapping 4
---

cheers,
jamal


^ permalink raw reply

* Re: [PATCH] iproute2: ip: add wilcard support for device matching
From: jamal @ 2010-12-21 13:14 UTC (permalink / raw)
  To: Vlad Dogaru
  Cc: Octavian Purdila, Eric Dumazet, Stephen Hemminger, netdev,
	Lucian Adrian Grijincu
In-Reply-To: <4D0FAD26.8050908@rosedu.org>

On Mon, 2010-12-20 at 11:23 -0800, Vlad Dogaru wrote:

> I'll try to implement this approach in the next few days.

Excellent ;-> Remember, this is general purpose tag, sort
like socket/route/skb->mark. It is upto the administrator to
define its use via policy. 

cheers,
jamal



^ permalink raw reply

* Re: [PATCH v2 net-next-2.6] sch_sfq: allow big packets and be fair
From: Jarek Poplawski @ 2010-12-21 13:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Patrick McHardy, netdev
In-Reply-To: <1292936699.2720.23.camel@edumazet-laptop>

On Tue, Dec 21, 2010 at 02:04:59PM +0100, Eric Dumazet wrote:
> Le mardi 21 décembre 2010 ?? 12:17 +0000, Jarek Poplawski a écrit :
> 
> > Oops! You're right yet ;-) This skipping shouldn't happen with quantum
> > bigger than max packet size, so this patch is OK.
> 
> Thanks Jarek, here is a v2 with the scale you suggested.

Very nice! Thanks as well,
Jarek P.

> 
> [PATCH v2 net-next-2.6] sch_sfq: allow big packets and be fair
> 
> SFQ is currently 'limited' to small packets, because it uses a 15bit
> allotment number per flow. Introduce a scale by 8, so that we can handle
> full size TSO/GRO packets.
> 
> Use appropriate handling to make sure allot is positive before a new
> packet is dequeued, so that fairness is respected.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Jarek Poplawski <jarkao2@gmail.com>
> Cc: Patrick McHardy <kaber@trash.net>
> ---
> v2: Use a scale of 8 as Jarek suggested, instead of 18bit fields
> 
>  net/sched/sch_sfq.c |   28 +++++++++++++++++++++-------
>  1 file changed, 21 insertions(+), 7 deletions(-)
> 
> diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
> index c474b4b..f3a9fd7 100644
> --- a/net/sched/sch_sfq.c
> +++ b/net/sched/sch_sfq.c
> @@ -67,7 +67,7 @@
>  
>  	IMPLEMENTATION:
>  	This implementation limits maximal queue length to 128;
> -	maximal mtu to 2^15-1; max 128 flows, number of hash buckets to 1024.
> +	max mtu to 2^18-1; max 128 flows, number of hash buckets to 1024.
>  	The only goal of this restrictions was that all data
>  	fit into one 4K page on 32bit arches.
>  
> @@ -77,6 +77,11 @@
>  #define SFQ_SLOTS		128 /* max number of flows */
>  #define SFQ_EMPTY_SLOT		255
>  #define SFQ_HASH_DIVISOR	1024
> +/* We use 15+1 bits to store allot, and want to handle packets up to 64K
> + * Scale allot by 8 (1<<3) so that no overflow occurs.
> + */
> +#define SFQ_ALLOT_SHIFT		3
> +#define SFQ_ALLOT_SIZE(X)	DIV_ROUND_UP(X, 1 << SFQ_ALLOT_SHIFT)
>  
>  /* This type should contain at least SFQ_DEPTH + SFQ_SLOTS values */
>  typedef unsigned char sfq_index;
> @@ -115,7 +120,7 @@ struct sfq_sched_data
>  	struct timer_list perturb_timer;
>  	u32		perturbation;
>  	sfq_index	cur_depth;	/* depth of longest slot */
> -
> +	unsigned short  scaled_quantum; /* SFQ_ALLOT_SIZE(quantum) */
>  	struct sfq_slot *tail;		/* current slot in round */
>  	sfq_index	ht[SFQ_HASH_DIVISOR];	/* Hash table */
>  	struct sfq_slot	slots[SFQ_SLOTS];
> @@ -394,7 +399,7 @@ sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
>  			q->tail->next = x;
>  		}
>  		q->tail = slot;
> -		slot->allot = q->quantum;
> +		slot->allot = q->scaled_quantum;
>  	}
>  	if (++sch->q.qlen <= q->limit) {
>  		sch->bstats.bytes += qdisc_pkt_len(skb);
> @@ -430,8 +435,14 @@ sfq_dequeue(struct Qdisc *sch)
>  	if (q->tail == NULL)
>  		return NULL;
>  
> +next_slot:
>  	a = q->tail->next;
>  	slot = &q->slots[a];
> +	if (slot->allot <= 0) {
> +		q->tail = slot;
> +		slot->allot += q->scaled_quantum;
> +		goto next_slot;
> +	}
>  	skb = slot_dequeue_head(slot);
>  	sfq_dec(q, a);
>  	sch->q.qlen--;
> @@ -446,9 +457,8 @@ sfq_dequeue(struct Qdisc *sch)
>  			return skb;
>  		}
>  		q->tail->next = next_a;
> -	} else if ((slot->allot -= qdisc_pkt_len(skb)) <= 0) {
> -		q->tail = slot;
> -		slot->allot += q->quantum;
> +	} else {
> +		slot->allot -= SFQ_ALLOT_SIZE(qdisc_pkt_len(skb));
>  	}
>  	return skb;
>  }
> @@ -484,6 +494,7 @@ static int sfq_change(struct Qdisc *sch, struct nlattr *opt)
>  
>  	sch_tree_lock(sch);
>  	q->quantum = ctl->quantum ? : psched_mtu(qdisc_dev(sch));
> +	q->scaled_quantum = SFQ_ALLOT_SIZE(q->quantum);
>  	q->perturb_period = ctl->perturb_period * HZ;
>  	if (ctl->limit)
>  		q->limit = min_t(u32, ctl->limit, SFQ_DEPTH - 1);
> @@ -524,6 +535,7 @@ static int sfq_init(struct Qdisc *sch, struct nlattr *opt)
>  	q->tail = NULL;
>  	if (opt == NULL) {
>  		q->quantum = psched_mtu(qdisc_dev(sch));
> +		q->scaled_quantum = SFQ_ALLOT_SIZE(q->quantum);
>  		q->perturb_period = 0;
>  		q->perturbation = net_random();
>  	} else {
> @@ -610,7 +622,9 @@ static int sfq_dump_class_stats(struct Qdisc *sch, unsigned long cl,
>  	struct sfq_sched_data *q = qdisc_priv(sch);
>  	const struct sfq_slot *slot = &q->slots[q->ht[cl - 1]];
>  	struct gnet_stats_queue qs = { .qlen = slot->qlen };
> -	struct tc_sfq_xstats xstats = { .allot = slot->allot };
> +	struct tc_sfq_xstats xstats = {
> +		.allot = slot->allot << SFQ_ALLOT_SHIFT
> +	};
>  	struct sk_buff *skb;
>  
>  	slot_queue_walk(slot, skb)
> 
> 

^ permalink raw reply

* Re: [patch -next] bnx2x: remove bogus check
From: Eilon Greenstein @ 2010-12-21 13:48 UTC (permalink / raw)
  To: Dan Carpenter; +Cc: netdev@vger.kernel.org, kernel-janitors@vger.kernel.org
In-Reply-To: <20101221070401.GF1936@bicker>

On Mon, 2010-12-20 at 23:04 -0800, Dan Carpenter wrote:
> We dereferenced params on the line before so it's too late to check if
> params is NULL.  In fact, params can never be NULL and strict_cos is
> either 0 or 1 so that part of the check is bogus too.  Let's remove it.
> 
> Signed-off-by: Dan Carpenter <error27@gmail.com>

Thanks Dan!

Acked-by: Eilon Greenstein <eilong@broadcom.com>

> 
> diff --git a/drivers/net/bnx2x/bnx2x_link.c b/drivers/net/bnx2x/bnx2x_link.c
> index 97cbee2..43b0de2 100644
> --- a/drivers/net/bnx2x/bnx2x_link.c
> +++ b/drivers/net/bnx2x/bnx2x_link.c
> @@ -354,9 +354,6 @@ u8 bnx2x_ets_strict(const struct link_params *params, const u8 strict_cos)
>  	struct bnx2x *bp = params->bp;
>  	u32 val	= 0;
>  
> -	if ((1 < strict_cos) && (NULL == params))
> -		return -EINVAL;
> -
>  	DP(NETIF_MSG_LINK, "ETS enabled strict configuration\n");
>  	/**
>  	 * Bitmap of 5bits length. Each bit specifies whether the entry behaves
> 





^ permalink raw reply

* Re: [PATCH net-2.6] net_sched: always clone skbs
From: jamal @ 2010-12-21 13:52 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Eric Dumazet, Changli Gao, David S. Miller, netdev,
	Pawel Staszewski
In-Reply-To: <20101220235209.GA1865@del.dom.local>

On Tue, 2010-12-21 at 00:52 +0100, Jarek Poplawski wrote:


> 
> host1 (kernel 2.6.36.2)
> netperf client -> eth3 (82598EB 10-Gigabit AT CX4) - directly connected to eth2 of host2
> ethtool -k eth3
> Offload parameters for eth3:
> rx-checksumming: on
> tx-checksumming: on
> scatter-gather: on
> tcp-segmentation-offload: on
> udp-fragmentation-offload: off
> generic-segmentation-offload: on
> generic-receive-offload: on 

Use to be we couldnt get the ifb+mirred combo to work
with TSO even without this change - i will have to dig old emails
to remember details. So we are making progress;-> I did write a set
of rules in: Documentation/networking/tc-actions-env-rules.txt
Changli optimized rule #1. When i looked at his patch it seems
to not harm that case. Sometimes dumb is a good principle ;->

cheers,
jamal


^ permalink raw reply

* Re: [PATCH 5/5 v4] net: add old_queue_mapping into skb->cb
From: Changli Gao @ 2010-12-21 14:03 UTC (permalink / raw)
  To: hadi
  Cc: David S. Miller, Stephen Hemminger, Eric Dumazet, Tom Herbert,
	Jiri Pirko, netdev, netem
In-Reply-To: <1292936837.6535.8.camel@mojatatu>

On Tue, Dec 21, 2010 at 9:07 PM, jamal <hadi@cyberus.ca> wrote:
> On Fri, 2010-12-17 at 21:41 +0800, Changli Gao wrote:
>
> Sorry for the latency - I am a little swamped.
>
>> I doubt it can work.
>
> It should work - this used to be part of my regression tests.
> If it doesnt work something is broken.

When I tested it, my OS got frozen.

>
> In any case, I shouldnt have used this example because
> it distracted from the point i was trying to make:
> You are restoring the old qmap when the point is we could change
> it to a new one. A simpler example illustrating how
> a qmap could be changed:
>
> ----
> tc filter add dev ifb0 parent 1:0 protocol ip prio 10 u32 \
>  match u32 0 0 flowid 1:2 \
>  action skbedit queue_mapping 4
> ----

Currently, you can only change the rx queue mapping, because for tx,
dev_pick_tx() doesn't use skb->queue_mapping to choose tx queue.

However, I don't think change the rx queue mapping is a good idea.
When the skbs returned from ifb enter netif_receive_skb() again,
get_rps_cpu() may warn about the wrong rx queue, and my this patch is
used to solve this problem. Even though the rx queue is legal, a
different rps_cpus settings will be used, and the skbs may be
redirected to different CPUs. Is it expected?


-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCH net-2.6] net_sched: always clone skbs
From: Changli Gao @ 2010-12-21 14:17 UTC (permalink / raw)
  To: hadi
  Cc: Jarek Poplawski, Eric Dumazet, David S. Miller, netdev,
	Pawel Staszewski
In-Reply-To: <1292939574.6535.27.camel@mojatatu>

On Tue, Dec 21, 2010 at 9:52 PM, jamal <hadi@cyberus.ca> wrote:
>
> Use to be we couldnt get the ifb+mirred combo to work
> with TSO even without this change - i will have to dig old emails
> to remember details. So we are making progress;-> I did write a set
> of rules in: Documentation/networking/tc-actions-env-rules.txt
> Changli optimized rule #1. When i looked at his patch it seems
> to not harm that case. Sometimes dumb is a good principle ;->
>

In order to make my trick work. We need to assure dev_queue_xmit() and
dev_hard_start_xmit() accept shared skbs. As Eric pointed, pktgen also
need dev->netdev_ops->ndo_start_xmit() accept shared skbs. We need to
fix every ndo_start_xmit() one by one, then dev_hard_start_xmit(), and
when dev_queue_xmit() is also fixed, my trick can be added back.
:)

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* pull request: sfc-next-2.6 2010-12-21
From: Ben Hutchings @ 2010-12-21 14:46 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, sf-linux-drivers

The following changes since commit cf78f8ee3de7d8d5b47d371c95716d0e4facf1c4:

  Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/bwh/sfc-next-2.6 (2010-12-10 10:20:43 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bwh/sfc-next-2.6.git for-davem

Some overdue cleanup of the TX path.

Ben.

Ben Hutchings (2):
      sfc: Remove unused field and comment on a previously removed field
      sfc: Remove ancient support for nesting of TX stop

 drivers/net/sfc/efx.c        |   24 +++++----
 drivers/net/sfc/efx.h        |    2 -
 drivers/net/sfc/net_driver.h |   13 +----
 drivers/net/sfc/tx.c         |  111 +++++++----------------------------------
 4 files changed, 35 insertions(+), 115 deletions(-)

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.



^ permalink raw reply

* [PATCH net-next-2.6 1/2] sfc: Remove unused field and comment on a previously removed field
From: Ben Hutchings @ 2010-12-21 14:48 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1292942817.3256.2.camel@bwh-desktop>

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/net_driver.h |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/drivers/net/sfc/net_driver.h b/drivers/net/sfc/net_driver.h
index 76f2fb1..294379f 100644
--- a/drivers/net/sfc/net_driver.h
+++ b/drivers/net/sfc/net_driver.h
@@ -179,7 +179,6 @@ struct efx_tx_queue {
 	struct efx_nic *efx ____cacheline_aligned_in_smp;
 	unsigned queue;
 	struct efx_channel *channel;
-	struct efx_nic *nic;
 	struct efx_tx_buffer *buffer;
 	struct efx_special_buffer txd;
 	unsigned int ptr_mask;
@@ -321,7 +320,6 @@ enum efx_rx_alloc_method {
  * @irq_moderation: IRQ moderation value (in hardware ticks)
  * @napi_dev: Net device used with NAPI
  * @napi_str: NAPI control structure
- * @reset_work: Scheduled reset work thread
  * @work_pending: Is work pending via NAPI?
  * @eventq: Event queue buffer
  * @eventq_mask: Event queue pointer mask
-- 
1.7.3.2



-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* [PATCH net-next-2.6 2/2] sfc: Remove ancient support for nesting of TX stop
From: Ben Hutchings @ 2010-12-21 14:49 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1292942817.3256.2.camel@bwh-desktop>

Long before this driver went into mainline, it had support for
multiple TX queues per port, with lockless TX enabled.  Since Linux
did not know anything of this, filling up any hardware TX queue would
stop the core TX queue and multiple hardware TX queues could fill up
before the scheduler reacted.  Thus it was necessary to keep a count
of how many TX queues were stopped and to wake the core TX queue only
when all had free space again.

The driver also previously (ab)used the per-hardware-queue stopped
flag as a counter to deal with various things that can inhibit TX, but
it no longer does that.

Remove the per-channel tx_stop_count, tx_stop_lock and
per-hardware-queue stopped count and just use the networking core
queue state directly.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/efx.c        |   24 +++++----
 drivers/net/sfc/efx.h        |    2 -
 drivers/net/sfc/net_driver.h |   11 +---
 drivers/net/sfc/tx.c         |  111 +++++++----------------------------------
 4 files changed, 35 insertions(+), 113 deletions(-)

diff --git a/drivers/net/sfc/efx.c b/drivers/net/sfc/efx.c
index 2166c1d..711449c 100644
--- a/drivers/net/sfc/efx.c
+++ b/drivers/net/sfc/efx.c
@@ -461,9 +461,6 @@ efx_alloc_channel(struct efx_nic *efx, int i, struct efx_channel *old_channel)
 		}
 	}
 
-	spin_lock_init(&channel->tx_stop_lock);
-	atomic_set(&channel->tx_stop_count, 1);
-
 	rx_queue = &channel->rx_queue;
 	rx_queue->efx = efx;
 	setup_timer(&rx_queue->slow_fill, efx_rx_slow_fill,
@@ -1406,11 +1403,11 @@ static void efx_start_all(struct efx_nic *efx)
 	 * restart the transmit interface early so the watchdog timer stops */
 	efx_start_port(efx);
 
-	efx_for_each_channel(channel, efx) {
-		if (efx_dev_registered(efx))
-			efx_wake_queue(channel);
+	if (efx_dev_registered(efx))
+		netif_tx_wake_all_queues(efx->net_dev);
+
+	efx_for_each_channel(channel, efx)
 		efx_start_channel(channel);
-	}
 
 	if (efx->legacy_irq)
 		efx->legacy_irq_enabled = true;
@@ -1498,9 +1495,7 @@ static void efx_stop_all(struct efx_nic *efx)
 	/* Stop the kernel transmit interface late, so the watchdog
 	 * timer isn't ticking over the flush */
 	if (efx_dev_registered(efx)) {
-		struct efx_channel *channel;
-		efx_for_each_channel(channel, efx)
-			efx_stop_queue(channel);
+		netif_tx_stop_all_queues(efx->net_dev);
 		netif_tx_lock_bh(efx->net_dev);
 		netif_tx_unlock_bh(efx->net_dev);
 	}
@@ -1896,6 +1891,7 @@ static DEVICE_ATTR(phy_type, 0644, show_phy_type, NULL);
 static int efx_register_netdev(struct efx_nic *efx)
 {
 	struct net_device *net_dev = efx->net_dev;
+	struct efx_channel *channel;
 	int rc;
 
 	net_dev->watchdog_timeo = 5 * HZ;
@@ -1918,6 +1914,14 @@ static int efx_register_netdev(struct efx_nic *efx)
 	if (rc)
 		goto fail_locked;
 
+	efx_for_each_channel(channel, efx) {
+		struct efx_tx_queue *tx_queue;
+		efx_for_each_channel_tx_queue(tx_queue, channel) {
+			tx_queue->core_txq = netdev_get_tx_queue(
+				efx->net_dev, tx_queue->queue / EFX_TXQ_TYPES);
+		}
+	}
+
 	/* Always start with carrier off; PHY events will detect the link */
 	netif_carrier_off(efx->net_dev);
 
diff --git a/drivers/net/sfc/efx.h b/drivers/net/sfc/efx.h
index 003fdb3..d43a7e5 100644
--- a/drivers/net/sfc/efx.h
+++ b/drivers/net/sfc/efx.h
@@ -36,8 +36,6 @@ efx_hard_start_xmit(struct sk_buff *skb, struct net_device *net_dev);
 extern netdev_tx_t
 efx_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb);
 extern void efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index);
-extern void efx_stop_queue(struct efx_channel *channel);
-extern void efx_wake_queue(struct efx_channel *channel);
 
 /* RX */
 extern int efx_probe_rx_queue(struct efx_rx_queue *rx_queue);
diff --git a/drivers/net/sfc/net_driver.h b/drivers/net/sfc/net_driver.h
index 294379f..bdce66d 100644
--- a/drivers/net/sfc/net_driver.h
+++ b/drivers/net/sfc/net_driver.h
@@ -136,6 +136,7 @@ struct efx_tx_buffer {
  * @efx: The associated Efx NIC
  * @queue: DMA queue number
  * @channel: The associated channel
+ * @core_txq: The networking core TX queue structure
  * @buffer: The software buffer ring
  * @txd: The hardware descriptor ring
  * @ptr_mask: The size of the ring minus 1.
@@ -148,8 +149,6 @@ struct efx_tx_buffer {
  *	variable indicates that the queue is empty.  This is to
  *	avoid cache-line ping-pong between the xmit path and the
  *	completion path.
- * @stopped: Stopped count.
- *	Set if this TX queue is currently stopping its port.
  * @insert_count: Current insert pointer
  *	This is the number of buffers that have been added to the
  *	software ring.
@@ -179,6 +178,7 @@ struct efx_tx_queue {
 	struct efx_nic *efx ____cacheline_aligned_in_smp;
 	unsigned queue;
 	struct efx_channel *channel;
+	struct netdev_queue *core_txq;
 	struct efx_tx_buffer *buffer;
 	struct efx_special_buffer txd;
 	unsigned int ptr_mask;
@@ -187,7 +187,6 @@ struct efx_tx_queue {
 	/* Members used mainly on the completion path */
 	unsigned int read_count ____cacheline_aligned_in_smp;
 	unsigned int old_write_count;
-	int stopped;
 
 	/* Members used only on the xmit path */
 	unsigned int insert_count ____cacheline_aligned_in_smp;
@@ -340,8 +339,6 @@ enum efx_rx_alloc_method {
  * @n_rx_overlength: Count of RX_OVERLENGTH errors
  * @n_skbuff_leaks: Count of skbuffs leaked due to RX overrun
  * @rx_queue: RX queue for this channel
- * @tx_stop_count: Core TX queue stop count
- * @tx_stop_lock: Core TX queue stop lock
  * @tx_queue: TX queues for this channel
  */
 struct efx_channel {
@@ -380,10 +377,6 @@ struct efx_channel {
 	bool rx_pkt_csummed;
 
 	struct efx_rx_queue rx_queue;
-
-	atomic_t tx_stop_count;
-	spinlock_t tx_stop_lock;
-
 	struct efx_tx_queue tx_queue[2];
 };
 
diff --git a/drivers/net/sfc/tx.c b/drivers/net/sfc/tx.c
index bdb92b4..2f5e9da 100644
--- a/drivers/net/sfc/tx.c
+++ b/drivers/net/sfc/tx.c
@@ -30,50 +30,6 @@
  */
 #define EFX_TXQ_THRESHOLD(_efx) ((_efx)->txq_entries / 2u)
 
-/* We need to be able to nest calls to netif_tx_stop_queue(), partly
- * because of the 2 hardware queues associated with each core queue,
- * but also so that we can inhibit TX for reasons other than a full
- * hardware queue. */
-void efx_stop_queue(struct efx_channel *channel)
-{
-	struct efx_nic *efx = channel->efx;
-	struct efx_tx_queue *tx_queue = efx_channel_get_tx_queue(channel, 0);
-
-	if (!tx_queue)
-		return;
-
-	spin_lock_bh(&channel->tx_stop_lock);
-	netif_vdbg(efx, tx_queued, efx->net_dev, "stop TX queue\n");
-
-	atomic_inc(&channel->tx_stop_count);
-	netif_tx_stop_queue(
-		netdev_get_tx_queue(efx->net_dev,
-				    tx_queue->queue / EFX_TXQ_TYPES));
-
-	spin_unlock_bh(&channel->tx_stop_lock);
-}
-
-/* Decrement core TX queue stop count and wake it if the count is 0 */
-void efx_wake_queue(struct efx_channel *channel)
-{
-	struct efx_nic *efx = channel->efx;
-	struct efx_tx_queue *tx_queue = efx_channel_get_tx_queue(channel, 0);
-
-	if (!tx_queue)
-		return;
-
-	local_bh_disable();
-	if (atomic_dec_and_lock(&channel->tx_stop_count,
-				&channel->tx_stop_lock)) {
-		netif_vdbg(efx, tx_queued, efx->net_dev, "waking TX queue\n");
-		netif_tx_wake_queue(
-			netdev_get_tx_queue(efx->net_dev,
-					    tx_queue->queue / EFX_TXQ_TYPES));
-		spin_unlock(&channel->tx_stop_lock);
-	}
-	local_bh_enable();
-}
-
 static void efx_dequeue_buffer(struct efx_tx_queue *tx_queue,
 			       struct efx_tx_buffer *buffer)
 {
@@ -234,9 +190,9 @@ netdev_tx_t efx_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb)
 				 * checked.  Update the xmit path's
 				 * copy of read_count.
 				 */
-				++tx_queue->stopped;
+				netif_tx_stop_queue(tx_queue->core_txq);
 				/* This memory barrier protects the
-				 * change of stopped from the access
+				 * change of queue state from the access
 				 * of read_count. */
 				smp_mb();
 				tx_queue->old_read_count =
@@ -244,10 +200,12 @@ netdev_tx_t efx_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb)
 				fill_level = (tx_queue->insert_count
 					      - tx_queue->old_read_count);
 				q_space = efx->txq_entries - 1 - fill_level;
-				if (unlikely(q_space-- <= 0))
-					goto stop;
+				if (unlikely(q_space-- <= 0)) {
+					rc = NETDEV_TX_BUSY;
+					goto unwind;
+				}
 				smp_mb();
-				--tx_queue->stopped;
+				netif_tx_start_queue(tx_queue->core_txq);
 			}
 
 			insert_ptr = tx_queue->insert_count & tx_queue->ptr_mask;
@@ -307,13 +265,6 @@ netdev_tx_t efx_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb)
 
 	/* Mark the packet as transmitted, and free the SKB ourselves */
 	dev_kfree_skb_any(skb);
-	goto unwind;
-
- stop:
-	rc = NETDEV_TX_BUSY;
-
-	if (tx_queue->stopped == 1)
-		efx_stop_queue(tx_queue->channel);
 
  unwind:
 	/* Work backwards until we hit the original insert pointer value */
@@ -400,32 +351,21 @@ void efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index)
 {
 	unsigned fill_level;
 	struct efx_nic *efx = tx_queue->efx;
-	struct netdev_queue *queue;
 
 	EFX_BUG_ON_PARANOID(index > tx_queue->ptr_mask);
 
 	efx_dequeue_buffers(tx_queue, index);
 
 	/* See if we need to restart the netif queue.  This barrier
-	 * separates the update of read_count from the test of
-	 * stopped. */
+	 * separates the update of read_count from the test of the
+	 * queue state. */
 	smp_mb();
-	if (unlikely(tx_queue->stopped) && likely(efx->port_enabled)) {
+	if (unlikely(netif_tx_queue_stopped(tx_queue->core_txq)) &&
+	    likely(efx->port_enabled)) {
 		fill_level = tx_queue->insert_count - tx_queue->read_count;
 		if (fill_level < EFX_TXQ_THRESHOLD(efx)) {
 			EFX_BUG_ON_PARANOID(!efx_dev_registered(efx));
-
-			/* Do this under netif_tx_lock(), to avoid racing
-			 * with efx_xmit(). */
-			queue = netdev_get_tx_queue(
-				efx->net_dev,
-				tx_queue->queue / EFX_TXQ_TYPES);
-			__netif_tx_lock(queue, smp_processor_id());
-			if (tx_queue->stopped) {
-				tx_queue->stopped = 0;
-				efx_wake_queue(tx_queue->channel);
-			}
-			__netif_tx_unlock(queue);
+			netif_tx_wake_queue(tx_queue->core_txq);
 		}
 	}
 
@@ -487,7 +427,6 @@ void efx_init_tx_queue(struct efx_tx_queue *tx_queue)
 	tx_queue->read_count = 0;
 	tx_queue->old_read_count = 0;
 	tx_queue->empty_read_count = 0 | EFX_EMPTY_COUNT_VALID;
-	BUG_ON(tx_queue->stopped);
 
 	/* Set up TX descriptor ring */
 	efx_nic_init_tx(tx_queue);
@@ -523,12 +462,6 @@ void efx_fini_tx_queue(struct efx_tx_queue *tx_queue)
 
 	/* Free up TSO header cache */
 	efx_fini_tso(tx_queue);
-
-	/* Release queue's stop on port, if any */
-	if (tx_queue->stopped) {
-		tx_queue->stopped = 0;
-		efx_wake_queue(tx_queue->channel);
-	}
 }
 
 void efx_remove_tx_queue(struct efx_tx_queue *tx_queue)
@@ -770,9 +703,9 @@ static int efx_tx_queue_insert(struct efx_tx_queue *tx_queue,
 			 * since the xmit path last checked.  Update
 			 * the xmit path's copy of read_count.
 			 */
-			++tx_queue->stopped;
+			netif_tx_stop_queue(tx_queue->core_txq);
 			/* This memory barrier protects the change of
-			 * stopped from the access of read_count. */
+			 * queue state from the access of read_count. */
 			smp_mb();
 			tx_queue->old_read_count =
 				ACCESS_ONCE(tx_queue->read_count);
@@ -784,7 +717,7 @@ static int efx_tx_queue_insert(struct efx_tx_queue *tx_queue,
 				return 1;
 			}
 			smp_mb();
-			--tx_queue->stopped;
+			netif_tx_start_queue(tx_queue->core_txq);
 		}
 
 		insert_ptr = tx_queue->insert_count & tx_queue->ptr_mask;
@@ -1124,8 +1057,10 @@ static int efx_enqueue_skb_tso(struct efx_tx_queue *tx_queue,
 
 	while (1) {
 		rc = tso_fill_packet_with_fragment(tx_queue, skb, &state);
-		if (unlikely(rc))
-			goto stop;
+		if (unlikely(rc)) {
+			rc2 = NETDEV_TX_BUSY;
+			goto unwind;
+		}
 
 		/* Move onto the next fragment? */
 		if (state.in_len == 0) {
@@ -1154,14 +1089,6 @@ static int efx_enqueue_skb_tso(struct efx_tx_queue *tx_queue,
 	netif_err(efx, tx_err, efx->net_dev,
 		  "Out of memory for TSO headers, or PCI mapping error\n");
 	dev_kfree_skb_any(skb);
-	goto unwind;
-
- stop:
-	rc2 = NETDEV_TX_BUSY;
-
-	/* Stop the queue if it wasn't stopped before. */
-	if (tx_queue->stopped == 1)
-		efx_stop_queue(tx_queue->channel);
 
  unwind:
 	/* Free the DMA mapping we were in the process of writing out */
-- 
1.7.3.2


-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* Re: [PATCH 5/5 v4] net: add old_queue_mapping into skb->cb
From: Eric Dumazet @ 2010-12-21 15:24 UTC (permalink / raw)
  To: Changli Gao
  Cc: hadi, David S. Miller, Stephen Hemminger, Tom Herbert, Jiri Pirko,
	netdev, netem
In-Reply-To: <AANLkTimqemuhxCKq-PJu+FD-MDgKaHnYKnP_2ch30wxE@mail.gmail.com>

Le mardi 21 décembre 2010 à 22:03 +0800, Changli Gao a écrit :
> However, I don't think change the rx queue mapping is a good idea.
> When the skbs returned from ifb enter netif_receive_skb() again,
> get_rps_cpu() may warn about the wrong rx queue, and my this patch is
> used to solve this problem. Even though the rx queue is legal, a
> different rps_cpus settings will be used, and the skbs may be
> redirected to different CPUs. Is it expected?
> 
> 

Do we really want a multi queue ifb at all ?

Why not use percpu data and LLTX, like we did for other virtual devices
(loopback, tunnels, vlans, ...)

I guess most ifb uses need to finaly deliver packets in a monoqueue
anyway, optimizing ifb might raise lock contention on this resource.

See what we did in commit 79640a4ca6955e3e (net: add additional lock to
qdisc to increase throughput) : Adding one spinlock actually helped a
lot ;)




^ permalink raw reply

* Re: IPTV buffering
From: Jesper Dangaard Brouer @ 2010-12-21 16:24 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Netfilter Developer Mailing List, netdev
In-Reply-To: <20101216111843.88E16F0C32AB5@borg.medozas.de>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2538 bytes --]


On Thu, 16 Dec 2010, Jan Engelhardt wrote:
> On Thursday 2010-12-16 10:57, Jesper Dangaard Brouer wrote:
>
>> [...] NetConf 2010, see:
>>
>> http://vger.kernel.org/netconf2010.html
>
> I just went over a few slide sets, and noticed Dave's Netfilter summary
> about your IPTV talk, enlisting the point
>
> * Ethernet switches buffer too small
>
> ("too small".. "too few"?) Given the recent uproar about bufferbloat in
> routing devices (see LWN coverage about Getty's articles), wanting
> larger buffers seems to almost contradict what Getty would like.

Always wanting small buffers doesn't make sense.  It seem that he is not 
considering that network equipment can be used for other things than 
TCP/IP.

What I want is a *smooth* IPTV multicast signal (which thus consumes 
minimal buffer space), but because the streamers are bursting packets, I 
want large enough buffers in the switch, to handle these bursts.

What I recommend (in the backbone) is to increase the buffer size in the 
QoS queue, which is used for e.g. IPTV/multicast.  And have another queue 
for the normal Internet traffic (because too large buffers can cause 
issues).


> Though TV is usually delivered via UDP rather than TCP, some of the
> protocols may too implement some sort of congestion recognition or
> even avoidance technique ÿÿ IIRC realplayer had something that
> adapted video quality based upon transfer rate.

Our TV streamer send out a MULTICAST signal, thus there is NOT any
congestion feedback...


> Wanting more buffers vs. wanting less buffering seems to be quite
> contradictory. Jesper, what is your take on this?

Skimming through Getty's blog post, I think Getty has actually missed what 
is happening.  He should read my masters thesis[1]... The real problem is 
that TCP/IP is clocked by the ACK packets, and on asymetric links (like 
ADSL and DOCSIS), the ACK packets are simply comming downstream too fast 
on the larger downstream link, resulting in bursts and high-latency on the 
upstream link.

With the ADSL-optimizer I actually solved Gettys problem, but I guess the 
real solution would be to implement a TCP algorithm which handels this 
asymmtry, and e.g. isn't based on the ACK feedback...

[1] http://www.adsl-optimizer.dk/thesis/

Cheers,
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

^ permalink raw reply

* Re: [PATCH v4 net-next-2.6] netfilter: x_tables: dont block BH while reading counters
From: Jesper Dangaard Brouer @ 2010-12-21 16:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Patrick McHardy, netfilter-devel, netdev, Stephen Hemminger
In-Reply-To: <1292856346.2800.54.camel@edumazet-laptop>

On Mon, 2010-12-20 at 15:45 +0100, Eric Dumazet wrote:
...
> > There is no packet overruns/drops, iif I run "iptables -vnL >
> > /dev/null" without tracing enabled and only 1Gbit/s pktgen at 512
> > bytes packets.  If I enable tracing while calling iptables I see
> > packet drops/overruns.  So I guess this is caused by the tracing
> > overhead.
> 
> yes, probably :)
> 
> > 
> > I'll try to rerun my test without all the lock debugging options
> > enabled.

Results are much better without the kernel debugging options enabled.
I took the .config from production and enabled tracer "function_graph".
And applied your patches (plus vzalloc) on top of 2.6.36-stable tree.

I can now hit the system with a pktgen at 128 bytes, and see no
drops/overruns while running iptables.  (This packet load at 128bytes
is 822 kpps and 840Mbit/s) (iptables ruleset is the big chains: 20929
rules: 81239).

If I reduce the ftrace filter to only track get_counters, I can even
run a trace without any drops.

 echo get_counters >  /sys/kernel/debug/tracing/set_ftrace_filter

Some trace funny stats on get_counters(), under the packet storm.
When running iptables on a CPU not processing packets (via taskset),
the execution time is increased to 124ms.  If I force iptables to run
on a CPU processing packets, the execution time is increased to
1308ms, which is large but the expected behavior.

Acked-by: Jesper Dangaard Brouer <hawk@comx.dk>

-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network Kernel Developer
  Cand. Scient Datalog / MSc.CS
  Author of http://adsl-optimizer.dk
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply

* RE: e1000e crash with 82574L  2.6.37-0.rc5
From: Wyborny, Carolyn @ 2010-12-21 18:13 UTC (permalink / raw)
  To: Brian Neu, netdev@vger.kernel.org; +Cc: e1000-devel@lists.sourceforge.net
In-Reply-To: <912683.73513.qm@web39308.mail.mud.yahoo.com>



>-----Original Message-----
>From: Brian Neu [mailto:proclivity76@yahoo.com]
>Sent: Monday, December 20, 2010 5:20 PM
>To: Wyborny, Carolyn; netdev@vger.kernel.org
>Cc: e1000-devel@lists.sourceforge.net
>Subject: Re: e1000e crash with 82574L 2.6.37-0.rc5
>
>> Hello,
>
>
>Thanks for responding.  I really wasn't sure where to report this.
>
>> We have some  known issues with 82574L but most have been resolved in
>the
>>latest versions, so  I'll need some more information.
>>
>> What version of e1000e are you using  exactly?  Are you able to
>download and
>>test the latest version of the  driver from SourceForge?
>
>
>This is the fedora build of kernel 2.6.37.0.rc5 which I downloaded from
>koji.fedoraproject.org.  I'm not sure which version of e1000e is
>included, but
>if I need to build a new module, just let me know.
>
>I had opened a bugzilla report with redhat which also has more
>backtraces:
> https://bugzilla.redhat.com/show_bug.cgi?id=625776
>
>> Please open an issue at SourceForge.net for easier  tracking of
>debugging
>>information.
>>
>> Please provide an output of lspci  -vvv.
>
>
>Attached.
>
>> What hw platform is this happening on?
>
>It's a Supermicro MB for AMD Socket G34
>
>> How often does it  happen and how long does it take to happen after
>reset or
>>reboot?
>
>It usually doesn't happen unless the system has been running for hours
>or days.
>
>>Is ASPM  enabled or disabled on your system.  Its possible to disable
>this in
>>the  BIOS, but not all BIOS provide the option.  If its enabled for
>some reason,
>>please disable it and try the driver  again.
>
>
>I'm going to check on this very soon and reply to the e1000 list only.

Hello Brian,

With this being a variant of Fedora, filing a Bugzilla was the right thing to do.  I missed that in your first email.

I checked your lspci output and ASPM *is* disabled as it should be.

When the adapter crashes/resets (per the log) does it come back up or stay down?

Since you're downloading off the Fedora build site, you could be encountering any number of things as these are live untested builds at this point.

Can you download a vanilla upstream build from kernel.org complete and see if the problem is happening there?  Or, you can download the latest released Fedora 14 from fedoraproject.org and see if the problem is happening with that build.  

Let me know how it goes,

Thanks,

Carolyn

Carolyn Wyborny
Linux Development
LAN Access Division
Intel Corporation

  

^ permalink raw reply

* [RFC] ipv4: add ICMP socket kind
From: Vasiliy Kulikov @ 2010-12-21 18:18 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, Pavel Kankovsky, Solar Designer

Hi,

This patch adds IPPROTO_ICMP socket kind.  It makes it possible to send
ICMP_ECHO messages and receive corresponding ICMP_ECHOREPLY messages
without any special privileges.  In other words, the patch makes it
possible to implement setuid-less /bin/ping.

A new ping socket is created with

    socket(PF_INET, SOCK_DGRAM, IPPROTO_ICMP)

Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.

Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by other means
2) Make it easier to port existing programs using raw sockets.

ICMP headers given to send() are checked and sanitized. The type
must be ICMP_ECHO and the code must be zero (future extensions might relax
this, see below). The id is set to the number (local port) of the socket,
the checksum is always recomputed.

ICMP reply packets received from the network are demultiplexed according
to their id's and returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).

The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).


Userspace ping util & patch for it:
ftp://mirrors.kernel.org/openwall/Owl/current/sources/Owl/packages/iputils/iputils-ss020927.tar.gz
ftp://ftp.intelib.org/pub/segoon/iputils-ss020927-pingsock.diff


Similar functionality is implemented in Mac OS X: 
http://www.manpagez.com/man/4/icmp/


TODO: implement ICMPv6 sockets.


All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.

Tested with 2K "ping -i0.2" running on Core i7 (8 cores), Load average = 80,
and with 20 simultanious "ping -c1", each for half an hour.


Initially this functionality was written by Pavel Kankovsky for linux
2.4.32, but unfortunately it was never made public.


All comments are appreciated, especially about:

1) locking (get_port, lookup, unhash) - IMO ping sockets are too rarely
  used to optimise it.
2) necessity of ICMP_TIMESTAMP, ICMP_INFO_REQUEST, ICMP_ADDRESS - does
  anybody use it nowadays?
3) whether it's better to stay in net/ipv4/ping.c or move to net/ipv4/icmp.c

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: Pavel Kankovsky <peak@argo.troja.mff.cuni.cz>
--
diff --git a/include/net/ping.h b/include/net/ping.h
new file mode 100644
index 0000000..96ba9e8
--- /dev/null
+++ b/include/net/ping.h
@@ -0,0 +1,68 @@
+/*
+ * INET		An implementation of the TCP/IP protocol suite for the LINUX
+ *		operating system.  INET is implemented using the  BSD Socket
+ *		interface as the means of communication with the user level.
+ *
+ *		Definitions for the "ping" module.
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ */
+#ifndef _PING_H
+#define _PING_H
+
+#include <net/netns/hash.h>
+
+#ifdef CONFIG_IP_PING_DEBUG
+#define ping_debug(fmt, x...) printk(KERN_INFO fmt, ## x)
+#else
+#define ping_debug(fmt, x...) do {} while (0)
+#endif
+
+/* PING_HTABLE_SIZE must be power of 2 */
+#define PING_HTABLE_SIZE 	64
+#define PING_HTABLE_MASK 	(PING_HTABLE_SIZE-1)
+
+#define ping_portaddr_for_each_entry(__sk, node, list) \
+	hlist_nulls_for_each_entry(__sk, node, list, sk_nulls_node)
+
+#define MAX_PING_IDENT 	65536
+
+
+struct ping_hslot {
+	struct hlist_nulls_head	head;
+} __attribute__((aligned(2 * sizeof(long))));
+
+struct ping_table {
+	struct ping_hslot	hash[PING_HTABLE_SIZE];
+	rwlock_t		lock;
+};
+
+struct ping_iter_state {
+	struct seq_net_private  p;
+	int			bucket;
+};
+
+extern struct proto ping_prot;
+
+
+#ifdef CONFIG_IP_PING
+#define icmp_echoreply ping_rcv
+#else
+#define icmp_echoreply icmp_discard
+#endif
+
+extern void ping_rcv(struct sk_buff *);
+extern void ping_err(struct sk_buff *, u32 info);
+
+#ifdef CONFIG_PROC_FS
+extern int __init ping_proc_init(void);
+extern void ping_proc_exit(void);
+#endif
+
+void __init ping_init(void);
+
+
+#endif /* _PING_H */
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index 9e95d7f..5cb13a3 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -14,6 +14,27 @@ config IP_MULTICAST
 	  <file:Documentation/networking/multicast.txt>. For most people, it's
 	  safe to say N.
 
+config IP_PING
+	bool "IP: ping socket"
+	depends on EXPERIMENTAL
+	help
+	  This option introduces a new kind of sockets - "ping sockets".
+
+	  A ping socket makes it possible to send ICMP Echo messages and receive
+	  corresponding ICMP Echo Reply messages without any special privileges.
+	  In other words, it makes is possible to implement setuid-less /bin/ping.
+
+	  A new ping socket is created with socket(PF_INET, SOCK_DGRAM, PROT_ICMP).
+
+config IP_PING_DEBUG
+	bool "IP: ping socket debug output"
+	depends on IP_PING
+	default n
+	help
+	  Enable the inclusion of debug code in the ICMP ping sockets.
+	  Be aware that doing this will impact performance.
+	  If unsure say N.
+
 config IP_ADVANCED_ROUTER
 	bool "IP: advanced router"
 	---help---
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index 4978d22..3a37479 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -19,6 +19,7 @@ obj-$(CONFIG_IP_FIB_TRIE) += fib_trie.o
 obj-$(CONFIG_PROC_FS) += proc.o
 obj-$(CONFIG_IP_MULTIPLE_TABLES) += fib_rules.o
 obj-$(CONFIG_IP_MROUTE) += ipmr.o
+obj-$(CONFIG_IP_PING) += ping.o
 obj-$(CONFIG_NET_IPIP) += ipip.o
 obj-$(CONFIG_NET_IPGRE_DEMUX) += gre.o
 obj-$(CONFIG_NET_IPGRE) += ip_gre.o
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index f581f77..bbe5eb3 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -105,6 +105,7 @@
 #include <net/tcp.h>
 #include <net/udp.h>
 #include <net/udplite.h>
+#include <net/ping.h>
 #include <linux/skbuff.h>
 #include <net/sock.h>
 #include <net/raw.h>
@@ -992,6 +993,16 @@ static struct inet_protosw inetsw_array[] =
 		.flags =      INET_PROTOSW_PERMANENT,
        },
 
+#ifdef CONFIG_IP_PING
+       {
+		.type =       SOCK_DGRAM,
+		.protocol =   IPPROTO_ICMP,
+		.prot =       &ping_prot,
+		.ops =        &inet_dgram_ops,
+		.no_check =   UDP_CSUM_DEFAULT,
+		.flags =      INET_PROTOSW_REUSE,
+       },
+#endif
 
        {
 	       .type =       SOCK_RAW,
@@ -1520,6 +1531,9 @@ static const struct net_protocol udp_protocol = {
 
 static const struct net_protocol icmp_protocol = {
 	.handler =	icmp_rcv,
+#ifdef CONFIG_IP_PING
+	.err_handler =	ping_err,
+#endif
 	.no_policy =	1,
 	.netns_ok =	1,
 };
@@ -1635,6 +1649,12 @@ static int __init inet_init(void)
 	if (rc)
 		goto out_unregister_udp_proto;
 
+#ifdef CONFIG_IP_PING
+	rc = proto_register(&ping_prot, 1);
+	if (rc)
+		goto out_unregister_raw_proto;
+#endif
+
 	/*
 	 *	Tell SOCKET that we are alive...
 	 */
@@ -1690,6 +1710,10 @@ static int __init inet_init(void)
 	/* Add UDP-Lite (RFC 3828) */
 	udplite4_register();
 
+#ifdef CONFIG_IP_PING
+	ping_init();
+#endif
+
 	/*
 	 *	Set the ICMP layer up
 	 */
@@ -1720,6 +1744,10 @@ static int __init inet_init(void)
 	rc = 0;
 out:
 	return rc;
+#ifdef CONFIG_IP_PING
+out_unregister_raw_proto:
+	proto_unregister(&raw_prot);
+#endif
 out_unregister_udp_proto:
 	proto_unregister(&udp_prot);
 out_unregister_tcp_proto:
@@ -1744,11 +1772,19 @@ static int __init ipv4_proc_init(void)
 		goto out_tcp;
 	if (udp4_proc_init())
 		goto out_udp;
+#ifdef CONFIG_IP_PING
+	if (ping_proc_init())
+		goto out_ping;
+#endif
 	if (ip_misc_proc_init())
 		goto out_misc;
 out:
 	return rc;
 out_misc:
+#ifdef CONFIG_IP_PING
+	ping_proc_exit();
+out_ping:
+#endif
 	udp4_proc_exit();
 out_udp:
 	tcp4_proc_exit();
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index e5d1a44..83232e2 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -83,6 +83,7 @@
 #include <net/tcp.h>
 #include <net/udp.h>
 #include <net/raw.h>
+#include <net/ping.h>
 #include <linux/skbuff.h>
 #include <net/sock.h>
 #include <linux/errno.h>
@@ -808,6 +809,17 @@ static void icmp_redirect(struct sk_buff *skb)
 			       iph->saddr, skb->dev);
 		break;
 	}
+
+#ifdef CONFIG_IP_PING
+	/* Ping wants to see redirects.
+         * Let's pretend they are errors of sorts... */
+	if (iph->protocol == IPPROTO_ICMP &&
+	    iph->ihl >= 5 &&
+	    pskb_may_pull(skb, (iph->ihl<<2)+8)) {
+		ping_err(skb, icmp_hdr(skb)->un.gateway);
+	}
+#endif
+
 out:
 	return;
 out_err:
@@ -1068,7 +1080,7 @@ error:
  */
 static const struct icmp_control icmp_pointers[NR_ICMP_TYPES + 1] = {
 	[ICMP_ECHOREPLY] = {
-		.handler = icmp_discard,
+		.handler = icmp_echoreply,
 	},
 	[1] = {
 		.handler = icmp_discard,
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
new file mode 100644
index 0000000..8a769ce
--- /dev/null
+++ b/net/ipv4/ping.c
@@ -0,0 +1,894 @@
+/*
+ * INET		An implementation of the TCP/IP protocol suite for the LINUX
+ *		operating system.  INET is implemented using the  BSD Socket
+ *		interface as the means of communication with the user level.
+ *
+ *		"Ping" sockets
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Based on ipv4/udp.c code.
+ *
+ * Authors:	Vasiliy Kulikov,
+ *		Pavel Kankovsky (for 2.4.32)
+ *
+ */
+
+#include <asm/system.h>
+#include <linux/uaccess.h>
+#include <asm/ioctls.h>
+#include <linux/types.h>
+#include <linux/fcntl.h>
+#include <linux/socket.h>
+#include <linux/sockios.h>
+#include <linux/in.h>
+#include <linux/errno.h>
+#include <linux/timer.h>
+#include <linux/mm.h>
+#include <linux/inet.h>
+#include <linux/netdevice.h>
+#include <net/snmp.h>
+#include <net/ip.h>
+#include <net/ipv6.h>
+#include <net/icmp.h>
+#include <net/protocol.h>
+#include <linux/skbuff.h>
+#include <linux/proc_fs.h>
+#include <net/sock.h>
+#include <net/ping.h>
+#include <net/icmp.h>
+#include <net/udp.h>
+#include <net/route.h>
+#include <net/inet_common.h>
+#include <net/checksum.h>
+
+
+struct ping_table ping_table __read_mostly;
+
+u16 ping_port_rover;
+
+static inline int ping_hashfn(struct net *net, unsigned num, unsigned mask)
+{
+	int res = (num + net_hash_mix(net)) & mask;
+	ping_debug("hash(%d) = %d\n", num, res);
+	return res;
+}
+
+static inline struct ping_hslot *ping_hashslot(struct ping_table *table,
+					     struct net *net, unsigned num)
+{
+	return &table->hash[ping_hashfn(net, num, PING_HTABLE_MASK)];
+}
+
+static int ping_v4_get_port(struct sock *sk, unsigned short ident)
+{
+	struct hlist_nulls_node *node;
+	struct ping_hslot *hlist;
+	struct inet_sock *isk, *isk2;
+	struct sock *sk2 = NULL;
+
+	isk = inet_sk(sk);
+	write_lock_bh(&ping_table.lock);
+	if (ident == 0) {
+		u32 i;
+		u16 result = ping_port_rover + 1;
+
+		for (i = 0; i < (1L << 16); i++, result++) {
+			if (!result)
+				result++; /* avoid zero */
+			hlist = ping_hashslot(&ping_table, sock_net(sk),
+					    result);
+			ping_portaddr_for_each_entry(sk2, node, &hlist->head) {
+				isk2 = inet_sk(sk2);
+
+				if (isk2->inet_num == result)
+					goto next_port;
+			}
+
+			/* found */
+			ping_port_rover = ident = result;
+			break;
+next_port:
+			;
+		}
+		if (i >= (1L << 16))
+			goto fail;
+	} else {
+		hlist = ping_hashslot(&ping_table, sock_net(sk), ident);
+		ping_portaddr_for_each_entry(sk2, node, &hlist->head) {
+			isk2 = inet_sk(sk2);
+
+			if ((isk2->inet_num == ident) &&
+			    (sk2 != sk) &&
+			    (!sk2->sk_reuse || !sk->sk_reuse))
+				goto fail;
+		}
+	}
+
+	ping_debug("found port/ident = %d\n", ident);
+	isk->inet_num = ident;
+	if (sk_unhashed(sk)) {
+		ping_debug("was not hashed\n");
+		sock_hold(sk);
+		hlist_nulls_add_head(&sk->sk_nulls_node, &hlist->head);
+		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
+	}
+	write_unlock_bh(&ping_table.lock);
+	return 0;
+
+fail:
+	write_unlock_bh(&ping_table.lock);
+	return 1;
+}
+
+static void ping_v4_hash(struct sock *sk)
+{
+	ping_debug("ping_v4_hash(sk->port=%u)\n", inet_sk(sk)->inet_num);
+	BUG(); /* "Please do not press this button again." */
+}
+
+static void ping_v4_unhash(struct sock *sk)
+{
+	struct inet_sock *isk = inet_sk(sk);
+	ping_debug("ping_v4_unhash(isk=%p,isk->num=%u)\n", isk, isk->inet_num);
+	if (sk_hashed(sk)) {
+		struct ping_hslot *hslot;
+
+		hslot = ping_hashslot(&ping_table, sock_net(sk), isk->inet_num);
+		write_lock_bh(&ping_table.lock);
+		hlist_nulls_del(&sk->sk_nulls_node);
+		sock_put(sk);
+		isk->inet_num = isk->inet_sport = 0;
+		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
+		write_unlock_bh(&ping_table.lock);
+	}
+}
+
+struct sock *ping_v4_lookup(struct net *net, u32 saddr, u32 daddr,
+	 u16 ident, int dif)
+{
+	struct ping_hslot *hslot = ping_hashslot(&ping_table, net, ident);
+	struct sock *sk = NULL;
+	struct inet_sock *isk;
+	struct hlist_nulls_node *hnode;
+
+	ping_debug("try to find: num = %d, daddr = %ld, dif = %d\n",
+			 (int)ident, (unsigned long)daddr, dif);
+	read_lock_bh(&ping_table.lock);
+
+	ping_portaddr_for_each_entry(sk, hnode, &hslot->head) {
+		isk = inet_sk(sk);
+
+		ping_debug("found: %p: num = %d, daddr = %ld, dif = %d\n", sk,
+			 (int)isk->inet_num, (unsigned long)isk->inet_rcv_saddr,
+			 sk->sk_bound_dev_if);
+
+		ping_debug("iterate\n");
+		if (isk->inet_num != ident)
+			continue;
+		if (isk->inet_rcv_saddr && isk->inet_rcv_saddr != daddr)
+			continue;
+		if (sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif)
+			continue;
+
+		sock_hold(sk);
+		goto exit;
+	}
+
+	sk = NULL;
+exit:
+	read_unlock_bh(&ping_table.lock);
+
+	return sk;
+}
+
+static void ping_close(struct sock *sk, long timeout)
+{
+	ping_debug("ping_close(sk=%p,sk->num=%u)\n",
+		inet_sk(sk), inet_sk(sk)->inet_num);
+	ping_debug("isk->refcnt = %d\n", sk->sk_refcnt.counter);
+
+	sk_common_release(sk);
+}
+
+/*
+ * We need our own bind because there are no privileged id's == local ports.
+ * Moreover, we don't allow binding to multi- and broadcast addresses.
+ */
+
+static int ping_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
+{
+	struct sockaddr_in *addr = (struct sockaddr_in *)uaddr;
+	struct inet_sock *isk = inet_sk(sk);
+	unsigned short snum;
+	int chk_addr_ret;
+	int err;
+
+	if (addr_len < sizeof(struct sockaddr_in))
+		return -EINVAL;
+
+	ping_debug("ping_v4_bind(sk=%p,sa_addr=%08x,sa_port=%d)\n",
+		sk, addr->sin_addr.s_addr, ntohs(addr->sin_port));
+
+	chk_addr_ret = inet_addr_type(sock_net(sk), addr->sin_addr.s_addr);
+	if (addr->sin_addr.s_addr == INADDR_ANY)
+		chk_addr_ret = RTN_LOCAL;
+
+	if ((sysctl_ip_nonlocal_bind == 0 &&
+	    isk->freebind == 0 && isk->transparent == 0 &&
+	     chk_addr_ret != RTN_LOCAL) ||
+	    chk_addr_ret == RTN_MULTICAST ||
+	    chk_addr_ret == RTN_BROADCAST)
+		return -EADDRNOTAVAIL;
+
+	lock_sock(sk);
+
+	err = -EINVAL;
+	if (isk->inet_num != 0)
+		goto out;
+
+	err = -EADDRINUSE;
+	isk->inet_rcv_saddr = isk->inet_saddr = addr->sin_addr.s_addr;
+	snum = ntohs(addr->sin_port);
+	if (ping_v4_get_port(sk, snum) != 0) {
+		isk->inet_saddr = isk->inet_rcv_saddr = 0;
+		goto out;
+	}
+
+	ping_debug("after bind(): num = %d, daddr = %ld, dif = %d\n",
+		(int)isk->inet_num,
+		(unsigned long) isk->inet_rcv_saddr,
+		(int)sk->sk_bound_dev_if);
+
+	err = 0;
+	if (isk->inet_rcv_saddr)
+		sk->sk_userlocks |= SOCK_BINDADDR_LOCK;
+	if (snum)
+		sk->sk_userlocks |= SOCK_BINDPORT_LOCK;
+	isk->inet_sport = htons(isk->inet_num);
+	isk->inet_daddr = 0;
+	isk->inet_dport = 0;
+	sk_dst_reset(sk);
+out:
+	release_sock(sk);
+	ping_debug("ping_v4_bind -> %d\n", err);
+	return err;
+}
+
+/*
+ * Is this a supported type of ICMP message?
+ */
+
+static inline int ping_supported(int type, int code)
+{
+	if (type == ICMP_ECHO && code == 0)
+		return 1;
+	return 0;
+}
+
+/*
+ * This routine is called by the ICMP module when it gets some
+ * sort of error condition.
+ */
+
+static int ping_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
+
+void ping_err(struct sk_buff *skb, u32 info)
+{
+	struct iphdr *iph = (struct iphdr *)skb->data;
+	struct icmphdr *icmph = (struct icmphdr *)(skb->data+(iph->ihl<<2));
+	struct inet_sock *inet_sock;
+	int type = icmph->type;
+	int code = icmph->code;
+	struct net *net = dev_net(skb->dev);
+	struct sock *sk;
+	int harderr;
+	int err;
+
+	/* We assume the packet has already been checked by icmp_unreach */
+
+	if (!ping_supported(icmph->type, icmph->code))
+		return;
+
+	ping_debug("ping_err(type=%04x,code=%04x,id=%04x,seq=%04x)\n", type,
+		code, ntohs(icmph->un.echo.id), ntohs(icmph->un.echo.sequence));
+
+	sk = ping_v4_lookup(net, iph->daddr, iph->saddr,
+			    ntohs(icmph->un.echo.id), skb->dev->ifindex);
+	if (sk == NULL) {
+		ICMP_INC_STATS_BH(net, ICMP_MIB_INERRORS);
+		ping_debug("no socket, dropping\n");
+		return;	/* No socket for error */
+	}
+	ping_debug("err on socket %p\n", sk);
+
+	err = 0;
+	harderr = 0;
+	inet_sock = inet_sk(sk);
+
+	switch (type) {
+	default:
+	case ICMP_TIME_EXCEEDED:
+		err = EHOSTUNREACH;
+		break;
+	case ICMP_SOURCE_QUENCH:
+		/* This is not a real error but ping wants to see it.
+		 * Report it with some fake errno. */
+		err = EREMOTEIO;
+		break;
+	case ICMP_PARAMETERPROB:
+		err = EPROTO;
+		harderr = 1;
+		break;
+	case ICMP_DEST_UNREACH:
+		if (code == ICMP_FRAG_NEEDED) { /* Path MTU discovery */
+			if (inet_sock->pmtudisc != IP_PMTUDISC_DONT) {
+				err = EMSGSIZE;
+				harderr = 1;
+				break;
+			}
+			goto out;
+		}
+		err = EHOSTUNREACH;
+		if (code <= NR_ICMP_UNREACH) {
+			harderr = icmp_err_convert[code].fatal;
+			err = icmp_err_convert[code].errno;
+		}
+		break;
+	case ICMP_REDIRECT:
+		/* See ICMP_SOURCE_QUENCH */
+		err = EREMOTEIO;
+		break;
+	}
+
+	/*
+	 *      RFC1122: OK.  Passes ICMP errors back to application, as per
+	 *	4.1.3.3.
+	 */
+	if (!inet_sock->recverr) {
+		if (!harderr || sk->sk_state != TCP_ESTABLISHED)
+			goto out;
+	} else {
+		ip_icmp_error(sk, skb, err, 0 /* no remote port */,
+			 info, (u8 *)icmph);
+	}
+	sk->sk_err = err;
+	sk->sk_error_report(sk);
+out:
+	sock_put(sk);
+}
+
+/*
+ *	Copy and checksum an ICMP Echo packet from user space into a buffer.
+ */
+
+struct pingfakehdr {
+	struct icmphdr icmph;
+	struct iovec *iov;
+	u32 wcheck;
+};
+
+static int ping_getfrag(void *from, char * to,
+			int offset, int fraglen, int odd, struct sk_buff *skb)
+{
+	struct pingfakehdr *pfh = (struct pingfakehdr *)from;
+
+	if (offset == 0) {
+		if (fraglen < sizeof(struct icmphdr))
+			BUG();
+		if (csum_partial_copy_fromiovecend(to + sizeof(struct icmphdr),
+			    pfh->iov, 0, fraglen - sizeof(struct icmphdr),
+			    &pfh->wcheck))
+			return -EFAULT;
+
+		pfh->wcheck = csum_partial((char *)&pfh->icmph,
+			sizeof(struct icmphdr), pfh->wcheck);
+		pfh->icmph.checksum = csum_fold(pfh->wcheck);
+		memcpy(to, &pfh->icmph, sizeof(struct icmphdr));
+		return 0;
+	}
+	if (offset < sizeof(struct icmphdr))
+		BUG();
+	if (csum_partial_copy_fromiovecend
+			(to, pfh->iov, offset - sizeof(struct icmphdr),
+			 fraglen, &pfh->wcheck))
+		return -EFAULT;
+	return 0;
+}
+
+int ping_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+		 size_t len)
+{
+	struct inet_sock *isk = inet_sk(sk);
+	struct ipcm_cookie ipc;
+	struct icmphdr user_icmph;
+	struct pingfakehdr pfh;
+	struct rtable *rt = NULL;
+	int free = 0;
+	u32 saddr, daddr;
+	u8  tos;
+	int err;
+
+	ping_debug("ping_sendmsg(sk=%p,sk->num=%u)\n", isk, isk->inet_num);
+
+
+	if (len > 0xFFFF)
+		return -EMSGSIZE;
+
+	/*
+	 *	Check the flags.
+	 */
+
+	/* Mirror BSD error message compatibility */
+	if (msg->msg_flags & MSG_OOB)
+		return -EOPNOTSUPP;
+
+	/*
+	 *	Fetch the ICMP header provided by the userland.
+	 *	iovec is modified!
+	 */
+
+	if (memcpy_fromiovec((u8 *)&user_icmph, msg->msg_iov,
+			     sizeof(struct icmphdr)))
+		return -EFAULT;
+	if (!ping_supported(user_icmph.type, user_icmph.code))
+		return -EINVAL;
+
+	/*
+	 *	Get and verify the address.
+	 */
+
+	if (msg->msg_name) {
+		struct sockaddr_in *usin = (struct sockaddr_in *)msg->msg_name;
+		if (msg->msg_namelen < sizeof(*usin))
+			return -EINVAL;
+		if (usin->sin_family != AF_INET)
+			return -EINVAL;
+		daddr = usin->sin_addr.s_addr;
+		/* no remote port */
+	} else {
+		if (sk->sk_state != TCP_ESTABLISHED)
+			return -EDESTADDRREQ;
+		daddr = isk->inet_daddr;
+		/* no remote port */
+	}
+
+	ipc.addr = isk->inet_saddr;
+	ipc.opt = NULL;
+	ipc.oif = sk->sk_bound_dev_if;
+
+	if (msg->msg_controllen) {
+		err = ip_cmsg_send(sock_net(sk), msg, &ipc);
+		if (err)
+			return err;
+		if (ipc.opt)
+			free = 1;
+	}
+	if (!ipc.opt)
+		ipc.opt = isk->opt;
+
+	saddr = ipc.addr;
+	ipc.addr = daddr;
+
+	if (ipc.opt && ipc.opt->srr) {
+		if (!daddr)
+			return -EINVAL;
+		daddr = ipc.opt->faddr;
+	}
+	tos = RT_TOS(isk->tos);
+	if (sock_flag(sk, SOCK_LOCALROUTE) ||
+	    (msg->msg_flags&MSG_DONTROUTE) ||
+	    (ipc.opt && ipc.opt->is_strictroute)) {
+		tos |= RTO_ONLINK;
+	}
+
+	if (ipv4_is_multicast(daddr)) {
+		if (!ipc.oif)
+			ipc.oif = isk->mc_index;
+		if (!saddr)
+			saddr = isk->mc_addr;
+	}
+
+	{
+		struct flowi fl = { .oif = ipc.oif,
+				    .mark = sk->sk_mark,
+				    .nl_u = { .ip4_u = {
+						.daddr = daddr,
+						.saddr = saddr,
+						.tos = tos } },
+				    .proto = IPPROTO_ICMP,
+				    .flags = inet_sk_flowi_flags(sk),
+		};
+
+		struct net *net = sock_net(sk);
+
+		security_sk_classify_flow(sk, &fl);
+		err = ip_route_output_flow(net, &rt, &fl, sk, 1);
+		if (err) {
+			if (err == -ENETUNREACH)
+				IP_INC_STATS_BH(net, IPSTATS_MIB_OUTNOROUTES);
+			goto out;
+		}
+
+		err = -EACCES;
+		if ((rt->rt_flags & RTCF_BROADCAST) &&
+		    !sock_flag(sk, SOCK_BROADCAST))
+			goto out;
+	}
+
+	if (msg->msg_flags & MSG_CONFIRM)
+		goto do_confirm;
+back_from_confirm:
+
+	if (!ipc.addr)
+		ipc.addr = rt->rt_dst;
+
+	lock_sock(sk);
+
+	pfh.icmph.type = user_icmph.type; /* already checked */
+	pfh.icmph.code = user_icmph.code; /* dtto */
+	pfh.icmph.checksum = 0;
+	pfh.icmph.un.echo.id = isk->inet_sport;
+	pfh.icmph.un.echo.sequence = user_icmph.un.echo.sequence;
+	pfh.iov = msg->msg_iov;
+	pfh.wcheck = 0;
+
+	err = ip_append_data(sk, ping_getfrag, &pfh, len,
+			0, &ipc, &rt,
+			msg->msg_flags);
+	if (err)
+		ip_flush_pending_frames(sk);
+	else
+		err = ip_push_pending_frames(sk);
+	release_sock(sk);
+
+out:
+	ip_rt_put(rt);
+	if (free)
+		kfree(ipc.opt);
+	if (!err) {
+		icmp_out_count(sock_net(sk), user_icmph.type);
+		return len;
+	}
+	return err;
+
+do_confirm:
+	dst_confirm(&rt->dst);
+	if (!(msg->msg_flags & MSG_PROBE) || len)
+		goto back_from_confirm;
+	err = 0;
+	goto out;
+}
+
+/*
+ *	IOCTL requests applicable to the UDP^H^H^HICMP protocol
+ */
+
+int ping_ioctl(struct sock *sk, int cmd, unsigned long arg)
+{
+	ping_debug("ping_ioctl(sk=%p,sk->num=%u,cmd=%d,arg=%lu)\n",
+		inet_sk(sk), inet_sk(sk)->inet_num, cmd, arg);
+	switch (cmd) {
+	case SIOCOUTQ:
+	case SIOCINQ:
+		return udp_ioctl(sk, cmd, arg);
+	default:
+		return -ENOIOCTLCMD;
+	}
+}
+
+int ping_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+		 size_t len, int noblock, int flags, int *addr_len)
+{
+	struct inet_sock *isk = inet_sk(sk);
+	struct sockaddr_in *sin = (struct sockaddr_in *)msg->msg_name;
+	struct sk_buff *skb;
+	int copied, err;
+
+	ping_debug("ping_recvmsg(sk=%p,sk->num=%u)\n", isk, isk->inet_num);
+
+	if (flags & MSG_OOB)
+		goto out;
+
+	if (addr_len)
+		*addr_len = sizeof(*sin);
+
+	if (flags & MSG_ERRQUEUE)
+		return ip_recv_error(sk, msg, len);
+
+	skb = skb_recv_datagram(sk, flags, noblock, &err);
+	if (!skb)
+		goto out;
+
+	copied = skb->len;
+	if (copied > len) {
+		msg->msg_flags |= MSG_TRUNC;
+		copied = len;
+	}
+
+	/* Don't bother checking the checksum */
+	err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+	if (err)
+		goto done;
+
+	sock_recv_timestamp(msg, sk, skb);
+
+	/* Copy the address. */
+	if (sin) {
+		sin->sin_family = AF_INET;
+		sin->sin_port = 0 /* skb->h.uh->source */;
+		sin->sin_addr.s_addr = ip_hdr(skb)->saddr;
+		memset(sin->sin_zero, 0, sizeof(sin->sin_zero));
+	}
+	if (isk->cmsg_flags)
+		ip_cmsg_recv(msg, skb);
+	err = copied;
+
+done:
+	skb_free_datagram(sk, skb);
+out:
+	ping_debug("ping_recvmsg -> %d\n", err);
+	return err;
+}
+
+static int ping_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
+{
+	ping_debug("ping_queue_rcv_skb(sk=%p,sk->num=%d,skb=%p)\n",
+		inet_sk(sk), inet_sk(sk)->inet_num, skb);
+	if (sock_queue_rcv_skb(sk, skb) < 0) {
+		ICMP_INC_STATS_BH(sock_net(sk), ICMP_MIB_INERRORS);
+		kfree_skb(skb);
+		ping_debug("ping_queue_rcv_skb -> failed\n");
+		return -1;
+	}
+	return 0;
+}
+
+
+/*
+ *	All we need to do is get the socket.
+ */
+
+void ping_rcv(struct sk_buff *skb)
+{
+	struct sock *sk;
+	struct net *net = dev_net(skb->dev);
+	struct iphdr *iph = ip_hdr(skb);
+	struct icmphdr *icmph = icmp_hdr(skb);
+	u32 saddr = iph->saddr;
+	u32 daddr = iph->daddr;
+
+	/* We assume the packet has already been checked by icmp_rcv */
+
+	ping_debug("ping_rcv(skb=%p,id=%04x,seq=%04x)\n",
+		skb, ntohs(icmph->un.echo.id), ntohs(icmph->un.echo.sequence));
+
+	/* Push ICMP header back */
+	skb_push(skb, skb->data - (u8 *)icmph);
+
+	sk = ping_v4_lookup(net, saddr, daddr, ntohs(icmph->un.echo.id),
+			    skb->dev->ifindex);
+	if (sk != NULL) {
+		ping_debug("rcv on socket %p\n", sk);
+		ping_queue_rcv_skb(sk, skb_get(skb));
+		sock_put(sk);
+		return;
+	}
+	ping_debug("no socket, dropping\n");
+
+	/* We're called from icmp_rcv(). kfree_skb() is done there. */
+}
+
+struct proto ping_prot = {
+	.name =		"PING",
+	.owner =	THIS_MODULE,
+	.close =	ping_close,
+	.connect =	ip4_datagram_connect,
+	.disconnect =	udp_disconnect,
+	.ioctl =	ping_ioctl,
+	.setsockopt =	ip_setsockopt,
+	.getsockopt =	ip_getsockopt,
+	.sendmsg =	ping_sendmsg,
+	.recvmsg =	ping_recvmsg,
+	.bind =		ping_bind,
+	.backlog_rcv =	ping_queue_rcv_skb,
+	.hash =		ping_v4_hash,
+	.unhash =	ping_v4_unhash,
+	.get_port =	ping_v4_get_port,
+	.obj_size =	sizeof(struct inet_sock),
+};
+EXPORT_SYMBOL(ping_prot);
+
+#ifdef CONFIG_PROC_FS
+
+static struct sock *ping_get_first(struct seq_file *seq, int start)
+{
+	struct sock *sk;
+	struct ping_iter_state *state = seq->private;
+	struct net *net = seq_file_net(seq);
+
+	for (state->bucket = start; state->bucket < PING_HTABLE_SIZE;
+	     ++state->bucket) {
+		struct hlist_nulls_node *node;
+		struct ping_hslot *hslot = &ping_table.hash[state->bucket];
+
+		if (hlist_nulls_empty(&hslot->head))
+			continue;
+
+		sk_nulls_for_each(sk, node, &hslot->head) {
+			if (net_eq(sock_net(sk), net))
+				goto found;
+		}
+	}
+	sk = NULL;
+found:
+	return sk;
+}
+
+static struct sock *ping_get_next(struct seq_file *seq, struct sock *sk)
+{
+	struct ping_iter_state *state = seq->private;
+	struct net *net = seq_file_net(seq);
+
+	do {
+		sk = sk_nulls_next(sk);
+	} while (sk && (!net_eq(sock_net(sk), net)));
+
+	if (!sk)
+		return ping_get_first(seq, state->bucket + 1);
+	return sk;
+}
+
+static struct sock *ping_get_idx(struct seq_file *seq, loff_t pos)
+{
+	struct sock *sk = ping_get_first(seq, 0);
+
+	if (sk)
+		while (pos && (sk = ping_get_next(seq, sk)) != NULL)
+			--pos;
+	return pos ? NULL : sk;
+}
+
+static void *ping_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	struct ping_iter_state *state = seq->private;
+	state->bucket = 0;
+
+	read_lock_bh(&ping_table.lock);
+
+	return *pos ? ping_get_idx(seq, *pos-1) : SEQ_START_TOKEN;
+}
+
+static void *ping_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct sock *sk;
+
+	if (v == SEQ_START_TOKEN)
+		sk = ping_get_idx(seq, 0);
+	else
+		sk = ping_get_next(seq, v);
+
+	++*pos;
+	return sk;
+}
+
+static void ping_seq_stop(struct seq_file *seq, void *v)
+{
+	read_unlock_bh(&ping_table.lock);
+}
+
+static void ping_format_sock(struct sock *sp, struct seq_file *f,
+		int bucket, int *len)
+{
+	struct inet_sock *inet = inet_sk(sp);
+	__be32 dest = inet->inet_daddr;
+	__be32 src = inet->inet_rcv_saddr;
+	__u16 destp = ntohs(inet->inet_dport);
+	__u16 srcp = ntohs(inet->inet_sport);
+
+	seq_printf(f, "%5d: %08X:%04X %08X:%04X"
+		" %02X %08X:%08X %02X:%08lX %08X %5d %8d %lu %d %p %d%n",
+		bucket, src, srcp, dest, destp, sp->sk_state,
+		sk_wmem_alloc_get(sp),
+		sk_rmem_alloc_get(sp),
+		0, 0L, 0, sock_i_uid(sp), 0, sock_i_ino(sp),
+		atomic_read(&sp->sk_refcnt), sp,
+		atomic_read(&sp->sk_drops), len);
+}
+
+static int ping_seq_show(struct seq_file *seq, void *v)
+{
+	if (v == SEQ_START_TOKEN)
+		seq_printf(seq, "%-127s\n",
+			   "  sl  local_address rem_address   st tx_queue "
+			   "rx_queue tr tm->when retrnsmt   uid  timeout "
+			   "inode ref pointer drops");
+	else {
+		struct ping_iter_state *state = seq->private;
+		int len;
+
+		ping_format_sock(v, seq, state->bucket, &len);
+		seq_printf(seq, "%*s\n", 127 - len, "");
+	}
+	return 0;
+}
+
+static struct seq_operations ping_seq_ops = {
+	.show		= ping_seq_show,
+	.start		= ping_seq_start,
+	.next		= ping_seq_next,
+	.stop		= ping_seq_stop,
+};
+
+static int ping_seq_open(struct inode *inode, struct file *file)
+{
+	return seq_open_net(inode, file, &ping_seq_ops,
+			   sizeof(struct ping_iter_state));
+}
+
+static struct file_operations ping_seq_fops = {
+	.owner		= THIS_MODULE,
+	.open		= ping_seq_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release_net,
+};
+
+static const char ping_proc_name[] = "icmp";
+
+static int ping_proc_register(struct net *net)
+{
+	struct proc_dir_entry *p;
+	int rc = 0;
+
+	p = proc_create_data(ping_proc_name, S_IRUGO, net->proc_net,
+			     &ping_seq_fops, NULL);
+	if (!p)
+		rc = -ENOMEM;
+	return rc;
+}
+
+static void ping_proc_unregister(struct net *net)
+{
+	proc_net_remove(net, ping_proc_name);
+}
+
+
+static int __net_init ping_proc_init_net(struct net *net)
+{
+	return ping_proc_register(net);
+}
+
+static void __net_exit ping_proc_exit_net(struct net *net)
+{
+	ping_proc_unregister(net);
+}
+
+static struct pernet_operations ping_net_ops = {
+	.init = ping_proc_init_net,
+	.exit = ping_proc_exit_net,
+};
+
+int __init ping_proc_init(void)
+{
+	return register_pernet_subsys(&ping_net_ops);
+}
+
+void ping_proc_exit(void)
+{
+	unregister_pernet_subsys(&ping_net_ops);
+}
+
+#endif
+
+void __init ping_init(void)
+{
+	int i;
+
+	for (i = 0; i < PING_HTABLE_SIZE; i++)
+		INIT_HLIST_NULLS_HEAD(&ping_table.hash[i].head, i);
+	rwlock_init(&ping_table.lock);
+}
-- 
Vasiliy

^ permalink raw reply related

* Re: Using ethernet device as efficient small packet generator
From: Stephen Hemminger @ 2010-12-21 18:22 UTC (permalink / raw)
  To: juice; +Cc: netdev
In-Reply-To: <3dbea0c095731ef843058388c29df8c1.squirrel@www.liukuma.net>

On Tue, 21 Dec 2010 11:56:42 +0200
"juice" <juice@swagman.org> wrote:

> 
> Hi net-devvers.
> 
> I am involved in telecom equipment R&D, and I need to do some network
> performance benchmarking. We need to generate streams of Ethernet/IP/UDP
> traffic that consists of different sized payloads ranging from smallest
> AMR payload to ethernet MTU.
> 
> We have various tools including for example Spirent traffic generators
> as well as in-house made software generating 3GPP specified protocol
> streams. Now, the problem with the off-the-shelf generators is the
> inflexibility in our needs and the unavailability to R&D personnel to
> have the generator available at any given time.
> 
> For larger packet sizes our linux-based generator is quite sufficent,
> as I can use it to fully saturate GE link with packet sizes around 1kB.
> However, as packet sizes get smalles ethernet performance suffers.
> 
> I did some benchmarking using pktgen with 64B packets against AX4000 and
> confirmed that the maximun throughput is only around 25% of GE capacity.
> I managed to get to about same speeds using own custom module that writes
> skbuffs directly to kernel *xmit of the netdev.
> 
> Now, it is evident that something is not optimized to the maximum here
> as PCI bus allows for way higher transfer speeds. If large packets can
> fully saturate the ethernet link same should apply for minimum sized
> packets too, unless there is some overhead I am unaware of.
> 
> I have couple of questions here:
> 
> 1.) Is it possible to enhance the "normal" behaving network driver so
>     that the device would still work as an ethernet device (ethxx)?
> 
>     Currently the test stream is generated in userland process that
>     writes to RAW_SOCK, but it is OK for me if I need to write the
>     packet generating part as a kernel module that is configured
>     from the userland part to send the prepared stream out.
> 
> 2.) If it is not possible to get the needed performance from normal
>     network architecture, is it possible to make a "generate only"
>     ethernet device that I can use to replace the network card driver?
> 
>     For example, RX is not really needed at all by my application, so
>     just optimizing the driver to send out packets from memory as fast
>     as possible is enough.
> 
>     Are there notable differences between ethernet chipsets/cards
>     regarding to the raw output speed they are capable?
>     I have benchmarked e1000, r8169 ang tg3 based cards and with all
>     of those I get about same throughput of 64byte ethernet frames.
> 
>     For my purpose, it would be OK, for example, to remove the normal
>     r8169 driver and replace it with a custom TX-only driver, and use
>     some other normal driver tied to another card to access the box.
> 
> I appreciate your comments and any pointers to existing projects that
> have similar implementation that I require.
> 
> Yours, Jussi Ohenoja

I regularly get full 1G line rate of 64 byte packets using old Opteron
box and pktgen.  It does require some tuning of IRQ's and interrupt mitigation but
no patches. Did you remember to do the basic stuff like setting IRQ affinity
and not enabling debugging or tracing in the kernel? This is on sky2, but
also using e1000 and tg3. Others have reported 7M packets per second over 10G cards.
The r8169 hardware is low end consumer hardware and doesn't work as well.

It is possible to get close to 1G line rate forwarding with a single core with current
generation processors. Actual rate depends on hardware and configuration (size of route
table, firewalling, etc).  Much better performance with multi-queue hardware to spread load
over multiple cores.


-- 

^ permalink raw reply

* Re: [PATCH 1/1] TCP: increase default initial receive window.
From: John Heffner @ 2010-12-21 18:27 UTC (permalink / raw)
  To: Nandita Dukkipati
  Cc: David S. Miller, netdev, Tom Herbert, Laurent Chavey,
	Yuchung Cheng
In-Reply-To: <1292642451-892-1-git-send-email-nanditad@google.com>

I know this has already been applied, but one thing to think about:
Linux announces a small initial window to prevent overflowing the
receive buffer when receiving segments smaller than the link MTU.
Increasing this even to 10 segments might have some negative
consequences.  I recall, for instance, some drivers when configured
with a 9000 byte MTU, have a single pool of receive buffers all 16k
(the next highest power of 2).  So each received segment will get 16k
of allocated memory accounted to it, even if the incoming segments are
<=1460 bytes long.  The default initial rcvbuf of 87380 bytes is less
than the 160k of memory that the initial window might consume, so
we're going to start hitting the very slow path of coalescing segments
to get back under memory bounds.

Some drivers are smarter about having multiple pools of receive
buffers with different sizes, so it might not be so easy to hit this
condition.  I haven't looked at any of them for a while.  Is this
still a real concern?

Thanks,
  -John


On Fri, Dec 17, 2010 at 10:20 PM, Nandita Dukkipati <nanditad@google.com> wrote:
> This patch changes the default initial receive window to 10 mss
> (defined constant). The default window is limited to the maximum
> of 10*1460 and 2*mss (when mss > 1460).
>
> Signed-off-by: Nandita Dukkipati <nanditad@google.com>
> ---
>  include/net/tcp.h     |    3 +++
>  net/ipv4/tcp_output.c |   11 ++++++++---
>  2 files changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 2ab6c9c..6c25ba8 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -60,6 +60,9 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
>  */
>  #define MAX_TCP_WINDOW         32767U
>
> +/* Offer an initial receive window of 10 mss. */
> +#define TCP_DEFAULT_INIT_RCVWND        10
> +
>  /* Minimal accepted MSS. It is (60+60+8) - (20+20). */
>  #define TCP_MIN_MSS            88U
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 2d39066..dc7c096 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -228,10 +228,15 @@ void tcp_select_initial_window(int __space, __u32 mss,
>                }
>        }
>
> -       /* Set initial window to value enough for senders, following RFC5681. */
> +       /* Set initial window to a value enough for senders starting with
> +        * initial congestion window of TCP_DEFAULT_INIT_RCVWND. Place
> +        * a limit on the initial window when mss is larger than 1460.
> +        */
>        if (mss > (1 << *rcv_wscale)) {
> -               int init_cwnd = rfc3390_bytes_to_packets(mss);
> -
> +               int init_cwnd = TCP_DEFAULT_INIT_RCVWND;
> +               if (mss > 1460)
> +                       init_cwnd =
> +                       max_t(u32, (1460 * TCP_DEFAULT_INIT_RCVWND) / mss, 2);
>                /* when initializing use the value from init_rcv_wnd
>                 * rather than the default from above
>                 */
> --
> 1.7.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: [RFC] ipv4: add ICMP socket kind
From: Colin Walters @ 2010-12-21 18:46 UTC (permalink / raw)
  To: Vasiliy Kulikov; +Cc: netdev, linux-kernel, Pavel Kankovsky, Solar Designer
In-Reply-To: <20101221181800.GA8166@albatros>

On Tue, Dec 21, 2010 at 1:18 PM, Vasiliy Kulikov <segooon@gmail.com> wrote:
> Hi,
>
> This patch adds IPPROTO_ICMP socket kind.  It makes it possible to send
> ICMP_ECHO messages and receive corresponding ICMP_ECHOREPLY messages
> without any special privileges.  In other words, the patch makes it
> possible to implement setuid-less /bin/ping.
>
> A new ping socket is created with
>
>    socket(PF_INET, SOCK_DGRAM, IPPROTO_ICMP)

And the default is to allow any uid to do this (modulo LSM)?

If you really have a burning desire to get rid of setuid /bin/ping,
why not just do it in userspace via message passing to/from a
privileged process, and avoid a lot of code in the kernel?  It's much
more flexible.  You could, for example, limit it to once a second by
default, allow only one process doing this per uid, etc.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox