Netdev List
 help / color / mirror / Atom feed
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
From: Krishna Kumar2 @ 2007-08-09  3:19 UTC (permalink / raw)
  To: Herbert Xu
  Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, mcarlson,
	peter.p.waskiewicz.jr, hadi, kaber, netdev, general, mchan, tgraf,
	sri, shemminger, David Miller, rdreier
In-Reply-To: <20070808134247.GA9942@gondor.apana.org.au>

Herbert Xu <herbert@gondor.apana.org.au> wrote on 08/08/2007 07:12:47 PM:

> On Wed, Aug 08, 2007 at 03:49:00AM -0700, David Miller wrote:
> >
> > Not because I think it obviates your work, but rather because I'm
> > curious, could you test a TSO-in-hardware driver converted to
> > batching and see how TSO alone compares to batching for a pure
> > TCP workload?
>
> You could even lower the bar by disabling TSO and enabling
> software GSO.

I will try with E1000 (though I didn't see improvement when I tested a long
time back). The difference I expect is that TSO would help with large
packets and not necessarily small/medium packets and not definitely in
the case of multiple different skbs (as opposed to single large skb)
getting
queue'd. I think these are two different workloads.

> > I personally don't think it will help for that case at all as
> > TSO likely does better job of coalescing the work _and_ reducing
> > bus traffic as well as work in the TCP stack.
>
> I agree.  I suspect the bulk of the effort is in getting
> these skb's created and processed by the stack so that by
> the time that they're exiting the qdisc there's not much
> to be saved anymore.

However, I am getting a large improvement for IPoIB specifically for this
same case. The reason - batching will help only when queue gets full and
stopped (and to a lesser extent if tx lock was not got, which results
in fewer amount of batching that can be done).

thanks,

- KK

^ permalink raw reply

* [NET]: Share correct feature code between bridging and bonding
From: Herbert Xu @ 2007-08-09  3:36 UTC (permalink / raw)
  To: David S. Miller, netdev

Hi Dave:

[NET]: Share correct feature code between bridging and bonding

http://bugzilla.kernel.org/show_bug.cgi?id=8797 shows that the
bonding driver may produce bogus combinations of the checksum
flags and SG/TSO.

For example, if you bond devices with NETIF_F_HW_CSUM and
NETIF_F_IP_CSUM you'll end up with a bonding device that
has neither flag set.  If both have TSO then this produces
an illegal combination.

The bridge device on the other hand has the correct code to
deal with this.

In fact, the same code can be used for both.  So this patch
moves that logic into net/core/dev.c and uses it for both
bonding and bridging.

In the process I've made small adjustments such as only
setting GSO_ROBUST if at least one constituent device
supports it.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 070b78d..1afda32 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1202,43 +1202,35 @@ static int bond_sethwaddr(struct net_device *bond_dev,
 	return 0;
 }
 
-#define BOND_INTERSECT_FEATURES \
-	(NETIF_F_SG | NETIF_F_ALL_CSUM | NETIF_F_TSO | NETIF_F_UFO)
+#define BOND_VLAN_FEATURES \
+	(NETIF_F_VLAN_CHALLENGED | NETIF_F_HW_VLAN_RX | NETIF_F_HW_VLAN_TX | \
+	 NETIF_F_HW_VLAN_FILTER)
 
 /* 
  * Compute the common dev->feature set available to all slaves.  Some
- * feature bits are managed elsewhere, so preserve feature bits set on
- * master device that are not part of the examined set.
+ * feature bits are managed elsewhere, so preserve those feature bits
+ * on the master device.
  */
 static int bond_compute_features(struct bonding *bond)
 {
-	unsigned long features = BOND_INTERSECT_FEATURES;
 	struct slave *slave;
 	struct net_device *bond_dev = bond->dev;
+	unsigned long features = bond_dev->features;
 	unsigned short max_hard_header_len = ETH_HLEN;
 	int i;
 
+	features &= ~(NETIF_F_ALL_CSUM | BOND_VLAN_FEATURES);
+	features |= NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HIGHDMA |
+		    NETIF_F_GSO_MASK | NETIF_F_NO_CSUM;
+
 	bond_for_each_slave(bond, slave, i) {
-		features &= (slave->dev->features & BOND_INTERSECT_FEATURES);
+		features = netdev_compute_features(features,
+						   slave->dev->features);
 		if (slave->dev->hard_header_len > max_hard_header_len)
 			max_hard_header_len = slave->dev->hard_header_len;
 	}
 
-	if ((features & NETIF_F_SG) && 
-	    !(features & NETIF_F_ALL_CSUM))
-		features &= ~NETIF_F_SG;
-
-	/* 
-	 * features will include NETIF_F_TSO (NETIF_F_UFO) iff all 
-	 * slave devices support NETIF_F_TSO (NETIF_F_UFO), which 
-	 * implies that all slaves also support scatter-gather 
-	 * (NETIF_F_SG), which implies that features also includes 
-	 * NETIF_F_SG. So no need to check whether we have an  
-	 * illegal combination of NETIF_F_{TSO,UFO} and 
-	 * !NETIF_F_SG 
-	 */
-
-	features |= (bond_dev->features & ~BOND_INTERSECT_FEATURES);
+	features |= (bond_dev->features & BOND_VLAN_FEATURES);
 	bond_dev->features = features;
 	bond_dev->hard_header_len = max_hard_header_len;
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 4a616d7..e679b27 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1131,6 +1131,8 @@ extern void dev_seq_stop(struct seq_file *seq, void *v);
 
 extern void linkwatch_run_queue(void);
 
+extern int netdev_compute_features(unsigned long all, unsigned long one);
+
 static inline int net_gso_ok(int features, int gso_type)
 {
 	int feature = gso_type << NETIF_F_GSO_SHIFT;
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index 5e1892d..0eded17 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -179,5 +179,5 @@ void br_dev_setup(struct net_device *dev)
 	dev->priv_flags = IFF_EBRIDGE;
 
 	dev->features = NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HIGHDMA |
-			NETIF_F_TSO | NETIF_F_NO_CSUM | NETIF_F_GSO_ROBUST;
+			NETIF_F_GSO_MASK | NETIF_F_NO_CSUM | NETIF_F_LLTX;
 }
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index b40dada..749f0e8 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -349,43 +349,15 @@ int br_min_mtu(const struct net_bridge *br)
 void br_features_recompute(struct net_bridge *br)
 {
 	struct net_bridge_port *p;
-	unsigned long features, checksum;
+	unsigned long features;
 
-	checksum = br->feature_mask & NETIF_F_ALL_CSUM ? NETIF_F_NO_CSUM : 0;
-	features = br->feature_mask & ~NETIF_F_ALL_CSUM;
+	features = br->feature_mask;
 
 	list_for_each_entry(p, &br->port_list, list) {
-		unsigned long feature = p->dev->features;
-
-		/* if device needs checksumming, downgrade to hw checksumming */
-		if (checksum & NETIF_F_NO_CSUM && !(feature & NETIF_F_NO_CSUM))
-			checksum ^= NETIF_F_NO_CSUM | NETIF_F_HW_CSUM;
-
-		/* if device can't do all checksum, downgrade to ipv4/ipv6 */
-		if (checksum & NETIF_F_HW_CSUM && !(feature & NETIF_F_HW_CSUM))
-			checksum ^= NETIF_F_HW_CSUM
-				| NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM;
-
-		if (checksum & NETIF_F_IPV6_CSUM && !(feature & NETIF_F_IPV6_CSUM))
-			checksum &= ~NETIF_F_IPV6_CSUM;
-
-		if (!(feature & NETIF_F_IP_CSUM))
-			checksum = 0;
-
-		if (feature & NETIF_F_GSO)
-			feature |= NETIF_F_GSO_SOFTWARE;
-		feature |= NETIF_F_GSO;
-
-		features &= feature;
+		features = netdev_compute_features(features, p->dev->features);
 	}
 
-	if (!(checksum & NETIF_F_ALL_CSUM))
-		features &= ~NETIF_F_SG;
-	if (!(features & NETIF_F_SG))
-		features &= ~NETIF_F_GSO_MASK;
-
-	br->dev->features = features | checksum | NETIF_F_LLTX |
-			    NETIF_F_GSO_ROBUST;
+	br->dev->features = features;
 }
 
 /* called with RTNL */
diff --git a/net/core/dev.c b/net/core/dev.c
index 6cc8a70..a76021c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3993,6 +3993,45 @@ static int __init netdev_dma_register(void)
 static int __init netdev_dma_register(void) { return -ENODEV; }
 #endif /* CONFIG_NET_DMA */
 
+/**
+ *	netdev_compute_feature - compute conjunction of two feature sets
+ *	@all: first feature set
+ *	@one: second feature set
+ *
+ *	Computes a new feature set after adding a device with feature set
+ *	@one to the master device with current feature set @all.  Returns
+ *	the new feature set.
+ */
+int netdev_compute_features(unsigned long all, unsigned long one)
+{
+	/* if device needs checksumming, downgrade to hw checksumming */
+	if (all & NETIF_F_NO_CSUM && !(one & NETIF_F_NO_CSUM))
+		all ^= NETIF_F_NO_CSUM | NETIF_F_HW_CSUM;
+
+	/* if device can't do all checksum, downgrade to ipv4/ipv6 */
+	if (all & NETIF_F_HW_CSUM && !(one & NETIF_F_HW_CSUM))
+		all ^= NETIF_F_HW_CSUM
+			| NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM;
+
+	if (one & NETIF_F_GSO)
+		one |= NETIF_F_GSO_SOFTWARE;
+	one |= NETIF_F_GSO;
+
+	/* If even one device supports robust GSO, enable it for all. */
+	if (one & NETIF_F_GSO_ROBUST)
+		all |= NETIF_F_GSO_ROBUST;
+
+	all &= one | NETIF_F_LLTX;
+
+	if (!(all & NETIF_F_ALL_CSUM))
+		all &= ~NETIF_F_SG;
+	if (!(all & NETIF_F_SG))
+		all &= ~NETIF_F_GSO_MASK;
+
+	return all;
+}
+EXPORT_SYMBOL(netdev_compute_features);
+
 /*
  *	Initialize the DEV module. At boot time this walks the device list and
  *	unhooks any devices that fail to initialise (normally hardware not

^ permalink raw reply related

* Re: [PATCH] make atomic_t volatile on all architectures
From: Paul E. McKenney @ 2007-08-09  3:47 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, csnook, akpm, torvalds, ak, heiko.carstens, linux-kernel,
	netdev, schwidefsky, wensong, horms, wjiang, cfriesen, zlynx
In-Reply-To: <20070808.184824.133910636.davem@davemloft.net>

On Wed, Aug 08, 2007 at 06:48:24PM -0700, David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Thu, 09 Aug 2007 09:03:27 +0800
> 
> > Such loops should always use something like cpu_relax() which comes
> > with a barrier.
> 
> This is an excellent point.
> 
> And it needs to be weighed with the error prone'ness Andrew mentioned.
> There probably is a middle ground somewhere.

OK...  I'll bite.  ACCESS_ONCE(), see http://lkml.org/lkml/2007/7/11/664.

This would allow ACCESS_ONCE(atomic_read(&x)) to be used where refetching
would be problematic, but allow the compiler free rein in cases where
refetching is OK.

The ACCESS_ONCE() primitive of course has its limitations as well, but
you did ask for a middle ground.  ;-)

							Thanx, Paul

^ permalink raw reply

* [ofa-general] Re: [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching
From: Krishna Kumar2 @ 2007-08-09  4:06 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
	rdreier, peter.p.waskiewicz.jr, hadi, mcarlson, jeff, general,
	mchan, tgraf, netdev, shemminger, davem, sri
In-Reply-To: <46B9CD9D.4020506@trash.net>

Hi Patrick,

Patrick McHardy <kaber@trash.net> wrote on 08/08/2007 07:35:17 PM:

> Krishna Kumar wrote:
<snip>
> > +static inline int get_skb(struct net_device *dev, struct Qdisc *q,
> > +           struct sk_buff_head *blist, struct sk_buff **skbp)
> > +{
> > +   if (likely(!blist || (!skb_queue_len(blist) && qdisc_qlen(q) <=
1))) {
> > +      return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
> > +   } else {
> > +      int max = dev->tx_queue_len - skb_queue_len(blist);
> > +      struct sk_buff *skb;
> > +
> > +      while (max > 0 && (skb = dev_dequeue_skb(dev, q)) != NULL)
> > +         max -= dev_add_skb_to_blist(skb, dev);
> > +
> > +      *skbp = NULL;
> > +      return 1;   /* we have atleast one skb in blist */
> > +   }
> > +}
>
> I think you missed my previous reply to this in the flood of
> responses (or I missed your reply), so I'm copying it below:

Sorry, but I didn't get your post on this point earlier (thanks for
posting again).

> The entire idea of a single queue after qdiscs that is refilled
> independantly of transmissions times etc. make be worry a bit.
> By changing the timing you're effectively changing the qdiscs
> behaviour, at least in some cases. SFQ is a good example, but I
> believe it affects most work-conserving qdiscs. Think of this
> situation:
>
> 100 packets of flow 1 arrive
> 50 packets of flow 1 are sent
> 100 packets for flow 2 arrive
> remaining packets are sent
>
> On the wire you'll first see 50 packets of flow 1, than 100 packets
> alternate of flow 1 and 2, then 50 packets flow 2.
>
> With your additional queue all packets of flow 1 are pulled out of
> the qdisc immediately and put in the fifo. When the 100 packets of
> the second flow arrive they will also get pulled out immediately
> and are put in the fifo behind the remaining 50 packets of flow 1.
> So what you get on the wire is:
>
> 100 packets of flow 1
> 100 packets of flow 1

In normal case (qdisc run from xmit and not from tx_action), the code
executing is the same as regular code without any difference in wire
behavior due to the check:
      if (likely(!blist || (!skb_queue_len(blist) && qdisc_qlen(q) <= 1)))
{
            return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
(I always pass blist as NULL).

With SFQ for the above scenario, my code will first send out 50 F1 skbs
iteratively (blist==NULL), get blocked so that 50 skbs accumulate on F1
and 100 on F2, then when it is woken up, it will batch but it will pick
up 50 F1 and 50 F2 skbs in alternate order and put in the queue, and
finally pick up the remaining 50 F2 skbs and put those too in the queue.
Since I am calling dev_dequeue_skb, I am assured of RR when using SFQ.
The only time I would have 100 skbs from F1 in my queue (and hence sent
out first) is if there are NO F2 skbs.

> So SFQ is without any effect. This is not completely avoidable of
> course, but you can and should limit the damage by only pulling
> out as much packets as the driver can take and have the driver
> stop the queue afterwards.

I feel this cannot happen. Please correct me if I am wrong.

Thanks,

- KK

^ permalink raw reply

* Re: [RFC NET_SCHED 00/02]: Flexible SFQ flow classification
From: Paul E. McKenney @ 2007-08-09  4:12 UTC (permalink / raw)
  To: jamal; +Cc: Patrick McHardy, netdev
In-Reply-To: <1180536987.4109.19.camel@localhost>

On Wed, May 30, 2007 at 10:56:27AM -0400, jamal wrote:
> 
> On Wed, 2007-30-05 at 11:40 +0200, Patrick McHardy wrote:
> > One good thing about ESFQ is the more flexible flow classification, but
> > I don't like the concept of having a set of selectable hash functions
> > very much.
> > 
> 
> In the spirit of SFQ it is probably ok to do that; 
> i.e iirc,  the idea for justification behind sfq was to provide a
> "computationaly feasible/reasonable" way to provide fair queueing with a
> gazillion queues. 
> Classification was considered damn expensive in those days (when MC68K
> was sort of state of the art); it still is but maybe a much much lesser
> issue now for what "gazillion" was in those days. So a simple hash on a
> few fields with pertubation (the part that makes allows the "stochastic"
> part in the name) was deemed the way to go and in order to provide
> fairness (while avoiding packet reordering).
> So if you want to keep that spirit it is ok to do what ESFQ does;
> I think the assumptions will still be valid if you have a gazillion
> queues in todays terms. A number like say 128K may make sense.
> (Hey Paul M is still around, just too focused on the wrong things like
> RCU;-> so we can ask his opinion)

Not only that, he is -really- slow about getting to some of his email
sometimes.  :-/  In my defense, if you CC me, I will spot it more quickly.

Yeah, the 1980s were indeed over a -long- time ago, and architectures
certainly have changed.  I wrote my first parallel kernel code about
a year after the SFQ paper was published, and not too many years after
that began my longstanding RCU habit.  And not only did I inhale at
the time, I am still inhaling deeply.  ;-)

In any case, if you can feasibly do an exact classification, then doing so
would most likely be better than using SFQ.  And with today's hardware,
exact classifications are much reasonable than they were back on my old
handful-of-MHz 68K-based Sun workstation.  If an exact classification is
unworkable, but you absolutely have to maintain perfect ordering, then
one approach is to let the SFQ drain before perturbing the hash function.

One way to do this is to have a pair of hash tables, and switch back and
forth between them on each perturbation.  Perturbation then goes as follows:

1.	Select a new hash perturbation value, and associate it with the
	currently-unused hash table (call it B).

2.	Swing pointers so that subsequent incoming packets go to B.

3.	Continue outputting from A until it is drained.  If the
	perturbation is lockless, you might need to ensure that a
	grace period passes during this drain operation.  (Yep, still
	inhaling!!!)

4.	Start outputting packets from B.

But there might be better approaches.

> > These patches change SFQ to allow attaching external classifiers and add
> > a new "flow" classifier that allows to classify flows based on an arbitary
> > combination of pre-defined keys. Its probably not the fastest classifier
> > when used with multiple keys, but frankly, I don't think speed is very
> > important in most situations where the current SFQ implementation is used.
> 
> The only one thing i noticed that changes the behavior is the use of
> skb->prio as a selector. I think if you removed that it should be fine.
> Another alternative is to create a brand new FQ qdisc and leave the
> classification to the classifiers.
> 
> > It currently does not support perturbation, I didn't want to move this into
> > the classifier, so I need to think about a way to handle it within SFQ.
> 
> It is kind of hard to put it back into the current approach because the
> basic assumptions of ensuring no re-ordering and a "fast" classifier are
> gone.
> 
> I am almost tempted to say go back and write a qdisc called FQ.

And this might well be a better approach.  ;-)

							Thanx, Paul

> cheers,
> jamal
> 
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] make atomic_t volatile on all architectures
From: Linus Torvalds @ 2007-08-09  4:18 UTC (permalink / raw)
  To: Chris Snook
  Cc: akpm, ak, heiko.carstens, davem, linux-kernel, netdev,
	schwidefsky, wensong, horms, wjiang, cfriesen, zlynx
In-Reply-To: <20070808230733.GA17270@shell.boston.redhat.com>



On Wed, 8 Aug 2007, Chris Snook wrote:
> 
> Some architectures currently do not declare the contents of an atomic_t to be
> volatile.  This causes confusion since atomic_read() might not actually read
> anything if an optimizing compiler re-uses a value stored in a register, which
> can break code that loops until something external changes the value of an
> atomic_t.

I'd be *much* happier with "atomic_read()" doing the "volatile" instead.

The fact is, volatile on data structures is a bug. It's a wart in the C 
language. It shouldn't be used. 

Volatile accesses in *code* can be ok, and if we have "atomic_read()" 
expand to a "*(volatile int *)&(x)->value", then I'd be ok with that.

But marking data structures volatile just makes the compiler screw up 
totally, and makes code for initialization sequences etc much worse.

		Linus

^ permalink raw reply

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
From: Krishna Kumar2 @ 2007-08-09  4:19 UTC (permalink / raw)
  To: David Miller
  Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
	peter.p.waskiewicz.jr, hadi, mcarlson, netdev, general, mchan,
	tgraf, sri, shemminger, kaber, herbert
In-Reply-To: <20070808.150137.50597736.davem@davemloft.net>

Hi Dave,

David Miller <davem@davemloft.net> wrote on 08/09/2007 03:31:37 AM:

> > What do you generally think of the patch/implementation ? :)
>
> We have two driver implementation paths on recieve and now
> we'll have two on send, and that's not a good trend.

Correct.

> In an ideal world all the drivers would be NAPI and netif_rx()
> would only be used by tunneling drivers and similar in the
> protocol layers.  And likewise all sends would go through
> ->hard_start_xmit().
>
> If you can come up with a long term strategy that gets rid of
> the special transmit method, that'd be great.
>
> We should make Linux network drivers easy to write, not more difficult
> by constantly adding most interfaces than we consolidate.

I think that is a good top level view, and I agree with that.

Patrick had suggested calling dev_hard_start_xmit() instead of
conditionally calling the new API and to remove the new API
entirely. The driver determines whether batching is required or
not depending on (skb==NULL) or not. Would that approach be fine
with this "single interface" goal ?

Thanks,

- KK

^ permalink raw reply

* Re: [PATCH RFC]: napi_struct V5
From: David Miller @ 2007-08-09  4:23 UTC (permalink / raw)
  To: hadi; +Cc: xma, jgarzik, netdev, netdev-owner, rdreier, rusty, shemminger
In-Reply-To: <1186587154.5155.43.camel@localhost>

From: jamal <hadi@cyberus.ca>
Date: Wed, 08 Aug 2007 11:32:34 -0400

> Think of a box where you have other network interfaces, the way you
> are implementing currently implies you are going to be very unfair to 
> the other interfaces on the box. 

This was the point I was trying to make the other day.

What's good for the goose (ipoib) is not necessarily good for the
gander and NAPI exists for the gander as much as for the goose.

^ permalink raw reply

* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
From: David Miller @ 2007-08-09  4:27 UTC (permalink / raw)
  To: krkumar2
  Cc: gaagaan, general, hadi, herbert, jagana, jeff, johnpol, kaber,
	kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
	rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma
In-Reply-To: <OF000FAFC5.C9245715-ON65257332.0016D216-65257332.0017CCB2@in.ibm.com>

From: Krishna Kumar2 <krkumar2@in.ibm.com>
Date: Thu, 9 Aug 2007 09:49:57 +0530

> Patrick had suggested calling dev_hard_start_xmit() instead of
> conditionally calling the new API and to remove the new API
> entirely. The driver determines whether batching is required or
> not depending on (skb==NULL) or not. Would that approach be fine
> with this "single interface" goal ?

It is a valid posibility.

Note that this is similar to how we handle TSO, the driver
sets the feature bit and in its ->hard_start_xmit() it checks
the SKB for the given offload property.

^ permalink raw reply

* Re: [PATCH RFC]: napi_struct V5
From: David Miller @ 2007-08-09  4:35 UTC (permalink / raw)
  To: hadi; +Cc: netdev, shemminger, jgarzik, rusty
In-Reply-To: <1186575059.5171.23.camel@localhost>

From: jamal <hadi@cyberus.ca>
Date: Wed, 08 Aug 2007 08:10:59 -0400

> Its overdue - just hasnt been anybody who has raised their hands
> to do it. Stephen did at one point. 
> Theres a lot of nice insights in that doc that will be nice to inherit.
> Theres also a nice state machine diagram in 
> http://www.kernel.org/pub/linux/kernel/people/hadi/docs/UKUUG2005.pdf
> that will be very useful to capture in whatever new doc.

Stephen said he would find time to work on this, so what I'll
do is yank the documentation file in my napi_struct changeset
and he can crib content from the old file and your UKUUG2005
slides.

^ permalink raw reply

* Re: [PATCH RFC]: napi_struct V6
From: David Miller @ 2007-08-09  4:36 UTC (permalink / raw)
  To: shemminger; +Cc: netdev, jgarzik, hadi, rusty
In-Reply-To: <20070808070628.284d805c@oldman>

From: Stephen Hemminger <shemminger@linux-foundation.org>
Date: Wed, 8 Aug 2007 07:06:28 -0400

> Documentation of NAPI still needs more work. I'll take a start at getting
> net_device docbook format cleaned up, then start on a redo of the API
> documentation.  

Thanks for stepping up to do this Stephen.

^ permalink raw reply

* Re: TCP's initial cwnd setting correct?...
From: David Miller @ 2007-08-09  4:40 UTC (permalink / raw)
  To: jheffner; +Cc: ilpo.jarvinen, netdev
In-Reply-To: <46B9DF25.3000800@psc.edu>

From: John Heffner <jheffner@psc.edu>
Date: Wed, 08 Aug 2007 11:20:05 -0400

> I believe the current calculation is correct.  The RFC specifies a 
> window of no more than 4380 bytes unless 2*MSS > 4380.  If you change 
> the code in this way, then MSS=1461 will give you an initial window of 
> 3*MSS == 4383, violating the spec.  Reading the pseudocode in the RFC 
> 3390 is a bit misleading because they use a clamp at 4380 bytes rather 
> than use a multiplier in the relevant range.

Thanks for this excellent clarification John.

Because this has tripped up enough people, not once but on
multiple occaisions, I'm going to add some comments to
tcp_init_cwnd() to save the next person who is confused
by what seems to be an incorrect implementation of the RFC :-)

^ permalink raw reply

* Re: [PATCH] make atomic_t volatile on all architectures
From: Jerry Jiang @ 2007-08-09  4:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Snook, akpm, ak, heiko.carstens, davem, linux-kernel,
	netdev, schwidefsky, wensong, horms, cfriesen, zlynx
In-Reply-To: <alpine.LFD.0.999.0708082116060.25146@woody.linux-foundation.org>

On Wed, 8 Aug 2007 21:18:25 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Wed, 8 Aug 2007, Chris Snook wrote:
> > 
> > Some architectures currently do not declare the contents of an atomic_t to be
> > volatile.  This causes confusion since atomic_read() might not actually read
> > anything if an optimizing compiler re-uses a value stored in a register, which
> > can break code that loops until something external changes the value of an
> > atomic_t.
> 
> I'd be *much* happier with "atomic_read()" doing the "volatile" instead.
> 
> The fact is, volatile on data structures is a bug. It's a wart in the C 
> language. It shouldn't be used. 

Why? It's a wart! Is it due to unclear C standard on volatile related point?

Why the *volatile-accesses-in-code* is acceptable, does C standard make it clear?

-- Jerry

> 
> Volatile accesses in *code* can be ok, and if we have "atomic_read()" 
> expand to a "*(volatile int *)&(x)->value", then I'd be ok with that.
> 
> But marking data structures volatile just makes the compiler screw up 
> totally, and makes code for initialization sequences etc much worse.
> 
> 		Linus

^ permalink raw reply

* Re: [PATCH 00/23] per device dirty throttling -v8
From: david @ 2007-08-09  5:11 UTC (permalink / raw)
  To: Ray Lee
  Cc: Ingo Molnar, Linus Torvalds, Peter Zijlstra, linux-mm,
	linux-kernel, miklos, akpm, neilb, dgc, tomoki.sekiyama.qu,
	nikita, trond.myklebust, yingchao.zhou, richard, netdev
In-Reply-To: <2c0942db0708040901x7ada0fe2mf71f37ecba51005b@mail.gmail.com>

On Sat, 4 Aug 2007, Ray Lee wrote:

> On 8/4/07, david@lang.hm <david@lang.hm> wrote:
>> On Sat, 4 Aug 2007, Ingo Molnar wrote:
>>
> At least on a surface level, your report has some similarities to
> http://lkml.org/lkml/2007/5/21/84 . In that message, John Miller
> mentions several things he tried without effect:
>
> < - I increased the max allowed receive buffer through
> < proc/sys/net/core/rmem_max and the application calls the right
> < syscall. "netstat -su" does not show any "packet receive errors".

mercury1:/proc/sys/net/core# cat rmem_*
124928
131071
mercury1:/proc/sys/net/core# netstat -su
Udp:
     697853177 packets received
     10025642 packets to unknown port received.
     191726680 packet receive errors
     63194 packets sent
     RcvbufErrors: 191726680
UdpLite:
mercury1:/proc/sys/net/core# echo "512000" >rmem_max

> < - After getting "kernel: swapper: page allocation failure.
> < order:0, mode:0x20", I increased /proc/sys/vm/min_free_kbytes

I have not seen any similar errors

> < - ixgb.txt in kernel network documentation suggests to increase
> < net.core.netdev_max_backlog to 300000. This did not help.

mercury1:/proc/sys/net/core# cat netdev_*
300
1000
mercury1:/proc/sys/net/core# echo "300000" >netdev_max_backlog

> < - I also had to increase net.core.optmem_max, because the default
> < value was too small for 700 multicast groups.

I'm not running multicast.

> As they're all pretty simple to test, it may be worthwhile to give
> them a shot just to rule things out.

unfortunantly the load is not high enough right now to see a real 
difference (it's only doing ~1400 logs/sec) I'll catch it at a higher load 
point to see if these make any difference.

David Lang

> Ray
>

^ permalink raw reply

* Re: [PATCH] Virtual ethernet device (v.4)
From: David Miller @ 2007-08-09  5:18 UTC (permalink / raw)
  To: xemul; +Cc: kaber, netdev, devel
In-Reply-To: <46B99E2A.60505@openvz.org>

From: Pavel Emelyanov <xemul@openvz.org>
Date: Wed, 08 Aug 2007 14:42:50 +0400

> What are your plans about this driver?

I've added it to my net-2.6.24 which I'm building to night.

^ permalink raw reply

* Re: [PATCH RFC]: napi_struct V5
From: Jeff Garzik @ 2007-08-09  5:32 UTC (permalink / raw)
  To: David Miller, rdreier; +Cc: hadi, xma, netdev, netdev-owner, rusty, shemminger
In-Reply-To: <20070808.212353.105403708.davem@davemloft.net>

David Miller wrote:
> From: jamal <hadi@cyberus.ca>
> Date: Wed, 08 Aug 2007 11:32:34 -0400
> 
>> Think of a box where you have other network interfaces, the way you
>> are implementing currently implies you are going to be very unfair to 
>> the other interfaces on the box. 
> 
> This was the point I was trying to make the other day.

Agreed.  That's one of the big selling points of NAPI when I talk to 
people -- the entire system works towards a single equilibrium, when 
multiple interfaces are under load.

And conversely, without NAPI, the driver has no knowledge of conditions 
outside its limited view, and resource imbalances inevitably ensue. 
Particularly so for infiniband, 10gb, etc. where the natural motivation 
of the driver writer appears to trend towards "performance even at the 
expense of other system entities."

	Jeff




^ permalink raw reply

* Re: [PATCH 0/1] lro: Generic Large Receive Offload for TCP traffic
From: David Miller @ 2007-08-09  6:11 UTC (permalink / raw)
  To: ossthema
  Cc: raisch, themann, linux-kernel, linuxppc-dev, meder, tklein,
	gallatin, jeff, stefan.roscher, netdev
In-Reply-To: <200708031441.14780.ossthema@de.ibm.com>

From: Jan-Bernd Themann <ossthema@de.ibm.com>
Date: Fri, 3 Aug 2007 14:41:14 +0200

> I think this patch could be the final version for now. It has been tested
> on two platforms (power and x86_64) and works very well.

I checked in the LRO patch and the two sample driver ports
to net-2.6.24, thanks!

^ permalink raw reply

* Re: [NEIGH]: Combine neighbour cleanup and release
From: David Miller @ 2007-08-09  6:13 UTC (permalink / raw)
  To: tgraf; +Cc: netdev
In-Reply-To: <20070731211206.GH9285@postel.suug.ch>

From: Thomas Graf <tgraf@suug.ch>
Date: Tue, 31 Jul 2007 23:12:06 +0200

> Introduces neigh_cleanup_and_release() to be used after a
> neighbour has been removed from its neighbour table. Serves
> as preparation to add event notifications.
> 
> Signed-off-by: Thomas Graf <tgraf@suug.ch>

Applied to net-2.6.24, thanks Thomas.

^ permalink raw reply

* Re: [NEIGH]: Netlink notifications
From: David Miller @ 2007-08-09  6:13 UTC (permalink / raw)
  To: tgraf; +Cc: netdev
In-Reply-To: <20070731211235.GI9285@postel.suug.ch>

From: Thomas Graf <tgraf@suug.ch>
Date: Tue, 31 Jul 2007 23:12:35 +0200

> Currently neighbour event notifications are limited to update
> notifications and only sent if the ARP daemon is enabled. This
> patch extends the existing notification code by also reporting
> neighbours being removed due to gc or administratively and
> removes the dependency on the ARP daemon. This allows to keep
> track of neighbour states without periodically fetching the
> complete neighbour table.
> 
> Signed-off-by: Thomas Graf <tgraf@suug.ch>

Also applied to net-2.6.24, thanks!

^ permalink raw reply

* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
From: Krishna Kumar2 @ 2007-08-09  6:26 UTC (permalink / raw)
  To: David Miller
  Cc: gaagaan, general, hadi, herbert, jagana, jeff, johnpol, kaber,
	kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
	rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma
In-Reply-To: <20070808.212727.60561701.davem@davemloft.net>

David Miller <davem@davemloft.net> wrote on 08/09/2007 09:57:27 AM:
>
> > Patrick had suggested calling dev_hard_start_xmit() instead of
> > conditionally calling the new API and to remove the new API
> > entirely. The driver determines whether batching is required or
> > not depending on (skb==NULL) or not. Would that approach be fine
> > with this "single interface" goal ?
>
> It is a valid posibility.
>
> Note that this is similar to how we handle TSO, the driver
> sets the feature bit and in its ->hard_start_xmit() it checks
> the SKB for the given offload property.

Great, I will try to get rid of two paths entirely, and see how to
re-arrange the code cleanly.

thanks,

- KK


^ permalink raw reply

* Re: [PATCH 0/1] lro: Generic Large Receive Offload for TCP traffic
From: Jeff Garzik @ 2007-08-09  6:31 UTC (permalink / raw)
  To: David Miller
  Cc: ossthema, raisch, themann, linux-kernel, linuxppc-dev, meder,
	tklein, gallatin, stefan.roscher, netdev
In-Reply-To: <20070808.231144.22645124.davem@davemloft.net>

David Miller wrote:
> From: Jan-Bernd Themann <ossthema@de.ibm.com>
> Date: Fri, 3 Aug 2007 14:41:14 +0200
> 
>> I think this patch could be the final version for now. It has been tested
>> on two platforms (power and x86_64) and works very well.
> 
> I checked in the LRO patch and the two sample driver ports
> to net-2.6.24, thanks!

Good to hear, thanks all.

I'll ponder whether I want to rebase netdev-2.6.git#upstream (2.6.24 
queue) on top of net-2.6.24.  I might put it off until the pain 
threshold rises, or I might go ahead and do it.  Undecided.

Either way, I'll want you to push to Linus before I do, when the next 
merge window opens.

	Jeff




^ permalink raw reply

* [PATCH 0/1] myri10ge update for 2.6.23
From: Brice Goglin @ 2007-08-09  7:01 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: netdev

Hi Jeff,

Aside from the LRO patch that DaveM queued for 2.6.24, here's a small
fix for myri10ge in 2.6.23:

1. Use the pause counter to avoid a needless device reset

Please apply, thanks,
Brice


^ permalink raw reply

* [PATCH 1/1] Use the pause counter to avoid a needless device reset
From: Brice Goglin @ 2007-08-09  7:02 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: netdev
In-Reply-To: <46BABBCA.2060001@myri.com>

Use the pause counter to avoid a needless device reset, and
print a message telling the admin that our link partner is
flow controlling us down to 0 pkts/sec.

Signed-off-by: Brice Goglin <brice@myri.com>
---
 drivers/net/myri10ge/myri10ge.c |   25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

Index: linux-2.6.22/drivers/net/myri10ge/myri10ge.c
===================================================================
--- linux-2.6.22.orig/drivers/net/myri10ge/myri10ge.c	2007-08-09 08:53:40.000000000 +0200
+++ linux-2.6.22/drivers/net/myri10ge/myri10ge.c	2007-08-09 08:53:53.000000000 +0200
@@ -191,6 +191,7 @@
 	struct timer_list watchdog_timer;
 	int watchdog_tx_done;
 	int watchdog_tx_req;
+	int watchdog_pause;
 	int watchdog_resets;
 	int tx_linearized;
 	int pause;
@@ -2800,6 +2801,7 @@
 static void myri10ge_watchdog_timer(unsigned long arg)
 {
 	struct myri10ge_priv *mgp;
+	u32 rx_pause_cnt;
 
 	mgp = (struct myri10ge_priv *)arg;
 
@@ -2816,19 +2818,28 @@
 		    myri10ge_fill_thresh)
 			mgp->rx_big.watchdog_needed = 0;
 	}
+	rx_pause_cnt = ntohl(mgp->fw_stats->dropped_pause);
 
 	if (mgp->tx.req != mgp->tx.done &&
 	    mgp->tx.done == mgp->watchdog_tx_done &&
-	    mgp->watchdog_tx_req != mgp->watchdog_tx_done)
+	    mgp->watchdog_tx_req != mgp->watchdog_tx_done) {
 		/* nic seems like it might be stuck.. */
-		schedule_work(&mgp->watchdog_work);
-	else
-		/* rearm timer */
-		mod_timer(&mgp->watchdog_timer,
-			  jiffies + myri10ge_watchdog_timeout * HZ);
-
+		if (rx_pause_cnt != mgp->watchdog_pause) {
+			if (net_ratelimit())
+				printk(KERN_WARNING "myri10ge %s:"
+				       "TX paused, check link partner\n",
+				       mgp->dev->name);
+		} else {
+			schedule_work(&mgp->watchdog_work);
+			return;
+		}
+	}
+	/* rearm timer */
+	mod_timer(&mgp->watchdog_timer,
+		  jiffies + myri10ge_watchdog_timeout * HZ);
 	mgp->watchdog_tx_done = mgp->tx.done;
 	mgp->watchdog_tx_req = mgp->tx.req;
+	mgp->watchdog_pause = rx_pause_cnt;
 }
 
 static int myri10ge_probe(struct pci_dev *pdev, const struct pci_device_id *ent)



^ permalink raw reply

* Re: [PATCH 0/1] lro: Generic Large Receive Offload for TCP traffic
From: David Miller @ 2007-08-09  7:03 UTC (permalink / raw)
  To: jeff
  Cc: ossthema, raisch, themann, linux-kernel, linuxppc-dev, meder,
	tklein, gallatin, stefan.roscher, netdev
In-Reply-To: <46BAB4AF.7010001@garzik.org>

From: Jeff Garzik <jeff@garzik.org>
Date: Thu, 09 Aug 2007 02:31:11 -0400

> Either way, I'll want you to push to Linus before I do, when the next 
> merge window opens.

No problem.

^ permalink raw reply

* myri10ge net-2.6.24 build fix
From: David Miller @ 2007-08-09  7:11 UTC (permalink / raw)
  To: netdev; +Cc: gallatin


I had to add the following patch to fix the build after
the LRO changes, I have no idea how you could have compile
tested that patch let alone done any real testing on it :-/

diff --git a/drivers/net/myri10ge/myri10ge.c b/drivers/net/myri10ge/myri10ge.c
index 4cc5e6f..0ac0610 100644
--- a/drivers/net/myri10ge/myri10ge.c
+++ b/drivers/net/myri10ge/myri10ge.c
@@ -93,6 +93,7 @@ MODULE_LICENSE("Dual BSD/GPL");
 #define MYRI10GE_EEPROM_STRINGS_SIZE 256
 #define MYRI10GE_MAX_SEND_DESC_TSO ((65536 / 2048) * 2)
 #define MYRI10GE_MAX_LRO_DESCRIPTORS 8
+#define MYRI10GE_LRO_MAX_PKTS 64
 
 #define MYRI10GE_NO_CONFIRM_DATA htonl(0xffffffff)
 #define MYRI10GE_NO_RESPONSE_RESULT 0xffffffff

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox