Netdev List
 help / color / mirror / Atom feed
* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Eric Dumazet @ 2012-05-23 17:24 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Kieran Mansley, Jeff Kirsher, Ben Hutchings, netdev
In-Reply-To: <4FBD1740.1020304@intel.com>

On Wed, 2012-05-23 at 09:58 -0700, Alexander Duyck wrote:

> Right, but the problem is that in order to make this work the we are
> dropping the padding for head and hoping to have room for shared info. 
> This is going to kill performance for things like routing workloads
> since the entire head is going to have to be copied over to make space
> for NET_SKB_PAD. 

Hey I said that one of the point I have to add to my patch. Please read
it again ;)

By the way, we can also add code doing the ksb->head upgrade to fragment
again in case we need to add a tunnel header, instead of full copy.

So maybe the NET_SKB_PAD is not really needed anymore.

Anyway, a router host could use a different allocation strategy (going
back to current one)

>  Also this assumes no RSC being enabled.  RSC is
> normally enabled by default.  If it is turned on we are going to start
> receiving full 2K buffers which will cause even more issues since there
> wouldn't be any room for shared info in the 2K frame.
> 

Hey his is one of the point I have to address, also mentioned.

Its almost trivial to check len (if we have room for shared info, take
it, if not allocate the head as before)


> The way the driver is currently written probably provides the optimal
> setup for truesize given the circumstances.

It unfortunate the hardware has 1KB granularity.


>   In order to support
> receiving at least 1 full 1500 byte frame per fragment, and supporting
> RSC I have to support receiving up to 2K of data.  If we try to make it
> all part of one paged receive we would then have to either reduce the
> receive buffer size to 1K in hardware and span multiple fragments for a
> 1.5K frame or allocate a 3K buffer so we would have room to add
> NET_SKB_PAD and the shared info on the end.  At which point we are back
> to the extra 1K again, only in that case we cannot trim it off later via
> skb_try_coalesce.  In the 3K buffer case we would be over a 1/2 page
> which means we can only get one buffer per page instead of 2 in which
> case we might as well just round it up to 4K and be honest.
> 
> The reason I am confused is that I thought the skb_try_coalesce function
> was supposed to be what addressed these types of issues.  If these
> packets go through that function they should be stripping the sk_buff
> and possibly even the skb->head if we used the fragment since the only
> thing that is going to end up in the head would be the TCP header which
> should have been pulled prior to trying to coalesce.
> 
> I will need to investigate this further to understand what is going on. 
> I realize that dealing with 3K of memory for buffer storage is not
> ideal, but all of the alternatives lean more toward 4K when fully
> implemented.  I'll try and see what alternative solutions we might have
> available.

Problem is skb_try_coalesce() is not used when we store packets in
socket backlog, and only used for TCP at this moment.

^ permalink raw reply

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: David Miller @ 2012-05-23 17:34 UTC (permalink / raw)
  To: eric.dumazet; +Cc: kmansley, bhutchings, netdev
In-Reply-To: <1337766246.3361.2447.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 23 May 2012 11:44:06 +0200

> Locking the socket for the whole operation (including copyout to user)
> is not very good. It was good enough years ago with small receive
> window.
> 
> With a potentially huge backlog, it means user process has to process
> it, regardless of its latency constraints. CPU caches are also
> completely destroyed because of huge amount of data included in thousand
> of skbs.

But it is the only way we can have TCP processing scheduled and
accounted to user processes.  That does have value when you have lots
of flows active.

The scheduler's ability to give the process cpu time influences
TCP's behavier, and under load if the process can't get enough
cpu time then TCP will back off.  We want that.

^ permalink raw reply

* Re: [PATCH 1/1] net: add dev_loopback_xmit() to avoid duplicate code
From: David Miller @ 2012-05-23 17:41 UTC (permalink / raw)
  To: michel
  Cc: netdev, linux-kernel, kuznet, jmorris, yoshfuji, kaber, edumazet,
	jpirko, mirq-linux, bhutchings
In-Reply-To: <1337782819.2779.20.camel@Thor>


I'm getting really tired of saying this.

As I announced several days ago, it is absolutely not appropriate
to submit patches other than bug fixes at this time because we are
in the merge window and the net-next tree is frozen.

And even once I do announce here that the net-next tree is open
once more, and patches like this one are appropriate, you must
indicate in the subject line which of the 'net' or 'net-next'
tree you are targetting your patch at.

Please pay attention to what's going on, and what state the networking
development trees are in, before submitting changes.

^ permalink raw reply

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Eric Dumazet @ 2012-05-23 17:46 UTC (permalink / raw)
  To: David Miller; +Cc: kmansley, bhutchings, netdev
In-Reply-To: <20120523.133401.915684077769386834.davem@davemloft.net>

On Wed, 2012-05-23 at 13:34 -0400, David Miller wrote:

> But it is the only way we can have TCP processing scheduled and
> accounted to user processes.  That does have value when you have lots
> of flows active.
> 
> The scheduler's ability to give the process cpu time influences
> TCP's behavier, and under load if the process can't get enough
> cpu time then TCP will back off.  We want that.

But TCP already backs off if user process is not blocked on socket
input.

Modern applications uses select()/poll()/epoll() on many sockets in //.

Only old ones stil block on recv().

^ permalink raw reply

* Re: [PATCH net] ipx: restore token ring define to include/linux/ipx.h
From: David Miller @ 2012-05-23 17:49 UTC (permalink / raw)
  To: paul.gortmaker; +Cc: netdev, shemminger
In-Reply-To: <1337784225-30641-1-git-send-email-paul.gortmaker@windriver.com>

From: Paul Gortmaker <paul.gortmaker@windriver.com>
Date: Wed, 23 May 2012 10:43:45 -0400

> Commit 211ed865108e24697b44bee5daac502ee6bdd4a4
> 
>     "net: delete all instances of special processing for token ring"
> 
> removed the define for IPX_FRAME_TR_8022.
> 
> While it is unlikely, we can't be 100% sure that there aren't
> random userspace consumers of this value, so restore it.
> 
> The only instance I could find was in ncpfs-2.2.6, and it was
> safe as-is, since it used #ifdef IPX_FRAME_TR_8022 around the
> two use cases it had, but there may be other userspace packages
> without similar ifdefs.
> 
> Cc: Stephen Hemminger <shemminger@vyatta.com>
> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>

Applied, thanks Paul.

^ permalink raw reply

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: David Miller @ 2012-05-23 17:57 UTC (permalink / raw)
  To: eric.dumazet; +Cc: kmansley, bhutchings, netdev
In-Reply-To: <1337795210.3361.3118.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 23 May 2012 19:46:50 +0200

> But TCP already backs off if user process is not blocked on socket
> input.
> 
> Modern applications uses select()/poll()/epoll() on many sockets in //.
> 
> Only old ones stil block on recv().

These arguments seem circular.

Those modern applications still trigger enough TCP work during their
recv() calls that it's significant enough for scheduling purposes, and
to me being able to account that TCP work as process time is still
extremely beneficial.

^ permalink raw reply

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Alexander Duyck @ 2012-05-23 17:57 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Kieran Mansley, Jeff Kirsher, Ben Hutchings, netdev
In-Reply-To: <1337793866.3361.3090.camel@edumazet-glaptop>

On 05/23/2012 10:24 AM, Eric Dumazet wrote:
> On Wed, 2012-05-23 at 09:58 -0700, Alexander Duyck wrote:
>
>> Right, but the problem is that in order to make this work the we are
>> dropping the padding for head and hoping to have room for shared info. 
>> This is going to kill performance for things like routing workloads
>> since the entire head is going to have to be copied over to make space
>> for NET_SKB_PAD. 
> Hey I said that one of the point I have to add to my patch. Please read
> it again ;)
I'm aware of that, but still it seems like we are getting ahead of
ourselves.  This fix is so specific to just the socket backlog case that
I think we are missing the fact that it is going to have huge
performance repercussions elsewhere.

> By the way, we can also add code doing the ksb->head upgrade to fragment
> again in case we need to add a tunnel header, instead of full copy.
>
> So maybe the NET_SKB_PAD is not really needed anymore.
>
> Anyway, a router host could use a different allocation strategy (going
> back to current one)
The thing I don't like is that we are adding extra memcpy calls to all
of these paths.  We cannot change the head without having to copy the
shared info and there is going to be a cost for that.  I would prefer to
only generate the shared info once and not have to relocate it every
time I want to run a router or tunnel.

>>  Also this assumes no RSC being enabled.  RSC is
>> normally enabled by default.  If it is turned on we are going to start
>> receiving full 2K buffers which will cause even more issues since there
>> wouldn't be any room for shared info in the 2K frame.
>>
> Hey his is one of the point I have to address, also mentioned.
>
> Its almost trivial to check len (if we have room for shared info, take
> it, if not allocate the head as before)
I know you mentioned this as well.  The thing is I would prefer not to
add code where we are branching in so many different directions on what
the header actually looks like.  It tends to open a lot of opportunities
for bugs when someone makes a change and doesn't take one of the
possible head and fragment combinations into account.

>> The way the driver is currently written probably provides the optimal
>> setup for truesize given the circumstances.
> It unfortunate the hardware has 1KB granularity.
Agreed, I would have preferred 512B granularity, but we are locked in at
1K for now..  :-/

> Problem is skb_try_coalesce() is not used when we store packets in
> socket backlog, and only used for TCP at this moment.
One piece of low hanging fruit that is available to help with some of
this is to drop the Rx header size for ixgbe to 256.  That should at
least cut the total truesize for the buffer and sk_buff to 960 or so
which is at least a step in the right direction.

Thanks,

Alex

^ permalink raw reply

* Re: [net] stmmac: fix driver Kconfig when built as module
From: David Miller @ 2012-05-23 18:01 UTC (permalink / raw)
  To: peppe.cavallaro; +Cc: netdev, bhutchings, lliubbo, rayagond
In-Reply-To: <1337753142-13762-1-git-send-email-peppe.cavallaro@st.com>

From: Giuseppe CAVALLARO <peppe.cavallaro@st.com>
Date: Wed, 23 May 2012 08:05:42 +0200

> This patches fixes the driver when built as dyn module.
> In fact the platform part cannot be built and the probe fails
> (thanks to Bob Liu that reported this bug).
> The patch also makes the selection of Platform and PCI parts
> mutually exclusive.
> 
> Reported-by: Bob Liu <lliubbo@gmail.com>
> Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
> Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>

We have drivers which support both OF (which is implemented as
platform bus) and PCI at the same time.  For example,
drivers/net/ethernet/sun/niu.c

I do not see why stmmac cannot support both at the same time as well.

I absolutely do not want such segregation unless it is absolutely
necessary.  Because it means that no matter what is choosen, a piece
of code is disabled and therefore not getting build and/or runtime
validation.

^ permalink raw reply

* Re: WARNING: at net/ipv4/tcp.c:1301 tcp_cleanup_rbuf+0x4f/0x110()
From: Sergio Correia @ 2012-05-23 18:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1337791018.3361.3024.camel@edumazet-glaptop>

On Wed, May 23, 2012 at 1:36 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2012-05-23 at 12:56 -0300, Sergio Correia wrote:
>> On Wed, May 23, 2012 at 12:08 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Tue, 2012-05-22 at 11:47 -0300, Sergio Correia wrote:
>> >> Hi Eric,
>> > ...
>> >> Yes, it's an Atheros AR9285 adapter.
>> >> This morning I did a make mrproper before rebuilding the kernel
>> >> (should I always do that?), but the warning has just appeared again.
>> >
>> > OK, I am taking a look at this problem, thanks.
>> >
>>
>> Thanks. Let me know if you need additional info. As of now, my dmesg
>> basically shows only those warnings.
>
> I believe I found the bug and am testing a fix right now.
>
> By the way, we might have the same problem in tcp collapses.
>
> TCP coalescing (introduced in linux-3.5) triggers the problem faster.
>
> Please test following patch :
>

I reverted back to 471368557a734c6c486ee757952c902b36e7fd01 and it
took almost one hour to trigger the warning. Now I have applied your
patch and will report back how it went after a few hours of testing.

^ permalink raw reply

* Re: WARNING: at net/ipv4/tcp.c:1301 tcp_cleanup_rbuf+0x4f/0x110()
From: Eric Dumazet @ 2012-05-23 18:37 UTC (permalink / raw)
  To: Sergio Correia; +Cc: netdev
In-Reply-To: <CAJyhjX1E_yzT1Vte+3iqc-PnfLiEEdFEyM9GcOCs-6TrUm86XQ@mail.gmail.com>

On Wed, 2012-05-23 at 15:30 -0300, Sergio Correia wrote:
> On Wed, May 23, 2012 at 1:36 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Wed, 2012-05-23 at 12:56 -0300, Sergio Correia wrote:
> >> On Wed, May 23, 2012 at 12:08 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >> > On Tue, 2012-05-22 at 11:47 -0300, Sergio Correia wrote:
> >> >> Hi Eric,
> >> > ...
> >> >> Yes, it's an Atheros AR9285 adapter.
> >> >> This morning I did a make mrproper before rebuilding the kernel
> >> >> (should I always do that?), but the warning has just appeared again.
> >> >
> >> > OK, I am taking a look at this problem, thanks.
> >> >
> >>
> >> Thanks. Let me know if you need additional info. As of now, my dmesg
> >> basically shows only those warnings.
> >
> > I believe I found the bug and am testing a fix right now.
> >
> > By the way, we might have the same problem in tcp collapses.
> >
> > TCP coalescing (introduced in linux-3.5) triggers the problem faster.
> >
> > Please test following patch :
> >
> 
> I reverted back to 471368557a734c6c486ee757952c902b36e7fd01 and it
> took almost one hour to trigger the warning. Now I have applied your
> patch and will report back how it went after a few hours of testing.

Thanks

I triggered it very fast in my lab using following setup

Sender machine :

# tc qdisc add dev eth0 root netem delay 1ms 3ms 20 reorder 10 20
for i in `seq 1 8`
do
  netperf -t OMNI  -C -c -H 172.30.42.8 -l 60 &
done
wait
# tc -s -d qd
qdisc netem 8002: dev eth0 root refcnt 2 limit 1000 delay 1.0ms  3.0ms
20% reorder 10% 20% gap 1
 Sent 66030032010 bytes 43992779 pkt (dropped 13846, overlimits 0
requeues 2712184) 
 backlog 0b 0p requeues 2712184 

receiver machine runs a netserver and triggers the bug in few seconds.

(receiver being a slow machine, with r8169 NIC)

^ permalink raw reply

* GET BACK TO ME ASAP.
From: Mrs Anna Kennedy @ 2012-05-23 18:59 UTC (permalink / raw)


Good day my beloved friend,

How are you and your lovely family doing today,i hope all is well?if so glory 
be to God,i have an urgent purposal for you, if interested kindly contact me on 
this e-mail (anna_kennedy_hood@hotmail.co.uk)

^ permalink raw reply

* Re: [PATCH v6 2/2] decrement static keys on real destroy time
From: Andrew Morton @ 2012-05-23 20:33 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, devel, kamezawa.hiroyu, netdev, Tejun Heo,
	Li Zefan, Johannes Weiner, Michal Hocko, David Miller
In-Reply-To: <4FBCAAF4.4030803@parallels.com>

On Wed, 23 May 2012 13:16:36 +0400
Glauber Costa <glommer@parallels.com> wrote:

> On 05/23/2012 02:46 AM, Andrew Morton wrote:
> > Here, we're open-coding kinda-test_bit().  Why do that?  These flags are
> > modified with set_bit() and friends, so we should read them with the
> > matching test_bit()?
> 
> My reasoning was to be as cheap as possible, as you noted yourself two
> paragraphs below.

These aren't on any fast path, are they?

Plus: you failed in that objective!  The C compiler's internal
scalar->bool conversion makes these functions no more efficient than
test_bit().

> > So here are suggested changes from*some*  of the above discussion.
> > Please consider, incorporate, retest and send us a v7?
> 
> How do you want me to do it? Should I add your patch ontop of mine,
> and then another one that tweaks whatever else is left, or should I just
> merge those changes into the patches I have?

A brand new patch, I guess.  I can sort out the what-did-he-change view
at this end.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH v7] tilegx network driver: initial support
From: Chris Metcalf @ 2012-05-23 20:42 UTC (permalink / raw)
  To: bhutchings, arnd, David Miller, linux-kernel, netdev
In-Reply-To: <20120520.165546.1211013675964130504.davem@davemloft.net>

This change adds support for the tilegx network driver based on the
GXIO IORPC support in the tilegx software stack, using the on-chip
mPIPE packet processing engine.

Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
---
I worked with the original author of this driver to refactor the
code and conform more closely to conventional Linux coding style.
I'd appreciate any additional feedback - thanks!

 drivers/net/ethernet/tile/Kconfig  |    1 +
 drivers/net/ethernet/tile/Makefile |    4 +-
 drivers/net/ethernet/tile/tilegx.c | 1798 ++++++++++++++++++++++++++++++++++++
 3 files changed, 1801 insertions(+), 2 deletions(-)
 create mode 100644 drivers/net/ethernet/tile/tilegx.c

diff --git a/drivers/net/ethernet/tile/Kconfig b/drivers/net/ethernet/tile/Kconfig
index 2d9218f..9184b61 100644
--- a/drivers/net/ethernet/tile/Kconfig
+++ b/drivers/net/ethernet/tile/Kconfig
@@ -7,6 +7,7 @@ config TILE_NET
 	depends on TILE
 	default y
 	select CRC32
+	select TILE_GXIO_MPIPE if TILEGX
 	---help---
 	  This is a standard Linux network device driver for the
 	  on-chip Tilera Gigabit Ethernet and XAUI interfaces.
diff --git a/drivers/net/ethernet/tile/Makefile b/drivers/net/ethernet/tile/Makefile
index f634f14..0ef9eef 100644
--- a/drivers/net/ethernet/tile/Makefile
+++ b/drivers/net/ethernet/tile/Makefile
@@ -4,7 +4,7 @@
 
 obj-$(CONFIG_TILE_NET) += tile_net.o
 ifdef CONFIG_TILEGX
-tile_net-objs := tilegx.o mpipe.o iorpc_mpipe.o dma_queue.o
+tile_net-y := tilegx.o
 else
-tile_net-objs := tilepro.o
+tile_net-y := tilepro.o
 endif
diff --git a/drivers/net/ethernet/tile/tilegx.c b/drivers/net/ethernet/tile/tilegx.c
new file mode 100644
index 0000000..abfff7f
--- /dev/null
+++ b/drivers/net/ethernet/tile/tilegx.c
@@ -0,0 +1,1798 @@
+/*
+ * Copyright 2012 Tilera Corporation. All Rights Reserved.
+ *
+ *   This program is free software; you can redistribute it and/or
+ *   modify it under the terms of the GNU General Public License
+ *   as published by the Free Software Foundation, version 2.
+ *
+ *   This program is distributed in the hope that it will be useful, but
+ *   WITHOUT ANY WARRANTY; without even the implied warranty of
+ *   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ *   NON INFRINGEMENT.  See the GNU General Public License for
+ *   more details.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/moduleparam.h>
+#include <linux/sched.h>
+#include <linux/kernel.h>      /* printk() */
+#include <linux/slab.h>        /* kmalloc() */
+#include <linux/errno.h>       /* error codes */
+#include <linux/types.h>       /* size_t */
+#include <linux/interrupt.h>
+#include <linux/in.h>
+#include <linux/irq.h>
+#include <linux/netdevice.h>   /* struct device, and other headers */
+#include <linux/etherdevice.h> /* eth_type_trans */
+#include <linux/skbuff.h>
+#include <linux/ioctl.h>
+#include <linux/cdev.h>
+#include <linux/hugetlb.h>
+#include <linux/in6.h>
+#include <linux/timer.h>
+#include <linux/io.h>
+#include <linux/ctype.h>
+#include <linux/ip.h>
+#include <linux/tcp.h>
+
+#include <asm/checksum.h>
+#include <asm/homecache.h>
+#include <gxio/mpipe.h>
+#include <arch/sim.h>
+
+/* Default transmit lockup timeout period, in jiffies. */
+#define TILE_NET_TIMEOUT (5 * HZ)
+
+/* The maximum number of distinct channels (idesc.channel is 5 bits). */
+#define TILE_NET_CHANNELS 32
+
+/* Maximum number of idescs to handle per "poll". */
+#define TILE_NET_BATCH 128
+
+/* Maximum number of packets to handle per "poll". */
+#define TILE_NET_WEIGHT 64
+
+/* Number of entries in each iqueue. */
+#define IQUEUE_ENTRIES 512
+
+/* Number of entries in each equeue. */
+#define EQUEUE_ENTRIES 2048
+
+/* Total header bytes per equeue slot.  Must be big enough for 2 bytes
+ * of NET_IP_ALIGN alignment, plus 14 bytes (?) of L2 header, plus up to
+ * 60 bytes of actual TCP header.  We round up to align to cache lines.
+ */
+#define HEADER_BYTES 128
+
+/* Maximum completions per cpu per device (must be a power of two).
+ * ISSUE: What is the right number here?  If this is too small, then
+ * egress might block waiting for free space in a completions array.
+ * ISSUE: At the least, allocate these only for initialized echannels.
+ */
+#define TILE_NET_MAX_COMPS 64
+
+#define MAX_FRAGS (MAX_SKB_FRAGS + 1)
+
+/* Size of completions data to allocate.
+ * ISSUE: Probably more than needed since we don't use all the channels.
+ */
+#define COMPS_SIZE (TILE_NET_CHANNELS * sizeof(struct tile_net_comps))
+
+/* Size of NotifRing data to allocate. */
+#define NOTIF_RING_SIZE (IQUEUE_ENTRIES * sizeof(gxio_mpipe_idesc_t))
+
+MODULE_AUTHOR("Tilera Corporation");
+MODULE_LICENSE("GPL");
+
+/* A "packet fragment" (a chunk of memory). */
+struct frag {
+	void *buf;
+	size_t length;
+};
+
+/* A single completion. */
+struct tile_net_comp {
+	/* The "complete_count" when the completion will be complete. */
+	s64 when;
+	/* The buffer to be freed when the completion is complete. */
+	struct sk_buff *skb;
+};
+
+/* The completions for a given cpu and device. */
+struct tile_net_comps {
+	/* The completions. */
+	struct tile_net_comp comp_queue[TILE_NET_MAX_COMPS];
+	/* The number of completions used. */
+	unsigned long comp_next;
+	/* The number of completions freed. */
+	unsigned long comp_last;
+};
+
+/* Info for a specific cpu. */
+struct tile_net_info {
+	/* The NAPI struct. */
+	struct napi_struct napi;
+	/* Packet queue. */
+	gxio_mpipe_iqueue_t iqueue;
+	/* Our cpu. */
+	int my_cpu;
+	/* True if iqueue is valid. */
+	bool has_iqueue;
+	/* NAPI flags. */
+	bool napi_added;
+	bool napi_enabled;
+	/* Number of small sk_buffs which must still be provided. */
+	unsigned int num_needed_small_buffers;
+	/* Number of large sk_buffs which must still be provided. */
+	unsigned int num_needed_large_buffers;
+	/* A timer for handling egress completions. */
+	struct timer_list egress_timer;
+	/* True if "egress_timer" is scheduled. */
+	bool egress_timer_scheduled;
+	/* Comps for each egress channel. */
+	struct tile_net_comps *comps_for_echannel[TILE_NET_CHANNELS];
+};
+
+/* Info for egress on a particular egress channel. */
+struct tile_net_egress {
+	/* The "equeue". */
+	gxio_mpipe_equeue_t *equeue;
+	/* The headers for TSO. */
+	unsigned char *headers;
+};
+
+/* Info for a specific device. */
+struct tile_net_priv {
+	/* Our network device. */
+	struct net_device *dev;
+	/* The primary link. */
+	gxio_mpipe_link_t link;
+	/* The primary channel, if open, else -1. */
+	int channel;
+	/* The "loopify" egress link, if needed. */
+	gxio_mpipe_link_t loopify_link;
+	/* The "loopify" egress channel, if open, else -1. */
+	int loopify_channel;
+	/* The egress channel (channel or loopify_channel). */
+	int echannel;
+	/* Total stats. */
+	struct net_device_stats stats;
+};
+
+/* Egress info, indexed by "priv->echannel" (lazily created as needed). */
+static struct tile_net_egress egress_for_echannel[TILE_NET_CHANNELS];
+
+/* Devices currently associated with each channel.
+ * NOTE: The array entry can become NULL after ifconfig down, but
+ * we do not free the underlying net_device structures, so it is
+ * safe to use a pointer after reading it from this array.
+ */
+static struct net_device *tile_net_devs_for_channel[TILE_NET_CHANNELS];
+
+/* A mutex for "tile_net_devs_for_channel". */
+static DEFINE_MUTEX(tile_net_devs_for_channel_mutex);
+
+/* The per-cpu info. */
+static DEFINE_PER_CPU(struct tile_net_info, per_cpu_info);
+
+/* The "context" for all devices. */
+static gxio_mpipe_context_t context;
+
+/* The small/large "buffer stacks". */
+static int small_buffer_stack = -1;
+static int large_buffer_stack = -1;
+
+/* Amount of memory allocated for each buffer stack. */
+static size_t buffer_stack_size;
+
+/* The actual memory allocated for the buffer stacks. */
+static void *small_buffer_stack_va;
+static void *large_buffer_stack_va;
+
+/* The buckets. */
+static int first_bucket = -1;
+static int num_buckets = 1;
+
+/* The ingress irq. */
+static int ingress_irq = -1;
+
+/* Text value of tile_net.cpus if passed as a module parameter. */
+static char *network_cpus_string;
+
+/* The actual cpus in "network_cpus". */
+static struct cpumask network_cpus_map;
+
+/* If "loopify=LINK" was specified, this is "LINK". */
+static char *loopify_link_name;
+
+/* If "tile_net.custom" was specified, this is non-NULL. */
+static char *custom_str;
+
+/* The "tile_net.cpus" argument specifies the cpus that are dedicated
+ * to handle ingress packets.
+ *
+ * The parameter should be in the form "tile_net.cpus=m-n[,x-y]", where
+ * m, n, x, y are integer numbers that represent the cpus that can be
+ * neither a dedicated cpu nor a dataplane cpu.
+ */
+static bool network_cpus_init(void)
+{
+	char buf[1024];
+	int rc;
+
+	if (network_cpus_string == NULL)
+		return false;
+
+	rc = cpulist_parse_crop(network_cpus_string, &network_cpus_map);
+	if (rc != 0) {
+		pr_warn("tile_net.cpus=%s: malformed cpu list\n",
+			network_cpus_string);
+		return false;
+	}
+
+	/* Remove dedicated cpus. */
+	cpumask_and(&network_cpus_map, &network_cpus_map, cpu_possible_mask);
+
+	if (cpumask_empty(&network_cpus_map)) {
+		pr_warn("Ignoring empty tile_net.cpus='%s'.\n",
+			network_cpus_string);
+		return false;
+	}
+
+	cpulist_scnprintf(buf, sizeof(buf), &network_cpus_map);
+	pr_info("Linux network CPUs: %s\n", buf);
+	return true;
+}
+
+module_param_named(cpus, network_cpus_string, charp, 0444);
+MODULE_PARM_DESC(cpus, "cpulist of cores that handle network interrupts");
+
+/* The "tile_net.loopify=LINK" argument causes the named device to
+ * actually use "loop0" for ingress, and "loop1" for egress.  This
+ * allows an app to sit between the actual link and linux, passing
+ * (some) packets along to linux, and forwarding (some) packets sent
+ * out by linux.
+ */
+module_param_named(loopify, loopify_link_name, charp, 0444);
+MODULE_PARM_DESC(loopify, "name the device to use loop0/1 for ingress/egress");
+
+/* The "tile_net.custom" argument causes us to ignore the "conventional"
+ * classifier metadata, in particular, the "l2_offset".
+ */
+module_param_named(custom, custom_str, charp, 0444);
+MODULE_PARM_DESC(custom, "indicates a (heavily) customized classifier");
+
+/* Atomically update a statistics field.
+ * Note that on TILE-Gx, this operation is fire-and-forget on the
+ * issuing core (single-cycle dispatch) and takes only a few cycles
+ * longer than a regular store when the request reaches the home cache.
+ * No expensive bus management overhead is required.
+ */
+static void tile_net_stats_add(unsigned long value, unsigned long *field)
+{
+	BUILD_BUG_ON(sizeof(atomic_long_t) != sizeof(unsigned long));
+	atomic_long_add(value, (atomic_long_t *)field);
+}
+
+/* Allocate and push a buffer. */
+static bool tile_net_provide_buffer(bool small)
+{
+	int stack = small ? small_buffer_stack : large_buffer_stack;
+	const unsigned long buffer_alignment = 128;
+	struct sk_buff *skb;
+	int len;
+
+	len = sizeof(struct sk_buff **) + buffer_alignment;
+	len += (small ? 128 : 1664);
+	skb = dev_alloc_skb(len);
+	if (skb == NULL)
+		return false;
+
+	/* Make room for a back-pointer to 'skb' and guarantee alignment. */
+	skb_reserve(skb, sizeof(struct sk_buff **));
+	skb_reserve(skb, -(long)skb->data & (buffer_alignment - 1));
+
+	/* Save a back-pointer to 'skb'. */
+	*(struct sk_buff **)(skb->data - sizeof(struct sk_buff **)) = skb;
+
+	/* Make sure "skb" and the back-pointer have been flushed. */
+	wmb();
+
+	gxio_mpipe_push_buffer(&context, stack,
+			       (void *)va_to_tile_io_addr(skb->data));
+
+	return true;
+}
+
+static void tile_net_pop_all_buffers(int stack)
+{
+	void *va;
+	while ((va = gxio_mpipe_pop_buffer(&context, stack)) != NULL) {
+		struct sk_buff **skb_ptr = va - sizeof(*skb_ptr);
+		struct sk_buff *skb = *skb_ptr;
+		dev_kfree_skb_irq(skb);
+	}
+}
+
+/* Provide linux buffers to mPIPE. */
+static void tile_net_provide_needed_buffers(struct tile_net_info *info)
+{
+	while (info->num_needed_small_buffers != 0) {
+		if (!tile_net_provide_buffer(true))
+			goto oops;
+		info->num_needed_small_buffers--;
+	}
+
+	while (info->num_needed_large_buffers != 0) {
+		if (!tile_net_provide_buffer(false))
+			goto oops;
+		info->num_needed_large_buffers--;
+	}
+
+	return;
+
+oops:
+	/* Add a description to the page allocation failure dump. */
+	pr_notice("Tile %d still needs some buffers\n", info->my_cpu);
+}
+
+static inline bool filter_packet(struct net_device *dev, void *buf)
+{
+	/* Filter packets received before we're up. */
+	if (dev == NULL || !(dev->flags & IFF_UP))
+		return true;
+
+	/* Filter out packets that aren't for us. */
+	if (!(dev->flags & IFF_PROMISC) &&
+	    !is_multicast_ether_addr(buf) &&
+	    compare_ether_addr(dev->dev_addr, buf) != 0)
+		return true;
+
+	return false;
+}
+
+/* Convert a raw mpipe buffer to its matching skb pointer. */
+static struct sk_buff *mpipe_buf_to_skb(void *va)
+{
+	/* Acquire the associated "skb". */
+	struct sk_buff **skb_ptr = va - sizeof(*skb_ptr);
+	struct sk_buff *skb = *skb_ptr;
+
+	/* Paranoia. */
+	if (skb->data != va) {
+		/* Panic here since there's a reasonable chance
+		 * that corrupt buffers means generic memory
+		 * corruption, with unpredictable system effects.
+		 */
+		panic("Corrupt linux buffer! "
+		      "va=%p, skb=%p, skb->data=%p",
+		      va, skb, skb->data);
+	}
+
+	return skb;
+}
+
+static void tile_net_receive_skb(struct net_device *dev, struct sk_buff *skb,
+				 struct tile_net_info *info,
+				 gxio_mpipe_idesc_t *idesc, unsigned long len)
+{
+	struct tile_net_priv *priv = netdev_priv(dev);
+
+	/* Encode the actual packet length. */
+	skb_put(skb, len);
+
+	skb->protocol = eth_type_trans(skb, dev);
+
+	/* Acknowledge "good" hardware checksums. */
+	if (idesc->cs && idesc->csum_seed_val == 0xFFFF)
+		skb->ip_summed = CHECKSUM_UNNECESSARY;
+
+	netif_receive_skb(skb);
+
+	/* Update stats. */
+	tile_net_stats_add(1, &priv->stats.rx_packets);
+	tile_net_stats_add(len, &priv->stats.rx_bytes);
+
+	/* Need a new buffer. */
+	if (idesc->size == GXIO_MPIPE_BUFFER_SIZE_128)
+		info->num_needed_small_buffers++;
+	else
+		info->num_needed_large_buffers++;
+}
+
+/* Handle a packet.  Return true if "processed", false if "filtered". */
+static bool tile_net_handle_packet(struct tile_net_info *info,
+				   gxio_mpipe_idesc_t *idesc)
+{
+	struct net_device *dev = tile_net_devs_for_channel[idesc->channel];
+	uint8_t l2_offset;
+	void *va;
+	void *buf;
+	unsigned long len;
+	bool filter;
+
+	/* Drop packets for which no buffer was available.
+	 * NOTE: This happens under heavy load.
+	 */
+	if (idesc->be) {
+		gxio_mpipe_iqueue_consume(&info->iqueue, idesc);
+		if (net_ratelimit())
+			pr_info("Dropping packet (insufficient buffers).\n");
+		return false;
+	}
+
+	/* Get the "l2_offset", if allowed. */
+	l2_offset = custom_str ? 0 : gxio_mpipe_idesc_get_l2_offset(idesc);
+
+	/* Get the raw buffer VA (includes "headroom"). */
+	va = tile_io_addr_to_va((unsigned long)(long)idesc->va);
+
+	/* Get the actual packet start/length. */
+	buf = va + l2_offset;
+	len = idesc->l2_size - l2_offset;
+
+	/* Point "va" at the raw buffer. */
+	va -= NET_IP_ALIGN;
+
+	filter = filter_packet(dev, buf);
+	if (filter) {
+		/* FIXME: Update "drop" statistics. */
+		gxio_mpipe_iqueue_drop(&info->iqueue, idesc);
+	} else {
+		struct sk_buff *skb = mpipe_buf_to_skb(va);
+
+		/* Skip headroom, and any custom header. */
+		skb_reserve(skb, NET_IP_ALIGN + l2_offset);
+
+		tile_net_receive_skb(dev, skb, info, idesc, len);
+	}
+
+	gxio_mpipe_iqueue_consume(&info->iqueue, idesc);
+	return !filter;
+}
+
+/* Handle some packets for the current CPU.
+ *
+ * This function handles up to TILE_NET_BATCH idescs per call.
+ *
+ * ISSUE: Since we do not provide new buffers until this function is
+ * complete, we must initially provide enough buffers for each network
+ * cpu to fill its iqueue and also its batched idescs.
+ *
+ * ISSUE: The "rotting packet" race condition occurs if a packet
+ * arrives after the queue appears to be empty, and before the
+ * hypervisor interrupt is re-enabled.
+ */
+static int tile_net_poll(struct napi_struct *napi, int budget)
+{
+	struct tile_net_info *info = &__get_cpu_var(per_cpu_info);
+	unsigned int work = 0;
+	gxio_mpipe_idesc_t *idesc;
+	int i, n;
+
+	/* Process packets. */
+	while ((n = gxio_mpipe_iqueue_try_peek(&info->iqueue, &idesc)) > 0) {
+		for (i = 0; i < n; i++) {
+			if (i == TILE_NET_BATCH)
+				goto done;
+			if (tile_net_handle_packet(info, idesc + i)) {
+				if (++work >= budget)
+					goto done;
+			}
+		}
+	}
+
+	/* There are no packets left. */
+	napi_complete(&info->napi);
+
+	/* Re-enable hypervisor interrupts. */
+	gxio_mpipe_enable_notif_ring_interrupt(&context, info->iqueue.ring);
+
+	/* HACK: Avoid the "rotting packet" problem. */
+	if (gxio_mpipe_iqueue_try_peek(&info->iqueue, &idesc) > 0)
+		napi_schedule(&info->napi);
+
+	/* ISSUE: Handle completions? */
+
+done:
+	tile_net_provide_needed_buffers(info);
+
+	return work;
+}
+
+/* Handle an ingress interrupt on the current cpu. */
+static irqreturn_t tile_net_handle_ingress_irq(int irq, void *unused)
+{
+	struct tile_net_info *info = &__get_cpu_var(per_cpu_info);
+	napi_schedule(&info->napi);
+	return IRQ_HANDLED;
+}
+
+/* Free some completions.  This must be called with interrupts blocked. */
+static void tile_net_free_comps(gxio_mpipe_equeue_t *equeue,
+				struct tile_net_comps *comps,
+				int limit, bool force_update)
+{
+	int n = 0;
+	while (comps->comp_last < comps->comp_next) {
+		unsigned int cid = comps->comp_last % TILE_NET_MAX_COMPS;
+		struct tile_net_comp *comp = &comps->comp_queue[cid];
+		if (!gxio_mpipe_equeue_is_complete(equeue, comp->when,
+						   force_update || n == 0))
+			return;
+		dev_kfree_skb_irq(comp->skb);
+		comps->comp_last++;
+		if (++n == limit)
+			return;
+	}
+}
+
+/* Add a completion.  This must be called with interrupts blocked.
+ *
+ * FIXME: We should probably have stopped the queue earlier rather
+ * than having to wait here.
+ */
+static void add_comp(gxio_mpipe_equeue_t *equeue,
+		     struct tile_net_comps *comps,
+		     uint64_t when, struct sk_buff *skb)
+{
+	int cid;
+
+	/* Wait for a free completion entry, if needed. */
+	while (comps->comp_next - comps->comp_last >= TILE_NET_MAX_COMPS - 1)
+		tile_net_free_comps(equeue, comps, 32, false);
+
+	/* Update the completions array. */
+	cid = comps->comp_next % TILE_NET_MAX_COMPS;
+	comps->comp_queue[cid].when = when;
+	comps->comp_queue[cid].skb = skb;
+	comps->comp_next++;
+}
+
+/* Make sure the egress timer is scheduled.
+ *
+ * Note that we use "schedule if not scheduled" logic instead of the more
+ * obvious "reschedule" logic, because "reschedule" is fairly expensive.
+ */
+static void tile_net_schedule_egress_timer(struct tile_net_info *info)
+{
+	if (!info->egress_timer_scheduled) {
+		mod_timer_pinned(&info->egress_timer, jiffies + 1);
+		info->egress_timer_scheduled = true;
+	}
+}
+
+/* The "function" for "info->egress_timer".
+ *
+ * This timer will reschedule itself as long as there are any pending
+ * completions expected for this tile.
+ */
+static void tile_net_handle_egress_timer(unsigned long arg)
+{
+	struct tile_net_info *info = (struct tile_net_info *)arg;
+	unsigned long irqflags;
+	bool pending = false;
+	int i;
+
+	local_irq_save(irqflags);
+
+	/* The timer is no longer scheduled. */
+	info->egress_timer_scheduled = false;
+
+	/* Free all possible comps for this tile. */
+	for (i = 0; i < TILE_NET_CHANNELS; i++) {
+		struct tile_net_egress *egress = &egress_for_echannel[i];
+		struct tile_net_comps *comps = info->comps_for_echannel[i];
+		if (comps->comp_last >= comps->comp_next)
+			continue;
+		tile_net_free_comps(egress->equeue, comps, -1, true);
+		pending = pending || (comps->comp_last < comps->comp_next);
+	}
+
+	/* Reschedule timer if needed. */
+	if (pending)
+		tile_net_schedule_egress_timer(info);
+
+	local_irq_restore(irqflags);
+}
+
+/* Helper function for "tile_net_update()".
+ * "dev" (i.e. arg) is the device being brought up or down,
+ * or NULL if all devices are now down.
+ */
+static void tile_net_update_cpu(void *arg)
+{
+	struct net_device *dev = arg;
+	struct tile_net_info *info = &__get_cpu_var(per_cpu_info);
+
+	if (!info->has_iqueue)
+		return;
+
+	if (dev != NULL) {
+		if (!info->napi_added) {
+			netif_napi_add(dev, &info->napi,
+				       tile_net_poll, TILE_NET_WEIGHT);
+			info->napi_added = true;
+		}
+		if (!info->napi_enabled) {
+			napi_enable(&info->napi);
+			info->napi_enabled = true;
+		}
+		enable_percpu_irq(ingress_irq, 0);
+	} else {
+		disable_percpu_irq(ingress_irq);
+		if (info->napi_enabled) {
+			napi_disable(&info->napi);
+			info->napi_enabled = false;
+		}
+		/* FIXME: Drain the iqueue. */
+	}
+}
+
+/* Helper function for tile_net_open() and tile_net_stop().
+ * Always called under tile_net_devs_for_channel_mutex.
+ */
+static int tile_net_update(struct net_device *dev)
+{
+	static gxio_mpipe_rules_t rules;  /* too big to fit on the stack */
+	bool saw_channel = false;
+	int channel;
+	int rc;
+	int cpu;
+
+	gxio_mpipe_rules_init(&rules, &context);
+
+	for (channel = 0; channel < TILE_NET_CHANNELS; channel++) {
+		if (tile_net_devs_for_channel[channel] == NULL)
+			continue;
+		if (!saw_channel) {
+			saw_channel = true;
+			gxio_mpipe_rules_begin(&rules, first_bucket,
+					       num_buckets, NULL);
+			gxio_mpipe_rules_set_headroom(&rules, NET_IP_ALIGN);
+		}
+		gxio_mpipe_rules_add_channel(&rules, channel);
+	}
+
+	/* NOTE: This can fail if there is no classifier.
+	 * ISSUE: Can anything else cause it to fail?
+	 */
+	rc = gxio_mpipe_rules_commit(&rules);
+	if (rc != 0) {
+		netdev_warn(dev, "gxio_mpipe_rules_commit failed: %d\n", rc);
+		return -EIO;
+	}
+
+	/* Update all cpus, sequentially (to protect "netif_napi_add()"). */
+	for_each_online_cpu(cpu)
+		smp_call_function_single(cpu, tile_net_update_cpu,
+					 (saw_channel ? dev : NULL), 1);
+
+	/* HACK: Allow packets to flow in the simulator. */
+	if (saw_channel)
+		sim_enable_mpipe_links(0, -1);
+
+	return 0;
+}
+
+/* Allocate and initialize mpipe buffer stacks, and register them in
+ * the mPIPE TLBs, for both small and large packet sizes.
+ * This routine supports tile_net_init_mpipe(), below.
+ */
+static int init_buffer_stacks(struct net_device *dev, int num_buffers)
+{
+	pte_t hash_pte = pte_set_home((pte_t) { 0 }, PAGE_HOME_HASH);
+	int rc;
+
+	/* Compute stack bytes; we round up to 64KB and then use
+	 * alloc_pages() so we get the required 64KB alignment as well.
+	 */
+	buffer_stack_size =
+		ALIGN(gxio_mpipe_calc_buffer_stack_bytes(num_buffers),
+		      64 * 1024);
+
+	/* Allocate two buffer stack indices. */
+	rc = gxio_mpipe_alloc_buffer_stacks(&context, 2, 0, 0);
+	if (rc < 0) {
+		netdev_err(dev, "gxio_mpipe_alloc_buffer_stacks failed: %d\n",
+			   rc);
+		return rc;
+	}
+	small_buffer_stack = rc;
+	large_buffer_stack = rc + 1;
+
+	/* Allocate the small memory stack. */
+	small_buffer_stack_va =
+		alloc_pages_exact(buffer_stack_size, GFP_KERNEL);
+	if (small_buffer_stack_va == NULL) {
+		netdev_err(dev,
+			   "Could not alloc %zd bytes for buffer stacks\n",
+			   buffer_stack_size);
+		return -ENOMEM;
+	}
+	rc = gxio_mpipe_init_buffer_stack(&context, small_buffer_stack,
+					  GXIO_MPIPE_BUFFER_SIZE_128,
+					  small_buffer_stack_va,
+					  buffer_stack_size, 0);
+	if (rc != 0) {
+		netdev_err(dev, "gxio_mpipe_init_buffer_stack: %d\n", rc);
+		return rc;
+	}
+	rc = gxio_mpipe_register_client_memory(&context, small_buffer_stack,
+					       hash_pte, 0);
+	if (rc != 0) {
+		netdev_err(dev,
+			   "gxio_mpipe_register_buffer_memory failed: %d\n",
+			   rc);
+		return rc;
+	}
+
+	/* Allocate the large buffer stack. */
+	large_buffer_stack_va =
+		alloc_pages_exact(buffer_stack_size, GFP_KERNEL);
+	if (large_buffer_stack_va == NULL) {
+		netdev_err(dev,
+			   "Could not alloc %zd bytes for buffer stacks\n",
+			   buffer_stack_size);
+		return -ENOMEM;
+	}
+	rc = gxio_mpipe_init_buffer_stack(&context, large_buffer_stack,
+					  GXIO_MPIPE_BUFFER_SIZE_1664,
+					  large_buffer_stack_va,
+					  buffer_stack_size, 0);
+	if (rc != 0) {
+		netdev_err(dev, "gxio_mpipe_init_buffer_stack failed: %d\n",
+			   rc);
+		return rc;
+	}
+	rc = gxio_mpipe_register_client_memory(&context, large_buffer_stack,
+					       hash_pte, 0);
+	if (rc != 0) {
+		netdev_err(dev,
+			   "gxio_mpipe_register_buffer_memory failed: %d\n",
+			   rc);
+		return rc;
+	}
+
+	return 0;
+}
+
+/* Allocate per-cpu resources (memory for completions and idescs).
+ * This routine supports tile_net_init_mpipe(), below.
+ */
+static int alloc_percpu_mpipe_resources(struct net_device *dev,
+					int cpu, int ring)
+{
+	struct tile_net_info *info = &per_cpu(per_cpu_info, cpu);
+	int order, i, rc;
+	struct page *page;
+	void *addr;
+
+	/* Allocate the "comps". */
+	order = get_order(COMPS_SIZE);
+	page = homecache_alloc_pages(GFP_KERNEL, order, cpu);
+	if (page == NULL) {
+		netdev_err(dev, "Failed to alloc %zd bytes comps memory\n",
+			   COMPS_SIZE);
+		return -ENOMEM;
+	}
+	addr = pfn_to_kaddr(page_to_pfn(page));
+	memset(addr, 0, COMPS_SIZE);
+	for (i = 0; i < TILE_NET_CHANNELS; i++)
+		info->comps_for_echannel[i] =
+			addr + i * sizeof(struct tile_net_comps);
+
+	/* If this is a network cpu, create an iqueue. */
+	if (cpu_isset(cpu, network_cpus_map)) {
+		order = get_order(NOTIF_RING_SIZE);
+		page = homecache_alloc_pages(GFP_KERNEL, order, cpu);
+		if (page == NULL) {
+			netdev_err(dev,
+				   "Failed to alloc %zd bytes iqueue memory\n",
+				   NOTIF_RING_SIZE);
+			return -ENOMEM;
+		}
+		addr = pfn_to_kaddr(page_to_pfn(page));
+		rc = gxio_mpipe_iqueue_init(&info->iqueue, &context, ring,
+					    addr, NOTIF_RING_SIZE, 0);
+		if (rc != 0) {
+			netdev_err(dev,
+				   "gxio_mpipe_iqueue_init failed: %d\n", rc);
+			return rc;
+		}
+		info->has_iqueue = true;
+	}
+
+	return 0;
+}
+
+/* Initialize NotifGroup and buckets.
+ * This routine supports tile_net_init_mpipe(), below.
+ */
+static int init_notif_group_and_buckets(struct net_device *dev,
+					int ring, int network_cpus_count)
+{
+	int group, rc;
+
+	/* Allocate one NotifGroup. */
+	rc = gxio_mpipe_alloc_notif_groups(&context, 1, 0, 0);
+	if (rc < 0) {
+		netdev_err(dev, "gxio_mpipe_alloc_notif_groups failed: %d\n",
+			   rc);
+		return rc;
+	}
+	group = rc;
+
+	/* Initialize global num_buckets value. */
+	if (network_cpus_count > 4)
+		num_buckets = 256;
+	else if (network_cpus_count > 1)
+		num_buckets = 16;
+
+	/* Allocate some buckets, and set global first_bucket value. */
+	rc = gxio_mpipe_alloc_buckets(&context, num_buckets, 0, 0);
+	if (rc < 0) {
+		netdev_err(dev, "gxio_mpipe_alloc_buckets failed: %d\n", rc);
+		return rc;
+	}
+	first_bucket = rc;
+
+	/* Init group and buckets. */
+	rc = gxio_mpipe_init_notif_group_and_buckets(
+		&context, group, ring, network_cpus_count,
+		first_bucket, num_buckets,
+		GXIO_MPIPE_BUCKET_STICKY_FLOW_LOCALITY);
+	if (rc != 0) {
+		netdev_err(
+			dev,
+			"gxio_mpipe_init_notif_group_and_buckets failed: %d\n",
+			rc);
+		return rc;
+	}
+
+	return 0;
+}
+
+/* Create an irq and register it, then activate the irq and request
+ * interrupts on all cores.  Note that "ingress_irq" being initialized
+ * is how we know not to call tile_net_init_mpipe() again.
+ * This routine supports tile_net_init_mpipe(), below.
+ */
+static int tile_net_setup_interrupts(struct net_device *dev)
+{
+	int cpu, rc;
+
+	rc = create_irq();
+	if (rc < 0) {
+		netdev_err(dev, "create_irq failed: %d\n", rc);
+		return rc;
+	}
+	ingress_irq = rc;
+	tile_irq_activate(ingress_irq, TILE_IRQ_PERCPU);
+	rc = request_irq(ingress_irq, tile_net_handle_ingress_irq,
+			 0, NULL, NULL);
+	if (rc != 0) {
+		netdev_err(dev, "request_irq failed: %d\n", rc);
+		destroy_irq(ingress_irq);
+		ingress_irq = -1;
+		return rc;
+	}
+
+	for_each_online_cpu(cpu) {
+		struct tile_net_info *info = &per_cpu(per_cpu_info, cpu);
+		if (info->has_iqueue) {
+			gxio_mpipe_request_notif_ring_interrupt(
+				&context, cpu_x(cpu), cpu_y(cpu),
+				1, ingress_irq, info->iqueue.ring);
+		}
+	}
+
+	return 0;
+}
+
+/* Undo any state set up partially by a failed call to tile_net_init_mpipe. */
+static void tile_net_init_mpipe_fail(void)
+{
+	int cpu;
+
+	/* Do cleanups that require the mpipe context first. */
+	if (small_buffer_stack >= 0)
+		tile_net_pop_all_buffers(small_buffer_stack);
+	if (large_buffer_stack >= 0)
+		tile_net_pop_all_buffers(large_buffer_stack);
+
+	/* Destroy mpipe context so the hardware no longer owns any memory. */
+	gxio_mpipe_destroy(&context);
+
+	for_each_online_cpu(cpu) {
+		struct tile_net_info *info = &per_cpu(per_cpu_info, cpu);
+		free_pages((unsigned long)(info->comps_for_echannel[0]),
+			   get_order(COMPS_SIZE));
+		info->comps_for_echannel[0] = NULL;
+		free_pages((unsigned long)(info->iqueue.idescs),
+			   get_order(NOTIF_RING_SIZE));
+		info->iqueue.idescs = NULL;
+	}
+
+	if (small_buffer_stack_va)
+		free_pages_exact(small_buffer_stack_va, buffer_stack_size);
+	if (large_buffer_stack_va)
+		free_pages_exact(large_buffer_stack_va, buffer_stack_size);
+
+	small_buffer_stack_va = NULL;
+	large_buffer_stack_va = NULL;
+	large_buffer_stack = -1;
+	small_buffer_stack = -1;
+	first_bucket = -1;
+}
+
+/* The first time any tilegx network device is opened, we initialize
+ * the global mpipe state.  If this step fails, we fail to open the
+ * device, but if it succeeds, we never need to do it again, and since
+ * tile_net can't be unloaded, we never undo it.
+ *
+ * Note that some resources in this path (buffer stack indices,
+ * bindings from init_buffer_stack, etc.) are hypervisor resources
+ * that are freed implicitly by gxio_mpipe_destroy().
+ */
+static int tile_net_init_mpipe(struct net_device *dev)
+{
+	int i, num_buffers, rc;
+	int cpu;
+	int first_ring, ring;
+	int network_cpus_count = cpus_weight(network_cpus_map);
+
+	if (!hash_default) {
+		netdev_err(dev, "Networking requires hash_default!\n");
+		return -EIO;
+	}
+
+	rc = gxio_mpipe_init(&context, 0);
+	if (rc != 0) {
+		netdev_err(dev, "gxio_mpipe_init failed: %d\n", rc);
+		return -EIO;
+	}
+
+	/* Set up the buffer stacks. */
+	num_buffers =
+		network_cpus_count * (IQUEUE_ENTRIES + TILE_NET_BATCH);
+	rc = init_buffer_stacks(dev, num_buffers);
+	if (rc != 0)
+		goto fail;
+
+	/* Provide initial buffers. */
+	rc = -ENOMEM;
+	for (i = 0; i < num_buffers; i++) {
+		if (!tile_net_provide_buffer(true)) {
+			netdev_err(dev, "Cannot allocate initial sk_bufs!\n");
+			goto fail;
+		}
+	}
+	for (i = 0; i < num_buffers; i++) {
+		if (!tile_net_provide_buffer(false)) {
+			netdev_err(dev, "Cannot allocate initial sk_bufs!\n");
+			goto fail;
+		}
+	}
+
+	/* Allocate one NotifRing for each network cpu. */
+	rc = gxio_mpipe_alloc_notif_rings(&context, network_cpus_count, 0, 0);
+	if (rc < 0) {
+		netdev_err(dev, "gxio_mpipe_alloc_notif_rings failed %d\n",
+			   rc);
+		goto fail;
+	}
+
+	/* Init NotifRings per-cpu. */
+	first_ring = rc;
+	ring = first_ring;
+	for_each_online_cpu(cpu) {
+		rc = alloc_percpu_mpipe_resources(dev, cpu, ring++);
+		if (rc != 0)
+			goto fail;
+	}
+
+	/* Initialize NotifGroup and buckets. */
+	rc = init_notif_group_and_buckets(dev, first_ring, network_cpus_count);
+	if (rc != 0)
+		goto fail;
+
+	/* Create and enable interrupts. */
+	rc = tile_net_setup_interrupts(dev);
+	if (rc != 0)
+		goto fail;
+
+	return 0;
+
+fail:
+	tile_net_init_mpipe_fail();
+	return rc;
+}
+
+/* Create persistent egress info for a given egress channel.
+ * Note that this may be shared between, say, "gbe0" and "xgbe0".
+ * ISSUE: Defer header allocation until TSO is actually needed?
+ */
+static int tile_net_init_egress(struct net_device *dev, int echannel)
+{
+	struct page *headers_page, *edescs_page, *equeue_page;
+	gxio_mpipe_edesc_t *edescs;
+	gxio_mpipe_equeue_t *equeue;
+	unsigned char *headers;
+	int headers_order, edescs_order, equeue_order;
+	size_t edescs_size;
+	int edma;
+	int rc = -ENOMEM;
+
+	/* Only initialize once. */
+	if (egress_for_echannel[echannel].equeue != NULL)
+		return 0;
+
+	/* Allocate memory for the "headers". */
+	headers_order = get_order(EQUEUE_ENTRIES * HEADER_BYTES);
+	headers_page = alloc_pages(GFP_KERNEL, headers_order);
+	if (headers_page == NULL) {
+		netdev_warn(dev,
+			    "Could not alloc %zd bytes for TSO headers.\n",
+			    PAGE_SIZE << headers_order);
+		goto fail;
+	}
+	headers = pfn_to_kaddr(page_to_pfn(headers_page));
+
+	/* Allocate memory for the "edescs". */
+	edescs_size = EQUEUE_ENTRIES * sizeof(*edescs);
+	edescs_order = get_order(edescs_size);
+	edescs_page = alloc_pages(GFP_KERNEL, edescs_order);
+	if (edescs_page == NULL) {
+		netdev_warn(dev,
+			    "Could not alloc %zd bytes for eDMA ring.\n",
+			    edescs_size);
+		goto fail_headers;
+	}
+	edescs = pfn_to_kaddr(page_to_pfn(edescs_page));
+
+	/* Allocate memory for the "equeue". */
+	equeue_order = get_order(sizeof(*equeue));
+	equeue_page = alloc_pages(GFP_KERNEL, equeue_order);
+	if (equeue_page == NULL) {
+		netdev_warn(dev,
+			    "Could not alloc %zd bytes for equeue info.\n",
+			    PAGE_SIZE << equeue_order);
+		goto fail_edescs;
+	}
+	equeue = pfn_to_kaddr(page_to_pfn(equeue_page));
+
+	/* Allocate an edma ring.  Note that in practice this can't
+	 * fail, which is good, because we will leak an edma ring if so.
+	 */
+	rc = gxio_mpipe_alloc_edma_rings(&context, 1, 0, 0);
+	if (rc < 0) {
+		netdev_warn(dev, "gxio_mpipe_alloc_edma_rings failed: %d\n",
+			    rc);
+		goto fail_equeue;
+	}
+	edma = rc;
+
+	/* Initialize the equeue. */
+	rc = gxio_mpipe_equeue_init(equeue, &context, edma, echannel,
+				    edescs, edescs_size, 0);
+	if (rc != 0) {
+		netdev_err(dev, "gxio_mpipe_equeue_init failed: %d\n", rc);
+		goto fail_equeue;
+	}
+
+	/* Done. */
+	egress_for_echannel[echannel].equeue = equeue;
+	egress_for_echannel[echannel].headers = headers;
+	return 0;
+
+fail_equeue:
+	__free_pages(equeue_page, equeue_order);
+
+fail_edescs:
+	__free_pages(edescs_page, edescs_order);
+
+fail_headers:
+	__free_pages(headers_page, headers_order);
+
+fail:
+	return rc;
+}
+
+/* Return channel number for a newly-opened link. */
+static int tile_net_link_open(struct net_device *dev, gxio_mpipe_link_t *link,
+			      const char *link_name)
+{
+	int rc = gxio_mpipe_link_open(link, &context, link_name, 0);
+	if (rc < 0) {
+		netdev_err(dev, "Failed to open '%s'\n", link_name);
+		return rc;
+	}
+	rc = gxio_mpipe_link_channel(link);
+	if (rc < 0 || rc >= TILE_NET_CHANNELS) {
+		netdev_err(dev, "gxio_mpipe_link_channel bad value: %d\n", rc);
+		gxio_mpipe_link_close(link);
+		return -EINVAL;
+	}
+	return rc;
+}
+
+/* Help the kernel activate the given network interface. */
+static int tile_net_open(struct net_device *dev)
+{
+	struct tile_net_priv *priv = netdev_priv(dev);
+	int rc;
+
+	mutex_lock(&tile_net_devs_for_channel_mutex);
+
+	/* Do one-time initialization the first time any device is opened. */
+	if (ingress_irq < 0) {
+		rc = tile_net_init_mpipe(dev);
+		if (rc != 0)
+			goto fail;
+	}
+
+	/* Determine if this is the "loopify" device. */
+	if (unlikely((loopify_link_name != NULL) &&
+		     !strcmp(dev->name, loopify_link_name))) {
+		rc = tile_net_link_open(dev, &priv->link, "loop0");
+		if (rc < 0)
+			goto fail;
+		priv->channel = rc;
+		rc = tile_net_link_open(dev, &priv->loopify_link, "loop1");
+		if (rc < 0)
+			goto fail;
+		priv->loopify_channel = rc;
+		priv->echannel = rc;
+	} else {
+		rc = tile_net_link_open(dev, &priv->link, dev->name);
+		if (rc < 0)
+			goto fail;
+		priv->channel = rc;
+		priv->echannel = rc;
+	}
+
+	/* Initialize egress info (if needed).  Once ever, per echannel. */
+	rc = tile_net_init_egress(dev, priv->echannel);
+	if (rc != 0)
+		goto fail;
+
+	tile_net_devs_for_channel[priv->channel] = dev;
+
+	rc = tile_net_update(dev);
+	if (rc != 0)
+		goto fail;
+
+	mutex_unlock(&tile_net_devs_for_channel_mutex);
+
+	netif_start_queue(dev);
+	netif_carrier_on(dev);
+	return 0;
+
+fail:
+	if (priv->loopify_channel >= 0) {
+		if (gxio_mpipe_link_close(&priv->loopify_link) != 0)
+			netdev_warn(dev, "Failed to close loopify link!\n");
+		priv->loopify_channel = -1;
+	}
+	if (priv->channel >= 0) {
+		if (gxio_mpipe_link_close(&priv->link) != 0)
+			netdev_warn(dev, "Failed to close link!\n");
+		priv->channel = -1;
+	}
+	priv->echannel = -1;
+	tile_net_devs_for_channel[priv->channel] = NULL;
+	mutex_unlock(&tile_net_devs_for_channel_mutex);
+
+	/* Don't return raw gxio error codes to generic Linux. */
+	return (rc > -512) ? rc : -EIO;
+}
+
+/* Help the kernel deactivate the given network interface. */
+static int tile_net_stop(struct net_device *dev)
+{
+	struct tile_net_priv *priv = netdev_priv(dev);
+
+	netif_stop_queue(dev);
+
+	mutex_lock(&tile_net_devs_for_channel_mutex);
+	tile_net_devs_for_channel[priv->channel] = NULL;
+	(void)tile_net_update(dev);
+	if (priv->loopify_channel >= 0) {
+		if (gxio_mpipe_link_close(&priv->loopify_link) != 0)
+			netdev_warn(dev, "Failed to close loopify link!\n");
+		priv->loopify_channel = -1;
+	}
+	if (priv->channel >= 0) {
+		if (gxio_mpipe_link_close(&priv->link) != 0)
+			netdev_warn(dev, "Failed to close link!\n");
+		priv->channel = -1;
+	}
+	priv->echannel = -1;
+	mutex_unlock(&tile_net_devs_for_channel_mutex);
+
+	return 0;
+}
+
+/* Determine the VA for a fragment. */
+static inline void *tile_net_frag_buf(skb_frag_t *f)
+{
+	unsigned long pfn = page_to_pfn(skb_frag_page(f));
+	return pfn_to_kaddr(pfn) + f->page_offset;
+}
+
+/* Determine how many edesc's are needed for TSO.
+ *
+ * Sometimes, if "sendfile()" requires copying, we will be called with
+ * "data" containing the header and payload, with "frags" being empty.
+ * Sometimes, for example when using NFS over TCP, a single segment can
+ * span 3 fragments.  This requires special care.
+ */
+static int tso_count_edescs(struct sk_buff *skb)
+{
+	struct skb_shared_info *sh = skb_shinfo(skb);
+	unsigned int len = skb->len;
+	unsigned int p_len = sh->gso_size;
+	long f_id = -1;    /* id of the current fragment */
+	long f_size = -1;  /* size of the current fragment */
+	long f_used = -1;  /* bytes used from the current fragment */
+	long n;            /* size of the current piece of payload */
+	int num_edescs = 0;
+	int segment;
+
+	for (segment = 0; segment < sh->gso_segs; segment++) {
+
+		unsigned int p_used = 0;
+
+		/* The last segment may be less than gso_size. */
+		len -= p_len;
+		if (len < p_len)
+			p_len = len;
+
+		/* One edesc for header and for each piece of the payload. */
+		for (num_edescs++; p_used < p_len; num_edescs++) {
+
+			/* Advance as needed. */
+			while (f_used >= f_size) {
+				f_id++;
+				f_size = sh->frags[f_id].size;
+				f_used = 0;
+			}
+
+			/* Use bytes from the current fragment. */
+			n = p_len - p_used;
+			if (n > f_size - f_used)
+				n = f_size - f_used;
+			f_used += n;
+			p_used += n;
+		}
+	}
+
+	return num_edescs;
+}
+
+/* Prepare modified copies of the skbuff headers.
+ * FIXME (bug 11489): add support for IPv6.
+ */
+static void tso_headers_prepare(struct sk_buff *skb, unsigned char *headers,
+				s64 slot)
+{
+	struct skb_shared_info *sh = skb_shinfo(skb);
+	struct iphdr *ih;
+	struct tcphdr *th;
+	unsigned int len = skb->len;
+	unsigned char *data = skb->data;
+	unsigned int ih_off, th_off, sh_len, total_len, p_len;
+	unsigned int isum_start, tsum_start, id, seq;
+	long f_id = -1;    /* id of the current fragment */
+	long f_size = -1;  /* size of the current fragment */
+	long f_used = -1;  /* bytes used from the current fragment */
+	long n;            /* size of the current piece of payload */
+	int segment;
+
+	/* Locate original headers and compute various lengths. */
+	ih = ip_hdr(skb);
+	th = tcp_hdr(skb);
+	ih_off = (unsigned char *)ih - data;
+	th_off = (unsigned char *)th - data;
+	sh_len = th_off + tcp_hdrlen(skb);
+	p_len = sh->gso_size;
+	total_len = p_len + sh_len;
+
+	/* Set up seed values for IP and TCP csum and initialize id and seq. */
+	isum_start = ((0xFFFF - ih->check) +
+		      (0xFFFF - ih->tot_len) +
+		      (0xFFFF - ih->id));
+	tsum_start = th->check + (0xFFFF ^ htons(len));
+	id = ntohs(ih->id);
+	seq = ntohl(th->seq);
+
+	/* Prepare all the headers. */
+	for (segment = 0; segment < sh->gso_segs; segment++) {
+		unsigned char *buf;
+		unsigned int p_used = 0;
+
+		/* The last segment may be less than gso_size. */
+		len -= p_len;
+		if (len < p_len) {
+			p_len = len;
+			total_len = p_len + sh_len;
+		}
+
+		/* Copy to the header memory for this segment. */
+		buf = headers + (slot % EQUEUE_ENTRIES) * HEADER_BYTES +
+			NET_IP_ALIGN;
+		memcpy(buf, data, sh_len);
+
+		/* Update copied ip header. */
+		ih = (struct iphdr *)(buf + ih_off);
+		ih->tot_len = htons(total_len - ih_off);
+		ih->id = htons(id);
+		ih->check = csum_long(isum_start + htons(total_len - ih_off) +
+				      htons(id)) ^ 0xffff;
+
+		/* Update copied tcp header. */
+		th = (struct tcphdr *)(buf + th_off);
+		th->seq = htonl(seq);
+		th->check = csum_long(tsum_start + htons(total_len));
+		if (segment != sh->gso_segs - 1) {
+			th->fin = 0;
+			th->psh = 0;
+		}
+
+		/* Skip past the header. */
+		slot++;
+
+		/* Skip past the payload. */
+		while (p_used < p_len) {
+
+			/* Advance as needed. */
+			while (f_used >= f_size) {
+				f_id++;
+				f_size = sh->frags[f_id].size;
+				f_used = 0;
+			}
+
+			/* Use bytes from the current fragment. */
+			n = p_len - p_used;
+			if (n > f_size - f_used)
+				n = f_size - f_used;
+			f_used += n;
+			p_used += n;
+
+			slot++;
+		}
+
+		id++;
+		seq += p_len;
+	}
+
+	/* Flush the headers so they are ready for hardware DMA. */
+	wmb();
+}
+
+/* Pass all the data to mpipe for egress. */
+static void tso_egress(struct net_device *dev, gxio_mpipe_equeue_t *equeue,
+		       struct sk_buff *skb, unsigned char *headers, s64 slot)
+{
+	struct tile_net_priv *priv = netdev_priv(dev);
+	struct skb_shared_info *sh = skb_shinfo(skb);
+	unsigned int len = skb->len;
+	unsigned int p_len = sh->gso_size;
+	gxio_mpipe_edesc_t edesc_head = { { 0 } };
+	gxio_mpipe_edesc_t edesc_body = { { 0 } };
+	long f_id = -1;    /* id of the current fragment */
+	long f_size = -1;  /* size of the current fragment */
+	long f_used = -1;  /* bytes used from the current fragment */
+	long n;            /* size of the current piece of payload */
+	unsigned long tx_packets = 0, tx_bytes = 0;
+	unsigned int csum_start, sh_len;
+	int segment;
+	
+	/* Prepare to egress the headers: set up header edesc. */
+	csum_start = skb_checksum_start_offset(skb);
+	sh_len = skb_transport_offset(skb) + tcp_hdrlen(skb);
+	edesc_head.csum = 1;
+	edesc_head.csum_start = csum_start;
+	edesc_head.csum_dest = csum_start + skb->csum_offset;
+	edesc_head.xfer_size = sh_len;
+
+	/* This is only used to specify the TLB. */
+	edesc_head.stack_idx = large_buffer_stack;
+	edesc_body.stack_idx = large_buffer_stack;
+
+	/* Egress all the edescs. */
+	for (segment = 0; segment < sh->gso_segs; segment++) {
+		void *va;
+		unsigned char *buf;
+		unsigned int p_used = 0;
+
+		/* The last segment may be less than gso_size. */
+		len -= p_len;
+		if (len < p_len)
+			p_len = len;
+
+		/* Egress the header. */
+		buf = headers + (slot % EQUEUE_ENTRIES) * HEADER_BYTES +
+			NET_IP_ALIGN;
+		edesc_head.va = va_to_tile_io_addr(buf);
+		gxio_mpipe_equeue_put_at(equeue, edesc_head, slot);
+		slot++;
+
+		/* Egress the payload. */
+		while (p_used < p_len) {
+
+			/* Advance as needed. */
+			while (f_used >= f_size) {
+				f_id++;
+				f_size = sh->frags[f_id].size;
+				f_used = 0;
+			}
+
+			va = tile_net_frag_buf(&sh->frags[f_id]) + f_used;
+
+			/* Use bytes from the current fragment. */
+			n = p_len - p_used;
+			if (n > f_size - f_used)
+				n = f_size - f_used;
+			f_used += n;
+			p_used += n;
+
+			/* Egress a piece of the payload. */
+			edesc_body.va = va_to_tile_io_addr(va);
+			edesc_body.xfer_size = n;
+			edesc_body.bound = !(p_used < p_len);
+			gxio_mpipe_equeue_put_at(equeue, edesc_body, slot);
+			slot++;
+		}
+
+		tx_packets++;
+		tx_bytes += sh_len + p_len;
+	}
+
+	/* Update stats. */
+	tile_net_stats_add(tx_packets, &priv->stats.tx_packets);
+	tile_net_stats_add(tx_bytes, &priv->stats.tx_bytes);
+}
+
+/* Do TSO handling for egress. */
+static int tile_net_tx_tso(struct sk_buff *skb, struct net_device *dev)
+{
+	struct tile_net_priv *priv = netdev_priv(dev);
+	struct tile_net_info *info = &__get_cpu_var(per_cpu_info);
+	int channel = priv->echannel;
+	struct tile_net_egress *egress = &egress_for_echannel[channel];
+	gxio_mpipe_equeue_t *equeue = egress->equeue;
+	unsigned int num_edescs;
+	unsigned long irqflags;
+	s64 slot;
+
+	/* Determine how many mpipe edesc's are needed. */
+	num_edescs = tso_count_edescs(skb);
+
+	local_irq_save(irqflags);
+
+	/* Set first reserved egress slot; see comment in tile_net_tx(). */
+	slot = gxio_mpipe_equeue_try_reserve(equeue, num_edescs);
+
+	/* Set up copies of header data properly. */
+	tso_headers_prepare(skb, egress->headers, slot);
+
+	/* Actually pass the data to the network hardware. */
+	tso_egress(dev, equeue, skb, egress->headers, slot);
+
+	/* Add a completion record. */
+	add_comp(equeue, info->comps_for_echannel[channel],
+		 slot + num_edescs - 1, skb);
+
+	local_irq_restore(irqflags);
+
+	/* Make sure the egress timer is scheduled. */
+	tile_net_schedule_egress_timer(info);
+
+	return NETDEV_TX_OK;
+}
+
+/* Analyze the body and frags for a transmit request. */
+static unsigned int tile_net_tx_frags(struct frag *frags,
+				       struct sk_buff *skb,
+				       void *b_data, unsigned int b_len)
+{
+	unsigned int i, n = 0;
+
+	struct skb_shared_info *sh = skb_shinfo(skb);
+
+	if (b_len != 0) {
+		frags[n].buf = b_data;
+		frags[n++].length = b_len;
+	}
+
+	for (i = 0; i < sh->nr_frags; i++) {
+		skb_frag_t *f = &sh->frags[i];
+		frags[n].buf = tile_net_frag_buf(f);
+		frags[n++].length = skb_frag_size(f);
+	}
+
+	return n;
+}
+
+/* Help the kernel transmit a packet. */
+static int tile_net_tx(struct sk_buff *skb, struct net_device *dev)
+{
+	struct tile_net_priv *priv = netdev_priv(dev);
+	struct tile_net_info *info = &__get_cpu_var(per_cpu_info);
+	struct tile_net_egress *egress = &egress_for_echannel[priv->echannel];
+	gxio_mpipe_equeue_t *equeue = egress->equeue;
+	struct tile_net_comps *comps =
+		info->comps_for_echannel[priv->echannel];
+	unsigned int len = skb->len;
+	unsigned char *data = skb->data;
+	unsigned int num_frags;
+	struct frag frags[MAX_FRAGS];
+	gxio_mpipe_edesc_t edescs[MAX_FRAGS];
+	unsigned long irqflags;
+	gxio_mpipe_edesc_t edesc = { { 0 } };
+	unsigned int i;
+	s64 slot;
+
+	/* Save the timestamp. */
+	dev->trans_start = jiffies;
+
+	if (skb_is_gso(skb))
+		return tile_net_tx_tso(skb, dev);
+
+	/* NOTE: This is usually 2, sometimes 3, for big writes. */
+	num_frags = tile_net_tx_frags(frags, skb, data, skb_headlen(skb));
+
+	/* This is only used to specify the TLB. */
+	edesc.stack_idx = large_buffer_stack;
+
+	/* Prepare the edescs. */
+	for (i = 0; i < num_frags; i++) {
+		edesc.xfer_size = frags[i].length;
+		edesc.va = va_to_tile_io_addr(frags[i].buf);
+		edescs[i] = edesc;
+	}
+
+	/* Mark the final edesc. */
+	edescs[num_frags - 1].bound = 1;
+
+	/* Add checksum info to the initial edesc, if needed. */
+	if (skb->ip_summed == CHECKSUM_PARTIAL) {
+		unsigned int csum_start = skb_checksum_start_offset(skb);
+		edescs[0].csum = 1;
+		edescs[0].csum_start = csum_start;
+		edescs[0].csum_dest = csum_start + skb->csum_offset;
+	}
+
+	local_irq_save(irqflags);
+
+	/* Try to reserve slots for egress.  If we fail due to the
+	 * queue being full, we return NETDEV_TX_BUSY.  This may lead
+	 * to "Virtual device xxx asks to queue packet" warnings.
+	 *
+	 * We might consider retrying briefly here since we expect in
+	 * principle that egress slots become available quickly as the
+	 * hardware engine drains packets into the network.
+	 *
+	 * FIXME (bug# 11479): We should stop queues when they're full.
+	 * We may want to consider making tile_net be multiqueue with
+	 * one TX queue per CPU and ndo_select_queue defined
+	 * accordingly.  Initially we saw bad things happen when
+	 * stopping the queue, so we are continuing to work on this
+	 * for a future fix.
+	 */
+	slot = gxio_mpipe_equeue_try_reserve(equeue, num_frags);
+	if (slot < 0) {
+		local_irq_restore(irqflags);
+		return NETDEV_TX_BUSY;
+	}
+
+	for (i = 0; i < num_frags; i++)
+		gxio_mpipe_equeue_put_at(equeue, edescs[i], slot++);
+
+	/* Add a completion record. */
+	add_comp(equeue, comps, slot - 1, skb);
+
+	/* NOTE: Use ETH_ZLEN for short packets (e.g. 42 < 60). */
+	tile_net_stats_add(1, &priv->stats.tx_packets);
+	tile_net_stats_add(max(len, (unsigned int)ETH_ZLEN),
+			   &priv->stats.tx_bytes);
+
+	local_irq_restore(irqflags);
+
+	/* Make sure the egress timer is scheduled. */
+	tile_net_schedule_egress_timer(info);
+
+	return NETDEV_TX_OK;
+}
+
+/* Deal with a transmit timeout. */
+static void tile_net_tx_timeout(struct net_device *dev)
+{
+	netif_wake_queue(dev);
+}
+
+/* Ioctl commands. */
+static int tile_net_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
+{
+	return -EOPNOTSUPP;
+}
+
+/* Get system network statistics for device. */
+static struct net_device_stats *tile_net_get_stats(struct net_device *dev)
+{
+	struct tile_net_priv *priv = netdev_priv(dev);
+	return &priv->stats;
+}
+
+/* Change the MTU. */
+static int tile_net_change_mtu(struct net_device *dev, int new_mtu)
+{
+	if ((new_mtu < 68) || (new_mtu > 1500))
+		return -EINVAL;
+	dev->mtu = new_mtu;
+	return 0;
+}
+
+/* Change the Ethernet address of the NIC.
+ *
+ * The hypervisor driver does not support changing MAC address.  However,
+ * the hardware does not do anything with the MAC address, so the address
+ * which gets used on outgoing packets, and which is accepted on incoming
+ * packets, is completely up to us.
+ *
+ * Returns 0 on success, negative on failure.
+ */
+static int tile_net_set_mac_address(struct net_device *dev, void *p)
+{
+	struct sockaddr *addr = p;
+
+	if (!is_valid_ether_addr(addr->sa_data))
+		return -EINVAL;
+	memcpy(dev->dev_addr, addr->sa_data, dev->addr_len);
+	return 0;
+}
+
+#ifdef CONFIG_NET_POLL_CONTROLLER
+/* Polling 'interrupt' - used by things like netconsole to send skbs
+ * without having to re-enable interrupts. It's not called while
+ * the interrupt routine is executing.
+ */
+static void tile_net_netpoll(struct net_device *dev)
+{
+	disable_percpu_irq(ingress_irq);
+	tile_net_handle_ingress_irq(ingress_irq, NULL);
+	enable_percpu_irq(ingress_irq, 0);
+}
+#endif
+
+static const struct net_device_ops tile_net_ops = {
+	.ndo_open = tile_net_open,
+	.ndo_stop = tile_net_stop,
+	.ndo_start_xmit = tile_net_tx,
+	.ndo_do_ioctl = tile_net_ioctl,
+	.ndo_get_stats = tile_net_get_stats,
+	.ndo_change_mtu = tile_net_change_mtu,
+	.ndo_tx_timeout = tile_net_tx_timeout,
+	.ndo_set_mac_address = tile_net_set_mac_address,
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	.ndo_poll_controller = tile_net_netpoll,
+#endif
+};
+
+/* The setup function.
+ *
+ * This uses ether_setup() to assign various fields in dev, including
+ * setting IFF_BROADCAST and IFF_MULTICAST, then sets some extra fields.
+ */
+static void tile_net_setup(struct net_device *dev)
+{
+	ether_setup(dev);
+	dev->netdev_ops = &tile_net_ops;
+	dev->watchdog_timeo = TILE_NET_TIMEOUT;
+	dev->features |= NETIF_F_LLTX;
+	dev->features |= NETIF_F_HW_CSUM;
+	dev->features |= NETIF_F_SG;
+	dev->features |= NETIF_F_TSO;
+	dev->tx_queue_len = 0;
+	dev->mtu = 1500;
+}
+
+/* Allocate the device structure, register the device, and obtain the
+ * MAC address from the hypervisor.
+ */
+static void tile_net_dev_init(const char *name, const uint8_t *mac)
+{
+	int ret;
+	int i;
+	int nz_addr = 0;
+	struct net_device *dev;
+	struct tile_net_priv *priv;
+
+	/* HACK: Ignore "loop" links. */
+	if (strncmp(name, "loop", 4) == 0)
+		return;
+
+	/* Allocate the device structure.  Normally, "name" is a
+	 * template, instantiated by register_netdev(), but not for us.
+	 */
+	dev = alloc_netdev(sizeof(*priv), name, tile_net_setup);
+	if (!dev) {
+		pr_err("alloc_netdev(%s) failed\n", name);
+		return;
+	}
+
+	/* Initialize "priv". */
+	priv = netdev_priv(dev);
+	memset(priv, 0, sizeof(*priv));
+	priv->dev = dev;
+	priv->channel = -1;
+	priv->loopify_channel = -1;
+	priv->echannel = -1;
+
+	/* Get the MAC address and set it in the device struct; this must
+	 * be done before the device is opened.  If the MAC is all zeroes,
+	 * we use a random address, since we're probably on the simulator.
+	 */
+	for (i = 0; i < 6; i++)
+		nz_addr |= mac[i];
+
+	if (nz_addr) {
+		memcpy(dev->dev_addr, mac, 6);
+		dev->addr_len = 6;
+	} else {
+		random_ether_addr(dev->dev_addr);
+	}
+
+	/* Register the network device. */
+	ret = register_netdev(dev);
+	if (ret) {
+		netdev_err(dev, "register_netdev failed %d\n", ret);
+		free_netdev(dev);
+		return;
+	}
+}
+
+/* Per-cpu module initialization. */
+static void tile_net_init_module_percpu(void *unused)
+{
+	struct tile_net_info *info = &__get_cpu_var(per_cpu_info);
+	int my_cpu = smp_processor_id();
+
+	info->has_iqueue = false;
+
+	info->my_cpu = my_cpu;
+
+	/* Initialize the egress timer. */
+	init_timer(&info->egress_timer);
+	info->egress_timer.data = (long)info;
+	info->egress_timer.function = tile_net_handle_egress_timer;
+}
+
+/* Module initialization. */
+static int __init tile_net_init_module(void)
+{
+	int i;
+	char name[GXIO_MPIPE_LINK_NAME_LEN];
+	uint8_t mac[6];
+
+	pr_info("Tilera Network Driver\n");
+
+	mutex_init(&tile_net_devs_for_channel_mutex);
+
+	/* Initialize each CPU. */
+	on_each_cpu(tile_net_init_module_percpu, NULL, 1);
+
+	/* Find out what devices we have, and initialize them. */
+	for (i = 0; gxio_mpipe_link_enumerate_mac(i, name, mac) >= 0; i++)
+		tile_net_dev_init(name, mac);
+
+	if (!network_cpus_init())
+		network_cpus_map = *cpu_online_mask;
+
+	return 0;
+}
+
+module_init(tile_net_init_module);
-- 
1.6.5.2

^ permalink raw reply related

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Alexander Duyck @ 2012-05-23 21:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Kieran Mansley, Jeff Kirsher, Ben Hutchings, netdev
In-Reply-To: <4FBD1A0C.4070606@intel.com>

On 05/23/2012 10:10 AM, Alexander Duyck wrote:
> On 05/23/2012 09:39 AM, Eric Dumazet wrote:
>> On Wed, 2012-05-23 at 18:12 +0200, Eric Dumazet wrote:
>>
>>> With current driver, a MTU=1500 frame uses :
>>>
>>> sk_buff (256 bytes)
>>> skb->head : 1024 bytes  (or more exaclty now : 512 + 384)
>> By the way, NET_SKB_PAD adds 64 bytes so its 64 + 512 + 384 = 960
> Actually pahole seems to be indicating to me the size of skb_shared_info
> is 320, unless something has changed in the last few days.
>
> When I get a chance I will try to remember to reduce the ixgbe header
> size to 256 which should also help.  The only reason it is set to 512
> was to deal with the fact that the old alloc_skb code wasn't aligning
> the shared info with the end of whatever size was allocated and so the
> 512 was an approximation to make better use of the 1K slab allocation
> back when we still were using hardware packet split.  That should help
> to improve the page utilization for the headers since that would
> increase the uses of a page from 4 to 6 for the skb head frag, and it
> would drop truesize by another 256 bytes.
>
> Thanks,
>
> Alex
Here is the patch for review.  I have submitted the official patch to Jeff 
so that it can go through his tree for testing, validation, and submission 
once Dave's tree opens back up.

---

The recent changes to netdev_alloc_skb actually make it so that the size of
the buffer now actually has a more direct input on the truesize.  So in
order to make best use of the piece of a page we are allocated I am
reducing the IXGBE_RX_HDR_SIZE to 256 so that our truesize will be reduced
by 256 bytes as well.

This should result in performance improvements since the number of uses per
page should increase from 4 to 6 in the case of a 4K page.  In addition we
should see socket performance improvements due to the truesize dropping
to less than 1K for buffers less than 256 bytes.

Not-Yet-Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 drivers/net/ethernet/intel/ixgbe/ixgbe.h      |   15 ++++++++-------
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    4 ++--
 2 files changed, 10 insertions(+), 9 deletions(-)


diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 402dd66..468e4ab 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -77,17 +77,18 @@
 #define IXGBE_MAX_FCPAUSE		 0xFFFF
 
 /* Supported Rx Buffer Sizes */
-#define IXGBE_RXBUFFER_512   512    /* Used for packet split */
+#define IXGBE_RXBUFFER_256    256  /* Used for skb receive header */
 #define IXGBE_MAX_RXBUFFER  16384  /* largest size for a single descriptor */
 
 /*
- * NOTE: netdev_alloc_skb reserves up to 64 bytes, NET_IP_ALIGN mans we
- * reserve 2 more, and skb_shared_info adds an additional 384 bytes more,
- * this adds up to 512 bytes of extra data meaning the smallest allocation
- * we could have is 1K.
- * i.e. RXBUFFER_512 --> size-1024 slab
+ * NOTE: netdev_alloc_skb reserves up to 64 bytes, NET_IP_ALIGN means we
+ * reserve 64 more, and skb_shared_info adds an additional 320 bytes more,
+ * this adds up to 448 bytes of extra data.
+ *
+ * Since netdev_alloc_skb now allocates a page fragment we can use a value
+ * of 256 and the resultant skb will have a truesize of 960 or less.
  */
-#define IXGBE_RX_HDR_SIZE IXGBE_RXBUFFER_512
+#define IXGBE_RX_HDR_SIZE IXGBE_RXBUFFER_256
 
 #define MAXIMUM_ETHERNET_VLAN_SIZE (ETH_FRAME_LEN + ETH_FCS_LEN + VLAN_HLEN)
 
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 7f92e40..f92b31a 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1520,8 +1520,8 @@ static bool ixgbe_cleanup_headers(struct ixgbe_ring *rx_ring,
 	 * 60 bytes if the skb->len is less than 60 for skb_pad.
 	 */
 	pull_len = skb_frag_size(frag);
-	if (pull_len > 256)
-		pull_len = ixgbe_get_headlen(va, pull_len);
+	if (pull_len > IXGBE_RX_HDR_SIZE)
+		pull_len = ixgbe_get_headlen(va, IXGBE_RX_HDR_SIZE);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));

^ permalink raw reply related

* [PATCH] fec: Add support for Coldfire M5441x enet-mac.
From: Steven King @ 2012-05-23 21:35 UTC (permalink / raw)
  To: netdev; +Cc: uClinux development list, gerg

Add support for the Freescale Coldfire M5441x; as these parts have an enet-mac,
add a quirk to distinguish them from the other Coldfire parts so we can use 
the existing enet-mac support.

Signed-off-by: Steven King <sfking@fdwdc.com>
---
 drivers/net/ethernet/freescale/Kconfig |    5 ++---
 drivers/net/ethernet/freescale/fec.c   |    6 +++++-
 drivers/net/ethernet/freescale/fec.h   |    3 ++-
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/freescale/Kconfig b/drivers/net/ethernet/freescale/Kconfig
index 3574e14..f9aa244 100644
--- a/drivers/net/ethernet/freescale/Kconfig
+++ b/drivers/net/ethernet/freescale/Kconfig
@@ -7,7 +7,7 @@ config NET_VENDOR_FREESCALE
 	default y
 	depends on FSL_SOC || QUICC_ENGINE || CPM1 || CPM2 || PPC_MPC512x || \
 		   M523x || M527x || M5272 || M528x || M520x || M532x || \
-		   ARCH_MXC || ARCH_MXS || (PPC_MPC52xx && PPC_BESTCOMM)
+		   M5441x || ARCH_MXC || ARCH_MXS || (PPC_MPC52xx && PPC_BESTCOMM)
 	---help---
 	  If you have a network (Ethernet) card belonging to this class, say Y
 	  and read the Ethernet-HOWTO, available from
@@ -22,8 +22,7 @@ if NET_VENDOR_FREESCALE
 
 config FEC
 	tristate "FEC ethernet controller (of ColdFire and some i.MX CPUs)"
-	depends on (M523x || M527x || M5272 || M528x || M520x || M532x || \
-		   ARCH_MXC || SOC_IMX28)
+	depends on (M523x || M527x || M5272 || M528x || M520x || M532x || M5441x || ARCH_MXC || SOC_IMX28)
 	default ARCH_MXC || SOC_IMX28 if ARM
 	select PHYLIB
 	---help---
diff --git a/drivers/net/ethernet/freescale/fec.c b/drivers/net/ethernet/freescale/fec.c
index a12b3f5..4cb1c90 100644
--- a/drivers/net/ethernet/freescale/fec.c
+++ b/drivers/net/ethernet/freescale/fec.c
@@ -93,6 +93,9 @@ static struct platform_device_id fec_devtype[] = {
 		.name = "imx6q-fec",
 		.driver_data = FEC_QUIRK_ENET_MAC | FEC_QUIRK_HAS_GBIT,
 	}, {
+		.name = "enet-fec",
+		.driver_data = FEC_QUIRK_ENET_MAC,
+	}, {
 		/* sentinel */
 	}
 };
@@ -186,7 +189,8 @@ MODULE_PARM_DESC(macaddr, "FEC Ethernet MAC address");
  * account when setting it.
  */
 #if defined(CONFIG_M523x) || defined(CONFIG_M527x) || defined(CONFIG_M528x) || \
-    defined(CONFIG_M520x) || defined(CONFIG_M532x) || defined(CONFIG_ARM)
+    defined(CONFIG_M520x) || defined(CONFIG_M532x) || defined(CONFIG_ARM) || \
+    defined(CONFIG_M5441x)
 #define	OPT_FRAME_SIZE	(PKT_MAXBUF_SIZE << 16)
 #else
 #define	OPT_FRAME_SIZE	0
diff --git a/drivers/net/ethernet/freescale/fec.h b/drivers/net/ethernet/freescale/fec.h
index 8408c62..298cfb7 100644
--- a/drivers/net/ethernet/freescale/fec.h
+++ b/drivers/net/ethernet/freescale/fec.h
@@ -15,7 +15,8 @@
 
 #if defined(CONFIG_M523x) || defined(CONFIG_M527x) || defined(CONFIG_M528x) || \
     defined(CONFIG_M520x) || defined(CONFIG_M532x) || \
-    defined(CONFIG_ARCH_MXC) || defined(CONFIG_SOC_IMX28)
+    defined(CONFIG_ARCH_MXC) || defined(CONFIG_SOC_IMX28) || \
+    defined(CONFIG_M5441x)
 /*
  *	Just figures, Motorola would have to change the offsets for
  *	registers in the same peripheral device on different models

^ permalink raw reply related

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Eric Dumazet @ 2012-05-23 21:37 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Kieran Mansley, Jeff Kirsher, Ben Hutchings, netdev
In-Reply-To: <4FBD546A.1030504@intel.com>

On Wed, 2012-05-23 at 14:19 -0700, Alexander Duyck wrote:
> On 05/23/2012 10:10 AM, Alexander Duyck wrote:
> > On 05/23/2012 09:39 AM, Eric Dumazet wrote:
> >> On Wed, 2012-05-23 at 18:12 +0200, Eric Dumazet wrote:
> >>
> >>> With current driver, a MTU=1500 frame uses :
> >>>
> >>> sk_buff (256 bytes)
> >>> skb->head : 1024 bytes  (or more exaclty now : 512 + 384)
> >> By the way, NET_SKB_PAD adds 64 bytes so its 64 + 512 + 384 = 960
> > Actually pahole seems to be indicating to me the size of skb_shared_info
> > is 320, unless something has changed in the last few days.
> >
> > When I get a chance I will try to remember to reduce the ixgbe header
> > size to 256 which should also help.  The only reason it is set to 512
> > was to deal with the fact that the old alloc_skb code wasn't aligning
> > the shared info with the end of whatever size was allocated and so the
> > 512 was an approximation to make better use of the 1K slab allocation
> > back when we still were using hardware packet split.  That should help
> > to improve the page utilization for the headers since that would
> > increase the uses of a page from 4 to 6 for the skb head frag, and it
> > would drop truesize by another 256 bytes.
> >
> > Thanks,
> >
> > Alex
> Here is the patch for review.  I have submitted the official patch to Jeff 
> so that it can go through his tree for testing, validation, and submission 
> once Dave's tree opens back up.
> 
> ---
> 
> The recent changes to netdev_alloc_skb actually make it so that the size of
> the buffer now actually has a more direct input on the truesize.  So in
> order to make best use of the piece of a page we are allocated I am
> reducing the IXGBE_RX_HDR_SIZE to 256 so that our truesize will be reduced
> by 256 bytes as well.
> 
> This should result in performance improvements since the number of uses per
> page should increase from 4 to 6 in the case of a 4K page.  In addition we
> should see socket performance improvements due to the truesize dropping
> to less than 1K for buffers less than 256 bytes.
> 
> Not-Yet-Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
> 
>  drivers/net/ethernet/intel/ixgbe/ixgbe.h      |   15 ++++++++-------
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    4 ++--
>  2 files changed, 10 insertions(+), 9 deletions(-)
> 
> 
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> index 402dd66..468e4ab 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> @@ -77,17 +77,18 @@
>  #define IXGBE_MAX_FCPAUSE		 0xFFFF
>  
>  /* Supported Rx Buffer Sizes */
> -#define IXGBE_RXBUFFER_512   512    /* Used for packet split */
> +#define IXGBE_RXBUFFER_256    256  /* Used for skb receive header */
>  #define IXGBE_MAX_RXBUFFER  16384  /* largest size for a single descriptor */
>  
>  /*
> - * NOTE: netdev_alloc_skb reserves up to 64 bytes, NET_IP_ALIGN mans we
> - * reserve 2 more, and skb_shared_info adds an additional 384 bytes more,
> - * this adds up to 512 bytes of extra data meaning the smallest allocation
> - * we could have is 1K.
> - * i.e. RXBUFFER_512 --> size-1024 slab
> + * NOTE: netdev_alloc_skb reserves up to 64 bytes, NET_IP_ALIGN means we
> + * reserve 64 more, and skb_shared_info adds an additional 320 bytes more,
> + * this adds up to 448 bytes of extra data.
> + *
> + * Since netdev_alloc_skb now allocates a page fragment we can use a value
> + * of 256 and the resultant skb will have a truesize of 960 or less.
>   */
> -#define IXGBE_RX_HDR_SIZE IXGBE_RXBUFFER_512
> +#define IXGBE_RX_HDR_SIZE IXGBE_RXBUFFER_256
>  
>  #define MAXIMUM_ETHERNET_VLAN_SIZE (ETH_FRAME_LEN + ETH_FCS_LEN + VLAN_HLEN)
>  
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 7f92e40..f92b31a 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -1520,8 +1520,8 @@ static bool ixgbe_cleanup_headers(struct ixgbe_ring *rx_ring,
>  	 * 60 bytes if the skb->len is less than 60 for skb_pad.
>  	 */
>  	pull_len = skb_frag_size(frag);
> -	if (pull_len > 256)
> -		pull_len = ixgbe_get_headlen(va, pull_len);
> +	if (pull_len > IXGBE_RX_HDR_SIZE)
> +		pull_len = ixgbe_get_headlen(va, IXGBE_RX_HDR_SIZE);
>  
>  	/* align pull length to size of long to optimize memcpy performance */
>  	skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));
> 


By the way you should reword the comment about NET_IP_ALIGN

On x86 NET_IP_ALIGN is 0, so we dont 'reserve 64 bytes more'


-> 896 bytes

Also, are you sure :

srrctl |= (IXGBE_RX_HDR_SIZE << IXGBE_SRRCTL_BSIZEHDRSIZE_SHIFT) &  
	IXGBE_SRRCTL_BSIZEHDR_MASK; 


is still needed in ixgbe_configure_srrctl() , since it uses
IXGBE_SRRCTL_DESCTYPE_ADV_ONEBUF (non packet split)

^ permalink raw reply

* Re: [PATCH 06/15] batman-adv: Distributed ARP Table - add snooping functions for ARP messages
From: Simon Wunderlich @ 2012-05-23 21:48 UTC (permalink / raw)
  To: David Miller
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r, Marek Lindner
In-Reply-To: <201205171953.54891.lindner_marek-LWAfsSFWpa4@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 2781 bytes --]

Hello David, 

we are a little bit in a pinch here - the DAT feature sent with this
patchset was developed for a long time, and we need your decision to move on
as more and more patches depend on it:

 * should we rewrite DAT to use our own ARP table/backend or
 * can we use the ARP neighbor table in another way, maybe after your changes?

We thought that re-using existing infrastructure would be smarter, but if
you disagree, please tell us so - we would like to get this feature finally
upstream and need your input to make the neccesary changes.

Thanks
	Simon


On Thu, May 17, 2012 at 07:53:54PM +0800, Marek Lindner wrote:
> 
> David,
> 
> > On Tuesday, May 01, 2012 08:59:04 David Miller wrote:
> > > From: Antonio Quartulli <ordex-GaUfNO9RBHfsrOwW+9ziJQ@public.gmane.org>
> > > Date: Tue, 1 May 2012 00:22:30 +0200
> > > 
> > > > However this patch also contains a procedure which queries the neigh
> > > > table in order to understand whether a given host is known or not.
> > > > Would it be possible to do that in another way (Without manually
> > > > touching the table)?
> > > > 
> > > > Instead, in the next patch (patch 06/15) batman-adv manually increase
> > > > the neigh timeouts. Do you think we should avoid doing that as well?
> > > > If we are allowed to do that, how can we perform the same operation in
> > > > a cleaner way?
> > > > 
> > > > Last question: why can't other modules use exported functions? Are you
> > > > going to change them as well?
> > > 
> > > I really don't have time to discuss your neigh issues right now as I'm
> > > busy speaking at conferences and dealing with the backlog of other
> > > patches.
> > > 
> > > You'll need to find someone else to discuss it with you, sorry.
> > 
> > I hope now is a good moment to bring the questions back onto the table. We
> > still are not sure how to proceed because we have no clear picture of what
> > is going to come and how the exported functions are supposed to be used.
> > 
> > David, if you don't have the time to discuss the ARP handling with us could
> > you name someone who knows your plans and the code equally well ? So far,
> > nobody has stepped up.
> 
> let me add another piece of information: The distributed ARP table does not 
> really depend on the kernel's ARP table. We can easily write our own backend 
> to be totally independent of the kernel's ARP table. Initially, we thought it 
> might be considered a smart move if the code made use of existing kernel 
> infrastructure instead of writing our own storage / user space API / etc, 
> hence duplicating what is already there. But if you feel this is the better 
> way forward we certainly will make the necessary changes.
> 
> Regards,
> Marek
> 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Alexander Duyck @ 2012-05-23 22:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Kieran Mansley, Jeff Kirsher, Ben Hutchings, netdev
In-Reply-To: <1337809034.3361.3487.camel@edumazet-glaptop>

On 05/23/2012 02:37 PM, Eric Dumazet wrote:
> On Wed, 2012-05-23 at 14:19 -0700, Alexander Duyck wrote:
>> On 05/23/2012 10:10 AM, Alexander Duyck wrote:
>>> On 05/23/2012 09:39 AM, Eric Dumazet wrote:
>>>> On Wed, 2012-05-23 at 18:12 +0200, Eric Dumazet wrote:
>>>>
>>>>> With current driver, a MTU=1500 frame uses :
>>>>>
>>>>> sk_buff (256 bytes)
>>>>> skb->head : 1024 bytes  (or more exaclty now : 512 + 384)
>>>> By the way, NET_SKB_PAD adds 64 bytes so its 64 + 512 + 384 = 960
>>> Actually pahole seems to be indicating to me the size of skb_shared_info
>>> is 320, unless something has changed in the last few days.
>>>
>>> When I get a chance I will try to remember to reduce the ixgbe header
>>> size to 256 which should also help.  The only reason it is set to 512
>>> was to deal with the fact that the old alloc_skb code wasn't aligning
>>> the shared info with the end of whatever size was allocated and so the
>>> 512 was an approximation to make better use of the 1K slab allocation
>>> back when we still were using hardware packet split.  That should help
>>> to improve the page utilization for the headers since that would
>>> increase the uses of a page from 4 to 6 for the skb head frag, and it
>>> would drop truesize by another 256 bytes.
>>>
>>> Thanks,
>>>
>>> Alex
>> Here is the patch for review.  I have submitted the official patch to Jeff 
>> so that it can go through his tree for testing, validation, and submission 
>> once Dave's tree opens back up.
>>
>> ---
>>
>> The recent changes to netdev_alloc_skb actually make it so that the size of
>> the buffer now actually has a more direct input on the truesize.  So in
>> order to make best use of the piece of a page we are allocated I am
>> reducing the IXGBE_RX_HDR_SIZE to 256 so that our truesize will be reduced
>> by 256 bytes as well.
>>
>> This should result in performance improvements since the number of uses per
>> page should increase from 4 to 6 in the case of a 4K page.  In addition we
>> should see socket performance improvements due to the truesize dropping
>> to less than 1K for buffers less than 256 bytes.
>>
>> Not-Yet-Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> ---
>>
>>  drivers/net/ethernet/intel/ixgbe/ixgbe.h      |   15 ++++++++-------
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    4 ++--
>>  2 files changed, 10 insertions(+), 9 deletions(-)
>>
>>
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
>> index 402dd66..468e4ab 100644
>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
>> @@ -77,17 +77,18 @@
>>  #define IXGBE_MAX_FCPAUSE		 0xFFFF
>>  
>>  /* Supported Rx Buffer Sizes */
>> -#define IXGBE_RXBUFFER_512   512    /* Used for packet split */
>> +#define IXGBE_RXBUFFER_256    256  /* Used for skb receive header */
>>  #define IXGBE_MAX_RXBUFFER  16384  /* largest size for a single descriptor */
>>  
>>  /*
>> - * NOTE: netdev_alloc_skb reserves up to 64 bytes, NET_IP_ALIGN mans we
>> - * reserve 2 more, and skb_shared_info adds an additional 384 bytes more,
>> - * this adds up to 512 bytes of extra data meaning the smallest allocation
>> - * we could have is 1K.
>> - * i.e. RXBUFFER_512 --> size-1024 slab
>> + * NOTE: netdev_alloc_skb reserves up to 64 bytes, NET_IP_ALIGN means we
>> + * reserve 64 more, and skb_shared_info adds an additional 320 bytes more,
>> + * this adds up to 448 bytes of extra data.
>> + *
>> + * Since netdev_alloc_skb now allocates a page fragment we can use a value
>> + * of 256 and the resultant skb will have a truesize of 960 or less.
>>   */
>> -#define IXGBE_RX_HDR_SIZE IXGBE_RXBUFFER_512
>> +#define IXGBE_RX_HDR_SIZE IXGBE_RXBUFFER_256
>>  
>>  #define MAXIMUM_ETHERNET_VLAN_SIZE (ETH_FRAME_LEN + ETH_FCS_LEN + VLAN_HLEN)
>>  
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> index 7f92e40..f92b31a 100644
>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> @@ -1520,8 +1520,8 @@ static bool ixgbe_cleanup_headers(struct ixgbe_ring *rx_ring,
>>  	 * 60 bytes if the skb->len is less than 60 for skb_pad.
>>  	 */
>>  	pull_len = skb_frag_size(frag);
>> -	if (pull_len > 256)
>> -		pull_len = ixgbe_get_headlen(va, pull_len);
>> +	if (pull_len > IXGBE_RX_HDR_SIZE)
>> +		pull_len = ixgbe_get_headlen(va, IXGBE_RX_HDR_SIZE);
>>  
>>  	/* align pull length to size of long to optimize memcpy performance */
>>  	skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));
>>
>
> By the way you should reword the comment about NET_IP_ALIGN
>
> On x86 NET_IP_ALIGN is 0, so we dont 'reserve 64 bytes more'
On x86 yes, on other platforms that don't clear the value it will add 2
more bytes, and after cache alignment it will likely add another 64.  I
am really just stating the worst case scenario here.  This is why I end
the comment with "960 or less".

> -> 896 bytes
.. or 832 bytes if you really tweak settings and get the sk_buff down to
192 bytes.  :-)

> Also, are you sure :
>
> srrctl |= (IXGBE_RX_HDR_SIZE << IXGBE_SRRCTL_BSIZEHDRSIZE_SHIFT) &  
> 	IXGBE_SRRCTL_BSIZEHDR_MASK; 
>
>
> is still needed in ixgbe_configure_srrctl() , since it uses
> IXGBE_SRRCTL_DESCTYPE_ADV_ONEBUF (non packet split)
That bit is needed for RSC.  Basically we have to specify the maximum
size of a header, even if we are not using packet split.  The value has
to be at least 128 or more.  Since we are using the value of
IXGBE_RX_HDR_SIZE to limit how much header we pull anyway I figure it is
probably a good idea just to leave this here.

Thanks,

Alex

^ permalink raw reply

* Re: NETDEV WATCHDOG: %s (%s): transmit queue %u timed out
From: Francois Romieu @ 2012-05-23 22:32 UTC (permalink / raw)
  To: George Spelvin; +Cc: davej, kernel-team, netdev
In-Reply-To: <20120523015312.14731.qmail@science.horizon.com>

[-- Attachment #1: Type: text/plain, Size: 517 bytes --]

George Spelvin <linux@horizon.com> :
[...]
> Unfortunately, the last reboot was on Feb. 26, and the kernel logs don't
> go back that far, so I'm not sure if 3.3-rc5 reported this priblem or not.

You may try the attached patches on top of current -git. A complete dmesg
will be welcome. So will an 'ethtool -d eth0' if the device stops working.

You did not label the problem as a serious one. Does it means that the
driver automatically recovers ?

I'll add some ring and registers debug stuff tomorrow.

-- 
Ueimor

[-- Attachment #2: 0001-r8169-avoid-clearing-the-end-of-Tx-descriptor-ring-m.patch --]
[-- Type: text/plain, Size: 1040 bytes --]

>From 4fafb314defc9e0cc5da1b58e84bba87686ea2ba Mon Sep 17 00:00:00 2001
Message-Id: <4fafb314defc9e0cc5da1b58e84bba87686ea2ba.1337810811.git.romieu@fr.zoreil.com>
From: Francois Romieu <romieu@fr.zoreil.com>
Date: Wed, 23 May 2012 22:18:35 +0200
Subject: [PATCH 1/2] r8169: avoid clearing the end of Tx descriptor ring
 marker bit.
X-Organisation: Land of Sunshine Inc.

Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
---
 drivers/net/ethernet/realtek/r8169.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 00b4f56..01f3367 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -5345,7 +5345,7 @@ static void rtl8169_unmap_tx_skb(struct device *d, struct ring_info *tx_skb,
 
 	dma_unmap_single(d, le64_to_cpu(desc->addr), len, DMA_TO_DEVICE);
 
-	desc->opts1 = 0x00;
+	desc->opts1 &= cpu_to_le32(RingEnd);
 	desc->opts2 = 0x00;
 	desc->addr = 0x00;
 	tx_skb->len = 0;
-- 
1.7.7.6


[-- Attachment #3: 0002-r8169-TxPoll-hack-rework.patch --]
[-- Type: text/plain, Size: 4285 bytes --]

>From 4cadfce6ba362b5d8d5c193c4896443be575b2b4 Mon Sep 17 00:00:00 2001
Message-Id: <4cadfce6ba362b5d8d5c193c4896443be575b2b4.1337810811.git.romieu@fr.zoreil.com>
In-Reply-To: <4fafb314defc9e0cc5da1b58e84bba87686ea2ba.1337810811.git.romieu@fr.zoreil.com>
References: <4fafb314defc9e0cc5da1b58e84bba87686ea2ba.1337810811.git.romieu@fr.zoreil.com>
From: Francois Romieu <romieu@fr.zoreil.com>
Date: Wed, 23 May 2012 23:21:13 +0200
Subject: [PATCH 2/2] r8169: TxPoll hack rework.
X-Organisation: Land of Sunshine Inc.

I don't want to try and convince myself that it is completely safe to
issue a TxPoll request from the NAPI handler right in the middle of
a start_xmit, whence the tx_lock.

Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
---
 drivers/net/ethernet/realtek/r8169.c |   59 +++++++++++++++++++--------------
 1 files changed, 34 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 01f3367..655b293 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -5544,19 +5544,20 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 	status = opts[0] | len | (RingEnd * !((entry + 1) % NUM_TX_DESC));
 	txd->opts1 = cpu_to_le32(status);
 
+	smp_wmb();
+
 	tp->cur_tx += frags + 1;
 
-	wmb();
+	smp_wmb();
 
 	RTL_W8(TxPoll, NPQ);
 
 	mmiowb();
 
 	if (!TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS)) {
-		/* Avoid wrongly optimistic queue wake-up: rtl_tx thread must
-		 * not miss a ring update when it notices a stopped queue.
-		 */
-		smp_wmb();
+		/* rtl_tx thread must not miss a ring update when it notices
+		 * a stopped queue. The TxPoll hack requires the smp_wmb
+		 * above so we can go ahead. */
 		netif_stop_queue(dev);
 		/* Sync with rtl_tx:
 		 * - publish queue status and cur_tx ring index (write barrier)
@@ -5640,22 +5641,36 @@ struct rtl_txc {
 static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp)
 {
 	struct rtl8169_stats *tx_stats = &tp->tx_stats;
-	unsigned int dirty_tx, tx_left;
+	unsigned int dirty_tx, cur_tx;
 	struct rtl_txc txc = { 0, 0 };
 
 	dirty_tx = tp->dirty_tx;
-	smp_rmb();
-	tx_left = tp->cur_tx - dirty_tx;
-
-	while (tx_left > 0) {
+xmit_race:
+	for (cur_tx = tp->cur_tx; dirty_tx != cur_tx; dirty_tx++) {
 		unsigned int entry = dirty_tx % NUM_TX_DESC;
 		struct ring_info *tx_skb = tp->tx_skb + entry;
 		u32 status;
 
-		rmb();
 		status = le32_to_cpu(tp->TxDescArray[entry].opts1);
-		if (status & DescOwn)
+
+		/* 8168 (only ?) hack: TxPoll requests are lost when the Tx
+		 * packets are too close. Let's kick an extra TxPoll request
+		 * when a burst of start_xmit activity is detected (if it is
+		 * not detected, it is slow enough).
+		 * The NPQ bit is cleared automatically by the chipset.
+		 * The code assumes that the chipset is sane enough to clear
+		 * it at a sensible time.*/
+		if (unlikely(status & DescOwn)) {
+			void __iomem *ioaddr = tp->mmio_addr;
+
+			if (!(RTL_R8(TxPoll) & NPQ)) {
+				netif_tx_lock(dev);
+				RTL_W8(TxPoll, NPQ);
+				netif_tx_unlock(dev);
+				goto done;
+			}
 			break;
+		}
 
 		rtl8169_unmap_tx_skb(&tp->pci_dev->dev, tx_skb,
 				     tp->TxDescArray + entry);
@@ -5667,10 +5682,15 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp)
 			dev_kfree_skb(skb);
 			tx_skb->skb = NULL;
 		}
-		dirty_tx++;
-		tx_left--;
 	}
 
+	/* Rationale: if chipset stopped DMAing, enforce TxPoll write either
+	 * here or in start_xmit. If chipset is still DMAing, this code
+	 * will be run later anyway. */
+	smp_mb();
+	if (cur_tx != tp->cur_tx)
+		goto xmit_race;
+done:
 	u64_stats_update_begin(&tx_stats->syncp);
 	tx_stats->packets += txc.packets;
 	tx_stats->bytes += txc.bytes;
@@ -5692,17 +5712,6 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp)
 		    TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS)) {
 			netif_wake_queue(dev);
 		}
-		/*
-		 * 8168 hack: TxPoll requests are lost when the Tx packets are
-		 * too close. Let's kick an extra TxPoll request when a burst
-		 * of start_xmit activity is detected (if it is not detected,
-		 * it is slow enough). -- FR
-		 */
-		if (tp->cur_tx != dirty_tx) {
-			void __iomem *ioaddr = tp->mmio_addr;
-
-			RTL_W8(TxPoll, NPQ);
-		}
 	}
 }
 
-- 
1.7.7.6


^ permalink raw reply related

* Re: [PATCH 06/15] batman-adv: Distributed ARP Table - add snooping functions for ARP messages
From: David Miller @ 2012-05-23 23:01 UTC (permalink / raw)
  To: simon.wunderlich-Y4E02TeZ33kaBlGTGt4zH4SGEyLTKazZ
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r,
	lindner_marek-LWAfsSFWpa4
In-Reply-To: <20120523214802.GA12222@pandem0nium>


It can't be all on me to answer your question, I cannot be
the choke point.

You must lean on the entire networking developer community
for help, otherwise it simply will not scale.

^ permalink raw reply

* Re: [PATCH] fec: Add support for Coldfire M5441x enet-mac.
From: David Miller @ 2012-05-23 23:02 UTC (permalink / raw)
  To: sfking; +Cc: netdev, uclinux-dev, gerg
In-Reply-To: <201205231435.50857.sfking@fdwdc.com>


Please resubmit this when the net-next tree is open again.

Now is not an appropriate time to submit patches that are
not bug fixes.

^ permalink raw reply

* Re: [PATCH v2] mm: add a low limit to alloc_large_system_hash
From: Tim Bird @ 2012-05-23 23:33 UTC (permalink / raw)
  To: eric.dumazet, Paul Gortmaker, David Miller, linux kernel, netdev

This patch seems to have fallen in the cracks:

https://lkml.org/lkml/2012/2/27/20

The last message on the thread was an ACK by David Miller, and
the question "Who wants to take this?"
  -- Tim

[message and patch follow]

UDP stack needs a minimum hash size value for proper operation and also
uses alloc_large_system_hash() for proper NUMA distribution of its hash
tables and automatic sizing depending on available system memory.

On some low memory situations, udp_table_init() must ignore the
alloc_large_system_hash() result and reallocs a bigger memory area.

As we cannot easily free old hash table, we leak it and kmemleak can
issue a warning.

This patch adds a low limit parameter to alloc_large_system_hash() to
solve this problem.

We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table
allocation.

Reported-by: Mark Asselstine <mark.asselstine@windriver.com>
Reported-by: Tim Bird <tim.bird@am.sony.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
---
V2: no 16 minimum value for pid hash
 fs/dcache.c             |    2 ++
 fs/inode.c              |    2 ++
 include/linux/bootmem.h |    3 ++-
 kernel/pid.c            |    3 ++-
 mm/page_alloc.c         |    7 +++++--
 net/ipv4/route.c        |    1 +
 net/ipv4/tcp.c          |    2 ++
 net/ipv4/udp.c          |   30 ++++++++++--------------------
 8 files changed, 26 insertions(+), 24 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index fe19ac1..ef5e72e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2984,6 +2984,7 @@ static void __init dcache_init_early(void)
 					HASH_EARLY,
 					&d_hash_shift,
 					&d_hash_mask,
+					0,
 					0);

 	for (loop = 0; loop < (1U << d_hash_shift); loop++)
@@ -3014,6 +3015,7 @@ static void __init dcache_init(void)
 					0,
 					&d_hash_shift,
 					&d_hash_mask,
+					0,
 					0);

 	for (loop = 0; loop < (1U << d_hash_shift); loop++)
diff --git a/fs/inode.c b/fs/inode.c
index d3ebdbe..7acee4c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1667,6 +1667,7 @@ void __init inode_init_early(void)
 					HASH_EARLY,
 					&i_hash_shift,
 					&i_hash_mask,
+					0,
 					0);

 	for (loop = 0; loop < (1U << i_hash_shift); loop++)
@@ -1697,6 +1698,7 @@ void __init inode_init(void)
 					0,
 					&i_hash_shift,
 					&i_hash_mask,
+					0,
 					0);

 	for (loop = 0; loop < (1U << i_hash_shift); loop++)
diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index 66d3e95..1a0cd27 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -154,7 +154,8 @@ extern void *alloc_large_system_hash(const char *tablename,
 				     int flags,
 				     unsigned int *_hash_shift,
 				     unsigned int *_hash_mask,
-				     unsigned long limit);
+				     unsigned long low_limit,
+				     unsigned long high_limit);

 #define HASH_EARLY	0x00000001	/* Allocating during early boot? */
 #define HASH_SMALL	0x00000002	/* sub-page allocation allowed, min
diff --git a/kernel/pid.c b/kernel/pid.c
index 9f08dfa..e86b291 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -547,7 +547,8 @@ void __init pidhash_init(void)

 	pid_hash = alloc_large_system_hash("PID", sizeof(*pid_hash), 0, 18,
 					   HASH_EARLY | HASH_SMALL,
-					   &pidhash_shift, NULL, 4096);
+					   &pidhash_shift, NULL,
+					   0, 4096);
 	pidhash_size = 1U << pidhash_shift;

 	for (i = 0; i < pidhash_size; i++)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a13ded1..b9afccb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5198,9 +5198,10 @@ void *__init alloc_large_system_hash(const char *tablename,
 				     int flags,
 				     unsigned int *_hash_shift,
 				     unsigned int *_hash_mask,
-				     unsigned long limit)
+				     unsigned long low_limit,
+				     unsigned long high_limit)
 {
-	unsigned long long max = limit;
+	unsigned long long max = high_limit;
 	unsigned long log2qty, size;
 	void *table = NULL;

@@ -5238,6 +5239,8 @@ void *__init alloc_large_system_hash(const char *tablename,
 	}
 	max = min(max, 0x80000000ULL);

+	if (numentries < low_limit)
+		numentries = low_limit;
 	if (numentries > max)
 		numentries = max;

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index bcacf54..0a41e38 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -3475,6 +3475,7 @@ int __init ip_rt_init(void)
 					0,
 					&rt_hash_log,
 					&rt_hash_mask,
+					0,
 					rhash_entries ? 0 : 512 * 1024);
 	memset(rt_hash_table, 0, (rt_hash_mask + 1) * sizeof(struct rt_hash_bucket));
 	rt_hash_lock_init();
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 22ef5f9..e61a498 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3267,6 +3267,7 @@ void __init tcp_init(void)
 					0,
 					NULL,
 					&tcp_hashinfo.ehash_mask,
+					0,
 					thash_entries ? 0 : 512 * 1024);
 	for (i = 0; i <= tcp_hashinfo.ehash_mask; i++) {
 		INIT_HLIST_NULLS_HEAD(&tcp_hashinfo.ehash[i].chain, i);
@@ -3283,6 +3284,7 @@ void __init tcp_init(void)
 					0,
 					&tcp_hashinfo.bhash_size,
 					NULL,
+					0,
 					64 * 1024);
 	tcp_hashinfo.bhash_size = 1U << tcp_hashinfo.bhash_size;
 	for (i = 0; i < tcp_hashinfo.bhash_size; i++) {
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 5d075b5..dc68ed2 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2182,26 +2182,16 @@ void __init udp_table_init(struct udp_table *table, const char *name)
 {
 	unsigned int i;

-	if (!CONFIG_BASE_SMALL)
-		table->hash = alloc_large_system_hash(name,
-			2 * sizeof(struct udp_hslot),
-			uhash_entries,
-			21, /* one slot per 2 MB */
-			0,
-			&table->log,
-			&table->mask,
-			64 * 1024);
-	/*
-	 * Make sure hash table has the minimum size
-	 */
-	if (CONFIG_BASE_SMALL || table->mask < UDP_HTABLE_SIZE_MIN - 1) {
-		table->hash = kmalloc(UDP_HTABLE_SIZE_MIN *
-				      2 * sizeof(struct udp_hslot), GFP_KERNEL);
-		if (!table->hash)
-			panic(name);
-		table->log = ilog2(UDP_HTABLE_SIZE_MIN);
-		table->mask = UDP_HTABLE_SIZE_MIN - 1;
-	}
+	table->hash = alloc_large_system_hash(name,
+					      2 * sizeof(struct udp_hslot),
+					      uhash_entries,
+					      21, /* one slot per 2 MB */
+					      0,
+					      &table->log,
+					      &table->mask,
+					      UDP_HTABLE_SIZE_MIN,
+					      64 * 1024);
+
 	table->hash2 = table->hash + (table->mask + 1);
 	for (i = 0; i <= table->mask; i++) {
 		INIT_HLIST_NULLS_HEAD(&table->hash[i].head, i);

^ permalink raw reply related

* Re: [PATCH 15/17] netfilter: cleanup sysctl for l4proto and l3proto
From: Gao feng @ 2012-05-24  0:59 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, netdev, serge.hallyn, ebiederm, dlezcano
In-Reply-To: <20120523103810.GE2836@1984>

Hi pablo:

于 2012年05月23日 18:38, Pablo Neira Ayuso 写道:
> On Mon, May 14, 2012 at 04:52:25PM +0800, Gao feng wrote:
>> delete no useless sysctl data for l4proto and l3proto.
>>
>> Acked-by: Eric W. Biederman <ebiederm@xmission.com>
>> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
>> ---
>>  include/net/netfilter/nf_conntrack_l3proto.h   |    2 --
>>  include/net/netfilter/nf_conntrack_l4proto.h   |   10 ----------
>>  net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c |    1 -
>>  net/ipv4/netfilter/nf_conntrack_proto_icmp.c   |    8 --------
>>  net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c |    5 -----
>>  net/netfilter/nf_conntrack_proto_generic.c     |    8 --------
>>  net/netfilter/nf_conntrack_proto_sctp.c        |   15 ---------------
>>  net/netfilter/nf_conntrack_proto_tcp.c         |   15 ---------------
>>  net/netfilter/nf_conntrack_proto_udp.c         |   15 ---------------
>>  net/netfilter/nf_conntrack_proto_udplite.c     |   12 ------------
>>  10 files changed, 0 insertions(+), 91 deletions(-)
>>
>> diff --git a/include/net/netfilter/nf_conntrack_l3proto.h b/include/net/netfilter/nf_conntrack_l3proto.h
>> index d6df8c7..6f7c13f 100644
>> --- a/include/net/netfilter/nf_conntrack_l3proto.h
>> +++ b/include/net/netfilter/nf_conntrack_l3proto.h
>> @@ -64,9 +64,7 @@ struct nf_conntrack_l3proto {
>>  	size_t nla_size;
>>  
>>  #ifdef CONFIG_SYSCTL
>> -	struct ctl_table_header	*ctl_table_header;
>>  	const char		*ctl_table_path;
>> -	struct ctl_table	*ctl_table;
>>  #endif /* CONFIG_SYSCTL */
>>  
>>  	/* Init l3proto pernet data */
>> diff --git a/include/net/netfilter/nf_conntrack_l4proto.h b/include/net/netfilter/nf_conntrack_l4proto.h
>> index 0d329b9..4881df34 100644
>> --- a/include/net/netfilter/nf_conntrack_l4proto.h
>> +++ b/include/net/netfilter/nf_conntrack_l4proto.h
>> @@ -95,16 +95,6 @@ struct nf_conntrack_l4proto {
>>  		const struct nla_policy *nla_policy;
>>  	} ctnl_timeout;
>>  #endif
>> -
>> -#ifdef CONFIG_SYSCTL
>> -	struct ctl_table_header	**ctl_table_header;
>> -	struct ctl_table	*ctl_table;
>> -	unsigned int		*ctl_table_users;
>> -#ifdef CONFIG_NF_CONNTRACK_PROC_COMPAT
>> -	struct ctl_table_header	*ctl_compat_table_header;
>> -	struct ctl_table	*ctl_compat_table;
>> -#endif
>> -#endif
> 
> Interesting. This structure is added in patch 1/17, then it's remove
> in patch 15/17.
> 
> Probably I'm missing anything, but why are you doing it like that?

This structure means ctl_table_header,ctl_table and so on?

I add this structure to struct nf_proto_net in patch 1/17,so those fields in
struct nf_conntrack_l4proto are useless,this patch is just some cleanup.

the same with nf_conntrack_l3proto.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 16/17] netfilter: add namespace support for cttimeout
From: Gao feng @ 2012-05-24  1:04 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, netdev, serge.hallyn, ebiederm, dlezcano
In-Reply-To: <20120523104110.GF2836@1984>

于 2012年05月23日 18:41, Pablo Neira Ayuso 写道:
> On Mon, May 14, 2012 at 04:52:26PM +0800, Gao feng wrote:
>> add struct net as a param of ctnl_timeout.nlattr_to_obj,
>>
>> modify ctnl_timeout_parse_policy and cttimeout_new_timeout
>> to transmit struct net to nlattr_to_obj.
> 
> Please, merge your patch 16 and 17 into one single patch.
> 
>>  	unsigned int *timeouts = data;
>>  
>> diff --git a/net/netfilter/nf_conntrack_proto_sctp.c b/net/netfilter/nf_conntrack_proto_sctp.c
>> index 291cef4..a28f3c4 100644
>> --- a/net/netfilter/nf_conntrack_proto_sctp.c
>> +++ b/net/netfilter/nf_conntrack_proto_sctp.c
>> @@ -562,7 +562,8 @@ static int sctp_nlattr_size(void)
>>  #include <linux/netfilter/nfnetlink.h>
>>  #include <linux/netfilter/nfnetlink_cttimeout.h>
>>  
>> -static int sctp_timeout_nlattr_to_obj(struct nlattr *tb[], void *data)
>> +static int sctp_timeout_nlattr_to_obj(struct nlattr *tb[],
>> +				      struct net *net, void *data)
> 
> The interface modification and the use of the new *net parameter
> should go together, ie. merge patch 16 and 17 :-).

got it,thanks ;)

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 04/17] netfilter: add namespace support for l4proto_generic
From: Gao feng @ 2012-05-24  1:13 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, netdev, serge.hallyn, ebiederm, dlezcano
In-Reply-To: <20120523103232.GD2836@1984>

于 2012年05月23日 18:32, Pablo Neira Ayuso 写道:
> On Mon, May 14, 2012 at 04:52:14PM +0800, Gao feng wrote:
>> implement and export nf_conntrack_proto_generic_[init,fini],
>> nf_conntrack_[init,cleanup]_net call them to register or unregister
>> the sysctl of generic proto.
>>
>> implement generic_net_init,it's used to initial the pernet
>> data for generic proto.
>>
>> and use nf_generic_net.timeout to replace nf_ct_generic_timeout in
>> get_timeouts function.
>>
>> Acked-by: Eric W. Biederman <ebiederm@xmission.com>
>> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
>> ---
>>  include/net/netfilter/nf_conntrack_l4proto.h |    2 +
>>  include/net/netns/conntrack.h                |    6 +++
>>  net/netfilter/nf_conntrack_core.c            |    8 +++-
>>  net/netfilter/nf_conntrack_proto.c           |   21 +++++-----
>>  net/netfilter/nf_conntrack_proto_generic.c   |   55 ++++++++++++++++++++++++-
>>  5 files changed, 76 insertions(+), 16 deletions(-)
>>
>> diff --git a/include/net/netfilter/nf_conntrack_l4proto.h b/include/net/netfilter/nf_conntrack_l4proto.h
>> index a93dcd5..0d329b9 100644
>> --- a/include/net/netfilter/nf_conntrack_l4proto.h
>> +++ b/include/net/netfilter/nf_conntrack_l4proto.h
>> @@ -118,6 +118,8 @@ struct nf_conntrack_l4proto {
>>  
>>  /* Existing built-in generic protocol */
>>  extern struct nf_conntrack_l4proto nf_conntrack_l4proto_generic;
>> +extern int nf_conntrack_proto_generic_init(struct net *net);
>> +extern void nf_conntrack_proto_generic_fini(struct net *net);
>>  
>>  #define MAX_NF_CT_PROTO 256
>>  
>> diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h
>> index 94992e9..3381b80 100644
>> --- a/include/net/netns/conntrack.h
>> +++ b/include/net/netns/conntrack.h
>> @@ -20,7 +20,13 @@ struct nf_proto_net {
>>  	unsigned int		users;
>>  };
>>  
>> +struct nf_generic_net {
>> +	struct nf_proto_net pn;
>> +	unsigned int timeout;
>> +};
>> +
>>  struct nf_ip_net {
>> +	struct nf_generic_net   generic;
>>  #if defined(CONFIG_SYSCTL) && defined(CONFIG_NF_CONNTRACK_PROC_COMPAT)
>>  	struct ctl_table_header *ctl_table_header;
>>  	struct ctl_table	*ctl_table;
>> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
>> index 32c5909..fd33e91 100644
>> --- a/net/netfilter/nf_conntrack_core.c
>> +++ b/net/netfilter/nf_conntrack_core.c
>> @@ -1353,6 +1353,7 @@ static void nf_conntrack_cleanup_net(struct net *net)
>>  	}
>>  
>>  	nf_ct_free_hashtable(net->ct.hash, net->ct.htable_size);
>> +	nf_conntrack_proto_generic_fini(net);
>>  	nf_conntrack_helper_fini(net);
>>  	nf_conntrack_timeout_fini(net);
>>  	nf_conntrack_ecache_fini(net);
>> @@ -1586,9 +1587,12 @@ static int nf_conntrack_init_net(struct net *net)
>>  	ret = nf_conntrack_helper_init(net);
>>  	if (ret < 0)
>>  		goto err_helper;
>> -
>> +	ret = nf_conntrack_proto_generic_init(net);
>> +	if (ret < 0)
>> +		goto err_generic;
>>  	return 0;
>> -
>> +err_generic:
>> +	nf_conntrack_helper_fini(net);
>>  err_helper:
>>  	nf_conntrack_timeout_fini(net);
>>  err_timeout:
>> diff --git a/net/netfilter/nf_conntrack_proto.c b/net/netfilter/nf_conntrack_proto.c
>> index 7ee6653..9b4bf6d 100644
>> --- a/net/netfilter/nf_conntrack_proto.c
>> +++ b/net/netfilter/nf_conntrack_proto.c
>> @@ -287,10 +287,16 @@ EXPORT_SYMBOL_GPL(nf_conntrack_l3proto_unregister);
>>  static struct nf_proto_net *nf_ct_l4proto_net(struct net *net,
>>  					      struct nf_conntrack_l4proto *l4proto)
>>  {
>> -	if (l4proto->net_id)
>> -		return net_generic(net, *l4proto->net_id);
>> -	else
>> -		return NULL;
>> +	switch (l4proto->l4proto) {
>> +	case 255: /* l4proto_generic */
>> +		return (struct nf_proto_net *)&net->ct.proto.generic;
>> +	default:
>> +		if (l4proto->net_id)
>> +			return net_generic(net, *l4proto->net_id);
>> +		else
>> +			return NULL;
>> +	}
>> +	return NULL;
>>  }
>>  
>>  int nf_ct_l4proto_register_sysctl(struct net *net,
>> @@ -457,11 +463,6 @@ EXPORT_SYMBOL_GPL(nf_conntrack_l4proto_unregister);
>>  int nf_conntrack_proto_init(void)
>>  {
>>  	unsigned int i;
>> -	int err;
>> -
>> -	err = nf_ct_l4proto_register_sysctl(&init_net, &nf_conntrack_l4proto_generic);
>> -	if (err < 0)
>> -		return err;
> 
> I like that all protocols sysctl are registered by
> nf_conntrack_proto_init. Can you keep using that?

you mean per-net's generic_proto sysctl are registered by
nf_conntrack_proto_init?

such as

int nf_conntrack_proto_init(struct net *net)
{
	...
	err = nf_ct_l4proto_register_sysctl(net, &nf_conntrack_l4proto_generic);
	...
}

if my understanding is right,my answer is yes we can ;)

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox