Netdev List
 help / color / mirror / Atom feed
* net/can/mscan: Fix buggy listen only mode setting
From: Wolfgang Grandegger @ 2011-11-14 13:34 UTC (permalink / raw)
  To: Netdev; +Cc: linux-can, Socketcan-core, Marc Kleine-Budde

This patch fixes an issue introduced recently with commit
452448f9283e1939408b397e87974a418825b0a8.

CC: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Wolfgang Grandegger <wg@grandegger.com>
---
 drivers/net/can/mscan/mscan.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/can/mscan/mscan.c b/drivers/net/can/mscan/mscan.c
index 74f3b18..1c82dd8 100644
--- a/drivers/net/can/mscan/mscan.c
+++ b/drivers/net/can/mscan/mscan.c
@@ -581,7 +581,7 @@ static int mscan_open(struct net_device *dev)
 
 	priv->open_time = jiffies;
 
-	if (ctrlmode.flags & CAN_CTRLMODE_LISTENONLY)
+	if (priv->can.ctrlmode & CAN_CTRLMODE_LISTENONLY)
 		setbits8(&regs->canctl1, MSCAN_LISTEN);
 	else
 		clrbits8(&regs->canctl1, MSCAN_LISTEN);
-- 
1.7.6.4

^ permalink raw reply related

* [PATCH] NET: MIPS: lantiq: fix etop compile error
From: John Crispin @ 2011-11-14 15:01 UTC (permalink / raw)
  To: netdev; +Cc: John Crispin

The Lantiq ETOP ethernet driver fails to build in 3.2-rc1 due to 2 missing
header files.

Signed-off-by: John Crispin <blogic@openwrt.org>
---
 drivers/net/ethernet/lantiq_etop.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/lantiq_etop.c b/drivers/net/ethernet/lantiq_etop.c
index 6bb2b95..0b3567a 100644
--- a/drivers/net/ethernet/lantiq_etop.c
+++ b/drivers/net/ethernet/lantiq_etop.c
@@ -34,6 +34,8 @@
 #include <linux/init.h>
 #include <linux/delay.h>
 #include <linux/io.h>
+#include <linux/dma-mapping.h>
+#include <linux/module.h>
 
 #include <asm/checksum.h>
 
-- 
1.7.7.1

^ permalink raw reply related

* Re: [PATCH] ip6_tunnel: copy parms.name after register_netdevice
From: Josh Boyer @ 2011-11-14 14:02 UTC (permalink / raw)
  To: David Miller; +Cc: jpirko, netdev, linux-kernel, stable
In-Reply-To: <20111114.002456.1590067179487443754.davem@davemloft.net>

On Mon, Nov 14, 2011 at 12:24:56AM -0500, David Miller wrote:
> From: Josh Boyer <jwboyer@redhat.com>
> Date: Thu, 10 Nov 2011 20:10:23 -0500
> 
> > @@ -288,6 +288,8 @@ static struct ip6_tnl *ip6_tnl_create(struct net *net, struct ip6_tnl_parm *p)
> >  
> >  	if ((err = register_netdevice(dev)) < 0)
> >  		goto failed_free;
> > +	
> > +	strcpy(t->parms.name, dev->name);
> 
> Trailing whitespace on that blank line you're adding.

Ugh.  Forgot to run checkpatch.  Apologies.

> I fixed this up, applied, and queued up for -stable, thanks.

Thank you.

josh

^ permalink raw reply

* Re: [PATCH] net, bridge: print log message after state changed
From: Joe Perches @ 2011-11-14 14:27 UTC (permalink / raw)
  To: David Miller; +Cc: Wolfgang.Fritz, holger.brunck, netdev, shemminger, bridge
In-Reply-To: <20111114.020749.1176229013272872343.davem@davemloft.net>

On Mon, 2011-11-14 at 02:07 -0500, David Miller wrote:
> I would never expect an "entering" message to print out after the event
> happens, and in fact I'd _always_ want to see it beforehand so that if
> the state change caused a crash or similar it'd be that much easier
> to pinpoint the proper location.
> 
> I'm still not applying this.  If the other log messages behave
> differently, they are broken, so go fix those instead.

Perhaps use "entered" after the transition or "entering" before.

^ permalink raw reply

* Re: net: Add network priority cgroup
From: Neil Horman @ 2011-11-14 14:43 UTC (permalink / raw)
  To: Dave Taht; +Cc: netdev, John Fastabend, Robert Love, David S. Miller
In-Reply-To: <CAA93jw7tQtBGFMuDiK_pRUqAdfBVqrwbMegR8gh-8KgVb19PWg@mail.gmail.com>

On Mon, Nov 14, 2011 at 01:32:04PM +0100, Dave Taht wrote:
> On Mon, Nov 14, 2011 at 12:47 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
> > On Wed, Nov 09, 2011 at 02:57:33PM -0500, Neil Horman wrote:
> >> Data Center Bridging environments are currently somewhat limited in their
> >> ability to provide a general mechanism for controlling traffic priority.
> >> Specifically they are unable to administratively control the priority at which
> >> various types of network traffic are sent.
> >>
> >> Currently, the only ways to set the priority of a network buffer are:
> >>
> >> 1) Through the use of the SO_PRIORITY socket option
> >> 2) By using low level hooks, like a tc action
> >>
> >> (1) is difficult from an administrative perspective because it requires that the
> >> application to be coded to not just assume the default priority is sufficient,
> >> and must expose an administrative interface to allow priority adjustment.  Such
> >> a solution is not scalable in a DCB environment
> >>
> >> (2) is also difficult, as it requires constant administrative oversight of
> >> applications so as to build appropriate rules to match traffic belonging to
> >> various classes, so that priority can be appropriately set. It is further
> >> limiting when DCB enabled hardware is in use, due to the fact that tc rules are
> >> only run after a root qdisc has been selected (DCB enabled hardware may reserve
> >> hw queues for various traffic classes and needs the priority to be set prior to
> >> selecting the root qdisc)
> >>
> >>
> >> I've discussed various solutions with John Fastabend, and we saw a cgroup as
> >> being a good general solution to this problem.  The network priority cgroup
> >> allows for a per-interface priority map to be built per cgroup.  Any traffic
> >> originating from an application in a cgroup, that does not explicitly set its
> >> priority with SO_PRIORITY will have its priority assigned to the value
> >> designated for that group on that interface.  This allows a user space daemon,
> >> when conducting LLDP negotiation with a DCB enabled peer to create a cgroup
> >> based on the APP_TLV value received and administratively assign applications to
> >> that priority using the existing cgroup utility infrastructure.
> >>
> >> Tested by John and myself, with good results
> >>
> >> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> >> CC: John Fastabend <john.r.fastabend@intel.com>
> >> CC: Robert Love <robert.w.love@intel.com>
> >> CC: "David S. Miller" <davem@davemloft.net>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe netdev" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >
> > Bump, any other thoughts here?  Dave T. has some reasonable thoughts regarding
> > the use of skb->priority, but IMO they really seem orthogonal to the purpose of
> > this change.  Any other reviews would be welcome.
> 
> Well, in part I've been playing catchup in the hope that lldp and
> openlldp and/or this dcb netlink layer that I don't know anything
> about (pointers please?) could help somehow to resolve the semantic
> mess skb->priority has become in the first place.
> 
> I liked what was described here.
> 
> "What if we did at least carve out the DCB functionality away from
> skb->priority?  Since, AIUI, we're only concerning ourselves with
> locally generated traffic here, we're talking
> about skbs that have a socket attached to them.  We could, instead of indexing
> the prio_tc_map with skb->priority, we could index it with
> skb->dev->priomap[skb->sk->prioidx] (as provided by this patch).  The cgroup
> then could be, instead of a strict priority cgroup, a queue_selector cgroup (or
> something more appropriately named), and we don't have to touch skb->priority at
> all.  I'd really rather not start down that road until I got more opinions and
> consensus on that, but it seems like a pretty good solution, one that would
> allow hardware queue selection in systems that use things like DCB to co-exist
> with software queueing features."
> 
I was initially ok with this, but the more I think about it, the more I think
its just not needed (see further down in this email for my reasoning).  John,
Rob, do you have any thoughts here?

> The piece that still kind of bothered me about the original proposal
> (and perhaps this one) was that setting SO_PRIORITY in an app means
> 'give my packets more mojo'.
> 
> Taking something that took unprioritized packets and assigned them and
> *them only* to a hardware queue struck me as possibly deprioritizing
> the 'more mojo wanted' packets in the app(s), as they would end up in
> some other, possibly overloaded, hardware queue.
> 
I don't really see what you mean by this at all.  Taking packets with no
priority and assigning them a priority doesn't really have an effect on
pre-prioritized packets.  Or rather it shouldn't.  You can certainly create a
problem by having apps prioritized according to conflicting semantic rules, but
that strikes me as administrative error.  Garbage in...Garbage out.

> So a cgroup that moves all of the packets from an application into a
> given hardware queue, and then gets scheduled normally according to
> skb->priority and friends (software queue, default of pfifo_fast,
> etc), seems to make some sense to me. (I wouldn't mind if we had
> abstractions for software queues, too, like, I need a software queue
> with these properties, find me a place for it on the hardware - but
> I'm dreaming)
> 
> One open question is where do packets generated from other subsystems
> end up, if you are using a cgroup for the app? arp, dns, etc?
> 
The overriding rule is the association of an skb to a socket.  If a transmitted
frame has skb->sk set in dev_queue_xmit, then we interrogate its priority index
as set when we passed through the sendmsg code at the top of the stack.
Otherwise its behavior is unchanged from its current standpoint.

> So to rephrase your original description from this:
> 
> >> Any traffic originating from an application in a cgroup, that does not explicitly set its
> >> priority with SO_PRIORITY will have its priority assigned to the value
> >> designated for that group on that interface.  This allows a user space daemon,
> >> when conducting LLDP negotiation with a DCB enabled peer to create a cgroup
> >> based on the APP_TLV value received and administratively assign applications to
> >> that priority using the existing cgroup utility infrastructure.
> > John, Robert, if you're supportive of these changes, some Acks would be
> > appreciated.
> 
> To this:
> 
> "Any traffic originating from an application in a cgroup,  will have
> its hardware queue  assigned to the value designated for that group on
> that interface.  This allows a user space daemon, when conducting LLDP
> negotiation with a DCB enabled peer to create a cgroup based on the
> APP_TLV value received and administratively assign applications to
> that hardware queue using the existing cgroup utility infrastructure."
> 
As above, I'm split brained about this.  I'm ok with the idea of making this a
queue selection cgroup, and separating it from priority, but at the same time,
in the context of DCB, we really are assigning priority here, so it seems a bit
false to do something that is not priority.  I also like the fact that it
provides administrative control in a way that netfilter and tc don't really
enable. 

> Assuming we're on the same page here, what the heck is APP_TLV?
> 
LLDP does layer 2 discovery with peer networking devices. It does so using sets
of Type/length/value tuples.  The types carry various bits of information, such
as which priority groups are available on the network.  The APP tlv conveys
application or feature specific information.  for instance, There is an ISCSI
app tlv that tells the host that  "on the interface you received this tlv, iscsi
traffic must be sent at priority X".  The idea being that, on receipt of this
tlv, the DCB daemon can create an ISCSI network priority cgroup instance, and
augment the cgroup rules file such that, when the user space iscsi daemon is
started, its traffic automatically transmits at the appropriate priority.

Actually, as I sit here thinking about it, I'm less supportive of separating the
assignment of hw queue / root qdisc from priority.  I say that because its
really just not necessecary, at least not in a DCB environment.  In a dcb
environment, we use priority exclusively to select the root qdisc and the
associated hardware queue.  Any software queue policing/grooming that takes
place after that by definition can't make any decisions based on skb->priority,
as the priority will be the same for each frame in the given parent queue (since
we used skb->priority to assign the parent queue).  You can still use RED or
some other queue discipline to manage queue length, but you have to understand
that priority won't be a packet-differentiating factor there.  If you made this
a separate queue selection variable, you could manipulate priority independenty,
but it would be non-sensical, because the intent would still be to send all
packets on that queue at the same priority, since thats what that hardware queue
is configured for.

In the alternate, if you don't use DCB, then this cgroup becomes somewhat
worthless anyway, because no-one has a need to select a specific hardware queue.
If you want to categorize traffic into classes you can already use the net_cls
cgroup, which works quite well for bandwidth allocation, and that still frees up
the skb->prioirty assignment to use as you see fit.

In the end, I think its just plain old more useful to assign priorty here than
some new thing.

Neil
 
> > John, Robert, if you're supportive of these changes, some Acks would be
> > appreciated.
> 
> 
> >
> >
> > Regards
> > Neil
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> 
> 
> 
> -- 
> Dave Täht
> SKYPE: davetaht
> US Tel: 1-239-829-5608
> FR Tel: 0638645374
> http://www.bufferbloat.net
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH V2 2/6] net/ethernet/atl1e: Disable ASPM
From: Matthew Garrett @ 2011-11-14 15:44 UTC (permalink / raw)
  To: David Miller; +Cc: linux-kernel, netdev, jcliburn, chris.snook
In-Reply-To: <20111114.003322.1579175313793306922.davem@davemloft.net>

Sorry, I screwed up there - it looks like I tested atl1c twice. Still 
inexcusable, and apologies for the wasted time.

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply

* Re: [PATCH] net/packet: remove dead code and unneeded variable from prb_setup_retire_blk_timer()
From: chetan loke @ 2011-11-14 15:54 UTC (permalink / raw)
  To: Jesper Juhl
  Cc: netdev, linux-kernel, David S. Miller, Eric Dumazet, Changli Gao,
	Ben Greear, waltje, gw4pts, waltje, ross.biro, alan
In-Reply-To: <alpine.LNX.2.00.1111132247440.3368@swampdragon.chaosbits.net>

On Sun, Nov 13, 2011 at 4:55 PM, Jesper Juhl <jj@chaosbits.net> wrote:
> We test for 'tx_ring' being != zero and BUG() if that's the case. So after
> that check there is no way that 'tx_ring' could be anything _but_ zero, so
> testing it again is just dead code. Once that dead code is removed, the
> 'pkc' local variable becomes entirely redundant, so remove that as well.

It was there so that it would be a no-brainer thing for the next
person who would want to enable tx_ring support. And given that this
check was called just once during the initial setup, it didn't seem
like a big deal to me. And the BUG() would let the next-person know
what different paths need to be plugged.


Chetan Loke

^ permalink raw reply

* Re: [PATCH net-next] bnx2x: reduce skb truesize by 50%
From: Eric Dumazet @ 2011-11-14 15:57 UTC (permalink / raw)
  To: David Miller
  Cc: eilong, bhutchings, pstaszewski, netdev, Thomas Graf, Tom Herbert,
	Jamal Hadi Salim, Stephen Hemminger
In-Reply-To: <1321251945.17837.55.camel@edumazet-laptop>

Le lundi 14 novembre 2011 à 07:25 +0100, Eric Dumazet a écrit :
> Le lundi 14 novembre 2011 à 00:08 -0500, David Miller a écrit :
> 
> > I fully support bringing this thing back to life :-)
> 
> I'll make extensive tests today and provide two patches when ready, with
> all performance results.
> 
> Some prefetch() calls will be removed, since build_skb() provides
> already cache hot skb.

Impressive results :

before : 720.000 pps
after :  820.000 pps

[ One mono threaded application receiving UDP messages on a single
socket, asking IP_PKTINFO ancillary info ]

Latencies are also a bit improved : softirq handler dirties about 320
bytes less per skb.

Definitely worth the pain.

I am sending two patches. Other drivers probably can benefit from
build_skb() as well.

[PATCH net-next 1/2] net: introduce build_skb()
[PATCH net-next 2/2] bnx2x: uses build_skb() in receive path

^ permalink raw reply

* [PATCH net-next 1/2] net: introduce build_skb()
From: Eric Dumazet @ 2011-11-14 16:03 UTC (permalink / raw)
  To: David Miller
  Cc: eilong, pstaszewski, netdev, Ben Hutchings, Tom Herbert,
	Jamal Hadi Salim, Stephen Hemminger, Thomas Graf, Herbert Xu,
	Jeff Kirsher

One of the thing we discussed during netdev 2011 conference was the idea
to change some network drivers to allocate/populate their skb at RX
completion time, right before feeding the skb to network stack.

In old days, we allocated skbs when populating the RX ring.

This means bringing into cpu cache sk_buff and skb_shared_info cache
lines (since we clear/initialize them), then 'queue' skb->data to NIC.

By the time NIC fills a frame in skb->data buffer and host can process
it, cpu probably threw away the cache lines from its caches, because lot
of things happened between the allocation and final use.

So the deal would be to allocate only the data buffer for the NIC to
populate its RX ring buffer. And use build_skb() at RX completion to
attach a data buffer (now filled with an ethernet frame) to a new skb,
initialize the skb_shared_info portion, and give the hot skb to network
stack.

build_skb() is the function to allocate an skb, caller providing the
data buffer that should be attached to it. Drivers are expected to call 
skb_reserve() right after build_skb() to adjust skb->data to the
Ethernet frame (usually skipping NET_SKB_PAD and NET_IP_ALIGN, but some
drivers might add a hardware provided alignment)

Data provided to build_skb() MUST have been allocated by a prior
kmalloc() call, with enough room to add SKB_DATA_ALIGN(sizeof(struct
skb_shared_info)) bytes at the end of the data without corrupting
incoming frame.

data = kmalloc(NET_SKB_PAD + NET_IP_ALIGN + 1536 +
               SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
	       GFP_ATOMIC);
...
skb = build_skb(data);
if (!skb) {
	recycle_data(data);
} else {
	skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
	...
}

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Eilon Greenstein <eilong@broadcom.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
CC: Tom Herbert <therbert@google.com>
CC: Jamal Hadi Salim <hadi@mojatatu.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Thomas Graf <tgraf@infradead.org>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 include/linux/skbuff.h |    1 
 net/core/skbuff.c      |   49 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fe86488..abad8a0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -540,6 +540,7 @@ extern void consume_skb(struct sk_buff *skb);
 extern void	       __kfree_skb(struct sk_buff *skb);
 extern struct sk_buff *__alloc_skb(unsigned int size,
 				   gfp_t priority, int fclone, int node);
+extern struct sk_buff *build_skb(void *data);
 static inline struct sk_buff *alloc_skb(unsigned int size,
 					gfp_t priority)
 {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 18a3ceb..8d2c5b3 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -245,6 +245,55 @@ nodata:
 EXPORT_SYMBOL(__alloc_skb);
 
 /**
+ * build_skb - build a network buffer
+ * @data: data buffer provided by caller
+ *
+ * Allocate a new &sk_buff. Caller provides space holding head and
+ * skb_shared_info. @data must have been allocated by kmalloc()
+ * The return is the new skb buffer.
+ * On a failure the return is %NULL, and @data is not freed.
+ * Notes :
+ *  Before IO, driver allocates only data buffer where NIC put incoming frame
+ *  Driver should add room at head (NET_SKB_PAD) and
+ *  MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info))
+ *  After IO, driver calls build_skb(), to allocate sk_buff and populate it
+ *  before giving packet to stack.
+ *  RX rings only contains data buffers, not full skbs.
+ */
+struct sk_buff *build_skb(void *data)
+{
+	struct skb_shared_info *shinfo;
+	struct sk_buff *skb;
+	unsigned int size;
+
+	skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
+	if (!skb)
+		return NULL;
+
+	size = ksize(data) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+
+	memset(skb, 0, offsetof(struct sk_buff, tail));
+	skb->truesize = SKB_TRUESIZE(size);
+	atomic_set(&skb->users, 1);
+	skb->head = data;
+	skb->data = data;
+	skb_reset_tail_pointer(skb);
+	skb->end = skb->tail + size;
+#ifdef NET_SKBUFF_DATA_USES_OFFSET
+	skb->mac_header = ~0U;
+#endif
+
+	/* make sure we initialize shinfo sequentially */
+	shinfo = skb_shinfo(skb);
+	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
+	atomic_set(&shinfo->dataref, 1);
+	kmemcheck_annotate_variable(shinfo->destructor_arg);
+
+	return skb;
+}
+EXPORT_SYMBOL(build_skb);
+
+/**
  *	__netdev_alloc_skb - allocate an skbuff for rx on a specific device
  *	@dev: network device to receive on
  *	@length: length to allocate

^ permalink raw reply related

* [PATCH net-next 2/2] bnx2x: uses build_skb() in receive path
From: Eric Dumazet @ 2011-11-14 16:05 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Ben Hutchings, Tom Herbert, Jamal Hadi Salim,
	Stephen Hemminger, Thomas Graf, Herbert Xu, Jeff Kirsher,
	pstaszewski, eilong

bnx2x uses following formula to compute its rx_buf_sz :

dev->mtu + 2*L1_CACHE_BYTES + 14 + 8 + 8 + 2

Then core network adds NET_SKB_PAD and SKB_DATA_ALIGN(sizeof(struct
skb_shared_info))

Final allocated size for skb head on x86_64 (L1_CACHE_BYTES = 64,
MTU=1500) : 2112 bytes : SLUB/SLAB round this to 4096 bytes.

Since skb truesize is then bigger than SK_MEM_QUANTUM, we have lot of
false sharing because of mem_reclaim in UDP stack.

One possible way to half truesize is to reduce the need by 64 bytes
(2112 -> 2048 bytes)

Instead of allocating a full cache line at the end of packet for
alignment, we can use the fact that skb_shared_info sits at the end of
skb->head, and we can use this room, if we convert bnx2x to new
build_skb() infrastructure.

skb_shared_info will be initialized after hardware finished its
transfert, so we can eventually overwrite the final padding.

Using build_skb() also reduces cache line misses in the driver, since we
use cache hot skb instead of cold ones. Number of in-flight sk_buff
structures is lower, they are recycled while still hot.

Performance results :

(820.000 pps on a rx UDP monothread benchmark, instead of 720.000 pps)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Eilon Greenstein <eilong@broadcom.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
CC: Tom Herbert <therbert@google.com>
CC: Jamal Hadi Salim <hadi@mojatatu.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Thomas Graf <tgraf@infradead.org>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x.h         |   30 -
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c     |  265 ++++------
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h     |   33 -
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c |    6 
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c    |    4 
 5 files changed, 170 insertions(+), 168 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
index 4b90b51..0f7b7a4 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
@@ -293,8 +293,13 @@ enum {
 #define FCOE_TXQ_IDX(bp)	(MAX_ETH_TXQ_IDX(bp))
 
 /* fast path */
+/*
+ * This driver uses new build_skb() API :
+ * RX ring buffer contains pointer to kmalloc() data only,
+ * skb are built only after Hardware filled the frame.
+ */
 struct sw_rx_bd {
-	struct sk_buff	*skb;
+	u8		*data;
 	DEFINE_DMA_UNMAP_ADDR(mapping);
 };
 
@@ -424,8 +429,8 @@ union host_hc_status_block {
 
 struct bnx2x_agg_info {
 	/*
-	 * First aggregation buffer is an skb, the following - are pages.
-	 * We will preallocate the skbs for each aggregation when
+	 * First aggregation buffer is a data buffer, the following - are pages.
+	 * We will preallocate the data buffer for each aggregation when
 	 * we open the interface and will replace the BD at the consumer
 	 * with this one when we receive the TPA_START CQE in order to
 	 * keep the Rx BD ring consistent.
@@ -439,6 +444,7 @@ struct bnx2x_agg_info {
 	u16			parsing_flags;
 	u16			vlan_tag;
 	u16			len_on_bd;
+	u32			rxhash;
 };
 
 #define Q_STATS_OFFSET32(stat_name) \
@@ -1187,10 +1193,20 @@ struct bnx2x {
 #define ETH_MAX_JUMBO_PACKET_SIZE	9600
 
 	/* Max supported alignment is 256 (8 shift) */
-#define BNX2X_RX_ALIGN_SHIFT		((L1_CACHE_SHIFT < 8) ? \
-					 L1_CACHE_SHIFT : 8)
-	/* FW use 2 Cache lines Alignment for start packet and size  */
-#define BNX2X_FW_RX_ALIGN		(2 << BNX2X_RX_ALIGN_SHIFT)
+#define BNX2X_RX_ALIGN_SHIFT		min(8, L1_CACHE_SHIFT)
+
+	/* FW uses 2 Cache lines Alignment for start packet and size
+	 *
+	 * We assume skb_build() uses sizeof(struct skb_shared_info) bytes
+	 * at the end of skb->data, to avoid wasting a full cache line.
+	 * This reduces memory use (skb->truesize).
+	 */
+#define BNX2X_FW_RX_ALIGN_START	(1UL << BNX2X_RX_ALIGN_SHIFT)
+
+#define BNX2X_FW_RX_ALIGN_END					\
+	max(1UL << BNX2X_RX_ALIGN_SHIFT, 			\
+	    SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+
 #define BNX2X_PXP_DRAM_ALIGN		(BNX2X_RX_ALIGN_SHIFT - 5)
 
 	struct host_sp_status_block *def_status_blk;
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index 13dad92..0d60b9e 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -294,8 +294,21 @@ static void bnx2x_update_sge_prod(struct bnx2x_fastpath *fp,
 	   fp->last_max_sge, fp->rx_sge_prod);
 }
 
+/* Set Toeplitz hash value in the skb using the value from the
+ * CQE (calculated by HW).
+ */
+static u32 bnx2x_get_rxhash(const struct bnx2x *bp,
+			    const struct eth_fast_path_rx_cqe *cqe)
+{
+	/* Set Toeplitz hash from CQE */
+	if ((bp->dev->features & NETIF_F_RXHASH) &&
+	    (cqe->status_flags & ETH_FAST_PATH_RX_CQE_RSS_HASH_FLG))
+		return le32_to_cpu(cqe->rss_hash_result);
+	return 0;
+}
+
 static void bnx2x_tpa_start(struct bnx2x_fastpath *fp, u16 queue,
-			    struct sk_buff *skb, u16 cons, u16 prod,
+			    u16 cons, u16 prod,
 			    struct eth_fast_path_rx_cqe *cqe)
 {
 	struct bnx2x *bp = fp->bp;
@@ -310,9 +323,9 @@ static void bnx2x_tpa_start(struct bnx2x_fastpath *fp, u16 queue,
 	if (tpa_info->tpa_state != BNX2X_TPA_STOP)
 		BNX2X_ERR("start of bin not in stop [%d]\n", queue);
 
-	/* Try to map an empty skb from the aggregation info  */
+	/* Try to map an empty data buffer from the aggregation info  */
 	mapping = dma_map_single(&bp->pdev->dev,
-				 first_buf->skb->data,
+				 first_buf->data + NET_SKB_PAD,
 				 fp->rx_buf_size, DMA_FROM_DEVICE);
 	/*
 	 *  ...if it fails - move the skb from the consumer to the producer
@@ -322,15 +335,15 @@ static void bnx2x_tpa_start(struct bnx2x_fastpath *fp, u16 queue,
 
 	if (unlikely(dma_mapping_error(&bp->pdev->dev, mapping))) {
 		/* Move the BD from the consumer to the producer */
-		bnx2x_reuse_rx_skb(fp, cons, prod);
+		bnx2x_reuse_rx_data(fp, cons, prod);
 		tpa_info->tpa_state = BNX2X_TPA_ERROR;
 		return;
 	}
 
-	/* move empty skb from pool to prod */
-	prod_rx_buf->skb = first_buf->skb;
+	/* move empty data from pool to prod */
+	prod_rx_buf->data = first_buf->data;
 	dma_unmap_addr_set(prod_rx_buf, mapping, mapping);
-	/* point prod_bd to new skb */
+	/* point prod_bd to new data */
 	prod_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
 	prod_bd->addr_lo = cpu_to_le32(U64_LO(mapping));
 
@@ -344,6 +357,7 @@ static void bnx2x_tpa_start(struct bnx2x_fastpath *fp, u16 queue,
 	tpa_info->tpa_state = BNX2X_TPA_START;
 	tpa_info->len_on_bd = le16_to_cpu(cqe->len_on_bd);
 	tpa_info->placement_offset = cqe->placement_offset;
+	tpa_info->rxhash = bnx2x_get_rxhash(bp, cqe);
 
 #ifdef BNX2X_STOP_ON_ERROR
 	fp->tpa_queue_used |= (1 << queue);
@@ -471,11 +485,12 @@ static void bnx2x_tpa_stop(struct bnx2x *bp, struct bnx2x_fastpath *fp,
 {
 	struct bnx2x_agg_info *tpa_info = &fp->tpa_info[queue];
 	struct sw_rx_bd *rx_buf = &tpa_info->first_buf;
-	u8 pad = tpa_info->placement_offset;
+	u32 pad = tpa_info->placement_offset;
 	u16 len = tpa_info->len_on_bd;
-	struct sk_buff *skb = rx_buf->skb;
+	struct sk_buff *skb = NULL;
+	u8 *data = rx_buf->data;
 	/* alloc new skb */
-	struct sk_buff *new_skb;
+	u8 *new_data;
 	u8 old_tpa_state = tpa_info->tpa_state;
 
 	tpa_info->tpa_state = BNX2X_TPA_STOP;
@@ -486,18 +501,18 @@ static void bnx2x_tpa_stop(struct bnx2x *bp, struct bnx2x_fastpath *fp,
 	if (old_tpa_state == BNX2X_TPA_ERROR)
 		goto drop;
 
-	/* Try to allocate the new skb */
-	new_skb = netdev_alloc_skb(bp->dev, fp->rx_buf_size);
+	/* Try to allocate the new data */
+	new_data = kmalloc(fp->rx_buf_size + NET_SKB_PAD, GFP_ATOMIC);
 
 	/* Unmap skb in the pool anyway, as we are going to change
 	   pool entry status to BNX2X_TPA_STOP even if new skb allocation
 	   fails. */
 	dma_unmap_single(&bp->pdev->dev, dma_unmap_addr(rx_buf, mapping),
 			 fp->rx_buf_size, DMA_FROM_DEVICE);
+	if (likely(new_data))
+		skb = build_skb(data);
 
-	if (likely(new_skb)) {
-		prefetch(skb);
-		prefetch(((char *)(skb)) + L1_CACHE_BYTES);
+	if (likely(skb)) {
 
 #ifdef BNX2X_STOP_ON_ERROR
 		if (pad + len > fp->rx_buf_size) {
@@ -509,8 +524,9 @@ static void bnx2x_tpa_stop(struct bnx2x *bp, struct bnx2x_fastpath *fp,
 		}
 #endif
 
-		skb_reserve(skb, pad);
+		skb_reserve(skb, pad + NET_SKB_PAD);
 		skb_put(skb, len);
+		skb->rxhash = tpa_info->rxhash;
 
 		skb->protocol = eth_type_trans(skb, bp->dev);
 		skb->ip_summed = CHECKSUM_UNNECESSARY;
@@ -526,8 +542,8 @@ static void bnx2x_tpa_stop(struct bnx2x *bp, struct bnx2x_fastpath *fp,
 		}
 
 
-		/* put new skb in bin */
-		rx_buf->skb = new_skb;
+		/* put new data in bin */
+		rx_buf->data = new_data;
 
 		return;
 	}
@@ -539,19 +555,6 @@ drop:
 	fp->eth_q_stats.rx_skb_alloc_failed++;
 }
 
-/* Set Toeplitz hash value in the skb using the value from the
- * CQE (calculated by HW).
- */
-static inline void bnx2x_set_skb_rxhash(struct bnx2x *bp, union eth_rx_cqe *cqe,
-					struct sk_buff *skb)
-{
-	/* Set Toeplitz hash from CQE */
-	if ((bp->dev->features & NETIF_F_RXHASH) &&
-	    (cqe->fast_path_cqe.status_flags &
-	     ETH_FAST_PATH_RX_CQE_RSS_HASH_FLG))
-		skb->rxhash =
-		le32_to_cpu(cqe->fast_path_cqe.rss_hash_result);
-}
 
 int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
 {
@@ -594,6 +597,7 @@ int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
 		u8 cqe_fp_flags;
 		enum eth_rx_cqe_type cqe_fp_type;
 		u16 len, pad;
+		u8 *data;
 
 #ifdef BNX2X_STOP_ON_ERROR
 		if (unlikely(bp->panic))
@@ -604,13 +608,6 @@ int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
 		bd_prod = RX_BD(bd_prod);
 		bd_cons = RX_BD(bd_cons);
 
-		/* Prefetch the page containing the BD descriptor
-		   at producer's index. It will be needed when new skb is
-		   allocated */
-		prefetch((void *)(PAGE_ALIGN((unsigned long)
-					     (&fp->rx_desc_ring[bd_prod])) -
-				  PAGE_SIZE + 1));
-
 		cqe = &fp->rx_comp_ring[comp_ring_cons];
 		cqe_fp = &cqe->fast_path_cqe;
 		cqe_fp_flags = cqe_fp->type_error_flags;
@@ -626,125 +623,110 @@ int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
 		if (unlikely(CQE_TYPE_SLOW(cqe_fp_type))) {
 			bnx2x_sp_event(fp, cqe);
 			goto next_cqe;
+		}
+		rx_buf = &fp->rx_buf_ring[bd_cons];
+		data = rx_buf->data;
 
-		/* this is an rx packet */
-		} else {
-			rx_buf = &fp->rx_buf_ring[bd_cons];
-			skb = rx_buf->skb;
-			prefetch(skb);
-
-			if (!CQE_TYPE_FAST(cqe_fp_type)) {
+		if (!CQE_TYPE_FAST(cqe_fp_type)) {
 #ifdef BNX2X_STOP_ON_ERROR
-				/* sanity check */
-				if (fp->disable_tpa &&
-				    (CQE_TYPE_START(cqe_fp_type) ||
-				     CQE_TYPE_STOP(cqe_fp_type)))
-					BNX2X_ERR("START/STOP packet while "
-						  "disable_tpa type %x\n",
-						  CQE_TYPE(cqe_fp_type));
+			/* sanity check */
+			if (fp->disable_tpa &&
+			    (CQE_TYPE_START(cqe_fp_type) ||
+			     CQE_TYPE_STOP(cqe_fp_type)))
+				BNX2X_ERR("START/STOP packet while "
+					  "disable_tpa type %x\n",
+					  CQE_TYPE(cqe_fp_type));
 #endif
 
-				if (CQE_TYPE_START(cqe_fp_type)) {
-					u16 queue = cqe_fp->queue_index;
-					DP(NETIF_MSG_RX_STATUS,
-					   "calling tpa_start on queue %d\n",
-					   queue);
-
-					bnx2x_tpa_start(fp, queue, skb,
-							bd_cons, bd_prod,
-							cqe_fp);
-
-					/* Set Toeplitz hash for LRO skb */
-					bnx2x_set_skb_rxhash(bp, cqe, skb);
-
-					goto next_rx;
+			if (CQE_TYPE_START(cqe_fp_type)) {
+				u16 queue = cqe_fp->queue_index;
+				DP(NETIF_MSG_RX_STATUS,
+				   "calling tpa_start on queue %d\n",
+				   queue);
 
-				} else {
-					u16 queue =
-						cqe->end_agg_cqe.queue_index;
-					DP(NETIF_MSG_RX_STATUS,
-					   "calling tpa_stop on queue %d\n",
-					   queue);
+				bnx2x_tpa_start(fp, queue,
+						bd_cons, bd_prod,
+						cqe_fp);
+				goto next_rx;
+			} else {
+				u16 queue =
+					cqe->end_agg_cqe.queue_index;
+				DP(NETIF_MSG_RX_STATUS,
+				   "calling tpa_stop on queue %d\n",
+				   queue);
 
-					bnx2x_tpa_stop(bp, fp, queue,
-						       &cqe->end_agg_cqe,
-						       comp_ring_cons);
+				bnx2x_tpa_stop(bp, fp, queue,
+					       &cqe->end_agg_cqe,
+					       comp_ring_cons);
 #ifdef BNX2X_STOP_ON_ERROR
-					if (bp->panic)
-						return 0;
+				if (bp->panic)
+					return 0;
 #endif
 
-					bnx2x_update_sge_prod(fp, cqe_fp);
-					goto next_cqe;
-				}
+				bnx2x_update_sge_prod(fp, cqe_fp);
+				goto next_cqe;
 			}
-			/* non TPA */
-			len = le16_to_cpu(cqe_fp->pkt_len);
-			pad = cqe_fp->placement_offset;
-			dma_sync_single_for_cpu(&bp->pdev->dev,
+		}
+		/* non TPA */
+		len = le16_to_cpu(cqe_fp->pkt_len);
+		pad = cqe_fp->placement_offset;
+		dma_sync_single_for_cpu(&bp->pdev->dev,
 					dma_unmap_addr(rx_buf, mapping),
-						       pad + RX_COPY_THRESH,
-						       DMA_FROM_DEVICE);
-			prefetch(((char *)(skb)) + L1_CACHE_BYTES);
+					pad + RX_COPY_THRESH,
+					DMA_FROM_DEVICE);
+		pad += NET_SKB_PAD;
+		prefetch(data + pad); /* speedup eth_type_trans() */
+		/* is this an error packet? */
+		if (unlikely(cqe_fp_flags & ETH_RX_ERROR_FALGS)) {
+			DP(NETIF_MSG_RX_ERR,
+			   "ERROR  flags %x  rx packet %u\n",
+			   cqe_fp_flags, sw_comp_cons);
+			fp->eth_q_stats.rx_err_discard_pkt++;
+			goto reuse_rx;
+		}
 
-			/* is this an error packet? */
-			if (unlikely(cqe_fp_flags & ETH_RX_ERROR_FALGS)) {
+		/* Since we don't have a jumbo ring
+		 * copy small packets if mtu > 1500
+		 */
+		if ((bp->dev->mtu > ETH_MAX_PACKET_SIZE) &&
+		    (len <= RX_COPY_THRESH)) {
+			skb = netdev_alloc_skb_ip_align(bp->dev, len);
+			if (skb == NULL) {
 				DP(NETIF_MSG_RX_ERR,
-				   "ERROR  flags %x  rx packet %u\n",
-				   cqe_fp_flags, sw_comp_cons);
-				fp->eth_q_stats.rx_err_discard_pkt++;
+				   "ERROR  packet dropped because of alloc failure\n");
+				fp->eth_q_stats.rx_skb_alloc_failed++;
 				goto reuse_rx;
 			}
-
-			/* Since we don't have a jumbo ring
-			 * copy small packets if mtu > 1500
-			 */
-			if ((bp->dev->mtu > ETH_MAX_PACKET_SIZE) &&
-			    (len <= RX_COPY_THRESH)) {
-				struct sk_buff *new_skb;
-
-				new_skb = netdev_alloc_skb(bp->dev, len + pad);
-				if (new_skb == NULL) {
-					DP(NETIF_MSG_RX_ERR,
-					   "ERROR  packet dropped "
-					   "because of alloc failure\n");
-					fp->eth_q_stats.rx_skb_alloc_failed++;
-					goto reuse_rx;
-				}
-
-				/* aligned copy */
-				skb_copy_from_linear_data_offset(skb, pad,
-						    new_skb->data + pad, len);
-				skb_reserve(new_skb, pad);
-				skb_put(new_skb, len);
-
-				bnx2x_reuse_rx_skb(fp, bd_cons, bd_prod);
-
-				skb = new_skb;
-
-			} else
-			if (likely(bnx2x_alloc_rx_skb(bp, fp, bd_prod) == 0)) {
+			memcpy(skb->data, data + pad, len);
+			bnx2x_reuse_rx_data(fp, bd_cons, bd_prod);
+		} else {
+			if (likely(bnx2x_alloc_rx_data(bp, fp, bd_prod) == 0)) {
 				dma_unmap_single(&bp->pdev->dev,
-					dma_unmap_addr(rx_buf, mapping),
+						 dma_unmap_addr(rx_buf, mapping),
 						 fp->rx_buf_size,
 						 DMA_FROM_DEVICE);
+				skb = build_skb(data);
+				if (unlikely(!skb)) {
+					kfree(data);
+					fp->eth_q_stats.rx_skb_alloc_failed++;
+					goto next_rx;
+				}
 				skb_reserve(skb, pad);
-				skb_put(skb, len);
-
 			} else {
 				DP(NETIF_MSG_RX_ERR,
 				   "ERROR  packet dropped because "
 				   "of alloc failure\n");
 				fp->eth_q_stats.rx_skb_alloc_failed++;
 reuse_rx:
-				bnx2x_reuse_rx_skb(fp, bd_cons, bd_prod);
+				bnx2x_reuse_rx_data(fp, bd_cons, bd_prod);
 				goto next_rx;
 			}
 
+			skb_put(skb, len);
 			skb->protocol = eth_type_trans(skb, bp->dev);
 
 			/* Set Toeplitz hash for a none-LRO skb */
-			bnx2x_set_skb_rxhash(bp, cqe, skb);
+			skb->rxhash = bnx2x_get_rxhash(bp, cqe_fp);
 
 			skb_checksum_none_assert(skb);
 
@@ -767,7 +749,7 @@ reuse_rx:
 
 
 next_rx:
-		rx_buf->skb = NULL;
+		rx_buf->data = NULL;
 
 		bd_cons = NEXT_RX_IDX(bd_cons);
 		bd_prod = NEXT_RX_IDX(bd_prod);
@@ -1013,9 +995,9 @@ void bnx2x_init_rx_rings(struct bnx2x *bp)
 				struct sw_rx_bd *first_buf =
 					&tpa_info->first_buf;
 
-				first_buf->skb = netdev_alloc_skb(bp->dev,
-						       fp->rx_buf_size);
-				if (!first_buf->skb) {
+				first_buf->data = kmalloc(fp->rx_buf_size + NET_SKB_PAD,
+							  GFP_ATOMIC);
+				if (!first_buf->data) {
 					BNX2X_ERR("Failed to allocate TPA "
 						  "skb pool for queue[%d] - "
 						  "disabling TPA on this "
@@ -1118,16 +1100,16 @@ static void bnx2x_free_rx_bds(struct bnx2x_fastpath *fp)
 
 	for (i = 0; i < NUM_RX_BD; i++) {
 		struct sw_rx_bd *rx_buf = &fp->rx_buf_ring[i];
-		struct sk_buff *skb = rx_buf->skb;
+		u8 *data = rx_buf->data;
 
-		if (skb == NULL)
+		if (data == NULL)
 			continue;
 		dma_unmap_single(&bp->pdev->dev,
 				 dma_unmap_addr(rx_buf, mapping),
 				 fp->rx_buf_size, DMA_FROM_DEVICE);
 
-		rx_buf->skb = NULL;
-		dev_kfree_skb(skb);
+		rx_buf->data = NULL;
+		kfree(data);
 	}
 }
 
@@ -1509,6 +1491,7 @@ static inline void bnx2x_set_rx_buf_size(struct bnx2x *bp)
 
 	for_each_queue(bp, i) {
 		struct bnx2x_fastpath *fp = &bp->fp[i];
+		u32 mtu;
 
 		/* Always use a mini-jumbo MTU for the FCoE L2 ring */
 		if (IS_FCOE_IDX(i))
@@ -1518,13 +1501,15 @@ static inline void bnx2x_set_rx_buf_size(struct bnx2x *bp)
 			 * IP_HEADER_ALIGNMENT_PADDING to prevent a buffer
 			 * overrun attack.
 			 */
-			fp->rx_buf_size =
-				BNX2X_FCOE_MINI_JUMBO_MTU + ETH_OVREHEAD +
-				BNX2X_FW_RX_ALIGN + IP_HEADER_ALIGNMENT_PADDING;
+			mtu = BNX2X_FCOE_MINI_JUMBO_MTU;
 		else
-			fp->rx_buf_size =
-				bp->dev->mtu + ETH_OVREHEAD +
-				BNX2X_FW_RX_ALIGN + IP_HEADER_ALIGNMENT_PADDING;
+			mtu = bp->dev->mtu;
+		fp->rx_buf_size = BNX2X_FW_RX_ALIGN_START +
+				  IP_HEADER_ALIGNMENT_PADDING +
+				  ETH_OVREHEAD +
+				  mtu +
+				  BNX2X_FW_RX_ALIGN_END;
+		/* Note : rx_buf_size doesnt take into account NET_SKB_PAD */
 	}
 }
 
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
index 260226d..41eb17e 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
@@ -910,26 +910,27 @@ static inline int bnx2x_alloc_rx_sge(struct bnx2x *bp,
 	return 0;
 }
 
-static inline int bnx2x_alloc_rx_skb(struct bnx2x *bp,
-				     struct bnx2x_fastpath *fp, u16 index)
+static inline int bnx2x_alloc_rx_data(struct bnx2x *bp,
+				      struct bnx2x_fastpath *fp, u16 index)
 {
-	struct sk_buff *skb;
+	u8 *data;
 	struct sw_rx_bd *rx_buf = &fp->rx_buf_ring[index];
 	struct eth_rx_bd *rx_bd = &fp->rx_desc_ring[index];
 	dma_addr_t mapping;
 
-	skb = netdev_alloc_skb(bp->dev, fp->rx_buf_size);
-	if (unlikely(skb == NULL))
+	data = kmalloc(fp->rx_buf_size + NET_SKB_PAD, GFP_ATOMIC);
+	if (unlikely(data == NULL))
 		return -ENOMEM;
 
-	mapping = dma_map_single(&bp->pdev->dev, skb->data, fp->rx_buf_size,
+	mapping = dma_map_single(&bp->pdev->dev, data + NET_SKB_PAD,
+				 fp->rx_buf_size,
 				 DMA_FROM_DEVICE);
 	if (unlikely(dma_mapping_error(&bp->pdev->dev, mapping))) {
-		dev_kfree_skb_any(skb);
+		kfree(data);
 		return -ENOMEM;
 	}
 
-	rx_buf->skb = skb;
+	rx_buf->data = data;
 	dma_unmap_addr_set(rx_buf, mapping, mapping);
 
 	rx_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
@@ -938,12 +939,12 @@ static inline int bnx2x_alloc_rx_skb(struct bnx2x *bp,
 	return 0;
 }
 
-/* note that we are not allocating a new skb,
+/* note that we are not allocating a new buffer,
  * we are just moving one from cons to prod
  * we are not creating a new mapping,
  * so there is no need to check for dma_mapping_error().
  */
-static inline void bnx2x_reuse_rx_skb(struct bnx2x_fastpath *fp,
+static inline void bnx2x_reuse_rx_data(struct bnx2x_fastpath *fp,
 				      u16 cons, u16 prod)
 {
 	struct sw_rx_bd *cons_rx_buf = &fp->rx_buf_ring[cons];
@@ -953,7 +954,7 @@ static inline void bnx2x_reuse_rx_skb(struct bnx2x_fastpath *fp,
 
 	dma_unmap_addr_set(prod_rx_buf, mapping,
 			   dma_unmap_addr(cons_rx_buf, mapping));
-	prod_rx_buf->skb = cons_rx_buf->skb;
+	prod_rx_buf->data = cons_rx_buf->data;
 	*prod_bd = *cons_bd;
 }
 
@@ -1029,9 +1030,9 @@ static inline void bnx2x_free_tpa_pool(struct bnx2x *bp,
 	for (i = 0; i < last; i++) {
 		struct bnx2x_agg_info *tpa_info = &fp->tpa_info[i];
 		struct sw_rx_bd *first_buf = &tpa_info->first_buf;
-		struct sk_buff *skb = first_buf->skb;
+		u8 *data = first_buf->data;
 
-		if (skb == NULL) {
+		if (data == NULL) {
 			DP(NETIF_MSG_IFDOWN, "tpa bin %d empty on free\n", i);
 			continue;
 		}
@@ -1039,8 +1040,8 @@ static inline void bnx2x_free_tpa_pool(struct bnx2x *bp,
 			dma_unmap_single(&bp->pdev->dev,
 					 dma_unmap_addr(first_buf, mapping),
 					 fp->rx_buf_size, DMA_FROM_DEVICE);
-		dev_kfree_skb(skb);
-		first_buf->skb = NULL;
+		kfree(data);
+		first_buf->data = NULL;
 	}
 }
 
@@ -1148,7 +1149,7 @@ static inline int bnx2x_alloc_rx_bds(struct bnx2x_fastpath *fp,
 	 * fp->eth_q_stats.rx_skb_alloc_failed = 0
 	 */
 	for (i = 0; i < rx_ring_size; i++) {
-		if (bnx2x_alloc_rx_skb(bp, fp, ring_prod) < 0) {
+		if (bnx2x_alloc_rx_data(bp, fp, ring_prod) < 0) {
 			fp->eth_q_stats.rx_skb_alloc_failed++;
 			continue;
 		}
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
index f6402fa..ec31871 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
@@ -1740,6 +1740,7 @@ static int bnx2x_run_loopback(struct bnx2x *bp, int loopback_mode)
 	struct sw_rx_bd *rx_buf;
 	u16 len;
 	int rc = -ENODEV;
+	u8 *data;
 
 	/* check the loopback mode */
 	switch (loopback_mode) {
@@ -1865,10 +1866,9 @@ static int bnx2x_run_loopback(struct bnx2x *bp, int loopback_mode)
 	dma_sync_single_for_cpu(&bp->pdev->dev,
 				   dma_unmap_addr(rx_buf, mapping),
 				   fp_rx->rx_buf_size, DMA_FROM_DEVICE);
-	skb = rx_buf->skb;
-	skb_reserve(skb, cqe->fast_path_cqe.placement_offset);
+	data = rx_buf->data + NET_SKB_PAD + cqe->fast_path_cqe.placement_offset;
 	for (i = ETH_HLEN; i < pkt_size; i++)
-		if (*(skb->data + i) != (unsigned char) (i & 0xff))
+		if (*(data + i) != (unsigned char) (i & 0xff))
 			goto test_loopback_rx_exit;
 
 	rc = 0;
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 33ff60d..80fb9b5 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -2789,8 +2789,8 @@ static void bnx2x_pf_rx_q_prep(struct bnx2x *bp,
 	/* This should be a maximum number of data bytes that may be
 	 * placed on the BD (not including paddings).
 	 */
-	rxq_init->buf_sz = fp->rx_buf_size - BNX2X_FW_RX_ALIGN -
-		IP_HEADER_ALIGNMENT_PADDING;
+	rxq_init->buf_sz = fp->rx_buf_size - BNX2X_FW_RX_ALIGN_START -
+		BNX2X_FW_RX_ALIGN_END -	IP_HEADER_ALIGNMENT_PADDING;
 
 	rxq_init->cl_qzone_id = fp->cl_qzone_id;
 	rxq_init->tpa_agg_sz = tpa_agg_size;

^ permalink raw reply related

* Re: [PATCH net-next] ipv6: reduce percpu needs for icmpv6msg mibs
From: Stephen Hemminger @ 2011-11-14 16:12 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1321183444.17837.21.camel@edumazet-laptop>

On Sun, 13 Nov 2011 12:24:04 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Reading /proc/net/snmp6 on a machine with a lot of cpus is very
> expensive (can be ~88000 us).
> 
> This is because ICMPV6MSG MIB uses 4096 bytes per cpu, and folding
> values for all possible cpus can read 16 Mbytes of memory (32MBytes on
> non x86 arches)
> 
> ICMP messages are not considered as fast path on a typical server, and
> eventually few cpus handle them anyway. We can afford an atomic
> operation instead of using percpu data.
> 
> This saves 4096 bytes per cpu and per network namespace.

Probably should do this for ICMP for IPv4 and all the other
non hot counters. The whole per-cpu SNMP MIB stuff for all counters
is massive overkill.

^ permalink raw reply

* Re: tc: deleting single entry from filter hashtable
From: Eric Dumazet @ 2011-11-14 16:15 UTC (permalink / raw)
  To: Miroslav Kratochvil; +Cc: netdev
In-Reply-To: <CAO0uZ+9N9zwt07EdQzz6KjGcpRT7rdmpaBvHoE0TLDXe-0kxMg@mail.gmail.com>

Le dimanche 13 novembre 2011 à 22:19 +0100, Miroslav Kratochvil a
écrit :
> Hello,
> 
> I have following problem: For performance reasons I've been using a
> setup known from LARTC as 'hashing filters', just like here:
> 
> http://lartc.org/howto/lartc.adv-filter.hashing.html
> 
> For the same performance reasons, I'd like to be able to delete or
> change a _single_ entry from the hash table, so I don't need to refill
> the whole table on every change I need to do (which can get pretty
> slow for amount of entries over 10k, even when using a C program that
> pushes the commands right into 'tc -batch')
> 
> Question: Is the single deletion/modification possible? If not --
> judging from the lack of the documentation on given subject I kindof
> expect that it's not possible -- is there any other possibility to
> delete at least some reasonably small subset of the hash table that
> could be modified and recreated quickly?
> 
> Also, here's my (failed) approach: I tried to assign different prio's
> to all hashtable entries so they could be deleted by reference to that
> prio, but after passing through TC they got created with the same prio
> anyway..
> 
> Thanks for any help with this problem,

Not sure I understand the exact problem, but it is possible to delete a
single filter, and add a new one.

Be careful, since adding a new one really adds a new filter : duplicates
are not detected.

$ tc filter add dev eth0 parent 8001:0 protocol ip prio 100 u32 match ip src  1.2.0.0 classid 1:3
$ tc -s -d filter show dev eth0
filter parent 8001: protocol ip pref 100 u32 
filter parent 8001: protocol ip pref 100 u32 fh 800: ht divisor 1 
filter parent 8001: protocol ip pref 100 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:3  (rule hit 4 success 0)
  match 01020000/ffffffff at 12 (success 0 ) 
$ tc filter add dev eth0 parent 8001:0 protocol ip prio 100 u32 match ip src  1.2.0.0 classid 1:3
$ tc -s -d filter show dev eth0
filter parent 8001: protocol ip pref 100 u32 
filter parent 8001: protocol ip pref 100 u32 fh 800: ht divisor 1 
filter parent 8001: protocol ip pref 100 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:3  (rule hit 4 success 0)
  match 01020000/ffffffff at 12 (success 0 ) 
filter parent 8001: protocol ip pref 100 u32 fh 800::801 order 2049 key ht 800 bkt 0 flowid 1:3  (rule hit 0 success 0)
  match 01020000/ffffffff at 12 (success 0 ) 

^ permalink raw reply

* Re: [PATCH RFC] ndo: ndo_queue_xmit/ndo_flush_xmit (was Re: [RFC] [ver3 PATCH 0/6] Implement multiqueue virtio-net)
From: Michael S. Tsirkin @ 2011-11-14 16:21 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: netdev, davem, kvm, virtualization
In-Reply-To: <20111113174827.GA23310@redhat.com>

On Sun, Nov 13, 2011 at 07:48:28PM +0200, Michael S. Tsirkin wrote:
> @@ -863,6 +872,9 @@ struct net_device_ops {
>  	int			(*ndo_stop)(struct net_device *dev);
>  	netdev_tx_t		(*ndo_start_xmit) (struct sk_buff *skb,
>  						   struct net_device *dev);
> +	netdev_tx_t		(*ndo_queue_xmit)(struct sk_buff *skb,
> +						  struct net_device *dev);
> +	void			(*ndo_flush_xmit)(struct net_device *dev);
>  	u16			(*ndo_select_queue)(struct net_device *dev,
>  						    struct sk_buff *skb);
>  	void			(*ndo_change_rx_flags)(struct net_device *dev,
> @@ -2099,6 +2111,10 @@ extern int		dev_set_mac_address(struct net_device *,

An alternative I considered was to add a boolean flag to ndo_start_xmit
'bool queue' or something like this, plus ndo_flush_xmit.
This will lead to cleaner code I think but will require all drivers
to be changed, so for a proof of concept I decided to go for one
that is less work.

Let me know what looks more palatable ... 

-- 
MST

^ permalink raw reply

* RE: [PATCH net-next 2/2] bnx2x: uses build_skb() in receive path
From: David Laight @ 2011-11-14 16:21 UTC (permalink / raw)
  To: Eric Dumazet, David Miller
  Cc: netdev, Ben Hutchings, Tom Herbert, Jamal Hadi Salim,
	Stephen Hemminger, Thomas Graf, Herbert Xu, Jeff Kirsher,
	pstaszewski, eilong
In-Reply-To: <1321286734.2272.48.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

 
> bnx2x uses following formula to compute its rx_buf_sz :
> 
> dev->mtu + 2*L1_CACHE_BYTES + 14 + 8 + 8 + 2
> 
> Then core network adds NET_SKB_PAD and SKB_DATA_ALIGN(sizeof(struct
> skb_shared_info))
> 
> Final allocated size for skb head on x86_64 (L1_CACHE_BYTES = 64,
> MTU=1500) : 2112 bytes : SLUB/SLAB round this to 4096 bytes.
> 
> Since skb truesize is then bigger than SK_MEM_QUANTUM, we have lot of
> false sharing because of mem_reclaim in UDP stack.
> 
> One possible way to half truesize is to reduce the need by 64 bytes
> (2112 -> 2048 bytes)

Would seem to be worth trying to reduce the size for systems
where the L1 caches lines are 128 bytes - I think that includes
some of the large ppc systems.

If the alignment of the malloced memory can't be assumed/forced
maybe one of the sub-structures could be dynamically placed
before the data buffer?

	David

^ permalink raw reply

* Re: [PATCH net-next] ipv6: reduce percpu needs for icmpv6msg mibs
From: Eric Dumazet @ 2011-11-14 16:26 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, netdev
In-Reply-To: <20111114081253.226e2b66@nehalam.linuxnetplumber.net>

Le lundi 14 novembre 2011 à 08:12 -0800, Stephen Hemminger a écrit :

> Probably should do this for ICMP for IPv4 and all the other
> non hot counters. The whole per-cpu SNMP MIB stuff for all counters
> is massive overkill.

Well, this (ICMP for IPv4) was already done some days ago, I did the
IPv6 part after IPv4, once David agreed it was a useful patch.


Thanks !

^ permalink raw reply

* RE: [PATCH net-next 2/2] bnx2x: uses build_skb() in receive path
From: Eric Dumazet @ 2011-11-14 16:36 UTC (permalink / raw)
  To: David Laight
  Cc: David Miller, netdev, Ben Hutchings, Tom Herbert,
	Jamal Hadi Salim, Stephen Hemminger, Thomas Graf, Herbert Xu,
	Jeff Kirsher, pstaszewski, eilong
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6D8AEC2@saturn3.aculab.com>

Le lundi 14 novembre 2011 à 16:21 +0000, David Laight a écrit :
> > bnx2x uses following formula to compute its rx_buf_sz :
> > 
> > dev->mtu + 2*L1_CACHE_BYTES + 14 + 8 + 8 + 2
> > 
> > Then core network adds NET_SKB_PAD and SKB_DATA_ALIGN(sizeof(struct
> > skb_shared_info))
> > 
> > Final allocated size for skb head on x86_64 (L1_CACHE_BYTES = 64,
> > MTU=1500) : 2112 bytes : SLUB/SLAB round this to 4096 bytes.
> > 
> > Since skb truesize is then bigger than SK_MEM_QUANTUM, we have lot of
> > false sharing because of mem_reclaim in UDP stack.
> > 
> > One possible way to half truesize is to reduce the need by 64 bytes
> > (2112 -> 2048 bytes)
> 
> Would seem to be worth trying to reduce the size for systems
> where the L1 caches lines are 128 bytes - I think that includes
> some of the large ppc systems.
> 
> If the alignment of the malloced memory can't be assumed/forced
> maybe one of the sub-structures could be dynamically placed
> before the data buffer?
> 

What do you call sub-structure ?

Assuming you speak of skb_shared_info, moving at the start wont change
overall head size.

large ppc systems have 64Kb pages and SK_MEM_QUANTUM=64Kb, so it's less
a problem.

One other way would be to have a shinfo->nr_max_frags field, to reduce
max number of fragments, instead of assumin MAX_SKB_FRAGS, for 'legacy'
drivers.

^ permalink raw reply

* Re: net: Add network priority cgroup
From: John Fastabend @ 2011-11-14 16:38 UTC (permalink / raw)
  To: Neil Horman
  Cc: Dave Taht, netdev@vger.kernel.org, Love, Robert W,
	David S. Miller
In-Reply-To: <20111114144358.GB27284@hmsreliant.think-freely.org>

On 11/14/2011 6:43 AM, Neil Horman wrote:
> On Mon, Nov 14, 2011 at 01:32:04PM +0100, Dave Taht wrote:
>> On Mon, Nov 14, 2011 at 12:47 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
>>> On Wed, Nov 09, 2011 at 02:57:33PM -0500, Neil Horman wrote:
>>>> Data Center Bridging environments are currently somewhat limited in their
>>>> ability to provide a general mechanism for controlling traffic priority.
>>>> Specifically they are unable to administratively control the priority at which
>>>> various types of network traffic are sent.
>>>>
>>>> Currently, the only ways to set the priority of a network buffer are:
>>>>
>>>> 1) Through the use of the SO_PRIORITY socket option
>>>> 2) By using low level hooks, like a tc action
>>>>
>>>> (1) is difficult from an administrative perspective because it requires that the
>>>> application to be coded to not just assume the default priority is sufficient,
>>>> and must expose an administrative interface to allow priority adjustment.  Such
>>>> a solution is not scalable in a DCB environment
>>>>
>>>> (2) is also difficult, as it requires constant administrative oversight of
>>>> applications so as to build appropriate rules to match traffic belonging to
>>>> various classes, so that priority can be appropriately set. It is further
>>>> limiting when DCB enabled hardware is in use, due to the fact that tc rules are
>>>> only run after a root qdisc has been selected (DCB enabled hardware may reserve
>>>> hw queues for various traffic classes and needs the priority to be set prior to
>>>> selecting the root qdisc)
>>>>
>>>>
>>>> I've discussed various solutions with John Fastabend, and we saw a cgroup as
>>>> being a good general solution to this problem.  The network priority cgroup
>>>> allows for a per-interface priority map to be built per cgroup.  Any traffic
>>>> originating from an application in a cgroup, that does not explicitly set its
>>>> priority with SO_PRIORITY will have its priority assigned to the value
>>>> designated for that group on that interface.  This allows a user space daemon,
>>>> when conducting LLDP negotiation with a DCB enabled peer to create a cgroup
>>>> based on the APP_TLV value received and administratively assign applications to
>>>> that priority using the existing cgroup utility infrastructure.
>>>>
>>>> Tested by John and myself, with good results
>>>>
>>>> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
>>>> CC: John Fastabend <john.r.fastabend@intel.com>
>>>> CC: Robert Love <robert.w.love@intel.com>
>>>> CC: "David S. Miller" <davem@davemloft.net>
>>>> --

Acked-by: John Fastabend <john.r.fastabend@intel.com>

>>>
>>> Bump, any other thoughts here?  Dave T. has some reasonable thoughts regarding
>>> the use of skb->priority, but IMO they really seem orthogonal to the purpose of
>>> this change.  Any other reviews would be welcome.
>>
>> Well, in part I've been playing catchup in the hope that lldp and
>> openlldp and/or this dcb netlink layer that I don't know anything
>> about (pointers please?) could help somehow to resolve the semantic
>> mess skb->priority has become in the first place.
>>
>> I liked what was described here.
>>
>> "What if we did at least carve out the DCB functionality away from
>> skb->priority?  Since, AIUI, we're only concerning ourselves with
>> locally generated traffic here, we're talking
>> about skbs that have a socket attached to them.  We could, instead of indexing
>> the prio_tc_map with skb->priority, we could index it with
>> skb->dev->priomap[skb->sk->prioidx] (as provided by this patch).  The cgroup
>> then could be, instead of a strict priority cgroup, a queue_selector cgroup (or
>> something more appropriately named), and we don't have to touch skb->priority at
>> all.  I'd really rather not start down that road until I got more opinions and
>> consensus on that, but it seems like a pretty good solution, one that would
>> allow hardware queue selection in systems that use things like DCB to co-exist
>> with software queueing features."
>>
> I was initially ok with this, but the more I think about it, the more I think
> its just not needed (see further down in this email for my reasoning).  John,
> Rob, do you have any thoughts here?
> 

I agree the original mechanism seems sufficient. skb->priority already has
all the qdisc and netfilter infrastructure in place to be used and using
it to prioritize "steer" packets at queues seems reasonable to me. Using
the skb->priority this way is not new pfifo_fast uses it to pick a band
and sch_prio does some similar prioritization, mqprio is a multiqueue
variant.

>> The piece that still kind of bothered me about the original proposal
>> (and perhaps this one) was that setting SO_PRIORITY in an app means
>> 'give my packets more mojo'.
>>
>> Taking something that took unprioritized packets and assigned them and
>> *them only* to a hardware queue struck me as possibly deprioritizing
>> the 'more mojo wanted' packets in the app(s), as they would end up in
>> some other, possibly overloaded, hardware queue.
>>
> I don't really see what you mean by this at all.  Taking packets with no
> priority and assigning them a priority doesn't really have an effect on
> pre-prioritized packets.  Or rather it shouldn't.  You can certainly create a
> problem by having apps prioritized according to conflicting semantic rules, but
> that strikes me as administrative error.  Garbage in...Garbage out.
> 
>> So a cgroup that moves all of the packets from an application into a
>> given hardware queue, and then gets scheduled normally according to
>> skb->priority and friends (software queue, default of pfifo_fast,
>> etc), seems to make some sense to me. (I wouldn't mind if we had
>> abstractions for software queues, too, like, I need a software queue
>> with these properties, find me a place for it on the hardware - but
>> I'm dreaming)
>>
>> One open question is where do packets generated from other subsystems
>> end up, if you are using a cgroup for the app? arp, dns, etc?
>>
> The overriding rule is the association of an skb to a socket.  If a transmitted
> frame has skb->sk set in dev_queue_xmit, then we interrogate its priority index
> as set when we passed through the sendmsg code at the top of the stack.
> Otherwise its behavior is unchanged from its current standpoint.
> 

Having a queue selection (skb->queue_mapping?) cgroup also would defeat
any hashing across multiple queues. With mqprio we can assign many hardware
queues to a skb->priority.

w.r.t. software queue abstractions don't we already have this? mq and
mqprio enumerate a software qdisc per hardware queue. You can attach
your favorite qdisc to these. This is likely off-topic for this thread though.

[...]

> In the end, I think its just plain old more useful to assign priorty here than
> some new thing.

I agree.

Thanks,
John


> 
> Neil
>  
>>> John, Robert, if you're supportive of these changes, some Acks would be
>>> appreciated.
>>
>>
>>>
>>>
>>> Regards
>>> Neil
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>>
>>
>> -- 
>> Dave Täht
>> SKYPE: davetaht
>> US Tel: 1-239-829-5608
>> FR Tel: 0638645374
>> http://www.bufferbloat.net
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

^ permalink raw reply

* RE: net: Add network priority cgroup
From: Shyam_Iyer @ 2011-11-14 16:43 UTC (permalink / raw)
  To: nhorman, dave.taht; +Cc: netdev, john.r.fastabend, robert.w.love, davem
In-Reply-To: <20111114144358.GB27284@hmsreliant.think-freely.org>



> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-
> owner@vger.kernel.org] On Behalf Of Neil Horman
> Sent: Monday, November 14, 2011 9:44 AM
> To: Dave Taht
> Cc: netdev@vger.kernel.org; John Fastabend; Robert Love; David S.
> Miller
> Subject: Re: net: Add network priority cgroup
> 
> On Mon, Nov 14, 2011 at 01:32:04PM +0100, Dave Taht wrote:
> > On Mon, Nov 14, 2011 at 12:47 PM, Neil Horman <nhorman@tuxdriver.com>
> wrote:
> > > On Wed, Nov 09, 2011 at 02:57:33PM -0500, Neil Horman wrote:
> > >> Data Center Bridging environments are currently somewhat limited
> in their
> > >> ability to provide a general mechanism for controlling traffic
> priority.
> > >> Specifically they are unable to administratively control the
> priority at which
> > >> various types of network traffic are sent.
> > >>
> > >> Currently, the only ways to set the priority of a network buffer
> are:
> > >>
> > >> 1) Through the use of the SO_PRIORITY socket option
> > >> 2) By using low level hooks, like a tc action
> > >>
> > >> (1) is difficult from an administrative perspective because it
> requires that the
> > >> application to be coded to not just assume the default priority is
> sufficient,
> > >> and must expose an administrative interface to allow priority
> adjustment.  Such
> > >> a solution is not scalable in a DCB environment
> > >>
> > >> (2) is also difficult, as it requires constant administrative
> oversight of
> > >> applications so as to build appropriate rules to match traffic
> belonging to
> > >> various classes, so that priority can be appropriately set. It is
> further
> > >> limiting when DCB enabled hardware is in use, due to the fact that
> tc rules are
> > >> only run after a root qdisc has been selected (DCB enabled
> hardware may reserve
> > >> hw queues for various traffic classes and needs the priority to be
> set prior to
> > >> selecting the root qdisc)
> > >>
> > >>
> > >> I've discussed various solutions with John Fastabend, and we saw a
> cgroup as
> > >> being a good general solution to this problem.  The network
> priority cgroup
> > >> allows for a per-interface priority map to be built per cgroup.
>  Any traffic
> > >> originating from an application in a cgroup, that does not
> explicitly set its
> > >> priority with SO_PRIORITY will have its priority assigned to the
> value
> > >> designated for that group on that interface.  This allows a user
> space daemon,
> > >> when conducting LLDP negotiation with a DCB enabled peer to create
> a cgroup
> > >> based on the APP_TLV value received and administratively assign
> applications to
> > >> that priority using the existing cgroup utility infrastructure.
> > >>
> > >> Tested by John and myself, with good results
> > >>
> > >> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > >> CC: John Fastabend <john.r.fastabend@intel.com>
> > >> CC: Robert Love <robert.w.love@intel.com>
> > >> CC: "David S. Miller" <davem@davemloft.net>
> > >> --
> > >> To unsubscribe from this list: send the line "unsubscribe netdev"
> in
> > >> the body of a message to majordomo@vger.kernel.org
> > >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >>
> > >
> > > Bump, any other thoughts here?  Dave T. has some reasonable
> thoughts regarding
> > > the use of skb->priority, but IMO they really seem orthogonal to
> the purpose of
> > > this change.  Any other reviews would be welcome.
> >
> > Well, in part I've been playing catchup in the hope that lldp and
> > openlldp and/or this dcb netlink layer that I don't know anything
> > about (pointers please?) could help somehow to resolve the semantic
> > mess skb->priority has become in the first place.
> >
> > I liked what was described here.
> >
> > "What if we did at least carve out the DCB functionality away from
> > skb->priority?  Since, AIUI, we're only concerning ourselves with
> > locally generated traffic here, we're talking
> > about skbs that have a socket attached to them.  We could, instead of
> indexing
> > the prio_tc_map with skb->priority, we could index it with
> > skb->dev->priomap[skb->sk->prioidx] (as provided by this patch).  The
> cgroup
> > then could be, instead of a strict priority cgroup, a queue_selector
> cgroup (or
> > something more appropriately named), and we don't have to touch skb-
> >priority at
> > all.  I'd really rather not start down that road until I got more
> opinions and
> > consensus on that, but it seems like a pretty good solution, one that
> would
> > allow hardware queue selection in systems that use things like DCB to
> co-exist
> > with software queueing features."
> >
> I was initially ok with this, but the more I think about it, the more I
> think
> its just not needed (see further down in this email for my reasoning).
> John,
> Rob, do you have any thoughts here?
> 
> > The piece that still kind of bothered me about the original proposal
> > (and perhaps this one) was that setting SO_PRIORITY in an app means
> > 'give my packets more mojo'.
> >
> > Taking something that took unprioritized packets and assigned them
> and
> > *them only* to a hardware queue struck me as possibly deprioritizing
> > the 'more mojo wanted' packets in the app(s), as they would end up in
> > some other, possibly overloaded, hardware queue.
> >
> I don't really see what you mean by this at all.  Taking packets with
> no
> priority and assigning them a priority doesn't really have an effect on
> pre-prioritized packets.  Or rather it shouldn't.  You can certainly
> create a
> problem by having apps prioritized according to conflicting semantic
> rules, but
> that strikes me as administrative error.  Garbage in...Garbage out.
> 
> > So a cgroup that moves all of the packets from an application into a
> > given hardware queue, and then gets scheduled normally according to
> > skb->priority and friends (software queue, default of pfifo_fast,
> > etc), seems to make some sense to me. (I wouldn't mind if we had
> > abstractions for software queues, too, like, I need a software queue
> > with these properties, find me a place for it on the hardware - but
> > I'm dreaming)
> >
> > One open question is where do packets generated from other subsystems
> > end up, if you are using a cgroup for the app? arp, dns, etc?
> >
> The overriding rule is the association of an skb to a socket.  If a
> transmitted
> frame has skb->sk set in dev_queue_xmit, then we interrogate its
> priority index
> as set when we passed through the sendmsg code at the top of the stack.
> Otherwise its behavior is unchanged from its current standpoint.
> 
> > So to rephrase your original description from this:
> >
> > >> Any traffic originating from an application in a cgroup, that does
> not explicitly set its
> > >> priority with SO_PRIORITY will have its priority assigned to the
> value
> > >> designated for that group on that interface.  This allows a user
> space daemon,
> > >> when conducting LLDP negotiation with a DCB enabled peer to create
> a cgroup
> > >> based on the APP_TLV value received and administratively assign
> applications to
> > >> that priority using the existing cgroup utility infrastructure.
> > > John, Robert, if you're supportive of these changes, some Acks
> would be
> > > appreciated.
> >
> > To this:
> >
> > "Any traffic originating from an application in a cgroup,  will have
> > its hardware queue  assigned to the value designated for that group
> on
> > that interface.  This allows a user space daemon, when conducting
> LLDP
> > negotiation with a DCB enabled peer to create a cgroup based on the
> > APP_TLV value received and administratively assign applications to
> > that hardware queue using the existing cgroup utility
> infrastructure."
> >
> As above, I'm split brained about this.  I'm ok with the idea of making
> this a
> queue selection cgroup, and separating it from priority, but at the
> same time,
> in the context of DCB, we really are assigning priority here, so it
> seems a bit
> false to do something that is not priority.  I also like the fact that
> it
> provides administrative control in a way that netfilter and tc don't
> really
> enable.
> 
> > Assuming we're on the same page here, what the heck is APP_TLV?
> >
> LLDP does layer 2 discovery with peer networking devices. It does so
> using sets
> of Type/length/value tuples.  The types carry various bits of
> information, such
> as which priority groups are available on the network.  The APP tlv
> conveys
> application or feature specific information.  for instance, There is an
> ISCSI
> app tlv that tells the host that  "on the interface you received this
> tlv, iscsi
> traffic must be sent at priority X".  The idea being that, on receipt
> of this
> tlv, the DCB daemon can create an ISCSI network priority cgroup
> instance, and
> augment the cgroup rules file such that, when the user space iscsi
> daemon is
> started, its traffic automatically transmits at the appropriate
> priority.

Love this !

I guess if this is integrated to libvirt via libcgroups VMs could be assigned a network priority..

http://linuxplumbersconf.org/2010/ocw/proposals/843

^ permalink raw reply

* Re: net: Add network priority cgroup
From: David Täht @ 2011-11-14 16:44 UTC (permalink / raw)
  To: John Fastabend
  Cc: Neil Horman, netdev@vger.kernel.org, Love, Robert W,
	David S. Miller
In-Reply-To: <4EC143FC.5060704@intel.com>

[-- Attachment #1: Type: text/plain, Size: 304 bytes --]

On 11/14/2011 05:38 PM, John Fastabend wrote:
> Acked-by: John Fastabend <john.r.fastabend@intel.com> 

Alright. Do as you will.

I would comfort me to know if 802.1Qau (QCN) was going to be enabled by 
default.

http://blog.ioshints.info/2010/11/data-center-bridging-dcb-congestion.html

-- 
Dave Täht


[-- Attachment #2: dave_taht.vcf --]
[-- Type: text/x-vcard, Size: 204 bytes --]

begin:vcard
fn;quoted-printable:Dave T=C3=A4ht
n;quoted-printable:T=C3=A4ht;Dave
email;internet:dave.taht@gmail.com
tel;home:1-239-829-5608
tel;cell:0638645374
x-mozilla-html:FALSE
version:2.1
end:vcard


^ permalink raw reply

* Re: [PATCH net-next 2/2] bnx2x: uses build_skb() in receive path
From: Eilon Greenstein @ 2011-11-14 16:48 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev@vger.kernel.org, Ben Hutchings, Tom Herbert,
	Jamal Hadi Salim, Stephen Hemminger, Thomas Graf, Herbert Xu,
	Jeff Kirsher, pstaszewski@itcare.pl
In-Reply-To: <1321286734.2272.48.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On Mon, 2011-11-14 at 08:05 -0800, Eric Dumazet wrote:
> bnx2x uses following formula to compute its rx_buf_sz :
> 
> dev->mtu + 2*L1_CACHE_BYTES + 14 + 8 + 8 + 2
> 
> Then core network adds NET_SKB_PAD and SKB_DATA_ALIGN(sizeof(struct
> skb_shared_info))
> 
> Final allocated size for skb head on x86_64 (L1_CACHE_BYTES = 64,
> MTU=1500) : 2112 bytes : SLUB/SLAB round this to 4096 bytes.
> 
> Since skb truesize is then bigger than SK_MEM_QUANTUM, we have lot of
> false sharing because of mem_reclaim in UDP stack.
> 
> One possible way to half truesize is to reduce the need by 64 bytes
> (2112 -> 2048 bytes)
> 
> Instead of allocating a full cache line at the end of packet for
> alignment, we can use the fact that skb_shared_info sits at the end of
> skb->head, and we can use this room, if we convert bnx2x to new
> build_skb() infrastructure.
> 
> skb_shared_info will be initialized after hardware finished its
> transfert, so we can eventually overwrite the final padding.
> 
> Using build_skb() also reduces cache line misses in the driver, since we
> use cache hot skb instead of cold ones. Number of in-flight sk_buff
> structures is lower, they are recycled while still hot.
> 
> Performance results :
> 
> (820.000 pps on a rx UDP monothread benchmark, instead of 720.000 pps)

Wow! This is very impressive. The only problem will be to back port this
driver now... But it is definitely worth the effort :)

> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Eilon Greenstein <eilong@broadcom.com>

> CC: Eilon Greenstein <eilong@broadcom.com>
> CC: Ben Hutchings <bhutchings@solarflare.com>
> CC: Tom Herbert <therbert@google.com>
> CC: Jamal Hadi Salim <hadi@mojatatu.com>
> CC: Stephen Hemminger <shemminger@vyatta.com>
> CC: Thomas Graf <tgraf@infradead.org>
> CC: Herbert Xu <herbert@gondor.apana.org.au>
> CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> ---
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x.h         |   30 -
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c     |  265 ++++------
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h     |   33 -
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c |    6
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c    |    4
>  5 files changed, 170 insertions(+), 168 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
> index 4b90b51..0f7b7a4 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
> @@ -293,8 +293,13 @@ enum {
>  #define FCOE_TXQ_IDX(bp)       (MAX_ETH_TXQ_IDX(bp))
> 
>  /* fast path */
> +/*
> + * This driver uses new build_skb() API :
> + * RX ring buffer contains pointer to kmalloc() data only,
> + * skb are built only after Hardware filled the frame.
> + */
>  struct sw_rx_bd {
> -       struct sk_buff  *skb;
> +       u8              *data;
>         DEFINE_DMA_UNMAP_ADDR(mapping);
>  };
> 
> @@ -424,8 +429,8 @@ union host_hc_status_block {
> 
>  struct bnx2x_agg_info {
>         /*
> -        * First aggregation buffer is an skb, the following - are pages.
> -        * We will preallocate the skbs for each aggregation when
> +        * First aggregation buffer is a data buffer, the following - are pages.
> +        * We will preallocate the data buffer for each aggregation when
>          * we open the interface and will replace the BD at the consumer
>          * with this one when we receive the TPA_START CQE in order to
>          * keep the Rx BD ring consistent.
> @@ -439,6 +444,7 @@ struct bnx2x_agg_info {
>         u16                     parsing_flags;
>         u16                     vlan_tag;
>         u16                     len_on_bd;
> +       u32                     rxhash;
>  };
> 
>  #define Q_STATS_OFFSET32(stat_name) \
> @@ -1187,10 +1193,20 @@ struct bnx2x {
>  #define ETH_MAX_JUMBO_PACKET_SIZE      9600
> 
>         /* Max supported alignment is 256 (8 shift) */
> -#define BNX2X_RX_ALIGN_SHIFT           ((L1_CACHE_SHIFT < 8) ? \
> -                                        L1_CACHE_SHIFT : 8)
> -       /* FW use 2 Cache lines Alignment for start packet and size  */
> -#define BNX2X_FW_RX_ALIGN              (2 << BNX2X_RX_ALIGN_SHIFT)
> +#define BNX2X_RX_ALIGN_SHIFT           min(8, L1_CACHE_SHIFT)
> +
> +       /* FW uses 2 Cache lines Alignment for start packet and size
> +        *
> +        * We assume skb_build() uses sizeof(struct skb_shared_info) bytes
> +        * at the end of skb->data, to avoid wasting a full cache line.
> +        * This reduces memory use (skb->truesize).
> +        */
> +#define BNX2X_FW_RX_ALIGN_START        (1UL << BNX2X_RX_ALIGN_SHIFT)
> +
> +#define BNX2X_FW_RX_ALIGN_END                                  \
> +       max(1UL << BNX2X_RX_ALIGN_SHIFT,                        \
> +           SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
> +
>  #define BNX2X_PXP_DRAM_ALIGN           (BNX2X_RX_ALIGN_SHIFT - 5)
> 
>         struct host_sp_status_block *def_status_blk;
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> index 13dad92..0d60b9e 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> @@ -294,8 +294,21 @@ static void bnx2x_update_sge_prod(struct bnx2x_fastpath *fp,
>            fp->last_max_sge, fp->rx_sge_prod);
>  }
> 
> +/* Set Toeplitz hash value in the skb using the value from the
> + * CQE (calculated by HW).
> + */
> +static u32 bnx2x_get_rxhash(const struct bnx2x *bp,
> +                           const struct eth_fast_path_rx_cqe *cqe)
> +{
> +       /* Set Toeplitz hash from CQE */
> +       if ((bp->dev->features & NETIF_F_RXHASH) &&
> +           (cqe->status_flags & ETH_FAST_PATH_RX_CQE_RSS_HASH_FLG))
> +               return le32_to_cpu(cqe->rss_hash_result);
> +       return 0;
> +}
> +
>  static void bnx2x_tpa_start(struct bnx2x_fastpath *fp, u16 queue,
> -                           struct sk_buff *skb, u16 cons, u16 prod,
> +                           u16 cons, u16 prod,
>                             struct eth_fast_path_rx_cqe *cqe)
>  {
>         struct bnx2x *bp = fp->bp;
> @@ -310,9 +323,9 @@ static void bnx2x_tpa_start(struct bnx2x_fastpath *fp, u16 queue,
>         if (tpa_info->tpa_state != BNX2X_TPA_STOP)
>                 BNX2X_ERR("start of bin not in stop [%d]\n", queue);
> 
> -       /* Try to map an empty skb from the aggregation info  */
> +       /* Try to map an empty data buffer from the aggregation info  */
>         mapping = dma_map_single(&bp->pdev->dev,
> -                                first_buf->skb->data,
> +                                first_buf->data + NET_SKB_PAD,
>                                  fp->rx_buf_size, DMA_FROM_DEVICE);
>         /*
>          *  ...if it fails - move the skb from the consumer to the producer
> @@ -322,15 +335,15 @@ static void bnx2x_tpa_start(struct bnx2x_fastpath *fp, u16 queue,
> 
>         if (unlikely(dma_mapping_error(&bp->pdev->dev, mapping))) {
>                 /* Move the BD from the consumer to the producer */
> -               bnx2x_reuse_rx_skb(fp, cons, prod);
> +               bnx2x_reuse_rx_data(fp, cons, prod);
>                 tpa_info->tpa_state = BNX2X_TPA_ERROR;
>                 return;
>         }
> 
> -       /* move empty skb from pool to prod */
> -       prod_rx_buf->skb = first_buf->skb;
> +       /* move empty data from pool to prod */
> +       prod_rx_buf->data = first_buf->data;
>         dma_unmap_addr_set(prod_rx_buf, mapping, mapping);
> -       /* point prod_bd to new skb */
> +       /* point prod_bd to new data */
>         prod_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
>         prod_bd->addr_lo = cpu_to_le32(U64_LO(mapping));
> 
> @@ -344,6 +357,7 @@ static void bnx2x_tpa_start(struct bnx2x_fastpath *fp, u16 queue,
>         tpa_info->tpa_state = BNX2X_TPA_START;
>         tpa_info->len_on_bd = le16_to_cpu(cqe->len_on_bd);
>         tpa_info->placement_offset = cqe->placement_offset;
> +       tpa_info->rxhash = bnx2x_get_rxhash(bp, cqe);
> 
>  #ifdef BNX2X_STOP_ON_ERROR
>         fp->tpa_queue_used |= (1 << queue);
> @@ -471,11 +485,12 @@ static void bnx2x_tpa_stop(struct bnx2x *bp, struct bnx2x_fastpath *fp,
>  {
>         struct bnx2x_agg_info *tpa_info = &fp->tpa_info[queue];
>         struct sw_rx_bd *rx_buf = &tpa_info->first_buf;
> -       u8 pad = tpa_info->placement_offset;
> +       u32 pad = tpa_info->placement_offset;
>         u16 len = tpa_info->len_on_bd;
> -       struct sk_buff *skb = rx_buf->skb;
> +       struct sk_buff *skb = NULL;
> +       u8 *data = rx_buf->data;
>         /* alloc new skb */
> -       struct sk_buff *new_skb;
> +       u8 *new_data;
>         u8 old_tpa_state = tpa_info->tpa_state;
> 
>         tpa_info->tpa_state = BNX2X_TPA_STOP;
> @@ -486,18 +501,18 @@ static void bnx2x_tpa_stop(struct bnx2x *bp, struct bnx2x_fastpath *fp,
>         if (old_tpa_state == BNX2X_TPA_ERROR)
>                 goto drop;
> 
> -       /* Try to allocate the new skb */
> -       new_skb = netdev_alloc_skb(bp->dev, fp->rx_buf_size);
> +       /* Try to allocate the new data */
> +       new_data = kmalloc(fp->rx_buf_size + NET_SKB_PAD, GFP_ATOMIC);
> 
>         /* Unmap skb in the pool anyway, as we are going to change
>            pool entry status to BNX2X_TPA_STOP even if new skb allocation
>            fails. */
>         dma_unmap_single(&bp->pdev->dev, dma_unmap_addr(rx_buf, mapping),
>                          fp->rx_buf_size, DMA_FROM_DEVICE);
> +       if (likely(new_data))
> +               skb = build_skb(data);
> 
> -       if (likely(new_skb)) {
> -               prefetch(skb);
> -               prefetch(((char *)(skb)) + L1_CACHE_BYTES);
> +       if (likely(skb)) {
> 
>  #ifdef BNX2X_STOP_ON_ERROR
>                 if (pad + len > fp->rx_buf_size) {
> @@ -509,8 +524,9 @@ static void bnx2x_tpa_stop(struct bnx2x *bp, struct bnx2x_fastpath *fp,
>                 }
>  #endif
> 
> -               skb_reserve(skb, pad);
> +               skb_reserve(skb, pad + NET_SKB_PAD);
>                 skb_put(skb, len);
> +               skb->rxhash = tpa_info->rxhash;
> 
>                 skb->protocol = eth_type_trans(skb, bp->dev);
>                 skb->ip_summed = CHECKSUM_UNNECESSARY;
> @@ -526,8 +542,8 @@ static void bnx2x_tpa_stop(struct bnx2x *bp, struct bnx2x_fastpath *fp,
>                 }
> 
> 
> -               /* put new skb in bin */
> -               rx_buf->skb = new_skb;
> +               /* put new data in bin */
> +               rx_buf->data = new_data;
> 
>                 return;
>         }
> @@ -539,19 +555,6 @@ drop:
>         fp->eth_q_stats.rx_skb_alloc_failed++;
>  }
> 
> -/* Set Toeplitz hash value in the skb using the value from the
> - * CQE (calculated by HW).
> - */
> -static inline void bnx2x_set_skb_rxhash(struct bnx2x *bp, union eth_rx_cqe *cqe,
> -                                       struct sk_buff *skb)
> -{
> -       /* Set Toeplitz hash from CQE */
> -       if ((bp->dev->features & NETIF_F_RXHASH) &&
> -           (cqe->fast_path_cqe.status_flags &
> -            ETH_FAST_PATH_RX_CQE_RSS_HASH_FLG))
> -               skb->rxhash =
> -               le32_to_cpu(cqe->fast_path_cqe.rss_hash_result);
> -}
> 
>  int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
>  {
> @@ -594,6 +597,7 @@ int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
>                 u8 cqe_fp_flags;
>                 enum eth_rx_cqe_type cqe_fp_type;
>                 u16 len, pad;
> +               u8 *data;
> 
>  #ifdef BNX2X_STOP_ON_ERROR
>                 if (unlikely(bp->panic))
> @@ -604,13 +608,6 @@ int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
>                 bd_prod = RX_BD(bd_prod);
>                 bd_cons = RX_BD(bd_cons);
> 
> -               /* Prefetch the page containing the BD descriptor
> -                  at producer's index. It will be needed when new skb is
> -                  allocated */
> -               prefetch((void *)(PAGE_ALIGN((unsigned long)
> -                                            (&fp->rx_desc_ring[bd_prod])) -
> -                                 PAGE_SIZE + 1));
> -
>                 cqe = &fp->rx_comp_ring[comp_ring_cons];
>                 cqe_fp = &cqe->fast_path_cqe;
>                 cqe_fp_flags = cqe_fp->type_error_flags;
> @@ -626,125 +623,110 @@ int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
>                 if (unlikely(CQE_TYPE_SLOW(cqe_fp_type))) {
>                         bnx2x_sp_event(fp, cqe);
>                         goto next_cqe;
> +               }
> +               rx_buf = &fp->rx_buf_ring[bd_cons];
> +               data = rx_buf->data;
> 
> -               /* this is an rx packet */
> -               } else {
> -                       rx_buf = &fp->rx_buf_ring[bd_cons];
> -                       skb = rx_buf->skb;
> -                       prefetch(skb);
> -
> -                       if (!CQE_TYPE_FAST(cqe_fp_type)) {
> +               if (!CQE_TYPE_FAST(cqe_fp_type)) {
>  #ifdef BNX2X_STOP_ON_ERROR
> -                               /* sanity check */
> -                               if (fp->disable_tpa &&
> -                                   (CQE_TYPE_START(cqe_fp_type) ||
> -                                    CQE_TYPE_STOP(cqe_fp_type)))
> -                                       BNX2X_ERR("START/STOP packet while "
> -                                                 "disable_tpa type %x\n",
> -                                                 CQE_TYPE(cqe_fp_type));
> +                       /* sanity check */
> +                       if (fp->disable_tpa &&
> +                           (CQE_TYPE_START(cqe_fp_type) ||
> +                            CQE_TYPE_STOP(cqe_fp_type)))
> +                               BNX2X_ERR("START/STOP packet while "
> +                                         "disable_tpa type %x\n",
> +                                         CQE_TYPE(cqe_fp_type));
>  #endif
> 
> -                               if (CQE_TYPE_START(cqe_fp_type)) {
> -                                       u16 queue = cqe_fp->queue_index;
> -                                       DP(NETIF_MSG_RX_STATUS,
> -                                          "calling tpa_start on queue %d\n",
> -                                          queue);
> -
> -                                       bnx2x_tpa_start(fp, queue, skb,
> -                                                       bd_cons, bd_prod,
> -                                                       cqe_fp);
> -
> -                                       /* Set Toeplitz hash for LRO skb */
> -                                       bnx2x_set_skb_rxhash(bp, cqe, skb);
> -
> -                                       goto next_rx;
> +                       if (CQE_TYPE_START(cqe_fp_type)) {
> +                               u16 queue = cqe_fp->queue_index;
> +                               DP(NETIF_MSG_RX_STATUS,
> +                                  "calling tpa_start on queue %d\n",
> +                                  queue);
> 
> -                               } else {
> -                                       u16 queue =
> -                                               cqe->end_agg_cqe.queue_index;
> -                                       DP(NETIF_MSG_RX_STATUS,
> -                                          "calling tpa_stop on queue %d\n",
> -                                          queue);
> +                               bnx2x_tpa_start(fp, queue,
> +                                               bd_cons, bd_prod,
> +                                               cqe_fp);
> +                               goto next_rx;
> +                       } else {
> +                               u16 queue =
> +                                       cqe->end_agg_cqe.queue_index;
> +                               DP(NETIF_MSG_RX_STATUS,
> +                                  "calling tpa_stop on queue %d\n",
> +                                  queue);
> 
> -                                       bnx2x_tpa_stop(bp, fp, queue,
> -                                                      &cqe->end_agg_cqe,
> -                                                      comp_ring_cons);
> +                               bnx2x_tpa_stop(bp, fp, queue,
> +                                              &cqe->end_agg_cqe,
> +                                              comp_ring_cons);
>  #ifdef BNX2X_STOP_ON_ERROR
> -                                       if (bp->panic)
> -                                               return 0;
> +                               if (bp->panic)
> +                                       return 0;
>  #endif
> 
> -                                       bnx2x_update_sge_prod(fp, cqe_fp);
> -                                       goto next_cqe;
> -                               }
> +                               bnx2x_update_sge_prod(fp, cqe_fp);
> +                               goto next_cqe;
>                         }
> -                       /* non TPA */
> -                       len = le16_to_cpu(cqe_fp->pkt_len);
> -                       pad = cqe_fp->placement_offset;
> -                       dma_sync_single_for_cpu(&bp->pdev->dev,
> +               }
> +               /* non TPA */
> +               len = le16_to_cpu(cqe_fp->pkt_len);
> +               pad = cqe_fp->placement_offset;
> +               dma_sync_single_for_cpu(&bp->pdev->dev,
>                                         dma_unmap_addr(rx_buf, mapping),
> -                                                      pad + RX_COPY_THRESH,
> -                                                      DMA_FROM_DEVICE);
> -                       prefetch(((char *)(skb)) + L1_CACHE_BYTES);
> +                                       pad + RX_COPY_THRESH,
> +                                       DMA_FROM_DEVICE);
> +               pad += NET_SKB_PAD;
> +               prefetch(data + pad); /* speedup eth_type_trans() */
> +               /* is this an error packet? */
> +               if (unlikely(cqe_fp_flags & ETH_RX_ERROR_FALGS)) {
> +                       DP(NETIF_MSG_RX_ERR,
> +                          "ERROR  flags %x  rx packet %u\n",
> +                          cqe_fp_flags, sw_comp_cons);
> +                       fp->eth_q_stats.rx_err_discard_pkt++;
> +                       goto reuse_rx;
> +               }
> 
> -                       /* is this an error packet? */
> -                       if (unlikely(cqe_fp_flags & ETH_RX_ERROR_FALGS)) {
> +               /* Since we don't have a jumbo ring
> +                * copy small packets if mtu > 1500
> +                */
> +               if ((bp->dev->mtu > ETH_MAX_PACKET_SIZE) &&
> +                   (len <= RX_COPY_THRESH)) {
> +                       skb = netdev_alloc_skb_ip_align(bp->dev, len);
> +                       if (skb == NULL) {
>                                 DP(NETIF_MSG_RX_ERR,
> -                                  "ERROR  flags %x  rx packet %u\n",
> -                                  cqe_fp_flags, sw_comp_cons);
> -                               fp->eth_q_stats.rx_err_discard_pkt++;
> +                                  "ERROR  packet dropped because of alloc failure\n");
> +                               fp->eth_q_stats.rx_skb_alloc_failed++;
>                                 goto reuse_rx;
>                         }
> -
> -                       /* Since we don't have a jumbo ring
> -                        * copy small packets if mtu > 1500
> -                        */
> -                       if ((bp->dev->mtu > ETH_MAX_PACKET_SIZE) &&
> -                           (len <= RX_COPY_THRESH)) {
> -                               struct sk_buff *new_skb;
> -
> -                               new_skb = netdev_alloc_skb(bp->dev, len + pad);
> -                               if (new_skb == NULL) {
> -                                       DP(NETIF_MSG_RX_ERR,
> -                                          "ERROR  packet dropped "
> -                                          "because of alloc failure\n");
> -                                       fp->eth_q_stats.rx_skb_alloc_failed++;
> -                                       goto reuse_rx;
> -                               }
> -
> -                               /* aligned copy */
> -                               skb_copy_from_linear_data_offset(skb, pad,
> -                                                   new_skb->data + pad, len);
> -                               skb_reserve(new_skb, pad);
> -                               skb_put(new_skb, len);
> -
> -                               bnx2x_reuse_rx_skb(fp, bd_cons, bd_prod);
> -
> -                               skb = new_skb;
> -
> -                       } else
> -                       if (likely(bnx2x_alloc_rx_skb(bp, fp, bd_prod) == 0)) {
> +                       memcpy(skb->data, data + pad, len);
> +                       bnx2x_reuse_rx_data(fp, bd_cons, bd_prod);
> +               } else {
> +                       if (likely(bnx2x_alloc_rx_data(bp, fp, bd_prod) == 0)) {
>                                 dma_unmap_single(&bp->pdev->dev,
> -                                       dma_unmap_addr(rx_buf, mapping),
> +                                                dma_unmap_addr(rx_buf, mapping),
>                                                  fp->rx_buf_size,
>                                                  DMA_FROM_DEVICE);
> +                               skb = build_skb(data);
> +                               if (unlikely(!skb)) {
> +                                       kfree(data);
> +                                       fp->eth_q_stats.rx_skb_alloc_failed++;
> +                                       goto next_rx;
> +                               }
>                                 skb_reserve(skb, pad);
> -                               skb_put(skb, len);
> -
>                         } else {
>                                 DP(NETIF_MSG_RX_ERR,
>                                    "ERROR  packet dropped because "
>                                    "of alloc failure\n");
>                                 fp->eth_q_stats.rx_skb_alloc_failed++;
>  reuse_rx:
> -                               bnx2x_reuse_rx_skb(fp, bd_cons, bd_prod);
> +                               bnx2x_reuse_rx_data(fp, bd_cons, bd_prod);
>                                 goto next_rx;
>                         }
> 
> +                       skb_put(skb, len);
>                         skb->protocol = eth_type_trans(skb, bp->dev);
> 
>                         /* Set Toeplitz hash for a none-LRO skb */
> -                       bnx2x_set_skb_rxhash(bp, cqe, skb);
> +                       skb->rxhash = bnx2x_get_rxhash(bp, cqe_fp);
> 
>                         skb_checksum_none_assert(skb);
> 
> @@ -767,7 +749,7 @@ reuse_rx:
> 
> 
>  next_rx:
> -               rx_buf->skb = NULL;
> +               rx_buf->data = NULL;
> 
>                 bd_cons = NEXT_RX_IDX(bd_cons);
>                 bd_prod = NEXT_RX_IDX(bd_prod);
> @@ -1013,9 +995,9 @@ void bnx2x_init_rx_rings(struct bnx2x *bp)
>                                 struct sw_rx_bd *first_buf =
>                                         &tpa_info->first_buf;
> 
> -                               first_buf->skb = netdev_alloc_skb(bp->dev,
> -                                                      fp->rx_buf_size);
> -                               if (!first_buf->skb) {
> +                               first_buf->data = kmalloc(fp->rx_buf_size + NET_SKB_PAD,
> +                                                         GFP_ATOMIC);
> +                               if (!first_buf->data) {
>                                         BNX2X_ERR("Failed to allocate TPA "
>                                                   "skb pool for queue[%d] - "
>                                                   "disabling TPA on this "
> @@ -1118,16 +1100,16 @@ static void bnx2x_free_rx_bds(struct bnx2x_fastpath *fp)
> 
>         for (i = 0; i < NUM_RX_BD; i++) {
>                 struct sw_rx_bd *rx_buf = &fp->rx_buf_ring[i];
> -               struct sk_buff *skb = rx_buf->skb;
> +               u8 *data = rx_buf->data;
> 
> -               if (skb == NULL)
> +               if (data == NULL)
>                         continue;
>                 dma_unmap_single(&bp->pdev->dev,
>                                  dma_unmap_addr(rx_buf, mapping),
>                                  fp->rx_buf_size, DMA_FROM_DEVICE);
> 
> -               rx_buf->skb = NULL;
> -               dev_kfree_skb(skb);
> +               rx_buf->data = NULL;
> +               kfree(data);
>         }
>  }
> 
> @@ -1509,6 +1491,7 @@ static inline void bnx2x_set_rx_buf_size(struct bnx2x *bp)
> 
>         for_each_queue(bp, i) {
>                 struct bnx2x_fastpath *fp = &bp->fp[i];
> +               u32 mtu;
> 
>                 /* Always use a mini-jumbo MTU for the FCoE L2 ring */
>                 if (IS_FCOE_IDX(i))
> @@ -1518,13 +1501,15 @@ static inline void bnx2x_set_rx_buf_size(struct bnx2x *bp)
>                          * IP_HEADER_ALIGNMENT_PADDING to prevent a buffer
>                          * overrun attack.
>                          */
> -                       fp->rx_buf_size =
> -                               BNX2X_FCOE_MINI_JUMBO_MTU + ETH_OVREHEAD +
> -                               BNX2X_FW_RX_ALIGN + IP_HEADER_ALIGNMENT_PADDING;
> +                       mtu = BNX2X_FCOE_MINI_JUMBO_MTU;
>                 else
> -                       fp->rx_buf_size =
> -                               bp->dev->mtu + ETH_OVREHEAD +
> -                               BNX2X_FW_RX_ALIGN + IP_HEADER_ALIGNMENT_PADDING;
> +                       mtu = bp->dev->mtu;
> +               fp->rx_buf_size = BNX2X_FW_RX_ALIGN_START +
> +                                 IP_HEADER_ALIGNMENT_PADDING +
> +                                 ETH_OVREHEAD +
> +                                 mtu +
> +                                 BNX2X_FW_RX_ALIGN_END;
> +               /* Note : rx_buf_size doesnt take into account NET_SKB_PAD */
>         }
>  }
> 
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
> index 260226d..41eb17e 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
> @@ -910,26 +910,27 @@ static inline int bnx2x_alloc_rx_sge(struct bnx2x *bp,
>         return 0;
>  }
> 
> -static inline int bnx2x_alloc_rx_skb(struct bnx2x *bp,
> -                                    struct bnx2x_fastpath *fp, u16 index)
> +static inline int bnx2x_alloc_rx_data(struct bnx2x *bp,
> +                                     struct bnx2x_fastpath *fp, u16 index)
>  {
> -       struct sk_buff *skb;
> +       u8 *data;
>         struct sw_rx_bd *rx_buf = &fp->rx_buf_ring[index];
>         struct eth_rx_bd *rx_bd = &fp->rx_desc_ring[index];
>         dma_addr_t mapping;
> 
> -       skb = netdev_alloc_skb(bp->dev, fp->rx_buf_size);
> -       if (unlikely(skb == NULL))
> +       data = kmalloc(fp->rx_buf_size + NET_SKB_PAD, GFP_ATOMIC);
> +       if (unlikely(data == NULL))
>                 return -ENOMEM;
> 
> -       mapping = dma_map_single(&bp->pdev->dev, skb->data, fp->rx_buf_size,
> +       mapping = dma_map_single(&bp->pdev->dev, data + NET_SKB_PAD,
> +                                fp->rx_buf_size,
>                                  DMA_FROM_DEVICE);
>         if (unlikely(dma_mapping_error(&bp->pdev->dev, mapping))) {
> -               dev_kfree_skb_any(skb);
> +               kfree(data);
>                 return -ENOMEM;
>         }
> 
> -       rx_buf->skb = skb;
> +       rx_buf->data = data;
>         dma_unmap_addr_set(rx_buf, mapping, mapping);
> 
>         rx_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
> @@ -938,12 +939,12 @@ static inline int bnx2x_alloc_rx_skb(struct bnx2x *bp,
>         return 0;
>  }
> 
> -/* note that we are not allocating a new skb,
> +/* note that we are not allocating a new buffer,
>   * we are just moving one from cons to prod
>   * we are not creating a new mapping,
>   * so there is no need to check for dma_mapping_error().
>   */
> -static inline void bnx2x_reuse_rx_skb(struct bnx2x_fastpath *fp,
> +static inline void bnx2x_reuse_rx_data(struct bnx2x_fastpath *fp,
>                                       u16 cons, u16 prod)
>  {
>         struct sw_rx_bd *cons_rx_buf = &fp->rx_buf_ring[cons];
> @@ -953,7 +954,7 @@ static inline void bnx2x_reuse_rx_skb(struct bnx2x_fastpath *fp,
> 
>         dma_unmap_addr_set(prod_rx_buf, mapping,
>                            dma_unmap_addr(cons_rx_buf, mapping));
> -       prod_rx_buf->skb = cons_rx_buf->skb;
> +       prod_rx_buf->data = cons_rx_buf->data;
>         *prod_bd = *cons_bd;
>  }
> 
> @@ -1029,9 +1030,9 @@ static inline void bnx2x_free_tpa_pool(struct bnx2x *bp,
>         for (i = 0; i < last; i++) {
>                 struct bnx2x_agg_info *tpa_info = &fp->tpa_info[i];
>                 struct sw_rx_bd *first_buf = &tpa_info->first_buf;
> -               struct sk_buff *skb = first_buf->skb;
> +               u8 *data = first_buf->data;
> 
> -               if (skb == NULL) {
> +               if (data == NULL) {
>                         DP(NETIF_MSG_IFDOWN, "tpa bin %d empty on free\n", i);
>                         continue;
>                 }
> @@ -1039,8 +1040,8 @@ static inline void bnx2x_free_tpa_pool(struct bnx2x *bp,
>                         dma_unmap_single(&bp->pdev->dev,
>                                          dma_unmap_addr(first_buf, mapping),
>                                          fp->rx_buf_size, DMA_FROM_DEVICE);
> -               dev_kfree_skb(skb);
> -               first_buf->skb = NULL;
> +               kfree(data);
> +               first_buf->data = NULL;
>         }
>  }
> 
> @@ -1148,7 +1149,7 @@ static inline int bnx2x_alloc_rx_bds(struct bnx2x_fastpath *fp,
>          * fp->eth_q_stats.rx_skb_alloc_failed = 0
>          */
>         for (i = 0; i < rx_ring_size; i++) {
> -               if (bnx2x_alloc_rx_skb(bp, fp, ring_prod) < 0) {
> +               if (bnx2x_alloc_rx_data(bp, fp, ring_prod) < 0) {
>                         fp->eth_q_stats.rx_skb_alloc_failed++;
>                         continue;
>                 }
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
> index f6402fa..ec31871 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
> @@ -1740,6 +1740,7 @@ static int bnx2x_run_loopback(struct bnx2x *bp, int loopback_mode)
>         struct sw_rx_bd *rx_buf;
>         u16 len;
>         int rc = -ENODEV;
> +       u8 *data;
> 
>         /* check the loopback mode */
>         switch (loopback_mode) {
> @@ -1865,10 +1866,9 @@ static int bnx2x_run_loopback(struct bnx2x *bp, int loopback_mode)
>         dma_sync_single_for_cpu(&bp->pdev->dev,
>                                    dma_unmap_addr(rx_buf, mapping),
>                                    fp_rx->rx_buf_size, DMA_FROM_DEVICE);
> -       skb = rx_buf->skb;
> -       skb_reserve(skb, cqe->fast_path_cqe.placement_offset);
> +       data = rx_buf->data + NET_SKB_PAD + cqe->fast_path_cqe.placement_offset;
>         for (i = ETH_HLEN; i < pkt_size; i++)
> -               if (*(skb->data + i) != (unsigned char) (i & 0xff))
> +               if (*(data + i) != (unsigned char) (i & 0xff))
>                         goto test_loopback_rx_exit;
> 
>         rc = 0;
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
> index 33ff60d..80fb9b5 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
> @@ -2789,8 +2789,8 @@ static void bnx2x_pf_rx_q_prep(struct bnx2x *bp,
>         /* This should be a maximum number of data bytes that may be
>          * placed on the BD (not including paddings).
>          */
> -       rxq_init->buf_sz = fp->rx_buf_size - BNX2X_FW_RX_ALIGN -
> -               IP_HEADER_ALIGNMENT_PADDING;
> +       rxq_init->buf_sz = fp->rx_buf_size - BNX2X_FW_RX_ALIGN_START -
> +               BNX2X_FW_RX_ALIGN_END - IP_HEADER_ALIGNMENT_PADDING;
> 
>         rxq_init->cl_qzone_id = fp->cl_qzone_id;
>         rxq_init->tpa_agg_sz = tpa_agg_size;
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH net-next 1/2] net: introduce build_skb()
From: Ben Hutchings @ 2011-11-14 17:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, eilong, pstaszewski, netdev, Tom Herbert,
	Jamal Hadi Salim, Stephen Hemminger, Thomas Graf, Herbert Xu,
	Jeff Kirsher
In-Reply-To: <1321286614.2272.47.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On Mon, 2011-11-14 at 17:03 +0100, Eric Dumazet wrote:
> One of the thing we discussed during netdev 2011 conference was the idea
> to change some network drivers to allocate/populate their skb at RX
> completion time, right before feeding the skb to network stack.
> 
> In old days, we allocated skbs when populating the RX ring.
> 
> This means bringing into cpu cache sk_buff and skb_shared_info cache
> lines (since we clear/initialize them), then 'queue' skb->data to NIC.
> 
> By the time NIC fills a frame in skb->data buffer and host can process
> it, cpu probably threw away the cache lines from its caches, because lot
> of things happened between the allocation and final use.

This is definitely a win so long as skbs are getting merged by GRO.
However, those cache misses take less time than skb allocation, so
preallocation of skbs normally reduces latency.  This is why sfc has an
adaptive buffer allocation behaviour.

> So the deal would be to allocate only the data buffer for the NIC to
> populate its RX ring buffer. And use build_skb() at RX completion to
> attach a data buffer (now filled with an ethernet frame) to a new skb,
> initialize the skb_shared_info portion, and give the hot skb to network
> stack.
> 
> build_skb() is the function to allocate an skb, caller providing the
> data buffer that should be attached to it. Drivers are expected to call 
> skb_reserve() right after build_skb() to adjust skb->data to the
> Ethernet frame (usually skipping NET_SKB_PAD and NET_IP_ALIGN, but some
> drivers might add a hardware provided alignment)
[...]

The option to attach a heap-allocated buffer rather than allocating a
header buffer and attaching a page might shift the balance.  I'll have
to test this (but don't hold your breath).

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH] netdev/phy: Use mdiobus_read() so that proper locks are taken.
From: Timur Tabi @ 2011-11-14 17:10 UTC (permalink / raw)
  To: Andy Fleming; +Cc: David Daney, netdev@vger.kernel.org, davem@davemloft.net
In-Reply-To: <CAKWjMd4WjDNHH8dOr9GPxQTC69SGgjh+euWh_DSjhtRxquB8qA@mail.gmail.com>

Andy Fleming wrote:

> Yes, that is correct. I have a patch, I just have to resend it. Will do
> next time I come to a break in what I'm working on.

Can you point me to that patch, so that I can try it out now?

I fixed some lockdep-related code in my audio driver.  The driver was not initializing a sysfs attr structure, so I added a call to sysfs_attr_init().  But when I do that, I get the following bug report.  I don't understand the connection.

=============================================                                   
[ INFO: possible recursive locking detected ]                                   
3.2.0-10b-00093-gebea711-dirty #10                                              
---------------------------------------------                                   
kworker/1:1/271 is trying to acquire lock:                                      
 (&(&(priv->tx_queue[i]->txlock))->rlock){......}, at: [<c02b2e28>] lock_tx_qs+0
x34/0x54                                                                        
                                                                                
but task is already holding lock:                                               
 (&(&(priv->tx_queue[i]->txlock))->rlock){......}, at: [<c02b2e28>] lock_tx_qs+0
x34/0x54                                                                        
                                                                                
other info that might help us debug this:                                       
 Possible unsafe locking scenario:                                              
                                                                                
       CPU0                                                                     
       ----                                                                     
  lock(&(&(priv->tx_queue[i]->txlock))->rlock);                                 
  lock(&(&(priv->tx_queue[i]->txlock))->rlock);                                 
                                                                                
 *** DEADLOCK ***                                                               
                                                                                
 May be due to missing lock nesting notation                                    
                                                                                
4 locks held by kworker/1:1/271:                                                
 #0:  (events){.+.+.+}, at: [<c005b7c8>] process_one_work+0x138/0x490           
 #1:  ((&(&dev->state_queue)->work)){+.+...}, at: [<c005b7c8>] process_one_work+
0x138/0x490                                                                     
 #2:  (&dev->lock){+.+...}, at: [<c02ab3cc>] phy_state_machine+0x30/0x580       
 #3:  (&(&(priv->tx_queue[i]->txlock))->rlock){......}, at: [<c02b2e28>] lock_tx
_qs+0x34/0x54                                                                   
                                                                                
stack backtrace:                                                                
Call Trace:                                                                     
[e69d5d80] [c0008c7c] show_stack+0x44/0x154 (unreliable)                        
[e69d5dc0] [c007bafc] __lock_acquire+0xefc/0x1824                               
[e69d5e70] [c007c870] lock_acquire+0x4c/0x68                                    
[e69d5e90] [c04575d8] _raw_spin_lock+0x44/0x60                                  
[e69d5ea0] [c02b2e28] lock_tx_qs+0x34/0x54                                      
[e69d5ec0] [c02b2f24] adjust_link+0x34/0x1d8                                    
[e69d5ef0] [c02ab73c] phy_state_machine+0x3a0/0x580                             
[e69d5f10] [c005b83c] process_one_work+0x1ac/0x490                              
[e69d5f50] [c005e4c0] worker_thread+0x18c/0x35c                                 
[e69d5f90] [c00631d4] kthread+0x7c/0x80                                         
[e69d5ff0] [c000e588] kernel_thread+0x4c/0x68                                   

-- 
Timur Tabi
Linux kernel developer at Freescale

^ permalink raw reply

* Re: [patch net-next V8] net: introduce ethernet teaming device
From: Andy Gospodarek @ 2011-11-14 17:18 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, eric.dumazet, bhutchings, shemminger, fubar, andy,
	tgraf, ebiederm, mirqus, kaber, greearb, jesse, fbl,
	benjamin.poirier, jzupka, ivecera
In-Reply-To: <1321085808-6871-1-git-send-email-jpirko@redhat.com>

On Sat, Nov 12, 2011 at 09:16:48AM +0100, Jiri Pirko wrote:
> This patch introduces new network device called team. It supposes to be
> very fast, simple, userspace-driven alternative to existing bonding
> driver.
> 
> Userspace library called libteam with couple of demo apps is available
> here:
> https://github.com/jpirko/libteam
> Note it's still in its dipers atm.
> 
> team<->libteam use generic netlink for communication. That and rtnl
> suppose to be the only way to configure team device, no sysfs etc.
> 
> Python binding of libteam was recently introduced.
> Daemon providing arpmon/miimon active-backup functionality will be
> introduced shortly. All what's necessary is already implemented in
> kernel team driver.
> 
> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
> 
> v7->v8:
> 	- check ndo_ndo_vlan_rx_[add/kill]_vid functions before calling
> 	  them.
> 	- use dev_kfree_skb_any() instead of dev_kfree_skb()
> 
> v6->v7:
> 	- transmit and receive functions are not checked in hot paths.
> 	  That also resolves memory leak on transmit when no port is
> 	  present
> 
> v5->v6:
> 	- changed couple of _rcu calls to non _rcu ones in non-readers
> 
> v4->v5:
> 	- team_change_mtu() uses team->lock while travesing though port
> 	  list
> 	- mac address changes are moved completely to jurisdiction of
> 	  userspace daemon. This way the daemon can do FOM1, FOM2 and
> 	  possibly other weird things with mac addresses.
> 	  Only round-robin mode sets up all ports to bond's address then
> 	  enslaved.
> 	- Extended Kconfig text
> 
> v3->v4:
> 	- remove redundant synchronize_rcu from __team_change_mode()
> 	- revert "set and clear of mode_ops happens per pointer, not per
> 	  byte"
> 	- extend comment of function __team_change_mode()
> 
> v2->v3:
> 	- team_change_mtu() uses rcu version of list traversal to unwind
> 	- set and clear of mode_ops happens per pointer, not per byte
> 	- port hashlist changed to be embedded into team structure
> 	- error branch in team_port_enter() does cleanup now
> 	- fixed rtln->rtnl
> 
> v1->v2:
> 	- modes are made as modules. Makes team more modular and
> 	  extendable.
> 	- several commenters' nitpicks found on v1 were fixed
> 	- several other bugs were fixed.
> 	- note I ignored Eric's comment about roundrobin port selector
> 	  as Eric's way may be easily implemented as another mode (mode
> 	  "random") in future.

You better get ready for v9.

Running the command:

# team_manual_control team0 set mode roundrobin

on a system with team0 running in roundrobin mode produces this:

[ 2127.785321] BUG: unable to handle kernel NULL pointer dereference at           (null)
[ 2127.788079] IP: [<ffffffffa0196edd>] team_nl_fill_options_get_changed+0xc5/0x240 [team]
[ 2127.790847] PGD 13eecf067 PUD 13f758067 PMD 0 
[ 2127.793603] Oops: 0000 [#1] SMP 
[ 2127.796352] CPU 7 
[ 2127.796370] Modules linked in: team_mode_roundrobin(O) team(O) fcoe libfcoe libfc scsi_transport_fc scsi_tgt 8021q garp stp llc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_state nf_conntrack snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device i2c_i801 joydev microcode shpchp snd_pcm snd_timer snd soundcore snd_page_alloc bnx2 iTCO_wdt iTCO_vendor_support e1000e uinput firewire_ohci firewire_core crc_itu_t i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: nf_defrag_ipv4]
[ 2127.808223] 
[ 2127.811261] Pid: 7085, comm: team_manual_con Tainted: G           O 3.2.0-rc1+ #1 Intel Corporation 2012 Client Platform/LosLunas CRB
[ 2127.814421] RIP: 0010:[<ffffffffa0196edd>]  [<ffffffffa0196edd>] team_nl_fill_options_get_changed+0xc5/0x240 [team]
[ 2127.817597] RSP: 0018:ffff88012ec3d968  EFLAGS: 00010286
[ 2127.820758] RAX: 0000000000000000 RBX: ffff8801397bb600 RCX: ffffffffffffffff
[ 2127.823947] RDX: ffff88013f4ba048 RSI: 0000000000000000 RDI: 0000000000000000
[ 2127.827154] RBP: ffff88012ec3d9c8 R08: ffff88013f4ba048 R09: 0000000000000004
[ 2127.830365] R10: 0000000000001bad R11: 0000000000000000 R12: ffff880143a8b740
[ 2127.833599] R13: ffff880143aca7e8 R14: ffff88013f4ba014 R15: ffff88013f4ba048
[ 2127.836838] FS:  00007fd65cdc8700(0000) GS:ffff88014e2e0000(0000) knlGS:0000000000000000
[ 2127.840102] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2127.843386] CR2: 0000000000000000 CR3: 0000000128531000 CR4: 00000000001406e0
[ 2127.846688] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2127.849987] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2127.853278] Process team_manual_con (pid: 7085, threadinfo ffff88012ec3c000, task ffff88013e842e40)
[ 2127.856605] Stack:
[ 2127.859898]  0000000000000000 ffff880143a8b7e8 0000000000000000 ffff88013f4ba01c
[ 2127.863261]  ffffffffa0019140 0000000500000000 0000000000000000 ffff8801397bb600
[ 2127.866623]  ffff88012ec3da58 ffffffffa0197058 ffff880143a8b740 00000000fffffff4
[ 2127.869993] Call Trace:
[ 2127.873344]  [<ffffffffa0197058>] ? team_nl_fill_options_get_changed+0x240/0x240 [team]
[ 2127.876750]  [<ffffffffa0197078>] team_nl_fill_options_get+0x20/0x22 [team]
[ 2127.880152]  [<ffffffffa019763c>] team_nl_send_generic+0x41/0x85 [team]
[ 2127.880156]  [<ffffffffa01976f5>] team_nl_cmd_options_get+0x36/0x3f [team]
[ 2127.880162]  [<ffffffff813fbdce>] genl_rcv_msg+0x1d8/0x203
[ 2127.880165]  [<ffffffff813fbbf6>] ? genl_rcv+0x2d/0x2d
[ 2127.880169]  [<ffffffff813fb7b2>] netlink_rcv_skb+0x42/0x8d
[ 2127.880172]  [<ffffffff813fbbef>] genl_rcv+0x26/0x2d
[ 2127.880174]  [<ffffffff813fb341>] netlink_unicast+0xec/0x156
[ 2127.880178]  [<ffffffff813fb5a6>] netlink_sendmsg+0x1fb/0x233
[ 2127.880182]  [<ffffffff813ca577>] sock_sendmsg+0xe6/0x109
[ 2127.880188]  [<ffffffff81120c76>] ? __mem_cgroup_commit_charge+0x9d/0xa9
[ 2127.880192]  [<ffffffff8112333b>] ? mem_cgroup_charge_common+0xb1/0xc3
[ 2127.880197]  [<ffffffff81043ff5>] ? should_resched+0xe/0x2d
[ 2127.880203]  [<ffffffff814b6208>] ? _cond_resched+0xe/0x22
[ 2127.880206]  [<ffffffff81043ff5>] ? should_resched+0xe/0x2d
[ 2127.880209]  [<ffffffff813d4433>] ? copy_from_user+0x2f/0x31
[ 2127.880212]  [<ffffffff813d481e>] ? verify_iovec+0x52/0xa4
[ 2127.880215]  [<ffffffff813ca85c>] __sys_sendmsg+0x213/0x2ba
[ 2127.880220]  [<ffffffff810fb1eb>] ? handle_mm_fault+0x1c8/0x1db
[ 2127.880224]  [<ffffffff814bab67>] ? do_page_fault+0x30c/0x37e
[ 2127.880228]  [<ffffffff814b7914>] ? _raw_spin_unlock_irqrestore+0x17/0x19
[ 2127.880232]  [<ffffffff81045079>] ? __wake_up+0x44/0x4d
[ 2127.880235]  [<ffffffff813cc459>] sys_sendmsg+0x42/0x60
[ 2127.880239]  [<ffffffff814be042>] system_call_fastpath+0x16/0x1b
[ 2127.880241] Code: e9 24 01 00 00 be 01 00 00 00 48 89 df e8 aa f3 ff ff 48 85 c0 49 89 c7 0f 84 4b 01 00 00 49 8b 75 10 31 c0 48 83 c9 ff 48 89 f7 <f2> ae 48 89 df 89 ca 48 89 f1 be 01 00 00 00 f7 d2 e8 2f 4c 0a 
[ 2127.880263] RIP  [<ffffffffa0196edd>] team_nl_fill_options_get_changed+0xc5/0x240 [team]
[ 2127.880268]  RSP <ffff88012ec3d968>
[ 2127.880269] CR2: 0000000000000000
[ 2127.880287] ---[ end trace 3e104c6acd231d26 ]---

Can you provide a detailed report of the testing you have done on the
team device?  It seems proper testing would have found something like
this.

^ permalink raw reply

* Re: net: Add network priority cgroup
From: Neil Horman @ 2011-11-14 17:24 UTC (permalink / raw)
  To: Shyam_Iyer; +Cc: dave.taht, netdev, john.r.fastabend, robert.w.love, davem
In-Reply-To: <DBFB1B45AF80394ABD1C807E9F28D157077D7D8D5F@BLRX7MCDC203.AMER.DELL.COM>

On Mon, Nov 14, 2011 at 10:13:37PM +0530, Shyam_Iyer@Dell.com wrote:
> 
> 
> > -----Original Message-----
> > From: netdev-owner@vger.kernel.org [mailto:netdev-
> > owner@vger.kernel.org] On Behalf Of Neil Horman
> > Sent: Monday, November 14, 2011 9:44 AM
> > To: Dave Taht
> > Cc: netdev@vger.kernel.org; John Fastabend; Robert Love; David S.
> > Miller
> > Subject: Re: net: Add network priority cgroup
> > 
> > On Mon, Nov 14, 2011 at 01:32:04PM +0100, Dave Taht wrote:
> > > On Mon, Nov 14, 2011 at 12:47 PM, Neil Horman <nhorman@tuxdriver.com>
> > wrote:
> > > > On Wed, Nov 09, 2011 at 02:57:33PM -0500, Neil Horman wrote:
> > > >> Data Center Bridging environments are currently somewhat limited
> > in their
> > > >> ability to provide a general mechanism for controlling traffic
> > priority.
> > > >> Specifically they are unable to administratively control the
> > priority at which
> > > >> various types of network traffic are sent.
> > > >>
> > > >> Currently, the only ways to set the priority of a network buffer
> > are:
> > > >>
> > > >> 1) Through the use of the SO_PRIORITY socket option
> > > >> 2) By using low level hooks, like a tc action
> > > >>
> > > >> (1) is difficult from an administrative perspective because it
> > requires that the
> > > >> application to be coded to not just assume the default priority is
> > sufficient,
> > > >> and must expose an administrative interface to allow priority
> > adjustment.  Such
> > > >> a solution is not scalable in a DCB environment
> > > >>
> > > >> (2) is also difficult, as it requires constant administrative
> > oversight of
> > > >> applications so as to build appropriate rules to match traffic
> > belonging to
> > > >> various classes, so that priority can be appropriately set. It is
> > further
> > > >> limiting when DCB enabled hardware is in use, due to the fact that
> > tc rules are
> > > >> only run after a root qdisc has been selected (DCB enabled
> > hardware may reserve
> > > >> hw queues for various traffic classes and needs the priority to be
> > set prior to
> > > >> selecting the root qdisc)
> > > >>
> > > >>
> > > >> I've discussed various solutions with John Fastabend, and we saw a
> > cgroup as
> > > >> being a good general solution to this problem.  The network
> > priority cgroup
> > > >> allows for a per-interface priority map to be built per cgroup.
> >  Any traffic
> > > >> originating from an application in a cgroup, that does not
> > explicitly set its
> > > >> priority with SO_PRIORITY will have its priority assigned to the
> > value
> > > >> designated for that group on that interface.  This allows a user
> > space daemon,
> > > >> when conducting LLDP negotiation with a DCB enabled peer to create
> > a cgroup
> > > >> based on the APP_TLV value received and administratively assign
> > applications to
> > > >> that priority using the existing cgroup utility infrastructure.
> > > >>
> > > >> Tested by John and myself, with good results
> > > >>
> > > >> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > >> CC: John Fastabend <john.r.fastabend@intel.com>
> > > >> CC: Robert Love <robert.w.love@intel.com>
> > > >> CC: "David S. Miller" <davem@davemloft.net>
> > > >> --
> > > >> To unsubscribe from this list: send the line "unsubscribe netdev"
> > in
> > > >> the body of a message to majordomo@vger.kernel.org
> > > >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > >>
> > > >
> > > > Bump, any other thoughts here?  Dave T. has some reasonable
> > thoughts regarding
> > > > the use of skb->priority, but IMO they really seem orthogonal to
> > the purpose of
> > > > this change.  Any other reviews would be welcome.
> > >
> > > Well, in part I've been playing catchup in the hope that lldp and
> > > openlldp and/or this dcb netlink layer that I don't know anything
> > > about (pointers please?) could help somehow to resolve the semantic
> > > mess skb->priority has become in the first place.
> > >
> > > I liked what was described here.
> > >
> > > "What if we did at least carve out the DCB functionality away from
> > > skb->priority?  Since, AIUI, we're only concerning ourselves with
> > > locally generated traffic here, we're talking
> > > about skbs that have a socket attached to them.  We could, instead of
> > indexing
> > > the prio_tc_map with skb->priority, we could index it with
> > > skb->dev->priomap[skb->sk->prioidx] (as provided by this patch).  The
> > cgroup
> > > then could be, instead of a strict priority cgroup, a queue_selector
> > cgroup (or
> > > something more appropriately named), and we don't have to touch skb-
> > >priority at
> > > all.  I'd really rather not start down that road until I got more
> > opinions and
> > > consensus on that, but it seems like a pretty good solution, one that
> > would
> > > allow hardware queue selection in systems that use things like DCB to
> > co-exist
> > > with software queueing features."
> > >
> > I was initially ok with this, but the more I think about it, the more I
> > think
> > its just not needed (see further down in this email for my reasoning).
> > John,
> > Rob, do you have any thoughts here?
> > 
> > > The piece that still kind of bothered me about the original proposal
> > > (and perhaps this one) was that setting SO_PRIORITY in an app means
> > > 'give my packets more mojo'.
> > >
> > > Taking something that took unprioritized packets and assigned them
> > and
> > > *them only* to a hardware queue struck me as possibly deprioritizing
> > > the 'more mojo wanted' packets in the app(s), as they would end up in
> > > some other, possibly overloaded, hardware queue.
> > >
> > I don't really see what you mean by this at all.  Taking packets with
> > no
> > priority and assigning them a priority doesn't really have an effect on
> > pre-prioritized packets.  Or rather it shouldn't.  You can certainly
> > create a
> > problem by having apps prioritized according to conflicting semantic
> > rules, but
> > that strikes me as administrative error.  Garbage in...Garbage out.
> > 
> > > So a cgroup that moves all of the packets from an application into a
> > > given hardware queue, and then gets scheduled normally according to
> > > skb->priority and friends (software queue, default of pfifo_fast,
> > > etc), seems to make some sense to me. (I wouldn't mind if we had
> > > abstractions for software queues, too, like, I need a software queue
> > > with these properties, find me a place for it on the hardware - but
> > > I'm dreaming)
> > >
> > > One open question is where do packets generated from other subsystems
> > > end up, if you are using a cgroup for the app? arp, dns, etc?
> > >
> > The overriding rule is the association of an skb to a socket.  If a
> > transmitted
> > frame has skb->sk set in dev_queue_xmit, then we interrogate its
> > priority index
> > as set when we passed through the sendmsg code at the top of the stack.
> > Otherwise its behavior is unchanged from its current standpoint.
> > 
> > > So to rephrase your original description from this:
> > >
> > > >> Any traffic originating from an application in a cgroup, that does
> > not explicitly set its
> > > >> priority with SO_PRIORITY will have its priority assigned to the
> > value
> > > >> designated for that group on that interface.  This allows a user
> > space daemon,
> > > >> when conducting LLDP negotiation with a DCB enabled peer to create
> > a cgroup
> > > >> based on the APP_TLV value received and administratively assign
> > applications to
> > > >> that priority using the existing cgroup utility infrastructure.
> > > > John, Robert, if you're supportive of these changes, some Acks
> > would be
> > > > appreciated.
> > >
> > > To this:
> > >
> > > "Any traffic originating from an application in a cgroup,  will have
> > > its hardware queue  assigned to the value designated for that group
> > on
> > > that interface.  This allows a user space daemon, when conducting
> > LLDP
> > > negotiation with a DCB enabled peer to create a cgroup based on the
> > > APP_TLV value received and administratively assign applications to
> > > that hardware queue using the existing cgroup utility
> > infrastructure."
> > >
> > As above, I'm split brained about this.  I'm ok with the idea of making
> > this a
> > queue selection cgroup, and separating it from priority, but at the
> > same time,
> > in the context of DCB, we really are assigning priority here, so it
> > seems a bit
> > false to do something that is not priority.  I also like the fact that
> > it
> > provides administrative control in a way that netfilter and tc don't
> > really
> > enable.
> > 
> > > Assuming we're on the same page here, what the heck is APP_TLV?
> > >
> > LLDP does layer 2 discovery with peer networking devices. It does so
> > using sets
> > of Type/length/value tuples.  The types carry various bits of
> > information, such
> > as which priority groups are available on the network.  The APP tlv
> > conveys
> > application or feature specific information.  for instance, There is an
> > ISCSI
> > app tlv that tells the host that  "on the interface you received this
> > tlv, iscsi
> > traffic must be sent at priority X".  The idea being that, on receipt
> > of this
> > tlv, the DCB daemon can create an ISCSI network priority cgroup
> > instance, and
> > augment the cgroup rules file such that, when the user space iscsi
> > daemon is
> > started, its traffic automatically transmits at the appropriate
> > priority.
> 
> Love this !
> 
> I guess if this is integrated to libvirt via libcgroups VMs could be assigned a network priority..
> 
As the patch stand currently, absolutely.  Just drop a qemu process into the
approriate cgroup, and (assuming you're using a tun/tap type device), all the
traffic from that vm will get assigned the corresponding priority.  You can do
the same thing with classification using net_cls already.  I did a video of it
here:
http://www.youtube.com/watch?v=KX5QV4LId_c
Neil

> http://linuxplumbersconf.org/2010/ocw/proposals/843
> 
> 
> 

^ permalink raw reply

* Re: [patch net-next V8] net: introduce ethernet teaming device
From: Andy Gospodarek @ 2011-11-14 17:31 UTC (permalink / raw)
  To: David Miller
  Cc: jpirko, netdev, eric.dumazet, bhutchings, shemminger, fubar, andy,
	tgraf, ebiederm, mirqus, kaber, greearb, jesse, fbl,
	benjamin.poirier, jzupka, ivecera
In-Reply-To: <20111113.160951.1905657808657009542.davem@davemloft.net>

On Sun, Nov 13, 2011 at 04:09:51PM -0500, David Miller wrote:
> From: Jiri Pirko <jpirko@redhat.com>
> Date: Sat, 12 Nov 2011 09:16:48 +0100
> 
> > This patch introduces new network device called team. It supposes to be
> > very fast, simple, userspace-driven alternative to existing bonding
> > driver.
> > 
> > Userspace library called libteam with couple of demo apps is available
> > here:
> > https://github.com/jpirko/libteam
> > Note it's still in its dipers atm.
> > 
> > team<->libteam use generic netlink for communication. That and rtnl
> > suppose to be the only way to configure team device, no sysfs etc.
> > 
> > Python binding of libteam was recently introduced.
> > Daemon providing arpmon/miimon active-backup functionality will be
> > introduced shortly. All what's necessary is already implemented in
> > kernel team driver.
> > 
> > Signed-off-by: Jiri Pirko <jpirko@redhat.com>
> 
> Applied, thanks for all of your hard work.

I'm a bit surprised by this.  Not only is this new function currently
difficult to setup (it took me over an hour to go from a a fresh F16
install to one that had all the necessary libraries and tools to even
configure a team device for the first time), but I was able to cause an
Oops with v8 in only a few minutes of testing.

I have no problem with this functionality as an add-on and possibly
future replacement to some of what currently exists with bonding, but it
seems like what is included in the initial support should should:

1. Not panic easily.
2. Have userspace bits in place to actually test all the proposed kernel
code.  (Jiri admits there is no way to verify the active-backup code).
3. Have some known, published test results.

I hope Jiri will reconsider having a separate team tree for the next few
weeks or months until these issues are worked out.  I think the hard
work will pay off and it is close to being ready; it just doesn't seem
like it is right now.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox