Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] Don't allow sharing of tx skbs on xen-netfront
From: Neil Horman @ 2011-11-14 19:32 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, jeremy.fitzhardinge, konrad.wilk, xen-devel
In-Reply-To: <20111114.142716.615003966880410697.davem@davemloft.net>

On Mon, Nov 14, 2011 at 02:27:16PM -0500, David Miller wrote:
> From: Neil Horman <nhorman@tuxdriver.com>
> Date: Mon, 14 Nov 2011 14:22:24 -0500
> 
> > It was pointed out to me recently that the xen-netfront driver can't safely
> > support shared skbs on transmit, since, while it doesn't maintain skb state
> > directly, it does pass a pointer to the skb to the hypervisor via a list, and
> > the hypervisor may expect the contents of the skb to remain stable.  Clearing
> > the IFF_TX_SKB_SHARING flag after the call to alloc_etherdev to make it safe.
> > 
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> 
> Please put an appropriate prefix into the subject lines of your patch
> submissions.  In this case "[PATCH] xen-netfront: ..." would be appropriate.
> 
> I've been letting you get away with this for the past few weeks and I've
> decided that it's your turn to start getting this right :-)
Jeez Dave, I got it right on the cgroups post, you want consistency now
too?  :).

Apologies, I need to consult a checklist for myself prior to sending stuff.

Best
Neil



> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH net-next 1/2] net: introduce build_skb()
From: Eric Dumazet @ 2011-11-14 19:31 UTC (permalink / raw)
  To: David Miller
  Cc: eilong, pstaszewski, netdev, bhutchings, therbert, hadi,
	shemminger, tgraf, herbert, jeffrey.t.kirsher
In-Reply-To: <20111114.142916.2005300491189237669.davem@davemloft.net>

Le lundi 14 novembre 2011 à 14:29 -0500, David Miller a écrit :

> Oh I see, this is for drivers which splat the packet into a linear
> buffer to begin with.

Exact.. We have plenty of them of course :)

^ permalink raw reply

* Re: net/can/mscan: Fix buggy listen only mode setting
From: David Miller @ 2011-11-14 19:30 UTC (permalink / raw)
  To: wg-5Yr1BZd7O62+XT7JhA+gdA
  Cc: Socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	Netdev-u79uwXL29TY76Z2rM5mHXA, mkl-bIcnvbaLZ9MEGnE8C9+IrQ,
	linux-can-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <4EC11900.7030005-5Yr1BZd7O62+XT7JhA+gdA@public.gmane.org>

From: Wolfgang Grandegger <wg-5Yr1BZd7O62+XT7JhA+gdA@public.gmane.org>
Date: Mon, 14 Nov 2011 14:34:56 +0100

> This patch fixes an issue introduced recently with commit
> 452448f9283e1939408b397e87974a418825b0a8.
> 
> CC: Marc Kleine-Budde <mkl-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>
> Signed-off-by: Wolfgang Grandegger <wg-5Yr1BZd7O62+XT7JhA+gdA@public.gmane.org>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next 1/2] net: introduce build_skb()
From: David Miller @ 2011-11-14 19:29 UTC (permalink / raw)
  To: eric.dumazet
  Cc: eilong, pstaszewski, netdev, bhutchings, therbert, hadi,
	shemminger, tgraf, herbert, jeffrey.t.kirsher
In-Reply-To: <1321298766.2719.11.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 14 Nov 2011 20:26:06 +0100

> Le lundi 14 novembre 2011 à 14:21 -0500, David Miller a écrit :
> 
>> Applied, I'll work on converting NIU over to this if I find some time.
> 
> I dont believe NIU needs it, since its a frag enabled driver.
> 
> You already allocate a fresh skb (and attach frags to it) right before
> giving it to upper stack.

Oh I see, this is for drivers which splat the packet into a linear
buffer to begin with.

^ permalink raw reply

* Re: [PATCH] Don't allow sharing of tx skbs on xen-netfront
From: David Miller @ 2011-11-14 19:27 UTC (permalink / raw)
  To: nhorman; +Cc: netdev, jeremy.fitzhardinge, konrad.wilk, xen-devel
In-Reply-To: <1321298544-16434-1-git-send-email-nhorman@tuxdriver.com>

From: Neil Horman <nhorman@tuxdriver.com>
Date: Mon, 14 Nov 2011 14:22:24 -0500

> It was pointed out to me recently that the xen-netfront driver can't safely
> support shared skbs on transmit, since, while it doesn't maintain skb state
> directly, it does pass a pointer to the skb to the hypervisor via a list, and
> the hypervisor may expect the contents of the skb to remain stable.  Clearing
> the IFF_TX_SKB_SHARING flag after the call to alloc_etherdev to make it safe.
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>

Please put an appropriate prefix into the subject lines of your patch
submissions.  In this case "[PATCH] xen-netfront: ..." would be appropriate.

I've been letting you get away with this for the past few weeks and I've
decided that it's your turn to start getting this right :-)

^ permalink raw reply

* Re: [PATCH Fix V2] mlx4_en: Remove FCS bytes from packet length.
From: David Miller @ 2011-11-14 19:26 UTC (permalink / raw)
  To: yevgenyp; +Cc: netdev
In-Reply-To: <4EC1446D.4020209@mellanox.co.il>

From: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Date: Mon, 14 Nov 2011 18:40:13 +0200

> When HW doesn't remove FCS bytes they are reported in the completion
> byte count, we don't need to take them to skb.
> 
> Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 1/2] net: introduce build_skb()
From: Eric Dumazet @ 2011-11-14 19:26 UTC (permalink / raw)
  To: David Miller
  Cc: eilong, pstaszewski, netdev, bhutchings, therbert, hadi,
	shemminger, tgraf, herbert, jeffrey.t.kirsher
In-Reply-To: <20111114.142153.804127798975712263.davem@davemloft.net>

Le lundi 14 novembre 2011 à 14:21 -0500, David Miller a écrit :

> Applied, I'll work on converting NIU over to this if I find some time.

I dont believe NIU needs it, since its a frag enabled driver.

You already allocate a fresh skb (and attach frags to it) right before
giving it to upper stack.

^ permalink raw reply

* Re: [PATCH net-next 2/2] bnx2x: uses build_skb() in receive path
From: David Miller @ 2011-11-14 19:22 UTC (permalink / raw)
  To: eilong
  Cc: eric.dumazet, netdev, bhutchings, therbert, hadi, shemminger,
	tgraf, herbert, jeffrey.t.kirsher, pstaszewski
In-Reply-To: <1321289329.3101.7.camel@lb-tlvb-eilong.il.broadcom.com>

From: "Eilon Greenstein" <eilong@broadcom.com>
Date: Mon, 14 Nov 2011 18:48:49 +0200

> On Mon, 2011-11-14 at 08:05 -0800, Eric Dumazet wrote:
>> bnx2x uses following formula to compute its rx_buf_sz :
>> 
>> dev->mtu + 2*L1_CACHE_BYTES + 14 + 8 + 8 + 2
>> 
>> Then core network adds NET_SKB_PAD and SKB_DATA_ALIGN(sizeof(struct
>> skb_shared_info))
>> 
>> Final allocated size for skb head on x86_64 (L1_CACHE_BYTES = 64,
>> MTU=1500) : 2112 bytes : SLUB/SLAB round this to 4096 bytes.
>> 
>> Since skb truesize is then bigger than SK_MEM_QUANTUM, we have lot of
>> false sharing because of mem_reclaim in UDP stack.
>> 
>> One possible way to half truesize is to reduce the need by 64 bytes
>> (2112 -> 2048 bytes)
>> 
>> Instead of allocating a full cache line at the end of packet for
>> alignment, we can use the fact that skb_shared_info sits at the end of
>> skb->head, and we can use this room, if we convert bnx2x to new
>> build_skb() infrastructure.
>> 
>> skb_shared_info will be initialized after hardware finished its
>> transfert, so we can eventually overwrite the final padding.
>> 
>> Using build_skb() also reduces cache line misses in the driver, since we
>> use cache hot skb instead of cold ones. Number of in-flight sk_buff
>> structures is lower, they are recycled while still hot.
>> 
>> Performance results :
>> 
>> (820.000 pps on a rx UDP monothread benchmark, instead of 720.000 pps)
> 
> Wow! This is very impressive. The only problem will be to back port this
> driver now... But it is definitely worth the effort :)

Applied.

^ permalink raw reply

* Re: [PATCH net-next 1/2] net: introduce build_skb()
From: David Miller @ 2011-11-14 19:21 UTC (permalink / raw)
  To: eric.dumazet
  Cc: eilong, pstaszewski, netdev, bhutchings, therbert, hadi,
	shemminger, tgraf, herbert, jeffrey.t.kirsher
In-Reply-To: <1321286614.2272.47.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 14 Nov 2011 17:03:34 +0100

 ...
> So the deal would be to allocate only the data buffer for the NIC to
> populate its RX ring buffer. And use build_skb() at RX completion to
> attach a data buffer (now filled with an ethernet frame) to a new skb,
> initialize the skb_shared_info portion, and give the hot skb to network
> stack.
> 
> build_skb() is the function to allocate an skb, caller providing the
> data buffer that should be attached to it. Drivers are expected to call 
> skb_reserve() right after build_skb() to adjust skb->data to the
> Ethernet frame (usually skipping NET_SKB_PAD and NET_IP_ALIGN, but some
> drivers might add a hardware provided alignment)
> 
> Data provided to build_skb() MUST have been allocated by a prior
> kmalloc() call, with enough room to add SKB_DATA_ALIGN(sizeof(struct
> skb_shared_info)) bytes at the end of the data without corrupting
> incoming frame.

Applied, I'll work on converting NIU over to this if I find some time.

^ permalink raw reply

* Re: [PATCH net-next] bnx2x: reduce skb truesize by 50%
From: David Miller @ 2011-11-14 19:21 UTC (permalink / raw)
  To: eric.dumazet
  Cc: eilong, bhutchings, pstaszewski, netdev, tgraf, therbert, hadi,
	shemminger
In-Reply-To: <1321286265.2272.46.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 14 Nov 2011 16:57:45 +0100

> Le lundi 14 novembre 2011 à 07:25 +0100, Eric Dumazet a écrit :
>> Le lundi 14 novembre 2011 à 00:08 -0500, David Miller a écrit :
>> 
>> > I fully support bringing this thing back to life :-)
>> 
>> I'll make extensive tests today and provide two patches when ready, with
>> all performance results.
>> 
>> Some prefetch() calls will be removed, since build_skb() provides
>> already cache hot skb.
> 
> Impressive results :
> 
> before : 720.000 pps
> after :  820.000 pps
> 
> [ One mono threaded application receiving UDP messages on a single
> socket, asking IP_PKTINFO ancillary info ]

Sweeeeeet.

^ permalink raw reply

* [PATCH] Don't allow sharing of tx skbs on xen-netfront
From: Neil Horman @ 2011-11-14 19:22 UTC (permalink / raw)
  To: netdev
  Cc: Neil Horman, David S. Miller, Jeremy Fitzhardinge,
	Konrad Rzeszutek Wilk, xen-devel

It was pointed out to me recently that the xen-netfront driver can't safely
support shared skbs on transmit, since, while it doesn't maintain skb state
directly, it does pass a pointer to the skb to the hypervisor via a list, and
the hypervisor may expect the contents of the skb to remain stable.  Clearing
the IFF_TX_SKB_SHARING flag after the call to alloc_etherdev to make it safe.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
CC: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
CC: xen-devel@lists.xensource.com
---
 drivers/net/xen-netfront.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 226faab..fb1077b 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -1252,6 +1252,12 @@ static struct net_device * __devinit xennet_create_dev(struct xenbus_device *dev
 		return ERR_PTR(-ENOMEM);
 	}
 
+	/*
+	 * Since frames remain on a queue after a return from xennet_start_xmit,
+	 * we can't support tx shared skbs
+	 */
+	netdev->priv_flags &= ~IFF_TX_SKB_SHARING;
+
 	np                   = netdev_priv(netdev);
 	np->xbdev            = dev;
 
-- 
1.7.6.4

^ permalink raw reply related

* Re: [PATCH net-next] Sweep the last of the active .get_drvinfo floors under ethernet/
From: David Miller @ 2011-11-14 19:22 UTC (permalink / raw)
  To: raj; +Cc: dwang2, netdev, roprabhu, e1000-devel, benve
In-Reply-To: <20111114181325.217C72900307@tardy>

From: raj@tardy.cup.hp.com (Rick Jones)
Date: Mon, 14 Nov 2011 10:13:25 -0800 (PST)

> From: Rick Jones <rick.jones2@hp.com>
> 
> This round of floor sweeping converts strncpy calls in various .get_drvinfo
> routines to the preferred strlcpy.  It also does a modicum of other
> cleaning in those routines.
> 
> Signed-off-by: Rick Jones <rick.jones2@hp.com>

Applied, thanks for taking care of this.

------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* [PATCH Fix V2] mlx4_en: Remove FCS bytes from packet length.
From: Yevgeny Petrilin @ 2011-11-14 16:40 UTC (permalink / raw)
  To: davem; +Cc: netdev, yevgenyp


When HW doesn't remove FCS bytes they are reported in the completion
byte count, we don't need to take them to skb.

Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   |    6 +++++-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |    1 +
 2 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index b89c36d..c2df6c3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -581,6 +581,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		 * Packet is OK - process it.
 		 */
 		length = be32_to_cpu(cqe->byte_cnt);
+		length -= ring->fcs_del;
 		ring->bytes += length;
 		ring->packets++;
 
@@ -813,8 +814,11 @@ static int mlx4_en_config_rss_qp(struct mlx4_en_priv *priv, int qpn,
 	context->db_rec_addr = cpu_to_be64(ring->wqres.db.dma);
 
 	/* Cancel FCS removal if FW allows */
-	if (mdev->dev->caps.flags & MLX4_DEV_CAP_FLAG_FCS_KEEP)
+	if (mdev->dev->caps.flags & MLX4_DEV_CAP_FLAG_FCS_KEEP) {
 		context->param3 |= cpu_to_be32(1 << 29);
+		ring->fcs_del = ETH_FCS_LEN;
+	} else
+		ring->fcs_del = 0;
 
 	err = mlx4_qp_to_ready(mdev->dev, &ring->wqres.mtt, context, qp, state);
 	if (err) {
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 8fda331c..207b5ad 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -272,6 +272,7 @@ struct mlx4_en_rx_ring {
 	u32 prod;
 	u32 cons;
 	u32 buf_size;
+	u8  fcs_del;
 	void *buf;
 	void *rx_info;
 	unsigned long bytes;
-- 
1.7.7

^ permalink raw reply related

* qfq getting "stuck"
From: Denys Fedoryshchenko @ 2011-11-14 19:01 UTC (permalink / raw)
  To: netdev, shemminger

 Hi

 Just did setup for QFQ (also testing SFB)
 i686/kernel 3.1.0
 tc utility, iproute2-ss111010

 SFB works fine, while QFQ are "freezing" after while.

 I cannot apply much patches there, since it is "production" customer, 
 but i can try few.
 Please let me know if you need additional info.

 64 bytes from 10.0.1.102: icmp_req=23 ttl=128 time=46.2 ms
 64 bytes from 10.0.1.102: icmp_req=24 ttl=128 time=30.6 ms
 ping: sendmsg: No buffer space available
 ping: sendmsg: No buffer space available
 ping: sendmsg: No buffer space available
 ping: sendmsg: No buffer space available

 Script:
 DEV="tun0"

 ethtool -K eth1.732 tso off gso off gro off
 ethtool -K eth1 tso off gso off gro off
 ethtool -K tun0 tso off gso off gro off
 ethtool -K ifb0 tso off gso off gro off

 tc qdisc del dev $DEV root
 echo add root
 tc qdisc add dev $DEV root handle 1 htb default 7
 echo add class
 tc class add dev $DEV parent 1: classid 1:1 htb rate 3800Kbit ceil 
 3800Kbit prio 3

 tc class add dev $DEV parent 1:1 classid 1:7 htb rate 3800Kbit ceil 
 3800Kbit prio 3
 echo add qdisc
 tc qdisc add dev $DEV parent 1:7 handle 11: est 0.5sec 2sec qfq

 tc filter add dev $DEV protocol ip parent 11: handle 3 \
     flow hash keys dst divisor 64

 for i in `seq 1 64`
 do
   classid=11:$(printf %x $i)
   tc class add dev $DEV classid $classid qfq.
   tc qdisc add dev $DEV parent $classid pfifo limit 10
 done

 tc filter add dev $DEV protocol ip parent 1: prio 4 u32 match ip dst 
 0.0.0.0/0 flowid 1:7

 Stats when it is freezed:
 class htb 1:1 root rate 3800Kbit ceil 3800Kbit burst 1599b/8 mpu 0b 
 overhead 0b cburst 1599b/8 mpu 0b overhead 0b level 7
  Sent 14289365 bytes 13366 pkt (dropped 0, overlimits 0 requeues 0)
  rate 9672bit 1pps backlog 0b 0p requeues 0
  lended: 0 borrowed: 0 giants: 0
  tokens: -46804 ctokens: -46804

 class htb 1:7 parent 1:1 leaf 11: prio 3 quantum 47500 rate 3800Kbit 
 ceil 3800Kbit burst 1599b/8 mpu 0b overhead 0b cburst 1599b/8 mpu 0b 
 overhead 0b level 0
  Sent 14348426 bytes 13535 pkt (dropped 3258, overlimits 0 requeues 0)
  rate 9856bit 1pps backlog 0b 169p requeues 0
  lended: 13366 borrowed: 0 giants: 0
  tokens: -46804 ctokens: -46804

 class qfq 11:11 root leaf 841b: weight 1 maxpkt 2048
  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
  backlog 0b 0p requeues 0
 class qfq 11:10 root leaf 841a: weight 1 maxpkt 2048
  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
  backlog 0b 0p requeues 0
 class qfq 11:13 root leaf 841d: weight 1 maxpkt 2048
  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
  backlog 0b 0p requeues 0
 class qfq 11:12 root leaf 841c: weight 1 maxpkt 2048
  Sent 5712 bytes 39 pkt (dropped 169, overlimits 0 requeues 0)
  backlog 1720b 10p requeues 0
 class qfq 11:15 root leaf 841f: weight 1 maxpkt 2048
  Sent 58 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
  backlog 29b 1p requeues 0
 class qfq 11:14 root leaf 841e: weight 1 maxpkt 2048
  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
  backlog 0b 0p requeues 0
 class qfq 11:17 root leaf 8421: weight 1 maxpkt 2048
  Sent 27473 bytes 91 pkt (dropped 131, overlimits 0 requeues 0)
  backlog 520b 10p requeues 0
 class qfq 11:16 root leaf 8420: weight 1 maxpkt 2048
  Sent 4037380 bytes 2743 pkt (dropped 261, overlimits 0 requeues 0)
  backlog 13584b 10p requeues 0
 class qfq 11:19 root leaf 8423: weight 1 maxpkt 2048
  Sent 5132 bytes 37 pkt (dropped 131, overlimits 0 requeues 0)
  backlog 755b 10p requeues 0

 ...
 class qfq 11:2f root leaf 8439: weight 1 maxpkt 2048
  Sent 89 bytes 1 pkt (dropped 0, overlimits 0 requeues 0)
  backlog 0b 0p requeues 0
 class qfq 11:2c root leaf 8436: weight 1 maxpkt 2048
  Sent 3977870 bytes 3239 pkt (dropped 989, overlimits 0 requeues 0)
  backlog 13521b 10p requeues 0
 class qfq 11:2d root leaf 8437: weight 1 maxpkt 2048
  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
  backlog 0b 0p requeues 0
 class qfq 11:40 root leaf 844a: weight 1 maxpkt 2048
  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
  backlog 0b 0p requeues 0


 ---
 System administrator
 Denys Fedoryshchenko
 Virtual ISP S.A.L.

^ permalink raw reply

* Re: [patch net-next V8] net: introduce ethernet teaming device
From: Rick Jones @ 2011-11-14 18:40 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, eric.dumazet, bhutchings, shemminger, fubar, andy,
	tgraf, ebiederm, mirqus, kaber, greearb, jesse, fbl,
	benjamin.poirier, jzupka, ivecera
In-Reply-To: <1321085808-6871-1-git-send-email-jpirko@redhat.com>

On 11/12/2011 12:16 AM, Jiri Pirko wrote:
> This patch introduces new network device called team. It supposes to be
> very fast, simple, userspace-driven alternative to existing bonding
> driver.

What is the definition of "very" here - relative to bonding I presume? 
Are there actual performance figures available at this point?  Something 
along the lines of test through both on the same hardware.  I don't have 
HW on which to run myself but would be quite happy to help with say 
netperf command selection to demonstrate the difference between the two.

happy benchmrking,

rick jones

^ permalink raw reply

* Re: [patch net-next V8] net: introduce ethernet teaming device
From: Eric Dumazet @ 2011-11-14 18:17 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: David Miller, jpirko, netdev, bhutchings, shemminger, fubar,
	tgraf, ebiederm, mirqus, kaber, greearb, jesse, fbl,
	benjamin.poirier, jzupka, ivecera
In-Reply-To: <20111114173157.GE20605@gospo.rdu.redhat.com>

Le lundi 14 novembre 2011 à 12:31 -0500, Andy Gospodarek a écrit :

> I'm a bit surprised by this.  Not only is this new function currently
> difficult to setup (it took me over an hour to go from a a fresh F16
> install to one that had all the necessary libraries and tools to even
> configure a team device for the first time), but I was able to cause an
> Oops with v8 in only a few minutes of testing.
> 
> I have no problem with this functionality as an add-on and possibly
> future replacement to some of what currently exists with bonding, but it
> seems like what is included in the initial support should should:
> 
> 1. Not panic easily.
> 2. Have userspace bits in place to actually test all the proposed kernel
> code.  (Jiri admits there is no way to verify the active-backup code).
> 3. Have some known, published test results.
> 
> I hope Jiri will reconsider having a separate team tree for the next few
> weeks or months until these issues are worked out.  I think the hard
> work will pay off and it is close to being ready; it just doesn't seem
> like it is right now.
> 

Its marked EXPERIMENTAL, so probably some changes are expected before
production mode :)

^ permalink raw reply

* Re: [PATCH 0/5] ks8851 MAC and eeprom updates
From: Stephen Boyd @ 2011-11-14 18:17 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, Sebastien Jan, Ben Dooks
In-Reply-To: <1319681870-11972-1-git-send-email-sboyd@codeaurora.org>

On 10/26/11 19:17, Stephen Boyd wrote:
> I pulled this patch series off the netdev list[1] when the eeprom
> reading didn't work for my ks8851 and I wanted the mac address to be
> read out at boot. It all seems to work well, so I'm posting it here
> again, slightly cleaned up and simplified, for inclusion.
[snip]
>
>  drivers/misc/eeprom/eeprom_93cx6.c   |   88 ++++++
>  drivers/net/ethernet/micrel/Kconfig  |    2 +
>  drivers/net/ethernet/micrel/ks8851.c |  513 ++++++++++++----------------------
>  drivers/net/ethernet/micrel/ks8851.h |   15 +-
>  include/linux/eeprom_93cx6.h         |    8 +
>  5 files changed, 278 insertions(+), 348 deletions(-)

Any comments on this series?

-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.

^ permalink raw reply

* Re: [patch 0/8 2.6.32] CVE-2010-4251: packet backlog can get too large
From: Greg KH @ 2011-11-14 18:11 UTC (permalink / raw)
  To: David Miller; +Cc: ben, dan.carpenter, stable, netdev, yi.zhu, eric.dumazet
In-Reply-To: <20111113.222410.315752972470589717.davem@davemloft.net>

On Sun, Nov 13, 2011 at 10:24:10PM -0500, David Miller wrote:
> From: Ben Hutchings <ben@decadent.org.uk>
> Date: Sun, 13 Nov 2011 23:29:09 +0000
> 
> > But you've previously said that you are not submitting networking
> > patches to the longterm series.  Did you change your mind?
> 
> No, I in fact haven't.
> 
> But I will say that if distributions want to apply this thing, that's
> fine, but it doesn't automatically make it a good idea for -stable
> to take it too.

Thanks for letting me know, I'll drop these from my to-apply mbox.

greg k-h

^ permalink raw reply

* [PATCH net-next] Sweep the last of the active .get_drvinfo floors under ethernet/
From: Rick Jones @ 2011-11-14 18:13 UTC (permalink / raw)
  To: netdev, Christian Benvenuti, Roopa Prabhu, David Wang,
	e1000-devel

From: Rick Jones <rick.jones2@hp.com>

This round of floor sweeping converts strncpy calls in various .get_drvinfo
routines to the preferred strlcpy.  It also does a modicum of other
cleaning in those routines.

Signed-off-by: Rick Jones <rick.jones2@hp.com>

---

Compile tested only.

 drivers/net/ethernet/cisco/enic/enic_main.c      |    8 ++++----
 drivers/net/ethernet/intel/e1000/e1000_ethtool.c |   13 +++++++------
 drivers/net/ethernet/intel/e1000e/ethtool.c      |   16 +++++++---------
 drivers/net/ethernet/intel/igb/igb_ethtool.c     |   15 ++++++---------
 drivers/net/ethernet/intel/igbvf/ethtool.c       |   12 +++++++-----
 drivers/net/ethernet/intel/ixgb/ixgb_ethtool.c   |   11 +++++++----
 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c |   16 ++++++----------
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c  |   11 +++++++----
 drivers/net/ethernet/sun/cassini.c               |    7 +++----
 9 files changed, 54 insertions(+), 55 deletions(-)

diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c
index c3786fd..1fe5df0 100644
--- a/drivers/net/ethernet/cisco/enic/enic_main.c
+++ b/drivers/net/ethernet/cisco/enic/enic_main.c
@@ -217,11 +217,11 @@ static void enic_get_drvinfo(struct net_device *netdev,
 
 	enic_dev_fw_info(enic, &fw_info);
 
-	strncpy(drvinfo->driver, DRV_NAME, sizeof(drvinfo->driver));
-	strncpy(drvinfo->version, DRV_VERSION, sizeof(drvinfo->version));
-	strncpy(drvinfo->fw_version, fw_info->fw_version,
+	strlcpy(drvinfo->driver, DRV_NAME, sizeof(drvinfo->driver));
+	strlcpy(drvinfo->version, DRV_VERSION, sizeof(drvinfo->version));
+	strlcpy(drvinfo->fw_version, fw_info->fw_version,
 		sizeof(drvinfo->fw_version));
-	strncpy(drvinfo->bus_info, pci_name(enic->pdev),
+	strlcpy(drvinfo->bus_info, pci_name(enic->pdev),
 		sizeof(drvinfo->bus_info));
 }
 
diff --git a/drivers/net/ethernet/intel/e1000/e1000_ethtool.c b/drivers/net/ethernet/intel/e1000/e1000_ethtool.c
index 2b223ac..63faec6 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_ethtool.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_ethtool.c
@@ -515,14 +515,15 @@ static void e1000_get_drvinfo(struct net_device *netdev,
 			      struct ethtool_drvinfo *drvinfo)
 {
 	struct e1000_adapter *adapter = netdev_priv(netdev);
-	char firmware_version[32];
 
-	strncpy(drvinfo->driver,  e1000_driver_name, 32);
-	strncpy(drvinfo->version, e1000_driver_version, 32);
+	strlcpy(drvinfo->driver,  e1000_driver_name,
+		sizeof(drvinfo->driver));
+	strlcpy(drvinfo->version, e1000_driver_version,
+		sizeof(drvinfo->version));
 
-	sprintf(firmware_version, "N/A");
-	strncpy(drvinfo->fw_version, firmware_version, 32);
-	strncpy(drvinfo->bus_info, pci_name(adapter->pdev), 32);
+	strlcpy(drvinfo->fw_version, "N/A", sizeof(drvinfo->fw_version));
+	strlcpy(drvinfo->bus_info, pci_name(adapter->pdev),
+		sizeof(drvinfo->bus_info));
 	drvinfo->regdump_len = e1000_get_regs_len(netdev);
 	drvinfo->eedump_len = e1000_get_eeprom_len(netdev);
 }
diff --git a/drivers/net/ethernet/intel/e1000e/ethtool.c b/drivers/net/ethernet/intel/e1000e/ethtool.c
index 69c9d21..6d8f0ed 100644
--- a/drivers/net/ethernet/intel/e1000e/ethtool.c
+++ b/drivers/net/ethernet/intel/e1000e/ethtool.c
@@ -579,26 +579,24 @@ static void e1000_get_drvinfo(struct net_device *netdev,
 			      struct ethtool_drvinfo *drvinfo)
 {
 	struct e1000_adapter *adapter = netdev_priv(netdev);
-	char firmware_version[32];
 
-	strncpy(drvinfo->driver,  e1000e_driver_name,
-		sizeof(drvinfo->driver) - 1);
+	strlcpy(drvinfo->driver,  e1000e_driver_name,
+		sizeof(drvinfo->driver));
 	strncpy(drvinfo->version, e1000e_driver_version,
-		sizeof(drvinfo->version) - 1);
+		sizeof(drvinfo->version));
 
 	/*
 	 * EEPROM image version # is reported as firmware version # for
 	 * PCI-E controllers
 	 */
-	snprintf(firmware_version, sizeof(firmware_version), "%d.%d-%d",
+	snprintf(drvinfo->fw_version, sizeof(drvinfo->fw_version),
+		"%d.%d-%d",
 		(adapter->eeprom_vers & 0xF000) >> 12,
 		(adapter->eeprom_vers & 0x0FF0) >> 4,
 		(adapter->eeprom_vers & 0x000F));
 
-	strncpy(drvinfo->fw_version, firmware_version,
-		sizeof(drvinfo->fw_version) - 1);
-	strncpy(drvinfo->bus_info, pci_name(adapter->pdev),
-		sizeof(drvinfo->bus_info) - 1);
+	strlcpy(drvinfo->bus_info, pci_name(adapter->pdev),
+		sizeof(drvinfo->bus_info));
 	drvinfo->regdump_len = e1000_get_regs_len(netdev);
 	drvinfo->eedump_len = e1000_get_eeprom_len(netdev);
 }
diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c b/drivers/net/ethernet/intel/igb/igb_ethtool.c
index 43873eb..e9335ef 100644
--- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
+++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
@@ -673,25 +673,22 @@ static void igb_get_drvinfo(struct net_device *netdev,
 			    struct ethtool_drvinfo *drvinfo)
 {
 	struct igb_adapter *adapter = netdev_priv(netdev);
-	char firmware_version[32];
 	u16 eeprom_data;
 
-	strncpy(drvinfo->driver,  igb_driver_name, sizeof(drvinfo->driver) - 1);
-	strncpy(drvinfo->version, igb_driver_version,
-		sizeof(drvinfo->version) - 1);
+	strlcpy(drvinfo->driver,  igb_driver_name, sizeof(drvinfo->driver));
+	strlcpy(drvinfo->version, igb_driver_version, sizeof(drvinfo->version));
 
 	/* EEPROM image version # is reported as firmware version # for
 	 * 82575 controllers */
 	adapter->hw.nvm.ops.read(&adapter->hw, 5, 1, &eeprom_data);
-	sprintf(firmware_version, "%d.%d-%d",
+	snprintf(drvinfo->fw_version, sizeof(drvinfo->fw_version),
+		"%d.%d-%d",
 		(eeprom_data & 0xF000) >> 12,
 		(eeprom_data & 0x0FF0) >> 4,
 		eeprom_data & 0x000F);
 
-	strncpy(drvinfo->fw_version, firmware_version,
-		sizeof(drvinfo->fw_version) - 1);
-	strncpy(drvinfo->bus_info, pci_name(adapter->pdev),
-		sizeof(drvinfo->bus_info) - 1);
+	strlcpy(drvinfo->bus_info, pci_name(adapter->pdev),
+		sizeof(drvinfo->bus_info));
 	drvinfo->n_stats = IGB_STATS_LEN;
 	drvinfo->testinfo_len = IGB_TEST_LEN;
 	drvinfo->regdump_len = igb_get_regs_len(netdev);
diff --git a/drivers/net/ethernet/intel/igbvf/ethtool.c b/drivers/net/ethernet/intel/igbvf/ethtool.c
index 2c25858..e60f1c6 100644
--- a/drivers/net/ethernet/intel/igbvf/ethtool.c
+++ b/drivers/net/ethernet/intel/igbvf/ethtool.c
@@ -191,12 +191,14 @@ static void igbvf_get_drvinfo(struct net_device *netdev,
                               struct ethtool_drvinfo *drvinfo)
 {
 	struct igbvf_adapter *adapter = netdev_priv(netdev);
-	char firmware_version[32] = "N/A";
 
-	strncpy(drvinfo->driver,  igbvf_driver_name, 32);
-	strncpy(drvinfo->version, igbvf_driver_version, 32);
-	strncpy(drvinfo->fw_version, firmware_version, 32);
-	strncpy(drvinfo->bus_info, pci_name(adapter->pdev), 32);
+	strlcpy(drvinfo->driver,  igbvf_driver_name, sizeof(drvinfo->driver));
+	strlcpy(drvinfo->version, igbvf_driver_version,
+		sizeof(drvinfo->version));
+	strlcpy(drvinfo->fw_version, "N/A",
+		sizeof(drvinfo->fw_version));
+	strlcpy(drvinfo->bus_info, pci_name(adapter->pdev),
+		sizeof(drvinfo->bus_info));
 	drvinfo->regdump_len = igbvf_get_regs_len(netdev);
 	drvinfo->eedump_len = igbvf_get_eeprom_len(netdev);
 }
diff --git a/drivers/net/ethernet/intel/ixgb/ixgb_ethtool.c b/drivers/net/ethernet/intel/ixgb/ixgb_ethtool.c
index 9dfce7d..96fcb0e 100644
--- a/drivers/net/ethernet/intel/ixgb/ixgb_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgb/ixgb_ethtool.c
@@ -473,10 +473,13 @@ ixgb_get_drvinfo(struct net_device *netdev,
 {
 	struct ixgb_adapter *adapter = netdev_priv(netdev);
 
-	strncpy(drvinfo->driver,  ixgb_driver_name, 32);
-	strncpy(drvinfo->version, ixgb_driver_version, 32);
-	strncpy(drvinfo->fw_version, "N/A", 32);
-	strncpy(drvinfo->bus_info, pci_name(adapter->pdev), 32);
+	strlcpy(drvinfo->driver,  ixgb_driver_name,
+		sizeof(drvinfo->driver));
+	strlcpy(drvinfo->version, ixgb_driver_version,
+		sizeof(drvinfo->version));
+	strlcpy(drvinfo->fw_version, "N/A", sizeof(drvinfo->fw_version));
+	strlcpy(drvinfo->bus_info, pci_name(adapter->pdev),
+		sizeof(drvinfo->bus_info));
 	drvinfo->n_stats = IXGB_STATS_LEN;
 	drvinfo->regdump_len = ixgb_get_regs_len(netdev);
 	drvinfo->eedump_len = ixgb_get_eeprom_len(netdev);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
index 70d58c3..91f871b 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
@@ -888,23 +888,19 @@ static void ixgbe_get_drvinfo(struct net_device *netdev,
                               struct ethtool_drvinfo *drvinfo)
 {
 	struct ixgbe_adapter *adapter = netdev_priv(netdev);
-	char firmware_version[32];
 	u32 nvm_track_id;
 
-	strncpy(drvinfo->driver, ixgbe_driver_name,
-	        sizeof(drvinfo->driver) - 1);
-	strncpy(drvinfo->version, ixgbe_driver_version,
-	        sizeof(drvinfo->version) - 1);
+	strlcpy(drvinfo->driver, ixgbe_driver_name, sizeof(drvinfo->driver));
+	strlcpy(drvinfo->version, ixgbe_driver_version,
+		sizeof(drvinfo->version));
 
 	nvm_track_id = (adapter->eeprom_verh << 16) |
 			adapter->eeprom_verl;
-	snprintf(firmware_version, sizeof(firmware_version), "0x%08x",
+	snprintf(drvinfo->fw_version, sizeof(drvinfo->fw_version), "0x%08x",
 		 nvm_track_id);
 
-	strncpy(drvinfo->fw_version, firmware_version,
-		sizeof(drvinfo->fw_version) - 1);
-	strncpy(drvinfo->bus_info, pci_name(adapter->pdev),
-		sizeof(drvinfo->bus_info) - 1);
+	strlcpy(drvinfo->bus_info, pci_name(adapter->pdev),
+		sizeof(drvinfo->bus_info));
 	drvinfo->n_stats = IXGBE_STATS_LEN;
 	drvinfo->testinfo_len = IXGBE_TEST_LEN;
 	drvinfo->regdump_len = ixgbe_get_regs_len(netdev);
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index 74e2a2a..ee637a2 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -45,13 +45,16 @@ mlx4_en_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *drvinfo)
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	struct mlx4_en_dev *mdev = priv->mdev;
 
-	strncpy(drvinfo->driver, DRV_NAME, 32);
-	strncpy(drvinfo->version, DRV_VERSION " (" DRV_RELDATE ")", 32);
-	sprintf(drvinfo->fw_version, "%d.%d.%d",
+	strlcpy(drvinfo->driver, DRV_NAME, sizeof(drvinfo->driver));
+	strlcpy(drvinfo->version, DRV_VERSION " (" DRV_RELDATE ")",
+		sizeof(drvinfo->version));
+	snprintf(drvinfo->fw_version, sizeof(drvinfo->fw_version),
+		"%d.%d.%d",
 		(u16) (mdev->dev->caps.fw_ver >> 32),
 		(u16) ((mdev->dev->caps.fw_ver >> 16) & 0xffff),
 		(u16) (mdev->dev->caps.fw_ver & 0xffff));
-	strncpy(drvinfo->bus_info, pci_name(mdev->dev->pdev), 32);
+	strlcpy(drvinfo->bus_info, pci_name(mdev->dev->pdev),
+		sizeof(drvinfo->bus_info));
 	drvinfo->n_stats = 0;
 	drvinfo->regdump_len = 0;
 	drvinfo->eedump_len = 0;
diff --git a/drivers/net/ethernet/sun/cassini.c b/drivers/net/ethernet/sun/cassini.c
index fd40988..f10665f 100644
--- a/drivers/net/ethernet/sun/cassini.c
+++ b/drivers/net/ethernet/sun/cassini.c
@@ -4532,10 +4532,9 @@ static void cas_set_multicast(struct net_device *dev)
 static void cas_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *info)
 {
 	struct cas *cp = netdev_priv(dev);
-	strncpy(info->driver, DRV_MODULE_NAME, ETHTOOL_BUSINFO_LEN);
-	strncpy(info->version, DRV_MODULE_VERSION, ETHTOOL_BUSINFO_LEN);
-	info->fw_version[0] = '\0';
-	strncpy(info->bus_info, pci_name(cp->pdev), ETHTOOL_BUSINFO_LEN);
+	strlcpy(info->driver, DRV_MODULE_NAME, sizeof(info->driver));
+	strlcpy(info->version, DRV_MODULE_VERSION, sizeof(info->version));
+	strlcpy(info->bus_info, pci_name(cp->pdev), sizeof(info->bus_info));
 	info->regdump_len = cp->casreg_len < CAS_MAX_REGS ?
 		cp->casreg_len : CAS_MAX_REGS;
 	info->n_stats = CAS_NUM_STAT_KEYS;

^ permalink raw reply related

* Re: [patch net-next V8] net: introduce ethernet teaming device
From: Andy Gospodarek @ 2011-11-14 17:31 UTC (permalink / raw)
  To: David Miller
  Cc: jpirko, netdev, eric.dumazet, bhutchings, shemminger, fubar, andy,
	tgraf, ebiederm, mirqus, kaber, greearb, jesse, fbl,
	benjamin.poirier, jzupka, ivecera
In-Reply-To: <20111113.160951.1905657808657009542.davem@davemloft.net>

On Sun, Nov 13, 2011 at 04:09:51PM -0500, David Miller wrote:
> From: Jiri Pirko <jpirko@redhat.com>
> Date: Sat, 12 Nov 2011 09:16:48 +0100
> 
> > This patch introduces new network device called team. It supposes to be
> > very fast, simple, userspace-driven alternative to existing bonding
> > driver.
> > 
> > Userspace library called libteam with couple of demo apps is available
> > here:
> > https://github.com/jpirko/libteam
> > Note it's still in its dipers atm.
> > 
> > team<->libteam use generic netlink for communication. That and rtnl
> > suppose to be the only way to configure team device, no sysfs etc.
> > 
> > Python binding of libteam was recently introduced.
> > Daemon providing arpmon/miimon active-backup functionality will be
> > introduced shortly. All what's necessary is already implemented in
> > kernel team driver.
> > 
> > Signed-off-by: Jiri Pirko <jpirko@redhat.com>
> 
> Applied, thanks for all of your hard work.

I'm a bit surprised by this.  Not only is this new function currently
difficult to setup (it took me over an hour to go from a a fresh F16
install to one that had all the necessary libraries and tools to even
configure a team device for the first time), but I was able to cause an
Oops with v8 in only a few minutes of testing.

I have no problem with this functionality as an add-on and possibly
future replacement to some of what currently exists with bonding, but it
seems like what is included in the initial support should should:

1. Not panic easily.
2. Have userspace bits in place to actually test all the proposed kernel
code.  (Jiri admits there is no way to verify the active-backup code).
3. Have some known, published test results.

I hope Jiri will reconsider having a separate team tree for the next few
weeks or months until these issues are worked out.  I think the hard
work will pay off and it is close to being ready; it just doesn't seem
like it is right now.

^ permalink raw reply

* Re: net: Add network priority cgroup
From: Neil Horman @ 2011-11-14 17:24 UTC (permalink / raw)
  To: Shyam_Iyer; +Cc: dave.taht, netdev, john.r.fastabend, robert.w.love, davem
In-Reply-To: <DBFB1B45AF80394ABD1C807E9F28D157077D7D8D5F@BLRX7MCDC203.AMER.DELL.COM>

On Mon, Nov 14, 2011 at 10:13:37PM +0530, Shyam_Iyer@Dell.com wrote:
> 
> 
> > -----Original Message-----
> > From: netdev-owner@vger.kernel.org [mailto:netdev-
> > owner@vger.kernel.org] On Behalf Of Neil Horman
> > Sent: Monday, November 14, 2011 9:44 AM
> > To: Dave Taht
> > Cc: netdev@vger.kernel.org; John Fastabend; Robert Love; David S.
> > Miller
> > Subject: Re: net: Add network priority cgroup
> > 
> > On Mon, Nov 14, 2011 at 01:32:04PM +0100, Dave Taht wrote:
> > > On Mon, Nov 14, 2011 at 12:47 PM, Neil Horman <nhorman@tuxdriver.com>
> > wrote:
> > > > On Wed, Nov 09, 2011 at 02:57:33PM -0500, Neil Horman wrote:
> > > >> Data Center Bridging environments are currently somewhat limited
> > in their
> > > >> ability to provide a general mechanism for controlling traffic
> > priority.
> > > >> Specifically they are unable to administratively control the
> > priority at which
> > > >> various types of network traffic are sent.
> > > >>
> > > >> Currently, the only ways to set the priority of a network buffer
> > are:
> > > >>
> > > >> 1) Through the use of the SO_PRIORITY socket option
> > > >> 2) By using low level hooks, like a tc action
> > > >>
> > > >> (1) is difficult from an administrative perspective because it
> > requires that the
> > > >> application to be coded to not just assume the default priority is
> > sufficient,
> > > >> and must expose an administrative interface to allow priority
> > adjustment.  Such
> > > >> a solution is not scalable in a DCB environment
> > > >>
> > > >> (2) is also difficult, as it requires constant administrative
> > oversight of
> > > >> applications so as to build appropriate rules to match traffic
> > belonging to
> > > >> various classes, so that priority can be appropriately set. It is
> > further
> > > >> limiting when DCB enabled hardware is in use, due to the fact that
> > tc rules are
> > > >> only run after a root qdisc has been selected (DCB enabled
> > hardware may reserve
> > > >> hw queues for various traffic classes and needs the priority to be
> > set prior to
> > > >> selecting the root qdisc)
> > > >>
> > > >>
> > > >> I've discussed various solutions with John Fastabend, and we saw a
> > cgroup as
> > > >> being a good general solution to this problem.  The network
> > priority cgroup
> > > >> allows for a per-interface priority map to be built per cgroup.
> >  Any traffic
> > > >> originating from an application in a cgroup, that does not
> > explicitly set its
> > > >> priority with SO_PRIORITY will have its priority assigned to the
> > value
> > > >> designated for that group on that interface.  This allows a user
> > space daemon,
> > > >> when conducting LLDP negotiation with a DCB enabled peer to create
> > a cgroup
> > > >> based on the APP_TLV value received and administratively assign
> > applications to
> > > >> that priority using the existing cgroup utility infrastructure.
> > > >>
> > > >> Tested by John and myself, with good results
> > > >>
> > > >> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > >> CC: John Fastabend <john.r.fastabend@intel.com>
> > > >> CC: Robert Love <robert.w.love@intel.com>
> > > >> CC: "David S. Miller" <davem@davemloft.net>
> > > >> --
> > > >> To unsubscribe from this list: send the line "unsubscribe netdev"
> > in
> > > >> the body of a message to majordomo@vger.kernel.org
> > > >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > >>
> > > >
> > > > Bump, any other thoughts here?  Dave T. has some reasonable
> > thoughts regarding
> > > > the use of skb->priority, but IMO they really seem orthogonal to
> > the purpose of
> > > > this change.  Any other reviews would be welcome.
> > >
> > > Well, in part I've been playing catchup in the hope that lldp and
> > > openlldp and/or this dcb netlink layer that I don't know anything
> > > about (pointers please?) could help somehow to resolve the semantic
> > > mess skb->priority has become in the first place.
> > >
> > > I liked what was described here.
> > >
> > > "What if we did at least carve out the DCB functionality away from
> > > skb->priority?  Since, AIUI, we're only concerning ourselves with
> > > locally generated traffic here, we're talking
> > > about skbs that have a socket attached to them.  We could, instead of
> > indexing
> > > the prio_tc_map with skb->priority, we could index it with
> > > skb->dev->priomap[skb->sk->prioidx] (as provided by this patch).  The
> > cgroup
> > > then could be, instead of a strict priority cgroup, a queue_selector
> > cgroup (or
> > > something more appropriately named), and we don't have to touch skb-
> > >priority at
> > > all.  I'd really rather not start down that road until I got more
> > opinions and
> > > consensus on that, but it seems like a pretty good solution, one that
> > would
> > > allow hardware queue selection in systems that use things like DCB to
> > co-exist
> > > with software queueing features."
> > >
> > I was initially ok with this, but the more I think about it, the more I
> > think
> > its just not needed (see further down in this email for my reasoning).
> > John,
> > Rob, do you have any thoughts here?
> > 
> > > The piece that still kind of bothered me about the original proposal
> > > (and perhaps this one) was that setting SO_PRIORITY in an app means
> > > 'give my packets more mojo'.
> > >
> > > Taking something that took unprioritized packets and assigned them
> > and
> > > *them only* to a hardware queue struck me as possibly deprioritizing
> > > the 'more mojo wanted' packets in the app(s), as they would end up in
> > > some other, possibly overloaded, hardware queue.
> > >
> > I don't really see what you mean by this at all.  Taking packets with
> > no
> > priority and assigning them a priority doesn't really have an effect on
> > pre-prioritized packets.  Or rather it shouldn't.  You can certainly
> > create a
> > problem by having apps prioritized according to conflicting semantic
> > rules, but
> > that strikes me as administrative error.  Garbage in...Garbage out.
> > 
> > > So a cgroup that moves all of the packets from an application into a
> > > given hardware queue, and then gets scheduled normally according to
> > > skb->priority and friends (software queue, default of pfifo_fast,
> > > etc), seems to make some sense to me. (I wouldn't mind if we had
> > > abstractions for software queues, too, like, I need a software queue
> > > with these properties, find me a place for it on the hardware - but
> > > I'm dreaming)
> > >
> > > One open question is where do packets generated from other subsystems
> > > end up, if you are using a cgroup for the app? arp, dns, etc?
> > >
> > The overriding rule is the association of an skb to a socket.  If a
> > transmitted
> > frame has skb->sk set in dev_queue_xmit, then we interrogate its
> > priority index
> > as set when we passed through the sendmsg code at the top of the stack.
> > Otherwise its behavior is unchanged from its current standpoint.
> > 
> > > So to rephrase your original description from this:
> > >
> > > >> Any traffic originating from an application in a cgroup, that does
> > not explicitly set its
> > > >> priority with SO_PRIORITY will have its priority assigned to the
> > value
> > > >> designated for that group on that interface.  This allows a user
> > space daemon,
> > > >> when conducting LLDP negotiation with a DCB enabled peer to create
> > a cgroup
> > > >> based on the APP_TLV value received and administratively assign
> > applications to
> > > >> that priority using the existing cgroup utility infrastructure.
> > > > John, Robert, if you're supportive of these changes, some Acks
> > would be
> > > > appreciated.
> > >
> > > To this:
> > >
> > > "Any traffic originating from an application in a cgroup,  will have
> > > its hardware queue  assigned to the value designated for that group
> > on
> > > that interface.  This allows a user space daemon, when conducting
> > LLDP
> > > negotiation with a DCB enabled peer to create a cgroup based on the
> > > APP_TLV value received and administratively assign applications to
> > > that hardware queue using the existing cgroup utility
> > infrastructure."
> > >
> > As above, I'm split brained about this.  I'm ok with the idea of making
> > this a
> > queue selection cgroup, and separating it from priority, but at the
> > same time,
> > in the context of DCB, we really are assigning priority here, so it
> > seems a bit
> > false to do something that is not priority.  I also like the fact that
> > it
> > provides administrative control in a way that netfilter and tc don't
> > really
> > enable.
> > 
> > > Assuming we're on the same page here, what the heck is APP_TLV?
> > >
> > LLDP does layer 2 discovery with peer networking devices. It does so
> > using sets
> > of Type/length/value tuples.  The types carry various bits of
> > information, such
> > as which priority groups are available on the network.  The APP tlv
> > conveys
> > application or feature specific information.  for instance, There is an
> > ISCSI
> > app tlv that tells the host that  "on the interface you received this
> > tlv, iscsi
> > traffic must be sent at priority X".  The idea being that, on receipt
> > of this
> > tlv, the DCB daemon can create an ISCSI network priority cgroup
> > instance, and
> > augment the cgroup rules file such that, when the user space iscsi
> > daemon is
> > started, its traffic automatically transmits at the appropriate
> > priority.
> 
> Love this !
> 
> I guess if this is integrated to libvirt via libcgroups VMs could be assigned a network priority..
> 
As the patch stand currently, absolutely.  Just drop a qemu process into the
approriate cgroup, and (assuming you're using a tun/tap type device), all the
traffic from that vm will get assigned the corresponding priority.  You can do
the same thing with classification using net_cls already.  I did a video of it
here:
http://www.youtube.com/watch?v=KX5QV4LId_c
Neil

> http://linuxplumbersconf.org/2010/ocw/proposals/843
> 
> 
> 

^ permalink raw reply

* Re: [patch net-next V8] net: introduce ethernet teaming device
From: Andy Gospodarek @ 2011-11-14 17:18 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, eric.dumazet, bhutchings, shemminger, fubar, andy,
	tgraf, ebiederm, mirqus, kaber, greearb, jesse, fbl,
	benjamin.poirier, jzupka, ivecera
In-Reply-To: <1321085808-6871-1-git-send-email-jpirko@redhat.com>

On Sat, Nov 12, 2011 at 09:16:48AM +0100, Jiri Pirko wrote:
> This patch introduces new network device called team. It supposes to be
> very fast, simple, userspace-driven alternative to existing bonding
> driver.
> 
> Userspace library called libteam with couple of demo apps is available
> here:
> https://github.com/jpirko/libteam
> Note it's still in its dipers atm.
> 
> team<->libteam use generic netlink for communication. That and rtnl
> suppose to be the only way to configure team device, no sysfs etc.
> 
> Python binding of libteam was recently introduced.
> Daemon providing arpmon/miimon active-backup functionality will be
> introduced shortly. All what's necessary is already implemented in
> kernel team driver.
> 
> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
> 
> v7->v8:
> 	- check ndo_ndo_vlan_rx_[add/kill]_vid functions before calling
> 	  them.
> 	- use dev_kfree_skb_any() instead of dev_kfree_skb()
> 
> v6->v7:
> 	- transmit and receive functions are not checked in hot paths.
> 	  That also resolves memory leak on transmit when no port is
> 	  present
> 
> v5->v6:
> 	- changed couple of _rcu calls to non _rcu ones in non-readers
> 
> v4->v5:
> 	- team_change_mtu() uses team->lock while travesing though port
> 	  list
> 	- mac address changes are moved completely to jurisdiction of
> 	  userspace daemon. This way the daemon can do FOM1, FOM2 and
> 	  possibly other weird things with mac addresses.
> 	  Only round-robin mode sets up all ports to bond's address then
> 	  enslaved.
> 	- Extended Kconfig text
> 
> v3->v4:
> 	- remove redundant synchronize_rcu from __team_change_mode()
> 	- revert "set and clear of mode_ops happens per pointer, not per
> 	  byte"
> 	- extend comment of function __team_change_mode()
> 
> v2->v3:
> 	- team_change_mtu() uses rcu version of list traversal to unwind
> 	- set and clear of mode_ops happens per pointer, not per byte
> 	- port hashlist changed to be embedded into team structure
> 	- error branch in team_port_enter() does cleanup now
> 	- fixed rtln->rtnl
> 
> v1->v2:
> 	- modes are made as modules. Makes team more modular and
> 	  extendable.
> 	- several commenters' nitpicks found on v1 were fixed
> 	- several other bugs were fixed.
> 	- note I ignored Eric's comment about roundrobin port selector
> 	  as Eric's way may be easily implemented as another mode (mode
> 	  "random") in future.

You better get ready for v9.

Running the command:

# team_manual_control team0 set mode roundrobin

on a system with team0 running in roundrobin mode produces this:

[ 2127.785321] BUG: unable to handle kernel NULL pointer dereference at           (null)
[ 2127.788079] IP: [<ffffffffa0196edd>] team_nl_fill_options_get_changed+0xc5/0x240 [team]
[ 2127.790847] PGD 13eecf067 PUD 13f758067 PMD 0 
[ 2127.793603] Oops: 0000 [#1] SMP 
[ 2127.796352] CPU 7 
[ 2127.796370] Modules linked in: team_mode_roundrobin(O) team(O) fcoe libfcoe libfc scsi_transport_fc scsi_tgt 8021q garp stp llc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_state nf_conntrack snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device i2c_i801 joydev microcode shpchp snd_pcm snd_timer snd soundcore snd_page_alloc bnx2 iTCO_wdt iTCO_vendor_support e1000e uinput firewire_ohci firewire_core crc_itu_t i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: nf_defrag_ipv4]
[ 2127.808223] 
[ 2127.811261] Pid: 7085, comm: team_manual_con Tainted: G           O 3.2.0-rc1+ #1 Intel Corporation 2012 Client Platform/LosLunas CRB
[ 2127.814421] RIP: 0010:[<ffffffffa0196edd>]  [<ffffffffa0196edd>] team_nl_fill_options_get_changed+0xc5/0x240 [team]
[ 2127.817597] RSP: 0018:ffff88012ec3d968  EFLAGS: 00010286
[ 2127.820758] RAX: 0000000000000000 RBX: ffff8801397bb600 RCX: ffffffffffffffff
[ 2127.823947] RDX: ffff88013f4ba048 RSI: 0000000000000000 RDI: 0000000000000000
[ 2127.827154] RBP: ffff88012ec3d9c8 R08: ffff88013f4ba048 R09: 0000000000000004
[ 2127.830365] R10: 0000000000001bad R11: 0000000000000000 R12: ffff880143a8b740
[ 2127.833599] R13: ffff880143aca7e8 R14: ffff88013f4ba014 R15: ffff88013f4ba048
[ 2127.836838] FS:  00007fd65cdc8700(0000) GS:ffff88014e2e0000(0000) knlGS:0000000000000000
[ 2127.840102] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2127.843386] CR2: 0000000000000000 CR3: 0000000128531000 CR4: 00000000001406e0
[ 2127.846688] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2127.849987] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2127.853278] Process team_manual_con (pid: 7085, threadinfo ffff88012ec3c000, task ffff88013e842e40)
[ 2127.856605] Stack:
[ 2127.859898]  0000000000000000 ffff880143a8b7e8 0000000000000000 ffff88013f4ba01c
[ 2127.863261]  ffffffffa0019140 0000000500000000 0000000000000000 ffff8801397bb600
[ 2127.866623]  ffff88012ec3da58 ffffffffa0197058 ffff880143a8b740 00000000fffffff4
[ 2127.869993] Call Trace:
[ 2127.873344]  [<ffffffffa0197058>] ? team_nl_fill_options_get_changed+0x240/0x240 [team]
[ 2127.876750]  [<ffffffffa0197078>] team_nl_fill_options_get+0x20/0x22 [team]
[ 2127.880152]  [<ffffffffa019763c>] team_nl_send_generic+0x41/0x85 [team]
[ 2127.880156]  [<ffffffffa01976f5>] team_nl_cmd_options_get+0x36/0x3f [team]
[ 2127.880162]  [<ffffffff813fbdce>] genl_rcv_msg+0x1d8/0x203
[ 2127.880165]  [<ffffffff813fbbf6>] ? genl_rcv+0x2d/0x2d
[ 2127.880169]  [<ffffffff813fb7b2>] netlink_rcv_skb+0x42/0x8d
[ 2127.880172]  [<ffffffff813fbbef>] genl_rcv+0x26/0x2d
[ 2127.880174]  [<ffffffff813fb341>] netlink_unicast+0xec/0x156
[ 2127.880178]  [<ffffffff813fb5a6>] netlink_sendmsg+0x1fb/0x233
[ 2127.880182]  [<ffffffff813ca577>] sock_sendmsg+0xe6/0x109
[ 2127.880188]  [<ffffffff81120c76>] ? __mem_cgroup_commit_charge+0x9d/0xa9
[ 2127.880192]  [<ffffffff8112333b>] ? mem_cgroup_charge_common+0xb1/0xc3
[ 2127.880197]  [<ffffffff81043ff5>] ? should_resched+0xe/0x2d
[ 2127.880203]  [<ffffffff814b6208>] ? _cond_resched+0xe/0x22
[ 2127.880206]  [<ffffffff81043ff5>] ? should_resched+0xe/0x2d
[ 2127.880209]  [<ffffffff813d4433>] ? copy_from_user+0x2f/0x31
[ 2127.880212]  [<ffffffff813d481e>] ? verify_iovec+0x52/0xa4
[ 2127.880215]  [<ffffffff813ca85c>] __sys_sendmsg+0x213/0x2ba
[ 2127.880220]  [<ffffffff810fb1eb>] ? handle_mm_fault+0x1c8/0x1db
[ 2127.880224]  [<ffffffff814bab67>] ? do_page_fault+0x30c/0x37e
[ 2127.880228]  [<ffffffff814b7914>] ? _raw_spin_unlock_irqrestore+0x17/0x19
[ 2127.880232]  [<ffffffff81045079>] ? __wake_up+0x44/0x4d
[ 2127.880235]  [<ffffffff813cc459>] sys_sendmsg+0x42/0x60
[ 2127.880239]  [<ffffffff814be042>] system_call_fastpath+0x16/0x1b
[ 2127.880241] Code: e9 24 01 00 00 be 01 00 00 00 48 89 df e8 aa f3 ff ff 48 85 c0 49 89 c7 0f 84 4b 01 00 00 49 8b 75 10 31 c0 48 83 c9 ff 48 89 f7 <f2> ae 48 89 df 89 ca 48 89 f1 be 01 00 00 00 f7 d2 e8 2f 4c 0a 
[ 2127.880263] RIP  [<ffffffffa0196edd>] team_nl_fill_options_get_changed+0xc5/0x240 [team]
[ 2127.880268]  RSP <ffff88012ec3d968>
[ 2127.880269] CR2: 0000000000000000
[ 2127.880287] ---[ end trace 3e104c6acd231d26 ]---

Can you provide a detailed report of the testing you have done on the
team device?  It seems proper testing would have found something like
this.

^ permalink raw reply

* Re: [PATCH] netdev/phy: Use mdiobus_read() so that proper locks are taken.
From: Timur Tabi @ 2011-11-14 17:10 UTC (permalink / raw)
  To: Andy Fleming; +Cc: David Daney, netdev@vger.kernel.org, davem@davemloft.net
In-Reply-To: <CAKWjMd4WjDNHH8dOr9GPxQTC69SGgjh+euWh_DSjhtRxquB8qA@mail.gmail.com>

Andy Fleming wrote:

> Yes, that is correct. I have a patch, I just have to resend it. Will do
> next time I come to a break in what I'm working on.

Can you point me to that patch, so that I can try it out now?

I fixed some lockdep-related code in my audio driver.  The driver was not initializing a sysfs attr structure, so I added a call to sysfs_attr_init().  But when I do that, I get the following bug report.  I don't understand the connection.

=============================================                                   
[ INFO: possible recursive locking detected ]                                   
3.2.0-10b-00093-gebea711-dirty #10                                              
---------------------------------------------                                   
kworker/1:1/271 is trying to acquire lock:                                      
 (&(&(priv->tx_queue[i]->txlock))->rlock){......}, at: [<c02b2e28>] lock_tx_qs+0
x34/0x54                                                                        
                                                                                
but task is already holding lock:                                               
 (&(&(priv->tx_queue[i]->txlock))->rlock){......}, at: [<c02b2e28>] lock_tx_qs+0
x34/0x54                                                                        
                                                                                
other info that might help us debug this:                                       
 Possible unsafe locking scenario:                                              
                                                                                
       CPU0                                                                     
       ----                                                                     
  lock(&(&(priv->tx_queue[i]->txlock))->rlock);                                 
  lock(&(&(priv->tx_queue[i]->txlock))->rlock);                                 
                                                                                
 *** DEADLOCK ***                                                               
                                                                                
 May be due to missing lock nesting notation                                    
                                                                                
4 locks held by kworker/1:1/271:                                                
 #0:  (events){.+.+.+}, at: [<c005b7c8>] process_one_work+0x138/0x490           
 #1:  ((&(&dev->state_queue)->work)){+.+...}, at: [<c005b7c8>] process_one_work+
0x138/0x490                                                                     
 #2:  (&dev->lock){+.+...}, at: [<c02ab3cc>] phy_state_machine+0x30/0x580       
 #3:  (&(&(priv->tx_queue[i]->txlock))->rlock){......}, at: [<c02b2e28>] lock_tx
_qs+0x34/0x54                                                                   
                                                                                
stack backtrace:                                                                
Call Trace:                                                                     
[e69d5d80] [c0008c7c] show_stack+0x44/0x154 (unreliable)                        
[e69d5dc0] [c007bafc] __lock_acquire+0xefc/0x1824                               
[e69d5e70] [c007c870] lock_acquire+0x4c/0x68                                    
[e69d5e90] [c04575d8] _raw_spin_lock+0x44/0x60                                  
[e69d5ea0] [c02b2e28] lock_tx_qs+0x34/0x54                                      
[e69d5ec0] [c02b2f24] adjust_link+0x34/0x1d8                                    
[e69d5ef0] [c02ab73c] phy_state_machine+0x3a0/0x580                             
[e69d5f10] [c005b83c] process_one_work+0x1ac/0x490                              
[e69d5f50] [c005e4c0] worker_thread+0x18c/0x35c                                 
[e69d5f90] [c00631d4] kthread+0x7c/0x80                                         
[e69d5ff0] [c000e588] kernel_thread+0x4c/0x68                                   

-- 
Timur Tabi
Linux kernel developer at Freescale

^ permalink raw reply

* Re: [PATCH net-next 1/2] net: introduce build_skb()
From: Ben Hutchings @ 2011-11-14 17:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, eilong, pstaszewski, netdev, Tom Herbert,
	Jamal Hadi Salim, Stephen Hemminger, Thomas Graf, Herbert Xu,
	Jeff Kirsher
In-Reply-To: <1321286614.2272.47.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On Mon, 2011-11-14 at 17:03 +0100, Eric Dumazet wrote:
> One of the thing we discussed during netdev 2011 conference was the idea
> to change some network drivers to allocate/populate their skb at RX
> completion time, right before feeding the skb to network stack.
> 
> In old days, we allocated skbs when populating the RX ring.
> 
> This means bringing into cpu cache sk_buff and skb_shared_info cache
> lines (since we clear/initialize them), then 'queue' skb->data to NIC.
> 
> By the time NIC fills a frame in skb->data buffer and host can process
> it, cpu probably threw away the cache lines from its caches, because lot
> of things happened between the allocation and final use.

This is definitely a win so long as skbs are getting merged by GRO.
However, those cache misses take less time than skb allocation, so
preallocation of skbs normally reduces latency.  This is why sfc has an
adaptive buffer allocation behaviour.

> So the deal would be to allocate only the data buffer for the NIC to
> populate its RX ring buffer. And use build_skb() at RX completion to
> attach a data buffer (now filled with an ethernet frame) to a new skb,
> initialize the skb_shared_info portion, and give the hot skb to network
> stack.
> 
> build_skb() is the function to allocate an skb, caller providing the
> data buffer that should be attached to it. Drivers are expected to call 
> skb_reserve() right after build_skb() to adjust skb->data to the
> Ethernet frame (usually skipping NET_SKB_PAD and NET_IP_ALIGN, but some
> drivers might add a hardware provided alignment)
[...]

The option to attach a heap-allocated buffer rather than allocating a
header buffer and attaching a page might shift the balance.  I'll have
to test this (but don't hold your breath).

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH net-next 2/2] bnx2x: uses build_skb() in receive path
From: Eilon Greenstein @ 2011-11-14 16:48 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev@vger.kernel.org, Ben Hutchings, Tom Herbert,
	Jamal Hadi Salim, Stephen Hemminger, Thomas Graf, Herbert Xu,
	Jeff Kirsher, pstaszewski@itcare.pl
In-Reply-To: <1321286734.2272.48.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On Mon, 2011-11-14 at 08:05 -0800, Eric Dumazet wrote:
> bnx2x uses following formula to compute its rx_buf_sz :
> 
> dev->mtu + 2*L1_CACHE_BYTES + 14 + 8 + 8 + 2
> 
> Then core network adds NET_SKB_PAD and SKB_DATA_ALIGN(sizeof(struct
> skb_shared_info))
> 
> Final allocated size for skb head on x86_64 (L1_CACHE_BYTES = 64,
> MTU=1500) : 2112 bytes : SLUB/SLAB round this to 4096 bytes.
> 
> Since skb truesize is then bigger than SK_MEM_QUANTUM, we have lot of
> false sharing because of mem_reclaim in UDP stack.
> 
> One possible way to half truesize is to reduce the need by 64 bytes
> (2112 -> 2048 bytes)
> 
> Instead of allocating a full cache line at the end of packet for
> alignment, we can use the fact that skb_shared_info sits at the end of
> skb->head, and we can use this room, if we convert bnx2x to new
> build_skb() infrastructure.
> 
> skb_shared_info will be initialized after hardware finished its
> transfert, so we can eventually overwrite the final padding.
> 
> Using build_skb() also reduces cache line misses in the driver, since we
> use cache hot skb instead of cold ones. Number of in-flight sk_buff
> structures is lower, they are recycled while still hot.
> 
> Performance results :
> 
> (820.000 pps on a rx UDP monothread benchmark, instead of 720.000 pps)

Wow! This is very impressive. The only problem will be to back port this
driver now... But it is definitely worth the effort :)

> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Eilon Greenstein <eilong@broadcom.com>

> CC: Eilon Greenstein <eilong@broadcom.com>
> CC: Ben Hutchings <bhutchings@solarflare.com>
> CC: Tom Herbert <therbert@google.com>
> CC: Jamal Hadi Salim <hadi@mojatatu.com>
> CC: Stephen Hemminger <shemminger@vyatta.com>
> CC: Thomas Graf <tgraf@infradead.org>
> CC: Herbert Xu <herbert@gondor.apana.org.au>
> CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> ---
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x.h         |   30 -
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c     |  265 ++++------
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h     |   33 -
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c |    6
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c    |    4
>  5 files changed, 170 insertions(+), 168 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
> index 4b90b51..0f7b7a4 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
> @@ -293,8 +293,13 @@ enum {
>  #define FCOE_TXQ_IDX(bp)       (MAX_ETH_TXQ_IDX(bp))
> 
>  /* fast path */
> +/*
> + * This driver uses new build_skb() API :
> + * RX ring buffer contains pointer to kmalloc() data only,
> + * skb are built only after Hardware filled the frame.
> + */
>  struct sw_rx_bd {
> -       struct sk_buff  *skb;
> +       u8              *data;
>         DEFINE_DMA_UNMAP_ADDR(mapping);
>  };
> 
> @@ -424,8 +429,8 @@ union host_hc_status_block {
> 
>  struct bnx2x_agg_info {
>         /*
> -        * First aggregation buffer is an skb, the following - are pages.
> -        * We will preallocate the skbs for each aggregation when
> +        * First aggregation buffer is a data buffer, the following - are pages.
> +        * We will preallocate the data buffer for each aggregation when
>          * we open the interface and will replace the BD at the consumer
>          * with this one when we receive the TPA_START CQE in order to
>          * keep the Rx BD ring consistent.
> @@ -439,6 +444,7 @@ struct bnx2x_agg_info {
>         u16                     parsing_flags;
>         u16                     vlan_tag;
>         u16                     len_on_bd;
> +       u32                     rxhash;
>  };
> 
>  #define Q_STATS_OFFSET32(stat_name) \
> @@ -1187,10 +1193,20 @@ struct bnx2x {
>  #define ETH_MAX_JUMBO_PACKET_SIZE      9600
> 
>         /* Max supported alignment is 256 (8 shift) */
> -#define BNX2X_RX_ALIGN_SHIFT           ((L1_CACHE_SHIFT < 8) ? \
> -                                        L1_CACHE_SHIFT : 8)
> -       /* FW use 2 Cache lines Alignment for start packet and size  */
> -#define BNX2X_FW_RX_ALIGN              (2 << BNX2X_RX_ALIGN_SHIFT)
> +#define BNX2X_RX_ALIGN_SHIFT           min(8, L1_CACHE_SHIFT)
> +
> +       /* FW uses 2 Cache lines Alignment for start packet and size
> +        *
> +        * We assume skb_build() uses sizeof(struct skb_shared_info) bytes
> +        * at the end of skb->data, to avoid wasting a full cache line.
> +        * This reduces memory use (skb->truesize).
> +        */
> +#define BNX2X_FW_RX_ALIGN_START        (1UL << BNX2X_RX_ALIGN_SHIFT)
> +
> +#define BNX2X_FW_RX_ALIGN_END                                  \
> +       max(1UL << BNX2X_RX_ALIGN_SHIFT,                        \
> +           SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
> +
>  #define BNX2X_PXP_DRAM_ALIGN           (BNX2X_RX_ALIGN_SHIFT - 5)
> 
>         struct host_sp_status_block *def_status_blk;
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> index 13dad92..0d60b9e 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> @@ -294,8 +294,21 @@ static void bnx2x_update_sge_prod(struct bnx2x_fastpath *fp,
>            fp->last_max_sge, fp->rx_sge_prod);
>  }
> 
> +/* Set Toeplitz hash value in the skb using the value from the
> + * CQE (calculated by HW).
> + */
> +static u32 bnx2x_get_rxhash(const struct bnx2x *bp,
> +                           const struct eth_fast_path_rx_cqe *cqe)
> +{
> +       /* Set Toeplitz hash from CQE */
> +       if ((bp->dev->features & NETIF_F_RXHASH) &&
> +           (cqe->status_flags & ETH_FAST_PATH_RX_CQE_RSS_HASH_FLG))
> +               return le32_to_cpu(cqe->rss_hash_result);
> +       return 0;
> +}
> +
>  static void bnx2x_tpa_start(struct bnx2x_fastpath *fp, u16 queue,
> -                           struct sk_buff *skb, u16 cons, u16 prod,
> +                           u16 cons, u16 prod,
>                             struct eth_fast_path_rx_cqe *cqe)
>  {
>         struct bnx2x *bp = fp->bp;
> @@ -310,9 +323,9 @@ static void bnx2x_tpa_start(struct bnx2x_fastpath *fp, u16 queue,
>         if (tpa_info->tpa_state != BNX2X_TPA_STOP)
>                 BNX2X_ERR("start of bin not in stop [%d]\n", queue);
> 
> -       /* Try to map an empty skb from the aggregation info  */
> +       /* Try to map an empty data buffer from the aggregation info  */
>         mapping = dma_map_single(&bp->pdev->dev,
> -                                first_buf->skb->data,
> +                                first_buf->data + NET_SKB_PAD,
>                                  fp->rx_buf_size, DMA_FROM_DEVICE);
>         /*
>          *  ...if it fails - move the skb from the consumer to the producer
> @@ -322,15 +335,15 @@ static void bnx2x_tpa_start(struct bnx2x_fastpath *fp, u16 queue,
> 
>         if (unlikely(dma_mapping_error(&bp->pdev->dev, mapping))) {
>                 /* Move the BD from the consumer to the producer */
> -               bnx2x_reuse_rx_skb(fp, cons, prod);
> +               bnx2x_reuse_rx_data(fp, cons, prod);
>                 tpa_info->tpa_state = BNX2X_TPA_ERROR;
>                 return;
>         }
> 
> -       /* move empty skb from pool to prod */
> -       prod_rx_buf->skb = first_buf->skb;
> +       /* move empty data from pool to prod */
> +       prod_rx_buf->data = first_buf->data;
>         dma_unmap_addr_set(prod_rx_buf, mapping, mapping);
> -       /* point prod_bd to new skb */
> +       /* point prod_bd to new data */
>         prod_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
>         prod_bd->addr_lo = cpu_to_le32(U64_LO(mapping));
> 
> @@ -344,6 +357,7 @@ static void bnx2x_tpa_start(struct bnx2x_fastpath *fp, u16 queue,
>         tpa_info->tpa_state = BNX2X_TPA_START;
>         tpa_info->len_on_bd = le16_to_cpu(cqe->len_on_bd);
>         tpa_info->placement_offset = cqe->placement_offset;
> +       tpa_info->rxhash = bnx2x_get_rxhash(bp, cqe);
> 
>  #ifdef BNX2X_STOP_ON_ERROR
>         fp->tpa_queue_used |= (1 << queue);
> @@ -471,11 +485,12 @@ static void bnx2x_tpa_stop(struct bnx2x *bp, struct bnx2x_fastpath *fp,
>  {
>         struct bnx2x_agg_info *tpa_info = &fp->tpa_info[queue];
>         struct sw_rx_bd *rx_buf = &tpa_info->first_buf;
> -       u8 pad = tpa_info->placement_offset;
> +       u32 pad = tpa_info->placement_offset;
>         u16 len = tpa_info->len_on_bd;
> -       struct sk_buff *skb = rx_buf->skb;
> +       struct sk_buff *skb = NULL;
> +       u8 *data = rx_buf->data;
>         /* alloc new skb */
> -       struct sk_buff *new_skb;
> +       u8 *new_data;
>         u8 old_tpa_state = tpa_info->tpa_state;
> 
>         tpa_info->tpa_state = BNX2X_TPA_STOP;
> @@ -486,18 +501,18 @@ static void bnx2x_tpa_stop(struct bnx2x *bp, struct bnx2x_fastpath *fp,
>         if (old_tpa_state == BNX2X_TPA_ERROR)
>                 goto drop;
> 
> -       /* Try to allocate the new skb */
> -       new_skb = netdev_alloc_skb(bp->dev, fp->rx_buf_size);
> +       /* Try to allocate the new data */
> +       new_data = kmalloc(fp->rx_buf_size + NET_SKB_PAD, GFP_ATOMIC);
> 
>         /* Unmap skb in the pool anyway, as we are going to change
>            pool entry status to BNX2X_TPA_STOP even if new skb allocation
>            fails. */
>         dma_unmap_single(&bp->pdev->dev, dma_unmap_addr(rx_buf, mapping),
>                          fp->rx_buf_size, DMA_FROM_DEVICE);
> +       if (likely(new_data))
> +               skb = build_skb(data);
> 
> -       if (likely(new_skb)) {
> -               prefetch(skb);
> -               prefetch(((char *)(skb)) + L1_CACHE_BYTES);
> +       if (likely(skb)) {
> 
>  #ifdef BNX2X_STOP_ON_ERROR
>                 if (pad + len > fp->rx_buf_size) {
> @@ -509,8 +524,9 @@ static void bnx2x_tpa_stop(struct bnx2x *bp, struct bnx2x_fastpath *fp,
>                 }
>  #endif
> 
> -               skb_reserve(skb, pad);
> +               skb_reserve(skb, pad + NET_SKB_PAD);
>                 skb_put(skb, len);
> +               skb->rxhash = tpa_info->rxhash;
> 
>                 skb->protocol = eth_type_trans(skb, bp->dev);
>                 skb->ip_summed = CHECKSUM_UNNECESSARY;
> @@ -526,8 +542,8 @@ static void bnx2x_tpa_stop(struct bnx2x *bp, struct bnx2x_fastpath *fp,
>                 }
> 
> 
> -               /* put new skb in bin */
> -               rx_buf->skb = new_skb;
> +               /* put new data in bin */
> +               rx_buf->data = new_data;
> 
>                 return;
>         }
> @@ -539,19 +555,6 @@ drop:
>         fp->eth_q_stats.rx_skb_alloc_failed++;
>  }
> 
> -/* Set Toeplitz hash value in the skb using the value from the
> - * CQE (calculated by HW).
> - */
> -static inline void bnx2x_set_skb_rxhash(struct bnx2x *bp, union eth_rx_cqe *cqe,
> -                                       struct sk_buff *skb)
> -{
> -       /* Set Toeplitz hash from CQE */
> -       if ((bp->dev->features & NETIF_F_RXHASH) &&
> -           (cqe->fast_path_cqe.status_flags &
> -            ETH_FAST_PATH_RX_CQE_RSS_HASH_FLG))
> -               skb->rxhash =
> -               le32_to_cpu(cqe->fast_path_cqe.rss_hash_result);
> -}
> 
>  int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
>  {
> @@ -594,6 +597,7 @@ int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
>                 u8 cqe_fp_flags;
>                 enum eth_rx_cqe_type cqe_fp_type;
>                 u16 len, pad;
> +               u8 *data;
> 
>  #ifdef BNX2X_STOP_ON_ERROR
>                 if (unlikely(bp->panic))
> @@ -604,13 +608,6 @@ int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
>                 bd_prod = RX_BD(bd_prod);
>                 bd_cons = RX_BD(bd_cons);
> 
> -               /* Prefetch the page containing the BD descriptor
> -                  at producer's index. It will be needed when new skb is
> -                  allocated */
> -               prefetch((void *)(PAGE_ALIGN((unsigned long)
> -                                            (&fp->rx_desc_ring[bd_prod])) -
> -                                 PAGE_SIZE + 1));
> -
>                 cqe = &fp->rx_comp_ring[comp_ring_cons];
>                 cqe_fp = &cqe->fast_path_cqe;
>                 cqe_fp_flags = cqe_fp->type_error_flags;
> @@ -626,125 +623,110 @@ int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
>                 if (unlikely(CQE_TYPE_SLOW(cqe_fp_type))) {
>                         bnx2x_sp_event(fp, cqe);
>                         goto next_cqe;
> +               }
> +               rx_buf = &fp->rx_buf_ring[bd_cons];
> +               data = rx_buf->data;
> 
> -               /* this is an rx packet */
> -               } else {
> -                       rx_buf = &fp->rx_buf_ring[bd_cons];
> -                       skb = rx_buf->skb;
> -                       prefetch(skb);
> -
> -                       if (!CQE_TYPE_FAST(cqe_fp_type)) {
> +               if (!CQE_TYPE_FAST(cqe_fp_type)) {
>  #ifdef BNX2X_STOP_ON_ERROR
> -                               /* sanity check */
> -                               if (fp->disable_tpa &&
> -                                   (CQE_TYPE_START(cqe_fp_type) ||
> -                                    CQE_TYPE_STOP(cqe_fp_type)))
> -                                       BNX2X_ERR("START/STOP packet while "
> -                                                 "disable_tpa type %x\n",
> -                                                 CQE_TYPE(cqe_fp_type));
> +                       /* sanity check */
> +                       if (fp->disable_tpa &&
> +                           (CQE_TYPE_START(cqe_fp_type) ||
> +                            CQE_TYPE_STOP(cqe_fp_type)))
> +                               BNX2X_ERR("START/STOP packet while "
> +                                         "disable_tpa type %x\n",
> +                                         CQE_TYPE(cqe_fp_type));
>  #endif
> 
> -                               if (CQE_TYPE_START(cqe_fp_type)) {
> -                                       u16 queue = cqe_fp->queue_index;
> -                                       DP(NETIF_MSG_RX_STATUS,
> -                                          "calling tpa_start on queue %d\n",
> -                                          queue);
> -
> -                                       bnx2x_tpa_start(fp, queue, skb,
> -                                                       bd_cons, bd_prod,
> -                                                       cqe_fp);
> -
> -                                       /* Set Toeplitz hash for LRO skb */
> -                                       bnx2x_set_skb_rxhash(bp, cqe, skb);
> -
> -                                       goto next_rx;
> +                       if (CQE_TYPE_START(cqe_fp_type)) {
> +                               u16 queue = cqe_fp->queue_index;
> +                               DP(NETIF_MSG_RX_STATUS,
> +                                  "calling tpa_start on queue %d\n",
> +                                  queue);
> 
> -                               } else {
> -                                       u16 queue =
> -                                               cqe->end_agg_cqe.queue_index;
> -                                       DP(NETIF_MSG_RX_STATUS,
> -                                          "calling tpa_stop on queue %d\n",
> -                                          queue);
> +                               bnx2x_tpa_start(fp, queue,
> +                                               bd_cons, bd_prod,
> +                                               cqe_fp);
> +                               goto next_rx;
> +                       } else {
> +                               u16 queue =
> +                                       cqe->end_agg_cqe.queue_index;
> +                               DP(NETIF_MSG_RX_STATUS,
> +                                  "calling tpa_stop on queue %d\n",
> +                                  queue);
> 
> -                                       bnx2x_tpa_stop(bp, fp, queue,
> -                                                      &cqe->end_agg_cqe,
> -                                                      comp_ring_cons);
> +                               bnx2x_tpa_stop(bp, fp, queue,
> +                                              &cqe->end_agg_cqe,
> +                                              comp_ring_cons);
>  #ifdef BNX2X_STOP_ON_ERROR
> -                                       if (bp->panic)
> -                                               return 0;
> +                               if (bp->panic)
> +                                       return 0;
>  #endif
> 
> -                                       bnx2x_update_sge_prod(fp, cqe_fp);
> -                                       goto next_cqe;
> -                               }
> +                               bnx2x_update_sge_prod(fp, cqe_fp);
> +                               goto next_cqe;
>                         }
> -                       /* non TPA */
> -                       len = le16_to_cpu(cqe_fp->pkt_len);
> -                       pad = cqe_fp->placement_offset;
> -                       dma_sync_single_for_cpu(&bp->pdev->dev,
> +               }
> +               /* non TPA */
> +               len = le16_to_cpu(cqe_fp->pkt_len);
> +               pad = cqe_fp->placement_offset;
> +               dma_sync_single_for_cpu(&bp->pdev->dev,
>                                         dma_unmap_addr(rx_buf, mapping),
> -                                                      pad + RX_COPY_THRESH,
> -                                                      DMA_FROM_DEVICE);
> -                       prefetch(((char *)(skb)) + L1_CACHE_BYTES);
> +                                       pad + RX_COPY_THRESH,
> +                                       DMA_FROM_DEVICE);
> +               pad += NET_SKB_PAD;
> +               prefetch(data + pad); /* speedup eth_type_trans() */
> +               /* is this an error packet? */
> +               if (unlikely(cqe_fp_flags & ETH_RX_ERROR_FALGS)) {
> +                       DP(NETIF_MSG_RX_ERR,
> +                          "ERROR  flags %x  rx packet %u\n",
> +                          cqe_fp_flags, sw_comp_cons);
> +                       fp->eth_q_stats.rx_err_discard_pkt++;
> +                       goto reuse_rx;
> +               }
> 
> -                       /* is this an error packet? */
> -                       if (unlikely(cqe_fp_flags & ETH_RX_ERROR_FALGS)) {
> +               /* Since we don't have a jumbo ring
> +                * copy small packets if mtu > 1500
> +                */
> +               if ((bp->dev->mtu > ETH_MAX_PACKET_SIZE) &&
> +                   (len <= RX_COPY_THRESH)) {
> +                       skb = netdev_alloc_skb_ip_align(bp->dev, len);
> +                       if (skb == NULL) {
>                                 DP(NETIF_MSG_RX_ERR,
> -                                  "ERROR  flags %x  rx packet %u\n",
> -                                  cqe_fp_flags, sw_comp_cons);
> -                               fp->eth_q_stats.rx_err_discard_pkt++;
> +                                  "ERROR  packet dropped because of alloc failure\n");
> +                               fp->eth_q_stats.rx_skb_alloc_failed++;
>                                 goto reuse_rx;
>                         }
> -
> -                       /* Since we don't have a jumbo ring
> -                        * copy small packets if mtu > 1500
> -                        */
> -                       if ((bp->dev->mtu > ETH_MAX_PACKET_SIZE) &&
> -                           (len <= RX_COPY_THRESH)) {
> -                               struct sk_buff *new_skb;
> -
> -                               new_skb = netdev_alloc_skb(bp->dev, len + pad);
> -                               if (new_skb == NULL) {
> -                                       DP(NETIF_MSG_RX_ERR,
> -                                          "ERROR  packet dropped "
> -                                          "because of alloc failure\n");
> -                                       fp->eth_q_stats.rx_skb_alloc_failed++;
> -                                       goto reuse_rx;
> -                               }
> -
> -                               /* aligned copy */
> -                               skb_copy_from_linear_data_offset(skb, pad,
> -                                                   new_skb->data + pad, len);
> -                               skb_reserve(new_skb, pad);
> -                               skb_put(new_skb, len);
> -
> -                               bnx2x_reuse_rx_skb(fp, bd_cons, bd_prod);
> -
> -                               skb = new_skb;
> -
> -                       } else
> -                       if (likely(bnx2x_alloc_rx_skb(bp, fp, bd_prod) == 0)) {
> +                       memcpy(skb->data, data + pad, len);
> +                       bnx2x_reuse_rx_data(fp, bd_cons, bd_prod);
> +               } else {
> +                       if (likely(bnx2x_alloc_rx_data(bp, fp, bd_prod) == 0)) {
>                                 dma_unmap_single(&bp->pdev->dev,
> -                                       dma_unmap_addr(rx_buf, mapping),
> +                                                dma_unmap_addr(rx_buf, mapping),
>                                                  fp->rx_buf_size,
>                                                  DMA_FROM_DEVICE);
> +                               skb = build_skb(data);
> +                               if (unlikely(!skb)) {
> +                                       kfree(data);
> +                                       fp->eth_q_stats.rx_skb_alloc_failed++;
> +                                       goto next_rx;
> +                               }
>                                 skb_reserve(skb, pad);
> -                               skb_put(skb, len);
> -
>                         } else {
>                                 DP(NETIF_MSG_RX_ERR,
>                                    "ERROR  packet dropped because "
>                                    "of alloc failure\n");
>                                 fp->eth_q_stats.rx_skb_alloc_failed++;
>  reuse_rx:
> -                               bnx2x_reuse_rx_skb(fp, bd_cons, bd_prod);
> +                               bnx2x_reuse_rx_data(fp, bd_cons, bd_prod);
>                                 goto next_rx;
>                         }
> 
> +                       skb_put(skb, len);
>                         skb->protocol = eth_type_trans(skb, bp->dev);
> 
>                         /* Set Toeplitz hash for a none-LRO skb */
> -                       bnx2x_set_skb_rxhash(bp, cqe, skb);
> +                       skb->rxhash = bnx2x_get_rxhash(bp, cqe_fp);
> 
>                         skb_checksum_none_assert(skb);
> 
> @@ -767,7 +749,7 @@ reuse_rx:
> 
> 
>  next_rx:
> -               rx_buf->skb = NULL;
> +               rx_buf->data = NULL;
> 
>                 bd_cons = NEXT_RX_IDX(bd_cons);
>                 bd_prod = NEXT_RX_IDX(bd_prod);
> @@ -1013,9 +995,9 @@ void bnx2x_init_rx_rings(struct bnx2x *bp)
>                                 struct sw_rx_bd *first_buf =
>                                         &tpa_info->first_buf;
> 
> -                               first_buf->skb = netdev_alloc_skb(bp->dev,
> -                                                      fp->rx_buf_size);
> -                               if (!first_buf->skb) {
> +                               first_buf->data = kmalloc(fp->rx_buf_size + NET_SKB_PAD,
> +                                                         GFP_ATOMIC);
> +                               if (!first_buf->data) {
>                                         BNX2X_ERR("Failed to allocate TPA "
>                                                   "skb pool for queue[%d] - "
>                                                   "disabling TPA on this "
> @@ -1118,16 +1100,16 @@ static void bnx2x_free_rx_bds(struct bnx2x_fastpath *fp)
> 
>         for (i = 0; i < NUM_RX_BD; i++) {
>                 struct sw_rx_bd *rx_buf = &fp->rx_buf_ring[i];
> -               struct sk_buff *skb = rx_buf->skb;
> +               u8 *data = rx_buf->data;
> 
> -               if (skb == NULL)
> +               if (data == NULL)
>                         continue;
>                 dma_unmap_single(&bp->pdev->dev,
>                                  dma_unmap_addr(rx_buf, mapping),
>                                  fp->rx_buf_size, DMA_FROM_DEVICE);
> 
> -               rx_buf->skb = NULL;
> -               dev_kfree_skb(skb);
> +               rx_buf->data = NULL;
> +               kfree(data);
>         }
>  }
> 
> @@ -1509,6 +1491,7 @@ static inline void bnx2x_set_rx_buf_size(struct bnx2x *bp)
> 
>         for_each_queue(bp, i) {
>                 struct bnx2x_fastpath *fp = &bp->fp[i];
> +               u32 mtu;
> 
>                 /* Always use a mini-jumbo MTU for the FCoE L2 ring */
>                 if (IS_FCOE_IDX(i))
> @@ -1518,13 +1501,15 @@ static inline void bnx2x_set_rx_buf_size(struct bnx2x *bp)
>                          * IP_HEADER_ALIGNMENT_PADDING to prevent a buffer
>                          * overrun attack.
>                          */
> -                       fp->rx_buf_size =
> -                               BNX2X_FCOE_MINI_JUMBO_MTU + ETH_OVREHEAD +
> -                               BNX2X_FW_RX_ALIGN + IP_HEADER_ALIGNMENT_PADDING;
> +                       mtu = BNX2X_FCOE_MINI_JUMBO_MTU;
>                 else
> -                       fp->rx_buf_size =
> -                               bp->dev->mtu + ETH_OVREHEAD +
> -                               BNX2X_FW_RX_ALIGN + IP_HEADER_ALIGNMENT_PADDING;
> +                       mtu = bp->dev->mtu;
> +               fp->rx_buf_size = BNX2X_FW_RX_ALIGN_START +
> +                                 IP_HEADER_ALIGNMENT_PADDING +
> +                                 ETH_OVREHEAD +
> +                                 mtu +
> +                                 BNX2X_FW_RX_ALIGN_END;
> +               /* Note : rx_buf_size doesnt take into account NET_SKB_PAD */
>         }
>  }
> 
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
> index 260226d..41eb17e 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
> @@ -910,26 +910,27 @@ static inline int bnx2x_alloc_rx_sge(struct bnx2x *bp,
>         return 0;
>  }
> 
> -static inline int bnx2x_alloc_rx_skb(struct bnx2x *bp,
> -                                    struct bnx2x_fastpath *fp, u16 index)
> +static inline int bnx2x_alloc_rx_data(struct bnx2x *bp,
> +                                     struct bnx2x_fastpath *fp, u16 index)
>  {
> -       struct sk_buff *skb;
> +       u8 *data;
>         struct sw_rx_bd *rx_buf = &fp->rx_buf_ring[index];
>         struct eth_rx_bd *rx_bd = &fp->rx_desc_ring[index];
>         dma_addr_t mapping;
> 
> -       skb = netdev_alloc_skb(bp->dev, fp->rx_buf_size);
> -       if (unlikely(skb == NULL))
> +       data = kmalloc(fp->rx_buf_size + NET_SKB_PAD, GFP_ATOMIC);
> +       if (unlikely(data == NULL))
>                 return -ENOMEM;
> 
> -       mapping = dma_map_single(&bp->pdev->dev, skb->data, fp->rx_buf_size,
> +       mapping = dma_map_single(&bp->pdev->dev, data + NET_SKB_PAD,
> +                                fp->rx_buf_size,
>                                  DMA_FROM_DEVICE);
>         if (unlikely(dma_mapping_error(&bp->pdev->dev, mapping))) {
> -               dev_kfree_skb_any(skb);
> +               kfree(data);
>                 return -ENOMEM;
>         }
> 
> -       rx_buf->skb = skb;
> +       rx_buf->data = data;
>         dma_unmap_addr_set(rx_buf, mapping, mapping);
> 
>         rx_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
> @@ -938,12 +939,12 @@ static inline int bnx2x_alloc_rx_skb(struct bnx2x *bp,
>         return 0;
>  }
> 
> -/* note that we are not allocating a new skb,
> +/* note that we are not allocating a new buffer,
>   * we are just moving one from cons to prod
>   * we are not creating a new mapping,
>   * so there is no need to check for dma_mapping_error().
>   */
> -static inline void bnx2x_reuse_rx_skb(struct bnx2x_fastpath *fp,
> +static inline void bnx2x_reuse_rx_data(struct bnx2x_fastpath *fp,
>                                       u16 cons, u16 prod)
>  {
>         struct sw_rx_bd *cons_rx_buf = &fp->rx_buf_ring[cons];
> @@ -953,7 +954,7 @@ static inline void bnx2x_reuse_rx_skb(struct bnx2x_fastpath *fp,
> 
>         dma_unmap_addr_set(prod_rx_buf, mapping,
>                            dma_unmap_addr(cons_rx_buf, mapping));
> -       prod_rx_buf->skb = cons_rx_buf->skb;
> +       prod_rx_buf->data = cons_rx_buf->data;
>         *prod_bd = *cons_bd;
>  }
> 
> @@ -1029,9 +1030,9 @@ static inline void bnx2x_free_tpa_pool(struct bnx2x *bp,
>         for (i = 0; i < last; i++) {
>                 struct bnx2x_agg_info *tpa_info = &fp->tpa_info[i];
>                 struct sw_rx_bd *first_buf = &tpa_info->first_buf;
> -               struct sk_buff *skb = first_buf->skb;
> +               u8 *data = first_buf->data;
> 
> -               if (skb == NULL) {
> +               if (data == NULL) {
>                         DP(NETIF_MSG_IFDOWN, "tpa bin %d empty on free\n", i);
>                         continue;
>                 }
> @@ -1039,8 +1040,8 @@ static inline void bnx2x_free_tpa_pool(struct bnx2x *bp,
>                         dma_unmap_single(&bp->pdev->dev,
>                                          dma_unmap_addr(first_buf, mapping),
>                                          fp->rx_buf_size, DMA_FROM_DEVICE);
> -               dev_kfree_skb(skb);
> -               first_buf->skb = NULL;
> +               kfree(data);
> +               first_buf->data = NULL;
>         }
>  }
> 
> @@ -1148,7 +1149,7 @@ static inline int bnx2x_alloc_rx_bds(struct bnx2x_fastpath *fp,
>          * fp->eth_q_stats.rx_skb_alloc_failed = 0
>          */
>         for (i = 0; i < rx_ring_size; i++) {
> -               if (bnx2x_alloc_rx_skb(bp, fp, ring_prod) < 0) {
> +               if (bnx2x_alloc_rx_data(bp, fp, ring_prod) < 0) {
>                         fp->eth_q_stats.rx_skb_alloc_failed++;
>                         continue;
>                 }
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
> index f6402fa..ec31871 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
> @@ -1740,6 +1740,7 @@ static int bnx2x_run_loopback(struct bnx2x *bp, int loopback_mode)
>         struct sw_rx_bd *rx_buf;
>         u16 len;
>         int rc = -ENODEV;
> +       u8 *data;
> 
>         /* check the loopback mode */
>         switch (loopback_mode) {
> @@ -1865,10 +1866,9 @@ static int bnx2x_run_loopback(struct bnx2x *bp, int loopback_mode)
>         dma_sync_single_for_cpu(&bp->pdev->dev,
>                                    dma_unmap_addr(rx_buf, mapping),
>                                    fp_rx->rx_buf_size, DMA_FROM_DEVICE);
> -       skb = rx_buf->skb;
> -       skb_reserve(skb, cqe->fast_path_cqe.placement_offset);
> +       data = rx_buf->data + NET_SKB_PAD + cqe->fast_path_cqe.placement_offset;
>         for (i = ETH_HLEN; i < pkt_size; i++)
> -               if (*(skb->data + i) != (unsigned char) (i & 0xff))
> +               if (*(data + i) != (unsigned char) (i & 0xff))
>                         goto test_loopback_rx_exit;
> 
>         rc = 0;
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
> index 33ff60d..80fb9b5 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
> @@ -2789,8 +2789,8 @@ static void bnx2x_pf_rx_q_prep(struct bnx2x *bp,
>         /* This should be a maximum number of data bytes that may be
>          * placed on the BD (not including paddings).
>          */
> -       rxq_init->buf_sz = fp->rx_buf_size - BNX2X_FW_RX_ALIGN -
> -               IP_HEADER_ALIGNMENT_PADDING;
> +       rxq_init->buf_sz = fp->rx_buf_size - BNX2X_FW_RX_ALIGN_START -
> +               BNX2X_FW_RX_ALIGN_END - IP_HEADER_ALIGNMENT_PADDING;
> 
>         rxq_init->cl_qzone_id = fp->cl_qzone_id;
>         rxq_init->tpa_agg_sz = tpa_agg_size;
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox