Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode
From: Greg Rose @ 2011-11-18  0:32 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Roopa Prabhu, netdev@vger.kernel.org, davem@davemloft.net,
	chrisw@redhat.com, sri@us.ibm.com, dragos.tatulea@gmail.com,
	kvm@vger.kernel.org, arnd@arndb.de, mst@redhat.com,
	mchan@broadcom.com, dwang2@cisco.com, shemminger@vyatta.com,
	eric.dumazet@gmail.com, kaber@trash.net, benve@cisco.com
In-Reply-To: <1321575301.2749.51.camel@bwh-desktop>


On 11/17/2011 4:15 PM, Ben Hutchings wrote:
> Sorry to come to this rather late.
>
> On Tue, 2011-11-08 at 23:55 -0800, Roopa Prabhu wrote:
> [...]
>> v2 ->  v3
>> - Moved set and get filter ops from rtnl_link_ops to netdev_ops
>> - Support for SRIOV VFs.
>>          [Note: The get filters msg (in the way current get rtnetlink handles
>>          it) might get too big for SRIOV vfs. This patch follows existing sriov
>>          vf get code and tries to accomodate filters for all VF's in a PF.
>>          And for the SRIOV case I have only tested the fact that the VF
>>          arguments are getting delivered to rtnetlink correctly. The code
>>          follows existing sriov vf handling code so rest of it should work fine]
> [...]
>
> This is already broken for large numbers of VFs, and increasing the
> amount of information per VF is going to make the situation worse.  I am
> no netlink expert but I think that the current approach of bundling all
> information about an interface in a single message may not be
> sustainable.
>
> Also, I'm unclear on why this interface is to be used to set filtering
> for the (PF) net device as well as for related VFs.  Doesn't that
> duplicate the functionality of ndo_set_rx_mode and
> ndo_vlan_rx_{add,kill}_vid?

Functionally yes but contextually no.  This allows the PF driver to know 
that it is setting these filters in the context of the existence of VFs, 
allowing it to take appropriate action.  The other two functions may be 
called without the presence of SR-IOV enablement and the existence of VFs.

Anyway, that's why I asked Roopa to add that capability.

- Greg

>
> Ben.
>

^ permalink raw reply

* Re: [PATCH] bonding: Don't allow mode change via sysfs with slaves present
From: David Miller @ 2011-11-18  0:32 UTC (permalink / raw)
  To: nicolas.2p.debian; +Cc: vfalico, netdev, andy, fubar
In-Reply-To: <4EC58C89.7020906@gmail.com>

From: Nicolas de Pesloüan <nicolas.2p.debian@gmail.com>
Date: Thu, 17 Nov 2011 23:36:57 +0100

> Le 17/11/2011 22:04, David Miller a écrit :
>> From: Veaceslav Falico<vfalico@redhat.com>
>> Date: Tue, 15 Nov 2011 17:44:42 +0100
>>
>>> When changing mode via bonding's sysfs, the slaves are not initialized
>>> correctly. Forbid to change modes with slaves present to ensure that
>>> every
>>> slave is initialized correctly via bond_enslave().
>>>
>>> Signed-off-by: Veaceslav Falico<vfalico@redhat.com>
>>
>> Lots of discussion... what is the final verdict on this patch bonding
>> folks?
> 
> Now looks good to me, and as all others participants agreed at the
> beginning, I think it's good for everyone.

Great, applied, thanks everyone.

^ permalink raw reply

* Re: root_lock vs. device's TX lock
From: Tom Herbert @ 2011-11-18  0:35 UTC (permalink / raw)
  To: Andy Fleming; +Cc: Dave Taht, Eric Dumazet, Linux Netdev List
In-Reply-To: <CAKWjMd7xvepSb2sMf4X4U4cjRYizq=tP2TTHhYaozuyBeSW4AQ@mail.gmail.com>

> Actually, I'm interested in circumventing *both* locks. Our SoC has
> some quite-versatile queueing infrastructure, such that (for many
> queueing setups) we can do all of the queueing in hardware, using
> per-cpu access portals. By hacking around the qdisc lock, and using a
> tx queue per core, we were able to achieve a significant speedup.
>
This was actually one of the motivations for my question.  If we have
a one TX queue per core, and and use a trivial mq aware qdisc for
instance, the locking becomes mostly overhead.  I don't mind taking a
lock once per TX, but right now were taking three! (root lock twice,
and device lock once).

Even without one queue per TX, I think the overhead savings may still
be present.  Eric, I realize that a point of dropping the root lock in
sch_direct_xmit is to possibly allow queuing to to qdisc and device
xmit in parallel, but if you're using a trivial qdisc then the time in
qdisc may be << time in device xmit, so the overhead of locking could
mitigate the gains in parallelism.  At the very least, this benefit is
hugely variable depending on the qdisc used.

Tom

^ permalink raw reply

* [PATCH net-next] sky2: fix hang in napi_disable
From: Stephen Hemminger @ 2011-11-18  0:37 UTC (permalink / raw)
  To: Sven Joachim; +Cc: davem, netdev
In-Reply-To: <87r517w2z0.fsf@turtle.gmx.de>

If IRQ was never initialized, then calling napi_disable() would hang.
Add more bookkeeping to track whether IRQ was ever initialized.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

--- a/drivers/net/ethernet/marvell/sky2.c	2011-11-17 11:10:35.938076778 -0800
+++ b/drivers/net/ethernet/marvell/sky2.c	2011-11-17 11:15:49.527882932 -0800
@@ -1723,6 +1723,8 @@ static int sky2_setup_irq(struct sky2_hw
 	if (err)
 		dev_err(&pdev->dev, "cannot assign irq %d\n", pdev->irq);
 	else {
+		hw->flags |= SKY2_HW_IRQ_SETUP;
+
 		napi_enable(&hw->napi);
 		sky2_write32(hw, B0_IMSK, Y2_IS_BASE);
 		sky2_read32(hw, B0_IMSK);
@@ -2120,6 +2122,7 @@ static int sky2_close(struct net_device
 
 		napi_disable(&hw->napi);
 		free_irq(hw->pdev->irq, hw);
+		hw->flags &= ~SKY2_HW_IRQ_SETUP;
 	} else {
 		u32 imask;
 
@@ -3423,12 +3426,13 @@ static void sky2_all_down(struct sky2_hw
 {
 	int i;
 
-	sky2_read32(hw, B0_IMSK);
-	sky2_write32(hw, B0_IMSK, 0);
+	if (hw->flags & SKY2_HW_IRQ_SETUP) {
+		sky2_read32(hw, B0_IMSK);
+		sky2_write32(hw, B0_IMSK, 0);
 
-	if (hw->ports > 1 || netif_running(hw->dev[0]))
 		synchronize_irq(hw->pdev->irq);
-	napi_disable(&hw->napi);
+		napi_disable(&hw->napi);
+	}
 
 	for (i = 0; i < hw->ports; i++) {
 		struct net_device *dev = hw->dev[i];
@@ -3445,7 +3449,7 @@ static void sky2_all_down(struct sky2_hw
 
 static void sky2_all_up(struct sky2_hw *hw)
 {
-	u32 imask = 0;
+	u32 imask = Y2_IS_BASE;
 	int i;
 
 	for (i = 0; i < hw->ports; i++) {
@@ -3461,8 +3465,7 @@ static void sky2_all_up(struct sky2_hw *
 		netif_wake_queue(dev);
 	}
 
-	if (imask || hw->ports > 1) {
-		imask |= Y2_IS_BASE;
+	if (hw->flags & SKY2_HW_IRQ_SETUP) {
 		sky2_write32(hw, B0_IMSK, imask);
 		sky2_read32(hw, B0_IMSK);
 		sky2_read32(hw, B0_Y2_SP_LISR);
--- a/drivers/net/ethernet/marvell/sky2.h	2011-11-16 15:15:40.964280527 -0800
+++ b/drivers/net/ethernet/marvell/sky2.h	2011-11-17 11:13:00.154858718 -0800
@@ -2287,6 +2287,7 @@ struct sky2_hw {
 #define SKY2_HW_RSS_BROKEN	0x00000100
 #define SKY2_HW_VLAN_BROKEN     0x00000200
 #define SKY2_HW_RSS_CHKSUM	0x00000400	/* RSS requires chksum */
+#define SKY2_HW_IRQ_SETUP	0x00000800
 
 	u8	     	     chip_id;
 	u8		     chip_rev;

^ permalink raw reply

* [PATCH net-next] sky2: enforce minimum ring size
From: Stephen Hemminger @ 2011-11-18  0:37 UTC (permalink / raw)
  To: Sven Joachim; +Cc: davem, netdev
In-Reply-To: <874ny2bgqa.fsf@turtle.gmx.de>

The hardware has a restriction that the minimum ring size possible
is 128. The number of elements used is controlled by tx_pending and
the overall number of elements in the ring tx_ring_size, therefore it
is okay to limit the number of elements in use to a small value (63)
but still provide a bigger ring.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

--- a/drivers/net/ethernet/marvell/sky2.c	2011-11-17 16:25:40.079898621 -0800
+++ b/drivers/net/ethernet/marvell/sky2.c	2011-11-17 16:33:19.088438709 -0800
@@ -4092,6 +4092,16 @@ static int sky2_set_coalesce(struct net_
 	return 0;
 }
 
+/*
+ * Hardware is limited to min of 128 and max of 2048 for ring size
+ * and  rounded up to next power of two
+ * to avoid division in modulus calclation
+ */
+static unsigned long roundup_ring_size(unsigned long pending)
+{
+	return max(128ul, roundup_pow_of_two(pending+1));
+}
+
 static void sky2_get_ringparam(struct net_device *dev,
 			       struct ethtool_ringparam *ering)
 {
@@ -4119,7 +4129,7 @@ static int sky2_set_ringparam(struct net
 
 	sky2->rx_pending = ering->rx_pending;
 	sky2->tx_pending = ering->tx_pending;
-	sky2->tx_ring_size = roundup_pow_of_two(sky2->tx_pending+1);
+	sky2->tx_ring_size = roundup_ring_size(sky2->tx_pending);
 
 	return sky2_reattach(dev);
 }
@@ -4714,7 +4724,7 @@ static __devinit struct net_device *sky2
 	spin_lock_init(&sky2->phy_lock);
 
 	sky2->tx_pending = TX_DEF_PENDING;
-	sky2->tx_ring_size = roundup_pow_of_two(TX_DEF_PENDING+1);
+	sky2->tx_ring_size = roundup_ring_size(TX_DEF_PENDING);
 	sky2->rx_pending = RX_DEF_PENDING;
 
 	hw->dev[port] = dev;

^ permalink raw reply

* Re: [net-next] iproute2: Add new command to IP link to enable/disable VF spoof check
From: Stephen Hemminger @ 2011-11-18  0:40 UTC (permalink / raw)
  To: Greg Rose; +Cc: Jeff Kirsher, davem, netdev, gospo
In-Reply-To: <4EC59B34.4080501@intel.com>

On Thu, 17 Nov 2011 15:39:32 -0800
Greg Rose <gregory.v.rose@intel.com> wrote:

> 
> 
> On 9/25/2011 1:23 AM, Jeff Kirsher wrote:
> > From: Greg Rose <gregory.v.rose@intel.com>
> >
> > Add IP link command parsing for VF spoof checking enable/disable
> >
> > Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
> > Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

I am limiting patches at this point to those that will build against 3.0.
The 3.1 version of iproute2 has not been released because kernel.org tool
to do the release is still in beta (kup), and they are working on it.

After 3.1 version of iproute2 is released, any fixes that require stuff
from 3.2 will go in.

^ permalink raw reply

* Re: [net-next] iproute2: Add new command to IP link to enable/disable VF spoof check
From: Greg Rose @ 2011-11-18  0:43 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Kirsher, Jeffrey T, davem@davemloft.net, netdev@vger.kernel.org,
	gospo@redhat.com
In-Reply-To: <20111117164044.21b916dd@nehalam.linuxnetplumber.net>

On 11/17/2011 4:40 PM, Stephen Hemminger wrote:
> On Thu, 17 Nov 2011 15:39:32 -0800
> Greg Rose<gregory.v.rose@intel.com>  wrote:
>
>>
>>
>> On 9/25/2011 1:23 AM, Jeff Kirsher wrote:
>>> From: Greg Rose<gregory.v.rose@intel.com>
>>>
>>> Add IP link command parsing for VF spoof checking enable/disable
>>>
>>> Signed-off-by: Greg Rose<gregory.v.rose@intel.com>
>>> Signed-off-by: Jeff Kirsher<jeffrey.t.kirsher@intel.com>
>
> I am limiting patches at this point to those that will build against 3.0.
> The 3.1 version of iproute2 has not been released because kernel.org tool
> to do the release is still in beta (kup), and they are working on it.
>
> After 3.1 version of iproute2 is released, any fixes that require stuff
> from 3.2 will go in.

Great, sounds good.

Would you prefer that I Jeff (Kirsher) and I repost the patch at that time?

thanks,

- Greg

>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode
From: Ben Hutchings @ 2011-11-18  0:44 UTC (permalink / raw)
  To: Greg Rose
  Cc: Roopa Prabhu, netdev@vger.kernel.org, davem@davemloft.net,
	chrisw@redhat.com, sri@us.ibm.com, dragos.tatulea@gmail.com,
	kvm@vger.kernel.org, arnd@arndb.de, mst@redhat.com,
	mchan@broadcom.com, dwang2@cisco.com, shemminger@vyatta.com,
	eric.dumazet@gmail.com, kaber@trash.net, benve@cisco.com
In-Reply-To: <4EC5A785.3060108@intel.com>

On Thu, 2011-11-17 at 16:32 -0800, Greg Rose wrote:
> On 11/17/2011 4:15 PM, Ben Hutchings wrote:
> > Sorry to come to this rather late.
> >
> > On Tue, 2011-11-08 at 23:55 -0800, Roopa Prabhu wrote:
> > [...]
> >> v2 ->  v3
> >> - Moved set and get filter ops from rtnl_link_ops to netdev_ops
> >> - Support for SRIOV VFs.
> >>          [Note: The get filters msg (in the way current get rtnetlink handles
> >>          it) might get too big for SRIOV vfs. This patch follows existing sriov
> >>          vf get code and tries to accomodate filters for all VF's in a PF.
> >>          And for the SRIOV case I have only tested the fact that the VF
> >>          arguments are getting delivered to rtnetlink correctly. The code
> >>          follows existing sriov vf handling code so rest of it should work fine]
> > [...]
> >
> > This is already broken for large numbers of VFs, and increasing the
> > amount of information per VF is going to make the situation worse.  I am
> > no netlink expert but I think that the current approach of bundling all
> > information about an interface in a single message may not be
> > sustainable.
> >
> > Also, I'm unclear on why this interface is to be used to set filtering
> > for the (PF) net device as well as for related VFs.  Doesn't that
> > duplicate the functionality of ndo_set_rx_mode and
> > ndo_vlan_rx_{add,kill}_vid?
> 
> Functionally yes but contextually no.  This allows the PF driver to know 
> that it is setting these filters in the context of the existence of VFs, 
> allowing it to take appropriate action.  The other two functions may be 
> called without the presence of SR-IOV enablement and the existence of VFs.
> 
> Anyway, that's why I asked Roopa to add that capability.

I don't follow.  The PF driver already knows whether it has enabled VFs.

How do filters set this way interact with filters set through the
existing operations?  Should they override promiscuous mode?  None of
this has been specified.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [net-next] iproute2: Add new command to IP link to enable/disable VF spoof check
From: Stephen Hemminger @ 2011-11-18  0:45 UTC (permalink / raw)
  To: Greg Rose
  Cc: Kirsher, Jeffrey T, davem@davemloft.net, netdev@vger.kernel.org,
	gospo@redhat.com
In-Reply-To: <4EC5AA1B.4040009@intel.com>

On Thu, 17 Nov 2011 16:43:07 -0800
Greg Rose <gregory.v.rose@intel.com> wrote:

> On 11/17/2011 4:40 PM, Stephen Hemminger wrote:
> > On Thu, 17 Nov 2011 15:39:32 -0800
> > Greg Rose<gregory.v.rose@intel.com>  wrote:
> >
> >>
> >>
> >> On 9/25/2011 1:23 AM, Jeff Kirsher wrote:
> >>> From: Greg Rose<gregory.v.rose@intel.com>
> >>>
> >>> Add IP link command parsing for VF spoof checking enable/disable
> >>>
> >>> Signed-off-by: Greg Rose<gregory.v.rose@intel.com>
> >>> Signed-off-by: Jeff Kirsher<jeffrey.t.kirsher@intel.com>
> >
> > I am limiting patches at this point to those that will build against 3.0.
> > The 3.1 version of iproute2 has not been released because kernel.org tool
> > to do the release is still in beta (kup), and they are working on it.
> >
> > After 3.1 version of iproute2 is released, any fixes that require stuff
> > from 3.2 will go in.
> 
> Great, sounds good.
> 
> Would you prefer that I Jeff (Kirsher) and I repost the patch at that time?

not required, if they show up in patchwork

^ permalink raw reply

* Re: [RFC] [ver3 PATCH 3/6] virtio_net: virtio_net driver changes
From: Ben Hutchings @ 2011-11-18  1:08 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: rusty, mst, netdev, kvm, davem, virtualization
In-Reply-To: <20111111130420.9878.19729.sendpatchset@krkumar2.in.ibm.com>

On Fri, 2011-11-11 at 18:34 +0530, Krishna Kumar wrote:
> Changes for multiqueue virtio_net driver.
[...]
> @@ -677,25 +730,35 @@ static struct rtnl_link_stats64 *virtnet
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
>  	int cpu;
> -	unsigned int start;
>  
>  	for_each_possible_cpu(cpu) {
> -		struct virtnet_stats __percpu *stats
> -			= per_cpu_ptr(vi->stats, cpu);
> -		u64 tpackets, tbytes, rpackets, rbytes;
> -
> -		do {
> -			start = u64_stats_fetch_begin(&stats->syncp);
> -			tpackets = stats->tx_packets;
> -			tbytes   = stats->tx_bytes;
> -			rpackets = stats->rx_packets;
> -			rbytes   = stats->rx_bytes;
> -		} while (u64_stats_fetch_retry(&stats->syncp, start));
> -
> -		tot->rx_packets += rpackets;
> -		tot->tx_packets += tpackets;
> -		tot->rx_bytes   += rbytes;
> -		tot->tx_bytes   += tbytes;
> +		int qpair;
> +
> +		for (qpair = 0; qpair < vi->num_queue_pairs; qpair++) {
> +			struct virtnet_send_stats __percpu *tx_stat;
> +			struct virtnet_recv_stats __percpu *rx_stat;

While you're at it, you can drop the per-CPU stats and make them only
per-queue.  There is unlikely to be any benefit in maintaining them
per-CPU while receive and transmit processing is serialised per-queue.

[...]
> +static int invoke_find_vqs(struct virtnet_info *vi)
> +{
> +	vq_callback_t **callbacks;
> +	struct virtqueue **vqs;
> +	int ret = -ENOMEM;
> +	int i, total_vqs;
> +	char **names;
> +
> +	/*
> +	 * We expect 1 RX virtqueue followed by 1 TX virtqueue, followed
> +	 * by the same 'vi->num_queue_pairs-1' more times, and optionally
> +	 * one control virtqueue.
> +	 */
> +	total_vqs = vi->num_queue_pairs * 2 +
> +		    virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ);
> +
> +	/* Allocate space for find_vqs parameters */
> +	vqs = kmalloc(total_vqs * sizeof(*vqs), GFP_KERNEL);
> +	callbacks = kmalloc(total_vqs * sizeof(*callbacks), GFP_KERNEL);
> +	names = kmalloc(total_vqs * sizeof(*names), GFP_KERNEL);
> +	if (!vqs || !callbacks || !names)
> +		goto err;
> +
> +	/* Allocate/initialize parameters for recv virtqueues */
> +	for (i = 0; i < vi->num_queue_pairs * 2; i += 2) {
> +		callbacks[i] = skb_recv_done;
> +		names[i] = kasprintf(GFP_KERNEL, "input.%d", i / 2);
> +		if (!names[i])
> +			goto err;
> +	}
> +
> +	/* Allocate/initialize parameters for send virtqueues */
> +	for (i = 1; i < vi->num_queue_pairs * 2; i += 2) {
> +		callbacks[i] = skb_xmit_done;
> +		names[i] = kasprintf(GFP_KERNEL, "output.%d", i / 2);
> +		if (!names[i])
> +			goto err;
> +	}
[...]

The RX and TX interrupt names for a multiqueue device should follow the
formats "<device>-rx-<index>" and "<device>-tx-<index>".

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* [PATCH] iwl-debug: Shrink object by using dev_err and deduplicating formats
From: Joe Perches @ 2011-11-18  1:46 UTC (permalink / raw)
  To: Wey-Yi Guy, Intel Linux Wireless
  Cc: John W. Linville, linux-wireless, netdev, linux-kernel

Using dev_err instead of dev_printk(KERN_ERR uses fewer
arguments and is a bit smaller.

Deduplicating formats used by IWL_DEBUG_QUIET_RFKILL also
makes the object a bit smaller.
 
Neatened the macros, used ##__VA_ARGS__.

$ size drivers/net/wireless/iwlwifi/built-in.o*
   text	   data	    bss	    dec	    hex	filename
 462652	   8646	  92576	 563874	  89aa2	drivers/net/wireless/iwlwifi/built-in.o.new
 467557	   8646	  92592	 568795	  8addb	drivers/net/wireless/iwlwifi/built-in.o.old

Signed-off-by: Joe Perches <joe@perches.com>
---
 drivers/net/wireless/iwlwifi/iwl-debug.h |   37 +++++++++++++++++-------------
 1 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/drivers/net/wireless/iwlwifi/iwl-debug.h b/drivers/net/wireless/iwlwifi/iwl-debug.h
index 40ef97b..44a7bdd 100644
--- a/drivers/net/wireless/iwlwifi/iwl-debug.h
+++ b/drivers/net/wireless/iwlwifi/iwl-debug.h
@@ -47,20 +47,21 @@ do {									\
 } while (0)
 
 #ifdef CONFIG_IWLWIFI_DEBUG
-#define IWL_DEBUG(m, level, fmt, args...)				\
+#define IWL_DEBUG(m, level, fmt, ...)					\
 do {									\
 	if (iwl_get_debug_level((m)->shrd) & (level))			\
-		dev_printk(KERN_ERR, bus(m)->dev,			\
-			 "%c %s " fmt, in_interrupt() ? 'I' : 'U',	\
-			__func__ , ## args);				\
+		dev_err(bus(m)->dev, "%c %s " fmt,			\
+			in_interrupt() ? 'I' : 'U', __func__,		\
+			##__VA_ARGS__);					\
 } while (0)
 
-#define IWL_DEBUG_LIMIT(m, level, fmt, args...)				\
+#define IWL_DEBUG_LIMIT(m, level, fmt, ...)				\
 do {									\
-	if (iwl_get_debug_level((m)->shrd) & (level) && net_ratelimit())\
-		dev_printk(KERN_ERR, bus(m)->dev,			\
-			"%c %s " fmt, in_interrupt() ? 'I' : 'U',	\
-			 __func__ , ## args);				\
+	if (iwl_get_debug_level((m)->shrd) & (level) &&			\
+	    net_ratelimit())						\
+		dev_err(bus(m)->dev, "%c %s " fmt,			\
+			in_interrupt() ? 'I' : 'U', __func__,		\
+			##__VA_ARGS__);					\
 } while (0)
 
 #define iwl_print_hex_dump(m, level, p, len)				\
@@ -70,14 +71,18 @@ do {                                            			\
 			       DUMP_PREFIX_OFFSET, 16, 1, p, len, 1);	\
 } while (0)
 
-#define IWL_DEBUG_QUIET_RFKILL(p, fmt, args...)			\
+#define IWL_DEBUG_QUIET_RFKILL(p, fmt, ...)				\
 do {									\
-	if (!iwl_is_rfkill(p->shrd))				\
-		dev_printk(KERN_ERR, bus(p)->dev, "%c %s " fmt,	\
-		(in_interrupt() ? 'I' : 'U'), __func__ , ##args);	\
-	else if	(iwl_get_debug_level(p->shrd) & IWL_DL_RADIO)	\
-		dev_printk(KERN_ERR, bus(p)->dev, "(RFKILL) %c %s " fmt, \
-		(in_interrupt() ? 'I' : 'U'), __func__ , ##args);	\
+	if (!iwl_is_rfkill(p->shrd))					\
+		dev_err(bus(p)->dev, "%s%c %s " fmt,			\
+			"",						\
+			in_interrupt() ? 'I' : 'U', __func__,		\
+			##__VA_ARGS__);					\
+	else if	(iwl_get_debug_level(p->shrd) & IWL_DL_RADIO)		\
+		dev_err(bus(p)->dev, "%s%c %s " fmt,			\
+			"(RFKILL) ",					\
+			in_interrupt() ? 'I' : 'U', __func__,		\
+			##__VA_ARGS__);					\
 } while (0)
 
 #else

^ permalink raw reply related

* Re: [PATCH net-next] sky2: fix hang in napi_disable
From: David Miller @ 2011-11-18  1:52 UTC (permalink / raw)
  To: shemminger; +Cc: svenjoac, netdev
In-Reply-To: <20111117163735.0a16660b@nehalam.linuxnetplumber.net>

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Thu, 17 Nov 2011 16:37:35 -0800

> If IRQ was never initialized, then calling napi_disable() would hang.
> Add more bookkeeping to track whether IRQ was ever initialized.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

Stephen, your other 6 patches went into 'net' so you have to target
these fixes there too.

^ permalink raw reply

* Re: [PATCH net-next] sky2: fix hang in napi_disable
From: Stephen Hemminger @ 2011-11-18  2:10 UTC (permalink / raw)
  To: David Miller; +Cc: svenjoac, netdev
In-Reply-To: <20111117.205234.394141688162787655.davem@davemloft.net>


> From: Stephen Hemminger <shemminger@vyatta.com>
> Date: Thu, 17 Nov 2011 16:37:35 -0800
> 
> > If IRQ was never initialized, then calling napi_disable() would
> > hang.
> > Add more bookkeeping to track whether IRQ was ever initialized.
> > 
> > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> 
> Stephen, your other 6 patches went into 'net' so you have to target
> these fixes there too.
> 

Do you need me to respin them?

^ permalink raw reply

* Re: [PATCH net-next] sky2: fix hang in napi_disable
From: David Miller @ 2011-11-18  2:15 UTC (permalink / raw)
  To: stephen.hemminger; +Cc: svenjoac, netdev
In-Reply-To: <f582e628-7770-43c5-aa90-51cf74141462@tahiti.vyatta.com>

From: Stephen Hemminger <stephen.hemminger@vyatta.com>
Date: Thu, 17 Nov 2011 18:10:49 -0800 (PST)

> 
>> From: Stephen Hemminger <shemminger@vyatta.com>
>> Date: Thu, 17 Nov 2011 16:37:35 -0800
>> 
>> > If IRQ was never initialized, then calling napi_disable() would
>> > hang.
>> > Add more bookkeeping to track whether IRQ was ever initialized.
>> > 
>> > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>> 
>> Stephen, your other 6 patches went into 'net' so you have to target
>> these fixes there too.
>> 
> 
> Do you need me to respin them?

If they apply cleanly to 'net', no.  Do they?

^ permalink raw reply

* [PATCH net-next v2] r8169: Add 64bit statistics
From: Junchang Wang @ 2011-11-18  2:15 UTC (permalink / raw)
  To: nic_swsd, romieu, shemminger, eric.dumazet; +Cc: netdev


Switch to use ndo_get_stats64 to get 64bit statistics.
Two sync entries are used (one for Rx and one for Tx).


Signed-off-by: Junchang Wang <junchangwang@gmail.com>
---
 drivers/net/ethernet/realtek/r8169.c |   61 +++++++++++++++++++++++++++++-----
 1 files changed, 52 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index cdf66d6..b9f8240 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -670,6 +670,12 @@ struct rtl8169_counters {
 	__le16	tx_underun;
 };
 
+struct rtl8169_stats {
+	struct u64_stats_sync	syncp;
+	u64			packets;
+	u64			bytes;
+};
+
 struct rtl8169_private {
 	void __iomem *mmio_addr;	/* memory map physical address */
 	struct pci_dev *pci_dev;
@@ -685,6 +691,8 @@ struct rtl8169_private {
 	u32 dirty_tx;
 	struct TxDesc *TxDescArray;	/* 256-aligned Tx descriptor ring */
 	struct RxDesc *RxDescArray;	/* 256-aligned Rx descriptor ring */
+	struct rtl8169_stats rx_stats;
+	struct rtl8169_stats tx_stats;
 	dma_addr_t TxPhyAddr;
 	dma_addr_t RxPhyAddr;
 	void *Rx_databuff[NUM_RX_DESC];	/* Rx data buffers */
@@ -766,7 +774,9 @@ static void rtl_hw_start(struct net_device *dev);
 static int rtl8169_close(struct net_device *dev);
 static void rtl_set_rx_mode(struct net_device *dev);
 static void rtl8169_tx_timeout(struct net_device *dev);
-static struct net_device_stats *rtl8169_get_stats(struct net_device *dev);
+static struct rtnl_link_stats64 *rtl8169_get_stats64(struct net_device *dev,
+						     struct rtnl_link_stats64
+						     *stats);
 static int rtl8169_rx_interrupt(struct net_device *, struct rtl8169_private *,
 				void __iomem *, u32 budget);
 static int rtl8169_change_mtu(struct net_device *dev, int new_mtu);
@@ -3454,7 +3464,7 @@ static void rtl_disable_msi(struct pci_dev *pdev, struct rtl8169_private *tp)
 static const struct net_device_ops rtl8169_netdev_ops = {
 	.ndo_open		= rtl8169_open,
 	.ndo_stop		= rtl8169_close,
-	.ndo_get_stats		= rtl8169_get_stats,
+	.ndo_get_stats64	= rtl8169_get_stats64,
 	.ndo_start_xmit		= rtl8169_start_xmit,
 	.ndo_tx_timeout		= rtl8169_tx_timeout,
 	.ndo_validate_addr	= eth_validate_addr,
@@ -5641,8 +5651,10 @@ static void rtl8169_tx_interrupt(struct net_device *dev,
 		rtl8169_unmap_tx_skb(&tp->pci_dev->dev, tx_skb,
 				     tp->TxDescArray + entry);
 		if (status & LastFrag) {
-			dev->stats.tx_packets++;
-			dev->stats.tx_bytes += tx_skb->skb->len;
+			u64_stats_update_begin(&tp->tx_stats.syncp);
+			tp->tx_stats.packets++;
+			tp->tx_stats.bytes += tx_skb->skb->len;
+			u64_stats_update_end(&tp->tx_stats.syncp);
 			dev_kfree_skb(tx_skb->skb);
 			tx_skb->skb = NULL;
 		}
@@ -5771,8 +5783,10 @@ static int rtl8169_rx_interrupt(struct net_device *dev,
 
 			napi_gro_receive(&tp->napi, skb);
 
-			dev->stats.rx_bytes += pkt_size;
-			dev->stats.rx_packets++;
+			u64_stats_update_begin(&tp->rx_stats.syncp);
+			tp->rx_stats.packets++;
+			tp->rx_stats.bytes += pkt_size;
+			u64_stats_update_end(&tp->rx_stats.syncp);
 		}
 
 		/* Work around for AMD plateform. */
@@ -6034,16 +6048,19 @@ static void rtl_set_rx_mode(struct net_device *dev)
 }
 
 /**
- *  rtl8169_get_stats - Get rtl8169 read/write statistics
+ *  rtl8169_get_stats64 - Get rtl8169 read/write statistics
  *  @dev: The Ethernet Device to get statistics for
  *
  *  Get TX/RX statistics for rtl8169
  */
-static struct net_device_stats *rtl8169_get_stats(struct net_device *dev)
+static struct rtnl_link_stats64 *
+rtl8169_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
 {
 	struct rtl8169_private *tp = netdev_priv(dev);
 	void __iomem *ioaddr = tp->mmio_addr;
+	u64 _packets, _bytes;
 	unsigned long flags;
+	unsigned int start;
 
 	if (netif_running(dev)) {
 		spin_lock_irqsave(&tp->lock, flags);
@@ -6051,7 +6068,33 @@ static struct net_device_stats *rtl8169_get_stats(struct net_device *dev)
 		spin_unlock_irqrestore(&tp->lock, flags);
 	}
 
-	return &dev->stats;
+	do {
+		start = u64_stats_fetch_begin_bh(&tp->rx_stats.syncp);
+		_packets = tp->rx_stats.packets;
+		_bytes = tp->rx_stats.bytes;
+	} while (u64_stats_fetch_retry_bh(&tp->rx_stats.syncp, start));
+
+	stats->rx_packets = _packets;
+	stats->rx_bytes	= _bytes;
+
+	do {
+		start = u64_stats_fetch_begin_bh(&tp->tx_stats.syncp);
+		_packets = tp->tx_stats.packets;
+		_bytes= tp->tx_stats.bytes;
+	} while (u64_stats_fetch_retry_bh(&tp->tx_stats.syncp, start));
+
+	stats->tx_packets = _packets;
+	stats->tx_bytes	= _bytes;
+
+	stats->rx_dropped	= dev->stats.rx_dropped;
+	stats->tx_dropped	= dev->stats.tx_dropped;
+	stats->rx_length_errors = dev->stats.rx_length_errors;
+	stats->rx_errors	= dev->stats.rx_errors;
+	stats->rx_crc_errors	= dev->stats.rx_crc_errors;
+	stats->rx_fifo_errors	= dev->stats.rx_fifo_errors;
+	stats->rx_missed_errors = dev->stats.rx_missed_errors;
+
+	return stats;
 }
 
 static void rtl8169_net_suspend(struct net_device *dev)
--

^ permalink raw reply related

* Re: [PATCH net-next] r8169: Add 64bit statistics
From: Junchang Wang @ 2011-11-18  2:24 UTC (permalink / raw)
  To: Francois Romieu; +Cc: nic_swsd, eric.dumazet, netdev
In-Reply-To: <20111117093635.GA9112@electric-eye.fr.zoreil.com>

> The 816x chipsets have hardware stats registers. The driver already
> use them. Please use them more.

Hi Francois,

I'm not sure which hardware stats registers are accurate. Besides, it
seem r8169.c are also designed for other chipsets (8129?).

I would like other people who are quite familiar with those chipsets
to do this job.

Is that OK? Thanks.

-- 
--Junchang

^ permalink raw reply

* Re: [PATCH net-next] sky2: fix hang in napi_disable
From: David Miller @ 2011-11-18  2:44 UTC (permalink / raw)
  To: shemminger; +Cc: svenjoac, netdev
In-Reply-To: <20111117163735.0a16660b@nehalam.linuxnetplumber.net>

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Thu, 17 Nov 2011 16:37:35 -0800

> If IRQ was never initialized, then calling napi_disable() would hang.
> Add more bookkeeping to track whether IRQ was ever initialized.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

Applied to 'net'.

^ permalink raw reply

* Re: [PATCH net-next] sky2: enforce minimum ring size
From: David Miller @ 2011-11-18  2:44 UTC (permalink / raw)
  To: shemminger; +Cc: svenjoac, netdev
In-Reply-To: <20111117163723.67964841@nehalam.linuxnetplumber.net>

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Thu, 17 Nov 2011 16:37:23 -0800

> The hardware has a restriction that the minimum ring size possible
> is 128. The number of elements used is controlled by tx_pending and
> the overall number of elements in the ring tx_ring_size, therefore it
> is okay to limit the number of elements in use to a small value (63)
> but still provide a bigger ring.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

Applied to 'net'.

^ permalink raw reply

* Re: [PATCH 1/2] net: add network priority cgroup infrastructure (v2)
From: John Fastabend @ 2011-11-18  3:17 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev@vger.kernel.org, Love, Robert W, David S. Miller
In-Reply-To: <1321566472-28969-2-git-send-email-nhorman@tuxdriver.com>

On 11/17/2011 1:47 PM, Neil Horman wrote:
> This patch adds in the infrastructure code to create the network priority
> cgroup.  The cgroup, in addition to the standard processes file creates two
> control files:
> 
> 1) prioidx - This is a read-only file that exports the index of this cgroup.
> This is a value that is both arbitrary and unique to a cgroup in this subsystem,
> and is used to index the per-device priority map
> 
> 2) priomap - This is a writeable file.  On read it reports a table of 2-tuples
> <name:priority> where name is the name of a network interface and priority is
> indicates the priority assigned to frames egresessing on the named interface and
> originating from a pid in this cgroup
> 
> This cgroup allows for skb priority to be set prior to a root qdisc getting
> selected. This is benenficial for DCB enabled systems, in that it allows for any
> application to use dcb configured priorities so without application modification
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> CC: Robert Love <robert.w.love@intel.com>
> CC: "David S. Miller" <davem@davemloft.net>
> ---

one more nit... can we convert the rcu_dereference() into rtnl_dereference()
where it is relevant?

/**
 * rtnl_dereference - fetch RCU pointer when updates are prevented by RTNL
 * @p: The pointer to read, prior to dereferencing
 *
 * Return the value of the specified RCU-protected pointer, but omit
 * both the smp_read_barrier_depends() and the ACCESS_ONCE(), because
 * caller holds RTNL.
 */
#define rtnl_dereference(p)                                     \
        rcu_dereference_protected(p, lockdep_rtnl_is_held())

[...]

> +
> +static void extend_netdev_table(struct net_device *dev, u32 new_len)
> +{
> +       size_t new_size = sizeof(struct netprio_map) +
> +                          ((sizeof(u32) * new_len));
> +       struct netprio_map *new_priomap = kzalloc(new_size, GFP_KERNEL);
> +       struct netprio_map *old_priomap;
> +       int i;
> +
> +       old_priomap  = rcu_dereference(dev->priomap);
> +

This could be rtnl_dereference(dev->priomap) to annotate that we always
have the rtnl lock here.

> +       if (!new_priomap) {
> +               printk(KERN_WARNING "Unable to alloc new priomap!\n");
> +               return;
> +       }
> +
> +       for (i = 0;
> +            old_priomap && (i < old_priomap->priomap_len);
> +            i++)
> +               new_priomap->priomap[i] = old_priomap->priomap[i];
> +
> +       new_priomap->priomap_len = new_len;
> +
> +       rcu_assign_pointer(dev->priomap, new_priomap);
> +       if (old_priomap)
> +               kfree_rcu(old_priomap, rcu);
> +}
> +
> +static void update_netdev_tables(void)
> +{
> +       struct net_device *dev;
> +       u32 max_len = atomic_read(&max_prioidx);
> +       struct netprio_map *map;
> +
> +       rtnl_lock();
          ^^^^^^^^^^^
> +
> +
> +       for_each_netdev(&init_net, dev) {
> +               map = rcu_dereference(dev->priomap);

same here rtnl_dereference(dev->priomap);

> +               if ((!map) ||
> +                   (map->priomap_len < max_len))
> +                       extend_netdev_table(dev, max_len);
> +       }
> +
> +       rtnl_unlock();
> +}
> +
> +static struct cgroup_subsys_state *cgrp_create(struct cgroup_subsys *ss,
> +                                                struct cgroup *cgrp)
> +{
> +       struct cgroup_netprio_state *cs;
> +       int ret;
> +
> +       cs = kzalloc(sizeof(*cs), GFP_KERNEL);
> +       if (!cs)
> +               return ERR_PTR(-ENOMEM);
> +
> +       if (cgrp->parent && cgrp_netprio_state(cgrp->parent)->prioidx) {
> +               kfree(cs);
> +               return ERR_PTR(-EINVAL);
> +       }
> +
> +       ret = get_prioidx(&cs->prioidx);
> +       if (ret != 0) {
> +               printk(KERN_WARNING "No space in priority index array\n");
> +               kfree(cs);
> +               return ERR_PTR(ret);
> +       }
> +
> +       return &cs->css;
> +}
> +
> +static void cgrp_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
> +{
> +       struct cgroup_netprio_state *cs;
> +       struct net_device *dev;
> +       struct netprio_map *map;
> +
> +       cs = cgrp_netprio_state(cgrp);
> +       rtnl_lock();
> +       for_each_netdev(&init_net, dev) {
> +               map = rcu_dereference(dev->priomap);

map = rtnl_dereference(dev->priomap)

> +               if (map)
> +                       map->priomap[cs->prioidx] = 0;
> +       }
> +       rtnl_unlock();
> +       put_prioidx(cs->prioidx);
> +       kfree(cs);
> +}
> +
> +static u64 read_prioidx(struct cgroup *cgrp, struct cftype *cft)
> +{
> +       return (u64)cgrp_netprio_state(cgrp)->prioidx;
> +}
> +
> +static int read_priomap(struct cgroup *cont, struct cftype *cft,
> +                       struct cgroup_map_cb *cb)
> +{
> +       struct net_device *dev;
> +       u32 prioidx = cgrp_netprio_state(cont)->prioidx;
> +       u32 priority;
> +       struct netprio_map *map;
> +
> +       rcu_read_lock();
> +
> +       for_each_netdev_rcu(&init_net, dev) {
> +               map = rcu_dereference(dev->priomap);
> +               priority = map ? map->priomap[prioidx] : 0;
> +               cb->fill(cb, dev->name, priority);
> +       }
> +       rcu_read_unlock();
> +       return 0;
> +}
> +

[...]

> +static int cgrp_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
> +{
> +       return cgroup_add_files(cgrp, ss, ss_files, ARRAY_SIZE(ss_files));
> +}
> +
> +static int netprio_device_event(struct notifier_block *unused,
> +                               unsigned long event, void *ptr)
> +{
> +       struct net_device *dev = ptr;
> +       struct netprio_map *old;
> +       u32 max_len = atomic_read(&max_prioidx);
> +
> +       old = rcu_dereference_protected(dev->priomap, 1);

This is protected because of the rtnl lock so use,

old = rtnl_dereference(dev->priomap);

> +       /*
> +        * Note this is called with rtnl_lock held so we have update side
> +        * protection on our rcu assignments
> +        */
> +
> +       switch (event) {
> +
> +       case NETDEV_REGISTER:
> +               if (max_len)
> +                       extend_netdev_table(dev, max_len);
> +               break;
> +       case NETDEV_UNREGISTER:
> +               rcu_assign_pointer(dev->priomap, NULL);
> +               if (old)
> +                       kfree_rcu(old, rcu);
> +               break;
> +       }
> +       return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block netprio_device_notifier = {
> +       .notifier_call = netprio_device_event
> +};
> +

I can roll an update if you want, just let me know.

Thanks,
John

^ permalink raw reply

* ipv4 udplite broken in >=linux-3.0 ?
From: Jinxin Zheng @ 2011-11-18  4:09 UTC (permalink / raw)
  To: gerrit; +Cc: linux-kernel, netdev

Hi,

I recently migrated my work from udp to udplite. Thanks for your great effort.

But after I upgraded the kernel from 2.6.38 to 3.0, the ipv4 version
of udplite stops working.
I tested the applications downloaded from Gerrit's udplite wiki, and
found that the ipv6 version works well, but not ipv4.

I did the tests on a Gentoo. The kernels are downloaded from the
official web site.

I don't know how to debug udplite. Can any one of you give me a tip on
how to get more debug info, then I can do further test and provide
with it?

Thanks!

--
Zheng Jinxin

^ permalink raw reply

* Re: [PATCH] iwl-debug: Shrink object by using dev_err and deduplicating formats
From: Guy, Wey-Yi @ 2011-11-18  3:14 UTC (permalink / raw)
  To: Joe Perches
  Cc: Intel Linux Wireless, John W. Linville,
	linux-wireless@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <1321580775.22117.8.camel@Joe-Laptop>

On Thu, 2011-11-17 at 17:46 -0800, Joe Perches wrote:
> Using dev_err instead of dev_printk(KERN_ERR uses fewer
> arguments and is a bit smaller.
> 
> Deduplicating formats used by IWL_DEBUG_QUIET_RFKILL also
> makes the object a bit smaller.
>  
> Neatened the macros, used ##__VA_ARGS__.
> 
> $ size drivers/net/wireless/iwlwifi/built-in.o*
>    text	   data	    bss	    dec	    hex	filename
>  462652	   8646	  92576	 563874	  89aa2	drivers/net/wireless/iwlwifi/built-in.o.new
>  467557	   8646	  92592	 568795	  8addb	drivers/net/wireless/iwlwifi/built-in.o.old
> 
> Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Wey-Yi Guy <wey-yi.w.guy@intel.com>
> ---
> \

very interesting. Thanks

Wey

^ permalink raw reply

* Re: [PATCH net-next] bnx2: switch to build_skb() infrastructure
From: Michael Chan @ 2011-11-18  4:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Eilon Greenstein
In-Reply-To: <1321378205.2856.24.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>


On Tue, 2011-11-15 at 09:30 -0800, Eric Dumazet wrote:
> This is very similar to bnx2x conversion, but bnx2 only requires 16bytes
> alignement at start of the received frame to store its l2_fhdr, so goal
> was not to reduce skb truesize (in fact it should not change after this
> patch)
> 
> Using build_skb() reduces cache line misses in the driver, since we
> use cache hot skb instead of cold ones. Number of in-flight sk_buff
> structures is lower, they are more likely recycled in SLUB caches
> while still hot.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Michael Chan <mchan@broadcom.com>
> CC: Eilon Greenstein <eilong@broadcom.com>

Looks good.  Thanks.

Reviewed-by: Michael Chan <mchan@broadcom.com>

^ permalink raw reply

* Re: root_lock vs. device's TX lock
From: Eric Dumazet @ 2011-11-18  5:02 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Andy Fleming, Dave Taht, Linux Netdev List
In-Reply-To: <CA+mtBx8KfVQYcZFCJrQF__LJh7e0cRSPqSk1nn537Z_ATMjD9Q@mail.gmail.com>

Le jeudi 17 novembre 2011 à 16:35 -0800, Tom Herbert a écrit :
> > Actually, I'm interested in circumventing *both* locks. Our SoC has
> > some quite-versatile queueing infrastructure, such that (for many
> > queueing setups) we can do all of the queueing in hardware, using
> > per-cpu access portals. By hacking around the qdisc lock, and using a
> > tx queue per core, we were able to achieve a significant speedup.

If packet reordering is not a concern (or taken into account in the
hardware)... A task sending tcp flow can migrate from cpu1 to cpu2...

> This was actually one of the motivations for my question.  If we have
> a one TX queue per core, and and use a trivial mq aware qdisc for
> instance, the locking becomes mostly overhead.  I don't mind taking a
> lock once per TX, but right now were taking three! (root lock twice,
> and device lock once).
> 

> Even without one queue per TX, I think the overhead savings may still
> be present.  Eric, I realize that a point of dropping the root lock in
> sch_direct_xmit is to possibly allow queuing to to qdisc and device
> xmit in parallel, but if you're using a trivial qdisc then the time in
> qdisc may be << time in device xmit, so the overhead of locking could
> mitigate the gains in parallelism.  At the very least, this benefit is
> hugely variable depending on the qdisc used.

Andy speaks more of a way to bypass qdisc (direct to device), while I
thought Tom wanted to optimize htb/cbq/..complex qdisc handling...

Note we have LLTX thing to avoid taking dev->lock for some devices.

We could have a qdisc->fast_enqueue() method for some (qdiscs/device)
combinations, able to short cut __dev_xmit_skb() and do their own stuff
(using rcu locking or other synch method, percpu bytes/counter stats...)

But if device is really that smart, why even use a qdisc in the first
place ?

If qdisc not needed : We take only one lock (dev lock) to transmit a
packet. If device is LLTX, no lock taken at all.

If qdisc is needed : We need to take qdisc lock to enqueue packet.
   Then, _if_ we own the __QDISC___STATE_RUNNING flag, we enter the loop
to dequeue and xmit packets (__qdisc_run() / sch_direct_xmit() doing the
insane lock flips)

Its a tough choice : Do we want to avoid false sharing in the device
itself or qdisc...

Current handling was designed so that one cpu was feeding the device and
other cpus feeding the qdisc.

^ permalink raw reply

* RE: [PATCH v7] Phonet: set the pipe handle using setsockopt
From: Dinesh Kumar SHARMA (STE) @ 2011-11-18  5:27 UTC (permalink / raw)
  To: Hemant-vilas RAMDASI, netdev-owner@vger.kernel.org,
	davem@davemloft.net
  Cc: netdev@vger.kernel.org, remi.denis-courmont@nokia.com,
	Hemant-vilas RAMDASI
In-Reply-To: <1321522140-10133-1-git-send-email-hemant.ramdasi@stericsson.com>

Hi,

Can this be applied now?

Thanks.

Regards,
-Dinesh

-----Original Message-----
From: Hemant Vilas RAMDASI [mailto:hemant.ramdasi@stericsson.com] 
Sent: Thursday, November 17, 2011 2:59 PM
To: netdev-owner@vger.kernel.org
Cc: netdev@vger.kernel.org; remi.denis-courmont@nokia.com; Dinesh Kumar SHARMA (STE); Hemant-vilas RAMDASI
Subject: [PATCH v7] Phonet: set the pipe handle using setsockopt

From: Dinesh Kumar Sharma <dinesh.sharma@stericsson.com>

This provides flexibility to set the pipe handle
using setsockopt. The pipe can be enabled (if disabled) later
using ioctl.

Signed-off-by: Hemant Ramdasi <hemant.ramdasi@stericsson.com>
Signed-off-by: Dinesh Kumar Sharma <dinesh.sharma@stericsson.com>
---
 include/linux/phonet.h |    2 +
 net/phonet/pep.c       |   96 ++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 91 insertions(+), 7 deletions(-)

diff --git a/include/linux/phonet.h b/include/linux/phonet.h
index 6fb1384..1847ef9 100644
--- a/include/linux/phonet.h
+++ b/include/linux/phonet.h
@@ -37,6 +37,7 @@
 #define PNPIPE_ENCAP		1
 #define PNPIPE_IFINDEX		2
 #define PNPIPE_HANDLE		3
+#define PNPIPE_INITSTATE	4
 
 #define PNADDR_ANY		0
 #define PNADDR_BROADCAST	0xFC
@@ -48,6 +49,7 @@
 
 /* ioctls */
 #define SIOCPNGETOBJECT		(SIOCPROTOPRIVATE + 0)
+#define SIOCPNENABLEPIPE	(SIOCPROTOPRIVATE + 13)
 #define SIOCPNADDRESOURCE	(SIOCPROTOPRIVATE + 14)
 #define SIOCPNDELRESOURCE	(SIOCPROTOPRIVATE + 15)
 
diff --git a/net/phonet/pep.c b/net/phonet/pep.c
index f17fd84..7acd262 100644
--- a/net/phonet/pep.c
+++ b/net/phonet/pep.c
@@ -533,6 +533,29 @@ static int pep_connresp_rcv(struct sock *sk, struct sk_buff *skb)
 	return pipe_handler_send_created_ind(sk);
 }
 
+static int pep_enableresp_rcv(struct sock *sk, struct sk_buff *skb)
+{
+	struct pnpipehdr *hdr = pnp_hdr(skb);
+
+	if (hdr->error_code != PN_PIPE_NO_ERROR)
+		return -ECONNREFUSED;
+
+	return pep_indicate(sk, PNS_PIPE_ENABLED_IND, 0 /* sub-blocks */,
+		NULL, 0, GFP_ATOMIC);
+
+}
+
+static void pipe_start_flow_control(struct sock *sk)
+{
+	struct pep_sock *pn = pep_sk(sk);
+
+	if (!pn_flow_safe(pn->tx_fc)) {
+		atomic_set(&pn->tx_credits, 1);
+		sk->sk_write_space(sk);
+	}
+	pipe_grant_credits(sk, GFP_ATOMIC);
+}
+
 /* Queue an skb to an actively connected sock.
  * Socket lock must be held. */
 static int pipe_handler_do_rcv(struct sock *sk, struct sk_buff *skb)
@@ -578,13 +601,25 @@ static int pipe_handler_do_rcv(struct sock *sk, struct sk_buff *skb)
 			sk->sk_state = TCP_CLOSE_WAIT;
 			break;
 		}
+		if (pn->init_enable == PN_PIPE_DISABLE)
+			sk->sk_state = TCP_SYN_RECV;
+		else {
+			sk->sk_state = TCP_ESTABLISHED;
+			pipe_start_flow_control(sk);
+		}
+		break;
 
-		sk->sk_state = TCP_ESTABLISHED;
-		if (!pn_flow_safe(pn->tx_fc)) {
-			atomic_set(&pn->tx_credits, 1);
-			sk->sk_write_space(sk);
+	case PNS_PEP_ENABLE_RESP:
+		if (sk->sk_state != TCP_SYN_SENT)
+			break;
+
+		if (pep_enableresp_rcv(sk, skb)) {
+			sk->sk_state = TCP_CLOSE_WAIT;
+			break;
 		}
-		pipe_grant_credits(sk, GFP_ATOMIC);
+
+		sk->sk_state = TCP_ESTABLISHED;
+		pipe_start_flow_control(sk);
 		break;
 
 	case PNS_PEP_DISCONNECT_RESP:
@@ -863,14 +898,32 @@ static int pep_sock_connect(struct sock *sk, struct sockaddr *addr, int len)
 	int err;
 	u8 data[4] = { 0 /* sub-blocks */, PAD, PAD, PAD };
 
-	pn->pipe_handle = 1; /* anything but INVALID_HANDLE */
+	if (pn->pipe_handle == PN_PIPE_INVALID_HANDLE)
+		pn->pipe_handle = 1; /* anything but INVALID_HANDLE */
+
 	err = pipe_handler_request(sk, PNS_PEP_CONNECT_REQ,
-					PN_PIPE_ENABLE, data, 4);
+				pn->init_enable, data, 4);
 	if (err) {
 		pn->pipe_handle = PN_PIPE_INVALID_HANDLE;
 		return err;
 	}
+
 	sk->sk_state = TCP_SYN_SENT;
+
+	return 0;
+}
+
+static int pep_sock_enable(struct sock *sk, struct sockaddr *addr, int len)
+{
+	int err;
+
+	err = pipe_handler_request(sk, PNS_PEP_ENABLE_REQ, PAD,
+				NULL, 0);
+	if (err)
+		return err;
+
+	sk->sk_state = TCP_SYN_SENT;
+
 	return 0;
 }
 
@@ -894,6 +947,19 @@ static int pep_ioctl(struct sock *sk, int cmd, unsigned long arg)
 			answ = 0;
 		release_sock(sk);
 		return put_user(answ, (int __user *)arg);
+		break;
+
+	case SIOCPNENABLEPIPE:
+		lock_sock(sk);
+		if (sk->sk_state == TCP_SYN_SENT)
+			answ =  -EBUSY;
+		else if (sk->sk_state == TCP_ESTABLISHED)
+			answ = -EISCONN;
+		else
+			answ = pep_sock_enable(sk, NULL, 0);
+		release_sock(sk);
+		return answ;
+		break;
 	}
 
 	return -ENOIOCTLCMD;
@@ -959,6 +1025,18 @@ static int pep_setsockopt(struct sock *sk, int level, int optname,
 		}
 		goto out_norel;
 
+	case PNPIPE_HANDLE:
+		if ((sk->sk_state == TCP_CLOSE) &&
+			(val >= 0) && (val < PN_PIPE_INVALID_HANDLE))
+			pn->pipe_handle = val;
+		else
+			err = -EINVAL;
+		break;
+
+	case PNPIPE_INITSTATE:
+		pn->init_enable = !!val;
+		break;
+
 	default:
 		err = -ENOPROTOOPT;
 	}
@@ -994,6 +1072,10 @@ static int pep_getsockopt(struct sock *sk, int level, int optname,
 			return -EINVAL;
 		break;
 
+	case PNPIPE_INITSTATE:
+		val = pn->init_enable;
+		break;
+
 	default:
 		return -ENOPROTOOPT;
 	}
-- 
1.7.4.3

^ permalink raw reply related

* Re: [RFC] [ver3 PATCH 3/6] virtio_net: virtio_net driver changes
From: Sasha Levin @ 2011-11-18  6:24 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Krishna Kumar, rusty, mst, netdev, kvm, davem, virtualization
In-Reply-To: <1321578488.2749.66.camel@bwh-desktop>

On Fri, 2011-11-18 at 01:08 +0000, Ben Hutchings wrote:
> On Fri, 2011-11-11 at 18:34 +0530, Krishna Kumar wrote:
> > Changes for multiqueue virtio_net driver.
> [...]
> > @@ -677,25 +730,35 @@ static struct rtnl_link_stats64 *virtnet
> >  {
> >  	struct virtnet_info *vi = netdev_priv(dev);
> >  	int cpu;
> > -	unsigned int start;
> >  
> >  	for_each_possible_cpu(cpu) {
> > -		struct virtnet_stats __percpu *stats
> > -			= per_cpu_ptr(vi->stats, cpu);
> > -		u64 tpackets, tbytes, rpackets, rbytes;
> > -
> > -		do {
> > -			start = u64_stats_fetch_begin(&stats->syncp);
> > -			tpackets = stats->tx_packets;
> > -			tbytes   = stats->tx_bytes;
> > -			rpackets = stats->rx_packets;
> > -			rbytes   = stats->rx_bytes;
> > -		} while (u64_stats_fetch_retry(&stats->syncp, start));
> > -
> > -		tot->rx_packets += rpackets;
> > -		tot->tx_packets += tpackets;
> > -		tot->rx_bytes   += rbytes;
> > -		tot->tx_bytes   += tbytes;
> > +		int qpair;
> > +
> > +		for (qpair = 0; qpair < vi->num_queue_pairs; qpair++) {
> > +			struct virtnet_send_stats __percpu *tx_stat;
> > +			struct virtnet_recv_stats __percpu *rx_stat;
> 
> While you're at it, you can drop the per-CPU stats and make them only
> per-queue.  There is unlikely to be any benefit in maintaining them
> per-CPU while receive and transmit processing is serialised per-queue.

It allows you to update stats without a lock.

Whats the benefit of having them per queue?

-- 

Sasha.


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox