Netdev List
 help / color / mirror / Atom feed
* Re: 82571EB: Detected Hardware Unit Hang
From: Dave, Tushar N @ 2012-07-12  2:52 UTC (permalink / raw)
  To: Joe Jin
  Cc: netdev@vger.kernel.org, e1000-devel@lists.sf.net,
	linux-kernel@vger.kernel.org
In-Reply-To: <4FFE3505.8060407@oracle.com>

>-----Original Message-----
>From: Joe Jin [mailto:joe.jin@oracle.com]
>Sent: Wednesday, July 11, 2012 7:23 PM
>To: Dave, Tushar N
>Cc: e1000-devel@lists.sf.net; netdev@vger.kernel.org; linux-
>kernel@vger.kernel.org
>Subject: Re: 82571EB: Detected Hardware Unit Hang
>
>On 07/12/12 02:51, Dave, Tushar N wrote:
>>
>> Joe,
>>
>> I see couple of errors in lspci output.
>> Device capability status register shows UnCorrectable PCIe error. This
>means there is certainly something went wrong. The only way to recover
>from Uncorrectable errors is reset.
>>
>> 	DevSta:	CorrErr- *UncorrErr+ FatalErr+ UnsuppReq+ AuxPwr+
>TransPend-
>>
>> Also AER sections in lspci output shows PCIe completion timeout.
>>
>> 	Capabilities: [100 v1] Advanced Error Reporting
>> 		UESta:	DLP- SDES- TLP- FCP- *CmpltTO+ CmpltAbrt- UnxCmplt-
>RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol-
>>
>> I suggest you should load AER driver and check for any error messages in
>log. Also please check any error message reported by system in BIOS log.
>Are there any machine check errors?
>>
>> When did you notice this issue? have 82571 ever been working before on
>this server?
>>
>> One more thing, Cache line size 256 is little unusual( I never seen this
>value before, mostly it's 64). Does BIOS settings have been changed? Are
>you using default BIOS setting?
>>
>
>I checked BIOS's log found the fault from the device, I changed "PCI-E
>Payload Size"
>from 256(default) to 128, now the device works.
>
>I compared lspci output found Address for data of MSI Capabilities's be
>changed:
>
>Old:
>        Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>                Address: 00000000fee21000  Data: 40cb
>
>New:
>        Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>                Address: 00000000fee24000  Data: 405c
>
>Mostly like it's a BIOS bug? please comments.
>
>Thanks,
>Joe

What is the exact error messages in BIOS log?


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* Re: 82571EB: Detected Hardware Unit Hang
From: Joe Jin @ 2012-07-12  2:23 UTC (permalink / raw)
  To: Dave, Tushar N
  Cc: e1000-devel@lists.sf.net, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <061C8A8601E8EE4CA8D8FD6990CEA891274EF8AD@ORSMSX102.amr.corp.intel.com>

On 07/12/12 02:51, Dave, Tushar N wrote:
> 
> Joe,
> 
> I see couple of errors in lspci output.
> Device capability status register shows UnCorrectable PCIe error. This means there is certainly something went wrong. The only way to recover from Uncorrectable errors is reset.
>    
> 	DevSta:	CorrErr- *UncorrErr+ FatalErr+ UnsuppReq+ AuxPwr+ TransPend-
> 
> Also AER sections in lspci output shows PCIe completion timeout.
> 	
> 	Capabilities: [100 v1] Advanced Error Reporting
> 		UESta:	DLP- SDES- TLP- FCP- *CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol-
> 
> I suggest you should load AER driver and check for any error messages in log. Also please check any error message reported by system in BIOS log. Are there any machine check errors? 
> 
> When did you notice this issue? have 82571 ever been working before on this server?
> 
> One more thing, Cache line size 256 is little unusual( I never seen this value before, mostly it's 64). Does BIOS settings have been changed? Are you using default BIOS setting?
> 

I checked BIOS's log found the fault from the device, I changed "PCI-E Payload Size"
from 256(default) to 128, now the device works.

I compared lspci output found Address for data of MSI Capabilities's be changed:

Old:
        Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee21000  Data: 40cb

New:
        Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee24000  Data: 405c

Mostly like it's a BIOS bug? please comments.

Thanks,
Joe

^ permalink raw reply

* linux-next: manual merge of the net-next tree with the infiniband tree
From: Stephen Rothwell @ 2012-07-12  2:13 UTC (permalink / raw)
  To: David Miller, netdev
  Cc: linux-next, linux-kernel, Hadar Hen Zion, Or Gerlitz,
	Jack Morgenstein, Roland Dreier, linux-rdma

[-- Attachment #1: Type: text/plain, Size: 3011 bytes --]

Hi all,

Today's linux-next merge of the net-next tree got a conflict in
include/linux/mlx4/device.h between commit 396f2feb05d7 ("mlx4_core:
Implement mechanism for reserved Q_Keys") from the infiniband tree and
commit 0ff1fb654bec ("{NET, IB}/mlx4: Add device managed flow steering
firmware API") from the net-next tree.

Just context changes.  I fixed it up (see below) and can carry the fix
as necessary.
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

diff --cc include/linux/mlx4/device.h
index 441caf1,6f0d133..0000000
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@@ -540,83 -542,10 +573,85 @@@ struct mlx4_dev 
  	u8			rev_id;
  	char			board_id[MLX4_BOARD_ID_LEN];
  	int			num_vfs;
+ 	u64			regid_promisc_array[MLX4_MAX_PORTS + 1];
+ 	u64			regid_allmulti_array[MLX4_MAX_PORTS + 1];
  };
  
 +struct mlx4_eqe {
 +	u8			reserved1;
 +	u8			type;
 +	u8			reserved2;
 +	u8			subtype;
 +	union {
 +		u32		raw[6];
 +		struct {
 +			__be32	cqn;
 +		} __packed comp;
 +		struct {
 +			u16	reserved1;
 +			__be16	token;
 +			u32	reserved2;
 +			u8	reserved3[3];
 +			u8	status;
 +			__be64	out_param;
 +		} __packed cmd;
 +		struct {
 +			__be32	qpn;
 +		} __packed qp;
 +		struct {
 +			__be32	srqn;
 +		} __packed srq;
 +		struct {
 +			__be32	cqn;
 +			u32	reserved1;
 +			u8	reserved2[3];
 +			u8	syndrome;
 +		} __packed cq_err;
 +		struct {
 +			u32	reserved1[2];
 +			__be32	port;
 +		} __packed port_change;
 +		struct {
 +			#define COMM_CHANNEL_BIT_ARRAY_SIZE	4
 +			u32 reserved;
 +			u32 bit_vec[COMM_CHANNEL_BIT_ARRAY_SIZE];
 +		} __packed comm_channel_arm;
 +		struct {
 +			u8	port;
 +			u8	reserved[3];
 +			__be64	mac;
 +		} __packed mac_update;
 +		struct {
 +			__be32	slave_id;
 +		} __packed flr_event;
 +		struct {
 +			__be16  current_temperature;
 +			__be16  warning_threshold;
 +		} __packed warming;
 +		struct {
 +			u8 reserved[3];
 +			u8 port;
 +			union {
 +				struct {
 +					__be16 mstr_sm_lid;
 +					__be16 port_lid;
 +					__be32 changed_attr;
 +					u8 reserved[3];
 +					u8 mstr_sm_sl;
 +					__be64 gid_prefix;
 +				} __packed port_info;
 +				struct {
 +					__be32 block_ptr;
 +					__be32 tbl_entries_mask;
 +				} __packed tbl_change_info;
 +			} params;
 +		} __packed port_mgmt_change;
 +	}			event;
 +	u8			slave_id;
 +	u8			reserved3[2];
 +	u8			owner;
 +} __packed;
 +
  struct mlx4_init_port_param {
  	int			set_guid0;
  	int			set_node_guid;
@@@ -783,6 -793,8 +908,10 @@@ int mlx4_wol_write(struct mlx4_dev *dev
  int mlx4_counter_alloc(struct mlx4_dev *dev, u32 *idx);
  void mlx4_counter_free(struct mlx4_dev *dev, u32 idx);
  
 +int mlx4_get_parav_qkey(struct mlx4_dev *dev, u32 qpn, u32 *qkey);
 +
+ int mlx4_flow_attach(struct mlx4_dev *dev,
+ 		     struct mlx4_net_trans_rule *rule, u64 *reg_id);
+ int mlx4_flow_detach(struct mlx4_dev *dev, u64 reg_id);
+ 
  #endif /* MLX4_DEVICE_H */

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* linux-next: manual merge of the net-next tree with the infiniband tree
From: Stephen Rothwell @ 2012-07-12  2:09 UTC (permalink / raw)
  To: David Miller, netdev
  Cc: linux-next, linux-kernel, Jack Morgenstein, Roland Dreier,
	linux-rdma, Hadar Hen Zion, Or Gerlitz

[-- Attachment #1: Type: text/plain, Size: 1561 bytes --]

Hi all,

Today's linux-next merge of the net-next tree got a conflict in
drivers/net/ethernet/mellanox/mlx4/main.c between commit 6634961c14d3
("mlx4: Put physical GID and P_Key table sizes in mlx4_phys_caps struct
and paravirtualize them") from the infiniband tree and commit
0ff1fb654bec ("{NET, IB}/mlx4: Add device managed flow steering firmware
API") from the net-next tree.

Just context changes (I think).  I have fixed it up (see below) and can
carry the fix as necessary.
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

diff --cc drivers/net/ethernet/mellanox/mlx4/main.c
index 5df3ac4,4264516..0000000
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@@ -1232,10 -1231,26 +1258,29 @@@ static int mlx4_init_hca(struct mlx4_de
  			goto err_stop_fw;
  		}
  
 +		if (mlx4_is_master(dev))
 +			mlx4_parav_master_pf_caps(dev);
 +
+ 		priv->fs_hash_mode = MLX4_FS_L2_HASH;
+ 
+ 		switch (priv->fs_hash_mode) {
+ 		case MLX4_FS_L2_HASH:
+ 			init_hca.fs_hash_enable_bits = 0;
+ 			break;
+ 
+ 		case MLX4_FS_L2_L3_L4_HASH:
+ 			/* Enable flow steering with
+ 			 * udp unicast and tcp unicast
+ 			 */
+ 			init_hca.fs_hash_enable_bits =
+ 				MLX4_FS_UDP_UC_EN | MLX4_FS_TCP_UC_EN;
+ 			break;
+ 		}
+ 
  		profile = default_profile;
+ 		if (dev->caps.steering_mode ==
+ 		    MLX4_STEERING_MODE_DEVICE_MANAGED)
+ 			profile.num_mcg = MLX4_FS_NUM_MCG;
  
  		icm_size = mlx4_make_profile(dev, &profile, &dev_cap,
  					     &init_hca);

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [PATCH 2/2] net: Update alloc frag to reduce get/put page usage and recycle pages
From: Alexander Duyck @ 2012-07-12  2:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, netdev, davem, jeffrey.t.kirsher, Eric Dumazet
In-Reply-To: <1342052967.3265.8210.camel@edumazet-glaptop>

On 7/11/2012 5:29 PM, Eric Dumazet wrote:
> On Wed, 2012-07-11 at 17:18 -0700, Alexander Duyck wrote:
>> This patch does several things.
>>
>> First it reorders the netdev_alloc_frag code so that only one conditional
>> check is needed in most cases instead of 2.
>>
>> Second it incorporates the atomic_set and atomic_sub_return logic from an
>> earlier proposed patch by Eric Dumazet allowing for a reduction in the
>> get_page/put_page overhead when dealing with frags.
>>
>> Finally it also incorporates the page reuse code so that if the page count
>> is dropped to 0 we can just reinitialize the page and reuse it.
>>
>> Cc: Eric Dumazet<edumazet@google.com>
>> Signed-off-by: Alexander Duyck<alexander.h.duyck@intel.com>
>> ---
>
> Hmm, I was working on a version using order-3 pages if available.
>
> (or more exactly 32768 bytes chunks)
>
> I am not sure how your version can help with typical 1500 allocations
> (2 skbs per page)
>
>
The gain will be minimal if any with the 1500 byte allocations, however 
there shouldn't be a performance degradation.

I was thinking more of the ixgbe case where we are working with only 256 
byte allocations and can recycle pages in the case of GRO or TCP.  For 
ixgbe the advantages are significant since we drop a number of the 
get_page calls and get the advantage of the page recycling.  So for 
example with GRO enabled we should only have to allocate 1 page for 
headers every 16 buffers, and the 6 slots we use in that page have a 
good likelihood of being warm in the cache since we just keep looping on 
the same page.

Thanks,

Alex

^ permalink raw reply

* [PATCH] tc: man: change man page and comment to confirm to code's behavior.
From: Li Wei @ 2012-07-12  1:56 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20120711075129.4f81eea8@nehalam.linuxnetplumber.net>


Since the get_rate() code incorrectly interpreted bare number, the
behavior is not the same as man page and comment described.

We need to change the man page and comment for compatible with the
existing usage by scripts.
---
 man/man8/tc.8 |    7 +++++--
 tc/tc_util.c  |    2 +-
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/man/man8/tc.8 b/man/man8/tc.8
index 958ab98..f0e5613 100644
--- a/man/man8/tc.8
+++ b/man/man8/tc.8
@@ -259,6 +259,9 @@ All parameters accept a floating point number, possibly followed by a unit.
 .P
 Bandwidths or rates can be specified in:
 .TP
+bps
+Bytes per second
+.TP
 kbps
 Kilobytes per second
 .TP
@@ -271,8 +274,8 @@ Kilobits per second
 mbit
 Megabits per second
 .TP
-bps or a bare number
-Bytes per second
+bit or a bare number
+Bits per second
 .P
 Amounts of data can be specified in:
 .TP
diff --git a/tc/tc_util.c b/tc/tc_util.c
index 926ed08..ccf8fa4 100644
--- a/tc/tc_util.c
+++ b/tc/tc_util.c
@@ -153,7 +153,7 @@ int get_rate(unsigned *rate, const char *str)
 		return -1;
 
 	if (*p == '\0') {
-		*rate = bps / 8.;	/* assume bytes/sec */
+		*rate = bps / 8.;	/* assume bits/sec */
 		return 0;
 	}
 
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH 2/2] bonding: debugfs and network namespaces are incompatible
From: Jay Vosburgh @ 2012-07-12  1:57 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: netdev
In-Reply-To: <877gu9omrr.fsf@xmission.com>

Eric W. Biederman <ebiederm@xmission.com> wrote:

>Jay Vosburgh <fubar@us.ibm.com> writes:
>
>> Eric W. Biederman <ebiederm@xmission.com> wrote:
>
>>>I haven't run across any of those network devices, but if they create a
>>>debugfs entry that embeds the device name it will be a problem.
>>
>> 	A quick grep suggests that cxgb4, skge, sky2, stmmac, ipoib and
>> half a dozen of the wireless drivers all create files in debugfs.  I did
>> not check exhaustively, but at least some of them include the device
>> name.
>
>Yep.  It looks like imperfect habits are common.
>
>>>Last I looked any custom user space interface from network devices was
>>>rare and bonding using debugfs is the first instance of using debugfs
>>>from networking devices I have seen.
>>>
>>>I think the problem will be a little less severe for physical network
>>>devices as they all start in the initial network namespace and so start
>>>with distinct names.
>>>
>>>With bonding I can do "ip link add type bond" in any network namespace
>>>and get another bond0.  So name conflicts are very much expeted with all
>>>virtual networking devices.
>>
>> 	Fair enough, although it is trivial to rename any network device
>> such that a conflict would occur.
>
>Actually for userspace and administrative reasons frequently it isn't
>trivial to rename devices.

	Well, perhaps it's uncommon for users to do so, but "ip link set
dev eth0 name eth44" is pretty easy to do.

>>>But if you know of any other networking devices using debugsfs that
>>>code should probably get the same treatment as the bonding debugfs code.
>>
>> 	Is there no alternative than simply disabling debugfs whenever
>> network namespaces are enabled?  The information bonding displays via
>> debugfs is useful, and having it unavailable on all distro kernels seems
>> a bit harsh.
>
>
>I took a good hard look at debugfs while writing this reply and debufs
>scares me.  It is the kind of code that just about wants to me to throw
>in the towel seeing no hope of a good solid kernel. 
>
>I can definitely open a /sys/kernel/debug/bonding/bond0/rlb_hash_table
>and delete the bond and then read the file.  On a bad day that will oops
>the kernel, as there is nothing holding a reference to the network
>device.  I think only the BOND_MODE_ALB check makes keeps the kernel
>from oopsing in my quick tests.
>
>The fact that debugfs is enabled in distro kernels is actually apalling
>to me.  debugfs makes it easy to oops the kernel.

	I'm not so sure things are that bad.  I cannot unload the
bonding module while a program holds an open file descriptor on its
debugfs file (it appears to hold a reference to the module), so uses
that only remove the debugfs file on module unload shouldn't have a
problem.

	The /proc file that bonding removes when an interface is
dynamically removed does not have this problem, as subsequent reads on
that file descriptor will fail.  I suspect that's because
remove_proc_entry NULLs the proc_fops, whereas debugfs_remove does not
do the equivalent for its case.  It may not be that simple, though; I'm
just looking at the code and have not tested anything.

>There are lots of alternatives to debugfs on where to put information
>and the bonding driver already uses most of them.
>
>> 	Why is the logic already in the driver not sufficient?  If the
>> attempt to create the debugfs directory with the interface name fails,
>> then it merely prints a warning and continues without the debugfs for
>> that interface.
>
>All I know for certain is the existing logic will eventually cause
>someone doing something reasonable to send me a bug report.
>
>I can see where you are coming from in that the bonding driver debugfs
>code really was built to gracefully fail and ignore problems of instead
>of just hapharzardly and sloppily ignore problems.  At the same time
>I can oops the kernel if I try with your debugfs in the bonding driver.
>
>But it causes the code to fail and issue a warning.  So if I don't
>disable the code now, I expect I will get a bug report, and who
>knows how many sill files in bonding will have in debugfs by then.
>what silly things bonding may be doing in debugfs by then.

	Or perhaps we can fix the debugfs support to function correctly
even in the face of network namespaces.  For example, do namespaces have
a unique name or identifier than can go into the debugfs name?

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: [PATCH 1/2] tcp: Fix out of bounds access to tcpm_vals
From: Alexander Duyck @ 2012-07-12  1:46 UTC (permalink / raw)
  To: David Miller; +Cc: alexander.h.duyck, netdev, jeffrey.t.kirsher
In-Reply-To: <20120711.173249.1303803416502735349.davem@davemloft.net>

On 7/11/2012 5:32 PM, David Miller wrote:
> From: Alexander Duyck<alexander.h.duyck@intel.com>
> Date: Wed, 11 Jul 2012 17:18:04 -0700
>
>> The recent patch "tcp: Maintain dynamic metrics in local cache." introduced
>> an out of bounds access due to what appears to be a typo.   I believe this
>> change should resolve the issue by replacing the access to RTAX_CWND with
>> TCP_METRIC_CWND.
>>
>> Signed-off-by: Alexander Duyck<alexander.h.duyck@intel.com>
> Applied, thanks a lot.
>
> How did you spot this, did you get a compiler warning?
>
> I ask because while working on this, I at one point put the
> tcp timestamp members after the metrics array in the
> tcp_metrics_bucket struct.  And I got a warning from gcc about
> an array bounds violation that I could not figure out.
>
> I am pretty certain this bug here is what it was warning about.  And
> the problem is that if you put the array at the end gcc doesn't warn
> in order to handle things similar to what people use zero length
> arrays for.
It came up as a compiler warning.  I suspect it may have something to do 
with the optimizations I had turned on since it complained that the 
issue was in tcp_update_metrics but then reported it on the one line in 
tcp_metric_set.

Thanks,

Alex

^ permalink raw reply

* Re: [PATCH] tc_util: fix incorrect bare number process in get_rate.
From: Li Wei @ 2012-07-12  1:16 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20120711075129.4f81eea8@nehalam.linuxnetplumber.net>

于 2012-7-11 22:51, Stephen Hemminger 写道:
> On Wed, 11 Jul 2012 15:24:50 +0800
> Li Wei <lw@cn.fujitsu.com> wrote:
> 
>>
>> As the comment and manpage indicated that the bare number means
>> bytes per second, so the division is not needed.
>> ---
>>  tc/tc_util.c |    2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/tc/tc_util.c b/tc/tc_util.c
>> index 926ed08..e8d89c1 100644
>> --- a/tc/tc_util.c
>> +++ b/tc/tc_util.c
>> @@ -153,7 +153,7 @@ int get_rate(unsigned *rate, const char *str)
>>  		return -1;
>>  
>>  	if (*p == '\0') {
>> -		*rate = bps / 8.;	/* assume bytes/sec */
>> +		*rate = bps;	/* assume bytes/sec */
>>  		return 0;
>>  	}
>>  
> Thanks for finding this. The documentation, code and comment do
> all need to be the same!
> 
> But changing the code as you propose would break existing usage
> by scripts. Instead, the man page and comment need to change
> to match the reality of the existing application.
> 
> 
Well, I see, I'll send another patch to take care of this.

Thanks,

Wei

^ permalink raw reply

* Re: [PATCH v3 net-next] tcp: TCP Small Queues
From: David Miller @ 2012-07-12  1:14 UTC (permalink / raw)
  To: eric.dumazet; +Cc: nanditad, netdev, codel, ncardwell, mattmathis
In-Reply-To: <1342021831.3265.8174.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 11 Jul 2012 17:50:31 +0200

> This introduce TSQ (TCP Small Queues)

Applied, fingers cross, thanks :-)

^ permalink raw reply

* Re: [PATCH 2/2] net: Update alloc frag to reduce get/put page usage and recycle pages
From: David Miller @ 2012-07-12  1:11 UTC (permalink / raw)
  To: eric.dumazet
  Cc: alexander.h.duyck, netdev, jeffrey.t.kirsher, alexander.duyck,
	edumazet
In-Reply-To: <1342052967.3265.8210.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 12 Jul 2012 02:29:27 +0200

> On Wed, 2012-07-11 at 17:18 -0700, Alexander Duyck wrote:
>> This patch does several things.
>> 
>> First it reorders the netdev_alloc_frag code so that only one conditional
>> check is needed in most cases instead of 2.
>> 
>> Second it incorporates the atomic_set and atomic_sub_return logic from an
>> earlier proposed patch by Eric Dumazet allowing for a reduction in the
>> get_page/put_page overhead when dealing with frags.
>> 
>> Finally it also incorporates the page reuse code so that if the page count
>> is dropped to 0 we can just reinitialize the page and reuse it.
>> 
>> Cc: Eric Dumazet <edumazet@google.com>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> ---
> 
> 
> Hmm, I was working on a version using order-3 pages if available.
> 
> (or more exactly 32768 bytes chunks)
> 
> I am not sure how your version can help with typical 1500 allocations
> (2 skbs per page)

I'd like you two to sort things out before I apply anything, thanks :)

^ permalink raw reply

* Re: [PATCH v3 net-next] tcp: TCP Small Queues
From: Nandita Dukkipati @ 2012-07-12  0:46 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, codel, ncardwell, David Miller, mattmathis
In-Reply-To: <1342051233.3265.8206.camel@edumazet-glaptop>

>> Considering these two points, why TSQ over the Codel feedback
>> approach?
>
> I dont think they compete. They are in fact complementary.
>
> If you use codel/fq_codel + TSQ, you have both per flow limitation in
> qdisc (TSQ) + sojourn time aware and multi flow aware feedback.

Makes sense. My conjecture is when using codel/fq_codel qdisc, the
need for TSQ will diminish. But as you said... good part of TSQ is it
limits per-flow queuing for any qdisc structure, even those not using
codel.

^ permalink raw reply

* Re: [PATCH 1/2] tcp: Fix out of bounds access to tcpm_vals
From: David Miller @ 2012-07-12  0:32 UTC (permalink / raw)
  To: alexander.h.duyck; +Cc: netdev, jeffrey.t.kirsher, alexander.duyck
In-Reply-To: <20120712001804.26542.2889.stgit@gitlad.jf.intel.com>

From: Alexander Duyck <alexander.h.duyck@intel.com>
Date: Wed, 11 Jul 2012 17:18:04 -0700

> The recent patch "tcp: Maintain dynamic metrics in local cache." introduced
> an out of bounds access due to what appears to be a typo.   I believe this
> change should resolve the issue by replacing the access to RTAX_CWND with
> TCP_METRIC_CWND.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>

Applied, thanks a lot.

How did you spot this, did you get a compiler warning?

I ask because while working on this, I at one point put the
tcp timestamp members after the metrics array in the
tcp_metrics_bucket struct.  And I got a warning from gcc about
an array bounds violation that I could not figure out.

I am pretty certain this bug here is what it was warning about.  And
the problem is that if you put the array at the end gcc doesn't warn
in order to handle things similar to what people use zero length
arrays for.

^ permalink raw reply

* Re: [PATCH 2/2] net: Update alloc frag to reduce get/put page usage and recycle pages
From: Eric Dumazet @ 2012-07-12  0:29 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: netdev, davem, jeffrey.t.kirsher, alexander.duyck, Eric Dumazet
In-Reply-To: <20120712001810.26542.61967.stgit@gitlad.jf.intel.com>

On Wed, 2012-07-11 at 17:18 -0700, Alexander Duyck wrote:
> This patch does several things.
> 
> First it reorders the netdev_alloc_frag code so that only one conditional
> check is needed in most cases instead of 2.
> 
> Second it incorporates the atomic_set and atomic_sub_return logic from an
> earlier proposed patch by Eric Dumazet allowing for a reduction in the
> get_page/put_page overhead when dealing with frags.
> 
> Finally it also incorporates the page reuse code so that if the page count
> is dropped to 0 we can just reinitialize the page and reuse it.
> 
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---


Hmm, I was working on a version using order-3 pages if available.

(or more exactly 32768 bytes chunks)

I am not sure how your version can help with typical 1500 allocations
(2 skbs per page)

^ permalink raw reply

* [RFC PATCH 2/2] ixgbe: Add functionality for delaying the MMIO write for Tx
From: Alexander Duyck @ 2012-07-12  0:26 UTC (permalink / raw)
  To: netdev
  Cc: davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
	alexander.duyck
In-Reply-To: <20120712002103.27846.73812.stgit@gitlad.jf.intel.com>

This change makes it so that ixgbe can use the new framework for delaying
the MMIO writes in the transmit path.  With this change in place we see a
significant reduction in CPU utilization and increase in overall packets
per second throughput for bulk traffic tests.  In addition I have not seen
any increase in latency as a result of this patch.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   22 ++++++++++++++++++++--
 1 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 9ec65ee..e9b71b8 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -884,6 +884,8 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 		return true;
 	}
 
+	netif_tx_dispatch_queue(txring_txq(tx_ring));
+
 	netdev_tx_completed_queue(txring_txq(tx_ring),
 				  total_packets, total_bytes);
 
@@ -5825,6 +5827,22 @@ static void ixgbe_service_task(struct work_struct *work)
 	ixgbe_service_event_complete(adapter);
 }
 
+static void ixgbe_complete_xmit_frame(struct net_device *dev,
+				      unsigned int index)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	struct ixgbe_ring *tx_ring = adapter->tx_ring[index];
+
+	/* notify HW of packet */
+	writel(tx_ring->next_to_use, tx_ring->tail);
+
+	/*
+	 * we need this if more than one processor can write to our tail
+	 * at a time, it synchronizes IO on IA64/Altix systems
+	 */
+	mmiowb();
+}
+
 static int ixgbe_tso(struct ixgbe_ring *tx_ring,
 		     struct ixgbe_tx_buffer *first,
 		     u8 *hdr_len)
@@ -6150,8 +6168,7 @@ static void ixgbe_tx_map(struct ixgbe_ring *tx_ring,
 
 	tx_ring->next_to_use = i;
 
-	/* notify HW of packet */
-	writel(i, tx_ring->tail);
+	netdev_complete_xmit(txring_txq(tx_ring));
 
 	return;
 dma_error:
@@ -6961,6 +6978,7 @@ static const struct net_device_ops ixgbe_netdev_ops = {
 	.ndo_open		= ixgbe_open,
 	.ndo_stop		= ixgbe_close,
 	.ndo_start_xmit		= ixgbe_xmit_frame,
+	.ndo_complete_xmit	= ixgbe_complete_xmit_frame,
 #ifdef IXGBE_FCOE
 	.ndo_select_queue	= ixgbe_select_queue,
 #endif

^ permalink raw reply related

* [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching
From: Alexander Duyck @ 2012-07-12  0:26 UTC (permalink / raw)
  To: netdev
  Cc: davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
	alexander.duyck
In-Reply-To: <20120712002103.27846.73812.stgit@gitlad.jf.intel.com>

This change adds capabilities to the driver for batching the MMIO write
involved with transmits.  Most of the logic is based off of the code for
the qdisc scheduling.

What I did is break the transmit path into two parts.  We already had the
ndo_start_xmit function which has been there all along.  The part I added
was ndo_complete_xmit which is meant to handle notifying the hardware that
frames are ready for delivery.

To control all of this I added a net sysfs value for the Tx queues called
dispatch_limit.  When 0 it indicates that all frames will notify hardware
immediately.  When 1 or more the netdev_complete_xmit call will queue up to
that number of packets, and when the value is exceeded it will notify the
hardware and reset the pending frame dispatch count.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 include/linux/netdevice.h |   57 ++++++++++++++++++++++++++++++++++++++
 net/core/dev.c            |   67 +++++++++++++++++++++++++++++++++++++++++++++
 net/core/net-sysfs.c      |   36 ++++++++++++++++++++++++
 3 files changed, 160 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5a1a657..8d50fc4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -522,6 +522,8 @@ enum netdev_queue_state_t {
 	__QUEUE_STATE_DRV_XOFF,
 	__QUEUE_STATE_STACK_XOFF,
 	__QUEUE_STATE_FROZEN,
+	__QUEUE_STATE_DELAYED,
+	__QUEUE_STATE_DISPATCH,
 #define QUEUE_STATE_ANY_XOFF ((1 << __QUEUE_STATE_DRV_XOFF)		| \
 			      (1 << __QUEUE_STATE_STACK_XOFF))
 #define QUEUE_STATE_ANY_XOFF_OR_FROZEN (QUEUE_STATE_ANY_XOFF		| \
@@ -550,6 +552,7 @@ struct netdev_queue {
 #if defined(CONFIG_XPS) && defined(CONFIG_NUMA)
 	int			numa_node;
 #endif
+	unsigned int		dispatch_limit;
 /*
  * write mostly part
  */
@@ -561,6 +564,11 @@ struct netdev_queue {
 	unsigned long		trans_start;
 
 	/*
+	 * pointer to next Tx queue in dispatch_queue
+	 */
+	struct netdev_queue	*next_dispatch;
+
+	/*
 	 * Number of TX timeouts for this queue
 	 * (/sys/class/net/DEV/Q/trans_timeout)
 	 */
@@ -568,6 +576,8 @@ struct netdev_queue {
 
 	unsigned long		state;
 
+	unsigned int		dispatch_pending;
+
 #ifdef CONFIG_BQL
 	struct dql		dql;
 #endif
@@ -924,6 +934,8 @@ struct net_device_ops {
 	int			(*ndo_stop)(struct net_device *dev);
 	netdev_tx_t		(*ndo_start_xmit) (struct sk_buff *skb,
 						   struct net_device *dev);
+	void			(*ndo_complete_xmit) (struct net_device *dev,
+						      unsigned int queue);
 	u16			(*ndo_select_queue)(struct net_device *dev,
 						    struct sk_buff *skb);
 	void			(*ndo_change_rx_flags)(struct net_device *dev,
@@ -1760,6 +1772,9 @@ struct softnet_data {
 	unsigned int		dropped;
 	struct sk_buff_head	input_pkt_queue;
 	struct napi_struct	backlog;
+
+	struct netdev_queue	*dispatch_queue;
+	struct netdev_queue	**dispatch_queue_tailp;
 };
 
 static inline void input_queue_head_incr(struct softnet_data *sd)
@@ -1779,6 +1794,44 @@ static inline void input_queue_tail_incr_save(struct softnet_data *sd,
 
 DECLARE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);
 
+static inline void netif_tx_delay_queue(struct netdev_queue *txq)
+{
+	set_bit(__QUEUE_STATE_DELAYED, &txq->state);
+}
+
+extern void __netif_tx_dispatch_queue(struct netdev_queue *txq);
+
+static inline void netif_tx_dispatch_queue(struct netdev_queue *txq)
+{
+	if (test_and_clear_bit(__QUEUE_STATE_DELAYED, &txq->state))
+		__netif_tx_dispatch_queue(txq);
+}
+
+static inline bool netif_tx_queue_delayed(const struct netdev_queue *txq)
+{
+	return test_bit(__QUEUE_STATE_DELAYED, &txq->state);
+}
+
+static inline void netdev_complete_xmit(struct netdev_queue *txq)
+{
+	struct net_device *dev = txq->dev;
+	const struct net_device_ops *ops = dev->netdev_ops;
+
+	if (txq->dispatch_pending < txq->dispatch_limit) {
+		if (netif_tx_queue_delayed(txq)) {
+			txq->dispatch_pending++;
+			return;
+		}
+
+		/* start of delayed write sequence */
+		netif_tx_delay_queue(txq);
+	}
+
+	txq->dispatch_pending = 0;
+
+	ops->ndo_complete_xmit(dev, txq - &dev->_tx[0]);
+}
+
 extern void __netif_schedule(struct Qdisc *q);
 
 static inline void netif_schedule_queue(struct netdev_queue *txq)
@@ -1973,6 +2026,7 @@ static inline void netdev_completed_queue(struct net_device *dev,
 
 static inline void netdev_tx_reset_queue(struct netdev_queue *q)
 {
+	clear_bit(__QUEUE_STATE_DELAYED, &q->state);
 #ifdef CONFIG_BQL
 	clear_bit(__QUEUE_STATE_STACK_XOFF, &q->state);
 	dql_reset(&q->dql);
@@ -2482,6 +2536,9 @@ static inline void netif_tx_unlock_bh(struct net_device *dev)
 	}						\
 }
 
+#define HARD_TX_TRYLOCK(dev, txq)			\
+	((dev->features & NETIF_F_LLTX) || __netif_tx_trylock(txq))
+
 static inline void netif_tx_disable(struct net_device *dev)
 {
 	unsigned int i;
diff --git a/net/core/dev.c b/net/core/dev.c
index 93af533..a72669a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2032,6 +2032,27 @@ int netif_get_num_default_rss_queues(void)
 }
 EXPORT_SYMBOL(netif_get_num_default_rss_queues);
 
+static inline void __netif_tx_redispatch_queue(struct netdev_queue *txq)
+{
+	struct softnet_data *sd;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	sd = &__get_cpu_var(softnet_data);
+	txq->next_dispatch = NULL;
+	sd->dispatch_queue = txq;
+	sd->dispatch_queue_tailp = &txq->next_dispatch;
+	raise_softirq_irqoff(NET_TX_SOFTIRQ);
+	local_irq_restore(flags);
+}
+
+void __netif_tx_dispatch_queue(struct netdev_queue *txq)
+{
+	if (!test_and_set_bit(__QUEUE_STATE_DISPATCH, &txq->state))
+		__netif_tx_redispatch_queue(txq);
+}
+EXPORT_SYMBOL(__netif_tx_dispatch_queue);
+
 static inline void __netif_reschedule(struct Qdisc *q)
 {
 	struct softnet_data *sd;
@@ -3268,6 +3289,41 @@ static void net_tx_action(struct softirq_action *h)
 			}
 		}
 	}
+
+	if (sd->dispatch_queue) {
+		struct netdev_queue *head;
+
+		local_irq_disable();
+		head = sd->dispatch_queue;
+		sd->dispatch_queue = NULL;
+		sd->dispatch_queue_tailp = &sd->dispatch_queue;
+		local_irq_enable();
+
+		while (head) {
+			struct netdev_queue *txq = head;
+			struct net_device *dev = txq->dev;
+			const struct net_device_ops *ops = dev->netdev_ops;
+
+			head = head->next_dispatch;
+
+			if (!HARD_TX_TRYLOCK(dev, txq)) {
+				__netif_tx_redispatch_queue(txq);
+				continue;
+			}
+
+			smp_mb__before_clear_bit();
+			clear_bit(__QUEUE_STATE_DISPATCH, &txq->state);
+
+			if (txq->dispatch_pending &&
+			    !netif_tx_queue_delayed(txq)) {
+				int index = txq - &dev->_tx[0];
+
+				ops->ndo_complete_xmit(dev, index);
+			}
+
+			HARD_TX_UNLOCK(dev, txq);
+		}
+	}
 }
 
 #if (defined(CONFIG_BRIDGE) || defined(CONFIG_BRIDGE_MODULE)) && \
@@ -6485,6 +6541,15 @@ static int dev_cpu_callback(struct notifier_block *nfb,
 		oldsd->output_queue = NULL;
 		oldsd->output_queue_tailp = &oldsd->output_queue;
 	}
+
+	/* Append delayed xmit queue from offline CPU */
+	if (oldsd->dispatch_queue) {
+		*sd->dispatch_queue_tailp = oldsd->dispatch_queue;
+		sd->dispatch_queue_tailp = oldsd->dispatch_queue_tailp;
+		oldsd->dispatch_queue = NULL;
+		oldsd->dispatch_queue_tailp = &oldsd->dispatch_queue;
+	}
+
 	/* Append NAPI poll list from offline CPU. */
 	if (!list_empty(&oldsd->poll_list)) {
 		list_splice_init(&oldsd->poll_list, &sd->poll_list);
@@ -6772,6 +6837,8 @@ static int __init net_dev_init(void)
 		INIT_LIST_HEAD(&sd->poll_list);
 		sd->output_queue = NULL;
 		sd->output_queue_tailp = &sd->output_queue;
+		sd->dispatch_queue = NULL;
+		sd->dispatch_queue_tailp = &sd->dispatch_queue;
 #ifdef CONFIG_RPS
 		sd->csd.func = rps_trigger_softirq;
 		sd->csd.info = sd;
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 42bb496..4f7eb58 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -997,11 +997,47 @@ static struct netdev_queue_attribute xps_cpus_attribute =
     __ATTR(xps_cpus, S_IRUGO | S_IWUSR, show_xps_map, store_xps_map);
 #endif /* CONFIG_XPS */
 
+static ssize_t show_dispatch_limit(struct netdev_queue *queue,
+				   struct netdev_queue_attribute *attribute,
+				   char *buf)
+{
+	unsigned int dispatch_limit;
+
+	spin_lock_irq(&queue->_xmit_lock);
+	dispatch_limit = queue->dispatch_limit;
+	spin_unlock_irq(&queue->_xmit_lock);
+
+	return sprintf(buf, "%u\n", dispatch_limit);
+}
+
+static ssize_t store_dispatch_limit(struct netdev_queue *queue,
+				    struct netdev_queue_attribute *attribute,
+				    const char *buf, size_t len)
+{
+	unsigned int dispatch_limit;
+	int err;
+
+	err = kstrtouint(buf, 10, &dispatch_limit);
+	if (err < 0)
+		return err;
+
+	spin_lock_irq(&queue->_xmit_lock);
+	queue->dispatch_limit = dispatch_limit;
+	spin_unlock_irq(&queue->_xmit_lock);
+
+	return len;
+}
+
+static struct netdev_queue_attribute dispatch_limit_attribute =
+	__ATTR(dispatch_limit, S_IRUGO | S_IWUSR,
+	       show_dispatch_limit, store_dispatch_limit);
+
 static struct attribute *netdev_queue_default_attrs[] = {
 	&queue_trans_timeout.attr,
 #ifdef CONFIG_XPS
 	&xps_cpus_attribute.attr,
 #endif
+	&dispatch_limit_attribute.attr,
 	NULL
 };
 

^ permalink raw reply related

* [RFC PATCH 0/2] Coalesce MMIO writes for transmits
From: Alexander Duyck @ 2012-07-12  0:25 UTC (permalink / raw)
  To: netdev
  Cc: davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert,
	alexander.duyck

This patch set is meant to address recent issues I found with ixgbe
performance being bound by Tx tail writes.  With these changes in place
and the dispatch_limit set to 1 or more I see a significant increase in
performance.

In the case of one of my systems I saw the routing rate for 7 queues jump
from 10.5 to 11.7Mpps.  The overall increase I have seen on most systems is
something on the order of about 15%.  In the case of pktgen I have also
seen a noticeable increase as the previous limit for transmits was
~12.5Mpps, but with this patch set in place and the dispatch_limit enabled
the value increases to ~14.2Mpps.

I expected there to be an increase in latency, however so far I have not
ran into that.  I have tried running NPtcp tests for latency and seen no
difference in the coalesced and non-coalesced transaction times.  I welcome
any suggestions for tests I might run that might expose any latency issues
as a result of this patch.

---

Alexander Duyck (2):
      ixgbe: Add functionality for delaying the MMIO write for Tx
      net: Add new network device function to allow for MMIO batching


 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   22 +++++++-
 include/linux/netdevice.h                     |   57 +++++++++++++++++++++
 net/core/dev.c                                |   67 +++++++++++++++++++++++++
 net/core/net-sysfs.c                          |   36 +++++++++++++
 4 files changed, 180 insertions(+), 2 deletions(-)

-- 
Thanks,

Alex

^ permalink raw reply

* Re: Add missing license tag to nci.
From: David Miller @ 2012-07-12  0:24 UTC (permalink / raw)
  To: davej; +Cc: netdev, ilane, linux-wireless
In-Reply-To: <20120711183157.GA19417@redhat.com>

From: Dave Jones <davej@redhat.com>
Date: Wed, 11 Jul 2012 14:31:57 -0400

> nci: module license 'unspecified' taints kernel.
> 
> Signed-off-by: Dave Jones <davej@redhat.com>

NFC patches go to linux-wireless, CC:'d

> diff --git a/net/nfc/nci/core.c b/net/nfc/nci/core.c
> index d560e6f..f18f207 100644
> --- a/net/nfc/nci/core.c
> +++ b/net/nfc/nci/core.c
> @@ -27,6 +27,7 @@
>  
>  #define pr_fmt(fmt) KBUILD_MODNAME ": %s: " fmt, __func__
>  
> +#include <linux/module.h>
>  #include <linux/types.h>
>  #include <linux/workqueue.h>
>  #include <linux/completion.h>
> @@ -878,3 +879,5 @@ static void nci_cmd_work(struct work_struct *work)
>  			  jiffies + msecs_to_jiffies(NCI_CMD_TIMEOUT));
>  	}
>  }
> +
> +MODULE_LICENSE("GPL");
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH 2/2] bonding: debugfs and network namespaces are incompatible
From: Eric W. Biederman @ 2012-07-12  0:18 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: netdev
In-Reply-To: <16903.1341947581@death.nxdomain>

Jay Vosburgh <fubar@us.ibm.com> writes:

> Eric W. Biederman <ebiederm@xmission.com> wrote:

>>I haven't run across any of those network devices, but if they create a
>>debugfs entry that embeds the device name it will be a problem.
>
> 	A quick grep suggests that cxgb4, skge, sky2, stmmac, ipoib and
> half a dozen of the wireless drivers all create files in debugfs.  I did
> not check exhaustively, but at least some of them include the device
> name.

Yep.  It looks like imperfect habits are common.

>>Last I looked any custom user space interface from network devices was
>>rare and bonding using debugfs is the first instance of using debugfs
>>from networking devices I have seen.
>>
>>I think the problem will be a little less severe for physical network
>>devices as they all start in the initial network namespace and so start
>>with distinct names.
>>
>>With bonding I can do "ip link add type bond" in any network namespace
>>and get another bond0.  So name conflicts are very much expeted with all
>>virtual networking devices.
>
> 	Fair enough, although it is trivial to rename any network device
> such that a conflict would occur.

Actually for userspace and administrative reasons frequently it isn't
trivial to rename devices.

>>But if you know of any other networking devices using debugsfs that
>>code should probably get the same treatment as the bonding debugfs code.
>
> 	Is there no alternative than simply disabling debugfs whenever
> network namespaces are enabled?  The information bonding displays via
> debugfs is useful, and having it unavailable on all distro kernels seems
> a bit harsh.


I took a good hard look at debugfs while writing this reply and debufs
scares me.  It is the kind of code that just about wants to me to throw
in the towel seeing no hope of a good solid kernel. 

I can definitely open a /sys/kernel/debug/bonding/bond0/rlb_hash_table
and delete the bond and then read the file.  On a bad day that will oops
the kernel, as there is nothing holding a reference to the network
device.  I think only the BOND_MODE_ALB check makes keeps the kernel
from oopsing in my quick tests.

The fact that debugfs is enabled in distro kernels is actually apalling
to me.  debugfs makes it easy to oops the kernel.

There are lots of alternatives to debugfs on where to put information
and the bonding driver already uses most of them.

> 	Why is the logic already in the driver not sufficient?  If the
> attempt to create the debugfs directory with the interface name fails,
> then it merely prints a warning and continues without the debugfs for
> that interface.

All I know for certain is the existing logic will eventually cause
someone doing something reasonable to send me a bug report.

I can see where you are coming from in that the bonding driver debugfs
code really was built to gracefully fail and ignore problems of instead
of just hapharzardly and sloppily ignore problems.  At the same time
I can oops the kernel if I try with your debugfs in the bonding driver.

But it causes the code to fail and issue a warning.  So if I don't
disable the code now, I expect I will get a bug report, and who
knows how many sill files in bonding will have in debugfs by then.
what silly things bonding may be doing in debugfs by then.

Eric

^ permalink raw reply

* [PATCH 2/2] net: Update alloc frag to reduce get/put page usage and recycle pages
From: Alexander Duyck @ 2012-07-12  0:18 UTC (permalink / raw)
  To: netdev
  Cc: davem, jeffrey.t.kirsher, alexander.duyck, Eric Dumazet,
	Alexander Duyck
In-Reply-To: <20120712001804.26542.2889.stgit@gitlad.jf.intel.com>

This patch does several things.

First it reorders the netdev_alloc_frag code so that only one conditional
check is needed in most cases instead of 2.

Second it incorporates the atomic_set and atomic_sub_return logic from an
earlier proposed patch by Eric Dumazet allowing for a reduction in the
get_page/put_page overhead when dealing with frags.

Finally it also incorporates the page reuse code so that if the page count
is dropped to 0 we can just reinitialize the page and reuse it.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 net/core/skbuff.c |   37 +++++++++++++++++++++++++------------
 1 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 506f678..69f4add 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -296,9 +296,12 @@ EXPORT_SYMBOL(build_skb);
 struct netdev_alloc_cache {
 	struct page *page;
 	unsigned int offset;
+	unsigned int pagecnt_bias;
 };
 static DEFINE_PER_CPU(struct netdev_alloc_cache, netdev_alloc_cache);
 
+#define NETDEV_PAGECNT_BIAS (PAGE_SIZE / SMP_CACHE_BYTES)
+
 /**
  * netdev_alloc_frag - allocate a page fragment
  * @fragsz: fragment size
@@ -311,23 +314,33 @@ void *netdev_alloc_frag(unsigned int fragsz)
 	struct netdev_alloc_cache *nc;
 	void *data = NULL;
 	unsigned long flags;
+	unsigned int offset;
 
 	local_irq_save(flags);
 	nc = &__get_cpu_var(netdev_alloc_cache);
-	if (unlikely(!nc->page)) {
-refill:
+	offset = nc->offset;
+	if (unlikely(offset < fragsz)) {
+		BUG_ON(PAGE_SIZE < fragsz);
+
+		if (likely(nc->page) &&
+		    atomic_sub_and_test(nc->pagecnt_bias, &nc->page->_count))
+			goto recycle;
+
 		nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
-		nc->offset = 0;
-	}
-	if (likely(nc->page)) {
-		if (nc->offset + fragsz > PAGE_SIZE) {
-			put_page(nc->page);
-			goto refill;
+		if (unlikely(!nc->page)) {
+			offset = 0;
+			goto end;
 		}
-		data = page_address(nc->page) + nc->offset;
-		nc->offset += fragsz;
-		get_page(nc->page);
-	}
+recycle:
+		atomic_set(&nc->page->_count, NETDEV_PAGECNT_BIAS);
+		nc->pagecnt_bias = NETDEV_PAGECNT_BIAS;
+		offset = PAGE_SIZE;
+	}
+	offset -= fragsz;
+	nc->pagecnt_bias--;
+	data = page_address(nc->page) + offset;
+end:
+	nc->offset = offset;
 	local_irq_restore(flags);
 	return data;
 }

^ permalink raw reply related

* [PATCH 1/2] tcp: Fix out of bounds access to tcpm_vals
From: Alexander Duyck @ 2012-07-12  0:18 UTC (permalink / raw)
  To: netdev; +Cc: davem, jeffrey.t.kirsher, alexander.duyck, Alexander Duyck

The recent patch "tcp: Maintain dynamic metrics in local cache." introduced
an out of bounds access due to what appears to be a typo.   I believe this
change should resolve the issue by replacing the access to RTAX_CWND with
TCP_METRIC_CWND.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 net/ipv4/tcp_metrics.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 1fd83d3..5a38a2d 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -412,7 +412,7 @@ void tcp_update_metrics(struct sock *sk)
 				       max(tp->snd_cwnd >> 1, tp->snd_ssthresh));
 		if (!tcp_metric_locked(tm, TCP_METRIC_CWND)) {
 			val = tcp_metric_get(tm, TCP_METRIC_CWND);
-			tcp_metric_set(tm, RTAX_CWND, (val + tp->snd_cwnd) >> 1);
+			tcp_metric_set(tm, TCP_METRIC_CWND, (val + tp->snd_cwnd) >> 1);
 		}
 	} else {
 		/* Else slow start did not finish, cwnd is non-sense,

^ permalink raw reply related

* Re: [PATCH v3 net-next] tcp: TCP Small Queues
From: Eric Dumazet @ 2012-07-12  0:00 UTC (permalink / raw)
  To: Nandita Dukkipati; +Cc: netdev, codel, ncardwell, David Miller, mattmathis
In-Reply-To: <CAB_+Fg496FgQZkJE8uf_4RX_wW+fz1p5Rq8rY=BuA=c6KPakpA@mail.gmail.com>

Note : netdev mailing list doesnt accept HTML ;)

On Wed, 2012-07-11 at 12:29 -0700, Nandita Dukkipati wrote:
> I have a couple of high level questions on the two solutions for
> maintain small queues at qdisc: 1) using codel's feedback versus 2)
> the approach in TSQ to explicitly limit bytes per-flow
> to tcp_limit_output_bytes.
> 
> 
> 1. multiple flows case: codel feedback (be it drop/ECN or direct
> feedback like in your patch with fq_codel: report congestion
> notification at enqueue time) seems to work in maintaining small
> queues regardless of the number of connections. TSQ probably won't
> achieve this goal when number of flows is large. Is this correct?


Problem with codel/fq_codel is that the congestion is delayed.

And my patch (signaling congestion to local tcp senders) raised several
issues.

I had then TSQ idea to have a more general mechanism (not depending on a
particular qdisc setup)

TCP Small Queue is effective even with a large number of flows because
with standard pfifo_fast, we can without TCP Small Queue store 1000
packets in Qdisc, each being 64KB. Thats 64 MBytes...

With TCP Small Queue, you can reach this level of occupancy using 500
flows (2 packets per flow)

Without TCP Small Queue, you can reach this level using 16 flows.

(This of course assumes TSO is on, but modern NIC are TSO enabled)

> 2. cwnd adjustment: codel feedback directly and explicitly controls
> snd_cwnd which in turn will also adjust the send buffer size to
> appropriate value. That's a nice property, which TSQ probably won't
> have because it doesn't explicitly adjust snd_cwnd. 
> 
> Considering these two points, why TSQ over the Codel feedback
> approach?

I dont think they compete. They are in fact complementary.

If you use codel/fq_codel + TSQ, you have both per flow limitation in
qdisc (TSQ) + sojourn time aware and multi flow aware feedback.

^ permalink raw reply

* Re: [RFC PATCH v2] tcp: TCP Small Queues
From: Eric Dumazet @ 2012-07-11 23:49 UTC (permalink / raw)
  To: Rick Jones; +Cc: nanditad, netdev, mattmathis, codel, ncardwell, David Miller
In-Reply-To: <4FFDC985.6050805@hp.com>

On Wed, 2012-07-11 at 11:44 -0700, Rick Jones wrote:
> On 07/11/2012 08:11 AM, Eric Dumazet wrote:
> >
> > Some bench results about the choice of 128KB being the default value:
> 
> What were the starting/baseline figures?
> 
> >
> > Tests using a single TCP flow.
> >
> > Tests on 10Gbit links :
> >
> >
> > echo 16384 >/proc/sys/net/ipv4/tcp_limit_output_bytes
> > OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
> > tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
> > tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 16 tpci_snd_cwnd 79
> > tcpi_reordering 53 tcpi_total_retrans 0
> > Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
> > Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
> > Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
> > Final       Final                                             %     Method %      Method
> > 392360      392360      16384  20.00   1389.53    10^6bits/s  0.52  S      4.30   S      0.737   1.014   usec/KB
> 
> By the way, that double reporting of the local socket send size is fixed in:
> 
> ------------------------------------------------------------------------
> r516 | raj | 2012-01-05 15:48:52 -0800 (Thu, 05 Jan 2012) | 1 line
> 
> report the rsr_size_end in an omni stream test rather than a copy of the 
> lss_size_end
> 
> of netperf and later.  Also, any idea why the local socket send size got 
> so much larger with 1GbE than 10 GbE at that setting of 
> tcp_limit_output_bytes?

The 10Gb receiver is a net-next kernel, but the 1Gb receiver is a 2.6.38
ubuntu kernel. They probably have very different TCP behavior.

^ permalink raw reply

* Re: [PATCH v3 net-next] tcp: TCP Small Queues
From: Eric Dumazet @ 2012-07-11 23:47 UTC (permalink / raw)
  To: Andi Kleen; +Cc: nanditad, netdev, codel, mattmathis, ncardwell, David Miller
In-Reply-To: <m2r4sirvmg.fsf@firstfloor.org>

On Wed, 2012-07-11 at 11:38 -0700, Andi Kleen wrote:
> Eric Dumazet <eric.dumazet@gmail.com> writes:
> > +
> > +		if (!sock_owned_by_user(sk)) {
> > +			if ((1 << sk->sk_state) &
> > +			    (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 |
> > +			     TCPF_CLOSING | TCPF_CLOSE_WAIT))
> > +				tcp_write_xmit(sk,
> > +					       tcp_current_mss(sk),
> > +					       0, 0,
> > +					       GFP_ATOMIC);
> 
> Did you have any problems with the GFP_ATOMIC allocation failing here?
> I think you move some skb allocs from process context to ATOMIC, right?
> 
> It relies on the VM somehow catching up in another context.
> 
> May be interesting to try out under very high memory pressure.

AFAIK this patch has no effect on memory allocations, because :

At write() time, we use GFP_KERNEL allocations to build the socket write
queue.

And each skb is "precloned", that is allocated from skbuff_fclone_cache.

Then, when we select some skbs from write for transmission, we clone
them. And this cloning doesnt allocate memory, because we use the fast
cloning.

There are some pathological cases were we do allocate memory, but TSQ
should lower probabilities :

1) when we skb_clone(skb), we might need an allocation if the previous
clone is still in flight. This is a retransmit case, and TSQ only takes
care of transmits (of previously unsent skbs : their fast clone is
available)

2) When we need to split the skb because not enough cwnd is available.
   As TSQ tends to limit number of skbs in qdisc, we lower the
probabilities of such splits. Anyway a TSO split only need an sk_buff,
this is not a lot of memory.

3) On bulk sends, most transmits are triggered by incoming ACKS, in
softirq context, so already using GFP_ATOMIC "allocations", if
ever needed (see point 1)

Thanks

^ permalink raw reply

* Re: [RFC PATCH v2] tcp: TCP Small Queues
From: Eric Dumazet @ 2012-07-11 23:38 UTC (permalink / raw)
  To: Rick Jones; +Cc: nanditad, netdev, mattmathis, codel, ncardwell, David Miller
In-Reply-To: <4FFDC48D.2030606@hp.com>

On Wed, 2012-07-11 at 11:23 -0700, Rick Jones wrote:
> On 07/11/2012 08:11 AM, Eric Dumazet wrote:
> >
> >
> > Tests using a single TCP flow.
> >
> > Tests on 10Gbit links :
> >
> >
> > echo 16384 >/proc/sys/net/ipv4/tcp_limit_output_bytes
> > OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
> > tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
> > tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 16 tpci_snd_cwnd 79
> > tcpi_reordering 53 tcpi_total_retrans 0
> 
> I take it you hacked your local copy of netperf to emit those?  Or did I 
> leave some cruft behind in something I committed to the repository?
> 
Yep, its a netperf-2.5.0 with a one line change to output these TCP_INFO
bits

> What was the ultimate limiter on throughput?  I notice it didn't achieve 
> link-rate on either 10 GbE nor 1 GbE.
> 

My lab has one fast machine (source in this 10Gb test), and one slow
machine (Intel Q6600 quad core), both with ixgbe cards.

On Gigabit test, the receiver is a laptop.


>  > Thats the plan : limiting numer of bytes in Qdisc, not number of bytes
>  > in socket write queue.
> 
> So the SO_SNDBUF can still grow rather larger than necessary?  It is 
> just that TCP will be nice to the other flows by not dumping all of it 
> into the qdisc at once.  Latency seen by the application itself is then 
> unchanged since there will still be (potentially) as much stuff queued 
> in the SO_SNDBUF as before right?

Of course SO_SNDBUF can grow if autotuning is enabled.

I think there is a bit of misunderstanding about this patch and what it
does.

It only makes sure the amount of packets (from socket write queue) are
cloned in qdisc/device queue in a limited way, not "as much as allowed"

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox