Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 1/2] tcp: Fix out of bounds access to tcpm_vals
From: David Miller @ 2012-07-12  0:32 UTC (permalink / raw)
  To: alexander.h.duyck; +Cc: netdev, jeffrey.t.kirsher, alexander.duyck
In-Reply-To: <20120712001804.26542.2889.stgit@gitlad.jf.intel.com>

From: Alexander Duyck <alexander.h.duyck@intel.com>
Date: Wed, 11 Jul 2012 17:18:04 -0700

> The recent patch "tcp: Maintain dynamic metrics in local cache." introduced
> an out of bounds access due to what appears to be a typo.   I believe this
> change should resolve the issue by replacing the access to RTAX_CWND with
> TCP_METRIC_CWND.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>

Applied, thanks a lot.

How did you spot this, did you get a compiler warning?

I ask because while working on this, I at one point put the
tcp timestamp members after the metrics array in the
tcp_metrics_bucket struct.  And I got a warning from gcc about
an array bounds violation that I could not figure out.

I am pretty certain this bug here is what it was warning about.  And
the problem is that if you put the array at the end gcc doesn't warn
in order to handle things similar to what people use zero length
arrays for.

^ permalink raw reply

* Re: [PATCH v3 net-next] tcp: TCP Small Queues
From: Nandita Dukkipati @ 2012-07-12  0:46 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, codel, ncardwell, David Miller, mattmathis
In-Reply-To: <1342051233.3265.8206.camel@edumazet-glaptop>

>> Considering these two points, why TSQ over the Codel feedback
>> approach?
>
> I dont think they compete. They are in fact complementary.
>
> If you use codel/fq_codel + TSQ, you have both per flow limitation in
> qdisc (TSQ) + sojourn time aware and multi flow aware feedback.

Makes sense. My conjecture is when using codel/fq_codel qdisc, the
need for TSQ will diminish. But as you said... good part of TSQ is it
limits per-flow queuing for any qdisc structure, even those not using
codel.

^ permalink raw reply

* Re: [PATCH 2/2] net: Update alloc frag to reduce get/put page usage and recycle pages
From: David Miller @ 2012-07-12  1:11 UTC (permalink / raw)
  To: eric.dumazet
  Cc: alexander.h.duyck, netdev, jeffrey.t.kirsher, alexander.duyck,
	edumazet
In-Reply-To: <1342052967.3265.8210.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 12 Jul 2012 02:29:27 +0200

> On Wed, 2012-07-11 at 17:18 -0700, Alexander Duyck wrote:
>> This patch does several things.
>> 
>> First it reorders the netdev_alloc_frag code so that only one conditional
>> check is needed in most cases instead of 2.
>> 
>> Second it incorporates the atomic_set and atomic_sub_return logic from an
>> earlier proposed patch by Eric Dumazet allowing for a reduction in the
>> get_page/put_page overhead when dealing with frags.
>> 
>> Finally it also incorporates the page reuse code so that if the page count
>> is dropped to 0 we can just reinitialize the page and reuse it.
>> 
>> Cc: Eric Dumazet <edumazet@google.com>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> ---
> 
> 
> Hmm, I was working on a version using order-3 pages if available.
> 
> (or more exactly 32768 bytes chunks)
> 
> I am not sure how your version can help with typical 1500 allocations
> (2 skbs per page)

I'd like you two to sort things out before I apply anything, thanks :)

^ permalink raw reply

* Re: [PATCH v3 net-next] tcp: TCP Small Queues
From: David Miller @ 2012-07-12  1:14 UTC (permalink / raw)
  To: eric.dumazet; +Cc: nanditad, netdev, codel, ncardwell, mattmathis
In-Reply-To: <1342021831.3265.8174.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 11 Jul 2012 17:50:31 +0200

> This introduce TSQ (TCP Small Queues)

Applied, fingers cross, thanks :-)

^ permalink raw reply

* Re: [PATCH] tc_util: fix incorrect bare number process in get_rate.
From: Li Wei @ 2012-07-12  1:16 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20120711075129.4f81eea8@nehalam.linuxnetplumber.net>

于 2012-7-11 22:51, Stephen Hemminger 写道:
> On Wed, 11 Jul 2012 15:24:50 +0800
> Li Wei <lw@cn.fujitsu.com> wrote:
> 
>>
>> As the comment and manpage indicated that the bare number means
>> bytes per second, so the division is not needed.
>> ---
>>  tc/tc_util.c |    2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/tc/tc_util.c b/tc/tc_util.c
>> index 926ed08..e8d89c1 100644
>> --- a/tc/tc_util.c
>> +++ b/tc/tc_util.c
>> @@ -153,7 +153,7 @@ int get_rate(unsigned *rate, const char *str)
>>  		return -1;
>>  
>>  	if (*p == '\0') {
>> -		*rate = bps / 8.;	/* assume bytes/sec */
>> +		*rate = bps;	/* assume bytes/sec */
>>  		return 0;
>>  	}
>>  
> Thanks for finding this. The documentation, code and comment do
> all need to be the same!
> 
> But changing the code as you propose would break existing usage
> by scripts. Instead, the man page and comment need to change
> to match the reality of the existing application.
> 
> 
Well, I see, I'll send another patch to take care of this.

Thanks,

Wei

^ permalink raw reply

* Re: [PATCH 1/2] tcp: Fix out of bounds access to tcpm_vals
From: Alexander Duyck @ 2012-07-12  1:46 UTC (permalink / raw)
  To: David Miller; +Cc: alexander.h.duyck, netdev, jeffrey.t.kirsher
In-Reply-To: <20120711.173249.1303803416502735349.davem@davemloft.net>

On 7/11/2012 5:32 PM, David Miller wrote:
> From: Alexander Duyck<alexander.h.duyck@intel.com>
> Date: Wed, 11 Jul 2012 17:18:04 -0700
>
>> The recent patch "tcp: Maintain dynamic metrics in local cache." introduced
>> an out of bounds access due to what appears to be a typo.   I believe this
>> change should resolve the issue by replacing the access to RTAX_CWND with
>> TCP_METRIC_CWND.
>>
>> Signed-off-by: Alexander Duyck<alexander.h.duyck@intel.com>
> Applied, thanks a lot.
>
> How did you spot this, did you get a compiler warning?
>
> I ask because while working on this, I at one point put the
> tcp timestamp members after the metrics array in the
> tcp_metrics_bucket struct.  And I got a warning from gcc about
> an array bounds violation that I could not figure out.
>
> I am pretty certain this bug here is what it was warning about.  And
> the problem is that if you put the array at the end gcc doesn't warn
> in order to handle things similar to what people use zero length
> arrays for.
It came up as a compiler warning.  I suspect it may have something to do 
with the optimizations I had turned on since it complained that the 
issue was in tcp_update_metrics but then reported it on the one line in 
tcp_metric_set.

Thanks,

Alex

^ permalink raw reply

* Re: [PATCH 2/2] bonding: debugfs and network namespaces are incompatible
From: Jay Vosburgh @ 2012-07-12  1:57 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: netdev
In-Reply-To: <877gu9omrr.fsf@xmission.com>

Eric W. Biederman <ebiederm@xmission.com> wrote:

>Jay Vosburgh <fubar@us.ibm.com> writes:
>
>> Eric W. Biederman <ebiederm@xmission.com> wrote:
>
>>>I haven't run across any of those network devices, but if they create a
>>>debugfs entry that embeds the device name it will be a problem.
>>
>> 	A quick grep suggests that cxgb4, skge, sky2, stmmac, ipoib and
>> half a dozen of the wireless drivers all create files in debugfs.  I did
>> not check exhaustively, but at least some of them include the device
>> name.
>
>Yep.  It looks like imperfect habits are common.
>
>>>Last I looked any custom user space interface from network devices was
>>>rare and bonding using debugfs is the first instance of using debugfs
>>>from networking devices I have seen.
>>>
>>>I think the problem will be a little less severe for physical network
>>>devices as they all start in the initial network namespace and so start
>>>with distinct names.
>>>
>>>With bonding I can do "ip link add type bond" in any network namespace
>>>and get another bond0.  So name conflicts are very much expeted with all
>>>virtual networking devices.
>>
>> 	Fair enough, although it is trivial to rename any network device
>> such that a conflict would occur.
>
>Actually for userspace and administrative reasons frequently it isn't
>trivial to rename devices.

	Well, perhaps it's uncommon for users to do so, but "ip link set
dev eth0 name eth44" is pretty easy to do.

>>>But if you know of any other networking devices using debugsfs that
>>>code should probably get the same treatment as the bonding debugfs code.
>>
>> 	Is there no alternative than simply disabling debugfs whenever
>> network namespaces are enabled?  The information bonding displays via
>> debugfs is useful, and having it unavailable on all distro kernels seems
>> a bit harsh.
>
>
>I took a good hard look at debugfs while writing this reply and debufs
>scares me.  It is the kind of code that just about wants to me to throw
>in the towel seeing no hope of a good solid kernel. 
>
>I can definitely open a /sys/kernel/debug/bonding/bond0/rlb_hash_table
>and delete the bond and then read the file.  On a bad day that will oops
>the kernel, as there is nothing holding a reference to the network
>device.  I think only the BOND_MODE_ALB check makes keeps the kernel
>from oopsing in my quick tests.
>
>The fact that debugfs is enabled in distro kernels is actually apalling
>to me.  debugfs makes it easy to oops the kernel.

	I'm not so sure things are that bad.  I cannot unload the
bonding module while a program holds an open file descriptor on its
debugfs file (it appears to hold a reference to the module), so uses
that only remove the debugfs file on module unload shouldn't have a
problem.

	The /proc file that bonding removes when an interface is
dynamically removed does not have this problem, as subsequent reads on
that file descriptor will fail.  I suspect that's because
remove_proc_entry NULLs the proc_fops, whereas debugfs_remove does not
do the equivalent for its case.  It may not be that simple, though; I'm
just looking at the code and have not tested anything.

>There are lots of alternatives to debugfs on where to put information
>and the bonding driver already uses most of them.
>
>> 	Why is the logic already in the driver not sufficient?  If the
>> attempt to create the debugfs directory with the interface name fails,
>> then it merely prints a warning and continues without the debugfs for
>> that interface.
>
>All I know for certain is the existing logic will eventually cause
>someone doing something reasonable to send me a bug report.
>
>I can see where you are coming from in that the bonding driver debugfs
>code really was built to gracefully fail and ignore problems of instead
>of just hapharzardly and sloppily ignore problems.  At the same time
>I can oops the kernel if I try with your debugfs in the bonding driver.
>
>But it causes the code to fail and issue a warning.  So if I don't
>disable the code now, I expect I will get a bug report, and who
>knows how many sill files in bonding will have in debugfs by then.
>what silly things bonding may be doing in debugfs by then.

	Or perhaps we can fix the debugfs support to function correctly
even in the face of network namespaces.  For example, do namespaces have
a unique name or identifier than can go into the debugfs name?

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* [PATCH] tc: man: change man page and comment to confirm to code's behavior.
From: Li Wei @ 2012-07-12  1:56 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20120711075129.4f81eea8@nehalam.linuxnetplumber.net>


Since the get_rate() code incorrectly interpreted bare number, the
behavior is not the same as man page and comment described.

We need to change the man page and comment for compatible with the
existing usage by scripts.
---
 man/man8/tc.8 |    7 +++++--
 tc/tc_util.c  |    2 +-
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/man/man8/tc.8 b/man/man8/tc.8
index 958ab98..f0e5613 100644
--- a/man/man8/tc.8
+++ b/man/man8/tc.8
@@ -259,6 +259,9 @@ All parameters accept a floating point number, possibly followed by a unit.
 .P
 Bandwidths or rates can be specified in:
 .TP
+bps
+Bytes per second
+.TP
 kbps
 Kilobytes per second
 .TP
@@ -271,8 +274,8 @@ Kilobits per second
 mbit
 Megabits per second
 .TP
-bps or a bare number
-Bytes per second
+bit or a bare number
+Bits per second
 .P
 Amounts of data can be specified in:
 .TP
diff --git a/tc/tc_util.c b/tc/tc_util.c
index 926ed08..ccf8fa4 100644
--- a/tc/tc_util.c
+++ b/tc/tc_util.c
@@ -153,7 +153,7 @@ int get_rate(unsigned *rate, const char *str)
 		return -1;
 
 	if (*p == '\0') {
-		*rate = bps / 8.;	/* assume bytes/sec */
+		*rate = bps / 8.;	/* assume bits/sec */
 		return 0;
 	}
 
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH 2/2] net: Update alloc frag to reduce get/put page usage and recycle pages
From: Alexander Duyck @ 2012-07-12  2:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, netdev, davem, jeffrey.t.kirsher, Eric Dumazet
In-Reply-To: <1342052967.3265.8210.camel@edumazet-glaptop>

On 7/11/2012 5:29 PM, Eric Dumazet wrote:
> On Wed, 2012-07-11 at 17:18 -0700, Alexander Duyck wrote:
>> This patch does several things.
>>
>> First it reorders the netdev_alloc_frag code so that only one conditional
>> check is needed in most cases instead of 2.
>>
>> Second it incorporates the atomic_set and atomic_sub_return logic from an
>> earlier proposed patch by Eric Dumazet allowing for a reduction in the
>> get_page/put_page overhead when dealing with frags.
>>
>> Finally it also incorporates the page reuse code so that if the page count
>> is dropped to 0 we can just reinitialize the page and reuse it.
>>
>> Cc: Eric Dumazet<edumazet@google.com>
>> Signed-off-by: Alexander Duyck<alexander.h.duyck@intel.com>
>> ---
>
> Hmm, I was working on a version using order-3 pages if available.
>
> (or more exactly 32768 bytes chunks)
>
> I am not sure how your version can help with typical 1500 allocations
> (2 skbs per page)
>
>
The gain will be minimal if any with the 1500 byte allocations, however 
there shouldn't be a performance degradation.

I was thinking more of the ixgbe case where we are working with only 256 
byte allocations and can recycle pages in the case of GRO or TCP.  For 
ixgbe the advantages are significant since we drop a number of the 
get_page calls and get the advantage of the page recycling.  So for 
example with GRO enabled we should only have to allocate 1 page for 
headers every 16 buffers, and the 6 slots we use in that page have a 
good likelihood of being warm in the cache since we just keep looping on 
the same page.

Thanks,

Alex

^ permalink raw reply

* linux-next: manual merge of the net-next tree with the infiniband tree
From: Stephen Rothwell @ 2012-07-12  2:09 UTC (permalink / raw)
  To: David Miller, netdev
  Cc: linux-next, linux-kernel, Jack Morgenstein, Roland Dreier,
	linux-rdma, Hadar Hen Zion, Or Gerlitz

[-- Attachment #1: Type: text/plain, Size: 1561 bytes --]

Hi all,

Today's linux-next merge of the net-next tree got a conflict in
drivers/net/ethernet/mellanox/mlx4/main.c between commit 6634961c14d3
("mlx4: Put physical GID and P_Key table sizes in mlx4_phys_caps struct
and paravirtualize them") from the infiniband tree and commit
0ff1fb654bec ("{NET, IB}/mlx4: Add device managed flow steering firmware
API") from the net-next tree.

Just context changes (I think).  I have fixed it up (see below) and can
carry the fix as necessary.
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

diff --cc drivers/net/ethernet/mellanox/mlx4/main.c
index 5df3ac4,4264516..0000000
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@@ -1232,10 -1231,26 +1258,29 @@@ static int mlx4_init_hca(struct mlx4_de
  			goto err_stop_fw;
  		}
  
 +		if (mlx4_is_master(dev))
 +			mlx4_parav_master_pf_caps(dev);
 +
+ 		priv->fs_hash_mode = MLX4_FS_L2_HASH;
+ 
+ 		switch (priv->fs_hash_mode) {
+ 		case MLX4_FS_L2_HASH:
+ 			init_hca.fs_hash_enable_bits = 0;
+ 			break;
+ 
+ 		case MLX4_FS_L2_L3_L4_HASH:
+ 			/* Enable flow steering with
+ 			 * udp unicast and tcp unicast
+ 			 */
+ 			init_hca.fs_hash_enable_bits =
+ 				MLX4_FS_UDP_UC_EN | MLX4_FS_TCP_UC_EN;
+ 			break;
+ 		}
+ 
  		profile = default_profile;
+ 		if (dev->caps.steering_mode ==
+ 		    MLX4_STEERING_MODE_DEVICE_MANAGED)
+ 			profile.num_mcg = MLX4_FS_NUM_MCG;
  
  		icm_size = mlx4_make_profile(dev, &profile, &dev_cap,
  					     &init_hca);

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* linux-next: manual merge of the net-next tree with the infiniband tree
From: Stephen Rothwell @ 2012-07-12  2:13 UTC (permalink / raw)
  To: David Miller, netdev
  Cc: linux-next, linux-kernel, Hadar Hen Zion, Or Gerlitz,
	Jack Morgenstein, Roland Dreier, linux-rdma

[-- Attachment #1: Type: text/plain, Size: 3011 bytes --]

Hi all,

Today's linux-next merge of the net-next tree got a conflict in
include/linux/mlx4/device.h between commit 396f2feb05d7 ("mlx4_core:
Implement mechanism for reserved Q_Keys") from the infiniband tree and
commit 0ff1fb654bec ("{NET, IB}/mlx4: Add device managed flow steering
firmware API") from the net-next tree.

Just context changes.  I fixed it up (see below) and can carry the fix
as necessary.
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

diff --cc include/linux/mlx4/device.h
index 441caf1,6f0d133..0000000
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@@ -540,83 -542,10 +573,85 @@@ struct mlx4_dev 
  	u8			rev_id;
  	char			board_id[MLX4_BOARD_ID_LEN];
  	int			num_vfs;
+ 	u64			regid_promisc_array[MLX4_MAX_PORTS + 1];
+ 	u64			regid_allmulti_array[MLX4_MAX_PORTS + 1];
  };
  
 +struct mlx4_eqe {
 +	u8			reserved1;
 +	u8			type;
 +	u8			reserved2;
 +	u8			subtype;
 +	union {
 +		u32		raw[6];
 +		struct {
 +			__be32	cqn;
 +		} __packed comp;
 +		struct {
 +			u16	reserved1;
 +			__be16	token;
 +			u32	reserved2;
 +			u8	reserved3[3];
 +			u8	status;
 +			__be64	out_param;
 +		} __packed cmd;
 +		struct {
 +			__be32	qpn;
 +		} __packed qp;
 +		struct {
 +			__be32	srqn;
 +		} __packed srq;
 +		struct {
 +			__be32	cqn;
 +			u32	reserved1;
 +			u8	reserved2[3];
 +			u8	syndrome;
 +		} __packed cq_err;
 +		struct {
 +			u32	reserved1[2];
 +			__be32	port;
 +		} __packed port_change;
 +		struct {
 +			#define COMM_CHANNEL_BIT_ARRAY_SIZE	4
 +			u32 reserved;
 +			u32 bit_vec[COMM_CHANNEL_BIT_ARRAY_SIZE];
 +		} __packed comm_channel_arm;
 +		struct {
 +			u8	port;
 +			u8	reserved[3];
 +			__be64	mac;
 +		} __packed mac_update;
 +		struct {
 +			__be32	slave_id;
 +		} __packed flr_event;
 +		struct {
 +			__be16  current_temperature;
 +			__be16  warning_threshold;
 +		} __packed warming;
 +		struct {
 +			u8 reserved[3];
 +			u8 port;
 +			union {
 +				struct {
 +					__be16 mstr_sm_lid;
 +					__be16 port_lid;
 +					__be32 changed_attr;
 +					u8 reserved[3];
 +					u8 mstr_sm_sl;
 +					__be64 gid_prefix;
 +				} __packed port_info;
 +				struct {
 +					__be32 block_ptr;
 +					__be32 tbl_entries_mask;
 +				} __packed tbl_change_info;
 +			} params;
 +		} __packed port_mgmt_change;
 +	}			event;
 +	u8			slave_id;
 +	u8			reserved3[2];
 +	u8			owner;
 +} __packed;
 +
  struct mlx4_init_port_param {
  	int			set_guid0;
  	int			set_node_guid;
@@@ -783,6 -793,8 +908,10 @@@ int mlx4_wol_write(struct mlx4_dev *dev
  int mlx4_counter_alloc(struct mlx4_dev *dev, u32 *idx);
  void mlx4_counter_free(struct mlx4_dev *dev, u32 idx);
  
 +int mlx4_get_parav_qkey(struct mlx4_dev *dev, u32 qpn, u32 *qkey);
 +
+ int mlx4_flow_attach(struct mlx4_dev *dev,
+ 		     struct mlx4_net_trans_rule *rule, u64 *reg_id);
+ int mlx4_flow_detach(struct mlx4_dev *dev, u64 reg_id);
+ 
  #endif /* MLX4_DEVICE_H */

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: 82571EB: Detected Hardware Unit Hang
From: Joe Jin @ 2012-07-12  2:23 UTC (permalink / raw)
  To: Dave, Tushar N
  Cc: e1000-devel@lists.sf.net, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <061C8A8601E8EE4CA8D8FD6990CEA891274EF8AD@ORSMSX102.amr.corp.intel.com>

On 07/12/12 02:51, Dave, Tushar N wrote:
> 
> Joe,
> 
> I see couple of errors in lspci output.
> Device capability status register shows UnCorrectable PCIe error. This means there is certainly something went wrong. The only way to recover from Uncorrectable errors is reset.
>    
> 	DevSta:	CorrErr- *UncorrErr+ FatalErr+ UnsuppReq+ AuxPwr+ TransPend-
> 
> Also AER sections in lspci output shows PCIe completion timeout.
> 	
> 	Capabilities: [100 v1] Advanced Error Reporting
> 		UESta:	DLP- SDES- TLP- FCP- *CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol-
> 
> I suggest you should load AER driver and check for any error messages in log. Also please check any error message reported by system in BIOS log. Are there any machine check errors? 
> 
> When did you notice this issue? have 82571 ever been working before on this server?
> 
> One more thing, Cache line size 256 is little unusual( I never seen this value before, mostly it's 64). Does BIOS settings have been changed? Are you using default BIOS setting?
> 

I checked BIOS's log found the fault from the device, I changed "PCI-E Payload Size"
from 256(default) to 128, now the device works.

I compared lspci output found Address for data of MSI Capabilities's be changed:

Old:
        Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee21000  Data: 40cb

New:
        Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee24000  Data: 405c

Mostly like it's a BIOS bug? please comments.

Thanks,
Joe

^ permalink raw reply

* Re: 82571EB: Detected Hardware Unit Hang
From: Dave, Tushar N @ 2012-07-12  2:52 UTC (permalink / raw)
  To: Joe Jin
  Cc: netdev@vger.kernel.org, e1000-devel@lists.sf.net,
	linux-kernel@vger.kernel.org
In-Reply-To: <4FFE3505.8060407@oracle.com>

>-----Original Message-----
>From: Joe Jin [mailto:joe.jin@oracle.com]
>Sent: Wednesday, July 11, 2012 7:23 PM
>To: Dave, Tushar N
>Cc: e1000-devel@lists.sf.net; netdev@vger.kernel.org; linux-
>kernel@vger.kernel.org
>Subject: Re: 82571EB: Detected Hardware Unit Hang
>
>On 07/12/12 02:51, Dave, Tushar N wrote:
>>
>> Joe,
>>
>> I see couple of errors in lspci output.
>> Device capability status register shows UnCorrectable PCIe error. This
>means there is certainly something went wrong. The only way to recover
>from Uncorrectable errors is reset.
>>
>> 	DevSta:	CorrErr- *UncorrErr+ FatalErr+ UnsuppReq+ AuxPwr+
>TransPend-
>>
>> Also AER sections in lspci output shows PCIe completion timeout.
>>
>> 	Capabilities: [100 v1] Advanced Error Reporting
>> 		UESta:	DLP- SDES- TLP- FCP- *CmpltTO+ CmpltAbrt- UnxCmplt-
>RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol-
>>
>> I suggest you should load AER driver and check for any error messages in
>log. Also please check any error message reported by system in BIOS log.
>Are there any machine check errors?
>>
>> When did you notice this issue? have 82571 ever been working before on
>this server?
>>
>> One more thing, Cache line size 256 is little unusual( I never seen this
>value before, mostly it's 64). Does BIOS settings have been changed? Are
>you using default BIOS setting?
>>
>
>I checked BIOS's log found the fault from the device, I changed "PCI-E
>Payload Size"
>from 256(default) to 128, now the device works.
>
>I compared lspci output found Address for data of MSI Capabilities's be
>changed:
>
>Old:
>        Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>                Address: 00000000fee21000  Data: 40cb
>
>New:
>        Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>                Address: 00000000fee24000  Data: 405c
>
>Mostly like it's a BIOS bug? please comments.
>
>Thanks,
>Joe

What is the exact error messages in BIOS log?


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* Re: 82571EB: Detected Hardware Unit Hang
From: Joe Jin @ 2012-07-12  2:57 UTC (permalink / raw)
  To: Dave, Tushar N
  Cc: netdev@vger.kernel.org, e1000-devel@lists.sf.net,
	linux-kernel@vger.kernel.org
In-Reply-To: <061C8A8601E8EE4CA8D8FD6990CEA891274F018D@ORSMSX102.amr.corp.intel.com>

On 07/12/12 10:52, Dave, Tushar N wrote:
> What is the exact error messages in BIOS log?

Error message from BIOS event log:
07/12/12 05:54:00
    PCI Express Non-Fatal Error

Thanks,
Joe

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* RE: 82571EB: Detected Hardware Unit Hang
From: Dave, Tushar N @ 2012-07-12  3:07 UTC (permalink / raw)
  To: Joe Jin
  Cc: e1000-devel@lists.sf.net, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <4FFE3D27.7020500@oracle.com>

>-----Original Message-----
>From: Joe Jin [mailto:joe.jin@oracle.com]
>Sent: Wednesday, July 11, 2012 7:58 PM
>To: Dave, Tushar N
>Cc: e1000-devel@lists.sf.net; netdev@vger.kernel.org; linux-
>kernel@vger.kernel.org
>Subject: Re: 82571EB: Detected Hardware Unit Hang
>
>On 07/12/12 10:52, Dave, Tushar N wrote:
>> What is the exact error messages in BIOS log?
>
>Error message from BIOS event log:
>07/12/12 05:54:00
>    PCI Express Non-Fatal Error
>
>Thanks,
>Joe

Thanks.  Well, I will check with team tomorrow if this  (max payload size) can be treated as solution to this issue. 
We can know more about what exact non-fatal error occurred if we capture bus trace.
We should check the eeprom on this device to make sure they are up-to-date.
Send me the full eeprom dump in a file and I will confirm with team that it is up-to-date.
Thanks for your work.

-Tushar

^ permalink raw reply

* Re: 82571EB: Detected Hardware Unit Hang
From: Joe Jin @ 2012-07-12  3:12 UTC (permalink / raw)
  To: Dave, Tushar N
  Cc: netdev@vger.kernel.org, e1000-devel@lists.sf.net,
	linux-kernel@vger.kernel.org
In-Reply-To: <061C8A8601E8EE4CA8D8FD6990CEA891274F01BE@ORSMSX102.amr.corp.intel.com>

[-- Attachment #1: Type: text/plain, Size: 1025 bytes --]

On 07/12/12 11:07, Dave, Tushar N wrote:
>> -----Original Message-----
>> From: Joe Jin [mailto:joe.jin@oracle.com]
>> Sent: Wednesday, July 11, 2012 7:58 PM
>> To: Dave, Tushar N
>> Cc: e1000-devel@lists.sf.net; netdev@vger.kernel.org; linux-
>> kernel@vger.kernel.org
>> Subject: Re: 82571EB: Detected Hardware Unit Hang
>>
>> On 07/12/12 10:52, Dave, Tushar N wrote:
>>> What is the exact error messages in BIOS log?
>>
>> Error message from BIOS event log:
>> 07/12/12 05:54:00
>>    PCI Express Non-Fatal Error
>>
>> Thanks,
>> Joe
> 
> Thanks.  Well, I will check with team tomorrow if this  (max payload size) can be treated as solution to this issue. 
> We can know more about what exact non-fatal error occurred if we capture bus trace.
> We should check the eeprom on this device to make sure they are up-to-date.
> Send me the full eeprom dump in a file and I will confirm with team that it is up-to-date.
> Thanks for your work.
> 

Hi Tushar,

Please find eeprom from attachment.

Thanks a lot of your help,
Joe

[-- Attachment #2: Type: text/plain, Size: 395 bytes --]

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

[-- Attachment #3: Type: text/plain, Size: 257 bytes --]

_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* Re: 3.5rc6 sctp panic
From: Wei Yongjun @ 2012-07-12  3:12 UTC (permalink / raw)
  To: davej; +Cc: netdev, vyasevich, sri

On 07/11/2012 08:08 AM, Dave Jones wrote:
> I just hit this while fuzz testing, and the box locked up immediately afterwards.
> The serial log was a little mangled, I did my best to clean it up..
>
>

Hi Dave,

Can you share your test program?

>

^ permalink raw reply

* Re: 3.5rc6 sctp panic
From: Dave Jones @ 2012-07-12  3:18 UTC (permalink / raw)
  To: Wei Yongjun; +Cc: netdev, vyasevich, sri
In-Reply-To: <CAPgLHd9OESfaKP9n6=S9ziXW6Zzyo54fZ2fPXUZY88Jp_e0=hA@mail.gmail.com>

On Thu, Jul 12, 2012 at 11:12:47AM +0800, Wei Yongjun wrote:
 > On 07/11/2012 08:08 AM, Dave Jones wrote:
 > > I just hit this while fuzz testing, and the box locked up immediately afterwards.
 > > The serial log was a little mangled, I did my best to clean it up..
 > 
 > Hi Dave,
 > 
 > Can you share your test program?

http://codemonkey.org.uk/projects/trinity/

	Dave

^ permalink raw reply

* Re: [PATCH 2/2] net: Update alloc frag to reduce get/put page usage and recycle pages
From: Eric Dumazet @ 2012-07-12  5:06 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, netdev, davem, jeffrey.t.kirsher, Eric Dumazet
In-Reply-To: <4FFE303F.8070902@gmail.com>

On Wed, 2012-07-11 at 19:02 -0700, Alexander Duyck wrote:

> The gain will be minimal if any with the 1500 byte allocations, however 
> there shouldn't be a performance degradation.
> 
> I was thinking more of the ixgbe case where we are working with only 256 
> byte allocations and can recycle pages in the case of GRO or TCP.  For 
> ixgbe the advantages are significant since we drop a number of the 
> get_page calls and get the advantage of the page recycling.  So for 
> example with GRO enabled we should only have to allocate 1 page for 
> headers every 16 buffers, and the 6 slots we use in that page have a 
> good likelihood of being warm in the cache since we just keep looping on 
> the same page.
> 

Its not possible to get 16 buffers per 4096 bytes page.

sizeof(struct skb_shared_info)=0x140 320

Add 192 bytes (NET_SKB_PAD + 128)

Thats a minimum of 512 bytes (but ixgbe uses more) per skb.

In practice for ixgbe, its :

#define IXGBE_RXBUFFER_512   512    /* Used for packet split */
#define IXGBE_RX_HDR_SIZE IXGBE_RXBUFFER_512 

skb = netdev_alloc_skb_ip_align(rx_ring->netdev, IXGBE_RX_HDR_SIZE)

So 4 buffers per PAGE

Maybe you plan to use IXGBE_RXBUFFER_256 or IXGBE_RXBUFFER_128 ?

^ permalink raw reply

* RE: 82571EB: Detected Hardware Unit Hang
From: Dave, Tushar N @ 2012-07-12  5:57 UTC (permalink / raw)
  To: Joe Jin
  Cc: e1000-devel@lists.sf.net, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <4FFE40A9.4060807@oracle.com>

>-----Original Message-----
>From: Joe Jin [mailto:joe.jin@oracle.com]
>Sent: Wednesday, July 11, 2012 8:13 PM
>To: Dave, Tushar N
>Cc: e1000-devel@lists.sf.net; netdev@vger.kernel.org; linux-
>kernel@vger.kernel.org
>Subject: Re: 82571EB: Detected Hardware Unit Hang
>
>On 07/12/12 11:07, Dave, Tushar N wrote:
>>> -----Original Message-----
>>> From: Joe Jin [mailto:joe.jin@oracle.com]
>>> Sent: Wednesday, July 11, 2012 7:58 PM
>>> To: Dave, Tushar N
>>> Cc: e1000-devel@lists.sf.net; netdev@vger.kernel.org; linux-
>>> kernel@vger.kernel.org
>>> Subject: Re: 82571EB: Detected Hardware Unit Hang
>>>
>>> On 07/12/12 10:52, Dave, Tushar N wrote:
>>>> What is the exact error messages in BIOS log?
>>>
>>> Error message from BIOS event log:
>>> 07/12/12 05:54:00
>>>    PCI Express Non-Fatal Error
>>>
>>> Thanks,
>>> Joe
>Hi Tushar,
>
>Please find eeprom from attachment.

Do you have lspci -vvv dump of entire system before and after issue occurs? If you have can you send it to me?

^ permalink raw reply

* Re: 82571EB: Detected Hardware Unit Hang
From: Joe Jin @ 2012-07-12  6:16 UTC (permalink / raw)
  To: Dave, Tushar N
  Cc: netdev@vger.kernel.org, e1000-devel@lists.sf.net,
	linux-kernel@vger.kernel.org
In-Reply-To: <061C8A8601E8EE4CA8D8FD6990CEA891274F02D7@ORSMSX102.amr.corp.intel.com>

On 07/12/12 13:57, Dave, Tushar N wrote:
>> -----Original Message-----
>> From: Joe Jin [mailto:joe.jin@oracle.com]
>> Sent: Wednesday, July 11, 2012 8:13 PM
>> To: Dave, Tushar N
>> Cc: e1000-devel@lists.sf.net; netdev@vger.kernel.org; linux-
>> kernel@vger.kernel.org
>> Subject: Re: 82571EB: Detected Hardware Unit Hang
>>
>> On 07/12/12 11:07, Dave, Tushar N wrote:
>>>> -----Original Message-----
>>>> From: Joe Jin [mailto:joe.jin@oracle.com]
>>>> Sent: Wednesday, July 11, 2012 7:58 PM
>>>> To: Dave, Tushar N
>>>> Cc: e1000-devel@lists.sf.net; netdev@vger.kernel.org; linux-
>>>> kernel@vger.kernel.org
>>>> Subject: Re: 82571EB: Detected Hardware Unit Hang
>>>>
>>>> On 07/12/12 10:52, Dave, Tushar N wrote:
>>>>> What is the exact error messages in BIOS log?
>>>>
>>>> Error message from BIOS event log:
>>>> 07/12/12 05:54:00
>>>>    PCI Express Non-Fatal Error
>>>>
>>>> Thanks,
>>>> Joe
>> Hi Tushar,
>>
>> Please find eeprom from attachment.
> 
> Do you have lspci -vvv dump of entire system before and after issue occurs? If you have can you send it to me?
> 

Before:
05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06)
	Subsystem: Oracle Corporation x4 PCI-Express Quad Gigabit Ethernet UTP Low Profile Adapter
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 256 bytes
	Interrupt: pin B routed to IRQ 80
	Region 0: Memory at fbde0000 (32-bit, non-prefetchable) [size=128K]
	Region 1: Memory at fbdc0000 (32-bit, non-prefetchable) [size=128K]
	Region 2: I/O ports at dc00 [size=32]
	Expansion ROM at fbda0000 [disabled] [size=128K]
	Capabilities: [c8] Power Management version 2
		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee33000  Data: 407c
	Capabilities: [e0] Express (v1) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+ TransPend-
		LnkCap:	Port #2, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 <4us, L1 <64us
			ClockPM- Surprise- LLActRep- BwNot-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		AERCap:	First Error Pointer: 12, GenCap- CGenEn- ChkCap- ChkEn-
	Capabilities: [140 v1] Device Serial Number 00-15-17-ff-ff-b9-77-9c
	Kernel driver in use: e1000e
	Kernel modules: e1000e


After:
05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06)
	Subsystem: Oracle Corporation x4 PCI-Express Quad Gigabit Ethernet UTP Low Profile Adapter
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 256 bytes
	Interrupt: pin B routed to IRQ 80
	Region 0: Memory at fbde0000 (32-bit, non-prefetchable) [size=128K]
	Region 1: Memory at fbdc0000 (32-bit, non-prefetchable) [size=128K]
	Region 2: I/O ports at dc00 [size=32]
	Expansion ROM at fbda0000 [disabled] [size=128K]
	Capabilities: [c8] Power Management version 2
		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee33000  Data: 407c
	Capabilities: [e0] Express (v1) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr+ FatalErr+ UnsuppReq+ AuxPwr+ TransPend+
		LnkCap:	Port #2, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 <4us, L1 <64us
			ClockPM- Surprise- LLActRep- BwNot-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		AERCap:	First Error Pointer: 12, GenCap- CGenEn- ChkCap- ChkEn-
	Capabilities: [140 v1] Device Serial Number 00-15-17-ff-ff-b9-77-9c
	Kernel driver in use: e1000e
	Kernel modules: e1000e

Different of them:
Before:		DevSta:	CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+ TransPend-
After		DevSta:	CorrErr- UncorrErr+ FatalErr+ UnsuppReq+ AuxPwr+ TransPend+


Thanks,
Joe

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* Warnung
From: barbosao @ 2012-07-12  6:11 UTC (permalink / raw)


Warnung

Ich sage Ihnen, dass Ihre E-Mail-Konto wurde überschritten
Speichergrenzwert. Sie werden nicht in der Lage sein zum Senden und
Empfangen von E-Mails
E-Mail-Konto wird von unserem Server gelöscht werden. Um dieses
Problem zu vermeiden, möchten wir
empfehlen, dass Sie Ihre Mailbox aktualisieren für mehr Platz.
Klicken Sie auf den Link unten, um zu aktualisieren und die
Informationen unten.

http://bi-t7.webs.com/contact.htm

Wenn wir nicht ein Update von Ihnen erhalten haben, werden wir
zerstören Ihrer Mailbox

Danke.
Der System-Administrator, das Management-Team.


-------------------------------------------------------------------------
"O usuario e integralmente responsavel por todo conteudo enviado
de sua conta de e-mail. Sua senha e pessoal e intransferivel."

^ permalink raw reply

* RE: [PATCH net-next] netxen: fix link notification order
From: Rajesh Borundia @ 2012-07-12  6:27 UTC (permalink / raw)
  To: Flavio Leitner, netdev; +Cc: Sony Chacko
In-Reply-To: <1342033015-31442-1-git-send-email-fbl@redhat.com>

> -----Original Message-----
> From: Flavio Leitner [mailto:fbl@redhat.com]
> Sent: Thursday, July 12, 2012 12:27 AM
> To: netdev
> Cc: Sony Chacko; Rajesh Borundia; Flavio Leitner
> Subject: [PATCH net-next] netxen: fix link notification order
> 
> First update the adapter variables with the current speed and
> mode before fire the notification. Otherwise, the get_settings()
> may provide old values.
> 
> Signed-off-by: Flavio Leitner <fbl@redhat.com>
> ---
>  drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c
> b/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c
> index b2c1b676..bc165f4 100644
> --- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c
> +++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c
> @@ -1437,8 +1437,6 @@ netxen_handle_linkevent(struct netxen_adapter
> *adapter, nx_fw_msg_t *msg)
>  				netdev->name, cable_len);
>  	}
> 
> -	netxen_advert_link_change(adapter, link_status);
> -
>  	/* update link parameters */
>  	if (duplex == LINKEVENT_FULL_DUPLEX)
>  		adapter->link_duplex = DUPLEX_FULL;
> @@ -1447,6 +1445,8 @@ netxen_handle_linkevent(struct netxen_adapter
> *adapter, nx_fw_msg_t *msg)
>  	adapter->module_type = module;
>  	adapter->link_autoneg = autoneg;
>  	adapter->link_speed = link_speed;
> +
> +	netxen_advert_link_change(adapter, link_status);
>  }
> 
>  static void
> --
> 1.7.10.4
> 
Acked-by: Rajesh Borundia <rajesh.borundia@qlogic.com>

^ permalink raw reply

* [PATCH 00/16] Swap-over-NBD without deadlocking V15
From: Mel Gorman @ 2012-07-12  6:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson, Eric Dumazet,
	Sebastian Andrzej Siewior, Mel Gorman

This is a rebase onto current linux-next due to a minor collision with
some NFS changes.

Changelog since V14
  o Rebase to linux-next 20120710

Changelog since V13
  o Rebase to linux-next 20120629

Changelog since V12
  o Rebase to linux-next-20120622
  o Do not alter coalesce handling in the input path		      (eric.dumazet)
  o Avoid unnecessary cast					      (sebastian)

Changelog since V11
  o Rebase to 3.5-rc3
  o Correct order of page flag free				      (sebastian)

Changelog since V10
  o Rebase to 3.4-rc5
  o Coding style fixups						      (davem)
  o API consistency						      (davem)
  o Rename sk_allocation to sk_gfp_atomic and use only when necessary (davem)
  o Use static branches for sk_memalloc_socks			      (davem)
  o Use static branch checks in fast paths			      (davem)
  o Document concerns about PF_MEMALLOC leaking flags		      (davem)
  o Locking fix in slab						      (mel)

Changelog since V9
  o Rebase to 3.4-rc5
  o Clarify comment on why PF_MEMALLOC is cleared in softirq handling (akpm)
  o Only set page->pfmemalloc if ALLOC_NO_WATERMARKS was required     (rientjes)

Changelog since V8
  o Rebase to 3.4-rc2
  o Use page flag instead of slab fields to keep structures the same size
  o Properly detect allocations from softirq context that use PF_MEMALLOC
  o Ensure kswapd does not sleep while processes are throttled
  o Do not accidentally throttle !_GFP_FS processes indefinitely

Changelog since V7
  o Rebase to 3.3-rc2
  o Take greater care propagating page->pfmemalloc to skb
  o Propagate pfmemalloc from netdev_alloc_page to skb where possible
  o Release RCU lock properly on preempt kernel

Changelog since V6
  o Rebase to 3.1-rc8
  o Use wake_up instead of wake_up_interruptible()
  o Do not throttle kernel threads
  o Avoid a potential race between kswapd going to sleep and processes being
    throttled

Changelog since V5
  o Rebase to 3.1-rc5

Changelog since V4
  o Update comment clarifying what protocols can be used		(Michal)
  o Rebase to 3.0-rc3

Changelog since V3
  o Propogate pfmemalloc from packet fragment pages to skb		(Neil)
  o Rebase to 3.0-rc2

Changelog since V2
  o Document that __GFP_NOMEMALLOC overrides __GFP_MEMALLOC		(Neil)
  o Use wait_event_interruptible					(Neil)
  o Use !! when casting to bool to avoid any possibilitity of type
    truncation								(Neil)
  o Nicer logic when using skb_pfmemalloc_protocol			(Neil)

Changelog since V1
  o Rebase on top of mmotm
  o Use atomic_t for memalloc_socks		(David Miller)
  o Remove use of sk_memalloc_socks in vmscan	(Neil Brown)
  o Check throttle within prepare_to_wait	(Neil Brown)
  o Add statistics on throttling instead of printk

When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. Swap over the network is considered as an option in diskless
systems. The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.

The Linux Terminal Server Project recommends the use of the
Network Block Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD
at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP
The nbd-client also documents the use of NBD as swap. Despite this, the
fact is that a machine using NBD for swap can deadlock within minutes if
swap is used intensively. This patch series addresses the problem.

The core issue is that network block devices do not use mempools like
normal block devices do. As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need. Some years ago, Peter Zijlstra developed a series of
patches that supported swap over an NFS that at least one distribution
is carrying within their kernels. This patch series borrows very heavily
from Peter's work to support swapping over NBD as a pre-requisite to
supporting swap-over-NFS. The bulk of the complexity is concerned with
preserving memory that is allocated from the PFMEMALLOC reserves for use
by the network layer which is needed for both NBD and NFS.

Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
	preserve access to pages allocated under low memory situations
	to callers that are freeing memory.

Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks

Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
	reserves without setting PFMEMALLOC.

Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
	for later use by network packet processing.

Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required

Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.

Patches 7-12 allows network processing to use PFMEMALLOC reserves when
	the socket has been marked as being used by the VM to clean pages. If
	packets are received and stored in pages that were allocated under
	low-memory situations and are unrelated to the VM, the packets
	are dropped.

	Patch 11 reintroduces __skb_alloc_page which the networking
	folk may object to but is needed in some cases to propogate
	pfmemalloc from a newly allocated page to an skb. If there is a
	strong objection, this patch can be dropped with the impact being
	that swap-over-network will be slower in some cases but it should
	not fail.

Patch 13 is a micro-optimisation to avoid a function call in the
	common case.

Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
	PFMEMALLOC if necessary.

Patch 15 notes that it is still possible for the PFMEMALLOC reserve
	to be depleted. To prevent this, direct reclaimers get throttled on
	a waitqueue if 50% of the PFMEMALLOC reserves are depleted.  It is
	expected that kswapd and the direct reclaimers already running
	will clean enough pages for the low watermark to be reached and
	the throttled processes are woken up.

Patch 16 adds a statistic to track how often processes get throttled

Some basic performance testing was run using kernel builds, netperf
on loopback for UDP and TCP, hackbench (pipes and sockets), iozone
and sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant
performance variances.

For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure.

Without the patches and using SLUB, the machine locks up within minutes and
runs to completion with them applied. With SLAB, the story is different
as an unpatched kernel run to completion. However, the patched kernel
completed the test 45% faster.

MICRO
                                         3.5.0-rc2 3.5.0-rc2
					 vanilla     swapnbd
Unrecognised test vmscan-anon-mmap-write
MMTests Statistics: duration
Sys Time Running Test (seconds)             197.80    173.07
User+Sys Time Running Test (seconds)        206.96    182.03
Total Elapsed Time (seconds)               3240.70   1762.09

 drivers/block/nbd.c                               |    6 +-
 drivers/net/ethernet/chelsio/cxgb4/sge.c          |    2 +-
 drivers/net/ethernet/chelsio/cxgb4vf/sge.c        |    2 +-
 drivers/net/ethernet/intel/igb/igb_main.c         |    2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |    4 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |    3 +-
 drivers/net/usb/cdc-phonet.c                      |    2 +-
 drivers/usb/gadget/f_phonet.c                     |    2 +-
 include/linux/gfp.h                               |   13 +-
 include/linux/mm_types.h                          |    9 +
 include/linux/mmzone.h                            |    1 +
 include/linux/page-flags.h                        |   28 +++
 include/linux/sched.h                             |    7 +
 include/linux/skbuff.h                            |   80 +++++++-
 include/linux/vm_event_item.h                     |    1 +
 include/net/sock.h                                |   28 +++
 include/trace/events/gfpflags.h                   |    1 +
 kernel/softirq.c                                  |    9 +
 mm/page_alloc.c                                   |   46 ++++-
 mm/slab.c                                         |  216 +++++++++++++++++++--
 mm/slub.c                                         |   30 ++-
 mm/vmscan.c                                       |  131 ++++++++++++-
 mm/vmstat.c                                       |    1 +
 net/core/dev.c                                    |   53 ++++-
 net/core/filter.c                                 |    8 +
 net/core/skbuff.c                                 |  124 +++++++++---
 net/core/sock.c                                   |   43 ++++
 net/ipv4/tcp_output.c                             |   12 +-
 net/ipv6/tcp_ipv6.c                               |    8 +-
 29 files changed, 782 insertions(+), 90 deletions(-)

-- 
1.7.9.2

^ permalink raw reply

* [PATCH 01/16] mm: sl[au]b: Add knowledge of PFMEMALLOC reserve pages
From: Mel Gorman @ 2012-07-12  6:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson, Eric Dumazet,
	Sebastian Andrzej Siewior, Mel Gorman
In-Reply-To: <1342075232-29267-1-git-send-email-mgorman@suse.de>

Allocations of pages below the min watermark run a risk of the
machine hanging due to a lack of memory. To prevent this, only
callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing
an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once
they are allocated to a slab though, nothing prevents other callers
consuming free objects within those slabs. This patch limits access
to slab pages that were alloced from the PFMEMALLOC reserves.

When this patch is applied, pages allocated from below the low watermark are
returned with page->pfmemalloc set and it is up to the caller to determine
how the page should be protected. SLAB restricts access to any page with
page->pfmemalloc set to callers which are known to able to access the
PFMEMALLOC reserve. If one is not available, an attempt is made to allocate
a new page rather than use a reserve. SLUB is a bit more relaxed in that
it only records if the current per-CPU page was allocated from PFMEMALLOC
reserve and uses another partial slab if the caller does not have the
necessary GFP or process flags. This was found to be sufficient in tests
to avoid hangs due to SLUB generally maintaining smaller lists than SLAB.

In low-memory conditions it does mean that !PFMEMALLOC allocators
can fail a slab allocation even though free objects are available
because they are being preserved for callers that are freeing pages.

[a.p.zijlstra@chello.nl: Original implementation]
[sebastian@breakpoint.cc: Correct order of page flag clearing]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm_types.h   |    9 +++
 include/linux/page-flags.h |   28 +++++++
 mm/internal.h              |    3 +
 mm/page_alloc.c            |   27 +++++--
 mm/slab.c                  |  192 +++++++++++++++++++++++++++++++++++++++-----
 mm/slub.c                  |   29 ++++++-
 6 files changed, 263 insertions(+), 25 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 27c741c..ad0ad6f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -54,6 +54,15 @@ struct page {
 		union {
 			pgoff_t index;		/* Our offset within mapping. */
 			void *freelist;		/* slub/slob first free object */
+			bool pfmemalloc;	/* If set by the page allocator,
+						 * ALLOC_PFMEMALLOC was set
+						 * and the low watermark was not
+						 * met implying that the system
+						 * is under some pressure. The
+						 * caller should try ensure
+						 * this page is only used to
+						 * free other pages.
+						 */
 		};
 
 		union {
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index c88d2a9..e66eb0d 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -453,6 +453,34 @@ static inline int PageTransTail(struct page *page)
 }
 #endif
 
+/*
+ * If network-based swap is enabled, sl*b must keep track of whether pages
+ * were allocated from pfmemalloc reserves.
+ */
+static inline int PageSlabPfmemalloc(struct page *page)
+{
+	VM_BUG_ON(!PageSlab(page));
+	return PageActive(page);
+}
+
+static inline void SetPageSlabPfmemalloc(struct page *page)
+{
+	VM_BUG_ON(!PageSlab(page));
+	SetPageActive(page);
+}
+
+static inline void __ClearPageSlabPfmemalloc(struct page *page)
+{
+	VM_BUG_ON(!PageSlab(page));
+	__ClearPageActive(page);
+}
+
+static inline void ClearPageSlabPfmemalloc(struct page *page)
+{
+	VM_BUG_ON(!PageSlab(page));
+	ClearPageActive(page);
+}
+
 #ifdef CONFIG_MMU
 #define __PG_MLOCKED		(1 << PG_mlocked)
 #else
diff --git a/mm/internal.h b/mm/internal.h
index 0b72461..93ea85b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -275,6 +275,9 @@ static inline struct page *mem_map_next(struct page *iter,
 #define __paginginit __init
 #endif
 
+/* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
+
 /* Memory initialisation debug and verification */
 enum mminit_level {
 	MMINIT_WARNING,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d236e8c..e4e2bb0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1508,6 +1508,7 @@ failed:
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
+#define ALLOC_PFMEMALLOC	0x80 /* Caller has PF_MEMALLOC set */
 
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 
@@ -2265,16 +2266,22 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	} else if (unlikely(rt_task(current)) && !in_interrupt())
 		alloc_flags |= ALLOC_HARDER;
 
-	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_interrupt() &&
-		    ((current->flags & PF_MEMALLOC) ||
-		     unlikely(test_thread_flag(TIF_MEMDIE))))
+	if ((current->flags & PF_MEMALLOC) ||
+			unlikely(test_thread_flag(TIF_MEMDIE))) {
+		alloc_flags |= ALLOC_PFMEMALLOC;
+
+		if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 
 	return alloc_flags;
 }
 
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
+{
+	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_PFMEMALLOC);
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -2462,10 +2469,18 @@ nopage:
 	warn_alloc_failed(gfp_mask, order, NULL);
 	return page;
 got_pg:
+	/*
+	 * page->pfmemalloc is set when the caller had PFMEMALLOC set or is
+	 * been OOM killed. The expectation is that the caller is taking
+	 * steps that will free more memory. The caller should avoid the
+	 * page being used for !PFMEMALLOC purposes.
+	 */
+	page->pfmemalloc = !!(alloc_flags & ALLOC_PFMEMALLOC);
+
 	if (kmemcheck_enabled)
 		kmemcheck_pagealloc_alloc(page, order, gfp_mask);
-	return page;
 
+	return page;
 }
 
 /*
@@ -2516,6 +2531,8 @@ retry_cpuset:
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
+	else
+		page->pfmemalloc = false;
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
 
diff --git a/mm/slab.c b/mm/slab.c
index 4eea480..85e6743 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -123,6 +123,8 @@
 
 #include <trace/events/kmem.h>
 
+#include	"internal.h"
+
 /*
  * DEBUG	- 1 for kmem_cache_create() to honour; SLAB_RED_ZONE & SLAB_POISON.
  *		  0 for faster, smaller code (especially in the critical paths).
@@ -151,6 +153,12 @@
 #define ARCH_KMALLOC_FLAGS SLAB_HWCACHE_ALIGN
 #endif
 
+/*
+ * true if a page was allocated from pfmemalloc reserves for network-based
+ * swap
+ */
+static bool pfmemalloc_active __read_mostly;
+
 /* Legal flag mask for kmem_cache_create(). */
 #if DEBUG
 # define CREATE_MASK	(SLAB_RED_ZONE | \
@@ -256,9 +264,30 @@ struct array_cache {
 			 * Must have this definition in here for the proper
 			 * alignment of array_cache. Also simplifies accessing
 			 * the entries.
+			 *
+			 * Entries should not be directly dereferenced as
+			 * entries belonging to slabs marked pfmemalloc will
+			 * have the lower bits set SLAB_OBJ_PFMEMALLOC
 			 */
 };
 
+#define SLAB_OBJ_PFMEMALLOC	1
+static inline bool is_obj_pfmemalloc(void *objp)
+{
+	return (unsigned long)objp & SLAB_OBJ_PFMEMALLOC;
+}
+
+static inline void set_obj_pfmemalloc(void **objp)
+{
+	*objp = (void *)((unsigned long)*objp | SLAB_OBJ_PFMEMALLOC);
+	return;
+}
+
+static inline void clear_obj_pfmemalloc(void **objp)
+{
+	*objp = (void *)((unsigned long)*objp & ~SLAB_OBJ_PFMEMALLOC);
+}
+
 /*
  * bootstrap: The caches do not work without cpuarrays anymore, but the
  * cpuarrays are allocated from the generic caches...
@@ -925,6 +954,102 @@ static struct array_cache *alloc_arraycache(int node, int entries,
 	return nc;
 }
 
+static inline bool is_slab_pfmemalloc(struct slab *slabp)
+{
+	struct page *page = virt_to_page(slabp->s_mem);
+
+	return PageSlabPfmemalloc(page);
+}
+
+/* Clears pfmemalloc_active if no slabs have pfmalloc set */
+static void recheck_pfmemalloc_active(struct kmem_cache *cachep,
+						struct array_cache *ac)
+{
+	struct kmem_list3 *l3 = cachep->nodelists[numa_mem_id()];
+	struct slab *slabp;
+	unsigned long flags;
+
+	if (!pfmemalloc_active)
+		return;
+
+	spin_lock_irqsave(&l3->list_lock, flags);
+	list_for_each_entry(slabp, &l3->slabs_full, list)
+		if (is_slab_pfmemalloc(slabp))
+			goto out;
+
+	list_for_each_entry(slabp, &l3->slabs_partial, list)
+		if (is_slab_pfmemalloc(slabp))
+			goto out;
+
+	list_for_each_entry(slabp, &l3->slabs_free, list)
+		if (is_slab_pfmemalloc(slabp))
+			goto out;
+
+	pfmemalloc_active = false;
+out:
+	spin_unlock_irqrestore(&l3->list_lock, flags);
+}
+
+static void *ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
+						gfp_t flags, bool force_refill)
+{
+	int i;
+	void *objp = ac->entry[--ac->avail];
+
+	/* Ensure the caller is allowed to use objects from PFMEMALLOC slab */
+	if (unlikely(is_obj_pfmemalloc(objp))) {
+		struct kmem_list3 *l3;
+
+		if (gfp_pfmemalloc_allowed(flags)) {
+			clear_obj_pfmemalloc(&objp);
+			return objp;
+		}
+
+		/* The caller cannot use PFMEMALLOC objects, find another one */
+		for (i = 1; i < ac->avail; i++) {
+			/* If a !PFMEMALLOC object is found, swap them */
+			if (!is_obj_pfmemalloc(ac->entry[i])) {
+				objp = ac->entry[i];
+				ac->entry[i] = ac->entry[ac->avail];
+				ac->entry[ac->avail] = objp;
+				return objp;
+			}
+		}
+
+		/*
+		 * If there are empty slabs on the slabs_free list and we are
+		 * being forced to refill the cache, mark this one !pfmemalloc.
+		 */
+		l3 = cachep->nodelists[numa_mem_id()];
+		if (!list_empty(&l3->slabs_free) && force_refill) {
+			struct slab *slabp = virt_to_slab(objp);
+			ClearPageSlabPfmemalloc(virt_to_page(slabp->s_mem));
+			clear_obj_pfmemalloc(&objp);
+			recheck_pfmemalloc_active(cachep, ac);
+			return objp;
+		}
+
+		/* No !PFMEMALLOC objects available */
+		ac->avail++;
+		objp = NULL;
+	}
+
+	return objp;
+}
+
+static void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
+								void *objp)
+{
+	if (unlikely(pfmemalloc_active)) {
+		/* Some pfmemalloc slabs exist, check if this is one */
+		struct page *page = virt_to_page(objp);
+		if (PageSlabPfmemalloc(page))
+			set_obj_pfmemalloc(&objp);
+	}
+
+	ac->entry[ac->avail++] = objp;
+}
+
 /*
  * Transfer objects in one arraycache to another.
  * Locking must be handled by the caller.
@@ -1101,7 +1226,7 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
 			STATS_INC_ACOVERFLOW(cachep);
 			__drain_alien_cache(cachep, alien, nodeid);
 		}
-		alien->entry[alien->avail++] = objp;
+		ac_put_obj(cachep, alien, objp);
 		spin_unlock(&alien->lock);
 	} else {
 		spin_lock(&(cachep->nodelists[nodeid])->list_lock);
@@ -1781,6 +1906,10 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 		return NULL;
 	}
 
+	/* Record if ALLOC_PFMEMALLOC was set when allocating the slab */
+	if (unlikely(page->pfmemalloc))
+		pfmemalloc_active = true;
+
 	nr_pages = (1 << cachep->gfporder);
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		add_zone_page_state(page_zone(page),
@@ -1788,9 +1917,13 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 	else
 		add_zone_page_state(page_zone(page),
 			NR_SLAB_UNRECLAIMABLE, nr_pages);
-	for (i = 0; i < nr_pages; i++)
+	for (i = 0; i < nr_pages; i++) {
 		__SetPageSlab(page + i);
 
+		if (page->pfmemalloc)
+			SetPageSlabPfmemalloc(page + i);
+	}
+
 	if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) {
 		kmemcheck_alloc_shadow(page, cachep->gfporder, flags, nodeid);
 
@@ -1822,6 +1955,7 @@ static void kmem_freepages(struct kmem_cache *cachep, void *addr)
 				NR_SLAB_UNRECLAIMABLE, nr_freed);
 	while (i--) {
 		BUG_ON(!PageSlab(page));
+		__ClearPageSlabPfmemalloc(page);
 		__ClearPageSlab(page);
 		page++;
 	}
@@ -3093,16 +3227,19 @@ bad:
 #define check_slabp(x,y) do { } while(0)
 #endif
 
-static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
+static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags,
+							bool force_refill)
 {
 	int batchcount;
 	struct kmem_list3 *l3;
 	struct array_cache *ac;
 	int node;
 
-retry:
 	check_irq_off();
 	node = numa_mem_id();
+	if (unlikely(force_refill))
+		goto force_grow;
+retry:
 	ac = cpu_cache_get(cachep);
 	batchcount = ac->batchcount;
 	if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
@@ -3152,8 +3289,8 @@ retry:
 			STATS_INC_ACTIVE(cachep);
 			STATS_SET_HIGH(cachep);
 
-			ac->entry[ac->avail++] = slab_get_obj(cachep, slabp,
-							    node);
+			ac_put_obj(cachep, ac, slab_get_obj(cachep, slabp,
+									node));
 		}
 		check_slabp(cachep, slabp);
 
@@ -3172,18 +3309,22 @@ alloc_done:
 
 	if (unlikely(!ac->avail)) {
 		int x;
+force_grow:
 		x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL);
 
 		/* cache_grow can reenable interrupts, then ac could change. */
 		ac = cpu_cache_get(cachep);
-		if (!x && ac->avail == 0)	/* no objects in sight? abort */
+
+		/* no objects in sight? abort */
+		if (!x && (ac->avail == 0 || force_refill))
 			return NULL;
 
 		if (!ac->avail)		/* objects refilled by interrupt? */
 			goto retry;
 	}
 	ac->touched = 1;
-	return ac->entry[--ac->avail];
+
+	return ac_get_obj(cachep, ac, flags, force_refill);
 }
 
 static inline void cache_alloc_debugcheck_before(struct kmem_cache *cachep,
@@ -3265,23 +3406,35 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 {
 	void *objp;
 	struct array_cache *ac;
+	bool force_refill = false;
 
 	check_irq_off();
 
 	ac = cpu_cache_get(cachep);
 	if (likely(ac->avail)) {
-		STATS_INC_ALLOCHIT(cachep);
 		ac->touched = 1;
-		objp = ac->entry[--ac->avail];
-	} else {
-		STATS_INC_ALLOCMISS(cachep);
-		objp = cache_alloc_refill(cachep, flags);
+		objp = ac_get_obj(cachep, ac, flags, false);
+
 		/*
-		 * the 'ac' may be updated by cache_alloc_refill(),
-		 * and kmemleak_erase() requires its correct value.
+		 * Allow for the possibility all avail objects are not allowed
+		 * by the current flags
 		 */
-		ac = cpu_cache_get(cachep);
+		if (objp) {
+			STATS_INC_ALLOCHIT(cachep);
+			goto out;
+		}
+		force_refill = true;
 	}
+
+	STATS_INC_ALLOCMISS(cachep);
+	objp = cache_alloc_refill(cachep, flags, force_refill);
+	/*
+	 * the 'ac' may be updated by cache_alloc_refill(),
+	 * and kmemleak_erase() requires its correct value.
+	 */
+	ac = cpu_cache_get(cachep);
+
+out:
 	/*
 	 * To avoid a false negative, if an object that is in one of the
 	 * per-CPU caches is leaked, we need to make sure kmemleak doesn't
@@ -3603,9 +3756,12 @@ static void free_block(struct kmem_cache *cachep, void **objpp, int nr_objects,
 	struct kmem_list3 *l3;
 
 	for (i = 0; i < nr_objects; i++) {
-		void *objp = objpp[i];
+		void *objp;
 		struct slab *slabp;
 
+		clear_obj_pfmemalloc(&objpp[i]);
+		objp = objpp[i];
+
 		slabp = virt_to_slab(objp);
 		l3 = cachep->nodelists[node];
 		list_del(&slabp->list);
@@ -3723,7 +3879,7 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp,
 		cache_flusharray(cachep, ac);
 	}
 
-	ac->entry[ac->avail++] = objp;
+	ac_put_obj(cachep, ac, objp);
 }
 
 /**
diff --git a/mm/slub.c b/mm/slub.c
index ef9bf01..98fecc2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -33,6 +33,8 @@
 
 #include <trace/events/kmem.h>
 
+#include "internal.h"
+
 /*
  * Lock order:
  *   1. slub_lock (Global Semaphore)
@@ -1370,6 +1372,8 @@ static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
 	inc_slabs_node(s, page_to_nid(page), page->objects);
 	page->slab = s;
 	__SetPageSlab(page);
+	if (page->pfmemalloc)
+		SetPageSlabPfmemalloc(page);
 
 	start = page_address(page);
 
@@ -1413,6 +1417,7 @@ static void __free_slab(struct kmem_cache *s, struct page *page)
 		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
 		-pages);
 
+	__ClearPageSlabPfmemalloc(page);
 	__ClearPageSlab(page);
 	reset_page_mapcount(page);
 	if (current->reclaim_state)
@@ -2132,6 +2137,14 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
 	return freelist;
 }
 
+static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags)
+{
+	if (unlikely(PageSlabPfmemalloc(page)))
+		return gfp_pfmemalloc_allowed(gfpflags);
+
+	return true;
+}
+
 /*
  * Check the page->freelist of a page and either transfer the freelist to the per cpu freelist
  * or deactivate the page.
@@ -2212,6 +2225,18 @@ redo:
 		goto new_slab;
 	}
 
+	/*
+	 * By rights, we should be searching for a slab page that was
+	 * PFMEMALLOC but right now, we are losing the pfmemalloc
+	 * information when the page leaves the per-cpu allocator
+	 */
+	if (unlikely(!pfmemalloc_match(page, gfpflags))) {
+		deactivate_slab(s, page, c->freelist);
+		c->page = NULL;
+		c->freelist = NULL;
+		goto new_slab;
+	}
+
 	/* must check again c->freelist in case of cpu migration or IRQ */
 	freelist = c->freelist;
 	if (freelist)
@@ -2318,8 +2343,8 @@ redo:
 
 	object = c->freelist;
 	page = c->page;
-	if (unlikely(!object || !node_match(page, node)))
-
+	if (unlikely(!object || !node_match(page, node)
+					!pfmemalloc_match(page, gfpflags)))
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox