Netdev List
 help / color / mirror / Atom feed
* Re: Bug in net/ipv6/ip6_fib.c:fib6_dump_table()
From: Alexey Kuznetsov @ 2012-06-21 21:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Josh Hunt, davem@davemloft.net, kaber@trash.net,
	Debabrata Banerjee, netdev@vger.kernel.org,
	yoshfuji@linux-ipv6.org, jmorris@namei.org, pekkas@netcore.fi,
	linux-kernel@vger.kernel.org
In-Reply-To: <1340310469.4604.6702.camel@edumazet-glaptop>

On Thu, Jun 21, 2012 at 10:27:49PM +0200, Eric Dumazet wrote:
> Looking at this code, it lacks proper synchronization
> between tree updaters and tree walkers.
> 
> fib6_walker_lock rwlock is not enough to prevent races.

Hmm. As author of this weird code, I must say I honestly believed it was correct.
At least I tried. :-)


What's about 2bec5a336.., it does not look reasonable.
The idea was that when you change tree, you fixup sleeping walkers, moving
their location in tree to correct point. So, walkers must not have any stale pointers
at any times (except when you under table write lock) and no skips/counts are required.
I remember how damn difficult was it to make this right (well, sorry, if it is not yet :-)),
so that understand that if some update is forgotten or done incorrectly, it is not so easy to find,
but it is definitely worth of efforts.

Alexey

^ permalink raw reply

* [PATCH] tcp: Validate route interface in early demux.
From: David Miller @ 2012-06-21 22:03 UTC (permalink / raw)
  To: netdev


Otherwise we might violate reverse path filtering.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/tcp_ipv4.c |   12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 13857df..21e22a0 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1676,6 +1676,7 @@ int tcp_v4_early_demux(struct sk_buff *skb)
 	struct net *net = dev_net(skb->dev);
 	const struct iphdr *iph;
 	const struct tcphdr *th;
+	struct net_device *dev;
 	struct sock *sk;
 	int err;
 
@@ -1695,10 +1696,11 @@ int tcp_v4_early_demux(struct sk_buff *skb)
 	if (!pskb_may_pull(skb, ip_hdrlen(skb) + th->doff * 4))
 		goto out_err;
 
+	dev = skb->dev;
 	sk = __inet_lookup_established(net, &tcp_hashinfo,
 				       iph->saddr, th->source,
 				       iph->daddr, th->dest,
-				       skb->dev->ifindex);
+				       dev->ifindex);
 	if (sk) {
 		skb->sk = sk;
 		skb->destructor = sock_edemux;
@@ -1707,8 +1709,12 @@ int tcp_v4_early_demux(struct sk_buff *skb)
 			if (dst)
 				dst = dst_check(dst, 0);
 			if (dst) {
-				skb_dst_set_noref(skb, dst);
-				err = 0;
+				struct rtable *rt = (struct rtable *) dst;
+
+				if (rt->rt_iif == dev->ifindex) {
+					skb_dst_set_noref(skb, dst);
+					err = 0;
+				}
 			}
 		}
 	}
-- 
1.7.10

^ permalink raw reply related

* Re: [net] ixgbe: simplify padding and length checks
From: David Miller @ 2012-06-21 22:04 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: shemminger, netdev, gospo, sassmann
In-Reply-To: <1340314089.2033.16.camel@jtkirshe-mobl>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Thu, 21 Jun 2012 14:28:09 -0700

> On Thu, 2012-06-21 at 13:38 -0700, David Miller wrote:
>> From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
>> Date: Thu, 21 Jun 2012 05:15:10 -0700
>> 
>> > From: Stephen Hemminger <shemminger@vyatta.com>
>> > 
>> > The check for length <= 0 is bogus because length is unsigned, and network
>> > stack never sends zero length packets (unless it is totally broken).
>> > 
>> > The check for really small packets can be optimized (using unlikely)
>> > and calling skb_pad directly.
>> > 
>> > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>> > Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
>> > Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
>> 
>> Not really fixing anything and more of a cleanup, so maybe 'net-next'
>> instead of 'net' for this guy instead?
> 
> Yeah, net-next is fine.  I just verified that the patch applies cleanly
> to net-next as well.

Great, applied, thanks.

^ permalink raw reply

* Re: [PATCH] net: dcb: fix small regression in __dcbnl_pg_setcfg()
From: David Miller @ 2012-06-21 22:06 UTC (permalink / raw)
  To: tgraf; +Cc: john.r.fastabend, netdev, lucy.liu, alexander.h.duyck
In-Reply-To: <20120621105245.GI27921@canuck.infradead.org>

From: Thomas Graf <tgraf@suug.ch>
Date: Thu, 21 Jun 2012 06:52:45 -0400

> On Wed, Jun 20, 2012 at 10:56:21PM -0700, John Fastabend wrote:
>> A small regression was introduced in the reply command of
>> dcbnl_pg_setcfg(). User space apps may be expecting the
>> DCB_ATTR_PG_CFG attribute to be returned with the patch
>> below TX or RX variants are returned.
>> 
>> commit 7be994138b188387691322921c08e19bddf6d3c5
>> Author: Thomas Graf <tgraf@suug.ch>
>> Date:   Wed Jun 13 02:54:55 2012 +0000
>> 
>>     dcbnl: Shorten all command handling functions
>> 
>> This patch reverts this behavior and returns DCB_ATTR_PG_CFG
>> 
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> 
> Acked-by: Thomas Graf <tgraf@suug.ch>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH][RESEND] bonding: delete migrated IP addresses from the rlb hash table
From: Jay Vosburgh @ 2012-06-21 23:05 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: Andy Gospodarek, netdev
In-Reply-To: <20120620203731.GA4957@midget.suse.cz>

Jiri Bohac <jbohac@suse.cz> wrote:

>Hi, this is a resend of the patch discussed here:
>	http://thread.gmane.org/gmane.linux.network/228076
>It has been updated to apply to the lastest net-next.
[...]
>The hash table is hashed by ip_dst. To be able to do the above
>check efficiently (not walking the whole hash table), we need a
>reverse mapping (by ip_src).
>
>I added three new members in struct rlb_client_info:
>   rx_hashtbl[x].src_first will point to the start of a list of
>      entries for which hash(ip_src) == x.
>   The list is linked with src_next and src_prev.
>
>When an incoming ARP packet arrives at rlb_arp_recv()
>rlb_purge_src_ip() can quickly walk only the entries on the
>corresponding lists, i.e. the entries that are likely to contain
>the offending IP address.
>
>To avoid confusion, I renamed these existing fields of struct 
>rlb_client_info:
>	next -> used_next
>	prev -> used_prev
>	rx_hashtbl_head -> rx_hashtbl_used_head
>
>(The current linked list is _not_ a list of hash table
>entries with colliding ip_dst. It's a list of entries that are
>being used; its purpose is to avoid walking the whole hash table
>when looking for used entries.)

	Just a note that I'm doing some testing with this patch.  Seems
to be ok for the "direct" case (wherein the IP in question is assigned
to the local system); I haven't tried the "bridge" case yet.  I've
extended some of the debugfs stuff to dump the new information, and I'm
trying some of the corner cases (e.g., breaking the linkages in the
middle) to see if it all hangs together.

	I am thinking that the layout of the "hash"-ish table is now
sufficiently complicated that there should be a comment block somewhere
describing what's going on (because I didn't really quite get it until I
dumped the whole thing and looked at it).  With this patch, there is one
"used" linkage for all of the elements in use, plus some number of "src"
linkages, one for each active source hash.  The "src" linkages are also
notable in that they are separate from the "assigned" state.

>diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
>index e15cc11..8505a24 100644
>--- a/drivers/net/bonding/bond_alb.c
>+++ b/drivers/net/bonding/bond_alb.c
[...]
>@@ -354,6 +357,17 @@ static int rlb_arp_recv(const struct sk_buff *skb, struct bonding *bond,
> 	if (!arp)
> 		goto out;
>
>+	/* We received an ARP from arp->ip_src.
>+	 * We might have used this IP address previously (on the bonding host
>+	 * itself or on a system that is bridged together with the bond).
>+	 * However, if arp->mac_src is different than what is stored in
>+	 * rx_hashtbl, some other host is now using the IP and we must prevent
>+	 * sending out client updates with this IP address and the old MAC address.
>+	 * Clean up all hash table entries that have this address as ip_src but
>+	 * have a dirrerent mac_src.

	Typo here; should be "different."

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* [PATCH] ipv4: Add sysctl knob to control early socket demux
From: Alexander Duyck @ 2012-06-21 23:58 UTC (permalink / raw)
  To: netdev; +Cc: jeffrey.t.kirsher, David S. Miller, Eric Dumazet, Alexander Duyck

This change is meant to add a control for disabling early socket demux.
The main motivation behind this patch is to provide an option to disable
the feature as it adds an additional cost to routing that reduces overall
throughput by up to 5%.  For example one of my systems went from 12.1Mpps
to 11.6 after the early socket demux was added.  It looks like the reason
for the regression is that we are now having to perform two lookups, first
the one for an established socket, and then the one for the routing table.

By adding this patch and toggling the value for ip_early_demux to 0 I am
able to get back to the 12.1Mpps I was previously seeing.

Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

I am open to any comments or suggestions on this patch.  I had seen the
earlier discussions and saw mention of adding a control for disabling the
early demux feature so I figured I would just code it up real quick once I
ran into the regression.  I am assuming it is okay to disable the early
demux code since I suspect there is other code in place that will still
handle the demux for the TCP sockets later.  Also I wasn't sure about the
sysctl since I haven't set one up before.

 include/linux/sysctl.h     |    1 +
 include/net/ip.h           |    3 +++
 kernel/sysctl_binary.c     |    2 ++
 net/ipv4/ip_input.c        |   19 +++++++++++--------
 net/ipv4/sysctl_net_ipv4.c |    7 +++++++
 5 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index c34b4c8..20825e5 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -425,6 +425,7 @@ enum
 	NET_TCP_ALLOWED_CONG_CONTROL=123,
 	NET_TCP_MAX_SSTHRESH=124,
 	NET_TCP_FRTO_RESPONSE=125,
+	NET_IPV4_EARLY_DEMUX=126,
 };
 
 enum {
diff --git a/include/net/ip.h b/include/net/ip.h
index 83e0619..50841bd 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -210,6 +210,9 @@ extern int inet_peer_threshold;
 extern int inet_peer_minttl;
 extern int inet_peer_maxttl;
 
+/* From ip_input.c */
+extern int sysctl_ip_early_demux;
+
 /* From ip_output.c */
 extern int sysctl_ip_dynaddr;
 
diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
index a650694..6a3cf82 100644
--- a/kernel/sysctl_binary.c
+++ b/kernel/sysctl_binary.c
@@ -415,6 +415,8 @@ static const struct bin_table bin_net_ipv4_table[] = {
 	{ CTL_INT,	NET_IPV4_IPFRAG_SECRET_INTERVAL,	"ipfrag_secret_interval" },
 	/* NET_IPV4_IPFRAG_MAX_DIST "ipfrag_max_dist" no longer used */
 
+	{ CTL_INT,	NET_IPV4_EARLY_DEMUX,			"ip_early_demux" },
+
 	{ CTL_INT,	2088 /* NET_IPQ_QMAX */,		"ip_queue_maxlen" },
 
 	/* NET_TCP_DEFAULT_WIN_SCALE unused */
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 93b092c..07de38d 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -313,6 +313,8 @@ drop:
 	return true;
 }
 
+int sysctl_ip_early_demux __read_mostly = 1;
+
 static int ip_rcv_finish(struct sk_buff *skb)
 {
 	const struct iphdr *iph = ip_hdr(skb);
@@ -325,14 +327,15 @@ static int ip_rcv_finish(struct sk_buff *skb)
 	if (skb_dst(skb) == NULL) {
 		const struct net_protocol *ipprot;
 		int protocol = iph->protocol;
-		int err;
-
-		rcu_read_lock();
-		ipprot = rcu_dereference(inet_protos[protocol]);
-		err = -ENOENT;
-		if (ipprot && ipprot->early_demux)
-			err = ipprot->early_demux(skb);
-		rcu_read_unlock();
+		int err = -ENOENT;
+
+		if (sysctl_ip_early_demux) {
+			rcu_read_lock();
+			ipprot = rcu_dereference(inet_protos[protocol]);
+			if (ipprot && ipprot->early_demux)
+				err = ipprot->early_demux(skb);
+			rcu_read_unlock();
+		}
 
 		if (err) {
 			err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index ef32956..12aa0c5 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -301,6 +301,13 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler	= proc_dointvec
 	},
 	{
+		.procname	= "ip_early_demux",
+		.data		= &sysctl_ip_early_demux,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
+	{
 		.procname	= "ip_dynaddr",
 		.data		= &sysctl_ip_dynaddr,
 		.maxlen		= sizeof(int),

^ permalink raw reply related

* Re: Bug in net/ipv6/ip6_fib.c:fib6_dump_table()
From: Gao feng @ 2012-06-22  3:34 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Eric Dumazet, Josh Hunt, davem@davemloft.net, kaber@trash.net,
	Debabrata Banerjee, netdev@vger.kernel.org,
	yoshfuji@linux-ipv6.org, jmorris@namei.org, pekkas@netcore.fi,
	linux-kernel@vger.kernel.org
In-Reply-To: <20120621215056.GA24908@ms2.inr.ac.ru>

于 2012年06月22日 05:50, Alexey Kuznetsov 写道:
> On Thu, Jun 21, 2012 at 10:27:49PM +0200, Eric Dumazet wrote:
>> Looking at this code, it lacks proper synchronization
>> between tree updaters and tree walkers.
>>
>> fib6_walker_lock rwlock is not enough to prevent races.
> 
> Hmm. As author of this weird code, I must say I honestly believed it was correct.
> At least I tried. :-)
> 
> 
> What's about 2bec5a336.., it does not look reasonable.
> The idea was that when you change tree, you fixup sleeping walkers, moving
> their location in tree to correct point. So, walkers must not have any stale pointers
> at any times (except when you under table write lock) and no skips/counts are required.
> I remember how damn difficult was it to make this right (well, sorry, if it is not yet :-)),
> so that understand that if some update is forgotten or done incorrectly, it is not so easy to find,
> but it is definitely worth of efforts.


Actually, I spent two months to try to reproduce this crash four months ago,
But finally I give up, I don't think there was any stale pointers,
we already correct it when we change the tree.

^ permalink raw reply

* Re: Bug in net/ipv6/ip6_fib.c:fib6_dump_table()
From: Josh Hunt @ 2012-06-22  6:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem@davemloft.net, kaber@trash.net, Debabrata Banerjee,
	netdev@vger.kernel.org, yoshfuji@linux-ipv6.org,
	jmorris@namei.org, pekkas@netcore.fi, kuznet@ms2.inr.ac.ru,
	linux-kernel@vger.kernel.org
In-Reply-To: <1340310469.4604.6702.camel@edumazet-glaptop>

On 06/21/2012 03:27 PM, Eric Dumazet wrote:
> On Thu, 2012-06-21 at 14:35 -0500, Josh Hunt wrote:
> 
>> Can anyone provide details of the crash which was intended to be fixed
>> by 2bec5a369ee79576a3eea2c23863325089785a2c? With this patch in and
>> doing concurrent adds/deletes and dumping the table via netlink causes
>> duplicate entries to be reported. Reverting this patch causes those
>> problems to go away. We can provide a more detailed test if that is
>> needed, but so far our testing has been unable to reproduce the crash
>> mentioned in the above commit with it reverted.
> 
> A mere revert wont be enough.
> 
> Looking at this code, it lacks proper synchronization
> between tree updaters and tree walkers.
> 
> fib6_walker_lock rwlock is not enough to prevent races.
> 
> Are you willing to fix this yourself ?
> 

Looking through the code a bit more it seems like we would need to have
a lock in fib6_walker_t to protect its contents. Mainly for when we
update the pointers in fib6_del_route and fib6_repair_tree. Right now
there is the fib6_walker_lock, but that appears to only be protecting
the elements of the list, not their contents. Is this what you had in
mind? I just coded up something along these lines and it works for the
most part, but I also got a message about unsafe lock ordering when I
stressed it so I am messing something up. If this sounds like it's on
the right track I can work out the kinks in the morning.

Josh

^ permalink raw reply

* Re: Bug in net/ipv6/ip6_fib.c:fib6_dump_table()
From: Eric Dumazet @ 2012-06-22  8:29 UTC (permalink / raw)
  To: Josh Hunt
  Cc: davem@davemloft.net, kaber@trash.net, Debabrata Banerjee,
	netdev@vger.kernel.org, yoshfuji@linux-ipv6.org,
	jmorris@namei.org, pekkas@netcore.fi, kuznet@ms2.inr.ac.ru,
	linux-kernel@vger.kernel.org
In-Reply-To: <4FE41570.4090803@akamai.com>

On Fri, 2012-06-22 at 01:49 -0500, Josh Hunt wrote:
> On 06/21/2012 03:27 PM, Eric Dumazet wrote:
> > On Thu, 2012-06-21 at 14:35 -0500, Josh Hunt wrote:
> > 
> >> Can anyone provide details of the crash which was intended to be fixed
> >> by 2bec5a369ee79576a3eea2c23863325089785a2c? With this patch in and
> >> doing concurrent adds/deletes and dumping the table via netlink causes
> >> duplicate entries to be reported. Reverting this patch causes those
> >> problems to go away. We can provide a more detailed test if that is
> >> needed, but so far our testing has been unable to reproduce the crash
> >> mentioned in the above commit with it reverted.
> > 
> > A mere revert wont be enough.
> > 
> > Looking at this code, it lacks proper synchronization
> > between tree updaters and tree walkers.
> > 
> > fib6_walker_lock rwlock is not enough to prevent races.
> > 
> > Are you willing to fix this yourself ?
> > 
> 
> Looking through the code a bit more it seems like we would need to have
> a lock in fib6_walker_t to protect its contents. Mainly for when we
> update the pointers in fib6_del_route and fib6_repair_tree. Right now
> there is the fib6_walker_lock, but that appears to only be protecting
> the elements of the list, not their contents. Is this what you had in
> mind? I just coded up something along these lines and it works for the
> most part, but I also got a message about unsafe lock ordering when I
> stressed it so I am messing something up. If this sounds like it's on
> the right track I can work out the kinks in the morning.

Hmm, it seems tb6_lock is held by a writer, so its safe :

a tree walker can run only holding a read_lock on tb6_lock

^ permalink raw reply

* [PATCH net] net: qmi_wwan: fix Oops while disconnecting
From: Bjørn Mork @ 2012-06-22  9:11 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-usb-u79uwXL29TY76Z2rM5mHXA, Marius Bjørnstad Kotsbak,
	Bjørn Mork

usbnet_disconnect() will set intfdata to NULL before calling
the minidriver unbind function.  The cdc_wdm subdriver cannot
know that it is disconnecting until the qmi_wwan unbind
function has called its disconnect function.  This means that
we must be able to support the cdc_wdm subdriver operating
normally while usbnet_disconnect() is running, and in
particular that intfdata may be NULL.

The only place this matters is in qmi_wwan_cdc_wdm_manage_power
which is called from cdc_wdm.  Simply testing for NULL
intfdata there is sufficient to allow it to continue working
at all times.

Fixes this Oops where a cdc-wdm device was closed while the
USB device was disconnecting, causing wdm_release to call
qmi_wwan_cdc_wdm_manage_power after intfdata was set to
NULL by usbnet_disconnect:

[41819.087460] BUG: unable to handle kernel NULL pointer dereference at 00000080
[41819.087815] IP: [<f8640458>] qmi_wwan_manage_power+0x68/0x90 [qmi_wwan]
[41819.088028] *pdpt = 000000000314f001 *pde = 0000000000000000
[41819.088028] Oops: 0002 [#1] SMP
[41819.088028] Modules linked in: qmi_wwan option usb_wwan usbserial usbnet
cdc_wdm nls_iso8859_1 nls_cp437 vfat fat usb_storage bnep rfcomm bluetooth
parport_pc ppdev binfmt_misc iptable_nat nf_nat nf_conntrack_ipv4
nf_conntrack nf_defrag_ipv4 iptable_mangle iptable_filter ip_tables
x_tables dm_crypt uvcvideo snd_hda_codec_realtek snd_hda_intel
videobuf2_core snd_hda_codec joydev videodev videobuf2_vmalloc
hid_multitouch snd_hwdep arc4 videobuf2_memops snd_pcm snd_seq_midi
snd_rawmidi snd_seq_midi_event ath9k mac80211 snd_seq ath9k_common ath9k_hw
ath snd_timer snd_seq_device sparse_keymap dm_multipath scsi_dh coretemp
mac_hid snd soundcore cfg80211 snd_page_alloc psmouse serio_raw microcode
lp parport dm_mirror dm_region_hash dm_log usbhid hid i915 drm_kms_helper
drm r8169 i2c_algo_bit wmi video [last unloaded: qmi_wwan]
[41819.088028]
[41819.088028] Pid: 23292, comm: qmicli Not tainted 3.4.0-5-generic #11-Ubuntu GIGABYTE T1005/T1005
[41819.088028] EIP: 0060:[<f8640458>] EFLAGS: 00010246 CPU: 1
[41819.088028] EIP is at qmi_wwan_manage_power+0x68/0x90 [qmi_wwan]
[41819.088028] EAX: 00000000 EBX: 00000000 ECX: 000000c3 EDX: 00000000
[41819.088028] ESI: c3b27658 EDI: 00000000 EBP: c298bea4 ESP: c298be98
[41819.088028]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[41819.088028] CR0: 8005003b CR2: 00000080 CR3: 3605e000 CR4: 000007f0
[41819.088028] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[41819.088028] DR6: ffff0ff0 DR7: 00000400
[41819.088028] Process qmicli (pid: 23292, ti=c298a000 task=f343b280 task.ti=c298a000)
[41819.088028] Stack:
[41819.088028]  00000000 c3b27658 e2a80d00 c298beb0 f864051a c3b27600 c298bec0 f9027099
[41819.088028]  c2fd6000 00000008 c298bef0 c1147f96 00000001 00000000 00000000 f4e54790
[41819.088028]  ecf43a00 ecf43a00 c2fd6008 c2fd6000 ebbd7600 ffffffb9 c298bf08 c1144474
[41819.088028] Call Trace:
[41819.088028]  [<f864051a>] qmi_wwan_cdc_wdm_manage_power+0x1a/0x20 [qmi_wwan]
[41819.088028]  [<f9027099>] wdm_release+0x69/0x70 [cdc_wdm]
[41819.088028]  [<c1147f96>] fput+0xe6/0x210
[41819.088028]  [<c1144474>] filp_close+0x54/0x80
[41819.088028]  [<c1046a65>] put_files_struct+0x75/0xc0
[41819.088028]  [<c1046b56>] exit_files+0x46/0x60
[41819.088028]  [<c1046f81>] do_exit+0x141/0x780
[41819.088028]  [<c107248f>] ? wake_up_state+0xf/0x20
[41819.088028]  [<c1053f48>] ? signal_wake_up+0x28/0x40
[41819.088028]  [<c1054f3b>] ? zap_other_threads+0x6b/0x80
[41819.088028]  [<c1047864>] do_group_exit+0x34/0xa0
[41819.088028]  [<c10478e8>] sys_exit_group+0x18/0x20
[41819.088028]  [<c15bb7df>] sysenter_do_call+0x12/0x28
[41819.088028] Code: 04 83 e7 01 c1 e7 03 0f b6 42 18 83 e0 f7 09 f8 88 42
18 8b 43 04 e8 48 9a dd c8 89 f0 8b 5d f4 8b 75 f8 8b 7d fc 89 ec 5d c3 90
<f0> ff 88 80 00 00 00 0f 94 c0 84 c0 75 b7 31 f6 8b 5d f4 89 f0
[41819.088028] EIP: [<f8640458>] qmi_wwan_manage_power+0x68/0x90 [qmi_wwan] SS:ESP 0068:c298be98
[41819.088028] CR2: 0000000000000080
[41819.149492] ---[ end trace 0944479ff8257f55 ]---

Reported-by: Marius Bjørnstad Kotsbak <marius.kotsbak-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org> # v3.4
Signed-off-by: Bjørn Mork <bjorn-yOkvZcmFvRU@public.gmane.org>
---
 drivers/net/usb/qmi_wwan.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
index 3767a12..b01960f 100644
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -197,6 +197,10 @@ err:
 static int qmi_wwan_cdc_wdm_manage_power(struct usb_interface *intf, int on)
 {
 	struct usbnet *dev = usb_get_intfdata(intf);
+
+	/* can be called while disconnecting */
+	if (!dev)
+		return 0;
 	return qmi_wwan_manage_power(dev, on);
 }
 
-- 
1.7.10

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH 01/17] mm: sl[au]b: Add knowledge of PFMEMALLOC reserve pages
From: Mel Gorman @ 2012-06-22  9:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior
In-Reply-To: <4FE39290.8020609@redhat.com>

On Thu, Jun 21, 2012 at 05:30:56PM -0400, Rik van Riel wrote:
> On 06/20/2012 07:43 AM, Mel Gorman wrote:
> 
> >+/* Clears ac->pfmemalloc if no slabs have pfmalloc set */
> >+static void check_ac_pfmemalloc(struct kmem_cache *cachep,
> >+						struct array_cache *ac)
> >+{
> 
> >+	pfmemalloc_active = false;
> >+out:
> >+	spin_unlock_irqrestore(&l3->list_lock, flags);
> >+}
> 
> The comment and the function do not seem to match.
> 

Well spotted. There used to be ac->pfmemalloc and obviously I failed to
update the comment when it was removed.

> Otherwise, the patch looks reasonable.

Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 10/17] netvm: Allow skb allocation to use PFMEMALLOC reserves
From: Mel Gorman @ 2012-06-22 10:00 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson
In-Reply-To: <20120621160902.GA6045@breakpoint.cc>

On Thu, Jun 21, 2012 at 06:09:02PM +0200, Sebastian Andrzej Siewior wrote:
> > <SNIP>
> >
> If merge this chunk
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 6510a5d..2acfec9 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -510,7 +510,7 @@ struct sk_buff {
>  #define SKB_ALLOC_RX		0x02
>  
>  /* Returns true if the skb was allocated from PFMEMALLOC reserves */
> -static inline bool skb_pfmemalloc(struct sk_buff *skb)
> +static inline bool skb_pfmemalloc(const struct sk_buff *skb)
>  {
>  	return unlikely(skb->pfmemalloc);
>  }
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index c44ab68..6ce94b5 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -852,7 +852,7 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
>  
>  static inline int skb_alloc_rx_flag(const struct sk_buff *skb)
>  {
> -	if (skb_pfmemalloc((struct sk_buff *)skb))
> +	if (skb_pfmemalloc(skb))
>  		return SKB_ALLOC_RX;
>  	return 0;
>  }
> 
> 
> Then you should be able to drop the case in skb_alloc_rx_flag() without adding
> a warning.
> 

You're right. Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 10/17] netvm: Allow skb allocation to use PFMEMALLOC reserves
From: Mel Gorman @ 2012-06-22 10:54 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson
In-Reply-To: <20120621163029.GB6045@breakpoint.cc>

On Thu, Jun 21, 2012 at 06:30:29PM +0200, Sebastian Andrzej Siewior wrote:
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 1d6ecc8..9a58dcc 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -167,14 +206,19 @@ static void skb_under_panic(struct sk_buff *skb, int sz, void *here)
> >   *	%GFP_ATOMIC.
> >   */
> >  struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> > -			    int fclone, int node)
> > +			    int flags, int node)
> >  {
> >  	struct kmem_cache *cache;
> >  	struct skb_shared_info *shinfo;
> >  	struct sk_buff *skb;
> >  	u8 *data;
> > +	bool pfmemalloc;
> >  
> > -	cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
> > +	cache = (flags & SKB_ALLOC_FCLONE)
> > +		? skbuff_fclone_cache : skbuff_head_cache;
> > +
> > +	if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX))
> > +		gfp_mask |= __GFP_MEMALLOC;
> >  
> >  	/* Get the HEAD */
> >  	skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
> 
> This is mostly used by nic to refil their RX skb pool. You add the
> __GFP_MEMALLOC to the allocation to rise the change of a successfull refill
> for the swap case.
> A few drivers use build_skb() to create the skb. __netdev_alloc_skb()
> shouldn't be affected since the allocation happens with GFP_ATOMIC. Looking at
> TG3 it uses build_skb() and get_pages() / kmalloc(). Shouldn't this be some
> considered?
> 

While TG3 is not exactly as you describe after rebasing build_skb should
make a similar check to __alloc_skb. As it is always used for RX allocation
from the skbuff_head_cache cache the following should be suitable. Thanks.

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9832001..063830c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -310,8 +310,12 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size)
 	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
 	unsigned int size = frag_size ? : ksize(data);
+	gfp_t gfp_mask = GFP_ATOMIC;
 
-	skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
+	if (sk_memalloc_socks())
+		gfp_mask |= __GFP_MEMALLOC;
+
+	skb = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
 	if (!skb)
 		return NULL;
 

^ permalink raw reply related

* [PATCH] smsc911x.c: encapsulate enable irq calls
From: Matthias Brugger @ 2012-06-22 11:10 UTC (permalink / raw)
  To: Steve Glendinning; +Cc: netdev, Matthias Brugger

We encapsulate enbale irq functionality in a function call.
As on probe the interrupts will be disabled twice, we delete
one.

Signed-off-by: Matthias Brugger <mbrugger@iseebcn.com>
---
 drivers/net/ethernet/smsc/smsc911x.c |   17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/smsc/smsc911x.c b/drivers/net/ethernet/smsc/smsc911x.c
index 1466e5d..54ca99d 100644
--- a/drivers/net/ethernet/smsc/smsc911x.c
+++ b/drivers/net/ethernet/smsc/smsc911x.c
@@ -1442,6 +1442,14 @@ smsc911x_set_hw_mac_address(struct smsc911x_data *pdata, u8 dev_addr[6])
 	smsc911x_mac_write(pdata, ADDRL, mac_low32);
 }
 
+static void smsc911x_disable_irq_chip(struct net_device *dev)
+{
+	struct smsc911x_data *pdata = netdev_priv(dev);
+
+	smsc911x_reg_write(pdata, INT_EN, 0);
+	smsc911x_reg_write(pdata, INT_STS, 0xFFFFFFFF);
+}
+
 static int smsc911x_open(struct net_device *dev)
 {
 	struct smsc911x_data *pdata = netdev_priv(dev);
@@ -1494,8 +1502,7 @@ static int smsc911x_open(struct net_device *dev)
 	spin_unlock_irq(&pdata->mac_lock);
 
 	/* Initialise irqs, but leave all sources disabled */
-	smsc911x_reg_write(pdata, INT_EN, 0);
-	smsc911x_reg_write(pdata, INT_STS, 0xFFFFFFFF);
+	smsc911x_disable_irq_chip(dev);
 
 	/* Set interrupt deassertion to 100uS */
 	intcfg = ((10 << 24) | INT_CFG_IRQ_EN_);
@@ -2215,9 +2222,6 @@ static int __devinit smsc911x_init(struct net_device *dev)
 	if (smsc911x_soft_reset(pdata))
 		return -ENODEV;
 
-	/* Disable all interrupt sources until we bring the device up */
-	smsc911x_reg_write(pdata, INT_EN, 0);
-
 	ether_setup(dev);
 	dev->flags |= IFF_MULTICAST;
 	netif_napi_add(dev, &pdata->napi, smsc911x_poll, SMSC_NAPI_WEIGHT);
@@ -2434,8 +2438,7 @@ static int __devinit smsc911x_drv_probe(struct platform_device *pdev)
 	smsc911x_reg_write(pdata, INT_CFG, intcfg);
 
 	/* Ensure interrupts are globally disabled before connecting ISR */
-	smsc911x_reg_write(pdata, INT_EN, 0);
-	smsc911x_reg_write(pdata, INT_STS, 0xFFFFFFFF);
+	smsc911x_disable_irq_chip(dev);
 
 	retval = request_irq(dev->irq, smsc911x_irqhandler,
 			     irq_flags | IRQF_SHARED, dev->name, dev);
-- 
1.7.9.5

^ permalink raw reply related

* Re: [PATCH 10/17] netvm: Allow skb allocation to use PFMEMALLOC reserves
From: Sebastian Andrzej Siewior @ 2012-06-22 11:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson
In-Reply-To: <20120622105451.GC8271@suse.de>

On Fri, Jun 22, 2012 at 11:54:51AM +0100, Mel Gorman wrote:
> > This is mostly used by nic to refil their RX skb pool. You add the
> > __GFP_MEMALLOC to the allocation to rise the change of a successfull refill
> > for the swap case.
> > A few drivers use build_skb() to create the skb. __netdev_alloc_skb()
> > shouldn't be affected since the allocation happens with GFP_ATOMIC. Looking at
> > TG3 it uses build_skb() and get_pages() / kmalloc(). Shouldn't this be some
> > considered?
> > 
> 
> While TG3 is not exactly as you describe after rebasing build_skb should
> make a similar check to __alloc_skb. As it is always used for RX allocation
> from the skbuff_head_cache cache the following should be suitable. Thanks.

As Eric pointed out you end up in netdev_alloc_frag() which is using
alloc_page(). This is also used by __netdev_alloc_skb().

Sebastian

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH net-next] ipv4: tcp: dont cache output dst for syncookies
From: Jesper Dangaard Brouer @ 2012-06-22 11:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Hans Schillstrom
In-Reply-To: <1340204539.4604.1122.camel@edumazet-glaptop>

On Wed, 2012-06-20 at 17:02 +0200, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Don't cache output dst for syncookies, as this adds pressure on IP route
> cache and rcu subsystem for no gain.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>

Looks good to me.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

^ permalink raw reply

* Re: [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO
From: Eric Dumazet @ 2012-06-22 12:33 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Alexander Duyck, netdev, davem, jeffrey.t.kirsher
In-Reply-To: <1340255226.4604.3774.camel@edumazet-glaptop>

On Thu, 2012-06-21 at 07:07 +0200, Eric Dumazet wrote:
> On Wed, 2012-06-20 at 21:07 -0700, Alexander Duyck wrote:
> > On 6/20/2012 6:21 AM, Eric Dumazet wrote:
> 
> > > +		nc->page = alloc_pages(GFP_ATOMIC | __GFP_COLD |
> > > +				       (NETDEV_FRAG_ORDER ? __GFP_COMP : 0),
> > > +				       NETDEV_FRAG_ORDER);
> 
> > I was wondering if you needed the check for NETDEV_FRAG_ORDER here.  
> >  From what I can tell setting __GFP_COMP for an order 0 page has no 
> > effect since it only seems to get checked in prep_new_page and that is 
> > after a check to verify if the page is order 0 or not.
> 
> Good point, it seems some net drivers could be changed to remove
> useless tests.
> 
> I'll post some performance data as well.

Here is the patch I tested here.

Using 32768 bytes allocations is actually nice for MTU=9000 traffic,
since we can fit 3 frames per 32KB instead of only 2 frames (using
kmalloc-16384 slab))

Also, I prefill page->_count with a high bias value, to avoid the
get_page() we did for each allocated frag.

In my profiles, the get_page() cost was dominant, because of false
sharing with skb consumers (as they might run on different cpus)

This way, when 32768 bytes are filled, we perform a single
atomic_sub_return() and can recycle the page if we find we are the last
user (this is what you did in your patch, when testing page->_count
being 1)

Note : If I used max(PAGE_SIZE, 32678) for MAX_NETDEV_FRAGSIZE,
gcc was not able to optimise get_order(MAX_NETDEV_FRAGSIZE), strange...


diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b21522..d31efa2 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -296,9 +296,18 @@ EXPORT_SYMBOL(build_skb);
 struct netdev_alloc_cache {
 	struct page *page;
 	unsigned int offset;
+	unsigned int pagecnt_bias;
 };
 static DEFINE_PER_CPU(struct netdev_alloc_cache, netdev_alloc_cache);
 
+#if PAGE_SIZE > 32768
+#define MAX_NETDEV_FRAGSIZE	PAGE_SIZE
+#else
+#define MAX_NETDEV_FRAGSIZE	32768
+#endif
+
+#define NETDEV_PAGECNT_BIAS	(MAX_NETDEV_FRAGSIZE /		\
+				 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
 /**
  * netdev_alloc_frag - allocate a page fragment
  * @fragsz: fragment size
@@ -316,18 +325,25 @@ void *netdev_alloc_frag(unsigned int fragsz)
 	nc = &__get_cpu_var(netdev_alloc_cache);
 	if (unlikely(!nc->page)) {
 refill:
-		nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
+		nc->page = alloc_pages(GFP_ATOMIC | __GFP_COLD | __GFP_COMP,
+				       get_order(MAX_NETDEV_FRAGSIZE));
+		if (unlikely(!nc->page))
+			goto end;
+recycle:
+		atomic_set(&nc->page->_count, NETDEV_PAGECNT_BIAS);
+		nc->pagecnt_bias = NETDEV_PAGECNT_BIAS;
 		nc->offset = 0;
 	}
-	if (likely(nc->page)) {
-		if (nc->offset + fragsz > PAGE_SIZE) {
-			put_page(nc->page);
-			goto refill;
-		}
-		data = page_address(nc->page) + nc->offset;
-		nc->offset += fragsz;
-		get_page(nc->page);
+	if (nc->offset + fragsz > MAX_NETDEV_FRAGSIZE) {
+		if (!atomic_sub_return(nc->pagecnt_bias,
+				       &nc->page->_count))
+			goto recycle;
+		goto refill;
 	}
+	data = page_address(nc->page) + nc->offset;
+	nc->offset += fragsz;
+	nc->pagecnt_bias--; /* avoid get_page()/get_page() false sharing */
+end:
 	local_irq_restore(flags);
 	return data;
 }
@@ -353,7 +369,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 	unsigned int fragsz = SKB_DATA_ALIGN(length + NET_SKB_PAD) +
 			      SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
-	if (fragsz <= PAGE_SIZE && !(gfp_mask & __GFP_WAIT)) {
+	if (fragsz <= MAX_NETDEV_FRAGSIZE && !(gfp_mask & __GFP_WAIT)) {
 		void *data = netdev_alloc_frag(fragsz);
 
 		if (likely(data)) {

^ permalink raw reply related

* Re: [PATCH net] net: qmi_wwan: fix Oops while disconnecting
From: Ming Lei @ 2012-06-22 12:45 UTC (permalink / raw)
  To: Bjørn Mork
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-usb-u79uwXL29TY76Z2rM5mHXA,
	Marius Bjørnstad Kotsbak
In-Reply-To: <1340356279-3124-1-git-send-email-bjorn-yOkvZcmFvRU@public.gmane.org>

On Fri, Jun 22, 2012 at 5:11 PM, Bjørn Mork <bjorn-yOkvZcmFvRU@public.gmane.org> wrote:
> usbnet_disconnect() will set intfdata to NULL before calling
> the minidriver unbind function.  The cdc_wdm subdriver cannot
> know that it is disconnecting until the qmi_wwan unbind
> function has called its disconnect function.  This means that
> we must be able to support the cdc_wdm subdriver operating
> normally while usbnet_disconnect() is running, and in
> particular that intfdata may be NULL.
>
> The only place this matters is in qmi_wwan_cdc_wdm_manage_power
> which is called from cdc_wdm.  Simply testing for NULL
> intfdata there is sufficient to allow it to continue working
> at all times.
>
> Fixes this Oops where a cdc-wdm device was closed while the
> USB device was disconnecting, causing wdm_release to call
> qmi_wwan_cdc_wdm_manage_power after intfdata was set to
> NULL by usbnet_disconnect:

In fact, it should be a general problem in usbnet, there are races
between usbnet_disconnect and .open/.close. Considered that
unregister_netdev, which will sync with .open/.close,  is called in
usbnet_disconnect,  the simplest fix is to move usb_set_intfdata(NULL)
after unregister_netdev.

> Reported-by: Marius Bjørnstad Kotsbak <marius.kotsbak-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org> # v3.4
> Signed-off-by: Bjørn Mork <bjorn-yOkvZcmFvRU@public.gmane.org>
> ---
>  drivers/net/usb/qmi_wwan.c |    4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
> index 3767a12..b01960f 100644
> --- a/drivers/net/usb/qmi_wwan.c
> +++ b/drivers/net/usb/qmi_wwan.c
> @@ -197,6 +197,10 @@ err:
>  static int qmi_wwan_cdc_wdm_manage_power(struct usb_interface *intf, int on)
>  {
>        struct usbnet *dev = usb_get_intfdata(intf);
> +
> +       /* can be called while disconnecting */
> +       if (!dev)
> +               return 0;
>        return qmi_wwan_manage_power(dev, on);
>  }

Considered that usb_set_intfdata(NULL) will be called after
executing .disconnect in unbind path, it can be removed simply.

So how about the below?

diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index e92c057..2eb9e1e 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -1286,7 +1286,6 @@ void usbnet_disconnect (struct usb_interface *intf)
 	struct net_device	*net;

 	dev = usb_get_intfdata(intf);
-	usb_set_intfdata(intf, NULL);
 	if (!dev)
 		return;


Thanks,
-- 
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH net] net: qmi_wwan: fix Oops while disconnecting
From: Bjørn Mork @ 2012-06-22 13:42 UTC (permalink / raw)
  To: Ming Lei; +Cc: netdev, linux-usb, Marius Bjørnstad Kotsbak
In-Reply-To: <CACVXFVPkC8NtvOzTsOyNcftZy60fDjtEkV_Da1Z32iYpHFHcUw@mail.gmail.com>

Ming Lei <tom.leiming@gmail.com> writes:

> On Fri, Jun 22, 2012 at 5:11 PM, Bjørn Mork <bjorn@mork.no> wrote:
>> usbnet_disconnect() will set intfdata to NULL before calling
>> the minidriver unbind function.  The cdc_wdm subdriver cannot
>> know that it is disconnecting until the qmi_wwan unbind
>> function has called its disconnect function.  This means that
>> we must be able to support the cdc_wdm subdriver operating
>> normally while usbnet_disconnect() is running, and in
>> particular that intfdata may be NULL.
>>
>> The only place this matters is in qmi_wwan_cdc_wdm_manage_power
>> which is called from cdc_wdm.  Simply testing for NULL
>> intfdata there is sufficient to allow it to continue working
>> at all times.
>>
>> Fixes this Oops where a cdc-wdm device was closed while the
>> USB device was disconnecting, causing wdm_release to call
>> qmi_wwan_cdc_wdm_manage_power after intfdata was set to
>> NULL by usbnet_disconnect:
>
> In fact, it should be a general problem in usbnet, there are races
> between usbnet_disconnect and .open/.close. Considered that
> unregister_netdev, which will sync with .open/.close,  is called in
> usbnet_disconnect,  the simplest fix is to move usb_set_intfdata(NULL)
> after unregister_netdev.

Is there really a race there?  The usbnet .open/.close don't use the
intfdata, do they?  I looked briefly through usbnet for related
potentional problems while fixing this in qmi_wwan, but could only find
suspend/resume.  Which I believe are protected against running on
disconnect.

So I think usbnet in general is OK.

> Considered that usb_set_intfdata(NULL) will be called after
> executing .disconnect in unbind path, it can be removed simply.
>
> So how about the below?
>
> diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
> index e92c057..2eb9e1e 100644
> --- a/drivers/net/usb/usbnet.c
> +++ b/drivers/net/usb/usbnet.c
> @@ -1286,7 +1286,6 @@ void usbnet_disconnect (struct usb_interface *intf)
>  	struct net_device	*net;
>
>  	dev = usb_get_intfdata(intf);
> -	usb_set_intfdata(intf, NULL);
>  	if (!dev)
>  		return;

I believe that call is there to prevent disconnect from running twice
for the common two-interface CDC ethernet model, like e.g. cdc_ether.
So I don't think it can be removed.  Not without touching the
minidrivers depending on that behaviour, anyway.


Bjørn

^ permalink raw reply

* Re: Bug in net/ipv6/ip6_fib.c:fib6_dump_table()
From: Josh Hunt @ 2012-06-22 13:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem@davemloft.net, kaber@trash.net, Debabrata Banerjee,
	netdev@vger.kernel.org, yoshfuji@linux-ipv6.org,
	jmorris@namei.org, pekkas@netcore.fi, kuznet@ms2.inr.ac.ru,
	linux-kernel@vger.kernel.org
In-Reply-To: <1340353746.4604.9502.camel@edumazet-glaptop>

On 06/22/2012 03:29 AM, Eric Dumazet wrote:
> On Fri, 2012-06-22 at 01:49 -0500, Josh Hunt wrote:
>> On 06/21/2012 03:27 PM, Eric Dumazet wrote:
>>> On Thu, 2012-06-21 at 14:35 -0500, Josh Hunt wrote:
>>>
>>>> Can anyone provide details of the crash which was intended to be fixed
>>>> by 2bec5a369ee79576a3eea2c23863325089785a2c? With this patch in and
>>>> doing concurrent adds/deletes and dumping the table via netlink causes
>>>> duplicate entries to be reported. Reverting this patch causes those
>>>> problems to go away. We can provide a more detailed test if that is
>>>> needed, but so far our testing has been unable to reproduce the crash
>>>> mentioned in the above commit with it reverted.
>>>
>>> A mere revert wont be enough.
>>>
>>> Looking at this code, it lacks proper synchronization
>>> between tree updaters and tree walkers.
>>>
>>> fib6_walker_lock rwlock is not enough to prevent races.
>>>
>>> Are you willing to fix this yourself ?
>>>
>>
>> Looking through the code a bit more it seems like we would need to have
>> a lock in fib6_walker_t to protect its contents. Mainly for when we
>> update the pointers in fib6_del_route and fib6_repair_tree. Right now
>> there is the fib6_walker_lock, but that appears to only be protecting
>> the elements of the list, not their contents. Is this what you had in
>> mind? I just coded up something along these lines and it works for the
>> most part, but I also got a message about unsafe lock ordering when I
>> stressed it so I am messing something up. If this sounds like it's on
>> the right track I can work out the kinks in the morning.
> 
> Hmm, it seems tb6_lock is held by a writer, so its safe :
> 
> a tree walker can run only holding a read_lock on tb6_lock

Ahh. That makes sense and is what Alexey said before I just didn't put
it all together. So we are OK reverting this patch? I cannot find a path
where the walker's pointers are updated without the tb6_lock write_lock.

Josh

^ permalink raw reply

* Re: [PATCH] net: sh_eth: fix the rxdesc pointer when rx descriptor empty happens
From: Guennadi Liakhovetski @ 2012-06-22 14:16 UTC (permalink / raw)
  To: Shimoda, Yoshihiro; +Cc: netdev, SH-Linux
In-Reply-To: <4FE2784A.70208@renesas.com>

Hi Shimoda-san

On Thu, 21 Jun 2012, Shimoda, Yoshihiro wrote:

> Hello Guennadi-san,
> 
> 2012/06/20 22:10, Guennadi Liakhovetski wrote:
> > Hello Shimoda-san
> > 
> > On Tue, 29 May 2012, Shimoda, Yoshihiro wrote:
> > 
> >> When Receive Descriptor Empty happens, rxdesc pointer of the driver
> >> and actual next descriptor of the controller may be mismatch.
> >> This patch fixes it.
> > 
> > Unfortunately, this patch breaks networking on ecovec (sh7724). Booting 
> > with dhcp and NFS-root progresses very slowly with lots of "nfs: server 
> > not responding / Ok" messages and never completes. Reverting it in current 
> > Linus' tree fixes the problem.
> 
> Thank you very much for the report.
> The SH7724 doesn't set the RMCR register. So, the EDRRR will be clear after
> the controller receives a freme every time.
> So, the "fix the rxdesc pointer" had to check a condition.
> 
> I wrote a patch for the issue as the following.
> If possible, would you try the patch?

Yes, it does seem to fix the problem.

Tested-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de>

Thanks
Guennadi

> 
> Best regards,
> Yoshihiro Shimoda
> ---
> Subject: [PATCH] net: sh_eth: fix the condition to fix the cur_tx/dirty_rx
> 
> The following commit couldn't work if the RMCR is not set to 1.
> 
> "net: sh_eth: fix the rxdesc pointer when rx descriptor empty happens"
> commit id 79fba9f51755c704c0a7d7b7f0df10874dc0a744
> 
> If RMCR is not set, the controller will clear the EDRRR after it received
> a frame. In this case, the driver doesn't need to fix the value of
> cur_rx/dirty_rx. The driver only needs it when the controll detects
> receive descriptors are empty.
> 
> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
> ---
>  drivers/net/ethernet/renesas/sh_eth.c |   12 +++++++-----
>  1 files changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/ethernet/renesas/sh_eth.c b/drivers/net/ethernet/renesas/sh_eth.c
> index 667169b..79bf09b 100644
> --- a/drivers/net/ethernet/renesas/sh_eth.c
> +++ b/drivers/net/ethernet/renesas/sh_eth.c
> @@ -1011,7 +1011,7 @@ static int sh_eth_txfree(struct net_device *ndev)
>  }
> 
>  /* Packet receive function */
> -static int sh_eth_rx(struct net_device *ndev)
> +static int sh_eth_rx(struct net_device *ndev, u32 intr_status)
>  {
>  	struct sh_eth_private *mdp = netdev_priv(ndev);
>  	struct sh_eth_rxdesc *rxdesc;
> @@ -1102,9 +1102,11 @@ static int sh_eth_rx(struct net_device *ndev)
>  	/* Restart Rx engine if stopped. */
>  	/* If we don't need to check status, don't. -KDU */
>  	if (!(sh_eth_read(ndev, EDRRR) & EDRRR_R)) {
> -		/* fix the values for the next receiving */
> -		mdp->cur_rx = mdp->dirty_rx = (sh_eth_read(ndev, RDFAR) -
> -					       sh_eth_read(ndev, RDLAR)) >> 4;
> +		/* fix the values for the next receiving if RDE is set */
> +		if (intr_status & EESR_RDE)
> +			mdp->cur_rx = mdp->dirty_rx =
> +				(sh_eth_read(ndev, RDFAR) -
> +				 sh_eth_read(ndev, RDLAR)) >> 4;
>  		sh_eth_write(ndev, EDRRR_R, EDRRR);
>  	}
> 
> @@ -1273,7 +1275,7 @@ static irqreturn_t sh_eth_interrupt(int irq, void *netdev)
>  			EESR_RTSF | /* short frame recv */
>  			EESR_PRE  | /* PHY-LSI recv error */
>  			EESR_CERF)){ /* recv frame CRC error */
> -		sh_eth_rx(ndev);
> +		sh_eth_rx(ndev, intr_status);
>  	}
> 
>  	/* Tx Check */
> -- 
> 1.7.1
> 

---
Guennadi Liakhovetski, Ph.D.
Freelance Open-Source Software Developer
http://www.open-technology.de/

^ permalink raw reply

* re: net: qmi_wwan: bind to both control and data interface
From: Dan Carpenter @ 2012-06-22 14:25 UTC (permalink / raw)
  To: bjorn; +Cc: netdev

Hello Bjørn Mork,

The patch 230718bda1be: "net: qmi_wwan: bind to both control and data 
interface" from Jun 19, 2012, leads to the following Smatch warning:
drivers/net/usb/qmi_wwan.c:206 qmi_wwan_bind()
	 error: potential NULL dereference 'cdc_union'.

drivers/net/usb/qmi_wwan.c
   205          /* verify CDC Union */
   206          if (desc->bInterfaceNumber != cdc_union->bMasterInterface0) {
                                              ^^^^^^^^^

cdc_union is only non-NULL for USB_CDC_UNION_TYPE.  We used to check for
NULL here but your patch removes the check.  I just want to verify that
that was intended.

   207                  dev_err(&intf->dev, "bogus CDC Union: master=%u\n", cdc_union->bMasterInterface0);
   208                  goto err;
   209          }
   210  

regards,
dan carpenter

^ permalink raw reply

* [PATCH 00/17] Swap-over-NBD without deadlocking V13
From: Mel Gorman @ 2012-06-22 14:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson, Eric Dumazet,
	Sebastian Andrzej Siewior, Mel Gorman

Changelog since V12
  o Rebase to linux-next-20120622
  o Do not alter coalesce handling in the input path		      (eric.dumazet)
  o Avoid unnecessary cast					      (sebastian)

Changelog since V11
  o Rebase to 3.5-rc3
  o Correct order of page flag free				      (sebastian)

Changelog since V10
  o Rebase to 3.4-rc5
  o Coding style fixups						      (davem)
  o API consistency						      (davem)
  o Rename sk_allocation to sk_gfp_atomic and use only when necessary (davem)
  o Use static branches for sk_memalloc_socks			      (davem)
  o Use static branch checks in fast paths			      (davem)
  o Document concerns about PF_MEMALLOC leaking flags		      (davem)
  o Locking fix in slab						      (mel)

Changelog since V9
  o Rebase to 3.4-rc5
  o Clarify comment on why PF_MEMALLOC is cleared in softirq handling (akpm)
  o Only set page->pfmemalloc if ALLOC_NO_WATERMARKS was required     (rientjes)

Changelog since V8
  o Rebase to 3.4-rc2
  o Use page flag instead of slab fields to keep structures the same size
  o Properly detect allocations from softirq context that use PF_MEMALLOC
  o Ensure kswapd does not sleep while processes are throttled
  o Do not accidentally throttle !_GFP_FS processes indefinitely

Changelog since V7
  o Rebase to 3.3-rc2
  o Take greater care propagating page->pfmemalloc to skb
  o Propagate pfmemalloc from netdev_alloc_page to skb where possible
  o Release RCU lock properly on preempt kernel

Changelog since V6
  o Rebase to 3.1-rc8
  o Use wake_up instead of wake_up_interruptible()
  o Do not throttle kernel threads
  o Avoid a potential race between kswapd going to sleep and processes being
    throttled

Changelog since V5
  o Rebase to 3.1-rc5

Changelog since V4
  o Update comment clarifying what protocols can be used		(Michal)
  o Rebase to 3.0-rc3

Changelog since V3
  o Propogate pfmemalloc from packet fragment pages to skb		(Neil)
  o Rebase to 3.0-rc2

Changelog since V2
  o Document that __GFP_NOMEMALLOC overrides __GFP_MEMALLOC		(Neil)
  o Use wait_event_interruptible					(Neil)
  o Use !! when casting to bool to avoid any possibilitity of type
    truncation								(Neil)
  o Nicer logic when using skb_pfmemalloc_protocol			(Neil)

Changelog since V1
  o Rebase on top of mmotm
  o Use atomic_t for memalloc_socks		(David Miller)
  o Remove use of sk_memalloc_socks in vmscan	(Neil Brown)
  o Check throttle within prepare_to_wait	(Neil Brown)
  o Add statistics on throttling instead of printk

When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. Swap over the network is considered as an option in diskless
systems. The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.

The Linux Terminal Server Project recommends the use of the
Network Block Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD
at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP
The nbd-client also documents the use of NBD as swap. Despite this, the
fact is that a machine using NBD for swap can deadlock within minutes if
swap is used intensively. This patch series addresses the problem.

The core issue is that network block devices do not use mempools like
normal block devices do. As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need. Some years ago, Peter Zijlstra developed a series of
patches that supported swap over an NFS that at least one distribution
is carrying within their kernels. This patch series borrows very heavily
from Peter's work to support swapping over NBD as a pre-requisite to
supporting swap-over-NFS. The bulk of the complexity is concerned with
preserving memory that is allocated from the PFMEMALLOC reserves for use
by the network layer which is needed for both NBD and NFS.

Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
	preserve access to pages allocated under low memory situations
	to callers that are freeing memory.

Patch 2 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
	reserves without setting PFMEMALLOC.

Patch 3 opens the possibility for softirqs to use PFMEMALLOC reserves
	for later use by network packet processing.

Patch 4 ignores memory policies when ALLOC_NO_WATERMARKS is set.

Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required

Patches 6-13 allows network processing to use PFMEMALLOC reserves when
	the socket has been marked as being used by the VM to clean pages. If
	packets are received and stored in pages that were allocated under
	low-memory situations and are unrelated to the VM, the packets
	are dropped.

	Patch 11 reintroduces __skb_alloc_page which the networking
	folk may object to but is needed in some cases to propogate
	pfmemalloc from a newly allocated page to an skb. If there is a
	strong objection, this patch can be dropped with the impact being
	that swap-over-network will be slower in some cases but it should
	not fail.

Patch 14 is a micro-optimisation to avoid a function call in the
	common case.

Patch 15 tags NBD sockets as being SOCK_MEMALLOC so they can use
	PFMEMALLOC if necessary.

Patch 16 notes that it is still possible for the PFMEMALLOC reserve
	to be depleted. To prevent this, direct reclaimers get throttled on
	a waitqueue if 50% of the PFMEMALLOC reserves are depleted.  It is
	expected that kswapd and the direct reclaimers already running
	will clean enough pages for the low watermark to be reached and
	the throttled processes are woken up.

Patch 17 adds a statistic to track how often processes get throttled

Some basic performance testing was run using kernel builds, netperf
on loopback for UDP and TCP, hackbench (pipes and sockets), iozone
and sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant
performance variances.

For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure.

Without the patches and using SLUB, the machine locks up within minutes and
runs to completion with them applied. With SLAB, the story is different
as an unpatched kernel run to completion. However, the patched kernel
completed the test 45% faster.

MICRO
                                         3.5.0-rc2 3.5.0-rc2
					 vanilla     swapnbd
Unrecognised test vmscan-anon-mmap-write
MMTests Statistics: duration
Sys Time Running Test (seconds)             197.80    173.07
User+Sys Time Running Test (seconds)        206.96    182.03
Total Elapsed Time (seconds)               3240.70   1762.09

 drivers/block/nbd.c                               |    6 +-
 drivers/net/ethernet/chelsio/cxgb4/sge.c          |    2 +-
 drivers/net/ethernet/chelsio/cxgb4vf/sge.c        |    2 +-
 drivers/net/ethernet/intel/igb/igb_main.c         |    2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |    4 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |    3 +-
 drivers/net/usb/cdc-phonet.c                      |    2 +-
 drivers/usb/gadget/f_phonet.c                     |    2 +-
 include/linux/gfp.h                               |   13 +-
 include/linux/mm_types.h                          |    9 +
 include/linux/mmzone.h                            |    1 +
 include/linux/page-flags.h                        |   28 +++
 include/linux/sched.h                             |    7 +
 include/linux/skbuff.h                            |   80 +++++++-
 include/linux/vm_event_item.h                     |    1 +
 include/net/sock.h                                |   19 ++
 include/trace/events/gfpflags.h                   |    1 +
 kernel/softirq.c                                  |    9 +
 mm/page_alloc.c                                   |   46 ++++-
 mm/slab.c                                         |  216 +++++++++++++++++++--
 mm/slub.c                                         |   30 ++-
 mm/vmscan.c                                       |  131 ++++++++++++-
 mm/vmstat.c                                       |    1 +
 net/core/dev.c                                    |   53 ++++-
 net/core/filter.c                                 |    8 +
 net/core/skbuff.c                                 |  124 +++++++++---
 net/core/sock.c                                   |   43 ++++
 net/ipv4/tcp_output.c                             |   12 +-
 net/ipv6/tcp_ipv6.c                               |    8 +-
 29 files changed, 773 insertions(+), 90 deletions(-)

-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 01/16] mm: sl[au]b: Add knowledge of PFMEMALLOC reserve pages
From: Mel Gorman @ 2012-06-22 14:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson, Eric Dumazet,
	Sebastian Andrzej Siewior, Mel Gorman
In-Reply-To: <1340375443-22455-1-git-send-email-mgorman@suse.de>

Allocations of pages below the min watermark run a risk of the
machine hanging due to a lack of memory. To prevent this, only
callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing
an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once
they are allocated to a slab though, nothing prevents other callers
consuming free objects within those slabs. This patch limits access
to slab pages that were alloced from the PFMEMALLOC reserves.

When this patch is applied, pages allocated from below the low watermark are
returned with page->pfmemalloc set and it is up to the caller to determine
how the page should be protected. SLAB restricts access to any page with
page->pfmemalloc set to callers which are known to able to access the
PFMEMALLOC reserve. If one is not available, an attempt is made to allocate
a new page rather than use a reserve. SLUB is a bit more relaxed in that
it only records if the current per-CPU page was allocated from PFMEMALLOC
reserve and uses another partial slab if the caller does not have the
necessary GFP or process flags. This was found to be sufficient in tests
to avoid hangs due to SLUB generally maintaining smaller lists than SLAB.

In low-memory conditions it does mean that !PFMEMALLOC allocators
can fail a slab allocation even though free objects are available
because they are being preserved for callers that are freeing pages.

[a.p.zijlstra@chello.nl: Original implementation]
[sebastian@breakpoint.cc: Correct order of page flag clearing]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm_types.h   |    9 +++
 include/linux/page-flags.h |   28 +++++++
 mm/internal.h              |    3 +
 mm/page_alloc.c            |   27 +++++--
 mm/slab.c                  |  192 +++++++++++++++++++++++++++++++++++++++-----
 mm/slub.c                  |   29 ++++++-
 6 files changed, 263 insertions(+), 25 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 27c741c..ad0ad6f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -54,6 +54,15 @@ struct page {
 		union {
 			pgoff_t index;		/* Our offset within mapping. */
 			void *freelist;		/* slub/slob first free object */
+			bool pfmemalloc;	/* If set by the page allocator,
+						 * ALLOC_PFMEMALLOC was set
+						 * and the low watermark was not
+						 * met implying that the system
+						 * is under some pressure. The
+						 * caller should try ensure
+						 * this page is only used to
+						 * free other pages.
+						 */
 		};
 
 		union {
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index c88d2a9..e66eb0d 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -453,6 +453,34 @@ static inline int PageTransTail(struct page *page)
 }
 #endif
 
+/*
+ * If network-based swap is enabled, sl*b must keep track of whether pages
+ * were allocated from pfmemalloc reserves.
+ */
+static inline int PageSlabPfmemalloc(struct page *page)
+{
+	VM_BUG_ON(!PageSlab(page));
+	return PageActive(page);
+}
+
+static inline void SetPageSlabPfmemalloc(struct page *page)
+{
+	VM_BUG_ON(!PageSlab(page));
+	SetPageActive(page);
+}
+
+static inline void __ClearPageSlabPfmemalloc(struct page *page)
+{
+	VM_BUG_ON(!PageSlab(page));
+	__ClearPageActive(page);
+}
+
+static inline void ClearPageSlabPfmemalloc(struct page *page)
+{
+	VM_BUG_ON(!PageSlab(page));
+	ClearPageActive(page);
+}
+
 #ifdef CONFIG_MMU
 #define __PG_MLOCKED		(1 << PG_mlocked)
 #else
diff --git a/mm/internal.h b/mm/internal.h
index 0e20156..5d4a634 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -282,6 +282,9 @@ static inline struct page *mem_map_next(struct page *iter,
 #define __paginginit __init
 #endif
 
+/* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
+
 /* Memory initialisation debug and verification */
 enum mminit_level {
 	MMINIT_WARNING,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2c29b1c..9c697e5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1505,6 +1505,7 @@ failed:
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
+#define ALLOC_PFMEMALLOC	0x80 /* Caller has PF_MEMALLOC set */
 
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 
@@ -2262,16 +2263,22 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	} else if (unlikely(rt_task(current)) && !in_interrupt())
 		alloc_flags |= ALLOC_HARDER;
 
-	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_interrupt() &&
-		    ((current->flags & PF_MEMALLOC) ||
-		     unlikely(test_thread_flag(TIF_MEMDIE))))
+	if ((current->flags & PF_MEMALLOC) ||
+			unlikely(test_thread_flag(TIF_MEMDIE))) {
+		alloc_flags |= ALLOC_PFMEMALLOC;
+
+		if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 
 	return alloc_flags;
 }
 
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
+{
+	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_PFMEMALLOC);
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -2459,10 +2466,18 @@ nopage:
 	warn_alloc_failed(gfp_mask, order, NULL);
 	return page;
 got_pg:
+	/*
+	 * page->pfmemalloc is set when the caller had PFMEMALLOC set or is
+	 * been OOM killed. The expectation is that the caller is taking
+	 * steps that will free more memory. The caller should avoid the
+	 * page being used for !PFMEMALLOC purposes.
+	 */
+	page->pfmemalloc = !!(alloc_flags & ALLOC_PFMEMALLOC);
+
 	if (kmemcheck_enabled)
 		kmemcheck_pagealloc_alloc(page, order, gfp_mask);
-	return page;
 
+	return page;
 }
 
 /*
@@ -2513,6 +2528,8 @@ retry_cpuset:
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
+	else
+		page->pfmemalloc = false;
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
 
diff --git a/mm/slab.c b/mm/slab.c
index 64c3d03..c0d51f1 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -123,6 +123,8 @@
 
 #include <trace/events/kmem.h>
 
+#include	"internal.h"
+
 /*
  * DEBUG	- 1 for kmem_cache_create() to honour; SLAB_RED_ZONE & SLAB_POISON.
  *		  0 for faster, smaller code (especially in the critical paths).
@@ -151,6 +153,12 @@
 #define ARCH_KMALLOC_FLAGS SLAB_HWCACHE_ALIGN
 #endif
 
+/*
+ * true if a page was allocated from pfmemalloc reserves for network-based
+ * swap
+ */
+static bool pfmemalloc_active __read_mostly;
+
 /* Legal flag mask for kmem_cache_create(). */
 #if DEBUG
 # define CREATE_MASK	(SLAB_RED_ZONE | \
@@ -256,9 +264,30 @@ struct array_cache {
 			 * Must have this definition in here for the proper
 			 * alignment of array_cache. Also simplifies accessing
 			 * the entries.
+			 *
+			 * Entries should not be directly dereferenced as
+			 * entries belonging to slabs marked pfmemalloc will
+			 * have the lower bits set SLAB_OBJ_PFMEMALLOC
 			 */
 };
 
+#define SLAB_OBJ_PFMEMALLOC	1
+static inline bool is_obj_pfmemalloc(void *objp)
+{
+	return (unsigned long)objp & SLAB_OBJ_PFMEMALLOC;
+}
+
+static inline void set_obj_pfmemalloc(void **objp)
+{
+	*objp = (void *)((unsigned long)*objp | SLAB_OBJ_PFMEMALLOC);
+	return;
+}
+
+static inline void clear_obj_pfmemalloc(void **objp)
+{
+	*objp = (void *)((unsigned long)*objp & ~SLAB_OBJ_PFMEMALLOC);
+}
+
 /*
  * bootstrap: The caches do not work without cpuarrays anymore, but the
  * cpuarrays are allocated from the generic caches...
@@ -926,6 +955,102 @@ static struct array_cache *alloc_arraycache(int node, int entries,
 	return nc;
 }
 
+static inline bool is_slab_pfmemalloc(struct slab *slabp)
+{
+	struct page *page = virt_to_page(slabp->s_mem);
+
+	return PageSlabPfmemalloc(page);
+}
+
+/* Clears pfmemalloc_active if no slabs have pfmalloc set */
+static void recheck_pfmemalloc_active(struct kmem_cache *cachep,
+						struct array_cache *ac)
+{
+	struct kmem_list3 *l3 = cachep->nodelists[numa_mem_id()];
+	struct slab *slabp;
+	unsigned long flags;
+
+	if (!pfmemalloc_active)
+		return;
+
+	spin_lock_irqsave(&l3->list_lock, flags);
+	list_for_each_entry(slabp, &l3->slabs_full, list)
+		if (is_slab_pfmemalloc(slabp))
+			goto out;
+
+	list_for_each_entry(slabp, &l3->slabs_partial, list)
+		if (is_slab_pfmemalloc(slabp))
+			goto out;
+
+	list_for_each_entry(slabp, &l3->slabs_free, list)
+		if (is_slab_pfmemalloc(slabp))
+			goto out;
+
+	pfmemalloc_active = false;
+out:
+	spin_unlock_irqrestore(&l3->list_lock, flags);
+}
+
+static void *ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
+						gfp_t flags, bool force_refill)
+{
+	int i;
+	void *objp = ac->entry[--ac->avail];
+
+	/* Ensure the caller is allowed to use objects from PFMEMALLOC slab */
+	if (unlikely(is_obj_pfmemalloc(objp))) {
+		struct kmem_list3 *l3;
+
+		if (gfp_pfmemalloc_allowed(flags)) {
+			clear_obj_pfmemalloc(&objp);
+			return objp;
+		}
+
+		/* The caller cannot use PFMEMALLOC objects, find another one */
+		for (i = 1; i < ac->avail; i++) {
+			/* If a !PFMEMALLOC object is found, swap them */
+			if (!is_obj_pfmemalloc(ac->entry[i])) {
+				objp = ac->entry[i];
+				ac->entry[i] = ac->entry[ac->avail];
+				ac->entry[ac->avail] = objp;
+				return objp;
+			}
+		}
+
+		/*
+		 * If there are empty slabs on the slabs_free list and we are
+		 * being forced to refill the cache, mark this one !pfmemalloc.
+		 */
+		l3 = cachep->nodelists[numa_mem_id()];
+		if (!list_empty(&l3->slabs_free) && force_refill) {
+			struct slab *slabp = virt_to_slab(objp);
+			ClearPageSlabPfmemalloc(virt_to_page(slabp->s_mem));
+			clear_obj_pfmemalloc(&objp);
+			recheck_pfmemalloc_active(cachep, ac);
+			return objp;
+		}
+
+		/* No !PFMEMALLOC objects available */
+		ac->avail++;
+		objp = NULL;
+	}
+
+	return objp;
+}
+
+static void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
+								void *objp)
+{
+	if (unlikely(pfmemalloc_active)) {
+		/* Some pfmemalloc slabs exist, check if this is one */
+		struct page *page = virt_to_page(objp);
+		if (PageSlabPfmemalloc(page))
+			set_obj_pfmemalloc(&objp);
+	}
+
+	ac->entry[ac->avail++] = objp;
+}
+
 /*
  * Transfer objects in one arraycache to another.
  * Locking must be handled by the caller.
@@ -1102,7 +1227,7 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
 			STATS_INC_ACOVERFLOW(cachep);
 			__drain_alien_cache(cachep, alien, nodeid);
 		}
-		alien->entry[alien->avail++] = objp;
+		ac_put_obj(cachep, alien, objp);
 		spin_unlock(&alien->lock);
 	} else {
 		spin_lock(&(cachep->nodelists[nodeid])->list_lock);
@@ -1782,6 +1907,10 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 		return NULL;
 	}
 
+	/* Record if ALLOC_PFMEMALLOC was set when allocating the slab */
+	if (unlikely(page->pfmemalloc))
+		pfmemalloc_active = true;
+
 	nr_pages = (1 << cachep->gfporder);
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		add_zone_page_state(page_zone(page),
@@ -1789,9 +1918,13 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 	else
 		add_zone_page_state(page_zone(page),
 			NR_SLAB_UNRECLAIMABLE, nr_pages);
-	for (i = 0; i < nr_pages; i++)
+	for (i = 0; i < nr_pages; i++) {
 		__SetPageSlab(page + i);
 
+		if (page->pfmemalloc)
+			SetPageSlabPfmemalloc(page + i);
+	}
+
 	if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) {
 		kmemcheck_alloc_shadow(page, cachep->gfporder, flags, nodeid);
 
@@ -1823,6 +1956,7 @@ static void kmem_freepages(struct kmem_cache *cachep, void *addr)
 				NR_SLAB_UNRECLAIMABLE, nr_freed);
 	while (i--) {
 		BUG_ON(!PageSlab(page));
+		__ClearPageSlabPfmemalloc(page);
 		__ClearPageSlab(page);
 		page++;
 	}
@@ -3094,16 +3228,19 @@ bad:
 #define check_slabp(x,y) do { } while(0)
 #endif
 
-static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
+static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags,
+							bool force_refill)
 {
 	int batchcount;
 	struct kmem_list3 *l3;
 	struct array_cache *ac;
 	int node;
 
-retry:
 	check_irq_off();
 	node = numa_mem_id();
+	if (unlikely(force_refill))
+		goto force_grow;
+retry:
 	ac = cpu_cache_get(cachep);
 	batchcount = ac->batchcount;
 	if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
@@ -3153,8 +3290,8 @@ retry:
 			STATS_INC_ACTIVE(cachep);
 			STATS_SET_HIGH(cachep);
 
-			ac->entry[ac->avail++] = slab_get_obj(cachep, slabp,
-							    node);
+			ac_put_obj(cachep, ac, slab_get_obj(cachep, slabp,
+									node));
 		}
 		check_slabp(cachep, slabp);
 
@@ -3173,18 +3310,22 @@ alloc_done:
 
 	if (unlikely(!ac->avail)) {
 		int x;
+force_grow:
 		x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL);
 
 		/* cache_grow can reenable interrupts, then ac could change. */
 		ac = cpu_cache_get(cachep);
-		if (!x && ac->avail == 0)	/* no objects in sight? abort */
+
+		/* no objects in sight? abort */
+		if (!x && (ac->avail == 0 || force_refill))
 			return NULL;
 
 		if (!ac->avail)		/* objects refilled by interrupt? */
 			goto retry;
 	}
 	ac->touched = 1;
-	return ac->entry[--ac->avail];
+
+	return ac_get_obj(cachep, ac, flags, force_refill);
 }
 
 static inline void cache_alloc_debugcheck_before(struct kmem_cache *cachep,
@@ -3266,23 +3407,35 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 {
 	void *objp;
 	struct array_cache *ac;
+	bool force_refill = false;
 
 	check_irq_off();
 
 	ac = cpu_cache_get(cachep);
 	if (likely(ac->avail)) {
-		STATS_INC_ALLOCHIT(cachep);
 		ac->touched = 1;
-		objp = ac->entry[--ac->avail];
-	} else {
-		STATS_INC_ALLOCMISS(cachep);
-		objp = cache_alloc_refill(cachep, flags);
+		objp = ac_get_obj(cachep, ac, flags, false);
+
 		/*
-		 * the 'ac' may be updated by cache_alloc_refill(),
-		 * and kmemleak_erase() requires its correct value.
+		 * Allow for the possibility all avail objects are not allowed
+		 * by the current flags
 		 */
-		ac = cpu_cache_get(cachep);
+		if (objp) {
+			STATS_INC_ALLOCHIT(cachep);
+			goto out;
+		}
+		force_refill = true;
 	}
+
+	STATS_INC_ALLOCMISS(cachep);
+	objp = cache_alloc_refill(cachep, flags, force_refill);
+	/*
+	 * the 'ac' may be updated by cache_alloc_refill(),
+	 * and kmemleak_erase() requires its correct value.
+	 */
+	ac = cpu_cache_get(cachep);
+
+out:
 	/*
 	 * To avoid a false negative, if an object that is in one of the
 	 * per-CPU caches is leaked, we need to make sure kmemleak doesn't
@@ -3604,9 +3757,12 @@ static void free_block(struct kmem_cache *cachep, void **objpp, int nr_objects,
 	struct kmem_list3 *l3;
 
 	for (i = 0; i < nr_objects; i++) {
-		void *objp = objpp[i];
+		void *objp;
 		struct slab *slabp;
 
+		clear_obj_pfmemalloc(&objpp[i]);
+		objp = objpp[i];
+
 		slabp = virt_to_slab(objp);
 		l3 = cachep->nodelists[node];
 		list_del(&slabp->list);
@@ -3724,7 +3880,7 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp,
 		cache_flusharray(cachep, ac);
 	}
 
-	ac->entry[ac->avail++] = objp;
+	ac_put_obj(cachep, ac, objp);
 }
 
 /**
diff --git a/mm/slub.c b/mm/slub.c
index cc4ed03..05929df 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -33,6 +33,8 @@
 
 #include <trace/events/kmem.h>
 
+#include "internal.h"
+
 /*
  * Lock order:
  *   1. slub_lock (Global Semaphore)
@@ -1370,6 +1372,8 @@ static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
 	inc_slabs_node(s, page_to_nid(page), page->objects);
 	page->slab = s;
 	__SetPageSlab(page);
+	if (page->pfmemalloc)
+		SetPageSlabPfmemalloc(page);
 
 	start = page_address(page);
 
@@ -1413,6 +1417,7 @@ static void __free_slab(struct kmem_cache *s, struct page *page)
 		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
 		-pages);
 
+	__ClearPageSlabPfmemalloc(page);
 	__ClearPageSlab(page);
 	reset_page_mapcount(page);
 	if (current->reclaim_state)
@@ -2132,6 +2137,14 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
 	return freelist;
 }
 
+static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags)
+{
+	if (unlikely(PageSlabPfmemalloc(page)))
+		return gfp_pfmemalloc_allowed(gfpflags);
+
+	return true;
+}
+
 /*
  * Check the page->freelist of a page and either transfer the freelist to the per cpu freelist
  * or deactivate the page.
@@ -2212,6 +2225,18 @@ redo:
 		goto new_slab;
 	}
 
+	/*
+	 * By rights, we should be searching for a slab page that was
+	 * PFMEMALLOC but right now, we are losing the pfmemalloc
+	 * information when the page leaves the per-cpu allocator
+	 */
+	if (unlikely(!pfmemalloc_match(page, gfpflags))) {
+		deactivate_slab(s, page, c->freelist);
+		c->page = NULL;
+		c->freelist = NULL;
+		goto new_slab;
+	}
+
 	/* must check again c->freelist in case of cpu migration or IRQ */
 	freelist = c->freelist;
 	if (freelist)
@@ -2318,8 +2343,8 @@ redo:
 
 	object = c->freelist;
 	page = c->page;
-	if (unlikely(!object || !node_match(page, node)))
-
+	if (unlikely(!object || !node_match(page, node)
+					!pfmemalloc_match(page, gfpflags)))
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 02/16] mm: slub: Optimise the SLUB fast path to avoid pfmemalloc checks
From: Mel Gorman @ 2012-06-22 14:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson, Eric Dumazet,
	Sebastian Andrzej Siewior, Mel Gorman
In-Reply-To: <1340375443-22455-1-git-send-email-mgorman@suse.de>

From: Christoph Lameter <cl@linux.com>

This patch removes the check for pfmemalloc from the alloc hotpath and
puts the logic after the election of a new per cpu slab. For a pfmemalloc
page we do not use the fast path but force the use of the slow path which
is also used for the debug case.

This has the side-effect of weakening pfmemalloc processing in the
following way;

1. A process that is allocating for network swap calls __slab_alloc.
   pfmemalloc_match is true so the freelist is loaded and c->freelist is
   now pointing to a pfmemalloc page.

2. A process that is attempting normal allocations calls slab_alloc,
   finds the pfmemalloc page on the freelist and uses it because it did
   not check pfmemalloc_match()

The patch allows non-pfmemalloc allocations to use pfmemalloc pages with
the kmalloc slabs being the most vunerable caches on the grounds they
are most likely to have a mix of pfmemalloc and !pfmemalloc requests. A
later patch will still protect the system as processes will get throttled
if the pfmemalloc reserves get depleted but performance will not degrade
as smoothly.

[mgorman@suse.de: Expanded changelog]
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/slub.c |    7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 05929df..87d59ea 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2287,11 +2287,11 @@ new_slab:
 	}
 
 	page = c->page;
-	if (likely(!kmem_cache_debug(s)))
+	if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags)))
 		goto load_freelist;
 
 	/* Only entered in the debug case */
-	if (!alloc_debug_processing(s, page, freelist, addr))
+	if (kmem_cache_debug(s) && !alloc_debug_processing(s, page, freelist, addr))
 		goto new_slab;	/* Slab failed checks. Next slab needed */
 
 	deactivate_slab(s, page, get_freepointer(s, freelist));
@@ -2343,8 +2343,7 @@ redo:
 
 	object = c->freelist;
 	page = c->page;
-	if (unlikely(!object || !node_match(page, node)
-					!pfmemalloc_match(page, gfpflags)))
+	if (unlikely(!object || !node_match(page, node)))
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox