Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [Xen-devel] [PATCH 1/1] xen/netback: only non-freed SKB is queued into tx_queue
From: Ian Campbell @ 2012-06-29  7:23 UTC (permalink / raw)
  To: David Miller
  Cc: annie.li@oracle.com, xen-devel@lists.xensource.com,
	netdev@vger.kernel.org, konrad.wilk@oracle.com,
	kurt.hackel@oracle.com
In-Reply-To: <20120628.165550.1816352825092253548.davem@davemloft.net>

On Fri, 2012-06-29 at 00:55 +0100, David Miller wrote:
> From: annie.li@oracle.com
> Date: Wed, 27 Jun 2012 18:46:58 +0800
> 
> > From: Annie Li <Annie.li@oracle.com>
> > 
> > After SKB is queued into tx_queue, it will be freed if request_gop is NULL.
> > However, no dequeue action is called in this situation, it is likely that
> > tx_queue constains freed SKB. This patch should fix this issue, and it is
> > based on 3.5.0-rc4+.
> > 
> > This issue is found through code inspection, no bug is seen with it currently.
> > I run netperf test for several hours, and no network regression was found.
> > 
> > Signed-off-by: Annie Li <annie.li@oracle.com>
> 
> I lack the expertiece necessary to properly review this, so I really
> need a Xen expert to look this over.

Sorry, I put it to one side waiting for the repost to netdev and then
forgot about it...

Yes, this change looks good to me:

Acked-by: Ian Campbell <ian.campbell@citrix.com>

^ permalink raw reply

* Re: [PATCH net-next] caif-hsi: Fix merge issues.
From: David Miller @ 2012-06-29  7:48 UTC (permalink / raw)
  To: sjur.brandeland; +Cc: netdev, sjurbren
In-Reply-To: <1340951780-27406-1-git-send-email-sjur.brandeland@stericsson.com>

From: sjur.brandeland@stericsson.com
Date: Fri, 29 Jun 2012 08:36:20 +0200

> From: Sjur Brændeland <sjur.brandeland@stericsson.com>
> 
> Fix the failing merge in net-next by reverting the last
> net-next merge for caif_hsi.c and then merge in the commit:
> "caif-hsi: Bugfix - Piggyback'ed embedded CAIF frame lost"
> from the net repository. 
> 
> The commit:"caif-hsi: Add missing return in error path" from
> net repository was dropped, as it changed code previously removed in the 
> net-next repository.
> 
> Signed-off-by: Sjur Brændeland <sjur.brandeland@stericsson.com>

Applied, thanks a lot.

^ permalink raw reply

* Re: [Xen-devel] [PATCH 1/1] xen/netback: only non-freed SKB is queued into tx_queue
From: David Miller @ 2012-06-29  7:50 UTC (permalink / raw)
  To: Ian.Campbell; +Cc: annie.li, xen-devel, netdev, konrad.wilk, kurt.hackel
In-Reply-To: <1340954589.5953.12.camel@dagon.hellion.org.uk>

From: Ian Campbell <Ian.Campbell@citrix.com>
Date: Fri, 29 Jun 2012 08:23:09 +0100

> On Fri, 2012-06-29 at 00:55 +0100, David Miller wrote:
>> From: annie.li@oracle.com
>> Date: Wed, 27 Jun 2012 18:46:58 +0800
>> 
>> > From: Annie Li <Annie.li@oracle.com>
>> > 
>> > After SKB is queued into tx_queue, it will be freed if request_gop is NULL.
>> > However, no dequeue action is called in this situation, it is likely that
>> > tx_queue constains freed SKB. This patch should fix this issue, and it is
>> > based on 3.5.0-rc4+.
>> > 
>> > This issue is found through code inspection, no bug is seen with it currently.
>> > I run netperf test for several hours, and no network regression was found.
>> > 
>> > Signed-off-by: Annie Li <annie.li@oracle.com>
>> 
>> I lack the expertiece necessary to properly review this, so I really
>> need a Xen expert to look this over.
> 
> Sorry, I put it to one side waiting for the repost to netdev and then
> forgot about it...
> 
> Yes, this change looks good to me:
> 
> Acked-by: Ian Campbell <ian.campbell@citrix.com>

Thanks, applied to net-next.

^ permalink raw reply

* Re: [PATCH] ipv6_tunnel: Allow receiving packets on the fallback tunnel if they pass sanity checks
From: David Miller @ 2012-06-29  7:52 UTC (permalink / raw)
  To: phil; +Cc: netdev, phild, ville.nuorvala
In-Reply-To: <20120629041552.GA27362@ipom.com>

From: Phil Dibowitz <phil@ipom.com>
Date: Thu, 28 Jun 2012 21:15:52 -0700

> From: Ville Nuorvala <ville.nuorvala@gmail.com>
> 
> At Facebook, we do Layer-3 DSR via IP-in-IP tunneling. Our load balancers wrap
> an extra IP header on incoming packets so they can be routed to the backend.
> In the v4 tunnel driver, when these packets fall on the default tunl0 device,
> the behavior is to decapsulate them and drop them back on the stack. So our
> setup is that tunl0 has the VIP and eth0 has (obviously) the backend's real
> address.
> 
> In IPv6 we do the same thing, but the v6 tunnel driver didn't have this same
> behavior - if you didn't have an explicit tunnel setup, it would drop the
> packet.
> 
> This patch brings that v4 feature to the v6 driver.
> 
> The same IPv6 address checks are performed as with any normal tunnel,
> but as the fallback tunnel endpoint addresses are unspecified, the checks
> must be performed on a per-packet basis, rather than at tunnel
> configuration time.
> 
> [Patch description modified by phil@ipom.com]
> 
> Signed-off-by: Ville Nuorvala <ville.nuorvala@gmail.com>
> Tested-by: Phil Dibowitz <phil@ipom.com>

Applied to net-next

^ permalink raw reply

* Re: [PATCH net-next] fq_codel: report congestion notification at enqueue time
From: David Miller @ 2012-06-29  7:53 UTC (permalink / raw)
  To: eric.dumazet; +Cc: nanditad, netdev, codel, ycheng, ncardwell, mattmathis
In-Reply-To: <1340949008.29822.73.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 29 Jun 2012 07:50:08 +0200

> Hmm, problem is the sender thinks the packet was queued for
> transmission.
> 
>         ret = macvlan_queue_xmit(skb, dev);
>         if (likely(ret == NET_XMIT_SUCCESS || ret == NET_XMIT_CN)) {
>                 struct macvlan_pcpu_stats *pcpu_stats;
> 
>                 pcpu_stats = this_cpu_ptr(vlan->pcpu_stats);
>                 u64_stats_update_begin(&pcpu_stats->syncp);
>                 pcpu_stats->tx_packets++;
>                 pcpu_stats->tx_bytes += len;
>                 u64_stats_update_end(&pcpu_stats->syncp);
>         } else {
>                 this_cpu_inc(vlan->pcpu_stats->tx_dropped);
>         }
> 
> NET_XMIT_CN has a lazy semantic it seems.
> 
> I will just dont rely on it.

I think we cannot just ignore this issue.  I will take a deeper look,
because we should have NET_XMIT_CN be very well defined and adjust any
mis-use.

^ permalink raw reply

* Re: [PATCH net-next] net: l2tp_eth: provide tx_dropped counter
From: David Miller @ 2012-06-29  7:54 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1340950513.29822.103.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 29 Jun 2012 08:15:13 +0200

> From: Eric Dumazet <edumazet@google.com>
> 
> Change l2tp_xmit_skb() to return NET_XMIT_DROP in case skb is dropped.
> 
> Use kfree_skb() instead dev_kfree_skb() for drop_monitor pleasure.
> 
> Support tx_dropped counter for l2tp_eth
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 1/1] netxen_nic: restrict force firmware dump when dump is disabled.
From: David Miller @ 2012-06-29  7:54 UTC (permalink / raw)
  To: rajesh.borundia; +Cc: netdev, ameen.rahman, manish.chopra
In-Reply-To: <1340950341-27252-2-git-send-email-rajesh.borundia@qlogic.com>

From: Rajesh Borundia <rajesh.borundia@qlogic.com>
Date: Fri, 29 Jun 2012 02:12:21 -0400

> From: Manish chopra <manish.chopra@qlogic.com>
> 
> o Set the ethtool_dump flag (=ETH_FW_DUMP_DISABLE) when dump is disabled.
> o update driver version to 4.0.80
> 
> Signed-off-by: Manish chopra <manish.chopra@qlogic.com>
> Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] fq_codel: report congestion notification at enqueue time
From: David Miller @ 2012-06-29  8:04 UTC (permalink / raw)
  To: eric.dumazet; +Cc: nanditad, netdev, codel, ycheng, ncardwell, mattmathis
In-Reply-To: <1340949008.29822.73.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 29 Jun 2012 07:50:08 +0200

> Hmm, problem is the sender thinks the packet was queued for
> transmission.
> 
>         ret = macvlan_queue_xmit(skb, dev);
>         if (likely(ret == NET_XMIT_SUCCESS || ret == NET_XMIT_CN)) {
>                 struct macvlan_pcpu_stats *pcpu_stats;
> 
>                 pcpu_stats = this_cpu_ptr(vlan->pcpu_stats);
>                 u64_stats_update_begin(&pcpu_stats->syncp);
>                 pcpu_stats->tx_packets++;
>                 pcpu_stats->tx_bytes += len;
>                 u64_stats_update_end(&pcpu_stats->syncp);
>         } else {
>                 this_cpu_inc(vlan->pcpu_stats->tx_dropped);
>         }

Ok, that is the meaning this has taken on.  Same test exists in
vlan_dev.c and this test used to be present also in the ipip.h macros
some time ago.

Nobody really does anything special with this value, except to
translate it to a zero 0 when propagating back to sockets.

The only thing it guards is the selection of which statistic to
increment.

For all practical purposes it is treated as NET_XMIT_SUCCESS except in
one location, pktgen, where it causes the errors counter to increment.

Looking this over, I'd say we should just get rid of it.

^ permalink raw reply

* "Winner
From: Motorola Award @ 2012-06-29  8:35 UTC (permalink / raw)
  To: Recipients

You Won £400,000.00GBP from Motorola Promotion 2012. Bee Line Courier Service UK (beeline@diploma.com)for your Check delivery with your Name,Address,Country,Phone Number. call this number +448719152576 for more info 

^ permalink raw reply

* AW: RFC: replace packets already in queue
From: Erdt, Ralph @ 2012-06-29  8:46 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev@vger.kernel.org
In-Reply-To: <4FEC854E.8080603@hp.com>

Hello Rick Jones,

> You might want to try the recent "codel" additions to the stack.  They
> seek to keep the size of queues more manageable while still allowing
> the occasional burst.
Thank you for your hint. This is surly a needful solution in normal network, but this didn't help us:
We are working with very heterogeneous networks:
Internal: 100MBit and more.
Extern: 9,6*K*Bit and LESS(*), and shared, and...
A few other information: wireless (higher packet loss rate), medium access time > 100ms, RTT (standard ping) with IDLE network: 1,5 *seconds*, RTT with network load: minutes(!), and so on. Just very shocking..

TCP isn't usable over such a link. So we are only sending UDP. The codel didn't help us, as codel addresses the flow speed. It's dropping "randomly" (I know it's not random in the lower level, but it's random from the application's perspective) packets. 

I'm addressing the amount of information: Trying to reduce it intelligently by REPLACING old packets with new ones.. Surely - the application must handle this. But in such a network a administrator have to configure the queues and he knows the applications.
In one private mail someone guesses that we are making VoIP. No - we just want to send status information (e.g. sensor information) which will get deprecated, when a new information is available.

I know, this is a very special problem, which didn't occur in normal or even abnormal situations. But I'm sure there are some other people having the this problem, too. So I'm glad to share my solution.

(*you remember the good ol' times with modems over telephone lines? When the internet was called BBS? And how it suddenly feels, when the BBS starts using ANSI? This was comfortable compared to our problem..)

Greetings
Ralph Erdt

^ permalink raw reply

* Re: [RFC PATCH net-next] ipvs: add missing lock in ip_vs_ftp_init_conn()
From: Julian Anastasov @ 2012-06-29  9:04 UTC (permalink / raw)
  To: Xiaotian Feng
  Cc: netdev, lvs-devel, netfilter-devel, netfilter, coreteam,
	linux-kernel, Xiaotian Feng, Wensong Zhang, Simon Horman,
	Pablo Neira Ayuso, Patrick McHardy, David S. Miller
In-Reply-To: <CAJn8CcFy=K+Aizpi0pvnpXCOYXhgyq12oBgVaPvMthW_fwn4Pg@mail.gmail.com>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1925 bytes --]


	Hello,

On Fri, 29 Jun 2012, Xiaotian Feng wrote:

> > On Thu, 28 Jun 2012, Xiaotian Feng wrote:
> >
> >> We met a kernel panic in 2.6.32.43 kernel:
> >>
> >> [2680191.848044] IPVS: ip_vs_conn_hash(): request for already hashed, called from run_timer_softirq+0x175/0x1d0
> >> <snip>
> >> [2680311.849009] general protection fault: 0000 [#1] SMP

	What we see here is 120 seconds between 2680191 and
2680311. It can mean 2 things:

- some state timeout, it depends on your forwarding method.
What is it? NAT? DR?

- 60 seconds for ip_vs_conn_expire retries

> >> After code review, the only chance that kernel change connection flag without protection is
> >> in ip_vs_ftp_init_conn().
> >
> >        Hm, ip_vs_ftp_init_conn is called before 1st hashing,
> > from ip_vs_bind_app() in ip_vs_conn_new() before
> > ip_vs_conn_hash(). It should be another problem with
> > the flags. How different is IPVS in 2.6.32.43 compared to
> > recent kernels? If commit aea9d711 is present, I'm not
> > aware of other similar problems.
> 
> ip_vs_bind_app() is also called by ip_vs_try_bind_dest(), which can be
> traced to ip_vs_proc_conn().
> I've checked the changes in upstream, but nothing helps since aea9d711
> has been taken into 2.6.32.28 kernel.

	OK, this fix should make it safe for master-backup
sync and it should be applied but I suspect you are not
using sync, right? And then this fix will not solve the oops.

	There are no many places that rehash conn:

ip_vs_conn_fill_cport
	- used for FTP

ip_vs_check_template:
	- do you have persistence configured?

	After you provide details for the used forwarding
method, persistence and sync we should think how such races
with rehashing can lead to double hlist_del. May be
you can modify the debug message in ip_vs_conn_hash, so
that we can see cp->flags and ntohs of cp->cport, cp->dport
and cp->vport when oops happens again.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* [PATCH] ipv4: Elide fib_validate_source() completely when possible.
From: David Miller @ 2012-06-29  9:05 UTC (permalink / raw)
  To: netdev


If rpfilter is off (or the SKB has an IPSEC path) and there are not
tclassid users, we don't have to do anything at all when
fib_validate_source() is invoked besides setting the itag to zero.

We monitor tclassid uses with a counter (modified only under RTNL and
marked __read_mostly) and we protect the fib_validate_source() real
work with a test against this counter and whether rpfilter is to be
done.

Having a way to know whether we need no tclassid processing or not
also opens the door for future optimized rpfilter algorithms that do
not perform full FIB lookups.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/fib_rules.h  |    1 +
 include/net/ip_fib.h     |    5 +++++
 net/core/fib_rules.c     |    4 ++++
 net/ipv4/fib_frontend.c  |   32 ++++++++++++++++++++++++--------
 net/ipv4/fib_rules.c     |   16 +++++++++++++++-
 net/ipv4/fib_semantics.c |   10 ++++++++++
 6 files changed, 59 insertions(+), 9 deletions(-)

diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h
index 075f1e3..e361f48 100644
--- a/include/net/fib_rules.h
+++ b/include/net/fib_rules.h
@@ -52,6 +52,7 @@ struct fib_rules_ops {
 					     struct sk_buff *,
 					     struct fib_rule_hdr *,
 					     struct nlattr **);
+	void			(*delete)(struct fib_rule *);
 	int			(*compare)(struct fib_rule *,
 					   struct fib_rule_hdr *,
 					   struct nlattr **);
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 619f68a..3dc7c96 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -235,6 +235,11 @@ extern int fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
 			       u8 tos, int oif, struct net_device *dev,
 			       struct in_device *idev, u32 *itag);
 extern void fib_select_default(struct fib_result *res);
+#ifdef CONFIG_IP_ROUTE_CLASSID
+extern int fib_num_tclassid_users;
+#else
+#define fib_num_tclassid_users 0
+#endif
 
 /* Exported by fib_semantics.c */
 extern int ip_fib_check_default(__be32 gw, struct net_device *dev);
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 72cceb7..ab7db83 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -151,6 +151,8 @@ static void fib_rules_cleanup_ops(struct fib_rules_ops *ops)
 
 	list_for_each_entry_safe(rule, tmp, &ops->rules_list, list) {
 		list_del_rcu(&rule->list);
+		if (ops->delete)
+			ops->delete(rule);
 		fib_rule_put(rule);
 	}
 }
@@ -499,6 +501,8 @@ static int fib_nl_delrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg)
 
 		notify_rule_change(RTM_DELRULE, rule, ops, nlh,
 				   NETLINK_CB(skb).pid);
+		if (ops->delete)
+			ops->delete(rule);
 		fib_rule_put(rule);
 		flush_route_cache(ops);
 		rules_ops_put(ops);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index c84cff5..ae528d1 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -31,6 +31,7 @@
 #include <linux/if_addr.h>
 #include <linux/if_arp.h>
 #include <linux/skbuff.h>
+#include <linux/cache.h>
 #include <linux/init.h>
 #include <linux/list.h>
 #include <linux/slab.h>
@@ -217,6 +218,10 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb)
 	return inet_select_addr(dev, ip_hdr(skb)->saddr, scope);
 }
 
+#ifdef CONFIG_IP_ROUTE_CLASSID
+int fib_num_tclassid_users __read_mostly;
+#endif
+
 /* Given (packet source, input interface) and optional (dst, oif, tos):
  * - (main) check, that source is valid i.e. not broadcast or our local
  *   address.
@@ -225,11 +230,11 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb)
  * - check, that packet arrived from expected physical interface.
  * called with rcu_read_lock()
  */
-int fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, u8 tos,
-			int oif, struct net_device *dev, struct in_device *idev,
-			u32 *itag)
+static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
+				 u8 tos, int oif, struct net_device *dev,
+				 int rpf, struct in_device *idev, u32 *itag)
 {
-	int ret, no_addr, rpf, accept_local;
+	int ret, no_addr, accept_local;
 	struct fib_result res;
 	struct flowi4 fl4;
 	struct net *net;
@@ -242,12 +247,9 @@ int fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, u8 tos,
 	fl4.flowi4_tos = tos;
 	fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
 
-	no_addr = rpf = accept_local = 0;
+	no_addr = accept_local = 0;
 	no_addr = idev->ifa_list == NULL;
 
-	/* Ignore rp_filter for packets protected by IPsec. */
-	rpf = secpath_exists(skb) ? 0 : IN_DEV_RPFILTER(idev);
-
 	accept_local = IN_DEV_ACCEPT_LOCAL(idev);
 	fl4.flowi4_mark = IN_DEV_SRC_VMARK(idev) ? skb->mark : 0;
 
@@ -303,6 +305,20 @@ e_rpf:
 	return -EXDEV;
 }
 
+/* Ignore rp_filter for packets protected by IPsec. */
+int fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
+			u8 tos, int oif, struct net_device *dev,
+			struct in_device *idev, u32 *itag)
+{
+	int r = secpath_exists(skb) ? 0 : IN_DEV_RPFILTER(idev);
+
+	if (!r && !fib_num_tclassid_users) {
+		*itag = 0;
+		return 0;
+	}
+	return __fib_validate_source(skb, src, dst, tos, oif, dev, r, idev, itag);
+}
+
 static inline __be32 sk_extract_addr(struct sockaddr *addr)
 {
 	return ((struct sockaddr_in *) addr)->sin_addr.s_addr;
diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
index 2d043f7..b23fd95 100644
--- a/net/ipv4/fib_rules.c
+++ b/net/ipv4/fib_rules.c
@@ -169,8 +169,11 @@ static int fib4_rule_configure(struct fib_rule *rule, struct sk_buff *skb,
 		rule4->dst = nla_get_be32(tb[FRA_DST]);
 
 #ifdef CONFIG_IP_ROUTE_CLASSID
-	if (tb[FRA_FLOW])
+	if (tb[FRA_FLOW]) {
 		rule4->tclassid = nla_get_u32(tb[FRA_FLOW]);
+		if (rule4->tclassid)
+			fib_num_tclassid_users++;
+	}
 #endif
 
 	rule4->src_len = frh->src_len;
@@ -184,6 +187,16 @@ errout:
 	return err;
 }
 
+static void fib4_rule_delete(struct fib_rule *rule)
+{
+#ifdef CONFIG_IP_ROUTE_CLASSID
+	struct fib4_rule *rule4 = (struct fib4_rule *) rule;
+
+	if (rule4->tclassid)
+		fib_num_tclassid_users--;
+#endif
+}
+
 static int fib4_rule_compare(struct fib_rule *rule, struct fib_rule_hdr *frh,
 			     struct nlattr **tb)
 {
@@ -256,6 +269,7 @@ static const struct fib_rules_ops __net_initdata fib4_rules_ops_template = {
 	.action		= fib4_rule_action,
 	.match		= fib4_rule_match,
 	.configure	= fib4_rule_configure,
+	.delete		= fib4_rule_delete,
 	.compare	= fib4_rule_compare,
 	.fill		= fib4_rule_fill,
 	.default_pref	= fib_default_rule_pref,
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 415f823..c46c20b 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -163,6 +163,12 @@ void free_fib_info(struct fib_info *fi)
 		return;
 	}
 	fib_info_cnt--;
+#ifdef CONFIG_IP_ROUTE_CLASSID
+	change_nexthops(fi) {
+		if (nexthop_nh->nh_tclassid)
+			fib_num_tclassid_users--;
+	} endfor_nexthops(fi);
+#endif
 	call_rcu(&fi->rcu, free_fib_info_rcu);
 }
 
@@ -421,6 +427,8 @@ static int fib_get_nhs(struct fib_info *fi, struct rtnexthop *rtnh,
 #ifdef CONFIG_IP_ROUTE_CLASSID
 			nla = nla_find(attrs, attrlen, RTA_FLOW);
 			nexthop_nh->nh_tclassid = nla ? nla_get_u32(nla) : 0;
+			if (nexthop_nh->nh_tclassid)
+				fib_num_tclassid_users++;
 #endif
 		}
 
@@ -815,6 +823,8 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
 		nh->nh_flags = cfg->fc_flags;
 #ifdef CONFIG_IP_ROUTE_CLASSID
 		nh->nh_tclassid = cfg->fc_flow;
+		if (nh->nh_tclassid)
+			fib_num_tclassid_users++;
 #endif
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 		nh->nh_weight = 1;
-- 
1.7.10

^ permalink raw reply related

* Re: AW: RFC: replace packets already in queue
From: Eric Dumazet @ 2012-06-29  9:06 UTC (permalink / raw)
  To: Erdt, Ralph; +Cc: Rick Jones, netdev@vger.kernel.org
In-Reply-To: <FB112703C4930F4ABEBB5B763F96491139378E5A@MAILSERV2A.lorien.fkie.fgan.de>

On Fri, 2012-06-29 at 08:46 +0000, Erdt, Ralph wrote:
> Hello Rick Jones,
> 
> > You might want to try the recent "codel" additions to the stack.  They
> > seek to keep the size of queues more manageable while still allowing
> > the occasional burst.
> Thank you for your hint. This is surly a needful solution in normal network, but this didn't help us:
> We are working with very heterogeneous networks:
> Internal: 100MBit and more.
> Extern: 9,6*K*Bit and LESS(*), and shared, and...
> A few other information: wireless (higher packet loss rate), medium access time > 100ms, RTT (standard ping) with IDLE network: 1,5 *seconds*, RTT with network load: minutes(!), and so on. Just very shocking..
> 
> TCP isn't usable over such a link. So we are only sending UDP. The codel didn't help us, as codel addresses the flow speed. It's dropping "randomly" (I know it's not random in the lower level, but it's random from the application's perspective) packets. 
> 
> I'm addressing the amount of information: Trying to reduce it intelligently by REPLACING old packets with new ones.. Surely - the application must handle this. But in such a network a administrator have to configure the queues and he knows the applications.
> In one private mail someone guesses that we are making VoIP. No - we just want to send status information (e.g. sensor information) which will get deprecated, when a new information is available.
> 
> I know, this is a very special problem, which didn't occur in normal or even abnormal situations. But I'm sure there are some other people having the this problem, too. So I'm glad to share my solution.
> 
> (*you remember the good ol' times with modems over telephone lines? When the internet was called BBS? And how it suddenly feels, when the BBS starts using ANSI? This was comfortable compared to our problem..)

Problem is : with wireless, chances are high that the old packet is not
waiting in qdisc, but in wireless queues.

Anyway, adding a maxdelay to codel / fq_codel is really easy : This
would drop packet if its sejourn time is above a given limit.

You could use codel with @target being greater than @maxdelay to remove
all probabilistic drops, and only keep the @maxdelay behavior.

If you want I can cook this patch.

^ permalink raw reply

* [PATCH net-next 1/2] r8169: support RTL8106E
From: Hayes Wang @ 2012-06-29 10:34 UTC (permalink / raw)
  To: romieu; +Cc: netdev, linux-kernel, hayes, Hayes Wang

From: hayes <hayes@fc17.localdomain>

Support the new chip RTL8106E.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
---
 drivers/net/ethernet/realtek/r8169.c |   56 ++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index d7a04e0..7afc593 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -46,6 +46,7 @@
 #define FIRMWARE_8105E_1	"rtl_nic/rtl8105e-1.fw"
 #define FIRMWARE_8402_1		"rtl_nic/rtl8402-1.fw"
 #define FIRMWARE_8411_1		"rtl_nic/rtl8411-1.fw"
+#define FIRMWARE_8106E_1	"rtl_nic/rtl8106e-1.fw"
 
 #ifdef RTL8169_DEBUG
 #define assert(expr) \
@@ -141,6 +142,7 @@ enum mac_version {
 	RTL_GIGA_MAC_VER_36,
 	RTL_GIGA_MAC_VER_37,
 	RTL_GIGA_MAC_VER_38,
+	RTL_GIGA_MAC_VER_39,
 	RTL_GIGA_MAC_NONE   = 0xff,
 };
 
@@ -259,6 +261,9 @@ static const struct {
 	[RTL_GIGA_MAC_VER_38] =
 		_R("RTL8411",		RTL_TD_1, FIRMWARE_8411_1,
 							JUMBO_9K, false),
+	[RTL_GIGA_MAC_VER_39] =
+		_R("RTL8106e",		RTL_TD_1, FIRMWARE_8106E_1,
+							JUMBO_1K, true),
 };
 #undef _R
 
@@ -431,7 +436,9 @@ enum rtl8168_registers {
 	RDSAR1			= 0xd0,	/* 8168c only. Undocumented on 8168dp */
 	MISC			= 0xf0,	/* 8168e only. */
 #define TXPLA_RST			(1 << 29)
+#define DISABLE_LAN_EN			(1 << 23) /* Enable GPIO pin */
 #define PWM_EN				(1 << 22)
+#define EARLY_TALLY_EN			(1 << 16)
 };
 
 enum rtl_register_content {
@@ -794,6 +801,7 @@ MODULE_FIRMWARE(FIRMWARE_8168F_1);
 MODULE_FIRMWARE(FIRMWARE_8168F_2);
 MODULE_FIRMWARE(FIRMWARE_8402_1);
 MODULE_FIRMWARE(FIRMWARE_8411_1);
+MODULE_FIRMWARE(FIRMWARE_8106E_1);
 
 static void rtl_lock_work(struct rtl8169_private *tp)
 {
@@ -1933,6 +1941,8 @@ static void rtl8169_get_mac_version(struct rtl8169_private *tp,
 		{ 0x7c800000, 0x30000000,	RTL_GIGA_MAC_VER_11 },
 
 		/* 8101 family. */
+		{ 0x7cf00000, 0x44900000,	RTL_GIGA_MAC_VER_39 },
+		{ 0x7c800000, 0x44800000,	RTL_GIGA_MAC_VER_39 },
 		{ 0x7c800000, 0x44000000,	RTL_GIGA_MAC_VER_37 },
 		{ 0x7cf00000, 0x40b00000,	RTL_GIGA_MAC_VER_30 },
 		{ 0x7cf00000, 0x40a00000,	RTL_GIGA_MAC_VER_30 },
@@ -3273,6 +3283,30 @@ static void rtl8402_hw_phy_config(struct rtl8169_private *tp)
 	rtl_writephy(tp, 0x1f, 0x0000);
 }
 
+static void rtl8106e_hw_phy_config(struct rtl8169_private *tp)
+{
+	void __iomem *ioaddr = tp->mmio_addr;
+
+	static const struct phy_reg phy_reg_init[] = {
+		{ 0x1f, 0x0004 },
+		{ 0x10, 0xc07f },
+		{ 0x19, 0x7030 },
+		{ 0x1f, 0x0000 }
+	};
+
+	/* Disable ALDPS before ram code */
+	rtl_writephy(tp, 0x1f, 0x0000);
+	rtl_writephy(tp, 0x18, 0x0310);
+	msleep(100);
+
+	rtl_apply_firmware(tp);
+
+	rtl_eri_write(ioaddr, 0x1b0, ERIAR_MASK_0011, 0x0000, ERIAR_EXGMAC);
+	rtl_writephy_batch(tp, phy_reg_init, ARRAY_SIZE(phy_reg_init));
+
+	rtl_eri_write(ioaddr, 0x1d0, ERIAR_MASK_0011, 0x0000, ERIAR_EXGMAC);
+}
+
 static void rtl_hw_phy_config(struct net_device *dev)
 {
 	struct rtl8169_private *tp = netdev_priv(dev);
@@ -3369,6 +3403,10 @@ static void rtl_hw_phy_config(struct net_device *dev)
 		rtl8411_hw_phy_config(tp);
 		break;
 
+	case RTL_GIGA_MAC_VER_39:
+		rtl8106e_hw_phy_config(tp);
+		break;
+
 	default:
 		break;
 	}
@@ -3608,6 +3646,7 @@ static void rtl_wol_suspend_quirk(struct rtl8169_private *tp)
 	case RTL_GIGA_MAC_VER_34:
 	case RTL_GIGA_MAC_VER_37:
 	case RTL_GIGA_MAC_VER_38:
+	case RTL_GIGA_MAC_VER_39:
 		RTL_W32(RxConfig, RTL_R32(RxConfig) |
 			AcceptBroadcast | AcceptMulticast | AcceptMyPhys);
 		break;
@@ -3830,6 +3869,7 @@ static void __devinit rtl_init_pll_power_ops(struct rtl8169_private *tp)
 	case RTL_GIGA_MAC_VER_29:
 	case RTL_GIGA_MAC_VER_30:
 	case RTL_GIGA_MAC_VER_37:
+	case RTL_GIGA_MAC_VER_39:
 		ops->down	= r810x_pll_power_down;
 		ops->up		= r810x_pll_power_up;
 		break;
@@ -5123,6 +5163,18 @@ static void rtl_hw_start_8402(struct rtl8169_private *tp)
 		     ERIAR_EXGMAC);
 }
 
+static void rtl_hw_start_8106(struct rtl8169_private *tp)
+{
+	void __iomem *ioaddr = tp->mmio_addr;
+
+	/* Force LAN exit from ASPM if Rx/Tx are not idle */
+	RTL_W32(FuncEvent, RTL_R32(FuncEvent) | 0x002800);
+
+	RTL_W32(MISC, (RTL_R32(MISC) | DISABLE_LAN_EN) & ~EARLY_TALLY_EN);
+	RTL_W8(MCU, RTL_R8(MCU) | EN_NDP | EN_OOB_RESET);
+	RTL_W8(DLLPR, RTL_R8(DLLPR) & ~PFM_EN);
+}
+
 static void rtl_hw_start_8101(struct net_device *dev)
 {
 	struct rtl8169_private *tp = netdev_priv(dev);
@@ -5167,6 +5219,10 @@ static void rtl_hw_start_8101(struct net_device *dev)
 	case RTL_GIGA_MAC_VER_37:
 		rtl_hw_start_8402(tp);
 		break;
+
+	case RTL_GIGA_MAC_VER_39:
+		rtl_hw_start_8106(tp);
+		break;
 	}
 
 	RTL_W8(Cfg9346, Cfg9346_Lock);
-- 
1.7.10.2

^ permalink raw reply related

* [PATCH net-next 2/2] r8169: support RTL8168G
From: Hayes Wang @ 2012-06-29 10:34 UTC (permalink / raw)
  To: romieu; +Cc: netdev, linux-kernel, hayes, Hayes Wang
In-Reply-To: <1340966060-2749-1-git-send-email-hayeswang@realtek.com>

From: hayes <hayes@fc17.localdomain>

Support the new chip RTL8168G.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
---
 drivers/net/ethernet/realtek/r8169.c |  344 +++++++++++++++++++++++++++++++++-
 1 file changed, 343 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 7afc593..fda4432 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -47,6 +47,7 @@
 #define FIRMWARE_8402_1		"rtl_nic/rtl8402-1.fw"
 #define FIRMWARE_8411_1		"rtl_nic/rtl8411-1.fw"
 #define FIRMWARE_8106E_1	"rtl_nic/rtl8106e-1.fw"
+#define FIRMWARE_8168G_1	"rtl_nic/rtl8168g-1.fw"
 
 #ifdef RTL8169_DEBUG
 #define assert(expr) \
@@ -143,6 +144,8 @@ enum mac_version {
 	RTL_GIGA_MAC_VER_37,
 	RTL_GIGA_MAC_VER_38,
 	RTL_GIGA_MAC_VER_39,
+	RTL_GIGA_MAC_VER_40,
+	RTL_GIGA_MAC_VER_41,
 	RTL_GIGA_MAC_NONE   = 0xff,
 };
 
@@ -264,6 +267,11 @@ static const struct {
 	[RTL_GIGA_MAC_VER_39] =
 		_R("RTL8106e",		RTL_TD_1, FIRMWARE_8106E_1,
 							JUMBO_1K, true),
+	[RTL_GIGA_MAC_VER_40] =
+		_R("RTL8168g/8111g",	RTL_TD_1, FIRMWARE_8168G_1,
+							JUMBO_9K, false),
+	[RTL_GIGA_MAC_VER_41] =
+		_R("RTL8168g/8111g",	RTL_TD_1, NULL, JUMBO_9K, false),
 };
 #undef _R
 
@@ -394,8 +402,11 @@ enum rtl8168_8101_registers {
 	TWSI			= 0xd2,
 	MCU			= 0xd3,
 #define	NOW_IS_OOB			(1 << 7)
+#define TX_EMPTY			(1 << 5)
+#define RX_EMPTY			(1 << 4)
 #define	EN_NDP				(1 << 3)
 #define	EN_OOB_RESET			(1 << 2)
+#define LINK_LIST_RDY			(1 << 1)
 	EFUSEAR			= 0xdc,
 #define	EFUSEAR_FLAG			0x80000000
 #define	EFUSEAR_WRITE_CMD		0x80000000
@@ -421,6 +432,7 @@ enum rtl8168_registers {
 #define ERIAR_MASK_SHIFT		12
 #define ERIAR_MASK_0001			(0x1 << ERIAR_MASK_SHIFT)
 #define ERIAR_MASK_0011			(0x3 << ERIAR_MASK_SHIFT)
+#define ERIAR_MASK_0101			(0x5 << ERIAR_MASK_SHIFT)
 #define ERIAR_MASK_1111			(0xf << ERIAR_MASK_SHIFT)
 	EPHY_RXER_NUM		= 0x7c,
 	OCPDR			= 0xb0,	/* OCP GPHY access */
@@ -433,11 +445,13 @@ enum rtl8168_registers {
 #define OCPAR_FLAG			0x80000000
 #define OCPAR_GPHY_WRITE_CMD		0x8000f060
 #define OCPAR_GPHY_READ_CMD		0x0000f060
+	GPHY_OCP		= 0xb8,
 	RDSAR1			= 0xd0,	/* 8168c only. Undocumented on 8168dp */
 	MISC			= 0xf0,	/* 8168e only. */
 #define TXPLA_RST			(1 << 29)
 #define DISABLE_LAN_EN			(1 << 23) /* Enable GPIO pin */
 #define PWM_EN				(1 << 22)
+#define RXDV_GATED_EN			(1 << 19)
 #define EARLY_TALLY_EN			(1 << 16)
 };
 
@@ -781,6 +795,8 @@ struct rtl8169_private {
 		} phy_action;
 	} *rtl_fw;
 #define RTL_FIRMWARE_UNKNOWN	ERR_PTR(-EAGAIN)
+
+	void (*write_fw)(struct rtl8169_private *, struct rtl_fw *);
 };
 
 MODULE_AUTHOR("Realtek and the Linux r8169 crew <netdev@vger.kernel.org>");
@@ -802,6 +818,7 @@ MODULE_FIRMWARE(FIRMWARE_8168F_2);
 MODULE_FIRMWARE(FIRMWARE_8402_1);
 MODULE_FIRMWARE(FIRMWARE_8411_1);
 MODULE_FIRMWARE(FIRMWARE_8106E_1);
+MODULE_FIRMWARE(FIRMWARE_8168G_1);
 
 static void rtl_lock_work(struct rtl8169_private *tp)
 {
@@ -919,6 +936,99 @@ static int r8168dp_check_dash(struct rtl8169_private *tp)
 	return (ocp_read(tp, 0x0f, reg) & 0x00008000) ? 1 : 0;
 }
 
+static void r8168_phy_ocp_write(void __iomem *ioaddr, u32 reg, u32 data)
+{
+	int i;
+
+	if (reg & 0xffff0001)
+		BUG();
+
+	RTL_W32(GPHY_OCP, OCPAR_FLAG | (reg << 15) | data);
+
+	for (i = 0; i < 10; i++) {
+		udelay(25);
+		if (!(RTL_R32(GPHY_OCP) & OCPAR_FLAG))
+			break;
+	}
+}
+
+static u16 r8168_phy_ocp_read(void __iomem *ioaddr, u32 reg)
+{
+	int i;
+	u32 data;
+
+	if (reg & 0xffff0001)
+		BUG();
+
+	RTL_W32(GPHY_OCP, (reg << 15));
+
+	for (i = 0; i < 10; i++) {
+		udelay(25);
+		data = RTL_R32(GPHY_OCP);
+		if (data & OCPAR_FLAG)
+			break;
+	}
+
+	return (u16)(data & 0xffff);
+}
+
+static void rtl_w1w0_phy_ocp(void __iomem *ioaddr, int reg_addr, int p, int m)
+{
+	int val;
+
+	val = r8168_phy_ocp_read(ioaddr, reg_addr);
+	r8168_phy_ocp_write(ioaddr, reg_addr, (val | p) & ~m);
+}
+
+static void r8168_mac_ocp_write(void __iomem *ioaddr, u32 reg, u32 data)
+{
+	int i;
+
+	if (reg & 0xffff0001)
+		BUG();
+
+	RTL_W32(OCPDR, OCPAR_FLAG | (reg << 15) | data);
+
+	for (i = 0; i < 10; i++) {
+		udelay(25);
+		if (!(RTL_R32(OCPDR) & OCPAR_FLAG))
+			break;
+	}
+}
+
+static u16 r8168_mac_ocp_read(void __iomem *ioaddr, u32 reg)
+{
+	int i;
+	u32 data;
+
+	if (reg & 0xffff0001)
+		BUG();
+
+	RTL_W32(OCPDR, (reg << 15));
+
+	for (i = 0; i < 10; i++) {
+		udelay(25);
+		data = RTL_R32(OCPDR);
+		if (data & OCPAR_FLAG)
+			break;
+	}
+
+	return (u16)(data & 0xffff);
+}
+
+static void r8168g_mdio_write(void __iomem *ioaddr, int reg_addr, int value)
+{
+	if (reg_addr == 0x1f)
+		return;
+
+	r8168_phy_ocp_write(ioaddr, 0xa400 + reg_addr * 2, value);
+}
+
+static int r8168g_mdio_read(void __iomem *ioaddr, int reg_addr)
+{
+	return r8168_phy_ocp_read(ioaddr, 0xa400 + reg_addr * 2);
+}
+
 static void r8169_mdio_write(void __iomem *ioaddr, int reg_addr, int value)
 {
 	int i;
@@ -1902,6 +2012,10 @@ static void rtl8169_get_mac_version(struct rtl8169_private *tp,
 		u32 val;
 		int mac_version;
 	} mac_info[] = {
+		/* 8168G family. */
+		{ 0x7cf00000, 0x4c100000,	RTL_GIGA_MAC_VER_41 },
+		{ 0x7cf00000, 0x4c000000,	RTL_GIGA_MAC_VER_40 },
+
 		/* 8168F family. */
 		{ 0x7c800000, 0x48800000,	RTL_GIGA_MAC_VER_38 },
 		{ 0x7cf00000, 0x48100000,	RTL_GIGA_MAC_VER_36 },
@@ -2241,6 +2355,92 @@ static void rtl_phy_write_fw(struct rtl8169_private *tp, struct rtl_fw *rtl_fw)
 	}
 }
 
+static void rtl_ocp_write_fw(struct rtl8169_private *tp, struct rtl_fw *rtl_fw)
+{
+	struct rtl_fw_phy_action *pa = &rtl_fw->phy_action;
+	void __iomem *ioaddr = tp->mmio_addr;
+	u32 predata, count;
+	u32 base_addr;
+	size_t index;
+
+	predata = count = 0;
+	base_addr = 0xa400;
+
+	for (index = 0; index < pa->size; ) {
+		u32 action = le32_to_cpu(pa->code[index]);
+		u32 data = action & 0x0000ffff;
+		u32 regno = (action & 0x0fff0000) >> 16;
+
+		if (!action)
+			break;
+
+		switch(action & 0xf0000000) {
+		case PHY_READ:
+			predata = r8168_phy_ocp_read(ioaddr,
+					base_addr + (regno -16) * 2);
+			count++;
+			index++;
+			break;
+		case PHY_DATA_OR:
+			predata |= data;
+			index++;
+			break;
+		case PHY_DATA_AND:
+			predata &= data;
+			index++;
+			break;
+		case PHY_BJMPN:
+			index -= regno;
+			break;
+		case PHY_CLEAR_READCOUNT:
+			count = 0;
+			index++;
+			break;
+		case PHY_WRITE:
+			if (regno == 0x1f)
+				base_addr = data << 4;
+			else
+				r8168_phy_ocp_write(ioaddr,
+						base_addr + (regno - 0x10) * 2,
+						data);
+			index++;
+			break;
+		case PHY_READCOUNT_EQ_SKIP:
+			index += (count == data) ? 2 : 1;
+			break;
+		case PHY_COMP_EQ_SKIPN:
+			if (predata == data)
+				index += regno;
+			index++;
+			break;
+		case PHY_COMP_NEQ_SKIPN:
+			if (predata != data)
+				index += regno;
+			index++;
+			break;
+		case PHY_WRITE_PREVIOUS:
+			r8168_phy_ocp_write(ioaddr, base_addr + (regno -16) * 2,
+					    predata);
+			index++;
+			break;
+		case PHY_SKIPN:
+			index += regno + 1;
+			break;
+		case PHY_DELAY_MS:
+			mdelay(data);
+			index++;
+			break;
+
+		case PHY_READ_MAC_BYTE:
+		case PHY_WRITE_MAC_BYTE:
+		case PHY_WRITE_ERI_WORD:
+		case PHY_READ_EFUSE:
+		default:
+			BUG();
+		}
+	}
+}
+
 static void rtl_release_firmware(struct rtl8169_private *tp)
 {
 	if (!IS_ERR_OR_NULL(tp->rtl_fw)) {
@@ -2256,7 +2456,7 @@ static void rtl_apply_firmware(struct rtl8169_private *tp)
 
 	/* TODO: release firmware once rtl_phy_write_fw signals failures. */
 	if (!IS_ERR_OR_NULL(rtl_fw))
-		rtl_phy_write_fw(tp, rtl_fw);
+		tp->write_fw(tp, rtl_fw);
 }
 
 static void rtl_apply_firmware_cond(struct rtl8169_private *tp, u8 reg, u16 val)
@@ -3221,6 +3421,56 @@ static void rtl8411_hw_phy_config(struct rtl8169_private *tp)
 	rtl_writephy(tp, 0x1f, 0x0000);
 }
 
+static void rtl8168g_1_hw_phy_config(struct rtl8169_private *tp)
+{
+	void __iomem *ioaddr = tp->mmio_addr;
+	u32 mac_ocp_addr, i;
+	static const u16 mac_ocp_patch[] = {
+		0xE008, 0xE01B, 0xE01D, 0xE01F,
+		0xE021, 0xE023, 0xE025, 0xE027,
+		0x49D2 ,0xF10D, 0x766C, 0x49E2,
+		0xF00A, 0x1EC0, 0x8EE1, 0xC60A,
+		0x77C0, 0x4870, 0x9FC0, 0x1EA0,
+		0xC707, 0x8EE1, 0x9D6C, 0xC603,
+		0xBE00, 0xB416, 0x0076, 0xE86C,
+		0xC602, 0xBE00, 0x0000, 0xC602,
+		0xBE00, 0x0000, 0xC602, 0xBE00,
+		0x0000, 0xC602, 0xBE00, 0x0000,
+		0xC602, 0xBE00, 0x0000, 0xC602,
+		0xBE00, 0x0000, 0xC602, 0xBE00,
+		0x0000, 0x0000, 0x0000, 0x0000
+	};
+
+	/* patch code for GPHY reset */
+	mac_ocp_addr = 0xf800;
+	for (i = 0; mac_ocp_addr < 0xf868; i++) {
+		r8168_mac_ocp_write(ioaddr, mac_ocp_addr, mac_ocp_patch[i]);
+		mac_ocp_addr += 2;
+	}
+	r8168_mac_ocp_write(ioaddr, 0xfc26, 0x8000);
+	r8168_mac_ocp_write(ioaddr, 0xfc28, 0x0075);
+
+	rtl_apply_firmware(tp);
+
+	if (r8168_phy_ocp_read(ioaddr, 0xa460) & 0x0100)
+		rtl_w1w0_phy_ocp(ioaddr, 0xbcc4, 0x0000, 0x8000);
+	else
+		rtl_w1w0_phy_ocp(ioaddr, 0xbcc4, 0x8000, 0x0000);
+
+	if (r8168_phy_ocp_read(ioaddr, 0xa466) & 0x0100)
+		rtl_w1w0_phy_ocp(ioaddr, 0xc41a, 0x0002, 0x0000);
+	else
+		rtl_w1w0_phy_ocp(ioaddr, 0xbcc4, 0x0000, 0x0002);
+
+	rtl_w1w0_phy_ocp(ioaddr, 0xa442, 0x000c, 0x0000);
+	rtl_w1w0_phy_ocp(ioaddr, 0xa4b2, 0x0004, 0x0000);
+
+	r8168_phy_ocp_write(ioaddr, 0xa436, 0x8012);
+	rtl_w1w0_phy_ocp(ioaddr, 0xa438, 0x8000, 0x0000);
+
+	rtl_w1w0_phy_ocp(ioaddr, 0xc422, 0x4000, 0x2000);
+}
+
 static void rtl8102e_hw_phy_config(struct rtl8169_private *tp)
 {
 	static const struct phy_reg phy_reg_init[] = {
@@ -3407,6 +3657,13 @@ static void rtl_hw_phy_config(struct net_device *dev)
 		rtl8106e_hw_phy_config(tp);
 		break;
 
+	case RTL_GIGA_MAC_VER_40:
+		rtl8168g_1_hw_phy_config(tp);
+		break;
+
+	case RTL_GIGA_MAC_VER_41:
+		break;
+
 	default:
 		break;
 	}
@@ -3621,15 +3878,24 @@ static void __devinit rtl_init_mdio_ops(struct rtl8169_private *tp)
 	case RTL_GIGA_MAC_VER_27:
 		ops->write	= r8168dp_1_mdio_write;
 		ops->read	= r8168dp_1_mdio_read;
+		tp->write_fw	= rtl_phy_write_fw;
 		break;
 	case RTL_GIGA_MAC_VER_28:
 	case RTL_GIGA_MAC_VER_31:
 		ops->write	= r8168dp_2_mdio_write;
 		ops->read	= r8168dp_2_mdio_read;
+		tp->write_fw	= rtl_phy_write_fw;
+		break;
+	case RTL_GIGA_MAC_VER_40:
+	case RTL_GIGA_MAC_VER_41:
+		ops->write	= r8168g_mdio_write;
+		ops->read	= r8168g_mdio_read;
+		tp->write_fw	= rtl_ocp_write_fw;
 		break;
 	default:
 		ops->write	= r8169_mdio_write;
 		ops->read	= r8169_mdio_read;
+		tp->write_fw	= rtl_phy_write_fw;
 		break;
 	}
 }
@@ -3647,6 +3913,8 @@ static void rtl_wol_suspend_quirk(struct rtl8169_private *tp)
 	case RTL_GIGA_MAC_VER_37:
 	case RTL_GIGA_MAC_VER_38:
 	case RTL_GIGA_MAC_VER_39:
+	case RTL_GIGA_MAC_VER_40:
+	case RTL_GIGA_MAC_VER_41:
 		RTL_W32(RxConfig, RTL_R32(RxConfig) |
 			AcceptBroadcast | AcceptMulticast | AcceptMyPhys);
 		break;
@@ -3895,6 +4163,8 @@ static void __devinit rtl_init_pll_power_ops(struct rtl8169_private *tp)
 	case RTL_GIGA_MAC_VER_35:
 	case RTL_GIGA_MAC_VER_36:
 	case RTL_GIGA_MAC_VER_38:
+	case RTL_GIGA_MAC_VER_40:
+	case RTL_GIGA_MAC_VER_41:
 		ops->down	= r8168_pll_power_down;
 		ops->up		= r8168_pll_power_up;
 		break;
@@ -4183,6 +4453,8 @@ static void rtl8169_hw_reset(struct rtl8169_private *tp)
 	           tp->mac_version == RTL_GIGA_MAC_VER_35 ||
 	           tp->mac_version == RTL_GIGA_MAC_VER_36 ||
 	           tp->mac_version == RTL_GIGA_MAC_VER_37 ||
+	           tp->mac_version == RTL_GIGA_MAC_VER_40 ||
+	           tp->mac_version == RTL_GIGA_MAC_VER_41 ||
 	           tp->mac_version == RTL_GIGA_MAC_VER_38) {
 		RTL_W8(ChipCmd, RTL_R8(ChipCmd) | StopReq);
 		while (!(RTL_R32(TxConfig) & TXCFG_EMPTY))
@@ -4921,6 +5193,28 @@ static void rtl_hw_start_8411(struct rtl8169_private *tp)
 		     ERIAR_EXGMAC);
 }
 
+static void rtl_hw_start_8168g_1(struct rtl8169_private *tp)
+{
+	void __iomem *ioaddr = tp->mmio_addr;
+	struct pci_dev *pdev = tp->pci_dev;
+
+	rtl_eri_write(ioaddr, 0xc8, ERIAR_MASK_0101, 0x080002, ERIAR_EXGMAC);
+	rtl_eri_write(ioaddr, 0xcc, ERIAR_MASK_0001, 0x38, ERIAR_EXGMAC);
+	rtl_eri_write(ioaddr, 0xd0, ERIAR_MASK_0001, 0x48, ERIAR_EXGMAC);
+	rtl_eri_write(ioaddr, 0xe8, ERIAR_MASK_1111, 0x00100006, ERIAR_EXGMAC);
+	rtl_csi_access_enable_1(tp);
+	rtl_tx_performance_tweak(pdev, 0x5 << MAX_READ_REQUEST_SHIFT);
+	rtl_w1w0_eri(ioaddr, 0xdc, ERIAR_MASK_0001, 0x00, 0x01, ERIAR_EXGMAC);
+	rtl_w1w0_eri(ioaddr, 0xdc, ERIAR_MASK_0001, 0x01, 0x00, ERIAR_EXGMAC);
+	RTL_W8(ChipCmd, CmdTxEnb | CmdRxEnb);
+	RTL_W32(MISC, RTL_R32(MISC) & ~RXDV_GATED_EN);
+	RTL_W8(MaxTxPacketSize, EarlySize);
+	rtl_eri_write(ioaddr, 0xc0, ERIAR_MASK_0011, 0x0000, ERIAR_EXGMAC);
+	rtl_eri_write(ioaddr, 0xb8, ERIAR_MASK_0011, 0x0000, ERIAR_EXGMAC);
+	RTL_W8(EEE_LED, RTL_R8(EEE_LED) & ~0x07);
+	rtl_w1w0_eri(ioaddr, 0x2fc, ERIAR_MASK_0001, 0x01, 0x02, ERIAR_EXGMAC);
+}
+
 static void rtl_hw_start_8168(struct net_device *dev)
 {
 	struct rtl8169_private *tp = netdev_priv(dev);
@@ -5022,6 +5316,11 @@ static void rtl_hw_start_8168(struct net_device *dev)
 		rtl_hw_start_8411(tp);
 		break;
 
+	case RTL_GIGA_MAC_VER_40:
+	case RTL_GIGA_MAC_VER_41:
+		rtl_hw_start_8168g_1(tp);
+		break;
+
 	default:
 		printk(KERN_ERR PFX "%s: unknown chipset (mac_version = %d).\n",
 			dev->name, tp->mac_version);
@@ -6491,6 +6790,47 @@ static unsigned rtl_try_msi(struct rtl8169_private *tp,
 	return msi;
 }
 
+static void __devinit rtl_hw_init_8168g(struct rtl8169_private *tp)
+{
+	void __iomem *ioaddr = tp->mmio_addr;
+	u32 tmp_data;
+
+	RTL_W32(MISC, RTL_R32(MISC) | RXDV_GATED_EN);
+	while (!(RTL_R32(TxConfig) & TXCFG_EMPTY))
+		udelay(100);
+
+	while ((RTL_R8(MCU) & (TX_EMPTY | RX_EMPTY)) != (TX_EMPTY | RX_EMPTY))
+		udelay(100);
+
+	RTL_W8(ChipCmd, RTL_R8(ChipCmd) & ~(CmdTxEnb | CmdRxEnb));
+	msleep(1);
+	RTL_W8(MCU, RTL_R8(MCU) & ~NOW_IS_OOB);
+
+	tmp_data = r8168_mac_ocp_read(ioaddr, 0xe8de);
+	tmp_data &= ~(1 << 14);
+	r8168_mac_ocp_write(ioaddr, 0xe8de, tmp_data);
+	while (!(RTL_R8(MCU) & LINK_LIST_RDY))
+		udelay(100);
+
+	tmp_data = r8168_mac_ocp_read(ioaddr, 0xe8de);
+	tmp_data |= (1 << 15);
+	r8168_mac_ocp_write(ioaddr, 0xe8de, tmp_data);
+	while (!(RTL_R8(MCU) & LINK_LIST_RDY))
+		udelay(100);
+}
+
+static void __devinit rtl_hw_initialize(struct rtl8169_private *tp)
+{
+	switch (tp->mac_version) {
+	case RTL_GIGA_MAC_VER_40:
+	case RTL_GIGA_MAC_VER_41:
+		rtl_hw_init_8168g(tp);
+		break;
+	default:
+		break;
+	}
+}
+
 static int __devinit
 rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 {
@@ -6600,6 +6940,8 @@ rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 
 	rtl_irq_disable(tp);
 
+	rtl_hw_initialize(tp);
+
 	rtl_hw_reset(tp);
 
 	rtl_ack_events(tp, 0xffff);
-- 
1.7.10.2

^ permalink raw reply related

* [MMTests] Network performance
From: Mel Gorman @ 2012-06-29 11:22 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, netdev
In-Reply-To: <20120629111932.GA14154@suse.de>

Configuration:	global-dhp__network-performance
Benchmarks:	netperf-udp, netperf-tcp, tbench4

Summary
=======
Some tests look good but netperf-tcp tests show a number of problems.

Benchmark notes
===============

netperf used the TCP_STREAM or UDP_STREAM tests. Server and client were bound
to CPU 0 and 1 respectively. To improve the chances of getting an accurate
reading "-i 50,6 -I 99,1" was specified on the command line.  Personally I
tend to find netperf figures a bit unreliable and can vary depending on the
exact starting conditions. This might be due to the test being run against
localhost or because there is no other machine activity to smooth outliers
related to cache coloring. Suggestions on how to mitigate this are welcome.

tbench was from dbench 4 and ran for 3 minutes.

===========================================================
Machine:	arnold
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__network-performance/arnold/comparison.html
Arch:		x86
CPUs:		1 socket, 2 threads
Model:		Pentium 4
Disk:		Single Rotary Disk
Status:		Ok, but netperf-tcp has problems
===========================================================

netperf-udp
-----------
For the most part, this looks good. 2.6.34 and 3.2.9 were both bad
kernels for some reason but currently it looks fine. I tend to
find that netperf figures fluctuate easily and t

netperf-tcp
-----------
This is less healthy, it looks like there is a fairly consistent
regression of 2-5%.

tbench4
-------
Some of these tests failed to run and the logs are unclear as to
why but only happened on this machine. It's only now that I noticed.
While results are looking ok now, there were some regressions for
3.0 until 3.2 kernels that might be of concern to -stable users.

==========================================================
Machine:	hydra
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/hydra/comparison.html
Arch:		x86-64
CPUs:		1 socket, 4 threads
Model:		AMD Phenom II X4 940
Disk:		Single Rotary Disk
Status:		Ok, but netperf-tcp has problems
==========================================================

netperf-udp
-----------
This is looking great. There was a high in 3.1 that has been
lost since but it's still better overall in comparison to
2.6.32.

netperf-tcp
-----------
This is less healthy with a lot of regression. 3.4 has mostly
regressed to the tune of 2-13% versus 2.6.32.

tbench4
-------
For the most part, this is looking ok. 2 clients seems to be
particularly problematic for some reason but otherwise looks
good.

==========================================================
Machine:	sandy
Result:		http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-ext3/sandy/comparison.html
Arch:		x86-64
CPUs:		1 socket, 8 threads
Model:		Intel Core i7-2600
Disk:		Single Rotary Disk
Status:		Bad, tbench is ok just otherwise poor
==========================================================

netperf-udp
-----------
This is not a happy story. There was a big drop between 3.2 and 3.3
and the regression is still there in comparison to 2.6.32

netperf-tcp
-----------
This has consistently regressed since 2.6.34 with the regression very
roughly around the 10% mark.

tbench4
-------
Unlike the other tests, this is looking reasonably good with performance
gains until the number of clients gets really high. It was interesting
to note that 2.6.34 was a particularly good kernel for tbench and
while current kernels are better then 2.6.32, they are not as good as
2.6.34.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: 3.4.x regression: rtl8169: frequent resets
From: Stefan Lippers-Hollmann @ 2012-06-29 11:50 UTC (permalink / raw)
  To: Francois Romieu; +Cc: Nix, netdev, linux-kernel
In-Reply-To: <201206290131.49150.s.L-H@gmx.de>

[-- Attachment #1: Type: Text/Plain, Size: 4427 bytes --]

Hi

On Friday 29 June 2012, Stefan Lippers-Hollmann wrote:
> On Thursday 28 June 2012, Francois Romieu wrote:
> > Nix <nix@esperi.org.uk> :
> > > I recently upgraded from 3.3.x to 3.4.4, and am now experiencing
> > > networking problems with my desktop box's r8169 card. The symptoms are
> > > that all traffic ceases for five to ten seconds, then the card appears
> > > to reset and everything is back to normal -- until it happens again. It
> > > can happen quite a lot:
> > 
> > Can you try and revert 036dafa28da1e2565a8529de2ae663c37b7a0060 ?
> > 
> > I would welcome a complete dmesg including the XID line from the
> > r8169 driver.

Full gzipped messages/ kern.log attached (unfortunately he rebooted to 
quickly for a regular dmesg).

[    0.573645] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
[    0.573930] r8169 0000:04:00.0: eth0: RTL8168d/8111d at 0xffffc90000c72000, 00:24:1d:72:7c:75, XID 081000c0 IRQ 44
[    0.573933] r8169 0000:04:00.0: eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
[    0.573953] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
[    0.574093] ehci_hcd 0000:00:1a.7: irq 18, io mem 0xfbffe000
[    0.574213] r8169 0000:05:00.0: eth1: RTL8168d/8111d at 0xffffc90000c6e000, 00:24:1d:72:7c:77, XID 081000c0 IRQ 45
[    0.574217] r8169 0000:05:00.0: eth1: jumbo features [frames: 9200 bytes, tx checksumming: ko]
[…]
[   20.872579] r8169 0000:04:00.0: eth0: link down
[   20.872594] r8169 0000:04:00.0: eth0: link down
[   20.873162] ADDRCONF(NETDEV_UP): eth0: link is not ready
[   20.945479] NET: Registered protocol family 17
[   22.516769] r8169 0000:04:00.0: eth0: link up
[   22.517670] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   25.996741] ip_tables: (C) 2000-2006 Netfilter Core Team
[   26.091554] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
[…]
[14454.544994] ------------[ cut here ]------------
[14454.545004] WARNING: at /tmp/buildd/linux-aptosid-3.4/debian/build/source_amd64_none/net/sched/sch_generic.c:256 dev_watchdog+0xe9/0x15c()
[14454.545008] Hardware name: EX58-UD5
[14454.545010] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
[14454.545013] Modules linked in: rfcomm bnep cpufreq_powersave cpufreq_stats cpufreq_conservative binfmt_misc xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables af_packet hfsplus fuse nls_utf8 nls_cp437 vfat fat jfs it87 dm_crypt dm_mod kvm_intel kvm adt7475 hwmon_vid nouveau snd_hda_codec_realtek coretemp video ttm drm_kms_helper drm snd_hda_intel power_supply snd_hda_codec snd_hwdep snd_pcm snd_page_alloc i2c_i801 i2c_algo_bit iTCO_wdt i7core_edac snd_seq iTCO_vendor_support microcode snd_seq_device edac_core i2c_core mxm_wmi btusb snd_timer snd bluetooth evdev pcspkr rfkill acpi_cpufreq soundcore mperf button processor wmi ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic usbhid pata_acpi hid sd_mod crc_t10dif crc32c_intel pata_jmicron uhci_hcd ahci libahci libata scsi_mod r8169 mii ehci_hcd usbcore usb_common [last unloaded: scsi_wait_scan]
[14454.545100] Pid: 4245, comm: iceape-bin Not tainted 3.4-4.slh.1-aptosid-amd64 #1
[14454.545103] Call Trace:
[14454.545105]  <IRQ>  [<ffffffff810332f6>] ? warn_slowpath_common+0x76/0x8a
[14454.545116]  [<ffffffff810333a2>] ? warn_slowpath_fmt+0x45/0x4a
[14454.545121]  [<ffffffff8127546a>] ? netif_tx_lock+0x67/0x7a
[14454.545127]  [<ffffffff812755b3>] ? dev_watchdog+0xe9/0x15c
[14454.545133]  [<ffffffff81020f2d>] ? __default_send_IPI_dest_field.constprop.0+0x38/0x4d
[14454.545139]  [<ffffffff8103c332>] ? run_timer_softirq+0x153/0x1e3
[14454.545145]  [<ffffffff8100f389>] ? paravirt_read_tsc+0x5/0x8
[14454.545150]  [<ffffffff81037f6b>] ? __do_softirq+0x92/0x126
[14454.545154]  [<ffffffff810202e2>] ? lapic_next_event+0xd/0x11
[14454.545160]  [<ffffffff813231dc>] ? call_softirq+0x1c/0x30
[14454.545164]  [<ffffffff8100ae23>] ? do_softirq+0x3a/0x77
[14454.545168]  [<ffffffff8103824b>] ? irq_exit+0x49/0xb1
[14454.545172]  [<ffffffff81020672>] ? smp_apic_timer_interrupt+0x74/0x82
[14454.545176]  [<ffffffff8132288a>] ? apic_timer_interrupt+0x6a/0x70
[14454.545179]  <EOI>  [<ffffffff81321df9>] ? system_call_fastpath+0x16/0x1b
[14454.545185] ---[ end trace a37b096a01814f14 ]---
[14454.549925] r8169 0000:04:00.0: eth0: link up
[14472.536356] r8169 0000:04:00.0: eth0: link up

Regards
	Stefan Lippers-Hollmann

[-- Attachment #2: messages.gz --]
[-- Type: application/x-gzip, Size: 14397 bytes --]

[-- Attachment #3: kern.log.gz --]
[-- Type: application/x-gzip, Size: 18217 bytes --]

^ permalink raw reply

* BUG: NULL pointer in ctnetlink_conntrack_event
From: Hans Schillstrom @ 2012-06-29 12:29 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netdev, netfilter-devel

Hello,

There is a "hard to find" problem in ctnetlink_conntrack_event() when calling
netlink_has_listeners() net->nfnl is NULL.

The rcu stuff seems to be right at a first look but who knows...

The line below fix the problem, but that is not the root cause.

 int nfnetlink_has_listeners(struct net *net, unsigned int group)
 {
-       return netlink_has_listeners(net->nfnl, group);
+       return net->nfnl ? netlink_has_listeners(net->nfnl, group) : 0 ;
 }

Yes it is a 3.0.26 kernel but this patch is applied
netfilter: nf_conntrack: make event callback registration per-netns

It happens when adding a number of containers with does a "nfct_query(h, NFCT_Q_CREATE, ct);"
and most likely one namespace shuts down.

Any idea why the timer is running at this point ?


BUG: unable to handle kernel NULL pointer dereference at 000000000000027c
IP: [<ffffffff813615db>] netlink_has_listeners+0xb/0x60
PGD 0
Oops: 0000 [#3] PREEMPT SMP
CPU 0
Modules linked in: ip6table_raw(N) xt_NOTRACK(N) iptable_raw(N) ipt_REJECT(N) xt_sctp(N) xt_multiport(N) xt_connmark(N) xt_mark(N) xt_conntrack(N) ip6table_mangle(N) ip_vs(N) nf_conntrack_netlink(N) nfnetlink(N) ip6_tunnel(N) tunnel6(N) macvlan(N) xt_HMARK(N) ipv6_find_hdr(N) iptable_mangle(N) nf_conntrack_ipv6(N) nf_defrag_ipv6(N) ip6t_LOG(N) ip6table_filter(N) ip6_tables(N) nf_conntrack_ipv4(N) nf_defrag_ipv4(N) xt_state(N) xt_tcpudp(N) xt_u32(N) xt_comment(N) xt_length(N) xt_hashlimit(N) ipt_LOG(N) xt_limit(N) iptable_filter(N) ip_tables(N) x_tables(N) nf_conntrack_ftp(N) nf_conntrack_tftp(N) nf_conntrack(N) mptsas(N) mptscsih(N) mptbase(N) sg(N) scsi_transport_sas(N) i2c_i801(N) i2c_core(N) button(N) pcspkr(N) ahci(N) libahci(N) processor(N) serio_raw(N) thermal_sys(N) hwmon(N) iTCO_wd
 t(N) iTCO_vendor_support(N) libata(N) ioatdma(N) ixgbe(N) mdio(N) nfs(N) lockd(N) fscache(N) auth_rpcgss(N) nfs_acl(N) sunrpc(N) af_packet(N) ipv6(N) ipv6_lib(N) bonding(N) e1000e(N) igb(N) dca(N) mii(N) 8021q(N) garp(N) stp(N) llc(N) softdog(N) xfs(N) exportfs(N) sd_mod(N) crc_t10dif(N) usb_storage(N) scsi_mod(N) ehci_hcd(N) uhci_hcd(N) usbcore(N) usb_common(N)
Supported: Yes

Pid: 0, comm: swapper Tainted: G      D    N  3.0.26-0.2-default
RIP: 0010:[<ffffffff813615db>]  [<ffffffff813615db>] netlink_has_listeners+0xb/0x60
RSP: 0018:ffff88063f203da0  EFLAGS: 00010286
RAX: ffff88063f203e30 RBX: 0000000000000000 RCX: ffffffffa04c60f0
RDX: 0000000000000004 RSI: 0000000000000003 RDI: 0000000000000000
RBP: 0000000000000003 R08: 0000000000000000 R09: ffff88063f2114a0
R10: 0000000000000000 R11: ffffffff8101e760 R12: ffff8805e2a45788
R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000004
FS:  0000000000000000(0000) GS:ffff88063f200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000000027c CR3: 0000000001a03000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff81a00000, task ffffffff81a0b020)
Stack:
 0000000000000000 0000000000000000 ffff8805e2a45800 ffffffffa04c453e
 ffff88063f203e30 0000000400000001 ffff8805e24e6c80 0000000300000000
 0000000000000000 ffff880610044000 ffff880610044800 ffff8805e2a45788
Call Trace:
 [<ffffffffa04c453e>] ctnetlink_conntrack_event+0x51e/0x570 [nf_conntrack_netlink]
 [<ffffffffa042a27b>] death_by_timeout+0x12b/0x190 [nf_conntrack]
 [<ffffffff810608ec>] run_timer_softirq+0x14c/0x270
 [<ffffffff81059d25>] __do_softirq+0xa5/0x180
 [<ffffffff813ff43c>] call_softirq+0x1c/0x30
 [<ffffffff810043f5>] do_softirq+0x65/0xa0
 [<ffffffff81059b15>] irq_exit+0xc5/0x100
 [<ffffffff8101f5a9>] smp_apic_timer_interrupt+0x69/0xa0
 [<ffffffff813febf3>] apic_timer_interrupt+0x13/0x20
 [<ffffffffa0230806>] acpi_idle_enter_bm+0x255/0x28f [processor]
 [<ffffffff813179e2>] cpuidle_idle_call+0xd2/0x120
 [<ffffffff810019f3>] cpu_idle+0x63/0xd0
 [<ffffffff81bf0f65>] start_kernel+0x3e4/0x4bf
 [<ffffffff81bf03c3>] x86_64_start_kernel+0x114/0x12f
Code: ff 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 e9 cb c5 fc ff 66 66 2e 0f 1f 84 00 00 00 00 00 55 89 f5 53 48 89 fb 48 83 ec 08 <f6> 87 7c 02 00 00 01 74 41 e8 47 50 d5 ff 0f b6 83 21 01 00 00
RIP  [<ffffffff813615db>] netlink_has_listeners+0xb/0x60
 RSP <ffff88063f203da0>
CR2: 000000000000027c
---[ end trace a057af0b3004c67a ]---
Kernel panic - not syncing: Fatal exception in interrupt
Pid: 0, comm: swapper Tainted: G      D    N  3.0.26-0.2-default #1
Call Trace:
 [<ffffffff81004672>] dump_trace+0x82/0x380
 [<ffffffff813f4fa2>] dump_stack+0x69/0x6f
 [<ffffffff813f5050>] panic+0xa8/0x20c
 [<ffffffff813f9b21>] oops_end+0xe1/0xf0
 [<ffffffff81030e50>] no_context+0x100/0x270
 [<ffffffff81031135>] __bad_area_nosemaphore+0x175/0x220
 [<ffffffff813fbb36>] do_page_fault+0x3a6/0x590
 [<ffffffff813f8d15>] page_fault+0x25/0x30
 [<ffffffff813615db>] netlink_has_listeners+0xb/0x60
 [<ffffffffa04c453e>] ctnetlink_conntrack_event+0x51e/0x570 [nf_conntrack_netlink]
 [<ffffffffa042a27b>] death_by_timeout+0x12b/0x190 [nf_conntrack]
 [<ffffffff810608ec>] run_timer_softirq+0x14c/0x270
 [<ffffffff81059d25>] __do_softirq+0xa5/0x180
 [<ffffffff813ff43c>] call_softirq+0x1c/0x30
 [<ffffffff810043f5>] do_softirq+0x65/0xa0
 [<ffffffff81059b15>] irq_exit+0xc5/0x100
 [<ffffffff8101f5a9>] smp_apic_timer_interrupt+0x69/0xa0
 [<ffffffff813febf3>] apic_timer_interrupt+0x13/0x20
 [<ffffffffa0230806>] acpi_idle_enter_bm+0x255/0x28f [processor]
 [<ffffffff813179e2>] cpuidle_idle_call+0xd2/0x120
 [<ffffffff810019f3>] cpu_idle+0x63/0xd0
 [<ffffffff81bf0f65>] start_kernel+0x3e4/0x4bf
 [<ffffffff81bf03c3>] x86_64_start_kernel+0x114/0x12f
Rebooting in 1 seconds..
--
Regards 
Hans Schillstrom




^ permalink raw reply

* [PATCH 00/16] Swap-over-NBD without deadlocking V14
From: Mel Gorman @ 2012-06-29 13:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson, Eric Dumazet,
	Sebastian Andrzej Siewior, Mel Gorman

Andrew, any chance of these being picked up for linux-next? Many of the
recent changes have been related to rebasing issues instead of something
more fundamental.

Changelog since V13
  o Rebase to linux-next 20120629

Changelog since V12
  o Rebase to linux-next-20120622
  o Do not alter coalesce handling in the input path		      (eric.dumazet)
  o Avoid unnecessary cast					      (sebastian)

Changelog since V11
  o Rebase to 3.5-rc3
  o Correct order of page flag free				      (sebastian)

Changelog since V10
  o Rebase to 3.4-rc5
  o Coding style fixups						      (davem)
  o API consistency						      (davem)
  o Rename sk_allocation to sk_gfp_atomic and use only when necessary (davem)
  o Use static branches for sk_memalloc_socks			      (davem)
  o Use static branch checks in fast paths			      (davem)
  o Document concerns about PF_MEMALLOC leaking flags		      (davem)
  o Locking fix in slab						      (mel)

Changelog since V9
  o Rebase to 3.4-rc5
  o Clarify comment on why PF_MEMALLOC is cleared in softirq handling (akpm)
  o Only set page->pfmemalloc if ALLOC_NO_WATERMARKS was required     (rientjes)

Changelog since V8
  o Rebase to 3.4-rc2
  o Use page flag instead of slab fields to keep structures the same size
  o Properly detect allocations from softirq context that use PF_MEMALLOC
  o Ensure kswapd does not sleep while processes are throttled
  o Do not accidentally throttle !_GFP_FS processes indefinitely

Changelog since V7
  o Rebase to 3.3-rc2
  o Take greater care propagating page->pfmemalloc to skb
  o Propagate pfmemalloc from netdev_alloc_page to skb where possible
  o Release RCU lock properly on preempt kernel

Changelog since V6
  o Rebase to 3.1-rc8
  o Use wake_up instead of wake_up_interruptible()
  o Do not throttle kernel threads
  o Avoid a potential race between kswapd going to sleep and processes being
    throttled

Changelog since V5
  o Rebase to 3.1-rc5

Changelog since V4
  o Update comment clarifying what protocols can be used		(Michal)
  o Rebase to 3.0-rc3

Changelog since V3
  o Propogate pfmemalloc from packet fragment pages to skb		(Neil)
  o Rebase to 3.0-rc2

Changelog since V2
  o Document that __GFP_NOMEMALLOC overrides __GFP_MEMALLOC		(Neil)
  o Use wait_event_interruptible					(Neil)
  o Use !! when casting to bool to avoid any possibilitity of type
    truncation								(Neil)
  o Nicer logic when using skb_pfmemalloc_protocol			(Neil)

Changelog since V1
  o Rebase on top of mmotm
  o Use atomic_t for memalloc_socks		(David Miller)
  o Remove use of sk_memalloc_socks in vmscan	(Neil Brown)
  o Check throttle within prepare_to_wait	(Neil Brown)
  o Add statistics on throttling instead of printk

When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. Swap over the network is considered as an option in diskless
systems. The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.

The Linux Terminal Server Project recommends the use of the
Network Block Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD
at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP
The nbd-client also documents the use of NBD as swap. Despite this, the
fact is that a machine using NBD for swap can deadlock within minutes if
swap is used intensively. This patch series addresses the problem.

The core issue is that network block devices do not use mempools like
normal block devices do. As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need. Some years ago, Peter Zijlstra developed a series of
patches that supported swap over an NFS that at least one distribution
is carrying within their kernels. This patch series borrows very heavily
from Peter's work to support swapping over NBD as a pre-requisite to
supporting swap-over-NFS. The bulk of the complexity is concerned with
preserving memory that is allocated from the PFMEMALLOC reserves for use
by the network layer which is needed for both NBD and NFS.

Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
	preserve access to pages allocated under low memory situations
	to callers that are freeing memory.

Patch 2 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
	reserves without setting PFMEMALLOC.

Patch 3 opens the possibility for softirqs to use PFMEMALLOC reserves
	for later use by network packet processing.

Patch 4 ignores memory policies when ALLOC_NO_WATERMARKS is set.

Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required

Patches 6-13 allows network processing to use PFMEMALLOC reserves when
	the socket has been marked as being used by the VM to clean pages. If
	packets are received and stored in pages that were allocated under
	low-memory situations and are unrelated to the VM, the packets
	are dropped.

	Patch 11 reintroduces __skb_alloc_page which the networking
	folk may object to but is needed in some cases to propogate
	pfmemalloc from a newly allocated page to an skb. If there is a
	strong objection, this patch can be dropped with the impact being
	that swap-over-network will be slower in some cases but it should
	not fail.

Patch 14 is a micro-optimisation to avoid a function call in the
	common case.

Patch 15 tags NBD sockets as being SOCK_MEMALLOC so they can use
	PFMEMALLOC if necessary.

Patch 16 notes that it is still possible for the PFMEMALLOC reserve
	to be depleted. To prevent this, direct reclaimers get throttled on
	a waitqueue if 50% of the PFMEMALLOC reserves are depleted.  It is
	expected that kswapd and the direct reclaimers already running
	will clean enough pages for the low watermark to be reached and
	the throttled processes are woken up.

Patch 17 adds a statistic to track how often processes get throttled

Some basic performance testing was run using kernel builds, netperf
on loopback for UDP and TCP, hackbench (pipes and sockets), iozone
and sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant
performance variances.

For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure.

Without the patches and using SLUB, the machine locks up within minutes and
runs to completion with them applied. With SLAB, the story is different
as an unpatched kernel run to completion. However, the patched kernel
completed the test 45% faster.

MICRO
                                         3.5.0-rc2 3.5.0-rc2
					 vanilla     swapnbd
Unrecognised test vmscan-anon-mmap-write
MMTests Statistics: duration
Sys Time Running Test (seconds)             197.80    173.07
User+Sys Time Running Test (seconds)        206.96    182.03
Total Elapsed Time (seconds)               3240.70   1762.09

 drivers/block/nbd.c                               |    6 +-
 drivers/net/ethernet/chelsio/cxgb4/sge.c          |    2 +-
 drivers/net/ethernet/chelsio/cxgb4vf/sge.c        |    2 +-
 drivers/net/ethernet/intel/igb/igb_main.c         |    2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |    4 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |    3 +-
 drivers/net/usb/cdc-phonet.c                      |    2 +-
 drivers/usb/gadget/f_phonet.c                     |    2 +-
 include/linux/gfp.h                               |   13 +-
 include/linux/mm_types.h                          |    9 +
 include/linux/mmzone.h                            |    1 +
 include/linux/page-flags.h                        |   28 +++
 include/linux/sched.h                             |    7 +
 include/linux/skbuff.h                            |   80 +++++++-
 include/linux/vm_event_item.h                     |    1 +
 include/net/sock.h                                |   28 +++
 include/trace/events/gfpflags.h                   |    1 +
 kernel/softirq.c                                  |    9 +
 mm/page_alloc.c                                   |   46 ++++-
 mm/slab.c                                         |  216 +++++++++++++++++++--
 mm/slub.c                                         |   30 ++-
 mm/vmscan.c                                       |  131 ++++++++++++-
 mm/vmstat.c                                       |    1 +
 net/core/dev.c                                    |   53 ++++-
 net/core/filter.c                                 |    8 +
 net/core/skbuff.c                                 |  124 +++++++++---
 net/core/sock.c                                   |   43 ++++
 net/ipv4/tcp_output.c                             |   12 +-
 net/ipv6/tcp_ipv6.c                               |    8 +-
 29 files changed, 782 insertions(+), 90 deletions(-)

-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 01/16] mm: sl[au]b: Add knowledge of PFMEMALLOC reserve pages
From: Mel Gorman @ 2012-06-29 13:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson, Eric Dumazet,
	Sebastian Andrzej Siewior, Mel Gorman
In-Reply-To: <1340976767-5737-1-git-send-email-mgorman@suse.de>

Allocations of pages below the min watermark run a risk of the
machine hanging due to a lack of memory. To prevent this, only
callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing
an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once
they are allocated to a slab though, nothing prevents other callers
consuming free objects within those slabs. This patch limits access
to slab pages that were alloced from the PFMEMALLOC reserves.

When this patch is applied, pages allocated from below the low watermark are
returned with page->pfmemalloc set and it is up to the caller to determine
how the page should be protected. SLAB restricts access to any page with
page->pfmemalloc set to callers which are known to able to access the
PFMEMALLOC reserve. If one is not available, an attempt is made to allocate
a new page rather than use a reserve. SLUB is a bit more relaxed in that
it only records if the current per-CPU page was allocated from PFMEMALLOC
reserve and uses another partial slab if the caller does not have the
necessary GFP or process flags. This was found to be sufficient in tests
to avoid hangs due to SLUB generally maintaining smaller lists than SLAB.

In low-memory conditions it does mean that !PFMEMALLOC allocators
can fail a slab allocation even though free objects are available
because they are being preserved for callers that are freeing pages.

[a.p.zijlstra@chello.nl: Original implementation]
[sebastian@breakpoint.cc: Correct order of page flag clearing]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm_types.h   |    9 +++
 include/linux/page-flags.h |   28 +++++++
 mm/internal.h              |    3 +
 mm/page_alloc.c            |   27 +++++--
 mm/slab.c                  |  192 +++++++++++++++++++++++++++++++++++++++-----
 mm/slub.c                  |   29 ++++++-
 6 files changed, 263 insertions(+), 25 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 27c741c..ad0ad6f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -54,6 +54,15 @@ struct page {
 		union {
 			pgoff_t index;		/* Our offset within mapping. */
 			void *freelist;		/* slub/slob first free object */
+			bool pfmemalloc;	/* If set by the page allocator,
+						 * ALLOC_PFMEMALLOC was set
+						 * and the low watermark was not
+						 * met implying that the system
+						 * is under some pressure. The
+						 * caller should try ensure
+						 * this page is only used to
+						 * free other pages.
+						 */
 		};
 
 		union {
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index c88d2a9..e66eb0d 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -453,6 +453,34 @@ static inline int PageTransTail(struct page *page)
 }
 #endif
 
+/*
+ * If network-based swap is enabled, sl*b must keep track of whether pages
+ * were allocated from pfmemalloc reserves.
+ */
+static inline int PageSlabPfmemalloc(struct page *page)
+{
+	VM_BUG_ON(!PageSlab(page));
+	return PageActive(page);
+}
+
+static inline void SetPageSlabPfmemalloc(struct page *page)
+{
+	VM_BUG_ON(!PageSlab(page));
+	SetPageActive(page);
+}
+
+static inline void __ClearPageSlabPfmemalloc(struct page *page)
+{
+	VM_BUG_ON(!PageSlab(page));
+	__ClearPageActive(page);
+}
+
+static inline void ClearPageSlabPfmemalloc(struct page *page)
+{
+	VM_BUG_ON(!PageSlab(page));
+	ClearPageActive(page);
+}
+
 #ifdef CONFIG_MMU
 #define __PG_MLOCKED		(1 << PG_mlocked)
 #else
diff --git a/mm/internal.h b/mm/internal.h
index 0e20156..5d4a634 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -282,6 +282,9 @@ static inline struct page *mem_map_next(struct page *iter,
 #define __paginginit __init
 #endif
 
+/* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
+
 /* Memory initialisation debug and verification */
 enum mminit_level {
 	MMINIT_WARNING,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2c29b1c..9c697e5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1505,6 +1505,7 @@ failed:
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
+#define ALLOC_PFMEMALLOC	0x80 /* Caller has PF_MEMALLOC set */
 
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 
@@ -2262,16 +2263,22 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	} else if (unlikely(rt_task(current)) && !in_interrupt())
 		alloc_flags |= ALLOC_HARDER;
 
-	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_interrupt() &&
-		    ((current->flags & PF_MEMALLOC) ||
-		     unlikely(test_thread_flag(TIF_MEMDIE))))
+	if ((current->flags & PF_MEMALLOC) ||
+			unlikely(test_thread_flag(TIF_MEMDIE))) {
+		alloc_flags |= ALLOC_PFMEMALLOC;
+
+		if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 
 	return alloc_flags;
 }
 
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
+{
+	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_PFMEMALLOC);
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -2459,10 +2466,18 @@ nopage:
 	warn_alloc_failed(gfp_mask, order, NULL);
 	return page;
 got_pg:
+	/*
+	 * page->pfmemalloc is set when the caller had PFMEMALLOC set or is
+	 * been OOM killed. The expectation is that the caller is taking
+	 * steps that will free more memory. The caller should avoid the
+	 * page being used for !PFMEMALLOC purposes.
+	 */
+	page->pfmemalloc = !!(alloc_flags & ALLOC_PFMEMALLOC);
+
 	if (kmemcheck_enabled)
 		kmemcheck_pagealloc_alloc(page, order, gfp_mask);
-	return page;
 
+	return page;
 }
 
 /*
@@ -2513,6 +2528,8 @@ retry_cpuset:
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
+	else
+		page->pfmemalloc = false;
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
 
diff --git a/mm/slab.c b/mm/slab.c
index 64c3d03..c0d51f1 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -123,6 +123,8 @@
 
 #include <trace/events/kmem.h>
 
+#include	"internal.h"
+
 /*
  * DEBUG	- 1 for kmem_cache_create() to honour; SLAB_RED_ZONE & SLAB_POISON.
  *		  0 for faster, smaller code (especially in the critical paths).
@@ -151,6 +153,12 @@
 #define ARCH_KMALLOC_FLAGS SLAB_HWCACHE_ALIGN
 #endif
 
+/*
+ * true if a page was allocated from pfmemalloc reserves for network-based
+ * swap
+ */
+static bool pfmemalloc_active __read_mostly;
+
 /* Legal flag mask for kmem_cache_create(). */
 #if DEBUG
 # define CREATE_MASK	(SLAB_RED_ZONE | \
@@ -256,9 +264,30 @@ struct array_cache {
 			 * Must have this definition in here for the proper
 			 * alignment of array_cache. Also simplifies accessing
 			 * the entries.
+			 *
+			 * Entries should not be directly dereferenced as
+			 * entries belonging to slabs marked pfmemalloc will
+			 * have the lower bits set SLAB_OBJ_PFMEMALLOC
 			 */
 };
 
+#define SLAB_OBJ_PFMEMALLOC	1
+static inline bool is_obj_pfmemalloc(void *objp)
+{
+	return (unsigned long)objp & SLAB_OBJ_PFMEMALLOC;
+}
+
+static inline void set_obj_pfmemalloc(void **objp)
+{
+	*objp = (void *)((unsigned long)*objp | SLAB_OBJ_PFMEMALLOC);
+	return;
+}
+
+static inline void clear_obj_pfmemalloc(void **objp)
+{
+	*objp = (void *)((unsigned long)*objp & ~SLAB_OBJ_PFMEMALLOC);
+}
+
 /*
  * bootstrap: The caches do not work without cpuarrays anymore, but the
  * cpuarrays are allocated from the generic caches...
@@ -926,6 +955,102 @@ static struct array_cache *alloc_arraycache(int node, int entries,
 	return nc;
 }
 
+static inline bool is_slab_pfmemalloc(struct slab *slabp)
+{
+	struct page *page = virt_to_page(slabp->s_mem);
+
+	return PageSlabPfmemalloc(page);
+}
+
+/* Clears pfmemalloc_active if no slabs have pfmalloc set */
+static void recheck_pfmemalloc_active(struct kmem_cache *cachep,
+						struct array_cache *ac)
+{
+	struct kmem_list3 *l3 = cachep->nodelists[numa_mem_id()];
+	struct slab *slabp;
+	unsigned long flags;
+
+	if (!pfmemalloc_active)
+		return;
+
+	spin_lock_irqsave(&l3->list_lock, flags);
+	list_for_each_entry(slabp, &l3->slabs_full, list)
+		if (is_slab_pfmemalloc(slabp))
+			goto out;
+
+	list_for_each_entry(slabp, &l3->slabs_partial, list)
+		if (is_slab_pfmemalloc(slabp))
+			goto out;
+
+	list_for_each_entry(slabp, &l3->slabs_free, list)
+		if (is_slab_pfmemalloc(slabp))
+			goto out;
+
+	pfmemalloc_active = false;
+out:
+	spin_unlock_irqrestore(&l3->list_lock, flags);
+}
+
+static void *ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
+						gfp_t flags, bool force_refill)
+{
+	int i;
+	void *objp = ac->entry[--ac->avail];
+
+	/* Ensure the caller is allowed to use objects from PFMEMALLOC slab */
+	if (unlikely(is_obj_pfmemalloc(objp))) {
+		struct kmem_list3 *l3;
+
+		if (gfp_pfmemalloc_allowed(flags)) {
+			clear_obj_pfmemalloc(&objp);
+			return objp;
+		}
+
+		/* The caller cannot use PFMEMALLOC objects, find another one */
+		for (i = 1; i < ac->avail; i++) {
+			/* If a !PFMEMALLOC object is found, swap them */
+			if (!is_obj_pfmemalloc(ac->entry[i])) {
+				objp = ac->entry[i];
+				ac->entry[i] = ac->entry[ac->avail];
+				ac->entry[ac->avail] = objp;
+				return objp;
+			}
+		}
+
+		/*
+		 * If there are empty slabs on the slabs_free list and we are
+		 * being forced to refill the cache, mark this one !pfmemalloc.
+		 */
+		l3 = cachep->nodelists[numa_mem_id()];
+		if (!list_empty(&l3->slabs_free) && force_refill) {
+			struct slab *slabp = virt_to_slab(objp);
+			ClearPageSlabPfmemalloc(virt_to_page(slabp->s_mem));
+			clear_obj_pfmemalloc(&objp);
+			recheck_pfmemalloc_active(cachep, ac);
+			return objp;
+		}
+
+		/* No !PFMEMALLOC objects available */
+		ac->avail++;
+		objp = NULL;
+	}
+
+	return objp;
+}
+
+static void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
+								void *objp)
+{
+	if (unlikely(pfmemalloc_active)) {
+		/* Some pfmemalloc slabs exist, check if this is one */
+		struct page *page = virt_to_page(objp);
+		if (PageSlabPfmemalloc(page))
+			set_obj_pfmemalloc(&objp);
+	}
+
+	ac->entry[ac->avail++] = objp;
+}
+
 /*
  * Transfer objects in one arraycache to another.
  * Locking must be handled by the caller.
@@ -1102,7 +1227,7 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
 			STATS_INC_ACOVERFLOW(cachep);
 			__drain_alien_cache(cachep, alien, nodeid);
 		}
-		alien->entry[alien->avail++] = objp;
+		ac_put_obj(cachep, alien, objp);
 		spin_unlock(&alien->lock);
 	} else {
 		spin_lock(&(cachep->nodelists[nodeid])->list_lock);
@@ -1782,6 +1907,10 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 		return NULL;
 	}
 
+	/* Record if ALLOC_PFMEMALLOC was set when allocating the slab */
+	if (unlikely(page->pfmemalloc))
+		pfmemalloc_active = true;
+
 	nr_pages = (1 << cachep->gfporder);
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		add_zone_page_state(page_zone(page),
@@ -1789,9 +1918,13 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 	else
 		add_zone_page_state(page_zone(page),
 			NR_SLAB_UNRECLAIMABLE, nr_pages);
-	for (i = 0; i < nr_pages; i++)
+	for (i = 0; i < nr_pages; i++) {
 		__SetPageSlab(page + i);
 
+		if (page->pfmemalloc)
+			SetPageSlabPfmemalloc(page + i);
+	}
+
 	if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) {
 		kmemcheck_alloc_shadow(page, cachep->gfporder, flags, nodeid);
 
@@ -1823,6 +1956,7 @@ static void kmem_freepages(struct kmem_cache *cachep, void *addr)
 				NR_SLAB_UNRECLAIMABLE, nr_freed);
 	while (i--) {
 		BUG_ON(!PageSlab(page));
+		__ClearPageSlabPfmemalloc(page);
 		__ClearPageSlab(page);
 		page++;
 	}
@@ -3094,16 +3228,19 @@ bad:
 #define check_slabp(x,y) do { } while(0)
 #endif
 
-static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
+static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags,
+							bool force_refill)
 {
 	int batchcount;
 	struct kmem_list3 *l3;
 	struct array_cache *ac;
 	int node;
 
-retry:
 	check_irq_off();
 	node = numa_mem_id();
+	if (unlikely(force_refill))
+		goto force_grow;
+retry:
 	ac = cpu_cache_get(cachep);
 	batchcount = ac->batchcount;
 	if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
@@ -3153,8 +3290,8 @@ retry:
 			STATS_INC_ACTIVE(cachep);
 			STATS_SET_HIGH(cachep);
 
-			ac->entry[ac->avail++] = slab_get_obj(cachep, slabp,
-							    node);
+			ac_put_obj(cachep, ac, slab_get_obj(cachep, slabp,
+									node));
 		}
 		check_slabp(cachep, slabp);
 
@@ -3173,18 +3310,22 @@ alloc_done:
 
 	if (unlikely(!ac->avail)) {
 		int x;
+force_grow:
 		x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL);
 
 		/* cache_grow can reenable interrupts, then ac could change. */
 		ac = cpu_cache_get(cachep);
-		if (!x && ac->avail == 0)	/* no objects in sight? abort */
+
+		/* no objects in sight? abort */
+		if (!x && (ac->avail == 0 || force_refill))
 			return NULL;
 
 		if (!ac->avail)		/* objects refilled by interrupt? */
 			goto retry;
 	}
 	ac->touched = 1;
-	return ac->entry[--ac->avail];
+
+	return ac_get_obj(cachep, ac, flags, force_refill);
 }
 
 static inline void cache_alloc_debugcheck_before(struct kmem_cache *cachep,
@@ -3266,23 +3407,35 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 {
 	void *objp;
 	struct array_cache *ac;
+	bool force_refill = false;
 
 	check_irq_off();
 
 	ac = cpu_cache_get(cachep);
 	if (likely(ac->avail)) {
-		STATS_INC_ALLOCHIT(cachep);
 		ac->touched = 1;
-		objp = ac->entry[--ac->avail];
-	} else {
-		STATS_INC_ALLOCMISS(cachep);
-		objp = cache_alloc_refill(cachep, flags);
+		objp = ac_get_obj(cachep, ac, flags, false);
+
 		/*
-		 * the 'ac' may be updated by cache_alloc_refill(),
-		 * and kmemleak_erase() requires its correct value.
+		 * Allow for the possibility all avail objects are not allowed
+		 * by the current flags
 		 */
-		ac = cpu_cache_get(cachep);
+		if (objp) {
+			STATS_INC_ALLOCHIT(cachep);
+			goto out;
+		}
+		force_refill = true;
 	}
+
+	STATS_INC_ALLOCMISS(cachep);
+	objp = cache_alloc_refill(cachep, flags, force_refill);
+	/*
+	 * the 'ac' may be updated by cache_alloc_refill(),
+	 * and kmemleak_erase() requires its correct value.
+	 */
+	ac = cpu_cache_get(cachep);
+
+out:
 	/*
 	 * To avoid a false negative, if an object that is in one of the
 	 * per-CPU caches is leaked, we need to make sure kmemleak doesn't
@@ -3604,9 +3757,12 @@ static void free_block(struct kmem_cache *cachep, void **objpp, int nr_objects,
 	struct kmem_list3 *l3;
 
 	for (i = 0; i < nr_objects; i++) {
-		void *objp = objpp[i];
+		void *objp;
 		struct slab *slabp;
 
+		clear_obj_pfmemalloc(&objpp[i]);
+		objp = objpp[i];
+
 		slabp = virt_to_slab(objp);
 		l3 = cachep->nodelists[node];
 		list_del(&slabp->list);
@@ -3724,7 +3880,7 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp,
 		cache_flusharray(cachep, ac);
 	}
 
-	ac->entry[ac->avail++] = objp;
+	ac_put_obj(cachep, ac, objp);
 }
 
 /**
diff --git a/mm/slub.c b/mm/slub.c
index cc4ed03..05929df 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -33,6 +33,8 @@
 
 #include <trace/events/kmem.h>
 
+#include "internal.h"
+
 /*
  * Lock order:
  *   1. slub_lock (Global Semaphore)
@@ -1370,6 +1372,8 @@ static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
 	inc_slabs_node(s, page_to_nid(page), page->objects);
 	page->slab = s;
 	__SetPageSlab(page);
+	if (page->pfmemalloc)
+		SetPageSlabPfmemalloc(page);
 
 	start = page_address(page);
 
@@ -1413,6 +1417,7 @@ static void __free_slab(struct kmem_cache *s, struct page *page)
 		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
 		-pages);
 
+	__ClearPageSlabPfmemalloc(page);
 	__ClearPageSlab(page);
 	reset_page_mapcount(page);
 	if (current->reclaim_state)
@@ -2132,6 +2137,14 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
 	return freelist;
 }
 
+static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags)
+{
+	if (unlikely(PageSlabPfmemalloc(page)))
+		return gfp_pfmemalloc_allowed(gfpflags);
+
+	return true;
+}
+
 /*
  * Check the page->freelist of a page and either transfer the freelist to the per cpu freelist
  * or deactivate the page.
@@ -2212,6 +2225,18 @@ redo:
 		goto new_slab;
 	}
 
+	/*
+	 * By rights, we should be searching for a slab page that was
+	 * PFMEMALLOC but right now, we are losing the pfmemalloc
+	 * information when the page leaves the per-cpu allocator
+	 */
+	if (unlikely(!pfmemalloc_match(page, gfpflags))) {
+		deactivate_slab(s, page, c->freelist);
+		c->page = NULL;
+		c->freelist = NULL;
+		goto new_slab;
+	}
+
 	/* must check again c->freelist in case of cpu migration or IRQ */
 	freelist = c->freelist;
 	if (freelist)
@@ -2318,8 +2343,8 @@ redo:
 
 	object = c->freelist;
 	page = c->page;
-	if (unlikely(!object || !node_match(page, node)))
-
+	if (unlikely(!object || !node_match(page, node)
+					!pfmemalloc_match(page, gfpflags)))
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 02/16] mm: slub: Optimise the SLUB fast path to avoid pfmemalloc checks
From: Mel Gorman @ 2012-06-29 13:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson, Eric Dumazet,
	Sebastian Andrzej Siewior, Mel Gorman
In-Reply-To: <1340976767-5737-1-git-send-email-mgorman@suse.de>

From: Christoph Lameter <cl@linux.com>

This patch removes the check for pfmemalloc from the alloc hotpath and
puts the logic after the election of a new per cpu slab. For a pfmemalloc
page we do not use the fast path but force the use of the slow path which
is also used for the debug case.

This has the side-effect of weakening pfmemalloc processing in the
following way;

1. A process that is allocating for network swap calls __slab_alloc.
   pfmemalloc_match is true so the freelist is loaded and c->freelist is
   now pointing to a pfmemalloc page.

2. A process that is attempting normal allocations calls slab_alloc,
   finds the pfmemalloc page on the freelist and uses it because it did
   not check pfmemalloc_match()

The patch allows non-pfmemalloc allocations to use pfmemalloc pages with
the kmalloc slabs being the most vunerable caches on the grounds they
are most likely to have a mix of pfmemalloc and !pfmemalloc requests. A
later patch will still protect the system as processes will get throttled
if the pfmemalloc reserves get depleted but performance will not degrade
as smoothly.

[mgorman@suse.de: Expanded changelog]
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/slub.c |    7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 05929df..87d59ea 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2287,11 +2287,11 @@ new_slab:
 	}

 	page = c->page;
-	if (likely(!kmem_cache_debug(s)))
+	if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags)))
 		goto load_freelist;

 	/* Only entered in the debug case */
-	if (!alloc_debug_processing(s, page, freelist, addr))
+	if (kmem_cache_debug(s) && !alloc_debug_processing(s, page, freelist, addr))
 		goto new_slab;	/* Slab failed checks. Next slab needed */

 	deactivate_slab(s, page, get_freepointer(s, freelist));
@@ -2343,8 +2343,7 @@ redo:

 	object = c->freelist;
 	page = c->page;
-	if (unlikely(!object || !node_match(page, node)
-					!pfmemalloc_match(page, gfpflags)))
+	if (unlikely(!object || !node_match(page, node)))
 		object = __slab_alloc(s, gfpflags, node, addr, c);

 	else {
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 03/16] mm: Introduce __GFP_MEMALLOC to allow access to emergency reserves
From: Mel Gorman @ 2012-06-29 13:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson, Eric Dumazet,
	Sebastian Andrzej Siewior, Mel Gorman
In-Reply-To: <1340976767-5737-1-git-send-email-mgorman@suse.de>

__GFP_MEMALLOC will allow the allocation to disregard the watermarks,
much like PF_MEMALLOC. It allows one to pass along the memalloc state
in object related allocation flags as opposed to task related flags,
such as sk->sk_allocation. This removes the need for ALLOC_PFMEMALLOC
as callers using __GFP_MEMALLOC can get the ALLOC_NO_WATERMARK flag
which is now enough to identify allocations related to page reclaim.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/gfp.h             |   10 ++++++++--
 include/linux/mm_types.h        |    2 +-
 include/trace/events/gfpflags.h |    1 +
 mm/page_alloc.c                 |   22 ++++++++++------------
 mm/slab.c                       |    2 +-
 5 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 1e49be4..cbd7400 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -23,6 +23,7 @@ struct vm_area_struct;
 #define ___GFP_REPEAT		0x400u
 #define ___GFP_NOFAIL		0x800u
 #define ___GFP_NORETRY		0x1000u
+#define ___GFP_MEMALLOC		0x2000u
 #define ___GFP_COMP		0x4000u
 #define ___GFP_ZERO		0x8000u
 #define ___GFP_NOMEMALLOC	0x10000u
@@ -76,9 +77,14 @@ struct vm_area_struct;
 #define __GFP_REPEAT	((__force gfp_t)___GFP_REPEAT)	/* See above */
 #define __GFP_NOFAIL	((__force gfp_t)___GFP_NOFAIL)	/* See above */
 #define __GFP_NORETRY	((__force gfp_t)___GFP_NORETRY) /* See above */
+#define __GFP_MEMALLOC	((__force gfp_t)___GFP_MEMALLOC)/* Allow access to emergency reserves */
 #define __GFP_COMP	((__force gfp_t)___GFP_COMP)	/* Add compound page metadata */
 #define __GFP_ZERO	((__force gfp_t)___GFP_ZERO)	/* Return zeroed page on success */
-#define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC) /* Don't use emergency reserves */
+#define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC) /* Don't use emergency reserves.
+							 * This takes precedence over the
+							 * __GFP_MEMALLOC flag if both are
+							 * set
+							 */
 #define __GFP_HARDWALL   ((__force gfp_t)___GFP_HARDWALL) /* Enforce hardwall cpuset memory allocs */
 #define __GFP_THISNODE	((__force gfp_t)___GFP_THISNODE)/* No fallback, no policies */
 #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */
@@ -129,7 +135,7 @@ struct vm_area_struct;
 /* Control page allocator reclaim behavior */
 #define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\
 			__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
-			__GFP_NORETRY|__GFP_NOMEMALLOC)
+			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC)
 
 /* Control slab gfp mask during early boot */
 #define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_WAIT|__GFP_IO|__GFP_FS))
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index ad0ad6f..8120fdc 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -55,7 +55,7 @@ struct page {
 			pgoff_t index;		/* Our offset within mapping. */
 			void *freelist;		/* slub/slob first free object */
 			bool pfmemalloc;	/* If set by the page allocator,
-						 * ALLOC_PFMEMALLOC was set
+						 * ALLOC_NO_WATERMARKS was set
 						 * and the low watermark was not
 						 * met implying that the system
 						 * is under some pressure. The
diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
index 9fe3a366..d6fd8e5 100644
--- a/include/trace/events/gfpflags.h
+++ b/include/trace/events/gfpflags.h
@@ -30,6 +30,7 @@
 	{(unsigned long)__GFP_COMP,		"GFP_COMP"},		\
 	{(unsigned long)__GFP_ZERO,		"GFP_ZERO"},		\
 	{(unsigned long)__GFP_NOMEMALLOC,	"GFP_NOMEMALLOC"},	\
+	{(unsigned long)__GFP_MEMALLOC,		"GFP_MEMALLOC"},	\
 	{(unsigned long)__GFP_HARDWALL,		"GFP_HARDWALL"},	\
 	{(unsigned long)__GFP_THISNODE,		"GFP_THISNODE"},	\
 	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9c697e5..b6c0727 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1505,7 +1505,6 @@ failed:
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
-#define ALLOC_PFMEMALLOC	0x80 /* Caller has PF_MEMALLOC set */
 
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 
@@ -2263,11 +2262,10 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	} else if (unlikely(rt_task(current)) && !in_interrupt())
 		alloc_flags |= ALLOC_HARDER;
 
-	if ((current->flags & PF_MEMALLOC) ||
-			unlikely(test_thread_flag(TIF_MEMDIE))) {
-		alloc_flags |= ALLOC_PFMEMALLOC;
-
-		if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
+	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+		if (gfp_mask & __GFP_MEMALLOC)
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 
@@ -2276,7 +2274,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 
 bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 {
-	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_PFMEMALLOC);
+	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
 }
 
 static inline struct page *
@@ -2467,12 +2465,12 @@ nopage:
 	return page;
 got_pg:
 	/*
-	 * page->pfmemalloc is set when the caller had PFMEMALLOC set or is
-	 * been OOM killed. The expectation is that the caller is taking
-	 * steps that will free more memory. The caller should avoid the
-	 * page being used for !PFMEMALLOC purposes.
+	 * page->pfmemalloc is set when the caller had PFMEMALLOC set, is
+	 * been OOM killed or specified __GFP_MEMALLOC. The expectation is
+	 * that the caller is taking steps that will free more memory. The
+	 * caller should avoid the page being used for !PFMEMALLOC purposes.
 	 */
-	page->pfmemalloc = !!(alloc_flags & ALLOC_PFMEMALLOC);
+	page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
 
 	if (kmemcheck_enabled)
 		kmemcheck_pagealloc_alloc(page, order, gfp_mask);
diff --git a/mm/slab.c b/mm/slab.c
index c0d51f1..d9fe508 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1907,7 +1907,7 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 		return NULL;
 	}
 
-	/* Record if ALLOC_PFMEMALLOC was set when allocating the slab */
+	/* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
 	if (unlikely(page->pfmemalloc))
 		pfmemalloc_active = true;
 
-- 
1.7.9.2

^ permalink raw reply related

* [PATCH 04/16] mm: allow PF_MEMALLOC from softirq context
From: Mel Gorman @ 2012-06-29 13:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson, Eric Dumazet,
	Sebastian Andrzej Siewior, Mel Gorman
In-Reply-To: <1340976767-5737-1-git-send-email-mgorman@suse.de>

This is needed to allow network softirq packet processing to make
use of PF_MEMALLOC.

Currently softirq context cannot use PF_MEMALLOC due to it not being
associated with a task, and therefore not having task flags to fiddle
with - thus the gfp to alloc flag mapping ignores the task flags when
in interrupts (hard or soft) context.

Allowing softirqs to make use of PF_MEMALLOC therefore requires some
trickery. This patch borrows the task flags from whatever process happens
to be preempted by the softirq. It then modifies the gfp to alloc flags
mapping to not exclude task flags in softirq context, and modify the
softirq code to save, clear and restore the PF_MEMALLOC flag.

The save and clear, ensures the preempted task's PF_MEMALLOC flag
doesn't leak into the softirq. The restore ensures a softirq's
PF_MEMALLOC flag cannot leak back into the preempted process. This
should be safe due to the following reasons

Softirqs can run on multiple CPUs sure but the same task should not be
	executing the same softirq code. Neither should the softirq
	handler be preempted by any other softirq handler so the flags
	should not leak to an unrelated softirq.

Softirqs re-enable hardware interrupts in __do_softirq() so can be
	preempted by hardware interrupts so PF_MEMALLOC is inherited
	by the hard IRQ. However, this is similar to a process in
	reclaim being preempted by a hardirq. While PF_MEMALLOC is
	set, gfp_to_alloc_flags() distinguishes between hard and
	soft irqs and avoids giving a hardirq the ALLOC_NO_WATERMARKS
	flag.

If the softirq is deferred to ksoftirq then its flags may be used
        instead of a normal tasks but as the softirq cannot be preempted,
        the PF_MEMALLOC flag does not leak to other code by accident.

[davem@davemloft.net: Document why PF_MEMALLOC is safe]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |    7 +++++++
 kernel/softirq.c      |    9 +++++++++
 mm/page_alloc.c       |    6 +++++-
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 08384db..706e405 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1913,6 +1913,13 @@ static inline void rcu_switch_from(struct task_struct *prev)
 
 #endif
 
+static inline void tsk_restore_flags(struct task_struct *task,
+				unsigned long orig_flags, unsigned long flags)
+{
+	task->flags &= ~flags;
+	task->flags |= orig_flags & flags;
+}
+
 #ifdef CONFIG_SMP
 extern void do_set_cpus_allowed(struct task_struct *p,
 			       const struct cpumask *new_mask);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 671f959..b73e681 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -210,6 +210,14 @@ asmlinkage void __do_softirq(void)
 	__u32 pending;
 	int max_restart = MAX_SOFTIRQ_RESTART;
 	int cpu;
+	unsigned long old_flags = current->flags;
+
+	/*
+	 * Mask out PF_MEMALLOC s current task context is borrowed for the
+	 * softirq. A softirq handled such as network RX might set PF_MEMALLOC
+	 * again if the socket is related to swap
+	 */
+	current->flags &= ~PF_MEMALLOC;
 
 	pending = local_softirq_pending();
 	account_system_vtime(current);
@@ -265,6 +273,7 @@ restart:
 
 	account_system_vtime(current);
 	__local_bh_enable(SOFTIRQ_OFFSET);
+	tsk_restore_flags(current, old_flags, PF_MEMALLOC);
 }
 
 #ifndef __ARCH_HAS_DO_SOFTIRQ
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b6c0727..5c6d9c6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2265,7 +2265,11 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
 		if (gfp_mask & __GFP_MEMALLOC)
 			alloc_flags |= ALLOC_NO_WATERMARKS;
-		else if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
+		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (!in_interrupt() &&
+				((current->flags & PF_MEMALLOC) ||
+				 unlikely(test_thread_flag(TIF_MEMDIE))))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 05/16] mm: Only set page->pfmemalloc when ALLOC_NO_WATERMARKS was used
From: Mel Gorman @ 2012-06-29 13:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson, Eric Dumazet,
	Sebastian Andrzej Siewior, Mel Gorman
In-Reply-To: <1340976767-5737-1-git-send-email-mgorman@suse.de>

__alloc_pages_slowpath() is called when the number of free pages is below
the low watermark. If the caller is entitled to use ALLOC_NO_WATERMARKS
then the page will be marked page->pfmemalloc.  This protects more pages
than are strictly necessary as we only need to protect pages allocated
below the min watermark (the pfmemalloc reserves).

This patch only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was
required to allocate the page.

[rientjes@google.com: David noticed the problem during review]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c |   27 ++++++++++++++-------------
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5c6d9c6..9883cf7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2085,8 +2085,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 
 		page = get_page_from_freelist(gfp_mask, nodemask,
 				order, zonelist, high_zoneidx,
-				alloc_flags, preferred_zone,
-				migratetype);
+				alloc_flags & ~ALLOC_NO_WATERMARKS,
+				preferred_zone, migratetype);
 		if (page) {
 			preferred_zone->compact_considered = 0;
 			preferred_zone->compact_defer_shift = 0;
@@ -2178,8 +2178,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 retry:
 	page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
-					alloc_flags, preferred_zone,
-					migratetype);
+					alloc_flags & ~ALLOC_NO_WATERMARKS,
+					preferred_zone, migratetype);
 
 	/*
 	 * If an allocation failed after direct reclaim, it could be because
@@ -2350,8 +2350,17 @@ rebalance:
 		page = __alloc_pages_high_priority(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
-		if (page)
+		if (page) {
+			/*
+			 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
+			 * necessary to allocate the page. The expectation is
+			 * that the caller is taking steps that will free more
+			 * memory. The caller should avoid the page being used
+			 * for !PFMEMALLOC purposes.
+			 */
+			page->pfmemalloc = true;
 			goto got_pg;
+		}
 	}
 
 	/* Atomic allocations - we can't balance anything */
@@ -2468,14 +2477,6 @@ nopage:
 	warn_alloc_failed(gfp_mask, order, NULL);
 	return page;
 got_pg:
-	/*
-	 * page->pfmemalloc is set when the caller had PFMEMALLOC set, is
-	 * been OOM killed or specified __GFP_MEMALLOC. The expectation is
-	 * that the caller is taking steps that will free more memory. The
-	 * caller should avoid the page being used for !PFMEMALLOC purposes.
-	 */
-	page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
-
 	if (kmemcheck_enabled)
 		kmemcheck_pagealloc_alloc(page, order, gfp_mask);
 
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 06/16] mm: Ignore mempolicies when using ALLOC_NO_WATERMARK
From: Mel Gorman @ 2012-06-29 13:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson, Eric Dumazet,
	Sebastian Andrzej Siewior, Mel Gorman
In-Reply-To: <1340976767-5737-1-git-send-email-mgorman@suse.de>

The reserve is proportionally distributed over all !highmem zones
in the system. So we need to allow an emergency allocation access to
all zones.  In order to do that we need to break out of any mempolicy
boundaries we might have.

In my opinion that does not break mempolicies as those are user
oriented and not system oriented. That is, system allocations are
not guaranteed to be within mempolicy boundaries. For instance IRQs
do not even have a mempolicy.

So breaking out of mempolicy boundaries for 'rare' emergency
allocations, which are always system allocations (as opposed to user)
is ok.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c |    7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9883cf7..981272d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2347,6 +2347,13 @@ rebalance:
 
 	/* Allocate without watermarks if the context allows */
 	if (alloc_flags & ALLOC_NO_WATERMARKS) {
+		/*
+		 * Ignore mempolicies if ALLOC_NO_WATERMARKS on the grounds
+		 * the allocation is high priority and these type of
+		 * allocations are system rather than user orientated
+		 */
+		zonelist = node_zonelist(numa_node_id(), gfp_mask);
+
 		page = __alloc_pages_high_priority(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox