Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: 10GBE performance drop with net.ipv4.tcp_timestamps=0
From: Eric Dumazet @ 2012-06-20 10:06 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: Linux Netdev List
In-Reply-To: <4FE19CFC.8030408@profihost.ag>

On Wed, 2012-06-20 at 11:50 +0200, Stefan Priebe - Profihost AG wrote:
> Am 20.06.2012 11:47, schrieb Eric Dumazet:
> > On Wed, 2012-06-20 at 11:33 +0200, Stefan Priebe - Profihost AG wrote:
> >
> >> Sure. In that case i get 4Gbit/s in both variants. I also tried two
> >> other different machines same result.
> >>
> >
> > So 3.5 on receiver is the problem, it seems ?
> Yes.
> 
> > And you checked all the stuff about irq affinities, i presume, since a
> > lot of things might have changed between 2.6.32 and 3.5 ?
> 
> It is a single core E5 Xeon - i've set the affinity like this:

And you still have the retransmits in "netstat -s" output ?

Might be a firmware or pci issue, I have same cards but no problem here.

Check LRO is on ?

ethtool -k eth2

^ permalink raw reply

* Re: [PATCH] can: c_can_pci: limit compilation to archs with clock support
From: Marc Kleine-Budde @ 2012-06-20 10:10 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-can, federico.vaga
In-Reply-To: <20120620.025452.2203668280120884694.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 1436 bytes --]

On 06/20/2012 11:54 AM, David Miller wrote:
> From: Marc Kleine-Budde <mkl@pengutronix.de>
> Date: Wed, 20 Jun 2012 11:48:08 +0200
> 
>> In commit:
>>
>>   5b92da0 c_can_pci: generic module for C_CAN/D_CAN on PCI
>>
>> the c_can_pci driver has been added. It uses clk_*() functions
>> unconditionally, resulting in a link error on archs without
>> clock support. This patch adds a "depends on HAVE_CLK" to the
>> Kconfig symbol.
> 
> This is an unreasonable change and I just explained why in my email to
> Frederico, did you not see it?

I send that mail before I received Frederico's and your Mail.

> He says that this driver was only tested on an architecture that
> currently doesn't even have clock support in any existing tree, and
> therefore completely relies upon local changes they have to add clock
> support to that platform.
> 
> Which means you're change is restricting compilation of this driver to
> platforms the driver was never, ever, tested on.
> 
> Can you see what a complete joke this is?

I think we finally can see the big picture now; I'm preparing a patch
which removes the clk_*() functions.

Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply

* Re: [PATCH] can: c_can_pci: limit compilation to archs with clock support
From: David Miller @ 2012-06-20 10:12 UTC (permalink / raw)
  To: mkl; +Cc: netdev, linux-can, federico.vaga
In-Reply-To: <4FE1A1A1.3000105@pengutronix.de>

From: Marc Kleine-Budde <mkl@pengutronix.de>
Date: Wed, 20 Jun 2012 12:10:41 +0200

> I think we finally can see the big picture now; I'm preparing a patch
> which removes the clk_*() functions.

Thank you.

^ permalink raw reply

* Re: [PATCH] usbnet: Activate halt interrupt endpoint before re-submit URB
From: Ming Lei @ 2012-06-20 10:15 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Huajun Li, David Miller, stern-nwvwT67g6+6dFdvTe/nMLpVzexx5G7lz,
	linux-usb-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <201206201058.55519.oneukum-l3A5Bk7waGM@public.gmane.org>

On Wed, Jun 20, 2012 at 4:58 PM, Oliver Neukum <oneukum-l3A5Bk7waGM@public.gmane.org> wrote:
> Am Mittwoch, 20. Juni 2012, 10:07:55 schrieb Ming Lei:
>> BTW, maybe it is better to add below
>>
>>     usbnet_defer_kevent(dev, EVENT_STS_HALT);
>>
>> for -EPIPE returned from usb_urb_submit if it will be resent.
>
> Why? If it failed once it'll probably also fail the next time.

-EPIPE just means the endpoint is halted, either from usb_urb_submit
or urb->status, so the HALT should be cleared in the situation.

> In that case we'd need to do something more intrusive
> like resetting the device, but that cannot be done well
> in the generic usbnet part.

IMO, resetting is not needed for -EPIPE, but may be needed for
-EPROTO failure.

Thanks,
-- 
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v2] ipv4: Early TCP socket demux.
From: David Miller @ 2012-06-20 10:15 UTC (permalink / raw)
  To: eric.dumazet; +Cc: shemminger, netdev
In-Reply-To: <20120619.231412.1236237191660427779.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Tue, 19 Jun 2012 23:14:12 -0700 (PDT)

> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 20 Jun 2012 07:59:00 +0200
> 
>> On Tue, 2012-06-19 at 21:46 -0700, David Miller wrote:
>> 
>>> These numbers can be decreased further, because since we're already
>>> looking at the TCP header we can pre-cook the TCP control block in the
>>> SKB and skip much of the stuff that tcp_v4_rcv() does since we've done
>>> it already in the early demux code.
>> 
>> It could be done at GRO level and remove one another demux.
>> 
>> As routers probably have no use of GRO, no need of additional knob.
> 
> That's a great idea.

Here's what I have so far, the ipv6 implementation we get nearly for
free :-)

Initially I tried to use ->gro_complete() for this as it was more
natural, but we abort before we get there for a lot of cases where we
want to use the early demux and cached route (ACKs, FINs, sub-mss
sized packets, etc.)

diff --git a/include/net/protocol.h b/include/net/protocol.h
index 967b926..a1b1b53 100644
--- a/include/net/protocol.h
+++ b/include/net/protocol.h
@@ -37,7 +37,6 @@
 
 /* This is used to register protocols. */
 struct net_protocol {
-	int			(*early_demux)(struct sk_buff *skb);
 	int			(*handler)(struct sk_buff *skb);
 	void			(*err_handler)(struct sk_buff *skb, u32 info);
 	int			(*gso_send_check)(struct sk_buff *skb);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b21522..c1b5626 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2956,6 +2956,12 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 		return -ENOMEM;
 
 	__copy_skb_header(nskb, p);
+	if (p->sk) {
+		nskb->sk = p->sk;
+		nskb->destructor = p->destructor;
+		p->sk = NULL;
+		p->destructor = NULL;
+	}
 	nskb->mac_len = p->mac_len;
 
 	skb_reserve(nskb, headroom);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 07a02f6..0aabad7 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1519,7 +1519,6 @@ static const struct net_protocol igmp_protocol = {
 #endif
 
 static const struct net_protocol tcp_protocol = {
-	.early_demux	=	tcp_v4_early_demux,
 	.handler	=	tcp_v4_rcv,
 	.err_handler	=	tcp_v4_err,
 	.gso_send_check	=	tcp_v4_gso_send_check,
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 93b092c..c4fe1d2 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -323,32 +323,19 @@ static int ip_rcv_finish(struct sk_buff *skb)
 	 *	how the packet travels inside Linux networking.
 	 */
 	if (skb_dst(skb) == NULL) {
-		const struct net_protocol *ipprot;
-		int protocol = iph->protocol;
-		int err;
-
-		rcu_read_lock();
-		ipprot = rcu_dereference(inet_protos[protocol]);
-		err = -ENOENT;
-		if (ipprot && ipprot->early_demux)
-			err = ipprot->early_demux(skb);
-		rcu_read_unlock();
-
-		if (err) {
-			err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
-						   iph->tos, skb->dev);
-			if (unlikely(err)) {
-				if (err == -EHOSTUNREACH)
-					IP_INC_STATS_BH(dev_net(skb->dev),
-							IPSTATS_MIB_INADDRERRORS);
-				else if (err == -ENETUNREACH)
-					IP_INC_STATS_BH(dev_net(skb->dev),
-							IPSTATS_MIB_INNOROUTES);
-				else if (err == -EXDEV)
-					NET_INC_STATS_BH(dev_net(skb->dev),
-							 LINUX_MIB_IPRPFILTER);
-				goto drop;
-			}
+		int err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
+					       iph->tos, skb->dev);
+		if (unlikely(err)) {
+			if (err == -EHOSTUNREACH)
+				IP_INC_STATS_BH(dev_net(skb->dev),
+						IPSTATS_MIB_INADDRERRORS);
+			else if (err == -ENETUNREACH)
+				IP_INC_STATS_BH(dev_net(skb->dev),
+						IPSTATS_MIB_INNOROUTES);
+			else if (err == -EXDEV)
+				NET_INC_STATS_BH(dev_net(skb->dev),
+						 LINUX_MIB_IPRPFILTER);
+			goto drop;
 		}
 	}
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 13857df..2a483ad 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1671,52 +1671,6 @@ csum_err:
 }
 EXPORT_SYMBOL(tcp_v4_do_rcv);
 
-int tcp_v4_early_demux(struct sk_buff *skb)
-{
-	struct net *net = dev_net(skb->dev);
-	const struct iphdr *iph;
-	const struct tcphdr *th;
-	struct sock *sk;
-	int err;
-
-	err = -ENOENT;
-	if (skb->pkt_type != PACKET_HOST)
-		goto out_err;
-
-	if (!pskb_may_pull(skb, ip_hdrlen(skb) + sizeof(struct tcphdr)))
-		goto out_err;
-
-	iph = ip_hdr(skb);
-	th = (struct tcphdr *) ((char *)iph + ip_hdrlen(skb));
-
-	if (th->doff < sizeof(struct tcphdr) / 4)
-		goto out_err;
-
-	if (!pskb_may_pull(skb, ip_hdrlen(skb) + th->doff * 4))
-		goto out_err;
-
-	sk = __inet_lookup_established(net, &tcp_hashinfo,
-				       iph->saddr, th->source,
-				       iph->daddr, th->dest,
-				       skb->dev->ifindex);
-	if (sk) {
-		skb->sk = sk;
-		skb->destructor = sock_edemux;
-		if (sk->sk_state != TCP_TIME_WAIT) {
-			struct dst_entry *dst = sk->sk_rx_dst;
-			if (dst)
-				dst = dst_check(dst, 0);
-			if (dst) {
-				skb_dst_set_noref(skb, dst);
-				err = 0;
-			}
-		}
-	}
-
-out_err:
-	return err;
-}
-
 /*
  *	From tcp_input.c
  */
@@ -2576,6 +2530,7 @@ void tcp4_proc_exit(void)
 struct sk_buff **tcp4_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 {
 	const struct iphdr *iph = skb_gro_network_header(skb);
+	struct sk_buff **pp;
 
 	switch (skb->ip_summed) {
 	case CHECKSUM_COMPLETE:
@@ -2591,7 +2546,36 @@ struct sk_buff **tcp4_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 		return NULL;
 	}
 
-	return tcp_gro_receive(head, skb);
+	pp = tcp_gro_receive(head, skb);
+
+	if (!NAPI_GRO_CB(skb)->same_flow) {
+		const struct tcphdr *th = tcp_hdr(skb);
+		struct net_device *dev = skb->dev;
+		struct sock *sk;
+
+		sk = __inet_lookup_established(dev_net(dev), &tcp_hashinfo,
+					       iph->saddr, th->source,
+					       iph->daddr, th->dest,
+					       dev->ifindex);
+		if (sk) {
+			skb_orphan(skb);
+			skb->sk = sk;
+			skb->destructor = sock_edemux;
+			if (!skb_dst(skb) &&
+			    sk->sk_state != TCP_TIME_WAIT) {
+				struct dst_entry *dst = sk->sk_rx_dst;
+				if (dst)
+					dst = dst_check(dst, 0);
+				if (dst) {
+					struct rtable *rt = (struct rtable *) dst;
+
+					if (rt->rt_iif == dev->ifindex)
+						skb_dst_set_noref(skb, dst);
+				}
+			}
+		}
+	}
+	return pp;
 }
 
 int tcp4_gro_complete(struct sk_buff *skb)
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 26a8862..b8ea463 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -797,6 +797,7 @@ static struct sk_buff **tcp6_gro_receive(struct sk_buff **head,
 					 struct sk_buff *skb)
 {
 	const struct ipv6hdr *iph = skb_gro_network_header(skb);
+	struct sk_buff **pp;
 
 	switch (skb->ip_summed) {
 	case CHECKSUM_COMPLETE:
@@ -812,7 +813,32 @@ static struct sk_buff **tcp6_gro_receive(struct sk_buff **head,
 		return NULL;
 	}
 
-	return tcp_gro_receive(head, skb);
+	pp = tcp_gro_receive(head, skb);
+
+	if (!NAPI_GRO_CB(skb)->same_flow) {
+		const struct tcphdr *th = tcp_hdr(skb);
+		struct net_device *dev = skb->dev;
+		struct sock *sk;
+
+		sk = __inet6_lookup_established(dev_net(dev), &tcp_hashinfo,
+						&iph->saddr, th->source,
+						&iph->daddr, th->dest,
+						dev->ifindex);
+		if (sk) {
+			skb_orphan(skb);
+			skb->sk = sk;
+			skb->destructor = sock_edemux;
+			if (!skb_dst(skb) &&
+			    sk->sk_state != TCP_TIME_WAIT) {
+				struct dst_entry *dst = sk->sk_rx_dst;
+				if (dst)
+					dst = dst_check(dst, 0);
+				if (dst)
+					skb_dst_set(skb, dst);
+			}
+		}
+	}
+	return pp;
 }
 
 static int tcp6_gro_complete(struct sk_buff *skb)

^ permalink raw reply related

* Re: [PATCH] netxen: Error return off by one in 'netxen_nic_set_pauseparam()'.
From: santosh prasad nayak @ 2012-06-20 10:16 UTC (permalink / raw)
  To: Rajesh Borundia
  Cc: Dan Carpenter, Sony Chacko, netdev,
	kernel-janitors@vger.kernel.org
In-Reply-To: <13A253B3F9BEFE43B93C09CF75F63CAA81A886EF19@MNEXMB1.qlogic.org>

On Wed, Jun 20, 2012 at 3:21 PM, Rajesh Borundia
<rajesh.borundia@qlogic.com> wrote:
> _______________________________________
> From: santosh prasad nayak [santoshprasadnayak@gmail.com]
> Sent: Wednesday, June 20, 2012 1:29 PM
> To: Dan Carpenter; Rajesh Borundia
> Cc: Sony Chacko; netdev; kernel-janitors@vger.kernel.org
> Subject: Re: [PATCH] netxen: Error return off by one in 'netxen_nic_set_pauseparam()'.
>
> On Wed, Jun 20, 2012 at 1:14 PM, Dan Carpenter <dan.carpenter@oracle.com> wrote:
>> On Wed, Jun 20, 2012 at 12:57:39PM +0530, santosh nayak wrote:
>>> From: Santosh Nayak <santoshprasadnayak@gmail.com>
>>>
>>> There are 'NETXEN_NIU_MAX_GBE_PORTS'  GBE ports. Port indexing starts
>>> from zero.
>>> Hence we should also return error for "port == NETXEN_NIU_MAX_GBE_PORTS"
>>>
>>
>> I don't know this code well enough to say if you are right or not,
>> but what about for port == NETXEN_NIU_MAX_XG_PORTS a few lines later
>> in both functions?
>
>
> I think "for port == NETXEN_NIU_MAX_XG_PORTS"  error should be returned.
>
>
> @Rajesh,
>
> Can you please comment on it ?
>
>
> regards
> santosh
>
>>
>> regards,
>> dan carpenter
>>
>
> Yes error should be returned for  both port == NETXEN_NIU_MAX_XG_PORTS and
> port ==  NETXEN_NIU_MAX_GBE_PORTS.


Ok.

The current patch is for GBE port.
For XG port I will send another patch.

regards
santosh


>
>
> Rajesh
>

^ permalink raw reply

* Re: linux-next: build failure after merge of the net-next tree
From: Federico Vaga @ 2012-06-20 10:17 UTC (permalink / raw)
  To: David Miller
  Cc: mkl, bhupesh.sharma, sfr, netdev, linux-next, linux-kernel,
	giancarlo.asnaghi, wg
In-Reply-To: <20120620.025837.721158723130014230.davem@davemloft.net>

> Why would you try to be generic by using an interface currently
> only available on certain platforms?

I know, I was wrong.

> That is how you make drivers non-portable, and not generic.

Now is fixed in my mind; I learn the lesson.

-- 
Federico Vaga

^ permalink raw reply

* Re: [PATCH] can: c_can_pci: limit compilation to archs with clock support
From: Federico Vaga @ 2012-06-20 10:18 UTC (permalink / raw)
  To: Marc Kleine-Budde; +Cc: David Miller, netdev, linux-can
In-Reply-To: <4FE1A1A1.3000105@pengutronix.de>

> I think we finally can see the big picture now; I'm preparing a patch
> which removes the clk_*() functions.

Thank you, and sorry for the big trouble

-- 
Federico Vaga

^ permalink raw reply

* Re: [PATCH] usbnet: Activate halt interrupt endpoint before re-submit URB
From: Oliver Neukum @ 2012-06-20 10:21 UTC (permalink / raw)
  To: Ming Lei; +Cc: Huajun Li, David Miller, stern, linux-usb, netdev
In-Reply-To: <CACVXFVMy6Hrqgw6rmAXv5YXD86sLF+FUPTRRcXKs40uwN6ioCg@mail.gmail.com>

Am Mittwoch, 20. Juni 2012, 12:15:25 schrieb Ming Lei:
> On Wed, Jun 20, 2012 at 4:58 PM, Oliver Neukum <oneukum@suse.de> wrote:
> > Am Mittwoch, 20. Juni 2012, 10:07:55 schrieb Ming Lei:
> >> BTW, maybe it is better to add below
> >>
> >>     usbnet_defer_kevent(dev, EVENT_STS_HALT);
> >>
> >> for -EPIPE returned from usb_urb_submit if it will be resent.
> >
> > Why? If it failed once it'll probably also fail the next time.
> 
> -EPIPE just means the endpoint is halted, either from usb_urb_submit
> or urb->status, so the HALT should be cleared in the situation.

It probably was halted and cleared. However that you cleared
a halt doesn't mean that the reason for stalling went away.
So you must cope with an endpoint being halted again right after
it was cleared.

> > In that case we'd need to do something more intrusive
> > like resetting the device, but that cannot be done well
> > in the generic usbnet part.
> 
> IMO, resetting is not needed for -EPIPE, but may be needed for
> -EPROTO failure.

We don't need it for a single failure, but what else would we do
if we keep getting -EPIPE?

	Regards
		Oliver

^ permalink raw reply

* Re: linux-next: build failure after merge of the net-next tree
From: Stephen Rothwell @ 2012-06-20 10:26 UTC (permalink / raw)
  To: David Miller
  Cc: viresh.kumar2, bhupesh.sharma, netdev, linux-next, linux-kernel,
	federico.vaga, giancarlo.asnaghi, wg, mkl, spear-devel,
	Andrew Morton
In-Reply-To: <20120620.012037.783895812206310043.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 1139 bytes --]

Hi all,

On Wed, 20 Jun 2012 01:20:37 -0700 (PDT) David Miller <davem@davemloft.net> wrote:
>
> From: viresh kumar <viresh.kumar2@arm.com>
> Date: Wed, 20 Jun 2012 09:08:34 +0100
> 
> > Please see following patchset from me, that got applied in linux-next
> > 
> > https://lkml.org/lkml/2012/4/24/154
> > 
> > Please check if this patchset is present in your build repo. I believe it should be
> > there. If it is, then you shouldn't get these errors.
> 
> Well, then Stephen shouldn't get those errors either.
> 
> But obviously he did.
> 
> But all of this talk about changes existing only in linux-next is
> entirely moot.  Because The damn thing MUST build independently inside
> of net-next which doesn't have those clock layer changes.
> 
> Someone send me a clean fix for net-next now.

I get those errors because those patches are in the akpm tree which is
merged after everything else ...

One possibility is to put those changes in another (stable) tree and
merge that into the net-next tree (and any other tree that needs it).
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: linux-next: build failure after merge of the net-next tree
From: Marc Kleine-Budde @ 2012-06-20 10:33 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: David Miller, viresh.kumar2, bhupesh.sharma, netdev, linux-next,
	linux-kernel, federico.vaga, giancarlo.asnaghi, wg, spear-devel,
	Andrew Morton
In-Reply-To: <20120620202604.407af721f746045ae00c8268@canb.auug.org.au>

[-- Attachment #1: Type: text/plain, Size: 1798 bytes --]

On 06/20/2012 12:26 PM, Stephen Rothwell wrote:
> Hi all,
> 
> On Wed, 20 Jun 2012 01:20:37 -0700 (PDT) David Miller <davem@davemloft.net> wrote:
>>
>> From: viresh kumar <viresh.kumar2@arm.com>
>> Date: Wed, 20 Jun 2012 09:08:34 +0100
>>
>>> Please see following patchset from me, that got applied in linux-next
>>>
>>> https://lkml.org/lkml/2012/4/24/154
>>>
>>> Please check if this patchset is present in your build repo. I believe it should be
>>> there. If it is, then you shouldn't get these errors.
>>
>> Well, then Stephen shouldn't get those errors either.
>>
>> But obviously he did.
>>
>> But all of this talk about changes existing only in linux-next is
>> entirely moot.  Because The damn thing MUST build independently inside
>> of net-next which doesn't have those clock layer changes.
>>
>> Someone send me a clean fix for net-next now.
> 
> I get those errors because those patches are in the akpm tree which is
> merged after everything else ...
> 
> One possibility is to put those changes in another (stable) tree and
> merge that into the net-next tree (and any other tree that needs it).

We're about to remove the offending clk_*() functions from the driver,
as they are untested anyway. The hardware the driver was developed for
uses a hardcoded clock rate in the driver anyway, as it cannot be
retrieved from clock tree. As soon as there is hardware available that
will work with the clock tree, we can add those functions back.

Sorry for the noise,
Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply

* [PATCH] netxen : Error return off by one for XG port.
From: santosh nayak @ 2012-06-20 10:52 UTC (permalink / raw)
  To: sony.chacko, rajesh.borundia; +Cc: netdev, kernel-janitors, Santosh Nayak

From: Santosh Nayak <santoshprasadnayak@gmail.com>

There are  NETXEN_NIU_MAX_XG_PORTS ports.
Port indexing starts from zero.
Hence we should also return error for  'port == NETXEN_NIU_MAX_XG_PORTS'.

Signed-off-by: Santosh Nayak <santoshprasadnayak@gmail.com>
---
 .../ethernet/qlogic/netxen/netxen_nic_ethtool.c    |    4 ++--
 drivers/net/ethernet/qlogic/netxen/netxen_nic_hw.c |    4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_ethtool.c b/drivers/net/ethernet/qlogic/netxen/netxen_nic_ethtool.c
index d4f179f..9103e3e 100644
--- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_ethtool.c
+++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_ethtool.c
@@ -511,7 +511,7 @@ netxen_nic_get_pauseparam(struct net_device *dev,
 				break;
 		}
 	} else if (adapter->ahw.port_type == NETXEN_NIC_XGBE) {
-		if ((port < 0) || (port > NETXEN_NIU_MAX_XG_PORTS))
+		if ((port < 0) || (port >= NETXEN_NIU_MAX_XG_PORTS))
 			return;
 		pause->rx_pause = 1;
 		val = NXRD32(adapter, NETXEN_NIU_XG_PAUSE_CTL);
@@ -577,7 +577,7 @@ netxen_nic_set_pauseparam(struct net_device *dev,
 		}
 		NXWR32(adapter, NETXEN_NIU_GB_PAUSE_CTL, val);
 	} else if (adapter->ahw.port_type == NETXEN_NIC_XGBE) {
-		if ((port < 0) || (port > NETXEN_NIU_MAX_XG_PORTS))
+		if ((port < 0) || (port >= NETXEN_NIU_MAX_XG_PORTS))
 			return -EIO;
 		val = NXRD32(adapter, NETXEN_NIU_XG_PAUSE_CTL);
 		if (port == 0) {
diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_hw.c b/drivers/net/ethernet/qlogic/netxen/netxen_nic_hw.c
index de96a94..946160f 100644
--- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_hw.c
+++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_hw.c
@@ -365,7 +365,7 @@ static int netxen_niu_disable_xg_port(struct netxen_adapter *adapter)
 	if (NX_IS_REVISION_P3(adapter->ahw.revision_id))
 		return 0;
 
-	if (port > NETXEN_NIU_MAX_XG_PORTS)
+	if (port >= NETXEN_NIU_MAX_XG_PORTS)
 		return -EINVAL;
 
 	mac_cfg = 0;
@@ -392,7 +392,7 @@ static int netxen_p2_nic_set_promisc(struct netxen_adapter *adapter, u32 mode)
 	u32 port = adapter->physical_port;
 	u16 board_type = adapter->ahw.board_type;
 
-	if (port > NETXEN_NIU_MAX_XG_PORTS)
+	if (port >= NETXEN_NIU_MAX_XG_PORTS)
 		return -EINVAL;
 
 	mac_cfg = NXRD32(adapter, NETXEN_NIU_XGE_CONFIG_0 + (0x10000 * port));
-- 
1.7.4.4

^ permalink raw reply related

* Re: [PATCH] usbnet: Activate halt interrupt endpoint before re-submit URB
From: Ming Lei @ 2012-06-20 10:56 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: Huajun Li, David Miller, stern, linux-usb, netdev
In-Reply-To: <201206201221.37211.oneukum@suse.de>

On Wed, Jun 20, 2012 at 6:21 PM, Oliver Neukum <oneukum@suse.de> wrote:

> It probably was halted and cleared. However that you cleared
> a halt doesn't mean that the reason for stalling went away.
> So you must cope with an endpoint being halted again right after
> it was cleared.

I only suggested we should handle -EPIPE for usb_submit_urb
on interrupt endpoint, maybe it is the 1st handling, at least it is
per USB spec.

Also from implementation of usb gadget device, generally
ClearFeature(HALT) is to clear the some halt related flag of
endpoint hardware.

Looks the reasons of interrupt endpoint stalling is invisible
for usbnet driver, so it is not easy to handle the situation
you described(halted and cleared repeatedly).

>
>> > In that case we'd need to do something more intrusive
>> > like resetting the device, but that cannot be done well
>> > in the generic usbnet part.
>>
>> IMO, resetting is not needed for -EPIPE, but may be needed for
>> -EPROTO failure.
>
> We don't need it for a single failure, but what else would we do
> if we keep getting -EPIPE?

Suppose the case will happen, what is the appropriate actions
usbnet should take on the failure? I am not sure RESET can deal
with it.

Also is it a actual failure case or only a theory case?

Thanks,
-- 
Ming Lei

^ permalink raw reply

* Re: [PATCH -v1 3/3] usbnet: handle remote wakeup asap
From: Sergei Shtylyov @ 2012-06-20 11:02 UTC (permalink / raw)
  To: Ming Lei
  Cc: David S. Miller, Greg Kroah-Hartman, Oliver Neukum,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-usb-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1340176553-32225-4-git-send-email-ming.lei-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

Hello.

On 20-06-2012 11:15, Ming Lei wrote:

> If usbnet is resumed by remote wakeup, generally there are
> some packets comming to be handled, so allocate and submit
> rx URBs in usbnet_resume to avoid delays introduced by tasklet.
> Otherwise, usbnet may have been runtime suspended before the
> usbnet_bh is executed to schedule Rx URBs.

> Without the patch, usbnet can't recieve any packets from peer
> in runtime suspend state if runtime PM is enabled and
> autosuspend_delay is set as zero.

> Signed-off-by: Ming Lei<ming.lei-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> ---
>   drivers/net/usb/usbnet.c |   42 ++++++++++++++++++++++++++----------------
>   1 file changed, 26 insertions(+), 16 deletions(-)

> diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
> index 9bfa775..a89d6c5 100644
> --- a/drivers/net/usb/usbnet.c
> +++ b/drivers/net/usb/usbnet.c
> @@ -1201,6 +1201,21 @@ deferred:
>   }
>   EXPORT_SYMBOL_GPL(usbnet_start_xmit);
>
> +static void rx_alloc_submit(struct usbnet *dev, gfp_t flags)
> +{
> +	struct urb	*urb;
> +	int		i;
> +
> +	/* don't refill the queue all at once */
> +	for (i = 0; i<  10&&  dev->rxq.qlen<  RX_QLEN(dev); i++) {
> +		urb = usb_alloc_urb(0, flags);
> +		if (urb != NULL) {
> +			if (rx_submit(dev, urb, flags) == -ENOLINK)

    The above 2 *if* statements can be collapsed into single one.

> +				return;
> +		}
> +	}
> +}
> +

WBR, Sergei
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* From Captain Miller Peterson
From: Captain Miller Peterson @ 2012-06-20 10:53 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 31 bytes --]

(Find Details Of Mail Attached)

^ permalink raw reply

* Re: [PATCH v2] ipv4: Early TCP socket demux.
From: Eric Dumazet @ 2012-06-20 11:03 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, netdev
In-Reply-To: <20120620.031543.1511134879638711616.davem@davemloft.net>

On Wed, 2012-06-20 at 03:15 -0700, David Miller wrote:

> Here's what I have so far, the ipv6 implementation we get nearly for
> free :-)
> 
> Initially I tried to use ->gro_complete() for this as it was more
> natural, but we abort before we get there for a lot of cases where we
> want to use the early demux and cached route (ACKs, FINs, sub-mss
> sized packets, etc.)
> 

Seems very good, I only have one remark :


>  /*
>   *	From tcp_input.c
>   */
> @@ -2576,6 +2530,7 @@ void tcp4_proc_exit(void)
>  struct sk_buff **tcp4_gro_receive(struct sk_buff **head, struct sk_buff *skb)
>  {
>  	const struct iphdr *iph = skb_gro_network_header(skb);
> +	struct sk_buff **pp;
>  
>  	switch (skb->ip_summed) {
>  	case CHECKSUM_COMPLETE:
> @@ -2591,7 +2546,36 @@ struct sk_buff **tcp4_gro_receive(struct sk_buff **head, struct sk_buff *skb)
>  		return NULL;
>  	}
>  
> -	return tcp_gro_receive(head, skb);
> +	pp = tcp_gro_receive(head, skb);
> +
> +	if (!NAPI_GRO_CB(skb)->same_flow) {
> +		const struct tcphdr *th = tcp_hdr(skb);
> +		struct net_device *dev = skb->dev;
> +		struct sock *sk;
> +
> +		sk = __inet_lookup_established(dev_net(dev), &tcp_hashinfo,
> +					       iph->saddr, th->source,
> +					       iph->daddr, th->dest,
> +					       dev->ifindex);
> +		if (sk) {
> +			skb_orphan(skb);
> +			skb->sk = sk;
> +			skb->destructor = sock_edemux;
> +			if (!skb_dst(skb) &&

I am not sure we need the skb_dst(skb) test here, it should be NULL
anyway in GRO layer ? (loopback device don't use GRO ;) )

> +			    sk->sk_state != TCP_TIME_WAIT) {
> +				struct dst_entry *dst = sk->sk_rx_dst;
> +				if (dst)
> +					dst = dst_check(dst, 0);
> +				if (dst) {
> +					struct rtable *rt = (struct rtable *) dst;
> +
> +					if (rt->rt_iif == dev->ifindex)
> +						skb_dst_set_noref(skb, dst);
> +				}
> +			}
> +		}
> +	}
> +	return pp;
>  }
>  
>  int tcp4_gro_complete(struct sk_buff *skb)

^ permalink raw reply

* Re: [PATCH 01/17] mm: sl[au]b: Add knowledge of PFMEMALLOC reserve pages
From: Sebastian Andrzej Siewior @ 2012-06-20 11:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson
In-Reply-To: <1340184920-22288-2-git-send-email-mgorman@suse.de>

On Wed, Jun 20, 2012 at 10:35:04AM +0100, Mel Gorman wrote:
> [a.p.zijlstra@chello.nl: Original implementation]
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
> diff --git a/mm/slab.c b/mm/slab.c
> index e901a36..b190cac 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1851,6 +1984,7 @@ static void kmem_freepages(struct kmem_cache *cachep, void *addr)
>  	while (i--) {
>  		BUG_ON(!PageSlab(page));
>  		__ClearPageSlab(page);
> +		__ClearPageSlabPfmemalloc(page);
>  		page++;
>  	}
>  	if (current->reclaim_state)
> @@ -3120,16 +3254,19 @@ bad:
> diff --git a/mm/slub.c b/mm/slub.c
> index 8c691fa..43738c9 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1414,6 +1418,7 @@ static void __free_slab(struct kmem_cache *s, struct page *page)
>  		-pages);
>  
>  	__ClearPageSlab(page);
> +	__ClearPageSlabPfmemalloc(page);
>  	reset_page_mapcount(page);
>  	if (current->reclaim_state)
>  		current->reclaim_state->reclaimed_slab += pages;

So you mention a change here in v11's changelog but I don't see it.

Sebastian

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: 10GBE performance drop with net.ipv4.tcp_timestamps=0
From: Eric Dumazet @ 2012-06-20 11:08 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: Linux Netdev List
In-Reply-To: <1340186788.4604.857.camel@edumazet-glaptop>

On Wed, 2012-06-20 at 12:06 +0200, Eric Dumazet wrote:
> On Wed, 2012-06-20 at 11:50 +0200, Stefan Priebe - Profihost AG wrote:
> > Am 20.06.2012 11:47, schrieb Eric Dumazet:
> > > On Wed, 2012-06-20 at 11:33 +0200, Stefan Priebe - Profihost AG wrote:
> > >
> > >> Sure. In that case i get 4Gbit/s in both variants. I also tried two
> > >> other different machines same result.
> > >>
> > >
> > > So 3.5 on receiver is the problem, it seems ?
> > Yes.
> > 
> > > And you checked all the stuff about irq affinities, i presume, since a
> > > lot of things might have changed between 2.6.32 and 3.5 ?
> > 
> > It is a single core E5 Xeon - i've set the affinity like this:
> 
> And you still have the retransmits in "netstat -s" output ?
> 
> Might be a firmware or pci issue, I have same cards but no problem here.
> 
> Check LRO is on ?
> 
> ethtool -k eth2
> 


Ah, your ethtool -S gives strange fdir_miss counts, you should ask Intel
guys help maybe...

^ permalink raw reply

* Re: [PATCH v2] ipv4: Early TCP socket demux.
From: David Miller @ 2012-06-20 11:09 UTC (permalink / raw)
  To: eric.dumazet; +Cc: shemminger, netdev
In-Reply-To: <1340190206.4604.862.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 20 Jun 2012 13:03:26 +0200

> I am not sure we need the skb_dst(skb) test here, it should be NULL
> anyway in GRO layer ? (loopback device don't use GRO ;) )

Thanks, I was too lazy to check this and just assumed that a non-NULL
skb_dst(skb) was a very real possibility.

^ permalink raw reply

* Re: [PATCH -v1 3/3] usbnet: handle remote wakeup asap
From: Oliver Neukum @ 2012-06-20 11:24 UTC (permalink / raw)
  To: Sergei Shtylyov
  Cc: Ming Lei, David S. Miller, Greg Kroah-Hartman,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-usb-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <4FE1ADB4.4060302-Igf4POYTYCDQT0dZR+AlfA@public.gmane.org>

Am Mittwoch, 20. Juni 2012, 13:02:12 schrieb Sergei Shtylyov:
> > Without the patch, usbnet can't recieve any packets from peer
> > in runtime suspend state if runtime PM is enabled and
> > autosuspend_delay is set as zero.
> 
> > Signed-off-by: Ming Lei<ming.lei-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> > ---
> >   drivers/net/usb/usbnet.c |   42 ++++++++++++++++++++++++++----------------
> >   1 file changed, 26 insertions(+), 16 deletions(-)
> 
> > diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
> > index 9bfa775..a89d6c5 100644
> > --- a/drivers/net/usb/usbnet.c
> > +++ b/drivers/net/usb/usbnet.c
> > @@ -1201,6 +1201,21 @@ deferred:
> >   }
> >   EXPORT_SYMBOL_GPL(usbnet_start_xmit);
> >
> > +static void rx_alloc_submit(struct usbnet *dev, gfp_t flags)
> > +{
> > +     struct urb      *urb;
> > +     int             i;
> > +
> > +     /* don't refill the queue all at once */
> > +     for (i = 0; i<  10&&  dev->rxq.qlen<  RX_QLEN(dev); i++) {
> > +             urb = usb_alloc_urb(0, flags);
> > +             if (urb != NULL) {
> > +                     if (rx_submit(dev, urb, flags) == -ENOLINK)
> 
>     The above 2 if statements can be collapsed into single one.
> 

That would not improve readability.

	Regards
		Oliver
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: linux-next: build failure after merge of the net-next tree
From: Mark Brown @ 2012-06-20 11:35 UTC (permalink / raw)
  To: David Miller
  Cc: bhupesh.sharma, sfr, netdev, linux-next, linux-kernel,
	federico.vaga, giancarlo.asnaghi, wg, mkl
In-Reply-To: <20120619.214759.2265041580160751452.davem@davemloft.net>

On Tue, Jun 19, 2012 at 09:47:59PM -0700, David Miller wrote:
> From: Bhupesh SHARMA <bhupesh.sharma@st.com>

> > So, whether adding a check in Kconfig for HAVE_CLK be a proper
> > solution ?  But that will limit the compilation of this driver for
> > only platforms which are ARM based.

> > One may need to support this driver on x86 like platforms also..

> Then x86 will need to provide clock operations, or there needs to
> be dummy ones for such platforms.

Not directly germane but I've been sending patches for this to the x86
guys for a little while though they're doing a /dev/null impression.

> This isn't rocket science.

The other option is that the clock API stubs itself out when not enabled
which is going into mainline (not sure quite where it is at the minute).

These sort of per-arch APIs should be a legacy thing, hopefully we'll
manage to squash them at some point and we should certainly work to
avoid introducing new ones.

^ permalink raw reply

* [PATCH 00/17] Swap-over-NBD without deadlocking V12 (resend)
From: Mel Gorman @ 2012-06-20 11:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior, Mel Gorman

Sorry for the resend of this. I exported the branch missing Sebastian's
fix by accident. Not my finest day for sending out patch bombs :(

Changelog since V11
  o Rebase to 3.5-rc3
  o Correct order of page flag free				      (sebastian)

Changelog since V10
  o Rebase to 3.4-rc5
  o Coding style fixups						      (davem)
  o API consistency						      (davem)
  o Rename sk_allocation to sk_gfp_atomic and use only when necessary (davem)
  o Use static branches for sk_memalloc_socks			      (davem)
  o Use static branch checks in fast paths			      (davem)
  o Document concerns about PF_MEMALLOC leaking flags		      (davem)
  o Locking fix in slab						      (mel)

Changelog since V9
  o Rebase to 3.4-rc5
  o Clarify comment on why PF_MEMALLOC is cleared in softirq handling (akpm)
  o Only set page->pfmemalloc if ALLOC_NO_WATERMARKS was required     (rientjes)

Changelog since V8
  o Rebase to 3.4-rc2
  o Use page flag instead of slab fields to keep structures the same size
  o Properly detect allocations from softirq context that use PF_MEMALLOC
  o Ensure kswapd does not sleep while processes are throttled
  o Do not accidentally throttle !_GFP_FS processes indefinitely

Changelog since V7
  o Rebase to 3.3-rc2
  o Take greater care propagating page->pfmemalloc to skb
  o Propagate pfmemalloc from netdev_alloc_page to skb where possible
  o Release RCU lock properly on preempt kernel

Changelog since V6
  o Rebase to 3.1-rc8
  o Use wake_up instead of wake_up_interruptible()
  o Do not throttle kernel threads
  o Avoid a potential race between kswapd going to sleep and processes being
    throttled

Changelog since V5
  o Rebase to 3.1-rc5

Changelog since V4
  o Update comment clarifying what protocols can be used		(Michal)
  o Rebase to 3.0-rc3

Changelog since V3
  o Propogate pfmemalloc from packet fragment pages to skb		(Neil)
  o Rebase to 3.0-rc2

Changelog since V2
  o Document that __GFP_NOMEMALLOC overrides __GFP_MEMALLOC		(Neil)
  o Use wait_event_interruptible					(Neil)
  o Use !! when casting to bool to avoid any possibilitity of type
    truncation								(Neil)
  o Nicer logic when using skb_pfmemalloc_protocol			(Neil)

Changelog since V1
  o Rebase on top of mmotm
  o Use atomic_t for memalloc_socks		(David Miller)
  o Remove use of sk_memalloc_socks in vmscan	(Neil Brown)
  o Check throttle within prepare_to_wait	(Neil Brown)
  o Add statistics on throttling instead of printk

When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. Swap over the network is considered as an option in diskless
systems. The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.

The Linux Terminal Server Project recommends the use of the
Network Block Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD
at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP
The nbd-client also documents the use of NBD as swap. Despite this, the
fact is that a machine using NBD for swap can deadlock within minutes if
swap is used intensively. This patch series addresses the problem.

The core issue is that network block devices do not use mempools like
normal block devices do. As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need. Some years ago, Peter Zijlstra developed a series of
patches that supported swap over an NFS that at least one distribution
is carrying within their kernels. This patch series borrows very heavily
from Peter's work to support swapping over NBD as a pre-requisite to
supporting swap-over-NFS. The bulk of the complexity is concerned with
preserving memory that is allocated from the PFMEMALLOC reserves for use
by the network layer which is needed for both NBD and NFS.

Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
	preserve access to pages allocated under low memory situations
	to callers that are freeing memory.

Patch 2 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
	reserves without setting PFMEMALLOC.

Patch 3 opens the possibility for softirqs to use PFMEMALLOC reserves
	for later use by network packet processing.

Patch 4 ignores memory policies when ALLOC_NO_WATERMARKS is set.

Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required

Patches 6-13 allows network processing to use PFMEMALLOC reserves when
	the socket has been marked as being used by the VM to clean pages. If
	packets are received and stored in pages that were allocated under
	low-memory situations and are unrelated to the VM, the packets
	are dropped.

	Patch 11 reintroduces __skb_alloc_page which the networking
	folk may object to but is needed in some cases to propogate
	pfmemalloc from a newly allocated page to an skb. If there is a
	strong objection, this patch can be dropped with the impact being
	that swap-over-network will be slower in some cases but it should
	not fail.

Patch 14 is a micro-optimisation to avoid a function call in the
	common case.

Patch 15 tags NBD sockets as being SOCK_MEMALLOC so they can use
	PFMEMALLOC if necessary.

Patch 16 notes that it is still possible for the PFMEMALLOC reserve
	to be depleted. To prevent this, direct reclaimers get throttled on
	a waitqueue if 50% of the PFMEMALLOC reserves are depleted.  It is
	expected that kswapd and the direct reclaimers already running
	will clean enough pages for the low watermark to be reached and
	the throttled processes are woken up.

Patch 17 adds a statistic to track how often processes get throttled

Some basic performance testing was run using kernel builds, netperf
on loopback for UDP and TCP, hackbench (pipes and sockets), iozone
and sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant
performance variances.

For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure.

Without the patches and using SLUB, the machine locks up within minutes and
runs to completion with them applied. With SLAB, the story is different
as an unpatched kernel run to completion. However, the patched kernel
completed the test 45% faster.

MICRO
                                         3.5.0-rc2 3.5.0-rc2
					 vanilla     swapnbd
Unrecognised test vmscan-anon-mmap-write
MMTests Statistics: duration
Sys Time Running Test (seconds)             197.80    173.07
User+Sys Time Running Test (seconds)        206.96    182.03
Total Elapsed Time (seconds)               3240.70   1762.09

 drivers/block/nbd.c                               |    6 +-
 drivers/net/ethernet/chelsio/cxgb4/sge.c          |    2 +-
 drivers/net/ethernet/chelsio/cxgb4vf/sge.c        |    2 +-
 drivers/net/ethernet/intel/igb/igb_main.c         |    2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |    2 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |    3 +-
 drivers/net/usb/cdc-phonet.c                      |    2 +-
 drivers/usb/gadget/f_phonet.c                     |    2 +-
 include/linux/gfp.h                               |   13 +-
 include/linux/mm_types.h                          |    9 +
 include/linux/mmzone.h                            |    1 +
 include/linux/page-flags.h                        |   28 +++
 include/linux/sched.h                             |    7 +
 include/linux/skbuff.h                            |   80 +++++++-
 include/linux/vm_event_item.h                     |    1 +
 include/net/sock.h                                |   19 ++
 include/trace/events/gfpflags.h                   |    1 +
 kernel/softirq.c                                  |    9 +
 mm/page_alloc.c                                   |   46 ++++-
 mm/slab.c                                         |  216 +++++++++++++++++++--
 mm/slub.c                                         |   28 ++-
 mm/vmscan.c                                       |  131 ++++++++++++-
 mm/vmstat.c                                       |    1 +
 net/core/dev.c                                    |   53 ++++-
 net/core/filter.c                                 |    8 +
 net/core/skbuff.c                                 |  131 ++++++++++---
 net/core/sock.c                                   |   43 ++++
 net/ipv4/tcp_input.c                              |    8 +
 net/ipv4/tcp_output.c                             |    9 +-
 net/ipv6/tcp_ipv6.c                               |    8 +-
 30 files changed, 783 insertions(+), 88 deletions(-)

-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 01/17] mm: sl[au]b: Add knowledge of PFMEMALLOC reserve pages
From: Mel Gorman @ 2012-06-20 11:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior, Mel Gorman
In-Reply-To: <1340192652-31658-1-git-send-email-mgorman@suse.de>

Allocations of pages below the min watermark run a risk of the
machine hanging due to a lack of memory. To prevent this, only
callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing
an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once
they are allocated to a slab though, nothing prevents other callers
consuming free objects within those slabs. This patch limits access
to slab pages that were alloced from the PFMEMALLOC reserves.

When this patch is applied, pages allocated from below the low watermark are
returned with page->pfmemalloc set and it is up to the caller to determine
how the page should be protected. SLAB restricts access to any page with
page->pfmemalloc set to callers which are known to able to access the
PFMEMALLOC reserve. If one is not available, an attempt is made to allocate
a new page rather than use a reserve. SLUB is a bit more relaxed in that
it only records if the current per-CPU page was allocated from PFMEMALLOC
reserve and uses another partial slab if the caller does not have the
necessary GFP or process flags. This was found to be sufficient in tests
to avoid hangs due to SLUB generally maintaining smaller lists than SLAB.

In low-memory conditions it does mean that !PFMEMALLOC allocators
can fail a slab allocation even though free objects are available
because they are being preserved for callers that are freeing pages.

[a.p.zijlstra@chello.nl: Original implementation]
[sebastian@breakpoint.cc: Correct order of page flag clearing]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm_types.h   |    9 +++
 include/linux/page-flags.h |   28 +++++++
 mm/internal.h              |    3 +
 mm/page_alloc.c            |   27 +++++--
 mm/slab.c                  |  192 +++++++++++++++++++++++++++++++++++++++-----
 mm/slub.c                  |   27 ++++++-
 6 files changed, 261 insertions(+), 25 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index dad95bd..4bcc5b9 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -54,6 +54,15 @@ struct page {
 		union {
 			pgoff_t index;		/* Our offset within mapping. */
 			void *freelist;		/* slub first free object */
+			bool pfmemalloc;	/* If set by the page allocator,
+						 * ALLOC_PFMEMALLOC was set
+						 * and the low watermark was not
+						 * met implying that the system
+						 * is under some pressure. The
+						 * caller should try ensure
+						 * this page is only used to
+						 * free other pages.
+						 */
 		};
 
 		union {
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index c88d2a9..e66eb0d 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -453,6 +453,34 @@ static inline int PageTransTail(struct page *page)
 }
 #endif
 
+/*
+ * If network-based swap is enabled, sl*b must keep track of whether pages
+ * were allocated from pfmemalloc reserves.
+ */
+static inline int PageSlabPfmemalloc(struct page *page)
+{
+	VM_BUG_ON(!PageSlab(page));
+	return PageActive(page);
+}
+
+static inline void SetPageSlabPfmemalloc(struct page *page)
+{
+	VM_BUG_ON(!PageSlab(page));
+	SetPageActive(page);
+}
+
+static inline void __ClearPageSlabPfmemalloc(struct page *page)
+{
+	VM_BUG_ON(!PageSlab(page));
+	__ClearPageActive(page);
+}
+
+static inline void ClearPageSlabPfmemalloc(struct page *page)
+{
+	VM_BUG_ON(!PageSlab(page));
+	ClearPageActive(page);
+}
+
 #ifdef CONFIG_MMU
 #define __PG_MLOCKED		(1 << PG_mlocked)
 #else
diff --git a/mm/internal.h b/mm/internal.h
index 2ba87fb..2d0dd52 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -273,6 +273,9 @@ static inline struct page *mem_map_next(struct page *iter,
 #define __paginginit __init
 #endif
 
+/* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
+
 /* Memory initialisation debug and verification */
 enum mminit_level {
 	MMINIT_WARNING,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4403009..a259384 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1507,6 +1507,7 @@ failed:
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
+#define ALLOC_PFMEMALLOC	0x80 /* Caller has PF_MEMALLOC set */
 
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 
@@ -2264,16 +2265,22 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	} else if (unlikely(rt_task(current)) && !in_interrupt())
 		alloc_flags |= ALLOC_HARDER;
 
-	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (!in_interrupt() &&
-		    ((current->flags & PF_MEMALLOC) ||
-		     unlikely(test_thread_flag(TIF_MEMDIE))))
+	if ((current->flags & PF_MEMALLOC) ||
+			unlikely(test_thread_flag(TIF_MEMDIE))) {
+		alloc_flags |= ALLOC_PFMEMALLOC;
+
+		if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 
 	return alloc_flags;
 }
 
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
+{
+	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_PFMEMALLOC);
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -2461,10 +2468,18 @@ nopage:
 	warn_alloc_failed(gfp_mask, order, NULL);
 	return page;
 got_pg:
+	/*
+	 * page->pfmemalloc is set when the caller had PFMEMALLOC set or is
+	 * been OOM killed. The expectation is that the caller is taking
+	 * steps that will free more memory. The caller should avoid the
+	 * page being used for !PFMEMALLOC purposes.
+	 */
+	page->pfmemalloc = !!(alloc_flags & ALLOC_PFMEMALLOC);
+
 	if (kmemcheck_enabled)
 		kmemcheck_pagealloc_alloc(page, order, gfp_mask);
-	return page;
 
+	return page;
 }
 
 /*
@@ -2515,6 +2530,8 @@ retry_cpuset:
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
+	else
+		page->pfmemalloc = false;
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
 
diff --git a/mm/slab.c b/mm/slab.c
index e901a36..901e97c 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -123,6 +123,8 @@
 
 #include <trace/events/kmem.h>
 
+#include	"internal.h"
+
 /*
  * DEBUG	- 1 for kmem_cache_create() to honour; SLAB_RED_ZONE & SLAB_POISON.
  *		  0 for faster, smaller code (especially in the critical paths).
@@ -151,6 +153,12 @@
 #define ARCH_KMALLOC_FLAGS SLAB_HWCACHE_ALIGN
 #endif
 
+/*
+ * true if a page was allocated from pfmemalloc reserves for network-based
+ * swap
+ */
+static bool pfmemalloc_active __read_mostly;
+
 /* Legal flag mask for kmem_cache_create(). */
 #if DEBUG
 # define CREATE_MASK	(SLAB_RED_ZONE | \
@@ -256,9 +264,30 @@ struct array_cache {
 			 * Must have this definition in here for the proper
 			 * alignment of array_cache. Also simplifies accessing
 			 * the entries.
+			 *
+			 * Entries should not be directly dereferenced as
+			 * entries belonging to slabs marked pfmemalloc will
+			 * have the lower bits set SLAB_OBJ_PFMEMALLOC
 			 */
 };
 
+#define SLAB_OBJ_PFMEMALLOC	1
+static inline bool is_obj_pfmemalloc(void *objp)
+{
+	return (unsigned long)objp & SLAB_OBJ_PFMEMALLOC;
+}
+
+static inline void set_obj_pfmemalloc(void **objp)
+{
+	*objp = (void *)((unsigned long)*objp | SLAB_OBJ_PFMEMALLOC);
+	return;
+}
+
+static inline void clear_obj_pfmemalloc(void **objp)
+{
+	*objp = (void *)((unsigned long)*objp & ~SLAB_OBJ_PFMEMALLOC);
+}
+
 /*
  * bootstrap: The caches do not work without cpuarrays anymore, but the
  * cpuarrays are allocated from the generic caches...
@@ -951,6 +980,102 @@ static struct array_cache *alloc_arraycache(int node, int entries,
 	return nc;
 }
 
+static inline bool is_slab_pfmemalloc(struct slab *slabp)
+{
+	struct page *page = virt_to_page(slabp->s_mem);
+
+	return PageSlabPfmemalloc(page);
+}
+
+/* Clears ac->pfmemalloc if no slabs have pfmalloc set */
+static void check_ac_pfmemalloc(struct kmem_cache *cachep,
+						struct array_cache *ac)
+{
+	struct kmem_list3 *l3 = cachep->nodelists[numa_mem_id()];
+	struct slab *slabp;
+	unsigned long flags;
+
+	if (!pfmemalloc_active)
+		return;
+
+	spin_lock_irqsave(&l3->list_lock, flags);
+	list_for_each_entry(slabp, &l3->slabs_full, list)
+		if (is_slab_pfmemalloc(slabp))
+			goto out;
+
+	list_for_each_entry(slabp, &l3->slabs_partial, list)
+		if (is_slab_pfmemalloc(slabp))
+			goto out;
+
+	list_for_each_entry(slabp, &l3->slabs_free, list)
+		if (is_slab_pfmemalloc(slabp))
+			goto out;
+
+	pfmemalloc_active = false;
+out:
+	spin_unlock_irqrestore(&l3->list_lock, flags);
+}
+
+static void *ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
+						gfp_t flags, bool force_refill)
+{
+	int i;
+	void *objp = ac->entry[--ac->avail];
+
+	/* Ensure the caller is allowed to use objects from PFMEMALLOC slab */
+	if (unlikely(is_obj_pfmemalloc(objp))) {
+		struct kmem_list3 *l3;
+
+		if (gfp_pfmemalloc_allowed(flags)) {
+			clear_obj_pfmemalloc(&objp);
+			return objp;
+		}
+
+		/* The caller cannot use PFMEMALLOC objects, find another one */
+		for (i = 1; i < ac->avail; i++) {
+			/* If a !PFMEMALLOC object is found, swap them */
+			if (!is_obj_pfmemalloc(ac->entry[i])) {
+				objp = ac->entry[i];
+				ac->entry[i] = ac->entry[ac->avail];
+				ac->entry[ac->avail] = objp;
+				return objp;
+			}
+		}
+
+		/*
+		 * If there are empty slabs on the slabs_free list and we are
+		 * being forced to refill the cache, mark this one !pfmemalloc.
+		 */
+		l3 = cachep->nodelists[numa_mem_id()];
+		if (!list_empty(&l3->slabs_free) && force_refill) {
+			struct slab *slabp = virt_to_slab(objp);
+			ClearPageSlabPfmemalloc(virt_to_page(slabp->s_mem));
+			clear_obj_pfmemalloc(&objp);
+			check_ac_pfmemalloc(cachep, ac);
+			return objp;
+		}
+
+		/* No !PFMEMALLOC objects available */
+		ac->avail++;
+		objp = NULL;
+	}
+
+	return objp;
+}
+
+static void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
+								void *objp)
+{
+	if (unlikely(pfmemalloc_active)) {
+		/* Some pfmemalloc slabs exist, check if this is one */
+		struct page *page = virt_to_page(objp);
+		if (PageSlabPfmemalloc(page))
+			set_obj_pfmemalloc(&objp);
+	}
+
+	ac->entry[ac->avail++] = objp;
+}
+
 /*
  * Transfer objects in one arraycache to another.
  * Locking must be handled by the caller.
@@ -1127,7 +1252,7 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
 			STATS_INC_ACOVERFLOW(cachep);
 			__drain_alien_cache(cachep, alien, nodeid);
 		}
-		alien->entry[alien->avail++] = objp;
+		ac_put_obj(cachep, alien, objp);
 		spin_unlock(&alien->lock);
 	} else {
 		spin_lock(&(cachep->nodelists[nodeid])->list_lock);
@@ -1809,6 +1934,10 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 		return NULL;
 	}
 
+	/* Record if ALLOC_PFMEMALLOC was set when allocating the slab */
+	if (unlikely(page->pfmemalloc))
+		pfmemalloc_active = true;
+
 	nr_pages = (1 << cachep->gfporder);
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		add_zone_page_state(page_zone(page),
@@ -1816,9 +1945,13 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 	else
 		add_zone_page_state(page_zone(page),
 			NR_SLAB_UNRECLAIMABLE, nr_pages);
-	for (i = 0; i < nr_pages; i++)
+	for (i = 0; i < nr_pages; i++) {
 		__SetPageSlab(page + i);
 
+		if (page->pfmemalloc)
+			SetPageSlabPfmemalloc(page + i);
+	}
+
 	if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) {
 		kmemcheck_alloc_shadow(page, cachep->gfporder, flags, nodeid);
 
@@ -1850,6 +1983,7 @@ static void kmem_freepages(struct kmem_cache *cachep, void *addr)
 				NR_SLAB_UNRECLAIMABLE, nr_freed);
 	while (i--) {
 		BUG_ON(!PageSlab(page));
+		__ClearPageSlabPfmemalloc(page);
 		__ClearPageSlab(page);
 		page++;
 	}
@@ -3120,16 +3254,19 @@ bad:
 #define check_slabp(x,y) do { } while(0)
 #endif
 
-static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
+static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags,
+							bool force_refill)
 {
 	int batchcount;
 	struct kmem_list3 *l3;
 	struct array_cache *ac;
 	int node;
 
-retry:
 	check_irq_off();
 	node = numa_mem_id();
+	if (unlikely(force_refill))
+		goto force_grow;
+retry:
 	ac = cpu_cache_get(cachep);
 	batchcount = ac->batchcount;
 	if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
@@ -3179,8 +3316,8 @@ retry:
 			STATS_INC_ACTIVE(cachep);
 			STATS_SET_HIGH(cachep);
 
-			ac->entry[ac->avail++] = slab_get_obj(cachep, slabp,
-							    node);
+			ac_put_obj(cachep, ac, slab_get_obj(cachep, slabp,
+									node));
 		}
 		check_slabp(cachep, slabp);
 
@@ -3199,18 +3336,22 @@ alloc_done:
 
 	if (unlikely(!ac->avail)) {
 		int x;
+force_grow:
 		x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL);
 
 		/* cache_grow can reenable interrupts, then ac could change. */
 		ac = cpu_cache_get(cachep);
-		if (!x && ac->avail == 0)	/* no objects in sight? abort */
+
+		/* no objects in sight? abort */
+		if (!x && (ac->avail == 0 || force_refill))
 			return NULL;
 
 		if (!ac->avail)		/* objects refilled by interrupt? */
 			goto retry;
 	}
 	ac->touched = 1;
-	return ac->entry[--ac->avail];
+
+	return ac_get_obj(cachep, ac, flags, force_refill);
 }
 
 static inline void cache_alloc_debugcheck_before(struct kmem_cache *cachep,
@@ -3292,23 +3433,35 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 {
 	void *objp;
 	struct array_cache *ac;
+	bool force_refill = false;
 
 	check_irq_off();
 
 	ac = cpu_cache_get(cachep);
 	if (likely(ac->avail)) {
-		STATS_INC_ALLOCHIT(cachep);
 		ac->touched = 1;
-		objp = ac->entry[--ac->avail];
-	} else {
-		STATS_INC_ALLOCMISS(cachep);
-		objp = cache_alloc_refill(cachep, flags);
+		objp = ac_get_obj(cachep, ac, flags, false);
+
 		/*
-		 * the 'ac' may be updated by cache_alloc_refill(),
-		 * and kmemleak_erase() requires its correct value.
+		 * Allow for the possibility all avail objects are not allowed
+		 * by the current flags
 		 */
-		ac = cpu_cache_get(cachep);
+		if (objp) {
+			STATS_INC_ALLOCHIT(cachep);
+			goto out;
+		}
+		force_refill = true;
 	}
+
+	STATS_INC_ALLOCMISS(cachep);
+	objp = cache_alloc_refill(cachep, flags, force_refill);
+	/*
+	 * the 'ac' may be updated by cache_alloc_refill(),
+	 * and kmemleak_erase() requires its correct value.
+	 */
+	ac = cpu_cache_get(cachep);
+
+out:
 	/*
 	 * To avoid a false negative, if an object that is in one of the
 	 * per-CPU caches is leaked, we need to make sure kmemleak doesn't
@@ -3630,9 +3783,12 @@ static void free_block(struct kmem_cache *cachep, void **objpp, int nr_objects,
 	struct kmem_list3 *l3;
 
 	for (i = 0; i < nr_objects; i++) {
-		void *objp = objpp[i];
+		void *objp;
 		struct slab *slabp;
 
+		clear_obj_pfmemalloc(&objpp[i]);
+		objp = objpp[i];
+
 		slabp = virt_to_slab(objp);
 		l3 = cachep->nodelists[node];
 		list_del(&slabp->list);
@@ -3750,7 +3906,7 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp,
 		cache_flusharray(cachep, ac);
 	}
 
-	ac->entry[ac->avail++] = objp;
+	ac_put_obj(cachep, ac, objp);
 }
 
 /**
diff --git a/mm/slub.c b/mm/slub.c
index 8c691fa..3cf24b4 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -33,6 +33,8 @@
 
 #include <trace/events/kmem.h>
 
+#include "internal.h"
+
 /*
  * Lock order:
  *   1. slub_lock (Global Semaphore)
@@ -1370,6 +1372,8 @@ static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
 	inc_slabs_node(s, page_to_nid(page), page->objects);
 	page->slab = s;
 	__SetPageSlab(page);
+	if (page->pfmemalloc)
+		SetPageSlabPfmemalloc(page);
 
 	start = page_address(page);
 
@@ -1413,6 +1417,7 @@ static void __free_slab(struct kmem_cache *s, struct page *page)
 		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
 		-pages);
 
+	__ClearPageSlabPfmemalloc(page);
 	__ClearPageSlab(page);
 	reset_page_mapcount(page);
 	if (current->reclaim_state)
@@ -2156,6 +2161,14 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
 	return object;
 }
 
+static inline bool pfmemalloc_match(struct kmem_cache_cpu *c, gfp_t gfpflags)
+{
+	if (unlikely(PageSlabPfmemalloc(c->page)))
+		return gfp_pfmemalloc_allowed(gfpflags);
+
+	return true;
+}
+
 /*
  * Check the page->freelist of a page and either transfer the freelist to the per cpu freelist
  * or deactivate the page.
@@ -2228,6 +2241,16 @@ redo:
 		goto new_slab;
 	}
 
+	/*
+	 * By rights, we should be searching for a slab page that was
+	 * PFMEMALLOC but right now, we are losing the pfmemalloc
+	 * information when the page leaves the per-cpu allocator
+	 */
+	if (unlikely(!pfmemalloc_match(c, gfpflags))) {
+		deactivate_slab(s, c);
+		goto new_slab;
+	}
+
 	/* must check again c->freelist in case of cpu migration or IRQ */
 	object = c->freelist;
 	if (object)
@@ -2332,8 +2355,8 @@ redo:
 	barrier();
 
 	object = c->freelist;
-	if (unlikely(!object || !node_match(c, node)))
-
+	if (unlikely(!object || !node_match(c, node) ||
+					!pfmemalloc_match(c, gfpflags)))
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-- 
1.7.9.2

^ permalink raw reply related

* [PATCH 02/17] mm: slub: Optimise the SLUB fast path to avoid pfmemalloc checks
From: Mel Gorman @ 2012-06-20 11:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior, Mel Gorman
In-Reply-To: <1340192652-31658-1-git-send-email-mgorman@suse.de>

From: Christoph Lameter <cl@linux.com>

This patch removes the check for pfmemalloc from the alloc hotpath and
puts the logic after the election of a new per cpu slab. For a pfmemalloc
page we do not use the fast path but force the use of the slow path which
is also used for the debug case.

This has the side-effect of weakening pfmemalloc processing in the
following way;

1. A process that is allocating for network swap calls __slab_alloc.
   pfmemalloc_match is true so the freelist is loaded and c->freelist is
   now pointing to a pfmemalloc page.

2. A process that is attempting normal allocations calls slab_alloc,
   finds the pfmemalloc page on the freelist and uses it because it did
   not check pfmemalloc_match()

The patch allows non-pfmemalloc allocations to use pfmemalloc pages with
the kmalloc slabs being the most vunerable caches on the grounds they
are most likely to have a mix of pfmemalloc and !pfmemalloc requests. A
later patch will still protect the system as processes will get throttled
if the pfmemalloc reserves get depleted but performance will not degrade
as smoothly.

[mgorman@suse.de: Expanded changelog]
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/slub.c |    7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 3cf24b4..dd13305 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2301,11 +2301,11 @@ new_slab:
 		}
 	}

-	if (likely(!kmem_cache_debug(s)))
+	if (likely(!kmem_cache_debug(s) && pfmemalloc_match(c, gfpflags)))
 		goto load_freelist;

 	/* Only entered in the debug case */
-	if (!alloc_debug_processing(s, c->page, object, addr))
+	if (kmem_cache_debug(s) && !alloc_debug_processing(s, c->page, object, addr))
 		goto new_slab;	/* Slab failed checks. Next slab needed */

 	c->freelist = get_freepointer(s, object);
@@ -2355,8 +2355,7 @@ redo:
 	barrier();

 	object = c->freelist;
-	if (unlikely(!object || !node_match(c, node) ||
-					!pfmemalloc_match(c, gfpflags)))
+	if (unlikely(!object || !node_match(c, node)))
 		object = __slab_alloc(s, gfpflags, node, addr, c);

 	else {
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 03/17] mm: Introduce __GFP_MEMALLOC to allow access to emergency reserves
From: Mel Gorman @ 2012-06-20 11:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson,
	Sebastian Andrzej Siewior, Mel Gorman
In-Reply-To: <1340192652-31658-1-git-send-email-mgorman@suse.de>

__GFP_MEMALLOC will allow the allocation to disregard the watermarks,
much like PF_MEMALLOC. It allows one to pass along the memalloc state
in object related allocation flags as opposed to task related flags,
such as sk->sk_allocation. This removes the need for ALLOC_PFMEMALLOC
as callers using __GFP_MEMALLOC can get the ALLOC_NO_WATERMARK flag
which is now enough to identify allocations related to page reclaim.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/gfp.h             |   10 ++++++++--
 include/linux/mm_types.h        |    2 +-
 include/trace/events/gfpflags.h |    1 +
 mm/page_alloc.c                 |   22 ++++++++++------------
 mm/slab.c                       |    2 +-
 5 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 1e49be4..cbd7400 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -23,6 +23,7 @@ struct vm_area_struct;
 #define ___GFP_REPEAT		0x400u
 #define ___GFP_NOFAIL		0x800u
 #define ___GFP_NORETRY		0x1000u
+#define ___GFP_MEMALLOC		0x2000u
 #define ___GFP_COMP		0x4000u
 #define ___GFP_ZERO		0x8000u
 #define ___GFP_NOMEMALLOC	0x10000u
@@ -76,9 +77,14 @@ struct vm_area_struct;
 #define __GFP_REPEAT	((__force gfp_t)___GFP_REPEAT)	/* See above */
 #define __GFP_NOFAIL	((__force gfp_t)___GFP_NOFAIL)	/* See above */
 #define __GFP_NORETRY	((__force gfp_t)___GFP_NORETRY) /* See above */
+#define __GFP_MEMALLOC	((__force gfp_t)___GFP_MEMALLOC)/* Allow access to emergency reserves */
 #define __GFP_COMP	((__force gfp_t)___GFP_COMP)	/* Add compound page metadata */
 #define __GFP_ZERO	((__force gfp_t)___GFP_ZERO)	/* Return zeroed page on success */
-#define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC) /* Don't use emergency reserves */
+#define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC) /* Don't use emergency reserves.
+							 * This takes precedence over the
+							 * __GFP_MEMALLOC flag if both are
+							 * set
+							 */
 #define __GFP_HARDWALL   ((__force gfp_t)___GFP_HARDWALL) /* Enforce hardwall cpuset memory allocs */
 #define __GFP_THISNODE	((__force gfp_t)___GFP_THISNODE)/* No fallback, no policies */
 #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */
@@ -129,7 +135,7 @@ struct vm_area_struct;
 /* Control page allocator reclaim behavior */
 #define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\
 			__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
-			__GFP_NORETRY|__GFP_NOMEMALLOC)
+			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC)
 
 /* Control slab gfp mask during early boot */
 #define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_WAIT|__GFP_IO|__GFP_FS))
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4bcc5b9..87431b2 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -55,7 +55,7 @@ struct page {
 			pgoff_t index;		/* Our offset within mapping. */
 			void *freelist;		/* slub first free object */
 			bool pfmemalloc;	/* If set by the page allocator,
-						 * ALLOC_PFMEMALLOC was set
+						 * ALLOC_NO_WATERMARKS was set
 						 * and the low watermark was not
 						 * met implying that the system
 						 * is under some pressure. The
diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
index 9fe3a366..d6fd8e5 100644
--- a/include/trace/events/gfpflags.h
+++ b/include/trace/events/gfpflags.h
@@ -30,6 +30,7 @@
 	{(unsigned long)__GFP_COMP,		"GFP_COMP"},		\
 	{(unsigned long)__GFP_ZERO,		"GFP_ZERO"},		\
 	{(unsigned long)__GFP_NOMEMALLOC,	"GFP_NOMEMALLOC"},	\
+	{(unsigned long)__GFP_MEMALLOC,		"GFP_MEMALLOC"},	\
 	{(unsigned long)__GFP_HARDWALL,		"GFP_HARDWALL"},	\
 	{(unsigned long)__GFP_THISNODE,		"GFP_THISNODE"},	\
 	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a259384..ebffeaa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1507,7 +1507,6 @@ failed:
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
-#define ALLOC_PFMEMALLOC	0x80 /* Caller has PF_MEMALLOC set */
 
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 
@@ -2265,11 +2264,10 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	} else if (unlikely(rt_task(current)) && !in_interrupt())
 		alloc_flags |= ALLOC_HARDER;
 
-	if ((current->flags & PF_MEMALLOC) ||
-			unlikely(test_thread_flag(TIF_MEMDIE))) {
-		alloc_flags |= ALLOC_PFMEMALLOC;
-
-		if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
+	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+		if (gfp_mask & __GFP_MEMALLOC)
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 
@@ -2278,7 +2276,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 
 bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 {
-	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_PFMEMALLOC);
+	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
 }
 
 static inline struct page *
@@ -2469,12 +2467,12 @@ nopage:
 	return page;
 got_pg:
 	/*
-	 * page->pfmemalloc is set when the caller had PFMEMALLOC set or is
-	 * been OOM killed. The expectation is that the caller is taking
-	 * steps that will free more memory. The caller should avoid the
-	 * page being used for !PFMEMALLOC purposes.
+	 * page->pfmemalloc is set when the caller had PFMEMALLOC set, is
+	 * been OOM killed or specified __GFP_MEMALLOC. The expectation is
+	 * that the caller is taking steps that will free more memory. The
+	 * caller should avoid the page being used for !PFMEMALLOC purposes.
 	 */
-	page->pfmemalloc = !!(alloc_flags & ALLOC_PFMEMALLOC);
+	page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
 
 	if (kmemcheck_enabled)
 		kmemcheck_pagealloc_alloc(page, order, gfp_mask);
diff --git a/mm/slab.c b/mm/slab.c
index 901e97c..5268368 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1934,7 +1934,7 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 		return NULL;
 	}
 
-	/* Record if ALLOC_PFMEMALLOC was set when allocating the slab */
+	/* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
 	if (unlikely(page->pfmemalloc))
 		pfmemalloc_active = true;
 
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox