swapper: page allocation failure. order:3, mode:0x20

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* swapper: page allocation failure. order:3, mode:0x20
@ 2004-01-01  9:35 Joonas Kortesalmi
  2004-01-01 10:15 ` Anton Blanchard
  0 siblings, 1 reply; 11+ messages in thread
From: Joonas Kortesalmi @ 2004-01-01  9:35 UTC (permalink / raw)
  To: linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

After running 2.6.0 on a server for a few days, I met an interesting and
annoying problem. I was playing with NFS over gigabit ethernet (e1000) and
it was a bit slow. I tried to find out why by running top and I saw syslog-ng
eating almost 10% of the 1,3GHz Duron. Looked at the log and there was a huge
flood of these messages:

swapper: page allocation failure. order:3, mode:0x20
irssi: page allocation failure. order:3, mode:0x20
swapper: page allocation failure. order:3, mode:0x20
vim: page allocation failure. order:3, mode:0x20
swapper: page allocation failure. order:3, mode:0x20

most of the messages starting with swapper.

After a quick google search it looks like this kind of a problem existed in
a bit older 2.5/2.6 serie kernels. It looks like it's still with us.

Linux hamsu.org 2.6.0 #1 Mon Dec 29 21:07:28 EET 2003 i686 AMD Duron(tm) processor AuthenticAMD GNU/Linux

Shortly about the hardware of the machine in case it would matter:
AMD Duron 1,3GHz CPU, 768MB of RAM, Intel PRO/1000MT Desktop and Realtek 8139C
NICs.

Pre-empt was disabled in the kernel config and the kernel didn't have any
extra patches. A pure vanilla kernel.

I'm not a subscriber so please CC me a copy of messages related to the subject.
I'm not sure if I can help much looking at the insides of the kernel, but I will
try my best answering any questions about this.
- -- 
Joonas  OH8GDV (ham call sign)   OpenPGP: 0x5F72BE43
( 0B37 05E0 0FB4 EB2E 161C  DCE6 7F7B C645 5F72 BE43 )
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQE/8+lzf3vGRV9yvkMRAgkIAJ4tjyWtfeDmXCaM0s6w9+ofYLKktACgt+I5
MhYxd+KeJS8QFOjPbtOmykk=
=75d0
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: swapper: page allocation failure. order:3, mode:0x20
  2004-01-01  9:35 swapper: page allocation failure. order:3, mode:0x20 Joonas Kortesalmi
@ 2004-01-01 10:15 ` Anton Blanchard
  2004-01-01 10:25   ` Andrew Morton
  2004-01-01 10:27   ` Anton Blanchard
  0 siblings, 2 replies; 11+ messages in thread
From: Anton Blanchard @ 2004-01-01 10:15 UTC (permalink / raw)
  To: Joonas Kortesalmi; +Cc: linux-kernel, akpm

Hi,

> After running 2.6.0 on a server for a few days, I met an interesting and
> annoying problem. I was playing with NFS over gigabit ethernet (e1000) and
> it was a bit slow. I tried to find out why by running top and I saw syslog-ng
> eating almost 10% of the 1,3GHz Duron. Looked at the log and there was a huge
> flood of these messages:
> 
> swapper: page allocation failure. order:3, mode:0x20
> irssi: page allocation failure. order:3, mode:0x20
> swapper: page allocation failure. order:3, mode:0x20
> vim: page allocation failure. order:3, mode:0x20
> swapper: page allocation failure. order:3, mode:0x20

Its sounds like you are using either a large MTU (9k?) or TSO. TSO
causes the networking stack to think it has a massive MTU and the e1000
card busts it up into proper MTU sized packets. The problem is that
it places much more stress on the allocator by asking for these large
chunks of memory in interrupt context.

Now e1000 uses TSO (and can regularly ask for 32kB+ kmallocs in
interrupt context) perhaps we should look moving the rx buffer refill code
into a context that can sleep. Then again its not like we can tolerate
much latency in this code path, your rx ring will run out quite quickly :)

BTW We have found increasing /proc/sys/vm/min_free_kbytes can help the
situation a bit. Bumping the slab limits for the larger kmallocs (via
echo X Y Z > /proc/slab) might be useful too.

We should probably rate limit that printk. Andrew: I was thinking of
stealing net_ratelimit and calling it core_ratelimit or whatever. Then
wrap these non critical things with it. Overkill?

Anton

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: swapper: page allocation failure. order:3, mode:0x20
  2004-01-01 10:15 ` Anton Blanchard
@ 2004-01-01 10:25   ` Andrew Morton
  2004-01-01 13:01     ` Anton Blanchard
  2004-01-01 10:27   ` Anton Blanchard
  1 sibling, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2004-01-01 10:25 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: joneskoo, linux-kernel

Anton Blanchard <anton@samba.org> wrote:
>
> > swapper: page allocation failure. order:3, mode:0x20
> 
> ...
>  We should probably rate limit that printk. Andrew: I was thinking of
>  stealing net_ratelimit and calling it core_ratelimit or whatever. Then
>  wrap these non critical things with it. Overkill?

Actually my intent was just to remove it (and __GFP_NOWARN) - it's just
development-time debug.  But it is handy on occasion.

So sure, ratelimit it, make it KERN_INFO and maybe add a dump_stack()?

(printk_ratelimit() may be a suitable name)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: swapper: page allocation failure. order:3, mode:0x20
  2004-01-01 10:25   ` Andrew Morton
@ 2004-01-01 13:01     ` Anton Blanchard
  2004-01-01 20:45       ` Andrew Morton
  2004-01-01 22:08       ` Christophe Saout
  0 siblings, 2 replies; 11+ messages in thread
From: Anton Blanchard @ 2004-01-01 13:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: joneskoo, linux-kernel


> So sure, ratelimit it, make it KERN_INFO and maybe add a dump_stack()?

Sounds good, I always end up adding a dump_stack there when debugging
these problems anyway.

> (printk_ratelimit() may be a suitable name)

How does this look?

Anton

===== Documentation/sysctl/kernel.txt 1.7 vs edited =====
--- 1.7/Documentation/sysctl/kernel.txt	Wed Apr  9 18:01:38 2003
+++ edited/Documentation/sysctl/kernel.txt	Thu Jan  1 23:52:10 2004
@@ -251,6 +251,25 @@
 
 ==============================================================
 
+printk_ratelimit:
+
+Some warning messages are rate limited. printk_ratelimit specifies
+the minimum length of time between these messages, by default we
+allow one every 5 seconds.
+
+A value of 0 will disable rate limiting.
+
+==============================================================
+
+printk_ratelimit_burst:
+
+While long term we enforce one message per printk_ratelimit 
+seconds, we do allow a burst of messages to pass through.
+printk_ratelimit_burst specifies the number of messages we can
+send before ratelimiting kicks in.
+
+==============================================================
+
 reboot-cmd: (Sparc only)
 
 ??? This seems to be a way to give an argument to the Sparc
===== include/linux/kernel.h 1.43 vs edited =====
--- 1.43/include/linux/kernel.h	Tue Dec 30 08:37:26 2003
+++ edited/include/linux/kernel.h	Thu Jan  1 23:52:10 2004
@@ -89,6 +89,8 @@
 
 unsigned long int_sqrt(unsigned long);
 
+extern int printk_ratelimit(void);
+
 static inline void console_silent(void)
 {
 	console_loglevel = 0;
===== include/linux/sysctl.h 1.54 vs edited =====
--- 1.54/include/linux/sysctl.h	Thu Dec 25 14:32:23 2003
+++ edited/include/linux/sysctl.h	Thu Jan  1 23:53:11 2004
@@ -127,6 +127,8 @@
 	KERN_PANIC_ON_OOPS=57,  /* int: whether we will panic on an oops */
 	KERN_HPPA_PWRSW=58,	/* int: hppa soft-power enable */
 	KERN_HPPA_UNALIGNED=59,	/* int: hppa unaligned-trap enable */
+	KERN_PRINTK_RATELIMIT=60, /* int: tune printk ratelimiting */
+	KERN_PRINTK_RATELIMIT_BURST=61,	/* int: tune printk ratelimiting */
 };
 
 
===== kernel/printk.c 1.29 vs edited =====
--- 1.29/kernel/printk.c	Tue Dec 30 12:01:55 2003
+++ edited/kernel/printk.c	Thu Jan  1 23:53:51 2004
@@ -762,3 +762,45 @@
 		tty->driver->write(tty, 0, msg, strlen(msg));
 	return;
 }
+
+/* minimum time in jiffies between messages */
+int printk_ratelimit_jiffies = 5*HZ;
+
+/* number of messages we send before ratelimiting */
+int printk_ratelimit_burst = 10;
+
+/* 
+ * printk rate limiting, lifted from the networking subsystem.
+ *
+ * This enforces a rate limit: not more than one kernel message
+ * every printk_ratelimit_jiffies to make a denial-of-service 
+ * attack impossible.
+ */ 
+int printk_ratelimit(void)
+{
+	static spinlock_t ratelimit_lock = SPIN_LOCK_UNLOCKED;
+	static unsigned long toks = 10*5*HZ;
+	static unsigned long last_msg; 
+	static int missed;
+	unsigned long flags;
+	unsigned long now = jiffies;
+
+	spin_lock_irqsave(&ratelimit_lock, flags);
+	toks += now - last_msg;
+	last_msg = now;
+	if (toks > (printk_ratelimit_burst * printk_ratelimit_jiffies))
+		toks = printk_ratelimit_burst * printk_ratelimit_jiffies;
+	if (toks >= printk_ratelimit_jiffies) {
+		int lost = missed;
+		missed = 0;
+		toks -= printk_ratelimit_jiffies;
+		spin_unlock_irqrestore(&ratelimit_lock, flags);
+		if (lost)
+			printk(KERN_WARNING "printk: %d messages suppressed.\n", lost);
+		return 1;
+	}
+	missed++;
+	spin_unlock_irqrestore(&ratelimit_lock, flags);
+	return 0;
+}
+EXPORT_SYMBOL(printk_ratelimit);
===== kernel/sysctl.c 1.55 vs edited =====
--- 1.55/kernel/sysctl.c	Thu Oct  2 17:12:07 2003
+++ edited/kernel/sysctl.c	Thu Jan  1 23:52:11 2004
@@ -60,6 +60,8 @@
 extern int pid_max;
 extern int sysctl_lower_zone_protection;
 extern int min_free_kbytes;
+extern int printk_ratelimit_jiffies;
+extern int printk_ratelimit_burst;
 
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
@@ -575,6 +577,22 @@
 		.ctl_name	= KERN_PANIC_ON_OOPS,
 		.procname	= "panic_on_oops",
 		.data		= &panic_on_oops,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= KERN_PRINTK_RATELIMIT,
+		.procname	= "printk_ratelimit",
+		.data		= &printk_ratelimit_jiffies,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_jiffies,
+	},
+	{
+		.ctl_name	= KERN_PRINTK_RATELIMIT_BURST,
+		.procname	= "printk_ratelimit_burst",
+		.data		= &printk_ratelimit_burst,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
===== mm/page_alloc.c 1.176 vs edited =====
--- 1.176/mm/page_alloc.c	Tue Dec 30 08:38:05 2003
+++ edited/mm/page_alloc.c	Thu Jan  1 23:52:11 2004
@@ -670,10 +670,11 @@
 	}
 
 nopage:
-	if (!(gfp_mask & __GFP_NOWARN)) {
-		printk("%s: page allocation failure."
+	if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
+		printk(KERN_WARNING "%s: page allocation failure."
 			" order:%d, mode:0x%x\n",
 			p->comm, order, gfp_mask);
+		dump_stack();
 	}
 	return NULL;
 got_pg:

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: swapper: page allocation failure. order:3, mode:0x20
  2004-01-01 13:01     ` Anton Blanchard
@ 2004-01-01 20:45       ` Andrew Morton
  2004-01-01 22:43         ` Anton Blanchard
  2004-01-01 22:08       ` Christophe Saout
  1 sibling, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2004-01-01 20:45 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: joneskoo, linux-kernel

Anton Blanchard <anton@samba.org> wrote:
>
> 
> > So sure, ratelimit it, make it KERN_INFO and maybe add a dump_stack()?
> 
> Sounds good, I always end up adding a dump_stack there when debugging
> these problems anyway.
> 
> > (printk_ratelimit() may be a suitable name)
> 
> How does this look?

Good.  I guess we need to make net_ratelimit() use this sometime.

> +/* 
> + * printk rate limiting, lifted from the networking subsystem.
> + *
> + * This enforces a rate limit: not more than one kernel message
> + * every printk_ratelimit_jiffies to make a denial-of-service 
> + * attack impossible.
> + */ 
> +int printk_ratelimit(void)
> +{
> +	static spinlock_t ratelimit_lock = SPIN_LOCK_UNLOCKED;
> +	static unsigned long toks = 10*5*HZ;
> +	static unsigned long last_msg; 
> +	static int missed;
> +	unsigned long flags;
> +	unsigned long now = jiffies;
> +
> +	spin_lock_irqsave(&ratelimit_lock, flags);
> +	toks += now - last_msg;
> +	last_msg = now;
> +	if (toks > (printk_ratelimit_burst * printk_ratelimit_jiffies))
> +		toks = printk_ratelimit_burst * printk_ratelimit_jiffies;
> +	if (toks >= printk_ratelimit_jiffies) {
> +		int lost = missed;
> +		missed = 0;
> +		toks -= printk_ratelimit_jiffies;
> +		spin_unlock_irqrestore(&ratelimit_lock, flags);
> +		if (lost)
> +			printk(KERN_WARNING "printk: %d messages suppressed.\n", lost);
> +		return 1;
> +	}
> +	missed++;
> +	spin_unlock_irqrestore(&ratelimit_lock, flags);
> +	return 0;
> +}

This seems a bit odd.  It means that the further apart the message bursts
are, the longer they are allowed to be.  Or something.

Wouldn't it be better to say "after each greater-than-five second window,
allow up to ten printk's as long as they happen in the next five
milliseconds"?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: swapper: page allocation failure. order:3, mode:0x20
  2004-01-01 20:45       ` Andrew Morton
@ 2004-01-01 22:43         ` Anton Blanchard
  0 siblings, 0 replies; 11+ messages in thread
From: Anton Blanchard @ 2004-01-01 22:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: joneskoo, linux-kernel

> This seems a bit odd.  It means that the further apart the message bursts
> are, the longer they are allowed to be.  Or something.
>
> Wouldn't it be better to say "after each greater-than-five second window,
> allow up to ten printk's as long as they happen in the next five
> milliseconds"?

Its 100% copied from the networking code. I would have thought we really
want that behaviour, if Im on a serial console I can only tolerate a
decent sized backtrace every few seconds. If I allow 10 bursts every 5
seconds then im screwed.

Instead we allow the burst of 10 once but then ratelimit it to one
per 5 seconds until we have had a long enough period of silence. The
main thing is that over the long term we have an average of one message
per 5 seconds.

Anton

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: swapper: page allocation failure. order:3, mode:0x20
  2004-01-01 13:01     ` Anton Blanchard
  2004-01-01 20:45       ` Andrew Morton
@ 2004-01-01 22:08       ` Christophe Saout
  2004-01-01 22:45         ` Anton Blanchard
  1 sibling, 1 reply; 11+ messages in thread
From: Christophe Saout @ 2004-01-01 22:08 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Andrew Morton, joneskoo, linux-kernel

Am Do, den 01.01.2004 schrieb Anton Blanchard um 14:01:

> How does this look?
> [...]
> +	static unsigned long toks = 10*5*HZ;
> +	static unsigned long last_msg; 
> +	static int missed;

This would mean that all users of printk_ratelimit share this. If
printk_ratelimit is bombed by one user other perhaps important messages
are also suppressed.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: swapper: page allocation failure. order:3, mode:0x20
  2004-01-01 22:08       ` Christophe Saout
@ 2004-01-01 22:45         ` Anton Blanchard
  2004-01-01 23:13           ` Christophe Saout
  0 siblings, 1 reply; 11+ messages in thread
From: Anton Blanchard @ 2004-01-01 22:45 UTC (permalink / raw)
  To: Christophe Saout; +Cc: Andrew Morton, joneskoo, linux-kernel


> > How does this look?
> > [...]
> > +	static unsigned long toks = 10*5*HZ;
> > +	static unsigned long last_msg; 
> > +	static int missed;
> 
> This would mean that all users of printk_ratelimit share this. If
> printk_ratelimit is bombed by one user other perhaps important messages
> are also suppressed.

printk_ratelimit is only to be used for things which we can afford to 
lose (eg our VM debugging messages). Don't use it on anything important :)

Anton

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: swapper: page allocation failure. order:3, mode:0x20
  2004-01-01 22:45         ` Anton Blanchard
@ 2004-01-01 23:13           ` Christophe Saout
  0 siblings, 0 replies; 11+ messages in thread
From: Christophe Saout @ 2004-01-01 23:13 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Andrew Morton, joneskoo, linux-kernel

Am Do, den 01.01.2004 schrieb Anton Blanchard um 23:45:

> > This would mean that all users of printk_ratelimit share this. If
> > printk_ratelimit is bombed by one user other perhaps important messages
> > are also suppressed.
> 
> printk_ratelimit is only to be used for things which we can afford to 
> lose (eg our VM debugging messages). Don't use it on anything important :)

Other implementations would probably be overkill, so I'll agree with
you. :)

If printk_ratelimit is called too often there is probably something
wrong going anyway that is worth fixing.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: swapper: page allocation failure. order:3, mode:0x20
  2004-01-01 10:15 ` Anton Blanchard
  2004-01-01 10:25   ` Andrew Morton
@ 2004-01-01 10:27   ` Anton Blanchard
  2004-01-01 10:42     ` Arjan van de Ven
  1 sibling, 1 reply; 11+ messages in thread
From: Anton Blanchard @ 2004-01-01 10:27 UTC (permalink / raw)
  To: Joonas Kortesalmi; +Cc: linux-kernel, akpm

 
> Now e1000 uses TSO (and can regularly ask for 32kB+ kmallocs in
> interrupt context) perhaps we should look moving the rx buffer refill code
> into a context that can sleep. Then again its not like we can tolerate
> much latency in this code path, your rx ring will run out quite quickly :)

I hate to argue with myself, but thats crap. TSO only affects the TX
path and its buffers are created outside interrupt context. So it must be
a large MTU causing the failures, regardless it still makes sense to explore
rx skb refill outside of interrupt context idea.

Anton

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: swapper: page allocation failure. order:3, mode:0x20
  2004-01-01 10:27   ` Anton Blanchard
@ 2004-01-01 10:42     ` Arjan van de Ven
  0 siblings, 0 replies; 11+ messages in thread
From: Arjan van de Ven @ 2004-01-01 10:42 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Joonas Kortesalmi, linux-kernel, akpm

[-- Attachment #1: Type: text/plain, Size: 406 bytes --]

On Thu, 2004-01-01 at 11:27, Anton Blanchard wrote:
>  regardless it still makes sense to explore
> rx skb refill outside of interrupt context idea.

could be done in 2 steps; eg have a "normal path" that starts refilling
from process context if, say, half the skbs are used but an emergency
fallback to do it from irq context if for whatever reason the process
context didn't get around to it....

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2004-01-01 23:13 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-01-01  9:35 swapper: page allocation failure. order:3, mode:0x20 Joonas Kortesalmi
2004-01-01 10:15 ` Anton Blanchard
2004-01-01 10:25   ` Andrew Morton
2004-01-01 13:01     ` Anton Blanchard
2004-01-01 20:45       ` Andrew Morton
2004-01-01 22:43         ` Anton Blanchard
2004-01-01 22:08       ` Christophe Saout
2004-01-01 22:45         ` Anton Blanchard
2004-01-01 23:13           ` Christophe Saout
2004-01-01 10:27   ` Anton Blanchard
2004-01-01 10:42     ` Arjan van de Ven

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox