* swapper: page allocation failure. order:3, mode:0x20 @ 2004-01-01 9:35 Joonas Kortesalmi 2004-01-01 10:15 ` Anton Blanchard 0 siblings, 1 reply; 11+ messages in thread From: Joonas Kortesalmi @ 2004-01-01 9:35 UTC (permalink / raw) To: linux-kernel -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 After running 2.6.0 on a server for a few days, I met an interesting and annoying problem. I was playing with NFS over gigabit ethernet (e1000) and it was a bit slow. I tried to find out why by running top and I saw syslog-ng eating almost 10% of the 1,3GHz Duron. Looked at the log and there was a huge flood of these messages: swapper: page allocation failure. order:3, mode:0x20 irssi: page allocation failure. order:3, mode:0x20 swapper: page allocation failure. order:3, mode:0x20 vim: page allocation failure. order:3, mode:0x20 swapper: page allocation failure. order:3, mode:0x20 most of the messages starting with swapper. After a quick google search it looks like this kind of a problem existed in a bit older 2.5/2.6 serie kernels. It looks like it's still with us. Linux hamsu.org 2.6.0 #1 Mon Dec 29 21:07:28 EET 2003 i686 AMD Duron(tm) processor AuthenticAMD GNU/Linux Shortly about the hardware of the machine in case it would matter: AMD Duron 1,3GHz CPU, 768MB of RAM, Intel PRO/1000MT Desktop and Realtek 8139C NICs. Pre-empt was disabled in the kernel config and the kernel didn't have any extra patches. A pure vanilla kernel. I'm not a subscriber so please CC me a copy of messages related to the subject. I'm not sure if I can help much looking at the insides of the kernel, but I will try my best answering any questions about this. - -- Joonas OH8GDV (ham call sign) OpenPGP: 0x5F72BE43 ( 0B37 05E0 0FB4 EB2E 161C DCE6 7F7B C645 5F72 BE43 ) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) iD8DBQE/8+lzf3vGRV9yvkMRAgkIAJ4tjyWtfeDmXCaM0s6w9+ofYLKktACgt+I5 MhYxd+KeJS8QFOjPbtOmykk= =75d0 -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: swapper: page allocation failure. order:3, mode:0x20 2004-01-01 9:35 swapper: page allocation failure. order:3, mode:0x20 Joonas Kortesalmi @ 2004-01-01 10:15 ` Anton Blanchard 2004-01-01 10:25 ` Andrew Morton 2004-01-01 10:27 ` Anton Blanchard 0 siblings, 2 replies; 11+ messages in thread From: Anton Blanchard @ 2004-01-01 10:15 UTC (permalink / raw) To: Joonas Kortesalmi; +Cc: linux-kernel, akpm Hi, > After running 2.6.0 on a server for a few days, I met an interesting and > annoying problem. I was playing with NFS over gigabit ethernet (e1000) and > it was a bit slow. I tried to find out why by running top and I saw syslog-ng > eating almost 10% of the 1,3GHz Duron. Looked at the log and there was a huge > flood of these messages: > > swapper: page allocation failure. order:3, mode:0x20 > irssi: page allocation failure. order:3, mode:0x20 > swapper: page allocation failure. order:3, mode:0x20 > vim: page allocation failure. order:3, mode:0x20 > swapper: page allocation failure. order:3, mode:0x20 Its sounds like you are using either a large MTU (9k?) or TSO. TSO causes the networking stack to think it has a massive MTU and the e1000 card busts it up into proper MTU sized packets. The problem is that it places much more stress on the allocator by asking for these large chunks of memory in interrupt context. Now e1000 uses TSO (and can regularly ask for 32kB+ kmallocs in interrupt context) perhaps we should look moving the rx buffer refill code into a context that can sleep. Then again its not like we can tolerate much latency in this code path, your rx ring will run out quite quickly :) BTW We have found increasing /proc/sys/vm/min_free_kbytes can help the situation a bit. Bumping the slab limits for the larger kmallocs (via echo X Y Z > /proc/slab) might be useful too. We should probably rate limit that printk. Andrew: I was thinking of stealing net_ratelimit and calling it core_ratelimit or whatever. Then wrap these non critical things with it. Overkill? Anton ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: swapper: page allocation failure. order:3, mode:0x20 2004-01-01 10:15 ` Anton Blanchard @ 2004-01-01 10:25 ` Andrew Morton 2004-01-01 13:01 ` Anton Blanchard 2004-01-01 10:27 ` Anton Blanchard 1 sibling, 1 reply; 11+ messages in thread From: Andrew Morton @ 2004-01-01 10:25 UTC (permalink / raw) To: Anton Blanchard; +Cc: joneskoo, linux-kernel Anton Blanchard <anton@samba.org> wrote: > > > swapper: page allocation failure. order:3, mode:0x20 > > ... > We should probably rate limit that printk. Andrew: I was thinking of > stealing net_ratelimit and calling it core_ratelimit or whatever. Then > wrap these non critical things with it. Overkill? Actually my intent was just to remove it (and __GFP_NOWARN) - it's just development-time debug. But it is handy on occasion. So sure, ratelimit it, make it KERN_INFO and maybe add a dump_stack()? (printk_ratelimit() may be a suitable name) ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: swapper: page allocation failure. order:3, mode:0x20 2004-01-01 10:25 ` Andrew Morton @ 2004-01-01 13:01 ` Anton Blanchard 2004-01-01 20:45 ` Andrew Morton 2004-01-01 22:08 ` Christophe Saout 0 siblings, 2 replies; 11+ messages in thread From: Anton Blanchard @ 2004-01-01 13:01 UTC (permalink / raw) To: Andrew Morton; +Cc: joneskoo, linux-kernel > So sure, ratelimit it, make it KERN_INFO and maybe add a dump_stack()? Sounds good, I always end up adding a dump_stack there when debugging these problems anyway. > (printk_ratelimit() may be a suitable name) How does this look? Anton ===== Documentation/sysctl/kernel.txt 1.7 vs edited ===== --- 1.7/Documentation/sysctl/kernel.txt Wed Apr 9 18:01:38 2003 +++ edited/Documentation/sysctl/kernel.txt Thu Jan 1 23:52:10 2004 @@ -251,6 +251,25 @@ ============================================================== +printk_ratelimit: + +Some warning messages are rate limited. printk_ratelimit specifies +the minimum length of time between these messages, by default we +allow one every 5 seconds. + +A value of 0 will disable rate limiting. + +============================================================== + +printk_ratelimit_burst: + +While long term we enforce one message per printk_ratelimit +seconds, we do allow a burst of messages to pass through. +printk_ratelimit_burst specifies the number of messages we can +send before ratelimiting kicks in. + +============================================================== + reboot-cmd: (Sparc only) ??? This seems to be a way to give an argument to the Sparc ===== include/linux/kernel.h 1.43 vs edited ===== --- 1.43/include/linux/kernel.h Tue Dec 30 08:37:26 2003 +++ edited/include/linux/kernel.h Thu Jan 1 23:52:10 2004 @@ -89,6 +89,8 @@ unsigned long int_sqrt(unsigned long); +extern int printk_ratelimit(void); + static inline void console_silent(void) { console_loglevel = 0; ===== include/linux/sysctl.h 1.54 vs edited ===== --- 1.54/include/linux/sysctl.h Thu Dec 25 14:32:23 2003 +++ edited/include/linux/sysctl.h Thu Jan 1 23:53:11 2004 @@ -127,6 +127,8 @@ KERN_PANIC_ON_OOPS=57, /* int: whether we will panic on an oops */ KERN_HPPA_PWRSW=58, /* int: hppa soft-power enable */ KERN_HPPA_UNALIGNED=59, /* int: hppa unaligned-trap enable */ + KERN_PRINTK_RATELIMIT=60, /* int: tune printk ratelimiting */ + KERN_PRINTK_RATELIMIT_BURST=61, /* int: tune printk ratelimiting */ }; ===== kernel/printk.c 1.29 vs edited ===== --- 1.29/kernel/printk.c Tue Dec 30 12:01:55 2003 +++ edited/kernel/printk.c Thu Jan 1 23:53:51 2004 @@ -762,3 +762,45 @@ tty->driver->write(tty, 0, msg, strlen(msg)); return; } + +/* minimum time in jiffies between messages */ +int printk_ratelimit_jiffies = 5*HZ; + +/* number of messages we send before ratelimiting */ +int printk_ratelimit_burst = 10; + +/* + * printk rate limiting, lifted from the networking subsystem. + * + * This enforces a rate limit: not more than one kernel message + * every printk_ratelimit_jiffies to make a denial-of-service + * attack impossible. + */ +int printk_ratelimit(void) +{ + static spinlock_t ratelimit_lock = SPIN_LOCK_UNLOCKED; + static unsigned long toks = 10*5*HZ; + static unsigned long last_msg; + static int missed; + unsigned long flags; + unsigned long now = jiffies; + + spin_lock_irqsave(&ratelimit_lock, flags); + toks += now - last_msg; + last_msg = now; + if (toks > (printk_ratelimit_burst * printk_ratelimit_jiffies)) + toks = printk_ratelimit_burst * printk_ratelimit_jiffies; + if (toks >= printk_ratelimit_jiffies) { + int lost = missed; + missed = 0; + toks -= printk_ratelimit_jiffies; + spin_unlock_irqrestore(&ratelimit_lock, flags); + if (lost) + printk(KERN_WARNING "printk: %d messages suppressed.\n", lost); + return 1; + } + missed++; + spin_unlock_irqrestore(&ratelimit_lock, flags); + return 0; +} +EXPORT_SYMBOL(printk_ratelimit); ===== kernel/sysctl.c 1.55 vs edited ===== --- 1.55/kernel/sysctl.c Thu Oct 2 17:12:07 2003 +++ edited/kernel/sysctl.c Thu Jan 1 23:52:11 2004 @@ -60,6 +60,8 @@ extern int pid_max; extern int sysctl_lower_zone_protection; extern int min_free_kbytes; +extern int printk_ratelimit_jiffies; +extern int printk_ratelimit_burst; /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */ static int maxolduid = 65535; @@ -575,6 +577,22 @@ .ctl_name = KERN_PANIC_ON_OOPS, .procname = "panic_on_oops", .data = &panic_on_oops, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { + .ctl_name = KERN_PRINTK_RATELIMIT, + .procname = "printk_ratelimit", + .data = &printk_ratelimit_jiffies, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec_jiffies, + }, + { + .ctl_name = KERN_PRINTK_RATELIMIT_BURST, + .procname = "printk_ratelimit_burst", + .data = &printk_ratelimit_burst, .maxlen = sizeof(int), .mode = 0644, .proc_handler = &proc_dointvec, ===== mm/page_alloc.c 1.176 vs edited ===== --- 1.176/mm/page_alloc.c Tue Dec 30 08:38:05 2003 +++ edited/mm/page_alloc.c Thu Jan 1 23:52:11 2004 @@ -670,10 +670,11 @@ } nopage: - if (!(gfp_mask & __GFP_NOWARN)) { - printk("%s: page allocation failure." + if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) { + printk(KERN_WARNING "%s: page allocation failure." " order:%d, mode:0x%x\n", p->comm, order, gfp_mask); + dump_stack(); } return NULL; got_pg: ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: swapper: page allocation failure. order:3, mode:0x20 2004-01-01 13:01 ` Anton Blanchard @ 2004-01-01 20:45 ` Andrew Morton 2004-01-01 22:43 ` Anton Blanchard 2004-01-01 22:08 ` Christophe Saout 1 sibling, 1 reply; 11+ messages in thread From: Andrew Morton @ 2004-01-01 20:45 UTC (permalink / raw) To: Anton Blanchard; +Cc: joneskoo, linux-kernel Anton Blanchard <anton@samba.org> wrote: > > > > So sure, ratelimit it, make it KERN_INFO and maybe add a dump_stack()? > > Sounds good, I always end up adding a dump_stack there when debugging > these problems anyway. > > > (printk_ratelimit() may be a suitable name) > > How does this look? Good. I guess we need to make net_ratelimit() use this sometime. > +/* > + * printk rate limiting, lifted from the networking subsystem. > + * > + * This enforces a rate limit: not more than one kernel message > + * every printk_ratelimit_jiffies to make a denial-of-service > + * attack impossible. > + */ > +int printk_ratelimit(void) > +{ > + static spinlock_t ratelimit_lock = SPIN_LOCK_UNLOCKED; > + static unsigned long toks = 10*5*HZ; > + static unsigned long last_msg; > + static int missed; > + unsigned long flags; > + unsigned long now = jiffies; > + > + spin_lock_irqsave(&ratelimit_lock, flags); > + toks += now - last_msg; > + last_msg = now; > + if (toks > (printk_ratelimit_burst * printk_ratelimit_jiffies)) > + toks = printk_ratelimit_burst * printk_ratelimit_jiffies; > + if (toks >= printk_ratelimit_jiffies) { > + int lost = missed; > + missed = 0; > + toks -= printk_ratelimit_jiffies; > + spin_unlock_irqrestore(&ratelimit_lock, flags); > + if (lost) > + printk(KERN_WARNING "printk: %d messages suppressed.\n", lost); > + return 1; > + } > + missed++; > + spin_unlock_irqrestore(&ratelimit_lock, flags); > + return 0; > +} This seems a bit odd. It means that the further apart the message bursts are, the longer they are allowed to be. Or something. Wouldn't it be better to say "after each greater-than-five second window, allow up to ten printk's as long as they happen in the next five milliseconds"? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: swapper: page allocation failure. order:3, mode:0x20 2004-01-01 20:45 ` Andrew Morton @ 2004-01-01 22:43 ` Anton Blanchard 0 siblings, 0 replies; 11+ messages in thread From: Anton Blanchard @ 2004-01-01 22:43 UTC (permalink / raw) To: Andrew Morton; +Cc: joneskoo, linux-kernel > This seems a bit odd. It means that the further apart the message bursts > are, the longer they are allowed to be. Or something. > > Wouldn't it be better to say "after each greater-than-five second window, > allow up to ten printk's as long as they happen in the next five > milliseconds"? Its 100% copied from the networking code. I would have thought we really want that behaviour, if Im on a serial console I can only tolerate a decent sized backtrace every few seconds. If I allow 10 bursts every 5 seconds then im screwed. Instead we allow the burst of 10 once but then ratelimit it to one per 5 seconds until we have had a long enough period of silence. The main thing is that over the long term we have an average of one message per 5 seconds. Anton ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: swapper: page allocation failure. order:3, mode:0x20 2004-01-01 13:01 ` Anton Blanchard 2004-01-01 20:45 ` Andrew Morton @ 2004-01-01 22:08 ` Christophe Saout 2004-01-01 22:45 ` Anton Blanchard 1 sibling, 1 reply; 11+ messages in thread From: Christophe Saout @ 2004-01-01 22:08 UTC (permalink / raw) To: Anton Blanchard; +Cc: Andrew Morton, joneskoo, linux-kernel Am Do, den 01.01.2004 schrieb Anton Blanchard um 14:01: > How does this look? > [...] > + static unsigned long toks = 10*5*HZ; > + static unsigned long last_msg; > + static int missed; This would mean that all users of printk_ratelimit share this. If printk_ratelimit is bombed by one user other perhaps important messages are also suppressed. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: swapper: page allocation failure. order:3, mode:0x20 2004-01-01 22:08 ` Christophe Saout @ 2004-01-01 22:45 ` Anton Blanchard 2004-01-01 23:13 ` Christophe Saout 0 siblings, 1 reply; 11+ messages in thread From: Anton Blanchard @ 2004-01-01 22:45 UTC (permalink / raw) To: Christophe Saout; +Cc: Andrew Morton, joneskoo, linux-kernel > > How does this look? > > [...] > > + static unsigned long toks = 10*5*HZ; > > + static unsigned long last_msg; > > + static int missed; > > This would mean that all users of printk_ratelimit share this. If > printk_ratelimit is bombed by one user other perhaps important messages > are also suppressed. printk_ratelimit is only to be used for things which we can afford to lose (eg our VM debugging messages). Don't use it on anything important :) Anton ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: swapper: page allocation failure. order:3, mode:0x20 2004-01-01 22:45 ` Anton Blanchard @ 2004-01-01 23:13 ` Christophe Saout 0 siblings, 0 replies; 11+ messages in thread From: Christophe Saout @ 2004-01-01 23:13 UTC (permalink / raw) To: Anton Blanchard; +Cc: Andrew Morton, joneskoo, linux-kernel Am Do, den 01.01.2004 schrieb Anton Blanchard um 23:45: > > This would mean that all users of printk_ratelimit share this. If > > printk_ratelimit is bombed by one user other perhaps important messages > > are also suppressed. > > printk_ratelimit is only to be used for things which we can afford to > lose (eg our VM debugging messages). Don't use it on anything important :) Other implementations would probably be overkill, so I'll agree with you. :) If printk_ratelimit is called too often there is probably something wrong going anyway that is worth fixing. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: swapper: page allocation failure. order:3, mode:0x20 2004-01-01 10:15 ` Anton Blanchard 2004-01-01 10:25 ` Andrew Morton @ 2004-01-01 10:27 ` Anton Blanchard 2004-01-01 10:42 ` Arjan van de Ven 1 sibling, 1 reply; 11+ messages in thread From: Anton Blanchard @ 2004-01-01 10:27 UTC (permalink / raw) To: Joonas Kortesalmi; +Cc: linux-kernel, akpm > Now e1000 uses TSO (and can regularly ask for 32kB+ kmallocs in > interrupt context) perhaps we should look moving the rx buffer refill code > into a context that can sleep. Then again its not like we can tolerate > much latency in this code path, your rx ring will run out quite quickly :) I hate to argue with myself, but thats crap. TSO only affects the TX path and its buffers are created outside interrupt context. So it must be a large MTU causing the failures, regardless it still makes sense to explore rx skb refill outside of interrupt context idea. Anton ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: swapper: page allocation failure. order:3, mode:0x20 2004-01-01 10:27 ` Anton Blanchard @ 2004-01-01 10:42 ` Arjan van de Ven 0 siblings, 0 replies; 11+ messages in thread From: Arjan van de Ven @ 2004-01-01 10:42 UTC (permalink / raw) To: Anton Blanchard; +Cc: Joonas Kortesalmi, linux-kernel, akpm [-- Attachment #1: Type: text/plain, Size: 406 bytes --] On Thu, 2004-01-01 at 11:27, Anton Blanchard wrote: > regardless it still makes sense to explore > rx skb refill outside of interrupt context idea. could be done in 2 steps; eg have a "normal path" that starts refilling from process context if, say, half the skbs are used but an emergency fallback to do it from irq context if for whatever reason the process context didn't get around to it.... [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2004-01-01 23:13 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-01-01 9:35 swapper: page allocation failure. order:3, mode:0x20 Joonas Kortesalmi 2004-01-01 10:15 ` Anton Blanchard 2004-01-01 10:25 ` Andrew Morton 2004-01-01 13:01 ` Anton Blanchard 2004-01-01 20:45 ` Andrew Morton 2004-01-01 22:43 ` Anton Blanchard 2004-01-01 22:08 ` Christophe Saout 2004-01-01 22:45 ` Anton Blanchard 2004-01-01 23:13 ` Christophe Saout 2004-01-01 10:27 ` Anton Blanchard 2004-01-01 10:42 ` Arjan van de Ven
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox