public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Brad Allen <Ulmo-Usenet@Usenet.Q.Net>
To: linux-kernel@VGER.KERNEL.ORG
Cc: Brad Allen <Ulmo-Usenet@Usenet.Q.Net>
Subject: allocation failures with CBQ bandwidth limiting & high net use (was Re: Filesystem kernel hangup, 2.6.3 (bad: scheduling while atomic!))
Date: Tue, 27 Apr 2004 21:39:19 -0700 (PDT)	[thread overview]
Message-ID: <20040427.213919.884032925.ulmo-u@Q.Net> (raw)
In-Reply-To: 1077543963.1246.20.camel@harrier.lucky.linux.kernel

> Just tried it without HyperThreading, with JFS, the filesystem
> hanged without any kernel oops after about 10 minutes of 1Gbit/s FTP
> input.
> 
> What I got was:   [...]
> mserv4 kernel: swapper: page allocation failure. order:0, mode:0x20   [...]
> mserv4 kernel:  [<c02ad599>] e1000_alloc_rx_buffers+0x59/0x100


I just corresponded a problem I originally thought yours might be
related to, so now I think it's even less likely they're related.

After a due-diligence check for latest version before post, I found
and upgraded to kernel version 2.6.6-rc3, and during my tests of that
new version inadvertently finally found the test that revealed a
correspondence with these errors: running CBQ network bandwidth
limiting, queuing and prioritization, *and* high network use (using
that bandwidth limiting).  The CBQ stuff is done at boot.  As soon as
I started the high network use, the problem started (see below
"full_kernel_output" file for an idea of what kernel messages I was
seeing en masse), and then during this high network use, I turned off
CBQ and everything was normal again.


This is what I wrote before the above discovery, which gives
background:


I'm also seeing "page allocation failure" errors in a 2.6 series
kernel with e1000 driver with lots of filesystem access and lots of
outbound network, which I think are our common themes.  I'm seeing the
call trace with e1000 driver routines in nearly every list.
Otherwise, much else is different:

* Linux kernel 2.6.6-rc2 used for this particular run
  as documented, but same allocation failure errors
  also seen in 2.6.6-rc3, 2.6.6-rc1, 2.6.5,
  2.6.5+kraxel263-1, 2.6.4+kraxel263-1 (+kraxel is
  drivers/media/video stuff ported in sooner than 2.6
  series got it).
* I was doing relatively heavy client NFSv3, not XFS
  or JFS like you: ~2MiB/s steady write, ~2MiB/s
  steady read.  (Test ran for ~2 hours.  Crashed
  with only 37s to go before filling up remote disk!
  Coincidence, because it had no idea the remote disk
  was getting full, I think ... but never did finish
  filling it, which was my planned test end point.)
* I am using experimental IVTV driver, which is
  itself having trouble and undergoing active
  development, to record and play TV using Hauppauge
  990 (WinTV PVR 350).  This card does all the heavy
  compute intensive stuff: MPEG2 encode & decode, as
  well as TV stuff.  The computer merely has to
  bitshuffle & stuff, which is *not* that hard.  The
  bandwidth of the system ought to allow more like four
  of these cards recording & at least one playing, if
  things were tuned right, I think.
* My IRQ is shared with that above TV card, e1000
  (Intel GbE) card, and video output card; the first
  two get used extensively (the last one -- the local
  console video -- almost never gets used, except for
  printk).
* I had network output going almost full-bore, as
  a stress test of the above, but I still get the
  errors when this isn't the case.  61,840B/s inside
  TCP/IP.
* I am using CBQ to limit network output speed to
  Intenret, and priority of packets.  62,464B/s including
  IP packets (headers, etc.).  Link measured max at a
  rough ~66,560B/s inside TCP/IP, so I stay away from it.
  So, this is basically definitely hitting output
  speed buffering on occasion.
* It's bridging between two ethernet cards (sender
  of through-IP traffic (~61,840B/s) in from the
  e1000 GbE card, being sent out an on-board 100MbE).
  NFSv3 (from & to server) isn't bridged, but also
  is going via that e1000 GbE card.
* The computer I'm seeing this on is relatively old
  and moderately low resource for today's memory &
  CPU use standards:
  1999 Dell Optiplex GX1 PIII 450MHz 128MiB ram

I'm not sure if this is related to your problem.  I searched
linux-kernel by using the two quoted phrases "page allocation failure"
"e1000_alloc_rx_buffers" in Google, and your above message is the only
one that came out.

I am worried that perhaps page allocation failures are very vague
diagnostic tools.  I take it this simply means I ran out of (perhaps a
certain kind of, e.g. and what I suspect, kernel) memory.

Here is the kernel message output (one sample excerpt below):

  ftp://ftp.sonic.net/pub/users/ulmo/ivtv/full_kernel_output.bz2

and .config options:

  ftp://ftp.sonic.net/pub/users/ulmo/SW/OS/linux/2.6.6-rc2.config.bz2

Here's my CBQ script:

  ftp://ftp.sonic.net/pub/users/ulmo/SW/OS/linux/tc.sh

Note that I *do* use LVM (everyone I talk to seems to -- although I
think it's called DM now; btw, I do use and like EVMS on more modern
hosts -- and I'm sure I'd like it on old ones, too, but just haven't
bothered to do it there), but since not much of that computer's access
is via local disk, I doubt that matters that much.

My MTU for GbE (e1000) is 9000, and NFS block size 8192 bytes.
That GbE is a consumer grade Intel model.

Here is a sample of the kernel output; note please that these errors
are not all exactly identical, so if you are reading them please check
the above full_kernel_output for any nuances and big differences (all
kernel messages for about two hours from boot to crash).

swapper: page allocation failure. order:3, mode:0x20
Call Trace:
 [__alloc_pages+696/784] __alloc_pages+0x2b8/0x310
 [__get_free_pages+34/80] __get_free_pages+0x22/0x50
 [cache_grow+165/656] cache_grow+0xa5/0x290
 [cache_alloc_refill+354/512] cache_alloc_refill+0x162/0x200
 [do_timer+224/240] do_timer+0xe0/0xf0
 [__kmalloc+140/176] __kmalloc+0x8c/0xb0
 [alloc_skb+72/240] alloc_skb+0x48/0xf0
 [e1000_alloc_rx_buffers+102/272] e1000_alloc_rx_buffers+0x66/0x110
 [e1000_clean_rx_irq+225/1056] e1000_clean_rx_irq+0xe1/0x420
 [scheduler_tick+31/1312] scheduler_tick+0x1f/0x520
 [e1000_intr+58/144] e1000_intr+0x3a/0x90
 [handle_IRQ_event+59/112] handle_IRQ_event+0x3b/0x70
 [do_IRQ+150/336] do_IRQ+0x96/0x150
 [common_interrupt+24/32] common_interrupt+0x18/0x20
 [default_idle+38/64] default_idle+0x26/0x40
 [cpu_idle+52/64] cpu_idle+0x34/0x40
 [start_kernel+415/480] start_kernel+0x19f/0x1e0
 [unknown_bootoption+0/288] unknown_bootoption+0x0/0x120

  parent reply	other threads:[~2004-04-28  4:39 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-02-22 15:49 Filesystem kernel hangup, 2.6.3 (bad: scheduling while atomic!) Mikael Wahlberg
2004-02-23 12:19 ` Christoph Hellwig
2004-02-23 13:08   ` Mikael Wahlberg
2004-02-23 13:13     ` Christoph Hellwig
2004-03-04  9:35       ` Mikael Wahlberg
2004-02-23 13:46     ` Seth Mos
2004-02-23 13:50       ` Mikael Wahlberg
2004-02-23 13:46   ` Mikael Wahlberg
     [not found]   ` <1077543963.1246.20.camel@harrier.lucky.linux.kernel>
2004-04-28  4:39     ` Brad Allen [this message]
2004-04-28  5:42       ` allocation failures with CBQ bandwidth limiting & high net use (was Re: Filesystem kernel hangup, 2.6.3 (bad: scheduling while atomic!)) Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20040427.213919.884032925.ulmo-u@Q.Net \
    --to=ulmo-usenet@usenet.q.net \
    --cc=linux-kernel@VGER.KERNEL.ORG \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox