public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [newbie:] Bonnie++2 hangs recent 2.6 kernels? Bash keeps looping in waitpid(), eating 100% CPU
@ 2007-09-19 14:58 Frantisek Rysanek
  0 siblings, 0 replies; 10+ messages in thread
From: Frantisek Rysanek @ 2007-09-19 14:58 UTC (permalink / raw)
  To: linux-kernel

Dear Mr. Piggin,

thanks for your response in the first place :-)

On 13 Sep 2007 at 2:30, Nick Piggin wrote:
>
> Can you see if it is looping in userspace or kernel? Can you kill -9
> the process?
> 
This is interesting. I can't run any classic system command. Any 
command hangs or coredumps. Any command except kill :-) Perhaps 
"kill" is an internal bash command, so that it needn't fork+exec 
(clone) to execute?

Anyway if I kill -9 the loopy bash process, the loopy console 
respawns, I get several segfaults from udevd and dircolors (called 
from .bashrc), and the new bash process on that console is no longer 
loopy. But I continue to get segfaults from any commands that I try 
to run...

> Are you able to test with the latest 2.6.23-rc kernel? If not (or if it
> still has the same problem), then can you get the output of sysrq+T
> and three sysrq+P calls, please? (this might help work out where in
> kernel it is spinning).
>
I've compiled 2.6.23-rc6, enabled serial console and captured 
the output of sysrq+P (on the affected virtual VGA console)
and sysrq+T. 

http://www.fccps.cz/download/adv/frr/bonnie/2.6.23-rc6.txt

The interesting bit of information, related to the erratic "bash" 
processes, is always a single line, such as:

bash          R running      0  2358      1

I've also taken a photo of `top` running
on another virtual console. I can't get any data out of the
affected box, as I can't run any shell commands...

http://www.fccps.cz/download/adv/frr/bonnie/top.jpg

Note that there are rather few processes running in the user space.
Can't say if that makes any difference from a full-blown distro.

Maybe I could set up the bootable CD for download somewhere 
(gzipped ISO of maybe 50 Megs).

In this scenario, Linux 2.6.16.18 once reported a soft lockup.
http://www.fccps.cz/download/adv/frr/bonnie/soft-lockup1.txt
Never again.

I also managed to catch the misbehavior in strace once, didn't
get a capture, but essentially it was stuck at a single open
syscall, I believe it was "waitpid(1, " . (Never managed that again, 
always got segfaults instead of the loopy bash when trying to watch 
bash by strace -p). 

Exactly where does the context switch from user to kernel take place?
I know that I can call ioctl() from user space, and I can write 
ioctl() handlers in kernel space as part of device drivers (the 
handlers take place entirely in kernel space). The waitpid()
thing is a syscall, being entered only once from user space
- and the bash process seems to keep looping inside it.
Does the single "running" line in Alt+SysRq+T mean that the
process is looping in user space?
Take a look at the CPU consumption % numbers though...

Note that there's no OOM killer. (Seen that one before, under 
different circumstances - when OCFS2 didn't like machines
with less than 1 GB RAM.)

My impression is that the erratic behavior could be a secondary 
symptom of a kernel-space memory leak taking place somewhere else 
than in the loopy code itself. Can't say if the leak takes place in 
memory management or EXT3 for instance...

Or maybe my problem lives in pure user space after all?

Frank Rysanek


^ permalink raw reply	[flat|nested] 10+ messages in thread
* [newbie:] Bonnie++2 hangs recent 2.6 kernels? Bash keeps looping in waitpid(), eating 100% CPU
@ 2007-09-13 14:46 Frantisek Rysanek
  2007-09-13  0:30 ` Nick Piggin
  2008-03-06 20:46 ` Frantisek Rysanek
  0 siblings, 2 replies; 10+ messages in thread
From: Frantisek Rysanek @ 2007-09-13 14:46 UTC (permalink / raw)
  To: linux-kernel

Dear everyone,

apologies in advance for a silly question...

I'm using a homebrew stripped-down mini-distro based on Fedora 5, 
with various newer kernels, on a live CD, to test hardware with. 
The live CD is composed by means of scripted binary copy of the key 
necessary components (libc, init, bash, /dev/, /etc/, you know the 
rest...) - it's almost like rolling your own MS-DOS boot floppy.
A minimum system is about 4-10 MB, a neat firewall takes up
about 22 MB.

Recently I've stubled over what seems like a lasting bug
in the Linux kernel. Excuse me for that accusation, which is 
admittedly based on rather vague data, dated versions
of the user-space software (libc, bash...), and a homebrew
hackey distro.

First impression:
looped execution of Bonnie++2 makes bash go berserk.
There are two possible flawed behaviors:

1) the bash process that's waiting for Bonnie++2 to return, 
   starts looping inside the last waitpid() call I believe, 
   eating 100% CPU.
   At least that's what 'top' + 'strace -p <bash PID>' would 
   suggest. The top and strace have to be running beforehand,
   as the same happens to the bash process on any other virtual
   console, if you try to run any further command. (The further 
   command doesn't seem to get executed anymore.)

2) the bash processes don't start eating 100% CPU, but any further 
   command that you try to execute returns immediately with
   a segfault.

I boot the CD with just bare bash on all 6 virtual consoles.
I mount a previously created EXT3 FS (several hundred GB
to over 1 TB) on a mountpoint, `cd` into the mountpoint
on one or two consoles, and run 

 while true; do bonnie++2 -u root -s 4096; done

Then I run 'iostat 2', 'top' and 'strace -p <bash PID>' on the 
remaining consoles. I try running some other command now and then, to 
make the paging and block IO subsystems load some more blocks from 
the CD.

I believe the `top` output suggests that the Bonnie processes don't 
eat all that much RAM, but the kernel-space buffering eats almost all 
of it. Only about 50 Megs remain truly "free", most of the RAM gets 
"cached". The system stabilizes at this balance, and a few minutes 
later it hangs in the aforementioned way.

This happens without a swap. If I mkswap+swapon some free hard drive, 
the symptoms seem somewhat more difficult to reproduce, but do occur 
after a somewhat longer period of time.

The symptoms are fairly easily reproduced on 2.6.16.18 through 
2.6.16.48, as well as 2.6.18.8. On 2.6.22.6 it seems to take a bit 
more time to reproduce the problem.

I've reproduced the problem on three different dual Xeon
boxes, all of them SuperMicro of different sizes/generations,
all of them upgraded to the latest BIOS (now showing no more
IRQ routing mischiefs).
The hardware setups are along the lines of 
- Intel 7501 chipset, dual Xeon Northwood, 1 GB RAM,
   Adaptec 79xx HBA, external RAID (~80 MBps),
   internal Adaptec 2120 RAID (~50 MBps)
- Intel 7520 chipset, dual Xeon Irwindale, 2 GB RAM,
   several internal U320 SCSI drives via Adaptec 79xx HBA,
   an external RAID (~80 MBps) via LSI 20320 HBA (Fusion MPT)
- Intel 7520 chipset, dual Xeon Nocona, 1 GB RAM,
   internal LSI MegaRAID SATA150-6 with 6 disk drives.

I've never seen this before I started using bonnie++2 as a load 
generator :-) Both my hardware systems and my Linux CD are otherwise 
perfectly stable, under sequential IO, cpuburn, older versions of 
Bonnie on Linux 2.4 / FreeBSD etc. 
I know what it looks like when there's a hardware problem and I know 
how to prove/deny a hardware problem by selective A/B-style hardware 
replacements, I'm fairly good at shielding away hardware unstability.

Should I start from compiling a fresh libc + bash + whatever else?
Any ideas are welcome :-)

Frank Rysanek


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2008-03-06 20:54 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-19 14:58 [newbie:] Bonnie++2 hangs recent 2.6 kernels? Bash keeps looping in waitpid(), eating 100% CPU Frantisek Rysanek
  -- strict thread matches above, loose matches on Subject: below --
2007-09-13 14:46 Frantisek Rysanek
2007-09-13  0:30 ` Nick Piggin
2007-09-14 15:00   ` Frantisek Rysanek
2007-09-19 22:53     ` Chuck Ebbert
2007-09-20  7:30       ` Frantisek Rysanek
2007-09-19  8:50   ` Frantisek Rysanek
2007-09-19 15:34     ` Randy Dunlap
2007-09-19 16:08       ` Frantisek Rysanek
2008-03-06 20:46 ` Frantisek Rysanek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox