Test Systems: x86 AMD Duron with PM3754 DPT controller, Two disks x86-64 2x Opteron with ASR-3010S Adaptec controller, Four disks Software: Gentoo 1.4.3.13 x86 with gcc-3.3.2 Fedora Core 2 x86-64 development with gcc-3.3.3-7 The i2o_block driver currently in kernel-2.6.5 has a problem when there is more than one logical block device on the I2O controller. Two or more devices causes kernel memory corruption,[1] sometimes accompanied by an oops, sometimes fatally in IRQ context and livelocks. In SMP mode this tended to cause panic=5 reboot to fail. http://lwn.net/Articles/58199/ Markus finally figured out what was going wrong when he read in this LWN article that in the 2.6 kernel, the generic block layer handles partition code and there is almost nothing the individual drivers have to do. The attached patch from Markus removes this unused partition logic and seemed to make things work extremely well. After some testing we began to realize why the kernel memory corruption that we saw with 2 or more block arrays was happening. The first array always had a valid partition table, while second and more arrays never did after they were split from the original single large array. The unused partition logic was copying the corrupted partition table from the 2nd and higher arrays into kernel memory, causing the trouble in kernel-2.6.5. Subsequent testing was going extremely well, with simultaneous heavy disk usage on two I2O block devices with bonnie++ going stable. We then chopped the controller into four I2O block devices, but unfortunately this ran us into trouble. bonnie++ on three devices simultaneously would panic. Below is a photo of the call trace from this panic. http://togami.com/~warren/archive/2004/i2o_cfq_quad_bonnie.jpg I noticed that the call trace contained "cfq_next_request", so I was curious what would happened if we changed to the deadline scheduler. Booted with the same kernel but with "elevator=deadline". To our surprise, bonnie++ ran simultaneously on all four I2O block devices without crashing the server. For another test we tried "elevator=as" and it too remained stable. Possible CFQ I/O scheduler problem? This research has been a joint venture of Markus Lidel and Warren Togami, with thanks to Mid-Pacific Institute for the loan of hardware. Warren Togami wtogami@redhat.com Markus Lidel Markus.Lidel@shadowconnect.com [1] With two block arrays, ls in the filesystem would sometimes have corrupted looking characters as the directories and files, making the system unusable. With four block arrays this happened, also with an accompanying kernel oops.