From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michael Tokarev Subject: Re: Setting up md-raid5: observations, errors, questions Date: Sun, 02 Mar 2008 21:33:51 +0300 Message-ID: <47CAF30F.7040502@msgid.tls.msk.ru> References: <47CAC5BF.6060600@msgid.tls.msk.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: Sender: linux-ide-owner@vger.kernel.org To: Christian Pernegger Cc: linux-raid@vger.kernel.org, linux-ide@vger.kernel.org List-Id: linux-raid.ids Christian Pernegger wrote: >> > OK. Back to the fs again, same command, different device. Still >> > glacially slow (and still running), only now the whole box is at a >> > standstill, too. cat /proc/cpuinfo takes about 3 minutes (!) to >> > complete, I'm still waiting for top to launch (15min and counting). >> > I'll leave mke2fs running for now ... >> >> What's the state of your array at this point - is it resyncing? > > Yes. Didn't think it would matter (much). Never did before. It does. If everything works ok, it should not, but it's not your case ;) >> o how about making filesystem(s) on individual disks first, to see >> how that will work out? Maybe on each of them in parallel? :) > > Running. System is perfectly responsive during 4x mke2fs -j -q on raw devices. > Done. Upper bound for duration is 8 minutes (probaby much lower, > forgot to let it beep on completion), which is much better than the 2 > hours with the syncing RAID. Aha. Excellent. > 26: 1041479 267 IO-APIC-fasteoi sata_promise > 27: 0 0 IO-APIC-fasteoi sata_promise > procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 0 4 0 12864 1769688 10000 0 0 0 146822 539 809 0 26 23 51 Ok. 146Mb/sec. > Cpu(s): 1.3%us, 8.1%sy, 0.0%ni, 41.6%id, 46.0%wa, 0.7%hi, 2.3%si, 0.0%st 46.0% waiting > I hope you can interpret that :) Some ;) >> o try --assume-clean when creating the array > > mke2fs (same command as in first post) now running on fresh > --assumed-clean array w/o crypto. System is only marginally less > responsive than under idle load, if at all. So the responsibility problem is solved here, right? I mean, if there's no resync going on (the case with --assume-clean), the rest of the system works as expected, right? > But inode table writing speed is only about 8-10/second. For the > single disk case I couldn't read the numbers fast enough. Note that mkfs now has to do 3x more work, too - since the device is 3x (for 4-drive raid5) larger. > chris@jesus:~$ cat /proc/interrupts > 26: 1211165 267 IO-APIC-fasteoi sata_promise > 27: 0 0 IO-APIC-fasteoi sata_promise > > procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 0 1 0 11092 1813376 10804 0 0 0 13316 535 5201 0 9 51 40 That's 10 times slower than in the case of 4 individual disks. > Cpu(s): 0.0%us, 10.1%sy, 0.0%ni, 55.6%id, 33.7%wa, 0.2%hi, 0.3%si, 0.0%st and only 33.7% waiting, which is probably due to the lack of parallelism. > From vmstat I gather that total write throughput is an order of > magnitude slower than on the 4 raw disks in parallel. Naturally the > mke2fs on the raid isn't parallelized but it should still be > sequential enough to get the max for a single disk (~60-40MB/s), > right? Well, not really. Mkfs is doing many small writes all over the place, so each is seek+write. And it's syncronous - no next write gets submitted till the current one completes. Ok. For now I don't see a problem (over than that there IS a problem somewhere - obviously). Interrupts are ok. System time (10.1%) in second case doesn't look right, but it was 8.1% before... Only 2 guesses left. And I really mean "guesses", because I can't say definitely what's going on anyway. First, try to disable bitmaps on the raid array, and see if it makes any difference. For some reason I think it will... ;) And second, the whole thing looks pretty much like a more general problem discussed here and elsewhere last few days. I mean handling of parallel reads and writes - when single write may stall reads for quite some time and vise versa. I see it every day on disks without NCQ/TCQ - system is mostly single-tasking, sorta like ol'good MS-DOG :) Good TCQ-enabled drives survives very high load while the system is still more-or-less responsible (and I forgot when I last saw "bad" TCQ-enabled drive - even 10 y/o 4Gb seagate has excellent TCQ support ;). And all modern SATA stuff works pretty much like old IDE drives, which were designed "for personal use", or "single-task only" -- even ones that CLAMS to support NCQ in reality does not.... But that's a long story, and your disks and/or controllers (or the combination) don't even support NCQ... /mjt