From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michael Tokarev Subject: Re: Setting up md-raid5: observations, errors, questions Date: Mon, 03 Mar 2008 05:58:19 +0300 Message-ID: <47CB694B.9000807@msgid.tls.msk.ru> References: <47CAC5BF.6060600@msgid.tls.msk.ru> <47CAF30F.7040502@msgid.tls.msk.ru> <47CB228D.5020709@msgid.tls.msk.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from hobbit.corpit.ru ([81.13.94.6]:24977 "EHLO hobbit.corpit.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751911AbYCCC6a (ORCPT ); Sun, 2 Mar 2008 21:58:30 -0500 In-Reply-To: Sender: linux-ide-owner@vger.kernel.org List-Id: linux-ide@vger.kernel.org To: Christian Pernegger Cc: linux-raid@vger.kernel.org, linux-ide@vger.kernel.org Christian Pernegger wrote: >> I highly doubt chunk size makes any difference. Bitmap is the primary >> suspect here. I meant something else. Sure thing chunk size will have quite significant impact on write performance, see below. What I meant is that the PROBLEM you're facing is not due to chunk size but due to bitmap. > Some tests: > > raid5, chunk size goes from 16k to 1m. arrays created with --assume-clean > > dd-tests > ======== [] > dd-write > -------- > with bitmap: gets SLOWLY worse with inc. chunk size 30 -> 27 MB/s > without bitmap: gets MUCH worse with inc chunk size 100 -> 59 MB/s In other words, bitmap makes HUGE impact - when it's on, everything is so damn slow that other factors aren't even noticeable. When bitmap is off, write speed drops when increasing chunk size. > Conclusion: needs explanation / tuning You know how raid5 process writes, right? The read-modify-write or similar technique, that involves READING as well as writing, reading from other disks in order to calculate/update parity. Unless you write complete stripe (all chunks). So the bigger your chunk size is, the less chances you have to perform full-stripe write, just because "large, properly aligned" writes are much less frequent than "smaller, unaligned" ones - at least for a typical filesystem usage (special usage patterns exists for sure). Here, both linux write-back cache (writes don't go directly to disk but to kernel memory first, and kernel does some reordering/batching here) and md stripe-cache makes huge difference, for obvious reasons. > even omitting the bitmap the writes just touch 100 MB/s, more like 80 > on any chunk size with nice reads. > why would it get worse? Anything tunable here? Yes. See your read-test with small chunk size. Here, you've got better "initial" results. Why the reading with small chunk size is so slow? Because of the small request size to the underlying device, that's why - each drive effectively is reading by 16Kb at once, spending much time rotating and communicating with the controller... But unlike for reads, increasing chunk size does not help writes - because of the above reasons (writing "half" stripes more often and hence requiring read-modify-write cycle more often etc). You can get best results when writing WITHOUT a filesystem (directly to /dev/mdFOO) with dd and blocksize equal to the total strip size (chunk size * number of data disks, or 16k * 3 ..... 1M * 3, since in your raid 3 disks are data in each strip), and trying direct write at the same time. Yes it will be a bit worse than read speed, but it should be pretty descent still. I said "without a filesystem" because alignment is very important too, -- to the same level as full-strip vs "half"-strip writes, and with a filesystem in place, you can't be sure anymore how your files are aligned on disk (xfs tries to align files correctly, at least under some conditions). > the maximum total reached via parallel single-disk writes is 150 MB/s > > > mke2fs-tests > ============ > > create ext3 fs with correct stride, get a 10-second vmstat average 10 > seconds in and abort the mke2fs > > with bitmap: goes down SLOWLY from 64k chunks 17 -> 13 MB/s > without bitmap: gets MUCH worse with inc. chunk size 80 -> 34 MB/s > > Conclusion: needs explanation / tuning The same thing. Bitmap makes HUGE impact, and w/o bitmap, write speed drops when increasing chunk size. > Comments welcome. I see one problem (the bitmap thing, finally discovered and confirmed), and a need to tune your system beforehand -- it's the chunk size. And here, you're pretty much a wizard - it's your guess. In any case, unless your usage pattern will be special and optimized for such a setup, don't try to choose large chunk size for raid456. Choosing large chunk size for raid0 and raid10 makes sense, but with raid456 it has immediately bad sides. > Next step: smaller bitmap > When the performance seems normal I'll revisit the responsiveness-issue. > >> Umm.. You mixed it all ;) >> Bitmap is a place (stored somewhere... ;) where each equally-sized >> block of the array has a single bit of information - namely, if that >> block has been written recently (which means it was dirty) or not. >> So for each block (which is in no way related to chunk size etc!) > > Aren't these blocks-represented-by-a-bit-in-the-bitmap called chunks, > too? Sorry for the confusion. Well yes, but I especially avoided using of *this* "chunk" in my sentence ;) In any case, the chunk as in "usual-raid-chunk-size" and "chunk-represented- by-a-bit-in-the-bitmap" are different chunks, the latter consists of one or more the formers. Oh well. >> This has nothing to do with window between first and second disk >> failure. Once first disk fails, bitmap is of no use anymore, >> because you will need a replacement disk, which has to be >> resyncronized in whole, > > Yes, that makes sense. Still sounds useful, since a lot of my > "failures" have been of the intermittent (SATA cables / connectors, > port resets, slow bad-sector remap) variety. Yes, you're right. Annoying stuff, and bitmaps definitely helps here. >> If the bitmap is unaccessible, it's handled as there was no bitmap >> at all - ie, if the array was dirty, it will be resynced as a whole; >> if it was clean, nothing will be done. > > Ok, good to hear. In theory that's the sane mode of operation, in > practice it might just have been that the array refuses to assemble > without its bitmap. I had the opposite here. Due to various reasons, including operator errors, bugs in mdadm and md kernel code, and probably phases of the Moon, after creating a bitmap on a raid array and rebooting, I were discovering that the bitmap isn't here anymore, -- it's gone. It all was down to the case when bitmap data (information about bitmap presence and location) were not passed to the kernel correctly - either because I forgot to specify it in a config file, or because mdadm didn't pass that info in several cases or because with that superblock version bitmaps were handled incorrectly.... >> Yes, external bitmaps are supported and working. It doesn't mean >> they're faster however - I tried placing a bitmap into a tmpfs (just >> for testing) - and discovered about 95% drop in speed > > Interesting ... what are external bitmaps good for, then? I had no time to investigate, and now I don't have a hardware to test again. In theory it should work, but I guess only a few people are using them if at all - most are using internal bitmaps. /mjt