From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id p3UMNwaY141059 for ; Sat, 30 Apr 2011 17:23:59 -0500 Received: from mx2.suse.de (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 246B815CE27C for ; Sat, 30 Apr 2011 15:27:32 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de [195.135.220.15]) by cuda.sgi.com with ESMTP id SAV2KQMI0UnwCGFW for ; Sat, 30 Apr 2011 15:27:32 -0700 (PDT) Date: Sun, 1 May 2011 08:27:17 +1000 From: NeilBrown Subject: Re: RAID6 r-m-w, op-journaled fs, SSDs Message-ID: <20110501082717.5116e575@notabene.brown> In-Reply-To: <19900.10868.583555.849181@tree.ty.sabi.co.UK> References: <19900.10868.583555.849181@tree.ty.sabi.co.UK> Mime-Version: 1.0 List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: xfs-bounces@oss.sgi.com Errors-To: xfs-bounces@oss.sgi.com To: Peter Grandi Cc: Linux RAID , Linux fs JFS , Linux fs XFS On Sat, 30 Apr 2011 16:27:48 +0100 pg_xf2@xf2.for.sabi.co.UK (Peter Grandi) wrote: > While I agree with BAARF.com arguments fully, I sometimes have > to deal with legacy systems with wide RAID6 sets (for example 16 > drives, quite revolting) which have op-journaled filesystems on > them like XFS or JFS (sometimes block-journaled ext[34], but I > am not that interested in them for this). > > Sometimes (but fortunately not that recently) I have had to deal > with small-file filesystems setup on wide-stripe RAID6 setup by > morons who don't understand the difference between a database > and a filesystem (and I have strong doubts that RAID6 is > remotely appropriate to databases). > > So I'd like to figure out how much effort I should invest in > undoing cases of the above, that is how badly they are likely to > be and degrade over time (usually very badly). > > First a couple of question purely about RAID, but indirectly > relevant to op-journaled filesystems: > > * Can Linux MD do "abbreviated" read-modify-write RAID6 > updates like for RAID5? That is where not the whole stripe > is read in, modified and written, but just the block to be > updated and the parity wblocks. No. (patches welcome). > > * When reading or writing part of RAID[456] stripe for example > smaller than a sector, what is the minimum unit of transfer > with Linux MD? The full stripe, the chunk containing the > sector, or just the sector containing the bytes to be > written or updated (and potentially the parity sectors)? I > would expect reads to always read just the sector, but not > so sure about writing. 1 "PAGE" - normally 4K. > > * What about popular HW RAID host adapter (e.g. LSI, Adaptec, > Areca, 3ware), where is the documentation if any on how they > behave in these cases? > > Regardless, op-journaled file system designs like JFS and XFS > write small records (way below a stripe set size, and usually > way below a chunk size) to the journal when they queue > operations, even if sometimes depending on design and options > may "batch" the journal updates (potentially breaking safety > semantics). Also they do small write when they dequeue the > operations from the journal to the actual metadata records > involved. The ideal config for a journalled filesystem is for put the journal on a separate smaller lower-latency device. e.g. a small RAID1 pair. In a previous work place I had good results with: RAID1 pair of small disks with root, swap, journal Large RAID5/6 array with bulk of filesystem. I also did data journalling as it helps a lot with NFS. > > How bad can this be when the journal is say internal for a > filesystem that is held on wide-stride RAID6 set? I suspect very > very bad, with apocalyptic read-modify-write storms, eating IOPS. > > I suspect that this happens a lot with SSDs too, where the role > of stripe set size is played by the erase block size (often in > the hundreds of KBytes, and even more expensive). > > Where are studies or even just impressions of anedoctes on how > bad this is? > > Are there instrumentation tools in JFS or XFS that may allow me > to watch/inspect what is happening with the journal? For Linux > MD to see what are the rates of stripe r-m-w cases? Not that I am aware of. NeilBrown > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs