From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: with ECARTIS (v1.0.0; list xfs); Wed, 01 Oct 2008 17:33:58 -0700 (PDT) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.12.11.20060308/8.12.11/SuSE Linux 0.7) with ESMTP id m920XsQ6026181 for ; Wed, 1 Oct 2008 17:33:55 -0700 Received: from ipmail04.adl2.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 54F0E134F9DA for ; Wed, 1 Oct 2008 17:35:31 -0700 (PDT) Received: from ipmail04.adl2.internode.on.net (ipmail04.adl2.internode.on.net [203.16.214.57]) by cuda.sgi.com with ESMTP id snvHL7HDvhGyXkH9 for ; Wed, 01 Oct 2008 17:35:31 -0700 (PDT) Date: Thu, 2 Oct 2008 10:32:51 +1000 From: Dave Chinner Subject: Re: RAID5/6 writes Message-ID: <20081002003251.GA30001@disturbed> References: <20081001175237.GJ32037@cordes.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20081001175237.GJ32037@cordes.ca> Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com List-Id: xfs To: Peter Cordes Cc: xfs@oss.sgi.com On Wed, Oct 01, 2008 at 02:52:37PM -0300, Peter Cordes wrote: > I just had an idea for speeding up writes to parity-based RAIDs > (RAID4,5,6).[1] If XFS wants to write sectors 1,2,3, 5,6,7, but it > knows that block 4 is free space, it might be better to write sector 4 > (with zeros, don't put uninitialized kernel memory on disk!). How does XFS know that block 4 is free space? Or indeed that this is a single block sized hole in range of blocks mapped to different inodes or filesystem metadata? If you want something like this, you need to have the lower layer discover holes like this and instead of immediately initiating a RMW cycle, it calls back to the filesystem to determine is hole is free space. That works for all filesystems not just XFS. > It's > probably only useful to do this if XFS has data in memory to prove > that the gap is not part of the filesystem. Doing extra reads > probably doesn't make sense except in very special cases. (e.g. > repeated writes to the same location with the same hole, so just one > read would let them all become full-block or even full-stripe writes.) That's the sort of workload the stripe cache is supposed to optimise; every subsequent sparse write to the same stripe line avoids the read part of the RMW cycle. The filesystem is the wrong layer to optimise this type of workload.... FWIW, XFS has it's own problems with writeback triggering RMW cycles - this sort of thing for data could be considered noise compared to the RMW storm that can be caused by inode writeback under memory pressure as XFS has to do RMW cycles itself on the inode cluster buffers. See the Inode Writeback section of this document: http://oss.sgi.com/archives/xfs/2008-09/msg00289.html This can only be fixed at the filesystem level because no amount of tweaking the storage can improve the I/O patterns that XFS is issuing. These RMW cycles in inode writeback can cause the inode flush rate to drop to a few tens of inodes per second. When you have hundreds of thousands of dirty inodes in a system, it can take *hours* to flush the dirty inodes to disk.... Cheers, Dave. -- Dave Chinner david@fromorbit.com