From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from tartarus.angband.pl ([89.206.35.136]:37167 "EHLO tartarus.angband.pl" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755922AbcJLT4H (ORCPT ); Wed, 12 Oct 2016 15:56:07 -0400 Date: Wed, 12 Oct 2016 21:55:28 +0200 From: Adam Borowski To: Zygo Blaxell Cc: Qu Wenruo , "linux-btrfs@vger.kernel.org" Subject: Re: RAID system with adaption to changed number of disks Message-ID: <20161012195528.GB4800@angband.pl> References: <20161011160601.GI7683@carfax.org.uk> <3da9a459-c63b-570c-5b42-c7186b3a74fd@cn.fujitsu.com> <20161012043718.GW21290@hungrycats.org> <37578baa-556b-d3f7-45bd-10843124dea1@cn.fujitsu.com> <20161012171936.GD26140@hungrycats.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <20161012171936.GD26140@hungrycats.org> Sender: linux-btrfs-owner@vger.kernel.org List-ID: On Wed, Oct 12, 2016 at 01:19:37PM -0400, Zygo Blaxell wrote: > On Wed, Oct 12, 2016 at 01:48:58PM +0800, Qu Wenruo wrote: > > In fact, the _concept_ to solve such RMW behavior is quite simple: > > > > Make sector size equal to stripe length. (Or vice versa if you like) > > > > Although the implementation will be more complex, people like Chandan are > > already working on sub page size sector size support. > > So...metadata blocks would be 256K on the 5-disk RAID5 example above, > and any file smaller than 256K would be stored inline? Ouch. That would > also imply the compressed extent size limit (currently 128K) has to become > much larger. > > I had been thinking that we could inject "plug" extents to fill up > RAID5 stripes. This lets us keep the 4K block size for allocations, > but at commit (or delalloc) time we would fill up any gaps in new RAID > stripes to prevent them from being modified. As the real data is deleted > from the RAID stripes, it would be replaced by "plug" extents to keep any > new data from being allocated in the stripe. When the stripe consists > entirely of "plug" extents, the plug extent would be deleted, allowing > the stripe to be allocated again. The "plug" data would be zero for > the purposes of parity reconstruction, regardless of what's on the disk. > Balance would just throw the plug extents away (no need to relocate them). Your idea sounds good, but there's one problem: most real users don't balance. Ever. Contrary to the tribal wisdom here, this actually works fine, unless you had a pathologic load skewed to either data or metadata on the first write then fill the disk to near-capacity with a load skewed the other way. Most usage patterns produce a mix of transient and persistent data (and at write time you don't know which file is which), meaning that with time every stripe will contain a smidge of cold data plus a fill of plug extents. Thus, while the plug extents idea doesn't suffer from problems of big sectors you just mentioned, we'd need some kind of auto-balance. -- A MAP07 (Dead Simple) raspberry tincture recipe: 0.5l 95% alcohol, 1kg raspberries, 0.4kg sugar; put into a big jar for 1 month. Filter out and throw away the fruits (can dump them into a cake, etc), let the drink age at least 3-6 months.