From mboxrd@z Thu Jan  1 00:00:00 1970
From: Ric Wheeler <rwheeler@redhat.com>
Subject: Re: thin provisioned LUN support
Date: Fri, 07 Nov 2008 16:04:52 -0500
Message-ID: <4914AD74.30509@redhat.com>
References: <1226074002.8030.33.camel@localhost.localdomain>	 <1226074270.15281.50.camel@think.oraclecorp.com>	 <1226074710.8030.43.camel@localhost.localdomain>	 <1226078535.15281.63.camel@think.oraclecorp.com>	 <4914846C.5060103@redhat.com> <20081107183636.GB29717@mit.edu>	 <49148BDF.9050707@redhat.com> <20081107193503.GG29717@mit.edu>	 <yq1iqqz9uqt.fsf@sermon.lab.mkp.net> <20081107201913.GI29717@mit.edu>	 <20081107202149.GJ15439@parisc-linux.org>  <4914A492.4080009@redhat.com> <1226090910.15281.84.camel@think.oraclecorp.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Matthew Wilcox <matthew@wil.cx>, Theodore Tso <tytso@mit.edu>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	James Bottomley <James.Bottomley@hansenpartnership.com>,
	Jens Axboe <jens.axboe@oracle.com>,
	David Woodhouse <dwmw2@infradead.org>,
	linux-scsi@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Black_David@emc.com, Tom Coughlan <coughlan@redhat.com>
To: Chris Mason <chris.mason@oracle.com>
Return-path: <linux-scsi-owner@vger.kernel.org>
In-Reply-To: <1226090910.15281.84.camel@think.oraclecorp.com>
Sender: linux-scsi-owner@vger.kernel.org
List-Id: linux-fsdevel.vger.kernel.org

Chris Mason wrote:
> On Fri, 2008-11-07 at 15:26 -0500, Ric Wheeler wrote:
>   
>> Matthew Wilcox wrote:
>>     
>>> On Fri, Nov 07, 2008 at 03:19:13PM -0500, Theodore Tso wrote:
>>>   
>>>       
>>>> Let's be just a *little* bit fair here.  Suppose we wanted to
>>>> implement thin-provisioned disks using devicemapper and LVM; consider
>>>> that LVM uses a default PE size of 4M for some very good reasons.
>>>> Asking filesystems to be a little smarter about allocation policies so
>>>> that we allocate in existing 4M chunks before going onto the next, and
>>>> asking the block layer to pool trim requests to 4M chunks is not
>>>> totally unreasonable.
>>>>
>>>> Array vendors use chunk sizes > than typical filesystem chunk sizes
>>>> for the same reason that LVM does.  So to say that this is due to
>>>> purely a "broken firmware architecture" is a little unfair.
>>>>     
>>>>         
>>> I think we would have a full-throated discussion about whether the
>>> right thing to do was to put the tracking in the block layer or in LVM.
>>> Rather similar to what we're doing now, in fact.
>>>   
>>>       
>> You definitely could imagine having a device mapper target that could 
>> track the discards commands and subsequent writes which would invalidate 
>> the previous discards.
>>
>> Actually, it would be kind of nice to move all of this away from the 
>> file systems entirely.
>>     
>
> * Fast
> * Crash safe
> * Bounded ram usage
> * Accurately deliver the trims
>
> Pick any three ;)  If we're dealing with large files, I can see it
> working well.  For files that are likely to be smaller than the physical
> extent size, you end up with either extra state bits on disk (and
> keeping them in sync) or a log structured lvm.
>
> I do agree that an offline tool to account for bytes used would be able
> to make up for this, and from a thin provisioning point of view, we
> might be better off if we don't accurately deliver all the trims all the
> time.
>   

Given the best practice more or less states that users need to have set 
the high water mark sufficiently low to allow storage admins to react, I 
think a tool like this would be very useful.

Think of how nasty it would be to run out of real blocks on a device 
that seems to have plenty of unused capacity :-)

> People just use the space again soon anyway, I'd have to guess the
> filesystems end up in a steady state outside of special events.
>
> In another email Ted mentions that it makes sense for the FS allocator
> to notice we've just freed the last block in an aligned region of size
> X, and I'd agree with that.
>
> The trim command we send down when we free the block could just contain
> the entire range that is free (and easy for the FS to determine) every
> time.
>
> -chris
>   
I think sending down the entire contiguous range of freed sectors would work well with these boxes...

ric