linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ric Wheeler <rwheeler@redhat.com>
To: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: David Woodhouse <dwmw2@infradead.org>,
	linux-scsi@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Black_David@emc.com,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	Tom Coughlan <coughlan@redhat.com>,
	Matthew Wilcox <matthew@wil.cx>,
	Jens Axboe <jens.axboe@oracle.com>
Subject: Re: thin provisioned LUN support
Date: Thu, 06 Nov 2008 18:10:57 -0500	[thread overview]
Message-ID: <49137981.7050308@redhat.com> (raw)
In-Reply-To: <1226012815.4703.122.camel@localhost.localdomain>

James Bottomley wrote:
> On Thu, 2008-11-06 at 17:55 -0500, Ric Wheeler wrote:
>   
>> Dave Chinner wrote:
>>     
>>> On Thu, Nov 06, 2008 at 09:43:23AM -0500, Ric Wheeler wrote:
>>>   
>>>       
>>>> After talking to some vendors, one issue that came up is that the arrays  
>>>> all have a different size that is used internally to track the SCSI  
>>>> equivalent of TRIM commands (POKE/unmap).
>>>>
>>>> What they would like is for us to coalesce these commands into aligned  
>>>> multiples of these chunks. If not, the target device will most likely  
>>>> ignore the bits at the beginning and end (and all small requests).
>>>>     
>>>>         
>>> There's lots of questions that need to be answered here. e.g:
>>>
>>> Where are these free spaces going to be aggregated before dispatch?
>>>
>>> What happens if they are re-allocated and re-written by the
>>> filesystem before they've been dispatched?
>>>
>>> How is the chunk size going to be passed to the aggregation layer?
>>>
>>> What about passing itto the filesystem so it can align all it's
>>> allocations in a manner that simplifies the dispatch problem?
>>>
>>> What happens if a crash occurs before the aggregated free space is
>>> dispatched?
>>>
>>> Are there coherency problems with filesystem recovery after a crash?
>>>   
>>>       
>> The good thing about these "unmap" commands (SCSI speak this week for 
>> TRIM) is that we can drop them if we have to without data integrity 
>> concerns.
>>
>> The only thing that you cannot do is to send down an unmap for a block 
>> still in use (including ones that have not been committed in a transaction).
>>
>> In SCSI, they plan to zero those blocks so that you will always read a 
>> block of zeros back if you try to read an unmapped sector.
>>     
>
> Actually, they left this up to the array in the latest spec.  If the
> TPRZ bit is set in the Block Device Characteristics VPD then, yes, it
> will return zeros.  If not, the return is undefined.
>
>   

The RAID vendors were not happy with this & are in the process of 
changing it to be:

    (1) all zeros OR
    (2) all 1's
    (3) other - but always to be returned consistently until a future write
 
The concern is that RAID boxes would trip up over parity (if it could 
change).

>> I have no idea how we can pass the aggregation size up from the block 
>> layer since it is not currently exported in a uniform way from SCSI. 
>> Even if it is, we have struggled to get RAID stripe alignment handled so 
>> far.
>>     
>
> Well, this is identical to the erase block size (and array stripe size)
> problems we've been discussing.  I thought we'd more or less agreed on
> the generic attributes model.
>   

We could do it, but need them to put it in a standard place first. 
Today, it is exposed only in device specific ways.

>   
>>>> I have been thinking about whether or not we can (and should) do  
>>>> anything more than our current best effort to send down large chunks  
>>>> (note that the "chunk" size can range from reasonable sizes like 8KB or  
>>>> so up to close to 1MB!).
>>>>     
>>>>         
>>> Any aggregation is only as good as the original allocation the
>>> filesystem did. Look as the mess ext3 extracting untarring a kernel
>>> tarball creates - blocks are written to all over the place. You'd
>>> need to fix that to have any hope of behaviour nicely for a RAID
>>> that has a sub-optimal thin provisioning algorithm.
>>>
>>> The problem is not with the filesystem, the block layer or the OS.
>>> If they array vendors have optimised themselves into a corner,
>>> then they shoul dbe fixing their problem, not asking the rest of
>>> the world to expend large amounts of effort to work around the
>>> shortcomings of their products.....
>>>   
>>>       
>> I agree - I think that eventually vendors will end up having to cache 
>> the requests internally. The problem is with the customers who will be 
>> getting the first generation of gear and have had their expectations set 
>> already....
>>
>>     
>>>   
>>>       
>>>> One suggestion is that a modified defrag sweep could be used
>>>> periodically to update the device (a proposal I am not keen on).
>>>>     
>>>>         
>>> No thanks. That needs an implementation per filesystem, and it will
>>> need to be done with the filesystem on line which means it will
>>> still need substantial help from the kernel.
>>>
>>> Cheers,
>>>
>>> Dave.
>>>   
>>>       
>> It does seem to be a mess - especially since people have already gone to 
>> the trouble to put the hooks in to inform the storage in a consistent 
>> and timely way :-)
>>     
>
> I'm sure we can iterate to a conclusion ... even if it's that we won't
> actually do anything other than send down properly formed unmap commands
> and if the array chooses to ignore them, that's its lookout.
>
> James
>
>
>   

Eventually, we will get it (collectively) right...

ric



  reply	other threads:[~2008-11-06 23:10 UTC|newest]

Thread overview: 105+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-11-06 14:43 thin provisioned LUN support Ric Wheeler
2008-11-06 15:17 ` James Bottomley
2008-11-06 15:24   ` David Woodhouse
2008-11-06 16:00     ` Ric Wheeler
2008-11-06 16:40       ` Martin K. Petersen
2008-11-06 17:04         ` Ric Wheeler
2008-11-06 17:15     ` Matthew Wilcox
2008-11-07 12:05     ` Jens Axboe
2008-11-07 12:14       ` Ric Wheeler
2008-11-07 12:17         ` David Woodhouse
2008-11-07 12:19         ` Jens Axboe
2008-11-07 14:26           ` thin provisioned LUN support & file system allocation policy Ric Wheeler
2008-11-07 14:34             ` Matthew Wilcox
2008-11-07 14:45               ` Jörn Engel
2008-11-07 14:43             ` Theodore Tso
2008-11-07 14:54               ` Ric Wheeler
2008-11-07 15:26                 ` jim owens
2008-11-07 15:31                   ` David Woodhouse
2008-11-07 15:35                     ` jim owens
2008-11-07 15:46                       ` Theodore Tso
2008-11-07 15:51                         ` Martin K. Petersen
2008-11-07 16:06                           ` Ric Wheeler
2008-11-07 15:56                         ` James Bottomley
2008-11-07 15:36                     ` James Bottomley
2008-11-07 15:48                       ` David Woodhouse
2008-11-07 15:36                   ` Theodore Tso
2008-11-07 15:45                     ` Matthew Wilcox
2008-11-07 16:07                       ` jim owens
2008-11-07 16:12                         ` James Bottomley
2008-11-07 16:23                           ` jim owens
2008-11-07 16:02                   ` Ric Wheeler
2008-11-07 14:55               ` Matthew Wilcox
2008-11-07 15:20         ` thin provisioned LUN support James Bottomley
2008-11-09 23:08           ` Dave Chinner
2008-11-09 23:37             ` James Bottomley
2008-11-10  0:33               ` Dave Chinner
2008-11-10 14:31                 ` James Bottomley
2008-11-07 15:49       ` Chris Mason
2008-11-07 16:00         ` Martin K. Petersen
2008-11-07 16:06           ` James Bottomley
2008-11-07 16:11             ` Chris Mason
2008-11-07 16:18               ` James Bottomley
2008-11-07 16:22                 ` Ric Wheeler
2008-11-07 16:27                   ` James Bottomley
2008-11-07 16:28                   ` David Woodhouse
2008-11-07 17:22                 ` Chris Mason
2008-11-07 18:09                   ` Ric Wheeler
2008-11-07 18:36                     ` Theodore Tso
2008-11-07 18:41                       ` Ric Wheeler
     [not found]                       ` <49148BDF.9050707@redhat.com>
2008-11-07 19:35                         ` Theodore Tso
2008-11-07 19:55                           ` Martin K. Petersen
2008-11-07 20:19                             ` Theodore Tso
2008-11-07 20:21                               ` Matthew Wilcox
     [not found]                               ` <20081107202149.GJ15439@parisc-linux.org>
2008-11-07 20:26                                 ` Ric Wheeler
2008-11-07 20:48                                   ` Chris Mason
2008-11-07 21:04                                     ` Ric Wheeler
2008-11-07 21:13                                     ` Theodore Tso
2008-11-07 20:42                                 ` Theodore Tso
2008-11-07 21:06                               ` Martin K. Petersen
2008-11-07 20:37                             ` Ric Wheeler
2008-11-10  2:44                               ` Black_David
2008-11-10  2:36                           ` Black_David
2008-11-07 19:44                       ` jim owens
2008-11-07 19:48                         ` Matthew Wilcox
2008-11-07 19:50                         ` Ric Wheeler
2008-11-09 23:36           ` Dave Chinner
2008-11-10  3:40             ` Thin provisioning & arrays Black_David
2008-11-10  8:31               ` Dave Chinner
2008-11-10  9:59                 ` David Woodhouse
2008-11-10 13:30                   ` Matthew Wilcox
2008-11-10 13:36                     ` Jens Axboe
2008-11-10 17:05                   ` UNMAP is a hint Black_David
2008-11-10 17:30                     ` Matthew Wilcox
2008-11-10 17:56                       ` Ric Wheeler
2008-11-10 22:18                   ` Thin provisioning & arrays Dave Chinner
2008-11-11  1:23                     ` Black_David
2008-11-11  2:09                       ` Keith Owens
2008-11-11 13:59                         ` Ric Wheeler
2008-11-11 14:55                           ` jim owens
2008-11-11 15:38                             ` Ric Wheeler
2008-11-11 15:59                               ` jim owens
2008-11-11 16:25                                 ` Ric Wheeler
2008-11-11 16:53                                   ` jim owens
2008-11-11 23:08                             ` Dave Chinner
2008-11-11 23:52                               ` jim owens
2008-11-11 22:49                       ` Dave Chinner
2008-11-06 15:27 ` thin provisioned LUN support jim owens
2008-11-06 15:57   ` jim owens
2008-11-06 16:21     ` James Bottomley
     [not found] ` <yq1d4h8nao5.fsf@sermon.lab.mkp.net>
2008-11-06 15:42   ` Ric Wheeler
2008-11-06 15:57     ` David Woodhouse
2008-11-06 22:36 ` Dave Chinner
2008-11-06 22:55   ` Ric Wheeler
     [not found]   ` <491375E9.7020707@redhat.com>
2008-11-06 23:06     ` James Bottomley
2008-11-06 23:10       ` Ric Wheeler [this message]
2008-11-06 23:26         ` James Bottomley
2008-11-06 23:32 ` thin provisioned LUN support - T10 activity Black_David
2008-11-07 11:59 ` thin provisioned LUN support Artem Bityutskiy
2008-11-10 20:39 ` Aggregating discard requests in the filesystem Matthew Wilcox
2008-11-10 20:44   ` Chris Mason
2008-11-11  0:12   ` Brad Boyer
2008-11-11 15:25     ` jim owens
2008-11-11 16:40 ` thin provisioned LUN support Christoph Hellwig
2008-11-11 17:07   ` jim owens
2008-11-11 17:33     ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49137981.7050308@redhat.com \
    --to=rwheeler@redhat.com \
    --cc=Black_David@emc.com \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=coughlan@redhat.com \
    --cc=dwmw2@infradead.org \
    --cc=jens.axboe@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=matthew@wil.cx \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).