All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ric Wheeler <rwheeler@redhat.com>
To: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: David Woodhouse <dwmw2@infradead.org>,
	linux-scsi@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Black_David@emc.com,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	Tom Coughlan <coughlan@redhat.com>,
	Matthew Wilcox <matthew@wil.cx>,
	Jens Axboe <jens.axboe@oracle.com>
Subject: Re: thin provisioned LUN support
Date: Thu, 06 Nov 2008 18:10:57 -0500	[thread overview]
Message-ID: <49137981.7050308@redhat.com> (raw)
In-Reply-To: <1226012815.4703.122.camel@localhost.localdomain>

James Bottomley wrote:
> On Thu, 2008-11-06 at 17:55 -0500, Ric Wheeler wrote:
>   
>> Dave Chinner wrote:
>>     
>>> On Thu, Nov 06, 2008 at 09:43:23AM -0500, Ric Wheeler wrote:
>>>   
>>>       
>>>> After talking to some vendors, one issue that came up is that the arrays  
>>>> all have a different size that is used internally to track the SCSI  
>>>> equivalent of TRIM commands (POKE/unmap).
>>>>
>>>> What they would like is for us to coalesce these commands into aligned  
>>>> multiples of these chunks. If not, the target device will most likely  
>>>> ignore the bits at the beginning and end (and all small requests).
>>>>     
>>>>         
>>> There's lots of questions that need to be answered here. e.g:
>>>
>>> Where are these free spaces going to be aggregated before dispatch?
>>>
>>> What happens if they are re-allocated and re-written by the
>>> filesystem before they've been dispatched?
>>>
>>> How is the chunk size going to be passed to the aggregation layer?
>>>
>>> What about passing itto the filesystem so it can align all it's
>>> allocations in a manner that simplifies the dispatch problem?
>>>
>>> What happens if a crash occurs before the aggregated free space is
>>> dispatched?
>>>
>>> Are there coherency problems with filesystem recovery after a crash?
>>>   
>>>       
>> The good thing about these "unmap" commands (SCSI speak this week for 
>> TRIM) is that we can drop them if we have to without data integrity 
>> concerns.
>>
>> The only thing that you cannot do is to send down an unmap for a block 
>> still in use (including ones that have not been committed in a transaction).
>>
>> In SCSI, they plan to zero those blocks so that you will always read a 
>> block of zeros back if you try to read an unmapped sector.
>>     
>
> Actually, they left this up to the array in the latest spec.  If the
> TPRZ bit is set in the Block Device Characteristics VPD then, yes, it
> will return zeros.  If not, the return is undefined.
>
>   

The RAID vendors were not happy with this & are in the process of 
changing it to be:

    (1) all zeros OR
    (2) all 1's
    (3) other - but always to be returned consistently until a future write
 
The concern is that RAID boxes would trip up over parity (if it could 
change).

>> I have no idea how we can pass the aggregation size up from the block 
>> layer since it is not currently exported in a uniform way from SCSI. 
>> Even if it is, we have struggled to get RAID stripe alignment handled so 
>> far.
>>     
>
> Well, this is identical to the erase block size (and array stripe size)
> problems we've been discussing.  I thought we'd more or less agreed on
> the generic attributes model.
>   

We could do it, but need them to put it in a standard place first. 
Today, it is exposed only in device specific ways.

>   
>>>> I have been thinking about whether or not we can (and should) do  
>>>> anything more than our current best effort to send down large chunks  
>>>> (note that the "chunk" size can range from reasonable sizes like 8KB or  
>>>> so up to close to 1MB!).
>>>>     
>>>>         
>>> Any aggregation is only as good as the original allocation the
>>> filesystem did. Look as the mess ext3 extracting untarring a kernel
>>> tarball creates - blocks are written to all over the place. You'd
>>> need to fix that to have any hope of behaviour nicely for a RAID
>>> that has a sub-optimal thin provisioning algorithm.
>>>
>>> The problem is not with the filesystem, the block layer or the OS.
>>> If they array vendors have optimised themselves into a corner,
>>> then they shoul dbe fixing their problem, not asking the rest of
>>> the world to expend large amounts of effort to work around the
>>> shortcomings of their products.....
>>>   
>>>       
>> I agree - I think that eventually vendors will end up having to cache 
>> the requests internally. The problem is with the customers who will be 
>> getting the first generation of gear and have had their expectations set 
>> already....
>>
>>     
>>>   
>>>       
>>>> One suggestion is that a modified defrag sweep could be used
>>>> periodically to update the device (a proposal I am not keen on).
>>>>     
>>>>         
>>> No thanks. That needs an implementation per filesystem, and it will
>>> need to be done with the filesystem on line which means it will
>>> still need substantial help from the kernel.
>>>
>>> Cheers,
>>>
>>> Dave.
>>>   
>>>       
>> It does seem to be a mess - especially since people have already gone to 
>> the trouble to put the hooks in to inform the storage in a consistent 
>> and timely way :-)
>>     
>
> I'm sure we can iterate to a conclusion ... even if it's that we won't
> actually do anything other than send down properly formed unmap commands
> and if the array chooses to ignore them, that's its lookout.
>
> James
>
>
>   

Eventually, we will get it (collectively) right...

ric



  reply	other threads:[~2008-11-06 23:11 UTC|newest]

Thread overview: 118+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-11-06 14:43 thin provisioned LUN support Ric Wheeler
2008-11-06 15:17 ` James Bottomley
2008-11-06 15:24   ` David Woodhouse
2008-11-06 16:00     ` Ric Wheeler
2008-11-06 16:40       ` Martin K. Petersen
2008-11-06 17:04         ` Ric Wheeler
2008-11-06 17:15     ` Matthew Wilcox
2008-11-07 12:05     ` Jens Axboe
2008-11-07 12:14       ` Ric Wheeler
2008-11-07 12:17         ` David Woodhouse
2008-11-07 12:19         ` Jens Axboe
2008-11-07 14:26           ` thin provisioned LUN support & file system allocation policy Ric Wheeler
2008-11-07 14:34             ` Matthew Wilcox
2008-11-07 14:45               ` Jörn Engel
2008-11-07 14:43             ` Theodore Tso
2008-11-07 14:54               ` Ric Wheeler
2008-11-07 14:54               ` Ric Wheeler
2008-11-07 15:26                 ` jim owens
2008-11-07 15:31                   ` David Woodhouse
2008-11-07 15:35                     ` jim owens
2008-11-07 15:46                       ` Theodore Tso
2008-11-07 15:51                         ` Martin K. Petersen
2008-11-07 16:06                           ` Ric Wheeler
2008-11-07 15:56                         ` James Bottomley
2008-11-07 15:36                     ` James Bottomley
2008-11-07 15:48                       ` David Woodhouse
2008-11-07 15:36                   ` Theodore Tso
2008-11-07 15:45                     ` Matthew Wilcox
2008-11-07 15:45                     ` Matthew Wilcox
2008-11-07 16:07                       ` jim owens
2008-11-07 16:12                         ` James Bottomley
2008-11-07 16:23                           ` jim owens
2008-11-07 16:02                   ` Ric Wheeler
2008-11-07 14:55               ` Matthew Wilcox
2008-11-07 14:55               ` Matthew Wilcox
2008-11-07 15:20         ` thin provisioned LUN support James Bottomley
2008-11-09 23:08           ` Dave Chinner
2008-11-09 23:37             ` James Bottomley
2008-11-10  0:33               ` Dave Chinner
2008-11-10 14:31                 ` James Bottomley
2008-11-07 15:49       ` Chris Mason
2008-11-07 16:00         ` Martin K. Petersen
2008-11-07 16:06           ` James Bottomley
2008-11-07 16:11             ` Chris Mason
2008-11-07 16:18               ` James Bottomley
2008-11-07 16:22                 ` Ric Wheeler
2008-11-07 16:27                   ` James Bottomley
2008-11-07 16:28                   ` David Woodhouse
2008-11-07 17:22                 ` Chris Mason
2008-11-07 18:09                   ` Ric Wheeler
2008-11-07 18:36                     ` Theodore Tso
2008-11-07 18:41                       ` Ric Wheeler
2008-11-07 19:35                         ` Theodore Tso
2008-11-07 19:55                           ` Martin K. Petersen
2008-11-07 20:19                             ` Theodore Tso
2008-11-07 20:21                               ` Matthew Wilcox
2008-11-07 20:26                                 ` Ric Wheeler
2008-11-07 20:48                                   ` Chris Mason
2008-11-07 21:04                                     ` Ric Wheeler
2008-11-07 21:13                                     ` Theodore Tso
2008-11-07 20:42                                 ` Theodore Tso
2008-11-07 20:21                               ` Matthew Wilcox
2008-11-07 21:06                               ` Martin K. Petersen
2008-11-07 20:37                             ` Ric Wheeler
2008-11-10  2:44                               ` Black_David
2008-11-10  2:44                                 ` Black_David
2008-11-10  2:36                           ` Black_David
2008-11-10  2:36                             ` Black_David
2008-11-07 18:41                       ` Ric Wheeler
2008-11-07 19:44                       ` jim owens
2008-11-07 19:48                         ` Matthew Wilcox
2008-11-07 19:50                         ` Ric Wheeler
2008-11-09 23:36           ` Dave Chinner
2008-11-10  3:40             ` Thin provisioning & arrays Black_David
2008-11-10  3:40               ` Black_David
2008-11-10  8:31               ` Dave Chinner
2008-11-10  9:59                 ` David Woodhouse
2008-11-10 13:30                   ` Matthew Wilcox
2008-11-10 13:36                     ` Jens Axboe
2008-11-10 17:05                   ` UNMAP is a hint Black_David
2008-11-10 17:05                     ` Black_David
2008-11-10 17:30                     ` Matthew Wilcox
2008-11-10 17:56                       ` Ric Wheeler
2008-11-10 22:18                   ` Thin provisioning & arrays Dave Chinner
2008-11-11  1:23                     ` Black_David
2008-11-11  1:23                       ` Black_David
2008-11-11  2:09                       ` Keith Owens
2008-11-11 13:59                         ` Ric Wheeler
2008-11-11 14:55                           ` jim owens
2008-11-11 15:38                             ` Ric Wheeler
2008-11-11 15:59                               ` jim owens
2008-11-11 16:25                                 ` Ric Wheeler
2008-11-11 16:53                                   ` jim owens
2008-11-11 23:08                             ` Dave Chinner
2008-11-11 23:52                               ` jim owens
2008-11-11 23:52                               ` jim owens
2008-11-11 22:49                       ` Dave Chinner
2008-11-06 15:27 ` thin provisioned LUN support jim owens
2008-11-06 15:57   ` jim owens
2008-11-06 16:21     ` James Bottomley
     [not found] ` <yq1d4h8nao5.fsf@sermon.lab.mkp.net>
2008-11-06 15:42   ` Ric Wheeler
2008-11-06 15:57     ` David Woodhouse
2008-11-06 22:36 ` Dave Chinner
2008-11-06 22:55   ` Ric Wheeler
2008-11-06 23:06     ` James Bottomley
2008-11-06 23:10       ` Ric Wheeler [this message]
2008-11-06 23:26         ` James Bottomley
2008-11-06 22:55   ` Ric Wheeler
2008-11-06 23:32 ` thin provisioned LUN support - T10 activity Black_David
2008-11-06 23:32   ` Black_David
2008-11-07 11:59 ` thin provisioned LUN support Artem Bityutskiy
2008-11-10 20:39 ` Aggregating discard requests in the filesystem Matthew Wilcox
2008-11-10 20:44   ` Chris Mason
2008-11-11  0:12   ` Brad Boyer
2008-11-11 15:25     ` jim owens
2008-11-11 16:40 ` thin provisioned LUN support Christoph Hellwig
2008-11-11 17:07   ` jim owens
2008-11-11 17:33     ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49137981.7050308@redhat.com \
    --to=rwheeler@redhat.com \
    --cc=Black_David@emc.com \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=coughlan@redhat.com \
    --cc=dwmw2@infradead.org \
    --cc=jens.axboe@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=matthew@wil.cx \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.