From: Ric Wheeler <rwheeler@redhat.com>
To: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: David Woodhouse <dwmw2@infradead.org>,
linux-scsi@vger.kernel.org, linux-fsdevel@vger.kernel.org,
Black_David@emc.com,
"Martin K. Petersen" <martin.petersen@oracle.com>,
Tom Coughlan <coughlan@redhat.com>,
Matthew Wilcox <matthew@wil.cx>,
Jens Axboe <jens.axboe@oracle.com>
Subject: Re: thin provisioned LUN support
Date: Thu, 06 Nov 2008 18:10:57 -0500 [thread overview]
Message-ID: <49137981.7050308@redhat.com> (raw)
In-Reply-To: <1226012815.4703.122.camel@localhost.localdomain>
James Bottomley wrote:
> On Thu, 2008-11-06 at 17:55 -0500, Ric Wheeler wrote:
>
>> Dave Chinner wrote:
>>
>>> On Thu, Nov 06, 2008 at 09:43:23AM -0500, Ric Wheeler wrote:
>>>
>>>
>>>> After talking to some vendors, one issue that came up is that the arrays
>>>> all have a different size that is used internally to track the SCSI
>>>> equivalent of TRIM commands (POKE/unmap).
>>>>
>>>> What they would like is for us to coalesce these commands into aligned
>>>> multiples of these chunks. If not, the target device will most likely
>>>> ignore the bits at the beginning and end (and all small requests).
>>>>
>>>>
>>> There's lots of questions that need to be answered here. e.g:
>>>
>>> Where are these free spaces going to be aggregated before dispatch?
>>>
>>> What happens if they are re-allocated and re-written by the
>>> filesystem before they've been dispatched?
>>>
>>> How is the chunk size going to be passed to the aggregation layer?
>>>
>>> What about passing itto the filesystem so it can align all it's
>>> allocations in a manner that simplifies the dispatch problem?
>>>
>>> What happens if a crash occurs before the aggregated free space is
>>> dispatched?
>>>
>>> Are there coherency problems with filesystem recovery after a crash?
>>>
>>>
>> The good thing about these "unmap" commands (SCSI speak this week for
>> TRIM) is that we can drop them if we have to without data integrity
>> concerns.
>>
>> The only thing that you cannot do is to send down an unmap for a block
>> still in use (including ones that have not been committed in a transaction).
>>
>> In SCSI, they plan to zero those blocks so that you will always read a
>> block of zeros back if you try to read an unmapped sector.
>>
>
> Actually, they left this up to the array in the latest spec. If the
> TPRZ bit is set in the Block Device Characteristics VPD then, yes, it
> will return zeros. If not, the return is undefined.
>
>
The RAID vendors were not happy with this & are in the process of
changing it to be:
(1) all zeros OR
(2) all 1's
(3) other - but always to be returned consistently until a future write
The concern is that RAID boxes would trip up over parity (if it could
change).
>> I have no idea how we can pass the aggregation size up from the block
>> layer since it is not currently exported in a uniform way from SCSI.
>> Even if it is, we have struggled to get RAID stripe alignment handled so
>> far.
>>
>
> Well, this is identical to the erase block size (and array stripe size)
> problems we've been discussing. I thought we'd more or less agreed on
> the generic attributes model.
>
We could do it, but need them to put it in a standard place first.
Today, it is exposed only in device specific ways.
>
>>>> I have been thinking about whether or not we can (and should) do
>>>> anything more than our current best effort to send down large chunks
>>>> (note that the "chunk" size can range from reasonable sizes like 8KB or
>>>> so up to close to 1MB!).
>>>>
>>>>
>>> Any aggregation is only as good as the original allocation the
>>> filesystem did. Look as the mess ext3 extracting untarring a kernel
>>> tarball creates - blocks are written to all over the place. You'd
>>> need to fix that to have any hope of behaviour nicely for a RAID
>>> that has a sub-optimal thin provisioning algorithm.
>>>
>>> The problem is not with the filesystem, the block layer or the OS.
>>> If they array vendors have optimised themselves into a corner,
>>> then they shoul dbe fixing their problem, not asking the rest of
>>> the world to expend large amounts of effort to work around the
>>> shortcomings of their products.....
>>>
>>>
>> I agree - I think that eventually vendors will end up having to cache
>> the requests internally. The problem is with the customers who will be
>> getting the first generation of gear and have had their expectations set
>> already....
>>
>>
>>>
>>>
>>>> One suggestion is that a modified defrag sweep could be used
>>>> periodically to update the device (a proposal I am not keen on).
>>>>
>>>>
>>> No thanks. That needs an implementation per filesystem, and it will
>>> need to be done with the filesystem on line which means it will
>>> still need substantial help from the kernel.
>>>
>>> Cheers,
>>>
>>> Dave.
>>>
>>>
>> It does seem to be a mess - especially since people have already gone to
>> the trouble to put the hooks in to inform the storage in a consistent
>> and timely way :-)
>>
>
> I'm sure we can iterate to a conclusion ... even if it's that we won't
> actually do anything other than send down properly formed unmap commands
> and if the array chooses to ignore them, that's its lookout.
>
> James
>
>
>
Eventually, we will get it (collectively) right...
ric
next prev parent reply other threads:[~2008-11-06 23:10 UTC|newest]
Thread overview: 105+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-11-06 14:43 thin provisioned LUN support Ric Wheeler
2008-11-06 15:17 ` James Bottomley
2008-11-06 15:24 ` David Woodhouse
2008-11-06 16:00 ` Ric Wheeler
2008-11-06 16:40 ` Martin K. Petersen
2008-11-06 17:04 ` Ric Wheeler
2008-11-06 17:15 ` Matthew Wilcox
2008-11-07 12:05 ` Jens Axboe
2008-11-07 12:14 ` Ric Wheeler
2008-11-07 12:17 ` David Woodhouse
2008-11-07 12:19 ` Jens Axboe
2008-11-07 14:26 ` thin provisioned LUN support & file system allocation policy Ric Wheeler
2008-11-07 14:34 ` Matthew Wilcox
2008-11-07 14:45 ` Jörn Engel
2008-11-07 14:43 ` Theodore Tso
2008-11-07 14:54 ` Ric Wheeler
2008-11-07 15:26 ` jim owens
2008-11-07 15:31 ` David Woodhouse
2008-11-07 15:35 ` jim owens
2008-11-07 15:46 ` Theodore Tso
2008-11-07 15:51 ` Martin K. Petersen
2008-11-07 16:06 ` Ric Wheeler
2008-11-07 15:56 ` James Bottomley
2008-11-07 15:36 ` James Bottomley
2008-11-07 15:48 ` David Woodhouse
2008-11-07 15:36 ` Theodore Tso
2008-11-07 15:45 ` Matthew Wilcox
2008-11-07 16:07 ` jim owens
2008-11-07 16:12 ` James Bottomley
2008-11-07 16:23 ` jim owens
2008-11-07 16:02 ` Ric Wheeler
2008-11-07 14:55 ` Matthew Wilcox
2008-11-07 15:20 ` thin provisioned LUN support James Bottomley
2008-11-09 23:08 ` Dave Chinner
2008-11-09 23:37 ` James Bottomley
2008-11-10 0:33 ` Dave Chinner
2008-11-10 14:31 ` James Bottomley
2008-11-07 15:49 ` Chris Mason
2008-11-07 16:00 ` Martin K. Petersen
2008-11-07 16:06 ` James Bottomley
2008-11-07 16:11 ` Chris Mason
2008-11-07 16:18 ` James Bottomley
2008-11-07 16:22 ` Ric Wheeler
2008-11-07 16:27 ` James Bottomley
2008-11-07 16:28 ` David Woodhouse
2008-11-07 17:22 ` Chris Mason
2008-11-07 18:09 ` Ric Wheeler
2008-11-07 18:36 ` Theodore Tso
2008-11-07 18:41 ` Ric Wheeler
[not found] ` <49148BDF.9050707@redhat.com>
2008-11-07 19:35 ` Theodore Tso
2008-11-07 19:55 ` Martin K. Petersen
2008-11-07 20:19 ` Theodore Tso
2008-11-07 20:21 ` Matthew Wilcox
[not found] ` <20081107202149.GJ15439@parisc-linux.org>
2008-11-07 20:26 ` Ric Wheeler
2008-11-07 20:48 ` Chris Mason
2008-11-07 21:04 ` Ric Wheeler
2008-11-07 21:13 ` Theodore Tso
2008-11-07 20:42 ` Theodore Tso
2008-11-07 21:06 ` Martin K. Petersen
2008-11-07 20:37 ` Ric Wheeler
2008-11-10 2:44 ` Black_David
2008-11-10 2:36 ` Black_David
2008-11-07 19:44 ` jim owens
2008-11-07 19:48 ` Matthew Wilcox
2008-11-07 19:50 ` Ric Wheeler
2008-11-09 23:36 ` Dave Chinner
2008-11-10 3:40 ` Thin provisioning & arrays Black_David
2008-11-10 8:31 ` Dave Chinner
2008-11-10 9:59 ` David Woodhouse
2008-11-10 13:30 ` Matthew Wilcox
2008-11-10 13:36 ` Jens Axboe
2008-11-10 17:05 ` UNMAP is a hint Black_David
2008-11-10 17:30 ` Matthew Wilcox
2008-11-10 17:56 ` Ric Wheeler
2008-11-10 22:18 ` Thin provisioning & arrays Dave Chinner
2008-11-11 1:23 ` Black_David
2008-11-11 2:09 ` Keith Owens
2008-11-11 13:59 ` Ric Wheeler
2008-11-11 14:55 ` jim owens
2008-11-11 15:38 ` Ric Wheeler
2008-11-11 15:59 ` jim owens
2008-11-11 16:25 ` Ric Wheeler
2008-11-11 16:53 ` jim owens
2008-11-11 23:08 ` Dave Chinner
2008-11-11 23:52 ` jim owens
2008-11-11 22:49 ` Dave Chinner
2008-11-06 15:27 ` thin provisioned LUN support jim owens
2008-11-06 15:57 ` jim owens
2008-11-06 16:21 ` James Bottomley
[not found] ` <yq1d4h8nao5.fsf@sermon.lab.mkp.net>
2008-11-06 15:42 ` Ric Wheeler
2008-11-06 15:57 ` David Woodhouse
2008-11-06 22:36 ` Dave Chinner
2008-11-06 22:55 ` Ric Wheeler
[not found] ` <491375E9.7020707@redhat.com>
2008-11-06 23:06 ` James Bottomley
2008-11-06 23:10 ` Ric Wheeler [this message]
2008-11-06 23:26 ` James Bottomley
2008-11-06 23:32 ` thin provisioned LUN support - T10 activity Black_David
2008-11-07 11:59 ` thin provisioned LUN support Artem Bityutskiy
2008-11-10 20:39 ` Aggregating discard requests in the filesystem Matthew Wilcox
2008-11-10 20:44 ` Chris Mason
2008-11-11 0:12 ` Brad Boyer
2008-11-11 15:25 ` jim owens
2008-11-11 16:40 ` thin provisioned LUN support Christoph Hellwig
2008-11-11 17:07 ` jim owens
2008-11-11 17:33 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=49137981.7050308@redhat.com \
--to=rwheeler@redhat.com \
--cc=Black_David@emc.com \
--cc=James.Bottomley@HansenPartnership.com \
--cc=coughlan@redhat.com \
--cc=dwmw2@infradead.org \
--cc=jens.axboe@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=matthew@wil.cx \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).