* Re: thin provisioned LUN support
2008-11-06 14:43 thin provisioned LUN support Ric Wheeler
@ 2008-11-06 15:17 ` James Bottomley
2008-11-06 15:24 ` David Woodhouse
2008-11-06 15:27 ` thin provisioned LUN support jim owens
` (6 subsequent siblings)
7 siblings, 1 reply; 105+ messages in thread
From: James Bottomley @ 2008-11-06 15:17 UTC (permalink / raw)
To: Ric Wheeler
Cc: David Woodhouse, linux-scsi, linux-fsdevel, Black_David,
Martin K. Petersen, Tom Coughlan, Matthew Wilcox, Jens Axboe
On Thu, 2008-11-06 at 09:43 -0500, Ric Wheeler wrote:
> After talking to some vendors, one issue that came up is that the arrays
> all have a different size that is used internally to track the SCSI
> equivalent of TRIM commands (POKE/unmap).
>
> What they would like is for us to coalesce these commands into aligned
> multiples of these chunks. If not, the target device will most likely
> ignore the bits at the beginning and end (and all small requests).
>
> I have been thinking about whether or not we can (and should) do
> anything more than our current best effort to send down large chunks
> (note that the "chunk" size can range from reasonable sizes like 8KB or
> so up to close to 1MB!).
>
> One suggestion is that a modified defrag sweep could be used
> periodically to update the device (a proposal I am not keen on).
>
> Thoughts?
This one's a bit nasty. We can't just use elevator techniques (assuming
we wanted to) beacuse a) the deletions are going to obey different
statistics and b) because the elevator eventually releases the
incorrectly sized units which then get ignored.
The way to do this properly would be to run a chequerboard of partials,
but this would effectively have trim region tracking done in the block
layer ... is this worth it?
By the way, the latest (from 2 days ago) version of the Thin
Provisioning proposal is here:
http://www.t10.org/ftp/t10/document.08/08-149r4.pdf
I skimmed it but don't see any update implying that trim might be
ineffective if we align wrongly ... where is this?
James
^ permalink raw reply [flat|nested] 105+ messages in thread* Re: thin provisioned LUN support
2008-11-06 15:17 ` James Bottomley
@ 2008-11-06 15:24 ` David Woodhouse
2008-11-06 16:00 ` Ric Wheeler
` (2 more replies)
0 siblings, 3 replies; 105+ messages in thread
From: David Woodhouse @ 2008-11-06 15:24 UTC (permalink / raw)
To: James Bottomley
Cc: Ric Wheeler, David Woodhouse, linux-scsi, linux-fsdevel,
Black_David, Martin K. Petersen, Tom Coughlan, Matthew Wilcox,
Jens Axboe
On Thu, 6 Nov 2008, James Bottomley wrote:
> The way to do this properly would be to run a chequerboard of partials,
> but this would effectively have trim region tracking done in the block
> layer ... is this worth it?
>
> By the way, the latest (from 2 days ago) version of the Thin
> Provisioning proposal is here:
>
> http://www.t10.org/ftp/t10/document.08/08-149r4.pdf
>
> I skimmed it but don't see any update implying that trim might be
> ineffective if we align wrongly ... where is this?
I think we should be content to declare such devices 'broken'.
They have to keep track of individual sectors _anyway_, and dropping
information for small discard requests is just careless.
--
dwmw2
^ permalink raw reply [flat|nested] 105+ messages in thread* Re: thin provisioned LUN support
2008-11-06 15:24 ` David Woodhouse
@ 2008-11-06 16:00 ` Ric Wheeler
2008-11-06 16:40 ` Martin K. Petersen
2008-11-06 17:15 ` Matthew Wilcox
2008-11-07 12:05 ` Jens Axboe
2 siblings, 1 reply; 105+ messages in thread
From: Ric Wheeler @ 2008-11-06 16:00 UTC (permalink / raw)
To: David Woodhouse
Cc: James Bottomley, Ric Wheeler, linux-scsi, linux-fsdevel,
Black_David, Martin K. Petersen, Tom Coughlan, Matthew Wilcox,
Jens Axboe
David Woodhouse wrote:
> On Thu, 6 Nov 2008, James Bottomley wrote:
>> The way to do this properly would be to run a chequerboard of partials,
>> but this would effectively have trim region tracking done in the block
>> layer ... is this worth it?
>>
>> By the way, the latest (from 2 days ago) version of the Thin
>> Provisioning proposal is here:
>>
>> http://www.t10.org/ftp/t10/document.08/08-149r4.pdf
>>
>> I skimmed it but don't see any update implying that trim might be
>> ineffective if we align wrongly ... where is this?
>
> I think we should be content to declare such devices 'broken'.
>
> They have to keep track of individual sectors _anyway_, and dropping
> information for small discard requests is just careless.
>
Big arrays have an internal "track" size that is much larger than 512
bytes (Symm for example is 64k. Everything smaller than that is a
read-modify-write. Effectively, they have few to no bits left to track
this other bit of state at the smal level of granularity.
The thing that makes this even more twisted is that the erase/unmap
chunk size is a multiple of the internal size (which would already be an
issue :-))
Ric
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-06 16:00 ` Ric Wheeler
@ 2008-11-06 16:40 ` Martin K. Petersen
2008-11-06 17:04 ` Ric Wheeler
0 siblings, 1 reply; 105+ messages in thread
From: Martin K. Petersen @ 2008-11-06 16:40 UTC (permalink / raw)
To: Ric Wheeler
Cc: David Woodhouse, James Bottomley, linux-scsi, linux-fsdevel,
Black_David, Martin K. Petersen, Tom Coughlan, Matthew Wilcox,
Jens Axboe
>>>>> "Ric" == Ric Wheeler <rwheeler@redhat.com> writes:
Ric> Big arrays have an internal "track" size that is much larger than
Ric> 512 bytes (Symm for example is 64k. Everything smaller than that
Ric> is a read-modify-write. Effectively, they have few to no bits
Ric> left to track this other bit of state at the smal level of
Ric> granularity.
My point still stands that this is an implementation problem in the
array firmware. It's not really our problem to solve. Especially
since drive vendors and mid-range storage vendors are getting it
right.
If EMC wants to provide thin provisioning on the Symmetrix they'll
have to overcome the inherent limitations in their own internal
architecture.
What's next? Requiring us to exclusively read and write in multiples
of block sizes that are artifacts of internal array implementation
details?
For thin provisioning to really work on the Symm, then, we're
effectively requiring the filesystem to use 64k filesystem blocks.
Lame!
I don't have a problem with honoring some bits akin to the block
device characteristics VPD in trying to do a best-effort scheduling of
the I/O. But effectively disabling thin provisioning for blocks
smaller than that is simply broken.
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-06 16:40 ` Martin K. Petersen
@ 2008-11-06 17:04 ` Ric Wheeler
0 siblings, 0 replies; 105+ messages in thread
From: Ric Wheeler @ 2008-11-06 17:04 UTC (permalink / raw)
To: Martin K. Petersen
Cc: David Woodhouse, James Bottomley, linux-scsi, linux-fsdevel,
Black_David, Tom Coughlan, Matthew Wilcox, Jens Axboe
Martin K. Petersen wrote:
>>>>>> "Ric" == Ric Wheeler <rwheeler@redhat.com> writes:
>>>>>>
>
> Ric> Big arrays have an internal "track" size that is much larger than
> Ric> 512 bytes (Symm for example is 64k. Everything smaller than that
> Ric> is a read-modify-write. Effectively, they have few to no bits
> Ric> left to track this other bit of state at the smal level of
> Ric> granularity.
>
> My point still stands that this is an implementation problem in the
> array firmware. It's not really our problem to solve. Especially
> since drive vendors and mid-range storage vendors are getting it
> right.
>
> If EMC wants to provide thin provisioning on the Symmetrix they'll
> have to overcome the inherent limitations in their own internal
> architecture.
>
> What's next? Requiring us to exclusively read and write in multiples
> of block sizes that are artifacts of internal array implementation
> details?
>
> For thin provisioning to really work on the Symm, then, we're
> effectively requiring the filesystem to use 64k filesystem blocks.
> Lame!
>
> I don't have a problem with honoring some bits akin to the block
> device characteristics VPD in trying to do a best-effort scheduling of
> the I/O. But effectively disabling thin provisioning for blocks
> smaller than that is simply broken.
>
Actually, the big arrays have larger "unmap" chunks than the 64k, so it
is even more painful than that :-)
What is likely to happen is that "thin" will work for vendors with this
kind of limitation in a very limited way and they will need to have some
kind of user space tool to clean up the mess.
ric
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-06 15:24 ` David Woodhouse
2008-11-06 16:00 ` Ric Wheeler
@ 2008-11-06 17:15 ` Matthew Wilcox
2008-11-07 12:05 ` Jens Axboe
2 siblings, 0 replies; 105+ messages in thread
From: Matthew Wilcox @ 2008-11-06 17:15 UTC (permalink / raw)
To: David Woodhouse
Cc: James Bottomley, Ric Wheeler, linux-scsi, linux-fsdevel,
Black_David, Martin K. Petersen, Tom Coughlan, Jens Axboe
On Thu, Nov 06, 2008 at 03:24:05PM +0000, David Woodhouse wrote:
> I think we should be content to declare such devices 'broken'.
>
> They have to keep track of individual sectors _anyway_, and dropping
> information for small discard requests is just careless.
As an implementor of such a device, I say "ya, boo, sucks to you".
ata_ram simply ignores the bits of the trim which don't line up with the
page size chunks it's allocated. Sure, it'd be possible to add a bitmap
to indicate which 512-byte chunks of the block contain data and which
don't, but I haven't done that yet. I think there's even space in the
struct page that I can abuse to do that.
I think this really is a QoI thing. Vendors who don't track individual
sectors will gradually get less and less efficient. Hopefully users
will buy from vendors who don't cheat. We can even write a quick
program to allocate the entire drive then trim sectors in a chessboard
patterns. That'll let users see who's got a crap implementation and
who's got a good one.
--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-06 15:24 ` David Woodhouse
2008-11-06 16:00 ` Ric Wheeler
2008-11-06 17:15 ` Matthew Wilcox
@ 2008-11-07 12:05 ` Jens Axboe
2008-11-07 12:14 ` Ric Wheeler
2008-11-07 15:49 ` Chris Mason
2 siblings, 2 replies; 105+ messages in thread
From: Jens Axboe @ 2008-11-07 12:05 UTC (permalink / raw)
To: David Woodhouse
Cc: James Bottomley, Ric Wheeler, linux-scsi, linux-fsdevel,
Black_David, Martin K. Petersen, Tom Coughlan, Matthew Wilcox
On Thu, Nov 06 2008, David Woodhouse wrote:
> On Thu, 6 Nov 2008, James Bottomley wrote:
> >The way to do this properly would be to run a chequerboard of partials,
> >but this would effectively have trim region tracking done in the block
> >layer ... is this worth it?
> >
> >By the way, the latest (from 2 days ago) version of the Thin
> >Provisioning proposal is here:
> >
> >http://www.t10.org/ftp/t10/document.08/08-149r4.pdf
> >
> >I skimmed it but don't see any update implying that trim might be
> >ineffective if we align wrongly ... where is this?
>
> I think we should be content to declare such devices 'broken'.
>
> They have to keep track of individual sectors _anyway_, and dropping
> information for small discard requests is just careless.
I agree, seems pretty pointless. Lets let evolution take care of this
issue. I have to say I'm surprised that it really IS an issue to begin
with, are array firmwares really that silly?
It's not that it would be hard to support (and it would eliminate the
need to do discard merging in the block layer), but it seems like one of
those things that will be of little use in even in the near future.
Discard merging should be useful, I have no problem merging something
like that.
--
Jens Axboe
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-07 12:05 ` Jens Axboe
@ 2008-11-07 12:14 ` Ric Wheeler
2008-11-07 12:17 ` David Woodhouse
` (2 more replies)
2008-11-07 15:49 ` Chris Mason
1 sibling, 3 replies; 105+ messages in thread
From: Ric Wheeler @ 2008-11-07 12:14 UTC (permalink / raw)
To: Jens Axboe
Cc: David Woodhouse, James Bottomley, linux-scsi, linux-fsdevel,
Black_David, Martin K. Petersen, Tom Coughlan, Matthew Wilcox
Jens Axboe wrote:
> On Thu, Nov 06 2008, David Woodhouse wrote:
>
>> On Thu, 6 Nov 2008, James Bottomley wrote:
>>
>>> The way to do this properly would be to run a chequerboard of partials,
>>> but this would effectively have trim region tracking done in the block
>>> layer ... is this worth it?
>>>
>>> By the way, the latest (from 2 days ago) version of the Thin
>>> Provisioning proposal is here:
>>>
>>> http://www.t10.org/ftp/t10/document.08/08-149r4.pdf
>>>
>>> I skimmed it but don't see any update implying that trim might be
>>> ineffective if we align wrongly ... where is this?
>>>
>> I think we should be content to declare such devices 'broken'.
>>
>> They have to keep track of individual sectors _anyway_, and dropping
>> information for small discard requests is just careless.
>>
>
> I agree, seems pretty pointless. Lets let evolution take care of this
> issue. I have to say I'm surprised that it really IS an issue to begin
> with, are array firmwares really that silly?
>
> It's not that it would be hard to support (and it would eliminate the
> need to do discard merging in the block layer), but it seems like one of
> those things that will be of little use in even in the near future.
> Discard merging should be useful, I have no problem merging something
> like that.
>
>
I think that discard merging would be helpful (especially for devices
with more reasonable sized unmap chunks).
Ric
^ permalink raw reply [flat|nested] 105+ messages in thread* Re: thin provisioned LUN support
2008-11-07 12:14 ` Ric Wheeler
@ 2008-11-07 12:17 ` David Woodhouse
2008-11-07 12:19 ` Jens Axboe
2008-11-07 15:20 ` thin provisioned LUN support James Bottomley
2 siblings, 0 replies; 105+ messages in thread
From: David Woodhouse @ 2008-11-07 12:17 UTC (permalink / raw)
To: Ric Wheeler
Cc: Jens Axboe, James Bottomley, linux-scsi, linux-fsdevel,
Black_David, Martin K. Petersen, Tom Coughlan, Matthew Wilcox
On Fri, 2008-11-07 at 07:14 -0500, Ric Wheeler wrote:
> I think that discard merging would be helpful (especially for devices
> with more reasonable sized unmap chunks).
First we need generic fixes to the elevator code. It already notices
when you submit a request which _precisely_ matches an existing one in
both start and length, and will ensure that they happen in the right
order. But it _doesn't_ cope with requests that just happen to overlap.
It's on my TODO list; fairly near the top.
--
David Woodhouse Open Source Technology Centre
David.Woodhouse@intel.com Intel Corporation
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-07 12:14 ` Ric Wheeler
2008-11-07 12:17 ` David Woodhouse
@ 2008-11-07 12:19 ` Jens Axboe
2008-11-07 14:26 ` thin provisioned LUN support & file system allocation policy Ric Wheeler
2008-11-07 15:20 ` thin provisioned LUN support James Bottomley
2 siblings, 1 reply; 105+ messages in thread
From: Jens Axboe @ 2008-11-07 12:19 UTC (permalink / raw)
To: Ric Wheeler
Cc: David Woodhouse, James Bottomley, linux-scsi, linux-fsdevel,
Black_David, Martin K. Petersen, Tom Coughlan, Matthew Wilcox
On Fri, Nov 07 2008, Ric Wheeler wrote:
> Jens Axboe wrote:
> >On Thu, Nov 06 2008, David Woodhouse wrote:
> >
> >>On Thu, 6 Nov 2008, James Bottomley wrote:
> >>
> >>>The way to do this properly would be to run a chequerboard of partials,
> >>>but this would effectively have trim region tracking done in the block
> >>>layer ... is this worth it?
> >>>
> >>>By the way, the latest (from 2 days ago) version of the Thin
> >>>Provisioning proposal is here:
> >>>
> >>>http://www.t10.org/ftp/t10/document.08/08-149r4.pdf
> >>>
> >>>I skimmed it but don't see any update implying that trim might be
> >>>ineffective if we align wrongly ... where is this?
> >>>
> >>I think we should be content to declare such devices 'broken'.
> >>
> >>They have to keep track of individual sectors _anyway_, and dropping
> >>information for small discard requests is just careless.
> >>
> >
> >I agree, seems pretty pointless. Lets let evolution take care of this
> >issue. I have to say I'm surprised that it really IS an issue to begin
> >with, are array firmwares really that silly?
> >
> >It's not that it would be hard to support (and it would eliminate the
> >need to do discard merging in the block layer), but it seems like one of
> >those things that will be of little use in even in the near future.
> >Discard merging should be useful, I have no problem merging something
> >like that.
> >
> >
> I think that discard merging would be helpful (especially for devices
> with more reasonable sized unmap chunks).
Indeed, and it fits in well with what we do already. Dave has this
mostly done I think, so 2.6.29 should be a potential target provided
that it gets sent soon :-)
--
Jens Axboe
^ permalink raw reply [flat|nested] 105+ messages in thread
* thin provisioned LUN support & file system allocation policy
2008-11-07 12:19 ` Jens Axboe
@ 2008-11-07 14:26 ` Ric Wheeler
2008-11-07 14:34 ` Matthew Wilcox
2008-11-07 14:43 ` Theodore Tso
0 siblings, 2 replies; 105+ messages in thread
From: Ric Wheeler @ 2008-11-07 14:26 UTC (permalink / raw)
To: Jens Axboe, Chris Mason, Theodore Tso, Dave Chinner
Cc: David Woodhouse, James Bottomley, linux-scsi, linux-fsdevel,
Black_David, Martin K. Petersen, Tom Coughlan, Matthew Wilcox
One more consideration that I should have mentioned is that we can also
make our file system allocation policies "thin provisioned LUN" friendly.
Basically, we need to try to re-allocate blocks instead of letting the
allocations happily progress across the entire block range. This might
be the inverse of an SSD friendly allocation policy, but would seem to
be fairly trivial to implement :-)
ric
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support & file system allocation policy
2008-11-07 14:26 ` thin provisioned LUN support & file system allocation policy Ric Wheeler
@ 2008-11-07 14:34 ` Matthew Wilcox
2008-11-07 14:45 ` Jörn Engel
2008-11-07 14:43 ` Theodore Tso
1 sibling, 1 reply; 105+ messages in thread
From: Matthew Wilcox @ 2008-11-07 14:34 UTC (permalink / raw)
To: Ric Wheeler
Cc: Jens Axboe, Chris Mason, Theodore Tso, Dave Chinner,
David Woodhouse, James Bottomley, linux-scsi, linux-fsdevel,
Black_David, Martin K. Petersen, Tom Coughlan
On Fri, Nov 07, 2008 at 09:26:49AM -0500, Ric Wheeler wrote:
> One more consideration that I should have mentioned is that we can also
> make our file system allocation policies "thin provisioned LUN" friendly.
>
> Basically, we need to try to re-allocate blocks instead of letting the
> allocations happily progress across the entire block range. This might
> be the inverse of an SSD friendly allocation policy, but would seem to
> be fairly trivial to implement :-)
It's the opposite of a _flash_ friendly policy. But SSDs are not naive
flash implementations -- if you overwrite a block, it'll just write
elsewhere and update its internal mapping of LBAs to sectors. I
honestly think there's no difference in performance between overwriting
a block and writing elsewhere ... as long as you TRIM the LBAs you're no
longer using, of course ;-)
--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support & file system allocation policy
2008-11-07 14:34 ` Matthew Wilcox
@ 2008-11-07 14:45 ` Jörn Engel
0 siblings, 0 replies; 105+ messages in thread
From: Jörn Engel @ 2008-11-07 14:45 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Ric Wheeler, Jens Axboe, Chris Mason, Theodore Tso, Dave Chinner,
David Woodhouse, James Bottomley, linux-scsi, linux-fsdevel,
Black_David, Martin K. Petersen, Tom Coughlan
On Fri, 7 November 2008 07:34:26 -0700, Matthew Wilcox wrote:
>
> It's the opposite of a _flash_ friendly policy. But SSDs are not naive
> flash implementations -- if you overwrite a block, it'll just write
> elsewhere and update its internal mapping of LBAs to sectors. I
Note that for many of the *cough* cheaper implementations, "elsewhere"
is still very close to the specified address.
Jörn
--
When people work hard for you for a pat on the back, you've got
to give them that pat.
-- Robert Heinlein
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support & file system allocation policy
2008-11-07 14:26 ` thin provisioned LUN support & file system allocation policy Ric Wheeler
2008-11-07 14:34 ` Matthew Wilcox
@ 2008-11-07 14:43 ` Theodore Tso
2008-11-07 14:54 ` Ric Wheeler
2008-11-07 14:55 ` Matthew Wilcox
1 sibling, 2 replies; 105+ messages in thread
From: Theodore Tso @ 2008-11-07 14:43 UTC (permalink / raw)
To: Ric Wheeler
Cc: Jens Axboe, Chris Mason, Dave Chinner, David Woodhouse,
James Bottomley, linux-scsi, linux-fsdevel, Black_David,
Martin K. Petersen, Tom Coughlan, Matthew Wilcox
On Fri, Nov 07, 2008 at 09:26:49AM -0500, Ric Wheeler wrote:
>
> One more consideration that I should have mentioned is that we can also
> make our file system allocation policies "thin provisioned LUN" friendly.
>
> Basically, we need to try to re-allocate blocks instead of letting the
> allocations happily progress across the entire block range. This might
> be the inverse of an SSD friendly allocation policy, but would seem to
> be fairly trivial to implement :-)
I would think that most non log-structured filesystems do this by
default.
The one thing we might need for SSD-friendly allocation policies is to
tell the allocators to not try so hard to make sure allocations are
contiguous, but there are other reasons why you want contiguous
extents anyway (such as reducing the size of your extent tree and
reducing the number of block allocation data structures that need to
be updated). And, I think to some extent SSD's do care to some level
about contiguous extents, from the point of view of reducing scatter
gather operations if nothing else, right?
- Ted
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support & file system allocation policy
2008-11-07 14:43 ` Theodore Tso
@ 2008-11-07 14:54 ` Ric Wheeler
2008-11-07 15:26 ` jim owens
2008-11-07 14:55 ` Matthew Wilcox
1 sibling, 1 reply; 105+ messages in thread
From: Ric Wheeler @ 2008-11-07 14:54 UTC (permalink / raw)
To: Theodore Tso, Ric Wheeler, Jens Axboe, Chris Mason, Dave Chinner,
David
Theodore Tso wrote:
> On Fri, Nov 07, 2008 at 09:26:49AM -0500, Ric Wheeler wrote:
>
>> One more consideration that I should have mentioned is that we can also
>> make our file system allocation policies "thin provisioned LUN" friendly.
>>
>> Basically, we need to try to re-allocate blocks instead of letting the
>> allocations happily progress across the entire block range. This might
>> be the inverse of an SSD friendly allocation policy, but would seem to
>> be fairly trivial to implement :-)
>>
>
> I would think that most non log-structured filesystems do this by
> default.
>
I am not sure - it would be interesting to use blktrace to build a
visual map of how we allocate/free blocks as a file system ages.
> The one thing we might need for SSD-friendly allocation policies is to
> tell the allocators to not try so hard to make sure allocations are
> contiguous, but there are other reasons why you want contiguous
> extents anyway (such as reducing the size of your extent tree and
> reducing the number of block allocation data structures that need to
> be updated). And, I think to some extent SSD's do care to some level
> about contiguous extents, from the point of view of reducing scatter
> gather operations if nothing else, right?
>
> - Ted
>
I think that contiguous allocations are still important (especially
since the big arrays really like to have contiguous, large chunks of
space freed up at once so their unmap/TRIM support works better :-)) For
SSD's, streaming writes are still faster than scattered small block
writes, so I think contiguous allocation would help them as well.
The type of allocation that would help most is something that tries to
keep the lower block ranges "hot" for allocation, second best policy
would simply keep the allocated blocks in each block group hot and
re-allocate them.
One other interesting feature is that the thin luns have a high water
mark which can be used to send an out of band (i.e., to some user space
app) notification when you hit a specified percentage of your physically
allocated blocks. The key is to set this so that a human can have time
to react by trying to expand the size of the physical pool (throw in
another disk).
We could trigger some file system clean up at this point as well if we
could try to repack our allocated blocks and then update the array. Of
course, this would only help when the array's concept of used data is
wildly out of sync with our concept of allocated blocks which happens
when it drops the unmap commands or we don't send them.
Ric
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support & file system allocation policy
2008-11-07 14:54 ` Ric Wheeler
@ 2008-11-07 15:26 ` jim owens
2008-11-07 15:31 ` David Woodhouse
` (2 more replies)
0 siblings, 3 replies; 105+ messages in thread
From: jim owens @ 2008-11-07 15:26 UTC (permalink / raw)
To: Ric Wheeler
Cc: Theodore Tso, Jens Axboe, Chris Mason, Dave Chinner,
David Woodhouse, James Bottomley, linux-scsi, linux-fsdevel,
Black_David, Martin K. Petersen, Tom Coughlan, Matthew Wilcox
Ric Wheeler wrote:
> The type of allocation that would help most is something that tries to
> keep the lower block ranges "hot" for allocation, second best policy
> would simply keep the allocated blocks in each block group hot and
> re-allocate them.
This block reuse policy ignores the issue of wear leveling...
as in most design things, trading one problem for another.
jim
^ permalink raw reply [flat|nested] 105+ messages in thread* Re: thin provisioned LUN support & file system allocation policy
2008-11-07 15:26 ` jim owens
@ 2008-11-07 15:31 ` David Woodhouse
2008-11-07 15:35 ` jim owens
2008-11-07 15:36 ` James Bottomley
2008-11-07 15:36 ` Theodore Tso
2008-11-07 16:02 ` Ric Wheeler
2 siblings, 2 replies; 105+ messages in thread
From: David Woodhouse @ 2008-11-07 15:31 UTC (permalink / raw)
To: jim owens
Cc: Ric Wheeler, Theodore Tso, Jens Axboe, Chris Mason, Dave Chinner,
David Woodhouse, James Bottomley, linux-scsi, linux-fsdevel,
Black_David, Martin K. Petersen, Tom Coughlan, Matthew Wilcox
On Fri, 7 Nov 2008, jim owens wrote:
> Ric Wheeler wrote:
>
>> The type of allocation that would help most is something that tries to keep
>> the lower block ranges "hot" for allocation, second best policy would
>> simply keep the allocated blocks in each block group hot and re-allocate
>> them.
>
> This block reuse policy ignores the issue of wear leveling...
> as in most design things, trading one problem for another.
For SSDs we're being told not to worry our pretty little heads about wear
levelling. That gets done for us, with varying degrees of competence,
within the black box. All we can do to improve that is pray...
and maybe sacrifice the occasional goat.
--
dwmw2
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support & file system allocation policy
2008-11-07 15:31 ` David Woodhouse
@ 2008-11-07 15:35 ` jim owens
2008-11-07 15:46 ` Theodore Tso
2008-11-07 15:36 ` James Bottomley
1 sibling, 1 reply; 105+ messages in thread
From: jim owens @ 2008-11-07 15:35 UTC (permalink / raw)
To: David Woodhouse
Cc: Ric Wheeler, Theodore Tso, Jens Axboe, Chris Mason, Dave Chinner,
James Bottomley, linux-scsi, linux-fsdevel, Black_David,
Martin K. Petersen, Tom Coughlan, Matthew Wilcox
David Woodhouse wrote:
> On Fri, 7 Nov 2008, jim owens wrote:
>
>> Ric Wheeler wrote:
>>
>>> The type of allocation that would help most is something that tries
>>> to keep the lower block ranges "hot" for allocation, second best
>>> policy would simply keep the allocated blocks in each block group hot
>>> and re-allocate them.
>>
>> This block reuse policy ignores the issue of wear leveling...
>> as in most design things, trading one problem for another.
>
> For SSDs we're being told not to worry our pretty little heads about
> wear levelling. That gets done for us, with varying degrees of
> competence, within the black box. All we can do to improve that is
> pray... and maybe sacrifice the occasional goat.
>
I'm talking DISK wear not SSD. The array vendors who are causing
this problem are doing petabyte san devices, not SSDs.
Rewriting the same sectors causes more bad block remaps
until the drive eventually runs out of remap space.
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support & file system allocation policy
2008-11-07 15:35 ` jim owens
@ 2008-11-07 15:46 ` Theodore Tso
2008-11-07 15:51 ` Martin K. Petersen
2008-11-07 15:56 ` James Bottomley
0 siblings, 2 replies; 105+ messages in thread
From: Theodore Tso @ 2008-11-07 15:46 UTC (permalink / raw)
To: jim owens
Cc: David Woodhouse, Ric Wheeler, Jens Axboe, Chris Mason,
Dave Chinner, James Bottomley, linux-scsi, linux-fsdevel,
Black_David, Martin K. Petersen, Tom Coughlan, Matthew Wilcox
On Fri, Nov 07, 2008 at 10:35:13AM -0500, jim owens wrote:
>
> I'm talking DISK wear not SSD. The array vendors who are causing
> this problem are doing petabyte san devices, not SSDs.
>
> Rewriting the same sectors causes more bad block remaps
> until the drive eventually runs out of remap space.
How much of a disk wear factor is there with modern disk drives? The
heads aren't touching the disk, and we have plenty of sectors which
are constantly getting rewritten with traditional filesystems, with no
ill effects as far as I know. For example, FAT filesystems, the
superblock, block allocation bitmaps all are constantly getting
rewritten today, and I haven't heard of disk manufacturers complaining
that this is a horrible thing.
- Ted
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support & file system allocation policy
2008-11-07 15:46 ` Theodore Tso
@ 2008-11-07 15:51 ` Martin K. Petersen
2008-11-07 16:06 ` Ric Wheeler
2008-11-07 15:56 ` James Bottomley
1 sibling, 1 reply; 105+ messages in thread
From: Martin K. Petersen @ 2008-11-07 15:51 UTC (permalink / raw)
To: Theodore Tso
Cc: jim owens, David Woodhouse, Ric Wheeler, Jens Axboe, Chris Mason,
Dave Chinner, James Bottomley, linux-scsi, linux-fsdevel,
Black_David, Martin K. Petersen, Tom Coughlan, Matthew Wilcox
>>>>> "Ted" == Theodore Tso <tytso@mit.edu> writes:
Ted> How much of a disk wear factor is there with modern disk drives?
Ted> The heads aren't touching the disk, and we have plenty of sectors
Ted> which are constantly getting rewritten with traditional
Ted> filesystems, with no ill effects as far as I know.
Modern disk firmware maintains a list of write hot spots and will
regularly rewrite adjacent sectors to prevent bleed.
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support & file system allocation policy
2008-11-07 15:51 ` Martin K. Petersen
@ 2008-11-07 16:06 ` Ric Wheeler
0 siblings, 0 replies; 105+ messages in thread
From: Ric Wheeler @ 2008-11-07 16:06 UTC (permalink / raw)
To: Martin K. Petersen
Cc: Theodore Tso, jim owens, David Woodhouse, Jens Axboe, Chris Mason,
Dave Chinner, James Bottomley, linux-scsi, linux-fsdevel,
Black_David, Tom Coughlan, Matthew Wilcox
Martin K. Petersen wrote:
>>>>>> "Ted" == Theodore Tso <tytso@mit.edu> writes:
>>>>>>
>
> Ted> How much of a disk wear factor is there with modern disk drives?
> Ted> The heads aren't touching the disk, and we have plenty of sectors
> Ted> which are constantly getting rewritten with traditional
> Ted> filesystems, with no ill effects as far as I know.
>
> Modern disk firmware maintains a list of write hot spots and will
> regularly rewrite adjacent sectors to prevent bleed.
>
>
This is my understanding (based on looking at lots of disks from my EMC
days). Again, this is not an issue for a Symm/Hitachi/Shark class array
since they all abstract away this kind of hot spotting.
Where it is an issue with local drives is when you constantly (like
every 20ms) try to update the same sector and that triggers the kind of
adjacent track erasure issues you mention here. Disk block allocation
policies that reuse blocks will update the same sectors are orders of
magnitude less than this (and much less than we rewrite block allocation
bitmaps, etc).
ric
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support & file system allocation policy
2008-11-07 15:46 ` Theodore Tso
2008-11-07 15:51 ` Martin K. Petersen
@ 2008-11-07 15:56 ` James Bottomley
1 sibling, 0 replies; 105+ messages in thread
From: James Bottomley @ 2008-11-07 15:56 UTC (permalink / raw)
To: Theodore Tso
Cc: jim owens, David Woodhouse, Ric Wheeler, Jens Axboe, Chris Mason,
Dave Chinner, linux-scsi, linux-fsdevel, Black_David,
Martin K. Petersen, Tom Coughlan, Matthew Wilcox
On Fri, 2008-11-07 at 10:46 -0500, Theodore Tso wrote:
> On Fri, Nov 07, 2008 at 10:35:13AM -0500, jim owens wrote:
> >
> > I'm talking DISK wear not SSD. The array vendors who are causing
> > this problem are doing petabyte san devices, not SSDs.
> >
> > Rewriting the same sectors causes more bad block remaps
> > until the drive eventually runs out of remap space.
>
> How much of a disk wear factor is there with modern disk drives? The
> heads aren't touching the disk, and we have plenty of sectors which
> are constantly getting rewritten with traditional filesystems, with no
> ill effects as far as I know. For example, FAT filesystems, the
> superblock, block allocation bitmaps all are constantly getting
> rewritten today, and I haven't heard of disk manufacturers complaining
> that this is a horrible thing.
All the evidence so far (the netapp and google et al error analysis
papers) seems to imply that hot rewrite spots don't actually correlate
with failures. The suspicion is that remapping algorithms are good
enough to hide the problem and even if that is true, it's not something
we need worry about too much. The other thought is that wear on
spinning media is mechanical rather than electromagnetic, so it doesn't
matter how many times the sector is rewritten but how many times the
head flies over the area (which is something we'll never manage to
control).
James
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support & file system allocation policy
2008-11-07 15:31 ` David Woodhouse
2008-11-07 15:35 ` jim owens
@ 2008-11-07 15:36 ` James Bottomley
2008-11-07 15:48 ` David Woodhouse
1 sibling, 1 reply; 105+ messages in thread
From: James Bottomley @ 2008-11-07 15:36 UTC (permalink / raw)
To: David Woodhouse
Cc: jim owens, Ric Wheeler, Theodore Tso, Jens Axboe, Chris Mason,
Dave Chinner, linux-scsi, linux-fsdevel, Black_David,
Martin K. Petersen, Tom Coughlan, Matthew Wilcox
On Fri, 2008-11-07 at 15:31 +0000, David Woodhouse wrote:
> On Fri, 7 Nov 2008, jim owens wrote:
>
> > Ric Wheeler wrote:
> >
> >> The type of allocation that would help most is something that tries to keep
> >> the lower block ranges "hot" for allocation, second best policy would
> >> simply keep the allocated blocks in each block group hot and re-allocate
> >> them.
> >
> > This block reuse policy ignores the issue of wear leveling...
> > as in most design things, trading one problem for another.
>
> For SSDs we're being told not to worry our pretty little heads about wear
> levelling. That gets done for us, with varying degrees of competence,
> within the black box. All we can do to improve that is pray...
> and maybe sacrifice the occasional goat.
I think the rule is for SSDs that if they have a disk interface we
ignore wear levelling ... if the FTL is stupid, they're not going to be
reliable enough even for consumer use. Trying to second guess the FTL
would be a layering violation (and a disaster in the making).
If we're being shown native flash with no intervening disk interface
then, yes, we need to do wear levelling (although I suspect this will
really only occur in the embedded space).
James
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support & file system allocation policy
2008-11-07 15:36 ` James Bottomley
@ 2008-11-07 15:48 ` David Woodhouse
0 siblings, 0 replies; 105+ messages in thread
From: David Woodhouse @ 2008-11-07 15:48 UTC (permalink / raw)
To: James Bottomley
Cc: David Woodhouse, jim owens, Ric Wheeler, Theodore Tso, Jens Axboe,
Chris Mason, Dave Chinner, linux-scsi, linux-fsdevel, Black_David,
Martin K. Petersen, Tom Coughlan, Matthew Wilcox
On Fri, 7 Nov 2008, James Bottomley wrote:
> If we're being shown native flash with no intervening disk interface
> then, yes, we need to do wear levelling (although I suspect this will
> really only occur in the embedded space).
Native flash will need special handling anyway. In practice I expect that
when we make btrfs do native flash, it'll use UBI which will handle a lot
of that for us. Including wear levelling.
--
dwmw2
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support & file system allocation policy
2008-11-07 15:26 ` jim owens
2008-11-07 15:31 ` David Woodhouse
@ 2008-11-07 15:36 ` Theodore Tso
2008-11-07 15:45 ` Matthew Wilcox
2008-11-07 16:02 ` Ric Wheeler
2 siblings, 1 reply; 105+ messages in thread
From: Theodore Tso @ 2008-11-07 15:36 UTC (permalink / raw)
To: jim owens
Cc: Ric Wheeler, Jens Axboe, Chris Mason, Dave Chinner,
David Woodhouse, James Bottomley, linux-scsi, linux-fsdevel,
Black_David, Martin K. Petersen, Tom Coughlan, Matthew Wilcox,
Eyal Shani
On Fri, Nov 07, 2008 at 10:26:04AM -0500, jim owens wrote:
> Ric Wheeler wrote:
>
>> The type of allocation that would help most is something that tries to
>> keep the lower block ranges "hot" for allocation, second best policy
>> would simply keep the allocated blocks in each block group hot and
>> re-allocate them.
>
> This block reuse policy ignores the issue of wear leveling...
> as in most design things, trading one problem for another.
>
The discussion here has been around Intel-style SSD's, which
apparently have a log-structured filesystem in the device, such that
wear leveling is done automatically, and in fact it is *better* for
these devices if we reuse the same block since then the SSD
automatically knows that contents at the old location is logically
"gone". (I don't believe, or at least don't see, why there would be
any benefit of reusing block ranges versus explicitly using a TRIM
command to tell the SSD that the old block was no longer being used;
it should have the same effect as far as the SSD is concerned.)
The one thing which I am somewhat concerned about is whether all SSD's
will be doing things the Intel way, or whether other SSD's might not
be willing to license some Intel patents (for example) and will end up
doing things some other way, where they aren't using a log-structured
filesystem under the covers and might be more succeptible to
wear-leveling concerns.
It would perhaps be unfortunate if we were to tune Linux filesystems
to be optimal for Intel-style SSD's, to the point where we can't
support other implementation strategies for SSD's where wear-leveling
might be more important.
OTOH, if Intel has lots of people engaging Linux, and helping to
provide code, benchmarking tools, etc., some bias towards SSD's as
designed and implemented by Intel is probably inevitable. (And it's
an incentive for other SSD vendor's to do the same. :-)
- Ted
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support & file system allocation policy
2008-11-07 15:36 ` Theodore Tso
@ 2008-11-07 15:45 ` Matthew Wilcox
2008-11-07 16:07 ` jim owens
0 siblings, 1 reply; 105+ messages in thread
From: Matthew Wilcox @ 2008-11-07 15:45 UTC (permalink / raw)
To: Theodore Tso, jim owens, Ric Wheeler, Jens Axboe, Chris Mason,
Dave Chinner <da
On Fri, Nov 07, 2008 at 10:36:24AM -0500, Theodore Tso wrote:
> The discussion here has been around Intel-style SSD's, which
> apparently have a log-structured filesystem in the device, such that
> wear leveling is done automatically, and in fact it is *better* for
> these devices if we reuse the same block since then the SSD
> automatically knows that contents at the old location is logically
> "gone". (I don't believe, or at least don't see, why there would be
> any benefit of reusing block ranges versus explicitly using a TRIM
> command to tell the SSD that the old block was no longer being used;
> it should have the same effect as far as the SSD is concerned.)
>
> The one thing which I am somewhat concerned about is whether all SSD's
> will be doing things the Intel way, or whether other SSD's might not
Given that most of my information about how SSDs work comes from a
presentation given by Samsung at the FS/IO storage workshop, I feel
fairly confident all the manufacturers do something very similar with a
log-structured FS internally.
Of course, this probably doesn't apply to the $5 1Gb USB keys that you
get in the conference schwag, but if we start optimising for those,
we've probably already lost.
--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support & file system allocation policy
2008-11-07 15:45 ` Matthew Wilcox
@ 2008-11-07 16:07 ` jim owens
2008-11-07 16:12 ` James Bottomley
0 siblings, 1 reply; 105+ messages in thread
From: jim owens @ 2008-11-07 16:07 UTC (permalink / raw)
To: linux-fsdevel
Cc: Matthew Wilcox, Theodore Tso, Ric Wheeler, Jens Axboe,
Chris Mason, Dave Chinner, David Woodhouse, James Bottomley,
linux-scsi, Black_David, Martin K. Petersen, Tom Coughlan,
Eyal Shani
We have become confused and combined 2 independent issues...
UNMAP thin storage array provisioning is not SSD TRIM.
AFAIK the SSD trim will be handled just fine by sending the
small size (merged when appropriate) trim command.
The UNMAP is a big JBOD array problem and the solution(s)
are different. As I said to one of Ric's early posts,
my opinion is
- expensive array vendors need to do this themselves
because their whole thin provisioning thing is all
about sharing on different host types with existing
filesystems that use small blocks.
- linux filesystems that can do large allocation blocks
would be able to be tuned at mkfs to the array geometry
if the vendors give us the data... and we should only
optimize those filesystems and forget about trying to
fix it in the block layer or fix it with some kind of
defrag/scanner.
jim
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support & file system allocation policy
2008-11-07 16:07 ` jim owens
@ 2008-11-07 16:12 ` James Bottomley
2008-11-07 16:23 ` jim owens
0 siblings, 1 reply; 105+ messages in thread
From: James Bottomley @ 2008-11-07 16:12 UTC (permalink / raw)
To: jim owens
Cc: linux-fsdevel, Matthew Wilcox, Theodore Tso, Ric Wheeler,
Jens Axboe, Chris Mason, Dave Chinner, David Woodhouse,
linux-scsi, Black_David, Martin K. Petersen, Tom Coughlan,
Eyal Shani
On Fri, 2008-11-07 at 11:07 -0500, jim owens wrote:
> We have become confused and combined 2 independent issues...
>
> UNMAP thin storage array provisioning is not SSD TRIM.
Actually, currently they are.
The primary reason is that we handle current SATA devices through SCSI
via SAT, so we're going to have to do UNMAP for both arrays and SSDs
until such time as SATA is ejected from the SCSI mid-layer and it can do
trim on its own.
The other reason, of course, is that we're mapping the Block layer
discard statement for both as well. If that needs to change to two
separate instructions, I don't think that's on our current radar.
So, currently we're approaching this problem as one and I don't see any
way we could logically separate the two cases (except by distinguishing
between them with device parametrisation).
James
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support & file system allocation policy
2008-11-07 16:12 ` James Bottomley
@ 2008-11-07 16:23 ` jim owens
0 siblings, 0 replies; 105+ messages in thread
From: jim owens @ 2008-11-07 16:23 UTC (permalink / raw)
To: James Bottomley
Cc: linux-fsdevel, Matthew Wilcox, Theodore Tso, Ric Wheeler,
Jens Axboe, Chris Mason, Dave Chinner, David Woodhouse,
linux-scsi, Black_David, Martin K. Petersen, Tom Coughlan,
Eyal Shani
James Bottomley wrote:
> On Fri, 2008-11-07 at 11:07 -0500, jim owens wrote:
>> We have become confused and combined 2 independent issues...
>>
>> UNMAP thin storage array provisioning is not SSD TRIM.
>
> Actually, currently they are.
Sorry, I'm not being clear. It may be the same code but
the "what is the design goal" is different.
The goal for SSD is to handle devices that are reasonably
small in size and have (we hope) reasonably small trim
chunk sizes.
The key here the size of the SSD device would keep any
merge bitmap you did in the block layer reasonable.
A 512-byte-per-block merge map for a 1,000 TB san array
is much uglier.
I'm saying that ugliness belongs to the array vendor.
Keep a common linux block layer discard/trim/unmap
that is good for reasonable devices - screw the rest!
jim
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support & file system allocation policy
2008-11-07 15:26 ` jim owens
2008-11-07 15:31 ` David Woodhouse
2008-11-07 15:36 ` Theodore Tso
@ 2008-11-07 16:02 ` Ric Wheeler
2 siblings, 0 replies; 105+ messages in thread
From: Ric Wheeler @ 2008-11-07 16:02 UTC (permalink / raw)
To: jim owens
Cc: Theodore Tso, Jens Axboe, Chris Mason, Dave Chinner,
David Woodhouse, James Bottomley, linux-scsi, linux-fsdevel,
Black_David, Martin K. Petersen, Tom Coughlan, Matthew Wilcox
jim owens wrote:
> Ric Wheeler wrote:
>
>> The type of allocation that would help most is something that tries
>> to keep the lower block ranges "hot" for allocation, second best
>> policy would simply keep the allocated blocks in each block group hot
>> and re-allocate them.
>
> This block reuse policy ignores the issue of wear leveling...
> as in most design things, trading one problem for another.
>
> jim
Wear levelling is not a problem for all (most?) T10 unmap capable arrays
since they remap pretty much everything internally,
Ric
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support & file system allocation policy
2008-11-07 14:43 ` Theodore Tso
2008-11-07 14:54 ` Ric Wheeler
@ 2008-11-07 14:55 ` Matthew Wilcox
1 sibling, 0 replies; 105+ messages in thread
From: Matthew Wilcox @ 2008-11-07 14:55 UTC (permalink / raw)
To: Theodore Tso, Ric Wheeler, Jens Axboe, Chris Mason, Dave Chinner,
David
On Fri, Nov 07, 2008 at 09:43:11AM -0500, Theodore Tso wrote:
> The one thing we might need for SSD-friendly allocation policies is to
> tell the allocators to not try so hard to make sure allocations are
> contiguous, but there are other reasons why you want contiguous
> extents anyway (such as reducing the size of your extent tree and
> reducing the number of block allocation data structures that need to
> be updated). And, I think to some extent SSD's do care to some level
> about contiguous extents, from the point of view of reducing scatter
> gather operations if nothing else, right?
It's not so much s-g operations as it is that you can only have 32
commands outstanding with the drive at any given time. Each read/write
command can specify only one extent. So if you can ask for one 256k
extent rather than have to ask for a 4k extent 64 times, you're going
to get your data faster.
--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-07 12:14 ` Ric Wheeler
2008-11-07 12:17 ` David Woodhouse
2008-11-07 12:19 ` Jens Axboe
@ 2008-11-07 15:20 ` James Bottomley
2008-11-09 23:08 ` Dave Chinner
2 siblings, 1 reply; 105+ messages in thread
From: James Bottomley @ 2008-11-07 15:20 UTC (permalink / raw)
To: Ric Wheeler
Cc: Jens Axboe, David Woodhouse, linux-scsi, linux-fsdevel,
Black_David, Martin K. Petersen, Tom Coughlan, Matthew Wilcox
On Fri, 2008-11-07 at 07:14 -0500, Ric Wheeler wrote:
> Jens Axboe wrote:
> > On Thu, Nov 06 2008, David Woodhouse wrote:
> >
> >> On Thu, 6 Nov 2008, James Bottomley wrote:
> >>
> >>> The way to do this properly would be to run a chequerboard of partials,
> >>> but this would effectively have trim region tracking done in the block
> >>> layer ... is this worth it?
> >>>
> >>> By the way, the latest (from 2 days ago) version of the Thin
> >>> Provisioning proposal is here:
> >>>
> >>> http://www.t10.org/ftp/t10/document.08/08-149r4.pdf
> >>>
> >>> I skimmed it but don't see any update implying that trim might be
> >>> ineffective if we align wrongly ... where is this?
> >>>
> >> I think we should be content to declare such devices 'broken'.
> >>
> >> They have to keep track of individual sectors _anyway_, and dropping
> >> information for small discard requests is just careless.
> >>
> >
> > I agree, seems pretty pointless. Lets let evolution take care of this
> > issue. I have to say I'm surprised that it really IS an issue to begin
> > with, are array firmwares really that silly?
> >
> > It's not that it would be hard to support (and it would eliminate the
> > need to do discard merging in the block layer), but it seems like one of
> > those things that will be of little use in even in the near future.
> > Discard merging should be useful, I have no problem merging something
> > like that.
> >
> >
> I think that discard merging would be helpful (especially for devices
> with more reasonable sized unmap chunks).
One of the ways the unmap command is set up is with a disjoint
scatterlist, so we can send a large number of unmaps together. Whether
they're merged or not really doesn't matter.
The probable way a discard system would work if we wanted to endure the
complexity would be to have the discard system in the underlying device
driver (or possibly just above it in block, but different devices like
SCSI or ATA have different discard characteristics). It would just
accumulate block discard requests as ranges (and it would have to poke
holes in the ranges as it sees read/write requests) which it flushes
periodically.
The reason for doing it this way is that discards are "special" as long
as we don't discard a rewritten sector, the time at which they're sent
down is irrelevant to integrity and thus we can potentially accumulate
over vastly different timescales than the regular block merging. If
we're really going to respect this discard block size, we could
accumulate the irrelevant discards the array would ignore anyway for
virtually infinite time.
Note, I'm not saying we *should* do this ... I think something like this
would be much better done in the device ... but if we *are* going to do
it, then at least lets get it right.
James
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-07 15:20 ` thin provisioned LUN support James Bottomley
@ 2008-11-09 23:08 ` Dave Chinner
2008-11-09 23:37 ` James Bottomley
0 siblings, 1 reply; 105+ messages in thread
From: Dave Chinner @ 2008-11-09 23:08 UTC (permalink / raw)
To: James Bottomley
Cc: Ric Wheeler, Jens Axboe, David Woodhouse, linux-scsi,
linux-fsdevel, Black_David, Martin K. Petersen, Tom Coughlan,
Matthew Wilcox
On Fri, Nov 07, 2008 at 09:20:30AM -0600, James Bottomley wrote:
> On Fri, 2008-11-07 at 07:14 -0500, Ric Wheeler wrote:
> > Jens Axboe wrote:
> > I think that discard merging would be helpful (especially for devices
> > with more reasonable sized unmap chunks).
>
> One of the ways the unmap command is set up is with a disjoint
> scatterlist, so we can send a large number of unmaps together. Whether
> they're merged or not really doesn't matter.
>
> The probable way a discard system would work if we wanted to endure the
> complexity would be to have the discard system in the underlying device
> driver (or possibly just above it in block, but different devices like
> SCSI or ATA have different discard characteristics). It would just
> accumulate block discard requests as ranges (and it would have to poke
> holes in the ranges as it sees read/write requests) which it flushes
> periodically.
It appears to me that discard requests are only being considered
here at a block and device level, and nobody is thinking about
the system level effects of such aggregation of discard requests.
What happens on a system crash? We lose all the pending discard
requests, never to be sent again? If so, how do we tell the device
that certain ranges have actually been discarded after the crash?
Are you expecting them to get replayed by a filesystem during
recovery? What if it was a userspace discard from something like
mkfs that was lost? How does this interact with sync or other
such user level filesystems synchronisation primitives? Does
sync_blockdev() flush out pending discard requests? Should fsync?
And if the filesystem has to wait for discard requests to complete
to guarantee that they are done or can be recovered and replayed
after a crash, most filesystems are going to need modification. e.g.
XFS would need to prevent the tail of the log moving forward until
the discard request associated with a given extent free transaction
has been completed. That means we need to be able to specifically
flush queued discard requests and we'd need I/O completions to
run when they are done to do the filesytem level cleanup work....
Let's keep the OS level interactions simple - if the array vendors
want to keep long queues of requests around before acting on them
to aggregate them, then that is an optimisation for them to
implement. They already do this with small data writes to NVRAM, so I
don't see how this should be treated any differently...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-09 23:08 ` Dave Chinner
@ 2008-11-09 23:37 ` James Bottomley
2008-11-10 0:33 ` Dave Chinner
0 siblings, 1 reply; 105+ messages in thread
From: James Bottomley @ 2008-11-09 23:37 UTC (permalink / raw)
To: Dave Chinner
Cc: Ric Wheeler, Jens Axboe, David Woodhouse, linux-scsi,
linux-fsdevel, Black_David, Martin K. Petersen, Tom Coughlan,
Matthew Wilcox
On Mon, 2008-11-10 at 10:08 +1100, Dave Chinner wrote:
> On Fri, Nov 07, 2008 at 09:20:30AM -0600, James Bottomley wrote:
> > On Fri, 2008-11-07 at 07:14 -0500, Ric Wheeler wrote:
> > > Jens Axboe wrote:
> > > I think that discard merging would be helpful (especially for devices
> > > with more reasonable sized unmap chunks).
> >
> > One of the ways the unmap command is set up is with a disjoint
> > scatterlist, so we can send a large number of unmaps together. Whether
> > they're merged or not really doesn't matter.
> >
> > The probable way a discard system would work if we wanted to endure the
> > complexity would be to have the discard system in the underlying device
> > driver (or possibly just above it in block, but different devices like
> > SCSI or ATA have different discard characteristics). It would just
> > accumulate block discard requests as ranges (and it would have to poke
> > holes in the ranges as it sees read/write requests) which it flushes
> > periodically.
>
> It appears to me that discard requests are only being considered
> here at a block and device level, and nobody is thinking about
> the system level effects of such aggregation of discard requests.
>
> What happens on a system crash? We lose all the pending discard
> requests, never to be sent again?
Yes ... since this is for thin provisioning. Discard is best guess ...
it doesn't affect integrity if we lose one and from the point of view of
the array, 99% transmitted is far better than we do today. All that
happens for a lost discard is that the array keeps a block that the
filesystem isn't currently using. However, the chances are that it will
get reused, so it shares a good probability of getting discarded again.
> If so, how do we tell the device
> that certain ranges have actually been discarded after the crash?
> Are you expecting them to get replayed by a filesystem during
> recovery? What if it was a userspace discard from something like
> mkfs that was lost? How does this interact with sync or other
> such user level filesystems synchronisation primitives? Does
> sync_blockdev() flush out pending discard requests? Should fsync?
No .. the syncs are all integrity based. Discard is simple opportunity
based.
> And if the filesystem has to wait for discard requests to complete
> to guarantee that they are done or can be recovered and replayed
> after a crash, most filesystems are going to need modification. e.g.
> XFS would need to prevent the tail of the log moving forward until
> the discard request associated with a given extent free transaction
> has been completed. That means we need to be able to specifically
> flush queued discard requests and we'd need I/O completions to
> run when they are done to do the filesytem level cleanup work....
OK, I really don't follow the logic here. Discards have no effect on
data integrity ... unless you're confusing them with secure deletion? A
discard merely tells the array that it doesn't need to back this block
with an actual storage location anymore (until the next write for that
region comes down).
The ordering worry can be coped with in the same way we do barriers ...
it's even safer for discards because if we know the block is going to be
rewritten, we simply discard the discard.
> Let's keep the OS level interactions simple - if the array vendors
> want to keep long queues of requests around before acting on them
> to aggregate them, then that is an optimisation for them to
> implement. They already do this with small data writes to NVRAM, so I
> don't see how this should be treated any differently...
Well, that's Chris' argument, and it has merit. I'm coming from the
point of view that discards are actually a fundamentally different
entity from anything else we process.
James
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-09 23:37 ` James Bottomley
@ 2008-11-10 0:33 ` Dave Chinner
2008-11-10 14:31 ` James Bottomley
0 siblings, 1 reply; 105+ messages in thread
From: Dave Chinner @ 2008-11-10 0:33 UTC (permalink / raw)
To: James Bottomley
Cc: Ric Wheeler, Jens Axboe, David Woodhouse, linux-scsi,
linux-fsdevel, Black_David, Martin K. Petersen, Tom Coughlan,
Matthew Wilcox
On Sun, Nov 09, 2008 at 05:37:39PM -0600, James Bottomley wrote:
> On Mon, 2008-11-10 at 10:08 +1100, Dave Chinner wrote:
> > On Fri, Nov 07, 2008 at 09:20:30AM -0600, James Bottomley wrote:
> > > On Fri, 2008-11-07 at 07:14 -0500, Ric Wheeler wrote:
> > > > Jens Axboe wrote:
> > > > I think that discard merging would be helpful (especially for devices
> > > > with more reasonable sized unmap chunks).
> > >
> > > One of the ways the unmap command is set up is with a disjoint
> > > scatterlist, so we can send a large number of unmaps together. Whether
> > > they're merged or not really doesn't matter.
> > >
> > > The probable way a discard system would work if we wanted to endure the
> > > complexity would be to have the discard system in the underlying device
> > > driver (or possibly just above it in block, but different devices like
> > > SCSI or ATA have different discard characteristics). It would just
> > > accumulate block discard requests as ranges (and it would have to poke
> > > holes in the ranges as it sees read/write requests) which it flushes
> > > periodically.
> >
> > It appears to me that discard requests are only being considered
> > here at a block and device level, and nobody is thinking about
> > the system level effects of such aggregation of discard requests.
> >
> > What happens on a system crash? We lose all the pending discard
> > requests, never to be sent again?
>
> Yes ... since this is for thin provisioning. Discard is best guess ...
> it doesn't affect integrity if we lose one and from the point of view of
> the array, 99% transmitted is far better than we do today. All that
> happens for a lost discard is that the array keeps a block that the
> filesystem isn't currently using. However, the chances are that it will
> get reused, so it shares a good probability of getting discarded again.
Ok. Given that a single extent free in XFS could span up to 2^37 bytes,
is it considered acceptible to lose the discard request that this
issued from this transaction? I don't think it is....
> > If so, how do we tell the device
> > that certain ranges have actually been discarded after the crash?
> > Are you expecting them to get replayed by a filesystem during
> > recovery? What if it was a userspace discard from something like
> > mkfs that was lost? How does this interact with sync or other
> > such user level filesystems synchronisation primitives? Does
> > sync_blockdev() flush out pending discard requests? Should fsync?
>
> No .. the syncs are all integrity based. Discard is simple opportunity
> based.
Given that discard requests modify the stable storage associated
with the filesystem, then shouldn't an integrity synchronisation
issue and complete all pending requests to the underlying storage
device?
If not, how do we guarantee them to all be flushed on remount-ro
or unmount-before-hot-unplug type of events?
> > And if the filesystem has to wait for discard requests to complete
> > to guarantee that they are done or can be recovered and replayed
> > after a crash, most filesystems are going to need modification. e.g.
> > XFS would need to prevent the tail of the log moving forward until
> > the discard request associated with a given extent free transaction
> > has been completed. That means we need to be able to specifically
> > flush queued discard requests and we'd need I/O completions to
> > run when they are done to do the filesytem level cleanup work....
>
> OK, I really don't follow the logic here. Discards have no effect on
> data integrity ... unless you're confusing them with secure deletion?
Not at all. I'm considering what is needed to allow the filesystem's
discard requests to be replayed during recovery. i.e. what is needed
to allow a filesystem to handle discard requests for thin
provisioning robustly.
If discard requests are not guaranteed to be issued to the storage
on a crash, then it is up to the filesystem to ensure that it
happens during recovery. That requires discard requests to behave
just like all other types of I/O and definitely requires a mechanism
to flush and wait for all discard requests to complete....
> A
> discard merely tells the array that it doesn't need to back this block
> with an actual storage location anymore (until the next write for that
> region comes down).
Right. But really, it's the filesystem that is saying this, not the
block layer, so if the filesytem wants to be robust, then block
layer can't queue these forever - they have to be issued in a timely
fashion so the filesystem can keep track of which discards have
completed or not....
> The ordering worry can be coped with in the same way we do barriers ...
> it's even safer for discards because if we know the block is going to be
> rewritten, we simply discard the discard.
Ordering is determined by the filesystem - barriers are just a
mechanism the filesystem uses to guarantee I/O ordering. If the
filesystem is tracking discard completion status, then it won't
be issuing I/O over the top of that region as the free transaction
won't be complete until the discard is done....
> > Let's keep the OS level interactions simple - if the array vendors
> > want to keep long queues of requests around before acting on them
> > to aggregate them, then that is an optimisation for them to
> > implement. They already do this with small data writes to NVRAM, so I
> > don't see how this should be treated any differently...
>
> Well, that's Chris' argument, and it has merit. I'm coming from the
> point of view that discards are actually a fundamentally different
> entity from anything else we process.
>From a filesystem perspective, they are no different to any other
metadata I/O. They need to be tracked to allow robust crash recovery
semantics to be implemented in the filesystem.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-10 0:33 ` Dave Chinner
@ 2008-11-10 14:31 ` James Bottomley
0 siblings, 0 replies; 105+ messages in thread
From: James Bottomley @ 2008-11-10 14:31 UTC (permalink / raw)
To: Dave Chinner
Cc: Ric Wheeler, Jens Axboe, David Woodhouse, linux-scsi,
linux-fsdevel, Black_David, Martin K. Petersen, Tom Coughlan,
Matthew Wilcox
On Mon, 2008-11-10 at 11:33 +1100, Dave Chinner wrote:
> > Yes ... since this is for thin provisioning. Discard is best guess ...
> > it doesn't affect integrity if we lose one and from the point of view of
> > the array, 99% transmitted is far better than we do today. All that
> > happens for a lost discard is that the array keeps a block that the
> > filesystem isn't currently using. However, the chances are that it will
> > get reused, so it shares a good probability of getting discarded again.
>
> Ok. Given that a single extent free in XFS could span up to 2^37 bytes,
> is it considered acceptible to lose the discard request that this
> issued from this transaction? I don't think it is....
Well, given the semantics we've been discussing, ~2^37 of that would go
down immediately and upto 2x the discard block size on either side may
be retained for misalignment reasons. I don't really see the problem.
> > > If so, how do we tell the device
> > > that certain ranges have actually been discarded after the crash?
> > > Are you expecting them to get replayed by a filesystem during
> > > recovery? What if it was a userspace discard from something like
> > > mkfs that was lost? How does this interact with sync or other
> > > such user level filesystems synchronisation primitives? Does
> > > sync_blockdev() flush out pending discard requests? Should fsync?
> >
> > No .. the syncs are all integrity based. Discard is simple opportunity
> > based.
>
> Given that discard requests modify the stable storage associated
> with the filesystem, then shouldn't an integrity synchronisation
> issue and complete all pending requests to the underlying storage
> device?
>
> If not, how do we guarantee them to all be flushed on remount-ro
> or unmount-before-hot-unplug type of events?
Discard is a guarantee from the filesystem but a hint to the storage.
All we need to ensure is that we don't discard a sector with data. To
do that, discard is alread a barrier (no merging around it). If we
retain discards in sd or some other layer, all we have to do is drop the
region we see a rewrite for ... this isn't rocket science, its similar
to what we do now for barrier transactions.
The reason for making them long lived is that we're keeping the pieces
the array would have ignored anyway. That's also why dropping them all
on the floor on a crash isn't a problem ... this is only best effort.
> > > And if the filesystem has to wait for discard requests to complete
> > > to guarantee that they are done or can be recovered and replayed
> > > after a crash, most filesystems are going to need modification. e.g.
> > > XFS would need to prevent the tail of the log moving forward until
> > > the discard request associated with a given extent free transaction
> > > has been completed. That means we need to be able to specifically
> > > flush queued discard requests and we'd need I/O completions to
> > > run when they are done to do the filesytem level cleanup work....
> >
> > OK, I really don't follow the logic here. Discards have no effect on
> > data integrity ... unless you're confusing them with secure deletion?
>
> Not at all. I'm considering what is needed to allow the filesystem's
> discard requests to be replayed during recovery. i.e. what is needed
> to allow a filesystem to handle discard requests for thin
> provisioning robustly.
>
> If discard requests are not guaranteed to be issued to the storage
> on a crash, then it is up to the filesystem to ensure that it
> happens during recovery. That requires discard requests to behave
> just like all other types of I/O and definitely requires a mechanism
> to flush and wait for all discard requests to complete....
Really, I think you're the one complicating the problem. It's really
simple. An array would like to know when a filesystem isn't using a
block. What the array does with that information is beyond the scope of
the filesystem to know. The guarantee is that it must perform
identically whether it acts on this knowledge or not. That makes it a
hint, so we don't need to go to extraordinary lengths to make sure we
get it exactly right ... we just have to be right for every hint we send
down.
> > A
> > discard merely tells the array that it doesn't need to back this block
> > with an actual storage location anymore (until the next write for that
> > region comes down).
>
> Right. But really, it's the filesystem that is saying this, not the
> block layer, so if the filesytem wants to be robust, then block
> layer can't queue these forever - they have to be issued in a timely
> fashion so the filesystem can keep track of which discards have
> completed or not....
>
> > The ordering worry can be coped with in the same way we do barriers ...
> > it's even safer for discards because if we know the block is going to be
> > rewritten, we simply discard the discard.
>
> Ordering is determined by the filesystem - barriers are just a
> mechanism the filesystem uses to guarantee I/O ordering. If the
> filesystem is tracking discard completion status, then it won't
> be issuing I/O over the top of that region as the free transaction
> won't be complete until the discard is done....
Not really ... ordering as determined by the barrier containing block
stream ... that's why we can do this at the block level.
Look at it this way: if we had to rely on filesystem internals for
ordering information, fs agnostic block replicators would be impossible.
> > > Let's keep the OS level interactions simple - if the array vendors
> > > want to keep long queues of requests around before acting on them
> > > to aggregate them, then that is an optimisation for them to
> > > implement. They already do this with small data writes to NVRAM, so I
> > > don't see how this should be treated any differently...
> >
> > Well, that's Chris' argument, and it has merit. I'm coming from the
> > point of view that discards are actually a fundamentally different
> > entity from anything else we process.
>
> >From a filesystem perspective, they are no different to any other
> metadata I/O. They need to be tracked to allow robust crash recovery
> semantics to be implemented in the filesystem.
I agree it could be. My point is that the hint doesn't need to be
robust, (as in accurate and complete) merely accurate, which we can
ensure at the block level.
James
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-07 12:05 ` Jens Axboe
2008-11-07 12:14 ` Ric Wheeler
@ 2008-11-07 15:49 ` Chris Mason
2008-11-07 16:00 ` Martin K. Petersen
1 sibling, 1 reply; 105+ messages in thread
From: Chris Mason @ 2008-11-07 15:49 UTC (permalink / raw)
To: Jens Axboe
Cc: David Woodhouse, James Bottomley, Ric Wheeler, linux-scsi,
linux-fsdevel, Black_David, Martin K. Petersen, Tom Coughlan,
Matthew Wilcox
On Fri, 2008-11-07 at 13:05 +0100, Jens Axboe wrote:
> > >I skimmed it but don't see any update implying that trim might be
> > >ineffective if we align wrongly ... where is this?
> >
> > I think we should be content to declare such devices 'broken'.
> >
> > They have to keep track of individual sectors _anyway_, and dropping
> > information for small discard requests is just careless.
>
> I agree, seems pretty pointless. Lets let evolution take care of this
> issue. I have to say I'm surprised that it really IS an issue to begin
> with, are array firmwares really that silly?
>
> It's not that it would be hard to support (and it would eliminate the
> need to do discard merging in the block layer), but it seems like one of
> those things that will be of little use in even in the near future.
> Discard merging should be useful, I have no problem merging something
> like that.
>
Hmmm, it's surprising to me that arrays who tell us please use the noop
elevator suddenly want us to merge discard requests. The array really
needs to be able to deal with this internally.
Not that discard merging is bad, but I agree that we need to push this
problem off on the array vendors.
-chris
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-07 15:49 ` Chris Mason
@ 2008-11-07 16:00 ` Martin K. Petersen
2008-11-07 16:06 ` James Bottomley
2008-11-09 23:36 ` Dave Chinner
0 siblings, 2 replies; 105+ messages in thread
From: Martin K. Petersen @ 2008-11-07 16:00 UTC (permalink / raw)
To: Chris Mason
Cc: Jens Axboe, David Woodhouse, James Bottomley, Ric Wheeler,
linux-scsi, linux-fsdevel, Black_David, Martin K. Petersen,
Tom Coughlan, Matthew Wilcox
>>>>> "Chris" == Chris Mason <chris.mason@oracle.com> writes:
Chris> Hmmm, it's surprising to me that arrays who tell us please use
Chris> the noop elevator suddenly want us to merge discard requests.
Chris> The array really needs to be able to deal with this internally.
Let's also not forget that we're talking about merging discard
requests for the purpose making internal array housekeeping efficient.
That involves merging discards up to the internal array block sizes
which may be on the order of 512/768/1024 KB.
If we were talking about merging discards up to a 4/8/16 KB boundary
that might be something we'd have a chance to do within a reasonable
amount of time (bigger than normal read/write I/O but not hours).
But keeping discard state around for long enough to attempt to
aggregate 768KB (and 768KB-aligned) chunks is icky.
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-07 16:00 ` Martin K. Petersen
@ 2008-11-07 16:06 ` James Bottomley
2008-11-07 16:11 ` Chris Mason
2008-11-09 23:36 ` Dave Chinner
1 sibling, 1 reply; 105+ messages in thread
From: James Bottomley @ 2008-11-07 16:06 UTC (permalink / raw)
To: Martin K. Petersen
Cc: Chris Mason, Jens Axboe, David Woodhouse, Ric Wheeler, linux-scsi,
linux-fsdevel, Black_David, Tom Coughlan, Matthew Wilcox
On Fri, 2008-11-07 at 11:00 -0500, Martin K. Petersen wrote:
> >>>>> "Chris" == Chris Mason <chris.mason@oracle.com> writes:
>
> Chris> Hmmm, it's surprising to me that arrays who tell us please use
> Chris> the noop elevator suddenly want us to merge discard requests.
> Chris> The array really needs to be able to deal with this internally.
>
> Let's also not forget that we're talking about merging discard
> requests for the purpose making internal array housekeeping efficient.
> That involves merging discards up to the internal array block sizes
> which may be on the order of 512/768/1024 KB.
>
> If we were talking about merging discards up to a 4/8/16 KB boundary
> that might be something we'd have a chance to do within a reasonable
> amount of time (bigger than normal read/write I/O but not hours).
>
> But keeping discard state around for long enough to attempt to
> aggregate 768KB (and 768KB-aligned) chunks is icky.
Icky but possible. It's the same rb tree affair we use to keep vma
lists (with the same characteristics). The point is that technically we
can do this pretty easily ... all the way down to not losing any
potential discards that the array would ignore. However, procedurally
it would certainly be sending the wrong message to the array vendors
(the message being "sure the OS will sanitise any crap you care to
dump").
On the other hand, if we have to do it for flash and MMC anyway ...
James
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-07 16:06 ` James Bottomley
@ 2008-11-07 16:11 ` Chris Mason
2008-11-07 16:18 ` James Bottomley
0 siblings, 1 reply; 105+ messages in thread
From: Chris Mason @ 2008-11-07 16:11 UTC (permalink / raw)
To: James Bottomley
Cc: Martin K. Petersen, Jens Axboe, David Woodhouse, Ric Wheeler,
linux-scsi, linux-fsdevel, Black_David, Tom Coughlan,
Matthew Wilcox
On Fri, 2008-11-07 at 10:06 -0600, James Bottomley wrote:
> On Fri, 2008-11-07 at 11:00 -0500, Martin K. Petersen wrote:
> > >>>>> "Chris" == Chris Mason <chris.mason@oracle.com> writes:
> >
> > Chris> Hmmm, it's surprising to me that arrays who tell us please use
> > Chris> the noop elevator suddenly want us to merge discard requests.
> > Chris> The array really needs to be able to deal with this internally.
> >
> > Let's also not forget that we're talking about merging discard
> > requests for the purpose making internal array housekeeping efficient.
> > That involves merging discards up to the internal array block sizes
> > which may be on the order of 512/768/1024 KB.
> >
> > If we were talking about merging discards up to a 4/8/16 KB boundary
> > that might be something we'd have a chance to do within a reasonable
> > amount of time (bigger than normal read/write I/O but not hours).
> >
> > But keeping discard state around for long enough to attempt to
> > aggregate 768KB (and 768KB-aligned) chunks is icky.
>
> Icky but possible. It's the same rb tree affair we use to keep vma
> lists (with the same characteristics). The point is that technically we
> can do this pretty easily ... all the way down to not losing any
> potential discards that the array would ignore. However, procedurally
> it would certainly be sending the wrong message to the array vendors
> (the message being "sure the OS will sanitise any crap you care to
> dump").
>
> On the other hand, if we have to do it for flash and MMC anyway ...
It doesn't seem like a good idea to maintain a ton of code that gets
exercised so rarely, especially wrt filesystem crashes.
Just testing it would be a fairly large challenge, spread out across N
filesystems. I think we need to keep discard as simple as we possibly
can.
-chris
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-07 16:11 ` Chris Mason
@ 2008-11-07 16:18 ` James Bottomley
2008-11-07 16:22 ` Ric Wheeler
2008-11-07 17:22 ` Chris Mason
0 siblings, 2 replies; 105+ messages in thread
From: James Bottomley @ 2008-11-07 16:18 UTC (permalink / raw)
To: Chris Mason
Cc: Martin K. Petersen, Jens Axboe, David Woodhouse, Ric Wheeler,
linux-scsi, linux-fsdevel, Black_David, Tom Coughlan,
Matthew Wilcox
On Fri, 2008-11-07 at 11:11 -0500, Chris Mason wrote:
> On Fri, 2008-11-07 at 10:06 -0600, James Bottomley wrote:
> > On Fri, 2008-11-07 at 11:00 -0500, Martin K. Petersen wrote:
> > > >>>>> "Chris" == Chris Mason <chris.mason@oracle.com> writes:
> > >
> > > Chris> Hmmm, it's surprising to me that arrays who tell us please use
> > > Chris> the noop elevator suddenly want us to merge discard requests.
> > > Chris> The array really needs to be able to deal with this internally.
> > >
> > > Let's also not forget that we're talking about merging discard
> > > requests for the purpose making internal array housekeeping efficient.
> > > That involves merging discards up to the internal array block sizes
> > > which may be on the order of 512/768/1024 KB.
> > >
> > > If we were talking about merging discards up to a 4/8/16 KB boundary
> > > that might be something we'd have a chance to do within a reasonable
> > > amount of time (bigger than normal read/write I/O but not hours).
> > >
> > > But keeping discard state around for long enough to attempt to
> > > aggregate 768KB (and 768KB-aligned) chunks is icky.
> >
> > Icky but possible. It's the same rb tree affair we use to keep vma
> > lists (with the same characteristics). The point is that technically we
> > can do this pretty easily ... all the way down to not losing any
> > potential discards that the array would ignore. However, procedurally
> > it would certainly be sending the wrong message to the array vendors
> > (the message being "sure the OS will sanitise any crap you care to
> > dump").
> >
> > On the other hand, if we have to do it for flash and MMC anyway ...
>
> It doesn't seem like a good idea to maintain a ton of code that gets
> exercised so rarely, especially wrt filesystem crashes.
Heh, am I the only person here who deletes files on a regular basis
(principally to get my disk down from 99%)?
> Just testing it would be a fairly large challenge, spread out across N
> filesystems. I think we need to keep discard as simple as we possibly
> can.
I don't disagree with that ... I'm not saying we *should* merely that we
*could*.
James
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-07 16:18 ` James Bottomley
@ 2008-11-07 16:22 ` Ric Wheeler
2008-11-07 16:27 ` James Bottomley
2008-11-07 16:28 ` David Woodhouse
2008-11-07 17:22 ` Chris Mason
1 sibling, 2 replies; 105+ messages in thread
From: Ric Wheeler @ 2008-11-07 16:22 UTC (permalink / raw)
To: James Bottomley
Cc: Chris Mason, Martin K. Petersen, Jens Axboe, David Woodhouse,
linux-scsi, linux-fsdevel, Black_David, Tom Coughlan,
Matthew Wilcox
James Bottomley wrote:
> On Fri, 2008-11-07 at 11:11 -0500, Chris Mason wrote:
>
>> On Fri, 2008-11-07 at 10:06 -0600, James Bottomley wrote:
>>
>>> On Fri, 2008-11-07 at 11:00 -0500, Martin K. Petersen wrote:
>>>
>>>>>>>>> "Chris" == Chris Mason <chris.mason@oracle.com> writes:
>>>>>>>>>
>>>> Chris> Hmmm, it's surprising to me that arrays who tell us please use
>>>> Chris> the noop elevator suddenly want us to merge discard requests.
>>>> Chris> The array really needs to be able to deal with this internally.
>>>>
>>>> Let's also not forget that we're talking about merging discard
>>>> requests for the purpose making internal array housekeeping efficient.
>>>> That involves merging discards up to the internal array block sizes
>>>> which may be on the order of 512/768/1024 KB.
>>>>
>>>> If we were talking about merging discards up to a 4/8/16 KB boundary
>>>> that might be something we'd have a chance to do within a reasonable
>>>> amount of time (bigger than normal read/write I/O but not hours).
>>>>
>>>> But keeping discard state around for long enough to attempt to
>>>> aggregate 768KB (and 768KB-aligned) chunks is icky.
>>>>
>>> Icky but possible. It's the same rb tree affair we use to keep vma
>>> lists (with the same characteristics). The point is that technically we
>>> can do this pretty easily ... all the way down to not losing any
>>> potential discards that the array would ignore. However, procedurally
>>> it would certainly be sending the wrong message to the array vendors
>>> (the message being "sure the OS will sanitise any crap you care to
>>> dump").
>>>
>>> On the other hand, if we have to do it for flash and MMC anyway ...
>>>
>> It doesn't seem like a good idea to maintain a ton of code that gets
>> exercised so rarely, especially wrt filesystem crashes.
>>
>
> Heh, am I the only person here who deletes files on a regular basis
> (principally to get my disk down from 99%)?
>
>
>> Just testing it would be a fairly large challenge, spread out across N
>> filesystems. I think we need to keep discard as simple as we possibly
>> can.
>>
>
> I don't disagree with that ... I'm not saying we *should* merely that we
> *could*.
>
> James
>
>
I agree that simple and robust are key, but we will need to try and do
reasonable coalescing of the requests.
Depending on how vendors implement those unmap commands, sending down a
sequence of commands might cause a performance issue if done at too fine
a granularity. Easiest way to handle that is to make sure that we have a
way of disabling the unmap/discard support (mount option?).
Ric
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-07 16:22 ` Ric Wheeler
@ 2008-11-07 16:27 ` James Bottomley
2008-11-07 16:28 ` David Woodhouse
1 sibling, 0 replies; 105+ messages in thread
From: James Bottomley @ 2008-11-07 16:27 UTC (permalink / raw)
To: Ric Wheeler
Cc: Chris Mason, Martin K. Petersen, Jens Axboe, David Woodhouse,
linux-scsi, linux-fsdevel, Black_David, Tom Coughlan,
Matthew Wilcox
On Fri, 2008-11-07 at 11:22 -0500, Ric Wheeler wrote:
> James Bottomley wrote:
> >> Just testing it would be a fairly large challenge, spread out across N
> >> filesystems. I think we need to keep discard as simple as we possibly
> >> can.
> >>
> > I don't disagree with that ... I'm not saying we *should* merely that we
> > *could*.
> >
> I agree that simple and robust are key, but we will need to try and do
> reasonable coalescing of the requests.
>
> Depending on how vendors implement those unmap commands, sending down a
> sequence of commands might cause a performance issue if done at too fine
> a granularity. Easiest way to handle that is to make sure that we have a
> way of disabling the unmap/discard support (mount option?).
I'd really think not. The best way to handle this is through the block
options. We'd give an interface to allow the user to change the
defaults (i.e. turn off discard on a discard supporting device but not
vice versa). Providing every possible block option as a mount option is
asking for confused users.
James
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-07 16:22 ` Ric Wheeler
2008-11-07 16:27 ` James Bottomley
@ 2008-11-07 16:28 ` David Woodhouse
1 sibling, 0 replies; 105+ messages in thread
From: David Woodhouse @ 2008-11-07 16:28 UTC (permalink / raw)
To: Ric Wheeler
Cc: James Bottomley, Chris Mason, Martin K. Petersen, Jens Axboe,
David Woodhouse, linux-scsi, linux-fsdevel, Black_David,
Tom Coughlan, Matthew Wilcox
On Fri, 7 Nov 2008, Ric Wheeler wrote:
> Depending on how vendors implement those unmap commands, sending down a
> sequence of commands might cause a performance issue if done at too fine a
> granularity. Easiest way to handle that is to make sure that we have a way of
> disabling the unmap/discard support (mount option?).
Or maybe a per-queue option, since it already is.
--
dwmw2
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-07 16:18 ` James Bottomley
2008-11-07 16:22 ` Ric Wheeler
@ 2008-11-07 17:22 ` Chris Mason
2008-11-07 18:09 ` Ric Wheeler
1 sibling, 1 reply; 105+ messages in thread
From: Chris Mason @ 2008-11-07 17:22 UTC (permalink / raw)
To: James Bottomley
Cc: Martin K. Petersen, Jens Axboe, David Woodhouse, Ric Wheeler,
linux-scsi, linux-fsdevel, Black_David, Tom Coughlan,
Matthew Wilcox
On Fri, 2008-11-07 at 10:18 -0600, James Bottomley wrote:
> On Fri, 2008-11-07 at 11:11 -0500, Chris Mason wrote:
[ complex trim management code ]
> >
> > It doesn't seem like a good idea to maintain a ton of code that gets
> > exercised so rarely, especially wrt filesystem crashes.
>
> Heh, am I the only person here who deletes files on a regular basis
> (principally to get my disk down from 99%)?
Using it is easy, but the failure case is the storage forgets about a
block the FS cares about, so actually testing the code means we have to
test all the blocks allocated to the FS to make sure they have the
correct values. Given remapping and everything else we've talked about,
any block in the FS could become corrupt due to a trim bug on block X.
So, trim bugs will look like a dozen other bugs, and they may not get
reported for months after the actual bug was triggered (all things I
know you already know ;)
I think the bar for keeping it simple is much higher here than in other
parts of the storage stack.
-chris
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-07 17:22 ` Chris Mason
@ 2008-11-07 18:09 ` Ric Wheeler
2008-11-07 18:36 ` Theodore Tso
0 siblings, 1 reply; 105+ messages in thread
From: Ric Wheeler @ 2008-11-07 18:09 UTC (permalink / raw)
To: Chris Mason
Cc: James Bottomley, Martin K. Petersen, Jens Axboe, David Woodhouse,
linux-scsi, linux-fsdevel, Black_David, Tom Coughlan,
Matthew Wilcox
Chris Mason wrote:
> On Fri, 2008-11-07 at 10:18 -0600, James Bottomley wrote:
>
>> On Fri, 2008-11-07 at 11:11 -0500, Chris Mason wrote:
>>
>
> [ complex trim management code ]
>
>
>>> It doesn't seem like a good idea to maintain a ton of code that gets
>>> exercised so rarely, especially wrt filesystem crashes.
>>>
>> Heh, am I the only person here who deletes files on a regular basis
>> (principally to get my disk down from 99%)?
>>
>
> Using it is easy, but the failure case is the storage forgets about a
> block the FS cares about, so actually testing the code means we have to
> test all the blocks allocated to the FS to make sure they have the
> correct values. Given remapping and everything else we've talked about,
> any block in the FS could become corrupt due to a trim bug on block X.
>
I don't think that trim bugs should be that common - we just have to be
very careful never to send down a trim for any uncommitted block.
On write any unmapped (trimmed) sector becomes mapped again.
> So, trim bugs will look like a dozen other bugs, and they may not get
> reported for months after the actual bug was triggered (all things I
> know you already know ;)
>
> I think the bar for keeping it simple is much higher here than in other
> parts of the storage stack.
>
> -chris
>
Simple is always good, but I still think that the coalescing (even basic
coalescing) will be a critical performance feature.
ric
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-07 18:09 ` Ric Wheeler
@ 2008-11-07 18:36 ` Theodore Tso
2008-11-07 18:41 ` Ric Wheeler
` (2 more replies)
0 siblings, 3 replies; 105+ messages in thread
From: Theodore Tso @ 2008-11-07 18:36 UTC (permalink / raw)
To: Ric Wheeler
Cc: Chris Mason, James Bottomley, Martin K. Petersen, Jens Axboe,
David Woodhouse, linux-scsi, linux-fsdevel, Black_David,
Tom Coughlan, Matthew Wilcox
On Fri, Nov 07, 2008 at 01:09:48PM -0500, Ric Wheeler wrote:
>
> I don't think that trim bugs should be that common - we just have to be
> very careful never to send down a trim for any uncommitted block.
>
The trim code probably deserves a very aggressive unit test to make
sure it works correctly, but yeah, we should be able to control any
trim bugs.
> Simple is always good, but I still think that the coalescing (even basic
> coalescing) will be a critical performance feature.
Will we be able to query the device and find out its TRIM/UNMAP
alignment requirements? There is also a balanace between performance
(at least if the concern is sending too many separate TRIM commands)
and giving the SSD more flexibility in its wear-leveling allocation
decisions by sending TRIM commands sooner rather than later.
- Ted
^ permalink raw reply [flat|nested] 105+ messages in thread* Re: thin provisioned LUN support
2008-11-07 18:36 ` Theodore Tso
@ 2008-11-07 18:41 ` Ric Wheeler
[not found] ` <49148BDF.9050707@redhat.com>
2008-11-07 19:44 ` jim owens
2 siblings, 0 replies; 105+ messages in thread
From: Ric Wheeler @ 2008-11-07 18:41 UTC (permalink / raw)
To: Theodore Tso, Ric Wheeler, Chris Mason, James Bottomley,
"Martin K. Petersen" <m
Theodore Tso wrote:
> On Fri, Nov 07, 2008 at 01:09:48PM -0500, Ric Wheeler wrote:
>
>> I don't think that trim bugs should be that common - we just have to be
>> very careful never to send down a trim for any uncommitted block.
>>
>>
>
> The trim code probably deserves a very aggressive unit test to make
> sure it works correctly, but yeah, we should be able to control any
> trim bugs.
>
>
>> Simple is always good, but I still think that the coalescing (even basic
>> coalescing) will be a critical performance feature.
>>
>
> Will we be able to query the device and find out its TRIM/UNMAP
> alignment requirements? There is also a balanace between performance
> (at least if the concern is sending too many separate TRIM commands)
> and giving the SSD more flexibility in its wear-leveling allocation
> decisions by sending TRIM commands sooner rather than later.
>
> - Ted
>
T10 is still working on the proposal for how to display unmap related
information for SCSI, so we don't even have a consistent way to find
this out today for this population.
Not sure what is possible for the ATA devices,
Ric
^ permalink raw reply [flat|nested] 105+ messages in thread[parent not found: <49148BDF.9050707@redhat.com>]
* Re: thin provisioned LUN support
[not found] ` <49148BDF.9050707@redhat.com>
@ 2008-11-07 19:35 ` Theodore Tso
2008-11-07 19:55 ` Martin K. Petersen
2008-11-10 2:36 ` Black_David
0 siblings, 2 replies; 105+ messages in thread
From: Theodore Tso @ 2008-11-07 19:35 UTC (permalink / raw)
To: Ric Wheeler
Cc: Chris Mason, James Bottomley, Martin K. Petersen, Jens Axboe,
David Woodhouse, linux-scsi, linux-fsdevel, Black_David,
Tom Coughlan, Matthew Wilcox
On Fri, Nov 07, 2008 at 01:41:35PM -0500, Ric Wheeler wrote:
>> Will we be able to query the device and find out its TRIM/UNMAP
>> alignment requirements? There is also a balanace between performance
>> (at least if the concern is sending too many separate TRIM commands)
>> and giving the SSD more flexibility in its wear-leveling allocation
>> decisions by sending TRIM commands sooner rather than later.
>>
>
> T10 is still working on the proposal for how to display unmap related
> information for SCSI, so we don't even have a consistent way to find
> this out today for this population.
Yeah, I know, the rhetorical question was mostly addressed at David
Black. :-)
> Not sure what is possible for the ATA devices,
I thought ATA didn't have any TRIM alignment requirements, and it's
T10 that wants to add it to the SCSI side?
- Ted
^ permalink raw reply [flat|nested] 105+ messages in thread* Re: thin provisioned LUN support
2008-11-07 19:35 ` Theodore Tso
@ 2008-11-07 19:55 ` Martin K. Petersen
2008-11-07 20:19 ` Theodore Tso
2008-11-07 20:37 ` Ric Wheeler
2008-11-10 2:36 ` Black_David
1 sibling, 2 replies; 105+ messages in thread
From: Martin K. Petersen @ 2008-11-07 19:55 UTC (permalink / raw)
To: Theodore Tso
Cc: Ric Wheeler, Chris Mason, James Bottomley, Martin K. Petersen,
Jens Axboe, David Woodhouse, linux-scsi, linux-fsdevel,
Black_David, Tom Coughlan, Matthew Wilcox
>>>>> "Ted" == Theodore Tso <tytso@mit.edu> writes:
Ted> I thought ATA didn't have any TRIM alignment requirements, and
Ted> it's T10 that wants to add it to the SCSI side?
The current UNMAP proposal in SCSI doesn't have requirements either.
Array vendors, suddenly realizing all the work they have to do to
support this, are now talking about imposing additional constraints
(orthogonal to the UNMAP command set) because of limitations in their
existing firmware architectures.
It obviously much easier for the array vendors to export a Somebody
Elses Problem VPD page containing a constant than it is to fix
inherent limitations in their internal architecture.
We're trying to point out that that's an unacceptable cop out for
something that's clearly their problem to deal with.
My concern is that if we start doing the array people's homework at
the OS level they won't be inclined to fix their broken firmware
design. Ever.
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-07 19:55 ` Martin K. Petersen
@ 2008-11-07 20:19 ` Theodore Tso
2008-11-07 20:21 ` Matthew Wilcox
` (2 more replies)
2008-11-07 20:37 ` Ric Wheeler
1 sibling, 3 replies; 105+ messages in thread
From: Theodore Tso @ 2008-11-07 20:19 UTC (permalink / raw)
To: Martin K. Petersen
Cc: Ric Wheeler, Chris Mason, James Bottomley, Jens Axboe,
David Woodhouse, linux-scsi, linux-fsdevel, Black_David,
Tom Coughlan, Matthew Wilcox
On Fri, Nov 07, 2008 at 02:55:06PM -0500, Martin K. Petersen wrote:
>
> The current UNMAP proposal in SCSI doesn't have requirements either.
>
> Array vendors, suddenly realizing all the work they have to do to
> support this, are now talking about imposing additional constraints
> (orthogonal to the UNMAP command set) because of limitations in their
> existing firmware architectures.
Let's be just a *little* bit fair here. Suppose we wanted to
implement thin-provisioned disks using devicemapper and LVM; consider
that LVM uses a default PE size of 4M for some very good reasons.
Asking filesystems to be a little smarter about allocation policies so
that we allocate in existing 4M chunks before going onto the next, and
asking the block layer to pool trim requests to 4M chunks is not
totally unreasonable.
Array vendors use chunk sizes > than typical filesystem chunk sizes
for the same reason that LVM does. So to say that this is due to
purely a "broken firmware architecture" is a little unfair.
Regards,
- Ted
^ permalink raw reply [flat|nested] 105+ messages in thread* Re: thin provisioned LUN support
2008-11-07 20:19 ` Theodore Tso
@ 2008-11-07 20:21 ` Matthew Wilcox
[not found] ` <20081107202149.GJ15439@parisc-linux.org>
2008-11-07 21:06 ` Martin K. Petersen
2 siblings, 0 replies; 105+ messages in thread
From: Matthew Wilcox @ 2008-11-07 20:21 UTC (permalink / raw)
To: Theodore Tso, Martin K. Petersen, Ric Wheeler, Chris Mason,
James Bottomley <James.Botto
On Fri, Nov 07, 2008 at 03:19:13PM -0500, Theodore Tso wrote:
> Let's be just a *little* bit fair here. Suppose we wanted to
> implement thin-provisioned disks using devicemapper and LVM; consider
> that LVM uses a default PE size of 4M for some very good reasons.
> Asking filesystems to be a little smarter about allocation policies so
> that we allocate in existing 4M chunks before going onto the next, and
> asking the block layer to pool trim requests to 4M chunks is not
> totally unreasonable.
>
> Array vendors use chunk sizes > than typical filesystem chunk sizes
> for the same reason that LVM does. So to say that this is due to
> purely a "broken firmware architecture" is a little unfair.
I think we would have a full-throated discussion about whether the
right thing to do was to put the tracking in the block layer or in LVM.
Rather similar to what we're doing now, in fact.
--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 105+ messages in thread[parent not found: <20081107202149.GJ15439@parisc-linux.org>]
* Re: thin provisioned LUN support
[not found] ` <20081107202149.GJ15439@parisc-linux.org>
@ 2008-11-07 20:26 ` Ric Wheeler
2008-11-07 20:48 ` Chris Mason
2008-11-07 20:42 ` Theodore Tso
1 sibling, 1 reply; 105+ messages in thread
From: Ric Wheeler @ 2008-11-07 20:26 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Theodore Tso, Martin K. Petersen, Chris Mason, James Bottomley,
Jens Axboe, David Woodhouse, linux-scsi, linux-fsdevel,
Black_David, Tom Coughlan
Matthew Wilcox wrote:
> On Fri, Nov 07, 2008 at 03:19:13PM -0500, Theodore Tso wrote:
>
>> Let's be just a *little* bit fair here. Suppose we wanted to
>> implement thin-provisioned disks using devicemapper and LVM; consider
>> that LVM uses a default PE size of 4M for some very good reasons.
>> Asking filesystems to be a little smarter about allocation policies so
>> that we allocate in existing 4M chunks before going onto the next, and
>> asking the block layer to pool trim requests to 4M chunks is not
>> totally unreasonable.
>>
>> Array vendors use chunk sizes > than typical filesystem chunk sizes
>> for the same reason that LVM does. So to say that this is due to
>> purely a "broken firmware architecture" is a little unfair.
>>
>
> I think we would have a full-throated discussion about whether the
> right thing to do was to put the tracking in the block layer or in LVM.
> Rather similar to what we're doing now, in fact.
>
You definitely could imagine having a device mapper target that could
track the discards commands and subsequent writes which would invalidate
the previous discards.
Actually, it would be kind of nice to move all of this away from the
file systems entirely.
Ric
^ permalink raw reply [flat|nested] 105+ messages in thread* Re: thin provisioned LUN support
2008-11-07 20:26 ` Ric Wheeler
@ 2008-11-07 20:48 ` Chris Mason
2008-11-07 21:04 ` Ric Wheeler
2008-11-07 21:13 ` Theodore Tso
0 siblings, 2 replies; 105+ messages in thread
From: Chris Mason @ 2008-11-07 20:48 UTC (permalink / raw)
To: Ric Wheeler
Cc: Matthew Wilcox, Theodore Tso, Martin K. Petersen, James Bottomley,
Jens Axboe, David Woodhouse, linux-scsi, linux-fsdevel,
Black_David, Tom Coughlan
On Fri, 2008-11-07 at 15:26 -0500, Ric Wheeler wrote:
> Matthew Wilcox wrote:
> > On Fri, Nov 07, 2008 at 03:19:13PM -0500, Theodore Tso wrote:
> >
> >> Let's be just a *little* bit fair here. Suppose we wanted to
> >> implement thin-provisioned disks using devicemapper and LVM; consider
> >> that LVM uses a default PE size of 4M for some very good reasons.
> >> Asking filesystems to be a little smarter about allocation policies so
> >> that we allocate in existing 4M chunks before going onto the next, and
> >> asking the block layer to pool trim requests to 4M chunks is not
> >> totally unreasonable.
> >>
> >> Array vendors use chunk sizes > than typical filesystem chunk sizes
> >> for the same reason that LVM does. So to say that this is due to
> >> purely a "broken firmware architecture" is a little unfair.
> >>
> >
> > I think we would have a full-throated discussion about whether the
> > right thing to do was to put the tracking in the block layer or in LVM.
> > Rather similar to what we're doing now, in fact.
> >
> You definitely could imagine having a device mapper target that could
> track the discards commands and subsequent writes which would invalidate
> the previous discards.
>
> Actually, it would be kind of nice to move all of this away from the
> file systems entirely.
* Fast
* Crash safe
* Bounded ram usage
* Accurately deliver the trims
Pick any three ;) If we're dealing with large files, I can see it
working well. For files that are likely to be smaller than the physical
extent size, you end up with either extra state bits on disk (and
keeping them in sync) or a log structured lvm.
I do agree that an offline tool to account for bytes used would be able
to make up for this, and from a thin provisioning point of view, we
might be better off if we don't accurately deliver all the trims all the
time.
People just use the space again soon anyway, I'd have to guess the
filesystems end up in a steady state outside of special events.
In another email Ted mentions that it makes sense for the FS allocator
to notice we've just freed the last block in an aligned region of size
X, and I'd agree with that.
The trim command we send down when we free the block could just contain
the entire range that is free (and easy for the FS to determine) every
time.
-chris
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-07 20:48 ` Chris Mason
@ 2008-11-07 21:04 ` Ric Wheeler
2008-11-07 21:13 ` Theodore Tso
1 sibling, 0 replies; 105+ messages in thread
From: Ric Wheeler @ 2008-11-07 21:04 UTC (permalink / raw)
To: Chris Mason
Cc: Matthew Wilcox, Theodore Tso, Martin K. Petersen, James Bottomley,
Jens Axboe, David Woodhouse, linux-scsi, linux-fsdevel,
Black_David, Tom Coughlan
Chris Mason wrote:
> On Fri, 2008-11-07 at 15:26 -0500, Ric Wheeler wrote:
>
>> Matthew Wilcox wrote:
>>
>>> On Fri, Nov 07, 2008 at 03:19:13PM -0500, Theodore Tso wrote:
>>>
>>>
>>>> Let's be just a *little* bit fair here. Suppose we wanted to
>>>> implement thin-provisioned disks using devicemapper and LVM; consider
>>>> that LVM uses a default PE size of 4M for some very good reasons.
>>>> Asking filesystems to be a little smarter about allocation policies so
>>>> that we allocate in existing 4M chunks before going onto the next, and
>>>> asking the block layer to pool trim requests to 4M chunks is not
>>>> totally unreasonable.
>>>>
>>>> Array vendors use chunk sizes > than typical filesystem chunk sizes
>>>> for the same reason that LVM does. So to say that this is due to
>>>> purely a "broken firmware architecture" is a little unfair.
>>>>
>>>>
>>> I think we would have a full-throated discussion about whether the
>>> right thing to do was to put the tracking in the block layer or in LVM.
>>> Rather similar to what we're doing now, in fact.
>>>
>>>
>> You definitely could imagine having a device mapper target that could
>> track the discards commands and subsequent writes which would invalidate
>> the previous discards.
>>
>> Actually, it would be kind of nice to move all of this away from the
>> file systems entirely.
>>
>
> * Fast
> * Crash safe
> * Bounded ram usage
> * Accurately deliver the trims
>
> Pick any three ;) If we're dealing with large files, I can see it
> working well. For files that are likely to be smaller than the physical
> extent size, you end up with either extra state bits on disk (and
> keeping them in sync) or a log structured lvm.
>
> I do agree that an offline tool to account for bytes used would be able
> to make up for this, and from a thin provisioning point of view, we
> might be better off if we don't accurately deliver all the trims all the
> time.
>
Given the best practice more or less states that users need to have set
the high water mark sufficiently low to allow storage admins to react, I
think a tool like this would be very useful.
Think of how nasty it would be to run out of real blocks on a device
that seems to have plenty of unused capacity :-)
> People just use the space again soon anyway, I'd have to guess the
> filesystems end up in a steady state outside of special events.
>
> In another email Ted mentions that it makes sense for the FS allocator
> to notice we've just freed the last block in an aligned region of size
> X, and I'd agree with that.
>
> The trim command we send down when we free the block could just contain
> the entire range that is free (and easy for the FS to determine) every
> time.
>
> -chris
>
I think sending down the entire contiguous range of freed sectors would work well with these boxes...
ric
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-07 20:48 ` Chris Mason
2008-11-07 21:04 ` Ric Wheeler
@ 2008-11-07 21:13 ` Theodore Tso
1 sibling, 0 replies; 105+ messages in thread
From: Theodore Tso @ 2008-11-07 21:13 UTC (permalink / raw)
To: Chris Mason
Cc: Ric Wheeler, Matthew Wilcox, Martin K. Petersen, James Bottomley,
Jens Axboe, David Woodhouse, linux-scsi, linux-fsdevel,
Black_David, Tom Coughlan
On Fri, Nov 07, 2008 at 03:48:30PM -0500, Chris Mason wrote:
>
> * Fast
> * Crash safe
> * Bounded ram usage
> * Accurately deliver the trims
>
> Pick any three ;)
Actually, if you move this responsibility into the FS block allocator,
I think you can get all four. You might pay a slight CPU cost for
determining whether the aligned region is free, depending on how the
filesystem's block allocation data structures are structured, but the
more I think about it, the more I like it.
It only has the downside that it would have to be implemented in every
filesystem separately, but hey, it's the block array vendors who would
be getting the big bucks from having this feature, so they should have
no problem putting up some bucks to make sure the OS implements so
their users can take advantage of *their* feature. :-)
- Ted
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
[not found] ` <20081107202149.GJ15439@parisc-linux.org>
2008-11-07 20:26 ` Ric Wheeler
@ 2008-11-07 20:42 ` Theodore Tso
1 sibling, 0 replies; 105+ messages in thread
From: Theodore Tso @ 2008-11-07 20:42 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Martin K. Petersen, Ric Wheeler, Chris Mason, James Bottomley,
Jens Axboe, David Woodhouse, linux-scsi, linux-fsdevel,
Black_David, Tom Coughlan
On Fri, Nov 07, 2008 at 01:21:49PM -0700, Matthew Wilcox wrote:
>
> I think we would have a full-throated discussion about whether the
> right thing to do was to put the tracking in the block layer or in LVM.
> Rather similar to what we're doing now, in fact.
Agreed. I'm just saying that what the array vendors are pushing for
is not totally unreasonable. This problem can be separated into two
issues. One is whether or not trim requests have to be 4 meg (or some
other size substantially bigger than filesystem block size) aligned,
and the other is whether the provisioning chunk size is 4 meg.
The latter still would most ideally work well with filesystems which
are aware of this fact and try hard to allocate to keep as many 4 meg
chunks as possible completely unused, and to try very hard to allocate
using 4 meg chunks that are already partially unused.
Where the trim request coalescing happens is a more interesting
question. You can either do it in the filesystem, in the block device
layer, or in the storage arraydevice itself. One interesting thought
is that perhaps it may actually make more sense to do it in the
filesystem. Since the filesystem has block allocation data structures
that already tell it which blocks are in use or not, there's no point
replicating that in the data array --- and so the filesystem can
detect when the last 4k block in a 4 meg chunk has been freed, and
then issue the trim request for the 4 meg TRIM/UNMAP request to the
block array. One advantage of doing it in the filesystem is that the
block allocation data structures are already journaled, and so by
keying this off filesystem's block allocation structures, we won't
lose any potential TRIM requests even across a reboot. (In contrast,
if the block device or the storage array is managing a list of trim
requests and in hopes of merging enough pieces to cover a 4 meg
aligned TRIM request, the in-memory rbtree is transient and would be
lost if the machine reboots.)
Sure, no filesystemsdo this now, but it's a just a Small Matter of
Programming --- and array vendors like EMC (cough, cough), could
easily pay for some filesystem hackers to implement this for some
popular Linux filesystem. It could even be a directed funding program
through the Linux Foundation if EMC doesn't feel it has sufficient
people who have expertise in the upstream kernel development process. :-)
- Ted
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-07 20:19 ` Theodore Tso
2008-11-07 20:21 ` Matthew Wilcox
[not found] ` <20081107202149.GJ15439@parisc-linux.org>
@ 2008-11-07 21:06 ` Martin K. Petersen
2 siblings, 0 replies; 105+ messages in thread
From: Martin K. Petersen @ 2008-11-07 21:06 UTC (permalink / raw)
To: Theodore Tso
Cc: Martin K. Petersen, Ric Wheeler, Chris Mason, James Bottomley,
Jens Axboe, David Woodhouse, linux-scsi, linux-fsdevel,
Black_David, Tom Coughlan, Matthew Wilcox
>>>>> "Ted" == Theodore Tso <tytso@mit.edu> writes:
Ted> Let's be just a *little* bit fair here. Suppose we wanted to
Ted> implement thin-provisioned disks using devicemapper and LVM;
Ted> consider that LVM uses a default PE size of 4M for some very good
Ted> reasons. Asking filesystems to be a little smarter about
Ted> allocation policies so that we allocate in existing 4M chunks
Ted> before going onto the next, and asking the block layer to pool
Ted> trim requests to 4M chunks is not totally unreasonable.
It would also be much easier for the array folks if we never wrote
anything less than 768KB and always on a 768KB boundary.
Ted> Array vendors use chunk sizes > than typical filesystem chunk
Ted> sizes for the same reason that LVM does. So to say that this is
Ted> due to purely a "broken firmware architecture" is a little
Ted> unfair.
Why? What is the advantage of doing it in Linux as opposed to in the
array firmware?
The issue at hand here is that we'll be issuing discards/trims/unmaps
and if they don't end up being multiples of 768KB starting on a 768KB
boundary the array is just going to ignore the command.
They expect us to keep track of what's used and what's unused within
that single chunk and let them know when we've completely cleared it
out.
The alternative is to walk the fs metadata occasionally, look for
properly aligned, completely unused chunks and them submit UNMAPs to
the array. That really seems like 1980's defrag technology to me.
I don't have a problem with arrays user bigger chunk sizes internally.
That's fine. What I don't see if why we have to carry the burden of
keeping in track of what's being used and what's not based upon some
quasi-random value. Especially given that the array is going to
silently ignore any UNMAP requests that it doesn't like.
Array folks already have to keep track of their internal virtual to
physical mapping. Why shouldn't they have to maintain a bitmap or an
extent list as part of their internal metadata? Why should we have to
carry that burden?
And why would we want to go through all this hassle when it's not a
problem for disks or (so far) for mid-range storage devices that use
exactly the same command set?
What I'm objecting to is not coalescing of discard requests. Or
laying out filesystems intelligently. That's fine and I think we
should do it (heck, I'm working on that). What I'm heavily against is
having Linux carry the burden of keeping state around for stuff that's
really internal to the array firmware.
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-07 19:55 ` Martin K. Petersen
2008-11-07 20:19 ` Theodore Tso
@ 2008-11-07 20:37 ` Ric Wheeler
2008-11-10 2:44 ` Black_David
1 sibling, 1 reply; 105+ messages in thread
From: Ric Wheeler @ 2008-11-07 20:37 UTC (permalink / raw)
To: Martin K. Petersen
Cc: Theodore Tso, Ric Wheeler, Chris Mason, James Bottomley,
Jens Axboe, David Woodhouse, linux-scsi, linux-fsdevel,
Black_David, Tom Coughlan, Matthew Wilcox
Martin K. Petersen wrote:
>>>>>> "Ted" == Theodore Tso <tytso@mit.edu> writes:
>>>>>>
>
> Ted> I thought ATA didn't have any TRIM alignment requirements, and
> Ted> it's T10 that wants to add it to the SCSI side?
>
> The current UNMAP proposal in SCSI doesn't have requirements either.
>
I think that is being actively debated & due to go out in the next
update. Not sure that it will matter much since vendors will not have
implemented it yet in their boxes (at least, not yet)....
ric
> Array vendors, suddenly realizing all the work they have to do to
> support this, are now talking about imposing additional constraints
> (orthogonal to the UNMAP command set) because of limitations in their
> existing firmware architectures.
>
> It obviously much easier for the array vendors to export a Somebody
> Elses Problem VPD page containing a constant than it is to fix
> inherent limitations in their internal architecture.
>
> We're trying to point out that that's an unacceptable cop out for
> something that's clearly their problem to deal with.
>
> My concern is that if we start doing the array people's homework at
> the OS level they won't be inclined to fix their broken firmware
> design. Ever.
>
>
^ permalink raw reply [flat|nested] 105+ messages in thread
* RE: thin provisioned LUN support
2008-11-07 20:37 ` Ric Wheeler
@ 2008-11-10 2:44 ` Black_David
0 siblings, 0 replies; 105+ messages in thread
From: Black_David @ 2008-11-10 2:44 UTC (permalink / raw)
To: rwheeler, martin.petersen
Cc: tytso, chris.mason, James.Bottomley, jens.axboe, dwmw2,
linux-scsi, linux-fsdevel, coughlan, matthew
> > Ted> I thought ATA didn't have any TRIM alignment requirements, and
> > Ted> it's T10 that wants to add it to the SCSI side?
> >
> > The current UNMAP proposal in SCSI doesn't have requirements either.
>
> I think that is being actively debated & due to go out in the next
> update. Not sure that it will matter much since vendors will not have
> implemented it yet in their boxes (at least, not yet)....
Not exactly. The UNMAP proposal will have no alignment requirements
but will also make it up to the device (array) to decide what to unmap.
I expect implementations to pick out the chunks that can be unmapped
from the ranges that are passed in, and ignore the partial chunks.
As noted in a previous message, the chunk size will be reported in a
VPD page.
Thanks,
--David
^ permalink raw reply [flat|nested] 105+ messages in thread
* RE: thin provisioned LUN support
2008-11-07 19:35 ` Theodore Tso
2008-11-07 19:55 ` Martin K. Petersen
@ 2008-11-10 2:36 ` Black_David
1 sibling, 0 replies; 105+ messages in thread
From: Black_David @ 2008-11-10 2:36 UTC (permalink / raw)
To: tytso, rwheeler
Cc: chris.mason, James.Bottomley, martin.petersen, jens.axboe, dwmw2,
linux-scsi, linux-fsdevel, coughlan, matthew, Black_David
Ted,
> On Fri, Nov 07, 2008 at 01:41:35PM -0500, Ric Wheeler wrote:
> >> Will we be able to query the device and find out its TRIM/UNMAP
> >> alignment requirements? There is also a balance between
performance
> >> (at least if the concern is sending too many separate TRIM
commands)
> >> and giving the SSD more flexibility in its wear-leveling allocation
> >> decisions by sending TRIM commands sooner rather than later.
> >>
> > T10 is still working on the proposal for how to display unmap
related
> > information for SCSI, so we don't even have a consistent way to find
> > this out today for this population.
>
> Yeah, I know, the rhetorical question was mostly addressed at David
> Black. :-)
Well, here's a real answer ... that should be coming in T10/08-149r5 as
an addition to the block device limits VPD page. Sorry for the delay
- I probably ought to remind Ric not to start this sort of discussion
*during* the T10 meetings - in contrast to the kernel, my brain is far
less effective at multi-tasking ;-).
FYI,
--David
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-07 18:36 ` Theodore Tso
2008-11-07 18:41 ` Ric Wheeler
[not found] ` <49148BDF.9050707@redhat.com>
@ 2008-11-07 19:44 ` jim owens
2008-11-07 19:48 ` Matthew Wilcox
2008-11-07 19:50 ` Ric Wheeler
2 siblings, 2 replies; 105+ messages in thread
From: jim owens @ 2008-11-07 19:44 UTC (permalink / raw)
To: Theodore Tso
Cc: Ric Wheeler, Chris Mason, James Bottomley, Martin K. Petersen,
Jens Axboe, David Woodhouse, linux-scsi, linux-fsdevel,
Black_David, Tom Coughlan, Matthew Wilcox
Theodore Tso wrote:
> On Fri, Nov 07, 2008 at 01:09:48PM -0500, Ric Wheeler wrote:
>> I don't think that trim bugs should be that common - we just have to be
>> very careful never to send down a trim for any uncommitted block.
>>
>
> The trim code probably deserves a very aggressive unit test to make
> sure it works correctly, but yeah, we should be able to control any
> trim bugs.
>
>> Simple is always good, but I still think that the coalescing (even basic
>> coalescing) will be a critical performance feature.
>
> Will we be able to query the device and find out its TRIM/UNMAP
> alignment requirements? There is also a balanace between performance
> (at least if the concern is sending too many separate TRIM commands)
> and giving the SSD more flexibility in its wear-leveling allocation
> decisions by sending TRIM commands sooner rather than later.
This is all good if the design is bounded by the requirements
of trim for flash devices. Because AFAIK the use of trim for
flash ssd is a performance optimization. The ssd won't loose
functionality if the trim is less than the chunk size. It may
run slower and wear out faster, but that is all.
If I understand correctly, with thin provisioning, unmapping
less than the chunk will not release that chunk for other use.
So you have lost the thin provision feature of the array.
The concern (Chris I think) and I have is that doing a design
to handle thin provision arrays *when chunk > fs_block_size*
that guarantees you will *always* release on chunk boundaries
is a lot more complicated.
To do that you kind of have to build a filesystem into the
block layer to persistently store "mapped/unmapped blocks
in chunk" and then do the "unmap-this-chunk" when a region
is all unmapped.
250 MB per 1TiB 512b sector disk for a simple 1-bit-per-sector
state. And that assumes you don't replicate it for safety.
That is what the array vendors are trying to avoid by pushing
it off to the OS.
Whoever supports thin provisioning better get their unmapping
correct because those big customers will be looking for who
to blame if they don't get all the features.
jim
^ permalink raw reply [flat|nested] 105+ messages in thread* Re: thin provisioned LUN support
2008-11-07 19:44 ` jim owens
@ 2008-11-07 19:48 ` Matthew Wilcox
2008-11-07 19:50 ` Ric Wheeler
1 sibling, 0 replies; 105+ messages in thread
From: Matthew Wilcox @ 2008-11-07 19:48 UTC (permalink / raw)
To: jim owens
Cc: Theodore Tso, Ric Wheeler, Chris Mason, James Bottomley,
Martin K. Petersen, Jens Axboe, David Woodhouse, linux-scsi,
linux-fsdevel, Black_David, Tom Coughlan
On Fri, Nov 07, 2008 at 02:44:08PM -0500, jim owens wrote:
> To do that you kind of have to build a filesystem into the
> block layer to persistently store "mapped/unmapped blocks
> in chunk" and then do the "unmap-this-chunk" when a region
> is all unmapped.
>
> 250 MB per 1TiB 512b sector disk for a simple 1-bit-per-sector
> state. And that assumes you don't replicate it for safety.
> That is what the array vendors are trying to avoid by pushing
> it off to the OS.
And it's what the OS people are trying to avoid having to incorporate
by pushing back on the array vendors.
> Whoever supports thin provisioning better get their unmapping
> correct because those big customers will be looking for who
> to blame if they don't get all the features.
It's fairly clear that it's the array vendors at fault here. They've
designed a shitty product and they're trying to get us to compensate.
--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-07 19:44 ` jim owens
2008-11-07 19:48 ` Matthew Wilcox
@ 2008-11-07 19:50 ` Ric Wheeler
1 sibling, 0 replies; 105+ messages in thread
From: Ric Wheeler @ 2008-11-07 19:50 UTC (permalink / raw)
To: jim owens
Cc: Theodore Tso, Chris Mason, James Bottomley, Martin K. Petersen,
Jens Axboe, David Woodhouse, linux-scsi, linux-fsdevel,
Black_David, Tom Coughlan, Matthew Wilcox
jim owens wrote:
> Theodore Tso wrote:
>> On Fri, Nov 07, 2008 at 01:09:48PM -0500, Ric Wheeler wrote:
>>> I don't think that trim bugs should be that common - we just have to
>>> be very careful never to send down a trim for any uncommitted block.
>>>
>>
>> The trim code probably deserves a very aggressive unit test to make
>> sure it works correctly, but yeah, we should be able to control any
>> trim bugs.
>>
>>> Simple is always good, but I still think that the coalescing (even
>>> basic coalescing) will be a critical performance feature.
>>
>> Will we be able to query the device and find out its TRIM/UNMAP
>> alignment requirements? There is also a balanace between performance
>> (at least if the concern is sending too many separate TRIM commands)
>> and giving the SSD more flexibility in its wear-leveling allocation
>> decisions by sending TRIM commands sooner rather than later.
>
> This is all good if the design is bounded by the requirements
> of trim for flash devices. Because AFAIK the use of trim for
> flash ssd is a performance optimization. The ssd won't loose
> functionality if the trim is less than the chunk size. It may
> run slower and wear out faster, but that is all.
>
> If I understand correctly, with thin provisioning, unmapping
> less than the chunk will not release that chunk for other use.
> So you have lost the thin provision feature of the array.
>
> The concern (Chris I think) and I have is that doing a design
> to handle thin provision arrays *when chunk > fs_block_size*
> that guarantees you will *always* release on chunk boundaries
> is a lot more complicated.
>
> To do that you kind of have to build a filesystem into the
> block layer to persistently store "mapped/unmapped blocks
> in chunk" and then do the "unmap-this-chunk" when a region
> is all unmapped.
>
> 250 MB per 1TiB 512b sector disk for a simple 1-bit-per-sector
> state. And that assumes you don't replicate it for safety.
> That is what the array vendors are trying to avoid by pushing
> it off to the OS.
>
> Whoever supports thin provisioning better get their unmapping
> correct because those big customers will be looking for who
> to blame if they don't get all the features.
>
> jim
I do think that what we have today is a reasonable start, especially if
we can do some coalescing of the unmap commands just like we do for
normal IO.
It does not have to be perfect, but it will work well for devices with a
reasonable chunk size.
A vendor could always supply a clean up script to run when we get too
out of sync between what the fs & storage device think is really
allocated. (Bringing back that wonderful model of windows defrag your
disk :-))
ric
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-07 16:00 ` Martin K. Petersen
2008-11-07 16:06 ` James Bottomley
@ 2008-11-09 23:36 ` Dave Chinner
2008-11-10 3:40 ` Thin provisioning & arrays Black_David
1 sibling, 1 reply; 105+ messages in thread
From: Dave Chinner @ 2008-11-09 23:36 UTC (permalink / raw)
To: Martin K. Petersen
Cc: Chris Mason, Jens Axboe, David Woodhouse, James Bottomley,
Ric Wheeler, linux-scsi, linux-fsdevel, Black_David, Tom Coughlan,
Matthew Wilcox
On Fri, Nov 07, 2008 at 11:00:38AM -0500, Martin K. Petersen wrote:
> >>>>> "Chris" == Chris Mason <chris.mason@oracle.com> writes:
>
> Chris> Hmmm, it's surprising to me that arrays who tell us please use
> Chris> the noop elevator suddenly want us to merge discard requests.
> Chris> The array really needs to be able to deal with this internally.
>
> Let's also not forget that we're talking about merging discard
> requests for the purpose making internal array housekeeping efficient.
> That involves merging discards up to the internal array block sizes
> which may be on the order of 512/768/1024 KB.
>
> If we were talking about merging discards up to a 4/8/16 KB boundary
> that might be something we'd have a chance to do within a reasonable
> amount of time (bigger than normal read/write I/O but not hours).
>
> But keeping discard state around for long enough to attempt to
> aggregate 768KB (and 768KB-aligned) chunks is icky.
Agreed.
It also ignores the fact that as the filesystem ages they will have
fewer and fewer aligned free chunks as the free space fragments.
Over time, arrays using large allocation chunks are simply going to
be full of wasted space as filesystem allocation patterns degrade
if the array vendors ignore this problem.
And no matter what us filesystem developers do, there is always
going to be degradation in allocation patterns as the filesystems
fill up and age. While we can try to improve aging behaviour, it
doesn't solve the problem for array vendors - they need to be
smarter about their allocation and block mapping....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 105+ messages in thread
* Thin provisioning & arrays
2008-11-09 23:36 ` Dave Chinner
@ 2008-11-10 3:40 ` Black_David
2008-11-10 8:31 ` Dave Chinner
0 siblings, 1 reply; 105+ messages in thread
From: Black_David @ 2008-11-10 3:40 UTC (permalink / raw)
To: david, martin.petersen
Cc: chris.mason, jens.axboe, dwmw2, James.Bottomley, rwheeler,
linux-scsi, linux-fsdevel, coughlan, matthew, Black_David
Wow, this discussion can generate a lot of traffic .... I think
Dave Chinner's recent message is as good a place as any to start:
> It also ignores the fact that as the filesystem ages they will have
> fewer and fewer aligned free chunks as the free space fragments.
> Over time, arrays using large allocation chunks are simply going to
> be full of wasted space as filesystem allocation patterns degrade
> if the array vendors ignore this problem.
>
> And no matter what us filesystem developers do, there is always
> going to be degradation in allocation patterns as the filesystems
> fill up and age. While we can try to improve aging behaviour, it
> doesn't solve the problem for array vendors - they need to be
> smarter about their allocation and block mapping....
I can't argue with that - this sort of internal fragmentation is
a consequence of using a large thin provisioning chunk size. As
for why array vendors did this (and EMC is not the only vendor
that uses a large chunk size), the answer is based on the original
motivating customer situation, for example:
- The database admin insists that this new application needs 4TB,
so 4TB of storage is provisioned.
- 3 months later, the application is using 200GB, and not growing
much, if at all.
Even a 1GB chunk size makes a big difference for this example ...
As for arrays and block tracking, EMC arrays work with 4kB blocks
internally. A write of smaller than 4kB will often result in
reading the rest of the block from the disk into cache. The
track size that Ric mentioned (64k) is used to manage on-disk
capacity, but the array knows how to do partial track writes.
As for what to optimize for, the chunk size is going to vary widely
across different arrays (even EMC's CLARiiON won't use the same
chunk size as Symmetrix). Different array implementers will make
different decisions about how much state is reasonable to keep.
My take on this is that I agree with Ric's comment:
> In another email Ted mentions that it makes sense for the FS allocator
> to notice we've just freed the last block in an aligned region of size
> X, and I'd agree with that.
>
> The trim command we send down when we free the block could just
contain
> the entire range that is free (and easy for the FS to determine) every
> time.
In other words, the filesystem ought to do a small amount of work to
send down the largest (reasonable) range that it knows is free - this
seems likely to be more effective than relying on the elevators to
make this happen.
There will be a chunk size value available in a VPD page that can be
used to determine minimum size/alignment. For openers, I see
essentially
no point in a 512-byte UNMAP, even though it's allowed by the standard -
I suspect most arrays (and many SSDs) will ignore it, and ignoring
it is definitely within the spirit of the proposed T10 standard (hint:
I'm one of the people directly working on that proposal). OTOH, it
may not be possible to frequently assemble large chunks for arrays
that use them, and I agree with Dave Chinner's remarks on free space
fragmentation (quoted above) - there's no "free lunch" there.
Beyond this, I think there may be an underlying assumption that the
array and filesystem ought to be in sync (or close to it) on based
on overhead and diminishing marginal returns of trying to get ever-
closer sync. Elsewhere, mention has been made of having the
filesystem's free list behavior be LIFO-like, rather than constantly
allocating new blocks from previously-unused space. IMHO, that's a
good idea for thin provisioning.
Now, if the workload running on the filesystem causes the capacity
used to stay within a range, there will be a set of relatively "hot"
blocks on the free list that are being frequently freed and reallocated.
It's a performance win not to UNMAP those blocks (saves work in both
the kernel and on the array), and hence to have the filesystem and
array views of what's in use not line up.
Despite the derogatory comment about defrag, it's making a comeback.
I already know of two thin-provisioning-specific defrag utilities from
other storage vendors (neither is for Linux, AFAIK). While defrag is
not a wonderful solution, it does free up space in large contiguous
ranges, and most of what it frees will be "cold".
Thanks,
--David
p.s. Apologies in advance for slow responses - my Monday "crisis"
is already scheduled ...
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: Thin provisioning & arrays
2008-11-10 3:40 ` Thin provisioning & arrays Black_David
@ 2008-11-10 8:31 ` Dave Chinner
2008-11-10 9:59 ` David Woodhouse
0 siblings, 1 reply; 105+ messages in thread
From: Dave Chinner @ 2008-11-10 8:31 UTC (permalink / raw)
To: Black_David
Cc: martin.petersen, chris.mason, jens.axboe, dwmw2, James.Bottomley,
rwheeler, linux-scsi, linux-fsdevel, coughlan, matthew
On Sun, Nov 09, 2008 at 10:40:24PM -0500, Black_David@emc.com wrote:
> Wow, this discussion can generate a lot of traffic .... I think
> Dave Chinner's recent message is as good a place as any to start:
>
> > It also ignores the fact that as the filesystem ages they will have
> > fewer and fewer aligned free chunks as the free space fragments.
> > Over time, arrays using large allocation chunks are simply going to
> > be full of wasted space as filesystem allocation patterns degrade
> > if the array vendors ignore this problem.
[snip a bunch of stuff I can't add anything too ;]
> My take on this is that I agree with Ric's comment:
>
> > In another email Ted mentions that it makes sense for the FS allocator
> > to notice we've just freed the last block in an aligned region of size
> > X, and I'd agree with that.
> >
> > The trim command we send down when we free the block could just
> contain
> > the entire range that is free (and easy for the FS to determine) every
> > time.
>
> In other words, the filesystem ought to do a small amount of work to
> send down the largest (reasonable) range that it knows is free - this
> seems likely to be more effective than relying on the elevators to
> make this happen.
>
> There will be a chunk size value available in a VPD page that can be
> used to determine minimum size/alignment. For openers, I see
> essentially
> no point in a 512-byte UNMAP, even though it's allowed by the standard -
> I suspect most arrays (and many SSDs) will ignore it, and ignoring
> it is definitely within the spirit of the proposed T10 standard (hint:
> I'm one of the people directly working on that proposal).
I think this is the crux of the issue. IMO, it's not much of a standard
when the spirit of the standard is to allow everyone to implement
different, non-deterministic behaviour....
> OTOH, it
> may not be possible to frequently assemble large chunks for arrays
> that use them, and I agree with Dave Chinner's remarks on free space
> fragmentation (quoted above) - there's no "free lunch" there.
> Beyond this, I think there may be an underlying assumption that the
> array and filesystem ought to be in sync (or close to it) on based
> on overhead and diminishing marginal returns of trying to get ever-
> closer sync.
I'm not sure I follow - it's possible to have perfect
synchronisation between the array and the filesystem, including
crash recovery.
> Elsewhere, mention has been made of having the
> filesystem's free list behavior be LIFO-like, rather than constantly
> allocating new blocks from previously-unused space. IMHO, that's a
> good idea for thin provisioning.
Which is at odds with preventing fragmentation and minimising the
effect of aging on the filesystem. This, in turn, is bad for thin
provisioning because file fragmentation leads to free space
fragmentation over time.
I think the overall goal for filesystems in thin provisioned
environments should be to minimise free space fragmentation - it's
when you fail to have large contiguous regions of free space in
the filesystem that thin provisioning becomes difficult. How this
is acheived will be different for every filesystem.
> Now, if the workload running on the filesystem causes the capacity
> used to stay within a range, there will be a set of relatively "hot"
> blocks on the free list that are being frequently freed and reallocated.
> It's a performance win not to UNMAP those blocks (saves work in both
> the kernel and on the array), and hence to have the filesystem and
> array views of what's in use not line up.
In that case, the filesystem tracks what it has not issued unmaps
on, so really there is no discrepancy between the filesystem and the
array in terms of free sapce. The filesystem simply has a "free but
not quite free" list of blocks that haven't been unmapped.
This is like the typical two-stage inode delete that most
journalling filesystems use - one stage to remove it from the
namespace and move it to a "to be freed" list, and then a second
stage to really free it. Inodes on the "to be freed" list can be
reused without being freed, and if a crash occurs they can be
really freed up during recovery. Issuing unmaps is conceptually
little different to this.....
> Despite the derogatory comment about defrag, it's making a comeback.
> I already know of two thin-provisioning-specific defrag utilities from
> other storage vendors (neither is for Linux, AFAIK). While defrag is
> not a wonderful solution, it does free up space in large contiguous
> ranges, and most of what it frees will be "cold".
The problem is that it is the wrong model to be using for thin
provisioning. It assumes that unmapping blocks as we free them
is fundamentally broken - if unmapping as we go works and is made
reliable, then there is no need for such a defrag tool. Unmapping
can and should be made reliable so that we don't have to waste
effort trying to fix up mismatches that shouldn't have occurred in
the first place...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: Thin provisioning & arrays
2008-11-10 8:31 ` Dave Chinner
@ 2008-11-10 9:59 ` David Woodhouse
2008-11-10 13:30 ` Matthew Wilcox
` (2 more replies)
0 siblings, 3 replies; 105+ messages in thread
From: David Woodhouse @ 2008-11-10 9:59 UTC (permalink / raw)
To: Dave Chinner
Cc: Black_David, martin.petersen, chris.mason, jens.axboe,
James.Bottomley, rwheeler, linux-scsi, linux-fsdevel, coughlan,
matthew
On Mon, 2008-11-10 at 19:31 +1100, Dave Chinner wrote:
> On Sun, Nov 09, 2008 at 10:40:24PM -0500, Black_David@emc.com wrote:
> > There will be a chunk size value available in a VPD page that can be
> > used to determine minimum size/alignment. For openers, I see
> > essentially
> > no point in a 512-byte UNMAP, even though it's allowed by the standard -
> > I suspect most arrays (and many SSDs) will ignore it, and ignoring
> > it is definitely within the spirit of the proposed T10 standard (hint:
> > I'm one of the people directly working on that proposal).
>
> I think this is the crux of the issue. IMO, it's not much of a standard
> when the spirit of the standard is to allow everyone to implement
> different, non-deterministic behaviour....
I disagree. The discard request is a _hint_ from the upper layers, and
the storage device can act on that hint as it sees fit. There's nothing
wrong with that; it doesn't make it "not much of a standard".
Storage devices are complex enough that they _already_ exhibit behaviour
which is fairly much non-deterministic in a number of ways. Especially
if we're talking about SSDs or large arrays, rather than just disks.
A standard needs to be clear about what _is_ guaranteed, and what is
_not_ guaranteed. If it is explicit that the storage device is permitted
to ignore the discard hint, and some storage devices do so under some
circumstances, then that is just fine.
> Unmapping can and should be made reliable so that we don't have to
> waste effort trying to fix up mismatches that shouldn't have occurred
> in the first place...
Perhaps so. But remember, this can only really be considered a
correctness issue on thin-provisioned arrays -- because they may run out
of space sooner than they should. But that kind of failure mode is
something that is explicitly accepted by those designing and using such
thin-provisioned arrays. It's not as if we're introducing any _new_ kind
of problem.
So I think it's perfectly acceptable for the operating system to treat
discard requests as a hint, with best-effort semantics. And any device
which _really_ cares will need to make sure for _itself_ that it handles
those hints reliably.
--
David Woodhouse Open Source Technology Centre
David.Woodhouse@intel.com Intel Corporation
^ permalink raw reply [flat|nested] 105+ messages in thread* Re: Thin provisioning & arrays
2008-11-10 9:59 ` David Woodhouse
@ 2008-11-10 13:30 ` Matthew Wilcox
2008-11-10 13:36 ` Jens Axboe
2008-11-10 17:05 ` UNMAP is a hint Black_David
2008-11-10 22:18 ` Thin provisioning & arrays Dave Chinner
2 siblings, 1 reply; 105+ messages in thread
From: Matthew Wilcox @ 2008-11-10 13:30 UTC (permalink / raw)
To: David Woodhouse
Cc: Dave Chinner, Black_David, martin.petersen, chris.mason,
jens.axboe, James.Bottomley, rwheeler, linux-scsi, linux-fsdevel,
coughlan
On Mon, Nov 10, 2008 at 10:59:49AM +0100, David Woodhouse wrote:
> Storage devices are complex enough that they _already_ exhibit behaviour
> which is fairly much non-deterministic in a number of ways. Especially
> if we're talking about SSDs or large arrays, rather than just disks.
If anything, SSDs are more deterministic than rotating storage.
Variable numbers of sectors per track, unpredictable sector remapping,
track re-reads due to errors during reads ... SSDs seem like a real
improvement.
--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: Thin provisioning & arrays
2008-11-10 13:30 ` Matthew Wilcox
@ 2008-11-10 13:36 ` Jens Axboe
0 siblings, 0 replies; 105+ messages in thread
From: Jens Axboe @ 2008-11-10 13:36 UTC (permalink / raw)
To: Matthew Wilcox
Cc: David Woodhouse, Dave Chinner, Black_David, martin.petersen,
chris.mason, James.Bottomley, rwheeler, linux-scsi, linux-fsdevel,
coughlan
On Mon, Nov 10 2008, Matthew Wilcox wrote:
> On Mon, Nov 10, 2008 at 10:59:49AM +0100, David Woodhouse wrote:
> > Storage devices are complex enough that they _already_ exhibit behaviour
> > which is fairly much non-deterministic in a number of ways. Especially
> > if we're talking about SSDs or large arrays, rather than just disks.
>
> If anything, SSDs are more deterministic than rotating storage.
> Variable numbers of sectors per track, unpredictable sector remapping,
> track re-reads due to errors during reads ... SSDs seem like a real
> improvement.
This was also discussed at LPC, and I would tend to disagree. On
rotation storage, slight skews here and there are seen, but rarely
anything major. On SSD's, when the GC kicks in it can be fairly
detrimental to performance. And one run doesn't necessarily compare to
the next run, even if you start with the same seed.
--
Jens Axboe
^ permalink raw reply [flat|nested] 105+ messages in thread
* UNMAP is a hint
2008-11-10 9:59 ` David Woodhouse
2008-11-10 13:30 ` Matthew Wilcox
@ 2008-11-10 17:05 ` Black_David
2008-11-10 17:30 ` Matthew Wilcox
2008-11-10 22:18 ` Thin provisioning & arrays Dave Chinner
2 siblings, 1 reply; 105+ messages in thread
From: Black_David @ 2008-11-10 17:05 UTC (permalink / raw)
To: dwmw2, david
Cc: martin.petersen, chris.mason, jens.axboe, James.Bottomley,
rwheeler, linux-scsi, linux-fsdevel, coughlan, matthew,
Black_David
> On Mon, 2008-11-10 at 19:31 +1100, Dave Chinner wrote:
> > On Sun, Nov 09, 2008 at 10:40:24PM -0500, Black_David@emc.com wrote:
> > > There will be a chunk size value available in a VPD page that can
be
> > > used to determine minimum size/alignment. For openers, I see
essentially
> > > no point in a 512-byte UNMAP, even though it's allowed by the
standard -
> > > I suspect most arrays (and many SSDs) will ignore it, and ignoring
> > > it is definitely within the spirit of the proposed T10 standard
(hint:
> > > I'm one of the people directly working on that proposal).
> >
> > I think this is the crux of the issue. IMO, it's not much of a
standard
> > when the spirit of the standard is to allow everyone to implement
> > different, non-deterministic behaviour....
>
> I disagree. The discard request is a _hint_ from the upper layers, and
> the storage device can act on that hint as it sees fit. There's
nothing
> wrong with that; it doesn't make it "not much of a standard".
Bingo! That is exactly the spirit and thinking behind the UNMAP
proposal.
Besides, UNMAP is already inherently non-deterministic in that only the
device knows what value will result from reading an unmapped block
(unless the "unmapped blocks always read as zero" bit is set by the
device).
[... snip ...]
> So I think it's perfectly acceptable for the operating system to treat
> discard requests as a hint, with best-effort semantics. And any device
> which _really_ cares will need to make sure for _itself_ that it
handles
> those hints reliably.
I agree, at least to the extent that the device makes sure make sure
that it reliably handles the hints that it cares about.
Thanks,
--David
----------------------------------------------------
David L. Black, Distinguished Engineer
EMC Corporation, 176 South St., Hopkinton, MA 01748
+1 (508) 293-7953 FAX: +1 (508) 293-7786
black_david@emc.com Mobile: +1 (978) 394-7754
----------------------------------------------------
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: UNMAP is a hint
2008-11-10 17:05 ` UNMAP is a hint Black_David
@ 2008-11-10 17:30 ` Matthew Wilcox
2008-11-10 17:56 ` Ric Wheeler
0 siblings, 1 reply; 105+ messages in thread
From: Matthew Wilcox @ 2008-11-10 17:30 UTC (permalink / raw)
To: Black_David
Cc: dwmw2, david, martin.petersen, chris.mason, jens.axboe,
James.Bottomley, rwheeler, linux-scsi, linux-fsdevel, coughlan
On Mon, Nov 10, 2008 at 12:05:57PM -0500, Black_David@emc.com wrote:
> > On Mon, 2008-11-10 at 19:31 +1100, Dave Chinner wrote:
> > > I think this is the crux of the issue. IMO, it's not much of a
> standard
> > > when the spirit of the standard is to allow everyone to implement
> > > different, non-deterministic behaviour....
> >
> > I disagree. The discard request is a _hint_ from the upper layers, and
> > the storage device can act on that hint as it sees fit. There's
> nothing
> > wrong with that; it doesn't make it "not much of a standard".
>
> Bingo! That is exactly the spirit and thinking behind the UNMAP
> proposal.
While that may be, it's hardly the spirit that Ric (at least) has been
promoting with dire warnings about how 'Enterprise class' customers will
react if Linux does the wrong thing for EMC arrays with discard/trim/unmap.
--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: UNMAP is a hint
2008-11-10 17:30 ` Matthew Wilcox
@ 2008-11-10 17:56 ` Ric Wheeler
0 siblings, 0 replies; 105+ messages in thread
From: Ric Wheeler @ 2008-11-10 17:56 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Black_David, dwmw2, david, martin.petersen, chris.mason,
jens.axboe, James.Bottomley, linux-scsi, linux-fsdevel, coughlan
Matthew Wilcox wrote:
> On Mon, Nov 10, 2008 at 12:05:57PM -0500, Black_David@emc.com wrote:
>
>>> On Mon, 2008-11-10 at 19:31 +1100, Dave Chinner wrote:
>>>
>>>> I think this is the crux of the issue. IMO, it's not much of a
>>>>
>> standard
>>
>>>> when the spirit of the standard is to allow everyone to implement
>>>> different, non-deterministic behaviour....
>>>>
>>> I disagree. The discard request is a _hint_ from the upper layers, and
>>> the storage device can act on that hint as it sees fit. There's
>>>
>> nothing
>>
>>> wrong with that; it doesn't make it "not much of a standard".
>>>
>> Bingo! That is exactly the spirit and thinking behind the UNMAP
>> proposal.
>>
>
> While that may be, it's hardly the spirit that Ric (at least) has been
> promoting with dire warnings about how 'Enterprise class' customers will
> react if Linux does the wrong thing for EMC arrays with discard/trim/unmap.
>
>
It would be nice to have arrays that can handle an OS that gives it
perfect information (as our current code should do) regardless of
alignment and size of requests.
Would something else be good enough is a reasonable question, but I fear
lots of disgruntled customers ;-)
ric
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: Thin provisioning & arrays
2008-11-10 9:59 ` David Woodhouse
2008-11-10 13:30 ` Matthew Wilcox
2008-11-10 17:05 ` UNMAP is a hint Black_David
@ 2008-11-10 22:18 ` Dave Chinner
2008-11-11 1:23 ` Black_David
2 siblings, 1 reply; 105+ messages in thread
From: Dave Chinner @ 2008-11-10 22:18 UTC (permalink / raw)
To: David Woodhouse
Cc: Black_David, martin.petersen, chris.mason, jens.axboe,
James.Bottomley, rwheeler, linux-scsi, linux-fsdevel, coughlan,
matthew
On Mon, Nov 10, 2008 at 10:59:49AM +0100, David Woodhouse wrote:
> On Mon, 2008-11-10 at 19:31 +1100, Dave Chinner wrote:
> > On Sun, Nov 09, 2008 at 10:40:24PM -0500, Black_David@emc.com wrote:
> > > There will be a chunk size value available in a VPD page that can be
> > > used to determine minimum size/alignment. For openers, I see
> > > essentially
> > > no point in a 512-byte UNMAP, even though it's allowed by the standard -
> > > I suspect most arrays (and many SSDs) will ignore it, and ignoring
> > > it is definitely within the spirit of the proposed T10 standard (hint:
> > > I'm one of the people directly working on that proposal).
> >
> > I think this is the crux of the issue. IMO, it's not much of a standard
> > when the spirit of the standard is to allow everyone to implement
> > different, non-deterministic behaviour....
>
> I disagree. The discard request is a _hint_ from the upper layers, and
> the storage device can act on that hint as it sees fit. There's nothing
> wrong with that; it doesn't make it "not much of a standard".
If it's not reliable, then it is effectively useless from a
design persepctive. The fact that it is being treated as a hint
means that everyone is going to require "defrag" tools to clean
up the mess when the array runs out of space.
Treating it as a reliable command (i.e. it succeeds or returns
an error) means that we can implement filesystems that can do
unmapping in such a way that when the array reports that it is out
of space we *know* that there is no free space that can be unmapped.
i.e. no need for a "defrag" tool.
The defrag tool approach is a cop-out. It simply does not scale to
environments where you have hundreds of luns spread over hundreds of
machines, and each of them needs to be "defragged" individually to
find all the unmappable space in the array. It gets worse in the
virutalised space where you might have tens of virtual machines
using each lun.
This is why unmap as a hint is a fundamentally broken model from an
overall storage stack persepctive, no matter how appealing it is to
array vendors....
> Storage devices are complex enough that they _already_ exhibit behaviour
> which is fairly much non-deterministic in a number of ways. Especially
> if we're talking about SSDs or large arrays, rather than just disks.
> A standard needs to be clear about what _is_ guaranteed, and what is
> _not_ guaranteed. If it is explicit that the storage device is permitted
> to ignore the discard hint, and some storage devices do so under some
> circumstances, then that is just fine.
Right, it's non-deterministic even within a single device. That
makes it impossible to implement something reliable because the
higher layers are not provided with any guarantee they can rely
on. A hint is useless from a design perspective - guarantees are
required for reliable operation and if we are not designing new
storage features with reliability as a primary concern then we
are wasting our time...
> > Unmapping can and should be made reliable so that we don't have to
> > waste effort trying to fix up mismatches that shouldn't have occurred
> > in the first place...
>
> Perhaps so. But remember, this can only really be considered a
> correctness issue on thin-provisioned arrays -- because they may run out
> of space sooner than they should. But that kind of failure mode is
> something that is explicitly accepted by those designing and using such
> thin-provisioned arrays. It's not as if we're introducing any _new_ kind
> of problem.
Very true. But this is not a justification for not providing a
reliable unmapping service. If anything it's justification for being
reliable; that when you finally run out of space, there really is no
more space available....
Defrag is not the answer here.
> So I think it's perfectly acceptable for the operating system to treat
> discard requests as a hint, with best-effort semantics. And any device
> which _really_ cares will need to make sure for _itself_ that it handles
> those hints reliably.
So how do you propose that a storage architect who is trying to
design a reliable thin provisioning storage stack finds out which
devices actually do reliable unmapping? Vendors are simply going
to say they support the unmap command, which currently means
anything from "ignore completely" to "always do the right thing".
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 105+ messages in thread
* RE: Thin provisioning & arrays
2008-11-10 22:18 ` Thin provisioning & arrays Dave Chinner
@ 2008-11-11 1:23 ` Black_David
2008-11-11 2:09 ` Keith Owens
2008-11-11 22:49 ` Dave Chinner
0 siblings, 2 replies; 105+ messages in thread
From: Black_David @ 2008-11-11 1:23 UTC (permalink / raw)
To: david, dwmw2
Cc: martin.petersen, chris.mason, jens.axboe, James.Bottomley,
rwheeler, linux-scsi, linux-fsdevel, coughlan, matthew,
Black_David
Dave,
> Treating it as a reliable command (i.e. it succeeds or returns
> an error) means that we can implement filesystems that can do
> unmapping in such a way that when the array reports that it is out
> of space we *know* that there is no free space that can be unmapped.
> i.e. no need for a "defrag" tool.
What if the filesystem block size and the array thin provisioning
chunk size don't match? It's still "defrag" time ...
> So how do you propose that a storage architect who is trying to
> design a reliable thin provisioning storage stack finds out which
> devices actually do reliable unmapping? Vendors are simply going
> to say they support the unmap command, which currently means
> anything from "ignore completely" to "always do the right thing".
The thin provisioning chunk size (coming) in the VPD page is a
possible place to start.
Do you want something that says "if an aligned multiple of the chunk
size is sent in UNMAP, then it will be unmapped?". That may be
plausible, but I don't want to hit an UNMAP that isn't an aligned
multiple with a CHECK CONDITION if there's something useful to do.
Thanks,
--David
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: Thin provisioning & arrays
2008-11-11 1:23 ` Black_David
@ 2008-11-11 2:09 ` Keith Owens
2008-11-11 13:59 ` Ric Wheeler
2008-11-11 22:49 ` Dave Chinner
1 sibling, 1 reply; 105+ messages in thread
From: Keith Owens @ 2008-11-11 2:09 UTC (permalink / raw)
To: Black_David
Cc: david, dwmw2, martin.petersen, chris.mason, jens.axboe,
James.Bottomley, rwheeler, linux-scsi, linux-fsdevel, coughlan,
matthew
On Mon, 10 Nov 2008 20:23:17 -0500,
Black_David@emc.com wrote:
>Dave,
>The thin provisioning chunk size (coming) in the VPD page is a
>possible place to start.
VPD page for which device? Consider a filesystem that is striped
across devices from multiple arrays or even multiple vendors. How is
the filesystem supposed to "align" an unmap command when the underlying
disks all have different alignments?
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: Thin provisioning & arrays
2008-11-11 2:09 ` Keith Owens
@ 2008-11-11 13:59 ` Ric Wheeler
2008-11-11 14:55 ` jim owens
0 siblings, 1 reply; 105+ messages in thread
From: Ric Wheeler @ 2008-11-11 13:59 UTC (permalink / raw)
To: Keith Owens
Cc: Black_David, david, dwmw2, martin.petersen, chris.mason,
jens.axboe, James.Bottomley, linux-scsi, linux-fsdevel, coughlan,
matthew
Keith Owens wrote:
> On Mon, 10 Nov 2008 20:23:17 -0500,
> Black_David@emc.com wrote:
>
>> Dave,
>> The thin provisioning chunk size (coming) in the VPD page is a
>> possible place to start.
>>
>
> VPD page for which device? Consider a filesystem that is striped
> across devices from multiple arrays or even multiple vendors. How is
> the filesystem supposed to "align" an unmap command when the underlying
> disks all have different alignments?
>
>
I think that we are losing focus on the core use case here. Big arrays
that implement thin luns also implement RAID in the box. If you are
building a clustered file system, it would be extremely unlikely to
build it with storage from different vendors.
You always have the option to disable thin luns or simply fully
provision LUN's for more complex situations.
Thing is being pitched to answer a very specific customer use case -
shared storage (mid to high end almost exclusively) with several
different users and applications....
ric
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: Thin provisioning & arrays
2008-11-11 13:59 ` Ric Wheeler
@ 2008-11-11 14:55 ` jim owens
2008-11-11 15:38 ` Ric Wheeler
2008-11-11 23:08 ` Dave Chinner
0 siblings, 2 replies; 105+ messages in thread
From: jim owens @ 2008-11-11 14:55 UTC (permalink / raw)
To: Ric Wheeler
Cc: Keith Owens, Black_David, david, dwmw2, martin.petersen,
chris.mason, jens.axboe, James.Bottomley, linux-scsi,
linux-fsdevel, coughlan, matthew
Ric Wheeler wrote:
> Thing is being pitched to answer a very specific customer use case -
> shared storage (mid to high end almost exclusively) with several
> different users and applications....
And by "different users" these customers almost always mean
different operating systems. They are combining storage into
a central location for easier management.
So "exact unmapped tracking by the filesystem" is impossible
and not part of the requirement. Doesn't mean we can't make our
filesystems better, but forget about a perfect ability to known
just how much space we really have once we do an unmap.
We can't tell how much of our unmapped space the device has
given away to someone else and we cannot prevent the device
from failing a write to an unmapped block if all the space
is gone. It is just an IO error, and possible fs-is-offline
if that block we failed was metadata!
It is up to the customer to manage their storage so it never
reaches the unable-to-write state.
jim
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: Thin provisioning & arrays
2008-11-11 14:55 ` jim owens
@ 2008-11-11 15:38 ` Ric Wheeler
2008-11-11 15:59 ` jim owens
2008-11-11 23:08 ` Dave Chinner
1 sibling, 1 reply; 105+ messages in thread
From: Ric Wheeler @ 2008-11-11 15:38 UTC (permalink / raw)
To: jim owens
Cc: Keith Owens, Black_David, david, dwmw2, martin.petersen,
chris.mason, jens.axboe, James.Bottomley, linux-scsi,
linux-fsdevel, coughlan, matthew
jim owens wrote:
> Ric Wheeler wrote:
>> Thing is being pitched to answer a very specific customer use case -
>> shared storage (mid to high end almost exclusively) with several
>> different users and applications....
>
> And by "different users" these customers almost always mean
> different operating systems. They are combining storage into
> a central location for easier management.
When you have one specific LUN exported from an array, it is owned by
one OS. You can definitely have different LUN's used by different OS's,
but that seems to be irrelevant to our challenges here, right?
>
> So "exact unmapped tracking by the filesystem" is impossible
> and not part of the requirement. Doesn't mean we can't make our
> filesystems better, but forget about a perfect ability to known
> just how much space we really have once we do an unmap.
My understanding is that most of this kind of information (how much real
space is provisioned/utilized/etc) is handled out of band by a user
space app.
>
> We can't tell how much of our unmapped space the device has
> given away to someone else and we cannot prevent the device
> from failing a write to an unmapped block if all the space
> is gone. It is just an IO error, and possible fs-is-offline
> if that block we failed was metadata!
This is where things really fall apart - odd IO errors on a device that
seems to us to have lots of space. If it becomes common in the field, I
suspect that users will flee thin luns :-) I also understand that other
os'es are mutually unable to react.
>
> It is up to the customer to manage their storage so it never
> reaches the unable-to-write state.
>
> jim
Agreed - the high water marks should be set to allow the sys admin
(storage admin?) to reallocate space....
ric
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: Thin provisioning & arrays
2008-11-11 15:38 ` Ric Wheeler
@ 2008-11-11 15:59 ` jim owens
2008-11-11 16:25 ` Ric Wheeler
0 siblings, 1 reply; 105+ messages in thread
From: jim owens @ 2008-11-11 15:59 UTC (permalink / raw)
To: Ric Wheeler
Cc: Keith Owens, Black_David, david, dwmw2, martin.petersen,
chris.mason, jens.axboe, James.Bottomley, linux-scsi,
linux-fsdevel, coughlan, matthew
Ric Wheeler wrote:
> jim owens wrote:
>>
>> And by "different users" these customers almost always mean
>> different operating systems. They are combining storage into
>> a central location for easier management.
>
> When you have one specific LUN exported from an array, it is owned by
> one OS. You can definitely have different LUN's used by different OS's,
> but that seems to be irrelevant to our challenges here, right?
But the total thin storage pool is shared by multiple luns
and thus maybe multiple not-able-to-cooperate hosts.
I was only pointing this out because earlier threads seemed
to be "linux filesystems to be exact across multiple hosts"
(which is really a cluster design) and even if we did that
for linux it would not solve the customer need.
I just wanted to make it clear why trying to do a complicated
change to linux for exactness is pointless because the customer
requirement is for more than linux attached to the thin pool.
So the relevance is our design boundary.
jim
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: Thin provisioning & arrays
2008-11-11 15:59 ` jim owens
@ 2008-11-11 16:25 ` Ric Wheeler
2008-11-11 16:53 ` jim owens
0 siblings, 1 reply; 105+ messages in thread
From: Ric Wheeler @ 2008-11-11 16:25 UTC (permalink / raw)
To: jim owens
Cc: Keith Owens, Black_David, david, dwmw2, martin.petersen,
chris.mason, jens.axboe, James.Bottomley, linux-scsi,
linux-fsdevel, coughlan, matthew
jim owens wrote:
> Ric Wheeler wrote:
>> jim owens wrote:
>>>
>>> And by "different users" these customers almost always mean
>>> different operating systems. They are combining storage into
>>> a central location for easier management.
>>
>> When you have one specific LUN exported from an array, it is owned by
>> one OS. You can definitely have different LUN's used by different
>> OS's, but that seems to be irrelevant to our challenges here, right?
>
> But the total thin storage pool is shared by multiple luns
> and thus maybe multiple not-able-to-cooperate hosts.
agreed...
>
> I was only pointing this out because earlier threads seemed
> to be "linux filesystems to be exact across multiple hosts"
> (which is really a cluster design) and even if we did that
> for linux it would not solve the customer need.
>
> I just wanted to make it clear why trying to do a complicated
> change to linux for exactness is pointless because the customer
> requirement is for more than linux attached to the thin pool.
>
> So the relevance is our design boundary.
>
> jim
I think that the concern is that the exact implementation is actually
already coded and relatively easy for us to do (i.e., send down unmap
commands at natural file system level units after a truncate/delete).
The irony is that the hard part is to try to approach that level of
exactness with the other techniques (coalescing unmaps, defrag, etc) :-)
ric
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: Thin provisioning & arrays
2008-11-11 16:25 ` Ric Wheeler
@ 2008-11-11 16:53 ` jim owens
0 siblings, 0 replies; 105+ messages in thread
From: jim owens @ 2008-11-11 16:53 UTC (permalink / raw)
To: Ric Wheeler
Cc: Keith Owens, Black_David, david, dwmw2, martin.petersen,
chris.mason, jens.axboe, James.Bottomley, linux-scsi,
linux-fsdevel, coughlan, matthew
Ric Wheeler wrote:
>
> I think that the concern is that the exact implementation is actually
> already coded and relatively easy for us to do (i.e., send down unmap
> commands at natural file system level units after a truncate/delete).
>
> The irony is that the hard part is to try to approach that level of
> exactness with the other techniques (coalescing unmaps, defrag, etc) :-)
I was using "exact" in the second sense... our battle about
matching exactly with an array where the thin unmap chunk is
greater than the natural file system level unit.
Saying in that case, extreme measures are not justified.
jim
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: Thin provisioning & arrays
2008-11-11 14:55 ` jim owens
2008-11-11 15:38 ` Ric Wheeler
@ 2008-11-11 23:08 ` Dave Chinner
2008-11-11 23:52 ` jim owens
1 sibling, 1 reply; 105+ messages in thread
From: Dave Chinner @ 2008-11-11 23:08 UTC (permalink / raw)
To: jim owens
Cc: Ric Wheeler, Keith Owens, Black_David, dwmw2, martin.petersen,
chris.mason, jens.axboe, James.Bottomley, linux-scsi,
linux-fsdevel, coughlan, matthew
On Tue, Nov 11, 2008 at 09:55:59AM -0500, jim owens wrote:
> Ric Wheeler wrote:
>> Thing is being pitched to answer a very specific customer use case -
>> shared storage (mid to high end almost exclusively) with several
>> different users and applications....
...
> It is up to the customer to manage their storage so it never
> reaches the unable-to-write state.
Sure, but putting the entire management burden of obtaining and
running defrag tools in every one of their large set of OS's is the
wrong approach.
We can and should be designing new functionality for the data center
in such a manner that does not require large scale manual
intervention to maintain the systems. Your customers won't thank you
for solving the thin provisioning management problem by requiring
them to do extra hand-holding....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: Thin provisioning & arrays
2008-11-11 23:08 ` Dave Chinner
@ 2008-11-11 23:52 ` jim owens
0 siblings, 0 replies; 105+ messages in thread
From: jim owens @ 2008-11-11 23:52 UTC (permalink / raw)
To: jim owens, Ric Wheeler, Keith Owens, Black_David, dwmw2,
martin.petersen, chris.mason
Dave Chinner wrote:
> On Tue, Nov 11, 2008 at 09:55:59AM -0500, jim owens wrote:
>> Ric Wheeler wrote:
>>> Thing is being pitched to answer a very specific customer use case -
>>> shared storage (mid to high end almost exclusively) with several
>>> different users and applications....
> ...
>> It is up to the customer to manage their storage so it never
>> reaches the unable-to-write state.
>
> Sure, but putting the entire management burden of obtaining and
> running defrag tools in every one of their large set of OS's is the
> wrong approach.
>
> We can and should be designing new functionality for the data center
> in such a manner that does not require large scale manual
> intervention to maintain the systems. Your customers won't thank you
> for solving the thin provisioning management problem by requiring
> them to do extra hand-holding....
I agree that it could be done better.
I just don't expect that to happen any time soon because
both multiple array vendors and multiple OS vendors must
agree and spend money when they don't see this as a large
amount of missed opportunity money. You only have to be
as good as your competition and the customers are used to
doing the extra hand-holding.
All we can do is support what the devices will do today
with minimal effort and wait for customer demand to force
everyone to make improvements.
jim
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: Thin provisioning & arrays
2008-11-11 1:23 ` Black_David
2008-11-11 2:09 ` Keith Owens
@ 2008-11-11 22:49 ` Dave Chinner
1 sibling, 0 replies; 105+ messages in thread
From: Dave Chinner @ 2008-11-11 22:49 UTC (permalink / raw)
To: Black_David
Cc: dwmw2, martin.petersen, chris.mason, jens.axboe, James.Bottomley,
rwheeler, linux-scsi, linux-fsdevel, coughlan, matthew
On Mon, Nov 10, 2008 at 08:23:17PM -0500, Black_David@emc.com wrote:
> Dave,
>
> > Treating it as a reliable command (i.e. it succeeds or returns
> > an error) means that we can implement filesystems that can do
> > unmapping in such a way that when the array reports that it is out
> > of space we *know* that there is no free space that can be unmapped.
> > i.e. no need for a "defrag" tool.
>
> What if the filesystem block size and the array thin provisioning
> chunk size don't match? It's still "defrag" time ...
No, it's "fix the array implementation" time ;)
It seems the point that we've made that the higher layers can be
exact and robust is being ignored because it means work to make the
arrays exact and robust.
Following that, if the array is not robust (i.e. doesn't execute
unmap commands exactly as specified), then as a filesystem
developer I want to know that this occurred so that appropriate
warnings can be issued to inform the admin of what they need to
do when the array runs out of space....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-06 14:43 thin provisioned LUN support Ric Wheeler
2008-11-06 15:17 ` James Bottomley
@ 2008-11-06 15:27 ` jim owens
2008-11-06 15:57 ` jim owens
[not found] ` <yq1d4h8nao5.fsf@sermon.lab.mkp.net>
` (5 subsequent siblings)
7 siblings, 1 reply; 105+ messages in thread
From: jim owens @ 2008-11-06 15:27 UTC (permalink / raw)
To: Ric Wheeler
Cc: David Woodhouse, James Bottomley, linux-scsi, linux-fsdevel,
Black_David, Martin K. Petersen, Tom Coughlan, Matthew Wilcox,
Jens Axboe
Ric Wheeler wrote:
>
> After talking to some vendors, one issue that came up is that the arrays
> all have a different size that is used internally to track the SCSI
> equivalent of TRIM commands (POKE/unmap).
>
> What they would like is for us to coalesce these commands into aligned
> multiples of these chunks. If not, the target device will most likely
> ignore the bits at the beginning and end (and all small requests).
>
> I have been thinking about whether or not we can (and should) do
> anything more than our current best effort to send down large chunks
> (note that the "chunk" size can range from reasonable sizes like 8KB or
> so up to close to 1MB!).
The rational way to do this is to admit TRIM is only a feature
for filesystems that can set their allocation block size to
aligned multiples of the device "trim chunk size".
And the vendors need to provide the device trim chunk size in
a standard way (like scsi geometry) to the filesystem.
Devices with a trim chunk size of 512 bytes would work with
all filesystems. Devices with larger trim chunk sizes would
only receive trim commands from filesystems that can do large
allocation blocks (such as btrfs).
> One suggestion is that a modified defrag sweep could be used
> periodically to update the device (a proposal I am not keen on).
I don't like it either. But nothing prevents an array vendor
from building and shipping a tool to do this to fix their
big 1MB trim with small blocking filesystems.
jim
^ permalink raw reply [flat|nested] 105+ messages in thread* Re: thin provisioned LUN support
2008-11-06 15:27 ` thin provisioned LUN support jim owens
@ 2008-11-06 15:57 ` jim owens
2008-11-06 16:21 ` James Bottomley
0 siblings, 1 reply; 105+ messages in thread
From: jim owens @ 2008-11-06 15:57 UTC (permalink / raw)
To: Ric Wheeler
Cc: David Woodhouse, James Bottomley, linux-scsi, linux-fsdevel,
Black_David, Martin K. Petersen, Tom Coughlan, Matthew Wilcox,
Jens Axboe
James Bottomley wrote:
> By the way, the latest (from 2 days ago) version of the Thin
> Provisioning proposal is here:
>
> http://www.t10.org/ftp/t10/document.08/08-149r4.pdf
If I understand the spec [ not likely ;) ] ...
jim owens wrote:
> And the vendors need to provide the device trim chunk size in
> a standard way (like scsi geometry) to the filesystem.
It may be that the READ CAPACITY (16) provides the trim chunk
size via the "logical blocks per physical block exponent".
But since this is just a T10 spec, I would want that
interpretation verified by the array vendors.
jim
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-06 15:57 ` jim owens
@ 2008-11-06 16:21 ` James Bottomley
0 siblings, 0 replies; 105+ messages in thread
From: James Bottomley @ 2008-11-06 16:21 UTC (permalink / raw)
To: jim owens
Cc: Ric Wheeler, David Woodhouse, linux-scsi, linux-fsdevel,
Black_David, Martin K. Petersen, Tom Coughlan, Matthew Wilcox,
Jens Axboe
On Thu, 2008-11-06 at 10:57 -0500, jim owens wrote:
> James Bottomley wrote:
>
> > By the way, the latest (from 2 days ago) version of the Thin
> > Provisioning proposal is here:
> >
> > http://www.t10.org/ftp/t10/document.08/08-149r4.pdf
>
> If I understand the spec [ not likely ;) ] ...
>
> jim owens wrote:
>
> > And the vendors need to provide the device trim chunk size in
> > a standard way (like scsi geometry) to the filesystem.
>
> It may be that the READ CAPACITY (16) provides the trim chunk
> size via the "logical blocks per physical block exponent".
>
> But since this is just a T10 spec, I would want that
> interpretation verified by the array vendors.
This could be ... I think it's original intention was to allow us to
figure out that we had a 4k sector disk emulating a 512b sector one.
However, it's also a useful way to parametrise the erase block size for
SSDs as well as the array track size.
James
^ permalink raw reply [flat|nested] 105+ messages in thread
[parent not found: <yq1d4h8nao5.fsf@sermon.lab.mkp.net>]
* Re: thin provisioned LUN support
[not found] ` <yq1d4h8nao5.fsf@sermon.lab.mkp.net>
@ 2008-11-06 15:42 ` Ric Wheeler
2008-11-06 15:57 ` David Woodhouse
0 siblings, 1 reply; 105+ messages in thread
From: Ric Wheeler @ 2008-11-06 15:42 UTC (permalink / raw)
To: Martin K. Petersen
Cc: David Woodhouse, James Bottomley, linux-scsi, linux-fsdevel,
Black_David, Tom Coughlan, Matthew Wilcox, Jens Axboe
Martin K. Petersen wrote:
>>>>>> "Ric" == Ric Wheeler <rwheeler@redhat.com> writes:
>>>>>>
>
> Ric> After talking to some vendors, one issue that came up is that the
> Ric> arrays all have a different size that is used internally to track
> Ric> the SCSI equivalent of TRIM commands (POKE/unmap).
>
> I haven't had time to completely digest the latest (Nov. 4th) UNMAP
> proposal yet. However, I don't recall seeing any notion of blocks
> bigger than the logical block length. And the command clearly takes
> (a list of) <start LBA, number of blocks>.
>
There is a proposal to expose this internal device size in a standard
way, but it has not been finalized.
>
> Ric> What they would like is for us to coalesce these commands into
> Ric> aligned multiples of these chunks.
>
> Ric> If not, the target device will most likely ignore the bits at the
> Ric> beginning and end (and all small requests).
>
> That really just sounds like a broken firmware implementation on their
> end. If they can not UNMAP a single logical block then they are
> clearly not within the spirit of the proposed standard. Their thin
> provisioning is going to suck and customers will hopefully
> complain/buy a competing product.
>
>
I tend to agree with this point of view, but unfortunately, I think that
all of the major arrays have this kind of limitation to one degree or
another. I suspect that this will be a big deal with high end customers.
Just not sure that we have any way to handle this better than what is in
already (best effort, maybe a post/defrag like user util that can be run
offline to clean up?)
ric
^ permalink raw reply [flat|nested] 105+ messages in thread* Re: thin provisioned LUN support
2008-11-06 15:42 ` Ric Wheeler
@ 2008-11-06 15:57 ` David Woodhouse
0 siblings, 0 replies; 105+ messages in thread
From: David Woodhouse @ 2008-11-06 15:57 UTC (permalink / raw)
To: Ric Wheeler
Cc: Martin K. Petersen, David Woodhouse, James Bottomley, linux-scsi,
linux-fsdevel, Black_David, Tom Coughlan, Matthew Wilcox,
Jens Axboe
On Thu, 6 Nov 2008, Ric Wheeler wrote:
> Just not sure that we have any way to handle this better than what is in
> already (best effort, maybe a post/defrag like user util that can be run
> offline to clean up?)
Well, it will get a _little_ better, because we will fix it so that
discard requests can at least be merged by the elevators (and stop being
barriers)
--
dwmw2
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-06 14:43 thin provisioned LUN support Ric Wheeler
` (2 preceding siblings ...)
[not found] ` <yq1d4h8nao5.fsf@sermon.lab.mkp.net>
@ 2008-11-06 22:36 ` Dave Chinner
2008-11-06 22:55 ` Ric Wheeler
[not found] ` <491375E9.7020707@redhat.com>
2008-11-06 23:32 ` thin provisioned LUN support - T10 activity Black_David
` (3 subsequent siblings)
7 siblings, 2 replies; 105+ messages in thread
From: Dave Chinner @ 2008-11-06 22:36 UTC (permalink / raw)
To: Ric Wheeler
Cc: David Woodhouse, James Bottomley, linux-scsi, linux-fsdevel,
Black_David, Martin K. Petersen, Tom Coughlan, Matthew Wilcox,
Jens Axboe
On Thu, Nov 06, 2008 at 09:43:23AM -0500, Ric Wheeler wrote:
>
> After talking to some vendors, one issue that came up is that the arrays
> all have a different size that is used internally to track the SCSI
> equivalent of TRIM commands (POKE/unmap).
>
> What they would like is for us to coalesce these commands into aligned
> multiples of these chunks. If not, the target device will most likely
> ignore the bits at the beginning and end (and all small requests).
There's lots of questions that need to be answered here. e.g:
Where are these free spaces going to be aggregated before dispatch?
What happens if they are re-allocated and re-written by the
filesystem before they've been dispatched?
How is the chunk size going to be passed to the aggregation layer?
What about passing itto the filesystem so it can align all it's
allocations in a manner that simplifies the dispatch problem?
What happens if a crash occurs before the aggregated free space is
dispatched?
Are there coherency problems with filesystem recovery after a crash?
> I have been thinking about whether or not we can (and should) do
> anything more than our current best effort to send down large chunks
> (note that the "chunk" size can range from reasonable sizes like 8KB or
> so up to close to 1MB!).
Any aggregation is only as good as the original allocation the
filesystem did. Look as the mess ext3 extracting untarring a kernel
tarball creates - blocks are written to all over the place. You'd
need to fix that to have any hope of behaviour nicely for a RAID
that has a sub-optimal thin provisioning algorithm.
The problem is not with the filesystem, the block layer or the OS.
If they array vendors have optimised themselves into a corner,
then they shoul dbe fixing their problem, not asking the rest of
the world to expend large amounts of effort to work around the
shortcomings of their products.....
> One suggestion is that a modified defrag sweep could be used
> periodically to update the device (a proposal I am not keen on).
No thanks. That needs an implementation per filesystem, and it will
need to be done with the filesystem on line which means it will
still need substantial help from the kernel.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 105+ messages in thread* Re: thin provisioned LUN support
2008-11-06 22:36 ` Dave Chinner
@ 2008-11-06 22:55 ` Ric Wheeler
[not found] ` <491375E9.7020707@redhat.com>
1 sibling, 0 replies; 105+ messages in thread
From: Ric Wheeler @ 2008-11-06 22:55 UTC (permalink / raw)
To: Ric Wheeler, David Woodhouse, James Bottomley, linux-scsi,
linux-fsdevel
Dave Chinner wrote:
> On Thu, Nov 06, 2008 at 09:43:23AM -0500, Ric Wheeler wrote:
>
>> After talking to some vendors, one issue that came up is that the arrays
>> all have a different size that is used internally to track the SCSI
>> equivalent of TRIM commands (POKE/unmap).
>>
>> What they would like is for us to coalesce these commands into aligned
>> multiples of these chunks. If not, the target device will most likely
>> ignore the bits at the beginning and end (and all small requests).
>>
>
> There's lots of questions that need to be answered here. e.g:
>
> Where are these free spaces going to be aggregated before dispatch?
>
> What happens if they are re-allocated and re-written by the
> filesystem before they've been dispatched?
>
> How is the chunk size going to be passed to the aggregation layer?
>
> What about passing itto the filesystem so it can align all it's
> allocations in a manner that simplifies the dispatch problem?
>
> What happens if a crash occurs before the aggregated free space is
> dispatched?
>
> Are there coherency problems with filesystem recovery after a crash?
>
The good thing about these "unmap" commands (SCSI speak this week for
TRIM) is that we can drop them if we have to without data integrity
concerns.
The only thing that you cannot do is to send down an unmap for a block
still in use (including ones that have not been committed in a transaction).
In SCSI, they plan to zero those blocks so that you will always read a
block of zeros back if you try to read an unmapped sector.
I have no idea how we can pass the aggregation size up from the block
layer since it is not currently exported in a uniform way from SCSI.
Even if it is, we have struggled to get RAID stripe alignment handled so
far.
>
>> I have been thinking about whether or not we can (and should) do
>> anything more than our current best effort to send down large chunks
>> (note that the "chunk" size can range from reasonable sizes like 8KB or
>> so up to close to 1MB!).
>>
>
> Any aggregation is only as good as the original allocation the
> filesystem did. Look as the mess ext3 extracting untarring a kernel
> tarball creates - blocks are written to all over the place. You'd
> need to fix that to have any hope of behaviour nicely for a RAID
> that has a sub-optimal thin provisioning algorithm.
>
> The problem is not with the filesystem, the block layer or the OS.
> If they array vendors have optimised themselves into a corner,
> then they shoul dbe fixing their problem, not asking the rest of
> the world to expend large amounts of effort to work around the
> shortcomings of their products.....
>
I agree - I think that eventually vendors will end up having to cache
the requests internally. The problem is with the customers who will be
getting the first generation of gear and have had their expectations set
already....
>
>> One suggestion is that a modified defrag sweep could be used
>> periodically to update the device (a proposal I am not keen on).
>>
>
> No thanks. That needs an implementation per filesystem, and it will
> need to be done with the filesystem on line which means it will
> still need substantial help from the kernel.
>
> Cheers,
>
> Dave.
>
It does seem to be a mess - especially since people have already gone to
the trouble to put the hooks in to inform the storage in a consistent
and timely way :-)
Ric
^ permalink raw reply [flat|nested] 105+ messages in thread[parent not found: <491375E9.7020707@redhat.com>]
* Re: thin provisioned LUN support
[not found] ` <491375E9.7020707@redhat.com>
@ 2008-11-06 23:06 ` James Bottomley
2008-11-06 23:10 ` Ric Wheeler
0 siblings, 1 reply; 105+ messages in thread
From: James Bottomley @ 2008-11-06 23:06 UTC (permalink / raw)
To: Ric Wheeler
Cc: David Woodhouse, linux-scsi, linux-fsdevel, Black_David,
Martin K. Petersen, Tom Coughlan, Matthew Wilcox, Jens Axboe
On Thu, 2008-11-06 at 17:55 -0500, Ric Wheeler wrote:
> Dave Chinner wrote:
> > On Thu, Nov 06, 2008 at 09:43:23AM -0500, Ric Wheeler wrote:
> >
> >> After talking to some vendors, one issue that came up is that the arrays
> >> all have a different size that is used internally to track the SCSI
> >> equivalent of TRIM commands (POKE/unmap).
> >>
> >> What they would like is for us to coalesce these commands into aligned
> >> multiples of these chunks. If not, the target device will most likely
> >> ignore the bits at the beginning and end (and all small requests).
> >>
> >
> > There's lots of questions that need to be answered here. e.g:
> >
> > Where are these free spaces going to be aggregated before dispatch?
> >
> > What happens if they are re-allocated and re-written by the
> > filesystem before they've been dispatched?
> >
> > How is the chunk size going to be passed to the aggregation layer?
> >
> > What about passing itto the filesystem so it can align all it's
> > allocations in a manner that simplifies the dispatch problem?
> >
> > What happens if a crash occurs before the aggregated free space is
> > dispatched?
> >
> > Are there coherency problems with filesystem recovery after a crash?
> >
>
> The good thing about these "unmap" commands (SCSI speak this week for
> TRIM) is that we can drop them if we have to without data integrity
> concerns.
>
> The only thing that you cannot do is to send down an unmap for a block
> still in use (including ones that have not been committed in a transaction).
>
> In SCSI, they plan to zero those blocks so that you will always read a
> block of zeros back if you try to read an unmapped sector.
Actually, they left this up to the array in the latest spec. If the
TPRZ bit is set in the Block Device Characteristics VPD then, yes, it
will return zeros. If not, the return is undefined.
> I have no idea how we can pass the aggregation size up from the block
> layer since it is not currently exported in a uniform way from SCSI.
> Even if it is, we have struggled to get RAID stripe alignment handled so
> far.
Well, this is identical to the erase block size (and array stripe size)
problems we've been discussing. I thought we'd more or less agreed on
the generic attributes model.
> >> I have been thinking about whether or not we can (and should) do
> >> anything more than our current best effort to send down large chunks
> >> (note that the "chunk" size can range from reasonable sizes like 8KB or
> >> so up to close to 1MB!).
> >>
> >
> > Any aggregation is only as good as the original allocation the
> > filesystem did. Look as the mess ext3 extracting untarring a kernel
> > tarball creates - blocks are written to all over the place. You'd
> > need to fix that to have any hope of behaviour nicely for a RAID
> > that has a sub-optimal thin provisioning algorithm.
> >
> > The problem is not with the filesystem, the block layer or the OS.
> > If they array vendors have optimised themselves into a corner,
> > then they shoul dbe fixing their problem, not asking the rest of
> > the world to expend large amounts of effort to work around the
> > shortcomings of their products.....
> >
>
> I agree - I think that eventually vendors will end up having to cache
> the requests internally. The problem is with the customers who will be
> getting the first generation of gear and have had their expectations set
> already....
>
> >
> >> One suggestion is that a modified defrag sweep could be used
> >> periodically to update the device (a proposal I am not keen on).
> >>
> >
> > No thanks. That needs an implementation per filesystem, and it will
> > need to be done with the filesystem on line which means it will
> > still need substantial help from the kernel.
> >
> > Cheers,
> >
> > Dave.
> >
>
> It does seem to be a mess - especially since people have already gone to
> the trouble to put the hooks in to inform the storage in a consistent
> and timely way :-)
I'm sure we can iterate to a conclusion ... even if it's that we won't
actually do anything other than send down properly formed unmap commands
and if the array chooses to ignore them, that's its lookout.
James
^ permalink raw reply [flat|nested] 105+ messages in thread* Re: thin provisioned LUN support
2008-11-06 23:06 ` James Bottomley
@ 2008-11-06 23:10 ` Ric Wheeler
2008-11-06 23:26 ` James Bottomley
0 siblings, 1 reply; 105+ messages in thread
From: Ric Wheeler @ 2008-11-06 23:10 UTC (permalink / raw)
To: James Bottomley
Cc: David Woodhouse, linux-scsi, linux-fsdevel, Black_David,
Martin K. Petersen, Tom Coughlan, Matthew Wilcox, Jens Axboe
James Bottomley wrote:
> On Thu, 2008-11-06 at 17:55 -0500, Ric Wheeler wrote:
>
>> Dave Chinner wrote:
>>
>>> On Thu, Nov 06, 2008 at 09:43:23AM -0500, Ric Wheeler wrote:
>>>
>>>
>>>> After talking to some vendors, one issue that came up is that the arrays
>>>> all have a different size that is used internally to track the SCSI
>>>> equivalent of TRIM commands (POKE/unmap).
>>>>
>>>> What they would like is for us to coalesce these commands into aligned
>>>> multiples of these chunks. If not, the target device will most likely
>>>> ignore the bits at the beginning and end (and all small requests).
>>>>
>>>>
>>> There's lots of questions that need to be answered here. e.g:
>>>
>>> Where are these free spaces going to be aggregated before dispatch?
>>>
>>> What happens if they are re-allocated and re-written by the
>>> filesystem before they've been dispatched?
>>>
>>> How is the chunk size going to be passed to the aggregation layer?
>>>
>>> What about passing itto the filesystem so it can align all it's
>>> allocations in a manner that simplifies the dispatch problem?
>>>
>>> What happens if a crash occurs before the aggregated free space is
>>> dispatched?
>>>
>>> Are there coherency problems with filesystem recovery after a crash?
>>>
>>>
>> The good thing about these "unmap" commands (SCSI speak this week for
>> TRIM) is that we can drop them if we have to without data integrity
>> concerns.
>>
>> The only thing that you cannot do is to send down an unmap for a block
>> still in use (including ones that have not been committed in a transaction).
>>
>> In SCSI, they plan to zero those blocks so that you will always read a
>> block of zeros back if you try to read an unmapped sector.
>>
>
> Actually, they left this up to the array in the latest spec. If the
> TPRZ bit is set in the Block Device Characteristics VPD then, yes, it
> will return zeros. If not, the return is undefined.
>
>
The RAID vendors were not happy with this & are in the process of
changing it to be:
(1) all zeros OR
(2) all 1's
(3) other - but always to be returned consistently until a future write
The concern is that RAID boxes would trip up over parity (if it could
change).
>> I have no idea how we can pass the aggregation size up from the block
>> layer since it is not currently exported in a uniform way from SCSI.
>> Even if it is, we have struggled to get RAID stripe alignment handled so
>> far.
>>
>
> Well, this is identical to the erase block size (and array stripe size)
> problems we've been discussing. I thought we'd more or less agreed on
> the generic attributes model.
>
We could do it, but need them to put it in a standard place first.
Today, it is exposed only in device specific ways.
>
>>>> I have been thinking about whether or not we can (and should) do
>>>> anything more than our current best effort to send down large chunks
>>>> (note that the "chunk" size can range from reasonable sizes like 8KB or
>>>> so up to close to 1MB!).
>>>>
>>>>
>>> Any aggregation is only as good as the original allocation the
>>> filesystem did. Look as the mess ext3 extracting untarring a kernel
>>> tarball creates - blocks are written to all over the place. You'd
>>> need to fix that to have any hope of behaviour nicely for a RAID
>>> that has a sub-optimal thin provisioning algorithm.
>>>
>>> The problem is not with the filesystem, the block layer or the OS.
>>> If they array vendors have optimised themselves into a corner,
>>> then they shoul dbe fixing their problem, not asking the rest of
>>> the world to expend large amounts of effort to work around the
>>> shortcomings of their products.....
>>>
>>>
>> I agree - I think that eventually vendors will end up having to cache
>> the requests internally. The problem is with the customers who will be
>> getting the first generation of gear and have had their expectations set
>> already....
>>
>>
>>>
>>>
>>>> One suggestion is that a modified defrag sweep could be used
>>>> periodically to update the device (a proposal I am not keen on).
>>>>
>>>>
>>> No thanks. That needs an implementation per filesystem, and it will
>>> need to be done with the filesystem on line which means it will
>>> still need substantial help from the kernel.
>>>
>>> Cheers,
>>>
>>> Dave.
>>>
>>>
>> It does seem to be a mess - especially since people have already gone to
>> the trouble to put the hooks in to inform the storage in a consistent
>> and timely way :-)
>>
>
> I'm sure we can iterate to a conclusion ... even if it's that we won't
> actually do anything other than send down properly formed unmap commands
> and if the array chooses to ignore them, that's its lookout.
>
> James
>
>
>
Eventually, we will get it (collectively) right...
ric
^ permalink raw reply [flat|nested] 105+ messages in thread* Re: thin provisioned LUN support
2008-11-06 23:10 ` Ric Wheeler
@ 2008-11-06 23:26 ` James Bottomley
0 siblings, 0 replies; 105+ messages in thread
From: James Bottomley @ 2008-11-06 23:26 UTC (permalink / raw)
To: Ric Wheeler
Cc: David Woodhouse, linux-scsi, linux-fsdevel, Black_David,
Martin K. Petersen, Tom Coughlan, Matthew Wilcox, Jens Axboe
On Thu, 2008-11-06 at 18:10 -0500, Ric Wheeler wrote:
> James Bottomley wrote:
> > On Thu, 2008-11-06 at 17:55 -0500, Ric Wheeler wrote:
> >> I have no idea how we can pass the aggregation size up from the block
> >> layer since it is not currently exported in a uniform way from SCSI.
> >> Even if it is, we have struggled to get RAID stripe alignment handled so
> >> far.
> >>
> >
> > Well, this is identical to the erase block size (and array stripe size)
> > problems we've been discussing. I thought we'd more or less agreed on
> > the generic attributes model.
> >
>
> We could do it, but need them to put it in a standard place first.
> Today, it is exposed only in device specific ways.
Actually, I think it is standard. I think it's exposed in READ CAPACITY
(16) logical per physical blocks exponent. This also has an analogue in
SATA since word 106 of the IDENTIFY DEVICE also contains this. What I'm
not clear on is whether SSDs actually implement this for the erase block
size (I'm reasonably sure 4k sector devices do).
James
^ permalink raw reply [flat|nested] 105+ messages in thread
* thin provisioned LUN support - T10 activity
2008-11-06 14:43 thin provisioned LUN support Ric Wheeler
` (3 preceding siblings ...)
2008-11-06 22:36 ` Dave Chinner
@ 2008-11-06 23:32 ` Black_David
2008-11-07 11:59 ` thin provisioned LUN support Artem Bityutskiy
` (2 subsequent siblings)
7 siblings, 0 replies; 105+ messages in thread
From: Black_David @ 2008-11-06 23:32 UTC (permalink / raw)
To: rwheeler, dwmw2, James.Bottomley, linux-scsi, linux-fsdevel
Cc: martin.petersen, coughlan, matthew, jens.axboe, Black_David
Folks,
Ric didn't realize it, but he started this discussion on a day when
T10 was working on the thin provisioning support in SCSI. Having been
in that T10, meeting, I'll use this message to describe what's happened
in T10 and use a separate message to discuss array implementation
concerns.
So, working my way through the messages in this thread ...
James Bottomley writes:
> By the way, the latest (from 2 days ago) version of the Thin
> Provisioning proposal is here:
>
> http://www.t10.org/ftp/t10/document.08/08-149r4.pdf
Just in case it wasn't not clear, this is a moving target. Expect
to see an r5 posted by the end of next week, and there are two
concalls between now and the T10 January meetings to work on it,
so it *will* change again.
> I skimmed it but don't see any update implying that trim might be
> ineffective if we align wrongly ... where is this?
The wording will be that an UNMAP command (f/k/a PUNCH, f/k/a TRIM)
requests an unmap operation, and the device can decide what if
anything to unmap. In r4, this was in these two sentences in 5.x
in the middle of p.20:
The UNMAP command requests alteration of the medium. The UNMAP
command (see table x.1) provides information to the device
server
that may be used by the device server to transition specified
ranges of blocks to the unmapped state.
There will be a T10 discussion at some point about whether the UNMAP
command tells the device that it "may" unmap vs. "should" unmap.
Responding to Martin Petersen, Ric Wheeler writes:
>> I haven't had time to completely digest the latest (Nov. 4th) UNMAP
>> proposal yet. However, I don't recall seeing any notion of blocks
>> bigger than the logical block length. And the command clearly takes
>> (a list of) <start LBA, number of blocks>.
>
> There is a proposal to expose this internal device size in a standard
> way, but it has not been finalized.
Both Martin and Ric are correct, but the initial proposal to do this
isn't available yet. This is likely to be in a VPD (mode) page in a
future version of the 08-149 proposal, but it's not clear whether
this function will be in the block device characteristics VPD page
vs. a new page for thin provisioning.
jim owens writes:
> > And the vendors need to provide the device trim chunk size in
> > a standard way (like scsi geometry) to the filesystem.
>
> It may be that the READ CAPACITY (16) provides the trim chunk
> size via the "logical blocks per physical block exponent".
No, definitely not. As James subsequently indicated, that exponent
is part of the 4k sector size support. There is no intention that
I'm aware of to use it for thin provisioning.
James Bottomley writes:
>> In SCSI, they plan to zero those blocks so that you will always read
a
>> block of zeros back if you try to read an unmapped sector.
>
> Actually, they left this up to the array in the latest spec. If the
> TPRZ bit is set in the Block Device Characteristics VPD then, yes, it
> will return zeros. If not, the return is undefined.
James is correct, and Ric's subsequent response is incorrect, in part
because I didn't update Ric on what's going on (mea culpa). Here's
the full story ...
There is a very strong desire to be able to map ATA functionality (or
most
of it) into SCSI. The initial ATA specification of TRIM was seriously
flawed; for an explanation, see T10/08-347r1:
http://www.t10.org/ftp/t10/document.08/08-347r1.pdf
There has been significant effort made to do something about this, the
result of which is that T13 will be adding a Deterministic Read After
TRIM (DRAT !) bit to the ATA specification (T13/e08137r1):
http://www.t13.org/Documents/UploadedDocuments/docs2008/e08137r1-DRAT_-_
Deterministic_Read_After_Trim.pdf
The crucial language in that proposal is the red text near the bottom
of p.4, which allows any value as long as it has deterministic read
behavior (the DRAT bit will be word 69 bit 14 of the IDENTIFY DEVICE
data). The SCSI standard will align to the ATA standard with the DRAT
bit set - that red language was apparently the most that T13 would
accept in the way of behavior requirements.
Thanks,
--David
----------------------------------------------------
David L. Black, Distinguished Engineer
EMC Corporation, 176 South St., Hopkinton, MA 01748
+1 (508) 293-7953 FAX: +1 (508) 293-7786
black_david@emc.com Mobile: +1 (978) 394-7754
----------------------------------------------------
> -----Original Message-----
> From: Ric Wheeler [mailto:rwheeler@redhat.com]
> Sent: Thursday, November 06, 2008 9:43 AM
> To: David Woodhouse; James Bottomley;
> linux-scsi@vger.kernel.org; linux-fsdevel@vger.kernel.org
> Cc: Black, David; Martin K. Petersen; Tom Coughlan; Matthew
> Wilcox; Jens Axboe
> Subject: thin provisioned LUN support
>
>
> After talking to some vendors, one issue that came up is that
> the arrays
> all have a different size that is used internally to track the SCSI
> equivalent of TRIM commands (POKE/unmap).
>
> What they would like is for us to coalesce these commands
> into aligned
> multiples of these chunks. If not, the target device will most likely
> ignore the bits at the beginning and end (and all small requests).
>
> I have been thinking about whether or not we can (and should) do
> anything more than our current best effort to send down large chunks
> (note that the "chunk" size can range from reasonable sizes
> like 8KB or
> so up to close to 1MB!).
>
> One suggestion is that a modified defrag sweep could be used
> periodically to update the device (a proposal I am not keen on).
>
> Thoughts?
>
> Ric
>
>
>
^ permalink raw reply [flat|nested] 105+ messages in thread* Re: thin provisioned LUN support
2008-11-06 14:43 thin provisioned LUN support Ric Wheeler
` (4 preceding siblings ...)
2008-11-06 23:32 ` thin provisioned LUN support - T10 activity Black_David
@ 2008-11-07 11:59 ` Artem Bityutskiy
2008-11-10 20:39 ` Aggregating discard requests in the filesystem Matthew Wilcox
2008-11-11 16:40 ` thin provisioned LUN support Christoph Hellwig
7 siblings, 0 replies; 105+ messages in thread
From: Artem Bityutskiy @ 2008-11-07 11:59 UTC (permalink / raw)
To: Ric Wheeler
Cc: David Woodhouse, James Bottomley, linux-scsi, linux-fsdevel,
Black_David, Martin K. Petersen, Tom Coughlan, Matthew Wilcox,
Jens Axboe
Ric Wheeler wrote:
> After talking to some vendors, one issue that came up is that the arrays
> all have a different size that is used internally to track the SCSI
> equivalent of TRIM commands (POKE/unmap).
>
> What they would like is for us to coalesce these commands into aligned
> multiples of these chunks. If not, the target device will most likely
> ignore the bits at the beginning and end (and all small requests).
>
> I have been thinking about whether or not we can (and should) do
> anything more than our current best effort to send down large chunks
> (note that the "chunk" size can range from reasonable sizes like 8KB or
> so up to close to 1MB!).
Note, this is relevant to MMC as well. They have an "erase" command,
which is equivalent to discarding a group of sectors, e.g., 128KiB. So
we would speed MMC up if we could make sure our discard requests are:
1. 128KiB aligned
2. of 128KiB in size
P.S. Gropup size is MMC-specific.
--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 105+ messages in thread* Aggregating discard requests in the filesystem
2008-11-06 14:43 thin provisioned LUN support Ric Wheeler
` (5 preceding siblings ...)
2008-11-07 11:59 ` thin provisioned LUN support Artem Bityutskiy
@ 2008-11-10 20:39 ` Matthew Wilcox
2008-11-10 20:44 ` Chris Mason
2008-11-11 0:12 ` Brad Boyer
2008-11-11 16:40 ` thin provisioned LUN support Christoph Hellwig
7 siblings, 2 replies; 105+ messages in thread
From: Matthew Wilcox @ 2008-11-10 20:39 UTC (permalink / raw)
To: Ric Wheeler
Cc: David Woodhouse, James Bottomley, linux-scsi, linux-fsdevel,
Black_David, Martin K. Petersen, Tom Coughlan, Jens Axboe,
Chris Mason, Dave Chinner
On Thu, Nov 06, 2008 at 09:43:23AM -0500, Ric Wheeler wrote:
> I have been thinking about whether or not we can (and should) do
> anything more than our current best effort to send down large chunks
> (note that the "chunk" size can range from reasonable sizes like 8KB or
> so up to close to 1MB!).
One of the proposals in this thread (that has got buried somewhere) was
to expand any discard request sent down from the filesystem to encompass
all the adjacent free space. I've checked with our SSD people and
they're fine with this idea.
dwmw2 says "it isn't actually that hard in FAT" and then interjects some
personal opinion about this solution ;-)
Is it hard in XFS? btrfs? ext2? Does anyone have a problem with this
as a solution?
--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 105+ messages in thread* Re: Aggregating discard requests in the filesystem
2008-11-10 20:39 ` Aggregating discard requests in the filesystem Matthew Wilcox
@ 2008-11-10 20:44 ` Chris Mason
2008-11-11 0:12 ` Brad Boyer
1 sibling, 0 replies; 105+ messages in thread
From: Chris Mason @ 2008-11-10 20:44 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Ric Wheeler, David Woodhouse, James Bottomley, linux-scsi,
linux-fsdevel, Black_David, Martin K. Petersen, Tom Coughlan,
Jens Axboe, Dave Chinner
On Mon, 2008-11-10 at 13:39 -0700, Matthew Wilcox wrote:
> On Thu, Nov 06, 2008 at 09:43:23AM -0500, Ric Wheeler wrote:
> > I have been thinking about whether or not we can (and should) do
> > anything more than our current best effort to send down large chunks
> > (note that the "chunk" size can range from reasonable sizes like 8KB or
> > so up to close to 1MB!).
>
> One of the proposals in this thread (that has got buried somewhere) was
> to expand any discard request sent down from the filesystem to encompass
> all the adjacent free space. I've checked with our SSD people and
> they're fine with this idea.
>
> dwmw2 says "it isn't actually that hard in FAT" and then interjects some
> personal opinion about this solution ;-)
>
> Is it hard in XFS? btrfs? ext2? Does anyone have a problem with this
> as a solution?
>
Btrfs needs some extra checking to make sure the extents really are free
(and won't magically reappear after a crash), but it is at least
possible.
-chris
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: Aggregating discard requests in the filesystem
2008-11-10 20:39 ` Aggregating discard requests in the filesystem Matthew Wilcox
2008-11-10 20:44 ` Chris Mason
@ 2008-11-11 0:12 ` Brad Boyer
2008-11-11 15:25 ` jim owens
1 sibling, 1 reply; 105+ messages in thread
From: Brad Boyer @ 2008-11-11 0:12 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Ric Wheeler, David Woodhouse, James Bottomley, linux-scsi,
linux-fsdevel, Black_David, Martin K. Petersen, Tom Coughlan,
Jens Axboe, Chris Mason, Dave Chinner
On Mon, Nov 10, 2008 at 01:39:15PM -0700, Matthew Wilcox wrote:
> One of the proposals in this thread (that has got buried somewhere) was
> to expand any discard request sent down from the filesystem to encompass
> all the adjacent free space. I've checked with our SSD people and
> they're fine with this idea.
>
> dwmw2 says "it isn't actually that hard in FAT" and then interjects some
> personal opinion about this solution ;-)
>
> Is it hard in XFS? btrfs? ext2? Does anyone have a problem with this
> as a solution?
I suspect how hard it is depends somewhat on exactly what you mean by
"all the adjacent free space" and what is expected of the file system.
My concern would particularly be around cluster file systems, which may
not have perfect local knowledge of exactly what is happening. My guess
is that it would be easy to know about some given range within some
larger aggregation of blocks, but not always across the entire device.
The other question is how this translates through software raid or
other non-simple block layers.
Brad Boyer
flar@allandria.com
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: Aggregating discard requests in the filesystem
2008-11-11 0:12 ` Brad Boyer
@ 2008-11-11 15:25 ` jim owens
0 siblings, 0 replies; 105+ messages in thread
From: jim owens @ 2008-11-11 15:25 UTC (permalink / raw)
To: Brad Boyer
Cc: Matthew Wilcox, Ric Wheeler, David Woodhouse, James Bottomley,
linux-scsi, linux-fsdevel, Black_David, Martin K. Petersen,
Tom Coughlan, Jens Axboe, Chris Mason, Dave Chinner
Brad Boyer wrote:
> On Mon, Nov 10, 2008 at 01:39:15PM -0700, Matthew Wilcox wrote:
>> One of the proposals in this thread (that has got buried somewhere) was
>> to expand any discard request sent down from the filesystem to encompass
>> all the adjacent free space. I've checked with our SSD people and
>> they're fine with this idea.
>>
>> dwmw2 says "it isn't actually that hard in FAT" and then interjects some
>> personal opinion about this solution ;-)
>>
>> Is it hard in XFS? btrfs? ext2? Does anyone have a problem with this
>> as a solution?
>
> I suspect how hard it is depends somewhat on exactly what you mean by
> "all the adjacent free space" and what is expected of the file system.
My take on this is that an "expanded discard" is just saying that
any UNMAP that is sent to the device can contain blocks which were
already UNMAPed. In fact, all blocks in the command may already
be unmapped.
The filesystem can send individual block discards down to the block
layer or/and send a range of free blocks that surround the just-freed
block(s). It is easy as long as the trim/unmap is permitted on
an already unmapped block. Then repeating is not an issue.
If devices do not allow unmapping an already unmapped block then
nothing works without massive filesystem changes such as Dave C
said were needed for "reliable exact tracking".
All devices must allow unmapping an already unmapped block for
the "defrag tool as unmapper" idea to work reliably too.
jim
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-06 14:43 thin provisioned LUN support Ric Wheeler
` (6 preceding siblings ...)
2008-11-10 20:39 ` Aggregating discard requests in the filesystem Matthew Wilcox
@ 2008-11-11 16:40 ` Christoph Hellwig
2008-11-11 17:07 ` jim owens
7 siblings, 1 reply; 105+ messages in thread
From: Christoph Hellwig @ 2008-11-11 16:40 UTC (permalink / raw)
To: Ric Wheeler
Cc: David Woodhouse, James Bottomley, linux-scsi, linux-fsdevel,
Black_David, Martin K. Petersen, Tom Coughlan, Matthew Wilcox,
Jens Axboe
Sorry for beeing so late to the game, been on the road for a couple of
days.
Why do most people assume that sending unmap/trim commands for every
deletet extent ASAP is a good idea? What this means is that we
basically duplicate the space allocator in the array. Everytime we
free something we don't just have to do a local btree insert in the
filesystem code but another behind a couple of abstractions, and
similarly on each allocation the storage device would have to allocate
from it's pool. Worth off all the interesting preallocation
optimization we've done in the filesystem, be thos explicit
pre-allocation from the application or implicit ones in the allocator
will be lost due to the abstraction boundary.
So I think not actually doing these on every alloc/free is a good idea.
Instead the filesystem would free bits when big enough regions happen,
which is something simple enough to do with most btree implementations.
That is of course not an excuse for just having the UNMAP as a hint. I
think having the UNMAP a an exact operation that is either guaranteed
to release the underlying space or fail will make the whole storage
setup a lot more robust. And while odd "unmap block" sizes will make
it a lot harder for the filesystem I think we could find ways to deal
with them, even if it might be ugly in places.
^ permalink raw reply [flat|nested] 105+ messages in thread* Re: thin provisioned LUN support
2008-11-11 16:40 ` thin provisioned LUN support Christoph Hellwig
@ 2008-11-11 17:07 ` jim owens
2008-11-11 17:33 ` Christoph Hellwig
0 siblings, 1 reply; 105+ messages in thread
From: jim owens @ 2008-11-11 17:07 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Ric Wheeler, David Woodhouse, James Bottomley, linux-scsi,
linux-fsdevel, Black_David, Martin K. Petersen, Tom Coughlan,
Matthew Wilcox, Jens Axboe
Christoph Hellwig wrote:
> Why do most people assume that sending unmap/trim commands for every
> deletet extent ASAP is a good idea?
I agree with you. Thus my earlier assertion:
- trim/unmap for SSD garbage collection has a different goal
than trim/unmap for thin provisioning.
In the SSD garbage collector mode, we want to send them as fast
as we can (per the Intel SSD architect). This allows them to
do their optimizations.
In the Thin Provision mode, we want to delay them as you said:
> So I think not actually doing these on every alloc/free is a good idea.
> Instead the filesystem would free bits when big enough regions happen,
to be filesystem friendly.
But this won't change the block layer. This is a per-filesystem
coding issue to decide when to send the discard.
jim
^ permalink raw reply [flat|nested] 105+ messages in thread
* Re: thin provisioned LUN support
2008-11-11 17:07 ` jim owens
@ 2008-11-11 17:33 ` Christoph Hellwig
0 siblings, 0 replies; 105+ messages in thread
From: Christoph Hellwig @ 2008-11-11 17:33 UTC (permalink / raw)
To: jim owens
Cc: Christoph Hellwig, Ric Wheeler, David Woodhouse, James Bottomley,
linux-scsi, linux-fsdevel, Black_David, Martin K. Petersen,
Tom Coughlan, Matthew Wilcox, Jens Axboe
On Tue, Nov 11, 2008 at 12:07:17PM -0500, jim owens wrote:
> I agree with you. Thus my earlier assertion:
>
> - trim/unmap for SSD garbage collection has a different goal
> than trim/unmap for thin provisioning.
Yes, I agree.
> In the Thin Provision mode, we want to delay them as you said:
>
>> So I think not actually doing these on every alloc/free is a good idea.
>> Instead the filesystem would free bits when big enough regions happen,
>
> to be filesystem friendly.
>
> But this won't change the block layer. This is a per-filesystem
> coding issue to decide when to send the discard.
Yes. But for this latter case large (or less so odd) unmap size aren't
that bothersome. For the SSD use case they would be. Note that
filesystems will need some SSD-awareness ayway, e.g. I have a local
hack for XFS that never bothers to look for extents in the by-bno
indexed, and I'm currently prototype a version that doesn't even update
it (could be converted back to a regular one using repair)
>
> jim
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
---end quoted text---
^ permalink raw reply [flat|nested] 105+ messages in thread