public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: Thin device provisioning
       [not found] ` <AC32D7C72530234288643DD5F1435D53A80EC3@RTPMVEXC1-PRD.hq.netapp.com>
@ 2008-08-09 16:45   ` Matthew Wilcox
  2008-08-09 17:12     ` Knight, Frederick
  2008-08-10  0:50     ` Jamie Lokier
  0 siblings, 2 replies; 9+ messages in thread
From: Matthew Wilcox @ 2008-08-09 16:45 UTC (permalink / raw)
  To: Knight, Frederick
  Cc: David Woodhouse, ricwheeler, linux-fsdevel, Christoph Hellwig


I've spoken with a few Linux filesystem people.  They find it
significantly easier to send a single LBA/length pair at a time.
Modern filesystems try quite hard to keep fragmentation to a minimum, so
they don't expect a performance hit from sending multiple commands.
They're non-blocking writes, and the IO elevators can take care of
sending more important reads first.

On Fri, Aug 08, 2008 at 05:58:15PM -0400, Knight, Frederick wrote:
> Thank you for your input.  Yes, this was discussed.  Most filesystems do
> not create fully contiguous files, and therefore when a file is deleted,
> some number of discontiguous extents must be punched/discarded.  This is
> the reason for the list of LBA/length pairs.
> 
> The alternative is that the filesystem must create 1 unique request to
> the driver for each extent, and the driver must create 1 CDB for each
> extent, and the filesystem must then iterate through the entire file for
> each discontiguous range.  It was felt that this would be a problem for
> filesystems and drivers to do this type of operation, and it was
> preferred to simply supply a list.
> 
> Thanks again for the input,
> 
> 	Fred Knight
> 	SAN Standards Technologist
> 	NetApp
> 
> 
> 
> -----Original Message-----
> From: Matthew Wilcox [mailto:matthew@wil.cx] 
> Sent: Friday, August 08, 2008 9:15 AM
> To: Knight, Frederick
> Cc: David Woodhouse; ricwheeler@gmail.com
> Subject: Thin device provisioning
> 
> * From the T10 Reflector (t10@t10.org), posted by:
> * Matthew Wilcox <matthew@wil.cx>
> *
> Good morning Fred,
> 
> I've been looking at your 08-149r0.pdf with a view to using the 'PUNCH'
> command to implement the Linux 'DISCARD' command.  It's a little
> over-specified for what we need and this causes the implementation to be
> a little more complex than I would like.  The excess capability is the
> ability to do multiple punches in a single command.  Do you really need
> to be able to add/remove lots of ranges atomically, or could you use a
> command specified like this:
> 
> 0       0x9F
> 1       service action
> 2-9     LBA
> 10-13   length
> 14      reserved
> 15      control
> 
> and send one command for each range?
> 
> Apologies if this has already been covered in a T10 discussion; I'm not
> a member and though I've searched the archives, I may have missed a
> discussion.
> 
> --
> Intel are signing my paycheques ... these opinions are still mine "Bill,
> look, we understand that you're interested in selling us this operating
> system, but compare it to ours.  We can't possibly take such a
> retrograde step."
> 
> *
> * For T10 Reflector information, send a message with
> * 'info t10' (no quotes) in the message body to majordomo@t10.org

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Thin device provisioning
  2008-08-09 16:45   ` Thin device provisioning Matthew Wilcox
@ 2008-08-09 17:12     ` Knight, Frederick
  2008-08-12 18:56       ` David Woodhouse
  2008-08-10  0:50     ` Jamie Lokier
  1 sibling, 1 reply; 9+ messages in thread
From: Knight, Frederick @ 2008-08-09 17:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Woodhouse, ricwheeler, linux-fsdevel, Christoph Hellwig

You are free to create a single pair API between your filesystems and
drivers and send SCSI commands one LBA/length pair at a time.  However,
I'm afraid the voting members of T10 (representing a number of other
operating systems and their multitude of filesystems) preferred the list
approach.

We would welcome your input at the T10 meetings.
http://www.t10.org/meeting.htm - based on the input from the last
meeting, a revised proposal will be on the September Agenda in Colorado
Springs.

	Fred

-----Original Message-----
From: Matthew Wilcox [mailto:matthew@wil.cx] 
Sent: Saturday, August 09, 2008 12:46 PM
To: Knight, Frederick
Cc: David Woodhouse; ricwheeler@gmail.com;
linux-fsdevel@vger.kernel.org; Christoph Hellwig
Subject: Re: Thin device provisioning


I've spoken with a few Linux filesystem people.  They find it
significantly easier to send a single LBA/length pair at a time.
Modern filesystems try quite hard to keep fragmentation to a minimum, so
they don't expect a performance hit from sending multiple commands.
They're non-blocking writes, and the IO elevators can take care of
sending more important reads first.

On Fri, Aug 08, 2008 at 05:58:15PM -0400, Knight, Frederick wrote:
> Thank you for your input.  Yes, this was discussed.  Most filesystems 
> do not create fully contiguous files, and therefore when a file is 
> deleted, some number of discontiguous extents must be 
> punched/discarded.  This is the reason for the list of LBA/length
pairs.
> 
> The alternative is that the filesystem must create 1 unique request to

> the driver for each extent, and the driver must create 1 CDB for each 
> extent, and the filesystem must then iterate through the entire file 
> for each discontiguous range.  It was felt that this would be a 
> problem for filesystems and drivers to do this type of operation, and 
> it was preferred to simply supply a list.
> 
> Thanks again for the input,
> 
> 	Fred Knight
> 	SAN Standards Technologist
> 	NetApp
> 
> 
> 
> -----Original Message-----
> From: Matthew Wilcox [mailto:matthew@wil.cx]
> Sent: Friday, August 08, 2008 9:15 AM
> To: Knight, Frederick
> Cc: David Woodhouse; ricwheeler@gmail.com
> Subject: Thin device provisioning
> 
> * From the T10 Reflector (t10@t10.org), posted by:
> * Matthew Wilcox <matthew@wil.cx>
> *
> Good morning Fred,
> 
> I've been looking at your 08-149r0.pdf with a view to using the
'PUNCH'
> command to implement the Linux 'DISCARD' command.  It's a little 
> over-specified for what we need and this causes the implementation to 
> be a little more complex than I would like.  The excess capability is 
> the ability to do multiple punches in a single command.  Do you really

> need to be able to add/remove lots of ranges atomically, or could you 
> use a command specified like this:
> 
> 0       0x9F
> 1       service action
> 2-9     LBA
> 10-13   length
> 14      reserved
> 15      control
> 
> and send one command for each range?
> 
> Apologies if this has already been covered in a T10 discussion; I'm 
> not a member and though I've searched the archives, I may have missed 
> a discussion.
> 
> --
> Intel are signing my paycheques ... these opinions are still mine 
> "Bill, look, we understand that you're interested in selling us this 
> operating system, but compare it to ours.  We can't possibly take such

> a retrograde step."
> 
> *
> * For T10 Reflector information, send a message with
> * 'info t10' (no quotes) in the message body to majordomo@t10.org

--
Intel are signing my paycheques ... these opinions are still mine "Bill,
look, we understand that you're interested in selling us this operating
system, but compare it to ours.  We can't possibly take such a
retrograde step."

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Thin device provisioning
  2008-08-09 16:45   ` Thin device provisioning Matthew Wilcox
  2008-08-09 17:12     ` Knight, Frederick
@ 2008-08-10  0:50     ` Jamie Lokier
  2008-08-10  3:51       ` Matthew Wilcox
  1 sibling, 1 reply; 9+ messages in thread
From: Jamie Lokier @ 2008-08-10  0:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Knight, Frederick, David Woodhouse, ricwheeler, linux-fsdevel,
	Christoph Hellwig

Matthew Wilcox wrote:
> I've spoken with a few Linux filesystem people.  They find it
> significantly easier to send a single LBA/length pair at a time.
> Modern filesystems try quite hard to keep fragmentation to a minimum, so
> they don't expect a performance hit from sending multiple commands.
> They're non-blocking writes, and the IO elevators can take care of
> sending more important reads first.

Perhaps there are occasions when it's more efficient for the disk to
process several LBA/length pairs in a single operation?

I.e. you send the first pair, the disk starts working, then you send
the second, and that's not as efficient as doing both at the same
time, which might translate to a single commit on SSD.

The general solution to that would be a 'CORK' operation, though,
similar to TCP_CORK: this operation will be followed by others, you
may start it now, but don't rush...

-- Jamie

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Thin device provisioning
  2008-08-10  0:50     ` Jamie Lokier
@ 2008-08-10  3:51       ` Matthew Wilcox
  0 siblings, 0 replies; 9+ messages in thread
From: Matthew Wilcox @ 2008-08-10  3:51 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Knight, Frederick, David Woodhouse, ricwheeler, linux-fsdevel,
	Christoph Hellwig

On Sun, Aug 10, 2008 at 01:50:38AM +0100, Jamie Lokier wrote:
> Matthew Wilcox wrote:
> > I've spoken with a few Linux filesystem people.  They find it
> > significantly easier to send a single LBA/length pair at a time.
> > Modern filesystems try quite hard to keep fragmentation to a minimum, so
> > they don't expect a performance hit from sending multiple commands.
> > They're non-blocking writes, and the IO elevators can take care of
> > sending more important reads first.
> 
> Perhaps there are occasions when it's more efficient for the disk to
> process several LBA/length pairs in a single operation?
> 
> I.e. you send the first pair, the disk starts working, then you send
> the second, and that's not as efficient as doing both at the same
> time, which might translate to a single commit on SSD.
> 
> The general solution to that would be a 'CORK' operation, though,
> similar to TCP_CORK: this operation will be followed by others, you
> may start it now, but don't rush...

If you read what Fred Knight (of the T10 committee) wrote, he said that
they felt it would be easier for filesystems and drivers to use multiple
extents.  He didn't say anything about drives finding it more efficient.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Thin device provisioning
  2008-08-09 17:12     ` Knight, Frederick
@ 2008-08-12 18:56       ` David Woodhouse
  2008-08-12 20:38         ` Knight, Frederick
  0 siblings, 1 reply; 9+ messages in thread
From: David Woodhouse @ 2008-08-12 18:56 UTC (permalink / raw)
  To: Knight, Frederick
  Cc: Matthew Wilcox, ricwheeler, linux-fsdevel, Christoph Hellwig

On Sat, 2008-08-09 at 13:12 -0400, Knight, Frederick wrote:
> You are free to create a single pair API between your filesystems and
> drivers and send SCSI commands one LBA/length pair at a time.  However,
> I'm afraid the voting members of T10 (representing a number of other
> operating systems and their multitude of filesystems) preferred the list
> approach.

That seems like unneeded complexity to me. Admittedly the sample set of
file systems I've converted to use this is small so far -- only 2 -- but
it seems to me that it's always likely to be done in the 'free extent'
code path as we mark blocks free, and I can't see the file system really
taking advantage of the ability to pack multiple regions into the same
request.

It's hard for I/O elevators to merge these requests when they're not
contiguous, too -- it would require a lot of special-casing to make it
happen at that level.

It's also somewhat suboptimal that it doesn't match the T13 'TRIM'
proposal, which allows only a single range to be specified. That'll be
fun for anyone doing SCSI<->ATA conversion...

Has anyone actually claimed that they can and will use the list form,
and have they also told the T13 forum the same thing?

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation




^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Thin device provisioning
  2008-08-12 18:56       ` David Woodhouse
@ 2008-08-12 20:38         ` Knight, Frederick
  2008-08-12 23:21           ` Matthew Wilcox
  0 siblings, 1 reply; 9+ messages in thread
From: Knight, Frederick @ 2008-08-12 20:38 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Matthew Wilcox, ricwheeler, linux-fsdevel, Christoph Hellwig

Yes, we have hosts that want to use the list form.

I don't see how it doesn't match T13 TRIM command?  Both can do single
ranges.  In both cases, you can have 1 LBA and 1 length.  There is
nothing requiring > 1 range to be sent via the SCSI proposal.  In both
cases, you pass the same values to the H/W driver.  In one H/W driver it
will load a bunch of values (including the LBA/length) into a set of
registers (PATA) of a memory structure (SATA).  In the other H/W driver,
it will load a bunch of values into memory structures (CDB/buffer), and
then tweek the H/W to send the memory structures.

Most SCSI drivers I've seen that have tagged queuing enabled turn off
their elevator algorithms (since the drive itself is doing it's own
optimizations)

There is no difference at the filesystem de-allocator level.  The only
difference is how the H/W sends the values to the other end of the wire,
and there will always be differences at that layer. 

	Fred Knight

-----Original Message-----
From: David Woodhouse [mailto:dwmw2@infradead.org] 
Sent: Tuesday, August 12, 2008 2:56 PM
To: Knight, Frederick
Cc: Matthew Wilcox; ricwheeler@gmail.com; linux-fsdevel@vger.kernel.org;
Christoph Hellwig
Subject: RE: Thin device provisioning

On Sat, 2008-08-09 at 13:12 -0400, Knight, Frederick wrote:
> You are free to create a single pair API between your filesystems and 
> drivers and send SCSI commands one LBA/length pair at a time.  
> However, I'm afraid the voting members of T10 (representing a number 
> of other operating systems and their multitude of filesystems) 
> preferred the list approach.

That seems like unneeded complexity to me. Admittedly the sample set of
file systems I've converted to use this is small so far -- only 2 -- but
it seems to me that it's always likely to be done in the 'free extent'
code path as we mark blocks free, and I can't see the file system really
taking advantage of the ability to pack multiple regions into the same
request.

It's hard for I/O elevators to merge these requests when they're not
contiguous, too -- it would require a lot of special-casing to make it
happen at that level.

It's also somewhat suboptimal that it doesn't match the T13 'TRIM'
proposal, which allows only a single range to be specified. That'll be
fun for anyone doing SCSI<->ATA conversion...

Has anyone actually claimed that they can and will use the list form,
and have they also told the T13 forum the same thing?

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Thin device provisioning
  2008-08-12 20:38         ` Knight, Frederick
@ 2008-08-12 23:21           ` Matthew Wilcox
  2008-08-13 16:50             ` Alan D. Brunelle
  0 siblings, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2008-08-12 23:21 UTC (permalink / raw)
  To: Knight, Frederick
  Cc: David Woodhouse, ricwheeler, linux-fsdevel, Christoph Hellwig

On Tue, Aug 12, 2008 at 04:38:48PM -0400, Knight, Frederick wrote:
> I don't see how it doesn't match T13 TRIM command?  Both can do single
> ranges.  In both cases, you can have 1 LBA and 1 length.  There is
> nothing requiring > 1 range to be sent via the SCSI proposal.  In both
> cases, you pass the same values to the H/W driver.  In one H/W driver it
> will load a bunch of values (including the LBA/length) into a set of
> registers (PATA) of a memory structure (SATA).  In the other H/W driver,
> it will load a bunch of values into memory structures (CDB/buffer), and
> then tweek the H/W to send the memory structures.

If you consider a SATL implemented in an array device, it can receive a
PUNCH command with multiple ranges.  It must then send multiple TRIM
commands, one for each range.

The proposal also suboptimal if the common case is just one range.  The SCSI
driver has to allocate a 20-byte block and do a DATA OUT command.

> Most SCSI drivers I've seen that have tagged queuing enabled turn off
> their elevator algorithms (since the drive itself is doing it's own
> optimizations)

In Linux, we try not to have elevators in the device drivers themselves
(though I believe there are still a few which have their own).  Instead we
have an elevator in the block layer where typically we have much more
information about which IOs can be merged and which IOs cannot pass
each other, which OS process submitted the IO (and hence can do fair
scheduling between different users) and so on.

Each request queue (~= SCSI LUN) can choose which elevator controls its
behaviour, so if it works out better to have the drive do the scheduling,
it can be disabled by switching to the noop elevator.

> There is no difference at the filesystem de-allocator level.  The only
> difference is how the H/W sends the values to the other end of the wire,
> and there will always be differences at that layer. 

I think Dave's point is that batching all the discards together into one
list isn't a natural interface for a filesystem; they prefer an
interface which is a single extent.


I'll make a counter-proposal though ... we rename all the commands in
08-149r0 to PUNCH MULTI, ERASE MULTI, etc and add single-(LBA, length)
versions of them.  What do you think?

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Thin device provisioning
  2008-08-12 23:21           ` Matthew Wilcox
@ 2008-08-13 16:50             ` Alan D. Brunelle
  2008-08-13 17:04               ` David Woodhouse
  0 siblings, 1 reply; 9+ messages in thread
From: Alan D. Brunelle @ 2008-08-13 16:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Knight, Frederick, David Woodhouse, ricwheeler, linux-fsdevel,
	Christoph Hellwig

Matthew Wilcox wrote:
> On Tue, Aug 12, 2008 at 04:38:48PM -0400, Knight, Frederick wrote:
>> I don't see how it doesn't match T13 TRIM command?  Both can do single
>> ranges.  In both cases, you can have 1 LBA and 1 length.  There is
>> nothing requiring > 1 range to be sent via the SCSI proposal.  In both
>> cases, you pass the same values to the H/W driver.  In one H/W driver it
>> will load a bunch of values (including the LBA/length) into a set of
>> registers (PATA) of a memory structure (SATA).  In the other H/W driver,
>> it will load a bunch of values into memory structures (CDB/buffer), and
>> then tweek the H/W to send the memory structures.
> 
> If you consider a SATL implemented in an array device, it can receive a
> PUNCH command with multiple ranges.  It must then send multiple TRIM
> commands, one for each range.
> 
> The proposal also suboptimal if the common case is just one range.  The SCSI
> driver has to allocate a 20-byte block and do a DATA OUT command.
> 
>> Most SCSI drivers I've seen that have tagged queuing enabled turn off
>> their elevator algorithms (since the drive itself is doing it's own
>> optimizations)
> 
> In Linux, we try not to have elevators in the device drivers themselves
> (though I believe there are still a few which have their own).  Instead we
> have an elevator in the block layer where typically we have much more
> information about which IOs can be merged and which IOs cannot pass
> each other, which OS process submitted the IO (and hence can do fair
> scheduling between different users) and so on.
> 
> Each request queue (~= SCSI LUN) can choose which elevator controls its
> behaviour, so if it works out better to have the drive do the scheduling,
> it can be disabled by switching to the noop elevator.


This is not completely true: the generic elevator code does attempt some
merge tries, and the NOOP I/O scheduler also performs a primitive sort.
Recent kernels have the "nomerges" tunable added under
/sys/block/*/queue which can turn off the more complicated merge
attempts (for any scheduler).


> 
>> There is no difference at the filesystem de-allocator level.  The only
>> difference is how the H/W sends the values to the other end of the wire,
>> and there will always be differences at that layer. 
> 
> I think Dave's point is that batching all the discards together into one
> list isn't a natural interface for a filesystem; they prefer an
> interface which is a single extent.


Is it expected that the file system code would emit PUNCH directives in
"specially marked" struct bio's through the block I/O storage system?
Then the I/O schedulers would be responsible for discriminating between
PUNCH bio's and "normal" read/write bio's when it performed merging (and
sorting?).

In either case, would the block I/O layer then build "specially marked"
PUNCH requests to the underlying physical drivers?

Alan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Thin device provisioning
  2008-08-13 16:50             ` Alan D. Brunelle
@ 2008-08-13 17:04               ` David Woodhouse
  0 siblings, 0 replies; 9+ messages in thread
From: David Woodhouse @ 2008-08-13 17:04 UTC (permalink / raw)
  To: Alan D. Brunelle
  Cc: Matthew Wilcox, Knight, Frederick, ricwheeler, linux-fsdevel,
	Christoph Hellwig

On Wed, 2008-08-13 at 12:50 -0400, Alan D. Brunelle wrote:
> Is it expected that the file system code would emit PUNCH directives in
> "specially marked" struct bio's through the block I/O storage system?

I have implemented a 'sb_issue_discard()' function which file systems
can use to issue such bios. 

This code is in the git tree at {git://, http://}
git.infradead.org/users/dwmw2/discard-2.6.git

> Then the I/O schedulers would be responsible for discriminating between
> PUNCH bio's and "normal" read/write bio's when it performed merging (and
> sorting?).

In the case of 'sb_issue_discard()', the request gets marked as a soft
barrier, which prevents the I/O schedulers from letting other requests
pass it in the queue, and from merging.

This is done to avoid problems for naïve callers when subsequent writes
are scheduled _before_ the discard request. (This can happen if the
blocks are reallocated immediately).

It's possible to issue such requests without the 'soft barrier' tag, by
manually submitting the bios (see the BLKDISCARD ioctl in the same git
tree for an example). But that leaves the submitter responsible for
ensuring that there is some form of barrier or flush before the affected
blocks are reallocated and subsequently rewritten.

> In either case, would the block I/O layer then build "specially marked"
> PUNCH requests to the underlying physical drivers?

Yes, but only ever for one range of blocks at a time.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-08-13 17:04 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <200808081714.m78HEMkA026466@coles02.co.lsil.com>
     [not found] ` <AC32D7C72530234288643DD5F1435D53A80EC3@RTPMVEXC1-PRD.hq.netapp.com>
2008-08-09 16:45   ` Thin device provisioning Matthew Wilcox
2008-08-09 17:12     ` Knight, Frederick
2008-08-12 18:56       ` David Woodhouse
2008-08-12 20:38         ` Knight, Frederick
2008-08-12 23:21           ` Matthew Wilcox
2008-08-13 16:50             ` Alan D. Brunelle
2008-08-13 17:04               ` David Woodhouse
2008-08-10  0:50     ` Jamie Lokier
2008-08-10  3:51       ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox