Re: [PATCH RFC] block: use discard if possible in blkdev_issue

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH RFC] block: use discard if possible in blkdev_issue_discard()
       [not found]     ` <yq161odofis.fsf@sermon.lab.mkp.net>
@ 2014-02-17 19:19       ` Theodore Ts'o
  2014-02-18  1:31         ` Martin K. Petersen
  0 siblings, 1 reply; 6+ messages in thread
From: Theodore Ts'o @ 2014-02-17 19:19 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-fsdevel, Jens Axboe, linux-ext4

On Mon, Feb 17, 2014 at 11:44:27AM -0500, Martin K. Petersen wrote:
> Ted> Basically, who was practicing engineering malpractice?  The SSD
> Ted> vendors, or the T10/T13 spec authors?
> 
> I think it's important to emphasize that T10/T13 specs are mainly
> written by device vendors. And they have a very strong objection to
> complicating the device firmware, keeping internal state, etc. So the
> outcome is very rarely in the operating system's favor. I completely
> agree that these flags are broken by definition.

Sigh...

One of the reasons why this came up is if you are implementing a cloud
hosting service, where disk is emulated, and since you are trying to
do something cheap-cheap-cheap (for example, OpenShift from Red Hat
has a very generous free guests policy), it's likely that you're using
something like qcow2, or thinp, or something similar to emulate disks
to drive storage costs down.  So anything we can do to eliminate I/O
work at the Host OS layer is going to be really visible, and this
includes replacing zero-block writes with the equivalent of punch or
TRIM w/ ZRAT.

> The only discard approach that provides a guaranteed result is WRITE
> SAME with the UNMAP bit set (i.e. SCSI only).

So currently blkdev_issue_zeroout() will do the WRITE SAME, but it
doesn't set the UNMAP bit, correct?  I understand there will be
environments where performance is more important than cost, where it
may not be a good idea to set the UNMAP bit.  So it sounds like what
we should do is add a flags which controls whether or not to use TRIM
w/ ZRAT or WRITE SAME with the UNMAP bit is set.

We'll then also need to work with the KVM folks to make sure that
WRITE SAME w/ UNMAP gets plumbed through to the KVM userspace, which
can then do something like FL_PUNCH if it is using a raw sparse image,
or the equivalent in qcow2, etc.

(If the KVM folks want to be even more aggressive, if they know they
are using an underlying storage system where keeping the allocated
blocks isn't really going to help performance, even if the UNMAP bit
isn't set and the data block is all zero's, maybe they might want to
unmap the block(s) anyway.  Or we could leave this up to the Guest OS
userspace, and plumb a hint from the Host to the Guest that it should
really use WRITE SAME w/ UNMAP.  But I'm not convinced it's worth it.)

Does this sound like a reasonable way to go?

> The good news is that most devices that report DRAT/RZAT are doing the
> right thing due to server/RAID vendor pressure.   But SSD vendors are
> generally not willing to give such guarantees in the datasheets.

I imagine the reason why they aren't willing to give such guarantees
is that it would cost more to do the testing to assure this, and while
they know that a certain firmwar version shipped to $BIG_HDD_CUSTOMER
does the right thing, it might regress without their knowing about it
in some future firmware version.

On the other hand, if there was a white list kept somewhere, either in
the kernel, or in some more dynamically updated list (ala what
smartctl does to get the latest vendor-specific attributes), being on
the white list might be enough of a commercial advantage that drive
vendors would be incentivized to provide such a guarantee.  Especially
if, say, a major SSD vendor such as Intel could be induced make such a
public guarantee and we publicized this fact.

						- Ted

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] block: use discard if possible in blkdev_issue_discard()
  2014-02-17 19:19       ` [PATCH RFC] block: use discard if possible in blkdev_issue_discard() Theodore Ts'o
@ 2014-02-18  1:31         ` Martin K. Petersen
  2014-02-18  2:17           ` Theodore Ts'o
  0 siblings, 1 reply; 6+ messages in thread
From: Martin K. Petersen @ 2014-02-18  1:31 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Martin K. Petersen, linux-fsdevel, Jens Axboe, linux-ext4

>>>>> "Ted" == Theodore Ts'o <tytso@mit.edu> writes:

Ted,

Ted> So currently blkdev_issue_zeroout() will do the WRITE SAME, but it
Ted> doesn't set the UNMAP bit, correct?  

Correct. Because it explicitly tells the device to write zeroes.

Ted> I understand there will be environments where performance is more
Ted> important than cost, where it may not be a good idea to set the
Ted> UNMAP bit.  So it sounds like what we should do is add a flags
Ted> which controls whether or not to use TRIM w/ ZRAT or WRITE SAME
Ted> with the UNMAP bit is set.

The rationale behind blkdev_issue_discard was to provide a facility to
mark a block range as unused by the filesystem. With the expectation
that those blocks would be deallocated/deprovisioned on the device.

The rationale behind blkdev_issue_zeroout was to provide a facility to
provide a cleared block range. With the expectation that those blocks
would be allocated/provisioned on the device.

Your variant seems to land somewhere in-between. You want a
blkdev_issue_clear() that zeroes a block range and it's then up to the
storage device to decide whether to provision or deprovision the space
as long as you are guaranteed to get zeroes back for each block in the
range on a subsequent read. Is that a correct interpretation?

I'm trying to pin down your exact use case because it can get very murky
between the SCSI and ATA variants. And the fact that the same knobs are
used for both over and under-provisioned devices. SCSI also has an
additional state: Blocks can be either mapped, anchored or deallocated.

Ted> On the other hand, if there was a white list kept somewhere, either
Ted> in the kernel, or in some more dynamically updated list (ala what
Ted> smartctl does to get the latest vendor-specific attributes), being
Ted> on the white list might be enough of a commercial advantage that
Ted> drive vendors would be incentivized to provide such a guarantee.
Ted> Especially if, say, a major SSD vendor such as Intel could be
Ted> induced make such a public guarantee and we publicized this fact.

I'm perfectly fine with maintaining a whitelist if we can get vendors to
commit.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] block: use discard if possible in blkdev_issue_discard()
  2014-02-18  1:31         ` Martin K. Petersen
@ 2014-02-18  2:17           ` Theodore Ts'o
  2014-02-18  3:44             ` Martin K. Petersen
  0 siblings, 1 reply; 6+ messages in thread
From: Theodore Ts'o @ 2014-02-18  2:17 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-fsdevel, Jens Axboe, linux-ext4

On Mon, Feb 17, 2014 at 08:31:50PM -0500, Martin K. Petersen wrote:
> 
> Your variant seems to land somewhere in-between. You want a
> blkdev_issue_clear() that zeroes a block range and it's then up to the
> storage device to decide whether to provision or deprovision the space
> as long as you are guaranteed to get zeroes back for each block in the
> range on a subsequent read. Is that a correct interpretation?

Ultimately the storage device (or host OS) can always decide to
deprovision the space if it is given a write of all zeroes; there's
nothing in the specs that say anything at all about performance
considerations, or what the back-end storage decides to do in order to
handle a particular write command, so long as a subsequent read
returns the correct data.

So what I want is a way in which the guest (kernel, file system, etc)
should be able to request that the space be deprovisioned.  So it's a
hint insofar ultimately, the host can always decide whether or not
whether or not to honor the deprovision request (or the host could
decide to deprovision a block even if it's not explicitly requested).

However, I don't want it to be a hint insofar that a subsequent read
is *guaranteed* to return all zeros after this "blkdev_issue_clear()"
command has been successfully sent to the device.

Does that make sense?

> I'm trying to pin down your exact use case because it can get very murky
> between the SCSI and ATA variants. And the fact that the same knobs are
> used for both over and under-provisioned devices. SCSI also has an
> additional state: Blocks can be either mapped, anchored or deallocated.

What does "anchored" mean?  Does is it the equivalent of using
fallocate() to allocate the block, but it's marked uninitialized so
any attempt to read it returns zeroes?  If so, that certainly sounds
like useful functionality that would be great if KVM exposed via
virt-scsi.

						- Ted

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] block: use discard if possible in blkdev_issue_discard()
  2014-02-18  2:17           ` Theodore Ts'o
@ 2014-02-18  3:44             ` Martin K. Petersen
  2014-02-18  5:47               ` Theodore Ts'o
  0 siblings, 1 reply; 6+ messages in thread
From: Martin K. Petersen @ 2014-02-18  3:44 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Martin K. Petersen, linux-fsdevel, Jens Axboe, linux-ext4

>>>>> "Ted" == Theodore Ts'o <tytso@mit.edu> writes:

Ted> Ultimately the storage device (or host OS) can always decide to
Ted> deprovision the space if it is given a write of all zeroes; there's
Ted> nothing in the specs that say anything at all about performance
Ted> considerations, or what the back-end storage decides to do in order
Ted> to handle a particular write command, so long as a subsequent read
Ted> returns the correct data.

Well, that has changed a bit with the logical block provisioning bits in
SCSI. That's why I brought up the allocation/deallocation assumptions in
the existing two blkdev_issue_foo() calls.

Ted> Does that make sense?

Yeah. So deprovision with guaranteed zero on read is what you're
after. I'll chew on that a bit tomorrow.

Ted> What does "anchored" mean?  Does is it the equivalent of using
Ted> fallocate() to allocate the block, but it's marked uninitialized so
Ted> any attempt to read it returns zeroes?

It means that any allocations internal to the storage device which are
required to subsequently perform a write to that block have been
made. So it's a way to reserve space without actually writing the
blocks. Whether reading an anchored block returns zeroes or something
else depends on the usual twisted maze of conflicting and confusing
flags.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] block: use discard if possible in blkdev_issue_discard()
  2014-02-18  3:44             ` Martin K. Petersen
@ 2014-02-18  5:47               ` Theodore Ts'o
  2014-02-19  2:20                 ` Martin K. Petersen
  0 siblings, 1 reply; 6+ messages in thread
From: Theodore Ts'o @ 2014-02-18  5:47 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-fsdevel, Jens Axboe, linux-ext4

On Mon, Feb 17, 2014 at 10:44:47PM -0500, Martin K. Petersen wrote:
> 
> Well, that has changed a bit with the logical block provisioning bits in
> SCSI. That's why I brought up the allocation/deallocation assumptions in
> the existing two blkdev_issue_foo() calls.

Is it a fair assumption that the reason why T10 added these bits is
mainly so that clients of thin-provisioned storage devices can
guarantee that a subseq uent write won't fail?  Since historically the
spec writers have washed their hands of anything that might vaguely
smell of performance considerations....

> Yeah. So deprovision with guaranteed zero on read is what you're
> after. I'll chew on that a bit tomorrow.

Yes.  And also some way for the host OS (or some other underlying
storage device, more generally) to send hints to the guest OS about
the best way to tune filesystems at mkfs and/or mount time for best
performance, so we don't have to require the system administrator to
have to manually set mount options, mkfs options, and/or magic "echo"
commands to /proc or /sys files.

It would be great if we could get SATA and SCSI devices to also
deliver these hints to kernel, or to have our kernels make some
hueristics based on various SCSI mode pages, and then deliver it to
the file system or via some defined /sys files so that userspace
programs like mkfs can automatically DTRT.  I'm not sure if this is
going to require spec changes and hardware changes, or whether there
are some existing hints form the device drivers that we might be able
to leverage.

For example, right now I'm just manually using the discard mount
options on certain PCIe-attached flash where I know it's beneficial,
but it's a manual tuning based on knowledge of the underlying storage
device.  Figuring out when it's better to use fstrim, or doing it at
unlink time, etc., is something that's better done automatically
instead of manually, but this is I fear answering questions like this
in a reliable fashion is going to be a very hard problem --- and as
storage devices get more complex, such as hybrid drives with even more
varied and interesting performance characteristics, it's only going to
get harder!

						- Ted

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] block: use discard if possible in blkdev_issue_discard()
  2014-02-18  5:47               ` Theodore Ts'o
@ 2014-02-19  2:20                 ` Martin K. Petersen
  0 siblings, 0 replies; 6+ messages in thread
From: Martin K. Petersen @ 2014-02-19  2:20 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Martin K. Petersen, linux-fsdevel, Jens Axboe, linux-ext4

>>>>> "Ted" == Theodore Ts'o <tytso@mit.edu> writes:

Ted,

Ted> Is it a fair assumption that the reason why T10 added these bits is
Ted> mainly so that clients of thin-provisioned storage devices can
Ted> guarantee that a subseq uent write won't fail? 

Yes. It's a way to lock down blocks for future use without having to
actually write them.

Ted> Figuring out when it's better to use fstrim, or doing it at unlink
Ted> time, etc., is something that's better done automatically instead
Ted> of manually, but this is I fear answering questions like this in a
Ted> reliable fashion is going to be a very hard problem

We'll quickly get back to the problem we had with the other I/O hints.
Namely the device vendor having to provide an honest indicator of their
performance characteristics.

Recent SCSI has a way to communicate maximum values for processing
various types of commands. Going forward we can use those values to tune
our request timeouts. But as far as the normal processing time is
concerned it's hard to see a vendor putting anything in there other than
"this one goes to 11".

Over time we may be able to switch to online discards by default. But at
this stage drives with queued trim support are just starting to roll
out. And some thin-provisioned arrays were evidently also designed with
a monthly Norton SpeedDisk run in mind.

However, I'm open to having a prefer_online_discards flag that dedicated
PCIe SSD drivers could set. It's not until we enter SCSI/ATA territory
that things get murky.

On the topic of reliable discards: Apparently there are some amendments
being discussed in T10/T13 right now that may help us. Still
investigating...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-02-19  2:20 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20140214043256.GA5145@thunk.org>
     [not found] ` <yq1ioshobul.fsf@sermon.lab.mkp.net>
     [not found]   ` <20140215012901.GA28307@thunk.org>
     [not found]     ` <yq161odofis.fsf@sermon.lab.mkp.net>
2014-02-17 19:19       ` [PATCH RFC] block: use discard if possible in blkdev_issue_discard() Theodore Ts'o
2014-02-18  1:31         ` Martin K. Petersen
2014-02-18  2:17           ` Theodore Ts'o
2014-02-18  3:44             ` Martin K. Petersen
2014-02-18  5:47               ` Theodore Ts'o
2014-02-19  2:20                 ` Martin K. Petersen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).