* Re: [PATCH RFC] block: use discard if possible in blkdev_issue_discard() [not found] ` <yq161odofis.fsf@sermon.lab.mkp.net> @ 2014-02-17 19:19 ` Theodore Ts'o 2014-02-18 1:31 ` Martin K. Petersen 0 siblings, 1 reply; 6+ messages in thread From: Theodore Ts'o @ 2014-02-17 19:19 UTC (permalink / raw) To: Martin K. Petersen; +Cc: linux-fsdevel, Jens Axboe, linux-ext4 On Mon, Feb 17, 2014 at 11:44:27AM -0500, Martin K. Petersen wrote: > Ted> Basically, who was practicing engineering malpractice? The SSD > Ted> vendors, or the T10/T13 spec authors? > > I think it's important to emphasize that T10/T13 specs are mainly > written by device vendors. And they have a very strong objection to > complicating the device firmware, keeping internal state, etc. So the > outcome is very rarely in the operating system's favor. I completely > agree that these flags are broken by definition. Sigh... One of the reasons why this came up is if you are implementing a cloud hosting service, where disk is emulated, and since you are trying to do something cheap-cheap-cheap (for example, OpenShift from Red Hat has a very generous free guests policy), it's likely that you're using something like qcow2, or thinp, or something similar to emulate disks to drive storage costs down. So anything we can do to eliminate I/O work at the Host OS layer is going to be really visible, and this includes replacing zero-block writes with the equivalent of punch or TRIM w/ ZRAT. > The only discard approach that provides a guaranteed result is WRITE > SAME with the UNMAP bit set (i.e. SCSI only). So currently blkdev_issue_zeroout() will do the WRITE SAME, but it doesn't set the UNMAP bit, correct? I understand there will be environments where performance is more important than cost, where it may not be a good idea to set the UNMAP bit. So it sounds like what we should do is add a flags which controls whether or not to use TRIM w/ ZRAT or WRITE SAME with the UNMAP bit is set. We'll then also need to work with the KVM folks to make sure that WRITE SAME w/ UNMAP gets plumbed through to the KVM userspace, which can then do something like FL_PUNCH if it is using a raw sparse image, or the equivalent in qcow2, etc. (If the KVM folks want to be even more aggressive, if they know they are using an underlying storage system where keeping the allocated blocks isn't really going to help performance, even if the UNMAP bit isn't set and the data block is all zero's, maybe they might want to unmap the block(s) anyway. Or we could leave this up to the Guest OS userspace, and plumb a hint from the Host to the Guest that it should really use WRITE SAME w/ UNMAP. But I'm not convinced it's worth it.) Does this sound like a reasonable way to go? > The good news is that most devices that report DRAT/RZAT are doing the > right thing due to server/RAID vendor pressure. But SSD vendors are > generally not willing to give such guarantees in the datasheets. I imagine the reason why they aren't willing to give such guarantees is that it would cost more to do the testing to assure this, and while they know that a certain firmwar version shipped to $BIG_HDD_CUSTOMER does the right thing, it might regress without their knowing about it in some future firmware version. On the other hand, if there was a white list kept somewhere, either in the kernel, or in some more dynamically updated list (ala what smartctl does to get the latest vendor-specific attributes), being on the white list might be enough of a commercial advantage that drive vendors would be incentivized to provide such a guarantee. Especially if, say, a major SSD vendor such as Intel could be induced make such a public guarantee and we publicized this fact. - Ted ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH RFC] block: use discard if possible in blkdev_issue_discard() 2014-02-17 19:19 ` [PATCH RFC] block: use discard if possible in blkdev_issue_discard() Theodore Ts'o @ 2014-02-18 1:31 ` Martin K. Petersen 2014-02-18 2:17 ` Theodore Ts'o 0 siblings, 1 reply; 6+ messages in thread From: Martin K. Petersen @ 2014-02-18 1:31 UTC (permalink / raw) To: Theodore Ts'o Cc: Martin K. Petersen, linux-fsdevel, Jens Axboe, linux-ext4 >>>>> "Ted" == Theodore Ts'o <tytso@mit.edu> writes: Ted, Ted> So currently blkdev_issue_zeroout() will do the WRITE SAME, but it Ted> doesn't set the UNMAP bit, correct? Correct. Because it explicitly tells the device to write zeroes. Ted> I understand there will be environments where performance is more Ted> important than cost, where it may not be a good idea to set the Ted> UNMAP bit. So it sounds like what we should do is add a flags Ted> which controls whether or not to use TRIM w/ ZRAT or WRITE SAME Ted> with the UNMAP bit is set. The rationale behind blkdev_issue_discard was to provide a facility to mark a block range as unused by the filesystem. With the expectation that those blocks would be deallocated/deprovisioned on the device. The rationale behind blkdev_issue_zeroout was to provide a facility to provide a cleared block range. With the expectation that those blocks would be allocated/provisioned on the device. Your variant seems to land somewhere in-between. You want a blkdev_issue_clear() that zeroes a block range and it's then up to the storage device to decide whether to provision or deprovision the space as long as you are guaranteed to get zeroes back for each block in the range on a subsequent read. Is that a correct interpretation? I'm trying to pin down your exact use case because it can get very murky between the SCSI and ATA variants. And the fact that the same knobs are used for both over and under-provisioned devices. SCSI also has an additional state: Blocks can be either mapped, anchored or deallocated. Ted> On the other hand, if there was a white list kept somewhere, either Ted> in the kernel, or in some more dynamically updated list (ala what Ted> smartctl does to get the latest vendor-specific attributes), being Ted> on the white list might be enough of a commercial advantage that Ted> drive vendors would be incentivized to provide such a guarantee. Ted> Especially if, say, a major SSD vendor such as Intel could be Ted> induced make such a public guarantee and we publicized this fact. I'm perfectly fine with maintaining a whitelist if we can get vendors to commit. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH RFC] block: use discard if possible in blkdev_issue_discard() 2014-02-18 1:31 ` Martin K. Petersen @ 2014-02-18 2:17 ` Theodore Ts'o 2014-02-18 3:44 ` Martin K. Petersen 0 siblings, 1 reply; 6+ messages in thread From: Theodore Ts'o @ 2014-02-18 2:17 UTC (permalink / raw) To: Martin K. Petersen; +Cc: linux-fsdevel, Jens Axboe, linux-ext4 On Mon, Feb 17, 2014 at 08:31:50PM -0500, Martin K. Petersen wrote: > > Your variant seems to land somewhere in-between. You want a > blkdev_issue_clear() that zeroes a block range and it's then up to the > storage device to decide whether to provision or deprovision the space > as long as you are guaranteed to get zeroes back for each block in the > range on a subsequent read. Is that a correct interpretation? Ultimately the storage device (or host OS) can always decide to deprovision the space if it is given a write of all zeroes; there's nothing in the specs that say anything at all about performance considerations, or what the back-end storage decides to do in order to handle a particular write command, so long as a subsequent read returns the correct data. So what I want is a way in which the guest (kernel, file system, etc) should be able to request that the space be deprovisioned. So it's a hint insofar ultimately, the host can always decide whether or not whether or not to honor the deprovision request (or the host could decide to deprovision a block even if it's not explicitly requested). However, I don't want it to be a hint insofar that a subsequent read is *guaranteed* to return all zeros after this "blkdev_issue_clear()" command has been successfully sent to the device. Does that make sense? > I'm trying to pin down your exact use case because it can get very murky > between the SCSI and ATA variants. And the fact that the same knobs are > used for both over and under-provisioned devices. SCSI also has an > additional state: Blocks can be either mapped, anchored or deallocated. What does "anchored" mean? Does is it the equivalent of using fallocate() to allocate the block, but it's marked uninitialized so any attempt to read it returns zeroes? If so, that certainly sounds like useful functionality that would be great if KVM exposed via virt-scsi. - Ted ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH RFC] block: use discard if possible in blkdev_issue_discard() 2014-02-18 2:17 ` Theodore Ts'o @ 2014-02-18 3:44 ` Martin K. Petersen 2014-02-18 5:47 ` Theodore Ts'o 0 siblings, 1 reply; 6+ messages in thread From: Martin K. Petersen @ 2014-02-18 3:44 UTC (permalink / raw) To: Theodore Ts'o Cc: Martin K. Petersen, linux-fsdevel, Jens Axboe, linux-ext4 >>>>> "Ted" == Theodore Ts'o <tytso@mit.edu> writes: Ted> Ultimately the storage device (or host OS) can always decide to Ted> deprovision the space if it is given a write of all zeroes; there's Ted> nothing in the specs that say anything at all about performance Ted> considerations, or what the back-end storage decides to do in order Ted> to handle a particular write command, so long as a subsequent read Ted> returns the correct data. Well, that has changed a bit with the logical block provisioning bits in SCSI. That's why I brought up the allocation/deallocation assumptions in the existing two blkdev_issue_foo() calls. Ted> Does that make sense? Yeah. So deprovision with guaranteed zero on read is what you're after. I'll chew on that a bit tomorrow. Ted> What does "anchored" mean? Does is it the equivalent of using Ted> fallocate() to allocate the block, but it's marked uninitialized so Ted> any attempt to read it returns zeroes? It means that any allocations internal to the storage device which are required to subsequently perform a write to that block have been made. So it's a way to reserve space without actually writing the blocks. Whether reading an anchored block returns zeroes or something else depends on the usual twisted maze of conflicting and confusing flags. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH RFC] block: use discard if possible in blkdev_issue_discard() 2014-02-18 3:44 ` Martin K. Petersen @ 2014-02-18 5:47 ` Theodore Ts'o 2014-02-19 2:20 ` Martin K. Petersen 0 siblings, 1 reply; 6+ messages in thread From: Theodore Ts'o @ 2014-02-18 5:47 UTC (permalink / raw) To: Martin K. Petersen; +Cc: linux-fsdevel, Jens Axboe, linux-ext4 On Mon, Feb 17, 2014 at 10:44:47PM -0500, Martin K. Petersen wrote: > > Well, that has changed a bit with the logical block provisioning bits in > SCSI. That's why I brought up the allocation/deallocation assumptions in > the existing two blkdev_issue_foo() calls. Is it a fair assumption that the reason why T10 added these bits is mainly so that clients of thin-provisioned storage devices can guarantee that a subseq uent write won't fail? Since historically the spec writers have washed their hands of anything that might vaguely smell of performance considerations.... > Yeah. So deprovision with guaranteed zero on read is what you're > after. I'll chew on that a bit tomorrow. Yes. And also some way for the host OS (or some other underlying storage device, more generally) to send hints to the guest OS about the best way to tune filesystems at mkfs and/or mount time for best performance, so we don't have to require the system administrator to have to manually set mount options, mkfs options, and/or magic "echo" commands to /proc or /sys files. It would be great if we could get SATA and SCSI devices to also deliver these hints to kernel, or to have our kernels make some hueristics based on various SCSI mode pages, and then deliver it to the file system or via some defined /sys files so that userspace programs like mkfs can automatically DTRT. I'm not sure if this is going to require spec changes and hardware changes, or whether there are some existing hints form the device drivers that we might be able to leverage. For example, right now I'm just manually using the discard mount options on certain PCIe-attached flash where I know it's beneficial, but it's a manual tuning based on knowledge of the underlying storage device. Figuring out when it's better to use fstrim, or doing it at unlink time, etc., is something that's better done automatically instead of manually, but this is I fear answering questions like this in a reliable fashion is going to be a very hard problem --- and as storage devices get more complex, such as hybrid drives with even more varied and interesting performance characteristics, it's only going to get harder! - Ted ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH RFC] block: use discard if possible in blkdev_issue_discard() 2014-02-18 5:47 ` Theodore Ts'o @ 2014-02-19 2:20 ` Martin K. Petersen 0 siblings, 0 replies; 6+ messages in thread From: Martin K. Petersen @ 2014-02-19 2:20 UTC (permalink / raw) To: Theodore Ts'o Cc: Martin K. Petersen, linux-fsdevel, Jens Axboe, linux-ext4 >>>>> "Ted" == Theodore Ts'o <tytso@mit.edu> writes: Ted, Ted> Is it a fair assumption that the reason why T10 added these bits is Ted> mainly so that clients of thin-provisioned storage devices can Ted> guarantee that a subseq uent write won't fail? Yes. It's a way to lock down blocks for future use without having to actually write them. Ted> Figuring out when it's better to use fstrim, or doing it at unlink Ted> time, etc., is something that's better done automatically instead Ted> of manually, but this is I fear answering questions like this in a Ted> reliable fashion is going to be a very hard problem We'll quickly get back to the problem we had with the other I/O hints. Namely the device vendor having to provide an honest indicator of their performance characteristics. Recent SCSI has a way to communicate maximum values for processing various types of commands. Going forward we can use those values to tune our request timeouts. But as far as the normal processing time is concerned it's hard to see a vendor putting anything in there other than "this one goes to 11". Over time we may be able to switch to online discards by default. But at this stage drives with queued trim support are just starting to roll out. And some thin-provisioned arrays were evidently also designed with a monthly Norton SpeedDisk run in mind. However, I'm open to having a prefer_online_discards flag that dedicated PCIe SSD drivers could set. It's not until we enter SCSI/ATA territory that things get murky. On the topic of reliable discards: Apparently there are some amendments being discussed in T10/T13 right now that may help us. Still investigating... -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2014-02-19 2:20 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20140214043256.GA5145@thunk.org>
[not found] ` <yq1ioshobul.fsf@sermon.lab.mkp.net>
[not found] ` <20140215012901.GA28307@thunk.org>
[not found] ` <yq161odofis.fsf@sermon.lab.mkp.net>
2014-02-17 19:19 ` [PATCH RFC] block: use discard if possible in blkdev_issue_discard() Theodore Ts'o
2014-02-18 1:31 ` Martin K. Petersen
2014-02-18 2:17 ` Theodore Ts'o
2014-02-18 3:44 ` Martin K. Petersen
2014-02-18 5:47 ` Theodore Ts'o
2014-02-19 2:20 ` Martin K. Petersen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).