* LSF/MM Schedule and improving discard support @ 2016-04-07 15:39 Bart Van Assche 2016-04-07 15:51 ` James Bottomley 0 siblings, 1 reply; 11+ messages in thread From: Bart Van Assche @ 2016-04-07 15:39 UTC (permalink / raw) To: James Bottomley, Jens Axboe Cc: linux-scsi@vger.kernel.org, linux-block@vger.kernel.org Hello James, Some time ago I proposed to discuss how to improve discard support during the LSF/MM (http://thread.gmane.org/gmane.linux.scsi/110048). I would appreciate it if this would be added to the LSF/MM agenda since there has been no progress yet for the patch series I posted in December 2015. Thanks, Bart. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: LSF/MM Schedule and improving discard support 2016-04-07 15:39 LSF/MM Schedule and improving discard support Bart Van Assche @ 2016-04-07 15:51 ` James Bottomley 2016-04-13 15:57 ` Bart Van Assche 0 siblings, 1 reply; 11+ messages in thread From: James Bottomley @ 2016-04-07 15:51 UTC (permalink / raw) To: Bart Van Assche, Jens Axboe Cc: linux-scsi@vger.kernel.org, linux-block@vger.kernel.org On Thu, 2016-04-07 at 08:39 -0700, Bart Van Assche wrote: > Hello James, > > Some time ago I proposed to discuss how to improve discard support > during the LSF/MM (http://thread.gmane.org/gmane.linux.scsi/110048). > I would appreciate it if this would be added to the LSF/MM agenda > since there has been no progress yet for the patch series I posted in > December 2015. Well, adding a cc to the lsf@ list to interest the attendees might have been a good idea. The basic problem with this topic is that it didn't really garner any interest when you proposed it. It also really just looks like there's nothing to discuss: you just propose the unifying patch and people discuss and modify that and it either gets accepted or not depending on the level of the objections. James ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: LSF/MM Schedule and improving discard support 2016-04-07 15:51 ` James Bottomley @ 2016-04-13 15:57 ` Bart Van Assche 2016-04-13 16:21 ` Martin K. Petersen 0 siblings, 1 reply; 11+ messages in thread From: Bart Van Assche @ 2016-04-13 15:57 UTC (permalink / raw) To: James Bottomley, Bart Van Assche, Jens Axboe Cc: linux-scsi@vger.kernel.org, linux-block@vger.kernel.org, lsf@lists.linux-foundation.org (+ LSF mailing list) On 04/07/2016 08:51 AM, James Bottomley wrote: > On Thu, 2016-04-07 at 08:39 -0700, Bart Van Assche wrote: >> Some time ago I proposed to discuss how to improve discard support >> during the LSF/MM (http://thread.gmane.org/gmane.linux.scsi/110048). >> I would appreciate it if this would be added to the LSF/MM agenda >> since there has been no progress yet for the patch series I posted in >> December 2015. > > Well, adding a cc to the lsf@ list to interest the attendees might have > been a good idea. > > The basic problem with this topic is that it didn't really garner any > interest when you proposed it. It also really just looks like there's > nothing to discuss: you just propose the unifying patch and people > discuss and modify that and it either gets accepted or not depending on > the level of the objections. Hello James, There is something that should be discussed further, namely what the behavior of the BLKDISCARD and BLKSECDISCARD ioctls should be if the start and/or end sectors are not aligned on a discard boundary. Should such requests fail with an error code, should the non-aligned head and tail be zeroed or should the non-aligned parts be left unmodified? Thanks, Bart. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: LSF/MM Schedule and improving discard support 2016-04-13 15:57 ` Bart Van Assche @ 2016-04-13 16:21 ` Martin K. Petersen 2016-04-13 16:29 ` Bart Van Assche 0 siblings, 1 reply; 11+ messages in thread From: Martin K. Petersen @ 2016-04-13 16:21 UTC (permalink / raw) To: Bart Van Assche Cc: James Bottomley, Jens Axboe, linux-scsi@vger.kernel.org, linux-block@vger.kernel.org, lsf@lists.linux-foundation.org, Darrick J. Wong >>>>> "Bart" == Bart Van Assche <bart.vanassche@sandisk.com> writes: Bart, Bart> There is something that should be discussed further, namely what Bart> the behavior of the BLKDISCARD and BLKSECDISCARD ioctls should be Bart> if the start and/or end sectors are not aligned on a discard Bart> boundary. Should such requests fail with an error code, should the Bart> non-aligned head and tail be zeroed or should the non-aligned Bart> parts be left unmodified? I think the problem with your changes is that they mix upward and downward-facing behaviors. >From a filesystem/ioctl perspective, BLKDISCARD is a hint. We should not be rounding off or aligning anything. DM or the thin provisioned device may or may not ignore parts of the request. Their business. And since discard is there to alleviate write amplification, it absolutely should never cause any media writing to happen. BLKZEROOUT, on the other hand, needs to provide hard guarantees. It is up to the filesystem or ioctl caller to indicate whether allocation or deallocation is preferred in this case. And then that should get turned into a mix of WRITE, WRITE SAME or DISCARD commands based on alignment, granularity and discard_zeroes_data reported by the device. Darrick has been trying to clean all this up and I would like to see the explicit distinction between allocate(anchor), deallocate(discard/unmap) and zeroout (write same, unmap) in the upward-facing interfaces. And then have those interfaces use WRITE SAME, DISCARD and ANCHOR to implement their functionality based on the parameters reported by the underlying device. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: LSF/MM Schedule and improving discard support 2016-04-13 16:21 ` Martin K. Petersen @ 2016-04-13 16:29 ` Bart Van Assche 2016-04-13 16:43 ` [Lsf] " Martin K. Petersen 2016-04-13 16:51 ` James Bottomley 0 siblings, 2 replies; 11+ messages in thread From: Bart Van Assche @ 2016-04-13 16:29 UTC (permalink / raw) To: Martin K. Petersen Cc: James Bottomley, Jens Axboe, linux-scsi@vger.kernel.org, linux-block@vger.kernel.org, lsf@lists.linux-foundation.org, Darrick J. Wong On 04/13/2016 09:21 AM, Martin K. Petersen wrote: > From a filesystem/ioctl perspective, BLKDISCARD is a hint. We should not be > rounding off or aligning anything. Hello Martin, Today if a BLKDISCARD ioctl passes a non-aligned start and/or end sector to the kernel then the block layer will submit invalid (non-aligned) REQ_DISCARD requests to the block driver the ioctl applies to. This is not acceptable. Does the above mean that you are proposing to fail such BLKDISCARD ioctls with an error code? Thanks, Bart. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Lsf] LSF/MM Schedule and improving discard support 2016-04-13 16:29 ` Bart Van Assche @ 2016-04-13 16:43 ` Martin K. Petersen 2016-04-13 16:57 ` Bart Van Assche 2016-04-13 16:51 ` James Bottomley 1 sibling, 1 reply; 11+ messages in thread From: Martin K. Petersen @ 2016-04-13 16:43 UTC (permalink / raw) To: Bart Van Assche Cc: Martin K. Petersen, linux-block@vger.kernel.org, James Bottomley, linux-scsi@vger.kernel.org, Darrick J. Wong, Jens Axboe, lsf@lists.linux-foundation.org >>>>> "Bart" == Bart Van Assche <bart.vanassche@sandisk.com> writes: Bart> Today if a BLKDISCARD ioctl passes a non-aligned start and/or end Bart> sector to the kernel then the block layer will submit invalid Bart> (non-aligned) REQ_DISCARD requests to the block driver the ioctl Bart> applies to. I do not know what you mean by "invalid". A device that implements UNMAP can freely ignore any parts of a request block range that are not aligned to its internal allocation units. >From SBC: "An unmap request with a number of logical blocks that is not a multiple of this value (OPTIMAL UNMAP GRANULARITY) may result in unmap operations on fewer LBAs than requested." and "An unmap request with a starting LBA that is not optimal may result in unmap operations on fewer LBAs than requested." The same is true for SATA which has no way report granularity and alignment at all. Nowhere does it say that a request that is not an aligned multiple of any reported granularity should be considered "invalid" or rejected by the storage. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Lsf] LSF/MM Schedule and improving discard support 2016-04-13 16:43 ` [Lsf] " Martin K. Petersen @ 2016-04-13 16:57 ` Bart Van Assche 2016-04-13 17:13 ` Martin K. Petersen 0 siblings, 1 reply; 11+ messages in thread From: Bart Van Assche @ 2016-04-13 16:57 UTC (permalink / raw) To: Martin K. Petersen Cc: linux-block@vger.kernel.org, James Bottomley, linux-scsi@vger.kernel.org, Darrick J. Wong, Jens Axboe, lsf@lists.linux-foundation.org On 04/13/2016 09:43 AM, Martin K. Petersen wrote: >>>>>> "Bart" == Bart Van Assche <bart.vanassche@sandisk.com> writes: > > Bart> Today if a BLKDISCARD ioctl passes a non-aligned start and/or end > Bart> sector to the kernel then the block layer will submit invalid > Bart> (non-aligned) REQ_DISCARD requests to the block driver the ioctl > Bart> applies to. > > I do not know what you mean by "invalid". > > A device that implements UNMAP can freely ignore any parts of a request > block range that are not aligned to its internal allocation units. > > From SBC: "An unmap request with a number of logical blocks that is not > a multiple of this value (OPTIMAL UNMAP GRANULARITY) may result in unmap > operations on fewer LBAs than requested." and "An unmap request with a > starting LBA that is not optimal may result in unmap operations on fewer > LBAs than requested." > > The same is true for SATA which has no way report granularity and > alignment at all. > > Nowhere does it say that a request that is not an aligned multiple of > any reported granularity should be considered "invalid" or rejected by > the storage. Hello Martin, Sorry if I wasn't clear enough. Regarding the SCSI UNMAP command, consider e.g. the following code from drivers/scsi/sd.c: static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd) { [ ... ] sector >>= ilog2(sdp->sector_size) - 9; nr_sectors >>= ilog2(sdp->sector_size) - 9; [ ... ] case SD_LBP_UNMAP: [ ... ] cmd->cmd_len = 10; cmd->cmnd[0] = UNMAP; cmd->cmnd[8] = 24; put_unaligned_be16(6 + 16, &buf[0]); put_unaligned_be16(16, &buf[2]); put_unaligned_be64(sector, &buf[8]); put_unaligned_be32(nr_sectors, &buf[16]); [ ... ] } For sdp->sector_size > 512, should we allow the block layer to pass sector and nr_sector values to this function that are not a multiple of sdp->sector_size? And if so, how should this code behave and if sector and/or nr_sectors is not a multiple of sdp->sector_size? As one can see the above code rounds down sector and nr_sectors while converting from sectors to logical blocks. This means that if a non-aligned BLKDISCARD is submitted from user space that one or more sectors *before* the start of the range specified in the ioctl will be discarded. Isn't that a bug? Bart. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Lsf] LSF/MM Schedule and improving discard support 2016-04-13 16:57 ` Bart Van Assche @ 2016-04-13 17:13 ` Martin K. Petersen 0 siblings, 0 replies; 11+ messages in thread From: Martin K. Petersen @ 2016-04-13 17:13 UTC (permalink / raw) To: Bart Van Assche Cc: Martin K. Petersen, linux-block@vger.kernel.org, James Bottomley, linux-scsi@vger.kernel.org, Darrick J. Wong, Jens Axboe, lsf@lists.linux-foundation.org >>>>> "Bart" == Bart Van Assche <bart.vanassche@sandisk.com> writes: Bart> And if so, how should this code behave and if sector and/or Bart> nr_sectors is not a multiple of sdp->sector_size? As one can see Bart> the above code rounds down sector and nr_sectors while converting Bart> from sectors to logical blocks. This means that if a non-aligned Bart> BLKDISCARD is submitted from user space that one or more sectors Bart> *before* the start of the range specified in the ioctl will be Bart> discarded. Isn't that a bug? Yes. But Darrick already addressed that, I believe. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Lsf] LSF/MM Schedule and improving discard support 2016-04-13 16:29 ` Bart Van Assche 2016-04-13 16:43 ` [Lsf] " Martin K. Petersen @ 2016-04-13 16:51 ` James Bottomley 2016-04-13 17:30 ` Darrick J. Wong 1 sibling, 1 reply; 11+ messages in thread From: James Bottomley @ 2016-04-13 16:51 UTC (permalink / raw) To: Bart Van Assche, Martin K. Petersen Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org, Darrick J. Wong, Jens Axboe, lsf@lists.linux-foundation.org On Wed, 2016-04-13 at 09:29 -0700, Bart Van Assche wrote: > On 04/13/2016 09:21 AM, Martin K. Petersen wrote: > > From a filesystem/ioctl perspective, BLKDISCARD is a hint. We > > should not be > > rounding off or aligning anything. > > Hello Martin, > > Today if a BLKDISCARD ioctl passes a non-aligned start and/or end > sector to the kernel then the block layer will submit invalid (non > -aligned) REQ_DISCARD requests to the block driver the ioctl applies > to. This is not acceptable. Does the above mean that you are > proposing to fail such BLKDISCARD ioctls with an error code? The answer would be of course not. discard is a hint so malformed discard gets ignored by the device and success is returned because you can't oblige devices to obey hints (that's why they're called hints). However, the problem of needing a mandatory discard for scrubbing blocks is part of the fallocate discussion, I think. James ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Lsf] LSF/MM Schedule and improving discard support 2016-04-13 16:51 ` James Bottomley @ 2016-04-13 17:30 ` Darrick J. Wong 2016-04-13 22:04 ` Douglas Gilbert 0 siblings, 1 reply; 11+ messages in thread From: Darrick J. Wong @ 2016-04-13 17:30 UTC (permalink / raw) To: James Bottomley Cc: Bart Van Assche, Martin K. Petersen, linux-block@vger.kernel.org, linux-scsi@vger.kernel.org, Jens Axboe, lsf@lists.linux-foundation.org On Wed, Apr 13, 2016 at 09:51:04AM -0700, James Bottomley wrote: > On Wed, 2016-04-13 at 09:29 -0700, Bart Van Assche wrote: > > On 04/13/2016 09:21 AM, Martin K. Petersen wrote: > > > From a filesystem/ioctl perspective, BLKDISCARD is a hint. We > > > should not be > > > rounding off or aligning anything. > > > > Hello Martin, > > > > Today if a BLKDISCARD ioctl passes a non-aligned start and/or end > > sector to the kernel then the block layer will submit invalid (non > > -aligned) REQ_DISCARD requests to the block driver the ioctl applies > > to. This is not acceptable. Does the above mean that you are > > proposing to fail such BLKDISCARD ioctls with an error code? > > The answer would be of course not. discard is a hint so malformed > discard gets ignored by the device and success is returned because you > can't oblige devices to obey hints (that's why they're called hints). Agree. For blockdev FALLOC_FL_PUNCH_HOLE I think we can simply check for logical block size ("lbs") alignment and then pass the request to the device with the understanding that it can do as it pleases. We asked the device to try to deallocate blocks, and perhaps it cannot. Just to be clear, this only applies to zeroing discard; the "discard and who knows what you can now read back" thing that nobody likes has been temporarily wired up to FALLOC_FL_PUNCH_HOLE | FALLOC_FL_NO_HIDE_STALE. :) > However, the problem of needing a mandatory discard for scrubbing > blocks is part of the fallocate discussion, I think. The third fallocate mode (FALLOC_FL_ZERO_RANGE) doesn't fit with the phrase "mandatory discard for scrubbing blocks", though if one removed "discard" from that phrase then it would. The only thing that ZERO_RANGE guarantees is that subsequent reads return zeroes. XFS punches the entire range and reallocates it with unwritten extents; ext4 fills the holes in the range with unwritten extents and converts real extents to unwritten. Both also write zeroes to any part of the range that doesn't align to an FS block. Yes, I think there are several questions to resolve here for mandatory zeroing with FALLOC_FL_ZERO_RANGE (summarizing the issues I've come up with so far): a) Should blockdev fallocate accept byte-granular offset/length arguments, even if it has to use the page cache to write zeroes to the device? This is what file fallocate does today. b) If blockdev fallocate does impose alignment requirements, should it return EINVAL to a request that isn't aligned to the logical block size? c) If a device really really prefers that its requests are aligned to min_io_size (which can be much larger than the logical block size), should it reject requests that aren't aligned to min_io? Or perhaps it should take care of the alignment problems on its own somehow? For allocate mode (the thing Mike Snitzer brought up in another thread yesterday), the alignment problems are much easier because we're allowed to round the start down and the end up to fit whatever alignment we require. Should we promote this to a storage track session at LSF next week? --D > > James > > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Lsf] LSF/MM Schedule and improving discard support 2016-04-13 17:30 ` Darrick J. Wong @ 2016-04-13 22:04 ` Douglas Gilbert 0 siblings, 0 replies; 11+ messages in thread From: Douglas Gilbert @ 2016-04-13 22:04 UTC (permalink / raw) To: Darrick J. Wong, James Bottomley Cc: Bart Van Assche, Martin K. Petersen, linux-block@vger.kernel.org, linux-scsi@vger.kernel.org, Jens Axboe, lsf@lists.linux-foundation.org On 16-04-13 01:30 PM, Darrick J. Wong wrote: > On Wed, Apr 13, 2016 at 09:51:04AM -0700, James Bottomley wrote: >> On Wed, 2016-04-13 at 09:29 -0700, Bart Van Assche wrote: >>> On 04/13/2016 09:21 AM, Martin K. Petersen wrote: >>>> From a filesystem/ioctl perspective, BLKDISCARD is a hint. We >>>> should not be >>>> rounding off or aligning anything. >>> >>> Hello Martin, >>> >>> Today if a BLKDISCARD ioctl passes a non-aligned start and/or end >>> sector to the kernel then the block layer will submit invalid (non >>> -aligned) REQ_DISCARD requests to the block driver the ioctl applies >>> to. This is not acceptable. Does the above mean that you are >>> proposing to fail such BLKDISCARD ioctls with an error code? >> >> The answer would be of course not. discard is a hint so malformed >> discard gets ignored by the device and success is returned because you >> can't oblige devices to obey hints (that's why they're called hints). > > Agree. For blockdev FALLOC_FL_PUNCH_HOLE I think we can simply check for > logical block size ("lbs") alignment and then pass the request to the > device with the understanding that it can do as it pleases. We asked the > device to try to deallocate blocks, and perhaps it cannot. > > Just to be clear, this only applies to zeroing discard; the "discard and who > knows what you can now read back" thing that nobody likes has been temporarily > wired up to FALLOC_FL_PUNCH_HOLE | FALLOC_FL_NO_HIDE_STALE. :) In May last year, T10 added another wrinkle when they expanded the LBPRZ field from 1 to 3 bits (in the LBP VPD page but _not_ in the READ CAPACITY(16) response). The expansion is to allow a new response when an unmapped logical block is read: return a "provisioning initialization pattern". That new piece of jargon is defined as a "non-zero pattern that is the length of one logical block". It seems that the "provisioning initialization pattern" is the same for every unmapped logical block and is chosen by the manufacturer. It can be read with the new REPORT PROVISIONING INITIALIZATION PATTERN command. If LBPRZ=2 and FORMAT UNIT is called with an "initialization pattern" equal to the disk's "provisioning initialization pattern" then all logical blocks are unmapped. Clear? Doug Gilbert >> However, the problem of needing a mandatory discard for scrubbing >> blocks is part of the fallocate discussion, I think. > > The third fallocate mode (FALLOC_FL_ZERO_RANGE) doesn't fit with the phrase > "mandatory discard for scrubbing blocks", though if one removed "discard" from > that phrase then it would. The only thing that ZERO_RANGE guarantees is that > subsequent reads return zeroes. XFS punches the entire range and reallocates > it with unwritten extents; ext4 fills the holes in the range with unwritten > extents and converts real extents to unwritten. Both also write zeroes to any > part of the range that doesn't align to an FS block. > > Yes, I think there are several questions to resolve here for mandatory zeroing > with FALLOC_FL_ZERO_RANGE (summarizing the issues I've come up with so far): > > a) Should blockdev fallocate accept byte-granular offset/length arguments, even > if it has to use the page cache to write zeroes to the device? This is what > file fallocate does today. > > b) If blockdev fallocate does impose alignment requirements, should it return > EINVAL to a request that isn't aligned to the logical block size? > > c) If a device really really prefers that its requests are aligned to > min_io_size (which can be much larger than the logical block size), should it > reject requests that aren't aligned to min_io? Or perhaps it should take care > of the alignment problems on its own somehow? > > For allocate mode (the thing Mike Snitzer brought up in another thread > yesterday), the alignment problems are much easier because we're allowed to > round the start down and the end up to fit whatever alignment we require. > > Should we promote this to a storage track session at LSF next week? > > --D > >> >> James >> >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2016-04-13 22:04 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-04-07 15:39 LSF/MM Schedule and improving discard support Bart Van Assche 2016-04-07 15:51 ` James Bottomley 2016-04-13 15:57 ` Bart Van Assche 2016-04-13 16:21 ` Martin K. Petersen 2016-04-13 16:29 ` Bart Van Assche 2016-04-13 16:43 ` [Lsf] " Martin K. Petersen 2016-04-13 16:57 ` Bart Van Assche 2016-04-13 17:13 ` Martin K. Petersen 2016-04-13 16:51 ` James Bottomley 2016-04-13 17:30 ` Darrick J. Wong 2016-04-13 22:04 ` Douglas Gilbert
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).