[PATCH 0/2] scsi discard support

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] scsi discard support
@ 2009-10-29 15:08 Christoph Hellwig
  2009-10-29 15:08 ` [PATCH 1/2] block: add support for discard limits Christoph Hellwig
  2009-10-29 15:08 ` [PATCH 2/2] sd: add support for WRITE SAME (16) with unmap bit Christoph Hellwig
  0 siblings, 2 replies; 14+ messages in thread
From: Christoph Hellwig @ 2009-10-29 15:08 UTC (permalink / raw)
  To: linux-scsi; +Cc: mkp, axboe, matthew

Time for proper scsi discard support.  The first patch is for the block layer
to implement the discard granularity limits properly.  The second is a SPC3r20
compliant implementation sending WRITE SAME with the unmap bit from sd.

ATA support will come later now that I finally obtained a TRIM capable OCZ SSD.

So far only tested using Martin's scsi_debug TP support and my experimental
qemu TP support, but I'll be able to test it on some real arrays soon.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/2] block: add support for discard limits
  2009-10-29 15:08 [PATCH 0/2] scsi discard support Christoph Hellwig
@ 2009-10-29 15:08 ` Christoph Hellwig
  2009-10-29 19:06   ` Jens Axboe
  2009-10-30  3:19   ` Martin K. Petersen
  2009-10-29 15:08 ` [PATCH 2/2] sd: add support for WRITE SAME (16) with unmap bit Christoph Hellwig
  1 sibling, 2 replies; 14+ messages in thread
From: Christoph Hellwig @ 2009-10-29 15:08 UTC (permalink / raw)
  To: linux-scsi; +Cc: mkp, axboe, matthew

[-- Attachment #1: discard-add-limits --]
[-- Type: text/plain, Size: 6999 bytes --]

To properly support discard on SCSI Arrays we need to take the discard
granularity and the first aligned discard LBA into account.  This patch
adds block limits for both of them, and trims down dicard requests to fit
into these limits in blkdev_issue_discard.  We also make sure the alignment
offset is properly adjust for partitions and add sysfs files to expose
the limits to userspaced.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/block/blk-sysfs.c
===================================================================
--- linux-2.6.orig/block/blk-sysfs.c	2009-10-29 15:46:06.567004210 +0100
+++ linux-2.6/block/blk-sysfs.c	2009-10-29 15:47:33.660022475 +0100
@@ -126,6 +126,18 @@ static ssize_t queue_io_opt_show(struct 
 	return queue_var_show(queue_io_opt(q), page);
 }
 
+static ssize_t queue_discard_granularity_show(struct request_queue *q,
+		char *page)
+{
+	return queue_var_show(q->limits.discard_granularity, page);
+}
+
+static ssize_t queue_discard_alignment_show(struct request_queue *q,
+		char *page)
+{
+	return queue_var_show(q->limits.discard_alignment, page);
+}
+
 static ssize_t
 queue_max_sectors_store(struct request_queue *q, const char *page, size_t count)
 {
@@ -293,6 +305,16 @@ static struct queue_sysfs_entry queue_io
 	.show = queue_io_opt_show,
 };
 
+static struct queue_sysfs_entry queue_discard_granularity_entry = {
+	.attr = {.name = "discard_granularity", .mode = S_IRUGO },
+	.show = queue_discard_granularity_show,
+};
+
+static struct queue_sysfs_entry queue_discard_alignment_entry = {
+	.attr = {.name = "discard_alignment", .mode = S_IRUGO },
+	.show = queue_discard_alignment_show,
+};
+
 static struct queue_sysfs_entry queue_nonrot_entry = {
 	.attr = {.name = "rotational", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_nonrot_show,
@@ -328,6 +350,8 @@ static struct attribute *default_attrs[]
 	&queue_physical_block_size_entry.attr,
 	&queue_io_min_entry.attr,
 	&queue_io_opt_entry.attr,
+	&queue_discard_granularity_entry.attr,
+	&queue_discard_alignment_entry.attr,
 	&queue_nonrot_entry.attr,
 	&queue_nomerges_entry.attr,
 	&queue_rq_affinity_entry.attr,
Index: linux-2.6/include/linux/blkdev.h
===================================================================
--- linux-2.6.orig/include/linux/blkdev.h	2009-10-29 15:46:06.551004197 +0100
+++ linux-2.6/include/linux/blkdev.h	2009-10-29 15:47:33.661024838 +0100
@@ -312,6 +312,8 @@ struct queue_limits {
 	unsigned int		io_min;
 	unsigned int		io_opt;
 	unsigned int		max_discard_sectors;
+	unsigned int		discard_granularity;
+	unsigned int		discard_alignment;
 
 	unsigned short		logical_block_size;
 	unsigned short		max_hw_segments;
Index: linux-2.6/block/blk-barrier.c
===================================================================
--- linux-2.6.orig/block/blk-barrier.c	2009-10-29 15:46:06.573022714 +0100
+++ linux-2.6/block/blk-barrier.c	2009-10-29 15:47:33.663024324 +0100
@@ -355,6 +355,20 @@ static void blkdev_discard_end_io(struct
 	bio_put(bio);
 }
 
+/*
+ * Many implementation of block dicard are very limited.  Most implementations
+ * enforce a granularity limit for the discard requests and we have to trim
+ * down the request to match it.  In addition to that for some implementation
+ * the start of this granularity is misaligned vs block 0, so we need to take
+ * that into account aswell.
+ */
+static sector_t discard_offset(sector_t sector, unsigned int granularity,
+		unsigned int alignment)
+{
+	return ((sector - alignment + granularity - 1) &
+		~(granularity - 1)) - (sector - alignment);
+}
+
 /**
  * blkdev_issue_discard - queue a discard
  * @bdev:	blockdev to issue discard for
@@ -371,10 +385,12 @@ int blkdev_issue_discard(struct block_de
 {
 	DECLARE_COMPLETION_ONSTACK(wait);
 	struct request_queue *q = bdev_get_queue(bdev);
+	struct hd_struct *part = bdev->bd_part;
 	int type = flags & DISCARD_FL_BARRIER ?
 		DISCARD_BARRIER : DISCARD_NOBARRIER;
 	struct bio *bio;
 	struct page *page;
+	sector_t offset;
 	int ret = 0;
 
 	if (!q)
@@ -383,6 +399,18 @@ int blkdev_issue_discard(struct block_de
 	if (!blk_queue_discard(q))
 		return -EOPNOTSUPP;
 
+	/*
+	 * We need to respect the discard granularity that is supported by the
+	 * device.  Round up the start block to the nearest multiple and round
+	 * down the length to the nearest multiple of that granularity.
+	 */
+	offset = discard_offset(sector, q->limits.discard_granularity,
+				part->discard_alignment);
+
+	sector += offset;
+	nr_sects = (nr_sects - offset - 1) &
+		    ~(q->limits.discard_granularity - 1);
+
 	while (nr_sects && !ret) {
 		unsigned int sector_size = q->limits.logical_block_size;
 		unsigned int max_discard_sectors =
Index: linux-2.6/fs/partitions/check.c
===================================================================
--- linux-2.6.orig/fs/partitions/check.c	2009-10-29 15:46:06.580003701 +0100
+++ linux-2.6/fs/partitions/check.c	2009-10-29 15:47:33.667023647 +0100
@@ -226,6 +226,13 @@ ssize_t part_alignment_offset_show(struc
 	return sprintf(buf, "%llu\n", (unsigned long long)p->alignment_offset);
 }
 
+ssize_t part_discard_alignment_show(struct device *dev,
+				    struct device_attribute *attr, char *buf)
+{
+	struct hd_struct *p = dev_to_part(dev);
+	return sprintf(buf, "%u\n", p->discard_alignment);
+}
+
 ssize_t part_stat_show(struct device *dev,
 		       struct device_attribute *attr, char *buf)
 {
@@ -288,6 +295,8 @@ static DEVICE_ATTR(partition, S_IRUGO, p
 static DEVICE_ATTR(start, S_IRUGO, part_start_show, NULL);
 static DEVICE_ATTR(size, S_IRUGO, part_size_show, NULL);
 static DEVICE_ATTR(alignment_offset, S_IRUGO, part_alignment_offset_show, NULL);
+static DEVICE_ATTR(discard_alignment, S_IRUGO, part_discard_alignment_show,
+		   NULL);
 static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL);
 static DEVICE_ATTR(inflight, S_IRUGO, part_inflight_show, NULL);
 #ifdef CONFIG_FAIL_MAKE_REQUEST
@@ -300,6 +309,7 @@ static struct attribute *part_attrs[] = 
 	&dev_attr_start.attr,
 	&dev_attr_size.attr,
 	&dev_attr_alignment_offset.attr,
+	&dev_attr_discard_alignment.attr,
 	&dev_attr_stat.attr,
 	&dev_attr_inflight.attr,
 #ifdef CONFIG_FAIL_MAKE_REQUEST
@@ -403,6 +413,8 @@ struct hd_struct *add_partition(struct g
 
 	p->start_sect = start;
 	p->alignment_offset = queue_sector_alignment_offset(disk->queue, start);
+	p->discard_alignment = (disk->queue->limits.discard_alignment + start) &
+				(disk->queue->limits.discard_granularity - 1);
 	p->nr_sects = len;
 	p->partno = partno;
 	p->policy = get_disk_ro(disk);
Index: linux-2.6/include/linux/genhd.h
===================================================================
--- linux-2.6.orig/include/linux/genhd.h	2009-10-29 15:46:06.557028148 +0100
+++ linux-2.6/include/linux/genhd.h	2009-10-29 15:47:33.672256087 +0100
@@ -91,6 +91,7 @@ struct hd_struct {
 	sector_t start_sect;
 	sector_t nr_sects;
 	sector_t alignment_offset;
+	unsigned int discard_alignment;
 	struct device __dev;
 	struct kobject *holder_dir;
 	int policy, partno;


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/2] sd: add support for WRITE SAME (16) with unmap bit
  2009-10-29 15:08 [PATCH 0/2] scsi discard support Christoph Hellwig
  2009-10-29 15:08 ` [PATCH 1/2] block: add support for discard limits Christoph Hellwig
@ 2009-10-29 15:08 ` Christoph Hellwig
  1 sibling, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2009-10-29 15:08 UTC (permalink / raw)
  To: linux-scsi; +Cc: mkp, axboe, matthew

[-- Attachment #1: discard-add-scsi-write-same-support --]
[-- Type: text/plain, Size: 3973 bytes --]

Send  WRITE SAME request with the unmap bit set to the device if it advertises
thin provisioning support. 


Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/drivers/scsi/sd.c
===================================================================
--- linux-2.6.orig/drivers/scsi/sd.c	2009-10-29 15:46:55.754006357 +0100
+++ linux-2.6/drivers/scsi/sd.c	2009-10-29 15:49:05.825279362 +0100
@@ -398,6 +398,35 @@ static void sd_prot_op(struct scsi_cmnd 
 	scsi_set_prot_type(scmd, dif);
 }
 
+static void sd_prepare_discard(struct request_queue *q, struct request *rq)
+{
+	struct bio *bio = rq->bio;
+
+	rq->cmd_type = REQ_TYPE_BLOCK_PC;
+	rq->timeout = SD_TIMEOUT;
+	rq->cmd[0] = WRITE_SAME_16;
+	rq->cmd[1] = 0x8; /* UNMAP bit */
+	rq->cmd[2] = sizeof(bio->bi_sector) > 4 ?
+			(unsigned char) (bio->bi_sector >> 56) & 0xff : 0;
+	rq->cmd[3] = sizeof(bio->bi_sector) > 4 ?
+			(unsigned char) (bio->bi_sector >> 48) & 0xff : 0;
+	rq->cmd[4] = sizeof(bio->bi_sector) > 4 ?
+			(unsigned char) (bio->bi_sector >> 40) & 0xff : 0;
+	rq->cmd[5] = sizeof(bio->bi_sector) > 4 ?
+			(unsigned char) (bio->bi_sector >> 32) & 0xff : 0;
+	rq->cmd[6] = (unsigned char) (bio->bi_sector >> 24) & 0xff;
+	rq->cmd[7] = (unsigned char) (bio->bi_sector >> 16) & 0xff;
+	rq->cmd[8] = (unsigned char) (bio->bi_sector >> 8) & 0xff;
+	rq->cmd[9] = (unsigned char) bio->bi_sector & 0xff;
+	rq->cmd[10] = (unsigned char) (bio_sectors(bio) >> 24) & 0xff;
+	rq->cmd[11] = (unsigned char) (bio_sectors(bio) >> 16) & 0xff;
+	rq->cmd[12] = (unsigned char) (bio_sectors(bio) >> 8) & 0xff;
+	rq->cmd[13] = (unsigned char) bio_sectors(bio) & 0xff;
+	rq->cmd[14] = 0;
+	rq->cmd[15] = 0;
+	rq->cmd_len = 16;
+}
+
 /**
  *	sd_init_command - build a scsi (read or write) command from
  *	information in the request structure.
@@ -418,6 +447,13 @@ static int sd_prep_fn(struct request_que
 	int ret, host_dif;
 	unsigned char protect;
 
+	/*
+	 * Discard request come in as REQ_TYPE_FS but we turn them into
+	 * block PC requests to make life easier.
+	 */
+	if (blk_discard_rq(rq))
+		sd_prepare_discard(q, rq);
+
 	if (rq->cmd_type == REQ_TYPE_BLOCK_PC) {
 		ret = scsi_setup_blk_pc_cmnd(sdp, rq);
 		goto out;
@@ -425,6 +461,7 @@ static int sd_prep_fn(struct request_que
 		ret = BLKPREP_KILL;
 		goto out;
 	}
+
 	ret = scsi_setup_fs_cmnd(sdp, rq);
 	if (ret != BLKPREP_OK)
 		goto out;
@@ -1432,6 +1469,9 @@ static int read_capacity_16(struct scsi_
 		sd_printk(KERN_NOTICE, sdkp,
 			  "physical block alignment offset: %u\n", alignment);
 
+	if (buffer[14] & 0x80)
+		sdkp->thin_provisioning = 1;
+
 	sdkp->capacity = lba + 1;
 	return sector_size;
 }
@@ -1877,6 +1917,17 @@ static void sd_read_block_limits(struct 
 	blk_queue_io_opt(sdkp->disk->queue,
 			 get_unaligned_be32(&buffer[12]) * sector_sz);
 
+	if (sdkp->thin_provisioning) {
+		sdkp->disk->queue->limits.discard_granularity =
+			get_unaligned_be32(&buffer[28]);
+
+		if (buffer[32] & 0x80) {
+			buffer[32] &= ~0x80;
+			sdkp->disk->queue->limits.discard_alignment =
+				get_unaligned_be32(&buffer[32]);
+		}
+	}
+
 	kfree(buffer);
 }
 
@@ -1979,6 +2030,11 @@ static int sd_revalidate_disk(struct gen
 
 	blk_queue_ordered(sdkp->disk->queue, ordered, sd_prepare_flush);
 
+	if (sdkp->thin_provisioning) {
+		queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, sdkp->disk->queue);
+		blk_queue_max_discard_sectors(sdp->request_queue, UINT_MAX);
+	}
+
 	set_capacity(disk, sdkp->capacity);
 	kfree(buffer);
 
Index: linux-2.6/drivers/scsi/sd.h
===================================================================
--- linux-2.6.orig/drivers/scsi/sd.h	2009-10-29 15:49:00.890254050 +0100
+++ linux-2.6/drivers/scsi/sd.h	2009-10-29 15:49:05.826278860 +0100
@@ -60,6 +60,7 @@ struct scsi_disk {
 	unsigned	RCD : 1;	/* state of disk RCD bit, unused */
 	unsigned	DPOFUA : 1;	/* state of disk DPOFUA bit */
 	unsigned	first_scan : 1;
+	unsigned	thin_provisioning : 1;
 };
 #define to_scsi_disk(obj) container_of(obj,struct scsi_disk,dev)
 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] block: add support for discard limits
  2009-10-29 15:08 ` [PATCH 1/2] block: add support for discard limits Christoph Hellwig
@ 2009-10-29 19:06   ` Jens Axboe
  2009-10-30  5:03     ` Christoph Hellwig
  2009-10-30  3:19   ` Martin K. Petersen
  1 sibling, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2009-10-29 19:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-scsi, mkp, matthew

On Thu, Oct 29 2009, Christoph Hellwig wrote:
> To properly support discard on SCSI Arrays we need to take the discard
> granularity and the first aligned discard LBA into account.  This patch
> adds block limits for both of them, and trims down dicard requests to fit
> into these limits in blkdev_issue_discard.  We also make sure the alignment
> offset is properly adjust for partitions and add sysfs files to expose
> the limits to userspaced.

Except for a few typos, this looks good to me. I'll add it for 2.6.33.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] block: add support for discard limits
  2009-10-29 15:08 ` [PATCH 1/2] block: add support for discard limits Christoph Hellwig
  2009-10-29 19:06   ` Jens Axboe
@ 2009-10-30  3:19   ` Martin K. Petersen
  2009-10-30  5:05     ` Christoph Hellwig
  1 sibling, 1 reply; 14+ messages in thread
From: Martin K. Petersen @ 2009-10-30  3:19 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-scsi, axboe, matthew

>>>>> "Christoph" == Christoph Hellwig <hch@infradead.org> writes:

Christoph> To properly support discard on SCSI Arrays we need to take
Christoph> the discard granularity and the first aligned discard LBA
Christoph> into account.  This patch adds block limits for both of them,
Christoph> and trims down dicard requests to fit into these limits in
Christoph> blkdev_issue_discard.  We also make sure the alignment offset
Christoph> is properly adjust for partitions and add sysfs files to
Christoph> expose the limits to userspaced.

Do we know for sure that this is really needed?

The reason I'm asking is that in my first attempt I also exposed all
this up the stack.  However, it sucks to keep this in sync and would
require the same level of topology stacking magic as the rest of the
queue limits.  My brain melted at the thought of LVM/MD volumes striped
or mirrored between devices with different unmap granularity and
alignment.

So I talked to a variety of array folks.  And regardless of whether they
implement WRITE SAME or UNMAP they all appear to handle misaligned
requests just fine.  I.e. there may be some mapped residue at the
beginning and end of an unmap request but any full allocation units in
between will be correctly unmapped.  They all told me the alignment and
granularity parameters were purely informational and intended as hints
for laying out data.  Not as hard limits for issuing I/O.

Consequently I gutted all the alignment stuff from my thin provisioning
tree.  I also changed mt scsi_debug code to unmap allocations similar to
the way the arrays work (my first attempt completely ignored misaligned
requests).

ACS-2 doesn't currently have a notion of unmap granularity and all array
vendors I have talked to appear to happily process any unmap request we
throw at them.  So I'm leaning towards keeping things simple and just
sending things down verbatim...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] block: add support for discard limits
  2009-10-29 19:06   ` Jens Axboe
@ 2009-10-30  5:03     ` Christoph Hellwig
  2009-11-02 12:39       ` Jens Axboe
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2009-10-30  5:03 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-scsi, mkp, matthew

On Thu, Oct 29, 2009 at 08:06:59PM +0100, Jens Axboe wrote:
> On Thu, Oct 29 2009, Christoph Hellwig wrote:
> > To properly support discard on SCSI Arrays we need to take the discard
> > granularity and the first aligned discard LBA into account.  This patch
> > adds block limits for both of them, and trims down dicard requests to fit
> > into these limits in blkdev_issue_discard.  We also make sure the alignment
> > offset is properly adjust for partitions and add sysfs files to expose
> > the limits to userspaced.
> 
> Except for a few typos, this looks good to me. I'll add it for 2.6.33.

It would be good if we could queue up this (or mkp's version) with the
sd bits through the scsi tree - otherwise getting testing for the whole
stack will be a pain.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] block: add support for discard limits
  2009-10-30  3:19   ` Martin K. Petersen
@ 2009-10-30  5:05     ` Christoph Hellwig
  2009-11-02 13:29       ` Martin K. Petersen
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2009-10-30  5:05 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Christoph Hellwig, linux-scsi, axboe, matthew

On Thu, Oct 29, 2009 at 11:19:43PM -0400, Martin K. Petersen wrote:
> ACS-2 doesn't currently have a notion of unmap granularity and all array
> vendors I have talked to appear to happily process any unmap request we
> throw at them.  So I'm leaning towards keeping things simple and just
> sending things down verbatim...

We have all the granularity and alignment information avilable and
keeping track of it is simple enough.  Yes md/dm will need to update
the first aligned unmap sector, but they'll also need to update the
first aligned LBA for I/O purposes and adding more more is simple
enough.

Given that these bits are in the standard vendors will use them sooner
or later and I'd rather be prepared if it's simple enough.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] block: add support for discard limits
  2009-10-30  5:03     ` Christoph Hellwig
@ 2009-11-02 12:39       ` Jens Axboe
  0 siblings, 0 replies; 14+ messages in thread
From: Jens Axboe @ 2009-11-02 12:39 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-scsi, mkp, matthew

On Fri, Oct 30 2009, Christoph Hellwig wrote:
> On Thu, Oct 29, 2009 at 08:06:59PM +0100, Jens Axboe wrote:
> > On Thu, Oct 29 2009, Christoph Hellwig wrote:
> > > To properly support discard on SCSI Arrays we need to take the discard
> > > granularity and the first aligned discard LBA into account.  This patch
> > > adds block limits for both of them, and trims down dicard requests to fit
> > > into these limits in blkdev_issue_discard.  We also make sure the alignment
> > > offset is properly adjust for partitions and add sysfs files to expose
> > > the limits to userspaced.
> > 
> > Except for a few typos, this looks good to me. I'll add it for 2.6.33.
> 
> It would be good if we could queue up this (or mkp's version) with the
> sd bits through the scsi tree - otherwise getting testing for the whole
> stack will be a pain.

Sure, fine with me. Feel free to add my acked-by.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] block: add support for discard limits
  2009-10-30  5:05     ` Christoph Hellwig
@ 2009-11-02 13:29       ` Martin K. Petersen
  2009-11-02 15:39         ` James Bottomley
                           ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Martin K. Petersen @ 2009-11-02 13:29 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Martin K. Petersen, linux-scsi, axboe, matthew

>>>>> "Christoph" == Christoph Hellwig <hch@infradead.org> writes:

>> ACS-2 doesn't currently have a notion of unmap granularity and all
>> array vendors I have talked to appear to happily process any unmap
>> request we throw at them.  So I'm leaning towards keeping things
>> simple and just sending things down verbatim...

Christoph> We have all the granularity and alignment information
Christoph> avilable and keeping track of it is simple enough.

That's not the issue.

Nothing prevents you from submitting a misaligned read/write request to
the array.  If you do, the array may be slower in servicing the request.
But the entire request will still be serviced.

Similarly, nothing prevents you from submitting a misaligned unmap
request.  If you do, the array will just ignore the portions that do not
constitute entire allocation units.  Your code is taking what is a hint
and turning it into a hard limit.  Note that it's called OPTIMAL UNMAP
GRANULARITY, not REQUIRED UNMAP GRANULARITY.

Every vendor I have talked to have asked us to always unmap the *entire*
LBA range we're interested in freeing.  No exceptions.

We don't throw away the beginning/end of a read/write request because
it's not properly aligned either, do we?

Christoph> Yes md/dm will need to update the first aligned unmap sector,
Christoph> but they'll also need to update the first aligned LBA for I/O
Christoph> purposes and adding more more is simple enough.

But with md and dm you can have devices with different unmap alignment
and granularity.  With read/write alignment we can throw our hands in
the air and say that things will be slow.  But we'll still submit all
I/O.

With heterogeneous md and dm devices there may not be a meaningful value
to put in the discard granularity field.  Then what?

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] block: add support for discard limits
  2009-11-02 13:29       ` Martin K. Petersen
@ 2009-11-02 15:39         ` James Bottomley
  2009-11-02 18:16           ` Martin K. Petersen
  2009-11-02 16:02         ` Boaz Harrosh
  2009-11-03 15:16         ` Christoph Hellwig
  2 siblings, 1 reply; 14+ messages in thread
From: James Bottomley @ 2009-11-02 15:39 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Christoph Hellwig, linux-scsi, axboe, matthew

On Mon, 2009-11-02 at 08:29 -0500, Martin K. Petersen wrote:
> >>>>> "Christoph" == Christoph Hellwig <hch@infradead.org> writes:
> 
> >> ACS-2 doesn't currently have a notion of unmap granularity and all
> >> array vendors I have talked to appear to happily process any unmap
> >> request we throw at them.  So I'm leaning towards keeping things
> >> simple and just sending things down verbatim...
> 
> Christoph> We have all the granularity and alignment information
> Christoph> avilable and keeping track of it is simple enough.
> 
> That's not the issue.
> 
> Nothing prevents you from submitting a misaligned read/write request to
> the array.  If you do, the array may be slower in servicing the request.
> But the entire request will still be serviced.
> 
> Similarly, nothing prevents you from submitting a misaligned unmap
> request.  If you do, the array will just ignore the portions that do not
> constitute entire allocation units.  Your code is taking what is a hint
> and turning it into a hard limit.  Note that it's called OPTIMAL UNMAP
> GRANULARITY, not REQUIRED UNMAP GRANULARITY.
> 
> Every vendor I have talked to have asked us to always unmap the *entire*
> LBA range we're interested in freeing.  No exceptions.
> 
> We don't throw away the beginning/end of a read/write request because
> it's not properly aligned either, do we?

Agree entirely for this with UNMAP, since the array will just silently
discard what it doesn't want.

Don't necessarily agree with this for WRITE_SAME.  From the
implementations I've read, it sounds like the array will discard all
sectors within the correct boundary and alignment but it will have to
write zeros to all sectors outside of the aligned region ... this sounds
like a performance hit unless we do some judicious aligning of the
discards.

James



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] block: add support for discard limits
  2009-11-02 13:29       ` Martin K. Petersen
  2009-11-02 15:39         ` James Bottomley
@ 2009-11-02 16:02         ` Boaz Harrosh
  2009-11-02 18:18           ` Martin K. Petersen
  2009-11-03 15:16         ` Christoph Hellwig
  2 siblings, 1 reply; 14+ messages in thread
From: Boaz Harrosh @ 2009-11-02 16:02 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Christoph Hellwig, linux-scsi, axboe, matthew

On 11/02/2009 03:29 PM, Martin K. Petersen wrote:
> 
> With heterogeneous md and dm devices there may not be a meaningful value
> to put in the discard granularity field.  Then what?
> 

These alignment masks may only be base-2, no?
If so then the max alignment of all devices will work for any device.

I think
Boaz

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] block: add support for discard limits
  2009-11-02 15:39         ` James Bottomley
@ 2009-11-02 18:16           ` Martin K. Petersen
  0 siblings, 0 replies; 14+ messages in thread
From: Martin K. Petersen @ 2009-11-02 18:16 UTC (permalink / raw)
  To: James Bottomley
  Cc: Martin K. Petersen, Christoph Hellwig, linux-scsi, axboe, matthew

>>>>> "James" == James Bottomley <James.Bottomley@suse.de> writes:

James> Agree entirely for this with UNMAP, since the array will just
James> silently discard what it doesn't want.

James> Don't necessarily agree with this for WRITE_SAME.  From the
James> implementations I've read, it sounds like the array will discard
James> all sectors within the correct boundary and alignment but it will
James> have to write zeros to all sectors outside of the aligned region
James> ... this sounds like a performance hit unless we do some
James> judicious aligning of the discards.

Depends whether the array firmware tracks partial allocation units or
not.  Some do (and will report allocation units in the block limits VPD
because it's a hint to FS/DB allocators).

Since all vendors I've talked to treat unmaps (whether UNMAP or WRITE
SAME with the UNMAP bit set) asynchronously, I doubt it's a problem.
And they all seem to think that the overhead is minimal.  They say we
can worry about loss of a queue tag, but that we shouldn't worry about
processing time on the back end at all.

Should the problem arise I vote for doing the alignment correction in
the WRITE SAME codepath in sd.c.  Potentially backed by some heuristic.
That solves the problem of information being lost on the way down due to
stacking and whatnot...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] block: add support for discard limits
  2009-11-02 16:02         ` Boaz Harrosh
@ 2009-11-02 18:18           ` Martin K. Petersen
  0 siblings, 0 replies; 14+ messages in thread
From: Martin K. Petersen @ 2009-11-02 18:18 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Martin K. Petersen, Christoph Hellwig, linux-scsi, axboe, matthew

>>>>> "Boaz" == Boaz Harrosh <bharrosh@panasas.com> writes:

Boaz> On 11/02/2009 03:29 PM, Martin K. Petersen wrote:
>> 
>> With heterogeneous md and dm devices there may not be a meaningful
>> value to put in the discard granularity field.  Then what?
>> 

Boaz> These alignment masks may only be base-2, no?  

Nope.  Could be the good old sector 63 offset vs. naturally aligned.


Boaz> If so then the max alignment of all devices will work for any
Boaz> device.

But even then we'd waste space on the array with smaller granularity.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] block: add support for discard limits
  2009-11-02 13:29       ` Martin K. Petersen
  2009-11-02 15:39         ` James Bottomley
  2009-11-02 16:02         ` Boaz Harrosh
@ 2009-11-03 15:16         ` Christoph Hellwig
  2 siblings, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2009-11-03 15:16 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Christoph Hellwig, linux-scsi, axboe, matthew

On Mon, Nov 02, 2009 at 08:29:08AM -0500, Martin K. Petersen wrote:
> Similarly, nothing prevents you from submitting a misaligned unmap
> request.  If you do, the array will just ignore the portions that do not
> constitute entire allocation units.  Your code is taking what is a hint
> and turning it into a hard limit.  Note that it's called OPTIMAL UNMAP
> GRANULARITY, not REQUIRED UNMAP GRANULARITY.
> 
> Every vendor I have talked to have asked us to always unmap the *entire*
> LBA range we're interested in freeing.  No exceptions.
> 
> We don't throw away the beginning/end of a read/write request because
> it's not properly aligned either, do we?

read/write is different.  If we throw parts of it away we lose data, we
don't do for unmap requests.  I'll resend a patch series that moves the
trimming of the range into sd.c as suggested later and make it tunable.
That might not be what we want to keep long term, but it'll allow us
benchmarking both variants on the real life arrays.


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2009-11-03 15:16 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-29 15:08 [PATCH 0/2] scsi discard support Christoph Hellwig
2009-10-29 15:08 ` [PATCH 1/2] block: add support for discard limits Christoph Hellwig
2009-10-29 19:06   ` Jens Axboe
2009-10-30  5:03     ` Christoph Hellwig
2009-11-02 12:39       ` Jens Axboe
2009-10-30  3:19   ` Martin K. Petersen
2009-10-30  5:05     ` Christoph Hellwig
2009-11-02 13:29       ` Martin K. Petersen
2009-11-02 15:39         ` James Bottomley
2009-11-02 18:16           ` Martin K. Petersen
2009-11-02 16:02         ` Boaz Harrosh
2009-11-02 18:18           ` Martin K. Petersen
2009-11-03 15:16         ` Christoph Hellwig
2009-10-29 15:08 ` [PATCH 2/2] sd: add support for WRITE SAME (16) with unmap bit Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox