[PATCH 0 of 3] [RFC] I/O Hints

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0 of 3] [RFC] I/O Hints
@ 2008-06-05  5:22 Martin K. Petersen
  2008-06-05  5:22 ` [PATCH 1 of 3] block: Export I/O hints for block devices and partitions Martin K. Petersen
                   ` (5 more replies)
  0 siblings, 6 replies; 28+ messages in thread
From: Martin K. Petersen @ 2008-06-05  5:22 UTC (permalink / raw)
  To: linux-scsi, linux-fsdevel

This is just a proof of concept set of patches.  I'd like some
feedback before I spend more time on them.

At the Filesystem & Storage Workshop there was lots of discussion
about how to communicate I/O alignment, stripe width, etc. to the
filesystems so they could lay out things properly.

An addition to the up-and-coming version of the SCSI block protocol
features an inquiry page that hardware RAIDs can use to indicate
preferred I/O sizes for a given LUN.

This patch kit implements support for exporting those values in
/sys/block/.  I have implemented support for it in sd.c using the
Block Limits VPD and in MD using chunk size and stripe width.

The physical sector offset for the start of the "virtual" block device
is also exported.  This includes partitions so you can get the actual
physical start sector offset for - say - an MD device sitting on a
partitioned set of drives.

Comments and suggestions are welcome.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1 of 3] block: Export I/O hints for block devices and partitions
  2008-06-05  5:22 [PATCH 0 of 3] [RFC] I/O Hints Martin K. Petersen
@ 2008-06-05  5:22 ` Martin K. Petersen
  2008-06-05 14:42   ` James Bottomley
  2008-06-06 14:21   ` Jamie Lokier
  2008-06-05  5:22 ` [PATCH 2 of 3] md: Export preferred I/O sizes and physical alignment Martin K. Petersen
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 28+ messages in thread
From: Martin K. Petersen @ 2008-06-05  5:22 UTC (permalink / raw)
  To: linux-scsi, linux-fsdevel

3 files changed, 74 insertions(+)
block/genhd.c         |   39 +++++++++++++++++++++++++++++++++++++++
fs/partitions/check.c |   11 +++++++++++
include/linux/genhd.h |   24 ++++++++++++++++++++++++


Export physical start offset and preferred I/O sizes in sysfs.  This
allows filesystems to align data structures to RAID stripes.

`physical offset' indicates the physical offset in sectors to the
start of the block device.

`optimal_io_block' indicates the smallest request that can be
submitted without incurring a performance penalty (RAID
read-modify-write, 512-byte sector emulation).  It is recommended that
I/O requests are a multiple of this size.

`optimal-io-length' indicates the optimal I/O length for the device
(i.e. stripe size).

These values are largely modeled after the SCSI SBC-3 Block Limits
VPD.

Block device drivers may call disk_set_io_hints() to set these values
to their preference.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

---

diff -r 3e197642edae -r 3e5d8520d247 block/genhd.c
--- a/block/genhd.c	Mon Jun 02 15:30:44 2008 -0700
+++ b/block/genhd.c	Thu Jun 05 01:07:51 2008 -0400
@@ -222,6 +222,15 @@
 	return  kobj ? dev_to_disk(dev) : NULL;
 }
 
+void disk_set_io_hints(struct gendisk *disk, sector_t offset, uint block, uint len)
+{
+	printk(KERN_ERR "%s I/O hints: off %llu, gran %u, len %u\n",
+	       disk->disk_name, (unsigned long long)offset, block, len);
+	disk->phys_offset = offset;
+	disk->optimal_io_block = block;
+	disk->optimal_io_length = len;
+}
+
 /*
  * print a full list of all partitions - intended for places where the root
  * filesystem can't be mounted and thus to give the victim some idea of what
@@ -416,6 +425,30 @@
 	return sprintf(buf, "%x\n", disk->flags);
 }
 
+static ssize_t disk_phys_offset_show(struct device *dev,
+				     struct device_attribute *attr, char *buf)
+{
+	struct gendisk *disk = dev_to_disk(dev);
+
+	return sprintf(buf, "%llu\n", (unsigned long long)disk->phys_offset);
+}
+
+static ssize_t disk_optimal_io_block_show(struct device *dev,
+					  struct device_attribute *attr, char *buf)
+{
+	struct gendisk *disk = dev_to_disk(dev);
+
+	return sprintf(buf, "%u\n", disk->optimal_io_block);
+}
+
+static ssize_t disk_optimal_io_length_show(struct device *dev,
+					   struct device_attribute *attr, char *buf)
+{
+	struct gendisk *disk = dev_to_disk(dev);
+
+	return sprintf(buf, "%u\n", disk->optimal_io_length);
+}
+
 static ssize_t disk_stat_show(struct device *dev,
 			      struct device_attribute *attr, char *buf)
 {
@@ -474,6 +507,9 @@
 static DEVICE_ATTR(removable, S_IRUGO, disk_removable_show, NULL);
 static DEVICE_ATTR(size, S_IRUGO, disk_size_show, NULL);
 static DEVICE_ATTR(capability, S_IRUGO, disk_capability_show, NULL);
+static DEVICE_ATTR(phys_offset, S_IRUGO, disk_phys_offset_show, NULL);
+static DEVICE_ATTR(optimal_io_block, S_IRUGO, disk_optimal_io_block_show, NULL);
+static DEVICE_ATTR(optimal_io_length, S_IRUGO, disk_optimal_io_length_show, NULL);
 static DEVICE_ATTR(stat, S_IRUGO, disk_stat_show, NULL);
 #ifdef CONFIG_FAIL_MAKE_REQUEST
 static struct device_attribute dev_attr_fail =
@@ -485,6 +521,9 @@
 	&dev_attr_removable.attr,
 	&dev_attr_size.attr,
 	&dev_attr_capability.attr,
+	&dev_attr_phys_offset.attr,
+	&dev_attr_optimal_io_block.attr,
+	&dev_attr_optimal_io_length.attr,
 	&dev_attr_stat.attr,
 #ifdef CONFIG_FAIL_MAKE_REQUEST
 	&dev_attr_fail.attr,
diff -r 3e197642edae -r 3e5d8520d247 fs/partitions/check.c
--- a/fs/partitions/check.c	Mon Jun 02 15:30:44 2008 -0700
+++ b/fs/partitions/check.c	Thu Jun 05 01:07:51 2008 -0400
@@ -204,6 +204,14 @@
 	return sprintf(buf, "%llu\n",(unsigned long long)p->start_sect);
 }
 
+static ssize_t part_phys_offset_show(struct device *dev,
+				     struct device_attribute *attr, char *buf)
+{
+	struct hd_struct *p = dev_to_part(dev);
+
+	return sprintf(buf, "%llu\n",(unsigned long long)p->phys_offset);
+}
+
 static ssize_t part_size_show(struct device *dev,
 			      struct device_attribute *attr, char *buf)
 {
@@ -261,6 +269,7 @@
 #endif
 
 static DEVICE_ATTR(start, S_IRUGO, part_start_show, NULL);
+static DEVICE_ATTR(phys_offset, S_IRUGO, part_phys_offset_show, NULL);
 static DEVICE_ATTR(size, S_IRUGO, part_size_show, NULL);
 static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL);
 #ifdef CONFIG_FAIL_MAKE_REQUEST
@@ -271,6 +280,7 @@
 static struct attribute *part_attrs[] = {
 	&dev_attr_start.attr,
 	&dev_attr_size.attr,
+	&dev_attr_phys_offset.attr,
 	&dev_attr_stat.attr,
 #ifdef CONFIG_FAIL_MAKE_REQUEST
 	&dev_attr_fail.attr,
@@ -358,6 +368,7 @@
 		return;
 	}
 	p->start_sect = start;
+	p->phys_offset = disk->phys_offset + start;
 	p->nr_sects = len;
 	p->partno = part;
 	p->policy = disk->policy;
diff -r 3e197642edae -r 3e5d8520d247 include/linux/genhd.h
--- a/include/linux/genhd.h	Mon Jun 02 15:30:44 2008 -0700
+++ b/include/linux/genhd.h	Thu Jun 05 01:07:51 2008 -0400
@@ -87,6 +87,7 @@
 struct hd_struct {
 	sector_t start_sect;
 	sector_t nr_sects;
+	sector_t phys_offset;
 	struct device dev;
 	struct kobject *holder_dir;
 	int policy, partno;
@@ -122,6 +123,10 @@
 	struct request_queue *queue;
 	void *private_data;
 	sector_t capacity;
+
+	sector_t phys_offset;
+	unsigned int optimal_io_block;
+	unsigned int optimal_io_length;
 
 	int flags;
 	struct device *driverfs_dev;  // FIXME: remove
@@ -357,6 +362,25 @@
 extern void del_gendisk(struct gendisk *gp);
 extern void unlink_gendisk(struct gendisk *gp);
 extern struct gendisk *get_gendisk(dev_t dev, int *part);
+extern void disk_set_io_hints(struct gendisk *, sector_t, unsigned int, unsigned int);
+
+static inline sector_t bdev_phys_offset(struct block_device *bdev)
+{
+	if (bdev != bdev->bd_contains)
+		return bdev->bd_part->phys_offset;
+
+	return bdev->bd_disk->phys_offset;
+}
+
+static inline unsigned int bdev_optimal_io_block(struct block_device *bdev)
+{
+	return bdev->bd_disk->optimal_io_block;
+}
+
+static inline unsigned int bdev_optimal_io_length(struct block_device *bdev)
+{
+	return bdev->bd_disk->optimal_io_length;
+}
 
 extern void set_device_ro(struct block_device *bdev, int flag);
 extern void set_disk_ro(struct gendisk *disk, int flag);



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2 of 3] md: Export preferred I/O sizes and physical alignment
  2008-06-05  5:22 [PATCH 0 of 3] [RFC] I/O Hints Martin K. Petersen
  2008-06-05  5:22 ` [PATCH 1 of 3] block: Export I/O hints for block devices and partitions Martin K. Petersen
@ 2008-06-05  5:22 ` Martin K. Petersen
  2008-06-05  5:22 ` [PATCH 3 of 3] sd: Export preferred I/O sizes Martin K. Petersen
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Martin K. Petersen @ 2008-06-05  5:22 UTC (permalink / raw)
  To: linux-scsi, linux-fsdevel

5 files changed, 49 insertions(+)
drivers/md/linear.c |    8 ++++++++
drivers/md/raid0.c  |   11 +++++++++++
drivers/md/raid1.c  |    8 ++++++++
drivers/md/raid10.c |    9 +++++++++
drivers/md/raid5.c  |   13 +++++++++++++


For each of the RAID levels set the preferred I/O hints according to
chunk size and stripe width.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

---

diff -r 3e5d8520d247 -r c7132bad88f3 drivers/md/linear.c
--- a/drivers/md/linear.c	Thu Jun 05 01:07:51 2008 -0400
+++ b/drivers/md/linear.c	Thu Jun 05 01:07:51 2008 -0400
@@ -135,6 +135,14 @@
 
 		blk_queue_stack_limits(mddev->queue,
 				       rdev->bdev->bd_disk->queue);
+
+		if (rdev->raid_disk == 0)
+			disk_set_io_hints(mddev->gendisk,
+					  bdev_phys_offset(rdev->bdev)
+					  + rdev->data_offset,
+					  bdev_optimal_io_block(rdev->bdev),
+					  bdev_optimal_io_length(rdev->bdev));
+
 		/* as we don't honour merge_bvec_fn, we must never risk
 		 * violating it, so limit ->max_sector to one PAGE, as
 		 * a one page request is never in violation.
diff -r 3e5d8520d247 -r c7132bad88f3 drivers/md/raid0.c
--- a/drivers/md/raid0.c	Thu Jun 05 01:07:51 2008 -0400
+++ b/drivers/md/raid0.c	Thu Jun 05 01:07:51 2008 -0400
@@ -140,6 +140,17 @@
 
 		blk_queue_stack_limits(mddev->queue,
 				       rdev1->bdev->bd_disk->queue);
+
+		if (rdev1->raid_disk == 0) {
+			disk_set_io_hints(mddev->gendisk,
+					  bdev_phys_offset(rdev1->bdev)
+					  + rdev1->data_offset,
+					  mddev->chunk_size,
+					  mddev->chunk_size * mddev->raid_disks);
+			printk(KERN_ERR "raid0: chunk_size=%u, disks=%u\n",
+			       mddev->chunk_size, mddev->raid_disks);
+		}
+
 		/* as we don't honour merge_bvec_fn, we must never risk
 		 * violating it, so limit ->max_sector to one PAGE, as
 		 * a one page request is never in violation.
diff -r 3e5d8520d247 -r c7132bad88f3 drivers/md/raid1.c
--- a/drivers/md/raid1.c	Thu Jun 05 01:07:51 2008 -0400
+++ b/drivers/md/raid1.c	Thu Jun 05 01:07:51 2008 -0400
@@ -1970,6 +1970,14 @@
 
 		blk_queue_stack_limits(mddev->queue,
 				       rdev->bdev->bd_disk->queue);
+
+		if (rdev->raid_disk == 0)
+			disk_set_io_hints(mddev->gendisk,
+					  bdev_phys_offset(rdev->bdev)
+					  + rdev->data_offset,
+					  bdev_optimal_io_block(rdev->bdev),
+					  bdev_optimal_io_length(rdev->bdev));
+
 		/* as we don't honour merge_bvec_fn, we must never risk
 		 * violating it, so limit ->max_sector to one PAGE, as
 		 * a one page request is never in violation.
diff -r 3e5d8520d247 -r c7132bad88f3 drivers/md/raid10.c
--- a/drivers/md/raid10.c	Thu Jun 05 01:07:51 2008 -0400
+++ b/drivers/md/raid10.c	Thu Jun 05 01:07:51 2008 -0400
@@ -2106,6 +2106,15 @@
 
 		blk_queue_stack_limits(mddev->queue,
 				       rdev->bdev->bd_disk->queue);
+
+		if (rdev->raid_disk == 0)
+			disk_set_io_hints(mddev->gendisk,
+					  bdev_phys_offset(rdev->bdev)
+					  + rdev->data_offset,
+					  mddev->chunk_size,
+					  (mddev->chunk_size * mddev->raid_disks) 
+					  >> 1);
+
 		/* as we don't honour merge_bvec_fn, we must never risk
 		 * violating it, so limit ->max_sector to one PAGE, as
 		 * a one page request is never in violation.
diff -r 3e5d8520d247 -r c7132bad88f3 drivers/md/raid5.c
--- a/drivers/md/raid5.c	Thu Jun 05 01:07:51 2008 -0400
+++ b/drivers/md/raid5.c	Thu Jun 05 01:07:51 2008 -0400
@@ -4298,6 +4298,19 @@
 				raid_disk);
 			working_disks++;
 		}
+
+		if (rdev->raid_disk == 0) {
+			int disks = mddev->raid_disks - 1;
+
+			if (mddev->level == 6)
+				disks--;
+
+			disk_set_io_hints(mddev->gendisk,
+					  bdev_phys_offset(rdev->bdev)
+					  + rdev->data_offset,
+					  mddev->chunk_size,
+					  mddev->chunk_size * disks);
+		}
 	}
 
 	/*



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3 of 3] sd: Export preferred I/O sizes
  2008-06-05  5:22 [PATCH 0 of 3] [RFC] I/O Hints Martin K. Petersen
  2008-06-05  5:22 ` [PATCH 1 of 3] block: Export I/O hints for block devices and partitions Martin K. Petersen
  2008-06-05  5:22 ` [PATCH 2 of 3] md: Export preferred I/O sizes and physical alignment Martin K. Petersen
@ 2008-06-05  5:22 ` Martin K. Petersen
  2008-06-05 11:25   ` Boaz Harrosh
  2008-06-05  6:27 ` [PATCH 0 of 3] [RFC] I/O Hints Andreas Dilger
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Martin K. Petersen @ 2008-06-05  5:22 UTC (permalink / raw)
  To: linux-scsi, linux-fsdevel

1 file changed, 67 insertions(+)
drivers/scsi/sd.c |   67 +++++++++++++++++++++++++++++++++++++++++++++++++++++


Make storage devices that support the Block Limits VPD export the
optimal transfer length granularity and the optimal transfer length to
the block layer.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

---

diff -r c7132bad88f3 -r dba6ce5d97b8 drivers/scsi/sd.c
--- a/drivers/scsi/sd.c	Thu Jun 05 01:07:51 2008 -0400
+++ b/drivers/scsi/sd.c	Thu Jun 05 01:07:51 2008 -0400
@@ -1534,6 +1534,72 @@
 	sdkp->DPOFUA = 0;
 }
 
+static int sd_vpd_inquiry(struct scsi_disk *sdkp, unsigned char *buffer, u8 page, u8 len)
+{
+	int result;
+	unsigned char cmd[16];
+	struct scsi_sense_hdr sshdr;
+
+	memset(cmd, 0, 16);
+	cmd[0] = INQUIRY;
+	cmd[1] = 1;		/* EVPD */
+	cmd[2] = page;		/* VPD page */
+	cmd[3] = len;
+	
+	result = scsi_execute_req(sdkp->device, cmd, DMA_FROM_DEVICE, buffer,
+				  len, &sshdr, SD_TIMEOUT, SD_MAX_RETRIES);
+
+	if (media_not_present(sdkp, &sshdr))
+		return -EIO;
+
+	if (result) {
+		sd_printk(KERN_ERR, sdkp, "EVPD Inquiry failed\n");
+		return -EIO;
+	}
+
+	if (buffer[1] != page) {
+		sd_printk(KERN_ERR, sdkp, "Page code not %2x (%2x)\n", page,
+			  buffer[1]);
+		return -EIO;
+	}
+
+	return buffer[3];
+}
+
+/**
+ * sd_block_limits - Query disk device for preferred I/O sizes.
+ * @disk: disk to query
+ * @buffer: temporary buffer to store inquiry response in
+ */
+static void sd_block_limits(struct scsi_disk *sdkp, unsigned char *buffer)
+{
+	int result, i, bl_found;
+
+	bl_found = 0;
+	result = sd_vpd_inquiry(sdkp, buffer, 0x0, 32); /* Supported Pages list */
+
+	if (result < 0 || buffer[4] != 0)
+		return;
+
+	for (i = 0 ; i < buffer[3] ; i++)
+		if (buffer[4+i] == 0xb0)
+			bl_found = 1;
+
+	if (bl_found == 0)
+		return;
+
+	result = sd_vpd_inquiry(sdkp, buffer, 0xb0, 16); /* Block Limits VPD */
+
+	if (result < 0 || buffer[3] != 12)
+		return;
+
+	disk_set_io_hints(sdkp->disk, 0, ((buffer[6] << 8) | buffer[7])
+			  * sdkp->device->sector_size,
+			  ((buffer[12] << 24) | (buffer[13] << 16)
+			   | (buffer[14] << 8) | buffer[15])
+			  * sdkp->device->sector_size);
+}
+
 /**
  *	sd_revalidate_disk - called the first time a new disk is seen,
  *	performs disk spin up, read_capacity, etc.
@@ -1579,6 +1645,7 @@
 	 */
 	if (sdkp->media_present) {
 		sd_read_capacity(sdkp, buffer);
+		sd_block_limits(sdkp, buffer);
 		sd_read_write_protect_flag(sdkp, buffer);
 		sd_read_cache_type(sdkp, buffer);
 	}



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0 of 3] [RFC] I/O Hints
  2008-06-05  5:22 [PATCH 0 of 3] [RFC] I/O Hints Martin K. Petersen
                   ` (2 preceding siblings ...)
  2008-06-05  5:22 ` [PATCH 3 of 3] sd: Export preferred I/O sizes Martin K. Petersen
@ 2008-06-05  6:27 ` Andreas Dilger
  2008-06-05 10:32   ` Jamie Lokier
                     ` (2 more replies)
  2008-06-05 10:40 ` Jamie Lokier
  2008-06-06 14:26 ` Jamie Lokier
  5 siblings, 3 replies; 28+ messages in thread
From: Andreas Dilger @ 2008-06-05  6:27 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-fsdevel

On Jun 05, 2008  01:22 -0400, Martin K. Petersen wrote:
> This is just a proof of concept set of patches.  I'd like some
> feedback before I spend more time on them.
> 
> At the Filesystem & Storage Workshop there was lots of discussion
> about how to communicate I/O alignment, stripe width, etc. to the
> filesystems so they could lay out things properly.

Thanks for looking into this Martin.  We could use this pretty
immediately in ext4 for helping the block allocator make good
decisions.

> An addition to the up-and-coming version of the SCSI block protocol
> features an inquiry page that hardware RAIDs can use to indicate
> preferred I/O sizes for a given LUN.
> 
> This patch kit implements support for exporting those values in
> /sys/block/.  I have implemented support for it in sd.c using the
> Block Limits VPD and in MD using chunk size and stripe width.
> 
> The physical sector offset for the start of the "virtual" block device
> is also exported.  This includes partitions so you can get the actual
> physical start sector offset for - say - an MD device sitting on a
> partitioned set of drives.

The kernel part of the code seems pretty reasonable (nicely stackable,
as your MD examples show) and useful for filesystems.  Having this
information available in the kernel removes much of the need to find
this information in userspace, but unfortunately not all of it (e.g.
some mkfs-time layout decisions need to be done before the filesystem
is mounted, even if the allocator can use the kernel-supplied hits).

To be honest, however, having the information exported only via sysfs
is a bit ugly IMHO.  I've had all sorts of grief with settings there
because there isn't always a match between the device that is being
specified by the user and what appears in sysfs (e.g. /dev/disk/by-id/foo
doesn't match /sys/block/sda) and hoops have to be jumped through to
find this mapping, before parsing a text value in C.

Having an ioctl() that can be called on the block device (getting
the right device regardless of its name) seems a lot more useful to
applications in my experience, unless you are using a script.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0 of 3] [RFC] I/O Hints
  2008-06-05  6:27 ` [PATCH 0 of 3] [RFC] I/O Hints Andreas Dilger
@ 2008-06-05 10:32   ` Jamie Lokier
  2008-06-05 12:35   ` Matthew Wilcox
  2008-06-06  1:03   ` Martin K. Petersen
  2 siblings, 0 replies; 28+ messages in thread
From: Jamie Lokier @ 2008-06-05 10:32 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Martin K. Petersen, linux-scsi, linux-fsdevel

Andreas Dilger wrote:
> Having an ioctl() that can be called on the block device (getting
> the right device regardless of its name) seems a lot more useful to
> applications in my experience, unless you are using a script.

I agree with this - ioctl() on actual block devices is more direct,
and the quirks needed to get the right sysfs dir awkward.

It would be good to use the same ioctl() as other OSes if there are
any exporting the same information.

However, sysfs, being text based, is easier to extend with new values
that can be used without recompiling programs, e.g. from scripts, so
it's useful to have the same information there.

To link them, an ioctl() which returns an unambiguous reference to the
appropriate sysfs entry would be good too.  E.g. perhaps returning a
file descriptor of the appropriate directory.

-- Jamie

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0 of 3] [RFC] I/O Hints
  2008-06-05  5:22 [PATCH 0 of 3] [RFC] I/O Hints Martin K. Petersen
                   ` (3 preceding siblings ...)
  2008-06-05  6:27 ` [PATCH 0 of 3] [RFC] I/O Hints Andreas Dilger
@ 2008-06-05 10:40 ` Jamie Lokier
  2008-06-05 19:19   ` Andreas Dilger
  2008-06-06  1:16   ` Martin K. Petersen
  2008-06-06 14:26 ` Jamie Lokier
  5 siblings, 2 replies; 28+ messages in thread
From: Jamie Lokier @ 2008-06-05 10:40 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-fsdevel

Martin K. Petersen wrote:
> At the Filesystem & Storage Workshop there was lots of discussion
> about how to communicate I/O alignment, stripe width, etc. to the
> filesystems so they could lay out things properly.

That's really nice, thanks.

Does it handle devices with different properties in differeng offset
ranges?  E.g. a RAID setup where the first 100GB have one stripe
width, but the next 100GB have a different stripe width - as you can
get if you join two different hardware RAIDs with LVM, for example.

> The physical sector offset for the start of the "virtual" block device
> is also exported.  This includes partitions so you can get the actual
> physical start sector offset for - say - an MD device sitting on a
> partitioned set of drives.

If it's a set of drives, doesn't it need to return multiple offsets,
and drive identities?

> Comments and suggestions are welcome.

>From a database perspective, I'm thinking it would be very useful to
provide similar information for large files.  Database engines have
the same requirement to know stripe width etc., and filesystems
allocate large contiguous extents so it it makes sense to pass the
device characteristics up to the app, with appropriate offset.

(Same for loopback devices.)

This is an extreme example of different properties in different offset
ranges - it depends on file allocation.

Is there scope to extend FIEMAP to pass this sort of information?

-- Jamie

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3 of 3] sd: Export preferred I/O sizes
  2008-06-05  5:22 ` [PATCH 3 of 3] sd: Export preferred I/O sizes Martin K. Petersen
@ 2008-06-05 11:25   ` Boaz Harrosh
  0 siblings, 0 replies; 28+ messages in thread
From: Boaz Harrosh @ 2008-06-05 11:25 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-fsdevel

Martin K. Petersen wrote:
> 1 file changed, 67 insertions(+)
> drivers/scsi/sd.c |   67 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> Make storage devices that support the Block Limits VPD export the
> optimal transfer length granularity and the optimal transfer length to
> the block layer.
> 
> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
> 
> ---
> 
> diff -r c7132bad88f3 -r dba6ce5d97b8 drivers/scsi/sd.c
> --- a/drivers/scsi/sd.c	Thu Jun 05 01:07:51 2008 -0400
> +++ b/drivers/scsi/sd.c	Thu Jun 05 01:07:51 2008 -0400
> @@ -1534,6 +1534,72 @@
>  	sdkp->DPOFUA = 0;
>  }
>  
> +static int sd_vpd_inquiry(struct scsi_disk *sdkp, unsigned char *buffer, u8 page, u8 len)
> +{
> +	int result;
> +	unsigned char cmd[16];
> +	struct scsi_sense_hdr sshdr;
> +
> +	memset(cmd, 0, 16);
> +	cmd[0] = INQUIRY;
> +	cmd[1] = 1;		/* EVPD */
> +	cmd[2] = page;		/* VPD page */
> +	cmd[3] = len;
> +	
> +	result = scsi_execute_req(sdkp->device, cmd, DMA_FROM_DEVICE, buffer,
> +				  len, &sshdr, SD_TIMEOUT, SD_MAX_RETRIES);
> +
> +	if (media_not_present(sdkp, &sshdr))
> +		return -EIO;
> +
> +	if (result) {
> +		sd_printk(KERN_ERR, sdkp, "EVPD Inquiry failed\n");
> +		return -EIO;
> +	}
> +
> +	if (buffer[1] != page) {
> +		sd_printk(KERN_ERR, sdkp, "Page code not %2x (%2x)\n", page,
> +			  buffer[1]);
> +		return -EIO;
> +	}
> +
> +	return buffer[3];
> +}
> +
> +/**
> + * sd_block_limits - Query disk device for preferred I/O sizes.
> + * @disk: disk to query
> + * @buffer: temporary buffer to store inquiry response in
> + */
> +static void sd_block_limits(struct scsi_disk *sdkp, unsigned char *buffer)
> +{
> +	int result, i, bl_found;
> +
> +	bl_found = 0;
> +	result = sd_vpd_inquiry(sdkp, buffer, 0x0, 32); /* Supported Pages list */
> +
> +	if (result < 0 || buffer[4] != 0)
> +		return;
> +
> +	for (i = 0 ; i < buffer[3] ; i++)
> +		if (buffer[4+i] == 0xb0)
> +			bl_found = 1;
> +
> +	if (bl_found == 0)
> +		return;
> +
> +	result = sd_vpd_inquiry(sdkp, buffer, 0xb0, 16); /* Block Limits VPD */
> +
> +	if (result < 0 || buffer[3] != 12)
> +		return;
> +
> +	disk_set_io_hints(sdkp->disk, 0, ((buffer[6] << 8) | buffer[7])
> +			  * sdkp->device->sector_size,
> +			  ((buffer[12] << 24) | (buffer[13] << 16)
> +			   | (buffer[14] << 8) | buffer[15])
> +			  * sdkp->device->sector_size);
> +}
> +

be16_to_cpup((__be16 *)&buffer[6]) and
be32_to_cpup((__be32 *)&buffer[12]) would be more readable for me.
(And defined to nothing on BE systems). Look how the standard aligned 
everything nicely for us.
 
>  /**
>   *	sd_revalidate_disk - called the first time a new disk is seen,
>   *	performs disk spin up, read_capacity, etc.
> @@ -1579,6 +1645,7 @@
>  	 */
>  	if (sdkp->media_present) {
>  		sd_read_capacity(sdkp, buffer);
> +		sd_block_limits(sdkp, buffer);
>  		sd_read_write_protect_flag(sdkp, buffer);
>  		sd_read_cache_type(sdkp, buffer);
>  	}
> 
> 
> --

Just my $0.017
Boaz

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0 of 3] [RFC] I/O Hints
  2008-06-05  6:27 ` [PATCH 0 of 3] [RFC] I/O Hints Andreas Dilger
  2008-06-05 10:32   ` Jamie Lokier
@ 2008-06-05 12:35   ` Matthew Wilcox
  2008-06-05 17:02     ` Dan Williams
  2008-06-06  1:03   ` Martin K. Petersen
  2 siblings, 1 reply; 28+ messages in thread
From: Matthew Wilcox @ 2008-06-05 12:35 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Martin K. Petersen, linux-scsi, linux-fsdevel

On Thu, Jun 05, 2008 at 12:27:54AM -0600, Andreas Dilger wrote:
> To be honest, however, having the information exported only via sysfs
> is a bit ugly IMHO.  I've had all sorts of grief with settings there
> because there isn't always a match between the device that is being
> specified by the user and what appears in sysfs (e.g. /dev/disk/by-id/foo
> doesn't match /sys/block/sda) and hoops have to be jumped through to
> find this mapping, before parsing a text value in C.

readlink() is too hard?

$ readlink /dev/disk/by-id/ata-FUJITSU_MHV2080AH_NT61T6325UGM
../../hda

> Having an ioctl() that can be called on the block device (getting
> the right device regardless of its name) seems a lot more useful to
> applications in my experience, unless you are using a script.

It's certainly easier.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1 of 3] block: Export I/O hints for block devices and partitions
  2008-06-05  5:22 ` [PATCH 1 of 3] block: Export I/O hints for block devices and partitions Martin K. Petersen
@ 2008-06-05 14:42   ` James Bottomley
  2008-06-06  1:18     ` Martin K. Petersen
  2008-06-06 14:21   ` Jamie Lokier
  1 sibling, 1 reply; 28+ messages in thread
From: James Bottomley @ 2008-06-05 14:42 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-fsdevel

On Thu, 2008-06-05 at 01:22 -0400, Martin K. Petersen wrote:
> 3 files changed, 74 insertions(+)
> block/genhd.c         |   39 +++++++++++++++++++++++++++++++++++++++
> fs/partitions/check.c |   11 +++++++++++
> include/linux/genhd.h |   24 ++++++++++++++++++++++++

This looks good.  However, the first observation I would have is that
the hints should be extensible without disrupting existing code.

Something like

enum disk_hint {
	DISK_HINT_OFFSET,
	DISK_HINT_BLOCK,
	DISK_HINT_LEN,
};

void disk_set_io_hints(struct gendisk *disk, enum disk_hint hint, u64 value) {
	switch (hint) {
	case DISK_HINT_OFFSET:
		disk->phys_offset = (sector_t)val;
		break;
...
etc.

Because I can just bet we will find other hints that people want adding.

I'm also not entirely sure that zero should be the no-hint value, but I
could be persuaded on that because I can't see a case where setting zero
to mean I'm telling you I definitely don't care should be different from
not setting it.

James



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0 of 3] [RFC] I/O Hints
  2008-06-05 12:35   ` Matthew Wilcox
@ 2008-06-05 17:02     ` Dan Williams
  0 siblings, 0 replies; 28+ messages in thread
From: Dan Williams @ 2008-06-05 17:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andreas Dilger, Martin K. Petersen, linux-scsi, linux-fsdevel

On Thu, Jun 5, 2008 at 5:35 AM, Matthew Wilcox <matthew@wil.cx> wrote:
> On Thu, Jun 05, 2008 at 12:27:54AM -0600, Andreas Dilger wrote:
>> To be honest, however, having the information exported only via sysfs
>> is a bit ugly IMHO.  I've had all sorts of grief with settings there
>> because there isn't always a match between the device that is being
>> specified by the user and what appears in sysfs (e.g. /dev/disk/by-id/foo
>> doesn't match /sys/block/sda) and hoops have to be jumped through to
>> find this mapping, before parsing a text value in C.
>
> readlink() is too hard?
>
> $ readlink /dev/disk/by-id/ata-FUJITSU_MHV2080AH_NT61T6325UGM
> ../../hda
>
>> Having an ioctl() that can be called on the block device (getting
>> the right device regardless of its name) seems a lot more useful to
>> applications in my experience, unless you are using a script.
>
> It's certainly easier.
>

We've had this discussion before:

http://marc.info/?t=120641785300005&r=6&w=2

...and it resulted in this patch that is pending is Greg's tree:

http://git.kernel.org/?p=linux/kernel/git/gregkh/patches.git;a=blob;f=driver-core/sysfs-add-sys-dev-char-block-to-lookup-sysfs-path-by-major-minor.patch;h=ed9a41a564f73db74f7260f24552352ab86bc893;hb=HEAD

Regards,
Dan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0 of 3] [RFC] I/O Hints
  2008-06-05 10:40 ` Jamie Lokier
@ 2008-06-05 19:19   ` Andreas Dilger
  2008-06-06 12:55     ` Jamie Lokier
  2008-06-06  1:16   ` Martin K. Petersen
  1 sibling, 1 reply; 28+ messages in thread
From: Andreas Dilger @ 2008-06-05 19:19 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Martin K. Petersen, linux-scsi, linux-fsdevel

On Jun 05, 2008  11:40 +0100, Jamie Lokier wrote:
> >From a database perspective, I'm thinking it would be very useful to
> provide similar information for large files.  Database engines have
> the same requirement to know stripe width etc., and filesystems
> allocate large contiguous extents so it it makes sense to pass the
> device characteristics up to the app, with appropriate offset.
> 
> (Same for loopback devices.)
> 
> This is an extreme example of different properties in different offset
> ranges - it depends on file allocation.
> 
> Is there scope to extend FIEMAP to pass this sort of information?

I have ideas in that direction, but have been gun-shy about presenting
them until the existing controversy dies down...  I think it is actually
fairly straight forward to enhance FIEMAP to allow arbitrary file layouts
to be returned to an application.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0 of 3] [RFC] I/O Hints
  2008-06-05  6:27 ` [PATCH 0 of 3] [RFC] I/O Hints Andreas Dilger
  2008-06-05 10:32   ` Jamie Lokier
  2008-06-05 12:35   ` Matthew Wilcox
@ 2008-06-06  1:03   ` Martin K. Petersen
  2008-06-06 14:02     ` Jamie Lokier
  2 siblings, 1 reply; 28+ messages in thread
From: Martin K. Petersen @ 2008-06-06  1:03 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Martin K. Petersen, linux-scsi, linux-fsdevel

>>>>> "Andreas" == Andreas Dilger <adilger@sun.com> writes:

Andreas> Having an ioctl() that can be called on the block device
Andreas> (getting the right device regardless of its name) seems a lot
Andreas> more useful to applications in my experience, unless you are
Andreas> using a script.

The reason I didn't do it that way is that I only exported the bare
bones required for userland to deal with the layout intelligently.

The problem of the ioctl is that you lose information about the
subdevices and in particular about their physical alignment.  The
phys_off value I export for the MD device is the offset for the first
device.  But that does not imply that the other devices in the stripe
have the same alignment.

I considered taking an approach similar to blk_queue_stack_limits()
where you clamp using the existing values as you add more devices.
The downside to that is that you really want to let the user know that
there's a potential problem.  And requiring the user to scrounge
through syslog to look for complaints isn't so happening.  I'd much
rather do that in libdisk where it's easy to print a message about
"Suboptimal layout, proceed with caution".

IOW, the idea is (for better and for worse) to let libdisk traverse
sysfs to ensure that all things line up nicely before mkfs actually
starts writing.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0 of 3] [RFC] I/O Hints
  2008-06-05 10:40 ` Jamie Lokier
  2008-06-05 19:19   ` Andreas Dilger
@ 2008-06-06  1:16   ` Martin K. Petersen
  2008-06-06  4:51     ` Dave Chinner
  2008-06-06 12:52     ` Jamie Lokier
  1 sibling, 2 replies; 28+ messages in thread
From: Martin K. Petersen @ 2008-06-06  1:16 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Martin K. Petersen, linux-scsi, linux-fsdevel

>>>>> "Jamie" == Jamie Lokier <jamie@shareable.org> writes:

Jamie> Does it handle devices with different properties in differeng
Jamie> offset ranges?  E.g. a RAID setup where the first 100GB have
Jamie> one stripe width, but the next 100GB have a different stripe
Jamie> width - as you can get if you join two different hardware RAIDs
Jamie> with LVM, for example.

I touched on this in my reply to Andreas.  The values exported in
sysfs are only part of the solution.  We'll still need some
intelligence (in libdisk or elsewhere) to traverse the stacked device.
And that's better done in user land where it's easier to notify the
operator or ask for confirmation.

Jamie> If it's a set of drives, doesn't it need to return multiple
Jamie> offsets, and drive identities?

Given the almost infinite amount of stacking and concatenation options
I think we'll quickly get into FIEMAP territory.  Add snapshots to the
mix and mapping out the characteristics quickly becomes unmanageable.

If we present the mkfs writers with a list of 200 regions with
different alignment criteria and stripe sizes I'm sure they'll get
very unhappy.

So instead of publishing all this information I'd much rather have
libdisk do a rudimentary check and make it a binary "looks good"
vs. "may have performance problems".

If some poor mkfs souls wantsto traverse the entire stack and actually
make the filesystem layout completely heterogeneous, my patch also
allows them to do that...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1 of 3] block: Export I/O hints for block devices and partitions
  2008-06-05 14:42   ` James Bottomley
@ 2008-06-06  1:18     ` Martin K. Petersen
  0 siblings, 0 replies; 28+ messages in thread
From: Martin K. Petersen @ 2008-06-06  1:18 UTC (permalink / raw)
  To: James Bottomley; +Cc: Martin K. Petersen, linux-scsi, linux-fsdevel

>>>>> "James" == James Bottomley <James.Bottomley@HansenPartnership.com> writes:

James> Something like

James> enum disk_hint { DISK_HINT_OFFSET, DISK_HINT_BLOCK,
James> DISK_HINT_LEN, };

Fine by me.  I'll look into that.

James> I'm also not entirely sure that zero should be the no-hint
James> value, but I could be persuaded on that because I can't see a
James> case where setting zero to mean I'm telling you I definitely
James> don't care should be different from not setting it.

Yeah, I couldn't come up with a valid reason for any of these values
to be 0.  That's why I made them unsigned.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0 of 3] [RFC] I/O Hints
  2008-06-06  1:16   ` Martin K. Petersen
@ 2008-06-06  4:51     ` Dave Chinner
  2008-06-06 16:53       ` Martin K. Petersen
  2008-06-06 12:52     ` Jamie Lokier
  1 sibling, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2008-06-06  4:51 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Jamie Lokier, linux-scsi, linux-fsdevel

On Thu, Jun 05, 2008 at 09:16:24PM -0400, Martin K. Petersen wrote:
> >>>>> "Jamie" == Jamie Lokier <jamie@shareable.org> writes:
> 
> Jamie> Does it handle devices with different properties in differeng
> Jamie> offset ranges?  E.g. a RAID setup where the first 100GB have
> Jamie> one stripe width, but the next 100GB have a different stripe
> Jamie> width - as you can get if you join two different hardware RAIDs
> Jamie> with LVM, for example.
> 
> I touched on this in my reply to Andreas.  The values exported in
> sysfs are only part of the solution.  We'll still need some
> intelligence (in libdisk or elsewhere) to traverse the stacked device.
> And that's better done in user land where it's easier to notify the
> operator or ask for confirmation.

So is there going to be any obvious kernel API to access all this
info? i.e. if you do an online modification of a volume (e.g.
add new storage of a different geometry to the volume) and then
do an online grow of the filesystem, how does the filesystem get
that new geometry information for the expanded area?

FWIW, I don't mind if there isn't a kernel interface for the
filesystem to get this information - we're already planning to
push complex and dynamic allocation policy configuration out
to userspace because it's easier to determine mappings and
geometries there....

> Jamie> If it's a set of drives, doesn't it need to return multiple
> Jamie> offsets, and drive identities?
> 
> Given the almost infinite amount of stacking and concatenation options
> I think we'll quickly get into FIEMAP territory.  Add snapshots to the
> mix and mapping out the characteristics quickly becomes unmanageable.
> 
> If we present the mkfs writers with a list of 200 regions with
> different alignment criteria and stripe sizes I'm sure they'll get
> very unhappy.

Most filesystems can't handle existing geometry information, let alone
variable geometry.

> So instead of publishing all this information I'd much rather have
> libdisk do a rudimentary check and make it a binary "looks good"
> vs. "may have performance problems".
>
> If some poor mkfs souls wantsto traverse the entire stack and actually
> make the filesystem layout completely heterogeneous, my patch also
> allows them to do that...

Great ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0 of 3] [RFC] I/O Hints
  2008-06-06  1:16   ` Martin K. Petersen
  2008-06-06  4:51     ` Dave Chinner
@ 2008-06-06 12:52     ` Jamie Lokier
  1 sibling, 0 replies; 28+ messages in thread
From: Jamie Lokier @ 2008-06-06 12:52 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-fsdevel

Martin K. Petersen wrote:
> I touched on this in my reply to Andreas.  The values exported in
> sysfs are only part of the solution.  We'll still need some
> intelligence (in libdisk or elsewhere) to traverse the stacked device.
> And that's better done in user land where it's easier to notify the
> operator or ask for confirmation.

Agree, except for one thing.  It would be good to able to get a
consistent atomic snapshot of the whole stack in one kernel call.  I
guess that's not feasible where network devices are involved, though,
without a rather complicated interface.

> Jamie> If it's a set of drives, doesn't it need to return multiple
> Jamie> offsets, and drive identities?
> 
> Given the almost infinite amount of stacking and concatenation options
> I think we'll quickly get into FIEMAP territory.  Add snapshots to the
> mix and mapping out the characteristics quickly becomes unmanageable.

Well, not unmanagable, but large and unsuitable for many programs.

> If we present the mkfs writers with a list of 200 regions with
> different alignment criteria and stripe sizes I'm sure they'll get
> very unhappy.

Probably right.  But as a database writer, I'd like the information if
I can get it.  Sometimes I'll do timing measurements, because they are
more "real" than the kernel's estimate.  But I can't measure storage
stability levels, that really needs data from the kernel (if the
kernel even has it).

> So instead of publishing all this information I'd much rather have
> libdisk do a rudimentary check and make it a binary "looks good"
> vs. "may have performance problems".

Sensible, I agree.

> If some poor mkfs souls wantsto traverse the entire stack and actually
> make the filesystem layout completely heterogeneous, my patch also
> allows them to do that...

Good, being possible is the main thing :-)

And preferably without a different interface for different kinds of
device.  So as long as there's a way to traverse the stack generically.

-- Jamie

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0 of 3] [RFC] I/O Hints
  2008-06-05 19:19   ` Andreas Dilger
@ 2008-06-06 12:55     ` Jamie Lokier
  0 siblings, 0 replies; 28+ messages in thread
From: Jamie Lokier @ 2008-06-06 12:55 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Martin K. Petersen, linux-scsi, linux-fsdevel

Andreas Dilger wrote:
> > Is there scope to extend FIEMAP to pass this sort of information?
> 
> I have ideas in that direction, but have been gun-shy about presenting
> them until the existing controversy dies down...  I think it is actually
> fairly straight forward to enhance FIEMAP to allow arbitrary file layouts
> to be returned to an application.

>From the pipe2(), dup3() saga: just remember to include room for a
"flags" argument somewhere :-)

-- Jamie

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0 of 3] [RFC] I/O Hints
  2008-06-06  1:03   ` Martin K. Petersen
@ 2008-06-06 14:02     ` Jamie Lokier
  2008-06-06 16:48       ` Martin K. Petersen
  0 siblings, 1 reply; 28+ messages in thread
From: Jamie Lokier @ 2008-06-06 14:02 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Andreas Dilger, linux-scsi, linux-fsdevel

Martin K. Petersen wrote:
> The phys_off value I export for the MD device is the offset for the
> first device.  But that does not imply that the other devices in the
> stripe have the same alignment.

There are setups where the offsets of subdevices are modulo-aligned
well for performance, and when they aren't.

When they conflict for poor performance, it's useful to be informed.

> I considered taking an approach similar to blk_queue_stack_limits()
> where you clamp using the existing values as you add more devices.

Makes some sense.

> The downside to that is that you really want to let the user know that
> there's a potential problem.

As long as the measurement presented is one which looks worse and
worse when you combine subdevices, they will see from the bad-looking
value.  A flag is a nice bonus, but the main thing is, e.g. "the
largest combined stripe size for the device is 8 sectors" due to
poor offset skew, instead of 1024 sectors, say.

> And requiring the user to scrounge
> through syslog to look for complaints isn't so happening.  I'd much
> rather do that in libdisk where it's easy to print a message about
> "Suboptimal layout, proceed with caution".

I agree, that is good.

Where it is due to specific values being not exactly right - like the
subdevice offset (or modulo-offset) in MD because it doesn't represent
all subdevices - it would be good for the info to have flags saying
which _specific_ values are not exactly right.

So that programs can choose their heuristics appropriately.  After
all, users will use suboptimal layouts, and still want the best
performance it can do.

-- Jamie

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1 of 3] block: Export I/O hints for block devices and partitions
  2008-06-05  5:22 ` [PATCH 1 of 3] block: Export I/O hints for block devices and partitions Martin K. Petersen
  2008-06-05 14:42   ` James Bottomley
@ 2008-06-06 14:21   ` Jamie Lokier
  1 sibling, 0 replies; 28+ messages in thread
From: Jamie Lokier @ 2008-06-06 14:21 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-fsdevel

Martin K. Petersen wrote:
> +	sector_t phys_offset;
> +	unsigned int optimal_io_block;
> +	unsigned int optimal_io_length;

Wouldn't off_t or sector_t be more appropriate?

I can (vaguely) imagine devices with stripe sizes >= 2GiB in the not
too distant future.  Especially clusters with high speed interconnect
and RAM-backed devices.

There's no harm in an application doing smaller I/O.  It should always
choose appropriate values itself anyway.

But it seems cleaner to report the numbers derived from actual drive
topology.

-- Jamie

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0 of 3] [RFC] I/O Hints
  2008-06-05  5:22 [PATCH 0 of 3] [RFC] I/O Hints Martin K. Petersen
                   ` (4 preceding siblings ...)
  2008-06-05 10:40 ` Jamie Lokier
@ 2008-06-06 14:26 ` Jamie Lokier
  2008-06-06 16:56   ` Martin K. Petersen
  5 siblings, 1 reply; 28+ messages in thread
From: Jamie Lokier @ 2008-06-06 14:26 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-fsdevel

Martin K. Petersen wrote:
> Comments and suggestions are welcome.

Flash devices and SSDs (solid-state storage) in general are becoming
more common.

Also, backing a disk with flash.  And increasingly people will stripe,
parity, mirror or concatenate disks and flash just because it's
convenient.

--> So it would be nice if some of the flash alignment characteristics
could be reported.

Some aspects of flash don't fit into block devices - e.g. erase block
and write order requirements.  That's MTD stuff, not appropriate here.

But SSDs present a block device interface, and they still have
alignment characteristics for performance.  Even more important, when
combining multiple SSDs into a RAID.

-- Jamie

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0 of 3] [RFC] I/O Hints
  2008-06-06 14:02     ` Jamie Lokier
@ 2008-06-06 16:48       ` Martin K. Petersen
  2008-06-09 10:47         ` Jamie Lokier
  0 siblings, 1 reply; 28+ messages in thread
From: Martin K. Petersen @ 2008-06-06 16:48 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Martin K. Petersen, Andreas Dilger, linux-scsi, linux-fsdevel

>>>>> "Jamie" == Jamie Lokier <jamie@shareable.org> writes:

Jamie> Where it is due to specific values being not exactly right -
Jamie> like the subdevice offset (or modulo-offset) in MD because it
Jamie> doesn't represent all subdevices - it would be good for the
Jamie> info to have flags saying which _specific_ values are not
Jamie> exactly right.

You get that traversing the sysfs tree.

Jamie> So that programs can choose their heuristics appropriately.
Jamie> After all, users will use suboptimal layouts, and still want
Jamie> the best performance it can do.

I'd like to keep a clear distinction between normal applications and
programs like mkfs.  I'm targeting the latter.

Given that filesystems can lay out files at will, I/O characteristics
for normal applications is really FIEMAP territory.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0 of 3] [RFC] I/O Hints
  2008-06-06  4:51     ` Dave Chinner
@ 2008-06-06 16:53       ` Martin K. Petersen
  2008-06-07 20:54         ` Dave Chinner
  0 siblings, 1 reply; 28+ messages in thread
From: Martin K. Petersen @ 2008-06-06 16:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-scsi, linux-fsdevel

>>>>> "Dave" == Dave Chinner <david@fromorbit.com> writes:

Dave> So is there going to be any obvious kernel API to access all
Dave> this info? i.e. if you do an online modification of a volume
Dave> (e.g.  add new storage of a different geometry to the volume)
Dave> and then do an online grow of the filesystem, how does the
Dave> filesystem get that new geometry information for the expanded
Dave> area?

xfs_growfs would call libdisk to see whether the topology had changed
since the fs was created.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0 of 3] [RFC] I/O Hints
  2008-06-06 14:26 ` Jamie Lokier
@ 2008-06-06 16:56   ` Martin K. Petersen
  0 siblings, 0 replies; 28+ messages in thread
From: Martin K. Petersen @ 2008-06-06 16:56 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Martin K. Petersen, linux-scsi, linux-fsdevel

>>>>> "Jamie" == Jamie Lokier <jamie@shareable.org> writes:

Jamie> But SSDs present a block device interface, and they still have
Jamie> alignment characteristics for performance.  Even more
Jamie> important, when combining multiple SSDs into a RAID.

I'm not sure whether ATA has a way to get this information like SCSI
dues.  But nothing prevents the SATA driver from setting the
parameters accordingly.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0 of 3] [RFC] I/O Hints
  2008-06-06 16:53       ` Martin K. Petersen
@ 2008-06-07 20:54         ` Dave Chinner
  2008-06-09 15:05           ` Martin K. Petersen
  0 siblings, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2008-06-07 20:54 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-fsdevel

On Fri, Jun 06, 2008 at 12:53:39PM -0400, Martin K. Petersen wrote:
> >>>>> "Dave" == Dave Chinner <david@fromorbit.com> writes:
> 
> Dave> So is there going to be any obvious kernel API to access all
> Dave> this info? i.e. if you do an online modification of a volume
> Dave> (e.g.  add new storage of a different geometry to the volume)
> Dave> and then do an online grow of the filesystem, how does the
> Dave> filesystem get that new geometry information for the expanded
> Dave> area?
> 
> xfs_growfs would call libdisk to see whether the topology had changed
> since the fs was created.

Sure it could, but xfs_growfs is currently just a wrapper around an
ioctl that tells the kernel to grow the filesystem by N bytes and
the kernel does all the changes transactionally. So changing
xfs_growfs to provide this information to the kernel will 
require a new userspace API for XFS to push this information
straight back down into the kernel....

Also, it would be good to verify that stored filesystem parameters
are still optimal at mount time by querying the block device(s)
when internal structures are first initialised from disk.
Hence even a simple kernel API to this info would be handy.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0 of 3] [RFC] I/O Hints
  2008-06-06 16:48       ` Martin K. Petersen
@ 2008-06-09 10:47         ` Jamie Lokier
  2008-06-10  2:17           ` Martin K. Petersen
  0 siblings, 1 reply; 28+ messages in thread
From: Jamie Lokier @ 2008-06-09 10:47 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Andreas Dilger, linux-scsi, linux-fsdevel

Martin K. Petersen wrote:
> Jamie> Where it is due to specific values being not exactly right -
> Jamie> like the subdevice offset (or modulo-offset) in MD because it
> Jamie> doesn't represent all subdevices - it would be good for the
> Jamie> info to have flags saying which _specific_ values are not
> Jamie> exactly right.
> 
> You get that traversing the sysfs tree.

Ok.  So I should always traverse the sysfs tree, and only use the leaf
values, combining them at MD nodes as I see fit, and ignoring the
values you provide at MD nodes?

Is there enough information in sysfs for me to combine the values?
I.e. offsets of all child disks, etc.?

What about future compound disk types that my app doesn't know about?
Use your values, I guess.

> Jamie> So that programs can choose their heuristics appropriately.
> Jamie> After all, users will use suboptimal layouts, and still want
> Jamie> the best performance it can do.
> 
> I'd like to keep a clear distinction between normal applications and
> programs like mkfs.  I'm targeting the latter.
>
> Given that filesystems can lay out files at will, I/O characteristics
> for normal applications is really FIEMAP territory.

I'm thinking of database type applications directly on block devices.
They need similar performance guides as filesystems.

You're right that most apps are FIEMAP territory.

Vice versa: Filesystems on LVM may be FIEMAP territory, if there are
many non-contiguous LVM strips, spanning multiple devices, LVM
snapshots, etc.

-- Jamie

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0 of 3] [RFC] I/O Hints
  2008-06-07 20:54         ` Dave Chinner
@ 2008-06-09 15:05           ` Martin K. Petersen
  0 siblings, 0 replies; 28+ messages in thread
From: Martin K. Petersen @ 2008-06-09 15:05 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-fsdevel

>>>>> "Dave" == Dave Chinner <david@fromorbit.com> writes:

Dave> Sure it could, but xfs_growfs is currently just a wrapper around
Dave> an ioctl that tells the kernel to grow the filesystem by N bytes
Dave> and the kernel does all the changes transactionally. So changing
Dave> xfs_growfs to provide this information to the kernel will
Dave> require a new userspace API for XFS to push this information
Dave> straight back down into the kernel....

Dave> Also, it would be good to verify that stored filesystem
Dave> parameters are still optimal at mount time by querying the block
Dave> device(s) when internal structures are first initialised from
Dave> disk.  Hence even a simple kernel API to this info would be
Dave> handy.

Ok, I'll see if I can come up with a reasonable interface for this.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0 of 3] [RFC] I/O Hints
  2008-06-09 10:47         ` Jamie Lokier
@ 2008-06-10  2:17           ` Martin K. Petersen
  0 siblings, 0 replies; 28+ messages in thread
From: Martin K. Petersen @ 2008-06-10  2:17 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Martin K. Petersen, Andreas Dilger, linux-scsi, linux-fsdevel

>>>>> "Jamie" == Jamie Lokier <jamie@shareable.org> writes:

Jamie> Ok.  So I should always traverse the sysfs tree, and only use
Jamie> the leaf values, combining them at MD nodes as I see fit, and
Jamie> ignoring the values you provide at MD nodes?

Yes.  The intent was that "sloppy" programs would use the values
exported in MD/DM, and that programs that really care (i.e. mkfs.xfs
via libdisk) would traverse the tree.

Sloppy in this case means "programs that aren't capable of dealing
with heterogeneous storage anyway".

Jamie> Is there enough information in sysfs for me to combine the
Jamie> values?  I.e. offsets of all child disks, etc.?

With MD I'd think so.  Neil generally exports everything on the
planet.  Haven't looked at DM yet.

But let me take a stab at coming up with a richer kernel interface
first...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2008-06-10  2:17 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-05  5:22 [PATCH 0 of 3] [RFC] I/O Hints Martin K. Petersen
2008-06-05  5:22 ` [PATCH 1 of 3] block: Export I/O hints for block devices and partitions Martin K. Petersen
2008-06-05 14:42   ` James Bottomley
2008-06-06  1:18     ` Martin K. Petersen
2008-06-06 14:21   ` Jamie Lokier
2008-06-05  5:22 ` [PATCH 2 of 3] md: Export preferred I/O sizes and physical alignment Martin K. Petersen
2008-06-05  5:22 ` [PATCH 3 of 3] sd: Export preferred I/O sizes Martin K. Petersen
2008-06-05 11:25   ` Boaz Harrosh
2008-06-05  6:27 ` [PATCH 0 of 3] [RFC] I/O Hints Andreas Dilger
2008-06-05 10:32   ` Jamie Lokier
2008-06-05 12:35   ` Matthew Wilcox
2008-06-05 17:02     ` Dan Williams
2008-06-06  1:03   ` Martin K. Petersen
2008-06-06 14:02     ` Jamie Lokier
2008-06-06 16:48       ` Martin K. Petersen
2008-06-09 10:47         ` Jamie Lokier
2008-06-10  2:17           ` Martin K. Petersen
2008-06-05 10:40 ` Jamie Lokier
2008-06-05 19:19   ` Andreas Dilger
2008-06-06 12:55     ` Jamie Lokier
2008-06-06  1:16   ` Martin K. Petersen
2008-06-06  4:51     ` Dave Chinner
2008-06-06 16:53       ` Martin K. Petersen
2008-06-07 20:54         ` Dave Chinner
2008-06-09 15:05           ` Martin K. Petersen
2008-06-06 12:52     ` Jamie Lokier
2008-06-06 14:26 ` Jamie Lokier
2008-06-06 16:56   ` Martin K. Petersen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).