* [DO NOT APPLY] sd take advantage of rotation speed
@ 2008-06-19 16:03 Matthew Wilcox
2008-06-19 17:12 ` Mike Anderson
` (3 more replies)
0 siblings, 4 replies; 34+ messages in thread
From: Matthew Wilcox @ 2008-06-19 16:03 UTC (permalink / raw)
To: linux-scsi
Use the noop elevator by default for drives that do not spin
[Not for applying]
SSDs do not benefit from the elevator. It just wastes precious CPU cycles.
By selecting the noop elevator by default, we can shave a few microseconds
off each IO.
I've brazenly stolen sd_vpd_inquiry from mkp's patch here:
http://marc.info/?l=linux-scsi&m=121264354724277&w=2
No need to have two copies of that ... but this will conflict with his code.
On to the self-criticism:
I don't intend the final version of this patch to include a printk for
the RPM or even a printk to say we switched IO elevator. I think we're
too verbose in SCSI as it is.
I think there's an opportunity to improve sd_vpd_inquiry() to remove
some of the duplicate code between sd_set_elevator() and sd_block_limits,
but it's not terribly important.
The switching of the elevators isn't particularly nice. I assume that
elevator_init("noop") cannot fail, which isn't true. It would be nice
to use the #if 0 block instead, but that causes a null ptr dereference
inside sysfs -- I suspect something isn't set up correctly.
Not-signed-off-by: Matthew Wilcox <willy@linux.intel.com>
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 01cefbb..1c5a296 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1534,6 +1534,79 @@ defaults:
sdkp->DPOFUA = 0;
}
+static int sd_vpd_inquiry(struct scsi_disk *sdkp, unsigned char *buffer, u8 page, u8 len)
+{
+ int result;
+ unsigned char cmd[16];
+ struct scsi_sense_hdr sshdr;
+
+ memset(cmd, 0, 16);
+ cmd[0] = INQUIRY;
+ cmd[1] = 1; /* EVPD */
+ cmd[2] = page; /* VPD page */
+ cmd[3] = len;
+
+ result = scsi_execute_req(sdkp->device, cmd, DMA_FROM_DEVICE, buffer,
+ len, &sshdr, SD_TIMEOUT, SD_MAX_RETRIES);
+
+ if (media_not_present(sdkp, &sshdr))
+ return -EIO;
+
+ if (result) {
+ sd_printk(KERN_ERR, sdkp, "EVPD Inquiry failed\n");
+ return -EIO;
+ }
+
+ if (buffer[1] != page) {
+ sd_printk(KERN_ERR, sdkp, "Page code not %2x (%2x)\n", page,
+ buffer[1]);
+ return -EIO;
+ }
+
+ return buffer[3];
+}
+
+static void sd_set_elevator(struct scsi_disk *sdkp, unsigned char *buffer)
+{
+ struct scsi_device *sdp = sdkp->device;
+ int res, i, rotation;
+
+ res = sd_vpd_inquiry(sdkp, buffer, 0, 255);
+ if (res < 0)
+ return;
+
+ for (i = 0; i < buffer[3]; i++)
+ if (buffer[i + 4] == 0xb1)
+ goto found;
+ return;
+
+ found:
+ res = sd_vpd_inquiry(sdkp, buffer, 0xb1, 64);
+ if (res < 0)
+ return;
+
+ rotation = (buffer[4] << 8) | buffer[5];
+
+ if (rotation == 0)
+ return;
+
+ if (rotation == 0x0001) {
+#if 0
+ res = elv_iosched_store(sdp->request_queue, "noop", 5);
+ if (res >= 0)
+ sd_printk(KERN_INFO, sdkp,
+ "Switched to noop elevator\n");
+#else
+ elevator_exit(sdp->request_queue->elevator);
+ sdp->request_queue->elevator = NULL;
+ elevator_init(sdp->request_queue, "noop");
+ sd_printk(KERN_INFO, sdkp, "Switched to noop elevator\n");
+#endif
+ } else if (rotation > 0x400) {
+ sd_printk(KERN_INFO, sdkp, "%u RPM drive", rotation);
+ }
+}
+
/**
* sd_revalidate_disk - called the first time a new disk is seen,
* performs disk spin up, read_capacity, etc.
@@ -1581,6 +1654,7 @@ static int sd_revalidate_disk(struct gendisk *disk)
sd_read_capacity(sdkp, buffer);
sd_read_write_protect_flag(sdkp, buffer);
sd_read_cache_type(sdkp, buffer);
+ sd_set_elevator(sdkp, buffer);
}
/*
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply related [flat|nested] 34+ messages in thread* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-06-19 16:03 [DO NOT APPLY] sd take advantage of rotation speed Matthew Wilcox
@ 2008-06-19 17:12 ` Mike Anderson
2008-06-19 18:10 ` Matthew Wilcox
2008-06-22 12:16 ` Boaz Harrosh
` (2 subsequent siblings)
3 siblings, 1 reply; 34+ messages in thread
From: Mike Anderson @ 2008-06-19 17:12 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: linux-scsi
Matthew Wilcox <matthew@wil.cx> wrote:
> Use the noop elevator by default for drives that do not spin
>
> [Not for applying]
>
> SSDs do not benefit from the elevator. It just wastes precious CPU cycles.
> By selecting the noop elevator by default, we can shave a few microseconds
> off each IO.
>
> I've brazenly stolen sd_vpd_inquiry from mkp's patch here:
>
> http://marc.info/?l=linux-scsi&m=121264354724277&w=2
>
> No need to have two copies of that ... but this will conflict with his code.
>
Why would we want to do this in the kernel instead of a udev rule
similar to how we set timeouts. Since there already exists a sysfs
interface to change the IO scheduler it would seem this would be more
flexible in case someone did not like the kernel policy selected.
You would need some tool to obtain the vpd data from user space that would
always be installed.
There was some past discussion on replacing scsi_id with a better tool
to support more options. url provided below.
http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/5289/focus=5301
> On to the self-criticism:
>
> I don't intend the final version of this patch to include a printk for
> the RPM or even a printk to say we switched IO elevator. I think we're
> too verbose in SCSI as it is.
>
> I think there's an opportunity to improve sd_vpd_inquiry() to remove
> some of the duplicate code between sd_set_elevator() and sd_block_limits,
> but it's not terribly important.
>
> The switching of the elevators isn't particularly nice. I assume that
> elevator_init("noop") cannot fail, which isn't true. It would be nice
> to use the #if 0 block instead, but that causes a null ptr dereference
> inside sysfs -- I suspect something isn't set up correctly.
>
> Not-signed-off-by: Matthew Wilcox <willy@linux.intel.com>
>
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 01cefbb..1c5a296 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -1534,6 +1534,79 @@ defaults:
> sdkp->DPOFUA = 0;
> }
>
> +static int sd_vpd_inquiry(struct scsi_disk *sdkp, unsigned char *buffer, u8 page, u8 len)
> +{
> + int result;
> + unsigned char cmd[16];
> + struct scsi_sense_hdr sshdr;
> +
> + memset(cmd, 0, 16);
> + cmd[0] = INQUIRY;
> + cmd[1] = 1; /* EVPD */
> + cmd[2] = page; /* VPD page */
> + cmd[3] = len;
> +
> + result = scsi_execute_req(sdkp->device, cmd, DMA_FROM_DEVICE, buffer,
> + len, &sshdr, SD_TIMEOUT, SD_MAX_RETRIES);
> +
> + if (media_not_present(sdkp, &sshdr))
> + return -EIO;
> +
> + if (result) {
> + sd_printk(KERN_ERR, sdkp, "EVPD Inquiry failed\n");
> + return -EIO;
> + }
> +
> + if (buffer[1] != page) {
> + sd_printk(KERN_ERR, sdkp, "Page code not %2x (%2x)\n", page,
> + buffer[1]);
> + return -EIO;
> + }
> +
> + return buffer[3];
> +}
> +
> +static void sd_set_elevator(struct scsi_disk *sdkp, unsigned char *buffer)
> +{
> + struct scsi_device *sdp = sdkp->device;
> + int res, i, rotation;
> +
> + res = sd_vpd_inquiry(sdkp, buffer, 0, 255);
> + if (res < 0)
> + return;
> +
> + for (i = 0; i < buffer[3]; i++)
> + if (buffer[i + 4] == 0xb1)
> + goto found;
> + return;
> +
> + found:
> + res = sd_vpd_inquiry(sdkp, buffer, 0xb1, 64);
> + if (res < 0)
> + return;
> +
> + rotation = (buffer[4] << 8) | buffer[5];
> +
> + if (rotation == 0)
> + return;
> +
> + if (rotation == 0x0001) {
> +#if 0
> + res = elv_iosched_store(sdp->request_queue, "noop", 5);
> + if (res >= 0)
> + sd_printk(KERN_INFO, sdkp,
> + "Switched to noop elevator\n");
> +#else
> + elevator_exit(sdp->request_queue->elevator);
> + sdp->request_queue->elevator = NULL;
> + elevator_init(sdp->request_queue, "noop");
> + sd_printk(KERN_INFO, sdkp, "Switched to noop elevator\n");
> +#endif
> + } else if (rotation > 0x400) {
> + sd_printk(KERN_INFO, sdkp, "%u RPM drive", rotation);
> + }
> +}
> +
> /**
> * sd_revalidate_disk - called the first time a new disk is seen,
> * performs disk spin up, read_capacity, etc.
> @@ -1581,6 +1654,7 @@ static int sd_revalidate_disk(struct gendisk *disk)
> sd_read_capacity(sdkp, buffer);
> sd_read_write_protect_flag(sdkp, buffer);
> sd_read_cache_type(sdkp, buffer);
> + sd_set_elevator(sdkp, buffer);
> }
>
> /*
> --
> Intel are signing my paycheques ... these opinions are still mine
> "Bill, look, we understand that you're interested in selling us this
> operating system, but compare it to ours. We can't possibly take such
> a retrograde step."
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
-andmike
--
Michael Anderson
andmike@linux.vnet.ibm.com
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-06-19 17:12 ` Mike Anderson
@ 2008-06-19 18:10 ` Matthew Wilcox
0 siblings, 0 replies; 34+ messages in thread
From: Matthew Wilcox @ 2008-06-19 18:10 UTC (permalink / raw)
To: Mike Anderson; +Cc: linux-scsi
On Thu, Jun 19, 2008 at 10:12:58AM -0700, Mike Anderson wrote:
> Why would we want to do this in the kernel instead of a udev rule
> similar to how we set timeouts. Since there already exists a sysfs
> interface to change the IO scheduler it would seem this would be more
> flexible in case someone did not like the kernel policy selected.
You don't lose any flexibility either way; you can always change the
elevator default. I just thought this would be a nice little hack to
give people with SSDs a better default. I'm not against doing it with
udev, but it's not a lot of extra code either way.
> You would need some tool to obtain the vpd data from user space that would
> always be installed.
Yes. I've been using sg_inq from sg3_utils to play with it, but one
could write a specialised tool very quickly.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-06-19 16:03 [DO NOT APPLY] sd take advantage of rotation speed Matthew Wilcox
2008-06-19 17:12 ` Mike Anderson
@ 2008-06-22 12:16 ` Boaz Harrosh
2008-06-22 13:19 ` Matthew Wilcox
2008-06-22 13:38 ` James Bottomley
2008-06-25 13:47 ` Jens Axboe
3 siblings, 1 reply; 34+ messages in thread
From: Boaz Harrosh @ 2008-06-22 12:16 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: linux-scsi
Matthew Wilcox wrote:
> Use the noop elevator by default for drives that do not spin
>
<snip>
> I've brazenly stolen sd_vpd_inquiry from mkp's patch here:
>
> http://marc.info/?l=linux-scsi&m=121264354724277&w=2
>
> No need to have two copies of that ... but this will conflict with his code.
>
<snip>
> I think there's an opportunity to improve sd_vpd_inquiry() to remove
> some of the duplicate code between sd_set_elevator() and sd_block_limits,
> but it's not terribly important.
>
<snip>
> Not-signed-off-by: Matthew Wilcox <willy@linux.intel.com>
>
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 01cefbb..1c5a296 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -1534,6 +1534,79 @@ defaults:
> sdkp->DPOFUA = 0;
> }
>
> +static int sd_vpd_inquiry(struct scsi_disk *sdkp, unsigned char *buffer, u8 page, u8 len)
> +{
> + int result;
> + unsigned char cmd[16];
> + struct scsi_sense_hdr sshdr;
> +
> + memset(cmd, 0, 16);
> + cmd[0] = INQUIRY;
> + cmd[1] = 1; /* EVPD */
> + cmd[2] = page; /* VPD page */
> + cmd[3] = len;
> +
> + result = scsi_execute_req(sdkp->device, cmd, DMA_FROM_DEVICE, buffer,
> + len, &sshdr, SD_TIMEOUT, SD_MAX_RETRIES);
> +
> + if (media_not_present(sdkp, &sshdr))
> + return -EIO;
> +
> + if (result) {
> + sd_printk(KERN_ERR, sdkp, "EVPD Inquiry failed\n");
> + return -EIO;
> + }
> +
> + if (buffer[1] != page) {
> + sd_printk(KERN_ERR, sdkp, "Page code not %2x (%2x)\n", page,
> + buffer[1]);
> + return -EIO;
> + }
> +
> + return buffer[3];
> +}
I was just copy/pasting the same code into my ULD. Perhaps the low level of above
could be made into a scsi_vpd_inquiry(...). From a quick search it looks like there
are 3 more places that issue _vpd_inquiry in the tree today.
<snip>
Some of the more interesting device information are in vpd's. we should have a generic
user-mode way to inquire them. sysfs structure, ioctl, I'm not sure, something good for
udev. There are tones of such little knobs that could be tuned according to vpd
information.
Thanks for doing this, I will investigate if current iscsi-targets will immediately benefit
from this too. From my testing I found that it is usually better to let the target side
elevator do the work, and set the Initiator elevator to noop. Do you know what the standard
say about the default value should be, in the case the page is not present. I would assume
it means "No spinning media" too, right?
Boaz
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-06-22 12:16 ` Boaz Harrosh
@ 2008-06-22 13:19 ` Matthew Wilcox
2008-06-22 13:27 ` Boaz Harrosh
0 siblings, 1 reply; 34+ messages in thread
From: Matthew Wilcox @ 2008-06-22 13:19 UTC (permalink / raw)
To: Boaz Harrosh; +Cc: linux-scsi
On Sun, Jun 22, 2008 at 03:16:49PM +0300, Boaz Harrosh wrote:
> Matthew Wilcox wrote:
> > Use the noop elevator by default for drives that do not spin
> >
> <snip>
> > I've brazenly stolen sd_vpd_inquiry from mkp's patch here:
> >
> > http://marc.info/?l=linux-scsi&m=121264354724277&w=2
> >
> > No need to have two copies of that ... but this will conflict with his code.
> >
> <snip>
> > I think there's an opportunity to improve sd_vpd_inquiry() to remove
> > some of the duplicate code between sd_set_elevator() and sd_block_limits,
> > but it's not terribly important.
> >
> > +static int sd_vpd_inquiry(struct scsi_disk *sdkp, unsigned char *buffer, u8 page, u8 len)
> > +{
> > + int result;
> > + unsigned char cmd[16];
> > + struct scsi_sense_hdr sshdr;
> > +
> > + memset(cmd, 0, 16);
> > + cmd[0] = INQUIRY;
> > + cmd[1] = 1; /* EVPD */
> > + cmd[2] = page; /* VPD page */
> > + cmd[3] = len;
> > +
> > + result = scsi_execute_req(sdkp->device, cmd, DMA_FROM_DEVICE, buffer,
> > + len, &sshdr, SD_TIMEOUT, SD_MAX_RETRIES);
> > +
> > + if (media_not_present(sdkp, &sshdr))
> > + return -EIO;
> > +
> > + if (result) {
> > + sd_printk(KERN_ERR, sdkp, "EVPD Inquiry failed\n");
> > + return -EIO;
> > + }
> > +
> > + if (buffer[1] != page) {
> > + sd_printk(KERN_ERR, sdkp, "Page code not %2x (%2x)\n", page,
> > + buffer[1]);
> > + return -EIO;
> > + }
> > +
> > + return buffer[3];
> > +}
>
> I was just copy/pasting the same code into my ULD. Perhaps the low level of above
> could be made into a scsi_vpd_inquiry(...). From a quick search it looks like there
> are 3 more places that issue _vpd_inquiry in the tree today.
Yes, there's nothing sd-specific about it. Let me just make that minor
improvement and I'll send out a proposal for scsi_vpd_inquiry().
> Some of the more interesting device information are in vpd's. we should have a generic
> user-mode way to inquire them. sysfs structure, ioctl, I'm not sure, something good for
> udev. There are tones of such little knobs that could be tuned according to vpd
> information.
We have sg_inq and sg_vpd from Doug's sg3-utils.
> Thanks for doing this, I will investigate if current iscsi-targets will immediately benefit
> from this too. From my testing I found that it is usually better to let the target side
> elevator do the work, and set the Initiator elevator to noop. Do you know what the standard
> say about the default value should be, in the case the page is not present. I would assume
> it means "No spinning media" too, right?
It means "Not reported". Which I translate from standard-ese as
"Could be anything".
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-06-22 13:19 ` Matthew Wilcox
@ 2008-06-22 13:27 ` Boaz Harrosh
0 siblings, 0 replies; 34+ messages in thread
From: Boaz Harrosh @ 2008-06-22 13:27 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: linux-scsi
Matthew Wilcox wrote:
> On Sun, Jun 22, 2008 at 03:16:49PM +0300, Boaz Harrosh wrote:
>> Matthew Wilcox wrote:
>>> Use the noop elevator by default for drives that do not spin
>>>
>> <snip>
>>> I've brazenly stolen sd_vpd_inquiry from mkp's patch here:
>>>
>>> http://marc.info/?l=linux-scsi&m=121264354724277&w=2
>>>
>>> No need to have two copies of that ... but this will conflict with his code.
>>>
>> <snip>
>>> I think there's an opportunity to improve sd_vpd_inquiry() to remove
>>> some of the duplicate code between sd_set_elevator() and sd_block_limits,
>>> but it's not terribly important.
>>>
>>> +static int sd_vpd_inquiry(struct scsi_disk *sdkp, unsigned char *buffer, u8 page, u8 len)
>>> +{
>>> + int result;
>>> + unsigned char cmd[16];
>>> + struct scsi_sense_hdr sshdr;
>>> +
>>> + memset(cmd, 0, 16);
>>> + cmd[0] = INQUIRY;
>>> + cmd[1] = 1; /* EVPD */
>>> + cmd[2] = page; /* VPD page */
>>> + cmd[3] = len;
>>> +
>>> + result = scsi_execute_req(sdkp->device, cmd, DMA_FROM_DEVICE, buffer,
>>> + len, &sshdr, SD_TIMEOUT, SD_MAX_RETRIES);
>>> +
>>> + if (media_not_present(sdkp, &sshdr))
>>> + return -EIO;
>>> +
>>> + if (result) {
>>> + sd_printk(KERN_ERR, sdkp, "EVPD Inquiry failed\n");
>>> + return -EIO;
>>> + }
>>> +
>>> + if (buffer[1] != page) {
>>> + sd_printk(KERN_ERR, sdkp, "Page code not %2x (%2x)\n", page,
>>> + buffer[1]);
>>> + return -EIO;
>>> + }
>>> +
>>> + return buffer[3];
>>> +}
>> I was just copy/pasting the same code into my ULD. Perhaps the low level of above
>> could be made into a scsi_vpd_inquiry(...). From a quick search it looks like there
>> are 3 more places that issue _vpd_inquiry in the tree today.
>
> Yes, there's nothing sd-specific about it. Let me just make that minor
> improvement and I'll send out a proposal for scsi_vpd_inquiry().
>
>> Some of the more interesting device information are in vpd's. we should have a generic
>> user-mode way to inquire them. sysfs structure, ioctl, I'm not sure, something good for
>> udev. There are tones of such little knobs that could be tuned according to vpd
>> information.
>
> We have sg_inq and sg_vpd from Doug's sg3-utils.
>
>> Thanks for doing this, I will investigate if current iscsi-targets will immediately benefit
>> from this too. From my testing I found that it is usually better to let the target side
>> elevator do the work, and set the Initiator elevator to noop. Do you know what the standard
>> say about the default value should be, in the case the page is not present. I would assume
>> it means "No spinning media" too, right?
>
> It means "Not reported". Which I translate from standard-ese as
> "Could be anything".
>
OK Thanks. I'll make sure all iscsi targets I care about return this page.
Am anticipating this patchset to be submitted, thanks again.
Boaz
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-06-19 16:03 [DO NOT APPLY] sd take advantage of rotation speed Matthew Wilcox
2008-06-19 17:12 ` Mike Anderson
2008-06-22 12:16 ` Boaz Harrosh
@ 2008-06-22 13:38 ` James Bottomley
2008-06-22 14:03 ` Matthew Wilcox
2008-06-25 13:47 ` Jens Axboe
3 siblings, 1 reply; 34+ messages in thread
From: James Bottomley @ 2008-06-22 13:38 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: linux-scsi
On Thu, 2008-06-19 at 10:03 -0600, Matthew Wilcox wrote:
> Use the noop elevator by default for drives that do not spin
>
> [Not for applying]
>
> SSDs do not benefit from the elevator. It just wastes precious CPU cycles.
> By selecting the noop elevator by default, we can shave a few microseconds
> off each IO.
>
> I've brazenly stolen sd_vpd_inquiry from mkp's patch here:
>
> http://marc.info/?l=linux-scsi&m=121264354724277&w=2
>
> No need to have two copies of that ... but this will conflict with his code.
>
> On to the self-criticism:
>
> I don't intend the final version of this patch to include a printk for
> the RPM or even a printk to say we switched IO elevator. I think we're
> too verbose in SCSI as it is.
>
> I think there's an opportunity to improve sd_vpd_inquiry() to remove
> some of the duplicate code between sd_set_elevator() and sd_block_limits,
> but it's not terribly important.
and actually ses.c as well.
> The switching of the elevators isn't particularly nice. I assume that
> elevator_init("noop") cannot fail, which isn't true. It would be nice
> to use the #if 0 block instead, but that causes a null ptr dereference
> inside sysfs -- I suspect something isn't set up correctly.
I'm really not very keen on this patch, since it's implementing elevator
policy from sd which is a bit of a layering violation (and a policy
should be in userspace one).
What's wrong with doing this entirely from udev (it already issues the
vpd's today, that's how it gets the id from page 0x83 ... it could
easily look at 0xb1 and set the elevator
using /sys/block/<>/queue/scheduler)?.
James
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-06-22 13:38 ` James Bottomley
@ 2008-06-22 14:03 ` Matthew Wilcox
2008-06-22 14:41 ` Martin K. Petersen
2008-06-22 17:26 ` James Bottomley
0 siblings, 2 replies; 34+ messages in thread
From: Matthew Wilcox @ 2008-06-22 14:03 UTC (permalink / raw)
To: James Bottomley; +Cc: linux-scsi
On Sun, Jun 22, 2008 at 08:38:51AM -0500, James Bottomley wrote:
> > I think there's an opportunity to improve sd_vpd_inquiry() to remove
> > some of the duplicate code between sd_set_elevator() and sd_block_limits,
> > but it's not terribly important.
>
> and actually ses.c as well.
Ah yes. I'll rip it out of ses.c. Thanks.
A question of policy ... should we interrogate page 0 to find out if
page 0x83 exists? You don't currently, but ses.c is limited to devices
with an enclosure ... which is presumably only newer devices. Do we
have any idea if devices blow up on being asked for random VPD that they
might not have?
> > The switching of the elevators isn't particularly nice. I assume that
> > elevator_init("noop") cannot fail, which isn't true. It would be nice
> > to use the #if 0 block instead, but that causes a null ptr dereference
> > inside sysfs -- I suspect something isn't set up correctly.
>
> I'm really not very keen on this patch, since it's implementing elevator
> policy from sd which is a bit of a layering violation (and a policy
> should be in userspace one).
It's only choosing a default, not implementing policy.
> What's wrong with doing this entirely from udev (it already issues the
> vpd's today, that's how it gets the id from page 0x83 ... it could
> easily look at 0xb1 and set the elevator
> using /sys/block/<>/queue/scheduler)?.
I'll take a look at udev on Monday. It certainly seems less
controversial than this patch.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-06-22 14:03 ` Matthew Wilcox
@ 2008-06-22 14:41 ` Martin K. Petersen
2008-06-22 18:44 ` Matthew Wilcox
2008-06-22 17:26 ` James Bottomley
1 sibling, 1 reply; 34+ messages in thread
From: Martin K. Petersen @ 2008-06-22 14:41 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: James Bottomley, linux-scsi
>>>>> "Matthew" == Matthew Wilcox <matthew@wil.cx> writes:
Matthew> A question of policy ... should we interrogate page 0 to find
Matthew> out if page 0x83 exists? You don't currently, but ses.c is
Matthew> limited to devices with an enclosure ... which is presumably
Matthew> only newer devices. Do we have any idea if devices blow up
Matthew> on being asked for random VPD that they might not have?
We should check.
I actually changed my VPD code a bit on Friday. There's a function to
do the heavy lifting and sanity checking, and a helper that can be
called without worrying about the details. I moved the page list
checking from the caller to the helper function.
--
Martin K. Petersen Oracle Linux Engineering
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1534,6 +1538,109 @@ defaults:
sdkp->DPOFUA = 0;
}
+static int sd_vpd_inquiry(struct scsi_disk *sdkp, unsigned char *buffer, u8 page, u8 len)
+{
+ int result;
+ unsigned char cmd[16];
+ struct scsi_sense_hdr sshdr;
+
+ memset(cmd, 0, 16);
+ cmd[0] = INQUIRY;
+ cmd[1] = 1; /* EVPD */
+ cmd[2] = page; /* VPD page */
+ cmd[3] = len;
+
+ result = scsi_execute_req(sdkp->device, cmd, DMA_FROM_DEVICE, buffer,
+ len, &sshdr, SD_TIMEOUT, SD_MAX_RETRIES);
+
+ if (media_not_present(sdkp, &sshdr))
+ return -EIO;
+
+ if (result) {
+ sd_printk(KERN_ERR, sdkp, "EVPD Inquiry failed\n");
+ return -EIO;
+ }
+
+ if (buffer[1] != page) {
+ sd_printk(KERN_ERR, sdkp, "Page code not %2x (%2x)\n", page,
+ buffer[1]);
+ return -EIO;
+ }
+
+ return buffer[3];
+}
+
+static int sd_get_vpd_page(struct scsi_disk *sdkp, unsigned char *buffer, u8 page, u8 len)
+{
+ int i, result;
+
+ /* Get Supported Pages list */
+ if (sd_vpd_inquiry(sdkp, buffer, 0x0, 255) < 0)
+ return -1;
+
+ for (i = 0 ; i < buffer[3] ; i++)
+ if (buffer[4+i] == page)
+ goto found;
+
+ return -1;
+
+found:
+ result = sd_vpd_inquiry(sdkp, buffer, page, len);
+
+ if (result < 0)
+ return -1;
+
+ if (buffer[3] != len)
+ sd_printk(KERN_ERR, sdkp, "VPD pg %x, req. %u bytes, got %u\n",
+ page, len, buffer[3]);
+
+ return result;
+}
+
+/**
+ * sd_block_limits - Query disk device for preferred I/O sizes.
+ * @disk: disk to query
+ * @buffer: temporary buffer to store inquiry response in
+ */
+static void sd_block_limits(struct scsi_disk *sdkp, unsigned char *buffer)
+{
+ struct io_topology *iot = sdkp->disk->topology;
+
+ /* Block Limits VPD */
+ if (sd_get_vpd_page(sdkp, buffer, 0xb0, 16) < 0)
+ return;
+
+ iot->phys_off = 0;
+ iot->opt_block = be16_to_cpup((__be16 *)&buffer[6])
+ * sdkp->device->sector_size;
+ iot->max_length = be32_to_cpup((__be32 *)&buffer[8])
+ * sdkp->device->sector_size;
+ iot->opt_length = be32_to_cpup((__be32 *)&buffer[12])
+ * sdkp->device->sector_size;
+}
+
+/**
+ * sd_block_characteristics - Query block dev. characteristics
+ * @disk: disk to query
+ * @buffer: temporary buffer to store inquiry response in
+ */
+static void sd_block_characteristics(struct scsi_disk *sdkp, unsigned char *buffer)
+{
+ struct io_topology *iot = sdkp->disk->topology;
+ unsigned int rotation;
+
+ /* Block Device Characteristics VPD */
+ if (sd_get_vpd_page(sdkp, buffer, 0xb1, 64) < 0)
+ return;
+
+ rotation = be16_to_cpup((__be16 *)&buffer[4]);
+
+ if (rotation == 1)
+ iot->dev_type = IO_TYPE_SSD;
+ else if (rotation > 0x400)
+ iot->dev_type = IO_TYPE_DISK;
+}
+
/**
* sd_revalidate_disk - called the first time a new disk is seen,
* performs disk spin up, read_capacity, etc.
@@ -1579,6 +1686,8 @@ static int sd_revalidate_disk(struct gen
*/
if (sdkp->media_present) {
sd_read_capacity(sdkp, buffer);
+ sd_block_limits(sdkp, buffer);
+ sd_block_characteristics(sdkp, buffer);
sd_read_write_protect_flag(sdkp, buffer);
sd_read_cache_type(sdkp, buffer);
}
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-06-22 14:41 ` Martin K. Petersen
@ 2008-06-22 18:44 ` Matthew Wilcox
2008-06-25 2:06 ` Martin K. Petersen
0 siblings, 1 reply; 34+ messages in thread
From: Matthew Wilcox @ 2008-06-22 18:44 UTC (permalink / raw)
To: Martin K. Petersen; +Cc: James Bottomley, linux-scsi
On Sun, Jun 22, 2008 at 10:41:35AM -0400, Martin K. Petersen wrote:
> + memset(cmd, 0, 16);
> + cmd[0] = INQUIRY;
> + cmd[1] = 1; /* EVPD */
> + cmd[2] = page; /* VPD page */
> + cmd[3] = len;
This looks like a bug to me.
The MSB of the length is supposed to be in byte 3 and the LSB in byte 4.
So you're asking for much more data than could fit in the buffer --
fortunately the device won't be able to return as much as you ask for,
so the buffer won't overflow.
> + result = scsi_execute_req(sdkp->device, cmd, DMA_FROM_DEVICE, buffer,
> + len, &sshdr, SD_TIMEOUT, SD_MAX_RETRIES);
> +
> + if (media_not_present(sdkp, &sshdr))
> + return -EIO;
> +
> + if (result) {
> + sd_printk(KERN_ERR, sdkp, "EVPD Inquiry failed\n");
> + return -EIO;
> + }
> +
> + if (buffer[1] != page) {
> + sd_printk(KERN_ERR, sdkp, "Page code not %2x (%2x)\n", page,
> + buffer[1]);
> + return -EIO;
> + }
> +
> + return buffer[3];
This isn't true for all pages. For example, 0x83 and 0x85 have a length in
buffer[2] and [3]. Now, all other pages I've looked at reserve byte 2,
and devices should be setting that to 0 ... but I wouldn't feel right
depending on it. For scsi_request_vpd(), I think we should return 0 for
success and -Error. The caller knows which page it requested and can
find out the length for itself.
> +static int sd_get_vpd_page(struct scsi_disk *sdkp, unsigned char *buffer, u8 page, u8 len)
> +{
> + int i, result;
> +
> + /* Get Supported Pages list */
> + if (sd_vpd_inquiry(sdkp, buffer, 0x0, 255) < 0)
> + return -1;
> +
> + for (i = 0 ; i < buffer[3] ; i++)
> + if (buffer[4+i] == page)
> + goto found;
> +
> + return -1;
> +
> +found:
> + result = sd_vpd_inquiry(sdkp, buffer, page, len);
> +
> + if (result < 0)
> + return -1;
> +
> + if (buffer[3] != len)
> + sd_printk(KERN_ERR, sdkp, "VPD pg %x, req. %u bytes, got %u\n",
> + page, len, buffer[3]);
> +
> + return result;
> +}
Yes, this refactoring is exactly what I had in mind.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-06-22 18:44 ` Matthew Wilcox
@ 2008-06-25 2:06 ` Martin K. Petersen
0 siblings, 0 replies; 34+ messages in thread
From: Martin K. Petersen @ 2008-06-25 2:06 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: Martin K. Petersen, James Bottomley, linux-scsi
>>>>> "Matthew" == Matthew Wilcox <matthew@wil.cx> writes:
Matthew> On Sun, Jun 22, 2008 at 10:41:35AM -0400, Martin K. Petersen wrote:
>> + memset(cmd, 0, 16); + cmd[0] = INQUIRY; + cmd[1] = 1; /* EVPD */
>> + cmd[2] = page; /* VPD page */ + cmd[3] = len;
Matthew> This looks like a bug to me.
Yeah, what I meant was cmd[4] = len;
Matthew> This isn't true for all pages. For example, 0x83 and 0x85
Matthew> have a length in buffer[2] and [3].
Fair enough. I only looked at b0 and b1.
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-06-22 14:03 ` Matthew Wilcox
2008-06-22 14:41 ` Martin K. Petersen
@ 2008-06-22 17:26 ` James Bottomley
1 sibling, 0 replies; 34+ messages in thread
From: James Bottomley @ 2008-06-22 17:26 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: linux-scsi
On Sun, 2008-06-22 at 08:03 -0600, Matthew Wilcox wrote:
> On Sun, Jun 22, 2008 at 08:38:51AM -0500, James Bottomley wrote:
> > > I think there's an opportunity to improve sd_vpd_inquiry() to remove
> > > some of the duplicate code between sd_set_elevator() and sd_block_limits,
> > > but it's not terribly important.
> >
> > and actually ses.c as well.
>
> Ah yes. I'll rip it out of ses.c. Thanks.
>
> A question of policy ... should we interrogate page 0 to find out if
> page 0x83 exists? You don't currently, but ses.c is limited to devices
> with an enclosure ... which is presumably only newer devices. Do we
> have any idea if devices blow up on being asked for random VPD that they
> might not have?
It's wise to. For true SCSI devices (going all the way back to my 8"
HVD FAST-10 ones of ca 1989) the answer is that they respond correctly
(as in ILLEGAL REQUEST). For the ATA devices through SAT, the answer's
also that. Unfortunately, for devices which have their own eccentric
SCSI interpreters (glares at USB) the answer is that they can do
horrible things: we had a bug filed only this year where a VPD inquiry
was crashing a USB device. This is another reason for punting this to
userspace. The heuristics of what works are already embedded into udev.
I think the solution is eventually just to push all VPD stuff to
userspace (including ses), but that's long term.
> > > The switching of the elevators isn't particularly nice. I assume that
> > > elevator_init("noop") cannot fail, which isn't true. It would be nice
> > > to use the #if 0 block instead, but that causes a null ptr dereference
> > > inside sysfs -- I suspect something isn't set up correctly.
> >
> > I'm really not very keen on this patch, since it's implementing elevator
> > policy from sd which is a bit of a layering violation (and a policy
> > should be in userspace one).
>
> It's only choosing a default, not implementing policy.
>
> > What's wrong with doing this entirely from udev (it already issues the
> > vpd's today, that's how it gets the id from page 0x83 ... it could
> > easily look at 0xb1 and set the elevator
> > using /sys/block/<>/queue/scheduler)?.
>
> I'll take a look at udev on Monday. It certainly seems less
> controversial than this patch.
Thanks!
James
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-06-19 16:03 [DO NOT APPLY] sd take advantage of rotation speed Matthew Wilcox
` (2 preceding siblings ...)
2008-06-22 13:38 ` James Bottomley
@ 2008-06-25 13:47 ` Jens Axboe
2008-06-25 13:57 ` Jens Axboe
2008-06-25 14:24 ` Ric Wheeler
3 siblings, 2 replies; 34+ messages in thread
From: Jens Axboe @ 2008-06-25 13:47 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: linux-scsi
On Thu, Jun 19 2008, Matthew Wilcox wrote:
> Use the noop elevator by default for drives that do not spin
>
> [Not for applying]
>
> SSDs do not benefit from the elevator. It just wastes precious CPU cycles.
> By selecting the noop elevator by default, we can shave a few microseconds
> off each IO.
>
> I've brazenly stolen sd_vpd_inquiry from mkp's patch here:
>
> http://marc.info/?l=linux-scsi&m=121264354724277&w=2
>
> No need to have two copies of that ... but this will conflict with his code.
>
> On to the self-criticism:
>
> I don't intend the final version of this patch to include a printk for
> the RPM or even a printk to say we switched IO elevator. I think we're
> too verbose in SCSI as it is.
>
> I think there's an opportunity to improve sd_vpd_inquiry() to remove
> some of the duplicate code between sd_set_elevator() and sd_block_limits,
> but it's not terribly important.
>
> The switching of the elevators isn't particularly nice. I assume that
> elevator_init("noop") cannot fail, which isn't true. It would be nice
> to use the #if 0 block instead, but that causes a null ptr dereference
> inside sysfs -- I suspect something isn't set up correctly.
I disagree with this approach. For now, lets just add a queue flag that
says the device doesn't have a seek penalty and let the io schedulers do
what they need to avoid that (it'd be a one-liner change to cfq and as).
There's more to io scheduling than just seek reduction, so this is the
wrong direction to take imo.
--
Jens Axboe
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-06-25 13:47 ` Jens Axboe
@ 2008-06-25 13:57 ` Jens Axboe
2008-06-25 14:24 ` Ric Wheeler
1 sibling, 0 replies; 34+ messages in thread
From: Jens Axboe @ 2008-06-25 13:57 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: linux-scsi
On Wed, Jun 25 2008, Jens Axboe wrote:
> On Thu, Jun 19 2008, Matthew Wilcox wrote:
> > Use the noop elevator by default for drives that do not spin
> >
> > [Not for applying]
> >
> > SSDs do not benefit from the elevator. It just wastes precious CPU cycles.
> > By selecting the noop elevator by default, we can shave a few microseconds
> > off each IO.
> >
> > I've brazenly stolen sd_vpd_inquiry from mkp's patch here:
> >
> > http://marc.info/?l=linux-scsi&m=121264354724277&w=2
> >
> > No need to have two copies of that ... but this will conflict with his code.
> >
> > On to the self-criticism:
> >
> > I don't intend the final version of this patch to include a printk for
> > the RPM or even a printk to say we switched IO elevator. I think we're
> > too verbose in SCSI as it is.
> >
> > I think there's an opportunity to improve sd_vpd_inquiry() to remove
> > some of the duplicate code between sd_set_elevator() and sd_block_limits,
> > but it's not terribly important.
> >
> > The switching of the elevators isn't particularly nice. I assume that
> > elevator_init("noop") cannot fail, which isn't true. It would be nice
> > to use the #if 0 block instead, but that causes a null ptr dereference
> > inside sysfs -- I suspect something isn't set up correctly.
>
> I disagree with this approach. For now, lets just add a queue flag that
> says the device doesn't have a seek penalty and let the io schedulers do
> what they need to avoid that (it'd be a one-liner change to cfq and as).
> There's more to io scheduling than just seek reduction, so this is the
> wrong direction to take imo.
OK, a two liner for CFQ. Something like this would be much better I
think - maintain the check in SCSI but flag the queue as seek-free if
SSD is detected, and let the lower layer worry about the details. Much
cleaner, doesn't allow dirty SCSI fingers where they don't belong.
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 8dd8641..afbf6e0 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -410,6 +410,18 @@ void blk_queue_update_dma_alignment(struct request_queue *q, int mask)
}
EXPORT_SYMBOL(blk_queue_update_dma_alignment);
+void blk_queue_seek_free(struct request_queue *q, int set)
+{
+ /*
+ * initialization stuff, don't worry about concurrency
+ */
+ if (set)
+ queue_flag_set_unlocked(QUEUE_FLAG_SEEKFREE, q);
+ else
+ queue_flag_clear_unlocked(QUEUE_FLAG_SEEKFREE, q);
+}
+EXPORT_SYMBOL(blk_queue_seek_free);
+
static int __init blk_settings_init(void)
{
blk_max_low_pfn = max_low_pfn - 1;
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index d01b411..3f654d9 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1423,7 +1423,8 @@ retry:
cfq_init_prio_data(cfqq, ioc);
if (is_sync) {
- if (!cfq_class_idle(cfqq))
+ if (!cfq_class_idle(cfqq) &&
+ !test_bit(QUEUE_FLAG_SEEKFREE, &cfqd->queue->queue_flags))
cfq_mark_cfqq_idle_window(cfqq);
cfq_mark_cfqq_sync(cfqq);
}
@@ -1680,7 +1681,8 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
/*
* Don't idle for async or idle io prio class
*/
- if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
+ if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq) ||
+ test_bit(QUEUE_FLAG_SEEKFREE, &cfqd->queue->queue_flags))
return;
enable_idle = cfq_cfqq_idle_window(cfqq);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d2a1b71..8e77cb5 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -409,6 +409,7 @@ struct request_queue
#define QUEUE_FLAG_ELVSWITCH 8 /* don't use elevator, just do FIFO */
#define QUEUE_FLAG_BIDI 9 /* queue supports bidi requests */
#define QUEUE_FLAG_NOMERGES 10 /* disable merge attempts */
+#define QUEUE_FLAG_SEEKFREE 11 /* seeks are free */
static inline int queue_is_locked(struct request_queue *q)
{
@@ -757,6 +758,7 @@ extern void blk_queue_prep_rq(struct request_queue *, prep_rq_fn *pfn);
extern void blk_queue_merge_bvec(struct request_queue *, merge_bvec_fn *);
extern void blk_queue_dma_alignment(struct request_queue *, int);
extern void blk_queue_update_dma_alignment(struct request_queue *, int);
+extern void blk_queue_seek_free(struct request_queue *, int);
extern void blk_queue_softirq_done(struct request_queue *, softirq_done_fn *);
extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
extern int blk_queue_ordered(struct request_queue *, unsigned, prepare_flush_fn *);
--
Jens Axboe
^ permalink raw reply related [flat|nested] 34+ messages in thread* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-06-25 13:47 ` Jens Axboe
2008-06-25 13:57 ` Jens Axboe
@ 2008-06-25 14:24 ` Ric Wheeler
2008-06-25 16:25 ` Boaz Harrosh
1 sibling, 1 reply; 34+ messages in thread
From: Ric Wheeler @ 2008-06-25 14:24 UTC (permalink / raw)
To: Jens Axboe; +Cc: Matthew Wilcox, linux-scsi
Jens Axboe wrote:
> On Thu, Jun 19 2008, Matthew Wilcox wrote:
>
>> Use the noop elevator by default for drives that do not spin
>>
>> [Not for applying]
>>
>> SSDs do not benefit from the elevator. It just wastes precious CPU cycles.
>> By selecting the noop elevator by default, we can shave a few microseconds
>> off each IO.
>>
>> I've brazenly stolen sd_vpd_inquiry from mkp's patch here:
>>
>> http://marc.info/?l=linux-scsi&m=121264354724277&w=2
>>
>> No need to have two copies of that ... but this will conflict with his code.
>>
>> On to the self-criticism:
>>
>> I don't intend the final version of this patch to include a printk for
>> the RPM or even a printk to say we switched IO elevator. I think we're
>> too verbose in SCSI as it is.
>>
>> I think there's an opportunity to improve sd_vpd_inquiry() to remove
>> some of the duplicate code between sd_set_elevator() and sd_block_limits,
>> but it's not terribly important.
>>
>> The switching of the elevators isn't particularly nice. I assume that
>> elevator_init("noop") cannot fail, which isn't true. It would be nice
>> to use the #if 0 block instead, but that causes a null ptr dereference
>> inside sysfs -- I suspect something isn't set up correctly.
>>
>
> I disagree with this approach. For now, lets just add a queue flag that
> says the device doesn't have a seek penalty and let the io schedulers do
> what they need to avoid that (it'd be a one-liner change to cfq and as).
> There's more to io scheduling than just seek reduction, so this is the
> wrong direction to take imo.
>
>
Very true - you still will get a significant win by coalescing IO's (say
for example, to do larger, aligned writes to flash devices).
ric
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-06-25 14:24 ` Ric Wheeler
@ 2008-06-25 16:25 ` Boaz Harrosh
2008-06-25 16:57 ` Jens Axboe
0 siblings, 1 reply; 34+ messages in thread
From: Boaz Harrosh @ 2008-06-25 16:25 UTC (permalink / raw)
To: Ric Wheeler; +Cc: Jens Axboe, Matthew Wilcox, linux-scsi
Ric Wheeler wrote:
> Jens Axboe wrote:
>> On Thu, Jun 19 2008, Matthew Wilcox wrote:
>>
>>> Use the noop elevator by default for drives that do not spin
>>>
>>> [Not for applying]
>>>
>>> SSDs do not benefit from the elevator. It just wastes precious CPU cycles.
>>> By selecting the noop elevator by default, we can shave a few microseconds
>>> off each IO.
>>>
>>> I've brazenly stolen sd_vpd_inquiry from mkp's patch here:
>>>
>>> http://marc.info/?l=linux-scsi&m=121264354724277&w=2
>>>
>>> No need to have two copies of that ... but this will conflict with his code.
>>>
>>> On to the self-criticism:
>>>
>>> I don't intend the final version of this patch to include a printk for
>>> the RPM or even a printk to say we switched IO elevator. I think we're
>>> too verbose in SCSI as it is.
>>>
>>> I think there's an opportunity to improve sd_vpd_inquiry() to remove
>>> some of the duplicate code between sd_set_elevator() and sd_block_limits,
>>> but it's not terribly important.
>>>
>>> The switching of the elevators isn't particularly nice. I assume that
>>> elevator_init("noop") cannot fail, which isn't true. It would be nice
>>> to use the #if 0 block instead, but that causes a null ptr dereference
>>> inside sysfs -- I suspect something isn't set up correctly.
>>>
>> I disagree with this approach. For now, lets just add a queue flag that
>> says the device doesn't have a seek penalty and let the io schedulers do
>> what they need to avoid that (it'd be a one-liner change to cfq and as).
>> There's more to io scheduling than just seek reduction, so this is the
>> wrong direction to take imo.
>>
>>
> Very true - you still will get a significant win by coalescing IO's (say
> for example, to do larger, aligned writes to flash devices).
>
> ric
>
And to not let HUGE writers hug the machine. A scheduler ...
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-06-25 16:25 ` Boaz Harrosh
@ 2008-06-25 16:57 ` Jens Axboe
2008-06-25 17:20 ` Matthew Wilcox
0 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2008-06-25 16:57 UTC (permalink / raw)
To: Boaz Harrosh; +Cc: Ric Wheeler, Matthew Wilcox, linux-scsi
On Wed, Jun 25 2008, Boaz Harrosh wrote:
> Ric Wheeler wrote:
> > Jens Axboe wrote:
> >> On Thu, Jun 19 2008, Matthew Wilcox wrote:
> >>
> >>> Use the noop elevator by default for drives that do not spin
> >>>
> >>> [Not for applying]
> >>>
> >>> SSDs do not benefit from the elevator. It just wastes precious CPU cycles.
> >>> By selecting the noop elevator by default, we can shave a few microseconds
> >>> off each IO.
> >>>
> >>> I've brazenly stolen sd_vpd_inquiry from mkp's patch here:
> >>>
> >>> http://marc.info/?l=linux-scsi&m=121264354724277&w=2
> >>>
> >>> No need to have two copies of that ... but this will conflict with his code.
> >>>
> >>> On to the self-criticism:
> >>>
> >>> I don't intend the final version of this patch to include a printk for
> >>> the RPM or even a printk to say we switched IO elevator. I think we're
> >>> too verbose in SCSI as it is.
> >>>
> >>> I think there's an opportunity to improve sd_vpd_inquiry() to remove
> >>> some of the duplicate code between sd_set_elevator() and sd_block_limits,
> >>> but it's not terribly important.
> >>>
> >>> The switching of the elevators isn't particularly nice. I assume that
> >>> elevator_init("noop") cannot fail, which isn't true. It would be nice
> >>> to use the #if 0 block instead, but that causes a null ptr dereference
> >>> inside sysfs -- I suspect something isn't set up correctly.
> >>>
> >> I disagree with this approach. For now, lets just add a queue flag that
> >> says the device doesn't have a seek penalty and let the io schedulers do
> >> what they need to avoid that (it'd be a one-liner change to cfq and as).
> >> There's more to io scheduling than just seek reduction, so this is the
> >> wrong direction to take imo.
> >>
> >>
> > Very true - you still will get a significant win by coalescing IO's (say
> > for example, to do larger, aligned writes to flash devices).
> >
> > ric
> >
> And to not let HUGE writers hug the machine. A scheduler ...
Precisely, merging and fairness is a big part of it. Plus, doing this in
sd is just a blatant layer violation.
--
Jens Axboe
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-06-25 16:57 ` Jens Axboe
@ 2008-06-25 17:20 ` Matthew Wilcox
2008-06-25 17:26 ` Jens Axboe
0 siblings, 1 reply; 34+ messages in thread
From: Matthew Wilcox @ 2008-06-25 17:20 UTC (permalink / raw)
To: Jens Axboe; +Cc: Boaz Harrosh, Ric Wheeler, linux-scsi
On Wed, Jun 25, 2008 at 06:57:59PM +0200, Jens Axboe wrote:
> Precisely, merging and fairness is a big part of it. Plus, doing this in
> sd is just a blatant layer violation.
I'm fine with tweaking the elevators to know about low-latency seeks.
But please stop describing it as "a blatant layering violation".
It's nothing of the sort. It's setting a default at a point where we
find out the information which would guide us.
I haven't looked into doing this in udev yet; I've got caught up in
another project. In any case, it seems like there's no role for udev
any more, so I'll probably not spend any more time on this unless there's
some need.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-06-25 17:20 ` Matthew Wilcox
@ 2008-06-25 17:26 ` Jens Axboe
2008-06-25 17:34 ` Matthew Wilcox
2008-06-25 18:06 ` Martin K. Petersen
0 siblings, 2 replies; 34+ messages in thread
From: Jens Axboe @ 2008-06-25 17:26 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: Boaz Harrosh, Ric Wheeler, linux-scsi
On Wed, Jun 25 2008, Matthew Wilcox wrote:
> On Wed, Jun 25, 2008 at 06:57:59PM +0200, Jens Axboe wrote:
> > Precisely, merging and fairness is a big part of it. Plus, doing this in
> > sd is just a blatant layer violation.
>
> I'm fine with tweaking the elevators to know about low-latency seeks.
> But please stop describing it as "a blatant layering violation".
> It's nothing of the sort. It's setting a default at a point where we
> find out the information which would guide us.
Uhm, but it IS "a blatant layering violation", it's doing things from
the wrong side up :-)
> I haven't looked into doing this in udev yet; I've got caught up in
> another project. In any case, it seems like there's no role for udev
> any more, so I'll probably not spend any more time on this unless there's
> some need.
I don't think udev is a particularly good idea either. My plan was to
introduce device profiles for the queue, but it is probably a bad idea
to over-engineer this. So I'll keep it simple, stick to a 'zero cost
seek' flag instead and allow drivers to signal that. libata/ide needs to
check the ID page word to detect SSD drives as well, so they need a few
lines of change too.
I'll just stick the block bit in the 2.6.27 pending queue and let the
other patches go through Jeff/James/Bart.
--
Jens Axboe
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-06-25 17:26 ` Jens Axboe
@ 2008-06-25 17:34 ` Matthew Wilcox
2008-06-25 17:43 ` James Bottomley
2008-06-25 17:59 ` Jens Axboe
2008-06-25 18:06 ` Martin K. Petersen
1 sibling, 2 replies; 34+ messages in thread
From: Matthew Wilcox @ 2008-06-25 17:34 UTC (permalink / raw)
To: Jens Axboe; +Cc: Boaz Harrosh, Ric Wheeler, linux-scsi
On Wed, Jun 25, 2008 at 07:26:39PM +0200, Jens Axboe wrote:
> Uhm, but it IS "a blatant layering violation", it's doing things from
> the wrong side up :-)
That's just "doing things from a spot Jens doesn't approve of". A
layering violation would be SCSI knowing how elevators work.
> I don't think udev is a particularly good idea either. My plan was to
> introduce device profiles for the queue, but it is probably a bad idea
> to over-engineer this. So I'll keep it simple, stick to a 'zero cost
> seek' flag instead and allow drivers to signal that. libata/ide needs to
> check the ID page word to detect SSD drives as well, so they need a few
> lines of change too.
One of the other patches in this mess was the libata piece to translate
ATA8's rotation speed to SCSI's b1 VPD page, so ide is the only
remaining work left to do (can't it die already?)
> I'll just stick the block bit in the 2.6.27 pending queue and let the
> other patches go through Jeff/James/Bart.
Thanks.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-06-25 17:34 ` Matthew Wilcox
@ 2008-06-25 17:43 ` James Bottomley
2008-06-25 17:53 ` Matthew Wilcox
2008-06-25 17:59 ` Jens Axboe
1 sibling, 1 reply; 34+ messages in thread
From: James Bottomley @ 2008-06-25 17:43 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: Jens Axboe, Boaz Harrosh, Ric Wheeler, linux-scsi
On Wed, 2008-06-25 at 11:34 -0600, Matthew Wilcox wrote:
> On Wed, Jun 25, 2008 at 07:26:39PM +0200, Jens Axboe wrote:
> > Uhm, but it IS "a blatant layering violation", it's doing things from
> > the wrong side up :-)
>
> That's just "doing things from a spot Jens doesn't approve of". A
> layering violation would be SCSI knowing how elevators work.
When it was setting particular elevators, it was a layering violation
(as I said at the time). If it's just setting a seek cost hint, that's
acceptable.
> > I don't think udev is a particularly good idea either. My plan was to
> > introduce device profiles for the queue, but it is probably a bad idea
> > to over-engineer this. So I'll keep it simple, stick to a 'zero cost
> > seek' flag instead and allow drivers to signal that. libata/ide needs to
> > check the ID page word to detect SSD drives as well, so they need a few
> > lines of change too.
>
> One of the other patches in this mess was the libata piece to translate
> ATA8's rotation speed to SCSI's b1 VPD page, so ide is the only
> remaining work left to do (can't it die already?)
I don't think there is any IDE work to do ... the last I heard from the
manufacturers, they were all not going to bother with PATA interfaces to
SSDs ... unless this has changed?
> > I'll just stick the block bit in the 2.6.27 pending queue and let the
> > other patches go through Jeff/James/Bart.
James
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-06-25 17:43 ` James Bottomley
@ 2008-06-25 17:53 ` Matthew Wilcox
2008-06-25 18:01 ` Jens Axboe
2008-06-25 18:06 ` James Bottomley
0 siblings, 2 replies; 34+ messages in thread
From: Matthew Wilcox @ 2008-06-25 17:53 UTC (permalink / raw)
To: James Bottomley; +Cc: Jens Axboe, Boaz Harrosh, Ric Wheeler, linux-scsi
On Wed, Jun 25, 2008 at 12:43:50PM -0500, James Bottomley wrote:
> On Wed, 2008-06-25 at 11:34 -0600, Matthew Wilcox wrote:
> > On Wed, Jun 25, 2008 at 07:26:39PM +0200, Jens Axboe wrote:
> > > Uhm, but it IS "a blatant layering violation", it's doing things from
> > > the wrong side up :-)
> >
> > That's just "doing things from a spot Jens doesn't approve of". A
> > layering violation would be SCSI knowing how elevators work.
>
> When it was setting particular elevators, it was a layering violation
> (as I said at the time). If it's just setting a seek cost hint, that's
> acceptable.
You're both being silly. Interacting with a subsystem via its exposed
interfaces is not a layering violation.
> I don't think there is any IDE work to do ... the last I heard from the
> manufacturers, they were all not going to bother with PATA interfaces to
> SSDs ... unless this has changed?
There are people with PATA->SATA adapters. I think there's one in your
P7120 (or is that one a PATA->SATA adapter?) In any case, I don't think
we need to do anything until someone complains.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-06-25 17:53 ` Matthew Wilcox
@ 2008-06-25 18:01 ` Jens Axboe
2008-06-25 18:06 ` James Bottomley
1 sibling, 0 replies; 34+ messages in thread
From: Jens Axboe @ 2008-06-25 18:01 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: James Bottomley, Boaz Harrosh, Ric Wheeler, linux-scsi
On Wed, Jun 25 2008, Matthew Wilcox wrote:
> On Wed, Jun 25, 2008 at 12:43:50PM -0500, James Bottomley wrote:
> > On Wed, 2008-06-25 at 11:34 -0600, Matthew Wilcox wrote:
> > > On Wed, Jun 25, 2008 at 07:26:39PM +0200, Jens Axboe wrote:
> > > > Uhm, but it IS "a blatant layering violation", it's doing things from
> > > > the wrong side up :-)
> > >
> > > That's just "doing things from a spot Jens doesn't approve of". A
> > > layering violation would be SCSI knowing how elevators work.
> >
> > When it was setting particular elevators, it was a layering violation
> > (as I said at the time). If it's just setting a seek cost hint, that's
> > acceptable.
>
> You're both being silly. Interacting with a subsystem via its exposed
> interfaces is not a layering violation.
Dude, get a grip, your line is reasoning is getting silly. By that
logic, as long as something is exported you can use it from wherever you
want? Uhm right.
--
Jens Axboe
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-06-25 17:53 ` Matthew Wilcox
2008-06-25 18:01 ` Jens Axboe
@ 2008-06-25 18:06 ` James Bottomley
1 sibling, 0 replies; 34+ messages in thread
From: James Bottomley @ 2008-06-25 18:06 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: Jens Axboe, Boaz Harrosh, Ric Wheeler, linux-scsi
On Wed, 2008-06-25 at 11:53 -0600, Matthew Wilcox wrote:
> On Wed, Jun 25, 2008 at 12:43:50PM -0500, James Bottomley wrote:
> > On Wed, 2008-06-25 at 11:34 -0600, Matthew Wilcox wrote:
> > > On Wed, Jun 25, 2008 at 07:26:39PM +0200, Jens Axboe wrote:
> > > > Uhm, but it IS "a blatant layering violation", it's doing things from
> > > > the wrong side up :-)
> > >
> > > That's just "doing things from a spot Jens doesn't approve of". A
> > > layering violation would be SCSI knowing how elevators work.
> >
> > When it was setting particular elevators, it was a layering violation
> > (as I said at the time). If it's just setting a seek cost hint, that's
> > acceptable.
>
> You're both being silly. Interacting with a subsystem via its exposed
> interfaces is not a layering violation.
It depends on the interface: sys_open is an exposed interface, but any
SCSI driver trying to use it will be caught and shot for egregious
layering violations.
The elevator setting interface is a policy interface designed for users,
not for lower layers of the I/O stack.
> > I don't think there is any IDE work to do ... the last I heard from the
> > manufacturers, they were all not going to bother with PATA interfaces to
> > SSDs ... unless this has changed?
>
> There are people with PATA->SATA adapters. I think there's one in your
> P7120 (or is that one a PATA->SATA adapter?)
No ... it's using an IHC6 with two interfaces, one SATA and one PATA.
> In any case, I don't think
> we need to do anything until someone complains.
Agreed.
James
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-06-25 17:34 ` Matthew Wilcox
2008-06-25 17:43 ` James Bottomley
@ 2008-06-25 17:59 ` Jens Axboe
1 sibling, 0 replies; 34+ messages in thread
From: Jens Axboe @ 2008-06-25 17:59 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: Boaz Harrosh, Ric Wheeler, linux-scsi
On Wed, Jun 25 2008, Matthew Wilcox wrote:
> On Wed, Jun 25, 2008 at 07:26:39PM +0200, Jens Axboe wrote:
> > Uhm, but it IS "a blatant layering violation", it's doing things from
> > the wrong side up :-)
>
> That's just "doing things from a spot Jens doesn't approve of". A
> layering violation would be SCSI knowing how elevators work.
And that's what the patch is doing, it knows that CFQ/AS idles and thus
chooses noop. So it's even doing it wrong, since that disables other
goodness as well. The whole sd_set_elevator() is a mess, I honestly
can't see how you can even start to defend it.
I don't really appreciate the personal bit either.
> > I don't think udev is a particularly good idea either. My plan was to
> > introduce device profiles for the queue, but it is probably a bad idea
> > to over-engineer this. So I'll keep it simple, stick to a 'zero cost
> > seek' flag instead and allow drivers to signal that. libata/ide needs to
> > check the ID page word to detect SSD drives as well, so they need a few
> > lines of change too.
>
> One of the other patches in this mess was the libata piece to translate
> ATA8's rotation speed to SCSI's b1 VPD page, so ide is the only
> remaining work left to do (can't it die already?)
IDE is definitely low priority :-). I had a pata ssd drive, but IDE is
only really interesting for the bits that can't be libata driven. So not
very interesting.
ATA8 seems to have changed the SSD detection scheme. I don't have an SSD
disk so I cannot check what the current models report.
--
Jens Axboe
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-06-25 17:26 ` Jens Axboe
2008-06-25 17:34 ` Matthew Wilcox
@ 2008-06-25 18:06 ` Martin K. Petersen
2008-06-25 18:12 ` Jens Axboe
2008-07-28 13:36 ` Ric Wheeler
1 sibling, 2 replies; 34+ messages in thread
From: Martin K. Petersen @ 2008-06-25 18:06 UTC (permalink / raw)
To: Jens Axboe; +Cc: Matthew Wilcox, Boaz Harrosh, Ric Wheeler, linux-scsi
>>>>> "Jens" == Jens Axboe <jens.axboe@oracle.com> writes:
Jens> Uhm, but it IS "a blatant layering violation", it's doing things
Jens> from the wrong side up :-)
I also disagree with mucking with the elevator directly in sd.c.
The whole point of my I/O hints work is to expose the I/O topology in
a generic fashion so that upper layers can take advantage of them.
I'll post a new batch of those patches shortly...
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-06-25 18:06 ` Martin K. Petersen
@ 2008-06-25 18:12 ` Jens Axboe
2008-07-28 13:36 ` Ric Wheeler
1 sibling, 0 replies; 34+ messages in thread
From: Jens Axboe @ 2008-06-25 18:12 UTC (permalink / raw)
To: Martin K. Petersen; +Cc: Matthew Wilcox, Boaz Harrosh, Ric Wheeler, linux-scsi
On Wed, Jun 25 2008, Martin K. Petersen wrote:
> >>>>> "Jens" == Jens Axboe <jens.axboe@oracle.com> writes:
>
> Jens> Uhm, but it IS "a blatant layering violation", it's doing things
> Jens> from the wrong side up :-)
>
> I also disagree with mucking with the elevator directly in sd.c.
>
> The whole point of my I/O hints work is to expose the I/O topology in
> a generic fashion so that upper layers can take advantage of them.
>
> I'll post a new batch of those patches shortly...
Good, that sounds useful! Sounds a lot like what I referred to as
'device profiles'.
--
Jens Axboe
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-06-25 18:06 ` Martin K. Petersen
2008-06-25 18:12 ` Jens Axboe
@ 2008-07-28 13:36 ` Ric Wheeler
2008-07-28 14:10 ` James Bottomley
2008-07-28 14:31 ` Martin K. Petersen
1 sibling, 2 replies; 34+ messages in thread
From: Ric Wheeler @ 2008-07-28 13:36 UTC (permalink / raw)
To: Martin K. Petersen; +Cc: Jens Axboe, Matthew Wilcox, Boaz Harrosh, linux-scsi
Martin K. Petersen wrote:
>>>>>> "Jens" == Jens Axboe <jens.axboe@oracle.com> writes:
>>>>>>
>
> Jens> Uhm, but it IS "a blatant layering violation", it's doing things
> Jens> from the wrong side up :-)
>
> I also disagree with mucking with the elevator directly in sd.c.
>
> The whole point of my I/O hints work is to expose the I/O topology in
> a generic fashion so that upper layers can take advantage of them.
>
> I'll post a new batch of those patches shortly...
>
>
One other thought - is there a way to give non-rotational devices also
some indication of latency? (FLASH is slower than enterprise SSD is
slower than DRAM ramdisk for example)?
ric
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-07-28 13:36 ` Ric Wheeler
@ 2008-07-28 14:10 ` James Bottomley
2008-07-28 14:31 ` Martin K. Petersen
1 sibling, 0 replies; 34+ messages in thread
From: James Bottomley @ 2008-07-28 14:10 UTC (permalink / raw)
To: rwheeler
Cc: Martin K. Petersen, Jens Axboe, Matthew Wilcox, Boaz Harrosh,
linux-scsi
On Mon, 2008-07-28 at 09:36 -0400, Ric Wheeler wrote:
> One other thought - is there a way to give non-rotational devices also
> some indication of latency? (FLASH is slower than enterprise SSD is
> slower than DRAM ramdisk for example)?
Not at the moment ... since libata uses SAT, we're bound by what the SBC
mode pages can do for us, so the VPD only reports rotational speed (or a
specific flag for non rotation) and form factor. There was once a
flexible disk mode page that showed some of this, but it's been listed
as obsolete since SBC-2.
There are, perhaps other VPD parameters we should be paying attention
to, like Block Limits (B0) which shows optimal transfer length and
granularity.
James
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-07-28 13:36 ` Ric Wheeler
2008-07-28 14:10 ` James Bottomley
@ 2008-07-28 14:31 ` Martin K. Petersen
2008-07-31 21:00 ` Grant Grundler
1 sibling, 1 reply; 34+ messages in thread
From: Martin K. Petersen @ 2008-07-28 14:31 UTC (permalink / raw)
To: rwheeler
Cc: Martin K. Petersen, Jens Axboe, Matthew Wilcox, Boaz Harrosh,
linux-scsi
>>>>> "Ric" == Ric Wheeler <rwheeler@redhat.com> writes:
Ric> One other thought - is there a way to give non-rotational devices
Ric> also some indication of latency? (FLASH is slower than enterprise
Ric> SSD is slower than DRAM ramdisk for example)?
The current SBC draft only distinguishes between rotating media
speeds. There is only one classification for non-rotating media in
the block device characteristics VPD.
For a mechanical disk drive the rpm isn't a terrible gauge for
performance. But for a solid state device I think it will be hard to
define a similar universal metric.
Ignoring SLC vs. MLC for a moment I think it's also safe to predict
that the enterprise drive of today will be the consumer drive of
tomorrow.
Maybe the ssd device could export the anticipated command response
time for a request that matches the Optimal Transfer Length field in
the block limits VPD?
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-07-28 14:31 ` Martin K. Petersen
@ 2008-07-31 21:00 ` Grant Grundler
2008-07-31 21:19 ` Andrew Patterson
2008-07-31 22:26 ` Ric Wheeler
0 siblings, 2 replies; 34+ messages in thread
From: Grant Grundler @ 2008-07-31 21:00 UTC (permalink / raw)
To: Martin K. Petersen
Cc: rwheeler, Jens Axboe, Matthew Wilcox, Boaz Harrosh, linux-scsi
On Mon, Jul 28, 2008 at 7:31 AM, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>>>>>> "Ric" == Ric Wheeler <rwheeler@redhat.com> writes:
>
> Ric> One other thought - is there a way to give non-rotational devices
> Ric> also some indication of latency? (FLASH is slower than enterprise
> Ric> SSD is slower than DRAM ramdisk for example)?
>
> The current SBC draft only distinguishes between rotating media
> speeds. There is only one classification for non-rotating media in
> the block device characteristics VPD.
>
> For a mechanical disk drive the rpm isn't a terrible gauge for
> performance. But for a solid state device I think it will be hard to
> define a similar universal metric.
rpm isn't a great gauge of performance either since the perf is a
function of rpm * bit density.
> Ignoring SLC vs. MLC for a moment I think it's also safe to predict
> that the enterprise drive of today will be the consumer drive of
> tomorrow.
>
> Maybe the ssd device could export the anticipated command response
> time for a request that matches the Optimal Transfer Length field in
> the block limits VPD?
erase and/or write times could be exported as well somehow for SSDs
if the FS (or other higher layer that wants to know) can't avoid
garbage collection and erase cycles. I was just told today that flash devices
have 10x higher write time than read time. erase is another order of
magnitude higher. This doesn't include any garbage collection overhead.
I think new file systems should be tuned to work with SSDs before we
worry so much about the differences between SSDs/flash technologies
and vendors. And then prescribe a different FS for different
storage technologies. This avoids the "layering violations" discussion
and helps keep the FSs (testing and developement) substantially simpler.
grant
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-07-31 21:00 ` Grant Grundler
@ 2008-07-31 21:19 ` Andrew Patterson
2008-07-31 22:26 ` Ric Wheeler
1 sibling, 0 replies; 34+ messages in thread
From: Andrew Patterson @ 2008-07-31 21:19 UTC (permalink / raw)
To: Grant Grundler
Cc: Martin K. Petersen, rwheeler, Jens Axboe, Matthew Wilcox,
Boaz Harrosh, linux-scsi
On Thu, 2008-07-31 at 14:00 -0700, Grant Grundler wrote:
> On Mon, Jul 28, 2008 at 7:31 AM, Martin K. Petersen
> <martin.petersen@oracle.com> wrote:
> >>>>>> "Ric" == Ric Wheeler <rwheeler@redhat.com> writes:
> >
> > Ric> One other thought - is there a way to give non-rotational devices
> > Ric> also some indication of latency? (FLASH is slower than enterprise
> > Ric> SSD is slower than DRAM ramdisk for example)?
> >
> > The current SBC draft only distinguishes between rotating media
> > speeds. There is only one classification for non-rotating media in
> > the block device characteristics VPD.
> >
> > For a mechanical disk drive the rpm isn't a terrible gauge for
> > performance. But for a solid state device I think it will be hard to
> > define a similar universal metric.
>
> rpm isn't a great gauge of performance either since the perf is a
> function of rpm * bit density.
>
> > Ignoring SLC vs. MLC for a moment I think it's also safe to predict
> > that the enterprise drive of today will be the consumer drive of
> > tomorrow.
> >
> > Maybe the ssd device could export the anticipated command response
> > time for a request that matches the Optimal Transfer Length field in
> > the block limits VPD?
>
> erase and/or write times could be exported as well somehow for SSDs
> if the FS (or other higher layer that wants to know) can't avoid
> garbage collection and erase cycles. I was just told today that flash devices
> have 10x higher write time than read time. erase is another order of
> magnitude higher. This doesn't include any garbage collection overhead.
>
> I think new file systems should be tuned to work with SSDs before we
> worry so much about the differences between SSDs/flash technologies
> and vendors. And then prescribe a different FS for different
> storage technologies. This avoids the "layering violations" discussion
> and helps keep the FSs (testing and developement) substantially simpler.
>
Could performance patterns be encoded in some sort of per-product data
structure ala the scsi_static_device_list?
Andrwe
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-07-31 21:00 ` Grant Grundler
2008-07-31 21:19 ` Andrew Patterson
@ 2008-07-31 22:26 ` Ric Wheeler
2008-07-31 23:44 ` Grant Grundler
1 sibling, 1 reply; 34+ messages in thread
From: Ric Wheeler @ 2008-07-31 22:26 UTC (permalink / raw)
To: Grant Grundler
Cc: Martin K. Petersen, Jens Axboe, Matthew Wilcox, Boaz Harrosh,
linux-scsi
Grant Grundler wrote:
> On Mon, Jul 28, 2008 at 7:31 AM, Martin K. Petersen
> <martin.petersen@oracle.com> wrote:
>
>>>>>>> "Ric" == Ric Wheeler <rwheeler@redhat.com> writes:
>>>>>>>
>> Ric> One other thought - is there a way to give non-rotational devices
>> Ric> also some indication of latency? (FLASH is slower than enterprise
>> Ric> SSD is slower than DRAM ramdisk for example)?
>>
>> The current SBC draft only distinguishes between rotating media
>> speeds. There is only one classification for non-rotating media in
>> the block device characteristics VPD.
>>
>> For a mechanical disk drive the rpm isn't a terrible gauge for
>> performance. But for a solid state device I think it will be hard to
>> define a similar universal metric.
>>
>
> rpm isn't a great gauge of performance either since the perf is a
> function of rpm * bit density.
>
Is this real rotational latency, or normalized? I think that the avg
seek time is usually a bit more predictive of how well we do with the
worst case load (fsck).
>
>> Ignoring SLC vs. MLC for a moment I think it's also safe to predict
>> that the enterprise drive of today will be the consumer drive of
>> tomorrow.
>>
>> Maybe the ssd device could export the anticipated command response
>> time for a request that matches the Optimal Transfer Length field in
>> the block limits VPD?
>>
>
> erase and/or write times could be exported as well somehow for SSDs
> if the FS (or other higher layer that wants to know) can't avoid
> garbage collection and erase cycles. I was just told today that flash devices
> have 10x higher write time than read time. erase is another order of
> magnitude higher. This doesn't include any garbage collection overhead.
>
This is changing - they have various ways of getting them much closer
together. On the other hand, a USB flash drive is much slower than a
high end SSD which can hit 20,000 IOPS.
> I think new file systems should be tuned to work with SSDs before we
> worry so much about the differences between SSDs/flash technologies
> and vendors. And then prescribe a different FS for different
> storage technologies. This avoids the "layering violations" discussion
> and helps keep the FSs (testing and developement) substantially simpler.
>
> grant
>
If we have access to the parts, we will try to get them to self tune
given whatever we can grab.
ric
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [DO NOT APPLY] sd take advantage of rotation speed
2008-07-31 22:26 ` Ric Wheeler
@ 2008-07-31 23:44 ` Grant Grundler
0 siblings, 0 replies; 34+ messages in thread
From: Grant Grundler @ 2008-07-31 23:44 UTC (permalink / raw)
To: rwheeler
Cc: Martin K. Petersen, Jens Axboe, Matthew Wilcox, Boaz Harrosh,
linux-scsi
On Thu, Jul 31, 2008 at 3:26 PM, Ric Wheeler <rwheeler@redhat.com> wrote:
> Grant Grundler wrote:
...
>> rpm isn't a great gauge of performance either since the perf is a
>> function of rpm * bit density.
>>
>
> Is this real rotational latency, or normalized?
Sorry... I was thinking throughput (MB/s) and not rotational latency.
mkp is right that rotaional latency (ie rpm) is a significant part of
total avg latency.
> I think that the avg seek
> time is usually a bit more predictive of how well we do with the worst case
> load (fsck).
*nod*
....
>> erase and/or write times could be exported as well somehow for SSDs
>> if the FS (or other higher layer that wants to know) can't avoid
>> garbage collection and erase cycles. I was just told today that flash
>> devices
>> have 10x higher write time than read time. erase is another order of
>> magnitude higher. This doesn't include any garbage collection overhead.
>>
>
> This is changing - they have various ways of getting them much closer
> together. On the other hand, a USB flash drive is much slower than a high
> end SSD which can hit 20,000 IOPS.
The IOPS isn't the issue. It's the difference between having a rotation/seek
latency and not. The efficiency (percent of theoretical throughput or IOPS)
of either USB stick or enterprise SSD shouldn't be that different if the
FS is aware of erase blocks and how it lays out the data. The USB stick
will always be slower - the question is efficiency not comparing it to
a faster SSD.
>>
>> I think new file systems should be tuned to work with SSDs before we
>> worry so much about the differences between SSDs/flash technologies
>> and vendors. And then prescribe a different FS for different
>> storage technologies. This avoids the "layering violations" discussion
>> and helps keep the FSs (testing and developement) substantially simpler.
>>
>> grant
>>
>
> If we have access to the parts, we will try to get them to self tune given
> whatever we can grab.
I'm just pointing out that "self tuning" is more complex, more work,
and requires
more testing. Ultimately it will take longer to deploy and I don't think we have
time for it. Just trying to defer this work until I see substantial
evidence of at
least one FS that works well with SSDs and can be deployed.
I'm not against "self tuning". But I believe sufficiently good results are
achievable by directly embedding the device attributes in the FS
as we've done for conventional rotational media in ext2.
Maybe I'm being too skeptical that developments can't be done in
parallel for new features.
thanks,
grants
^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2008-07-31 23:44 UTC | newest]
Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-19 16:03 [DO NOT APPLY] sd take advantage of rotation speed Matthew Wilcox
2008-06-19 17:12 ` Mike Anderson
2008-06-19 18:10 ` Matthew Wilcox
2008-06-22 12:16 ` Boaz Harrosh
2008-06-22 13:19 ` Matthew Wilcox
2008-06-22 13:27 ` Boaz Harrosh
2008-06-22 13:38 ` James Bottomley
2008-06-22 14:03 ` Matthew Wilcox
2008-06-22 14:41 ` Martin K. Petersen
2008-06-22 18:44 ` Matthew Wilcox
2008-06-25 2:06 ` Martin K. Petersen
2008-06-22 17:26 ` James Bottomley
2008-06-25 13:47 ` Jens Axboe
2008-06-25 13:57 ` Jens Axboe
2008-06-25 14:24 ` Ric Wheeler
2008-06-25 16:25 ` Boaz Harrosh
2008-06-25 16:57 ` Jens Axboe
2008-06-25 17:20 ` Matthew Wilcox
2008-06-25 17:26 ` Jens Axboe
2008-06-25 17:34 ` Matthew Wilcox
2008-06-25 17:43 ` James Bottomley
2008-06-25 17:53 ` Matthew Wilcox
2008-06-25 18:01 ` Jens Axboe
2008-06-25 18:06 ` James Bottomley
2008-06-25 17:59 ` Jens Axboe
2008-06-25 18:06 ` Martin K. Petersen
2008-06-25 18:12 ` Jens Axboe
2008-07-28 13:36 ` Ric Wheeler
2008-07-28 14:10 ` James Bottomley
2008-07-28 14:31 ` Martin K. Petersen
2008-07-31 21:00 ` Grant Grundler
2008-07-31 21:19 ` Andrew Patterson
2008-07-31 22:26 ` Ric Wheeler
2008-07-31 23:44 ` Grant Grundler
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).