Re: [PATCH] SG_IO readcd and various bugs

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH] SG_IO readcd and various bugs
@ 2003-05-31  8:23 Douglas Gilbert
  2003-05-31 10:57 ` Jens Axboe
  0 siblings, 1 reply; 7+ messages in thread
From: Douglas Gilbert @ 2003-05-31  8:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-scsi

[-- Attachment #1: Type: text/plain, Size: 3289 bytes --]

Jens Axboe wrote:
 > On Fri, May 30 2003, Markus Plail wrote:
 > > On Fri, 30 May 2003, Markus Plail wrote:
 > >
 > > > The patch makes readcd work just fine here :-) Many thanks!
 > >
 > > Just realized that C2 scans don't yet work.
 >
 > Updated patch, please give that a shot. These sense_len wasn't
 > being set correctly.

Jens,
At the end of this post is an incremental patch on
top of your most recent one.


Here are some timing and CPU utilization numbers on the
combined patches. Reading
1 GB data from the beginning of a Fujitsu MAM3184MP SCSI
disk whose streaming speed for outer zones is about
57 MB/sec (according to Storage Review). Each atomic read
is 64 KB (and "of=." is sg_dd shorthand for "of=/dev/null").

Here is a normal read (i.e. treating /dev/sda as a normal file):
  $ /usr/bin/time sg_dd if=/dev/sda of=. bs=512 time=1 count=2M
  time to transfer data was 17.821534 secs, 57.46 MB/sec
  0.00user 4.37system 0:17.82elapsed 24%CPU

The transfer time is a little fast due to cache hits. The
24% CPU utilization is the price paid for those cache hits.

Next using O_DIRECT:
  $ /usr/bin/time sg_dd if=/dev/sda of=. bs=512 time=1 count=2M
      odir=1
  time to transfer data was 18.003662 secs, 56.88 MB/sec
  0.00user 0.52system 0:18.00elapsed 2%CPU

The time to transfer is about right and the CPU utilization is
down to 2% (0.52 seconds).

Next using the block layer SG_IO command:
  $ /usr/bin/time sg_dd if=/dev/sda of=. bs=512 time=1 count=2M
      blk_sgio=1
  time to transfer data was 18.780551 secs, 54.52 MB/sec
  0.00user 6.40system 0:18.78elapsed 34%CPU

The throughput is worse and the CPU utilization is now
worse than a normal file.

Setting the "dio=1" flag in sg_dd page aligns its buffers
which causes bio_map_user() to succeed (in
drivers/block/scsi_ioctl.c:sg_io()). In other words it turns
on direct I/O:
  $ /usr/bin/time sg_dd if=/dev/sda of=. bs=512 time=1 count=2M
      blk_sgio=1 dio=1
  time to transfer data was 17.899802 secs, 57.21 MB/sec
  0.00user 0.31system 0:17.90elapsed 1%CPU

So this result is comparable with O_DIRECT on the normal
file. CPU utilization is down to 1% (0.31 seonds).


With the latest changes from Jens (mainly dropping the
artificial 64 KB per operation limit) the maximum
element size in the block layer SG_IO is:
   - 128 KB when direct I/O is not used (i.e. the user
     space buffer does not meet bio_map_user()'s
     requirements). This seems to be the largest buffer
     kmalloc() will allow (in my tests)
   - (sg_tablesize * page_size) when direct I/O is used.
     My test was with the sym53c8xx_2 driver in which
     sg_tablesize==96 so my largest element size was 384 KB



Incremental patch (on top of Jens's 2nd patch in this
thread) changelog:
   - change version number (effectively to 3.7.29) so
     apps can distinguish if the want (current sg driver
     version is 3.5.29). The main thing that app do is
     check the version >= 3.0.0 as that implies the
     SG_IO ioctl is there.
   - reject requests for mmap()-ed I/O with a "not
     supported error"
   - if direct I/O is requested, send back via info
     field whether it was done (or fell back to indirect
     I/O). This is what happens in the sg driver.
   - ultra-paranoid buffer zeroing (in padding at end)

Doug Gilbert

[-- Attachment #2: blk_sg_io2570ja_dg.diff --]
[-- Type: text/plain, Size: 2332 bytes --]

--- linux/drivers/block/scsi_ioctl.c	2003-05-31 13:50:04.000000000 +1000
+++ linux/drivers/block/scsi_ioctl.c2570dpg2	2003-05-31 14:13:15.000000000 +1000
@@ -82,7 +82,7 @@
 
 static int sg_get_version(int *p)
 {
-	static int sg_version_num = 30527;
+	static int sg_version_num = 30729;   /* version 3.7.* to distinguish */
 	return put_user(sg_version_num, p);
 }
 
@@ -143,7 +143,7 @@
 		 struct sg_io_hdr *uptr)
 {
 	unsigned long start_time;
-	int reading, writing;
+	int reading, writing, dio;
 	struct sg_io_hdr hdr;
 	struct request *rq;
 	struct bio *bio;
@@ -163,11 +163,14 @@
 	 */
 	if (hdr.iovec_count)
 		return -EOPNOTSUPP;
+	if (hdr.flags & SG_FLAG_MMAP_IO)
+		return -EOPNOTSUPP;
 
 	if (hdr.dxfer_len > (q->max_sectors << 9))
 		return -EIO;
 
 	reading = writing = 0;
+	dio = 0;
 	buffer = NULL;
 	bio = NULL;
 	if (hdr.dxfer_len) {
@@ -194,10 +197,12 @@
 		bio = bio_map_user(bdev, (unsigned long) hdr.dxferp,
 				   hdr.dxfer_len, reading);
 
-		/*
-		 * if bio setup failed, fall back to slow approach
-		 */
-		if (!bio) {
+		if (bio)
+			dio = 1;
+		else {
+			/*
+			 * if bio setup failed, fall back to slow approach
+			 */
 			buffer = kmalloc(bytes, q->bounce_gfp | GFP_USER);
 			if (!buffer)
 				return -ENOMEM;
@@ -206,8 +211,11 @@
 				if (copy_from_user(buffer, hdr.dxferp,
 						   hdr.dxfer_len))
 					goto out_buffer;
+				if (bytes > hdr.dxfer_len)
+					memset((char *)buffer + hdr.dxfer_len,
+					       0, bytes - hdr.dxfer_len);
 			} else
-				memset(buffer, 0, hdr.dxfer_len);
+				memset(buffer, 0, bytes);
 		}
 	}
 
@@ -257,11 +265,13 @@
 	hdr.status = rq->errors;	
 	hdr.masked_status = (hdr.status >> 1) & 0x1f;
 	hdr.msg_status = 0;
-	hdr.host_status = 0;
+	hdr.host_status = 0;	/* "lost nexus" error could go here */
 	hdr.driver_status = 0;
 	hdr.info = 0;
 	if (hdr.masked_status || hdr.host_status || hdr.driver_status)
 		hdr.info |= SG_INFO_CHECK;
+	if (dio && (hdr.flags & SG_FLAG_DIRECT_IO))
+		hdr.info |= SG_INFO_DIRECT_IO;
 	hdr.resid = rq->data_len;
 	hdr.duration = ((jiffies - start_time) * 1000) / HZ;
 	hdr.sb_len_wr = 0;
@@ -286,7 +296,7 @@
 		kfree(buffer);
 	}
 
-	/* may not have succeeded, but output values written to control
+	/* may not have succeeded, but output status written to control
 	 * structure (struct sg_io_hdr).  */
 	return 0;
 out_request:

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] SG_IO readcd and various bugs
  2003-05-31  8:23 [PATCH] SG_IO readcd and various bugs Douglas Gilbert
@ 2003-05-31 10:57 ` Jens Axboe
  2003-06-01  7:39   ` Douglas Gilbert
  0 siblings, 1 reply; 7+ messages in thread
From: Jens Axboe @ 2003-05-31 10:57 UTC (permalink / raw)
  To: Douglas Gilbert; +Cc: linux-kernel, linux-scsi

On Sat, May 31 2003, Douglas Gilbert wrote:
> Jens Axboe wrote:
> > On Fri, May 30 2003, Markus Plail wrote:
> > > On Fri, 30 May 2003, Markus Plail wrote:
> > >
> > > > The patch makes readcd work just fine here :-) Many thanks!
> > >
> > > Just realized that C2 scans don't yet work.
> >
> > Updated patch, please give that a shot. These sense_len wasn't
> > being set correctly.
> 
> Jens,
> At the end of this post is an incremental patch on
> top of your most recent one.

(Douglas, please don't sent seperate identical mails in private and to
the lists, it's silly having to reply twice to the same mail).

> Here are some timing and CPU utilization numbers on the
> combined patches. Reading
> 1 GB data from the beginning of a Fujitsu MAM3184MP SCSI
> disk whose streaming speed for outer zones is about
> 57 MB/sec (according to Storage Review). Each atomic read
> is 64 KB (and "of=." is sg_dd shorthand for "of=/dev/null").
> 
> Here is a normal read (i.e. treating /dev/sda as a normal file):
>  $ /usr/bin/time sg_dd if=/dev/sda of=. bs=512 time=1 count=2M
>  time to transfer data was 17.821534 secs, 57.46 MB/sec
>  0.00user 4.37system 0:17.82elapsed 24%CPU
> 
> The transfer time is a little fast due to cache hits. The
> 24% CPU utilization is the price paid for those cache hits.
> 
> Next using O_DIRECT:
>  $ /usr/bin/time sg_dd if=/dev/sda of=. bs=512 time=1 count=2M
>      odir=1
>  time to transfer data was 18.003662 secs, 56.88 MB/sec
>  0.00user 0.52system 0:18.00elapsed 2%CPU
> 
> The time to transfer is about right and the CPU utilization is
> down to 2% (0.52 seconds).
> 
> Next using the block layer SG_IO command:
>  $ /usr/bin/time sg_dd if=/dev/sda of=. bs=512 time=1 count=2M
>      blk_sgio=1
>  time to transfer data was 18.780551 secs, 54.52 MB/sec
>  0.00user 6.40system 0:18.78elapsed 34%CPU
> 
> The throughput is worse and the CPU utilization is now
> worse than a normal file.
>
> Setting the "dio=1" flag in sg_dd page aligns its buffers
> which causes bio_map_user() to succeed (in
> drivers/block/scsi_ioctl.c:sg_io()). In other words it turns
> on direct I/O:
>  $ /usr/bin/time sg_dd if=/dev/sda of=. bs=512 time=1 count=2M
>      blk_sgio=1 dio=1
>  time to transfer data was 17.899802 secs, 57.21 MB/sec
>  0.00user 0.31system 0:17.90elapsed 1%CPU

This looks fine, a performance critical app should align the buffers to
get zero copy operation.

> So this result is comparable with O_DIRECT on the normal
> file. CPU utilization is down to 1% (0.31 seonds).

A smidgen faster, it seems.

> With the latest changes from Jens (mainly dropping the
> artificial 64 KB per operation limit) the maximum
> element size in the block layer SG_IO is:
>   - 128 KB when direct I/O is not used (i.e. the user
>     space buffer does not meet bio_map_user()'s
>     requirements). This seems to be the largest buffer
>     kmalloc() will allow (in my tests)

Correct.

>   - (sg_tablesize * page_size) when direct I/O is used.
>     My test was with the sym53c8xx_2 driver in which
>     sg_tablesize==96 so my largest element size was 384 KB

Or ->max_sectors << 9, whichever is smallest. Really, the limits are in
the queue. Don't confuse SCSI with block.

> Incremental patch (on top of Jens's 2nd patch in this
> thread) changelog:
>   - change version number (effectively to 3.7.29) so
>     apps can distinguish if the want (current sg driver
>     version is 3.5.29). The main thing that app do is
>     check the version >= 3.0.0 as that implies the
>     SG_IO ioctl is there.

Agree

>   - reject requests for mmap()-ed I/O with a "not
>     supported error"

Agree

>   - if direct I/O is requested, send back via info
>     field whether it was done (or fell back to indirect
>     I/O). This is what happens in the sg driver.

Agree

>   - ultra-paranoid buffer zeroing (in padding at end)

Pointless, we wont be copying that data back. So it's a waste of cycles.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] SG_IO readcd and various bugs
  2003-05-31 10:57 ` Jens Axboe
@ 2003-06-01  7:39   ` Douglas Gilbert
  2003-06-02  7:27     ` Jens Axboe
  0 siblings, 1 reply; 7+ messages in thread
From: Douglas Gilbert @ 2003-06-01  7:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-scsi

Jens Axboe wrote:
> On Sat, May 31 2003, Douglas Gilbert wrote:

>>With the latest changes from Jens (mainly dropping the
>>artificial 64 KB per operation limit) the maximum
>>element size in the block layer SG_IO is:
>>  - 128 KB when direct I/O is not used (i.e. the user
>>    space buffer does not meet bio_map_user()'s
>>    requirements). This seems to be the largest buffer
>>    kmalloc() will allow (in my tests)
> 
> 
> Correct.
> 
> 
>>  - (sg_tablesize * page_size) when direct I/O is used.
>>    My test was with the sym53c8xx_2 driver in which
>>    sg_tablesize==96 so my largest element size was 384 KB
> 
> 
> Or ->max_sectors << 9, whichever is smallest. Really, the limits are in
> the queue. Don't confuse SCSI with block.

The block layer SG_IO ioctl passes through the SCSI
command set to a device that understands it
(i.e. not necessarily a "SCSI" device in the traditional
sense). Other pass throughs exist (or may be needed) for
ATA's task file interface and SAS's management protocol.

Even though my tests, shown earlier in this thread, indicated
that the SG_IO ioctl might be a shade faster than O_DIRECT,
the main reason for having it is to pass through "non-block"
commands to a device. Some examples:
   - special writes (e.g. formating a disk, writing a CD/DVD)
   - uploading firmware
   - reading the defect table from a disk
   - reading and writing special areas on a disk
     (e.g. application client log page)

The reason for choosing this list is that all these
operations potentially move large amounts of data in a
single operation. For such data transfers to be constrained
by max_sectors is questionable. Putting a block paradigm
bypass in the block layer is an interesting design :-)

With these patches in place cdrecord (in RH 9) works via
/dev/hdb (using ide-cd). Cdrecord fails on the same ATAPI
writer via /dev/scd0 because ide-scsi does not set max_sectors
high enough [and the SG_SET_RESERVED_SIZE ioctl reacts
differently in this case to the sg driver].

In summary, I see no reason to constrain the SG_IO ioctl
by max_sectors. SG_IO is a "one shot" with no command
merging or retries. If the HBA driver doesn't like a command,
then it can reject it. [However, currently there is no
mechanism to convey host or HBA driver errors back through
the request queue, only the device (response) status.]

Doug Gilbert

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] SG_IO readcd and various bugs
  2003-06-01  7:39   ` Douglas Gilbert
@ 2003-06-02  7:27     ` Jens Axboe
  2003-06-03  5:23       ` Douglas Gilbert
  0 siblings, 1 reply; 7+ messages in thread
From: Jens Axboe @ 2003-06-02  7:27 UTC (permalink / raw)
  To: Douglas Gilbert; +Cc: linux-kernel, linux-scsi

On Sun, Jun 01 2003, Douglas Gilbert wrote:
> Jens Axboe wrote:
> >On Sat, May 31 2003, Douglas Gilbert wrote:
> 
> >>With the latest changes from Jens (mainly dropping the
> >>artificial 64 KB per operation limit) the maximum
> >>element size in the block layer SG_IO is:
> >> - 128 KB when direct I/O is not used (i.e. the user
> >>   space buffer does not meet bio_map_user()'s
> >>   requirements). This seems to be the largest buffer
> >>   kmalloc() will allow (in my tests)
> >
> >
> >Correct.
> >
> >
> >> - (sg_tablesize * page_size) when direct I/O is used.
> >>   My test was with the sym53c8xx_2 driver in which
> >>   sg_tablesize==96 so my largest element size was 384 KB
> >
> >
> >Or ->max_sectors << 9, whichever is smallest. Really, the limits are in
> >the queue. Don't confuse SCSI with block.
> 
> The block layer SG_IO ioctl passes through the SCSI
> command set to a device that understands it
> (i.e. not necessarily a "SCSI" device in the traditional
> sense). Other pass throughs exist (or may be needed) for
> ATA's task file interface and SAS's management protocol.
> 
> Even though my tests, shown earlier in this thread, indicated
> that the SG_IO ioctl might be a shade faster than O_DIRECT,
> the main reason for having it is to pass through "non-block"
> commands to a device. Some examples:
>   - special writes (e.g. formating a disk, writing a CD/DVD)
>   - uploading firmware
>   - reading the defect table from a disk
>   - reading and writing special areas on a disk
>     (e.g. application client log page)
> 
> The reason for choosing this list is that all these
> operations potentially move large amounts of data in a
> single operation. For such data transfers to be constrained
> by max_sectors is questionable. Putting a block paradigm
> bypass in the block layer is an interesting design :-)

I think this is nonsense. The block layer will not accept commands
that it cannot handle in one go, what would the point of that be?
There's no way for us to break down a single command into pieces,
we have no idea how to do that. max_sectors _is_ the natural
constraint, it's the hardware limit not something I impose through
policy. For SCSI it could be bigger in some cases, that's up to the
lldd to set though.

> With these patches in place cdrecord (in RH 9) works via
> /dev/hdb (using ide-cd). Cdrecord fails on the same ATAPI
> writer via /dev/scd0 because ide-scsi does not set max_sectors
> high enough [and the SG_SET_RESERVED_SIZE ioctl reacts
> differently in this case to the sg driver].

ide-scsi bug.

> In summary, I see no reason to constrain the SG_IO ioctl
> by max_sectors. SG_IO is a "one shot" with no command
> merging or retries. If the HBA driver doesn't like a command,
> then it can reject it. [However, currently there is no
> mechanism to convey host or HBA driver errors back through
> the request queue, only the device (response) status.]

It breaks existing applications too, which is at least a very
good reason in my book. I think your position is naive. The
driver explicitly asked not to get request bigger than a certain
size, and know you want to rely on the correct handling of that
case? Won't happen.

BTW, if you want me to read emails somewhat reliably, cc me.

Jens

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] SG_IO readcd and various bugs
  2003-06-02  7:27     ` Jens Axboe
@ 2003-06-03  5:23       ` Douglas Gilbert
  2003-06-03  5:39         ` Nick Piggin
  2003-06-03  8:29         ` Jens Axboe
  0 siblings, 2 replies; 7+ messages in thread
From: Douglas Gilbert @ 2003-06-03  5:23 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-scsi

Jens Axboe wrote:
> On Sun, Jun 01 2003, Douglas Gilbert wrote:
<snip>
>>The block layer SG_IO ioctl passes through the SCSI
>>command set to a device that understands it
>>(i.e. not necessarily a "SCSI" device in the traditional
>>sense). Other pass throughs exist (or may be needed) for
>>ATA's task file interface and SAS's management protocol.
>>
>>Even though my tests, shown earlier in this thread, indicated
>>that the SG_IO ioctl might be a shade faster than O_DIRECT,
>>the main reason for having it is to pass through "non-block"
>>commands to a device. Some examples:
>>  - special writes (e.g. formating a disk, writing a CD/DVD)
>>  - uploading firmware
>>  - reading the defect table from a disk
>>  - reading and writing special areas on a disk
>>    (e.g. application client log page)
>>
>>The reason for choosing this list is that all these
>>operations potentially move large amounts of data in a
>>single operation. For such data transfers to be constrained
>>by max_sectors is questionable. Putting a block paradigm
>>bypass in the block layer is an interesting design :-)
> 
> 
> I think this is nonsense. The block layer will not accept commands
> that it cannot handle in one go, what would the point of that be?
> There's no way for us to break down a single command into pieces,
> we have no idea how to do that. max_sectors _is_ the natural
> constraint, it's the hardware limit not something I impose through
> policy. For SCSI it could be bigger in some cases, that's up to the
> lldd to set though.
<snip>

Jens,
Reviewing the linix-scsi archives, max_sectors was
introduced around lk 2.4.7 and you were quite active
in its promotion. There are also posts about problems
with qlogic HBAs and their need for a limit to maximum
transfer length. So there is some hardware justification.

On 11th April 2002 Justin Gibbs posted this in a mail
about aic7xxx version 6.2.6:
"2) Set max_sectors to a sane value.  The aic7xxx driver was not
    updated when this value was added to the host template structure.
    In more recent kernels, the default setting for this field, 255,
    can limit our transaction size to 127K.  This often causes the
    scsi_merge routines to generate 127k followed by 1k I/Os to complete
    a client transaction.  The command overhead of such small
    transactions
    can severely impact performance.  The driver now sets max_sectors to
    8192 which equates to the 16MB S/G element limit for these cards as
    expressed in 2K sectors."

At the time max_sectors defaulted to 255, later it was
bumped to 256 and is now 1024 in lk 2.5. However Justin's
post is saying the hardware limit for a data transfer
associated with a single SCSI command in the aic7xxx
driver is:
   sg_tablesize * (2 ** 24) bytes == 2 GB
as the aic7xxx driver sets sg_tablesize to 128.
Taking into account the largest practical kmalloc of 128 KB
(which is not a hardware limitation) this number comes down
to 16 MB. The 8192 figure that Justin chose is still in place
in the aic7xxx driver in lk 2.5 and it limits maximum transfer
size to 4 MB since the unit of max_sectors is now 512 bytes.

Various projects have reported to me success in transferring
8 and 16 MB individual WRITE commands through the sg driver,
usually with LSI or Adaptec HBAs. The max_sectors==8192
set by the aic7xxx is the maximum of any driver in the
ide or the scsi subsystems (both in lk 2.4 and lk 2.5)
currently. Most drivers are picking up the default value.
The definition of "max_sectors" states in
drivers/scsi/hosts.h:
   "if the host adapter has limitations beside segment count"
That could be taken to imply if a LLD does not define
max_sectors then there is no limit.

In summary, from a HBA drivers point of view, "max_sectors"
is misnamed (since they transfer bytes) and not precise
enough to describe any limitations on data transfers they
may have.

Apologies in advance for propagating further nonsense.

Doug Gilbert

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] SG_IO readcd and various bugs
  2003-06-03  5:23       ` Douglas Gilbert
@ 2003-06-03  5:39         ` Nick Piggin
  2003-06-03  8:29         ` Jens Axboe
  1 sibling, 0 replies; 7+ messages in thread
From: Nick Piggin @ 2003-06-03  5:39 UTC (permalink / raw)
  To: Douglas Gilbert; +Cc: Jens Axboe, linux-kernel, linux-scsi

On Tue, Jun 03, 2003 at 03:23:19PM +1000, Douglas Gilbert wrote:
> Jens Axboe wrote:
> >On Sun, Jun 01 2003, Douglas Gilbert wrote:
> <snip>
> >>The block layer SG_IO ioctl passes through the SCSI
> >>command set to a device that understands it
> >>(i.e. not necessarily a "SCSI" device in the traditional
> >>sense). Other pass throughs exist (or may be needed) for
> >>ATA's task file interface and SAS's management protocol.
> >>
> >>Even though my tests, shown earlier in this thread, indicated
> >>that the SG_IO ioctl might be a shade faster than O_DIRECT,
> >>the main reason for having it is to pass through "non-block"
> >>commands to a device. Some examples:
> >> - special writes (e.g. formating a disk, writing a CD/DVD)
> >> - uploading firmware
> >> - reading the defect table from a disk
> >> - reading and writing special areas on a disk
> >>   (e.g. application client log page)
> >>
> >>The reason for choosing this list is that all these
> >>operations potentially move large amounts of data in a
> >>single operation. For such data transfers to be constrained
> >>by max_sectors is questionable. Putting a block paradigm
> >>bypass in the block layer is an interesting design :-)
> >
> >
> >I think this is nonsense. The block layer will not accept commands
> >that it cannot handle in one go, what would the point of that be?
> >There's no way for us to break down a single command into pieces,
> >we have no idea how to do that. max_sectors _is_ the natural
> >constraint, it's the hardware limit not something I impose through
> >policy. For SCSI it could be bigger in some cases, that's up to the
> >lldd to set though.
> <snip>
> 
> Jens,
> Reviewing the linix-scsi archives, max_sectors was
> introduced around lk 2.4.7 and you were quite active
> in its promotion. There are also posts about problems
> with qlogic HBAs and their need for a limit to maximum
> transfer length. So there is some hardware justification.
> 
> On 11th April 2002 Justin Gibbs posted this in a mail
> about aic7xxx version 6.2.6:
> "2) Set max_sectors to a sane value.  The aic7xxx driver was not
>    updated when this value was added to the host template structure.
>    In more recent kernels, the default setting for this field, 255,
>    can limit our transaction size to 127K.  This often causes the
>    scsi_merge routines to generate 127k followed by 1k I/Os to complete
>    a client transaction.  The command overhead of such small
>    transactions
>    can severely impact performance.  The driver now sets max_sectors to
>    8192 which equates to the 16MB S/G element limit for these cards as
>    expressed in 2K sectors."
> 
> At the time max_sectors defaulted to 255, later it was
> bumped to 256 and is now 1024 in lk 2.5. However Justin's
> post is saying the hardware limit for a data transfer
> associated with a single SCSI command in the aic7xxx
> driver is:
>   sg_tablesize * (2 ** 24) bytes == 2 GB
> as the aic7xxx driver sets sg_tablesize to 128.
> Taking into account the largest practical kmalloc of 128 KB
> (which is not a hardware limitation) this number comes down
> to 16 MB. The 8192 figure that Justin chose is still in place
> in the aic7xxx driver in lk 2.5 and it limits maximum transfer
> size to 4 MB since the unit of max_sectors is now 512 bytes.
> 
> Various projects have reported to me success in transferring
> 8 and 16 MB individual WRITE commands through the sg driver,
> usually with LSI or Adaptec HBAs. The max_sectors==8192
> set by the aic7xxx is the maximum of any driver in the
> ide or the scsi subsystems (both in lk 2.4 and lk 2.5)
> currently.

I would just like to add my 2c here, and say that 16MB
requests are just a bit too big for a general purpose
installation, and using any of the available IO schedulers.

The grainularity and disparity between large and small
requests is just too large. I know AS wouldn't cope well
with requests that large in a general purpose situation
(general purpose being < 4000 disks :P )

I would be really interested in seeing benchmarks which
showed a significant performance improvement when going
from say 128K requests to say 16MB. Even in the most
favourable conditions for big request sizes, I'd say
128K should be getting toward the point of diminishing
returns.

Nick

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] SG_IO readcd and various bugs
  2003-06-03  5:23       ` Douglas Gilbert
  2003-06-03  5:39         ` Nick Piggin
@ 2003-06-03  8:29         ` Jens Axboe
  1 sibling, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2003-06-03  8:29 UTC (permalink / raw)
  To: Douglas Gilbert; +Cc: linux-kernel, linux-scsi

On Tue, Jun 03 2003, Douglas Gilbert wrote:
> Jens Axboe wrote:
> >On Sun, Jun 01 2003, Douglas Gilbert wrote:
> <snip>
> >>The block layer SG_IO ioctl passes through the SCSI
> >>command set to a device that understands it
> >>(i.e. not necessarily a "SCSI" device in the traditional
> >>sense). Other pass throughs exist (or may be needed) for
> >>ATA's task file interface and SAS's management protocol.
> >>
> >>Even though my tests, shown earlier in this thread, indicated
> >>that the SG_IO ioctl might be a shade faster than O_DIRECT,
> >>the main reason for having it is to pass through "non-block"
> >>commands to a device. Some examples:
> >> - special writes (e.g. formating a disk, writing a CD/DVD)
> >> - uploading firmware
> >> - reading the defect table from a disk
> >> - reading and writing special areas on a disk
> >>   (e.g. application client log page)
> >>
> >>The reason for choosing this list is that all these
> >>operations potentially move large amounts of data in a
> >>single operation. For such data transfers to be constrained
> >>by max_sectors is questionable. Putting a block paradigm
> >>bypass in the block layer is an interesting design :-)
> >
> >
> >I think this is nonsense. The block layer will not accept commands
> >that it cannot handle in one go, what would the point of that be?
> >There's no way for us to break down a single command into pieces,
> >we have no idea how to do that. max_sectors _is_ the natural
> >constraint, it's the hardware limit not something I impose through
> >policy. For SCSI it could be bigger in some cases, that's up to the
> >lldd to set though.
> <snip>
> 
> Jens,
> Reviewing the linix-scsi archives, max_sectors was
> introduced around lk 2.4.7 and you were quite active
> in its promotion. There are also posts about problems
> with qlogic HBAs and their need for a limit to maximum
> transfer length. So there is some hardware justification.
> 
> On 11th April 2002 Justin Gibbs posted this in a mail
> about aic7xxx version 6.2.6:
> "2) Set max_sectors to a sane value.  The aic7xxx driver was not
>    updated when this value was added to the host template structure.
>    In more recent kernels, the default setting for this field, 255,
>    can limit our transaction size to 127K.  This often causes the
>    scsi_merge routines to generate 127k followed by 1k I/Os to complete
>    a client transaction.  The command overhead of such small
>    transactions
>    can severely impact performance.  The driver now sets max_sectors to
>    8192 which equates to the 16MB S/G element limit for these cards as
>    expressed in 2K sectors."
> 
> At the time max_sectors defaulted to 255, later it was
> bumped to 256 and is now 1024 in lk 2.5. However Justin's
> post is saying the hardware limit for a data transfer
> associated with a single SCSI command in the aic7xxx
> driver is:
>   sg_tablesize * (2 ** 24) bytes == 2 GB
> as the aic7xxx driver sets sg_tablesize to 128.
> Taking into account the largest practical kmalloc of 128 KB
> (which is not a hardware limitation) this number comes down
> to 16 MB. The 8192 figure that Justin chose is still in place
> in the aic7xxx driver in lk 2.5 and it limits maximum transfer
> size to 4 MB since the unit of max_sectors is now 512 bytes.
> 
> Various projects have reported to me success in transferring
> 8 and 16 MB individual WRITE commands through the sg driver,
> usually with LSI or Adaptec HBAs. The max_sectors==8192
> set by the aic7xxx is the maximum of any driver in the
> ide or the scsi subsystems (both in lk 2.4 and lk 2.5)
> currently. Most drivers are picking up the default value.
> The definition of "max_sectors" states in
> drivers/scsi/hosts.h:
>   "if the host adapter has limitations beside segment count"
> That could be taken to imply if a LLD does not define
> max_sectors then there is no limit.
> 
> In summary, from a HBA drivers point of view, "max_sectors"
> is misnamed (since they transfer bytes) and not precise
> enough to describe any limitations on data transfers they
> may have.
> 
> Apologies in advance for propagating further nonsense.

Wow, that was a long email. I don't have good connectivity these days,
so mail collisions are bound to happen.

As I wrote in the last email to you, it might make sense to introduce a
hard upper limit and a preferred limit. I think you are missing the
point with the latency requirements. If you allow 16MB sg requests in
the queue, you will both have pinned down a _lot_ of memory (that's one
problem) and build up a huge latency queue. And that's not even
considering that it makes _no_ sense from a performance pov to go as
high as 16MB, zero.

The fact is that for some drivers, max_sectors is a hard limit. There's
no way that scsi_ioctl will pass down requests bigger than this, period.
If 512KB is too small for some operations (your firmware case, it makes
sense), then I'm all for fixing that up.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2003-06-03  8:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-05-31  8:23 [PATCH] SG_IO readcd and various bugs Douglas Gilbert
2003-05-31 10:57 ` Jens Axboe
2003-06-01  7:39   ` Douglas Gilbert
2003-06-02  7:27     ` Jens Axboe
2003-06-03  5:23       ` Douglas Gilbert
2003-06-03  5:39         ` Nick Piggin
2003-06-03  8:29         ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox