public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed
* IO buffer alignment for block devices
@ 2004-09-13 12:48 Saeed Bishara
  2004-09-13 14:14 ` James Bottomley
  0 siblings, 1 reply; 11+ messages in thread
From: Saeed Bishara @ 2004-09-13 12:48 UTC (permalink / raw)
  To: linux-scsi

Hi,
    Is there any minimal alignment for the IO buffers (sg and single) 
sent to a scsi LLD?
    I found that buffers that come from buffer cache, page cache and sg 
are page aligned , and the raw driver send sector aligned     buffers.
    so can I assume that the minimal alignment of the buffer is 512 bytes?
thanks
saeed


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: IO buffer alignment for block devices
  2004-09-13 12:48 IO buffer alignment for block devices Saeed Bishara
@ 2004-09-13 14:14 ` James Bottomley
  2004-09-13 15:16   ` Saeed Bishara
  0 siblings, 1 reply; 11+ messages in thread
From: James Bottomley @ 2004-09-13 14:14 UTC (permalink / raw)
  To: Saeed Bishara; +Cc: SCSI Mailing List

On Mon, 2004-09-13 at 08:48, Saeed Bishara wrote:
>     Is there any minimal alignment for the IO buffers (sg and single) 
> sent to a scsi LLD?
>     I found that buffers that come from buffer cache, page cache and sg 
> are page aligned , and the raw driver send sector aligned     buffers.
>     so can I assume that the minimal alignment of the buffer is 512 bytes?

You can assume it will be whatever you set blk_queue_dma_alignment() to
in your slave configure routines.  If the LLD doesn't set this, then it
will be 8 bytes.  However, beware, setting this higher can cause user
generated commands to be errored if they fail the alignment checks.

James



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: IO buffer alignment for block devices
  2004-09-13 14:14 ` James Bottomley
@ 2004-09-13 15:16   ` Saeed Bishara
  2004-09-13 15:28     ` James Bottomley
  0 siblings, 1 reply; 11+ messages in thread
From: Saeed Bishara @ 2004-09-13 15:16 UTC (permalink / raw)
  To: James Bottomley; +Cc: SCSI Mailing List



James Bottomley wrote:

>On Mon, 2004-09-13 at 08:48, Saeed Bishara wrote:
>  
>
>>    Is there any minimal alignment for the IO buffers (sg and single) 
>>sent to a scsi LLD?
>>    I found that buffers that come from buffer cache, page cache and sg 
>>are page aligned , and the raw driver send sector aligned     buffers.
>>    so can I assume that the minimal alignment of the buffer is 512 bytes?
>>    
>>
>
>You can assume it will be whatever you set blk_queue_dma_alignment() to
>in your slave configure routines.  If the LLD doesn't set this, then it
>will be 8 bytes.  However, beware, setting this higher can cause user
>  
>
    isn't the default 512, this is what I see in blk_queue_make_request.

<http://lxr.linux.no/ident?v=2.6.8.1;i=blk_queue_make_request>

>generated commands to be errored if they fail the alignment checks.
>
>James
>
>  
>
any Ideas about lk 2.4?

>To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>  
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: IO buffer alignment for block devices
  2004-09-13 15:16   ` Saeed Bishara
@ 2004-09-13 15:28     ` James Bottomley
  2004-09-13 17:12       ` Kai Makisara
  0 siblings, 1 reply; 11+ messages in thread
From: James Bottomley @ 2004-09-13 15:28 UTC (permalink / raw)
  To: Saeed Bishara; +Cc: SCSI Mailing List

On Mon, 2004-09-13 at 11:16, Saeed Bishara wrote:
>     isn't the default 512, this is what I see in blk_queue_make_request.
> 
> <http://lxr.linux.no/ident?v=2.6.8.1;i=blk_queue_make_request>

In the block layer, yes.  In SCSI we tune that down to be 8 for backward
compatibility with 2.4, I think, but LLDs can tune it up again.

> any Ideas about lk 2.4?

Not really; since it's end of life, you could probably get away with not
back porting to 2.4

James



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: IO buffer alignment for block devices
  2004-09-13 15:28     ` James Bottomley
@ 2004-09-13 17:12       ` Kai Makisara
  2004-09-13 19:02         ` Jens Axboe
  0 siblings, 1 reply; 11+ messages in thread
From: Kai Makisara @ 2004-09-13 17:12 UTC (permalink / raw)
  To: James Bottomley; +Cc: Saeed Bishara, SCSI Mailing List

On Mon, 13 Sep 2004, James Bottomley wrote:

> On Mon, 2004-09-13 at 11:16, Saeed Bishara wrote:
> >     isn't the default 512, this is what I see in blk_queue_make_request.
> > 
> > <http://lxr.linux.no/ident?v=2.6.8.1;i=blk_queue_make_request>
> 
> In the block layer, yes.  In SCSI we tune that down to be 8 for backward
> compatibility with 2.4, I think, but LLDs can tune it up again.
> 
We don't. The alignment has been 512 bytes after this patch in 2.6.4:

<James.Bottomley@SteelEye.com>
        [PATCH] Undo SCSI 8-byte alignment relaxation
        
        This makes the default alignment requirements be 512 bytes for SCSI,
        the way it used to be.
        
        Jens will fix the SCSI layer problems, but low-level drivers might have
        other restrictions on alignment.

As you can guess, I have not been too happy with this situation ;-)

Any driver can change this restriction in slave_configure(). I have for 
some time had the following patch in my system to change the alignment for 
my tape drive to 8 bytes:

--- linux-2.6.9-rc1-bk1/drivers/scsi/sym53c8xx_2/sym_glue.c~	2004-08-27 19:33:34.000000000 +0300
+++ linux-2.6.9-rc1-bk1/drivers/scsi/sym53c8xx_2/sym_glue.c	2004-08-28 12:54:49.000000000 +0300
@@ -1097,6 +1097,8 @@
 
 	spi_dv_device(device);
 
+	blk_queue_dma_alignment(device->request_queue, 7);
+
 	return 0;
 }
 

-- 
Kai

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: IO buffer alignment for block devices
  2004-09-13 17:12       ` Kai Makisara
@ 2004-09-13 19:02         ` Jens Axboe
  2004-09-13 19:56           ` Kai Makisara
  0 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2004-09-13 19:02 UTC (permalink / raw)
  To: Kai Makisara; +Cc: James Bottomley, Saeed Bishara, SCSI Mailing List

On Mon, Sep 13 2004, Kai Makisara wrote:
> On Mon, 13 Sep 2004, James Bottomley wrote:
> 
> > On Mon, 2004-09-13 at 11:16, Saeed Bishara wrote:
> > >     isn't the default 512, this is what I see in blk_queue_make_request.
> > > 
> > > <http://lxr.linux.no/ident?v=2.6.8.1;i=blk_queue_make_request>
> > 
> > In the block layer, yes.  In SCSI we tune that down to be 8 for backward
> > compatibility with 2.4, I think, but LLDs can tune it up again.
> > 
> We don't. The alignment has been 512 bytes after this patch in 2.6.4:
> 
> <James.Bottomley@SteelEye.com>
>         [PATCH] Undo SCSI 8-byte alignment relaxation
>         
>         This makes the default alignment requirements be 512 bytes for SCSI,
>         the way it used to be.
>         
>         Jens will fix the SCSI layer problems, but low-level drivers might have
>         other restrictions on alignment.
> 
> As you can guess, I have not been too happy with this situation ;-)
> 
> Any driver can change this restriction in slave_configure(). I have for 
> some time had the following patch in my system to change the alignment for 
> my tape drive to 8 bytes:

I'm not sure 8 is a safe alignment at all, at least for ide-cd there
were cases that simply didn't work well with 8-byte alignment.

> --- linux-2.6.9-rc1-bk1/drivers/scsi/sym53c8xx_2/sym_glue.c~	2004-08-27 19:33:34.000000000 +0300
> +++ linux-2.6.9-rc1-bk1/drivers/scsi/sym53c8xx_2/sym_glue.c	2004-08-28 12:54:49.000000000 +0300
> @@ -1097,6 +1097,8 @@
>  
>  	spi_dv_device(device);
>  
> +	blk_queue_dma_alignment(device->request_queue, 7);
> +
>  	return 0;
>  }

Why don't you just do this in st? Not that it matters, since it's just
your private hack - just wondering.

And finally (and most importantly), why do you care so much? You should
rip some of that manual data mapping out of st completely. Add a
scsi_rq_map_user or something that returns a struct scsi_request, and
uses blk_rq_map_user (which will do bio_map_user() or bio_copy_user())
to map the user data into your SCSI request. I seriously doubt that st
will see any performance difference between the two at all, and st
really should be using proper infrastructure for this. sg as well, but
since it's dying anyways...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: IO buffer alignment for block devices
  2004-09-13 19:02         ` Jens Axboe
@ 2004-09-13 19:56           ` Kai Makisara
  2004-09-13 20:23             ` Jens Axboe
  0 siblings, 1 reply; 11+ messages in thread
From: Kai Makisara @ 2004-09-13 19:56 UTC (permalink / raw)
  To: Jens Axboe; +Cc: James Bottomley, Saeed Bishara, SCSI Mailing List

On Mon, 13 Sep 2004, Jens Axboe wrote:

> On Mon, Sep 13 2004, Kai Makisara wrote:
...
> > Any driver can change this restriction in slave_configure(). I have for 
> > some time had the following patch in my system to change the alignment for 
> > my tape drive to 8 bytes:
> 
> I'm not sure 8 is a safe alignment at all, at least for ide-cd there
> were cases that simply didn't work well with 8-byte alignment.
> 
I agree. That is why I have not objected to the patch that changed the 
default back to 512. That is why I reviewed sym53c8xx_2 and, to me, it 
seemed safe to to try 8 byte alignment with this driver. The proper way is 
to have safe default and let the LLDDs relax the alignment. The only 
practical problem is how to get this done but we probably have to live 
with this. Besides we were talking about SCSI.

> > --- linux-2.6.9-rc1-bk1/drivers/scsi/sym53c8xx_2/sym_glue.c~	2004-08-27 19:33:34.000000000 +0300
> > +++ linux-2.6.9-rc1-bk1/drivers/scsi/sym53c8xx_2/sym_glue.c	2004-08-28 12:54:49.000000000 +0300
> > @@ -1097,6 +1097,8 @@
> >  
> >  	spi_dv_device(device);
> >  
> > +	blk_queue_dma_alignment(device->request_queue, 7);
> > +
> >  	return 0;
> >  }
> 
> Why don't you just do this in st? Not that it matters, since it's just
> your private hack - just wondering.
> 
Because the high-level driver should not guess what the low-level driver 
requirements are.

> And finally (and most importantly), why do you care so much?

I do care because the non-bounced transfers give measurable decrease in 
cpu time and bus use. And I don't like the situation where portable 
software has to have special alignment code just for Linux when there is 
no real technical reason for that. In practice, this code often does not 
exist. Then the question is, why does writing to tape in Linux use, say, 
30 % CPU time when the same thing in xxx Unix uses 1 % CPU time.

>                                                               You should
> rip some of that manual data mapping out of st completely. Add a
> scsi_rq_map_user or something that returns a struct scsi_request, and
> uses blk_rq_map_user (which will do bio_map_user() or bio_copy_user())
> to map the user data into your SCSI request. I seriously doubt that st
> will see any performance difference between the two at all, and st
> really should be using proper infrastructure for this. sg as well, but
> since it's dying anyways...
> 
The manual data mapping exists in st and sg because when the code was 
made, nothing better was available. It was meant to be a proof of concept 
and it still is. However, it has at least one good property: it works.

I don't think you can get around the alignment by using bio_map_user(): it 
tests alignment for both the buffer start and the transfer length. So, 
converting the data mapping does not change the alignment handling which 
is what we were talking about.

I have sometimes looked at the bio code and thought about changing to use 
it. The problem has been that it is too limited for use in st. It has 
become better but it still is not good enough to replace both the data 
mapping and the driver's buffering. A quick look through the code suggests 
that it still can't handle big requests (up to 6 MB). Please tell me if I 
am wrong.

-- 
Kai

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: IO buffer alignment for block devices
  2004-09-13 19:56           ` Kai Makisara
@ 2004-09-13 20:23             ` Jens Axboe
  2004-09-13 22:08               ` Kai Makisara
  0 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2004-09-13 20:23 UTC (permalink / raw)
  To: Kai Makisara; +Cc: James Bottomley, Saeed Bishara, SCSI Mailing List

On Mon, Sep 13 2004, Kai Makisara wrote:
> On Mon, 13 Sep 2004, Jens Axboe wrote:
> 
> > On Mon, Sep 13 2004, Kai Makisara wrote:
> ...
> > > Any driver can change this restriction in slave_configure(). I have for 
> > > some time had the following patch in my system to change the alignment for 
> > > my tape drive to 8 bytes:
> > 
> > I'm not sure 8 is a safe alignment at all, at least for ide-cd there
> > were cases that simply didn't work well with 8-byte alignment.
> > 
> I agree. That is why I have not objected to the patch that changed the 
> default back to 512. That is why I reviewed sym53c8xx_2 and, to me, it 
> seemed safe to to try 8 byte alignment with this driver. The proper way is 
> to have safe default and let the LLDDs relax the alignment. The only 
> practical problem is how to get this done but we probably have to live 
> with this. Besides we were talking about SCSI.

This seemed to be a device problem, not an adapter problem.

I'm curious where the alignment is really documented, seems to be we are
stabbing in the dark.

> > > --- linux-2.6.9-rc1-bk1/drivers/scsi/sym53c8xx_2/sym_glue.c~	2004-08-27 19:33:34.000000000 +0300
> > > +++ linux-2.6.9-rc1-bk1/drivers/scsi/sym53c8xx_2/sym_glue.c	2004-08-28 12:54:49.000000000 +0300
> > > @@ -1097,6 +1097,8 @@
> > >  
> > >  	spi_dv_device(device);
> > >  
> > > +	blk_queue_dma_alignment(device->request_queue, 7);
> > > +
> > >  	return 0;
> > >  }
> > 
> > Why don't you just do this in st? Not that it matters, since it's just
> > your private hack - just wondering.
> > 
> Because the high-level driver should not guess what the low-level driver 
> requirements are.
> 
> > And finally (and most importantly), why do you care so much?
> 
> I do care because the non-bounced transfers give measurable decrease in 
> cpu time and bus use. And I don't like the situation where portable 
> software has to have special alignment code just for Linux when there is 
> no real technical reason for that. In practice, this code often does not 
> exist. Then the question is, why does writing to tape in Linux use, say, 
> 30 % CPU time when the same thing in xxx Unix uses 1 % CPU time.

Please share how you arrived at those numbers, if you get that bad
performance you are doing something horribly wrong elsewhere. Doing
50MiB/sec single transfer from a hard drive with either bio_map_user()
or bio_copy_user() backing shows at most 1% sys time difference on a
dual P3-800MHz. How fast are these tape drives?

There's no questioning that bio_map_user() will be faster for larger
transfers (4kb and up, I'm guessing), but there's no way that you can
claim 30% sys time for a tape drive without backing that up. Where was
this time spent?

> > rip some of that manual data mapping out of st completely. Add a
> > scsi_rq_map_user or something that returns a struct scsi_request, and
> > uses blk_rq_map_user (which will do bio_map_user() or bio_copy_user())
> > to map the user data into your SCSI request. I seriously doubt that st
> > will see any performance difference between the two at all, and st
> > really should be using proper infrastructure for this. sg as well, but
> > since it's dying anyways...
> > 
> The manual data mapping exists in st and sg because when the code was 
> made, nothing better was available. It was meant to be a proof of concept 
> and it still is. However, it has at least one good property: it works.

Nothing as permanent as temporary driver code :)

> I don't think you can get around the alignment by using bio_map_user(): it 
> tests alignment for both the buffer start and the transfer length. So, 
> converting the data mapping does not change the alignment handling which 
> is what we were talking about.

Of course not, bio_map_user() data must be aligned, that's the whole
premise of if. But bio_copy_data() will align the data for you. Lookup
blk_rq_map_user() like I described.

> I have sometimes looked at the bio code and thought about changing to use 
> it. The problem has been that it is too limited for use in st. It has 
> become better but it still is not good enough to replace both the data 
> mapping and the driver's buffering. A quick look through the code suggests 
> that it still can't handle big requests (up to 6 MB). Please tell me if I 
> am wrong.

I don't think you should. As I wrote, write a scsi mid layer helper that
will map the data for you into a scsi structure. I think it's a mistake
that st/sg attempts to handle this business themselves.

You can't map 6MB into a single bio, but you can string them together if
you want such a big request.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: IO buffer alignment for block devices
  2004-09-13 20:23             ` Jens Axboe
@ 2004-09-13 22:08               ` Kai Makisara
  2004-09-14  7:44                 ` Jens Axboe
  0 siblings, 1 reply; 11+ messages in thread
From: Kai Makisara @ 2004-09-13 22:08 UTC (permalink / raw)
  To: Jens Axboe; +Cc: James Bottomley, Saeed Bishara, SCSI Mailing List

On Mon, 13 Sep 2004, Jens Axboe wrote:

> On Mon, Sep 13 2004, Kai Makisara wrote:
> > On Mon, 13 Sep 2004, Jens Axboe wrote:
> > 
> > > On Mon, Sep 13 2004, Kai Makisara wrote:
...
> > I agree. That is why I have not objected to the patch that changed the 
> > default back to 512. That is why I reviewed sym53c8xx_2 and, to me, it 
> > seemed safe to to try 8 byte alignment with this driver. The proper way is 
> > to have safe default and let the LLDDs relax the alignment. The only 
> > practical problem is how to get this done but we probably have to live 
> > with this. Besides we were talking about SCSI.
> 
> This seemed to be a device problem, not an adapter problem.
> 
The tape drives at the other end of the SCSI cable don't have any 
restrictions. The restrictions are at the computer end of the cable.

> I'm curious where the alignment is really documented, seems to be we are
> stabbing in the dark.
> 
I have not seen any Linux documentation on the whole problem. The 
requirements have to be determined case by case considering the whole data 
path from the source to the target. The low-level driver (writer) should 
know the requirements for the adapter but this should be combined with the 
requirements coming from the i/o architecture of the computer. One might 
design code where the low-level driver suggests an alignment that 
architecture-dependent code then modifies if necessary.

...
> > > And finally (and most importantly), why do you care so much?
> > 
> > I do care because the non-bounced transfers give measurable decrease in 
> > cpu time and bus use. And I don't like the situation where portable 
> > software has to have special alignment code just for Linux when there is 
> > no real technical reason for that. In practice, this code often does not 
> > exist. Then the question is, why does writing to tape in Linux use, say, 
> > 30 % CPU time when the same thing in xxx Unix uses 1 % CPU time.
> 
> Please share how you arrived at those numbers, if you get that bad
> performance you are doing something horribly wrong elsewhere. Doing
> 50MiB/sec single transfer from a hard drive with either bio_map_user()
> or bio_copy_user() backing shows at most 1% sys time difference on a
> dual P3-800MHz. How fast are these tape drives?
> 
On 20 Nov 2003 I posted to linux-scsi and linux-usb-development the 
following results:
-----------------------8<--------------------------------------------
Before implementing direct i/o, I made some measurements with the hardware
available. The tape drive was HP DDS-4 and computer dual PIII 700 MHz. The
results with a test program using well-compressible data (zeroes):

Buffer length 32768 bytes, 32768 buffers written (1024.000 MB, 1073741824 
bytes)
Variable block mode (writing 32768 byte blocks).
dio:
write: wall  60.436 user   0.015 (  0.0 %) system   0.485 (  0.8 %) speed 
16.944 MB/s
read:  wall  64.580 user   0.014 (  0.0 %) system   0.500 (  0.8 %) speed 
15.856 MB/s
no dio:
write: wall  61.373 user   0.024 (  0.0 %) system   2.897 (  4.7 %) speed 
16.685 MB/s
read:  wall  66.435 user   0.055 (  0.1 %) system   6.347 (  9.6 %) speed 
15.413 MB/s

The common high-end drives at that time streamed at 30-40 MB/s assuming
2:1 compression. It meant that about 20 % of cpu was needed for extra
copies when reading. Other people reported even higher percentages with
other hardware.

I just repeated the tests using the same drive but a 2.6 GHz PIV on a
Intel D875 motherboard and dual channel memory. The results were:

dio:
write: wall  58.668 user   0.013 (  0.0 %) system   0.217 (  0.4 %) speed  
17.045 MB/s
read:  wall  63.425 user   0.011 (  0.0 %) system   0.245 (  0.4 %) speed  
15.767 MB/s
no dio:
write: wall  58.713 user   0.017 (  0.0 %) system   0.534 (  0.9 %) speed  
17.032 MB/s
read:  wall  63.265 user   0.020 (  0.0 %) system   0.805 (  1.3 %) speed  
15.807 MB/s

Current ordinary drives stream up to 70 MB/s (assuming 2:1 compression).
This would lead to over 5 % overhead per drive when not doing dio.
---------------------------8<---------------------------------------------

The tests were done using a simple program that does the writes and 
reads and reports the times from gettimeofday() and getrusage().

I just looked at the specs of a new LTO-3 drive. They claim sustained 
compressed rate of 490 GB/h which makes 136 MB/s. The recent big tape 
drives are fast and, since they are streamers, the system has to be able 
to provide this rate for hours...

> There's no questioning that bio_map_user() will be faster for larger
> transfers (4kb and up, I'm guessing), but there's no way that you can
> claim 30% sys time for a tape drive without backing that up. Where was
> this time spent?
> 
I have not profiled the code. The st driver uses the same loop for direct 
and buffered transfers. The only difference is that in the buffered case 
the data is copied to/from the internal buffer before/after the SCSI 
transactions.

As you can see from the results and project from them, the system time 
varies a lot depending on the hardware. The differences between the older 
system and the newer system are surprisingly big but, if you look at 
system architectures, the results are what you actually should expect.

I am not claiming that the cpu usage is in all cases significant with 
buffered transfers. Just that there exist cases where it is significant 
(big drive connected to a not very fast computer). This has motivated 
doing the code. (And the fact that other Unices do this, too).

...
> > I have sometimes looked at the bio code and thought about changing to use 
> > it. The problem has been that it is too limited for use in st. It has 
> > become better but it still is not good enough to replace both the data 
> > mapping and the driver's buffering. A quick look through the code suggests 
> > that it still can't handle big requests (up to 6 MB). Please tell me if I 
> > am wrong.
> 
> I don't think you should. As I wrote, write a scsi mid layer helper that
> will map the data for you into a scsi structure. I think it's a mistake
> that st/sg attempts to handle this business themselves.
> 
I think I again did not express myself clearly. By converting st to use 
bio I meant writing the helper you suggested and using it in st. I was 
thinking about the end result (st using bio).

> You can't map 6MB into a single bio, but you can string them together if
> you want such a big request.
> 
OK, I will look at that when I have time. The need for big requests comes 
from reading existing tapes with multimegabyte blocks. A block has to be 
read with a single SCSI read and this is why the request has to be so big 
in these cases.

-- 
Kai

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: IO buffer alignment for block devices
  2004-09-13 22:08               ` Kai Makisara
@ 2004-09-14  7:44                 ` Jens Axboe
  2004-09-14 17:39                   ` Kai Makisara
  0 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2004-09-14  7:44 UTC (permalink / raw)
  To: Kai Makisara; +Cc: James Bottomley, Saeed Bishara, SCSI Mailing List

On Tue, Sep 14 2004, Kai Makisara wrote:
> On Mon, 13 Sep 2004, Jens Axboe wrote:
> 
> > On Mon, Sep 13 2004, Kai Makisara wrote:
> > > On Mon, 13 Sep 2004, Jens Axboe wrote:
> > > 
> > > > On Mon, Sep 13 2004, Kai Makisara wrote:
> ...
> > > I agree. That is why I have not objected to the patch that changed the 
> > > default back to 512. That is why I reviewed sym53c8xx_2 and, to me, it 
> > > seemed safe to to try 8 byte alignment with this driver. The proper way is 
> > > to have safe default and let the LLDDs relax the alignment. The only 
> > > practical problem is how to get this done but we probably have to live 
> > > with this. Besides we were talking about SCSI.
> > 
> > This seemed to be a device problem, not an adapter problem.
> > 
> The tape drives at the other end of the SCSI cable don't have any 
> restrictions. The restrictions are at the computer end of the cable.

I'm talking about the specific ide-cd case.

> > I'm curious where the alignment is really documented, seems to be we are
> > stabbing in the dark.
> > 
> I have not seen any Linux documentation on the whole problem. The 
> requirements have to be determined case by case considering the whole data 
> path from the source to the target. The low-level driver (writer) should 
> know the requirements for the adapter but this should be combined with the 
> requirements coming from the i/o architecture of the computer. One might 
> design code where the low-level driver suggests an alignment that 
> architecture-dependent code then modifies if necessary.

I completely agree, and right now there are many bits and pieces missing
from that chain. Which is why we have to default to something a little
aggressive...

> > > > And finally (and most importantly), why do you care so much?
> > > 
> > > I do care because the non-bounced transfers give measurable decrease in 
> > > cpu time and bus use. And I don't like the situation where portable 
> > > software has to have special alignment code just for Linux when there is 
> > > no real technical reason for that. In practice, this code often does not 
> > > exist. Then the question is, why does writing to tape in Linux use, say, 
> > > 30 % CPU time when the same thing in xxx Unix uses 1 % CPU time.
> > 
> > Please share how you arrived at those numbers, if you get that bad
> > performance you are doing something horribly wrong elsewhere. Doing
> > 50MiB/sec single transfer from a hard drive with either bio_map_user()
> > or bio_copy_user() backing shows at most 1% sys time difference on a
> > dual P3-800MHz. How fast are these tape drives?
> > 
> On 20 Nov 2003 I posted to linux-scsi and linux-usb-development the 
> following results:
> -----------------------8<--------------------------------------------
> Before implementing direct i/o, I made some measurements with the hardware
> available. The tape drive was HP DDS-4 and computer dual PIII 700 MHz. The
> results with a test program using well-compressible data (zeroes):
> 
> Buffer length 32768 bytes, 32768 buffers written (1024.000 MB, 1073741824 
> bytes)
> Variable block mode (writing 32768 byte blocks).
> dio:
> write: wall  60.436 user   0.015 (  0.0 %) system   0.485 (  0.8 %) speed 
> 16.944 MB/s
> read:  wall  64.580 user   0.014 (  0.0 %) system   0.500 (  0.8 %) speed 
> 15.856 MB/s
> no dio:
> write: wall  61.373 user   0.024 (  0.0 %) system   2.897 (  4.7 %) speed 
> 16.685 MB/s
> read:  wall  66.435 user   0.055 (  0.1 %) system   6.347 (  9.6 %) speed 
> 15.413 MB/s
> 
> The common high-end drives at that time streamed at 30-40 MB/s assuming
> 2:1 compression. It meant that about 20 % of cpu was needed for extra
> copies when reading. Other people reported even higher percentages with
> other hardware.

Just tried 70MiB/sec on my test box, and it is going rather slowly with
non direct io. So your numbers don't look all that bad. Pretty strange,
I don't remember it being this bad. This is a profile from a forced
unaligned 128KiB contig transfer from two SCSI disks, doing 40 and
30MiB/sec at the same time:

 89887 default_idle                             1404.4844
 24357 __copy_to_user_ll                        217.4732
   512 __generic_unplug_device                    8.0000
   279 __free_pages                               3.4875
   291 sym53c8xx_intr                             1.8188
   137 blk_rq_unmap_user                          1.4271
   260 scsi_end_request                           1.2500
   210 blk_execute_rq                             1.0938
    17 free_hot_page                              1.0625

Compared to direct io:

118502 default_idle                             1851.5938
   138 __generic_unplug_device                    2.1562
    50 unlock_page                                1.5625
   117 sym53c8xx_intr                             0.7312
    37 set_page_dirty_lock                        0.3854
    30 set_page_dirty                             0.3750
    73 scsi_end_request                           0.3510
    49 __bio_unmap_user                           0.3403

For 512-byte reads, the results are comparable (aggregate bandwidth):

non-dio:

 91038 default_idle                             1422.4688
  2843 __generic_unplug_device                   44.4219
  3014 sym53c8xx_intr                            18.8375
  1694 __copy_to_user_ll                         15.1250
  1739 scsi_end_request                           8.3606
   779 blk_rq_unmap_user                          8.1146
   571 blk_run_queue                              7.1375

dio:

 92280 default_idle                             1441.8750
  2824 __generic_unplug_device                   44.1250
  2926 sym53c8xx_intr                            18.2875
  1608 scsi_end_request                           7.7308
   716 blk_rq_unmap_user                          7.4583
   566 blk_run_queue                              7.0750
  1284 blk_execute_rq                             6.6875


> I just repeated the tests using the same drive but a 2.6 GHz PIV on a
> Intel D875 motherboard and dual channel memory. The results were:
> 
> dio:
> write: wall  58.668 user   0.013 (  0.0 %) system   0.217 (  0.4 %) speed  
> 17.045 MB/s
> read:  wall  63.425 user   0.011 (  0.0 %) system   0.245 (  0.4 %) speed  
> 15.767 MB/s
> no dio:
> write: wall  58.713 user   0.017 (  0.0 %) system   0.534 (  0.9 %) speed  
> 17.032 MB/s
> read:  wall  63.265 user   0.020 (  0.0 %) system   0.805 (  1.3 %) speed  
> 15.807 MB/s

That looks a lot closer to what I would expect - non-dio using a little
more sys time at the same transfer rate, still a little high though. You
must be doing something else differently as well. Are you transfer sizes
as big for non-dio as with dio? Are you using sg for non-dio?

> > There's no questioning that bio_map_user() will be faster for larger
> > transfers (4kb and up, I'm guessing), but there's no way that you can
> > claim 30% sys time for a tape drive without backing that up. Where was
> > this time spent?
> > 
> I have not profiled the code. The st driver uses the same loop for direct 
> and buffered transfers. The only difference is that in the buffered case 
> the data is copied to/from the internal buffer before/after the SCSI 
> transactions.

That's truly the only difference?

> As you can see from the results and project from them, the system time 
> varies a lot depending on the hardware. The differences between the older 
> system and the newer system are surprisingly big but, if you look at 
> system architectures, the results are what you actually should expect.

Agree.

> I am not claiming that the cpu usage is in all cases significant with 
> buffered transfers. Just that there exist cases where it is significant 
> (big drive connected to a not very fast computer). This has motivated 
> doing the code. (And the fact that other Unices do this, too).

Sure.

> > You can't map 6MB into a single bio, but you can string them together if
> > you want such a big request.
> > 
> OK, I will look at that when I have time. The need for big requests comes 
> from reading existing tapes with multimegabyte blocks. A block has to be 
> read with a single SCSI read and this is why the request has to be so big 
> in these cases.

I understand.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: IO buffer alignment for block devices
  2004-09-14  7:44                 ` Jens Axboe
@ 2004-09-14 17:39                   ` Kai Makisara
  0 siblings, 0 replies; 11+ messages in thread
From: Kai Makisara @ 2004-09-14 17:39 UTC (permalink / raw)
  To: Jens Axboe; +Cc: James Bottomley, Saeed Bishara, SCSI Mailing List

On Tue, 14 Sep 2004, Jens Axboe wrote:

> On Tue, Sep 14 2004, Kai Makisara wrote:
> > On Mon, 13 Sep 2004, Jens Axboe wrote:
> > 
...
> > On 20 Nov 2003 I posted to linux-scsi and linux-usb-development the 
> > following results:
> > -----------------------8<--------------------------------------------
> > Before implementing direct i/o, I made some measurements with the hardware
> > available. The tape drive was HP DDS-4 and computer dual PIII 700 MHz. The
> > results with a test program using well-compressible data (zeroes):
> > 
> > Buffer length 32768 bytes, 32768 buffers written (1024.000 MB, 1073741824 
> > bytes)
> > Variable block mode (writing 32768 byte blocks).
> > dio:
> > write: wall  60.436 user   0.015 (  0.0 %) system   0.485 (  0.8 %) speed 
> > 16.944 MB/s
> > read:  wall  64.580 user   0.014 (  0.0 %) system   0.500 (  0.8 %) speed 
> > 15.856 MB/s
> > no dio:
> > write: wall  61.373 user   0.024 (  0.0 %) system   2.897 (  4.7 %) speed 
> > 16.685 MB/s
> > read:  wall  66.435 user   0.055 (  0.1 %) system   6.347 (  9.6 %) speed 
> > 15.413 MB/s
> > 
> > The common high-end drives at that time streamed at 30-40 MB/s assuming
> > 2:1 compression. It meant that about 20 % of cpu was needed for extra
> > copies when reading. Other people reported even higher percentages with
> > other hardware.
> 
> Just tried 70MiB/sec on my test box, and it is going rather slowly with
> non direct io. So your numbers don't look all that bad. Pretty strange,
> I don't remember it being this bad. This is a profile from a forced
> unaligned 128KiB contig transfer from two SCSI disks, doing 40 and
> 30MiB/sec at the same time:
> 
>  89887 default_idle                             1404.4844
>  24357 __copy_to_user_ll                        217.4732
...
> For 512-byte reads, the results are comparable (aggregate bandwidth):
> 
> non-dio:
> 
>  91038 default_idle                             1422.4688
>   2843 __generic_unplug_device                   44.4219
>   3014 sym53c8xx_intr                            18.8375
>   1694 __copy_to_user_ll                         15.1250

Interesting results. I did tests also with other block sizes (8192 - 
131072 bytes) with that P3 system (VIA chipset). Looking at the notes now 
I can see the same trend: the system time percentage increases quite a lot 
when the block size gets larger!

I did similar tests with the P4 system. No trend was seen. The read/write 
system time balance changed a little but the averages were very close.

...
> > I just repeated the tests using the same drive but a 2.6 GHz PIV on a
> > Intel D875 motherboard and dual channel memory. The results were:
> > 
> > dio:
> > write: wall  58.668 user   0.013 (  0.0 %) system   0.217 (  0.4 %) speed  
> > 17.045 MB/s
> > read:  wall  63.425 user   0.011 (  0.0 %) system   0.245 (  0.4 %) speed  
> > 15.767 MB/s
> > no dio:
> > write: wall  58.713 user   0.017 (  0.0 %) system   0.534 (  0.9 %) speed  
> > 17.032 MB/s
> > read:  wall  63.265 user   0.020 (  0.0 %) system   0.805 (  1.3 %) speed  
> > 15.807 MB/s
> 
> That looks a lot closer to what I would expect - non-dio using a little
> more sys time at the same transfer rate, still a little high though. You
> must be doing something else differently as well. Are you transfer sizes
> as big for non-dio as with dio? Are you using sg for non-dio?
> 
The transfer sizes are the same. This is actually the only possibility 
with variable block mode. Sg is used for no-dio. However, the driver tries 
to allocate as large chunks of memory as it can get and the buffer 
probably consisted of only one sg segment with this block size.

...
> > I have not profiled the code. The st driver uses the same loop for direct 
> > and buffered transfers. The only difference is that in the buffered case 
> > the data is copied to/from the internal buffer before/after the SCSI 
> > transactions.
> 
> That's truly the only difference?
> 
Yes. Most of the code is checking results and updating status. This does 
not depend on how the data is transferred.

-- 
Kai

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2004-09-14 17:39 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-13 12:48 IO buffer alignment for block devices Saeed Bishara
2004-09-13 14:14 ` James Bottomley
2004-09-13 15:16   ` Saeed Bishara
2004-09-13 15:28     ` James Bottomley
2004-09-13 17:12       ` Kai Makisara
2004-09-13 19:02         ` Jens Axboe
2004-09-13 19:56           ` Kai Makisara
2004-09-13 20:23             ` Jens Axboe
2004-09-13 22:08               ` Kai Makisara
2004-09-14  7:44                 ` Jens Axboe
2004-09-14 17:39                   ` Kai Makisara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox