* IO buffer alignment for block devices
@ 2004-09-13 12:48 Saeed Bishara
2004-09-13 14:14 ` James Bottomley
0 siblings, 1 reply; 11+ messages in thread
From: Saeed Bishara @ 2004-09-13 12:48 UTC (permalink / raw)
To: linux-scsi
Hi,
Is there any minimal alignment for the IO buffers (sg and single)
sent to a scsi LLD?
I found that buffers that come from buffer cache, page cache and sg
are page aligned , and the raw driver send sector aligned buffers.
so can I assume that the minimal alignment of the buffer is 512 bytes?
thanks
saeed
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: IO buffer alignment for block devices
2004-09-13 12:48 IO buffer alignment for block devices Saeed Bishara
@ 2004-09-13 14:14 ` James Bottomley
2004-09-13 15:16 ` Saeed Bishara
0 siblings, 1 reply; 11+ messages in thread
From: James Bottomley @ 2004-09-13 14:14 UTC (permalink / raw)
To: Saeed Bishara; +Cc: SCSI Mailing List
On Mon, 2004-09-13 at 08:48, Saeed Bishara wrote:
> Is there any minimal alignment for the IO buffers (sg and single)
> sent to a scsi LLD?
> I found that buffers that come from buffer cache, page cache and sg
> are page aligned , and the raw driver send sector aligned buffers.
> so can I assume that the minimal alignment of the buffer is 512 bytes?
You can assume it will be whatever you set blk_queue_dma_alignment() to
in your slave configure routines. If the LLD doesn't set this, then it
will be 8 bytes. However, beware, setting this higher can cause user
generated commands to be errored if they fail the alignment checks.
James
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: IO buffer alignment for block devices
2004-09-13 14:14 ` James Bottomley
@ 2004-09-13 15:16 ` Saeed Bishara
2004-09-13 15:28 ` James Bottomley
0 siblings, 1 reply; 11+ messages in thread
From: Saeed Bishara @ 2004-09-13 15:16 UTC (permalink / raw)
To: James Bottomley; +Cc: SCSI Mailing List
James Bottomley wrote:
>On Mon, 2004-09-13 at 08:48, Saeed Bishara wrote:
>
>
>> Is there any minimal alignment for the IO buffers (sg and single)
>>sent to a scsi LLD?
>> I found that buffers that come from buffer cache, page cache and sg
>>are page aligned , and the raw driver send sector aligned buffers.
>> so can I assume that the minimal alignment of the buffer is 512 bytes?
>>
>>
>
>You can assume it will be whatever you set blk_queue_dma_alignment() to
>in your slave configure routines. If the LLD doesn't set this, then it
>will be 8 bytes. However, beware, setting this higher can cause user
>
>
isn't the default 512, this is what I see in blk_queue_make_request.
<http://lxr.linux.no/ident?v=2.6.8.1;i=blk_queue_make_request>
>generated commands to be errored if they fail the alignment checks.
>
>James
>
>
>
any Ideas about lk 2.4?
>To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: IO buffer alignment for block devices
2004-09-13 15:16 ` Saeed Bishara
@ 2004-09-13 15:28 ` James Bottomley
2004-09-13 17:12 ` Kai Makisara
0 siblings, 1 reply; 11+ messages in thread
From: James Bottomley @ 2004-09-13 15:28 UTC (permalink / raw)
To: Saeed Bishara; +Cc: SCSI Mailing List
On Mon, 2004-09-13 at 11:16, Saeed Bishara wrote:
> isn't the default 512, this is what I see in blk_queue_make_request.
>
> <http://lxr.linux.no/ident?v=2.6.8.1;i=blk_queue_make_request>
In the block layer, yes. In SCSI we tune that down to be 8 for backward
compatibility with 2.4, I think, but LLDs can tune it up again.
> any Ideas about lk 2.4?
Not really; since it's end of life, you could probably get away with not
back porting to 2.4
James
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: IO buffer alignment for block devices
2004-09-13 15:28 ` James Bottomley
@ 2004-09-13 17:12 ` Kai Makisara
2004-09-13 19:02 ` Jens Axboe
0 siblings, 1 reply; 11+ messages in thread
From: Kai Makisara @ 2004-09-13 17:12 UTC (permalink / raw)
To: James Bottomley; +Cc: Saeed Bishara, SCSI Mailing List
On Mon, 13 Sep 2004, James Bottomley wrote:
> On Mon, 2004-09-13 at 11:16, Saeed Bishara wrote:
> > isn't the default 512, this is what I see in blk_queue_make_request.
> >
> > <http://lxr.linux.no/ident?v=2.6.8.1;i=blk_queue_make_request>
>
> In the block layer, yes. In SCSI we tune that down to be 8 for backward
> compatibility with 2.4, I think, but LLDs can tune it up again.
>
We don't. The alignment has been 512 bytes after this patch in 2.6.4:
<James.Bottomley@SteelEye.com>
[PATCH] Undo SCSI 8-byte alignment relaxation
This makes the default alignment requirements be 512 bytes for SCSI,
the way it used to be.
Jens will fix the SCSI layer problems, but low-level drivers might have
other restrictions on alignment.
As you can guess, I have not been too happy with this situation ;-)
Any driver can change this restriction in slave_configure(). I have for
some time had the following patch in my system to change the alignment for
my tape drive to 8 bytes:
--- linux-2.6.9-rc1-bk1/drivers/scsi/sym53c8xx_2/sym_glue.c~ 2004-08-27 19:33:34.000000000 +0300
+++ linux-2.6.9-rc1-bk1/drivers/scsi/sym53c8xx_2/sym_glue.c 2004-08-28 12:54:49.000000000 +0300
@@ -1097,6 +1097,8 @@
spi_dv_device(device);
+ blk_queue_dma_alignment(device->request_queue, 7);
+
return 0;
}
--
Kai
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: IO buffer alignment for block devices
2004-09-13 17:12 ` Kai Makisara
@ 2004-09-13 19:02 ` Jens Axboe
2004-09-13 19:56 ` Kai Makisara
0 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2004-09-13 19:02 UTC (permalink / raw)
To: Kai Makisara; +Cc: James Bottomley, Saeed Bishara, SCSI Mailing List
On Mon, Sep 13 2004, Kai Makisara wrote:
> On Mon, 13 Sep 2004, James Bottomley wrote:
>
> > On Mon, 2004-09-13 at 11:16, Saeed Bishara wrote:
> > > isn't the default 512, this is what I see in blk_queue_make_request.
> > >
> > > <http://lxr.linux.no/ident?v=2.6.8.1;i=blk_queue_make_request>
> >
> > In the block layer, yes. In SCSI we tune that down to be 8 for backward
> > compatibility with 2.4, I think, but LLDs can tune it up again.
> >
> We don't. The alignment has been 512 bytes after this patch in 2.6.4:
>
> <James.Bottomley@SteelEye.com>
> [PATCH] Undo SCSI 8-byte alignment relaxation
>
> This makes the default alignment requirements be 512 bytes for SCSI,
> the way it used to be.
>
> Jens will fix the SCSI layer problems, but low-level drivers might have
> other restrictions on alignment.
>
> As you can guess, I have not been too happy with this situation ;-)
>
> Any driver can change this restriction in slave_configure(). I have for
> some time had the following patch in my system to change the alignment for
> my tape drive to 8 bytes:
I'm not sure 8 is a safe alignment at all, at least for ide-cd there
were cases that simply didn't work well with 8-byte alignment.
> --- linux-2.6.9-rc1-bk1/drivers/scsi/sym53c8xx_2/sym_glue.c~ 2004-08-27 19:33:34.000000000 +0300
> +++ linux-2.6.9-rc1-bk1/drivers/scsi/sym53c8xx_2/sym_glue.c 2004-08-28 12:54:49.000000000 +0300
> @@ -1097,6 +1097,8 @@
>
> spi_dv_device(device);
>
> + blk_queue_dma_alignment(device->request_queue, 7);
> +
> return 0;
> }
Why don't you just do this in st? Not that it matters, since it's just
your private hack - just wondering.
And finally (and most importantly), why do you care so much? You should
rip some of that manual data mapping out of st completely. Add a
scsi_rq_map_user or something that returns a struct scsi_request, and
uses blk_rq_map_user (which will do bio_map_user() or bio_copy_user())
to map the user data into your SCSI request. I seriously doubt that st
will see any performance difference between the two at all, and st
really should be using proper infrastructure for this. sg as well, but
since it's dying anyways...
--
Jens Axboe
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: IO buffer alignment for block devices
2004-09-13 19:02 ` Jens Axboe
@ 2004-09-13 19:56 ` Kai Makisara
2004-09-13 20:23 ` Jens Axboe
0 siblings, 1 reply; 11+ messages in thread
From: Kai Makisara @ 2004-09-13 19:56 UTC (permalink / raw)
To: Jens Axboe; +Cc: James Bottomley, Saeed Bishara, SCSI Mailing List
On Mon, 13 Sep 2004, Jens Axboe wrote:
> On Mon, Sep 13 2004, Kai Makisara wrote:
...
> > Any driver can change this restriction in slave_configure(). I have for
> > some time had the following patch in my system to change the alignment for
> > my tape drive to 8 bytes:
>
> I'm not sure 8 is a safe alignment at all, at least for ide-cd there
> were cases that simply didn't work well with 8-byte alignment.
>
I agree. That is why I have not objected to the patch that changed the
default back to 512. That is why I reviewed sym53c8xx_2 and, to me, it
seemed safe to to try 8 byte alignment with this driver. The proper way is
to have safe default and let the LLDDs relax the alignment. The only
practical problem is how to get this done but we probably have to live
with this. Besides we were talking about SCSI.
> > --- linux-2.6.9-rc1-bk1/drivers/scsi/sym53c8xx_2/sym_glue.c~ 2004-08-27 19:33:34.000000000 +0300
> > +++ linux-2.6.9-rc1-bk1/drivers/scsi/sym53c8xx_2/sym_glue.c 2004-08-28 12:54:49.000000000 +0300
> > @@ -1097,6 +1097,8 @@
> >
> > spi_dv_device(device);
> >
> > + blk_queue_dma_alignment(device->request_queue, 7);
> > +
> > return 0;
> > }
>
> Why don't you just do this in st? Not that it matters, since it's just
> your private hack - just wondering.
>
Because the high-level driver should not guess what the low-level driver
requirements are.
> And finally (and most importantly), why do you care so much?
I do care because the non-bounced transfers give measurable decrease in
cpu time and bus use. And I don't like the situation where portable
software has to have special alignment code just for Linux when there is
no real technical reason for that. In practice, this code often does not
exist. Then the question is, why does writing to tape in Linux use, say,
30 % CPU time when the same thing in xxx Unix uses 1 % CPU time.
> You should
> rip some of that manual data mapping out of st completely. Add a
> scsi_rq_map_user or something that returns a struct scsi_request, and
> uses blk_rq_map_user (which will do bio_map_user() or bio_copy_user())
> to map the user data into your SCSI request. I seriously doubt that st
> will see any performance difference between the two at all, and st
> really should be using proper infrastructure for this. sg as well, but
> since it's dying anyways...
>
The manual data mapping exists in st and sg because when the code was
made, nothing better was available. It was meant to be a proof of concept
and it still is. However, it has at least one good property: it works.
I don't think you can get around the alignment by using bio_map_user(): it
tests alignment for both the buffer start and the transfer length. So,
converting the data mapping does not change the alignment handling which
is what we were talking about.
I have sometimes looked at the bio code and thought about changing to use
it. The problem has been that it is too limited for use in st. It has
become better but it still is not good enough to replace both the data
mapping and the driver's buffering. A quick look through the code suggests
that it still can't handle big requests (up to 6 MB). Please tell me if I
am wrong.
--
Kai
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: IO buffer alignment for block devices
2004-09-13 19:56 ` Kai Makisara
@ 2004-09-13 20:23 ` Jens Axboe
2004-09-13 22:08 ` Kai Makisara
0 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2004-09-13 20:23 UTC (permalink / raw)
To: Kai Makisara; +Cc: James Bottomley, Saeed Bishara, SCSI Mailing List
On Mon, Sep 13 2004, Kai Makisara wrote:
> On Mon, 13 Sep 2004, Jens Axboe wrote:
>
> > On Mon, Sep 13 2004, Kai Makisara wrote:
> ...
> > > Any driver can change this restriction in slave_configure(). I have for
> > > some time had the following patch in my system to change the alignment for
> > > my tape drive to 8 bytes:
> >
> > I'm not sure 8 is a safe alignment at all, at least for ide-cd there
> > were cases that simply didn't work well with 8-byte alignment.
> >
> I agree. That is why I have not objected to the patch that changed the
> default back to 512. That is why I reviewed sym53c8xx_2 and, to me, it
> seemed safe to to try 8 byte alignment with this driver. The proper way is
> to have safe default and let the LLDDs relax the alignment. The only
> practical problem is how to get this done but we probably have to live
> with this. Besides we were talking about SCSI.
This seemed to be a device problem, not an adapter problem.
I'm curious where the alignment is really documented, seems to be we are
stabbing in the dark.
> > > --- linux-2.6.9-rc1-bk1/drivers/scsi/sym53c8xx_2/sym_glue.c~ 2004-08-27 19:33:34.000000000 +0300
> > > +++ linux-2.6.9-rc1-bk1/drivers/scsi/sym53c8xx_2/sym_glue.c 2004-08-28 12:54:49.000000000 +0300
> > > @@ -1097,6 +1097,8 @@
> > >
> > > spi_dv_device(device);
> > >
> > > + blk_queue_dma_alignment(device->request_queue, 7);
> > > +
> > > return 0;
> > > }
> >
> > Why don't you just do this in st? Not that it matters, since it's just
> > your private hack - just wondering.
> >
> Because the high-level driver should not guess what the low-level driver
> requirements are.
>
> > And finally (and most importantly), why do you care so much?
>
> I do care because the non-bounced transfers give measurable decrease in
> cpu time and bus use. And I don't like the situation where portable
> software has to have special alignment code just for Linux when there is
> no real technical reason for that. In practice, this code often does not
> exist. Then the question is, why does writing to tape in Linux use, say,
> 30 % CPU time when the same thing in xxx Unix uses 1 % CPU time.
Please share how you arrived at those numbers, if you get that bad
performance you are doing something horribly wrong elsewhere. Doing
50MiB/sec single transfer from a hard drive with either bio_map_user()
or bio_copy_user() backing shows at most 1% sys time difference on a
dual P3-800MHz. How fast are these tape drives?
There's no questioning that bio_map_user() will be faster for larger
transfers (4kb and up, I'm guessing), but there's no way that you can
claim 30% sys time for a tape drive without backing that up. Where was
this time spent?
> > rip some of that manual data mapping out of st completely. Add a
> > scsi_rq_map_user or something that returns a struct scsi_request, and
> > uses blk_rq_map_user (which will do bio_map_user() or bio_copy_user())
> > to map the user data into your SCSI request. I seriously doubt that st
> > will see any performance difference between the two at all, and st
> > really should be using proper infrastructure for this. sg as well, but
> > since it's dying anyways...
> >
> The manual data mapping exists in st and sg because when the code was
> made, nothing better was available. It was meant to be a proof of concept
> and it still is. However, it has at least one good property: it works.
Nothing as permanent as temporary driver code :)
> I don't think you can get around the alignment by using bio_map_user(): it
> tests alignment for both the buffer start and the transfer length. So,
> converting the data mapping does not change the alignment handling which
> is what we were talking about.
Of course not, bio_map_user() data must be aligned, that's the whole
premise of if. But bio_copy_data() will align the data for you. Lookup
blk_rq_map_user() like I described.
> I have sometimes looked at the bio code and thought about changing to use
> it. The problem has been that it is too limited for use in st. It has
> become better but it still is not good enough to replace both the data
> mapping and the driver's buffering. A quick look through the code suggests
> that it still can't handle big requests (up to 6 MB). Please tell me if I
> am wrong.
I don't think you should. As I wrote, write a scsi mid layer helper that
will map the data for you into a scsi structure. I think it's a mistake
that st/sg attempts to handle this business themselves.
You can't map 6MB into a single bio, but you can string them together if
you want such a big request.
--
Jens Axboe
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: IO buffer alignment for block devices
2004-09-13 20:23 ` Jens Axboe
@ 2004-09-13 22:08 ` Kai Makisara
2004-09-14 7:44 ` Jens Axboe
0 siblings, 1 reply; 11+ messages in thread
From: Kai Makisara @ 2004-09-13 22:08 UTC (permalink / raw)
To: Jens Axboe; +Cc: James Bottomley, Saeed Bishara, SCSI Mailing List
On Mon, 13 Sep 2004, Jens Axboe wrote:
> On Mon, Sep 13 2004, Kai Makisara wrote:
> > On Mon, 13 Sep 2004, Jens Axboe wrote:
> >
> > > On Mon, Sep 13 2004, Kai Makisara wrote:
...
> > I agree. That is why I have not objected to the patch that changed the
> > default back to 512. That is why I reviewed sym53c8xx_2 and, to me, it
> > seemed safe to to try 8 byte alignment with this driver. The proper way is
> > to have safe default and let the LLDDs relax the alignment. The only
> > practical problem is how to get this done but we probably have to live
> > with this. Besides we were talking about SCSI.
>
> This seemed to be a device problem, not an adapter problem.
>
The tape drives at the other end of the SCSI cable don't have any
restrictions. The restrictions are at the computer end of the cable.
> I'm curious where the alignment is really documented, seems to be we are
> stabbing in the dark.
>
I have not seen any Linux documentation on the whole problem. The
requirements have to be determined case by case considering the whole data
path from the source to the target. The low-level driver (writer) should
know the requirements for the adapter but this should be combined with the
requirements coming from the i/o architecture of the computer. One might
design code where the low-level driver suggests an alignment that
architecture-dependent code then modifies if necessary.
...
> > > And finally (and most importantly), why do you care so much?
> >
> > I do care because the non-bounced transfers give measurable decrease in
> > cpu time and bus use. And I don't like the situation where portable
> > software has to have special alignment code just for Linux when there is
> > no real technical reason for that. In practice, this code often does not
> > exist. Then the question is, why does writing to tape in Linux use, say,
> > 30 % CPU time when the same thing in xxx Unix uses 1 % CPU time.
>
> Please share how you arrived at those numbers, if you get that bad
> performance you are doing something horribly wrong elsewhere. Doing
> 50MiB/sec single transfer from a hard drive with either bio_map_user()
> or bio_copy_user() backing shows at most 1% sys time difference on a
> dual P3-800MHz. How fast are these tape drives?
>
On 20 Nov 2003 I posted to linux-scsi and linux-usb-development the
following results:
-----------------------8<--------------------------------------------
Before implementing direct i/o, I made some measurements with the hardware
available. The tape drive was HP DDS-4 and computer dual PIII 700 MHz. The
results with a test program using well-compressible data (zeroes):
Buffer length 32768 bytes, 32768 buffers written (1024.000 MB, 1073741824
bytes)
Variable block mode (writing 32768 byte blocks).
dio:
write: wall 60.436 user 0.015 ( 0.0 %) system 0.485 ( 0.8 %) speed
16.944 MB/s
read: wall 64.580 user 0.014 ( 0.0 %) system 0.500 ( 0.8 %) speed
15.856 MB/s
no dio:
write: wall 61.373 user 0.024 ( 0.0 %) system 2.897 ( 4.7 %) speed
16.685 MB/s
read: wall 66.435 user 0.055 ( 0.1 %) system 6.347 ( 9.6 %) speed
15.413 MB/s
The common high-end drives at that time streamed at 30-40 MB/s assuming
2:1 compression. It meant that about 20 % of cpu was needed for extra
copies when reading. Other people reported even higher percentages with
other hardware.
I just repeated the tests using the same drive but a 2.6 GHz PIV on a
Intel D875 motherboard and dual channel memory. The results were:
dio:
write: wall 58.668 user 0.013 ( 0.0 %) system 0.217 ( 0.4 %) speed
17.045 MB/s
read: wall 63.425 user 0.011 ( 0.0 %) system 0.245 ( 0.4 %) speed
15.767 MB/s
no dio:
write: wall 58.713 user 0.017 ( 0.0 %) system 0.534 ( 0.9 %) speed
17.032 MB/s
read: wall 63.265 user 0.020 ( 0.0 %) system 0.805 ( 1.3 %) speed
15.807 MB/s
Current ordinary drives stream up to 70 MB/s (assuming 2:1 compression).
This would lead to over 5 % overhead per drive when not doing dio.
---------------------------8<---------------------------------------------
The tests were done using a simple program that does the writes and
reads and reports the times from gettimeofday() and getrusage().
I just looked at the specs of a new LTO-3 drive. They claim sustained
compressed rate of 490 GB/h which makes 136 MB/s. The recent big tape
drives are fast and, since they are streamers, the system has to be able
to provide this rate for hours...
> There's no questioning that bio_map_user() will be faster for larger
> transfers (4kb and up, I'm guessing), but there's no way that you can
> claim 30% sys time for a tape drive without backing that up. Where was
> this time spent?
>
I have not profiled the code. The st driver uses the same loop for direct
and buffered transfers. The only difference is that in the buffered case
the data is copied to/from the internal buffer before/after the SCSI
transactions.
As you can see from the results and project from them, the system time
varies a lot depending on the hardware. The differences between the older
system and the newer system are surprisingly big but, if you look at
system architectures, the results are what you actually should expect.
I am not claiming that the cpu usage is in all cases significant with
buffered transfers. Just that there exist cases where it is significant
(big drive connected to a not very fast computer). This has motivated
doing the code. (And the fact that other Unices do this, too).
...
> > I have sometimes looked at the bio code and thought about changing to use
> > it. The problem has been that it is too limited for use in st. It has
> > become better but it still is not good enough to replace both the data
> > mapping and the driver's buffering. A quick look through the code suggests
> > that it still can't handle big requests (up to 6 MB). Please tell me if I
> > am wrong.
>
> I don't think you should. As I wrote, write a scsi mid layer helper that
> will map the data for you into a scsi structure. I think it's a mistake
> that st/sg attempts to handle this business themselves.
>
I think I again did not express myself clearly. By converting st to use
bio I meant writing the helper you suggested and using it in st. I was
thinking about the end result (st using bio).
> You can't map 6MB into a single bio, but you can string them together if
> you want such a big request.
>
OK, I will look at that when I have time. The need for big requests comes
from reading existing tapes with multimegabyte blocks. A block has to be
read with a single SCSI read and this is why the request has to be so big
in these cases.
--
Kai
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: IO buffer alignment for block devices
2004-09-13 22:08 ` Kai Makisara
@ 2004-09-14 7:44 ` Jens Axboe
2004-09-14 17:39 ` Kai Makisara
0 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2004-09-14 7:44 UTC (permalink / raw)
To: Kai Makisara; +Cc: James Bottomley, Saeed Bishara, SCSI Mailing List
On Tue, Sep 14 2004, Kai Makisara wrote:
> On Mon, 13 Sep 2004, Jens Axboe wrote:
>
> > On Mon, Sep 13 2004, Kai Makisara wrote:
> > > On Mon, 13 Sep 2004, Jens Axboe wrote:
> > >
> > > > On Mon, Sep 13 2004, Kai Makisara wrote:
> ...
> > > I agree. That is why I have not objected to the patch that changed the
> > > default back to 512. That is why I reviewed sym53c8xx_2 and, to me, it
> > > seemed safe to to try 8 byte alignment with this driver. The proper way is
> > > to have safe default and let the LLDDs relax the alignment. The only
> > > practical problem is how to get this done but we probably have to live
> > > with this. Besides we were talking about SCSI.
> >
> > This seemed to be a device problem, not an adapter problem.
> >
> The tape drives at the other end of the SCSI cable don't have any
> restrictions. The restrictions are at the computer end of the cable.
I'm talking about the specific ide-cd case.
> > I'm curious where the alignment is really documented, seems to be we are
> > stabbing in the dark.
> >
> I have not seen any Linux documentation on the whole problem. The
> requirements have to be determined case by case considering the whole data
> path from the source to the target. The low-level driver (writer) should
> know the requirements for the adapter but this should be combined with the
> requirements coming from the i/o architecture of the computer. One might
> design code where the low-level driver suggests an alignment that
> architecture-dependent code then modifies if necessary.
I completely agree, and right now there are many bits and pieces missing
from that chain. Which is why we have to default to something a little
aggressive...
> > > > And finally (and most importantly), why do you care so much?
> > >
> > > I do care because the non-bounced transfers give measurable decrease in
> > > cpu time and bus use. And I don't like the situation where portable
> > > software has to have special alignment code just for Linux when there is
> > > no real technical reason for that. In practice, this code often does not
> > > exist. Then the question is, why does writing to tape in Linux use, say,
> > > 30 % CPU time when the same thing in xxx Unix uses 1 % CPU time.
> >
> > Please share how you arrived at those numbers, if you get that bad
> > performance you are doing something horribly wrong elsewhere. Doing
> > 50MiB/sec single transfer from a hard drive with either bio_map_user()
> > or bio_copy_user() backing shows at most 1% sys time difference on a
> > dual P3-800MHz. How fast are these tape drives?
> >
> On 20 Nov 2003 I posted to linux-scsi and linux-usb-development the
> following results:
> -----------------------8<--------------------------------------------
> Before implementing direct i/o, I made some measurements with the hardware
> available. The tape drive was HP DDS-4 and computer dual PIII 700 MHz. The
> results with a test program using well-compressible data (zeroes):
>
> Buffer length 32768 bytes, 32768 buffers written (1024.000 MB, 1073741824
> bytes)
> Variable block mode (writing 32768 byte blocks).
> dio:
> write: wall 60.436 user 0.015 ( 0.0 %) system 0.485 ( 0.8 %) speed
> 16.944 MB/s
> read: wall 64.580 user 0.014 ( 0.0 %) system 0.500 ( 0.8 %) speed
> 15.856 MB/s
> no dio:
> write: wall 61.373 user 0.024 ( 0.0 %) system 2.897 ( 4.7 %) speed
> 16.685 MB/s
> read: wall 66.435 user 0.055 ( 0.1 %) system 6.347 ( 9.6 %) speed
> 15.413 MB/s
>
> The common high-end drives at that time streamed at 30-40 MB/s assuming
> 2:1 compression. It meant that about 20 % of cpu was needed for extra
> copies when reading. Other people reported even higher percentages with
> other hardware.
Just tried 70MiB/sec on my test box, and it is going rather slowly with
non direct io. So your numbers don't look all that bad. Pretty strange,
I don't remember it being this bad. This is a profile from a forced
unaligned 128KiB contig transfer from two SCSI disks, doing 40 and
30MiB/sec at the same time:
89887 default_idle 1404.4844
24357 __copy_to_user_ll 217.4732
512 __generic_unplug_device 8.0000
279 __free_pages 3.4875
291 sym53c8xx_intr 1.8188
137 blk_rq_unmap_user 1.4271
260 scsi_end_request 1.2500
210 blk_execute_rq 1.0938
17 free_hot_page 1.0625
Compared to direct io:
118502 default_idle 1851.5938
138 __generic_unplug_device 2.1562
50 unlock_page 1.5625
117 sym53c8xx_intr 0.7312
37 set_page_dirty_lock 0.3854
30 set_page_dirty 0.3750
73 scsi_end_request 0.3510
49 __bio_unmap_user 0.3403
For 512-byte reads, the results are comparable (aggregate bandwidth):
non-dio:
91038 default_idle 1422.4688
2843 __generic_unplug_device 44.4219
3014 sym53c8xx_intr 18.8375
1694 __copy_to_user_ll 15.1250
1739 scsi_end_request 8.3606
779 blk_rq_unmap_user 8.1146
571 blk_run_queue 7.1375
dio:
92280 default_idle 1441.8750
2824 __generic_unplug_device 44.1250
2926 sym53c8xx_intr 18.2875
1608 scsi_end_request 7.7308
716 blk_rq_unmap_user 7.4583
566 blk_run_queue 7.0750
1284 blk_execute_rq 6.6875
> I just repeated the tests using the same drive but a 2.6 GHz PIV on a
> Intel D875 motherboard and dual channel memory. The results were:
>
> dio:
> write: wall 58.668 user 0.013 ( 0.0 %) system 0.217 ( 0.4 %) speed
> 17.045 MB/s
> read: wall 63.425 user 0.011 ( 0.0 %) system 0.245 ( 0.4 %) speed
> 15.767 MB/s
> no dio:
> write: wall 58.713 user 0.017 ( 0.0 %) system 0.534 ( 0.9 %) speed
> 17.032 MB/s
> read: wall 63.265 user 0.020 ( 0.0 %) system 0.805 ( 1.3 %) speed
> 15.807 MB/s
That looks a lot closer to what I would expect - non-dio using a little
more sys time at the same transfer rate, still a little high though. You
must be doing something else differently as well. Are you transfer sizes
as big for non-dio as with dio? Are you using sg for non-dio?
> > There's no questioning that bio_map_user() will be faster for larger
> > transfers (4kb and up, I'm guessing), but there's no way that you can
> > claim 30% sys time for a tape drive without backing that up. Where was
> > this time spent?
> >
> I have not profiled the code. The st driver uses the same loop for direct
> and buffered transfers. The only difference is that in the buffered case
> the data is copied to/from the internal buffer before/after the SCSI
> transactions.
That's truly the only difference?
> As you can see from the results and project from them, the system time
> varies a lot depending on the hardware. The differences between the older
> system and the newer system are surprisingly big but, if you look at
> system architectures, the results are what you actually should expect.
Agree.
> I am not claiming that the cpu usage is in all cases significant with
> buffered transfers. Just that there exist cases where it is significant
> (big drive connected to a not very fast computer). This has motivated
> doing the code. (And the fact that other Unices do this, too).
Sure.
> > You can't map 6MB into a single bio, but you can string them together if
> > you want such a big request.
> >
> OK, I will look at that when I have time. The need for big requests comes
> from reading existing tapes with multimegabyte blocks. A block has to be
> read with a single SCSI read and this is why the request has to be so big
> in these cases.
I understand.
--
Jens Axboe
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: IO buffer alignment for block devices
2004-09-14 7:44 ` Jens Axboe
@ 2004-09-14 17:39 ` Kai Makisara
0 siblings, 0 replies; 11+ messages in thread
From: Kai Makisara @ 2004-09-14 17:39 UTC (permalink / raw)
To: Jens Axboe; +Cc: James Bottomley, Saeed Bishara, SCSI Mailing List
On Tue, 14 Sep 2004, Jens Axboe wrote:
> On Tue, Sep 14 2004, Kai Makisara wrote:
> > On Mon, 13 Sep 2004, Jens Axboe wrote:
> >
...
> > On 20 Nov 2003 I posted to linux-scsi and linux-usb-development the
> > following results:
> > -----------------------8<--------------------------------------------
> > Before implementing direct i/o, I made some measurements with the hardware
> > available. The tape drive was HP DDS-4 and computer dual PIII 700 MHz. The
> > results with a test program using well-compressible data (zeroes):
> >
> > Buffer length 32768 bytes, 32768 buffers written (1024.000 MB, 1073741824
> > bytes)
> > Variable block mode (writing 32768 byte blocks).
> > dio:
> > write: wall 60.436 user 0.015 ( 0.0 %) system 0.485 ( 0.8 %) speed
> > 16.944 MB/s
> > read: wall 64.580 user 0.014 ( 0.0 %) system 0.500 ( 0.8 %) speed
> > 15.856 MB/s
> > no dio:
> > write: wall 61.373 user 0.024 ( 0.0 %) system 2.897 ( 4.7 %) speed
> > 16.685 MB/s
> > read: wall 66.435 user 0.055 ( 0.1 %) system 6.347 ( 9.6 %) speed
> > 15.413 MB/s
> >
> > The common high-end drives at that time streamed at 30-40 MB/s assuming
> > 2:1 compression. It meant that about 20 % of cpu was needed for extra
> > copies when reading. Other people reported even higher percentages with
> > other hardware.
>
> Just tried 70MiB/sec on my test box, and it is going rather slowly with
> non direct io. So your numbers don't look all that bad. Pretty strange,
> I don't remember it being this bad. This is a profile from a forced
> unaligned 128KiB contig transfer from two SCSI disks, doing 40 and
> 30MiB/sec at the same time:
>
> 89887 default_idle 1404.4844
> 24357 __copy_to_user_ll 217.4732
...
> For 512-byte reads, the results are comparable (aggregate bandwidth):
>
> non-dio:
>
> 91038 default_idle 1422.4688
> 2843 __generic_unplug_device 44.4219
> 3014 sym53c8xx_intr 18.8375
> 1694 __copy_to_user_ll 15.1250
Interesting results. I did tests also with other block sizes (8192 -
131072 bytes) with that P3 system (VIA chipset). Looking at the notes now
I can see the same trend: the system time percentage increases quite a lot
when the block size gets larger!
I did similar tests with the P4 system. No trend was seen. The read/write
system time balance changed a little but the averages were very close.
...
> > I just repeated the tests using the same drive but a 2.6 GHz PIV on a
> > Intel D875 motherboard and dual channel memory. The results were:
> >
> > dio:
> > write: wall 58.668 user 0.013 ( 0.0 %) system 0.217 ( 0.4 %) speed
> > 17.045 MB/s
> > read: wall 63.425 user 0.011 ( 0.0 %) system 0.245 ( 0.4 %) speed
> > 15.767 MB/s
> > no dio:
> > write: wall 58.713 user 0.017 ( 0.0 %) system 0.534 ( 0.9 %) speed
> > 17.032 MB/s
> > read: wall 63.265 user 0.020 ( 0.0 %) system 0.805 ( 1.3 %) speed
> > 15.807 MB/s
>
> That looks a lot closer to what I would expect - non-dio using a little
> more sys time at the same transfer rate, still a little high though. You
> must be doing something else differently as well. Are you transfer sizes
> as big for non-dio as with dio? Are you using sg for non-dio?
>
The transfer sizes are the same. This is actually the only possibility
with variable block mode. Sg is used for no-dio. However, the driver tries
to allocate as large chunks of memory as it can get and the buffer
probably consisted of only one sg segment with this block size.
...
> > I have not profiled the code. The st driver uses the same loop for direct
> > and buffered transfers. The only difference is that in the buffered case
> > the data is copied to/from the internal buffer before/after the SCSI
> > transactions.
>
> That's truly the only difference?
>
Yes. Most of the code is checking results and updating status. This does
not depend on how the data is transferred.
--
Kai
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2004-09-14 17:39 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-13 12:48 IO buffer alignment for block devices Saeed Bishara
2004-09-13 14:14 ` James Bottomley
2004-09-13 15:16 ` Saeed Bishara
2004-09-13 15:28 ` James Bottomley
2004-09-13 17:12 ` Kai Makisara
2004-09-13 19:02 ` Jens Axboe
2004-09-13 19:56 ` Kai Makisara
2004-09-13 20:23 ` Jens Axboe
2004-09-13 22:08 ` Kai Makisara
2004-09-14 7:44 ` Jens Axboe
2004-09-14 17:39 ` Kai Makisara
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox