From: Jens Axboe <axboe@suse.de>
To: Kai Makisara <Kai.Makisara@kolumbus.fi>
Cc: James Bottomley <James.Bottomley@SteelEye.com>,
Saeed Bishara <Saeed.Bishara@il.marvell.com>,
SCSI Mailing List <linux-scsi@vger.kernel.org>
Subject: Re: IO buffer alignment for block devices
Date: Tue, 14 Sep 2004 09:44:49 +0200 [thread overview]
Message-ID: <20040914074448.GK2336@suse.de> (raw)
In-Reply-To: <Pine.LNX.4.58.0409132351210.1972@kai.makisara.local>
On Tue, Sep 14 2004, Kai Makisara wrote:
> On Mon, 13 Sep 2004, Jens Axboe wrote:
>
> > On Mon, Sep 13 2004, Kai Makisara wrote:
> > > On Mon, 13 Sep 2004, Jens Axboe wrote:
> > >
> > > > On Mon, Sep 13 2004, Kai Makisara wrote:
> ...
> > > I agree. That is why I have not objected to the patch that changed the
> > > default back to 512. That is why I reviewed sym53c8xx_2 and, to me, it
> > > seemed safe to to try 8 byte alignment with this driver. The proper way is
> > > to have safe default and let the LLDDs relax the alignment. The only
> > > practical problem is how to get this done but we probably have to live
> > > with this. Besides we were talking about SCSI.
> >
> > This seemed to be a device problem, not an adapter problem.
> >
> The tape drives at the other end of the SCSI cable don't have any
> restrictions. The restrictions are at the computer end of the cable.
I'm talking about the specific ide-cd case.
> > I'm curious where the alignment is really documented, seems to be we are
> > stabbing in the dark.
> >
> I have not seen any Linux documentation on the whole problem. The
> requirements have to be determined case by case considering the whole data
> path from the source to the target. The low-level driver (writer) should
> know the requirements for the adapter but this should be combined with the
> requirements coming from the i/o architecture of the computer. One might
> design code where the low-level driver suggests an alignment that
> architecture-dependent code then modifies if necessary.
I completely agree, and right now there are many bits and pieces missing
from that chain. Which is why we have to default to something a little
aggressive...
> > > > And finally (and most importantly), why do you care so much?
> > >
> > > I do care because the non-bounced transfers give measurable decrease in
> > > cpu time and bus use. And I don't like the situation where portable
> > > software has to have special alignment code just for Linux when there is
> > > no real technical reason for that. In practice, this code often does not
> > > exist. Then the question is, why does writing to tape in Linux use, say,
> > > 30 % CPU time when the same thing in xxx Unix uses 1 % CPU time.
> >
> > Please share how you arrived at those numbers, if you get that bad
> > performance you are doing something horribly wrong elsewhere. Doing
> > 50MiB/sec single transfer from a hard drive with either bio_map_user()
> > or bio_copy_user() backing shows at most 1% sys time difference on a
> > dual P3-800MHz. How fast are these tape drives?
> >
> On 20 Nov 2003 I posted to linux-scsi and linux-usb-development the
> following results:
> -----------------------8<--------------------------------------------
> Before implementing direct i/o, I made some measurements with the hardware
> available. The tape drive was HP DDS-4 and computer dual PIII 700 MHz. The
> results with a test program using well-compressible data (zeroes):
>
> Buffer length 32768 bytes, 32768 buffers written (1024.000 MB, 1073741824
> bytes)
> Variable block mode (writing 32768 byte blocks).
> dio:
> write: wall 60.436 user 0.015 ( 0.0 %) system 0.485 ( 0.8 %) speed
> 16.944 MB/s
> read: wall 64.580 user 0.014 ( 0.0 %) system 0.500 ( 0.8 %) speed
> 15.856 MB/s
> no dio:
> write: wall 61.373 user 0.024 ( 0.0 %) system 2.897 ( 4.7 %) speed
> 16.685 MB/s
> read: wall 66.435 user 0.055 ( 0.1 %) system 6.347 ( 9.6 %) speed
> 15.413 MB/s
>
> The common high-end drives at that time streamed at 30-40 MB/s assuming
> 2:1 compression. It meant that about 20 % of cpu was needed for extra
> copies when reading. Other people reported even higher percentages with
> other hardware.
Just tried 70MiB/sec on my test box, and it is going rather slowly with
non direct io. So your numbers don't look all that bad. Pretty strange,
I don't remember it being this bad. This is a profile from a forced
unaligned 128KiB contig transfer from two SCSI disks, doing 40 and
30MiB/sec at the same time:
89887 default_idle 1404.4844
24357 __copy_to_user_ll 217.4732
512 __generic_unplug_device 8.0000
279 __free_pages 3.4875
291 sym53c8xx_intr 1.8188
137 blk_rq_unmap_user 1.4271
260 scsi_end_request 1.2500
210 blk_execute_rq 1.0938
17 free_hot_page 1.0625
Compared to direct io:
118502 default_idle 1851.5938
138 __generic_unplug_device 2.1562
50 unlock_page 1.5625
117 sym53c8xx_intr 0.7312
37 set_page_dirty_lock 0.3854
30 set_page_dirty 0.3750
73 scsi_end_request 0.3510
49 __bio_unmap_user 0.3403
For 512-byte reads, the results are comparable (aggregate bandwidth):
non-dio:
91038 default_idle 1422.4688
2843 __generic_unplug_device 44.4219
3014 sym53c8xx_intr 18.8375
1694 __copy_to_user_ll 15.1250
1739 scsi_end_request 8.3606
779 blk_rq_unmap_user 8.1146
571 blk_run_queue 7.1375
dio:
92280 default_idle 1441.8750
2824 __generic_unplug_device 44.1250
2926 sym53c8xx_intr 18.2875
1608 scsi_end_request 7.7308
716 blk_rq_unmap_user 7.4583
566 blk_run_queue 7.0750
1284 blk_execute_rq 6.6875
> I just repeated the tests using the same drive but a 2.6 GHz PIV on a
> Intel D875 motherboard and dual channel memory. The results were:
>
> dio:
> write: wall 58.668 user 0.013 ( 0.0 %) system 0.217 ( 0.4 %) speed
> 17.045 MB/s
> read: wall 63.425 user 0.011 ( 0.0 %) system 0.245 ( 0.4 %) speed
> 15.767 MB/s
> no dio:
> write: wall 58.713 user 0.017 ( 0.0 %) system 0.534 ( 0.9 %) speed
> 17.032 MB/s
> read: wall 63.265 user 0.020 ( 0.0 %) system 0.805 ( 1.3 %) speed
> 15.807 MB/s
That looks a lot closer to what I would expect - non-dio using a little
more sys time at the same transfer rate, still a little high though. You
must be doing something else differently as well. Are you transfer sizes
as big for non-dio as with dio? Are you using sg for non-dio?
> > There's no questioning that bio_map_user() will be faster for larger
> > transfers (4kb and up, I'm guessing), but there's no way that you can
> > claim 30% sys time for a tape drive without backing that up. Where was
> > this time spent?
> >
> I have not profiled the code. The st driver uses the same loop for direct
> and buffered transfers. The only difference is that in the buffered case
> the data is copied to/from the internal buffer before/after the SCSI
> transactions.
That's truly the only difference?
> As you can see from the results and project from them, the system time
> varies a lot depending on the hardware. The differences between the older
> system and the newer system are surprisingly big but, if you look at
> system architectures, the results are what you actually should expect.
Agree.
> I am not claiming that the cpu usage is in all cases significant with
> buffered transfers. Just that there exist cases where it is significant
> (big drive connected to a not very fast computer). This has motivated
> doing the code. (And the fact that other Unices do this, too).
Sure.
> > You can't map 6MB into a single bio, but you can string them together if
> > you want such a big request.
> >
> OK, I will look at that when I have time. The need for big requests comes
> from reading existing tapes with multimegabyte blocks. A block has to be
> read with a single SCSI read and this is why the request has to be so big
> in these cases.
I understand.
--
Jens Axboe
next prev parent reply other threads:[~2004-09-14 7:46 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2004-09-13 12:48 IO buffer alignment for block devices Saeed Bishara
2004-09-13 14:14 ` James Bottomley
2004-09-13 15:16 ` Saeed Bishara
2004-09-13 15:28 ` James Bottomley
2004-09-13 17:12 ` Kai Makisara
2004-09-13 19:02 ` Jens Axboe
2004-09-13 19:56 ` Kai Makisara
2004-09-13 20:23 ` Jens Axboe
2004-09-13 22:08 ` Kai Makisara
2004-09-14 7:44 ` Jens Axboe [this message]
2004-09-14 17:39 ` Kai Makisara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20040914074448.GK2336@suse.de \
--to=axboe@suse.de \
--cc=James.Bottomley@SteelEye.com \
--cc=Kai.Makisara@kolumbus.fi \
--cc=Saeed.Bishara@il.marvell.com \
--cc=linux-scsi@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox