From: Jens Axboe <axboe@suse.de>
To: Kai Makisara <Kai.Makisara@kolumbus.fi>
Cc: James Bottomley <James.Bottomley@SteelEye.com>,
Saeed Bishara <Saeed.Bishara@il.marvell.com>,
SCSI Mailing List <linux-scsi@vger.kernel.org>
Subject: Re: IO buffer alignment for block devices
Date: Tue, 14 Sep 2004 09:44:49 +0200 [thread overview]
Message-ID: <20040914074448.GK2336@suse.de> (raw)
In-Reply-To: <Pine.LNX.4.58.0409132351210.1972@kai.makisara.local>
On Tue, Sep 14 2004, Kai Makisara wrote:
> On Mon, 13 Sep 2004, Jens Axboe wrote:
>
> > On Mon, Sep 13 2004, Kai Makisara wrote:
> > > On Mon, 13 Sep 2004, Jens Axboe wrote:
> > >
> > > > On Mon, Sep 13 2004, Kai Makisara wrote:
> ...
> > > I agree. That is why I have not objected to the patch that changed the
> > > default back to 512. That is why I reviewed sym53c8xx_2 and, to me, it
> > > seemed safe to to try 8 byte alignment with this driver. The proper way is
> > > to have safe default and let the LLDDs relax the alignment. The only
> > > practical problem is how to get this done but we probably have to live
> > > with this. Besides we were talking about SCSI.
> >
> > This seemed to be a device problem, not an adapter problem.
> >
> The tape drives at the other end of the SCSI cable don't have any
> restrictions. The restrictions are at the computer end of the cable.
I'm talking about the specific ide-cd case.
> > I'm curious where the alignment is really documented, seems to be we are
> > stabbing in the dark.
> >
> I have not seen any Linux documentation on the whole problem. The
> requirements have to be determined case by case considering the whole data
> path from the source to the target. The low-level driver (writer) should
> know the requirements for the adapter but this should be combined with the
> requirements coming from the i/o architecture of the computer. One might
> design code where the low-level driver suggests an alignment that
> architecture-dependent code then modifies if necessary.
I completely agree, and right now there are many bits and pieces missing
from that chain. Which is why we have to default to something a little
aggressive...
> > > > And finally (and most importantly), why do you care so much?
> > >
> > > I do care because the non-bounced transfers give measurable decrease in
> > > cpu time and bus use. And I don't like the situation where portable
> > > software has to have special alignment code just for Linux when there is
> > > no real technical reason for that. In practice, this code often does not
> > > exist. Then the question is, why does writing to tape in Linux use, say,
> > > 30 % CPU time when the same thing in xxx Unix uses 1 % CPU time.
> >
> > Please share how you arrived at those numbers, if you get that bad
> > performance you are doing something horribly wrong elsewhere. Doing
> > 50MiB/sec single transfer from a hard drive with either bio_map_user()
> > or bio_copy_user() backing shows at most 1% sys time difference on a
> > dual P3-800MHz. How fast are these tape drives?
> >
> On 20 Nov 2003 I posted to linux-scsi and linux-usb-development the
> following results:
> -----------------------8<--------------------------------------------
> Before implementing direct i/o, I made some measurements with the hardware
> available. The tape drive was HP DDS-4 and computer dual PIII 700 MHz. The
> results with a test program using well-compressible data (zeroes):
>
> Buffer length 32768 bytes, 32768 buffers written (1024.000 MB, 1073741824
> bytes)
> Variable block mode (writing 32768 byte blocks).
> dio:
> write: wall 60.436 user 0.015 ( 0.0 %) system 0.485 ( 0.8 %) speed
> 16.944 MB/s
> read: wall 64.580 user 0.014 ( 0.0 %) system 0.500 ( 0.8 %) speed
> 15.856 MB/s
> no dio:
> write: wall 61.373 user 0.024 ( 0.0 %) system 2.897 ( 4.7 %) speed
> 16.685 MB/s
> read: wall 66.435 user 0.055 ( 0.1 %) system 6.347 ( 9.6 %) speed
> 15.413 MB/s
>
> The common high-end drives at that time streamed at 30-40 MB/s assuming
> 2:1 compression. It meant that about 20 % of cpu was needed for extra
> copies when reading. Other people reported even higher percentages with
> other hardware.
Just tried 70MiB/sec on my test box, and it is going rather slowly with
non direct io. So your numbers don't look all that bad. Pretty strange,
I don't remember it being this bad. This is a profile from a forced
unaligned 128KiB contig transfer from two SCSI disks, doing 40 and
30MiB/sec at the same time:
89887 default_idle 1404.4844
24357 __copy_to_user_ll 217.4732
512 __generic_unplug_device 8.0000
279 __free_pages 3.4875
291 sym53c8xx_intr 1.8188
137 blk_rq_unmap_user 1.4271
260 scsi_end_request 1.2500
210 blk_execute_rq 1.0938
17 free_hot_page 1.0625
Compared to direct io:
118502 default_idle 1851.5938
138 __generic_unplug_device 2.1562
50 unlock_page 1.5625
117 sym53c8xx_intr 0.7312
37 set_page_dirty_lock 0.3854
30 set_page_dirty 0.3750
73 scsi_end_request 0.3510
49 __bio_unmap_user 0.3403
For 512-byte reads, the results are comparable (aggregate bandwidth):
non-dio:
91038 default_idle 1422.4688
2843 __generic_unplug_device 44.4219
3014 sym53c8xx_intr 18.8375
1694 __copy_to_user_ll 15.1250
1739 scsi_end_request 8.3606
779 blk_rq_unmap_user 8.1146
571 blk_run_queue 7.1375
dio:
92280 default_idle 1441.8750
2824 __generic_unplug_device 44.1250
2926 sym53c8xx_intr 18.2875
1608 scsi_end_request 7.7308
716 blk_rq_unmap_user 7.4583
566 blk_run_queue 7.0750
1284 blk_execute_rq 6.6875
> I just repeated the tests using the same drive but a 2.6 GHz PIV on a
> Intel D875 motherboard and dual channel memory. The results were:
>
> dio:
> write: wall 58.668 user 0.013 ( 0.0 %) system 0.217 ( 0.4 %) speed
> 17.045 MB/s
> read: wall 63.425 user 0.011 ( 0.0 %) system 0.245 ( 0.4 %) speed
> 15.767 MB/s
> no dio:
> write: wall 58.713 user 0.017 ( 0.0 %) system 0.534 ( 0.9 %) speed
> 17.032 MB/s
> read: wall 63.265 user 0.020 ( 0.0 %) system 0.805 ( 1.3 %) speed
> 15.807 MB/s
That looks a lot closer to what I would expect - non-dio using a little
more sys time at the same transfer rate, still a little high though. You
must be doing something else differently as well. Are you transfer sizes
as big for non-dio as with dio? Are you using sg for non-dio?
> > There's no questioning that bio_map_user() will be faster for larger
> > transfers (4kb and up, I'm guessing), but there's no way that you can
> > claim 30% sys time for a tape drive without backing that up. Where was
> > this time spent?
> >
> I have not profiled the code. The st driver uses the same loop for direct
> and buffered transfers. The only difference is that in the buffered case
> the data is copied to/from the internal buffer before/after the SCSI
> transactions.
That's truly the only difference?
> As you can see from the results and project from them, the system time
> varies a lot depending on the hardware. The differences between the older
> system and the newer system are surprisingly big but, if you look at
> system architectures, the results are what you actually should expect.
Agree.
> I am not claiming that the cpu usage is in all cases significant with
> buffered transfers. Just that there exist cases where it is significant
> (big drive connected to a not very fast computer). This has motivated
> doing the code. (And the fact that other Unices do this, too).
Sure.
> > You can't map 6MB into a single bio, but you can string them together if
> > you want such a big request.
> >
> OK, I will look at that when I have time. The need for big requests comes
> from reading existing tapes with multimegabyte blocks. A block has to be
> read with a single SCSI read and this is why the request has to be so big
> in these cases.
I understand.
--
Jens Axboe
next prev parent reply other threads:[~2004-09-14 7:46 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2004-09-13 12:48 IO buffer alignment for block devices Saeed Bishara
2004-09-13 14:14 ` James Bottomley
2004-09-13 15:16 ` Saeed Bishara
2004-09-13 15:28 ` James Bottomley
2004-09-13 17:12 ` Kai Makisara
2004-09-13 19:02 ` Jens Axboe
2004-09-13 19:56 ` Kai Makisara
2004-09-13 20:23 ` Jens Axboe
2004-09-13 22:08 ` Kai Makisara
2004-09-14 7:44 ` Jens Axboe [this message]
2004-09-14 17:39 ` Kai Makisara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20040914074448.GK2336@suse.de \
--to=axboe@suse.de \
--cc=James.Bottomley@SteelEye.com \
--cc=Kai.Makisara@kolumbus.fi \
--cc=Saeed.Bishara@il.marvell.com \
--cc=linux-scsi@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.