From: Wu Fengguang <wfg@linux.intel.com>
To: Vladislav Bolkhovitin <vst@vlnb.net>
Cc: Jens Axboe <jens.axboe@oracle.com>,
Jeff Moyer <jmoyer@redhat.com>,
"Vitaly V. Bursov" <vitalyb@telenet.dn.ua>,
linux-kernel@vger.kernel.org
Subject: Re: Slow file transfer speeds with CFQ IO scheduler in some cases
Date: Tue, 25 Nov 2008 20:15:35 +0800 [thread overview]
Message-ID: <20081125121534.GA16778@localhost> (raw)
In-Reply-To: <492BEAE8.9050809@vlnb.net>
On Tue, Nov 25, 2008 at 03:09:12PM +0300, Vladislav Bolkhovitin wrote:
> Vladislav Bolkhovitin wrote:
>> Wu Fengguang wrote:
>>> On Tue, Nov 25, 2008 at 02:41:47PM +0300, Vladislav Bolkhovitin wrote:
>>>> Wu Fengguang wrote:
>>>>> On Tue, Nov 25, 2008 at 01:59:53PM +0300, Vladislav Bolkhovitin wrote:
>>>>>> Wu Fengguang wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> //Sorry for being late.
>>>>>>>
>>>>>>> On Wed, Nov 12, 2008 at 08:02:28PM +0100, Jens Axboe wrote:
>>>>>>> [...]
>>>>>>>> I already talked about this with Jeff on irc, but I guess should post it
>>>>>>>> here as well.
>>>>>>>>
>>>>>>>> nfsd aside (which does seem to have some different behaviour skewing the
>>>>>>>> results), the original patch came about because dump(8) has a really
>>>>>>>> stupid design that offloads IO to a number of processes. This basically
>>>>>>>> makes fairly sequential IO more random with CFQ, since each process gets
>>>>>>>> its own io context. My feeling is that we should fix dump instead of
>>>>>>>> introducing a fair bit of complexity (and slowdown) in CFQ. I'm not
>>>>>>>> aware of any other good programs out there that would do something
>>>>>>>> similar, so I don't think there's a lot of merrit to spending cycles on
>>>>>>>> detecting cooperating processes.
>>>>>>>>
>>>>>>>> Jeff will take a look at fixing dump instead, and I may have promised
>>>>>>>> him that santa will bring him something nice this year if he does (since
>>>>>>>> I'm sure it'll be painful on the eyes).
>>>>>>> This could also be fixed at the VFS readahead level.
>>>>>>>
>>>>>>> In fact I've seen many kinds of interleaved accesses:
>>>>>>> - concurrently reading 40 files that are in fact hard links of one single file
>>>>>>> - a backup tool that splits a big file into 8k chunks, and serve the
>>>>>>> {1, 3, 5, 7, ...} chunks in one process and the {0, 2, 4, 6, ...}
>>>>>>> chunks in another one
>>>>>>> - a pool of NFSDs randomly serving some originally sequential
>>>>>>> read requests - now dump(8) seems to have some similar
>>>>>>> problem.
>>>>>>>
>>>>>>> In summary there have been all kinds of efforts on trying to
>>>>>>> parallelize I/O tasks, but unfortunately they can easily screw up the
>>>>>>> sequential pattern. It may not be easily fixable for many of them.
>>>>>>>
>>>>>>> It is however possible to detect most of these patterns at the
>>>>>>> readahead layer and restore sequential I/Os, before they propagate
>>>>>>> into the block layer and hurt performance.
>>>>>> I believe this would be the most effective way to go,
>>>>>> especially in case if data delivery path to the original
>>>>>> client has its own latency depended from the amount of
>>>>>> transferred data as it is in the case of remote NFS mount,
>>>>>> which does synchronous sequential reads. In this case it is
>>>>>> essential for performance to make both links (local to the
>>>>>> storage and network to the client) be always busy and
>>>>>> transfer data simultaneously. Since the reads are synchronous,
>>>>>> the only way to achieve that is perform read ahead on the
>>>>>> server sufficient to cover the network link latency. Otherwise
>>>>>> you would end up with only half of possible throughput.
>>>>>>
>>>>>> However, from one side, server has to have a pool of
>>>>>> threads/processes to perform well, but, from other side,
>>>>>> current read ahead code doesn't detect too well that those
>>>>>> threads/processes are doing joint sequential read, so the read
>>>>>> ahead window gets smaller, hence the overall read performance
>>>>>> gets considerably smaller too.
>>>>>>
>>>>>>> Vitaly, if that's what you need, I can try to prepare a patch for testing out.
>>>>>> I can test it with SCST SCSI target sybsystem
>>>>>> (http://scst.sf.net). SCST needs such feature very much,
>>>>>> otherwise it can't get full backstorage read speed. The
>>>>>> maximum I can see is about ~80MB/s from ~130MB/s 15K RPM disk
>>>>>> over 1Gbps iSCSI link (maximum possible is ~110MB/s).
>>>>> Thank you very much!
>>>>>
>>>>> BTW, do you implicate that the SCSI system (or its applications) has
>>>>> similar behaviors that the current readahead code cannot handle well?
>>>> No. SCSI target subsystem is not the same as SCSI initiator
>>>> subsystem, which usually called simply SCSI (sub)system. SCSI
>>>> target is a SCSI server. It has the same amount of common with
>>>> SCSI initiator as there is, e.g., between Apache (HTTP server) and
>>>> Firefox (HTTP client).
>>> Got it. So the SCSI server will split&spread sequential IO of one
>>> single file to cooperative threads?
>>
>> Yes. It has to do so, because Linux doesn't have async. cached IO and a
>> client can queue several tens of commands at time. Then, on the
>> sequential IO with 1 command at time, CPU scheduler comes to play and
>> spreads those commands over those threads, so read ahead gets too small
>> to cover the external link latency and fill both links with data, so
>> that uncovered latency kills throughput.
>
> Additionally, if the uncovered external link latency is too large, one
> more factor is getting noticeable: storage rotation latency. If the next
> unread sector is missed to be read at time, server has to wait a full
> rotation to start receiving data for the next block, which even more
> decreases the resulting throughput.
Thank you for the details. I've been working slowly on the idea, and
should be able to send you a patch in the next one or two days.
Thanks,
Fengguang
next prev parent reply other threads:[~2008-11-25 12:16 UTC|newest]
Thread overview: 70+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-11-09 18:04 Slow file transfer speeds with CFQ IO scheduler in some cases Vitaly V. Bursov
2008-11-09 18:30 ` Alexey Dobriyan
2008-11-09 18:32 ` Vitaly V. Bursov
2008-11-10 10:44 ` Jens Axboe
2008-11-10 13:51 ` Jeff Moyer
2008-11-10 13:56 ` Jens Axboe
2008-11-10 17:16 ` Vitaly V. Bursov
2008-11-10 17:35 ` Jens Axboe
2008-11-10 18:27 ` Vitaly V. Bursov
2008-11-10 18:29 ` Jens Axboe
2008-11-10 18:39 ` Jeff Moyer
2008-11-10 18:42 ` Jens Axboe
2008-11-10 21:51 ` Jeff Moyer
2008-11-11 9:34 ` Jens Axboe
2008-11-11 9:35 ` Jens Axboe
2008-11-11 11:52 ` Jens Axboe
2008-11-11 16:48 ` Jeff Moyer
2008-11-11 18:08 ` Jens Axboe
2008-11-11 16:53 ` Vitaly V. Bursov
2008-11-11 18:06 ` Jens Axboe
2008-11-11 19:36 ` Jeff Moyer
2008-11-11 21:41 ` Jeff Layton
2008-11-11 21:59 ` Jeff Layton
2008-11-12 12:20 ` Jens Axboe
2008-11-12 12:45 ` Jeff Layton
2008-11-12 12:54 ` Christoph Hellwig
2008-11-11 19:42 ` Vitaly V. Bursov
2008-11-12 18:32 ` Jeff Moyer
2008-11-12 19:02 ` Jens Axboe
2008-11-13 8:51 ` Wu Fengguang
2008-11-13 8:54 ` Jens Axboe
2008-11-14 1:36 ` Wu Fengguang
2008-11-25 11:02 ` Vladislav Bolkhovitin
2008-11-25 11:25 ` Wu Fengguang
2008-11-25 15:21 ` Jeff Moyer
2008-11-25 16:17 ` Vladislav Bolkhovitin
2008-11-13 18:46 ` Vitaly V. Bursov
2008-11-25 10:59 ` Vladislav Bolkhovitin
2008-11-25 11:30 ` Wu Fengguang
2008-11-25 11:41 ` Vladislav Bolkhovitin
2008-11-25 11:49 ` Wu Fengguang
2008-11-25 12:03 ` Vladislav Bolkhovitin
2008-11-25 12:09 ` Vladislav Bolkhovitin
2008-11-25 12:15 ` Wu Fengguang [this message]
2008-11-27 17:46 ` Vladislav Bolkhovitin
2008-11-28 0:48 ` Wu Fengguang
2009-02-12 18:35 ` Vladislav Bolkhovitin
2009-02-13 1:57 ` Wu Fengguang
2009-02-13 20:08 ` Vladislav Bolkhovitin
2009-02-16 2:34 ` Wu Fengguang
2009-02-17 19:03 ` Vladislav Bolkhovitin
2009-02-18 18:14 ` Vladislav Bolkhovitin
2009-02-19 1:35 ` Wu Fengguang
2009-02-17 19:01 ` Vladislav Bolkhovitin
2009-02-19 2:05 ` Wu Fengguang
2009-03-19 17:44 ` Vladislav Bolkhovitin
2009-03-20 8:53 ` Vladislav Bolkhovitin
2009-03-23 1:42 ` Wu Fengguang
2009-04-21 18:18 ` Vladislav Bolkhovitin
2009-04-24 8:43 ` Wu Fengguang
2009-05-12 18:13 ` Vladislav Bolkhovitin
2009-02-17 19:01 ` Vladislav Bolkhovitin
2009-02-19 1:38 ` Wu Fengguang
2008-11-24 15:33 ` Jeff Moyer
2008-11-24 18:13 ` Jens Axboe
2008-11-24 18:50 ` Jeff Moyer
2008-11-24 18:51 ` Jens Axboe
2008-11-13 6:54 ` Vitaly V. Bursov
2008-11-13 14:32 ` Jeff Moyer
2008-11-13 18:33 ` Vitaly V. Bursov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20081125121534.GA16778@localhost \
--to=wfg@linux.intel.com \
--cc=jens.axboe@oracle.com \
--cc=jmoyer@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=vitalyb@telenet.dn.ua \
--cc=vst@vlnb.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox