public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Avi Kivity <avi@exanet.com>
To: Joel Becker <jlbec@evilplan.org>
Cc: Yasushi Saito <ysaito@hpl.hp.com>,
	linux-aio@kvack.org, linux-kernel@vger.kernel.org,
	suparna@in.ibm.com, Janet Morgan <janetmor@us.ibm.com>
Subject: Re: [PATCH 1/2]  aio: add vectored I/O support
Date: Sat, 16 Oct 2004 19:29:03 +0200	[thread overview]
Message-ID: <41715A5F.2060006@exanet.com> (raw)
In-Reply-To: <20041016162836.GG17142@parcelfarce.linux.theplanet.co.uk>

Joel Becker wrote:

>On Sat, Oct 16, 2004 at 10:43:04AM +0200, Avi Kivity wrote:
>  
>
>>Using IO_CMD_READ for a vector entails
>>
>>- converting the userspace structure (which might well an iovec) to iocbs
>>    
>>
>
>	Why create an iov if you don't need to?
>
>  
>
If you aren't writing directly to the kernel API, an iovec is very 
convenient. It need not be an iovec, but surely you need _some_ vector.

>>- merging the iocbs
>>    
>>
>
>	I don't see how this is different than merging iovs.  Whether an
>I/O range is represented by two segments of an iov or by two iocbs, the
>elevator is going to merge them.  If the userspace program had the
>knowledge to merge them up front, it should have submitted one larger
>segment.
>  
>
No. An iovec is already merged; it is known that adjacent segments of an 
iovec have adjacent offsets. a single IO_CMD_READV iovec can generate a 
single bio without any merging.

The app did not submit a single large segment for the same reason 
non-aio readv is used: because app memory is paged. in my case, a 
userspace filesystem has a paged cache; large, disk-contiguous reads go 
into many small noncontiguous memory pages. or it might be a database 
performing a sequential scan and reading a large block into multiple 
block buffers, which are usually discontiguous.

>  
>
>>- coalescing the multiple completions in userspace to a single completion
>>    
>>
>
>	You generally have to do this anyway.  In fact, it is often far
>more efficient and performant to have a pattern of:
>
>	submit 10;
>	reap 3; submit 3 more;
>	reap 6; submit 6 more;
>	repeat until you are done;
>
>than to wait on all 10 before you can submit 10 again.
>  
>
If the data is physically contiguous, it will (should) be merged, and 
thus completed in a single event anyway. All 10 completions will happen 
at the same time.

I might divide a 1M read into 4 iocbs to get the effect you mention. I 
don't want to be forced into dividing them based on virtual address, 
into 256 4K iocbs. *if* I wanted to do anything with partial data.

>>error handling is difficult as well. one would expect that a bad sector 
>>with multiple iocbs would only fail one of the requests. it seems to be 
>>non-trivial to implement this correctly.
>>    
>>
>
>	I don't follow this.  If you mean that you want all io from
>later segments in an iov to fail if one segment has a bad sector, I
>don't know that we can enforce it without running one segment at a
>time.  That's terribly slow.
>  
>
That's not what I meant. If you submit 16 iocbs which are merged by the 
kernel, and there is an error somewhere within the seventh iocb, I would 
expect to get 15 success completions and one error completion. so error 
information from the merged iocb must be demultiplexed into the originals.

If you have a single iocb, then any error simply fails that iocb.

>	Again, even if READV is a good idea, we need to fix whatever
>inefficiencies io_submit() has.  copying to/from userspace just can't be
>that slow.
>  
>
The inefficiencies I refered to were disk inefficiencies, not processor.

I think what happened was that the number of iocbs submitted (64 iocbs 
of 4K each) did not merge because the device queue depth was very large; 
no queuing occured because (I imagine) merging happens while a request 
is waiting for disk readiness.

Decreasing the queue depth is not an option, because I might want to do 
random reads of small iovecs later.

Of course, it is better to copy less than to copy more; so that is an 
additional win for PREADV.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


  reply	other threads:[~2004-10-16 17:29 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-10-14 20:10 [PATCH 1/2] aio: add vectored I/O support Yasushi Saito
2004-10-16  3:13 ` Joel Becker
2004-10-16  5:18   ` Avi Kivity
2004-10-16  5:37     ` Joel Becker
2004-10-16  8:43       ` Avi Kivity
2004-10-16 16:28         ` Joel Becker
2004-10-16 17:29           ` Avi Kivity [this message]
2004-10-17  0:14             ` Joel Becker
2004-10-17  6:25               ` Avi Kivity
2004-10-16 12:05       ` William Lee Irwin III

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=41715A5F.2060006@exanet.com \
    --to=avi@exanet.com \
    --cc=janetmor@us.ibm.com \
    --cc=jlbec@evilplan.org \
    --cc=linux-aio@kvack.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=suparna@in.ibm.com \
    --cc=ysaito@hpl.hp.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox