All of lore.kernel.org
 help / color / mirror / Atom feed
From: Avi Kivity <avi@exanet.com>
To: Joel Becker <jlbec@evilplan.org>
Cc: Yasushi Saito <ysaito@hpl.hp.com>,
	linux-aio@kvack.org, linux-kernel@vger.kernel.org,
	suparna@in.ibm.com, Janet Morgan <janetmor@us.ibm.com>
Subject: Re: [PATCH 1/2]  aio: add vectored I/O support
Date: Sat, 16 Oct 2004 19:29:03 +0200	[thread overview]
Message-ID: <41715A5F.2060006@exanet.com> (raw)
In-Reply-To: <20041016162836.GG17142@parcelfarce.linux.theplanet.co.uk>

Joel Becker wrote:

>On Sat, Oct 16, 2004 at 10:43:04AM +0200, Avi Kivity wrote:
>  
>
>>Using IO_CMD_READ for a vector entails
>>
>>- converting the userspace structure (which might well an iovec) to iocbs
>>    
>>
>
>	Why create an iov if you don't need to?
>
>  
>
If you aren't writing directly to the kernel API, an iovec is very 
convenient. It need not be an iovec, but surely you need _some_ vector.

>>- merging the iocbs
>>    
>>
>
>	I don't see how this is different than merging iovs.  Whether an
>I/O range is represented by two segments of an iov or by two iocbs, the
>elevator is going to merge them.  If the userspace program had the
>knowledge to merge them up front, it should have submitted one larger
>segment.
>  
>
No. An iovec is already merged; it is known that adjacent segments of an 
iovec have adjacent offsets. a single IO_CMD_READV iovec can generate a 
single bio without any merging.

The app did not submit a single large segment for the same reason 
non-aio readv is used: because app memory is paged. in my case, a 
userspace filesystem has a paged cache; large, disk-contiguous reads go 
into many small noncontiguous memory pages. or it might be a database 
performing a sequential scan and reading a large block into multiple 
block buffers, which are usually discontiguous.

>  
>
>>- coalescing the multiple completions in userspace to a single completion
>>    
>>
>
>	You generally have to do this anyway.  In fact, it is often far
>more efficient and performant to have a pattern of:
>
>	submit 10;
>	reap 3; submit 3 more;
>	reap 6; submit 6 more;
>	repeat until you are done;
>
>than to wait on all 10 before you can submit 10 again.
>  
>
If the data is physically contiguous, it will (should) be merged, and 
thus completed in a single event anyway. All 10 completions will happen 
at the same time.

I might divide a 1M read into 4 iocbs to get the effect you mention. I 
don't want to be forced into dividing them based on virtual address, 
into 256 4K iocbs. *if* I wanted to do anything with partial data.

>>error handling is difficult as well. one would expect that a bad sector 
>>with multiple iocbs would only fail one of the requests. it seems to be 
>>non-trivial to implement this correctly.
>>    
>>
>
>	I don't follow this.  If you mean that you want all io from
>later segments in an iov to fail if one segment has a bad sector, I
>don't know that we can enforce it without running one segment at a
>time.  That's terribly slow.
>  
>
That's not what I meant. If you submit 16 iocbs which are merged by the 
kernel, and there is an error somewhere within the seventh iocb, I would 
expect to get 15 success completions and one error completion. so error 
information from the merged iocb must be demultiplexed into the originals.

If you have a single iocb, then any error simply fails that iocb.

>	Again, even if READV is a good idea, we need to fix whatever
>inefficiencies io_submit() has.  copying to/from userspace just can't be
>that slow.
>  
>
The inefficiencies I refered to were disk inefficiencies, not processor.

I think what happened was that the number of iocbs submitted (64 iocbs 
of 4K each) did not merge because the device queue depth was very large; 
no queuing occured because (I imagine) merging happens while a request 
is waiting for disk readiness.

Decreasing the queue depth is not an option, because I might want to do 
random reads of small iovecs later.

Of course, it is better to copy less than to copy more; so that is an 
additional win for PREADV.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


  reply	other threads:[~2004-10-16 17:29 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-10-14 20:10 [PATCH 1/2] aio: add vectored I/O support Yasushi Saito
2004-10-16  3:13 ` Joel Becker
2004-10-16  5:18   ` Avi Kivity
2004-10-16  5:37     ` Joel Becker
2004-10-16  8:43       ` Avi Kivity
2004-10-16 16:28         ` Joel Becker
2004-10-16 17:29           ` Avi Kivity [this message]
2004-10-17  0:14             ` Joel Becker
2004-10-17  6:25               ` Avi Kivity
2004-10-16 12:05       ` William Lee Irwin III

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=41715A5F.2060006@exanet.com \
    --to=avi@exanet.com \
    --cc=janetmor@us.ibm.com \
    --cc=jlbec@evilplan.org \
    --cc=linux-aio@kvack.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=suparna@in.ibm.com \
    --cc=ysaito@hpl.hp.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.