From: Anthony Liguori <anthony@codemonkey.ws>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>,
qemu-devel@nongnu.org, kvm-devel <kvm@vger.kernel.org>
Subject: Re: [Qemu-devel] [RFC] Replace posix-aio with custom thread pool
Date: Fri, 12 Dec 2008 11:25:55 -0600 [thread overview]
Message-ID: <49429EA3.8070008@codemonkey.ws> (raw)
In-Reply-To: <20081212170916.GO6809@random.random>
Andrea Arcangeli wrote:
> On Fri, Dec 12, 2008 at 10:49:45AM -0600, Anthony Liguori wrote:
>
>> I meant, if you wanted to pass a file descriptor as a raw device. So:
>>
>> qemu -hda raw:fd=4
>>
>> Or something like that. We don't support this today.
>>
>
> ah ok.
>
>
>> I think bouncing the iov and just using pread/pwrite may be our best bet.
>> It means memory allocation but we can cap it. Since we're using threads,
>>
>
> It's already capped. However currently it generates an iovec, but
> we've simply to check the iovcnt to be 1, if it's 1 we pread from
> iov.iov_base, iov.iov_len. The dma api will take care to enforce
> iovcnt to be 1 for the iovec if preadv/pwritev isn't detected at
> compile time.
>
Hrm, that's more complex than I was expecting. I was thinking the bdrv
aio infrastructure would always take an iovec. Any details about the
underlying host's ability to handle the iovec would be insulated.
>> we just can force a thread to sleep until memory becomes available so it's
>> actually pretty straight forward.
>>
>
> There's no way to detect that and wait for memory,
If we artificially cap at say 50MB, then you do something like:
while (buffer == NULL) {
buffer = try_to_bounce(offset, iov, iovcnt, &size);
if (buffer == NULL && errno == ENOMEM) {
pthread_wait_cond(more memory);
}
}
try_to_bounce allocs with malloc() but if you exceed 50MB, then you fail
with an error of ENOMEM. In your bounce_free() function, you do a
pthread_cond_broadcast() to wake up any threads potentially waiting to
allocate memory.
This lets us expose a preadv/pwritev function that actually works. The
expectation is that bouncing will outperform just doing pread/pwrite of
each vector. Of course, you could get smart and if try_to_bounce fail,
fall back to pread/pwrite each vector. Likewise, you can fast-path the
case of a single iovec to avoid bouncing entirely.
Regards,
Anthony Liguori
> it'd sigkill before
> you can check... at least with the default overcommit. The way the dma
> api works, is that it doesn't send a mega large writev, but send it in
> pieces capped by the max buffer size, with many iovecs with iovcnt = 1.
>
>
>> We can use libaio on older Linux's to simulate preadv/pwritev. Use the
>> proper syscalls on newer kernels, on BSDs, and bounce everything else.
>>
>
> Given READV/WRITEV aren't available in not very recent kernels and
> given that without O_DIRECT each iocb will become synchronous, we
> can't use the libaio. Also once they fix linux-aio, if we do that, the
> iocb logic would need to be largely refactored. So I'm not sure if it
> worth it as it can't handle 2.6.16-18 when O_DIRECT is disabled (when
> O_DIRECT is enabled we could just build an array of linear iocb).
>
WARNING: multiple messages have this Message-ID (diff)
From: Anthony Liguori <anthony@codemonkey.ws>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>,
kvm-devel <kvm@vger.kernel.org>,
qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [RFC] Replace posix-aio with custom thread pool
Date: Fri, 12 Dec 2008 11:25:55 -0600 [thread overview]
Message-ID: <49429EA3.8070008@codemonkey.ws> (raw)
In-Reply-To: <20081212170916.GO6809@random.random>
Andrea Arcangeli wrote:
> On Fri, Dec 12, 2008 at 10:49:45AM -0600, Anthony Liguori wrote:
>
>> I meant, if you wanted to pass a file descriptor as a raw device. So:
>>
>> qemu -hda raw:fd=4
>>
>> Or something like that. We don't support this today.
>>
>
> ah ok.
>
>
>> I think bouncing the iov and just using pread/pwrite may be our best bet.
>> It means memory allocation but we can cap it. Since we're using threads,
>>
>
> It's already capped. However currently it generates an iovec, but
> we've simply to check the iovcnt to be 1, if it's 1 we pread from
> iov.iov_base, iov.iov_len. The dma api will take care to enforce
> iovcnt to be 1 for the iovec if preadv/pwritev isn't detected at
> compile time.
>
Hrm, that's more complex than I was expecting. I was thinking the bdrv
aio infrastructure would always take an iovec. Any details about the
underlying host's ability to handle the iovec would be insulated.
>> we just can force a thread to sleep until memory becomes available so it's
>> actually pretty straight forward.
>>
>
> There's no way to detect that and wait for memory,
If we artificially cap at say 50MB, then you do something like:
while (buffer == NULL) {
buffer = try_to_bounce(offset, iov, iovcnt, &size);
if (buffer == NULL && errno == ENOMEM) {
pthread_wait_cond(more memory);
}
}
try_to_bounce allocs with malloc() but if you exceed 50MB, then you fail
with an error of ENOMEM. In your bounce_free() function, you do a
pthread_cond_broadcast() to wake up any threads potentially waiting to
allocate memory.
This lets us expose a preadv/pwritev function that actually works. The
expectation is that bouncing will outperform just doing pread/pwrite of
each vector. Of course, you could get smart and if try_to_bounce fail,
fall back to pread/pwrite each vector. Likewise, you can fast-path the
case of a single iovec to avoid bouncing entirely.
Regards,
Anthony Liguori
> it'd sigkill before
> you can check... at least with the default overcommit. The way the dma
> api works, is that it doesn't send a mega large writev, but send it in
> pieces capped by the max buffer size, with many iovecs with iovcnt = 1.
>
>
>> We can use libaio on older Linux's to simulate preadv/pwritev. Use the
>> proper syscalls on newer kernels, on BSDs, and bounce everything else.
>>
>
> Given READV/WRITEV aren't available in not very recent kernels and
> given that without O_DIRECT each iocb will become synchronous, we
> can't use the libaio. Also once they fix linux-aio, if we do that, the
> iocb logic would need to be largely refactored. So I'm not sure if it
> worth it as it can't handle 2.6.16-18 when O_DIRECT is disabled (when
> O_DIRECT is enabled we could just build an array of linear iocb).
>
next prev parent reply other threads:[~2008-12-12 17:26 UTC|newest]
Thread overview: 75+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-12-05 21:21 [RFC] Replace posix-aio with custom thread pool Anthony Liguori
2008-12-05 21:21 ` [Qemu-devel] " Anthony Liguori
2008-12-06 9:03 ` Blue Swirl
2008-12-06 18:26 ` Jamie Lokier
2008-12-08 18:23 ` Anthony Liguori
2008-12-08 18:23 ` Anthony Liguori
2008-12-09 15:51 ` Gerd Hoffmann
2008-12-09 16:01 ` Anthony Liguori
2008-12-10 16:44 ` Andrea Arcangeli
2008-12-10 17:21 ` Anthony Liguori
2008-12-10 17:21 ` Anthony Liguori
2008-12-10 17:29 ` Gerd Hoffmann
2008-12-10 18:50 ` Anthony Liguori
2008-12-10 19:08 ` Andrea Arcangeli
2008-12-10 19:08 ` Andrea Arcangeli
2008-12-11 13:12 ` Andrea Arcangeli
2008-12-11 15:24 ` Gerd Hoffmann
2008-12-11 15:24 ` Gerd Hoffmann
2008-12-11 15:53 ` Andrea Arcangeli
2008-12-11 15:53 ` Andrea Arcangeli
2008-12-11 16:11 ` Gerd Hoffmann
2008-12-11 16:11 ` Gerd Hoffmann
2008-12-11 16:49 ` Andrea Arcangeli
2008-12-11 16:49 ` Andrea Arcangeli
2008-12-11 17:20 ` Gerd Hoffmann
2008-12-11 17:20 ` Gerd Hoffmann
2008-12-11 18:11 ` Andrea Arcangeli
2008-12-11 18:11 ` Andrea Arcangeli
2008-12-11 20:38 ` Gerd Hoffmann
2008-12-11 20:38 ` Gerd Hoffmann
2008-12-11 20:40 ` Anthony Liguori
2008-12-12 8:23 ` Jens Axboe
2008-12-12 8:23 ` Jens Axboe
2008-12-12 11:51 ` Andrea Arcangeli
2008-12-12 11:51 ` Andrea Arcangeli
2008-12-12 11:54 ` Jens Axboe
2008-12-12 11:54 ` Jens Axboe
2008-12-12 14:13 ` Andrea Arcangeli
2008-12-12 14:13 ` Andrea Arcangeli
2008-12-12 14:24 ` Anthony Liguori
2008-12-12 14:24 ` Anthony Liguori
2008-12-12 16:33 ` Chris Wright
2008-12-12 16:33 ` Chris Wright
2008-12-12 16:51 ` Anthony Liguori
2008-12-12 16:51 ` Anthony Liguori
2008-12-12 16:52 ` Chris Wright
2008-12-12 16:52 ` Chris Wright
2008-12-11 21:32 ` Christoph Hellwig
2008-12-12 0:27 ` Andrea Arcangeli
2008-12-12 0:27 ` Andrea Arcangeli
2008-12-11 21:30 ` Christoph Hellwig
2008-12-11 16:41 ` Anthony Liguori
2008-12-11 16:41 ` Anthony Liguori
2008-12-12 14:24 ` Andrea Arcangeli
2008-12-12 14:24 ` Andrea Arcangeli
2008-12-12 14:35 ` Anthony Liguori
2008-12-12 14:35 ` Anthony Liguori
2008-12-12 15:44 ` Andrea Arcangeli
2008-12-12 15:44 ` Andrea Arcangeli
2008-12-12 16:49 ` Anthony Liguori
2008-12-12 16:49 ` Anthony Liguori
2008-12-12 17:09 ` Andrea Arcangeli
2008-12-12 17:09 ` Andrea Arcangeli
2008-12-12 17:25 ` Anthony Liguori [this message]
2008-12-12 17:25 ` Anthony Liguori
2008-12-12 17:52 ` Andrea Arcangeli
2008-12-12 17:52 ` Andrea Arcangeli
2008-12-12 18:17 ` Anthony Liguori
2008-12-12 18:17 ` Anthony Liguori
2008-12-12 18:26 ` Andrea Arcangeli
2008-12-12 20:12 ` Gerd Hoffmann
2008-12-12 20:17 ` Anthony Liguori
2008-12-12 20:35 ` Gerd Hoffmann
2008-12-09 17:16 ` Avi Kivity
2008-12-17 14:44 ` Ian Jackson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=49429EA3.8070008@codemonkey.ws \
--to=anthony@codemonkey.ws \
--cc=aarcange@redhat.com \
--cc=kraxel@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.