* Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
[not found] ` <20070404212103.GA19026@elte.hu>
@ 2007-04-04 23:19 ` Rusty Russell
2007-04-05 7:17 ` Avi Kivity
[not found] ` <1175728768.12230.593.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
0 siblings, 2 replies; 36+ messages in thread
From: Rusty Russell @ 2007-04-04 23:19 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Anthony Liguori, kvm-devel, netdev
On Wed, 2007-04-04 at 23:21 +0200, Ingo Molnar wrote:
> * Anthony Liguori <aliguori@us.ibm.com> wrote:
>
> > But why is it a good thing to do PV drivers in the kernel? You lose
> > flexibility and functionality to gain performance. [...]
>
> in Linux a kernel-space network driver can still be tunneled over
> user-space code, and hence you can add arbitrary add-on functionality
> (and thus have flexibility), without slowing down the common case (which
> would be to tunnel the guest's network traffic into the firewall rules
> of the kernel. No need to touch user-space for any of that).
You didn't quote Anthony's point about "it's more about there not being
good enough userspace interfaces to do network IO."
It's easier to write a kernel-space network driver, but it's not
obviously the right thing to do until we can show that an efficient
packet-level userspace interface isn't possible. I don't think that's
been done, and it would be interesting to try.
Cheers,
Rusty.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
2007-04-04 23:19 ` [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work Rusty Russell
@ 2007-04-05 7:17 ` Avi Kivity
2007-04-06 1:02 ` Rusty Russell
[not found] ` <1175728768.12230.593.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
1 sibling, 1 reply; 36+ messages in thread
From: Avi Kivity @ 2007-04-05 7:17 UTC (permalink / raw)
To: Rusty Russell; +Cc: Ingo Molnar, kvm-devel, netdev
Rusty Russell wrote:
> You didn't quote Anthony's point about "it's more about there not being
> good enough userspace interfaces to do network IO."
>
> It's easier to write a kernel-space network driver, but it's not
> obviously the right thing to do until we can show that an efficient
> packet-level userspace interface isn't possible. I don't think that's
> been done, and it would be interesting to try.
>
In the case of networking, the copyful interfaces on receive are driven
by the hardware not knowing how to split the header from the data. On
transmit I agree, it could be made copyless from userspace (somthing
like sendfilev, only not file oriented).
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: QEMU PIC indirection patch for in-kernel APIC work
[not found] ` <1175728768.12230.593.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-04-05 9:30 ` Ingo Molnar
[not found] ` <20070405093033.GC25448-X9Un+BFzKDI@public.gmane.org>
2007-04-05 14:32 ` [kvm-devel] " Anthony Liguori
0 siblings, 2 replies; 36+ messages in thread
From: Ingo Molnar @ 2007-04-05 9:30 UTC (permalink / raw)
To: Rusty Russell; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, netdev
* Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org> wrote:
> It's easier to write a kernel-space network driver, but it's not
> obviously the right thing to do until we can show that an efficient
> packet-level userspace interface isn't possible. I don't think that's
> been done, and it would be interesting to try.
yes, i agree in theory, but IMO this is largely beside the point. What
matters most for developing a project is _the quality of the codebase_.
That attracts developers, developers improve the code, which then
attracts users, which attracts more developers, etc., etc. As long as
the quality of the codebase is maintained, this is a self-sustaining
process. You've seen that happen with Linux. [ And of course, the
crutial step #0 is: a sane, open-minded maintainer with good taste ;-) ]
qemu's code quality is not really suitable for that basic OSS model, in
my opinion. It has been a mostly one-man show for a long time with
various hostile forks, bin-only kernel module and other actions that
easily poison an OSS project.
the result is not surprising: important portions of qemu have grown into
a hard to hack, hard to maintain codebase with poor code quality, with
gems like:
#ifdef _WIN32
void CALLBACK host_alarm_handler(UINT uTimerID, UINT uMsg,
DWORD_PTR dwUser, DWORD_PTR dw1, DWORD_PTR dw2)
#else
static void host_alarm_handler(int host_signum)
#endif
{
#if 0
#define DISP_FREQ 1000
and that's not just some random driver - this is _the_ main central
timer code of qemu.
so right now the only option for a clean codebase is the KVM in-kernel
code. It's clean and sweet and integrates nicely into the rest of the
kernel. The kernel is also obviously the final place where most
virtualization technologies want to show up because it's the entity that
is the closest to the guest context: we _dont_ want to _force_ network
traffic (let alone interrupt handling) through a userspace context, only
if the functionality of the task absolutely requires it. (but in most
cases we'll try to come up with a maximally flexible scheme that can
just drive things straight via the kernel. netfilter/iptables isnt in
user-space either, partly for that reason.)
but architectural issues aside (ignoring that the kernel _is_ the best
place to do this particular of stuff), this question is still mainly
dominated by the basic question of code quality. I'd rather move
something into the Linux kernel, enforce its code quality that way, and
_then_ add whatever clean infrastructure is needed to push it back into
user-space again (into a different codebase), than having to hack the
monolithic 200 KLOC+ qemu codebase that is shackled with support for
tons of arcane architectures nobody uses and tons of arcane OS variants
that no-one cares about. Now qemu is a very important enabler and
platform-reference-implementation for KVM to fall back to, but it's not
the place to put crutial new code into, at least currently.
Ingo
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: QEMU PIC indirection patch for in-kernel APIC work
[not found] ` <20070405093033.GC25448-X9Un+BFzKDI@public.gmane.org>
@ 2007-04-05 9:58 ` Avi Kivity
2007-04-05 10:26 ` [kvm-devel] " Ingo Molnar
2007-04-05 10:55 ` Ingo Molnar
1 sibling, 1 reply; 36+ messages in thread
From: Avi Kivity @ 2007-04-05 9:58 UTC (permalink / raw)
To: Ingo Molnar; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, netdev
Ingo Molnar wrote:
> so right now the only option for a clean codebase is the KVM in-kernel
> code.
I strongly disagree with this. Bad code in userspace is not an excuse
for shoving stuff into the kernel, where maintaining it is much more
expensive, and the cause of a mistake can be system crashes and data
loss, affecting unrelated processes. If we move something into the
kernel, we'd better have a really good reason for it.
Qemu code _is_ crufty. We can do one of three things:
1. live with it
2. fork it and clean it up
3. clean it up incrementally and merge it upstream
Currently we're doing (1). You're suggesting a variant of (2), fork
plus move into the kernel. The right thing to do IMO is (3), but I
don't see anybody volunteering. Qemu picked up additional committers
recently and I believe they would be receptive to cleanups.
[In the *pic/pit case, we have other reasons to push things into the
kernel. But "this code is crap, let's rewrite it in the kernel" is not
a justification I'll accept. I'd be much happier if we could quantify
these other reasons.]
--
error compiling committee.c: too many arguments to function
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
2007-04-05 9:58 ` Avi Kivity
@ 2007-04-05 10:26 ` Ingo Molnar
2007-04-05 11:26 ` Avi Kivity
0 siblings, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2007-04-05 10:26 UTC (permalink / raw)
To: Avi Kivity; +Cc: Rusty Russell, kvm-devel, netdev
* Avi Kivity <avi@qumranet.com> wrote:
> > so right now the only option for a clean codebase is the KVM
> > in-kernel code.
>
> I strongly disagree with this.
are you disagreeing with my statement that the KVM kernel-side code is
the only clean codebase here? To me this is a clear fact :)
I only pointed out that the only clean codebase at the moment is the KVM
in-kernel code - i did not make the argument (at all) that every new
piece of KVM code should be done in the kernel. That would be stupid -
do you think i'd advocate for example moving command line argument
parsing into the kernel?
and as i said in the mail: "the kernel _is_ the best place to do this
particular stuff".
Ingo
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: QEMU PIC indirection patch for in-kernel APIC work
[not found] ` <20070405093033.GC25448-X9Un+BFzKDI@public.gmane.org>
2007-04-05 9:58 ` Avi Kivity
@ 2007-04-05 10:55 ` Ingo Molnar
1 sibling, 0 replies; 36+ messages in thread
From: Ingo Molnar @ 2007-04-05 10:55 UTC (permalink / raw)
To: Rusty Russell; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, netdev
* Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:
> * Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org> wrote:
>
> > It's easier to write a kernel-space network driver, but it's not
> > obviously the right thing to do until we can show that an efficient
> > packet-level userspace interface isn't possible. I don't think
> > that's been done, and it would be interesting to try.
>
> yes, i agree in theory, [...]
let me explain my position a bit more verbosely:
i agree in terms of 'network driver' (and more generally in terms of
'device', which includes network, storage, console, etc. devices):
having a user-space driver option should still be possible and it should
be integrated well. Qemu is quite rich and flexible in these areas and
we dont want to throw away or isolate that body of code.
but i dont agree in terms of PIC code, which is the main argument in
this particular thread. There's little precedent for any add-ons for
PICs in user-space, nor any particular PIC handling richness in Qemu
that we'd like to preserve.
Ingo
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
2007-04-05 10:26 ` [kvm-devel] " Ingo Molnar
@ 2007-04-05 11:26 ` Avi Kivity
[not found] ` <4614DCE1.70905-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 36+ messages in thread
From: Avi Kivity @ 2007-04-05 11:26 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Rusty Russell, kvm-devel, netdev
Ingo Molnar wrote:
> * Avi Kivity <avi@qumranet.com> wrote:
>
>
>>> so right now the only option for a clean codebase is the KVM
>>> in-kernel code.
>>>
>> I strongly disagree with this.
>>
>
> are you disagreeing with my statement that the KVM kernel-side code is
> the only clean codebase here? To me this is a clear fact :)
>
No, I agree with that. I just disagree with choosing to put the *pic
code (or other code) into the kernel on *that* basis. The selection
should be on design/performance issues alone, *not* the state of
existing code.
> I only pointed out that the only clean codebase at the moment is the KVM
> in-kernel code - i did not make the argument (at all) that every new
> piece of KVM code should be done in the kernel. That would be stupid -
> do you think i'd advocate for example moving command line argument
> parsing into the kernel?
>
No. But the difference in cruftiness between kvm and qemu code should
not enter into the discussion of where to do things.
> and as i said in the mail: "the kernel _is_ the best place to do this
> particular stuff".
>
I agree with this, maybe for different reasons.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: QEMU PIC indirection patch for in-kernel APIC work
[not found] ` <4614DCE1.70905-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-04-05 11:36 ` Ingo Molnar
2007-04-06 1:16 ` [kvm-devel] " Rusty Russell
0 siblings, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2007-04-05 11:36 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, netdev
* Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote:
> [...] But the difference in cruftiness between kvm and qemu code
> should not enter into the discussion of where to do things.
i agree that it doesnt enter the discussion for the *PIC question, but
it very much enters the discussion for the question that i replied to:
> > > You didn't quote Anthony's point about "it's more about there not
> > > being good enough userspace interfaces to do network IO."
> > >
> > > It's easier to write a kernel-space network driver, but it's not
> > > obviously the right thing to do until we can show that an
> > > efficient packet-level userspace interface isn't possible. I
> > > don't think that's been done, and it would be interesting to try.
prototyping new kernel APIs to implement user-space network drivers, on
a crufty codebase is not something that should be done lightly. Any
negative result will not bring us any real conclusion. (was the failure
due to the concept, due the API or due to the crufty codebase?)
(but ... this is really a side-track issue for the *PIC question at
hand. PICs are not network devices, they are essential platform
components and almost an extended part of the CPU.)
Ingo
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
2007-04-05 9:30 ` Ingo Molnar
[not found] ` <20070405093033.GC25448-X9Un+BFzKDI@public.gmane.org>
@ 2007-04-05 14:32 ` Anthony Liguori
2007-04-06 10:37 ` Ingo Molnar
1 sibling, 1 reply; 36+ messages in thread
From: Anthony Liguori @ 2007-04-05 14:32 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Rusty Russell, kvm-devel, netdev
Ingo Molnar wrote:
> * Rusty Russell <rusty@rustcorp.com.au> wrote:
>
>
>> It's easier to write a kernel-space network driver, but it's not
>> obviously the right thing to do until we can show that an efficient
>> packet-level userspace interface isn't possible. I don't think that's
>> been done, and it would be interesting to try.
>>
>
> yes, i agree in theory, but IMO this is largely beside the point. What
> matters most for developing a project is _the quality of the codebase_.
> That attracts developers, developers improve the code, which then
> attracts users, which attracts more developers, etc., etc. As long as
> the quality of the codebase is maintained, this is a self-sustaining
> process. You've seen that happen with Linux. [ And of course, the
> crutial step #0 is: a sane, open-minded maintainer with good taste ;-) ]
>
> qemu's code quality is not really suitable for that basic OSS model, in
> my opinion.
I think you may want to step off your high horse there. QEMU's code may
not be Linux kernel quality but it's certainly not anywhere near the
worst that is out there. Linux is over decade old. QEMU is only around
3 years old. Did Linux have extremely high quality code in 1994?
Instead of posting code snippets to LKML, it would be much more
constructive to post patches to qemu-devel. It's not like the QEMU
maintainers are actively ignoring your efforts to improve the code.
> but architectural issues aside (ignoring that the kernel _is_ the best
> place to do this particular of stuff),
Right. We don't put things in the kernel just because we don't like the
way the userspace code is written. If that logic was valid, then Linus
would be working on moving all of Gnome into the kernel.
This discussion has two parts. The first is whether or not the kernel
is the right place for a paravirtual network driver backend. My current
believe is that we could not get enough performance from something like
tun to do it in userspace. I also believe that we could improve tun (or
create a replacement) so that we could implement a PV network driver
backend in userspace. Admittedly, I'm not an expert in networking
though so I could be wrong here.
The second part is whether the platform devices should go in the
kernel. I agree with you that having the PIT in the kernel is probably
a good idea. I also agree that we probably have no choice but to move
the APIC into the kernel (not for PV drivers, but for TPR performance
and SMP support).
Regards,
Anthony Liguori
> this question is still mainly
> dominated by the basic question of code quality. I'd rather move
> something into the Linux kernel, enforce its code quality that way, and
> _then_ add whatever clean infrastructure is needed to push it back into
> user-space again (into a different codebase), than having to hack the
> monolithic 200 KLOC+ qemu codebase that is shackled with support for
> tons of arcane architectures nobody uses and tons of arcane OS variants
> that no-one cares about. Now qemu is a very important enabler and
> platform-reference-implementation for KVM to fall back to, but it's not
> the place to put crutial new code into, at least currently.
>
> Ingo
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
2007-04-05 7:17 ` Avi Kivity
@ 2007-04-06 1:02 ` Rusty Russell
2007-04-08 5:36 ` Avi Kivity
0 siblings, 1 reply; 36+ messages in thread
From: Rusty Russell @ 2007-04-06 1:02 UTC (permalink / raw)
To: Avi Kivity; +Cc: Ingo Molnar, kvm-devel, netdev
On Thu, 2007-04-05 at 10:17 +0300, Avi Kivity wrote:
> Rusty Russell wrote:
> > You didn't quote Anthony's point about "it's more about there not being
> > good enough userspace interfaces to do network IO."
> >
> > It's easier to write a kernel-space network driver, but it's not
> > obviously the right thing to do until we can show that an efficient
> > packet-level userspace interface isn't possible. I don't think that's
> > been done, and it would be interesting to try.
> >
>
> In the case of networking, the copyful interfaces on receive are driven
> by the hardware not knowing how to split the header from the data. On
> transmit I agree, it could be made copyless from userspace (somthing
> like sendfilev, only not file oriented).
Hi Avi,
I don't think you've thought about this very hard. The receive copy is
completely independent with whether the packet is going to the guest via
a kernel driver or via userspace, so not relevant.
And if all packets from the card are going to the guest, you can
deliver directly. Userspace or kernel, no difference.
And we have a "sendfilev not file oriented": it's called "writev" 8)
An in-kernel driver can avoid system call overhead and page references.
But a better tap device helps more than just KVM.
Rusty.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
2007-04-05 11:36 ` Ingo Molnar
@ 2007-04-06 1:16 ` Rusty Russell
2007-04-06 18:59 ` Ingo Molnar
0 siblings, 1 reply; 36+ messages in thread
From: Rusty Russell @ 2007-04-06 1:16 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Avi Kivity, kvm-devel, netdev
On Thu, 2007-04-05 at 13:36 +0200, Ingo Molnar wrote:
> prototyping new kernel APIs to implement user-space network drivers, on
> a crufty codebase is not something that should be done lightly.
I think you overestimate my radicalism. I was considering readv() and
writev() on the tap device.
Qemu's infrastructure may hurt kvm here, but lguest won't be able to use
that excuse.
> track issue for the *PIC question at
> hand. PICs are not network devices, they are essential platform
> components and almost an extended part of the CPU.)
Definitely, I'm only interested in stealing^H^H^Hsharing KVM devices.
The subject is now deeply misleading 8(
Cheers,
Rusty.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
2007-04-05 14:32 ` [kvm-devel] " Anthony Liguori
@ 2007-04-06 10:37 ` Ingo Molnar
2007-04-06 11:07 ` Ingo Molnar
0 siblings, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2007-04-06 10:37 UTC (permalink / raw)
To: Anthony Liguori; +Cc: Rusty Russell, kvm-devel, netdev
* Anthony Liguori <aliguori@us.ibm.com> wrote:
> [...] Did Linux have extremely high quality code in 1994?
yes! It was crutial to strive for extremely high quality code all the
time. That was the only way to grow Linux's codebase, which was ~300,000
lines of code in 1994, to the current 7.2+ million lines of code,
without losing maintainability. Code quality is more important than any
feature. 99% of feature patches sent to lkml get rejected in the first
review round on quality/design grounds, it always takes at least a
couple of iterations to make it nice and clean. Look at Apache, it's
evolving along the same lines. Or Samba. Or any of the really large and
important OSS projects. (even X, after years of struggle and stagnation,
seems to have gotten this point now.) In the past 10 years the OSS
community wrote more than 1 billion lines of new code (!), and all the
successful projects have a clean codebase. _It cannot be done any other
way_, because cleanliness and pride over good code is what keeps
developers and it is what attracts new developers.
now this doesnt mean that Linux's code quality is good in every spot -
it's an eternal fight. But the core subsystems are pretty damn clean.
When i prepare patches for the Linux kernel more than 50% of the work i
do is related to making the changes clean, or cleaning up some existing
aspect of the kernel that the new code triggers. Often it's 90% of the
work!
the 'get functionality now, clean up later' mentality is what leads to
throwaway, use-once codebases that the majority of closed-source
projects do. Once the cruft level reaches a certain threshold it's
cheaper to just throw away old code and just rewrite the whole thing
(users and costs be damned). Cleanups must not be an afterthought, code
cleanliness and gradual code evolution is _the_ most valuable property
of OSS codebases.
i guess my negative Qemu experience is dominated by my recent failure of
trying to untangle its timer code, so that qemu properly adopts to
changes in PIT/lapic programming and maps that correctly to OS timers.
(so that a dynticks/NO_HZ guest's reduced irq rate becomes visible on
the host too) I'll be a happy camper if that's fixed ;-)
Ingo
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
2007-04-06 10:37 ` Ingo Molnar
@ 2007-04-06 11:07 ` Ingo Molnar
0 siblings, 0 replies; 36+ messages in thread
From: Ingo Molnar @ 2007-04-06 11:07 UTC (permalink / raw)
To: Anthony Liguori; +Cc: Rusty Russell, kvm-devel, netdev
* Ingo Molnar <mingo@elte.hu> wrote:
> * Anthony Liguori <aliguori@us.ibm.com> wrote:
>
> > [...] Did Linux have extremely high quality code in 1994?
>
> yes! It was crutial to strive for extremely high quality code all the
> time. That was the only way to grow Linux's codebase, which was
> ~300,000 lines of code in 1994, to the current 7.2+ million lines of
> code, without losing maintainability. [...]
in fact Linux 1.0, released in early 1994, was only 170,000 LOC:
http://www.kernel.org/pub/linux/kernel/v1.0/linux-1.0.tar.gz
and i just looked at a few random files in it - it's pretty clean.
Ingo
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
2007-04-06 1:16 ` [kvm-devel] " Rusty Russell
@ 2007-04-06 18:59 ` Ingo Molnar
0 siblings, 0 replies; 36+ messages in thread
From: Ingo Molnar @ 2007-04-06 18:59 UTC (permalink / raw)
To: Rusty Russell; +Cc: Avi Kivity, kvm-devel, netdev
* Rusty Russell <rusty@rustcorp.com.au> wrote:
> > prototyping new kernel APIs to implement user-space network drivers,
> > on a crufty codebase is not something that should be done lightly.
>
> I think you overestimate my radicalism. I was considering readv() and
> writev() on the tap device.
ok :-) How would packeting be handled, or would this be alike a raw
socket in essence, but not in 'peek' but 'filter through' mode? I think
it's not quite trivial. (but maybe i'm way too radical again :)
Ingo
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
2007-04-06 1:02 ` Rusty Russell
@ 2007-04-08 5:36 ` Avi Kivity
[not found] ` <46187F4E.1080807-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 36+ messages in thread
From: Avi Kivity @ 2007-04-08 5:36 UTC (permalink / raw)
To: Rusty Russell; +Cc: Ingo Molnar, kvm-devel, netdev
Rusty Russell wrote:
> On Thu, 2007-04-05 at 10:17 +0300, Avi Kivity wrote:
>
>> Rusty Russell wrote:
>>
>>> You didn't quote Anthony's point about "it's more about there not being
>>> good enough userspace interfaces to do network IO."
>>>
>>> It's easier to write a kernel-space network driver, but it's not
>>> obviously the right thing to do until we can show that an efficient
>>> packet-level userspace interface isn't possible. I don't think that's
>>> been done, and it would be interesting to try.
>>>
>>>
>> In the case of networking, the copyful interfaces on receive are driven
>> by the hardware not knowing how to split the header from the data. On
>> transmit I agree, it could be made copyless from userspace (somthing
>> like sendfilev, only not file oriented).
>>
>
> Hi Avi,
>
> I don't think you've thought about this very hard. The receive copy is
> completely independent with whether the packet is going to the guest via
> a kernel driver or via userspace, so not relevant.
>
A packet received in the kernel cannot be made available to userspace in
a safe manner without a copy, as it will not be aligned with page
boundaries, so userspace cannot examine the packet until after one copy
has occured. After userspace has determined what to do with the packet,
another copy must take place to get it there.
There's a counterexample, mmapped sockets, but that works only when all
packets arriving on a card are exposed to the same process. This is
useful for tcpdump or for what you outline below but is hardly generic.
> And if all packets from the card are going to the guest, you can
> deliver directly. Userspace or kernel, no difference.
>
That is not the common case. Nor is it true when there is a mismatch
between the card's capabilties and guest expectations and constraints.
For example, guest memory is not physically contiguous so a NIC that
won't do scatter/gather will require bouncing (or an iommu, but that's
not here yet).
> And we have a "sendfilev not file oriented": it's called "writev" 8)
>
writev() cannot be made copyless for networking. One needs an async
interface so the kernel can complete the write after the NIC acks the
dma transfer, or a kernel driver.
> An in-kernel driver can avoid system call overhead and page references.
> But a better tap device helps more than just KVM.
>
I'll believe it when I see it.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: QEMU PIC indirection patch for in-kernel APIC work
[not found] ` <46187F4E.1080807-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-04-08 9:04 ` Muli Ben-Yehuda
2007-04-09 2:50 ` Rusty Russell
1 sibling, 0 replies; 36+ messages in thread
From: Muli Ben-Yehuda @ 2007-04-08 9:04 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, netdev
On Sun, Apr 08, 2007 at 08:36:14AM +0300, Avi Kivity wrote:
> That is not the common case. Nor is it true when there is a
> mismatch between the card's capabilties and guest expectations and
> constraints. For example, guest memory is not physically contiguous
> so a NIC that won't do scatter/gather will require bouncing (or an
> iommu, but that's not here yet).
Actually, Allen Key from Intel just posted the first VT-d patches to
xen-devel a couple of days ago. I wonder if anyone is working on kvm
support (which would require Linux support).
Cheers,
Muli
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: QEMU PIC indirection patch for in-kernel APIC work
[not found] ` <46187F4E.1080807-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-04-08 9:04 ` Muli Ben-Yehuda
@ 2007-04-09 2:50 ` Rusty Russell
[not found] ` <1176087018.11664.65.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
1 sibling, 1 reply; 36+ messages in thread
From: Rusty Russell @ 2007-04-09 2:50 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, netdev
On Sun, 2007-04-08 at 08:36 +0300, Avi Kivity wrote:
> Rusty Russell wrote:
> > Hi Avi,
> >
> > I don't think you've thought about this very hard. The receive copy is
> > completely independent with whether the packet is going to the guest via
> > a kernel driver or via userspace, so not relevant.
> >
>
> A packet received in the kernel cannot be made available to userspace in
> a safe manner without a copy, as it will not be aligned with page
> boundaries, so userspace cannot examine the packet until after one copy
> has occured.
Hi Avi!
I'm a little puzzled by your response. Hmm...
lguest's userspace network frontend does exactly as many copies as
Ingo's in-host-kernel code. One from the Guest, one to the Guest.
Does that clarify?
Rusty.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: QEMU PIC indirection patch for in-kernel APIC work
[not found] ` <1176087018.11664.65.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-04-09 7:10 ` Avi Kivity
[not found] ` <4619E6DC.3010804-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 36+ messages in thread
From: Avi Kivity @ 2007-04-09 7:10 UTC (permalink / raw)
To: Rusty Russell; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, netdev
Rusty Russell wrote:
> On Sun, 2007-04-08 at 08:36 +0300, Avi Kivity wrote:
>
>> Rusty Russell wrote:
>>
>>> Hi Avi,
>>>
>>> I don't think you've thought about this very hard. The receive copy is
>>> completely independent with whether the packet is going to the guest via
>>> a kernel driver or via userspace, so not relevant.
>>>
>>>
>> A packet received in the kernel cannot be made available to userspace in
>> a safe manner without a copy, as it will not be aligned with page
>> boundaries, so userspace cannot examine the packet until after one copy
>> has occured.
>>
>
> Hi Avi!
>
> I'm a little puzzled by your response. Hmm...
>
> lguest's userspace network frontend does exactly as many copies as
> Ingo's in-host-kernel code. One from the Guest, one to the Guest.
>
>
kvm pvnet is suboptimal now. The number of copies could be reduced by
two (to zero), by constructing an skb that points to guest memory.
Right now, this can only be done in-kernel.
With current userspace networking interfaces, one cannot build a network
device that has less than one copy on transmit, because sendmsg() *must*
copy the data (as there is no completion notification). sendfilev(),
even if it existed, cannot be used: it is copyless, but lacks completion
notification. It is useful only on unchanging data like read-only files.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: QEMU PIC indirection patch for in-kernel APIC work
[not found] ` <4619E6DC.3010804-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-04-09 9:46 ` Rusty Russell
[not found] ` <1176111984.11664.90.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
0 siblings, 1 reply; 36+ messages in thread
From: Rusty Russell @ 2007-04-09 9:46 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, netdev
On Mon, 2007-04-09 at 10:10 +0300, Avi Kivity wrote:
> Rusty Russell wrote:
> > I'm a little puzzled by your response. Hmm...
> >
> > lguest's userspace network frontend does exactly as many copies as
> > Ingo's in-host-kernel code. One from the Guest, one to the Guest.
>
> kvm pvnet is suboptimal now. The number of copies could be reduced by
> two (to zero), by constructing an skb that points to guest memory.
> Right now, this can only be done in-kernel.
Sorry, you lost me here. You mean both input and output copies can be
eliminated? Or are you talking about another two copies somewhere?
But I don't get this "we can enhance the kernel but not userspace" vibe
8(
> With current userspace networking interfaces, one cannot build a network
> device that has less than one copy on transmit, because sendmsg() *must*
> copy the data (as there is no completion notification).
Why are you talking about sendmsg()? Perhaps this is where we're
getting tangled up.
We're dealing with the tun/tap device here, not a socket.
> sendfilev(),
> even if it existed, cannot be used: it is copyless, but lacks completion
> notification. It is useful only on unchanging data like read-only files.
Again, sendfile is a *much* harder problem than sending a single packet
once, which is the question here.
Rusty.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: QEMU PIC indirection patch for in-kernel APIC work
[not found] ` <1176111984.11664.90.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-04-09 13:38 ` Avi Kivity
[not found] ` <461A41CA.9080201-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-04-11 3:53 ` [kvm-devel] " Rusty Russell
0 siblings, 2 replies; 36+ messages in thread
From: Avi Kivity @ 2007-04-09 13:38 UTC (permalink / raw)
To: Rusty Russell; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, netdev
Rusty Russell wrote:
> On Mon, 2007-04-09 at 10:10 +0300, Avi Kivity wrote:
>
>> Rusty Russell wrote:
>>
>>> I'm a little puzzled by your response. Hmm...
>>>
>>> lguest's userspace network frontend does exactly as many copies as
>>> Ingo's in-host-kernel code. One from the Guest, one to the Guest.
>>>
>> kvm pvnet is suboptimal now. The number of copies could be reduced by
>> two (to zero), by constructing an skb that points to guest memory.
>> Right now, this can only be done in-kernel.
>>
>
> Sorry, you lost me here. You mean both input and output copies can be
> eliminated? Or are you talking about another two copies somewhere?
>
On the transmit path, current kvm pvnet has two copies:
1. on the guest side, the driver copies the skb data into the shared ring
2. on the host side, the device copies the data from the ring into a
newly allocated skb
Both of these copies can be eliminated with a host-side kernel. With
current userspace interfaces, only one copy can be eliminated.
Similar logic applies to receive, except that one copy must remain.
> But I don't get this "we can enhance the kernel but not userspace" vibe
> 8(
>
I've been waiting for network aio since ~2003. If it arrives in the
next few days, I'm all for it; much more than kvm can use it
profitably. But I'm not going to write that interface myself.
Moreover, some things just don't lend themselves to a userspace
abstraction. If we want to expose tso (tcp segmentation offload), we
can easily do so with a kernel driver since the kernel interfaces are
all tso aware. Tacking on tso awareness to tun/tap is doable, but at
the very least wierd.
>
>> With current userspace networking interfaces, one cannot build a network
>> device that has less than one copy on transmit, because sendmsg() *must*
>> copy the data (as there is no completion notification).
>>
>
> Why are you talking about sendmsg()? Perhaps this is where we're
> getting tangled up.
>
> We're dealing with the tun/tap device here, not a socket.
>
>
Hmm. tun actually has aio_write implemented, but it seems synchronous.
So does the read path.
If these are made truly asynchronous, and the write path is made in
addition copyless, then we might have something workable. I still
cringe at having a pagetable walk in order to deliver a 1500-byte packet.
>> sendfilev(),
>> even if it existed, cannot be used: it is copyless, but lacks completion
>> notification. It is useful only on unchanging data like read-only files.
>>
>
> Again, sendfile is a *much* harder problem than sending a single packet
> once, which is the question here.
>
sendfile() is a *different* problem. It doesn't need completion because
the data is assumed not to change under it.
Consider that the guest may be issuing a megabyte-sized sendfile() which
is broken into 17 tso frames. We need to preserve the large structures
as much as possible or we end up repeating the simple "single packet
once" path 700 times.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: QEMU PIC indirection patch for in-kernel APIC work
[not found] ` <461A41CA.9080201-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-04-10 8:07 ` Evgeniy Polyakov
2007-04-10 8:19 ` [kvm-devel] " Avi Kivity
0 siblings, 1 reply; 36+ messages in thread
From: Evgeniy Polyakov @ 2007-04-10 8:07 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, netdev
On Mon, Apr 09, 2007 at 04:38:18PM +0300, Avi Kivity (avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org) wrote:
> >But I don't get this "we can enhance the kernel but not userspace" vibe
> >8(
> >
>
> I've been waiting for network aio since ~2003. If it arrives in the
> next few days, I'm all for it; much more than kvm can use it
> profitably. But I'm not going to write that interface myself.
Hmm, you missed at least two implementations of network aio in the
previous year, and now with syslets we can have third one.
But it looks from this discussion, that it will not prevent from
changing in-kernel driver - place a hook into skb allocation path and
allocate data from opposing memory - get pages from another side and put
them into fragments, then copy headers into skb->data.
--
Evgeniy Polyakov
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
2007-04-10 8:07 ` Evgeniy Polyakov
@ 2007-04-10 8:19 ` Avi Kivity
[not found] ` <461B48A8.1060904-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 36+ messages in thread
From: Avi Kivity @ 2007-04-10 8:19 UTC (permalink / raw)
To: Evgeniy Polyakov; +Cc: Rusty Russell, Ingo Molnar, kvm-devel, netdev
Evgeniy Polyakov wrote:
> On Mon, Apr 09, 2007 at 04:38:18PM +0300, Avi Kivity (avi@qumranet.com) wrote:
>
>>> But I don't get this "we can enhance the kernel but not userspace" vibe
>>> 8(
>>>
>>>
>> I've been waiting for network aio since ~2003. If it arrives in the
>> next few days, I'm all for it; much more than kvm can use it
>> profitably. But I'm not going to write that interface myself.
>>
>
> Hmm, you missed at least two implementations of network aio in the
> previous year, and now with syslets we can have third one.
>
I meant, network aio in the mainline kernel. I am aware of the various
out-of-tree implementations.
> But it looks from this discussion, that it will not prevent from
> changing in-kernel driver - place a hook into skb allocation path and
> allocate data from opposing memory - get pages from another side and put
> them into fragments, then copy headers into skb->data.
>
I don't understand this (opposing memory, another side?). Can you
elaborate?
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: QEMU PIC indirection patch for in-kernel APIC work
[not found] ` <461B48A8.1060904-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-04-10 8:58 ` Evgeniy Polyakov
2007-04-10 11:21 ` [kvm-devel] " Avi Kivity
0 siblings, 1 reply; 36+ messages in thread
From: Evgeniy Polyakov @ 2007-04-10 8:58 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, netdev
On Tue, Apr 10, 2007 at 11:19:52AM +0300, Avi Kivity (avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org) wrote:
> I meant, network aio in the mainline kernel. I am aware of the various
> out-of-tree implementations.
If potential users do not pay attention to initial implementaion, it is
quite hard to them to get into. But actually it does not matter to this
discussion.
> > But it looks from this discussion, that it will not prevent from
> > changing in-kernel driver - place a hook into skb allocation path and
> > allocate data from opposing memory - get pages from another side and put
> > them into fragments, then copy headers into skb->data.
> >
>
> I don't understand this (opposing memory, another side?). Can you
> elaborate?
You want to implement zero-copy network device between host and guest, if
I understood this thread correctly?
So, for sending part, device allocates pages from receiver's memory (or
from shared memory), receiver gets an 'interrupt' and got pages from own
memory, which are attached to new skb and transferred up to the network
stack.
It can be extended to use shared ring of pages.
--
Evgeniy Polyakov
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
2007-04-10 8:58 ` Evgeniy Polyakov
@ 2007-04-10 11:21 ` Avi Kivity
[not found] ` <461B7334.8090807-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 36+ messages in thread
From: Avi Kivity @ 2007-04-10 11:21 UTC (permalink / raw)
To: Evgeniy Polyakov; +Cc: Rusty Russell, Ingo Molnar, kvm-devel, netdev
Evgeniy Polyakov wrote:
>>> But it looks from this discussion, that it will not prevent from
>>> changing in-kernel driver - place a hook into skb allocation path and
>>> allocate data from opposing memory - get pages from another side and put
>>> them into fragments, then copy headers into skb->data.
>>>
>>>
>> I don't understand this (opposing memory, another side?). Can you
>> elaborate?
>>
>
> You want to implement zero-copy network device between host and guest, if
> I understood this thread correctly?
> So, for sending part, device allocates pages from receiver's memory (or
> from shared memory), receiver gets an 'interrupt' and got pages from own
> memory, which are attached to new skb and transferred up to the network
> stack.
> It can be extended to use shared ring of pages.
>
This is what Xen does. It is actually less performant than copying, IIRC.
The problem with flipping pages around is that physical addresses are
cached both in the kvm mmu and in the on-chip tlbs, necessitating
expensive page table walks and tlb invalidation IPIs.
Note that for sending from the guest an external host can be done
copylessly, and for the receive side using a dma engine (like I/OAT) can
reduce the cost of the copy.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: QEMU PIC indirection patch for in-kernel APIC work
[not found] ` <461B7334.8090807-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-04-10 11:50 ` Evgeniy Polyakov
2007-04-10 12:17 ` [kvm-devel] " Avi Kivity
0 siblings, 1 reply; 36+ messages in thread
From: Evgeniy Polyakov @ 2007-04-10 11:50 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, netdev
On Tue, Apr 10, 2007 at 02:21:24PM +0300, Avi Kivity (avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org) wrote:
> >You want to implement zero-copy network device between host and guest, if
> >I understood this thread correctly?
> >So, for sending part, device allocates pages from receiver's memory (or
> >from shared memory), receiver gets an 'interrupt' and got pages from own
> >memory, which are attached to new skb and transferred up to the network
> >stack.
> >It can be extended to use shared ring of pages.
> >
>
> This is what Xen does. It is actually less performant than copying, IIRC.
>
> The problem with flipping pages around is that physical addresses are
> cached both in the kvm mmu and in the on-chip tlbs, necessitating
> expensive page table walks and tlb invalidation IPIs.
Hmm, I'm not familiar with Xen driver, but similar technique was used
with zero-copy network sniffer some time ago, substituting userspace
pages with pages containing skb data was about 25-50% faster than
copying 1500 bytes in general, and in order of 10 times faster in some
cases.
Check a link please in case we are talking about different ideas:
http://marc.info/?l=linux-netdev&m=112262743505711&w=2
--
Evgeniy Polyakov
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
2007-04-10 11:50 ` Evgeniy Polyakov
@ 2007-04-10 12:17 ` Avi Kivity
[not found] ` <461B8069.6070007-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 36+ messages in thread
From: Avi Kivity @ 2007-04-10 12:17 UTC (permalink / raw)
To: Evgeniy Polyakov; +Cc: Rusty Russell, Ingo Molnar, kvm-devel, netdev
Evgeniy Polyakov wrote:
>> This is what Xen does. It is actually less performant than copying, IIRC.
>>
>> The problem with flipping pages around is that physical addresses are
>> cached both in the kvm mmu and in the on-chip tlbs, necessitating
>> expensive page table walks and tlb invalidation IPIs.
>>
>
> Hmm, I'm not familiar with Xen driver, but similar technique was used
> with zero-copy network sniffer some time ago, substituting userspace
> pages with pages containing skb data was about 25-50% faster than
> copying 1500 bytes in general, and in order of 10 times faster in some
> cases.
>
> Check a link please in case we are talking about different ideas:
> http://marc.info/?l=linux-netdev&m=112262743505711&w=2
>
>
I don't really understand what you're testing there. in particular, how
can the copying time change so dramatically depending on whether you've
just rebooted or not?
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: QEMU PIC indirection patch for in-kernel APIC work
[not found] ` <461B8069.6070007-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-04-10 12:30 ` Evgeniy Polyakov
[not found] ` <20070410123034.GA11493-9fLWQ3dKdXwox3rIn2DAYQ@public.gmane.org>
0 siblings, 1 reply; 36+ messages in thread
From: Evgeniy Polyakov @ 2007-04-10 12:30 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, netdev
On Tue, Apr 10, 2007 at 03:17:45PM +0300, Avi Kivity (avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org) wrote:
> >Check a link please in case we are talking about different ideas:
> >http://marc.info/?l=linux-netdev&m=112262743505711&w=2
> >
> >
>
> I don't really understand what you're testing there. in particular, how
> can the copying time change so dramatically depending on whether you've
> just rebooted or not?
I tested page remapping time - i.e. time to replace a page in two
different mappings - the same should be performed in host and guest
kernels if such design is going to be used for communication.
I can only explain after-reboot slow copy with empty caches - arbitrary
kernel pages were copied into buffer (not the same data as in posted
code).
--
Evgeniy Polyakov
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: QEMU PIC indirection patch for in-kernel APIC work
[not found] ` <20070410123034.GA11493-9fLWQ3dKdXwox3rIn2DAYQ@public.gmane.org>
@ 2007-04-10 12:49 ` Avi Kivity
0 siblings, 0 replies; 36+ messages in thread
From: Avi Kivity @ 2007-04-10 12:49 UTC (permalink / raw)
To: Evgeniy Polyakov; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, netdev
Evgeniy Polyakov wrote:
> On Tue, Apr 10, 2007 at 03:17:45PM +0300, Avi Kivity (avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org) wrote:
>
>>> Check a link please in case we are talking about different ideas:
>>> http://marc.info/?l=linux-netdev&m=112262743505711&w=2
>>>
>>>
>>>
>> I don't really understand what you're testing there. in particular, how
>> can the copying time change so dramatically depending on whether you've
>> just rebooted or not?
>>
>
> I tested page remapping time - i.e. time to replace a page in two
> different mappings - the same should be performed in host and guest
> kernels if such design is going to be used for communication.
>
> I can only explain after-reboot slow copy with empty caches - arbitrary
> kernel pages were copied into buffer (not the same data as in posted
> code).
>
Doing this in kvm would be significantly more complex, as we'd need to
use full reverse mapping to locate all guest mappings (we already
reverse map writable pages for other reasons), so the 25-50% difference
might be nullified or even turn into overhead.
Here are the Xen numbers for reference. Xen probably has more overhead
than kvm for such things, though, as it needs to do hypercalls from dom0
which is in-kernel for kvm.
http://lists.xensource.com/archives/html/xen-devel/2007-03/msg01218.html
--
error compiling committee.c: too many arguments to function
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
2007-04-09 13:38 ` Avi Kivity
[not found] ` <461A41CA.9080201-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-04-11 3:53 ` Rusty Russell
[not found] ` <1176263593.26372.84.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
1 sibling, 1 reply; 36+ messages in thread
From: Rusty Russell @ 2007-04-11 3:53 UTC (permalink / raw)
To: Avi Kivity; +Cc: Ingo Molnar, kvm-devel, netdev
On Mon, 2007-04-09 at 16:38 +0300, Avi Kivity wrote:
> Moreover, some things just don't lend themselves to a userspace
> abstraction. If we want to expose tso (tcp segmentation offload), we
> can easily do so with a kernel driver since the kernel interfaces are
> all tso aware. Tacking on tso awareness to tun/tap is doable, but at
> the very least wierd.
It is kinda weird, yes, but it certainly makes sense. All the arguments
for tso apply in triplicate to userspace packet sends...
> > We're dealing with the tun/tap device here, not a socket.
>
> Hmm. tun actually has aio_write implemented, but it seems synchronous.
> So does the read path.
>
> If these are made truly asynchronous, and the write path is made in
> addition copyless, then we might have something workable. I still
> cringe at having a pagetable walk in order to deliver a 1500-byte packet.
Right, now we're talking!
However, it's not clear to me why creating an skb which references a kvm
guest's memory doesn't need a pagetable walk, but a packet in (other)
userspace memory does?
My conviction which started this discussion is that if we can offer an
efficient interface for kvm, we should be able to offer an efficient
interface for any (other) userspace.
As to async, I'm not *so* worried about that for the moment, although it
would probably be nicer to fail than to block. Otherwise we could
simply set an skb destructor to wake us up.
> > Again, sendfile is a *much* harder problem than sending a single packet
> > once, which is the question here.
>
> sendfile() is a *different* problem. It doesn't need completion because
> the data is assumed not to change under it.
Well, let's not argue over that, it's irrelevant. Hopefully we can do
that over a beer or equivalent sometime.
I think the first step is to see how much worse a decent userspace net
driver is compared with the current in-kernel one.
Rusty.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: QEMU PIC indirection patch for in-kernel APIC work
[not found] ` <1176263593.26372.84.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-04-11 4:26 ` Avi Kivity
[not found] ` <461C6360.1060908-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 36+ messages in thread
From: Avi Kivity @ 2007-04-11 4:26 UTC (permalink / raw)
To: Rusty Russell; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, netdev
Rusty Russell wrote:
> On Mon, 2007-04-09 at 16:38 +0300, Avi Kivity wrote:
>
>> Moreover, some things just don't lend themselves to a userspace
>> abstraction. If we want to expose tso (tcp segmentation offload), we
>> can easily do so with a kernel driver since the kernel interfaces are
>> all tso aware. Tacking on tso awareness to tun/tap is doable, but at
>> the very least wierd.
>>
>
> It is kinda weird, yes, but it certainly makes sense. All the arguments
> for tso apply in triplicate to userspace packet sends...
>
>
Well, write() with a large buffer is a sort of tso device. The problem
is tso breaks through several layers (like I'm advocating in the other
thread :), pushing tcp functionality into ethernet. Well, we've seen worse.
>>> We're dealing with the tun/tap device here, not a socket.
>>>
>> Hmm. tun actually has aio_write implemented, but it seems synchronous.
>> So does the read path.
>>
>> If these are made truly asynchronous, and the write path is made in
>> addition copyless, then we might have something workable. I still
>> cringe at having a pagetable walk in order to deliver a 1500-byte packet.
>>
>
> Right, now we're talking!
>
> However, it's not clear to me why creating an skb which references a kvm
> guest's memory doesn't need a pagetable walk, but a packet in (other)
> userspace memory does?
>
Currently guest pages are stashed in a kernel array, as well as being
mmap()ed into user space.
That's not a very strong argument though, as I'd like to be map
userspace memory into the guest, or map address_spaces to the guest, or
something, so accessing guest physical memory will become more expensive
in time.
> My conviction which started this discussion is that if we can offer an
> efficient interface for kvm, we should be able to offer an efficient
> interface for any (other) userspace.
>
Fully agreed. It's mostly a question of who and when. Designing and
implementing this interface is going to be difficult, require deep
knowledge of Linux networking, and consume a lot of time.
> As to async, I'm not *so* worried about that for the moment, although it
> would probably be nicer to fail than to block. Otherwise we could
> simply set an skb destructor to wake us up.
>
Nope. Being async is critical for copyless networking:
- in the transmit path, so need to stop the sender (guest) from touching
the memory until it's on the wire. This means 100% of packets sent will
be blocked.
- in the receive path, you could separate receive notification from the
single copy that must be done (like poll() + read()), but to make use of
dma engines you need to provide the end address beforehand.
> I think the first step is to see how much worse a decent userspace net
> driver is compared with the current in-kernel one.
>
A userspace net interface needs to provide the following:
- true async operations
- multiple packets per operation (for interrupt mitigation) (like
lio_listio)
- scatter/gather packets (iovecs)
- configurable wakeup (by packet count/timeout) for queue management
- hacks (tso)
Most of these can be provided by a combination of the pending aio work,
the pending aio/fd integration, and the not-so-pending tap aio work. As
the first two are available as patches and the third is limited to the
tap device, it is not unreasonable to try it out. Maybe it will turn
out not to be as difficult as I predicted just a few lines above.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: QEMU PIC indirection patch for in-kernel APIC work
[not found] ` <461C6360.1060908-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-04-11 13:23 ` Rusty Russell
[not found] ` <1176297794.14322.72.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
0 siblings, 1 reply; 36+ messages in thread
From: Rusty Russell @ 2007-04-11 13:23 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, netdev
On Wed, 2007-04-11 at 07:26 +0300, Avi Kivity wrote:
> Nope. Being async is critical for copyless networking:
>
> - in the transmit path, so need to stop the sender (guest) from touching
> the memory until it's on the wire. This means 100% of packets sent will
> be blocked.
Hi Avi,
You keep saying stuff like this, and I keep ignoring it. OK, I'll
bite:
Why would we try to prevent the sender from altering the packets?
> A userspace net interface needs to provide the following:
>
> - true async operations
I'll hold on this pending discussion above.
> - multiple packets per operation (for interrupt mitigation) (like
> lio_listio)
The benefits for interrupt mitigation are less clear to me in a virtual
environment (scheduling tends to make it happen anyway); I'd want to
benchmark it.
Some kind of batching to reduce syscall overhead, perhaps, but TSO would
go a fair way towards that anyway (probably not enough).
> - scatter/gather packets (iovecs)
Yes, and this is already present in the tap device. Anthony suggested a
slightly nasty hack for multiple sg packets in one writev()/readv, which
could also give us batching.
> - configurable wakeup (by packet count/timeout) for queue management
I'm not convinced that this is a showstopper, though.
> - hacks (tso)
I'd usually go for a batch interface over TSO, but if the card we're
sending to actually does TSO then TSO will probably win.
> Most of these can be provided by a combination of the pending aio work,
> the pending aio/fd integration, and the not-so-pending tap aio work. As
> the first two are available as patches and the third is limited to the
> tap device, it is not unreasonable to try it out. Maybe it will turn
> out not to be as difficult as I predicted just a few lines above.
Indeed, I don't think we're asking for a revolution a-la VJ-style
channels. But I'm still itching to get back to that, and this might yet
provide an excuse 8)
Cheers,
Rusty.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: QEMU PIC indirection patch for in-kernel APIC work
[not found] ` <1176297794.14322.72.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-04-11 14:28 ` Avi Kivity
[not found] ` <461CF098.3090003-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 36+ messages in thread
From: Avi Kivity @ 2007-04-11 14:28 UTC (permalink / raw)
To: Rusty Russell; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, netdev
Rusty Russell wrote:
> On Wed, 2007-04-11 at 07:26 +0300, Avi Kivity wrote:
>
>> Nope. Being async is critical for copyless networking:
>>
>> - in the transmit path, so need to stop the sender (guest) from touching
>> the memory until it's on the wire. This means 100% of packets sent will
>> be blocked.
>>
>
> Hi Avi,
>
> You keep saying stuff like this, and I keep ignoring it. OK, I'll
> bite:
>
> Why would we try to prevent the sender from altering the packets?
>
>
To avoid data corruption.
The guest wants to send a packet. It calls write(), which causes an skb
to be allocated, data to be copied into it, the entire networking stack
gets into gear, and the guest-side driver instructs the "device" to send
the packet.
With async operations, the saga continues like this: the host-side
driver allocates an skb, get_page()s and attaches the data to the new
skb, this skb crosses the bridge, trickles into the real ethernet
device, gets queued there, sent, interrupts fire, triggering async
completion. On this completion, we send a virtual interrupt to the
guest, which tells it to destroy the skb and reclaim the pages attached
to it.
Without async operations, we don't have a hook to notify the guest when
to reclaim the skb. If we do it too soon, the skb can be reclaimed and
the memory reused before the real device gets to see it, so we end up
sending data that we did not intend. The only way to avoid it is to
copy the data somewhere safe, but that is exactly what we don't want to do.
>> - multiple packets per operation (for interrupt mitigation) (like
>> lio_listio)
>>
>
> The benefits for interrupt mitigation are less clear to me in a virtual
> environment (scheduling tends to make it happen anyway); I'd want to
> benchmark it.
>
>
Yes, the guest will probably submit multiple packets in one hypercall.
It would be nice for the userspace driver to be able to submit them to
the host kernel in one syscall.
> Some kind of batching to reduce syscall overhead, perhaps, but TSO would
> go a fair way towards that anyway (probably not enough).
>
>
For some workloads, sure.
>> - scatter/gather packets (iovecs)
>>
>
> Yes, and this is already present in the tap device. Anthony suggested a
> slightly nasty hack for multiple sg packets in one writev()/readv, which
> could also give us batching.
>
>
No need for hacks if we get list aio support one day.
>> - configurable wakeup (by packet count/timeout) for queue management
>>
>
> I'm not convinced that this is a showstopper, though.
>
It probably isn't. It's free with aio though.
>
>> - hacks (tso)
>>
>
> I'd usually go for a batch interface over TSO, but if the card we're
> sending to actually does TSO then TSO will probably win.
>
Sure, if tso helps a regular host then it should help one that happens
to be running a virtual machine.
>
>> Most of these can be provided by a combination of the pending aio work,
>> the pending aio/fd integration, and the not-so-pending tap aio work. As
>> the first two are available as patches and the third is limited to the
>> tap device, it is not unreasonable to try it out. Maybe it will turn
>> out not to be as difficult as I predicted just a few lines above.
>>
>
> Indeed, I don't think we're asking for a revolution a-la VJ-style
> channels. But I'm still itching to get back to that, and this might yet
> provide an excuse 8)
>
I'll be happy if this can be made to work. It will make the paravirt
guest-side driver work in kvm-less setups, which are useful for testing,
and of course reduction in kernel code is beneficial. It will be slower
that in-kernel, but if we get the batching right, perhaps not
significantly slower. I'm mostly concerned that this depends on code
that has eluded merging for such a long time.
--
error compiling committee.c: too many arguments to function
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: QEMU PIC indirection patch for in-kernel APIC work
[not found] ` <461CF098.3090003-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-04-11 23:30 ` Rusty Russell
[not found] ` <1176334200.14322.133.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
0 siblings, 1 reply; 36+ messages in thread
From: Rusty Russell @ 2007-04-11 23:30 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, netdev
On Wed, 2007-04-11 at 17:28 +0300, Avi Kivity wrote:
> Rusty Russell wrote:
> > On Wed, 2007-04-11 at 07:26 +0300, Avi Kivity wrote:
> >
> >> Nope. Being async is critical for copyless networking:
> >>
> With async operations, the saga continues like this: the host-side
> driver allocates an skb, get_page()s and attaches the data to the new
> skb, this skb crosses the bridge, trickles into the real ethernet
> device, gets queued there, sent, interrupts fire, triggering async
> completion. On this completion, we send a virtual interrupt to the
> guest, which tells it to destroy the skb and reclaim the pages attached
> to it.
Hi Avi!
Thanks for spelling it out, I now understand your POV. I had
considered it obvious that a (non-async) write which didn't copy would
block until the skb was finished with, which is easy to code up within
the tap device itself. Otherwise it's actually an async write without a
notification mechanism, which I agree is broken.
Note though: if the guest can change the packet headers they can
subvert some firewall rules and possibly crash the host. None of the
networking code I wrote expects packets to change in flight 8(
This applies to a userspace or kernelspace driver.
> > Yes, and this is already present in the tap device. Anthony suggested a
> > slightly nasty hack for multiple sg packets in one writev()/readv, which
> > could also give us batching.
>
> No need for hacks if we get list aio support one day.
As you point out though, aio is not something we want to hold our breath
for. Plus, aio never makes things simpler, and complexity kills
puppies.
Cheers!
Rusty.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: QEMU PIC indirection patch for in-kernel APIC work
[not found] ` <1176334200.14322.133.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-04-12 3:32 ` Avi Kivity
2007-04-16 0:22 ` [kvm-devel] " Rusty Russell
0 siblings, 1 reply; 36+ messages in thread
From: Avi Kivity @ 2007-04-12 3:32 UTC (permalink / raw)
To: Rusty Russell; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, netdev
Rusty Russell wrote:
> On Wed, 2007-04-11 at 17:28 +0300, Avi Kivity wrote:
>
>> Rusty Russell wrote:
>>
>>> On Wed, 2007-04-11 at 07:26 +0300, Avi Kivity wrote:
>>>
>>>
>>>> Nope. Being async is critical for copyless networking:
>>>>
>>>>
>> With async operations, the saga continues like this: the host-side
>> driver allocates an skb, get_page()s and attaches the data to the new
>> skb, this skb crosses the bridge, trickles into the real ethernet
>> device, gets queued there, sent, interrupts fire, triggering async
>> completion. On this completion, we send a virtual interrupt to the
>> guest, which tells it to destroy the skb and reclaim the pages attached
>> to it.
>>
>
> Hi Avi!
>
> Thanks for spelling it out, I now understand your POV. I had
> considered it obvious that a (non-async) write which didn't copy would
> block until the skb was finished with, which is easy to code up within
> the tap device itself. Otherwise it's actually an async write without a
> notification mechanism, which I agree is broken.
>
>
I hadn't considered an always-blocking (or unbuffered) networking API.
It's very counter to current APIs, but does make sense with things like
syslets. Without syslets, I don't think it's very useful as you need
some artificial threads to keep things humming along.
(How would userspace specify it? O_DIRECT when opening the tap?)
I don't think there's a lot of difference between implementing aio or
always-blocking copyless writes for tap. They just differ in how they
sleep and in how to access user pages.
> Note though: if the guest can change the packet headers they can
> subvert some firewall rules and possibly crash the host. None of the
> networking code I wrote expects packets to change in flight 8(
>
> This applies to a userspace or kernelspace driver.
>
>
Umm, right. We could write-protect the packets (which would be very
expensive). We could set the evil bit on guest-originated packets, and
rewrite the entire networking stack to copy any part which is inspected
if the evil bit is set. We need more head-scratching on this.
>>> Yes, and this is already present in the tap device. Anthony suggested a
>>> slightly nasty hack for multiple sg packets in one writev()/readv, which
>>> could also give us batching.
>>>
>> No need for hacks if we get list aio support one day.
>>
>
> As you point out though, aio is not something we want to hold our breath
> for. Plus, aio never makes things simpler, and complexity kills
> puppies.
>
The puppies had better stay away from qemu then, as it is completely async.
Always-blocking writes won't reduce complexity. Suddenly you need a
thread for each request batch and some pleasant code for joining the
threads when done. Syslets do make it go away, though they're more for
the mostly-nonblocking-with-occasional-blockage stuff rather than the
always blocking thingie you describe.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
2007-04-12 3:32 ` Avi Kivity
@ 2007-04-16 0:22 ` Rusty Russell
2007-04-16 5:13 ` Avi Kivity
0 siblings, 1 reply; 36+ messages in thread
From: Rusty Russell @ 2007-04-16 0:22 UTC (permalink / raw)
To: Avi Kivity; +Cc: Ingo Molnar, kvm-devel, netdev
On Thu, 2007-04-12 at 06:32 +0300, Avi Kivity wrote:
> I hadn't considered an always-blocking (or unbuffered) networking API.
> It's very counter to current APIs, but does make sense with things like
> syslets. Without syslets, I don't think it's very useful as you need
> some artificial threads to keep things humming along.
>
> (How would userspace specify it? O_DIRECT when opening the tap?)
TBH, I hadn't thought that far. Tap already has those IFF_NO_PI etc
flags, but it might make sense to just be the default. From userspace's
POV it's not a semantic change.
OK, just tested: I can get 230,000 packets (28 byte UDP) through the tun
device in a second (130,000 actually out the 100-base-T NIC, 100,000
dropped). If the tun driver's write() blocks until the skb is
destroyed, it's 4,000 packets.
So your intuition was right: skb_free latency on xmit (at least for this
e1000) is far too large for anything but an async solution.
Will ponder further.
Thanks!
Rusty.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work
2007-04-16 0:22 ` [kvm-devel] " Rusty Russell
@ 2007-04-16 5:13 ` Avi Kivity
0 siblings, 0 replies; 36+ messages in thread
From: Avi Kivity @ 2007-04-16 5:13 UTC (permalink / raw)
To: Rusty Russell; +Cc: Ingo Molnar, kvm-devel, netdev
Rusty Russell wrote:
> On Thu, 2007-04-12 at 06:32 +0300, Avi Kivity wrote:
>
>> I hadn't considered an always-blocking (or unbuffered) networking API.
>> It's very counter to current APIs, but does make sense with things like
>> syslets. Without syslets, I don't think it's very useful as you need
>> some artificial threads to keep things humming along.
>>
>> (How would userspace specify it? O_DIRECT when opening the tap?)
>>
>
> TBH, I hadn't thought that far. Tap already has those IFF_NO_PI etc
> flags, but it might make sense to just be the default. From userspace's
> POV it's not a semantic change.
>
> OK, just tested: I can get 230,000 packets (28 byte UDP) through the tun
> device in a second (130,000 actually out the 100-base-T NIC, 100,000
> dropped). If the tun driver's write() blocks until the skb is
> destroyed, it's 4,000 packets.
>
> So your intuition was right: skb_free latency on xmit (at least for this
> e1000) is far too large for anything but an async solution.
>
> Will ponder further.
>
I think aio_write (but done copyless-lessly) is the way to go. Not only
is the infrastructure there, but the API already allows for multiple
packet submission and for batching completions. Fitting into that
framework ought to be easier than starting yet another one.
It still misses scatter/gather and integration with fd-based
notification, but there are patches around for that.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2007-04-16 5:13 UTC | newest]
Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <4613B438.60107@codemonkey.ws>
[not found] ` <4613B89F.8090806@qumranet.com>
[not found] ` <4613BC6B.1070708@codemonkey.ws>
[not found] ` <4613BF07.50606@qumranet.com>
[not found] ` <4613C993.9020405@codemonkey.ws>
[not found] ` <4613CC01.1090500@qumranet.com>
[not found] ` <4613CDB2.4000903@codemonkey.ws>
[not found] ` <4613D001.3040606@qumranet.com>
[not found] ` <20070404200112.GA6070@elte.hu>
[not found] ` <4614098F.2030307@us.ibm.com>
[not found] ` <20070404212103.GA19026@elte.hu>
2007-04-04 23:19 ` [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work Rusty Russell
2007-04-05 7:17 ` Avi Kivity
2007-04-06 1:02 ` Rusty Russell
2007-04-08 5:36 ` Avi Kivity
[not found] ` <46187F4E.1080807-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-04-08 9:04 ` Muli Ben-Yehuda
2007-04-09 2:50 ` Rusty Russell
[not found] ` <1176087018.11664.65.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-04-09 7:10 ` Avi Kivity
[not found] ` <4619E6DC.3010804-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-04-09 9:46 ` Rusty Russell
[not found] ` <1176111984.11664.90.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-04-09 13:38 ` Avi Kivity
[not found] ` <461A41CA.9080201-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-04-10 8:07 ` Evgeniy Polyakov
2007-04-10 8:19 ` [kvm-devel] " Avi Kivity
[not found] ` <461B48A8.1060904-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-04-10 8:58 ` Evgeniy Polyakov
2007-04-10 11:21 ` [kvm-devel] " Avi Kivity
[not found] ` <461B7334.8090807-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-04-10 11:50 ` Evgeniy Polyakov
2007-04-10 12:17 ` [kvm-devel] " Avi Kivity
[not found] ` <461B8069.6070007-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-04-10 12:30 ` Evgeniy Polyakov
[not found] ` <20070410123034.GA11493-9fLWQ3dKdXwox3rIn2DAYQ@public.gmane.org>
2007-04-10 12:49 ` Avi Kivity
2007-04-11 3:53 ` [kvm-devel] " Rusty Russell
[not found] ` <1176263593.26372.84.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-04-11 4:26 ` Avi Kivity
[not found] ` <461C6360.1060908-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-04-11 13:23 ` Rusty Russell
[not found] ` <1176297794.14322.72.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-04-11 14:28 ` Avi Kivity
[not found] ` <461CF098.3090003-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-04-11 23:30 ` Rusty Russell
[not found] ` <1176334200.14322.133.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-04-12 3:32 ` Avi Kivity
2007-04-16 0:22 ` [kvm-devel] " Rusty Russell
2007-04-16 5:13 ` Avi Kivity
[not found] ` <1175728768.12230.593.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-04-05 9:30 ` Ingo Molnar
[not found] ` <20070405093033.GC25448-X9Un+BFzKDI@public.gmane.org>
2007-04-05 9:58 ` Avi Kivity
2007-04-05 10:26 ` [kvm-devel] " Ingo Molnar
2007-04-05 11:26 ` Avi Kivity
[not found] ` <4614DCE1.70905-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-04-05 11:36 ` Ingo Molnar
2007-04-06 1:16 ` [kvm-devel] " Rusty Russell
2007-04-06 18:59 ` Ingo Molnar
2007-04-05 10:55 ` Ingo Molnar
2007-04-05 14:32 ` [kvm-devel] " Anthony Liguori
2007-04-06 10:37 ` Ingo Molnar
2007-04-06 11:07 ` Ingo Molnar
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).