* [RFC] Virtualization steps
@ 2006-03-24 17:19 Kirill Korotaev
2006-03-24 17:33 ` Nick Piggin
` (2 more replies)
0 siblings, 3 replies; 125+ messages in thread
From: Kirill Korotaev @ 2006-03-24 17:19 UTC (permalink / raw)
To: Eric W. Biederman, haveblue, linux-kernel, herbert, devel, serue,
akpm, sam, Alexey Kuznetsov, Pavel Emelianov, Stanislav Protassov
Eric, Herbert,
I think it is quite clear, that without some agreement on all these
virtualization issues, we won't be able to commit anything good to
mainstream. My idea is to gather our efforts to get consensus on most
clean parts of code first and commit them one by one.
The proposal is quite simple. We have 4 parties in this conversation
(maybe more?): IBM guys, OpenVZ, VServer and Eric Biederman. We discuss
the areas which should be considered step by step. Send patches for each
area, discuss, come to some agreement and all 4 parties Sign-Off the
patch. After that it goes to Andrew/Linus. Worth trying?
So far, (correct me if I'm wrong) we concluded that some people don't
want containers as a whole, but want some subsystem namespaces. I
suppose for people who care about containers only it doesn't matter, so
we can proceed with namespaces, yeah?
So the most easy namespaces to discuss I see:
- utsname
- sys IPC
- network virtualization
- netfilter virtualization
all these were discussed already somehow and looks like there is no
fundamental differencies in our approaches (at least OpenVZ and Eric,
for sure).
Right now, I suggest to concentrate on first 2 namespaces - utsname and
sysvipc. They are small enough and easy. Lets consider them without
sysctl/proc issues, as those can be resolved later. I sent the patches
for these 2 namespaces to all of you. I really hope for some _good_
critics, so we could work it out quickly.
Thanks,
Kirill
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-24 17:19 [RFC] Virtualization steps Kirill Korotaev
@ 2006-03-24 17:33 ` Nick Piggin
2006-03-24 19:25 ` Dave Hansen
2006-03-28 9:02 ` Kirill Korotaev
2006-03-24 18:36 ` Eric W. Biederman
2006-03-24 21:19 ` Herbert Poetzl
2 siblings, 2 replies; 125+ messages in thread
From: Nick Piggin @ 2006-03-24 17:33 UTC (permalink / raw)
To: Kirill Korotaev
Cc: Eric W. Biederman, haveblue, linux-kernel, herbert, devel, serue,
akpm, sam, Alexey Kuznetsov, Pavel Emelianov, Stanislav Protassov
Kirill Korotaev wrote:
> Eric, Herbert,
>
> I think it is quite clear, that without some agreement on all these
> virtualization issues, we won't be able to commit anything good to
> mainstream. My idea is to gather our efforts to get consensus on most
> clean parts of code first and commit them one by one.
>
> The proposal is quite simple. We have 4 parties in this conversation
> (maybe more?): IBM guys, OpenVZ, VServer and Eric Biederman. We discuss
> the areas which should be considered step by step. Send patches for each
> area, discuss, come to some agreement and all 4 parties Sign-Off the
> patch. After that it goes to Andrew/Linus. Worth trying?
Oh, after you come to an agreement and start posting patches, can you
also outline why we want this in the kernel (what it does that low
level virtualization doesn't, etc, etc), and how and why you've agreed
to implement it. Basically, some background and a summary of your
discussions for those who can't follow everything. Or is that a faq
item?
Thanks,
Nick
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-24 17:19 [RFC] Virtualization steps Kirill Korotaev
2006-03-24 17:33 ` Nick Piggin
@ 2006-03-24 18:36 ` Eric W. Biederman
2006-03-24 21:19 ` Herbert Poetzl
2 siblings, 0 replies; 125+ messages in thread
From: Eric W. Biederman @ 2006-03-24 18:36 UTC (permalink / raw)
To: Kirill Korotaev
Cc: haveblue, linux-kernel, herbert, devel, serue, akpm, sam,
Alexey Kuznetsov, Pavel Emelianov, Stanislav Protassov
Kirill Korotaev <dev@sw.ru> writes:
> Eric, Herbert,
>
> I think it is quite clear, that without some agreement on all these
> virtualization issues, we won't be able to commit anything good to
> mainstream. My idea is to gather our efforts to get consensus on most clean
> parts of code first and commit them one by one.
>
> The proposal is quite simple. We have 4 parties in this conversation (maybe
> more?): IBM guys, OpenVZ, VServer and Eric Biederman. We discuss the areas which
> should be considered step by step. Send patches for each area, discuss, come to
> some agreement and all 4 parties Sign-Off the patch. After that it goes to
> Andrew/Linus. Worth trying?
Yes, this sounds like a path forward that has a reasonable chance of
making progress.
> So far, (correct me if I'm wrong) we concluded that some people don't want
> containers as a whole, but want some subsystem namespaces. I suppose for people
> who care about containers only it doesn't matter, so we can proceed with
> namespaces, yeah?
Yes, I think at one point I have seen all of the major parties receptive
to the concept.
> So the most easy namespaces to discuss I see:
> - utsname
> - sys IPC
> - network virtualization
> - netfilter virtualization
The networking is hard simply because the is so very much of it, and it
is being active developed :)
> all these were discussed already somehow and looks like there is no fundamental
> differencies in our approaches (at least OpenVZ and Eric, for sure).
Yes. I think we agree on what the semantics should be for these parts.
Which should avoid the problem with have with the pid namespace.
> Right now, I suggest to concentrate on first 2 namespaces - utsname and
> sysvipc. They are small enough and easy. Lets consider them without sysctl/proc
> issues, as those can be resolved later. I sent the patches for these 2
> namespaces to all of you. I really hope for some _good_ critics, so we could
> work it out quickly.
Sounds like a plan.
Eric
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-24 17:33 ` Nick Piggin
@ 2006-03-24 19:25 ` Dave Hansen
2006-03-24 19:53 ` Eric W. Biederman
` (2 more replies)
2006-03-28 9:02 ` Kirill Korotaev
1 sibling, 3 replies; 125+ messages in thread
From: Dave Hansen @ 2006-03-24 19:25 UTC (permalink / raw)
To: Nick Piggin
Cc: Kirill Korotaev, Eric W. Biederman, linux-kernel, herbert, devel,
serue, akpm, sam, Alexey Kuznetsov, Pavel Emelianov,
Stanislav Protassov
On Sat, 2006-03-25 at 04:33 +1100, Nick Piggin wrote:
> Oh, after you come to an agreement and start posting patches, can you
> also outline why we want this in the kernel (what it does that low
> level virtualization doesn't, etc, etc)
Can you wait for an OLS paper? ;)
I'll summarize it this way: low-level virtualization uses resource
inefficiently.
With this higher-level stuff, you get to share all of the Linux caching,
and can do things like sharing libraries pretty naturally.
They are also much lighter-weight to create and destroy than full
virtual machines. We were planning on doing some performance
comparisons versus some hypervisors like Xen and the ppc64 one to show
scaling with the number of virtualized instances. Creating 100 of these
Linux containers is as easy as a couple of shell scripts, but we still
can't find anybody crazy enough to go create 100 Xen VMs.
Anyway, those are the things that came to my mind first. I'm sure the
others involved have their own motivations.
-- Dave
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-24 19:25 ` Dave Hansen
@ 2006-03-24 19:53 ` Eric W. Biederman
2006-03-28 4:28 ` Bill Davidsen
2006-03-28 20:29 ` [Devel] " Jun OKAJIMA
2 siblings, 0 replies; 125+ messages in thread
From: Eric W. Biederman @ 2006-03-24 19:53 UTC (permalink / raw)
To: Dave Hansen
Cc: Nick Piggin, Kirill Korotaev, linux-kernel, herbert, devel, serue,
akpm, sam, Alexey Kuznetsov, Pavel Emelianov, Stanislav Protassov
Dave Hansen <haveblue@us.ibm.com> writes:
> On Sat, 2006-03-25 at 04:33 +1100, Nick Piggin wrote:
>> Oh, after you come to an agreement and start posting patches, can you
>> also outline why we want this in the kernel (what it does that low
>> level virtualization doesn't, etc, etc)
>
> Can you wait for an OLS paper? ;)
>
> I'll summarize it this way: low-level virtualization uses resource
> inefficiently.
>
> With this higher-level stuff, you get to share all of the Linux caching,
> and can do things like sharing libraries pretty naturally.
Also it is a major enabler for things such as process migration,
between kernels.
> They are also much lighter-weight to create and destroy than full
> virtual machines. We were planning on doing some performance
> comparisons versus some hypervisors like Xen and the ppc64 one to show
> scaling with the number of virtualized instances. Creating 100 of these
> Linux containers is as easy as a couple of shell scripts, but we still
> can't find anybody crazy enough to go create 100 Xen VMs.
One of my favorite test cases is to kill about 100 of them
simultaneously :)
I think on a reasonably beefy dual processor machine I should be able
to get about 1000 of them running all at once.
> Anyway, those are the things that came to my mind first. I'm sure the
> others involved have their own motivations.
The practical aspect is that several groups have found the arguments
compelling enough that they have already done complete
implementations. At which point getting us all to agree on a common
implementation is important. :)
Eric
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-24 17:19 [RFC] Virtualization steps Kirill Korotaev
2006-03-24 17:33 ` Nick Piggin
2006-03-24 18:36 ` Eric W. Biederman
@ 2006-03-24 21:19 ` Herbert Poetzl
2006-03-27 18:45 ` Eric W. Biederman
2006-03-28 21:58 ` Eric W. Biederman
2 siblings, 2 replies; 125+ messages in thread
From: Herbert Poetzl @ 2006-03-24 21:19 UTC (permalink / raw)
To: Kirill Korotaev
Cc: Eric W. Biederman, haveblue, linux-kernel, devel, serue, akpm,
sam, Alexey Kuznetsov, Pavel Emelianov, Stanislav Protassov
On Fri, Mar 24, 2006 at 08:19:59PM +0300, Kirill Korotaev wrote:
> Eric, Herbert,
>
> I think it is quite clear, that without some agreement on all these
> virtualization issues, we won't be able to commit anything good to
> mainstream. My idea is to gather our efforts to get consensus on most
> clean parts of code first and commit them one by one.
>
> The proposal is quite simple. We have 4 parties in this conversation
> (maybe more?): IBM guys, OpenVZ, VServer and Eric Biederman. We
> discuss the areas which should be considered step by step. Send
> patches for each area, discuss, come to some agreement and all 4
> parties Sign-Off the patch. After that it goes to Andrew/Linus.
> Worth trying?
sounds good to me, as long as we do not consider
the patches 'final' atm .. because I think we should
try to test them with _all_ currently existing solutions
first ... we do not need to bother Andrew with stuff
which doesn't work for the existing and future 'users'.
so IMHO, we should make a kernel branch (Eric or Sam
are probably willing to maintain that), which we keep
in-sync with mainline (not necessarily git, but at
least snapshot wise), where we put all the patches
we agree on, and each party should then adjust the
existing solution to this kernel, so we get some deep
testing in the process, and everybody can see if it
'works' for him or not ...
things where we agree that it 'just works' for everyone
can always be handed upstream, and would probably make
perfect patches for Andrew ...
> So far, (correct me if I'm wrong) we concluded that some people don't
> want containers as a whole, but want some subsystem namespaces. I
> suppose for people who care about containers only it doesn't matter, so
> we can proceed with namespaces, yeah?
yes, the emphasis here should be on lightweight and
modular, so that those folks interested in full featured
containers can just 'assemble' the pieces, while those
desiring service/space isolation pick their subsystems
one by one ...
> So the most easy namespaces to discuss I see:
> - utsname
yes, that's definitely one we can start with, as it seems
that we already have _very_ similar implementations
> - sys IPC
this is something which is also related to limits and
should get special attention with resource sharing,
isolation and control in mind
> - network virtualization
here I see many issues, as for example Linux-VServer
does not necessarily aim for full virtualization, when
simple and performant isolation is sufficient.
don't get me wrong, we are _not_ against network
virtualization per se, but we isolation is just so
much simpler to administrate and often much more
performant, so that it is very interesting for service
separation as well as security applications
just consider the 'typical' service isolation aspect
where you want to have two apaches, separated on two
IPs, but communicating with a single sql database
> - netfilter virtualization
same as for network virtualization, but not really
an issue if it can be 'disabled'
of course, the ideal solution would be some kind
of hybrid, where you can have virtual interfaces as
well as isolated IPs, side-by-side ...
> all these were discussed already somehow and looks like there is no
> fundamental differencies in our approaches (at least OpenVZ and Eric,
> for sure).
>
> Right now, I suggest to concentrate on first 2 namespaces - utsname
> and sysvipc. They are small enough and easy. Lets consider them
> without sysctl/proc issues, as those can be resolved later. I sent the
> patches for these 2 namespaces to all of you. I really hope for some
> _good_ critics, so we could work it out quickly.
will look into them soon ...
best,
Herbert
> Thanks,
> Kirill
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-24 21:19 ` Herbert Poetzl
@ 2006-03-27 18:45 ` Eric W. Biederman
2006-03-28 8:51 ` Kirill Korotaev
2006-03-28 21:58 ` Eric W. Biederman
1 sibling, 1 reply; 125+ messages in thread
From: Eric W. Biederman @ 2006-03-27 18:45 UTC (permalink / raw)
To: Kirill Korotaev
Cc: haveblue, linux-kernel, devel, serue, akpm, sam, Alexey Kuznetsov,
Pavel Emelianov, Stanislav Protassov
Herbert Poetzl <herbert@13thfloor.at> writes:
> On Fri, Mar 24, 2006 at 08:19:59PM +0300, Kirill Korotaev wrote:
>> Eric, Herbert,
>>
>> I think it is quite clear, that without some agreement on all these
>> virtualization issues, we won't be able to commit anything good to
>> mainstream. My idea is to gather our efforts to get consensus on most
>> clean parts of code first and commit them one by one.
>>
>> The proposal is quite simple. We have 4 parties in this conversation
>> (maybe more?): IBM guys, OpenVZ, VServer and Eric Biederman. We
>> discuss the areas which should be considered step by step. Send
>> patches for each area, discuss, come to some agreement and all 4
>> parties Sign-Off the patch. After that it goes to Andrew/Linus.
>> Worth trying?
>
> sounds good to me, as long as we do not consider
> the patches 'final' atm .. because I think we should
> try to test them with _all_ currently existing solutions
> first ... we do not need to bother Andrew with stuff
> which doesn't work for the existing and future 'users'.
>
> so IMHO, we should make a kernel branch (Eric or Sam
> are probably willing to maintain that), which we keep
> in-sync with mainline (not necessarily git, but at
> least snapshot wise), where we put all the patches
> we agree on, and each party should then adjust the
> existing solution to this kernel, so we get some deep
> testing in the process, and everybody can see if it
> 'works' for him or not ...
ACK. A collection of patches that we can all agree
on sounds like something worth aiming for.
It looks like Kirill last round of patches can form
a nucleus for that. So far I have seem plenty of technical
objects but no objections to the general direction.
So agreement appears possible.
Eric
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-24 19:25 ` Dave Hansen
2006-03-24 19:53 ` Eric W. Biederman
@ 2006-03-28 4:28 ` Bill Davidsen
2006-03-28 5:31 ` Sam Vilain
` (3 more replies)
2006-03-28 20:29 ` [Devel] " Jun OKAJIMA
2 siblings, 4 replies; 125+ messages in thread
From: Bill Davidsen @ 2006-03-28 4:28 UTC (permalink / raw)
To: Dave Hansen
Cc: Kirill Korotaev, Eric W. Biederman, linux-kernel, herbert, devel,
serue, akpm, sam, Alexey Kuznetsov, Pavel Emelianov,
Stanislav Protassov
Dave Hansen wrote:
> On Sat, 2006-03-25 at 04:33 +1100, Nick Piggin wrote:
>> Oh, after you come to an agreement and start posting patches, can you
>> also outline why we want this in the kernel (what it does that low
>> level virtualization doesn't, etc, etc)
>
> Can you wait for an OLS paper? ;)
>
> I'll summarize it this way: low-level virtualization uses resource
> inefficiently.
>
> With this higher-level stuff, you get to share all of the Linux caching,
> and can do things like sharing libraries pretty naturally.
>
> They are also much lighter-weight to create and destroy than full
> virtual machines. We were planning on doing some performance
> comparisons versus some hypervisors like Xen and the ppc64 one to show
> scaling with the number of virtualized instances. Creating 100 of these
> Linux containers is as easy as a couple of shell scripts, but we still
> can't find anybody crazy enough to go create 100 Xen VMs.
But these require a modified O/S, do they not? Or do I read that
incorrectly? Is this going to be real virtualization able to run any O/S?
Frankly I don't see running 100 VMs as a realistic goal, being able to
run Linux, Windows, Solaris and BEOS unmodified in 4-5 VMs would be far
more useful.
>
> Anyway, those are the things that came to my mind first. I'm sure the
> others involved have their own motivations.
>
> -- Dave
>
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-28 4:28 ` Bill Davidsen
@ 2006-03-28 5:31 ` Sam Vilain
2006-03-28 6:45 ` [Devel] " Kir Kolyshkin
` (2 subsequent siblings)
3 siblings, 0 replies; 125+ messages in thread
From: Sam Vilain @ 2006-03-28 5:31 UTC (permalink / raw)
To: Bill Davidsen; +Cc: linux-kernel
On Mon, 2006-03-27 at 23:28 -0500, Bill Davidsen wrote:
> Frankly I don't see running 100 VMs as a realistic goal, being able to
> run Linux, Windows, Solaris and BEOS unmodified in 4-5 VMs would be far
> more useful.
You misunderstand this approach. It is not about VMs at all. Any VM
approach is the "big hammer" of virtualisation; we are more interested
in a big bag of very precise tools to virtualise one subsystem at a
time.
Sam.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-03-28 4:28 ` Bill Davidsen
2006-03-28 5:31 ` Sam Vilain
@ 2006-03-28 6:45 ` Kir Kolyshkin
2006-03-28 21:59 ` Sam Vilain
2006-03-28 8:52 ` Herbert Poetzl
2006-03-28 9:00 ` Kirill Korotaev
3 siblings, 1 reply; 125+ messages in thread
From: Kir Kolyshkin @ 2006-03-28 6:45 UTC (permalink / raw)
To: devel
Cc: Dave Hansen, akpm, linux-kernel, herbert, sam, Eric W. Biederman,
Alexey Kuznetsov, serue
Bill Davidsen wrote:
> Dave Hansen wrote:
>
>> On Sat, 2006-03-25 at 04:33 +1100, Nick Piggin wrote:
>>
>>> Oh, after you come to an agreement and start posting patches, can you
>>> also outline why we want this in the kernel (what it does that low
>>> level virtualization doesn't, etc, etc)
>>
>>
>> Can you wait for an OLS paper? ;)
>>
>> I'll summarize it this way: low-level virtualization uses resource
>> inefficiently.
>>
>> With this higher-level stuff, you get to share all of the Linux caching,
>> and can do things like sharing libraries pretty naturally.
>>
>> They are also much lighter-weight to create and destroy than full
>> virtual machines. We were planning on doing some performance
>> comparisons versus some hypervisors like Xen and the ppc64 one to show
>> scaling with the number of virtualized instances. Creating 100 of these
>> Linux containers is as easy as a couple of shell scripts, but we still
>> can't find anybody crazy enough to go create 100 Xen VMs.
>
>
> But these require a modified O/S, do they not? Or do I read that
> incorrectly? Is this going to be real virtualization able to run any O/S?
This type is called OS-level virtualization, or kernel-level
virtualization, or partitioning. Basically it allows to create a
compartments (in OpenVZ we call them VEs -- Virtual Environments) in
which you can run full *unmodified* Linux system (but the kernel itself
-- it is one single kernel common for all compartments). That means that
with this approach you can not run OSs other than Linux, but different
Linux distributions are working just fine.
> Frankly I don't see running 100 VMs as a realistic goal
It is actually not a future goal, but rather a reality. Since os-level
virtualization overhead is very low (1-2 per cent or so), one can run
hundreds of VEs.
Say, on a box with 1GB of RAM OpenVZ [http://openvz.org/] is able to run
about 150 VEs each one having init, apache (serving static content),
sendmail, sshd, cron etc. running. Actually you can run more, but with
the aggressive swapping so performance drops considerably. So it all
mostly depends on RAM, and I'd say that 500+ VEs on a 4GB box should run
just fine. Of course it all depends on what you run inside those VEs.
> , being able to run Linux, Windows, Solaris and BEOS unmodified in 4-5
> VMs would be far more useful.
This is a different story. If you want to run different OSs on the same
box -- use emulation or paravirtualization.
If you are happy to stick to Linux on this box -- use OS-level
virtualization. Aside from the best possible scalability and
performance, the other benefit of this approach is dynamic resource
management -- since there is a single kernel managing all the resources
such as RAM, you can easily tune all those resources runtime. More to
say, you can make one VE use more RAM while nobody else it using it,
leading to much better resource usage. And since there is one single
kernel that manages everything, you could do nice tricks like VE
checkpointing, live migration, etc. etc.
Some more info on topic are available from
http://openvz.org/documentation/tech/
Kir.
>>
>> Anyway, those are the things that came to my mind first. I'm sure the
>> others involved have their own motivations.
>>
>> -- Dave
>>
>
> _______________________________________________
> Devel mailing list
> Devel@openvz.org
> https://openvz.org/mailman/listinfo/devel
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-27 18:45 ` Eric W. Biederman
@ 2006-03-28 8:51 ` Kirill Korotaev
2006-03-28 12:53 ` Serge E. Hallyn
` (2 more replies)
0 siblings, 3 replies; 125+ messages in thread
From: Kirill Korotaev @ 2006-03-28 8:51 UTC (permalink / raw)
To: Eric W. Biederman
Cc: haveblue, linux-kernel, devel, serue, akpm, sam, Alexey Kuznetsov,
Pavel Emelianov, Stanislav Protassov
>> so IMHO, we should make a kernel branch (Eric or Sam
>> are probably willing to maintain that), which we keep
>> in-sync with mainline (not necessarily git, but at
>> least snapshot wise), where we put all the patches
>> we agree on, and each party should then adjust the
>> existing solution to this kernel, so we get some deep
>> testing in the process, and everybody can see if it
>> 'works' for him or not ...
>
> ACK. A collection of patches that we can all agree
> on sounds like something worth aiming for.
>
> It looks like Kirill last round of patches can form
> a nucleus for that. So far I have seem plenty of technical
> objects but no objections to the general direction.
yup, I will fix everything and will come with a set of patches for IPC,
so we could select which way is better to do it :)
> So agreement appears possible.
Nice to hear this!
Eric, we have a GIT repo on openvz.org already:
http://git.openvz.org
we will create a separate branch also called -acked, where patches
agreed upon will go.
Thanks,
Kirill
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-28 4:28 ` Bill Davidsen
2006-03-28 5:31 ` Sam Vilain
2006-03-28 6:45 ` [Devel] " Kir Kolyshkin
@ 2006-03-28 8:52 ` Herbert Poetzl
2006-03-28 9:00 ` Nick Piggin
2006-03-28 9:00 ` Kirill Korotaev
3 siblings, 1 reply; 125+ messages in thread
From: Herbert Poetzl @ 2006-03-28 8:52 UTC (permalink / raw)
To: Bill Davidsen; +Cc: Linux Kernel ML
On Mon, Mar 27, 2006 at 11:28:12PM -0500, Bill Davidsen wrote:
> Dave Hansen wrote:
> >On Sat, 2006-03-25 at 04:33 +1100, Nick Piggin wrote:
> >>Oh, after you come to an agreement and start posting patches, can you
> >>also outline why we want this in the kernel (what it does that low
> >>level virtualization doesn't, etc, etc)
> >
> >Can you wait for an OLS paper? ;)
> >
> >I'll summarize it this way: low-level virtualization uses resource
> >inefficiently.
> >
> >With this higher-level stuff, you get to share all of the Linux caching,
> >and can do things like sharing libraries pretty naturally.
> >
> >They are also much lighter-weight to create and destroy than full
> >virtual machines. We were planning on doing some performance
> >comparisons versus some hypervisors like Xen and the ppc64 one to show
> >scaling with the number of virtualized instances. Creating 100 of these
> >Linux containers is as easy as a couple of shell scripts, but we still
> >can't find anybody crazy enough to go create 100 Xen VMs.
>
> But these require a modified O/S, do they not? Or do I read that
> incorrectly? Is this going to be real virtualization able to run any
> O/S?
Xen requires slighly modified kernels, while e.g.
Linux-VServer only uses a _single_ kernel for all
virtualized guests ...
> Frankly I don't see running 100 VMs as a realistic goal, being able to
> run Linux, Windows, Solaris and BEOS unmodified in 4-5 VMs would be
> far more useful.
well, that largely depends on the 'use' ...
I don't think that vps providers like lycos would be
very happy if they had to multiply the ammount of
machines they require by 10 or 20 :)
and yes, running 100 and more Linux-VServers on a
single machine _is_ realistic ...
best,
Herbert
> >Anyway, those are the things that came to my mind first. I'm sure the
> >others involved have their own motivations.
> >
> >-- Dave
> >
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-28 8:52 ` Herbert Poetzl
@ 2006-03-28 9:00 ` Nick Piggin
2006-03-28 14:26 ` Herbert Poetzl
0 siblings, 1 reply; 125+ messages in thread
From: Nick Piggin @ 2006-03-28 9:00 UTC (permalink / raw)
To: Herbert Poetzl; +Cc: Bill Davidsen, Linux Kernel ML
Herbert Poetzl wrote:
> well, that largely depends on the 'use' ...
>
> I don't think that vps providers like lycos would be
> very happy if they had to multiply the ammount of
> machines they require by 10 or 20 :)
>
> and yes, running 100 and more Linux-VServers on a
> single machine _is_ realistic ...
>
Yep.
And if it is intrusive to the core kernel, then as always we have
to try to evaluate the question "is it worth it"? How many people
want it and what alternatives do they have (eg. maintaining
seperate patches, using another approach), what are the costs,
complexities, to other users and developers etc.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-28 4:28 ` Bill Davidsen
` (2 preceding siblings ...)
2006-03-28 8:52 ` Herbert Poetzl
@ 2006-03-28 9:00 ` Kirill Korotaev
2006-03-28 14:41 ` Bill Davidsen
3 siblings, 1 reply; 125+ messages in thread
From: Kirill Korotaev @ 2006-03-28 9:00 UTC (permalink / raw)
To: Bill Davidsen
Cc: Dave Hansen, Eric W. Biederman, linux-kernel, herbert, devel,
serue, akpm, sam, Alexey Kuznetsov, Pavel Emelianov,
Stanislav Protassov
> Frankly I don't see running 100 VMs as a realistic goal, being able to
> run Linux, Windows, Solaris and BEOS unmodified in 4-5 VMs would be far
> more useful.
It is more than realistic. Hosting companies run more than 100 VPSs in
reality. There are also other usefull scenarios. For example, I know the
universities which run VPS for every faculty web site, for every
department, mail server and so on. Why do you think they want to run
only 5VMs on one machine? Much more!
Thanks,
Kirill
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-24 17:33 ` Nick Piggin
2006-03-24 19:25 ` Dave Hansen
@ 2006-03-28 9:02 ` Kirill Korotaev
2006-03-28 9:15 ` Nick Piggin
2006-03-28 15:48 ` [Devel] " Matt Ayres
1 sibling, 2 replies; 125+ messages in thread
From: Kirill Korotaev @ 2006-03-28 9:02 UTC (permalink / raw)
To: Nick Piggin
Cc: Eric W. Biederman, haveblue, linux-kernel, herbert, devel, serue,
akpm, sam, Alexey Kuznetsov, Pavel Emelianov, Stanislav Protassov
> Oh, after you come to an agreement and start posting patches, can you
> also outline why we want this in the kernel (what it does that low
> level virtualization doesn't, etc, etc), and how and why you've agreed
> to implement it. Basically, some background and a summary of your
> discussions for those who can't follow everything. Or is that a faq
> item?
Nick, will be glad to shed some light on it.
First of all, what it does which low level virtualization can't:
- it allows to run 100 containers on 1GB RAM
(it is called containers, VE - Virtual Environments,
VPS - Virtual Private Servers).
- it has no much overhead (<1-2%), which is unavoidable with hardware
virtualization. For example, Xen has >20% overhead on disk I/O.
- it allows to create/deploy VE in less than a minute, VE start/stop
takes ~1-2 seconds.
- it allows to dynamically change all resource limits/configurations.
In OpenVZ it is even possible to add/remove virtual CPUs to/from VE.
It is possible to increase/descrease memory limits on the fly etc.
- it has much more efficient memory usage with single template file
in a cache if COW-like filesystem is used for VE templates.
- it allows you to access VE files from host easily if needed.
This helps to make management much more flexible, e.g. you can
upgrade/repair/fix all you VEs from host, i.e. easy mass management.
OS kernel virtualization
~~~~~~~~~~~~~~~~~~~~~~~~
OS virtualization is a kernel solution, which replaces the usage
of many global variables with context-dependant counterparts. This
allows to have isolated private resources in different contexts.
So VE means essentially context and a set of it's variables/settings,
which include but not limited to, own process tree, files, IPC
resources, IP routing, network devices and such.
Full virtualization solution consists of:
- virtualization of resources, i.e. private contexts
- resource controls, for limiting contexts
- management tools
Such kind of virtualization solution is implemented in OpenVZ
(http://openvz.org) and Linux-Vserver (http://linux-vserver.org) projects.
Summary of previous discussions on LKML
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- we agreed upon doing virtualization of each kernel subsystem
separately, not as a single virtual environment.
- we almost agreed upon calling virtualization of subsystems
"namespaces".
- we were discussing whether we should have global namespace context,
like 'current' or bypass context as an argument to all functions
which require it.
- we didn't agreed on whether we need a config option and ability to
compile kernel w/o virtual namespaces.
Thansk,
Kirill
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-28 9:02 ` Kirill Korotaev
@ 2006-03-28 9:15 ` Nick Piggin
2006-03-28 15:35 ` Herbert Poetzl
` (3 more replies)
2006-03-28 15:48 ` [Devel] " Matt Ayres
1 sibling, 4 replies; 125+ messages in thread
From: Nick Piggin @ 2006-03-28 9:15 UTC (permalink / raw)
To: Kirill Korotaev
Cc: Eric W. Biederman, haveblue, linux-kernel, herbert, devel, serue,
akpm, sam, Alexey Kuznetsov, Pavel Emelianov, Stanislav Protassov
Kirill Korotaev wrote:
>
> Nick, will be glad to shed some light on it.
>
Thanks very much Kirill.
I don't think I'm qualified to make any decisions about this,
so I don't want to detract from the real discussions, but I
just had a couple more questions:
> First of all, what it does which low level virtualization can't:
> - it allows to run 100 containers on 1GB RAM
> (it is called containers, VE - Virtual Environments,
> VPS - Virtual Private Servers).
> - it has no much overhead (<1-2%), which is unavoidable with hardware
> virtualization. For example, Xen has >20% overhead on disk I/O.
Are any future hardware solutions likely to improve these problems?
>
> OS kernel virtualization
> ~~~~~~~~~~~~~~~~~~~~~~~~
Is this considered secure enough that multiple untrusted VEs are run
on production systems?
What kind of users want this, who can't use alternatives like real
VMs?
> Summary of previous discussions on LKML
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Have their been any discussions between the groups pushing this
virtualization, and important kernel developers who are not part of
a virtualization effort? Ie. is there any consensus about the
future of these patches?
Thanks,
Nick
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-28 8:51 ` Kirill Korotaev
@ 2006-03-28 12:53 ` Serge E. Hallyn
2006-03-28 22:51 ` Sam Vilain
2006-03-29 20:30 ` Dave Hansen
2 siblings, 0 replies; 125+ messages in thread
From: Serge E. Hallyn @ 2006-03-28 12:53 UTC (permalink / raw)
To: Kirill Korotaev
Cc: Eric W. Biederman, haveblue, linux-kernel, devel, serue, akpm,
sam, Alexey Kuznetsov, Pavel Emelianov, Stanislav Protassov
Quoting Kirill Korotaev (dev@sw.ru):
> >>so IMHO, we should make a kernel branch (Eric or Sam
> >>are probably willing to maintain that), which we keep
> >>in-sync with mainline (not necessarily git, but at
> >>least snapshot wise), where we put all the patches
> >>we agree on, and each party should then adjust the
> >>existing solution to this kernel, so we get some deep
> >>testing in the process, and everybody can see if it
> >>'works' for him or not ...
> >
> >ACK. A collection of patches that we can all agree
> >on sounds like something worth aiming for.
> >
> >It looks like Kirill last round of patches can form
> >a nucleus for that. So far I have seem plenty of technical
> >objects but no objections to the general direction.
> yup, I will fix everything and will come with a set of patches for IPC,
> so we could select which way is better to do it :)
>
> >So agreement appears possible.
> Nice to hear this!
>
> Eric, we have a GIT repo on openvz.org already:
> http://git.openvz.org
>
> we will create a separate branch also called -acked, where patches
> agreed upon will go.
That's ok by me. If a more neutral name/site were preferred, we could
use the sf.net set we had finally gotten around to setting up -
www.sf.net/projects/lxc (LinuX Containers). Unfortunately that would
likely be just a quilt patch repository.
A wiki + git repository would be ideal.
-serge
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-28 9:00 ` Nick Piggin
@ 2006-03-28 14:26 ` Herbert Poetzl
2006-03-28 14:44 ` Nick Piggin
0 siblings, 1 reply; 125+ messages in thread
From: Herbert Poetzl @ 2006-03-28 14:26 UTC (permalink / raw)
To: Nick Piggin; +Cc: Bill Davidsen, Linux Kernel ML
On Tue, Mar 28, 2006 at 07:00:25PM +1000, Nick Piggin wrote:
> Herbert Poetzl wrote:
>
> >well, that largely depends on the 'use' ...
> >
> >I don't think that vps providers like lycos would be
> >very happy if they had to multiply the ammount of
> >machines they require by 10 or 20 :)
> >
> >and yes, running 100 and more Linux-VServers on a
> >single machine _is_ realistic ...
> >
>
> Yep.
>
> And if it is intrusive to the core kernel, then as always we have to
> try to evaluate the question "is it worth it"? How many people want it
> and what alternatives do they have (eg. maintaining seperate patches,
> using another approach), what are the costs, complexities, to other
> users and developers etc.
my words, but let me ask, what do you consider 'intrusive'?
best,
Herbert
> --
> SUSE Labs, Novell Inc.
> Send instant messages to your online friends http://au.messenger.yahoo.com
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-28 9:00 ` Kirill Korotaev
@ 2006-03-28 14:41 ` Bill Davidsen
2006-03-28 15:03 ` Eric W. Biederman
2006-03-28 23:07 ` Sam Vilain
0 siblings, 2 replies; 125+ messages in thread
From: Bill Davidsen @ 2006-03-28 14:41 UTC (permalink / raw)
To: Kirill Korotaev
Cc: Dave Hansen, Eric W. Biederman, linux-kernel, herbert, devel,
serue, akpm, sam, Alexey Kuznetsov, Pavel Emelianov,
Stanislav Protassov
Kirill Korotaev wrote:
>> Frankly I don't see running 100 VMs as a realistic goal, being able
>> to run Linux, Windows, Solaris and BEOS unmodified in 4-5 VMs would
>> be far more useful.
>
> It is more than realistic. Hosting companies run more than 100 VPSs in
> reality. There are also other usefull scenarios. For example, I know
> the universities which run VPS for every faculty web site, for every
> department, mail server and so on. Why do you think they want to run
> only 5VMs on one machine? Much more!
I made no commont on what "they" might want, I want to make the rack of
underutilized Windows, BSD and Solaris servers go away. An approach
which doesn't support unmodified guest installs doesn't solve any of my
current problems. I didn't say it was in any way not useful, just not of
interest to me. What needs I have for Linux environments are answered by
jails and/or UML.
--
bill davidsen <davidsen@tmr.com>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-28 14:26 ` Herbert Poetzl
@ 2006-03-28 14:44 ` Nick Piggin
2006-03-29 6:05 ` Eric W. Biederman
0 siblings, 1 reply; 125+ messages in thread
From: Nick Piggin @ 2006-03-28 14:44 UTC (permalink / raw)
To: Herbert Poetzl; +Cc: Bill Davidsen, Linux Kernel ML
Herbert Poetzl wrote:
> On Tue, Mar 28, 2006 at 07:00:25PM +1000, Nick Piggin wrote:
>>And if it is intrusive to the core kernel, then as always we have to
>>try to evaluate the question "is it worth it"? How many people want it
>>and what alternatives do they have (eg. maintaining seperate patches,
>>using another approach), what are the costs, complexities, to other
>>users and developers etc.
>
>
> my words, but let me ask, what do you consider 'intrusive'?
>
I don't think I could give a complete answer...
I guess it could be stated as the increase in the complexity of
the rest of the code for someone who doesn't know anything about
the virtualization implementation.
Completely non intrusive is something like 2 extra function calls
to/from generic code, changes to data structures are transparent
(or have simple wrappers), and there is no shared locking or data
with the rest of the kernel. And it goes up from there.
Anyway I'm far from qualified... I just hope that with all the
work you guys are putting in that you'll be able to justify it ;)
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-28 14:41 ` Bill Davidsen
@ 2006-03-28 15:03 ` Eric W. Biederman
2006-03-28 17:48 ` Jeff Dike
2006-03-28 23:07 ` Sam Vilain
1 sibling, 1 reply; 125+ messages in thread
From: Eric W. Biederman @ 2006-03-28 15:03 UTC (permalink / raw)
To: Bill Davidsen
Cc: Kirill Korotaev, Dave Hansen, linux-kernel, herbert, devel, serue,
akpm, sam, Alexey Kuznetsov, Pavel Emelianov, Stanislav Protassov
Bill Davidsen <davidsen@tmr.com> writes:
> Kirill Korotaev wrote:
>
>>> Frankly I don't see running 100 VMs as a realistic goal, being able to run
>>> Linux, Windows, Solaris and BEOS unmodified in 4-5 VMs would be far more
>>> useful.
>>
>> It is more than realistic. Hosting companies run more than 100 VPSs in
>> reality. There are also other usefull scenarios. For example, I know the
>> universities which run VPS for every faculty web site, for every department,
>> mail server and so on. Why do you think they want to run only 5VMs on one
>> machine? Much more!
>
> I made no commont on what "they" might want, I want to make the rack of
> underutilized Windows, BSD and Solaris servers go away. An approach which
> doesn't support unmodified guest installs doesn't solve any of my current
> problems. I didn't say it was in any way not useful, just not of interest to
> me. What needs I have for Linux environments are answered by jails and/or UML.
So from one perspective that is what we are building. A full featured
jail capable of running an unmodified linux distro. The cost is
simply making a way to use the same names twice for the global
namespaces. UML may use these features to accelerate it's own processes.
Virtualization is really the wrong word to describe what we are building. As
it allows for all kinds of heavy weight implementations, and has an associate
with much heavier things.
At the extreme end where you only have one process in each logical instance
of the kernel, a better name would be a heavy weight process. Where each
such process sees an environment as if it owned the entire machine.
Eric
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-28 9:15 ` Nick Piggin
@ 2006-03-28 15:35 ` Herbert Poetzl
2006-03-28 15:53 ` Nick Piggin
` (2 more replies)
2006-03-28 16:15 ` Eric W. Biederman
` (2 subsequent siblings)
3 siblings, 3 replies; 125+ messages in thread
From: Herbert Poetzl @ 2006-03-28 15:35 UTC (permalink / raw)
To: Nick Piggin
Cc: Kirill Korotaev, Eric W. Biederman, haveblue, linux-kernel, devel,
serue, akpm, sam, Alexey Kuznetsov, Pavel Emelianov,
Stanislav Protassov
On Tue, Mar 28, 2006 at 07:15:17PM +1000, Nick Piggin wrote:
> Kirill Korotaev wrote:
> >
> >Nick, will be glad to shed some light on it.
> >
>
> Thanks very much Kirill.
>
> I don't think I'm qualified to make any decisions about this,
> so I don't want to detract from the real discussions, but I
> just had a couple more questions:
>
> >First of all, what it does which low level virtualization can't:
> >- it allows to run 100 containers on 1GB RAM
> > (it is called containers, VE - Virtual Environments,
> > VPS - Virtual Private Servers).
> >- it has no much overhead (<1-2%), which is unavoidable with hardware
> > virtualization. For example, Xen has >20% overhead on disk I/O.
>
> Are any future hardware solutions likely to improve these problems?
not really, but as you know, "640K ought to be enough
for anybody", so maybe future hardware developments will
make shared resources possible (with different kernels)
> >OS kernel virtualization
> >~~~~~~~~~~~~~~~~~~~~~~~~
>
> Is this considered secure enough that multiple untrusted VEs are run
> on production systems?
definitely! there are many, many, hosting providers
using exactly this technology to provide Virutal Private
Servers for their customers, of course, in production
> What kind of users want this, who can't use alternatives like real
> VMs?
well, the same users who do not want to use Bochs for
emulating a PC on a PC, when they can use UML for example,
because it's much faster and easier to use ...
aside from that, Linux-VServer for example, is not only
designed to create complete virtual servers, it also
works for service separation and increasing security for
many applications, like for example:
- test environments (one guest per distro)
- service separation (one service per 'container')
- resource management and accounting
> >Summary of previous discussions on LKML
> >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Have their been any discussions between the groups pushing this
> virtualization, and ...
yes, the discussions are ongoing ... maybe to clarify the
situation for the folks not involved (projects in
alphabetical order):
FreeVPS (Free Virtual Private Server Solution):
===============================================
[http://www.freevps.com/]
not pushing for inclusion, early Linux-VServer
spinoff, partially maintained but they seem to have
other interrests lately
Alex Lyashkov (FreeVPS kernel maintainer)
[Positive Software Corporation http://www.freevps.com/]
BSD Jail LSM (Linux-Jails security module):
===========================================
[http://kerneltrap.org/node/3823]
Serge E. Hallyn (Patch/Module maintainer) [IBM]
interested in some kind of mainline solution
Dave Hansen (IBM Linux Technology Center)
interested in virtualization for context/container
migration
Linux-VServer (community project, maintained):
==============================================
[http://linux-vserver.org/]
Jacques Gelinas (previous VServer maintainer)
not pushing for inclusion
Herbert Poetzl (Linux-VServer kernel maintainer)
not pushing for inclusion, but I want to make damn
sure that there does not come bloat into the kernel
and the mainline effords will be usable for
Linux-VServer and similar ...
Sam Vilain (Refactoring Linux-VServer patches)
[Catalyst http://catalyst.net.nz/]
trying hard to provide a simple/minimalistic version
of Linux-VServer for mainline
many others, not really pushing anything here :)
OpenVZ (open project, maintained, subset of Virtuozzo(tm)):
===========================================================
[http://openvz.org/]
Kir Kolyshkin (OpenVZ maintainer):
[SWsoft http://www.swsoft.com I gues?]
maybe pushing for inclusion ...
Kirill Korotaev (OpenVZ/Virtuozzo kernel developer?)
[SWsoft http://www.swsoft.com]
heavily pushing for inclusion ...
Alexey Kuznetsov (Chief Software Engineer)
[SWsoft http://www.swsoft.com]
not pushing but supporting company interrests
PID Virtualization (kernel branch for inclusion):
=================================================
Eric W. Biederman (branch developer/maintainer)
[XMission http://xmission.com/]
Virtuozzo(tm) (Commercial solution form SWsoft):
================================================
[http://www.virtuozzo.com/]
not involved yet, except via OpenVZ
Stanislav Protassov (Director of Engineering)
[SWsoft http://www.swsoft.com]
A ton of IBM and VZ folks are not listed here, but I
guess you can figure who is who from the email addresses
there are also a bunch of folks from Columbia and
Princeton university interested and/or involved in
kernel level virtualization and context migration.
please extend this list where appropriate, I'm pretty
sure I forgot at least five important/involved persons
> important kernel developers who are not part of a virtualization
> effort?
no idea, probably none for now ...
> Ie. is there any consensus about the future of these patches?
what patches? what future?
HTC,
Herbert
> Thanks,
> Nick
>
> --
> SUSE Labs, Novell Inc.
> Send instant messages to your online friends http://au.messenger.yahoo.com
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-03-28 9:02 ` Kirill Korotaev
2006-03-28 9:15 ` Nick Piggin
@ 2006-03-28 15:48 ` Matt Ayres
2006-03-28 16:42 ` Eric W. Biederman
2006-03-29 0:55 ` Kirill Korotaev
1 sibling, 2 replies; 125+ messages in thread
From: Matt Ayres @ 2006-03-28 15:48 UTC (permalink / raw)
To: devel
Cc: Nick Piggin, akpm, linux-kernel, herbert, sam, Eric W. Biederman,
Alexey Kuznetsov, serue, xen-devel@lists.xensource.com
Kirill Korotaev wrote:
>> Oh, after you come to an agreement and start posting patches, can you
>> also outline why we want this in the kernel (what it does that low
>> level virtualization doesn't, etc, etc), and how and why you've agreed
>> to implement it. Basically, some background and a summary of your
>> discussions for those who can't follow everything. Or is that a faq
>> item?
> Nick, will be glad to shed some light on it.
>
> First of all, what it does which low level virtualization can't:
> - it allows to run 100 containers on 1GB RAM
> (it is called containers, VE - Virtual Environments,
> VPS - Virtual Private Servers).
> - it has no much overhead (<1-2%), which is unavoidable with hardware
> virtualization. For example, Xen has >20% overhead on disk I/O.
I think the Xen guys would disagree with you on this. Xen claims <3%
overhead on the XenSource site.
Where did you get these figures from? What Xen version did you test?
What was your configuration? Did you have kernel debugging enabled? You
can't just post numbers without the data to back it up, especially when
it conflicts greatly with the Xen developers statements. AFAIK Xen is
well on it's way to inclusion into the mainstream kernel.
Thank you,
Matt Ayres
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-28 15:35 ` Herbert Poetzl
@ 2006-03-28 15:53 ` Nick Piggin
2006-03-28 16:31 ` Eric W. Biederman
2006-03-29 21:37 ` Bill Davidsen
2 siblings, 0 replies; 125+ messages in thread
From: Nick Piggin @ 2006-03-28 15:53 UTC (permalink / raw)
To: Herbert Poetzl
Cc: Kirill Korotaev, Eric W. Biederman, haveblue, linux-kernel, devel,
serue, akpm, sam, Alexey Kuznetsov, Pavel Emelianov,
Stanislav Protassov
Herbert Poetzl wrote:
> On Tue, Mar 28, 2006 at 07:15:17PM +1000, Nick Piggin wrote:
[...]
Thanks for the clarifications, Herbert.
>>Ie. is there any consensus about the future of these patches?
>
>
> what patches?
One's being thrown around lkml, and future ones being talked about.
Patches ~= changes to kernel.
> what future?
I presume everyone's goal is to get something into the kernel?
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-28 9:15 ` Nick Piggin
2006-03-28 15:35 ` Herbert Poetzl
@ 2006-03-28 16:15 ` Eric W. Biederman
2006-03-28 23:04 ` Sam Vilain
2006-03-29 1:39 ` Kirill Korotaev
3 siblings, 0 replies; 125+ messages in thread
From: Eric W. Biederman @ 2006-03-28 16:15 UTC (permalink / raw)
To: Nick Piggin
Cc: Kirill Korotaev, haveblue, linux-kernel, herbert, devel, serue,
akpm, sam, Alexey Kuznetsov, Pavel Emelianov, Stanislav Protassov
Nick Piggin <nickpiggin@yahoo.com.au> writes:
> Kirill Korotaev wrote:
>> Nick, will be glad to shed some light on it.
>>
>
> Thanks very much Kirill.
>
> I don't think I'm qualified to make any decisions about this,
> so I don't want to detract from the real discussions, but I
> just had a couple more questions:
>
>> First of all, what it does which low level virtualization can't:
>> - it allows to run 100 containers on 1GB RAM
>> (it is called containers, VE - Virtual Environments,
>> VPS - Virtual Private Servers).
>> - it has no much overhead (<1-2%), which is unavoidable with hardware
>> virtualization. For example, Xen has >20% overhead on disk I/O.
>
> Are any future hardware solutions likely to improve these problems?
This isn't a direct competition, both solutions coincide nicely.
The major efficiency differences are fundamental to the approaches and
can only be solved in software and not hardware. The fundamental efficiency
limits of low level virtualization are not sharing resources between
instances well (think how hard memory hotplug is to solve), the fact
that running a kernel takes at least 1MB for just the kernel, the
fact that no matter how good your hypervisor is there will be some
hardware interface it doesn't virtualize.
Whereas what we are aiming at are just enough modifications to the kernel
to allow multiple instances of user space. We aren't virtualizing anything
that isn't already virtualized in the kernel.
>> OS kernel virtualization
>> ~~~~~~~~~~~~~~~~~~~~~~~~
>
> Is this considered secure enough that multiple untrusted VEs are run
> on production systems?
Kirill or Herbert can give a better answer but that is of the major
points of BSD Jails and their kin is it not?
> What kind of users want this, who can't use alternatives like real
> VMs?
Well that question assumes a lot. The answer that assumes a lot
in the other direction is that adding an additional unnecessary layers
just complicates the problem and slows things down for no reason
while making it so you can't assume the solution is always present.
In addition to doing it in a non-portable way so it is only available
on a few platforms.
I can't even think of a straight answer to the users question.
My users are in the high performance computing realm, and for that
subset it is easy. Xen and it's kin don't virtualize the high
bandwidth low latency communication hardware that is used, and that
may not even be possible. Using a hypervisor in a situation like that
certainly isn't general or easily maintainable. (Think about
what a challenge it has been to get usable infiniband drivers merged).
>> Summary of previous discussions on LKML
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Have their been any discussions between the groups pushing this
> virtualization, and important kernel developers who are not part of
> a virtualization effort? Ie. is there any consensus about the
> future of these patches?
Yes, but just enough to give us hope :)
Unless you count the mount namespace as part of this in which case
pieces are already merged.
The challenging is that writing kernel code that does this is
easy. Writing kernel code that is mergeable and that the different
groups all agree meets their requirements is much harder. It has
taken us until now to have a basic approach that we all agree on.
Now we get to beat each other up over the technical details :)
Eric
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-28 15:35 ` Herbert Poetzl
2006-03-28 15:53 ` Nick Piggin
@ 2006-03-28 16:31 ` Eric W. Biederman
2006-03-29 21:37 ` Bill Davidsen
2 siblings, 0 replies; 125+ messages in thread
From: Eric W. Biederman @ 2006-03-28 16:31 UTC (permalink / raw)
To: Nick Piggin
Cc: Kirill Korotaev, haveblue, linux-kernel, devel, serue, akpm, sam,
Alexey Kuznetsov, Pavel Emelianov, Stanislav Protassov
Herbert Poetzl <herbert@13thfloor.at> writes:
> PID Virtualization (kernel branch for inclusion):
> =================================================
>
> Eric W. Biederman (branch developer/maintainer)
> [XMission http://xmission.com/]
Actually I work for Linux Networx http://www.lnxi.com
XMission is just my ISP. I find it easier to work from
home. :)
Eric
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-03-28 15:48 ` [Devel] " Matt Ayres
@ 2006-03-28 16:42 ` Eric W. Biederman
2006-03-28 17:04 ` Matt Ayres
2006-03-29 0:55 ` Kirill Korotaev
1 sibling, 1 reply; 125+ messages in thread
From: Eric W. Biederman @ 2006-03-28 16:42 UTC (permalink / raw)
To: Matt Ayres
Cc: devel, Nick Piggin, akpm, linux-kernel, herbert, sam,
Alexey Kuznetsov, serue, xen-devel@lists.xensource.com
Matt Ayres <matta@tektonic.net> writes:
> I think the Xen guys would disagree with you on this. Xen claims <3% overhead
> on the XenSource site.
>
> Where did you get these figures from? What Xen version did you test? What was
> your configuration? Did you have kernel debugging enabled? You can't just post
> numbers without the data to back it up, especially when it conflicts greatly
> with the Xen developers statements. AFAIK Xen is well on it's way to inclusion
> into the mainstream kernel.
It doesn't matter. The proof that Xen has more overhead is trivial
Xen does more, and Xen clients don't share resources well.
Nor is this about Xen vs what we are doing. These are different
non conflicting approaches that operating in completely different
ways and solve a different set of problems.
Xen is about multiple kernels.
The alternative is a supped of chroot.
Eric
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-03-28 16:42 ` Eric W. Biederman
@ 2006-03-28 17:04 ` Matt Ayres
0 siblings, 0 replies; 125+ messages in thread
From: Matt Ayres @ 2006-03-28 17:04 UTC (permalink / raw)
To: Eric W. Biederman
Cc: devel, Nick Piggin, akpm, linux-kernel, herbert, sam,
Alexey Kuznetsov, serue, xen-devel@lists.xensource.com
Eric W. Biederman wrote:
> Matt Ayres <matta@tektonic.net> writes:
>
>> I think the Xen guys would disagree with you on this. Xen claims <3% overhead
>> on the XenSource site.
>>
>> Where did you get these figures from? What Xen version did you test? What was
>> your configuration? Did you have kernel debugging enabled? You can't just post
>> numbers without the data to back it up, especially when it conflicts greatly
>> with the Xen developers statements. AFAIK Xen is well on it's way to inclusion
>> into the mainstream kernel.
>
> It doesn't matter. The proof that Xen has more overhead is trivial
> Xen does more, and Xen clients don't share resources well.
>
I understand the difference. It was more about Kirill grabbing numbers
out of the air. I actually think the containers and Xen complement each
other very well. As Xen is now based on 2.6.16 (as are both VServer and
OVZ) it makes sense to run a few Xen domains that then in turn run
containers in some scenarios. As far as the last part, Xen doesn't
share resources at all :)
Thank you,
Matt Ayres
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-28 15:03 ` Eric W. Biederman
@ 2006-03-28 17:48 ` Jeff Dike
0 siblings, 0 replies; 125+ messages in thread
From: Jeff Dike @ 2006-03-28 17:48 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Bill Davidsen, Kirill Korotaev, Dave Hansen, linux-kernel,
herbert, devel, serue, akpm, sam, Alexey Kuznetsov,
Pavel Emelianov, Stanislav Protassov
On Tue, Mar 28, 2006 at 08:03:34AM -0700, Eric W. Biederman wrote:
> UML may use these features to accelerate it's own processes.
And I'm planning on doing exactly that.
Jeff
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-03-24 19:25 ` Dave Hansen
2006-03-24 19:53 ` Eric W. Biederman
2006-03-28 4:28 ` Bill Davidsen
@ 2006-03-28 20:29 ` Jun OKAJIMA
2006-03-28 20:50 ` Kir Kolyshkin
2 siblings, 1 reply; 125+ messages in thread
From: Jun OKAJIMA @ 2006-03-28 20:29 UTC (permalink / raw)
To: devel
Cc: Nick Piggin, akpm, linux-kernel, herbert, sam, Eric W. Biederman,
Alexey Kuznetsov, serue
>
>I'll summarize it this way: low-level virtualization uses resource
>inefficiently.
>
>With this higher-level stuff, you get to share all of the Linux caching,
>and can do things like sharing libraries pretty naturally.
>
>They are also much lighter-weight to create and destroy than full
>virtual machines. We were planning on doing some performance
>comparisons versus some hypervisors like Xen and the ppc64 one to show
>scaling with the number of virtualized instances. Creating 100 of these
>Linux containers is as easy as a couple of shell scripts, but we still
>can't find anybody crazy enough to go create 100 Xen VMs.
>
>Anyway, those are the things that came to my mind first. I'm sure the
>others involved have their own motivations.
>
Some questions.
1. Your point is rignt in some ways, and I agree with you.
Yes, I currently guess Jail is quite practical than Xen.
Xen sounds cool, but really practical? I doubt a bit.
But it would be a narrow thought, maybe.
How you estimate feature improvement of memory shareing
on VM ( e.g. Xen/VMware)?
I have seen there are many papers about this issue.
If once memory sharing gets much efficient, Xen possibly wins.
2. Folks, how you think about other good points of Xen,
like live migration, or runs solaris, or has suspend/resume or...
No Linux jails have such feature for now, although I dont think
it is impossible with jail.
My current suggestion is,
1. Dont use Xen for running multiple VMs.
2. Use Xen for better admin/operation/deploy... tools.
3. If you need multiple VMs, use jail on Xen.
--- Okajima, Jun. Tokyo, Japan.
http://www.digitalinfra.co.jp/
http://www.colinux.org/
http://www.machboot.com/
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-03-28 20:29 ` [Devel] " Jun OKAJIMA
@ 2006-03-28 20:50 ` Kir Kolyshkin
2006-03-28 21:38 ` Jun OKAJIMA
2006-04-03 16:47 ` Bill Davidsen
0 siblings, 2 replies; 125+ messages in thread
From: Kir Kolyshkin @ 2006-03-28 20:50 UTC (permalink / raw)
To: devel
Cc: akpm, Nick Piggin, sam, linux-kernel, Eric W. Biederman, serue,
Alexey Kuznetsov, herbert
Jun OKAJIMA wrote:
>>I'll summarize it this way: low-level virtualization uses resource
>>inefficiently.
>>
>>With this higher-level stuff, you get to share all of the Linux caching,
>>and can do things like sharing libraries pretty naturally.
>>
>>They are also much lighter-weight to create and destroy than full
>>virtual machines. We were planning on doing some performance
>>comparisons versus some hypervisors like Xen and the ppc64 one to show
>>scaling with the number of virtualized instances. Creating 100 of these
>>Linux containers is as easy as a couple of shell scripts, but we still
>>can't find anybody crazy enough to go create 100 Xen VMs.
>>
>>Anyway, those are the things that came to my mind first. I'm sure the
>>others involved have their own motivations.
>>
>>
>>
>
>Some questions.
>
>1. Your point is rignt in some ways, and I agree with you.
> Yes, I currently guess Jail is quite practical than Xen.
> Xen sounds cool, but really practical? I doubt a bit.
> But it would be a narrow thought, maybe.
> How you estimate feature improvement of memory shareing
> on VM ( e.g. Xen/VMware)?
> I have seen there are many papers about this issue.
> If once memory sharing gets much efficient, Xen possibly wins.
>
>
This is not just about memory sharing. Dynamic resource management is
hardly possible in a model where you have multiple kernels running; all
of those kernel were designed to run on a dedicated hardware. As it was
pointed out, adding/removing memory from a Xen guest during runtime is
tricky.
Finally, multiple-kernels-on-top-of-hypervisor architecture is just more
complex and has more overhead then one-kernel-with-many-namespaces.
>2. Folks, how you think about other good points of Xen,
> like live migration, or runs solaris, or has suspend/resume or...
>
>
OpenVZ will have live zero downtime migration and suspend/resume some
time next month.
> No Linux jails have such feature for now, although I dont think
> it is impossible with jail.
>
>
>My current suggestion is,
>
>1. Dont use Xen for running multiple VMs.
>2. Use Xen for better admin/operation/deploy... tools.
>
>
This point is controversial. Tools are tools -- they can be made to
support Xen, Linux VServer, UML, OpenVZ, VMware -- or even all of them!
But anyway, speaking of tools and better admin operations, what it takes
to create a Xen domain (I mean create all those files needed to run a
new Xen domain), and how much time it takes? Say, in OpenVZ creation of
a VE (Virtual Environment) is a matter of unpacking a ~100MB tarball and
copying 1K config file -- which essentially means one can create a VE in
a minute. Linux-VServer should be pretty much the same.
Another concern is, yes, manageability. In OpenVZ model the host system
can easily access all the VPSs' files, making, say, a mass software
update a reality. You can easily change some settings in 100+ VEs very
easy. In systems based on Xen and, say, VMware one should log in into
each system, one by one, to administer them, which is not unlike the
'separate physical server' model.
>3. If you need multiple VMs, use jail on Xen.
>
>
Indeed, a mixed approach is very interesting. You can run OpenVZ or
Linux-VServer in a Xen domain, that makes a lot of sense.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-03-28 20:50 ` Kir Kolyshkin
@ 2006-03-28 21:38 ` Jun OKAJIMA
2006-03-28 21:51 ` Eric W. Biederman
2006-04-03 16:47 ` Bill Davidsen
1 sibling, 1 reply; 125+ messages in thread
From: Jun OKAJIMA @ 2006-03-28 21:38 UTC (permalink / raw)
To: devel
Cc: akpm, Nick Piggin, sam, linux-kernel, Eric W. Biederman,
Alexey Kuznetsov, serue, herbert
>
>>2. Folks, how you think about other good points of Xen,
>> like live migration, or runs solaris, or has suspend/resume or...
>>
>>
>OpenVZ will have live zero downtime migration and suspend/resume some
>time next month.
>
COOL!!!!
>>
>>1. Dont use Xen for running multiple VMs.
>>2. Use Xen for better admin/operation/deploy... tools.
>>
>>
>This point is controversial. Tools are tools -- they can be made to
>support Xen, Linux VServer, UML, OpenVZ, VMware -- or even all of them!
>
>But anyway, speaking of tools and better admin operations, what it takes
>to create a Xen domain (I mean create all those files needed to run a
>new Xen domain), and how much time it takes? Say, in OpenVZ creation of
>a VE (Virtual Environment) is a matter of unpacking a ~100MB tarball and
>copying 1K config file -- which essentially means one can create a VE in
>a minute. Linux-VServer should be pretty much the same.
>
>Another concern is, yes, manageability. In OpenVZ model the host system
>can easily access all the VPSs' files, making, say, a mass software
>update a reality. You can easily change some settings in 100+ VEs very
>easy. In systems based on Xen and, say, VMware one should log in into
>each system, one by one, to administer them, which is not unlike the
>'separate physical server' model.
>
>>3. If you need multiple VMs, use jail on Xen.
>>
>>
>Indeed, a mixed approach is very interesting. You can run OpenVZ or
>Linux-VServer in a Xen domain, that makes a lot of sense.
>
>
Sorry for making misunderstanding.
What I wanted to say with "2" (use Xen as a tool) is, probably same as
what you are guessing now.
I mean, you make a server like this.
1. Install jailed Linux(OpenVZ/Vserver/or..) on Xen
2. make only one domU. and many VMs on this domU with jail.
3. runs many (more than 100 or...) VMs with jail, not with Xen.
4. but, for example, you want to migrate to another PC,
use Xen live migration.
The fourth point would help administration tasks easier. This is the point
where I mentioned about "better tool".
There is other usage of Xen as admin tool. For example, if you need device
driver (e.g. new iSCSI H/W driver or gigabit ether or...) of 2.6 kernel, but
no need to use any other 2.6 funcs, keep guest OS (domU) as 2.4, and make
dom0 as 2.6 Xen. This also helps admin tasks.
Probably, the biggest problem for now is, Xen patch conflicts with
Vserver/OpenVZ patch.
--- Okajima, Jun. Tokyo, Japan.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-03-28 21:38 ` Jun OKAJIMA
@ 2006-03-28 21:51 ` Eric W. Biederman
2006-03-28 23:18 ` Sam Vilain
0 siblings, 1 reply; 125+ messages in thread
From: Eric W. Biederman @ 2006-03-28 21:51 UTC (permalink / raw)
To: Jun OKAJIMA
Cc: devel, akpm, Nick Piggin, sam, linux-kernel, Alexey Kuznetsov,
serue, herbert
Jun OKAJIMA <okajima@digitalinfra.co.jp> writes:
> Probably, the biggest problem for now is, Xen patch conflicts with
> Vserver/OpenVZ patch.
The implementations are significantly different enough that I don't
see Xen and any jail patch really conflicting. There might be some
trivial conflicts in /proc but even that seems unlikely.
Eric
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-24 21:19 ` Herbert Poetzl
2006-03-27 18:45 ` Eric W. Biederman
@ 2006-03-28 21:58 ` Eric W. Biederman
1 sibling, 0 replies; 125+ messages in thread
From: Eric W. Biederman @ 2006-03-28 21:58 UTC (permalink / raw)
To: Herbert Poetzl
Cc: haveblue, Kirill Korotaev, linux-kernel, devel, serue, akpm, sam,
Alexey Kuznetsov, Pavel Emelianov, Stanislav Protassov
Herbert Poetzl <herbert@13thfloor.at> writes:
>> - network virtualization
>
> here I see many issues, as for example Linux-VServer
> does not necessarily aim for full virtualization, when
> simple and performant isolation is sufficient.
The current technique employed by vserver is implementable
in a security module today. We are implementing each of
these pieces as a separate namespace. So actually using
any one of them is optional. So implementing your current
method of network isolation in a security module should be straight
forward.
Eric
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-03-28 6:45 ` [Devel] " Kir Kolyshkin
@ 2006-03-28 21:59 ` Sam Vilain
2006-03-28 22:24 ` Kir Kolyshkin
0 siblings, 1 reply; 125+ messages in thread
From: Sam Vilain @ 2006-03-28 21:59 UTC (permalink / raw)
To: Kir Kolyshkin; +Cc: devel, linux-kernel
On Tue, 2006-03-28 at 10:45 +0400, Kir Kolyshkin wrote:
> It is actually not a future goal, but rather a reality. Since os-level
> virtualization overhead is very low (1-2 per cent or so), one can run
> hundreds of VEs.
Huh? You managed to measure it!? Or do you just mean "negligible" by
"1-2 per cent" ? :-)
Sam.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-03-28 21:59 ` Sam Vilain
@ 2006-03-28 22:24 ` Kir Kolyshkin
2006-03-28 23:28 ` Sam Vilain
0 siblings, 1 reply; 125+ messages in thread
From: Kir Kolyshkin @ 2006-03-28 22:24 UTC (permalink / raw)
To: Sam Vilain; +Cc: Kir Kolyshkin, devel, linux-kernel
Sam Vilain wrote:
>On Tue, 2006-03-28 at 10:45 +0400, Kir Kolyshkin wrote:
>
>
>>It is actually not a future goal, but rather a reality. Since os-level
>>virtualization overhead is very low (1-2 per cent or so), one can run
>>hundreds of VEs.
>>
>>
>
>Huh? You managed to measure it!? Or do you just mean "negligible" by
>"1-2 per cent" ? :-)
>
>
We run different tests to measure OpenVZ/Virtuozzo overhead, as we do
care much for that stuff. I do not remember all the gory details at the
moment, but I gave the correct numbers: "1-2 per cent or so".
There are things such as networking (OpenVZ's venet device) overhead, a
fair cpu scheduler overhead, something else.
Why do you think it can not be measured? It either can be, or it is too
low to be measured reliably (a fraction of a per cent or so).
Regards,
Kir.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-28 8:51 ` Kirill Korotaev
2006-03-28 12:53 ` Serge E. Hallyn
@ 2006-03-28 22:51 ` Sam Vilain
2006-03-29 20:30 ` Dave Hansen
2 siblings, 0 replies; 125+ messages in thread
From: Sam Vilain @ 2006-03-28 22:51 UTC (permalink / raw)
To: Kirill Korotaev; +Cc: linux-kernel, devel
On Tue, 2006-03-28 at 12:51 +0400, Kirill Korotaev wrote:
> we will create a separate branch also called -acked, where patches
> agreed upon will go.
No need. Just use Acked-By: comments.
Also, can I give some more feedback on the way you publish your patches:
1. git's replication uses the notion of a forward-only commit list.
So, if you change patches or rebase them then you have to rewind
the base point - which in pure git terms means create a new head.
So, you should use the convention of putting some identifier - a
date, or a version number - in each head.
2. Why do you have a seperate repository for your normal openvz and the
-ms trees? You can just you different heads.
3. Apache was doing something weird to the HEAD symlink in your
repository. (mind you, if you adopt notion 1., this becomes
irrelevant :-))
Otherwise, it's a great thing to see your patches published via git!
I can't recommend Stacked Git more highly for performing the 'winding'
of the patch stack necessary for revising patches. Google for "stgit".
Sam.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-28 9:15 ` Nick Piggin
2006-03-28 15:35 ` Herbert Poetzl
2006-03-28 16:15 ` Eric W. Biederman
@ 2006-03-28 23:04 ` Sam Vilain
2006-03-29 1:39 ` Kirill Korotaev
3 siblings, 0 replies; 125+ messages in thread
From: Sam Vilain @ 2006-03-28 23:04 UTC (permalink / raw)
To: Nick Piggin; +Cc: Kirill Korotaev, linux-kernel
On Tue, 2006-03-28 at 19:15 +1000, Nick Piggin wrote:
> Kirill Korotaev wrote:
> > First of all, what it does which low level virtualization can't:
> > - it allows to run 100 containers on 1GB RAM
> > (it is called containers, VE - Virtual Environments,
> > VPS - Virtual Private Servers).
> > - it has no much overhead (<1-2%), which is unavoidable with hardware
> > virtualization. For example, Xen has >20% overhead on disk I/O.
> Are any future hardware solutions likely to improve these problems?
No, not all of them.
> > OS kernel virtualization
> > ~~~~~~~~~~~~~~~~~~~~~~~~
> Is this considered secure enough that multiple untrusted VEs are run
> on production systems?
Yes, hosting providers have been deploying this technology for years.
> What kind of users want this, who can't use alternatives like real
> VMs?
People who want low overhead and the administrative benefits of only
running a single kernel and not umpteen. For instance visibility from
the host into the guests' filesystems is a huge advantage, even if the
performance benefits can be magically overcome somehow.
> > Summary of previous discussions on LKML
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Have their been any discussions between the groups pushing this
> virtualization, and important kernel developers who are not part of
> a virtualization effort? Ie. is there any consensus about the
> future of these patches?
Plenty recently. Check for threads involving (the people on the CC list
to the head of this thread) this year.
Comparing Xen/VMI with Vserver/OpenVZ is comparing apples with orchards.
May I refer you to some slides for a talk I gave at Linux.conf.au about
Vserver: http://utsl.gen.nz/talks/vserver/slide17a.html
Sam.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-28 14:41 ` Bill Davidsen
2006-03-28 15:03 ` Eric W. Biederman
@ 2006-03-28 23:07 ` Sam Vilain
2006-03-29 20:56 ` Bill Davidsen
1 sibling, 1 reply; 125+ messages in thread
From: Sam Vilain @ 2006-03-28 23:07 UTC (permalink / raw)
To: Bill Davidsen
Cc: Kirill Korotaev, Dave Hansen, Eric W. Biederman, linux-kernel,
herbert, devel, serue, akpm, Alexey Kuznetsov, Pavel Emelianov,
Stanislav Protassov
On Tue, 2006-03-28 at 09:41 -0500, Bill Davidsen wrote:
> > It is more than realistic. Hosting companies run more than 100 VPSs in
> > reality. There are also other usefull scenarios. For example, I know
> > the universities which run VPS for every faculty web site, for every
> > department, mail server and so on. Why do you think they want to run
> > only 5VMs on one machine? Much more!
>
> I made no commont on what "they" might want, I want to make the rack of
> underutilized Windows, BSD and Solaris servers go away. An approach
> which doesn't support unmodified guest installs doesn't solve any of my
> current problems. I didn't say it was in any way not useful, just not of
> interest to me. What needs I have for Linux environments are answered by
> jails and/or UML.
We are talking about adding jail technology, also known as containers on
Solaris and vserver/openvz on Linux, to the mainline kernel.
So, you are obviously interested!
Because of course, you can take an unmodified filesystem of the guest
and assuming the kernels are compatible run them without changes. I
find this consolidation approach indispensible.
Sam.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-03-28 21:51 ` Eric W. Biederman
@ 2006-03-28 23:18 ` Sam Vilain
0 siblings, 0 replies; 125+ messages in thread
From: Sam Vilain @ 2006-03-28 23:18 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Jun OKAJIMA, devel, akpm, Nick Piggin, linux-kernel,
Alexey Kuznetsov, serue, herbert
On Tue, 2006-03-28 at 14:51 -0700, Eric W. Biederman wrote:
> Jun OKAJIMA <okajima@digitalinfra.co.jp> writes:
>
> > Probably, the biggest problem for now is, Xen patch conflicts with
> > Vserver/OpenVZ patch.
>
> The implementations are significantly different enough that I don't
> see Xen and any jail patch really conflicting. There might be some
> trivial conflicts in /proc but even that seems unlikely.
This has been done before,
http://list.linux-vserver.org/archive/vserver/msg10235.html
Sam.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-03-28 22:24 ` Kir Kolyshkin
@ 2006-03-28 23:28 ` Sam Vilain
2006-03-29 9:13 ` Kirill Korotaev
0 siblings, 1 reply; 125+ messages in thread
From: Sam Vilain @ 2006-03-28 23:28 UTC (permalink / raw)
To: Kir Kolyshkin; +Cc: devel, linux-kernel
On Wed, 2006-03-29 at 02:24 +0400, Kir Kolyshkin wrote:
> >Huh? You managed to measure it!? Or do you just mean "negligible" by
> >"1-2 per cent" ? :-)
> We run different tests to measure OpenVZ/Virtuozzo overhead, as we do
> care much for that stuff. I do not remember all the gory details at the
> moment, but I gave the correct numbers: "1-2 per cent or so".
>
> There are things such as networking (OpenVZ's venet device) overhead, a
> fair cpu scheduler overhead, something else.
>
> Why do you think it can not be measured? It either can be, or it is too
> low to be measured reliably (a fraction of a per cent or so).
Well, for instance the fair CPU scheduling overhead is so tiny it may as
well not be there in the VServer patch. It's just a per-vserver TBF
that feeds back into the priority (and hence timeslice length) of the
process. ie, you get "CPU tokens" which deplete as processes in your
vserver run and you either get a boost or a penalty depending on the
level of the tokens in the bucket. This doesn't provide guarantees, but
works well for many typical workloads. And once Herbert fixed the SMP
cacheline problems in my code ;) it was pretty much full speed. That
is, until you want it to sacrifice overall performance for enforcing
limits.
How does your fair scheduler work? Do you just keep a runqueue for each
vps?
To be honest, I've never needed to determine whether its overhead is 1%
or 0.01%, it would just be a meaningless benchmark anyway :-). I know
it's "good enough for me".
Sam.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-03-28 15:48 ` [Devel] " Matt Ayres
2006-03-28 16:42 ` Eric W. Biederman
@ 2006-03-29 0:55 ` Kirill Korotaev
1 sibling, 0 replies; 125+ messages in thread
From: Kirill Korotaev @ 2006-03-29 0:55 UTC (permalink / raw)
To: devel
Cc: akpm, Nick Piggin, xen-devel@lists.xensource.com, sam,
linux-kernel, Eric W. Biederman, serue, Alexey Kuznetsov, herbert
> Kirill Korotaev wrote:
>>> Oh, after you come to an agreement and start posting patches, can you
>>> also outline why we want this in the kernel (what it does that low
>>> level virtualization doesn't, etc, etc), and how and why you've agreed
>>> to implement it. Basically, some background and a summary of your
>>> discussions for those who can't follow everything. Or is that a faq
>>> item?
>> Nick, will be glad to shed some light on it.
>>
>> First of all, what it does which low level virtualization can't:
>> - it allows to run 100 containers on 1GB RAM
>> (it is called containers, VE - Virtual Environments,
>> VPS - Virtual Private Servers).
>> - it has no much overhead (<1-2%), which is unavoidable with hardware
>> virtualization. For example, Xen has >20% overhead on disk I/O.
>
> I think the Xen guys would disagree with you on this. Xen claims <3%
> overhead on the XenSource site.
>
> Where did you get these figures from? What Xen version did you test?
> What was your configuration? Did you have kernel debugging enabled? You
> can't just post numbers without the data to back it up, especially when
> it conflicts greatly with the Xen developers statements. AFAIK Xen is
> well on it's way to inclusion into the mainstream kernel.
I have no exact numbers in the hands as I'm in another country right now.
But! We tested Xen not long ago with iozone test suite and it gave
~20-30% disk I/O overhead. Recently we were testing CPU scheduler and
EDF scheduler gave me 33% overhead on some very simple loads with almost
busy loops inside VMs. It also was not providing any good fairness on
2CPU SMP system to my suprise. You can object to me, but better simply
retest it if interested yourself. There were other tests as well, which
reported very different overheads on Xen 3. I suppose Xen guys do such
measurements themself, no?
And I'm sure, they are constantly improving it, they are doing a good
work on it.
Thanks,
Kirill
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-28 9:15 ` Nick Piggin
` (2 preceding siblings ...)
2006-03-28 23:04 ` Sam Vilain
@ 2006-03-29 1:39 ` Kirill Korotaev
2006-03-29 13:47 ` Herbert Poetzl
3 siblings, 1 reply; 125+ messages in thread
From: Kirill Korotaev @ 2006-03-29 1:39 UTC (permalink / raw)
To: Nick Piggin
Cc: Eric W. Biederman, haveblue, linux-kernel, herbert, devel, serue,
akpm, sam, Alexey Kuznetsov, Pavel Emelianov, Stanislav Protassov
Nick,
>> First of all, what it does which low level virtualization can't:
>> - it allows to run 100 containers on 1GB RAM
>> (it is called containers, VE - Virtual Environments,
>> VPS - Virtual Private Servers).
>> - it has no much overhead (<1-2%), which is unavoidable with hardware
>> virtualization. For example, Xen has >20% overhead on disk I/O.
>
> Are any future hardware solutions likely to improve these problems?
Probably you are aware of VT-i/VT-x technologies and planned virtualized
MMU and I/O MMU from Intel and AMD.
These features should improve the performance somehow, but there is
still a limit for decreasing the overhead, since at least disk, network,
video and such devices should be emulated.
>> OS kernel virtualization
>> ~~~~~~~~~~~~~~~~~~~~~~~~
>
> Is this considered secure enough that multiple untrusted VEs are run
> on production systems?
it is secure enough. What makes it secure? In general:
- virtualization, which makes resources private
- resource control, which makes VE to be limited with its usages
In more technical details virtualization projects make user access (and
capabilities) checks stricter. Moreover, OpenVZ is using "denied by
default" approach to make sure it is secure and VE users are not allowed
something else.
Also, about 2-3 month ago we had a security review of OpenVZ project
made by Solar Designer. So, in general such virtualization approach
should be not less secure than VM-like one. VM core code is bigger and
there is enough chances for bugs there.
> What kind of users want this, who can't use alternatives like real
> VMs?
Many companies, just can't share their names. But in general no
enterprise and hosting companies need to run different OSes on the same
machine. For them it is quite natural to use N machines for Linux and M
for Windows. And since VEs are much more lightweight and easier to work
with, they like it very much.
Just for example, OpenVZ core is running more than 300,000 VEs worldwide.
Thanks,
Kirill
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-28 14:44 ` Nick Piggin
@ 2006-03-29 6:05 ` Eric W. Biederman
2006-03-29 6:19 ` Sam Vilain
2006-03-30 3:26 ` Nick Piggin
0 siblings, 2 replies; 125+ messages in thread
From: Eric W. Biederman @ 2006-03-29 6:05 UTC (permalink / raw)
To: Nick Piggin; +Cc: Herbert Poetzl, Bill Davidsen, Linux Kernel ML
Nick Piggin <nickpiggin@yahoo.com.au> writes:
> I don't think I could give a complete answer...
> I guess it could be stated as the increase in the complexity of
> the rest of the code for someone who doesn't know anything about
> the virtualization implementation.
>
> Completely non intrusive is something like 2 extra function calls
> to/from generic code, changes to data structures are transparent
> (or have simple wrappers), and there is no shared locking or data
> with the rest of the kernel. And it goes up from there.
>
> Anyway I'm far from qualified... I just hope that with all the
> work you guys are putting in that you'll be able to justify it ;)
As I have been able to survey the work, the most common case
is replacing a global variable with a variable we lookup via
current.
That plus using the security module infrastructure you can
implement the semantics pretty in a straight forward manner.
The only really intrusive part is that because we tickle the
code differently we see a different set of problems. Such
as the mess that is the proc and sysctl code, and the lack of
good resource limits.
But none of that is inherent to the problem it is just when
you use the kernel harder and have more untrusted users you
see a different set of problems.
Eric
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-29 6:05 ` Eric W. Biederman
@ 2006-03-29 6:19 ` Sam Vilain
2006-03-29 18:20 ` Chris Wright
2006-03-30 3:26 ` Nick Piggin
1 sibling, 1 reply; 125+ messages in thread
From: Sam Vilain @ 2006-03-29 6:19 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Nick Piggin, Herbert Poetzl, Bill Davidsen, Linux Kernel ML
Eric W. Biederman wrote:
>That plus using the security module infrastructure you can
>implement the semantics pretty in a straight forward manner.
>
>
Yes, this is the essence of it all. Globals are bad, mmm'kay?
This raises a very interesting question. All those LSM globals,
shouldn't those be virtualisable, too? After all, isn't it natural to
want to apply a different security policy to different sets of processes?
I don't think anyone's done any work on this yet...
Man, fork() is going to get really expensive if we don't put in the
"process family" abstraction... but like you say, it comes later,
getting the semantics right comes first.
>The only really intrusive part is that because we tickle the
>code differently we see a different set of problems. Such
>as the mess that is the proc and sysctl code, and the lack of
>good resource limits.
>
>But none of that is inherent to the problem it is just when
>you use the kernel harder and have more untrusted users you
>see a different set of problems.
>
>
Indeed. Lots of old turds to clean up...
Sam.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-03-28 23:28 ` Sam Vilain
@ 2006-03-29 9:13 ` Kirill Korotaev
2006-03-29 11:08 ` Sam Vilain
2006-03-29 13:45 ` Herbert Poetzl
0 siblings, 2 replies; 125+ messages in thread
From: Kirill Korotaev @ 2006-03-29 9:13 UTC (permalink / raw)
To: devel; +Cc: Kir Kolyshkin, linux-kernel, sam, Herbert Poetzl
Sam,
>> Why do you think it can not be measured? It either can be, or it is too
>> low to be measured reliably (a fraction of a per cent or so).
>
> Well, for instance the fair CPU scheduling overhead is so tiny it may as
> well not be there in the VServer patch. It's just a per-vserver TBF
> that feeds back into the priority (and hence timeslice length) of the
> process. ie, you get "CPU tokens" which deplete as processes in your
> vserver run and you either get a boost or a penalty depending on the
> level of the tokens in the bucket. This doesn't provide guarantees, but
> works well for many typical workloads.
I wonder what is the value of it if it doesn't do guarantees or QoS?
In our experiments with it we failed to observe any fairness. So I
suppose the only goal of this is too make sure that maliscuios user want
consume all the CPU power, right?
> How does your fair scheduler work? Do you just keep a runqueue for each
> vps?
we keep num_online_cpus runqueues per VPS.
Fairs scheduler is some kind of SFQ like algorithm which selects VPS to
be scheduled, than standart linux scheduler selects a process in a VPS
runqueues to run.
> To be honest, I've never needed to determine whether its overhead is 1%
> or 0.01%, it would just be a meaningless benchmark anyway :-). I know
> it's "good enough for me".
Sure! We feel the same, but people like numbers :)
Thanks,
Kirill
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-03-29 9:13 ` Kirill Korotaev
@ 2006-03-29 11:08 ` Sam Vilain
2006-03-29 13:45 ` Herbert Poetzl
1 sibling, 0 replies; 125+ messages in thread
From: Sam Vilain @ 2006-03-29 11:08 UTC (permalink / raw)
To: Kirill Korotaev; +Cc: devel, Kir Kolyshkin, linux-kernel, Herbert Poetzl
On Wed, 2006-03-29 at 13:13 +0400, Kirill Korotaev wrote:
> > Well, for instance the fair CPU scheduling overhead is so tiny it may as
> > well not be there in the VServer patch. It's just a per-vserver TBF
> > that feeds back into the priority (and hence timeslice length) of the
> > process. ie, you get "CPU tokens" which deplete as processes in your
> > vserver run and you either get a boost or a penalty depending on the
> > level of the tokens in the bucket. This doesn't provide guarantees, but
> > works well for many typical workloads.
> I wonder what is the value of it if it doesn't do guarantees or QoS?
It still does "QoS". The TBF has a "fill rate", which is basically N
tokens per M jiffies. Then you just set the size of the "bucket", and
the prio bonus given is between -5 (when bucket is full) and +15 (when
bucket is empty). The normal -10 to +10 'interactive' prio bonus is
reduced to -5 to +5 to compensate.
In other words, it's like a global 'nice' across all of the processes in
the vserver.
So, these characteristics do provide some level of guarantees, but not
all that people expect. eg, people want to say "cap usage at 5%", but
as designed the scheduler does not ever prevent runnable processes from
running if the CPUs have nothing better to do, so they think the
scheduler is broken. It is also possible with a fork bomb (assuming the
absence of appropriate ulimits) that you start enough processes that you
don't care that they are all effectively nice +19.
Herbert later made it add some of these guarantees, but I believe there
is a performance impact of some kind.
> In our experiments with it we failed to observe any fairness.
Well, it does not aim to be 'fair', it aims to be useful for allocating
CPU to vservers. ie, if you allocate X% of the CPU in the system to a
vserver, and it uses more, then try to make it use less via priority
penalties - and give others shortchanged or not using the CPU very much
performance bonuses. That's all.
So, if you under- or over-book CPU allocation, it doesn't work. The
idea was that monitoring it could be shipped out to userland. I just
wanted something flexible enough to allow virtually any policy to be put
into place without wasting too many cycles.
> > How does your fair scheduler work? Do you just keep a runqueue for each
> > vps?
> we keep num_online_cpus runqueues per VPS.
Right. I considered that approach but just couldn't be bothered
implementing it, so went with the TBF because it worked and was
lightweight.
> Fairs scheduler is some kind of SFQ like algorithm which selects VPS to
> be scheduled, than standart linux scheduler selects a process in a VPS
> runqueues to run.
Right.
> > To be honest, I've never needed to determine whether its overhead is 1%
> > or 0.01%, it would just be a meaningless benchmark anyway :-). I know
> > it's "good enough for me".
> Sure! We feel the same, but people like numbers :)
Sometimes the answer has to be "mu".
Sam.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-03-29 9:13 ` Kirill Korotaev
2006-03-29 11:08 ` Sam Vilain
@ 2006-03-29 13:45 ` Herbert Poetzl
2006-03-29 14:47 ` Kirill Korotaev
1 sibling, 1 reply; 125+ messages in thread
From: Herbert Poetzl @ 2006-03-29 13:45 UTC (permalink / raw)
To: Kirill Korotaev; +Cc: devel, Kir Kolyshkin, linux-kernel, sam
On Wed, Mar 29, 2006 at 01:13:14PM +0400, Kirill Korotaev wrote:
> Sam,
>
> >>Why do you think it can not be measured? It either can be, or it is too
> >>low to be measured reliably (a fraction of a per cent or so).
> >
> >Well, for instance the fair CPU scheduling overhead is so tiny it may as
> >well not be there in the VServer patch. It's just a per-vserver TBF
> >that feeds back into the priority (and hence timeslice length) of the
> >process. ie, you get "CPU tokens" which deplete as processes in your
> >vserver run and you either get a boost or a penalty depending on the
> >level of the tokens in the bucket. This doesn't provide guarantees, but
> >works well for many typical workloads.
> I wonder what is the value of it if it doesn't do guarantees or QoS?
> In our experiments with it we failed to observe any fairness.
probably a misconfiguration on your side ...
> So I suppose the only goal of this is too make sure that maliscuios
> user want consume all the CPU power, right?
the currently used scheduler extensions do much
more than that, basically all kinds of scenarios
can be satisfied with it, at almost no overhead
> >How does your fair scheduler work?
> >Do you just keep a runqueue for each vps?
> we keep num_online_cpus runqueues per VPS.
> Fairs scheduler is some kind of SFQ like algorithm which selects VPS
> to be scheduled, than standart linux scheduler selects a process in a
> VPS runqueues to run.
>
> >To be honest, I've never needed to determine whether its overhead is 1%
> >or 0.01%, it would just be a meaningless benchmark anyway :-). I know
> >it's "good enough for me".
> Sure! We feel the same, but people like numbers :)
well, do you have numbers?
best,
Herbert
> Thanks,
> Kirill
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-29 1:39 ` Kirill Korotaev
@ 2006-03-29 13:47 ` Herbert Poetzl
0 siblings, 0 replies; 125+ messages in thread
From: Herbert Poetzl @ 2006-03-29 13:47 UTC (permalink / raw)
To: Kirill Korotaev
Cc: Nick Piggin, Eric W. Biederman, haveblue, linux-kernel, devel,
serue, akpm, sam, Alexey Kuznetsov, Pavel Emelianov,
Stanislav Protassov
On Wed, Mar 29, 2006 at 05:39:00AM +0400, Kirill Korotaev wrote:
> Nick,
>
> >>First of all, what it does which low level virtualization can't:
> >>- it allows to run 100 containers on 1GB RAM
> >> (it is called containers, VE - Virtual Environments,
> >> VPS - Virtual Private Servers).
> >>- it has no much overhead (<1-2%), which is unavoidable with hardware
> >> virtualization. For example, Xen has >20% overhead on disk I/O.
> >
> >Are any future hardware solutions likely to improve these problems?
> Probably you are aware of VT-i/VT-x technologies and planned virtualized
> MMU and I/O MMU from Intel and AMD.
> These features should improve the performance somehow, but there is
> still a limit for decreasing the overhead, since at least disk, network,
> video and such devices should be emulated.
>
> >>OS kernel virtualization
> >>~~~~~~~~~~~~~~~~~~~~~~~~
> >
> >Is this considered secure enough that multiple untrusted VEs are run
> >on production systems?
> it is secure enough. What makes it secure? In general:
> - virtualization, which makes resources private
> - resource control, which makes VE to be limited with its usages
> In more technical details virtualization projects make user access (and
> capabilities) checks stricter. Moreover, OpenVZ is using "denied by
> default" approach to make sure it is secure and VE users are not allowed
> something else.
>
> Also, about 2-3 month ago we had a security review of OpenVZ project
> made by Solar Designer. So, in general such virtualization approach
> should be not less secure than VM-like one. VM core code is bigger and
> there is enough chances for bugs there.
>
> >What kind of users want this, who can't use alternatives like real
> >VMs?
> Many companies, just can't share their names. But in general no
> enterprise and hosting companies need to run different OSes on the same
> machine. For them it is quite natural to use N machines for Linux and M
> for Windows. And since VEs are much more lightweight and easier to work
> with, they like it very much.
>
> Just for example, OpenVZ core is running more than 300,000 VEs worldwide.
not bad, how did you get to those numbers?
and, more important, how many of those are actually OpenVZ?
(compared to Virtuozzo(tm))
best,
Herbert
> Thanks,
> Kirill
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-03-29 13:45 ` Herbert Poetzl
@ 2006-03-29 14:47 ` Kirill Korotaev
2006-03-29 17:29 ` Herbert Poetzl
2006-03-29 21:37 ` Sam Vilain
0 siblings, 2 replies; 125+ messages in thread
From: Kirill Korotaev @ 2006-03-29 14:47 UTC (permalink / raw)
To: Herbert Poetzl; +Cc: Kirill Korotaev, devel, Kir Kolyshkin, linux-kernel, sam
>> I wonder what is the value of it if it doesn't do guarantees or QoS?
>> In our experiments with it we failed to observe any fairness.
>
> probably a misconfiguration on your side ...
maybe you can provide some instructions on which kernel version to use
and how to setup the following scenario:
2CPU box. 3 VPSs which should run with 1:2:3 ratio of CPU usage.
> well, do you have numbers?
just run the above scenario with one busy loop inside each VPS. I was
not able to observe 1:2:3 cpu distribution. Other scenarios also didn't
showed my any fairness. The results were different. Sometimes 1:1:2,
sometimes others.
Thanks,
Kirill
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-03-29 14:47 ` Kirill Korotaev
@ 2006-03-29 17:29 ` Herbert Poetzl
2006-03-29 21:37 ` Sam Vilain
1 sibling, 0 replies; 125+ messages in thread
From: Herbert Poetzl @ 2006-03-29 17:29 UTC (permalink / raw)
To: Kirill Korotaev; +Cc: Kirill Korotaev, devel, Kir Kolyshkin, linux-kernel, sam
On Wed, Mar 29, 2006 at 06:47:58PM +0400, Kirill Korotaev wrote:
> >>I wonder what is the value of it if it doesn't do guarantees or QoS?
> >>In our experiments with it we failed to observe any fairness.
> >
> >probably a misconfiguration on your side ...
> maybe you can provide some instructions on which kernel version to use
> and how to setup the following scenario: 2CPU box. 3 VPSs which should
> run with 1:2:3 ratio of CPU usage.
that is quite simple, you enable the Hard CPU Scheduler
and select the Idle Time Skip, then you set the following
token bucket values depending on what your mean with
'should run with 1:2:3 ratio of CPU usage':
a) a guaranteed maximum of 16.7%, 33.3% and 50.0%
b) a fair sharing according to 1:2:3
c) a guaranteed minimum of 16.7%, 33.3% and 50.0%
with a fair sharing of 1:2:3 for the rest ...
for all cases you would set:
(adjust according to you reserve/boost likings)
VPS1,2,3: tokens_min = 50, tokens_max = 500
interval = interval2 = 6
a) VPS1: rate = 1, hard, noidleskip
VPS2: rate = 2, hard, noidleskip
VPS3: rate = 3, hard, noidleskip
b) VPS1: rate2 = 1, hard, idleskip
VPS2: rate2 = 2, hard, idleskip
VPS3: rate2 = 3, hard, idleskip
c) VPS1: rate = rate2 = 1, hard, idleskip
VPS2: rate = rate2 = 2, hard, idleskip
VPS3: rate = rate2 = 3, hard, idleskip
of course, adjusting rate/interval while keeping
the ratio might help you depending on the guest load
(i.e. more batch load type or mor interactive stuff)
of course, you can do those adjustments per CPU so, if
you for example want to assign one CPU to the third
guest, you can do that easily too ...
> >well, do you have numbers?
> just run the above scenario with one busy loop inside each VPS. I was
> not able to observe 1:2:3 cpu distribution. Other scenarios also didn't
> showed my any fairness. The results were different. Sometimes 1:1:2,
> sometimes others.
what was your setup?
best,
Herbert
> Thanks,
> Kirill
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-29 6:19 ` Sam Vilain
@ 2006-03-29 18:20 ` Chris Wright
2006-03-29 22:36 ` Sam Vilain
0 siblings, 1 reply; 125+ messages in thread
From: Chris Wright @ 2006-03-29 18:20 UTC (permalink / raw)
To: Sam Vilain
Cc: Eric W. Biederman, Nick Piggin, Herbert Poetzl, Bill Davidsen,
Linux Kernel ML
* Sam Vilain (sam@vilain.net) wrote:
> This raises a very interesting question. All those LSM globals,
> shouldn't those be virtualisable, too? After all, isn't it natural to
> want to apply a different security policy to different sets of processes?
Which globals? Policy could be informed by relevant containers.
thanks,
-chris
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-28 8:51 ` Kirill Korotaev
2006-03-28 12:53 ` Serge E. Hallyn
2006-03-28 22:51 ` Sam Vilain
@ 2006-03-29 20:30 ` Dave Hansen
2006-03-29 20:47 ` Eric W. Biederman
` (2 more replies)
2 siblings, 3 replies; 125+ messages in thread
From: Dave Hansen @ 2006-03-29 20:30 UTC (permalink / raw)
To: Kirill Korotaev
Cc: Eric W. Biederman, linux-kernel, devel, serue, akpm, sam,
Alexey Kuznetsov, Pavel Emelianov, Stanislav Protassov
On Tue, 2006-03-28 at 12:51 +0400, Kirill Korotaev wrote:
> Eric, we have a GIT repo on openvz.org already:
> http://git.openvz.org
Git is great for getting patches and lots of updates out, but I'm not
sure it is idea for what we're trying to do. We'll need things reviewed
at each step, especially because we're going to be touching so much
common code.
I'd guess set of quilt (or patch-utils) patches is probably best,
especially if we're trying to get stuff into -mm first.
-- Dave
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-29 20:30 ` Dave Hansen
@ 2006-03-29 20:47 ` Eric W. Biederman
2006-03-29 22:44 ` Sam Vilain
2006-03-30 13:51 ` Kirill Korotaev
2 siblings, 0 replies; 125+ messages in thread
From: Eric W. Biederman @ 2006-03-29 20:47 UTC (permalink / raw)
To: Dave Hansen
Cc: Kirill Korotaev, linux-kernel, devel, serue, akpm, sam,
Alexey Kuznetsov, Pavel Emelianov, Stanislav Protassov
Dave Hansen <haveblue@us.ibm.com> writes:
> On Tue, 2006-03-28 at 12:51 +0400, Kirill Korotaev wrote:
>> Eric, we have a GIT repo on openvz.org already:
>> http://git.openvz.org
>
> Git is great for getting patches and lots of updates out, but I'm not
> sure it is idea for what we're trying to do. We'll need things reviewed
> at each step, especially because we're going to be touching so much
> common code.
>
> I'd guess set of quilt (or patch-utils) patches is probably best,
> especially if we're trying to get stuff into -mm first.
Git is as good at holding patches as quilt. It isn't quite as
good at working with them as quilt but in the long term that is
fixable.
The important point is that we get a collection of patches that
we can all agree to, and that we publish it.
At this point it sounds like each group will happily publish the
patches, and that might not be a bad double check, on agreement.
Then we have someone send them to Andrew. Or we have a quilt or
a git tree that Andrew knows he can pull from.
But we do need lots of review so distribution to Andrew and the other
kernel developers as plain patches appears to be the healthy choice.
I'm going to go bury my head in the sand and finish my OLS paper now.
Eric
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-28 23:07 ` Sam Vilain
@ 2006-03-29 20:56 ` Bill Davidsen
0 siblings, 0 replies; 125+ messages in thread
From: Bill Davidsen @ 2006-03-29 20:56 UTC (permalink / raw)
To: Sam Vilain
Cc: Kirill Korotaev, Dave Hansen, Eric W. Biederman, linux-kernel,
herbert, devel, serue, akpm, Alexey Kuznetsov, Pavel Emelianov,
Stanislav Protassov
Sam Vilain wrote:
> On Tue, 2006-03-28 at 09:41 -0500, Bill Davidsen wrote:
>>> It is more than realistic. Hosting companies run more than 100 VPSs in
>>> reality. There are also other usefull scenarios. For example, I know
>>> the universities which run VPS for every faculty web site, for every
>>> department, mail server and so on. Why do you think they want to run
>>> only 5VMs on one machine? Much more!
>> I made no commont on what "they" might want, I want to make the rack of
>> underutilized Windows, BSD and Solaris servers go away. An approach
>> which doesn't support unmodified guest installs doesn't solve any of my
>> current problems. I didn't say it was in any way not useful, just not of
>> interest to me. What needs I have for Linux environments are answered by
>> jails and/or UML.
>
> We are talking about adding jail technology, also known as containers on
> Solaris and vserver/openvz on Linux, to the mainline kernel.
>
> So, you are obviously interested!
>
> Because of course, you can take an unmodified filesystem of the guest
> and assuming the kernels are compatible run them without changes. I
> find this consolidation approach indispensible.
>
The only way to assume kernels are compatible is to run the same distro.
Because vendor kernels are sure not compatible, even running a
kernel.org kernel on Fedora (for instance) reveals the the utilities are
also tweaked to expect the kernel changes, and you wind up with a system
which feels like wearing someone else's hat. It's stable but little
things just don't work right.
--
-bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-28 15:35 ` Herbert Poetzl
2006-03-28 15:53 ` Nick Piggin
2006-03-28 16:31 ` Eric W. Biederman
@ 2006-03-29 21:37 ` Bill Davidsen
2 siblings, 0 replies; 125+ messages in thread
From: Bill Davidsen @ 2006-03-29 21:37 UTC (permalink / raw)
To: Nick Piggin, Kirill Korotaev, Eric W. Biederman, haveblue,
linux-kernel, devel, serue, akpm, sam, Alexey Kuznetsov,
Pavel Emelianov, Stanislav Protassov
Herbert Poetzl wrote:
>>> Summary of previous discussions on LKML
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> Have their been any discussions between the groups pushing this
>> virtualization, and ...
>
> yes, the discussions are ongoing ... maybe to clarify the
> situation for the folks not involved (projects in
> alphabetical order):
>
Thank you! Nice to have a scorecard.
--
-bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-03-29 14:47 ` Kirill Korotaev
2006-03-29 17:29 ` Herbert Poetzl
@ 2006-03-29 21:37 ` Sam Vilain
2006-04-12 8:28 ` Kirill Korotaev
1 sibling, 1 reply; 125+ messages in thread
From: Sam Vilain @ 2006-03-29 21:37 UTC (permalink / raw)
To: Kirill Korotaev
Cc: Herbert Poetzl, Kirill Korotaev, devel, Kir Kolyshkin,
linux-kernel
On Wed, 2006-03-29 at 18:47 +0400, Kirill Korotaev wrote:
> >> I wonder what is the value of it if it doesn't do guarantees or QoS?
> >> In our experiments with it we failed to observe any fairness.
> >
> > probably a misconfiguration on your side ...
> maybe you can provide some instructions on which kernel version to use
> and how to setup the following scenario:
> 2CPU box. 3 VPSs which should run with 1:2:3 ratio of CPU usage.
Ok, I'll call those three VPSes fast, faster and fastest.
"fast" : fill rate 1, interval 3
"faster" : fill rate 2, interval 3
"fastest" : fill rate 3, interval 3
That all adds up to a fill rate of 6 with an interval of 3, but that is
right because with two processors you have 2 tokens to allocate per
jiffie. Also set the bucket size to something of the order of HZ.
You can watch the processes within each vserver's priority jump up and
down with `vtop' during testing. Also you should be able to watch the
vserver's bucket fill and empty in /proc/virtual/XXX/sched (IIRC)
> > well, do you have numbers?
> just run the above scenario with one busy loop inside each VPS. I was
> not able to observe 1:2:3 cpu distribution. Other scenarios also didn't
> showed my any fairness. The results were different. Sometimes 1:1:2,
> sometimes others.
I mentioned this earlier, but for the sake of the archives I'll repeat -
if you are running with any of the buckets on empty, the scheduler is
imbalanced and therefore not going to provide the exact distribution you
asked for.
However with a single busy loop in each vserver I'd expect the above to
yield roughly 100% for fastest, 66% for faster and 33% for fast, within
5 seconds or so of starting those processes (assuming you set a bucket
size of HZ).
Sam.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-29 18:20 ` Chris Wright
@ 2006-03-29 22:36 ` Sam Vilain
2006-03-29 22:52 ` Chris Wright
0 siblings, 1 reply; 125+ messages in thread
From: Sam Vilain @ 2006-03-29 22:36 UTC (permalink / raw)
To: Chris Wright
Cc: Eric W. Biederman, Nick Piggin, Herbert Poetzl, Bill Davidsen,
Linux Kernel ML
Chris Wright wrote:
>* Sam Vilain (sam@vilain.net) wrote:
>
>
>>This raises a very interesting question. All those LSM globals,
>>shouldn't those be virtualisable, too? After all, isn't it natural to
>>want to apply a different security policy to different sets of processes?
>>
>>
>
>Which globals? Policy could be informed by relevant containers.
>
>
extern struct security_operations *security_ops; in
include/linux/security.h is the global I refer to.
There is likely to be some contention there between the security folk
who probably won't like the idea that your security module can be
different for different processes, and the people who want to provide
access to security modules on the systems they want to host or consolidate.
Sam.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-29 20:30 ` Dave Hansen
2006-03-29 20:47 ` Eric W. Biederman
@ 2006-03-29 22:44 ` Sam Vilain
2006-03-30 13:51 ` Kirill Korotaev
2 siblings, 0 replies; 125+ messages in thread
From: Sam Vilain @ 2006-03-29 22:44 UTC (permalink / raw)
To: Dave Hansen
Cc: Kirill Korotaev, Eric W. Biederman, linux-kernel, devel, serue,
akpm, Alexey Kuznetsov, Pavel Emelianov, Stanislav Protassov
Dave Hansen wrote:
>On Tue, 2006-03-28 at 12:51 +0400, Kirill Korotaev wrote:
>
>
>>Eric, we have a GIT repo on openvz.org already:
>>http://git.openvz.org
>>
>>
>
>Git is great for getting patches and lots of updates out, but I'm not
>sure it is idea for what we're trying to do. We'll need things reviewed
>at each step, especially because we're going to be touching so much
>common code.
>
>I'd guess set of quilt (or patch-utils) patches is probably best,
>especially if we're trying to get stuff into -mm first.
>
>
The apparent problem is that the git commit history on a branch cannot
be unwound. However, that is fine - just make another branch and put
your new sequence of commits there.
Tools exist that allow you to wind and unwind the commit history
arbitrarily to revise patches before they are published on a branch that
you don't want to just delete. For instance:
stacked git
http://www.procode.org/stgit/
or patchy git
http://www.spearce.org/2006/02/pg-version-0111-released.html
are examples of such tools.
I recommend starting with stacked git, it really is nice.
Sam.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-29 22:36 ` Sam Vilain
@ 2006-03-29 22:52 ` Chris Wright
2006-03-29 23:01 ` Sam Vilain
2006-03-30 1:02 ` Eric W. Biederman
0 siblings, 2 replies; 125+ messages in thread
From: Chris Wright @ 2006-03-29 22:52 UTC (permalink / raw)
To: Sam Vilain
Cc: Chris Wright, Eric W. Biederman, Nick Piggin, Herbert Poetzl,
Bill Davidsen, Linux Kernel ML
* Sam Vilain (sam@vilain.net) wrote:
> extern struct security_operations *security_ops; in
> include/linux/security.h is the global I refer to.
OK, I figured that's what you meant. The top-level ops are similar in
nature to inode_ops in that there's not a real compelling reason to make
them per process. The process context is (usually) available, and more
importantly, the object whose access is being mediated is readily
available with its security label.
> There is likely to be some contention there between the security folk
> who probably won't like the idea that your security module can be
> different for different processes, and the people who want to provide
> access to security modules on the systems they want to host or consolidate.
I think the current setup would work fine. It's less likely that we'd
want a separate security module for each container than simply policy
that is container aware.
thanks,
-chris
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-29 22:52 ` Chris Wright
@ 2006-03-29 23:01 ` Sam Vilain
2006-03-29 23:13 ` Chris Wright
2006-03-30 1:02 ` Eric W. Biederman
1 sibling, 1 reply; 125+ messages in thread
From: Sam Vilain @ 2006-03-29 23:01 UTC (permalink / raw)
To: Chris Wright
Cc: Eric W. Biederman, Nick Piggin, Herbert Poetzl, Bill Davidsen,
Linux Kernel ML
Chris Wright wrote:
>* Sam Vilain (sam@vilain.net) wrote:
>
>
>>extern struct security_operations *security_ops; in
>>include/linux/security.h is the global I refer to.
>>
>>
>
>OK, I figured that's what you meant. The top-level ops are similar in
>nature to inode_ops in that there's not a real compelling reason to make
>them per process. The process context is (usually) available, and more
>importantly, the object whose access is being mediated is readily
>available with its security label.
>
>
AIUI inode_ops are not globals, they are per FS.
>>There is likely to be some contention there between the security folk
>>who probably won't like the idea that your security module can be
>>different for different processes, and the people who want to provide
>>access to security modules on the systems they want to host or consolidate.
>>
>>
>
>I think the current setup would work fine. It's less likely that we'd
>want a separate security module for each container than simply policy
>that is container aware.
>
>
That to me reads as:
"To avoid having to consider making security_ops non-global we will
force security modules to be container aware".
It also means you could not mix security modules that affect the same
operation different containers on a system. Personally I don't care, I
don't use them. But perhaps this inflexibility will bring problems later
for some.
I think it's a design decision that is not completely closed, but the
inertia is certainly in the favour of your position.
Sam.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-29 23:01 ` Sam Vilain
@ 2006-03-29 23:13 ` Chris Wright
2006-03-29 23:18 ` Sam Vilain
0 siblings, 1 reply; 125+ messages in thread
From: Chris Wright @ 2006-03-29 23:13 UTC (permalink / raw)
To: Sam Vilain
Cc: Chris Wright, Eric W. Biederman, Nick Piggin, Herbert Poetzl,
Bill Davidsen, Linux Kernel ML
* Sam Vilain (sam@vilain.net) wrote:
> AIUI inode_ops are not globals, they are per FS.
Heh, yes really bad example.
> That to me reads as:
>
> "To avoid having to consider making security_ops non-global we will
> force security modules to be container aware".
Not my intention. Rather, I think from a security standpoint there's
sanity in controlling things with a single policy. I'm thinking of
containers as a simple and logical extension of roles. Point being,
the per-object security label can easily include notion of container.
> It also means you could not mix security modules that affect the same
> operation different containers on a system. Personally I don't care, I
> don't use them. But perhaps this inflexibility will bring problems later
> for some.
No issue with addressing these issues as they come.
thanks,
-chris
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-29 23:13 ` Chris Wright
@ 2006-03-29 23:18 ` Sam Vilain
2006-03-29 23:28 ` Chris Wright
0 siblings, 1 reply; 125+ messages in thread
From: Sam Vilain @ 2006-03-29 23:18 UTC (permalink / raw)
To: Chris Wright; +Cc: Linux Kernel ML
Chris Wright wrote:
>Not my intention. Rather, I think from a security standpoint there's
>sanity in controlling things with a single policy.
>
Yes, certainly. Providing the features to the users in a different way
is a pragmatic alternative to trying to make sure the contained system
gets to use all the same kernel API calls it could without the
virtualisation. The only people who won't like that is are people
consolidating, so they still have to use Xen.
>I'm thinking of
>containers as a simple and logical extension of roles. Point being,
>the per-object security label can easily include notion of container.
>
>
If it fits the model well, sounds good.
Sam.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-29 23:18 ` Sam Vilain
@ 2006-03-29 23:28 ` Chris Wright
0 siblings, 0 replies; 125+ messages in thread
From: Chris Wright @ 2006-03-29 23:28 UTC (permalink / raw)
To: Sam Vilain; +Cc: Chris Wright, Linux Kernel ML
* Sam Vilain (sam@vilain.net) wrote:
> Yes, certainly. Providing the features to the users in a different way
> is a pragmatic alternative to trying to make sure the contained system
> gets to use all the same kernel API calls it could without the
> virtualisation. The only people who won't like that is are people
> consolidating, so they still have to use Xen.
Works for me ;-)
-chris
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-29 22:52 ` Chris Wright
2006-03-29 23:01 ` Sam Vilain
@ 2006-03-30 1:02 ` Eric W. Biederman
2006-03-30 1:36 ` Chris Wright
2006-03-30 2:24 ` Sam Vilain
1 sibling, 2 replies; 125+ messages in thread
From: Eric W. Biederman @ 2006-03-30 1:02 UTC (permalink / raw)
To: Chris Wright
Cc: Sam Vilain, Nick Piggin, Herbert Poetzl, Bill Davidsen,
Linux Kernel ML, Serge E. Hallyn
Chris Wright <chrisw@sous-sol.org> writes:
> * Sam Vilain (sam@vilain.net) wrote:
>> extern struct security_operations *security_ops; in
>> include/linux/security.h is the global I refer to.
>
> OK, I figured that's what you meant. The top-level ops are similar in
> nature to inode_ops in that there's not a real compelling reason to make
> them per process. The process context is (usually) available, and more
> importantly, the object whose access is being mediated is readily
> available with its security label.
>
>> There is likely to be some contention there between the security folk
>> who probably won't like the idea that your security module can be
>> different for different processes, and the people who want to provide
>> access to security modules on the systems they want to host or consolidate.
>
> I think the current setup would work fine. It's less likely that we'd
> want a separate security module for each container than simply policy
> that is container aware.
I think what we really want are stacked security modules.
I have not yet fully digested all of the requirements for multiple servers
on the same machine but increasingly the security aspects look
like a job for a security module.
Enforcing policies like container A cannot send signals to processes
in container B or something like that.
Then inside of each container we could have the code that implements
a containers internal security policy.
At least one implementation Linux Jails by Serge E. Hallyn was done completely
with security modules, and the code was pretty minimal.
Eric
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-30 1:02 ` Eric W. Biederman
@ 2006-03-30 1:36 ` Chris Wright
2006-03-30 1:41 ` David Lang
` (2 more replies)
2006-03-30 2:24 ` Sam Vilain
1 sibling, 3 replies; 125+ messages in thread
From: Chris Wright @ 2006-03-30 1:36 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Chris Wright, Sam Vilain, Nick Piggin, Herbert Poetzl,
Bill Davidsen, Linux Kernel ML, Serge E. Hallyn
* Eric W. Biederman (ebiederm@xmission.com) wrote:
> Chris Wright <chrisw@sous-sol.org> writes:
>
> > * Sam Vilain (sam@vilain.net) wrote:
> >> extern struct security_operations *security_ops; in
> >> include/linux/security.h is the global I refer to.
> >
> > OK, I figured that's what you meant. The top-level ops are similar in
> > nature to inode_ops in that there's not a real compelling reason to make
> > them per process. The process context is (usually) available, and more
> > importantly, the object whose access is being mediated is readily
> > available with its security label.
> >
> >> There is likely to be some contention there between the security folk
> >> who probably won't like the idea that your security module can be
> >> different for different processes, and the people who want to provide
> >> access to security modules on the systems they want to host or consolidate.
> >
> > I think the current setup would work fine. It's less likely that we'd
> > want a separate security module for each container than simply policy
> > that is container aware.
>
> I think what we really want are stacked security modules.
I'm not convinced we need a new module for each container. The module
is a policy enforcement engine, so give it a container aware policy and
you shouldn't need another module.
> I have not yet fully digested all of the requirements for multiple servers
> on the same machine but increasingly the security aspects look
> like a job for a security module.
There's two primary security areas here. One is container level
isolation, which is the job of the container itself. Security modules
can effectively introduce containers, but w/out any notion of a virtual
environment (easy example, uts). With namespaces you get isolation w/out
any formal access control check, you simply can't find objects that aren't
in your namespace. The second is object level isolation (objects such
as files, processes, etc), standard access control checks that should
happen within a container. This can be handled quite naturally by the
security module.
> Enforcing policies like container A cannot send signals to processes
> in container B or something like that.
This is a question of visibility. One method of containment is via
LSM. This checks all object access against a label that's aware of
container ids to disallow inter-container, well, anything. However,
if a namespace would mean you simply can't find those other processes,
then there's no need for the LSM side except for intra-container.
> Then inside of each container we could have the code that implements
> a containers internal security policy.
Right, and that's doable as a single top-level policy. It's a bit
interesting when you want to be able to specifiy policy from within a
container (e.g. virtual hosting), granted.
> At least one implementation Linux Jails by Serge E. Hallyn was done completely
> with security modules, and the code was pretty minimal.
Yes, although the networking area was something that looked better done
via namespaces (at least that's my recollection of my conversations with
Serge on that one a few years back).
thanks,
-chris
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-30 1:36 ` Chris Wright
@ 2006-03-30 1:41 ` David Lang
2006-03-30 2:04 ` Chris Wright
2006-03-30 2:48 ` Eric W. Biederman
2006-03-30 13:29 ` Serge E. Hallyn
2 siblings, 1 reply; 125+ messages in thread
From: David Lang @ 2006-03-30 1:41 UTC (permalink / raw)
To: Chris Wright
Cc: Eric W. Biederman, Sam Vilain, Nick Piggin, Herbert Poetzl,
Bill Davidsen, Linux Kernel ML, Serge E. Hallyn
On Wed, 29 Mar 2006, Chris Wright wrote:
>
> * Eric W. Biederman (ebiederm@xmission.com) wrote:
>> Chris Wright <chrisw@sous-sol.org> writes:
>>
>>> * Sam Vilain (sam@vilain.net) wrote:
>>>> extern struct security_operations *security_ops; in
>>>> include/linux/security.h is the global I refer to.
>>>
>>> OK, I figured that's what you meant. The top-level ops are similar in
>>> nature to inode_ops in that there's not a real compelling reason to make
>>> them per process. The process context is (usually) available, and more
>>> importantly, the object whose access is being mediated is readily
>>> available with its security label.
>>>
>>>> There is likely to be some contention there between the security folk
>>>> who probably won't like the idea that your security module can be
>>>> different for different processes, and the people who want to provide
>>>> access to security modules on the systems they want to host or consolidate.
>>>
>>> I think the current setup would work fine. It's less likely that we'd
>>> want a separate security module for each container than simply policy
>>> that is container aware.
>>
>> I think what we really want are stacked security modules.
>
> I'm not convinced we need a new module for each container. The module
> is a policy enforcement engine, so give it a container aware policy and
> you shouldn't need another module.
what if the people administering the container are different from the
people administering the host?
in that case the people working in the container want to be able to
implement and change their own policy, and the people working on the host
don't want to have to implement changes to their main policy config (wtih
all the auditing that would be involved with it) every time a container
wants to change it's internal policy.
I can definantly see where a container aware policy on the master would be
useful, but I can also see where the ability to nest seperate policies
would be useful.
David Lang
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-30 1:41 ` David Lang
@ 2006-03-30 2:04 ` Chris Wright
2006-03-30 14:32 ` Serge E. Hallyn
0 siblings, 1 reply; 125+ messages in thread
From: Chris Wright @ 2006-03-30 2:04 UTC (permalink / raw)
To: David Lang
Cc: Chris Wright, Eric W. Biederman, Sam Vilain, Nick Piggin,
Herbert Poetzl, Bill Davidsen, Linux Kernel ML, Serge E. Hallyn
* David Lang (dlang@digitalinsight.com) wrote:
> what if the people administering the container are different from the
> people administering the host?
Yes, I alluded to that.
> in that case the people working in the container want to be able to
> implement and change their own policy, and the people working on the host
> don't want to have to implement changes to their main policy config (wtih
> all the auditing that would be involved with it) every time a container
> wants to change it's internal policy.
*nod*
> I can definantly see where a container aware policy on the master would be
> useful, but I can also see where the ability to nest seperate policies
> would be useful.
This is all fine. The question is whether this is a policy management
issue or a kernel infrastructure issue. So far, it's not clear that this
really necessitates kernel infrastructure changes to support container
aware policies to be loaded by physical host admin/owner or the virtual
host admin. The place where it breaks down is if each virtual host
wants not only to control its own policy, but also its security model.
Then we are left with stacking modules or heavier isolation (as in Xen).
thanks,
-chris
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-30 1:02 ` Eric W. Biederman
2006-03-30 1:36 ` Chris Wright
@ 2006-03-30 2:24 ` Sam Vilain
2006-03-30 3:01 ` Eric W. Biederman
1 sibling, 1 reply; 125+ messages in thread
From: Sam Vilain @ 2006-03-30 2:24 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Chris Wright, Nick Piggin, Herbert Poetzl, Bill Davidsen,
Linux Kernel ML, Serge E. Hallyn
Eric W. Biederman wrote:
>I think what we really want are stacked security modules.
>
>I have not yet fully digested all of the requirements for multiple servers
>on the same machine but increasingly the security aspects look
>like a job for a security module.
>
>Enforcing policies like container A cannot send signals to processes
>in container B or something like that.
>
>
We could even end up making security modules to implement standard unix
security. ie, which processes can send any signal to other processes.
Why hardcode the (!sender.user_id || (sender.user_id == target.user_id)
) rule at all? That rule should be the default rule in a security module
chain.
I just think that doing it this way is the wrong way around, but I guess
I'm hardly qualified to speak on this. Aren't security modules supposed
to be for custom security policy, not standard system semantics ?
Sam.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-30 1:36 ` Chris Wright
2006-03-30 1:41 ` David Lang
@ 2006-03-30 2:48 ` Eric W. Biederman
2006-03-30 19:23 ` Chris Wright
2006-03-30 13:29 ` Serge E. Hallyn
2 siblings, 1 reply; 125+ messages in thread
From: Eric W. Biederman @ 2006-03-30 2:48 UTC (permalink / raw)
To: Chris Wright
Cc: Sam Vilain, Nick Piggin, Herbert Poetzl, Bill Davidsen,
Linux Kernel ML, Serge E. Hallyn
Chris Wright <chrisw@sous-sol.org> writes:
>> At least one implementation Linux Jails by Serge E. Hallyn was done completely
>> with security modules, and the code was pretty minimal.
>
> Yes, although the networking area was something that looked better done
> via namespaces (at least that's my recollection of my conversations with
> Serge on that one a few years back).
For general networking yes the namespace flavor seems to be the sane
way to do it.
As I currently understand the problem everything goes along nicely
nothing really special needed until you start asking the question
how do I implement a root user with uid 0 who does not own the
machine. When you start asking that question is when the creepy
crawlies come out.
On most virtual filesystems the default owner of files is uid 0.
Additional privilege checks are not applied. Writing to those
files could potentially have global effect.
It is completely unclear how permissions checks should work
between two processes in different uid namespaces. Especially
there are cases where you do want interactions.
If every guest/container/jail is configured so the uids with the same
value mean the same user there are no security issues even when they
interact because the isolation is not perfect. So my gut feel it to
postpone a bunch of these problems and say making uids non-global
is a security module issue.
Eric
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-30 2:24 ` Sam Vilain
@ 2006-03-30 3:01 ` Eric W. Biederman
0 siblings, 0 replies; 125+ messages in thread
From: Eric W. Biederman @ 2006-03-30 3:01 UTC (permalink / raw)
To: Sam Vilain
Cc: Chris Wright, Nick Piggin, Herbert Poetzl, Bill Davidsen,
Linux Kernel ML, Serge E. Hallyn
Sam Vilain <sam@vilain.net> writes:
>
> We could even end up making security modules to implement standard unix
> security. ie, which processes can send any signal to other processes.
> Why hardcode the (!sender.user_id || (sender.user_id == target.user_id)
> ) rule at all? That rule should be the default rule in a security module
> chain.
>
> I just think that doing it this way is the wrong way around, but I guess
> I'm hardly qualified to speak on this. Aren't security modules supposed
> to be for custom security policy, not standard system semantics ?
It is simply my contention that you into at least a semi custom
configuration when you have multiple users with the same uid.
Especially when that uid == 0.
For guests you have to change the rule about what permissions
a setuid root executable gets or else it will have CAP_SYS_MKNOD,
and CAP_RAW_IO. (Unless I didn't read that code right).
Plus all of the /proc and sysfs issues.
Now perhaps we can sit down and figure out how to get completely
isolated and only allow a new uid namespace when that is
the case, but that doesn't sound to interesting.
So at least until I can imagine what the semantics of a new uid
namespace are when we don't have perfect isolation that feels
like a job for a security module.
Eric
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-29 6:05 ` Eric W. Biederman
2006-03-29 6:19 ` Sam Vilain
@ 2006-03-30 3:26 ` Nick Piggin
2006-03-30 10:30 ` Eric W. Biederman
2006-04-11 10:32 ` Kirill Korotaev
1 sibling, 2 replies; 125+ messages in thread
From: Nick Piggin @ 2006-03-30 3:26 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: Herbert Poetzl, Bill Davidsen, Linux Kernel ML
Eric W. Biederman wrote:
>Nick Piggin <nickpiggin@yahoo.com.au> writes:
>
>
>>I don't think I could give a complete answer...
>>I guess it could be stated as the increase in the complexity of
>>the rest of the code for someone who doesn't know anything about
>>the virtualization implementation.
>>
>>Completely non intrusive is something like 2 extra function calls
>>to/from generic code, changes to data structures are transparent
>>(or have simple wrappers), and there is no shared locking or data
>>with the rest of the kernel. And it goes up from there.
>>
>>Anyway I'm far from qualified... I just hope that with all the
>>work you guys are putting in that you'll be able to justify it ;)
>>
>
>As I have been able to survey the work, the most common case
>is replacing a global variable with a variable we lookup via
>current.
>
>That plus using the security module infrastructure you can
>implement the semantics pretty in a straight forward manner.
>
>The only really intrusive part is that because we tickle the
>code differently we see a different set of problems. Such
>as the mess that is the proc and sysctl code, and the lack of
>good resource limits.
>
>But none of that is inherent to the problem it is just when
>you use the kernel harder and have more untrusted users you
>see a different set of problems.
>
>
Yes... about that; if/when namespaces get into the kernel, you guys
are going to start pushing all sorts of per-container resource
control, right? Or will you be happy to leave most of that to VMs?
--
Send instant messages to your online friends http://au.messenger.yahoo.com
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-30 3:26 ` Nick Piggin
@ 2006-03-30 10:30 ` Eric W. Biederman
2006-04-11 10:32 ` Kirill Korotaev
1 sibling, 0 replies; 125+ messages in thread
From: Eric W. Biederman @ 2006-03-30 10:30 UTC (permalink / raw)
To: Nick Piggin; +Cc: Herbert Poetzl, Bill Davidsen, Linux Kernel ML
Nick Piggin <nickpiggin@yahoo.com.au> writes:
> Yes... about that; if/when namespaces get into the kernel, you guys
> are going to start pushing all sorts of per-container resource
> control, right? Or will you be happy to leave most of that to VMs?
That will certainly be an aspect of it, and that is one of the
pieces of the ongoing discussion. The out of tree implementations
already do this.
What flavor of resource limits these will be I don't know. That
is a part of the discussion we are just coming to now.
Eric
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-30 1:36 ` Chris Wright
2006-03-30 1:41 ` David Lang
2006-03-30 2:48 ` Eric W. Biederman
@ 2006-03-30 13:29 ` Serge E. Hallyn
2006-03-30 13:37 ` Eric W. Biederman
2 siblings, 1 reply; 125+ messages in thread
From: Serge E. Hallyn @ 2006-03-30 13:29 UTC (permalink / raw)
To: Chris Wright
Cc: Eric W. Biederman, Sam Vilain, Nick Piggin, Herbert Poetzl,
Bill Davidsen, Linux Kernel ML, Serge E. Hallyn
Quoting Chris Wright (chrisw@sous-sol.org):
> * Eric W. Biederman (ebiederm@xmission.com) wrote:
> > At least one implementation Linux Jails by Serge E. Hallyn was done completely
> > with security modules, and the code was pretty minimal.
>
> Yes, although the networking area was something that looked better done
> via namespaces (at least that's my recollection of my conversations with
> Serge on that one a few years back).
Yes, namespaces would be better - just as the file system isolation was
moved from a "strong chroot" approach to using pivot-root. Though note
that vserver still uses basically the method that bsdjail uses, and my
two attempts at getting network namespaces considered in the kernel so
far were dismal failures. Hopefully this time we've got some better,
more network-savvy minds on the task :)
-serge
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-30 13:29 ` Serge E. Hallyn
@ 2006-03-30 13:37 ` Eric W. Biederman
2006-03-30 14:55 ` Serge E. Hallyn
0 siblings, 1 reply; 125+ messages in thread
From: Eric W. Biederman @ 2006-03-30 13:37 UTC (permalink / raw)
To: Serge E. Hallyn
Cc: Chris Wright, Sam Vilain, Nick Piggin, Herbert Poetzl,
Bill Davidsen, Linux Kernel ML
"Serge E. Hallyn" <serue@us.ibm.com> writes:
> Quoting Chris Wright (chrisw@sous-sol.org):
>> * Eric W. Biederman (ebiederm@xmission.com) wrote:
>> > At least one implementation Linux Jails by Serge E. Hallyn was done
> completely
>> > with security modules, and the code was pretty minimal.
>>
>> Yes, although the networking area was something that looked better done
>> via namespaces (at least that's my recollection of my conversations with
>> Serge on that one a few years back).
>
> Yes, namespaces would be better - just as the file system isolation was
> moved from a "strong chroot" approach to using pivot-root. Though note
> that vserver still uses basically the method that bsdjail uses, and my
> two attempts at getting network namespaces considered in the kernel so
> far were dismal failures. Hopefully this time we've got some better,
> more network-savvy minds on the task :)
Any pointers to those old discussions?
I'm curious why getting your network namespaces were dismal failures.
Everyone ignored the patch?
Eric
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-29 20:30 ` Dave Hansen
2006-03-29 20:47 ` Eric W. Biederman
2006-03-29 22:44 ` Sam Vilain
@ 2006-03-30 13:51 ` Kirill Korotaev
2 siblings, 0 replies; 125+ messages in thread
From: Kirill Korotaev @ 2006-03-30 13:51 UTC (permalink / raw)
To: Dave Hansen
Cc: Eric W. Biederman, linux-kernel, devel, serue, akpm, sam,
Alexey Kuznetsov, Pavel Emelianov, Stanislav Protassov
ok. This is also easier for us, as it is a usual way of doing things in
OpenVZ. Will see...
> On Tue, 2006-03-28 at 12:51 +0400, Kirill Korotaev wrote:
>> Eric, we have a GIT repo on openvz.org already:
>> http://git.openvz.org
>
> Git is great for getting patches and lots of updates out, but I'm not
> sure it is idea for what we're trying to do. We'll need things reviewed
> at each step, especially because we're going to be touching so much
> common code.
>
> I'd guess set of quilt (or patch-utils) patches is probably best,
> especially if we're trying to get stuff into -mm first.
>
> -- Dave
>
>
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-30 2:04 ` Chris Wright
@ 2006-03-30 14:32 ` Serge E. Hallyn
2006-03-30 15:30 ` Herbert Poetzl
` (3 more replies)
0 siblings, 4 replies; 125+ messages in thread
From: Serge E. Hallyn @ 2006-03-30 14:32 UTC (permalink / raw)
To: Chris Wright
Cc: David Lang, Eric W. Biederman, Sam Vilain, Nick Piggin,
Herbert Poetzl, Bill Davidsen, Linux Kernel ML, Serge E. Hallyn
Quoting Chris Wright (chrisw@sous-sol.org):
> * David Lang (dlang@digitalinsight.com) wrote:
> > what if the people administering the container are different from the
> > people administering the host?
>
> Yes, I alluded to that.
>
> > in that case the people working in the container want to be able to
> > implement and change their own policy, and the people working on the host
> > don't want to have to implement changes to their main policy config (wtih
> > all the auditing that would be involved with it) every time a container
> > wants to change it's internal policy.
>
> *nod*
>
> > I can definantly see where a container aware policy on the master would be
> > useful, but I can also see where the ability to nest seperate policies
> > would be useful.
>
> This is all fine. The question is whether this is a policy management
> issue or a kernel infrastructure issue. So far, it's not clear that this
> really necessitates kernel infrastructure changes to support container
> aware policies to be loaded by physical host admin/owner or the virtual
> host admin. The place where it breaks down is if each virtual host
> wants not only to control its own policy, but also its security model.
What do you define as 'policy', and how is it different from the
security model?
> Then we are left with stacking modules or heavier isolation (as in Xen).
Hmm, talking about 'container' in this sense is confusing, because we're
not yet clear on what a container is.
So I'm trying to get a handle on what we really want to do.
Talking about namespaces is tricky. For instance if I do
clone(CLONE_NEWNS), the new process is in a new fs namespace, but the fs
objects are still the same, so if it loads an LSM, then perhaps at most
the new process should only control mount activities in its own
namespace.
Frankly I thought, and am still not unconvinced, that containers owned
by someone other than the system owner would/should never want to load
their own LSMs, so that this wasn't a problem. Isolation, as Chris has
mentioned, would be taken care of by the very nature of namespaces.
There are of course two alternatives... First, we might want to allow the
machine admin to insert per-container/per-namespace LSMs. To support
this case, we would need a way for the admin to tag a container some way
identifying it as being subject to a particular set of security_ops.
Second, we might want container admins to insert LSMs. In addition to
a straightforward way of tagging subjects/objects with their container,
we'd need to implement at least permissions for "may insert global LSM",
"may insert container LSM", and "may not insert any LSM." This might be
sufficient if we trust userspace to always create full containers.
Otherwise we might want to support meta-policy along the lines of "may
authorize ptrace and mount hooks only", or even "not subject to the
global inode_permission hook, and may create its own." (yuck)
But so much of this depends on how the namespaces/containers end up
being implemented...
-serge
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-30 13:37 ` Eric W. Biederman
@ 2006-03-30 14:55 ` Serge E. Hallyn
0 siblings, 0 replies; 125+ messages in thread
From: Serge E. Hallyn @ 2006-03-30 14:55 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Serge E. Hallyn, Chris Wright, Sam Vilain, Nick Piggin,
Herbert Poetzl, Bill Davidsen, Linux Kernel ML
Quoting Eric W. Biederman (ebiederm@xmission.com):
> "Serge E. Hallyn" <serue@us.ibm.com> writes:
>
> > Quoting Chris Wright (chrisw@sous-sol.org):
> >> * Eric W. Biederman (ebiederm@xmission.com) wrote:
> >> > At least one implementation Linux Jails by Serge E. Hallyn was done
> > completely
> >> > with security modules, and the code was pretty minimal.
> >>
> >> Yes, although the networking area was something that looked better done
> >> via namespaces (at least that's my recollection of my conversations with
> >> Serge on that one a few years back).
> >
> > Yes, namespaces would be better - just as the file system isolation was
> > moved from a "strong chroot" approach to using pivot-root. Though note
> > that vserver still uses basically the method that bsdjail uses, and my
> > two attempts at getting network namespaces considered in the kernel so
> > far were dismal failures. Hopefully this time we've got some better,
> > more network-savvy minds on the task :)
>
> Any pointers to those old discussions?
I can only find the one.
http://marc.theaimsgroup.com/?l=linux-netdev&m=109837694221901&w=2
I thought I'd sent one earlier than this too. Maybe I just got ready to
resend a new version, then decided the code quality wasn't worth it.
> I'm curious why getting your network namespaces were dismal failures.
Ok, I guess "dismal failure" most aptly applies to the patch itself :)
> Everyone ignored the patch?
Well, there was that. Then I briefly tried to rework the patch, but
just ran out of time, and have kept this on my todo list ever since,
but never really gotten back to it. At last it looks like this may
finally be coming back up.
-serge
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-30 14:32 ` Serge E. Hallyn
@ 2006-03-30 15:30 ` Herbert Poetzl
2006-03-30 16:43 ` Serge E. Hallyn
2006-03-30 18:00 ` Eric W. Biederman
2006-03-30 16:07 ` Stephen Smalley
` (2 subsequent siblings)
3 siblings, 2 replies; 125+ messages in thread
From: Herbert Poetzl @ 2006-03-30 15:30 UTC (permalink / raw)
To: Serge E. Hallyn
Cc: Chris Wright, David Lang, Eric W. Biederman, Sam Vilain,
Nick Piggin, Bill Davidsen, Linux Kernel ML
On Thu, Mar 30, 2006 at 08:32:24AM -0600, Serge E. Hallyn wrote:
> Quoting Chris Wright (chrisw@sous-sol.org):
> > * David Lang (dlang@digitalinsight.com) wrote:
> > > what if the people administering the container are different from the
> > > people administering the host?
> >
> > Yes, I alluded to that.
> >
> > > in that case the people working in the container want to be able to
> > > implement and change their own policy, and the people working on the host
> > > don't want to have to implement changes to their main policy config (wtih
> > > all the auditing that would be involved with it) every time a container
> > > wants to change it's internal policy.
> >
> > *nod*
> >
> > > I can definantly see where a container aware policy on the master would be
> > > useful, but I can also see where the ability to nest seperate policies
> > > would be useful.
> >
> > This is all fine. The question is whether this is a policy management
> > issue or a kernel infrastructure issue. So far, it's not clear that this
> > really necessitates kernel infrastructure changes to support container
> > aware policies to be loaded by physical host admin/owner or the virtual
> > host admin. The place where it breaks down is if each virtual host
> > wants not only to control its own policy, but also its security model.
>
> What do you define as 'policy', and how is it different from the
> security model?
>
> > Then we are left with stacking modules or heavier isolation (as in Xen).
>
> Hmm, talking about 'container' in this sense is confusing, because we're
> not yet clear on what a container is.
>
> So I'm trying to get a handle on what we really want to do.
>
> Talking about namespaces is tricky. For instance if I do
> clone(CLONE_NEWNS), the new process is in a new fs namespace, but the
> fs objects are still the same, so if it loads an LSM, then perhaps at
> most the new process should only control mount activities in its own
> namespace.
>
> Frankly I thought, and am still not unconvinced, that containers owned
> by someone other than the system owner would/should never want to load
> their own LSMs, so that this wasn't a problem. Isolation, as Chris has
> mentioned, would be taken care of by the very nature of namespaces.
>
> There are of course two alternatives... First, we might want to
> allow the machine admin to insert per-container/per-namespace LSMs.
> To support this case, we would need a way for the admin to tag a
> container some way identifying it as being subject to a particular set
> of security_ops.
>
> Second, we might want container admins to insert LSMs. In addition
> to a straightforward way of tagging subjects/objects with their
> container, we'd need to implement at least permissions for "may insert
> global LSM", "may insert container LSM", and "may not insert any LSM."
> This might be sufficient if we trust userspace to always create full
> containers. Otherwise we might want to support meta-policy along the
> lines of "may authorize ptrace and mount hooks only", or even "not
> subject to the global inode_permission hook, and may create its own."
sorry folks, I don't think that we _ever_ want container
root to be able to load any kernel modues at any time
without having CAP_SYS_ADMIN or so, in which case the
modules can be global as well ... otherwise we end up
as a bad Xen imitation with a lot of security issues,
where it should be a security enhancement ...
best,
Herbert
> (yuck)
>
> But so much of this depends on how the namespaces/containers end up
> being implemented...
>
> -serge
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-30 14:32 ` Serge E. Hallyn
2006-03-30 15:30 ` Herbert Poetzl
@ 2006-03-30 16:07 ` Stephen Smalley
2006-03-30 16:15 ` Serge E. Hallyn
2006-03-30 18:55 ` Chris Wright
2006-03-30 18:44 ` Eric W. Biederman
2006-03-30 18:53 ` Chris Wright
3 siblings, 2 replies; 125+ messages in thread
From: Stephen Smalley @ 2006-03-30 16:07 UTC (permalink / raw)
To: Serge E. Hallyn
Cc: Chris Wright, David Lang, Eric W. Biederman, Sam Vilain,
Nick Piggin, Herbert Poetzl, Bill Davidsen, Linux Kernel ML
On Thu, 2006-03-30 at 08:32 -0600, Serge E. Hallyn wrote:
> Frankly I thought, and am still not unconvinced, that containers owned
> by someone other than the system owner would/should never want to load
> their own LSMs, so that this wasn't a problem. Isolation, as Chris has
> mentioned, would be taken care of by the very nature of namespaces.
>
> There are of course two alternatives... First, we might want to allow the
> machine admin to insert per-container/per-namespace LSMs. To support
> this case, we would need a way for the admin to tag a container some way
> identifying it as being subject to a particular set of security_ops.
>
> Second, we might want container admins to insert LSMs. In addition to
> a straightforward way of tagging subjects/objects with their container,
> we'd need to implement at least permissions for "may insert global LSM",
> "may insert container LSM", and "may not insert any LSM." This might be
> sufficient if we trust userspace to always create full containers.
> Otherwise we might want to support meta-policy along the lines of "may
> authorize ptrace and mount hooks only", or even "not subject to the
> global inode_permission hook, and may create its own." (yuck)
>
> But so much of this depends on how the namespaces/containers end up
> being implemented...
FWIW, SELinux now has a notion of a type hierarchy in its policy, so the
root admin can carve out a portion of the policy space and allow less
privileged admins to then define sub-types that are strictly constrained
by what was allowed to the parent type by the root admin. This is
handled in userspace, with the policy mediation performed by a userspace
agent (daemon, policy management server), which then becomes the focal
point for all policy loading.
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-30 16:07 ` Stephen Smalley
@ 2006-03-30 16:15 ` Serge E. Hallyn
2006-03-30 18:55 ` Chris Wright
1 sibling, 0 replies; 125+ messages in thread
From: Serge E. Hallyn @ 2006-03-30 16:15 UTC (permalink / raw)
To: Stephen Smalley
Cc: Serge E. Hallyn, Chris Wright, David Lang, Eric W. Biederman,
Sam Vilain, Nick Piggin, Herbert Poetzl, Bill Davidsen,
Linux Kernel ML
Quoting Stephen Smalley (sds@tycho.nsa.gov):
> On Thu, 2006-03-30 at 08:32 -0600, Serge E. Hallyn wrote:
> > Frankly I thought, and am still not unconvinced, that containers owned
> > by someone other than the system owner would/should never want to load
> > their own LSMs, so that this wasn't a problem. Isolation, as Chris has
> > mentioned, would be taken care of by the very nature of namespaces.
> >
> > There are of course two alternatives... First, we might want to allow the
> > machine admin to insert per-container/per-namespace LSMs. To support
> > this case, we would need a way for the admin to tag a container some way
> > identifying it as being subject to a particular set of security_ops.
> >
> > Second, we might want container admins to insert LSMs. In addition to
> > a straightforward way of tagging subjects/objects with their container,
> > we'd need to implement at least permissions for "may insert global LSM",
> > "may insert container LSM", and "may not insert any LSM." This might be
> > sufficient if we trust userspace to always create full containers.
> > Otherwise we might want to support meta-policy along the lines of "may
> > authorize ptrace and mount hooks only", or even "not subject to the
> > global inode_permission hook, and may create its own." (yuck)
> >
> > But so much of this depends on how the namespaces/containers end up
> > being implemented...
>
> FWIW, SELinux now has a notion of a type hierarchy in its policy, so the
> root admin can carve out a portion of the policy space and allow less
> privileged admins to then define sub-types that are strictly constrained
> by what was allowed to the parent type by the root admin. This is
> handled in userspace, with the policy mediation performed by a userspace
> agent (daemon, policy management server), which then becomes the focal
> point for all policy loading.
Yes, my first response (which I cancelled) mentioned this as a possible
solution.
The global admin could assign certain max privileges to 'container_b'.
The admin in container_b could create container_b.root_t,
container_b.user_t, etc, which would be limited by the container_b
max perms.
Presumably the policy daemon, running in container 0, could accept input
from a socket from container 2, labeled appropriately automatically
ensuring that all types created by the policy in container 2 are
prefixed with container_b, and doing the obvious restrictions.
Or something like that :)
-serge
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-30 15:30 ` Herbert Poetzl
@ 2006-03-30 16:43 ` Serge E. Hallyn
2006-03-30 18:00 ` Eric W. Biederman
1 sibling, 0 replies; 125+ messages in thread
From: Serge E. Hallyn @ 2006-03-30 16:43 UTC (permalink / raw)
To: Chris Wright, David Lang, Eric W. Biederman, Sam Vilain,
Nick Piggin, Bill Davidsen, Linux Kernel ML, Stephen Smalley
Quoting Herbert Poetzl (herbert@13thfloor.at):
> On Thu, Mar 30, 2006 at 08:32:24AM -0600, Serge E. Hallyn wrote:
> > Quoting Chris Wright (chrisw@sous-sol.org):
> > > * David Lang (dlang@digitalinsight.com) wrote:
> > > > what if the people administering the container are different from the
> > > > people administering the host?
> > >
> > > Yes, I alluded to that.
> > >
> > > > in that case the people working in the container want to be able to
> > > > implement and change their own policy, and the people working on the host
> > > > don't want to have to implement changes to their main policy config (wtih
> > > > all the auditing that would be involved with it) every time a container
> > > > wants to change it's internal policy.
> > >
> > > *nod*
> > >
> > > > I can definantly see where a container aware policy on the master would be
> > > > useful, but I can also see where the ability to nest seperate policies
> > > > would be useful.
> > >
> > > This is all fine. The question is whether this is a policy management
> > > issue or a kernel infrastructure issue. So far, it's not clear that this
> > > really necessitates kernel infrastructure changes to support container
> > > aware policies to be loaded by physical host admin/owner or the virtual
> > > host admin. The place where it breaks down is if each virtual host
> > > wants not only to control its own policy, but also its security model.
> >
> > What do you define as 'policy', and how is it different from the
> > security model?
> >
> > > Then we are left with stacking modules or heavier isolation (as in Xen).
> >
> > Hmm, talking about 'container' in this sense is confusing, because we're
> > not yet clear on what a container is.
> >
> > So I'm trying to get a handle on what we really want to do.
> >
> > Talking about namespaces is tricky. For instance if I do
> > clone(CLONE_NEWNS), the new process is in a new fs namespace, but the
> > fs objects are still the same, so if it loads an LSM, then perhaps at
> > most the new process should only control mount activities in its own
> > namespace.
> >
> > Frankly I thought, and am still not unconvinced, that containers owned
> > by someone other than the system owner would/should never want to load
> > their own LSMs, so that this wasn't a problem. Isolation, as Chris has
> > mentioned, would be taken care of by the very nature of namespaces.
> >
> > There are of course two alternatives... First, we might want to
> > allow the machine admin to insert per-container/per-namespace LSMs.
> > To support this case, we would need a way for the admin to tag a
> > container some way identifying it as being subject to a particular set
> > of security_ops.
> >
> > Second, we might want container admins to insert LSMs. In addition
> > to a straightforward way of tagging subjects/objects with their
> > container, we'd need to implement at least permissions for "may insert
> > global LSM", "may insert container LSM", and "may not insert any LSM."
> > This might be sufficient if we trust userspace to always create full
> > containers. Otherwise we might want to support meta-policy along the
> > lines of "may authorize ptrace and mount hooks only", or even "not
> > subject to the global inode_permission hook, and may create its own."
>
> sorry folks, I don't think that we _ever_ want container
> root to be able to load any kernel modues at any time
> without having CAP_SYS_ADMIN or so, in which case the
> modules can be global as well ... otherwise we end up
> as a bad Xen imitation with a lot of security issues,
> where it should be a security enhancement ...
I agree. As Chris points out, at most we should help LSM become
container-aware. But as the selinux example shows, even that should
not be necessary.
And that's for funky setups. For normal setups, the isolation provided
inherently by the namespaces should suffice.
thanks,
-serge
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-30 15:30 ` Herbert Poetzl
2006-03-30 16:43 ` Serge E. Hallyn
@ 2006-03-30 18:00 ` Eric W. Biederman
2006-03-31 13:40 ` Serge E. Hallyn
1 sibling, 1 reply; 125+ messages in thread
From: Eric W. Biederman @ 2006-03-30 18:00 UTC (permalink / raw)
To: Serge E. Hallyn
Cc: Chris Wright, David Lang, Sam Vilain, Nick Piggin, Bill Davidsen,
Linux Kernel ML
Herbert Poetzl <herbert@13thfloor.at> writes:
> sorry folks, I don't think that we _ever_ want container
> root to be able to load any kernel modues at any time
> without having CAP_SYS_ADMIN or so, in which case the
> modules can be global as well ... otherwise we end up
> as a bad Xen imitation with a lot of security issues,
> where it should be a security enhancement ...
Agreed. At least until someone defines a user-mode
linux-security-module. We may want a different security module
in effect for a particular guest. Which modules you get
being defined by the one system administrator is fine.
The primary case I see worth worry about is using
a security module to ensure isolation of a container,
while still providing the selinux mandatory capabilities
to a container.
Eric
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-30 14:32 ` Serge E. Hallyn
2006-03-30 15:30 ` Herbert Poetzl
2006-03-30 16:07 ` Stephen Smalley
@ 2006-03-30 18:44 ` Eric W. Biederman
2006-03-30 19:07 ` Chris Wright
2006-03-30 18:53 ` Chris Wright
3 siblings, 1 reply; 125+ messages in thread
From: Eric W. Biederman @ 2006-03-30 18:44 UTC (permalink / raw)
To: Serge E. Hallyn
Cc: Chris Wright, David Lang, Sam Vilain, Nick Piggin, Herbert Poetzl,
Bill Davidsen, Linux Kernel ML
"Serge E. Hallyn" <serue@us.ibm.com> writes:
> Frankly I thought, and am still not unconvinced, that containers owned
> by someone other than the system owner would/should never want to load
> their own LSMs, so that this wasn't a problem. Isolation, as Chris has
> mentioned, would be taken care of by the very nature of namespaces.
Up to uids I agree. Once we hit uids things get very ugly.
And since security modules already seem to touch all of the places
we need to touch to make a UID namespace work I think using security
modules to implement the strange things we need with uid.
To ensure uid isolation we would need a different copy of every other
namespace. The pid space would need to be completely isolated,
and we couldn't share any filesystem mounts with any other namespace.
This especially includes /proc and sysfs.
> There are of course two alternatives... First, we might want to allow the
> machine admin to insert per-container/per-namespace LSMs. To support
> this case, we would need a way for the admin to tag a container some way
> identifying it as being subject to a particular set of security_ops.
>
> Second, we might want container admins to insert LSMs. In addition to
> a straightforward way of tagging subjects/objects with their container,
> we'd need to implement at least permissions for "may insert global LSM",
> "may insert container LSM", and "may not insert any LSM." This might be
> sufficient if we trust userspace to always create full containers.
> Otherwise we might want to support meta-policy along the lines of "may
> authorize ptrace and mount hooks only", or even "not subject to the
> global inode_permission hook, and may create its own." (yuck)
Security modules that are stackable call mod_reg_security.
Currently in the current we have: root_plug, seclvl, capability modules
that implement this. selinux currently only supports running
as the global security policy.
Allowing a different administrator to load modules is out of
the question, if we actually care about security.
However it is possible to build the capacity to multiplex
compiled in or already loaded security modules, and allowed which
security modules are in effect to be controlled by securityfs.
With appropriate care we should be able to allow the container
administrator to use this capability to select which security
policies, and mechanisms they want.
That is something we probably want to consider anyway as
currently the security modules break the basic rule that
compiling code in should not affect how the kernel operates
by default.
Until we get to that point simply specifying the name of a security
module in the static configuration of a container that the container
creation program can use should be enough.
> But so much of this depends on how the namespaces/containers end up
> being implemented...
Agreed. But if I hand wave and say an upper level security module will
decide when to call it then only the details of that upper level
security module are in question. The stacked module will likely just
work.
So I guess I am leaning towards a security namespace implemented with
an appropriate security module.
Eric
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-30 14:32 ` Serge E. Hallyn
` (2 preceding siblings ...)
2006-03-30 18:44 ` Eric W. Biederman
@ 2006-03-30 18:53 ` Chris Wright
3 siblings, 0 replies; 125+ messages in thread
From: Chris Wright @ 2006-03-30 18:53 UTC (permalink / raw)
To: Serge E. Hallyn
Cc: Chris Wright, David Lang, Eric W. Biederman, Sam Vilain,
Nick Piggin, Herbert Poetzl, Bill Davidsen, Linux Kernel ML
* Serge E. Hallyn (serue@us.ibm.com) wrote:
> Quoting Chris Wright (chrisw@sous-sol.org):
> > This is all fine. The question is whether this is a policy management
> > issue or a kernel infrastructure issue. So far, it's not clear that this
> > really necessitates kernel infrastructure changes to support container
> > aware policies to be loaded by physical host admin/owner or the virtual
> > host admin. The place where it breaks down is if each virtual host
> > wants not only to control its own policy, but also its security model.
>
> What do you define as 'policy', and how is it different from the
> security model?
Model, as in TE, RBAC, or something trivially simple ala Openwall type
protection. Policy, as in rules to drive the model.
> Second, we might want container admins to insert LSMs.
I think we can agree that this way lies madness.
thanks,
-chris
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-30 16:07 ` Stephen Smalley
2006-03-30 16:15 ` Serge E. Hallyn
@ 2006-03-30 18:55 ` Chris Wright
1 sibling, 0 replies; 125+ messages in thread
From: Chris Wright @ 2006-03-30 18:55 UTC (permalink / raw)
To: Stephen Smalley
Cc: Serge E. Hallyn, Chris Wright, David Lang, Eric W. Biederman,
Sam Vilain, Nick Piggin, Herbert Poetzl, Bill Davidsen,
Linux Kernel ML
* Stephen Smalley (sds@tycho.nsa.gov) wrote:
> FWIW, SELinux now has a notion of a type hierarchy in its policy, so the
> root admin can carve out a portion of the policy space and allow less
> privileged admins to then define sub-types that are strictly constrained
> by what was allowed to the parent type by the root admin. This is
> handled in userspace, with the policy mediation performed by a userspace
> agent (daemon, policy management server), which then becomes the focal
> point for all policy loading.
*nod* this is exactly what I was thinking in terms of container
specifying policy. Goes through the system/root container and gets
validated before loaded.
thanks,
-chris
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-30 18:44 ` Eric W. Biederman
@ 2006-03-30 19:07 ` Chris Wright
2006-03-31 5:36 ` Eric W. Biederman
0 siblings, 1 reply; 125+ messages in thread
From: Chris Wright @ 2006-03-30 19:07 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Serge E. Hallyn, Chris Wright, David Lang, Sam Vilain,
Nick Piggin, Herbert Poetzl, Bill Davidsen, Linux Kernel ML
* Eric W. Biederman (ebiederm@xmission.com) wrote:
> "Serge E. Hallyn" <serue@us.ibm.com> writes:
>
> > Frankly I thought, and am still not unconvinced, that containers owned
> > by someone other than the system owner would/should never want to load
> > their own LSMs, so that this wasn't a problem. Isolation, as Chris has
> > mentioned, would be taken care of by the very nature of namespaces.
>
> Up to uids I agree. Once we hit uids things get very ugly.
> And since security modules already seem to touch all of the places
> we need to touch to make a UID namespace work I think using security
> modules to implement the strange things we need with uid.
>
> To ensure uid isolation we would need a different copy of every other
> namespace. The pid space would need to be completely isolated,
> and we couldn't share any filesystem mounts with any other namespace.
> This especially includes /proc and sysfs.
Security modules use labels not uid's. The uid is the basis for
traditional DAC checks, the label are used for MAC checks. And its
easy to imagine a label that includes a notion of container id.
> However it is possible to build the capacity to multiplex
> compiled in or already loaded security modules, and allowed which
> security modules are in effect to be controlled by securityfs.
Yes, it's been proposed and discussed many times. There's some
fundamental issues with composing security modules. First and foremost
is the notion that arbritrary security models may not compose to form
meaningful (in a security sense) results. Second, at an implementation
level, sharing labels is non-trivial and comes with overhead.
> With appropriate care we should be able to allow the container
> administrator to use this capability to select which security
> policies, and mechanisms they want.
>
> That is something we probably want to consider anyway as
> currently the security modules break the basic rule that
> compiling code in should not affect how the kernel operates
> by default.
Don't follow you on this one.
thanks,
-chris
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-30 2:48 ` Eric W. Biederman
@ 2006-03-30 19:23 ` Chris Wright
2006-03-31 6:00 ` Eric W. Biederman
0 siblings, 1 reply; 125+ messages in thread
From: Chris Wright @ 2006-03-30 19:23 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Chris Wright, Sam Vilain, Nick Piggin, Herbert Poetzl,
Bill Davidsen, Linux Kernel ML, Serge E. Hallyn
* Eric W. Biederman (ebiederm@xmission.com) wrote:
> As I currently understand the problem everything goes along nicely
> nothing really special needed until you start asking the question
> how do I implement a root user with uid 0 who does not own the
> machine. When you start asking that question is when the creepy
> crawlies come out.
Hehe. uid 0 _and_ full capabilities. So reducing capabilities is one
relatively easy way to handle that. And, if you have a security module
loaded it's going to use security labels, which can be richer than both
uid and capabilites combined.
> On most virtual filesystems the default owner of files is uid 0.
> Additional privilege checks are not applied. Writing to those
> files could potentially have global effect.
Yes, many (albeit far from all) have a capable() check as well.
> It is completely unclear how permissions checks should work
> between two processes in different uid namespaces. Especially
> there are cases where you do want interactions.
Are there? Why put them in different containers then? I'd think
network sockets is the extent of the interaction you'd want. Sharing
filesystem does leave room for named pipes and unix domain sockets (also
in the abstract namespace). And considering the side channel in unix
domain sockets, they become a potential hole. So for solid isolation,
I'd expect disallowing access to those when the object owner is in a
different security context from context which is trying to attach.
thanks,
-chris
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-30 19:07 ` Chris Wright
@ 2006-03-31 5:36 ` Eric W. Biederman
2006-03-31 5:51 ` Chris Wright
0 siblings, 1 reply; 125+ messages in thread
From: Eric W. Biederman @ 2006-03-31 5:36 UTC (permalink / raw)
To: Chris Wright
Cc: Serge E. Hallyn, David Lang, Sam Vilain, Nick Piggin,
Herbert Poetzl, Bill Davidsen, Linux Kernel ML
Chris Wright <chrisw@sous-sol.org> writes:
>> With appropriate care we should be able to allow the container
>> administrator to use this capability to select which security
>> policies, and mechanisms they want.
>>
>> That is something we probably want to consider anyway as
>> currently the security modules break the basic rule that
>> compiling code in should not affect how the kernel operates
>> by default.
>
> Don't follow you on this one.
Very simple, it should be possible statically compile in
all of the security modules and be able to pick at run time which
security module to use.
Unless I have been very blind and missed something skimming
through the code compiling if I compile in all of the security
modules, whichever one is initialized first is the one
that we will use.
Eric
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-31 5:36 ` Eric W. Biederman
@ 2006-03-31 5:51 ` Chris Wright
2006-03-31 6:52 ` Eric W. Biederman
0 siblings, 1 reply; 125+ messages in thread
From: Chris Wright @ 2006-03-31 5:51 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Chris Wright, Serge E. Hallyn, David Lang, Sam Vilain,
Nick Piggin, Herbert Poetzl, Bill Davidsen, Linux Kernel ML
* Eric W. Biederman (ebiederm@xmission.com) wrote:
> Very simple, it should be possible statically compile in
> all of the security modules and be able to pick at run time which
> security module to use.
>
> Unless I have been very blind and missed something skimming
> through the code compiling if I compile in all of the security
> modules, whichever one is initialized first is the one
> that we will use.
I see. No, you got that correct. That's rather intentional, to make
sure all objects are properly initialized as they are allocated rather
than having to double check at every access control check. That's why
security_initcalls are so early.
thanks,
-chris
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-30 19:23 ` Chris Wright
@ 2006-03-31 6:00 ` Eric W. Biederman
2006-03-31 14:52 ` Stephen Smalley
0 siblings, 1 reply; 125+ messages in thread
From: Eric W. Biederman @ 2006-03-31 6:00 UTC (permalink / raw)
To: Chris Wright
Cc: Sam Vilain, Nick Piggin, Herbert Poetzl, Bill Davidsen,
Linux Kernel ML, Serge E. Hallyn
Chris Wright <chrisw@sous-sol.org> writes:
> * Eric W. Biederman (ebiederm@xmission.com) wrote:
>> As I currently understand the problem everything goes along nicely
>> nothing really special needed until you start asking the question
>> how do I implement a root user with uid 0 who does not own the
>> machine. When you start asking that question is when the creepy
>> crawlies come out.
>
> Hehe. uid 0 _and_ full capabilities. So reducing capabilities is one
> relatively easy way to handle that.
It comes close the but capabilities are not currently factored correctly.
> And, if you have a security module
> loaded it's going to use security labels, which can be richer than both
> uid and capabilites combined.
Exactly. You can define the semantics with a security module,
but you cannot define the semantics in terms of uids.
>> On most virtual filesystems the default owner of files is uid 0.
>> Additional privilege checks are not applied. Writing to those
>> files could potentially have global effect.
>
> Yes, many (albeit far from all) have a capable() check as well.
Nothing controlled by sysctl has a capable check, except
the capabilities sysctl. The default if not the norm is not
to apply capability checks.
>> It is completely unclear how permissions checks should work
>> between two processes in different uid namespaces. Especially
>> there are cases where you do want interactions.
>
> Are there? Why put them in different containers then? I'd think
> network sockets is the extent of the interaction you'd want. Sharing
> filesystem does leave room for named pipes and unix domain sockets (also
> in the abstract namespace). And considering the side channel in unix
> domain sockets, they become a potential hole. So for solid isolation,
> I'd expect disallowing access to those when the object owner is in a
> different security context from context which is trying to attach.
Yes. My current implementation has all of that visibility closed,
when you create a new network namespace. But there are still
interactions. For me it isn't a real problem though as I have
a single system administrator and synchronized user ids. For
other use case it is a different story.
In a more normal use case, the container admin can't get out, but
the box admin can get in. At least for simple things like monitoring
and possibly some debugging.
Or you get weird cases where you want to allow access to some of
the files in /proc to the container but not all.
If I am the machine admin and I have discovered a process in
a container it has a bug and is going wild, it is preferable
to kill that process, or possibly that container rather than
rebooting the box to solve the problem.
All of the normal every day interactions get handled fine and there
is simply no visibility. But I don't ever expect perfect isolation,
from the machine admin.
I do still need to read up on the selinux mandatory access controls.
Although the comment from the NSA selinux FAQ about selinux being
just a proof-of-concept and no security bugs were discovered or
looked for during it's implementation scares me.
Eric
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-31 5:51 ` Chris Wright
@ 2006-03-31 6:52 ` Eric W. Biederman
0 siblings, 0 replies; 125+ messages in thread
From: Eric W. Biederman @ 2006-03-31 6:52 UTC (permalink / raw)
To: Chris Wright
Cc: Serge E. Hallyn, David Lang, Sam Vilain, Nick Piggin,
Herbert Poetzl, Bill Davidsen, Linux Kernel ML
Chris Wright <chrisw@sous-sol.org> writes:
> * Eric W. Biederman (ebiederm@xmission.com) wrote:
>> Very simple, it should be possible statically compile in
>> all of the security modules and be able to pick at run time which
>> security module to use.
>>
>> Unless I have been very blind and missed something skimming
>> through the code compiling if I compile in all of the security
>> modules, whichever one is initialized first is the one
>> that we will use.
>
> I see. No, you got that correct. That's rather intentional, to make
> sure all objects are properly initialized as they are allocated rather
> than having to double check at every access control check. That's why
> security_initcalls are so early.
Ok. That make sense. The fact that some of the security modules
besides selinux are tristate in Kconfig had me confused for a moment.
Controlling what to run with a kernel command line makes sense
then.
Having a generic command line like lsm=[selinux|root_plug|capability|seclvl]
would be nice. Where nothing supplied would not enable any of
the linux security modules.
Eric
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-30 18:00 ` Eric W. Biederman
@ 2006-03-31 13:40 ` Serge E. Hallyn
0 siblings, 0 replies; 125+ messages in thread
From: Serge E. Hallyn @ 2006-03-31 13:40 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Serge E. Hallyn, Chris Wright, David Lang, Sam Vilain,
Nick Piggin, Bill Davidsen, Linux Kernel ML
Quoting Eric W. Biederman (ebiederm@xmission.com):
> Herbert Poetzl <herbert@13thfloor.at> writes:
>
> > sorry folks, I don't think that we _ever_ want container
> > root to be able to load any kernel modues at any time
> > without having CAP_SYS_ADMIN or so, in which case the
> > modules can be global as well ... otherwise we end up
> > as a bad Xen imitation with a lot of security issues,
> > where it should be a security enhancement ...
>
> Agreed. At least until someone defines a user-mode
> linux-security-module. We may want a different security module
It's been done before, at least for some hooks (ie one implementation by
antivirus folks). But to actually do this with full support for all
hooks would require some changes. For example, the security_task_kill()
hook is called under several potential locks. At least
read_lock(tasklist_lock) and plain rcu_read_lock() (and I thought also
write_lock(tasklist_lock), but can't find that instance right now).
Clearly that can be fixed, but atm a user-mode lsm isn't entirely
possible.
-serge
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-31 6:00 ` Eric W. Biederman
@ 2006-03-31 14:52 ` Stephen Smalley
2006-03-31 16:39 ` Eric W. Biederman
0 siblings, 1 reply; 125+ messages in thread
From: Stephen Smalley @ 2006-03-31 14:52 UTC (permalink / raw)
To: Eric W. Biederman
Cc: James Morris, Chris Wright, Sam Vilain, Nick Piggin,
Herbert Poetzl, Bill Davidsen, Linux Kernel ML, Serge E. Hallyn
On Thu, 2006-03-30 at 23:00 -0700, Eric W. Biederman wrote:
> I do still need to read up on the selinux mandatory access controls.
> Although the comment from the NSA selinux FAQ about selinux being
> just a proof-of-concept and no security bugs were discovered or
> looked for during it's implementation scares me.
Point of clarification: The original SELinux prototype NSA released in
Dec 2000 based on Linux 2.2.x kernels was a proof-of-concept reference
implementation. I wouldn't describe the current implementation in
mainline Linux 2.6 and certain distributions in the same manner. Also,
the separate Q&A about "did you try to fix any vulnerabilities" is just
saying that NSA did not perform a thorough code audit of the entire
Linux kernel; we just implemented the extensions needed for mandatory
access control.
http://selinux.sf.net/resources.php3 has some good pointers for SELinux
resources. There is also a recently created SELinux news site at
http://selinuxnews.org/wp/.
--
Stephen Smalley
National Security Agency
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-31 14:52 ` Stephen Smalley
@ 2006-03-31 16:39 ` Eric W. Biederman
0 siblings, 0 replies; 125+ messages in thread
From: Eric W. Biederman @ 2006-03-31 16:39 UTC (permalink / raw)
To: sds
Cc: James Morris, Chris Wright, Sam Vilain, Nick Piggin,
Herbert Poetzl, Bill Davidsen, Linux Kernel ML, Serge E. Hallyn
Stephen Smalley <sds@tycho.nsa.gov> writes:
> On Thu, 2006-03-30 at 23:00 -0700, Eric W. Biederman wrote:
>> I do still need to read up on the selinux mandatory access controls.
>> Although the comment from the NSA selinux FAQ about selinux being
>> just a proof-of-concept and no security bugs were discovered or
>> looked for during it's implementation scares me.
>
> Point of clarification: The original SELinux prototype NSA released in
> Dec 2000 based on Linux 2.2.x kernels was a proof-of-concept reference
> implementation. I wouldn't describe the current implementation in
> mainline Linux 2.6 and certain distributions in the same manner. Also,
> the separate Q&A about "did you try to fix any vulnerabilities" is just
> saying that NSA did not perform a thorough code audit of the entire
> Linux kernel; we just implemented the extensions needed for mandatory
> access control.
>
> http://selinux.sf.net/resources.php3 has some good pointers for SELinux
> resources. There is also a recently created SELinux news site at
> http://selinuxnews.org/wp/.
Thanks. I am concerned that there hasn't been an audit, of at least
the core kernel.
My first interaction with security modules was that I fixed a but
where /proc/pid/fd was performing the wrong super user security
checks and the system became unusable for people using selinux.
Eric
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-03-28 20:50 ` Kir Kolyshkin
2006-03-28 21:38 ` Jun OKAJIMA
@ 2006-04-03 16:47 ` Bill Davidsen
2006-04-11 10:38 ` Kirill Korotaev
1 sibling, 1 reply; 125+ messages in thread
From: Bill Davidsen @ 2006-04-03 16:47 UTC (permalink / raw)
To: Kir Kolyshkin
Cc: akpm, Nick Piggin, sam, linux-kernel, Eric W. Biederman, serue,
Alexey Kuznetsov, herbert
Kir Kolyshkin wrote:
> OpenVZ will have live zero downtime migration and suspend/resume some
> time next month.
>
Please clarify. Currently a migration involves:
- stopping or suspending the instance
- backing up the instance and all of its data
- creating an environment for the instance on a new machine
- transporting the data to a new machine
- installing the instance and all data
- starting the instance
If you could just briefly cover how you do each of these steps with zero
downtime...
--
-bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-03-30 3:26 ` Nick Piggin
2006-03-30 10:30 ` Eric W. Biederman
@ 2006-04-11 10:32 ` Kirill Korotaev
2006-04-11 11:14 ` Nick Piggin
1 sibling, 1 reply; 125+ messages in thread
From: Kirill Korotaev @ 2006-04-11 10:32 UTC (permalink / raw)
To: Nick Piggin
Cc: Eric W. Biederman, Herbert Poetzl, Bill Davidsen, Linux Kernel ML
> Yes... about that; if/when namespaces get into the kernel, you guys
> are going to start pushing all sorts of per-container resource
> control, right? Or will you be happy to leave most of that to VMs?
Nick, OpenVZ, for example, uses "User Bean Counters" patch originally
developed by Alan Cox. The good thing is that it is fully separate from
virtualization and allows to control any users or set of processes.
Don't you think it is valuable and helpful feature itself? Why are you
afraid of resource management?
Thanks,
Kirill
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-04-03 16:47 ` Bill Davidsen
@ 2006-04-11 10:38 ` Kirill Korotaev
2006-04-11 16:20 ` Herbert Poetzl
` (2 more replies)
0 siblings, 3 replies; 125+ messages in thread
From: Kirill Korotaev @ 2006-04-11 10:38 UTC (permalink / raw)
To: Bill Davidsen
Cc: Kir Kolyshkin, akpm, Nick Piggin, sam, linux-kernel,
Eric W. Biederman, serue, Alexey Kuznetsov, herbert
Bill,
>> OpenVZ will have live zero downtime migration and suspend/resume some
>> time next month.
>>
> Please clarify. Currently a migration involves:
> - stopping or suspending the instance
> - backing up the instance and all of its data
> - creating an environment for the instance on a new machine
> - transporting the data to a new machine
> - installing the instance and all data
> - starting the instance
> If you could just briefly cover how you do each of these steps with zero
> downtime...
it does exactly what you wrote with some minor steps such as networking
stop on source and start on destination etc.
So I would detailed it like this:
- freeze VPS
- freeze networking
- copy VPS data to destination
- dump VPS
- copy dump to the destination
- restore VPS
- unfreeze VPS
- kill original VPS on source
Moreover, in OpenVZ live migration allows to migrate 32bit VPSs between
i686 and x86-64 Linux machines.
Thanks,
Kirill
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-04-11 10:32 ` Kirill Korotaev
@ 2006-04-11 11:14 ` Nick Piggin
2006-04-11 14:44 ` Kirill Korotaev
0 siblings, 1 reply; 125+ messages in thread
From: Nick Piggin @ 2006-04-11 11:14 UTC (permalink / raw)
To: Kirill Korotaev
Cc: Eric W. Biederman, Herbert Poetzl, Bill Davidsen, Linux Kernel ML
Kirill Korotaev wrote:
>> Yes... about that; if/when namespaces get into the kernel, you guys
>> are going to start pushing all sorts of per-container resource
>> control, right? Or will you be happy to leave most of that to VMs?
>
>
> Nick, OpenVZ, for example, uses "User Bean Counters" patch originally
> developed by Alan Cox. The good thing is that it is fully separate from
> virtualization and allows to control any users or set of processes.
> Don't you think it is valuable and helpful feature itself? Why are you
> afraid of resource management?
I'm afraid of resource management because I've seen things like the
ckrm cpu resource manager.
Considering we tend to mostly have only per-process resource management,
low level virtualisation seems like a much better place to do this.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [RFC] Virtualization steps
2006-04-11 11:14 ` Nick Piggin
@ 2006-04-11 14:44 ` Kirill Korotaev
0 siblings, 0 replies; 125+ messages in thread
From: Kirill Korotaev @ 2006-04-11 14:44 UTC (permalink / raw)
To: Nick Piggin
Cc: Eric W. Biederman, Herbert Poetzl, Bill Davidsen, Linux Kernel ML
>> Nick, OpenVZ, for example, uses "User Bean Counters" patch originally
>> developed by Alan Cox. The good thing is that it is fully separate
>> from virtualization and allows to control any users or set of
>> processes. Don't you think it is valuable and helpful feature itself?
>> Why are you afraid of resource management?
>
>
> I'm afraid of resource management because I've seen things like the
> ckrm cpu resource manager.
Ohhhhh... Now I see :)
CKRM is using too much heavy framework, with hierarchical settings and
so on, but little practical things.
Our approach is totally different, we make it simple and straitforward
and all resource management features are compile-time configurable.
> Considering we tend to mostly have only per-process resource management,
> low level virtualisation seems like a much better place to do this.
it depends. if you want trully secure environment in Linux, resource
management is a MUST. Also, per-process management is not natural from
user POV.
Thanks,
Kirill
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-04-11 10:38 ` Kirill Korotaev
@ 2006-04-11 16:20 ` Herbert Poetzl
2006-04-11 18:12 ` Kir Kolyshkin
2006-04-12 5:12 ` Andi Kleen
2006-04-30 13:22 ` Bill Davidsen
2 siblings, 1 reply; 125+ messages in thread
From: Herbert Poetzl @ 2006-04-11 16:20 UTC (permalink / raw)
To: Kirill Korotaev
Cc: Bill Davidsen, Kir Kolyshkin, akpm, Nick Piggin, sam,
linux-kernel, Eric W. Biederman, serue, Alexey Kuznetsov
On Tue, Apr 11, 2006 at 02:38:51PM +0400, Kirill Korotaev wrote:
> Bill,
>
> >>OpenVZ will have live zero downtime migration and suspend/resume some
> >>time next month.
> >>
> >Please clarify. Currently a migration involves:
> >- stopping or suspending the instance
> >- backing up the instance and all of its data
> >- creating an environment for the instance on a new machine
> >- transporting the data to a new machine
> >- installing the instance and all data
> >- starting the instance
>
> >If you could just briefly cover how you do each of these steps with zero
> >downtime...
>
> it does exactly what you wrote with some minor steps such as networking
> stop on source and start on destination etc.
>
> So I would detailed it like this:
> - freeze VPS
> - freeze networking
> - copy VPS data to destination
> - dump VPS
IIRC, Xen does some kind of repetitive sync of
memory pages to allow the 'original' to continue
running for as long as possible, so pages and
structures get resynced until it looks like the
migration would require only a few pages to be
synced for the final move, then it does the actual
move ...
> - copy dump to the destination
> - restore VPS
> - unfreeze VPS
> - kill original VPS on source
>
> Moreover, in OpenVZ live migration allows to migrate 32bit VPSs between
> i686 and x86-64 Linux machines.
I think that zero downtime is some kind of marketing
buzzword here, and has nothing to do with the actual
time the migration takes, which will probably be
around 20 seconds or so (for larger guests) ...
best,
Herbert
> Thanks,
> Kirill
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-04-11 16:20 ` Herbert Poetzl
@ 2006-04-11 18:12 ` Kir Kolyshkin
0 siblings, 0 replies; 125+ messages in thread
From: Kir Kolyshkin @ 2006-04-11 18:12 UTC (permalink / raw)
To: Herbert Poetzl
Cc: Kirill Korotaev, Bill Davidsen, Kir Kolyshkin, akpm, Nick Piggin,
sam, linux-kernel, Eric W. Biederman, serue, Alexey Kuznetsov
Herbert Poetzl wrote:
>I think that zero downtime is some kind of marketing
>buzzword here, and has nothing to do with the actual
>time the migration takes, which will probably be
>around 20 seconds or so (for larger guests) ...
>
>
IMO it is called "zero downtime" because from the user's point of view
there is no downtime, i.e. network connections are preserved, so what
user sees is a delay in processing, not a downtime.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-04-11 10:38 ` Kirill Korotaev
2006-04-11 16:20 ` Herbert Poetzl
@ 2006-04-12 5:12 ` Andi Kleen
2006-04-12 6:55 ` Kirill Korotaev
2006-04-13 16:54 ` Alexey Kuznetsov
2006-04-30 13:22 ` Bill Davidsen
2 siblings, 2 replies; 125+ messages in thread
From: Andi Kleen @ 2006-04-12 5:12 UTC (permalink / raw)
To: Kirill Korotaev
Cc: Kir Kolyshkin, akpm, Nick Piggin, sam, linux-kernel,
Eric W. Biederman, serue, Alexey Kuznetsov, herbert
Kirill Korotaev <dev@sw.ru> writes:
>
> Moreover, in OpenVZ live migration allows to migrate 32bit VPSs
> between i686 and x86-64 Linux machines.
How would that work when x86-64 32bit programs have 4GB of address
space and native on i386 programs only 3GB?
-Andi
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-04-12 6:55 ` Kirill Korotaev
@ 2006-04-12 6:53 ` Andi Kleen
2006-04-12 7:51 ` Kirill Korotaev
0 siblings, 1 reply; 125+ messages in thread
From: Andi Kleen @ 2006-04-12 6:53 UTC (permalink / raw)
To: Kirill Korotaev
Cc: Kir Kolyshkin, akpm, Nick Piggin, sam, linux-kernel,
Eric W. Biederman, serue, Alexey Kuznetsov, herbert
Kirill Korotaev <dev@sw.ru> writes:
> >>Moreover, in OpenVZ live migration allows to migrate 32bit VPSs
> >>between i686 and x86-64 Linux machines.
>
> > How would that work when x86-64 32bit programs have 4GB of address
> > space and native on i386 programs only 3GB?
> we limit address space of i386 apps on x86-64 to 3GB due to
> compatibility issues - some applications don't work with not 3:1 GB VM
> split.
The only program I'm aware of with this problem is an very old JDK
used in the Oracle installer - and the official documented fix
for this is to run it with linux32 --3gb
Limiting everybody just for that single bug seems quite excessive.
-Andi
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-04-12 5:12 ` Andi Kleen
@ 2006-04-12 6:55 ` Kirill Korotaev
2006-04-12 6:53 ` Andi Kleen
2006-04-13 16:54 ` Alexey Kuznetsov
1 sibling, 1 reply; 125+ messages in thread
From: Kirill Korotaev @ 2006-04-12 6:55 UTC (permalink / raw)
To: Andi Kleen
Cc: Kir Kolyshkin, akpm, Nick Piggin, sam, linux-kernel,
Eric W. Biederman, serue, Alexey Kuznetsov, herbert
>>Moreover, in OpenVZ live migration allows to migrate 32bit VPSs
>>between i686 and x86-64 Linux machines.
> How would that work when x86-64 32bit programs have 4GB of address
> space and native on i386 programs only 3GB?
we limit address space of i386 apps on x86-64 to 3GB due to
compatibility issues - some applications don't work with not 3:1 GB VM
split.
On the other hand if you need 4GB addr space for user you can use 4GB
x8664 and 4gb split on i386.
Thanks,
Kirill
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-04-12 6:53 ` Andi Kleen
@ 2006-04-12 7:51 ` Kirill Korotaev
2006-04-12 17:03 ` Andi Kleen
0 siblings, 1 reply; 125+ messages in thread
From: Kirill Korotaev @ 2006-04-12 7:51 UTC (permalink / raw)
To: Andi Kleen
Cc: Kir Kolyshkin, akpm, Nick Piggin, sam, linux-kernel,
Eric W. Biederman, serue, Alexey Kuznetsov, herbert
>>>How would that work when x86-64 32bit programs have 4GB of address
>>>space and native on i386 programs only 3GB?
>>
>>we limit address space of i386 apps on x86-64 to 3GB due to
>>compatibility issues - some applications don't work with not 3:1 GB VM
>>split.
>
> The only program I'm aware of with this problem is an very old JDK
> used in the Oracle installer - and the official documented fix
> for this is to run it with linux32 --3gb
>
> Limiting everybody just for that single bug seems quite excessive.
Sergey Vlasov recently reported some other apps:
-------------- cut ---------------
Changing PAGE_OFFSET this way would break at least Valgrind (the latest
release 3.1.0 by default is statically linked at address 0xb0000000, and
PIE support does not seem to be present in that release). I remember
that similar changes were also breaking Lisp implementations (cmucl,
sbcl), however, I am not really sure about this.
-------------- cut ---------------
Also, why would one expect 4GB of VM on x86-64 if normally have 3GB on
i686? Anyway, as it is tunable, people can select which one they prefer.
Thanks,
Kirill
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-03-29 21:37 ` Sam Vilain
@ 2006-04-12 8:28 ` Kirill Korotaev
2006-04-13 1:05 ` Herbert Poetzl
0 siblings, 1 reply; 125+ messages in thread
From: Kirill Korotaev @ 2006-04-12 8:28 UTC (permalink / raw)
To: Sam Vilain; +Cc: Herbert Poetzl, devel, Kir Kolyshkin, linux-kernel
Sam,
> Ok, I'll call those three VPSes fast, faster and fastest.
>
> "fast" : fill rate 1, interval 3
> "faster" : fill rate 2, interval 3
> "fastest" : fill rate 3, interval 3
>
> That all adds up to a fill rate of 6 with an interval of 3, but that is
> right because with two processors you have 2 tokens to allocate per
> jiffie. Also set the bucket size to something of the order of HZ.
>
> You can watch the processes within each vserver's priority jump up and
> down with `vtop' during testing. Also you should be able to watch the
> vserver's bucket fill and empty in /proc/virtual/XXX/sched (IIRC)
>
> I mentioned this earlier, but for the sake of the archives I'll repeat -
> if you are running with any of the buckets on empty, the scheduler is
> imbalanced and therefore not going to provide the exact distribution you
> asked for.
>
> However with a single busy loop in each vserver I'd expect the above to
> yield roughly 100% for fastest, 66% for faster and 33% for fast, within
> 5 seconds or so of starting those processes (assuming you set a bucket
> size of HZ).
Sam, what we observe is the situation, when Linux cpu scheduler spreads
2 tasks on 1st CPU and 1 task on the 2nd CPU. Std linux scheduler
doesn't do any rebalancing after that, so no plays with tokens make the
spread to be 3:2:1, since the lowest priority process gets a full 2nd
CPU (100% instead of 33% of CPU).
Where is my mistake? Can you provide a configuration where we could test
or the instuctions on how to avoid this?
Thanks,
Kirill
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-04-12 7:51 ` Kirill Korotaev
@ 2006-04-12 17:03 ` Andi Kleen
2006-04-12 17:20 ` Eric W. Biederman
0 siblings, 1 reply; 125+ messages in thread
From: Andi Kleen @ 2006-04-12 17:03 UTC (permalink / raw)
To: Kirill Korotaev
Cc: Kir Kolyshkin, akpm, Nick Piggin, sam, linux-kernel,
Eric W. Biederman, serue, Alexey Kuznetsov, herbert
Kirill Korotaev <dev@sw.ru> writes:
>
> -------------- cut ---------------
> Changing PAGE_OFFSET this way would break at least Valgrind (the latest
> release 3.1.0 by default is statically linked at address 0xb0000000, and
> PIE support does not seem to be present in that release). I remember
> that similar changes were also breaking Lisp implementations (cmucl,
> sbcl), however, I am not really sure about this.
> -------------- cut ---------------
valgrind only breaks when you decrease TASK_SIZE to 2GB, not when you
enlarge it. In general 2GB VM breaks a lot of apps, that is why
the new CONFIGs that were added for this were a very bad idea imho.
Obviously the x86-64 kernel doesn't support such things.
> Also, why would one expect 4GB of VM on x86-64 if normally have 3GB on
> i686? Anyway, as it is tunable, people can select which one they
> prefer.
Because near all programs that need can actually take advance of it.
-Andi
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-04-12 17:03 ` Andi Kleen
@ 2006-04-12 17:20 ` Eric W. Biederman
0 siblings, 0 replies; 125+ messages in thread
From: Eric W. Biederman @ 2006-04-12 17:20 UTC (permalink / raw)
To: Andi Kleen
Cc: Kirill Korotaev, Kir Kolyshkin, akpm, Nick Piggin, sam,
linux-kernel, serue, Alexey Kuznetsov, herbert
Andi Kleen <ak@suse.de> writes:
> Kirill Korotaev <dev@sw.ru> writes:
>>
>> -------------- cut ---------------
>> Changing PAGE_OFFSET this way would break at least Valgrind (the latest
>> release 3.1.0 by default is statically linked at address 0xb0000000, and
>> PIE support does not seem to be present in that release). I remember
>> that similar changes were also breaking Lisp implementations (cmucl,
>> sbcl), however, I am not really sure about this.
>> -------------- cut ---------------
>
> valgrind only breaks when you decrease TASK_SIZE to 2GB, not when you
> enlarge it. In general 2GB VM breaks a lot of apps, that is why
> the new CONFIGs that were added for this were a very bad idea imho.
>
> Obviously the x86-64 kernel doesn't support such things.
>
>> Also, why would one expect 4GB of VM on x86-64 if normally have 3GB on
>> i686? Anyway, as it is tunable, people can select which one they
>> prefer.
>
> Because near all programs that need can actually take advance of it.
So back to the core aim of this thread.
i386 -> x86_64: 32bit migration should work, but may be a little
confusing with the increase in VM space. (I wonder how this
interacts with the kernels vdso).
x86_64 -> i386 is likely to use addresses between 3GB and 4GB
and thus the migration probably will not work. Unless the VM
accessible to user space is capped at 3G.
Odd. But address space layout looks like one of the easiest migraion
problem to solve.
Eric
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-04-12 8:28 ` Kirill Korotaev
@ 2006-04-13 1:05 ` Herbert Poetzl
2006-04-13 6:52 ` Kirill Korotaev
0 siblings, 1 reply; 125+ messages in thread
From: Herbert Poetzl @ 2006-04-13 1:05 UTC (permalink / raw)
To: Kirill Korotaev; +Cc: Sam Vilain, devel, Kir Kolyshkin, linux-kernel
On Wed, Apr 12, 2006 at 12:28:56PM +0400, Kirill Korotaev wrote:
> Sam,
>
> >Ok, I'll call those three VPSes fast, faster and fastest.
> >
> >"fast" : fill rate 1, interval 3
> >"faster" : fill rate 2, interval 3
> >"fastest" : fill rate 3, interval 3
> >
> >That all adds up to a fill rate of 6 with an interval of 3, but that is
> >right because with two processors you have 2 tokens to allocate per
> >jiffie. Also set the bucket size to something of the order of HZ.
> >
> >You can watch the processes within each vserver's priority jump up and
> >down with `vtop' during testing. Also you should be able to watch the
> >vserver's bucket fill and empty in /proc/virtual/XXX/sched (IIRC)
> >
> >I mentioned this earlier, but for the sake of the archives I'll repeat -
> >if you are running with any of the buckets on empty, the scheduler is
> >imbalanced and therefore not going to provide the exact distribution you
> >asked for.
> >
> >However with a single busy loop in each vserver I'd expect the above to
> >yield roughly 100% for fastest, 66% for faster and 33% for fast, within
> >5 seconds or so of starting those processes (assuming you set a bucket
> >size of HZ).
>
> Sam, what we observe is the situation, when Linux cpu scheduler spreads
> 2 tasks on 1st CPU and 1 task on the 2nd CPU. Std linux scheduler
> doesn't do any rebalancing after that, so no plays with tokens make the
> spread to be 3:2:1, since the lowest priority process gets a full 2nd
> CPU (100% instead of 33% of CPU).
>
> Where is my mistake? Can you provide a configuration where we could test
> or the instuctions on how to avoid this?
well, your mistake seems to be that you probably haven't
tested this yet, because with the following (simple)
setups I seem to get what you consider impossible
(of course, not as precise as your scheduler does it)
vcontext --create --xid 100 ./cpuhog -n 1 100 &
vcontext --create --xid 200 ./cpuhog -n 1 200 &
vcontext --create --xid 300 ./cpuhog -n 1 300 &
vsched --xid 100 --fill-rate 1 --interval 6
vsched --xid 200 --fill-rate 2 --interval 6
vsched --xid 300 --fill-rate 3 --interval 6
vattribute --xid 100 --flag sched_hard
vattribute --xid 200 --flag sched_hard
vattribute --xid 300 --flag sched_hard
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
39 root 25 0 1304 248 200 R 74 0.1 0:46.16 ./cpuhog -n 1 300
38 root 25 0 1308 252 200 H 53 0.1 0:34.06 ./cpuhog -n 1 200
37 root 25 0 1308 252 200 H 28 0.1 0:19.53 ./cpuhog -n 1 100
46 root 0 0 1804 912 736 R 1 0.4 0:02.14 top -cid 20
and here the other way round:
vsched --xid 100 --fill-rate 3 --interval 6
vsched --xid 200 --fill-rate 2 --interval 6
vsched --xid 300 --fill-rate 1 --interval 6
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
36 root 25 0 1304 248 200 R 75 0.1 0:58.41 ./cpuhog -n 1 100
37 root 25 0 1308 252 200 H 54 0.1 0:42.77 ./cpuhog -n 1 200
38 root 25 0 1308 252 200 R 29 0.1 0:25.30 ./cpuhog -n 1 300
45 root 0 0 1804 912 736 R 1 0.4 0:02.26 top -cid 20
note that this was done on a virtual dual cpu
machine (QEMU 8.0) with 2.6.16-vs2.1.1-rc16 and
that there were roughly 25% idle time, which I'm
unable to explain atm ...
feel free to jump on that fact, but I consider
it unimportant for now ...
best,
Herbert
> Thanks,
> Kirill
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-04-13 1:05 ` Herbert Poetzl
@ 2006-04-13 6:52 ` Kirill Korotaev
2006-04-13 13:42 ` Herbert Poetzl
0 siblings, 1 reply; 125+ messages in thread
From: Kirill Korotaev @ 2006-04-13 6:52 UTC (permalink / raw)
To: Herbert Poetzl; +Cc: Sam Vilain, devel, Kir Kolyshkin, linux-kernel
Herbert,
Thanks a lot for the details, I will give it a try once again. Looks
like fairness in this scenario simply requires sched_hard settings.
Herbert... I don't know why you've decided that my goal is to prove that
your scheduler is bad or not precise. My goal is simply to investigate
different approaches and make some measurements. I suppose you can
benefit from such a volunteer, don't you think so? Anyway, thanks again
and don't be cycled on the idea that OpenVZ are so cruel bad guys :)
Thanks,
Kirill
> well, your mistake seems to be that you probably haven't
> tested this yet, because with the following (simple)
> setups I seem to get what you consider impossible
> (of course, not as precise as your scheduler does it)
>
>
> vcontext --create --xid 100 ./cpuhog -n 1 100 &
> vcontext --create --xid 200 ./cpuhog -n 1 200 &
> vcontext --create --xid 300 ./cpuhog -n 1 300 &
>
> vsched --xid 100 --fill-rate 1 --interval 6
> vsched --xid 200 --fill-rate 2 --interval 6
> vsched --xid 300 --fill-rate 3 --interval 6
>
> vattribute --xid 100 --flag sched_hard
> vattribute --xid 200 --flag sched_hard
> vattribute --xid 300 --flag sched_hard
>
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 39 root 25 0 1304 248 200 R 74 0.1 0:46.16 ./cpuhog -n 1 300
> 38 root 25 0 1308 252 200 H 53 0.1 0:34.06 ./cpuhog -n 1 200
> 37 root 25 0 1308 252 200 H 28 0.1 0:19.53 ./cpuhog -n 1 100
> 46 root 0 0 1804 912 736 R 1 0.4 0:02.14 top -cid 20
>
> and here the other way round:
>
> vsched --xid 100 --fill-rate 3 --interval 6
> vsched --xid 200 --fill-rate 2 --interval 6
> vsched --xid 300 --fill-rate 1 --interval 6
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 36 root 25 0 1304 248 200 R 75 0.1 0:58.41 ./cpuhog -n 1 100
> 37 root 25 0 1308 252 200 H 54 0.1 0:42.77 ./cpuhog -n 1 200
> 38 root 25 0 1308 252 200 R 29 0.1 0:25.30 ./cpuhog -n 1 300
> 45 root 0 0 1804 912 736 R 1 0.4 0:02.26 top -cid 20
>
>
> note that this was done on a virtual dual cpu
> machine (QEMU 8.0) with 2.6.16-vs2.1.1-rc16 and
> that there were roughly 25% idle time, which I'm
> unable to explain atm ...
>
> feel free to jump on that fact, but I consider
> it unimportant for now ...
>
> best,
> Herbert
>
>
>>Thanks,
>>Kirill
>
>
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-04-13 6:52 ` Kirill Korotaev
@ 2006-04-13 13:42 ` Herbert Poetzl
2006-04-13 21:33 ` Cedric Le Goater
0 siblings, 1 reply; 125+ messages in thread
From: Herbert Poetzl @ 2006-04-13 13:42 UTC (permalink / raw)
To: Kirill Korotaev; +Cc: Sam Vilain, devel, Kir Kolyshkin, linux-kernel
On Thu, Apr 13, 2006 at 10:52:19AM +0400, Kirill Korotaev wrote:
> Herbert,
>
> Thanks a lot for the details, I will give it a try once again. Looks
> like fairness in this scenario simply requires sched_hard settings.
hmm, not precisely, it's a cpu limit you described
and that is what this configuration does, for fair
scheduling you need to activate the indle skip and
configure it in a similar way ...
> Herbert... I don't know why you've decided that my goal is to prove
> that your scheduler is bad or not precise. My goal is simply to
> investigate different approaches and make some measurements.
fair enough ...
> I suppose you can benefit from such a volunteer, don't you think so?
well, if the 'results' and 'methods' will be made
public, I can, until now all I got was something
along the lines:
"Linux-VServer is not stable! WE (swsoft?) have
a secret but essential test suite running two
weeks to confirm that OUR kernels ARE stable,
and Linux-VServer will never pass those tests,
but of course, we can't tell you what kind of
tests or what results we got"
which doesn't help me anything and which, to be
honest, does not sound very friendly either ...
> Anyway, thanks again and don't be cycled on the idea that OpenVZ are
> so cruel bad guys :)
but what about the Virtuozzo(tm) guys? :)
I'm really trying not to generalize here ...
best,
Herbert
> Thanks,
> Kirill
>
> >well, your mistake seems to be that you probably haven't
> >tested this yet, because with the following (simple)
> >setups I seem to get what you consider impossible
> >(of course, not as precise as your scheduler does it)
> >
> >
> >vcontext --create --xid 100 ./cpuhog -n 1 100 &
> >vcontext --create --xid 200 ./cpuhog -n 1 200 &
> >vcontext --create --xid 300 ./cpuhog -n 1 300 &
> >
> >vsched --xid 100 --fill-rate 1 --interval 6
> >vsched --xid 200 --fill-rate 2 --interval 6
> >vsched --xid 300 --fill-rate 3 --interval 6
> >
> >vattribute --xid 100 --flag sched_hard
> >vattribute --xid 200 --flag sched_hard
> >vattribute --xid 300 --flag sched_hard
> >
> >
> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > 39 root 25 0 1304 248 200 R 74 0.1 0:46.16 ./cpuhog -n 1
> > 300 38 root 25 0 1308 252 200 H 53 0.1 0:34.06 ./cpuhog
> > -n 1 200 37 root 25 0 1308 252 200 H 28 0.1 0:19.53
> > ./cpuhog -n 1 100 46 root 0 0 1804 912 736 R 1 0.4
> > 0:02.14 top -cid 20
> >and here the other way round:
> >
> >vsched --xid 100 --fill-rate 3 --interval 6
> >vsched --xid 200 --fill-rate 2 --interval 6
> >vsched --xid 300 --fill-rate 1 --interval 6
> >
> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > 36 root 25 0 1304 248 200 R 75 0.1 0:58.41 ./cpuhog -n 1
> > 100 37 root 25 0 1308 252 200 H 54 0.1 0:42.77 ./cpuhog
> > -n 1 200 38 root 25 0 1308 252 200 R 29 0.1 0:25.30
> > ./cpuhog -n 1 300 45 root 0 0 1804 912 736 R 1 0.4
> > 0:02.26 top -cid 20
> >
> >note that this was done on a virtual dual cpu
> >machine (QEMU 8.0) with 2.6.16-vs2.1.1-rc16 and
> >that there were roughly 25% idle time, which I'm
> >unable to explain atm ...
> >
> >feel free to jump on that fact, but I consider
> >it unimportant for now ...
> >
> >best,
> >Herbert
> >
> >
> >>Thanks,
> >>Kirill
> >
> >
>
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-04-12 5:12 ` Andi Kleen
2006-04-12 6:55 ` Kirill Korotaev
@ 2006-04-13 16:54 ` Alexey Kuznetsov
1 sibling, 0 replies; 125+ messages in thread
From: Alexey Kuznetsov @ 2006-04-13 16:54 UTC (permalink / raw)
To: Andi Kleen
Cc: Kirill Korotaev, Kir Kolyshkin, akpm, Nick Piggin, sam,
linux-kernel, Eric W. Biederman, serue, herbert
Hello!
> How would that work when x86-64 32bit programs have 4GB of address
> space and native on i386 programs only 3GB?
It is not the only obstacle. There are another ones:
1. Different values of segment registers. __USER32_* selectors should match
ones on i386.
2. ia32 vsyscall page. It also must be the same, unless arch/i386 is changed
to allow various mappins like 32bit page x86_64 works.
Well, if we want something to migrate, we have to pay.
Alexey
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-04-13 13:42 ` Herbert Poetzl
@ 2006-04-13 21:33 ` Cedric Le Goater
2006-04-13 22:45 ` Herbert Poetzl
2006-04-13 22:51 ` Kir Kolyshkin
0 siblings, 2 replies; 125+ messages in thread
From: Cedric Le Goater @ 2006-04-13 21:33 UTC (permalink / raw)
To: Kirill Korotaev, Sam Vilain, devel, Kir Kolyshkin, linux-kernel,
Herbert Poetzl
Herbert Poetzl wrote:
> well, if the 'results' and 'methods' will be made
> public, I can, until now all I got was something
> along the lines:
>
> "Linux-VServer is not stable! WE (swsoft?) have
> a secret but essential test suite running two
> weeks to confirm that OUR kernels ARE stable,
> and Linux-VServer will never pass those tests,
> but of course, we can't tell you what kind of
> tests or what results we got"
>
> which doesn't help me anything and which, to be
> honest, does not sound very friendly either ...
Recently, we've been running tests and benchmarks in different
virtualization environments : openvz, vserver, vserver in a minimal context
and also Xen as a reference in the virtual machine world.
We ran the usual benchmarks, dbench, tbench, lmbench, kernerl build, on the
native kernel, on the patched kernel and in each virtualized environment.
We also did some scalability tests to see how each solution behaved. And
finally, some tests on live migration. We didn't do much on network nor on
resource management behavior.
We'd like to continue in an open way. But first, we want to make sure we
have the right tests, benchmarks, tools, versions, configuration, tuning,
etc, before publishing any results :) We have some materials already but
before proposing we would like to have your comments and advices on what we
should or shouldn't use.
Thanks for doing such a great job on lightweight containers,
C.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-04-13 21:33 ` Cedric Le Goater
@ 2006-04-13 22:45 ` Herbert Poetzl
2006-04-14 7:41 ` Kirill Korotaev
2006-04-14 9:56 ` Cedric Le Goater
2006-04-13 22:51 ` Kir Kolyshkin
1 sibling, 2 replies; 125+ messages in thread
From: Herbert Poetzl @ 2006-04-13 22:45 UTC (permalink / raw)
To: Cedric Le Goater
Cc: Kirill Korotaev, Sam Vilain, devel, Kir Kolyshkin, linux-kernel
On Thu, Apr 13, 2006 at 11:33:13PM +0200, Cedric Le Goater wrote:
> Herbert Poetzl wrote:
>
> > well, if the 'results' and 'methods' will be made
> > public, I can, until now all I got was something
> > along the lines:
> >
> > "Linux-VServer is not stable! WE (swsoft?) have
> > a secret but essential test suite running two
> > weeks to confirm that OUR kernels ARE stable,
> > and Linux-VServer will never pass those tests,
> > but of course, we can't tell you what kind of
> > tests or what results we got"
> >
> > which doesn't help me anything and which, to be
> > honest, does not sound very friendly either ...
>
> Recently, we've been running tests and benchmarks in different
> virtualization environments : openvz, vserver, vserver in a minimal
> context and also Xen as a reference in the virtual machine world.
>
> We ran the usual benchmarks, dbench, tbench, lmbench, kernerl build,
> on the native kernel, on the patched kernel and in each virtualized
> environment. We also did some scalability tests to see how each
> solution behaved. And finally, some tests on live migration. We didn't
> do much on network nor on resource management behavior.
I would be really interested in getting comparisons
between vanilla kernels and linux-vserver patched
versions, especially vs2.1.1 and vs2.0.2 on the
same test setup with a minimum difference in config
I doubt that you can really compare across the
existing virtualization technologies, as it really
depends on the setup and hardware
> We'd like to continue in an open way. But first, we want to make sure
> we have the right tests, benchmarks, tools, versions, configuration,
> tuning, etc, before publishing any results :) We have some materials
> already but before proposing we would like to have your comments and
> advices on what we should or shouldn't use.
In my experience it is extremely hard to do 'proper'
comparisons, because the slightest change of the
environment can cause big differences ...
here as example, a kernel build (-j99) on 2.6.16
on a test host, with and without a chroot:
without:
451.03user 26.27system 2:00.38elapsed 396%CPU
449.39user 26.21system 1:59.95elapsed 396%CPU
447.40user 25.86system 1:59.79elapsed 395%CPU
now with:
490.77user 24.45system 2:13.35elapsed 386%CPU
489.69user 24.50system 2:12.60elapsed 387%CPU
490.41user 24.99system 2:12.22elapsed 389%CPU
now is chroot() that imperformant? no, but the change
in /tmp being on a partition vs. tmpfs makes quite
some difference here
even moving from one partition to another will give
measurable difference here, all within a small margin
an interesting aspect is the gain (or loss) you have
when you start several guests basically doing the
same thing (and sharing the same files, etc)
> Thanks for doing such a great job on lightweight containers,
you're welcome!
best,
Herbert
> C.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-04-13 21:33 ` Cedric Le Goater
2006-04-13 22:45 ` Herbert Poetzl
@ 2006-04-13 22:51 ` Kir Kolyshkin
2006-04-14 10:08 ` Cedric Le Goater
1 sibling, 1 reply; 125+ messages in thread
From: Kir Kolyshkin @ 2006-04-13 22:51 UTC (permalink / raw)
To: Cedric Le Goater
Cc: Kirill Korotaev, Sam Vilain, devel, linux-kernel, Herbert Poetzl
Cedric Le Goater wrote:
> Recently, we've been running tests and benchmarks in different
>
>virtualization environments : openvz, vserver, vserver in a minimal context
>and also Xen as a reference in the virtual machine world.
>
>We ran the usual benchmarks, dbench, tbench, lmbench, kernerl build, on the
>native kernel, on the patched kernel and in each virtualized environment.
>We also did some scalability tests to see how each solution behaved. And
>finally, some tests on live migration. We didn't do much on network nor on
>resource management behavior.
>
>We'd like to continue in an open way. But first, we want to make sure we
>have the right tests, benchmarks, tools, versions, configuration, tuning,
>etc, before publishing any results :) We have some materials already but
>before proposing we would like to have your comments and advices on what we
>should or shouldn't use.
>
>Thanks for doing such a great job on lightweight containers,
>
>C.
>
>
Cedrik,
You made my day, I am really happy to hear that! Such testing and
benchmarking should be done by an independent third party, and IBM fits
that requirement just fine. It all makes much sense for everybody who's
involved.
If it will be opened (not just results, but also the processes and
tools), and all the projects will be able to contribute and help, that
would be just great. We do a lot of testing in-house, and will be happy
to contribute to such an independent testing/benchmarking project.
Speaking of live migration, we in OpenVZ plan to release our
implementation as soon as next week.
Regards,
Kir.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-04-13 22:45 ` Herbert Poetzl
@ 2006-04-14 7:41 ` Kirill Korotaev
2006-04-14 9:56 ` Cedric Le Goater
1 sibling, 0 replies; 125+ messages in thread
From: Kirill Korotaev @ 2006-04-14 7:41 UTC (permalink / raw)
To: devel
Cc: Cedric Le Goater, Kirill Korotaev, Kir Kolyshkin, Sam Vilain,
linux-kernel
> I would be really interested in getting comparisons
> between vanilla kernels and linux-vserver patched
> versions, especially vs2.1.1 and vs2.0.2 on the
> same test setup with a minimum difference in config
>
> I doubt that you can really compare across the
> existing virtualization technologies, as it really
> depends on the setup and hardware
and kernel .config's :)
for example, I'm pretty sure, OVZ smp kernel is not the same as any of
prebuilt vserver kernels.
> In my experience it is extremely hard to do 'proper'
> comparisons, because the slightest change of the
> environment can cause big differences ...
>
> here as example, a kernel build (-j99) on 2.6.16
> on a test host, with and without a chroot:
>
> without:
>
> 451.03user 26.27system 2:00.38elapsed 396%CPU
> 449.39user 26.21system 1:59.95elapsed 396%CPU
> 447.40user 25.86system 1:59.79elapsed 395%CPU
>
> now with:
>
> 490.77user 24.45system 2:13.35elapsed 386%CPU
> 489.69user 24.50system 2:12.60elapsed 387%CPU
> 490.41user 24.99system 2:12.22elapsed 389%CPU
>
> now is chroot() that imperformant? no, but the change
> in /tmp being on a partition vs. tmpfs makes quite
> some difference here
filesystem performance also very much depends on disk layout.
If you use different partitions of the same disk for Xen, vserver and OVZ,
one of them will be quickest while others can be significantly slower
and slower :/
Thanks,
Kirill
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-04-13 22:45 ` Herbert Poetzl
2006-04-14 7:41 ` Kirill Korotaev
@ 2006-04-14 9:56 ` Cedric Le Goater
2006-04-15 19:29 ` Herbert Poetzl
1 sibling, 1 reply; 125+ messages in thread
From: Cedric Le Goater @ 2006-04-14 9:56 UTC (permalink / raw)
To: Cedric Le Goater, Kirill Korotaev, Sam Vilain, devel,
Kir Kolyshkin, linux-kernel, Herbert Poetzl
Bonjour !
Herbert Poetzl wrote:
> I would be really interested in getting comparisons
> between vanilla kernels and linux-vserver patched
> versions, especially vs2.1.1 and vs2.0.2 on the
> same test setup with a minimum difference in config
We did the tests last month and used the stable version : vs2.0.2rc9 on a
2.6.15.4. Using benchmarks like dbench, tbench, lmbench, the vserver patch
has no impact, vserver overhead in a context is hardly measurable (<3%),
same results for a debian sarge running in a vserver.
It is pretty difficult to follow everyone patches. This makes the
comparisons difficult so we chose to normalize all the results with the
native kernel results. But in a way, this is good because the goal of these
tests isn't to compare technologies but to measure their overhead and
stability. And at the end, we don't care if openvz is faster than vserver,
we want containers in the linux kernel to be fast and stable, one day :)
> I doubt that you can really compare across the
> existing virtualization technologies, as it really
> depends on the setup and hardware
I agree these are very different technologies but from a user point of
view, they provide a similar service. So, it is interesting to see what are
the drawbacks and the benefits of each solution. You want fault containment
and strict isolation, here's the price. You want performance, here's another.
Anyway, there's already enough focus on the virtual machines so we should
focus only on lightweight containers.
>> We'd like to continue in an open way. But first, we want to make sure
>> we have the right tests, benchmarks, tools, versions, configuration,
>> tuning, etc, before publishing any results :) We have some materials
>> already but before proposing we would like to have your comments and
>> advices on what we should or shouldn't use.
>
> In my experience it is extremely hard to do 'proper'
> comparisons, because the slightest change of the
> environment can cause big differences ...
>
> here as example, a kernel build (-j99) on 2.6.16
> on a test host, with and without a chroot:
>
> without:
>
> 451.03user 26.27system 2:00.38elapsed 396%CPU
> 449.39user 26.21system 1:59.95elapsed 396%CPU
> 447.40user 25.86system 1:59.79elapsed 395%CPU
>
> now with:
>
> 490.77user 24.45system 2:13.35elapsed 386%CPU
> 489.69user 24.50system 2:12.60elapsed 387%CPU
> 490.41user 24.99system 2:12.22elapsed 389%CPU
>
> now is chroot() that imperformant? no, but the change
> in /tmp being on a partition vs. tmpfs makes quite
> some difference here
>
> even moving from one partition to another will give
> measurable difference here, all within a small margin
very interesting thanks.
> an interesting aspect is the gain (or loss) you have
> when you start several guests basically doing the
> same thing (and sharing the same files, etc)
we have these in the pipe also, we called them scalability test: trying to
run as much containers as possible and see how performance drops (when the
kernel survives the test :)
ok, now i guess we want to make some kind of test plan.
C.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-04-13 22:51 ` Kir Kolyshkin
@ 2006-04-14 10:08 ` Cedric Le Goater
2006-04-15 19:31 ` Herbert Poetzl
0 siblings, 1 reply; 125+ messages in thread
From: Cedric Le Goater @ 2006-04-14 10:08 UTC (permalink / raw)
To: Kir Kolyshkin
Cc: Kirill Korotaev, Sam Vilain, devel, linux-kernel, Herbert Poetzl
Bonjour !
Kir Kolyshkin wrote:
> You made my day, I am really happy to hear that! Such testing and
> benchmarking should be done by an independent third party, and IBM fits
> that requirement just fine. It all makes much sense for everybody who's
> involved.
>
> If it will be opened (not just results, but also the processes and
> tools), and all the projects will be able to contribute and help, that
> would be just great. We do a lot of testing in-house, and will be happy
> to contribute to such an independent testing/benchmarking project.
What we have in mind is something like http://test.kernel.org/ for each
patch set. I guess we will start humbly at the beginning :)
Initially, the idea was to test the patch series we've been sending on
lkml. But as we've been running tests on existing solutions, openvz,
vserver, and our own prototype, we thought that extending to all was
interesting and fair.
The goal is to promote lightweight containers in the linux kernel, so this
needs to be open.
> Speaking of live migration, we in OpenVZ plan to release our
> implementation as soon as next week.
We've been working on that topic for a long time, we are very interested in
seeing what you've acheived ! Migration tests is also an interesting topic
we could add with time to the containers tests.
thanks,
C.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-04-14 9:56 ` Cedric Le Goater
@ 2006-04-15 19:29 ` Herbert Poetzl
0 siblings, 0 replies; 125+ messages in thread
From: Herbert Poetzl @ 2006-04-15 19:29 UTC (permalink / raw)
To: Cedric Le Goater
Cc: Kirill Korotaev, Sam Vilain, devel, Kir Kolyshkin, linux-kernel
On Fri, Apr 14, 2006 at 11:56:21AM +0200, Cedric Le Goater wrote:
> Bonjour !
>
> Herbert Poetzl wrote:
>
> > I would be really interested in getting comparisons
> > between vanilla kernels and linux-vserver patched
> > versions, especially vs2.1.1 and vs2.0.2 on the
> > same test setup with a minimum difference in config
>
> We did the tests last month and used the stable version : vs2.0.2rc9
> on a 2.6.15.4. Using benchmarks like dbench, tbench, lmbench, the
> vserver patch has no impact, vserver overhead in a context is hardly
> measurable (<3%), same results for a debian sarge running in a
> vserver.
with 2.1.1-rc16 they are not supposed to be measurable
at all, so if you measure any difference here, please
let me know about it, as I consider it an issue :)
> It is pretty difficult to follow everyone patches. This makes the
> comparisons difficult so we chose to normalize all the results with
> the native kernel results. But in a way, this is good because the goal
> of these tests isn't to compare technologies but to measure their
> overhead and stability. And at the end, we don't care if openvz is
> faster than vserver, we want containers in the linux kernel to be fast
> and stable, one day :)
I'm completely with you here ...
> > I doubt that you can really compare across the
> > existing virtualization technologies, as it really
> > depends on the setup and hardware
>
> I agree these are very different technologies but from a user point
> of view, they provide a similar service. So, it is interesting to see
> what are the drawbacks and the benefits of each solution. You want
> fault containment and strict isolation, here's the price. You want
> performance, here's another.
precisely, taht's why there are different projects
and different aims ...
> Anyway, there's already enough focus on the virtual machines so we
> should focus only on lightweight containers.
>
> >> We'd like to continue in an open way. But first, we want to
> >> make sure we have the right tests, benchmarks, tools, versions,
> >> configuration, tuning, etc, before publishing any results :) We
> >> have some materials already but before proposing we would like to
> >> have your comments and advices on what we should or shouldn't use.
> >
> > In my experience it is extremely hard to do 'proper'
> > comparisons, because the slightest change of the
> > environment can cause big differences ...
> >
> > here as example, a kernel build (-j99) on 2.6.16
> > on a test host, with and without a chroot:
> >
> > without:
> >
> > 451.03user 26.27system 2:00.38elapsed 396%CPU
> > 449.39user 26.21system 1:59.95elapsed 396%CPU
> > 447.40user 25.86system 1:59.79elapsed 395%CPU
> >
> > now with:
> >
> > 490.77user 24.45system 2:13.35elapsed 386%CPU
> > 489.69user 24.50system 2:12.60elapsed 387%CPU
> > 490.41user 24.99system 2:12.22elapsed 389%CPU
> >
> > now is chroot() that imperformant? no, but the change
> > in /tmp being on a partition vs. tmpfs makes quite
> > some difference here
> >
> > even moving from one partition to another will give
> > measurable difference here, all within a small margin
>
> very interesting thanks.
>
> > an interesting aspect is the gain (or loss) you have
> > when you start several guests basically doing the
> > same thing (and sharing the same files, etc)
>
> we have these in the pipe also, we called them scalability test:
> trying to run as much containers as possible and see how performance
> drops (when the kernel survives the test :)
yes, might want to check with and without unification
here too, as I think you can reach more than 100% native
speed in the multi guest scenario with that :)
> ok, now i guess we want to make some kind of test plan.
sounds good, please keep me posted ...
best,
Herbert
> C.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-04-14 10:08 ` Cedric Le Goater
@ 2006-04-15 19:31 ` Herbert Poetzl
0 siblings, 0 replies; 125+ messages in thread
From: Herbert Poetzl @ 2006-04-15 19:31 UTC (permalink / raw)
To: Cedric Le Goater
Cc: Kir Kolyshkin, Kirill Korotaev, Sam Vilain, devel, linux-kernel
On Fri, Apr 14, 2006 at 12:08:05PM +0200, Cedric Le Goater wrote:
> Bonjour !
>
> Kir Kolyshkin wrote:
>
> > You made my day, I am really happy to hear that! Such testing and
> > benchmarking should be done by an independent third party, and
> > IBM fits that requirement just fine. It all makes much sense for
> > everybody who's involved.
> >
> > If it will be opened (not just results, but also the processes and
> > tools), and all the projects will be able to contribute and help,
> > that would be just great. We do a lot of testing in-house, and will
> > be happy to contribute to such an independent testing/benchmarking
> > project.
>
> What we have in mind is something like http://test.kernel.org/ for
> each patch set. I guess we will start humbly at the beginning :)
>
> Initially, the idea was to test the patch series we've been sending on
> lkml. But as we've been running tests on existing solutions, openvz,
> vserver, and our own prototype, we thought that extending to all was
> interesting and fair.
would be really great if you could extend that to something
like the PLM where folks (like linux-vserver and openvz) can
test their patches against mainline kernels in a fairly
automated way ...
I guess that would be some initial work, but could improve
many other patches (not only those related to virtualization)
best,
Herbert
> The goal is to promote lightweight containers in the linux kernel, so
> this needs to be open.
>
> > Speaking of live migration, we in OpenVZ plan to release our
> > implementation as soon as next week.
>
> We've been working on that topic for a long time, we are very
> interested in seeing what you've acheived ! Migration tests is also an
> interesting topic we could add with time to the containers tests.
>
> thanks,
>
> C.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-04-11 10:38 ` Kirill Korotaev
2006-04-11 16:20 ` Herbert Poetzl
2006-04-12 5:12 ` Andi Kleen
@ 2006-04-30 13:22 ` Bill Davidsen
2006-04-30 21:34 ` Sam Vilain
2006-05-01 12:27 ` Kirill Korotaev
2 siblings, 2 replies; 125+ messages in thread
From: Bill Davidsen @ 2006-04-30 13:22 UTC (permalink / raw)
To: Kirill Korotaev
Cc: Kir Kolyshkin, akpm, Nick Piggin, sam, linux-kernel,
Eric W. Biederman, serue, Alexey Kuznetsov, herbert
Kirill Korotaev wrote:
> Bill,
>
>>> OpenVZ will have live zero downtime migration and suspend/resume
>>> some time next month.
>>>
>> Please clarify. Currently a migration involves:
>> - stopping or suspending the instance
>> - backing up the instance and all of its data
>> - creating an environment for the instance on a new machine
>> - transporting the data to a new machine
>> - installing the instance and all data
>> - starting the instance
>
>
>> If you could just briefly cover how you do each of these steps with zero
>> downtime...
>
>
> it does exactly what you wrote with some minor steps such as
> networking stop on source and start on destination etc.
>
> So I would detailed it like this:
> - freeze VPS
when the VM stops providing services it's down as far as I'm concerned
> - freeze networking
> - copy VPS data to destination
> - dump VPS
> - copy dump to the destination
> - restore VPS
> - unfreeze VPS
and here is where my service is available again. The server may not know
it's been down, but the clients will.
> - kill original VPS on source
>
> Moreover, in OpenVZ live migration allows to migrate 32bit VPSs
> between i686 and x86-64 Linux machines.
I guess you're using "zero downtime" as a marketing term rather than a
technical term.
--
bill davidsen <davidsen@tmr.com>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-04-30 13:22 ` Bill Davidsen
@ 2006-04-30 21:34 ` Sam Vilain
2006-05-01 12:27 ` Kirill Korotaev
1 sibling, 0 replies; 125+ messages in thread
From: Sam Vilain @ 2006-04-30 21:34 UTC (permalink / raw)
To: Bill Davidsen
Cc: Kirill Korotaev, Kir Kolyshkin, akpm, Nick Piggin, linux-kernel,
Eric W. Biederman, serue, Alexey Kuznetsov, herbert
Bill Davidsen wrote:
>Kirill Korotaev wrote:
>
>
>
>>Bill,
>>
>>
>>
>>>>OpenVZ will have live zero downtime migration and suspend/resume
>>>>some time next month.
>>>>
>>>>
>>>>
>>>Please clarify. Currently a migration involves:
>>>- stopping or suspending the instance
>>>- backing up the instance and all of its data
>>>- creating an environment for the instance on a new machine
>>>- transporting the data to a new machine
>>>- installing the instance and all data
>>>- starting the instance
>>>
>>>
>>
>>
>>>If you could just briefly cover how you do each of these steps with zero
>>>downtime...
>>>
>>>
>>it does exactly what you wrote with some minor steps such as
>>networking stop on source and start on destination etc.
>>
>>So I would detailed it like this:
>>- freeze VPS
>>
>>
>
>when the VM stops providing services it's down as far as I'm concerned
>
>
You're entirely nitpicking.
Sam.
>>- freeze networking
>>- copy VPS data to destination
>>- dump VPS
>>- copy dump to the destination
>>- restore VPS
>>- unfreeze VPS
>>
>>
>
>and here is where my service is available again. The server may not know
>it's been down, but the clients will.
>
>
>
>>- kill original VPS on source
>>
>>Moreover, in OpenVZ live migration allows to migrate 32bit VPSs
>>between i686 and x86-64 Linux machines.
>>
>>
>
>I guess you're using "zero downtime" as a marketing term rather than a
>technical term.
>
>
>
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-04-30 13:22 ` Bill Davidsen
2006-04-30 21:34 ` Sam Vilain
@ 2006-05-01 12:27 ` Kirill Korotaev
2006-05-03 20:32 ` Bill Davidsen
1 sibling, 1 reply; 125+ messages in thread
From: Kirill Korotaev @ 2006-05-01 12:27 UTC (permalink / raw)
To: Bill Davidsen
Cc: Kir Kolyshkin, akpm, Nick Piggin, sam, linux-kernel,
Eric W. Biederman, serue, Alexey Kuznetsov, herbert
Bill,
>> So I would detailed it like this:
>> - freeze VPS
>
> when the VM stops providing services it's down as far as I'm concerned
please, note, that connections are not dropped, new connections are not
responded with RESET and when VM is migrated all the clients are
serviced as if nothing has happened. From client point of view there is
only a small delay in servicing, but not a real downtime (when clients
are rejected). Maybe due to these some of people call it zero down-time.
Though from technical POV this is not the best term for sure. It is
better to call it checkpointing/restore or live migration.
>> - freeze networking
>> - copy VPS data to destination
>> - dump VPS
>> - copy dump to the destination
>> - restore VPS
>> - unfreeze VPS
>
> and here is where my service is available again. The server may not know
> it's been down, but the clients will.
>
>> - kill original VPS on source
>>
>> Moreover, in OpenVZ live migration allows to migrate 32bit VPSs
>> between i686 and x86-64 Linux machines.
>
> I guess you're using "zero downtime" as a marketing term rather than a
> technical term.
Thanks,
Kirill
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Devel] Re: [RFC] Virtualization steps
2006-05-01 12:27 ` Kirill Korotaev
@ 2006-05-03 20:32 ` Bill Davidsen
0 siblings, 0 replies; 125+ messages in thread
From: Bill Davidsen @ 2006-05-03 20:32 UTC (permalink / raw)
To: Kirill Korotaev
Cc: Kir Kolyshkin, sam, linux-kernel, Eric W. Biederman, serue,
Alexey Kuznetsov, herbert
Kirill Korotaev wrote:
> Bill,
>
>>> So I would detailed it like this:
>>> - freeze VPS
>>
>> when the VM stops providing services it's down as far as I'm concerned
> please, note, that connections are not dropped, new connections are not
> responded with RESET and when VM is migrated all the clients are
> serviced as if nothing has happened. From client point of view there is
> only a small delay in servicing, but not a real downtime (when clients
> are rejected). Maybe due to these some of people call it zero down-time.
> Though from technical POV this is not the best term for sure. It is
> better to call it checkpointing/restore or live migration.
With that I can agree. The argument that it's not down it's just
unavailable isn't convincing. I think "live migration" is a really good
description of what takes place, thanks for the nomenclature.
--
-bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me
^ permalink raw reply [flat|nested] 125+ messages in thread
end of thread, other threads:[~2006-05-03 20:35 UTC | newest]
Thread overview: 125+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-24 17:19 [RFC] Virtualization steps Kirill Korotaev
2006-03-24 17:33 ` Nick Piggin
2006-03-24 19:25 ` Dave Hansen
2006-03-24 19:53 ` Eric W. Biederman
2006-03-28 4:28 ` Bill Davidsen
2006-03-28 5:31 ` Sam Vilain
2006-03-28 6:45 ` [Devel] " Kir Kolyshkin
2006-03-28 21:59 ` Sam Vilain
2006-03-28 22:24 ` Kir Kolyshkin
2006-03-28 23:28 ` Sam Vilain
2006-03-29 9:13 ` Kirill Korotaev
2006-03-29 11:08 ` Sam Vilain
2006-03-29 13:45 ` Herbert Poetzl
2006-03-29 14:47 ` Kirill Korotaev
2006-03-29 17:29 ` Herbert Poetzl
2006-03-29 21:37 ` Sam Vilain
2006-04-12 8:28 ` Kirill Korotaev
2006-04-13 1:05 ` Herbert Poetzl
2006-04-13 6:52 ` Kirill Korotaev
2006-04-13 13:42 ` Herbert Poetzl
2006-04-13 21:33 ` Cedric Le Goater
2006-04-13 22:45 ` Herbert Poetzl
2006-04-14 7:41 ` Kirill Korotaev
2006-04-14 9:56 ` Cedric Le Goater
2006-04-15 19:29 ` Herbert Poetzl
2006-04-13 22:51 ` Kir Kolyshkin
2006-04-14 10:08 ` Cedric Le Goater
2006-04-15 19:31 ` Herbert Poetzl
2006-03-28 8:52 ` Herbert Poetzl
2006-03-28 9:00 ` Nick Piggin
2006-03-28 14:26 ` Herbert Poetzl
2006-03-28 14:44 ` Nick Piggin
2006-03-29 6:05 ` Eric W. Biederman
2006-03-29 6:19 ` Sam Vilain
2006-03-29 18:20 ` Chris Wright
2006-03-29 22:36 ` Sam Vilain
2006-03-29 22:52 ` Chris Wright
2006-03-29 23:01 ` Sam Vilain
2006-03-29 23:13 ` Chris Wright
2006-03-29 23:18 ` Sam Vilain
2006-03-29 23:28 ` Chris Wright
2006-03-30 1:02 ` Eric W. Biederman
2006-03-30 1:36 ` Chris Wright
2006-03-30 1:41 ` David Lang
2006-03-30 2:04 ` Chris Wright
2006-03-30 14:32 ` Serge E. Hallyn
2006-03-30 15:30 ` Herbert Poetzl
2006-03-30 16:43 ` Serge E. Hallyn
2006-03-30 18:00 ` Eric W. Biederman
2006-03-31 13:40 ` Serge E. Hallyn
2006-03-30 16:07 ` Stephen Smalley
2006-03-30 16:15 ` Serge E. Hallyn
2006-03-30 18:55 ` Chris Wright
2006-03-30 18:44 ` Eric W. Biederman
2006-03-30 19:07 ` Chris Wright
2006-03-31 5:36 ` Eric W. Biederman
2006-03-31 5:51 ` Chris Wright
2006-03-31 6:52 ` Eric W. Biederman
2006-03-30 18:53 ` Chris Wright
2006-03-30 2:48 ` Eric W. Biederman
2006-03-30 19:23 ` Chris Wright
2006-03-31 6:00 ` Eric W. Biederman
2006-03-31 14:52 ` Stephen Smalley
2006-03-31 16:39 ` Eric W. Biederman
2006-03-30 13:29 ` Serge E. Hallyn
2006-03-30 13:37 ` Eric W. Biederman
2006-03-30 14:55 ` Serge E. Hallyn
2006-03-30 2:24 ` Sam Vilain
2006-03-30 3:01 ` Eric W. Biederman
2006-03-30 3:26 ` Nick Piggin
2006-03-30 10:30 ` Eric W. Biederman
2006-04-11 10:32 ` Kirill Korotaev
2006-04-11 11:14 ` Nick Piggin
2006-04-11 14:44 ` Kirill Korotaev
2006-03-28 9:00 ` Kirill Korotaev
2006-03-28 14:41 ` Bill Davidsen
2006-03-28 15:03 ` Eric W. Biederman
2006-03-28 17:48 ` Jeff Dike
2006-03-28 23:07 ` Sam Vilain
2006-03-29 20:56 ` Bill Davidsen
2006-03-28 20:29 ` [Devel] " Jun OKAJIMA
2006-03-28 20:50 ` Kir Kolyshkin
2006-03-28 21:38 ` Jun OKAJIMA
2006-03-28 21:51 ` Eric W. Biederman
2006-03-28 23:18 ` Sam Vilain
2006-04-03 16:47 ` Bill Davidsen
2006-04-11 10:38 ` Kirill Korotaev
2006-04-11 16:20 ` Herbert Poetzl
2006-04-11 18:12 ` Kir Kolyshkin
2006-04-12 5:12 ` Andi Kleen
2006-04-12 6:55 ` Kirill Korotaev
2006-04-12 6:53 ` Andi Kleen
2006-04-12 7:51 ` Kirill Korotaev
2006-04-12 17:03 ` Andi Kleen
2006-04-12 17:20 ` Eric W. Biederman
2006-04-13 16:54 ` Alexey Kuznetsov
2006-04-30 13:22 ` Bill Davidsen
2006-04-30 21:34 ` Sam Vilain
2006-05-01 12:27 ` Kirill Korotaev
2006-05-03 20:32 ` Bill Davidsen
2006-03-28 9:02 ` Kirill Korotaev
2006-03-28 9:15 ` Nick Piggin
2006-03-28 15:35 ` Herbert Poetzl
2006-03-28 15:53 ` Nick Piggin
2006-03-28 16:31 ` Eric W. Biederman
2006-03-29 21:37 ` Bill Davidsen
2006-03-28 16:15 ` Eric W. Biederman
2006-03-28 23:04 ` Sam Vilain
2006-03-29 1:39 ` Kirill Korotaev
2006-03-29 13:47 ` Herbert Poetzl
2006-03-28 15:48 ` [Devel] " Matt Ayres
2006-03-28 16:42 ` Eric W. Biederman
2006-03-28 17:04 ` Matt Ayres
2006-03-29 0:55 ` Kirill Korotaev
2006-03-24 18:36 ` Eric W. Biederman
2006-03-24 21:19 ` Herbert Poetzl
2006-03-27 18:45 ` Eric W. Biederman
2006-03-28 8:51 ` Kirill Korotaev
2006-03-28 12:53 ` Serge E. Hallyn
2006-03-28 22:51 ` Sam Vilain
2006-03-29 20:30 ` Dave Hansen
2006-03-29 20:47 ` Eric W. Biederman
2006-03-29 22:44 ` Sam Vilain
2006-03-30 13:51 ` Kirill Korotaev
2006-03-28 21:58 ` Eric W. Biederman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox