* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
2021-11-18 18:55 ` [RFC PATCH 0/4] namespacefs: Proof-of-Concept Eric W. Biederman
@ 2021-11-18 19:02 ` Steven Rostedt
2021-11-18 19:22 ` Eric W. Biederman
2021-11-18 19:24 ` Steven Rostedt
2021-11-19 14:26 ` Yordan Karadzhov
2 siblings, 1 reply; 21+ messages in thread
From: Steven Rostedt @ 2021-11-18 19:02 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Yordan Karadzhov (VMware), linux-kernel, linux-fsdevel, viro,
mingo, hagen, rppt, James.Bottomley, akpm, vvs, shakeelb,
christian.brauner, mkoutny, Linux Containers
On Thu, 18 Nov 2021 12:55:07 -0600
ebiederm@xmission.com (Eric W. Biederman) wrote:
> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>
> Eric
Eric,
As you can see, the subject says "Proof-of-Concept" and every patch in the
the series says "RFC". All you did was point out problems with no help in
fixing those problems, and then gave a nasty Nacked-by before it even got
into a conversation.
From this response, I have to say:
It is not correct to nack a proof of concept that is asking for
discussion.
So, I nack your nack, because it's way to early to nack this.
-- Steve
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
2021-11-18 19:02 ` Steven Rostedt
@ 2021-11-18 19:22 ` Eric W. Biederman
2021-11-18 19:36 ` Steven Rostedt
0 siblings, 1 reply; 21+ messages in thread
From: Eric W. Biederman @ 2021-11-18 19:22 UTC (permalink / raw)
To: Steven Rostedt
Cc: Yordan Karadzhov (VMware), linux-kernel, linux-fsdevel, viro,
mingo, hagen, rppt, James.Bottomley, akpm, vvs, shakeelb,
christian.brauner, mkoutny, Linux Containers
Steven Rostedt <rostedt@goodmis.org> writes:
> On Thu, 18 Nov 2021 12:55:07 -0600
> ebiederm@xmission.com (Eric W. Biederman) wrote:
>
>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>
>> Eric
>
> Eric,
>
> As you can see, the subject says "Proof-of-Concept" and every patch in the
> the series says "RFC". All you did was point out problems with no help in
> fixing those problems, and then gave a nasty Nacked-by before it even got
> into a conversation.
>
> From this response, I have to say:
>
> It is not correct to nack a proof of concept that is asking for
> discussion.
>
> So, I nack your nack, because it's way to early to nack this.
I am refreshing my nack on the concept. My nack has been in place for
good technical reasons since about 2006.
I see no way forward. I do not see a compelling use case.
There have been many conversations in the past attempt to implement
something that requires a namespace of namespaces and they have never
gotten anywhere.
I see no attempt a due diligence or of actually understanding what
hierarchy already exists in namespaces.
I don't mean to be nasty but I do mean to be clear. Without a
compelling new idea in this space I see no hope of an implementation.
What they are attempting to do makes it impossible to migrate a set of
process that uses this feature from one machine to another. AKA this
would be a breaking change and a regression if merged.
The breaking and regression are caused by assigning names to namespaces
without putting those names into a namespace of their own. That
appears fundamental to the concept not to the implementation.
Since the concept if merged would cause a regression it qualifies for
a nack.
We can explore what problems they are trying to solve with this and
explore other ways to solve those problems. All I saw was a comment
about monitoring tools and wanting a global view. I did not see
any comments about dealing with all of the reasons why a global view
tends to be a bad idea.
I should have added that we have to some extent a way to walk through
namespaces using ioctls on nsfs inodes.
Eric
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
2021-11-18 19:22 ` Eric W. Biederman
@ 2021-11-18 19:36 ` Steven Rostedt
0 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2021-11-18 19:36 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Yordan Karadzhov (VMware), linux-kernel, linux-fsdevel, viro,
mingo, hagen, rppt, James.Bottomley, akpm, vvs, shakeelb,
christian.brauner, mkoutny, Linux Containers
On Thu, 18 Nov 2021 13:22:16 -0600
ebiederm@xmission.com (Eric W. Biederman) wrote:
> Steven Rostedt <rostedt@goodmis.org> writes:
> >
> I am refreshing my nack on the concept. My nack has been in place for
> good technical reasons since about 2006.
I'll admit, we are new to this, as we are now trying to add more visibility
into the workings of things like kubernetes. And having a way of knowing
what containers are running and how to monitor them is needed, and we need
to do this for all container infrastructures.
>
> I see no way forward. I do not see a compelling use case.
What do you use to debug issues in a kubernetes cluster of hundreds of
machines running thousands of containers? Currently, if something is amiss,
a node is restarted in the hopes that the issue does not appear again. But
we would like to add infrastructure that takes advantage of tracing and
profiling to be able to narrow that down. But to do so, we need to
understand what tasks belong to what containers.
>
> There have been many conversations in the past attempt to implement
> something that requires a namespace of namespaces and they have never
> gotten anywhere.
We are not asking about a "namespace" of namespaces, but a filesystem (one,
not a namespace of one), that holds the information at the system scale,
not a container view.
I would be happy to implement something that makes a container having this
file system available "special" as most containers do not need this.
>
> I see no attempt a due diligence or of actually understanding what
> hierarchy already exists in namespaces.
This is not trivial. What did we miss?
>
> I don't mean to be nasty but I do mean to be clear. Without a
> compelling new idea in this space I see no hope of an implementation.
>
> What they are attempting to do makes it impossible to migrate a set of
> process that uses this feature from one machine to another. AKA this
> would be a breaking change and a regression if merged.
The point of this is not to allow that migration. I'd be happy to add that
if a container has access to this file system, it is pinned to the system
and can not be migrated. The whole point of this file system is to monitor
all containers no the system, and it makes no sense in migrating it.
We would duplicate it over several systems, but there's no reason to move
it once it is running.
>
> The breaking and regression are caused by assigning names to namespaces
> without putting those names into a namespace of their own. That
> appears fundamental to the concept not to the implementation.
If you think this should be migrated then yes, it is broken. But we don't
want this to work across migrations. That defeats the purpose of this work.
>
> Since the concept if merged would cause a regression it qualifies for
> a nack.
>
> We can explore what problems they are trying to solve with this and
> explore other ways to solve those problems. All I saw was a comment
> about monitoring tools and wanting a global view. I did not see
> any comments about dealing with all of the reasons why a global view
> tends to be a bad idea.
If you only care about a working environment of the system that runs a set
of containers, how is that a bad idea. Again, I'm happy with implementing
something that makes having this file system prevent it from being
migrated. A pinned privileged container.
>
> I should have added that we have to some extent a way to walk through
> namespaces using ioctls on nsfs inodes.
How robust is this? And is there a library or tooling around it?
-- Steve
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
2021-11-18 18:55 ` [RFC PATCH 0/4] namespacefs: Proof-of-Concept Eric W. Biederman
2021-11-18 19:02 ` Steven Rostedt
@ 2021-11-18 19:24 ` Steven Rostedt
2021-11-19 9:50 ` Kirill Tkhai
2021-11-19 12:45 ` James Bottomley
2021-11-19 14:26 ` Yordan Karadzhov
2 siblings, 2 replies; 21+ messages in thread
From: Steven Rostedt @ 2021-11-18 19:24 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Yordan Karadzhov (VMware), linux-kernel, linux-fsdevel, viro,
mingo, hagen, rppt, James.Bottomley, akpm, vvs, shakeelb,
christian.brauner, mkoutny, Linux Containers
On Thu, 18 Nov 2021 12:55:07 -0600
ebiederm@xmission.com (Eric W. Biederman) wrote:
> It is not correct to use inode numbers as the actual names for
> namespaces.
>
> I can not see anything else you can possibly uses as names for
> namespaces.
This is why we used inode numbers.
>
> To allow container migration between machines and similar things
> the you wind up needing a namespace for your names of namespaces.
Is this why you say inode numbers are incorrect?
There's no reason to make this into its own namespace. Ideally, this file
system should only be for privilege containers. As the entire point of this
file system is to monitor the other containers on the system. In other
words, this file system is not to be used like procfs, but instead a global
information of the containers running on the host.
At first, we were not going to let this file system be part of any
namespace but the host itself, but because we want to wrap up tooling into
a container that we can install on other machines as a way to monitor the
containers on each machine, we had to open that up.
>
> Further you talk about hierarchy and you have not added support for the
> user namespace. Without the user namespace there is not hierarchy with
> any namespace but the pid namespace. There is definitely no meaningful
> hierarchy without the user namespace.
Great, help us implement this.
>
> As far as I can tell merging this will break CRIU and container
> migration in general (as the namespace of namespaces problem is not
> solved).
This is not to be a file system that is to be migrated. As the point of
this file system is to monitor the other containers, so it does not make
sense to migrate it.
>
> Since you are not solving the problem of a namespace for namespaces,
> yet implementing something that requires it.
Why is it needed?
>
> Since you are implementing hierarchy and ignoring the user namespace
> which gives structure and hierarchy to the namespaces.
We are not ignoring it, we are RFC'ing for advice on how to implement it.
>
> Since this breaks existing use cases without giving a solution.
You don't understand proof-of-concepts and RFCs do you?
-- Steve
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
2021-11-18 19:24 ` Steven Rostedt
@ 2021-11-19 9:50 ` Kirill Tkhai
2021-11-19 12:45 ` James Bottomley
1 sibling, 0 replies; 21+ messages in thread
From: Kirill Tkhai @ 2021-11-19 9:50 UTC (permalink / raw)
To: Steven Rostedt, Eric W. Biederman
Cc: Yordan Karadzhov (VMware), linux-kernel, linux-fsdevel, viro,
mingo, hagen, rppt, James.Bottomley, akpm, vvs, shakeelb,
christian.brauner, mkoutny, Linux Containers
On 18.11.2021 22:24, Steven Rostedt wrote:
> On Thu, 18 Nov 2021 12:55:07 -0600
> ebiederm@xmission.com (Eric W. Biederman) wrote:
>
>> It is not correct to use inode numbers as the actual names for
>> namespaces.
>>
>> I can not see anything else you can possibly uses as names for
>> namespaces.
>
> This is why we used inode numbers.
The migration problem may be solved in case of the new filesystem
allows rename.
Kernel may use random UUID as initial namespace file. After the migration,
we recreate this namespace, and it will have another UUID generated by kernel.
Then, we just rename it in correct one.
I sent something like this for /proc fs (except rename):
http://archive.lwn.net:8080/linux-fsdevel/97fdcff1-1cce-7eab-6449-7fe10451162d@virtuozzo.com/T/#m7579f79a6ba8422b57463049f52d2043986b5cac
>>
>> To allow container migration between machines and similar things
>> the you wind up needing a namespace for your names of namespaces.
>
> Is this why you say inode numbers are incorrect?
>
> There's no reason to make this into its own namespace. Ideally, this file
> system should only be for privilege containers. As the entire point of this
> file system is to monitor the other containers on the system. In other
> words, this file system is not to be used like procfs, but instead a global
> information of the containers running on the host.
>
> At first, we were not going to let this file system be part of any
> namespace but the host itself, but because we want to wrap up tooling into
> a container that we can install on other machines as a way to monitor the
> containers on each machine, we had to open that up.
>
>>
>> Further you talk about hierarchy and you have not added support for the
>> user namespace. Without the user namespace there is not hierarchy with
>> any namespace but the pid namespace. There is definitely no meaningful
>> hierarchy without the user namespace.
>
> Great, help us implement this.
>
>>
>> As far as I can tell merging this will break CRIU and container
>> migration in general (as the namespace of namespaces problem is not
>> solved).
>
> This is not to be a file system that is to be migrated. As the point of
> this file system is to monitor the other containers, so it does not make
> sense to migrate it.
>
>>
>> Since you are not solving the problem of a namespace for namespaces,
>> yet implementing something that requires it.
>
> Why is it needed?
>
>>
>> Since you are implementing hierarchy and ignoring the user namespace
>> which gives structure and hierarchy to the namespaces.
>
> We are not ignoring it, we are RFC'ing for advice on how to implement it.
>
>>
>> Since this breaks existing use cases without giving a solution.
>
> You don't understand proof-of-concepts and RFCs do you?
>
> -- Steve
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
2021-11-18 19:24 ` Steven Rostedt
2021-11-19 9:50 ` Kirill Tkhai
@ 2021-11-19 12:45 ` James Bottomley
2021-11-19 14:27 ` Steven Rostedt
1 sibling, 1 reply; 21+ messages in thread
From: James Bottomley @ 2021-11-19 12:45 UTC (permalink / raw)
To: Steven Rostedt, Eric W. Biederman
Cc: Yordan Karadzhov (VMware), linux-kernel, linux-fsdevel, viro,
mingo, hagen, rppt, akpm, vvs, shakeelb, christian.brauner,
mkoutny, Linux Containers
On Thu, 2021-11-18 at 14:24 -0500, Steven Rostedt wrote:
> On Thu, 18 Nov 2021 12:55:07 -0600
> ebiederm@xmission.com (Eric W. Biederman) wrote:
>
> > It is not correct to use inode numbers as the actual names for
> > namespaces.
> >
> > I can not see anything else you can possibly uses as names for
> > namespaces.
>
> This is why we used inode numbers.
>
> > To allow container migration between machines and similar things
> > the you wind up needing a namespace for your names of namespaces.
>
> Is this why you say inode numbers are incorrect?
The problem is you seem to have picked on one orchestration system
without considering all the uses of namespaces and how this would
impact them. So let me explain why inode numbers are incorrect and it
will possibly illuminate some of the cans of worms you're opening.
We have a container checkpoint/restore system called CRIU that can be
used to snapshot the state of a pid subtree and restore it. It can be
used for the entire system or piece of it. It is also used by some
orchestration systems to live migrate containers. Any property of a
container system that has meaning must be saved and restored by CRIU.
The inode number is simply a semi random number assigned to the
namespace. it shows up in /proc/<pid>/ns but nowhere else and isn't
used by anything. When CRIU migrates or restores containers, all the
namespaces that compose them get different inode values on the restore.
If you want to make the inode number equivalent to the container name,
they'd have to restore to the previous number because you've made it a
property of the namespace. The way everything is set up now, that's
just not possible and never will be. Inode numbers are a 32 bit space
and can't be globally unique. If you want a container name, it will
have to be something like a new UUID and that's the first problem you
should tackle.
James
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
2021-11-19 12:45 ` James Bottomley
@ 2021-11-19 14:27 ` Steven Rostedt
2021-11-19 16:42 ` James Bottomley
[not found] ` <f6ca1f5bdb3b516688f291d9685a6a59f49f1393.camel@HansenPartnership.com>
0 siblings, 2 replies; 21+ messages in thread
From: Steven Rostedt @ 2021-11-19 14:27 UTC (permalink / raw)
To: James Bottomley
Cc: "Eric W. Biederman\" <ebiederm@xmission.com>, "
On Fri, 19 Nov 2021 07:45:01 -0500
James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> On Thu, 2021-11-18 at 14:24 -0500, Steven Rostedt wrote:
> > On Thu, 18 Nov 2021 12:55:07 -0600
> > ebiederm@xmission.com (Eric W. Biederman) wrote:
> >
> > > It is not correct to use inode numbers as the actual names for
> > > namespaces.
> > >
> > > I can not see anything else you can possibly uses as names for
> > > namespaces.
> >
> > This is why we used inode numbers.
> >
> > > To allow container migration between machines and similar things
> > > the you wind up needing a namespace for your names of namespaces.
> >
> > Is this why you say inode numbers are incorrect?
>
> The problem is you seem to have picked on one orchestration system
> without considering all the uses of namespaces and how this would
> impact them. So let me explain why inode numbers are incorrect and it
> will possibly illuminate some of the cans of worms you're opening.
>
> We have a container checkpoint/restore system called CRIU that can be
> used to snapshot the state of a pid subtree and restore it. It can be
> used for the entire system or piece of it. It is also used by some
> orchestration systems to live migrate containers. Any property of a
> container system that has meaning must be saved and restored by CRIU.
>
> The inode number is simply a semi random number assigned to the
> namespace. it shows up in /proc/<pid>/ns but nowhere else and isn't
> used by anything. When CRIU migrates or restores containers, all the
> namespaces that compose them get different inode values on the restore.
> If you want to make the inode number equivalent to the container name,
> they'd have to restore to the previous number because you've made it a
> property of the namespace. The way everything is set up now, that's
> just not possible and never will be. Inode numbers are a 32 bit space
> and can't be globally unique. If you want a container name, it will
> have to be something like a new UUID and that's the first problem you
> should tackle.
So everyone seems to be all upset about using inode number. We could do
what Kirill suggested and just create some random UUID and use that. We
could have a file in the directory called inode that has the inode number
(as that's what both docker and podman use to identify their containers,
and it's nice to have something to map back to them).
On checkpoint restore, only the directories that represent the container
that migrated matter, so as Kirill said, make sure they get the old UUID
name, and expose that as the directory.
If a container is looking at directories of other containers on the system,
then it gets migrated to another system, it should be treated as though
those directories were deleted under them.
I still do not see what the issue is here.
-- Steve
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
2021-11-19 14:27 ` Steven Rostedt
@ 2021-11-19 16:42 ` James Bottomley
2021-11-19 17:14 ` Yordan Karadzhov
[not found] ` <f6ca1f5bdb3b516688f291d9685a6a59f49f1393.camel@HansenPartnership.com>
1 sibling, 1 reply; 21+ messages in thread
From: James Bottomley @ 2021-11-19 16:42 UTC (permalink / raw)
To: Steven Rostedt
Cc: Yordan Karadzhov (VMware), linux-kernel, linux-fsdevel, viro,
mingo, hagen, rppt, akpm, vvs, shakeelb, christian.brauner,
mkoutny, Linux Containers, Steven Rostedt, Eric W. Biederman
[resend due to header mangling causing loss on the lists]
On Fri, 2021-11-19 at 09:27 -0500, Steven Rostedt wrote:
> On Fri, 19 Nov 2021 07:45:01 -0500
> James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
>
> > On Thu, 2021-11-18 at 14:24 -0500, Steven Rostedt wrote:
> > > On Thu, 18 Nov 2021 12:55:07 -0600
> > > ebiederm@xmission.com (Eric W. Biederman) wrote:
> > >
> > > > It is not correct to use inode numbers as the actual names for
> > > > namespaces.
> > > >
> > > > I can not see anything else you can possibly uses as names for
> > > > namespaces.
> > >
> > > This is why we used inode numbers.
> > >
> > > > To allow container migration between machines and similar
> > > > things the you wind up needing a namespace for your names of
> > > > namespaces.
> > >
> > > Is this why you say inode numbers are incorrect?
> >
> > The problem is you seem to have picked on one orchestration system
> > without considering all the uses of namespaces and how this would
> > impact them. So let me explain why inode numbers are incorrect and
> > it will possibly illuminate some of the cans of worms you're
> > opening.
> >
> > We have a container checkpoint/restore system called CRIU that can
> > be used to snapshot the state of a pid subtree and restore it. It
> > can be used for the entire system or piece of it. It is also used
> > by some orchestration systems to live migrate containers. Any
> > property of a container system that has meaning must be saved and
> > restored by CRIU.
> >
> > The inode number is simply a semi random number assigned to the
> > namespace. it shows up in /proc/<pid>/ns but nowhere else and
> > isn't used by anything. When CRIU migrates or restores containers,
> > all the namespaces that compose them get different inode values on
> > the restore. If you want to make the inode number equivalent to
> > the container name, they'd have to restore to the previous number
> > because you've made it a property of the namespace. The way
> > everything is set up now, that's just not possible and never will
> > be. Inode numbers are a 32 bit space and can't be globally
> > unique. If you want a container name, it will have to be something
> > like a new UUID and that's the first problem you should tackle.
>
> So everyone seems to be all upset about using inode number. We could
> do what Kirill suggested and just create some random UUID and use
> that. We could have a file in the directory called inode that has the
> inode number (as that's what both docker and podman use to identify
> their containers, and it's nice to have something to map back to
> them).
>
> On checkpoint restore, only the directories that represent the
> container that migrated matter, so as Kirill said, make sure they get
> the old UUID name, and expose that as the directory.
>
> If a container is looking at directories of other containers on the
> system, then it gets migrated to another system, it should be treated
> as though those directories were deleted under them.
>
> I still do not see what the issue is here.
The issue is you're introducing a new core property for namespaces they
didn't have before. Everyone has different use cases for containers
and we need to make sure the new property works with all of them.
Having a "name" for a namespace has been discussed before which is the
landmine you stepped on when you advocated using the inode number as
the name, because that's already known to be unworkable.
Can we back up and ask what problem you're trying to solve before we
start introducing new objects like namespace name? The problem
statement just seems to be "Being able to see the structure of the
namespaces can be very useful in the context of the containerized
workloads." which you later expanded on as "trying to add more
visibility into the working of things like kubernetes". If you just
want to see the namespace "tree" you can script that (as root) by
matching the process tree and the /proc/<pid>/ns changes without
actually needing to construct it in the kernel. This can also be done
without introducing the concept of a namespace name. However, there is
a subtlety of doing this matching in the way I described in that you
don't get proper parenting to the user namespace ownership ... but that
seems to be something you don't want anyway?
James
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
2021-11-19 16:42 ` James Bottomley
@ 2021-11-19 17:14 ` Yordan Karadzhov
2021-11-19 17:22 ` Steven Rostedt
2021-11-19 23:22 ` James Bottomley
0 siblings, 2 replies; 21+ messages in thread
From: Yordan Karadzhov @ 2021-11-19 17:14 UTC (permalink / raw)
To: James Bottomley, Steven Rostedt
Cc: linux-kernel, linux-fsdevel, viro, mingo, hagen, rppt, akpm, vvs,
shakeelb, christian.brauner, mkoutny, Linux Containers,
Eric W. Biederman
On 19.11.21 г. 18:42 ч., James Bottomley wrote:
> [resend due to header mangling causing loss on the lists]
> On Fri, 2021-11-19 at 09:27 -0500, Steven Rostedt wrote:
>> On Fri, 19 Nov 2021 07:45:01 -0500
>> James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
>>
>>> On Thu, 2021-11-18 at 14:24 -0500, Steven Rostedt wrote:
>>>> On Thu, 18 Nov 2021 12:55:07 -0600
>>>> ebiederm@xmission.com (Eric W. Biederman) wrote:
>>>>
>>>>> It is not correct to use inode numbers as the actual names for
>>>>> namespaces.
>>>>>
>>>>> I can not see anything else you can possibly uses as names for
>>>>> namespaces.
>>>>
>>>> This is why we used inode numbers.
>>>>
>>>>> To allow container migration between machines and similar
>>>>> things the you wind up needing a namespace for your names of
>>>>> namespaces.
>>>>
>>>> Is this why you say inode numbers are incorrect?
>>>
>>> The problem is you seem to have picked on one orchestration system
>>> without considering all the uses of namespaces and how this would
>>> impact them. So let me explain why inode numbers are incorrect and
>>> it will possibly illuminate some of the cans of worms you're
>>> opening.
>>>
>>> We have a container checkpoint/restore system called CRIU that can
>>> be used to snapshot the state of a pid subtree and restore it. It
>>> can be used for the entire system or piece of it. It is also used
>>> by some orchestration systems to live migrate containers. Any
>>> property of a container system that has meaning must be saved and
>>> restored by CRIU.
>>>
>>> The inode number is simply a semi random number assigned to the
>>> namespace. it shows up in /proc/<pid>/ns but nowhere else and
>>> isn't used by anything. When CRIU migrates or restores containers,
>>> all the namespaces that compose them get different inode values on
>>> the restore. If you want to make the inode number equivalent to
>>> the container name, they'd have to restore to the previous number
>>> because you've made it a property of the namespace. The way
>>> everything is set up now, that's just not possible and never will
>>> be. Inode numbers are a 32 bit space and can't be globally
>>> unique. If you want a container name, it will have to be something
>>> like a new UUID and that's the first problem you should tackle.
>>
>> So everyone seems to be all upset about using inode number. We could
>> do what Kirill suggested and just create some random UUID and use
>> that. We could have a file in the directory called inode that has the
>> inode number (as that's what both docker and podman use to identify
>> their containers, and it's nice to have something to map back to
>> them).
>>
>> On checkpoint restore, only the directories that represent the
>> container that migrated matter, so as Kirill said, make sure they get
>> the old UUID name, and expose that as the directory.
>>
>> If a container is looking at directories of other containers on the
>> system, then it gets migrated to another system, it should be treated
>> as though those directories were deleted under them.
>>
>> I still do not see what the issue is here.
>
> The issue is you're introducing a new core property for namespaces they
> didn't have before. Everyone has different use cases for containers
> and we need to make sure the new property works with all of them.
>
> Having a "name" for a namespace has been discussed before which is the
> landmine you stepped on when you advocated using the inode number as
> the name, because that's already known to be unworkable.
>
> Can we back up and ask what problem you're trying to solve before we
> start introducing new objects like namespace name? The problem
> statement just seems to be "Being able to see the structure of the
> namespaces can be very useful in the context of the containerized
> workloads." which you later expanded on as "trying to add more
> visibility into the working of things like kubernetes". If you just
> want to see the namespace "tree" you can script that (as root) by
> matching the process tree and the /proc/<pid>/ns changes without
> actually needing to construct it in the kernel. This can also be done
> without introducing the concept of a namespace name. However, there is
> a subtlety of doing this matching in the way I described in that you
> don't get proper parenting to the user namespace ownership ... but that
> seems to be something you don't want anyway?
>
The major motivation is to be able to hook tracing to individual containers. We want to be able to quickly discover the
PIDs of all containers running on a system. And when we say all, we mean not only Docker, but really all sorts of
containers that exist now or may exist in the future. We also considered the solution of brute-forcing all processes in
/proc/*/ns/ but we are afraid that such solution do not scale. As I stated in the Cover letter, the problem was
discussed at Plumbers (links at the bottom of the Cover letter) and the conclusion was that the most distinct feature
that anything that can be called 'Container' must have is a separate PID namespace. This is why the PoC starts with the
implementation of this namespace. You can see in the example script that discovering the name and all PIDs of all
containers gets quick and trivial with the help of this new filesystem. And you need to add just few more lines of code
in order to make it start tracing a selected container.
Thanks!
Yordan
> James
>
>
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
2021-11-19 17:14 ` Yordan Karadzhov
@ 2021-11-19 17:22 ` Steven Rostedt
2021-11-19 23:22 ` James Bottomley
1 sibling, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2021-11-19 17:22 UTC (permalink / raw)
To: Yordan Karadzhov
Cc: James Bottomley, linux-kernel, linux-fsdevel, viro, mingo, hagen,
rppt, akpm, vvs, shakeelb, christian.brauner, mkoutny,
Linux Containers, Eric W. Biederman
On Fri, 19 Nov 2021 19:14:08 +0200
Yordan Karadzhov <y.karadz@gmail.com> wrote:
> And you need to add just few more lines of code
> in order to make it start tracing a selected container.
I would like to add that this is not just about tracing a single container,
but could be tracing several containers and seeing how they interact, and
analyze the contention between them on shared resources. Just to name an
example of what could be done.
-- Steve
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
2021-11-19 17:14 ` Yordan Karadzhov
2021-11-19 17:22 ` Steven Rostedt
@ 2021-11-19 23:22 ` James Bottomley
2021-11-20 0:07 ` Steven Rostedt
1 sibling, 1 reply; 21+ messages in thread
From: James Bottomley @ 2021-11-19 23:22 UTC (permalink / raw)
To: Yordan Karadzhov, Steven Rostedt
Cc: linux-kernel, linux-fsdevel, viro, mingo, hagen, rppt, akpm, vvs,
shakeelb, christian.brauner, mkoutny, Linux Containers,
Eric W. Biederman
On Fri, 2021-11-19 at 19:14 +0200, Yordan Karadzhov wrote:
> On 19.11.21 г. 18:42 ч., James Bottomley wrote:
[...]
> > Can we back up and ask what problem you're trying to solve before
> > we start introducing new objects like namespace name? The problem
> > statement just seems to be "Being able to see the structure of the
> > namespaces can be very useful in the context of the containerized
> > workloads." which you later expanded on as "trying to add more
> > visibility into the working of things like kubernetes". If you
> > just want to see the namespace "tree" you can script that (as root)
> > by matching the process tree and the /proc/<pid>/ns changes without
> > actually needing to construct it in the kernel. This can also be
> > done without introducing the concept of a namespace name. However,
> > there is a subtlety of doing this matching in the way I described
> > in that you don't get proper parenting to the user namespace
> > ownership ... but that seems to be something you don't want anyway?
> >
>
> The major motivation is to be able to hook tracing to individual
> containers. We want to be able to quickly discover the
> PIDs of all containers running on a system. And when we say all, we
> mean not only Docker, but really all sorts of
> containers that exist now or may exist in the future. We also
> considered the solution of brute-forcing all processes in
> /proc/*/ns/ but we are afraid that such solution do not scale.
What do you mean does not scale? ps and top use the /proc tree to
gather all the real time interface data for every process; do they not
"scale" as well and should therefore be done as in-kernel interfaces?
> As I stated in the Cover letter, the problem was
> discussed at Plumbers (links at the bottom of the Cover letter) and
> the conclusion was that the most distinct feature
> that anything that can be called 'Container' must have is a separate
> PID namespace.
Unfortunately, I think I was fighting matrix fires at the time so
couldn't be there. However, I'd have pushed back on the idea of
identifying containers by the pid namespace (mainly because most of the
unprivileged containers I set up don't have one). Realistically, if
you're not a system container (need for pid 1) and don't have multiple
untrusted tenants (global process tree information leak), you likely
shouldn't be using the pid namespace either ... it just adds isolation
for no value.
> This is why the PoC starts with the implementation of this
> namespace. You can see in the example script that discovering the
> name and all PIDs of all containers gets quick and trivial with the
> help of this new filesystem. And you need to add just few more lines
> of code in order to make it start tracing a selected container.
But I could write a script or a tool to gather all the information
without this filesystem. The namespace tree can be reconstructed by
anything that can view the process tree and the /proc/<pid>/ns
directory.
James
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
2021-11-19 23:22 ` James Bottomley
@ 2021-11-20 0:07 ` Steven Rostedt
2021-11-20 0:14 ` James Bottomley
0 siblings, 1 reply; 21+ messages in thread
From: Steven Rostedt @ 2021-11-20 0:07 UTC (permalink / raw)
To: James Bottomley
Cc: Yordan Karadzhov, linux-kernel, linux-fsdevel, viro, mingo, hagen,
rppt, akpm, vvs, shakeelb, christian.brauner, mkoutny,
Linux Containers, Eric W. Biederman
On Fri, 19 Nov 2021 18:22:55 -0500
James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> But I could write a script or a tool to gather all the information
> without this filesystem. The namespace tree can be reconstructed by
> anything that can view the process tree and the /proc/<pid>/ns
> directory.
So basically you're stating that we could build the same thing that the
namespacefs would give us from inside a privileged container that had
access to the system procfs?
-- Steve
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
2021-11-20 0:07 ` Steven Rostedt
@ 2021-11-20 0:14 ` James Bottomley
0 siblings, 0 replies; 21+ messages in thread
From: James Bottomley @ 2021-11-20 0:14 UTC (permalink / raw)
To: Steven Rostedt
Cc: Yordan Karadzhov, linux-kernel, linux-fsdevel, viro, mingo, hagen,
rppt, akpm, vvs, shakeelb, christian.brauner, mkoutny,
Linux Containers, Eric W. Biederman
On Fri, 2021-11-19 at 19:07 -0500, Steven Rostedt wrote:
> On Fri, 19 Nov 2021 18:22:55 -0500
> James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
>
> > But I could write a script or a tool to gather all the information
> > without this filesystem. The namespace tree can be reconstructed
> > by anything that can view the process tree and the /proc/<pid>/ns
> > directory.
>
> So basically you're stating that we could build the same thing that
> the namespacefs would give us from inside a privileged container that
> had access to the system procfs?
I think so, yes ... and if some information is missing, we could export
it for you. This way the kernel doesn't prescribe what the namespace
tree looks like and the tool can display it in many different ways.
For instance, your current RFC patch misses the subtlety of the owning
user namespace, but that could simply be an alternative view presented
by a userspace tool.
James
^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <f6ca1f5bdb3b516688f291d9685a6a59f49f1393.camel@HansenPartnership.com>]
* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
2021-11-18 18:55 ` [RFC PATCH 0/4] namespacefs: Proof-of-Concept Eric W. Biederman
2021-11-18 19:02 ` Steven Rostedt
2021-11-18 19:24 ` Steven Rostedt
@ 2021-11-19 14:26 ` Yordan Karadzhov
2 siblings, 0 replies; 21+ messages in thread
From: Yordan Karadzhov @ 2021-11-19 14:26 UTC (permalink / raw)
To: Eric W. Biederman
Cc: linux-kernel, linux-fsdevel, viro, rostedt, mingo, hagen, rppt,
James.Bottomley, akpm, vvs, shakeelb, christian.brauner, mkoutny,
Linux Containers
Dear Eric,
Thank you very much for pointing out all the weaknesses of this Proof-of-Concept!
I tried to make it clear in the Cover letter that this is nothing more than a PoC. It is OK that you are giving it a
'Nacked-by'. We never had an expectation that this particular version of the code can be merged. Nevertheless, we hope
to receive constructive guidance on how to improve. I will try to comment on your arguments below.
On 18.11.21 г. 20:55 ч., Eric W. Biederman wrote:
>
> Adding the containers mailing list which is for discussions like this.
>
> "Yordan Karadzhov (VMware)" <y.karadz@gmail.com> writes:
>
>> We introduce a simple read-only virtual filesystem that provides
>> direct mechanism for examining the existing hierarchy of namespaces
>> on the system. For the purposes of this PoC, we tried to keep the
>> implementation of the pseudo filesystem as simple as possible. Only
>> two namespace types (PID and UTS) are coupled to it for the moment.
>> Nevertheless, we do not expect having significant problems when
>> adding all other namespace types.
>>
>> When fully functional, 'namespacefs' will allow the user to see all
>> namespaces that are active on the system and to easily retrieve the
>> specific data, managed by each namespace. For example the PIDs of
>> all tasks enclosed in the individual PID namespaces. Any existing
>> namespace on the system will be represented by its corresponding
>> directory in namespacesfs. When a namespace is created a directory
>> will be added. When a namespace is destroyed, its corresponding
>> directory will be removed. The hierarchy of the directories will
>> follow the hierarchy of the namespaces.
>
> It is not correct to use inode numbers as the actual names for
> namespaces.
It is unclear for me why exposing the inode number of a namespace is such a fundamental problem. This information is
already available in /proc/PID/ns. If you are worried by the fact that the inode number gives the name of the
corresponding directory in the filesystem and that someone can interpret this as a name of the namespace itself, then we
can make the inum available inside the directory (and make it identical with /proc/PID/ns/) and to think for some other
naming convention for the directories.
>
> I can not see anything else you can possibly uses as names for
> namespaces.
>
> To allow container migration between machines and similar things
> the you wind up needing a namespace for your names of namespaces.
>
This filesystem aims to provide a snapshot of the current structure of the namespaces on the entire host, so migrating
it to another machine where this structure will be anyway different seems to be meaningless by definition, unless you
really migrate the entire machine.
This may be a stupid question, but are you currently migrating 'debugfs' or 'tracefs' together with a container?
> Further you talk about hierarchy and you have not added support for the
> user namespace. Without the user namespace there is not hierarchy with
> any namespace but the pid namespace. There is definitely no meaningful
> hierarchy without the user namespace.
>
I do agree that the user namespace plays a central role in the global hierarchy of namespaces.
> As far as I can tell merging this will break CRIU and container
> migration in general (as the namespace of namespaces problem is not
> solved).
>
> Since you are not solving the problem of a namespace for namespaces,
> yet implementing something that requires it.
>
> Since you are implementing hierarchy and ignoring the user namespace
> which gives structure and hierarchy to the namespaces.
>
If we provide a second version of the PoC that includes the use namespace, is this going make you do a second
consideration of the idea?
It is OK if you give us a second "Nacked-by" after this ;-)
Once again, thank you very much for your comments!
Best,
Yordan
> Since this breaks existing use cases without giving a solution.
>
> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>
> Eric
>
^ permalink raw reply [flat|nested] 21+ messages in thread