Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept

public inbox for containers@lists.linux.dev
 help / color / mirror / Atom feed

* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
       [not found] <20211118181210.281359-1-y.karadz@gmail.com>
@ 2021-11-18 18:55 ` Eric W. Biederman
  2021-11-18 19:02   ` Steven Rostedt
                     ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Eric W. Biederman @ 2021-11-18 18:55 UTC (permalink / raw)
  To: Yordan Karadzhov (VMware)
  Cc: linux-kernel, linux-fsdevel, viro, rostedt, mingo, hagen, rppt,
	James.Bottomley, akpm, vvs, shakeelb, christian.brauner, mkoutny,
	Linux Containers

Adding the containers mailing list which is for discussions like this.

"Yordan Karadzhov (VMware)" <y.karadz@gmail.com> writes:

> We introduce a simple read-only virtual filesystem that provides
> direct mechanism for examining the existing hierarchy of namespaces
> on the system. For the purposes of this PoC, we tried to keep the
> implementation of the pseudo filesystem as simple as possible. Only
> two namespace types (PID and UTS) are coupled to it for the moment.
> Nevertheless, we do not expect having significant problems when
> adding all other namespace types.
>
> When fully functional, 'namespacefs' will allow the user to see all
> namespaces that are active on the system and to easily retrieve the
> specific data, managed by each namespace. For example the PIDs of
> all tasks enclosed in the individual PID namespaces. Any existing
> namespace on the system will be represented by its corresponding
> directory in namespacesfs. When a namespace is created a directory
> will be added. When a namespace is destroyed, its corresponding
> directory will be removed. The hierarchy of the directories will
> follow the hierarchy of the namespaces.

It is not correct to use inode numbers as the actual names for
namespaces.

I can not see anything else you can possibly uses as names for
namespaces.

To allow container migration between machines and similar things
the you wind up needing a namespace for your names of namespaces.

Further you talk about hierarchy and you have not added support for the
user namespace.  Without the user namespace there is not hierarchy with
any namespace but the pid namespace. There is definitely no meaningful
hierarchy without the user namespace.

As far as I can tell merging this will break CRIU and container
migration in general (as the namespace of namespaces problem is not
solved).

Since you are not solving the problem of a namespace for namespaces,
yet implementing something that requires it.

Since you are implementing hierarchy and ignoring the user namespace
which gives structure and hierarchy to the namespaces.

Since this breaks existing use cases without giving a solution.

Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>

Eric

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
  2021-11-18 18:55 ` [RFC PATCH 0/4] namespacefs: Proof-of-Concept Eric W. Biederman
@ 2021-11-18 19:02   ` Steven Rostedt
  2021-11-18 19:22     ` Eric W. Biederman
  2021-11-18 19:24   ` Steven Rostedt
  2021-11-19 14:26   ` Yordan Karadzhov
  2 siblings, 1 reply; 21+ messages in thread
From: Steven Rostedt @ 2021-11-18 19:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Yordan Karadzhov (VMware), linux-kernel, linux-fsdevel, viro,
	mingo, hagen, rppt, James.Bottomley, akpm, vvs, shakeelb,
	christian.brauner, mkoutny, Linux Containers

On Thu, 18 Nov 2021 12:55:07 -0600
ebiederm@xmission.com (Eric W. Biederman) wrote:

> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> Eric

Eric, 

As you can see, the subject says "Proof-of-Concept" and every patch in the
the series says "RFC". All you did was point out problems with no help in
fixing those problems, and then gave a nasty Nacked-by before it even got
into a conversation.

From this response, I have to say:

  It is not correct to nack a proof of concept that is asking for
  discussion.

So, I nack your nack, because it's way to early to nack this.

-- Steve

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
  2021-11-18 19:02   ` Steven Rostedt
@ 2021-11-18 19:22     ` Eric W. Biederman
  2021-11-18 19:36       ` Steven Rostedt
  0 siblings, 1 reply; 21+ messages in thread
From: Eric W. Biederman @ 2021-11-18 19:22 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Yordan Karadzhov (VMware), linux-kernel, linux-fsdevel, viro,
	mingo, hagen, rppt, James.Bottomley, akpm, vvs, shakeelb,
	christian.brauner, mkoutny, Linux Containers

Steven Rostedt <rostedt@goodmis.org> writes:

> On Thu, 18 Nov 2021 12:55:07 -0600
> ebiederm@xmission.com (Eric W. Biederman) wrote:
>
>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> 
>> Eric
>
> Eric, 
>
> As you can see, the subject says "Proof-of-Concept" and every patch in the
> the series says "RFC". All you did was point out problems with no help in
> fixing those problems, and then gave a nasty Nacked-by before it even got
> into a conversation.
>
> From this response, I have to say:
>
>   It is not correct to nack a proof of concept that is asking for
>   discussion.
>
> So, I nack your nack, because it's way to early to nack this.

I am refreshing my nack on the concept.  My nack has been in place for
good technical reasons since about 2006.

I see no way forward.  I do not see a compelling use case.

There have been many conversations in the past attempt to implement
something that requires a namespace of namespaces and they have never
gotten anywhere.

I see no attempt a due diligence or of actually understanding what
hierarchy already exists in namespaces.

I don't mean to be nasty but I do mean to be clear.  Without a
compelling new idea in this space I see no hope of an implementation.

What they are attempting to do makes it impossible to migrate a set of
process that uses this feature from one machine to another.  AKA this
would be a breaking change and a regression if merged.

The breaking and regression are caused by assigning names to namespaces
without putting those names into a namespace of their own.   That
appears fundamental to the concept not to the implementation.

Since the concept if merged would cause a regression it qualifies for
a nack.

We can explore what problems they are trying to solve with this and
explore other ways to solve those problems.  All I saw was a comment
about monitoring tools and wanting a global view.  I did not see
any comments about dealing with all of the reasons why a global view
tends to be a bad idea.

I should have added that we have to some extent a way to walk through
namespaces using ioctls on nsfs inodes.

Eric

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
  2021-11-18 18:55 ` [RFC PATCH 0/4] namespacefs: Proof-of-Concept Eric W. Biederman
  2021-11-18 19:02   ` Steven Rostedt
@ 2021-11-18 19:24   ` Steven Rostedt
  2021-11-19  9:50     ` Kirill Tkhai
  2021-11-19 12:45     ` James Bottomley
  2021-11-19 14:26   ` Yordan Karadzhov
  2 siblings, 2 replies; 21+ messages in thread
From: Steven Rostedt @ 2021-11-18 19:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Yordan Karadzhov (VMware), linux-kernel, linux-fsdevel, viro,
	mingo, hagen, rppt, James.Bottomley, akpm, vvs, shakeelb,
	christian.brauner, mkoutny, Linux Containers

On Thu, 18 Nov 2021 12:55:07 -0600
ebiederm@xmission.com (Eric W. Biederman) wrote:

> It is not correct to use inode numbers as the actual names for
> namespaces.
> 
> I can not see anything else you can possibly uses as names for
> namespaces.

This is why we used inode numbers.

> 
> To allow container migration between machines and similar things
> the you wind up needing a namespace for your names of namespaces.

Is this why you say inode numbers are incorrect?

There's no reason to make this into its own namespace. Ideally, this file
system should only be for privilege containers. As the entire point of this
file system is to monitor the other containers on the system. In other
words, this file system is not to be used like procfs, but instead a global
information of the containers running on the host.

At first, we were not going to let this file system be part of any
namespace but the host itself, but because we want to wrap up tooling into
a container that we can install on other machines as a way to monitor the
containers on each machine, we had to open that up.

> 
> Further you talk about hierarchy and you have not added support for the
> user namespace.  Without the user namespace there is not hierarchy with
> any namespace but the pid namespace. There is definitely no meaningful
> hierarchy without the user namespace.

Great, help us implement this.

> 
> As far as I can tell merging this will break CRIU and container
> migration in general (as the namespace of namespaces problem is not
> solved).

This is not to be a file system that is to be migrated. As the point of
this file system is to monitor the other containers, so it does not make
sense to migrate it.

> 
> Since you are not solving the problem of a namespace for namespaces,
> yet implementing something that requires it.

Why is it needed?

> 
> Since you are implementing hierarchy and ignoring the user namespace
> which gives structure and hierarchy to the namespaces.

We are not ignoring it, we are RFC'ing for advice on how to implement it.

> 
> Since this breaks existing use cases without giving a solution.

You don't understand proof-of-concepts and RFCs do you?

-- Steve

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
  2021-11-18 19:22     ` Eric W. Biederman
@ 2021-11-18 19:36       ` Steven Rostedt
  0 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2021-11-18 19:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Yordan Karadzhov (VMware), linux-kernel, linux-fsdevel, viro,
	mingo, hagen, rppt, James.Bottomley, akpm, vvs, shakeelb,
	christian.brauner, mkoutny, Linux Containers

On Thu, 18 Nov 2021 13:22:16 -0600
ebiederm@xmission.com (Eric W. Biederman) wrote:

> Steven Rostedt <rostedt@goodmis.org> writes:

> > 
> I am refreshing my nack on the concept.  My nack has been in place for
> good technical reasons since about 2006.

I'll admit, we are new to this, as we are now trying to add more visibility
into the workings of things like kubernetes. And having a way of knowing
what containers are running and how to monitor them is needed, and we need
to do this for all container infrastructures.

> 
> I see no way forward.  I do not see a compelling use case.

What do you use to debug issues in a kubernetes cluster of hundreds of
machines running thousands of containers? Currently, if something is amiss,
a node is restarted in the hopes that the issue does not appear again. But
we would like to add infrastructure that takes advantage of tracing and
profiling to be able to narrow that down. But to do so, we need to
understand what tasks belong to what containers.

> 
> There have been many conversations in the past attempt to implement
> something that requires a namespace of namespaces and they have never
> gotten anywhere.

We are not asking about a "namespace" of namespaces, but a filesystem (one,
not a namespace of one), that holds the information at the system scale,
not a container view.

I would be happy to implement something that makes a container having this
file system available "special" as most containers do not need this.

> 
> I see no attempt a due diligence or of actually understanding what
> hierarchy already exists in namespaces.

This is not trivial. What did we miss?

> 
> I don't mean to be nasty but I do mean to be clear.  Without a
> compelling new idea in this space I see no hope of an implementation.
> 
> What they are attempting to do makes it impossible to migrate a set of
> process that uses this feature from one machine to another.  AKA this
> would be a breaking change and a regression if merged.

The point of this is not to allow that migration. I'd be happy to add that
if a container has access to this file system, it is pinned to the system
and can not be migrated. The whole point of this file system is to monitor
all containers no the system, and it makes no sense in migrating it.

We would duplicate it over several systems, but there's no reason to move
it once it is running.

> 
> The breaking and regression are caused by assigning names to namespaces
> without putting those names into a namespace of their own.   That
> appears fundamental to the concept not to the implementation.

If you think this should be migrated then yes, it is broken. But we don't
want this to work across migrations. That defeats the purpose of this work.

> 
> Since the concept if merged would cause a regression it qualifies for
> a nack.
> 
> We can explore what problems they are trying to solve with this and
> explore other ways to solve those problems.  All I saw was a comment
> about monitoring tools and wanting a global view.  I did not see
> any comments about dealing with all of the reasons why a global view
> tends to be a bad idea.

If you only care about a working environment of the system that runs a set
of containers, how is that a bad idea. Again, I'm happy with implementing
something that makes having this file system prevent it from being
migrated. A pinned privileged container.

> 
> I should have added that we have to some extent a way to walk through
> namespaces using ioctls on nsfs inodes.

How robust is this? And is there a library or tooling around it?

-- Steve

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
  2021-11-18 19:24   ` Steven Rostedt
@ 2021-11-19  9:50     ` Kirill Tkhai
  2021-11-19 12:45     ` James Bottomley
  1 sibling, 0 replies; 21+ messages in thread
From: Kirill Tkhai @ 2021-11-19  9:50 UTC (permalink / raw)
  To: Steven Rostedt, Eric W. Biederman
  Cc: Yordan Karadzhov (VMware), linux-kernel, linux-fsdevel, viro,
	mingo, hagen, rppt, James.Bottomley, akpm, vvs, shakeelb,
	christian.brauner, mkoutny, Linux Containers

On 18.11.2021 22:24, Steven Rostedt wrote:
> On Thu, 18 Nov 2021 12:55:07 -0600
> ebiederm@xmission.com (Eric W. Biederman) wrote:
> 
>> It is not correct to use inode numbers as the actual names for
>> namespaces.
>>
>> I can not see anything else you can possibly uses as names for
>> namespaces.
> 
> This is why we used inode numbers.

The migration problem may be solved in case of the new filesystem
allows rename.

Kernel may use random UUID as initial namespace file. After the migration, 
we recreate this namespace, and it will have another UUID generated by kernel.
Then, we just rename it in correct one. 

I sent something like this for /proc fs (except rename): 

http://archive.lwn.net:8080/linux-fsdevel/97fdcff1-1cce-7eab-6449-7fe10451162d@virtuozzo.com/T/#m7579f79a6ba8422b57463049f52d2043986b5cac

>>
>> To allow container migration between machines and similar things
>> the you wind up needing a namespace for your names of namespaces.
> 
> Is this why you say inode numbers are incorrect?
> 
> There's no reason to make this into its own namespace. Ideally, this file
> system should only be for privilege containers. As the entire point of this
> file system is to monitor the other containers on the system. In other
> words, this file system is not to be used like procfs, but instead a global
> information of the containers running on the host.
> 
> At first, we were not going to let this file system be part of any
> namespace but the host itself, but because we want to wrap up tooling into
> a container that we can install on other machines as a way to monitor the
> containers on each machine, we had to open that up.
> 
>>
>> Further you talk about hierarchy and you have not added support for the
>> user namespace.  Without the user namespace there is not hierarchy with
>> any namespace but the pid namespace. There is definitely no meaningful
>> hierarchy without the user namespace.
> 
> Great, help us implement this.
> 
>>
>> As far as I can tell merging this will break CRIU and container
>> migration in general (as the namespace of namespaces problem is not
>> solved).
> 
> This is not to be a file system that is to be migrated. As the point of
> this file system is to monitor the other containers, so it does not make
> sense to migrate it.
> 
>>
>> Since you are not solving the problem of a namespace for namespaces,
>> yet implementing something that requires it.
> 
> Why is it needed?
> 
>>
>> Since you are implementing hierarchy and ignoring the user namespace
>> which gives structure and hierarchy to the namespaces.
> 
> We are not ignoring it, we are RFC'ing for advice on how to implement it.
> 
>>
>> Since this breaks existing use cases without giving a solution.
> 
> You don't understand proof-of-concepts and RFCs do you?
> 
> -- Steve
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
  2021-11-18 19:24   ` Steven Rostedt
  2021-11-19  9:50     ` Kirill Tkhai
@ 2021-11-19 12:45     ` James Bottomley
  2021-11-19 14:27       ` Steven Rostedt
  1 sibling, 1 reply; 21+ messages in thread
From: James Bottomley @ 2021-11-19 12:45 UTC (permalink / raw)
  To: Steven Rostedt, Eric W. Biederman
  Cc: Yordan Karadzhov (VMware), linux-kernel, linux-fsdevel, viro,
	mingo, hagen, rppt, akpm, vvs, shakeelb, christian.brauner,
	mkoutny, Linux Containers

On Thu, 2021-11-18 at 14:24 -0500, Steven Rostedt wrote:
> On Thu, 18 Nov 2021 12:55:07 -0600
> ebiederm@xmission.com (Eric W. Biederman) wrote:
> 
> > It is not correct to use inode numbers as the actual names for
> > namespaces.
> > 
> > I can not see anything else you can possibly uses as names for
> > namespaces.
> 
> This is why we used inode numbers.
> 
> > To allow container migration between machines and similar things
> > the you wind up needing a namespace for your names of namespaces.
> 
> Is this why you say inode numbers are incorrect?

The problem is you seem to have picked on one orchestration system
without considering all the uses of namespaces and how this would
impact them.  So let me explain why inode numbers are incorrect and it
will possibly illuminate some of the cans of worms you're opening.

We have a container checkpoint/restore system called CRIU that can be
used to snapshot the state of a pid subtree and restore it.  It can be
used for the entire system or piece of it.  It is also used by some
orchestration systems to live migrate containers.  Any property of a
container system that has meaning must be saved and restored by CRIU.

The inode number is simply a semi random number assigned to the
namespace.  it shows up in /proc/<pid>/ns but nowhere else and isn't
used by anything.  When CRIU migrates or restores containers, all the
namespaces that compose them get different inode values on the restore.
If you want to make the inode number equivalent to the container name,
they'd have to restore to the previous number because you've made it a
property of the namespace.  The way everything is set up now, that's
just not possible and never will be.  Inode numbers are a 32 bit space
and can't be globally unique.  If you want a container name, it will
have to be something like a new UUID and that's the first problem you
should tackle.

James

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
  2021-11-18 18:55 ` [RFC PATCH 0/4] namespacefs: Proof-of-Concept Eric W. Biederman
  2021-11-18 19:02   ` Steven Rostedt
  2021-11-18 19:24   ` Steven Rostedt
@ 2021-11-19 14:26   ` Yordan Karadzhov
  2 siblings, 0 replies; 21+ messages in thread
From: Yordan Karadzhov @ 2021-11-19 14:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, linux-fsdevel, viro, rostedt, mingo, hagen, rppt,
	James.Bottomley, akpm, vvs, shakeelb, christian.brauner, mkoutny,
	Linux Containers

Dear Eric,

Thank you very much for pointing out all the weaknesses of this Proof-of-Concept!

I tried to make it clear in the Cover letter that this is nothing more than a PoC. It is OK that you are giving it a 
'Nacked-by'. We never had an expectation that this particular version of the code can be merged. Nevertheless, we hope 
to receive constructive guidance on how to improve. I will try to comment on your arguments below.

On 18.11.21 г. 20:55 ч., Eric W. Biederman wrote:
> 
> Adding the containers mailing list which is for discussions like this.
> 
> "Yordan Karadzhov (VMware)" <y.karadz@gmail.com> writes:
> 
>> We introduce a simple read-only virtual filesystem that provides
>> direct mechanism for examining the existing hierarchy of namespaces
>> on the system. For the purposes of this PoC, we tried to keep the
>> implementation of the pseudo filesystem as simple as possible. Only
>> two namespace types (PID and UTS) are coupled to it for the moment.
>> Nevertheless, we do not expect having significant problems when
>> adding all other namespace types.
>>
>> When fully functional, 'namespacefs' will allow the user to see all
>> namespaces that are active on the system and to easily retrieve the
>> specific data, managed by each namespace. For example the PIDs of
>> all tasks enclosed in the individual PID namespaces. Any existing
>> namespace on the system will be represented by its corresponding
>> directory in namespacesfs. When a namespace is created a directory
>> will be added. When a namespace is destroyed, its corresponding
>> directory will be removed. The hierarchy of the directories will
>> follow the hierarchy of the namespaces.
> 
> It is not correct to use inode numbers as the actual names for
> namespaces.

It is unclear for me why exposing the inode number of a namespace is such a fundamental problem. This information is 
already available in /proc/PID/ns. If you are worried by the fact that the inode number gives the name of the 
corresponding directory in the filesystem and that someone can interpret this as a name of the namespace itself, then we 
can make the inum available inside the directory (and make it identical with /proc/PID/ns/) and to think for some other 
naming convention for the directories.

> 
> I can not see anything else you can possibly uses as names for
> namespaces.
> 
> To allow container migration between machines and similar things
> the you wind up needing a namespace for your names of namespaces.
> 

This filesystem aims to provide a snapshot of the current structure of the namespaces on the entire host, so migrating 
it to another machine where this structure will be anyway different seems to be meaningless by definition, unless you 
really migrate the entire machine.

This may be a stupid question, but are you currently migrating 'debugfs' or 'tracefs' together with a container?

> Further you talk about hierarchy and you have not added support for the
> user namespace.  Without the user namespace there is not hierarchy with
> any namespace but the pid namespace. There is definitely no meaningful
> hierarchy without the user namespace.
> 

I do agree that the user namespace plays a central role in the global hierarchy of namespaces.

> As far as I can tell merging this will break CRIU and container
> migration in general (as the namespace of namespaces problem is not
> solved).
> 
> Since you are not solving the problem of a namespace for namespaces,
> yet implementing something that requires it.
> 
> Since you are implementing hierarchy and ignoring the user namespace
> which gives structure and hierarchy to the namespaces.
> 

If we provide a second version of the PoC that includes the use namespace, is this going make you do a second 
consideration of the idea?
It is OK if you give us a second "Nacked-by" after this ;-)

Once again, thank you very much for your comments!

Best,
Yordan

> Since this breaks existing use cases without giving a solution.
> 
> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> Eric
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
  2021-11-19 12:45     ` James Bottomley
@ 2021-11-19 14:27       ` Steven Rostedt
  2021-11-19 16:42         ` James Bottomley
       [not found]         ` <f6ca1f5bdb3b516688f291d9685a6a59f49f1393.camel@HansenPartnership.com>
  0 siblings, 2 replies; 21+ messages in thread
From: Steven Rostedt @ 2021-11-19 14:27 UTC (permalink / raw)
  To: James Bottomley
  Cc: "Eric W. Biederman\"  <ebiederm@xmission.com>, "

On Fri, 19 Nov 2021 07:45:01 -0500
James Bottomley <James.Bottomley@HansenPartnership.com> wrote:

> On Thu, 2021-11-18 at 14:24 -0500, Steven Rostedt wrote:
> > On Thu, 18 Nov 2021 12:55:07 -0600
> > ebiederm@xmission.com (Eric W. Biederman) wrote:
> >   
> > > It is not correct to use inode numbers as the actual names for
> > > namespaces.
> > > 
> > > I can not see anything else you can possibly uses as names for
> > > namespaces.  
> > 
> > This is why we used inode numbers.
> >   
> > > To allow container migration between machines and similar things
> > > the you wind up needing a namespace for your names of namespaces.  
> > 
> > Is this why you say inode numbers are incorrect?  
> 
> The problem is you seem to have picked on one orchestration system
> without considering all the uses of namespaces and how this would
> impact them.  So let me explain why inode numbers are incorrect and it
> will possibly illuminate some of the cans of worms you're opening.
> 
> We have a container checkpoint/restore system called CRIU that can be
> used to snapshot the state of a pid subtree and restore it.  It can be
> used for the entire system or piece of it.  It is also used by some
> orchestration systems to live migrate containers.  Any property of a
> container system that has meaning must be saved and restored by CRIU.
> 
> The inode number is simply a semi random number assigned to the
> namespace.  it shows up in /proc/<pid>/ns but nowhere else and isn't
> used by anything.  When CRIU migrates or restores containers, all the
> namespaces that compose them get different inode values on the restore.
> If you want to make the inode number equivalent to the container name,
> they'd have to restore to the previous number because you've made it a
> property of the namespace.  The way everything is set up now, that's
> just not possible and never will be.  Inode numbers are a 32 bit space
> and can't be globally unique.  If you want a container name, it will
> have to be something like a new UUID and that's the first problem you
> should tackle.

So everyone seems to be all upset about using inode number. We could do
what Kirill suggested and just create some random UUID and use that. We
could have a file in the directory called inode that has the inode number
(as that's what both docker and podman use to identify their containers,
and it's nice to have something to map back to them).

On checkpoint restore, only the directories that represent the container
that migrated matter, so as Kirill said, make sure they get the old UUID
name, and expose that as the directory.

If a container is looking at directories of other containers on the system,
then it gets migrated to another system, it should be treated as though
those directories were deleted under them.

I still do not see what the issue is here.

-- Steve



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
  2021-11-19 14:27       ` Steven Rostedt
@ 2021-11-19 16:42         ` James Bottomley
  2021-11-19 17:14           ` Yordan Karadzhov
       [not found]         ` <f6ca1f5bdb3b516688f291d9685a6a59f49f1393.camel@HansenPartnership.com>
  1 sibling, 1 reply; 21+ messages in thread
From: James Bottomley @ 2021-11-19 16:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Yordan Karadzhov (VMware), linux-kernel, linux-fsdevel, viro,
	mingo, hagen, rppt, akpm, vvs, shakeelb, christian.brauner,
	mkoutny, Linux Containers, Steven Rostedt, Eric W. Biederman

[resend due to header mangling causing loss on the lists]
On Fri, 2021-11-19 at 09:27 -0500, Steven Rostedt wrote:
> On Fri, 19 Nov 2021 07:45:01 -0500
> James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> 
> > On Thu, 2021-11-18 at 14:24 -0500, Steven Rostedt wrote:
> > > On Thu, 18 Nov 2021 12:55:07 -0600
> > > ebiederm@xmission.com (Eric W. Biederman) wrote:
> > >   
> > > > It is not correct to use inode numbers as the actual names for
> > > > namespaces.
> > > > 
> > > > I can not see anything else you can possibly uses as names for
> > > > namespaces.  
> > > 
> > > This is why we used inode numbers.
> > >   
> > > > To allow container migration between machines and similar
> > > > things the you wind up needing a namespace for your names of
> > > > namespaces.  
> > > 
> > > Is this why you say inode numbers are incorrect?  
> > 
> > The problem is you seem to have picked on one orchestration system
> > without considering all the uses of namespaces and how this would
> > impact them.  So let me explain why inode numbers are incorrect and
> > it will possibly illuminate some of the cans of worms you're
> > opening.
> > 
> > We have a container checkpoint/restore system called CRIU that can
> > be used to snapshot the state of a pid subtree and restore it.  It
> > can be used for the entire system or piece of it.  It is also used
> > by some orchestration systems to live migrate containers.  Any
> > property of a container system that has meaning must be saved and
> > restored by CRIU.
> > 
> > The inode number is simply a semi random number assigned to the
> > namespace.  it shows up in /proc/<pid>/ns but nowhere else and
> > isn't used by anything.  When CRIU migrates or restores containers,
> > all the namespaces that compose them get different inode values on
> > the restore.  If you want to make the inode number equivalent to
> > the container name, they'd have to restore to the previous number
> > because you've made it a property of the namespace.  The way
> > everything is set up now, that's just not possible and never will
> > be.  Inode numbers are a 32 bit space and can't be globally
> > unique.  If you want a container name, it will have to be something
> > like a new UUID and that's the first problem you should tackle.
> 
> So everyone seems to be all upset about using inode number. We could
> do what Kirill suggested and just create some random UUID and use
> that. We could have a file in the directory called inode that has the
> inode number (as that's what both docker and podman use to identify
> their containers, and it's nice to have something to map back to
> them).
> 
> On checkpoint restore, only the directories that represent the
> container that migrated matter, so as Kirill said, make sure they get
> the old UUID name, and expose that as the directory.
> 
> If a container is looking at directories of other containers on the
> system, then it gets migrated to another system, it should be treated
> as though those directories were deleted under them.
> 
> I still do not see what the issue is here.

The issue is you're introducing a new core property for namespaces they
didn't have before.  Everyone has different use cases for containers
and we need to make sure the new property works with all of them.

Having a "name" for a namespace has been discussed before which is the
landmine you stepped on when you advocated using the inode number as
the name, because that's already known to be unworkable.

Can we back up and ask what problem you're trying to solve before we
start introducing new objects like namespace name?  The problem
statement just seems to be "Being able to see the structure of the
namespaces can be very useful in the context of the containerized
workloads."  which you later expanded on as "trying to add more
visibility into the working of things like kubernetes".  If you just
want to see the namespace "tree" you can script that (as root) by
matching the process tree and the /proc/<pid>/ns changes without
actually needing to construct it in the kernel.  This can also be done
without introducing the concept of a namespace name.  However, there is
a subtlety of doing this matching in the way I described in that you
don't get proper parenting to the user namespace ownership ... but that
seems to be something you don't want anyway?

James




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
  2021-11-19 16:42         ` James Bottomley
@ 2021-11-19 17:14           ` Yordan Karadzhov
  2021-11-19 17:22             ` Steven Rostedt
  2021-11-19 23:22             ` James Bottomley
  0 siblings, 2 replies; 21+ messages in thread
From: Yordan Karadzhov @ 2021-11-19 17:14 UTC (permalink / raw)
  To: James Bottomley, Steven Rostedt
  Cc: linux-kernel, linux-fsdevel, viro, mingo, hagen, rppt, akpm, vvs,
	shakeelb, christian.brauner, mkoutny, Linux Containers,
	Eric W. Biederman



On 19.11.21 г. 18:42 ч., James Bottomley wrote:
> [resend due to header mangling causing loss on the lists]
> On Fri, 2021-11-19 at 09:27 -0500, Steven Rostedt wrote:
>> On Fri, 19 Nov 2021 07:45:01 -0500
>> James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
>>
>>> On Thu, 2021-11-18 at 14:24 -0500, Steven Rostedt wrote:
>>>> On Thu, 18 Nov 2021 12:55:07 -0600
>>>> ebiederm@xmission.com (Eric W. Biederman) wrote:
>>>>    
>>>>> It is not correct to use inode numbers as the actual names for
>>>>> namespaces.
>>>>>
>>>>> I can not see anything else you can possibly uses as names for
>>>>> namespaces.
>>>>
>>>> This is why we used inode numbers.
>>>>    
>>>>> To allow container migration between machines and similar
>>>>> things the you wind up needing a namespace for your names of
>>>>> namespaces.
>>>>
>>>> Is this why you say inode numbers are incorrect?
>>>
>>> The problem is you seem to have picked on one orchestration system
>>> without considering all the uses of namespaces and how this would
>>> impact them.  So let me explain why inode numbers are incorrect and
>>> it will possibly illuminate some of the cans of worms you're
>>> opening.
>>>
>>> We have a container checkpoint/restore system called CRIU that can
>>> be used to snapshot the state of a pid subtree and restore it.  It
>>> can be used for the entire system or piece of it.  It is also used
>>> by some orchestration systems to live migrate containers.  Any
>>> property of a container system that has meaning must be saved and
>>> restored by CRIU.
>>>
>>> The inode number is simply a semi random number assigned to the
>>> namespace.  it shows up in /proc/<pid>/ns but nowhere else and
>>> isn't used by anything.  When CRIU migrates or restores containers,
>>> all the namespaces that compose them get different inode values on
>>> the restore.  If you want to make the inode number equivalent to
>>> the container name, they'd have to restore to the previous number
>>> because you've made it a property of the namespace.  The way
>>> everything is set up now, that's just not possible and never will
>>> be.  Inode numbers are a 32 bit space and can't be globally
>>> unique.  If you want a container name, it will have to be something
>>> like a new UUID and that's the first problem you should tackle.
>>
>> So everyone seems to be all upset about using inode number. We could
>> do what Kirill suggested and just create some random UUID and use
>> that. We could have a file in the directory called inode that has the
>> inode number (as that's what both docker and podman use to identify
>> their containers, and it's nice to have something to map back to
>> them).
>>
>> On checkpoint restore, only the directories that represent the
>> container that migrated matter, so as Kirill said, make sure they get
>> the old UUID name, and expose that as the directory.
>>
>> If a container is looking at directories of other containers on the
>> system, then it gets migrated to another system, it should be treated
>> as though those directories were deleted under them.
>>
>> I still do not see what the issue is here.
> 
> The issue is you're introducing a new core property for namespaces they
> didn't have before.  Everyone has different use cases for containers
> and we need to make sure the new property works with all of them.
> 
> Having a "name" for a namespace has been discussed before which is the
> landmine you stepped on when you advocated using the inode number as
> the name, because that's already known to be unworkable.
> 
> Can we back up and ask what problem you're trying to solve before we
> start introducing new objects like namespace name?  The problem
> statement just seems to be "Being able to see the structure of the
> namespaces can be very useful in the context of the containerized
> workloads."  which you later expanded on as "trying to add more
> visibility into the working of things like kubernetes".  If you just
> want to see the namespace "tree" you can script that (as root) by
> matching the process tree and the /proc/<pid>/ns changes without
> actually needing to construct it in the kernel.  This can also be done
> without introducing the concept of a namespace name.  However, there is
> a subtlety of doing this matching in the way I described in that you
> don't get proper parenting to the user namespace ownership ... but that
> seems to be something you don't want anyway?
> 


The major motivation is to be able to hook tracing to individual containers. We want to be able to quickly discover the 
PIDs of all containers running on a system. And when we say all, we mean not only Docker, but really all sorts of 
containers that exist now or may exist in the future. We also considered the solution of brute-forcing all processes in 
/proc/*/ns/ but we are afraid that such solution do not scale. As I stated in the Cover letter, the problem was 
discussed at Plumbers (links at the bottom of the Cover letter) and the conclusion was that the most distinct feature 
that anything that can be called 'Container' must have is a separate PID namespace. This is why the PoC starts with the 
implementation of this namespace. You can see in the example script that discovering the name and all PIDs of all 
containers gets quick and trivial with the help of this new filesystem. And you need to add just few more lines of code 
in order to make it start tracing a selected container.

Thanks!
Yordan

> James
> 
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
  2021-11-19 17:14           ` Yordan Karadzhov
@ 2021-11-19 17:22             ` Steven Rostedt
  2021-11-19 23:22             ` James Bottomley
  1 sibling, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2021-11-19 17:22 UTC (permalink / raw)
  To: Yordan Karadzhov
  Cc: James Bottomley, linux-kernel, linux-fsdevel, viro, mingo, hagen,
	rppt, akpm, vvs, shakeelb, christian.brauner, mkoutny,
	Linux Containers, Eric W. Biederman

On Fri, 19 Nov 2021 19:14:08 +0200
Yordan Karadzhov <y.karadz@gmail.com> wrote:

>  And you need to add just few more lines of code 
> in order to make it start tracing a selected container.

I would like to add that this is not just about tracing a single container,
but could be tracing several containers and seeing how they interact, and
analyze the contention between them on shared resources. Just to name an
example of what could be done.

-- Steve

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
       [not found]             ` <20211119114910.177c80d6@gandalf.local.home>
@ 2021-11-19 23:08               ` James Bottomley
  2021-11-22 13:02                 ` Yordan Karadzhov
  0 siblings, 1 reply; 21+ messages in thread
From: James Bottomley @ 2021-11-19 23:08 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Yordan Karadzhov (VMware), linux-kernel, linux-fsdevel, viro,
	mingo, hagen, rppt, akpm, vvs, shakeelb, christian.brauner,
	mkoutny, Linux Containers, Steven Rostedt, Eric W. Biederman

[trying to reconstruct cc list, since the cc: field is bust again]
> On Fri, 19 Nov 2021 11:47:36 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > > Can we back up and ask what problem you're trying to solve before
> > > we start introducing new objects like namespace name?
> 
> TL;DR; verison:
> 
> We want to be able to install a container on a machine that will let
> us view all namespaces currently defined on that machine and which
> tasks are associated with them.
> 
> That's basically it.

So you mentioned kubernetes.  Have you tried

kubectl get pods --all-namespaces

?

The point is that orchestration systems usually have interfaces to get
this information, even if the kernel doesn't.  In fact, userspace is
almost certainly the best place to construct this from.

To look at this another way, what if you were simply proposing the
exact same thing but for the process tree.  The push back would be that
we can get that all in userspace and there's even a nice tool (pstree)
to do it which simply walks the /proc interface.  Why, then, do we have
to do nstree in the kernel when we can get all the information in
exactly the same way (walking the process tree)?

James

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
  2021-11-19 17:14           ` Yordan Karadzhov
  2021-11-19 17:22             ` Steven Rostedt
@ 2021-11-19 23:22             ` James Bottomley
  2021-11-20  0:07               ` Steven Rostedt
  1 sibling, 1 reply; 21+ messages in thread
From: James Bottomley @ 2021-11-19 23:22 UTC (permalink / raw)
  To: Yordan Karadzhov, Steven Rostedt
  Cc: linux-kernel, linux-fsdevel, viro, mingo, hagen, rppt, akpm, vvs,
	shakeelb, christian.brauner, mkoutny, Linux Containers,
	Eric W. Biederman

On Fri, 2021-11-19 at 19:14 +0200, Yordan Karadzhov wrote:
> On 19.11.21 г. 18:42 ч., James Bottomley wrote:
[...]
> > Can we back up and ask what problem you're trying to solve before
> > we start introducing new objects like namespace name?  The problem
> > statement just seems to be "Being able to see the structure of the
> > namespaces can be very useful in the context of the containerized
> > workloads."  which you later expanded on as "trying to add more
> > visibility into the working of things like kubernetes".  If you
> > just want to see the namespace "tree" you can script that (as root)
> > by matching the process tree and the /proc/<pid>/ns changes without
> > actually needing to construct it in the kernel.  This can also be
> > done without introducing the concept of a namespace name.  However,
> > there is a subtlety of doing this matching in the way I described
> > in that you don't get proper parenting to the user namespace
> > ownership ... but that seems to be something you don't want anyway?
> > 
> 
> The major motivation is to be able to hook tracing to individual
> containers. We want to be able to quickly discover the 
> PIDs of all containers running on a system. And when we say all, we
> mean not only Docker, but really all sorts of 
> containers that exist now or may exist in the future. We also
> considered the solution of brute-forcing all processes in 
> /proc/*/ns/ but we are afraid that such solution do not scale.

What do you mean does not scale?  ps and top use the /proc tree to
gather all the real time interface data for every process; do they not
"scale" as well and should therefore be done as in-kernel interfaces?

>  As I stated in the Cover letter, the problem was 
> discussed at Plumbers (links at the bottom of the Cover letter) and
> the conclusion was that the most distinct feature 
> that anything that can be called 'Container' must have is a separate
> PID namespace.

Unfortunately, I think I was fighting matrix fires at the time so
couldn't be there.  However, I'd have pushed back on the idea of
identifying containers by the pid namespace (mainly because most of the
unprivileged containers I set up don't have one).  Realistically, if
you're not a system container (need for pid 1) and don't have multiple
untrusted tenants (global process tree information leak), you likely
shouldn't be using the pid namespace either ... it just adds isolation
for no value.

>  This is why the PoC starts with the implementation of this
> namespace. You can see in the example script that discovering the
> name and all PIDs of all  containers gets quick and trivial with the
> help of this new filesystem. And you need to add just few more lines
> of code in order to make it start tracing a selected container.

But I could write a script or a tool to gather all the information
without this filesystem.  The namespace tree can be reconstructed by
anything that can view the process tree and the /proc/<pid>/ns
directory.

James



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
  2021-11-19 23:22             ` James Bottomley
@ 2021-11-20  0:07               ` Steven Rostedt
  2021-11-20  0:14                 ` James Bottomley
  0 siblings, 1 reply; 21+ messages in thread
From: Steven Rostedt @ 2021-11-20  0:07 UTC (permalink / raw)
  To: James Bottomley
  Cc: Yordan Karadzhov, linux-kernel, linux-fsdevel, viro, mingo, hagen,
	rppt, akpm, vvs, shakeelb, christian.brauner, mkoutny,
	Linux Containers, Eric W. Biederman

On Fri, 19 Nov 2021 18:22:55 -0500
James Bottomley <James.Bottomley@HansenPartnership.com> wrote:

> But I could write a script or a tool to gather all the information
> without this filesystem.  The namespace tree can be reconstructed by
> anything that can view the process tree and the /proc/<pid>/ns
> directory.

So basically you're stating that we could build the same thing that the
namespacefs would give us from inside a privileged container that had
access to the system procfs?

-- Steve

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
  2021-11-20  0:07               ` Steven Rostedt
@ 2021-11-20  0:14                 ` James Bottomley
  0 siblings, 0 replies; 21+ messages in thread
From: James Bottomley @ 2021-11-20  0:14 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Yordan Karadzhov, linux-kernel, linux-fsdevel, viro, mingo, hagen,
	rppt, akpm, vvs, shakeelb, christian.brauner, mkoutny,
	Linux Containers, Eric W. Biederman

On Fri, 2021-11-19 at 19:07 -0500, Steven Rostedt wrote:
> On Fri, 19 Nov 2021 18:22:55 -0500
> James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> 
> > But I could write a script or a tool to gather all the information
> > without this filesystem.  The namespace tree can be reconstructed
> > by anything that can view the process tree and the /proc/<pid>/ns
> > directory.
> 
> So basically you're stating that we could build the same thing that
> the namespacefs would give us from inside a privileged container that
> had access to the system procfs?

I think so, yes ... and if some information is missing, we could export
it for you.  This way the kernel doesn't prescribe what the namespace
tree looks like and the tool can display it in many different ways. 
For instance, your current RFC patch misses the subtlety of the owning
user namespace, but that could simply be an alternative view presented
by a userspace tool.

James




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
  2021-11-19 23:08               ` James Bottomley
@ 2021-11-22 13:02                 ` Yordan Karadzhov
  2021-11-22 13:44                   ` James Bottomley
  0 siblings, 1 reply; 21+ messages in thread
From: Yordan Karadzhov @ 2021-11-22 13:02 UTC (permalink / raw)
  To: James Bottomley, Steven Rostedt
  Cc: linux-kernel, linux-fsdevel, viro, mingo, hagen, rppt, akpm, vvs,
	shakeelb, christian.brauner, mkoutny, Linux Containers,
	Eric W. Biederman

On 20.11.21 г. 1:08 ч., James Bottomley wrote:
> [trying to reconstruct cc list, since the cc: field is bust again]
>> On Fri, 19 Nov 2021 11:47:36 -0500
>> Steven Rostedt <rostedt@goodmis.org> wrote:
>>
>>>> Can we back up and ask what problem you're trying to solve before
>>>> we start introducing new objects like namespace name?
>>
>> TL;DR; verison:
>>
>> We want to be able to install a container on a machine that will let
>> us view all namespaces currently defined on that machine and which
>> tasks are associated with them.
>>
>> That's basically it.
> 
> So you mentioned kubernetes.  Have you tried
> 
> kubectl get pods --all-namespaces
> 
> ?
> 
> The point is that orchestration systems usually have interfaces to get
> this information, even if the kernel doesn't.  In fact, userspace is
> almost certainly the best place to construct this from.
> 
> To look at this another way, what if you were simply proposing the
> exact same thing but for the process tree.  The push back would be that
> we can get that all in userspace and there's even a nice tool (pstree)
> to do it which simply walks the /proc interface.  Why, then, do we have
> to do nstree in the kernel when we can get all the information in
> exactly the same way (walking the process tree)?
> 

I see on important difference between the problem we have and the problem in your example. /proc contains all the 
information needed to unambiguously reconstruct the process tree.

On the other hand, I do not see how one can reconstruct the namespace tree using only the information in proc/ (maybe 
this is because of my ignorance).

Let's look the following case (oversimplified just to get the idea):
1. The process X is a parent of the process Y and both are in namespace 'A'.
3. "unshare" is used to place process Y (and all its child processes) in a new namespace B (A is a parent namespace of B).
4. "setns" is s used to move process X in namespace C.

How would you find the parent namespace of B?

Again, using your arguments, I can reformulate the problem statement this way: a userspace program is well instrumented 
to create an arbitrary complex tree of namespaces. In the same time, the only place where the information about the 
created structure can be retrieved is in the userspace program itself. And when we have multiple userspace programs 
adding to the namespaces tree, the global picture gets impossible to recover.

Thanks!
Yordan

> James
> 
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
  2021-11-22 13:02                 ` Yordan Karadzhov
@ 2021-11-22 13:44                   ` James Bottomley
  2021-11-22 15:00                     ` Yordan Karadzhov
  0 siblings, 1 reply; 21+ messages in thread
From: James Bottomley @ 2021-11-22 13:44 UTC (permalink / raw)
  To: Yordan Karadzhov, Steven Rostedt
  Cc: linux-kernel, linux-fsdevel, viro, mingo, hagen, rppt, akpm, vvs,
	shakeelb, christian.brauner, mkoutny, Linux Containers,
	Eric W. Biederman

On Mon, 2021-11-22 at 15:02 +0200, Yordan Karadzhov wrote:
> 
> On 20.11.21 г. 1:08 ч., James Bottomley wrote:
> > [trying to reconstruct cc list, since the cc: field is bust again]
> > > On Fri, 19 Nov 2021 11:47:36 -0500
> > > Steven Rostedt <rostedt@goodmis.org> wrote:
> > > 
> > > > > Can we back up and ask what problem you're trying to solve
> > > > > before we start introducing new objects like namespace name?
> > > 
> > > TL;DR; verison:
> > > 
> > > We want to be able to install a container on a machine that will
> > > let us view all namespaces currently defined on that machine and
> > > which tasks are associated with them.
> > > 
> > > That's basically it.
> > 
> > So you mentioned kubernetes.  Have you tried
> > 
> > kubectl get pods --all-namespaces
> > 
> > ?
> > 
> > The point is that orchestration systems usually have interfaces to
> > get this information, even if the kernel doesn't.  In fact,
> > userspace is almost certainly the best place to construct this
> > from.
> > 
> > To look at this another way, what if you were simply proposing the
> > exact same thing but for the process tree.  The push back would be
> > that we can get that all in userspace and there's even a nice tool
> > (pstree) to do it which simply walks the /proc interface.  Why,
> > then, do we have to do nstree in the kernel when we can get all the
> > information in exactly the same way (walking the process tree)?
> > 
> 
> I see on important difference between the problem we have and the
> problem in your example. /proc contains all the 
> information needed to unambiguously reconstruct the process tree.
> 
> On the other hand, I do not see how one can reconstruct the namespace
> tree using only the information in proc/ (maybe this is because of my
> ignorance).

Well, no, the information may not all exist.  However, the point is we
can add it without adding additional namespace objects.

> Let's look the following case (oversimplified just to get the idea):
> 1. The process X is a parent of the process Y and both are in
> namespace 'A'.
> 3. "unshare" is used to place process Y (and all its child processes)
> in a new namespace B (A is a parent namespace of B).
> 4. "setns" is s used to move process X in namespace C.
> 
> How would you find the parent namespace of B?

Actually this one's quite easy: the parent of X in your setup still has
it.

However, I think you're looking to set up a scenario where the
namespace information isn't carried by live processes and that's
certainly possible if we unshare the namespace, bind it to a mount
point and exit the process that unshared it.  If will exist as a bound
namespace with no processes until it gets entered via the binding and
when that happens the parent information can't be deduced from the
process tree.

There's another problem, that I think you don't care about but someone
will at some point: the owning user_ns can't be deduced from the
current tree either because it depends on the order of entry.  We fixed
unshare so that if you enter multiple namespaces, it enters the user_ns
first so the latter is always the owning namespace, but if you enter
the rest of the namespaces first via one unshare then unshare the
user_ns second, that won't be true.

Neither of the above actually matter for docker like containers because
that's not the way the orchestration system works (it doesn't use mount
bindings or the user_ns) but one day, hopefully, it might.

> Again, using your arguments, I can reformulate the problem statement
> this way: a userspace program is well instrumented 
> to create an arbitrary complex tree of namespaces. In the same time,
> the only place where the information about the 
> created structure can be retrieved is in the userspace program
> itself. And when we have multiple userspace programs 
> adding to the namespaces tree, the global picture gets impossible to
> recover.

So figure out what's missing in the /proc tree and propose adding it. 
The interface isn't immutable it's just that what exists today is an
ABI and can't be altered.  I think this is the last time we realised we
needed to add missing information in /proc/<pid>/ns:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eaa0d190bfe1ed891b814a52712dcd852554cb08

So you can use that as the pattern.

James

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
  2021-11-22 13:44                   ` James Bottomley
@ 2021-11-22 15:00                     ` Yordan Karadzhov
  2021-11-22 15:47                       ` James Bottomley
  0 siblings, 1 reply; 21+ messages in thread
From: Yordan Karadzhov @ 2021-11-22 15:00 UTC (permalink / raw)
  To: James Bottomley, Steven Rostedt
  Cc: linux-kernel, linux-fsdevel, viro, mingo, hagen, rppt, akpm, vvs,
	shakeelb, christian.brauner, mkoutny, Linux Containers,
	Eric W. Biederman



On 22.11.21 г. 15:44 ч., James Bottomley wrote:
> Well, no, the information may not all exist.  However, the point is we
> can add it without adding additional namespace objects.
> 
>> Let's look the following case (oversimplified just to get the idea):
>> 1. The process X is a parent of the process Y and both are in
>> namespace 'A'.
>> 3. "unshare" is used to place process Y (and all its child processes)
>> in a new namespace B (A is a parent namespace of B).
>> 4. "setns" is s used to move process X in namespace C.
>>
>> How would you find the parent namespace of B?
> Actually this one's quite easy: the parent of X in your setup still has
> it.

Hmm, Isn't that true only if somehow we know that (3) happened before (4).

> However, I think you're looking to set up a scenario where the
> namespace information isn't carried by live processes and that's
> certainly possible if we unshare the namespace, bind it to a mount
> point and exit the process that unshared it.  If will exist as a bound
> namespace with no processes until it gets entered via the binding and
> when that happens the parent information can't be deduced from the
> process tree.
> 
> There's another problem, that I think you don't care about but someone
> will at some point: the owning user_ns can't be deduced from the
> current tree either because it depends on the order of entry.  We fixed
> unshare so that if you enter multiple namespaces, it enters the user_ns
> first so the latter is always the owning namespace, but if you enter
> the rest of the namespaces first via one unshare then unshare the
> user_ns second, that won't be true.
> 
> Neither of the above actually matter for docker like containers because
> that's not the way the orchestration system works (it doesn't use mount
> bindings or the user_ns) but one day, hopefully, it might.
> 
>> Again, using your arguments, I can reformulate the problem statement
>> this way: a userspace program is well instrumented
>> to create an arbitrary complex tree of namespaces. In the same time,
>> the only place where the information about the
>> created structure can be retrieved is in the userspace program
>> itself. And when we have multiple userspace programs
>> adding to the namespaces tree, the global picture gets impossible to
>> recover.
> So figure out what's missing in the /proc tree and propose adding it.
> The interface isn't immutable it's just that what exists today is an
> ABI and can't be altered.  I think this is the last time we realised we
> needed to add missing information in/proc/<pid>/ns:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eaa0d190bfe1ed891b814a52712dcd852554cb08
> 
> So you can use that as the pattern.
> 

OK, if everybody agrees that adding extra information to /proc is the right way to go, we will be happy to try 
developing another PoC that implements this approach.

Thank you very much for all your help!
Yordan

> James
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
  2021-11-22 15:00                     ` Yordan Karadzhov
@ 2021-11-22 15:47                       ` James Bottomley
  2021-11-22 16:15                         ` Yordan Karadzhov
  0 siblings, 1 reply; 21+ messages in thread
From: James Bottomley @ 2021-11-22 15:47 UTC (permalink / raw)
  To: Yordan Karadzhov, Steven Rostedt
  Cc: linux-kernel, linux-fsdevel, viro, mingo, hagen, rppt, akpm, vvs,
	shakeelb, christian.brauner, mkoutny, Linux Containers,
	Eric W. Biederman

On Mon, 2021-11-22 at 17:00 +0200, Yordan Karadzhov wrote:
> 
> On 22.11.21 г. 15:44 ч., James Bottomley wrote:
> > Well, no, the information may not all exist.  However, the point is
> > we can add it without adding additional namespace objects.
> > 
> > > Let's look the following case (oversimplified just to get the
> > > idea):
> > > 1. The process X is a parent of the process Y and both are in
> > > namespace 'A'.
> > > 3. "unshare" is used to place process Y (and all its child
> > > processes) in a new namespace B (A is a parent namespace of B).
> > > 4. "setns" is s used to move process X in namespace C.
> > > 
> > > How would you find the parent namespace of B?
> > Actually this one's quite easy: the parent of X in your setup still
> > has it.
> 
> Hmm, Isn't that true only if somehow we know that (3) happened before
> (4).

This depends.  There are only two parented namespaces: pid and user. 
You said you were only interested in pid for now.  setns on the process
only affects pid_for_children because you have to fork to enter the pid
namespace, so in your scenario X has a new ns/pid_for_children but its
own ns/pid never changed.  It's the ns/pid not the ns/pid_for_children
which is the parent.  This makes me suspect that the specific thing
you're trying to do: trace the pid parentage, can actually be done with
the information we have now.

If you do this with the user_ns, then you have a problem because it's
not fork on entry.  But, as I listed in the examples, there are a load
of other problems with tracing the user_ns tree.

James

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept
  2021-11-22 15:47                       ` James Bottomley
@ 2021-11-22 16:15                         ` Yordan Karadzhov
  0 siblings, 0 replies; 21+ messages in thread
From: Yordan Karadzhov @ 2021-11-22 16:15 UTC (permalink / raw)
  To: James Bottomley, Steven Rostedt
  Cc: linux-kernel, linux-fsdevel, viro, mingo, hagen, rppt, akpm, vvs,
	shakeelb, christian.brauner, mkoutny, Linux Containers,
	Eric W. Biederman



On 22.11.21 г. 17:47 ч., James Bottomley wrote:
>> Hmm, Isn't that true only if somehow we know that (3) happened before
>> (4).
> This depends.  There are only two parented namespaces: pid and user.
> You said you were only interested in pid for now.  setns on the process
> only affects pid_for_children because you have to fork to enter the pid
> namespace, so in your scenario X has a new ns/pid_for_children but its
> own ns/pid never changed.  It's the ns/pid not the ns/pid_for_children
> which is the parent.  This makes me suspect that the specific thing
> you're trying to do: trace the pid parentage, can actually be done with
> the information we have now.

This is very good point indeed. Thank you very much!
Yordan

> 
> If you do this with the user_ns, then you have a problem because it's
> not fork on entry.  But, as I listed in the examples, there are a load
> of other problems with tracing the user_ns tree.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-11-22 16:15 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20211118181210.281359-1-y.karadz@gmail.com>
2021-11-18 18:55 ` [RFC PATCH 0/4] namespacefs: Proof-of-Concept Eric W. Biederman
2021-11-18 19:02   ` Steven Rostedt
2021-11-18 19:22     ` Eric W. Biederman
2021-11-18 19:36       ` Steven Rostedt
2021-11-18 19:24   ` Steven Rostedt
2021-11-19  9:50     ` Kirill Tkhai
2021-11-19 12:45     ` James Bottomley
2021-11-19 14:27       ` Steven Rostedt
2021-11-19 16:42         ` James Bottomley
2021-11-19 17:14           ` Yordan Karadzhov
2021-11-19 17:22             ` Steven Rostedt
2021-11-19 23:22             ` James Bottomley
2021-11-20  0:07               ` Steven Rostedt
2021-11-20  0:14                 ` James Bottomley
     [not found]         ` <f6ca1f5bdb3b516688f291d9685a6a59f49f1393.camel@HansenPartnership.com>
     [not found]           ` <20211119114736.5d9dcf6c@gandalf.local.home>
     [not found]             ` <20211119114910.177c80d6@gandalf.local.home>
2021-11-19 23:08               ` James Bottomley
2021-11-22 13:02                 ` Yordan Karadzhov
2021-11-22 13:44                   ` James Bottomley
2021-11-22 15:00                     ` Yordan Karadzhov
2021-11-22 15:47                       ` James Bottomley
2021-11-22 16:15                         ` Yordan Karadzhov
2021-11-19 14:26   ` Yordan Karadzhov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox