NFSd in container

linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* NFSd in container - it works
@ 2012-11-28 17:13 Stanislav Kinsbursky
  2012-11-28 20:01 ` bfields
  0 siblings, 1 reply; 10+ messages in thread
From: Stanislav Kinsbursky @ 2012-11-28 17:13 UTC (permalink / raw)
  To: bfields@fieldses.org
  Cc: linux-nfs@vger.kernel.org, Jeff Layton,
	Trond.Myklebust@netapp.com

Hi.
I have about ~10 more patches, which makes NFS server works in container (mnt + pid + net namesapces). And it passes basic tests.
But there are some issues I would like to discuss:
1) NFSd threads are running in init_pid namespace. This makes impossible to stop NFS server by signals from container. Also is makes possible to stop and 
destroy container without stopping its NFS server (network namespace thus will stay alive). So, there should be implemented some way to destroy these threads, 
when container's child reaper is exiting.
2) We need to solve this issue with registering in wrong portmapper. Sync connects suits both Lockd and NFSd. Bruce, what about gss daemon? Maybe some other 
socket (abstract UNIX or loopback) can be used instead? Or PipeFS?
3) Holding net by tracker looks redundant. What was the reason for this?

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: NFSd in container - it works
  2012-11-28 17:13 NFSd in container - it works Stanislav Kinsbursky
@ 2012-11-28 20:01 ` bfields
  2012-11-28 20:28   ` Jeff Layton
  2012-11-29 11:34   ` Stanislav Kinsbursky
  0 siblings, 2 replies; 10+ messages in thread
From: bfields @ 2012-11-28 20:01 UTC (permalink / raw)
  To: Stanislav Kinsbursky
  Cc: linux-nfs@vger.kernel.org, Jeff Layton,
	Trond.Myklebust@netapp.com

On Wed, Nov 28, 2012 at 09:13:12PM +0400, Stanislav Kinsbursky wrote:
> Hi.
> I have about ~10 more patches, which makes NFS server works in container (mnt + pid + net namesapces). And it passes basic tests.

Good, congratulations.

> But there are some issues I would like to discuss:
> 1) NFSd threads are running in init_pid namespace. This makes
> impossible to stop NFS server by signals from container.

Note "rpc.nfsd 0" (which writes to /proc/fs/nfsd/threads) is what
current Fedora, for example, uses to shut down the server.

It's not ideal, but for now we can tell people "if you're in a container
and want to shut down nfsd, you need to use /proc/fs/nfsd/threads, not
signals."

> Also is
> makes possible to stop and destroy container without stopping its
> NFS server (network namespace thus will stay alive). So, there
> should be implemented some way to destroy these threads, when
> container's child reaper is exiting.
> 2) We need to solve this issue with registering in wrong portmapper.
> Sync connects suits both Lockd and NFSd. Bruce, what about gss
> daemon? Maybe some other socket (abstract UNIX or loopback) can be
> used instead? Or PipeFS?

My vague thought was that the gss-proxy can do a write to a special file
to indicate that it's up (and thus that it should be used and not the
old svcgssd interface), and that we could use that process context to do
the connect....  Not sure if that works.

> 3) Holding net by tracker looks redundant. What was the reason for this?

I don't understand, what's tracker?

--b.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: NFSd in container - it works
  2012-11-28 20:01 ` bfields
@ 2012-11-28 20:28   ` Jeff Layton
  2012-11-29 11:53     ` Stanislav Kinsbursky
  2012-11-29 11:34   ` Stanislav Kinsbursky
  1 sibling, 1 reply; 10+ messages in thread
From: Jeff Layton @ 2012-11-28 20:28 UTC (permalink / raw)
  To: bfields@fieldses.org
  Cc: Stanislav Kinsbursky, linux-nfs@vger.kernel.org,
	Trond.Myklebust@netapp.com

On Wed, 28 Nov 2012 15:01:26 -0500
"bfields@fieldses.org" <bfields@fieldses.org> wrote:

> > 3) Holding net by tracker looks redundant. What was the reason for this?
> 
> I don't understand, what's tracker?

I assume he means the clientid tracker. That was necessary for the
nfsdcld upcall because it used rpc_pipefs files, and those were
net-namespacified. Once we deprecate that in 3.10, I don't think we'll
need to worry about the net namespace in the clientid tracker.

We probably *will* need to concern ourselves with the mnt namespace
there though since each container will presumably have its own clientid
database...

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: NFSd in container - it works
  2012-11-28 20:01 ` bfields
  2012-11-28 20:28   ` Jeff Layton
@ 2012-11-29 11:34   ` Stanislav Kinsbursky
  1 sibling, 0 replies; 10+ messages in thread
From: Stanislav Kinsbursky @ 2012-11-29 11:34 UTC (permalink / raw)
  To: bfields@fieldses.org
  Cc: linux-nfs@vger.kernel.org, Jeff Layton,
	Trond.Myklebust@netapp.com

29.11.2012 00:01, bfields@fieldses.org пишет:
> On Wed, Nov 28, 2012 at 09:13:12PM +0400, Stanislav Kinsbursky wrote:
>> Hi.
>> I have about ~10 more patches, which makes NFS server works in container (mnt + pid + net namesapces). And it passes basic tests.
>
> Good, congratulations.
>

Thanks.

>> But there are some issues I would like to discuss:
>> 1) NFSd threads are running in init_pid namespace. This makes
>> impossible to stop NFS server by signals from container.
>
> Note "rpc.nfsd 0" (which writes to /proc/fs/nfsd/threads) is what
> current Fedora, for example, uses to shut down the server.
>

Yes. this is the only right way. And this is another issue: on containers with old operation system (rhel6, for example), init scripts have to be updated.

> It's not ideal, but for now we can tell people "if you're in a container
> and want to shut down nfsd, you need to use /proc/fs/nfsd/threads, not
> signals."
>

Ok. But there is another issue.
Imagine, that you have container with it's own pid and network namespaces (like OpenVZ container).
You can start NFS server in such container and then kill containers "init" (child reaper), from outside.
Child reaper and all it's children will die. But NFSd kthreads will remain running. And note, that they are holding network namespace currently. Which, actually 
means, that NFS server is still running. Then add one more namespace to this example - mount namespace. Currently it's not hold by NFSd kthreads. And thus NFSd 
kthreads and network namespace can disappear from under NFSd file system (which will be mounted per-net). I'm afraid, that this will lead to kernel panic 
shortly right after any request will be received by NFS server.

So, I see only one proper solution so far:
1) NFSd doesn't hold network references, but instead register it's callback in per-net operations, which will allow to properly shutdown all NFSd kthreads on 
network namespace destruction. This looks sane, because kthreads are started by kernel, and such approach allows to shutdown NFS server properly in case of it's 
child reaper has been killed.
2) NFSd file system holds network namespace. I don't really like this solution, but it look like the only way to make sure, that we don't get to kernel panic, 
mentioned earlier. Moreover, if NFSd file system will be mounted in separated mount namespace, it (mount point) will be unmounted during child reaper exit 
before destroying network namespace.

Have to notice, that if mount namespace is shared between host and container, then NFSd mount point won't be unmounted on child reaper exit, containers NFSd 
kthreads will be running and thus the whole NFSd server will be active after container stop. Situation is not look pleasant, but it's sane and the whole NFSd 
will be properly destructed when NFSd fs is unmounted.

One more note: unmounting of NFSd file system on network namespace shutdown (instead of holding network reference) is another possible solution. This one is 
even better, because we can fully shutdown NFS server on child reaper exit.
But there are a couple of problems:1
1) we have to tie network namespace and mount point (which is not good and not that simple).
2) we have to make sure, that mount point is destroyed before shutdown of kthreads (again, not good and simple).

>> Also is
>> makes possible to stop and destroy container without stopping its
>> NFS server (network namespace thus will stay alive). So, there
>> should be implemented some way to destroy these threads, when
>> container's child reaper is exiting.
>> 2) We need to solve this issue with registering in wrong portmapper.
>> Sync connects suits both Lockd and NFSd. Bruce, what about gss
>> daemon? Maybe some other socket (abstract UNIX or loopback) can be
>> used instead? Or PipeFS?
>
> My vague thought was that the gss-proxy can do a write to a special file
> to indicate that it's up (and thus that it should be used and not the
> old svcgssd interface), and that we could use that process context to do
> the connect....  Not sure if that works.
>

Does it mean, that you don't object against sync transports connect to UNIX sockets?

>> 3) Holding net by tracker looks redundant. What was the reason for this?
>
> I don't understand, what's tracker?
>
> --b.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: NFSd in container - it works
  2012-11-28 20:28   ` Jeff Layton
@ 2012-11-29 11:53     ` Stanislav Kinsbursky
  2012-11-29 12:13       ` Jeff Layton
  0 siblings, 1 reply; 10+ messages in thread
From: Stanislav Kinsbursky @ 2012-11-29 11:53 UTC (permalink / raw)
  To: Jeff Layton
  Cc: bfields@fieldses.org, linux-nfs@vger.kernel.org,
	Trond.Myklebust@netapp.com

29.11.2012 00:28, Jeff Layton пишет:
> On Wed, 28 Nov 2012 15:01:26 -0500
> "bfields@fieldses.org" <bfields@fieldses.org> wrote:
>
>>> 3) Holding net by tracker looks redundant. What was the reason for this?
>>
>> I don't understand, what's tracker?
>
> I assume he means the clientid tracker. That was necessary for the
> nfsdcld upcall because it used rpc_pipefs files, and those were
> net-namespacified. Once we deprecate that in 3.10, I don't think we'll
> need to worry about the net namespace in the clientid tracker.
>

Sorry, I don't understand. Rpc_pipefs superblock already holds network namespace.

> We probably *will* need to concern ourselves with the mnt namespace
> there though since each container will presumably have its own clientid
> database...

Since NFSd server in network namespace based, it means, that we can create one server for more than one mount namespace.
And clietid tracker holds files opened. Thus holds mount.
Mount namespace itself doesn't look that important to me.
Or I'm wrong?

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: NFSd in container - it works
  2012-11-29 11:53     ` Stanislav Kinsbursky
@ 2012-11-29 12:13       ` Jeff Layton
  2012-11-29 12:48         ` Stanislav Kinsbursky
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff Layton @ 2012-11-29 12:13 UTC (permalink / raw)
  To: Stanislav Kinsbursky
  Cc: bfields@fieldses.org, linux-nfs@vger.kernel.org,
	Trond.Myklebust@netapp.com

On Thu, 29 Nov 2012 15:53:47 +0400
Stanislav Kinsbursky <skinsbursky@parallels.com> wrote:

> 29.11.2012 00:28, Jeff Layton пишет:
> > On Wed, 28 Nov 2012 15:01:26 -0500
> > "bfields@fieldses.org" <bfields@fieldses.org> wrote:
> >
> >>> 3) Holding net by tracker looks redundant. What was the reason for this?
> >>
> >> I don't understand, what's tracker?
> >
> > I assume he means the clientid tracker. That was necessary for the
> > nfsdcld upcall because it used rpc_pipefs files, and those were
> > net-namespacified. Once we deprecate that in 3.10, I don't think we'll
> > need to worry about the net namespace in the clientid tracker.
> >
> 
> Sorry, I don't understand. Rpc_pipefs superblock already holds network namespace.
> 

...and how do you know which rpc_pipefs superblock you're dealing with?
In any case, I'm ok with ripping out references to the net namespace
there if you think it's valid.

> > We probably *will* need to concern ourselves with the mnt namespace
> > there though since each container will presumably have its own clientid
> > database...
> 
> Since NFSd server in network namespace based, it means, that we can create one server for more than one mount namespace.
> And clietid tracker holds files opened. Thus holds mount.
> Mount namespace itself doesn't look that important to me.
> Or I'm wrong?
> 

I confess I don't understand the design well enough to reasonably
comment here...

Both the new clientid tracker and the legacy one involve storing data
on a local filesystem somewhere. In the case of the new tracker, we
upcall using call_usermodehelper to spawn a process to handle access to
the on disk database. In the legacy tracker, it's done directly by the
kernel using vfs calls.

Presumably, you will have multiple containers serving NFS, so you'll
have multiple sets of client id data being stored. You'll need some
mechanism to ensure that the usermodehelper is spawned within the
correct container or that the legacy tracker accesses the files in the
correct container.

My assumption there was that you'd need to ensure that it's using the
right mount namespace in order to do that...

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: NFSd in container - it works
  2012-11-29 12:13       ` Jeff Layton
@ 2012-11-29 12:48         ` Stanislav Kinsbursky
  2012-11-29 12:55           ` Jeff Layton
  0 siblings, 1 reply; 10+ messages in thread
From: Stanislav Kinsbursky @ 2012-11-29 12:48 UTC (permalink / raw)
  To: Jeff Layton
  Cc: bfields@fieldses.org, linux-nfs@vger.kernel.org,
	Trond.Myklebust@netapp.com

29.11.2012 16:13, Jeff Layton пишет:
> On Thu, 29 Nov 2012 15:53:47 +0400
> Stanislav Kinsbursky <skinsbursky@parallels.com> wrote:
>
>> 29.11.2012 00:28, Jeff Layton пишет:
>>> On Wed, 28 Nov 2012 15:01:26 -0500
>>> "bfields@fieldses.org" <bfields@fieldses.org> wrote:
>>>
>>>>> 3) Holding net by tracker looks redundant. What was the reason for this?
>>>>
>>>> I don't understand, what's tracker?
>>>
>>> I assume he means the clientid tracker. That was necessary for the
>>> nfsdcld upcall because it used rpc_pipefs files, and those were
>>> net-namespacified. Once we deprecate that in 3.10, I don't think we'll
>>> need to worry about the net namespace in the clientid tracker.
>>>
>>
>> Sorry, I don't understand. Rpc_pipefs superblock already holds network namespace.
>>
>
> ...and how do you know which rpc_pipefs superblock you're dealing with?
> In any case, I'm ok with ripping out references to the net namespace
> there if you think it's valid.
>

I'm not saying, that you don't need to have a reference to network namespace.
I'm not sure, that you need to grab the reference to it (i.e. call get_net()).

>>> We probably *will* need to concern ourselves with the mnt namespace
>>> there though since each container will presumably have its own clientid
>>> database...
>>
>> Since NFSd server in network namespace based, it means, that we can create one server for more than one mount namespace.
>> And clietid tracker holds files opened. Thus holds mount.
>> Mount namespace itself doesn't look that important to me.
>> Or I'm wrong?
>>
>
> I confess I don't understand the design well enough to reasonably
> comment here...
>
> Both the new clientid tracker and the legacy one involve storing data
> on a local filesystem somewhere. In the case of the new tracker, we
> upcall using call_usermodehelper to spawn a process to handle access to
> the on disk database. In the legacy tracker, it's done directly by the
> kernel using vfs calls.
>
> Presumably, you will have multiple containers serving NFS, so you'll
> have multiple sets of client id data being stored. You'll need some
> mechanism to ensure that the usermodehelper is spawned within the
> correct container or that the legacy tracker accesses the files in the
> correct container.
>
> My assumption there was that you'd need to ensure that it's using the
> right mount namespace in order to do that...
>

Yes, I see... Look like it's better to disable this type tracker in containers for now.
As, I see it, the problem is not in mount namespace itself, but in proper root path for khelper.
I.e. the problem is the same as with portmapper Unix sockets.
But, luckily, usermode khelper allow to pass init/cleanup functions.
Init function could be like these:

init() {
	unshare_fs_struct();	// to make sure, that we won't affect other kthreads
	swap_root();		// replace root, get new one, put old one.
}

Cleanup function is not required: fs struct will be destroyed on usermode khelper thread exit.

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: NFSd in container - it works
  2012-11-29 12:48         ` Stanislav Kinsbursky
@ 2012-11-29 12:55           ` Jeff Layton
  2012-11-29 13:04             ` Stanislav Kinsbursky
  2012-11-29 14:11             ` Stanislav Kinsbursky
  0 siblings, 2 replies; 10+ messages in thread
From: Jeff Layton @ 2012-11-29 12:55 UTC (permalink / raw)
  To: Stanislav Kinsbursky
  Cc: bfields@fieldses.org, linux-nfs@vger.kernel.org,
	Trond.Myklebust@netapp.com

On Thu, 29 Nov 2012 16:48:43 +0400
Stanislav Kinsbursky <skinsbursky@parallels.com> wrote:

> 29.11.2012 16:13, Jeff Layton пишет:
> > On Thu, 29 Nov 2012 15:53:47 +0400
> > Stanislav Kinsbursky <skinsbursky@parallels.com> wrote:
> >
> >> 29.11.2012 00:28, Jeff Layton пишет:
> >>> On Wed, 28 Nov 2012 15:01:26 -0500
> >>> "bfields@fieldses.org" <bfields@fieldses.org> wrote:
> >>>
> >>>>> 3) Holding net by tracker looks redundant. What was the reason for this?
> >>>>
> >>>> I don't understand, what's tracker?
> >>>
> >>> I assume he means the clientid tracker. That was necessary for the
> >>> nfsdcld upcall because it used rpc_pipefs files, and those were
> >>> net-namespacified. Once we deprecate that in 3.10, I don't think we'll
> >>> need to worry about the net namespace in the clientid tracker.
> >>>
> >>
> >> Sorry, I don't understand. Rpc_pipefs superblock already holds network namespace.
> >>
> >
> > ...and how do you know which rpc_pipefs superblock you're dealing with?
> > In any case, I'm ok with ripping out references to the net namespace
> > there if you think it's valid.
> >
> 
> I'm not saying, that you don't need to have a reference to network namespace.
> I'm not sure, that you need to grab the reference to it (i.e. call get_net()).
> 

Hmm..ok. I'll take your word for it on this...

> >>> We probably *will* need to concern ourselves with the mnt namespace
> >>> there though since each container will presumably have its own clientid
> >>> database...
> >>
> >> Since NFSd server in network namespace based, it means, that we can create one server for more than one mount namespace.
> >> And clietid tracker holds files opened. Thus holds mount.
> >> Mount namespace itself doesn't look that important to me.
> >> Or I'm wrong?
> >>
> >
> > I confess I don't understand the design well enough to reasonably
> > comment here...
> >
> > Both the new clientid tracker and the legacy one involve storing data
> > on a local filesystem somewhere. In the case of the new tracker, we
> > upcall using call_usermodehelper to spawn a process to handle access to
> > the on disk database. In the legacy tracker, it's done directly by the
> > kernel using vfs calls.
> >
> > Presumably, you will have multiple containers serving NFS, so you'll
> > have multiple sets of client id data being stored. You'll need some
> > mechanism to ensure that the usermodehelper is spawned within the
> > correct container or that the legacy tracker accesses the files in the
> > correct container.
> >
> > My assumption there was that you'd need to ensure that it's using the
> > right mount namespace in order to do that...
> >
> 
> Yes, I see... Look like it's better to disable this type tracker in containers for now.
> As, I see it, the problem is not in mount namespace itself, but in proper root path for khelper.
> I.e. the problem is the same as with portmapper Unix sockets.
> But, luckily, usermode khelper allow to pass init/cleanup functions.
> Init function could be like these:
> 
> init() {
> 	unshare_fs_struct();	// to make sure, that we won't affect other kthreads
> 	swap_root();		// replace root, get new one, put old one.
> }
> 
> Cleanup function is not required: fs struct will be destroyed on usermode khelper thread exit.
> 

nfsdcld is being ripped out in 3.10. The binaries for it are already
gone from nfs-utils. If you disable the legacy and usermode helper
trackers, then you have nothing left. :)

I think you'll need to come up with some mechanism to ensure that these
things are done in the correct container. What that is, and how that
should work, I'm not sure...

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: NFSd in container - it works
  2012-11-29 12:55           ` Jeff Layton
@ 2012-11-29 13:04             ` Stanislav Kinsbursky
  2012-11-29 14:11             ` Stanislav Kinsbursky
  1 sibling, 0 replies; 10+ messages in thread
From: Stanislav Kinsbursky @ 2012-11-29 13:04 UTC (permalink / raw)
  To: Jeff Layton
  Cc: bfields@fieldses.org, linux-nfs@vger.kernel.org,
	Trond.Myklebust@netapp.com

29.11.2012 16:55, Jeff Layton пишет:
>
> nfsdcld is being ripped out in 3.10. The binaries for it are already
> gone from nfs-utils. If you disable the legacy and usermode helper
> trackers, then you have nothing left. :)
>

I was going to disable only usermode helper in container for a while (it looks like we have at least half a year till 3.10 kernel).

> I think you'll need to come up with some mechanism to ensure that these
> things are done in the correct container. What that is, and how that
> should work, I'm not sure...
>

This doesn't look that complicated: root have to be taken from task, which starts tracker and stored with NFSd per-net data.
Then it can be passed to usermode helper. And of course, this path have to be grabbed (path_get() must be called).
Not a great design. But it will work, I believe. And doesn't need any changes to other parts of kernel tree...

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: NFSd in container - it works
  2012-11-29 12:55           ` Jeff Layton
  2012-11-29 13:04             ` Stanislav Kinsbursky
@ 2012-11-29 14:11             ` Stanislav Kinsbursky
  1 sibling, 0 replies; 10+ messages in thread
From: Stanislav Kinsbursky @ 2012-11-29 14:11 UTC (permalink / raw)
  To: Jeff Layton
  Cc: bfields@fieldses.org, linux-nfs@vger.kernel.org,
	Trond.Myklebust@netapp.com

29.11.2012 16:55, Jeff Layton пишет:
>
> nfsdcld is being ripped out in 3.10. The binaries for it are already
> gone from nfs-utils. If you disable the legacy and usermode helper
> trackers, then you have nothing left. :)
>

BTW, I'm going to remove your init_net check and enable legacy tracker for containers.



-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-11-29 14:11 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-28 17:13 NFSd in container - it works Stanislav Kinsbursky
2012-11-28 20:01 ` bfields
2012-11-28 20:28   ` Jeff Layton
2012-11-29 11:53     ` Stanislav Kinsbursky
2012-11-29 12:13       ` Jeff Layton
2012-11-29 12:48         ` Stanislav Kinsbursky
2012-11-29 12:55           ` Jeff Layton
2012-11-29 13:04             ` Stanislav Kinsbursky
2012-11-29 14:11             ` Stanislav Kinsbursky
2012-11-29 11:34   ` Stanislav Kinsbursky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).