linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Stanislav Kinsbursky <skinsbursky@parallels.com>
To: "bfields@fieldses.org" <bfields@fieldses.org>
Cc: "Myklebust, Trond" <Trond.Myklebust@netapp.com>,
	Jeff Layton <jlayton@redhat.com>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: Grace period
Date: Tue, 10 Apr 2012 14:56:12 +0400	[thread overview]
Message-ID: <4F8411CC.2070105@parallels.com> (raw)
In-Reply-To: <20120409181138.GA9581@fieldses.org>

09.04.2012 22:11, bfields@fieldses.org пишет:
> On Mon, Apr 09, 2012 at 08:56:47PM +0400, Stanislav Kinsbursky wrote:
>> 09.04.2012 20:33, Myklebust, Trond пишет:
>>> On Mon, 2012-04-09 at 12:21 -0400, bfields@fieldses.org wrote:
>>>> On Mon, Apr 09, 2012 at 04:17:06PM +0000, Myklebust, Trond wrote:
>>>>> On Mon, 2012-04-09 at 12:11 -0400, bfields@fieldses.org wrote:
>>>>>> On Mon, Apr 09, 2012 at 08:08:57PM +0400, Stanislav Kinsbursky wrote:
>>>>>>> 09.04.2012 19:27, Jeff Layton пишет:
>>>>>>>>
>>>>>>>> If you allow one container to hand out conflicting locks while another
>>>>>>>> container is allowing reclaims, then you can end up with some very
>>>>>>>> difficult to debug silent data corruption. That's the worst possible
>>>>>>>> outcome, IMO. We really need to actively keep people from shooting
>>>>>>>> themselves in the foot here.
>>>>>>>>
>>>>>>>> One possibility might be to only allow filesystems to be exported from
>>>>>>>> a single container at a time (and allow that to be overridable somehow
>>>>>>>> once we have a working active/active serving solution). With that, you
>>>>>>>> may be able limp along with a per-container grace period handling
>>>>>>>> scheme like you're proposing.
>>>>>>>>
>>>>>>>
>>>>>>> Ok then. Keeping people from shooting themselves here sounds reasonable.
>>>>>>> And I like the idea of exporting a filesystem only from once per
>>>>>>> network namespace.
>>>>>>
>>>>>> Unfortunately that's not going to get us very far, especially not in the
>>>>>> v4 case where we've got the common read-only pseudoroot that everyone
>>>>>> has to share.
>>>>>
>>>>> I don't see how that can work in cases where each container has its own
>>>>> private mount namespace. You're going to have to tie that pseudoroot to
>>>>> the mount namespace somehow.
>>>>
>>>> Sure, but in typical cases it'll still be shared; requiring that they
>>>> not be sounds like a severe limitation.
>>>
>>> I'd expect the typical case to be the non-shared namespace: the whole
>>> point of containers is to provide for complete isolation of processes.
>>> Usually that implies that you don't want them to be able to communicate
>>> via a shared filesystem.
>>>
>>
>> BTW, we DO use one mount namespace for all containers and host in
>> OpenVZ. This allows us to have an access to containers mount points
>> from initial environment. Isolation between containers is done via
>> chroot and some simple tricks on /proc/mounts read operation.
>> Moreover, with one mount namespace, we currently support
>> bind-mounting on NFS from one container into another...
>>
>> Anyway, I'm sorry, but I'm not familiar with this pseudoroot idea.
>
> Since NFSv4 doesn't have a separate MOUNT protocol, clients need to be
> able to do readdir's and lookups to get to exported filesystems.  We
> support this in the Linux server by exporting all the filesystems from
> "/" on down that must be traversed to reach a given filesystem.  These
> exports are very restricted (e.g. only parents of exports are visible).
>

Ok, thanks for explanation.
So, this pseudoroot looks like a part of NFS server internal implementation, but 
not a part of a standard. That's good.

>> Why does it prevents implementing of check for "superblock-network
>> namespace" pair on NFS server start and forbid (?) it in case of
>> this pair is shared already in other namespace? I.e. maybe this
>> pseudoroot can be an exclusion from this rule?
>
> That might work.  It's read-only and consists only of directories, so
> the grace period doesn't affect it.
>

I've just realized, that this per-sb grace period won't work.
I.e., it's a valid situation, when two or more containers located on the same 
filesystem, but shares different parts of it. And there is not conflict here at all.
I don't see any clear and simple way how to handle such races, because otherwise 
we have to tie network namespace and filesystem namespace.
I.e. there will be required some way to define, was passed export directory 
shared already somewhere else or not.

Realistic solution - since export check should be done in initial file system 
environment (most probably container will have it's own root), then we to pass 
this data to some kernel thread/userspace daemon in initial file system 
environment somehow (sockets doesn't suits here... Shared memory?).

Improbable solution - patching VFS layer...

-- 
Best regards,
Stanislav Kinsbursky

  reply	other threads:[~2012-04-10 10:56 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <4F7F230A.6080506@parallels.com>
     [not found] ` <20120406234039.GA20940@fieldses.org>
2012-04-09 11:24   ` Grace period Stanislav Kinsbursky
2012-04-09 13:47     ` Jeff Layton
2012-04-09 14:25       ` Stanislav Kinsbursky
2012-04-09 15:27         ` Jeff Layton
2012-04-09 16:08           ` Stanislav Kinsbursky
2012-04-09 16:11             ` bfields
2012-04-09 16:17               ` Myklebust, Trond
2012-04-09 16:21                 ` bfields
2012-04-09 16:33                   ` Myklebust, Trond
2012-04-09 16:39                     ` bfields
2012-04-09 16:56                     ` Stanislav Kinsbursky
2012-04-09 18:11                       ` bfields
2012-04-10 10:56                         ` Stanislav Kinsbursky [this message]
2012-04-10 13:39                           ` bfields
2012-04-10 15:36                             ` Stanislav Kinsbursky
2012-04-10 18:28                               ` Jeff Layton
2012-04-10 20:46                                 ` bfields
2012-04-11 10:08                                 ` Stanislav Kinsbursky
2012-04-09 23:26     ` bfields
2012-04-10 11:29       ` Stanislav Kinsbursky
2012-04-10 13:37         ` bfields
2012-04-10 14:10           ` Stanislav Kinsbursky
2012-04-10 14:18             ` bfields
2016-06-14 21:25 [PATCH] NFS: Don't let readdirplus revalidate an inode that was marked as stale Trond Myklebust
2016-06-30 21:46 ` grace period Marc Eshel
2016-07-01 16:08   ` Bruce Fields
2016-07-01 17:31     ` Marc Eshel
2016-07-01 20:07       ` Bruce Fields
2016-07-01 20:24         ` Marc Eshel
2016-07-01 20:47           ` Bruce Fields
2016-07-01 20:46         ` Marc Eshel
2016-07-01 21:01           ` Bruce Fields
2016-07-01 22:42             ` Marc Eshel
2016-07-02  0:58               ` Bruce Fields
2016-07-03  5:30                 ` Marc Eshel
2016-07-05 20:51                   ` Bruce Fields
2016-07-05 23:05                     ` Marc Eshel
2016-07-06  0:38                       ` Bruce Fields
     [not found]         ` <OF5D486F02.62CECB7B-ON88257FE3.0071DBE5-88257FE3.00722318@LocalDomain>
2016-07-01 20:51           ` Marc Eshel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F8411CC.2070105@parallels.com \
    --to=skinsbursky@parallels.com \
    --cc=Trond.Myklebust@netapp.com \
    --cc=bfields@fieldses.org \
    --cc=jlayton@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).