NFSv4 high availability setups

All of lore.kernel.org
 help / color / mirror / Atom feed

* NFSv4 high availability setups
@ 2012-04-05 10:31 Lukas Hejtmanek
  2012-04-05 11:39 ` Jeff Layton
  0 siblings, 1 reply; 10+ messages in thread
From: Lukas Hejtmanek @ 2012-04-05 10:31 UTC (permalink / raw)
  To: linux-nfs

Hi,

we got several front-ends for a shared storage. We want to build HA setup so
that failed front-end fails over to another front-end (that is serving NFSv4
already). 

As I understand, NFS4 uses state dir somewhere in /var/lib/nfs/rpc_pipefs.

Can we put this state dir on a shared volume so that this state dir is common
for all the front-ends serving the same content? Is is supposed to work and
NFSv4 can merge its state with existing state on a shared disk?

-- 
Lukáš Hejtmánek

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: NFSv4 high availability setups
  2012-04-05 10:31 NFSv4 high availability setups Lukas Hejtmanek
@ 2012-04-05 11:39 ` Jeff Layton
  2012-04-10 12:55   ` Lukas Hejtmanek
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff Layton @ 2012-04-05 11:39 UTC (permalink / raw)
  To: Lukas Hejtmanek; +Cc: linux-nfs

On Thu, 5 Apr 2012 12:31:24 +0200
Lukas Hejtmanek <xhejtman@ics.muni.cz> wrote:

> Hi,
> 
> we got several front-ends for a shared storage. We want to build HA setup so
> that failed front-end fails over to another front-end (that is serving NFSv4
> already). 
> 
> As I understand, NFS4 uses state dir somewhere in /var/lib/nfs/rpc_pipefs.
> 

You're probably thinking of /var/lib/nfs/v4recovery.

> Can we put this state dir on a shared volume so that this state dir is common
> for all the front-ends serving the same content? Is is supposed to work and
> NFSv4 can merge its state with existing state on a shared disk?
> 

Not properly, no. nfsd expects to have complete control over that
directory. There's no locking or merging of the data there. A node will
also clean that directory out in some cases, and that will throw your
state tracking off.

3.4 just got an overhaul of this code to use an upcall instead. At this
point I'm waiting on Steve to merge the userspace portion of that. The
legacy client tracking code will probably never be cluster-aware.

This is actually a very complex problem to solve as you need to
coordinate the grace periods between the different serving cluster
nodes.

I've been looking at this problem for the last few months, and am still
working out a design that would allow active/active NFSv4 serving. For
now, I'd advise against trying it since it won't work properly.

If you want to follow along with the gory details of the design, I've
been sporadically doing blog posts about it here:

    http://jtlayton.wordpress.com/

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: NFSv4 high availability setups
  2012-04-05 11:39 ` Jeff Layton
@ 2012-04-10 12:55   ` Lukas Hejtmanek
  2012-04-10 13:13     ` Jeff Layton
  0 siblings, 1 reply; 10+ messages in thread
From: Lukas Hejtmanek @ 2012-04-10 12:55 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs, jiri.horky

Hi Jeff, 

On Thu, Apr 05, 2012 at 07:39:01AM -0400, Jeff Layton wrote:
> > As I understand, NFS4 uses state dir somewhere in /var/lib/nfs/rpc_pipefs.
> > 
> 
> You're probably thinking of /var/lib/nfs/v4recovery.

yes, sorry for confusion.

> > Can we put this state dir on a shared volume so that this state dir is common
> > for all the front-ends serving the same content? Is is supposed to work and
> > NFSv4 can merge its state with existing state on a shared disk?
> > 
> 
> Not properly, no. nfsd expects to have complete control over that
> directory. There's no locking or merging of the data there. A node will
> also clean that directory out in some cases, and that will throw your
> state tracking off.

Thank you for information.

Is there any (preferably simple) way to demonstrate that this does not work
properly? E.g., if I share the same export through two or more NFSv4
front-ends that share the v4recovery directory, do I trigger problems with
this tool http://nfsv4.bullopensource.org/tools/tests/locktest.php?

-- 
Lukáš Hejtmánek

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: NFSv4 high availability setups
  2012-04-10 12:55   ` Lukas Hejtmanek
@ 2012-04-10 13:13     ` Jeff Layton
  2012-04-10 18:14       ` Michael Schwartzkopff
  2012-04-17 14:34       ` Lukas Hejtmanek
  0 siblings, 2 replies; 10+ messages in thread
From: Jeff Layton @ 2012-04-10 13:13 UTC (permalink / raw)
  To: Lukas Hejtmanek; +Cc: linux-nfs, jiri.horky

On Tue, 10 Apr 2012 14:55:52 +0200
Lukas Hejtmanek <xhejtman@ics.muni.cz> wrote:

> Hi Jeff, 
> 
> On Thu, Apr 05, 2012 at 07:39:01AM -0400, Jeff Layton wrote:
> > > As I understand, NFS4 uses state dir somewhere in /var/lib/nfs/rpc_pipefs.
> > > 
> > 
> > You're probably thinking of /var/lib/nfs/v4recovery.
> 
> yes, sorry for confusion.
> 
> > > Can we put this state dir on a shared volume so that this state dir is common
> > > for all the front-ends serving the same content? Is is supposed to work and
> > > NFSv4 can merge its state with existing state on a shared disk?
> > > 
> > 
> > Not properly, no. nfsd expects to have complete control over that
> > directory. There's no locking or merging of the data there. A node will
> > also clean that directory out in some cases, and that will throw your
> > state tracking off.
> 
> Thank you for information.
> 
> Is there any (preferably simple) way to demonstrate that this does not work
> properly? E.g., if I share the same export through two or more NFSv4
> front-ends that share the v4recovery directory, do I trigger problems with
> this tool http://nfsv4.bullopensource.org/tools/tests/locktest.php?
> 

Nope. It'll all work just great...until it doesn't. I don't have any
specific failure scenarios, but most of the problems will be issues
with state recovery when a server node is restarted.

That may manifest in different ways -- problems reclaiming locks for
instance, or even silent data corruption depending on the application.

For instance, a node might hand out a lock and the client release it,
after a server node reboots but before a client that really "owns" it
reclaims it. Depending on the application, that may cause serious
problems.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: NFSv4 high availability setups
  2012-04-10 13:13     ` Jeff Layton
@ 2012-04-10 18:14       ` Michael Schwartzkopff
  2012-04-17 14:34       ` Lukas Hejtmanek
  1 sibling, 0 replies; 10+ messages in thread
From: Michael Schwartzkopff @ 2012-04-10 18:14 UTC (permalink / raw)
  To: linux-nfs; +Cc: Lukas Hejtmanek

[-- Attachment #1: Type: Text/Plain, Size: 2273 bytes --]

> On Tue, 10 Apr 2012 14:55:52 +0200
> 
(...)
> > > > Can we put this state dir on a shared volume so that this state dir
> > > > is common for all the front-ends serving the same content? Is is
> > > > supposed to work and NFSv4 can merge its state with existing state
> > > > on a shared disk?
> > > 
> > > Not properly, no. nfsd expects to have complete control over that
> > > directory. There's no locking or merging of the data there. A node will
> > > also clean that directory out in some cases, and that will throw your
> > > state tracking off.
> > 
> > Thank you for information.
> > 
> > Is there any (preferably simple) way to demonstrate that this does not
> > work properly? E.g., if I share the same export through two or more
> > NFSv4 front-ends that share the v4recovery directory, do I trigger
> > problems with this tool
> > http://nfsv4.bullopensource.org/tools/tests/locktest.php?
> 
> Nope. It'll all work just great...until it doesn't. I don't have any
> specific failure scenarios, but most of the problems will be issues
> with state recovery when a server node is restarted.
> 
> That may manifest in different ways -- problems reclaiming locks for
> instance, or even silent data corruption depending on the application.
> 
> For instance, a node might hand out a lock and the client release it,
> after a server node reboots but before a client that really "owns" it
> reclaims it. Depending on the application, that may cause serious
> problems.

Hi,

I don't think a active/active NFS server is possible with 
/var/lib/nfs/v4recovery on a shared media. I think you will get into major 
trouble if two or more nodes access that directory at the same time.

On the other hand an active/passive setup is quite easy. There are some HOWTOs 
on the internet. I like the one of linbit most:
http://www.linbit.com/de/training/tech-guides/highly-available-nfs-with-drbd-
and-pacemaker/

The guide provides a basic path to follow. You have to tune it according to 
which distribution you use. Not all distributions have the necessary features.

See: leasetime, grace time, ...

Greetings,
-- 
Dr. Michael Schwartzkopff
Guardinistr. 63
81375 München

Tel: (0163) 172 50 98
Fax: (089) 620 304 13

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: NFSv4 high availability setups
  2012-04-10 13:13     ` Jeff Layton
  2012-04-10 18:14       ` Michael Schwartzkopff
@ 2012-04-17 14:34       ` Lukas Hejtmanek
  2012-04-17 15:14         ` Jeff Layton
  1 sibling, 1 reply; 10+ messages in thread
From: Lukas Hejtmanek @ 2012-04-17 14:34 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs, jiri.horky

Hi,

On Tue, Apr 10, 2012 at 09:13:21AM -0400, Jeff Layton wrote:
> Nope. It'll all work just great...until it doesn't. I don't have any
> specific failure scenarios, but most of the problems will be issues
> with state recovery when a server node is restarted.
> 
> That may manifest in different ways -- problems reclaiming locks for
> instance, or even silent data corruption depending on the application.

would it work if I relax active-active scenario to just active-passive in the
following way:

Server A actively exports  /export/A
Server B actively exports  /export/B

Server B is passive backup for Server A
Server A is passive backup for Server B

would it work to migrate the failed Server B to Server A so that Server A will
server both /export/A and /export/B?

There will be a problem with v4recovery dir. Would it be possible just to
merge v4recovery from Server B to Server A (nfs export would be stopped while
merging v4recovery).

It seems that cp -r B/v4recovery/* A/v4recovery/ would do all the things. Am
I right?

Do I need to copy recovery state if I delay migration of the failed Server B to
Server A for 91 secs? I.e., longer than lease expiry time.. Or do I still need
a record for the client in v4recovery dir in such a case?

-- 
Lukáš Hejtmánek

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: NFSv4 high availability setups
  2012-04-17 14:34       ` Lukas Hejtmanek
@ 2012-04-17 15:14         ` Jeff Layton
  2012-04-24 14:01           ` Jeff Layton
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff Layton @ 2012-04-17 15:14 UTC (permalink / raw)
  To: Lukas Hejtmanek; +Cc: linux-nfs, jiri.horky

On Tue, 17 Apr 2012 16:34:48 +0200
Lukas Hejtmanek <xhejtman@ics.muni.cz> wrote:

> Hi,
> 
> On Tue, Apr 10, 2012 at 09:13:21AM -0400, Jeff Layton wrote:
> > Nope. It'll all work just great...until it doesn't. I don't have any
> > specific failure scenarios, but most of the problems will be issues
> > with state recovery when a server node is restarted.
> > 
> > That may manifest in different ways -- problems reclaiming locks for
> > instance, or even silent data corruption depending on the application.
> 
> would it work if I relax active-active scenario to just active-passive in the
> following way:
> 
> Server A actively exports  /export/A
> Server B actively exports  /export/B
> 
> Server B is passive backup for Server A
> Server A is passive backup for Server B
> 
> would it work to migrate the failed Server B to Server A so that Server A will
> server both /export/A and /export/B?
> 
> There will be a problem with v4recovery dir. Would it be possible just to
> merge v4recovery from Server B to Server A (nfs export would be stopped while
> merging v4recovery).
> 
> It seems that cp -r B/v4recovery/* A/v4recovery/ would do all the things. Am
> I right?
> 
> Do I need to copy recovery state if I delay migration of the failed Server B to
> Server A for 91 secs? I.e., longer than lease expiry time.. Or do I still need
> a record for the client in v4recovery dir in such a case?
> 

That'll still be dangerous. Suppose (for instance) that a client1 lost
communication with server B for a period of time and then it expired
the lease and handed out a lock to client2 that it was holding
previously. client2 modifies the file and drops the lock. At the same
time, client1 has uninterrupted communication with serverA, and holds
state on it.

Eventually, you fail over server B and merge the directories. client1
attempts to renew its lease, but gets back an error and starts
reclaiming things. Now, server B would have denied reclaim of that lock
-- its lease had expired, but in this case it's allowed because you
merged the directory and it client1 held state on serverA. client1
reclaims the lock and thinks that it's held the lock the entire time --
data corruption and other hilarity ensues...

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: NFSv4 high availability setups
  2012-04-17 15:14         ` Jeff Layton
@ 2012-04-24 14:01           ` Jeff Layton
  2012-04-24 14:28             ` Chuck Lever
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff Layton @ 2012-04-24 14:01 UTC (permalink / raw)
  To: Lukas Hejtmanek; +Cc: linux-nfs, jiri.horky

On Tue, 17 Apr 2012 11:14:11 -0400
Jeff Layton <jlayton@redhat.com> wrote:

> On Tue, 17 Apr 2012 16:34:48 +0200
> Lukas Hejtmanek <xhejtman@ics.muni.cz> wrote:
> 
> > Hi,
> > 
> > On Tue, Apr 10, 2012 at 09:13:21AM -0400, Jeff Layton wrote:
> > > Nope. It'll all work just great...until it doesn't. I don't have any
> > > specific failure scenarios, but most of the problems will be issues
> > > with state recovery when a server node is restarted.
> > > 
> > > That may manifest in different ways -- problems reclaiming locks for
> > > instance, or even silent data corruption depending on the application.
> > 
> > would it work if I relax active-active scenario to just active-passive in the
> > following way:
> > 
> > Server A actively exports  /export/A
> > Server B actively exports  /export/B
> > 
> > Server B is passive backup for Server A
> > Server A is passive backup for Server B
> > 
> > would it work to migrate the failed Server B to Server A so that Server A will
> > server both /export/A and /export/B?
> > 
> > There will be a problem with v4recovery dir. Would it be possible just to
> > merge v4recovery from Server B to Server A (nfs export would be stopped while
> > merging v4recovery).
> > 
> > It seems that cp -r B/v4recovery/* A/v4recovery/ would do all the things. Am
> > I right?
> > 
> > Do I need to copy recovery state if I delay migration of the failed Server B to
> > Server A for 91 secs? I.e., longer than lease expiry time.. Or do I still need
> > a record for the client in v4recovery dir in such a case?
> > 
> 
> That'll still be dangerous. Suppose (for instance) that a client1 lost
> communication with server B for a period of time and then it expired
> the lease and handed out a lock to client2 that it was holding
> previously. client2 modifies the file and drops the lock. At the same
> time, client1 has uninterrupted communication with serverA, and holds
> state on it.
> 
> Eventually, you fail over server B and merge the directories. client1
> attempts to renew its lease, but gets back an error and starts
> reclaiming things. Now, server B would have denied reclaim of that lock
> -- its lease had expired, but in this case it's allowed because you
> merged the directory and it client1 held state on serverA. client1
> reclaims the lock and thinks that it's held the lock the entire time --
> data corruption and other hilarity ensues...
> 

Now that I've had some time to think about this, you may actually be OK
to just merge those directories when you fail over. The caveat is that
you need to know for certain that the clients are using non-uniform
clientid strings when they talk to the server.

When a client makes a SETCLIENTID call to the server, it sends an opaque
identifier string to the server. Traditionally (and I think per a
SHOULD in the RFC) Linux clients have varied that string based on the IP
address of the server. That's called the non-UCS (uniform client string)
based model.

There is some debate on this practice though, as it makes it difficult
to identify clients for recovery purposes in migration scenarios (Dave
Novak has a paper on this). In order to facilitate that, we're
considering moving to a UCS based model in the linux client.

The upshot here is that if you do it that way, then a client that holds
state on both server addresses will look like two different clients even
after the service floats to the backup server. In that case, you'd have
no problems with reclaim (in principle, of course!).

The catch here is that if any clients have a UCS based model for
generating client strings (where the client string is invariant vs. the
server's IP address), then you'll be subject to the scenario above.

Still, merging those directories is enough of an uncharted territory
that I'd advise against it even if it would theoretically work.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: NFSv4 high availability setups
  2012-04-24 14:01           ` Jeff Layton
@ 2012-04-24 14:28             ` Chuck Lever
  2012-04-24 15:19               ` Jeff Layton
  0 siblings, 1 reply; 10+ messages in thread
From: Chuck Lever @ 2012-04-24 14:28 UTC (permalink / raw)
  To: Jeff Layton, Lukas Hejtmanek; +Cc: Linux NFS Mailing List, jiri.horky


On Apr 24, 2012, at 10:01 AM, Jeff Layton wrote:

> On Tue, 17 Apr 2012 11:14:11 -0400
> Jeff Layton <jlayton@redhat.com> wrote:
> 
>> On Tue, 17 Apr 2012 16:34:48 +0200
>> Lukas Hejtmanek <xhejtman@ics.muni.cz> wrote:
>> 
>>> Hi,
>>> 
>>> On Tue, Apr 10, 2012 at 09:13:21AM -0400, Jeff Layton wrote:
>>>> Nope. It'll all work just great...until it doesn't. I don't have any
>>>> specific failure scenarios, but most of the problems will be issues
>>>> with state recovery when a server node is restarted.
>>>> 
>>>> That may manifest in different ways -- problems reclaiming locks for
>>>> instance, or even silent data corruption depending on the application.
>>> 
>>> would it work if I relax active-active scenario to just active-passive in the
>>> following way:
>>> 
>>> Server A actively exports  /export/A
>>> Server B actively exports  /export/B
>>> 
>>> Server B is passive backup for Server A
>>> Server A is passive backup for Server B
>>> 
>>> would it work to migrate the failed Server B to Server A so that Server A will
>>> server both /export/A and /export/B?
>>> 
>>> There will be a problem with v4recovery dir. Would it be possible just to
>>> merge v4recovery from Server B to Server A (nfs export would be stopped while
>>> merging v4recovery).
>>> 
>>> It seems that cp -r B/v4recovery/* A/v4recovery/ would do all the things. Am
>>> I right?
>>> 
>>> Do I need to copy recovery state if I delay migration of the failed Server B to
>>> Server A for 91 secs? I.e., longer than lease expiry time.. Or do I still need
>>> a record for the client in v4recovery dir in such a case?
>>> 
>> 
>> That'll still be dangerous. Suppose (for instance) that a client1 lost
>> communication with server B for a period of time and then it expired
>> the lease and handed out a lock to client2 that it was holding
>> previously. client2 modifies the file and drops the lock. At the same
>> time, client1 has uninterrupted communication with serverA, and holds
>> state on it.
>> 
>> Eventually, you fail over server B and merge the directories. client1
>> attempts to renew its lease, but gets back an error and starts
>> reclaiming things. Now, server B would have denied reclaim of that lock
>> -- its lease had expired, but in this case it's allowed because you
>> merged the directory and it client1 held state on serverA. client1
>> reclaims the lock and thinks that it's held the lock the entire time --
>> data corruption and other hilarity ensues...
>> 
> 
> Now that I've had some time to think about this, you may actually be OK
> to just merge those directories when you fail over. The caveat is that
> you need to know for certain that the clients are using non-uniform
> clientid strings when they talk to the server.

The nfs_client_id4 string is supposed to be entirely opaque to servers.  A server can only compare these for equality.  It's simply not valid for a server to "make certain the client is using non-uniform clientid strings."

In fact, NFSv4.1 clients are supposed to use only UCS client strings, so any server implementation that depends on non-UCS is going to be broken for NFSv4.1.  IMO a server implementation should never depend on clients using non-UCS v. UCS.

> When a client makes a SETCLIENTID call to the server, it sends an opaque
> identifier string to the server. Traditionally (and I think per a
> SHOULD in the RFC) Linux clients have varied that string based on the IP
> address of the server. That's called the non-UCS (uniform client string)
> based model.

We've demonstrated that RFC 3530's recommendation to use IP addresses in a client's ID string is mistaken.  The problem this was designed to solve (that servers would mistakenly purge leases if a client identifies itself the same way on multiple server IP addresses) cannot occur, thanks to the SETCLIENTID boot verifier.

Aside from that, the intent of RFC 3530 is that a client should have a single lease on each server.  If either a server or client is multi-homed, using IP addresses in the client ID strings means a client can have more than one lease on a server.  That makes transparent state migration challenging, but it's also a scaling issue because it means servers and clients have to manage much more state information.

> There is some debate on this practice though, as it makes it difficult
> to identify clients for recovery purposes in migration scenarios (Dave
> Novak has a paper on this). In order to facilitate that, we're
> considering moving to a UCS based model in the linux client.

Noveck's migration draft is being accepted as a working group draft, so one could say the debate is officially drawing to consensus.

> The upshot here is that if you do it that way, then a client that holds
> state on both server addresses will look like two different clients even
> after the service floats to the backup server. In that case, you'd have
> no problems with reclaim (in principle, of course!).

A better approach to clustering is to virtualize each NFS service.  The network addresses and filesystem hierarchy (and possibly NFSv4 state as well) on each virtual server move between physical hosts, but are never merged with each other.  Then there is no possibility of confusion.

> The catch here is that if any clients have a UCS based model for
> generating client strings (where the client string is invariant vs. the
> server's IP address), then you'll be subject to the scenario above.
> 
> Still, merging those directories is enough of an uncharted territory
> that I'd advise against it even if it would theoretically work.

Just don't depend on the contents of the client strings.

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: NFSv4 high availability setups
  2012-04-24 14:28             ` Chuck Lever
@ 2012-04-24 15:19               ` Jeff Layton
  0 siblings, 0 replies; 10+ messages in thread
From: Jeff Layton @ 2012-04-24 15:19 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Lukas Hejtmanek, Linux NFS Mailing List, jiri.horky

On Tue, 24 Apr 2012 10:28:00 -0400
Chuck Lever <chuck.lever@oracle.com> wrote:

> 
> On Apr 24, 2012, at 10:01 AM, Jeff Layton wrote:
> 
> > On Tue, 17 Apr 2012 11:14:11 -0400
> > Jeff Layton <jlayton@redhat.com> wrote:
> > 
> >> On Tue, 17 Apr 2012 16:34:48 +0200
> >> Lukas Hejtmanek <xhejtman@ics.muni.cz> wrote:
> >> 
> >>> Hi,
> >>> 
> >>> On Tue, Apr 10, 2012 at 09:13:21AM -0400, Jeff Layton wrote:
> >>>> Nope. It'll all work just great...until it doesn't. I don't have any
> >>>> specific failure scenarios, but most of the problems will be issues
> >>>> with state recovery when a server node is restarted.
> >>>> 
> >>>> That may manifest in different ways -- problems reclaiming locks for
> >>>> instance, or even silent data corruption depending on the application.
> >>> 
> >>> would it work if I relax active-active scenario to just active-passive in the
> >>> following way:
> >>> 
> >>> Server A actively exports  /export/A
> >>> Server B actively exports  /export/B
> >>> 
> >>> Server B is passive backup for Server A
> >>> Server A is passive backup for Server B
> >>> 
> >>> would it work to migrate the failed Server B to Server A so that Server A will
> >>> server both /export/A and /export/B?
> >>> 
> >>> There will be a problem with v4recovery dir. Would it be possible just to
> >>> merge v4recovery from Server B to Server A (nfs export would be stopped while
> >>> merging v4recovery).
> >>> 
> >>> It seems that cp -r B/v4recovery/* A/v4recovery/ would do all the things. Am
> >>> I right?
> >>> 
> >>> Do I need to copy recovery state if I delay migration of the failed Server B to
> >>> Server A for 91 secs? I.e., longer than lease expiry time.. Or do I still need
> >>> a record for the client in v4recovery dir in such a case?
> >>> 
> >> 
> >> That'll still be dangerous. Suppose (for instance) that a client1 lost
> >> communication with server B for a period of time and then it expired
> >> the lease and handed out a lock to client2 that it was holding
> >> previously. client2 modifies the file and drops the lock. At the same
> >> time, client1 has uninterrupted communication with serverA, and holds
> >> state on it.
> >> 
> >> Eventually, you fail over server B and merge the directories. client1
> >> attempts to renew its lease, but gets back an error and starts
> >> reclaiming things. Now, server B would have denied reclaim of that lock
> >> -- its lease had expired, but in this case it's allowed because you
> >> merged the directory and it client1 held state on serverA. client1
> >> reclaims the lock and thinks that it's held the lock the entire time --
> >> data corruption and other hilarity ensues...
> >> 
> > 
> > Now that I've had some time to think about this, you may actually be OK
> > to just merge those directories when you fail over. The caveat is that
> > you need to know for certain that the clients are using non-uniform
> > clientid strings when they talk to the server.
> 
> The nfs_client_id4 string is supposed to be entirely opaque to servers.  A server can only compare these for equality.  It's simply not valid for a server to "make certain the client is using non-uniform clientid strings."
> 
> In fact, NFSv4.1 clients are supposed to use only UCS client strings, so any server implementation that depends on non-UCS is going to be broken for NFSv4.1.  IMO a server implementation should never depend on clients using non-UCS v. UCS.
> 

Right, I wasn't suggesting that we or they add any code that checked
that. You'd just have to know beforehand that the clients were non-UCS
and ensure that didn't change in a later kernel or anything.

> > When a client makes a SETCLIENTID call to the server, it sends an opaque
> > identifier string to the server. Traditionally (and I think per a
> > SHOULD in the RFC) Linux clients have varied that string based on the IP
> > address of the server. That's called the non-UCS (uniform client string)
> > based model.
> 
> We've demonstrated that RFC 3530's recommendation to use IP addresses in a client's ID string is mistaken.  The problem this was designed to solve (that servers would mistakenly purge leases if a client identifies itself the same way on multiple server IP addresses) cannot occur, thanks to the SETCLIENTID boot verifier.
> 
> Aside from that, the intent of RFC 3530 is that a client should have a single lease on each server.  If either a server or client is multi-homed, using IP addresses in the client ID strings means a client can have more than one lease on a server.  That makes transparent state migration challenging, but it's also a scaling issue because it means servers and clients have to manage much more state information.
> 
> > There is some debate on this practice though, as it makes it difficult
> > to identify clients for recovery purposes in migration scenarios (Dave
> > Novak has a paper on this). In order to facilitate that, we're
> > considering moving to a UCS based model in the linux client.
> 
> Noveck's migration draft is being accepted as a working group draft, so one could say the debate is officially drawing to consensus.
> 
> > The upshot here is that if you do it that way, then a client that holds
> > state on both server addresses will look like two different clients even
> > after the service floats to the backup server. In that case, you'd have
> > no problems with reclaim (in principle, of course!).
> 
> A better approach to clustering is to virtualize each NFS service.  The network addresses and filesystem hierarchy (and possibly NFSv4 state as well) on each virtual server move between physical hosts, but are never merged with each other.  Then there is no possibility of confusion.
> 

That's also a work-in-progress and won't really be feasible for some
time.

> > The catch here is that if any clients have a UCS based model for
> > generating client strings (where the client string is invariant vs. the
> > server's IP address), then you'll be subject to the scenario above.
> > 
> > Still, merging those directories is enough of an uncharted territory
> > that I'd advise against it even if it would theoretically work.
> 
> Just don't depend on the contents of the client strings.
> 

Agreed. I just wanted to point out that the problem scenario I outlined
is actually contingent on the clients using a UCS model. They should
take into account that although the Linux client today uses a non-UCS
model, that may change in the future and that change could be quite
problematic for their use-case.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-04-24 15:26 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-04-05 10:31 NFSv4 high availability setups Lukas Hejtmanek
2012-04-05 11:39 ` Jeff Layton
2012-04-10 12:55   ` Lukas Hejtmanek
2012-04-10 13:13     ` Jeff Layton
2012-04-10 18:14       ` Michael Schwartzkopff
2012-04-17 14:34       ` Lukas Hejtmanek
2012-04-17 15:14         ` Jeff Layton
2012-04-24 14:01           ` Jeff Layton
2012-04-24 14:28             ` Chuck Lever
2012-04-24 15:19               ` Jeff Layton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.