* NFSv4 high availability setups @ 2012-04-05 10:31 Lukas Hejtmanek 2012-04-05 11:39 ` Jeff Layton 0 siblings, 1 reply; 10+ messages in thread From: Lukas Hejtmanek @ 2012-04-05 10:31 UTC (permalink / raw) To: linux-nfs Hi, we got several front-ends for a shared storage. We want to build HA setup so that failed front-end fails over to another front-end (that is serving NFSv4 already). As I understand, NFS4 uses state dir somewhere in /var/lib/nfs/rpc_pipefs. Can we put this state dir on a shared volume so that this state dir is common for all the front-ends serving the same content? Is is supposed to work and NFSv4 can merge its state with existing state on a shared disk? -- Lukáš Hejtmánek ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: NFSv4 high availability setups 2012-04-05 10:31 NFSv4 high availability setups Lukas Hejtmanek @ 2012-04-05 11:39 ` Jeff Layton 2012-04-10 12:55 ` Lukas Hejtmanek 0 siblings, 1 reply; 10+ messages in thread From: Jeff Layton @ 2012-04-05 11:39 UTC (permalink / raw) To: Lukas Hejtmanek; +Cc: linux-nfs On Thu, 5 Apr 2012 12:31:24 +0200 Lukas Hejtmanek <xhejtman@ics.muni.cz> wrote: > Hi, > > we got several front-ends for a shared storage. We want to build HA setup so > that failed front-end fails over to another front-end (that is serving NFSv4 > already). > > As I understand, NFS4 uses state dir somewhere in /var/lib/nfs/rpc_pipefs. > You're probably thinking of /var/lib/nfs/v4recovery. > Can we put this state dir on a shared volume so that this state dir is common > for all the front-ends serving the same content? Is is supposed to work and > NFSv4 can merge its state with existing state on a shared disk? > Not properly, no. nfsd expects to have complete control over that directory. There's no locking or merging of the data there. A node will also clean that directory out in some cases, and that will throw your state tracking off. 3.4 just got an overhaul of this code to use an upcall instead. At this point I'm waiting on Steve to merge the userspace portion of that. The legacy client tracking code will probably never be cluster-aware. This is actually a very complex problem to solve as you need to coordinate the grace periods between the different serving cluster nodes. I've been looking at this problem for the last few months, and am still working out a design that would allow active/active NFSv4 serving. For now, I'd advise against trying it since it won't work properly. If you want to follow along with the gory details of the design, I've been sporadically doing blog posts about it here: http://jtlayton.wordpress.com/ -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: NFSv4 high availability setups 2012-04-05 11:39 ` Jeff Layton @ 2012-04-10 12:55 ` Lukas Hejtmanek 2012-04-10 13:13 ` Jeff Layton 0 siblings, 1 reply; 10+ messages in thread From: Lukas Hejtmanek @ 2012-04-10 12:55 UTC (permalink / raw) To: Jeff Layton; +Cc: linux-nfs, jiri.horky Hi Jeff, On Thu, Apr 05, 2012 at 07:39:01AM -0400, Jeff Layton wrote: > > As I understand, NFS4 uses state dir somewhere in /var/lib/nfs/rpc_pipefs. > > > > You're probably thinking of /var/lib/nfs/v4recovery. yes, sorry for confusion. > > Can we put this state dir on a shared volume so that this state dir is common > > for all the front-ends serving the same content? Is is supposed to work and > > NFSv4 can merge its state with existing state on a shared disk? > > > > Not properly, no. nfsd expects to have complete control over that > directory. There's no locking or merging of the data there. A node will > also clean that directory out in some cases, and that will throw your > state tracking off. Thank you for information. Is there any (preferably simple) way to demonstrate that this does not work properly? E.g., if I share the same export through two or more NFSv4 front-ends that share the v4recovery directory, do I trigger problems with this tool http://nfsv4.bullopensource.org/tools/tests/locktest.php? -- Lukáš Hejtmánek ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: NFSv4 high availability setups 2012-04-10 12:55 ` Lukas Hejtmanek @ 2012-04-10 13:13 ` Jeff Layton 2012-04-10 18:14 ` Michael Schwartzkopff 2012-04-17 14:34 ` Lukas Hejtmanek 0 siblings, 2 replies; 10+ messages in thread From: Jeff Layton @ 2012-04-10 13:13 UTC (permalink / raw) To: Lukas Hejtmanek; +Cc: linux-nfs, jiri.horky On Tue, 10 Apr 2012 14:55:52 +0200 Lukas Hejtmanek <xhejtman@ics.muni.cz> wrote: > Hi Jeff, > > On Thu, Apr 05, 2012 at 07:39:01AM -0400, Jeff Layton wrote: > > > As I understand, NFS4 uses state dir somewhere in /var/lib/nfs/rpc_pipefs. > > > > > > > You're probably thinking of /var/lib/nfs/v4recovery. > > yes, sorry for confusion. > > > > Can we put this state dir on a shared volume so that this state dir is common > > > for all the front-ends serving the same content? Is is supposed to work and > > > NFSv4 can merge its state with existing state on a shared disk? > > > > > > > Not properly, no. nfsd expects to have complete control over that > > directory. There's no locking or merging of the data there. A node will > > also clean that directory out in some cases, and that will throw your > > state tracking off. > > Thank you for information. > > Is there any (preferably simple) way to demonstrate that this does not work > properly? E.g., if I share the same export through two or more NFSv4 > front-ends that share the v4recovery directory, do I trigger problems with > this tool http://nfsv4.bullopensource.org/tools/tests/locktest.php? > Nope. It'll all work just great...until it doesn't. I don't have any specific failure scenarios, but most of the problems will be issues with state recovery when a server node is restarted. That may manifest in different ways -- problems reclaiming locks for instance, or even silent data corruption depending on the application. For instance, a node might hand out a lock and the client release it, after a server node reboots but before a client that really "owns" it reclaims it. Depending on the application, that may cause serious problems. -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: NFSv4 high availability setups 2012-04-10 13:13 ` Jeff Layton @ 2012-04-10 18:14 ` Michael Schwartzkopff 2012-04-17 14:34 ` Lukas Hejtmanek 1 sibling, 0 replies; 10+ messages in thread From: Michael Schwartzkopff @ 2012-04-10 18:14 UTC (permalink / raw) To: linux-nfs; +Cc: Lukas Hejtmanek [-- Attachment #1: Type: Text/Plain, Size: 2273 bytes --] > On Tue, 10 Apr 2012 14:55:52 +0200 > (...) > > > > Can we put this state dir on a shared volume so that this state dir > > > > is common for all the front-ends serving the same content? Is is > > > > supposed to work and NFSv4 can merge its state with existing state > > > > on a shared disk? > > > > > > Not properly, no. nfsd expects to have complete control over that > > > directory. There's no locking or merging of the data there. A node will > > > also clean that directory out in some cases, and that will throw your > > > state tracking off. > > > > Thank you for information. > > > > Is there any (preferably simple) way to demonstrate that this does not > > work properly? E.g., if I share the same export through two or more > > NFSv4 front-ends that share the v4recovery directory, do I trigger > > problems with this tool > > http://nfsv4.bullopensource.org/tools/tests/locktest.php? > > Nope. It'll all work just great...until it doesn't. I don't have any > specific failure scenarios, but most of the problems will be issues > with state recovery when a server node is restarted. > > That may manifest in different ways -- problems reclaiming locks for > instance, or even silent data corruption depending on the application. > > For instance, a node might hand out a lock and the client release it, > after a server node reboots but before a client that really "owns" it > reclaims it. Depending on the application, that may cause serious > problems. Hi, I don't think a active/active NFS server is possible with /var/lib/nfs/v4recovery on a shared media. I think you will get into major trouble if two or more nodes access that directory at the same time. On the other hand an active/passive setup is quite easy. There are some HOWTOs on the internet. I like the one of linbit most: http://www.linbit.com/de/training/tech-guides/highly-available-nfs-with-drbd- and-pacemaker/ The guide provides a basic path to follow. You have to tune it according to which distribution you use. Not all distributions have the necessary features. See: leasetime, grace time, ... Greetings, -- Dr. Michael Schwartzkopff Guardinistr. 63 81375 München Tel: (0163) 172 50 98 Fax: (089) 620 304 13 [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: NFSv4 high availability setups 2012-04-10 13:13 ` Jeff Layton 2012-04-10 18:14 ` Michael Schwartzkopff @ 2012-04-17 14:34 ` Lukas Hejtmanek 2012-04-17 15:14 ` Jeff Layton 1 sibling, 1 reply; 10+ messages in thread From: Lukas Hejtmanek @ 2012-04-17 14:34 UTC (permalink / raw) To: Jeff Layton; +Cc: linux-nfs, jiri.horky Hi, On Tue, Apr 10, 2012 at 09:13:21AM -0400, Jeff Layton wrote: > Nope. It'll all work just great...until it doesn't. I don't have any > specific failure scenarios, but most of the problems will be issues > with state recovery when a server node is restarted. > > That may manifest in different ways -- problems reclaiming locks for > instance, or even silent data corruption depending on the application. would it work if I relax active-active scenario to just active-passive in the following way: Server A actively exports /export/A Server B actively exports /export/B Server B is passive backup for Server A Server A is passive backup for Server B would it work to migrate the failed Server B to Server A so that Server A will server both /export/A and /export/B? There will be a problem with v4recovery dir. Would it be possible just to merge v4recovery from Server B to Server A (nfs export would be stopped while merging v4recovery). It seems that cp -r B/v4recovery/* A/v4recovery/ would do all the things. Am I right? Do I need to copy recovery state if I delay migration of the failed Server B to Server A for 91 secs? I.e., longer than lease expiry time.. Or do I still need a record for the client in v4recovery dir in such a case? -- Lukáš Hejtmánek ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: NFSv4 high availability setups 2012-04-17 14:34 ` Lukas Hejtmanek @ 2012-04-17 15:14 ` Jeff Layton 2012-04-24 14:01 ` Jeff Layton 0 siblings, 1 reply; 10+ messages in thread From: Jeff Layton @ 2012-04-17 15:14 UTC (permalink / raw) To: Lukas Hejtmanek; +Cc: linux-nfs, jiri.horky On Tue, 17 Apr 2012 16:34:48 +0200 Lukas Hejtmanek <xhejtman@ics.muni.cz> wrote: > Hi, > > On Tue, Apr 10, 2012 at 09:13:21AM -0400, Jeff Layton wrote: > > Nope. It'll all work just great...until it doesn't. I don't have any > > specific failure scenarios, but most of the problems will be issues > > with state recovery when a server node is restarted. > > > > That may manifest in different ways -- problems reclaiming locks for > > instance, or even silent data corruption depending on the application. > > would it work if I relax active-active scenario to just active-passive in the > following way: > > Server A actively exports /export/A > Server B actively exports /export/B > > Server B is passive backup for Server A > Server A is passive backup for Server B > > would it work to migrate the failed Server B to Server A so that Server A will > server both /export/A and /export/B? > > There will be a problem with v4recovery dir. Would it be possible just to > merge v4recovery from Server B to Server A (nfs export would be stopped while > merging v4recovery). > > It seems that cp -r B/v4recovery/* A/v4recovery/ would do all the things. Am > I right? > > Do I need to copy recovery state if I delay migration of the failed Server B to > Server A for 91 secs? I.e., longer than lease expiry time.. Or do I still need > a record for the client in v4recovery dir in such a case? > That'll still be dangerous. Suppose (for instance) that a client1 lost communication with server B for a period of time and then it expired the lease and handed out a lock to client2 that it was holding previously. client2 modifies the file and drops the lock. At the same time, client1 has uninterrupted communication with serverA, and holds state on it. Eventually, you fail over server B and merge the directories. client1 attempts to renew its lease, but gets back an error and starts reclaiming things. Now, server B would have denied reclaim of that lock -- its lease had expired, but in this case it's allowed because you merged the directory and it client1 held state on serverA. client1 reclaims the lock and thinks that it's held the lock the entire time -- data corruption and other hilarity ensues... -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: NFSv4 high availability setups 2012-04-17 15:14 ` Jeff Layton @ 2012-04-24 14:01 ` Jeff Layton 2012-04-24 14:28 ` Chuck Lever 0 siblings, 1 reply; 10+ messages in thread From: Jeff Layton @ 2012-04-24 14:01 UTC (permalink / raw) To: Lukas Hejtmanek; +Cc: linux-nfs, jiri.horky On Tue, 17 Apr 2012 11:14:11 -0400 Jeff Layton <jlayton@redhat.com> wrote: > On Tue, 17 Apr 2012 16:34:48 +0200 > Lukas Hejtmanek <xhejtman@ics.muni.cz> wrote: > > > Hi, > > > > On Tue, Apr 10, 2012 at 09:13:21AM -0400, Jeff Layton wrote: > > > Nope. It'll all work just great...until it doesn't. I don't have any > > > specific failure scenarios, but most of the problems will be issues > > > with state recovery when a server node is restarted. > > > > > > That may manifest in different ways -- problems reclaiming locks for > > > instance, or even silent data corruption depending on the application. > > > > would it work if I relax active-active scenario to just active-passive in the > > following way: > > > > Server A actively exports /export/A > > Server B actively exports /export/B > > > > Server B is passive backup for Server A > > Server A is passive backup for Server B > > > > would it work to migrate the failed Server B to Server A so that Server A will > > server both /export/A and /export/B? > > > > There will be a problem with v4recovery dir. Would it be possible just to > > merge v4recovery from Server B to Server A (nfs export would be stopped while > > merging v4recovery). > > > > It seems that cp -r B/v4recovery/* A/v4recovery/ would do all the things. Am > > I right? > > > > Do I need to copy recovery state if I delay migration of the failed Server B to > > Server A for 91 secs? I.e., longer than lease expiry time.. Or do I still need > > a record for the client in v4recovery dir in such a case? > > > > That'll still be dangerous. Suppose (for instance) that a client1 lost > communication with server B for a period of time and then it expired > the lease and handed out a lock to client2 that it was holding > previously. client2 modifies the file and drops the lock. At the same > time, client1 has uninterrupted communication with serverA, and holds > state on it. > > Eventually, you fail over server B and merge the directories. client1 > attempts to renew its lease, but gets back an error and starts > reclaiming things. Now, server B would have denied reclaim of that lock > -- its lease had expired, but in this case it's allowed because you > merged the directory and it client1 held state on serverA. client1 > reclaims the lock and thinks that it's held the lock the entire time -- > data corruption and other hilarity ensues... > Now that I've had some time to think about this, you may actually be OK to just merge those directories when you fail over. The caveat is that you need to know for certain that the clients are using non-uniform clientid strings when they talk to the server. When a client makes a SETCLIENTID call to the server, it sends an opaque identifier string to the server. Traditionally (and I think per a SHOULD in the RFC) Linux clients have varied that string based on the IP address of the server. That's called the non-UCS (uniform client string) based model. There is some debate on this practice though, as it makes it difficult to identify clients for recovery purposes in migration scenarios (Dave Novak has a paper on this). In order to facilitate that, we're considering moving to a UCS based model in the linux client. The upshot here is that if you do it that way, then a client that holds state on both server addresses will look like two different clients even after the service floats to the backup server. In that case, you'd have no problems with reclaim (in principle, of course!). The catch here is that if any clients have a UCS based model for generating client strings (where the client string is invariant vs. the server's IP address), then you'll be subject to the scenario above. Still, merging those directories is enough of an uncharted territory that I'd advise against it even if it would theoretically work. -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: NFSv4 high availability setups 2012-04-24 14:01 ` Jeff Layton @ 2012-04-24 14:28 ` Chuck Lever 2012-04-24 15:19 ` Jeff Layton 0 siblings, 1 reply; 10+ messages in thread From: Chuck Lever @ 2012-04-24 14:28 UTC (permalink / raw) To: Jeff Layton, Lukas Hejtmanek; +Cc: Linux NFS Mailing List, jiri.horky On Apr 24, 2012, at 10:01 AM, Jeff Layton wrote: > On Tue, 17 Apr 2012 11:14:11 -0400 > Jeff Layton <jlayton@redhat.com> wrote: > >> On Tue, 17 Apr 2012 16:34:48 +0200 >> Lukas Hejtmanek <xhejtman@ics.muni.cz> wrote: >> >>> Hi, >>> >>> On Tue, Apr 10, 2012 at 09:13:21AM -0400, Jeff Layton wrote: >>>> Nope. It'll all work just great...until it doesn't. I don't have any >>>> specific failure scenarios, but most of the problems will be issues >>>> with state recovery when a server node is restarted. >>>> >>>> That may manifest in different ways -- problems reclaiming locks for >>>> instance, or even silent data corruption depending on the application. >>> >>> would it work if I relax active-active scenario to just active-passive in the >>> following way: >>> >>> Server A actively exports /export/A >>> Server B actively exports /export/B >>> >>> Server B is passive backup for Server A >>> Server A is passive backup for Server B >>> >>> would it work to migrate the failed Server B to Server A so that Server A will >>> server both /export/A and /export/B? >>> >>> There will be a problem with v4recovery dir. Would it be possible just to >>> merge v4recovery from Server B to Server A (nfs export would be stopped while >>> merging v4recovery). >>> >>> It seems that cp -r B/v4recovery/* A/v4recovery/ would do all the things. Am >>> I right? >>> >>> Do I need to copy recovery state if I delay migration of the failed Server B to >>> Server A for 91 secs? I.e., longer than lease expiry time.. Or do I still need >>> a record for the client in v4recovery dir in such a case? >>> >> >> That'll still be dangerous. Suppose (for instance) that a client1 lost >> communication with server B for a period of time and then it expired >> the lease and handed out a lock to client2 that it was holding >> previously. client2 modifies the file and drops the lock. At the same >> time, client1 has uninterrupted communication with serverA, and holds >> state on it. >> >> Eventually, you fail over server B and merge the directories. client1 >> attempts to renew its lease, but gets back an error and starts >> reclaiming things. Now, server B would have denied reclaim of that lock >> -- its lease had expired, but in this case it's allowed because you >> merged the directory and it client1 held state on serverA. client1 >> reclaims the lock and thinks that it's held the lock the entire time -- >> data corruption and other hilarity ensues... >> > > Now that I've had some time to think about this, you may actually be OK > to just merge those directories when you fail over. The caveat is that > you need to know for certain that the clients are using non-uniform > clientid strings when they talk to the server. The nfs_client_id4 string is supposed to be entirely opaque to servers. A server can only compare these for equality. It's simply not valid for a server to "make certain the client is using non-uniform clientid strings." In fact, NFSv4.1 clients are supposed to use only UCS client strings, so any server implementation that depends on non-UCS is going to be broken for NFSv4.1. IMO a server implementation should never depend on clients using non-UCS v. UCS. > When a client makes a SETCLIENTID call to the server, it sends an opaque > identifier string to the server. Traditionally (and I think per a > SHOULD in the RFC) Linux clients have varied that string based on the IP > address of the server. That's called the non-UCS (uniform client string) > based model. We've demonstrated that RFC 3530's recommendation to use IP addresses in a client's ID string is mistaken. The problem this was designed to solve (that servers would mistakenly purge leases if a client identifies itself the same way on multiple server IP addresses) cannot occur, thanks to the SETCLIENTID boot verifier. Aside from that, the intent of RFC 3530 is that a client should have a single lease on each server. If either a server or client is multi-homed, using IP addresses in the client ID strings means a client can have more than one lease on a server. That makes transparent state migration challenging, but it's also a scaling issue because it means servers and clients have to manage much more state information. > There is some debate on this practice though, as it makes it difficult > to identify clients for recovery purposes in migration scenarios (Dave > Novak has a paper on this). In order to facilitate that, we're > considering moving to a UCS based model in the linux client. Noveck's migration draft is being accepted as a working group draft, so one could say the debate is officially drawing to consensus. > The upshot here is that if you do it that way, then a client that holds > state on both server addresses will look like two different clients even > after the service floats to the backup server. In that case, you'd have > no problems with reclaim (in principle, of course!). A better approach to clustering is to virtualize each NFS service. The network addresses and filesystem hierarchy (and possibly NFSv4 state as well) on each virtual server move between physical hosts, but are never merged with each other. Then there is no possibility of confusion. > The catch here is that if any clients have a UCS based model for > generating client strings (where the client string is invariant vs. the > server's IP address), then you'll be subject to the scenario above. > > Still, merging those directories is enough of an uncharted territory > that I'd advise against it even if it would theoretically work. Just don't depend on the contents of the client strings. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: NFSv4 high availability setups 2012-04-24 14:28 ` Chuck Lever @ 2012-04-24 15:19 ` Jeff Layton 0 siblings, 0 replies; 10+ messages in thread From: Jeff Layton @ 2012-04-24 15:19 UTC (permalink / raw) To: Chuck Lever; +Cc: Lukas Hejtmanek, Linux NFS Mailing List, jiri.horky On Tue, 24 Apr 2012 10:28:00 -0400 Chuck Lever <chuck.lever@oracle.com> wrote: > > On Apr 24, 2012, at 10:01 AM, Jeff Layton wrote: > > > On Tue, 17 Apr 2012 11:14:11 -0400 > > Jeff Layton <jlayton@redhat.com> wrote: > > > >> On Tue, 17 Apr 2012 16:34:48 +0200 > >> Lukas Hejtmanek <xhejtman@ics.muni.cz> wrote: > >> > >>> Hi, > >>> > >>> On Tue, Apr 10, 2012 at 09:13:21AM -0400, Jeff Layton wrote: > >>>> Nope. It'll all work just great...until it doesn't. I don't have any > >>>> specific failure scenarios, but most of the problems will be issues > >>>> with state recovery when a server node is restarted. > >>>> > >>>> That may manifest in different ways -- problems reclaiming locks for > >>>> instance, or even silent data corruption depending on the application. > >>> > >>> would it work if I relax active-active scenario to just active-passive in the > >>> following way: > >>> > >>> Server A actively exports /export/A > >>> Server B actively exports /export/B > >>> > >>> Server B is passive backup for Server A > >>> Server A is passive backup for Server B > >>> > >>> would it work to migrate the failed Server B to Server A so that Server A will > >>> server both /export/A and /export/B? > >>> > >>> There will be a problem with v4recovery dir. Would it be possible just to > >>> merge v4recovery from Server B to Server A (nfs export would be stopped while > >>> merging v4recovery). > >>> > >>> It seems that cp -r B/v4recovery/* A/v4recovery/ would do all the things. Am > >>> I right? > >>> > >>> Do I need to copy recovery state if I delay migration of the failed Server B to > >>> Server A for 91 secs? I.e., longer than lease expiry time.. Or do I still need > >>> a record for the client in v4recovery dir in such a case? > >>> > >> > >> That'll still be dangerous. Suppose (for instance) that a client1 lost > >> communication with server B for a period of time and then it expired > >> the lease and handed out a lock to client2 that it was holding > >> previously. client2 modifies the file and drops the lock. At the same > >> time, client1 has uninterrupted communication with serverA, and holds > >> state on it. > >> > >> Eventually, you fail over server B and merge the directories. client1 > >> attempts to renew its lease, but gets back an error and starts > >> reclaiming things. Now, server B would have denied reclaim of that lock > >> -- its lease had expired, but in this case it's allowed because you > >> merged the directory and it client1 held state on serverA. client1 > >> reclaims the lock and thinks that it's held the lock the entire time -- > >> data corruption and other hilarity ensues... > >> > > > > Now that I've had some time to think about this, you may actually be OK > > to just merge those directories when you fail over. The caveat is that > > you need to know for certain that the clients are using non-uniform > > clientid strings when they talk to the server. > > The nfs_client_id4 string is supposed to be entirely opaque to servers. A server can only compare these for equality. It's simply not valid for a server to "make certain the client is using non-uniform clientid strings." > > In fact, NFSv4.1 clients are supposed to use only UCS client strings, so any server implementation that depends on non-UCS is going to be broken for NFSv4.1. IMO a server implementation should never depend on clients using non-UCS v. UCS. > Right, I wasn't suggesting that we or they add any code that checked that. You'd just have to know beforehand that the clients were non-UCS and ensure that didn't change in a later kernel or anything. > > When a client makes a SETCLIENTID call to the server, it sends an opaque > > identifier string to the server. Traditionally (and I think per a > > SHOULD in the RFC) Linux clients have varied that string based on the IP > > address of the server. That's called the non-UCS (uniform client string) > > based model. > > We've demonstrated that RFC 3530's recommendation to use IP addresses in a client's ID string is mistaken. The problem this was designed to solve (that servers would mistakenly purge leases if a client identifies itself the same way on multiple server IP addresses) cannot occur, thanks to the SETCLIENTID boot verifier. > > Aside from that, the intent of RFC 3530 is that a client should have a single lease on each server. If either a server or client is multi-homed, using IP addresses in the client ID strings means a client can have more than one lease on a server. That makes transparent state migration challenging, but it's also a scaling issue because it means servers and clients have to manage much more state information. > > > There is some debate on this practice though, as it makes it difficult > > to identify clients for recovery purposes in migration scenarios (Dave > > Novak has a paper on this). In order to facilitate that, we're > > considering moving to a UCS based model in the linux client. > > Noveck's migration draft is being accepted as a working group draft, so one could say the debate is officially drawing to consensus. > > > The upshot here is that if you do it that way, then a client that holds > > state on both server addresses will look like two different clients even > > after the service floats to the backup server. In that case, you'd have > > no problems with reclaim (in principle, of course!). > > A better approach to clustering is to virtualize each NFS service. The network addresses and filesystem hierarchy (and possibly NFSv4 state as well) on each virtual server move between physical hosts, but are never merged with each other. Then there is no possibility of confusion. > That's also a work-in-progress and won't really be feasible for some time. > > The catch here is that if any clients have a UCS based model for > > generating client strings (where the client string is invariant vs. the > > server's IP address), then you'll be subject to the scenario above. > > > > Still, merging those directories is enough of an uncharted territory > > that I'd advise against it even if it would theoretically work. > > Just don't depend on the contents of the client strings. > Agreed. I just wanted to point out that the problem scenario I outlined is actually contingent on the clients using a UCS model. They should take into account that although the Linux client today uses a non-UCS model, that may change in the future and that change could be quite problematic for their use-case. -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2012-04-24 15:26 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-04-05 10:31 NFSv4 high availability setups Lukas Hejtmanek 2012-04-05 11:39 ` Jeff Layton 2012-04-10 12:55 ` Lukas Hejtmanek 2012-04-10 13:13 ` Jeff Layton 2012-04-10 18:14 ` Michael Schwartzkopff 2012-04-17 14:34 ` Lukas Hejtmanek 2012-04-17 15:14 ` Jeff Layton 2012-04-24 14:01 ` Jeff Layton 2012-04-24 14:28 ` Chuck Lever 2012-04-24 15:19 ` Jeff Layton
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.