* multiple instances of rpc.statd
@ 2008-04-25 13:31 Bernd Schubert
[not found] ` <200804251531.21035.bs-PKu+Ek1N2UGzQB+pC5nmwQ@public.gmane.org>
0 siblings, 1 reply; 9+ messages in thread
From: Bernd Schubert @ 2008-04-25 13:31 UTC (permalink / raw)
To: linux-nfs
Hello,
on servers with heartbeat managed resources one rather often has the situation
one exports different directories from different resources.
It now may happen all resources are running on one host, but they can also run
from different hosts. The situation gets even more complicated if the server
is also a nfs client.
In principle having different nfs resources works fine, only the statd state
directory is a problem. Or in principle the statd concept at all. Actually we
would need to have several instances of statd running using different
directories. These then would have to be migrated from one server to the
other on resource movement.
However, as far I understand it, there does not even exist the basic concept
for this, doesn't it?
Thanks,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
^ permalink raw reply [flat|nested] 9+ messages in thread[parent not found: <200804251531.21035.bs-PKu+Ek1N2UGzQB+pC5nmwQ@public.gmane.org>]
* Re: multiple instances of rpc.statd [not found] ` <200804251531.21035.bs-PKu+Ek1N2UGzQB+pC5nmwQ@public.gmane.org> @ 2008-04-25 13:47 ` Wendy Cheng 2008-04-25 14:30 ` Bernd Schubert 2008-04-25 22:07 ` J. Bruce Fields 0 siblings, 2 replies; 9+ messages in thread From: Wendy Cheng @ 2008-04-25 13:47 UTC (permalink / raw) To: Bernd Schubert; +Cc: linux-nfs Bernd Schubert wrote: > Hello, > > on servers with heartbeat managed resources one rather often has the situation > one exports different directories from different resources. > > It now may happen all resources are running on one host, but they can also run > from different hosts. The situation gets even more complicated if the server > is also a nfs client. > > In principle having different nfs resources works fine, only the statd state > directory is a problem. Or in principle the statd concept at all. Actually we > would need to have several instances of statd running using different > directories. These then would have to be migrated from one server to the > other on resource movement. > However, as far I understand it, there does not even exist the basic concept > for this, doesn't it? > > The efforts have been attempted (to remedy this issue) and a complete set of patches have been (kept) submitting for the past two years. The patch acceptance progress is very slow (I guess people just don't want to get bothered with cluster issues ?). Anyway, the kernel side has the basic infrastructure to handle the problem (it stores the incoming clients IP address as part of its book-keeping record) - just a little bit tweak will do the job. However, the user side statd directory needs to get re-structured. I didn't publish the user side directory structure script during my last round of submission. Forking statd into multiple threads do not solve all the issues. Check out: https://www.redhat.com/archives/cluster-devel/2007-April/msg00028.html -- Wendy ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: multiple instances of rpc.statd 2008-04-25 13:47 ` Wendy Cheng @ 2008-04-25 14:30 ` Bernd Schubert [not found] ` <200804251630.36917.bs-PKu+Ek1N2UGzQB+pC5nmwQ@public.gmane.org> 2008-04-25 22:07 ` J. Bruce Fields 1 sibling, 1 reply; 9+ messages in thread From: Bernd Schubert @ 2008-04-25 14:30 UTC (permalink / raw) To: Wendy Cheng; +Cc: linux-nfs Hello Wendy. On Friday 25 April 2008 15:47:03 Wendy Cheng wrote: > Bernd Schubert wrote: > > Hello, > > > > on servers with heartbeat managed resources one rather often has the > > situation one exports different directories from different resources. > > > > It now may happen all resources are running on one host, but they can > > also run from different hosts. The situation gets even more complicated > > if the server is also a nfs client. > > > > In principle having different nfs resources works fine, only the statd > > state directory is a problem. Or in principle the statd concept at all. > > Actually we would need to have several instances of statd running using > > different directories. These then would have to be migrated from one > > server to the other on resource movement. > > However, as far I understand it, there does not even exist the basic > > concept for this, doesn't it? > > The efforts have been attempted (to remedy this issue) and a complete > set of patches have been (kept) submitting for the past two years. The > patch acceptance progress is very slow (I guess people just don't want > to get bothered with cluster issues ?). Well, I think people are just ignorant. I did see your discussions about NLM in the past on the NFS mailing list, but actually I didn't understand the entire point of discussion ;) I was simply used to active-passive services (mostly due to heartbeat-1.x) and there we just had /var/lib/nfs linked to the exported directory. After I started to work here, I was confronted with the fact we do have working active-active clusters here, but nobody besides me ever cared about the locking problem :( NFS failovers just are done ignoring file locks. Seems so far also nobody run into a problem, but maybe the result was so obscure that nobody ever bothered to complain... I'm just afraid most admins will simply do like this... > > Anyway, the kernel side has the basic infrastructure to handle the > problem (it stores the incoming clients IP address as part of its > book-keeping record) - just a little bit tweak will do the job. However, > the user side statd directory needs to get re-structured. I didn't > publish the user side directory structure script during my last round of > submission. Forking statd into multiple threads do not solve all the > issues. Check out: > https://www.redhat.com/archives/cluster-devel/2007-April/msg00028.html Thanks, I will read this! Thanks again, Bernd -- Bernd Schubert Q-Leap Networks GmbH ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <200804251630.36917.bs-PKu+Ek1N2UGzQB+pC5nmwQ@public.gmane.org>]
* Re: multiple instances of rpc.statd [not found] ` <200804251630.36917.bs-PKu+Ek1N2UGzQB+pC5nmwQ@public.gmane.org> @ 2008-04-25 15:39 ` Wendy Cheng 0 siblings, 0 replies; 9+ messages in thread From: Wendy Cheng @ 2008-04-25 15:39 UTC (permalink / raw) To: Bernd Schubert; +Cc: linux-nfs Bernd Schubert wrote: > Hello Wendy. > > On Friday 25 April 2008 15:47:03 Wendy Cheng wrote: > >> The efforts have been attempted (to remedy this issue) and a complete >> set of patches have been (kept) submitting for the past two years. The >> patch acceptance progress is very slow (I guess people just don't want >> to get bothered with cluster issues ?). >> > > Well, I think people are just ignorant. I did see your discussions about NLM > in the past on the NFS mailing list, but actually I didn't understand the > entire point of discussion ;) I was simply used to active-passive services > (mostly due to heartbeat-1.x) and there we just had /var/lib/nfs linked to > the exported directory. > > After I started to work here, I was confronted with the fact we do have > working active-active clusters here, but nobody besides me ever cared about > the locking problem :( NFS failovers just are done ignoring file locks. > Seems so far also nobody run into a problem, but maybe the result was so > obscure that nobody ever bothered to complain... > I'm just afraid most admins will simply do like this... > That's an accurate observation :) .. people are just ignorant until they get bitten by the problem. Then they blush out nasty words about Linux servers and go for proprietary solutions. There are amazing amount of "workaround"(s) and funny setup(s) to bypass various Linux problems. Admins normally don't care the details but just know if they do certain "tricks", things work. I was looking at a performance issue last week why clustered mail servers ran miserably slow. As a person who don't know much about mail server, I was surprised to learn it is a common practice that linux email servers could be configured to grab flock, followed by posix lock, then wrote a lock file whenever a "write" occurs - all three actions are used concurrently to protect one single file (?). It was a very interesting conversation. -- Wendy ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: multiple instances of rpc.statd 2008-04-25 13:47 ` Wendy Cheng 2008-04-25 14:30 ` Bernd Schubert @ 2008-04-25 22:07 ` J. Bruce Fields 2008-04-28 3:59 ` Wendy Cheng 1 sibling, 1 reply; 9+ messages in thread From: J. Bruce Fields @ 2008-04-25 22:07 UTC (permalink / raw) To: Wendy Cheng; +Cc: Bernd Schubert, linux-nfs On Fri, Apr 25, 2008 at 09:47:03AM -0400, Wendy Cheng wrote: > Bernd Schubert wrote: >> Hello, >> >> on servers with heartbeat managed resources one rather often has the >> situation one exports different directories from different resources. >> >> It now may happen all resources are running on one host, but they can >> also run from different hosts. The situation gets even more complicated >> if the server is also a nfs client. >> >> In principle having different nfs resources works fine, only the statd >> state directory is a problem. Or in principle the statd concept at all. >> Actually we would need to have several instances of statd running using >> different directories. These then would have to be migrated from one >> server to the other on resource movement. However, as far I understand >> it, there does not even exist the basic concept for this, doesn't it? >> >> > The efforts have been attempted (to remedy this issue) and a complete > set of patches have been (kept) submitting for the past two years. The > patch acceptance progress is very slow (I guess people just don't want > to get bothered with cluster issues ?). We definitely want to get this all figured out.... > Anyway, the kernel side has the basic infrastructure to handle the > problem (it stores the incoming clients IP address as part of its > book-keeping record) - just a little bit tweak will do the job. However, > the user side statd directory needs to get re-structured. I didn't > publish the user side directory structure script during my last round of > submission. Forking statd into multiple threads do not solve all the > issues. Check out: > https://www.redhat.com/archives/cluster-devel/2007-April/msg00028.html So for basic v2/v3 failover, what remains is some statd -H scripts, and some form of grace period control? Is there anything else we're missing? --b. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: multiple instances of rpc.statd 2008-04-25 22:07 ` J. Bruce Fields @ 2008-04-28 3:59 ` Wendy Cheng 2008-04-28 18:26 ` J. Bruce Fields 0 siblings, 1 reply; 9+ messages in thread From: Wendy Cheng @ 2008-04-28 3:59 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-nfs J. Bruce Fields wrote: > On Fri, Apr 25, 2008 at 09:47:03AM -0400, Wendy Cheng wrote: > >> Bernd Schubert wrote: >> >>> Hello, >>> >>> on servers with heartbeat managed resources one rather often has the >>> situation one exports different directories from different resources. >>> >>> It now may happen all resources are running on one host, but they can >>> also run from different hosts. The situation gets even more complicated >>> if the server is also a nfs client. >>> >>> In principle having different nfs resources works fine, only the statd >>> state directory is a problem. Or in principle the statd concept at all. >>> Actually we would need to have several instances of statd running using >>> different directories. These then would have to be migrated from one >>> server to the other on resource movement. However, as far I understand >>> it, there does not even exist the basic concept for this, doesn't it? >>> >>> >>> >> The efforts have been attempted (to remedy this issue) and a complete >> set of patches have been (kept) submitting for the past two years. The >> patch acceptance progress is very slow (I guess people just don't want >> to get bothered with cluster issues ?). >> > > We definitely want to get this all figured out.... > > >> Anyway, the kernel side has the basic infrastructure to handle the >> problem (it stores the incoming clients IP address as part of its >> book-keeping record) - just a little bit tweak will do the job. However, >> the user side statd directory needs to get re-structured. I didn't >> publish the user side directory structure script during my last round of >> submission. Forking statd into multiple threads do not solve all the >> issues. Check out: >> https://www.redhat.com/archives/cluster-devel/2007-April/msg00028.html >> > > So for basic v2/v3 failover, what remains is some statd -H scripts, and > some form of grace period control? Is there anything else we're > missing? > > > The submitted patch set is reasonably complete ... . There was another thought about statd patches though - mostly because of the concerns over statd's responsiveness. It depended so much on network status and clients' participations. I was hoping NFS V4 would catch up by the time v2/v3 grace period patches got accepted into mainline kernel. Ideally the v2/v3 lock reclaiming logic could use (or at least did a similar implementation) the communication channel established by v4 servers - that is, 1. Enable grace period as previous submitted patches on secondary server. 2. Drop the locks on primary server (and chained the dropped locks into a lock-list). 3. Send the lock-list via v4 communication channel (or similar implementation) from primary server to backup server. 4. Reclaim the lock base on the lock-list on backup server. In short, it would be nice to replace the existing statd lock reclaiming logic with the above steps if all possible during active-active failover. For reboot, on the other hand, should stay same as today's statd logic without changes. -- Wendy ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: multiple instances of rpc.statd 2008-04-28 3:59 ` Wendy Cheng @ 2008-04-28 18:26 ` J. Bruce Fields 2008-04-28 19:19 ` Wendy Cheng 0 siblings, 1 reply; 9+ messages in thread From: J. Bruce Fields @ 2008-04-28 18:26 UTC (permalink / raw) To: Wendy Cheng; +Cc: linux-nfs On Sun, Apr 27, 2008 at 10:59:11PM -0500, Wendy Cheng wrote: > J. Bruce Fields wrote: >> On Fri, Apr 25, 2008 at 09:47:03AM -0400, Wendy Cheng wrote: >> >>> Bernd Schubert wrote: >>> >>>> Hello, >>>> >>>> on servers with heartbeat managed resources one rather often has >>>> the situation one exports different directories from different >>>> resources. >>>> >>>> It now may happen all resources are running on one host, but they >>>> can also run from different hosts. The situation gets even more >>>> complicated if the server is also a nfs client. >>>> >>>> In principle having different nfs resources works fine, only the >>>> statd state directory is a problem. Or in principle the statd >>>> concept at all. Actually we would need to have several instances of >>>> statd running using different directories. These then would have to >>>> be migrated from one server to the other on resource movement. >>>> However, as far I understand it, there does not even exist the >>>> basic concept for this, doesn't it? >>>> >>>> >>> The efforts have been attempted (to remedy this issue) and a complete >>> set of patches have been (kept) submitting for the past two years. >>> The patch acceptance progress is very slow (I guess people just >>> don't want to get bothered with cluster issues ?). >>> >> >> We definitely want to get this all figured out.... >> >> >>> Anyway, the kernel side has the basic infrastructure to handle the >>> problem (it stores the incoming clients IP address as part of its >>> book-keeping record) - just a little bit tweak will do the job. >>> However, the user side statd directory needs to get re-structured. I >>> didn't publish the user side directory structure script during my >>> last round of submission. Forking statd into multiple threads do not >>> solve all the issues. Check out: >>> https://www.redhat.com/archives/cluster-devel/2007-April/msg00028.html >>> >> >> So for basic v2/v3 failover, what remains is some statd -H scripts, and >> some form of grace period control? Is there anything else we're >> missing? >> >> >> > The submitted patch set is reasonably complete ... . > > There was another thought about statd patches though - mostly because of > the concerns over statd's responsiveness. It depended so much on network > status and clients' participations. I was hoping NFS V4 would catch up > by the time v2/v3 grace period patches got accepted into mainline > kernel. Ideally the v2/v3 lock reclaiming logic could use (or at least > did a similar implementation) the communication channel established by > v4 servers - that is, > > 1. Enable grace period as previous submitted patches on secondary server. > 2. Drop the locks on primary server (and chained the dropped locks into > a lock-list). What information exactly would be on that lock list? > 3. Send the lock-list via v4 communication channel (or similar > implementation) from primary server to backup server. > 4. Reclaim the lock base on the lock-list on backup server. So at this step it's the server itself reclaiming those locks, and you're talking about a completely transparent migration that doesn't look to the client like a reboot? My feeling has been that that's best done after first making sure we can handle the case where the client reclaims the locks, since the latter is easier, and is likely to involve at least some of the same work. I could be wrong. Exactly which data has to be transferred from the old server to the new? (Lock types, ranges, fh's, owners, and pid's, for established locks; do we also need to hand off blocking locks? Statd data still needs to be transferred. Ideally rpc reply caches. What else?) > In short, it would be nice to replace the existing statd lock reclaiming > logic with the above steps if all possible during active-active > failover. For reboot, on the other hand, should stay same as today's > statd logic without changes. --b. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: multiple instances of rpc.statd 2008-04-28 18:26 ` J. Bruce Fields @ 2008-04-28 19:19 ` Wendy Cheng 2008-04-29 16:20 ` J. Bruce Fields 0 siblings, 1 reply; 9+ messages in thread From: Wendy Cheng @ 2008-04-28 19:19 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-nfs J. Bruce Fields wrote: > On Sun, Apr 27, 2008 at 10:59:11PM -0500, Wendy Cheng wrote: > >> >>> So for basic v2/v3 failover, what remains is some statd -H scripts, and >>> some form of grace period control? Is there anything else we're >>> missing? >>> >> The submitted patch set is reasonably complete ... . >> >> There was another thought about statd patches though - mostly because of >> the concerns over statd's responsiveness. It depended so much on network >> status and clients' participations. I was hoping NFS V4 would catch up >> by the time v2/v3 grace period patches got accepted into mainline >> kernel. Ideally the v2/v3 lock reclaiming logic could use (or at least >> did a similar implementation) the communication channel established by >> v4 servers - that is, >> >> 1. Enable grace period as previous submitted patches on secondary server. >> 2. Drop the locks on primary server (and chained the dropped locks into >> a lock-list). >> > > What information exactly would be on that lock list? > Can't believe I get myself into this ... I'm supposed to be a disk firmware person *now* .. Anyway, Are the lock state finalized in v4 yet ? Can we borrow the concepts (and saved lock states) from v4 ? We certainly can define the saved state useful for v3 independent of v4, say client IP, file path, lock range, lock type, and user id ? Need to re-read linux source to make sure it is doable though. > >> 3. Send the lock-list via v4 communication channel (or similar >> implementation) from primary server to backup server. >> 4. Reclaim the lock base on the lock-list on backup server. >> > > So at this step it's the server itself reclaiming those locks, and > you're talking about a completely transparent migration that doesn't > look to the client like a reboot? > Yes, that's the idea .. never implement any prototype code yet - so not sure how feasible it would be. > My feeling has been that that's best done after first making sure we can > handle the case where the client reclaims the locks, since the latter is > easier, and is likely to involve at least some of the same work. I > could be wrong. > Makes sense .. so the steps taken may be: 1. Push the patch sets that we originally submitted. This is to make sure we have something working. 2. Prototype the new logic, parallel with v4 development, observe and learn the results from step 1 based on user feedbacks. 3. Integrate the new logic, if it turns out to be good. > Exactly which data has to be transferred from the old server to the new? > (Lock types, ranges, fh's, owners, and pid's, for established locks; do > we also need to hand off blocking locks? Statd data still needs to be > transferred. Ideally rpc reply caches. What else?) > All statd has is the client network addresses (that is already part of current NLM states anyway). Yes, rpc reply cache is important (and that's exactly the motivation for this thread of discussion). Eventually the rpc reply cache needs to get transferred. As long as the communication channel is established, there is no reason for lock states not taking this advantages. > >> In short, it would be nice to replace the existing statd lock reclaiming >> logic with the above steps if all possible during active-active >> failover. For reboot, on the other hand, should stay same as today's >> statd logic without changes. >> As mentioned before, cluster issues are not trivial. Take one step at a time .. So the next task we should be focusing may be the grace period patch. Will see what I can do to help out here. -- Wendy ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: multiple instances of rpc.statd 2008-04-28 19:19 ` Wendy Cheng @ 2008-04-29 16:20 ` J. Bruce Fields 0 siblings, 0 replies; 9+ messages in thread From: J. Bruce Fields @ 2008-04-29 16:20 UTC (permalink / raw) To: Wendy Cheng; +Cc: linux-nfs On Mon, Apr 28, 2008 at 03:19:28PM -0400, Wendy Cheng wrote: > J. Bruce Fields wrote: >> On Sun, Apr 27, 2008 at 10:59:11PM -0500, Wendy Cheng wrote: >> >>> >>>> So for basic v2/v3 failover, what remains is some statd -H scripts, and >>>> some form of grace period control? Is there anything else we're >>>> missing? >>>> >>> The submitted patch set is reasonably complete ... . >>> >>> There was another thought about statd patches though - mostly because of >>> the concerns over statd's responsiveness. It depended so much on network >>> status and clients' participations. I was hoping NFS V4 would catch up >>> by the time v2/v3 grace period patches got accepted into mainline >>> kernel. Ideally the v2/v3 lock reclaiming logic could use (or at least >>> did a similar implementation) the communication channel established by >>> v4 servers - that is, >>> >>> 1. Enable grace period as previous submitted patches on secondary server. >>> 2. Drop the locks on primary server (and chained the dropped locks into >>> a lock-list). >>> >> >> What information exactly would be on that lock list? >> > > Can't believe I get myself into this ... I'm supposed to be a disk > firmware person *now* .. Anyway, > > Are the lock state finalized in v4 yet ? You mean, have we figured out what to send across for a transparent migration? Somebody did a prototype that I think we set aside for a while, but I don't recall if it tried to handle truly transparent migration, or whether it just sent across the v4 equivalent of the statd data; I'll check. --b. > Can we borrow the concepts (and > saved lock states) from v4 ? We certainly can define the saved state > useful for v3 independent of v4, say client IP, file path, lock range, > lock type, and user id ? Need to re-read linux source to make sure it is > doable though. > >> >>> 3. Send the lock-list via v4 communication channel (or similar >>> implementation) from primary server to backup server. >>> 4. Reclaim the lock base on the lock-list on backup server. >>> >> >> So at this step it's the server itself reclaiming those locks, and >> you're talking about a completely transparent migration that doesn't >> look to the client like a reboot? >> > > Yes, that's the idea .. never implement any prototype code yet - so not > sure how feasible it would be. >> My feeling has been that that's best done after first making sure we can >> handle the case where the client reclaims the locks, since the latter is >> easier, and is likely to involve at least some of the same work. I >> could be wrong. >> > > Makes sense .. so the steps taken may be: > > 1. Push the patch sets that we originally submitted. This is to make > sure we have something working. > 2. Prototype the new logic, parallel with v4 development, observe and > learn the results from step 1 based on user feedbacks. > 3. Integrate the new logic, if it turns out to be good. > >> Exactly which data has to be transferred from the old server to the new? >> (Lock types, ranges, fh's, owners, and pid's, for established locks; do >> we also need to hand off blocking locks? Statd data still needs to be >> transferred. Ideally rpc reply caches. What else?) >> > > All statd has is the client network addresses (that is already part of > current NLM states anyway). Yes, rpc reply cache is important (and > that's exactly the motivation for this thread of discussion). Eventually > the rpc reply cache needs to get transferred. As long as the > communication channel is established, there is no reason for lock states > not taking this advantages. > >> >>> In short, it would be nice to replace the existing statd lock reclaiming >>> logic with the above steps if all possible during active-active >>> failover. For reboot, on the other hand, should stay same as today's >>> statd logic without changes. >>> > > As mentioned before, cluster issues are not trivial. Take one step at a > time .. So the next task we should be focusing may be the grace period > patch. Will see what I can do to help out here. > > -- Wendy > ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2008-04-29 16:20 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-25 13:31 multiple instances of rpc.statd Bernd Schubert
[not found] ` <200804251531.21035.bs-PKu+Ek1N2UGzQB+pC5nmwQ@public.gmane.org>
2008-04-25 13:47 ` Wendy Cheng
2008-04-25 14:30 ` Bernd Schubert
[not found] ` <200804251630.36917.bs-PKu+Ek1N2UGzQB+pC5nmwQ@public.gmane.org>
2008-04-25 15:39 ` Wendy Cheng
2008-04-25 22:07 ` J. Bruce Fields
2008-04-28 3:59 ` Wendy Cheng
2008-04-28 18:26 ` J. Bruce Fields
2008-04-28 19:19 ` Wendy Cheng
2008-04-29 16:20 ` J. Bruce Fields
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.