Fixing NFS

All of lore.kernel.org
 help / color / mirror / Atom feed

* Fixing NFS
@ 2011-02-03 17:05 Brian Chrisman
  2011-02-03 17:29 ` Sage Weil
  0 siblings, 1 reply; 13+ messages in thread
From: Brian Chrisman @ 2011-02-03 17:05 UTC (permalink / raw)
  To: ceph-devel

I've looked into the export.c code in the kernel client.
It looks like the primary issue may be incompleteness, as for
non-connected filehandles, the dentry lookup does not query the mds
but instead returns stalefh if it's not in the cache.
For connected filehandles, ceph_mdsc_* methods are called to lookup dentries.

I understand there's not a lot of interest in re-exporting a ceph fs over NFS.
But if I were to go ahead and investigate the APIs and find how to
make that query for non-connected filehandles, would I be running into
any obvious roadblocks? (I'd consider a "roadblock" something like:
"there's no interface to make that lookup" or "you'll get
non-deterministic results")

thanks,
Brian

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fixing NFS
  2011-02-03 17:05 Fixing NFS Brian Chrisman
@ 2011-02-03 17:29 ` Sage Weil
  2011-02-03 17:52   ` Tommi Virtanen
  2011-02-03 23:09   ` Hard links (was: Fixing NFS) Chris Dunlop
  0 siblings, 2 replies; 13+ messages in thread
From: Sage Weil @ 2011-02-03 17:29 UTC (permalink / raw)
  To: Brian Chrisman; +Cc: ceph-devel

On Thu, 3 Feb 2011, Brian Chrisman wrote:
> I've looked into the export.c code in the kernel client.
> It looks like the primary issue may be incompleteness, as for
> non-connected filehandles, the dentry lookup does not query the mds
> but instead returns stalefh if it's not in the cache.
> For connected filehandles, ceph_mdsc_* methods are called to lookup dentries.
> 
> I understand there's not a lot of interest in re-exporting a ceph fs over NFS.
> But if I were to go ahead and investigate the APIs and find how to
> make that query for non-connected filehandles, would I be running into
> any obvious roadblocks? (I'd consider a "roadblock" something like:
> "there's no interface to make that lookup" or "you'll get
> non-deterministic results")

There are a couple of levels of difficulty.  The main problem is that the 
only truly stable information in the NFS fh is the inode number, and 
Ceph's architecture simply doesn't support lookup-by-ino.  (It uses an 
extra table to support it for hard-linked files, under the assumption that 
these are relatively rare in the real world.)  Using purely the ino, if we 
miss in the exporting client's icache, we can then try all MDSs.  If those 
all miss too, we're out of luck.

To improve things somewhat, the fh includes as many ancestor inos as 
possible (and the connecting dentry hashes).  That let's us try to look up 
parents too, which are more likely to be cached.  That's what the 
LOOKUPHASH stuff is all about (although I confess I can't remember exactly 
what state that code is in, and it's not well tested).  

Also, the situation for directories is a bit better: the directory object 
on disk has ancestor backpointers, so given a _directory_ inode we can, 
with some effort, always find it.  (This isn't implemented, but is 
doable.)

Which leaves us with a final problem: what if the fh is generated for 
/foo/bar, but bar is renamed to /baz/bar, bar drops out of all caches, and 
the client tries to use the fh.  We're still stuck with ESTALE in that 
case.  The only real solution there is to include a backpointer on the 
file's data object.  This is doable, but comes at a cost.  We could make 
it optional, and/or mitigate it somewhat (backpointer is only created once 
a file is renamed, or something like that).

I'm not really sure to what lengths a server is supposed to go to avoid 
ESTALE.  I seem to remember that NFSv4 has a different class of fh's that 
are allowed to expire.  I'm not sure how that helps, though; it seems 
likeif a client has a file open that is renamed by another node and then 
idle for long enough and then tries to read it'll still be screwed, 
regardless of what the server does/does not promise the client.

sage

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fixing NFS
  2011-02-03 17:29 ` Sage Weil
@ 2011-02-03 17:52   ` Tommi Virtanen
  2011-02-08  1:51     ` Brian Chrisman
  2011-02-03 23:09   ` Hard links (was: Fixing NFS) Chris Dunlop
  1 sibling, 1 reply; 13+ messages in thread
From: Tommi Virtanen @ 2011-02-03 17:52 UTC (permalink / raw)
  To: Sage Weil; +Cc: Brian Chrisman, ceph-devel

On Thu, Feb 03, 2011 at 09:29:32AM -0800, Sage Weil wrote:
> Which leaves us with a final problem: what if the fh is generated for
> /foo/bar, but bar is renamed to /baz/bar, bar drops out of all caches, and
> the client tries to use the fh.  We're still stuck with ESTALE in that
> case.  The only real solution there is to include a backpointer on the
> file's data object.  This is doable, but comes at a cost.  We could make
> it optional, and/or mitigate it somewhat (backpointer is only created once
> a file is renamed, or something like that).
>
> I'm not really sure to what lengths a server is supposed to go to avoid 
> ESTALE.  I seem to remember that NFSv4 has a different class of fh's that 
> are allowed to expire.  I'm not sure how that helps, though; it seems 
> likeif a client has a file open that is renamed by another node and then 
> idle for long enough and then tries to read it'll still be screwed, 
> regardless of what the server does/does not promise the client.

NFSv4 volatile filehandles move away from the whole "stale"
terminology into "expiring" filehandles, which a client SHOULD recover
from, and that's said with fairly strong language in RFC3530. The
volatile filehandles may go away at any moment (for FH4_VOLATILE_ANY).

The RFC suggests clients remember the full path of every volatile
filehandle, and points out that doesn't let you recover if someone
else renamed the file.. which means your "final problem" above is
still a problem, and smells unavoidable. But at least shifting
responsibility for remembering the path to the client makes recovery
easy in the typical case.

If the real-world support is there, I'd say NFSv4 is the way to go,
for future Ceph re-exporting.

-- 
:(){ :|:&};:

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Hard links (was: Fixing NFS)
  2011-02-03 17:29 ` Sage Weil
  2011-02-03 17:52   ` Tommi Virtanen
@ 2011-02-03 23:09   ` Chris Dunlop
  2011-02-03 23:16     ` Sage Weil
  1 sibling, 1 reply; 13+ messages in thread
From: Chris Dunlop @ 2011-02-03 23:09 UTC (permalink / raw)
  To: ceph-devel

On 2011-02-03, Sage Weil <sage@newdream.net> wrote:
> There are a couple of levels of difficulty.  The main problem is that the 
> only truly stable information in the NFS fh is the inode number, and 
> Ceph's architecture simply doesn't support lookup-by-ino.  (It uses an 
> extra table to support it for hard-linked files, under the assumption that 
> these are relatively rare in the real world.)

Sorry for the thread hijack, but just so this issue doesn't completely
fall through the cracks...

There are different "real worlds" where hard links are very, very
common. Although, admittedly, ceph may well not be targeted at those
parallel universes.

Backup servers are a classic example. It's very common to have
hard-links between the files for each snapshot. In this situation *most*
files have multiple hard links, and you can easily have almost all files
with 60 or more hard links (for 60 or more snapshots). Rsnapshot,
BackupPC and apparently OSX's Time Machine for instance work this way.

Of course, apps like this would probably be far better off if they
started using proper snapshots (and dedup, if/when that becomes
available, praise the day) provided by file systems such as ceph and
btrfs.

But there are other apps which use significant numbers of hard links,
some examples buried in this thread:

http://thread.gmane.org/gmane.comp.file-systems.btrfs/3427

Cheers,

Chris

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Hard links (was: Fixing NFS)
  2011-02-03 23:09   ` Hard links (was: Fixing NFS) Chris Dunlop
@ 2011-02-03 23:16     ` Sage Weil
  0 siblings, 0 replies; 13+ messages in thread
From: Sage Weil @ 2011-02-03 23:16 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: ceph-devel

On Thu, 3 Feb 2011, Chris Dunlop wrote:
> On 2011-02-03, Sage Weil <sage@newdream.net> wrote:
> > There are a couple of levels of difficulty.  The main problem is that the 
> > only truly stable information in the NFS fh is the inode number, and 
> > Ceph's architecture simply doesn't support lookup-by-ino.  (It uses an 
> > extra table to support it for hard-linked files, under the assumption that 
> > these are relatively rare in the real world.)
> 
> Sorry for the thread hijack, but just so this issue doesn't completely
> fall through the cracks...
> 
> There are different "real worlds" where hard links are very, very
> common. Although, admittedly, ceph may well not be targeted at those
> parallel universes.
> 
> Backup servers are a classic example. It's very common to have
> hard-links between the files for each snapshot. In this situation *most*
> files have multiple hard links, and you can easily have almost all files
> with 60 or more hard links (for 60 or more snapshots). Rsnapshot,
> BackupPC and apparently OSX's Time Machine for instance work this way.

Yep.  These apps will work.. mostly.  The current pain point will be the 
anchor table, which isn't built to scale currently (it's pretty trivial, 
everything in memory).  

Assuming that is addressed, though, these workloads won't be too bad.  The 
nice thing is that these links are usually "parallel" in that all the 
dentries in one dir are hard linked to the same targets in another dir.  
Although the MDS pays a cost looking up the anchor for the first name, the 
result is that it loads the second directory into cache, and the 
subsequent links are also resolve for "free."  The result is a lookup cost 
that is more like O(number of dirs) instead of O(number of files).  

sage

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fixing NFS
  2011-02-03 17:52   ` Tommi Virtanen
@ 2011-02-08  1:51     ` Brian Chrisman
  2011-02-08  2:06       ` Gregory Farnum
  2011-02-08  3:33       ` Sage Weil
  0 siblings, 2 replies; 13+ messages in thread
From: Brian Chrisman @ 2011-02-08  1:51 UTC (permalink / raw)
  To: ceph-devel

On Thu, Feb 3, 2011 at 9:52 AM, Tommi Virtanen
<tommi.virtanen@dreamhost.com> wrote:
> On Thu, Feb 03, 2011 at 09:29:32AM -0800, Sage Weil wrote:
>> Which leaves us with a final problem: what if the fh is generated for
>> /foo/bar, but bar is renamed to /baz/bar, bar drops out of all caches, and
>> the client tries to use the fh.  We're still stuck with ESTALE in that
>> case.  The only real solution there is to include a backpointer on the
>> file's data object.  This is doable, but comes at a cost.  We could make
>> it optional, and/or mitigate it somewhat (backpointer is only created once
>> a file is renamed, or something like that).
>>
>> I'm not really sure to what lengths a server is supposed to go to avoid
>> ESTALE.  I seem to remember that NFSv4 has a different class of fh's that
>> are allowed to expire.  I'm not sure how that helps, though; it seems
>> likeif a client has a file open that is renamed by another node and then
>> idle for long enough and then tries to read it'll still be screwed,
>> regardless of what the server does/does not promise the client.
>
> NFSv4 volatile filehandles move away from the whole "stale"
> terminology into "expiring" filehandles, which a client SHOULD recover
> from, and that's said with fairly strong language in RFC3530. The
> volatile filehandles may go away at any moment (for FH4_VOLATILE_ANY).
>
> The RFC suggests clients remember the full path of every volatile
> filehandle, and points out that doesn't let you recover if someone
> else renamed the file.. which means your "final problem" above is
> still a problem, and smells unavoidable. But at least shifting
> responsibility for remembering the path to the client makes recovery
> easy in the typical case.
>
> If the real-world support is there, I'd say NFSv4 is the way to go,
> for future Ceph re-exporting.


I was playing around with implementing this.  I was trying to get the
ceph client's export functions to return NFS4ERR_FHEXPIRED instead of
ESTALE (hoping that my nfs4 clients would then attempt the full lookup
again).  I noticed also that the mds itself can also return an ESTALE
to the ceph kernel client, which seems to be getting propagated back
to the NFS client.  I'm wondering where I could intercept that and
send back an expiry notice?

-Brian


>
> --
> :(){ :|:&};:
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fixing NFS
  2011-02-08  1:51     ` Brian Chrisman
@ 2011-02-08  2:06       ` Gregory Farnum
  2011-02-08  3:35         ` Sage Weil
  2011-02-08  3:33       ` Sage Weil
  1 sibling, 1 reply; 13+ messages in thread
From: Gregory Farnum @ 2011-02-08  2:06 UTC (permalink / raw)
  To: Brian Chrisman; +Cc: ceph-devel

On Mon, Feb 7, 2011 at 5:51 PM, Brian Chrisman <brchrisman@gmail.com> wrote:
> I was playing around with implementing this.  I was trying to get the
> ceph client's export functions to return NFS4ERR_FHEXPIRED instead of
> ESTALE (hoping that my nfs4 clients would then attempt the full lookup
> again).  I noticed also that the mds itself can also return an ESTALE
> to the ceph kernel client, which seems to be getting propagated back
> to the NFS client.  I'm wondering where I could intercept that and
> send back an expiry notice?
I don't think you want to: the MDS only returns ESTALE to the client
if it can't find the inode, but that can happen normally if
responsibility for the inode has been moved to another MDS. The client
should be able to handle this circumstance and from a quick check
that's the only way you're going to get ESTALE back from the MDS
(unless something is horribly broken).
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fixing NFS
  2011-02-08  1:51     ` Brian Chrisman
  2011-02-08  2:06       ` Gregory Farnum
@ 2011-02-08  3:33       ` Sage Weil
  2011-02-10 18:52         ` Brian Chrisman
  1 sibling, 1 reply; 13+ messages in thread
From: Sage Weil @ 2011-02-08  3:33 UTC (permalink / raw)
  To: Brian Chrisman; +Cc: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2892 bytes --]

On Mon, 7 Feb 2011, Brian Chrisman wrote:
> On Thu, Feb 3, 2011 at 9:52 AM, Tommi Virtanen
> <tommi.virtanen@dreamhost.com> wrote:
> > On Thu, Feb 03, 2011 at 09:29:32AM -0800, Sage Weil wrote:
> >> Which leaves us with a final problem: what if the fh is generated for
> >> /foo/bar, but bar is renamed to /baz/bar, bar drops out of all caches, and
> >> the client tries to use the fh.  We're still stuck with ESTALE in that
> >> case.  The only real solution there is to include a backpointer on the
> >> file's data object.  This is doable, but comes at a cost.  We could make
> >> it optional, and/or mitigate it somewhat (backpointer is only created once
> >> a file is renamed, or something like that).
> >>
> >> I'm not really sure to what lengths a server is supposed to go to avoid
> >> ESTALE.  I seem to remember that NFSv4 has a different class of fh's that
> >> are allowed to expire.  I'm not sure how that helps, though; it seems
> >> likeif a client has a file open that is renamed by another node and then
> >> idle for long enough and then tries to read it'll still be screwed,
> >> regardless of what the server does/does not promise the client.
> >
> > NFSv4 volatile filehandles move away from the whole "stale"
> > terminology into "expiring" filehandles, which a client SHOULD recover
> > from, and that's said with fairly strong language in RFC3530. The
> > volatile filehandles may go away at any moment (for FH4_VOLATILE_ANY).
> >
> > The RFC suggests clients remember the full path of every volatile
> > filehandle, and points out that doesn't let you recover if someone
> > else renamed the file.. which means your "final problem" above is
> > still a problem, and smells unavoidable. But at least shifting
> > responsibility for remembering the path to the client makes recovery
> > easy in the typical case.
> >
> > If the real-world support is there, I'd say NFSv4 is the way to go,
> > for future Ceph re-exporting.
> 
> 
> I was playing around with implementing this.  I was trying to get the
> ceph client's export functions to return NFS4ERR_FHEXPIRED instead of
> ESTALE (hoping that my nfs4 clients would then attempt the full lookup
> again).  I noticed also that the mds itself can also return an ESTALE
> to the ceph kernel client, which seems to be getting propagated back
> to the NFS client.  I'm wondering where I could intercept that and
> send back an expiry notice?

I believe the only place an actual MDS call is exposed to an NFS export is 
in export.c's __cfh_to_dentry().  This is where the ino search is going to 
need to get more sophisticated (at least on the client side).

An ESTALE from the MDS generally means the starting ino in the request 
isn't in the cache.  You can try all MDSs for one that has it.  Beyond 
that, we'll need to implement more smarts on the server side!

sage

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fixing NFS
  2011-02-08  2:06       ` Gregory Farnum
@ 2011-02-08  3:35         ` Sage Weil
       [not found]           ` <AANLkTik0jz_RXbrLz2m=+EvBwpBwesQ_Lr_iiCuQJHyF@mail.gmail.com>
  0 siblings, 1 reply; 13+ messages in thread
From: Sage Weil @ 2011-02-08  3:35 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Brian Chrisman, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1389 bytes --]

On Mon, 7 Feb 2011, Gregory Farnum wrote:
> On Mon, Feb 7, 2011 at 5:51 PM, Brian Chrisman <brchrisman@gmail.com> wrote:
> > I was playing around with implementing this.  I was trying to get the
> > ceph client's export functions to return NFS4ERR_FHEXPIRED instead of
> > ESTALE (hoping that my nfs4 clients would then attempt the full lookup
> > again).  I noticed also that the mds itself can also return an ESTALE
> > to the ceph kernel client, which seems to be getting propagated back
> > to the NFS client.  I'm wondering where I could intercept that and
> > send back an expiry notice?
>
> I don't think you want to: the MDS only returns ESTALE to the client
> if it can't find the inode, but that can happen normally if
> responsibility for the inode has been moved to another MDS. The client
> should be able to handle this circumstance and from a quick check
> that's the only way you're going to get ESTALE back from the MDS
> (unless something is horribly broken).

Normally, yes.  If the ceph client keeps the inode in its cache, it should 
never get an ESTALE it can't mask (I think that'd only happen when a 
request races with a metadata migration between MDSs).  NFS reexport is 
the one case where that doesn't hold true, because the NFS client can hold 
onto a file handle for as long as it wants and then present it back to the 
server.

sage

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fixing NFS
       [not found]             ` <Pine.LNX.4.64.1102072059260.16784@cobra.newdream.net>
@ 2011-02-09 18:42               ` Brian Chrisman
  2011-02-09 18:52                 ` Sage Weil
  0 siblings, 1 reply; 13+ messages in thread
From: Brian Chrisman @ 2011-02-09 18:42 UTC (permalink / raw)
  To: ceph-devel

On Mon, Feb 7, 2011 at 9:02 PM, Sage Weil <sage@newdream.net> wrote:
> On Mon, 7 Feb 2011, Brian Chrisman wrote:
>> My goal is merely to push this problem back on the NFSv4 client.  If I
>> can tell it "your filehandle is expired", it should request a full
>> path lookup to re-establish the filehandle, as far as I can tell from
>> the specs.
>
> Oh I see...
>
>> I converted the ESTALE returns to NFS4ERR_FHEXPIRED in export.c to
>> take a stab at accomplishing this, but there are still ESTALEs coming
>> across the wire to the client.  That's why I was looking to see where
>> other things could be going wrong.
>
> Hmm.  Generally speaking, the inode is in the ceph client's cache _only_
> if it has an MDS capability, and as long as it holds the capability it has
> a stateful handle on it and will never get ESTALE.  So my guess is it's
> coming from somewhere else in export.c.  Can you crank up client debugging
> (see below) and reproduce?
>
> sage
>
>
>
> Crank up debugging on just about everything.  Or you can just turn up
> export.c...
>
> $ cat /home/sage/ceph/src/script/kcon_most.sh
> #!/bin/sh -x

Cranking up only export.c, I see mds communication troubles when I get
the NFS stale fh.
I also used your full-debug script and ran the same test (copying over
my build environment, building, deleting), but the copy operation
hosed the cluster with osds going down and all kinds of other stuff
(this is probably an artifact of the kernel client being on the same
node as osds etc).
The previous messages (just export.c debugging):

ceph:         export.c:171  : __cfh_to_dentry 10000005a56
ffff880004176850 dentry ffff88003b788a80
ceph:         export.c:133  : __cfh_to_dentry 10000005a56 (1000000145a/1c60f9f)
ceph:         export.c:171  : __cfh_to_dentry 10000005a56
ffff880004176850 dentry ffff88003b788a80
ceph: mds0 reconnect start
ceph: mds0 reconnect success
ceph:         export.c:133  : __cfh_to_dentry 10000005a56 (1000000145a/1c60f9f)
ceph:         export.c:171  : __cfh_to_dentry 10000005a56
ffff880004176850 dentry ffff88003b788a80
ceph: mds0 recovery completed
libceph: mds0 10.200.98.105:6800 socket closed
libceph: mds0 10.200.98.105:6800 connection failed
ceph: mds0 reconnect start
ceph: mds0 reconnect success
ceph: mds0 recovery completed
libceph: mds0 10.200.98.111:6813 socket closed
libceph: mds0 10.200.98.111:6813 connection failed
libceph: mds0 10.200.98.111:6813 connection failed
libceph: mds0 10.200.98.111:6813 connection failed
libceph: mds0 10.200.98.111:6813 connection failed
libceph: mds0 10.200.98.111:6813 connection failed
libceph: mds0 10.200.98.111:6813 connection failed
ceph:         export.c:133  : __cfh_to_dentry 10000005a56 (1000000145a/1c60f9f)
ceph:         export.c:171  : __cfh_to_dentry 10000005a56
ffff880004176850 dentry ffff88003b788a80
libceph: mds0 10.200.98.111:6813 connection failed
ceph: mds0 caps stale
ceph: mds0 caps stale
libceph: mds0 10.200.98.111:6813 connection failed
ceph:         export.c:133  : __cfh_to_dentry 10000005a56 (1000000145a/1c60f9f)
ceph:         export.c:171  : __cfh_to_dentry 10000005a56
ffff880004176850 dentry ffff88003b788a80
libceph: mds0 10.200.98.111:6813 connection failed
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fixing NFS
  2011-02-09 18:42               ` Brian Chrisman
@ 2011-02-09 18:52                 ` Sage Weil
  0 siblings, 0 replies; 13+ messages in thread
From: Sage Weil @ 2011-02-09 18:52 UTC (permalink / raw)
  To: Brian Chrisman; +Cc: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3800 bytes --]

On Wed, 9 Feb 2011, Brian Chrisman wrote:
> On Mon, Feb 7, 2011 at 9:02 PM, Sage Weil <sage@newdream.net> wrote:
> > On Mon, 7 Feb 2011, Brian Chrisman wrote:
> >> My goal is merely to push this problem back on the NFSv4 client.  If I
> >> can tell it "your filehandle is expired", it should request a full
> >> path lookup to re-establish the filehandle, as far as I can tell from
> >> the specs.
> >
> > Oh I see...
> >
> >> I converted the ESTALE returns to NFS4ERR_FHEXPIRED in export.c to
> >> take a stab at accomplishing this, but there are still ESTALEs coming
> >> across the wire to the client.  That's why I was looking to see where
> >> other things could be going wrong.
> >
> > Hmm.  Generally speaking, the inode is in the ceph client's cache _only_
> > if it has an MDS capability, and as long as it holds the capability it has
> > a stateful handle on it and will never get ESTALE.  So my guess is it's
> > coming from somewhere else in export.c.  Can you crank up client debugging
> > (see below) and reproduce?
> >
> > sage
> >
> >
> >
> > Crank up debugging on just about everything.  Or you can just turn up
> > export.c...
> >
> > $ cat /home/sage/ceph/src/script/kcon_most.sh
> > #!/bin/sh -x
> 
> Cranking up only export.c, I see mds communication troubles when I get
> the NFS stale fh.
> I also used your full-debug script and ran the same test (copying over
> my build environment, building, deleting), but the copy operation
> hosed the cluster with osds going down and all kinds of other stuff
> (this is probably an artifact of the kernel client being on the same
> node as osds etc).
> The previous messages (just export.c debugging):
> 
> ceph:         export.c:171  : __cfh_to_dentry 10000005a56
> ffff880004176850 dentry ffff88003b788a80
> ceph:         export.c:133  : __cfh_to_dentry 10000005a56 (1000000145a/1c60f9f)
> ceph:         export.c:171  : __cfh_to_dentry 10000005a56
> ffff880004176850 dentry ffff88003b788a80
> ceph: mds0 reconnect start
> ceph: mds0 reconnect success
> ceph:         export.c:133  : __cfh_to_dentry 10000005a56 (1000000145a/1c60f9f)
> ceph:         export.c:171  : __cfh_to_dentry 10000005a56
> ffff880004176850 dentry ffff88003b788a80
> ceph: mds0 recovery completed
> libceph: mds0 10.200.98.105:6800 socket closed
> libceph: mds0 10.200.98.105:6800 connection failed

Can you look at the mds logs to see why cmds is crashing?  (Or going into 
an infinite loop, or whatever it is it's doing?)

sage



> ceph: mds0 reconnect start
> ceph: mds0 reconnect success
> ceph: mds0 recovery completed
> libceph: mds0 10.200.98.111:6813 socket closed
> libceph: mds0 10.200.98.111:6813 connection failed
> libceph: mds0 10.200.98.111:6813 connection failed
> libceph: mds0 10.200.98.111:6813 connection failed
> libceph: mds0 10.200.98.111:6813 connection failed
> libceph: mds0 10.200.98.111:6813 connection failed
> libceph: mds0 10.200.98.111:6813 connection failed
> ceph:         export.c:133  : __cfh_to_dentry 10000005a56 (1000000145a/1c60f9f)
> ceph:         export.c:171  : __cfh_to_dentry 10000005a56
> ffff880004176850 dentry ffff88003b788a80
> libceph: mds0 10.200.98.111:6813 connection failed
> ceph: mds0 caps stale
> ceph: mds0 caps stale
> libceph: mds0 10.200.98.111:6813 connection failed
> ceph:         export.c:133  : __cfh_to_dentry 10000005a56 (1000000145a/1c60f9f)
> ceph:         export.c:171  : __cfh_to_dentry 10000005a56
> ffff880004176850 dentry ffff88003b788a80
> libceph: mds0 10.200.98.111:6813 connection failed
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fixing NFS
  2011-02-08  3:33       ` Sage Weil
@ 2011-02-10 18:52         ` Brian Chrisman
  2011-02-10 19:03           ` Sage Weil
  0 siblings, 1 reply; 13+ messages in thread
From: Brian Chrisman @ 2011-02-10 18:52 UTC (permalink / raw)
  To: ceph-devel

On Mon, Feb 7, 2011 at 7:33 PM, Sage Weil <sage@newdream.net> wrote:
...
>
> I believe the only place an actual MDS call is exposed to an NFS export is
> in export.c's __cfh_to_dentry().  This is where the ino search is going to
> need to get more sophisticated (at least on the client side).
>
> An ESTALE from the MDS generally means the starting ino in the request
> isn't in the cache.  You can try all MDSs for one that has it.  Beyond
> that, we'll need to implement more smarts on the server side!
>
> sage
>

With further testing, I tracked this down to ESTALEs indeed being
returned from __cfh_to_dentry().
I'm guessing this is because it has been flushed from the MDS cache,
as my max mds is 1 and it hasn't failed/migrated.

It looks like CEPH_MDS_OP_LOOKUPHASH is failing to find the dentry...
I was hoping to see how the rest of the kernel client implements
lookup when LOOKUPHASH fails, but it looks like only export.c is using
that operation.  Is it possible to perform a full lookup (past the
cache) of a file from a cfh?  Would appreciate pointers on
implementation.

I also noticed that NFS4ERR_FHEXPIRED is not referenced anywhere in
the kernel (particularly nfs client), so I'm guessing support for
filehandle expiry is quite a ways off.

Another question: I'd like to reproduce this error more quickly by
reducing the mds cache size.  I wanted to confirm 'mds_cache_size' is
what i'm looking for... and that I'd set it in the mds stanza of the
config with 'mds cache size = ####'?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fixing NFS
  2011-02-10 18:52         ` Brian Chrisman
@ 2011-02-10 19:03           ` Sage Weil
  0 siblings, 0 replies; 13+ messages in thread
From: Sage Weil @ 2011-02-10 19:03 UTC (permalink / raw)
  To: Brian Chrisman; +Cc: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2594 bytes --]

On Thu, 10 Feb 2011, Brian Chrisman wrote:
> On Mon, Feb 7, 2011 at 7:33 PM, Sage Weil <sage@newdream.net> wrote:
> ...
> >
> > I believe the only place an actual MDS call is exposed to an NFS export is
> > in export.c's __cfh_to_dentry().  This is where the ino search is going to
> > need to get more sophisticated (at least on the client side).
> >
> > An ESTALE from the MDS generally means the starting ino in the request
> > isn't in the cache.  You can try all MDSs for one that has it.  Beyond
> > that, we'll need to implement more smarts on the server side!
> >
> > sage
> >
> 
> With further testing, I tracked this down to ESTALEs indeed being
> returned from __cfh_to_dentry().
> I'm guessing this is because it has been flushed from the MDS cache,
> as my max mds is 1 and it hasn't failed/migrated.
> 
> It looks like CEPH_MDS_OP_LOOKUPHASH is failing to find the dentry...
> I was hoping to see how the rest of the kernel client implements
> lookup when LOOKUPHASH fails, but it looks like only export.c is using
> that operation.  Is it possible to perform a full lookup (past the
> cache) of a file from a cfh?  Would appreciate pointers on
> implementation.

The idea with LOOKUPHASH is to take a dir ino, dentry hash, and ino, and 
try to locate it on the MDS.  The MDS will (currently) start with the dir 
(if it has it; otherwise ESTALE, what you're seeing), find the right 
directory fragment based on the dentry hash, and then look for the given 
ino in that dir frag.

We can improve LOOKUPHASH to leverage the directory object backpointers on 
the MDS to make the dir location reliable.  That shoud eliminate ESTALE 
for everything except the case where the file was renamed to a new 
directory and then dropped out of caches.  Good enough, I hope?

> I also noticed that NFS4ERR_FHEXPIRED is not referenced anywhere in
> the kernel (particularly nfs client), so I'm guessing support for
> filehandle expiry is quite a ways off.
> 
> Another question: I'd like to reproduce this error more quickly by
> reducing the mds cache size.  I wanted to confirm 'mds_cache_size' is
> what i'm looking for... and that I'd set it in the mds stanza of the
> config with 'mds cache size = ####'?

Right.  You'll also want to reduce the size of the journal so that the 
dirty inodes are flushed to the dir objects more quickly (so they can be 
expired).  'mds log max segments = 2' should be okay.  You'll need to 
scribble some other metadata to fill up the journal and make the item you 
care about get flushed/trimmed.

sage

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2011-02-10 19:02 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-02-03 17:05 Fixing NFS Brian Chrisman
2011-02-03 17:29 ` Sage Weil
2011-02-03 17:52   ` Tommi Virtanen
2011-02-08  1:51     ` Brian Chrisman
2011-02-08  2:06       ` Gregory Farnum
2011-02-08  3:35         ` Sage Weil
     [not found]           ` <AANLkTik0jz_RXbrLz2m=+EvBwpBwesQ_Lr_iiCuQJHyF@mail.gmail.com>
     [not found]             ` <Pine.LNX.4.64.1102072059260.16784@cobra.newdream.net>
2011-02-09 18:42               ` Brian Chrisman
2011-02-09 18:52                 ` Sage Weil
2011-02-08  3:33       ` Sage Weil
2011-02-10 18:52         ` Brian Chrisman
2011-02-10 19:03           ` Sage Weil
2011-02-03 23:09   ` Hard links (was: Fixing NFS) Chris Dunlop
2011-02-03 23:16     ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.