* parent xattrs on file objects
@ 2012-10-16 21:17 Sage Weil
2012-10-16 21:26 ` Gregory Farnum
` (2 more replies)
0 siblings, 3 replies; 16+ messages in thread
From: Sage Weil @ 2012-10-16 21:17 UTC (permalink / raw)
To: ceph-devel
Hey-
One of the design goals of the ceph fs was to keep metadata separate from
data. This means, among other things, that when a client is creating a
bunch of files, it creates the inode via the mds and writes the file data
to the OSD, but no mds->osd interaction is necessary.
One of the challenges we currently have is that it is difficult to lookup
an inode by ino. Normally clients traverse the hierarchy to get there, so
things are fine for native ceph clients, but when reexporting via NFS we
can get ESTALE because we an ancient nfs file handle can be presented and
the ceph MDS won't know where to find it. We have a similar problem with
the fsck design in that it is not always possible to discover orphaned
children of directory that was somehow lost.
One option is to put an ancestor xattr on the first object for each file,
similar to what we do for directories. This basically means that each
file creation will be followed (eventually) by a setxattr osd operation.
This used to scare me, but now it's seeming like a pretty small price to
pay for robust NFS reexport and additional information for fsck to
utilize.
It's also nice because it means we could get rid of the anchor table (used
for locating files with multiple hard links) entirely and use the
ancestore xattrs instead. That means one less thing to fsck, and avoids
having to invest any time in making the anchor table effectively scale (it
currently doesn't).
Anyone feel like we shouldn't go ahead and do this?
sage
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: parent xattrs on file objects
2012-10-16 21:17 parent xattrs on file objects Sage Weil
@ 2012-10-16 21:26 ` Gregory Farnum
2012-10-16 21:35 ` Sage Weil
2012-10-16 21:32 ` Mark Nelson
2012-10-16 21:35 ` Matt W. Benjamin
2 siblings, 1 reply; 16+ messages in thread
From: Gregory Farnum @ 2012-10-16 21:26 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
On Tue, Oct 16, 2012 at 2:17 PM, Sage Weil <sage@inktank.com> wrote:
> Hey-
>
> One of the design goals of the ceph fs was to keep metadata separate from
> data. This means, among other things, that when a client is creating a
> bunch of files, it creates the inode via the mds and writes the file data
> to the OSD, but no mds->osd interaction is necessary.
>
> One of the challenges we currently have is that it is difficult to lookup
> an inode by ino. Normally clients traverse the hierarchy to get there, so
> things are fine for native ceph clients, but when reexporting via NFS we
> can get ESTALE because we an ancient nfs file handle can be presented and
> the ceph MDS won't know where to find it. We have a similar problem with
> the fsck design in that it is not always possible to discover orphaned
> children of directory that was somehow lost.
>
> One option is to put an ancestor xattr on the first object for each file,
> similar to what we do for directories. This basically means that each
> file creation will be followed (eventually) by a setxattr osd operation.
> This used to scare me, but now it's seeming like a pretty small price to
> pay for robust NFS reexport and additional information for fsck to
> utilize.
Can you talk about this in a bit more detail? Do you expect the
clients or the MDS to be doing the setxattr? What about doing it used
to scare you?
> It's also nice because it means we could get rid of the anchor table (used
> for locating files with multiple hard links) entirely and use the
> ancestore xattrs instead. That means one less thing to fsck, and avoids
> having to invest any time in making the anchor table effectively scale (it
> currently doesn't).
Hurray! I'm not sure how this directly lets us get rid of the anchor
table, though. Is your plan to just stick the inode in every directory
and then mark it so everything that does a stat on that inode goes to
the inode, grabs its primary location out of the inode, and then do a
lookup there? That seems a bit circuitous for a lot of operations...
> Anyone feel like we shouldn't go ahead and do this?
I'm certainly for it with this broad outline. ;)
-Greg
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: parent xattrs on file objects
2012-10-16 21:17 parent xattrs on file objects Sage Weil
2012-10-16 21:26 ` Gregory Farnum
@ 2012-10-16 21:32 ` Mark Nelson
2012-10-16 21:35 ` Matt W. Benjamin
2 siblings, 0 replies; 16+ messages in thread
From: Mark Nelson @ 2012-10-16 21:32 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
On 10/16/2012 04:17 PM, Sage Weil wrote:
> Hey-
>
> One of the design goals of the ceph fs was to keep metadata separate from
> data. This means, among other things, that when a client is creating a
> bunch of files, it creates the inode via the mds and writes the file data
> to the OSD, but no mds->osd interaction is necessary.
>
> One of the challenges we currently have is that it is difficult to lookup
> an inode by ino. Normally clients traverse the hierarchy to get there, so
> things are fine for native ceph clients, but when reexporting via NFS we
> can get ESTALE because we an ancient nfs file handle can be presented and
> the ceph MDS won't know where to find it. We have a similar problem with
> the fsck design in that it is not always possible to discover orphaned
> children of directory that was somehow lost.
>
> One option is to put an ancestor xattr on the first object for each file,
> similar to what we do for directories. This basically means that each
> file creation will be followed (eventually) by a setxattr osd operation.
> This used to scare me, but now it's seeming like a pretty small price to
> pay for robust NFS reexport and additional information for fsck to
> utilize.
>
Seems like a small price to pay especially for large writes. How much
later does the setxattr happen? For small writes, any idea if this is
going to cause an additional seek if it's delayed?
> It's also nice because it means we could get rid of the anchor table (used
> for locating files with multiple hard links) entirely and use the
> ancestore xattrs instead. That means one less thing to fsck, and avoids
> having to invest any time in making the anchor table effectively scale (it
> currently doesn't).
>
> Anyone feel like we shouldn't go ahead and do this?
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: parent xattrs on file objects
2012-10-16 21:17 parent xattrs on file objects Sage Weil
2012-10-16 21:26 ` Gregory Farnum
2012-10-16 21:32 ` Mark Nelson
@ 2012-10-16 21:35 ` Matt W. Benjamin
2 siblings, 0 replies; 16+ messages in thread
From: Matt W. Benjamin @ 2012-10-16 21:35 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel, aemerson, casey, peter honeyman
Hi Sage,
We've been exploring (experimentally implementing) a different solution to this problem, basically refactoring dirents and inodes, extending fragmentation logic, and adding new metadata location operations. We also remove the anchor table. We were planning to ask for some feedback once we had some initial results, but since you're floating a related idea, we'd like to share what we have so far. CC'ing people.
Regards,
Matt
----- "Sage Weil" <sage@inktank.com> wrote:
> Hey-
>
> One of the design goals of the ceph fs was to keep metadata separate
> from
> data. This means, among other things, that when a client is creating
> a
> bunch of files, it creates the inode via the mds and writes the file
> data
> to the OSD, but no mds->osd interaction is necessary.
>
> One of the challenges we currently have is that it is difficult to
> lookup
> an inode by ino. Normally clients traverse the hierarchy to get
> there, so
> things are fine for native ceph clients, but when reexporting via NFS
> we
> can get ESTALE because we an ancient nfs file handle can be presented
> and
> the ceph MDS won't know where to find it. We have a similar problem
> with
> the fsck design in that it is not always possible to discover orphaned
>
> children of directory that was somehow lost.
>
> One option is to put an ancestor xattr on the first object for each
> file,
> similar to what we do for directories. This basically means that each
>
> file creation will be followed (eventually) by a setxattr osd
> operation.
> This used to scare me, but now it's seeming like a pretty small price
> to
> pay for robust NFS reexport and additional information for fsck to
> utilize.
>
> It's also nice because it means we could get rid of the anchor table
> (used
> for locating files with multiple hard links) entirely and use the
> ancestore xattrs instead. That means one less thing to fsck, and
> avoids
> having to invest any time in making the anchor table effectively scale
> (it
> currently doesn't).
>
> Anyone feel like we shouldn't go ahead and do this?
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Matt Benjamin
The Linux Box
206 South Fifth Ave. Suite 150
Ann Arbor, MI 48104
http://linuxbox.com
tel. 734-761-4689
fax. 734-769-8938
cel. 734-216-5309
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: parent xattrs on file objects
2012-10-16 21:26 ` Gregory Farnum
@ 2012-10-16 21:35 ` Sage Weil
2012-10-16 21:47 ` Yehuda Sadeh Weinraub
0 siblings, 1 reply; 16+ messages in thread
From: Sage Weil @ 2012-10-16 21:35 UTC (permalink / raw)
To: Gregory Farnum; +Cc: ceph-devel
On Tue, 16 Oct 2012, Gregory Farnum wrote:
> On Tue, Oct 16, 2012 at 2:17 PM, Sage Weil <sage@inktank.com> wrote:
> > Hey-
> >
> > One of the design goals of the ceph fs was to keep metadata separate from
> > data. This means, among other things, that when a client is creating a
> > bunch of files, it creates the inode via the mds and writes the file data
> > to the OSD, but no mds->osd interaction is necessary.
> >
> > One of the challenges we currently have is that it is difficult to lookup
> > an inode by ino. Normally clients traverse the hierarchy to get there, so
> > things are fine for native ceph clients, but when reexporting via NFS we
> > can get ESTALE because we an ancient nfs file handle can be presented and
> > the ceph MDS won't know where to find it. We have a similar problem with
> > the fsck design in that it is not always possible to discover orphaned
> > children of directory that was somehow lost.
> >
> > One option is to put an ancestor xattr on the first object for each file,
> > similar to what we do for directories. This basically means that each
> > file creation will be followed (eventually) by a setxattr osd operation.
> > This used to scare me, but now it's seeming like a pretty small price to
> > pay for robust NFS reexport and additional information for fsck to
> > utilize.
>
> Can you talk about this in a bit more detail? Do you expect the
> clients or the MDS to be doing the setxattr? What about doing it used
> to scare you?
For untarring small files, it doubles the number of osd operations, and
means we have to think about the setxattr timing wrt warm caches, etc.
> > It's also nice because it means we could get rid of the anchor table (used
> > for locating files with multiple hard links) entirely and use the
> > ancestore xattrs instead. That means one less thing to fsck, and avoids
> > having to invest any time in making the anchor table effectively scale (it
> > currently doesn't).
>
> Hurray! I'm not sure how this directly lets us get rid of the anchor
> table, though. Is your plan to just stick the inode in every directory
> and then mark it so everything that does a stat on that inode goes to
> the inode, grabs its primary location out of the inode, and then do a
> lookup there? That seems a bit circuitous for a lot of operations...
We would build a generic lookup_by_ino framework based on these xattrs
(first try local mds, then try object xattrs, then try other mds caches,
then try object xattr again.. something like that). Like the anchor
lookups, this would iteratively look for parents so that we can
traverse to the given file.
Given that functionality, the anchor table is no longer needed--it
performs exactly the same function by explicitly tracking parents for only
the linked file. This approach may be somewhat slower (the file xattr may
be stale beyond the immediate parent, whereas the anchor table is always
up to date for the full ancestor chain), but we can mitigate that by
lazily updating out-of-date file object xattrs when we see them. I
suspect the end result will be only slightly more complicated than the
anchor table (if at all) and provide a much more generic and useful
service (for hard links, NFS reexport, and fsck alike).
> > Anyone feel like we shouldn't go ahead and do this?
>
> I'm certainly for it with this broad outline. ;)
> -Greg
sage
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: parent xattrs on file objects
2012-10-16 21:35 ` Sage Weil
@ 2012-10-16 21:47 ` Yehuda Sadeh Weinraub
2012-10-16 21:54 ` Gregory Farnum
0 siblings, 1 reply; 16+ messages in thread
From: Yehuda Sadeh Weinraub @ 2012-10-16 21:47 UTC (permalink / raw)
To: Sage Weil; +Cc: Gregory Farnum, ceph-devel
On Tue, Oct 16, 2012 at 2:35 PM, Sage Weil <sage@inktank.com> wrote:
> On Tue, 16 Oct 2012, Gregory Farnum wrote:
>> On Tue, Oct 16, 2012 at 2:17 PM, Sage Weil <sage@inktank.com> wrote:
>> > Hey-
>> >
>> > One of the design goals of the ceph fs was to keep metadata separate from
>> > data. This means, among other things, that when a client is creating a
>> > bunch of files, it creates the inode via the mds and writes the file data
>> > to the OSD, but no mds->osd interaction is necessary.
>> >
>> > One of the challenges we currently have is that it is difficult to lookup
>> > an inode by ino. Normally clients traverse the hierarchy to get there, so
>> > things are fine for native ceph clients, but when reexporting via NFS we
>> > can get ESTALE because we an ancient nfs file handle can be presented and
>> > the ceph MDS won't know where to find it. We have a similar problem with
>> > the fsck design in that it is not always possible to discover orphaned
>> > children of directory that was somehow lost.
>> >
>> > One option is to put an ancestor xattr on the first object for each file,
>> > similar to what we do for directories. This basically means that each
>> > file creation will be followed (eventually) by a setxattr osd operation.
>> > This used to scare me, but now it's seeming like a pretty small price to
>> > pay for robust NFS reexport and additional information for fsck to
>> > utilize.
>>
>> Can you talk about this in a bit more detail? Do you expect the
>> clients or the MDS to be doing the setxattr? What about doing it used
>> to scare you?
>
> For untarring small files, it doubles the number of osd operations, and
> means we have to think about the setxattr timing wrt warm caches, etc.
>
>> > It's also nice because it means we could get rid of the anchor table (used
>> > for locating files with multiple hard links) entirely and use the
>> > ancestore xattrs instead. That means one less thing to fsck, and avoids
>> > having to invest any time in making the anchor table effectively scale (it
>> > currently doesn't).
>>
>> Hurray! I'm not sure how this directly lets us get rid of the anchor
>> table, though. Is your plan to just stick the inode in every directory
>> and then mark it so everything that does a stat on that inode goes to
>> the inode, grabs its primary location out of the inode, and then do a
>> lookup there? That seems a bit circuitous for a lot of operations...
>
> We would build a generic lookup_by_ino framework based on these xattrs
> (first try local mds, then try object xattrs, then try other mds caches,
> then try object xattr again.. something like that). Like the anchor
> lookups, this would iteratively look for parents so that we can
> traverse to the given file.
>
Will that be able to cover all cases, or are there still cases where
we'd end up with ESTALE?
Yehuda
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: parent xattrs on file objects
2012-10-16 21:47 ` Yehuda Sadeh Weinraub
@ 2012-10-16 21:54 ` Gregory Farnum
0 siblings, 0 replies; 16+ messages in thread
From: Gregory Farnum @ 2012-10-16 21:54 UTC (permalink / raw)
To: Yehuda Sadeh Weinraub; +Cc: Sage Weil, ceph-devel
On Tue, Oct 16, 2012 at 2:47 PM, Yehuda Sadeh Weinraub
<yehudasa@gmail.com> wrote:
> On Tue, Oct 16, 2012 at 2:35 PM, Sage Weil <sage@inktank.com> wrote:
>> On Tue, 16 Oct 2012, Gregory Farnum wrote:
>>> On Tue, Oct 16, 2012 at 2:17 PM, Sage Weil <sage@inktank.com> wrote:
>>> > Hey-
>>> >
>>> > One of the design goals of the ceph fs was to keep metadata separate from
>>> > data. This means, among other things, that when a client is creating a
>>> > bunch of files, it creates the inode via the mds and writes the file data
>>> > to the OSD, but no mds->osd interaction is necessary.
>>> >
>>> > One of the challenges we currently have is that it is difficult to lookup
>>> > an inode by ino. Normally clients traverse the hierarchy to get there, so
>>> > things are fine for native ceph clients, but when reexporting via NFS we
>>> > can get ESTALE because we an ancient nfs file handle can be presented and
>>> > the ceph MDS won't know where to find it. We have a similar problem with
>>> > the fsck design in that it is not always possible to discover orphaned
>>> > children of directory that was somehow lost.
>>> >
>>> > One option is to put an ancestor xattr on the first object for each file,
>>> > similar to what we do for directories. This basically means that each
>>> > file creation will be followed (eventually) by a setxattr osd operation.
>>> > This used to scare me, but now it's seeming like a pretty small price to
>>> > pay for robust NFS reexport and additional information for fsck to
>>> > utilize.
>>>
>>> Can you talk about this in a bit more detail? Do you expect the
>>> clients or the MDS to be doing the setxattr? What about doing it used
>>> to scare you?
>>
>> For untarring small files, it doubles the number of osd operations, and
>> means we have to think about the setxattr timing wrt warm caches, etc.
>>
>>> > It's also nice because it means we could get rid of the anchor table (used
>>> > for locating files with multiple hard links) entirely and use the
>>> > ancestore xattrs instead. That means one less thing to fsck, and avoids
>>> > having to invest any time in making the anchor table effectively scale (it
>>> > currently doesn't).
>>>
>>> Hurray! I'm not sure how this directly lets us get rid of the anchor
>>> table, though. Is your plan to just stick the inode in every directory
>>> and then mark it so everything that does a stat on that inode goes to
>>> the inode, grabs its primary location out of the inode, and then do a
>>> lookup there? That seems a bit circuitous for a lot of operations...
>>
>> We would build a generic lookup_by_ino framework based on these xattrs
>> (first try local mds, then try object xattrs, then try other mds caches,
>> then try object xattr again.. something like that). Like the anchor
>> lookups, this would iteratively look for parents so that we can
>> traverse to the given file.
>>
>
> Will that be able to cover all cases, or are there still cases where
> we'd end up with ESTALE?
Assuming an ancestor xattr that stores a lazily-updated path in
addition to the actual inode of the parent, and assuming that we
always update the actual parent inode synchronously with a move of the
inode to a different parent, then that lets us cover all lookup cases
since we can just keep hopping back up the object backpointers to the
root.
A malicious workload of inode moves could slow the lookup down quite a
bit; I'm still working through in my head if we can guarantee forward
progress when we have to do lookups from bottom to top but we need to
do locks from top to bottom...
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: parent xattrs on file objects
[not found] <2054435269.116.1350502651797.JavaMail.root@thunderbeast.private.linuxbox.com>
@ 2012-10-17 19:40 ` Casey Bodley
2012-10-17 19:53 ` Sage Weil
2012-10-17 20:18 ` Gregory Farnum
0 siblings, 2 replies; 16+ messages in thread
From: Casey Bodley @ 2012-10-17 19:40 UTC (permalink / raw)
To: Matt W. Benjamin; +Cc: ceph-devel, aemerson, peter honeyman, Sage Weil
To expand on what Matt said, we're also trying to address this issue of lookups by inode number for use with NFS.
The design we've been exploring is to create a single system inode, designated the 'inode container' directory, which stores the primary links to all inodes in the filesystem. These links are named by their inode number to satisfy lookups and obviate the need for an anchor table. This design allows the inode container to make use of existing directory fragmentation and load balancing to distribute the inodes over the MDS cluster.
When a new file is created, it then adds two links: a primary link into the inode container, and a remote link into the filesystem namespace. In the case where the parent directory fragment's authority is different than the corresponding inode container fragment's, it is created in the parent directory then exported to the inode container via an asynchronous slave request.
We welcome additional discussion, both on this design specifically and on the general topic of scalable ino lookups.
Casey
----- Original Message -----
From: "Matt W. Benjamin" <matt@linuxbox.com>
To: "Sage Weil" <sage@inktank.com>
Cc: ceph-devel@vger.kernel.org, "aemerson" <aemerson@linuxbox.com>, "casey" <casey@linuxbox.com>, "peter honeyman" <peter.honeyman@gmail.com>
Sent: Tuesday, October 16, 2012 5:35:12 PM
Subject: Re: parent xattrs on file objects
Hi Sage,
We've been exploring (experimentally implementing) a different solution to this problem, basically refactoring dirents and inodes, extending fragmentation logic, and adding new metadata location operations. We also remove the anchor table. We were planning to ask for some feedback once we had some initial results, but since you're floating a related idea, we'd like to share what we have so far. CC'ing people.
Regards,
Matt
----- "Sage Weil" <sage@inktank.com> wrote:
> Hey-
>
> One of the design goals of the ceph fs was to keep metadata separate
> from
> data. This means, among other things, that when a client is creating
> a
> bunch of files, it creates the inode via the mds and writes the file
> data
> to the OSD, but no mds->osd interaction is necessary.
>
> One of the challenges we currently have is that it is difficult to
> lookup
> an inode by ino. Normally clients traverse the hierarchy to get
> there, so
> things are fine for native ceph clients, but when reexporting via NFS
> we
> can get ESTALE because we an ancient nfs file handle can be presented
> and
> the ceph MDS won't know where to find it. We have a similar problem
> with
> the fsck design in that it is not always possible to discover orphaned
>
> children of directory that was somehow lost.
>
> One option is to put an ancestor xattr on the first object for each
> file,
> similar to what we do for directories. This basically means that each
>
> file creation will be followed (eventually) by a setxattr osd
> operation.
> This used to scare me, but now it's seeming like a pretty small price
> to
> pay for robust NFS reexport and additional information for fsck to
> utilize.
>
> It's also nice because it means we could get rid of the anchor table
> (used
> for locating files with multiple hard links) entirely and use the
> ancestore xattrs instead. That means one less thing to fsck, and
> avoids
> having to invest any time in making the anchor table effectively scale
> (it
> currently doesn't).
>
> Anyone feel like we shouldn't go ahead and do this?
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Matt Benjamin
The Linux Box
206 South Fifth Ave. Suite 150
Ann Arbor, MI 48104
http://linuxbox.com
tel. 734-761-4689
fax. 734-769-8938
cel. 734-216-5309
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: parent xattrs on file objects
2012-10-17 19:40 ` Casey Bodley
@ 2012-10-17 19:53 ` Sage Weil
2012-10-17 20:18 ` Gregory Farnum
1 sibling, 0 replies; 16+ messages in thread
From: Sage Weil @ 2012-10-17 19:53 UTC (permalink / raw)
To: Casey Bodley; +Cc: Matt W. Benjamin, ceph-devel, aemerson, peter honeyman
On Wed, 17 Oct 2012, Casey Bodley wrote:
> To expand on what Matt said, we're also trying to address this issue of
> lookups by inode number for use with NFS.
>
> The design we've been exploring is to create a single system inode,
> designated the 'inode container' directory, which stores the primary
> links to all inodes in the filesystem. These links are named by their
> inode number to satisfy lookups and obviate the need for an anchor
> table. This design allows the inode container to make use of existing
> directory fragmentation and load balancing to distribute the inodes over
> the MDS cluster.
>
> When a new file is created, it then adds two links: a primary link into
> the inode container, and a remote link into the filesystem namespace. In
> the case where the parent directory fragment's authority is different
> than the corresponding inode container fragment's, it is created in the
> parent directory then exported to the inode container via an
> asynchronous slave request.
>
> We welcome additional discussion, both on this design specifically and
> on the general topic of scalable ino lookups.
This would certainly work. It essentially gives up on the idea of
embedding inodes in directories, however, and the performance advantages
that offers in the common case, because every file has to be resolved via
the inode directory. I also suspect that if that is the end goal, we
could get there more easily without grafting it onto directory fragments.
My hope is that these backpointers on file data objects will satisfy our
need to look up by ino in the rare case where it's necessary, while
avoiding the overhead with an inode table and the second lookup for each
file in the general case.
Do you see issues or pitfalls with that approach, given your experience
with the NFS ESTALE stuff so far?
sage
>
> Casey
>
> ----- Original Message -----
> From: "Matt W. Benjamin" <matt@linuxbox.com>
> To: "Sage Weil" <sage@inktank.com>
> Cc: ceph-devel@vger.kernel.org, "aemerson" <aemerson@linuxbox.com>, "casey" <casey@linuxbox.com>, "peter honeyman" <peter.honeyman@gmail.com>
> Sent: Tuesday, October 16, 2012 5:35:12 PM
> Subject: Re: parent xattrs on file objects
>
> Hi Sage,
>
> We've been exploring (experimentally implementing) a different solution to this problem, basically refactoring dirents and inodes, extending fragmentation logic, and adding new metadata location operations. We also remove the anchor table. We were planning to ask for some feedback once we had some initial results, but since you're floating a related idea, we'd like to share what we have so far. CC'ing people.
>
> Regards,
>
> Matt
>
> ----- "Sage Weil" <sage@inktank.com> wrote:
>
> > Hey-
> >
> > One of the design goals of the ceph fs was to keep metadata separate
> > from
> > data. This means, among other things, that when a client is creating
> > a
> > bunch of files, it creates the inode via the mds and writes the file
> > data
> > to the OSD, but no mds->osd interaction is necessary.
> >
> > One of the challenges we currently have is that it is difficult to
> > lookup
> > an inode by ino. Normally clients traverse the hierarchy to get
> > there, so
> > things are fine for native ceph clients, but when reexporting via NFS
> > we
> > can get ESTALE because we an ancient nfs file handle can be presented
> > and
> > the ceph MDS won't know where to find it. We have a similar problem
> > with
> > the fsck design in that it is not always possible to discover orphaned
> >
> > children of directory that was somehow lost.
> >
> > One option is to put an ancestor xattr on the first object for each
> > file,
> > similar to what we do for directories. This basically means that each
> >
> > file creation will be followed (eventually) by a setxattr osd
> > operation.
> > This used to scare me, but now it's seeming like a pretty small price
> > to
> > pay for robust NFS reexport and additional information for fsck to
> > utilize.
> >
> > It's also nice because it means we could get rid of the anchor table
> > (used
> > for locating files with multiple hard links) entirely and use the
> > ancestore xattrs instead. That means one less thing to fsck, and
> > avoids
> > having to invest any time in making the anchor table effectively scale
> > (it
> > currently doesn't).
> >
> > Anyone feel like we shouldn't go ahead and do this?
> >
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> Matt Benjamin
> The Linux Box
> 206 South Fifth Ave. Suite 150
> Ann Arbor, MI 48104
>
> http://linuxbox.com
>
> tel. 734-761-4689
> fax. 734-769-8938
> cel. 734-216-5309
>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: parent xattrs on file objects
2012-10-17 19:40 ` Casey Bodley
2012-10-17 19:53 ` Sage Weil
@ 2012-10-17 20:18 ` Gregory Farnum
1 sibling, 0 replies; 16+ messages in thread
From: Gregory Farnum @ 2012-10-17 20:18 UTC (permalink / raw)
To: Casey Bodley
Cc: Matt W. Benjamin, ceph-devel, aemerson, peter honeyman, Sage Weil
On Wed, Oct 17, 2012 at 12:40 PM, Casey Bodley <casey@linuxbox.com> wrote:
> To expand on what Matt said, we're also trying to address this issue of lookups by inode number for use with NFS.
>
> The design we've been exploring is to create a single system inode, designated the 'inode container' directory, which stores the primary links to all inodes in the filesystem. These links are named by their inode number to satisfy lookups and obviate the need for an anchor table. This design allows the inode container to make use of existing directory fragmentation and load balancing to distribute the inodes over the MDS cluster.
>
> When a new file is created, it then adds two links: a primary link into the inode container, and a remote link into the filesystem namespace. In the case where the parent directory fragment's authority is different than the corresponding inode container fragment's, it is created in the parent directory then exported to the inode container via an asynchronous slave request.
>
> We welcome additional discussion, both on this design specifically and on the general topic of scalable ino lookups.
So if the primary link isn't always in the "inode container", you must
be preserving the anchor table for this setup. Am I understanding that
correctly? Or is there some other mechanism for linking them that's
less expensive?
-Greg
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: parent xattrs on file objects
[not found] <937776470.145.1350510476081.JavaMail.root@thunderbeast.private.linuxbox.com>
@ 2012-10-17 21:51 ` Casey Bodley
2012-10-17 22:04 ` Gregory Farnum
0 siblings, 1 reply; 16+ messages in thread
From: Casey Bodley @ 2012-10-17 21:51 UTC (permalink / raw)
To: Gregory Farnum
Cc: Matt W. Benjamin, ceph-devel, aemerson, peter honeyman, Sage Weil
Hi Greg,
In this case where an inode is created on mds.a and exported to mds.b, there is a potential race on mds.b between a subsequent lookup-by-ino and the primary link actually making it into the inode container.
Our tentative solution was to rely on the way InoTable breaks up the range of inode numbers based on mds nodeid. So when a lookup on the inode container fails, we can determine which mds would have allocated that inode number and attempt to find the inode there. The originating mds.a should always find the inode in its cache while it's pinned for export. Depending on whether the inode is found on mds.a, the lookup-by-ino on mds.b either returns failure or waits for the import to finish.
Casey
----- Original Message -----
From: "Gregory Farnum" <greg@inktank.com>
To: "Casey Bodley" <casey@linuxbox.com>
Cc: "Matt W. Benjamin" <matt@linuxbox.com>, ceph-devel@vger.kernel.org, "aemerson" <aemerson@linuxbox.com>, "peter honeyman" <peter.honeyman@gmail.com>, "Sage Weil" <sage@inktank.com>
Sent: Wednesday, October 17, 2012 4:18:04 PM
Subject: Re: parent xattrs on file objects
On Wed, Oct 17, 2012 at 12:40 PM, Casey Bodley <casey@linuxbox.com> wrote:
> To expand on what Matt said, we're also trying to address this issue of lookups by inode number for use with NFS.
>
> The design we've been exploring is to create a single system inode, designated the 'inode container' directory, which stores the primary links to all inodes in the filesystem. These links are named by their inode number to satisfy lookups and obviate the need for an anchor table. This design allows the inode container to make use of existing directory fragmentation and load balancing to distribute the inodes over the MDS cluster.
>
> When a new file is created, it then adds two links: a primary link into the inode container, and a remote link into the filesystem namespace. In the case where the parent directory fragment's authority is different than the corresponding inode container fragment's, it is created in the parent directory then exported to the inode container via an asynchronous slave request.
>
> We welcome additional discussion, both on this design specifically and on the general topic of scalable ino lookups.
So if the primary link isn't always in the "inode container", you must
be preserving the anchor table for this setup. Am I understanding that
correctly? Or is there some other mechanism for linking them that's
less expensive?
-Greg
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: parent xattrs on file objects
2012-10-17 21:51 ` Casey Bodley
@ 2012-10-17 22:04 ` Gregory Farnum
2012-10-17 22:15 ` Adam C. Emerson
0 siblings, 1 reply; 16+ messages in thread
From: Gregory Farnum @ 2012-10-17 22:04 UTC (permalink / raw)
To: Casey Bodley
Cc: Matt W. Benjamin, ceph-devel, aemerson, peter honeyman, Sage Weil
I still don't get it. Putting every inode's primary link in a lookup
directory and then patching the lookup code to go there makes sense to
me. But if you have to go the other way (from the inode directory's
secondary link to some other location as the primary link), you need
an up-to-date path for that primary link, right? How do you handle it
when the path changes — do you have a two-phase commit on the lookup
directory attributes?
On Wed, Oct 17, 2012 at 2:51 PM, Casey Bodley <casey@linuxbox.com> wrote:
> Hi Greg,
>
> In this case where an inode is created on mds.a and exported to mds.b, there is a potential race on mds.b between a subsequent lookup-by-ino and the primary link actually making it into the inode container.
>
> Our tentative solution was to rely on the way InoTable breaks up the range of inode numbers based on mds nodeid. So when a lookup on the inode container fails, we can determine which mds would have allocated that inode number and attempt to find the inode there. The originating mds.a should always find the inode in its cache while it's pinned for export. Depending on whether the inode is found on mds.a, the lookup-by-ino on mds.b either returns failure or waits for the import to finish.
>
> Casey
>
> ----- Original Message -----
> From: "Gregory Farnum" <greg@inktank.com>
> To: "Casey Bodley" <casey@linuxbox.com>
> Cc: "Matt W. Benjamin" <matt@linuxbox.com>, ceph-devel@vger.kernel.org, "aemerson" <aemerson@linuxbox.com>, "peter honeyman" <peter.honeyman@gmail.com>, "Sage Weil" <sage@inktank.com>
> Sent: Wednesday, October 17, 2012 4:18:04 PM
> Subject: Re: parent xattrs on file objects
>
> On Wed, Oct 17, 2012 at 12:40 PM, Casey Bodley <casey@linuxbox.com> wrote:
>> To expand on what Matt said, we're also trying to address this issue of lookups by inode number for use with NFS.
>>
>> The design we've been exploring is to create a single system inode, designated the 'inode container' directory, which stores the primary links to all inodes in the filesystem. These links are named by their inode number to satisfy lookups and obviate the need for an anchor table. This design allows the inode container to make use of existing directory fragmentation and load balancing to distribute the inodes over the MDS cluster.
>>
>> When a new file is created, it then adds two links: a primary link into the inode container, and a remote link into the filesystem namespace. In the case where the parent directory fragment's authority is different than the corresponding inode container fragment's, it is created in the parent directory then exported to the inode container via an asynchronous slave request.
>>
>> We welcome additional discussion, both on this design specifically and on the general topic of scalable ino lookups.
>
> So if the primary link isn't always in the "inode container", you must
> be preserving the anchor table for this setup. Am I understanding that
> correctly? Or is there some other mechanism for linking them that's
> less expensive?
> -Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: parent xattrs on file objects
2012-10-17 22:04 ` Gregory Farnum
@ 2012-10-17 22:15 ` Adam C. Emerson
2012-10-19 21:17 ` Sage Weil
0 siblings, 1 reply; 16+ messages in thread
From: Adam C. Emerson @ 2012-10-17 22:15 UTC (permalink / raw)
To: Gregory Farnum
Cc: Casey Bodley, Matt W. Benjamin, ceph-devel, peter honeyman,
Sage Weil
Mr. Farnum,
At Wed, 17 Oct 2012 15:04:23 -0700, Gregory Farnum wrote:
>
> I still don't get it. Putting every inode's primary link in a lookup
> directory and then patching the lookup code to go there makes sense to
> me. But if you have to go the other way (from the inode directory's
> secondary link to some other location as the primary link), you need
> an up-to-date path for that primary link, right? How do you handle it
> when the path changes — do you have a two-phase commit on the lookup
> directory attributes?
Our idea isn't to have the inode directory contain links back to the
primary. Our idea is to have a structure managed by MDSs that is
looked up by inode number and spread across the MDSs in a cluster
similarly to the way CRUSH maps files across OSDs. This structure
contains all the information currently in the inode that's now
incorporated into the dirent.
The dirents would then contain mappings from mappings from names to
inodes and possibly cache (but not be the primary for) inode content.
We were also planning to change directory fragmentation to distribute
fragments across MDSs based on a function of the filename, also
similarly to how CRUSH maps objects to OSDs.
Respectfully yours,
Adam C. Emerson <aemerson@linuxbox.com>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: parent xattrs on file objects
2012-10-17 22:15 ` Adam C. Emerson
@ 2012-10-19 21:17 ` Sage Weil
0 siblings, 0 replies; 16+ messages in thread
From: Sage Weil @ 2012-10-19 21:17 UTC (permalink / raw)
To: Adam C. Emerson
Cc: Gregory Farnum, Casey Bodley, Matt W. Benjamin, ceph-devel,
peter honeyman
On Wed, 17 Oct 2012, Adam C. Emerson wrote:
> Mr. Farnum,
>
> At Wed, 17 Oct 2012 15:04:23 -0700, Gregory Farnum wrote:
> >
> > I still don't get it. Putting every inode's primary link in a lookup
> > directory and then patching the lookup code to go there makes sense to
> > me. But if you have to go the other way (from the inode directory's
> > secondary link to some other location as the primary link), you need
> > an up-to-date path for that primary link, right? How do you handle it
> > when the path changes ? do you have a two-phase commit on the lookup
> > directory attributes?
>
> Our idea isn't to have the inode directory contain links back to the
> primary. Our idea is to have a structure managed by MDSs that is
> looked up by inode number and spread across the MDSs in a cluster
> similarly to the way CRUSH maps files across OSDs. This structure
> contains all the information currently in the inode that's now
> incorporated into the dirent.
>
> The dirents would then contain mappings from mappings from names to
> inodes and possibly cache (but not be the primary for) inode content.
> We were also planning to change directory fragmentation to distribute
> fragments across MDSs based on a function of the filename, also
> similarly to how CRUSH maps objects to OSDs.
I think this basic approach is viable. However, I'm hesitant to give up
on embedded inodes because of the huge performance wins in the common
cases; I'd rather have a more expensive lookup-by-ino in the rare cases
where nfs filehandles are out of cache and be smokin' fast the rest of the
time.
Are there reasons you're attached to your current approach? Do you see
problems with a generalized "find this ino" function based on the file
objects? I like the latter because it
- means we can scrap the anchor table, which needs additional work anyway
if it is going to scale
- is generallly useful for fsck
- solves the NFS fh issue
The only real downsides I see to this approach are:
- more OSD ops (setxattrs.. if we're smart, they'll be cheap)
- lookup-by-ino for resolving hard links may be slower than the anchor
table, which gives you *all* ancestors in one lookup, vs this, which
may range from 1 lookup to (depth of tree) lookups (or possibly more,
in rare cases). For all the reasons that the anchor table was
acceptable for hard links, though (hard link rarity, parallel link
patterns), I can live with it.
There are also lots of people who seem to be putting BackupPC (or whatever
is it) on Ceph, which is creating huge messes of hard links, so it will be
really good to solve/avoid teh current anchor table scaling problems.
sage
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: parent xattrs on file objects
[not found] <1743327214.12.1350731614461.JavaMail.root@thunderbeast.private.linuxbox.com>
@ 2012-10-20 12:09 ` Matt W. Benjamin
2012-10-22 21:27 ` Sage Weil
0 siblings, 1 reply; 16+ messages in thread
From: Matt W. Benjamin @ 2012-10-20 12:09 UTC (permalink / raw)
To: Sage Weil
Cc: Gregory Farnum, Casey Bodley, ceph-devel, peter honeyman,
Adam C. Emerson
Hi Sage,
I the interest of timeliness, I'll post a few thoughts now.
----- "Sage Weil" <sage@inktank.com> wrote:
> On Wed, 17 Oct 2012, Adam C. Emerson wrote:
>
> I think this basic approach is viable. However, I'm hesitant to give
> up
> on embedded inodes because of the huge performance wins in the common
>
> cases; I'd rather have a more expensive lookup-by-ino in the rare
> cases
> where nfs filehandles are out of cache and be smokin' fast the rest of
> the
> time.
Broadly, for us, lookup-by-ino is a fast path. Fast for lookups but slow for inode lookups seems out of balance.
>
> Are there reasons you're attached to your current approach? Do you
> see
> problems with a generalized "find this ino" function based on the file
>
> objects? I like the latter because it
>
> - means we can scrap the anchor table, which needs additional work
> anyway
> if it is going to scale
> - is generallly useful for fsck
> - solves the NFS fh issue
The proposed approach, if I understand it, is costly. It's optimizing for some workloads, at the definite expense of others. (The side benefits, e.g., to fsck might completely justify the cost, however. "We need it anyway" may only be a decisive argument if we've accepted the premise that inode lookups can be slow, however.)
By contrast, the additional cost our approach adds is small and constant--but we grant, it's in a fast path. For motivation, we solve both the lookup-by-ino and hard link problems much more satisfactorily, as far as I can see.
Obviously, we -hope- we are not sacrificing "smokin' fast" name lookups for (smokin') fast inode lookups. As in UFS, we can make use of caching, bulkstat [which proved to be a huge win in AFS and DFS], and given Ceph's design, parallelism to make up the gap in what -we hope- would be the actual common case. Of course we might be wrong. We haven't implemented all of that yet. Maybe we would need to actually do some performance measurement and comparison to be convincing, and presumed we would.
>
> The only real downsides I see to this approach are:
>
> - more OSD ops (setxattrs.. if we're smart, they'll be cheap)
> - lookup-by-ino for resolving hard links may be slower than the
> anchor
> table, which gives you *all* ancestors in one lookup, vs this,
> which
> may range from 1 lookup to (depth of tree) lookups (or possibly
> more,
> in rare cases). For all the reasons that the anchor table was
> acceptable for hard links, though (hard link rarity, parallel link
>
> patterns), I can live with it.
>
> There are also lots of people who seem to be putting BackupPC (or
> whatever
> is it) on Ceph, which is creating huge messes of hard links, so it
> will be
> really good to solve/avoid teh current anchor table scaling problems.
>
> sage
--
Matt Benjamin
The Linux Box
206 South Fifth Ave. Suite 150
Ann Arbor, MI 48104
http://linuxbox.com
tel. 734-761-4689
fax. 734-769-8938
cel. 734-216-5309
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: parent xattrs on file objects
2012-10-20 12:09 ` Matt W. Benjamin
@ 2012-10-22 21:27 ` Sage Weil
0 siblings, 0 replies; 16+ messages in thread
From: Sage Weil @ 2012-10-22 21:27 UTC (permalink / raw)
To: Matt W. Benjamin
Cc: Gregory Farnum, Casey Bodley, ceph-devel, peter honeyman,
Adam C. Emerson
On Sat, 20 Oct 2012, Matt W. Benjamin wrote:
> Hi Sage,
>
> I the interest of timeliness, I'll post a few thoughts now.
>
> ----- "Sage Weil" <sage@inktank.com> wrote:
>
> > On Wed, 17 Oct 2012, Adam C. Emerson wrote:
>
> >
> > I think this basic approach is viable. However, I'm hesitant to give
> > up
> > on embedded inodes because of the huge performance wins in the common
> >
> > cases; I'd rather have a more expensive lookup-by-ino in the rare
> > cases
> > where nfs filehandles are out of cache and be smokin' fast the rest of
> > the
> > time.
>
> Broadly, for us, lookup-by-ino is a fast path. Fast for lookups but
> slow for inode lookups seems out of balance.
And just to make sure I completely understand, this is specifically in
reference to resolving NFS file handles?
My hope is that because it has to fall through both the client and MDS
caches before it goes to the 'slow' path, this isn't such an issue. It's
really the case of clients presenting ancient fhs that are slow. Even
then, the 'normal' pattern would be a single osd op to the osd data
object, followed by a path lookup or two.
In exchange, you get an 'ls -al' that only takes a single IO to fully
populate the cache... but it is hard to say how often the client will need
to resolve an ino it doesn't have in its cache, or how expensive that will
be.
> > Are there reasons you're attached to your current approach? Do you
> > see
> > problems with a generalized "find this ino" function based on the file
> >
> > objects? I like the latter because it
> >
> > - means we can scrap the anchor table, which needs additional work
> > anyway
> > if it is going to scale
> > - is generallly useful for fsck
> > - solves the NFS fh issue
>
> The proposed approach, if I understand it, is costly. It's optimizing
> for some workloads, at the definite expense of others. (The side
> benefits, e.g., to fsck might completely justify the cost, however.
> "We need it anyway" may only be a decisive argument if we've accepted
> the premise that inode lookups can be slow, however.)
>
> By contrast, the additional cost our approach adds is small and
> constant--but we grant, it's in a fast path. For motivation, we solve
> both the lookup-by-ino and hard link problems much more satisfactorily,
> as far as I can see.
>
> Obviously, we -hope- we are not sacrificing "smokin' fast" name lookups
> for (smokin') fast inode lookups. As in UFS, we can make use of
> caching, bulkstat [which proved to be a huge win in AFS and DFS], and
> given Ceph's design, parallelism to make up the gap in what -we hope-
> would be the actual common case. Of course we might be wrong. We
> haven't implemented all of that yet. Maybe we would need to actually do
> some performance measurement and comparison to be convincing, and
> presumed we would.
Yep. It's hard to make a convincing argument either way without seeing
what performance looks like on the actual workloads you care about.
I think we will continue to implement the file backpointers, since it will
be useful for fsck regardless, and then we'll be in a position to
experiment with how fast/slow it is in practice.
sage
>
> >
> > The only real downsides I see to this approach are:
> >
> > - more OSD ops (setxattrs.. if we're smart, they'll be cheap)
> > - lookup-by-ino for resolving hard links may be slower than the
> > anchor
> > table, which gives you *all* ancestors in one lookup, vs this,
> > which
> > may range from 1 lookup to (depth of tree) lookups (or possibly
> > more,
> > in rare cases). For all the reasons that the anchor table was
> > acceptable for hard links, though (hard link rarity, parallel link
> >
> > patterns), I can live with it.
> >
> > There are also lots of people who seem to be putting BackupPC (or
> > whatever
> > is it) on Ceph, which is creating huge messes of hard links, so it
> > will be
> > really good to solve/avoid teh current anchor table scaling problems.
> >
> > sage
>
> --
> Matt Benjamin
> The Linux Box
> 206 South Fifth Ave. Suite 150
> Ann Arbor, MI 48104
>
> http://linuxbox.com
>
> tel. 734-761-4689
> fax. 734-769-8938
> cel. 734-216-5309
>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2012-10-22 21:28 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-10-16 21:17 parent xattrs on file objects Sage Weil
2012-10-16 21:26 ` Gregory Farnum
2012-10-16 21:35 ` Sage Weil
2012-10-16 21:47 ` Yehuda Sadeh Weinraub
2012-10-16 21:54 ` Gregory Farnum
2012-10-16 21:32 ` Mark Nelson
2012-10-16 21:35 ` Matt W. Benjamin
[not found] <2054435269.116.1350502651797.JavaMail.root@thunderbeast.private.linuxbox.com>
2012-10-17 19:40 ` Casey Bodley
2012-10-17 19:53 ` Sage Weil
2012-10-17 20:18 ` Gregory Farnum
[not found] <937776470.145.1350510476081.JavaMail.root@thunderbeast.private.linuxbox.com>
2012-10-17 21:51 ` Casey Bodley
2012-10-17 22:04 ` Gregory Farnum
2012-10-17 22:15 ` Adam C. Emerson
2012-10-19 21:17 ` Sage Weil
[not found] <1743327214.12.1350731614461.JavaMail.root@thunderbeast.private.linuxbox.com>
2012-10-20 12:09 ` Matt W. Benjamin
2012-10-22 21:27 ` Sage Weil
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox