* [Lustre-devel] Doubly indexed tree / changelogs
@ 2008-09-21 23:40 Peter Braam
2008-09-22 5:52 ` Alex Zhuravlev
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Peter Braam @ 2008-09-21 23:40 UTC (permalink / raw)
To: lustre-devel
Hi Nikita, Nathan -
After some pondering I have come to two conclusions.
To encode filesets, we need a tree that makes two iterations fast:
1. list all filesets that contain a certain object
2. list all objects in a certain fileset
Is there a doubly indexed tree for this?
Secondly, to make the changelogs useful and scalable for filesets we will
need to be able to list all changelog entries associated with a certain
inode efficiently. I see two ways to do this ? one is an auxiliary
directory file mapping inodes to many changelog entries, the second is to
embed forward and backward pointers in the changelog entries to build a
linked list rooted at the inode (using an EA in the inode pointing to the
first and last element of the list). Both have some overheads. What are
your thoughts?
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080922/8c54f19d/attachment.htm>
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Lustre-devel] Doubly indexed tree / changelogs
2008-09-21 23:40 [Lustre-devel] Doubly indexed tree / changelogs Peter Braam
@ 2008-09-22 5:52 ` Alex Zhuravlev
2008-09-22 6:58 ` Peter Braam
2008-09-23 3:49 ` Nathaniel Rutman
2008-09-23 7:38 ` Nikita Danilov
2 siblings, 1 reply; 12+ messages in thread
From: Alex Zhuravlev @ 2008-09-22 5:52 UTC (permalink / raw)
To: lustre-devel
can object migrate between filesets? if not, we probably
can use fid's sequence as a record in that index?
thanks, Alex
Peter Braam wrote:
> Hi Nikita, Nathan -
>
> After some pondering I have come to two conclusions.
>
> To encode filesets, we need a tree that makes two iterations fast:
>
> 1. list all filesets that contain a certain object
> 2. list all objects in a certain fileset
>
>
> Is there a doubly indexed tree for this?
>
> Secondly, to make the changelogs useful and scalable for filesets we
> will need to be able to list all changelog entries associated with a
> certain inode efficiently. I see two ways to do this ? one is an
> auxiliary directory file mapping inodes to many changelog entries, the
> second is to embed forward and backward pointers in the changelog
> entries to build a linked list rooted at the inode (using an EA in the
> inode pointing to the first and last element of the list). Both have
> some overheads. What are your thoughts?
>
> Peter
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Lustre-devel] Doubly indexed tree / changelogs
2008-09-22 5:52 ` Alex Zhuravlev
@ 2008-09-22 6:58 ` Peter Braam
2008-09-22 7:05 ` Alex Zhuravlev
0 siblings, 1 reply; 12+ messages in thread
From: Peter Braam @ 2008-09-22 6:58 UTC (permalink / raw)
To: lustre-devel
Objects can be in many filesets, and be added to some, removed from others.
I think this is an almost arbitrary collection of pairs (FID, fileset-id)
and we need it indexed by both.
Peter
On 9/22/08 1:52 PM, "Alex Zhuravlev" <Alex.Zhuravlev@Sun.COM> wrote:
> can object migrate between filesets? if not, we probably
> can use fid's sequence as a record in that index?
>
> thanks, Alex
>
> Peter Braam wrote:
>> Hi Nikita, Nathan -
>>
>> After some pondering I have come to two conclusions.
>>
>> To encode filesets, we need a tree that makes two iterations fast:
>>
>> 1. list all filesets that contain a certain object
>> 2. list all objects in a certain fileset
>>
>>
>> Is there a doubly indexed tree for this?
>>
>> Secondly, to make the changelogs useful and scalable for filesets we
>> will need to be able to list all changelog entries associated with a
>> certain inode efficiently. I see two ways to do this ? one is an
>> auxiliary directory file mapping inodes to many changelog entries, the
>> second is to embed forward and backward pointers in the changelog
>> entries to build a linked list rooted at the inode (using an EA in the
>> inode pointing to the first and last element of the list). Both have
>> some overheads. What are your thoughts?
>>
>> Peter
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Lustre-devel] Doubly indexed tree / changelogs
2008-09-22 6:58 ` Peter Braam
@ 2008-09-22 7:05 ` Alex Zhuravlev
2008-09-22 7:13 ` Peter Braam
0 siblings, 1 reply; 12+ messages in thread
From: Alex Zhuravlev @ 2008-09-22 7:05 UTC (permalink / raw)
To: lustre-devel
IMHO, it'd be useful to insert aggregations into that index.
just to keep the index small. say, subtree?
thanks, Alex
Peter Braam wrote:
> Objects can be in many filesets, and be added to some, removed from others.
>
> I think this is an almost arbitrary collection of pairs (FID, fileset-id)
> and we need it indexed by both.
>
> Peter
>
>
> On 9/22/08 1:52 PM, "Alex Zhuravlev" <Alex.Zhuravlev@Sun.COM> wrote:
>
>> can object migrate between filesets? if not, we probably
>> can use fid's sequence as a record in that index?
>>
>> thanks, Alex
>>
>> Peter Braam wrote:
>>> Hi Nikita, Nathan -
>>>
>>> After some pondering I have come to two conclusions.
>>>
>>> To encode filesets, we need a tree that makes two iterations fast:
>>>
>>> 1. list all filesets that contain a certain object
>>> 2. list all objects in a certain fileset
>>>
>>>
>>> Is there a doubly indexed tree for this?
>>>
>>> Secondly, to make the changelogs useful and scalable for filesets we
>>> will need to be able to list all changelog entries associated with a
>>> certain inode efficiently. I see two ways to do this ? one is an
>>> auxiliary directory file mapping inodes to many changelog entries, the
>>> second is to embed forward and backward pointers in the changelog
>>> entries to build a linked list rooted at the inode (using an EA in the
>>> inode pointing to the first and last element of the list). Both have
>>> some overheads. What are your thoughts?
>>>
>>> Peter
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Lustre-devel mailing list
>>> Lustre-devel at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Lustre-devel] Doubly indexed tree / changelogs
2008-09-22 7:05 ` Alex Zhuravlev
@ 2008-09-22 7:13 ` Peter Braam
2008-09-22 7:26 ` Alex Zhuravlev
0 siblings, 1 reply; 12+ messages in thread
From: Peter Braam @ 2008-09-22 7:13 UTC (permalink / raw)
To: lustre-devel
Sure, when aggregations apply. But they do not apply in general (e.g.
Filesets that are search results) and we need a doubly indexed tree for
that.
Hence my question, what doubly indexed trees exist?
Peter
On 9/22/08 3:05 PM, "Alex Zhuravlev" <Alex.Zhuravlev@Sun.COM> wrote:
> IMHO, it'd be useful to insert aggregations into that index.
> just to keep the index small. say, subtree?
>
> thanks, Alex
>
> Peter Braam wrote:
>> Objects can be in many filesets, and be added to some, removed from others.
>>
>> I think this is an almost arbitrary collection of pairs (FID, fileset-id)
>> and we need it indexed by both.
>>
>> Peter
>>
>>
>> On 9/22/08 1:52 PM, "Alex Zhuravlev" <Alex.Zhuravlev@Sun.COM> wrote:
>>
>>> can object migrate between filesets? if not, we probably
>>> can use fid's sequence as a record in that index?
>>>
>>> thanks, Alex
>>>
>>> Peter Braam wrote:
>>>> Hi Nikita, Nathan -
>>>>
>>>> After some pondering I have come to two conclusions.
>>>>
>>>> To encode filesets, we need a tree that makes two iterations fast:
>>>>
>>>> 1. list all filesets that contain a certain object
>>>> 2. list all objects in a certain fileset
>>>>
>>>>
>>>> Is there a doubly indexed tree for this?
>>>>
>>>> Secondly, to make the changelogs useful and scalable for filesets we
>>>> will need to be able to list all changelog entries associated with a
>>>> certain inode efficiently. I see two ways to do this ? one is an
>>>> auxiliary directory file mapping inodes to many changelog entries, the
>>>> second is to embed forward and backward pointers in the changelog
>>>> entries to build a linked list rooted at the inode (using an EA in the
>>>> inode pointing to the first and last element of the list). Both have
>>>> some overheads. What are your thoughts?
>>>>
>>>> Peter
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> Lustre-devel mailing list
>>>> Lustre-devel at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>>
>>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Lustre-devel] Doubly indexed tree / changelogs
2008-09-22 7:13 ` Peter Braam
@ 2008-09-22 7:26 ` Alex Zhuravlev
0 siblings, 0 replies; 12+ messages in thread
From: Alex Zhuravlev @ 2008-09-22 7:26 UTC (permalink / raw)
To: lustre-devel
Peter Braam wrote:
> Sure, when aggregations apply. But they do not apply in general (e.g.
> Filesets that are search results) and we need a doubly indexed tree for
> that.
ah, clear enough
> Hence my question, what doubly indexed trees exist?
there is K-D tree, but I'm not sure it fits here.
if number of filesets is limited, then we probably could build a table of
all possible filesets ovelapping and put table's index into inode? this is
for reverse mapping to find all filesystems for given inode.
thanks, Alex
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Lustre-devel] Doubly indexed tree / changelogs
2008-09-21 23:40 [Lustre-devel] Doubly indexed tree / changelogs Peter Braam
2008-09-22 5:52 ` Alex Zhuravlev
@ 2008-09-23 3:49 ` Nathaniel Rutman
2008-09-23 9:20 ` Peter Braam
2008-09-23 7:38 ` Nikita Danilov
2 siblings, 1 reply; 12+ messages in thread
From: Nathaniel Rutman @ 2008-09-23 3:49 UTC (permalink / raw)
To: lustre-devel
I actually added a "previous record" pointer in each changelog entry,
but fill it in only where it is cheap -- when the metadata object is
already in the cache I record the last changelog entry there. If it's
not in the cache, I don't know where the last record associated with
that fid is. We could store the last record number with the inode (EA?),
but that would potentially be painful if we are recording e.g. file
open/closes.
Forward pointers are also problematic, in that I don't want to go back
and modify the old record every time a new one is recorded (seems like
this will make the disks very seek-y), and I think maybe we don't need
forward pointers anyhow (use case?). Anyhow, this effectively doubles
the changelog write impact. Maybe that's ok: Manoj's measurements put
the changelog overhead at only about 4% using mdsrate.
Peter Braam wrote:
> Hi Nikita, Nathan -
>
> After some pondering I have come to two conclusions.
>
> To encode filesets, we need a tree that makes two iterations fast:
>
> 1. list all filesets that contain a certain object
> 2. list all objects in a certain fileset
>
>
> Is there a doubly indexed tree for this?
>
> Secondly, to make the changelogs useful and scalable for filesets we
> will need to be able to list all changelog entries associated with a
> certain inode efficiently. I see two ways to do this ? one is an
> auxiliary directory file mapping inodes to many changelog entries, the
> second is to embed forward and backward pointers in the changelog
> entries to build a linked list rooted at the inode (using an EA in the
> inode pointing to the first and last element of the list). Both have
> some overheads. What are your thoughts?
>
> Peter
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Lustre-devel] Doubly indexed tree / changelogs
2008-09-21 23:40 [Lustre-devel] Doubly indexed tree / changelogs Peter Braam
2008-09-22 5:52 ` Alex Zhuravlev
2008-09-23 3:49 ` Nathaniel Rutman
@ 2008-09-23 7:38 ` Nikita Danilov
2008-09-24 2:50 ` Peter Braam
2 siblings, 1 reply; 12+ messages in thread
From: Nikita Danilov @ 2008-09-23 7:38 UTC (permalink / raw)
To: lustre-devel
Peter Braam writes:
> Hi Nikita, Nathan -
>
> After some pondering I have come to two conclusions.
>
> To encode filesets, we need a tree that makes two iterations fast:
>
> 1. list all filesets that contain a certain object
> 2. list all objects in a certain fileset
>
> Is there a doubly indexed tree for this?
I don't know of a data-structure that provides ready solution for this.
As Alex pointed out k-d-trees do not fit there (they effectively use a
key composed of the alternating sequences of bits of the original keys),
and `geographical trees' that data bases use to store 2-d data use very
special notion of proximity.
But, thinking about a fileset as a generalized directory, isn't this the
same problem as `list all files in directory' and `find all parent
directories of a file'? It probably makes sense to use the same
mechanism to deal with directories and filesets.
Our current solution is to use EA to keep track of all parent
directories, do you see any problems with applying it to filesets?
>
> Peter
Nikita.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Lustre-devel] Doubly indexed tree / changelogs
2008-09-23 3:49 ` Nathaniel Rutman
@ 2008-09-23 9:20 ` Peter Braam
2008-09-23 21:46 ` Nathaniel Rutman
0 siblings, 1 reply; 12+ messages in thread
From: Peter Braam @ 2008-09-23 9:20 UTC (permalink / raw)
To: lustre-devel
On 9/23/08 11:49 AM, "Nathaniel Rutman" <Nathan.Rutman@Sun.COM> wrote:
> I actually added a "previous record" pointer in each changelog entry,
> but fill it in only where it is cheap -- when the metadata object is
> already in the cache I record the last changelog entry there. If it's
> not in the cache, I don't know where the last record associated with
> that fid is. We could store the last record number with the inode (EA?),
> but that would potentially be painful if we are recording e.g. file
> open/closes.
Previous records are free - you get the previous one from the EA in the
inode, and replace the inode with the record info of the record you are
adding. But for rename operations and others there are multiple pointers
like this needed.
> Forward pointers are also problematic, in that I don't want to go back
> and modify the old record every time a new one is recorded (seems like
> this will make the disks very seek-y), and I think maybe we don't need
> forward pointers anyhow (use case?). Anyhow, this effectively doubles
> the changelog write impact. Maybe that's ok: Manoj's measurements put
> the changelog overhead at only about 4% using mdsrate.
Wow - that is amazingly low.
It is better to think about it before hacking it in I think.
Peter
>
> Peter Braam wrote:
>> Hi Nikita, Nathan -
>>
>> After some pondering I have come to two conclusions.
>>
>> To encode filesets, we need a tree that makes two iterations fast:
>>
>> 1. list all filesets that contain a certain object
>> 2. list all objects in a certain fileset
>>
>>
>> Is there a doubly indexed tree for this?
>>
>> Secondly, to make the changelogs useful and scalable for filesets we
>> will need to be able to list all changelog entries associated with a
>> certain inode efficiently. I see two ways to do this ? one is an
>> auxiliary directory file mapping inodes to many changelog entries, the
>> second is to embed forward and backward pointers in the changelog
>> entries to build a linked list rooted at the inode (using an EA in the
>> inode pointing to the first and last element of the list). Both have
>> some overheads. What are your thoughts?
>>
>> Peter
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>>
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Lustre-devel] Doubly indexed tree / changelogs
2008-09-23 9:20 ` Peter Braam
@ 2008-09-23 21:46 ` Nathaniel Rutman
2008-09-23 22:48 ` Peter Braam
0 siblings, 1 reply; 12+ messages in thread
From: Nathaniel Rutman @ 2008-09-23 21:46 UTC (permalink / raw)
To: lustre-devel
Peter Braam wrote:
>
> On 9/23/08 11:49 AM, "Nathaniel Rutman" <Nathan.Rutman@Sun.COM> wrote:
>
>
>> I actually added a "previous record" pointer in each changelog entry,
>> but fill it in only where it is cheap -- when the metadata object is
>> already in the cache I record the last changelog entry there. If it's
>> not in the cache, I don't know where the last record associated with
>> that fid is. We could store the last record number with the inode (EA?),
>> but that would potentially be painful if we are recording e.g. file
>> open/closes.
>>
>
> Previous records are free - you get the previous one from the EA in the
> inode, and replace the inode with the record info of the record you are
> adding. But for rename operations and others there are multiple pointers
> like this needed.
>
Currently I'm not reading or writing any EA for the changelog. Yes, if
you want to tie in the fwd/back ptrs to the inode itself we need to do
this, but I thought we were specifically discussing alternatives to
doing that here (e.g. "auxiliary directory file mapping inodes to many
changelog entries".) If we are e.g. recording every open/close for a
file, do we really want to read/write the EA on the MDT every time, in
addition to the changelog llog entry?
> Secondly, to make the changelogs useful and scalable for filesets we
> will need to be able to list all changelog entries associated with a
> certain inode efficiently. I see two ways to do this ? one is an
> auxiliary directory file mapping inodes to many changelog entries, the
> second is to embed forward and backward pointers in the changelog
> entries to build a linked list rooted at the inode (using an EA in the
> inode pointing to the first and last element of the list). Both have
> some overheads. What are your thoughts?
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Lustre-devel] Doubly indexed tree / changelogs
2008-09-23 21:46 ` Nathaniel Rutman
@ 2008-09-23 22:48 ` Peter Braam
0 siblings, 0 replies; 12+ messages in thread
From: Peter Braam @ 2008-09-23 22:48 UTC (permalink / raw)
To: lustre-devel
On 9/24/08 5:46 AM, "Nathaniel Rutman" <Nathan.Rutman@Sun.COM> wrote:
> Peter Braam wrote:
>>
>> On 9/23/08 11:49 AM, "Nathaniel Rutman" <Nathan.Rutman@Sun.COM> wrote:
>>
>>
>>> I actually added a "previous record" pointer in each changelog entry,
>>> but fill it in only where it is cheap -- when the metadata object is
>>> already in the cache I record the last changelog entry there. If it's
>>> not in the cache, I don't know where the last record associated with
>>> that fid is. We could store the last record number with the inode (EA?),
>>> but that would potentially be painful if we are recording e.g. file
>>> open/closes.
>>>
>>
>> Previous records are free - you get the previous one from the EA in the
>> inode, and replace the inode with the record info of the record you are
>> adding. But for rename operations and others there are multiple pointers
>> like this needed.
>>
> Currently I'm not reading or writing any EA for the changelog. Yes, if
> you want to tie in the fwd/back ptrs to the inode itself we need to do
> this, but I thought we were specifically discussing alternatives to
> doing that here (e.g. "auxiliary directory file mapping inodes to many
> changelog entries".)
Good point.
> If we are e.g. recording every open/close for a
> file, do we really want to read/write the EA on the MDT every time, in
> addition to the changelog llog entry?
You are writing that inode anyway, so it doesn't cost more I/O if the EA is
embedded.
Peter
>
>> Secondly, to make the changelogs useful and scalable for filesets we
>> will need to be able to list all changelog entries associated with a
>> certain inode efficiently. I see two ways to do this ? one is an
>> auxiliary directory file mapping inodes to many changelog entries, the
>> second is to embed forward and backward pointers in the changelog
>> entries to build a linked list rooted at the inode (using an EA in the
>> inode pointing to the first and last element of the list). Both have
>> some overheads. What are your thoughts?
>>
>>
>
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Lustre-devel] Doubly indexed tree / changelogs
2008-09-23 7:38 ` Nikita Danilov
@ 2008-09-24 2:50 ` Peter Braam
0 siblings, 0 replies; 12+ messages in thread
From: Peter Braam @ 2008-09-24 2:50 UTC (permalink / raw)
To: lustre-devel
The idea is ok, but scale may turn out to be different than expected.
There could be a huge amount of filesets containing the word "Bin Laden",
say 100,000 of them.
Would two directories work?
Peter
On 9/23/08 3:38 PM, "Nikita Danilov" <Nikita.Danilov@Sun.COM> wrote:
> Peter Braam writes:
>> Hi Nikita, Nathan -
>>
>> After some pondering I have come to two conclusions.
>>
>> To encode filesets, we need a tree that makes two iterations fast:
>>
>> 1. list all filesets that contain a certain object
>> 2. list all objects in a certain fileset
>>
>> Is there a doubly indexed tree for this?
>
> I don't know of a data-structure that provides ready solution for this.
> As Alex pointed out k-d-trees do not fit there (they effectively use a
> key composed of the alternating sequences of bits of the original keys),
> and `geographical trees' that data bases use to store 2-d data use very
> special notion of proximity.
>
> But, thinking about a fileset as a generalized directory, isn't this the
> same problem as `list all files in directory' and `find all parent
> directories of a file'? It probably makes sense to use the same
> mechanism to deal with directories and filesets.
>
> Our current solution is to use EA to keep track of all parent
> directories, do you see any problems with applying it to filesets?
>
>>
>> Peter
>
> Nikita.
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2008-09-24 2:50 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-21 23:40 [Lustre-devel] Doubly indexed tree / changelogs Peter Braam
2008-09-22 5:52 ` Alex Zhuravlev
2008-09-22 6:58 ` Peter Braam
2008-09-22 7:05 ` Alex Zhuravlev
2008-09-22 7:13 ` Peter Braam
2008-09-22 7:26 ` Alex Zhuravlev
2008-09-23 3:49 ` Nathaniel Rutman
2008-09-23 9:20 ` Peter Braam
2008-09-23 21:46 ` Nathaniel Rutman
2008-09-23 22:48 ` Peter Braam
2008-09-23 7:38 ` Nikita Danilov
2008-09-24 2:50 ` Peter Braam
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.