[Lustre-devel] Replication

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Lustre-devel] Replication
@ 2008-05-05 14:41 Peter Braam
  0 siblings, 0 replies; 4+ messages in thread
From: Peter Braam @ 2008-05-05 14:41 UTC (permalink / raw)
  To: lustre-devel

Hi Nathan -

I talked through the design with Nikita.  After he had understood our
constraints and I had understood his issues it all narrowed down to one
important improvement that Nikita suggests:  we must get a fast way to
compute the pathname of a FID.  The scanning and searching I suggested
without an index is not tenable.

We had a couple of suggestions, such as storing parent fid and a name in the
EA, or storing similar information in a large directory file.

Can you connect with Nikita and do this?

Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080505/23025d10/attachment.htm>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Lustre-devel] Replication
       [not found] <482098C9.4020403@sun.com>
@ 2008-05-07  5:57 ` Peter Braam
  2008-05-08 14:48   ` Nikita Danilov
  0 siblings, 1 reply; 4+ messages in thread
From: Peter Braam @ 2008-05-07  5:57 UTC (permalink / raw)
  To: lustre-devel

On 5/6/08 11:43 AM, "Nathaniel Rutman" <Nathan.Rutman@Sun.COM> wrote:

> Peter Braam wrote:
>> Hi Nathan -
>> 
>> I talked through the design with Nikita.  After he had understood our
>> constraints and I had understood his issues it all narrowed down to
>> one important improvement that Nikita suggests:  we must get a fast
>> way to compute the pathname of a FID.  The scanning and searching I
>> suggested without an index is not tenable.
>> 
>> We had a couple of suggestions, such as storing parent fid and a name
>> in the EA, or storing similar information in a large directory file.
>> 
>> Can you connect with Nikita and do this?
> 
> We talked yesterday afternoon.
> Nikita has three concerns:
> 
> 1. Global lock on namespace during pathname reconstruction.
> I think we can eliminate this the following way:
> a. lookup full path from fid, parent fid (remember the list of fids for
> the entire path also)
> b. lookup last transno
> c. verify traversing down the full path name results in the same branch
> and leaf fids all the way back down
>  i. if they don't match, repeat from a
>  ii. if they do match, we can backtrack starting from the transno in b
> to regenerate the original name
> 
> 2. Directory name lookup given the parent fid - this may be inefficient
> if we have to read the parent directory in order to get the name (parent
> object is not likely to be cached at lookup time).
> 
> 3. Someone deletes one of the parents of a hardlinked file.  If we only
> store one parent, there's no way to regenerate a pathname if that parent
> is the one that gets removed.
> 
> For 2 and 3, we could store the directory name for each directory in an
> EA, and all the fids for all the parents in some other manner.
> But it seems to make more sense at this point to put all this
> information (fid, name, parent list) in a database file stored on the
> MDT.  Then we just look through this database to generate our full path
> information; no need to lookup info in the file objects or EAs.
> Generating this database should be no more time consuming than writing
> the changelogs themselves, assuming a reasonable database structure like
> IAM.
> 

Yes I agree with all of this.

Peter

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Lustre-devel] Replication
  2008-05-07  5:57 ` [Lustre-devel] Replication Peter Braam
@ 2008-05-08 14:48   ` Nikita Danilov
  2008-05-08 14:57     ` Peter Braam
  0 siblings, 1 reply; 4+ messages in thread
From: Nikita Danilov @ 2008-05-08 14:48 UTC (permalink / raw)
  To: lustre-devel

Peter Braam writes:
 > On 5/6/08 11:43 AM, "Nathaniel Rutman" <Nathan.Rutman@Sun.COM> wrote:

[...]

 > > 
 > > For 2 and 3, we could store the directory name for each directory in an
 > > EA, and all the fids for all the parents in some other manner.
 > > But it seems to make more sense at this point to put all this
 > > information (fid, name, parent list) in a database file stored on the
 > > MDT.  Then we just look through this database to generate our full path

One advantage EA has over global data-base is that the former is more
resilient against file system corruption. This becomes more important if
we ever plan to use (parent-fid, name) information for things like fsck.

 > > information; no need to lookup info in the file objects or EAs.
 > > Generating this database should be no more time consuming than writing
 > > the changelogs themselves, assuming a reasonable database structure like
 > > IAM.

On a lower level note, I think that changelogs and parent-database are
better to be implemented as a new layer separate from mdd:

    - mdd code is already complicated enough,

    - separate layer can be inserted into stack optionally, avoiding
    run-time cost if change-logs are not needed (currently there is no
    way to insert a layer after initial configuration completes though).

 > > 
 > 
 > Yes I agree with all of this.
 > 
 > Peter
 > 

Nikita.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Lustre-devel] Replication
  2008-05-08 14:48   ` Nikita Danilov
@ 2008-05-08 14:57     ` Peter Braam
  0 siblings, 0 replies; 4+ messages in thread
From: Peter Braam @ 2008-05-08 14:57 UTC (permalink / raw)
  To: lustre-devel




On 5/8/08 8:48 AM, "Nikita Danilov" <Nikita.Danilov@Sun.COM> wrote:

> Peter Braam writes:
>> On 5/6/08 11:43 AM, "Nathaniel Rutman" <Nathan.Rutman@Sun.COM> wrote:
> 
> [...]
> 
>>> 
>>> For 2 and 3, we could store the directory name for each directory in an
>>> EA, and all the fids for all the parents in some other manner.
>>> But it seems to make more sense at this point to put all this
>>> information (fid, name, parent list) in a database file stored on the
>>> MDT.  Then we just look through this database to generate our full path
> 
> One advantage EA has over global data-base is that the former is more
> resilient against file system corruption. This becomes more important if
> we ever plan to use (parent-fid, name) information for things like fsck.
> 
>>> information; no need to lookup info in the file objects or EAs.
>>> Generating this database should be no more time consuming than writing
>>> the changelogs themselves, assuming a reasonable database structure like
>>> IAM.
> 
> On a lower level note, I think that changelogs and parent-database are
> better to be implemented as a new layer separate from mdd:
> 
>     - mdd code is already complicated enough,
> 
>     - separate layer can be inserted into stack optionally, avoiding
>     run-time cost if change-logs are not needed (currently there is no
>     way to insert a layer after initial configuration completes though).

Yes, find a good place.

Just remember that things like pNFS integrated with the Lustre servers also
need to replicate.  In fact having this log purely at the DMU / ZFS level
would be a valuable feature - there are no good replication solutions even
for laptops today!

Peter


> 
>>> 
>> 
>> Yes I agree with all of this.
>> 
>> Peter
>> 
> 
> Nikita.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2008-05-08 14:57 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <482098C9.4020403@sun.com>
2008-05-07  5:57 ` [Lustre-devel] Replication Peter Braam
2008-05-08 14:48   ` Nikita Danilov
2008-05-08 14:57     ` Peter Braam
2008-05-05 14:41 Peter Braam

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.