* [Lustre-devel] Lustre HSM HLD draft
@ 2008-02-07 16:19 Rick Matthews
2008-02-08 0:03 ` JC.LAFOUCRIERE at CEA.FR
2008-02-08 15:55 ` Aurelien Degremont
0 siblings, 2 replies; 30+ messages in thread
From: Rick Matthews @ 2008-02-07 16:19 UTC (permalink / raw)
To: lustre-devel
All,
I'm new to this list, so I'll start with apologies. My Lustre
background is
also limited; a situation I hope to fix.
As part of the Solaris Software Archiving group, I was asked to review
the HSM HLD
by my management. That review was sent to Peter Bojanic. He suggested I
get involved in
the community discussion.
This is a posting of my original response, based on a copy of the HLD
which seems to
be the one posted. I've made a couple of minor corrections.
Page 1, 1, Define coordinator (space coordinator?),
define agent, (condense Part II intro, page 14)
(for me, MDT, MGS and OST)
Page 8, 3.8, "use" not "used" in second sentence
Page 9, 3.8.2 et.al., "precised" (maybe, explicit or precise)
Page 9, 3.8.4, Lustre ID "if" no path
Page 10, 4.1, 1) When archived? (probably in Space Manager portion)
SAM-QFS archives well ahead of space need.
4) External object reference must be unusable, until 5.
4.2, 2) Implies only one copy per "version"...bad idea
Page 12, 5.3, Last Sentence, This enables, not This ables
6.1, 100,000 migrations make current migration list operations
problematic (lets say want to move last migration to
be next migration).
Page 13, Lustre object mtime may not be good enough. There are several
mechanisms (like touch) to manipulate mtime, which makes it
unusable as a last written time.
Page 15, a variant on 1.5, ask for/return last valid byte offset
(perhaps within a range).
Page 19, Special Path, does this boil down to invisible I/O?
Page 23, 2.3 and 2.4, I'm assuming that lists of tuples can be processed
in any order.
Page 25, 1, Punch - becomes "sparse" not "spare"
I think this spec needs to be more consistent with its use of data range.
It is confusing as laid out.
Page 26, 3.2 space will be exhausted, or space will be low, not space
will be
missing.
Page 28, protection of Lustre extended attributes?
Issues:
The Space manager is likely the most important piece. There is no
detail on it. This is where archive and other policy is enforced.
The described HSM seems to follow the "copy out" when space needed,
then purge, model. This function (a Space Manager function) is
contrary
to SAM, and a shortfall of many HSMs.
File/object association is an important component of SAM.
For example, if I access a file in a source tree, I'm likely
to access the others as well.
The purge (3.2, Space manager needs to make room) and 4.1
"needs to be atomic" is a complex operations. Sequencing is
important.
Coordination between agents seems important. For example,
if agents requested new copy-outs on objects striped on
10 different stores, ordering them on tape seems difficult.
What is the backup story for Lustre? How does that play with
the HSM?
--
---------------------------------------------------------------------
Rick Matthews email: Rick.Matthews at sun.com
Sun Microsystems, Inc. phone:+1(651) 554-1518
1270 Eagan Industrial Road phone(internal): 54418
Suite 160 fax: +1(651) 554-1540
Eagan, MN 55121-1231 USA main: +1(651) 554-1500
---------------------------------------------------------------------
^ permalink raw reply [flat|nested] 30+ messages in thread* [Lustre-devel] Lustre HSM HLD draft
2008-02-07 16:19 [Lustre-devel] Lustre HSM HLD draft Rick Matthews
@ 2008-02-08 0:03 ` JC.LAFOUCRIERE at CEA.FR
2008-02-08 11:52 ` Rick Matthews
2008-02-08 15:55 ` Aurelien Degremont
1 sibling, 1 reply; 30+ messages in thread
From: JC.LAFOUCRIERE at CEA.FR @ 2008-02-08 0:03 UTC (permalink / raw)
To: lustre-devel
Hello
thank you for your review, I add some comments in the following
Page 1, 1, Define coordinator (space coordinator?),
define agent, (condense Part II intro, page 14)
(for me, MDT, MGS and OST)
These are defined in the arch wiki pages
Page 10,
4.2, 2) Implies only one copy per "version"...bad idea
Different versions correspond to different files in the external storage. We take the more recent.
Not sure I understand your remark
Page 13, Lustre object mtime may not be good enough. There are several
mechanisms (like touch) to manipulate mtime, which makes it
unusable as a last written time.
If a user make a touch in the past this change the mtime and can hide previous writes.
If we want to keep real write time we need to add a new time field in Lustre backend
(may be ZFS has it)
Page 19, Special Path, does this boil down to invisible I/O?
The path is /mnt_mount/.lustre/fid/FID_NUMBER. When a file is open through this path a
flag is carried to the OSS to avoid copy in trigger (this used to fill the file)
Page 23, 2.3 and 2.4, I'm assuming that lists of tuples can be processed
in any order.
yes
Issues:
The Space manager is likely the most important piece. There is no
detail on it. This is where archive and other policy is enforced.
The space manager is based on changelogs/feed Lustre feature which are very new (draft HLD has just been
published). This is why it not described at this time.
The described HSM seems to follow the "copy out" when space needed,
then purge, model. This function (a Space Manager function) is contrary
to SAM, and a shortfall of many HSMs.
no spacemanger is doing pre-migration and when free space is needed, it only has to make punc
Coordination between agents seems important. For example,
if agents requested new copy-outs on objects striped on
10 different stores, ordering them on tape seems difficult.
Tape access optimization has to be made by the archival system. We try to put as few external storage knowledge
as possible in Lustre to be external storage independant.
What is the backup story for Lustre? How does that play with
the HSM?
HSM do not backup the namespace. It has to be done with a separate tool like a MDT scannner.
The copy tool can use the FID2PATH() function to save the object pathname with the file.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080208/ac88ef54/attachment.htm>
^ permalink raw reply [flat|nested] 30+ messages in thread* [Lustre-devel] Lustre HSM HLD draft
2008-02-08 0:03 ` JC.LAFOUCRIERE at CEA.FR
@ 2008-02-08 11:52 ` Rick Matthews
0 siblings, 0 replies; 30+ messages in thread
From: Rick Matthews @ 2008-02-08 11:52 UTC (permalink / raw)
To: lustre-devel
JC.LAFOUCRIERE at CEA.FR wrote:
Thanks for allowing me to participate.
> Hello
>
> thank you for your review, I add some comments in the following
>
> Page 1, 1, Define coordinator (space coordinator?),
> define agent, (condense Part II intro, page 14)
> (for me, MDT, MGS and OST)
> These are defined in the arch wiki pages
>
Thank you, I still haven't got to them yet...but plan to.
> Page 10,
> 4.2, 2) Implies only one copy per "version"...bad idea
> Different versions correspond to different files in the external storage. We take the more recent.
> Not sure I understand your remark
>
A basic mantra of SAM-QFS and other data retention systems is that one
image of the data is vulnerable (a tape breaks,
or is otherwise overwritten). While the archival system can be
responsible for making multiple identical images, it
can still represent a single point of failure. Note: I am using version
to represent a point in time image of the files data,
and copy to represent an image of that version. (See LOCKSS for
additional references on copies).
> Page 13, Lustre object mtime may not be good enough. There are several
> mechanisms (like touch) to manipulate mtime, which makes it
> unusable as a last written time.
> If a user make a touch in the past this change the mtime and can hide previous writes.
> If we want to keep real write time we need to add a new time field in Lustre backend
> (may be ZFS has it)
>
What the archival system needs to know is that the copy previously made
(or a first copy need to be made),
which seems to be triggered by a user (not archive or other - like
restore) write operation.
> Page 19, Special Path, does this boil down to invisible I/O?
> The path is /mnt_mount/.lustre/fid/FID_NUMBER. When a file is open through this path a
> flag is carried to the OSS to avoid copy in trigger (this used to fill the file)
>
> Page 23, 2.3 and 2.4, I'm assuming that lists of tuples can be processed
> in any order.
> yes
>
> Issues:
> The Space manager is likely the most important piece. There is no
> detail on it. This is where archive and other policy is enforced.
> The space manager is based on changelogs/feed Lustre feature which are very new (draft HLD has just been
> published). This is why it not described at this time.
>
OK...also consider using change logs as a trigger for need of a new
archive version (not copy). Alleviates the mtime issue above.
> The described HSM seems to follow the "copy out" when space needed,
> then purge, model. This function (a Space Manager function) is contrary
> to SAM, and a shortfall of many HSMs.
> no spacemanger is doing pre-migration and when free space is needed, it only has to make punc
>
OK, so who schedules the pre-migration to the archive system?
> Coordination between agents seems important. For example,
> if agents requested new copy-outs on objects striped on
> 10 different stores, ordering them on tape seems difficult.
> Tape access optimization has to be made by the archival system. We try to put as few external storage knowledge
> as possible in Lustre to be external storage independant.
>
The isolation between archive system and file system is (to me) a good
idea. I'd just like you to
consider that the recall (stage-in) events can be optimized. At least,
make sure the archive system
is allowed to reorder as needed (hence the async - list of tuples in any
order - question above).
Think of other association between files to live storage as 1) a
pre-stage operation, or 2)
a disk cache pre-fetch operation. I hope I'm using understandable words ;>)
> What is the backup story for Lustre? How does that play with
> the HSM?
> HSM do not backup the namespace. It has to be done with a separate tool like a MDT scannner.
> The copy tool can use the FID2PATH() function to save the object pathname with the file.
>
>
One point here is that an HSM + namespace/metadata backup + unarchived
data capture can be used to be a
nearly continuous backup operation with a relatively tiny backup window.
--
---------------------------------------------------------------------
Rick Matthews email: Rick.Matthews at sun.com
Sun Microsystems, Inc. phone:+1(651) 554-1518
1270 Eagan Industrial Road phone(internal): 54418
Suite 160 fax: +1(651) 554-1540
Eagan, MN 55121-1231 USA main: +1(651) 554-1500
---------------------------------------------------------------------
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Lustre-devel] Lustre HSM HLD draft
2008-02-07 16:19 [Lustre-devel] Lustre HSM HLD draft Rick Matthews
2008-02-08 0:03 ` JC.LAFOUCRIERE at CEA.FR
@ 2008-02-08 15:55 ` Aurelien Degremont
2008-02-11 18:18 ` Andreas Dilger
1 sibling, 1 reply; 30+ messages in thread
From: Aurelien Degremont @ 2008-02-08 15:55 UTC (permalink / raw)
To: lustre-devel
Hello
First of all, thanks for your remarks.
Information explained in the architecture documents from the Arch Wiki
have not been re-explained in the HLD. So some points could be unclear,
but read or check the arch docs first.
If the HLD must be self sufficent or more details are really needed, let
me know.
I will clarify some points anyway in the new document version.
Rick Matthews a ?crit :
> Page 10, 4.1, 1) When archived? (probably in Space Manager portion)
> SAM-QFS archives well ahead of space need.
Concerning the archived copies vunlerability, I'm not sure this is
Lustre responsability to manage several copies of each of its file
versions into the HSM...
> 6.1, 100,000 migrations make current migration list operations
> problematic (lets say want to move last migration to
> be next migration).
You speak about pending migrations ? This is just pointer manipulation.
I do not see a real problem at this level. This value is only
algorithmic indications, not about resources (memory, ...)
But we could decrease this value to 10,000.
> Page 13, Lustre object mtime may not be good enough. There are several
> mechanisms (like touch) to manipulate mtime, which makes it
> unusable as a last written time.
If fact, this value is only needed for user information, not for Lustre
internals. Lustre will based is comparison on the FID version.
The mtime field is used for listing the file copies in the HSM, and as
the lustre fid version is not relevant for the user, will indicates the
associated file date at this time.
(just a quick example, not the final output)
user$ list_hsm_copies ./foo
Storage Date Size Version
============================================
HSM1 Feb 2 2006 1566162 1
HSM1 Jun 18 2007 1423540 2
HSM1 Jun 18 2007 1900051 54
But the touch could be problematic. Lustre gurus, is there another time
field we could use instead ? Should we add a
"last-modification-field-which-ignore-touch" ? Is this really a problem
is we use display a "touched" time ? In this case, we display what the
user set on the file, we suppose he did it in purpose.
> Page 15, a variant on 1.5, ask for/return last valid byte offset
> (perhaps within a range).
Why not... But do you have use cases were the current "Data available"
feature as explained in 1.5 is not sufficent ?
> Page 28, protection of Lustre extended attributes?
I do not see what you mean.
> Issues:
> The purge (3.2, Space manager needs to make room) and 4.1
> "needs to be atomic" is a complex operations. Sequencing is
> important.
Does "transactionnal" fit ?
I will add a Bugzilla entry and a new updated version the HLD on it,
next Monday.
Regards,
--
Aurelien Degremont
CEA/DAM - DIF/DSSI/SISR
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Lustre-devel] Lustre HSM HLD draft
2008-02-08 15:55 ` Aurelien Degremont
@ 2008-02-11 18:18 ` Andreas Dilger
2008-02-11 19:38 ` Peter Braam
2008-02-11 21:11 ` Ricardo M. Correia
0 siblings, 2 replies; 30+ messages in thread
From: Andreas Dilger @ 2008-02-11 18:18 UTC (permalink / raw)
To: lustre-devel
On Feb 08, 2008 16:55 +0100, Aurelien Degremont wrote:
> But the touch could be problematic. Lustre gurus, is there another time
> field we could use instead ? Should we add a
> "last-modification-field-which-ignore-touch" ? Is this really a problem
> is we use display a "touched" time ? In this case, we display what the
> user set on the file, we suppose he did it in purpose.
There was work done in ext4/ldiskfs to add a 64-bit "version" field to
the on-disk inode, for use by lustre and NFSv4. In the ldiskfs case
Lustre was free to store any information in this field it wanted. The
planned use for this field is for "version based recovery" and it has
the semantic that it is an increasing (though not necessarily sequential)
version number that tracks any change to the file. This is stored in
each inode on the MDT and each object on the OSTs.
In ZFS I believe there is also a "last modified transaction group" (txg)
number stored with each dnode that could be used in a similar manner.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Lustre-devel] Lustre HSM HLD draft
2008-02-11 18:18 ` Andreas Dilger
@ 2008-02-11 19:38 ` Peter Braam
2008-02-11 21:11 ` Ricardo M. Correia
1 sibling, 0 replies; 30+ messages in thread
From: Peter Braam @ 2008-02-11 19:38 UTC (permalink / raw)
To: lustre-devel
Versions are critical - we need them for multiple things, let's make
sure we get exactly the right thing in ZFS also.
- Peter -
Andreas Dilger wrote:
> On Feb 08, 2008 16:55 +0100, Aurelien Degremont wrote:
>
>> But the touch could be problematic. Lustre gurus, is there another time
>> field we could use instead ? Should we add a
>> "last-modification-field-which-ignore-touch" ? Is this really a problem
>> is we use display a "touched" time ? In this case, we display what the
>> user set on the file, we suppose he did it in purpose.
>>
>
> There was work done in ext4/ldiskfs to add a 64-bit "version" field to
> the on-disk inode, for use by lustre and NFSv4. In the ldiskfs case
> Lustre was free to store any information in this field it wanted. The
> planned use for this field is for "version based recovery" and it has
> the semantic that it is an increasing (though not necessarily sequential)
> version number that tracks any change to the file. This is stored in
> each inode on the MDT and each object on the OSTs.
>
> In ZFS I believe there is also a "last modified transaction group" (txg)
> number stored with each dnode that could be used in a similar manner.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080211/98333d16/attachment.htm>
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Lustre-devel] Lustre HSM HLD draft
2008-02-11 18:18 ` Andreas Dilger
2008-02-11 19:38 ` Peter Braam
@ 2008-02-11 21:11 ` Ricardo M. Correia
2008-02-11 21:39 ` Andreas Dilger
1 sibling, 1 reply; 30+ messages in thread
From: Ricardo M. Correia @ 2008-02-11 21:11 UTC (permalink / raw)
To: lustre-devel
Hi,
On Seg, 2008-02-11 at 11:18 -0700, Andreas Dilger wrote:
> On Feb 08, 2008 16:55 +0100, Aurelien Degremont wrote:
> > But the touch could be problematic. Lustre gurus, is there another time
> > field we could use instead ? Should we add a
> > "last-modification-field-which-ignore-touch" ? Is this really a problem
> > is we use display a "touched" time ? In this case, we display what the
> > user set on the file, we suppose he did it in purpose.
>
> (snip)
>
> In ZFS I believe there is also a "last modified transaction group" (txg)
> number stored with each dnode that could be used in a similar manner.
Hmm.. I think ZFS only has zp_gen in the dnode/znode, which is the txg
of the file creation. We also cannot use the txg birth time of the block
where the dnode is stored because a metadnode block holds several
dnodes.
I may be missing something here, but isn't the "ctime" the appropriate
value to use here?
Regards,
Ricardo
--
Ricardo Manuel Correia
Lustre Engineering
Sun Microsystems, Inc.
Portugal
Phone +351.214134023 / x58723
Mobile +351.912590825
Email Ricardo.M.Correia at Sun.COM
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080211/03b52971/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6g_top.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080211/03b52971/attachment.gif>
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Lustre-devel] Lustre HSM HLD draft
2008-02-11 21:11 ` Ricardo M. Correia
@ 2008-02-11 21:39 ` Andreas Dilger
2008-02-11 22:07 ` Ricardo M. Correia
0 siblings, 1 reply; 30+ messages in thread
From: Andreas Dilger @ 2008-02-11 21:39 UTC (permalink / raw)
To: lustre-devel
On Feb 11, 2008 21:11 +0000, Ricardo Correia wrote:
> On Seg, 2008-02-11 at 11:18 -0700, Andreas Dilger wrote:
>
> > On Feb 08, 2008 16:55 +0100, Aurelien Degremont wrote:
> > > But the touch could be problematic. Lustre gurus, is there another time
> > > field we could use instead ? Should we add a
> > > "last-modification-field-which-ignore-touch" ? Is this really a problem
> > > is we use display a "touched" time ? In this case, we display what the
> > > user set on the file, we suppose he did it in purpose.
> >
> > (snip)
> >
> > In ZFS I believe there is also a "last modified transaction group" (txg)
> > number stored with each dnode that could be used in a similar manner.
>
>
> Hmm.. I think ZFS only has zp_gen in the dnode/znode, which is the txg
> of the file creation. We also cannot use the txg birth time of the block
> where the dnode is stored because a metadnode block holds several
> dnodes.
>
> I may be missing something here, but isn't the "ctime" the appropriate
> value to use here?
The problem with ctime (on Linux as well) is that it is possible for the
system clock to go backward, whether due to ntp, or because the hardware
clock is incorrect/reset, so it cannot be depended upon to be monotonically
increasing for the life of the lustre filesystem.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Lustre-devel] Lustre HSM HLD draft
2008-02-11 21:39 ` Andreas Dilger
@ 2008-02-11 22:07 ` Ricardo M. Correia
2008-02-11 22:32 ` Nathaniel Rutman
0 siblings, 1 reply; 30+ messages in thread
From: Ricardo M. Correia @ 2008-02-11 22:07 UTC (permalink / raw)
To: lustre-devel
On Seg, 2008-02-11 at 14:39 -0700, Andreas Dilger wrote:
> The problem with ctime (on Linux as well) is that it is possible for the
> system clock to go backward, whether due to ntp, or because the hardware
> clock is incorrect/reset, so it cannot be depended upon to be monotonically
> increasing for the life of the lustre filesystem.
Ok. In that case, we could either add a new 64-bit version field to the
dnode (or znode) similar to the one in ldiskfs, or we could look at the
birth time (txg nr) of all the block pointers in the dnode.
Using txg numbers might not be very useful if an object is migrated from
one storage device to another, but I have not read the HSM HLD so I'm
not sure if this is a problem or not.
Cheers,
Ricardo
--
Ricardo Manuel Correia
Lustre Engineering
Sun Microsystems, Inc.
Portugal
Phone +351.214134023 / x58723
Mobile +351.912590825
Email Ricardo.M.Correia at Sun.COM
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080211/4de0ee30/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6g_top.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080211/4de0ee30/attachment.gif>
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Lustre-devel] Lustre HSM HLD draft
2008-02-11 22:07 ` Ricardo M. Correia
@ 2008-02-11 22:32 ` Nathaniel Rutman
2008-02-11 22:46 ` Rick Matthews
2008-02-12 0:25 ` Ricardo M. Correia
0 siblings, 2 replies; 30+ messages in thread
From: Nathaniel Rutman @ 2008-02-11 22:32 UTC (permalink / raw)
To: lustre-devel
Ricardo M. Correia wrote:
>
> On Seg, 2008-02-11 at 14:39 -0700, Andreas Dilger wrote:
>> The problem with ctime (on Linux as well) is that it is possible for the
>> system clock to go backward, whether due to ntp, or because the hardware
>> clock is incorrect/reset, so it cannot be depended upon to be monotonically
>> increasing for the life of the lustre filesystem.
>>
>
> Ok. In that case, we could either add a new 64-bit version field to
> the dnode (or znode) similar to the one in ldiskfs, or we could look
> at the birth time (txg nr) of all the block pointers in the dnode.
> Using txg numbers might not be very useful if an object is migrated
> from one storage device to another, but I have not read the HSM HLD so
> I'm not sure if this is a problem or not.
I'm missing the point of this discussion. Clearly we shouldn't/can't
use ctime/mtime for anything internal to Lustre; that is what object
versions are all about. Why are we talking about adding new fields or
anything else?
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Lustre-devel] Lustre HSM HLD draft
2008-02-11 22:32 ` Nathaniel Rutman
@ 2008-02-11 22:46 ` Rick Matthews
2008-02-12 15:41 ` Aurelien Degremont
2008-02-12 0:25 ` Ricardo M. Correia
1 sibling, 1 reply; 30+ messages in thread
From: Rick Matthews @ 2008-02-11 22:46 UTC (permalink / raw)
To: lustre-devel
I'm probably responsible for opening this can of worms. I inferred from
the HSM HLD that
mtime was proposed to be used for state change, or version of the
file/object. As the discussion
bears out, mtime for this purpose would be a bad idea. A reliable way of
detecting change is
needed, and if it already exists withing Lustre, great!.
What I think is far more significant is the involvement of the community
on issues
such as this. More folks examining (and critiquing) the details, the
better.
Nice to see such an active community.
--
Nathaniel Rutman wrote:
> Ricardo M. Correia wrote:
>>
>> On Seg, 2008-02-11 at 14:39 -0700, Andreas Dilger wrote:
>>> The problem with ctime (on Linux as well) is that it is possible for
>>> the
>>> system clock to go backward, whether due to ntp, or because the
>>> hardware
>>> clock is incorrect/reset, so it cannot be depended upon to be
>>> monotonically
>>> increasing for the life of the lustre filesystem.
>>>
>>
>> Ok. In that case, we could either add a new 64-bit version field to
>> the dnode (or znode) similar to the one in ldiskfs, or we could look
>> at the birth time (txg nr) of all the block pointers in the dnode.
>> Using txg numbers might not be very useful if an object is migrated
>> from one storage device to another, but I have not read the HSM HLD
>> so I'm not sure if this is a problem or not.
> I'm missing the point of this discussion. Clearly we shouldn't/can't
> use ctime/mtime for anything internal to Lustre; that is what object
> versions are all about. Why are we talking about adding new fields or
> anything else?
>
>
--
---------------------------------------------------------------------
Rick Matthews email: Rick.Matthews at sun.com
Sun Microsystems, Inc. phone:+1(651) 554-1518
1270 Eagan Industrial Road phone(internal): 54418
Suite 160 fax: +1(651) 554-1540
Eagan, MN 55121-1231 USA main: +1(651) 554-1500
---------------------------------------------------------------------
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Lustre-devel] Lustre HSM HLD draft
2008-02-11 22:46 ` Rick Matthews
@ 2008-02-12 15:41 ` Aurelien Degremont
0 siblings, 0 replies; 30+ messages in thread
From: Aurelien Degremont @ 2008-02-12 15:41 UTC (permalink / raw)
To: lustre-devel
It is important to note that all comparisons and modifications are done
at Lustre-object level: OST stripe object or MDT file object, each of
those objects already has a version field, in the FID.
This is the version inside the FID that we will use for all treatments.
All purges are always requested for a specific FID.
The mtime is stored only for information, for the users. It is simpler
to display to the user:
user$ list_hsm_copies ./foo
Date
============
Feb 2 2006
Jun 18 2007
Jun 19 2007
than:
user$ list_hsm_copies ./foo
Version
============
0x0012356
0x001a250
0x001a011
If the user "touched" the file sometime, he knew what he has done. Just
the output will be different, but internaly, we manipulate Lustre FID
and so we don't care of mtime.
So the "version" in the backend is not a problem. We do not rely on the
ldiskfs/zfs inode versioning.
Aurelien Degremont
Rick Matthews a ?crit :
> I'm probably responsible for opening this can of worms. I inferred from
> the HSM HLD that
> mtime was proposed to be used for state change, or version of the
> file/object. As the discussion
> bears out, mtime for this purpose would be a bad idea. A reliable way of
> detecting change is
> needed, and if it already exists withing Lustre, great!.
>
> What I think is far more significant is the involvement of the community
> on issues
> such as this. More folks examining (and critiquing) the details, the
> better.
> Nice to see such an active community.
> --
>
> Nathaniel Rutman wrote:
>> Ricardo M. Correia wrote:
>>> On Seg, 2008-02-11 at 14:39 -0700, Andreas Dilger wrote:
>>>> The problem with ctime (on Linux as well) is that it is possible for
>>>> the
>>>> system clock to go backward, whether due to ntp, or because the
>>>> hardware
>>>> clock is incorrect/reset, so it cannot be depended upon to be
>>>> monotonically
>>>> increasing for the life of the lustre filesystem.
>>>>
>>> Ok. In that case, we could either add a new 64-bit version field to
>>> the dnode (or znode) similar to the one in ldiskfs, or we could look
>>> at the birth time (txg nr) of all the block pointers in the dnode.
>>> Using txg numbers might not be very useful if an object is migrated
>>> from one storage device to another, but I have not read the HSM HLD
>>> so I'm not sure if this is a problem or not.
>> I'm missing the point of this discussion. Clearly we shouldn't/can't
>> use ctime/mtime for anything internal to Lustre; that is what object
>> versions are all about. Why are we talking about adding new fields or
>> anything else?
>>
>>
>
>
--
Aurelien Degremont
CEA
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Lustre-devel] Lustre HSM HLD draft
2008-02-11 22:32 ` Nathaniel Rutman
2008-02-11 22:46 ` Rick Matthews
@ 2008-02-12 0:25 ` Ricardo M. Correia
1 sibling, 0 replies; 30+ messages in thread
From: Ricardo M. Correia @ 2008-02-12 0:25 UTC (permalink / raw)
To: lustre-devel
On Seg, 2008-02-11 at 14:32 -0800, Nathaniel Rutman wrote:
> I'm missing the point of this discussion. Clearly we shouldn't/can't
> use ctime/mtime for anything internal to Lustre; that is what object
> versions are all about. Why are we talking about adding new fields or
> anything else?
If by object versions you are referring to the version field in the
ldiskfs inodes that Andreas mentioned, then we need to add a similar
field/attribute in ZFS.
It seems that Andreas has already filed bug 14865 for this.
Cheers,
Ricardo
--
Ricardo Manuel Correia
Lustre Engineering
Sun Microsystems, Inc.
Portugal
Phone +351.214134023 / x58723
Mobile +351.912590825
Email Ricardo.M.Correia at Sun.COM
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080212/70ae527c/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6g_top.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080212/70ae527c/attachment.gif>
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Lustre-devel] Lustre HSM HLD draft
@ 2008-02-07 10:52 DEGREMONT Aurelien
2008-02-08 21:18 ` Nathaniel Rutman
` (2 more replies)
0 siblings, 3 replies; 30+ messages in thread
From: DEGREMONT Aurelien @ 2008-02-07 10:52 UTC (permalink / raw)
To: lustre-devel
Hello
Here is a first draft for comments of the Lustre HSM HLD.
It is intended to be a support for further analyzes and comments from
CFS/Sun.
The document covers the main parts of the HSM features but some elements
are still lacking.
The policy management and the space manager will be describe later.
Let us know your comments and ideas about it.
Regards,
Aurelien Degremont
CEA
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hld_hsm.pdf
Type: application/pdf
Size: 159329 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080207/dd334e91/attachment.pdf>
^ permalink raw reply [flat|nested] 30+ messages in thread* [Lustre-devel] Lustre HSM HLD draft
2008-02-07 10:52 DEGREMONT Aurelien
@ 2008-02-08 21:18 ` Nathaniel Rutman
2008-02-11 14:59 ` Aurelien Degremont
2008-02-18 21:51 ` Canon, Richard Shane
2008-02-21 15:26 ` Aurelien Degremont
2 siblings, 1 reply; 30+ messages in thread
From: Nathaniel Rutman @ 2008-02-08 21:18 UTC (permalink / raw)
To: lustre-devel
DEGREMONT Aurelien wrote:
> Hello
>
> Here is a first draft for comments of the Lustre HSM HLD.
> It is intended to be a support for further analyzes and comments from
> CFS/Sun.
>
> The document covers the main parts of the HSM features but some elements
> are still lacking.
> The policy management and the space manager will be describe later.
>
> Let us know your comments and ideas about it.
>
> Regards,
5.1 external storage list - is this to be stored on the MGS device or a
separate device? If the coordinator lives on the MGS, why not it's
storage as well? In any case, it should be possible to co-locate the
coordinator on the MGS and used the MGS's storage device, in the same
way that the MGS can currently co-locate with the MDT.
6.3 object ref should include version number. Also include checksum?
How does the coordinator request activity from an agent? If the
coordinator is the RPC server, then it's up to the agents to make
requests; agents aren't listening for RPC requests themselves.
2.1Archiving one Lustre file
There should not be a cache miss when archiving a lustre file; perhaps
open-by-fid is intended to bypass atime updates
so that the file isn't marked as "recently accessed"?
2.2Restoring a file
"External ID" presumably contains all information required to retrieve
the file - tape #, path name, etc?
Once file is copied back, we should probably restore original ctime,
mtime, atime - coordinator is storing this, correct?
IV2 - why not multiple purged windows? Seems like if you're going to
purge 1 object out of a file, you might want to purge more.
Specifically, it will probably be a common case to purge every object of
a file from a particular OST. This is not contiguous in a
striped file.
I don't see any reason to purge anything smaller than an entire object
on an OST - is there good reason for this?
If that's the case, then it the OST must keep track of purged objects,
not ranges within an existing object.
If the MDT is tracking purged areas also, then there's a good potential
synergy here with a missing OST --
If the missing OST's objects are marked as purged, then we can
potentially recover them automatically from
HSM...
4.2 How is a purge request recovered? For example, MDT says purge obj1
from ost1, ost1 replies "ok", but then dies before it actually
does the purge. Reboots, doesn't know anything about purge request now,
but MDT has marked it as purged.
Transparent access - should this avoid modification of atime/mtime?
V2.1 How long does OST wait for completion? Is there a timeout? We
probably need a "no timeout if progress is being
made" kind of function - clients currently do this kind of thing with OSTs.
V2.2 No need to copy-in purged data on full-object-size writes.
Page 13, Lustre object mtime may not be good enough. There are several
mechanisms (like touch) to manipulate mtime, which makes it
unusable as a last written time.
If a user make a touch in the past this change the mtime and can hide
previous writes.
If we want to keep real write time we need to add a new time field in
Lustre backend
(may be ZFS has it)
If a user touches or otherwise modifies the mtime on purpose, they
presumably know what they are doing. Besides, we're using the
object version number, not the mtime, to determine whether a file
is up to date. I think this can be ignored.
^ permalink raw reply [flat|nested] 30+ messages in thread* [Lustre-devel] Lustre HSM HLD draft
2008-02-08 21:18 ` Nathaniel Rutman
@ 2008-02-11 14:59 ` Aurelien Degremont
2008-02-11 20:33 ` Nathaniel Rutman
0 siblings, 1 reply; 30+ messages in thread
From: Aurelien Degremont @ 2008-02-11 14:59 UTC (permalink / raw)
To: lustre-devel
Nathaniel Rutman a ?crit :
> 5.1 external storage list - is this to be stored on the MGS device or a
> separate device? If the coordinator lives on the MGS, why not it's
> storage as well? In any case, it should be possible to co-locate the
> coordinator on the MGS and used the MGS's storage device, in the same
> way that the MGS can currently co-locate with the MDT.
> How does the coordinator request activity from an agent? If the
> coordinator is the RPC server, then it's up to the agents to make
> requests; agents aren't listening for RPC requests themselves.
Presently, it is never said that the coordinator will live on the MGS.
The Coordinator constrains are:
1 - Must receive various migration requests from OST/MDT.
2 - Should be able to communicate with Agents and asks them migrations.
3 - Should store configuration and migration logs.
I think #1 and #2 are two differents API. The coordinator is clearly a
RPC server for the first one. How #2 should be implemented is not so
clear. What would be be the "Lustre-way" here?
For #3, the few logs that will be backed up here are not huge, and it
surely could be colocated with another Target, but I'm not sure this
should be mandatory. This device should be available to several servers,
for failover like the other Targets. We could imagine having more than 1
coordinator at long term. I'm not sure it is a good idea to stick it to
another target.
> 6.3 object ref should include version number. Also include checksum?
For data coherency? Should we add a explicit checksum for those values
(stored in an EA) or used a possible backend feature (Can ZFS and
ldiskfs detect EA value corruption by themselves?) ?
> 2.1Archiving one Lustre file
> There should not be a cache miss when archiving a lustre file; perhaps
> open-by-fid is intended to bypass atime updates
> so that the file isn't marked as "recently accessed"?
> Transparent access - should this avoid modification of atime/mtime?
I would say yes.
> 2.2Restoring a file
> "External ID" presumably contains all information required to retrieve
> the file - tape #, path name, etc?
> Once file is copied back, we should probably restore original ctime,
> mtime, atime - coordinator is storing this, correct?
External ID is an opaque value manage by the archiving tool. If the HSM
can store a lot of metadata, only a ref is needed, if not, the tool is
responsible for storing all the data it needs. Anyway, this is totally
opaque for Lustre.
I hope the HSMs will not need so many data in this field. HPSS does not
need so many data, it uses its internal DB to store them. I suppose SAM
also.
> IV2 - why not multiple purged windows? Seems like if you're going to
> purge 1 object out of a file, you might want to purge more.
> Specifically, it will probably be a common case to purge every object of
> a file from a particular OST. This is not contiguous in a
> striped file.
> I don't see any reason to purge anything smaller than an entire object
> on an OST - is there good reason for this?
Multiple purged window is subtle. If you permit this feature, you could
technically have, in the worst case, one purged window per byte, and
this could be very huge to store. Do you think you will do several holes
in the same file? In which cases?
In fact, the more common case is to totally purge a file which have been
migrated on HSM, and it is only an optimisation to keep the start and
the end of the file on disk, to avoid triggering tons of cache misses
with commands like "file foo/*" or a tool like Nautilus or Windows
Explorer browsing the directory.
The purged window is stored by per object, OST object and MDT object.
So, if several objects are purged, each object will store its own purged
window. But the MDT object describing this file will store a special
purged window which starts at the smallest unavailable bytes and ends at
the first available one. The MDT purged window indicates "if you do I/O
in this range, you're not sure the date are there." or "Outside of this
area, I guarantee data are present."
Maintain multiple purged windows will be an headache, with no real need
I think.
Moreover, people have asked for an OST-object based migration, even if I
think whole file migration will be the most common case.
> If that's the case, then it
> the OST must keep track of purged objects, not ranges within an existing
> object.
Objects are not removed, only their datas. All metadata are kept.
> If the MDT is tracking purged areas also, then there's a good potential
> synergy here with a missing OST --
> If the missing OST's objects are marked as purged, then we can
> potentially recover them automatically from
> HSM...
What do you call a "missing OST" ? A corrupt one ? A offline one?
Unavailable?
Where will you copy back the object data ? On another OST object ?
With the purged window on each OST object and MDT and the file stripping
info, we could easily restore the missing parts.
> 4.2 How is a purge request recovered? For example, MDT says purge obj1
> from ost1, ost1 replies "ok", but then dies before it actually
> does the purge. Reboots, doesn't know anything about purge request now,
> but MDT has marked it as purged.
The OST asynchronously acknowledges the purge when it is done. The MDT
marks it purged only when it is really done. I will clarify this.
> V2.1 How long does OST wait for completion? Is there a timeout? We
> probably need a "no timeout if progress is being
> made" kind of function - clients currently do this kind of thing with OSTs.
I'm sure Lustre already has similar mechanisms for optimized timeout in
this kind of situation we could reused here.
What you describe is a good approach I think.
> V2.2 No need to copy-in purged data on full-object-size writes.
True. We could had such optimization. But this is only useful for small
files or very widely stripped ones, doesn't it?
Thanks for your comments.
--
Aurelien Degremont
CEA
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Lustre-devel] Lustre HSM HLD draft
2008-02-11 14:59 ` Aurelien Degremont
@ 2008-02-11 20:33 ` Nathaniel Rutman
2008-02-12 3:55 ` Andreas Dilger
0 siblings, 1 reply; 30+ messages in thread
From: Nathaniel Rutman @ 2008-02-11 20:33 UTC (permalink / raw)
To: lustre-devel
Aurelien Degremont wrote:
> Nathaniel Rutman a ?crit :
>
>> 5.1 external storage list - is this to be stored on the MGS device or a
>> separate device? If the coordinator lives on the MGS, why not it's
>> storage as well? In any case, it should be possible to co-locate the
>> coordinator on the MGS and used the MGS's storage device, in the same
>> way that the MGS can currently co-locate with the MDT.
>> How does the coordinator request activity from an agent? If the
>> coordinator is the RPC server, then it's up to the agents to make
>> requests; agents aren't listening for RPC requests themselves.
>>
>
> Presently, it is never said that the coordinator will live on the MGS.
> The Coordinator constrains are:
> 1 - Must receive various migration requests from OST/MDT.
> 2 - Should be able to communicate with Agents and asks them migrations.
> 3 - Should store configuration and migration logs.
>
> I think #1 and #2 are two differents API. The coordinator is clearly a
> RPC server for the first one. How #2 should be implemented is not so
> clear. What would be be the "Lustre-way" here?
>
With userspace servers, presumably we have some way of passing LNET
messages
from kernel to userspace. We should probably still go through LNET for
#2 in order
to use the broadest range of network fabrics. So it could be the same
or similar
RPC. There is no "Lustre-way" for this area - we've never done this
kind of thing before.
> For #3, the few logs that will be backed up here are not huge, and it
> surely could be colocated with another Target, but I'm not sure this
> should be mandatory. This device should be available to several servers,
> for failover like the other Targets. We could imagine having more than 1
> coordinator at long term. I'm not sure it is a good idea to stick it to
> another target.
>
Not mandatory, but possible is nice. Minimize the number of required
partitions.
>
>> 6.3 object ref should include version number. Also include checksum?
>>
>
> For data coherency? Should we add a explicit checksum for those values
> (stored in an EA) or used a possible backend feature (Can ZFS and
> ldiskfs detect EA value corruption by themselves?) ?
>
ZFS can, ldiskfs cannot. Anyhow, it was just a thought. Doesn't hurt
to allow space for it.
>
>> 2.1Archiving one Lustre file
>> There should not be a cache miss when archiving a lustre file; perhaps
>> open-by-fid is intended to bypass atime updates
>> so that the file isn't marked as "recently accessed"?
>>
> > Transparent access - should this avoid modification of atime/mtime?
>
> I would say yes.
>
>
>> 2.2Restoring a file
>> "External ID" presumably contains all information required to retrieve
>> the file - tape #, path name, etc?
>> Once file is copied back, we should probably restore original ctime,
>> mtime, atime - coordinator is storing this, correct?
>>
>
> External ID is an opaque value manage by the archiving tool. If the HSM
> can store a lot of metadata, only a ref is needed, if not, the tool is
> responsible for storing all the data it needs. Anyway, this is totally
> opaque for Lustre.
> I hope the HSMs will not need so many data in this field. HPSS does not
> need so many data, it uses its internal DB to store them. I suppose SAM
> also.
>
What about restore of original ctime, mtime, atime? I think we must
store it
in the coordinator because we must work with all HSMs, and I think it is
important
to restore it.
>
>> IV2 - why not multiple purged windows? Seems like if you're going to
>> purge 1 object out of a file, you might want to purge more.
>> Specifically, it will probably be a common case to purge every object of
>> a file from a particular OST. This is not contiguous in a
>> striped file.
>> I don't see any reason to purge anything smaller than an entire object
>> on an OST - is there good reason for this?
>>
>
> Multiple purged window is subtle. If you permit this feature, you could
> technically have, in the worst case, one purged window per byte, and
> this could be very huge to store. Do you think you will do several holes
> in the same file? In which cases?
>
Like I said, I don't see any reason to purge anything smaller than a
full object; I
would in fact disallow purging of an arbitrary byte range, and only
allow purging
on full-object boundaries.
> In fact, the more common case is to totally purge a file which have been
> migrated on HSM, and it is only an optimisation to keep the start and
> the end of the file on disk, to avoid triggering tons of cache misses
> with commands like "file foo/*" or a tool like Nautilus or Windows
> Explorer browsing the directory.
>
Again, since Lustre is optimized to work with 1MB chunks anyhow, I don't
think
it helps much to keep less than that in the beginning / end objects, so
I would
say just keep the first and last blocks instead.
> The purged window is stored by per object, OST object and MDT object.
> So, if several objects are purged, each object will store its own purged
> window. But the MDT object describing this file will store a special
> purged window which starts at the smallest unavailable bytes and ends at
> the first available one. The MDT purged window indicates "if you do I/O
> in this range, you're not sure the date are there." or "Outside of this
> area, I guarantee data are present."
> Maintain multiple purged windows will be an headache, with no real need
> I think.
> Moreover, people have asked for an OST-object based migration, even if I
> think whole file migration will be the most common case.
>
>
> > If that's the case, then it
>
>> the OST must keep track of purged objects, not ranges within an existing
>> object.
>>
>
> Objects are not removed, only their datas. All metadata are kept.
>
>
>> If the MDT is tracking purged areas also, then there's a good potential
>> synergy here with a missing OST --
>> If the missing OST's objects are marked as purged, then we can
>> potentially recover them automatically from
>> HSM...
>>
>
> What do you call a "missing OST" ? A corrupt one ? A offline one?
> Unavailable?
>
Yes. All of the above. Obviously we need to distinguish between
"permanently
gone" and "temporarily gone".
> Where will you copy back the object data ? On another OST object ?
>
Yes. Some kind of recovery will take place to generate a new object on
a different OST and
we can restore the data there.
> With the purged window on each OST object and MDT and the file stripping
> info, we could easily restore the missing parts.
>
Exactly. This is why I say we should think about this now, to allow for
this
possibility.
>
>> 4.2 How is a purge request recovered? For example, MDT says purge obj1
>> from ost1, ost1 replies "ok", but then dies before it actually
>> does the purge. Reboots, doesn't know anything about purge request now,
>> but MDT has marked it as purged.
>>
>
> The OST asynchronously acknowledges the purge when it is done. The MDT
> marks it purged only when it is really done. I will clarify this.
>
>
>> V2.1 How long does OST wait for completion? Is there a timeout? We
>> probably need a "no timeout if progress is being
>> made" kind of function - clients currently do this kind of thing with OSTs.
>>
>
> I'm sure Lustre already has similar mechanisms for optimized timeout in
> this kind of situation we could reused here.
> What you describe is a good approach I think.
>
>
>> V2.2 No need to copy-in purged data on full-object-size writes.
>>
>
> True. We could had such optimization. But this is only useful for small
> files or very widely stripped ones, doesn't it?
>
No, we very frequently write entire stripes (objects). Lustre clients
can optimize for this.
>
> Thanks for your comments.
>
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Lustre-devel] Lustre HSM HLD draft
2008-02-11 20:33 ` Nathaniel Rutman
@ 2008-02-12 3:55 ` Andreas Dilger
2008-02-12 11:04 ` Eric Barton
0 siblings, 1 reply; 30+ messages in thread
From: Andreas Dilger @ 2008-02-12 3:55 UTC (permalink / raw)
To: lustre-devel
On Feb 11, 2008 12:33 -0800, Nathaniel Rutman wrote:
> Aurelien Degremont wrote:
> > Nathaniel Rutman a ?crit :
> >> IV2 - why not multiple purged windows? Seems like if you're going to
> >> purge 1 object out of a file, you might want to purge more.
> >> Specifically, it will probably be a common case to purge every object of
> >> a file from a particular OST. This is not contiguous in a
> >> striped file.
> >> I don't see any reason to purge anything smaller than an entire object
> >> on an OST - is there good reason for this?
> >
> > Multiple purged window is subtle. If you permit this feature, you could
> > technically have, in the worst case, one purged window per byte, and
> > this could be very huge to store. Do you think you will do several holes
> > in the same file? In which cases?
One issue is that if you are purging individual objects from a file your
windows will be quite disjoint at the file level. That may not be a serious
problem for applications that only look at the first and last chunks of a
file.
I can imagine use cases for extremely large files and limited-sized caches
where there is a need to access only subsets of the file (i.e. the entire
file cannot be resident at one time). That said, it may be this is too
complex for the initial implementation.
> Like I said, I don't see any reason to purge anything smaller than a
> full object; I would in fact disallow purging of an arbitrary byte range,
> and only allow purging on full-object boundaries.
That is impractical, for the reasons that Aurelien mentioned - we want to
avoid file re-staging for tools like "file" and GUIs that read the start/end
of files to determine file type and icons.
> > In fact, the more common case is to totally purge a file which have been
> > migrated on HSM, and it is only an optimisation to keep the start and
> > the end of the file on disk, to avoid triggering tons of cache misses
> > with commands like "file foo/*" or a tool like Nautilus or Windows
> > Explorer browsing the directory.
>
> Again, since Lustre is optimized to work with 1MB chunks anyhow, I don't
> think it helps much to keep less than that in the beginning / end objects,
> so I would say just keep the first and last blocks instead.
What if file is N*1MB + 1 byte? We need to be able to keep something like
64kB for a windows icon, so having some arbitrary byte range seems reasonable.
> > The purged window is stored by per object, OST object and MDT object.
> > So, if several objects are purged, each object will store its own purged
> > window. But the MDT object describing this file will store a special
> > purged window which starts at the smallest unavailable bytes and ends at
> > the first available one.
I think this should read "ends at the highest range contiguous to the end
of the file" or similar, or it will be misleading in the multi-object case.
> >> the OST must keep track of purged objects, not ranges within an existing
> >> object.
> >
> > Objects are not removed, only their datas. All metadata are kept.
The one drawback with this approach is that it is not possible to HSM
copy-in objects to a different OST than where they were originally stored.
BUT... in conjunction with the migration tool it should be able to migrate
an (empty) object from one OST to another before the copy-in from HSM,
so long as there is no OST-specific data in the HSM identifier (i.e. the
HSM label is truely opaque).
> >> If the MDT is tracking purged areas also, then there's a good potential
> >> synergy here with a missing OST --
> >> If the missing OST's objects are marked as purged, then we can
> >> potentially recover them automatically from
> >> HSM...
> >
> > What do you call a "missing OST" ? A corrupt one ? A offline one?
> > Unavailable?
>
> Yes. All of the above. Obviously we need to distinguish between
> "permanently gone" and "temporarily gone".
I suppose this leads to a requirement to store the object in HSM so
that it can be accessed just by the object FID+version. That would allow
the OST to be restored from HSM even if the entire OST filesystem is lost,
potentially modifying the FLDB to relocate the FID to a different OST.
> > Where will you copy back the object data ? On another OST object ?
>
> Yes. Some kind of recovery will take place to generate a new object on
> a different OST and we can restore the data there.
> > With the purged window on each OST object and MDT and the file stripping
> > info, we could easily restore the missing parts.
>
> Exactly. This is why I say we should think about this now, to allow for
> this possibility.
Right.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Lustre-devel] Lustre HSM HLD draft
2008-02-12 3:55 ` Andreas Dilger
@ 2008-02-12 11:04 ` Eric Barton
2008-02-12 15:25 ` Aurelien Degremont
0 siblings, 1 reply; 30+ messages in thread
From: Eric Barton @ 2008-02-12 11:04 UTC (permalink / raw)
To: lustre-devel
Hi,
Sorry if these questions duplicates previous debate.
Have I understood correctly that the design allows individual objects
within a lustre file (i.e. stripes?) to be purged independently?
If so why is this needed?
I would have thought that when you purge a file, you need only record
the purged extent as an attribute of the whole lustre file and punch
its stripes to free the space. Am I missing a use case?
--
Cheers,
Eric
^ permalink raw reply [flat|nested] 30+ messages in thread* [Lustre-devel] Lustre HSM HLD draft
2008-02-12 11:04 ` Eric Barton
@ 2008-02-12 15:25 ` Aurelien Degremont
2008-02-12 17:23 ` Andreas Dilger
0 siblings, 1 reply; 30+ messages in thread
From: Aurelien Degremont @ 2008-02-12 15:25 UTC (permalink / raw)
To: lustre-devel
Eric Barton a ?crit :
> Hi,
>
> Sorry if these questions duplicates previous debate.
>
> Have I understood correctly that the design allows individual objects
> within a lustre file (i.e. stripes?) to be purged independently?
>
> If so why is this needed?
>
> I would have thought that when you purge a file, you need only record
> the purged extent as an attribute of the whole lustre file and punch
> its stripes to free the space. Am I missing a use case?
Since the beginning CFS required this feature. It seems a lab ask for
it. I do not know who. Unfortunately we have no use case for what they
want to do with this.
I'm wondering if their need could not be met with other features like
the internal Lustre migration...
--
Aurelien Degremont
CEA
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Lustre-devel] Lustre HSM HLD draft
2008-02-12 15:25 ` Aurelien Degremont
@ 2008-02-12 17:23 ` Andreas Dilger
2008-02-12 19:43 ` Eric Barton
2008-02-12 23:24 ` Nathaniel Rutman
0 siblings, 2 replies; 30+ messages in thread
From: Andreas Dilger @ 2008-02-12 17:23 UTC (permalink / raw)
To: lustre-devel
On Feb 12, 2008 16:25 +0100, Aurelien Degremont wrote:
> Eric Barton a ?crit :
> > Sorry if these questions duplicates previous debate.
> >
> > Have I understood correctly that the design allows individual objects
> > within a lustre file (i.e. stripes?) to be purged independently?
> >
> > If so why is this needed?
> >
> > I would have thought that when you purge a file, you need only record
> > the purged extent as an attribute of the whole lustre file and punch
> > its stripes to free the space. Am I missing a use case?
>
> Since the beginning CFS required this feature. It seems a lab ask for
> it. I do not know who. Unfortunately we have no use case for what they
> want to do with this.
> I'm wondering if their need could not be met with other features like
> the internal Lustre migration...
That is my understanding also - I believe one of the Labs wanted this
(to be able to do HSM on a per-stripe basis instead of a per-file basis).
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Lustre-devel] Lustre HSM HLD draft
2008-02-12 17:23 ` Andreas Dilger
@ 2008-02-12 19:43 ` Eric Barton
2008-02-12 23:24 ` Nathaniel Rutman
1 sibling, 0 replies; 30+ messages in thread
From: Eric Barton @ 2008-02-12 19:43 UTC (permalink / raw)
To: lustre-devel
Andreas,
Is this requirement documented? I'd appreciate any pointers...
> -----Original Message-----
> From: Andreas.Dilger at Sun.COM [mailto:Andreas.Dilger at Sun.COM]
> On Behalf Of Andreas Dilger
> Sent: 12 February 2008 5:23 PM
> To: Aurelien Degremont
> Cc: Eric Barton; lustre-devel at lists.lustre.org
> Subject: Re: [Lustre-devel] Lustre HSM HLD draft
>
> On Feb 12, 2008 16:25 +0100, Aurelien Degremont wrote:
> > Eric Barton a ?crit :
> > > Sorry if these questions duplicates previous debate.
> > >
> > > Have I understood correctly that the design allows
> individual objects
> > > within a lustre file (i.e. stripes?) to be purged independently?
> > >
> > > If so why is this needed?
> > >
> > > I would have thought that when you purge a file, you need
> only record
> > > the purged extent as an attribute of the whole lustre
> file and punch
> > > its stripes to free the space. Am I missing a use case?
> >
> > Since the beginning CFS required this feature. It seems a
> lab ask for
> > it. I do not know who. Unfortunately we have no use case
> for what they
> > want to do with this.
> > I'm wondering if their need could not be met with other
> features like
> > the internal Lustre migration...
>
> That is my understanding also - I believe one of the Labs wanted this
> (to be able to do HSM on a per-stripe basis instead of a
> per-file basis).
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Lustre-devel] Lustre HSM HLD draft
2008-02-12 17:23 ` Andreas Dilger
2008-02-12 19:43 ` Eric Barton
@ 2008-02-12 23:24 ` Nathaniel Rutman
1 sibling, 0 replies; 30+ messages in thread
From: Nathaniel Rutman @ 2008-02-12 23:24 UTC (permalink / raw)
To: lustre-devel
Andreas Dilger wrote:
> On Feb 12, 2008 16:25 +0100, Aurelien Degremont wrote:
>
>> Eric Barton a ?crit :
>>
>>> Sorry if these questions duplicates previous debate.
>>>
>>> Have I understood correctly that the design allows individual objects
>>> within a lustre file (i.e. stripes?) to be purged independently?
>>>
>>> If so why is this needed?
>>>
>>> I would have thought that when you purge a file, you need only record
>>> the purged extent as an attribute of the whole lustre file and punch
>>> its stripes to free the space. Am I missing a use case?
>>>
>> Since the beginning CFS required this feature. It seems a lab ask for
>> it. I do not know who. Unfortunately we have no use case for what they
>> want to do with this.
>> I'm wondering if their need could not be met with other features like
>> the internal Lustre migration...
>>
>
> That is my understanding also - I believe one of the Labs wanted this
> (to be able to do HSM on a per-stripe basis instead of a per-file basis).
>
This doesn't make any sense to me. Layouts may change;
a stripe on one filesystem may not correspond to a stripe on a replica
of the filesystem;
exposing stripes to user apps is a bad idea.
I'm going to propose what I think we need:
1. Punch a single, arbitrary byte range from the middle of a file (thus
leaving beginning and end for file type, icons, filesize.
2. No other arbitrary punch patterns.
3. The punched range is stored on the MDT alone.
4. Once punched, the OST may forget about any fully-punched stripes it
used to hold.
5. Clients must take a layout lock (CR) when they retrieve the layout from
the MDT. If the MDT punches from the middle, it revokes the layout
lock,
and clients must re-enqueue it for further read/write on the file.
The MDT
is the sole keeper of the layout, and it must be protected by a lock.
6. Client access within a punched range results in an RPC to the MDT. The
MDT decides where to put the restored data, organizes the restoration
(via the coordinator), and rewrites the layout (under lock, of course).
Client gets the new layout, and can contact the appropriate OST.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Lustre-devel] Lustre HSM HLD draft
2008-02-07 10:52 DEGREMONT Aurelien
2008-02-08 21:18 ` Nathaniel Rutman
@ 2008-02-18 21:51 ` Canon, Richard Shane
2008-02-19 17:13 ` Aurelien Degremont
2008-02-25 22:44 ` Peter J Braam
2008-02-21 15:26 ` Aurelien Degremont
2 siblings, 2 replies; 30+ messages in thread
From: Canon, Richard Shane @ 2008-02-18 21:51 UTC (permalink / raw)
To: lustre-devel
Aurelien and JC,
Sorry that my feedback is late. Here are my questions/remarks.
General
* Any thought on how quotas will be handled?
Coordinator
* 3.4 - I was curious what the precise use case was that was driving
this? I don't disagree with it, but I was curious for more background
* 3.7.1 - The coordinator could become a scaling bottleneck. We should
think about how this will be scaled in the future
* 4.1 - Does the coordinator store the ext obj id or does the agent
* 4.3 item 2 - This looks like the coordinator could become a bottle
neck for unlinks and slow down performance. Could this be put in some
type of async queue to be processed later (or some type of attic space)?
Use Cases
* 2.3 (Use cases) - I'm really keen on this feature. I think it is very
important in order to make small file performance work well.
Unfortunately, it isn't clear how the file list gets communicated to the
archive tool. The coordinator and agent seem to only take one file at a
time. So how would this work exactly?
* 2.4 - The copy tool should be allowed to preemptively restage files.
I think this will work with the design, but we should make sure of this.
This would be useful for restaging a whole tar file versus doing things
piece-meal.
Part IV
2 EAs - I'm worried that the EA list could get huge for holes.
3.2 -item 3 - Who insures a file is archived before punches are made?
3.3 - Another use case... The user checks to see if a file has been
archived.
Also, someone earlier made the point about the archive tool being able
to reorder request. This is really important since an archival system
wants to know all the files being restaged in order to order tape mounts
and reads.
Thanks for taking the lead on this. It looks like there is a lot of
interest in it.
--Shane
-----Original Message-----
From: lustre-devel-bounces@lists.lustre.org
[mailto:lustre-devel-bounces at lists.lustre.org] On Behalf Of DEGREMONT
Aurelien
Sent: Thursday, February 07, 2008 5:53 AM
To: lustre-devel at lists.lustre.org
Subject: [Lustre-devel] Lustre HSM HLD draft
Hello
Here is a first draft for comments of the Lustre HSM HLD.
It is intended to be a support for further analyzes and comments from
CFS/Sun.
The document covers the main parts of the HSM features but some elements
are still lacking.
The policy management and the space manager will be describe later.
Let us know your comments and ideas about it.
Regards,
Aurelien Degremont
CEA
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Lustre-devel] Lustre HSM HLD draft
2008-02-18 21:51 ` Canon, Richard Shane
@ 2008-02-19 17:13 ` Aurelien Degremont
2008-02-25 22:44 ` Peter J Braam
1 sibling, 0 replies; 30+ messages in thread
From: Aurelien Degremont @ 2008-02-19 17:13 UTC (permalink / raw)
To: lustre-devel
Canon, Richard Shane a ?crit :
> General
> * Any thought on how quotas will be handled?
That's a very good question.
I think this point should be discussed.
The purge possibility introduces two values which could be under quota.
1 - File size (current case)
2 - The disk occupation are used (migrated files free quota)
The first point are the simplest to implement and will need fewer
modifications, but users could not free quota even if all their files
are migrated.
The second point could help users but this will be problematic when they
will copy back some of their file, because this will trigger space
issues and purge requests on theirs other files, and so on.
IMO, the best way is to take choice #1 and possibly add a 'real disk
use' quota value that could be tuned by admins.
I'm not a Lustre quota specialist and AFAIK this code is a bit touchy.
> Coordinator
> * 3.4 - I was curious what the precise use case was that was driving
> this? I don't disagree with it, but I was curious for more background
Coordinator is designed to also manage internal Lustre migrations.
> * 3.7.1 - The coordinator could become a scaling bottleneck. We should
> think about how this will be scaled in the future
I think we should be able to have several coordinators in the future.
Each of them dealing with different external storages.
> * 4.1 - Does the coordinator store the ext obj id or does the agent
The agent does not have a storage device. It stores nothing.
The external IDs are in MDT device.
> * 4.3 item 2 - This looks like the coordinator could become a bottle
> neck for unlinks and slow down performance. Could this be put in some
> type of async queue to be processed later (or some type of attic space)?
Yes, I think unlinks should be handled asynchronously.
> Use Cases
> * 2.3 (Use cases) - I'm really keen on this feature. I think it is very
> important in order to make small file performance work well.
> Unfortunately, it isn't clear how the file list gets communicated to the
> archive tool. The coordinator and agent seem to only take one file at a
> time. So how would this work exactly?
In fact, we have presently designed the archiving tool to support this
feature and only it because the archiving tool could be developped by
anyone and we want this API being as stable as possible. The current
Lustre component design does not handle it. But it will be added later,
in a second step, and the copy tool developped since will be already
compatible with it.
> * 2.4 - The copy tool should be allowed to preemptively restage files.
> I think this will work with the design, but we should make sure of this.
> This would be useful for restaging a whole tar file versus doing things
> piece-meal.
That's an interesting point. I think we could avoid it but it is an
interesting feature. I must think how we should modify the design to
permit it. (The tool should be able to warn the coordinator: oh, i'm
staging this file also! please note it)
> 2 EAs - I'm worried that the EA list could get huge for holes.
This part has been redesigned. The data that were stored in EA have been
moved. It will be explained in the new document version.
> 3.2 -item 3 - Who insures a file is archived before punches are made?
The space manager did it. It is the only one which will make punch
request. May be MDT could ensure it before dealing with it.
> 3.3 - Another use case... The user checks to see if a file has been
> archived.
Ok
> Also, someone earlier made the point about the archive tool being able
> to reorder request. This is really important since an archival system
> wants to know all the files being restaged in order to order tape mounts
> and reads.
I do not see any problem with this.
I will add this point in the doc.
> Thanks for taking the lead on this. It looks like there is a lot of
> interest in it.
Thanks you for your very interesting comments.
--
Aurelien Degremont
CEA
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Lustre-devel] Lustre HSM HLD draft
2008-02-18 21:51 ` Canon, Richard Shane
2008-02-19 17:13 ` Aurelien Degremont
@ 2008-02-25 22:44 ` Peter J Braam
1 sibling, 0 replies; 30+ messages in thread
From: Peter J Braam @ 2008-02-25 22:44 UTC (permalink / raw)
To: lustre-devel
Just a few initial responses from me, I haven't read things
systematically yet.
Canon, Richard Shane wrote:
> Aurelien and JC,
>
> Sorry that my feedback is late. Here are my questions/remarks.
>
> General
> * Any thought on how quotas will be handled?
>
>
This is very very important and will require a lot of detail. Well
spotted Shane!!!
> Coordinator
> * 3.4 - I was curious what the precise use case was that was driving
> this? I don't disagree with it, but I was curious for more background
>
In internal migrations many objects will be restriped to another set of
objects to move the data. The coordinator handles the completion and
abortion of the agents accomplishing this.
> * 3.7.1 - The coordinator could become a scaling bottleneck. We should
> think about how this will be scaled in the future
>
In my writings I was always anticipating a family of load balancing
coordinators.
> * 4.1 - Does the coordinator store the ext obj id or does the agent
>
Coordinator, I suggest, in view of the fact that many agents may be
required to move one file.
> * 4.3 item 2 - This looks like the coordinator could become a bottle
> neck for unlinks and slow down performance. Could this be put in some
> type of async queue to be processed later (or some type of attic space)?
>
>
I agree with this.
> Use Cases
> * 2.3 (Use cases) - I'm really keen on this feature. I think it is very
> important in order to make small file performance work well.
> Unfortunately, it isn't clear how the file list gets communicated to the
> archive tool. The coordinator and agent seem to only take one file at a
> time. So how would this work exactly?
> * 2.4 - The copy tool should be allowed to preemptively restage files.
> I think this will work with the design, but we should make sure of this.
> This would be useful for restaging a whole tar file versus doing things
> piece-meal.
>
> Part IV
> 2 EAs - I'm worried that the EA list could get huge for holes.
>
The EA merely points to an extent tree (similar to the allocation extent
tree).
> 3.2 -item 3 - Who insures a file is archived before punches are made?
>
The coordinator.
> 3.3 - Another use case... The user checks to see if a file has been
> archived.
>
>
> Also, someone earlier made the point about the archive tool being able
> to reorder request. This is really important since an archival system
> wants to know all the files being restaged in order to order tape mounts
> and reads.
>
> Thanks for taking the lead on this. It looks like there is a lot of
> interest in it.
>
> --Shane
>
> -----Original Message-----
> From: lustre-devel-bounces at lists.lustre.org
> [mailto:lustre-devel-bounces at lists.lustre.org] On Behalf Of DEGREMONT
> Aurelien
> Sent: Thursday, February 07, 2008 5:53 AM
> To: lustre-devel at lists.lustre.org
> Subject: [Lustre-devel] Lustre HSM HLD draft
>
> Hello
>
> Here is a first draft for comments of the Lustre HSM HLD.
> It is intended to be a support for further analyzes and comments from
> CFS/Sun.
>
> The document covers the main parts of the HSM features but some elements
> are still lacking.
> The policy management and the space manager will be describe later.
>
> Let us know your comments and ideas about it.
>
> Regards,
>
> Aurelien Degremont
> CEA
>
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Lustre-devel] Lustre HSM HLD draft
2008-02-07 10:52 DEGREMONT Aurelien
2008-02-08 21:18 ` Nathaniel Rutman
2008-02-18 21:51 ` Canon, Richard Shane
@ 2008-02-21 15:26 ` Aurelien Degremont
2008-02-25 22:38 ` Peter J Braam
2 siblings, 1 reply; 30+ messages in thread
From: Aurelien Degremont @ 2008-02-21 15:26 UTC (permalink / raw)
To: lustre-devel
Hello
I've got several wondering about some specific point in HSM
implementation and I would like your opinion about them.
Coordinator:
This element will manage migration externally (HSM) and internally of
Lustre (space balancing?). Is the current API acceptable (specific calls
for external migration, and other ones for internal migration)? The best
way could have been to have generic call for migration, but we must also
have generic objects to describe the migration sources and destinations
and those are not simples. We finally conclude with the API presented in
the HLD document. Tell me if this is *really* a bad idea or if only
adjustments are needed.
We presented two modes of migration, explicit and implicit migrations.
The first one result of an administrative request, the second one was
triggered automatically (cache miss by example). Is that ok? (See the
doc for all details).
Agent:
It seems, to support Lustre internal migration, you have planned to
implement specific Agents which will reside on OST. HSM will need
specific agent on clients. Do those two kinds of agent are acceptable ?
The current API only describe HSM-based agent. Maybe we should think of
a generic agent framework and add specialized implementations for
ost,hsm,etc ?
--
Aurelien Degremont
CEA
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Lustre-devel] Lustre HSM HLD draft
2008-02-21 15:26 ` Aurelien Degremont
@ 2008-02-25 22:38 ` Peter J Braam
2008-02-27 16:51 ` Aurelien Degremont
0 siblings, 1 reply; 30+ messages in thread
From: Peter J Braam @ 2008-02-25 22:38 UTC (permalink / raw)
To: lustre-devel
Aurelien Degremont wrote:
> Hello
>
> I've got several wondering about some specific point in HSM
> implementation and I would like your opinion about them.
>
> Coordinator:
>
> This element will manage migration externally (HSM) and internally of
> Lustre (space balancing?). Is the current API acceptable (specific calls
> for external migration, and other ones for internal migration)?
I would like to see a parameter indicating what agent will be used and
keep all other parameters the same.
> The best
> way could have been to have generic call for migration, but we must also
> have generic objects to describe the migration sources and destinations
> and those are not simples.
For migration to and from external sources, Lustre must already manage
this data in an extended attribute (e.g. to describe the file on tape to
which a Lustre file was migrated). This data is opaque to Lustre and
can be passed as a blob.
> We finally conclude with the API presented in
> the HLD document. Tell me if this is *really* a bad idea or if only
> adjustments are needed.
>
>
I have not yet looked at these.
> We presented two modes of migration, explicit and implicit migrations.
> The first one result of an administrative request, the second one was
> triggered automatically (cache miss by example). Is that ok? (See the
> doc for all details).
>
Yes, that seems ok.
> Agent:
>
> It seems, to support Lustre internal migration, you have planned to
> implement specific Agents which will reside on OST.
To avoid many complications involving locks, we decided that even the
agents used for internal migrations will layer on the file system. The
Lustre file system will be mounted on the OST's and it will use the
"LOLND" to transport the data efficiently between the OST process and
the client file system cache. In the internal case source and
destination lie in Lustre in the HSM case only one of them.
As a result I believe these two cases are closer together than you may
think, and should be one "type".
The key aspect we/you need to design is what an agent has to make sure
happens, for example in terms of locking file extents and in terms of
avoiding triggering a recursive cache miss (open by fid with a flag?).
- Peter -
> HSM will need
> specific agent on clients. Do those two kinds of agent are acceptable ?
> The current API only describe HSM-based agent. Maybe we should think of
> a generic agent framework and add specialized implementations for
> ost,hsm,etc ?
>
>
>
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* [Lustre-devel] Lustre HSM HLD draft
2008-02-25 22:38 ` Peter J Braam
@ 2008-02-27 16:51 ` Aurelien Degremont
2008-02-29 4:30 ` Peter Braam
0 siblings, 1 reply; 30+ messages in thread
From: Aurelien Degremont @ 2008-02-27 16:51 UTC (permalink / raw)
To: lustre-devel
Peter J Braam a ?crit :
>> Coordinator:
>>
>> This element will manage migration externally (HSM) and internally of
>> Lustre (space balancing?). Is the current API acceptable (specific
>> calls for external migration, and other ones for internal migration)?
> I would like to see a parameter indicating what agent will be used and
> keep all other parameters the same.
Agreed.
>> The best way could have been to have generic call for migration, but
>> we must also have generic objects to describe the migration sources
>> and destinations and those are not simples.
> For migration to and from external sources, Lustre must already manage
> this data in an extended attribute (e.g. to describe the file on tape to
> which a Lustre file was migrated). This data is opaque to Lustre and
> can be passed as a blob.
>> It seems, to support Lustre internal migration, you have planned to
>> implement specific Agents which will reside on OST.
> To avoid many complications involving locks, we decided that even the
> agents used for internal migrations will layer on the file system. The
> Lustre file system will be mounted on the OST's and it will use the
> "LOLND" to transport the data efficiently between the OST process and
> the client file system cache. In the internal case source and
> destination lie in Lustre in the HSM case only one of them.
>
> As a result I believe these two cases are closer together than you may
> think, and should be one "type".
If we unify the API, we must have a way to request some data movement like:
copy elemA in placeP
copy elemA,stored in placeP bak into Lustre
copy elemA into placeC
move elemB into elemB
The elem could be unified using Lustre FID, but the places could be an
external storage, or a precise OST. If we want a unify API, the API call
should manipulate a generic object which could describe a Lustre storage
element (ost) or a external storage (hsm,...)
ie:
struct storage_place {
...
}
copy(fid,storage_place*)
move(fid,storage_place*)
and their is some specific cases to handle. The other possibity:
ext_copyout(fid, external storage)
ext_copyin(fid, external object)
int_copy(fid, fid, ost)
int_move(fid, fid, ost)
I think this one, even if the design is not the most beautiful one, if
the easiest one.
Instead you want to create some new generic objects to manipulate lustre
object data and generic storage areas, the second case is the best one IMO.
--
Aurelien Degremont
CEA
^ permalink raw reply [flat|nested] 30+ messages in thread* [Lustre-devel] Lustre HSM HLD draft
2008-02-27 16:51 ` Aurelien Degremont
@ 2008-02-29 4:30 ` Peter Braam
0 siblings, 0 replies; 30+ messages in thread
From: Peter Braam @ 2008-02-29 4:30 UTC (permalink / raw)
To: lustre-devel
The discussion below about the API's is a standard element of data
abstraction taught in advanced programming courses (see e.g. Abelson et.
al. Structure and Interpretation of Computer Programs (SICP)).
From this one concludes that the coordinator and agents will use abstract
data types and call abstract methods that accommodate multiple:
- source and destination descriptors for the data
- data movers implementing the methods to move data
If you proceed along the lines you outline you will get a big matrix of
movers and data types to keep track of. If you follow my approach you will
encapsulate things much more cleanly.
Think in terms of virtual classes data movers acting on source and
destination objects.
- peter -
On 2/27/08 9:51 AM, "Aurelien Degremont" <aurelien.degremont@cea.fr> wrote:
>
> Peter J Braam a ?crit :
>>> Coordinator:
>>>
>>> This element will manage migration externally (HSM) and internally of
>>> Lustre (space balancing?). Is the current API acceptable (specific
>>> calls for external migration, and other ones for internal migration)?
>> I would like to see a parameter indicating what agent will be used and
>> keep all other parameters the same.
>
> Agreed.
>
>>> The best way could have been to have generic call for migration, but
>>> we must also have generic objects to describe the migration sources
>>> and destinations and those are not simples.
>> For migration to and from external sources, Lustre must already manage
>> this data in an extended attribute (e.g. to describe the file on tape to
>> which a Lustre file was migrated). This data is opaque to Lustre and
>> can be passed as a blob.
>>> It seems, to support Lustre internal migration, you have planned to
>>> implement specific Agents which will reside on OST.
>> To avoid many complications involving locks, we decided that even the
>> agents used for internal migrations will layer on the file system. The
>> Lustre file system will be mounted on the OST's and it will use the
>> "LOLND" to transport the data efficiently between the OST process and
>> the client file system cache. In the internal case source and
>> destination lie in Lustre in the HSM case only one of them.
>>
>> As a result I believe these two cases are closer together than you may
>> think, and should be one "type".
>
>
> If we unify the API, we must have a way to request some data movement like:
>
> copy elemA in placeP
> copy elemA,stored in placeP bak into Lustre
> copy elemA into placeC
> move elemB into elemB
>
>
> The elem could be unified using Lustre FID, but the places could be an
> external storage, or a precise OST. If we want a unify API, the API call
> should manipulate a generic object which could describe a Lustre storage
> element (ost) or a external storage (hsm,...)
>
> ie:
> struct storage_place {
> ...
> }
> copy(fid,storage_place*)
> move(fid,storage_place*)
>
> and their is some specific cases to handle. The other possibity:
>
> ext_copyout(fid, external storage)
> ext_copyin(fid, external object)
> int_copy(fid, fid, ost)
> int_move(fid, fid, ost)
>
> I think this one, even if the design is not the most beautiful one, if
> the easiest one.
>
> Instead you want to create some new generic objects to manipulate lustre
> object data and generic storage areas, the second case is the best one IMO.
>
^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2008-02-29 4:30 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-07 16:19 [Lustre-devel] Lustre HSM HLD draft Rick Matthews
2008-02-08 0:03 ` JC.LAFOUCRIERE at CEA.FR
2008-02-08 11:52 ` Rick Matthews
2008-02-08 15:55 ` Aurelien Degremont
2008-02-11 18:18 ` Andreas Dilger
2008-02-11 19:38 ` Peter Braam
2008-02-11 21:11 ` Ricardo M. Correia
2008-02-11 21:39 ` Andreas Dilger
2008-02-11 22:07 ` Ricardo M. Correia
2008-02-11 22:32 ` Nathaniel Rutman
2008-02-11 22:46 ` Rick Matthews
2008-02-12 15:41 ` Aurelien Degremont
2008-02-12 0:25 ` Ricardo M. Correia
-- strict thread matches above, loose matches on Subject: below --
2008-02-07 10:52 DEGREMONT Aurelien
2008-02-08 21:18 ` Nathaniel Rutman
2008-02-11 14:59 ` Aurelien Degremont
2008-02-11 20:33 ` Nathaniel Rutman
2008-02-12 3:55 ` Andreas Dilger
2008-02-12 11:04 ` Eric Barton
2008-02-12 15:25 ` Aurelien Degremont
2008-02-12 17:23 ` Andreas Dilger
2008-02-12 19:43 ` Eric Barton
2008-02-12 23:24 ` Nathaniel Rutman
2008-02-18 21:51 ` Canon, Richard Shane
2008-02-19 17:13 ` Aurelien Degremont
2008-02-25 22:44 ` Peter J Braam
2008-02-21 15:26 ` Aurelien Degremont
2008-02-25 22:38 ` Peter J Braam
2008-02-27 16:51 ` Aurelien Degremont
2008-02-29 4:30 ` Peter Braam
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.