[Lustre-devel] Lustre HSM HLD draft

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Nathaniel Rutman <Nathan.Rutman@Sun.COM>
To: lustre-devel@lists.lustre.org
Subject: [Lustre-devel] Lustre HSM HLD draft
Date: Mon, 11 Feb 2008 12:33:03 -0800	[thread overview]
Message-ID: <47B0B0FF.7060805@sun.com> (raw)
In-Reply-To: <47B062BA.1070809@cea.fr>

Aurelien Degremont wrote:
> Nathaniel Rutman a ?crit :
>   
>> 5.1 external storage list - is this to be stored on the MGS device or a 
>> separate device?  If the coordinator lives on the MGS, why not it's 
>> storage as well?  In any case, it should be possible to co-locate the 
>> coordinator on the MGS and used the MGS's storage device, in the same 
>> way that the MGS can currently co-locate with the MDT.
>> How does the coordinator request activity from an agent?  If the 
>> coordinator is the RPC server, then it's up to the agents to make 
>> requests; agents aren't listening for RPC requests themselves.
>>     
>
> Presently, it is never said that the coordinator will live on the MGS.
> The Coordinator constrains are:
>   1 - Must receive various migration requests from OST/MDT.
>   2 - Should be able to communicate with Agents and asks them migrations.
>   3 - Should store configuration and migration logs.
>   
> I think #1 and #2 are two differents API. The coordinator is clearly a 
> RPC server for the first one. How #2 should be implemented is not so 
> clear. What would be be the "Lustre-way" here?
>   
With userspace servers, presumably we have some way of passing LNET 
messages
from kernel to userspace.  We should probably still go through LNET for 
#2 in order
to use the broadest range of network fabrics.  So it could be the same 
or similar
RPC.  There is no "Lustre-way" for this area - we've never done this 
kind of thing before.
> For #3, the few logs that will be backed up here are not huge, and it 
> surely could be colocated with another Target, but I'm not sure this 
> should be mandatory. This device should be available to several servers, 
> for failover like the other Targets. We could imagine having more than 1 
> coordinator at long term. I'm not sure it is a good idea to stick it to 
> another target.
>   
Not mandatory, but possible is nice.  Minimize the number of required 
partitions.
>   
>> 6.3 object ref should include version number.  Also include checksum?
>>     
>
> For data coherency? Should we add a explicit checksum for those values 
> (stored in an EA) or used a possible backend feature (Can ZFS and 
> ldiskfs detect EA value corruption by themselves?) ?
>   
ZFS can, ldiskfs cannot.  Anyhow, it was just a thought.  Doesn't hurt 
to allow space for it.
>   
>> 2.1Archiving one Lustre file
>> There should not be a cache miss when archiving a lustre file; perhaps 
>> open-by-fid is intended to bypass atime updates
>> so that the file isn't marked as "recently accessed"?
>>     
>  > Transparent access - should this avoid modification of atime/mtime?
>
> I would say yes.
>
>   
>> 2.2Restoring a file
>> "External ID" presumably contains all information required to retrieve 
>> the file - tape #, path name, etc?
>> Once file is copied back, we should probably restore original ctime, 
>> mtime, atime - coordinator is storing this, correct?
>>     
>
> External ID is an opaque value manage by the archiving tool. If the HSM 
> can store a lot of metadata, only a ref is needed, if not, the tool is 
> responsible for storing all the data it needs. Anyway, this is totally 
> opaque for Lustre.
> I hope the HSMs will not need so many data in this field. HPSS does not 
> need so many data, it uses its internal DB to store them. I suppose SAM 
> also.
>   
What about restore of original ctime, mtime, atime?  I think we must 
store it
in the coordinator because we must work with all HSMs, and I think it is 
important
to restore it. 

>   
>> IV2 - why not multiple purged windows?  Seems like if you're going to 
>> purge 1 object out of a file, you might want to purge more.
>> Specifically, it will probably be a common case to purge every object of 
>> a file from a particular OST.  This is not contiguous in a
>> striped file.
>> I don't see any reason to purge anything smaller than an entire object 
>> on an OST - is there good reason for this? 
>>     
>
> Multiple purged window is subtle. If you permit this feature, you could 
> technically have, in the worst case, one purged window per byte, and 
> this could be very huge to store. Do you think you will do several holes 
> in the same file? In which cases?
>   
Like I said, I don't see any reason to purge anything smaller than a 
full object; I
would in fact disallow purging of an arbitrary byte range, and only 
allow purging
on full-object boundaries.
> In fact, the more common case is to totally purge a file which have been 
>   migrated on HSM, and it is only an optimisation to keep the start and 
> the end of the file on disk, to avoid triggering tons of cache misses 
> with commands like "file foo/*" or a tool like Nautilus or Windows 
> Explorer browsing the directory.
>   
Again, since Lustre is optimized to work with 1MB chunks anyhow, I don't 
think
it helps much to keep less than that in the beginning / end objects, so 
I would
say just keep the first and last blocks instead.
> The purged window is stored by per object, OST object and MDT object.
> So, if several objects are purged, each object will store its own purged 
> window. But the MDT object describing this file will store a special 
> purged window which starts at the smallest unavailable bytes and ends at 
> the first available one. The MDT purged window indicates "if you do I/O 
> in this range, you're not sure the date are there." or "Outside of this 
> area, I guarantee data are present."
> Maintain multiple purged windows will be an headache, with no real need 
> I think.
> Moreover, people have asked for an OST-object based migration, even if I 
> think whole file migration will be the most common case.
>
>
>  > If that's the case, then it
>   
>> the OST must keep track of purged objects, not ranges within an existing 
>> object.
>>     
>
> Objects are not removed, only their datas. All metadata are kept.
>
>   
>> If the MDT is tracking purged areas also, then there's a good potential 
>> synergy here with a missing OST --
>> If the missing OST's objects are marked as purged, then we can 
>> potentially recover them automatically from
>> HSM...
>>     
>
> What do you call a "missing OST" ? A corrupt one ? A offline one? 
> Unavailable?
>   
Yes.  All of the above. Obviously we need to distinguish between 
"permanently
gone" and "temporarily gone".
> Where will you copy back the object data ? On another OST object ?
>   
Yes.  Some kind of recovery will take place to generate a new object on 
a different OST and
we can restore the data there.
> With the purged window on each OST object and MDT and the file stripping 
> info, we could easily restore the missing parts.
>   
Exactly.  This is why I say we should think about this now, to allow for 
this
possibility.
>   
>> 4.2 How is a purge request recovered?  For example, MDT says purge obj1 
>> from ost1, ost1 replies "ok", but then dies before it actually
>> does the purge.  Reboots, doesn't know anything about purge request now, 
>> but MDT has marked it as purged.
>>     
>
> The OST asynchronously acknowledges the purge when it is done. The MDT 
> marks it purged only when it is really done. I will clarify this.
>
>   
>> V2.1 How long does OST wait for completion?  Is there a timeout?    We 
>> probably need a "no timeout if progress is being
>> made" kind of function - clients currently do this kind of thing with OSTs.
>>     
>
> I'm sure Lustre already has similar mechanisms for optimized timeout in 
> this kind of situation we could reused here.
> What you describe is a good approach I think.
>
>   
>> V2.2 No need to copy-in purged data on full-object-size writes.
>>     
>
> True. We could had such optimization. But this is only useful for small 
> files or very widely stripped ones, doesn't it?
>   
No, we very frequently write entire stripes (objects).  Lustre clients 
can optimize for this.
>
> Thanks for your comments.
>
>

next prev parent reply	other threads:[~2008-02-11 20:33 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-02-07 10:52 [Lustre-devel] Lustre HSM HLD draft DEGREMONT Aurelien
2008-02-08 21:18 ` Nathaniel Rutman
2008-02-11 14:59   ` Aurelien Degremont
2008-02-11 20:33     ` Nathaniel Rutman [this message]
2008-02-12  3:55       ` Andreas Dilger
2008-02-12 11:04         ` Eric Barton
2008-02-12 15:25           ` Aurelien Degremont
2008-02-12 17:23             ` Andreas Dilger
2008-02-12 19:43               ` Eric Barton
2008-02-12 23:24               ` Nathaniel Rutman
2008-02-18 21:51 ` Canon, Richard Shane
2008-02-19 17:13   ` Aurelien Degremont
2008-02-25 22:44   ` Peter J Braam
2008-02-21 15:26 ` Aurelien Degremont
2008-02-25 22:38   ` Peter J Braam
2008-02-27 16:51     ` Aurelien Degremont
2008-02-29  4:30       ` Peter Braam
  -- strict thread matches above, loose matches on Subject: below --
2008-02-07 16:19 Rick Matthews
2008-02-08  0:03 ` JC.LAFOUCRIERE at CEA.FR
2008-02-08 11:52   ` Rick Matthews
2008-02-08 15:55 ` Aurelien Degremont
2008-02-11 18:18   ` Andreas Dilger
2008-02-11 19:38     ` Peter Braam
2008-02-11 21:11     ` Ricardo M. Correia
2008-02-11 21:39       ` Andreas Dilger
2008-02-11 22:07         ` Ricardo M. Correia
2008-02-11 22:32           ` Nathaniel Rutman
2008-02-11 22:46             ` Rick Matthews
2008-02-12 15:41               ` Aurelien Degremont
2008-02-12  0:25             ` Ricardo M. Correia

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47B0B0FF.7060805@sun.com \
    --to=nathan.rutman@sun.com \
    --cc=lustre-devel@lists.lustre.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.