All of lore.kernel.org
 help / color / mirror / Atom feed
* [Lustre-devel] "Simple" HSM straw man
@ 2008-10-09 22:37 Nathaniel Rutman
  2008-10-13 16:39 ` Aurelien Degremont
  0 siblings, 1 reply; 14+ messages in thread
From: Nathaniel Rutman @ 2008-10-09 22:37 UTC (permalink / raw)
  To: lustre-devel

This is intended to be a starting point for discussion; the concepts 
here have been hashed through a few times and hopefully represent the 
best current thinking.


Baseline concepts
1. all single-file coherency issues are in kernel space (file locking, 
recovery)
2. all policy decisions are in user space (using changelogs, df, etc)
3. coordinator/mover communication will use LNET
4. "simple" refers to
    a. integration with HPSS only
    b. depends on changelog for policy decisions
    c. restore on file open, not data read/write
5. HSM tracks entire files, not stripe objects
6. HSM namespace is flat, all files are addressed by FID only
7. Desired: coordinator and movers can be reused by (non-HSM) replication


Components
1. Mover
    a. combined kernel (LNET comms) and userspace processes
    b. userspace processes will use Lustre clients for data i/o
    c. will use special fid directory for file access (.lustre/fid/XXXX)
    d. interfaces with hardware-specific copy tool to access HSM files
    e. kernel process encompasses service threads listening for 
coordinator requests, passes these up to userspace process via upcall.  
No interaction with the client is needed; this is a simple message 
passing service.
2. Coordinator
    a. decides and dispatches copyin and copyout requests to movers
    b. consolidates repeat requests
    c. re-queues requests to a new agent if an agent becomes unresponsive
    d. kernel space, associated with the MDT for cache-miss
    e. ioctl interface for copyout, purge requests from policy engine
3. Policy engine (aka Space Manager)
    a. makes policy decisions for copyout, purge
    b. normally uses changelogs and 'df' for input; rarely is allowed to 
scan filesystem
    c. userspace process, requests copyout and purge via ioctl to 
coordinator
4. MDT changes
    a. Per-file layout lock

    A new layout lock is created for every file.  Private writer lock is
    taken by the MDT when allocating/changing file layout (LOV EA). 
    Shared reader locks are taken by anyone reading the layout (client
    opens, lfs getstripe).  Anyone taking a new extent lock anywhere in
    the file must first hold the layout lock.
    Problem: Layout lock can't be held by liblustre during i/o? 

    b. lov EA changes
       i.  flags: file_is_purged "purged", copyout_begin, 
file_in_HSM_is_out_of_date "hsm_dirty", copyout_complete.  The purged 
flag is always manipulated under a write layout lock, the other flags 
are not.
       ii: "window" EA range of non-purged data (rev2)
    c. new file ioctls: HSMCopyOut, HSMPurge, HSMCopyinDone


Algorithms
1. copyout
    a. Policy engine decides to copy a file to HSM, executes HSMCopyOut 
ioctl on file
    b. ioctl handled by MDT, which passes request to Coordinator
    c. coordinator dispatches request to mover.  request should include 
file extents (for future purposes)
    d. normal extents read lock is taken by mover running on client
    e. mover sets "copyout_begin" bit and clears "hsm_dirty" bit in EA.
    f. any writes to the file set the "hsm_dirty" bit (may be 
lazy/delayed with mtime or filesize change updates on MDT).  Note that 
file writes need not cancel copyout; for a fs with a single big file, we 
don't want to keep interrupting copyout or it will never finish. 
    g. when done, mover checks hsm_dirty bit.  If set, clears 
copyout_begin, indicating current file is not in HSM.  If not set,  
mover sets "copyout_complete" bit.  File layout write lock is not taken 
during mover flag manipulation.  (Note: file modifications after copyout 
is complete will have both copyout_complete and hsm_dirty bits set.)

2. purge (aka punch)
    a. Policy engine decides to purge a file, exectues HSMPurge ioctl on 
file
    b. ioctl handled by MDT
    c. MDT takes a write lock on the file layout lock
    d. MDT enques write locks on all extents of the file.  After these 
are granted, then no client has any dirty cache and no child can take 
new extent locks until layout lock is released.  MDT drops all extent locks.
    e. MDT verifies that hsm_dirty bit is clear and copyout_complete bit 
is set
    f. MDT marks the LOV EA as "purged"
    g. MDT sends destroys the OST objects, using destroy llog entries to 
guard against object leakage during OST failover
    h. MDT drops layout lock.

3. restore (aka copyin aka cache miss)
    a. Client open intent enques layout read lock. 
    b. MDT checks "purged" bit; if purged, lock request response 
includes "wait forever" flag, causing client to block the open.
    c. MDT creates a new layout with a similar stripe pattern as the 
original, allocating new objects on new OSTs.  (We should try to respect 
specific layout settings (pool, stripecount, stripesize), but be 
flexible if e.g. pool doesn't exist anymore.  Maybe we want to ignore 
offset and/or specific ost allocations in order to rebalance.)
    d. MDT sends request to coordinator requesting copyin of the file to 
.lustre/fid/XXXX with extents 0-EOF. Extents may be used in the future 
to (a) copy in part of a file, in low-disk-space situations; (b) copy in 
individual stripes simultaneously on multiple OSTs.
    e. Coordinator distributes that request to an appropriate mover.
    f. Writes into .lustre/fid/* are not required to hold layout read 
lock (or special flag is passed to open, or group write lock on layout 
is passed to mover)
    g. Mover copies data from HSM
    h. When finished, mover calls ioctl HSM_COPYIN_DONE on the file
    i. MDT clears "purged" bit from LOV EA
    j. MDT releases the layout lock
    k. This sends a completion AST to the original client, who now 
completes his open. 


State machines
TBD - I think there's enough in here to chew on for awhile

Things requiring a more detailed look
1. configuration of HSM/movers
2. policy engine
3. "complex" HSM roadmap
    a. partial access to files during restore
    b. partial purging for file type identification, image thumbnails, ??
    c. integration with other HSM backends (ADM, ??)
4. layout locks and liblustre

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Lustre-devel] "Simple" HSM straw man
  2008-10-09 22:37 [Lustre-devel] "Simple" HSM straw man Nathaniel Rutman
@ 2008-10-13 16:39 ` Aurelien Degremont
  2008-10-14 19:41   ` Nathaniel Rutman
  2008-10-20 23:09   ` [Lustre-devel] HSM arch wiki Nathaniel Rutman
  0 siblings, 2 replies; 14+ messages in thread
From: Aurelien Degremont @ 2008-10-13 16:39 UTC (permalink / raw)
  To: lustre-devel

Nathaniel Rutman a ?crit :
>     c. restore on file open, not data read/write

take care of the difficulties to move this behavious to a 
restore-on-first-I/O later

>     d. interfaces with hardware-specific copy tool to access HSM files
rather "HSM-specific"

>     e. kernel process encompasses service threads listening for 
> coordinator requests, passes these up to userspace process via upcall.  
> No interaction with the client is needed; this is a simple message 
> passing service.

Depending on how you can manage a user-space process, but, AFAIK, to be 
able to manage the copy tool process, that means:
  - start this process
  - send a signal
  - get its output (for "complex" hsm, we will need feedback from copy 
tool process)
  - wait for the process end
  - ...
All of this is easily doable from userspace, and very hard in 
kernel-space (we cannot use the fire-and-forget call call_usermodehelper).
So I rather imagine:
  - a kernel space mover, simply getting LNET messages and passing them 
to user-space mover
  - a user-space mover, forking, spawning and managing the copy tool 
process.

Maybe it will need to manage several copy tool processes, so it will 
need queues, process list, etc...

So I think this tool needs a bit more than just a "simple message 
passing service".

>     b. lov EA changes
>        i.  flags: file_is_purged "purged", copyout_begin, 
> file_in_HSM_is_out_of_date "hsm_dirty", copyout_complete.  The purged 
> flag is always manipulated under a write layout lock, the other flags 
> are not.
>        ii: "window" EA range of non-purged data (rev2)

If you add a window EA (will be needed anyway for hsm v2), you do not 
need a purged flag:

window.start ==window.end is comparable to a purged flag unset. (or 
window.end == 0)


>     c. new file ioctls: HSMCopyOut, HSMPurge, HSMCopyinDone
> 
> 
> Algorithms
> 1. copyout
>     a. Policy engine decides to copy a file to HSM, executes HSMCopyOut 
> ioctl on file
>     b. ioctl handled by MDT, which passes request to Coordinator
>     c. coordinator dispatches request to mover.  request should include 
> file extents (for future purposes)
>     d. normal extents read lock is taken by mover running on client
>     e. mover sets "copyout_begin" bit and clears "hsm_dirty" bit in EA.
>     f. any writes to the file set the "hsm_dirty" bit (may be 
> lazy/delayed with mtime or filesize change updates on MDT).  Note that 
> file writes need not cancel copyout; for a fs with a single big file, we 
> don't want to keep interrupting copyout or it will never finish. 

Is it interesting to have a file that is outdated and possibly uncoherent?

>     g. when done, mover checks hsm_dirty bit.  If set, clears 
> copyout_begin, indicating current file is not in HSM.  If not set,  
> mover sets "copyout_complete" bit.  File layout write lock is not taken 
> during mover flag manipulation.  (Note: file modifications after copyout 
> is complete will have both copyout_complete and hsm_dirty bits set.)
> 
> 2. purge (aka punch)
>     a. Policy engine decides to purge a file, exectues HSMPurge ioctl on 
> file
>     b. ioctl handled by MDT
>     c. MDT takes a write lock on the file layout lock
>     d. MDT enques write locks on all extents of the file.  After these 
> are granted, then no client has any dirty cache and no child can take 
> new extent locks until layout lock is released.  MDT drops all extent locks.
>     e. MDT verifies that hsm_dirty bit is clear and copyout_complete bit 
> is set
>     f. MDT marks the LOV EA as "purged"
>     g. MDT sends destroys the OST objects, using destroy llog entries to 
> guard against object leakage during OST failover

Are you sure you want to remove those objects if we will need them 
later, in "complex" HSM?
As this mecanism will need to change a lot when we will implement the 
restore-in-place feature, i'm not sure this is the best idea.


>     h. MDT drops layout lock.
> 
> 3. restore (aka copyin aka cache miss)
>     a. Client open intent enques layout read lock. 
>     b. MDT checks "purged" bit; if purged, lock request response 
> includes "wait forever" flag, causing client to block the open.
>     c. MDT creates a new layout with a similar stripe pattern as the 
> original, allocating new objects on new OSTs.  (We should try to respect 
> specific layout settings (pool, stripecount, stripesize), but be 
> flexible if e.g. pool doesn't exist anymore.  Maybe we want to ignore 
> offset and/or specific ost allocations in order to rebalance.)
>     d. MDT sends request to coordinator requesting copyin of the file to 
> .lustre/fid/XXXX with extents 0-EOF. Extents may be used in the future 
> to (a) copy in part of a file, in low-disk-space situations; (b) copy in 
> individual stripes simultaneously on multiple OSTs.
>     e. Coordinator distributes that request to an appropriate mover.
>     f. Writes into .lustre/fid/* are not required to hold layout read 
> lock (or special flag is passed to open, or group write lock on layout 
> is passed to mover)
>     g. Mover copies data from HSM
>     h. When finished, mover calls ioctl HSM_COPYIN_DONE on the file
>     i. MDT clears "purged" bit from LOV EA
>     j. MDT releases the layout lock
>     k. This sends a completion AST to the original client, who now 
> completes his open. 




Concerning the new flag copyout_begin/copyout_complete, I'm not a 
ldlm/recovery specialist but is it possible to have the mover to take a 
kind of write extent lock on the area it has to copied in/out and 
downgrade it on a smaller range as the copy tool goes along.

Copy-out
- Mover take a specific lock on range (0-EOF for the moment)
- On this range, reads pass, writes raise a callback on the mover.
- Receiving this callback, if the mover release its lock, the copyout is 
cancelled, if not, the write i/o is blocked
- When the mover has copied [0 - cursor], it can downgrade its lock to 
[cursor - EOF] and release the lock on [ 0 - cursor].

Same thing could be done for copy in.

The two key points are:
  - Could we have a layout lock on a specific range?
  - Is it possible to downgrade a range lock with ldlm?



-- 
Aurelien Degremont
CEA

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Lustre-devel] "Simple" HSM straw man
  2008-10-13 16:39 ` Aurelien Degremont
@ 2008-10-14 19:41   ` Nathaniel Rutman
  2008-10-15 21:52     ` Nathaniel Rutman
  2008-10-16 21:56     ` Eric Barton
  2008-10-20 23:09   ` [Lustre-devel] HSM arch wiki Nathaniel Rutman
  1 sibling, 2 replies; 14+ messages in thread
From: Nathaniel Rutman @ 2008-10-14 19:41 UTC (permalink / raw)
  To: lustre-devel

Aurelien Degremont wrote:
> Nathaniel Rutman a ?crit :
>>     c. restore on file open, not data read/write
>
> take care of the difficulties to move this behavious to a 
> restore-on-first-I/O later.
Indeed.  From a client point of view, it only changes which locks its 
waiting on, but from a server point of view the OSTs would need to 
become involved in HSM knowledge.  It is more work, but I don't think 
there would be much "throwaway" code from the former to the latter.
>
>>     d. interfaces with hardware-specific copy tool to access HSM files
> rather "HSM-specific"
>
>>     e. kernel process encompasses service threads listening for 
>> coordinator requests, passes these up to userspace process via 
>> upcall.  No interaction with the client is needed; this is a simple 
>> message passing service.
>
> Depending on how you can manage a user-space process, but, AFAIK, to 
> be able to manage the copy tool process, that means:
>  - start this process
>  - send a signal
>  - get its output (for "complex" hsm, we will need feedback from copy 
> tool process)
>  - wait for the process end
>  - ...
> All of this is easily doable from userspace, and very hard in 
> kernel-space (we cannot use the fire-and-forget call 
> call_usermodehelper).
> So I rather imagine:
>  - a kernel space mover, simply getting LNET messages and passing them 
> to user-space mover
>  - a user-space mover, forking, spawning and managing the copy tool 
> process.
>
> Maybe it will need to manage several copy tool processes, so it will 
> need queues, process list, etc...
>
> So I think this tool needs a bit more than just a "simple message 
> passing service".
As we discussed in the HSM concall this morning, the return path can 
mostly take place through the file itself via ioctl calls.  The mover 
will open the destination file location in Lustre and then can indicate 
status through an ioctl: starting, waiting for HSM, periodic pinging or 
"% complete" messages, copyin complete.  This is the "fire-and-forget" 
model, and can be started from call_usermodehelper.  The in-kernel code 
will only have to deal with one-way requests from coordinator to mover.

We also specified 4 types of requests from coordinator:
1. copyin FID
2. copyout FID
3. abort copy(in|out) FID
4. purge FID from HSM

To accomplish 3, it might make sense to store the PID of the process 
started from the upcall in the kernel (this is returned by the upcall).  
Closing the file could clear the pid from the kernel list.
>
>>     b. lov EA changes
>>        i.  flags: file_is_purged "purged", copyout_begin, 
>> file_in_HSM_is_out_of_date "hsm_dirty", copyout_complete.  The purged 
>> flag is always manipulated under a write layout lock, the other flags 
>> are not.
>>        ii: "window" EA range of non-purged data (rev2)
>
> If you add a window EA (will be needed anyway for hsm v2), you do not 
> need a purged flag:
>
> window.start ==window.end is comparable to a purged flag unset. (or 
> window.end == 0)
True, but I don't really see a large market for partially purged files, 
so I don't really believe that it is worth the effort.  One of the 
important points here is that we are deleting stripes off the OSTs, 
freeing up space, and we won't necessarily restore to those same OSTs.  
As soon as we have partially purged files that's no longer the case, and 
I think complicates things too much.
>
>
>>     c. new file ioctls: HSMCopyOut, HSMPurge, HSMCopyinDone
>>
>>
>> Algorithms
>> 1. copyout
>>     a. Policy engine decides to copy a file to HSM, executes 
>> HSMCopyOut ioctl on file
>>     b. ioctl handled by MDT, which passes request to Coordinator
>>     c. coordinator dispatches request to mover.  request should 
>> include file extents (for future purposes)
>>     d. normal extents read lock is taken by mover running on client
>>     e. mover sets "copyout_begin" bit and clears "hsm_dirty" bit in EA.
>>     f. any writes to the file set the "hsm_dirty" bit (may be 
>> lazy/delayed with mtime or filesize change updates on MDT).  Note 
>> that file writes need not cancel copyout; for a fs with a single big 
>> file, we don't want to keep interrupting copyout or it will never 
>> finish. 
>
> Is it interesting to have a file that is outdated and possibly 
> uncoherent?
It is probably useful in some cases -- simulation checkpoints maybe.
>
>>     g. when done, mover checks hsm_dirty bit.  If set, clears 
>> copyout_begin, indicating current file is not in HSM.  If not set,  
>> mover sets "copyout_complete" bit.  File layout write lock is not 
>> taken during mover flag manipulation.  (Note: file modifications 
>> after copyout is complete will have both copyout_complete and 
>> hsm_dirty bits set.)
>>
>> 2. purge (aka punch)
>>     a. Policy engine decides to purge a file, exectues HSMPurge ioctl 
>> on file
>>     b. ioctl handled by MDT
>>     c. MDT takes a write lock on the file layout lock
>>     d. MDT enques write locks on all extents of the file.  After 
>> these are granted, then no client has any dirty cache and no child 
>> can take new extent locks until layout lock is released.  MDT drops 
>> all extent locks.
>>     e. MDT verifies that hsm_dirty bit is clear and copyout_complete 
>> bit is set
>>     f. MDT marks the LOV EA as "purged"
>>     g. MDT sends destroys the OST objects, using destroy llog entries 
>> to guard against object leakage during OST failover
>
> Are you sure you want to remove those objects if we will need them 
> later, in "complex" HSM?
> As this mecanism will need to change a lot when we will implement the 
> restore-in-place feature, i'm not sure this is the best idea.
Ah, I think it is important that we do NOT restore in place to the old 
OST objects.  The OSTs may now be full, or indeed not exist anymore.  
The restore in place for complex HSM is at the file level; the objects 
may move around.  "Complex" in this case just means that clients will 
have access to partially restored files.
>
>
>>     h. MDT drops layout lock.
>>
>> 3. restore (aka copyin aka cache miss)
>>     a. Client open intent enques layout read lock.     b. MDT checks 
>> "purged" bit; if purged, lock request response includes "wait 
>> forever" flag, causing client to block the open.
>>     c. MDT creates a new layout with a similar stripe pattern as the 
>> original, allocating new objects on new OSTs.  (We should try to 
>> respect specific layout settings (pool, stripecount, stripesize), but 
>> be flexible if e.g. pool doesn't exist anymore.  Maybe we want to 
>> ignore offset and/or specific ost allocations in order to rebalance.)
>>     d. MDT sends request to coordinator requesting copyin of the file 
>> to .lustre/fid/XXXX with extents 0-EOF. Extents may be used in the 
>> future to (a) copy in part of a file, in low-disk-space situations; 
>> (b) copy in individual stripes simultaneously on multiple OSTs.
>>     e. Coordinator distributes that request to an appropriate mover.
>>     f. Writes into .lustre/fid/* are not required to hold layout read 
>> lock (or special flag is passed to open, or group write lock on 
>> layout is passed to mover)
>>     g. Mover copies data from HSM
>>     h. When finished, mover calls ioctl HSM_COPYIN_DONE on the file
>>     i. MDT clears "purged" bit from LOV EA
>>     j. MDT releases the layout lock
>>     k. This sends a completion AST to the original client, who now 
>> completes his open. 
>
>
>
>
> Concerning the new flag copyout_begin/copyout_complete, I'm not a 
> ldlm/recovery specialist but is it possible to have the mover to take 
> a kind of write extent lock on the area it has to copied in/out and 
> downgrade it on a smaller range as the copy tool goes along.
This is called "lock conversion" and is not yet implemented, but has 
been a general Lustre design goal for some time.  So yes, for "complex" 
HSM this is what we would want to do.
>
> Copy-out
> - Mover take a specific lock on range (0-EOF for the moment)
> - On this range, reads pass, writes raise a callback on the mover.
> - Receiving this callback, if the mover release its lock, the copyout 
> is cancelled, if not, the write i/o is blocked 
I don't think we want to block the write just because the HSM copy isn't 
done yet.  If the data is changing, then the policy engine shouldn't 
have started a copyout process in the first place.  If the customer's 
goal is to do a coherent checkpoint, then it should explicitly wait for 
the copyout to be done.  If it's just the policy engine that got it 
wrong, it doesn't matter if it finishes or not; the file will be marked 
"hsm_dirty", and so the policy engine should re-queue it for copyout 
again later, and it can't be purged in the meantime since the dirty bit 
is set.

> - When the mover has copied [0 - cursor], it can downgrade its lock to 
> [cursor - EOF] and release the lock on [ 0 - cursor].
>
> Same thing could be done for copy in.
>
> The two key points are:
>  - Could we have a layout lock on a specific range?
Not the layout lock - layout means the striping pattern, and must be 
held first before any extent locks can be taken.  So I think what you 
are asking we plan to do with two locks: the layout lock plus another 
extent lock.
>  - Is it possible to downgrade a range lock with ldlm?
>
Not yet, but as I said, lock conversion is a general Lustre goal.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Lustre-devel] "Simple" HSM straw man
  2008-10-14 19:41   ` Nathaniel Rutman
@ 2008-10-15 21:52     ` Nathaniel Rutman
  2008-10-16 14:09       ` Aurelien Degremont
  2008-10-16 21:56     ` Eric Barton
  1 sibling, 1 reply; 14+ messages in thread
From: Nathaniel Rutman @ 2008-10-15 21:52 UTC (permalink / raw)
  To: lustre-devel

Nathaniel Rutman wrote:
>
>>>     b. lov EA changes
>>>        i.  flags: file_is_purged "purged", copyout_begin, 
>>> file_in_HSM_is_out_of_date "hsm_dirty", copyout_complete.  The purged 
>>> flag is always manipulated under a write layout lock, the other flags 
>>> are not.
>>>        ii: "window" EA range of non-purged data (rev2)
>>>       
>> If you add a window EA (will be needed anyway for hsm v2), you do not 
>> need a purged flag:
>>
>> window.start ==window.end is comparable to a purged flag unset. (or 
>> window.end == 0)
>>     
> True, but I don't really see a large market for partially purged files, 
> so I don't really believe that it is worth the effort.  One of the 
> important points here is that we are deleting stripes off the OSTs, 
> freeing up space, and we won't necessarily restore to those same OSTs.  
> As soon as we have partially purged files that's no longer the case, and 
> I think complicates things too much.
>   
Ok, I've been told I'm dead wrong here, and this will absolutely be 
required for "complex" HSM (not "simple"), and so therefore we should at 
least think about the arch now.  Supposedly we need to keep X bytes at 
the beginning of the file for the unix "file" command, and supposedly 
icon/preview data, and Y bytes at the end of the file, not sure exactly 
why.
We would still plan on deleting the OST objects in the middle.  And 
clearly, a simple beginning/ending byte count is insufficient for the 
final "complex" requirement of enabling partial file reads while doing a 
copyin (where we would at a minimum need a per-object cursor). 
Anyhow, as I write this none of this sounds like something that can't be 
implemented at a later time, so I think we should stick with the 
simplest of the simple options for now.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Lustre-devel] "Simple" HSM straw man
  2008-10-15 21:52     ` Nathaniel Rutman
@ 2008-10-16 14:09       ` Aurelien Degremont
  0 siblings, 0 replies; 14+ messages in thread
From: Aurelien Degremont @ 2008-10-16 14:09 UTC (permalink / raw)
  To: lustre-devel

Nathaniel Rutman a ?crit :
> Ok, I've been told I'm dead wrong here, and this will absolutely be 
> required for "complex" HSM (not "simple"), and so therefore we should at 
> least think about the arch now.  Supposedly we need to keep X bytes at 
> the beginning of the file for the unix "file" command, and supposedly 
> icon/preview data, and Y bytes at the end of the file, not sure exactly 
> why.
> We would still plan on deleting the OST objects in the middle.  And 
> clearly, a simple beginning/ending byte count is insufficient for the 
> final "complex" requirement of enabling partial file reads while doing a 
> copyin (where we would at a minimum need a per-object cursor). Anyhow, 
> as I write this none of this sounds like something that can't be 
> implemented at a later time, so I think we should stick with the 
> simplest of the simple options for now.
>

Ok.
Can you just sum up the inplace copy-in mechanism that have been decided 
(between Menlo Park version and the other ones)?


-- 
Aurelien Degremont
CEA

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Lustre-devel] "Simple" HSM straw man
  2008-10-14 19:41   ` Nathaniel Rutman
  2008-10-15 21:52     ` Nathaniel Rutman
@ 2008-10-16 21:56     ` Eric Barton
  2008-10-17  9:47       ` Aurelien Degremont
  1 sibling, 1 reply; 14+ messages in thread
From: Eric Barton @ 2008-10-16 21:56 UTC (permalink / raw)
  To: lustre-devel

Nathan,

> True, but I don't really see a large market for partially purged files, 
> so I don't really believe that it is worth the effort.  One of the 
> important points here is that we are deleting stripes off the OSTs, 
> freeing up space, and we won't necessarily restore to those same OSTs.  
> As soon as we have partially purged files that's no longer the case, and 
> I think complicates things too much.

Partially purged files is a requirement to allow graphical file browsers
to retrieve icons from within the file.  It's OK to miss this out in the
first version, but it has to be there for the full product.

> >> Algorithms
> >> 1. copyout
> >>     a. Policy engine decides to copy a file to HSM, executes 
> >>        HSMCopyOut ioctl on file
> >>     b. ioctl handled by MDT, which passes request to Coordinator
> >>     c. coordinator dispatches request to mover.  request should 
> >>        include file extents (for future purposes)
> >>     d. normal extents read lock is taken by mover running on client
> >>     e. mover sets "copyout_begin" bit and clears "hsm_dirty" bit in EA.
> >>     f. any writes to the file set the "hsm_dirty" bit (may be 
> >>        lazy/delayed with mtime or filesize change updates on MDT).  Note 
> >>        that file writes need not cancel copyout; for a fs with a single big 
> >>        file, we don't want to keep interrupting copyout or it will never 
> >>        finish. 
> >
> > Is it interesting to have a file that is outdated and possibly 
> > uncoherent?
> It is probably useful in some cases -- simulation checkpoints maybe.

A corrupt simulation checkpoint is useless.  We _must_ provide a way to
ensure the HSM copy of a file is a known good snapshot.  We don't necessarily
have to abort the copyout if there is an update that could mean the
HSM copy would be corrupt since we can always just copy it out again,
but it doesn't seem hugely complicated to notify the backend, if not
the agent and let it decide.

> > Are you sure you want to remove those objects if we will need them 
> > later, in "complex" HSM?
> > As this mecanism will need to change a lot when we will implement the 
> > restore-in-place feature, i'm not sure this is the best idea.
> Ah, I think it is important that we do NOT restore in place to the old 
> OST objects.  The OSTs may now be full, or indeed not exist anymore.  
> The restore in place for complex HSM is at the file level; the objects 
> may move around.  "Complex" in this case just means that clients will 
> have access to partially restored files.

Can't the "complex" HSM restore to new objects?  It just depends on when
the new-being-restored objects become the new contents of the file doesn't it?

> > Copy-out
> > - Mover take a specific lock on range (0-EOF for the moment)
> > - On this range, reads pass, writes raise a callback on the mover.
> > - Receiving this callback, if the mover release its lock, the copyout 
> > is cancelled, if not, the write i/o is blocked 
> I don't think we want to block the write just because the HSM copy isn't 
> done yet.  If the data is changing, then the policy engine shouldn't 
> have started a copyout process in the first place. 

Indeed.

> If the customer's 
> goal is to do a coherent checkpoint, then it should explicitly wait for 
> the copyout to be done.  

Disagree - the customer doesn't have to know a copyout is in progress.
The HSM should abort the copyout or mark the copy corrupt.

> If it's just the policy engine that got it 
> wrong, it doesn't matter if it finishes or not; the file will be marked 
> "hsm_dirty", and so the policy engine should re-queue it for copyout 
> again later, and it can't be purged in the meantime since the dirty bit 
> is set.

Indeed.


    Cheers,
              Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Lustre-devel] "Simple" HSM straw man
  2008-10-16 21:56     ` Eric Barton
@ 2008-10-17  9:47       ` Aurelien Degremont
  2008-10-17 10:10         ` Eric Barton
  2008-10-17 13:54         ` Peter Braam
  0 siblings, 2 replies; 14+ messages in thread
From: Aurelien Degremont @ 2008-10-17  9:47 UTC (permalink / raw)
  To: lustre-devel

Eric Barton a ?crit :
 > Partially purged files is a requirement to allow graphical file browsers
 > to retrieve icons from within the file.  It's OK to miss this out in the
 > first version, but it has to be there for the full product.

Think also of command like

$ file foo*

 >>> Is it interesting to have a file that is outdated and possibly
 >>> uncoherent?
 >> It is probably useful in some cases -- simulation checkpoints maybe.
 >
 > A corrupt simulation checkpoint is useless.  We _must_ provide a way to
 > ensure the HSM copy of a file is a known good snapshot.  We don't 
necessarily
 > have to abort the copyout if there is an update that could mean the
 > HSM copy would be corrupt since we can always just copy it out again,
 > but it doesn't seem hugely complicated to notify the backend, if not
 > the agent and let it decide.

Indeed, this is important

 >> I don't think we want to block the write just because the HSM copy 
isn't
 >> done yet.  If the data is changing, then the policy engine shouldn't
 >> have started a copyout process in the first place.
 >
 > Indeed.

You were speaking of a FS with only one big file and so we need to have 
a way to be sure it will be copied at least once, even if people are 
writting on it.
In this case, with a classical policy engine, this file will never be 
copied out because data is constantly changing.



-- 
Aurelien Degremont
CEA

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Lustre-devel] "Simple" HSM straw man
  2008-10-17  9:47       ` Aurelien Degremont
@ 2008-10-17 10:10         ` Eric Barton
  2008-10-17 13:54         ` Peter Braam
  1 sibling, 0 replies; 14+ messages in thread
From: Eric Barton @ 2008-10-17 10:10 UTC (permalink / raw)
  To: lustre-devel

Aurelien,

>  >> I don't think we want to block the write just because the HSM
>  >> copy isn't done yet.  If the data is changing, then the policy
>  >> engine shouldn't have started a copyout process in the first
>  >> place.
>  >
>  > Indeed.
> 
> You were speaking of a FS with only one big file and so we need to
> have a way to be sure it will be copied at least once, even if
> people are writting on it.  In this case, with a classical policy
> engine, this file will never be copied out because data is
> constantly changing.

I'm not so sure that's a realistic case.  If this file is so active
that it's impossible to take a consistent copy of it without some
sort of a snapshot facility, does that really mean it's a candidate
for archiving?

    Cheers,
              Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Lustre-devel] "Simple" HSM straw man
  2008-10-17  9:47       ` Aurelien Degremont
  2008-10-17 10:10         ` Eric Barton
@ 2008-10-17 13:54         ` Peter Braam
  1 sibling, 0 replies; 14+ messages in thread
From: Peter Braam @ 2008-10-17 13:54 UTC (permalink / raw)
  To: lustre-devel




On 10/17/08 3:47 AM, "Aurelien Degremont" <aurelien.degremont@cea.fr> wrote:

> Eric Barton a ?crit :
>> Partially purged files is a requirement to allow graphical file browsers
>> to retrieve icons from within the file.  It's OK to miss this out in the
>> first version, but it has to be there for the full product.
> 
> Think also of command like
> 
> $ file foo*
> 
>>>> Is it interesting to have a file that is outdated and possibly
>>>> uncoherent?

99.99% of (probably more 9's) backup systems do work this way, with
relatively little harm.

Also remember that many files are append only - for those it might be fine.

Philosophically it is a disaster of course.  I would offer archiving of
files that are active, and in due course use snapshots.

Peter


>>> It is probably useful in some cases -- simulation checkpoints maybe.
>> 
>> A corrupt simulation checkpoint is useless.  We _must_ provide a way to
>> ensure the HSM copy of a file is a known good snapshot.  We don't
> necessarily
>> have to abort the copyout if there is an update that could mean the
>> HSM copy would be corrupt since we can always just copy it out again,
>> but it doesn't seem hugely complicated to notify the backend, if not
>> the agent and let it decide.
> 
> Indeed, this is important
> 
>>> I don't think we want to block the write just because the HSM copy
> isn't
>>> done yet.  If the data is changing, then the policy engine shouldn't
>>> have started a copyout process in the first place.
>> 
>> Indeed.
> 
> You were speaking of a FS with only one big file and so we need to have
> a way to be sure it will be copied at least once, even if people are
> writting on it.
> In this case, with a classical policy engine, this file will never be
> copied out because data is constantly changing.
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Lustre-devel] HSM arch wiki
  2008-10-13 16:39 ` Aurelien Degremont
  2008-10-14 19:41   ` Nathaniel Rutman
@ 2008-10-20 23:09   ` Nathaniel Rutman
  2008-10-21 13:21     ` Aurelien Degremont
  2008-11-25 16:59     ` Alex Kulyavtsev
  1 sibling, 2 replies; 14+ messages in thread
From: Nathaniel Rutman @ 2008-10-20 23:09 UTC (permalink / raw)
  To: lustre-devel

High-level architecture page for the Lustre HSM project
http://arch.lustre.org/index.php?title=HSM_Migration


HSM core team - this is intended to be sufficient to write a full 
HLD/DLD from.  What is it missing?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Lustre-devel] HSM arch wiki
  2008-10-20 23:09   ` [Lustre-devel] HSM arch wiki Nathaniel Rutman
@ 2008-10-21 13:21     ` Aurelien Degremont
  2008-11-25 16:59     ` Alex Kulyavtsev
  1 sibling, 0 replies; 14+ messages in thread
From: Aurelien Degremont @ 2008-10-21 13:21 UTC (permalink / raw)
  To: lustre-devel

Nathaniel Rutman a ?crit :
> High-level architecture page for the Lustre HSM project
> http://arch.lustre.org/index.php?title=HSM_Migration
>
>
> HSM core team - this is intended to be sufficient to write a full 
> HLD/DLD from.  What is it missing?
>
I think after tuestday conf call, all the main element for writting hld 
will be their. May be some small points will be missing but we could 
discuss them by e-mail or at confcalls. I think most of it is already there.

-- 
Aurelien Degremont
CEA

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Lustre-devel] HSM arch wiki
  2008-10-20 23:09   ` [Lustre-devel] HSM arch wiki Nathaniel Rutman
  2008-10-21 13:21     ` Aurelien Degremont
@ 2008-11-25 16:59     ` Alex Kulyavtsev
  2008-11-26 16:29       ` Nathaniel Rutman
  2008-11-27 19:05       ` Andreas Dilger
  1 sibling, 2 replies; 14+ messages in thread
From: Alex Kulyavtsev @ 2008-11-25 16:59 UTC (permalink / raw)
  To: lustre-devel

Few questions :
- For large existing archive of tapes (~10,000,000 files) it is 
desirable to import file metadata to lustre fs without actually copying 
files on disk.
Import shall be done in reasonable time (hours rather than month) or online.

- to provide bandwidth to tape it is desirable to have multiple migrator 
nodes connected to HSM. What element of proposed design distributes 
copy-out processes across migrator nodes to provide scalability ? Is it 
functionality of HSM specific copy tool or does lustre agent provide it ?

- a "smart" HSM system can reorder requests to optimize tape access. It 
is common to have 2000 requests pending in queue with tens or hundreds 
IO transfers actually served. Current limit of pending requests is about 
30,000. We found implementing of pending requests as processes (one 
copy-out tool process per request waiting for IO) is resource consuming 
and is not scalable. What is the way to serve ~100,000 request waiting 
for transfer ?

- how to prestage files ? Send asynchronous request for copy-in file 
from tape without blocking on wait. It is needed to stage large data 
sets for future processing. Prestaging "file sets" is desirable.

- what proposed scanario to handle OST down ? Suppose file is present on 
one of OSTs and it went down (striping is one). My understanding is 
client will wait when OST will come back (case[1]) and file will not be 
staged from tape automatically. IF file is not present on any OST, it 
will be staged immediately (case[2]). Is possible to stage file 
automatically (case[1]) to another OST and mark a copy on old OST for 
removal ?

We discussed some of these questions with Peter, he suggested to ask on 
devel list.

Best regards, Alex.

Nathaniel Rutman wrote:
> High-level architecture page for the Lustre HSM project
> http://arch.lustre.org/index.php?title=HSM_Migration
>
>
> HSM core team - this is intended to be sufficient to write a full 
> HLD/DLD from.  What is it missing?
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
>   

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Lustre-devel] HSM arch wiki
  2008-11-25 16:59     ` Alex Kulyavtsev
@ 2008-11-26 16:29       ` Nathaniel Rutman
  2008-11-27 19:05       ` Andreas Dilger
  1 sibling, 0 replies; 14+ messages in thread
From: Nathaniel Rutman @ 2008-11-26 16:29 UTC (permalink / raw)
  To: lustre-devel

Alex Kulyavtsev wrote:
> Few questions :
> - For large existing archive of tapes (~10,000,000 files) it is 
> desirable to import file metadata to lustre fs without actually 
> copying files on disk.
> Import shall be done in reasonable time (hours rather than month) or 
> online.
Agreed.  Probably best done via a special ioctl that would create a stub 
file and populate the metadata.
>
> - to provide bandwidth to tape it is desirable to have multiple 
> migrator nodes connected to HSM. What element of proposed design 
> distributes copy-out processes across migrator nodes to provide 
> scalability ? Is it functionality of HSM specific copy tool or does 
> lustre agent provide it ?
Lustre agents can run on multiple Lustre clients in parallel.  
Coordinator distributes copyout jobs to different agents.
>
>
> - a "smart" HSM system can reorder requests to optimize tape access. 
> It is common to have 2000 requests pending in queue with tens or 
> hundreds IO transfers actually served. Current limit of pending 
> requests is about 30,000. We found implementing of pending requests as 
> processes (one copy-out tool process per request waiting for IO) is 
> resource consuming and is not scalable. What is the way to serve 
> ~100,000 request waiting for transfer ?
>
Coordinator decides when to request copyin/out jobs, and could throttle 
the total number of concurrent accesses.

> - how to prestage files ? Send asynchronous request for copy-in file 
> from tape without blocking on wait. It is needed to stage large data 
> sets for future processing. Prestaging "file sets" is desirable.
Policy engine would request copyin of files before cache miss on open.  
Policy could define file sets.
>
> - what proposed scanario to handle OST down ? Suppose file is present 
> on one of OSTs and it went down (striping is one). My understanding is 
> client will wait when OST will come back (case[1]) and file will not 
> be staged from tape automatically. IF file is not present on any OST, 
> it will be staged immediately (case[2]). Is possible to stage file 
> automatically (case[1]) to another OST and mark a copy on old OST for 
> removal ?
With our V2 HSM, we will have the ability to keep more detailed layouts; 
this optimization could be part of those changes.
>
> We discussed some of these questions with Peter, he suggested to ask 
> on devel list.
We greatly appreciate it!  Please ask/suggest away.
>
> Best regards, Alex.
>
> Nathaniel Rutman wrote:
>> High-level architecture page for the Lustre HSM project
>> http://arch.lustre.org/index.php?title=HSM_Migration
>>
>>
>> HSM core team - this is intended to be sufficient to write a full 
>> HLD/DLD from.  What is it missing?
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>>   
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Lustre-devel] HSM arch wiki
  2008-11-25 16:59     ` Alex Kulyavtsev
  2008-11-26 16:29       ` Nathaniel Rutman
@ 2008-11-27 19:05       ` Andreas Dilger
  1 sibling, 0 replies; 14+ messages in thread
From: Andreas Dilger @ 2008-11-27 19:05 UTC (permalink / raw)
  To: lustre-devel

On Nov 25, 2008  10:59 -0600, Alex Kulyavtsev wrote:
> - For large existing archive of tapes (~10,000,000 files) it is  
> desirable to import file metadata to lustre fs without actually copying  
> files on disk.
> Import shall be done in reasonable time (hours rather than month) or online.

Concievably this could be done with "mknod" and "setxattr" to store the
striping information into the Lustre inode.  However, one issue will be
how to identify this new file to the HSM.  The current plan is that the
Lustre HSM policy engine database will contain the mapping between the
Lustre FID (~= inode number) and the file in the archive.

Since this is a new file (FID) then we would also need to add an entry to
the policy engine database that contains the mapping from FID->archive
file.

> - a "smart" HSM system can reorder requests to optimize tape access. It  
> is common to have 2000 requests pending in queue with tens or hundreds  
> IO transfers actually served. Current limit of pending requests is about  
> 30,000. We found implementing of pending requests as processes (one  
> copy-out tool process per request waiting for IO) is resource consuming  
> and is not scalable. What is the way to serve ~100,000 request waiting  
> for transfer ?

The Lustre HSM design has the policy engine as a mediator between the
copyin/copyout/purge requests and the userspace agents that are specific
to the HSM and do the actual work.

The policy engine it is free to reorder all of the requests as it sees
fit.  CEA is supplying their existing policy engine as a starting point
for Lustre HSM+HPSS, and I this could be made available to interested
parties sooner rather than later.

> - what proposed scanario to handle OST down ? Suppose file is present on  
> one of OSTs and it went down (striping is one). My understanding is  
> client will wait when OST will come back (case[1]) and file will not be  
> staged from tape automatically. IF file is not present on any OST, it  
> will be staged immediately (case[2]). Is possible to stage file  
> automatically (case[1]) to another OST and mark a copy on old OST for  
> removal ?

Since the OST objects will be removed when the file is purged there is
no requirement to store the file on a particular OST during copyin.
The HSM will store the striping attributes (probably only if they do
not match the filesystem defaults) to ensure that wide striped files
retain this property when returned to the filesystem.

In addition to not saving the striping for files that match the default
layout, we may also consider to save the layout of files with
"stripe_count == target_count" as having a stripe_count = -1 (stripe over
all OSTs) so that if there are more OSTs available when the file is
restored it takes advantage of the additional bandwidth.  We might also
consider having a (policy engine?) tunable that files with > N stripes
are restriped over all OSTs when restored.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2008-11-27 19:05 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-09 22:37 [Lustre-devel] "Simple" HSM straw man Nathaniel Rutman
2008-10-13 16:39 ` Aurelien Degremont
2008-10-14 19:41   ` Nathaniel Rutman
2008-10-15 21:52     ` Nathaniel Rutman
2008-10-16 14:09       ` Aurelien Degremont
2008-10-16 21:56     ` Eric Barton
2008-10-17  9:47       ` Aurelien Degremont
2008-10-17 10:10         ` Eric Barton
2008-10-17 13:54         ` Peter Braam
2008-10-20 23:09   ` [Lustre-devel] HSM arch wiki Nathaniel Rutman
2008-10-21 13:21     ` Aurelien Degremont
2008-11-25 16:59     ` Alex Kulyavtsev
2008-11-26 16:29       ` Nathaniel Rutman
2008-11-27 19:05       ` Andreas Dilger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.