* [Lustre-devel] "Simple" HSM straw man
@ 2008-10-09 22:37 Nathaniel Rutman
2008-10-13 16:39 ` Aurelien Degremont
0 siblings, 1 reply; 14+ messages in thread
From: Nathaniel Rutman @ 2008-10-09 22:37 UTC (permalink / raw)
To: lustre-devel
This is intended to be a starting point for discussion; the concepts
here have been hashed through a few times and hopefully represent the
best current thinking.
Baseline concepts
1. all single-file coherency issues are in kernel space (file locking,
recovery)
2. all policy decisions are in user space (using changelogs, df, etc)
3. coordinator/mover communication will use LNET
4. "simple" refers to
a. integration with HPSS only
b. depends on changelog for policy decisions
c. restore on file open, not data read/write
5. HSM tracks entire files, not stripe objects
6. HSM namespace is flat, all files are addressed by FID only
7. Desired: coordinator and movers can be reused by (non-HSM) replication
Components
1. Mover
a. combined kernel (LNET comms) and userspace processes
b. userspace processes will use Lustre clients for data i/o
c. will use special fid directory for file access (.lustre/fid/XXXX)
d. interfaces with hardware-specific copy tool to access HSM files
e. kernel process encompasses service threads listening for
coordinator requests, passes these up to userspace process via upcall.
No interaction with the client is needed; this is a simple message
passing service.
2. Coordinator
a. decides and dispatches copyin and copyout requests to movers
b. consolidates repeat requests
c. re-queues requests to a new agent if an agent becomes unresponsive
d. kernel space, associated with the MDT for cache-miss
e. ioctl interface for copyout, purge requests from policy engine
3. Policy engine (aka Space Manager)
a. makes policy decisions for copyout, purge
b. normally uses changelogs and 'df' for input; rarely is allowed to
scan filesystem
c. userspace process, requests copyout and purge via ioctl to
coordinator
4. MDT changes
a. Per-file layout lock
A new layout lock is created for every file. Private writer lock is
taken by the MDT when allocating/changing file layout (LOV EA).
Shared reader locks are taken by anyone reading the layout (client
opens, lfs getstripe). Anyone taking a new extent lock anywhere in
the file must first hold the layout lock.
Problem: Layout lock can't be held by liblustre during i/o?
b. lov EA changes
i. flags: file_is_purged "purged", copyout_begin,
file_in_HSM_is_out_of_date "hsm_dirty", copyout_complete. The purged
flag is always manipulated under a write layout lock, the other flags
are not.
ii: "window" EA range of non-purged data (rev2)
c. new file ioctls: HSMCopyOut, HSMPurge, HSMCopyinDone
Algorithms
1. copyout
a. Policy engine decides to copy a file to HSM, executes HSMCopyOut
ioctl on file
b. ioctl handled by MDT, which passes request to Coordinator
c. coordinator dispatches request to mover. request should include
file extents (for future purposes)
d. normal extents read lock is taken by mover running on client
e. mover sets "copyout_begin" bit and clears "hsm_dirty" bit in EA.
f. any writes to the file set the "hsm_dirty" bit (may be
lazy/delayed with mtime or filesize change updates on MDT). Note that
file writes need not cancel copyout; for a fs with a single big file, we
don't want to keep interrupting copyout or it will never finish.
g. when done, mover checks hsm_dirty bit. If set, clears
copyout_begin, indicating current file is not in HSM. If not set,
mover sets "copyout_complete" bit. File layout write lock is not taken
during mover flag manipulation. (Note: file modifications after copyout
is complete will have both copyout_complete and hsm_dirty bits set.)
2. purge (aka punch)
a. Policy engine decides to purge a file, exectues HSMPurge ioctl on
file
b. ioctl handled by MDT
c. MDT takes a write lock on the file layout lock
d. MDT enques write locks on all extents of the file. After these
are granted, then no client has any dirty cache and no child can take
new extent locks until layout lock is released. MDT drops all extent locks.
e. MDT verifies that hsm_dirty bit is clear and copyout_complete bit
is set
f. MDT marks the LOV EA as "purged"
g. MDT sends destroys the OST objects, using destroy llog entries to
guard against object leakage during OST failover
h. MDT drops layout lock.
3. restore (aka copyin aka cache miss)
a. Client open intent enques layout read lock.
b. MDT checks "purged" bit; if purged, lock request response
includes "wait forever" flag, causing client to block the open.
c. MDT creates a new layout with a similar stripe pattern as the
original, allocating new objects on new OSTs. (We should try to respect
specific layout settings (pool, stripecount, stripesize), but be
flexible if e.g. pool doesn't exist anymore. Maybe we want to ignore
offset and/or specific ost allocations in order to rebalance.)
d. MDT sends request to coordinator requesting copyin of the file to
.lustre/fid/XXXX with extents 0-EOF. Extents may be used in the future
to (a) copy in part of a file, in low-disk-space situations; (b) copy in
individual stripes simultaneously on multiple OSTs.
e. Coordinator distributes that request to an appropriate mover.
f. Writes into .lustre/fid/* are not required to hold layout read
lock (or special flag is passed to open, or group write lock on layout
is passed to mover)
g. Mover copies data from HSM
h. When finished, mover calls ioctl HSM_COPYIN_DONE on the file
i. MDT clears "purged" bit from LOV EA
j. MDT releases the layout lock
k. This sends a completion AST to the original client, who now
completes his open.
State machines
TBD - I think there's enough in here to chew on for awhile
Things requiring a more detailed look
1. configuration of HSM/movers
2. policy engine
3. "complex" HSM roadmap
a. partial access to files during restore
b. partial purging for file type identification, image thumbnails, ??
c. integration with other HSM backends (ADM, ??)
4. layout locks and liblustre
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Lustre-devel] "Simple" HSM straw man
2008-10-09 22:37 [Lustre-devel] "Simple" HSM straw man Nathaniel Rutman
@ 2008-10-13 16:39 ` Aurelien Degremont
2008-10-14 19:41 ` Nathaniel Rutman
2008-10-20 23:09 ` [Lustre-devel] HSM arch wiki Nathaniel Rutman
0 siblings, 2 replies; 14+ messages in thread
From: Aurelien Degremont @ 2008-10-13 16:39 UTC (permalink / raw)
To: lustre-devel
Nathaniel Rutman a ?crit :
> c. restore on file open, not data read/write
take care of the difficulties to move this behavious to a
restore-on-first-I/O later
> d. interfaces with hardware-specific copy tool to access HSM files
rather "HSM-specific"
> e. kernel process encompasses service threads listening for
> coordinator requests, passes these up to userspace process via upcall.
> No interaction with the client is needed; this is a simple message
> passing service.
Depending on how you can manage a user-space process, but, AFAIK, to be
able to manage the copy tool process, that means:
- start this process
- send a signal
- get its output (for "complex" hsm, we will need feedback from copy
tool process)
- wait for the process end
- ...
All of this is easily doable from userspace, and very hard in
kernel-space (we cannot use the fire-and-forget call call_usermodehelper).
So I rather imagine:
- a kernel space mover, simply getting LNET messages and passing them
to user-space mover
- a user-space mover, forking, spawning and managing the copy tool
process.
Maybe it will need to manage several copy tool processes, so it will
need queues, process list, etc...
So I think this tool needs a bit more than just a "simple message
passing service".
> b. lov EA changes
> i. flags: file_is_purged "purged", copyout_begin,
> file_in_HSM_is_out_of_date "hsm_dirty", copyout_complete. The purged
> flag is always manipulated under a write layout lock, the other flags
> are not.
> ii: "window" EA range of non-purged data (rev2)
If you add a window EA (will be needed anyway for hsm v2), you do not
need a purged flag:
window.start ==window.end is comparable to a purged flag unset. (or
window.end == 0)
> c. new file ioctls: HSMCopyOut, HSMPurge, HSMCopyinDone
>
>
> Algorithms
> 1. copyout
> a. Policy engine decides to copy a file to HSM, executes HSMCopyOut
> ioctl on file
> b. ioctl handled by MDT, which passes request to Coordinator
> c. coordinator dispatches request to mover. request should include
> file extents (for future purposes)
> d. normal extents read lock is taken by mover running on client
> e. mover sets "copyout_begin" bit and clears "hsm_dirty" bit in EA.
> f. any writes to the file set the "hsm_dirty" bit (may be
> lazy/delayed with mtime or filesize change updates on MDT). Note that
> file writes need not cancel copyout; for a fs with a single big file, we
> don't want to keep interrupting copyout or it will never finish.
Is it interesting to have a file that is outdated and possibly uncoherent?
> g. when done, mover checks hsm_dirty bit. If set, clears
> copyout_begin, indicating current file is not in HSM. If not set,
> mover sets "copyout_complete" bit. File layout write lock is not taken
> during mover flag manipulation. (Note: file modifications after copyout
> is complete will have both copyout_complete and hsm_dirty bits set.)
>
> 2. purge (aka punch)
> a. Policy engine decides to purge a file, exectues HSMPurge ioctl on
> file
> b. ioctl handled by MDT
> c. MDT takes a write lock on the file layout lock
> d. MDT enques write locks on all extents of the file. After these
> are granted, then no client has any dirty cache and no child can take
> new extent locks until layout lock is released. MDT drops all extent locks.
> e. MDT verifies that hsm_dirty bit is clear and copyout_complete bit
> is set
> f. MDT marks the LOV EA as "purged"
> g. MDT sends destroys the OST objects, using destroy llog entries to
> guard against object leakage during OST failover
Are you sure you want to remove those objects if we will need them
later, in "complex" HSM?
As this mecanism will need to change a lot when we will implement the
restore-in-place feature, i'm not sure this is the best idea.
> h. MDT drops layout lock.
>
> 3. restore (aka copyin aka cache miss)
> a. Client open intent enques layout read lock.
> b. MDT checks "purged" bit; if purged, lock request response
> includes "wait forever" flag, causing client to block the open.
> c. MDT creates a new layout with a similar stripe pattern as the
> original, allocating new objects on new OSTs. (We should try to respect
> specific layout settings (pool, stripecount, stripesize), but be
> flexible if e.g. pool doesn't exist anymore. Maybe we want to ignore
> offset and/or specific ost allocations in order to rebalance.)
> d. MDT sends request to coordinator requesting copyin of the file to
> .lustre/fid/XXXX with extents 0-EOF. Extents may be used in the future
> to (a) copy in part of a file, in low-disk-space situations; (b) copy in
> individual stripes simultaneously on multiple OSTs.
> e. Coordinator distributes that request to an appropriate mover.
> f. Writes into .lustre/fid/* are not required to hold layout read
> lock (or special flag is passed to open, or group write lock on layout
> is passed to mover)
> g. Mover copies data from HSM
> h. When finished, mover calls ioctl HSM_COPYIN_DONE on the file
> i. MDT clears "purged" bit from LOV EA
> j. MDT releases the layout lock
> k. This sends a completion AST to the original client, who now
> completes his open.
Concerning the new flag copyout_begin/copyout_complete, I'm not a
ldlm/recovery specialist but is it possible to have the mover to take a
kind of write extent lock on the area it has to copied in/out and
downgrade it on a smaller range as the copy tool goes along.
Copy-out
- Mover take a specific lock on range (0-EOF for the moment)
- On this range, reads pass, writes raise a callback on the mover.
- Receiving this callback, if the mover release its lock, the copyout is
cancelled, if not, the write i/o is blocked
- When the mover has copied [0 - cursor], it can downgrade its lock to
[cursor - EOF] and release the lock on [ 0 - cursor].
Same thing could be done for copy in.
The two key points are:
- Could we have a layout lock on a specific range?
- Is it possible to downgrade a range lock with ldlm?
--
Aurelien Degremont
CEA
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Lustre-devel] "Simple" HSM straw man
2008-10-13 16:39 ` Aurelien Degremont
@ 2008-10-14 19:41 ` Nathaniel Rutman
2008-10-15 21:52 ` Nathaniel Rutman
2008-10-16 21:56 ` Eric Barton
2008-10-20 23:09 ` [Lustre-devel] HSM arch wiki Nathaniel Rutman
1 sibling, 2 replies; 14+ messages in thread
From: Nathaniel Rutman @ 2008-10-14 19:41 UTC (permalink / raw)
To: lustre-devel
Aurelien Degremont wrote:
> Nathaniel Rutman a ?crit :
>> c. restore on file open, not data read/write
>
> take care of the difficulties to move this behavious to a
> restore-on-first-I/O later.
Indeed. From a client point of view, it only changes which locks its
waiting on, but from a server point of view the OSTs would need to
become involved in HSM knowledge. It is more work, but I don't think
there would be much "throwaway" code from the former to the latter.
>
>> d. interfaces with hardware-specific copy tool to access HSM files
> rather "HSM-specific"
>
>> e. kernel process encompasses service threads listening for
>> coordinator requests, passes these up to userspace process via
>> upcall. No interaction with the client is needed; this is a simple
>> message passing service.
>
> Depending on how you can manage a user-space process, but, AFAIK, to
> be able to manage the copy tool process, that means:
> - start this process
> - send a signal
> - get its output (for "complex" hsm, we will need feedback from copy
> tool process)
> - wait for the process end
> - ...
> All of this is easily doable from userspace, and very hard in
> kernel-space (we cannot use the fire-and-forget call
> call_usermodehelper).
> So I rather imagine:
> - a kernel space mover, simply getting LNET messages and passing them
> to user-space mover
> - a user-space mover, forking, spawning and managing the copy tool
> process.
>
> Maybe it will need to manage several copy tool processes, so it will
> need queues, process list, etc...
>
> So I think this tool needs a bit more than just a "simple message
> passing service".
As we discussed in the HSM concall this morning, the return path can
mostly take place through the file itself via ioctl calls. The mover
will open the destination file location in Lustre and then can indicate
status through an ioctl: starting, waiting for HSM, periodic pinging or
"% complete" messages, copyin complete. This is the "fire-and-forget"
model, and can be started from call_usermodehelper. The in-kernel code
will only have to deal with one-way requests from coordinator to mover.
We also specified 4 types of requests from coordinator:
1. copyin FID
2. copyout FID
3. abort copy(in|out) FID
4. purge FID from HSM
To accomplish 3, it might make sense to store the PID of the process
started from the upcall in the kernel (this is returned by the upcall).
Closing the file could clear the pid from the kernel list.
>
>> b. lov EA changes
>> i. flags: file_is_purged "purged", copyout_begin,
>> file_in_HSM_is_out_of_date "hsm_dirty", copyout_complete. The purged
>> flag is always manipulated under a write layout lock, the other flags
>> are not.
>> ii: "window" EA range of non-purged data (rev2)
>
> If you add a window EA (will be needed anyway for hsm v2), you do not
> need a purged flag:
>
> window.start ==window.end is comparable to a purged flag unset. (or
> window.end == 0)
True, but I don't really see a large market for partially purged files,
so I don't really believe that it is worth the effort. One of the
important points here is that we are deleting stripes off the OSTs,
freeing up space, and we won't necessarily restore to those same OSTs.
As soon as we have partially purged files that's no longer the case, and
I think complicates things too much.
>
>
>> c. new file ioctls: HSMCopyOut, HSMPurge, HSMCopyinDone
>>
>>
>> Algorithms
>> 1. copyout
>> a. Policy engine decides to copy a file to HSM, executes
>> HSMCopyOut ioctl on file
>> b. ioctl handled by MDT, which passes request to Coordinator
>> c. coordinator dispatches request to mover. request should
>> include file extents (for future purposes)
>> d. normal extents read lock is taken by mover running on client
>> e. mover sets "copyout_begin" bit and clears "hsm_dirty" bit in EA.
>> f. any writes to the file set the "hsm_dirty" bit (may be
>> lazy/delayed with mtime or filesize change updates on MDT). Note
>> that file writes need not cancel copyout; for a fs with a single big
>> file, we don't want to keep interrupting copyout or it will never
>> finish.
>
> Is it interesting to have a file that is outdated and possibly
> uncoherent?
It is probably useful in some cases -- simulation checkpoints maybe.
>
>> g. when done, mover checks hsm_dirty bit. If set, clears
>> copyout_begin, indicating current file is not in HSM. If not set,
>> mover sets "copyout_complete" bit. File layout write lock is not
>> taken during mover flag manipulation. (Note: file modifications
>> after copyout is complete will have both copyout_complete and
>> hsm_dirty bits set.)
>>
>> 2. purge (aka punch)
>> a. Policy engine decides to purge a file, exectues HSMPurge ioctl
>> on file
>> b. ioctl handled by MDT
>> c. MDT takes a write lock on the file layout lock
>> d. MDT enques write locks on all extents of the file. After
>> these are granted, then no client has any dirty cache and no child
>> can take new extent locks until layout lock is released. MDT drops
>> all extent locks.
>> e. MDT verifies that hsm_dirty bit is clear and copyout_complete
>> bit is set
>> f. MDT marks the LOV EA as "purged"
>> g. MDT sends destroys the OST objects, using destroy llog entries
>> to guard against object leakage during OST failover
>
> Are you sure you want to remove those objects if we will need them
> later, in "complex" HSM?
> As this mecanism will need to change a lot when we will implement the
> restore-in-place feature, i'm not sure this is the best idea.
Ah, I think it is important that we do NOT restore in place to the old
OST objects. The OSTs may now be full, or indeed not exist anymore.
The restore in place for complex HSM is at the file level; the objects
may move around. "Complex" in this case just means that clients will
have access to partially restored files.
>
>
>> h. MDT drops layout lock.
>>
>> 3. restore (aka copyin aka cache miss)
>> a. Client open intent enques layout read lock. b. MDT checks
>> "purged" bit; if purged, lock request response includes "wait
>> forever" flag, causing client to block the open.
>> c. MDT creates a new layout with a similar stripe pattern as the
>> original, allocating new objects on new OSTs. (We should try to
>> respect specific layout settings (pool, stripecount, stripesize), but
>> be flexible if e.g. pool doesn't exist anymore. Maybe we want to
>> ignore offset and/or specific ost allocations in order to rebalance.)
>> d. MDT sends request to coordinator requesting copyin of the file
>> to .lustre/fid/XXXX with extents 0-EOF. Extents may be used in the
>> future to (a) copy in part of a file, in low-disk-space situations;
>> (b) copy in individual stripes simultaneously on multiple OSTs.
>> e. Coordinator distributes that request to an appropriate mover.
>> f. Writes into .lustre/fid/* are not required to hold layout read
>> lock (or special flag is passed to open, or group write lock on
>> layout is passed to mover)
>> g. Mover copies data from HSM
>> h. When finished, mover calls ioctl HSM_COPYIN_DONE on the file
>> i. MDT clears "purged" bit from LOV EA
>> j. MDT releases the layout lock
>> k. This sends a completion AST to the original client, who now
>> completes his open.
>
>
>
>
> Concerning the new flag copyout_begin/copyout_complete, I'm not a
> ldlm/recovery specialist but is it possible to have the mover to take
> a kind of write extent lock on the area it has to copied in/out and
> downgrade it on a smaller range as the copy tool goes along.
This is called "lock conversion" and is not yet implemented, but has
been a general Lustre design goal for some time. So yes, for "complex"
HSM this is what we would want to do.
>
> Copy-out
> - Mover take a specific lock on range (0-EOF for the moment)
> - On this range, reads pass, writes raise a callback on the mover.
> - Receiving this callback, if the mover release its lock, the copyout
> is cancelled, if not, the write i/o is blocked
I don't think we want to block the write just because the HSM copy isn't
done yet. If the data is changing, then the policy engine shouldn't
have started a copyout process in the first place. If the customer's
goal is to do a coherent checkpoint, then it should explicitly wait for
the copyout to be done. If it's just the policy engine that got it
wrong, it doesn't matter if it finishes or not; the file will be marked
"hsm_dirty", and so the policy engine should re-queue it for copyout
again later, and it can't be purged in the meantime since the dirty bit
is set.
> - When the mover has copied [0 - cursor], it can downgrade its lock to
> [cursor - EOF] and release the lock on [ 0 - cursor].
>
> Same thing could be done for copy in.
>
> The two key points are:
> - Could we have a layout lock on a specific range?
Not the layout lock - layout means the striping pattern, and must be
held first before any extent locks can be taken. So I think what you
are asking we plan to do with two locks: the layout lock plus another
extent lock.
> - Is it possible to downgrade a range lock with ldlm?
>
Not yet, but as I said, lock conversion is a general Lustre goal.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Lustre-devel] "Simple" HSM straw man
2008-10-14 19:41 ` Nathaniel Rutman
@ 2008-10-15 21:52 ` Nathaniel Rutman
2008-10-16 14:09 ` Aurelien Degremont
2008-10-16 21:56 ` Eric Barton
1 sibling, 1 reply; 14+ messages in thread
From: Nathaniel Rutman @ 2008-10-15 21:52 UTC (permalink / raw)
To: lustre-devel
Nathaniel Rutman wrote:
>
>>> b. lov EA changes
>>> i. flags: file_is_purged "purged", copyout_begin,
>>> file_in_HSM_is_out_of_date "hsm_dirty", copyout_complete. The purged
>>> flag is always manipulated under a write layout lock, the other flags
>>> are not.
>>> ii: "window" EA range of non-purged data (rev2)
>>>
>> If you add a window EA (will be needed anyway for hsm v2), you do not
>> need a purged flag:
>>
>> window.start ==window.end is comparable to a purged flag unset. (or
>> window.end == 0)
>>
> True, but I don't really see a large market for partially purged files,
> so I don't really believe that it is worth the effort. One of the
> important points here is that we are deleting stripes off the OSTs,
> freeing up space, and we won't necessarily restore to those same OSTs.
> As soon as we have partially purged files that's no longer the case, and
> I think complicates things too much.
>
Ok, I've been told I'm dead wrong here, and this will absolutely be
required for "complex" HSM (not "simple"), and so therefore we should at
least think about the arch now. Supposedly we need to keep X bytes at
the beginning of the file for the unix "file" command, and supposedly
icon/preview data, and Y bytes at the end of the file, not sure exactly
why.
We would still plan on deleting the OST objects in the middle. And
clearly, a simple beginning/ending byte count is insufficient for the
final "complex" requirement of enabling partial file reads while doing a
copyin (where we would at a minimum need a per-object cursor).
Anyhow, as I write this none of this sounds like something that can't be
implemented at a later time, so I think we should stick with the
simplest of the simple options for now.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Lustre-devel] "Simple" HSM straw man
2008-10-15 21:52 ` Nathaniel Rutman
@ 2008-10-16 14:09 ` Aurelien Degremont
0 siblings, 0 replies; 14+ messages in thread
From: Aurelien Degremont @ 2008-10-16 14:09 UTC (permalink / raw)
To: lustre-devel
Nathaniel Rutman a ?crit :
> Ok, I've been told I'm dead wrong here, and this will absolutely be
> required for "complex" HSM (not "simple"), and so therefore we should at
> least think about the arch now. Supposedly we need to keep X bytes at
> the beginning of the file for the unix "file" command, and supposedly
> icon/preview data, and Y bytes at the end of the file, not sure exactly
> why.
> We would still plan on deleting the OST objects in the middle. And
> clearly, a simple beginning/ending byte count is insufficient for the
> final "complex" requirement of enabling partial file reads while doing a
> copyin (where we would at a minimum need a per-object cursor). Anyhow,
> as I write this none of this sounds like something that can't be
> implemented at a later time, so I think we should stick with the
> simplest of the simple options for now.
>
Ok.
Can you just sum up the inplace copy-in mechanism that have been decided
(between Menlo Park version and the other ones)?
--
Aurelien Degremont
CEA
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Lustre-devel] "Simple" HSM straw man
2008-10-14 19:41 ` Nathaniel Rutman
2008-10-15 21:52 ` Nathaniel Rutman
@ 2008-10-16 21:56 ` Eric Barton
2008-10-17 9:47 ` Aurelien Degremont
1 sibling, 1 reply; 14+ messages in thread
From: Eric Barton @ 2008-10-16 21:56 UTC (permalink / raw)
To: lustre-devel
Nathan,
> True, but I don't really see a large market for partially purged files,
> so I don't really believe that it is worth the effort. One of the
> important points here is that we are deleting stripes off the OSTs,
> freeing up space, and we won't necessarily restore to those same OSTs.
> As soon as we have partially purged files that's no longer the case, and
> I think complicates things too much.
Partially purged files is a requirement to allow graphical file browsers
to retrieve icons from within the file. It's OK to miss this out in the
first version, but it has to be there for the full product.
> >> Algorithms
> >> 1. copyout
> >> a. Policy engine decides to copy a file to HSM, executes
> >> HSMCopyOut ioctl on file
> >> b. ioctl handled by MDT, which passes request to Coordinator
> >> c. coordinator dispatches request to mover. request should
> >> include file extents (for future purposes)
> >> d. normal extents read lock is taken by mover running on client
> >> e. mover sets "copyout_begin" bit and clears "hsm_dirty" bit in EA.
> >> f. any writes to the file set the "hsm_dirty" bit (may be
> >> lazy/delayed with mtime or filesize change updates on MDT). Note
> >> that file writes need not cancel copyout; for a fs with a single big
> >> file, we don't want to keep interrupting copyout or it will never
> >> finish.
> >
> > Is it interesting to have a file that is outdated and possibly
> > uncoherent?
> It is probably useful in some cases -- simulation checkpoints maybe.
A corrupt simulation checkpoint is useless. We _must_ provide a way to
ensure the HSM copy of a file is a known good snapshot. We don't necessarily
have to abort the copyout if there is an update that could mean the
HSM copy would be corrupt since we can always just copy it out again,
but it doesn't seem hugely complicated to notify the backend, if not
the agent and let it decide.
> > Are you sure you want to remove those objects if we will need them
> > later, in "complex" HSM?
> > As this mecanism will need to change a lot when we will implement the
> > restore-in-place feature, i'm not sure this is the best idea.
> Ah, I think it is important that we do NOT restore in place to the old
> OST objects. The OSTs may now be full, or indeed not exist anymore.
> The restore in place for complex HSM is at the file level; the objects
> may move around. "Complex" in this case just means that clients will
> have access to partially restored files.
Can't the "complex" HSM restore to new objects? It just depends on when
the new-being-restored objects become the new contents of the file doesn't it?
> > Copy-out
> > - Mover take a specific lock on range (0-EOF for the moment)
> > - On this range, reads pass, writes raise a callback on the mover.
> > - Receiving this callback, if the mover release its lock, the copyout
> > is cancelled, if not, the write i/o is blocked
> I don't think we want to block the write just because the HSM copy isn't
> done yet. If the data is changing, then the policy engine shouldn't
> have started a copyout process in the first place.
Indeed.
> If the customer's
> goal is to do a coherent checkpoint, then it should explicitly wait for
> the copyout to be done.
Disagree - the customer doesn't have to know a copyout is in progress.
The HSM should abort the copyout or mark the copy corrupt.
> If it's just the policy engine that got it
> wrong, it doesn't matter if it finishes or not; the file will be marked
> "hsm_dirty", and so the policy engine should re-queue it for copyout
> again later, and it can't be purged in the meantime since the dirty bit
> is set.
Indeed.
Cheers,
Eric
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Lustre-devel] "Simple" HSM straw man
2008-10-16 21:56 ` Eric Barton
@ 2008-10-17 9:47 ` Aurelien Degremont
2008-10-17 10:10 ` Eric Barton
2008-10-17 13:54 ` Peter Braam
0 siblings, 2 replies; 14+ messages in thread
From: Aurelien Degremont @ 2008-10-17 9:47 UTC (permalink / raw)
To: lustre-devel
Eric Barton a ?crit :
> Partially purged files is a requirement to allow graphical file browsers
> to retrieve icons from within the file. It's OK to miss this out in the
> first version, but it has to be there for the full product.
Think also of command like
$ file foo*
>>> Is it interesting to have a file that is outdated and possibly
>>> uncoherent?
>> It is probably useful in some cases -- simulation checkpoints maybe.
>
> A corrupt simulation checkpoint is useless. We _must_ provide a way to
> ensure the HSM copy of a file is a known good snapshot. We don't
necessarily
> have to abort the copyout if there is an update that could mean the
> HSM copy would be corrupt since we can always just copy it out again,
> but it doesn't seem hugely complicated to notify the backend, if not
> the agent and let it decide.
Indeed, this is important
>> I don't think we want to block the write just because the HSM copy
isn't
>> done yet. If the data is changing, then the policy engine shouldn't
>> have started a copyout process in the first place.
>
> Indeed.
You were speaking of a FS with only one big file and so we need to have
a way to be sure it will be copied at least once, even if people are
writting on it.
In this case, with a classical policy engine, this file will never be
copied out because data is constantly changing.
--
Aurelien Degremont
CEA
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Lustre-devel] "Simple" HSM straw man
2008-10-17 9:47 ` Aurelien Degremont
@ 2008-10-17 10:10 ` Eric Barton
2008-10-17 13:54 ` Peter Braam
1 sibling, 0 replies; 14+ messages in thread
From: Eric Barton @ 2008-10-17 10:10 UTC (permalink / raw)
To: lustre-devel
Aurelien,
> >> I don't think we want to block the write just because the HSM
> >> copy isn't done yet. If the data is changing, then the policy
> >> engine shouldn't have started a copyout process in the first
> >> place.
> >
> > Indeed.
>
> You were speaking of a FS with only one big file and so we need to
> have a way to be sure it will be copied at least once, even if
> people are writting on it. In this case, with a classical policy
> engine, this file will never be copied out because data is
> constantly changing.
I'm not so sure that's a realistic case. If this file is so active
that it's impossible to take a consistent copy of it without some
sort of a snapshot facility, does that really mean it's a candidate
for archiving?
Cheers,
Eric
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Lustre-devel] "Simple" HSM straw man
2008-10-17 9:47 ` Aurelien Degremont
2008-10-17 10:10 ` Eric Barton
@ 2008-10-17 13:54 ` Peter Braam
1 sibling, 0 replies; 14+ messages in thread
From: Peter Braam @ 2008-10-17 13:54 UTC (permalink / raw)
To: lustre-devel
On 10/17/08 3:47 AM, "Aurelien Degremont" <aurelien.degremont@cea.fr> wrote:
> Eric Barton a ?crit :
>> Partially purged files is a requirement to allow graphical file browsers
>> to retrieve icons from within the file. It's OK to miss this out in the
>> first version, but it has to be there for the full product.
>
> Think also of command like
>
> $ file foo*
>
>>>> Is it interesting to have a file that is outdated and possibly
>>>> uncoherent?
99.99% of (probably more 9's) backup systems do work this way, with
relatively little harm.
Also remember that many files are append only - for those it might be fine.
Philosophically it is a disaster of course. I would offer archiving of
files that are active, and in due course use snapshots.
Peter
>>> It is probably useful in some cases -- simulation checkpoints maybe.
>>
>> A corrupt simulation checkpoint is useless. We _must_ provide a way to
>> ensure the HSM copy of a file is a known good snapshot. We don't
> necessarily
>> have to abort the copyout if there is an update that could mean the
>> HSM copy would be corrupt since we can always just copy it out again,
>> but it doesn't seem hugely complicated to notify the backend, if not
>> the agent and let it decide.
>
> Indeed, this is important
>
>>> I don't think we want to block the write just because the HSM copy
> isn't
>>> done yet. If the data is changing, then the policy engine shouldn't
>>> have started a copyout process in the first place.
>>
>> Indeed.
>
> You were speaking of a FS with only one big file and so we need to have
> a way to be sure it will be copied at least once, even if people are
> writting on it.
> In this case, with a classical policy engine, this file will never be
> copied out because data is constantly changing.
>
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Lustre-devel] HSM arch wiki
2008-10-13 16:39 ` Aurelien Degremont
2008-10-14 19:41 ` Nathaniel Rutman
@ 2008-10-20 23:09 ` Nathaniel Rutman
2008-10-21 13:21 ` Aurelien Degremont
2008-11-25 16:59 ` Alex Kulyavtsev
1 sibling, 2 replies; 14+ messages in thread
From: Nathaniel Rutman @ 2008-10-20 23:09 UTC (permalink / raw)
To: lustre-devel
High-level architecture page for the Lustre HSM project
http://arch.lustre.org/index.php?title=HSM_Migration
HSM core team - this is intended to be sufficient to write a full
HLD/DLD from. What is it missing?
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Lustre-devel] HSM arch wiki
2008-10-20 23:09 ` [Lustre-devel] HSM arch wiki Nathaniel Rutman
@ 2008-10-21 13:21 ` Aurelien Degremont
2008-11-25 16:59 ` Alex Kulyavtsev
1 sibling, 0 replies; 14+ messages in thread
From: Aurelien Degremont @ 2008-10-21 13:21 UTC (permalink / raw)
To: lustre-devel
Nathaniel Rutman a ?crit :
> High-level architecture page for the Lustre HSM project
> http://arch.lustre.org/index.php?title=HSM_Migration
>
>
> HSM core team - this is intended to be sufficient to write a full
> HLD/DLD from. What is it missing?
>
I think after tuestday conf call, all the main element for writting hld
will be their. May be some small points will be missing but we could
discuss them by e-mail or at confcalls. I think most of it is already there.
--
Aurelien Degremont
CEA
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Lustre-devel] HSM arch wiki
2008-10-20 23:09 ` [Lustre-devel] HSM arch wiki Nathaniel Rutman
2008-10-21 13:21 ` Aurelien Degremont
@ 2008-11-25 16:59 ` Alex Kulyavtsev
2008-11-26 16:29 ` Nathaniel Rutman
2008-11-27 19:05 ` Andreas Dilger
1 sibling, 2 replies; 14+ messages in thread
From: Alex Kulyavtsev @ 2008-11-25 16:59 UTC (permalink / raw)
To: lustre-devel
Few questions :
- For large existing archive of tapes (~10,000,000 files) it is
desirable to import file metadata to lustre fs without actually copying
files on disk.
Import shall be done in reasonable time (hours rather than month) or online.
- to provide bandwidth to tape it is desirable to have multiple migrator
nodes connected to HSM. What element of proposed design distributes
copy-out processes across migrator nodes to provide scalability ? Is it
functionality of HSM specific copy tool or does lustre agent provide it ?
- a "smart" HSM system can reorder requests to optimize tape access. It
is common to have 2000 requests pending in queue with tens or hundreds
IO transfers actually served. Current limit of pending requests is about
30,000. We found implementing of pending requests as processes (one
copy-out tool process per request waiting for IO) is resource consuming
and is not scalable. What is the way to serve ~100,000 request waiting
for transfer ?
- how to prestage files ? Send asynchronous request for copy-in file
from tape without blocking on wait. It is needed to stage large data
sets for future processing. Prestaging "file sets" is desirable.
- what proposed scanario to handle OST down ? Suppose file is present on
one of OSTs and it went down (striping is one). My understanding is
client will wait when OST will come back (case[1]) and file will not be
staged from tape automatically. IF file is not present on any OST, it
will be staged immediately (case[2]). Is possible to stage file
automatically (case[1]) to another OST and mark a copy on old OST for
removal ?
We discussed some of these questions with Peter, he suggested to ask on
devel list.
Best regards, Alex.
Nathaniel Rutman wrote:
> High-level architecture page for the Lustre HSM project
> http://arch.lustre.org/index.php?title=HSM_Migration
>
>
> HSM core team - this is intended to be sufficient to write a full
> HLD/DLD from. What is it missing?
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Lustre-devel] HSM arch wiki
2008-11-25 16:59 ` Alex Kulyavtsev
@ 2008-11-26 16:29 ` Nathaniel Rutman
2008-11-27 19:05 ` Andreas Dilger
1 sibling, 0 replies; 14+ messages in thread
From: Nathaniel Rutman @ 2008-11-26 16:29 UTC (permalink / raw)
To: lustre-devel
Alex Kulyavtsev wrote:
> Few questions :
> - For large existing archive of tapes (~10,000,000 files) it is
> desirable to import file metadata to lustre fs without actually
> copying files on disk.
> Import shall be done in reasonable time (hours rather than month) or
> online.
Agreed. Probably best done via a special ioctl that would create a stub
file and populate the metadata.
>
> - to provide bandwidth to tape it is desirable to have multiple
> migrator nodes connected to HSM. What element of proposed design
> distributes copy-out processes across migrator nodes to provide
> scalability ? Is it functionality of HSM specific copy tool or does
> lustre agent provide it ?
Lustre agents can run on multiple Lustre clients in parallel.
Coordinator distributes copyout jobs to different agents.
>
>
> - a "smart" HSM system can reorder requests to optimize tape access.
> It is common to have 2000 requests pending in queue with tens or
> hundreds IO transfers actually served. Current limit of pending
> requests is about 30,000. We found implementing of pending requests as
> processes (one copy-out tool process per request waiting for IO) is
> resource consuming and is not scalable. What is the way to serve
> ~100,000 request waiting for transfer ?
>
Coordinator decides when to request copyin/out jobs, and could throttle
the total number of concurrent accesses.
> - how to prestage files ? Send asynchronous request for copy-in file
> from tape without blocking on wait. It is needed to stage large data
> sets for future processing. Prestaging "file sets" is desirable.
Policy engine would request copyin of files before cache miss on open.
Policy could define file sets.
>
> - what proposed scanario to handle OST down ? Suppose file is present
> on one of OSTs and it went down (striping is one). My understanding is
> client will wait when OST will come back (case[1]) and file will not
> be staged from tape automatically. IF file is not present on any OST,
> it will be staged immediately (case[2]). Is possible to stage file
> automatically (case[1]) to another OST and mark a copy on old OST for
> removal ?
With our V2 HSM, we will have the ability to keep more detailed layouts;
this optimization could be part of those changes.
>
> We discussed some of these questions with Peter, he suggested to ask
> on devel list.
We greatly appreciate it! Please ask/suggest away.
>
> Best regards, Alex.
>
> Nathaniel Rutman wrote:
>> High-level architecture page for the Lustre HSM project
>> http://arch.lustre.org/index.php?title=HSM_Migration
>>
>>
>> HSM core team - this is intended to be sufficient to write a full
>> HLD/DLD from. What is it missing?
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>>
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Lustre-devel] HSM arch wiki
2008-11-25 16:59 ` Alex Kulyavtsev
2008-11-26 16:29 ` Nathaniel Rutman
@ 2008-11-27 19:05 ` Andreas Dilger
1 sibling, 0 replies; 14+ messages in thread
From: Andreas Dilger @ 2008-11-27 19:05 UTC (permalink / raw)
To: lustre-devel
On Nov 25, 2008 10:59 -0600, Alex Kulyavtsev wrote:
> - For large existing archive of tapes (~10,000,000 files) it is
> desirable to import file metadata to lustre fs without actually copying
> files on disk.
> Import shall be done in reasonable time (hours rather than month) or online.
Concievably this could be done with "mknod" and "setxattr" to store the
striping information into the Lustre inode. However, one issue will be
how to identify this new file to the HSM. The current plan is that the
Lustre HSM policy engine database will contain the mapping between the
Lustre FID (~= inode number) and the file in the archive.
Since this is a new file (FID) then we would also need to add an entry to
the policy engine database that contains the mapping from FID->archive
file.
> - a "smart" HSM system can reorder requests to optimize tape access. It
> is common to have 2000 requests pending in queue with tens or hundreds
> IO transfers actually served. Current limit of pending requests is about
> 30,000. We found implementing of pending requests as processes (one
> copy-out tool process per request waiting for IO) is resource consuming
> and is not scalable. What is the way to serve ~100,000 request waiting
> for transfer ?
The Lustre HSM design has the policy engine as a mediator between the
copyin/copyout/purge requests and the userspace agents that are specific
to the HSM and do the actual work.
The policy engine it is free to reorder all of the requests as it sees
fit. CEA is supplying their existing policy engine as a starting point
for Lustre HSM+HPSS, and I this could be made available to interested
parties sooner rather than later.
> - what proposed scanario to handle OST down ? Suppose file is present on
> one of OSTs and it went down (striping is one). My understanding is
> client will wait when OST will come back (case[1]) and file will not be
> staged from tape automatically. IF file is not present on any OST, it
> will be staged immediately (case[2]). Is possible to stage file
> automatically (case[1]) to another OST and mark a copy on old OST for
> removal ?
Since the OST objects will be removed when the file is purged there is
no requirement to store the file on a particular OST during copyin.
The HSM will store the striping attributes (probably only if they do
not match the filesystem defaults) to ensure that wide striped files
retain this property when returned to the filesystem.
In addition to not saving the striping for files that match the default
layout, we may also consider to save the layout of files with
"stripe_count == target_count" as having a stripe_count = -1 (stripe over
all OSTs) so that if there are more OSTs available when the file is
restored it takes advantage of the additional bandwidth. We might also
consider having a (policy engine?) tunable that files with > N stripes
are restriped over all OSTs when restored.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2008-11-27 19:05 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-09 22:37 [Lustre-devel] "Simple" HSM straw man Nathaniel Rutman
2008-10-13 16:39 ` Aurelien Degremont
2008-10-14 19:41 ` Nathaniel Rutman
2008-10-15 21:52 ` Nathaniel Rutman
2008-10-16 14:09 ` Aurelien Degremont
2008-10-16 21:56 ` Eric Barton
2008-10-17 9:47 ` Aurelien Degremont
2008-10-17 10:10 ` Eric Barton
2008-10-17 13:54 ` Peter Braam
2008-10-20 23:09 ` [Lustre-devel] HSM arch wiki Nathaniel Rutman
2008-10-21 13:21 ` Aurelien Degremont
2008-11-25 16:59 ` Alex Kulyavtsev
2008-11-26 16:29 ` Nathaniel Rutman
2008-11-27 19:05 ` Andreas Dilger
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.