* [Lustre-devel] WBC HLD outline
@ 2009-03-23 21:58 Alexander Zarochentsev
2009-03-23 23:17 ` Robert Read
` (2 more replies)
0 siblings, 3 replies; 16+ messages in thread
From: Alexander Zarochentsev @ 2009-03-23 21:58 UTC (permalink / raw)
To: lustre-devel
Hello,
here is a wbc hld outline.
Please take a look.
===============================================
WBC HLD OUTLINE
* Definitions
WBC (MD WBC): (Meta Data) Write Back Cache.
MD operation: whole MD operation over an object:
rename/create/open/close/setattr/getattr/link/unlink/mkdir/rmdir +
readdir.
Reintegration: The process of applying accumulated MD operation to the
MD servers.
MDS/RAW: MDS API extension to do "raw" fs operations: inserting of a
dir entry w/o creating inode and so.
MD update: a part of MD operation to be executed on one server,
contains one or more MDS/RAW operations.
MD batch: a collection of per-server MD updates.
MDTR: MD translator: translates MD operations into MD/Raw ones.
* Requirements
Client application is able to create 64k files/second.
Reintergration moves fs from one consistent state to another
consistent state.
Non-WBC client support w/o visible overhead.
Avoid MDS code rewrite if possible.
* Design outline
** Overall picture
[Application]
|
=syscalls=
|
V
[VFS]
|
=vfs hooks=
|
V
[LLITE/MDC]
|
=MD (non-WBC) proto=
|
V
[MD CACHE MANAGER] ---> [LDLM]
|
V
[MDTR]
+-----------+----------+
| | |
=======WBC proto==========
| | |
V V V
[MDS1/RAW] [MDS2/RAW] [MDS3/RAW]
** WBC
WBC client has a MDTR running on client side,
it also can be a proxy server, acting as a server for
non-WBC clients and as a client for MD servers.
*** WBC vs non-WBC
Processing MD operation request (lock enqueue + op intent, by Alex
suggestion), MD server may decide to execute it by itself, or grant a
only a lock (subtree one) and allow client to continue in WBC mode.
*** Locks
needed LDLM locks are taken before operation starts and held until the
corresponded batch is re-integrated.
*** Local cache management
WBC client executes operations locally, modifying local in-memory
objects. WBC client has a (redo-)log of all operations.
The cache manager controls process of MD cache re-integration.
*** MDS/RAW operations
Managing directory entries and inodes, without maintaining
fs consistency automatically.
create/update/delete methods for directory entries and inodes.
*** MDTR
MDTR is responsible for converting MD operations into set of
per-server MD/RAW operations.
*** Client re-integration
Periodically, or because of (sub-)lock releasing, dirty memory
flushing or so, WBC client submits batches to all MD servers involved
into the operations.
Process of re-integration is protected by LDLM locks. MD servers are
updated
using WBC protocol.
*** WBC protocol
WBC request contains a set of MD/RAW operations, tagged with one epoch
number. Bulk transfers are used.
*** File data
Flushing file data to the OST servers is delayed until file creation
is re-integrated.
*** Recovery
The redo-log preserved until it is not needed in recovery (i.e. epoch
gets stable)
Client replay the log and re-execute all operations from it, repeating
MDTR processing (dispatching the operation between MD servers).
**** WBC client eviction, uncompleted updates
If client dies until re-integration is completed, there are three
choices:
a) Cluster-wide rollback, all servers roll back to the last globally
stable epoch, then clients to replay heir redo-logs.
This scenario should be avoided because a single client failure may
may stop whole cluster for recovery.
b) All servers participating in re-integration coordinate to undo
uncompleted updates.
c) The servers have all information needed to complete re-integration
w/o client.
The recovery strategy is a subject of CMD Recovery Design document,
but a possibility of (c) need a support in the WBC protocol.
** non-WBC
*** MD protocol
MD (non-WBC) protocol remains the same as now.
** Use cases
*** WBC / non-WBC decision
1. Check whether server and client can operate in WBC-mode through
connect flags.
2. I they can, a lock enqueue request may contain a request for
WBC-mode, the server may respond with granting WBC-mode and STL or PW
lock on the directory. MD server accepts or rejects WBC-mode request
depending on server rules and per-object access statistics.
*** File creation
client gets a PW lock on directory.
client fetches directory content.
client does file creation locally, in cache, the operation record is
added to the client redo-log.
Another client want to read the directory, lock conflict triggers
re-integration.
MD Cache manager processes the redo-log, prepares batches with MDS/RAW
operations and submits them to the MD servers.
The MD servers integrate the batches.
MD Cache manager frees local cache content and cancels the directory
lock.
** Questions
Q: Can several wbc clients work in one directory simultaneously?
A: If extent locks for directories are implemented, each WBC client
can take a lock on a hash interval.
Q: can wbc clients do massive file creation in one directory
efficiently?
A: the idea that may help: if we can guess that the file names created
by a client are lexicographically ordered, a special hash function
may reduce lock conflicts between clients holding locks on
directory extents.
Thanks,
--
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems
^ permalink raw reply [flat|nested] 16+ messages in thread* [Lustre-devel] WBC HLD outline
2009-03-23 21:58 [Lustre-devel] WBC HLD outline Alexander Zarochentsev
@ 2009-03-23 23:17 ` Robert Read
2009-03-25 8:17 ` Alexander Zarochentsev
2009-03-24 5:06 ` Alex Zhuravlev
2009-04-01 8:17 ` Eric Barton
2 siblings, 1 reply; 16+ messages in thread
From: Robert Read @ 2009-03-23 23:17 UTC (permalink / raw)
To: lustre-devel
Hi Zam,
On Mar 23, 2009, at 14:58 , Alexander Zarochentsev wrote:
> Hello,
>
> here is a wbc hld outline.
> Please take a look.
>
> ===============================================
> WBC HLD OUTLINE
>
> * Definitions
> WBC (MD WBC): (Meta Data) Write Back Cache.
>
> MD operation: whole MD operation over an object:
> rename/create/open/close/setattr/getattr/link/unlink/mkdir/rmdir +
> readdir.
>
> Reintegration: The process of applying accumulated MD operation to the
> MD servers.
>
> MDS/RAW: MDS API extension to do "raw" fs operations: inserting of a
> dir entry w/o creating inode and so.
>
> MD update: a part of MD operation to be executed on one server,
> contains one or more MDS/RAW operations.
Why does the client need to to be more granular than an update? It
seems MDS/Raw and update should be the same.
>
> MD batch: a collection of per-server MD updates.
>
> MDTR: MD translator: translates MD operations into MD/Raw ones.
Isn't this essentially what the cmm is doing today? (Breaking down
distributed operations into per-node updates?) Are you expanding on
Alex's idea of creating a new generic MD server stack?
>
> * Requirements
>
> Client application is able to create 64k files/second.
>
> Reintergration moves fs from one consistent state to another
> consistent state.
>
> Non-WBC client support w/o visible overhead.
>
> Avoid MDS code rewrite if possible.
>
> * Design outline
>
> ** Overall picture
>
> [Application]
> |
> =syscalls=
> |
> V
> [VFS]
> |
> =vfs hooks=
> |
> V
> [LLITE/MDC]
> |
> =MD (non-WBC) proto=
> |
> V
> [MD CACHE MANAGER] ---> [LDLM]
> |
> V
> [MDTR]
> +-----------+----------+
> | | |
> =======WBC proto==========
> | | |
> V V V
> [MDS1/RAW] [MDS2/RAW] [MDS3/RAW]
>
> ** WBC
>
> WBC client has a MDTR running on client side,
> it also can be a proxy server, acting as a server for
> non-WBC clients and as a client for MD servers.
>
> *** WBC vs non-WBC
>
> Processing MD operation request (lock enqueue + op intent, by Alex
> suggestion), MD server may decide to execute it by itself, or grant a
> only a lock (subtree one) and allow client to continue in WBC mode.
>
> *** Locks
>
> needed LDLM locks are taken before operation starts and held until the
> corresponded batch is re-integrated.
>
> *** Local cache management
>
> WBC client executes operations locally, modifying local in-memory
> objects. WBC client has a (redo-)log of all operations.
>
> The cache manager controls process of MD cache re-integration.
>
> *** MDS/RAW operations
>
> Managing directory entries and inodes, without maintaining
> fs consistency automatically.
>
> create/update/delete methods for directory entries and inodes.
>
> *** MDTR
>
> MDTR is responsible for converting MD operations into set of
> per-server MD/RAW operations.
>
> *** Client re-integration
>
> Periodically, or because of (sub-)lock releasing, dirty memory
> flushing or so, WBC client submits batches to all MD servers involved
> into the operations.
>
> Process of re-integration is protected by LDLM locks. MD servers are
> updated
> using WBC protocol.
>
> *** WBC protocol
>
> WBC request contains a set of MD/RAW operations, tagged with one epoch
> number. Bulk transfers are used.
All the updates in a single operation must have the same epoch, but I
don't think we can guarantee that all the operations in a batch will
be in the same epoch, unless we stop exchanging messages with all the
MD servers. I don't see a need for them to be in the same epoch, either.
>
> *** File data
> Flushing file data to the OST servers is delayed until file creation
> is re-integrated.
>
> *** Recovery
>
> The redo-log preserved until it is not needed in recovery (i.e. epoch
> gets stable)
>
> Client replay the log and re-execute all operations from it, repeating
> MDTR processing (dispatching the operation between MD servers).
Since the MD servers all roll back before recovery, recovery will be
very similar to the original reintegration, with the exception of
using versions. So we should try to keep the recovery (replay) code
as similar to the normal code as possible, and move recovery higher
into the stack.
>
> **** WBC client eviction, uncompleted updates
>
> If client dies until re-integration is completed, there are three
> choices:
>
> a) Cluster-wide rollback, all servers roll back to the last globally
> stable epoch, then clients to replay heir redo-logs.
>
> This scenario should be avoided because a single client failure may
> may stop whole cluster for recovery.
>
> b) All servers participating in re-integration coordinate to undo
> uncompleted updates.
>
> c) The servers have all information needed to complete re-integration
> w/o client.
You mean by keeping the original operation info in the undo logs?
>
> The recovery strategy is a subject of CMD Recovery Design document,
> but a possibility of (c) need a support in the WBC protocol.
>
> ** non-WBC
>
> *** MD protocol
>
> MD (non-WBC) protocol remains the same as now.
>
> ** Use cases
>
> *** WBC / non-WBC decision
>
> 1. Check whether server and client can operate in WBC-mode through
> connect flags.
>
> 2. I they can, a lock enqueue request may contain a request for
> WBC-mode, the server may respond with granting WBC-mode and STL or PW
> lock on the directory. MD server accepts or rejects WBC-mode request
> depending on server rules and per-object access statistics.
>
> *** File creation
>
> client gets a PW lock on directory.
>
> client fetches directory content.
>
> client does file creation locally, in cache, the operation record is
> added to the client redo-log.
>
> Another client want to read the directory, lock conflict triggers
> re-integration.
>
> MD Cache manager processes the redo-log, prepares batches with MDS/RAW
> operations and submits them to the MD servers.
>
> The MD servers integrate the batches.
>
> MD Cache manager frees local cache content and cancels the directory
> lock.
>
> ** Questions
>
> Q: Can several wbc clients work in one directory simultaneously?
> A: If extent locks for directories are implemented, each WBC client
> can take a lock on a hash interval.
>
> Q: can wbc clients do massive file creation in one directory
> efficiently?
> A: the idea that may help: if we can guess that the file names created
> by a client are lexicographically ordered, a special hash function
> may reduce lock conflicts between clients holding locks on
> directory extents.
cheers,
robert
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] WBC HLD outline
2009-03-23 23:17 ` Robert Read
@ 2009-03-25 8:17 ` Alexander Zarochentsev
2009-03-25 8:33 ` Alex Zhuravlev
0 siblings, 1 reply; 16+ messages in thread
From: Alexander Zarochentsev @ 2009-03-25 8:17 UTC (permalink / raw)
To: lustre-devel
On 24 March 2009 02:17:33 Robert Read wrote:
> Hi Zam,
> > MD update: a part of MD operation to be executed on one server,
> > contains one or more MDS/RAW operations.
>
> Why does the client need to to be more granular than an update? It
> seems MDS/Raw and update should be the same.
well, better to say an update is MDS op if the operation touch only one
MD server and MDS/Raw op in case of distributed operation.
> > MD batch: a collection of per-server MD updates.
> >
> > MDTR: MD translator: translates MD operations into MD/Raw ones.
>
> Isn't this essentially what the cmm is doing today? (Breaking down
> distributed operations into per-node updates?) Are you expanding on
> Alex's idea of creating a new generic MD server stack?
I just doubt that cmm code reuse is worth MD stack relayering. Can it be
done as a subtask later?
> > *** WBC protocol
> >
> > WBC request contains a set of MD/RAW operations, tagged with one
> > epoch number. Bulk transfers are used.
>
> All the updates in a single operation must have the same epoch, but I
> don't think we can guarantee that all the operations in a batch will
> be in the same epoch, unless we stop exchanging messages with all the
> MD servers. I don't see a need for them to be in the same epoch,
> either.
you are right.
> > *** File data
> > Flushing file data to the OST servers is delayed until file
> > creation is re-integrated.
> >
> > *** Recovery
> >
> > The redo-log preserved until it is not needed in recovery (i.e.
> > epoch gets stable)
> >
> > Client replay the log and re-execute all operations from it,
> > repeating MDTR processing (dispatching the operation between MD
> > servers).
>
> Since the MD servers all roll back before recovery, recovery will be
> very similar to the original reintegration, with the exception of
> using versions. So we should try to keep the recovery (replay) code
> as similar to the normal code as possible, and move recovery higher
> into the stack.
ok.
> > **** WBC client eviction, uncompleted updates
> >
> > If client dies until re-integration is completed, there are three
> > choices:
> >
> > a) Cluster-wide rollback, all servers roll back to the last
> > globally stable epoch, then clients to replay heir redo-logs.
> >
> > This scenario should be avoided because a single client failure may
> > may stop whole cluster for recovery.
> >
> > b) All servers participating in re-integration coordinate to undo
> > uncompleted updates.
> >
> > c) The servers have all information needed to complete
> > re-integration w/o client.
>
> You mean by keeping the original operation info in the undo logs?
I meant the servers receive not updates but whole operations. If the
client failed and didn't send an update to some of the servers, the
operation can be completed w/o the client. It is an alternative to
undoing of partial updates.
Thanks,
--
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] WBC HLD outline
2009-03-25 8:17 ` Alexander Zarochentsev
@ 2009-03-25 8:33 ` Alex Zhuravlev
2009-03-25 16:17 ` Alexander Zarochentsev
0 siblings, 1 reply; 16+ messages in thread
From: Alex Zhuravlev @ 2009-03-25 8:33 UTC (permalink / raw)
To: lustre-devel
>>>>> Alexander Zarochentsev (AZ) writes:
AZ> On 24 March 2009 02:17:33 Robert Read wrote:
>> Hi Zam,
>> > MD update: a part of MD operation to be executed on one server,
>> > contains one or more MDS/RAW operations.
>>
>> Why does the client need to to be more granular than an update? It
>> seems MDS/Raw and update should be the same.
AZ> well, better to say an update is MDS op if the operation touch only one
AZ> MD server and MDS/Raw op in case of distributed operation.
I think this just adds unneeded entity to the system. stating that
we either have updates or operations is simpler.
>> Isn't this essentially what the cmm is doing today? (Breaking down
>> distributed operations into per-node updates?) Are you expanding on
>> Alex's idea of creating a new generic MD server stack?
AZ> I just doubt that cmm code reuse is worth MD stack relayering. Can it be
AZ> done as a subtask later?
I don't think CMM is right thing because it essentially breaks layering:
instead of sending object creation request in terms of OSD API or index
insert in terms of OSD API it introduces some intermediate thing which
is neither operation nor update.
>> You mean by keeping the original operation info in the undo logs?
AZ> I meant the servers receive not updates but whole operations. If the
AZ> client failed and didn't send an update to some of the servers, the
AZ> operation can be completed w/o the client. It is an alternative to
AZ> undoing of partial updates.
same can be done with updates if you send them through single server.
and you don't need to put additional cpu processing to parse operation
into updates.
thanks, Alex
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] WBC HLD outline
2009-03-25 8:33 ` Alex Zhuravlev
@ 2009-03-25 16:17 ` Alexander Zarochentsev
2009-03-25 16:26 ` Alex Zhuravlav
2009-03-25 16:32 ` Alex Zhuravlav
0 siblings, 2 replies; 16+ messages in thread
From: Alexander Zarochentsev @ 2009-03-25 16:17 UTC (permalink / raw)
To: lustre-devel
On 25 March 2009 11:33:12 Alex Zhuravlev wrote:
> >>>>> Alexander Zarochentsev (AZ) writes:
>
> AZ> On 24 March 2009 02:17:33 Robert Read wrote:
> >> Hi Zam,
> >>
> >> > MD update: a part of MD operation to be executed on one server,
> >> > contains one or more MDS/RAW operations.
> >>
> >> Why does the client need to to be more granular than an update?
> >> It seems MDS/Raw and update should be the same.
>
> AZ> well, better to say an update is MDS op if the operation touch
> only one AZ> MD server and MDS/Raw op in case of distributed
> operation.
>
>
> I think this just adds unneeded entity to the system. stating that
> we either have updates or operations is simpler.
>
> >> Isn't this essentially what the cmm is doing today? (Breaking
> >> down distributed operations into per-node updates?) Are you
> >> expanding on Alex's idea of creating a new generic MD server
> >> stack?
>
> AZ> I just doubt that cmm code reuse is worth MD stack relayering.
> Can it be AZ> done as a subtask later?
>
> I don't think CMM is right thing because it essentially breaks
> layering: instead of sending object creation request in terms of OSD
> API or index insert in terms of OSD API it introduces some
> intermediate thing which is neither operation nor update.
Server MD stack has to support both WBC and non-WBC clients for the same
objects. It is why I think MDT layer should handle MD ops as well as
MDS/RAW ops. Then CMM only passes RAW operations to MDD layer, where
raw ops are already supported.
Thanks,
--
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] WBC HLD outline
2009-03-25 16:17 ` Alexander Zarochentsev
@ 2009-03-25 16:26 ` Alex Zhuravlav
2009-03-25 16:32 ` Alex Zhuravlav
1 sibling, 0 replies; 16+ messages in thread
From: Alex Zhuravlav @ 2009-03-25 16:26 UTC (permalink / raw)
To: lustre-devel
>>>>> Alexander Zarochentsev (AZ) writes:
AZ> Server MD stack has to support both WBC and non-WBC clients for the same
AZ> objects. It is why I think MDT layer should handle MD ops as well as
AZ> MDS/RAW ops. Then CMM only passes RAW operations to MDD layer, where
AZ> raw ops are already supported.
then I don't understand what you mean by CMM. same about RAW operations.
--
thanks, Alex
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] WBC HLD outline
2009-03-25 16:17 ` Alexander Zarochentsev
2009-03-25 16:26 ` Alex Zhuravlav
@ 2009-03-25 16:32 ` Alex Zhuravlav
1 sibling, 0 replies; 16+ messages in thread
From: Alex Zhuravlav @ 2009-03-25 16:32 UTC (permalink / raw)
To: lustre-devel
>>>>> Alexander Zarochentsev (AZ) writes:
AZ> Server MD stack has to support both WBC and non-WBC clients for the same
AZ> objects. It is why I think MDT layer should handle MD ops as well as
AZ> MDS/RAW ops. Then CMM only passes RAW operations to MDD layer, where
AZ> raw ops are already supported.
btw, what's problem with supporting WBC and non-WBC clients for same objects?
any time you access some object via short (MDT-OSD for WBC client) or long
(MDT-MDD-OSD) for non-WBC client) it's initialized at all layers (MDT-MDD-OSD).
--
thanks, Alex
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] WBC HLD outline
2009-03-23 21:58 [Lustre-devel] WBC HLD outline Alexander Zarochentsev
2009-03-23 23:17 ` Robert Read
@ 2009-03-24 5:06 ` Alex Zhuravlev
2009-04-01 8:17 ` Eric Barton
2 siblings, 0 replies; 16+ messages in thread
From: Alex Zhuravlev @ 2009-03-24 5:06 UTC (permalink / raw)
To: lustre-devel
>>>>> Alexander Zarochentsev (AZ) writes:
AZ> MDS/RAW: MDS API extension to do "raw" fs operations: inserting of a
AZ> dir entry w/o creating inode and so.
this seems to be duplication of OSD API's insert/delete/etc.
AZ> MDTR: MD translator: translates MD operations into MD/Raw ones.
and this one seems to duplicate MDD code.
why would we want to duplicate these things?
thanks, Alex
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] WBC HLD outline
2009-03-23 21:58 [Lustre-devel] WBC HLD outline Alexander Zarochentsev
2009-03-23 23:17 ` Robert Read
2009-03-24 5:06 ` Alex Zhuravlev
@ 2009-04-01 8:17 ` Eric Barton
2009-04-06 10:23 ` Alexander Zarochentsev
2 siblings, 1 reply; 16+ messages in thread
From: Eric Barton @ 2009-04-01 8:17 UTC (permalink / raw)
To: lustre-devel
Zam,
Some notes on the WBC HLD outline
1. The requirement is for 32K creates/second on one node of small
files with a random size of up to 64K. It's basically HPCS IO
Scenario 4.
2. Reintegration must change the filesystem from one consistent state
to another consistent state _atomically_.
3. Not all the updates in a batch for 1 server need to have the same
epoch number - i.e. being forced to advance your epoch
(e.g. because you acquired a lock) doesn't force you to create
a new batch.
I think this got mentioned in other emails.
4. Most readers won't know what "bulk transfers are used" for batches.
5. Is ensuring file data is delayed until file creation is
reintegrated sufficient for correct operation? Are we not
effectively doing create-on-write with a WBC? I'm sure there
are more issues (e.g. orphans).
Does including the OSTs in epoch recovery solve all the issues? If
so, what are the expected bounds on client redo and server undo
storage? Can we avoid needing server undo for data with some
compromises? Can we exploit the DMU at all?
6. The section on recovering from WBC client death seems imprecise.
Is (a) just describing V1-4 in Nikita's original post - similarly
(b) for V1-2, V3'-5'? Also, for (c) I think we may have discussed
the possibility of always sending updates as the full operation +
context to select which updates apply locally so that an operation
can always be recovered from any of its updates.
Cheers,
Eric
^ permalink raw reply [flat|nested] 16+ messages in thread* [Lustre-devel] WBC HLD outline
2009-04-01 8:17 ` Eric Barton
@ 2009-04-06 10:23 ` Alexander Zarochentsev
2009-04-07 6:18 ` Andreas Dilger
0 siblings, 1 reply; 16+ messages in thread
From: Alexander Zarochentsev @ 2009-04-06 10:23 UTC (permalink / raw)
To: lustre-devel
Hello Eric,
Thanks for the review,
On 1 April 2009 12:17:17 Eric Barton wrote:
> Zam,
>
> Some notes on the WBC HLD outline
[...]
>
> 5. Is ensuring file data is delayed until file creation is
> reintegrated sufficient for correct operation? Are we not
> effectively doing create-on-write with a WBC? I'm sure there
> are more issues (e.g. orphans).
>
> Does including the OSTs in epoch recovery solve all the issues?
> If so, what are the expected bounds on client redo and server undo
> storage? Can we avoid needing server undo for data with some
> compromises? Can we exploit the DMU at all?
I think we can't avoid tagging OST object creation w/ epoch counter.
Would Lustre users complain if file writes are out-of-epochs?
So a write to existing OST object may survive loosing the context of MD
operations where the write operation was issued, object
creation/deletion may not.
The alternative is to implement undo logging for file data. It would
require support from underlaying server fs. It could be done for
ldiskfs, not sure about DMU.
There is a security problem with out-of-epochs writes and setting
file attributes (especially permissions):
chmod 400 foo; cat /etc/secret-file >> foo. Chmod/chown can be a special
case which triggers wbc flush.
> 6. The section on recovering from WBC client death seems imprecise.
> Is (a) just describing V1-4 in Nikita's original post - similarly
> (b) for V1-2, V3'-5'? Also, for (c) I think we may have discussed
> the possibility of always sending updates as the full operation +
> context to select which updates apply locally so that an operation
> can always be recovered from any of its updates.
It is only a rough schema of client eviction to list what support might
be needed in wbc protocol, like sending full MD op instead of update--
what you just mentioned. BTW, I thought Epochs HLD would cover the
detailed algorithm descriptions, no?
> Cheers,
> Eric
Thanks,
--
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] WBC HLD outline
2009-04-06 10:23 ` Alexander Zarochentsev
@ 2009-04-07 6:18 ` Andreas Dilger
2009-04-07 6:30 ` Alex Zhuravlev
0 siblings, 1 reply; 16+ messages in thread
From: Andreas Dilger @ 2009-04-07 6:18 UTC (permalink / raw)
To: lustre-devel
On Apr 06, 2009 13:23 +0300, Alexander Zarochentsev wrote:
> On 1 April 2009 12:17:17 Eric Barton wrote:
> I think we can't avoid tagging OST object creation w/ epoch counter.
> Would Lustre users complain if file writes are out-of-epochs?
>
> There is a security problem with out-of-epochs writes and setting
> file attributes (especially permissions):
> chmod 400 foo; cat /etc/secret-file >> foo. Chmod/chown can be a special
> case which triggers wbc flush.
While this example has been given many times as a security issue that
forces many strange actions on the part of Lustre, the example is
fundamentally broken because POSIX allows "foo" to be opened before the
chmod, and kept open until after the write and then read the "secret-file"
content. The "foo" file needs to be created securely in the first place
to be safe.
> > 6. The section on recovering from WBC client death seems imprecise.
> > Is (a) just describing V1-4 in Nikita's original post - similarly
> > (b) for V1-2, V3'-5'? Also, for (c) I think we may have discussed
> > the possibility of always sending updates as the full operation +
> > context to select which updates apply locally so that an operation
> > can always be recovered from any of its updates.
>
> It is only a rough schema of client eviction to list what support might
> be needed in wbc protocol, like sending full MD op instead of update--
> what you just mentioned. BTW, I thought Epochs HLD would cover the
> detailed algorithm descriptions, no?
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] WBC HLD outline
2009-04-07 6:18 ` Andreas Dilger
@ 2009-04-07 6:30 ` Alex Zhuravlev
2009-04-07 7:50 ` Nikita Danilov
2009-04-09 3:04 ` Oleg Drokin
0 siblings, 2 replies; 16+ messages in thread
From: Alex Zhuravlev @ 2009-04-07 6:30 UTC (permalink / raw)
To: lustre-devel
>>>>> Andreas Dilger (AD) writes:
AD> On Apr 06, 2009 13:23 +0300, Alexander Zarochentsev wrote:
>> On 1 April 2009 12:17:17 Eric Barton wrote:
>> I think we can't avoid tagging OST object creation w/ epoch counter.
>> Would Lustre users complain if file writes are out-of-epochs?
>>
>> There is a security problem with out-of-epochs writes and setting
>> file attributes (especially permissions):
>> chmod 400 foo; cat /etc/secret-file >> foo. Chmod/chown can be a special
>> case which triggers wbc flush.
AD> While this example has been given many times as a security issue that
AD> forces many strange actions on the part of Lustre, the example is
AD> fundamentally broken because POSIX allows "foo" to be opened before the
AD> chmod, and kept open until after the write and then read the "secret-file"
AD> content. The "foo" file needs to be created securely in the first place
AD> to be safe.
yup, and there is no way in posix to even check whether file is opened.
my take on this and similar security related issues is that we probably
should provide two modes:
1) strict, when no optimizations in order of flush is done
2) relaxed, when order is not garanteed and user should use some form of sync
but lustre can improve performance
--
thanks, Alex
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] WBC HLD outline
2009-04-07 6:30 ` Alex Zhuravlev
@ 2009-04-07 7:50 ` Nikita Danilov
2009-04-08 16:41 ` Alexander Zarochentsev
2009-04-09 3:04 ` Oleg Drokin
1 sibling, 1 reply; 16+ messages in thread
From: Nikita Danilov @ 2009-04-07 7:50 UTC (permalink / raw)
To: lustre-devel
2009/4/7 Alex Zhuravlev <bzzz@sun.com>
> >>>>> Andreas Dilger (AD) writes:
Hello,
>
>
> AD> On Apr 06, 2009 13:23 +0300, Alexander Zarochentsev wrote:
> >> On 1 April 2009 12:17:17 Eric Barton wrote:
> >> I think we can't avoid tagging OST object creation w/ epoch counter.
> >> Would Lustre users complain if file writes are out-of-epochs?
> >>
> >> There is a security problem with out-of-epochs writes and setting
> >> file attributes (especially permissions):
> >> chmod 400 foo; cat /etc/secret-file >> foo. Chmod/chown can be a
> special
> >> case which triggers wbc flush.
>
> AD> While this example has been given many times as a security issue that
> AD> forces many strange actions on the part of Lustre, the example is
> AD> fundamentally broken because POSIX allows "foo" to be opened before
> the
> AD> chmod, and kept open until after the write and then read the
> "secret-file"
> AD> content. The "foo" file needs to be created securely in the first
> place
> AD> to be safe.
the original "partial write-back" problem was demonstrated with the use case
$ mkdir -m 0700 a # nobody but me can access things under "a"
$ umask 000
$ mkdir -m 0777 -p a/b/c/d
$ echo "secret data" > a/b/c/d/file
$ sync # time passes...
$ echo > a/b/c/d/file # truncate secret data
$ chmod 777 a # relax permissions
Note that here an ordering between data and meta-data updates on _different_
objects is important.
>
> yup, and there is no way in posix to even check whether file is opened.
>
> my take on this and similar security related issues is that we probably
> should provide two modes:
> 1) strict, when no optimizations in order of flush is done
> 2) relaxed, when order is not garanteed and user should use some form of
> sync
> but lustre can improve performance
The old (and outdated) WBC HLD has a section "Partial write-out" describing
these issues.
--
> thanks, Alex
Nikita.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20090407/0c4ebdd4/attachment.htm>
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] WBC HLD outline
2009-04-07 7:50 ` Nikita Danilov
@ 2009-04-08 16:41 ` Alexander Zarochentsev
2009-04-09 8:58 ` Nikita Danilov
0 siblings, 1 reply; 16+ messages in thread
From: Alexander Zarochentsev @ 2009-04-08 16:41 UTC (permalink / raw)
To: lustre-devel
Hello Nikita!
On 7 April 2009 11:50:29 Nikita Danilov wrote:
> 2009/4/7 Alex Zhuravlev <bzzz@sun.com>
>
> > >>>>> Andreas Dilger (AD) writes:
>
> Hello,
>
> > AD> On Apr 06, 2009 13:23 +0300, Alexander Zarochentsev wrote:
> > >> On 1 April 2009 12:17:17 Eric Barton wrote:
> > >> I think we can't avoid tagging OST object creation w/ epoch
> > >> counter. Would Lustre users complain if file writes are
> > >> out-of-epochs?
> > >>
> > >> There is a security problem with out-of-epochs writes and
> > >> setting file attributes (especially permissions):
> > >> chmod 400 foo; cat /etc/secret-file >> foo. Chmod/chown can be
> > >> a
> >
> > special
> >
> > >> case which triggers wbc flush.
> >
> > AD> While this example has been given many times as a security
> > issue that AD> forces many strange actions on the part of Lustre,
> > the example is AD> fundamentally broken because POSIX allows "foo"
> > to be opened before the
> > AD> chmod, and kept open until after the write and then read the
> > "secret-file"
> > AD> content. The "foo" file needs to be created securely in the
> > first place
> > AD> to be safe.
>
> the original "partial write-back" problem was demonstrated with the
> use case
>
> $ mkdir -m 0700 a # nobody but me can access things under "a"
> $ umask 000
> $ mkdir -m 0777 -p a/b/c/d
> $ echo "secret data" > a/b/c/d/file
> $ sync # time passes...
> $ echo > a/b/c/d/file # truncate secret data
> $ chmod 777 a # relax permissions
>
> Note that here an ordering between data and meta-data updates on
> _different_ objects is important.
If we only guarantee no reordering in MD updates, Lustre behavior would
be like ext3 without data journalling? I think it is not terrible.
> > yup, and there is no way in posix to even check whether file is
> > opened.
> >
> > my take on this and similar security related issues is that we
> > probably should provide two modes:
> > 1) strict, when no optimizations in order of flush is done
> > 2) relaxed, when order is not garanteed and user should use some
> > form of sync
> > but lustre can improve performance
>
> The old (and outdated) WBC HLD has a section "Partial write-out"
> describing these issues.
>
> --
>
> > thanks, Alex
>
> Nikita.
--
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] WBC HLD outline
2009-04-08 16:41 ` Alexander Zarochentsev
@ 2009-04-09 8:58 ` Nikita Danilov
0 siblings, 0 replies; 16+ messages in thread
From: Nikita Danilov @ 2009-04-09 8:58 UTC (permalink / raw)
To: lustre-devel
2009/4/8 Alexander Zarochentsev <Alexander.Zarochentsev@sun.com>
> Hello Nikita!
>
> On 7 April 2009 11:50:29 Nikita Danilov wrote:
> > 2009/4/7 Alex Zhuravlev <bzzz@sun.com>
> >
> > > >>>>> Andreas Dilger (AD) writes:
> >
> > Hello,
> >
>
[...]
> > $ echo > a/b/c/d/file # truncate secret data
> > $ chmod 777 a # relax permissions
> >
> > Note that here an ordering between data and meta-data updates on
> > _different_ objects is important.
>
> If we only guarantee no reordering in MD updates, Lustre behavior would
> be like ext3 without data journalling? I think it is not terrible.
It's not terrible, but it is non-intuitive, in my opinion. More enlightened
file systems, like ZFS, reiser4, and NTFS provide stronger consistency
guarantees, ignoring the petty distinctions between data and meta-data. :-)
But even limiting consistency to meta-data leaves some issues opened. For
example, think about an md proxy server acting as a WBC client for a higher
tier server. To be efficient such proxy might need to cache very large
amount of meta-data, and it most likely cannot afford to keep a log of all
operations. In this situation, when a lock on a top-level directory gets a
blocking AST, proxy would have --to guarantee ordering of visible meta-data
updates-- to write back all cached dirty meta-data under this directory
before the lock can be cancelled, which might result in unacceptable
latency.
>
> > > thanks, Alex
> >
> > Nikita.
>
> --
> Alexander "Zam" Zarochentsev
>
Nikita.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20090409/ffcd5ed0/attachment.htm>
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Lustre-devel] WBC HLD outline
2009-04-07 6:30 ` Alex Zhuravlev
2009-04-07 7:50 ` Nikita Danilov
@ 2009-04-09 3:04 ` Oleg Drokin
1 sibling, 0 replies; 16+ messages in thread
From: Oleg Drokin @ 2009-04-09 3:04 UTC (permalink / raw)
To: lustre-devel
Hello!
On Apr 7, 2009, at 2:30 AM, Alex Zhuravlev wrote:
> AD> While this example has been given many times as a security issue
> that
> AD> forces many strange actions on the part of Lustre, the example is
> AD> fundamentally broken because POSIX allows "foo" to be opened
> before the
> AD> chmod, and kept open until after the write and then read the
> "secret-file"
> AD> content. The "foo" file needs to be created securely in the
> first place
> AD> to be safe.
> yup, and there is no way in posix to even check whether file is
> opened.
I do not know if file leases are POSIX or not (and cannot check right
now),
but they do in fact allow you not only to ensure the file is not
opened in certain
mode, but would also allow you to get notified when somebody attempts
to open
a file on which you have obtained such a lease.
Bye,
Oleg
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2009-04-09 8:58 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-23 21:58 [Lustre-devel] WBC HLD outline Alexander Zarochentsev
2009-03-23 23:17 ` Robert Read
2009-03-25 8:17 ` Alexander Zarochentsev
2009-03-25 8:33 ` Alex Zhuravlev
2009-03-25 16:17 ` Alexander Zarochentsev
2009-03-25 16:26 ` Alex Zhuravlav
2009-03-25 16:32 ` Alex Zhuravlav
2009-03-24 5:06 ` Alex Zhuravlev
2009-04-01 8:17 ` Eric Barton
2009-04-06 10:23 ` Alexander Zarochentsev
2009-04-07 6:18 ` Andreas Dilger
2009-04-07 6:30 ` Alex Zhuravlev
2009-04-07 7:50 ` Nikita Danilov
2009-04-08 16:41 ` Alexander Zarochentsev
2009-04-09 8:58 ` Nikita Danilov
2009-04-09 3:04 ` Oleg Drokin
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.