All of lore.kernel.org
 help / color / mirror / Atom feed
* [Lustre-devel] Commit on share
@ 2008-05-27 10:44 Peter Braam
  2008-05-27 15:43 ` Mikhail Pershin
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Peter Braam @ 2008-05-27 10:44 UTC (permalink / raw)
  To: lustre-devel

This HLD is definitely not ready at all.  It is very short, lacks
interaction diagrams and the arguments made are not sufficiently detailed.

* the second sentence is not right.  Commit should happen before
un-committed data coming from a client is shared with a 2nd client.
* Is COS dependent on VBR ? no it is not, and can equally apply to normal
recovery 
* Section 3.2 is wrong: the recovery process will not fail with gaps in the
sequence when there is VBR.  It only fails if there are gaps in the
versions, and this is rare.
* 3.3 parallel creations in one directory are protected with different,
independent lock resources.  Isn?t that sufficient to allow parallel
operations with COS?
* 3.6 provide a detailed explanation please
* GC thread is wrong mechanism this is what we have commit callbacks for
* Why not use the DLM, then we can simply keep the client waiting ? the
mechanism already exists for repack; I am not convinced at all by the
reasoning that rep-ack is so different ? no real facts are quoted
* It is left completely without explanation how the hash table (which I
think we don?t need/want) is used

Regards,

Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080527/11a2ef8f/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: commit-on-sharing-simple-tracking-hld.pdf
Type: application/octet-stream
Size: 63433 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080527/11a2ef8f/attachment.obj>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Lustre-devel] Commit on share
  2008-05-27 10:44 [Lustre-devel] Commit on share Peter Braam
@ 2008-05-27 15:43 ` Mikhail Pershin
  2008-06-01  5:00   ` Peter Braam
  2008-05-29 17:42 ` Mikhail Pershin
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: Mikhail Pershin @ 2008-05-27 15:43 UTC (permalink / raw)
  To: lustre-devel

Hello Peter,

Thanks for review. Alexander is on vacation so I will answer as co-author.

On Tue, 27 May 2008 14:44:18 +0400, Peter Braam <Peter.Braam@Sun.COM>  
wrote:

> This HLD is definitely not ready at all.  It is very short, lacks
> interaction diagrams and the arguments made are not sufficiently  
> detailed.
>
> * the second sentence is not right.  Commit should happen before
> un-committed data coming from a client is shared with a 2nd client.

Can you provide any issues with that? Committing data after but not before  
current operation has following benefits:
1) no need in starting/commiting separate transaction, simple code due to  
that
2) less syncs. E.g. if we are committing current operation too then we  
resolve possible share case related to current operation and later commit  
will not needed. Therefore we will have less synced operations.
E.g. the worst case for COS is:
client1 operation
client2 operation -> sync
client1 op -> sync
client2 op -> sync
...
COS will do commit for each operation if they are on the same object and  
sync is happened before current operation.
But including the current operation in commit will reduce number of  
commits in twice:
client1 operation
client2 op -> sync (including this op)
client1 op - no sync because no uncommitted shared data
client2 op -> sync
...

> * Is COS dependent on VBR ? no it is not, and can equally apply to normal
> recovery

Agree, COS is like just lever to turn sync commit on/off depending on some  
conditions. These conditions maybe quite simple like now - just  
comparision of clients - or maybe more complex and include checking of  
operation types, etc.

But COS was requested initially as optional feature of VBR, so we didn't  
review COS-only configuration. Without VBR any missed client will invoke  
the eviction of all clients with replays after the gap. Therefore COS will  
not helps until we will change current recovery to don't evict clients if  
COS is enabled. But we should know actually was COS enabled before the  
server failure to be sure that the excluding gap transactions is safe. Do  
we need COS-only use-case actually?

> * Section 3.2 is wrong: the recovery process will not fail with gaps in  
> the
> sequence when there is VBR.  It only fails if there are gaps in the
> versions, and this is rare.

the 3.2 section is talking only about gap in versions. Maybe it is not  
correct grammatically though.
"... Highly probably we have non-trivial gaps version in the sequence and  
the recovery process fails"
Could you mark what is wrong with 3.2? just rewrite the sentence to make  
it more clear about what gaps we mean?

> * 3.3 parallel creations in one directory are protected with different,
> independent lock resources.  Isn?t that sufficient to allow parallel
> operations with COS?

it is HEAD feature, but this design is for 1.8 (1.6-based) Lustre with one  
dir lock. If this is not mentioned in HLD then it should be fixed.
But the issue is not about the lock only.
The 'simple' COS checks only clients nids to determine the dependency.  
Therefore if two clients are creating objects in the same directory then  
we will have frequent syncs due to COS (operations from different nids)  
although there is no need for sync at all becase the operations are not  
dependent.
The same will be with parallel locking if we will not check type of  
operation to determine the dependency.

> * 3.6 provide a detailed explanation please
"When enabled, COS makes REP-ACK not needed."
Basically the COS is about two things:
1) dependency tracking. This functionality which try to determine is  
current operation depending on some other uncommitted one. It may be  
simple and check only nids of clients, maybe more sophisticated and  
include type of operation checking or any other additional data.
2) COS itself, the doing sync commit of current operation if there is  
dependency.

So if we have 1) and 2) we have only the following cases:
- there is dependency determined and commit is needed to remove it. No ACK  
is needed.
- there is no dependency and we don't need no ACK and lock nor commit  
because client's replays are not dependent
Therefore the ACK is not needed in both cases. The COS don't need to wait  
on repack lock, it determine the share case and do commit.

how ACK is related to 'simple' COS (the only client NIDs are matter):
1) client1 did operation and lock object until ACK from it will come to  
server
2) client2 is waiting for ACK or commit to access the object
3) if there was no commit yet, then client2 determine the sharing exists  
and force commit

The only positive effect of ACK is delay before doing sync, that give us  
the chance to wait for commit without doing force sync. But that can be  
done with timer to get the same results.

In HLD we propose the following:
1) client1 got lock, did operation, unlock object after operation is done
2) client2 got lock on object and check was there the dependency
3) if dependency then force commit (or wait for it as alternative way)
4) otherwise update dependency info for next check, unlock object when  
operation is done

This is generic way and will work with any dependency tracking (on NIDs,  
on types of operations, etc.)

> * GC thread is wrong mechanism this is what we have commit callbacks for

Well, with callbacks we have to scan through all hash to find data to  
delete on each callback. As alex said there can be about 10K uncommitted  
transactions in high load easily, therefore using callback may become the  
bottlneck. There was discussion recently in devel@ about that originated  
by zam. Although I agree the HLD should be clear about why we choose that  
way and what is wrong with another.

> * Why not use the DLM, then we can simply keep the client waiting ? the
> mechanism already exists for repack; I am not convinced at all by the
> reasoning that rep-ack is so different ? no real facts are quoted

Let's estimate how RepACK lock is suitable as dependency tracking  
functionality. In fact it is more like 'possible dependency prevention'  
mechanism, and block object always because we can't predict the next  
operation, so should keep lock taken for ALL modifying operations. It is  
not 'tracking' but 'prediction' mechanism, it blocks access to the object  
until client will got reply just because the conflicting operation is  
possible but not because it really happen.
Moreover it conflicts in general with dependency tracking we needed,  
because it will serialize operations even when they may not depend.

With RepACK lock we are entering in operation AFTER the checks and we  
don't know the result of this check - was there operation from different  
client? are changes committed? Should we do sync or not? RepACK lock  
doesn't answer this question and we can't decide about sync is needed or  
not.

For example, the client2 will wait for commit or ACK before entering in  
locked area.
1) ACK is got but no commit yet. So client2 enter in locked area and now  
should determine was commit done or not. How to do that? This is vital  
because if there was no commit yet then we should do it. We may use  
version of object possible and check it against last_committed, but this  
will work only with VBR.
So we need extra data per-object like transno.
2) Commit was done. We should still do the same as for 1) to be sure about  
was commit done or not because it is not known why lock was unlocked - due  
to ACK or commit.
3) But we don't know still is there conflict or not because we should  
check client uuids, but we don't store such info anywhere and waiting on  
lock is not reflected somehow. So we need extra data (or extra information  
 from ldlm?) again to store uuid of client who did latest operation on that  
object.

The only way how dlm can work without any additional data is to unlock  
only when commit. But in that case we don't need COS at all. Each  
conflicting client will wait on lock until previous changes will be  
committed. But this may lead to huge latency for requests, comparing with  
commit interval and it is not what we need.

> * It is left completely without explanation how the hash table (which I
> think we don?t need/want) is used

hash table store the following data per object:
struct lu_dep_info {
         struct ll_fid     di_object;
         struct obd_uuid   di_client;
         __u64             di_transno;
};

it contains uuid of client and transno of last change from this client.  
The uuid is compared to determine is there is conflict of not, the transno  
shows was that data committed already or not. I described above why it is  
needed. It is 1.6-related issue because we have only inode of object and  
no any extra structure. The HEAD has lu_object enveloping inodes, and hash  
will not needed, the dependency info may be stored per lu_object.

>
> Regards,
>
> Peter



-- 
Mikhail Pershin
Staff Engineer
Lustre Group
Sun Microsystems, Inc.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Lustre-devel] Commit on share
  2008-05-27 10:44 [Lustre-devel] Commit on share Peter Braam
  2008-05-27 15:43 ` Mikhail Pershin
@ 2008-05-29 17:42 ` Mikhail Pershin
  2008-05-31  2:45   ` Andreas Dilger
  2008-06-02  8:42 ` Alex Zhuravlev
  2008-06-11 14:21 ` Alexander Zarochentsev
  3 siblings, 1 reply; 20+ messages in thread
From: Mikhail Pershin @ 2008-05-29 17:42 UTC (permalink / raw)
  To: lustre-devel

Hi,

It seems this mail wasn't received by subscribers though it is in  
lustre-devel archive already. I paste the original answer below.

----

Hello Peter,

Thanks for review. Alexander is on vacation so I will answer as co-author.

On Tue, 27 May 2008 14:44:18 +0400, Peter Braam <Peter.Braam@Sun.COM>
wrote:

> This HLD is definitely not ready at all.  It is very short, lacks
> interaction diagrams and the arguments made are not sufficiently  
> detailed.
>
> * the second sentence is not right.  Commit should happen before
> un-committed data coming from a client is shared with a 2nd client.

Can you provide any issues with that? Committing data after but not before
current operation has following benefits:
1) no need in starting/commiting separate transaction, simple code due to
that
2) less syncs. E.g. if we are committing current operation too then we
resolve possible share case related to current operation and later commit
will not needed. Therefore we will have less synced operations.
E.g. the worst case for COS is:
client1 operation
client2 operation -> sync
client1 op -> sync
client2 op -> sync
...
COS will do commit for each operation if they are on the same object and
sync is happened before current operation.
But including the current operation in commit will reduce number of
commits in twice:
client1 operation
client2 op -> sync (including this op)
client1 op - no sync because no uncommitted shared data
client2 op -> sync
...

> * Is COS dependent on VBR ? no it is not, and can equally apply to normal
> recovery

Agree, COS is like just lever to turn sync commit on/off depending on some
conditions. These conditions maybe quite simple like now - just
comparision of clients - or maybe more complex and include checking of
operation types, etc.

But COS was requested initially as optional feature of VBR, so we didn't
review COS-only configuration. Without VBR any missed client will invoke
the eviction of all clients with replays after the gap. Therefore COS will
not helps until we will change current recovery to don't evict clients if
COS is enabled. But we should know actually was COS enabled before the
server failure to be sure that the excluding gap transactions is safe. Do
we need COS-only use-case actually?

> * Section 3.2 is wrong: the recovery process will not fail with gaps in  
> the
> sequence when there is VBR.  It only fails if there are gaps in the
> versions, and this is rare.

the 3.2 section is talking only about gap in versions. Maybe it is not
correct grammatically though.
"... Highly probably we have non-trivial gaps version in the sequence and
the recovery process fails"
Could you mark what is wrong with 3.2? just rewrite the sentence to make
it more clear about what gaps we mean?

> * 3.3 parallel creations in one directory are protected with different,
> independent lock resources.  Isn?t that sufficient to allow parallel
> operations with COS?

it is HEAD feature, but this design is for 1.8 (1.6-based) Lustre with one
dir lock. If this is not mentioned in HLD then it should be fixed.
But the issue is not about the lock only.
The 'simple' COS checks only clients nids to determine the dependency.
Therefore if two clients are creating objects in the same directory then
we will have frequent syncs due to COS (operations from different nids)
although there is no need for sync at all becase the operations are not
dependent.
The same will be with parallel locking if we will not check type of
operation to determine the dependency.

> * 3.6 provide a detailed explanation please
"When enabled, COS makes REP-ACK not needed."
Basically the COS is about two things:
1) dependency tracking. This functionality which try to determine is
current operation depending on some other uncommitted one. It may be
simple and check only nids of clients, maybe more sophisticated and
include type of operation checking or any other additional data.
2) COS itself, the doing sync commit of current operation if there is
dependency.

So if we have 1) and 2) we have only the following cases:
- there is dependency determined and commit is needed to remove it. No ACK
is needed.
- there is no dependency and we don't need no ACK and lock nor commit
because client's replays are not dependent
Therefore the ACK is not needed in both cases. The COS don't need to wait
on repack lock, it determine the share case and do commit.

how ACK is related to 'simple' COS (the only client NIDs are matter):
1) client1 did operation and lock object until ACK from it will come to
server
2) client2 is waiting for ACK or commit to access the object
3) if there was no commit yet, then client2 determine the sharing exists
and force commit

The only positive effect of ACK is delay before doing sync, that give us
the chance to wait for commit without doing force sync. But that can be
done with timer to get the same results.

In HLD we propose the following:
1) client1 got lock, did operation, unlock object after operation is done
2) client2 got lock on object and check was there the dependency
3) if dependency then force commit (or wait for it as alternative way)
4) otherwise update dependency info for next check, unlock object when
operation is done

This is generic way and will work with any dependency tracking (on NIDs,
on types of operations, etc.)

> * GC thread is wrong mechanism this is what we have commit callbacks for

Well, with callbacks we have to scan through all hash to find data to
delete on each callback. As alex said there can be about 10K uncommitted
transactions in high load easily, therefore using callback may become the
bottlneck. There was discussion recently in devel@ about that originated
by zam. Although I agree the HLD should be clear about why we choose that
way and what is wrong with another.

> * Why not use the DLM, then we can simply keep the client waiting ? the
> mechanism already exists for repack; I am not convinced at all by the
> reasoning that rep-ack is so different ? no real facts are quoted

Let's estimate how RepACK lock is suitable as dependency tracking
functionality. In fact it is more like 'possible dependency prevention'
mechanism, and block object always because we can't predict the next
operation, so should keep lock taken for ALL modifying operations. It is
not 'tracking' but 'prediction' mechanism, it blocks access to the object
until client will got reply just because the conflicting operation is
possible but not because it really happen.
Moreover it conflicts in general with dependency tracking we needed,
because it will serialize operations even when they may not depend.

With RepACK lock we are entering in operation AFTER the checks and we
don't know the result of this check - was there operation from different
client? are changes committed? Should we do sync or not? RepACK lock
doesn't answer this question and we can't decide about sync is needed or
not.

For example, the client2 will wait for commit or ACK before entering in
locked area.
1) ACK is got but no commit yet. So client2 enter in locked area and now
should determine was commit done or not. How to do that? This is vital
because if there was no commit yet then we should do it. We may use
version of object possible and check it against last_committed, but this
will work only with VBR.
So we need extra data per-object like transno.
2) Commit was done. We should still do the same as for 1) to be sure about
was commit done or not because it is not known why lock was unlocked - due
to ACK or commit.
3) But we don't know still is there conflict or not because we should
check client uuids, but we don't store such info anywhere and waiting on
lock is not reflected somehow. So we need extra data (or extra information
  from ldlm?) again to store uuid of client who did latest operation on that
object.

The only way how dlm can work without any additional data is to unlock
only when commit. But in that case we don't need COS at all. Each
conflicting client will wait on lock until previous changes will be
committed. But this may lead to huge latency for requests, comparing with
commit interval and it is not what we need.

> * It is left completely without explanation how the hash table (which I
> think we don?t need/want) is used

hash table store the following data per object:
struct lu_dep_info {
          struct ll_fid     di_object;
          struct obd_uuid   di_client;
          __u64             di_transno;
};

it contains uuid of client and transno of last change from this client.
The uuid is compared to determine is there is conflict of not, the transno
shows was that data committed already or not. I described above why it is
needed. It is 1.6-related issue because we have only inode of object and
no any extra structure. The HEAD has lu_object enveloping inodes, and hash
will not needed, the dependency info may be stored per lu_object.

>
> Regards,
>
> Peter



-- 
Mikhail Pershin
Staff Engineer
Lustre Group
Sun Microsystems, Inc.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Lustre-devel] Commit on share
  2008-05-29 17:42 ` Mikhail Pershin
@ 2008-05-31  2:45   ` Andreas Dilger
  2008-05-31  9:37     ` Alex Zhuravlev
                       ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Andreas Dilger @ 2008-05-31  2:45 UTC (permalink / raw)
  To: lustre-devel

On May 29, 2008  21:42 +0400, Mike Pershin wrote:
> It seems this mail wasn't received by subscribers though it is in  
> lustre-devel archive already. I paste the original answer below.
> 
> Thanks for review. Alexander is on vacation so I will answer as co-author.

Mike, can you please also archive this discussion in the bugzilla bug,
so that it is available for future reference.

> On Tue, 27 May 2008 14:44:18 +0400, Peter Braam <Peter.Braam@Sun.COM>
> 
> how ACK is related to 'simple' COS (the only client NIDs are matter):
> 1) client1 did operation and lock object until ACK from it will come to
> server
> 2) client2 is waiting for ACK or commit to access the object
> 3) if there was no commit yet, then client2 determine the sharing exists
> and force commit
> 
> The only positive effect of ACK is delay before doing sync, that give us
> the chance to wait for commit without doing force sync. But that can be
> done with timer to get the same results.

On a related note - I just came across bug 3621 - sync outstanding
transaction instead of evicting client when a rep-ack isn't received.
Could you please address this bug at the same time as COS.  With COS
this will always happen, of course, but it should also happen without
it to avoid client eviction if possible.

> > * GC thread is wrong mechanism this is what we have commit callbacks for
> 
> Well, with callbacks we have to scan through all hash to find data to
> delete on each callback. As alex said there can be about 10K uncommitted
> transactions in high load easily, therefore using callback may become the
> bottlneck. There was discussion recently in devel@ about that originated
> by zam. Although I agree the HLD should be clear about why we choose that
> way and what is wrong with another.

Maybe I'm misunderstanding something (I didn't read HLD), but the commit
callback can be set per Lustre transaction (in fact multiple callbacks
can exist per transaction) so there should not be any need to do searching
for finding per-transaction cleanup.  What state is the GC thread supposed
to be cleaning up?  Doing GC is also searching, and that is undesirable in
any case.

> > * Why not use the DLM, then we can simply keep the client waiting ? the
> > mechanism already exists for repack; I am not convinced at all by the
> > reasoning that rep-ack is so different ? no real facts are quoted
> 
> Let's estimate how RepACK lock is suitable as dependency tracking
> functionality. In fact it is more like 'possible dependency prevention'
> mechanism, and block object always because we can't predict the next
> operation, so should keep lock taken for ALL modifying operations. It is
> not 'tracking' but 'prediction' mechanism, it blocks access to the object
> until client will got reply just because the conflicting operation is
> possible but not because it really happen.

RepACK is currently needed for recovery.  I don't think it is a false
conflict in most cases, though I agree in some cases it is.  If MDS
thread is only e.g. passing through a directory to do some operation
in a previously-existing subdirectory, or wants to stat a file that
existed before the conflicting lock was taken then this is a false
dependency.

> Moreover it conflicts in general with dependency tracking we needed,
> because it will serialize operations even when they may not depend.
> 
> With RepACK lock we are entering in operation AFTER the checks and we
> don't know the result of this check - was there operation from different
> client? are changes committed? Should we do sync or not? RepACK lock
> doesn't answer this question and we can't decide about sync is needed or
> not.

That isn't quite true - if the changes ARE already committed, then the
lock is no longer needed and dropped by the commit callback.  See
ptlrpc_commit_replies->
  schedule_difficult_replies-> (wakeup srv_waitq)
    ptlrpc_main->
      ptlrpc_server_handle_reply->
      	ldlm_lock_decref()


> For example, the client2 will wait for commit or ACK before entering in
> locked area.
> 1) ACK is got but no commit yet. So client2 enter in locked area and now
> should determine was commit done or not. How to do that? This is vital
> because if there was no commit yet then we should do it. We may use
> version of object possible and check it against last_committed, but this
> will work only with VBR. So we need extra data per-object like transno.

Yes, this is definitely most efficient with VBR.

> 2) Commit was done. We should still do the same as for 1) to be sure about
> was commit done or not because it is not known why lock was unlocked - due
> to ACK or commit.
> 3) But we don't know still is there conflict or not because we should
> check client uuids, but we don't store such info anywhere and waiting on
> lock is not reflected somehow. So we need extra data (or extra information
>   from ldlm?) again to store uuid of client who did latest operation on that
> object.

Wouldn't that be in the last_rcvd data for the current client?  If the
req->rq_export->exp_mds_data->med_mcd->mcd_last_transno is the same as
the VBR transno on object being modified then we know this client was
the last one to modify the object and there is no external dependency.

> The only way how dlm can work without any additional data is to unlock
> only when commit. But in that case we don't need COS at all. Each
> conflicting client will wait on lock until previous changes will be
> committed. But this may lead to huge latency for requests, comparing with
> commit interval and it is not what we need.
> 
> > * It is left completely without explanation how the hash table (which I
> > think we don?t need/want) is used
> 
> hash table store the following data per object:
> struct lu_dep_info {
>           struct ll_fid     di_object;
>           struct obd_uuid   di_client;
>           __u64             di_transno;
> };
> 
> it contains uuid of client and transno of last change from this client.
> The uuid is compared to determine is there is conflict of not, the transno
> shows was that data committed already or not. I described above why it is
> needed. It is 1.6-related issue because we have only inode of object and
> no any extra structure. The HEAD has lu_object enveloping inodes, and hash
> will not needed, the dependency info may be stored per lu_object.

I think the commit callbacks should be able to free this data, there
should never be any such items on an object with di_transno > last_committed.
Also, isn't it enough to store a single such item per object directly
on the object?  Once we know there is ANY such conflict that is enough
to invoke COS.  For per-object data this can be stored on 1.6 in the
i_filterdata structure that we can attach onto every server inode.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Lustre-devel] Commit on share
  2008-05-31  2:45   ` Andreas Dilger
@ 2008-05-31  9:37     ` Alex Zhuravlev
  2008-06-01  7:03     ` Mikhail Pershin
  2008-06-01 16:54     ` Alex Zhuravlev
  2 siblings, 0 replies; 20+ messages in thread
From: Alex Zhuravlev @ 2008-05-31  9:37 UTC (permalink / raw)
  To: lustre-devel

Andreas Dilger wrote:
> Maybe I'm misunderstanding something (I didn't read HLD), but the commit
> callback can be set per Lustre transaction (in fact multiple callbacks
> can exist per transaction) so there should not be any need to do searching
> for finding per-transaction cleanup.  What state is the GC thread supposed
> to be cleaning up?  Doing GC is also searching, and that is undesirable in
> any case.

I think removal *few* (usually two, sometimes upto 4) deps from every callback
is quite expensive as we need to take spinlock and in short period of time we
have to remove thousands deps. for SMP it's not that good. even worse for NUMA.

instead we can do lazy free - in many cases we have to scan whole bucket anyway
as we search for possible dependency. so, under single spinlock instance and
with free searching we can collect stale entries and free them (or move onto
special list to free later).

as entries are supposed to be distributed evenly, this algo should work in most
cases. only if we see some bucket is getting too many stale entries we start GC.

thanks, Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Lustre-devel] Commit on share
  2008-05-27 15:43 ` Mikhail Pershin
@ 2008-06-01  5:00   ` Peter Braam
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Braam @ 2008-06-01  5:00 UTC (permalink / raw)
  To: lustre-devel




On 5/27/08 11:43 PM, "Mikhail Pershin" <Mikhail.Pershin@Sun.COM> wrote:

> Hello Peter,
> 
> Thanks for review. Alexander is on vacation so I will answer as co-author.
> 
> On Tue, 27 May 2008 14:44:18 +0400, Peter Braam <Peter.Braam@Sun.COM>
> wrote:
> 
>> This HLD is definitely not ready at all.  It is very short, lacks
>> interaction diagrams and the arguments made are not sufficiently
>> detailed.
>> 
>> * the second sentence is not right.  Commit should happen before
>> un-committed data coming from a client is shared with a 2nd client.
> 
> Can you provide any issues with that? Committing data after but not before
> current operation has following benefits:

The document is written without any mention of read only interaction with
the file system.  On top of that the language was insufficiently clear,
meaning that I only understood what you wanted to do several pages further.
That means other people will encounter the same difficulty.


> 1) no need in starting/commiting separate transaction, simple code due to
> that
> 2) less syncs. E.g. if we are committing current operation too then we
> resolve possible share case related to current operation and later commit
> will not needed. Therefore we will have less synced operations.
> E.g. the worst case for COS is:
> client1 operation
> client2 operation -> sync
> client1 op -> sync
> client2 op -> sync
> ...
> COS will do commit for each operation if they are on the same object and
> sync is happened before current operation.
> But including the current operation in commit will reduce number of
> commits in twice:
> client1 operation
> client2 op -> sync (including this op)
> client1 op - no sync because no uncommitted shared data
> client2 op -> sync
> ...
> 

This may not be a worthwhile optimization, although it seems correct.
Please provide detailed use cases where it provides value.

For example with 1000 clients creating each one file in a directory, what is
the quantative benefit?

With one client creating a file and 999 clients doing a getattr, you now
have 999 locks blocking for completion - not convincing.

>> * Is COS dependent on VBR ? no it is not, and can equally apply to normal
>> recovery
> 
> Agree, COS is like just lever to turn sync commit on/off depending on some
> conditions. These conditions maybe quite simple like now - just
> comparision of clients - or maybe more complex and include checking of
> operation types, etc.
> 
> But COS was requested initially as optional feature of VBR, so we didn't
> review COS-only configuration. Without VBR any missed client will invoke
> the eviction of all clients with replays after the gap. Therefore COS will
> not helps until we will change current recovery to don't evict clients if
> COS is enabled. But we should know actually was COS enabled before the
> server failure to be sure that the excluding gap transactions is safe. Do
> we need COS-only use-case actually?

Yes, we need COS with traditional recovery and a precise explanation why COS
with COS adds any value over COS with traditional recovery.  Again, use
numbers and exact facts.


> 
>> * Section 3.2 is wrong: the recovery process will not fail with gaps in
>> the
>> sequence when there is VBR.  It only fails if there are gaps in the
>> versions, and this is rare.
> 
> the 3.2 section is talking only about gap in versions. Maybe it is not
> correct grammatically though.
> "... Highly probably we have non-trivial gaps version in the sequence and
> the recovery process fails"
> Could you mark what is wrong with 3.2? just rewrite the sentence to make
> it more clear about what gaps we mean?

Exact detail, example use cases, no mumbling of complex ideas.

I also want to see precise flow charts of interactions upon reconnection
(this perhaps belongs in the VBR HLD), how does the system transition from
one recovery type to the next.

> 
>> * 3.3 parallel creations in one directory are protected with different,
>> independent lock resources.  Isn?t that sufficient to allow parallel
>> operations with COS?
> 
> it is HEAD feature, but this design is for 1.8 (1.6-based) Lustre with one
> dir lock. If this is not mentioned in HLD then it should be fixed.
> But the issue is not about the lock only.
> The 'simple' COS checks only clients nids to determine the dependency.
> Therefore if two clients are creating objects in the same directory then
> we will have frequent syncs due to COS (operations from different nids)
> although there is no need for sync at all becase the operations are not
> dependent.

If they are not dependent then there should be no commits.  But, you have
not defined dependency in a precise way, so the HLD is hand-waving instead
of designing.

In any case I absolutely don't want the hash.  This has to be done with
commit callbacks unless the reasons not to do so become one order of
magnitude clearer.

> The same will be with parallel locking if we will not check type of
> operation to determine the dependency.
> 
>> * 3.6 provide a detailed explanation please
> "When enabled, COS makes REP-ACK not needed."
> Basically the COS is about two things:
> 1) dependency tracking. This functionality which try to determine is
> current operation depending on some other uncommitted one.

"try to?" - so it sometimes fails?

> It may be  
> simple 

What kind of language is this?

>and check only nids of clients, maybe more sophisticated and
> include type of operation checking or any other additional data.

Without a definition of dependency, you can see why I have completely
rejected the HLD, and I will continue to do so.

> 2) COS itself, the doing sync commit of current operation if there is
> dependency.
> 
> So if we have 1) and 2) we have only the following cases:
> - there is dependency determined and commit is needed to remove it. No ACK
> is needed.
> - there is no dependency and we don't need no ACK and lock nor commit
> because client's replays are not dependent
> Therefore the ACK is not needed in both cases. The COS don't need to wait
> on repack lock, it determine the share case and do commit.

In the HLD state and define in 100% accurate manner why REP ACKS are needed,
and prove that with COS they are not.  This clearly depends on precise
definitions.

> 
> how ACK is related to 'simple' COS (the only client NIDs are matter):
> 1) client1 did operation and lock object until ACK from it will come to
> server
> 2) client2 is waiting for ACK or commit to access the object
> 3) if there was no commit yet, then client2 determine the sharing exists
> and force commit
> 
> The only positive effect of ACK is delay before doing sync, that give us
> the chance to wait for commit without doing force sync. But that can be
> done with timer to get the same results.

No timers - end of discussion.

> 
> In HLD we propose the following:
> 1) client1 got lock, did operation, unlock object after operation is done
> 2) client2 got lock on object and check was there the dependency
> 3) if dependency then force commit (or wait for it as alternative way)
> 4) otherwise update dependency info for next check, unlock object when
> operation is done
> 
> This is generic way and will work with any dependency tracking (on NIDs,
> on types of operations, etc.)

Two clients is not a sufficient argument possibly.


Please put explanations in the HLD and supply a new one.


> 
>> * GC thread is wrong mechanism this is what we have commit callbacks for

No GC - end of discussion.

> 
> Well, with callbacks we have to scan through all hash to find data to
> delete on each callback. As alex said there can be about 10K uncommitted
> transactions in high load easily, therefore using callback may become the
> bottlneck. There was discussion recently in devel@ about that originated
> by zam. Although I agree the HLD should be clear about why we choose that
> way and what is wrong with another.
> 
>> * Why not use the DLM, then we can simply keep the client waiting ? the
>> mechanism already exists for repack; I am not convinced at all by the
>> reasoning that rep-ack is so different ? no real facts are quoted
> 
> Let's estimate how RepACK lock is suitable as dependency tracking
> functionality.

Without better definitions, the arguments below cannot be judged.

> In fact it is more like 'possible dependency prevention'
> mechanism, and block object always because we can't predict the next
> operation, so should keep lock taken for ALL modifying operations. It is
> not 'tracking' but 'prediction' mechanism, it blocks access to the object
> until client will got reply just because the conflicting operation is
> possible but not because it really happen.
> Moreover it conflicts in general with dependency tracking we needed,
> because it will serialize operations even when they may not depend.
> 
> With RepACK lock we are entering in operation AFTER the checks and we
> don't know the result of this check - was there operation from different
> client? are changes committed? Should we do sync or not? RepACK lock
> doesn't answer this question and we can't decide about sync is needed or
> not.
> 
> For example, the client2 will wait for commit or ACK before entering in
> locked area.
> 1) ACK is got but no commit yet. So client2 enter in locked area and now
> should determine was commit done or not. How to do that? This is vital
> because if there was no commit yet then we should do it. We may use
> version of object possible and check it against last_committed, but this
> will work only with VBR.
> So we need extra data per-object like transno.
> 2) Commit was done. We should still do the same as for 1) to be sure about
> was commit done or not because it is not known why lock was unlocked - due
> to ACK or commit.
> 3) But we don't know still is there conflict or not because we should
> check client uuids, but we don't store such info anywhere and waiting on
> lock is not reflected somehow. So we need extra data (or extra information
>  from ldlm?) again to store uuid of client who did latest operation on that
> object.
> 
> The only way how dlm can work without any additional data is to unlock
> only when commit. But in that case we don't need COS at all. Each
> conflicting client will wait on lock until previous changes will be
> committed. But this may lead to huge latency for requests, comparing with
> commit interval and it is not what we need.
> 
>> * It is left completely without explanation how the hash table (which I
>> think we don?t need/want) is used
> 
> hash table store the following data per object:
> struct lu_dep_info {
>          struct ll_fid     di_object;
>          struct obd_uuid   di_client;
>          __u64             di_transno;
> };
> 
> it contains uuid of client and transno of last change from this client.
> The uuid is compared to determine is there is conflict of not, the transno
> shows was that data committed already or not. I described above why it is
> needed. It is 1.6-related issue because we have only inode of object and
> no any extra structure. The HEAD has lu_object enveloping inodes, and hash
> will not needed, the dependency info may be stored per lu_object.
> 
>> 
>> Regards,
>> 
>> Peter
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Lustre-devel] Commit on share
  2008-05-31  2:45   ` Andreas Dilger
  2008-05-31  9:37     ` Alex Zhuravlev
@ 2008-06-01  7:03     ` Mikhail Pershin
  2008-06-03 18:41       ` Andreas Dilger
  2008-06-01 16:54     ` Alex Zhuravlev
  2 siblings, 1 reply; 20+ messages in thread
From: Mikhail Pershin @ 2008-06-01  7:03 UTC (permalink / raw)
  To: lustre-devel

On Sat, 31 May 2008 06:45:24 +0400, Andreas Dilger <adilger@sun.com> wrote:

> On May 29, 2008  21:42 +0400, Mike Pershin wrote:
>> The only positive effect of ACK is delay before doing sync, that give us
>> the chance to wait for commit without doing force sync. But that can be
>> done with timer to get the same results.
>
> On a related note - I just came across bug 3621 - sync outstanding
> transaction instead of evicting client when a rep-ack isn't received.
> Could you please address this bug at the same time as COS.  With COS
> this will always happen, of course, but it should also happen without
> it to avoid client eviction if possible.
>

OK

> RepACK is currently needed for recovery.  I don't think it is a false
> conflict in most cases, though I agree in some cases it is.  If MDS
> thread is only e.g. passing through a directory to do some operation
> in a previously-existing subdirectory, or wants to stat a file that
> existed before the conflicting lock was taken then this is a false
> dependency.

RepACK is not needed for recovery if COS is enabled, because COS will sync  
the share cases so there is no need to be sure that client got reply and  
will do replay as there are no dependent replays on it.
Also the cases are creations from different clients or unlinks (operations  
of same type). They are not dependent actually, the only dependency here  
may be create vs unlink or unlink vs create. Currently such cases are not  
distinguished and we block access for any operation from different client.

>
>> Moreover it conflicts in general with dependency tracking we needed,
>> because it will serialize operations even when they may not depend.
>>
>> With RepACK lock we are entering in operation AFTER the checks and we
>> don't know the result of this check - was there operation from different
>> client? are changes committed? Should we do sync or not? RepACK lock
>> doesn't answer this question and we can't decide about sync is needed or
>> not.
>
> That isn't quite true - if the changes ARE already committed, then the
> lock is no longer needed and dropped by the commit callback.

Indeed. But I were talking about different thing. When we pass lock (enter  
the locked area) then we don't know was the lock taken at all or not? Was  
it dropped due to commit or ACK received? So we don't know should we do  
commit or it was done already or it is not needed at all. Maybe we may use  
uncommitted_replies list to determine that, but it is not perfect way too.

>> 3) But we don't know still is there conflict or not because we should
>> check client uuids, but we don't store such info anywhere and waiting on
>> lock is not reflected somehow. So we need extra data (or extra  
>> information
>>   from ldlm?) again to store uuid of client who did latest operation on  
>> that
>> object.
>
> Wouldn't that be in the last_rcvd data for the current client?  If the
> req->rq_export->exp_mds_data->med_mcd->mcd_last_transno is the same as
> the VBR transno on object being modified then we know this client was
> the last one to modify the object and there is no external dependency.
>
but this stops working if last_transno is bigger that object version. Then  
we lost info about who set than version.

>> hash table store the following data per object:
>> struct lu_dep_info {
>>           struct ll_fid     di_object;
>>           struct obd_uuid   di_client;
>>           __u64             di_transno;
>> };
>>
>> it contains uuid of client and transno of last change from this client.
>> The uuid is compared to determine is there is conflict of not, the  
>> transno
>> shows was that data committed already or not. I described above why it  
>> is
>> needed. It is 1.6-related issue because we have only inode of object and
>> no any extra structure. The HEAD has lu_object enveloping inodes, and  
>> hash
>> will not needed, the dependency info may be stored per lu_object.
>
> I think the commit callbacks should be able to free this data, there
> should never be any such items on an object with di_transno >  
> last_committed.

you mean the moment of commit? As I know the new journal_start() may start  
after last batch is committed but before commit callbacks will be invoked.  
So the new dep_info may occur with di_transno > last_committed, and we may  
not free all dep_info at once in commit callback, but should distinguish  
new from old. The good thing to have would be some notification from  
ldiskfs about batch boundary but this is good in theory only.

> Also, isn't it enough to store a single such item per object directly
> on the object?  Once we know there is ANY such conflict that is enough
> to invoke COS.  For per-object data this can be stored on 1.6 in the
> i_filterdata structure that we can attach onto every server inode.
It is per-object, yes. And this is very valuable advice about  
i_filterdata. I thought we have no access to inode_info from upper level  
at server side. This will reduce need for hash at all and simplify things  
a lot.

>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel



-- 
Mikhail Pershin
Staff Engineer
Lustre Group
Sun Microsystems, Inc.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Lustre-devel] Commit on share
  2008-05-31  2:45   ` Andreas Dilger
  2008-05-31  9:37     ` Alex Zhuravlev
  2008-06-01  7:03     ` Mikhail Pershin
@ 2008-06-01 16:54     ` Alex Zhuravlev
  2 siblings, 0 replies; 20+ messages in thread
From: Alex Zhuravlev @ 2008-06-01 16:54 UTC (permalink / raw)
  To: lustre-devel

Andreas Dilger wrote:
> I think the commit callbacks should be able to free this data, there
> should never be any such items on an object with di_transno > last_committed.
> Also, isn't it enough to store a single such item per object directly
> on the object?  Once we know there is ANY such conflict that is enough
> to invoke COS.  For per-object data this can be stored on 1.6 in the
> i_filterdata structure that we can attach onto every server inode.

we don't control inode's lifetime. in order to use i_filterdata we'd have
to pin inode. then we'd have to unpin inode. we can't do this from commit
callback as iput() may cause inode delete. so, we'd have to use special
thread - not that nice.

I believe originally hash table is a intermediate solution for 1.6 and 1.8.
in new server we do have own objects with all needed callbacks to control
data like i_filterdata.

thanks, Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Lustre-devel] Commit on share
  2008-05-27 10:44 [Lustre-devel] Commit on share Peter Braam
  2008-05-27 15:43 ` Mikhail Pershin
  2008-05-29 17:42 ` Mikhail Pershin
@ 2008-06-02  8:42 ` Alex Zhuravlev
  2008-06-03 18:50   ` Andreas Dilger
  2008-06-11 14:21 ` Alexander Zarochentsev
  3 siblings, 1 reply; 20+ messages in thread
From: Alex Zhuravlev @ 2008-06-02  8:42 UTC (permalink / raw)
  To: lustre-devel

there was an idea to control recovery postponing replies.
can we use this idea for COS? instead of immediate sync
we execute request, but put reply on a special queue. then
reply is sent from the queue when all previous transno
are committed (for COS w/o VBR). if there is no requests to
be handled, but reply queue isn't empty server does sync.
for VBR, the rule is a bit more complex - we'll have to track
dependency on per-object basis.

thanks, Alex


Peter Braam wrote:
> This HLD is definitely not ready at all.  It is very short, lacks 
> interaction diagrams and the arguments made are not sufficiently detailed.
> 
>     * the second sentence is not right.  Commit should happen before
>       un-committed data coming from a client is shared with a 2nd client.
>     * Is COS dependent on VBR ? no it is not, and can equally apply to
>       normal recovery
>     * Section 3.2 is wrong: the recovery process will not fail with gaps
>       in the sequence when there is VBR.  It only fails if there are
>       gaps in the versions, and this is rare.
>     * 3.3 parallel creations in one directory are protected with
>       different, independent lock resources.  Isn?t that sufficient to
>       allow parallel operations with COS?
>     * 3.6 provide a detailed explanation please
>     * GC thread is wrong mechanism this is what we have commit callbacks
>       for
>     * Why not use the DLM, then we can simply keep the client waiting ?
>       the mechanism already exists for repack; I am not convinced at all
>       by the reasoning that rep-ack is so different ? no real facts are
>       quoted
>     * It is left completely without explanation how the hash table
>       (which I think we don?t need/want) is used
> 
> 
> Regards,
> 
> Peter
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Lustre-devel] Commit on share
  2008-06-01  7:03     ` Mikhail Pershin
@ 2008-06-03 18:41       ` Andreas Dilger
  2008-06-03 18:56         ` Alex Zhuravlev
  0 siblings, 1 reply; 20+ messages in thread
From: Andreas Dilger @ 2008-06-03 18:41 UTC (permalink / raw)
  To: lustre-devel

On Jun 01, 2008  11:03 +0400, Mike Pershin wrote:
> On Sat, 31 May 2008 06:45:24 +0400, Andreas Dilger <adilger@sun.com> wrote:
>> RepACK is currently needed for recovery.  I don't think it is a false
>> conflict in most cases, though I agree in some cases it is.  If MDS
>> thread is only e.g. passing through a directory to do some operation
>> in a previously-existing subdirectory, or wants to stat a file that
>> existed before the conflicting lock was taken then this is a false
>> dependency.
>
> RepACK is not needed for recovery if COS is enabled, because COS will sync 
> the share cases so there is no need to be sure that client got reply and 
> will do replay as there are no dependent replays on it.
>
> Also the cases are creations from different clients or unlinks (operations 
> of same type). They are not dependent actually, the only dependency here 
> may be create vs unlink or unlink vs create. Currently such cases are not 
> distinguished and we block access for any operation from different client.

In 2.0 with per-name-hash locking we will remove almost all such false
dependencies, so I don't know whether the intermediate optimization is
worthwhile or not.

>> Also, isn't it enough to store a single such item per object directly
>> on the object?  Once we know there is ANY such conflict that is enough
>> to invoke COS.  For per-object data this can be stored on 1.6 in the
>> i_filterdata structure that we can attach onto every server inode.
>
> It is per-object, yes. And this is very valuable advice about i_filterdata. 
> I thought we have no access to inode_info from upper level at server side. 
> This will reduce need for hash at all and simplify things a lot.

Well, in 1.6/1.8 the layering isn't so strict, and in HEAD the problem
goes away because of per-object data/locking.  I also don't think Alex's
worry about the i_filterdata lifetime is warranted.  If the inode is
being evicted from cache, then it surely must have been written to disk,
so there is no need to cache the last-modified data at all as COS is
not needed anymore.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Lustre-devel] Commit on share
  2008-06-02  8:42 ` Alex Zhuravlev
@ 2008-06-03 18:50   ` Andreas Dilger
  2008-06-04  1:11     ` Peter Braam
  2008-06-04 10:50     ` Nikita Danilov
  0 siblings, 2 replies; 20+ messages in thread
From: Andreas Dilger @ 2008-06-03 18:50 UTC (permalink / raw)
  To: lustre-devel

On Jun 02, 2008  12:42 +0400, Alex Zhuravlev wrote:
> there was an idea to control recovery postponing replies.
> can we use this idea for COS? instead of immediate sync
> we execute request, but put reply on a special queue. then
> reply is sent from the queue when all previous transno
> are committed (for COS w/o VBR). if there is no requests to
> be handled, but reply queue isn't empty server does sync.

Yes, this was proposed for the DMU OST, so that it can export
synchronous IO semantics to clients that do not know how to do
bulk IO recovery (i.e. all of them right now) without forcing
the DMU to limit the transaction group size too much.

My proposal was that the OST service threads would perform the IO
in the normal manner until they were ready with the reply, but
instead of waiting in the thread context for the commit the RPC
request (with attached reply) would be put as private data into
a commit callback.  Once the commit callbacks are run (whether
because of transaction size, age, or explicit sync operation)
the RPC requests are put into a queue and sent by one or more
threads to the waiting clients.

This ties in fairly nicely with the network request scheduler, as
it can batch requests from the same or multiple clients in different
ways, and then "unplug" the device (sync the transaction).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Lustre-devel] Commit on share
  2008-06-03 18:41       ` Andreas Dilger
@ 2008-06-03 18:56         ` Alex Zhuravlev
  0 siblings, 0 replies; 20+ messages in thread
From: Alex Zhuravlev @ 2008-06-03 18:56 UTC (permalink / raw)
  To: lustre-devel

Andreas Dilger wrote:
> Well, in 1.6/1.8 the layering isn't so strict, and in HEAD the problem
> goes away because of per-object data/locking.  I also don't think Alex's
> worry about the i_filterdata lifetime is warranted.  If the inode is
> being evicted from cache, then it surely must have been written to disk,
> so there is no need to cache the last-modified data at all as COS is
> not needed anymore.

we can't destroy dependency data until object is committed, right?
but JBD doesn't work with inodes, it works with buffers only. IOW,
inode can be evicted from the cache while correspondent buffer is
still to be flushed?

thanks, Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Lustre-devel] Commit on share
  2008-06-03 18:50   ` Andreas Dilger
@ 2008-06-04  1:11     ` Peter Braam
  2008-06-04 10:50     ` Nikita Danilov
  1 sibling, 0 replies; 20+ messages in thread
From: Peter Braam @ 2008-06-04  1:11 UTC (permalink / raw)
  To: lustre-devel

Yes - this is like a commit scheduler.  Don't jump too quickly to an
implementation plan.

It might be necessary to have good control over what gets committed in what
order, and the dependency issues may only get more difficult and not easier.

But currently the focus should be on COS.  Can COS have a design that
mirrors that of REP-ACKs?  I'm not against committing one RPC later, but I
doubt it really helps in relevant use cases (see my questions about getting
to the bottom of it).

Peter


On 6/4/08 3:50 AM, "Andreas Dilger" <adilger@sun.com> wrote:

> On Jun 02, 2008  12:42 +0400, Alex Zhuravlev wrote:
>> there was an idea to control recovery postponing replies.
>> can we use this idea for COS? instead of immediate sync
>> we execute request, but put reply on a special queue. then
>> reply is sent from the queue when all previous transno
>> are committed (for COS w/o VBR). if there is no requests to
>> be handled, but reply queue isn't empty server does sync.
> 
> Yes, this was proposed for the DMU OST, so that it can export
> synchronous IO semantics to clients that do not know how to do
> bulk IO recovery (i.e. all of them right now) without forcing
> the DMU to limit the transaction group size too much.
> 
> My proposal was that the OST service threads would perform the IO
> in the normal manner until they were ready with the reply, but
> instead of waiting in the thread context for the commit the RPC
> request (with attached reply) would be put as private data into
> a commit callback.  Once the commit callbacks are run (whether
> because of transaction size, age, or explicit sync operation)
> the RPC requests are put into a queue and sent by one or more
> threads to the waiting clients.
> 
> This ties in fairly nicely with the network request scheduler, as
> it can batch requests from the same or multiple clients in different
> ways, and then "unplug" the device (sync the transaction).
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Lustre-devel] Commit on share
  2008-06-03 18:50   ` Andreas Dilger
  2008-06-04  1:11     ` Peter Braam
@ 2008-06-04 10:50     ` Nikita Danilov
  1 sibling, 0 replies; 20+ messages in thread
From: Nikita Danilov @ 2008-06-04 10:50 UTC (permalink / raw)
  To: lustre-devel

Andreas Dilger writes:
 > On Jun 02, 2008  12:42 +0400, Alex Zhuravlev wrote:
 > > there was an idea to control recovery postponing replies.
 > > can we use this idea for COS? instead of immediate sync
 > > we execute request, but put reply on a special queue. then
 > > reply is sent from the queue when all previous transno
 > > are committed (for COS w/o VBR). if there is no requests to
 > > be handled, but reply queue isn't empty server does sync.
 > 
 > Yes, this was proposed for the DMU OST, so that it can export
 > synchronous IO semantics to clients that do not know how to do
 > bulk IO recovery (i.e. all of them right now) without forcing
 > the DMU to limit the transaction group size too much.

This mechanism would be useful for other purposes too. For example, SOM
sometimes wants to reply only when certain transaction has committed and
currently this is implemented through an explicit sync. It seems that it
is generally better to make server IO as asynchronous as possible,
because this increases server throughput (even if at the expense of
individual request latency).

 > 
 > My proposal was that the OST service threads would perform the IO
 > in the normal manner until they were ready with the reply, but
 > instead of waiting in the thread context for the commit the RPC
 > request (with attached reply) would be put as private data into
 > a commit callback.  Once the commit callbacks are run (whether
 > because of transaction size, age, or explicit sync operation)
 > the RPC requests are put into a queue and sent by one or more
 > threads to the waiting clients.
 > 
 > This ties in fairly nicely with the network request scheduler, as
 > it can batch requests from the same or multiple clients in different
 > ways, and then "unplug" the device (sync the transaction).
 > 
 > Cheers, Andreas

Nikita.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Lustre-devel] Commit on share
  2008-05-27 10:44 [Lustre-devel] Commit on share Peter Braam
                   ` (2 preceding siblings ...)
  2008-06-02  8:42 ` Alex Zhuravlev
@ 2008-06-11 14:21 ` Alexander Zarochentsev
  2008-06-11 14:35   ` Alex Zhuravlev
  2008-06-11 15:26   ` Peter Braam
  3 siblings, 2 replies; 20+ messages in thread
From: Alexander Zarochentsev @ 2008-06-11 14:21 UTC (permalink / raw)
  To: lustre-devel

Hello,

On 27 May 2008 14:44:18 Peter Braam wrote:
> This HLD is definitely not ready at all.  It is very short, lacks
> interaction diagrams and the arguments made are not sufficiently
> detailed.

is the following definition of dependent operation more clear?

Operation B depends on operation A if:

1. A and B modify the same object
2. B modifies the object after A (LDLM serializes object access) 
3. A isn't committed yet
4. A and B are issued by different clients

> * the second sentence is not right.  Commit should happen before
> un-committed data coming from a client is shared with a 2nd client.
> * Is COS dependent on VBR ? no it is not, and can equally apply to
> normal recovery
> * Section 3.2 is wrong: the recovery process will not fail with gaps
> in the sequence when there is VBR.  It only fails if there are gaps
> in the versions, and this is rare.
> * 3.3 parallel creations in one directory are protected with
> different, independent lock resources.  Isn?t that sufficient to
> allow parallel operations with COS?

The objects in the definition of dependent operation can be those parts 
of the directory identified by hash.

> * 3.6 provide a detailed explanation please
> * GC thread is wrong mechanism this is what we have commit callbacks
> for * Why not use the DLM, then we can simply keep the client waiting
> ? the mechanism already exists for repack;

CoS is just an improved version of rep-ack, using persistent storage 
instead of client replay queue?

> I am not convinced at all 
> by the reasoning that rep-ack is so different ? no real facts are
> quoted * It is left completely without explanation how the hash table
> (which I think we don?t need/want) is used
>
> Regards,
>
> Peter

Thanks,
-- 
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Lustre-devel] Commit on share
  2008-06-11 14:21 ` Alexander Zarochentsev
@ 2008-06-11 14:35   ` Alex Zhuravlev
  2008-06-11 15:26   ` Peter Braam
  1 sibling, 0 replies; 20+ messages in thread
From: Alex Zhuravlev @ 2008-06-11 14:35 UTC (permalink / raw)
  To: lustre-devel

Alexander Zarochentsev wrote:
>> * 3.6 provide a detailed explanation please
>> * GC thread is wrong mechanism this is what we have commit callbacks
>> for * Why not use the DLM, then we can simply keep the client waiting
>> ? the mechanism already exists for repack;
> 
> CoS is just an improved version of rep-ack, using persistent storage 
> instead of client replay queue?

AFAIU, there is another difference - rep-ack doesn't need to care about
"same client" optimization as usually ACK is received before next request
(or at least very soon after). so, cost of ACK for single client is very
small. in contrast the cost of sync is very high, thus we want this "same
client" optimization which can't be implemented without some changes to
LDLM, I think.

thanks, Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Lustre-devel] Commit on share
  2008-06-11 14:21 ` Alexander Zarochentsev
  2008-06-11 14:35   ` Alex Zhuravlev
@ 2008-06-11 15:26   ` Peter Braam
  2008-06-11 16:27     ` Alex Zhuravlev
  2008-06-11 16:46     ` Alexander Zarochentsev
  1 sibling, 2 replies; 20+ messages in thread
From: Peter Braam @ 2008-06-11 15:26 UTC (permalink / raw)
  To: lustre-devel




On 6/11/08 8:21 AM, "Alexander Zarochentsev"
<Alexander.Zarochentsev@Sun.COM> wrote:

> Hello,
> 
> On 27 May 2008 14:44:18 Peter Braam wrote:
>> This HLD is definitely not ready at all.  It is very short, lacks
>> interaction diagrams and the arguments made are not sufficiently
>> detailed.
> 
> is the following definition of dependent operation more clear?
> 
> Operation B depends on operation A if:
> 
> 1. A and B modify the same object
> 2. B modifies the object after A (LDLM serializes object access)
> 3. A isn't committed yet
> 4. A and B are issued by different clients

This shows a fundamental mistake, and one I was afraid of.  If a second
client only reads uncommitted data there is already a dependency.

You need to read the database literature - this is standard stuff, here is a
link to a great book:

http://research.microsoft.com/~philbe/ccontrol/

Peter


> 
>> * the second sentence is not right.  Commit should happen before
>> un-committed data coming from a client is shared with a 2nd client.
>> * Is COS dependent on VBR ? no it is not, and can equally apply to
>> normal recovery
>> * Section 3.2 is wrong: the recovery process will not fail with gaps
>> in the sequence when there is VBR.  It only fails if there are gaps
>> in the versions, and this is rare.
>> * 3.3 parallel creations in one directory are protected with
>> different, independent lock resources.  Isn?t that sufficient to
>> allow parallel operations with COS?
> 
> The objects in the definition of dependent operation can be those parts
> of the directory identified by hash.
> 
>> * 3.6 provide a detailed explanation please
>> * GC thread is wrong mechanism this is what we have commit callbacks
>> for * Why not use the DLM, then we can simply keep the client waiting
>> ? the mechanism already exists for repack;
> 
> CoS is just an improved version of rep-ack, using persistent storage
> instead of client replay queue?
> 
>> I am not convinced at all
>> by the reasoning that rep-ack is so different ? no real facts are
>> quoted * It is left completely without explanation how the hash table
>> (which I think we don?t need/want) is used
>> 
>> Regards,
>> 
>> Peter
> 
> Thanks,

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Lustre-devel] Commit on share
  2008-06-11 15:26   ` Peter Braam
@ 2008-06-11 16:27     ` Alex Zhuravlev
  2008-06-11 16:28       ` Peter Braam
  2008-06-11 16:46     ` Alexander Zarochentsev
  1 sibling, 1 reply; 20+ messages in thread
From: Alex Zhuravlev @ 2008-06-11 16:27 UTC (permalink / raw)
  To: lustre-devel

Peter Braam wrote:
> This shows a fundamental mistake, and one I was afraid of.  If a second
> client only reads uncommitted data there is already a dependency.

actually we've discussed this many times and this is known property.

thanks, Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Lustre-devel] Commit on share
  2008-06-11 16:27     ` Alex Zhuravlev
@ 2008-06-11 16:28       ` Peter Braam
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Braam @ 2008-06-11 16:28 UTC (permalink / raw)
  To: lustre-devel

I don't know what a known property is but the definition given is not
correct.


On 6/11/08 10:27 AM, "Alex Zhuravlev" <Alex.Zhuravlev@Sun.COM> wrote:

> Peter Braam wrote:
>> This shows a fundamental mistake, and one I was afraid of.  If a second
>> client only reads uncommitted data there is already a dependency.
> 
> actually we've discussed this many times and this is known property.
> 
> thanks, Alex
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Lustre-devel] Commit on share
  2008-06-11 15:26   ` Peter Braam
  2008-06-11 16:27     ` Alex Zhuravlev
@ 2008-06-11 16:46     ` Alexander Zarochentsev
  1 sibling, 0 replies; 20+ messages in thread
From: Alexander Zarochentsev @ 2008-06-11 16:46 UTC (permalink / raw)
  To: lustre-devel

On 11 June 2008 19:26:19 Peter Braam wrote:
> On 6/11/08 8:21 AM, "Alexander Zarochentsev"
>
> <Alexander.Zarochentsev@Sun.COM> wrote:
> > Hello,
> >
> > On 27 May 2008 14:44:18 Peter Braam wrote:
> >> This HLD is definitely not ready at all.  It is very short, lacks
> >> interaction diagrams and the arguments made are not sufficiently
> >> detailed.
> >
> > is the following definition of dependent operation more clear?
> >
> > Operation B depends on operation A if:
> >
> > 1. A and B modify the same object
> > 2. B modifies the object after A (LDLM serializes object access)
> > 3. A isn't committed yet
> > 4. A and B are issued by different clients
>
> This shows a fundamental mistake, and one I was afraid of.  If a
> second client only reads uncommitted data there is already a
> dependency.

The definition above is intentionally done that way.
It is just to fit requirements from the arch page, one of them is
".. avoid non-recoverable requests". A non-recoverable request is a 
request which cannot be replayed due to object version / request 
version mismatch. CoS doesn't care about requests which are not 
replayable and the definition reflects that.

Well, I understand now you want more than an optimization to VBR but I 
had to explain the mistake.

s/modify/access/ :

1. A and B access the same object, A is a write access.
2. B accesses the object after A (LDLM serializes object access)
3. A isn't committed yet
4. A and B are issued by different clients

A question: the definition still counts parallel file creation as 
dependent operation but actually the operations can be replayed 
independently. Is the definition OK for CoS?

Or we can add (results implementation complexity)
5. A depends on the result of B: file creation and readdir, creation and 
deletion of the same file and so on.

> You need to read the database literature - this is standard stuff,
> here is a link to a great book:
>
> http://research.microsoft.com/~philbe/ccontrol/

> Peter

-- 
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2008-06-11 16:46 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-27 10:44 [Lustre-devel] Commit on share Peter Braam
2008-05-27 15:43 ` Mikhail Pershin
2008-06-01  5:00   ` Peter Braam
2008-05-29 17:42 ` Mikhail Pershin
2008-05-31  2:45   ` Andreas Dilger
2008-05-31  9:37     ` Alex Zhuravlev
2008-06-01  7:03     ` Mikhail Pershin
2008-06-03 18:41       ` Andreas Dilger
2008-06-03 18:56         ` Alex Zhuravlev
2008-06-01 16:54     ` Alex Zhuravlev
2008-06-02  8:42 ` Alex Zhuravlev
2008-06-03 18:50   ` Andreas Dilger
2008-06-04  1:11     ` Peter Braam
2008-06-04 10:50     ` Nikita Danilov
2008-06-11 14:21 ` Alexander Zarochentsev
2008-06-11 14:35   ` Alex Zhuravlev
2008-06-11 15:26   ` Peter Braam
2008-06-11 16:27     ` Alex Zhuravlev
2008-06-11 16:28       ` Peter Braam
2008-06-11 16:46     ` Alexander Zarochentsev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.