All of lore.kernel.org
 help / color / mirror / Atom feed
* [Lustre-devel] SOM Recovery of open files
@ 2009-01-30 23:32 Andreas Dilger
  2009-01-31  0:51 ` Oleg Drokin
  0 siblings, 1 reply; 9+ messages in thread
From: Andreas Dilger @ 2009-01-30 23:32 UTC (permalink / raw)
  To: lustre-devel

Vitaly Fertman wrote:
> Oleg told me yesterday about one feature which seems destroying the
> SOM completely.  If a client is evicted and re-connects, we do not
> re-open files so that client thinks files are opened, whereas MDS
> thinks they are closed.

Right.  This issue has been around for a long time.  There is bug 971
dealing with this issue, about changing open file recovery to work by
generating new "open file" requests instead of saving the RPCs and
handling it at the ptlrpc level.  This is (AFAIK) being done for the
simplified interoperability fixes already.

> Thus MDS has no control over opened files, whereas clients may write
> to them.  To fix this we need at least to disable the file modification
> on clients until files are re-opened.

This is also going to be handled by the LOV EA lock that CEA is working
on for HSM and migration.  If the client is evicted from the MDS it will
have the LOV EA lock cancelled, and all IO will block until a new LOV EA
lock is gotten.

> The re-opening itself could be done by application or by us.  In the
> later case, the recovery mechanism is involved...

This is definitely not an application-level problem, it needs to be
fixed within Lustre.

> it was missed for the recovery, but it is a problem for interoperability
> as well. I remember Eric said that we will evict clients on downgrade
> and he said therefore all the files get closed. however, it seems it
> is not for clients unless we do some extra actions.

Even on upgrade, simplified interoperability will now have the server
requesting that all clients flush their state before the server is shut
down, so that the amount of interoperability needed is minimal.  The only
state that a client cannot completely remove is the open file handles,
so the "replay" of file open will now be driven by the file handles
themselves instead of the "saved RPC" mechanism we use today.  That would
also avoid bugs like 3632, 3633, etc.


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Lustre-devel] SOM Recovery of open files
  2009-01-30 23:32 [Lustre-devel] SOM Recovery of open files Andreas Dilger
@ 2009-01-31  0:51 ` Oleg Drokin
  2009-02-01 14:45   ` Vitaly Fertman
  0 siblings, 1 reply; 9+ messages in thread
From: Oleg Drokin @ 2009-01-31  0:51 UTC (permalink / raw)
  To: lustre-devel

Hello!

On Jan 30, 2009, at 6:32 PM, Andreas Dilger wrote:

> Vitaly Fertman wrote:
>> Oleg told me yesterday about one feature which seems destroying the
>> SOM completely.  If a client is evicted and re-connects, we do not
>> re-open files so that client thinks files are opened, whereas MDS
>> thinks they are closed.
> Right.  This issue has been around for a long time.  There is bug 971
> dealing with this issue, about changing open file recovery to work by
> generating new "open file" requests instead of saving the RPCs and
> handling it at the ptlrpc level.  This is (AFAIK) being done for the
> simplified interoperability fixes already.

But the problem is client might be evicted before such command is issued
and a knowledge about this system would disappear from MDS (but not from
OST where it is still connected).

>> Thus MDS has no control over opened files, whereas clients may write
>> to them.  To fix this we need at least to disable the file  
>> modification
>> on clients until files are re-opened.
> This is also going to be handled by the LOV EA lock that CEA is  
> working
> on for HSM and migration.  If the client is evicted from the MDS it  
> will
> have the LOV EA lock cancelled, and all IO will block until a new  
> LOV EA
> lock is gotten.

LOV EA lock won't help. It does not prevent (with current design,  
anyway)
dirty data flush from client cache, only new writes would be not  
possible.
Even then since there is no reopen when obtaining EA lock, MDS would  
still
have no idea there is an open file handle somewhere.

>> The re-opening itself could be done by application or by us.  In the
>> later case, the recovery mechanism is involved...
> This is definitely not an application-level problem, it needs to be
> fixed within Lustre.

Right. But there is no straightforward fix. It is not going to be easy
to reopen a file after eviction. Of course we can just invalidate
local fd, so that the app will start to get something like ESTALE,
but this approach is also not very desirable.

>> it was missed for the recovery, but it is a problem for  
>> interoperability
>> as well. I remember Eric said that we will evict clients on downgrade
>> and he said therefore all the files get closed. however, it seems it
>> is not for clients unless we do some extra actions.
> Even on upgrade, simplified interoperability will now have the server
> requesting that all clients flush their state before the server is  
> shut
> down, so that the amount of interoperability needed is minimal.  The  
> only

Except in this case the client is evicted from e.g. MDS, so it does not
participate in recovery anyway.

Bye,
     Oleg

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Lustre-devel] SOM Recovery of open files
  2009-01-31  0:51 ` Oleg Drokin
@ 2009-02-01 14:45   ` Vitaly Fertman
  2009-02-01 17:24     ` Vitaly Fertman
  0 siblings, 1 reply; 9+ messages in thread
From: Vitaly Fertman @ 2009-02-01 14:45 UTC (permalink / raw)
  To: lustre-devel

On Jan 31, 2009, at 3:51 AM, Oleg Drokin wrote:

> Hello!
>
> On Jan 30, 2009, at 6:32 PM, Andreas Dilger wrote:
>
>> Vitaly Fertman wrote:
>>> Oleg told me yesterday about one feature which seems destroying the
>>> SOM completely.  If a client is evicted and re-connects, we do not
>>> re-open files so that client thinks files are opened, whereas MDS
>>> thinks they are closed.
>> Right.  This issue has been around for a long time.  There is bug 971
>> dealing with this issue, about changing open file recovery to work by
>> generating new "open file" requests instead of saving the RPCs and
>> handling it at the ptlrpc level.  This is (AFAIK) being done for the
>> simplified interoperability fixes already.
>
> But the problem is client might be evicted before such command is  
> issued
> and a knowledge about this system would disappear from MDS (but not  
> from
> OST where it is still connected).

right, besides that the problem exists even without the interoperability
involved, i.e. if mds does not even reboot, when only eviction happens.

>>> Thus MDS has no control over opened files, whereas clients may write
>>> to them.  To fix this we need at least to disable the file  
>>> modification
>>> on clients until files are re-opened.
>> This is also going to be handled by the LOV EA lock that CEA is  
>> working
>> on for HSM and migration.  If the client is evicted from the MDS it  
>> will
>> have the LOV EA lock cancelled, and all IO will block until a new  
>> LOV EA
>> lock is gotten.
>
> LOV EA lock won't help. It does not prevent (with current design,  
> anyway)
> dirty data flush from client cache, only new writes would be not  
> possible.
> Even then since there is no reopen when obtaining EA lock, MDS would  
> still
> have no idea there is an open file handle somewhere.

the dirty cache existent on client is not such a big problem for SOM.
first of all, the client eviction leads to closing the files on MDS,  
when MDS
removes the SOM cache.

besides that, if MDS failover happens, during the MDS-OST  
synchronization
OST may ask the clients to flush their data and tell the MDS about the  
existent
llog record -- thus MDS will be able to clean the SOM cache as well.

Once MDS wants to get the SOM cache again and sees the cache did not  
exist,
it asks a client to gather attributes under extent locks forcing other  
clients to
flush their data on OST.

thus the only problem here is a stale fh on a client which may let the  
client
to write to the file after the SOM cache will be re-obtained on MDS,  
which
consists of 2 parts:

- an ability of a client to write to an opened file without a  
connection to MDS;
- an absence of file re-opening on re-connection.

>>> The re-opening itself could be done by application or by us.  In the
>>> later case, the recovery mechanism is involved...
>> This is definitely not an application-level problem, it needs to be
>> fixed within Lustre.
>
> Right. But there is no straightforward fix. It is not going to be easy
> to reopen a file after eviction. Of course we can just invalidate
> local fd, so that the app will start to get something like ESTALE,
> but this approach is also not very desirable.
>
>>> it was missed for the recovery, but it is a problem for  
>>> interoperability
>>> as well. I remember Eric said that we will evict clients on  
>>> downgrade
>>> and he said therefore all the files get closed. however, it seems it
>>> is not for clients unless we do some extra actions.
>> Even on upgrade, simplified interoperability will now have the server
>> requesting that all clients flush their state before the server is  
>> shut
>> down, so that the amount of interoperability needed is minimal.   
>> The only
>> state that a client cannot completely remove is the open file  
>> handles,

the only state needed for SOM ;)
IIRC, what was discussed in Beijing was the failover for upgrade and all
the client evictions for downgrade. the failover is not a problem here
as opens will be merely replayed. but eviction is.

>> so the "replay" of file open will now be driven by the file handles
>> themselves instead of the "saved RPC" mechanism we use today.

hopefully not for replay only, but for the re-connection as well.

> Except in this case the client is evicted from e.g. MDS, so it does  
> not
> participate in recovery anyway.

right.

--
Vitaly

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Lustre-devel] SOM Recovery of open files
  2009-02-01 14:45   ` Vitaly Fertman
@ 2009-02-01 17:24     ` Vitaly Fertman
  2009-02-21  0:21       ` Andreas Dilger
  0 siblings, 1 reply; 9+ messages in thread
From: Vitaly Fertman @ 2009-02-01 17:24 UTC (permalink / raw)
  To: lustre-devel

On Feb 1, 2009, at 5:45 PM, Vitaly Fertman wrote:

> thus the only problem here is a stale fh on a client which may let  
> the client
> to write to the file after the SOM cache will be re-obtained on MDS,  
> which
> consists of 2 parts:
>
> - an ability of a client to write to an opened file without a  
> connection to MDS;
> - an absence of file re-opening on re-connection.

I forgot to mention about truncate (locked & lockless) and lockless IO.

MDS must be aware about opened IOEpoch for truncate as well, otherwise
obd_punches must be blocked. The situation is pretty rare as we do not
cache punches on clients and they go away right md_setattr completes,
but I think what if at the time of the client eviction from MDS, the  
connection
between this client and an OST is unstable so that punches will hang  
in the
re-send list for a while, enough for another client to modify the file  
--
MDS gets a new SOM cache, and later punch will modify the file.

The same for lockless IO.

The locked truncate is involved as it could hang in the re-send list  
with
the lock enqueue, so that enqueue+punch will happen after MDS re- 
validates
SOM cache.

Thus:
- block truncate and lockless IO;
- "re-open" truncate on re-connection as well as regularly opened files.

This must happen even if SOM is disabled but the client already  
supports it
(clients are upgraded first). Otherwise, the interoperability will be  
broken.

--
Vitaly

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Lustre-devel] SOM Recovery of open files
  2009-02-01 17:24     ` Vitaly Fertman
@ 2009-02-21  0:21       ` Andreas Dilger
  2009-02-23 14:56         ` Eric Barton
       [not found]         ` <49B16509.9000409@sun.com>
  0 siblings, 2 replies; 9+ messages in thread
From: Andreas Dilger @ 2009-02-21  0:21 UTC (permalink / raw)
  To: lustre-devel

On Feb 01, 2009  20:24 +0300, Vitaly Fertman wrote:
> On Feb 1, 2009, at 5:45 PM, Vitaly Fertman wrote:
>> thus the only problem here is a stale fh on a client which may let the 
>> client to write to the file after the SOM cache will be re-obtained on
>> MDS, which consists of 2 parts:
>>
>> - an ability of a client to write to an opened file without a  
>>   connection to MDS;

With the layout lock this would not be possible.  The client would be
required to have the layout lock (hence be connected to the MDS) in
order to generate a new write.

>> - an absence of file re-opening on re-connection.
>
> I forgot to mention about truncate (locked & lockless) and lockless IO.
>
> MDS must be aware about opened IOEpoch for truncate as well, otherwise
> obd_punches must be blocked. The situation is pretty rare as we do not
> cache punches on clients and they go away right md_setattr completes,
> but I think what if at the time of the client eviction from MDS, the  
> connection between this client and an OST is unstable so that punches
> will hang in the re-send list for a while, enough for another client
> to modify the file  

I a second client is trying to modify the file while the first one is
having OST connection problems, then the first client would either
succeed to flush its cache, or be evicted by the OST before the second
client can get the extent locks needed to truncate the file.

The same is true whether the truncate is from a remote client (with
client lock) or a lockless truncate (OST holds lock).

> MDS gets a new SOM cache, and later punch will modify the file.
>
> The same for lockless IO.
>
> The locked truncate is involved as it could hang in the re-send list  
> with the lock enqueue, so that enqueue+punch will happen after MDS re- 
> validates SOM cache.

In this case the client will not even begin to send the truncate RPC
until the lock enqueue has succeeded.

> Thus:
> - block truncate and lockless IO;
> - "re-open" truncate on re-connection as well as regularly opened files.
>
> This must happen even if SOM is disabled but the client already supports 
> it (clients are upgraded first). Otherwise, the interoperability will be  
> broken.

It isn't clear to me why the done_writing RPC needs to be sent separately
for each truncate?  The client is already sending an RPC to the MDS for
each truncate to update the size there, if file is not open (and currently
has no objects), and to verify file write permission (avoid truncate of
in-use executables).

Now, if this only happens on recovery I don't have a huge objection.  If
the "done_writing" RPC needs to be sent to the MDS for every single truncate,
then that is a major performance concern.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Lustre-devel] SOM Recovery of open files
  2009-02-21  0:21       ` Andreas Dilger
@ 2009-02-23 14:56         ` Eric Barton
       [not found]         ` <49B16509.9000409@sun.com>
  1 sibling, 0 replies; 9+ messages in thread
From: Eric Barton @ 2009-02-23 14:56 UTC (permalink / raw)
  To: lustre-devel

Please also consider the security implication.  Can all client
actions be checked without extra message passing?  Are any
special capabilities required?  To what extent must clients
be trusted?  What will go wrong if this trust is abused etc...

    Cheers,
              Eric

> -----Original Message-----
> From: lustre-devel-bounces at lists.lustre.org [mailto:lustre-devel-bounces at lists.lustre.org] On Behalf Of Andreas
> Dilger
> Sent: 21 February 2009 12:21 AM
> To: Vitaly Fertman
> Cc: Oleg Drokin; Lustre Development Mailing List
> Subject: Re: [Lustre-devel] SOM Recovery of open files
> 
> On Feb 01, 2009  20:24 +0300, Vitaly Fertman wrote:
> > On Feb 1, 2009, at 5:45 PM, Vitaly Fertman wrote:
> >> thus the only problem here is a stale fh on a client which may let the
> >> client to write to the file after the SOM cache will be re-obtained on
> >> MDS, which consists of 2 parts:
> >>
> >> - an ability of a client to write to an opened file without a
> >>   connection to MDS;
> 
> With the layout lock this would not be possible.  The client would be
> required to have the layout lock (hence be connected to the MDS) in
> order to generate a new write.
> 
> >> - an absence of file re-opening on re-connection.
> >
> > I forgot to mention about truncate (locked & lockless) and lockless IO.
> >
> > MDS must be aware about opened IOEpoch for truncate as well, otherwise
> > obd_punches must be blocked. The situation is pretty rare as we do not
> > cache punches on clients and they go away right md_setattr completes,
> > but I think what if at the time of the client eviction from MDS, the
> > connection between this client and an OST is unstable so that punches
> > will hang in the re-send list for a while, enough for another client
> > to modify the file
> 
> I a second client is trying to modify the file while the first one is
> having OST connection problems, then the first client would either
> succeed to flush its cache, or be evicted by the OST before the second
> client can get the extent locks needed to truncate the file.
> 
> The same is true whether the truncate is from a remote client (with
> client lock) or a lockless truncate (OST holds lock).
> 
> > MDS gets a new SOM cache, and later punch will modify the file.
> >
> > The same for lockless IO.
> >
> > The locked truncate is involved as it could hang in the re-send list
> > with the lock enqueue, so that enqueue+punch will happen after MDS re-
> > validates SOM cache.
> 
> In this case the client will not even begin to send the truncate RPC
> until the lock enqueue has succeeded.
> 
> > Thus:
> > - block truncate and lockless IO;
> > - "re-open" truncate on re-connection as well as regularly opened files.
> >
> > This must happen even if SOM is disabled but the client already supports
> > it (clients are upgraded first). Otherwise, the interoperability will be
> > broken.
> 
> It isn't clear to me why the done_writing RPC needs to be sent separately
> for each truncate?  The client is already sending an RPC to the MDS for
> each truncate to update the size there, if file is not open (and currently
> has no objects), and to verify file write permission (avoid truncate of
> in-use executables).
> 
> Now, if this only happens on recovery I don't have a huge objection.  If
> the "done_writing" RPC needs to be sent to the MDS for every single truncate,
> then that is a major performance concern.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Lustre-devel] layout lock / extent lock interaction
       [not found]         ` <49B16509.9000409@sun.com>
@ 2009-03-06 19:16           ` Andreas Dilger
  2009-03-06 22:59             ` Nathaniel Rutman
  0 siblings, 1 reply; 9+ messages in thread
From: Andreas Dilger @ 2009-03-06 19:16 UTC (permalink / raw)
  To: lustre-devel

On Mar 06, 2009  10:01 -0800, Nathaniel Rutman wrote:
> I think we need to explicitly list the extent / layout lock interactions  
> so we don't miss anything in the implementation:
> 1. Create
>
>    * MDT generates new layout lock at open
>    * client gets Common Reader layout lock
>    * client can get new extents read/write locks as long it holds CR
>      layout lock
>
> 2. Layout change
>
>    * MDT takes PW layout lock, revoking all client CR locks
>    * in parallel, MDT takes PW lock on all extents on all OSTs for this
>      file
>    * Clients drop layout lock and requeue
>    * Clients flush cache and drop their extent locks
>    * MDT changes layout
>    * MDT releases layout lock and extents locks
>    * Clients get CR layout lock and can now requeue their extent locks
>
> 3. Client / MDT network partition
>
>    * client can continue reading/writing to currently held extents
>    * when client determines it has been disconnected from MDT it drops
>      layout lock
>    * client can't get new extent locks, but can continue writing to
>      currently held extents
>    * if MDT changes layout, it first PW locks all extents, causing OSTs
>      to revoke client's extents locks
>    * Client must requeue layout lock before requeueing extents locks
>
>    What if client hasn't noticed it's been disconnected from the MDT by
>    the time it tries to requeue extent locks?  It doesn't know that the
>    layout lock its holding is invalid...

That is a thorny problem.  I'll go through several partial solutions
and see why they do not work, then hopefully a safe solution at the end.

One possibility is that the AST sent to the clients during the extent lock
revocation would contain a flag that indicates "the layout is changing"
(similar to the truncate/discard data flag), so the clients get notified
even if disconnected from the MDS.  It still isn't enough, however,
as the clients will only get this AST if they currently have an extent
lock, and it isn't always true.

A second option is in case a client holding a layout lock is evicted AND
the layout is being changed then the MDS can't release the extent locks
until at least one ping interval (assuming any still-alive client would
have detected this and try reconnecting).  This is also not 100% safe because
the client might have been evicted moments earlier due to some other lock
and the "wait for one ping interval" heuristic would no longer apply.

We cannot depend on the layout change to be drastic and the objects would
no longer exist to be written to (CROW issues aside).  If we are changing
the layout to add a mirror that wouldn't help and we would now have
inconsistent data on each half of the mirror.

Another option is something like "imperative eviction" so that clients
being evicted are actively told they are being evicted, but that has
the issue of the "you are evicted" RPC will normally be sent to a node
which is already dead and slow down the MDS and/or block all of its
LNET credits so isn't really even a usable option.


A safe option (AFAICS) is to have MDS eviction force OST eviction (via
obd_set_info_async(EVICT_BY_NID).  That would also resolve some other
recovery problems, but might be overly drastic if e.g. the client is
being evicted from the MDS due to router failure or simple network
partition.  Having a proper health network and also server-side RPC
resending would help avoid such problems.

This is one of the main reasons why having DLM servers on one node
controlling resources on another node is a bad idea.  We had similar
issues in the past when we locked all objects via the OST only on
stripe index 0, and we might have similar problems with subtree locks
in the future with CMD or any SNS RAID that is only locking a subset
of objects.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Lustre-devel] layout lock / extent lock interaction
  2009-03-06 19:16           ` [Lustre-devel] layout lock / extent lock interaction Andreas Dilger
@ 2009-03-06 22:59             ` Nathaniel Rutman
  2009-03-06 23:48               ` Andreas Dilger
  0 siblings, 1 reply; 9+ messages in thread
From: Nathaniel Rutman @ 2009-03-06 22:59 UTC (permalink / raw)
  To: lustre-devel

Andreas Dilger wrote:
> On Mar 06, 2009  10:01 -0800, Nathaniel Rutman wrote:
>   
>> I think we need to explicitly list the extent / layout lock interactions  
>> so we don't miss anything in the implementation:
>> 1. Create
>>
>>    * MDT generates new layout lock at open
>>    * client gets Common Reader layout lock
>>    * client can get new extents read/write locks as long it holds CR
>>      layout lock
>>
>> 2. Layout change
>>
>>    * MDT takes PW layout lock, revoking all client CR locks
>>    * in parallel, MDT takes PW lock on all extents on all OSTs for this
>>      file
>>    * Clients drop layout lock and requeue
>>    * Clients flush cache and drop their extent locks
>>    * MDT changes layout
>>    * MDT releases layout lock and extents locks
>>    * Clients get CR layout lock and can now requeue their extent locks
>>
>> 3. Client / MDT network partition
>>
>>    * client can continue reading/writing to currently held extents
>>    * when client determines it has been disconnected from MDT it drops
>>      layout lock
>>    * client can't get new extent locks, but can continue writing to
>>      currently held extents
>>    * if MDT changes layout, it first PW locks all extents, causing OSTs
>>      to revoke client's extents locks
>>    * Client must requeue layout lock before requeueing extents locks
>>
>>    What if client hasn't noticed it's been disconnected from the MDT by
>>    the time it tries to requeue extent locks?  It doesn't know that the
>>    layout lock its holding is invalid...
>>     
>
> That is a thorny problem.  I'll go through several partial solutions
> and see why they do not work, then hopefully a safe solution at the end.
>
> One possibility is that the AST sent to the clients during the extent lock
> revocation would contain a flag that indicates "the layout is changing"
> (similar to the truncate/discard data flag), so the clients get notified
> even if disconnected from the MDS.  It still isn't enough, however,
> as the clients will only get this AST if they currently have an extent
> lock, and it isn't always true.
>   
How about if we introduce the concept of a layout generation?  The 
generation is stored in the layout and also with each OST object.  When 
the MDT takes the extent locks it sends the new generation to the OSTs.  
Clients send the layout generation along with any extent lock enqueue.  
The OSTs only grant extents to clients that match the current 
generation.  Maybe "match or exceed" in case OST dies before new gen can 
be recorded.  And OST increases gen to latest seen whenever any (MDT or 
client) extent lock is enqueued.
> A second option is in case a client holding a layout lock is evicted AND
> the layout is being changed then the MDS can't release the extent locks
> until at least one ping interval (assuming any still-alive client would
> have detected this and try reconnecting).  This is also not 100% safe because
> the client might have been evicted moments earlier due to some other lock
> and the "wait for one ping interval" heuristic would no longer apply.
>
> We cannot depend on the layout change to be drastic and the objects would
> no longer exist to be written to (CROW issues aside).  If we are changing
> the layout to add a mirror that wouldn't help and we would now have
> inconsistent data on each half of the mirror.
>
> Another option is something like "imperative eviction" so that clients
> being evicted are actively told they are being evicted, but that has
> the issue of the "you are evicted" RPC will normally be sent to a node
> which is already dead and slow down the MDS and/or block all of its
> LNET credits so isn't really even a usable option.
>
>
> A safe option (AFAICS) is to have MDS eviction force OST eviction (via
> obd_set_info_async(EVICT_BY_NID).  That would also resolve some other
> recovery problems, but might be overly drastic if e.g. the client is
> being evicted from the MDS due to router failure or simple network
> partition.  Having a proper health network and also server-side RPC
> resending would help avoid such problems.
>   
This is drastic, but on the other hand we only need to do this if the 
layout is being changed.  Of course, since eviction would happen before 
layout change we would need to remember who was evicted and hasn't 
reconnected...
> This is one of the main reasons why having DLM servers on one node
> controlling resources on another node is a bad idea.  We had similar
> issues in the past when we locked all objects via the OST only on
> stripe index 0, and we might have similar problems with subtree locks
> in the future with CMD or any SNS RAID that is only locking a subset
> of objects.
>   

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Lustre-devel] layout lock / extent lock interaction
  2009-03-06 22:59             ` Nathaniel Rutman
@ 2009-03-06 23:48               ` Andreas Dilger
  0 siblings, 0 replies; 9+ messages in thread
From: Andreas Dilger @ 2009-03-06 23:48 UTC (permalink / raw)
  To: lustre-devel

On Mar 06, 2009  14:59 -0800, Nathaniel Rutman wrote:
> How about if we introduce the concept of a layout generation?  The  
> generation is stored in the layout and also with each OST object.  When  
> the MDT takes the extent locks it sends the new generation to the OSTs.   
> Clients send the layout generation along with any extent lock enqueue.   
> The OSTs only grant extents to clients that match the current  
> generation.  Maybe "match or exceed" in case OST dies before new gen can  
> be recorded.  And OST increases gen to latest seen whenever any (MDT or  
> client) extent lock is enqueued.

I like this idea.  We would need some place to store this information in
the LOV EA on the MDT and pass it to the client, and to/on the OST.
We already have:
- inode versions (VBR; change on each file modification)
- IO epochs (SOM; change slowly as files are written, not persistent)
- recovery epochs (CMD/WBC; change frequently as global epochs are committed)

We could concievably use the space in "l_ost_gen" in the first stripe,
as we have never implemented OST generations.  Those were intended for
OST replacement, and/or OST snapshots, but have never been implemented.
It also has the drawback that it is per-stripe, and we would likely be
wasting the additional l_ost_gen values in later stripes in addition
to breaking their intended use.

Maybe we just bite the bullet and add another LOV EA type?

>> A safe option (AFAICS) is to have MDS eviction force OST eviction (via
>> obd_set_info_async(EVICT_BY_NID).  That would also resolve some other
>> recovery problems, but might be overly drastic if e.g. the client is
>> being evicted from the MDS due to router failure or simple network
>> partition.  Having a proper health network and also server-side RPC
>> resending would help avoid such problems.
>   
> This is drastic, but on the other hand we only need to do this if the  
> layout is being changed.  Of course, since eviction would happen before  
> layout change we would need to remember who was evicted and hasn't  
> reconnected...

No, I don't think we need to remember recently-evicted clients, since
the MDS would also evict clients from all OSTs immediately.  The goal
to avoid this drastic action would be to avoid evicting the client
from the MDS in the first place (e.g. by request resend, health net),
which is a double win.

>> This is one of the main reasons why having DLM servers on one node
>> controlling resources on another node is a bad idea.  We had similar
>> issues in the past when we locked all objects via the OST only on
>> stripe index 0, and we might have similar problems with subtree locks
>> in the future with CMD or any SNS RAID that is only locking a subset
>> of objects.
>>   

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2009-03-06 23:48 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-30 23:32 [Lustre-devel] SOM Recovery of open files Andreas Dilger
2009-01-31  0:51 ` Oleg Drokin
2009-02-01 14:45   ` Vitaly Fertman
2009-02-01 17:24     ` Vitaly Fertman
2009-02-21  0:21       ` Andreas Dilger
2009-02-23 14:56         ` Eric Barton
     [not found]         ` <49B16509.9000409@sun.com>
2009-03-06 19:16           ` [Lustre-devel] layout lock / extent lock interaction Andreas Dilger
2009-03-06 22:59             ` Nathaniel Rutman
2009-03-06 23:48               ` Andreas Dilger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.