All of lore.kernel.org
 help / color / mirror / Atom feed
* [Lustre-devel] hiding non-fatal communications errors
@ 2008-06-04 13:25 Eric Barton
  2008-06-04 21:17 ` Peter Braam
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Barton @ 2008-06-04 13:25 UTC (permalink / raw)
  To: lustre-devel

Something for recovery experts...

Communications may timeout for non-fatal reasons e.g...

1. Adaptive timeouts were too aggressive (e.g. if server load has
   suddenly become extreme).

2. An LNET router has failed but one or more of its peers hasn't
   detected this yet.

When a lustre client times out an RPC it sent to a server, it (a) allows
pending signals to be delivered (i.e. you can now ^C the process doing
the I/O) and (b) tries to reconnect and/or fail over.  If it reconnects
and confirms that the server has not rebooted, the RPC is resent and
may now succeed.

This should work in all "normal" RPCs (i.e. all RPCs apart from ldlm
callbacks (ASTs)) since the server knows whether it actually processed
the RPC or not and can handle the resent request appropriately.

However I think there is a problem if the RPC is an ldlm callback.  In
this case, the lustre server sends the RPC to the lustre client and
AFAIK the request is not resent if it times out.  If the request is a
blocking AST, the lustre client isn't notified to clean its cache and
cancel locks - and it risks being evicted.

How should this be handled?

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2008-06-19 20:24 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-04 13:25 [Lustre-devel] hiding non-fatal communications errors Eric Barton
2008-06-04 21:17 ` Peter Braam
2008-06-04 22:20   ` Andreas Dilger
2008-06-05  4:12     ` Oleg Drokin
2008-06-05 16:42       ` Robert Read
2008-06-05 16:59         ` Oleg Drokin
2008-06-06  3:29           ` Peter Braam
2008-06-06  3:38             ` Oleg Drokin
2008-06-06  3:40               ` Peter Braam
2008-06-06  4:41                 ` Andreas Dilger
2008-06-06 11:13                   ` Eric Barton
2008-06-19 20:24                     ` Nathaniel Rutman
2008-06-06 12:23                   ` Peter Braam
2008-06-06  3:37         ` Peter Braam
2008-06-04 23:41   ` Eric Barton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.