All of lore.kernel.org
 help / color / mirror / Atom feed
* [Lustre-devel] imperative recovery
       [not found] <1906DB02-F9DF-4F49-9A9A-23FE7E799EA8@sun.com>
@ 2008-12-15 20:32 ` Eric Barton
  2008-12-18 20:15   ` Nathaniel Rutman
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Barton @ 2008-12-15 20:32 UTC (permalink / raw)
  To: lustre-devel

Robert,

Comments inline

> -----Original Message-----
> From: Robert.Read at Sun.COM [mailto:Robert.Read at Sun.COM] On Behalf Of Robert Read
> Sent: 11 December 2008 12:25 AM
> To: Eric Barton
> Subject: imperative recovery
> 
> Earlier today you suggested that the server could ping the clients
> after it restarts. Assuming the server had the nids, how would that
> actually work? Clients don't have any services (or even an acceptor
> for the socklnd case), so how would a server initiate communication
> with the client?  We could add a new kind of RPC that doesn't
> require a ptlrpc connection (much like connect itself doesn't
> require a connection), but it seems at least with socklnd there is
> no way to send that message.

Indeed - this overturns the precedent that Lustre servers don't send
unsolicited RPCs to clients.  This is a nod towards network security
so that client firewalls can trivially block incoming connection
requests.  But this precedent is only assured at the lustre RPC level
- with redundantly routed networks, connections can be established in
either direction at the LND level.  An RPC reply will most probably
follow a different path back through the network to the request sender
and establish new LND connections as required.  This is fine for
kernel LNDs which both create and accept connections - but userspace
LNDs typically don't run acceptors, so userspace LNET specifically
establishes connections to all known routers on startup to avoid this
issue.

Ignoring this precedent for now - one could argue that when a
rebooting server sees info about a client in the on-disk export, it
could have some expectation that the client is waiting for recovery.
Some way of alerting the client that now is a good time to try to
reconnect therefore seems reasonable.  However I think there is a
wider issue to consider first.

Q. Why can't clients reconnect immediately the server restarts?

A. Because they may not know yet that the server died.

Q. Why don't clients know that the server died?

A. Because server death is not detected until RPCs time out.

Q. Why is the RPC timeout so long?

A. Because server death and congestion are easily confused.

This seems to me to get at some fundamental issues about recovery
handling that not even adaptive timeouts has solved for us...

1. Server failover/recovery should complete in 10s of seconds, not
   minutes or hours.

   . Clients must detect server death promptly - much faster than
     normal RPC latency on a congested cluster

   . Servers must detect client death/absence promptly to ensure
     recovery isn't blocked too long by a client crash.

   . To prevent unrelated traffic from being blocked unduly,
     communications associated with a failed client or server must be
     removed from the network promptly, as if the failing node were
     still responsive.

2. Peer failure must be detected with reasonably accuracy in the
   presence of server congestion, LNET router congestion, and LNET
   router failure.

   . Router failure can cause large numbers of RPCs to fail or time
     out.

   . Mis-diagnosing server death is inefficient but the client can
     reconnect harmlessly.

   . Mis-diagnosing client death can cause lost updates when the
     server evicts the client.

> Other options I've thought of to explore this idea:
> 
> - MGS notifies clients (somehow) after a server has restarted.
> 
> - A new tcp socket (possibly in userspace) that can receive
> administrative messages like this (messages can be sent from the
> server, from master admin node, etc). Perhaps related to new lproc
> replacement? Updates could be sent from servers themselves or from
> "god" appliance that was keeping track of server nodes.
> 
> - Use "pdsh lctl" to notify all clients a failover has occurred.
> Ugly, but it would allow us to test the basic idea quickly.  (All we
> need is a new lctl command and changes in the ptlrpc client bits to
> support external initiation of recovery to a specific node, which
> we'll need anyway.)
> 
> 
> robert

I'm totally in favour of supporting additional notification methods
that can increase diagnostic accuracy or speed recovery.  However...

1. We can't rely purely on external notifications.  We need a portable
   baseline capability that works well with existing network
   infrastructure.

2. I'm extremely nervous of relying on notifications via 3rd parties
   unless the whole Lustre communications model is changed to
   accomodate them.  Network failures can be observed quite
   differently from different nodes, so I'd like to stick with methods
   that uses the same paths as regular communications.

I think some elements of the solution include...

0. Change the point-to-point LNET peer health model from one that
   times out individual messages to one that removes messages blocking
   for a failing peer aggressively.  This has already been
   demonstrated to work successfully to flush congested routers when a
   server dies (bug 16186)
 
1. Health related communications must not be affected by congested
   "normal" communications.  The obvious solution is to provide an
   additional virtual LNET just for this traffic - i.e. implement
   message priority - but this poses further questions...

   a. How much will this complicate the LNET/LND implementation -
      e.g. do _all_ connection-based LNDs have to double up their
      connections to ensure orthogonality or complicate existing
      credit protocols to account for priority messaging.

   b. Is 2 priority levels enough - maybe lock conflict resolution
      could/should benefit?

   c. What effect does this have on security/resilience to attack?

2. Aggregate health related communications between peers to minimize
   the number of health messages in the system.  Also ensure health
   related communications only occur when knowledge of peer health is
   actually required - e.g. a client with no locks on a given server
   doesn't have to be responsive.

   The implementation of these features is fundamental to scalability.
   They determine the level of background health "noise" and its
   effect on "real" traffic at a given client and server count given a
   required failure detection latency and limits (or lack thereof) on
   how much state on how many servers each client can cache.

    Cheers,
              Eric

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Lustre-devel] imperative recovery
  2008-12-15 20:32 ` [Lustre-devel] imperative recovery Eric Barton
@ 2008-12-18 20:15   ` Nathaniel Rutman
  2009-01-09 15:27     ` Nicholas Henke
  0 siblings, 1 reply; 7+ messages in thread
From: Nathaniel Rutman @ 2008-12-18 20:15 UTC (permalink / raw)
  To: lustre-devel

Eric Barton wrote:
>
>> Other options I've thought of to explore this idea:
>>
>> - MGS notifies clients (somehow) after a server has restarted.
>>     
This seems like a no-brainer easy win today, and doesn't depend on any 
advanced features like message priority.  The only scalability issue 
would seem to be the broadcast of the message to all clients, but this 
is no different than the current broadcast mechanism the MGS employs to 
update client configs.  The message from the MGS would be taken as a 
suggestion, "Why don't y'all time out all your current RPCs since I 
noticed OST0004 restarted.  Oh, and use failover nid #2."  Current 
replay/recovery need not be touched.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Lustre-devel] imperative recovery
  2008-12-18 20:15   ` Nathaniel Rutman
@ 2009-01-09 15:27     ` Nicholas Henke
  2009-01-09 17:04       ` Robert Read
  0 siblings, 1 reply; 7+ messages in thread
From: Nicholas Henke @ 2009-01-09 15:27 UTC (permalink / raw)
  To: lustre-devel

Nathaniel Rutman wrote:
> Eric Barton wrote:
>>> Other options I've thought of to explore this idea:
>>>
>>> - MGS notifies clients (somehow) after a server has restarted.
>>>     
> This seems like a no-brainer easy win today, and doesn't depend on any 
> advanced features like message priority.  The only scalability issue 
> would seem to be the broadcast of the message to all clients, but this 
> is no different than the current broadcast mechanism the MGS employs to 
> update client configs.  The message from the MGS would be taken as a 
> suggestion, "Why don't y'all time out all your current RPCs since I 
> noticed OST0004 restarted.  Oh, and use failover nid #2."  Current 
> replay/recovery need not be touched.

This would be a great enhancement for OSS failover or reboot, it is really the 
only way we'll get to recovery times under ~2.5 x obd_timeout. Adaptive Timeouts 
really aren't buying us much here, as at scale and under load we are seeing the 
timeouts approach the usual static obd_timeout of 300s. It only takes one client 
with a higher timeout to push the recovery time out.

I do think this will miss a significant case: combo MGS+MDS. A majority of our 
customers are deploying with this configuration. Perhaps exposing this mechanism 
on the clients via a /proc file would be enough - that way a failover framework 
could manually trigger the timeout and/or nid switching.

Nic

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Lustre-devel] imperative recovery
  2009-01-09 15:27     ` Nicholas Henke
@ 2009-01-09 17:04       ` Robert Read
  2009-01-09 19:43         ` Nicholas Henke
  2009-01-10  0:50         ` Andreas Dilger
  0 siblings, 2 replies; 7+ messages in thread
From: Robert Read @ 2009-01-09 17:04 UTC (permalink / raw)
  To: lustre-devel


On Jan 9, 2009, at 07:27 , Nicholas Henke wrote:

> Nathaniel Rutman wrote:
>> Eric Barton wrote:
>>>> Other options I've thought of to explore this idea:
>>>>
>>>> - MGS notifies clients (somehow) after a server has restarted.
>>>>
>> This seems like a no-brainer easy win today, and doesn't depend on  
>> any
>> advanced features like message priority.  The only scalability issue
>> would seem to be the broadcast of the message to all clients, but  
>> this
>> is no different than the current broadcast mechanism the MGS  
>> employs to
>> update client configs.  The message from the MGS would be taken as a
>> suggestion, "Why don't y'all time out all your current RPCs since I
>> noticed OST0004 restarted.  Oh, and use failover nid #2."  Current
>> replay/recovery need not be touched.
>
> This would be a great enhancement for OSS failover or reboot, it is  
> really the
> only way we'll get to recovery times under ~2.5 x obd_timeout.  
> Adaptive Timeouts
> really aren't buying us much here, as at scale and under load we are  
> seeing the
> timeouts approach the usual static obd_timeout of 300s. It only  
> takes one client
> with a higher timeout to push the recovery time out.
>
> I do think this will miss a significant case: combo MGS+MDS. A  
> majority of our
> customers are deploying with this configuration. Perhaps exposing  
> this mechanism
> on the clients via a /proc file would be enough - that way a  
> failover framework
> could manually trigger the timeout and/or nid switching.

Yes, exactly what I was thinking. Exposing this feature via proc (or  
lctl) on the clients is the first step. It's has minimal impact,  
requires no changes to the server, and should integrate well with  
existing failover frameworks.  We also need to get the server to end  
recovery sooner (without waiting for all the stale exports), but VBR  
should help with that.

robert

>
>
> Nic
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Lustre-devel] imperative recovery
  2009-01-09 17:04       ` Robert Read
@ 2009-01-09 19:43         ` Nicholas Henke
  2009-01-10  0:50         ` Andreas Dilger
  1 sibling, 0 replies; 7+ messages in thread
From: Nicholas Henke @ 2009-01-09 19:43 UTC (permalink / raw)
  To: lustre-devel

Robert Read wrote:
> 
> On Jan 9, 2009, at 07:27 , Nicholas Henke wrote:

>>
>> I do think this will miss a significant case: combo MGS+MDS. A 
>> majority of our
>> customers are deploying with this configuration. Perhaps exposing this 
>> mechanism
>> on the clients via a /proc file would be enough - that way a failover 
>> framework
>> could manually trigger the timeout and/or nid switching.
> 
> Yes, exactly what I was thinking. Exposing this feature via proc (or 
> lctl) on the clients is the first step. It's has minimal impact, 
> requires no changes to the server, and should integrate well with 
> existing failover frameworks.  We also need to get the server to end 
> recovery sooner (without waiting for all the stale exports), but VBR 
> should help with that.
> 
> robert

FWIW: we'd prefer /proc. We don't ship lctl on our computes for memory 
(initramfs) usage reasons. Being in /proc makes it easy for someone to use the 
functionality from another kernel module as well; we can just call the .read or 
.write functions directly.

Nic

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Lustre-devel] imperative recovery
  2009-01-09 17:04       ` Robert Read
  2009-01-09 19:43         ` Nicholas Henke
@ 2009-01-10  0:50         ` Andreas Dilger
  2009-01-10  4:44           ` Robert Read
  1 sibling, 1 reply; 7+ messages in thread
From: Andreas Dilger @ 2009-01-10  0:50 UTC (permalink / raw)
  To: lustre-devel

On Jan 09, 2009  09:04 -0800, Robert Read wrote:
> On Jan 9, 2009, at 07:27 , Nicholas Henke wrote:
> > This would be a great enhancement for OSS failover or reboot, it is  
> > really the only way we'll get to recovery times under ~2.5 x obd_timeout.  
> >
> > I do think this will miss a significant case: combo MGS+MDS. A  
> > majority of our customers are deploying with this configuration.
> > Perhaps exposing this mechanism on the clients via a /proc file
> > would be enough - that way a failover framework
> > could manually trigger the timeout and/or nid switching.
> 
> Yes, exactly what I was thinking. Exposing this feature via proc (or  
> lctl) on the clients is the first step. It's has minimal impact,  
> requires no changes to the server, and should integrate well with  
> existing failover frameworks.  We also need to get the server to end  
> recovery sooner (without waiting for all the stale exports), but VBR  
> should help with that.

Hey, wouldn't (essentially) "lctl --device $foo recover" do the trick
today?


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Lustre-devel] imperative recovery
  2009-01-10  0:50         ` Andreas Dilger
@ 2009-01-10  4:44           ` Robert Read
  0 siblings, 0 replies; 7+ messages in thread
From: Robert Read @ 2009-01-10  4:44 UTC (permalink / raw)
  To: lustre-devel


On Jan 9, 2009, at 4:50 PM, Andreas Dilger wrote:

> On Jan 09, 2009  09:04 -0800, Robert Read wrote:
>> On Jan 9, 2009, at 07:27 , Nicholas Henke wrote:
>>> This would be a great enhancement for OSS failover or reboot, it is
>>> really the only way we'll get to recovery times under ~2.5 x  
>>> obd_timeout.
>>>
>>> I do think this will miss a significant case: combo MGS+MDS. A
>>> majority of our customers are deploying with this configuration.
>>> Perhaps exposing this mechanism on the clients via a /proc file
>>> would be enough - that way a failover framework
>>> could manually trigger the timeout and/or nid switching.
>>
>> Yes, exactly what I was thinking. Exposing this feature via proc (or
>> lctl) on the clients is the first step. It's has minimal impact,
>> requires no changes to the server, and should integrate well with
>> existing failover frameworks.  We also need to get the server to end
>> recovery sooner (without waiting for all the stale exports), but VBR
>> should help with that.
>
> Hey, wouldn't (essentially) "lctl --device $foo recover" do the trick
> today?

The main difference is we need to specify the nid to connect to. Also,  
since  lctl isn't always available we should do this with a /proc file  
(and  set_param), so something like this:

echo $new_ost_nid > /proc/fs/lustre/osc/OSC_FOO_01/target_nid

or

lctl set_param osc.osc_FOO_01.target_nid $new_ost_nid

robert

>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-01-10  4:44 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1906DB02-F9DF-4F49-9A9A-23FE7E799EA8@sun.com>
2008-12-15 20:32 ` [Lustre-devel] imperative recovery Eric Barton
2008-12-18 20:15   ` Nathaniel Rutman
2009-01-09 15:27     ` Nicholas Henke
2009-01-09 17:04       ` Robert Read
2009-01-09 19:43         ` Nicholas Henke
2009-01-10  0:50         ` Andreas Dilger
2009-01-10  4:44           ` Robert Read

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.