All of lore.kernel.org
 help / color / mirror / Atom feed
* [Lustre-devel] hiding non-fatal communications errors
@ 2008-06-04 13:25 Eric Barton
  2008-06-04 21:17 ` Peter Braam
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Barton @ 2008-06-04 13:25 UTC (permalink / raw)
  To: lustre-devel

Something for recovery experts...

Communications may timeout for non-fatal reasons e.g...

1. Adaptive timeouts were too aggressive (e.g. if server load has
   suddenly become extreme).

2. An LNET router has failed but one or more of its peers hasn't
   detected this yet.

When a lustre client times out an RPC it sent to a server, it (a) allows
pending signals to be delivered (i.e. you can now ^C the process doing
the I/O) and (b) tries to reconnect and/or fail over.  If it reconnects
and confirms that the server has not rebooted, the RPC is resent and
may now succeed.

This should work in all "normal" RPCs (i.e. all RPCs apart from ldlm
callbacks (ASTs)) since the server knows whether it actually processed
the RPC or not and can handle the resent request appropriately.

However I think there is a problem if the RPC is an ldlm callback.  In
this case, the lustre server sends the RPC to the lustre client and
AFAIK the request is not resent if it times out.  If the request is a
blocking AST, the lustre client isn't notified to clean its cache and
cancel locks - and it risks being evicted.

How should this be handled?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Lustre-devel] hiding non-fatal communications errors
  2008-06-04 13:25 [Lustre-devel] hiding non-fatal communications errors Eric Barton
@ 2008-06-04 21:17 ` Peter Braam
  2008-06-04 22:20   ` Andreas Dilger
  2008-06-04 23:41   ` Eric Barton
  0 siblings, 2 replies; 15+ messages in thread
From: Peter Braam @ 2008-06-04 21:17 UTC (permalink / raw)
  To: lustre-devel

Andreas has been suggesting re-transmission of these callback (aka AST) RPCs
for years.  If we think it through carefully, it might be a simple solution.

Peter


On 6/4/08 6:25 AM, "Eric Barton" <eeb@sun.com> wrote:

> Something for recovery experts...
> 
> Communications may timeout for non-fatal reasons e.g...
> 
> 1. Adaptive timeouts were too aggressive (e.g. if server load has
>    suddenly become extreme).
> 
> 2. An LNET router has failed but one or more of its peers hasn't
>    detected this yet.
> 
> When a lustre client times out an RPC it sent to a server, it (a) allows
> pending signals to be delivered (i.e. you can now ^C the process doing
> the I/O) and (b) tries to reconnect and/or fail over.  If it reconnects
> and confirms that the server has not rebooted, the RPC is resent and
> may now succeed.
> 
> This should work in all "normal" RPCs (i.e. all RPCs apart from ldlm
> callbacks (ASTs)) since the server knows whether it actually processed
> the RPC or not and can handle the resent request appropriately.
> 
> However I think there is a problem if the RPC is an ldlm callback.  In
> this case, the lustre server sends the RPC to the lustre client and
> AFAIK the request is not resent if it times out.  If the request is a
> blocking AST, the lustre client isn't notified to clean its cache and
> cancel locks - and it risks being evicted.
> 
> How should this be handled?
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Lustre-devel] hiding non-fatal communications errors
  2008-06-04 21:17 ` Peter Braam
@ 2008-06-04 22:20   ` Andreas Dilger
  2008-06-05  4:12     ` Oleg Drokin
  2008-06-04 23:41   ` Eric Barton
  1 sibling, 1 reply; 15+ messages in thread
From: Andreas Dilger @ 2008-06-04 22:20 UTC (permalink / raw)
  To: lustre-devel

On Jun 04, 2008  14:17 -0700, Peter J. Braam wrote:
> Andreas has been suggesting re-transmission of these callback (aka AST) RPCs
> for years.  If we think it through carefully, it might be a simple solution.

Yes, server->client resends at least to a limited extent would help in the
case of short-term network partitioning or e.g. a suddenly-failed router.

We have some amount of "RPC resend before recovery" support for bulk RPCs
in the case of checksum errors - e.g. retry the bulk RPC 5 times for a
checksum error before returning an IO error to the application.

I suspect this could be adapted to allowing a fixed number of retries for
server-originated RPCs also.  In the case of LDLM blocking callbacks sent
to a client, a resend is currently harmless (either the client is already
processing the callback, or the lock was cancelled).

> On 6/4/08 6:25 AM, "Eric Barton" <eeb@sun.com> wrote:
> 
> > Something for recovery experts...
> > 
> > Communications may timeout for non-fatal reasons e.g...
> > 
> > 1. Adaptive timeouts were too aggressive (e.g. if server load has
> >    suddenly become extreme).
> > 
> > 2. An LNET router has failed but one or more of its peers hasn't
> >    detected this yet.
> > 
> > When a lustre client times out an RPC it sent to a server, it (a) allows
> > pending signals to be delivered (i.e. you can now ^C the process doing
> > the I/O) and (b) tries to reconnect and/or fail over.  If it reconnects
> > and confirms that the server has not rebooted, the RPC is resent and
> > may now succeed.
> > 
> > This should work in all "normal" RPCs (i.e. all RPCs apart from ldlm
> > callbacks (ASTs)) since the server knows whether it actually processed
> > the RPC or not and can handle the resent request appropriately.
> > 
> > However I think there is a problem if the RPC is an ldlm callback.  In
> > this case, the lustre server sends the RPC to the lustre client and
> > AFAIK the request is not resent if it times out.  If the request is a
> > blocking AST, the lustre client isn't notified to clean its cache and
> > cancel locks - and it risks being evicted.
> > 
> > How should this be handled?
> > 
> > _______________________________________________
> > Lustre-devel mailing list
> > Lustre-devel at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-devel
> 
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Lustre-devel] hiding non-fatal communications errors
  2008-06-04 21:17 ` Peter Braam
  2008-06-04 22:20   ` Andreas Dilger
@ 2008-06-04 23:41   ` Eric Barton
  1 sibling, 0 replies; 15+ messages in thread
From: Eric Barton @ 2008-06-04 23:41 UTC (permalink / raw)
  To: lustre-devel

> -----Original Message-----
> From: Peter.Braam at Sun.COM [mailto:Peter.Braam at Sun.COM] 
> Sent: 04 June 2008 10:17 PM
> To: Eric Barton; 'Lustre Development Mailing List'
> Subject: Re: [Lustre-devel] hiding non-fatal communications errors
> 
> Andreas has been suggesting re-transmission of these callback (aka AST) RPCs
> for years.  If we think it through carefully, it might be a simple solution.

Yes - carefully is the watchword - I suspect lock callback RPCs (aka ASTs)
have some fundamentally different properties.  Nathan and I seemed to
touch on this when we last chatted about related AT issues.

Any volunteers to s/we/me/ ?

> 
> Peter
> 
> 
> On 6/4/08 6:25 AM, "Eric Barton" <eeb@sun.com> wrote:
> 
> > Something for recovery experts...
> > 
> > Communications may timeout for non-fatal reasons e.g...
> > 
> > 1. Adaptive timeouts were too aggressive (e.g. if server load has
> >    suddenly become extreme).
> > 
> > 2. An LNET router has failed but one or more of its peers hasn't
> >    detected this yet.
> > 
> > When a lustre client times out an RPC it sent to a server, it (a) allows
> > pending signals to be delivered (i.e. you can now ^C the process doing
> > the I/O) and (b) tries to reconnect and/or fail over.  If it reconnects
> > and confirms that the server has not rebooted, the RPC is resent and
> > may now succeed.
> > 
> > This should work in all "normal" RPCs (i.e. all RPCs apart from ldlm
> > callbacks (ASTs)) since the server knows whether it actually processed
> > the RPC or not and can handle the resent request appropriately.
> > 
> > However I think there is a problem if the RPC is an ldlm callback.  In
> > this case, the lustre server sends the RPC to the lustre client and
> > AFAIK the request is not resent if it times out.  If the request is a
> > blocking AST, the lustre client isn't notified to clean its cache and
> > cancel locks - and it risks being evicted.
> > 
> > How should this be handled?
> > 
> > _______________________________________________
> > Lustre-devel mailing list
> > Lustre-devel at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-devel
> 
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Lustre-devel] hiding non-fatal communications errors
  2008-06-04 22:20   ` Andreas Dilger
@ 2008-06-05  4:12     ` Oleg Drokin
  2008-06-05 16:42       ` Robert Read
  0 siblings, 1 reply; 15+ messages in thread
From: Oleg Drokin @ 2008-06-05  4:12 UTC (permalink / raw)
  To: lustre-devel

Hello!

On Jun 4, 2008, at 6:20 PM, Andreas Dilger wrote:

> I suspect this could be adapted to allowing a fixed number of  
> retries for
> server-originated RPCs also.  In the case of LDLM blocking callbacks  
> sent
> to a client, a resend is currently harmless (either the client is  
> already
> processing the callback, or the lock was cancelled).

We need to be careful here and decide on a good strategy on when to  
resend.
E.g. recent case at ORNL (even if a bit pathologic) is they pound  
through
thousands of clients to 4 OSSes via 2 routers. That creates request  
waiting
lists on OSSes well into tens of thousands. When we block on a lock  
and send
blocking AST to the client, it quickly turns around and puts in his  
data...
at the end of our list that takes hundreds of seconds (more than  
obd_timeout,
obviously). No matter how much you resend, it won't help.
Now a good argument is before we kill such clients (or do any sort of  
resend),
perhaps it makes sense to check incoming queue to see if there is  
anything?
On the other hand that would be like half of request scheduler,  
probably, and
with such queues, it would take ages, I guess.
BTW, AT code changes lock waiting from obd_timeout to obd_timeout/2,  
why is that?
(when AT is disabled). All this is bug 15332.

Or was the resend meant just for initial RPC where we do not get a  
confirmation
soon? Yes, there it makes sense to retry soon, but this case above  
needs to be
still considered, since currently we do not retry writeouts too, which  
has as
much of a bad effect on dirty client caches, and of course all the  
above is very
true in such cases too.
Also without lnet patch in 15332, where small messages are prioritized  
on routers,
it is way too easy to timeout ast response because of router  
congestion and no
amount of resending would help then.

Bye,
     Oleg

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Lustre-devel] hiding non-fatal communications errors
  2008-06-05  4:12     ` Oleg Drokin
@ 2008-06-05 16:42       ` Robert Read
  2008-06-05 16:59         ` Oleg Drokin
  2008-06-06  3:37         ` Peter Braam
  0 siblings, 2 replies; 15+ messages in thread
From: Robert Read @ 2008-06-05 16:42 UTC (permalink / raw)
  To: lustre-devel


On Jun 4, 2008, at 21:12 , Oleg Drokin wrote:

> Hello!
>
> On Jun 4, 2008, at 6:20 PM, Andreas Dilger wrote:
>
>> I suspect this could be adapted to allowing a fixed number of
>> retries for
>> server-originated RPCs also.  In the case of LDLM blocking callbacks
>> sent
>> to a client, a resend is currently harmless (either the client is
>> already
>> processing the callback, or the lock was cancelled).
>
> We need to be careful here and decide on a good strategy on when to
> resend.
> E.g. recent case at ORNL (even if a bit pathologic) is they pound
> through
> thousands of clients to 4 OSSes via 2 routers. That creates request
> waiting
> lists on OSSes well into tens of thousands. When we block on a lock
> and send
> blocking AST to the client, it quickly turns around and puts in his
> data...
> at the end of our list that takes hundreds of seconds (more than
> obd_timeout,
> obviously). No matter how much you resend, it won't help.


This looks like the poster child for adaptive timeouts, although we  
might want need some version of the early margin update patch on  
15501.  Have you tried enabling AT?

>
> Now a good argument is before we kill such clients (or do any sort of
> resend),
> perhaps it makes sense to check incoming queue to see if there is
> anything?
> On the other hand that would be like half of request scheduler,
> probably, and
> with such queues, it would take ages, I guess.
> BTW, AT code changes lock waiting from obd_timeout to obd_timeout/2,
> why is that?
> (when AT is disabled). All this is bug 15332.
>

Maybe that's was done to discourage people from disabling AT?  
Seriously, though, I don't know why that was changed. Perhaps it was  
done on b1_6 before to AT landed?

robert


> Or was the resend meant just for initial RPC where we do not get a
> confirmation
> soon? Yes, there it makes sense to retry soon, but this case above
> needs to be
> still considered, since currently we do not retry writeouts too, which
> has as
> much of a bad effect on dirty client caches, and of course all the
> above is very
> true in such cases too.
> Also without lnet patch in 15332, where small messages are prioritized
> on routers,
> it is way too easy to timeout ast response because of router
> congestion and no
> amount of resending would help then.
>
> Bye,
>     Oleg
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Lustre-devel] hiding non-fatal communications errors
  2008-06-05 16:42       ` Robert Read
@ 2008-06-05 16:59         ` Oleg Drokin
  2008-06-06  3:29           ` Peter Braam
  2008-06-06  3:37         ` Peter Braam
  1 sibling, 1 reply; 15+ messages in thread
From: Oleg Drokin @ 2008-06-05 16:59 UTC (permalink / raw)
  To: lustre-devel

Hello!

On Jun 5, 2008, at 12:42 PM, Robert Read wrote:

>>> I suspect this could be adapted to allowing a fixed number of
>>> retries for
>>> server-originated RPCs also.  In the case of LDLM blocking callbacks
>>> sent
>>> to a client, a resend is currently harmless (either the client is
>>> already
>>> processing the callback, or the lock was cancelled).
>> We need to be careful here and decide on a good strategy on when to
>> resend.
>> E.g. recent case at ORNL (even if a bit pathologic) is they pound
>> through
>> thousands of clients to 4 OSSes via 2 routers. That creates request
>> waiting
>> lists on OSSes well into tens of thousands. When we block on a lock
>> and send
>> blocking AST to the client, it quickly turns around and puts in his
>> data...
>> at the end of our list that takes hundreds of seconds (more than
>> obd_timeout,
>> obviously). No matter how much you resend, it won't help.
> This looks like the poster child for adaptive timeouts, although we  
> might want need some version of the early margin update patch on  
> 15501.  Have you tried enabling AT?

The problem is AT does not handle this specific case, there is no way to
deliver "early replay" from a client to server that "I am working on  
it" outside of
just sending dirty data. But dirty data gets into a queue for way too  
long.
There re no timed out requests, the only thing timing out is lock that  
is not
cancelled in time.
AT was not tried - this is hard to do at ORNL, as client side is Cray  
XT4 machine,
and updating clients is hard. So they are on 1.4.11 of some sort.
They can easily update servers, but this won't help, of course.

> Maybe that's was done to discourage people from disabling AT?  
> Seriously, though, I don't know why that was changed. Perhaps it was  
> done on b1_6 before to AT landed?

hm, indeed. I see this change in 1.6.3.

Bye,
     Oleg

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Lustre-devel] hiding non-fatal communications errors
  2008-06-05 16:59         ` Oleg Drokin
@ 2008-06-06  3:29           ` Peter Braam
  2008-06-06  3:38             ` Oleg Drokin
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Braam @ 2008-06-06  3:29 UTC (permalink / raw)
  To: lustre-devel

Why can we not send early replies?


On 6/5/08 9:59 AM, "Oleg Drokin" <Oleg.Drokin@Sun.COM> wrote:

> Hello!
> 
> On Jun 5, 2008, at 12:42 PM, Robert Read wrote:
> 
>>>> I suspect this could be adapted to allowing a fixed number of
>>>> retries for
>>>> server-originated RPCs also.  In the case of LDLM blocking callbacks
>>>> sent
>>>> to a client, a resend is currently harmless (either the client is
>>>> already
>>>> processing the callback, or the lock was cancelled).
>>> We need to be careful here and decide on a good strategy on when to
>>> resend.
>>> E.g. recent case at ORNL (even if a bit pathologic) is they pound
>>> through
>>> thousands of clients to 4 OSSes via 2 routers. That creates request
>>> waiting
>>> lists on OSSes well into tens of thousands. When we block on a lock
>>> and send
>>> blocking AST to the client, it quickly turns around and puts in his
>>> data...
>>> at the end of our list that takes hundreds of seconds (more than
>>> obd_timeout,
>>> obviously). No matter how much you resend, it won't help.
>> This looks like the poster child for adaptive timeouts, although we
>> might want need some version of the early margin update patch on
>> 15501.  Have you tried enabling AT?
> 
> The problem is AT does not handle this specific case, there is no way to
> deliver "early replay" from a client to server that "I am working on
> it" outside of
> just sending dirty data. But dirty data gets into a queue for way too
> long.
> There re no timed out requests, the only thing timing out is lock that
> is not
> cancelled in time.
> AT was not tried - this is hard to do at ORNL, as client side is Cray
> XT4 machine,
> and updating clients is hard. So they are on 1.4.11 of some sort.
> They can easily update servers, but this won't help, of course.
> 
>> Maybe that's was done to discourage people from disabling AT?
>> Seriously, though, I don't know why that was changed. Perhaps it was
>> done on b1_6 before to AT landed?
> 
> hm, indeed. I see this change in 1.6.3.
> 
> Bye,
>      Oleg
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Lustre-devel] hiding non-fatal communications errors
  2008-06-05 16:42       ` Robert Read
  2008-06-05 16:59         ` Oleg Drokin
@ 2008-06-06  3:37         ` Peter Braam
  1 sibling, 0 replies; 15+ messages in thread
From: Peter Braam @ 2008-06-06  3:37 UTC (permalink / raw)
  To: lustre-devel




On 6/5/08 9:42 AM, "Robert Read" <rread@sun.com> wrote:

> 
> On Jun 4, 2008, at 21:12 , Oleg Drokin wrote:
> 
>> Hello!
>> 
>> On Jun 4, 2008, at 6:20 PM, Andreas Dilger wrote:
>> 
>>> I suspect this could be adapted to allowing a fixed number of
>>> retries for
>>> server-originated RPCs also.  In the case of LDLM blocking callbacks
>>> sent
>>> to a client, a resend is currently harmless (either the client is
>>> already
>>> processing the callback, or the lock was cancelled).
>> 
>> We need to be careful here and decide on a good strategy on when to
>> resend.
>> E.g. recent case at ORNL (even if a bit pathologic) is they pound
>> through
>> thousands of clients to 4 OSSes via 2 routers. That creates request
>> waiting
>> lists on OSSes well into tens of thousands. When we block on a lock
>> and send
>> blocking AST to the client, it quickly turns around and puts in his
>> data...
>> at the end of our list that takes hundreds of seconds (more than
>> obd_timeout,
>> obviously). No matter how much you resend, it won't help.
> 

I think this is an SNS issue.  Eric?

Peter


> 
> This looks like the poster child for adaptive timeouts, although we
> might want need some version of the early margin update patch on
> 15501.  Have you tried enabling AT?
> 
>> 
>> Now a good argument is before we kill such clients (or do any sort of
>> resend),
>> perhaps it makes sense to check incoming queue to see if there is
>> anything?
>> On the other hand that would be like half of request scheduler,
>> probably, and
>> with such queues, it would take ages, I guess.
>> BTW, AT code changes lock waiting from obd_timeout to obd_timeout/2,
>> why is that?
>> (when AT is disabled). All this is bug 15332.
>> 
> 
> Maybe that's was done to discourage people from disabling AT?
> Seriously, though, I don't know why that was changed. Perhaps it was
> done on b1_6 before to AT landed?
> 
> robert
> 
> 
>> Or was the resend meant just for initial RPC where we do not get a
>> confirmation
>> soon? Yes, there it makes sense to retry soon, but this case above
>> needs to be
>> still considered, since currently we do not retry writeouts too, which
>> has as
>> much of a bad effect on dirty client caches, and of course all the
>> above is very
>> true in such cases too.
>> Also without lnet patch in 15332, where small messages are prioritized
>> on routers,
>> it is way too easy to timeout ast response because of router
>> congestion and no
>> amount of resending would help then.
>> 
>> Bye,
>>     Oleg
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Lustre-devel] hiding non-fatal communications errors
  2008-06-06  3:29           ` Peter Braam
@ 2008-06-06  3:38             ` Oleg Drokin
  2008-06-06  3:40               ` Peter Braam
  0 siblings, 1 reply; 15+ messages in thread
From: Oleg Drokin @ 2008-06-06  3:38 UTC (permalink / raw)
  To: lustre-devel

Hello!

    Because there is no way to deliver them. We send our first  
acknowledge of ast reception and it is delivered fast, this is the  
reply.
    Now what left is to send actual dirty data and then cancel  
request. These are not replies, but stand-alone client-generated RPCs,
    we cannot cancel locks while dirty data is not flushed. Just  
inventing some sort of ldlm "I am still alive" RPCs to send periodically
    instead of cancels is dangerous - data-sending part could be  
wedged for unrelated reasons, for example, not only because of  
contention, but due
    to some client problems, and if we prolong locks by other means,  
that potentially can wedge all access to that part of a file forever.
    And dirty data itself takes too long to get to the actual server  
processing.
    On of the solutions here is request scheduler, or some stand-alone  
part of it that could peek early into RPCs as they arrive, so that
    when the decision is being made about client eviction, we can  
quickly see what is in the queue from that client and perhaps
    based on this data to postpone the eviction. This was discussed on  
ORNL call.
    Andreas said that AT is currently already looking into incoming  
RPCs before processing, to get ideas about expected service times,  
perhaps
    it would not be too hard to add some logic that would link  
requests into actual exports they came from for further analysis if  
the need for
    it arises.

Bye,
     Oleg
On Jun 5, 2008, at 11:29 PM, Peter Braam wrote:

> Why can we not send early replies?
>
>
> On 6/5/08 9:59 AM, "Oleg Drokin" <Oleg.Drokin@Sun.COM> wrote:
>
>> Hello!
>>
>> On Jun 5, 2008, at 12:42 PM, Robert Read wrote:
>>
>>>>> I suspect this could be adapted to allowing a fixed number of
>>>>> retries for
>>>>> server-originated RPCs also.  In the case of LDLM blocking  
>>>>> callbacks
>>>>> sent
>>>>> to a client, a resend is currently harmless (either the client is
>>>>> already
>>>>> processing the callback, or the lock was cancelled).
>>>> We need to be careful here and decide on a good strategy on when to
>>>> resend.
>>>> E.g. recent case at ORNL (even if a bit pathologic) is they pound
>>>> through
>>>> thousands of clients to 4 OSSes via 2 routers. That creates request
>>>> waiting
>>>> lists on OSSes well into tens of thousands. When we block on a lock
>>>> and send
>>>> blocking AST to the client, it quickly turns around and puts in his
>>>> data...
>>>> at the end of our list that takes hundreds of seconds (more than
>>>> obd_timeout,
>>>> obviously). No matter how much you resend, it won't help.
>>> This looks like the poster child for adaptive timeouts, although we
>>> might want need some version of the early margin update patch on
>>> 15501.  Have you tried enabling AT?
>>
>> The problem is AT does not handle this specific case, there is no  
>> way to
>> deliver "early replay" from a client to server that "I am working on
>> it" outside of
>> just sending dirty data. But dirty data gets into a queue for way too
>> long.
>> There re no timed out requests, the only thing timing out is lock  
>> that
>> is not
>> cancelled in time.
>> AT was not tried - this is hard to do at ORNL, as client side is Cray
>> XT4 machine,
>> and updating clients is hard. So they are on 1.4.11 of some sort.
>> They can easily update servers, but this won't help, of course.
>>
>>> Maybe that's was done to discourage people from disabling AT?
>>> Seriously, though, I don't know why that was changed. Perhaps it was
>>> done on b1_6 before to AT landed?
>>
>> hm, indeed. I see this change in 1.6.3.
>>
>> Bye,
>>     Oleg
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Lustre-devel] hiding non-fatal communications errors
  2008-06-06  3:38             ` Oleg Drokin
@ 2008-06-06  3:40               ` Peter Braam
  2008-06-06  4:41                 ` Andreas Dilger
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Braam @ 2008-06-06  3:40 UTC (permalink / raw)
  To: lustre-devel

Ah yes.  So monitoring progress is the only thing we can do and with SNS you
will be able to get that information long before the request is being
handled.

Peter


On 6/5/08 8:38 PM, "Oleg Drokin" <Oleg.Drokin@Sun.COM> wrote:

> Hello!
> 
>     Because there is no way to deliver them. We send our first
> acknowledge of ast reception and it is delivered fast, this is the
> reply.
>     Now what left is to send actual dirty data and then cancel
> request. These are not replies, but stand-alone client-generated RPCs,
>     we cannot cancel locks while dirty data is not flushed. Just
> inventing some sort of ldlm "I am still alive" RPCs to send periodically
>     instead of cancels is dangerous - data-sending part could be
> wedged for unrelated reasons, for example, not only because of
> contention, but due
>     to some client problems, and if we prolong locks by other means,
> that potentially can wedge all access to that part of a file forever.
>     And dirty data itself takes too long to get to the actual server
> processing.
>     On of the solutions here is request scheduler, or some stand-alone
> part of it that could peek early into RPCs as they arrive, so that
>     when the decision is being made about client eviction, we can
> quickly see what is in the queue from that client and perhaps
>     based on this data to postpone the eviction. This was discussed on
> ORNL call.
>     Andreas said that AT is currently already looking into incoming
> RPCs before processing, to get ideas about expected service times,
> perhaps
>     it would not be too hard to add some logic that would link
> requests into actual exports they came from for further analysis if
> the need for
>     it arises.
> 
> Bye,
>      Oleg
> On Jun 5, 2008, at 11:29 PM, Peter Braam wrote:
> 
>> Why can we not send early replies?
>> 
>> 
>> On 6/5/08 9:59 AM, "Oleg Drokin" <Oleg.Drokin@Sun.COM> wrote:
>> 
>>> Hello!
>>> 
>>> On Jun 5, 2008, at 12:42 PM, Robert Read wrote:
>>> 
>>>>>> I suspect this could be adapted to allowing a fixed number of
>>>>>> retries for
>>>>>> server-originated RPCs also.  In the case of LDLM blocking
>>>>>> callbacks
>>>>>> sent
>>>>>> to a client, a resend is currently harmless (either the client is
>>>>>> already
>>>>>> processing the callback, or the lock was cancelled).
>>>>> We need to be careful here and decide on a good strategy on when to
>>>>> resend.
>>>>> E.g. recent case at ORNL (even if a bit pathologic) is they pound
>>>>> through
>>>>> thousands of clients to 4 OSSes via 2 routers. That creates request
>>>>> waiting
>>>>> lists on OSSes well into tens of thousands. When we block on a lock
>>>>> and send
>>>>> blocking AST to the client, it quickly turns around and puts in his
>>>>> data...
>>>>> at the end of our list that takes hundreds of seconds (more than
>>>>> obd_timeout,
>>>>> obviously). No matter how much you resend, it won't help.
>>>> This looks like the poster child for adaptive timeouts, although we
>>>> might want need some version of the early margin update patch on
>>>> 15501.  Have you tried enabling AT?
>>> 
>>> The problem is AT does not handle this specific case, there is no
>>> way to
>>> deliver "early replay" from a client to server that "I am working on
>>> it" outside of
>>> just sending dirty data. But dirty data gets into a queue for way too
>>> long.
>>> There re no timed out requests, the only thing timing out is lock
>>> that
>>> is not
>>> cancelled in time.
>>> AT was not tried - this is hard to do at ORNL, as client side is Cray
>>> XT4 machine,
>>> and updating clients is hard. So they are on 1.4.11 of some sort.
>>> They can easily update servers, but this won't help, of course.
>>> 
>>>> Maybe that's was done to discourage people from disabling AT?
>>>> Seriously, though, I don't know why that was changed. Perhaps it was
>>>> done on b1_6 before to AT landed?
>>> 
>>> hm, indeed. I see this change in 1.6.3.
>>> 
>>> Bye,
>>>     Oleg
>>> _______________________________________________
>>> Lustre-devel mailing list
>>> Lustre-devel at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>> 
>> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Lustre-devel] hiding non-fatal communications errors
  2008-06-06  3:40               ` Peter Braam
@ 2008-06-06  4:41                 ` Andreas Dilger
  2008-06-06 11:13                   ` Eric Barton
  2008-06-06 12:23                   ` Peter Braam
  0 siblings, 2 replies; 15+ messages in thread
From: Andreas Dilger @ 2008-06-06  4:41 UTC (permalink / raw)
  To: lustre-devel

On Jun 05, 2008  20:40 -0700, Peter J. Braam wrote:
> Ah yes.  So monitoring progress is the only thing we can do and with SNS you
> will be able to get that information long before the request is being
> handled.

You mean NRS, instead of SNS, right?

> On 6/5/08 8:38 PM, "Oleg Drokin" <Oleg.Drokin@Sun.COM> wrote:
> >     Because there is no way to deliver them. We send our first
> > acknowledge of ast reception and it is delivered fast, this is the
> > reply.
> >     Now what left is to send actual dirty data and then cancel
> > request. These are not replies, but stand-alone client-generated RPCs,
> >     we cannot cancel locks while dirty data is not flushed. Just
> > inventing some sort of ldlm "I am still alive" RPCs to send periodically
> >     instead of cancels is dangerous - data-sending part could be
> > wedged for unrelated reasons, for example, not only because of
> > contention, but due
> >     to some client problems, and if we prolong locks by other means,
> > that potentially can wedge all access to that part of a file forever.
> >     And dirty data itself takes too long to get to the actual server
> > processing.
> >     On of the solutions here is request scheduler, or some stand-alone
> > part of it that could peek early into RPCs as they arrive, so that
> >     when the decision is being made about client eviction, we can
> > quickly see what is in the queue from that client and perhaps
> >     based on this data to postpone the eviction. This was discussed on
> > ORNL call.
> >     Andreas said that AT is currently already looking into incoming
> > RPCs before processing, to get ideas about expected service times,
> > perhaps
> >     it would not be too hard to add some logic that would link
> > requests into actual exports they came from for further analysis if
> > the need for
> >     it arises.

I think hooking the requests into the exports at arrival time is fairly
straight forward, and is a easy first step toward implementing the NRS.

> > Bye,
> >      Oleg
> > On Jun 5, 2008, at 11:29 PM, Peter Braam wrote:
> > 
> >> Why can we not send early replies?
> >> 
> >> 
> >> On 6/5/08 9:59 AM, "Oleg Drokin" <Oleg.Drokin@Sun.COM> wrote:
> >> 
> >>> Hello!
> >>> 
> >>> On Jun 5, 2008, at 12:42 PM, Robert Read wrote:
> >>> 
> >>>>>> I suspect this could be adapted to allowing a fixed number of
> >>>>>> retries for
> >>>>>> server-originated RPCs also.  In the case of LDLM blocking
> >>>>>> callbacks
> >>>>>> sent
> >>>>>> to a client, a resend is currently harmless (either the client is
> >>>>>> already
> >>>>>> processing the callback, or the lock was cancelled).
> >>>>> We need to be careful here and decide on a good strategy on when to
> >>>>> resend.
> >>>>> E.g. recent case at ORNL (even if a bit pathologic) is they pound
> >>>>> through
> >>>>> thousands of clients to 4 OSSes via 2 routers. That creates request
> >>>>> waiting
> >>>>> lists on OSSes well into tens of thousands. When we block on a lock
> >>>>> and send
> >>>>> blocking AST to the client, it quickly turns around and puts in his
> >>>>> data...
> >>>>> at the end of our list that takes hundreds of seconds (more than
> >>>>> obd_timeout,
> >>>>> obviously). No matter how much you resend, it won't help.
> >>>> This looks like the poster child for adaptive timeouts, although we
> >>>> might want need some version of the early margin update patch on
> >>>> 15501.  Have you tried enabling AT?
> >>> 
> >>> The problem is AT does not handle this specific case, there is no
> >>> way to
> >>> deliver "early replay" from a client to server that "I am working on
> >>> it" outside of
> >>> just sending dirty data. But dirty data gets into a queue for way too
> >>> long.
> >>> There re no timed out requests, the only thing timing out is lock
> >>> that
> >>> is not
> >>> cancelled in time.
> >>> AT was not tried - this is hard to do at ORNL, as client side is Cray
> >>> XT4 machine,
> >>> and updating clients is hard. So they are on 1.4.11 of some sort.
> >>> They can easily update servers, but this won't help, of course.
> >>> 
> >>>> Maybe that's was done to discourage people from disabling AT?
> >>>> Seriously, though, I don't know why that was changed. Perhaps it was
> >>>> done on b1_6 before to AT landed?
> >>> 
> >>> hm, indeed. I see this change in 1.6.3.
> >>> 
> >>> Bye,
> >>>     Oleg
> >>> _______________________________________________
> >>> Lustre-devel mailing list
> >>> Lustre-devel at lists.lustre.org
> >>> http://lists.lustre.org/mailman/listinfo/lustre-devel
> >> 
> >> 
> > 
> 
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Lustre-devel] hiding non-fatal communications errors
  2008-06-06  4:41                 ` Andreas Dilger
@ 2008-06-06 11:13                   ` Eric Barton
  2008-06-19 20:24                     ` Nathaniel Rutman
  2008-06-06 12:23                   ` Peter Braam
  1 sibling, 1 reply; 15+ messages in thread
From: Eric Barton @ 2008-06-06 11:13 UTC (permalink / raw)
  To: lustre-devel


Oleg's comments about congestion and the ORNL discussions I've been
involved in are effectively presenting arguments for allowing
expedited communications.  This is possible but comes at a cost.

The "proper" implementation effectively holds an uncongested network
in reserve for expedited communications.  That's a high price to pay
because it pretty well means doubling up all the LNET state - twice
the number of queues/sockets/queuepairs/connections.  That's
unavoidable since we're using these structures for backpressure and
once they're "full" you can only bypass with an additional connection.

You can go a long way towards the ideal by allowing prioritised
traffic to "overtake" everywhere apart from the wire - i.e. all
packets serialise once they have been passed to the comms APIs below
the LNDs, but take priority within the LNDs, LNET (including routers)
and ptlrpc.  

It's hard to say without further thought and/or experiment whether
either of these alternatives actually solves the problem in all
envisaged use cases and doesn't just shift it elsewhere.  For example,
even the "proper" implementation could end up with a logjam on both
low and high priority networks in pathalogical use cases.  And I'm not
ready to believe that increasing the number of priority levels can add
anything fundamental to the argument.

I think our best bet is to find a way to keep congestion to a minimum
in the first place so that peer ping latency in a single-priority
network can be bounded and kept relatively short (seconds, not
minutes).

Unfortunately, the current algorithms for exploiting network and disk
bandwidth are unbelievably simplistic and invite congestion.
Increasing the number of service threads until performance levels off
ignores completely the issue of service latency.  Allowing a single
client to post sufficient traffic to max the network is fine when it's
the only one, but mad when it's one of 100000.  We're tuning systems
to the point of instability, so of course timeouts have to become
unmanageable long.

Scheduling can be a subtle problem where "obvious" solutions can have
non-obvious consequences.  But it might be a start to give servers more
dynamic control over the number of concurrent requests individual clients
are allowed to submit so that when many clients are active individual
clients only submit one RPC at a time, and when few clients are active
concurrency on these clients can increase.  

    Cheers,
              Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Lustre-devel] hiding non-fatal communications errors
  2008-06-06  4:41                 ` Andreas Dilger
  2008-06-06 11:13                   ` Eric Barton
@ 2008-06-06 12:23                   ` Peter Braam
  1 sibling, 0 replies; 15+ messages in thread
From: Peter Braam @ 2008-06-06 12:23 UTC (permalink / raw)
  To: lustre-devel

Sorry yes, network request scheduling; which is btw the most basic instance
of a secondary resource management protocol as Eric described in his post.

Peter


On 6/5/08 10:41 PM, "Andreas Dilger" <adilger@sun.com> wrote:

> On Jun 05, 2008  20:40 -0700, Peter J. Braam wrote:
>> Ah yes.  So monitoring progress is the only thing we can do and with SNS you
>> will be able to get that information long before the request is being
>> handled.
> 
> You mean NRS, instead of SNS, right?
> 
>> On 6/5/08 8:38 PM, "Oleg Drokin" <Oleg.Drokin@Sun.COM> wrote:
>>>     Because there is no way to deliver them. We send our first
>>> acknowledge of ast reception and it is delivered fast, this is the
>>> reply.
>>>     Now what left is to send actual dirty data and then cancel
>>> request. These are not replies, but stand-alone client-generated RPCs,
>>>     we cannot cancel locks while dirty data is not flushed. Just
>>> inventing some sort of ldlm "I am still alive" RPCs to send periodically
>>>     instead of cancels is dangerous - data-sending part could be
>>> wedged for unrelated reasons, for example, not only because of
>>> contention, but due
>>>     to some client problems, and if we prolong locks by other means,
>>> that potentially can wedge all access to that part of a file forever.
>>>     And dirty data itself takes too long to get to the actual server
>>> processing.
>>>     On of the solutions here is request scheduler, or some stand-alone
>>> part of it that could peek early into RPCs as they arrive, so that
>>>     when the decision is being made about client eviction, we can
>>> quickly see what is in the queue from that client and perhaps
>>>     based on this data to postpone the eviction. This was discussed on
>>> ORNL call.
>>>     Andreas said that AT is currently already looking into incoming
>>> RPCs before processing, to get ideas about expected service times,
>>> perhaps
>>>     it would not be too hard to add some logic that would link
>>> requests into actual exports they came from for further analysis if
>>> the need for
>>>     it arises.
> 
> I think hooking the requests into the exports at arrival time is fairly
> straight forward, and is a easy first step toward implementing the NRS.
> 
>>> Bye,
>>>      Oleg
>>> On Jun 5, 2008, at 11:29 PM, Peter Braam wrote:
>>> 
>>>> Why can we not send early replies?
>>>> 
>>>> 
>>>> On 6/5/08 9:59 AM, "Oleg Drokin" <Oleg.Drokin@Sun.COM> wrote:
>>>> 
>>>>> Hello!
>>>>> 
>>>>> On Jun 5, 2008, at 12:42 PM, Robert Read wrote:
>>>>> 
>>>>>>>> I suspect this could be adapted to allowing a fixed number of
>>>>>>>> retries for
>>>>>>>> server-originated RPCs also.  In the case of LDLM blocking
>>>>>>>> callbacks
>>>>>>>> sent
>>>>>>>> to a client, a resend is currently harmless (either the client is
>>>>>>>> already
>>>>>>>> processing the callback, or the lock was cancelled).
>>>>>>> We need to be careful here and decide on a good strategy on when to
>>>>>>> resend.
>>>>>>> E.g. recent case at ORNL (even if a bit pathologic) is they pound
>>>>>>> through
>>>>>>> thousands of clients to 4 OSSes via 2 routers. That creates request
>>>>>>> waiting
>>>>>>> lists on OSSes well into tens of thousands. When we block on a lock
>>>>>>> and send
>>>>>>> blocking AST to the client, it quickly turns around and puts in his
>>>>>>> data...
>>>>>>> at the end of our list that takes hundreds of seconds (more than
>>>>>>> obd_timeout,
>>>>>>> obviously). No matter how much you resend, it won't help.
>>>>>> This looks like the poster child for adaptive timeouts, although we
>>>>>> might want need some version of the early margin update patch on
>>>>>> 15501.  Have you tried enabling AT?
>>>>> 
>>>>> The problem is AT does not handle this specific case, there is no
>>>>> way to
>>>>> deliver "early replay" from a client to server that "I am working on
>>>>> it" outside of
>>>>> just sending dirty data. But dirty data gets into a queue for way too
>>>>> long.
>>>>> There re no timed out requests, the only thing timing out is lock
>>>>> that
>>>>> is not
>>>>> cancelled in time.
>>>>> AT was not tried - this is hard to do at ORNL, as client side is Cray
>>>>> XT4 machine,
>>>>> and updating clients is hard. So they are on 1.4.11 of some sort.
>>>>> They can easily update servers, but this won't help, of course.
>>>>> 
>>>>>> Maybe that's was done to discourage people from disabling AT?
>>>>>> Seriously, though, I don't know why that was changed. Perhaps it was
>>>>>> done on b1_6 before to AT landed?
>>>>> 
>>>>> hm, indeed. I see this change in 1.6.3.
>>>>> 
>>>>> Bye,
>>>>>     Oleg
>>>>> _______________________________________________
>>>>> Lustre-devel mailing list
>>>>> Lustre-devel at lists.lustre.org
>>>>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>>>> 
>>>> 
>>> 
>> 
>> 
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Lustre-devel] hiding non-fatal communications errors
  2008-06-06 11:13                   ` Eric Barton
@ 2008-06-19 20:24                     ` Nathaniel Rutman
  0 siblings, 0 replies; 15+ messages in thread
From: Nathaniel Rutman @ 2008-06-19 20:24 UTC (permalink / raw)
  To: lustre-devel

Eric Barton wrote:
> Oleg's comments about congestion and the ORNL discussions I've been
> involved in are effectively presenting arguments for allowing
> expedited communications.  This is possible but comes at a cost.
>   
> The "proper" implementation effectively holds an uncongested network
> in reserve for expedited communications.  That's a high price to pay
> because it pretty well means doubling up all the LNET state - twice
> the number of queues/sockets/queuepairs/connections.  That's
> unavoidable since we're using these structures for backpressure and
> once they're "full" you can only bypass with an additional connection.
>   
That's assuming network congestion is the cause of the lock timeout.  
What if the server disk is busy doing who knows what, the client's cache 
flush RPCs are all sitting on the server in the request queue just 
waiting for some disk time.  Furthermore assume that a bunch of other 
clients are all doing the same thing, so that we can't simply prioritize 
this clients RPCs over everybody else's. 

I think the method suggested by Oleg has the most potential in this 
case: "sniff" the incoming RPCs to see if they are cache flushes, and do 
not decide to evict those clients until after those RPCs have been 
processed.  As mentioned, we already do sniff the incoming reqs to check 
adaptive timeout deadlines (ptlrpc_server_handle_req_in).

One further thing I would like to do is respond to "easy" RPCs 
immediately (in a reserved thread).  "Easy" would certainly include 
pings, maybe others that have no disk access.  This would allow us to 
free up LNET buffers and other resources, prevent us from evicting 
clients "we haven't heard from in X seconds" (although I just realized 
we could fix that right now in ptlrpc_server_handle_req_in), and more 
quickly determine network and server loading remotely.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2008-06-19 20:24 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-04 13:25 [Lustre-devel] hiding non-fatal communications errors Eric Barton
2008-06-04 21:17 ` Peter Braam
2008-06-04 22:20   ` Andreas Dilger
2008-06-05  4:12     ` Oleg Drokin
2008-06-05 16:42       ` Robert Read
2008-06-05 16:59         ` Oleg Drokin
2008-06-06  3:29           ` Peter Braam
2008-06-06  3:38             ` Oleg Drokin
2008-06-06  3:40               ` Peter Braam
2008-06-06  4:41                 ` Andreas Dilger
2008-06-06 11:13                   ` Eric Barton
2008-06-19 20:24                     ` Nathaniel Rutman
2008-06-06 12:23                   ` Peter Braam
2008-06-06  3:37         ` Peter Braam
2008-06-04 23:41   ` Eric Barton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.