All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Braam <Peter.Braam@Sun.COM>
To: lustre-devel@lists.lustre.org
Subject: [Lustre-devel] hiding non-fatal communications errors
Date: Thu, 05 Jun 2008 20:37:22 -0700	[thread overview]
Message-ID: <C46DFF02.587F%peter.braam@sun.com> (raw)
In-Reply-To: <C756F41E-2CF2-4DAD-9920-F1CB2A2BECA6@sun.com>




On 6/5/08 9:42 AM, "Robert Read" <rread@sun.com> wrote:

> 
> On Jun 4, 2008, at 21:12 , Oleg Drokin wrote:
> 
>> Hello!
>> 
>> On Jun 4, 2008, at 6:20 PM, Andreas Dilger wrote:
>> 
>>> I suspect this could be adapted to allowing a fixed number of
>>> retries for
>>> server-originated RPCs also.  In the case of LDLM blocking callbacks
>>> sent
>>> to a client, a resend is currently harmless (either the client is
>>> already
>>> processing the callback, or the lock was cancelled).
>> 
>> We need to be careful here and decide on a good strategy on when to
>> resend.
>> E.g. recent case at ORNL (even if a bit pathologic) is they pound
>> through
>> thousands of clients to 4 OSSes via 2 routers. That creates request
>> waiting
>> lists on OSSes well into tens of thousands. When we block on a lock
>> and send
>> blocking AST to the client, it quickly turns around and puts in his
>> data...
>> at the end of our list that takes hundreds of seconds (more than
>> obd_timeout,
>> obviously). No matter how much you resend, it won't help.
> 

I think this is an SNS issue.  Eric?

Peter


> 
> This looks like the poster child for adaptive timeouts, although we
> might want need some version of the early margin update patch on
> 15501.  Have you tried enabling AT?
> 
>> 
>> Now a good argument is before we kill such clients (or do any sort of
>> resend),
>> perhaps it makes sense to check incoming queue to see if there is
>> anything?
>> On the other hand that would be like half of request scheduler,
>> probably, and
>> with such queues, it would take ages, I guess.
>> BTW, AT code changes lock waiting from obd_timeout to obd_timeout/2,
>> why is that?
>> (when AT is disabled). All this is bug 15332.
>> 
> 
> Maybe that's was done to discourage people from disabling AT?
> Seriously, though, I don't know why that was changed. Perhaps it was
> done on b1_6 before to AT landed?
> 
> robert
> 
> 
>> Or was the resend meant just for initial RPC where we do not get a
>> confirmation
>> soon? Yes, there it makes sense to retry soon, but this case above
>> needs to be
>> still considered, since currently we do not retry writeouts too, which
>> has as
>> much of a bad effect on dirty client caches, and of course all the
>> above is very
>> true in such cases too.
>> Also without lnet patch in 15332, where small messages are prioritized
>> on routers,
>> it is way too easy to timeout ast response because of router
>> congestion and no
>> amount of resending would help then.
>> 
>> Bye,
>>     Oleg
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

  parent reply	other threads:[~2008-06-06  3:37 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-06-04 13:25 [Lustre-devel] hiding non-fatal communications errors Eric Barton
2008-06-04 21:17 ` Peter Braam
2008-06-04 22:20   ` Andreas Dilger
2008-06-05  4:12     ` Oleg Drokin
2008-06-05 16:42       ` Robert Read
2008-06-05 16:59         ` Oleg Drokin
2008-06-06  3:29           ` Peter Braam
2008-06-06  3:38             ` Oleg Drokin
2008-06-06  3:40               ` Peter Braam
2008-06-06  4:41                 ` Andreas Dilger
2008-06-06 11:13                   ` Eric Barton
2008-06-19 20:24                     ` Nathaniel Rutman
2008-06-06 12:23                   ` Peter Braam
2008-06-06  3:37         ` Peter Braam [this message]
2008-06-04 23:41   ` Eric Barton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=C46DFF02.587F%peter.braam@sun.com \
    --to=peter.braam@sun.com \
    --cc=lustre-devel@lists.lustre.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.