All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Barton <eeb@sun.com>
To: lustre-devel@lists.lustre.org
Subject: [Lustre-devel] server-side resending & bulk transfer
Date: Fri, 05 Feb 2010 17:12:51 +0000	[thread overview]
Message-ID: <002d01caa686$72c40d40$584c27c0$@com> (raw)
In-Reply-To: <20100205163524.GW236@granier.hd.free.fr>

Johann,

cc-ing lustre-devel.

Yes, the server could retry the bulk if it times out and this
will be safe for the client since its bulk buffer is auto-unlinked,
so only 1 bulk PUT/GET can match it.  But if the problem happens
on the way back to the server rather than the way out to the client,
you're hosed since the bulk has completed from the client's POV.

This should be an exceptional circumstance - i.e. a router has
actually failed - so I think it's better just to stick with the
client retrying from scratch rather than tying down a server thread
until it has decided whether there was a router failure or the
client really crashed.

Roll on the health network! :)

    Cheers,
              Eric

> -----Original Message-----
> From: Johann Lombardi [mailto:johann at sun.com]
> Sent: 05 February 2010 4:35 PM
> To: lustre-tech-leads at sun.com
> Subject: server-side resending & bulk transfer
> 
> Hi,
> 
> As you know, the most important part of server-side resending is to resend
> lock callbacks since a lost of such a message ends up with a client eviction
> (except for glimpses which are resent indefinitely causing other problems).
> 
> That being said, another aspect is losing a message during bulk transfer, and
> more particularly the start bulk signal issued by LNET.
> Unlike lock callback rpcs, losing the start bulk signal is not fatal since
> the bulk transfer will timeout on the server side, the request be dropped
> and the client will resend after reconnection. This is indeed harmless,
> but still causes slowdown which could be avoided according to LLNL if we
> try to resend the start bulk signal (bug 21714). Brian Behlendorf's
> proposal is to resend the start bulk signal after the first l_wait_event()
> timeout in ost_brw_write(). However, we don't know if this is safe to do,
> e.g. how does the client react if it receives duplicated start bulk signals?
> 
> Johann

       reply	other threads:[~2010-02-05 17:12 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20100205163524.GW236@granier.hd.free.fr>
2010-02-05 17:12 ` Eric Barton [this message]
2010-02-05 20:20   ` [Lustre-devel] server-side resending & bulk transfer Nicolas Williams
2010-02-06 12:28     ` Eric Barton
2010-02-09 19:21   ` Nathan Rutman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='002d01caa686$72c40d40$584c27c0$@com' \
    --to=eeb@sun.com \
    --cc=lustre-devel@lists.lustre.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.