All of lore.kernel.org
 help / color / mirror / Atom feed
From: John Hammond <jhammond@ices.utexas.edu>
To: lustre-devel@lists.lustre.org
Subject: [Lustre-devel] question about ldlm_server_glimpse_ast
Date: Fri, 30 Apr 2010 08:00:46 -0500	[thread overview]
Message-ID: <4BDAD47E.401@ices.utexas.edu> (raw)
In-Reply-To: <w2je5fcbd181004291959wb9790719ic4dc3a2bd9b64eb@mail.gmail.com>

On 04/29/2010 09:59 PM, Jeremy Filizetti wrote:
> In our Lustre WAN environment a few times we've had a link drop for an
> extended period of time which causes problems on systems accessing data
> in the same directory as the remote system that becomes unavailable.
> Our OSS's seem to be stuck in a loop of ptlrpc_queue_wait called from
> ldlm_server_glimpse_ast.  The remote site is accesed through an LNet
> router which is still available.  However the OSS resends requests every
> 7 seconds successfully to the router but squbsequently with timeout
> which causes it to loop in ptlrpc_queue_wait.
>
> Looking over the ldlm_server_blocking_ast and ldlm_server_completion_ast
> functions I see they set rq_no_resend = 1, but ldlm_server_glimpse_ast
> does not.  I'm not familiar with the locking in Lustre, is there a
> reason that ldlm_server_glimpse_ast doesn't set rq_no_resend = 1?  This
> would get rid of the loop ptlrpc_queue_wait is stuck in until the client
> comes back, but I'm not sure if it would have other unexpected consequences.

We have the same issue at TACC, and there is a bugzilla entry:

https://bugzilla.lustre.org/show_bug.cgi?id=21937

I tested a patch which set rq_no_resend = 0 for glimpses, and found that 
clients only had about 6 seconds to reply before eviction.  Since 
eviction creates the possibility for data loss, a 6 second timeout was 
deemed too short for production.  (With the patch applied, it was easy 
for me to create cases where data was indeed lost.)  I was also able to 
observe some file consistency issues which lasted for a few seconds 
after eviction, as well as a failure of the file operations on the 
evicted client to return an error.  See also:

https://bugzilla.lustre.org/show_bug.cgi?id=22360

-John

-- 
John L. Hammond, Ph.D.
ICES, The University of Texas at Austin
jhammond at ices.utexas.edu
(512) 471-9304

  reply	other threads:[~2010-04-30 13:00 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-04-30  2:59 [Lustre-devel] question about ldlm_server_glimpse_ast Jeremy Filizetti
2010-04-30 13:00 ` John Hammond [this message]
2010-04-30 18:44   ` Oleg Drokin
2010-04-30 19:25     ` Cory Spitz
2010-04-30 21:07     ` John Hammond
2010-04-30 21:14       ` Oleg Drokin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4BDAD47E.401@ices.utexas.edu \
    --to=jhammond@ices.utexas.edu \
    --cc=lustre-devel@lists.lustre.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.