From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from zeniv.linux.org.uk ([195.92.253.2]:34938 "EHLO
	ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1030237AbcBQULY (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Wed, 17 Feb 2016 15:11:24 -0500
Date: Wed, 17 Feb 2016 20:11:20 +0000
From: Al Viro <viro@ZenIV.linux.org.uk>
To: Mike Marshall <hubcap@omnibond.com>
Cc: Martin Brandenburg <martin@omnibond.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Stephen Rothwell <sfr@canb.auug.org.au>
Subject: Re: Orangefs ABI documentation
Message-ID: <20160217201120.GN17997@ZenIV.linux.org.uk>
References: <CAOg9mSRgGghk=RAmWZ_MLgp=3DkR+2iV5dU8ffaAFK8aYmM_xQ@mail.gmail.com>
 <20160214234312.GX17997@ZenIV.linux.org.uk>
 <CAOg9mSSsztHyRsQ=BAA9FoEMo7XEdgTwYHosqvh4ZjykDapj2A@mail.gmail.com>
 <20160215184554.GY17997@ZenIV.linux.org.uk>
 <CA+D=wkgGZ9A8Qa5C6q3cROrr+Gp=jsgowvcbOs-22UU=aVT7Wg@mail.gmail.com>
 <20160215230434.GZ17997@ZenIV.linux.org.uk>
 <CAOg9mSSdsVL8tRiAojg=UCNPQ6iPcthtdowH9kWyiWnXUvTEHg@mail.gmail.com>
 <20160216233609.GE17997@ZenIV.linux.org.uk>
 <20160216235441.GF17997@ZenIV.linux.org.uk>
 <CAOg9mSR0QZOy-JAesN32o51omxdOQyfUh-Zj8k7ZohKLaa3GxQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAOg9mSR0QZOy-JAesN32o51omxdOQyfUh-Zj8k7ZohKLaa3GxQ@mail.gmail.com>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Wed, Feb 17, 2016 at 02:24:34PM -0500, Mike Marshall wrote:
> It is still busted, I've been trying to find clues as to why...

With reinit_completion() added?

> Maybe this is relevant:
> 
> Alloced OP ffff880015698000 <- doomed op for orangefs_create MAILBOX2.CPT
> service_operation: orangefs_create op ffff880015698000
> ffff880015698000 got past is_daemon_in_service
> 
> ... lots of stuff ...
> 
> w_f_m_d returned -11 for ffff880015698000 <- first op to get EAGAIN
> 
> first client core is NOT in service
> second op to get EAGAIN
>           ...
> last client core is NOT in service
> 
> ... lots of stuff ...
> 
> service_operation returns to orangef_create with handle 0 fsid 0 ret 0
> for MAILBOX2.CPT
> 
> I'm guessing you want me to wait to do the switching of my branch
> until we fix this (last?) thing, let me know...

What I'd like to check is the value of op->waitq.done at retry_servicing.
If we get there with a non-zero value, we've a problem.  BTW, do you
hit any of gossip_err() in orangefs_clean_up_interrupted_operation()?