All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alex Elder <elder@inktank.com>
To: Nick Bartos <nick@pistoncloud.com>
Cc: Sage Weil <sage@inktank.com>, Gregory Farnum <greg@inktank.com>,
	Josh Durgin <josh.durgin@inktank.com>,
	Mandell Degerness <mandell@pistoncloud.com>,
	"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: rbd map command hangs for 15 minutes during system start up
Date: Mon, 31 Dec 2012 12:22:28 -0600	[thread overview]
Message-ID: <50E1D7E4.1040305@inktank.com> (raw)
In-Reply-To: <50DB6DD7.3050308@inktank.com>

On 12/26/2012 03:36 PM, Alex Elder wrote:
> On 12/26/2012 11:45 AM, Nick Bartos wrote:
>> Here's a log with a hang on the updated branch:
>>
>> https://gist.github.com/raw/4381750/772476e1bae1e6366347a223f34aa6c440b92765/rdb-hang-1356543132.log
> 
> OK, new naming scheme.  Please try:  wip-nick-1

Now that we've got this resolved, I've created an updated
"stable" branch with ceph-related bug fixes, based on the
latest 3.5 stable branch, 3.5.7.  It contains a bunch of
other bug fixes that what you had been working with did
not have.

I'm starting my own testing with this branch now.  But it
would be great if you'd give it a try as well, since I
know you're a "real" user of this code base.

It's available as branch "linux-3.5.7-ceph" on the
ceph-client git repository.  Thanks a lot.

					-Alex

> 
> I added another simple fix, but then collapsed three commits
> into one, and added one more (somewhat unrelated).
> 
> I've done simple testing with this and will subject it to
> more rigorous testing shortly.  I wanted to make it available
> to you quickly though.
> 
> 					-Alex
> 
>>
>> On Thu, Dec 20, 2012 at 1:59 PM, Alex Elder <elder@inktank.com> wrote:
>>> On 12/20/2012 11:48 AM, Nick Bartos wrote:
>>>> Unfortunately, we still have a hang:
>>>>
>>>> https://gist.github.com/4347052/download
>>>
>>> The saga continues, and each time we get a little more
>>> information.  Please try branch: "wip-nick-newerest"
>>>
>>> Thank you.
>>>
>>>                                         -Alex
>>>
>>>
>>>> On Wed, Dec 19, 2012 at 2:42 PM, Alex Elder <elder@inktank.com> wrote:
>>>>> On 12/19/2012 03:25 PM, Alex Elder wrote:
>>>>>> On 12/18/2012 12:05 PM, Nick Bartos wrote:
>>>>>>> I've added the output of "ps -ef" in addition to triggering a trace
>>>>>>> when a hang is detected.  Not much is generally running at that point,
>>>>>>> but you can have a look:
>>>>>>>
>>>>>>> https://gist.github.com/raw/4330223/2f131ee312ee43cb3d8c307a9bf2f454a7edfe57/rbd-hang-1355851498.txt
>>>>>>
>>>>>> This helped a lot.  I updated the bug with a little more info.
>>>>>>
>>>>>>     http://tracker.newdream.net/issues/3519
>>>>>>
>>>>>> I also think I have now found something that could explain what you
>>>>>> are seeing, and am developing a fix.  I'll provide you an update
>>>>>> as soon as I have tested what I come up with, almost certainly
>>>>>> this afternoon.
>>>>>
>>>>> Nick, I have a new branch for you to try with a new fix in place.
>>>>> As you might have predicted, it's named "wip-nick-newest".
>>>>>
>>>>> Please give it a try to see if it resolved the hang you've
>>>>> been seeing and let me know how it goes.  If it continues
>>>>> to hang, please provide the logs as you have before, it's
>>>>> been very helpful.
>>>>>
>>>>> Thanks a lot.
>>>>>
>>>>>                                         -Alex
>>>>>>
>>>>>>                                       -Alex
>>>>>>
>>>>>>> Is it possible that there is some sort of deadlock going on?  We are
>>>>>>> doing the rbd maps (and subsequent filesystem mounts) on the same
>>>>>>> systems which are running the ceph-osd and ceph-mon processes.  To get
>>>>>>> around the 'sync' deadlock problem, we are using a patch from Sage
>>>>>>> which ignores system wide sync's on filesystems mounted with the
>>>>>>> 'mand' option (and we mount the underlying osd filesystems with
>>>>>>> 'mand').  However I am wondering if there is potential for other types
>>>>>>> of deadlocks in this environment.
>>>>>>>
>>>>>>> Also, we recently saw an rbd hang in a much older version, running
>>>>>>> kernel 3.5.3 with only the sync hack patch, along side ceph 0.48.1.
>>>>>>> It's possible that this issue was around for some time, just the
>>>>>>> recent patches made it happen more often (and thus more reproducible)
>>>>>>> for us.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Dec 18, 2012 at 8:09 AM, Alex Elder <elder@inktank.com> wrote:
>>>>>>>> On 12/17/2012 11:12 AM, Nick Bartos wrote:
>>>>>>>>> Here's a log with the rbd debugging enabled:
>>>>>>>>>
>>>>>>>>> https://gist.github.com/raw/4319962/d9690fd92c169198efc5eecabf275ef1808929d2/rbd-hang-test-1355763470.log
>>>>>>>>>
>>>>>>>>> On Fri, Dec 14, 2012 at 10:03 AM, Alex Elder <elder@inktank.com> wrote:
>>>>>>>>>> On 12/14/2012 10:53 AM, Nick Bartos wrote:
>>>>>>>>>>> Yes I was only enabling debugging for libceph.  I'm adding debugging
>>>>>>>>>>> for rbd as well.  I'll do a repro later today when a test cluster
>>>>>>>>>>> opens up.
>>>>>>>>>>
>>>>>>>>>> Excellent, thank you.   -Alex
>>>>>>>>
>>>>>>>> I looked through these debugging messages.  Looking only at the
>>>>>>>> rbd debugging, what I see seems to indicate that rbd is idle at
>>>>>>>> the point the "hang" seems to start.  This suggests that the hang
>>>>>>>> is not due to rbd itself, but rather whatever it is that might
>>>>>>>> be responsible for using the rbd image once it has been mapped.
>>>>>>>>
>>>>>>>> Is that possible?  I don't know what process you have that is
>>>>>>>> mapping the rbd image, and what is supposed to be the next thing
>>>>>>>> it does.  (I realize this may not make a lot of sense, given
>>>>>>>> a patch in rdb seems to have caused the hang to begin occurring.)
>>>>>>>>
>>>>>>>> Also note that the debugging information available (i.e., the
>>>>>>>> lines in the code that can output debugging information) may
>>>>>>>> well be incomplete.  So if you don't find anything it may be
>>>>>>>> necessary to provide you with another update which might include
>>>>>>>> more debugging.
>>>>>>>>
>>>>>>>> Anyway, could you provide a little more context about what
>>>>>>>> is going on sort of *around* rbd when activity seems to stop?
>>>>>>>>
>>>>>>>> Thanks a lot.
>>>>>>>>
>>>>>>>>                                         -Alex
>>>>>>
>>>>>
>>>
> 


  parent reply	other threads:[~2012-12-31 18:22 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-08 22:10 rbd map command hangs for 15 minutes during system start up Mandell Degerness
2012-11-09  1:43 ` Josh Durgin
2012-11-12 22:19   ` Nick Bartos
2012-11-12 23:16     ` Sage Weil
2012-11-16  0:21       ` Nick Bartos
2012-11-16  0:25         ` Sage Weil
2012-11-16 18:36           ` Nick Bartos
2012-11-16 19:16             ` Sage Weil
2012-11-16 22:01               ` Nick Bartos
2012-11-16 22:13                 ` Sage Weil
2012-11-16 22:16                   ` Nick Bartos
2012-11-16 22:21                     ` Sage Weil
2012-11-19 23:04                       ` Nick Bartos
2012-11-19 23:34                         ` Gregory Farnum
2012-11-20 21:53                           ` Nick Bartos
2012-11-21  1:31                             ` Nick Bartos
2012-11-21 16:50                               ` Sage Weil
2012-11-21 17:02                                 ` Nick Bartos
2012-11-21 17:34                                   ` Nick Bartos
2012-11-21 21:41                                     ` Nick Bartos
2012-11-22  4:47                                       ` Sage Weil
2012-11-22  5:49                                         ` Nick Bartos
2012-11-22 18:04                                           ` Nick Bartos
2012-11-29 20:37                                             ` Alex Elder
2012-11-30 18:49                                               ` Nick Bartos
2012-11-30 19:10                                                 ` Alex Elder
2012-11-30 19:31                                                   ` Sage Weil
2012-11-30 23:22                                               ` Alex Elder
2012-12-02  5:34                                                 ` Nick Bartos
2012-12-03  4:43                                                   ` Alex Elder
2012-12-10 21:57                                                     ` Alex Elder
2012-12-11 17:26                                                       ` Nick Bartos
2012-12-11 18:01                                                         ` Alex Elder
2012-12-11 19:44                                                           ` Alex Elder
2012-12-13  0:57                                                             ` Nick Bartos
2012-12-13 19:00                                                               ` Nick Bartos
2012-12-13 19:07                                                                 ` Alex Elder
2012-12-14 16:46                                                                 ` Alex Elder
2012-12-14 16:53                                                                   ` Nick Bartos
2012-12-14 18:03                                                                     ` Alex Elder
2012-12-17 17:12                                                                       ` Nick Bartos
2012-12-18 16:09                                                                         ` Alex Elder
2012-12-18 18:05                                                                           ` Nick Bartos
2012-12-19 21:25                                                                             ` Alex Elder
2012-12-19 22:42                                                                               ` Alex Elder
2012-12-20 17:48                                                                                 ` Nick Bartos
2012-12-20 21:59                                                                                   ` Alex Elder
2012-12-26 17:45                                                                                     ` Nick Bartos
2012-12-26 17:50                                                                                       ` Alex Elder
2012-12-26 21:36                                                                                       ` Alex Elder
2012-12-27 17:33                                                                                         ` Nick Bartos
2012-12-27 18:43                                                                                           ` Sage Weil
2012-12-27 19:41                                                                                             ` Alex Elder
2012-12-31 18:22                                                                                         ` Alex Elder [this message]
2013-01-02 15:56                                                                                           ` Nick Bartos
2012-11-16 22:23                     ` Gregory Farnum

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50E1D7E4.1040305@inktank.com \
    --to=elder@inktank.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=greg@inktank.com \
    --cc=josh.durgin@inktank.com \
    --cc=mandell@pistoncloud.com \
    --cc=nick@pistoncloud.com \
    --cc=sage@inktank.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.