All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alex Elder <elder@inktank.com>
To: Nick Bartos <nick@pistoncloud.com>
Cc: Sage Weil <sage@inktank.com>, Gregory Farnum <greg@inktank.com>,
	Josh Durgin <josh.durgin@inktank.com>,
	Mandell Degerness <mandell@pistoncloud.com>,
	"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: rbd map command hangs for 15 minutes during system start up
Date: Fri, 30 Nov 2012 17:22:50 -0600	[thread overview]
Message-ID: <50B93FCA.2060801@inktank.com> (raw)
In-Reply-To: <50B7C788.6040404@inktank.com>

On 11/29/2012 02:37 PM, Alex Elder wrote:
> On 11/22/2012 12:04 PM, Nick Bartos wrote:
>> Here are the ceph log messages (including the libceph kernel debug
>> stuff you asked for) from a node boot with the rbd command hung for a
>> couple of minutes:

I'm sorry, but I did something stupid...

Yes, the branch I gave you includes these fixes.  However
it does *not* include the commit that was giving you trouble
to begin with.

So...

I have updated that same branch (wip-nick) to contain:
- Linux 3.5.5
- Plus the first *50* (not 49) patches you listed
- Plus the ones I added before.

The new commit id for that branch begins with be3198d6.

I'm really sorry for this mistake.  Please try this new
branch and report back what you find.

					-Alex


> Nick, I have put together a branch that includes two fixes
> that might be helpful.  I don't expect these fixes will
> necessarily *fix* what you're seeing, but one of them
> pulls a big hunk of processing out of the picture and
> might help eliminate some potential causes.  I had to
> pull in several other patches as prerequisites in order
> to get those fixes to apply cleanly.
> 
> Would you be able to give it a try, and let us know what
> results you get?  The branch contains:
> - Linux 3.5.5
> - Plus the first 49 patches you listed
> - Plus four patches, which are prerequisites...
>     libceph: define ceph_extract_encoded_string()
>     rbd: define some new format constants
>     rbd: define rbd_dev_image_id()
>     rbd: kill create_snap sysfs entry
> - ...for these two bug fixes:
>     libceph: remove 'osdtimeout' option
>     ceph: don't reference req after put
> 
> The branch is available in the ceph-client git repository
> under the name "wip-nick" and has commit id dd9323aa.
>     https://github.com/ceph/ceph-client/tree/wip-nick
> 
>> https://raw.github.com/gist/4132395/7cb5f0150179b012429c6e57749120dd88616cce/gistfile1.txt
> 
> This full debug output is very helpful.  Please supply
> that again as well.
> 
> Thanks.
> 
> 					-Alex
> 
>> On Wed, Nov 21, 2012 at 9:49 PM, Nick Bartos <nick@pistoncloud.com> wrote:
>>> It's very easy to reproduce now with my automated install script, the
>>> most I've seen it succeed with that patch is 2 in a row, and hanging
>>> on the 3rd, although it hangs on most builds.  So it shouldn't take
>>> much to get it to do it again.  I'll try and get to that tomorrow,
>>> when I'm a bit more rested and my brain is working better.
>>>
>>> Yes during this the OSDs are probably all syncing up.  All the osd and
>>> mon daemons have started by the time the rdb commands are ran, though.
>>>
>>> On Wed, Nov 21, 2012 at 8:47 PM, Sage Weil <sage@inktank.com> wrote:
>>>> On Wed, 21 Nov 2012, Nick Bartos wrote:
>>>>> FYI the build which included all 3.5 backports except patch #50 is
>>>>> still going strong after 21 builds.
>>>>
>>>> Okay, that one at least makes some sense.  I've opened
>>>>
>>>>         http://tracker.newdream.net/issues/3519
>>>>
>>>> How easy is this to reproduce?  If it is something you can trigger with
>>>> debugging enabled ('echo module libceph +p >
>>>> /sys/kernel/debug/dynamic_debug/control') that would help tremendously.
>>>>
>>>> I'm guessing that during this startup time the OSDs are still in the
>>>> process of starting?
>>>>
>>>> Alex, I bet that a test that does a lot of map/unmap stuff in a loop while
>>>> thrashing OSDs could hit this.
>>>>
>>>> Thanks!
>>>> sage
>>>>
>>>>
>>>>>
>>>>> On Wed, Nov 21, 2012 at 9:34 AM, Nick Bartos <nick@pistoncloud.com> wrote:
>>>>>> With 8 successful installs already done, I'm reasonably confident that
>>>>>> it's patch #50.  I'm making another build which applies all patches
>>>>>> from the 3.5 backport branch, excluding that specific one.  I'll let
>>>>>> you know if that turns up any unexpected failures.
>>>>>>
>>>>>> What will the potential fall out be for removing that specific patch?
>>>>>>
>>>>>>
>>>>>> On Wed, Nov 21, 2012 at 9:02 AM, Nick Bartos <nick@pistoncloud.com> wrote:
>>>>>>> It's really looking like it's the
>>>>>>> libceph_resubmit_linger_ops_when_pg_mapping_changes commit.  When
>>>>>>> patches 1-50 (listed below) are applied to 3.5.7, the hang is present.
>>>>>>>  So far I have gone through 4 successful installs with no hang with
>>>>>>> only 1-49 applied.  I'm still leaving my test run to make sure it's
>>>>>>> not a fluke, but since previously it hangs within the first couple of
>>>>>>> builds, it really looks like this is where the problem originated.
>>>>>>>
>>>>>>> 1-libceph_eliminate_connection_state_DEAD.patch
>>>>>>> 2-libceph_kill_bad_proto_ceph_connection_op.patch
>>>>>>> 3-libceph_rename_socket_callbacks.patch
>>>>>>> 4-libceph_rename_kvec_reset_and_kvec_add_functions.patch
>>>>>>> 5-libceph_embed_ceph_messenger_structure_in_ceph_client.patch
>>>>>>> 6-libceph_start_separating_connection_flags_from_state.patch
>>>>>>> 7-libceph_start_tracking_connection_socket_state.patch
>>>>>>> 8-libceph_provide_osd_number_when_creating_osd.patch
>>>>>>> 9-libceph_set_CLOSED_state_bit_in_con_init.patch
>>>>>>> 10-libceph_embed_ceph_connection_structure_in_mon_client.patch
>>>>>>> 11-libceph_drop_connection_refcounting_for_mon_client.patch
>>>>>>> 12-libceph_init_monitor_connection_when_opening.patch
>>>>>>> 13-libceph_fully_initialize_connection_in_con_init.patch
>>>>>>> 14-libceph_tweak_ceph_alloc_msg.patch
>>>>>>> 15-libceph_have_messages_point_to_their_connection.patch
>>>>>>> 16-libceph_have_messages_take_a_connection_reference.patch
>>>>>>> 17-libceph_make_ceph_con_revoke_a_msg_operation.patch
>>>>>>> 18-libceph_make_ceph_con_revoke_message_a_msg_op.patch
>>>>>>> 19-libceph_fix_overflow_in___decode_pool_names.patch
>>>>>>> 20-libceph_fix_overflow_in_osdmap_decode.patch
>>>>>>> 21-libceph_fix_overflow_in_osdmap_apply_incremental.patch
>>>>>>> 22-libceph_transition_socket_state_prior_to_actual_connect.patch
>>>>>>> 23-libceph_fix_NULL_dereference_in_reset_connection.patch
>>>>>>> 24-libceph_use_con_get_put_methods.patch
>>>>>>> 25-libceph_drop_ceph_con_get_put_helpers_and_nref_member.patch
>>>>>>> 26-libceph_encapsulate_out_message_data_setup.patch
>>>>>>> 27-libceph_encapsulate_advancing_msg_page.patch
>>>>>>> 28-libceph_don_t_mark_footer_complete_before_it_is.patch
>>>>>>> 29-libceph_move_init_bio__functions_up.patch
>>>>>>> 30-libceph_move_init_of_bio_iter.patch
>>>>>>> 31-libceph_don_t_use_bio_iter_as_a_flag.patch
>>>>>>> 32-libceph_SOCK_CLOSED_is_a_flag_not_a_state.patch
>>>>>>> 33-libceph_don_t_change_socket_state_on_sock_event.patch
>>>>>>> 34-libceph_just_set_SOCK_CLOSED_when_state_changes.patch
>>>>>>> 35-libceph_don_t_touch_con_state_in_con_close_socket.patch
>>>>>>> 36-libceph_clear_CONNECTING_in_ceph_con_close.patch
>>>>>>> 37-libceph_clear_NEGOTIATING_when_done.patch
>>>>>>> 38-libceph_define_and_use_an_explicit_CONNECTED_state.patch
>>>>>>> 39-libceph_separate_banner_and_connect_writes.patch
>>>>>>> 40-libceph_distinguish_two_phases_of_connect_sequence.patch
>>>>>>> 41-libceph_small_changes_to_messenger.c.patch
>>>>>>> 42-libceph_add_some_fine_ASCII_art.patch
>>>>>>> 43-libceph_set_peer_name_on_con_open_not_init.patch
>>>>>>> 44-libceph_initialize_mon_client_con_only_once.patch
>>>>>>> 45-libceph_allow_sock_transition_from_CONNECTING_to_CLOSED.patch
>>>>>>> 46-libceph_initialize_msgpool_message_types.patch
>>>>>>> 47-libceph_prevent_the_race_of_incoming_work_during_teardown.patch
>>>>>>> 48-libceph_report_socket_read_write_error_message.patch
>>>>>>> 49-libceph_fix_mutex_coverage_for_ceph_con_close.patch
>>>>>>> 50-libceph_resubmit_linger_ops_when_pg_mapping_changes.patch
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Nov 21, 2012 at 8:50 AM, Sage Weil <sage@inktank.com> wrote:
>>>>>>>> Thanks for hunting this down.  I'm very curious what the culprit is...
>>>>>>>>
>>>>>>>> sage
>>>>>
>>>>>
> 


  parent reply	other threads:[~2012-11-30 23:22 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-08 22:10 rbd map command hangs for 15 minutes during system start up Mandell Degerness
2012-11-09  1:43 ` Josh Durgin
2012-11-12 22:19   ` Nick Bartos
2012-11-12 23:16     ` Sage Weil
2012-11-16  0:21       ` Nick Bartos
2012-11-16  0:25         ` Sage Weil
2012-11-16 18:36           ` Nick Bartos
2012-11-16 19:16             ` Sage Weil
2012-11-16 22:01               ` Nick Bartos
2012-11-16 22:13                 ` Sage Weil
2012-11-16 22:16                   ` Nick Bartos
2012-11-16 22:21                     ` Sage Weil
2012-11-19 23:04                       ` Nick Bartos
2012-11-19 23:34                         ` Gregory Farnum
2012-11-20 21:53                           ` Nick Bartos
2012-11-21  1:31                             ` Nick Bartos
2012-11-21 16:50                               ` Sage Weil
2012-11-21 17:02                                 ` Nick Bartos
2012-11-21 17:34                                   ` Nick Bartos
2012-11-21 21:41                                     ` Nick Bartos
2012-11-22  4:47                                       ` Sage Weil
2012-11-22  5:49                                         ` Nick Bartos
2012-11-22 18:04                                           ` Nick Bartos
2012-11-29 20:37                                             ` Alex Elder
2012-11-30 18:49                                               ` Nick Bartos
2012-11-30 19:10                                                 ` Alex Elder
2012-11-30 19:31                                                   ` Sage Weil
2012-11-30 23:22                                               ` Alex Elder [this message]
2012-12-02  5:34                                                 ` Nick Bartos
2012-12-03  4:43                                                   ` Alex Elder
2012-12-10 21:57                                                     ` Alex Elder
2012-12-11 17:26                                                       ` Nick Bartos
2012-12-11 18:01                                                         ` Alex Elder
2012-12-11 19:44                                                           ` Alex Elder
2012-12-13  0:57                                                             ` Nick Bartos
2012-12-13 19:00                                                               ` Nick Bartos
2012-12-13 19:07                                                                 ` Alex Elder
2012-12-14 16:46                                                                 ` Alex Elder
2012-12-14 16:53                                                                   ` Nick Bartos
2012-12-14 18:03                                                                     ` Alex Elder
2012-12-17 17:12                                                                       ` Nick Bartos
2012-12-18 16:09                                                                         ` Alex Elder
2012-12-18 18:05                                                                           ` Nick Bartos
2012-12-19 21:25                                                                             ` Alex Elder
2012-12-19 22:42                                                                               ` Alex Elder
2012-12-20 17:48                                                                                 ` Nick Bartos
2012-12-20 21:59                                                                                   ` Alex Elder
2012-12-26 17:45                                                                                     ` Nick Bartos
2012-12-26 17:50                                                                                       ` Alex Elder
2012-12-26 21:36                                                                                       ` Alex Elder
2012-12-27 17:33                                                                                         ` Nick Bartos
2012-12-27 18:43                                                                                           ` Sage Weil
2012-12-27 19:41                                                                                             ` Alex Elder
2012-12-31 18:22                                                                                         ` Alex Elder
2013-01-02 15:56                                                                                           ` Nick Bartos
2012-11-16 22:23                     ` Gregory Farnum

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50B93FCA.2060801@inktank.com \
    --to=elder@inktank.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=greg@inktank.com \
    --cc=josh.durgin@inktank.com \
    --cc=mandell@pistoncloud.com \
    --cc=nick@pistoncloud.com \
    --cc=sage@inktank.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.