From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dan Mick Subject: Re: Possible deadlock condition Date: Mon, 18 Jun 2012 16:34:09 -0700 Message-ID: <4FDFBAF1.9090109@inktank.com> References: <4FDFB26E.1060109@inktank.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mail-pb0-f46.google.com ([209.85.160.46]:33172 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752544Ab2FRXeM (ORCPT ); Mon, 18 Jun 2012 19:34:12 -0400 Received: by pbbrp8 with SMTP id rp8so8850341pbb.19 for ; Mon, 18 Jun 2012 16:34:12 -0700 (PDT) In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Mandell Degerness Cc: ceph-devel@vger.kernel.org I don't know enough to know if there's a connection, but I do note this prior thread that sounds kinda similar: http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/6574 On 06/18/2012 04:08 PM, Mandell Degerness wrote: > None of the OSDs seem to be more than 82% full. I didn't think we were > running quite that close to the margin, but it is still far from > actually full. > > > On Mon, Jun 18, 2012 at 3:57 PM, Dan Mick wrote: >> Does the xfs on the OSD have plenty of free space left, or could this be an >> allocation deadlock? >> >> >> On 06/18/2012 03:17 PM, Mandell Degerness wrote: >>> >>> Here is, perhaps, a more useful traceback from a different run of >>> tests that we just ran into: >>> >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.680815] INFO: task >>> flush-254:0:29582 blocked for more than 120 seconds. >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.681040] "echo 0> >>> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.681458] flush-254:0 >>> D ffff880bd9ca2fc0 0 29582 2 0x00000000 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.681740] >>> ffff88006e51d160 0000000000000046 0000000000000002 ffff88061b362040 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.682173] >>> ffff88006e51d160 00000000000120c0 00000000000120c0 00000000000120c0 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.682659] >>> ffff88006e51dfd8 00000000000120c0 00000000000120c0 ffff88006e51dfd8 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683088] Call Trace: >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683302] >>> [] schedule+0x5a/0x5c >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683514] >>> [] schedule_timeout+0x36/0xe3 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683784] >>> [] ? physflat_send_IPI_mask+0xe/0x10 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683999] >>> [] ? native_smp_send_reschedule+0x46/0x48 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684219] >>> [] ? list_move_tail+0x27/0x2c >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684432] >>> [] __down_common+0x90/0xd4 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684708] >>> [] ? _xfs_buf_find+0x17f/0x210 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684925] >>> [] __down+0x1d/0x1f >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685139] >>> [] down+0x2d/0x3d >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685350] >>> [] xfs_buf_lock+0x76/0xaf >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685565] >>> [] _xfs_buf_find+0x17f/0x210 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685836] >>> [] xfs_buf_get+0x2a/0x177 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.686052] >>> [] xfs_buf_read+0x1f/0xca >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.686270] >>> [] xfs_trans_read_buf+0x205/0x308 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.686490] >>> [] xfs_btree_read_buf_block.clone.22+0x4f/0xa7 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687015] >>> [] ? xfs_trans_log_buf+0xb2/0xc1 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687232] >>> [] xfs_btree_lookup_get_block+0x84/0xac >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687449] >>> [] xfs_btree_lookup+0x12b/0x3dc >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687721] >>> [] ? xfs_alloc_vextent+0x447/0x469 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687939] >>> [] xfs_bmbt_lookup_eq+0x1f/0x21 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688156] >>> [] xfs_bmap_add_extent_delay_real+0x5b5/0xfec >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688378] >>> [] ? kmem_cache_alloc+0x87/0xf3 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688650] >>> [] ? xfs_bmbt_init_cursor+0x3f/0x107 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688867] >>> [] xfs_bmapi_allocate+0x1f6/0x23a >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689084] >>> [] ? xfs_iext_bno_to_irec+0x95/0xb9 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689301] >>> [] xfs_bmapi_write+0x32d/0x5a2 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689519] >>> [] xfs_iomap_write_allocate+0x1a5/0x29f >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689797] >>> [] xfs_map_blocks+0x13e/0x1dd >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690016] >>> [] xfs_vm_writepage+0x24e/0x410 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690233] >>> [] __writepage+0x17/0x30 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690446] >>> [] write_cache_pages+0x276/0x3c8 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690693] >>> [] ? set_page_dirty+0x60/0x60 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690908] >>> [] generic_writepages+0x45/0x5c >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.691123] >>> [] xfs_vm_writepages+0x4d/0x54 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.691337] >>> [] do_writepages+0x21/0x2a >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.691552] >>> [] writeback_single_inode+0x12a/0x2cc >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.691800] >>> [] writeback_sb_inodes+0x174/0x215 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.692016] >>> [] __writeback_inodes_wb+0x78/0xb9 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.692231] >>> [] wb_writeback+0x136/0x22a >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.692444] >>> [] ? determine_dirtyable_memory+0x1d/0x26 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.692692] >>> [] wb_do_writeback+0x19c/0x1b7 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.692907] >>> [] bdi_writeback_thread+0x8c/0x20f >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.693122] >>> [] ? wb_do_writeback+0x1b7/0x1b7 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.693336] >>> [] ? wb_do_writeback+0x1b7/0x1b7 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.693553] >>> [] kthread+0x82/0x8a >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.693803] >>> [] kernel_thread_helper+0x4/0x10 >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.694018] >>> [] ? kthread_worker_fn+0x13b/0x13b >>> Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.694232] >>> [] ? gs_change+0xb/0xb >>> >>> >>> On Mon, Jun 18, 2012 at 11:37 AM, Mandell Degerness >>> wrote: >>>> >>>> We've been seeing random issues of apparent deadlocks. We are running >>>> ceph 0.47 on kernel 3.2.18. OSDs are running on XFS file system. >>>> mysqld (which ran into the particular problems in the attached kernel >>>> log) is running on an RBD with XFS (mounted on a system which includes >>>> OSDs). We have sync_fs, and gcc ver 4.5.3-r2. The mysqld process in >>>> both instances returned an error to the calling process. >>>> >>>> Regards, >>>> Mandell Degerness >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html