All of lore.kernel.org
 help / color / mirror / Atom feed
* Circular lock / deadlock in kernel client
@ 2011-11-30 12:21 Amon Ott
  2011-11-30 17:20 ` Sage Weil
  0 siblings, 1 reply; 11+ messages in thread
From: Amon Ott @ 2011-11-30 12:21 UTC (permalink / raw)
  To: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 917 bytes --]

Hi!

With some kernel debug options for soft and hard lockup detection, I got some 
fine traces. My kernel is a 3.1.4 to which I have ported from ceph-client 
for-linus branch what is suitable for 3.1. If needed, I can make my exact 
ceph code available.

Traces are attached. It seems that two depending locks can be acquired in 
different order at different parts of the code, and thus lead to a deadlock. 
Additionally, I am still trying to reproduce a partial lockup of single dirs 
with this debugging. Those are likely to be related to mutex locking dirs 
without unlocking properly.

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH           Tel: +49 30 24342334
Am Köllnischen Park 1    Fax: +49 30 24342336
10179 Berlin             http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649

[-- Attachment #2: circ1.log --]
[-- Type: text/x-log, Size: 7178 bytes --]

Nov 30 10:40:21 tgpro1 kernel: =======================================================
Nov 30 10:40:21 tgpro1 kernel: [ INFO: possible circular locking dependency detected ]
Nov 30 10:40:21 tgpro1 kernel: 3.1.3-rsbac #1
Nov 30 10:40:21 tgpro1 kernel: -------------------------------------------------------
Nov 30 10:40:21 tgpro1 kernel: kworker/0:0/8787 is trying to acquire lock:
Nov 30 10:40:21 tgpro1 kernel:  (&sb->s_type->i_lock_key#16){+.+...}, at: [<000f7176>] igrab+0x11/0x41
Nov 30 10:40:21 tgpro1 kernel: 
Nov 30 10:40:21 tgpro1 kernel: but task is already holding lock:
Nov 30 10:40:21 tgpro1 kernel:  (&(&s->s_cap_lock)->rlock){+.+...}, at: [<006e0de4>] iterate_session_caps+0x40/0x17d [ceph]
Nov 30 10:40:21 tgpro1 kernel: 
Nov 30 10:40:21 tgpro1 kernel: which lock already depends on the new lock.
Nov 30 10:40:21 tgpro1 kernel: 
Nov 30 10:40:21 tgpro1 kernel: 
Nov 30 10:40:21 tgpro1 kernel: the existing dependency chain (in reverse order) is:
Nov 30 10:40:21 tgpro1 kernel: 
Nov 30 10:40:21 tgpro1 kernel: -> #1 (&(&s->s_cap_lock)->rlock){+.+...}:
Nov 30 10:40:21 tgpro1 kernel:        [<0009038d>] lock_acquire+0x42/0x59
Nov 30 10:40:21 tgpro1 kernel:        [<0052f167>] _raw_spin_lock+0x24/0x33
Nov 30 10:40:21 tgpro1 kernel:        [<006dc0bc>] ceph_add_cap+0x302/0x551 [ceph]
Nov 30 10:40:21 tgpro1 kernel:        [<006cfbfd>] fill_inode+0x5ef/0x72e [ceph]
Nov 30 10:40:21 tgpro1 kernel:        [<006d0d6b>] ceph_fill_trace+0x663/0x6c5 [ceph]
Nov 30 10:40:21 tgpro1 kernel:        [<006e4c6e>] dispatch+0xafe/0x10e7 [ceph]
Nov 30 10:40:21 tgpro1 kernel:        [<006a444d>] con_work+0x14f7/0x16c6 [libceph]
Nov 30 10:40:21 tgpro1 kernel:        [<0007a24b>] process_one_work+0x223/0x391
Nov 30 10:40:21 tgpro1 kernel:        [<0007a747>] worker_thread+0x182/0x2d7
Nov 30 10:40:21 tgpro1 kernel:        [<0007d767>] kthread+0x62/0x67
Nov 30 10:40:21 tgpro1 kernel:        [<00530b26>] kernel_thread_helper+0x6/0xd
Nov 30 10:40:21 tgpro1 kernel: 
Nov 30 10:40:21 tgpro1 kernel: -> #0 (&sb->s_type->i_lock_key#16){+.+...}:
Nov 30 10:40:21 tgpro1 kernel:        [<0008fa3e>] __lock_acquire+0xe79/0x1425
Nov 30 10:40:21 tgpro1 kernel:        [<0009038d>] lock_acquire+0x42/0x59
Nov 30 10:40:21 tgpro1 kernel:        [<0052f167>] _raw_spin_lock+0x24/0x33
Nov 30 10:40:21 tgpro1 kernel:        [<000f7176>] igrab+0x11/0x41
Nov 30 10:40:21 tgpro1 kernel:        [<006e0e14>] iterate_session_caps+0x70/0x17d [ceph]
Nov 30 10:40:21 tgpro1 kernel:        [<006e3bca>] send_mds_reconnect+0x319/0x46a [ceph]
Nov 30 10:40:21 tgpro1 kernel:        [<006e4035>] ceph_mdsc_handle_map+0x2f2/0x42d [ceph]
Nov 30 10:40:21 tgpro1 kernel:        [<006ce00d>] extra_mon_dispatch+0x18/0x1f [ceph]
Nov 30 10:40:21 tgpro1 kernel:        [<006a63dd>] dispatch+0x4f7/0x52a [libceph]
Nov 30 10:40:21 tgpro1 kernel:        [<006a444d>] con_work+0x14f7/0x16c6 [libceph]
Nov 30 10:40:21 tgpro1 kernel:        [<0007a24b>] process_one_work+0x223/0x391
Nov 30 10:40:21 tgpro1 kernel:        [<0007a747>] worker_thread+0x182/0x2d7
Nov 30 10:40:21 tgpro1 kernel:        [<0007d767>] kthread+0x62/0x67
Nov 30 10:40:21 tgpro1 kernel:        [<00530b26>] kernel_thread_helper+0x6/0xd
Nov 30 10:40:21 tgpro1 kernel: 
Nov 30 10:40:21 tgpro1 kernel: other info that might help us debug this:
Nov 30 10:40:21 tgpro1 kernel: 
Nov 30 10:40:21 tgpro1 kernel:  Possible unsafe locking scenario:
Nov 30 10:40:21 tgpro1 kernel: 
Nov 30 10:40:21 tgpro1 kernel:        CPU0                    CPU1
Nov 30 10:40:21 tgpro1 kernel:        ----                    ----
Nov 30 10:40:21 tgpro1 kernel:   lock(&(&s->s_cap_lock)->rlock);
Nov 30 10:40:21 tgpro1 kernel:                                lock(&sb->s_type->i_lock_key);
Nov 30 10:40:21 tgpro1 kernel:                                lock(&(&s->s_cap_lock)->rlock);
Nov 30 10:40:21 tgpro1 kernel:   lock(&sb->s_type->i_lock_key);
Nov 30 10:40:21 tgpro1 kernel: 
Nov 30 10:40:21 tgpro1 kernel:  *** DEADLOCK ***
Nov 30 10:40:21 tgpro1 kernel: 
Nov 30 10:40:21 tgpro1 kernel: 5 locks held by kworker/0:0/8787:
Nov 30 10:40:21 tgpro1 kernel:  #0:  (ceph-msgr){.+.+.+}, at: [<0007a1ff>] process_one_work+0x1d7/0x391
Nov 30 10:40:21 tgpro1 kernel:  #1:  ((&(&con->work)->work)){+.+.+.}, at: [<0007a1ff>] process_one_work+0x1d7/0x391
Nov 30 10:40:21 tgpro1 kernel:  #2:  (&s->s_mutex){+.+.+.}, at: [<006e3954>] send_mds_reconnect+0xa3/0x46a [ceph]
Nov 30 10:40:21 tgpro1 kernel:  #3:  (&mdsc->snap_rwsem){+++++.}, at: [<006e3a4c>] send_mds_reconnect+0x19b/0x46a [ceph]
Nov 30 10:40:21 tgpro1 kernel:  #4:  (&(&s->s_cap_lock)->rlock){+.+...}, at: [<006e0de4>] iterate_session_caps+0x40/0x17d [ceph]
Nov 30 10:40:21 tgpro1 kernel: 
Nov 30 10:40:21 tgpro1 kernel: stack backtrace:
Nov 30 10:40:21 tgpro1 kernel: Pid: 8787, comm: kworker/0:0 Tainted: G        W   3.1.3-rsbac #1
Nov 30 10:40:21 tgpro1 kernel: Call Trace:
Nov 30 10:40:21 tgpro1 kernel:  [<0008e776>] print_circular_bug+0x21a/0x227
Nov 30 10:40:21 tgpro1 kernel:  [<0008fa3e>] __lock_acquire+0xe79/0x1425
Nov 30 10:40:21 tgpro1 kernel:  [<0009038d>] lock_acquire+0x42/0x59
Nov 30 10:40:21 tgpro1 kernel:  [<000f7176>] ? igrab+0x11/0x41
Nov 30 10:40:21 tgpro1 kernel:  [<0052f167>] _raw_spin_lock+0x24/0x33
Nov 30 10:40:21 tgpro1 kernel:  [<000f7176>] ? igrab+0x11/0x41
Nov 30 10:40:21 tgpro1 kernel:  [<000f7176>] igrab+0x11/0x41
Nov 30 10:40:21 tgpro1 kernel:  [<006e0e14>] iterate_session_caps+0x70/0x17d [ceph]
Nov 30 10:40:21 tgpro1 kernel:  [<006e5488>] ? ceph_mdsc_submit_request+0x55/0x55 [ceph]
Nov 30 10:40:21 tgpro1 kernel:  [<006a4d03>] ? ceph_pagelist_append+0xbc/0xf9 [libceph]
Nov 30 10:40:21 tgpro1 kernel:  [<006e3bca>] send_mds_reconnect+0x319/0x46a [ceph]
Nov 30 10:40:21 tgpro1 kernel:  [<0008daa1>] ? trace_hardirqs_on_caller+0x10b/0x13c
Nov 30 10:40:21 tgpro1 kernel:  [<0008dadd>] ? trace_hardirqs_on+0xb/0xd
Nov 30 10:40:21 tgpro1 kernel:  [<006e4035>] ceph_mdsc_handle_map+0x2f2/0x42d [ceph]
Nov 30 10:40:21 tgpro1 kernel:  [<000d7cd3>] ? __kmalloc+0x10d/0x137
Nov 30 10:40:21 tgpro1 kernel:  [<00c40015>] ? 0xc40014
Nov 30 10:40:21 tgpro1 kernel:  [<006ce00d>] extra_mon_dispatch+0x18/0x1f [ceph]
Nov 30 10:40:21 tgpro1 kernel:  [<006a63dd>] dispatch+0x4f7/0x52a [libceph]
Nov 30 10:40:21 tgpro1 kernel:  [<0008dadd>] ? trace_hardirqs_on+0xb/0xd
Nov 30 10:40:21 tgpro1 kernel:  [<006a444d>] con_work+0x14f7/0x16c6 [libceph]
Nov 30 10:40:21 tgpro1 kernel:  [<0007a24b>] process_one_work+0x223/0x391
Nov 30 10:40:21 tgpro1 kernel:  [<0007a1ff>] ? process_one_work+0x1d7/0x391
Nov 30 10:40:21 tgpro1 kernel:  [<006a2f56>] ? ceph_fault+0x262/0x262 [libceph]
Nov 30 10:40:21 tgpro1 kernel:  [<0007a747>] worker_thread+0x182/0x2d7
Nov 30 10:40:21 tgpro1 kernel:  [<00002a00>] ? __show_regs+0x6f/0x6f
Nov 30 10:40:21 tgpro1 kernel:  [<00002a00>] ? __show_regs+0x6f/0x6f
Nov 30 10:40:21 tgpro1 kernel:  [<0007a5c5>] ? rescuer_thread+0x20c/0x20c
Nov 30 10:40:21 tgpro1 kernel:  [<0007d767>] kthread+0x62/0x67
Nov 30 10:40:21 tgpro1 kernel:  [<0007d705>] ? __init_kthread_worker+0x42/0x42
Nov 30 10:40:21 tgpro1 kernel:  [<00530b26>] kernel_thread_helper+0x6/0xd
Nov 30 10:40:26 tgpro1 kernel: ceph: mds0 reconnect success
Nov 30 10:40:40 tgpro1 kernel: ceph: mds0 recovery completed

[-- Attachment #3: circ2.log --]
[-- Type: text/x-log, Size: 7686 bytes --]

Nov 30 10:50:53 tgpro1 kernel: libceph: loaded (mon/osd proto 15/24, osdmap 5/6 5/6)
Nov 30 10:50:53 tgpro1 kernel: ceph: loaded (mds proto 32)
Nov 30 10:50:53 tgpro1 kernel: libceph: client6597 fsid 520649ba-50cc-2c58-7db3-4b30d3bb97d3
Nov 30 10:50:53 tgpro1 kernel: libceph: mon0 192.168.111.1:6789 session established
Nov 30 10:56:43 tgpro1 kernel: ceph: mds0 caps stale
Nov 30 10:57:03 tgpro1 kernel: ceph: mds0 caps stale
Nov 30 10:58:24 tgpro1 kernel: ceph: mds0 reconnect start
Nov 30 10:58:24 tgpro1 kernel: 
Nov 30 10:58:24 tgpro1 kernel: =======================================================
Nov 30 10:58:24 tgpro1 kernel: [ INFO: possible circular locking dependency detected ]
Nov 30 10:58:24 tgpro1 kernel: 3.1.4-rsbac #1
Nov 30 10:58:24 tgpro1 kernel: -------------------------------------------------------
Nov 30 10:58:24 tgpro1 kernel: kworker/0:0/4 is trying to acquire lock:
Nov 30 10:58:24 tgpro1 kernel:  (&sb->s_type->i_lock_key#16){+.+...}, at: [<000f7166>] igrab+0x11/0x41
Nov 30 10:58:24 tgpro1 kernel: 
Nov 30 10:58:24 tgpro1 kernel: but task is already holding lock:
Nov 30 10:58:24 tgpro1 kernel:  (&(&s->s_cap_lock)->rlock){+.+...}, at: [<006b2de4>] iterate_session_caps+0x40/0x17d [ceph]
Nov 30 10:58:24 tgpro1 kernel: 
Nov 30 10:58:24 tgpro1 kernel: which lock already depends on the new lock.
Nov 30 10:58:24 tgpro1 kernel: 
Nov 30 10:58:24 tgpro1 kernel: 
Nov 30 10:58:24 tgpro1 kernel: the existing dependency chain (in reverse order) is:
Nov 30 10:58:24 tgpro1 kernel: 
Nov 30 10:58:24 tgpro1 kernel: -> #1 (&(&s->s_cap_lock)->rlock){+.+...}:
Nov 30 10:58:24 tgpro1 kernel:        [<0009038d>] lock_acquire+0x42/0x59
Nov 30 10:58:24 tgpro1 kernel:        [<0052f147>] _raw_spin_lock+0x24/0x33
Nov 30 10:58:24 tgpro1 kernel:        [<006ae0bc>] ceph_add_cap+0x302/0x551 [ceph]
Nov 30 10:58:24 tgpro1 kernel:        [<006a1bfd>] fill_inode+0x5ef/0x72e [ceph]
Nov 30 10:58:24 tgpro1 kernel:        [<006a2d6b>] ceph_fill_trace+0x663/0x6c5 [ceph]
Nov 30 10:58:24 tgpro1 kernel:        [<006b6c6e>] dispatch+0xafe/0x10e7 [ceph]
Nov 30 10:58:24 tgpro1 kernel:        [<0067644d>] con_work+0x14f7/0x16c6 [libceph]
Nov 30 10:58:24 tgpro1 kernel:        [<0007a24b>] process_one_work+0x223/0x391
Nov 30 10:58:24 tgpro1 kernel:        [<0007a747>] worker_thread+0x182/0x2d7
Nov 30 10:58:24 tgpro1 kernel:        [<0007d767>] kthread+0x62/0x67
Nov 30 10:58:24 tgpro1 kernel:        [<00530b06>] kernel_thread_helper+0x6/0xd
Nov 30 10:58:24 tgpro1 kernel: 
Nov 30 10:58:24 tgpro1 kernel: -> #0 (&sb->s_type->i_lock_key#16){+.+...}:
Nov 30 10:58:24 tgpro1 kernel:        [<0008fa3e>] __lock_acquire+0xe79/0x1425
Nov 30 10:58:24 tgpro1 kernel:        [<0009038d>] lock_acquire+0x42/0x59
Nov 30 10:58:24 tgpro1 kernel:        [<0052f147>] _raw_spin_lock+0x24/0x33
Nov 30 10:58:24 tgpro1 kernel:        [<000f7166>] igrab+0x11/0x41
Nov 30 10:58:24 tgpro1 kernel:        [<006b2e14>] iterate_session_caps+0x70/0x17d [ceph]
Nov 30 10:58:24 tgpro1 kernel:        [<006b5bca>] send_mds_reconnect+0x319/0x46a [ceph]
Nov 30 10:58:24 tgpro1 kernel:        [<006b6035>] ceph_mdsc_handle_map+0x2f2/0x42d [ceph]
Nov 30 10:58:24 tgpro1 kernel:        [<006a000d>] extra_mon_dispatch+0x18/0x1f [ceph]
Nov 30 10:58:24 tgpro1 kernel:        [<006783dd>] dispatch+0x4f7/0x52a [libceph]
Nov 30 10:58:24 tgpro1 kernel:        [<0067644d>] con_work+0x14f7/0x16c6 [libceph]
Nov 30 10:58:24 tgpro1 kernel:        [<0007a24b>] process_one_work+0x223/0x391
Nov 30 10:58:24 tgpro1 kernel:        [<0007a747>] worker_thread+0x182/0x2d7
Nov 30 10:58:25 tgpro1 kernel:        [<0007d767>] kthread+0x62/0x67
Nov 30 10:58:25 tgpro1 kernel:        [<00530b06>] kernel_thread_helper+0x6/0xd
Nov 30 10:58:25 tgpro1 kernel: 
Nov 30 10:58:25 tgpro1 kernel: other info that might help us debug this:
Nov 30 10:58:25 tgpro1 kernel: 
Nov 30 10:58:25 tgpro1 kernel:  Possible unsafe locking scenario:
Nov 30 10:58:25 tgpro1 kernel: 
Nov 30 10:58:25 tgpro1 kernel:        CPU0                    CPU1
Nov 30 10:58:25 tgpro1 kernel:        ----                    ----
Nov 30 10:58:25 tgpro1 kernel:   lock(&(&s->s_cap_lock)->rlock);
Nov 30 10:58:25 tgpro1 kernel:                                lock(&sb->s_type->i_lock_key);
Nov 30 10:58:25 tgpro1 kernel:                                lock(&(&s->s_cap_lock)->rlock);
Nov 30 10:58:25 tgpro1 kernel:   lock(&sb->s_type->i_lock_key);
Nov 30 10:58:25 tgpro1 kernel: 
Nov 30 10:58:25 tgpro1 kernel:  *** DEADLOCK ***
Nov 30 10:58:25 tgpro1 kernel: 
Nov 30 10:58:25 tgpro1 kernel: 5 locks held by kworker/0:0/4:
Nov 30 10:58:25 tgpro1 kernel:  #0:  (ceph-msgr){.+.+.+}, at: [<0007a1ff>] process_one_work+0x1d7/0x391
Nov 30 10:58:25 tgpro1 kernel:  #1:  ((&(&con->work)->work)){+.+.+.}, at: [<0007a1ff>] process_one_work+0x1d7/0x391
Nov 30 10:58:25 tgpro1 kernel:  #2:  (&s->s_mutex){+.+.+.}, at: [<006b5954>] send_mds_reconnect+0xa3/0x46a [ceph]
Nov 30 10:58:25 tgpro1 kernel:  #3:  (&mdsc->snap_rwsem){+++++.}, at: [<006b5a4c>] send_mds_reconnect+0x19b/0x46a [ceph]
Nov 30 10:58:25 tgpro1 kernel:  #4:  (&(&s->s_cap_lock)->rlock){+.+...}, at: [<006b2de4>] iterate_session_caps+0x40/0x17d [ceph]
Nov 30 10:58:25 tgpro1 kernel: 
Nov 30 10:58:25 tgpro1 kernel: stack backtrace:
Nov 30 10:58:25 tgpro1 kernel: Pid: 4, comm: kworker/0:0 Tainted: G        W   3.1.4-rsbac #1
Nov 30 10:58:25 tgpro1 kernel: Call Trace:
Nov 30 10:58:25 tgpro1 kernel:  [<0008e776>] print_circular_bug+0x21a/0x227
Nov 30 10:58:25 tgpro1 kernel:  [<0008fa3e>] __lock_acquire+0xe79/0x1425
Nov 30 10:58:25 tgpro1 kernel:  [<0009038d>] lock_acquire+0x42/0x59
Nov 30 10:58:25 tgpro1 kernel:  [<000f7166>] ? igrab+0x11/0x41
Nov 30 10:58:25 tgpro1 kernel:  [<0052f147>] _raw_spin_lock+0x24/0x33
Nov 30 10:58:25 tgpro1 kernel:  [<000f7166>] ? igrab+0x11/0x41
Nov 30 10:58:25 tgpro1 kernel:  [<000f7166>] igrab+0x11/0x41
Nov 30 10:58:25 tgpro1 kernel:  [<006b2e14>] iterate_session_caps+0x70/0x17d [ceph]
Nov 30 10:58:25 tgpro1 kernel:  [<006b7488>] ? ceph_mdsc_submit_request+0x55/0x55 [ceph]
Nov 30 10:58:25 tgpro1 kernel:  [<00676d03>] ? ceph_pagelist_append+0xbc/0xf9 [libceph]
Nov 30 10:58:25 tgpro1 kernel:  [<006b5bca>] send_mds_reconnect+0x319/0x46a [ceph]
Nov 30 10:58:25 tgpro1 kernel:  [<0008daa1>] ? trace_hardirqs_on_caller+0x10b/0x13c
Nov 30 10:58:25 tgpro1 kernel:  [<0008dadd>] ? trace_hardirqs_on+0xb/0xd
Nov 30 10:58:25 tgpro1 kernel:  [<006b6035>] ceph_mdsc_handle_map+0x2f2/0x42d [ceph]
Nov 30 10:58:25 tgpro1 kernel:  [<000d7cd3>] ? __kmalloc+0x10d/0x137
Nov 30 10:58:25 tgpro1 kernel:  [<00c40015>] ? 0xc40014
Nov 30 10:58:25 tgpro1 kernel:  [<006a000d>] extra_mon_dispatch+0x18/0x1f [ceph]
Nov 30 10:58:25 tgpro1 kernel:  [<006783dd>] dispatch+0x4f7/0x52a [libceph]
Nov 30 10:58:25 tgpro1 kernel:  [<0008dadd>] ? trace_hardirqs_on+0xb/0xd
Nov 30 10:58:25 tgpro1 kernel:  [<0067644d>] con_work+0x14f7/0x16c6 [libceph]
Nov 30 10:58:25 tgpro1 kernel:  [<0007a24b>] process_one_work+0x223/0x391
Nov 30 10:58:25 tgpro1 kernel:  [<0007a1ff>] ? process_one_work+0x1d7/0x391
Nov 30 10:58:25 tgpro1 kernel:  [<00674f56>] ? ceph_fault+0x262/0x262 [libceph]
Nov 30 10:58:25 tgpro1 kernel:  [<0007a747>] worker_thread+0x182/0x2d7
Nov 30 10:58:25 tgpro1 kernel:  [<00002a00>] ? __show_regs+0x6f/0x6f
Nov 30 10:58:25 tgpro1 kernel:  [<00002a00>] ? __show_regs+0x6f/0x6f
Nov 30 10:58:25 tgpro1 kernel:  [<0007a5c5>] ? rescuer_thread+0x20c/0x20c
Nov 30 10:58:25 tgpro1 kernel:  [<0007d767>] kthread+0x62/0x67
Nov 30 10:58:25 tgpro1 kernel:  [<0007d705>] ? __init_kthread_worker+0x42/0x42
Nov 30 10:58:25 tgpro1 kernel:  [<00530b06>] kernel_thread_helper+0x6/0xd
Nov 30 10:58:26 tgpro1 kernel: ceph: mds0 reconnect success
Nov 30 10:58:30 tgpro1 kernel: ceph: mds0 recovery completed

[-- Attachment #4: circ3.log --]
[-- Type: text/x-log, Size: 6696 bytes --]

Nov 30 13:02:05 tgpro1 kernel: 
Nov 30 13:02:05 tgpro1 kernel: =======================================================
Nov 30 13:02:05 tgpro1 kernel: [ INFO: possible circular locking dependency detected ]
Nov 30 13:02:05 tgpro1 kernel: 3.1.4-rsbac #1
Nov 30 13:02:05 tgpro1 kernel: -------------------------------------------------------
Nov 30 13:02:05 tgpro1 kernel: kworker/0:1/22099 is trying to acquire lock:
Nov 30 13:02:05 tgpro1 kernel:  (&sb->s_type->i_lock_key#16){+.+...}, at: [<000ff721>] igrab+0x11/0x41
Nov 30 13:02:05 tgpro1 kernel: 
Nov 30 13:02:05 tgpro1 kernel: but task is already holding lock:
Nov 30 13:02:05 tgpro1 kernel:  (&(&realm->inodes_with_caps_lock)->rlock){+.+...}, at: [<006ce9b4>] ceph_update_snap_trace+0x258/0x3bc [ceph]
Nov 30 13:02:05 tgpro1 kernel: 
Nov 30 13:02:05 tgpro1 kernel: which lock already depends on the new lock.
Nov 30 13:02:05 tgpro1 kernel: 
Nov 30 13:02:05 tgpro1 kernel: 
Nov 30 13:02:05 tgpro1 kernel: the existing dependency chain (in reverse order) is:
Nov 30 13:02:05 tgpro1 kernel: 
Nov 30 13:02:05 tgpro1 kernel: -> #1 (&(&realm->inodes_with_caps_lock)->rlock){+.+...}:
Nov 30 13:02:05 tgpro1 kernel:        [<00091456>] lock_acquire+0x42/0x59
Nov 30 13:02:05 tgpro1 kernel:        [<0053f187>] _raw_spin_lock+0x24/0x33
Nov 30 13:02:05 tgpro1 kernel:        [<006cc0a3>] ceph_add_cap+0x37d/0x554 [ceph]
Nov 30 13:02:05 tgpro1 kernel:        [<006bfbf7>] fill_inode+0x5ee/0x72d [ceph]
Nov 30 13:02:05 tgpro1 kernel:        [<006c0d61>] ceph_fill_trace+0x663/0x6c5 [ceph]
Nov 30 13:02:05 tgpro1 kernel:        [<006d4bd4>] dispatch+0xaff/0x10e9 [ceph]
Nov 30 13:02:05 tgpro1 kernel:        [<00694459>] con_work+0x14f4/0x16c3 [libceph]
Nov 30 13:02:05 tgpro1 kernel:        [<0007b331>] process_one_work+0x229/0x397
Nov 30 13:02:05 tgpro1 kernel:        [<0007b80e>] worker_thread+0x182/0x2d8
Nov 30 13:02:05 tgpro1 kernel:        [<0007e843>] kthread+0x62/0x67
Nov 30 13:02:05 tgpro1 kernel:        [<00540c86>] kernel_thread_helper+0x6/0xd
Nov 30 13:02:05 tgpro1 kernel: 
Nov 30 13:02:05 tgpro1 kernel: -> #0 (&sb->s_type->i_lock_key#16){+.+...}:
Nov 30 13:02:05 tgpro1 kernel:        [<00090b06>] __lock_acquire+0xe7b/0x1428
Nov 30 13:02:05 tgpro1 kernel:        [<00091456>] lock_acquire+0x42/0x59
Nov 30 13:02:05 tgpro1 kernel:        [<0053f187>] _raw_spin_lock+0x24/0x33
Nov 30 13:02:05 tgpro1 kernel:        [<000ff721>] igrab+0x11/0x41
Nov 30 13:02:05 tgpro1 kernel:        [<006ce9d3>] ceph_update_snap_trace+0x277/0x3bc [ceph]
Nov 30 13:02:05 tgpro1 kernel:        [<006cee4e>] ceph_handle_snap+0x336/0x458 [ceph]
Nov 30 13:02:05 tgpro1 kernel:        [<006d4e7d>] dispatch+0xda8/0x10e9 [ceph]
Nov 30 13:02:05 tgpro1 kernel:        [<00694459>] con_work+0x14f4/0x16c3 [libceph]
Nov 30 13:02:05 tgpro1 kernel:        [<0007b331>] process_one_work+0x229/0x397
Nov 30 13:02:05 tgpro1 kernel:        [<0007b80e>] worker_thread+0x182/0x2d8
Nov 30 13:02:05 tgpro1 kernel:        [<0007e843>] kthread+0x62/0x67
Nov 30 13:02:05 tgpro1 kernel:        [<00540c86>] kernel_thread_helper+0x6/0xd
Nov 30 13:02:05 tgpro1 kernel: 
Nov 30 13:02:05 tgpro1 kernel: other info that might help us debug this:
Nov 30 13:02:05 tgpro1 kernel: 
Nov 30 13:02:05 tgpro1 kernel:  Possible unsafe locking scenario:
Nov 30 13:02:05 tgpro1 kernel: 
Nov 30 13:02:05 tgpro1 kernel:        CPU0                    CPU1
Nov 30 13:02:05 tgpro1 kernel:        ----                    ----
Nov 30 13:02:05 tgpro1 kernel:   lock(&(&realm->inodes_with_caps_lock)->rlock);
Nov 30 13:02:05 tgpro1 kernel:                                lock(&sb->s_type->i_lock_key);
Nov 30 13:02:05 tgpro1 kernel:                                lock(&(&realm->inodes_with_caps_lock)->rlock);
Nov 30 13:02:05 tgpro1 kernel:   lock(&sb->s_type->i_lock_key);
Nov 30 13:02:05 tgpro1 kernel: 
Nov 30 13:02:05 tgpro1 kernel:  *** DEADLOCK ***
Nov 30 13:02:05 tgpro1 kernel: 
Nov 30 13:02:05 tgpro1 kernel: 4 locks held by kworker/0:1/22099:
Nov 30 13:02:05 tgpro1 kernel:  #0:  (ceph-msgr){++++.+}, at: [<0007b2e5>] process_one_work+0x1dd/0x397
Nov 30 13:02:05 tgpro1 kernel:  #1:  ((&(&con->work)->work)){+.+.+.}, at: [<0007b2e5>] process_one_work+0x1dd/0x397
Nov 30 13:02:05 tgpro1 kernel:  #2:  (&mdsc->snap_rwsem){+++++.}, at: [<006cebe5>] ceph_handle_snap+0xcd/0x458 [ceph]
Nov 30 13:02:05 tgpro1 kernel:  #3:  (&(&realm->inodes_with_caps_lock)->rlock){+.+...}, at: [<006ce9b4>] ceph_update_snap_trace+0x258/0x3bc [ceph]
Nov 30 13:02:05 tgpro1 kernel: 
Nov 30 13:02:05 tgpro1 kernel: stack backtrace:
Nov 30 13:02:05 tgpro1 kernel: Pid: 22099, comm: kworker/0:1 Tainted: G        W   3.1.4-rsbac #1
Nov 30 13:02:05 tgpro1 kernel: Call Trace:
Nov 30 13:02:05 tgpro1 kernel:  [<0008f840>] print_circular_bug+0x219/0x226
Nov 30 13:02:05 tgpro1 kernel:  [<00090b06>] __lock_acquire+0xe7b/0x1428
Nov 30 13:02:05 tgpro1 kernel:  [<00090b54>] ? __lock_acquire+0xec9/0x1428
Nov 30 13:02:05 tgpro1 kernel:  [<00091456>] lock_acquire+0x42/0x59
Nov 30 13:02:05 tgpro1 kernel:  [<000ff721>] ? igrab+0x11/0x41
Nov 30 13:02:05 tgpro1 kernel:  [<0053f187>] _raw_spin_lock+0x24/0x33
Nov 30 13:02:05 tgpro1 kernel:  [<000ff721>] ? igrab+0x11/0x41
Nov 30 13:02:05 tgpro1 kernel:  [<000ff721>] igrab+0x11/0x41
Nov 30 13:02:05 tgpro1 kernel:  [<006ce9d3>] ceph_update_snap_trace+0x277/0x3bc [ceph]
Nov 30 13:02:05 tgpro1 kernel:  [<006cee4e>] ceph_handle_snap+0x336/0x458 [ceph]
Nov 30 13:02:05 tgpro1 kernel:  [<007f0312>] ? 0x7f0311
Nov 30 13:02:05 tgpro1 kernel:  [<006d4e7d>] dispatch+0xda8/0x10e9 [ceph]
Nov 30 13:02:05 tgpro1 kernel:  [<0048738c>] ? kernel_recvmsg+0x2a/0x34
Nov 30 13:02:05 tgpro1 kernel:  [<0053dbb6>] ? __mutex_unlock_slowpath+0xdf/0xf8
Nov 30 13:02:05 tgpro1 kernel:  [<0008eb6f>] ? trace_hardirqs_on_caller+0x10b/0x13c
Nov 30 13:02:05 tgpro1 kernel:  [<0008ebab>] ? trace_hardirqs_on+0xb/0xd
Nov 30 13:02:05 tgpro1 kernel:  [<00694459>] con_work+0x14f4/0x16c3 [libceph]
Nov 30 13:02:05 tgpro1 kernel:  [<0000784e>] ? text_poke+0x79/0x95
Nov 30 13:02:05 tgpro1 kernel:  [<0007b331>] process_one_work+0x229/0x397
Nov 30 13:02:05 tgpro1 kernel:  [<0007b2e5>] ? process_one_work+0x1dd/0x397
Nov 30 13:02:05 tgpro1 kernel:  [<00692f65>] ? ceph_fault+0x262/0x262 [libceph]
Nov 30 13:02:05 tgpro1 kernel:  [<0007b80e>] worker_thread+0x182/0x2d8
Nov 30 13:02:05 tgpro1 kernel:  [<00002b40>] ? setup_sigcontext+0x24/0x24
Nov 30 13:02:05 tgpro1 kernel:  [<00002b40>] ? setup_sigcontext+0x24/0x24
Nov 30 13:02:05 tgpro1 kernel:  [<0007b68c>] ? rescuer_thread+0x1ed/0x1ed
Nov 30 13:02:05 tgpro1 kernel:  [<0007e843>] kthread+0x62/0x67
Nov 30 13:02:05 tgpro1 kernel:  [<0007e7e1>] ? __init_kthread_worker+0x42/0x42
Nov 30 13:02:05 tgpro1 kernel:  [<00540c86>] kernel_thread_helper+0x6/0xd

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Circular lock / deadlock in kernel client
  2011-11-30 12:21 Circular lock / deadlock in kernel client Amon Ott
@ 2011-11-30 17:20 ` Sage Weil
  2011-11-30 17:55   ` Sage Weil
  0 siblings, 1 reply; 11+ messages in thread
From: Sage Weil @ 2011-11-30 17:20 UTC (permalink / raw)
  To: Amon Ott; +Cc: ceph-devel

On Wed, 30 Nov 2011, Amon Ott wrote:
> Hi!
> 
> With some kernel debug options for soft and hard lockup detection, I got some 
> fine traces. My kernel is a 3.1.4 to which I have ported from ceph-client 
> for-linus branch what is suitable for 3.1. If needed, I can make my exact 
> ceph code available.
> 
> Traces are attached. It seems that two depending locks can be acquired in 
> different order at different parts of the code, and thus lead to a deadlock. 
> Additionally, I am still trying to reproduce a partial lockup of single dirs 
> with this debugging. Those are likely to be related to mutex locking dirs 
> without unlocking properly.

Thanks, put these in the tracker at 
http://tracker.newdream.net/issues/1762.

sage


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Circular lock / deadlock in kernel client
  2011-11-30 17:20 ` Sage Weil
@ 2011-11-30 17:55   ` Sage Weil
  2011-12-01  8:16     ` Amon Ott
  0 siblings, 1 reply; 11+ messages in thread
From: Sage Weil @ 2011-11-30 17:55 UTC (permalink / raw)
  To: Amon Ott; +Cc: ceph-devel

Hi Amon,

On Wed, 30 Nov 2011, Sage Weil wrote:
> On Wed, 30 Nov 2011, Amon Ott wrote:
> > Hi!
> > 
> > With some kernel debug options for soft and hard lockup detection, I got some 
> > fine traces. My kernel is a 3.1.4 to which I have ported from ceph-client 
> > for-linus branch what is suitable for 3.1. If needed, I can make my exact 
> > ceph code available.
> > 
> > Traces are attached. It seems that two depending locks can be acquired in 
> > different order at different parts of the code, and thus lead to a deadlock. 
> > Additionally, I am still trying to reproduce a partial lockup of single dirs 
> > with this debugging. Those are likely to be related to mutex locking dirs 
> > without unlocking properly.
> 
> Thanks, put these in the tracker at 
> http://tracker.newdream.net/issues/1762.

I pushed a wip-i-ceph-lock branch to ceph-client.git that replaces our 
(ab?)use of i_lock with a new i_ceph_lock in the ceph inode.  This avoids 
being bitten by the lock ordering constraint imposed by igrab(), which 
requires i_lock to safely take a reference to an inode without racing with 
inode destruction.  This lets us keep two inode list locks logically 
ordered inside i_ceph_lock (with i_lock as an inner lock).

I did some very basic testing and it didn't blow up.  If can give it a 
try, that would be very helpful.

Also, we need to enable lockdep in our qa environment and make sure 
teuthology is erroring out on lockdep warnings.  (#1763)

Thanks!
sage

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Circular lock / deadlock in kernel client
  2011-11-30 17:55   ` Sage Weil
@ 2011-12-01  8:16     ` Amon Ott
  2011-12-01 16:12       ` Sage Weil
  0 siblings, 1 reply; 11+ messages in thread
From: Amon Ott @ 2011-12-01  8:16 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Wednesday 30 November 2011 wrote Sage Weil:
> I pushed a wip-i-ceph-lock branch to ceph-client.git that replaces our
> (ab?)use of i_lock with a new i_ceph_lock in the ceph inode.  This avoids
> being bitten by the lock ordering constraint imposed by igrab(), which
> requires i_lock to safely take a reference to an inode without racing with
> inode destruction.  This lets us keep two inode list locks logically
> ordered inside i_ceph_lock (with i_lock as an inner lock).

I see the branch, but there is nothing new in it. Is that the right location? 
Maybe forgot to push?

https://github.com/NewDreamNetwork/ceph-client/tree/wip-i-ceph-lock

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH           Tel: +49 30 24342334
Am Köllnischen Park 1    Fax: +49 30 24342336
10179 Berlin             http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Circular lock / deadlock in kernel client
  2011-12-01  8:16     ` Amon Ott
@ 2011-12-01 16:12       ` Sage Weil
  2011-12-01 16:24         ` Amon Ott
  0 siblings, 1 reply; 11+ messages in thread
From: Sage Weil @ 2011-12-01 16:12 UTC (permalink / raw)
  To: Amon Ott; +Cc: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1350 bytes --]

On Thu, 1 Dec 2011, Amon Ott wrote:
> On Wednesday 30 November 2011 wrote Sage Weil:
> > I pushed a wip-i-ceph-lock branch to ceph-client.git that replaces our
> > (ab?)use of i_lock with a new i_ceph_lock in the ceph inode.  This avoids
> > being bitten by the lock ordering constraint imposed by igrab(), which
> > requires i_lock to safely take a reference to an inode without racing with
> > inode destruction.  This lets us keep two inode list locks logically
> > ordered inside i_ceph_lock (with i_lock as an inner lock).
> 
> I see the branch, but there is nothing new in it. Is that the right location? 
> Maybe forgot to push?
> 
> https://github.com/NewDreamNetwork/ceph-client/tree/wip-i-ceph-lock

Oops, I pushed the wrong branch.  It's there now.

Sorry!
sage

> 
> Amon Ott
> -- 
> Dr. Amon Ott
> m-privacy GmbH           Tel: +49 30 24342334
> Am Köllnischen Park 1    Fax: +49 30 24342336
> 10179 Berlin             http://www.m-privacy.de
> 
> Amtsgericht Charlottenburg, HRB 84946
> 
> Geschäftsführer:
>  Dipl.-Kfm. Holger Maczkowsky,
>  Roman Maczkowsky
> 
> GnuPG-Key-ID: 0x2DD3A649
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Circular lock / deadlock in kernel client
  2011-12-01 16:12       ` Sage Weil
@ 2011-12-01 16:24         ` Amon Ott
  2011-12-06 13:21           ` Amon Ott
  0 siblings, 1 reply; 11+ messages in thread
From: Amon Ott @ 2011-12-01 16:24 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Thursday 01 December 2011 wrote Sage Weil:
> On Thu, 1 Dec 2011, Amon Ott wrote:
> > On Wednesday 30 November 2011 wrote Sage Weil:
> > > I pushed a wip-i-ceph-lock branch to ceph-client.git that replaces our
> > > (ab?)use of i_lock with a new i_ceph_lock in the ceph inode.  This
> > > avoids being bitten by the lock ordering constraint imposed by igrab(),
> > > which requires i_lock to safely take a reference to an inode without
> > > racing with inode destruction.  This lets us keep two inode list locks
> > > logically ordered inside i_ceph_lock (with i_lock as an inner lock).
> >
> > I see the branch, but there is nothing new in it. Is that the right
> > location? Maybe forgot to push?
> >
> > https://github.com/NewDreamNetwork/ceph-client/tree/wip-i-ceph-lock
>
> Oops, I pushed the wrong branch.  It's there now.

Got the new commit, but it does not apply cleanly to my tree. I will try to 
get it merged tomorrow and retry. I cannot use kernel 3.2-pre here, the rest 
of the system needs to be stable.

Interesting note: Even though I got the deadlock message in syslog, the system 
worked fine for another few hours, until all the MDS processes died (see my 
other mail).

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH           Tel: +49 30 24342334
Am Köllnischen Park 1    Fax: +49 30 24342336
10179 Berlin             http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Circular lock / deadlock in kernel client
  2011-12-01 16:24         ` Amon Ott
@ 2011-12-06 13:21           ` Amon Ott
  2011-12-07 15:20             ` Amon Ott
  0 siblings, 1 reply; 11+ messages in thread
From: Amon Ott @ 2011-12-06 13:21 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Thursday 01 December 2011 wrote Amon Ott:
> On Thursday 01 December 2011 wrote Sage Weil:
> > On Thu, 1 Dec 2011, Amon Ott wrote:
> > > On Wednesday 30 November 2011 wrote Sage Weil:
> > > > I pushed a wip-i-ceph-lock branch to ceph-client.git that replaces
> > > > our (ab?)use of i_lock with a new i_ceph_lock in the ceph inode. 
> > > > This avoids being bitten by the lock ordering constraint imposed by
> > > > igrab(), which requires i_lock to safely take a reference to an inode
> > > > without racing with inode destruction.  This lets us keep two inode
> > > > list locks logically ordered inside i_ceph_lock (with i_lock as an
> > > > inner lock).
> > >
> > > I see the branch, but there is nothing new in it. Is that the right
> > > location? Maybe forgot to push?
> > >
> > > https://github.com/NewDreamNetwork/ceph-client/tree/wip-i-ceph-lock
> >
> > Oops, I pushed the wrong branch.  It's there now.
>
> Got the new commit, but it does not apply cleanly to my tree. I will try to
> get it merged tomorrow and retry. I cannot use kernel 3.2-pre here, the
> rest of the system needs to be stable.

Merged in and bug seems to be fixed. No more deadlock warnings today.

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH           Tel: +49 30 24342334
Am Köllnischen Park 1    Fax: +49 30 24342336
10179 Berlin             http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Circular lock / deadlock in kernel client
  2011-12-06 13:21           ` Amon Ott
@ 2011-12-07 15:20             ` Amon Ott
  2011-12-07 16:33               ` Sage Weil
  0 siblings, 1 reply; 11+ messages in thread
From: Amon Ott @ 2011-12-07 15:20 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 541 bytes --]

On Tuesday 06 December 2011 wrote Amon Ott:
> Merged in and bug seems to be fixed. No more deadlock warnings today.

Unfortunately, I got another deadlock message in the log today. Full log of 
one boot time is attached.

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH           Tel: +49 30 24342334
Am Köllnischen Park 1    Fax: +49 30 24342336
10179 Berlin             http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649

[-- Attachment #2: circ4.log.bz2 --]
[-- Type: application/x-bzip2, Size: 14561 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Circular lock / deadlock in kernel client
  2011-12-07 15:20             ` Amon Ott
@ 2011-12-07 16:33               ` Sage Weil
  2011-12-08  9:58                 ` Amon Ott
  0 siblings, 1 reply; 11+ messages in thread
From: Sage Weil @ 2011-12-07 16:33 UTC (permalink / raw)
  To: Amon Ott; +Cc: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1033 bytes --]

I pushed a patch to wip-d-lock that may fix this one, but unfortunately 
don't have time to test this very carefully right now.  Let us know if 
that helps, or you can wait until next week.

The call path that was triggering both of these can be exercised by 
restarting the ceph-mds daemon.  Try running your client for a bit and 
the doing that and see if you get any more splats.

Thanks!
sage


On Wed, 7 Dec 2011, Amon Ott wrote:

> On Tuesday 06 December 2011 wrote Amon Ott:
> > Merged in and bug seems to be fixed. No more deadlock warnings today.
> 
> Unfortunately, I got another deadlock message in the log today. Full log of 
> one boot time is attached.
> 
> Amon Ott
> -- 
> Dr. Amon Ott
> m-privacy GmbH           Tel: +49 30 24342334
> Am Köllnischen Park 1    Fax: +49 30 24342336
> 10179 Berlin             http://www.m-privacy.de
> 
> Amtsgericht Charlottenburg, HRB 84946
> 
> Geschäftsführer:
>  Dipl.-Kfm. Holger Maczkowsky,
>  Roman Maczkowsky
> 
> GnuPG-Key-ID: 0x2DD3A649
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Circular lock / deadlock in kernel client
  2011-12-07 16:33               ` Sage Weil
@ 2011-12-08  9:58                 ` Amon Ott
  2011-12-08 13:39                   ` Amon Ott
  0 siblings, 1 reply; 11+ messages in thread
From: Amon Ott @ 2011-12-08  9:58 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Wednesday 07 December 2011 wrote Sage Weil:
> I pushed a patch to wip-d-lock that may fix this one, but unfortunately
> don't have time to test this very carefully right now.  Let us know if
> that helps, or you can wait until next week.

Rebuilding kernel with that patch.

> The call path that was triggering both of these can be exercised by
> restarting the ceph-mds daemon.  Try running your client for a bit and
> the doing that and see if you get any more splats.

What triggered the kernel problem was bug 1047. ceph-mds crashed on all nodes 
with that assert. When the kernel detected that the main mds connection was 
missing, it tried to reconnect and hung.

We do appreciate how much work you are doing to make Ceph production ready. 
Thank you!

We really want to use Ceph in our product, because the design and experience 
so far have shown a lot of potential.

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH           Tel: +49 30 24342334
Am Köllnischen Park 1    Fax: +49 30 24342336
10179 Berlin             http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Circular lock / deadlock in kernel client
  2011-12-08  9:58                 ` Amon Ott
@ 2011-12-08 13:39                   ` Amon Ott
  0 siblings, 0 replies; 11+ messages in thread
From: Amon Ott @ 2011-12-08 13:39 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Thursday 08 December 2011 wrote Amon Ott:
> On Wednesday 07 December 2011 wrote Sage Weil:
> > I pushed a patch to wip-d-lock that may fix this one, but unfortunately
> > don't have time to test this very carefully right now.  Let us know if
> > that helps, or you can wait until next week.
>
> Rebuilding kernel with that patch.

With the patch the deadlock disappears and the kernel does not hang at the 
locks afterwards.

> > The call path that was triggering both of these can be exercised by
> > restarting the ceph-mds daemon.  Try running your client for a bit and
> > the doing that and see if you get any more splats.
>
> What triggered the kernel problem was bug 1047. ceph-mds crashed on all
> nodes with that assert. When the kernel detected that the main mds
> connection was missing, it tried to reconnect and hung.

This problem remains, and without a working mds the whole ceph mount hangs. 
Instead of crashing, mds sometimes goes into a dead loop and uses a cpu core 
at 100%. This also makes the mount hang.

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH           Tel: +49 30 24342334
Am Köllnischen Park 1    Fax: +49 30 24342336
10179 Berlin             http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2011-12-08 13:39 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-30 12:21 Circular lock / deadlock in kernel client Amon Ott
2011-11-30 17:20 ` Sage Weil
2011-11-30 17:55   ` Sage Weil
2011-12-01  8:16     ` Amon Ott
2011-12-01 16:12       ` Sage Weil
2011-12-01 16:24         ` Amon Ott
2011-12-06 13:21           ` Amon Ott
2011-12-07 15:20             ` Amon Ott
2011-12-07 16:33               ` Sage Weil
2011-12-08  9:58                 ` Amon Ott
2011-12-08 13:39                   ` Amon Ott

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.