Hi, I was the one who initially stumbled onto this problem, and when I realized that it is the ABBA deadlock, I approached Chandra and Mike. Chandra came up with the initial fix, and it worked fine. Later Chandra pointed me to this patch, and when I tried it on.. I ran into system hang. Please note that, I am running it on -rt kernel based on 2.6.24. I could not apply the patch directly, so I ported it onto my kernel. I am attaching the ported version(new_dm_patch)..I already ran this ported patch by Chandra. I ran IO stress test with this patch while one of the paths is constantly bounced . Bounced the same path all the time. (20min between the bounces) System hung few hours into the test..and I forced the dump. I am still analyzing the dump. If you want to have the dump, please let me know where I can upload it to. Core is around 8G Here are few things from the dump that might be interesting. crash> ps |grep udev | wc -l 425 << <<<425 udevd threads at the time of hang. crash> ps | wc -l 782 crash> foreach bt| grep rt_mutex_slowlock | wc -l 416 < <<<<416 of the the total 782 threads are waiting for a lock. crash> crash> struct rt_mutex ffff81024ec6cca0 struct rt_mutex { wait_lock = { raw_lock = { slock = 49858 }, break_lock = 0 }, wait_list = { prio_list = { next = 0xffff8101007f5ae0, prev = 0xffff8101007f5ae0 }, node_list = { next = 0xffff8101007f5af0, prev = 0xffff81007cb51af0 } }, owner = 0xffff8102400c2b22 } Following task is holding the lock that many other udevs are waiting for. PID: 21896 TASK: ffff8102400c2b20 CPU: 6 COMMAND: "udevd" << Holding #0 [ffff810100667848] schedule at ffffffff8128531c #1 [ffff810100667900] io_schedule at ffffffff81285859 #2 [ffff810100667920] sync_buffer at ffffffff810d1fc1 #3 [ffff810100667930] __wait_on_bit at ffffffff81285ad1 #4 [ffff810100667970] out_of_line_wait_on_bit at ffffffff81285b71 #5 [ffff8101006679e0] __wait_on_buffer at ffffffff810d1f41 #6 [ffff8101006679f0] ext3_find_entry at ffffffff8803c1d2 #7 [ffff810100667b60] ext3_lookup at ffffffff8803dbae #8 [ffff810100667ba0] do_lookup at ffffffff810b78cf #9 [ffff810100667bf0] __link_path_walk at ffffffff810b94ef #10 [ffff810100667c90] link_path_walk at ffffffff810b9f99 #11 [ffff810100667d60] path_walk at ffffffff810ba04b #12 [ffff810100667d70] do_path_lookup at ffffffff810ba352 #13 [ffff810100667dc0] __path_lookup_intent_open at ffffffff810bae88 #14 [ffff810100667e10] path_lookup_open at ffffffff810baf38 #15 [ffff810100667e20] open_exec at ffffffff810b41e3 #16 [ffff810100667ed0] do_execve at ffffffff810b53a2 #17 [ffff810100667f20] sys_execve at ffffffff8100ac30 #18 [ffff810100667f50] stub_execve at ffffffff8100c5c7 PID: 21946 TASK: ffff81007f090b20 CPU: 4 COMMAND: "udevd" << One of the udevds waiting for the lock. #0 [ffff81007f0e59f8] schedule at ffffffff8128531c #1 [ffff81007f0e5ab0] rt_mutex_slowlock at ffffffff81286a95 #2 [ffff81007f0e5b80] rt_mutex_lock at ffffffff81285f84 #3 [ffff81007f0e5b90] _mutex_lock at ffffffff812873f9 #4 [ffff81007f0e5ba0] do_lookup at ffffffff810b788b #5 [ffff81007f0e5bf0] __link_path_walk at ffffffff810b94ef #6 [ffff81007f0e5c90] link_path_walk at ffffffff810b9f99 #7 [ffff81007f0e5d60] path_walk at ffffffff810ba04b #8 [ffff81007f0e5d70] do_path_lookup at ffffffff810ba352 #9 [ffff81007f0e5dc0] __path_lookup_intent_open at ffffffff810bae88 #10 [ffff81007f0e5e10] path_lookup_open at ffffffff810baf38 #11 [ffff81007f0e5e20] open_exec at ffffffff810b41e3 #12 [ffff81007f0e5ed0] do_execve at ffffffff810b53a2 #13 [ffff81007f0e5f20] sys_execve at ffffffff8100ac30 #14 [ffff81007f0e5f50] stub_execve at ffffffff8100c5c7 Thanks, Venkateswararao Jujjuri (JV) Realtime Team, LTC, Beaverton, OR 97006 > > ------------------------------------------------------------------------ > > Subject: > [PATCHES] new solution for dm_any_congested crash > From: > Mikulas Patocka > Date: > Thu, 13 Nov 2008 20:55:27 -0500 (EST) > To: > Alasdair G Kergon , Chandra Seetharaman > > > To: > Alasdair G Kergon , Chandra Seetharaman > > CC: > dm-devel@redhat.com, Milan Broz > > > Hi > > The Chandra's patch was correct, but the problem is more serious (the same > crash could happen in dm_merge_bvec, dm_unplug_all or at some other dm > places), so I had to rework reference counting. > > These are three patches. > 1. reverts Chadra's changes > 2. just a little swap of two calls, to prepare for the third > 3. the reference counting rework > > Chandra, please test the patches at your system (without your original > patch) and verify that they avoid the crashes as well as your patch does. > > Mikulas