From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Vladimir V. Saveliev" Subject: Re: reiser4-2.6.18-rc2-mm1: possible circular locking dependency detected in txn_end Date: Mon, 14 Aug 2006 15:50:41 +0400 Message-ID: <200608141550.41663.vs@namesys.com> References: <44CD0115.4010608@free.fr> <44D2112A.7080705@free.fr> <44DDD707.1000506@free.fr> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: list-help: list-unsubscribe: list-post: Errors-To: flx@namesys.com In-Reply-To: <44DDD707.1000506@free.fr> Content-Disposition: inline List-Id: Content-Type: text/plain; charset="iso-8859-1" To: Laurent Riffard Cc: reiserfs-list@namesys.com, Alexander Zarochentsev Hello On Saturday 12 August 2006 17:26, Laurent Riffard wrote: > Le 03.08.2006 17:07, Laurent Riffard a =E9crit : > > Le 03.08.2006 08:09, Alexander Zarochentsev a =E9crit : > >> On Tuesday 01 August 2006 01:29, Laurent Riffard wrote: > >>> Le 31.07.2006 21:55, Vladimir V. Saveliev a =E9crit : > >>>> Hello > >>>> > >>>> What kind of load did you run on reiser4 at that time? > >>> > >>> I just formatted a new 2GB Reiser4 FS, then I moved a whole ccache > >>> cache tree to this new FS (cache size was about 20~30 Mbytes). > >>> Something like: > >>> > >>> # mkfs.reiser4 /dev/vglinux1/ccache > >>> # mount -tauto -onoatime /dev/vglinux1/ccache /mnt/disk > >>> # mv ~laurent/.ccache/* /mnt/disk/ > >> > >> I was not able to reproduce it. Can you please try the following patc= h? > >> > >> > >> lock validator friendly locking of new atom in > >> atom_begin_and_assign_to_txnh and locking of two atoms. > >> > >> Signed-off-by: Alexander Zarochentsev > >> --- > >> > >> fs/reiser4/txnmgr.c | 14 ++++++++------ > >> fs/reiser4/txnmgr.h | 15 +++++++++++++++ > >> 2 files changed, 23 insertions(+), 6 deletions(-) > > [patch snipped] > > > I tried this patch: it's slow as hell (CPU is ~100% system) and it > > panics when syncing... > > > > reiser4 panicked cowardly: reiser4[shutdown(1904)]: spin_lock_atom > > (fs/reiser4/txmgr.h:509)[]: > > Hello, > > I tried again with linux 2.6.18-rc3-mm2+hotfixes. > > # booted to runlevel 1 > ~$ mount > ... > /dev/mapper/vglinux1-lvhome on /home type reiserfs (rw) > /dev/mapper/vglinux1-lvccache on /home/laurent/.ccache type reiser4 > (rw,nosuid,nodev,noatime) ... > > ~$ df ~/.ccache > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vglinux1-lvccache > 2.0G 53M 1.9G 3% /home/laurent/.ccache > > ~$ time mv ~/.ccache/* ~/tmp/ccache > 0.10user 6.01system 0:07.92elapsed 77%CPU (0avgtext+0avgdata > 0maxresident)k 0inputs+0outputs (1major+296minor)pagefaults 0swaps > > dmesg output: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D > [ INFO: possible circular locking dependency detected ] > ------------------------------------------------------- > mv/1255 is trying to acquire lock: > (&txnh->hlock){--..}, at: [] txn_end+0x191/0x368 [reiser4] > > but task is already holding lock: > (&atom->alock){--..}, at: [] txnh_get_atom+0xf6/0x39e > [reiser4] > > which lock already depends on the new lock. > > > the existing dependency chain (in reverse order) is: > > -> #1 (&atom->alock){--..}: > [] lock_acquire+0x60/0x80 > [] _spin_lock+0x19/0x28 > [] try_capture+0x7cf/0x1cd7 [reiser4] > [] longterm_lock_znode+0x427/0x84f [reiser4] > [] seal_validate+0x221/0x5ee [reiser4] > [] find_entry+0x126/0x307 [reiser4] > [] rem_entry_common+0xe9/0x4ba [reiser4] > [] unlink_common+0x258/0x364 [reiser4] > [] vfs_unlink+0x47/0x87 > [] do_unlinkat+0x8c/0x122 > [] sys_unlink+0x10/0x12 > [] sysenter_past_esp+0x56/0x8d > > -> #0 (&txnh->hlock){--..}: > [] lock_acquire+0x60/0x80 > [] _spin_lock+0x19/0x28 > [] txn_end+0x191/0x368 [reiser4] > [] reiser4_exit_context+0x1c2/0x571 [reiser4] > [] unlink_common+0x359/0x364 [reiser4] > [] vfs_unlink+0x47/0x87 > [] do_unlinkat+0x8c/0x122 > [] sys_unlink+0x10/0x12 > [] sysenter_past_esp+0x56/0x8d > > other info that might help us debug this: > > 3 locks held by mv/1255: > #0: (&inode->i_mutex/1){--..}, at: [] > do_unlinkat+0x59/0x122 #1: (&inode->i_mutex){--..}, at: [] > mutex_lock+0x21/0x24 #2: (&atom->alock){--..}, at: [] > txnh_get_atom+0xf6/0x39e [reiser4] > > stack backtrace: > [] show_trace+0xd/0x10 > [] dump_stack+0x19/0x1b > [] print_circular_bug_tail+0x59/0x64 > [] __lock_acquire+0x814/0x9a5 > [] lock_acquire+0x60/0x80 > [] _spin_lock+0x19/0x28 > [] txn_end+0x191/0x368 [reiser4] > [] reiser4_exit_context+0x1c2/0x571 [reiser4] > [] unlink_common+0x359/0x364 [reiser4] > [] vfs_unlink+0x47/0x87 > [] do_unlinkat+0x8c/0x122 > [] sys_unlink+0x10/0x12 > [] sysenter_past_esp+0x56/0x8d > > > ~$ time sync > 0.00user 0.02system 0:00.49elapsed 4%CPU (0avgtext+0avgdata > 0maxresident)k 0inputs+0outputs (1major+202minor)pagefaults 0swaps > > Move the files backward... > ~$ time mv ~/tmp/ccache/* ~/.ccache/ > 0.11user 3.98system 0:04.09elapsed 100%CPU (0avgtext+0avgdata > 0maxresident)k 0inputs+0outputs (0major+286minor)pagefaults 0swaps > > ~$ time sync > 0.00user 0.00system 0:01.86elapsed 0%CPU (0avgtext+0avgdata > 0maxresident)k 0inputs+0outputs (0major+204minor)pagefaults 0swaps > > So this problem still appears in 2.6.18-rc3-mm2+hotfixes. > > I applied the patch zam sent 01 August 2006. Compile, boot to > runlevel 1 and test again. The warning went away but it's now really > really slow: > > ~$ time mv ~/.ccache/* ~/tmp/ccache > 0.09user 15.82system 0:16.08elapsed 98%CPU (0avgtext+0avgdata > 0maxresident)k 0inputs+0outputs (0major+299minor)pagefaults 0swaps > > ouch! 16s versus 8s. > > ~$ time sync > 0.00user 0.10system 0:00.58elapsed 17%CPU (0avgtext+0avgdata > 0maxresident)k 0inputs+0outputs (0major+205minor)pagefaults 0swaps > > Move the files backward... > ~$ time mv ~/tmp/ccache/* ~/.ccache/ > 0.11user 27.20system 0:27.33elapsed 99%CPU (0avgtext+0avgdata > 0maxresident)k 0inputs+0outputs (0major+286minor)pagefaults 0swaps > > 27s versus 4s previously. > > ~$ time sync > 0.00user 0.02system 0:04.35elapsed 0%CPU (0avgtext+0avgdata > 0maxresident)k 0inputs+0outputs (0major+204minor)pagefaults 0swaps > > once again... > ~$ time ~/.ccache/* ~/tmp/ccache > 0.09user 37.43system 0:37.71elapsed 99%CPU (0avgtext+0avgdata > 0maxresident)k 0inputs+0outputs (0major+298minor)pagefaults 0swaps > > ~$ time sync > 0.00user 0.08system 0:00.44elapsed 19%CPU (0avgtext+0avgdata > 0maxresident)k 0inputs+0outputs (0major+204minor)pagefaults 0swaps > > What's going on ? Let's have a look at vmstat output > ~$ vmstat 3 & > [1] 1394 > ~$ procs -----------memory---------- ---swap-- -----io---- --system-- > ----cpu---- r b swpd free buff cache si so bi bo in = =20 > cs us sy id wa 1 0 0 438688 9684 45612 0 0 70 90 26= 9=20 > 50 1 15 81 3 0 0 0 438688 9684 45612 0 0 0 0=20 > 350 58 0 1 99 0 0 0 0 438688 9688 45612 0 0 0 = =20 > 3 251 10 0 0 100 0 0 0 0 438704 9688 45612 0 0 = 0=20 > 0 257 10 0 0 100 0 time mv ~/tmp/ccache/* ~/.ccache/ > > 1 0 0 437944 9728 45684 0 0 0 13 270 25 0 > 76 23 0 1 0 0 437448 9732 46384 0 0 0 4 251 = 8=20 > 0 100 0 0 1 0 0 437448 9784 46096 0 0 0 17 252 = =20 > 16 0 100 0 0 1 0 0 436828 9908 45984 0 0 0 43 2= 54 > 12 0 100 0 0 1 0 0 436580 9908 45844 0 0 0 0= =20 > 250 8 0 100 0 0 1 0 0 436208 10004 46252 0 0 0 = =20 > 42 260 12 0 100 0 0 1 0 0 435836 10004 46176 0 0 = 0 > 0 251 7 1 99 0 0 1 0 0 434968 10096 46868 0 0 = =20 > 0 31 251 11 0 100 0 0 1 0 0 435092 10144 46112 0 = 0=20 > 0 16 251 12 0 100 0 0 0.07user 29.02system 0:29.10elapsed > 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs > (0major+287minor)pagefaults 0swaps > ~$ > 0 0 0 434968 10144 45960 0 0 0 0 251 11 0 > 92 8 0 0 0 0 434968 10192 45960 0 0 0 29 255 1= 1=20 > 0 0 99 1 0 0 0 434984 10192 45960 0 0 0 0 255 = =20 > 27 0 0 100 0 kill %1 > ~$ > [1]+ Complete vmstat 3 > > See? 100% system CPU. Without the patch, I have about 60% user and > 30% system. > > This patch seems to solve the "possible circular locking dependency > detected in txn_end". Moreover, it seems to solve the "possible > circular locking dependency detected wile unmounting reiser4 FS" I > reported in another thread. > > But it still panics while syncing: > reiser4 panicked cowardly: reiser4[sync(23302)]: spin_lock_atom > (fs/reiser4/txnmgr.h:509)[]: Kernel panic - not syncing: > reiser4[sync(23302)]: spin_lock_atom (fs/reiser4/txnmgr.h:509)[]: > Sorry, I am confused. In the above I see that sync completed because you we= re=20 able to time it. Please describe how to reproduce the panic. >=20 > laurent