* [linux-lvm] lvm deadlock with 2.4.x kernel? @ 2001-05-14 22:11 Tom Otake 2001-05-15 8:40 ` Joe Thornber 0 siblings, 1 reply; 16+ messages in thread From: Tom Otake @ 2001-05-14 22:11 UTC (permalink / raw) To: linux-lvm I'm not sure if this has been brought up yet. Over the weekend, my Linux hung up on me twice. Considering that this never happened before and only two things have recently changed (LVM and ReiserFS) , I did some reading and came across the bug report for LVM on sistina's website about LVM deadlocking Linux. I'm running kernle 2.4.3 with LVM compiled into the kernel. LVM is 0.9.1_beta7, reiser is 3.x.0j. All essential fs (/, /usr, /var, /tmp) are still using ext2 and linux partitions, non essential fs (/home amongst others) are all on reiserfs with LVM, excluding /usr/local, which is still on ext2 and Linux partition. The first occurance: Running vmware (not on lvm/reiser) while browsing the web using netscape and running seti@home. The system hung on me when I tried to access a web page that appeared to be task intensive, wether servlets, javascript, flash, or something else, I don't know. The second occurance: I was copying a large amount of data from a CDROM to my home dir (on lvm). While the copy was in progress, I created a new LV. This worked. The system hung when I ran mkreiserfs on the new LV. All hdisks and CD are SCSI, no IDE at all. As I said, I'm not sure if the system hang was caused by the deadlock, since the system was dead. If this is related to the deadlock issue, are there any possible workarounds, besides being mindful of the system load? Thanks -- _______________ Intolerance is the last defense of the insecure. -- Tom Otake -- totake66@home.com -- #550 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [linux-lvm] lvm deadlock with 2.4.x kernel? 2001-05-14 22:11 [linux-lvm] lvm deadlock with 2.4.x kernel? Tom Otake @ 2001-05-15 8:40 ` Joe Thornber 2001-05-15 22:35 ` Tom Otake 0 siblings, 1 reply; 16+ messages in thread From: Joe Thornber @ 2001-05-15 8:40 UTC (permalink / raw) To: linux-lvm On Mon, May 14, 2001 at 05:11:47PM -0500, Tom Otake wrote: > I'm not sure if this has been brought up yet. Over the weekend, my > Linux hung up on me twice. Considering that this never happened before > and only two things have recently changed (LVM and ReiserFS) , I did > some reading and came across the bug report for LVM on sistina's website > about LVM deadlocking Linux. All the deadlocking issues have been due to either running snapshots on 2.2 kernels or doing a 'pvmove' on 2.2 or 2.4. It doesn't sound like you were doing either. > > I'm running kernle 2.4.3 with LVM compiled into the kernel. LVM is > 0.9.1_beta7, reiser is 3.x.0j. All essential fs (/, /usr, /var, /tmp) > are still using ext2 and linux partitions, non essential fs (/home > amongst others) are all on reiserfs with LVM, excluding /usr/local, > which is still on ext2 and Linux partition. > > The first occurance: > Running vmware (not on lvm/reiser) while browsing the web using netscape > and running seti@home. The system hung on me when I tried to access a > web page that appeared to be task intensive, wether servlets, > javascript, flash, or something else, I don't know. I used to get deadlocks from vmware without using LVM. > The second occurance: > I was copying a large amount of data from a CDROM to my home dir (on > lvm). While the copy was in progress, I created a new LV. This > worked. The system hung when I ran mkreiserfs on the new LV. This sounds more serious. Can you reproduce it ? If you can the quickest way for us to find the problem is for you to build the kernel with kdb and get stack traces for the relevent threads. > As I said, I'm not sure if the system hang was caused by the deadlock, > since the system was dead. If this is related to the deadlock issue, > are there any possible workarounds, besides being mindful of the system > load? I am not aware of any deadlock issues in beta7. Has anyone else experienced problems ? - Joe ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [linux-lvm] lvm deadlock with 2.4.x kernel? 2001-05-15 8:40 ` Joe Thornber @ 2001-05-15 22:35 ` Tom Otake 2001-05-15 22:49 ` Andreas Dilger 2001-05-17 2:26 ` Tom Otake 0 siblings, 2 replies; 16+ messages in thread From: Tom Otake @ 2001-05-15 22:35 UTC (permalink / raw) To: linux-lvm Yes, I've been able to recreate the second hang scenario, though I have to admit it wasn't exactly the same. I started the copy of the data, created a new LV, which worked. I ran mkreiserfs on the new LV, it worked. I removed the new LV, also worked, then ran pvscan. That's when the system hung. All the while, the copy from CD to disk was going on. I apologize if I sound like an idiot but I've never taken a stack trace for the linux kernel. I assume this will require enabling magic sysrq. I looked through the sysrq.txt but it didn't offer too much help, especially on how to save stack traces, etc. Would it be possible to get a quick rundown on what commands/keys I need to use to get the data you need? Thanks Joe Thornber wrote: > On Mon, May 14, 2001 at 05:11:47PM -0500, Tom Otake wrote: > > I'm not sure if this has been brought up yet. Over the weekend, my > > Linux hung up on me twice. Considering that this never happened before > > and only two things have recently changed (LVM and ReiserFS) , I did > > some reading and came across the bug report for LVM on sistina's website > > about LVM deadlocking Linux. > > All the deadlocking issues have been due to either running snapshots > on 2.2 kernels or doing a 'pvmove' on 2.2 or 2.4. It doesn't sound > like you were doing either. > > > > > I'm running kernle 2.4.3 with LVM compiled into the kernel. LVM is > > 0.9.1_beta7, reiser is 3.x.0j. All essential fs (/, /usr, /var, /tmp) > > are still using ext2 and linux partitions, non essential fs (/home > > amongst others) are all on reiserfs with LVM, excluding /usr/local, > > which is still on ext2 and Linux partition. > > > > The first occurance: > > Running vmware (not on lvm/reiser) while browsing the web using netscape > > and running seti@home. The system hung on me when I tried to access a > > web page that appeared to be task intensive, wether servlets, > > javascript, flash, or something else, I don't know. > > I used to get deadlocks from vmware without using LVM. > > > The second occurance: > > I was copying a large amount of data from a CDROM to my home dir (on > > lvm). While the copy was in progress, I created a new LV. This > > worked. The system hung when I ran mkreiserfs on the new LV. > > This sounds more serious. Can you reproduce it ? If you can the > quickest way for us to find the problem is for you to build the kernel > with kdb and get stack traces for the relevent threads. > > > As I said, I'm not sure if the system hang was caused by the deadlock, > > since the system was dead. If this is related to the deadlock issue, > > are there any possible workarounds, besides being mindful of the system > > load? > > I am not aware of any deadlock issues in beta7. Has anyone else > experienced problems ? > > - Joe > _______________________________________________ > linux-lvm mailing list > linux-lvm@sistina.com > http://lists.sistina.com/mailman/listinfo/linux-lvm -- _______________ Love cannot be much younger than the lust for murder. -- Sigmund Freud -- Tom Otake -- totake66_nospam@home.com -- Remove _nospam -- #550 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [linux-lvm] lvm deadlock with 2.4.x kernel? 2001-05-15 22:35 ` Tom Otake @ 2001-05-15 22:49 ` Andreas Dilger 2001-05-15 23:14 ` Chris Mason 2001-05-17 2:26 ` Tom Otake 1 sibling, 1 reply; 16+ messages in thread From: Andreas Dilger @ 2001-05-15 22:49 UTC (permalink / raw) To: linux-lvm Tom Otake writes: > Yes, I've been able to recreate the second hang scenario, though I have to > admit it wasn't exactly the same. I started the copy of the data, created a > new LV, which worked. I ran mkreiserfs on the new LV, it worked. I removed > the new LV, also worked, then ran pvscan. That's when the system hung. All > the while, the copy from CD to disk was going on. It may be that this is related to the ext3 problem that is ongoing. Basically, if pvscan or vgscan (PV_FLUSH ioctl calling invalidate_buffers) is run it causes buffers to go into an invalid state for the journal code, and this breaks the journaling. On ext3, there are assertions in the code which detect the invalid state and case an oops (stack trace), but this may not be the case with reiserfs. Cheers, Andreas -- Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto, \ would they cancel out, leaving him still hungry?" http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [linux-lvm] lvm deadlock with 2.4.x kernel? 2001-05-15 22:49 ` Andreas Dilger @ 2001-05-15 23:14 ` Chris Mason 2001-05-16 0:32 ` Andreas Dilger 0 siblings, 1 reply; 16+ messages in thread From: Chris Mason @ 2001-05-15 23:14 UTC (permalink / raw) To: linux-lvm On Tuesday, May 15, 2001 04:49:25 PM -0600 Andreas Dilger <adilger@turbolinux.com> wrote: > Tom Otake writes: >> Yes, I've been able to recreate the second hang scenario, though I have >> to admit it wasn't exactly the same. I started the copy of the data, >> created a new LV, which worked. I ran mkreiserfs on the new LV, it >> worked. I removed the new LV, also worked, then ran pvscan. That's >> when the system hung. All the while, the copy from CD to disk was going >> on. > > It may be that this is related to the ext3 problem that is ongoing. > Basically, if pvscan or vgscan (PV_FLUSH ioctl calling invalidate_buffers) > is run it causes buffers to go into an invalid state for the journal > code, and this breaks the journaling. On ext3, there are assertions in > the code which detect the invalid state and case an oops (stack trace), > but this may not be the case with reiserfs. reiserfs should catch blocks that don't have the proper bits set when it starts i/o, and then it makes sure the block hasn't been relogged while the i/o was in progress. It sends warnings not an oops though, check your log files. If we were losing journal bits, and the log code didn't catch it, the result should be silent corruption. Since he is seeing deadlock, it seems more likely reiserfs is trying to lock a buffer for i/o, and that is hanging for some reason.... -chris ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [linux-lvm] lvm deadlock with 2.4.x kernel? 2001-05-15 23:14 ` Chris Mason @ 2001-05-16 0:32 ` Andreas Dilger 2001-05-16 1:17 ` Chris Mason 0 siblings, 1 reply; 16+ messages in thread From: Andreas Dilger @ 2001-05-16 0:32 UTC (permalink / raw) To: linux-lvm Chris writes: > On Tuesday, May 15, 2001 04:49:25 PM -0600 Andreas Dilger > <adilger@turbolinux.com> wrote: > > > Tom Otake writes: > >> Yes, I've been able to recreate the second hang scenario, though I have > >> to admit it wasn't exactly the same. I started the copy of the data, > >> created a new LV, which worked. I ran mkreiserfs on the new LV, it > >> worked. I removed the new LV, also worked, then ran pvscan. That's > >> when the system hung. All the while, the copy from CD to disk was going > >> on. > > > > It may be that this is related to the ext3 problem that is ongoing. > > Basically, if pvscan or vgscan (PV_FLUSH ioctl calling invalidate_buffers) > > is run it causes buffers to go into an invalid state for the journal > > code, and this breaks the journaling. On ext3, there are assertions in > > the code which detect the invalid state and case an oops (stack trace), > > but this may not be the case with reiserfs. > > reiserfs should catch blocks that don't have the proper bits set when it > starts i/o, and then it makes sure the block hasn't been relogged while the > i/o was in progress. It sends warnings not an oops though, check your log > files. If we were losing journal bits, and the log code didn't catch it, > the result should be silent corruption. > > Since he is seeing deadlock, it seems more likely reiserfs is trying to > lock a buffer for i/o, and that is hanging for some reason.... But what does PV_FLUSH do? Calls fsync_dev() to flush dirty buffers to disk, and sync_supers() and waits for buffer I/O completion. This is unlikely to be the cause of a problem, because that happens on each sync call. It then calls __invalidate_buffers(dev, 0), which destroys everything but dirty buffers (on ALL buffer lru lists). Since reiserfs may have journaled buffers which are not "dirty" by the normal sense, these may be thrown out. It is doing _something_ wierd with the ext3 buffers, such that they are essentially gone from the buffer lists, but still in the journal list. We have tried tracking it down a bit, but not successfully yet. I think some of the debugging tools Andrew Morton made for ext3 on 2.4 will help. Basically, it allows you to keep a history of what happens to the buffer through the journal and block layer, so that when you get a problem with a buffer you can trace back to see who changed it... I haven't yet checked if we still have this invalidate_buffers() issue in 2.4 ext3 yet. Cheers, Andreas -- Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto, \ would they cancel out, leaving him still hungry?" http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [linux-lvm] lvm deadlock with 2.4.x kernel? 2001-05-16 0:32 ` Andreas Dilger @ 2001-05-16 1:17 ` Chris Mason 2001-05-16 1:50 ` Jay Weber 2001-05-16 8:39 ` Joe Thornber 0 siblings, 2 replies; 16+ messages in thread From: Chris Mason @ 2001-05-16 1:17 UTC (permalink / raw) To: linux-lvm On Tuesday, May 15, 2001 06:32:24 PM -0600 Andreas Dilger <adilger@turbolinux.com> wrote: >> reiserfs should catch blocks that don't have the proper bits set when it >> starts i/o, and then it makes sure the block hasn't been relogged while >> the i/o was in progress. It sends warnings not an oops though, check >> your log files. If we were losing journal bits, and the log code didn't >> catch it, the result should be silent corruption. >> >> Since he is seeing deadlock, it seems more likely reiserfs is trying to >> lock a buffer for i/o, and that is hanging for some reason.... > > But what does PV_FLUSH do? Calls fsync_dev() to flush dirty buffers to > disk, and sync_supers() and waits for buffer I/O completion. This is > unlikely to be the cause of a problem, because that happens on each > sync call. > > It then calls __invalidate_buffers(dev, 0), which destroys everything > but dirty buffers (on ALL buffer lru lists). Unless I'm reading it wrong (2.4.4), __invalidate_buffers destroys all buffers that are clean and have b_count == 0. Reiserfs keeps b_count > 0 for all metadata buffers that have been logged, while ext3 allows the count to be zero (but keeps them in the dirty list). __invalidate_buffers also waits on any locked buffers. Any chance one of the other LVM ioctls grabs some lvm lock before calling PV_FLUSH? You're right though, pv_flush certainly doesn't look like it could cause any deadlocks. -chris ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [linux-lvm] lvm deadlock with 2.4.x kernel? 2001-05-16 1:17 ` Chris Mason @ 2001-05-16 1:50 ` Jay Weber 2001-05-16 3:35 ` Jay Weber 2001-05-16 8:39 ` Joe Thornber 1 sibling, 1 reply; 16+ messages in thread From: Jay Weber @ 2001-05-16 1:50 UTC (permalink / raw) To: linux-lvm; +Cc: ext3-users, Joe Thornber, sct I think I have this one solved, I hope. I think what Andreas and I are running into are a few different assertions. One being the LVM lvm_do_pv_flush caused assertion which is related directly to invalidate_buffers() being called which then triggers refile_buffer() on a journaled buffer, which appears clean in all other ways according to the checks in refile_buffer(). The following is what I've got in __invalidate_buffers() right now. if (!bh->b_count && !buffer_journaled(bh) && (destroy_dirty_buffers || !buffer_dirty(bh))) put_last_free(bh); if (slept) goto again; Stephen suggested something along the above a bit ago, except he uses bh->b_jlist == BJ_None. buffer_journaled() seems to be a function in fs.h which seems a bit more appropriate. Next, with the above we'd still see problems. My next patch included a suggestion from Heinz to add lock_kernel() and unlock_kernel() around the fsync_dev() and invalidate_buffers() in lvm.c/lvm_do_pv_flush(). Currently I have this in my working kernel, I'm gonna try again without it though, it seems that it shouldn't be necessary, the other block devices I've looked at don't seem to lock the kernel. Lastly, I was still getting an assertion generating the "Attempt to refile free buffer", but this one was actually caused by an ext3 journaling function calling refile_buffer(), not derived from invalidate_buffers(). In fs/jfs/checkpoint.c/cleanup_transaction(), you'll note it does some buffer_head bit checks and then calls refile_buffer(). Mine currently looks like the following: if (!buffer_dirty(bh) && !buffer_jdirty(bh) && !buffer_journaled(bh) && bh->b_list != BUF_CLEAN) { unlock_journal(journal); refile_buffer(bh); lock_journal(journal); return 1; } Note the addition of the !buffer_journaled(bh) check. Okay, so using all of the above, I have now been running multiple vgscan loops and a pvscan loop while untarr'ing kernel, removing the kernel dir, and then untarring again, and building the kernel with make -j4 (eating up my memory and cpu) for nearly an hour with no assertions. To me it appears that Stephen had it right all along (in prior thread on this), he stated that the b_jlist == BJ_None may be necessary elsewhere also, to insure that there are no journaled buffers out there before handing back to refile_buffer(). I think that's what we were up against and as far as I can tell (grepping for refile_buffer() in jfs/* code) I've added the checks to all the appropriate cases. Andreas can you give the above a try and see if it solves the problem on your end also. Stephen, does this look good as far as what I've changed? Sorry, no diffs just yet, the changes are rather smallish though. Thanks. On Tue, 15 May 2001, Chris Mason wrote: > Date: Tue, 15 May 2001 21:17:06 -0400 > From: Chris Mason <mason@suse.com> > Reply-To: linux-lvm@sistina.com > To: linux-lvm@sistina.com > Subject: Re: [linux-lvm] lvm deadlock with 2.4.x kernel? > > > > On Tuesday, May 15, 2001 06:32:24 PM -0600 Andreas Dilger > <adilger@turbolinux.com> wrote: > > >> reiserfs should catch blocks that don't have the proper bits set when it > >> starts i/o, and then it makes sure the block hasn't been relogged while > >> the i/o was in progress. It sends warnings not an oops though, check > >> your log files. If we were losing journal bits, and the log code didn't > >> catch it, the result should be silent corruption. > >> > >> Since he is seeing deadlock, it seems more likely reiserfs is trying to > >> lock a buffer for i/o, and that is hanging for some reason.... > > > > But what does PV_FLUSH do? Calls fsync_dev() to flush dirty buffers to > > disk, and sync_supers() and waits for buffer I/O completion. This is > > unlikely to be the cause of a problem, because that happens on each > > sync call. > > > > It then calls __invalidate_buffers(dev, 0), which destroys everything > > but dirty buffers (on ALL buffer lru lists). > > Unless I'm reading it wrong (2.4.4), __invalidate_buffers destroys all > buffers that are clean and have b_count == 0. Reiserfs keeps b_count > 0 > for all metadata buffers that have been logged, while ext3 allows the count > to be zero (but keeps them in the dirty list). > > __invalidate_buffers also waits on any locked buffers. Any chance one of > the other LVM ioctls grabs some lvm lock before calling PV_FLUSH? > > You're right though, pv_flush certainly doesn't look like it could cause > any deadlocks. > > -chris > > _______________________________________________ > linux-lvm mailing list > linux-lvm@sistina.com > http://lists.sistina.com/mailman/listinfo/linux-lvm > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [linux-lvm] lvm deadlock with 2.4.x kernel? 2001-05-16 1:50 ` Jay Weber @ 2001-05-16 3:35 ` Jay Weber 0 siblings, 0 replies; 16+ messages in thread From: Jay Weber @ 2001-05-16 3:35 UTC (permalink / raw) To: ext3-users; +Cc: linux-lvm, Joe Thornber, sct Nope, soon after I posted the email box died. I'm still hitting Attempt to refile buffer which is caused by cleanup_transaction(). I reverted to use bh->b_jlist == BJ_None in my tests also. Rereading Andrea's prior thread on this makes me think I'm heading down the same path he did prior also. Bummer. :) On Tue, 15 May 2001, Jay Weber wrote: > Date: Tue, 15 May 2001 18:50:44 -0700 (PDT) > From: Jay Weber <jweber@valinux.com> > Reply-To: ext3-users@redhat.com > To: linux-lvm@sistina.com > Cc: ext3-users@redhat.com, Joe Thornber <thornber@btconnect.com>, > sct@redhat.com > Subject: Re: [linux-lvm] lvm deadlock with 2.4.x kernel? > > I think I have this one solved, I hope. > > I think what Andreas and I are running into are a few different > assertions. One being the LVM lvm_do_pv_flush caused assertion which is > related directly to invalidate_buffers() being called which then triggers > refile_buffer() on a journaled buffer, which appears clean in all other > ways according to the checks in refile_buffer(). > > The following is what I've got in __invalidate_buffers() right now. > > if (!bh->b_count && !buffer_journaled(bh) && > (destroy_dirty_buffers || !buffer_dirty(bh))) > put_last_free(bh); > if (slept) > goto again; > > Stephen suggested something along the above a bit ago, except he uses > bh->b_jlist == BJ_None. buffer_journaled() seems to be a function in fs.h > which seems a bit more appropriate. > > Next, with the above we'd still see problems. My next patch included a > suggestion from Heinz to add lock_kernel() and unlock_kernel() around the > fsync_dev() and invalidate_buffers() in lvm.c/lvm_do_pv_flush(). > Currently I have this in my working kernel, I'm gonna try again without it > though, it seems that it shouldn't be necessary, the other block devices > I've looked at don't seem to lock the kernel. > > Lastly, I was still getting an assertion generating the "Attempt to refile > free buffer", but this one was actually caused by an ext3 journaling > function calling refile_buffer(), not derived from invalidate_buffers(). > > In fs/jfs/checkpoint.c/cleanup_transaction(), you'll note it does some > buffer_head bit checks and then calls refile_buffer(). Mine currently > looks like the following: > > if (!buffer_dirty(bh) && !buffer_jdirty(bh) && > !buffer_journaled(bh) && > bh->b_list != BUF_CLEAN) { > unlock_journal(journal); > refile_buffer(bh); > lock_journal(journal); > return 1; > } > > Note the addition of the !buffer_journaled(bh) check. > > Okay, so using all of the above, I have now been running multiple vgscan > loops and a pvscan loop while untarr'ing kernel, removing the kernel dir, > and then untarring again, and building the kernel with make -j4 (eating up > my memory and cpu) for nearly an hour with no assertions. > > To me it appears that Stephen had it right all along (in prior thread on > this), he stated that the b_jlist == BJ_None may be necessary elsewhere > also, to insure that there are no journaled buffers out there before > handing back to refile_buffer(). I think that's what we were up against > and as far as I can tell (grepping for refile_buffer() in jfs/* code) I've > added the checks to all the appropriate cases. > > Andreas can you give the above a try and see if it solves the problem on > your end also. Stephen, does this look good as far as what I've changed? > > Sorry, no diffs just yet, the changes are rather smallish though. > > Thanks. > > On Tue, 15 May 2001, Chris Mason wrote: > > > Date: Tue, 15 May 2001 21:17:06 -0400 > > From: Chris Mason <mason@suse.com> > > Reply-To: linux-lvm@sistina.com > > To: linux-lvm@sistina.com > > Subject: Re: [linux-lvm] lvm deadlock with 2.4.x kernel? > > > > > > > > On Tuesday, May 15, 2001 06:32:24 PM -0600 Andreas Dilger > > <adilger@turbolinux.com> wrote: > > > > >> reiserfs should catch blocks that don't have the proper bits set when it > > >> starts i/o, and then it makes sure the block hasn't been relogged while > > >> the i/o was in progress. It sends warnings not an oops though, check > > >> your log files. If we were losing journal bits, and the log code didn't > > >> catch it, the result should be silent corruption. > > >> > > >> Since he is seeing deadlock, it seems more likely reiserfs is trying to > > >> lock a buffer for i/o, and that is hanging for some reason.... > > > > > > But what does PV_FLUSH do? Calls fsync_dev() to flush dirty buffers to > > > disk, and sync_supers() and waits for buffer I/O completion. This is > > > unlikely to be the cause of a problem, because that happens on each > > > sync call. > > > > > > It then calls __invalidate_buffers(dev, 0), which destroys everything > > > but dirty buffers (on ALL buffer lru lists). > > > > Unless I'm reading it wrong (2.4.4), __invalidate_buffers destroys all > > buffers that are clean and have b_count == 0. Reiserfs keeps b_count > 0 > > for all metadata buffers that have been logged, while ext3 allows the count > > to be zero (but keeps them in the dirty list). > > > > __invalidate_buffers also waits on any locked buffers. Any chance one of > > the other LVM ioctls grabs some lvm lock before calling PV_FLUSH? > > > > You're right though, pv_flush certainly doesn't look like it could cause > > any deadlocks. > > > > -chris > > > > _______________________________________________ > > linux-lvm mailing list > > linux-lvm@sistina.com > > http://lists.sistina.com/mailman/listinfo/linux-lvm > > > > > > > _______________________________________________ > Ext3-users mailing list > Ext3-users@redhat.com > https://listman.redhat.com/mailman/listinfo/ext3-users > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [linux-lvm] lvm deadlock with 2.4.x kernel? 2001-05-16 1:17 ` Chris Mason 2001-05-16 1:50 ` Jay Weber @ 2001-05-16 8:39 ` Joe Thornber 2001-05-16 10:50 ` Jay Weber ` (2 more replies) 1 sibling, 3 replies; 16+ messages in thread From: Joe Thornber @ 2001-05-16 8:39 UTC (permalink / raw) To: linux-lvm On Tue, May 15, 2001 at 09:17:06PM -0400, Chris Mason wrote: > You're right though, pv_flush certainly doesn't look like it could cause > any deadlocks. I must admit I'm struggling to understand why PV_FLUSH even exists. It does *exactly* the same thing as a BLKFLSBUF ioctl to the pv device itself. As such I agree that it's unlikely to be the culprit. - Joe ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [linux-lvm] lvm deadlock with 2.4.x kernel? 2001-05-16 8:39 ` Joe Thornber @ 2001-05-16 10:50 ` Jay Weber 2001-05-16 11:06 ` Joe Thornber 2001-05-16 10:53 ` Heinz J. Mauelshagen 2001-05-16 13:20 ` Chris Mason 2 siblings, 1 reply; 16+ messages in thread From: Jay Weber @ 2001-05-16 10:50 UTC (permalink / raw) To: linux-lvm On Wed, 16 May 2001, Joe Thornber wrote: > On Tue, May 15, 2001 at 09:17:06PM -0400, Chris Mason wrote: > > You're right though, pv_flush certainly doesn't look like it could cause > > any deadlocks. > > I must admit I'm struggling to understand why PV_FLUSH even exists. > It does *exactly* the same thing as a BLKFLSBUF ioctl to the pv device > itself. As such I agree that it's unlikely to be the culprit. I don't think it is, I think it just appears as such. I've actually hacked up my LVM here so that lvm_do_pv_flush() just returns 0. I don't get the problem there anymore. :) I'm digging in the userland code now. It looks to me as though somewhere around the vg_copy_to_disk or lseek and write to pv_handle in vg_write.c is where I see the first instance of a BUF_LOCKED buffer being set to B_FREE. I added a printk to my put_last_free() function in buffer.c to denote when such odd symptoms occur. Again, only LVM userland tool usage seems to generate output from that printk, nothing else that I do on the machine. And it looks as though vg_write.c in tools/lib is just dropping the VG offset data and such onto the physical PV itself. I've noted that following this I get alot more printk messages regarding BUF_LOCKED being set to B_FREE, the next massive hunk of write is in regards to lv_write_all_lv, so I gather it's during the writeout of LV information to a PV? Not sure why writing data to the raw device would generate these printk's. Thats the best I've been able to come up with overnight though. > > - Joe > _______________________________________________ > linux-lvm mailing list > linux-lvm@sistina.com > http://lists.sistina.com/mailman/listinfo/linux-lvm > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [linux-lvm] lvm deadlock with 2.4.x kernel? 2001-05-16 10:50 ` Jay Weber @ 2001-05-16 11:06 ` Joe Thornber 0 siblings, 0 replies; 16+ messages in thread From: Joe Thornber @ 2001-05-16 11:06 UTC (permalink / raw) To: linux-lvm On Wed, May 16, 2001 at 03:50:33AM -0700, Jay Weber wrote: > On Wed, 16 May 2001, Joe Thornber wrote: > > > On Tue, May 15, 2001 at 09:17:06PM -0400, Chris Mason wrote: > > > You're right though, pv_flush certainly doesn't look like it could cause > > > any deadlocks. > > > > I must admit I'm struggling to understand why PV_FLUSH even exists. > > It does *exactly* the same thing as a BLKFLSBUF ioctl to the pv device > > itself. As such I agree that it's unlikely to be the culprit. > > I don't think it is, I think it just appears as such. I've actually > hacked up my LVM here so that lvm_do_pv_flush() just returns 0. I don't > get the problem there anymore. :) Agreed, I don't think we're seeing a bug with LVM. It's just that LVM (or software raid in linear mode ?) is the only time you will do a partial flush, ie. we flush one PV, but not all of them for the LV. That's an interesting idea; instead of calling PV_FLUSH, you could try flushing the whole LV, does the problem go away if you do this ? You'll have to hack quite a bit to try this, probably easiest to get the user land tools to check to see if the PV is part of an LV, and then if it is call BLKFLSBUF for the LV, otherwise call BLKFLSBUF for the PV. - Joe ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [linux-lvm] lvm deadlock with 2.4.x kernel? 2001-05-16 8:39 ` Joe Thornber 2001-05-16 10:50 ` Jay Weber @ 2001-05-16 10:53 ` Heinz J. Mauelshagen 2001-05-16 13:20 ` Chris Mason 2 siblings, 0 replies; 16+ messages in thread From: Heinz J. Mauelshagen @ 2001-05-16 10:53 UTC (permalink / raw) To: linux-lvm On Wed, May 16, 2001 at 09:39:29AM +0100, Joe Thornber wrote: > On Tue, May 15, 2001 at 09:17:06PM -0400, Chris Mason wrote: > > You're right though, pv_flush certainly doesn't look like it could cause > > any deadlocks. > > I must admit I'm struggling to understand why PV_FLUSH even exists. > It does *exactly* the same thing as a BLKFLSBUF ioctl to the pv device > itself. Joe is right. It is a few lines of unneccesary code in the LVM driver ;-) We can remove it > 1.0 and call the ioctl of the underlying driver directly. > As such I agree that it's unlikely to be the culprit. Correct. > > - Joe > _______________________________________________ > linux-lvm mailing list > linux-lvm@sistina.com > http://lists.sistina.com/mailman/listinfo/linux-lvm -- Regards, Heinz -- The LVM Guy -- *** Software bugs are stupid. Nevertheless it needs not so stupid people to solve them *** =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Heinz Mauelshagen Sistina Software Inc. Senior Consultant/Developer Am Sonnenhang 11 56242 Marienrachdorf Germany Mauelshagen@Sistina.com +49 2626 141200 FAX 924446 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [linux-lvm] lvm deadlock with 2.4.x kernel? 2001-05-16 8:39 ` Joe Thornber 2001-05-16 10:50 ` Jay Weber 2001-05-16 10:53 ` Heinz J. Mauelshagen @ 2001-05-16 13:20 ` Chris Mason 2 siblings, 0 replies; 16+ messages in thread From: Chris Mason @ 2001-05-16 13:20 UTC (permalink / raw) To: linux-lvm; +Cc: jweber On Wednesday, May 16, 2001 09:39:29 AM +0100 Joe Thornber <thornber@btconnect.com> wrote: > On Tue, May 15, 2001 at 09:17:06PM -0400, Chris Mason wrote: >> You're right though, pv_flush certainly doesn't look like it could cause >> any deadlocks. > > I must admit I'm struggling to understand why PV_FLUSH even exists. > It does *exactly* the same thing as a BLKFLSBUF ioctl to the pv device > itself. As such I agree that it's unlikely to be the culprit. I think there are actually two problems. Calling invalidate_buffers on part of an active ext3 FS should hose it (unless ext3 doesn't allow b_count == 0 on buffers that are clean but still need flushing). Adding the BKL on 2.2.x shouldn't do anything, since sys_ioctl grabs it. Unless the LVM code drops the BKL somewhere, it should be safe. So, at the very least ext3 people need Jay's first patch. The 2.4.x deadlock with reiserfs should be something different. Reiserfs should have b_count > 0 on any buffer it cares about. If PV_FLUSH is never called with any other locks held, we're probably best off going in with kdb. -chris ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [linux-lvm] lvm deadlock with 2.4.x kernel? 2001-05-15 22:35 ` Tom Otake 2001-05-15 22:49 ` Andreas Dilger @ 2001-05-17 2:26 ` Tom Otake 2001-05-17 15:31 ` Andreas Dilger 1 sibling, 1 reply; 16+ messages in thread From: Tom Otake @ 2001-05-17 2:26 UTC (permalink / raw) To: linux-lvm I've recreated the system hang exactly as I've described below with a pvscan during a copy. I used a new kernel with hacking enabled, redirected 1 into /proc/sys/kernel/sysrq. None of the sysrq commands seemed to have worked. The system hung about 20 to 30 seconds into the copy process. Tom Otake wrote: > Yes, I've been able to recreate the second hang scenario, though I have to > admit it wasn't exactly the same. I started the copy of the data, created a > new LV, which worked. I ran mkreiserfs on the new LV, it worked. I removed > the new LV, also worked, then ran pvscan. That's when the system hung. All > the while, the copy from CD to disk was going on. > > Joe Thornber wrote: > > > > > > The second occurance: > > > I was copying a large amount of data from a CDROM to my home dir (on > > > lvm). While the copy was in progress, I created a new LV. This > > > worked. The system hung when I ran mkreiserfs on the new LV. > > > > This sounds more serious. Can you reproduce it ? If you can the > > quickest way for us to find the problem is for you to build the kernel > > with kdb and get stack traces for the relevent threads. > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [linux-lvm] lvm deadlock with 2.4.x kernel? 2001-05-17 2:26 ` Tom Otake @ 2001-05-17 15:31 ` Andreas Dilger 0 siblings, 0 replies; 16+ messages in thread From: Andreas Dilger @ 2001-05-17 15:31 UTC (permalink / raw) To: linux-lvm Tom Otake writes: > I've recreated the system hang exactly as I've described below with a pvscan > during a copy. I used a new kernel with hacking enabled, redirected 1 into > /proc/sys/kernel/sysrq. None of the sysrq commands seemed to have worked. > The system hung about 20 to 30 seconds into the copy process. Try using the kdb patches (available at Sourceforge). They will allow you to interrupt the kernel anywhere and to a stack trace (via "bt" command) to find out where you are stuck. The SysRQ key does not work if you are in a tight loop somewhere. Cheers, Andreas -- Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto, \ would they cancel out, leaving him still hungry?" http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2001-05-17 15:31 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2001-05-14 22:11 [linux-lvm] lvm deadlock with 2.4.x kernel? Tom Otake 2001-05-15 8:40 ` Joe Thornber 2001-05-15 22:35 ` Tom Otake 2001-05-15 22:49 ` Andreas Dilger 2001-05-15 23:14 ` Chris Mason 2001-05-16 0:32 ` Andreas Dilger 2001-05-16 1:17 ` Chris Mason 2001-05-16 1:50 ` Jay Weber 2001-05-16 3:35 ` Jay Weber 2001-05-16 8:39 ` Joe Thornber 2001-05-16 10:50 ` Jay Weber 2001-05-16 11:06 ` Joe Thornber 2001-05-16 10:53 ` Heinz J. Mauelshagen 2001-05-16 13:20 ` Chris Mason 2001-05-17 2:26 ` Tom Otake 2001-05-17 15:31 ` Andreas Dilger
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.