All of lore.kernel.org
 help / color / mirror / Atom feed
* [linux-lvm] lvm deadlock with 2.4.x kernel?
@ 2001-05-14 22:11 Tom Otake
  2001-05-15  8:40 ` Joe Thornber
  0 siblings, 1 reply; 16+ messages in thread
From: Tom Otake @ 2001-05-14 22:11 UTC (permalink / raw)
  To: linux-lvm

I'm not sure if this has been brought up yet.  Over the weekend, my
Linux hung up on me twice.  Considering that this never happened before
and only two things have recently changed (LVM and ReiserFS) , I did
some reading and came across the bug report for LVM on sistina's website
about LVM deadlocking Linux.

I'm running kernle 2.4.3 with LVM compiled into the kernel.  LVM is
0.9.1_beta7, reiser is 3.x.0j.  All essential fs (/, /usr, /var, /tmp)
are still using ext2 and linux partitions, non essential fs (/home
amongst others) are all on  reiserfs with LVM, excluding /usr/local,
which is still on ext2 and Linux partition.

The first occurance:
Running vmware (not on lvm/reiser) while browsing the web using netscape
and running seti@home.  The system hung on me when I tried to access a
web page that appeared to be task intensive, wether servlets,
javascript, flash, or something else, I don't know.

The second occurance:
I was copying a large amount of data from a CDROM to my home dir (on
lvm).  While the copy was in progress, I created a new LV.  This
worked.  The system hung when I ran mkreiserfs on the new LV.

All hdisks and CD are SCSI, no IDE at all.

As I said, I'm not sure if the system hang was caused by the deadlock,
since the system was dead.  If this is related to the deadlock issue,
are there any possible workarounds, besides being mindful of the system
load?

Thanks

--
_______________
Intolerance is the last defense of the insecure.

-- Tom Otake
-- totake66@home.com
-- #550

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [linux-lvm] lvm deadlock with 2.4.x kernel?
  2001-05-14 22:11 [linux-lvm] lvm deadlock with 2.4.x kernel? Tom Otake
@ 2001-05-15  8:40 ` Joe Thornber
  2001-05-15 22:35   ` Tom Otake
  0 siblings, 1 reply; 16+ messages in thread
From: Joe Thornber @ 2001-05-15  8:40 UTC (permalink / raw)
  To: linux-lvm

On Mon, May 14, 2001 at 05:11:47PM -0500, Tom Otake wrote:
> I'm not sure if this has been brought up yet.  Over the weekend, my
> Linux hung up on me twice.  Considering that this never happened before
> and only two things have recently changed (LVM and ReiserFS) , I did
> some reading and came across the bug report for LVM on sistina's website
> about LVM deadlocking Linux.

All the deadlocking issues have been due to either running snapshots
on 2.2 kernels or doing a 'pvmove' on 2.2 or 2.4.  It doesn't sound
like you were doing either.

> 
> I'm running kernle 2.4.3 with LVM compiled into the kernel.  LVM is
> 0.9.1_beta7, reiser is 3.x.0j.  All essential fs (/, /usr, /var, /tmp)
> are still using ext2 and linux partitions, non essential fs (/home
> amongst others) are all on  reiserfs with LVM, excluding /usr/local,
> which is still on ext2 and Linux partition.
> 
> The first occurance:
> Running vmware (not on lvm/reiser) while browsing the web using netscape
> and running seti@home.  The system hung on me when I tried to access a
> web page that appeared to be task intensive, wether servlets,
> javascript, flash, or something else, I don't know.

I used to get deadlocks from vmware without using LVM.

> The second occurance:
> I was copying a large amount of data from a CDROM to my home dir (on
> lvm).  While the copy was in progress, I created a new LV.  This
> worked.  The system hung when I ran mkreiserfs on the new LV.

This sounds more serious.  Can you reproduce it ?  If you can the
quickest way for us to find the problem is for you to build the kernel
with kdb and get stack traces for the relevent threads.

> As I said, I'm not sure if the system hang was caused by the deadlock,
> since the system was dead.  If this is related to the deadlock issue,
> are there any possible workarounds, besides being mindful of the system
> load?

I am not aware of any deadlock issues in beta7.  Has anyone else
experienced problems ?

- Joe

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [linux-lvm] lvm deadlock with 2.4.x kernel?
  2001-05-15  8:40 ` Joe Thornber
@ 2001-05-15 22:35   ` Tom Otake
  2001-05-15 22:49     ` Andreas Dilger
  2001-05-17  2:26     ` Tom Otake
  0 siblings, 2 replies; 16+ messages in thread
From: Tom Otake @ 2001-05-15 22:35 UTC (permalink / raw)
  To: linux-lvm

Yes, I've been able to recreate the second hang scenario, though I have to
admit it wasn't exactly the same.  I started the copy of the data, created a
new LV, which worked.  I ran mkreiserfs on the new LV, it worked.  I removed
the new LV, also worked, then ran pvscan.  That's when the system hung.  All
the while, the copy from CD to disk was going on.

I apologize if I sound like an idiot but I've never taken a stack trace for
the linux kernel.  I assume this will require enabling magic sysrq.  I looked
through the sysrq.txt but it didn't offer too much help, especially on how to
save stack traces, etc.  Would it be possible to get a quick rundown on what
commands/keys I need to use to get the data you need?

Thanks

Joe Thornber wrote:

> On Mon, May 14, 2001 at 05:11:47PM -0500, Tom Otake wrote:
> > I'm not sure if this has been brought up yet.  Over the weekend, my
> > Linux hung up on me twice.  Considering that this never happened before
> > and only two things have recently changed (LVM and ReiserFS) , I did
> > some reading and came across the bug report for LVM on sistina's website
> > about LVM deadlocking Linux.
>
> All the deadlocking issues have been due to either running snapshots
> on 2.2 kernels or doing a 'pvmove' on 2.2 or 2.4.  It doesn't sound
> like you were doing either.
>
> >
> > I'm running kernle 2.4.3 with LVM compiled into the kernel.  LVM is
> > 0.9.1_beta7, reiser is 3.x.0j.  All essential fs (/, /usr, /var, /tmp)
> > are still using ext2 and linux partitions, non essential fs (/home
> > amongst others) are all on  reiserfs with LVM, excluding /usr/local,
> > which is still on ext2 and Linux partition.
> >
> > The first occurance:
> > Running vmware (not on lvm/reiser) while browsing the web using netscape
> > and running seti@home.  The system hung on me when I tried to access a
> > web page that appeared to be task intensive, wether servlets,
> > javascript, flash, or something else, I don't know.
>
> I used to get deadlocks from vmware without using LVM.
>
> > The second occurance:
> > I was copying a large amount of data from a CDROM to my home dir (on
> > lvm).  While the copy was in progress, I created a new LV.  This
> > worked.  The system hung when I ran mkreiserfs on the new LV.
>
> This sounds more serious.  Can you reproduce it ?  If you can the
> quickest way for us to find the problem is for you to build the kernel
> with kdb and get stack traces for the relevent threads.
>
> > As I said, I'm not sure if the system hang was caused by the deadlock,
> > since the system was dead.  If this is related to the deadlock issue,
> > are there any possible workarounds, besides being mindful of the system
> > load?
>
> I am not aware of any deadlock issues in beta7.  Has anyone else
> experienced problems ?
>
> - Joe
> _______________________________________________
> linux-lvm mailing list
> linux-lvm@sistina.com
> http://lists.sistina.com/mailman/listinfo/linux-lvm

--
_______________
Love cannot be much younger than the lust for murder.
                -- Sigmund Freud

-- Tom Otake
-- totake66_nospam@home.com
-- Remove _nospam
-- #550

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [linux-lvm] lvm deadlock with 2.4.x kernel?
  2001-05-15 22:35   ` Tom Otake
@ 2001-05-15 22:49     ` Andreas Dilger
  2001-05-15 23:14       ` Chris Mason
  2001-05-17  2:26     ` Tom Otake
  1 sibling, 1 reply; 16+ messages in thread
From: Andreas Dilger @ 2001-05-15 22:49 UTC (permalink / raw)
  To: linux-lvm

Tom Otake writes:
> Yes, I've been able to recreate the second hang scenario, though I have to
> admit it wasn't exactly the same.  I started the copy of the data, created a
> new LV, which worked.  I ran mkreiserfs on the new LV, it worked.  I removed
> the new LV, also worked, then ran pvscan.  That's when the system hung.  All
> the while, the copy from CD to disk was going on.

It may be that this is related to the ext3 problem that is ongoing.
Basically, if pvscan or vgscan (PV_FLUSH ioctl calling invalidate_buffers)
is run it causes buffers to go into an invalid state for the journal
code, and this breaks the journaling.  On ext3, there are assertions in
the code which detect the invalid state and case an oops (stack trace),
but this may not be the case with reiserfs.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [linux-lvm] lvm deadlock with 2.4.x kernel?
  2001-05-15 22:49     ` Andreas Dilger
@ 2001-05-15 23:14       ` Chris Mason
  2001-05-16  0:32         ` Andreas Dilger
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Mason @ 2001-05-15 23:14 UTC (permalink / raw)
  To: linux-lvm


On Tuesday, May 15, 2001 04:49:25 PM -0600 Andreas Dilger
<adilger@turbolinux.com> wrote:

> Tom Otake writes:
>> Yes, I've been able to recreate the second hang scenario, though I have
>> to admit it wasn't exactly the same.  I started the copy of the data,
>> created a new LV, which worked.  I ran mkreiserfs on the new LV, it
>> worked.  I removed the new LV, also worked, then ran pvscan.  That's
>> when the system hung.  All the while, the copy from CD to disk was going
>> on.
> 
> It may be that this is related to the ext3 problem that is ongoing.
> Basically, if pvscan or vgscan (PV_FLUSH ioctl calling invalidate_buffers)
> is run it causes buffers to go into an invalid state for the journal
> code, and this breaks the journaling.  On ext3, there are assertions in
> the code which detect the invalid state and case an oops (stack trace),
> but this may not be the case with reiserfs.

reiserfs should catch blocks that don't have the proper bits set when it
starts i/o, and then it makes sure the block hasn't been relogged while the
i/o was in progress.  It sends warnings not an oops though, check your log
files.  If we were losing journal bits, and the log code didn't catch it,
the result should be silent corruption.  

Since he is seeing deadlock, it seems more likely reiserfs is trying to
lock a buffer for i/o, and that is hanging for some reason....

-chris

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [linux-lvm] lvm deadlock with 2.4.x kernel?
  2001-05-15 23:14       ` Chris Mason
@ 2001-05-16  0:32         ` Andreas Dilger
  2001-05-16  1:17           ` Chris Mason
  0 siblings, 1 reply; 16+ messages in thread
From: Andreas Dilger @ 2001-05-16  0:32 UTC (permalink / raw)
  To: linux-lvm

Chris writes:
> On Tuesday, May 15, 2001 04:49:25 PM -0600 Andreas Dilger
> <adilger@turbolinux.com> wrote:
> 
> > Tom Otake writes:
> >> Yes, I've been able to recreate the second hang scenario, though I have
> >> to admit it wasn't exactly the same.  I started the copy of the data,
> >> created a new LV, which worked.  I ran mkreiserfs on the new LV, it
> >> worked.  I removed the new LV, also worked, then ran pvscan.  That's
> >> when the system hung.  All the while, the copy from CD to disk was going
> >> on.
> > 
> > It may be that this is related to the ext3 problem that is ongoing.
> > Basically, if pvscan or vgscan (PV_FLUSH ioctl calling invalidate_buffers)
> > is run it causes buffers to go into an invalid state for the journal
> > code, and this breaks the journaling.  On ext3, there are assertions in
> > the code which detect the invalid state and case an oops (stack trace),
> > but this may not be the case with reiserfs.
> 
> reiserfs should catch blocks that don't have the proper bits set when it
> starts i/o, and then it makes sure the block hasn't been relogged while the
> i/o was in progress.  It sends warnings not an oops though, check your log
> files.  If we were losing journal bits, and the log code didn't catch it,
> the result should be silent corruption.  
> 
> Since he is seeing deadlock, it seems more likely reiserfs is trying to
> lock a buffer for i/o, and that is hanging for some reason....

But what does PV_FLUSH do?  Calls fsync_dev() to flush dirty buffers to
disk, and sync_supers() and waits for buffer I/O completion.  This is
unlikely to be the cause of a problem, because that happens on each
sync call.

It then calls __invalidate_buffers(dev, 0), which destroys everything
but dirty buffers (on ALL buffer lru lists).  Since reiserfs may have
journaled buffers which are not "dirty" by the normal sense, these may
be thrown out.  It is doing _something_ wierd with the ext3 buffers,
such that they are essentially gone from the buffer lists, but still
in the journal list.  We have tried tracking it down a bit, but not
successfully yet.

I think some of the debugging tools Andrew Morton made for ext3 on 2.4
will help.  Basically, it allows you to keep a history of what happens
to the buffer through the journal and block layer, so that when you get
a problem with a buffer you can trace back to see who changed it...  I
haven't yet checked if we still have this invalidate_buffers() issue in
2.4 ext3 yet.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [linux-lvm] lvm deadlock with 2.4.x kernel?
  2001-05-16  0:32         ` Andreas Dilger
@ 2001-05-16  1:17           ` Chris Mason
  2001-05-16  1:50             ` Jay Weber
  2001-05-16  8:39             ` Joe Thornber
  0 siblings, 2 replies; 16+ messages in thread
From: Chris Mason @ 2001-05-16  1:17 UTC (permalink / raw)
  To: linux-lvm


On Tuesday, May 15, 2001 06:32:24 PM -0600 Andreas Dilger
<adilger@turbolinux.com> wrote:

>> reiserfs should catch blocks that don't have the proper bits set when it
>> starts i/o, and then it makes sure the block hasn't been relogged while
>> the i/o was in progress.  It sends warnings not an oops though, check
>> your log files.  If we were losing journal bits, and the log code didn't
>> catch it, the result should be silent corruption.  
>> 
>> Since he is seeing deadlock, it seems more likely reiserfs is trying to
>> lock a buffer for i/o, and that is hanging for some reason....
> 
> But what does PV_FLUSH do?  Calls fsync_dev() to flush dirty buffers to
> disk, and sync_supers() and waits for buffer I/O completion.  This is
> unlikely to be the cause of a problem, because that happens on each
> sync call.
> 
> It then calls __invalidate_buffers(dev, 0), which destroys everything
> but dirty buffers (on ALL buffer lru lists).  

Unless I'm reading it wrong (2.4.4), __invalidate_buffers destroys all
buffers that are clean and have b_count == 0.  Reiserfs keeps b_count > 0
for all metadata buffers that have been logged, while ext3 allows the count
to be zero (but keeps them in the dirty list).

__invalidate_buffers also waits on any locked buffers.  Any chance one of
the other LVM ioctls grabs some lvm lock before calling PV_FLUSH?

You're right though, pv_flush certainly doesn't look like it could cause
any deadlocks.

-chris

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [linux-lvm] lvm deadlock with 2.4.x kernel?
  2001-05-16  1:17           ` Chris Mason
@ 2001-05-16  1:50             ` Jay Weber
  2001-05-16  3:35               ` Jay Weber
  2001-05-16  8:39             ` Joe Thornber
  1 sibling, 1 reply; 16+ messages in thread
From: Jay Weber @ 2001-05-16  1:50 UTC (permalink / raw)
  To: linux-lvm; +Cc: ext3-users, Joe Thornber, sct

I think I have this one solved, I hope.

I think what Andreas and I are running into are a few different
assertions.  One being the LVM lvm_do_pv_flush caused assertion which is
related directly to invalidate_buffers() being called which then triggers
refile_buffer() on a journaled buffer, which appears clean in all other
ways according to the checks in refile_buffer().

The following is what I've got in __invalidate_buffers() right now.

                        if (!bh->b_count && !buffer_journaled(bh) &&
                            (destroy_dirty_buffers || !buffer_dirty(bh)))
                                put_last_free(bh);
                        if (slept)
                                goto again;

Stephen suggested something along the above a bit ago, except he uses
bh->b_jlist == BJ_None.  buffer_journaled() seems to be a function in fs.h
which seems a bit more appropriate.

Next, with the above we'd still see problems.  My next patch included a
suggestion from Heinz to add lock_kernel() and unlock_kernel() around the
fsync_dev() and invalidate_buffers() in lvm.c/lvm_do_pv_flush().
Currently I have this in my working kernel, I'm gonna try again without it
though, it seems that it shouldn't be necessary, the other block devices
I've looked at don't seem to lock the kernel.

Lastly, I was still getting an assertion generating the "Attempt to refile
free buffer", but this one was actually caused by an ext3 journaling
function calling refile_buffer(), not derived from invalidate_buffers().

In fs/jfs/checkpoint.c/cleanup_transaction(), you'll note it does some
buffer_head bit checks and then calls refile_buffer().  Mine currently
looks like the following:

                if (!buffer_dirty(bh) && !buffer_jdirty(bh) &&
                    !buffer_journaled(bh) &&
                    bh->b_list != BUF_CLEAN) {
                        unlock_journal(journal);
                        refile_buffer(bh);
                        lock_journal(journal);
                        return 1;
                }

Note the addition of the !buffer_journaled(bh) check.

Okay, so using all of the above, I have now been running multiple vgscan
loops and a pvscan loop while untarr'ing kernel, removing the kernel dir,
and then untarring again, and building the kernel with make -j4 (eating up
my memory and cpu) for nearly an hour with no assertions.

To me it appears that Stephen had it right all along (in prior thread on
this), he stated that the b_jlist == BJ_None may be necessary elsewhere
also, to insure that there are no journaled buffers out there before
handing back to refile_buffer().  I think that's what we were up against
and as far as I can tell (grepping for refile_buffer() in jfs/* code) I've
added the checks to all the appropriate cases.

Andreas can you give the above a try and see if it solves the problem on
your end also.  Stephen, does this look good as far as what I've changed?

Sorry, no diffs just yet, the changes are rather smallish though.

Thanks.

On Tue, 15 May 2001, Chris Mason wrote:

> Date: Tue, 15 May 2001 21:17:06 -0400
> From: Chris Mason <mason@suse.com>
> Reply-To: linux-lvm@sistina.com
> To: linux-lvm@sistina.com
> Subject: Re: [linux-lvm] lvm deadlock with 2.4.x kernel?
>
>
>
> On Tuesday, May 15, 2001 06:32:24 PM -0600 Andreas Dilger
> <adilger@turbolinux.com> wrote:
>
> >> reiserfs should catch blocks that don't have the proper bits set when it
> >> starts i/o, and then it makes sure the block hasn't been relogged while
> >> the i/o was in progress.  It sends warnings not an oops though, check
> >> your log files.  If we were losing journal bits, and the log code didn't
> >> catch it, the result should be silent corruption.
> >>
> >> Since he is seeing deadlock, it seems more likely reiserfs is trying to
> >> lock a buffer for i/o, and that is hanging for some reason....
> >
> > But what does PV_FLUSH do?  Calls fsync_dev() to flush dirty buffers to
> > disk, and sync_supers() and waits for buffer I/O completion.  This is
> > unlikely to be the cause of a problem, because that happens on each
> > sync call.
> >
> > It then calls __invalidate_buffers(dev, 0), which destroys everything
> > but dirty buffers (on ALL buffer lru lists).
>
> Unless I'm reading it wrong (2.4.4), __invalidate_buffers destroys all
> buffers that are clean and have b_count == 0.  Reiserfs keeps b_count > 0
> for all metadata buffers that have been logged, while ext3 allows the count
> to be zero (but keeps them in the dirty list).
>
> __invalidate_buffers also waits on any locked buffers.  Any chance one of
> the other LVM ioctls grabs some lvm lock before calling PV_FLUSH?
>
> You're right though, pv_flush certainly doesn't look like it could cause
> any deadlocks.
>
> -chris
>
> _______________________________________________
> linux-lvm mailing list
> linux-lvm@sistina.com
> http://lists.sistina.com/mailman/listinfo/linux-lvm
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [linux-lvm] lvm deadlock with 2.4.x kernel?
  2001-05-16  1:50             ` Jay Weber
@ 2001-05-16  3:35               ` Jay Weber
  0 siblings, 0 replies; 16+ messages in thread
From: Jay Weber @ 2001-05-16  3:35 UTC (permalink / raw)
  To: ext3-users; +Cc: linux-lvm, Joe Thornber, sct

Nope, soon after I posted the email box died.  I'm still hitting Attempt
to refile buffer which is caused by cleanup_transaction().  I reverted to
use bh->b_jlist == BJ_None in my tests also.

Rereading Andrea's prior thread on this makes me think I'm heading down
the same path he did prior also.  Bummer. :)

On Tue, 15 May 2001, Jay Weber wrote:

> Date: Tue, 15 May 2001 18:50:44 -0700 (PDT)
> From: Jay Weber <jweber@valinux.com>
> Reply-To: ext3-users@redhat.com
> To: linux-lvm@sistina.com
> Cc: ext3-users@redhat.com, Joe Thornber <thornber@btconnect.com>,
>      sct@redhat.com
> Subject: Re: [linux-lvm] lvm deadlock with 2.4.x kernel?
>
> I think I have this one solved, I hope.
>
> I think what Andreas and I are running into are a few different
> assertions.  One being the LVM lvm_do_pv_flush caused assertion which is
> related directly to invalidate_buffers() being called which then triggers
> refile_buffer() on a journaled buffer, which appears clean in all other
> ways according to the checks in refile_buffer().
>
> The following is what I've got in __invalidate_buffers() right now.
>
>                         if (!bh->b_count && !buffer_journaled(bh) &&
>                             (destroy_dirty_buffers || !buffer_dirty(bh)))
>                                 put_last_free(bh);
>                         if (slept)
>                                 goto again;
>
> Stephen suggested something along the above a bit ago, except he uses
> bh->b_jlist == BJ_None.  buffer_journaled() seems to be a function in fs.h
> which seems a bit more appropriate.
>
> Next, with the above we'd still see problems.  My next patch included a
> suggestion from Heinz to add lock_kernel() and unlock_kernel() around the
> fsync_dev() and invalidate_buffers() in lvm.c/lvm_do_pv_flush().
> Currently I have this in my working kernel, I'm gonna try again without it
> though, it seems that it shouldn't be necessary, the other block devices
> I've looked at don't seem to lock the kernel.
>
> Lastly, I was still getting an assertion generating the "Attempt to refile
> free buffer", but this one was actually caused by an ext3 journaling
> function calling refile_buffer(), not derived from invalidate_buffers().
>
> In fs/jfs/checkpoint.c/cleanup_transaction(), you'll note it does some
> buffer_head bit checks and then calls refile_buffer().  Mine currently
> looks like the following:
>
>                 if (!buffer_dirty(bh) && !buffer_jdirty(bh) &&
>                     !buffer_journaled(bh) &&
>                     bh->b_list != BUF_CLEAN) {
>                         unlock_journal(journal);
>                         refile_buffer(bh);
>                         lock_journal(journal);
>                         return 1;
>                 }
>
> Note the addition of the !buffer_journaled(bh) check.
>
> Okay, so using all of the above, I have now been running multiple vgscan
> loops and a pvscan loop while untarr'ing kernel, removing the kernel dir,
> and then untarring again, and building the kernel with make -j4 (eating up
> my memory and cpu) for nearly an hour with no assertions.
>
> To me it appears that Stephen had it right all along (in prior thread on
> this), he stated that the b_jlist == BJ_None may be necessary elsewhere
> also, to insure that there are no journaled buffers out there before
> handing back to refile_buffer().  I think that's what we were up against
> and as far as I can tell (grepping for refile_buffer() in jfs/* code) I've
> added the checks to all the appropriate cases.
>
> Andreas can you give the above a try and see if it solves the problem on
> your end also.  Stephen, does this look good as far as what I've changed?
>
> Sorry, no diffs just yet, the changes are rather smallish though.
>
> Thanks.
>
> On Tue, 15 May 2001, Chris Mason wrote:
>
> > Date: Tue, 15 May 2001 21:17:06 -0400
> > From: Chris Mason <mason@suse.com>
> > Reply-To: linux-lvm@sistina.com
> > To: linux-lvm@sistina.com
> > Subject: Re: [linux-lvm] lvm deadlock with 2.4.x kernel?
> >
> >
> >
> > On Tuesday, May 15, 2001 06:32:24 PM -0600 Andreas Dilger
> > <adilger@turbolinux.com> wrote:
> >
> > >> reiserfs should catch blocks that don't have the proper bits set when it
> > >> starts i/o, and then it makes sure the block hasn't been relogged while
> > >> the i/o was in progress.  It sends warnings not an oops though, check
> > >> your log files.  If we were losing journal bits, and the log code didn't
> > >> catch it, the result should be silent corruption.
> > >>
> > >> Since he is seeing deadlock, it seems more likely reiserfs is trying to
> > >> lock a buffer for i/o, and that is hanging for some reason....
> > >
> > > But what does PV_FLUSH do?  Calls fsync_dev() to flush dirty buffers to
> > > disk, and sync_supers() and waits for buffer I/O completion.  This is
> > > unlikely to be the cause of a problem, because that happens on each
> > > sync call.
> > >
> > > It then calls __invalidate_buffers(dev, 0), which destroys everything
> > > but dirty buffers (on ALL buffer lru lists).
> >
> > Unless I'm reading it wrong (2.4.4), __invalidate_buffers destroys all
> > buffers that are clean and have b_count == 0.  Reiserfs keeps b_count > 0
> > for all metadata buffers that have been logged, while ext3 allows the count
> > to be zero (but keeps them in the dirty list).
> >
> > __invalidate_buffers also waits on any locked buffers.  Any chance one of
> > the other LVM ioctls grabs some lvm lock before calling PV_FLUSH?
> >
> > You're right though, pv_flush certainly doesn't look like it could cause
> > any deadlocks.
> >
> > -chris
> >
> > _______________________________________________
> > linux-lvm mailing list
> > linux-lvm@sistina.com
> > http://lists.sistina.com/mailman/listinfo/linux-lvm
> >
>
>
>
>
> _______________________________________________
> Ext3-users mailing list
> Ext3-users@redhat.com
> https://listman.redhat.com/mailman/listinfo/ext3-users
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [linux-lvm] lvm deadlock with 2.4.x kernel?
  2001-05-16  1:17           ` Chris Mason
  2001-05-16  1:50             ` Jay Weber
@ 2001-05-16  8:39             ` Joe Thornber
  2001-05-16 10:50               ` Jay Weber
                                 ` (2 more replies)
  1 sibling, 3 replies; 16+ messages in thread
From: Joe Thornber @ 2001-05-16  8:39 UTC (permalink / raw)
  To: linux-lvm

On Tue, May 15, 2001 at 09:17:06PM -0400, Chris Mason wrote:
> You're right though, pv_flush certainly doesn't look like it could cause
> any deadlocks.

I must admit I'm struggling to understand why PV_FLUSH even exists.
It does *exactly* the same thing as a BLKFLSBUF ioctl to the pv device
itself.  As such I agree that it's unlikely to be the culprit.

- Joe

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [linux-lvm] lvm deadlock with 2.4.x kernel?
  2001-05-16  8:39             ` Joe Thornber
@ 2001-05-16 10:50               ` Jay Weber
  2001-05-16 11:06                 ` Joe Thornber
  2001-05-16 10:53               ` Heinz J. Mauelshagen
  2001-05-16 13:20               ` Chris Mason
  2 siblings, 1 reply; 16+ messages in thread
From: Jay Weber @ 2001-05-16 10:50 UTC (permalink / raw)
  To: linux-lvm

On Wed, 16 May 2001, Joe Thornber wrote:

> On Tue, May 15, 2001 at 09:17:06PM -0400, Chris Mason wrote:
> > You're right though, pv_flush certainly doesn't look like it could cause
> > any deadlocks.
>
> I must admit I'm struggling to understand why PV_FLUSH even exists.
> It does *exactly* the same thing as a BLKFLSBUF ioctl to the pv device
> itself.  As such I agree that it's unlikely to be the culprit.

I don't think it is, I think it just appears as such.  I've actually
hacked up my LVM here so that lvm_do_pv_flush() just returns 0.  I don't
get the problem there anymore. :)

I'm digging in the userland code now.  It looks to me as though somewhere
around the vg_copy_to_disk or lseek and write to pv_handle in vg_write.c
is where I see the first instance of a BUF_LOCKED buffer being set to
B_FREE.  I added a printk to my put_last_free() function in buffer.c to
denote when such odd symptoms occur.  Again, only LVM userland tool usage
seems to generate output from that printk, nothing else that I do on the
machine.

And it looks as though vg_write.c in tools/lib is just dropping the VG
offset data and such onto the physical PV itself.  I've noted that
following this I get alot more printk messages regarding BUF_LOCKED being
set to B_FREE, the next massive hunk of write is in regards to
lv_write_all_lv, so I gather it's during the writeout of LV information to
a PV?

Not sure why writing data to the raw device would generate these printk's.
Thats the best I've been able to come up with overnight though.

>
> - Joe
> _______________________________________________
> linux-lvm mailing list
> linux-lvm@sistina.com
> http://lists.sistina.com/mailman/listinfo/linux-lvm
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [linux-lvm] lvm deadlock with 2.4.x kernel?
  2001-05-16  8:39             ` Joe Thornber
  2001-05-16 10:50               ` Jay Weber
@ 2001-05-16 10:53               ` Heinz J. Mauelshagen
  2001-05-16 13:20               ` Chris Mason
  2 siblings, 0 replies; 16+ messages in thread
From: Heinz J. Mauelshagen @ 2001-05-16 10:53 UTC (permalink / raw)
  To: linux-lvm

On Wed, May 16, 2001 at 09:39:29AM +0100, Joe Thornber wrote:
> On Tue, May 15, 2001 at 09:17:06PM -0400, Chris Mason wrote:
> > You're right though, pv_flush certainly doesn't look like it could cause
> > any deadlocks.
> 
> I must admit I'm struggling to understand why PV_FLUSH even exists.
> It does *exactly* the same thing as a BLKFLSBUF ioctl to the pv device
> itself.

Joe is right.
It is a few lines of unneccesary code in the LVM driver ;-)
We can remove it > 1.0 and call the ioctl of the underlying driver directly.

> As such I agree that it's unlikely to be the culprit.

Correct.

> 
> - Joe
> _______________________________________________
> linux-lvm mailing list
> linux-lvm@sistina.com
> http://lists.sistina.com/mailman/listinfo/linux-lvm

-- 

Regards,
Heinz    -- The LVM Guy --

*** Software bugs are stupid.
    Nevertheless it needs not so stupid people to solve them ***

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Heinz Mauelshagen                                 Sistina Software Inc.
Senior Consultant/Developer                       Am Sonnenhang 11
                                                  56242 Marienrachdorf
                                                  Germany
Mauelshagen@Sistina.com                           +49 2626 141200
                                                       FAX 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [linux-lvm] lvm deadlock with 2.4.x kernel?
  2001-05-16 10:50               ` Jay Weber
@ 2001-05-16 11:06                 ` Joe Thornber
  0 siblings, 0 replies; 16+ messages in thread
From: Joe Thornber @ 2001-05-16 11:06 UTC (permalink / raw)
  To: linux-lvm

On Wed, May 16, 2001 at 03:50:33AM -0700, Jay Weber wrote:
> On Wed, 16 May 2001, Joe Thornber wrote:
> 
> > On Tue, May 15, 2001 at 09:17:06PM -0400, Chris Mason wrote:
> > > You're right though, pv_flush certainly doesn't look like it could cause
> > > any deadlocks.
> >
> > I must admit I'm struggling to understand why PV_FLUSH even exists.
> > It does *exactly* the same thing as a BLKFLSBUF ioctl to the pv device
> > itself.  As such I agree that it's unlikely to be the culprit.
> 
> I don't think it is, I think it just appears as such.  I've actually
> hacked up my LVM here so that lvm_do_pv_flush() just returns 0.  I don't
> get the problem there anymore. :)

Agreed, I don't think we're seeing a bug with LVM.  It's just that LVM
(or software raid in linear mode ?) is the only time you will do a
partial flush, ie. we flush one PV, but not all of them for the LV.

That's an interesting idea; instead of calling PV_FLUSH, you could try
flushing the whole LV, does the problem go away if you do this ?
You'll have to hack quite a bit to try this, probably easiest to get
the user land tools to check to see if the PV is part of an LV, and
then if it is call BLKFLSBUF for the LV, otherwise call BLKFLSBUF for
the PV.

- Joe

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [linux-lvm] lvm deadlock with 2.4.x kernel?
  2001-05-16  8:39             ` Joe Thornber
  2001-05-16 10:50               ` Jay Weber
  2001-05-16 10:53               ` Heinz J. Mauelshagen
@ 2001-05-16 13:20               ` Chris Mason
  2 siblings, 0 replies; 16+ messages in thread
From: Chris Mason @ 2001-05-16 13:20 UTC (permalink / raw)
  To: linux-lvm; +Cc: jweber


On Wednesday, May 16, 2001 09:39:29 AM +0100 Joe Thornber
<thornber@btconnect.com> wrote:

> On Tue, May 15, 2001 at 09:17:06PM -0400, Chris Mason wrote:
>> You're right though, pv_flush certainly doesn't look like it could cause
>> any deadlocks.
> 
> I must admit I'm struggling to understand why PV_FLUSH even exists.
> It does *exactly* the same thing as a BLKFLSBUF ioctl to the pv device
> itself.  As such I agree that it's unlikely to be the culprit.

I think there are actually two problems.  Calling invalidate_buffers on
part of an active ext3 FS should hose it (unless ext3 doesn't allow b_count
== 0 on buffers that are clean but still need flushing). 

Adding the BKL on 2.2.x shouldn't do anything, since sys_ioctl grabs it.
Unless the LVM code drops the BKL somewhere, it should be safe.  So, at the
very least ext3 people need Jay's first patch.

The 2.4.x deadlock with reiserfs should be something different.  Reiserfs
should have b_count > 0 on any buffer it cares about.  If PV_FLUSH is never
called with any other locks held, we're probably best off going in with kdb.

-chris

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [linux-lvm] lvm deadlock with 2.4.x kernel?
  2001-05-15 22:35   ` Tom Otake
  2001-05-15 22:49     ` Andreas Dilger
@ 2001-05-17  2:26     ` Tom Otake
  2001-05-17 15:31       ` Andreas Dilger
  1 sibling, 1 reply; 16+ messages in thread
From: Tom Otake @ 2001-05-17  2:26 UTC (permalink / raw)
  To: linux-lvm

I've recreated the system hang exactly as I've described below with a pvscan
during a copy.  I used a new kernel with hacking enabled, redirected 1 into
/proc/sys/kernel/sysrq.  None of the sysrq commands seemed to have worked.  The
system hung about 20 to 30 seconds into the copy process.

Tom Otake wrote:

> Yes, I've been able to recreate the second hang scenario, though I have to
> admit it wasn't exactly the same.  I started the copy of the data, created a
> new LV, which worked.  I ran mkreiserfs on the new LV, it worked.  I removed
> the new LV, also worked, then ran pvscan.  That's when the system hung.  All
> the while, the copy from CD to disk was going on.
>
> Joe Thornber wrote:
>
> >
> > > The second occurance:
> > > I was copying a large amount of data from a CDROM to my home dir (on
> > > lvm).  While the copy was in progress, I created a new LV.  This
> > > worked.  The system hung when I ran mkreiserfs on the new LV.
> >
> > This sounds more serious.  Can you reproduce it ?  If you can the
> > quickest way for us to find the problem is for you to build the kernel
> > with kdb and get stack traces for the relevent threads.
> >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [linux-lvm] lvm deadlock with 2.4.x kernel?
  2001-05-17  2:26     ` Tom Otake
@ 2001-05-17 15:31       ` Andreas Dilger
  0 siblings, 0 replies; 16+ messages in thread
From: Andreas Dilger @ 2001-05-17 15:31 UTC (permalink / raw)
  To: linux-lvm

Tom Otake writes:
> I've recreated the system hang exactly as I've described below with a pvscan
> during a copy.  I used a new kernel with hacking enabled, redirected 1 into
> /proc/sys/kernel/sysrq.  None of the sysrq commands seemed to have worked.
> The system hung about 20 to 30 seconds into the copy process.

Try using the kdb patches (available at Sourceforge).  They will allow you
to interrupt the kernel anywhere and to a stack trace (via "bt" command)
to find out where you are stuck.  The SysRQ key does not work if you are
in a tight loop somewhere.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2001-05-17 15:31 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-05-14 22:11 [linux-lvm] lvm deadlock with 2.4.x kernel? Tom Otake
2001-05-15  8:40 ` Joe Thornber
2001-05-15 22:35   ` Tom Otake
2001-05-15 22:49     ` Andreas Dilger
2001-05-15 23:14       ` Chris Mason
2001-05-16  0:32         ` Andreas Dilger
2001-05-16  1:17           ` Chris Mason
2001-05-16  1:50             ` Jay Weber
2001-05-16  3:35               ` Jay Weber
2001-05-16  8:39             ` Joe Thornber
2001-05-16 10:50               ` Jay Weber
2001-05-16 11:06                 ` Joe Thornber
2001-05-16 10:53               ` Heinz J. Mauelshagen
2001-05-16 13:20               ` Chris Mason
2001-05-17  2:26     ` Tom Otake
2001-05-17 15:31       ` Andreas Dilger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.