Re: Deadlock in md barrier code? / RAID1 / LVM CoW snapshot + ext3 / Debian 5.0 - lenny 2.6.26 kernel

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Neil Brown <neilb@suse.de>
To: Tim Small <tim@seoss.co.uk>
Cc: "linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>,
	mike@hartmanipulation.com
Subject: Re: Deadlock in md barrier code? / RAID1 / LVM CoW snapshot + ext3 / Debian 5.0 - lenny 2.6.26 kernel
Date: Sat, 18 Sep 2010 08:59:25 +1000	[thread overview]
Message-ID: <20100918085925.5fee83ee@notabene> (raw)
In-Reply-To: <4C938103.1010304@seoss.co.uk>

On Fri, 17 Sep 2010 15:53:55 +0100
Tim Small <tim@seoss.co.uk> wrote:

> Hi,
> 
> I have a box with a relatively simple setup:
> 
> sda + sdb are 1TB SATA drives attached to an Intel ICH10.
> Three partitions on each drive, three md raid1s built on top of these:
> 
> md0 /
> md1 swap
> md2 LVM PV
> 
> 
> During resync about a week ago, processes seemed to deadlock on I/O, the 
> machine was still alive but with a load of 100+.  A USB drive happened 
> to be mounted, so I managed to save /var/log/kern.log  At the time of 
> the problem, the monthly RAID check was in progress.  On reboot, a 
> rebuild commenced, and the same deadlock seemed to occur between roughly 
> 2 minutes and 15 minutes after boot.
> 
> At this point, the server was running on a Dell PE R300 (12G RAM, 
> quad-core), with an LSI SAS controller and 2x 500G SATA drives.  I 
> shifted all the data onto a spare box (Dell PE R210, ICH10R, 1x1TB 
> drive, 8G RAM, quad-core+HT), with only a single drive, so I created the 
> md RAID1s with just a single drive in each.  The original box was put 
> offline with the idea of me debugging it "soon".
> 
> This morning, I added in a second 1TB drive, and during the resync 
> (approx 1 hour in), the deadlock up occurred again.  The resync had 
> stopped, and any attempt to write to md2 would deadlock the process in 
> question.  I think it was doing an rsnaphot backup to a USB drive at the 
> time the initial problem occurred - this creates an LVM snapshot device 
> on top of md2 for the duration of the backup for each filesystem backed 
> up (there are two at the moment), and I suppose this results in lots of 
> read-copy-update operations - the mounting of the snapshots shows up in 
> the logs as the fs-mounts, and subsequent orphan_cleanups.  As the 
> snapshot survives the reboot, I assume this is what triggers the 
> subsequent lockup after the machine has rebooted.
> 
> I got a couple of 'echo w > /proc/sysrq-trigger' sets of output this 
> time...  Edited copies of kern.log are attached - looks like it's 
> barrier related.  I'd guess the combination of the LVM CoW snapshot, and 
> the RAID resync are tickling this bug.
> 
> 
> Any thoughts?  Maybe this is related to Debian bug #584881 - 
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=584881
> 
> ... since the kernel is essentially the same.
> 
> I can do some debugging on this out-of-office-hours, or can probably 
> resurrect the original hardware to debug that too.
> 
> Logs are here:
> 
> http://buttersideup.com/files/md-raid1-lockup-lvm-snapshot/
> 
> I think vger binned the first version of this email (with the logs 
> attached) - so apologies if you've ended up with two copies of this email...
> 
> Tim.
> 
> 

Hi Tim,

 unfortunately I need more that just the set of blocked tasks to diagnose the
 problem.   If you could get the result of 
         echo t > /proc/sysrq-trigger
 that might help a lot.  This might be bigger than the dmesg buffer, so you
 might try booting with 'log_buf_len=1M' just to be sure.

 It looks a bit like a bug that was fixed in prior to the release of 2.6.26,
 but as you are running 2.6.26, it cannot be that..

NeilBrown

next prev parent reply	other threads:[~2010-09-17 22:59 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-17 14:53 Deadlock in md barrier code? / RAID1 / LVM CoW snapshot + ext3 / Debian 5.0 - lenny 2.6.26 kernel Tim Small
2010-09-17 22:59 ` Neil Brown [this message]
2010-09-20 19:59   ` Tim Small
2010-09-21 21:02     ` Tim Small
2010-09-21 22:30       ` Neil Brown
2010-10-12 13:59         ` Tim Small
2010-10-12 14:06         ` Tim Small
2010-10-12 16:48           ` CoolCold
2010-10-13  8:51             ` Tim Small
2010-10-13 13:00               ` CoolCold
2010-10-18 18:52         ` Tim Small
2010-10-19  6:16           ` Neil Brown
2010-10-19 16:24             ` Tim Small
2010-10-19 16:29               ` Tim Small
2010-10-19 19:29                 ` Tim Small
2010-10-20 20:34                   ` Tim Small
2010-10-20 23:04                     ` Neil Brown
2010-11-18 18:04                       ` Tim Small
2010-11-21 23:05                         ` Neil Brown
2010-12-06 15:42                           ` Tim Small
2010-09-21 22:21     ` Neil Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100918085925.5fee83ee@notabene \
    --to=neilb@suse.de \
    --cc=linux-raid@vger.kernel.org \
    --cc=mike@hartmanipulation.com \
    --cc=tim@seoss.co.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).