From: Neil Brown <neilb@suse.de>
To: Tim Small <tim@seoss.co.uk>
Cc: "linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>,
mike@hartmanipulation.com
Subject: Re: Deadlock in md barrier code? / RAID1 / LVM CoW snapshot + ext3 / Debian 5.0 - lenny 2.6.26 kernel
Date: Sat, 18 Sep 2010 08:59:25 +1000 [thread overview]
Message-ID: <20100918085925.5fee83ee@notabene> (raw)
In-Reply-To: <4C938103.1010304@seoss.co.uk>
On Fri, 17 Sep 2010 15:53:55 +0100
Tim Small <tim@seoss.co.uk> wrote:
> Hi,
>
> I have a box with a relatively simple setup:
>
> sda + sdb are 1TB SATA drives attached to an Intel ICH10.
> Three partitions on each drive, three md raid1s built on top of these:
>
> md0 /
> md1 swap
> md2 LVM PV
>
>
> During resync about a week ago, processes seemed to deadlock on I/O, the
> machine was still alive but with a load of 100+. A USB drive happened
> to be mounted, so I managed to save /var/log/kern.log At the time of
> the problem, the monthly RAID check was in progress. On reboot, a
> rebuild commenced, and the same deadlock seemed to occur between roughly
> 2 minutes and 15 minutes after boot.
>
> At this point, the server was running on a Dell PE R300 (12G RAM,
> quad-core), with an LSI SAS controller and 2x 500G SATA drives. I
> shifted all the data onto a spare box (Dell PE R210, ICH10R, 1x1TB
> drive, 8G RAM, quad-core+HT), with only a single drive, so I created the
> md RAID1s with just a single drive in each. The original box was put
> offline with the idea of me debugging it "soon".
>
> This morning, I added in a second 1TB drive, and during the resync
> (approx 1 hour in), the deadlock up occurred again. The resync had
> stopped, and any attempt to write to md2 would deadlock the process in
> question. I think it was doing an rsnaphot backup to a USB drive at the
> time the initial problem occurred - this creates an LVM snapshot device
> on top of md2 for the duration of the backup for each filesystem backed
> up (there are two at the moment), and I suppose this results in lots of
> read-copy-update operations - the mounting of the snapshots shows up in
> the logs as the fs-mounts, and subsequent orphan_cleanups. As the
> snapshot survives the reboot, I assume this is what triggers the
> subsequent lockup after the machine has rebooted.
>
> I got a couple of 'echo w > /proc/sysrq-trigger' sets of output this
> time... Edited copies of kern.log are attached - looks like it's
> barrier related. I'd guess the combination of the LVM CoW snapshot, and
> the RAID resync are tickling this bug.
>
>
> Any thoughts? Maybe this is related to Debian bug #584881 -
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=584881
>
> ... since the kernel is essentially the same.
>
> I can do some debugging on this out-of-office-hours, or can probably
> resurrect the original hardware to debug that too.
>
> Logs are here:
>
> http://buttersideup.com/files/md-raid1-lockup-lvm-snapshot/
>
> I think vger binned the first version of this email (with the logs
> attached) - so apologies if you've ended up with two copies of this email...
>
> Tim.
>
>
Hi Tim,
unfortunately I need more that just the set of blocked tasks to diagnose the
problem. If you could get the result of
echo t > /proc/sysrq-trigger
that might help a lot. This might be bigger than the dmesg buffer, so you
might try booting with 'log_buf_len=1M' just to be sure.
It looks a bit like a bug that was fixed in prior to the release of 2.6.26,
but as you are running 2.6.26, it cannot be that..
NeilBrown
next prev parent reply other threads:[~2010-09-17 22:59 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-09-17 14:53 Deadlock in md barrier code? / RAID1 / LVM CoW snapshot + ext3 / Debian 5.0 - lenny 2.6.26 kernel Tim Small
2010-09-17 22:59 ` Neil Brown [this message]
2010-09-20 19:59 ` Tim Small
2010-09-21 21:02 ` Tim Small
2010-09-21 22:30 ` Neil Brown
2010-10-12 13:59 ` Tim Small
2010-10-12 14:06 ` Tim Small
2010-10-12 16:48 ` CoolCold
2010-10-13 8:51 ` Tim Small
2010-10-13 13:00 ` CoolCold
2010-10-18 18:52 ` Tim Small
2010-10-19 6:16 ` Neil Brown
2010-10-19 16:24 ` Tim Small
2010-10-19 16:29 ` Tim Small
2010-10-19 19:29 ` Tim Small
2010-10-20 20:34 ` Tim Small
2010-10-20 23:04 ` Neil Brown
2010-11-18 18:04 ` Tim Small
2010-11-21 23:05 ` Neil Brown
2010-12-06 15:42 ` Tim Small
2010-09-21 22:21 ` Neil Brown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100918085925.5fee83ee@notabene \
--to=neilb@suse.de \
--cc=linux-raid@vger.kernel.org \
--cc=mike@hartmanipulation.com \
--cc=tim@seoss.co.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.