From: Neil Brown <neilb@suse.de>
To: Tim Small <tim@seoss.co.uk>
Cc: "linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>,
mike@hartmanipulation.com
Subject: Re: Deadlock in md barrier code? / RAID1 / LVM CoW snapshot + ext3 / Debian 5.0 - lenny 2.6.26 kernel
Date: Sat, 18 Sep 2010 08:59:25 +1000 [thread overview]
Message-ID: <20100918085925.5fee83ee@notabene> (raw)
In-Reply-To: <4C938103.1010304@seoss.co.uk>
On Fri, 17 Sep 2010 15:53:55 +0100
Tim Small <tim@seoss.co.uk> wrote:
> Hi,
>
> I have a box with a relatively simple setup:
>
> sda + sdb are 1TB SATA drives attached to an Intel ICH10.
> Three partitions on each drive, three md raid1s built on top of these:
>
> md0 /
> md1 swap
> md2 LVM PV
>
>
> During resync about a week ago, processes seemed to deadlock on I/O, the
> machine was still alive but with a load of 100+. A USB drive happened
> to be mounted, so I managed to save /var/log/kern.log At the time of
> the problem, the monthly RAID check was in progress. On reboot, a
> rebuild commenced, and the same deadlock seemed to occur between roughly
> 2 minutes and 15 minutes after boot.
>
> At this point, the server was running on a Dell PE R300 (12G RAM,
> quad-core), with an LSI SAS controller and 2x 500G SATA drives. I
> shifted all the data onto a spare box (Dell PE R210, ICH10R, 1x1TB
> drive, 8G RAM, quad-core+HT), with only a single drive, so I created the
> md RAID1s with just a single drive in each. The original box was put
> offline with the idea of me debugging it "soon".
>
> This morning, I added in a second 1TB drive, and during the resync
> (approx 1 hour in), the deadlock up occurred again. The resync had
> stopped, and any attempt to write to md2 would deadlock the process in
> question. I think it was doing an rsnaphot backup to a USB drive at the
> time the initial problem occurred - this creates an LVM snapshot device
> on top of md2 for the duration of the backup for each filesystem backed
> up (there are two at the moment), and I suppose this results in lots of
> read-copy-update operations - the mounting of the snapshots shows up in
> the logs as the fs-mounts, and subsequent orphan_cleanups. As the
> snapshot survives the reboot, I assume this is what triggers the
> subsequent lockup after the machine has rebooted.
>
> I got a couple of 'echo w > /proc/sysrq-trigger' sets of output this
> time... Edited copies of kern.log are attached - looks like it's
> barrier related. I'd guess the combination of the LVM CoW snapshot, and
> the RAID resync are tickling this bug.
>
>
> Any thoughts? Maybe this is related to Debian bug #584881 -
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=584881
>
> ... since the kernel is essentially the same.
>
> I can do some debugging on this out-of-office-hours, or can probably
> resurrect the original hardware to debug that too.
>
> Logs are here:
>
> http://buttersideup.com/files/md-raid1-lockup-lvm-snapshot/
>
> I think vger binned the first version of this email (with the logs
> attached) - so apologies if you've ended up with two copies of this email...
>
> Tim.
>
>
Hi Tim,
unfortunately I need more that just the set of blocked tasks to diagnose the
problem. If you could get the result of
echo t > /proc/sysrq-trigger
that might help a lot. This might be bigger than the dmesg buffer, so you
might try booting with 'log_buf_len=1M' just to be sure.
It looks a bit like a bug that was fixed in prior to the release of 2.6.26,
but as you are running 2.6.26, it cannot be that..
NeilBrown
next prev parent reply other threads:[~2010-09-17 22:59 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-09-17 14:53 Deadlock in md barrier code? / RAID1 / LVM CoW snapshot + ext3 / Debian 5.0 - lenny 2.6.26 kernel Tim Small
2010-09-17 22:59 ` Neil Brown [this message]
2010-09-20 19:59 ` Tim Small
2010-09-21 21:02 ` Tim Small
2010-09-21 22:30 ` Neil Brown
2010-10-12 13:59 ` Tim Small
2010-10-12 14:06 ` Tim Small
2010-10-12 16:48 ` CoolCold
2010-10-13 8:51 ` Tim Small
2010-10-13 13:00 ` CoolCold
2010-10-18 18:52 ` Tim Small
2010-10-19 6:16 ` Neil Brown
2010-10-19 16:24 ` Tim Small
2010-10-19 16:29 ` Tim Small
2010-10-19 19:29 ` Tim Small
2010-10-20 20:34 ` Tim Small
2010-10-20 23:04 ` Neil Brown
2010-11-18 18:04 ` Tim Small
2010-11-21 23:05 ` Neil Brown
2010-12-06 15:42 ` Tim Small
2010-09-21 22:21 ` Neil Brown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100918085925.5fee83ee@notabene \
--to=neilb@suse.de \
--cc=linux-raid@vger.kernel.org \
--cc=mike@hartmanipulation.com \
--cc=tim@seoss.co.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).