Deadlock in md barrier code? / RAID1 / LVM CoW snapshot + ext3 / Debian 5.0 - lenny 2.6.26 kernel

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Tim Small <tim@seoss.co.uk>
To: "linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>
Cc: mike@hartmanipulation.com
Subject: Deadlock in md barrier code? / RAID1 / LVM CoW snapshot + ext3 / Debian 5.0 - lenny 2.6.26 kernel
Date: Fri, 17 Sep 2010 15:53:55 +0100	[thread overview]
Message-ID: <4C938103.1010304@seoss.co.uk> (raw)

Hi,

I have a box with a relatively simple setup:

sda + sdb are 1TB SATA drives attached to an Intel ICH10.
Three partitions on each drive, three md raid1s built on top of these:

md0 /
md1 swap
md2 LVM PV

During resync about a week ago, processes seemed to deadlock on I/O, the 
machine was still alive but with a load of 100+.  A USB drive happened 
to be mounted, so I managed to save /var/log/kern.log  At the time of 
the problem, the monthly RAID check was in progress.  On reboot, a 
rebuild commenced, and the same deadlock seemed to occur between roughly 
2 minutes and 15 minutes after boot.

At this point, the server was running on a Dell PE R300 (12G RAM, 
quad-core), with an LSI SAS controller and 2x 500G SATA drives.  I 
shifted all the data onto a spare box (Dell PE R210, ICH10R, 1x1TB 
drive, 8G RAM, quad-core+HT), with only a single drive, so I created the 
md RAID1s with just a single drive in each.  The original box was put 
offline with the idea of me debugging it "soon".

This morning, I added in a second 1TB drive, and during the resync 
(approx 1 hour in), the deadlock up occurred again.  The resync had 
stopped, and any attempt to write to md2 would deadlock the process in 
question.  I think it was doing an rsnaphot backup to a USB drive at the 
time the initial problem occurred - this creates an LVM snapshot device 
on top of md2 for the duration of the backup for each filesystem backed 
up (there are two at the moment), and I suppose this results in lots of 
read-copy-update operations - the mounting of the snapshots shows up in 
the logs as the fs-mounts, and subsequent orphan_cleanups.  As the 
snapshot survives the reboot, I assume this is what triggers the 
subsequent lockup after the machine has rebooted.

I got a couple of 'echo w > /proc/sysrq-trigger' sets of output this 
time...  Edited copies of kern.log are attached - looks like it's 
barrier related.  I'd guess the combination of the LVM CoW snapshot, and 
the RAID resync are tickling this bug.

Any thoughts?  Maybe this is related to Debian bug #584881 - 
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=584881

... since the kernel is essentially the same.

I can do some debugging on this out-of-office-hours, or can probably 
resurrect the original hardware to debug that too.

Logs are here:

http://buttersideup.com/files/md-raid1-lockup-lvm-snapshot/

I think vger binned the first version of this email (with the logs 
attached) - so apologies if you've ended up with two copies of this email...

Tim.

-- 
South East Open Source Solutions Limited
Registered in England and Wales with company number 06134732.
Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ
VAT number: 900 6633 53  http://seoss.co.uk/ +44-(0)1273-808309

next             reply	other threads:[~2010-09-17 14:53 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-17 14:53 Tim Small [this message]
2010-09-17 22:59 ` Deadlock in md barrier code? / RAID1 / LVM CoW snapshot + ext3 / Debian 5.0 - lenny 2.6.26 kernel Neil Brown
2010-09-20 19:59   ` Tim Small
2010-09-21 21:02     ` Tim Small
2010-09-21 22:30       ` Neil Brown
2010-10-12 13:59         ` Tim Small
2010-10-12 14:06         ` Tim Small
2010-10-12 16:48           ` CoolCold
2010-10-13  8:51             ` Tim Small
2010-10-13 13:00               ` CoolCold
2010-10-18 18:52         ` Tim Small
2010-10-19  6:16           ` Neil Brown
2010-10-19 16:24             ` Tim Small
2010-10-19 16:29               ` Tim Small
2010-10-19 19:29                 ` Tim Small
2010-10-20 20:34                   ` Tim Small
2010-10-20 23:04                     ` Neil Brown
2010-11-18 18:04                       ` Tim Small
2010-11-21 23:05                         ` Neil Brown
2010-12-06 15:42                           ` Tim Small
2010-09-21 22:21     ` Neil Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4C938103.1010304@seoss.co.uk \
    --to=tim@seoss.co.uk \
    --cc=linux-raid@vger.kernel.org \
    --cc=mike@hartmanipulation.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).