* [patch] md superblock update failures
@ 2005-03-16 13:05 Lars Marowsky-Bree
2005-03-16 13:22 ` Lars Marowsky-Bree
2005-03-16 22:15 ` Neil Brown
0 siblings, 2 replies; 6+ messages in thread
From: Lars Marowsky-Bree @ 2005-03-16 13:05 UTC (permalink / raw)
To: linux-raid
[-- Attachment #1: Type: text/plain, Size: 343 bytes --]
Mark found a bug where md doesn't handle write failures when trying to
update the superblock.
Attached is the fix he sent to us, and which seems to apply fine to
2.6.11 too.
Sincerely,
Lars Marowsky-Brée <lmb@suse.de>
--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business
[-- Attachment #2: md-superblock-failures --]
[-- Type: text/plain, Size: 2050 bytes --]
From: Mark Rustad
Subject: md does not handle write failures for the superblock
Patch-mainline: 2.6.12
References: 65306
Description by Mark:
I have found that superblock updates that experience write failures to a
raid component device, do not fail the device out of the raid. This
results in the raid superblock being updated 100 times and ultimately
simply fails. It takes a different type of failing access to the failed
device to finally fail the device out of the raid. This can be seen by
simply pulling out a raid device in an idle system (but with sgraidmon &
mdadmd running).
The following patch will fail the failing device out of the raid after
the attempted superblock update and then retry the update with one fewer
device. This seems to work very well in our system.
Acked-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Lars Marowsky-Bree <lmb@suse.de>
Index: linux-2.6.5/drivers/md/md.c
===================================================================
--- linux-2.6.5.orig/drivers/md/md.c 2005-03-16 13:57:10.381445927 +0100
+++ linux-2.6.5/drivers/md/md.c 2005-03-16 13:57:10.714396523 +0100
@@ -1115,6 +1115,7 @@ static void export_array(mddev_t *mddev)
{
struct list_head *tmp;
mdk_rdev_t *rdev;
+ mdk_rdev_t *frdev;
ITERATE_RDEV(mddev,rdev,tmp) {
if (!rdev->mddev) {
@@ -1288,6 +1289,7 @@ repeat:
mdname(mddev),mddev->in_sync);
err = 0;
+ frdev = 0;
ITERATE_RDEV(mddev,rdev,tmp) {
char b[BDEVNAME_SIZE];
dprintk(KERN_INFO "md: ");
@@ -1296,13 +1298,21 @@ repeat:
dprintk("%s ", bdevname(rdev->bdev,b));
if (!rdev->faulty) {
- err += write_disk_sb(rdev);
+ int ret;
+ ret = write_disk_sb(rdev);
+ if (ret) {
+ frdev = rdev; /* Save failed device */
+ err += ret;
+ }
} else
dprintk(")\n");
if (!err && mddev->level == LEVEL_MULTIPATH)
/* only need to write one superblock... */
break;
}
+ if (frdev)
+ md_error(mddev, frdev); /* Fail the failed device */
+
if (err) {
if (--count) {
printk(KERN_ERR "md: errors occurred during superblock"
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [patch] md superblock update failures
2005-03-16 13:05 [patch] md superblock update failures Lars Marowsky-Bree
@ 2005-03-16 13:22 ` Lars Marowsky-Bree
2005-03-16 14:01 ` Michael Tokarev
2005-03-16 22:15 ` Neil Brown
1 sibling, 1 reply; 6+ messages in thread
From: Lars Marowsky-Bree @ 2005-03-16 13:22 UTC (permalink / raw)
To: linux-raid
[-- Attachment #1: Type: text/plain, Size: 499 bytes --]
On 2005-03-16T14:05:12, Lars Marowsky-Bree <lmb@suse.de> wrote:
> Mark found a bug where md doesn't handle write failures when trying to
> update the superblock.
>
> Attached is the fix he sent to us, and which seems to apply fine to
> 2.6.11 too.
Oops, sorry. Broken diff due to yours truely. Attached patch actually
compiles.
Sincerely,
Lars Marowsky-Brée <lmb@suse.de>
--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business
[-- Attachment #2: md-superblock-failures --]
[-- Type: text/plain, Size: 2050 bytes --]
From: Mark Rustad
Subject: md does not handle write failures for the superblock
Patch-mainline: 2.6.12
References: 65306
Description by Mark:
I have found that superblock updates that experience write failures to a
raid component device, do not fail the device out of the raid. This
results in the raid superblock being updated 100 times and ultimately
simply fails. It takes a different type of failing access to the failed
device to finally fail the device out of the raid. This can be seen by
simply pulling out a raid device in an idle system (but with sgraidmon &
mdadmd running).
The following patch will fail the failing device out of the raid after
the attempted superblock update and then retry the update with one fewer
device. This seems to work very well in our system.
Acked-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Lars Marowsky-Bree <lmb@suse.de>
Index: linux-2.6.5/drivers/md/md.c
===================================================================
--- linux-2.6.5.orig/drivers/md/md.c 2005-03-16 13:57:10.381445927 +0100
+++ linux-2.6.5/drivers/md/md.c 2005-03-16 13:57:10.714396523 +0100
@@ -1115,6 +1115,7 @@ static void export_array(mddev_t *mddev)
{
struct list_head *tmp;
mdk_rdev_t *rdev;
+ mdk_rdev_t *frdev;
ITERATE_RDEV(mddev,rdev,tmp) {
if (!rdev->mddev) {
@@ -1288,6 +1289,7 @@ repeat:
mdname(mddev),mddev->in_sync);
err = 0;
+ frdev = 0;
ITERATE_RDEV(mddev,rdev,tmp) {
char b[BDEVNAME_SIZE];
dprintk(KERN_INFO "md: ");
@@ -1296,13 +1298,21 @@ repeat:
dprintk("%s ", bdevname(rdev->bdev,b));
if (!rdev->faulty) {
- err += write_disk_sb(rdev);
+ int ret;
+ ret = write_disk_sb(rdev);
+ if (ret) {
+ frdev = rdev; /* Save failed device */
+ err += ret;
+ }
} else
dprintk(")\n");
if (!err && mddev->level == LEVEL_MULTIPATH)
/* only need to write one superblock... */
break;
}
+ if (frdev)
+ md_error(mddev, frdev); /* Fail the failed device */
+
if (err) {
if (--count) {
printk(KERN_ERR "md: errors occurred during superblock"
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [patch] md superblock update failures
2005-03-16 13:22 ` Lars Marowsky-Bree
@ 2005-03-16 14:01 ` Michael Tokarev
0 siblings, 0 replies; 6+ messages in thread
From: Michael Tokarev @ 2005-03-16 14:01 UTC (permalink / raw)
To: Lars Marowsky-Bree; +Cc: linux-raid
Lars Marowsky-Bree wrote:
[]
> err = 0;
> + frdev = 0;
> ITERATE_RDEV(mddev,rdev,tmp) {
> char b[BDEVNAME_SIZE];
> dprintk(KERN_INFO "md: ");
> @@ -1296,13 +1298,21 @@ repeat:
>
> dprintk("%s ", bdevname(rdev->bdev,b));
> if (!rdev->faulty) {
> - err += write_disk_sb(rdev);
> + int ret;
> + ret = write_disk_sb(rdev);
> + if (ret) {
> + frdev = rdev; /* Save failed device */
> + err += ret;
> + }
Why not
+ int ret;
+ ret = write_disk_sb(rdev);
+ if (ret) {
+ md_error(mddev, rdev);
+ err += ret;
+ }
instead (and remove frdev entirely)?
I mean, in theory, multiple disks may fail...
But wait.. will md_error() call this routine again?
/mjt
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [patch] md superblock update failures
2005-03-16 13:05 [patch] md superblock update failures Lars Marowsky-Bree
2005-03-16 13:22 ` Lars Marowsky-Bree
@ 2005-03-16 22:15 ` Neil Brown
2005-03-18 10:39 ` Lars Marowsky-Bree
1 sibling, 1 reply; 6+ messages in thread
From: Neil Brown @ 2005-03-16 22:15 UTC (permalink / raw)
To: Lars Marowsky-Bree; +Cc: linux-raid
On Wednesday March 16, lmb@suse.de wrote:
> Mark found a bug where md doesn't handle write failures when trying to
> update the superblock.
>
> Attached is the fix he sent to us, and which seems to apply fine to
> 2.6.11 too.
Yes... thanks....
The whole '100 times' thing is completely bogus isn't it...
By co-incidence, I've just recently be modifying this code to do
writes more intelligently. My goal was to get it to write all the
superblocks in parallel rather than in series. The result is below
which will probably go to Andrew shortly. It add the required
md_error and removes the 'try 100 times'.
It also loops 'round to re-write the superblock if one of the writes
failed, thus dirtying the superblock.
NeilBrown
========================================
Allow md to update multiple superblocks in parallel.
currently, md updates all superblocks (one on each device)
in series. It waits for one write to complete before starting
the next. This isn't a big problem as superblock updates don't
happen that often.
However it is neater to do it in parallel, and if the drives
in the array have gone to "sleep" after a period of idleness, then
waking them is parallel is faster (and someone else should be
worrying about power drain).
Futher, we will need parallel superblock updates for a future
patch which keeps the intent-logging bitmap near the superblock.
Also remove the silly code that retired superblock updates
100 times. This simply never made sense.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
### Diffstat output
./drivers/md/md.c | 83 ++++++++++++++++++++++++--------------------
./include/linux/raid/md_k.h | 1
2 files changed, 48 insertions(+), 36 deletions(-)
diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~ 2005-03-14 11:51:29.000000000 +1100
+++ ./drivers/md/md.c 2005-03-15 10:50:13.000000000 +1100
@@ -328,6 +328,40 @@ static void free_disk_sb(mdk_rdev_t * rd
}
+static int super_written(struct bio *bio, unsigned int bytes_done, int error)
+{
+ mdk_rdev_t *rdev = bio->bi_private;
+ if (bio->bi_size)
+ return 1;
+
+ if (error || !test_bit(BIO_UPTODATE, &bio->bi_flags))
+ md_error(rdev->mddev, rdev);
+
+ if (atomic_dec_and_test(&rdev->mddev->pending_writes))
+ wake_up(&rdev->mddev->sb_wait);
+ return 0;
+}
+
+void md_super_write(mddev_t *mddev, mdk_rdev_t *rdev,
+ sector_t sector, int size, struct page *page)
+{
+ /* write first size bytes of page to sector of rdev
+ * Increment mddev->pending_writes before returning
+ * and decrement it on completion, waking up sb_wait
+ * if zero is reached.
+ * If an error occurred, call md_error
+ */
+ struct bio *bio = bio_alloc(GFP_KERNEL, 1);
+
+ bio->bi_bdev = rdev->bdev;
+ bio->bi_sector = sector;
+ bio_add_page(bio, page, size, 0);
+ bio->bi_private = rdev;
+ bio->bi_end_io = super_written;
+ atomic_inc(&mddev->pending_writes);
+ submit_bio((1<<BIO_RW)|(1<<BIO_RW_SYNC), bio);
+}
+
static int bi_complete(struct bio *bio, unsigned int bytes_done, int error)
{
if (bio->bi_size)
@@ -1240,30 +1274,6 @@ void md_print_devices(void)
}
-static int write_disk_sb(mdk_rdev_t * rdev)
-{
- char b[BDEVNAME_SIZE];
- if (!rdev->sb_loaded) {
- MD_BUG();
- return 1;
- }
- if (rdev->faulty) {
- MD_BUG();
- return 1;
- }
-
- dprintk(KERN_INFO "(write) %s's sb offset: %llu\n",
- bdevname(rdev->bdev,b),
- (unsigned long long)rdev->sb_offset);
-
- if (sync_page_io(rdev->bdev, rdev->sb_offset<<1, MD_SB_BYTES, rdev->sb_page, WRITE))
- return 0;
-
- printk("md: write_disk_sb failed for device %s\n",
- bdevname(rdev->bdev,b));
- return 1;
-}
-
static void sync_sbs(mddev_t * mddev)
{
mdk_rdev_t *rdev;
@@ -1278,7 +1288,7 @@ static void sync_sbs(mddev_t * mddev)
static void md_update_sb(mddev_t * mddev)
{
- int err, count = 100;
+ int err;
struct list_head *tmp;
mdk_rdev_t *rdev;
int sync_req;
@@ -1298,6 +1308,7 @@ repeat:
MD_BUG();
mddev->events --;
}
+ mddev->sb_dirty = 2;
sync_sbs(mddev);
/*
@@ -1325,24 +1336,24 @@ repeat:
dprintk("%s ", bdevname(rdev->bdev,b));
if (!rdev->faulty) {
- err += write_disk_sb(rdev);
+ md_super_write(mddev,rdev,
+ rdev->sb_offset<<1, MD_SB_BYTES,
+ rdev->sb_page);
+ dprintk(KERN_INFO "(write) %s's sb offset: %llu\n",
+ bdevname(rdev->bdev,b),
+ (unsigned long long)rdev->sb_offset);
+
} else
dprintk(")\n");
if (!err && mddev->level == LEVEL_MULTIPATH)
/* only need to write one superblock... */
break;
}
- if (err) {
- if (--count) {
- printk(KERN_ERR "md: errors occurred during superblock"
- " update, repeating\n");
- goto repeat;
- }
- printk(KERN_ERR \
- "md: excessive errors occurred during superblock update, exiting\n");
- }
+ wait_event(mddev->sb_wait, atomic_read(&mddev->pending_writes)==0);
+ /* if there was a failure, sb_dirty was set to 1, and we re-write super */
+
spin_lock(&mddev->write_lock);
- if (mddev->in_sync != sync_req) {
+ if (mddev->in_sync != sync_req|| mddev->sb_dirty == 1) {
/* have to write it out again */
spin_unlock(&mddev->write_lock);
goto repeat;
diff ./include/linux/raid/md_k.h~current~ ./include/linux/raid/md_k.h
--- ./include/linux/raid/md_k.h~current~ 2005-03-15 10:26:24.000000000 +1100
+++ ./include/linux/raid/md_k.h 2005-03-15 10:49:06.000000000 +1100
@@ -262,6 +262,7 @@ struct mddev_s
spinlock_t write_lock;
wait_queue_head_t sb_wait; /* for waiting on superblock updates */
+ atomic_t pending_writes; /* number of active superblock writes */
unsigned int safemode; /* if set, update "clean" superblock
* when no writes pending.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [patch] md superblock update failures
2005-03-16 22:15 ` Neil Brown
@ 2005-03-18 10:39 ` Lars Marowsky-Bree
2005-03-18 12:57 ` Peter T. Breuer
0 siblings, 1 reply; 6+ messages in thread
From: Lars Marowsky-Bree @ 2005-03-18 10:39 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
On 2005-03-17T09:15:39, Neil Brown <neilb@cse.unsw.edu.au> wrote:
> Yes... thanks....
> The whole '100 times' thing is completely bogus isn't it...
Yes.
> By co-incidence, I've just recently be modifying this code to do
> writes more intelligently. My goal was to get it to write all the
> superblocks in parallel rather than in series. The result is below
> which will probably go to Andrew shortly. It add the required
> md_error and removes the 'try 100 times'.
> It also loops 'round to re-write the superblock if one of the writes
> failed, thus dirtying the superblock.
Your patch is much cleaner and nicer. I'll pick that up when it goes to
Andrew.
Minor cleanup:
> @@ -1325,24 +1336,24 @@ repeat:
>
> dprintk("%s ", bdevname(rdev->bdev,b));
> if (!rdev->faulty) {
> - err += write_disk_sb(rdev);
> + md_super_write(mddev,rdev,
> + rdev->sb_offset<<1, MD_SB_BYTES,
> + rdev->sb_page);
> + dprintk(KERN_INFO "(write) %s's sb offset: %llu\n",
> + bdevname(rdev->bdev,b),
> + (unsigned long long)rdev->sb_offset);
> +
> } else
> dprintk(")\n");
> if (!err && mddev->level == LEVEL_MULTIPATH)
> /* only need to write one superblock... */
> break;
> }
The "!err &&" part can probably go away, right?
Sincerely,
Lars Marowsky-Brée <lmb@suse.de>
--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [patch] md superblock update failures
2005-03-18 10:39 ` Lars Marowsky-Bree
@ 2005-03-18 12:57 ` Peter T. Breuer
0 siblings, 0 replies; 6+ messages in thread
From: Peter T. Breuer @ 2005-03-18 12:57 UTC (permalink / raw)
To: linux-raid
Lars Marowsky-Bree <lmb@suse.de> wrote:
> Minor cleanup:
>
> > @@ -1325,24 +1336,24 @@ repeat:
> >
> > dprintk("%s ", bdevname(rdev->bdev,b));
> > if (!rdev->faulty) {
> > - err += write_disk_sb(rdev);
> > + md_super_write(mddev,rdev,
> > + rdev->sb_offset<<1, MD_SB_BYTES,
> > + rdev->sb_page);
> > + dprintk(KERN_INFO "(write) %s's sb offset: %llu\n",
> > + bdevname(rdev->bdev,b),
> > + (unsigned long long)rdev->sb_offset);
> > +
> > } else
> > dprintk(")\n");
> > if (!err && mddev->level == LEVEL_MULTIPATH)
> > /* only need to write one superblock... */
> > break;
> > }
Maintenance-wise, I'd prefer if (write_disk_sb(rdev) != 0) err++, since
seeing things which are of signed type tested ultimately only for
difference from zero makes me nervous. Who says somebody won't forget
the way it's tested here and let write_disk_sb one day return negative
for error, and zero for success?
> The "!err &&" part can probably go away, right?
As to your observation, morally I'm with you on that, since we oughtn't
need to write more superblocks if there has been an error than if there
hasn't.
Peter
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2005-03-18 12:57 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-03-16 13:05 [patch] md superblock update failures Lars Marowsky-Bree
2005-03-16 13:22 ` Lars Marowsky-Bree
2005-03-16 14:01 ` Michael Tokarev
2005-03-16 22:15 ` Neil Brown
2005-03-18 10:39 ` Lars Marowsky-Bree
2005-03-18 12:57 ` Peter T. Breuer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).