From: "Steinar H. Gunderson" <sgunderson@bigfoot.com>
To: linux-raid@vger.kernel.org
Subject: [PATCH] Online RAID-5 resizing
Date: Tue, 20 Sep 2005 16:33:46 +0200 [thread overview]
Message-ID: <20050920143346.GA5777@uio.no> (raw)
[-- Attachment #1: Type: text/plain, Size: 2250 bytes --]
(Please Cc me on any replies, I'm not subscribed)
Hi,
Attached is a patch (against 2.6.12) for adding online RAID-5 resize
capabilities to Linux' RAID code. It needs to changes to mdadm (I've only
tested with mdadm 1.12.0, though), you can just do
mdadm --add /dev/md1 /dev/hd[eg]1
mdadm --grow /dev/md1 -n 4
and it will restripe /dev/md1; you can still use the volume just fine
during the expand process. (cat /proc/mdstat to get the progress; it will
look like a regular sync, and when the restripe is done the volume will
suddenly get larger and do a regular sync of the new parts.)
The patch is quite rough -- it's my first trip ever into the md code, the
block layer or really kernel code in general, so expect subtle race
conditions and problems here and there. :-) That being said, it seems to be
quite stable on my (SMP) test system now -- I would really take backups
before testing it, though! You have been warned :-)
Things still to do, off the top of my head:
- It's RAID-5 only; I don't really use RAID-0, and RAID-6 would probably be
more complex.
- It supports only growing, not shrinking. (Not sure if I really care about
fixing this one.)
- It leaks memory; it doesn't properly free up the old stripes etc. at the
end of the resize. (This also makes it impossible to do a grow and then
another grow without stopping and starting the volumes.)
- There is absolutely no crash recovery -- this shouldn't be so hard to do
(just update the superblock every time, with some progress meter, and
restart from that spot in case of a crash), but I have no knowledge of the
on-disk superblock format at all, so some help would be appreciated here.
Also, I'm not really sure what happens if it encounters a bad block during
the restripe.
- It's quite slow; on my test system with old IDE disks, it achieves about
1MB/sec. One could probably make a speed/memory tradeoff here, and move
more chunks at a time instead of just one by one; I'm a bit concerned
about the implications of the kernel allocating something like 64MB in one
go, though :-)
Comments, patches, fixes etc. would be greatly appreciated. (Again, remember
to Cc me, I'm not on the list.)
/* Steinar */
--
Homepage: http://www.sesse.net/
[-- Attachment #2: raid5-resize.patch --]
[-- Type: text/plain, Size: 55631 bytes --]
diff -ur linux-2.6-2.6.12/drivers/md/raid5.c ../linux-2.6-2.6.12/drivers/md/raid5.c
--- linux-2.6-2.6.12/drivers/md/raid5.c 2005-06-17 21:48:29.000000000 +0200
+++ linux-2.6-2.6.12.patch/drivers/md/raid5.c 2005-09-20 00:13:55.000000000 +0200
@@ -68,19 +68,40 @@
#endif
static void print_raid5_conf (raid5_conf_t *conf);
+#if RAID5_DEBUG
+static void print_sh (struct stripe_head *sh);
+#endif
+static int sync_request (mddev_t *mddev, sector_t sector_nr, int go_faster);
+static void raid5_finish_expand (raid5_conf_t *conf);
+static sector_t raid5_compute_sector(sector_t r_sector, unsigned int raid_disks,
+ unsigned int data_disks, unsigned int * dd_idx,
+ unsigned int * pd_idx, raid5_conf_t *conf);
static inline void __release_stripe(raid5_conf_t *conf, struct stripe_head *sh)
{
+ PRINTK("__release_stripe, conf=%p\n", conf);
+ BUG_ON(atomic_read(&sh->count) == 0);
if (atomic_dec_and_test(&sh->count)) {
if (!list_empty(&sh->lru))
BUG();
- if (atomic_read(&conf->active_stripes)==0)
- BUG();
+ if (conf->expand_in_progress && sh->disks == conf->raid_disks) {
+ if (atomic_read(&conf->active_stripes_expand)==0)
+ BUG();
+ } else {
+ if (atomic_read(&conf->active_stripes)==0)
+ BUG();
+ }
if (test_bit(STRIPE_HANDLE, &sh->state)) {
- if (test_bit(STRIPE_DELAYED, &sh->state))
+ if (test_bit(STRIPE_DELAY_EXPAND, &sh->state)) {
+ list_add_tail(&sh->lru, &conf->wait_for_expand_list);
+ printk("delaying stripe with sector %llu (expprog=%llu, active=%d)\n", sh->sector,
+ conf->expand_progress, atomic_read(&conf->active_stripes_expand));
+ } else if (test_bit(STRIPE_DELAYED, &sh->state)) {
+// printk("real-delay\n");
list_add_tail(&sh->lru, &conf->delayed_list);
- else
+ } else {
list_add_tail(&sh->lru, &conf->handle_list);
+ }
md_wakeup_thread(conf->mddev->thread);
} else {
if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
@@ -88,11 +109,34 @@
if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD)
md_wakeup_thread(conf->mddev->thread);
}
- list_add_tail(&sh->lru, &conf->inactive_list);
- atomic_dec(&conf->active_stripes);
- if (!conf->inactive_blocked ||
- atomic_read(&conf->active_stripes) < (NR_STRIPES*3/4))
- wake_up(&conf->wait_for_stripe);
+ if (conf->expand_in_progress && sh->disks == conf->raid_disks) {
+ list_add_tail(&sh->lru, &conf->inactive_list_expand);
+ atomic_dec(&conf->active_stripes_expand);
+ } else {
+ list_add_tail(&sh->lru, &conf->inactive_list);
+ if (conf->expand_in_progress == 2) {
+ // we are in the process of finishing up an expand, see
+ // if we have no active stripes left
+ if (atomic_dec_and_test(&conf->active_stripes)) {
+ printk("Finishing up expand\n");
+ raid5_finish_expand(conf);
+ printk("Expand done.\n");
+ }
+ } else {
+ atomic_dec(&conf->active_stripes);
+ }
+ }
+ if (conf->expand_in_progress && sh->disks == conf->raid_disks) {
+ if (!conf->inactive_blocked_expand ||
+ atomic_read(&conf->active_stripes_expand) < (NR_STRIPES*3/4)) {
+ wake_up(&conf->wait_for_stripe_expand);
+ }
+ } else {
+ if (!conf->inactive_blocked ||
+ atomic_read(&conf->active_stripes) < (NR_STRIPES*3/4)) {
+ wake_up(&conf->wait_for_stripe);
+ }
+ }
}
}
}
@@ -133,20 +177,44 @@
/* find an idle stripe, make sure it is unhashed, and return it. */
-static struct stripe_head *get_free_stripe(raid5_conf_t *conf)
+static struct stripe_head *get_free_stripe(raid5_conf_t *conf, int expand)
{
struct stripe_head *sh = NULL;
struct list_head *first;
CHECK_DEVLOCK();
- if (list_empty(&conf->inactive_list))
- goto out;
- first = conf->inactive_list.next;
- sh = list_entry(first, struct stripe_head, lru);
- list_del_init(first);
- remove_hash(sh);
- atomic_inc(&conf->active_stripes);
+
+ if (expand) {
+ if (list_empty(&conf->inactive_list_expand))
+ goto out;
+ first = conf->inactive_list_expand.next;
+ sh = list_entry(first, struct stripe_head, lru);
+ list_del_init(first);
+ remove_hash(sh);
+ atomic_inc(&conf->active_stripes_expand);
+ } else {
+ if (list_empty(&conf->inactive_list))
+ goto out;
+ first = conf->inactive_list.next;
+ sh = list_entry(first, struct stripe_head, lru);
+ list_del_init(first);
+ remove_hash(sh);
+ atomic_inc(&conf->active_stripes);
+ }
out:
+
+ if (sh) {
+ if (conf->expand_in_progress) {
+ if (expand)
+ BUG_ON(sh->disks != conf->raid_disks);
+ else
+ BUG_ON(sh->disks != conf->previous_raid_disks);
+ } else {
+ BUG_ON(expand);
+ BUG_ON(sh->disks != conf->raid_disks);
+ }
+ }
+
return sh;
}
@@ -184,7 +252,7 @@
static inline void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx)
{
raid5_conf_t *conf = sh->raid_conf;
- int disks = conf->raid_disks, i;
+ int disks = sh->disks, i;
if (atomic_read(&sh->count) != 0)
BUG();
@@ -245,21 +313,59 @@
do {
sh = __find_stripe(conf, sector);
+
+ // make sure this is of the right size; if not, remove it from the hash
+ if (sh) {
+ int correct_disks = conf->raid_disks;
+ if (conf->expand_in_progress && sector >= conf->expand_progress) {
+ correct_disks = conf->previous_raid_disks;
+ }
+
+ if (sh->disks != correct_disks) {
+ BUG_ON(atomic_read(&sh->count) != 0);
+
+ remove_hash(sh);
+ sh = NULL;
+ }
+ }
+
if (!sh) {
- if (!conf->inactive_blocked)
- sh = get_free_stripe(conf);
+ if (conf->expand_in_progress && sector * (conf->raid_disks - 1) < conf->expand_progress) {
+ if (!conf->inactive_blocked_expand) {
+ sh = get_free_stripe(conf, 1);
+ }
+ } else {
+ if (!conf->inactive_blocked) {
+ sh = get_free_stripe(conf, 0);
+ }
+ }
if (noblock && sh == NULL)
break;
if (!sh) {
- conf->inactive_blocked = 1;
- wait_event_lock_irq(conf->wait_for_stripe,
- !list_empty(&conf->inactive_list) &&
- (atomic_read(&conf->active_stripes) < (NR_STRIPES *3/4)
- || !conf->inactive_blocked),
- conf->device_lock,
- unplug_slaves(conf->mddev);
- );
- conf->inactive_blocked = 0;
+ if (conf->expand_in_progress && sector * (conf->raid_disks - 1) < conf->expand_progress) {
+// printk("WAITING FOR AN EXPAND STRIPE\n");
+ conf->inactive_blocked_expand = 1;
+ wait_event_lock_irq(conf->wait_for_stripe_expand,
+ !list_empty(&conf->inactive_list_expand) &&
+ (atomic_read(&conf->active_stripes_expand) < (NR_STRIPES *3/4)
+ || !conf->inactive_blocked_expand),
+ conf->device_lock,
+ unplug_slaves(conf->mddev);
+ );
+ conf->inactive_blocked_expand = 0;
+ } else {
+// printk("WAITING FOR A NON-EXPAND STRIPE, sector=%llu\n", sector);
+ conf->inactive_blocked = 1;
+ wait_event_lock_irq(conf->wait_for_stripe,
+ !list_empty(&conf->inactive_list) &&
+ (atomic_read(&conf->active_stripes) < (NR_STRIPES *3/4)
+ || !conf->inactive_blocked),
+ conf->device_lock,
+ unplug_slaves(conf->mddev);
+ );
+ conf->inactive_blocked = 0;
+ }
+// printk("INACTIVITY DONE\n");
} else
init_stripe(sh, sector, pd_idx);
} else {
@@ -267,8 +373,13 @@
if (!list_empty(&sh->lru))
BUG();
} else {
- if (!test_bit(STRIPE_HANDLE, &sh->state))
- atomic_inc(&conf->active_stripes);
+ if (!test_bit(STRIPE_HANDLE, &sh->state)) {
+ if (conf->expand_in_progress && sector < conf->expand_progress) {
+ atomic_inc(&conf->active_stripes_expand);
+ } else {
+ atomic_inc(&conf->active_stripes);
+ }
+ }
if (list_empty(&sh->lru))
BUG();
list_del_init(&sh->lru);
@@ -283,26 +394,34 @@
return sh;
}
-static int grow_stripes(raid5_conf_t *conf, int num)
+static int grow_stripes(raid5_conf_t *conf, int num, int expand)
{
struct stripe_head *sh;
kmem_cache_t *sc;
int devs = conf->raid_disks;
- sprintf(conf->cache_name, "raid5/%s", mdname(conf->mddev));
+ if (expand)
+ sprintf(conf->cache_name, "raid5e/%s", mdname(conf->mddev));
+ else
+ sprintf(conf->cache_name, "raid5/%s", mdname(conf->mddev));
sc = kmem_cache_create(conf->cache_name,
sizeof(struct stripe_head)+(devs-1)*sizeof(struct r5dev),
0, 0, NULL, NULL);
if (!sc)
return 1;
- conf->slab_cache = sc;
+ if (expand)
+ conf->slab_cache_expand = sc;
+ else
+ conf->slab_cache = sc;
while (num--) {
sh = kmem_cache_alloc(sc, GFP_KERNEL);
if (!sh)
return 1;
+ printk("alloc stripe: %p\n", sh);
memset(sh, 0, sizeof(*sh) + (devs-1)*sizeof(struct r5dev));
sh->raid_conf = conf;
+ sh->disks = conf->raid_disks;
spin_lock_init(&sh->lock);
if (grow_buffers(sh, conf->raid_disks)) {
@@ -312,10 +431,15 @@
}
/* we just created an active stripe so... */
atomic_set(&sh->count, 1);
- atomic_inc(&conf->active_stripes);
+ if (expand) {
+ atomic_inc(&conf->active_stripes_expand);
+ } else {
+ atomic_inc(&conf->active_stripes);
+ }
INIT_LIST_HEAD(&sh->lru);
release_stripe(sh);
}
+ printk("done growing\n");
return 0;
}
@@ -325,7 +449,7 @@
while (1) {
spin_lock_irq(&conf->device_lock);
- sh = get_free_stripe(conf);
+ sh = get_free_stripe(conf, 0);
spin_unlock_irq(&conf->device_lock);
if (!sh)
break;
@@ -344,7 +468,7 @@
{
struct stripe_head *sh = bi->bi_private;
raid5_conf_t *conf = sh->raid_conf;
- int disks = conf->raid_disks, i;
+ int disks = sh->disks, i;
int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
if (bi->bi_size)
@@ -393,6 +517,8 @@
set_bit(R5_UPTODATE, &sh->dev[i].flags);
#endif
} else {
+ printk("received non-up-to-date information for disk %u, sector %llu!\n",
+ i, sh->sector);
md_error(conf->mddev, conf->disks[i].rdev);
clear_bit(R5_UPTODATE, &sh->dev[i].flags);
}
@@ -411,12 +537,93 @@
return 0;
}
+
+static void raid5_finish_expand (raid5_conf_t *conf)
+{
+ int i;
+ struct disk_info *tmp;
+// shrink_stripes(conf);
+
+ conf->expand_in_progress = 0;
+ conf->active_stripes = conf->active_stripes_expand;
+ conf->inactive_list = conf->inactive_list_expand;
+ conf->wait_for_stripe = conf->wait_for_stripe_expand;
+ conf->slab_cache = conf->slab_cache_expand;
+ conf->inactive_blocked = conf->inactive_blocked_expand;
+
+ // fix up linked list
+ conf->inactive_list.next->prev = &conf->inactive_list;
+ {
+ struct list_head *first = &conf->inactive_list;
+ while (1) {
+ if (first->next == &conf->inactive_list_expand) {
+ first->next = &conf->inactive_list;
+ break;
+ }
+
+ first = first->next;
+ }
+ }
+
+ conf->wait_for_stripe.task_list.next->prev = &conf->wait_for_stripe.task_list;
+ {
+ struct list_head *first = &conf->wait_for_stripe.task_list;
+ while (1) {
+ if (first->next == &conf->wait_for_stripe_expand.task_list) {
+ first->next = &conf->wait_for_stripe.task_list;
+ break;
+ }
+
+ first = first->next;
+ }
+ }
+
+ for (i = conf->previous_raid_disks; i < conf->raid_disks; i++) {
+ tmp = conf->disks + i;
+ if (tmp->rdev
+ && !tmp->rdev->faulty
+ && !tmp->rdev->in_sync) {
+ conf->mddev->degraded--;
+ conf->failed_disks--;
+ conf->working_disks++;
+ tmp->rdev->in_sync = 1;
+ }
+ }
+
+ // hey, mr. md code: we have more space now!
+ {
+ struct block_device *bdev;
+ sector_t sync_sector;
+ unsigned dummy1, dummy2;
+
+ conf->mddev->array_size = conf->mddev->size * (conf->mddev->raid_disks-1);
+ set_capacity(conf->mddev->gendisk, conf->mddev->array_size << 1);
+ conf->mddev->changed = 1;
+
+ sync_sector = raid5_compute_sector(conf->expand_progress, conf->raid_disks,
+ conf->raid_disks - 1, &dummy1, &dummy2, conf);
+
+ conf->mddev->recovery_cp = sync_sector << 1; // FIXME: hum, hum
+ set_bit(MD_RECOVERY_NEEDED, &conf->mddev->recovery);
+
+ bdev = bdget_disk(conf->mddev->gendisk, 0);
+ if (bdev) {
+ down(&bdev->bd_inode->i_sem);
+ i_size_write(bdev->bd_inode, conf->mddev->array_size << 10);
+ up(&bdev->bd_inode->i_sem);
+ bdput(bdev);
+ }
+ }
+
+ /* FIXME: free old stuff here! (what are we missing?) */
+}
+
static int raid5_end_write_request (struct bio *bi, unsigned int bytes_done,
int error)
{
struct stripe_head *sh = bi->bi_private;
raid5_conf_t *conf = sh->raid_conf;
- int disks = conf->raid_disks, i;
+ int disks = sh->disks, i;
unsigned long flags;
int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
@@ -436,8 +643,11 @@
}
spin_lock_irqsave(&conf->device_lock, flags);
- if (!uptodate)
+ if (!uptodate) {
+ printk("end_write_request ends with error, for disk %u sector %llu\n",
+ i, sh->sector);
md_error(conf->mddev, conf->disks[i].rdev);
+ }
rdev_dec_pending(conf->disks[i].rdev, conf->mddev);
@@ -512,12 +722,14 @@
int sectors_per_chunk = conf->chunk_size >> 9;
/* First compute the information on this sector */
+ PRINTK("r_sector_inp=%llu\n", r_sector);
/*
* Compute the chunk number and the sector offset inside the chunk
*/
chunk_offset = sector_div(r_sector, sectors_per_chunk);
chunk_number = r_sector;
+ PRINTK("r_sector=%llu, chunk_number=%lu\n", r_sector, chunk_number);
BUG_ON(r_sector != chunk_number);
/*
@@ -556,7 +768,7 @@
break;
default:
printk("raid5: unsupported algorithm %d\n",
- conf->algorithm);
+ conf->algorithm);
}
/*
@@ -570,7 +782,7 @@
static sector_t compute_blocknr(struct stripe_head *sh, int i)
{
raid5_conf_t *conf = sh->raid_conf;
- int raid_disks = conf->raid_disks, data_disks = raid_disks - 1;
+ int raid_disks = sh->disks, data_disks = raid_disks - 1;
sector_t new_sector = sh->sector, check;
int sectors_per_chunk = conf->chunk_size >> 9;
sector_t stripe;
@@ -582,7 +794,7 @@
stripe = new_sector;
BUG_ON(new_sector != stripe);
-
+
switch (conf->algorithm) {
case ALGORITHM_LEFT_ASYMMETRIC:
case ALGORITHM_RIGHT_ASYMMETRIC:
@@ -597,7 +809,7 @@
break;
default:
printk("raid5: unsupported algorithm %d\n",
- conf->algorithm);
+ conf->algorithm);
}
chunk_number = stripe * data_disks + i;
@@ -605,7 +817,8 @@
check = raid5_compute_sector (r_sector, raid_disks, data_disks, &dummy1, &dummy2, conf);
if (check != sh->sector || dummy1 != dd_idx || dummy2 != sh->pd_idx) {
- printk("compute_blocknr: map not correct\n");
+ printk("compute_blocknr: map not correct (%llu,%u,%u vs. %llu,%u,%u) disks=%u offset=%u virtual_dd=%u\n",
+ check, dummy1, dummy2, sh->sector, dd_idx, sh->pd_idx, sh->disks, chunk_offset, i);
return 0;
}
return r_sector;
@@ -620,8 +833,8 @@
* All iovecs in the bio must be considered.
*/
static void copy_data(int frombio, struct bio *bio,
- struct page *page,
- sector_t sector)
+ struct page *page,
+ sector_t sector)
{
char *pa = page_address(page);
struct bio_vec *bvl;
@@ -646,7 +859,7 @@
if (len > 0 && page_offset + len > STRIPE_SIZE)
clen = STRIPE_SIZE - page_offset;
else clen = len;
-
+
if (clen > 0) {
char *ba = __bio_kmap_atomic(bio, i, KM_USER0);
if (frombio)
@@ -662,21 +875,21 @@
}
#define check_xor() do { \
- if (count == MAX_XOR_BLOCKS) { \
- xor_block(count, STRIPE_SIZE, ptr); \
- count = 1; \
- } \
- } while(0)
+ if (count == MAX_XOR_BLOCKS) { \
+ xor_block(count, STRIPE_SIZE, ptr); \
+ count = 1; \
+ } \
+} while(0)
static void compute_block(struct stripe_head *sh, int dd_idx)
{
- raid5_conf_t *conf = sh->raid_conf;
- int i, count, disks = conf->raid_disks;
+ // raid5_conf_t *conf = sh->raid_conf;
+ int i, count, disks = sh->disks;
void *ptr[MAX_XOR_BLOCKS], *p;
PRINTK("compute_block, stripe %llu, idx %d\n",
- (unsigned long long)sh->sector, dd_idx);
+ (unsigned long long)sh->sector, dd_idx);
ptr[0] = page_address(sh->dev[dd_idx].page);
memset(ptr[0], 0, STRIPE_SIZE);
@@ -689,8 +902,8 @@
ptr[count++] = p;
else
printk("compute_block() %d, stripe %llu, %d"
- " not present\n", dd_idx,
- (unsigned long long)sh->sector, i);
+ " not present\n", dd_idx,
+ (unsigned long long)sh->sector, i);
check_xor();
}
@@ -702,59 +915,59 @@
static void compute_parity(struct stripe_head *sh, int method)
{
raid5_conf_t *conf = sh->raid_conf;
- int i, pd_idx = sh->pd_idx, disks = conf->raid_disks, count;
+ int i, pd_idx = sh->pd_idx, disks = sh->disks, count;
void *ptr[MAX_XOR_BLOCKS];
struct bio *chosen;
PRINTK("compute_parity, stripe %llu, method %d\n",
- (unsigned long long)sh->sector, method);
+ (unsigned long long)sh->sector, method);
count = 1;
ptr[0] = page_address(sh->dev[pd_idx].page);
switch(method) {
- case READ_MODIFY_WRITE:
- if (!test_bit(R5_UPTODATE, &sh->dev[pd_idx].flags))
- BUG();
- for (i=disks ; i-- ;) {
- if (i==pd_idx)
- continue;
- if (sh->dev[i].towrite &&
- test_bit(R5_UPTODATE, &sh->dev[i].flags)) {
- ptr[count++] = page_address(sh->dev[i].page);
- chosen = sh->dev[i].towrite;
- sh->dev[i].towrite = NULL;
-
- if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
- wake_up(&conf->wait_for_overlap);
-
- if (sh->dev[i].written) BUG();
- sh->dev[i].written = chosen;
- check_xor();
+ case READ_MODIFY_WRITE:
+ if (!test_bit(R5_UPTODATE, &sh->dev[pd_idx].flags))
+ BUG();
+ for (i=disks ; i-- ;) {
+ if (i==pd_idx)
+ continue;
+ if (sh->dev[i].towrite &&
+ test_bit(R5_UPTODATE, &sh->dev[i].flags)) {
+ ptr[count++] = page_address(sh->dev[i].page);
+ chosen = sh->dev[i].towrite;
+ sh->dev[i].towrite = NULL;
+
+ if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
+ wake_up(&conf->wait_for_overlap);
+
+ if (sh->dev[i].written) BUG();
+ sh->dev[i].written = chosen;
+ check_xor();
+ }
}
- }
- break;
- case RECONSTRUCT_WRITE:
- memset(ptr[0], 0, STRIPE_SIZE);
- for (i= disks; i-- ;)
- if (i!=pd_idx && sh->dev[i].towrite) {
- chosen = sh->dev[i].towrite;
- sh->dev[i].towrite = NULL;
+ break;
+ case RECONSTRUCT_WRITE:
+ memset(ptr[0], 0, STRIPE_SIZE);
+ for (i= disks; i-- ;)
+ if (i!=pd_idx && sh->dev[i].towrite) {
+ chosen = sh->dev[i].towrite;
+ sh->dev[i].towrite = NULL;
- if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
- wake_up(&conf->wait_for_overlap);
+ if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
+ wake_up(&conf->wait_for_overlap);
- if (sh->dev[i].written) BUG();
- sh->dev[i].written = chosen;
- }
- break;
- case CHECK_PARITY:
- break;
+ if (sh->dev[i].written) BUG();
+ sh->dev[i].written = chosen;
+ }
+ break;
+ case CHECK_PARITY:
+ break;
}
if (count>1) {
xor_block(count, STRIPE_SIZE, ptr);
count = 1;
}
-
+
for (i = disks; i--;)
if (sh->dev[i].written) {
sector_t sector = sh->dev[i].sector;
@@ -769,24 +982,24 @@
}
switch(method) {
- case RECONSTRUCT_WRITE:
- case CHECK_PARITY:
- for (i=disks; i--;)
- if (i != pd_idx) {
- ptr[count++] = page_address(sh->dev[i].page);
- check_xor();
- }
- break;
- case READ_MODIFY_WRITE:
- for (i = disks; i--;)
- if (sh->dev[i].written) {
- ptr[count++] = page_address(sh->dev[i].page);
- check_xor();
- }
+ case RECONSTRUCT_WRITE:
+ case CHECK_PARITY:
+ for (i=disks; i--;)
+ if (i != pd_idx) {
+ ptr[count++] = page_address(sh->dev[i].page);
+ check_xor();
+ }
+ break;
+ case READ_MODIFY_WRITE:
+ for (i = disks; i--;)
+ if (sh->dev[i].written) {
+ ptr[count++] = page_address(sh->dev[i].page);
+ check_xor();
+ }
}
if (count != 1)
xor_block(count, STRIPE_SIZE, ptr);
-
+
if (method != CHECK_PARITY) {
set_bit(R5_UPTODATE, &sh->dev[pd_idx].flags);
set_bit(R5_LOCKED, &sh->dev[pd_idx].flags);
@@ -805,16 +1018,18 @@
raid5_conf_t *conf = sh->raid_conf;
PRINTK("adding bh b#%llu to stripe s#%llu\n",
- (unsigned long long)bi->bi_sector,
- (unsigned long long)sh->sector);
+ (unsigned long long)bi->bi_sector,
+ (unsigned long long)sh->sector);
spin_lock(&sh->lock);
spin_lock_irq(&conf->device_lock);
+ PRINTK("lock, DISKS: %u\n", sh->disks);
if (forwrite)
bip = &sh->dev[dd_idx].towrite;
else
bip = &sh->dev[dd_idx].toread;
+ PRINTK("pip, disk=%u, bip=%p, num_disks=%u\n", dd_idx, bip, sh->disks);
while (*bip && (*bip)->bi_sector < bi->bi_sector) {
if ((*bip)->bi_sector + ((*bip)->bi_size >> 9) > bi->bi_sector)
goto overlap;
@@ -833,16 +1048,16 @@
spin_unlock(&sh->lock);
PRINTK("added bi b#%llu to stripe s#%llu, disk %d.\n",
- (unsigned long long)bi->bi_sector,
- (unsigned long long)sh->sector, dd_idx);
+ (unsigned long long)bi->bi_sector,
+ (unsigned long long)sh->sector, dd_idx);
if (forwrite) {
/* check if page is covered */
sector_t sector = sh->dev[dd_idx].sector;
for (bi=sh->dev[dd_idx].towrite;
- sector < sh->dev[dd_idx].sector + STRIPE_SECTORS &&
- bi && bi->bi_sector <= sector;
- bi = r5_next_bio(bi, sh->dev[dd_idx].sector)) {
+ sector < sh->dev[dd_idx].sector + STRIPE_SECTORS &&
+ bi && bi->bi_sector <= sector;
+ bi = r5_next_bio(bi, sh->dev[dd_idx].sector)) {
if (bi->bi_sector + (bi->bi_size>>9) >= sector)
sector = bi->bi_sector + (bi->bi_size>>9);
}
@@ -851,7 +1066,9 @@
}
return 1;
- overlap:
+overlap:
+ printk("overlap\n");
+
set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
spin_unlock_irq(&conf->device_lock);
spin_unlock(&sh->lock);
@@ -876,11 +1093,11 @@
* get BH_Lock set before the stripe lock is released.
*
*/
-
+
static void handle_stripe(struct stripe_head *sh)
{
raid5_conf_t *conf = sh->raid_conf;
- int disks = conf->raid_disks;
+ int disks = sh->disks;
struct bio *return_bi= NULL;
struct bio *bi;
int i;
@@ -891,12 +1108,13 @@
struct r5dev *dev;
PRINTK("handling stripe %llu, cnt=%d, pd_idx=%d\n",
- (unsigned long long)sh->sector, atomic_read(&sh->count),
- sh->pd_idx);
+ (unsigned long long)sh->sector, atomic_read(&sh->count),
+ sh->pd_idx);
spin_lock(&sh->lock);
clear_bit(STRIPE_HANDLE, &sh->state);
clear_bit(STRIPE_DELAYED, &sh->state);
+ clear_bit(STRIPE_DELAY_EXPAND, &sh->state);
syncing = test_bit(STRIPE_SYNCING, &sh->state);
/* Now to look around and see what can be done */
@@ -908,7 +1126,7 @@
clear_bit(R5_Syncio, &dev->flags);
PRINTK("check %d: state 0x%lx read %p write %p written %p\n",
- i, dev->flags, dev->toread, dev->towrite, dev->written);
+ i, dev->flags, dev->toread, dev->towrite, dev->written);
/* maybe we can reply to a read */
if (test_bit(R5_UPTODATE, &dev->flags) && dev->toread) {
struct bio *rbi, *rbi2;
@@ -936,7 +1154,7 @@
if (test_bit(R5_LOCKED, &dev->flags)) locked++;
if (test_bit(R5_UPTODATE, &dev->flags)) uptodate++;
-
+
if (dev->toread) to_read++;
if (dev->towrite) {
to_write++;
@@ -945,19 +1163,21 @@
}
if (dev->written) written++;
rdev = conf->disks[i].rdev; /* FIXME, should I be looking rdev */
- if (!rdev || !rdev->in_sync) {
+ if (!conf->expand_in_progress && (!rdev || !rdev->in_sync)) {
failed++;
failed_num = i;
+ printk("failing disk %u (%p)!\n", i, rdev);
} else
set_bit(R5_Insync, &dev->flags);
}
- PRINTK("locked=%d uptodate=%d to_read=%d"
- " to_write=%d failed=%d failed_num=%d\n",
- locked, uptodate, to_read, to_write, failed, failed_num);
/* check if the array has lost two devices and, if so, some requests might
* need to be failed
*/
if (failed > 1 && to_read+to_write+written) {
+ printk("Need to fail requests!\n");
+ printk("locked=%d uptodate=%d to_read=%d"
+ " to_write=%d failed=%d failed_num=%d disks=%d\n",
+ locked, uptodate, to_read, to_write, failed, failed_num, disks);
spin_lock_irq(&conf->device_lock);
for (i=disks; i--; ) {
/* fail all writes first */
@@ -1012,7 +1232,7 @@
}
spin_unlock_irq(&conf->device_lock);
}
- if (failed > 1 && syncing) {
+ if (failed > 1 && syncing && !conf->expand_in_progress) {
md_done_sync(conf->mddev, STRIPE_SECTORS,0);
clear_bit(STRIPE_SYNCING, &sh->state);
syncing = 0;
@@ -1023,37 +1243,37 @@
*/
dev = &sh->dev[sh->pd_idx];
if ( written &&
- ( (test_bit(R5_Insync, &dev->flags) && !test_bit(R5_LOCKED, &dev->flags) &&
- test_bit(R5_UPTODATE, &dev->flags))
- || (failed == 1 && failed_num == sh->pd_idx))
- ) {
- /* any written block on an uptodate or failed drive can be returned.
- * Note that if we 'wrote' to a failed drive, it will be UPTODATE, but
- * never LOCKED, so we don't need to test 'failed' directly.
- */
- for (i=disks; i--; )
- if (sh->dev[i].written) {
- dev = &sh->dev[i];
- if (!test_bit(R5_LOCKED, &dev->flags) &&
- test_bit(R5_UPTODATE, &dev->flags) ) {
- /* We can return any write requests */
- struct bio *wbi, *wbi2;
- PRINTK("Return write for disc %d\n", i);
- spin_lock_irq(&conf->device_lock);
- wbi = dev->written;
- dev->written = NULL;
- while (wbi && wbi->bi_sector < dev->sector + STRIPE_SECTORS) {
- wbi2 = r5_next_bio(wbi, dev->sector);
- if (--wbi->bi_phys_segments == 0) {
- md_write_end(conf->mddev);
- wbi->bi_next = return_bi;
- return_bi = wbi;
- }
- wbi = wbi2;
- }
- spin_unlock_irq(&conf->device_lock);
- }
- }
+ ( (test_bit(R5_Insync, &dev->flags) && !test_bit(R5_LOCKED, &dev->flags) &&
+ test_bit(R5_UPTODATE, &dev->flags))
+ || (failed == 1 && failed_num == sh->pd_idx))
+ ) {
+ /* any written block on an uptodate or failed drive can be returned.
+ * Note that if we 'wrote' to a failed drive, it will be UPTODATE, but
+ * never LOCKED, so we don't need to test 'failed' directly.
+ */
+ for (i=disks; i--; )
+ if (sh->dev[i].written) {
+ dev = &sh->dev[i];
+ if (!test_bit(R5_LOCKED, &dev->flags) &&
+ test_bit(R5_UPTODATE, &dev->flags) ) {
+ /* We can return any write requests */
+ struct bio *wbi, *wbi2;
+ PRINTK("Return write for disc %d\n", i);
+ spin_lock_irq(&conf->device_lock);
+ wbi = dev->written;
+ dev->written = NULL;
+ while (wbi && wbi->bi_sector < dev->sector + STRIPE_SECTORS) {
+ wbi2 = r5_next_bio(wbi, dev->sector);
+ if (--wbi->bi_phys_segments == 0) {
+ md_write_end(conf->mddev);
+ wbi->bi_next = return_bi;
+ return_bi = wbi;
+ }
+ wbi = wbi2;
+ }
+ spin_unlock_irq(&conf->device_lock);
+ }
+ }
}
/* Now we might consider reading some blocks, either to check/generate
@@ -1064,13 +1284,13 @@
for (i=disks; i--;) {
dev = &sh->dev[i];
if (!test_bit(R5_LOCKED, &dev->flags) && !test_bit(R5_UPTODATE, &dev->flags) &&
- (dev->toread ||
- (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) ||
- syncing ||
- (failed && (sh->dev[failed_num].toread ||
- (sh->dev[failed_num].towrite && !test_bit(R5_OVERWRITE, &sh->dev[failed_num].flags))))
- )
- ) {
+ (dev->toread ||
+ (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) ||
+ syncing ||
+ (failed && (sh->dev[failed_num].toread ||
+ (sh->dev[failed_num].towrite && !test_bit(R5_OVERWRITE, &sh->dev[failed_num].flags))))
+ )
+ ) {
/* we would like to get this block, possibly
* by computing it, but we might not be able to
*/
@@ -1085,23 +1305,303 @@
/* if I am just reading this block and we don't have
a failed drive, or any pending writes then sidestep the cache */
if (sh->bh_read[i] && !sh->bh_read[i]->b_reqnext &&
- ! syncing && !failed && !to_write) {
+ ! syncing && !failed && !to_write) {
sh->bh_cache[i]->b_page = sh->bh_read[i]->b_page;
sh->bh_cache[i]->b_data = sh->bh_read[i]->b_data;
}
#endif
locked++;
PRINTK("Reading block %d (sync=%d)\n",
- i, syncing);
- if (syncing)
+ i, syncing);
+ if (syncing && !conf->expand_in_progress)
md_sync_acct(conf->disks[i].rdev->bdev,
- STRIPE_SECTORS);
+ STRIPE_SECTORS);
}
}
}
set_bit(STRIPE_HANDLE, &sh->state);
}
+ // see if we have the data we need to expand by another block
+ if (conf->expand_in_progress && sh->disks == conf->previous_raid_disks) {
+ int uptodate = 0, delay_to_future=0, d = 0, count = 0, needed_uptodate = 0;
+ for (i=0; i<disks; ++i) {
+ sector_t start_sector, dest_sector;
+ unsigned int dd_idx, pd_idx;
+
+ if (i == sh->pd_idx)
+ continue;
+
+ start_sector = sh->sector * (conf->previous_raid_disks - 1) + d * (conf->chunk_size >> 9);
+ ++d;
+
+ // see what sector this block would land in the new layout
+ dest_sector = raid5_compute_sector(start_sector, conf->raid_disks,
+ conf->raid_disks - 1, &dd_idx, &pd_idx, conf);
+ if (dd_idx > pd_idx)
+ --dd_idx;
+
+/* printk("start_sector = %llu (base=%llu, i=%u, d=%u) || dest_stripe = %llu\n", start_sector, sh->sector,
+ i, d, dest_stripe); */
+
+ if (dest_sector * (conf->raid_disks - 1) >= conf->expand_progress &&
+ dest_sector * (conf->raid_disks - 1) < conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1)) {
+/* printk("UPDATING CHUNK %u FROM DISK %u (sec=%llu, dest_sector=%llu, uptodate=%u)\n",
+ dd_idx, i, start_sector, dest_sector, test_bit(R5_UPTODATE, &sh->dev[i].flags)); */
+ unsigned int buf_sector;
+ sector_t base = conf->expand_progress;
+ sector_div(base, conf->raid_disks - 1);
+
+ buf_sector = dd_idx * (conf->chunk_size / STRIPE_SIZE) + (dest_sector - base) / STRIPE_SECTORS;
+
+ if (test_bit(R5_UPTODATE, &sh->dev[i].flags)) {
+ conf->expand_buffer[buf_sector].up_to_date = 1;
+// printk("memcpy device %u/%u: %p <- %p\n", i, sh->disks,
+// page_address(conf->expand_buffer[buf_sector].page), page_address(sh->dev[i].page));
+ memcpy(page_address(conf->expand_buffer[buf_sector].page), page_address(sh->dev[i].page), STRIPE_SIZE);
+// printk("memcpy done\n");
+ count = 1;
+ PRINTK("Updating %u\n", buf_sector);
+ } else {
+ conf->expand_buffer[buf_sector].up_to_date = 0;
+ }
+ } else if (dest_sector * (conf->raid_disks - 1) >= conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1) &&
+ dest_sector * (conf->raid_disks - 1) < conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1) * 2 &&
+ syncing) {
+ delay_to_future = 1;
+ }
+ }
+
+ for (i=0; i < (conf->raid_disks - 1) * (conf->chunk_size / STRIPE_SIZE); ++i) {
+ uptodate += conf->expand_buffer[i].up_to_date;
+ }
+ if (count)
+ PRINTK("%u/%lu is up to date\n", uptodate, (conf->raid_disks - 1) * (conf->chunk_size / STRIPE_SIZE));
+
+ /*
+ * Figure out how many stripes we need for this chunk to be complete.
+ * In almost all cases, this will be a full destination stripe, but our
+ * original volume might not be big enough for that at the very end --
+ * so use the rest of the volume then.
+ */
+ needed_uptodate = (conf->raid_disks - 1) * (conf->chunk_size / STRIPE_SIZE);
+ if (((conf->mddev->size << 1) - conf->expand_progress) / STRIPE_SECTORS < needed_uptodate) {
+ needed_uptodate = ((conf->mddev->size << 1) - conf->expand_progress) / STRIPE_SECTORS;
+// printk("reading partial block at the end: %u\n", needed_uptodate);
+ }
+ if (needed_uptodate > 0 && uptodate == needed_uptodate) {
+ // we can do an expand!
+ struct stripe_head *newsh[256]; // FIXME: dynamic allocation somewhere instead?
+ sector_t dest_sector, advance;
+ unsigned i;
+ unsigned int dummy1, dummy2, pd_idx;
+
+ if ((conf->mddev->size << 1) - conf->expand_progress > (conf->chunk_size >> 9) * (conf->raid_disks - 1)) {
+ advance = (conf->chunk_size * (conf->raid_disks - 1)) >> 9;
+ } else {
+ advance = (conf->mddev->size << 1) - conf->expand_progress;
+ }
+
+// sector_div(new_sector, (conf->raid_disks - 1));
+// printk("EXPANDING ONTO SECTOR %llu\n", conf->expand_progress);
+// printk("EXPAND => %llu/%llu\n", conf->expand_progress, conf->mddev->size << 1);
+
+ // find the parity disk and starting sector
+ dest_sector = raid5_compute_sector(conf->expand_progress, conf->raid_disks,
+ conf->raid_disks - 1, &dummy1, &pd_idx, conf);
+ printk("Expanding onto %llu\n", dest_sector);
+
+ spin_lock_irq(&conf->device_lock);
+
+ /*
+ * Check that we won't try to expand over an area where there's
+ * still active stripes; if we do, we'll risk inconsistency since we
+ * suddenly have two different sets of stripes referring to the
+ * same logical sector.
+ */
+ {
+ struct stripe_head *ash;
+ int activity = 0, i;
+ sector_t first_touched_sector, last_touched_sector;
+
+ first_touched_sector = raid5_compute_sector(conf->expand_progress,
+ conf->previous_raid_disks, conf->previous_raid_disks - 1, &dummy1, &dummy2, conf);
+ last_touched_sector = raid5_compute_sector(conf->expand_progress + ((conf->chunk_size * (conf->previous_raid_disks - 1)) >> 9) - 1,
+ conf->previous_raid_disks, conf->previous_raid_disks - 1, &dummy1, &dummy2, conf);
+
+ for (i = 0; i < NR_HASH; i++) {
+ ash = conf->stripe_hashtbl[i];
+ for (; ash; ash = ash->hash_next) {
+ if (sh == ash && atomic_read(&ash->count) == 1 && !to_write)
+ continue; // we'll release it shortly, so it's OK (?)
+
+ // is this stripe active, and within the region we're expanding?
+ if (atomic_read(&ash->count) > 0 &&
+ ash->disks == conf->previous_raid_disks &&
+ ash->sector >= first_touched_sector &&
+ ash->sector <= last_touched_sector) {
+ activity = 1;
+ break;
+ }
+ }
+ }
+
+ if (activity) {
+ spin_unlock_irq(&conf->device_lock);
+ goto please_wait;
+ }
+ }
+
+ /*
+ * Check that we have enough free stripes to write out our
+ * entire chunk in the new layout. If not, we'll have to wait
+ * until some writes have been retired. We can't just do
+ * as in get_active_stripe() and sleep here until enough are
+ * free, since all busy stripes might have STRIPE_HANDLE set
+ * and thus won't be retired until somebody (our thread!) takes
+ * care of them.
+ */
+
+ {
+ int not_enough_free = 0;
+
+ for (i = 0; i < conf->chunk_size / STRIPE_SIZE; ++i) {
+ newsh[i] = get_free_stripe(conf, 1);
+ if (newsh[i] == NULL) {
+ not_enough_free = 1;
+ break;
+ }
+ init_stripe(newsh[i], dest_sector + i * STRIPE_SECTORS, pd_idx);
+ }
+
+ if (not_enough_free) {
+ // release all the stripes we allocated
+ for (i = 0; i < conf->chunk_size / STRIPE_SIZE; ++i) {
+ if (newsh[i] == NULL)
+ break;
+ atomic_inc(&newsh[i]->count);
+ __release_stripe(conf, newsh[i]);
+ }
+ spin_unlock_irq(&conf->device_lock);
+ goto please_wait;
+ }
+ }
+
+ for (i = 0; i < conf->chunk_size / STRIPE_SIZE; ++i) {
+ for (d = 0; d < conf->raid_disks; ++d) {
+ unsigned dd_idx = d;
+
+ if (d != pd_idx) {
+ if (dd_idx > pd_idx)
+ --dd_idx;
+
+ memcpy(page_address(newsh[i]->dev[d].page), page_address(conf->expand_buffer[dd_idx * conf->chunk_size / STRIPE_SIZE + i].page), STRIPE_SIZE);
+ }
+ set_bit(R5_Wantwrite, &newsh[i]->dev[d].flags);
+ set_bit(R5_Syncio, &newsh[i]->dev[d].flags);
+ }
+ }
+
+ for (i=0; i < (conf->raid_disks - 1) * (conf->chunk_size / STRIPE_SIZE); ++i) {
+ conf->expand_buffer[i].up_to_date = 0;
+ }
+
+ conf->expand_progress += advance;
+
+ spin_unlock_irq(&conf->device_lock);
+
+ for (i = 0; i < conf->chunk_size / STRIPE_SIZE; ++i) {
+ compute_parity(newsh[i], RECONSTRUCT_WRITE);
+
+ atomic_inc(&newsh[i]->count);
+ set_bit(STRIPE_INSYNC, &newsh[i]->state);
+ set_bit(STRIPE_HANDLE, &newsh[i]->state);
+ release_stripe(newsh[i]);
+ }
+
+ spin_lock_irq(&conf->device_lock);
+ md_done_sync(conf->mddev, advance, 1);
+ wake_up(&conf->wait_for_expand_progress);
+ spin_unlock_irq(&conf->device_lock);
+
+// md_sync_acct(conf->disks[0].rdev->bdev, STRIPE_SECTORS * (conf->raid_disks - 1));
+
+ // see if we have delayed data that we can process now
+ {
+ struct list_head *l, *next;
+
+ spin_lock_irq(&conf->device_lock);
+ l = conf->wait_for_expand_list.next;
+
+// printk("printing delay list:\n");
+ while (l != &conf->wait_for_expand_list) {
+ int i, d = 0;
+ int do_process = 0;
+
+ struct stripe_head *dsh;
+ dsh = list_entry(l, struct stripe_head, lru);
+// printk("sector: %llu\n", dsh->sector);
+
+ for (i=0; i<disks; ++i) {
+ sector_t start_sector, dest_sector;
+ unsigned int dd_idx, pd_idx;
+
+ if (i == dsh->pd_idx)
+ continue;
+
+ start_sector = dsh->sector * (conf->previous_raid_disks - 1) + d * (conf->chunk_size >> 9);
+
+ // see what sector this block would land in in the new layout
+ dest_sector = raid5_compute_sector(start_sector, conf->raid_disks,
+ conf->raid_disks - 1, &dd_idx, &pd_idx, conf);
+ if (/*dest_sector * (conf->raid_disks - 1) >= conf->expand_progress &&*/
+ dest_sector * (conf->raid_disks - 1) < conf->expand_progress + (conf->raid_disks - 1) * (conf->chunk_size >> 9)) {
+ do_process = 1;
+ }
+
+ ++d;
+ }
+
+ next = l->next;
+
+ if (do_process) {
+ list_del_init(l);
+
+ set_bit(STRIPE_HANDLE, &dsh->state);
+ clear_bit(STRIPE_DELAYED, &dsh->state);
+ clear_bit(STRIPE_DELAY_EXPAND, &dsh->state);
+ atomic_inc(&dsh->count);
+ atomic_inc(&dsh->count);
+ printk("pulling in stuff from delayed, sector=%llu\n",
+ dsh->sector);
+ __release_stripe(conf, dsh);
+ } else {
+ printk("still there\n");
+ }
+
+ l = next;
+ }
+
+ spin_unlock_irq(&conf->device_lock);
+ }
+
+ // see if we are done
+ if (conf->expand_progress >= conf->mddev->array_size << 1) {
+ printk("expand done, waiting for last activity to settle...\n");
+// conf->mddev->raid_disks = conf->raid_disks;
+// raid5_resize(conf->mddev, conf->mddev->size << 1);
+ conf->expand_in_progress = 2;
+ }
+
+please_wait:
+ 1;
+ }
+
+ if (delay_to_future) { // && atomic_dec_and_test(&sh->count)) {
+ set_bit(STRIPE_DELAY_EXPAND, &sh->state);
+ }
+ }
+
/* now to consider writing and what else, if anything should be read */
if (to_write) {
int rmw=0, rcw=0;
@@ -1237,7 +1737,9 @@
}
}
if (syncing && locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
- md_done_sync(conf->mddev, STRIPE_SECTORS,1);
+ if (!conf->expand_in_progress) {
+ md_done_sync(conf->mddev, STRIPE_SECTORS,1);
+ }
clear_bit(STRIPE_SYNCING, &sh->state);
}
@@ -1279,7 +1781,7 @@
rcu_read_unlock();
if (rdev) {
- if (test_bit(R5_Syncio, &sh->dev[i].flags))
+ if (test_bit(R5_Syncio, &sh->dev[i].flags) && !conf->expand_in_progress)
md_sync_acct(rdev->bdev, STRIPE_SECTORS);
bi->bi_bdev = rdev->bdev;
@@ -1308,6 +1810,7 @@
static inline void raid5_activate_delayed(raid5_conf_t *conf)
{
+ PRINTK("raid5_activate_delayed\n");
if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD) {
while (!list_empty(&conf->delayed_list)) {
struct list_head *l = conf->delayed_list.next;
@@ -1428,8 +1931,15 @@
for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
DEFINE_WAIT(w);
- new_sector = raid5_compute_sector(logical_sector,
- raid_disks, data_disks, &dd_idx, &pd_idx, conf);
+ if (conf->expand_in_progress && logical_sector >= conf->expand_progress) {
+ PRINTK("GEOM: old\n");
+ new_sector = raid5_compute_sector(logical_sector,
+ conf->previous_raid_disks, conf->previous_raid_disks - 1, &dd_idx, &pd_idx, conf);
+ } else {
+ PRINTK("GEOM: new\n");
+ new_sector = raid5_compute_sector(logical_sector,
+ raid_disks, data_disks, &dd_idx, &pd_idx, conf);
+ }
PRINTK("raid5: make_request, sector %llu logical %llu\n",
(unsigned long long)new_sector,
@@ -1488,6 +1998,13 @@
int raid_disks = conf->raid_disks;
int data_disks = raid_disks-1;
+ if (conf->expand_in_progress) {
+ raid_disks = conf->previous_raid_disks;
+ data_disks = raid_disks-1;
+ }
+
+ BUG_ON(data_disks == 0 || raid_disks == 0);
+
if (sector_nr >= mddev->size <<1) {
/* just being told to finish up .. nothing much to do */
unplug_slaves(mddev);
@@ -1499,17 +2016,41 @@
*/
if (mddev->degraded >= 1 && test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
int rv = (mddev->size << 1) - sector_nr;
+ printk("md_done_sync()\n");
md_done_sync(mddev, rv, 1);
return rv;
}
+
+ /* if we're in an expand, we can't allow the process
+ * to keep reading in stripes; we might not have enough buffer
+ * space to keep it all in RAM.
+ */
+ if (conf->expand_in_progress && sector_nr >= conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1)) {
+ //printk("DELAY\n");
+ //printall(conf);
+ //printk("progress = %llu\n", conf->expand_progress);
+ spin_lock_irq(&conf->device_lock);
+ wait_event_lock_irq(conf->wait_for_expand_progress,
+ sector_nr < conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1),
+ conf->device_lock,
+ unplug_slaves(conf->mddev);
+ );
+ spin_unlock_irq(&conf->device_lock);
+ //printk("DELAY DONE\n");
+ }
x = sector_nr;
chunk_offset = sector_div(x, sectors_per_chunk);
stripe = x;
BUG_ON(x != stripe);
-
+
+ PRINTK("sync_request:%llu/%llu, %u+%u active, pr=%llu v. %llu\n", sector_nr, mddev->size<<1,
+ atomic_read(&conf->active_stripes), atomic_read(&conf->active_stripes_expand),
+ sector_nr,
+ conf->expand_progress + (conf->chunk_size >> 9) * (conf->raid_disks - 1));
+
first_sector = raid5_compute_sector((sector_t)stripe*data_disks*sectors_per_chunk
- + chunk_offset, raid_disks, data_disks, &dd_idx, &pd_idx, conf);
+ + chunk_offset, raid_disks, data_disks, &dd_idx, &pd_idx, conf);
sh = get_active_stripe(conf, sector_nr, pd_idx, 1);
if (sh == NULL) {
sh = get_active_stripe(conf, sector_nr, pd_idx, 0);
@@ -1553,18 +2094,29 @@
while (1) {
struct list_head *first;
+ conf = mddev_to_conf(mddev);
+
if (list_empty(&conf->handle_list) &&
atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD &&
!blk_queue_plugged(mddev->queue) &&
- !list_empty(&conf->delayed_list))
+ !list_empty(&conf->delayed_list)) {
+ PRINTK("activate delayed\n");
raid5_activate_delayed(conf);
+ }
if (list_empty(&conf->handle_list))
break;
first = conf->handle_list.next;
+ PRINTK("first: %p\n", first);
+
sh = list_entry(first, struct stripe_head, lru);
+#if RAID5_DEBUG
+ PRINTK("sh: %p\n", sh);
+ print_sh(sh);
+#endif
+
list_del_init(first);
atomic_inc(&sh->count);
if (atomic_read(&sh->count)!= 1)
@@ -1577,7 +2129,7 @@
spin_lock_irq(&conf->device_lock);
}
- PRINTK("%d stripes handled\n", handled);
+// PRINTK("%d stripes handled\n", handled);
spin_unlock_irq(&conf->device_lock);
@@ -1594,6 +2146,8 @@
struct disk_info *disk;
struct list_head *tmp;
+ printk("run()!\n");
+
if (mddev->level != 5 && mddev->level != 4) {
printk("raid5: %s: raid level not set to 4/5 (%d)\n", mdname(mddev), mddev->level);
return -EIO;
@@ -1650,6 +2204,7 @@
conf->level = mddev->level;
conf->algorithm = mddev->layout;
conf->max_nr_stripes = NR_STRIPES;
+ conf->expand_in_progress = 0;
/* device size must be a multiple of chunk size */
mddev->size &= ~(mddev->chunk_size/1024 -1);
@@ -1691,7 +2246,7 @@
}
memory = conf->max_nr_stripes * (sizeof(struct stripe_head) +
conf->raid_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;
- if (grow_stripes(conf, conf->max_nr_stripes)) {
+ if (grow_stripes(conf, conf->max_nr_stripes, 0)) {
printk(KERN_ERR
"raid5: couldn't allocate %dkB for buffers\n", memory);
shrink_stripes(conf);
@@ -1767,8 +2322,8 @@
printk("sh %llu, pd_idx %d, state %ld.\n",
(unsigned long long)sh->sector, sh->pd_idx, sh->state);
- printk("sh %llu, count %d.\n",
- (unsigned long long)sh->sector, atomic_read(&sh->count));
+ printk("sh %llu, count %d, disks %d.\n",
+ (unsigned long long)sh->sector, atomic_read(&sh->count), sh->disks);
printk("sh %llu, ", (unsigned long long)sh->sector);
for (i = 0; i < sh->raid_conf->raid_disks; i++) {
printk("(cache%d: %p %ld) ",
@@ -1865,6 +2420,9 @@
mdk_rdev_t *rdev;
struct disk_info *p = conf->disks + number;
+ printk("we were asked to remove a disk\n");
+ return -EBUSY; // sesse hack
+
print_raid5_conf(conf);
rdev = p->rdev;
if (rdev) {
@@ -1894,27 +2452,37 @@
int disk;
struct disk_info *p;
- if (mddev->degraded > 1)
+ printk("RAID5 ADD DISK PLZ: %p\n", rdev);
+
+ if (mddev->degraded > 1) {
+ printk("GAVE UP\n");
+
/* no point adding a device */
return 0;
+ }
/*
* find the disk ...
*/
- for (disk=0; disk < mddev->raid_disks; disk++)
+ for (disk=0; disk < mddev->raid_disks; disk++) {
if ((p=conf->disks + disk)->rdev == NULL) {
+ printk("adding disk to %u\n", disk);
+
+ rdev->faulty = 0;
rdev->in_sync = 0;
rdev->raid_disk = disk;
found = 1;
p->rdev = rdev;
break;
}
+ }
print_raid5_conf(conf);
return found;
}
static int raid5_resize(mddev_t *mddev, sector_t sectors)
{
+ raid5_conf_t *conf = mddev_to_conf(mddev);
/* no resync is happening, and there is enough space
* on all devices, so we can resize.
* We need to make sure resync covers any new space.
@@ -1922,8 +2490,14 @@
* any io in the removed space completes, but it hardly seems
* worth it.
*/
+ printk("asked to resize\n");
+ if (conf->expand_in_progress)
+ return -EBUSY;
+
sectors &= ~((sector_t)mddev->chunk_size/512 - 1);
+ printk("old array_size: %llu\n", mddev->array_size);
mddev->array_size = (sectors * (mddev->raid_disks-1))>>1;
+ printk("new array_size: %llu (%llu x %u)\n", mddev->array_size, sectors, mddev->raid_disks - 1);
set_capacity(mddev->gendisk, mddev->array_size << 1);
mddev->changed = 1;
if (sectors/2 > mddev->size && mddev->recovery_cp == MaxSector) {
@@ -1934,6 +2508,221 @@
return 0;
}
+static int raid5_reshape(mddev_t *mddev, int raid_disks)
+{
+ raid5_conf_t *conf = mddev_to_conf(mddev);
+ raid5_conf_t *newconf;
+ struct list_head *tmp;
+ mdk_rdev_t *rdev;
+ unsigned long flags;
+
+ int d, i;
+
+ if (mddev->degraded >= 1 || conf->expand_in_progress)
+ return -EBUSY;
+
+ printk("sesse was here: reshape to %u disks\n", raid_disks);
+ print_raid5_conf(conf);
+
+ newconf = kmalloc (sizeof (raid5_conf_t)
+ + raid_disks * sizeof(struct disk_info),
+ GFP_KERNEL);
+ if (newconf == NULL)
+ return -ENOMEM;
+
+ memset(newconf, 0, sizeof (raid5_conf_t) + raid_disks * sizeof(struct disk_info));
+ memcpy(newconf, conf, sizeof (raid5_conf_t) + conf->raid_disks * sizeof(struct disk_info));
+
+ newconf->expand_in_progress = 1;
+ newconf->expand_progress = 0;
+ newconf->raid_disks = mddev->raid_disks = raid_disks;
+ newconf->previous_raid_disks = conf->raid_disks;
+
+ INIT_LIST_HEAD(&newconf->inactive_list_expand);
+
+
+ spin_lock_irqsave(&conf->device_lock, flags);
+ mddev->private = newconf;
+
+ printk("conf=%p newconf=%p\n", conf, newconf);
+
+ if (newconf->handle_list.next)
+ newconf->handle_list.next->prev = &newconf->handle_list;
+ if (newconf->delayed_list.next)
+ newconf->delayed_list.next->prev = &newconf->delayed_list;
+ if (newconf->inactive_list.next)
+ newconf->inactive_list.next->prev = &newconf->inactive_list;
+
+ if (newconf->handle_list.prev == &conf->handle_list)
+ newconf->handle_list.prev = &newconf->handle_list;
+ if (newconf->delayed_list.prev == &conf->delayed_list)
+ newconf->delayed_list.prev = &newconf->delayed_list;
+ if (newconf->inactive_list.prev == &conf->inactive_list)
+ newconf->inactive_list.prev = &newconf->inactive_list;
+
+ if (newconf->wait_for_stripe.task_list.prev == &conf->wait_for_stripe.task_list)
+ newconf->wait_for_stripe.task_list.prev = &newconf->wait_for_stripe.task_list;
+ if (newconf->wait_for_overlap.task_list.prev == &conf->wait_for_overlap.task_list)
+ newconf->wait_for_overlap.task_list.prev = &newconf->wait_for_overlap.task_list;
+
+ init_waitqueue_head(&newconf->wait_for_stripe_expand);
+ init_waitqueue_head(&newconf->wait_for_expand_progress);
+ INIT_LIST_HEAD(&newconf->wait_for_expand_list);
+
+ // update all the stripes
+ for (i = 0; i < NR_STRIPES; ++i) {
+ struct stripe_head *sh = newconf->stripe_hashtbl[i];
+ while (sh) {
+ sh->raid_conf = newconf;
+
+ if (sh->lru.next == &conf->inactive_list)
+ sh->lru.next = &newconf->inactive_list;
+ if (sh->lru.next == &conf->handle_list)
+ sh->lru.next = &newconf->handle_list;
+
+ sh = sh->hash_next;
+ }
+ }
+
+ // ...and all on the inactive queue
+ {
+ struct list_head *first = newconf->inactive_list.next;
+
+ while (1) {
+ struct stripe_head *sh = list_entry(first, struct stripe_head, lru);
+ sh->raid_conf = newconf;
+
+ if (sh->lru.next == &conf->inactive_list)
+ sh->lru.next = &newconf->inactive_list;
+ if (sh->lru.next == &conf->handle_list)
+ sh->lru.next = &newconf->handle_list;
+
+ if (first->next == &conf->inactive_list || first->next == &newconf->inactive_list) {
+ first->next = &newconf->inactive_list;
+ break;
+ }
+
+ first = first->next;
+ };
+ }
+
+ // update the pointer for the other lists as well
+ {
+ struct list_head *first = &newconf->handle_list;
+ while (1) {
+ if (first->next == &conf->handle_list) {
+ first->next = &newconf->handle_list;
+ break;
+ }
+
+ first = first->next;
+ };
+ }
+ {
+ struct list_head *first = &newconf->delayed_list;
+ while (1) {
+ if (first->next == &conf->delayed_list) {
+ first->next = &newconf->delayed_list;
+ break;
+ }
+
+ first = first->next;
+ };
+ }
+ {
+ struct list_head *first = &newconf->wait_for_stripe.task_list;
+ while (1) {
+ if (first->next == &conf->wait_for_stripe.task_list) {
+ first->next = &newconf->wait_for_stripe.task_list;
+ break;
+ }
+
+ first = first->next;
+ };
+ }
+ {
+ struct list_head *first = &newconf->wait_for_overlap.task_list;
+ while (1) {
+ if (first->next == &conf->wait_for_overlap.task_list) {
+ first->next = &newconf->wait_for_overlap.task_list;
+ break;
+ }
+
+ first = first->next;
+ };
+ }
+
+ ITERATE_RDEV(mddev,rdev,tmp) {
+ printk("disk: %p\n", rdev);
+ for (d= 0; d < newconf->raid_disks; d++) {
+ if (newconf->disks[d].rdev == rdev) {
+ goto already_there;
+ }
+ }
+
+ raid5_add_disk(mddev, rdev);
+ newconf->failed_disks++;
+
+already_there:
+ 1;
+ }
+
+ // argh! we can't hold this lock while allocating memory
+ spin_unlock_irqrestore(&conf->device_lock, flags);
+
+ // allocate new stripes
+ atomic_set(&newconf->active_stripes_expand, 0);
+ if (grow_stripes(newconf, newconf->max_nr_stripes, 1)) {
+ int memory = newconf->max_nr_stripes * (sizeof(struct stripe_head) +
+ newconf->raid_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;
+ printk(KERN_ERR "raid5: couldn't allocate %dkB for expand stripes\n", memory);
+ shrink_stripes(newconf);
+ kfree(newconf);
+ return -ENOMEM;
+ }
+
+ // and space for our temporary expansion buffers
+ newconf->expand_buffer = kmalloc (sizeof(struct expand_buf) * (conf->chunk_size / STRIPE_SIZE) * (raid_disks-1), GFP_KERNEL);
+ if (newconf->expand_buffer == NULL) {
+ printk(KERN_ERR "raid5: couldn't allocate %dkB for expand buffer\n",
+ (conf->chunk_size * (raid_disks-1)) >> 10);
+ shrink_stripes(newconf);
+ kfree(newconf);
+ return -ENOMEM;
+ }
+
+ for (i = 0; i < (conf->chunk_size / STRIPE_SIZE) * (raid_disks-1); ++i) {
+ newconf->expand_buffer[i].page = alloc_page(GFP_KERNEL);
+ if (newconf->expand_buffer[i].page == NULL) {
+ printk(KERN_ERR "raid5: couldn't allocate %dkB for expand buffer\n",
+ (conf->chunk_size * (raid_disks-1)) >> 10);
+ shrink_stripes(newconf);
+ kfree(newconf);
+ return -ENOMEM;
+ }
+ newconf->expand_buffer[i].up_to_date = 0;
+ }
+
+ spin_lock_irqsave(&conf->device_lock, flags);
+
+ print_raid5_conf(newconf);
+
+ clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
+ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+ set_bit(MD_RECOVERY_SYNC, &mddev->recovery);
+ mddev->recovery_cp = 0;
+ md_wakeup_thread(mddev->thread);
+// md_check_recovery(mddev);
+ spin_unlock_irqrestore(&conf->device_lock, flags);
+
+ kfree(conf);
+
+ printk("Starting expand.\n");
+
+ return 0;
+}
+
+
static mdk_personality_t raid5_personality=
{
.name = "raid5",
@@ -1948,6 +2737,7 @@
.spare_active = raid5_spare_active,
.sync_request = sync_request,
.resize = raid5_resize,
+ .reshape = raid5_reshape
};
static int __init raid5_init (void)
diff -ur linux-2.6-2.6.12/include/linux/raid/raid5.h ../linux-2.6-2.6.12/include/linux/raid/raid5.h
--- linux-2.6-2.6.12/include/linux/raid/raid5.h 2005-06-17 21:48:29.000000000 +0200
+++ linux-2.6-2.6.12.patch/include/linux/raid/raid5.h 2005-09-17 00:47:25.000000000 +0200
@@ -92,7 +92,11 @@
* stripe is also (potentially) linked to a hash bucket in the hash
* table so that it can be found by sector number. Stripes that are
* not hashed must be on the inactive_list, and will normally be at
- * the front. All stripes start life this way.
+ * the front. All stripes start life this way. There is also a
+ * "inactive_list_expand"; this is only used during an expand, and
+ * it contains stripes with "disks" set to the correct number of disks
+ * after the expand (and with the correct amount of memory allocated,
+ * of course).
*
* The inactive_list, handle_list and hash bucket lists are all protected by the
* device_lock.
@@ -134,6 +138,7 @@
unsigned long state; /* state flags */
atomic_t count; /* nr of active thread/requests */
spinlock_t lock;
+ int disks; /* disks in stripe */
struct r5dev {
struct bio req;
struct bio_vec vec;
@@ -171,6 +176,7 @@
#define STRIPE_INSYNC 4
#define STRIPE_PREREAD_ACTIVE 5
#define STRIPE_DELAYED 6
+#define STRIPE_DELAY_EXPAND 7
/*
* Plugging:
@@ -199,6 +205,10 @@
struct disk_info {
mdk_rdev_t *rdev;
};
+struct expand_buf {
+ struct page *page;
+ int up_to_date;
+};
struct raid5_private_data {
struct stripe_head **stripe_hashtbl;
@@ -208,22 +218,38 @@
int raid_disks, working_disks, failed_disks;
int max_nr_stripes;
+ /* used during an expand */
+ int expand_in_progress;
+ sector_t expand_progress;
+ int previous_raid_disks;
+ struct list_head wait_for_expand_list;
+
+ struct expand_buf *expand_buffer;
+
struct list_head handle_list; /* stripes needing handling */
struct list_head delayed_list; /* stripes that have plugged requests */
atomic_t preread_active_stripes; /* stripes with scheduled io */
char cache_name[20];
+ char cache_name_expand[20];
kmem_cache_t *slab_cache; /* for allocating stripes */
+ kmem_cache_t *slab_cache_expand;
+
/*
* Free stripes pool
*/
atomic_t active_stripes;
+ atomic_t active_stripes_expand;
struct list_head inactive_list;
+ struct list_head inactive_list_expand;
wait_queue_head_t wait_for_stripe;
+ wait_queue_head_t wait_for_stripe_expand;
+ wait_queue_head_t wait_for_expand_progress;
wait_queue_head_t wait_for_overlap;
int inactive_blocked; /* release of inactive stripes blocked,
* waiting for 25% to be free
- */
+ */
+ int inactive_blocked_expand;
spinlock_t device_lock;
struct disk_info disks[0];
};
next reply other threads:[~2005-09-20 14:33 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-09-20 14:33 Steinar H. Gunderson [this message]
2005-09-20 15:01 ` [PATCH] Online RAID-5 resizing Neil Brown
2005-09-20 15:36 ` Steinar H. Gunderson
2005-09-22 16:16 ` Neil Brown
2005-09-22 16:32 ` Steinar H. Gunderson
2005-09-23 8:59 ` Neil Brown
2005-09-23 12:50 ` Steinar H. Gunderson
2005-09-22 20:53 ` Steinar H. Gunderson
2005-09-24 1:44 ` Steinar H. Gunderson
2005-10-07 3:09 ` Neil Brown
2005-10-07 14:13 ` Steinar H. Gunderson
2005-10-14 19:46 ` Steinar H. Gunderson
2005-10-16 22:55 ` Neil Brown
2005-10-17 0:16 ` Steinar H. Gunderson
2005-10-19 23:18 ` Steinar H. Gunderson
2005-10-20 13:07 ` Steinar H. Gunderson
2005-10-22 13:45 ` Steinar H. Gunderson
2005-10-22 13:52 ` Neil Brown
2005-10-24 0:37 ` Neil Brown
2005-09-20 18:54 ` Al Boldi
2005-09-21 19:23 ` Steinar H. Gunderson
2005-09-22 0:14 ` Steinar H. Gunderson
2005-09-22 1:00 ` Steinar H. Gunderson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20050920143346.GA5777@uio.no \
--to=sgunderson@bigfoot.com \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.