* [PATCH md 0 of 4] Introduction
@ 2005-03-08 5:50 NeilBrown
2005-03-08 5:50 ` [PATCH md 1 of 4] Fix typo in super_1_sync NeilBrown
` (5 more replies)
0 siblings, 6 replies; 16+ messages in thread
From: NeilBrown @ 2005-03-08 5:50 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-raid
4 patches for md/raid in 2.6.11-mm1
The first two are trivial and should apply equally to 2.6.11
The second two fix bugs that were introduced by the recent
bitmap-based-intent-logging patches and so are not relevant
to 2.6.11 yet.
[PATCH md 1 of 4] Fix typo in super_1_sync
[PATCH md 2 of 4] Erroneous sizeof use in raid1
[PATCH md 3 of 4] Initialise sync_blocks in raid1 resync
[PATCH md 4 of 4] Fix md deadlock due to md thread processing delayed requests.
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH md 1 of 4] Fix typo in super_1_sync 2005-03-08 5:50 [PATCH md 0 of 4] Introduction NeilBrown @ 2005-03-08 5:50 ` NeilBrown 2005-03-08 5:50 ` [PATCH md 3 of 4] Initialise sync_blocks in raid1 resync NeilBrown ` (4 subsequent siblings) 5 siblings, 0 replies; 16+ messages in thread From: NeilBrown @ 2005-03-08 5:50 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-raid Instead of setting one value lots of times, let's set lots of values once each, as we should.. From: Paul Clements <paul.clements@steeleye.com> Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au> ### Diffstat output ./drivers/md/md.c | 2 +- 1 files changed, 1 insertion(+), 1 deletion(-) diff ./drivers/md/md.c~current~ ./drivers/md/md.c --- ./drivers/md/md.c~current~ 2005-03-07 11:32:41.000000000 +1100 +++ ./drivers/md/md.c 2005-03-07 11:32:41.000000000 +1100 @@ -1010,7 +1010,7 @@ static void super_1_sync(mddev_t *mddev, sb->max_dev = cpu_to_le32(max_dev); for (i=0; i<max_dev;i++) - sb->dev_roles[max_dev] = cpu_to_le16(0xfffe); + sb->dev_roles[i] = cpu_to_le16(0xfffe); ITERATE_RDEV(mddev,rdev2,tmp) { i = rdev2->desc_nr; ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH md 3 of 4] Initialise sync_blocks in raid1 resync 2005-03-08 5:50 [PATCH md 0 of 4] Introduction NeilBrown 2005-03-08 5:50 ` [PATCH md 1 of 4] Fix typo in super_1_sync NeilBrown @ 2005-03-08 5:50 ` NeilBrown 2005-03-08 5:50 ` [PATCH md 4 of 4] Fix md deadlock due to md thread processing delayed requests NeilBrown ` (3 subsequent siblings) 5 siblings, 0 replies; 16+ messages in thread From: NeilBrown @ 2005-03-08 5:50 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-raid Otherwise it could have a random value and might BUG. This fixes a BUG during resync problem in raid1 introduced by the bitmap-based-intent-loggin patches. Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au> ### Diffstat output ./drivers/md/raid1.c | 1 + 1 files changed, 1 insertion(+) diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c --- ./drivers/md/raid1.c~current~ 2005-03-07 15:49:55.000000000 +1100 +++ ./drivers/md/raid1.c 2005-03-07 15:53:55.000000000 +1100 @@ -1235,6 +1235,7 @@ static sector_t sync_request(mddev_t *md } nr_sectors = 0; + sync_blocks = 0; do { struct page *page; int len = PAGE_SIZE; ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH md 4 of 4] Fix md deadlock due to md thread processing delayed requests. 2005-03-08 5:50 [PATCH md 0 of 4] Introduction NeilBrown 2005-03-08 5:50 ` [PATCH md 1 of 4] Fix typo in super_1_sync NeilBrown 2005-03-08 5:50 ` [PATCH md 3 of 4] Initialise sync_blocks in raid1 resync NeilBrown @ 2005-03-08 5:50 ` NeilBrown 2005-03-08 5:50 ` [PATCH md 2 of 4] Erroneous sizeof use in raid1 NeilBrown ` (2 subsequent siblings) 5 siblings, 0 replies; 16+ messages in thread From: NeilBrown @ 2005-03-08 5:50 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-raid Before completing a 'write' the md superblock might need to be updated. This is best done by the md_thread. The current code schedules this up and queues the write request for later handling by the md_thread. However some personalities (Raid5/raid6) will deadlock if the md_thread tries to submit requests to its own array. So this patch changes things so the processes submitting the request waits for the superblock to be written and then submits the request itself. This fixes a recently-created deadlock in raid5/raid6 Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au> ### Diffstat output ./drivers/md/md.c | 45 +++++++++++++++----------------------------- ./drivers/md/raid1.c | 4 +-- ./drivers/md/raid10.c | 3 -- ./drivers/md/raid5.c | 3 -- ./drivers/md/raid6main.c | 3 -- ./include/linux/raid/md.h | 2 - ./include/linux/raid/md_k.h | 2 - 7 files changed, 23 insertions(+), 39 deletions(-) diff ./drivers/md/md.c~current~ ./drivers/md/md.c --- ./drivers/md/md.c~current~ 2005-03-08 16:08:10.000000000 +1100 +++ ./drivers/md/md.c 2005-03-08 16:11:44.000000000 +1100 @@ -267,8 +267,8 @@ static mddev_t * mddev_find(dev_t unit) INIT_LIST_HEAD(&new->all_mddevs); init_timer(&new->safemode_timer); atomic_set(&new->active, 1); - bio_list_init(&new->write_list); spin_lock_init(&new->write_lock); + init_waitqueue_head(&new->sb_wait); new->queue = blk_alloc_queue(GFP_KERNEL); if (!new->queue) { @@ -1350,6 +1350,7 @@ repeat: if (!mddev->persistent) { mddev->sb_dirty = 0; spin_unlock(&mddev->write_lock); + wake_up(&mddev->sb_wait); return; } spin_unlock(&mddev->write_lock); @@ -1391,6 +1392,7 @@ repeat: } mddev->sb_dirty = 0; spin_unlock(&mddev->write_lock); + wake_up(&mddev->sb_wait); } @@ -3489,29 +3491,26 @@ void md_done_sync(mddev_t *mddev, int bl /* md_write_start(mddev, bi) * If we need to update some array metadata (e.g. 'active' flag - * in superblock) before writing, queue bi for later writing - * and return 0, else return 1 and it will be written now + * in superblock) before writing, schedule a superblock update + * and wait for it to complete. */ -int md_write_start(mddev_t *mddev, struct bio *bi) +void md_write_start(mddev_t *mddev, struct bio *bi) { + DEFINE_WAIT(w); if (bio_data_dir(bi) != WRITE) - return 1; + return; atomic_inc(&mddev->writes_pending); - spin_lock(&mddev->write_lock); - if (mddev->in_sync == 0 && mddev->sb_dirty == 0) { - spin_unlock(&mddev->write_lock); - return 1; - } - bio_list_add(&mddev->write_list, bi); - if (mddev->in_sync) { - mddev->in_sync = 0; - mddev->sb_dirty = 1; + spin_lock(&mddev->write_lock); + if (mddev->in_sync) { + mddev->in_sync = 0; + mddev->sb_dirty = 1; + md_wakeup_thread(mddev->thread); + } + spin_unlock(&mddev->write_lock); } - spin_unlock(&mddev->write_lock); - md_wakeup_thread(mddev->thread); - return 0; + wait_event(mddev->sb_wait, mddev->sb_dirty==0); } void md_write_end(mddev_t *mddev) @@ -3808,7 +3807,6 @@ void md_check_recovery(mddev_t *mddev) mddev->sb_dirty || test_bit(MD_RECOVERY_NEEDED, &mddev->recovery) || test_bit(MD_RECOVERY_DONE, &mddev->recovery) || - mddev->write_list.head || (mddev->safemode == 1) || (mddev->safemode == 2 && ! atomic_read(&mddev->writes_pending) && !mddev->in_sync && mddev->recovery_cp == MaxSector) @@ -3817,7 +3815,6 @@ void md_check_recovery(mddev_t *mddev) if (mddev_trylock(mddev)==0) { int spares =0; - struct bio *blist; spin_lock(&mddev->write_lock); if (mddev->safemode && !atomic_read(&mddev->writes_pending) && @@ -3827,21 +3824,11 @@ void md_check_recovery(mddev_t *mddev) } if (mddev->safemode == 1) mddev->safemode = 0; - blist = bio_list_get(&mddev->write_list); spin_unlock(&mddev->write_lock); if (mddev->sb_dirty) md_update_sb(mddev); - while (blist) { - struct bio *b = blist; - blist = blist->bi_next; - b->bi_next = NULL; - generic_make_request(b); - /* we already counted this, so need to un-count */ - md_write_end(mddev); - } - if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) && !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) { diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c --- ./drivers/md/raid1.c~current~ 2005-03-08 16:08:10.000000000 +1100 +++ ./drivers/md/raid1.c 2005-03-07 16:33:42.000000000 +1100 @@ -561,8 +561,8 @@ static int make_request(request_queue_t * thread has put up a bar for new requests. * Continue immediately if no resync is active currently. */ - if (md_write_start(mddev, bio)==0) - return 0; + md_write_start(mddev, bio); /* wait on superblock update early */ + spin_lock_irq(&conf->resync_lock); wait_event_lock_irq(conf->wait_resume, !conf->barrier, conf->resync_lock, ); conf->nr_pending++; diff ./drivers/md/raid10.c~current~ ./drivers/md/raid10.c --- ./drivers/md/raid10.c~current~ 2005-03-08 16:08:10.000000000 +1100 +++ ./drivers/md/raid10.c 2005-03-07 16:33:59.000000000 +1100 @@ -700,8 +700,7 @@ static int make_request(request_queue_t return 0; } - if (md_write_start(mddev, bio) == 0) - return 0; + md_write_start(mddev, bio); /* * Register the new request and wait if the reconstruction diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c --- ./drivers/md/raid5.c~current~ 2005-03-08 16:08:10.000000000 +1100 +++ ./drivers/md/raid5.c 2005-03-07 16:34:09.000000000 +1100 @@ -1411,8 +1411,7 @@ static int make_request (request_queue_t sector_t logical_sector, last_sector; struct stripe_head *sh; - if (md_write_start(mddev, bi)==0) - return 0; + md_write_start(mddev, bi); if (bio_data_dir(bi)==WRITE) { disk_stat_inc(mddev->gendisk, writes); diff ./drivers/md/raid6main.c~current~ ./drivers/md/raid6main.c --- ./drivers/md/raid6main.c~current~ 2005-03-08 16:08:10.000000000 +1100 +++ ./drivers/md/raid6main.c 2005-03-07 16:34:29.000000000 +1100 @@ -1570,8 +1570,7 @@ static int make_request (request_queue_t sector_t logical_sector, last_sector; struct stripe_head *sh; - if (md_write_start(mddev, bi)==0) - return 0; + md_write_start(mddev, bi); if (bio_data_dir(bi)==WRITE) { disk_stat_inc(mddev->gendisk, writes); diff ./include/linux/raid/md.h~current~ ./include/linux/raid/md.h --- ./include/linux/raid/md.h~current~ 2005-03-08 16:08:10.000000000 +1100 +++ ./include/linux/raid/md.h 2005-03-07 16:32:55.000000000 +1100 @@ -69,7 +69,7 @@ extern mdk_thread_t * md_register_thread extern void md_unregister_thread (mdk_thread_t *thread); extern void md_wakeup_thread(mdk_thread_t *thread); extern void md_check_recovery(mddev_t *mddev); -extern int md_write_start(mddev_t *mddev, struct bio *bi); +extern void md_write_start(mddev_t *mddev, struct bio *bi); extern void md_write_end(mddev_t *mddev); extern void md_handle_safemode(mddev_t *mddev); extern void md_done_sync(mddev_t *mddev, int blocks, int ok); diff ./include/linux/raid/md_k.h~current~ ./include/linux/raid/md_k.h --- ./include/linux/raid/md_k.h~current~ 2005-03-08 16:08:10.000000000 +1100 +++ ./include/linux/raid/md_k.h 2005-03-07 16:31:44.000000000 +1100 @@ -261,7 +261,7 @@ struct mddev_s sector_t recovery_cp; spinlock_t write_lock; - struct bio_list write_list; + wait_queue_head_t sb_wait; /* for waiting on superblock updates */ unsigned int safemode; /* if set, update "clean" superblock * when no writes pending. ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH md 2 of 4] Erroneous sizeof use in raid1 2005-03-08 5:50 [PATCH md 0 of 4] Introduction NeilBrown ` (2 preceding siblings ...) 2005-03-08 5:50 ` [PATCH md 4 of 4] Fix md deadlock due to md thread processing delayed requests NeilBrown @ 2005-03-08 5:50 ` NeilBrown 2005-03-08 6:10 ` [PATCH md 0 of 4] Introduction Andrew Morton 2005-03-08 12:49 ` Peter T. Breuer 5 siblings, 0 replies; 16+ messages in thread From: NeilBrown @ 2005-03-08 5:50 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-raid This isn't a real bug as the smallest slab-size is 32 bytes but please apply for consistency. Found by the Coverity tool. Signed-off-by: Alexander Nyberg <alexn@dsv.su.se> Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au> ### Diffstat output ./drivers/md/raid1.c | 2 +- 1 files changed, 1 insertion(+), 1 deletion(-) diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c --- ./drivers/md/raid1.c~current~ 2005-03-07 15:49:25.000000000 +1100 +++ ./drivers/md/raid1.c 2005-03-07 15:49:55.000000000 +1100 @@ -1494,7 +1494,7 @@ static int raid1_reshape(mddev_t *mddev, if (conf->mirrors[d].rdev) return -EBUSY; - newpoolinfo = kmalloc(sizeof(newpoolinfo), GFP_KERNEL); + newpoolinfo = kmalloc(sizeof(*newpoolinfo), GFP_KERNEL); if (!newpoolinfo) return -ENOMEM; newpoolinfo->mddev = mddev; ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH md 0 of 4] Introduction 2005-03-08 5:50 [PATCH md 0 of 4] Introduction NeilBrown ` (3 preceding siblings ...) 2005-03-08 5:50 ` [PATCH md 2 of 4] Erroneous sizeof use in raid1 NeilBrown @ 2005-03-08 6:10 ` Andrew Morton 2005-03-09 3:17 ` Neil Brown 2005-03-08 12:49 ` Peter T. Breuer 5 siblings, 1 reply; 16+ messages in thread From: Andrew Morton @ 2005-03-08 6:10 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid NeilBrown <neilb@cse.unsw.edu.au> wrote: > > The first two are trivial and should apply equally to 2.6.11 > > The second two fix bugs that were introduced by the recent > bitmap-based-intent-logging patches and so are not relevant > to 2.6.11 yet. The changelog for the "Fix typo in super_1_sync" patch doesn't actually say what the patch does. What are the user-visible consequences of not fixing this? Is the bitmap stuff now ready for Linus? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH md 0 of 4] Introduction 2005-03-08 6:10 ` [PATCH md 0 of 4] Introduction Andrew Morton @ 2005-03-09 3:17 ` Neil Brown 2005-03-09 9:27 ` Mike Tran 0 siblings, 1 reply; 16+ messages in thread From: Neil Brown @ 2005-03-09 3:17 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-raid On Monday March 7, akpm@osdl.org wrote: > NeilBrown <neilb@cse.unsw.edu.au> wrote: > > > > The first two are trivial and should apply equally to 2.6.11 > > > > The second two fix bugs that were introduced by the recent > > bitmap-based-intent-logging patches and so are not relevant > > to 2.6.11 yet. > > The changelog for the "Fix typo in super_1_sync" patch doesn't actually say > what the patch does. What are the user-visible consequences of not fixing > this? ------- This fixes possible inconsistencies that might arise in a version-1 superblock when devices fail and are removed. Usage of version-1 superblocks is not yet widespread and no actual problems have been reported. -------- > > > Is the bitmap stuff now ready for Linus? I agree with Paul - not yet. I'd also like to get a bit more functionality in before it goes to Linus, as the functionality may necessitate in interface change (I'm not sure). Specifically, I want the bitmap to be able to live near the superblock rather than having to be in a file on a different filesystem. NeilBrown ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH md 0 of 4] Introduction 2005-03-09 3:17 ` Neil Brown @ 2005-03-09 9:27 ` Mike Tran 0 siblings, 0 replies; 16+ messages in thread From: Mike Tran @ 2005-03-09 9:27 UTC (permalink / raw) To: linux-raid Hi Neil, On Tue, 2005-03-08 at 21:17, Neil Brown wrote: > On Monday March 7, akpm@osdl.org wrote: > > NeilBrown <neilb@cse.unsw.edu.au> wrote: > > > > > > The first two are trivial and should apply equally to 2.6.11 > > > > > > The second two fix bugs that were introduced by the recent > > > bitmap-based-intent-logging patches and so are not relevant > > > to 2.6.11 yet. > > > > The changelog for the "Fix typo in super_1_sync" patch doesn't actually say > > what the patch does. What are the user-visible consequences of not fixing > > this? > > ------- > This fixes possible inconsistencies that might arise in a version-1 > superblock when devices fail and are removed. > > Usage of version-1 superblocks is not yet widespread and no actual > problems have been reported. > -------- EVMS 2.5.1 (http://evms.sf.net) has provided support for creation of MD arrays using version-1 superblock. Some of EVMS users actually tried to use this new functionality. You probably remember I posted a problem and a patch to fix version-1 superblock update code. We will continue to test and will report any problems. -- Regards, Mike T. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH md 0 of 4] Introduction 2005-03-08 5:50 [PATCH md 0 of 4] Introduction NeilBrown ` (4 preceding siblings ...) 2005-03-08 6:10 ` [PATCH md 0 of 4] Introduction Andrew Morton @ 2005-03-08 12:49 ` Peter T. Breuer 2005-03-08 17:02 ` Paul Clements 5 siblings, 1 reply; 16+ messages in thread From: Peter T. Breuer @ 2005-03-08 12:49 UTC (permalink / raw) To: linux-raid NeilBrown <neilb@cse.unsw.edu.au> wrote: > The second two fix bugs that were introduced by the recent > bitmap-based-intent-logging patches and so are not relevant Neil - can you describe for me (us all?) what is meant by intentlogging here. Well, I can guess - I suppose the driver marks the bitmap before a write (or group of writes) and unmarks it when they have completed successfully. Is that it? If so, how does it manage to mark what it is _going_ to do (without psychic powers) on the disk bitmap? Unmarking is easy - that needs a queue of things due to be unmarked in the bitmap, and a point in time at which they are all unmarked at once on disk. Then resync would only deal with the marked blocks. Peter - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH md 0 of 4] Introduction 2005-03-08 12:49 ` Peter T. Breuer @ 2005-03-08 17:02 ` Paul Clements 2005-03-08 19:05 ` Peter T. Breuer 0 siblings, 1 reply; 16+ messages in thread From: Paul Clements @ 2005-03-08 17:02 UTC (permalink / raw) To: Peter T. Breuer; +Cc: linux-raid Peter T. Breuer wrote: > Neil - can you describe for me (us all?) what is meant by > intentlogging here. Since I wrote a lot of the code, I guess I'll try... > Well, I can guess - I suppose the driver marks the bitmap before a write > (or group of writes) and unmarks it when they have completed > successfully. Is that it? Yes. It marks the bitmap before writing (actually queues up the bitmap and normal writes in bunches for the sake of performance). The code is actually (loosely) based on your original bitmap (fr1) code. > If so, how does it manage to mark what it is _going_ to do (without > psychic powers) on the disk bitmap? That's actually fairly easy. The pages for the bitmap are locked in memory, so you just dirty the bits you want (which doesn't actually incur any I/O) and then when you're about to perform the normal writes, you flush the dirty bitmap pages to disk. Once the writes are complete, a thread (we have the raid1d thread doing this) comes back along and flushes the (now clean) bitmap pages back to disk. If the pages get dirty again in the meantime (because of more I/O), we just leave them dirty and don't touch the disk. > Then resync would only deal with the marked blocks. Right. It clears the bitmap once things are back in sync. -- Paul - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH md 0 of 4] Introduction 2005-03-08 17:02 ` Paul Clements @ 2005-03-08 19:05 ` Peter T. Breuer 2005-03-09 5:07 ` Neil Brown 0 siblings, 1 reply; 16+ messages in thread From: Peter T. Breuer @ 2005-03-08 19:05 UTC (permalink / raw) To: linux-raid Paul Clements <paul.clements@steeleye.com> wrote: > Peter T. Breuer wrote: > > Neil - can you describe for me (us all?) what is meant by > > intent-logging here. > > Since I wrote a lot of the code, I guess I'll try... Hi, Paul. Thanks. > > Well, I can guess - I suppose the driver marks the bitmap before a write > > (or group of writes) and unmarks it when they have completed > > successfully. Is that it? > > Yes. It marks the bitmap before writing (actually queues up the bitmap > and normal writes in bunches for the sake of performance). The code is > actually (loosely) based on your original bitmap (fr1) code. Yeah, I can see the traces. I'm a little tired right now, but some aspects of this idea vaguely worry me. I'll see if I manage to articulate those worries here despite my state. And you can dispell them :). Let me first of all guess at the intervals involved. I assume you will write the marked parts of the bitmap to disk every 1/100th of a second or so? (I'd probably opt for 1/10th of a second or even every second just to make sure it's not noticable on bandwidth and to heck with the safety until we learn better what the tradeoffs are). Or perhaps once every hundred trasactions in busy times. Now, there are races here. You must mark the bitmap in memory before every write, and unmark it after every complete write. That is an ordering constraint. There is a race, however, to record the bitmap state to disk. Without any rendezvous or handshake or other synchronization, one would simply be snapshotting the in-memory bitmap to disk every so often, and the on-disk bitmap would not always accurately reflect the current state of completed transactions to the mirror. The question is whether it shows an overly-pessimistic picture, an overly-optimistic picture, or neither one nor the other. I would naively imagine straight off that it cannot in general be (appropriately) pessimistic because it does not know what writes will occur in the next 1/100th second in order to be able to mark those on the disk bitmap before they happen. In the next section of your answer, however, you say this is what happens, and therefore I deduce that a) 1/100th second's worth of writes to the mirror are first queued b) the in-memory bitmap is marked for these (if it exists as separate) c) the dirty parts of that bitmap are written to disk(s) d) the queued writes are carried out on the mirror e) the in-memory bitmap is unmarked for these f) the newly cleaned parts of that bitmap are written to disk. You may even have some sort of direct mapping between the on-disk bitmap and the memory image, which could be quite effective, but may run into problems with the address range available (bitmap must be less than 2GB, no?), unless it maps only the necessary parts of the bitmap at a time. Well, if the kernel can manage that mapping window on its own, it would be useful and probably what you have done. But I digress. My immediate problem is that writes must be queued first. I thought md traditionally did not queue requests, but instead used its own make_request substitute to dispatch incoming requests as they arrived. Have you remodelled the md/raid1 make_request() fn? And if so, do you also aggregate them? And what steps are taken to preserve write ordering constraints (do some overlying file systems still require these)? > > If so, how does it manage to mark what it is _going_ to do (without > > psychic powers) on the disk bitmap? > > That's actually fairly easy. The pages for the bitmap are locked in > memory, That limits the size to about 2GB - oh, but perhaps you are doing as I did and release bitmap pages when they are not dirty. Yes, you must. > so you just dirty the bits you want (which doesn't actually > incur any I/O) and then when you're about to perform the normal writes, > you flush the dirty bitmap pages to disk. Hmm. I don't know how one can select pages to flush, but clearly one can! You maintain a list of dirtied pages, clearly. This list cannot be larger than the list of outstanding requests. If you use the generic kernel mechanisms, that will be 1000 or so, max. > Once the writes are complete, a thread (we have the raid1d thread doing > this) comes back along and flushes the (now clean) bitmap pages back to > disk. OK .. there is a potential race here too, however, ... > If the pages get dirty again in the meantime (because of more > I/O), we just leave them dirty and don't touch the disk. Hmm. This appears to me to be an optimization. OK. > > Then resync would only deal with the marked blocks. > > Right. It clears the bitmap once things are back in sync. Well, OK. Thinking it through as I write I see fewer problems. Thank you for the explanation, and well done. I have been meaning to merge the patches and see what comes out. I presume you left out the mechanisms I included to allow a mirror component to aggressively notify the array when it feels sick, and when it feels better again. That required the array to be able to notify the mirror components that they have been included in an array, and lodge a callback hotline with them. Thanks again. Peter ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH md 0 of 4] Introduction 2005-03-08 19:05 ` Peter T. Breuer @ 2005-03-09 5:07 ` Neil Brown 2005-03-09 15:37 ` Peter T. Breuer 0 siblings, 1 reply; 16+ messages in thread From: Neil Brown @ 2005-03-09 5:07 UTC (permalink / raw) To: Peter T. Breuer; +Cc: linux-raid On Tuesday March 8, ptb@lab.it.uc3m.es wrote: > > But I digress. My immediate problem is that writes must be queued > first. I thought md traditionally did not queue requests, but instead > used its own make_request substitute to dispatch incoming requests as > they arrived. > > Have you remodelled the md/raid1 make_request() fn? Somewhat. Write requests are queued, and raid1d submits them when it is happy that all bitmap updates have been done. There is no '1/100th' second or anything like that. When a write request arrives, the queue is 'plugged', requests are queued, and bits in the in-memory bitmap are set. When the queue is unplugged (by the filesystem or timeout) the bitmap changes (if any) are flushed to disk, then the queued requests are submitted. Bits on disk are cleaned lazily. Note that for many applications, the bitmap does not need to be huge. 4K is enough for 1 bit per 2-3 megabytes on many large drives. Having to sync 3 meg when just one block might be out-of-sync may seem like a waste, but it is heaps better than syncing 100Gig!! If a resync without bitmap logging takes 1 hour, I suspect a resync with a 4K bitmap would have a good chance of finishing in under 1 minute (Depending on locality of references). That is good enough for me. Of course, if one mirror is on the other side of the country, and a normal sync requires 5 days over ADSL, then you would have a strong case for a finer grained bitmap. > > And if so, do you also aggregate them? And what steps are taken to > preserve write ordering constraints (do some overlying file systems > still require these)? filesystems have never had any write ordering constraints, except that IO must not be processed before it is requested, nor after it has been acknowledged. md continue to obey these restraints. NeilBrown ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH md 0 of 4] Introduction 2005-03-09 5:07 ` Neil Brown @ 2005-03-09 15:37 ` Peter T. Breuer 0 siblings, 0 replies; 16+ messages in thread From: Peter T. Breuer @ 2005-03-09 15:37 UTC (permalink / raw) To: linux-raid Neil Brown <neilb@cse.unsw.edu.au> wrote: > On Tuesday March 8, ptb@lab.it.uc3m.es wrote: > > Have you remodelled the md/raid1 make_request() fn? > > Somewhat. Write requests are queued, and raid1d submits them when > it is happy that all bitmap updates have been done. OK - so a slight modification of the kernel generic_make_request (I haven't looked). Mind you, I think that Paul said that just before clearing bitmap entries, incoming requests were checked to see if a bitmap entry should be marked again.. Perhaps both things happen. Bitmap pages in memory are updated as clean after pending writes have finished and then marked as dirty as necessary, and then flushed and when the flush finishes new accumulated requests are started. One can > There is no '1/100th' second or anything like that. I was trying in a way to give a definite image to what happens, rather than speak abstractly. I'm sure that the ordinary kernel mechanism for plugging and unplugging is used, as much as it is possible. If yu unplug when the request struct reservoir is exhausted, then it will be at 1K requests. If they are each 4KB, that will be every 4MB. At say 64MB/s, that will be every 1/16 s. And unplugging may happen more frequently because of other kernel magic mumble mumble ... > When a write request arrives, the queue is 'plugged', requests are > queued, and bits in the in-memory bitmap are set. OK. > When the queue is unplugged (by the filesystem or timeout) the bitmap > changes (if any) are flushed to disk, then the queued requests are > submitted. That accumulates bitmap markings into the minimum number of extra transactions. It does impose extra latency, however. I'm intrigued by exactly how you exert the memory pressure required to force just the dirty bitmap pages out. I'll have to look it up. > Bits on disk are cleaned lazily. OK - so the disk bitmap state is always pessimistic. That's fine. Very good. > Note that for many applications, the bitmap does not need to be huge. > 4K is enough for 1 bit per 2-3 megabytes on many large drives. > Having to sync 3 meg when just one block might be out-of-sync may seem > like a waste, but it is heaps better than syncing 100Gig!! Yes - I used 1 bit per 1K, falling back to 1 bit per 2MB under memory pressure. > > And if so, do you also aggregate them? And what steps are taken to > > preserve write ordering constraints (do some overlying file systems > > still require these)? > > filesystems have never had any write ordering constraints, except that > IO must not be processed before it is requested, nor after it has been > acknowledged. md continue to obey these restraints. Out of curiousity, is aggregation done on the queued requests? Or are they all kept at 4KB? (or whatever - 1KB). Thanks! Peter ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH md 0 of 4] Introduction
@ 2004-11-02 3:37 NeilBrown
0 siblings, 0 replies; 16+ messages in thread
From: NeilBrown @ 2004-11-02 3:37 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-raid
Following are 4 patches for md/raid against 2.6.10-rc1-mm2.
1/ Fix problem with linear arrays if component devices are > 2terabytes
2/ Fix data corruption in (experimental) RAID6 personality
3/ Fix possible oops with unplug_timer firing at the wrong time.
4/ Add new md personality "faulty".
"Faulty" can be used to inject faults and so test failure modes
of other raid levels and of filesystes.
NeilBrown
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH md 0 of 4] Introduction @ 2004-09-03 2:20 NeilBrown 0 siblings, 0 replies; 16+ messages in thread From: NeilBrown @ 2004-09-03 2:20 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-raid Following are 4 patches for md in 2.6.8.1-mm4 The first three are minor improvements and modifications either required by or inspired by the fourth. The fourth adds a new raid pers ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH md 0 of 4] Introduction @ 2004-08-23 3:10 NeilBrown 0 siblings, 0 replies; 16+ messages in thread From: NeilBrown @ 2004-08-23 3:10 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-raid Following are 4 patches for md in 2.6.8.1-mm4 The first three are minor improvements and modifications either required by or inspired by the fourth. The fourth adds a new raid personality - raid10. At 56K, I'm not sure it will get through the mailing list, but interested parties can find it at: http://neilb.web.cse.unsw.edu.au/patches/linux-devel/2.6/2004-08-23-03 raid10 provides a combination of raid0 and raid1. It requires mdadm 1.7.0 or later to use. The next release of mdadm should have better documention of raid10, but from the comment in the .c file: /* * RAID10 provides a combination of RAID0 and RAID1 functionality. * The layout of data is defined by * chunk_size * raid_disks * near_copies (stored in low byte of layout) * far_copies (stored in second byte of layout) * * The data to be stored is divided into chunks using chunksize. * Each device is divided into far_copies sections. * In each section, chunks are layed out in a style similar to raid0, but * near_copies copies of each chunk is stored (each on a different drive). * The starting device for each section is offset near_copies from the starting * device of the previous section. * Thus there are (near_copies*far_copies) of each chunk, and each is on a different * drive. * near_copies and far_copies must be at least one, and there product is at most * raid_disks. */ raid10 is currently marked EXPERIMENTAL, and this should be taken seriously. A reasonable amount of basic testing hasn't shown any bugs, and it seems to resync and rebuild correctly. However wider testing would help. NeilBrown ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2005-03-09 15:37 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-03-08 5:50 [PATCH md 0 of 4] Introduction NeilBrown 2005-03-08 5:50 ` [PATCH md 1 of 4] Fix typo in super_1_sync NeilBrown 2005-03-08 5:50 ` [PATCH md 3 of 4] Initialise sync_blocks in raid1 resync NeilBrown 2005-03-08 5:50 ` [PATCH md 4 of 4] Fix md deadlock due to md thread processing delayed requests NeilBrown 2005-03-08 5:50 ` [PATCH md 2 of 4] Erroneous sizeof use in raid1 NeilBrown 2005-03-08 6:10 ` [PATCH md 0 of 4] Introduction Andrew Morton 2005-03-09 3:17 ` Neil Brown 2005-03-09 9:27 ` Mike Tran 2005-03-08 12:49 ` Peter T. Breuer 2005-03-08 17:02 ` Paul Clements 2005-03-08 19:05 ` Peter T. Breuer 2005-03-09 5:07 ` Neil Brown 2005-03-09 15:37 ` Peter T. Breuer -- strict thread matches above, loose matches on Subject: below -- 2004-11-02 3:37 NeilBrown 2004-09-03 2:20 NeilBrown 2004-08-23 3:10 NeilBrown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).