From mboxrd@z Thu Jan 1 00:00:00 1970 From: Paul Clements Subject: Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes Date: Tue, 04 May 2004 16:08:14 -0400 Sender: linux-raid-owner@vger.kernel.org Message-ID: <4097F82E.60404@steeleye.com> References: <40198E85.29EBC8E0@SteelEye.com> <16422.62911.755570.855200@notabene.cse.unsw.edu.au> <4027E342.D02202F1@SteelEye.com> <16424.8182.876520.280031@notabene.cse.unsw.edu.au> <402D3A86.97CF894F@SteelEye.com> <16456.2775.641721.204171@notabene.cse.unsw.edu.au> <4048F9AA.1BBD67F@SteelEye.com> <406B1024.7BF88C@SteelEye.com> <16528.49083.998593.199805@cse.unsw.edu.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <16528.49083.998593.199805@cse.unsw.edu.au> To: Neil Brown Cc: linux-raid@vger.kernel.org, ptb@it.uc3m.es, mingo@redhat.com, "james.bottomley" List-Id: linux-raid.ids Neil Brown wrote: > On Wednesday March 31, Paul.Clements@SteelEye.com wrote: > and less that a month later I replied .... I've been busy and had > two weeks leave in there. Sorry. Ahh, I noticed you'd been absent from the mailing lists for a bit... Anyway, thanks for taking the time to review this... >>Create a bitmap file: >>-------------------- >> >># mdadm --create-bitmap=65536,3,580480 /tmp/bitmap --force >> > > > Maybe: > mdadm --create-bitmap --chunksize=64 --delay=3 --size==580480 > ??? > while more verbose, it is also easier to remember and more in-keeping > with the rest of mdadm's syntax. OK...and it will probably change slightly if we're not doing bitmaps in files...more on that below... >>1) an is_create flag was added to do_md_run to tell bitmap_create >>whether we are creating or just assembling the array -- this is >>necessary since 0.90 superblocks do not have a UUID until one is >>generated randomly at array creation time, therefore, we must set the >>bitmap UUID equal to this newly generated array UUID when the array is >>created > > > I think this is the wrong approach. You are justifying a design error > by reference to a previous design error. I agree, I don't like it either, but there is no clean solution that I can think of... > I think user-space should be completely responsible for creating the > bitmap file including setting the UUID. > Either > 1/ add the bitmap after the array has been created. We can't do this because the initial resync would have started. > or > 2/ Create the array in user-space and just get the kernel to > assemble it (this is what I will almost certainly do in mdadm > once I get around to supporting version 1 superblocks). I'd gladly support version 1 superblocks, but currently the tools and kernel support for them is not complete. So in order to have working code, right now, unfortunately, my hack is a necessary evil. If there's a cleaner way to make this work, I'm certainly open to suggestions. >>3) code was added to mdadm to allow creation of arrays with >>non-persistent superblocks (also, device size calculation with >>non-persistent superblocks was fixed) >> >>4) a fix was made to the hot_remove code to allow a faulty device to be >>removed >> >>5) various typo and minor bug fixes were also included in the patches > > > > please, Please, PLEASE, keep separate patches separate. It makes them > much easier to review, and hence makes acceptance much more likely. Yes, I should have cleaned the patch up a bit...sorry about that... > In particular, your fix in 4 is, I think, wrong. I agree that there > is a problem here. I don't think your fix is correct. Could you explain what's wrong with it? I'll be glad to alter the patch if there's a better way to fix this problem. > > But, onto the patch. > > > 1/ Why bitmap_alloc_page instead of just kmalloc? That was part of Peter's original patch and I never bothered to change it. Admittedly, there probably is not a valid reason for having the cache. I'll remove it and just use kmalloc. If we find we need it later, it can be added back... > If every kernel subsystem kept it's own private cache of memory > in case of desperate need, then there would be a lot of memory > wastage. Unless you have evidence that times when you need to > allocate a bitmap are frequently times when there is excessive > memory pressure, I think you should just trust kmalloc. > On the other hand, if you have reason to believe that the services > of kmalloc are substantially suboptimal for your needs, you should > explain why in a comment. > > 2/There is a race in bitmap_checkpage. > I assume that the required postcondition is that (if the arguments > are valid) either bitmap->bp[page].hijacked or bitmap->bp[page].map > but not both. If this is the case, then > > + if ((mappage = bitmap_alloc_page(bitmap)) == NULL) { > + PRINTK("%s: bitmap map page allocation failed, hijacking\n", > + bmname(bitmap)); > + /* failed - set the hijacked flag so that we can use the > + * pointer as a counter */ > + spin_lock_irqsave(&bitmap->lock, flags); > + bitmap->bp[page].hijacked = 1; > + goto out_unlock; > + } > > should become: > > + if ((mappage = bitmap_alloc_page(bitmap)) == NULL) { > + PRINTK("%s: bitmap map page allocation failed, hijacking\n", > + bmname(bitmap)); > + /* failed - set the hijacked flag so that we can use the > + * pointer as a counter */ > + spin_lock_irqsave(&bitmap->lock, flags); > + if (!bitmap->bp[page].map) bitmap->bp[page].hijacked = 1; > + goto out_unlock; > + } > > as someone else could have allocated a bitmap while we were trying > and failing. Yes, I'll fix that. > > > 3/ Your bmap_page / sync_page is very filesystem-specific. > > bmap_page sets page->private to a linked-list of buffers. > However page->private "belongs" to the address-space that the page > is in, which means the filesystem. Yes, you're right. > I don't know if any filesystems use page->private for anything > other than a list of buffers, but they certainly could if they > wanted to, and if they did, this code would suddenly break. XFS uses page->private differently... > > I think that if you have your heart set on being able to store the > bitmap in a file, that using a loop-back mount would be easiest. > But if you don't want to do that, at least use the correct > interface. Referring to the code in loop.c would help. We could not do the prepare_write/commit_write (as loop does) because of the current->journal_info limitation in jbd/ext3 (i.e., a single process cannot have two jbd transactions ongoing at a time, even though the two transactions are for different filesystems). In order to work around that limitation, we would have had to create a separate thread to do the bitmap writes, which is complex and probably too slow to be an acceptable solution. I now agree with you that (due to the aforementioned limitations) bitmap files are not going to work. I think, at least for now, we'll go with a bitmap located on a device at a certain offset. So, for a bitmap located on the md partition itself, we specify an offset of sb_offset+4096. For a standalone bitmap device/partition (or loopback mount of a file, as you suggested) we give a 0 offset. > Another alternative would be to use the approach that swapfile uses. > i.e. create a list of extents using bmap information, and then do > direct IO to the device using this extent information. > swapfile chooses to ignore any extent that is less that a page. > You might not want to do that, but you wouldn't have to. > > > 4/ I would avoid extending the file in the kernel. It is too easy > to do that in user space. Just require that the file is the > correct size. OK. That's easy enough to change. > 5/ I find it very confusing that you kmap a page, and then leave it > to some function that you call to kunmap the page (unmap_put_page > or sync_put_page). It looks very unbalanced. I would much > rather see the kunmap in the same function as the kmap. > > 6/ It seems odd that bitmap_set_state calls unmap_put_page instead > of sync_put_page. Surely you want to sync the superblock at this > point. > 7/ The existence of bitmap_get_state confuses me. Why not store the > state in the bitmap structure in bitmap_read_sb?? > > 8/ calling md_force_spare(.., 1) to force a sync is definitely > wrong. You appear to be assuming a raid1 array with exactly two > devices. Can't you just set recovery_cp to 0, or maybe just set > a flag somewhere and test it in md_check_recovery?? This is code that was necessary in 2.4, where it was harder to trigger a resync. I think this can be cleaned up for 2.6. > 9/ why don't you just pass "%s_bitmap" as the thread name to > md_register_thread ? As far as I can tell, it would have exactly > the same effect as the current code without requiring a kmalloc. OK > 10/ > +static void bitmap_stop_daemon(struct bitmap *bitmap) > +{ > + mdk_thread_t *daemon; > + unsigned long flags; > + > + spin_lock_irqsave(&bitmap->lock, flags); > + if (!bitmap->daemon) { > + spin_unlock_irqrestore(&bitmap->lock, flags); > + return; > + } > + daemon = bitmap->daemon; > + bitmap->daemon = NULL; > + spin_unlock_irqrestore(&bitmap->lock, flags); > + md_unregister_thread(daemon); /* destroy the thread */ > +} > > would look better as: > > +static void bitmap_stop_daemon(struct bitmap *bitmap) > +{ > + mdk_thread_t *daemon; > + unsigned long flags; > + > + spin_lock_irqsave(&bitmap->lock, flags); > + daemon = bitmap->daemon; > + bitmap->daemon = NULL; > + spin_unlock_irqrestore(&bitmap->lock, flags); > + if (bitmap->daemon) > + md_unregister_thread(daemon); /* destroy the thread */ > +} OK > > 11/ md_update_sb needs to be called with the mddev locked, and I > think there are times when you call it without the lock. > I would prefer it if you left it 'static' and just set the > sb_dirty flag. raid[156]d will the update date it soon > enough. I think that will be OK, as long as it doesn't open up a window for things to get into an inconsistent state if there's a crash. > 12/ The tests in hot_add_disk to see if the newly added device is > sufficiently up-to-date that the bitmap can be used to get it > the rest of the way up-to-date must be wrong as they don't > contain any reference to 'events'. > You presumably want to be able to fail/remove a drive and then > re-add it and not require a full resync. > For this to work, you need to record an event number when the > bitmap switches from "which blocks on active drives are not > in-sync" to "which blocks active drives have changed since a > drive failed", and when you incorporate a new device, only > allow it to be synced based on the bitmap if the event counter > is at least as new as the saved one (plus the checks you > currently have). Yes, the current code assumes only two partitions, and thus does not do this extra checking. I'll look at adding that. > 13/ The test > + } else if (atomic_read(&mddev->bitmap_file->f_count) > 2) { > in set_bitmap_file is half-hearted at best. > What you probably want is "deny_write_access". I'll check that out. > Or just check that the file is owned by root and isn't world > writable. > > The check against the uuid in the header should be enough to > ensure that operator error doesn't result in the one file > being used for two arrays. > > 14/ In md_do_sync, I think you should move "cond_reshed()" above > > @@ -3312,6 +3506,10 @@ static void md_do_sync(mddev_t *mddev) > goto out; > } > > + /* don't worry about speed limits unless we do I/O */ > + if (!need_sync) > + continue; > + > /* > * this loop exits only if either when we are slower than > * the 'hard' speed limit, or the system was IO-idle for > > to make sure that mdX_sync doesn't steal too much time before > allowing a reschedule. I'll look at doing this. > > and the worst part about it is that the code doesn't support what I > would think would be the most widely used and hence most useful case, > and that is to store the bitmap in the 60K of space after the > superblock. As mentioned above, the next patch will support this configuration (as well as standalone bitmap devices/partitions). Hopefully, the next patch will be more to your liking (and much smaller, too...) -- Paul