Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Paul Clements <paul.clements@steeleye.com>
To: Neil Brown <neilb@cse.unsw.edu.au>
Cc: linux-raid@vger.kernel.org, ptb@it.uc3m.es, mingo@redhat.com,
	"james.bottomley" <james.bottomley@steeleye.com>
Subject: Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async  writes
Date: Tue, 04 May 2004 16:08:14 -0400	[thread overview]
Message-ID: <4097F82E.60404@steeleye.com> (raw)
In-Reply-To: <16528.49083.998593.199805@cse.unsw.edu.au>

Neil Brown wrote:
> On Wednesday March 31, Paul.Clements@SteelEye.com wrote:
>   and less that a month later I replied .... I've been busy and had
>   two weeks leave in there.  Sorry.

Ahh, I noticed you'd been absent from the mailing lists for a bit...

Anyway, thanks for taking the time to review this...

>>Create a bitmap file:
>>--------------------
>>
>># mdadm --create-bitmap=65536,3,580480 /tmp/bitmap --force
>>
> 
> 
> Maybe:
> mdadm --create-bitmap --chunksize=64 --delay=3 --size==580480
> ???
> while more verbose, it is also easier to remember and more in-keeping
> with the rest of mdadm's syntax.

OK...and it will probably change slightly if we're not doing bitmaps in 
files...more on that below...


>>1) an is_create flag was added to do_md_run to tell bitmap_create
>>whether we are creating or just assembling the array -- this is
>>necessary since 0.90 superblocks do not have a UUID until one is
>>generated randomly at array creation time, therefore, we must set the
>>bitmap UUID equal to this newly generated array UUID when the array is
>>created
> 
> 
> I think this is the wrong approach.  You are justifying a design error
> by reference to a previous design error.

I agree, I don't like it either, but there is no clean solution that I 
can think of...

> I think user-space should be completely responsible for creating the
> bitmap file including setting the UUID.

> Either
>   1/ add the bitmap after the array has been created.

We can't do this because the initial resync would have started.

> or
>   2/ Create the array in user-space and just get the kernel to
>     assemble it (this is what I will almost certainly do in mdadm
>     once I get around to supporting version 1 superblocks).

I'd gladly support version 1 superblocks, but currently the tools and 
kernel support for them is not complete.

So in order to have working code, right now, unfortunately, my hack is a 
necessary evil. If there's a cleaner way to make this work, I'm 
certainly open to suggestions.



>>3) code was added to mdadm to allow creation of arrays with
>>non-persistent superblocks (also, device size calculation with
>>non-persistent superblocks was fixed)
>>
>>4) a fix was made to the hot_remove code to allow a faulty device to be
>>removed
>>
>>5) various typo and minor bug fixes were also included in the patches
> 
> 
> 
> please, Please, PLEASE, keep separate patches separate.  It makes them
> much easier to review, and hence makes acceptance much more likely.

Yes, I should have cleaned the patch up a bit...sorry about that...

> In particular, your fix in 4 is, I think, wrong.  I agree that there
> is a problem here. I don't think your fix is correct.

Could you explain what's wrong with it? I'll be glad to alter the patch 
if there's a better way to fix this problem.


> 
> But, onto the patch.
> 
> 
> 1/ Why bitmap_alloc_page instead of just kmalloc?

That was part of Peter's original patch and I never bothered to change 
it. Admittedly, there probably is not a valid reason for having the 
cache. I'll remove it and just use kmalloc. If we find we need it later, 
it can be added back...

>    If every kernel subsystem kept it's own private cache of memory
>    in case of desperate need, then there would be a lot of memory
>    wastage.   Unless you have evidence that times when you need to
>    allocate a bitmap are frequently times when there is excessive
>    memory pressure, I think you should just trust kmalloc.
>    On the other hand, if you have reason to believe that the services
>    of kmalloc are substantially suboptimal for your needs, you should
>    explain why in a comment.
> 
>  2/There is a race in bitmap_checkpage.
>    I assume that the required postcondition is that (if the arguments
>    are valid) either bitmap->bp[page].hijacked or bitmap->bp[page].map
>    but not both.  If this is the case, then
> 
> +	if ((mappage = bitmap_alloc_page(bitmap)) == NULL) {
> +		PRINTK("%s: bitmap map page allocation failed, hijacking\n", 
> +			bmname(bitmap));
> +		/* failed - set the hijacked flag so that we can use the
> +		 * pointer as a counter */
> +		spin_lock_irqsave(&bitmap->lock, flags);
> +		bitmap->bp[page].hijacked = 1;
> +		goto out_unlock;
> +	}
> 
>    should become:
> 
> +	if ((mappage = bitmap_alloc_page(bitmap)) == NULL) {
> +		PRINTK("%s: bitmap map page allocation failed, hijacking\n", 
> +			bmname(bitmap));
> +		/* failed - set the hijacked flag so that we can use the
> +		 * pointer as a counter */
> +		spin_lock_irqsave(&bitmap->lock, flags);
> +		if (!bitmap->bp[page].map) bitmap->bp[page].hijacked = 1;
> +		goto out_unlock;
> +	}
> 
>    as someone else could have allocated a bitmap while we were trying
>    and failing.

Yes, I'll fix that.

> 
> 
>  3/ Your bmap_page / sync_page is very filesystem-specific.
> 
>    bmap_page sets page->private to a linked-list of buffers.
>    However page->private "belongs" to the address-space that the page
>    is in, which means the filesystem.

Yes, you're right.

>    I don't know if any filesystems use page->private for anything
>    other than a list of buffers, but they certainly could if they
>    wanted to, and if they did, this code would suddenly break.

XFS uses page->private differently...

> 
>    I think that if you have your heart set on being able to store the
>    bitmap in a file, that using a loop-back mount would be easiest.
>    But if you don't want to do that, at least use the correct
>    interface.  Referring to the code in loop.c would help.

We could not do the prepare_write/commit_write (as loop does) because of 
the current->journal_info limitation in jbd/ext3 (i.e., a single process 
cannot have two jbd transactions ongoing at a time, even though the two 
transactions are for different filesystems). In order to work around 
that limitation, we would have had to create a separate thread to do the 
bitmap writes, which is complex and probably too slow to be an 
acceptable solution.

I now agree with you that (due to the aforementioned limitations) bitmap 
files are not going to work. I think, at least for now, we'll go with a 
bitmap located on a device at a certain offset. So, for a bitmap located 
on the md partition itself, we specify an offset of sb_offset+4096. For 
a standalone bitmap device/partition (or loopback mount of a file, as 
you suggested) we give a 0 offset.


>    Another alternative would be to use the approach that swapfile uses.
>    i.e. create a list of extents using bmap information, and then do
>    direct IO to the device using this extent information.
>    swapfile chooses to ignore any extent that is less that a page.
>    You might not want to do that, but you wouldn't have to.
> 
> 
>   4/ I would avoid extending the file in the kernel.  It is too easy
>      to do that in user space.  Just require that the file is the
>      correct size.

OK. That's easy enough to change.


>   5/ I find it very confusing that you kmap a page, and then leave it
>      to some function that you call to kunmap the page (unmap_put_page
>      or sync_put_page).  It looks very unbalanced.  I would much
>      rather see the kunmap in the same function as the kmap.
> 
>   6/ It seems odd that bitmap_set_state calls unmap_put_page instead
>      of sync_put_page.  Surely you want to sync the superblock at this
>      point.  
>   7/ The existence of bitmap_get_state confuses me.  Why not store the
>      state in the bitmap structure in bitmap_read_sb??
> 
>   8/ calling md_force_spare(.., 1) to force a sync is definitely
>      wrong.  You appear to be assuming a raid1 array with exactly two
>      devices.  Can't you just set recovery_cp to 0, or maybe just set
>      a flag somewhere and test it in md_check_recovery??

This is code that was necessary in 2.4, where it was harder to trigger a 
resync. I think this can be cleaned up for 2.6.


>   9/ why don't you just pass "%s_bitmap" as the thread name to
>      md_register_thread ?  As far as I can tell, it would have exactly
>      the same effect as the current code without requiring a kmalloc.

OK


>   10/
> +static void bitmap_stop_daemon(struct bitmap *bitmap)
> +{
> +	mdk_thread_t *daemon;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&bitmap->lock, flags);
> +	if (!bitmap->daemon) {
> +		spin_unlock_irqrestore(&bitmap->lock, flags);
> +		return;
> +	}
> +	daemon = bitmap->daemon;
> +	bitmap->daemon = NULL;
> +	spin_unlock_irqrestore(&bitmap->lock, flags);
> +	md_unregister_thread(daemon); /* destroy the thread */
> +}
> 
> would look better as:
> 
> +static void bitmap_stop_daemon(struct bitmap *bitmap)
> +{
> +	mdk_thread_t *daemon;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&bitmap->lock, flags);
> +	daemon = bitmap->daemon;
> +	bitmap->daemon = NULL;
> +	spin_unlock_irqrestore(&bitmap->lock, flags);
> +	if (bitmap->daemon)
> +		md_unregister_thread(daemon); /* destroy the thread */
> +}

OK

>    
>    11/  md_update_sb needs to be called with the mddev locked, and I
>         think there are times when you call it without the lock.
>         I would prefer it if you left it 'static' and just set the
>         sb_dirty flag.  raid[156]d will the update date it soon
>         enough.

I think that will be OK, as long as it doesn't open up a window for 
things to get into an inconsistent state if there's a crash.


>    12/ The tests in hot_add_disk to see if the newly added device is
>        sufficiently up-to-date that the bitmap can be used to get it
>        the rest of the way up-to-date must be wrong as they don't
>        contain any reference to 'events'.
>        You presumably want to be able to fail/remove a drive and then
>        re-add it and not require a full resync.
>        For this to work, you need to record an event number when the
>        bitmap switches from "which blocks on active drives are not
>        in-sync" to "which blocks active drives have changed since a
>        drive failed", and when you incorporate a new device, only
>        allow it to be synced based on the bitmap if the event counter
>        is at least as new as the saved one (plus the checks you
>        currently have).

Yes, the current code assumes only two partitions, and thus does not do 
this extra checking. I'll look at adding that.


>     13/ The test
> +	} else if (atomic_read(&mddev->bitmap_file->f_count) > 2) {
>         in set_bitmap_file is half-hearted at best.
>         What you probably want is "deny_write_access".

I'll check that out.

>         Or just check that the file is owned by root and isn't world
> 	writable. 
> 
>         The check against the uuid in the header should be enough to
>         ensure that operator error doesn't result in the one file
>         being used for two arrays.
> 
>     14/ In md_do_sync, I think you should move "cond_reshed()" above 
> 
> @@ -3312,6 +3506,10 @@ static void md_do_sync(mddev_t *mddev)
>  			goto out;
>  		}
>  
> +		/* don't worry about speed limits unless we do I/O */
> +		if (!need_sync)
> +			continue;
> +
>  		/*
>  		 * this loop exits only if either when we are slower than
>  		 * the 'hard' speed limit, or the system was IO-idle for
> 
>         to make sure that mdX_sync doesn't steal too much time before
>         allowing a reschedule.

I'll look at doing this.


> 
> and the worst part about it is that the code doesn't support what I
> would think would be the most widely used and hence most useful case,
> and that is to store the bitmap in the 60K of space after the
> superblock.

As mentioned above, the next patch will support this configuration (as 
well as standalone bitmap devices/partitions).


Hopefully, the next patch will be more to your liking (and much smaller, 
too...)

--
Paul

next prev parent reply	other threads:[~2004-05-04 20:08 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-01-29 22:51 [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes Paul Clements
2004-01-30 22:52 ` Paul Clements
2004-02-09  2:51 ` Neil Brown
2004-02-09 19:45   ` Paul Clements
2004-02-10  0:04     ` Neil Brown
2004-02-10 16:20       ` Paul Clements
2004-02-10 16:57       ` Paul Clements
2004-02-13 20:58       ` Paul Clements
2004-03-05  5:06         ` Neil Brown
2004-03-05 22:05           ` Paul Clements
2004-03-31 18:38             ` Paul Clements
2004-04-28 18:10               ` Paul Clements
2004-04-28 18:53                 ` Peter T. Breuer
2004-04-29  8:41               ` Neil Brown
2004-05-04 20:08                 ` Paul Clements [this message]
2004-06-08 20:53                 ` Paul Clements
2004-06-08 22:47                   ` Neil Brown
2004-06-14 23:39                   ` Neil Brown
2004-06-14 23:59                     ` James Bottomley
2004-06-15  6:27                   ` Neil Brown
2004-06-17 17:57                     ` Paul Clements
2004-06-18 20:48                     ` Paul Clements
2004-06-23 21:48                     ` Paul Clements
2004-06-23 21:50                       ` Paul Clements
2004-07-06 14:52                       ` Paul Clements
     [not found]                       ` <40F7E50F.2040308@steeleye.com>
     [not found]                         ` <16649.61212.310271.36561@cse.unsw.edu.au>
2004-08-10 21:37                           ` Paul Clements
2004-08-13  3:04                             ` Neil Brown
2004-09-21  3:28                               ` Paul Clements
2004-09-21 19:19                                 ` Paul Clements
2004-10-12  2:15                                   ` Neil Brown
2004-10-12 14:06                                     ` Paul Clements
2004-10-12 21:16                                       ` Paul Clements
2004-11-10  0:37                                     ` md: persistent (file-backed) bitmap Neil Brown
2004-11-10 18:28                                       ` Paul Clements

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4097F82E.60404@steeleye.com \
    --to=paul.clements@steeleye.com \
    --cc=james.bottomley@steeleye.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=neilb@cse.unsw.edu.au \
    --cc=ptb@it.uc3m.es \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.