[ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
@ 2004-01-29 22:51 Paul Clements
  2004-01-30 22:52 ` Paul Clements
  2004-02-09  2:51 ` Neil Brown
  0 siblings, 2 replies; 34+ messages in thread
From: Paul Clements @ 2004-01-29 22:51 UTC (permalink / raw)
  To: linux-raid; +Cc: ptb, mingo, james.bottomley, Neil Brown

Description
===========
This patch provides the md driver with the ability to track
resync/rebuild progress with a bitmap. It also gives the raid1 driver
the ability to perform asynchronous writes (i.e., writes are
acknowledged before they actually reach the secondary disk). The bitmap
and asynchronous write capabilities are primarily useful when raid1 is
employed in data replication (e.g., with a remote disk served over nbd
as the secondary device). However, the bitmap is also useful for
reducing resync/rebuild times with ordinary (local) raid1, raid5, and
raid6 arrays.

Background
==========
This patch is an adjunct to Peter T. Breuer's raid1 bitmap code (fr1
v2.14, ftp://oboe.it.uc3m.es/pub/Programs/fr1-2.14.tgz). The code was
originally written for 2.4 (I have patches vs. 2.4.19/20 Red Hat and
SuSE kernels, if anyone is interested). The 2.4 version of this patch
has undergone extensive alpha, beta, and stress testing, including a WAN
setup where a 500MB partition was mirrored across the U.S. The 2.6
version of the patch remains as close to the 2.4 version as possible,
while still allowing it to function properly in the 2.6 kernel. The 2.6
code has also been tested quite a bit and is fairly stable.

Features
========

Persistent Bitmap
-----------------
The bitmap tracks which blocks are out of sync between the primary and
secondary disk in a raid1 array (in raid5, the bitmap would indicate
which stripes need to be rebuilt). The bitmap is stored in memory (for
speed) and on disk (for persistence, so that a full resync is never
needed, even after a failure or reboot).

There is a kernel daemon that periodically (lazily) clears bits in the
bitmap file (this reduces the number and frequency of disk writes to the
bitmap file).

The bitmap can also be rescaled -- i.e., change the amount of data that
each bit represents. This allows for increased efficiency at the cost of
reduced bitmap granularity.

Currently, the bitmap code has been implemented only for raid1, but it
could easily be leveraged by other raid drivers (namely raid5 and raid6)
by adding a few calls to the bitmap routines in the appropriate places.

Asynchronous Writes
-------------------
The asynchronous write capability allows the raid1 driver to function
more efficiently in data replication environments (i.e., where the
secondary disk is remote). Asynchronous writes allow us to overcome high
network latency by filling the network pipe.

Modifications to mdadm
----------------------
I have modified Neil's mdadm tool to allow it to configure the
additional bitmap and async parameters. The attached patch is against
the 1.2 mdadm release. Briefly, the new options are:

Creation:

mdadm -C /dev/md0 -l 1 -n 2 --persistent --async=512
--bitmap=/tmp/bitmap_md0_file,4096,5 /dev/xxx /dev/yyy

This creates a raid1 array with:

* 2 disks
* a persistent superblock
* asynchronous writes enabled (maximum of 512 outstanding writes)
* bitmap enabled (using the file /tmp/bitmap_md0_file)
* a bitmap chunksize of 4k (bitmap chunksize determines how much data
each bitmap bit represents)
* the bitmap daemon set to wake up every 5 seconds to clear bits in the
bitmap file (if needed)
* /dev/xxx as the primary disk
* /dev/yyy as the backup disk (when asynchronous writes are enabled, the
second disk in the array is labelled as a "backup", indicating that it
is remote, and thus no reads will be issued to the device)

Assembling:

mdadm -A /dev/md0 --bitmap=/tmp/bitmap_md0_file /dev/xxx /dev/yyy

This assembles an existing array and configures it to use a bitmap file.
The bitmap file pathname is not stored in the array superblock, and so
must be specified every time the array is assembled.

Details:

mdadm -D /dev/md0

This will display information about /dev/md0, including some additional
information about the bitmap and async parameters.

I've also added some information to the /proc/mdstat file:

# cat /proc/mdstat
Personalities : [raid1] 
md1 : active raid1 loop0[0] loop1[1](B)
      39936 blocks [2/2] [UU]
      async: 0/256 outstanding writes
      bitmap: 1/1 pages (15 cached) [64KB], 64KB chunk, file:
/tmp/bitmap_md1

unused devices: <none>

More details on the design and implementation can be found in Section 3
of my 2003 OLS Paper:
http://archive.linuxsymposium.org/ols2003/Proceedings/All-Reprints/Reprint-Clements-OLS2003.pdf

Patch Location
==============

Finally, the patches are available here:

kernel patch vs. 2.6.2-rc2-bk3
------------------------------
http://parisc-linux.org/~jejb/md_bitmap/md_bitmap_2_30_2_6_2_RC2_BK3_RELEASE.diff 

mdadm patch vs. 1.2.0
---------------------
http://parisc-linux.org/~jejb/md_bitmap/mdadm_1_2_0.diff

So if you're interested, please review, test, ask questions, etc. Any
feedback is welcome.

Thanks,
Paul

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-01-29 22:51 [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes Paul Clements
@ 2004-01-30 22:52 ` Paul Clements
  2004-02-09  2:51 ` Neil Brown
  1 sibling, 0 replies; 34+ messages in thread
From: Paul Clements @ 2004-01-30 22:52 UTC (permalink / raw)
  To: linux-raid, ptb, mingo, james.bottomley, Neil Brown

I've uploaded a patch against the latest mdadm (1.5.0):

http://parisc-linux.org/~jejb/md_bitmap/mdadm_1_5_0.diff

Thanks,
Paul


Paul Clements wrote:
> 
> Description
> ===========
> This patch provides the md driver with the ability to track
> resync/rebuild progress with a bitmap. It also gives the raid1 driver
> the ability to perform asynchronous writes (i.e., writes are
> acknowledged before they actually reach the secondary disk). The bitmap
> and asynchronous write capabilities are primarily useful when raid1 is
> employed in data replication (e.g., with a remote disk served over nbd
> as the secondary device). However, the bitmap is also useful for
> reducing resync/rebuild times with ordinary (local) raid1, raid5, and
> raid6 arrays.
> 
> Background
> ==========
> This patch is an adjunct to Peter T. Breuer's raid1 bitmap code (fr1
> v2.14, ftp://oboe.it.uc3m.es/pub/Programs/fr1-2.14.tgz). The code was
> originally written for 2.4 (I have patches vs. 2.4.19/20 Red Hat and
> SuSE kernels, if anyone is interested). The 2.4 version of this patch
> has undergone extensive alpha, beta, and stress testing, including a WAN
> setup where a 500MB partition was mirrored across the U.S. The 2.6
> version of the patch remains as close to the 2.4 version as possible,
> while still allowing it to function properly in the 2.6 kernel. The 2.6
> code has also been tested quite a bit and is fairly stable.
> 
> Features
> ========
> 
> Persistent Bitmap
> -----------------
> The bitmap tracks which blocks are out of sync between the primary and
> secondary disk in a raid1 array (in raid5, the bitmap would indicate
> which stripes need to be rebuilt). The bitmap is stored in memory (for
> speed) and on disk (for persistence, so that a full resync is never
> needed, even after a failure or reboot).
> 
> There is a kernel daemon that periodically (lazily) clears bits in the
> bitmap file (this reduces the number and frequency of disk writes to the
> bitmap file).
> 
> The bitmap can also be rescaled -- i.e., change the amount of data that
> each bit represents. This allows for increased efficiency at the cost of
> reduced bitmap granularity.
> 
> Currently, the bitmap code has been implemented only for raid1, but it
> could easily be leveraged by other raid drivers (namely raid5 and raid6)
> by adding a few calls to the bitmap routines in the appropriate places.
> 
> Asynchronous Writes
> -------------------
> The asynchronous write capability allows the raid1 driver to function
> more efficiently in data replication environments (i.e., where the
> secondary disk is remote). Asynchronous writes allow us to overcome high
> network latency by filling the network pipe.
> 
> Modifications to mdadm
> ----------------------
> I have modified Neil's mdadm tool to allow it to configure the
> additional bitmap and async parameters. The attached patch is against
> the 1.2 mdadm release. Briefly, the new options are:
> 
> Creation:
> 
> mdadm -C /dev/md0 -l 1 -n 2 --persistent --async=512
> --bitmap=/tmp/bitmap_md0_file,4096,5 /dev/xxx /dev/yyy
> 
> This creates a raid1 array with:
> 
> * 2 disks
> * a persistent superblock
> * asynchronous writes enabled (maximum of 512 outstanding writes)
> * bitmap enabled (using the file /tmp/bitmap_md0_file)
> * a bitmap chunksize of 4k (bitmap chunksize determines how much data
> each bitmap bit represents)
> * the bitmap daemon set to wake up every 5 seconds to clear bits in the
> bitmap file (if needed)
> * /dev/xxx as the primary disk
> * /dev/yyy as the backup disk (when asynchronous writes are enabled, the
> second disk in the array is labelled as a "backup", indicating that it
> is remote, and thus no reads will be issued to the device)
> 
> Assembling:
> 
> mdadm -A /dev/md0 --bitmap=/tmp/bitmap_md0_file /dev/xxx /dev/yyy
> 
> This assembles an existing array and configures it to use a bitmap file.
> The bitmap file pathname is not stored in the array superblock, and so
> must be specified every time the array is assembled.
> 
> Details:
> 
> mdadm -D /dev/md0
> 
> This will display information about /dev/md0, including some additional
> information about the bitmap and async parameters.
> 
> I've also added some information to the /proc/mdstat file:
> 
> # cat /proc/mdstat
> Personalities : [raid1]
> md1 : active raid1 loop0[0] loop1[1](B)
>       39936 blocks [2/2] [UU]
>       async: 0/256 outstanding writes
>       bitmap: 1/1 pages (15 cached) [64KB], 64KB chunk, file:
> /tmp/bitmap_md1
> 
> unused devices: <none>
> 
> More details on the design and implementation can be found in Section 3
> of my 2003 OLS Paper:
> http://archive.linuxsymposium.org/ols2003/Proceedings/All-Reprints/Reprint-Clements-OLS2003.pdf
> 
> Patch Location
> ==============
> 
> Finally, the patches are available here:
> 
> kernel patch vs. 2.6.2-rc2-bk3
> ------------------------------
> http://parisc-linux.org/~jejb/md_bitmap/md_bitmap_2_30_2_6_2_RC2_BK3_RELEASE.diff
> 
> mdadm patch vs. 1.2.0
> ---------------------
> http://parisc-linux.org/~jejb/md_bitmap/mdadm_1_2_0.diff
> 
> So if you're interested, please review, test, ask questions, etc. Any
> feedback is welcome.
> 
> Thanks,
> Paul

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-01-29 22:51 [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes Paul Clements
  2004-01-30 22:52 ` Paul Clements
@ 2004-02-09  2:51 ` Neil Brown
  2004-02-09 19:45   ` Paul Clements
  1 sibling, 1 reply; 34+ messages in thread
From: Neil Brown @ 2004-02-09  2:51 UTC (permalink / raw)
  To: Paul Clements; +Cc: linux-raid, ptb, mingo, james.bottomley

On Thursday January 29, Paul.Clements@SteelEye.com wrote:
> Description
> ===========
> This patch provides the md driver with the ability to track
> resync/rebuild progress with a bitmap. It also gives the raid1 driver
> the ability to perform asynchronous writes (i.e., writes are
> acknowledged before they actually reach the secondary disk). The bitmap
> and asynchronous write capabilities are primarily useful when raid1 is
> employed in data replication (e.g., with a remote disk served over nbd
> as the secondary device). However, the bitmap is also useful for
> reducing resync/rebuild times with ordinary (local) raid1, raid5, and
> raid6 arrays.
> 

Thanks for this.

There is obviously some good functionality here, but I do have a few
issues with the patch and the code that I would like to see addressed
first.

- The patch seems to add several separate, though related, pieces of
  functionality.  It would be a lot easier to review if they were
  separate.
  There is (at least):
     - "intent-loging" in the bitmaps.
     - optimising resync using the intents.
     - async writes to non-primary devices
     - 'hotrepair'

- bitmap storage.  I gather these are stored on a separate device
  (in a separate filesystem).  It would be nice to be able to store
  them on the raid array directly so the optimised-resync works on a
  root-filesytem device as well.  There is nearly 64k after the
  superblock  that can be used.  I think there should be an option to
  store the bitmap on all the (primary) devices after the superblock.
  (This could be added later - a separate patch)
- A question - can the bitmap be "rescaled" online, or only offline?
  (not that I particularly want on-line rescaling) How is it achieved?
- ioctl - By changing the size of mdu_array_info_t you have changed
  the GET_ARRAY_INFO ioctl, so you lose backwards compatibility with
  user-space tools.  If you want to return more info with
  GET_ARRAY_INFO (which is reasonable) we need to handle both the old
  and new mdu_array_info_t sizes. Ditto for SET_ARRAY_INFO
- superblock.  If you have an array with a bitmap attached, boot an
  old kernel that doesn't understand bitmaps, use the array for a
  while (after a full sync) have an unclean shutdown, and then boot
  back into the new kernel, the out-of-date bitmap will be used to
  optimise resync - incorrectly.  Is this possibility handled at all?
- I don't see the point of the new "array_size" personality method. 
  Surely the size of the bitmap depends on the device size
  (mddev->size), rather than the array_size - after all, that is what
  md_do_sync iterates over.
- I don't think it is right for "stop" to return EBUSY if there are
  outstanding async writes.  It should either wait for the writes to
  complete, or abort them (or both, with a timeout).
- Write requests tend to come in bursts.  So when you get a burst of
  write requests, it would be most efficient to set all the on-disk
  bits for all requests in the burst, flush all blocks which have
  changed and then release the burst of write requests.  If I read the
  code correctly, you do one bitmap-block-write for every bit that
  gets set. 
- returning -ENOMEM in the make_request function is a no-no.  You need
  to either cope with a kmalloc failure, or use a mempool or similar
  to guarantee some memory either now or soon.
- The bitmap_page structure is amusing.  On memory failure, you
  highjack the "*map" pointer to form 2 16bit counters, and spend a
  whole 32bit "unsigned int" to flag if this is the case or not.
  Would it be better to use one bit of "counter" as the highjack flag,
  and thereby save a substantial fraction of the size of 'bitmap_page'?
- What is the relationship between the 'bitmap_fd' field in
  mdu_array_info_s, and the SET_BITMAP_FILE ioctl?  They both seem to
  set the bitmap file.

Finally, do you have any measurements of the performance impact of the
intent logging on various workloads?


That'll do for now.

NeilBrown

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-02-09  2:51 ` Neil Brown
@ 2004-02-09 19:45   ` Paul Clements
  2004-02-10  0:04     ` Neil Brown
  0 siblings, 1 reply; 34+ messages in thread
From: Paul Clements @ 2004-02-09 19:45 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, ptb, mingo, james.bottomley

Neil Brown wrote:
> 
> On Thursday January 29, Paul.Clements@SteelEye.com wrote:
> > Description
> > ===========
> > This patch provides the md driver with the ability to track
> > resync/rebuild progress with a bitmap. It also gives the raid1 driver
> > the ability to perform asynchronous writes (i.e., writes are
> > acknowledged before they actually reach the secondary disk). The bitmap
> > and asynchronous write capabilities are primarily useful when raid1 is
> > employed in data replication (e.g., with a remote disk served over nbd
> > as the secondary device). However, the bitmap is also useful for
> > reducing resync/rebuild times with ordinary (local) raid1, raid5, and
> > raid6 arrays.
> >
> 
> Thanks for this.

...and thanks for your feedback. There are some really good
ideas/suggestions in here. A couple of them I had thought about, but
wavered on which way to implement them. I'll respond to each of your
points below...

> - The patch seems to add several separate, though related, pieces of
>   functionality.  It would be a lot easier to review if they were
>   separate.
>   There is (at least):
>      - "intent-loging" in the bitmaps.
>      - optimising resync using the intents.
>      - async writes to non-primary devices
>      - 'hotrepair'

I can fairly easily split the bitmap code from the async write code.
Splitting the smaller bitmap pieces from the rest of the bitmap code,
however, is a little more difficult. Also, the "hotrepair" and "resync"
parts would be so small, that I'm not sure it would be worth it. Would
it be alright if I split it into two patches and see what that looks
like?

> - bitmap storage.  I gather these are stored on a separate device
>   (in a separate filesystem). 

Yes.

>   It would be nice to be able to store
>   them on the raid array directly so the optimised-resync works on a
>   root-filesytem device as well.  There is nearly 64k after the
>   superblock  that can be used.  I think there should be an option to
>   store the bitmap on all the (primary) devices after the superblock.
>   (This could be added later - a separate patch)

Yes, that would be nice. I'll think about this some and maybe submit it
as a follow-on patch.

> - A question - can the bitmap be "rescaled" online, or only offline?
>   (not that I particularly want on-line rescaling) How is it achieved?

The bitmap file can be rescaled offline by simply expanding or
collapsing the bits as necessary, and then writing the enlarged or
shrunken bitmap file. When the array is started (with a new bitmap chunk
size), everything will work correctly. (The bitmap file can also be
truncated, in which case the driver will assume all the bits to be dirty
and perform a full resync).

> - ioctl - By changing the size of mdu_array_info_t you have changed
>   the GET_ARRAY_INFO ioctl, so you lose backwards compatibility with
>   user-space tools.

Yes, unfortunately...

>   If you want to return more info with
>   GET_ARRAY_INFO (which is reasonable) we need to handle both the old
>   and new mdu_array_info_t sizes. Ditto for SET_ARRAY_INFO

OK, I'll code this up and include it in the next patch. Some type of
versioning scheme (like what you've done with the superblock) would
probably do the trick.

> - superblock.  If you have an array with a bitmap attached, boot an
>   old kernel that doesn't understand bitmaps, use the array for a
>   while (after a full sync) have an unclean shutdown, and then boot
>   back into the new kernel, the out-of-date bitmap will be used to
>   optimise resync - incorrectly.  Is this possibility handled at all?

No, I guess I was assuming that a persistent bitmap could always be
trusted, but in this particular situation, that's not the case. I guess
I'll have to add bitmap event counters to the superblock?

> - I don't see the point of the new "array_size" personality method.
>   Surely the size of the bitmap depends on the device size
>   (mddev->size), rather than the array_size - after all, that is what
>   md_do_sync iterates over.

Yikes...thanks for catching that. It got lost in the translation between
2.4 and 2.6 (size no longer comes directly from the superblock). I'll be
happy to get rid of that bit of ugliness from the patch.

> - I don't think it is right for "stop" to return EBUSY if there are
>   outstanding async writes.  It should either wait for the writes to
>   complete, or abort them (or both, with a timeout).

I'll see what I can do about this. Should be easy enough to wait for the
writes either to complete or error out.

> - Write requests tend to come in bursts.  So when you get a burst of
>   write requests, it would be most efficient to set all the on-disk
>   bits for all requests in the burst, flush all blocks which have
>   changed and then release the burst of write requests.  If I read the
>   code correctly, you do one bitmap-block-write for every bit that
>   gets set.

That would be a very nice optimization. It would also be pretty
involved. If you'd like I can look at doing this as a follow-on patch.

> - returning -ENOMEM in the make_request function is a no-no.  You need
>   to either cope with a kmalloc failure, or use a mempool or similar
>   to guarantee some memory either now or soon.

Yes, I opted for the simple route here, but I think you're right; a
mempool really is required. I'll code that up and include it in the next
patch.

> - The bitmap_page structure is amusing.

I like to keep my readers entertained... :)

>   highjack the "*map" pointer to form 2 16bit counters, and spend a
>   whole 32bit "unsigned int" to flag if this is the case or not.
>   Would it be better to use one bit of "counter" as the highjack flag,
>   and thereby save a substantial fraction of the size of 'bitmap_page'?

In all seriousness though, I opted for the extra flag because I didn't
want to get into figuring out which bits of a pointer could be used as
flags and which could not, across all the various 32 and 64 bit
architectures...it was easier and less risky just to use a couple more
bytes of memory. I suppose if pages are guaranteed to be aligned on some
boundary, the lowest bit could be used as a flag without any risk of
misinterpretation? At any rate, wasting 4 bytes per page didn't seem
like a terrible loss.

> - What is the relationship between the 'bitmap_fd' field in
>   mdu_array_info_s, and the SET_BITMAP_FILE ioctl?  They both seem to
>   set the bitmap file.

Yes, they are the same. mdadm opens the bitmap file and passes down a
file descriptor (bitmap_fd). SET_BITMAP_FILE is used in the case of
Assembling an array (where SET_ARRAY_INFO is called with a NULL array
info pointer), while the bitmap_fd is simply passed down inside the
array info struct in the Create case.

> Finally, do you have any measurements of the performance impact of the
> intent logging on various workloads?

No, I don't have any exact numbers, but I haven't observed a slowdown in
the tests I've performed. However, most of my tests were data
replication setups using async writes, which tends to mitigate any
slowdown caused by the bitmap.

--
Paul

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-02-09 19:45   ` Paul Clements
@ 2004-02-10  0:04     ` Neil Brown
  2004-02-10 16:20       ` Paul Clements
                         ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Neil Brown @ 2004-02-10  0:04 UTC (permalink / raw)
  To: Paul Clements; +Cc: Neil Brown, linux-raid, ptb, mingo, james.bottomley

On Monday February 9, Paul.Clements@SteelEye.com wrote:
> Neil Brown wrote:
> > - The patch seems to add several separate, though related, pieces of
> >   functionality.  It would be a lot easier to review if they were
> >   separate.
> >   There is (at least):
> >      - "intent-loging" in the bitmaps.
> >      - optimising resync using the intents.
> >      - async writes to non-primary devices
> >      - 'hotrepair'
> 
> I can fairly easily split the bitmap code from the async write code.
> Splitting the smaller bitmap pieces from the rest of the bitmap code,
> however, is a little more difficult. Also, the "hotrepair" and "resync"
> parts would be so small, that I'm not sure it would be worth it. Would
> it be alright if I split it into two patches and see what that looks
> like?
>  

Two would certainly be better than 1.  
Splitting off the "resync" stuff probably isn't necessary, but the
'hotrepair' seems conceptually quite separate and it would help to see
the patch and the justification separately, but I won't push it.

> > - A question - can the bitmap be "rescaled" online, or only offline?
> >   (not that I particularly want on-line rescaling) How is it achieved?
> 
> The bitmap file can be rescaled offline by simply expanding or
> collapsing the bits as necessary, and then writing the enlarged or
> shrunken bitmap file. When the array is started (with a new bitmap chunk
> size), everything will work correctly. (The bitmap file can also be
> truncated, in which case the driver will assume all the bits to be dirty
> and perform a full resync).
> 

It would seem to make sense to store the "chunk size" in the bitmap
file.
In fact, if the bitmap file had a small header containing:
    magic number
    array uuid
    event count
    bitmap chunk size
Then you could rescale the bitmap without touching the superblock and
would at the same time solve the problem with possibly getting an
out-of-date bitmap.

> 
> > - ioctl - By changing the size of mdu_array_info_t you have changed
> >   the GET_ARRAY_INFO ioctl, so you lose backwards compatibility with
> >   user-space tools.
> 
> Yes, unfortunately...
> 
> >   If you want to return more info with
> >   GET_ARRAY_INFO (which is reasonable) we need to handle both the old
> >   and new mdu_array_info_t sizes. Ditto for SET_ARRAY_INFO
> 
> OK, I'll code this up and include it in the next patch. Some type of
> versioning scheme (like what you've done with the superblock) would
> probably do the trick.

When you change the structure size, you change the ioctl number so you
can differentiate requests based just on the ioctl.
so you would need to define a "struct mda_array_info_s_old" and  have
#define GET_ARRAY_INFO		_IOR (MD_MAJOR, 0x11, mdu_array_info_t)
#define GET_ARRAY_INFO_OLD	_IOR (MD_MAJOR, 0x11, struct mdu_array_info_s_old)

and handle each ioctl separately (or translate one into the other
early in md_ioctl).

> 
> > - Write requests tend to come in bursts.  So when you get a burst of
> >   write requests, it would be most efficient to set all the on-disk
> >   bits for all requests in the burst, flush all blocks which have
> >   changed and then release the burst of write requests.  If I read the
> >   code correctly, you do one bitmap-block-write for every bit that
> >   gets set.
> 
> That would be a very nice optimization. It would also be pretty
> involved. If you'd like I can look at doing this as a follow-on patch.
> 

You would need an "unplug" concept.
When make_request is given a write request, it puts the request on a
queue and schedules an bitmap update.  You set
mddev->queue->unplug_fn to a function which flushes the outstanding
bitmap updates, and then moves the queue of outstanding write requests
to somewhere that raid1d can pick them up and initiate them.

> 
> > - The bitmap_page structure is amusing.
> 
> I like to keep my readers entertained... :)
> 
> >   highjack the "*map" pointer to form 2 16bit counters, and spend a
> >   whole 32bit "unsigned int" to flag if this is the case or not.
> >   Would it be better to use one bit of "counter" as the highjack flag,
> >   and thereby save a substantial fraction of the size of 'bitmap_page'?
> 
> In all seriousness though, I opted for the extra flag because I didn't
> want to get into figuring out which bits of a pointer could be used as
> flags and which could not, across all the various 32 and 64 bit
> architectures...it was easier and less risky just to use a couple more
> bytes of memory. I suppose if pages are guaranteed to be aligned on some
> boundary, the lowest bit could be used as a flag without any risk of
> misinterpretation? At any rate, wasting 4 bytes per page didn't seem
> like a terrible loss.
> 

True it isn't a great loss.  It just seemed amusing that you squeezed
two 16bit counters into the pointer (seeming to make best use of
available memory) and then wasted 31 bits.
If you put the "highjacked" flag in the counter (not in the pointer)
as e.g. the lsb, doing all real counting by 2s, it should be straight
forward.

> 
> > - What is the relationship between the 'bitmap_fd' field in
> >   mdu_array_info_s, and the SET_BITMAP_FILE ioctl?  They both seem to
> >   set the bitmap file.
> 
> Yes, they are the same. mdadm opens the bitmap file and passes down a
> file descriptor (bitmap_fd). SET_BITMAP_FILE is used in the case of
> Assembling an array (where SET_ARRAY_INFO is called with a NULL array
> info pointer), while the bitmap_fd is simply passed down inside the
> array info struct in the Create case.
> 

Ahh, I see.  That's why you need both.
I guess you could create the array "the old way", stop it, and then
re-assemble it and pass down the fd then.
If you kept the bitmap parameters in the head of the bitmap file
rather than in the superblock, you could avoid having to change
kernel-side array-creation altogether.

I'm not actually a big fan of having the kernel create the array.
You might note that for new style superblocks, the kernel will not
create the array at all.
You have to have user-space write out the initial superblocks, and
then just have the kernel assemble the array.

I would advocate doing the same thing for the bitmap superblock (if
you choose to have one in the bitmap file) - i.e. userspace sets it up
and passes it to the kernel, and the kernel extracts the information
it needs.

NeilBrown

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-02-10  0:04     ` Neil Brown
@ 2004-02-10 16:20       ` Paul Clements
  2004-02-10 16:57       ` Paul Clements
  2004-02-13 20:58       ` Paul Clements
  2 siblings, 0 replies; 34+ messages in thread
From: Paul Clements @ 2004-02-10 16:20 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, ptb, mingo, james.bottomley

Neil Brown wrote:
 
> >
> > > - ioctl - By changing the size of mdu_array_info_t you have changed
> > >   the GET_ARRAY_INFO ioctl, so you lose backwards compatibility with
> > >   user-space tools.
> >
> > Yes, unfortunately...
> >
> > >   If you want to return more info with
> > >   GET_ARRAY_INFO (which is reasonable) we need to handle both the old
> > >   and new mdu_array_info_t sizes. Ditto for SET_ARRAY_INFO
> >
> > OK, I'll code this up and include it in the next patch. Some type of
> > versioning scheme (like what you've done with the superblock) would
> > probably do the trick.
> 
> When you change the structure size, you change the ioctl number so you
> can differentiate requests based just on the ioctl.
> so you would need to define a "struct mda_array_info_s_old" and  have
> #define GET_ARRAY_INFO          _IOR (MD_MAJOR, 0x11, mdu_array_info_t)
> #define GET_ARRAY_INFO_OLD      _IOR (MD_MAJOR, 0x11, struct mdu_array_info_s_old)
> 
> and handle each ioctl separately (or translate one into the other
> early in md_ioctl).

I was thinking of leaving the original ioctls alone and just adding two
new ioctls that would handle only the new parameters, something like:

#define GET_EXTRA_ARRAY_INFO    _IOR (MD_MAJOR, 0x15,
mdu_extra_array_info_t)
#define SET_EXTRA_ARRAY_INFO    _IOW (MD_MAJOR, 0x2b,
mdu_extra_array_info_t)

/* The extra array info structure is flexible, to allow for future
changes */
#define EXTRA_ARRAY_INFO_VERSION 1

typedef struct mdu_extra_array_info_s {
        __u32 extra_info_version;  /* EXTRA_ARRAY_INFO_VERSION */
        __u32 async_max_writes;    /* Max outstanding async writes (0 =
sync) */
        __u32 bitmap_fd;           /* The bitmap file descriptor (in
only)    */
        __u32 bitmap_chunksize;    /* The bitmap
chunksize                    */
        __u32 bitmap_daemon_sleep; /* The bitmap daemon sleep
period          */
        char bitmap_path[256];     /* The bitmap filename (out
only)          */
        char pad[4096-256-20];     /* Allow for future additions, empty
now   */
} mdu_extra_array_info_t;

So for older kernels, the ioctls would just error out and mdadm would
ignore that, and the array would be run using only the old features.
This will also allow me to get rid of the SET_BITMAP_FILE ioctl and just
use the SET_EXTRA_ARRAY_INFO ioctl for Assembling.

Sound OK?

--
Paul

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-02-10  0:04     ` Neil Brown
  2004-02-10 16:20       ` Paul Clements
@ 2004-02-10 16:57       ` Paul Clements
  2004-02-13 20:58       ` Paul Clements
  2 siblings, 0 replies; 34+ messages in thread
From: Paul Clements @ 2004-02-10 16:57 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, ptb, mingo, james.bottomley

Neil Brown wrote:
 
> > > - A question - can the bitmap be "rescaled" online, or only offline?
> > >   (not that I particularly want on-line rescaling) How is it achieved?
> >
> > The bitmap file can be rescaled offline by simply expanding or
> > collapsing the bits as necessary, and then writing the enlarged or
> > shrunken bitmap file. When the array is started (with a new bitmap chunk
> > size), everything will work correctly. (The bitmap file can also be
> > truncated, in which case the driver will assume all the bits to be dirty
> > and perform a full resync).
> >
> 
> It would seem to make sense to store the "chunk size" in the bitmap
> file.
> In fact, if the bitmap file had a small header containing:
>     magic number
>     array uuid
>     event count
>     bitmap chunk size
> Then you could rescale the bitmap without touching the superblock and
> would at the same time solve the problem with possibly getting an
> out-of-date bitmap.

That sounds good, as it would lend some more foolproofing to the bitmap
file operations. The only problem being that we wouldn't be able to
support bitmaps without backing files (in-memory only bitmaps). I guess
if we don't think that's a useful feature we could do this. 

We could also add this bitmap file superblock as a follow-on patch. The
code that reads in the bitmap file would simply look for the magic
string in the superblock to determine if the bitmap file is a flat file
(old style) or contains a superblock (new style).

--
Paul

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-02-10  0:04     ` Neil Brown
  2004-02-10 16:20       ` Paul Clements
  2004-02-10 16:57       ` Paul Clements
@ 2004-02-13 20:58       ` Paul Clements
  2004-03-05  5:06         ` Neil Brown
  2 siblings, 1 reply; 34+ messages in thread
From: Paul Clements @ 2004-02-13 20:58 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, ptb, mingo, james.bottomley

And here is the next set of patches. Tested thoroughly against
2.6.2-rc2-bk3 and compile tested against 2.6.3-rc2.

I've split the kernel code into two patches. The async write patch is
dependent on the bitmap patch. The bitmap patch includes the new
GET/SET_EXTRA_ARRAY_INFO ioctls and a mempool for the bitmap_update
structures (no more -ENOMEM in the I/O request path). 

bitmap patch vs. 2.6.3-rc2
--------------------------
http://parisc-linux.org/~jejb/md_bitmap/md_bitmap_2_30_2_6_3_RC2.diff 

async write patch vs. 2.6.3-rc2
-------------------------------
http://parisc-linux.org/~jejb/md_bitmap/md_async_2_30_2_6_3_RC2.diff

mdadm patch vs. 1.5.0
---------------------
http://parisc-linux.org/~jejb/md_bitmap/mdadm_1_5_0-2.diff

For simplicity's sake, the mdadm patch includes both the bitmap and
async code (it's a fairly small patch anyway, and probably wouldn't have
benefited much from a split).

--
Paul

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-02-13 20:58       ` Paul Clements
@ 2004-03-05  5:06         ` Neil Brown
  2004-03-05 22:05           ` Paul Clements
  0 siblings, 1 reply; 34+ messages in thread
From: Neil Brown @ 2004-03-05  5:06 UTC (permalink / raw)
  To: Paul Clements; +Cc: linux-raid, ptb, mingo, james.bottomley

On Friday February 13, Paul.Clements@SteelEye.com wrote:
> And here is the next set of patches. Tested thoroughly against
> 2.6.2-rc2-bk3 and compile tested against 2.6.3-rc2.
> 
> I've split the kernel code into two patches. The async write patch is
> dependent on the bitmap patch. The bitmap patch includes the new
> GET/SET_EXTRA_ARRAY_INFO ioctls and a mempool for the bitmap_update
> structures (no more -ENOMEM in the I/O request path). 
> 
> bitmap patch vs. 2.6.3-rc2
> --------------------------
> http://parisc-linux.org/~jejb/md_bitmap/md_bitmap_2_30_2_6_3_RC2.diff 
> 
> async write patch vs. 2.6.3-rc2
> -------------------------------
> http://parisc-linux.org/~jejb/md_bitmap/md_async_2_30_2_6_3_RC2.diff
> 
> mdadm patch vs. 1.5.0
> ---------------------
> http://parisc-linux.org/~jejb/md_bitmap/mdadm_1_5_0-2.diff
> 
> 
> For simplicity's sake, the mdadm patch includes both the bitmap and
> async code (it's a fairly small patch anyway, and probably wouldn't have
> benefited much from a split).
> 

I finally had a chance to look at this properly -- or atleast at the
"bitmap" patch.

There is a lot in there that I don't like.  I was just going to rip
out the bits that I wasn't happy with and produce a minimal patch that
I was happy with and which did something useful, but I ran into
problems with the file io in bitmap.c.  It is just simply wrong, and
needs a substantial rewrite.

Using "generic_file_write" is wrong.  It is a service routine for
filesystems to call.  Not an interface for clients of a filesystem to
call.

You should look at loop.c.  It seems to be closer to correct.
You need to see the files as an "address_space" and use address_space
operations to access it.
  ->readpage to read a page
  ->prepare_write / ->commit_write to write a page (or maybe
		  write_page, I'm not sure)
  ->sync_page to sync a patch

Apart from that....

 The more I think about it, the less I like extra fields being added
 to the superblock.
 I think the bitmap file should have a header with 
   magic
   uuid (which must match raid array)
   events counter (which must also match)
   chunk size
   other configurables.

 Then to assemble an array with a bitmap you create the bitmap file,
 and pass it down with a new ioctl. 

 If the array has a persistent superblock, md check the header against
 the superblock, otherwise it just trusts you.

 You don't need any ioctls to get or set extra information about the
 array.  Just read the header off the bitmap file.

 I'm not 100% sure what R1BIO_NotUptodate is supposed to do, but I'm
 fairly sure it does it wrongly, and I don't like the name (as it
 seems similar to "R1BIO_Uptodate", but is clearly a lot different.

So, if you can produce a patch which *only* adds a persistent bitmap
in a file, uses it to record which blocks are not in-sync, and
optimises resync using the bitmap,  and which uses address_space
operations for fileio, then I will see if it is acceptable, and we can
then start adding bits like hot-repair and async-write etc on top of
that.

NeilBrown

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-03-05  5:06         ` Neil Brown
@ 2004-03-05 22:05           ` Paul Clements
  2004-03-31 18:38             ` Paul Clements
  0 siblings, 1 reply; 34+ messages in thread
From: Paul Clements @ 2004-03-05 22:05 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, ptb, mingo, james.bottomley

Neil Brown wrote:
> On Friday February 13, Paul.Clements@SteelEye.com wrote:

> I finally had a chance to look at this properly -- or atleast at the
> "bitmap" patch.

Thanks for looking this over...

> Using "generic_file_write" is wrong.  It is a service routine for
> filesystems to call.  Not an interface for clients of a filesystem to
> call.

Agreed. I'm pretty unfamiliar with some of these interfaces, so I was
not sure which ones were the correct ones to use. After looking over the
interfaces a bit more and with some guidance from a couple of experts, I
have a much better idea about what to do...

> You should look at loop.c.  It seems to be closer to correct.
> You need to see the files as an "address_space" and use address_space
> operations to access it.
>   ->readpage to read a page

I'm already using read_cache_page(). Didn't realize that it would
actually extend the file for me automatically. Given that it can do
that, the generic read and write stuff can just be thrown out
completely.

>   ->prepare_write / ->commit_write to write a page 

Yes, these combined with read_cache_page are all I need...

>  The more I think about it, the less I like extra fields being added
>  to the superblock.

OK, fair enough. I'll look at adding a header to the bitmap file and
I'll axe the superblock fields and the new ioctl I created.

>  I think the bitmap file should have a header with
>    magic
>    uuid (which must match raid array)
>    events counter (which must also match)
>    chunk size
>    other configurables.

...and maybe a version field and some padding to allow for future
changes.

This all means no more in-memory-only bitmaps, but I think that's not
really a critical feature anyway.

>  Then to assemble an array with a bitmap you create the bitmap file,
>  and pass it down with a new ioctl.

Right. Probably a lot simpler than the existing method.

>  If the array has a persistent superblock, md check the header against
>  the superblock, otherwise it just trusts you.

Yep.

>  You don't need any ioctls to get or set extra information about the
>  array.  Just read the header off the bitmap file.

Yep.

>  I'm not 100% sure what R1BIO_NotUptodate is supposed to do, but I'm
>  fairly sure it does it wrongly, and I don't like the name (as it
>  seems similar to "R1BIO_Uptodate", but is clearly a lot different.

Yes, looking back at this, there were a couple of errors in the way that
bit was handled. I've corrected those and renamed the bit to
R1BIO_Degraded. I think this is a better name as it's closer to what the
bit really signifies. If the bit is set, it means:

1) either we've got a degraded array (only 1 disk), or 
2) we've gotten a write failure (and we're about to degrade the array)

In either case, we will not clear the bitmap.

> So, if you can produce a patch which *only* adds a persistent bitmap
> in a file, uses it to record which blocks are not in-sync, and
> optimises resync using the bitmap,  and which uses address_space
> operations for fileio, then I will see if it is acceptable, and we can
> then start adding bits like hot-repair and async-write etc on top of
> that.

I'll work on that and get it out as soon as I can...

Thanks again,
Paul

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-03-05 22:05           ` Paul Clements
@ 2004-03-31 18:38             ` Paul Clements
  2004-04-28 18:10               ` Paul Clements
  2004-04-29  8:41               ` Neil Brown
  0 siblings, 2 replies; 34+ messages in thread
From: Paul Clements @ 2004-03-31 18:38 UTC (permalink / raw)
  To: Neil Brown, linux-raid, ptb, mingo, james.bottomley

Paul Clements wrote:
> Neil Brown wrote:
> > So, if you can produce a patch which *only* adds a persistent bitmap
> > in a file, uses it to record which blocks are not in-sync, and
> > optimises resync using the bitmap,  and which uses address_space
> > operations for fileio, then I will see if it is acceptable, and we can
> > then start adding bits like hot-repair and async-write etc on top of
> > that.
> 
> I'll work on that and get it out as soon as I can...

OK, so here it is. The changes from the previous patch are, basically:

1) The patch is a little less intrusive now. The diffs against md/raid1
are smaller as more of the bitmap handling code is now isolated in the
bitmap.c and bitmap.h files. Also, the diff against mdadm, while now a
little larger, is less intrusive as I've created a bitmap.c file in the
mdadm sources, as well.

2) The bitmap file I/O routines have been totally rewritten and are now
much simpler. We now use read_cache_page, prepare_write/commit_write (to
extend the file at init time, if necessary), and bmap/submit_bh (to
"sync" the file pages). The use of bmap and submit_bh is due to a
limitation in jbd that allows only one ongoing transaction per process
(see comment in bitmap.c regarding current->journal_info).

3) The bitmap file now has a superblock containing various configuration
information. The bitmap superblock is checked against the md superblock
when the array is assembled. 

Options have been added to mdadm to create and examine bitmap files, and
also to specify the bitmap file to be used with an array:

Create a bitmap file:
--------------------

# mdadm --create-bitmap=65536,3,580480 /tmp/bitmap --force

(this creates a bitmap file with a chunksize of 64KB, a 3 second bitmap
daemon sleep period, and a bitmap (initially dirtied) which is
appropriate for use with an array of 580480 blocks)

Examine a bitmap file:
---------------------

# mdadm --examine-bitmap /tmp/bitmap (or just: mdadm -X /tmp/bitmap)
        Filename : /tmp/bitmap
           Magic : 6d746962
         Version : 2
            UUID : 997cb579.99d31d20.3014cae8.4bb4bf9d
          Events : 5
           State : OK
       Chunksize : 4 KB
          Daemon : 5s flush period
      Array Size : 580480 (566.88 MiB 594.41 MB)
          Bitmap : 145120 bits (chunks), 145120 dirty (100.0%)

(in addition, mdadm -D <array> gives the bitmap information if a bitmap
is attached to the array)


Create or Assemble array using a bitmap:
---------------------------------------

# mdadm -C /dev/md2 -n 2 -l 1 --bitmap=/tmp/bitmap /dev/sda5 missing

# mdadm -A /dev/md2 --bitmap=/tmp/bitmap /dev/sda5 /dev/sdb5



Patch Location:
--------------

Patch vs. linux 2.6.5-rc2
=========================
http://parisc-linux.org/~jejb/md_bitmap/md_bitmap_2_31_2_6_5_RC2.diff

Patch vs. mdadm 1.5.0
=====================
http://parisc-linux.org/~jejb/md_bitmap/mdadm_1_5_0-bitmap.diff

(no async write patch has been generated, these patches contain only the
bitmap code)


Notes:
-----

1) an is_create flag was added to do_md_run to tell bitmap_create
whether we are creating or just assembling the array -- this is
necessary since 0.90 superblocks do not have a UUID until one is
generated randomly at array creation time, therefore, we must set the
bitmap UUID equal to this newly generated array UUID when the array is
created

2) bitmap.h was not included in the mdadm patch, but a link (or copy)
must be made to the kernel's include/linux/raid/bitmap.h file in order
to build mdadm now

3) code was added to mdadm to allow creation of arrays with
non-persistent superblocks (also, device size calculation with
non-persistent superblocks was fixed)

4) a fix was made to the hot_remove code to allow a faulty device to be
removed

5) various typo and minor bug fixes were also included in the patches

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-03-31 18:38             ` Paul Clements
@ 2004-04-28 18:10               ` Paul Clements
  2004-04-28 18:53                 ` Peter T. Breuer
  2004-04-29  8:41               ` Neil Brown
  1 sibling, 1 reply; 34+ messages in thread
From: Paul Clements @ 2004-04-28 18:10 UTC (permalink / raw)
  To: linux-raid; +Cc: Neil Brown, ptb, mingo, james.bottomley

Since there hasn't been any negative feedback, I assume that this 
version is acceptable for inclusion in mainline/-mm?

Thanks,
Paul


Paul Clements wrote:

> OK, so here it is. The changes from the previous patch are, basically:
> 
> 1) The patch is a little less intrusive now. The diffs against md/raid1
> are smaller as more of the bitmap handling code is now isolated in the
> bitmap.c and bitmap.h files. Also, the diff against mdadm, while now a
> little larger, is less intrusive as I've created a bitmap.c file in the
> mdadm sources, as well.
> 
> 2) The bitmap file I/O routines have been totally rewritten and are now
> much simpler. We now use read_cache_page, prepare_write/commit_write (to
> extend the file at init time, if necessary), and bmap/submit_bh (to
> "sync" the file pages). The use of bmap and submit_bh is due to a
> limitation in jbd that allows only one ongoing transaction per process
> (see comment in bitmap.c regarding current->journal_info).
> 
> 3) The bitmap file now has a superblock containing various configuration
> information. The bitmap superblock is checked against the md superblock
> when the array is assembled. 
> 
> Options have been added to mdadm to create and examine bitmap files, and
> also to specify the bitmap file to be used with an array:
> 
> Create a bitmap file:
> --------------------
> 
> # mdadm --create-bitmap=65536,3,580480 /tmp/bitmap --force
> 
> (this creates a bitmap file with a chunksize of 64KB, a 3 second bitmap
> daemon sleep period, and a bitmap (initially dirtied) which is
> appropriate for use with an array of 580480 blocks)
> 
> Examine a bitmap file:
> ---------------------
> 
> # mdadm --examine-bitmap /tmp/bitmap (or just: mdadm -X /tmp/bitmap)
>         Filename : /tmp/bitmap
>            Magic : 6d746962
>          Version : 2
>             UUID : 997cb579.99d31d20.3014cae8.4bb4bf9d
>           Events : 5
>            State : OK
>        Chunksize : 4 KB
>           Daemon : 5s flush period
>       Array Size : 580480 (566.88 MiB 594.41 MB)
>           Bitmap : 145120 bits (chunks), 145120 dirty (100.0%)
> 
> (in addition, mdadm -D <array> gives the bitmap information if a bitmap
> is attached to the array)
> 
> 
> Create or Assemble array using a bitmap:
> ---------------------------------------
> 
> # mdadm -C /dev/md2 -n 2 -l 1 --bitmap=/tmp/bitmap /dev/sda5 missing
> 
> # mdadm -A /dev/md2 --bitmap=/tmp/bitmap /dev/sda5 /dev/sdb5
> 
> 
> 
> Patch Location:
> --------------
> 
> Patch vs. linux 2.6.5-rc2
> =========================
> http://parisc-linux.org/~jejb/md_bitmap/md_bitmap_2_31_2_6_5_RC2.diff
> 
> Patch vs. mdadm 1.5.0
> =====================
> http://parisc-linux.org/~jejb/md_bitmap/mdadm_1_5_0-bitmap.diff
> 
> (no async write patch has been generated, these patches contain only the
> bitmap code)
> 
> 
> Notes:
> -----
> 
> 1) an is_create flag was added to do_md_run to tell bitmap_create
> whether we are creating or just assembling the array -- this is
> necessary since 0.90 superblocks do not have a UUID until one is
> generated randomly at array creation time, therefore, we must set the
> bitmap UUID equal to this newly generated array UUID when the array is
> created
> 
> 2) bitmap.h was not included in the mdadm patch, but a link (or copy)
> must be made to the kernel's include/linux/raid/bitmap.h file in order
> to build mdadm now
> 
> 3) code was added to mdadm to allow creation of arrays with
> non-persistent superblocks (also, device size calculation with
> non-persistent superblocks was fixed)
> 
> 4) a fix was made to the hot_remove code to allow a faulty device to be
> removed
> 
> 5) various typo and minor bug fixes were also included in the patches
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-04-28 18:10               ` Paul Clements
@ 2004-04-28 18:53                 ` Peter T. Breuer
  0 siblings, 0 replies; 34+ messages in thread
From: Peter T. Breuer @ 2004-04-28 18:53 UTC (permalink / raw)
  To: Paul Clements; +Cc: linux-raid, Neil Brown, ptb, mingo, james.bottomley

"Also sprach Paul Clements:"
> Since there hasn't been any negative feedback, I assume that this 
> version is acceptable for inclusion in mainline/-mm?

whew ... since I just got out of a %^&% grant appication spin, and can
breathe now, you prompted me to take a look.

First thing I notice is md.o is replaced with md-mod.o.  Why?  That's
gratuitous breakage of every poor admin's scripts that up till now has
been doing an insmod md, or has had a "md" in /etc/modules.

Owwww ....

Is the reason perhaps  that md.c is no longer the whole story? So that
there is extra linking? I guess so. Why not put the extra stuff in a
separate module and leave md.o be?

I'll not comment on the bitmap.c file (I still need to figure out ow it
can possibly work :-). But I like the comments in the file! Looks
comprehensible and very neat. Perhaps a summary of what's in it could
be put up top ... or is that likely to get out of step? Perhaps.

The changes to md.c ... look very benign. I see you add a function,
export a function previously static ... add bitmap sb update to the sb
update fn and export it.

I can't be sure from the diff alone, but it looks as though a
bitmap is always created with bitmap_create, and there is no way to
stop it? Is that right? I'd need to see more context in the diff
to figure it out. Perhaps the calls to bintmmap_create and
bitmap_destroy don't do anything unless the sb is marked as "wanting" 

get_bitmap_file seems to be intended to talk to a user, but it's
static! Are you going to use it in an ioctl? If so, surely the ioctl
part should do the copy_to_user, and this function should not?

It's also go a kmalloc of a buf inside, and I would have thought you
had loadsa potential bufferspace lying around in the superblocks you
have in memory ...  oh well, it's independent this way. 

There's a tiny change

  -       if (rdev->raid_disk >= 0)
  +       if (rdev->raid_disk >= 0 && !rdev->faulty)
                  goto busy;

which needs more context for me to judge! So we don't pretend we're
busy if what we're busy with is faulty? Where are we? It must be a
diskop of some kind.

The code following ...

   +
   +       /* check the sb on the new disk to see whether the bitmap can be used
   +        * to resync it (if not, we dirty the bitmap to cause a full resync) */
   +       if (mddev->bitmap) {
   +               int bitmap_invalidate = 1;
   +               mdk_rdev_t *rdev0 = list_entry(mddev->disks.next,
   +                       mdk_rdev_t, same_set);

looks entirely opaque to me! Loads of if's.

There's an addition of set_bitmap_file as an allowed diskop when we
aren't initialised yet - entirely harmless and correct.

Ah .. here I think we have the ioctl for get_bitmap_file. Yes, but I
would have prefered the copy_to_user here, not in the cllaed routine.

Maybe set_bitmap_file does the same "trick"?

There's some status report output reformatting which is hard to figure.
Maybe only a \n added? Probably because you follow with a bitmap status
report when there is one. Shouldn't this bitmap report be a function in
the bitmap.c file? Or would that be all cyclic in terms of
dependencies?

Here there is something I don't get:

  -       if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery))
  +       /* we don't use the checkpoint if there's a bitmap */
  +       if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery) && !mddev->bitmap)

Are you trying to avoid the speed throttling problem here?  I guess the
checkpoints are where the speed is recalculated too.

When I patched md.c here to get around the throttle (it doesn't know
that some of the synced blocks are "virtual", thanks to the bitmap code),
I did it by adding an extra array of "discounted blocks" that the
sync code could increment every time it skipped something. Then I
changed the throttle speed calculation in md.c to discount the
discounted blocks each time, and only consider the remainder, which had
been "really" written. So the throttle acted on the real i/o speed, not
the virtual speed.

But for printout, I used the virtual speed.

Here's another bit that needs explanation:

  -               blk_run_queues();
  +               if (need_sync)
  +                       blk_run_queues();

And here ...

  +               /* don't worry about speed limits unless we do I/O */
  +               if (!need_sync)
  +                       continue;

Are you saying that in the corner case when the sync at since the
previous checkpoint has been entirely "virtual", we just dive right
back round again? What's this "need_sync" *exactly*? When do you think
we "need" a sync?

Hurrr .. I see changes to raid1.c next. I'll consider them after a
stiff drink.

In summary, the md.c changes look almost certainly benign to me,  with
just a couple of one-liners that I would need tolook at harder. I would
be grateful if you maybe comment on them here (or even in the code).

Perhaps commenting the patch would be appropriate? Can one add text
between hunks? One can do it before the diff starts. 

Peter

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-03-31 18:38             ` Paul Clements
  2004-04-28 18:10               ` Paul Clements
@ 2004-04-29  8:41               ` Neil Brown
  2004-05-04 20:08                 ` Paul Clements
  2004-06-08 20:53                 ` Paul Clements
  1 sibling, 2 replies; 34+ messages in thread
From: Neil Brown @ 2004-04-29  8:41 UTC (permalink / raw)
  To: Paul Clements; +Cc: linux-raid, ptb, mingo, james.bottomley

On Wednesday March 31, Paul.Clements@SteelEye.com wrote:
  and less that a month later I replied .... I've been busy and had
  two weeks leave in there.  Sorry.

> 
> Create a bitmap file:
> --------------------
> 
> # mdadm --create-bitmap=65536,3,580480 /tmp/bitmap --force
> 

Maybe:
mdadm --create-bitmap --chunksize=64 --delay=3 --size==580480
???
while more verbose, it is also easier to remember and more in-keeping
with the rest of mdadm's syntax.

> 1) an is_create flag was added to do_md_run to tell bitmap_create
> whether we are creating or just assembling the array -- this is
> necessary since 0.90 superblocks do not have a UUID until one is
> generated randomly at array creation time, therefore, we must set the
> bitmap UUID equal to this newly generated array UUID when the array is
> created

I think this is the wrong approach.  You are justifying a design error
by reference to a previous design error.
I think user-space should be completely responsible for creating the
bitmap file including setting the UUID.
Either
  1/ add the bitmap after the array has been created.
or
  2/ Create the array in user-space and just get the kernel to
    assemble it (this is what I will almost certainly do in mdadm
    once I get around to supporting version 1 superblocks).

> 
> 3) code was added to mdadm to allow creation of arrays with
> non-persistent superblocks (also, device size calculation with
> non-persistent superblocks was fixed)
> 
> 4) a fix was made to the hot_remove code to allow a faulty device to be
> removed
> 
> 5) various typo and minor bug fixes were also included in the patches

please, Please, PLEASE, keep separate patches separate.  It makes them
much easier to review, and hence makes acceptance much more likely.

In particular, your fix in 4 is, I think, wrong.  I agree that there
is a problem here. I don't think your fix is correct.

But, onto the patch.

1/ Why bitmap_alloc_page instead of just kmalloc?
   If every kernel subsystem kept it's own private cache of memory
   in case of desperate need, then there would be a lot of memory
   wastage.   Unless you have evidence that times when you need to
   allocate a bitmap are frequently times when there is excessive
   memory pressure, I think you should just trust kmalloc.
   On the other hand, if you have reason to believe that the services
   of kmalloc are substantially suboptimal for your needs, you should
   explain why in a comment.

 2/There is a race in bitmap_checkpage.
   I assume that the required postcondition is that (if the arguments
   are valid) either bitmap->bp[page].hijacked or bitmap->bp[page].map
   but not both.  If this is the case, then

+	if ((mappage = bitmap_alloc_page(bitmap)) == NULL) {
+		PRINTK("%s: bitmap map page allocation failed, hijacking\n", 
+			bmname(bitmap));
+		/* failed - set the hijacked flag so that we can use the
+		 * pointer as a counter */
+		spin_lock_irqsave(&bitmap->lock, flags);
+		bitmap->bp[page].hijacked = 1;
+		goto out_unlock;
+	}

   should become:

+	if ((mappage = bitmap_alloc_page(bitmap)) == NULL) {
+		PRINTK("%s: bitmap map page allocation failed, hijacking\n", 
+			bmname(bitmap));
+		/* failed - set the hijacked flag so that we can use the
+		 * pointer as a counter */
+		spin_lock_irqsave(&bitmap->lock, flags);
+		if (!bitmap->bp[page].map) bitmap->bp[page].hijacked = 1;
+		goto out_unlock;
+	}

   as someone else could have allocated a bitmap while we were trying
   and failing.

 3/ Your bmap_page / sync_page is very filesystem-specific.

   bmap_page sets page->private to a linked-list of buffers.
   However page->private "belongs" to the address-space that the page
   is in, which means the filesystem.
   I don't know if any filesystems use page->private for anything
   other than a list of buffers, but they certainly could if they
   wanted to, and if they did, this code would suddenly break.

   I think that if you have your heart set on being able to store the
   bitmap in a file, that using a loop-back mount would be easiest.
   But if you don't want to do that, at least use the correct
   interface.  Referring to the code in loop.c would help.

   Another alternative would be to use the approach that swapfile uses.
   i.e. create a list of extents using bmap information, and then do
   direct IO to the device using this extent information.
   swapfile chooses to ignore any extent that is less that a page.
   You might not want to do that, but you wouldn't have to.

  4/ I would avoid extending the file in the kernel.  It is too easy
     to do that in user space.  Just require that the file is the
     correct size.

  5/ I find it very confusing that you kmap a page, and then leave it
     to some function that you call to kunmap the page (unmap_put_page
     or sync_put_page).  It looks very unbalanced.  I would much
     rather see the kunmap in the same function as the kmap.

  6/ It seems odd that bitmap_set_state calls unmap_put_page instead
     of sync_put_page.  Surely you want to sync the superblock at this
     point.  
  7/ The existence of bitmap_get_state confuses me.  Why not store the
     state in the bitmap structure in bitmap_read_sb??

  8/ calling md_force_spare(.., 1) to force a sync is definitely
     wrong.  You appear to be assuming a raid1 array with exactly two
     devices.  Can't you just set recovery_cp to 0, or maybe just set
     a flag somewhere and test it in md_check_recovery??

  9/ why don't you just pass "%s_bitmap" as the thread name to
     md_register_thread ?  As far as I can tell, it would have exactly
     the same effect as the current code without requiring a kmalloc.

  10/
+static void bitmap_stop_daemon(struct bitmap *bitmap)
+{
+	mdk_thread_t *daemon;
+	unsigned long flags;
+
+	spin_lock_irqsave(&bitmap->lock, flags);
+	if (!bitmap->daemon) {
+		spin_unlock_irqrestore(&bitmap->lock, flags);
+		return;
+	}
+	daemon = bitmap->daemon;
+	bitmap->daemon = NULL;
+	spin_unlock_irqrestore(&bitmap->lock, flags);
+	md_unregister_thread(daemon); /* destroy the thread */
+}

would look better as:

+static void bitmap_stop_daemon(struct bitmap *bitmap)
+{
+	mdk_thread_t *daemon;
+	unsigned long flags;
+
+	spin_lock_irqsave(&bitmap->lock, flags);
+	daemon = bitmap->daemon;
+	bitmap->daemon = NULL;
+	spin_unlock_irqrestore(&bitmap->lock, flags);
+	if (bitmap->daemon)
+		md_unregister_thread(daemon); /* destroy the thread */
+}

   11/  md_update_sb needs to be called with the mddev locked, and I
        think there are times when you call it without the lock.
        I would prefer it if you left it 'static' and just set the
        sb_dirty flag.  raid[156]d will the update date it soon
        enough.

   12/ The tests in hot_add_disk to see if the newly added device is
       sufficiently up-to-date that the bitmap can be used to get it
       the rest of the way up-to-date must be wrong as they don't
       contain any reference to 'events'.
       You presumably want to be able to fail/remove a drive and then
       re-add it and not require a full resync.
       For this to work, you need to record an event number when the
       bitmap switches from "which blocks on active drives are not
       in-sync" to "which blocks active drives have changed since a
       drive failed", and when you incorporate a new device, only
       allow it to be synced based on the bitmap if the event counter
       is at least as new as the saved one (plus the checks you
       currently have).

    13/ The test
+	} else if (atomic_read(&mddev->bitmap_file->f_count) > 2) {
        in set_bitmap_file is half-hearted at best.
        What you probably want is "deny_write_access".
        Or just check that the file is owned by root and isn't world
	writable. 

        The check against the uuid in the header should be enough to
        ensure that operator error doesn't result in the one file
        being used for two arrays.

    14/ In md_do_sync, I think you should move "cond_reshed()" above 

@@ -3312,6 +3506,10 @@ static void md_do_sync(mddev_t *mddev)
 			goto out;
 		}

+		/* don't worry about speed limits unless we do I/O */
+		if (!need_sync)
+			continue;
+
 		/*
 		 * this loop exits only if either when we are slower than
 		 * the 'hard' speed limit, or the system was IO-idle for

        to make sure that mdX_sync doesn't steal too much time before
        allowing a reschedule.

and the worst part about it is that the code doesn't support what I
would think would be the most widely used and hence most useful case,
and that is to store the bitmap in the 60K of space after the
superblock.

If you had implemented that first, then you could have avoided all the
mucking about with files, which seems to be a problematic area, and
probably had working and accepted code by now.  Then adding support
for separate files could have been layered on top.
That would have meant that each piece of submitted code was
substantially smaller and hence easier to review.

Anyway, enough for now.  I'll try to review your next patch with a
little less delay, but it is a substantial effort, and that makes it
easy to put off.

NeilBrown

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async  writes
  2004-04-29  8:41               ` Neil Brown
@ 2004-05-04 20:08                 ` Paul Clements
  2004-06-08 20:53                 ` Paul Clements
  1 sibling, 0 replies; 34+ messages in thread
From: Paul Clements @ 2004-05-04 20:08 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, ptb, mingo, james.bottomley

Neil Brown wrote:
> On Wednesday March 31, Paul.Clements@SteelEye.com wrote:
>   and less that a month later I replied .... I've been busy and had
>   two weeks leave in there.  Sorry.

Ahh, I noticed you'd been absent from the mailing lists for a bit...

Anyway, thanks for taking the time to review this...

>>Create a bitmap file:
>>--------------------
>>
>># mdadm --create-bitmap=65536,3,580480 /tmp/bitmap --force
>>
> 
> 
> Maybe:
> mdadm --create-bitmap --chunksize=64 --delay=3 --size==580480
> ???
> while more verbose, it is also easier to remember and more in-keeping
> with the rest of mdadm's syntax.

OK...and it will probably change slightly if we're not doing bitmaps in 
files...more on that below...


>>1) an is_create flag was added to do_md_run to tell bitmap_create
>>whether we are creating or just assembling the array -- this is
>>necessary since 0.90 superblocks do not have a UUID until one is
>>generated randomly at array creation time, therefore, we must set the
>>bitmap UUID equal to this newly generated array UUID when the array is
>>created
> 
> 
> I think this is the wrong approach.  You are justifying a design error
> by reference to a previous design error.

I agree, I don't like it either, but there is no clean solution that I 
can think of...

> I think user-space should be completely responsible for creating the
> bitmap file including setting the UUID.

> Either
>   1/ add the bitmap after the array has been created.

We can't do this because the initial resync would have started.

> or
>   2/ Create the array in user-space and just get the kernel to
>     assemble it (this is what I will almost certainly do in mdadm
>     once I get around to supporting version 1 superblocks).

I'd gladly support version 1 superblocks, but currently the tools and 
kernel support for them is not complete.

So in order to have working code, right now, unfortunately, my hack is a 
necessary evil. If there's a cleaner way to make this work, I'm 
certainly open to suggestions.



>>3) code was added to mdadm to allow creation of arrays with
>>non-persistent superblocks (also, device size calculation with
>>non-persistent superblocks was fixed)
>>
>>4) a fix was made to the hot_remove code to allow a faulty device to be
>>removed
>>
>>5) various typo and minor bug fixes were also included in the patches
> 
> 
> 
> please, Please, PLEASE, keep separate patches separate.  It makes them
> much easier to review, and hence makes acceptance much more likely.

Yes, I should have cleaned the patch up a bit...sorry about that...

> In particular, your fix in 4 is, I think, wrong.  I agree that there
> is a problem here. I don't think your fix is correct.

Could you explain what's wrong with it? I'll be glad to alter the patch 
if there's a better way to fix this problem.


> 
> But, onto the patch.
> 
> 
> 1/ Why bitmap_alloc_page instead of just kmalloc?

That was part of Peter's original patch and I never bothered to change 
it. Admittedly, there probably is not a valid reason for having the 
cache. I'll remove it and just use kmalloc. If we find we need it later, 
it can be added back...

>    If every kernel subsystem kept it's own private cache of memory
>    in case of desperate need, then there would be a lot of memory
>    wastage.   Unless you have evidence that times when you need to
>    allocate a bitmap are frequently times when there is excessive
>    memory pressure, I think you should just trust kmalloc.
>    On the other hand, if you have reason to believe that the services
>    of kmalloc are substantially suboptimal for your needs, you should
>    explain why in a comment.
> 
>  2/There is a race in bitmap_checkpage.
>    I assume that the required postcondition is that (if the arguments
>    are valid) either bitmap->bp[page].hijacked or bitmap->bp[page].map
>    but not both.  If this is the case, then
> 
> +	if ((mappage = bitmap_alloc_page(bitmap)) == NULL) {
> +		PRINTK("%s: bitmap map page allocation failed, hijacking\n", 
> +			bmname(bitmap));
> +		/* failed - set the hijacked flag so that we can use the
> +		 * pointer as a counter */
> +		spin_lock_irqsave(&bitmap->lock, flags);
> +		bitmap->bp[page].hijacked = 1;
> +		goto out_unlock;
> +	}
> 
>    should become:
> 
> +	if ((mappage = bitmap_alloc_page(bitmap)) == NULL) {
> +		PRINTK("%s: bitmap map page allocation failed, hijacking\n", 
> +			bmname(bitmap));
> +		/* failed - set the hijacked flag so that we can use the
> +		 * pointer as a counter */
> +		spin_lock_irqsave(&bitmap->lock, flags);
> +		if (!bitmap->bp[page].map) bitmap->bp[page].hijacked = 1;
> +		goto out_unlock;
> +	}
> 
>    as someone else could have allocated a bitmap while we were trying
>    and failing.

Yes, I'll fix that.

> 
> 
>  3/ Your bmap_page / sync_page is very filesystem-specific.
> 
>    bmap_page sets page->private to a linked-list of buffers.
>    However page->private "belongs" to the address-space that the page
>    is in, which means the filesystem.

Yes, you're right.

>    I don't know if any filesystems use page->private for anything
>    other than a list of buffers, but they certainly could if they
>    wanted to, and if they did, this code would suddenly break.

XFS uses page->private differently...

> 
>    I think that if you have your heart set on being able to store the
>    bitmap in a file, that using a loop-back mount would be easiest.
>    But if you don't want to do that, at least use the correct
>    interface.  Referring to the code in loop.c would help.

We could not do the prepare_write/commit_write (as loop does) because of 
the current->journal_info limitation in jbd/ext3 (i.e., a single process 
cannot have two jbd transactions ongoing at a time, even though the two 
transactions are for different filesystems). In order to work around 
that limitation, we would have had to create a separate thread to do the 
bitmap writes, which is complex and probably too slow to be an 
acceptable solution.

I now agree with you that (due to the aforementioned limitations) bitmap 
files are not going to work. I think, at least for now, we'll go with a 
bitmap located on a device at a certain offset. So, for a bitmap located 
on the md partition itself, we specify an offset of sb_offset+4096. For 
a standalone bitmap device/partition (or loopback mount of a file, as 
you suggested) we give a 0 offset.


>    Another alternative would be to use the approach that swapfile uses.
>    i.e. create a list of extents using bmap information, and then do
>    direct IO to the device using this extent information.
>    swapfile chooses to ignore any extent that is less that a page.
>    You might not want to do that, but you wouldn't have to.
> 
> 
>   4/ I would avoid extending the file in the kernel.  It is too easy
>      to do that in user space.  Just require that the file is the
>      correct size.

OK. That's easy enough to change.


>   5/ I find it very confusing that you kmap a page, and then leave it
>      to some function that you call to kunmap the page (unmap_put_page
>      or sync_put_page).  It looks very unbalanced.  I would much
>      rather see the kunmap in the same function as the kmap.
> 
>   6/ It seems odd that bitmap_set_state calls unmap_put_page instead
>      of sync_put_page.  Surely you want to sync the superblock at this
>      point.  
>   7/ The existence of bitmap_get_state confuses me.  Why not store the
>      state in the bitmap structure in bitmap_read_sb??
> 
>   8/ calling md_force_spare(.., 1) to force a sync is definitely
>      wrong.  You appear to be assuming a raid1 array with exactly two
>      devices.  Can't you just set recovery_cp to 0, or maybe just set
>      a flag somewhere and test it in md_check_recovery??

This is code that was necessary in 2.4, where it was harder to trigger a 
resync. I think this can be cleaned up for 2.6.


>   9/ why don't you just pass "%s_bitmap" as the thread name to
>      md_register_thread ?  As far as I can tell, it would have exactly
>      the same effect as the current code without requiring a kmalloc.

OK


>   10/
> +static void bitmap_stop_daemon(struct bitmap *bitmap)
> +{
> +	mdk_thread_t *daemon;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&bitmap->lock, flags);
> +	if (!bitmap->daemon) {
> +		spin_unlock_irqrestore(&bitmap->lock, flags);
> +		return;
> +	}
> +	daemon = bitmap->daemon;
> +	bitmap->daemon = NULL;
> +	spin_unlock_irqrestore(&bitmap->lock, flags);
> +	md_unregister_thread(daemon); /* destroy the thread */
> +}
> 
> would look better as:
> 
> +static void bitmap_stop_daemon(struct bitmap *bitmap)
> +{
> +	mdk_thread_t *daemon;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&bitmap->lock, flags);
> +	daemon = bitmap->daemon;
> +	bitmap->daemon = NULL;
> +	spin_unlock_irqrestore(&bitmap->lock, flags);
> +	if (bitmap->daemon)
> +		md_unregister_thread(daemon); /* destroy the thread */
> +}

OK

>    
>    11/  md_update_sb needs to be called with the mddev locked, and I
>         think there are times when you call it without the lock.
>         I would prefer it if you left it 'static' and just set the
>         sb_dirty flag.  raid[156]d will the update date it soon
>         enough.

I think that will be OK, as long as it doesn't open up a window for 
things to get into an inconsistent state if there's a crash.


>    12/ The tests in hot_add_disk to see if the newly added device is
>        sufficiently up-to-date that the bitmap can be used to get it
>        the rest of the way up-to-date must be wrong as they don't
>        contain any reference to 'events'.
>        You presumably want to be able to fail/remove a drive and then
>        re-add it and not require a full resync.
>        For this to work, you need to record an event number when the
>        bitmap switches from "which blocks on active drives are not
>        in-sync" to "which blocks active drives have changed since a
>        drive failed", and when you incorporate a new device, only
>        allow it to be synced based on the bitmap if the event counter
>        is at least as new as the saved one (plus the checks you
>        currently have).

Yes, the current code assumes only two partitions, and thus does not do 
this extra checking. I'll look at adding that.


>     13/ The test
> +	} else if (atomic_read(&mddev->bitmap_file->f_count) > 2) {
>         in set_bitmap_file is half-hearted at best.
>         What you probably want is "deny_write_access".

I'll check that out.

>         Or just check that the file is owned by root and isn't world
> 	writable. 
> 
>         The check against the uuid in the header should be enough to
>         ensure that operator error doesn't result in the one file
>         being used for two arrays.
> 
>     14/ In md_do_sync, I think you should move "cond_reshed()" above 
> 
> @@ -3312,6 +3506,10 @@ static void md_do_sync(mddev_t *mddev)
>  			goto out;
>  		}
>  
> +		/* don't worry about speed limits unless we do I/O */
> +		if (!need_sync)
> +			continue;
> +
>  		/*
>  		 * this loop exits only if either when we are slower than
>  		 * the 'hard' speed limit, or the system was IO-idle for
> 
>         to make sure that mdX_sync doesn't steal too much time before
>         allowing a reschedule.

I'll look at doing this.


> 
> and the worst part about it is that the code doesn't support what I
> would think would be the most widely used and hence most useful case,
> and that is to store the bitmap in the 60K of space after the
> superblock.

As mentioned above, the next patch will support this configuration (as 
well as standalone bitmap devices/partitions).


Hopefully, the next patch will be more to your liking (and much smaller, 
too...)

--
Paul

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-04-29  8:41               ` Neil Brown
  2004-05-04 20:08                 ` Paul Clements
@ 2004-06-08 20:53                 ` Paul Clements
  2004-06-08 22:47                   ` Neil Brown
                                     ` (2 more replies)
  1 sibling, 3 replies; 34+ messages in thread
From: Paul Clements @ 2004-06-08 20:53 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, ptb, mingo, james.bottomley

Neil,

Here's the latest patch...it supports bitmaps in files as well as block 
devices (disks or partitions), contrary to what I had stated in my 
previous e-mail. I've tried to address all the issues you've pointed 
out, and generally cleaned up and fixed the patch some more...details 
below...

Patches available at:

bitmap patch for 2.6.6:
   http://dsl2.external.hp.com/~jejb/md_bitmap/md_bitmap_2_32_2_6_6.diff

mdadm patch against v1.6.0:
   http://dsl2.external.hp.com/~jejb/md_bitmap/mdadm_1_6_0-bitmap.diff

(the normal parisc-linux.org URLs are not working right now, for some 
reason...)

Thanks again,
Paul


Neil Brown wrote:
> On Wednesday March 31, Paul.Clements@SteelEye.com wrote:
>   and less that a month later I replied .... I've been busy and had
>   two weeks leave in there.  Sorry.
> 
> 
>>Create a bitmap file:
>>--------------------
>>
>># mdadm --create-bitmap=65536,3,580480 /tmp/bitmap --force
>>
> 
> 
> Maybe:
> mdadm --create-bitmap --chunksize=64 --delay=3 --size==580480
> ???
> while more verbose, it is also easier to remember and more in-keeping
> with the rest of mdadm's syntax.

OK, this is done. I reused the existing --chunk, --delay, and --size 
options. The bitmap "file" can be either a file (e.g., /tmp/my_bitmap) 
or a block device (e.g., /dev/sdd10).



>>1) an is_create flag was added to do_md_run to tell bitmap_create
>>whether we are creating or just assembling the array -- this is
>>necessary since 0.90 superblocks do not have a UUID until one is
>>generated randomly at array creation time, therefore, we must set the
>>bitmap UUID equal to this newly generated array UUID when the array is
>>created
> 
> 
> I think this is the wrong approach.  You are justifying a design error
> by reference to a previous design error.
> I think user-space should be completely responsible for creating the
> bitmap file including setting the UUID.
> Either
>   1/ add the bitmap after the array has been created.
> or
>   2/ Create the array in user-space and just get the kernel to
>     assemble it (this is what I will almost certainly do in mdadm
>     once I get around to supporting version 1 superblocks).

I could not find another way to make this work with the existing code, 
so this remains as is.



>>3) code was added to mdadm to allow creation of arrays with
>>non-persistent superblocks (also, device size calculation with
>>non-persistent superblocks was fixed)
>>
>>4) a fix was made to the hot_remove code to allow a faulty device to be
>>removed
>>
>>5) various typo and minor bug fixes were also included in the patches
> 
> 
> 
> please, Please, PLEASE, keep separate patches separate.  It makes them
> much easier to review, and hence makes acceptance much more likely.

I've removed all the other bug fixes, spelling corrections, etc.



> In particular, your fix in 4 is, I think, wrong.  I agree that there
> is a problem here. I don't think your fix is correct.
> 
> But, onto the patch.
> 
> 
> 1/ Why bitmap_alloc_page instead of just kmalloc?
>    If every kernel subsystem kept it's own private cache of memory
>    in case of desperate need, then there would be a lot of memory
>    wastage.   Unless you have evidence that times when you need to
>    allocate a bitmap are frequently times when there is excessive
>    memory pressure, I think you should just trust kmalloc.
>    On the other hand, if you have reason to believe that the services
>    of kmalloc are substantially suboptimal for your needs, you should
>    explain why in a comment.

This has been changed to simply kmalloc/kfree.


>  2/There is a race in bitmap_checkpage.
>    I assume that the required postcondition is that (if the arguments
>    are valid) either bitmap->bp[page].hijacked or bitmap->bp[page].map
>    but not both.  If this is the case, then
> 
> +	if ((mappage = bitmap_alloc_page(bitmap)) == NULL) {
> +		PRINTK("%s: bitmap map page allocation failed, hijacking\n", 
> +			bmname(bitmap));
> +		/* failed - set the hijacked flag so that we can use the
> +		 * pointer as a counter */
> +		spin_lock_irqsave(&bitmap->lock, flags);
> +		bitmap->bp[page].hijacked = 1;
> +		goto out_unlock;
> +	}
> 
>    should become:
> 
> +	if ((mappage = bitmap_alloc_page(bitmap)) == NULL) {
> +		PRINTK("%s: bitmap map page allocation failed, hijacking\n", 
> +			bmname(bitmap));
> +		/* failed - set the hijacked flag so that we can use the
> +		 * pointer as a counter */
> +		spin_lock_irqsave(&bitmap->lock, flags);
> +		if (!bitmap->bp[page].map) bitmap->bp[page].hijacked = 1;
> +		goto out_unlock;
> +	}
> 
>    as someone else could have allocated a bitmap while we were trying
>    and failing.

This has been fixed.



>  3/ Your bmap_page / sync_page is very filesystem-specific.
> 
>    bmap_page sets page->private to a linked-list of buffers.
>    However page->private "belongs" to the address-space that the page
>    is in, which means the filesystem.
>    I don't know if any filesystems use page->private for anything
>    other than a list of buffers, but they certainly could if they
>    wanted to, and if they did, this code would suddenly break.
> 
>    I think that if you have your heart set on being able to store the
>    bitmap in a file, that using a loop-back mount would be easiest.
>    But if you don't want to do that, at least use the correct
>    interface.  Referring to the code in loop.c would help.

I've basically rewritten the bitmap file read/write code to be exactly 
like loop.c. This includes the current limitation of needing an extra 
write thread in which to perform the bitmap file writes (again, due to 
the jbd current->journal_info problem).


>    Another alternative would be to use the approach that swapfile uses.
>    i.e. create a list of extents using bmap information, and then do
>    direct IO to the device using this extent information.
>    swapfile chooses to ignore any extent that is less that a page.
>    You might not want to do that, but you wouldn't have to.
> 
> 
>   4/ I would avoid extending the file in the kernel.  It is too easy
>      to do that in user space.  Just require that the file is the
>      correct size.

The file will no longer be extended in the kernel.


>   5/ I find it very confusing that you kmap a page, and then leave it
>      to some function that you call to kunmap the page (unmap_put_page
>      or sync_put_page).  It looks very unbalanced.  I would much
>      rather see the kunmap in the same function as the kmap.

This has been cleaned up and simplified.


>   6/ It seems odd that bitmap_set_state calls unmap_put_page instead
>      of sync_put_page.  Surely you want to sync the superblock at this
>      point.

There's now an explicit bitmap_update_sb where needed.


>   7/ The existence of bitmap_get_state confuses me.  Why not store the
>      state in the bitmap structure in bitmap_read_sb??

bitmap_get_state is gone...the state now gets recorded in bitmap->flags 
and read from there


>   8/ calling md_force_spare(.., 1) to force a sync is definitely
>      wrong.  You appear to be assuming a raid1 array with exactly two
>      devices.  Can't you just set recovery_cp to 0, or maybe just set
>      a flag somewhere and test it in md_check_recovery??

This has been updated to the 2.6 style of kicking off recovery (setting 
MD_RECOVERY_NEEDED and waking up the recovery thread).


>   9/ why don't you just pass "%s_bitmap" as the thread name to
>      md_register_thread ?  As far as I can tell, it would have exactly
>      the same effect as the current code without requiring a kmalloc.

Right, fixed.

>   10/
> +static void bitmap_stop_daemon(struct bitmap *bitmap)
> +{
> +	mdk_thread_t *daemon;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&bitmap->lock, flags);
> +	if (!bitmap->daemon) {
> +		spin_unlock_irqrestore(&bitmap->lock, flags);
> +		return;
> +	}
> +	daemon = bitmap->daemon;
> +	bitmap->daemon = NULL;
> +	spin_unlock_irqrestore(&bitmap->lock, flags);
> +	md_unregister_thread(daemon); /* destroy the thread */
> +}
> 
> would look better as:
> 
> +static void bitmap_stop_daemon(struct bitmap *bitmap)
> +{
> +	mdk_thread_t *daemon;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&bitmap->lock, flags);
> +	daemon = bitmap->daemon;
> +	bitmap->daemon = NULL;
> +	spin_unlock_irqrestore(&bitmap->lock, flags);
> +	if (bitmap->daemon)
> +		md_unregister_thread(daemon); /* destroy the thread */
> +}

OK, changed.


>    11/  md_update_sb needs to be called with the mddev locked, and I
>         think there are times when you call it without the lock.
>         I would prefer it if you left it 'static' and just set the
>         sb_dirty flag.  raid[156]d will the update date it soon
>         enough.

Agreed. Actually, since the bitmap state is no longer in the md 
superblock, this is completely unneeded and has been removed.


>    12/ The tests in hot_add_disk to see if the newly added device is
>        sufficiently up-to-date that the bitmap can be used to get it
>        the rest of the way up-to-date must be wrong as they don't
>        contain any reference to 'events'.
>        You presumably want to be able to fail/remove a drive and then
>        re-add it and not require a full resync.
>        For this to work, you need to record an event number when the
>        bitmap switches from "which blocks on active drives are not
>        in-sync" to "which blocks active drives have changed since a
>        drive failed", and when you incorporate a new device, only
>        allow it to be synced based on the bitmap if the event counter
>        is at least as new as the saved one (plus the checks you
>        currently have).

Right. The extra check and event fields in the superblock have been added.

>     13/ The test
> +	} else if (atomic_read(&mddev->bitmap_file->f_count) > 2) {
>         in set_bitmap_file is half-hearted at best.
>         What you probably want is "deny_write_access".

I now use a slightly modified version of deny_write_access...


>         Or just check that the file is owned by root and isn't world
> 	writable. 
> 
>         The check against the uuid in the header should be enough to
>         ensure that operator error doesn't result in the one file
>         being used for two arrays.
> 
>     14/ In md_do_sync, I think you should move "cond_reshed()" above 
> 
> @@ -3312,6 +3506,10 @@ static void md_do_sync(mddev_t *mddev)
>  			goto out;
>  		}
>  
> +		/* don't worry about speed limits unless we do I/O */
> +		if (!need_sync)
> +			continue;
> +
>  		/*
>  		 * this loop exits only if either when we are slower than
>  		 * the 'hard' speed limit, or the system was IO-idle for
> 
>         to make sure that mdX_sync doesn't steal too much time before
>         allowing a reschedule.

Done.


> and the worst part about it is that the code doesn't support what I
> would think would be the most widely used and hence most useful case,
> and that is to store the bitmap in the 60K of space after the
> superblock.

Unfortunately, this type of setup performs rather abysmally (generally, 
about a 5-10x slowdown in write performance). If you think about what is 
happening to the disk head, it becomes clear why. In fact, having the 
intent log anywhere on the same physical media as the array components 
gives very bad performance. For this reason, I have not taken extra 
steps to support this configuration. If anyone is curious, this type of 
setup can be tested using device mapper (but not loop, because loop does 
not have the correct sync semantics) to map that area of the disk as a 
separate device and use it as a bitmap.


> If you had implemented that first, then you could have avoided all the
> mucking about with files, which seems to be a problematic area, and
> probably had working and accepted code by now.  Then adding support
> for separate files could have been layered on top.
> That would have meant that each piece of submitted code was
> substantially smaller and hence easier to review.
> 
> 
> Anyway, enough for now.  I'll try to review your next patch with a
> little less delay, but it is a substantial effort, and that makes it
> easy to put off.
> 
> 
> 
> NeilBrown
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-06-08 20:53                 ` Paul Clements
@ 2004-06-08 22:47                   ` Neil Brown
  2004-06-14 23:39                   ` Neil Brown
  2004-06-15  6:27                   ` Neil Brown
  2 siblings, 0 replies; 34+ messages in thread
From: Neil Brown @ 2004-06-08 22:47 UTC (permalink / raw)
  To: Paul Clements; +Cc: linux-raid, ptb, mingo, james.bottomley

On Tuesday June 8, paul.clements@steeleye.com wrote:
> Neil,
> 
> Here's the latest patch...it supports bitmaps in files as well as block 
> devices (disks or partitions), contrary to what I had stated in my 
> previous e-mail. I've tried to address all the issues you've pointed 
> out, and generally cleaned up and fixed the patch some more...details 
> below...

Thanks.  I've penciled it into my diary to look through this on
Tuesday 15th
NeilBrown

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-06-08 20:53                 ` Paul Clements
  2004-06-08 22:47                   ` Neil Brown
@ 2004-06-14 23:39                   ` Neil Brown
  2004-06-14 23:59                     ` James Bottomley
  2004-06-15  6:27                   ` Neil Brown
  2 siblings, 1 reply; 34+ messages in thread
From: Neil Brown @ 2004-06-14 23:39 UTC (permalink / raw)
  To: Paul Clements; +Cc: linux-raid, ptb, mingo, james.bottomley

On Tuesday June 8, paul.clements@steeleye.com wrote:
> Neil,
> 
> Here's the latest patch...it supports bitmaps in files as well as block 
> devices (disks or partitions), contrary to what I had stated in my 
> previous e-mail. I've tried to address all the issues you've pointed 
> out, and generally cleaned up and fixed the patch some more...details 
> below...
> 
> Patches available at:
> 
> bitmap patch for 2.6.6:
>    http://dsl2.external.hp.com/~jejb/md_bitmap/md_bitmap_2_32_2_6_6.diff
> 
> mdadm patch against v1.6.0:
>    http://dsl2.external.hp.com/~jejb/md_bitmap/mdadm_1_6_0-bitmap.diff
> 
> (the normal parisc-linux.org URLs are not working right now, for some 
> reason...)

Unfortunately, these dsl2 URLs aren't working today - I should have
grabbed the patches earlier.  I'll check again later in the day.

> 
> >>1) an is_create flag was added to do_md_run to tell bitmap_create
> >>whether we are creating or just assembling the array -- this is
> >>necessary since 0.90 superblocks do not have a UUID until one is
> >>generated randomly at array creation time, therefore, we must set the
> >>bitmap UUID equal to this newly generated array UUID when the array is
> >>created
> > 
> > 
> > I think this is the wrong approach.  You are justifying a design error
> > by reference to a previous design error.
> > I think user-space should be completely responsible for creating the
> > bitmap file including setting the UUID.
> > Either
> >   1/ add the bitmap after the array has been created.
> > or
> >   2/ Create the array in user-space and just get the kernel to
> >     assemble it (this is what I will almost certainly do in mdadm
> >     once I get around to supporting version 1 superblocks).
> 
> I could not find another way to make this work with the existing code, 
> so this remains as is.

I guess that means you (or someone) needs to write some code.
Creating the array, including allocating the UUID, in userspace is
trivial.  Just make some superblocks and write them into the right
location in each device.   Then assemble the array as you would any
pre-existing array.

> 
> > and the worst part about it is that the code doesn't support what I
> > would think would be the most widely used and hence most useful case,
> > and that is to store the bitmap in the 60K of space after the
> > superblock.
> 
> Unfortunately, this type of setup performs rather abysmally (generally, 
> about a 5-10x slowdown in write performance). If you think about what is 
> happening to the disk head, it becomes clear why. In fact, having the 
> intent log anywhere on the same physical media as the array components 
> gives very bad performance. For this reason, I have not taken extra 
> steps to support this configuration. If anyone is curious, this type of 
> setup can be tested using device mapper (but not loop, because loop does 
> not have the correct sync semantics) to map that area of the disk as a 
> separate device and use it as a bitmap.

Ahhh... this explains a comment you made some time ago where I thought
performance would be pretty poor and you said it wasn't.  I was
imagining the bitmap in the same device as the array and you weren't.

I think I mentioned at the time that this could be addressed by using
"plugging".

When a write request arrives for a block that isn't flagged as
"dirty", we flag it 'dirty' and put the request on a queue.  When an
"unplug" request is made, we flush all updates to the "dirty" bitmap,
and then release the queued write requests.  There could still be a
noticeable performance hit, but it should be substantially less than
with the current code.


Everything else you mentioned sounds good.  When I manage to get a
copy of the patch I'll comment further.

NeilBrown

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-06-14 23:39                   ` Neil Brown
@ 2004-06-14 23:59                     ` James Bottomley
  0 siblings, 0 replies; 34+ messages in thread
From: James Bottomley @ 2004-06-14 23:59 UTC (permalink / raw)
  To: Neil Brown; +Cc: Paul Clements, linux-raid, ptb, mingo

On Mon, 2004-06-14 at 19:39, Neil Brown wrote:
> Unfortunately, these dsl2 URLs aren't working today - I should have
> grabbed the patches earlier.  I'll check again later in the day.

I'm sorry, that's my fault.  dsl2 is being decomissioned ... it was a
crappy x86 box, we're replacing it with a parisc one.

The correct URL should have www.parisc-linux.org in place of
dsl2.external.hp.com

James



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-06-08 20:53                 ` Paul Clements
  2004-06-08 22:47                   ` Neil Brown
  2004-06-14 23:39                   ` Neil Brown
@ 2004-06-15  6:27                   ` Neil Brown
  2004-06-17 17:57                     ` Paul Clements
                                       ` (2 more replies)
  2 siblings, 3 replies; 34+ messages in thread
From: Neil Brown @ 2004-06-15  6:27 UTC (permalink / raw)
  To: Paul Clements; +Cc: linux-raid, ptb, mingo, james.bottomley

On Tuesday June 8, paul.clements@steeleye.com wrote:
> Neil,
> 
> Here's the latest patch...it supports bitmaps in files as well as block 
> devices (disks or partitions), contrary to what I had stated in my 
> previous e-mail. I've tried to address all the issues you've pointed 
> out, and generally cleaned up and fixed the patch some more...details 
> below...

This looks a lot better.  There are a few little issues that I noticed
on a quick read through:
     - last_block_device should return "sector_t", not "unsigned
       long".  It would be worth checking for other placed that
       sector_t might be needed.
     - bitmap_checkpage still isn't quite safe.  If "hijacked" gets
       set while it is allocating a page (unlikely, but possible), it
       will exit with both highjacked set and an allocated page, which
       isn't right.
     - The event comparison
+		    sb->events_hi >= refsb->bitmap_events_hi &&
+		    sb->events_lo >= refsb->bitmap_events_lo) {
       in hot_add_disk is wrong. 

I'll try to make time to try the patch out in a week or so to get a
better feel for it.

The changes to mdadm need a bit of work.
 
You have added "--persistent" and "--non-persistent" flags for
--create.  This is wrong.
--create always uses persistent superblocks.
--build makes arrays without persistent superblocks.

I don't think I like --create-bitmap.  A bitmap file should always be
created in the context of a particular array (partly so that the size
and uuid can be set correctly).  I think I would like to bitmap to be
specified as a "--bitmap=filename" option to --create, --build, or
--grow.  I haven't thought this through in great detail yet, but that
is my leaning.

--examine-bitmap is a bit of a pain too.  I think I would like
--examine to figure out what it has been given to look at, and report
on whatever it finds.

NeilBrown

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-06-15  6:27                   ` Neil Brown
@ 2004-06-17 17:57                     ` Paul Clements
  2004-06-18 20:48                     ` Paul Clements
  2004-06-23 21:48                     ` Paul Clements
  2 siblings, 0 replies; 34+ messages in thread
From: Paul Clements @ 2004-06-17 17:57 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, ptb, mingo, james.bottomley

Neil Brown wrote:

> The changes to mdadm need a bit of work.
>  
> You have added "--persistent" and "--non-persistent" flags for
> --create.  This is wrong.
> --create always uses persistent superblocks.
> --build makes arrays without persistent superblocks.

Yes, I think --build didn't use to work for raid1, though, which is why 
I went the create route. I'll fix this.

> I don't think I like --create-bitmap.  A bitmap file should always be
> created in the context of a particular array (partly so that the size
> and uuid can be set correctly).  I think I would like to bitmap to be
> specified as a "--bitmap=filename" option to --create, --build, or
> --grow.  I haven't thought this through in great detail yet, but that
> is my leaning.

That all sounds fine except for --build. How do we know if we're doing 
the initial build or just a rebuild. We need to know this so we know 
whether to initialize the bitmap or simply use it. So I think we still 
need --create-bitmap for the build command (and then build will just use 
the bitmap, not initialize it). For --create I can make it so that the 
bitmap is always initialized using the array parameters, so no separate 
--create-bitmap will be needed. I'll need to work out the conflict that 
this will create with the --chunk option, though, since --chunk is the 
array chunk size for --create and --chunk is the bitmap chunk size for 
--bitmap-create. Any suggestions?

> --examine-bitmap is a bit of a pain too. 

-X is usually what I use...that's a lot easier to remember and type...

> I think I would like
> --examine to figure out what it has been given to look at, and report
> on whatever it finds.

Well, that gets into a little bit of black magic when you're talking 
about examining a disk or partition, which could conceivably contain 
both an md superblock at the end and a bitmap at the front. How do we 
know which one it is "supposed" to be? Or do you just want examine to 
print information for both? That would seem a little confusing to me 
(granted this would only happen when a disk changed roles between bitmap 
and array component).

--
Paul

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-06-15  6:27                   ` Neil Brown
  2004-06-17 17:57                     ` Paul Clements
@ 2004-06-18 20:48                     ` Paul Clements
  2004-06-23 21:48                     ` Paul Clements
  2 siblings, 0 replies; 34+ messages in thread
From: Paul Clements @ 2004-06-18 20:48 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, ptb, mingo, james.bottomley

Neil,

responses to your feedback on the kernel code:

Neil Brown wrote:
> On Tuesday June 8, paul.clements@steeleye.com wrote:

> This looks a lot better.  There are a few little issues that I noticed
> on a quick read through:
>      - last_block_device should return "sector_t", not "unsigned
>        long".  It would be worth checking for other placed that
>        sector_t might be needed.

Yes, there were a couple. I've fixed them now.

>      - bitmap_checkpage still isn't quite safe.  If "hijacked" gets
>        set while it is allocating a page (unlikely, but possible), it
>        will exit with both highjacked set and an allocated page, which
>        isn't right.

Yes, that's right. Fixed.

>      - The event comparison
> +		    sb->events_hi >= refsb->bitmap_events_hi &&
> +		    sb->events_lo >= refsb->bitmap_events_lo) {
>        in hot_add_disk is wrong. 

I think it's OK. Basically, I only record bitmap events while the array 
is in sync. As soon as the array goes out of sync, we continue to set 
bits in the bitmap, but never clear them. This means that the bitmap is 
valid for resyncing any disk that was part of the array at the time it 
was last in sync. Does that make sense, or am I missing a corner case?


> I'll try to make time to try the patch out in a week or so to get a
> better feel for it.

OK, that sounds great.

Thanks,
Paul

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-06-15  6:27                   ` Neil Brown
  2004-06-17 17:57                     ` Paul Clements
  2004-06-18 20:48                     ` Paul Clements
@ 2004-06-23 21:48                     ` Paul Clements
  2004-06-23 21:50                       ` Paul Clements
                                         ` (2 more replies)
  2 siblings, 3 replies; 34+ messages in thread
From: Paul Clements @ 2004-06-23 21:48 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, ptb, mingo, james.bottomley

Hi Neil,

Here's the next round of patches:

kernel patch against 2.6.7:
--------------------------
http://www.parisc-linux.org/~jejb/md_bitmap/md_bitmap_2_32_2_6_7.diff

mdadm patch against 1.6.0:
-------------------------
http://www.parisc-linux.org/~jejb/md_bitmap/md_bitmap_2_32_2_6_7.diff


Details below...


Neil Brown wrote:
> On Tuesday June 8, paul.clements@steeleye.com wrote:

> This looks a lot better.  There are a few little issues that I noticed
> on a quick read through:
>      - last_block_device should return "sector_t", not "unsigned
>        long".  It would be worth checking for other placed that
>        sector_t might be needed.

Fixed.

>      - bitmap_checkpage still isn't quite safe.  If "hijacked" gets
>        set while it is allocating a page (unlikely, but possible), it
>        will exit with both highjacked set and an allocated page, which
>        isn't right.

Fixed.

>      - The event comparison
> +		    sb->events_hi >= refsb->bitmap_events_hi &&
> +		    sb->events_lo >= refsb->bitmap_events_lo) {
>        in hot_add_disk is wrong. 

This is still there, see my explanation in the previous email...

> The changes to mdadm need a bit of work.
>  
> You have added "--persistent" and "--non-persistent" flags for
> --create.  This is wrong.
> --create always uses persistent superblocks.
> --build makes arrays without persistent superblocks.

OK, changed. The --build command now also accepts the --bitmap option.

> I don't think I like --create-bitmap.  A bitmap file should always be
> created in the context of a particular array (partly so that the size
> and uuid can be set correctly).  I think I would like to bitmap to be
> specified as a "--bitmap=filename" option to --create, --build, or
> --grow.  I haven't thought this through in great detail yet, but that
> is my leaning.

OK, --create-bitmap remains solely for the --build case, but is no 
longer needed for --create. The --create command now does all the array 
creation in userspace and simply Assembles the resulting collection of 
disks. It also initialises the bitmap correctly for use with the array 
(i.e., uuid and size match).

> --examine-bitmap is a bit of a pain too.  I think I would like
> --examine to figure out what it has been given to look at, and report
> on whatever it finds.

This also remains as-is for now, pending your reply to my previous emails.


Thanks,
Paul

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-06-23 21:48                     ` Paul Clements
@ 2004-06-23 21:50                       ` Paul Clements
  2004-07-06 14:52                       ` Paul Clements
       [not found]                       ` <40F7E50F.2040308@steeleye.com>
  2 siblings, 0 replies; 34+ messages in thread
From: Paul Clements @ 2004-06-23 21:50 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, ptb, mingo, james.bottomley

Paul Clements wrote:
> Hi Neil,
> 
> Here's the next round of patches:

[snip]

> mdadm patch against 1.6.0:
> -------------------------
> http://www.parisc-linux.org/~jejb/md_bitmap/md_bitmap_2_32_2_6_7.diff

err, this should be:

http://www.parisc-linux.org/~jejb/md_bitmap/mdadm_1_6_0-bitmap-2.diff


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-06-23 21:48                     ` Paul Clements
  2004-06-23 21:50                       ` Paul Clements
@ 2004-07-06 14:52                       ` Paul Clements
       [not found]                       ` <40F7E50F.2040308@steeleye.com>
  2 siblings, 0 replies; 34+ messages in thread
From: Paul Clements @ 2004-07-06 14:52 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, ptb, mingo, james.bottomley

Paul Clements wrote:
> Hi Neil,
> 
> Here's the next round of patches:
> 
> kernel patch against 2.6.7:
> --------------------------
> http://www.parisc-linux.org/~jejb/md_bitmap/md_bitmap_2_32_2_6_7.diff

A quick note for anyone who is interested: this patch misses the 
poolinfo structure changes that were made to raid1 between 2.6.6 and 
2.6.7. Unfortunately, the mempool interfaces use casts from void *, so 
this wasn't caught by the compiler. Trivial changes need to be made to 
r1bio_pool_free for the code to work with 2.6.7.

Thanks,
Paul

^ permalink raw reply	[flat|nested] 34+ messages in thread

[parent not found: <40F7E50F.2040308@steeleye.com>]

[parent not found: <16649.61212.310271.36561@cse.unsw.edu.au>]

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
       [not found]                         ` <16649.61212.310271.36561@cse.unsw.edu.au>
@ 2004-08-10 21:37                           ` Paul Clements
  2004-08-13  3:04                             ` Neil Brown
  0 siblings, 1 reply; 34+ messages in thread
From: Paul Clements @ 2004-08-10 21:37 UTC (permalink / raw)
  To: Neil Brown; +Cc: jejb, linux-raid

Neil,

I've implemented the improvements that you've suggested below, and 
preliminary tests are showing some very good results!

The latest patches are available at:

http://www.parisc-linux.org/~jejb/md_bitmap/

Further details below...

Thanks,
Paul


Neil Brown wrote:
> It's looking a lot better.  I can start focussing on deeper issues
> now.
> 
> It really bothers me that all the updates are synchronous - that isn't
> a good approach for getting good performance.

Yes, that's quite true. In fact, in my simple write performance tests I 
used to see a slowdown of around 30% with the bitmap file...now, the 
difference is not even measurable! (This is with the bitmap file located 
on a dedicated disk, in order to reduce disk head movement, which tends 
to degrade performance).


> This is how I think it should happen:
> 
>   When a write request arrives for a block where the corresponding bit
>   is already set on disk, continue with the request (as currently
>   happens).
> 
>   When a write request arrives for a block where the corresponding bit 
>   is *not* set on disk, set the bit in ram (if not already set), queue
>   the write request for later handling, and queue the bitmap block to
>   be written.  Also mark the queue as "plugged".
>
>   When an unplug request comes, Send all queued bitmap blocks to disk,
>   then wait for them all to complete, then send all queue raid write
>   requests to disk.
> 
>   When a write request completes, decrement the corresponding
>   counter but don't clear the "shadow" bit if the count hits zero.
>   Instead flag the block as "might have unsynced-zeros".
> 
>   The write-back thread slowly walks around the bitmap looking for
>   blocks which might have an unsynced zero.  They are checked to see
>   if they still do.  If they do, the disk-bit is cleared and the
>   disk-block is queued for writing.
> 
> 
> There might need to be a couple of refinements to this, but I think it
> is the right starting point.

I've implemented more or less what you've described above in this latest 
patch.

> With this approach you wouldn't need the "bitmap_update" in r1bio_s as
> the write-back daemon doesn't have a list of things to do, but instead
> it periodically scans to see what needs to be done.

Yes, getting rid of the bitmap_update structure was a very good idea. 
The code is much cleaner without that...


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-08-10 21:37                           ` Paul Clements
@ 2004-08-13  3:04                             ` Neil Brown
  2004-09-21  3:28                               ` Paul Clements
  0 siblings, 1 reply; 34+ messages in thread
From: Neil Brown @ 2004-08-13  3:04 UTC (permalink / raw)
  To: Paul Clements; +Cc: jejb, linux-raid

On Tuesday August 10, paul.clements@steeleye.com wrote:
> Neil,
> 
> I've implemented the improvements that you've suggested below, and 
> preliminary tests are showing some very good results!
> 
> The latest patches are available at:
> 
> http://www.parisc-linux.org/~jejb/md_bitmap/
> 

This is definitely heading in the right direction.  However I think
you have missed some of the subtleties of plugging.

Plugging should be seen as entirely the responsibility of each
device.  Each device will plug it's own queue if or when it thinks
that is a good idea, and will possibly unplug it spontaneously.  Thus
if you ever call generic_make_request, you have to assume that the
request could get services at any moment.

The only interest that clients of a device have in plugging is that
they may request the device to unplug, in case it still has some old
requests in a plugged queue.

When a raid1 write request comes, your code passes that request to
generic_make_request, and schedules a bitmap update for later.
Presumably you assume that the underlying devices are or will-be
plugged and only get unplugged when ->unplug_fn is called.  However
this isn't true.

What you need to do is:
  In the WRITE branch of make_request, schedule the bitmap update as
  you currently do, but *don't* call generic_make_request on the bio.
  Instead, you should add the bio to a queue using ->bi_next for
  linkage. 

  Then in raid1_unplug, you should:
    Call bitmap_unplug
     (this waits for the bitmaps to be safe)
    Walk through the queue of bio and submit them with
      make_request. As the bit(s) will already be marked dirty, the
      request will go straight through.  
    Finally call unplug_slaves.

That should then have the bitmap updates happening before the
data updates, and should cluster bitmap updates better.

The next step would be to stop write_page from being synchronous - I'm
not comfortable about ->unplug-fn having to wait for disk io.

This looks to be quite awkward as there is not call-back that happens
when writeback on a page is completed.  You need to have a kernel
thread to wait for the pages to complete.  I would get the
bitmap_daemon to do this.

write_page puts the page on a queue and wakes up bitmap_write_daemon
bitmap_write_daemon calls prepare_write/commit_write and puts the page
on another queue and wakes up bitmap_daemon.

bitmap_daemon whenever it is woken up:
   - checks if any clean bitmaps needs to be written and puts them on
     the queue for bitmap_write_daemon
   - checks if there is a page in the queue from bitmap_write_daemon
     and, if there is, waits for it (wait_on_page_writeback).
     When that completes, put the page and decrease a count, and go
     round the loop.
   - if the count of pending bitmap updates hits zero, bitmap_daemon
     walks the queue of pending write requests of the array (as I
     suggested above could be done in raid1_unplug) and submits them.

Does that make sense?

I would suggest doing the first version first (leaving write_page
synchronous), and then have a go at making write_page synchronous.

NeilBrown

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-08-13  3:04                             ` Neil Brown
@ 2004-09-21  3:28                               ` Paul Clements
  2004-09-21 19:19                                 ` Paul Clements
  0 siblings, 1 reply; 34+ messages in thread
From: Paul Clements @ 2004-09-21  3:28 UTC (permalink / raw)
  To: Neil Brown; +Cc: jejb, linux-raid

OK, here we go again. The patch has been updated to 2.6.9-rc2 and mdadm 
1.7.0, and I've implemented the suggestions you've made below...

Patches: http://parisc-linux.org/~jejb/md_bitmap/

Thanks,
Paul

----

Neil Brown wrote:

> This is definitely heading in the right direction.  However I think
> you have missed some of the subtleties of plugging.

[snip]

> That should then have the bitmap updates happening before the
> data updates, and should cluster bitmap updates better.

OK, I've fixed that.

> The next step would be to stop write_page from being synchronous - I'm
> not comfortable about ->unplug-fn having to wait for disk io.

Well, it still does have to wait for the bitmap writes to complete 
before letting the regular writes go, but I see your point about the 
writes being submitted simultaneously (nearly) rather than being 
completely serialized.

> This looks to be quite awkward as there is not call-back that happens
> when writeback on a page is completed.  You need to have a kernel
> thread to wait for the pages to complete.  I would get the
> bitmap_daemon to do this.

OK, this is what I've done...there didn't seem to be a better place to 
do it.

> Does that make sense?

Certainly does, and seems to work...

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-09-21  3:28                               ` Paul Clements
@ 2004-09-21 19:19                                 ` Paul Clements
  2004-10-12  2:15                                   ` Neil Brown
  0 siblings, 1 reply; 34+ messages in thread
From: Paul Clements @ 2004-09-21 19:19 UTC (permalink / raw)
  To: Neil Brown; +Cc: jejb, linux-raid

Paul Clements wrote:
> OK, here we go again. The patch has been updated to 2.6.9-rc2 and mdadm 
> 1.7.0, and I've implemented the suggestions you've made below...
> 
> Patches: http://parisc-linux.org/~jejb/md_bitmap/

Well, I ran some simple write performance benchmarks today and I was 
stunned to find the bitmap performance was terrible. After debugging a 
while, I realized the problem was having the bitmap daemon do the 
wait_on_page_writeback...if the daemon got caught at the wrong time, it 
could take quite a while to come back around and do the writeback. So, I 
added yet another thread, the bitmap writeback daemon (whose sole 
purpose is to wait for page writebacks) and now things are running 
excellently. My simple benchmarks, which previously showed about a 
25-30% slowdown when using a bitmap vs. not using one, now show only a 
4% slowdown, which is pretty close to the margin of error in the test 
itself. Check out the new patch here:

http://parisc-linux.org/~jejb/md_bitmap/md_bitmap_2_37_2_6_9_rc2.diff

Thanks again,
Paul



> Neil Brown wrote:
> 
>> This is definitely heading in the right direction.  However I think
>> you have missed some of the subtleties of plugging.
> 
> 
> [snip]
> 
>> That should then have the bitmap updates happening before the
>> data updates, and should cluster bitmap updates better.
> 
> 
> OK, I've fixed that.
> 
>> The next step would be to stop write_page from being synchronous - I'm
>> not comfortable about ->unplug-fn having to wait for disk io.
> 
> 
> Well, it still does have to wait for the bitmap writes to complete 
> before letting the regular writes go, but I see your point about the 
> writes being submitted simultaneously (nearly) rather than being 
> completely serialized.
> 
>> This looks to be quite awkward as there is not call-back that happens
>> when writeback on a page is completed.  You need to have a kernel
>> thread to wait for the pages to complete.  I would get the
>> bitmap_daemon to do this.
> 
> 
> OK, this is what I've done...there didn't seem to be a better place to 
> do it.
> 
>> Does that make sense?
> 
> 
> Certainly does, and seems to work...
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-09-21 19:19                                 ` Paul Clements
@ 2004-10-12  2:15                                   ` Neil Brown
  2004-10-12 14:06                                     ` Paul Clements
  2004-11-10  0:37                                     ` md: persistent (file-backed) bitmap Neil Brown
  0 siblings, 2 replies; 34+ messages in thread
From: Neil Brown @ 2004-10-12  2:15 UTC (permalink / raw)
  To: Paul Clements; +Cc: jejb, linux-raid

On Tuesday September 21, paul.clements@steeleye.com wrote:
> Paul Clements wrote:
> > OK, here we go again. The patch has been updated to 2.6.9-rc2 and mdadm 
> > 1.7.0, and I've implemented the suggestions you've made below...
> > 
> > Patches: http://parisc-linux.org/~jejb/md_bitmap/
> 
> Well, I ran some simple write performance benchmarks today and I was 
> stunned to find the bitmap performance was terrible. After debugging a 
> while, I realized the problem was having the bitmap daemon do the 
> wait_on_page_writeback...if the daemon got caught at the wrong time, it 
> could take quite a while to come back around and do the writeback. So, I 
> added yet another thread, the bitmap writeback daemon (whose sole 
> purpose is to wait for page writebacks) and now things are running 
> excellently. My simple benchmarks, which previously showed about a 
> 25-30% slowdown when using a bitmap vs. not using one, now show only a 
> 4% slowdown, which is pretty close to the margin of error in the test 
> itself. Check out the new patch here:
> 
> http://parisc-linux.org/~jejb/md_bitmap/md_bitmap_2_37_2_6_9_rc2.diff
> 
> Thanks again,
> Paul
> 


Further comments.

bitmap_events
  1/  You have inserted bitmap_event_hi/lo *before* recovery_cp, thus
      moving recovery_cp, and thus breaking backwards comparability.
  2/ The test in hot_add_disk:
+		if (refsb && sb && uuid_equal(sb, refsb) &&
+		    sb->events_hi >= refsb->bitmap_events_hi &&
+		    sb->events_lo >= refsb->bitmap_events_lo) {
+			bitmap_invalidate = 0;
   is wrong.  The events count must be compared as a 64bit
  number. e.g. it is only meaningful to compare events_lo if both
  events_hi are equal.

pending_bio_list
  1/ Do you really need a separate pending_bio_lock, or would
     the current device_lock be adequate to the task.
  2/  I think there can be a race with new requests being added to
     this list while bitmap_unplug is running in unplug_slaves.
     I think you should "bio_get_list" before calling bitmap_unplug,
     So that you only then submit requests that were made definitely
     *before* the call the bitmap_unplug.  This would have the added
     advantage that you don't need to keep claiming and dropping
     pending_bio_lock. 

random damage
   1/ You have changed r1bio_pool_free without it being a useful
      change.
   2/ You have removed "static" from sync_page_io for no apparent
      reason.
   3/ You declare and set start_sector_nr, and never use it.


I have removed the "random damage" bit and merged that patch into my
collection.  I would really appreciate it if any further changes could
be sent as incremental patches, as reading through a large patch like
that takes a fair bit of effort.

I'll hopefully be finishing up some modification to mdadm this week.
I'll aim to include support for bitmaps as part of that (based on your
patch, but probably with a number of changes to match changes in
mdadm).

NeilBrown


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-10-12  2:15                                   ` Neil Brown
@ 2004-10-12 14:06                                     ` Paul Clements
  2004-10-12 21:16                                       ` Paul Clements
  2004-11-10  0:37                                     ` md: persistent (file-backed) bitmap Neil Brown
  1 sibling, 1 reply; 34+ messages in thread
From: Paul Clements @ 2004-10-12 14:06 UTC (permalink / raw)
  To: Neil Brown; +Cc: jejb, linux-raid

Neil Brown wrote:
>>Paul Clements wrote:

>>itself. Check out the new patch here:
>>
>>http://parisc-linux.org/~jejb/md_bitmap/md_bitmap_2_37_2_6_9_rc2.diff
>>

> Further comments.
> 
> bitmap_events
>   1/  You have inserted bitmap_event_hi/lo *before* recovery_cp, thus
>       moving recovery_cp, and thus breaking backwards comparability.

Yes. I guess when recovery_cp came along I failed to notice that...

>   2/ The test in hot_add_disk:
> +		if (refsb && sb && uuid_equal(sb, refsb) &&
> +		    sb->events_hi >= refsb->bitmap_events_hi &&
> +		    sb->events_lo >= refsb->bitmap_events_lo) {
> +			bitmap_invalidate = 0;
>    is wrong.  The events count must be compared as a 64bit
>   number. e.g. it is only meaningful to compare events_lo if both
>   events_hi are equal.

Yes, that is broken.

> pending_bio_list
>   1/ Do you really need a separate pending_bio_lock, or would
>      the current device_lock be adequate to the task.

Probably so...especially with the following change...

>   2/  I think there can be a race with new requests being added to
>      this list while bitmap_unplug is running in unplug_slaves.
>      I think you should "bio_get_list" before calling bitmap_unplug,
>      So that you only then submit requests that were made definitely
>      *before* the call the bitmap_unplug.  This would have the added
>      advantage that you don't need to keep claiming and dropping
>      pending_bio_lock. 

Yes, that would make sense.

> collection.  I would really appreciate it if any further changes could
> be sent as incremental patches, as reading through a large patch like
> that takes a fair bit of effort.

Yes, I certainly understand that. I appreciate all the effort you've put 
into reviewing and giving suggestions. Further patches will be on top of 
the current bitmap patch.

I will send out an incremental patch to fix the issues above, probably 
in a couple of days.

Thanks,
Paul


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes
  2004-10-12 14:06                                     ` Paul Clements
@ 2004-10-12 21:16                                       ` Paul Clements
  0 siblings, 0 replies; 34+ messages in thread
From: Paul Clements @ 2004-10-12 21:16 UTC (permalink / raw)
  To: Neil Brown; +Cc: jejb, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1734 bytes --]

Neil,

patch to fix the issues mentioned below has been tested and is attached. 
Should apply on top of 2.6.9-rc2 + md_bitmap.

--
Paul


Paul Clements wrote:
> Neil Brown wrote:
> 
>>> Paul Clements wrote:
> 
> 
>>> itself. Check out the new patch here:
>>>
>>> http://parisc-linux.org/~jejb/md_bitmap/md_bitmap_2_37_2_6_9_rc2.diff
>>>
> 
>> Further comments.
>>
>> bitmap_events
>>   1/  You have inserted bitmap_event_hi/lo *before* recovery_cp, thus
>>       moving recovery_cp, and thus breaking backwards comparability.
> 
> 
> Yes. I guess when recovery_cp came along I failed to notice that...
> 
>>   2/ The test in hot_add_disk:
>> +        if (refsb && sb && uuid_equal(sb, refsb) &&
>> +            sb->events_hi >= refsb->bitmap_events_hi &&
>> +            sb->events_lo >= refsb->bitmap_events_lo) {
>> +            bitmap_invalidate = 0;
>>    is wrong.  The events count must be compared as a 64bit
>>   number. e.g. it is only meaningful to compare events_lo if both
>>   events_hi are equal.
> 
> 
> Yes, that is broken.
> 
>> pending_bio_list
>>   1/ Do you really need a separate pending_bio_lock, or would
>>      the current device_lock be adequate to the task.
> 
> 
> Probably so...especially with the following change...
> 
>>   2/  I think there can be a race with new requests being added to
>>      this list while bitmap_unplug is running in unplug_slaves.
>>      I think you should "bio_get_list" before calling bitmap_unplug,
>>      So that you only then submit requests that were made definitely
>>      *before* the call the bitmap_unplug.  This would have the added
>>      advantage that you don't need to keep claiming and dropping
>>      pending_bio_lock. 
> 
> 
> Yes, that would make sense.

[-- Attachment #2: md_bitmap_bugfix_2_37_2_6_9_rc2.diff --]
[-- Type: text/plain, Size: 5644 bytes --]

diff -purN --exclude-from /export/public/clemep/tmp/dontdiff linux-2.6.9-rc2-BITMAP/drivers/md/md.c linux-2.6.9-rc2-BITMAP-NEW/drivers/md/md.c
--- linux-2.6.9-rc2-BITMAP/drivers/md/md.c	Tue Sep 14 14:11:07 2004
+++ linux-2.6.9-rc2-BITMAP-NEW/drivers/md/md.c	Tue Oct 12 12:23:13 2004
@@ -2375,8 +2375,9 @@ static int hot_add_disk(mddev_t * mddev,
 		if (rdev->sb_loaded)
 			sb = (mdp_super_t *)page_address(rdev->sb_page);
 		if (refsb && sb && uuid_equal(sb, refsb) &&
-		    sb->events_hi >= refsb->bitmap_events_hi &&
-		    sb->events_lo >= refsb->bitmap_events_lo) {
+		    (sb->events_hi > refsb->bitmap_events_hi ||
+		    (sb->events_hi == refsb->bitmap_events_hi &&
+		    sb->events_lo >= refsb->bitmap_events_lo))) {
 			bitmap_invalidate = 0;
 		} else if (!mddev->persistent) { /* assume bitmap is valid */
 			bitmap_invalidate = 0;
diff -purN --exclude-from /export/public/clemep/tmp/dontdiff linux-2.6.9-rc2-BITMAP/drivers/md/raid1.c linux-2.6.9-rc2-BITMAP-NEW/drivers/md/raid1.c
--- linux-2.6.9-rc2-BITMAP/drivers/md/raid1.c	Tue Sep 14 14:13:47 2004
+++ linux-2.6.9-rc2-BITMAP-NEW/drivers/md/raid1.c	Tue Oct 12 16:23:54 2004
@@ -455,18 +455,21 @@ static void unplug_slaves(mddev_t *mddev
 	struct bio *bio;
 	unsigned long flags;
 
+	/* pull writes off the pending queue and (later) submit them */
+	spin_lock_irqsave(&conf->device_lock, flags);
+	bio = bio_list_get(&conf->pending_bio_list);
+	spin_unlock_irqrestore(&conf->device_lock, flags);
+
 	/* flush any pending bitmap writes to disk before proceeding w/ I/O */
 	if (bitmap_unplug(mddev->bitmap) != 0)
 		printk("%s: bitmap file write failed!\n", mdname(mddev));
 
-	/* pull writes off the pending queue and submit them */
-	spin_lock_irqsave(&conf->pending_bio_lock, flags);
-	while ((bio = bio_list_pop(&conf->pending_bio_list))) {
-		spin_unlock_irqrestore(&conf->pending_bio_lock, flags);
+	while (bio) { /* submit pending writes */
+		struct bio *next = bio->bi_next;
+		bio->bi_next = NULL;
 		generic_make_request(bio);
-		spin_lock_irqsave(&conf->pending_bio_lock, flags);
+		bio = next;
 	}
-	spin_unlock_irqrestore(&conf->pending_bio_lock, flags);
 
 	spin_lock_irqsave(&conf->device_lock, flags);
 	for (i=0; i<mddev->raid_disks; i++) {
@@ -666,9 +669,9 @@ static int make_request(request_queue_t 
 
 		atomic_inc(&r1_bio->remaining);
 		/* queue the write...it will be submitted when we unplug */
-		spin_lock_irqsave(&conf->pending_bio_lock, flags);
+		spin_lock_irqsave(&conf->device_lock, flags);
 		bio_list_add(&conf->pending_bio_list, mbio);
-		spin_unlock_irqrestore(&conf->pending_bio_lock, flags);
+		spin_unlock_irqrestore(&conf->device_lock, flags);
 	}
 
 	if (atomic_dec_and_test(&r1_bio->remaining)) {
@@ -965,9 +968,9 @@ static void sync_request_write(mddev_t *
 				"while resyncing!\n", mdname(mddev), err);
 
 		/* queue the write...it will be submitted when we unplug */
-		spin_lock_irqsave(&conf->pending_bio_lock, flags);
+		spin_lock_irqsave(&conf->device_lock, flags);
 		bio_list_add(&conf->pending_bio_list, wbio);
-		spin_unlock_irqrestore(&conf->pending_bio_lock, flags);
+		spin_unlock_irqrestore(&conf->device_lock, flags);
 	}
 
 	if (atomic_dec_and_test(&r1_bio->remaining)) {
@@ -1307,7 +1310,6 @@ static int run(mddev_t *mddev)
 	init_waitqueue_head(&conf->wait_idle);
 	init_waitqueue_head(&conf->wait_resume);
 
-	conf->pending_bio_lock = SPIN_LOCK_UNLOCKED;
 	bio_list_init(&conf->pending_bio_list);
 
 	if (!conf->working_disks) {
diff -purN --exclude-from /export/public/clemep/tmp/dontdiff linux-2.6.9-rc2-BITMAP/include/linux/raid/md_p.h linux-2.6.9-rc2-BITMAP-NEW/include/linux/raid/md_p.h
--- linux-2.6.9-rc2-BITMAP/include/linux/raid/md_p.h	Tue Sep 14 12:12:55 2004
+++ linux-2.6.9-rc2-BITMAP-NEW/include/linux/raid/md_p.h	Tue Oct 12 12:21:12 2004
@@ -133,17 +133,20 @@ typedef struct mdp_superblock_s {
 	__u32 events_lo;	/*  8 low-order of superblock update count    */
 	__u32 cp_events_hi;	/*  9 high-order of checkpoint update count   */
 	__u32 cp_events_lo;	/* 10 low-order of checkpoint update count    */
-	__u32 bitmap_events_hi;	/* 11 high-order of bitmap update count       */
-	__u32 bitmap_events_lo;	/* 12 low-order of bitmap update count        */
 #else
 	__u32 events_lo;	/*  7 low-order of superblock update count    */
 	__u32 events_hi;	/*  8 high-order of superblock update count   */
 	__u32 cp_events_lo;	/*  9 low-order of checkpoint update count    */
 	__u32 cp_events_hi;	/* 10 high-order of checkpoint update count   */
-	__u32 bitmap_events_lo;	/* 11 low-order of bitmap update count        */
+#endif
+	__u32 recovery_cp;	/* 11 recovery checkpoint sector count	      */
+#ifdef __BIG_ENDIAN
 	__u32 bitmap_events_hi;	/* 12 high-order of bitmap update count       */
+	__u32 bitmap_events_lo;	/* 13 low-order of bitmap update count        */
+#else
+	__u32 bitmap_events_lo;	/* 12 low-order of bitmap update count        */
+	__u32 bitmap_events_hi;	/* 13 high-order of bitmap update count       */
 #endif
-	__u32 recovery_cp;	/* 13 recovery checkpoint sector count	      */
 	__u32 gstate_sreserved[MD_SB_GENERIC_STATE_WORDS - 14];
 
 	/*
diff -purN --exclude-from /export/public/clemep/tmp/dontdiff linux-2.6.9-rc2-BITMAP/include/linux/raid/raid1.h linux-2.6.9-rc2-BITMAP-NEW/include/linux/raid/raid1.h
--- linux-2.6.9-rc2-BITMAP/include/linux/raid/raid1.h	Tue Sep 14 14:09:11 2004
+++ linux-2.6.9-rc2-BITMAP-NEW/include/linux/raid/raid1.h	Tue Oct 12 12:30:03 2004
@@ -37,7 +37,6 @@ struct r1_private_data_s {
 
 	/* queue pending writes and submit them on unplug */
 	struct bio_list		pending_bio_list;
-	spinlock_t		pending_bio_lock;
 
 	/* for use when syncing mirrors: */
 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re:  md: persistent (file-backed) bitmap
  2004-10-12  2:15                                   ` Neil Brown
  2004-10-12 14:06                                     ` Paul Clements
@ 2004-11-10  0:37                                     ` Neil Brown
  2004-11-10 18:28                                       ` Paul Clements
  1 sibling, 1 reply; 34+ messages in thread
From: Neil Brown @ 2004-11-10  0:37 UTC (permalink / raw)
  To: Paul Clements, jejb, linux-raid

Hi Paul,
 I've been looking through the bitmap code again and found the
 handling of resync a bit hard to wrap my brain around.  The handling
 of the counter here seems a bit non-obvious and possibly fragile.

 I would like to make a change, but thought I should bounce it off you
 first.

 I would like to change the "RESYNC_MASK" bit to mean:
   At least one block in this chunk is out-of-sync.

 Then:
  - When we read the bitmap from disk we set this bit and the
    SHADOW_MASK, but leave the counter at zero.
  - When we get a failed write, we set this bit, but still decrement
    the counter. 
  - When we are performing a resync, we periodically clear the bit on
    recently completed chunks.
  - We only clear the SHADOW_MASK and on-disk bit when the counter
    hits zero *and* this bit is clear.

 I would find this approach a lot easier to understand.  Are you OK
 with it?

 Also, I would like to move the bitmap_testbit in md_do_sync down into
 the personality.  This should make life easier for other
 personalities like raid10 which use a very different approach for
 resync than for recovery.

NeilBrown

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: md: persistent (file-backed) bitmap
  2004-11-10  0:37                                     ` md: persistent (file-backed) bitmap Neil Brown
@ 2004-11-10 18:28                                       ` Paul Clements
  0 siblings, 0 replies; 34+ messages in thread
From: Paul Clements @ 2004-11-10 18:28 UTC (permalink / raw)
  To: Neil Brown; +Cc: jejb, linux-raid

Hi Neil,

Neil Brown wrote:

>  I would like to change the "RESYNC_MASK" bit to mean:
>    At least one block in this chunk is out-of-sync.

OK

>  Then:
>   - When we read the bitmap from disk we set this bit and the
>     SHADOW_MASK, but leave the counter at zero.
>   - When we get a failed write, we set this bit, but still decrement
>     the counter. 

>   - When we are performing a resync, we periodically clear the bit on
>     recently completed chunks.

I assume that this also means that the counter will get incremented 
before each read-for-resync is submitted. Perhaps you've already 
considered that?


>   - We only clear the SHADOW_MASK and on-disk bit when the counter
>     hits zero *and* this bit is clear.
> 
>  I would find this approach a lot easier to understand.  Are you OK
>  with it?

It sounds reasonable. Probably a little simpler too...

>  Also, I would like to move the bitmap_testbit in md_do_sync down into
>  the personality.  This should make life easier for other
>  personalities like raid10 which use a very different approach for
>  resync than for recovery.

OK, that sounds fine.

--
Paul

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2004-11-10 18:28 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-01-29 22:51 [ANNOUNCE][PATCH 2.6] md: persistent (file-backed) bitmap and async writes Paul Clements
2004-01-30 22:52 ` Paul Clements
2004-02-09  2:51 ` Neil Brown
2004-02-09 19:45   ` Paul Clements
2004-02-10  0:04     ` Neil Brown
2004-02-10 16:20       ` Paul Clements
2004-02-10 16:57       ` Paul Clements
2004-02-13 20:58       ` Paul Clements
2004-03-05  5:06         ` Neil Brown
2004-03-05 22:05           ` Paul Clements
2004-03-31 18:38             ` Paul Clements
2004-04-28 18:10               ` Paul Clements
2004-04-28 18:53                 ` Peter T. Breuer
2004-04-29  8:41               ` Neil Brown
2004-05-04 20:08                 ` Paul Clements
2004-06-08 20:53                 ` Paul Clements
2004-06-08 22:47                   ` Neil Brown
2004-06-14 23:39                   ` Neil Brown
2004-06-14 23:59                     ` James Bottomley
2004-06-15  6:27                   ` Neil Brown
2004-06-17 17:57                     ` Paul Clements
2004-06-18 20:48                     ` Paul Clements
2004-06-23 21:48                     ` Paul Clements
2004-06-23 21:50                       ` Paul Clements
2004-07-06 14:52                       ` Paul Clements
     [not found]                       ` <40F7E50F.2040308@steeleye.com>
     [not found]                         ` <16649.61212.310271.36561@cse.unsw.edu.au>
2004-08-10 21:37                           ` Paul Clements
2004-08-13  3:04                             ` Neil Brown
2004-09-21  3:28                               ` Paul Clements
2004-09-21 19:19                                 ` Paul Clements
2004-10-12  2:15                                   ` Neil Brown
2004-10-12 14:06                                     ` Paul Clements
2004-10-12 21:16                                       ` Paul Clements
2004-11-10  0:37                                     ` md: persistent (file-backed) bitmap Neil Brown
2004-11-10 18:28                                       ` Paul Clements

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).