* Fast (intelligent) raid1
@ 2003-01-23 17:07 Peter T. Breuer
2003-01-24 9:03 ` Ingo Molnar
0 siblings, 1 reply; 19+ messages in thread
From: Peter T. Breuer @ 2003-01-23 17:07 UTC (permalink / raw)
To: mingo, neilb, linux-raid
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UNKNOWN-8BIT, Size: 11902 bytes --]
Hello Ingo, Neil, ...
Apologies if you receive this twice. I was going to write to you
individually, but when I scanned the kernel maintainers file, I saw the
linux-raid list mentioned, and it didn't seem fair to be secretive, so
I'm ccíng the list on it and you'll probably get this twice.
Sorry!
What I'm writing about is the driver I just put up on
ftp://oboe.it.uc3m.es/pub/Programs/fr1-1.0.tgz
It's an "intelligent RAID1" driver. It only resyncs what's necessary
instead of resyncing the whole disk.
As you surely have experienced, it can take hours to resync a big
device, even if it's local. In my case - as the author and maintainer
of ENBD, a network device - the mirror components are hardly ever local
and I don't get more than about 6MB/s across the net. It takes me
quarter of an hour to resync an array that passes the 4GB mark, and
that's too long for testing. There are people with arrays out there
approaching 2TB now, and they say they spend half a day resyncing.
So, in self-defence, I put intelligent mirroring into ENBD. But the
result is too big too manage, code-wise. So I spent the last month
separating it out again. Now it's a separate module (2.5KLOC), and I
made it accept the kernels md ioctls, so it works under the raidtools2,
can be listed in raidtab, etc. etc.
However, raidtools2 is hardcoded to use the md major of 9, and that I
can't use. So I made the major of the module adjustable with a major=
parameter as you install it. I also made the trivial patch to
raidtools2 available. Plea: can somebody liberalize the tools? There's
no need to check for major 9, as if the device isn't 9, it won't
understand the ioctls anyway!
The alternative is to recast the fr1 module as a dependency of
the md module. I'd like to do that.
There I'd like to ask for somebody's help. I'd like to do it. But I
need to be told how the persistent superblock stuff works. I already
emulated the version, arrayinfo information, and other bits and pieces,
but simply by reverse engineering what the raidtools used as calls. I'd
really appreciate any help that could be offered.
I'll append the announcement I made on the ENBD mailing list a short
while ago. It contains some details of operation that may be helpful
in getting the picture. I'll explain more of what happens in further
mail if a conversation develops.
Please cc: me as I am not on the linux-raid list to my knowledge
(though I am on the kernel list, and many others, and the omission
is not particularly deliberate!).
The current code took its first working tests a couple of days ago,
and reached full functionality today. I'm still not sure if it can
detect and react to underlying device errors appropriately (I gave
raidhotgenerateerror some real functiinality in order to test, thgough I
see it's nulled out in the kernel md code). I am not sure if I have
made the buffer heads I send to the mirror components age fast enough,
or if I should wait for each request completion instead of firing and
forgetting. I haven't throttled the resync but it should go slow enough
as I scheduled after every block.
The message below contains various snapshots that tell the tale.
I'll either move on to do intelligent raid4 now, or aim for the
integration with the md code.
Peter
----- Forwarded message from Peter T. Breuer -----
I separated the "intelligent raid1" code out from enbd-2.4.31 and
put it in a separate driver. It's now available as
ftp://oboe.it.uc3m.es/pub/Programs/fr1-1.0.tgz
I've just got it up to working functionality. I haven't tried stressing
it. It runs under the standard raidtools if you load it with major=9.
You have to patch the tools to "liberalize" them if you use another
major. I included a patch.
I'll include the (hastily written in the train last night) README here.
Mmmph .. major limitation: it only has blocksize 1024, like the rest
of softraid. I'll fix that in parallel with other work. It's therefore
limited to 4TB in size, I think, as the block count is a u32. Maybe
even 2TB, as the sector count is a u32 too.
If anybody would like to make it into a proper md -dependent module,
I'd be very much obliged. That involves understanding the md devices
persistent superblock stuff. At the moment there is no permanent
superblock.
fr1 README (C) Peter T. Breuer Jan 2003.
This is the README for the intelligent fast RAID1 driver, "fr1". It's
"intelligent" in that it doesn't blindly resynchronize a whole mirror
component when only a few blocks need resyncing. That can save hours of
resync time on a large device.
The driver keeps a bitmap of pending writes in memory, and writes them
to the mirror component that's just been repaired when it comes back on
line. The bitmap is two-level and created pagewise on demand, so it's
not too expensive. A terabyte sized device with blocks of 4K will cost
max 32MB of memory per mirror component, thus 64MB max for a two
component mirror. The driver is tolerant wrt memory faults too. It'll
still work if you run out of memory, just be a little less intelligent.
HOW TO MAKE THE MODULE
Edit the Makefile in this directory, change LINUXDIR to point to the
kernel source for your target kernel, and type "make". Put the fr1.o
module in the misc/ subdirectory of your kernel modules in
/lib/modules/2.4.whatever/. Run /sbin/depmod -a.
HOW TO USE IT:
0) Insert the module into the kernel with "insmod fr1.o". Now, by
default it will take major 240, and the raid tools won't work with
that, so if you want to let it go ahead and use its default major,
then you will have to patch the raidtools. Do it like this ...
i) Get the raidtools2 package
ii) remove the 5 or 6 if clauses in the C code that test that the
major of the block device just stated is the MD_MAJOR (9).
iii) compile ("make") and install ("make install") as usual.
Let me just remark that you now have a more tolerant set of raid tools,
and they'll work with fr1 whatever its major. I'll include a patch for
raidtools2 in this directory (raidtools2-0.90.20010914.patch), and try
and persuade the authors to liberalize the base code, but the changes
are obvious.
If you don't want to patch the raid tools, then you will have to load
fr1 and make it use major 9, the md major. Like this:
insmod fr1.o major=9
For that to work, the kernel md module must NOT be loaded. You can tell
if it's loaded by doing "cat /proc/devices" and seeing if block major 9
is listed already. If it is, bad luck. You maybe have md.o loaded, and
can unload it with "rmmod md" (preceded by "rmmod raid1" and whatever
other modules are loaded on top of it). Or it may be built in to the
kernel, in which case you're sorely out of luck. Maybe there's a kernel
boot paramter to disable md. I don't know. It would be "md=off" if
anything. To continue ...
Once you have the driver fr1 loaded, you should see it bound to its
major when you do "cat /proc/devices". It'll be visible with lsmod
too.
To use it, you use the (maybe modified, as remarked above) raid
tools.
1) if you are using a non-md major, then you will have to make some
nodes in /dev. Do (for example)
mknod /dev/fr10 b 240 0
mknod /dev/fr11 b 240 1
mknod /dev/fr12 b 240 2
mknod /dev/fr13 b 240 3
otherwise, if using the md major, 9, make sure that /dev/md[0-3]
are present and correct. If not, make them:
mknod /dev/md0 b 9 0
mknod /dev/md1 b 9 1
mknod /dev/md2 b 9 2
mknod /dev/md3 b 9 3
2) edit /etc/raidtab and put in an entry for a typical raid1 mirror
device for /dev/fr10 or /ev/md0, or whatever corresponds to the major
you are using. Here's an example:
raiddev /dev/fr10
raid-level 1
nr-raid-disks 2
nr-spare-disks 0
persistent-superblock 0
chunk-size 4
device /dev/loop0
raid-disk 0
device /dev/loop1
raid-disk 1
That was for a two-way mirror with two loop devices as components. The
target is /dev/fr10.
3) make the mirror in the usual way with the mkraid utility. For
example:
mkraid --dangerous-no-resync --force /dev/fr10
I don't see the point of NOT using --dangerous-no-resync. You can
always do it in a moment.
At this point you can "cat /proc/fr1stat" and see how things look.
Here is how they should look for the raidstat configuration detailed
above.
Personalities : [raid1]
read_ahead 4 sectors
fr10 : active fr1 [dev 07:00][0] [dev 07:01][1]
1024 blocks
4) You can now manipulate the mirror with the raidsetfaulty,
raidhotremove, and raidhotadd tools. Raidstop and raidstart might
also be useful.
The only difference with respect to normal usage is that a raidhotadd
will WORK after a raidsetfaulty. You don't have to do a raidhotremove
first. If you do the raidhotadd after a raidsetfaulty, then ONLY THE
BLOCKS NOT WRITTEN IN THE INTERVAL are resynced. Not the whole device.
So you want to do this!
For example, to fault one mirror component:
raidsetfaulty /dev/fr10 /dev/loop0
After this, the output from /proc/fr1stat will show a failed component.
It wont't be written to or read:
Personalities : [raid1]
read_ahead 4 sectors
fr10 : active fr1 [dev 07:00][0] [dev 07:01][1](F)
1024 blocks
Then to put the "failed" component back on line:
raidhotadd /dev/fr10 /dev/loop0
and the situation will return to normal, immediately. Only a few
dirtied blocks will have been written to the newly added device.
Personalities : [raid1]
read_ahead 4 sectors
fr10 : active fr1 [dev 07:00][0] [dev 07:01][1]
1024 blocks
If you want to take the "failed" component fully offline, then you must
follow the raidsetfaulty with a
raidhotremove /dev/fr10 /dev/loop0
After this, you can still put the component back with raidhotadd,
but the background resync will be total. You really want to avoid that.
Oh yes. You can now mkfs on the device, mount it, write files to it,
etc. To stop (and deconfigure) the device, do
raidstop /dev/fr10
No, I don't know what raidstart is supposed to do on a non-persistent
array. It doesn't do anything on fr1.
If you fault one device, then write to the device, then hotadd the
faulted device back in, you should be able to see from the kernel
messages (use "dmesg") that the resync is intelligent. Here's some
dmesg output:
fr1 resync starts on device 0 component 1 for 1024 blocks
fr1 resynced dirty blocks 0-9
fr1 resync skipped clean blocks 10-1023
fr1 resync terminates with 0 errs on device 0 component 1
fr1 hotadd component 7.1[1] to device 0
This resync only copied across blocks 0-9, and skipped the rest.
While the resync is happening, /proc/fr1stat will show progress, like
so:
Personalities : [raid1]
read_ahead 4 sectors
fr10 : active fr1 [dev 07:00][0] [dev 07:01][1](F)
1024 blocks
[=======>.............] resync=35.5% (364/1024)
Peter T. Breuer (ptb@it.uc3m.es) Jan 2003.
_______________________________________________
ENBD mailing list
ENBD@lists.community.tummy.com
http://lists.community.tummy.com/mailman/listinfo/enbd
----- End of forwarded message from Peter T. Breuer -----
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: Fast (intelligent) raid1 2003-01-23 17:07 Fast (intelligent) raid1 Peter T. Breuer @ 2003-01-24 9:03 ` Ingo Molnar 2003-01-24 9:34 ` Peter T. Breuer 2003-02-11 19:47 ` Peter T. Breuer 0 siblings, 2 replies; 19+ messages in thread From: Ingo Molnar @ 2003-01-24 9:03 UTC (permalink / raw) To: Peter T. Breuer; +Cc: neilb, linux-raid Peter, your patch looks really interesting. > The driver keeps a bitmap of pending writes in memory, and writes them > to the mirror component that's just been repaired when it comes back on > line. The bitmap is two-level and created pagewise on demand, so it's > not too expensive. [...] how do you ensure that the 'repaired' drive indeed only differs in the dirty-bitmap portions of the data disk? It's perfectly valid to add a completely new (and unsynced) disk to the system when one disk fails. Or is this the responsibility of the administrator? also, your resyncing method does not attempt to address the resync necessary to be done after an unclear shutdown (eg. power failure), correct? Ingo ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fast (intelligent) raid1 2003-01-24 9:03 ` Ingo Molnar @ 2003-01-24 9:34 ` Peter T. Breuer 2003-02-11 19:47 ` Peter T. Breuer 1 sibling, 0 replies; 19+ messages in thread From: Peter T. Breuer @ 2003-01-24 9:34 UTC (permalink / raw) To: Ingo Molnar; +Cc: Peter T. Breuer, neilb, linux-raid [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset=UNKNOWN-8BIT, Size: 3234 bytes --] "A month of sundays ago Ingo Molnar wrote:" > your patch looks really interesting. Thanks - I had a look at changing it to depend on the md.c module last night on the train home, but I need some architectural information. I see the personality struct and its methods, but I need to knwo the semantics that's expected in the methods. There are underlying assumptions - whatis nb_disk (total? Actual?). Whatshoudl the run method do. Does it run through a lits of disk components? It looks like it does. Etc. I may try for raid4 first. > > The driver keeps a bitmap of pending writes in memory, and writes them > > to the mirror component that's just been repaired when it comes back on > > line. The bitmap is two-level and created pagewise on demand, so it's > > not too expensive. [...] > > how do you ensure that the 'repaired' drive indeed only differs in the > dirty-bitmap portions of the data disk? It's perfectly valid to add a If the disk has been hotremoved and then hotadded, it is completely resynced. If, OTOH, it has only been setfaultyéd and then hotadded, it is repaired according to the bitmap. I would really prefer that there were a raidhotrepair utility and ioctl (to repair setfaulty), but in the meantime I use the idea that fixing a disk "only" in the setfaulty state is always done from the bitmap. If the disk has been hotremoved, then the repair is made complete. I would like to use the suberblock information to identify and differentiate between replacement disks, but unfortunately I need instruction there too. I could tell from the uuid if the same disk were being put back or not. If it's the same disk, then the repair can be done using the bitmap. > completely new (and unsynced) disk to the system when one disk fails. Or > is this the responsibility of the administrator? The admin can always override. > also, your resyncing method does not attempt to address the resync > necessary to be done after an unclear shutdown (eg. power failure), > correct? Possibly, but I only say that because I'm not sure what that is. It has been my experience that the current softraid is itself not very clear on the issue! If the array is somehow taken down badly, then there is no very easy way of deternining who has the best copy of the data the nexttime it is restarted. I suspect that you mean the "sync on startup" that should in principle be done, but which can be avoided if there is sufficient "good indicators" in the various superblocks. Since I need to be taught about the superblock from zero, yes, I have not used it! The current code does no sync at start up _at all_, nor does it deal with persistent superblocks. I need documentation - or instruction. Yes, the admin will have to identify the correct component and resync from that, if necessary. I was thinking of maintaining an mmap'ed copy of the bitmap on disk, by the way. But that's for later. I could try it. At startup I can read in the bitmap(s). Peter - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fast (intelligent) raid1 2003-01-24 9:03 ` Ingo Molnar 2003-01-24 9:34 ` Peter T. Breuer @ 2003-02-11 19:47 ` Peter T. Breuer 2003-02-11 20:17 ` Ross Vandegrift 1 sibling, 1 reply; 19+ messages in thread From: Peter T. Breuer @ 2003-02-11 19:47 UTC (permalink / raw) To: Ingo Molnar; +Cc: Peter T. Breuer, neilb, linux-raid "A month of sundays ago Ingo Molnar wrote:" > [ptb wrote] > > The driver keeps a bitmap of pending writes in memory, and writes them > > to the mirror component that's just been repaired when it comes back on > > line. The bitmap is two-level and created pagewise on demand, so it's > > not too expensive. [...] > > how do you ensure that the 'repaired' drive indeed only differs in the > dirty-bitmap portions of the data disk? It's perfectly valid to add a > completely new (and unsynced) disk to the system when one disk fails. Or > is this the responsibility of the administrator? I've now built the technology into the kernel raid1 code. It's available from ftp://oboe.it.uc3m.es/pub/Programs/fr1-2.0.tgz list: fr1-2.0/src/Makefile fr1-2.0/patches/linux-2.4.20-xfs.patch fr1-2.0/patches/linux-2.4.19-xfs.patch fr1-2.0/patches/linux-2.4.generic.patch fr1-2.0/LICENCE fr1-2.0/Makefile fr1-2.0/README The patch modifies md.c in order to allow hotadd directly after setfaulty (which I count as a "hotrepair"). It does this by detecting the attempt to hotadd a faulty disk, and doing a hotremove first, after saving the metadata. It also modifies raid1.c. It makes the resync skip blocks that are not marked in the bitmap, if there is a bitmap. It makes ordinary writes mark the bitmap of every mirror component that is currently not operational. On being marked bad, the bitmap is created for the component. On being marked active (i.e. after a resync) the bitmap is taken away. Then it combines raid1.c with the support in bitmap.c to form a new module, fr1.o, which replaces raid1.o. I would be grateful for corrections. I would like it if a "hotrepair" could be detected automatically from the disk component uuid. I would also like it if the bitmap could be marked clean block by block in the raid1_end_io. Well, maybe I could do that - I haven't so far. The bitmap is vamooshed when the sync is complete, presently. I think that's the same since I don't think the resync starts from zero if interrupted and resumed. Peter ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fast (intelligent) raid1 2003-02-11 19:47 ` Peter T. Breuer @ 2003-02-11 20:17 ` Ross Vandegrift 2003-02-11 20:36 ` Peter T. Breuer 0 siblings, 1 reply; 19+ messages in thread From: Ross Vandegrift @ 2003-02-11 20:17 UTC (permalink / raw) To: Peter T. Breuer; +Cc: Ingo Molnar, neilb, linux-raid On Tue, Feb 11, 2003 at 08:47:04PM +0100, Peter T. Breuer wrote: > I've now built the technology into the kernel raid1 code. It's > available from > > ftp://oboe.it.uc3m.es/pub/Programs/fr1-2.0.tgz How much testing has this recieved? I'd love to try it out, but it might take me a bit to setup an array that could stand data loss. On the other hand, if it's stable, I'd move to it pretty quickly. This is a way cool feature. -- Ross Vandegrift ross@willow.seitz.com A Pope has a Water Cannon. It is a Water Cannon. He fires Holy-Water from it. It is a Holy-Water Cannon. He Blesses it. It is a Holy Holy-Water Cannon. He Blesses the Hell out of it. It is a Wholly Holy Holy-Water Cannon. He has it pierced. It is a Holey Wholly Holy Holy-Water Cannon. He makes it official. It is a Canon Holey Wholly Holy Holy-Water Cannon. Batman and Robin arrive. He shoots them. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fast (intelligent) raid1 2003-02-11 20:17 ` Ross Vandegrift @ 2003-02-11 20:36 ` Peter T. Breuer 2003-02-11 23:05 ` Peter T. Breuer 0 siblings, 1 reply; 19+ messages in thread From: Peter T. Breuer @ 2003-02-11 20:36 UTC (permalink / raw) To: Ross Vandegrift; +Cc: Peter T. Breuer, Ingo Molnar, neilb, linux-raid "A month of sundays ago Ross Vandegrift wrote:" > On Tue, Feb 11, 2003 at 08:47:04PM +0100, Peter T. Breuer wrote: > > I've now built the technology into the kernel raid1 code. It's > > available from > > > > ftp://oboe.it.uc3m.es/pub/Programs/fr1-2.0.tgz > > How much testing has this recieved? I'd love to try it out, but it Oh, virtually none. Why not try it on a couple of 4M loop devices first. Thus: dd if=/dev/zero bs=1024k of=/tmp/foo0 count=4 dd if=/dev/zero bs=1024k of=/tmp/foo1 count=4 losetup /dev/loop0 /tmp/foo0 losetup /dev/loop1 /tmp/foo1 raidtab: raiddev /dev/md0 raid-level 1 nr-raid-disks 2 nr-spare-disks 0 persistent-superblock 0 chunk-size 4 device /dev/loop0 raid-disk 0 device /dev/loop1 raid-disk 1 and off you go ... > might take me a bit to setup an array that could stand data loss. On > the other hand, if it's stable, I'd move to it pretty quickly. This is > a way cool feature. I have had a headache understanding the current raid code. I don't pretend to understand it all, and the effort I have made so far is directed at letting the kernel raid authors see how to integrate the extra technology into THEIR code, by putting up my best attempt. The result works for me, but I would appreciate communication from this list and the raid authors on what should be done. Have a look at the patch in the .tgz. I tried to make it as clean as I could. Every change I made in the md.c code is commented. There are 4 "hunks" of changes to md.c, to allow hotadd after setfaulty, and about ten significant hunks of changes to raid1.c, inserting the extra technology. There is some extra debugging code in that, which I can remove if a minimal patch is required. The rest of the support is given in a separate, new, bitmap.c file, which supplies infrastructure. Peter ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fast (intelligent) raid1 2003-02-11 20:36 ` Peter T. Breuer @ 2003-02-11 23:05 ` Peter T. Breuer 2003-02-14 11:33 ` Peter T. Breuer 0 siblings, 1 reply; 19+ messages in thread From: Peter T. Breuer @ 2003-02-11 23:05 UTC (permalink / raw) To: ptb; +Cc: Ross Vandegrift, Ingo Molnar, neilb, linux-raid "Peter T. Breuer wrote:" > Have a look at the patch in the .tgz. I tried to make it as clean as I > could. Every change I made in the md.c code is commented. There are 4 > "hunks" of changes to md.c, to allow hotadd after setfaulty, and about > ten significant hunks of changes to raid1.c, inserting the extra > technology. There is some extra debugging code in that, which I can > remove if a minimal patch is required. The rest of the support is given > in a separate, new, bitmap.c file, which supplies infrastructure. In fact - I'll publish and go through the patch here. Here we go. We start with md.c and the addition to the block comment in the head of the file> --- linux-2.4.19-xfs.orig/drivers/md/md.c Sun Feb 9 10:35:53 2003 +++ linux-2.4.19-xfs/drivers/md/md.c Sun Feb 9 10:45:42 2003 @@ -26,6 +26,12 @@ You should have received a copy of the GNU General Public License (for example /usr/src/linux/COPYING); if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + + Changes 31/1/2003 by Peter T. Breuer <ptb@it.uc3m.es> to support + hotadd directly after setfaulty without intervening hotremove + ("hotrepair"). + - save and restore extra data when hotrepair detected + */ #include <linux/module.h> Next in md.c is a set of changes in the hotadd function. We're going to detect a hotadd of a faulty disk, and interpret it as a "hotrepair", so we need a state variable to signal it. When we spot it, we will do a hotremove before doing the hotadd, as the raid code expects that and I don't want to play around. But we'll save and restore all the extra data that I've put in the "reserved for future use" part of the md disk data when we declared the disk faulty. Well, I say "all", but it's only the address of a bitmap. Anyway, the hotrepair boolean signals what we're doing, and the extra_data array holds the extra data temporarily. @@ -2374,6 +2380,9 @@ unsigned int size; mdk_rdev_t *rdev; mdp_disk_t *disk; + /* do extra hotremove and save/restore extra data in hotrepair */ + int hotrepair = 0; + typeof(disk->reserved) extra_data; if (!mddev->pers) return -ENODEV; Now for the change in the hotadd function that spots an attempt to hotadd a faulty disk, and adds an extra hotremove before continuing. We set the hotrepair boolean here, and save the extra_data. In order to preserve exactly what should happen if we had really done a hotremove before getting here, I put a label at the start of this section and jump back to it with a goto after saving the datam, setting the boolean, and calling the hotremove. When we start again we're exactly in the sutuation that the original md.c code expects. We don't reexecute this detection code again because there's a !hotrepair guard on it. So we just fall through. I could have fallen through originally, but I wanted to not miss anything by accident. @@ -2396,14 +2405,48 @@ return -ENOSPC; } +start_again: rdev = find_rdev(mddev, dev); - if (rdev) - return -EBUSY; + /* + * Allow "hotrepair" of merely faulty device too. + */ + if (rdev) { + if (!rdev->faulty) + return -EBUSY; + if (!hotrepair && rdev->dev == dev) { + printk(KERN_WARNING "md%d: re-add of faulty disk detected! Will remove first.\n", + mdidx(mddev)); + for (i = 0; i < MD_SB_DISKS; i++) { + disk = mddev->sb->disks + i; + if (MKDEV(disk->major,disk->minor) == dev) { + break; + } + } + if (i < MD_SB_DISKS) { + mdp_disk_t * disk = mddev->sb->disks + i; + printk(KERN_WARNING "md%d: saving extra data from disk %d!\n", + mdidx(mddev), disk->number); + memcpy(extra_data, + (&mddev->sb->disks[disk->number])->reserved, sizeof(extra_data)); + printk(KERN_DEBUG "saved data"); + for (i = 0; i < sizeof(extra_data)/4; i++) { + printk(" %d: %x", i, extra_data[i]); + } + printk("\n"); + } + err = hot_remove_disk(mddev, dev); + if (err < 0) { + return err; + } + hotrepair = 1; + goto start_again; + } + } err = md_import_device (dev, 0); if (err) { printk(KERN_WARNING "md: error, md_import_device() returned %d\n", err); return -EINVAL; } rdev = find_rdev_all(dev); if (!rdev) { Further down the function we come to a bit where the "new" disk, which has been added as a spare disk, is finally shifted into place in the array. At this point we restore the saved data to it (it's only the address of a bitmap really, but we restore all the data that's possible to restore so as not to know anything about raid1 structures). @@ -2466,6 +2509,16 @@ } mark_disk_spare(disk); + if (hotrepair) { + printk(KERN_WARNING "md%d: restoring saved extra data to disk %d!\n", + mdidx(mddev), disk->number); + memcpy((&mddev->sb->disks[disk->number])->reserved, extra_data, sizeof(extra_data)); + printk(KERN_DEBUG "restored data"); + for (i = 0; i < sizeof(extra_data)/4; i++) { + printk(" %d: %x", i, extra_data[i]); + } + printk("\n"); + } mddev->sb->nr_disks++; mddev->sb->spare_disks++; mddev->sb->working_disks++; That was all the changes in md.c. Now for the changes in raid1.c. First an addition to the comment at the head: --- linux-2.4.19-xfs.orig/drivers/md/raid1.c Sat Feb 8 23:19:06 2003 +++ linux-2.4.19-xfs/drivers/md/raid1.c Sun Feb 9 09:48:24 2003 @@ -20,6 +20,17 @@ * You should have received a copy of the GNU General Public License * (for example /usr/src/linux/COPYING); if not, write to the Free * Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + * + * Changes by Peter T. Breuer <ptb@it.uc3m.es> 31/1/2003 to support + * bitmapped intelligence in resync: + * + * - bitmap attached on setfaulty (mark bad) + * - bitmap marked during normal i/o if faulty disk + * - bitmap used to skip nondirty blocks during sync + * - bitmap removed on set active + * + * Minor changes are needed in raid1.h (extra fields in conf) and in + * md.c (support hotadd directly after hotremove). */ #include <linux/module.h> OK. I turned on what debugging there was in order to help me. That's this nect hunk. @@ -39,7 +50,7 @@ /* * The following can be used to debug the driver */ -#define RAID1_DEBUG 0 +#define RAID1_DEBUG 1 #if RAID1_DEBUG #define PRINTK(x...) printk(x) In order to include the bitmap technology, I need some of the functions declared in bitmap.h. Actually, it's an object/class, and I need the class declaration with its methods. I didn't want to play with the raid1 disk info structs, so I used the first 32bit reserved field in the struct to contain a bitmap address. Sorry. @@ -49,6 +60,8 @@ #define PRINTK(x...) do { } while (0) #endif +#include "bitmap.h" +#define raid1_bitmap(disk) ((struct bitmap *)(disk)->reserved[0]) static mdk_personality_t raid1_personality; static md_spinlock_t retry_list_lock = MD_SPIN_LOCK_UNLOCKED; Now comes a change to the ordinary write code. When we get a write command we go and search for "nonoperational" mirror components, and mark the bitmap on each of them for the blocks that we are supposed to be writing to. Sorry about the search each time, but the only sensible alternative is to maintain an array of indices of the nonoperatonal devices, and that's plain confusing as code. I'm not sure if nonworking mirror components ARE signalled by the !operational flag. What's this used_slot field for? Am I supposed to ignore it, or respect it, or what? I seem to be skipping components without the used_slot field set. It didn't hurt me, but who knows what it's for ... @@ -640,8 +653,31 @@ bhl = raid1_alloc_bh(conf, conf->raid_disks); for (i = 0; i < disks; i++) { struct buffer_head *mbh; - if (!conf->mirrors[i].operational) + /* + * Mark the bitmap of each mirror we can't write to + * (i.e. is not operational). + */ + if (!conf->mirrors[i].operational) { + + struct bitmap * bitmap = NULL; + mdp_super_t *sb = mddev->sb; + + if (!conf->mirrors[i].used_slot) + continue; + + /* I'm not sure if mddev always has sb. FIXME. */ + if (sb) { + bitmap = + raid1_bitmap(&sb->disks[conf->mirrors[i].number]); + } + if (bitmap) { + bitmap->setbits(bitmap, bh->b_rsector >> 1, bh->b_size >> 10); + PRINTK(KERN_DEBUG "raid1: mark mirror %d blk %lu-%lu\n", + i, bh->b_rsector >> 1, + (bh->b_rsector >> 1) + (bh->b_size >> 10) - 1); + } continue; + } /* * We should use a private pool (size depending on NR_REQUEST), The next bit just adds a couple of functions for adding and removing a bitmap from a raid1 component disk. It's purely an interface to the underlying bitmap make and destroy methods. The create function figures out the size of the component and passes it to the init method, for example. @@ -744,6 +780,47 @@ #define ALREADY_SYNCING KERN_INFO \ "raid1: syncing already in progress.\n" +static int raid1_create_bitmap(mdp_disk_t * disk) { + + struct bitmap * bitmap; + unsigned long blocks; + int err; + + if (raid1_bitmap(disk) != NULL) + return -EINVAL; + + if (!blk_size[disk->major]) + return -EINVAL; + + blocks = blk_size[disk->major][disk->minor]; + + bitmap = kmalloc (sizeof (*bitmap), GFP_KERNEL); + if (!bitmap) + return -ENOMEM; + + bitmap_init (bitmap, blocks); + err = bitmap->make (bitmap); + if (err < 0) { + kfree (bitmap); + return err; + } + raid1_bitmap(disk) = bitmap; + return 0; +} + +static void +raid1_bitmap_remove (mdp_disk_t * disk) { + + struct bitmap * bitmap = raid1_bitmap(disk); + + if (bitmap == NULL) + return; + + raid1_bitmap(disk) = NULL; + bitmap->destroy(bitmap); + kfree(bitmap); +} + static void mark_disk_bad (mddev_t *mddev, int failed) { raid1_conf_t *conf = mddev_to_conf(mddev); The mark disk bad function is apparently what's called when we do a setfaulty. It gets altered to add in a bitmap for the component, if one wasn't already there. @@ -752,6 +829,20 @@ mirror->operational = 0; mark_disk_faulty(sb->disks+mirror->number); + /* + * Put the bitmap on a mirror just marked faulty (and + * nonoperational). + */ + if (raid1_bitmap(&sb->disks[mirror->number]) == NULL) { + raid1_create_bitmap(&sb->disks[mirror->number]); + PRINTK(KERN_DEBUG "raid1: make bitmap %x on mirror %d\n", + (unsigned) raid1_bitmap(&sb->disks[mirror->number]), + mirror->number ); + } else { + PRINTK(KERN_DEBUG "raid1: bitmap %x already on mirror %d\n", + (unsigned) raid1_bitmap(&sb->disks[mirror->number]), + mirror->number ); + } mark_disk_nonsync(sb->disks+mirror->number); mark_disk_inactive(sb->disks+mirror->number); if (!mirror->write_only) This hunk is purely for my debugging convenience. I removed repeats of "zero" entries from the conf map. @@ -818,6 +909,12 @@ for (i = 0; i < MD_SB_DISKS; i++) { tmp = conf->mirrors + i; + /* + * Remove repeats from debug printout. + */ + if (i > 0 && memcmp(tmp, &conf->mirrors[i-1], sizeof(*tmp)) == 0) { + continue; + } printk(" disk %d, s:%d, o:%d, n:%d rd:%d us:%d dev:%s\n", i, tmp->spare,tmp->operational, tmp->number,tmp->raid_disk,tmp->used_slot, Hohum - more debugging. The diskop function is a complete mystery to me. I had to add printouts in every case of all its case statements. I used the PRINTK call, so it's turned off when you turn off debugging in the code. I left the debug stuff in because it prints out only when somebody actually does an operation, so it's human-related. @@ -878,6 +975,8 @@ case DISKOP_SPARE_ACTIVE: + PRINTK(KERN_DEBUG "raid1: diskop SPARE ACTIVE\n"); + /* * Find the failed disk within the RAID1 configuration ... * (this can only be in the first conf->working_disks part) More debugging in diskop, plus a possibly gratuitous change that allows the code, which goes looking for a spare device to apply one of its spare_active or spare_inactive or spare_write changes to, to find a spare device more easily. The device major and minor matching is sufficient after this change. I didn't know wxactly what the disk "number" signified. @@ -904,13 +1003,24 @@ case DISKOP_SPARE_WRITE: case DISKOP_SPARE_INACTIVE: + PRINTK(KERN_DEBUG "raid1: diskop SPARE %s\n", + state == DISKOP_SPARE_WRITE ? "WRITE" : + state == DISKOP_SPARE_INACTIVE ? "INACTIVE" : + state == DISKOP_SPARE_ACTIVE ? "ACTIVE" : "" + ); /* * Find the spare disk ... (can only be in the 'high' * area of the array) */ for (i = conf->raid_disks; i < MD_SB_DISKS; i++) { tmp = conf->mirrors + i; - if (tmp->spare && tmp->number == (*d)->number) { + if (tmp->spare + && (tmp->number == (*d)->number + /* + * I'm not sure we now need to allow match by + * device number too. FIXME. + */ + || tmp->dev == MKDEV((*d)->major,(*d)->minor))) { spare_disk = i; break; } and more debugging in diskop. @@ -924,6 +1034,8 @@ case DISKOP_HOT_REMOVE_DISK: + PRINTK(KERN_DEBUG "raid1: diskop HOT REMOVE\n"); + for (i = 0; i < MD_SB_DISKS; i++) { tmp = conf->mirrors + i; if (tmp->used_slot && (tmp->number == (*d)->number)) { and more debugging in diskop. @@ -944,6 +1056,8 @@ case DISKOP_HOT_ADD_DISK: + PRINTK(KERN_DEBUG "raid1: diskop HOT ADD\n"); + for (i = conf->raid_disks; i < MD_SB_DISKS; i++) { tmp = conf->mirrors + i; if (!tmp->used_slot) { and more debugging in diskop. @@ -964,20 +1078,31 @@ * Switch the spare disk to write-only mode: */ case DISKOP_SPARE_WRITE: + sdisk = conf->mirrors + spare_disk; + + PRINTK(KERN_DEBUG "raid1: diskop SPARE WRITE disk %d\n", + sdisk->number); + sdisk->operational = 1; sdisk->write_only = 1; + break; /* * Deactivate a spare disk: */ case DISKOP_SPARE_INACTIVE: + if (conf->start_future > 0) { MD_BUG(); err = -EBUSY; break; } sdisk = conf->mirrors + spare_disk; + + PRINTK(KERN_DEBUG "raid1: diskop SPARE INACTIVE disk %d\n", + sdisk->number); + sdisk->operational = 0; sdisk->write_only = 0; break; and more debugging in diskop. One can leave all this out of the patch. But it would drive me crazy. @@ -989,12 +1114,17 @@ * property) */ case DISKOP_SPARE_ACTIVE: + if (conf->start_future > 0) { MD_BUG(); err = -EBUSY; break; } sdisk = conf->mirrors + spare_disk; + + PRINTK(KERN_DEBUG "raid1: diskop SPARE ACTIVE disk %d\n", + sdisk->number); + fdisk = conf->mirrors + failed_disk; spare_desc = &sb->disks[sdisk->number]; Aha, finally, I think this is when we mark the spare active in diskop. We remove the bitmap at this point. We've presumably just done a sync. I'd have preferered to wipe the bitmap during the sync itself, or at the end, but it appears that spare_active is called as a diskop always just before integrating the "new" device in the array and after it having synced. So I trust this is correct. @@ -1077,9 +1207,17 @@ conf->working_disks++; + /* + * We need to vamoosh the bitmap. + */ + raid1_bitmap_remove( mddev->sb->disks+fdisk->number); + break; case DISKOP_HOT_REMOVE_DISK: + + PRINTK(KERN_DEBUG "raid1: diskop HOT REMOVE\n"); + rdisk = conf->mirrors + removed_disk; if (rdisk->spare && (removed_disk < conf->raid_disks)) { And more diskop debugging. @@ -1093,6 +1231,9 @@ break; case DISKOP_HOT_ADD_DISK: + + PRINTK(KERN_DEBUG "raid1: diskop HOT ADD\n"); + adisk = conf->mirrors + added_disk; added_desc = *d; And some debugging of my own, to show that the bitmap is there. It's magicked into position because it comes from the data that md.c saved and restored, and it's just "there" when we look here in diskop. The bitmap was really created dring the mark disk bad call, ages ago, after a setfaulty. @@ -1113,6 +1254,10 @@ adisk->head_position = 0; conf->nr_disks++; + PRINTK(KERN_DEBUG "raid1: diskop HOT ADDed mirr %d disk %d bitmap %x\n", + added_disk, adisk->number, + (unsigned)raid1_bitmap(&mddev->sb->disks[adisk->number])); + break; default: Now here we are in the resync function. The original code synced every block. We're only going to sync blocks that appear in the bitmaps of the faulty devices. So I keep an array of the indices of the faulty devices ("targets"), as well as a "count" of how many there are of them. "bitmap" is just a temp variable. @@ -1358,6 +1503,15 @@ int disk; int block_nr; int buffs; + /* + * Will need to count mirror components currently with a bitmap + * which have been marked faulty and nonoperational at some + * point beforehand, and have been accumulating marks on the + * bitmap to indicate dirty blocks that need syncing. + */ + struct bitmap * bitmap; + int count; + int targets[MD_SB_DISKS]; if (!sector_nr) { /* we want enough buffers to hold twice the window of 128*/ The original code does its setup when its asked to sync sector 0. We do the same, but for a couple of extra accounting fields placed in the "conf" raid1 struct. This is purely for informational output. @@ -1369,6 +1523,10 @@ } spin_lock_irq(&conf->segment_lock); if (!sector_nr) { + /* setup extra report counters for skipped/synced blocks */ + conf->sync_mode = -1; + conf->last_clean_sector = -1; + conf->last_dirty_sector = -1; /* initialize ...*/ conf->start_active = 0; conf->start_ready = 0; Umm, I fixed a couple of printk field types. @@ -1382,7 +1540,7 @@ MD_BUG(); } while (sector_nr >= conf->start_pending) { - PRINTK("wait .. sect=%lu start_active=%d ready=%d pending=%d future=%d, cnt_done=%d active=%d ready=%d pending=%d future=%d\n", + PRINTK("wait .. sect=%lu start_active=%ld ready=%ld pending=%ld future=%ld, cnt_done=%d active=%d ready=%d pending=%d future=%d\n", sector_nr, conf->start_active, conf->start_ready, conf->start_pending, conf->start_future, conf->cnt_done, conf->cnt_active, conf->cnt_ready, conf->cnt_pending, conf->cnt_future); wait_event_lock_irq(conf->wait_done, Here we go and find the list of faulted (nonoperational) mirror components. There appears to be no sensible upper bound on where to search for these in the existing array. So I look in the range n to n +f, where n is the number of "raid disks", and "f" is the number of the raid disks which have failed. We are presently syncing a device we have just added, and it gets added in as a spare disk, so it will be above the standard raid disks in the array. I don't think it can be above n+f, but maybe I am wrong. I don't know what the effect of "spare" disks is. "f" is "n-w", of course, where "w" is the number of working disks. If we find some faulted targets, then we check their bitmaps. If they have a bitmap and its clean, then we skip the sync of this block. I signalled md_sync_acct, sync_request_done, md_done_sync, and anything else I could find. That seems to do the trick. If, OTOH, the bitmaps are not clean for this block, we fall through and do the normal sync. @@ -1422,7 +1580,64 @@ conf->last_used = disk; mirror = conf->mirrors+conf->last_used; + + /* go looking for the faulted (nonoperational) mirrors */ + count = 0; + while (1) { + const int maxdisk = 2 * conf->raid_disks - conf->working_disks; + if (disk <= 0) + disk = maxdisk > MD_SB_DISKS ? MD_SB_DISKS : maxdisk; + disk--; + if (disk == conf->last_used) + break; + if (!conf->mirrors[disk].operational) + continue; + /* We need them to be writable */ + if (conf->mirrors[disk].write_only) { + targets[count++] = disk; + } + } + + if (count > 0) { + int i; + int dirty = 0; + for (i = 0; i < count; i++) { + disk = targets[i]; + PRINTK(KERN_DEBUG "testing bitmap for disk %d\n", disk); + bitmap = mddev->sb ? raid1_bitmap(&mddev->sb->disks[conf->mirrors[disk].number]) : NULL; + + if (!bitmap + || bitmap->testbit(bitmap, sector_nr >> 1)) { + dirty++; + break; + } + } + if (dirty <= 0) { + const int done = 2 - (sector_nr & 1); + md_sync_acct(mirror->dev, done); + sync_request_done(sector_nr, conf); + md_done_sync(mddev, done, 1); + if (conf->sync_mode != 0) { + if (conf->sync_mode == 1) { + printk(KERN_INFO "raid1: synced dirty sectors %lu-%lu\n", + conf->last_clean_sector+1, + conf->last_dirty_sector); + } + conf->sync_mode = 0; + } + conf->last_clean_sector = sector_nr + done - 1; + wake_up(&conf->wait_ready); + if (mddev->sb && sector_nr + done >= mddev->sb->size<<1) { + printk(KERN_INFO "raid1: skipped clean sectors %lu-%lu\n", + conf->last_dirty_sector+1, + conf->last_clean_sector); + } + /* skip remainder of block */ + return done; + } + } + /* read */ r1_bh = raid1_alloc_buf (conf); r1_bh->master_bh = NULL; r1_bh->mddev = mddev; Here's some accounting printout at the end of the resync function. It's just reporting sequences of clean or dirty blocks. It shouldn't be too noisy in practice. @@ -1456,6 +1671,22 @@ generic_make_request(READ, bh); md_sync_acct(bh->b_dev, bh->b_size/512); + /* printout info from time to time */ + if (conf->sync_mode != 1) { + if (conf->sync_mode == 0) { + printk(KERN_INFO "raid1: skipped clean sectors %lu-%lu\n", + conf->last_dirty_sector+1, + conf->last_clean_sector); + } + conf->sync_mode = 1; + } + conf->last_dirty_sector = sector_nr + (bsize >> 9) - 1; + + if (mddev->sb && sector_nr + (bsize >> 9) >= mddev->sb->size<<1) { + printk(KERN_INFO "raid1: synced dirty sectors %lu-%lu\n", + conf->last_clean_sector+1, + conf->last_dirty_sector); + } return (bsize >> 9); nomem: And that was that. Peter ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fast (intelligent) raid1 2003-02-11 23:05 ` Peter T. Breuer @ 2003-02-14 11:33 ` Peter T. Breuer 2003-02-14 20:46 ` Ross Vandegrift 2003-02-17 3:07 ` Neil Brown 0 siblings, 2 replies; 19+ messages in thread From: Peter T. Breuer @ 2003-02-14 11:33 UTC (permalink / raw) To: ptb; +Cc: Ross Vandegrift, Ingo Molnar, neilb, linux-raid I'll compress this down to an even more abstract summary ... "Peter T. Breuer wrote:" > "Peter T. Breuer wrote:" > > Have a look at the patch in the .tgz. I tried to make it as clean as I > > could. Every change I made in the md.c code is commented. There are 4 > > "hunks" of changes to md.c, to allow hotadd after setfaulty, and about > > ten significant hunks of changes to raid1.c, inserting the extra > > technology. There is some extra debugging code in that, which I can > > In fact - I'll publish and go through the patch here. Here we go. 1) change hotadd function in md.c with the objective of permitting hotadd after setfaulty ("hotrepair") which should preserve a bitmap which has been previously added to the disk metadata in the main array (during setfaulty). 2) change write code in raid1.c to mark the bitmap of every mirror component disk which is marked not operational, if it has a bitmap. 3) change mark_disk_bad code in raid1.c to add a bitmap to the disk metadata in the full raid array. This is called by setfaulty, and also on error from below, I think. 4) at the point where a spare disk is marked active in the diskop function in raid1.c (state SPARE_ACTIVE), remove any bitmap associated with the disk metadata in the full raid array. This is called after a successful resync, somehow, and possibly on other occasions. 5) in the resync function in raid1.c, for each resync block or blocks, find all the spare mirror components which are marked nonoperational but writable ("write_only"), and if they have a bitmap and it is clean for the blocks we are interested in, then cheat for that device - report and account for having written to it when it fact we have not. This means calling md_sync_acct and sync_request_done and md_done_sync and possibly signalling on the wait_ready wait queue for the raid device. If we don't cheat then fall through and do the normal thing, which is to launch a write request for some blocks, do a bit of accounting and leave the done functions and signalling for its end_io. I would be deeply obliged if somebody could indicate to me where to make some further changes. What I want to do is allow an underlying block device to notify the raid code when the block device has "fixed itself". My plan is to a) get the raid code to signal the underlying block device during a hotadd, presumably at the end, what the major and minor of the raid device it has become part of is. This will be via an extra ioctl which I will declare for all block devices. Possibly it would be nice to actually pass the file system inode for the special device node of md0 or whatever, if we have it. b) when the block device feels well again, then it will signal the raid code via the inode or more directly via the block_ops array and a new ioctl that it has come back up, and the raid code will then do a hotadd. and I would like pointers as to where to insert this in the current raid codes. Peter ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fast (intelligent) raid1 2003-02-14 11:33 ` Peter T. Breuer @ 2003-02-14 20:46 ` Ross Vandegrift 2003-02-17 3:07 ` Neil Brown 1 sibling, 0 replies; 19+ messages in thread From: Ross Vandegrift @ 2003-02-14 20:46 UTC (permalink / raw) To: Peter T. Breuer; +Cc: linux-raid [cc list trimmed] On Fri, Feb 14, 2003 at 12:33:06PM +0100, Peter T. Breuer wrote: > I would be deeply obliged if somebody could indicate to me where to > make some further changes. I'm certainly not the person to be answering your question, as I've only ever taken a cursory look at the RAID code - I'm just a serious user. However, I wasn't able to get your patch code compiled with a kernel to check it out more throughly. What version is the patch against, and can I compile the kernel separately from your Makefile? (These are questions I could probably answer myself, but I'm at work now, and didn't have time the other night when I was playing with it). Hopefully I'll have some time this weekend/coming week to get it all built and working. -- Ross Vandegrift ross@willow.seitz.com A Pope has a Water Cannon. It is a Water Cannon. He fires Holy-Water from it. It is a Holy-Water Cannon. He Blesses it. It is a Holy Holy-Water Cannon. He Blesses the Hell out of it. It is a Wholly Holy Holy-Water Cannon. He has it pierced. It is a Holey Wholly Holy Holy-Water Cannon. He makes it official. It is a Canon Holey Wholly Holy Holy-Water Cannon. Batman and Robin arrive. He shoots them. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fast (intelligent) raid1 2003-02-14 11:33 ` Peter T. Breuer 2003-02-14 20:46 ` Ross Vandegrift @ 2003-02-17 3:07 ` Neil Brown 2003-02-17 13:17 ` Peter T. Breuer ` (3 more replies) 1 sibling, 4 replies; 19+ messages in thread From: Neil Brown @ 2003-02-17 3:07 UTC (permalink / raw) To: ptb; +Cc: Ross Vandegrift, Ingo Molnar, linux-raid On Friday February 14, ptb@it.uc3m.es wrote: > I'll compress this down to an even more abstract summary ... > > "Peter T. Breuer wrote:" > > "Peter T. Breuer wrote:" > > > Have a look at the patch in the .tgz. I tried to make it as clean as I > > > could. Every change I made in the md.c code is commented. There are 4 > > > "hunks" of changes to md.c, to allow hotadd after setfaulty, and about > > > ten significant hunks of changes to raid1.c, inserting the extra > > > technology. There is some extra debugging code in that, which I can > > > > In fact - I'll publish and go through the patch here. Here we go. > > 1) change hotadd function in md.c with the objective of permitting > hotadd after setfaulty ("hotrepair") which should preserve a > bitmap which has been previously added to the disk metadata in > the main array (during setfaulty). > > 2) change write code in raid1.c to mark the bitmap of every mirror > component disk which is marked not operational, if it has a bitmap. Can I suggest a somewhat different approach? Rather than having several bitmaps, have just one. Normally it is full of zero and isn't touched. When a write fails, or when a write is sent to some-but-not-all devices, set the relevant bit in the bitmap. The first time you set a bit, record the current 'event' number with the bitmap. The interpretation of this bit map is 'only the blocks that are flagged in this bitmap have changed since event X'. On hot-add, read the old superblock. If it looks valid and matches the current array, and has an event counter of X or more, then ignore blocks that have their bits set in the bitmap which reconstructing, otherwise do a full reconstruction. When we have a full compliment of devices again, clear the bitmap and the event record. The advantages of this include: - only need one bitmap - don't need the hot_repair concept - what we have is more general. - don't need to update the bitmap (which would have to be a bus-locked operation) on every write. Disadvantages: - if two devices fail, will resync some blocks on the later one that don't require it. As for ths other bits about a block device 'fixing itself' - I think that belongs in user-space. Have some program monitoring things and re-adding the device to the array when it appears to be working again. NeilBrown ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fast (intelligent) raid1 2003-02-17 3:07 ` Neil Brown @ 2003-02-17 13:17 ` Peter T. Breuer 2003-02-17 15:08 ` Peter T. Breuer 2003-02-17 17:54 ` Peter T. Breuer 2003-02-18 0:57 ` Peter T. Breuer ` (2 subsequent siblings) 3 siblings, 2 replies; 19+ messages in thread From: Peter T. Breuer @ 2003-02-17 13:17 UTC (permalink / raw) To: Neil Brown; +Cc: ptb, Ross Vandegrift, Ingo Molnar, linux-raid Hi! "A month of sundays ago Neil Brown wrote:" > Rather than having several bitmaps, have just one. Normally it is > full of zero and isn't touched. I'm having a little trouble interpreting your suggestion, so I'll work through it here, and you can maybe correct my understanding. In the first place, I can say that having ONE bitmap per raid composite device is OK with me and probably is a worthwhile simplification over having ONE per mirror component, since the normal case will be a two component mirror and a mark on the bitmap will indicate unambiguously where the fault is if the mirror is operational at all. Or will it .. while the mirror is resyncing there may be a doubt. But then it's currently subject to some imprecisions at that stage, since a fault from a normal write while a resync is going on will be marked but the mark will be lost when the bitmap is cleared at the end of the resync (this is a bug - I'm not sure how the raid code itself reacts in this situation since there are races here). Nevertheless the principle is sound. The mark on the bitmap need only indicate "something may be wrong", not precisely what is wrong. More precision implies more efficiency at resync time, but it's a tradeoff. > When a write fails, or when a write is sent to some-but-not-all > devices, set the relevant bit in the bitmap. This is what is currently done, I believe, subject to my correct implementation of the code, and correct understanding of the extant source. Except that presently it's done per component instead of per array. I'm not too sure what whole-array structure to attach the bitmap to. Suggestions welcome. > The first time you set a bit, record the current 'event' number with > the bitmap. Let me clarify here that you are talking of a counter that is incremented once per request received on the device, and which is written to the superblock of all the mirror components at the point when the global bitmap is first dirtied after previously having been clean. I'd observe that we dirty the bitmap because a component has just dropped out/failed, and so the register of that counter on the failed component will stay at zero (or whatever it was), since we can/should no longer write to it. It is not quite clear to me if we should write the counter to a device which is removed while up to date. Let's see ... > The interpretation of this bit map is 'only the blocks that are > flagged in this bitmap have changed since event X'. This is fine by me. > On hot-add, read the old superblock. If it looks valid and matches > the current array, and has an event counter of X or more, then ignore > blocks that have [not got] their bits set in the bitmap [when] > reconstructing, otherwise do a full reconstruction. I have trouble understanding the implications here. See items in square brackets also for possible typos that I've "corrected". Normally a disk will drop out with the counter set at zero on the disk component involved, and the inmemory counter actually at X, but it would not have been possible to write the counter to the sb of the component since we find out that we should have written it only after it's dead... So it will come back with its counter still set to zero. So when it comes back its counter will NOT be set at "X or more", so we "do a full reconstruction". This obviously is not the intention. I believe that possibly that we should note the value of the counter X when we last wrote it successfully to all the disks inkernel. This is a "checkpoint". We can update the checkpoint on all disks (and in kernel) from time to time, I think. If the bitmap is dirtied, then it will have been dirtied since the checkpoint was written in kernel AND on disk. This is one invariant. When a disk comes back without a replacement having been used in the meantime, its checkpoint on its sb will match the checkpoint in kernel, and we can update only the blocks signalled as dirty in the bitmap. When a disk comes back after a replacement has been used meantime, then the checkpoint in the kernel will have advanced (umm, we have to advance it by one at least on the first write after a hotadd, or do we have a problem when a faulty disk is introduced as a replacement?) beyond that on the disk, and we will know that we have to do a full resync on the component. > When we have a full compliment of devices again, clear the bitmap and > the event record. Not sure about that. It would lead to confusion when a replacement disk was used with an old checkpoint value. I don't think one can ever reset the event counter safely, which means that it needs a generation counter too. > The advantages of this include: > - only need one bitmap Agreed. > - don't need the hot_repair concept - what we have is more general. Not sure. > - don't need to update the bitmap (which would have to be a > bus-locked operation) on every write. Hmm .. one only updates the bitmap when there is a faulted disk component, more or less. And I'm not quite sure what "bus-locked" means .. you mean that the order wrt bus operations must be strictly preserved? Is that necessary? It's not clear to me. > Disadvantages: > - if two devices fail, will resync some blocks on the later one that > don't require it. Doesn't matter. There are other race conditions which are possibly worse karma. > As for ths other bits about a block device 'fixing itself' - I think > that belongs in user-space. Have some program monitoring things and > re-adding the device to the array when it appears to be working again. I don't agree - programs are simply not reliable enough and users are not reliable enough to install and configure them. This can be done in about 10 lines of kernel code or less, I believe. 1) notify the new component in which array it is now with a generic ioctl after a hotadd, 2) let the component do a hotadd ioctl through the blkops array and our struct when it comes back online, if it feels like it. Thanks for the comments .. let me know if I've misinterpreted something. Peter ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fast (intelligent) raid1 2003-02-17 13:17 ` Peter T. Breuer @ 2003-02-17 15:08 ` Peter T. Breuer 2003-02-17 17:54 ` Peter T. Breuer 1 sibling, 0 replies; 19+ messages in thread From: Peter T. Breuer @ 2003-02-17 15:08 UTC (permalink / raw) To: ptb; +Cc: Neil Brown, Ross Vandegrift, Ingo Molnar, linux-raid "Peter T. Breuer wrote:" > "A month of sundays ago Neil Brown wrote:" > > Rather than having several bitmaps, have just one. Normally it is > > full of zero and isn't touched. > > I'm not too sure what whole-array structure to attach the bitmap to. > Suggestions welcome. This would be the raid1_conf_t ("conf") that is referenced in the private data field of the mddev_t. I'll do that first, though I presume it should later be promoted to become an md-wide field. Peter ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fast (intelligent) raid1 2003-02-17 13:17 ` Peter T. Breuer 2003-02-17 15:08 ` Peter T. Breuer @ 2003-02-17 17:54 ` Peter T. Breuer 1 sibling, 0 replies; 19+ messages in thread From: Peter T. Breuer @ 2003-02-17 17:54 UTC (permalink / raw) To: ptb; +Cc: Neil Brown, Ross Vandegrift, Ingo Molnar, linux-raid "Peter T. Breuer wrote:" > "A month of sundays ago Neil Brown wrote:" > > Rather than having several bitmaps, have just one. Normally it is > > full of zero and isn't touched. Well, this bit was no trouble - it works fine. I'll publish a patch presently, when I have a few more things in. > > On hot-add, read the old superblock. If it looks valid and matches Yah. Any clues as to what function does that? I need to WRITE the superblock too, from time to time. I need all the clues I can get. This must be in md.c, somewhere, right? Peter ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fast (intelligent) raid1 2003-02-17 3:07 ` Neil Brown 2003-02-17 13:17 ` Peter T. Breuer @ 2003-02-18 0:57 ` Peter T. Breuer 2003-02-18 15:04 ` Peter T. Breuer 2003-02-18 22:13 ` Peter T. Breuer 3 siblings, 0 replies; 19+ messages in thread From: Peter T. Breuer @ 2003-02-18 0:57 UTC (permalink / raw) To: Neil Brown; +Cc: ptb, Ross Vandegrift, Ingo Molnar, linux-raid "Neil Brown wrote:" > Rather than having several bitmaps, have just one. Normally it is > full of zero and isn't touched. > On hot-add, read the old superblock. If it looks valid and matches > the current array, and has an event counter of X or more, then ignore I've made a preliminary patch. Available as ftp://oboe.it.uc3m.es/pub/Programs/fr1-2.2.tgz This still needs to have the checkpoint (X) stuff put in, but I'm too tired to do it right now. What it has is 1) single bitmap 2) recognition of a "hotrepair" via the superblock on a persistent sb array and also 3) allow "hotrepair" on a nonpersistent sb array when a hotadd follows a setfaulty directlym without hotremove. I'll complete it tomorrow. I checked basic functionality (i.e. "it works", tm). Peter ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fast (intelligent) raid1 2003-02-17 3:07 ` Neil Brown 2003-02-17 13:17 ` Peter T. Breuer 2003-02-18 0:57 ` Peter T. Breuer @ 2003-02-18 15:04 ` Peter T. Breuer 2003-02-18 22:13 ` Peter T. Breuer 3 siblings, 0 replies; 19+ messages in thread From: Peter T. Breuer @ 2003-02-18 15:04 UTC (permalink / raw) To: Neil Brown; +Cc: ptb, Ross Vandegrift, Ingo Molnar, linux-raid "A month of sundays ago Neil Brown wrote:" > The first time you set a bit, record the current 'event' number with > the bitmap. I'm having some trouble with this part. Now, there ARE things called event_hi and event_lo in the current raid code, and you may be referring to these. Correct? > On hot-add, read the old superblock. If it looks valid and matches > the current array, and has an event counter of X or more, then ignore And this is the same event_hi and event_lo? Now this all HAS to be done in md.c, vecause the read_disk_sb and write_disk_sb functions are not exported from there. Peter ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fast (intelligent) raid1 2003-02-17 3:07 ` Neil Brown ` (2 preceding siblings ...) 2003-02-18 15:04 ` Peter T. Breuer @ 2003-02-18 22:13 ` Peter T. Breuer 3 siblings, 0 replies; 19+ messages in thread From: Peter T. Breuer @ 2003-02-18 22:13 UTC (permalink / raw) To: Neil Brown; +Cc: ptb, Ross Vandegrift, Ingo Molnar, linux-raid I've now completed the preliminary patch which conforms to Neil's ideas, and I've put it up at ftp://oboe.it.uc3m.es/pub/Programs/fr1-2.2.tgz I'll follow roughly the same procedure as before, and publish the main part of the patch here, with my comments. Then later I'll abstract one further time. First let me go through Neils' ideas and comment on what changes I made to accomodate them ... ====================================== "Neil Brown wrote:" > Rather than having several bitmaps, have just one. Normally it is This was easy, and welcome. I did it at once and several things became simpler. The bitmap is stuck in the raid1 private data, but it can go one layer up (down?) into the general raid info at any time. > full of zero and isn't touched. At the moment, normally it's not there. It's brought into existence when needed. I agree that that's wrong and I'll alter it in a subsequent patch. > When a write fails, or when a write is sent to some-but-not-all > devices, set the relevant bit in the bitmap. Well, I think that was already so. > The first time you set a bit, record the current 'event' number with > the bitmap. OK - now you got me there for a while, because I didn't realize you were talking about the "events" counter in the current code. At least I think you are. I saved it into the "reserved" area of the md device superblock, because it has to be in the md layer in order to avoid a layering violation. This is another indicator that the bitmap should be there too. > The interpretation of this bit map is 'only the blocks that are > flagged in this bitmap have changed since event X'. > > On hot-add, read the old superblock. If it looks valid and matches I managed that. It seems to more or less work, though I'm a bit wary of this event counter thing. I've only seen the case 1 <= 1 so far. > the current array, and has an event counter of X or more, then ignore I signal this desire via another entry in the reserved area of the md device superblock. In the resync, the write to the first sector examines this flag and decides whether to observe or disregard the bitmap. If the bitmap is to be disregarded (because the disk that's been added is too old or does not match), then it is removed. It can't presently be removed in the md code which set the flag, because it's currently a raid1 thing only. This is the first opportunity. > blocks that have their bits set in the bitmap which reconstructing, > otherwise do a full reconstruction. > > When we have a full compliment of devices again, clear the bitmap and > the event record. Well, that was done in one of the diskops already. Spare active, I think. > As for ths other bits about a block device 'fixing itself' - I think > that belongs in user-space. Have some program monitoring things and I've mentioned that I disagree, but I haven't had time to do anything about it yet. > re-adding the device to the array when it appears to be working again. ====================================== Now I'll select the main parts of the current patch and go through them ... We start with raid1.h. It gets a few extra fields in the private data structure for raid1. --- linux-2.4.20-xfs/include/linux/raid/raid1.h.orig Sun Jan 26 20:03:48 2003 +++ linux-2.4.20-xfs/include/linux/raid/raid1.h Tue Feb 18 17:48:10 2003 @@ -59,6 +59,11 @@ md_wait_queue_head_t wait_done; md_wait_queue_head_t wait_ready; md_spinlock_t segment_lock; + long last_clean_sector; /* helps debugging printout */ + long last_dirty_sector; + int sync_mode; /* clean or dirty in sync ? */ + struct bitmap * bitmap; /* the array bitmap */ + int bitmap_dirty; /* flag */ }; typedef struct raid1_private_data raid1_conf_t; Now for the changes to md.c. Start with the flag that indicates we have detected a "hotrepair" situation in hot_add(). It's either a previously removed disk being readded and we've been maintaining a bitmap of pending writes for it, or we can force it (in the case of no persistent superblock) by doing a hotadd of the same disk immediately after setfaulty without an intervening hotremove. --- linux-2.4.20-xfs/drivers/md/md.c.orig Sun Jan 26 18:57:20 2003 +++ linux-2.4.20-xfs/drivers/md/md.c Tue Feb 18 22:07:58 2003 @@ -2374,6 +2382,7 @@ unsigned int size; mdk_rdev_t *rdev; mdp_disk_t *disk; + int hotrepair = 0; if (!mddev->pers) return -ENODEV; Here is where we detect the case of no persistent superblock, and hotadd after setfaulty. Neil won't want this. We force an extra hotremove and restart the hotadd. But near the end of this hunk there is an important change, of a "0" to a "persistent" in the md_import_device() call. This is for the case when there IS a persistent superblock. The value 1 to the call means that the new disk superblock is tested. And read in to be tested. That's the key. @@ -2396,11 +2405,38 @@ return -ENOSPC; } - rdev = find_rdev(mddev, dev); - if (rdev) - return -EBUSY; + /* + * This is a do at most once loop because the remove in the loop will + * cause the test to fail the next time round. And if that + * doesn't break us out, then the hotrepair count will. + */ + while ((rdev = find_rdev(mddev, dev)) != NULL) { + + if (hotrepair || persistent || rdev->dev != dev || !rdev->faulty) { + printk(KERN_WARNING "md%d: cannot add existing component %x\n", + mdidx(mddev), dev); + return -EBUSY; + } + /* + * Allow "hotrepair" of merely faulty device too if no superblock to + * go by. We assume a hotadd after setfaulty of the same device is a + * hotrepair, if there's no persistent superblock to say otherwise. + */ + printk(KERN_WARNING "md%d: repair of faulty disk %x!\n", + mdidx(mddev), dev); + + /* Remove will cause find_rdev to fail next time */ + err = hot_remove_disk(mddev, dev); + if (err < 0) { + printk(KERN_WARNING "md%d: remove disk %x errored\n", + mdidx(mddev), dev); + return err; + } + /* This will inevitably error us out of the loop interior next time */ + hotrepair = 1; + } - err = md_import_device (dev, 0); + err = md_import_device (dev, persistent); if (err) { printk(KERN_WARNING "md: error, md_import_device() returned %d\n", err); return -EINVAL; Here's where we test the uuid of the disk superblock just read in. We're still in hotadd. I also test the event counter. That's kept in fields 2 and 3 of the "reserved" array on the disk superblock. I'm not at all sure right now how I got it there ... the counter seems to have the value 1 in the tests I've done! The idea is that a counter that is old (low) represents a disk that separated from our lineage way before the point at which our bitmap was created, which is recored in the corresponding reserved fields of the md device superblock in memory. I know that I write those fields when I dirty the bitmap for the first time. I've tested by doing a setfaulty, then writing to the device, then doing hotremove and hotadd, and things work. But I am uneasy about the semantics here. The end result of this bit in hotadd is merely to set the local hotrepair flag. We use it nearer the end of the function. @@ -2416,6 +2452,52 @@ err = -EINVAL; goto abort_export; } + /* let's check the new disk sb at this poimt */ + if (persistent && rdev->sb + && rdev->sb->set_uuid0 == mddev->sb->set_uuid0 + && rdev->sb->set_uuid1 == mddev->sb->set_uuid1 + && rdev->sb->set_uuid2 == mddev->sb->set_uuid2 + && rdev->sb->set_uuid3 == mddev->sb->set_uuid3) { + unsigned long long disk_events, bitmap_events; + disk_events = rdev->sb->reserved[3]; + disk_events <<= 32; + disk_events |= rdev->sb->reserved[2]; + + bitmap_events = mddev->sb->reserved[3]; + bitmap_events <<= 32; + bitmap_events |= mddev->sb->reserved[2]; + + if (disk_events < bitmap_events) { + /* new disk is too old! */ + hotrepair = 0; + printk(KERN_INFO "md%d: new disk %x too old for repair %Lu < %Lu\n", + mdidx(mddev), dev, disk_events, bitmap_events); + } else { + hotrepair = 1; + printk(KERN_INFO "md%d: repairing old mirror component %x (%Lu >= %Lu)\n", + mdidx(mddev), dev, disk_events, bitmap_events); + } + } else if (!persistent && hotrepair) { + hotrepair = 1; + printk(KERN_INFO "md: forced repair of mirror component %x\n", + dev); + } else { + /* failed match */ + hotrepair = 0; + printk(KERN_INFO "md: adding new mirror component %x\n", + dev); + printk(KERN_DEBUG "md: old uuid %x %x %x %x\n", + mddev->sb->set_uuid0, + mddev->sb->set_uuid1, + mddev->sb->set_uuid2, + mddev->sb->set_uuid3); + printk(KERN_DEBUG "md: new uuid %x %x %x %x\n", + rdev->sb->set_uuid0, + rdev->sb->set_uuid1, + rdev->sb->set_uuid2, + rdev->sb->set_uuid3); + } + bind_rdev_to_array(rdev, mddev); /* At the end of the hotadd function, I set the flag (the 0th reserved field of the md device superblock) to indicate that we are doing a hotrepair - intelligent resync if possible. It'll be seen in the raid1 resync. @@ -2470,6 +2554,19 @@ mddev->sb->spare_disks++; mddev->sb->working_disks++; + if (hotrepair) { + /* maybe say something nice - this means we want to respect + * the bitmap in raid1 resync if teher is one + */ + printk(KERN_DEBUG "md%d: set repair bit on superblock\n", + mdidx(mddev)); + mddev->sb->reserved[0] = 1; + } else { + /* this means we need to kill any bitmap that we have been + * saving but we'll do it in the raid1 resync instead of here + */ + mddev->sb->reserved[0] = 0; + } mddev->sb_dirty = 1; md_update_sb(mddev); That was the end of changes to md.c. Now changes to raid1.c. First of all, in the make request function, mark the bitmaps of every nonoperational component of the mirror. This is as it was last time, except I decided to take the device spinlock while looking at the metadata. I see lots of potential races in the current code, and I don't want to be responsible for any more. Let's try and avoid things moving around underneath while we waffle around in here ... --- linux-2.4.20-xfs/drivers/md/raid1.c.orig Sun Jan 26 18:57:20 2003 +++ linux-2.4.20-xfs/drivers/md/raid1.c Tue Feb 18 18:28:12 2003 @@ -645,8 +658,38 @@ bhl = raid1_alloc_bh(conf, conf->raid_disks); for (i = 0; i < disks; i++) { struct buffer_head *mbh; - if (!conf->mirrors[i].operational) + /* + * Mark the bitmap of each mirror we can't write to + * (i.e. is not operational). + */ + + spin_lock_irq(&conf->segment_lock); + if (!conf->mirrors[i].operational) { + + struct bitmap * bitmap = conf->bitmap; + + if (!conf->mirrors[i].used_slot) { + spin_unlock_irq(&conf->segment_lock); + continue; + } + + if (bitmap) { + bitmap->setbits(bitmap, bh->b_rsector >> 1, bh->b_size >> 10); + if (!conf->bitmap_dirty) { + conf->bitmap_dirty = 1; + mddev->sb->reserved[2] = + mddev->sb->events_lo; + mddev->sb->reserved[3] = + mddev->sb->events_hi; + } + PRINTK(KERN_DEBUG "raid1: mark mirror %d blk %lu-%lu\n", + i, bh->b_rsector >> 1, + (bh->b_rsector >> 1) + (bh->b_size >> 10) - 1); + } + spin_unlock_irq(&conf->segment_lock); continue; + } + spin_unlock_irq(&conf->segment_lock); /* * We should use a private pool (size depending on NR_REQUEST), As before, in the mark disk bad function, we make a bitmap and attach it to the device. Mmmph .. should take the device spinlock. Well, I probably do. In the functions called. @@ -757,6 +869,18 @@ mirror->operational = 0; mark_disk_faulty(sb->disks+mirror->number); + /* + * Put the bitmap on a mirror just marked faulty (and + * nonoperational). + */ + if (conf->bitmap == NULL) { + int err = raid1_create_bitmap(mddev); + printk(KERN_DEBUG "raid1: make bitmap %x on mirror %d (%d)\n", + (unsigned) conf->bitmap, mirror->number, err); + } else { + printk(KERN_WARNING "raid1: bitmap %x already on mirror %d\n", + (unsigned) conf->bitmap, mirror->number ); + } mark_disk_nonsync(sb->disks+mirror->number); mark_disk_inactive(sb->disks+mirror->number); if (!mirror->write_only) In the diskop that activates a spare device, we reomve the bitmap. All is well again. @@ -1082,9 +1223,17 @@ conf->working_disks++; + /* + * We need to vamoosh the bitmap. + */ + raid1_remove_bitmap(mddev); + break; case DISKOP_HOT_REMOVE_DISK: + + PRINTK(KERN_DEBUG "raid1: diskop HOT REMOVE\n"); + rdisk = conf->mirrors + removed_disk; if (rdisk->spare && (removed_disk < conf->raid_disks)) { Now we get to the resync function. A few local variables added. "targets" is the set of faulted (writeonly) disks. "Count" is how many of them there are. @@ -1363,6 +1515,15 @@ int disk; int block_nr; int buffs; + /* + * Will need to count mirror components currently with a bitmap + * which have been marked faulty and nonoperational at some + * point beforehand, and have been accumulating marks on the + * bitmap to indicate dirty blocks that need syncing. + */ + struct bitmap * bitmap; + int count; + int targets[MD_SB_DISKS]; if (!sector_nr) { /* we want enough buffers to hold twice the window of 128*/ When looking at the first sector, we check the "hotrepair" flag that's been passed down in the md device superblock. If it says 1, ok, we leave any bitmap in place and later in the resync use it. If the flag says 0, we remove any bitmap, so the resync will be the ordinary full resync. @@ -1371,9 +1532,25 @@ if (buffs < 2) goto nomem; conf->window = buffs*(PAGE_SIZE>>9)/2; + /* also remove bitmap if not indicated */ + if (! mddev->sb->reserved[0]) { + /* has to be outside spinlock as it takes it */ + printk(KERN_WARNING "md%d: removed bitmap\n", + mdidx(mddev)); + raid1_remove_bitmap (mddev); + } else { + printk(KERN_WARNING "md%d: retained bitmap %x\n", + mdidx(mddev), (unsigned)conf->bitmap); + } + /* reset the bitmap indicator always */ + mddev->sb->reserved[0] = 0; } spin_lock_irq(&conf->segment_lock); if (!sector_nr) { + /* setup extra report counters for skipped/synced blocks */ + conf->sync_mode = -1; + conf->last_clean_sector = -1; + conf->last_dirty_sector = -1; /* initialize ...*/ conf->start_active = 0; conf->start_ready = 0; This is where we find our targets. And skip them if we have a clean bitmap for our current block. I need to fix this. This can now be more efficient. @@ -1427,7 +1604,63 @@ conf->last_used = disk; mirror = conf->mirrors+conf->last_used; + + /* go looking for the faulted (nonoperational) mirrors */ + count = 0; + while (1) { + const int maxdisk = 2 * conf->raid_disks - conf->working_disks; + if (disk <= 0) + disk = maxdisk > MD_SB_DISKS ? MD_SB_DISKS : maxdisk; + disk--; + if (disk == conf->last_used) + break; + if (!conf->mirrors[disk].operational) + continue; + /* We need them to be writable */ + if (conf->mirrors[disk].write_only) { + targets[count++] = disk; + } + } + + if (count > 0) { + int i; + int dirty = 0; + for (i = 0; i < count; i++) { + disk = targets[i]; + bitmap = conf->bitmap; + + if (!bitmap + || bitmap->testbit(bitmap, sector_nr >> 1)) { + dirty++; + break; + } + } + if (dirty <= 0) { + const int done = 2 - (sector_nr & 1); + md_sync_acct(mirror->dev, done); + sync_request_done(sector_nr, conf); + md_done_sync(mddev, done, 1); + if (conf->sync_mode != 0) { + if (conf->sync_mode == 1) { + printk(KERN_INFO "raid1: synced dirty sectors %lu-%lu\n", + conf->last_clean_sector+1, + conf->last_dirty_sector); + } + conf->sync_mode = 0; + } + conf->last_clean_sector = sector_nr + done - 1; + wake_up(&conf->wait_ready); + if (mddev->sb && sector_nr + done >= mddev->sb->size<<1) { + printk(KERN_INFO "raid1: skipped clean sectors %lu-%lu\n", + conf->last_dirty_sector+1, + conf->last_clean_sector); + } + /* skip remainder of block */ + return done; + } + } + /* read */ r1_bh = raid1_alloc_buf (conf); r1_bh->master_bh = NULL; r1_bh->mddev = mddev; And that's that. I'll summarise later. Peter ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Fast (intelligent) raid1
@ 2003-02-19 11:29 Peter T. Breuer
0 siblings, 0 replies; 19+ messages in thread
From: Peter T. Breuer @ 2003-02-19 11:29 UTC (permalink / raw)
To: ptb; +Cc: Neil Brown, Ross Vandegrift, Ingo Molnar, linux-raid
"A month of sundays ago ptb wrote:"
> ftp://oboe.it.uc3m.es/pub/Programs/fr1-2.2.tgz
I'll now try and provide an abstract summary of the code changes
introduced by the patch.
1) change hotadd in md.c to detect a previously known disk via
its superblock uuid (which must match the raid array's) and
signal "hotrepair" if the events count in the disk sb
indicates that it separated from us after the current bitmap was
first set up (first dirtied), which indicates that te bitmap
contains all the changes that are pending for it. We have saved
the events counter at the time the bitmap was set up (first
dirtied) in the reserved area of the array superblock.
If hotrepair is signalled, record that also in the array
superblock reserved area.
If it's not a hotrepair that we're doing in the hotadd, then we
will eventually ditch the bitmap during the resync. For the moment
we just signal it via the flag in the array superblock reserved
area.
2) change make_request in raid1.c so as to mark the bitmap when
there are some nonoperational elements in the array. On dirtying
the bitmap for the first time, write the current md device
superblock events value to the reserved area of the md device
superblock, and set the bitmap dirty flag.
The bitmap itself is kept in the private raid1 metadata struct
along with a flag to indicate if it's dirty or not.
3) in the mark_disk_bad function in raid1.c, activate (or set up)
a bitmap, flagged not dirty.
4) In the diskop function in raid1.c, in the SPARE_ACTIVE case
we remove the bitmap, or disactivate it. Whatever. This is
called at the end of a successful resync.
5) In the resync_block function in raid1.c, we check on handling
the first sector if we are doing a hotrepair or not, by looking
in the reserved area of the md device superblock. If we are not,
we remove any bitmap, or disactivate it, and will then proceed to
do a full resync afterwards. If we are doing hotrepair, we zero
the hotrepair flag and keep any bitmap that happens to be there.
In the rest of the function, if there is a bitmap (active), we
use it to decide which blocks to really sync and which not.
I've since also started to make the ordinary mirror bh writes
asynchronous, by marking the bitmap on submitting the bh i/o, saying the
i/o is complete when one of the sub-bh's completes, and clearing the
bitmap for those blocks when all the sub-bh's complete. An error in i/o
should fault a component offline so we will know to which the bitmap
refers later.
I hope linux-raid is archived. Just my luck if I lose these summaries
..
Peter
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: Fast (intelligent) raid1
@ 2003-02-19 20:59 Peter T. Breuer
0 siblings, 0 replies; 19+ messages in thread
From: Peter T. Breuer @ 2003-02-19 20:59 UTC (permalink / raw)
To: ptb; +Cc: Neil Brown, Ross Vandegrift, Ingo Molnar, linux-raid
"ptb wrote:"
> "A month of sundays ago ptb wrote:"
> > ftp://oboe.it.uc3m.es/pub/Programs/fr1-2.2.tgz
I've now added asynchronous writes to the fr1-2.4 patch (note,
2.4, not 2.2). This uses the bitmap as a journal log. Ordinary
writes mark the bitmap and they bitmap is cleared when the last i/o in
the set completes.
However, success is reported back to the user after ONE successful
completion (i.e. we run the original end_io handed down to md), not
after two or more or however many components are in the array...
We can afford to do that because the bitmap records if we miss
any.
Peter
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: Fast (intelligent) raid1
@ 2003-02-22 22:51 Peter T. Breuer
0 siblings, 0 replies; 19+ messages in thread
From: Peter T. Breuer @ 2003-02-22 22:51 UTC (permalink / raw)
To: linux-raid; +Cc: neilb, mingo
I'll set out here the basic couple of changes needed in the raid1
driver in order to allow async writes, once one has a bitmap. It's
a short trick, but I'd like somebody to tell me if the accounting
is in the right places still ...
The basic idea is that in an ordinary write, we mark the bitmap
before the write, and clear the bitmap in the last end_io of the
mirrored set of end_ios. So if we don't write all the mirrors,
well, it's because one errrored, which means the array will fault it,
and our bitmap will be dirty for that block, and the block will
be resynced when we put in a new disk component.
So we can afford to ack back to the user on the FIRST end_io of a
mirrored set of writes, not the last. The remaining i_os are async.
The first change is in the final end_io function. It no longer
unconditionally acks the user i_o, only if nobody else has done
it yet.
struct buffer_head *bh = r1_bh->master_bh;
raid1_conf_t * conf = mddev_to_conf(r1_bh->mddev);
+ if (r1_bh->cmd == WRITE && !test_and_set_bit(R1BH_AsyncPhase, &r1_bh->state)) {
+
io_request_done(bh->b_rsector, conf,
test_bit(R1BH_SyncPhase, &r1_bh->state));
bh->b_end_io(bh, uptodate);
+ }
I think that io_request_done is alright there. I assume it's
accounting. The whole lot of the stuff which has been if'ed above can
now be done in an ordinary (nonfinal) end_io (raid1_end_request)
instead. I just added an if on whether or not the io is successful,
with the whole lot above in. So it probably will result in a user ack
on the first i/o of a set. not the last.
/*
* WRITE:
*
* Let's see if all mirrored write operations have finished
* already.
*/
+ if (uptodate && !test_and_set_bit(R1BH_AsyncPhase, &r1_bh->state)) {
+ struct buffer_head *bh = r1_bh->master_bh;
+ raid1_conf_t *conf = mddev_to_conf(r1_bh->mddev);
+
+ io_request_done(bh->b_rsector, conf,
+ test_bit(R1BH_SyncPhase, &r1_bh->state));
+ bh->b_end_io(bh, uptodate);
+ }
if (atomic_dec_and_test(&r1_bh->remaining))
raid1_end_bh_io(r1_bh, test_bit(R1BH_Uptodate, &r1_bh->state));
Peter
^ permalink raw reply [flat|nested] 19+ messages in threadend of thread, other threads:[~2003-02-22 22:51 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2003-01-23 17:07 Fast (intelligent) raid1 Peter T. Breuer 2003-01-24 9:03 ` Ingo Molnar 2003-01-24 9:34 ` Peter T. Breuer 2003-02-11 19:47 ` Peter T. Breuer 2003-02-11 20:17 ` Ross Vandegrift 2003-02-11 20:36 ` Peter T. Breuer 2003-02-11 23:05 ` Peter T. Breuer 2003-02-14 11:33 ` Peter T. Breuer 2003-02-14 20:46 ` Ross Vandegrift 2003-02-17 3:07 ` Neil Brown 2003-02-17 13:17 ` Peter T. Breuer 2003-02-17 15:08 ` Peter T. Breuer 2003-02-17 17:54 ` Peter T. Breuer 2003-02-18 0:57 ` Peter T. Breuer 2003-02-18 15:04 ` Peter T. Breuer 2003-02-18 22:13 ` Peter T. Breuer -- strict thread matches above, loose matches on Subject: below -- 2003-02-19 11:29 Peter T. Breuer 2003-02-19 20:59 Peter T. Breuer 2003-02-22 22:51 Peter T. Breuer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).