* Questions answered by Neil Brown @ 2003-02-24 20:15 Peter T. Breuer 2003-02-24 21:58 ` Paul Clements 0 siblings, 1 reply; 25+ messages in thread From: Peter T. Breuer @ 2003-02-24 20:15 UTC (permalink / raw) To: linux-raid Here's part of offline conversation with Neil Brown. The answers should go to the list to be archived, so I'm passing them on ... ----- Forwarded message from Neil Brown ----- On Monday February 24, ptb@it.uc3m.es wrote: > 1) when is the events count written to the component disks > superblocks, and how can I force an update (at intervals)? Whenever md_update_sb is called, which is normally: When the array starts When it stops When a drive fails When a drive is added When a drive is removed. Just set mddev->sb_dirty and maybe kick the raid thread. > > I ask because I have seen a case when a disk that I know > had been taken out properly was rejected on reinsert as > a replacement candidate (bitmapped resync) because its > event count was too old compared to the one on the bitmap > that supposedly was set up when it was marked faulty, > and stamped on first attempted write afterwards. I would have to look at the code. > > 2) how can I know how many faulty or removed disks there > currently are in the array? I need to mark the bitmap on > write as soon as there are any. At the moment I search > in a certain range in the array for disks marked nonoperational. The write fork for raid1_make_request iterrates through all devices and ignores the non-operational ones. When you come to the end of the list, and before you actually submit the write request, you should know if there are any faulty devices, so you should know what to do with the bitmap. Is that a suitable answer? > > 3) it is not clear to me that I am doing accounting right on > async writes (or indeed when resyncing and skipping blocks). > > The last end_io always did > io_request_done(bh->b_rsector, conf, > test_bit(R1BH_SyncPhase, &r1_bh->state)); > and now this can be done on the first end_io (along with the > bh->b_end_io(bh, uptodate);) in an async write. Is that right? It's not enough. Once you have called bh->b_end_io, you cannot touch that bh every again, as it might not even exist. Also, io_request_done is used to synchronise io request with resync requests so they don't tread on each other's feet. io_request_done must come after all the io is complete, even if you find some way of calling bh->b_end_io sooner. > > in resync I simulate having done n sectors by > > md_sync_acct(mirror->dev, n); > sync_request_done(sector_nr, conf); > md_done_sync(mddev, n, 1); > > wake_up(&conf->wait_ready); > > is that right? It looks good. I would have to spend a little while staring at the code to be sure, but it is close enough that if it seems to work, then it is probably right. NeilBrown ----- End of forwarded message from Neil Brown ----- ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Questions answered by Neil Brown 2003-02-24 20:15 Questions answered by Neil Brown Peter T. Breuer @ 2003-02-24 21:58 ` Paul Clements 2003-02-25 3:10 ` Neil Brown 0 siblings, 1 reply; 25+ messages in thread From: Paul Clements @ 2003-02-24 21:58 UTC (permalink / raw) To: ptb; +Cc: linux-raid, Neil Brown "Peter T. Breuer" wrote: > > > > 3) it is not clear to me that I am doing accounting right on > > async writes (or indeed when resyncing and skipping blocks). > > > > The last end_io always did > > io_request_done(bh->b_rsector, conf, > > test_bit(R1BH_SyncPhase, &r1_bh->state)); > > and now this can be done on the first end_io (along with the > > bh->b_end_io(bh, uptodate);) in an async write. Is that right? > Once you have called bh->b_end_io, you cannot touch that bh every > again, as it might not even exist. So I think this means that we really need to duplicate the buffer memory (bh->b_page, bh->b_data) and point the mirror I/Os to the duplicated buffer, so we can allow the original (master_bh) to be freed before all the I/Os complete. We then free the duplicate when the last mirror I/O completes, right? -- Paul ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Questions answered by Neil Brown 2003-02-24 21:58 ` Paul Clements @ 2003-02-25 3:10 ` Neil Brown 2003-02-25 9:11 ` Peter T. Breuer 0 siblings, 1 reply; 25+ messages in thread From: Neil Brown @ 2003-02-25 3:10 UTC (permalink / raw) To: Paul Clements; +Cc: ptb, linux-raid On Monday February 24, Paul.Clements@SteelEye.com wrote: > > So I think this means that we really need to duplicate the buffer memory > (bh->b_page, bh->b_data) and point the mirror I/Os to the duplicated > buffer, so we can allow the original (master_bh) to be freed before all > the I/Os complete. We then free the duplicate when the last mirror I/O > completes, right? Yes, that is right. NeilBrown ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Questions answered by Neil Brown 2003-02-25 3:10 ` Neil Brown @ 2003-02-25 9:11 ` Peter T. Breuer 2003-02-26 7:44 ` Paul Clements 0 siblings, 1 reply; 25+ messages in thread From: Peter T. Breuer @ 2003-02-25 9:11 UTC (permalink / raw) To: Neil Brown; +Cc: Paul Clements, ptb, linux-raid "A month of sundays ago Neil Brown wrote:" > On Monday February 24, Paul.Clements@SteelEye.com wrote: > > > > So I think this means that we really need to duplicate the buffer memory > > (bh->b_page, bh->b_data) and point the mirror I/Os to the duplicated > > buffer, so we can allow the original (master_bh) to be freed before all > > the I/Os complete. We then free the duplicate when the last mirror I/O > > completes, right? > > Yes, that is right. It's early in the morning (for me!), but I believe that page data at least can be encouraged to stay around by hiking a reference counter or equivalent. It really depends how that b_end_io() got hold of its buffer and what it does with it on completion. The end_buffer_io() fn in buffer.c will lock the page and walk through the bh chain hanging off bh->b_this_page (->b_this_page ...) to see if those other bh's have completed before marking the page up to date and unlocking it. So it might be enough to chain all the mirror bh's through bh->b_this_page. I believe that currently this field is just set to "1" in raid1_make_request(). Peter ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Questions answered by Neil Brown 2003-02-25 9:11 ` Peter T. Breuer @ 2003-02-26 7:44 ` Paul Clements 2003-02-26 8:09 ` Peter T. Breuer 2003-02-26 21:41 ` Neil Brown 0 siblings, 2 replies; 25+ messages in thread From: Paul Clements @ 2003-02-26 7:44 UTC (permalink / raw) To: ptb; +Cc: Neil Brown, linux-raid "Peter T. Breuer" wrote: > > "A month of sundays ago Neil Brown wrote:" > > On Monday February 24, Paul.Clements@SteelEye.com wrote: > So it might be enough to chain all the mirror bh's through > bh->b_this_page. That's an interesting idea. I looked through the code and I have some questions: What if the user is waiting on a page and not a buffer (not sure if that can/will happen). In that case, we'd be artificially causing him to wait when it wasn't necessary. Suppose all the I/O for a page really was complete, but we kept the user waiting until all the mirror I/Os (including ones to backup devices) for that page had completed. Another thing I'm not sure about is whether it's safe for raid1 to modify the b_this_page field (for a buffer that was passed in from above)...we'd at least have to insert our values into the existing list. Is it safe to modify the list without any locks held? > I believe that currently this field is just set to "1" in > raid1_make_request(). Yeah, I sure wish I knew who did that and why. I wonder if someone had a clever plan to use that field at some point, but never got around to it. Setting that field to something besides a real address sure does seem odd...and I can't see that it's ever used anywhere. -- Paul ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Questions answered by Neil Brown 2003-02-26 7:44 ` Paul Clements @ 2003-02-26 8:09 ` Peter T. Breuer 2003-02-26 16:41 ` Paul Clements 2003-02-26 21:45 ` Questions answered by Neil Brown Neil Brown 2003-02-26 21:41 ` Neil Brown 1 sibling, 2 replies; 25+ messages in thread From: Peter T. Breuer @ 2003-02-26 8:09 UTC (permalink / raw) To: Paul Clements; +Cc: ptb, Neil Brown, linux-raid "A month of sundays ago Paul Clements wrote:" > "Peter T. Breuer" wrote: > > "A month of sundays ago Neil Brown wrote:" > > > On Monday February 24, Paul.Clements@SteelEye.com wrote: > > > So it might be enough to chain all the mirror bh's through > > bh->b_this_page. > > What if the user is waiting on a page and not a buffer (not sure if that > can/will happen). In that case, we'd be artificially causing him to wait I'm not the person to answer that kind of question. If I understood the VMS I wouldn't be asking questions myself! > when it wasn't necessary. Suppose all the I/O for a page really was > complete, but we kept the user waiting until all the mirror I/Os > (including ones to backup devices) for that page had completed. You're one up in the abstraction hierarchy from me .. I just am seeing the concrete b_end_io codes and what they might do. I can offer no sensible comment. > Another thing I'm not sure about is whether it's safe for raid1 to > modify the b_this_page field (for a buffer that was passed in from > above)...we'd at least have to insert our values into the existing list. > Is it safe to modify the list without any locks held? I'm sure it is. We modify it anyway, to write the 1s. And we do it before submitting any of the bh's involved. > > I believe that currently this field is just set to "1" in > > raid1_make_request(). > > Yeah, I sure wish I knew who did that and why. I wonder if someone had a A check of the buffer.c code shows b_this_page is always tested against real bh addresses in any conditionals in which it appears. It appears to potentially be a circular list and they're testing for having got back to the beginning (or having a trivial circle). What I want to know is how "1" ever worked. It can't! Not in the buffer.c code, enayway. That's end_buffer_io_async and discard_bh_page and __block_write_full_page and __block_prepare_write and __block_commit_write and block_read_full_page and ... well, lots. All of these go chasing through that field trusting it to be the address of a bh. > clever plan to use that field at some point, but never got around to it. > Setting that field to something besides a real address sure does seem > odd...and I can't see that it's ever used anywhere. What is also puzzling me is that despite the horrible potential for what might happen from doing the original users end_io early, I can't see any consequences in actual tests! I am writing stuff to a file on a raid1 mirror mount, then dismantling the mount, the device, and rmmoding the driver. Then I put it all back again. On remount the data in the file is all perfect. Yet it was built with plenty of async writes! Surely the buffers in some of those writes should have been pointing nowhere and been full of rubbish? ... raid1: async end i/o on sectors 494-495 raid1: async end i/o on sectors 496-497 raid1: async end i/o on sectors 498-499 raid1: async end i/o on sectors 500-501 raid1: async end i/o on sectors 502-503 raid1: async end i/o on sectors 504-505 raid1: async end i/o on sectors 506-507 raid1: async end i/o on sectors 508-509 raid1: async end i/o on sectors 510-511 raid1: async end i/o on sectors 512-513 ... Got an idea for a test that might show up corruption accruing from the suspect mechanism? Peter ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Questions answered by Neil Brown 2003-02-26 8:09 ` Peter T. Breuer @ 2003-02-26 16:41 ` Paul Clements 2003-02-26 17:26 ` Peter T. Breuer 2003-02-26 21:45 ` Questions answered by Neil Brown Neil Brown 1 sibling, 1 reply; 25+ messages in thread From: Paul Clements @ 2003-02-26 16:41 UTC (permalink / raw) To: ptb; +Cc: Neil Brown, linux-raid "Peter T. Breuer" wrote: > > Yeah, I sure wish I knew who did that and why. I wonder if someone had a > > A check of the buffer.c code shows b_this_page is always tested against Right, but note that when raid1 is setting those 1s, it's in the bh's that it's sending down to lower level devices, not in the bh's that came in from above...a subtle distinction. So these 1s would be consumed only by raid1's end_io routines. Any changes we make to b_this_page in the "master_bh" will have to be handled correctly by the end_io of the upper level code (stuff in buffer.c). > > clever plan to use that field at some point, but never got around to it. > > Setting that field to something besides a real address sure does seem > > odd...and I can't see that it's ever used anywhere. > > What is also puzzling me is that despite the horrible potential for > what might happen from doing the original users end_io early, I > can't see any consequences in actual tests! Probably because the timing is so close...if we were to delay the completion of I/O to one of the devices by several seconds, say, I believe we'd see some really bad things happen. Another thing that probably has to coincide with the I/O delays is memory pressure, otherwise I think the system will just end up keeping the buffers cached forever (OK, a really long time...) and nothing bad will happen. One thought I had for a test (when I get to the point of really rigorously testing this stuff :)) is to set up an nbd-client/server pair and insert a sleep into the nbd-server so that completion of I/O is _always_ delayed by some period...(this will also help in performance testing, to see how much benefit we get from doing async writes with high latency). BTW, I'm working on the code to duplicate the bh (and its memory buffer) right now. It's basically coded, but not tested. I've based it off your 2.5 code. I'm also working on a simple queueing mechanism (to queue write requests to backup devices). This will allow us to adjust the bit to block ratio of the bitmap (intent log) to save disk space and memory. This mechanism will also be needed if we want to increase the degree of asynchronicity of the writes (we could just queue all writes and deal with them later, perhaps in batches). ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Questions answered by Neil Brown 2003-02-26 16:41 ` Paul Clements @ 2003-02-26 17:26 ` Peter T. Breuer 2003-02-26 18:29 ` raid1 bitmap code [Was: Re: Questions answered by Neil Brown] Paul Clements 0 siblings, 1 reply; 25+ messages in thread From: Peter T. Breuer @ 2003-02-26 17:26 UTC (permalink / raw) To: Paul Clements; +Cc: ptb, Neil Brown, linux-raid "Paul Clements wrote:" > Right, but note that when raid1 is setting those 1s, it's in the bh's > that it's sending down to lower level devices, not in the bh's that came > in from above...a subtle distinction. So these 1s would be consumed only Too subtle for me. > by raid1's end_io routines. Any changes we make to b_this_page in the Umm ... err, I'm not at all sure .. you mean that their b_end_io is something we implanted? Yes, I'll go along with that. OK - here's the story .. the bh's that we set the this_page field to 1 on come from raid1_alloc_bh. Normally these come from a pool of preallocated bhs that stick around in the driver. The b_end_io that we give them is the raid1_end_request function, which is the common or garden endio that does /nothing/. Only on the last of the bh's do we run a raid1_end_bh_io, which calls raid1_free_r1bh on the /master/, surprise, raid1_free_r1bh does nothing except call raid1_free_bh on the list of all the associated bh's that's been sitting chained onto the mirror_bh_list field of the master, and then raid1_free_bh runs through the list and calls kmem_cahe_free on each of them. I see the this_page field used nowhere in this. I can't follow what the raid1_alloc_bh thing does, so I have an info leak here. I don't know how the preallocated bh cache grows or shrinks under pressure. Anyway, if we run raid1_end_bh_io early, we will free up bh's and they may get reused before we have a chance to complete them. So that's a problem. Uh - we'll have to stop raid1_end_bh_io calling raid1_free_bh and instead get each individual bh to unchain itself from the linked list and free itself. Under some lock, of course. Where in all this do we ack the original end_io on the md array? I don't see it! > "master_bh" will have to be handled correctly by the end_io of the upper > level code (stuff in buffer.c). > Probably because the timing is so close...if we were to delay the > completion of I/O to one of the devices by several seconds, say, I > believe we'd see some really bad things happen. Another thing that > probably has to coincide with the I/O delays is memory pressure, > otherwise I think the system will just end up keeping the buffers cached > forever (OK, a really long time...) and nothing bad will happen. I don't know how many bh's there are in the preallocated pool. If there were more than 128 it's fairly certain that they could satisfy all requests outstanding at any one time. Especially if freed requests go to the back of the free list (do they?). > One thought I had for a test (when I get to the point of really > rigorously testing this stuff :)) is to set up an nbd-client/server pair > and insert a sleep into the nbd-server so that completion of I/O is > _always_ delayed by some period...(this will also help in performance Possibly. My enbd will timeout. > testing, to see how much benefit we get from doing async writes with > high latency). > > BTW, I'm working on the code to duplicate the bh (and its memory buffer) > right now. It's basically coded, but not tested. I've based it off your > 2.5 code. I'm also working on a simple queueing mechanism (to queue > write requests to backup devices). This will allow us to adjust the bit > to block ratio of the bitmap (intent log) to save disk space and memory. Don't worry about that - that's not necessary, I think. The bitmap is already lazy on creating pages for itself. But yes, it needs to maintain a count of dirty bits per bitmap page, and when the count drops to zero it needs to free the page. I can do that if you like? > This mechanism will also be needed if we want to increase the degree of > asynchronicity of the writes (we could just queue all writes and deal > with them later, perhaps in batches). Yes. Peter ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown] 2003-02-26 17:26 ` Peter T. Breuer @ 2003-02-26 18:29 ` Paul Clements 2003-02-26 19:15 ` Peter T. Breuer 0 siblings, 1 reply; 25+ messages in thread From: Paul Clements @ 2003-02-26 18:29 UTC (permalink / raw) To: ptb; +Cc: Neil Brown, linux-raid "Peter T. Breuer" wrote: > "Paul Clements wrote:" > > BTW, I'm working on the code to duplicate the bh (and its memory buffer) > > right now. It's basically coded, but not tested. I've based it off your > > 2.5 code. I'm also working on a simple queueing mechanism (to queue > > write requests to backup devices). This will allow us to adjust the bit > > to block ratio of the bitmap (intent log) to save disk space and memory. > > Don't worry about that - that's not necessary, I think. The bitmap is > already lazy on creating pages for itself. But yes, it needs to > maintain a count of dirty bits per bitmap page, and when the count drops > to zero it needs to free the page. I can do that if you like? Yes, the on-demand page freeing sounds like a good idea. If you don't have that, I think the bitmap eventually grows to maximum size over time... As far as the bit to block ratio, some numbers: 1 bit/64kb @ 1TB -> 2MB maximum bitmap size 1 bit/512b @ 1TB -> 250MB maximum bitmap size So, I think that if we want to be sure that this will scale, we'll want the bit to block ratio to be adjustable. Another benefit of having a large bitmap block size is that it limits the frequency of the disk writes required to sync the bitmap to disk (1 sync per 128 sectors written vs. 1 per sector, in the example above). -- Paul ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown] 2003-02-26 18:29 ` raid1 bitmap code [Was: Re: Questions answered by Neil Brown] Paul Clements @ 2003-02-26 19:15 ` Peter T. Breuer 2003-02-26 22:12 ` Neil Brown 0 siblings, 1 reply; 25+ messages in thread From: Peter T. Breuer @ 2003-02-26 19:15 UTC (permalink / raw) To: Paul Clements; +Cc: ptb, Neil Brown, linux-raid "A month of sundays ago Paul Clements wrote:" > "Peter T. Breuer" wrote: > > Don't worry about that - that's not necessary, I think. The bitmap is > > already lazy on creating pages for itself. But yes, it needs to > > maintain a count of dirty bits per bitmap page, and when the count drops > > to zero it needs to free the page. I can do that if you like? > > Yes, the on-demand page freeing sounds like a good idea. If you don't > have that, I think the bitmap eventually grows to maximum size over > time... Of course. But nobody's complained yet :-) The bitmap is OO. It's easy to do changes like that. I'll have to test the map before setting the bit in order to count the number of dirty bits. That's all (the ops are done under lock). > As far as the bit to block ratio, some numbers: > > 1 bit/64kb @ 1TB -> 2MB maximum bitmap size > > 1 bit/512b @ 1TB -> 250MB maximum bitmap size Yes, I understand this. But I simply do not believe it is worth the pain initially. The raid code already disallows all blksizes apart from 1KB and so I only allowed one bit per block! That seems fair. Oversizing the dirty bit to cover more than the minimal transaction unit means that translations have to be made all over the place. Well, OK, it could be done by simply calling bitmap->setbit(bitmap, block >> 8) instead of bitmap->setbit(bitmap, block >> 8), and the same for testbit, But cannot this be left for some other time? > So, I think that if we want to be sure that this will scale, we'll want > the bit to block ratio to be adjustable. Another benefit of having a Well, the simple thing is to change the bitmap calls in the patched raid1.c to have block >> scale in them instead of just block, and then set scale as a module parameter. > large bitmap block size is that it limits the frequency of the disk > writes required to sync the bitmap to disk (1 sync per 128 sectors > written vs. 1 per sector, in the example above). But I don't think this really matters. Kernel request aggregation to the underlying devices will smooth out everything anyhow. My instinct is to leave things be for the moment. Peter ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown] 2003-02-26 19:15 ` Peter T. Breuer @ 2003-02-26 22:12 ` Neil Brown 2003-02-26 23:24 ` Peter T. Breuer 2003-02-27 5:33 ` Paul Clements 0 siblings, 2 replies; 25+ messages in thread From: Neil Brown @ 2003-02-26 22:12 UTC (permalink / raw) To: ptb; +Cc: Paul Clements, linux-raid Some observations on the idea of an intent-logging bitmap. 1/ You don't want or need very fine granularity. The value of this is to speed up resync time. Currently it is limited by drive bandwidth. If you have lots of little updates due to fine granularity, you will be limited by seek time. One possibly reasonable approach would be to pick a granulatity such that it takes about as long to read or write a chunk and it would to seek to the next one. This probably means a few hundred kilobytes. i.e. one bit in the bitmap for every hundred K. This would require a 125K bitmap for a 100Gig drive. Another possibility would be a fixed size bitmap - say 4K or 8K. An 8K bitmap used to map a 100Gig drive would be 1.5Meg per bit. This may seem biggish, but if your bitmap were sparse, resync would still be much much faster, and if it were dense, having a finer grain in the bitmap isn't going to speed things up much. 2/ You cannot allocate the bitmap on demand. Demand happens where you are writing data out, and when writing data out due to high memory pressure, kmalloc *will* fail. Relying on kmalloc in the write path is BAD. That is why we have mempools which pre-allocate. For the bitmap, you simply need to pre-allocate everything. 3/ Internally, you need to store a counter for each 'chunk' (need a better word, this is different from the raid chunksize, this is the amount of space that each bit refers to). The counter is needed so you know when the bit can be cleared. This too must be pre-allocated and so further limits the size of your bitmap. 16 bit counters would use less ram and would allow 33553920 bytes (65535 sectors) per 'chunk' which, with an 8K bitmap, puts an upper limit of 2 terabytes per device, which I think is adequate. (that's per physical device, not per raid array). Or you could just use 32 bit counters. 4/ I would use device plugging to help reduce the number of times you have to write the intent bitmap. When a write comes in, you set the bit in the bitmap, queue the write on a list of 'plugged' requests, and mark the device as 'plugged'. The device will eventually be unplugged, at which point you write out the bitmap, then release all the requests to the lower devices. You could optimise this a bit, and not bother plugging the device if it wasn't already plugged, and the request only affected bits that were already set. NeilBrown ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown] 2003-02-26 22:12 ` Neil Brown @ 2003-02-26 23:24 ` Peter T. Breuer 2003-02-27 7:26 ` Paul Clements 2003-02-27 5:33 ` Paul Clements 1 sibling, 1 reply; 25+ messages in thread From: Peter T. Breuer @ 2003-02-26 23:24 UTC (permalink / raw) To: Neil Brown; +Cc: ptb, Paul Clements, linux-raid [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset=UNKNOWN-8BIT, Size: 6082 bytes --] Hi! "A month of sundays ago Neil Brown wrote:" > 1/ You don't want or need very fine granularity. The value of this is > to speed up resync time. Currently it is limited by drive The map is also used to "journal" async writes. So I think that makes it necessary to have granularity the same size as a write. Otherwise we cannot meaningfully mark the map clean again when the async write completes. > bandwidth. If you have lots of little updates due to fine > granularity, you will be limited by seek time. While your observation is undoubtedly correct, since the resync is monotonic the seeks will all be in the same direction. That cannot be any slower than actually traversing while writing, can it? And IF it is the case that it is slower, we can always do "writeahead" in the resync - that is, when we see a block is marked dirty in the resync, resync it AND some number of following blocks, then look again. We can even look at the bitmap fairly carefully and decide what strategy to use in resync, part by part. I've now (earlier this evening) altered the bitmap so that it accounts the number of dirty bits per bitmap page (i.e. how many dirty blocks per 4096 blocks at the moment) and if it's more than some proportion we can write the whole lot. > such that it takes about as long to read or write a chunk and it > would to seek to the next one. This probably means a few hundred > kilobytes. i.e. one bit in the bitmap for every hundred K. > This would require a 125K bitmap for a 100Gig drive. > > Another possibility would be a fixed size bitmap - say 4K or 8K. > > An 8K bitmap used to map a 100Gig drive would be 1.5Meg per bit. > > This may seem biggish, but if your bitmap were sparse, resync would > still be much much faster, and if it were dense, having a finer > grain in the bitmap isn't going to speed things up much. The objection that it couldn't be used for journalling holds, I think. > 2/ You cannot allocate the bitmap on demand. > Demand happens where you are writing data out, and when writing That was taken account of in the botmap code design. If the kmalloc fails for a bitmap page, it will write the address of the page as "1" in the page array which the bitmap will then understand as "all 4096 blocks dirty" during resync. Ummm ... yes, it clears the whole thing correctly after the resync too. > data out due to high memory pressure, kmalloc *will* fail. > Relying on kmalloc in the write path is BAD. That is why we have I didn't :-) > mempools which pre-allocate. > For the bitmap, you simply need to pre-allocate everything. I think we can breathe easy here - the bitmap code is designed to fail correctly, so that the semantics is appropriate. It marks a page that can't be obtained via kmalloc as all-dirty by setting its address as 1. Curiously enough, I'm slightly more nonplussed by the problem of kfreeing the bitmap pages when their dirty count drops to zero. I can foresee that when journalling we will go 0 1 0 1 0 1 in terms of number of bits dirty in the map, and if we kfree after the count drops to zero each time, and kmalloc when we should set the count to 1, then we will be held up. And maybe sleep when we shouldn't - the bitmap lock is held for some ops. Needs checking. What should I do? Maintain a how-many-times-we-have-wanted-to-free-this page count and only free it on the 10th attempt? > 3/ Internally, you need to store a counter for each 'chunk' (need a > better word, this is different from the raid chunksize, this is the > amount of space that each bit refers to). > The counter is needed so you know when the bit can be cleared. > This too must be pre-allocated and so further limits the size of > your bitmap. I'm not sure what you mean here. At the moment (since earlier this evening) there IS a counter for each page in the bitmap saying how many bits in it are dirty. Oh - I see, you are considering the case when there is more than one block þer bit. Really - I think it's too complicated to fly, at least for the moment. I'm really not worried about the memory used by the bitmap! If you would fix md so it could have 4K blocksize we'd probably gain more. But that code has bs=1K and sectors/blk=2 assumptions all over! Currently, the max blocks in a md device is 2^31. At one bit per block and 8bits per byte that's 2^28 bytes of bitmap, or 256MB, if it were to become completely full. I am sure that people who have 1TB of disk can afford 256MB of ram to service it with. > 16 bit counters would use less ram and would allow 33553920 bytes > (65535 sectors) per 'chunk' which, with an 8K bitmap, puts an upper > limit of 2 terabytes per device, which I think is adequate. (that's > per physical device, not per raid array). The current counters (dirty bits per bitmap page) are 16 bit. SIgned. > Or you could just use 32 bit counters. > > 4/ I would use device plugging to help reduce the number of times you > have to write the intent bitmap. > When a write comes in, you set the bit in the bitmap, queue the > write on a list of 'plugged' requests, and mark the device as > 'plugged'. The device will eventually be unplugged, at which point > you write out the bitmap, then release all the requests to the > lower devices. Hmmm .. that's interesting. But bitmap access was meant to be cheap. Whether I succeeded is another question! If not someone else can rewrite. > You could optimise this a bit, and not bother plugging the device > if it wasn't already plugged, and the request only affected bits > that were already set. There are very good ideas indeed there. Thank you very much. I appreciate it! Peter - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown] 2003-02-26 23:24 ` Peter T. Breuer @ 2003-02-27 7:26 ` Paul Clements 2003-02-27 8:48 ` Peter T. Breuer 0 siblings, 1 reply; 25+ messages in thread From: Paul Clements @ 2003-02-27 7:26 UTC (permalink / raw) To: ptb; +Cc: Neil Brown, linux-raid "Peter T. Breuer" wrote: > Curiously enough, I'm slightly more nonplussed by the problem of > kfreeing the bitmap pages when their dirty count drops to zero. > I can foresee that when journalling we will go 0 1 0 1 0 1 in terms > of number of bits dirty in the map, and if we kfree after the > count drops to zero each time, and kmalloc when we should set > the count to 1, then we will be held up. And maybe sleep when we > shouldn't - the bitmap lock is held for some ops. Needs checking. > > What should I do? Maintain a how-many-times-we-have-wanted-to-free-this > page count and only free it on the 10th attempt? hmm...perhaps an LRU approach? you could store a timestamp (jiffies?), so that you never deallocate a page unless it hasn't been used in X amount of time...might be too heavyweight for what you're trying to do -- Paul ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown] 2003-02-27 7:26 ` Paul Clements @ 2003-02-27 8:48 ` Peter T. Breuer 2003-02-27 15:47 ` Paul Clements 0 siblings, 1 reply; 25+ messages in thread From: Peter T. Breuer @ 2003-02-27 8:48 UTC (permalink / raw) To: Paul Clements; +Cc: ptb, Neil Brown, linux-raid "Paul Clements wrote:" > "Peter T. Breuer" wrote: > > Curiously enough, I'm slightly more nonplussed by the problem of > > kfreeing the bitmap pages when their dirty count drops to zero. > > What should I do? Maintain a how-many-times-we-have-wanted-to-free-this > > page count and only free it on the 10th attempt? > > hmm...perhaps an LRU approach? you could store a timestamp (jiffies?), > so that you never deallocate a page unless it hasn't been used in X > amount of time...might be too heavyweight for what you're trying to do This is not silly, but is too heavyweight to do each time a bit is set. The best thing to do is probably either or both of a) put the notionally freed pages on a free list and allocate from that list first when we want a new page b) do a periodic sweep for clean pages and free them. Peter ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown] 2003-02-27 8:48 ` Peter T. Breuer @ 2003-02-27 15:47 ` Paul Clements 0 siblings, 0 replies; 25+ messages in thread From: Paul Clements @ 2003-02-27 15:47 UTC (permalink / raw) To: ptb; +Cc: Neil Brown, linux-raid "Peter T. Breuer" wrote: > > "Paul Clements wrote:" > > "Peter T. Breuer" wrote: > > > Curiously enough, I'm slightly more nonplussed by the problem of > > > kfreeing the bitmap pages when their dirty count drops to zero. > > > > What should I do? Maintain a how-many-times-we-have-wanted-to-free-this > > > page count and only free it on the 10th attempt? > > > > hmm...perhaps an LRU approach? you could store a timestamp (jiffies?), > > so that you never deallocate a page unless it hasn't been used in X > > amount of time...might be too heavyweight for what you're trying to do > > This is not silly, but is too heavyweight to do each time a bit is set. Yeah, I was thinking of things like an ext3 journal, where you really do not want to _ever_ free the pages. With a simple counter, I don't think there's any way to get that type of behavior. Of course, if you keep a pool of pre-allocated pages, the penalty for deallocation/reallocation becomes a lot lower. -- Paul ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown] 2003-02-26 22:12 ` Neil Brown 2003-02-26 23:24 ` Peter T. Breuer @ 2003-02-27 5:33 ` Paul Clements 2003-02-27 10:35 ` Peter T. Breuer 1 sibling, 1 reply; 25+ messages in thread From: Paul Clements @ 2003-02-27 5:33 UTC (permalink / raw) To: Neil Brown; +Cc: ptb, linux-raid Neil, You've made some really good points and suggestions...thanks... > 1/ You don't want or need very fine granularity. The value of this is > to speed up resync time. Currently it is limited by drive > bandwidth. If you have lots of little updates due to fine > granularity, you will be limited by seek time. One more reason why an adjustable bit to block ratio would be nice :) ... > 2/ You cannot allocate the bitmap on demand. Hmm...that's a very good point. I had not really thought about that, but you're right. Maybe there are some advantages to having a simple, flat, pre-allocated bitmap...although, I do really like Peter's two-level on-demand allocation scheme. Maybe we could do partial pre-allocation, using the pre-allocated pages when we're under pressure and kmalloc fails, and doing on-demand allocation the rest of the time? Another idea, expanding on what Peter has already done with marking the pointer with an "address" of 1 if the kmalloc fails (this means that the bitmap degrades from 1bit/1k to 1bit/4MB, which is not a terrible thing, all in all). What if we were clever and used more than just one bit when the allocation fails, so that the ratio could be kept more reasonable (I'm thinking the 1bit/4MB is OK now, but with a default bit/block ratio that's much higher, it might get out of control)...maybe something similar to the IS_ERR and ERR_PTR macros that are elsewhere in the kernel? Those use 1000 values (I think...) that are known not to be valid pointer values as error values instead. > 3/ Internally, you need to store a counter for each 'chunk' (need a Yes, in order to use a bit/block ratio other than 1 bit/1k, we need some mechanism for tracking multiple pending writes to the same 'chunk'. One way to do this is to keep a counter (say 16 or 32 bits) on each chunk, rather than just a single bit. Another alternative might be to queue the writes (using a hash on the 'chunk' number, for quick insertion and deletion). This would tell us how many writes are pending for a given 'chunk', and allow us to clear the bit at the appropriate time (once the last pending write for the 'chunk' had finished). This could be expanded later to queueing entire requests (including data) so we could do full transaction logging, so that a short network outage (how short depends on how much $$ (sorry Peter, pesetas^Weuros) you want to lay out for RAM :)) could be recovered from quickly, by just replaying the already queued data. > 4/ I would use device plugging to help reduce the number of times you > have to write the intent bitmap. That's a good idea. I also think that with a large bit/block ratio, the bitmap syncing will be fairly efficient, since the bitmap will only have to be synced once per XXXk of data written. -- Paul ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown] 2003-02-27 5:33 ` Paul Clements @ 2003-02-27 10:35 ` Peter T. Breuer 2003-02-27 10:50 ` Peter T. Breuer 2003-02-27 16:51 ` Paul Clements 0 siblings, 2 replies; 25+ messages in thread From: Peter T. Breuer @ 2003-02-27 10:35 UTC (permalink / raw) To: Paul Clements; +Cc: Neil Brown, ptb, linux-raid "Paul Clements wrote:" > > 2/ You cannot allocate the bitmap on demand. > > Hmm...that's a very good point. I had not really thought about that, but > you're right. Maybe there are some advantages to having a simple, flat, > pre-allocated bitmap...although, I do really like Peter's two-level > on-demand allocation scheme. Maybe we could do partial pre-allocation, > using the pre-allocated pages when we're under pressure and kmalloc I am sure that as the code evolves, schemes like this will be added to it. But are we really sure that it's worth trying to beat the kernels memory management? All the calls here are for 4KB. Sure, we can maybe always maintain a 5% reserve on standby (if the reserve drops to 1% then ask for some more), and also maintain a free list (for first recourse) that is swept every so often. But the kernel already does that kind of management and balancing. What right do we have to say we're better than the rest of the kernel? I can up the priority with which we ask for memory, but it seems to me that in times of memory pressure we /ought to fail/, but fail gracefully. Currently we degrade to 4MB resolution, instead of 1KB. That's all. > fails, and doing on-demand allocation the rest of the time? Another > idea, expanding on what Peter has already done with marking the pointer > with an "address" of 1 if the kmalloc fails (this means that the bitmap > degrades from 1bit/1k to 1bit/4MB, which is not a terrible thing, all in > all). What if we were clever and used more than just one bit when the > allocation fails, so that the ratio could be kept more reasonable (I'm > thinking the 1bit/4MB is OK now, but with a default bit/block ratio This is OK as an idea. The sign bit can be used as a marker. Or perhaps the top two bits, since I think memory normally starts at an adress of 0xc.... Can anyone confirm that? I'm thinking that if the top two bits of the address are low we can take it that this is not an address, but a small bitmap in itself. Hmm ... no, thats' silly. We need a number of bits that is a power of 2. OK, so the top 16 bits being zero lets us see that the bottom 16 bits are a bitmap. > that's much higher, it might get out of control)...maybe something > similar to the IS_ERR and ERR_PTR macros that are elsewhere in the > kernel? Those use 1000 values (I think...) that are known not to be > valid pointer values as error values instead. If 16 bits are available instead of 4096 * 8 = 32K, then we get resolution of 4M/16 = 256K per bit instead of 1K per bit (or instead of 4M per bit). This is more reasonable. I'll look at the coding effort. > > 3/ Internally, you need to store a counter for each 'chunk' (need a > mechanism for tracking multiple pending writes to the same 'chunk'. One I really don't see the logic of this argument, so I'll risk your wrath by saying so in public. We cannot account for how many bits are really dirty in this chunk. If you mark the same spot twice you will count 2 by your logic when only 1 should have been added to the count. Nor can you really clear a bit, because by the same token, if we try and clear a bit that is already notionally cleared, you will want to subtract 1 from the count, when 0 should really be subtracted. In other words, you are making assumptions on the mode of use of the bitmap interface that are difficult to express .. I think you need to count in "sessions", and promise not to mark the same place twice per session, and never to clear anything that you have not already marked in the session, and to not close the session until you have cleared everthing etc. .. but in any case these assumptions render the use of a bitmap as a programming abstraction inappropriate. It's not really good for maintenance. > way to do this is to keep a counter (say 16 or 32 bits) on each chunk, > rather than just a single bit. Another alternative might be to queue the > writes (using a hash on the 'chunk' number, for quick insertion and > deletion). This would tell us how many writes are pending for a given > 'chunk', and allow us to clear the bit at the appropriate time (once the > last pending write for the 'chunk' had finished). This could be expanded > later to queueing entire requests (including data) so we could do full > transaction logging, so that a short network outage (how short depends > on how much $$ (sorry Peter, pesetas^Weuros) you want to lay out for RAM I really don't grok the idea here. I am hung up by the fundamental impossibility of accounting correctly in the situation where writes to different places on disk will land on the same bit of the bitmap. That breaks the logic so thoroughly for me that I cannot proceed from there! Maybe I need to receive more $$! > :)) could be recovered from quickly, by just replaying the already > queued data. Well, if you really want to store the requests instead of merely where those requests are aimed at, that's another story. But it still won't work. State information that's needed for the accounting will vanish when those requests complete. The bitmap cannot have worse granularity than writes can have. Peter ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown] 2003-02-27 10:35 ` Peter T. Breuer @ 2003-02-27 10:50 ` Peter T. Breuer 2003-02-27 16:51 ` Paul Clements 1 sibling, 0 replies; 25+ messages in thread From: Peter T. Breuer @ 2003-02-27 10:50 UTC (permalink / raw) To: ptb; +Cc: Paul Clements, Neil Brown, linux-raid "Peter T. Breuer wrote:" > "Paul Clements wrote:" > > fails, and doing on-demand allocation the rest of the time? Another > > idea, expanding on what Peter has already done with marking the pointer > > with an "address" of 1 if the kmalloc fails (this means that the bitmap > > degrades from 1bit/1k to 1bit/4MB, which is not a terrible thing, all in > > all). What if we were clever and used more than just one bit when the I forgot to say that when the bitmap page fails to be allocated in order that we may mark it, then an attempted mark also fails, and the async-write code notices that and refuses to even try an async write, but stays sync instead. So this fails safe. I also forgot to say something else but I forgot what it was ... ah yes, it's that I did take care never to call kmalloc (or kfree) with the bitmap spinlock held. So that if the kernel blocks and sleeps we don't cause death. Instead we release the lock and regain it after the kmalloc, and retest the conditions that made us want to do the kmalloc then, aborting if they have been satisfied by anyone else. Indeed the kmalloc calls are only in the init function and in the function that is supposed to check the bitmap page and install a page if one is required, prior to further access. It all /looks/ under control. Peter ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown] 2003-02-27 10:35 ` Peter T. Breuer 2003-02-27 10:50 ` Peter T. Breuer @ 2003-02-27 16:51 ` Paul Clements 2003-02-27 17:18 ` Peter T. Breuer 2003-02-28 15:25 ` Peter T. Breuer 1 sibling, 2 replies; 25+ messages in thread From: Paul Clements @ 2003-02-27 16:51 UTC (permalink / raw) To: ptb; +Cc: Neil Brown, linux-raid "Peter T. Breuer" wrote: > "Paul Clements wrote:" > > > 3/ Internally, you need to store a counter for each 'chunk' (need a > > mechanism for tracking multiple pending writes to the same 'chunk'. One > We cannot account for how many bits are really dirty in this chunk. That's OK, we don't need to know. We really just want to know if there are any outstanding writes for that chunk. We increment the chunk counter for every write and decrement it when the write finishes. Now, if the backup device fails, we'll bring it back online at some point and do a (partial) resync, using the bitmap as a guide. For every chunk, if the counter is non-zero, we have to do a full resync of every block in that chunk. We then set the counter back to 0 only after we're done resyncing the last block in that chunk. So we really don't care how many blocks are dirty, just that there are or are not dirty blocks in the chunk. Does that make sense? -- Paul ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown] 2003-02-27 16:51 ` Paul Clements @ 2003-02-27 17:18 ` Peter T. Breuer 2003-02-28 15:25 ` Peter T. Breuer 1 sibling, 0 replies; 25+ messages in thread From: Peter T. Breuer @ 2003-02-27 17:18 UTC (permalink / raw) To: Paul Clements; +Cc: ptb, Neil Brown, linux-raid [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset=UNKNOWN-8BIT, Size: 1338 bytes --] "Paul Clements wrote:" > "Peter T. Breuer" wrote: > > "Paul Clements wrote:" > > We cannot account for how many bits are really dirty in this chunk. > > That's OK, we don't need to know. We really just want to know if there > are any outstanding writes for that chunk. We increment the chunk OK. That's a fine semantics. > So we really don't care how many blocks are dirty, just that there are > or are not dirty blocks in the chunk. Does that make sense? Well, yes. We would have tried to mark the bitmap before each write, and I suppose we would have failed for out of memory in creating the bitmap page, and now you are saying that OK, so let's just count outstanding writes to thæt bitmap zone. I suppose that if the write fails we don't decrement the count? Otherwise we do? So when we do a mark we increment the count on the bitmap zone, and when we do a clear we decrement the count, and when the write fails we never try and clear on the bitmap? Then what we have is a count of the unbalance between marks and clears on the bitmap per zone. And you want the Yes, I can go with that. Peter - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown] 2003-02-27 16:51 ` Paul Clements 2003-02-27 17:18 ` Peter T. Breuer @ 2003-02-28 15:25 ` Peter T. Breuer 2003-02-28 16:14 ` Paul Clements 1 sibling, 1 reply; 25+ messages in thread From: Peter T. Breuer @ 2003-02-28 15:25 UTC (permalink / raw) To: Paul Clements; +Cc: ptb, Neil Brown, linux-raid "Paul Clements wrote:" > That's OK, we don't need to know. We really just want to know if there > are any outstanding writes for that chunk. We increment the chunk > counter for every write and decrement it when the write finishes. Now, I agree. I modified the bitmap so that when a page can't be got to make a mark on, then we count pending "writes" by summing one to a count for every attempted write to the zone covered by the missing page and subtracting one for every attempted clear. The bitmap reports "dirty" if the count is positive and the page for that zone is absent. I put the present code up at ftp://oboe.it.uc3m.es/pub/Programs/fr1-2.6.tgz I spent much of the morning checking that mark and clear attempts are balanced unless the write does not succeed. They are. This means that under memory pressure the bitmap not only fails only by losing precision (it did before), but now it can lose less precision, dropping to 256K instead of 4M precision, although I haven't implemented the final change to up the precision just yet. I made all the preparatory moves. Since it's friday I'll publish what I have. Under normal circumstances the count for each zone of the bitmap records the number of dirty bits. It's used to help free the bitmap pages at the right time, to stop pages accumulating to the max. Shucks .. I have to put them on a free list. I'll do that later too. Peter ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown] 2003-02-28 15:25 ` Peter T. Breuer @ 2003-02-28 16:14 ` Paul Clements 2003-02-28 16:23 ` Peter T. Breuer 0 siblings, 1 reply; 25+ messages in thread From: Paul Clements @ 2003-02-28 16:14 UTC (permalink / raw) To: ptb; +Cc: Neil Brown, linux-raid "Peter T. Breuer" wrote: > > "Paul Clements wrote:" > > That's OK, we don't need to know. We really just want to know if there > > are any outstanding writes for that chunk. We increment the chunk > > counter for every write and decrement it when the write finishes. Now, > > I agree. I modified the bitmap so that when a page can't be got to make > a mark on, then we count pending "writes" by summing one to a count > for every attempted write to the zone covered by the missing page > and subtracting one for every attempted clear. That sounds good. I'll have to merge up to 2.6...ugh...now I really need to put this all in CVS...I'm making some progress with the mmap stuff and I've got the code done for duplicating the buffers for the async writes to the backup devices. I'll need a pool of memory for that too, since I can't just fail in the middle of I/O... -- Paul ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown] 2003-02-28 16:14 ` Paul Clements @ 2003-02-28 16:23 ` Peter T. Breuer 0 siblings, 0 replies; 25+ messages in thread From: Peter T. Breuer @ 2003-02-28 16:23 UTC (permalink / raw) To: Paul Clements; +Cc: ptb, Neil Brown, linux-raid "Paul Clements wrote:" > That sounds good. I'll have to merge up to 2.6...ugh...now I really need > to put this all in CVS...I'm making some progress with the mmap stuff > and I've got the code done for duplicating the buffers for the async > writes to the backup devices. > I'll need a pool of memory for that too, since I can't just fail in the > middle of I/O... Well, you can. Failing to complete will leave the map marked and we could fault the device offline too. That will stop any more memory being wasted because nothing will go to it! We could then make periodic attempts to bring it online again using the "hotrepair" trick and it'll resync itself in the background. Just an idea. Maybe not a good one, but an idea. Peter ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Questions answered by Neil Brown 2003-02-26 8:09 ` Peter T. Breuer 2003-02-26 16:41 ` Paul Clements @ 2003-02-26 21:45 ` Neil Brown 1 sibling, 0 replies; 25+ messages in thread From: Neil Brown @ 2003-02-26 21:45 UTC (permalink / raw) To: ptb; +Cc: Paul Clements, linux-raid On Wednesday February 26, ptb@it.uc3m.es wrote: > > What is also puzzling me is that despite the horrible potential for > what might happen from doing the original users end_io early, I > can't see any consequences in actual tests! > > I am writing stuff to a file on a raid1 mirror mount, then dismantling > the mount, the device, and rmmoding the driver. Then I put it all back > again. On remount the data in the file is all perfect. Yet it was built > with plenty of async writes! Surely the buffers in some of those writes > should have been pointing nowhere and been full of rubbish? > I suspect that mostly you are writing from a cache, and the data will probably stay around. To be able to demonstrate a problem you probably need very high memory pressure so things don't stay in cache long, lots of metadata updates, and probably some for of journalling filesystem like I mentioned previously. Have very long latencies for the delayed write would also make the problem more likely. Even if you cannot demonstrate a problem, I'm sure one would be noticed sooner or later if you released this sort of code into the wild and people used it. NeilBrown ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Questions answered by Neil Brown 2003-02-26 7:44 ` Paul Clements 2003-02-26 8:09 ` Peter T. Breuer @ 2003-02-26 21:41 ` Neil Brown 1 sibling, 0 replies; 25+ messages in thread From: Neil Brown @ 2003-02-26 21:41 UTC (permalink / raw) To: Paul Clements; +Cc: ptb, linux-raid On Wednesday February 26, Paul.Clements@SteelEye.com wrote: > "Peter T. Breuer" wrote: > > > > "A month of sundays ago Neil Brown wrote:" > > > On Monday February 24, Paul.Clements@SteelEye.com wrote: > > > So it might be enough to chain all the mirror bh's through > > bh->b_this_page. This sort of thing is very much a layering violation and should not be done. It is similar in some ways to the design decisions in the 2.2 raid patches which meant they were un-safe for swap or ext3 (at least while rebuilding). The current interface does *not* guarantee that the data will be around after b_end_io is called, and does not provide any way to ask for the data to be left around. You might be able to fake it in some cases, but not all. A particular example that springs to mind is descriptor block used when writing out an ext3 journal entry. Once the block is written, the descriptor block is of no interest to the kernel and will very likely be re-used. You cannot stop this. You *have* to copy the data. > > > I believe that currently this field is just set to "1" in > > raid1_make_request(). > > Yeah, I sure wish I knew who did that and why. I wonder if someone had a > clever plan to use that field at some point, but never got around to it. > Setting that field to something besides a real address sure does seem > odd...and I can't see that it's ever used anywhere. The bh which gets b_this_page set to 1 is a bh that is internal to raid1. It is allocated by raid1 and never used by any filesystem or buffer cache or anything. Thus raid1 can do whatever it pleases with this field. No-one else should ever look at it. I suspect it was set to one so that, if some coding error meant that the buffer cache saw this buffer, then it would oops pretty quickly. NeilBrown > > -- > Paul ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2003-02-28 16:23 UTC | newest] Thread overview: 25+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2003-02-24 20:15 Questions answered by Neil Brown Peter T. Breuer 2003-02-24 21:58 ` Paul Clements 2003-02-25 3:10 ` Neil Brown 2003-02-25 9:11 ` Peter T. Breuer 2003-02-26 7:44 ` Paul Clements 2003-02-26 8:09 ` Peter T. Breuer 2003-02-26 16:41 ` Paul Clements 2003-02-26 17:26 ` Peter T. Breuer 2003-02-26 18:29 ` raid1 bitmap code [Was: Re: Questions answered by Neil Brown] Paul Clements 2003-02-26 19:15 ` Peter T. Breuer 2003-02-26 22:12 ` Neil Brown 2003-02-26 23:24 ` Peter T. Breuer 2003-02-27 7:26 ` Paul Clements 2003-02-27 8:48 ` Peter T. Breuer 2003-02-27 15:47 ` Paul Clements 2003-02-27 5:33 ` Paul Clements 2003-02-27 10:35 ` Peter T. Breuer 2003-02-27 10:50 ` Peter T. Breuer 2003-02-27 16:51 ` Paul Clements 2003-02-27 17:18 ` Peter T. Breuer 2003-02-28 15:25 ` Peter T. Breuer 2003-02-28 16:14 ` Paul Clements 2003-02-28 16:23 ` Peter T. Breuer 2003-02-26 21:45 ` Questions answered by Neil Brown Neil Brown 2003-02-26 21:41 ` Neil Brown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).