Questions answered by Neil Brown

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Questions answered by Neil Brown
@ 2003-02-24 20:15 Peter T. Breuer
  2003-02-24 21:58 ` Paul Clements
  0 siblings, 1 reply; 27+ messages in thread
From: Peter T. Breuer @ 2003-02-24 20:15 UTC (permalink / raw)
  To: linux-raid

Here's part of offline conversation with Neil Brown. The answers should
go to the list to be archived, so I'm passing them on ...

----- Forwarded message from Neil Brown -----
On Monday February 24, ptb@it.uc3m.es wrote:
> 1) when is the events count written to the component disks
>    superblocks, and how can I force an update (at intervals)?

Whenever md_update_sb is called, which is normally:
   When the array starts
   When it stops
   When a drive fails
   When a drive is added
   When a drive is removed.

Just set mddev->sb_dirty and maybe kick the raid thread.

> 
>    I ask because I have seen a case when a disk that I know
>    had been taken out properly was rejected on reinsert as
>    a replacement candidate (bitmapped resync) because its
>    event count was too old compared to the one on the bitmap
>    that supposedly was set up when it was marked faulty,
>    and stamped on first attempted write afterwards.

I would have to look at the code.

> 
> 2) how can I know how many faulty or removed disks there
>    currently are in the array? I need to mark the bitmap on
>    write as soon as there are any. At the moment I search
>    in a certain range in the array for disks marked nonoperational.

The write fork for raid1_make_request iterrates through all devices
and ignores the non-operational ones.  When you come to the end of the
list, and before you actually submit the write request, you should
know if there are any faulty devices, so you should know what to do
with the bitmap.

Is that a suitable answer?

> 
> 3) it is not clear to me that I am doing accounting right on
>    async writes (or indeed when resyncing and skipping blocks).
> 
>    The last end_io always did
>    io_request_done(bh->b_rsector, conf,
>                           test_bit(R1BH_SyncPhase, &r1_bh->state));
>    and now this can be done on the first end_io (along with the
>    bh->b_end_io(bh, uptodate);) in  an async write. Is that right?

It's not enough.
Once you have called bh->b_end_io, you cannot touch that bh every
again, as it might not even exist.

Also, io_request_done is used to synchronise io request with resync
requests so they don't tread on each other's feet.  io_request_done
must come after all the io is complete, even if you find some way of
calling bh->b_end_io sooner.

> 
>    in resync I simulate having done n sectors by
> 
>                    md_sync_acct(mirror->dev, n);
>                    sync_request_done(sector_nr, conf);
>                    md_done_sync(mddev, n, 1);
> 
>                     wake_up(&conf->wait_ready);
> 
>    is that right?

It looks good.  I would have to spend a little while staring at the
code to be sure, but it is close enough that if it seems to work, then
it is probably right.

NeilBrown

----- End of forwarded message from Neil Brown -----

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Questions answered by Neil Brown
  2003-02-24 20:15 Questions answered by Neil Brown Peter T. Breuer
@ 2003-02-24 21:58 ` Paul Clements
  2003-02-25  3:10   ` Neil Brown
  0 siblings, 1 reply; 27+ messages in thread
From: Paul Clements @ 2003-02-24 21:58 UTC (permalink / raw)
  To: ptb; +Cc: linux-raid, Neil Brown

"Peter T. Breuer" wrote:
 
> >
> > 3) it is not clear to me that I am doing accounting right on
> >    async writes (or indeed when resyncing and skipping blocks).
> >
> >    The last end_io always did
> >    io_request_done(bh->b_rsector, conf,
> >                           test_bit(R1BH_SyncPhase, &r1_bh->state));
> >    and now this can be done on the first end_io (along with the
> >    bh->b_end_io(bh, uptodate);) in  an async write. Is that right?

> Once you have called bh->b_end_io, you cannot touch that bh every
> again, as it might not even exist.
 
So I think this means that we really need to duplicate the buffer memory
(bh->b_page, bh->b_data) and point the mirror I/Os to the duplicated
buffer, so we can allow the original (master_bh) to be freed before all
the I/Os complete. We then free the duplicate when the last mirror I/O
completes, right?

--
Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Questions answered by Neil Brown
  2003-02-24 21:58 ` Paul Clements
@ 2003-02-25  3:10   ` Neil Brown
  2003-02-25  9:11     ` Peter T. Breuer
  0 siblings, 1 reply; 27+ messages in thread
From: Neil Brown @ 2003-02-25  3:10 UTC (permalink / raw)
  To: Paul Clements; +Cc: ptb, linux-raid

On Monday February 24, Paul.Clements@SteelEye.com wrote:
>  
> So I think this means that we really need to duplicate the buffer memory
> (bh->b_page, bh->b_data) and point the mirror I/Os to the duplicated
> buffer, so we can allow the original (master_bh) to be freed before all
> the I/Os complete. We then free the duplicate when the last mirror I/O
> completes, right?


Yes, that is right.

NeilBrown

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Questions answered by Neil Brown
  2003-02-25  3:10   ` Neil Brown
@ 2003-02-25  9:11     ` Peter T. Breuer
  2003-02-26  7:44       ` Paul Clements
  0 siblings, 1 reply; 27+ messages in thread
From: Peter T. Breuer @ 2003-02-25  9:11 UTC (permalink / raw)
  To: Neil Brown; +Cc: Paul Clements, ptb, linux-raid

"A month of sundays ago Neil Brown wrote:"
> On Monday February 24, Paul.Clements@SteelEye.com wrote:
> >  
> > So I think this means that we really need to duplicate the buffer memory
> > (bh->b_page, bh->b_data) and point the mirror I/Os to the duplicated
> > buffer, so we can allow the original (master_bh) to be freed before all
> > the I/Os complete. We then free the duplicate when the last mirror I/O
> > completes, right?
> 
> Yes, that is right.

It's early in the morning (for me!), but I believe that page data at
least can be encouraged to stay around by hiking a reference counter
or equivalent.

It really depends how that b_end_io() got hold of its buffer and what
it does with it on completion. The end_buffer_io() fn in buffer.c
will lock the page and walk through the bh chain hanging off
bh->b_this_page (->b_this_page ...) to see if those other bh's have
completed before marking the page up to date and unlocking it.

So it might be enough to chain all the mirror bh's through
bh->b_this_page.

I believe that currently this field is just set to "1" in
raid1_make_request().

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Questions answered by Neil Brown
  2003-02-25  9:11     ` Peter T. Breuer
@ 2003-02-26  7:44       ` Paul Clements
  2003-02-26  8:09         ` Peter T. Breuer
  2003-02-26 21:41         ` Neil Brown
  0 siblings, 2 replies; 27+ messages in thread
From: Paul Clements @ 2003-02-26  7:44 UTC (permalink / raw)
  To: ptb; +Cc: Neil Brown, linux-raid

"Peter T. Breuer" wrote:
> 
> "A month of sundays ago Neil Brown wrote:"
> > On Monday February 24, Paul.Clements@SteelEye.com wrote:

> So it might be enough to chain all the mirror bh's through
> bh->b_this_page.

That's an interesting idea. I looked through the code and I have some
questions:

What if the user is waiting on a page and not a buffer (not sure if that
can/will happen). In that case, we'd be artificially causing him to wait
when it wasn't necessary. Suppose all the I/O for a page really was
complete, but we kept the user waiting until all the mirror I/Os
(including ones to backup devices) for that page had completed. 

Another thing I'm not sure about is whether it's safe for raid1 to
modify the b_this_page field (for a buffer that was passed in from
above)...we'd at least have to insert our values into the existing list.
Is it safe to modify the list without any locks held?

> I believe that currently this field is just set to "1" in
> raid1_make_request().

Yeah, I sure wish I knew who did that and why. I wonder if someone had a
clever plan to use that field at some point, but never got around to it.
Setting that field to something besides a real address sure does seem
odd...and I can't see that it's ever used anywhere.

--
Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Questions answered by Neil Brown
  2003-02-26  7:44       ` Paul Clements
@ 2003-02-26  8:09         ` Peter T. Breuer
  2003-02-26 16:41           ` Paul Clements
  2003-02-26 21:45           ` Questions answered by Neil Brown Neil Brown
  2003-02-26 21:41         ` Neil Brown
  1 sibling, 2 replies; 27+ messages in thread
From: Peter T. Breuer @ 2003-02-26  8:09 UTC (permalink / raw)
  To: Paul Clements; +Cc: ptb, Neil Brown, linux-raid

"A month of sundays ago Paul Clements wrote:"
> "Peter T. Breuer" wrote:
> > "A month of sundays ago Neil Brown wrote:"
> > > On Monday February 24, Paul.Clements@SteelEye.com wrote:
> 
> > So it might be enough to chain all the mirror bh's through
> > bh->b_this_page.
> 

> What if the user is waiting on a page and not a buffer (not sure if that
> can/will happen). In that case, we'd be artificially causing him to wait

I'm not the person to answer that kind of question. If I understood the
VMS I wouldn't be asking questions myself!

> when it wasn't necessary. Suppose all the I/O for a page really was
> complete, but we kept the user waiting until all the mirror I/Os
> (including ones to backup devices) for that page had completed. 

You're one up in the abstraction hierarchy from me .. I just am seeing
the concrete b_end_io codes and what they might do. I can offer no
sensible comment.

> Another thing I'm not sure about is whether it's safe for raid1 to
> modify the b_this_page field (for a buffer that was passed in from
> above)...we'd at least have to insert our values into the existing list.
> Is it safe to modify the list without any locks held?

I'm sure it is. We modify it anyway, to write the 1s. And we do it
before submitting any of the bh's involved.

> > I believe that currently this field is just set to "1" in
> > raid1_make_request().
> 
> Yeah, I sure wish I knew who did that and why. I wonder if someone had a

A check of the buffer.c code shows b_this_page is always tested against
real bh addresses in any conditionals in which it appears.  It appears
to potentially be a circular list and they're testing for having got
back to the beginning (or having a trivial circle). What I want to know
is how "1" ever worked. It can't! Not in the buffer.c code, enayway.
That's end_buffer_io_async and discard_bh_page and
__block_write_full_page and __block_prepare_write and
__block_commit_write and  block_read_full_page and ... well, lots.
All of these go chasing through that field trusting it to be the
address of a bh.

> clever plan to use that field at some point, but never got around to it.
> Setting that field to something besides a real address sure does seem
> odd...and I can't see that it's ever used anywhere.

What is also puzzling me is that despite the horrible potential for
what might happen from doing the original users end_io early, I
can't see any consequences in actual tests!

I am writing stuff to a file on a raid1 mirror mount, then dismantling
the mount, the device, and rmmoding the driver. Then I put it all back
again. On remount the data in the file is all perfect. Yet it was built
with plenty of async writes! Surely the buffers in some of those writes
should have been pointing nowhere and been full of rubbish?

   ...
   raid1: async end i/o on sectors 494-495
   raid1: async end i/o on sectors 496-497
   raid1: async end i/o on sectors 498-499
   raid1: async end i/o on sectors 500-501
   raid1: async end i/o on sectors 502-503
   raid1: async end i/o on sectors 504-505
   raid1: async end i/o on sectors 506-507
   raid1: async end i/o on sectors 508-509
   raid1: async end i/o on sectors 510-511
   raid1: async end i/o on sectors 512-513
   ...

Got an idea for a test that might show up corruption accruing from the
suspect mechanism?

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Questions answered by Neil Brown
  2003-02-26  8:09         ` Peter T. Breuer
@ 2003-02-26 16:41           ` Paul Clements
  2003-02-26 17:26             ` Peter T. Breuer
  2003-02-26 21:45           ` Questions answered by Neil Brown Neil Brown
  1 sibling, 1 reply; 27+ messages in thread
From: Paul Clements @ 2003-02-26 16:41 UTC (permalink / raw)
  To: ptb; +Cc: Neil Brown, linux-raid

"Peter T. Breuer" wrote:

> > Yeah, I sure wish I knew who did that and why. I wonder if someone had a
> 
> A check of the buffer.c code shows b_this_page is always tested against

Right, but note that when raid1 is setting those 1s, it's in the bh's
that it's sending down to lower level devices, not in the bh's that came
in from above...a subtle distinction. So these 1s would be consumed only
by raid1's end_io routines. Any changes we make to b_this_page in the
"master_bh" will have to be handled correctly by the end_io of the upper
level code (stuff in buffer.c).

> > clever plan to use that field at some point, but never got around to it.
> > Setting that field to something besides a real address sure does seem
> > odd...and I can't see that it's ever used anywhere.
> 
> What is also puzzling me is that despite the horrible potential for
> what might happen from doing the original users end_io early, I
> can't see any consequences in actual tests!

Probably because the timing is so close...if we were to delay the
completion of I/O to one of the devices by several seconds, say, I
believe we'd see some really bad things happen. Another thing that
probably has to coincide with the I/O delays is memory pressure,
otherwise I think the system will just end up keeping the buffers cached
forever (OK, a really long time...) and nothing bad will happen. 

One thought I had for a test (when I get to the point of really
rigorously testing this stuff :)) is to set up an nbd-client/server pair
and insert a sleep into the nbd-server so that completion of I/O is
_always_ delayed by some period...(this will also help in performance
testing, to see how much benefit we get from doing async writes with
high latency).

BTW, I'm working on the code to duplicate the bh (and its memory buffer)
right now. It's basically coded, but not tested. I've based it off your
2.5 code. I'm also working on a simple queueing mechanism (to queue
write requests to backup devices). This will allow us to adjust the bit
to block ratio of the bitmap (intent log) to save disk space and memory.
This mechanism will also be needed if we want to increase the degree of
asynchronicity of the writes (we could just queue all writes and deal
with them later, perhaps in batches).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Questions answered by Neil Brown
  2003-02-26 16:41           ` Paul Clements
@ 2003-02-26 17:26             ` Peter T. Breuer
  2003-02-26 18:29               ` raid1 bitmap code [Was: Re: Questions answered by Neil Brown] Paul Clements
  0 siblings, 1 reply; 27+ messages in thread
From: Peter T. Breuer @ 2003-02-26 17:26 UTC (permalink / raw)
  To: Paul Clements; +Cc: ptb, Neil Brown, linux-raid

"Paul Clements wrote:"
> Right, but note that when raid1 is setting those 1s, it's in the bh's
> that it's sending down to lower level devices, not in the bh's that came
> in from above...a subtle distinction. So these 1s would be consumed only

Too subtle for me.  

> by raid1's end_io routines. Any changes we make to b_this_page in the

Umm ... err, I'm not at all sure .. you mean that their b_end_io
is something we implanted? Yes, I'll go along with that. OK - here's
the story ..

   the bh's that we set the this_page field to 1 on come from
   raid1_alloc_bh.  Normally these come from a pool of preallocated bhs
   that stick around in the driver.

   The b_end_io that we give them is the raid1_end_request function, which
   is the common or garden endio that does /nothing/. 

   Only on the last of the bh's do we run a raid1_end_bh_io, which
   calls raid1_free_r1bh on the /master/,

   surprise, raid1_free_r1bh does nothing except call raid1_free_bh on the
   list of all the associated bh's that's been sitting chained onto the
   mirror_bh_list field of the master,

   and then raid1_free_bh runs through the list and calls kmem_cahe_free on
   each of them.

I see the this_page field used nowhere in this. I can't follow what the
raid1_alloc_bh thing does, so I have an info leak here. I don't know
how the preallocated bh cache grows or shrinks under pressure.

Anyway, if we run raid1_end_bh_io early, we will free up bh's and they
may get reused before we have a chance to complete them. So that's
a problem.

Uh - we'll have to stop raid1_end_bh_io calling raid1_free_bh and
instead get each individual bh to unchain itself from the linked
list and free itself. Under some lock, of course. 

Where in all this do we ack the original end_io on the md array? I
don't see it!

> "master_bh" will have to be handled correctly by the end_io of the upper
> level code (stuff in buffer.c).

> Probably because the timing is so close...if we were to delay the
> completion of I/O to one of the devices by several seconds, say, I
> believe we'd see some really bad things happen. Another thing that
> probably has to coincide with the I/O delays is memory pressure,
> otherwise I think the system will just end up keeping the buffers cached
> forever (OK, a really long time...) and nothing bad will happen. 

I don't know how many bh's there are in the preallocated pool.  If there
were more than 128 it's fairly certain that they could satisfy all
requests outstanding at any one time. Especially if freed requests
go to the back of the free list (do they?). 

> One thought I had for a test (when I get to the point of really
> rigorously testing this stuff :)) is to set up an nbd-client/server pair
> and insert a sleep into the nbd-server so that completion of I/O is
> _always_ delayed by some period...(this will also help in performance

Possibly. My enbd will timeout.

> testing, to see how much benefit we get from doing async writes with
> high latency).
> 
> BTW, I'm working on the code to duplicate the bh (and its memory buffer)
> right now. It's basically coded, but not tested. I've based it off your
> 2.5 code. I'm also working on a simple queueing mechanism (to queue
> write requests to backup devices). This will allow us to adjust the bit
> to block ratio of the bitmap (intent log) to save disk space and memory.

Don't worry about that - that's not necessary, I think.  The bitmap is
already lazy on creating pages for itself.  But yes, it needs to
maintain a count of dirty bits per bitmap page, and when the count drops
to zero it needs to free the page.  I can do that if you like?

> This mechanism will also be needed if we want to increase the degree of
> asynchronicity of the writes (we could just queue all writes and deal
> with them later, perhaps in batches).

Yes.

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown]
  2003-02-26 17:26             ` Peter T. Breuer
@ 2003-02-26 18:29               ` Paul Clements
  2003-02-26 19:15                 ` Peter T. Breuer
  0 siblings, 1 reply; 27+ messages in thread
From: Paul Clements @ 2003-02-26 18:29 UTC (permalink / raw)
  To: ptb; +Cc: Neil Brown, linux-raid

"Peter T. Breuer" wrote:
> "Paul Clements wrote:"

> > BTW, I'm working on the code to duplicate the bh (and its memory buffer)
> > right now. It's basically coded, but not tested. I've based it off your
> > 2.5 code. I'm also working on a simple queueing mechanism (to queue
> > write requests to backup devices). This will allow us to adjust the bit
> > to block ratio of the bitmap (intent log) to save disk space and memory.
> 
> Don't worry about that - that's not necessary, I think.  The bitmap is
> already lazy on creating pages for itself.  But yes, it needs to
> maintain a count of dirty bits per bitmap page, and when the count drops
> to zero it needs to free the page.  I can do that if you like?

Yes, the on-demand page freeing sounds like a good idea. If you don't
have that, I think the bitmap eventually grows to maximum size over
time...

As far as the bit to block ratio, some numbers:

1 bit/64kb @ 1TB ->   2MB maximum bitmap size

1 bit/512b @ 1TB -> 250MB maximum bitmap size

So, I think that if we want to be sure that this will scale, we'll want
the bit to block ratio to be adjustable. Another benefit of having a
large bitmap block size is that it limits the frequency of the disk
writes required to sync the bitmap to disk (1 sync per 128 sectors
written vs. 1 per sector, in the example above).

--
Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown]
  2003-02-26 18:29               ` raid1 bitmap code [Was: Re: Questions answered by Neil Brown] Paul Clements
@ 2003-02-26 19:15                 ` Peter T. Breuer
  2003-02-26 22:12                   ` Neil Brown
  0 siblings, 1 reply; 27+ messages in thread
From: Peter T. Breuer @ 2003-02-26 19:15 UTC (permalink / raw)
  To: Paul Clements; +Cc: ptb, Neil Brown, linux-raid

"A month of sundays ago Paul Clements wrote:"
> "Peter T. Breuer" wrote:
> > Don't worry about that - that's not necessary, I think.  The bitmap is
> > already lazy on creating pages for itself.  But yes, it needs to
> > maintain a count of dirty bits per bitmap page, and when the count drops
> > to zero it needs to free the page.  I can do that if you like?
> 
> Yes, the on-demand page freeing sounds like a good idea. If you don't
> have that, I think the bitmap eventually grows to maximum size over
> time...

Of course. But nobody's complained yet :-)

The bitmap is OO. It's easy to do changes like that. I'll have to
test the map before setting the bit in order to count the number of
dirty bits. That's all (the ops are done under lock).

> As far as the bit to block ratio, some numbers:
> 
> 1 bit/64kb @ 1TB ->   2MB maximum bitmap size
> 
> 1 bit/512b @ 1TB -> 250MB maximum bitmap size

Yes, I understand this.  But I simply do not believe it is worth the
pain initially.  The raid code already disallows all blksizes apart from
1KB and so I only allowed one bit per block!  That seems fair.
Oversizing the dirty bit to cover more than the minimal transaction unit
means that translations have to be made all over the place.  Well, OK,
it could be done by simply calling bitmap->setbit(bitmap, block >> 8)
instead of bitmap->setbit(bitmap, block >> 8), and the same for testbit,
But cannot this be left for some other time?

> So, I think that if we want to be sure that this will scale, we'll want
> the bit to block ratio to be adjustable. Another benefit of having a

Well, the simple thing is to change the bitmap calls in the patched
raid1.c to have

   block >> scale

in them instead of just block, and then set scale as a module
parameter.

> large bitmap block size is that it limits the frequency of the disk
> writes required to sync the bitmap to disk (1 sync per 128 sectors
> written vs. 1 per sector, in the example above).

But I don't think this really matters. Kernel request aggregation
to the underlying devices will smooth out everything anyhow.

My instinct is to leave things be for the moment.

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown]
  2003-02-26 19:15                 ` Peter T. Breuer
@ 2003-02-26 22:12                   ` Neil Brown
  2003-02-26 23:24                     ` Peter T. Breuer
  2003-02-27  5:33                     ` Paul Clements
  0 siblings, 2 replies; 27+ messages in thread
From: Neil Brown @ 2003-02-26 22:12 UTC (permalink / raw)
  To: ptb; +Cc: Paul Clements, linux-raid

Some observations on the idea of an intent-logging bitmap.

1/ You don't want or need very fine granularity.  The value of this is
   to speed up resync time.  Currently it is limited by drive
   bandwidth.  If you have lots of little updates due to fine
   granularity, you will be limited by seek time.

   One possibly reasonable approach would be to pick a granulatity
   such that it takes about as long to read or write a chunk and it
   would to seek to the next one.  This probably means a few hundred
   kilobytes.  i.e. one bit in the bitmap for every hundred K.
   This would require a 125K bitmap for a 100Gig drive.

   Another possibility would be a fixed size bitmap - say 4K or 8K.

   An 8K bitmap used to map a 100Gig drive would be 1.5Meg per bit.

   This may seem biggish, but if your bitmap were sparse, resync would
   still be much much faster, and if it were dense,  having a finer
   grain in the bitmap isn't going to speed things up much.

2/ You cannot allocate the bitmap on demand.  
   Demand happens where you are writing data out, and when writing
   data out due to high memory pressure, kmalloc *will* fail.
   Relying on kmalloc in the write path is BAD.  That is why we have
   mempools which pre-allocate.
   For the bitmap, you simply need to pre-allocate everything.

3/ Internally, you need to store a counter for each 'chunk' (need a
   better word, this is different from the raid chunksize, this is the
   amount of space that each bit refers to).
   The counter is needed so you know when the bit can be cleared.
   This too must be pre-allocated and so further limits the size of
   your bitmap.

   16 bit counters would use less ram and would allow 33553920 bytes 
   (65535 sectors) per 'chunk' which, with an 8K bitmap, puts an upper
   limit of 2 terabytes per device, which I think is adequate. (that's
   per physical device, not per raid array).

   Or you could just use 32 bit counters.

4/ I would use device plugging to help reduce the number of times you
   have to write the intent bitmap.
   When a write comes in, you set the bit in the bitmap, queue the
   write on a list of 'plugged' requests, and mark the device as
   'plugged'.  The device will eventually be unplugged, at which point
   you write out the bitmap, then release all the requests to the
   lower devices.

   You could optimise this a bit, and not bother plugging the device
   if it wasn't already plugged, and the request only affected bits
   that were already set.

NeilBrown

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown]
  2003-02-26 22:12                   ` Neil Brown
@ 2003-02-26 23:24                     ` Peter T. Breuer
  2003-02-27  7:26                       ` Paul Clements
  2003-02-27  5:33                     ` Paul Clements
  1 sibling, 1 reply; 27+ messages in thread
From: Peter T. Breuer @ 2003-02-26 23:24 UTC (permalink / raw)
  To: Neil Brown; +Cc: ptb, Paul Clements, linux-raid

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UNKNOWN-8BIT, Size: 6082 bytes --]

Hi!

"A month of sundays ago Neil Brown wrote:"
> 1/ You don't want or need very fine granularity.  The value of this is
>    to speed up resync time.  Currently it is limited by drive

The map is also used to "journal" async writes. So I think that makes
it necessary to have granularity the same size as a write. Otherwise
we cannot meaningfully mark the map clean again when the async write
completes.

>    bandwidth.  If you have lots of little updates due to fine
>    granularity, you will be limited by seek time.

While your observation is undoubtedly correct, since the resync is
monotonic the seeks will all be in the same direction.  That cannot be
any slower than actually traversing while writing, can it?  And IF it is
the case that it is slower, we can always do "writeahead" in the resync
- that is, when we see a block is marked dirty in the resync, resync it
AND some number of following blocks, then look again.

We can even look at the bitmap fairly carefully and decide what strategy
to use in resync, part by part.  I've now (earlier this evening) altered
the bitmap so that it accounts the number of dirty bits per bitmap page
(i.e.  how many dirty blocks per 4096 blocks at the moment) and if it's
more than some proportion we can write the whole lot.

>    such that it takes about as long to read or write a chunk and it
>    would to seek to the next one.  This probably means a few hundred
>    kilobytes.  i.e. one bit in the bitmap for every hundred K.
>    This would require a 125K bitmap for a 100Gig drive.
> 
>    Another possibility would be a fixed size bitmap - say 4K or 8K.
> 
>    An 8K bitmap used to map a 100Gig drive would be 1.5Meg per bit.
>  
>    This may seem biggish, but if your bitmap were sparse, resync would
>    still be much much faster, and if it were dense,  having a finer
>    grain in the bitmap isn't going to speed things up much.

The objection that it couldn't be used for journalling holds, I think.

> 2/ You cannot allocate the bitmap on demand.  
>    Demand happens where you are writing data out, and when writing

That was taken account of in the botmap code design.  If the kmalloc
fails for a bitmap page, it will write the address of the page as "1" in
the page array which the bitmap will then understand as "all 4096 blocks
dirty" during resync.

Ummm ... yes, it clears the whole thing correctly after the resync too.

>    data out due to high memory pressure, kmalloc *will* fail.
>    Relying on kmalloc in the write path is BAD.  That is why we have

I didn't :-)

>    mempools which pre-allocate.
>    For the bitmap, you simply need to pre-allocate everything.

I think we can breathe easy here - the bitmap code is designed to fail
correctly, so that the semantics is appropriate. It marks a page
that can't be obtained via kmalloc as all-dirty by setting its address
as 1.

Curiously enough, I'm slightly more nonplussed by the problem of
kfreeing the bitmap pages when their dirty count drops to zero.
I can foresee that when journalling we will go 0 1 0 1 0 1 in terms
of number of bits dirty in the map, and if we kfree after the
count drops to zero each time, and kmalloc when we should set
the count to 1, then we will be held up. And maybe sleep when we
shouldn't - the bitmap lock is held for some ops. Needs checking.

What should I do? Maintain a how-many-times-we-have-wanted-to-free-this
page count and only free it on the 10th attempt?

> 3/ Internally, you need to store a counter for each 'chunk' (need a
>    better word, this is different from the raid chunksize, this is the
>    amount of space that each bit refers to).
>    The counter is needed so you know when the bit can be cleared.
>    This too must be pre-allocated and so further limits the size of
>    your bitmap.

I'm not sure what you mean here. At the moment (since earlier this
evening) there IS a counter for each page in the bitmap saying how many
bits in it are dirty.

Oh - I see, you are considering the case when there is more than one
block þer bit. 

Really - I think it's too complicated to fly, at least for the moment. 

I'm really not worried about the memory used by the bitmap! If you
would fix md so it could have 4K blocksize we'd probably gain more.
But that code has bs=1K and sectors/blk=2 assumptions all over!

Currently, the max blocks in a md device is 2^31. At one bit per
block and 8bits per byte that's 2^28 bytes of bitmap, or 256MB,
if it were to become completely full. I am sure that people who have
1TB of disk can afford 256MB of ram to service it with.

>    16 bit counters would use less ram and would allow 33553920 bytes 
>    (65535 sectors) per 'chunk' which, with an 8K bitmap, puts an upper
>    limit of 2 terabytes per device, which I think is adequate. (that's
>    per physical device, not per raid array).

The current counters (dirty bits per bitmap page) are 16 bit. SIgned.

>    Or you could just use 32 bit counters.
> 
> 4/ I would use device plugging to help reduce the number of times you
>    have to write the intent bitmap.
>    When a write comes in, you set the bit in the bitmap, queue the
>    write on a list of 'plugged' requests, and mark the device as
>    'plugged'.  The device will eventually be unplugged, at which point
>    you write out the bitmap, then release all the requests to the
>    lower devices.

Hmmm .. that's interesting. But bitmap access was meant to be cheap.
Whether I succeeded is another question! If not someone else can
rewrite.

>    You could optimise this a bit, and not bother plugging the device
>    if it wasn't already plugged, and the request only affected bits
>    that were already set.

There are very good ideas indeed there. Thank you very much. I
appreciate it!

Peter
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown]
  2003-02-26 23:24                     ` Peter T. Breuer
@ 2003-02-27  7:26                       ` Paul Clements
  2003-02-27  8:48                         ` Peter T. Breuer
  0 siblings, 1 reply; 27+ messages in thread
From: Paul Clements @ 2003-02-27  7:26 UTC (permalink / raw)
  To: ptb; +Cc: Neil Brown, linux-raid

"Peter T. Breuer" wrote:
 
> Curiously enough, I'm slightly more nonplussed by the problem of
> kfreeing the bitmap pages when their dirty count drops to zero.
> I can foresee that when journalling we will go 0 1 0 1 0 1 in terms
> of number of bits dirty in the map, and if we kfree after the
> count drops to zero each time, and kmalloc when we should set
> the count to 1, then we will be held up. And maybe sleep when we
> shouldn't - the bitmap lock is held for some ops. Needs checking.
> 
> What should I do? Maintain a how-many-times-we-have-wanted-to-free-this
> page count and only free it on the 10th attempt?

hmm...perhaps an LRU approach? you could store a timestamp (jiffies?),
so that you never deallocate a page unless it hasn't been used in X
amount of time...might be too heavyweight for what you're trying to do

--
Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown]
  2003-02-27  7:26                       ` Paul Clements
@ 2003-02-27  8:48                         ` Peter T. Breuer
  2003-02-27 15:47                           ` Paul Clements
  0 siblings, 1 reply; 27+ messages in thread
From: Peter T. Breuer @ 2003-02-27  8:48 UTC (permalink / raw)
  To: Paul Clements; +Cc: ptb, Neil Brown, linux-raid

"Paul Clements wrote:"
> "Peter T. Breuer" wrote:
> > Curiously enough, I'm slightly more nonplussed by the problem of
> > kfreeing the bitmap pages when their dirty count drops to zero.

> > What should I do? Maintain a how-many-times-we-have-wanted-to-free-this
> > page count and only free it on the 10th attempt?
> 
> hmm...perhaps an LRU approach? you could store a timestamp (jiffies?),
> so that you never deallocate a page unless it hasn't been used in X
> amount of time...might be too heavyweight for what you're trying to do

This is not silly, but is too heavyweight to do each time a bit is set.

The best thing to do is probably either or both of

   a) put the notionally freed pages on a free list and allocate from
      that list first when we want a new page

   b) do a periodic sweep for clean pages and free them. 

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown]
  2003-02-27  8:48                         ` Peter T. Breuer
@ 2003-02-27 15:47                           ` Paul Clements
  0 siblings, 0 replies; 27+ messages in thread
From: Paul Clements @ 2003-02-27 15:47 UTC (permalink / raw)
  To: ptb; +Cc: Neil Brown, linux-raid

"Peter T. Breuer" wrote:
> 
> "Paul Clements wrote:"
> > "Peter T. Breuer" wrote:
> > > Curiously enough, I'm slightly more nonplussed by the problem of
> > > kfreeing the bitmap pages when their dirty count drops to zero.
> 
> > > What should I do? Maintain a how-many-times-we-have-wanted-to-free-this
> > > page count and only free it on the 10th attempt?
> >
> > hmm...perhaps an LRU approach? you could store a timestamp (jiffies?),
> > so that you never deallocate a page unless it hasn't been used in X
> > amount of time...might be too heavyweight for what you're trying to do
> 
> This is not silly, but is too heavyweight to do each time a bit is set.

Yeah, I was thinking of things like an ext3 journal, where you really do
not want to _ever_ free the pages. With a simple counter, I don't think
there's any way to get that type of behavior.

Of course, if you keep a pool of pre-allocated pages, the penalty for
deallocation/reallocation becomes a lot lower.

--
Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown]
  2003-02-26 22:12                   ` Neil Brown
  2003-02-26 23:24                     ` Peter T. Breuer
@ 2003-02-27  5:33                     ` Paul Clements
  2003-02-27 10:35                       ` Peter T. Breuer
  1 sibling, 1 reply; 27+ messages in thread
From: Paul Clements @ 2003-02-27  5:33 UTC (permalink / raw)
  To: Neil Brown; +Cc: ptb, linux-raid

Neil, 

You've made some really good points and suggestions...thanks...

> 1/ You don't want or need very fine granularity.  The value of this is
>    to speed up resync time.  Currently it is limited by drive
>    bandwidth.  If you have lots of little updates due to fine
>    granularity, you will be limited by seek time.

One more reason why an adjustable bit to block ratio would be nice :)
...

> 2/ You cannot allocate the bitmap on demand.

Hmm...that's a very good point. I had not really thought about that, but
you're right. Maybe there are some advantages to having a simple, flat,
pre-allocated bitmap...although, I do really like Peter's two-level
on-demand allocation scheme. Maybe we could do partial pre-allocation,
using the pre-allocated pages when we're under pressure and kmalloc
fails, and doing on-demand allocation the rest of the time? Another
idea, expanding on what Peter has already done with marking the pointer
with an "address" of 1 if the kmalloc fails (this means that the bitmap
degrades from 1bit/1k to 1bit/4MB, which is not a terrible thing, all in
all). What if we were clever and used more than just one bit when the
allocation fails, so that the ratio could be kept more reasonable (I'm
thinking the 1bit/4MB is OK now, but with a default bit/block ratio
that's much higher, it might get out of control)...maybe something
similar to the IS_ERR and ERR_PTR macros that are elsewhere in the
kernel? Those use 1000 values (I think...) that are known not to be
valid pointer values as error values instead.

> 3/ Internally, you need to store a counter for each 'chunk' (need a

Yes, in order to use a bit/block ratio other than 1 bit/1k, we need some
mechanism for tracking multiple pending writes to the same 'chunk'. One
way to do this is to keep a counter (say 16 or 32 bits) on each chunk,
rather than just a single bit. Another alternative might be to queue the
writes (using a hash on the 'chunk' number, for quick insertion and
deletion). This would tell us how many writes are pending for a given
'chunk', and allow us to clear the bit at the appropriate time (once the
last pending write for the 'chunk' had finished). This could be expanded
later to queueing entire requests (including data) so we could do full
transaction logging, so that a short network outage (how short depends
on how much $$ (sorry Peter, pesetas^Weuros) you want to lay out for RAM
:)) could be recovered from quickly, by just replaying the already
queued data.

> 4/ I would use device plugging to help reduce the number of times you
>    have to write the intent bitmap.

That's a good idea. I also think that with a large bit/block ratio, the
bitmap syncing will be fairly efficient, since the bitmap will
only have to be synced once per XXXk of data written.

--
Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown]
  2003-02-27  5:33                     ` Paul Clements
@ 2003-02-27 10:35                       ` Peter T. Breuer
  2003-02-27 10:50                         ` Peter T. Breuer
  2003-02-27 16:51                         ` Paul Clements
  0 siblings, 2 replies; 27+ messages in thread
From: Peter T. Breuer @ 2003-02-27 10:35 UTC (permalink / raw)
  To: Paul Clements; +Cc: Neil Brown, ptb, linux-raid

"Paul Clements wrote:"
> > 2/ You cannot allocate the bitmap on demand.
> 
> Hmm...that's a very good point. I had not really thought about that, but
> you're right. Maybe there are some advantages to having a simple, flat,
> pre-allocated bitmap...although, I do really like Peter's two-level
> on-demand allocation scheme. Maybe we could do partial pre-allocation,
> using the pre-allocated pages when we're under pressure and kmalloc

I am sure that as the code evolves, schemes like this will be added to
it. But are we really sure that it's worth trying to beat the kernels
memory management? All the calls here are for 4KB. Sure, we can maybe
always maintain a 5% reserve on standby (if the reserve drops to 1%
then ask for some more), and also maintain a free list (for first
recourse) that is swept every so often.

But the kernel already does that kind of management and balancing.  What
right do we have to say we're better than the rest of the kernel?  I can
up the priority with which we ask for memory, but it seems to me that in
times of memory pressure we /ought to fail/, but fail gracefully.

Currently we degrade to 4MB resolution, instead of 1KB. That's all.

> fails, and doing on-demand allocation the rest of the time? Another
> idea, expanding on what Peter has already done with marking the pointer
> with an "address" of 1 if the kmalloc fails (this means that the bitmap
> degrades from 1bit/1k to 1bit/4MB, which is not a terrible thing, all in
> all). What if we were clever and used more than just one bit when the
> allocation fails, so that the ratio could be kept more reasonable (I'm
> thinking the 1bit/4MB is OK now, but with a default bit/block ratio

This is OK as an idea. The sign bit can be used as a marker. Or perhaps
the top two bits, since I think memory normally starts at an adress of
0xc.... Can anyone confirm that? I'm thinking that if the top two bits
of the address are low we can take it that this is not an address, but
a small bitmap in itself.

Hmm ... no, thats' silly. We need a number of bits that is a power of
2. OK, so the top 16 bits being zero lets us see that the bottom 16
bits are a bitmap.

> that's much higher, it might get out of control)...maybe something
> similar to the IS_ERR and ERR_PTR macros that are elsewhere in the
> kernel? Those use 1000 values (I think...) that are known not to be
> valid pointer values as error values instead.

If 16 bits are available instead of 4096 * 8 = 32K, then we get
resolution of 4M/16 = 256K per bit instead of 1K per bit (or instead of
4M per bit). This is more reasonable. I'll look at the coding effort.

> > 3/ Internally, you need to store a counter for each 'chunk' (need a
> mechanism for tracking multiple pending writes to the same 'chunk'. One

I really don't see the logic of this argument, so I'll risk your wrath
by saying so in public.  We cannot account for how many bits are really
dirty in this chunk.  If you mark the same spot twice you will count 2
by your logic when only 1 should have been added to the count.  Nor can
you really clear a bit, because by the same token, if we try and clear a
bit that is already notionally cleared, you will want to subtract 1 from
the count, when 0 should really be subtracted.

In other words, you are making assumptions on the mode of use of the
bitmap interface that are difficult to express ..  I think you need to
count in "sessions", and promise not to mark the same place twice per
session, and never to clear anything that you have not already marked in
the session, and to not close the session until you have cleared everthing
etc.  ..  but in any case these assumptions render the use of a bitmap
as a programming abstraction inappropriate.  It's not really good for
maintenance.

> way to do this is to keep a counter (say 16 or 32 bits) on each chunk,
> rather than just a single bit. Another alternative might be to queue the
> writes (using a hash on the 'chunk' number, for quick insertion and
> deletion). This would tell us how many writes are pending for a given
> 'chunk', and allow us to clear the bit at the appropriate time (once the
> last pending write for the 'chunk' had finished). This could be expanded
> later to queueing entire requests (including data) so we could do full
> transaction logging, so that a short network outage (how short depends
> on how much $$ (sorry Peter, pesetas^Weuros) you want to lay out for RAM

I really don't grok the idea here. I am hung up by the fundamental
impossibility of accounting correctly in the situation where writes
to different places on disk will land on the same bit of the bitmap.
That breaks the logic so thoroughly for me that I cannot proceed from
there!

Maybe I need to receive more $$!

> :)) could be recovered from quickly, by just replaying the already
> queued data.

Well, if you really want to store the requests instead of merely where
those requests are aimed at, that's another story. But it still won't
work. State information that's needed for the accounting will vanish
when those requests complete. The bitmap cannot have worse granularity
than writes can have.

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown]
  2003-02-27 10:35                       ` Peter T. Breuer
@ 2003-02-27 10:50                         ` Peter T. Breuer
  2003-02-27 16:51                         ` Paul Clements
  1 sibling, 0 replies; 27+ messages in thread
From: Peter T. Breuer @ 2003-02-27 10:50 UTC (permalink / raw)
  To: ptb; +Cc: Paul Clements, Neil Brown, linux-raid

"Peter T. Breuer wrote:"
> "Paul Clements wrote:"
> > fails, and doing on-demand allocation the rest of the time? Another
> > idea, expanding on what Peter has already done with marking the pointer
> > with an "address" of 1 if the kmalloc fails (this means that the bitmap
> > degrades from 1bit/1k to 1bit/4MB, which is not a terrible thing, all in
> > all). What if we were clever and used more than just one bit when the

I forgot to say that when the bitmap page fails to be allocated in
order that we may mark it, then an attempted mark also fails, and the
async-write code notices that and refuses to even try an async write,
but stays sync instead.

So this fails safe.

I also forgot to say something else but I forgot what it was ... ah
yes, it's that I did take care never to call kmalloc (or kfree) with
the bitmap spinlock held. So that if the kernel blocks and sleeps
we don't cause death. Instead we release the lock and regain it after
the kmalloc, and retest the conditions that made us want to do the
kmalloc then, aborting if they have been satisfied by anyone else.

Indeed the kmalloc calls are only in the init function and in the
function that is supposed to check the bitmap page and install a page
if one is required, prior to further access. It all /looks/ under
control.

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown]
  2003-02-27 10:35                       ` Peter T. Breuer
  2003-02-27 10:50                         ` Peter T. Breuer
@ 2003-02-27 16:51                         ` Paul Clements
  2003-02-27 17:18                           ` Peter T. Breuer
  2003-02-28 15:25                           ` Peter T. Breuer
  1 sibling, 2 replies; 27+ messages in thread
From: Paul Clements @ 2003-02-27 16:51 UTC (permalink / raw)
  To: ptb; +Cc: Neil Brown, linux-raid

"Peter T. Breuer" wrote:

> "Paul Clements wrote:"
> > > 3/ Internally, you need to store a counter for each 'chunk' (need a
> > mechanism for tracking multiple pending writes to the same 'chunk'. One

> We cannot account for how many bits are really dirty in this chunk. 

That's OK, we don't need to know. We really just want to know if there
are any outstanding writes for that chunk. We increment the chunk
counter for every write and decrement it when the write finishes. Now,
if the backup device fails, we'll bring it back online at some point and
do a (partial) resync, using the bitmap as a guide. For every chunk, if
the counter is non-zero, we have to do a full resync of every block in
that chunk. We then set the counter back to 0 only after we're done
resyncing the last block in that chunk.

So we really don't care how many blocks are dirty, just that there are
or are not dirty blocks in the chunk. Does that make sense?

--
Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown]
  2003-02-27 16:51                         ` Paul Clements
@ 2003-02-27 17:18                           ` Peter T. Breuer
  2003-02-28 15:25                           ` Peter T. Breuer
  1 sibling, 0 replies; 27+ messages in thread
From: Peter T. Breuer @ 2003-02-27 17:18 UTC (permalink / raw)
  To: Paul Clements; +Cc: ptb, Neil Brown, linux-raid

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UNKNOWN-8BIT, Size: 1338 bytes --]

"Paul Clements wrote:"
> "Peter T. Breuer" wrote:
> > "Paul Clements wrote:"
> > We cannot account for how many bits are really dirty in this chunk. 
> 
> That's OK, we don't need to know. We really just want to know if there
> are any outstanding writes for that chunk. We increment the chunk

OK. That's a fine semantics.

> So we really don't care how many blocks are dirty, just that there are
> or are not dirty blocks in the chunk. Does that make sense?

Well, yes.  We would have tried to mark the bitmap before each write,
and I suppose we would have failed for out of memory in creating the
bitmap page, and now you are saying that OK, so let's just count
outstanding writes to thæt bitmap zone.

I suppose that if the write fails we don't decrement the count?
Otherwise we do?

So when we do a mark we increment the count on the bitmap zone, and when
we do a clear we decrement the count, and when the write fails we never
try and clear on the bitmap?  Then what we have is a count of the
unbalance between marks and clears on the bitmap per zone.  And you want
the

Yes, I can go with that.

Peter
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown]
  2003-02-27 16:51                         ` Paul Clements
  2003-02-27 17:18                           ` Peter T. Breuer
@ 2003-02-28 15:25                           ` Peter T. Breuer
  2003-02-28 16:14                             ` Paul Clements
  1 sibling, 1 reply; 27+ messages in thread
From: Peter T. Breuer @ 2003-02-28 15:25 UTC (permalink / raw)
  To: Paul Clements; +Cc: ptb, Neil Brown, linux-raid

"Paul Clements wrote:"
> That's OK, we don't need to know. We really just want to know if there
> are any outstanding writes for that chunk. We increment the chunk
> counter for every write and decrement it when the write finishes. Now,

I agree. I modified the bitmap so that when a page can't be got to make
a mark on, then we count pending "writes" by summing one to a count
for every attempted write to the zone covered by the missing page
and subtracting one for every attempted clear.

The bitmap reports "dirty" if the count is positive and the page for
that zone is absent.  I put the present code up at

  ftp://oboe.it.uc3m.es/pub/Programs/fr1-2.6.tgz

I spent much of the morning checking that mark and clear attempts
are balanced unless the write does not succeed. They are. This means
that under memory pressure the bitmap not only fails only by losing
precision (it did before), but now it can lose less precision, dropping
to 256K instead of 4M precision,  although I haven't implemented the
final change to up the precision just yet. I made all the preparatory
moves.   Since it's friday I'll publish what I have.

Under normal circumstances the count for each zone of the bitmap
records the number of dirty bits. It's used to help free the bitmap 
pages at the right time, to stop pages accumulating to the max.
Shucks ..  I have to put them on a free list. I'll do that later too.

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown]
  2003-02-28 15:25                           ` Peter T. Breuer
@ 2003-02-28 16:14                             ` Paul Clements
  2003-02-28 16:23                               ` Peter T. Breuer
  0 siblings, 1 reply; 27+ messages in thread
From: Paul Clements @ 2003-02-28 16:14 UTC (permalink / raw)
  To: ptb; +Cc: Neil Brown, linux-raid

"Peter T. Breuer" wrote:
> 
> "Paul Clements wrote:"
> > That's OK, we don't need to know. We really just want to know if there
> > are any outstanding writes for that chunk. We increment the chunk
> > counter for every write and decrement it when the write finishes. Now,
> 
> I agree. I modified the bitmap so that when a page can't be got to make
> a mark on, then we count pending "writes" by summing one to a count
> for every attempted write to the zone covered by the missing page
> and subtracting one for every attempted clear.

That sounds good. I'll have to merge up to 2.6...ugh...now I really need
to put this all in CVS...I'm making some progress with the mmap stuff
and I've got the code done for duplicating the buffers for the async
writes to the backup devices.
I'll need a pool of memory for that too, since I can't just fail in the
middle of I/O...

--
Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown]
  2003-02-28 16:14                             ` Paul Clements
@ 2003-02-28 16:23                               ` Peter T. Breuer
  0 siblings, 0 replies; 27+ messages in thread
From: Peter T. Breuer @ 2003-02-28 16:23 UTC (permalink / raw)
  To: Paul Clements; +Cc: ptb, Neil Brown, linux-raid

"Paul Clements wrote:"
> That sounds good. I'll have to merge up to 2.6...ugh...now I really need
> to put this all in CVS...I'm making some progress with the mmap stuff
> and I've got the code done for duplicating the buffers for the async
> writes to the backup devices.
> I'll need a pool of memory for that too, since I can't just fail in the
> middle of I/O...

Well, you can. Failing to complete will leave the map marked and
we could fault the device offline too. That will stop any more memory
being wasted because nothing will go to it! We could then make periodic
attempts to bring it online again using the "hotrepair" trick and it'll 
resync itself in the background.

Just an idea. Maybe not a good one, but an idea.

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Questions answered by Neil Brown
  2003-02-26  8:09         ` Peter T. Breuer
  2003-02-26 16:41           ` Paul Clements
@ 2003-02-26 21:45           ` Neil Brown
  1 sibling, 0 replies; 27+ messages in thread
From: Neil Brown @ 2003-02-26 21:45 UTC (permalink / raw)
  To: ptb; +Cc: Paul Clements, linux-raid

On Wednesday February 26, ptb@it.uc3m.es wrote:
> 
> What is also puzzling me is that despite the horrible potential for
> what might happen from doing the original users end_io early, I
> can't see any consequences in actual tests!
> 
> I am writing stuff to a file on a raid1 mirror mount, then dismantling
> the mount, the device, and rmmoding the driver. Then I put it all back
> again. On remount the data in the file is all perfect. Yet it was built
> with plenty of async writes! Surely the buffers in some of those writes
> should have been pointing nowhere and been full of rubbish?
> 

I suspect that mostly you are writing from a cache, and the data will
probably stay around.
To be able to demonstrate a problem you probably need very high memory
pressure so things don't stay in cache long, lots of metadata updates,
and probably some for of journalling filesystem like I mentioned
previously.  Have very long latencies for the delayed write would also
make the problem more likely.

Even if you cannot demonstrate a problem, I'm sure one would be
noticed sooner or later if you released this sort of code into the
wild and people used it.

NeilBrown

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Questions answered by Neil Brown
  2003-02-26  7:44       ` Paul Clements
  2003-02-26  8:09         ` Peter T. Breuer
@ 2003-02-26 21:41         ` Neil Brown
  1 sibling, 0 replies; 27+ messages in thread
From: Neil Brown @ 2003-02-26 21:41 UTC (permalink / raw)
  To: Paul Clements; +Cc: ptb, linux-raid

On Wednesday February 26, Paul.Clements@SteelEye.com wrote:
> "Peter T. Breuer" wrote:
> > 
> > "A month of sundays ago Neil Brown wrote:"
> > > On Monday February 24, Paul.Clements@SteelEye.com wrote:
> 
> > So it might be enough to chain all the mirror bh's through
> > bh->b_this_page.

This sort of thing is very much a layering violation and should not be
done.
It is similar in some ways to the design decisions in the 2.2 raid
patches which meant they were un-safe for swap or ext3 (at least while
rebuilding).

The current interface does *not* guarantee that the data will be
around after b_end_io is called, and does not provide any way to ask
for the data to be left around.
You might be able to fake it in some cases, but not all.
A particular example that springs to mind is descriptor block used
when writing out an ext3 journal entry.  Once the block is written,
the descriptor block is of no interest to the kernel and will very
likely be re-used.  You cannot stop this.
You *have* to copy the data.

>  
> > I believe that currently this field is just set to "1" in
> > raid1_make_request().
> 
> Yeah, I sure wish I knew who did that and why. I wonder if someone had a
> clever plan to use that field at some point, but never got around to it.
> Setting that field to something besides a real address sure does seem
> odd...and I can't see that it's ever used anywhere.

The bh which gets b_this_page set to 1 is a bh that is internal to
raid1.  It is allocated by raid1 and never used by any filesystem or
buffer cache or anything.  Thus raid1 can do whatever it pleases with
this field.  No-one else should ever look at it.
I suspect it was set to one so that, if some coding error meant that
the buffer cache saw this buffer, then it would oops pretty quickly.

NeilBrown

> 
> --
> Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown]
@ 2003-03-01 12:36 Peter T. Breuer
  0 siblings, 0 replies; 27+ messages in thread
From: Peter T. Breuer @ 2003-03-01 12:36 UTC (permalink / raw)
  To: ptb; +Cc: Paul Clements, Neil Brown, linux-raid

"A month of sundays ago ptb wrote:"
> I agree. I modified the bitmap so that when a page can't be got to make
> a mark on, then we count pending "writes" by summing one to a count
> for every attempted write to the zone covered by the missing page
> and subtracting one for every attempted clear.
> 
> The bitmap reports "dirty" if the count is positive and the page for
> that zone is absent.  I put the present code up at
> 
>   ftp://oboe.it.uc3m.es/pub/Programs/fr1-2.6.tgz

And if anybody cares, I put up a fr1-2.7.tgz in which the bitmap also
has a page cache so that it has a bit of leeway when asking for pages to
put into the map.  I think I put lo/hi water mark at 2/7 by default.
That means it will preallocate 7 pages, and when 5 of those have been
used up it will ask for 5 more, and keep them ready.  But even if it
can't get 5 more it will still serve the 2 it has in reserve.

I still have to put in subzone pending-write counters (for 256KB
subzones) for when the page alloc fails completely.  At the moment the
pending-write counter is for a page, which is a whole 4MB zone ondisk.

When a page is completely cleared it disconnects itself from the bitmap
and puts itself in the page cache, unless the cache is already at
hiwater, when it will kfree itself instead.

But I did let the bitmap "heal itself".  If it tries to mark a page that
was unable to be allocated in the past (because of lack of memory) and
which now has no pending writes on it, then it will be attempted to be
re-alloced.  This may not be a good strategy - I can envisage that some
time needs to pass before retrying.

So, all in all, I guess things get more advanced. As usual when
architecture improves, I took out large snips of code.

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown]
@ 2003-03-13 18:49 Peter T. Breuer
  0 siblings, 0 replies; 27+ messages in thread
From: Peter T. Breuer @ 2003-03-13 18:49 UTC (permalink / raw)
  To: ptb; +Cc: Paul Clements, Neil Brown, linux-raid

Latest news (fr1-2.8) is that

   1) all the fallback bitmap counting is in place, so that if it can't
      get a page for the bitmap it falls back to counting the unbalance
      between mark and clear attempts per zone and reporting a zone
      dirty or clean according to the unbalance.

      (the bitmap pages are 4KB which maps to 32K blocks each, or 32MB
      with 1KB raid blocks.  There are 16 zones of 2MB in the region
      mapped by each page, and so the precision becomes 2MB instead of
      1KB in out-of-memory conditions.  The fallback counters are 16bit,
      so there is 32B of overhead in this scheme per 4KB page, or less
      than 0.1%.  If you have a 1TB device, it will need 32K pages, or
      128MB of bitmap, if all gets dirty.  That's about 0.01% of the
      mapped area.  But 1MB is preallocated for the fallback counters at
      startup).

      Incidentally, I don't know whether to get the counterspace in
      one lump via kmalloc, or to do it a group of 16 counters (32B) at
      a time.

   2) I have added an ioctl which informs a component device when it 
      is added to (or removed from) a raid device.

      The idea is that the informed device should maintain a list of 
      raid devices it is in, and when it changes to an enabled state
      then it should tell us so by calling our hot_add ioctl (or some
      other similar manouever of its choosing).

I'll add the patch for that below. The full thing is at

      ftp://oboe.it.uc3m.es/pub/Programs/fr1-2.8.tgz

And here's the patch to md.c (apply with -b .. tabs probably expanded
in mail):

@@ -587,6 +597,33 @@
        return 0;
 }

+static void
+notify_device (mddev_t * mddev, kdev_t dev)
+{
+#ifndef BLKMDNTFY
+#define BLKMDNTFY _IOW(0x12,133,sizeof(int))
+#endif
+       struct block_device *bdev;
+       printk (KERN_INFO "md%d: notifying dev %x\n", mdidx(mddev), dev);
+        bdev = bdget (dev);
+       if (!bdev)
+                return;
+        ioctl_by_bdev (bdev, BLKMDNTFY, MKDEV (MD_MAJOR, mddev->__minor));
+}
+static void
+unnotify_device (mddev_t * mddev, kdev_t dev)
+{
+#ifndef BLKMDUNTFY
+#define BLKMDUNTFY _IOW(0x12,134,sizeof(int))
+#endif
+       struct block_device *bdev;
+       printk (KERN_INFO "md%d: unnotifying dev %x\n", mdidx(mddev), dev);
+       bdev = bdget (dev);
+       if (!bdev)
+                return;
+        ioctl_by_bdev(bdev, BLKMDUNTFY, MKDEV(MD_MAJOR, mddev->__minor));
+}
+
 static MD_LIST_HEAD(all_raid_disks);
 static MD_LIST_HEAD(pending_raid_disks);

@@ -610,6 +647,7 @@
        rdev->mddev = mddev;
        mddev->nb_dev++;
        printk(KERN_INFO "md: bind<%s,%d>\n", partition_name(rdev->dev), mddev->nb_dev);
+       notify_device(mddev, rdev->dev);
 }

 static void unbind_rdev_from_array(mdk_rdev_t * rdev)
@@ -618,6 +656,7 @@
                MD_BUG();
                return;
        }
+       unnotify_device(rdev->mddev, rdev->dev);
        md_list_del(&rdev->same_set);
        MD_INIT_LIST_HEAD(&rdev->same_set);
        rdev->mddev->nb_dev--;


The additions are in bind/unbind to/from the array.


Peter


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2003-03-13 18:49 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-02-24 20:15 Questions answered by Neil Brown Peter T. Breuer
2003-02-24 21:58 ` Paul Clements
2003-02-25  3:10   ` Neil Brown
2003-02-25  9:11     ` Peter T. Breuer
2003-02-26  7:44       ` Paul Clements
2003-02-26  8:09         ` Peter T. Breuer
2003-02-26 16:41           ` Paul Clements
2003-02-26 17:26             ` Peter T. Breuer
2003-02-26 18:29               ` raid1 bitmap code [Was: Re: Questions answered by Neil Brown] Paul Clements
2003-02-26 19:15                 ` Peter T. Breuer
2003-02-26 22:12                   ` Neil Brown
2003-02-26 23:24                     ` Peter T. Breuer
2003-02-27  7:26                       ` Paul Clements
2003-02-27  8:48                         ` Peter T. Breuer
2003-02-27 15:47                           ` Paul Clements
2003-02-27  5:33                     ` Paul Clements
2003-02-27 10:35                       ` Peter T. Breuer
2003-02-27 10:50                         ` Peter T. Breuer
2003-02-27 16:51                         ` Paul Clements
2003-02-27 17:18                           ` Peter T. Breuer
2003-02-28 15:25                           ` Peter T. Breuer
2003-02-28 16:14                             ` Paul Clements
2003-02-28 16:23                               ` Peter T. Breuer
2003-02-26 21:45           ` Questions answered by Neil Brown Neil Brown
2003-02-26 21:41         ` Neil Brown
  -- strict thread matches above, loose matches on Subject: below --
2003-03-01 12:36 raid1 bitmap code [Was: Re: Questions answered by Neil Brown] Peter T. Breuer
2003-03-13 18:49 Peter T. Breuer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).