linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Questions answered by Neil Brown
@ 2003-02-24 20:15 Peter T. Breuer
  2003-02-24 21:58 ` Paul Clements
  0 siblings, 1 reply; 27+ messages in thread
From: Peter T. Breuer @ 2003-02-24 20:15 UTC (permalink / raw)
  To: linux-raid

Here's part of offline conversation with Neil Brown. The answers should
go to the list to be archived, so I'm passing them on ...


----- Forwarded message from Neil Brown -----
On Monday February 24, ptb@it.uc3m.es wrote:
> 1) when is the events count written to the component disks
>    superblocks, and how can I force an update (at intervals)?

Whenever md_update_sb is called, which is normally:
   When the array starts
   When it stops
   When a drive fails
   When a drive is added
   When a drive is removed.

Just set mddev->sb_dirty and maybe kick the raid thread.

> 
>    I ask because I have seen a case when a disk that I know
>    had been taken out properly was rejected on reinsert as
>    a replacement candidate (bitmapped resync) because its
>    event count was too old compared to the one on the bitmap
>    that supposedly was set up when it was marked faulty,
>    and stamped on first attempted write afterwards.

I would have to look at the code.

> 
> 2) how can I know how many faulty or removed disks there
>    currently are in the array? I need to mark the bitmap on
>    write as soon as there are any. At the moment I search
>    in a certain range in the array for disks marked nonoperational.

The write fork for raid1_make_request iterrates through all devices
and ignores the non-operational ones.  When you come to the end of the
list, and before you actually submit the write request, you should
know if there are any faulty devices, so you should know what to do
with the bitmap.

Is that a suitable answer?

> 
> 3) it is not clear to me that I am doing accounting right on
>    async writes (or indeed when resyncing and skipping blocks).
> 
>    The last end_io always did
>    io_request_done(bh->b_rsector, conf,
>                           test_bit(R1BH_SyncPhase, &r1_bh->state));
>    and now this can be done on the first end_io (along with the
>    bh->b_end_io(bh, uptodate);) in  an async write. Is that right?

It's not enough.
Once you have called bh->b_end_io, you cannot touch that bh every
again, as it might not even exist.

Also, io_request_done is used to synchronise io request with resync
requests so they don't tread on each other's feet.  io_request_done
must come after all the io is complete, even if you find some way of
calling bh->b_end_io sooner.

> 
>    in resync I simulate having done n sectors by
> 
>                    md_sync_acct(mirror->dev, n);
>                    sync_request_done(sector_nr, conf);
>                    md_done_sync(mddev, n, 1);
> 
>                     wake_up(&conf->wait_ready);
> 
>    is that right?

It looks good.  I would have to spend a little while staring at the
code to be sure, but it is close enough that if it seems to work, then
it is probably right.

NeilBrown

----- End of forwarded message from Neil Brown -----

^ permalink raw reply	[flat|nested] 27+ messages in thread
* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown]
@ 2003-03-01 12:36 Peter T. Breuer
  0 siblings, 0 replies; 27+ messages in thread
From: Peter T. Breuer @ 2003-03-01 12:36 UTC (permalink / raw)
  To: ptb; +Cc: Paul Clements, Neil Brown, linux-raid

"A month of sundays ago ptb wrote:"
> I agree. I modified the bitmap so that when a page can't be got to make
> a mark on, then we count pending "writes" by summing one to a count
> for every attempted write to the zone covered by the missing page
> and subtracting one for every attempted clear.
> 
> The bitmap reports "dirty" if the count is positive and the page for
> that zone is absent.  I put the present code up at
> 
>   ftp://oboe.it.uc3m.es/pub/Programs/fr1-2.6.tgz

And if anybody cares, I put up a fr1-2.7.tgz in which the bitmap also
has a page cache so that it has a bit of leeway when asking for pages to
put into the map.  I think I put lo/hi water mark at 2/7 by default.
That means it will preallocate 7 pages, and when 5 of those have been
used up it will ask for 5 more, and keep them ready.  But even if it
can't get 5 more it will still serve the 2 it has in reserve.

I still have to put in subzone pending-write counters (for 256KB
subzones) for when the page alloc fails completely.  At the moment the
pending-write counter is for a page, which is a whole 4MB zone ondisk.

When a page is completely cleared it disconnects itself from the bitmap
and puts itself in the page cache, unless the cache is already at
hiwater, when it will kfree itself instead.

But I did let the bitmap "heal itself".  If it tries to mark a page that
was unable to be allocated in the past (because of lack of memory) and
which now has no pending writes on it, then it will be attempted to be
re-alloced.  This may not be a good strategy - I can envisage that some
time needs to pass before retrying.

So, all in all, I guess things get more advanced. As usual when
architecture improves, I took out large snips of code.

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread
* Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown]
@ 2003-03-13 18:49 Peter T. Breuer
  0 siblings, 0 replies; 27+ messages in thread
From: Peter T. Breuer @ 2003-03-13 18:49 UTC (permalink / raw)
  To: ptb; +Cc: Paul Clements, Neil Brown, linux-raid

Latest news (fr1-2.8) is that

   1) all the fallback bitmap counting is in place, so that if it can't
      get a page for the bitmap it falls back to counting the unbalance
      between mark and clear attempts per zone and reporting a zone
      dirty or clean according to the unbalance.

      (the bitmap pages are 4KB which maps to 32K blocks each, or 32MB
      with 1KB raid blocks.  There are 16 zones of 2MB in the region
      mapped by each page, and so the precision becomes 2MB instead of
      1KB in out-of-memory conditions.  The fallback counters are 16bit,
      so there is 32B of overhead in this scheme per 4KB page, or less
      than 0.1%.  If you have a 1TB device, it will need 32K pages, or
      128MB of bitmap, if all gets dirty.  That's about 0.01% of the
      mapped area.  But 1MB is preallocated for the fallback counters at
      startup).

      Incidentally, I don't know whether to get the counterspace in
      one lump via kmalloc, or to do it a group of 16 counters (32B) at
      a time.

   2) I have added an ioctl which informs a component device when it 
      is added to (or removed from) a raid device.

      The idea is that the informed device should maintain a list of 
      raid devices it is in, and when it changes to an enabled state
      then it should tell us so by calling our hot_add ioctl (or some
      other similar manouever of its choosing).

I'll add the patch for that below. The full thing is at

      ftp://oboe.it.uc3m.es/pub/Programs/fr1-2.8.tgz

And here's the patch to md.c (apply with -b .. tabs probably expanded
in mail):

@@ -587,6 +597,33 @@
        return 0;
 }

+static void
+notify_device (mddev_t * mddev, kdev_t dev)
+{
+#ifndef BLKMDNTFY
+#define BLKMDNTFY _IOW(0x12,133,sizeof(int))
+#endif
+       struct block_device *bdev;
+       printk (KERN_INFO "md%d: notifying dev %x\n", mdidx(mddev), dev);
+        bdev = bdget (dev);
+       if (!bdev)
+                return;
+        ioctl_by_bdev (bdev, BLKMDNTFY, MKDEV (MD_MAJOR, mddev->__minor));
+}
+static void
+unnotify_device (mddev_t * mddev, kdev_t dev)
+{
+#ifndef BLKMDUNTFY
+#define BLKMDUNTFY _IOW(0x12,134,sizeof(int))
+#endif
+       struct block_device *bdev;
+       printk (KERN_INFO "md%d: unnotifying dev %x\n", mdidx(mddev), dev);
+       bdev = bdget (dev);
+       if (!bdev)
+                return;
+        ioctl_by_bdev(bdev, BLKMDUNTFY, MKDEV(MD_MAJOR, mddev->__minor));
+}
+
 static MD_LIST_HEAD(all_raid_disks);
 static MD_LIST_HEAD(pending_raid_disks);

@@ -610,6 +647,7 @@
        rdev->mddev = mddev;
        mddev->nb_dev++;
        printk(KERN_INFO "md: bind<%s,%d>\n", partition_name(rdev->dev), mddev->nb_dev);
+       notify_device(mddev, rdev->dev);
 }

 static void unbind_rdev_from_array(mdk_rdev_t * rdev)
@@ -618,6 +656,7 @@
                MD_BUG();
                return;
        }
+       unnotify_device(rdev->mddev, rdev->dev);
        md_list_del(&rdev->same_set);
        MD_INIT_LIST_HEAD(&rdev->same_set);
        rdev->mddev->nb_dev--;


The additions are in bind/unbind to/from the array.


Peter


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2003-03-13 18:49 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-02-24 20:15 Questions answered by Neil Brown Peter T. Breuer
2003-02-24 21:58 ` Paul Clements
2003-02-25  3:10   ` Neil Brown
2003-02-25  9:11     ` Peter T. Breuer
2003-02-26  7:44       ` Paul Clements
2003-02-26  8:09         ` Peter T. Breuer
2003-02-26 16:41           ` Paul Clements
2003-02-26 17:26             ` Peter T. Breuer
2003-02-26 18:29               ` raid1 bitmap code [Was: Re: Questions answered by Neil Brown] Paul Clements
2003-02-26 19:15                 ` Peter T. Breuer
2003-02-26 22:12                   ` Neil Brown
2003-02-26 23:24                     ` Peter T. Breuer
2003-02-27  7:26                       ` Paul Clements
2003-02-27  8:48                         ` Peter T. Breuer
2003-02-27 15:47                           ` Paul Clements
2003-02-27  5:33                     ` Paul Clements
2003-02-27 10:35                       ` Peter T. Breuer
2003-02-27 10:50                         ` Peter T. Breuer
2003-02-27 16:51                         ` Paul Clements
2003-02-27 17:18                           ` Peter T. Breuer
2003-02-28 15:25                           ` Peter T. Breuer
2003-02-28 16:14                             ` Paul Clements
2003-02-28 16:23                               ` Peter T. Breuer
2003-02-26 21:45           ` Questions answered by Neil Brown Neil Brown
2003-02-26 21:41         ` Neil Brown
  -- strict thread matches above, loose matches on Subject: below --
2003-03-01 12:36 raid1 bitmap code [Was: Re: Questions answered by Neil Brown] Peter T. Breuer
2003-03-13 18:49 Peter T. Breuer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).