[PATCH 000 of 6] md: udev events and cache bypass for reads

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 000 of 6] md: udev events and cache bypass for reads
@ 2006-10-31  6:00 NeilBrown
  2006-10-31  6:00 ` [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops NeilBrown
                   ` (6 more replies)
  0 siblings, 7 replies; 26+ messages in thread
From: NeilBrown @ 2006-10-31  6:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Raz Ben-Jehuda(caro), gregkh

Following are 6 patches for md in -lastest which I have been sitting
on for a while because I hadn't had a chance to test them properly.
I now have so there shouldn't be too many bugs left :-)

First is suitable for 2.6.19 (if it isn't too late and gregkh thinks it
is good).  Rest are for 2.6.20.

First two make md play a bit more nicely with udev.  Currently udev gets an
'add' event once and never any remove event or anything else.  And the the 'add'
event comes before the array is ready to be used.

The first patch added online/offline events when the array becomes
ready to be used, and when the array ceases to be usable.

The second patch changes the lifetime rules of md devices so they
disappear when not in use.  This means that udev will normally dev a
'remove' event sometime after the offline, and then another 'add' if a
new array is started with a the same name.  I'm not sure if this helps 
udev particularly, but it is a lot neater.

Then we have 4 patches that cause raid5 to not use the stripe-cache
for reads when that is appropriate.  This avoids copies and improves
read throughput by as much as 30% in some tests.  These patches
are largely due to  "Raz Ben-Jehuda(caro)" <raziebe@gmail.com>.

NeilBrown

 [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops.
 [PATCH 002 of 6] md: Change lifetime rules for 'md' devices.
 [PATCH 003 of 6] md: Define raid5_mergeable_bvec
 [PATCH 004 of 6] md: Handle bypassing the read cache (assuming nothing fails).
 [PATCH 005 of 6] md: Allow reads that have bypassed the cache to be retried on failure.
 [PATCH 006 of 6] md: Enable bypassing cache for reads.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops.
  2006-10-31  6:00 [PATCH 000 of 6] md: udev events and cache bypass for reads NeilBrown
@ 2006-10-31  6:00 ` NeilBrown
  2006-10-31 21:16   ` Greg KH
  2006-10-31  6:00 ` [PATCH 002 of 6] md: Change lifetime rules for 'md' devices NeilBrown
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: NeilBrown @ 2006-10-31  6:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, gregkh


This allows udev to do something intelligent when an
array becomes available.

cc: gregkh@suse.de
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c |    2 ++
 1 file changed, 2 insertions(+)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2006-10-31 16:40:52.000000000 +1100
+++ ./drivers/md/md.c	2006-10-31 16:41:02.000000000 +1100
@@ -3200,6 +3200,7 @@ static int do_md_run(mddev_t * mddev)
 
 	mddev->changed = 1;
 	md_new_event(mddev);
+	kobject_uevent(&mddev->gendisk->kobj, KOBJ_ONLINE);
 	return 0;
 }
 
@@ -3313,6 +3314,7 @@ static int do_md_stop(mddev_t * mddev, i
 
 			module_put(mddev->pers->owner);
 			mddev->pers = NULL;
+			kobject_uevent(&mddev->gendisk->kobj, KOBJ_OFFLINE);
 			if (mddev->ro)
 				mddev->ro = 0;
 		}

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops.
  2006-10-31  6:00 ` [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops NeilBrown
@ 2006-10-31 21:16   ` Greg KH
  2006-11-02 12:13     ` Kay Sievers
  0 siblings, 1 reply; 26+ messages in thread
From: Greg KH @ 2006-10-31 21:16 UTC (permalink / raw)
  To: NeilBrown; +Cc: Andrew Morton, linux-raid, linux-kernel

On Tue, Oct 31, 2006 at 05:00:46PM +1100, NeilBrown wrote:
> 
> This allows udev to do something intelligent when an
> array becomes available.
> 
> cc: gregkh@suse.de
> Signed-off-by: Neil Brown <neilb@suse.de>

Acked-by: Greg Kroah-Hartman <gregkh@suse.de>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops.
  2006-10-31 21:16   ` Greg KH
@ 2006-11-02 12:13     ` Kay Sievers
  2006-11-02 12:32       ` Neil Brown
  0 siblings, 1 reply; 26+ messages in thread
From: Kay Sievers @ 2006-11-02 12:13 UTC (permalink / raw)
  To: Greg KH; +Cc: NeilBrown, Andrew Morton, linux-raid, linux-kernel

On 10/31/06, Greg KH <gregkh@suse.de> wrote:
> On Tue, Oct 31, 2006 at 05:00:46PM +1100, NeilBrown wrote:
> >
> > This allows udev to do something intelligent when an
> > array becomes available.
> >
> > cc: gregkh@suse.de
> > Signed-off-by: Neil Brown <neilb@suse.de>
>
> Acked-by: Greg Kroah-Hartman <gregkh@suse.de>

I don't agree with this, and asked several times to change this to
"change" events, like device-mapper is doing it to address the same
problem. Online/offline is not supported by udev/HAL and will not work
as expected. Please fix this.

Thanks,
Kay

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops.
  2006-11-02 12:13     ` Kay Sievers
@ 2006-11-02 12:32       ` Neil Brown
  2006-11-02 13:51         ` Kay Sievers
  0 siblings, 1 reply; 26+ messages in thread
From: Neil Brown @ 2006-11-02 12:32 UTC (permalink / raw)
  To: Kay Sievers; +Cc: Greg KH, Andrew Morton, linux-raid, linux-kernel

On Thursday November 2, kay.sievers@vrfy.org wrote:
> On 10/31/06, Greg KH <gregkh@suse.de> wrote:
> > On Tue, Oct 31, 2006 at 05:00:46PM +1100, NeilBrown wrote:
> > >
> > > This allows udev to do something intelligent when an
> > > array becomes available.
> > >
> > > cc: gregkh@suse.de
> > > Signed-off-by: Neil Brown <neilb@suse.de>
> >
> > Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
> 
> I don't agree with this, and asked several times to change this to
> "change" events, like device-mapper is doing it to address the same
> problem. Online/offline is not supported by udev/HAL and will not work
> as expected. Please fix this.

I don't remember who suggested "online/offline", and I don't remember
you suggesting "change", but my memory isn't what it used to be(*), so you
probably did.

Is there some document somewhere that explains exactly what each of
the kobject_actions are meant to mean and how they can be
interpreted?

Anyway, I am happy to change it.  What exactly do you want?
KOBJ_CHANGE both when the array is activated and when it is
deactivated?  Or only when it is activated?
Should ONLINE and OFFLINE remain and CHANGE be added, or should they
go away?
If they remain, should CHANGE come before or after ONLINE (and
OFFLINE)?

I must admit that it feels more like an ONLINE/OFFLINE event than a
CHANGE event to me, but they are just words after all.

What does udev/HAL do with ONLINE/OFFLINE?  Could it be changed to do
"the right thing" for ONLINE? (Not implying that it should be, just
wanting to understand as much of the picture as possible).

Thanks,
NeilBrown

(*) At least I think it isn't what it used to be, but I cannot
remember what it used to be, so I'm not sure :-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops.
  2006-11-02 12:32       ` Neil Brown
@ 2006-11-02 13:51         ` Kay Sievers
  2006-11-03  6:57           ` Neil Brown
  0 siblings, 1 reply; 26+ messages in thread
From: Kay Sievers @ 2006-11-02 13:51 UTC (permalink / raw)
  To: Neil Brown; +Cc: Greg KH, Andrew Morton, linux-raid, linux-kernel

On Thu, 2006-11-02 at 23:32 +1100, Neil Brown wrote:
> On Thursday November 2, kay.sievers@vrfy.org wrote:
> > On 10/31/06, Greg KH <gregkh@suse.de> wrote:
> > > On Tue, Oct 31, 2006 at 05:00:46PM +1100, NeilBrown wrote:
> > > > This allows udev to do something intelligent when an
> > > > array becomes available.
> > > >
> > > Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
> > 
> > I don't agree with this, and asked several times to change this to
> > "change" events, like device-mapper is doing it to address the same
> > problem. Online/offline is not supported by udev/HAL and will not work
> > as expected. Please fix this.
> 
> I don't remember who suggested "online/offline",

It was probably the first version of the patch for device-mapper which
we got into SLE10, but it changed to "change" in the upstream kernel,
after we all met at OLS and talked about it.

> and I don't remember
> you suggesting "change", but my memory isn't what it used to be(*), so you
> probably did.

It was in the Czech Republic, but we got a few beers... :) And in the
"virtual md devices" conversation.

> Is there some document somewhere that explains exactly what each of
> the kobject_actions are meant to mean and how they can be
> interpreted?

No, there isn't. The thing is, that "online/offline" need to be always
symmetric in it's order. There can't be two "online" events without an
"offline" event. We decided at OLS for the device-mapper events, that we
can't be sure, that there will always be "online/offline" sequences and
can't be sure to make them always match the right sequence. Therefore we
decided to go for a simple "change", and let userspace find out the
current state of the device if needed.

> Anyway, I am happy to change it.  What exactly do you want?
> KOBJ_CHANGE both when the array is activated and when it is
> deactivated?  Or only when it is activated?

We couldn't think of any use of an "offline" event. So we removed the
event when the device-mapper device is suspended.

> Should ONLINE and OFFLINE remain and CHANGE be added, or should they
> go away?

The current idea is to send only a "change" event if something happens
that makes it necessary for udev to reinvestigate the device, like
possible filesystem content that creates /dev/disk/by-* links.

Finer grained device-monitoring is likely better placed by using the
poll() infrastructure for a sysfs file, instead of sending pretty
expensive uevents. 

Udev only hooks into "change" and revalidates all current symlinks for
the device. Udev can run programs on "online", but currently, it will
not update any /dev/disk/by-* link, if the device changes its content.

> If they remain, should CHANGE come before or after ONLINE (and
> OFFLINE)?

> I must admit that it feels more like an ONLINE/OFFLINE event than a
> CHANGE event to me, but they are just words after all.

Yeah, "online/offline" sounds nice, but it will get messy, if you have a
case where you don't need to go offline, but still want to notify a
change ... :)

> What does udev/HAL do with ONLINE/OFFLINE?  Could it be changed to do
> "the right thing" for ONLINE? (Not implying that it should be, just
> wanting to understand as much of the picture as possible).

Sure, it's just software, it definitely could be made to match on
anything. It's just that "change" already works fine today. :)

Thanks,
Kay

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops.
  2006-11-02 13:51         ` Kay Sievers
@ 2006-11-03  6:57           ` Neil Brown
  2006-11-03  8:22             ` Kay Sievers
  0 siblings, 1 reply; 26+ messages in thread
From: Neil Brown @ 2006-11-03  6:57 UTC (permalink / raw)
  To: Kay Sievers; +Cc: Greg KH, Andrew Morton, linux-raid, linux-kernel

On Thursday November 2, kay.sievers@vrfy.org wrote:
> On Thu, 2006-11-02 at 23:32 +1100, Neil Brown wrote:
> > and I don't remember
> > you suggesting "change", but my memory isn't what it used to be(*), so you
> > probably did.
> 
> It was in the Czech Republic, but we got a few beers... :) And in the
> "virtual md devices" conversation.

Hmm... rings a bell.  I guess I didn't appreciate the important
difference between 'change' and 'online' at the time.  Thanks for
clearing that up.

> 
> We couldn't think of any use of an "offline" event. So we removed the
> event when the device-mapper device is suspended.
> 
> > Should ONLINE and OFFLINE remain and CHANGE be added, or should they
> > go away?
> 
> The current idea is to send only a "change" event if something happens
> that makes it necessary for udev to reinvestigate the device, like
> possible filesystem content that creates /dev/disk/by-* links.
> 
> Finer grained device-monitoring is likely better placed by using the
> poll() infrastructure for a sysfs file, instead of sending pretty
> expensive uevents. 
> 
> Udev only hooks into "change" and revalidates all current symlinks for
> the device. Udev can run programs on "online", but currently, it will
> not update any /dev/disk/by-* link, if the device changes its content.
> 

OK.  Makes sense.
I tried it an got an interesting result....

This is with md generating 'CHANGE' events when an array goes on-line
and when it goes off line, and also with another patch which causes md
devices to disappear when not active so that we get ADD and REMOVE
events at reasonably appropriate times.

It all works fine until I stop an array.
We get a CHANGE event and then a REMOVE event.
And then a seemingly infinite series of ADD/REMOVE pairs.

I guess that udev sees the CHANGE and so opens the device to see what
is there.  By that time the device has disappeared so the open causes
an ADD.  udev doesn't find anything and closes the device which causes
it to disappear and we get a REMOVE.
Now udev sees that ADD and so opens the device again to see what it
there, triggering an ADD.  Nothing is there so we close it and get a
REMOVE.
Now udev sees the second ADD and ....

A bit unfortunate really.  This didn't happen when I had
ONLINE/OFFLINE as udev ignored the OFFLINE.
I guess I can removed the CHANGE at shutdown, but as there really is a
change there, that doesn't seem right.

The real problem is that udev opens the device, and md interprets and
'open' as a request to create the device. And udev see the open and an
ADD and so opens the device....

It's not clear to me what the 'right' thing to do here is:
 - I could stop removing the device on last-close, but I still
   think that (the current situation) is ugly.
 - I could delay the remove until udev will have stopped poking,
   but that is even more ugly
 - udev could avoid opening md devices until it has poked in 
   /sys/block/mdX to see what the status is, but that is very specific
   to md

It would be nice if I could delay the add until later, but that would
require major surgery and probably break the model badly.

On the whole, it seems that udev was designed without thought to the
special needs of md, and md was designed (long ago) without thought
the ugliness that "open creates a device" causes.

Any clever ideas anyone?

NeilBrown

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops.
  2006-11-03  6:57           ` Neil Brown
@ 2006-11-03  8:22             ` Kay Sievers
  2006-11-06  0:18               ` Neil Brown
  0 siblings, 1 reply; 26+ messages in thread
From: Kay Sievers @ 2006-11-03  8:22 UTC (permalink / raw)
  To: Neil Brown; +Cc: Greg KH, Andrew Morton, linux-raid, linux-kernel

On Fri, 2006-11-03 at 17:57 +1100, Neil Brown wrote:
> On Thursday November 2, kay.sievers@vrfy.org wrote:
> > On Thu, 2006-11-02 at 23:32 +1100, Neil Brown wrote:

> > We couldn't think of any use of an "offline" event. So we removed the
> > event when the device-mapper device is suspended.
> > 
> > > Should ONLINE and OFFLINE remain and CHANGE be added, or should they
> > > go away?
> > 
> > The current idea is to send only a "change" event if something happens
> > that makes it necessary for udev to reinvestigate the device, like
> > possible filesystem content that creates /dev/disk/by-* links.
> > 
> > Finer grained device-monitoring is likely better placed by using the
> > poll() infrastructure for a sysfs file, instead of sending pretty
> > expensive uevents. 
> > 
> > Udev only hooks into "change" and revalidates all current symlinks for
> > the device. Udev can run programs on "online", but currently, it will
> > not update any /dev/disk/by-* link, if the device changes its content.
> > 
> 
> OK.  Makes sense.
> I tried it an got an interesting result....
> 
> This is with md generating 'CHANGE' events when an array goes on-line
> and when it goes off line, and also with another patch which causes md
> devices to disappear when not active so that we get ADD and REMOVE
> events at reasonably appropriate times.
> 
> It all works fine until I stop an array.
> We get a CHANGE event and then a REMOVE event.
> And then a seemingly infinite series of ADD/REMOVE pairs.
> 
> I guess that udev sees the CHANGE and so opens the device to see what
> is there.  By that time the device has disappeared so the open causes
> an ADD.  udev doesn't find anything and closes the device which causes
> it to disappear and we get a REMOVE.
> Now udev sees that ADD and so opens the device again to see what it
> there, triggering an ADD.  Nothing is there so we close it and get a
> REMOVE.
> Now udev sees the second ADD and ....

Hmm, why does the open() of device node of a stopped device cause an "add"?
Shouldn't it just return a failure, instead of creating a device?

> A bit unfortunate really.  This didn't happen when I had
> ONLINE/OFFLINE as udev ignored the OFFLINE.
> I guess I can removed the CHANGE at shutdown, but as there really is a
> change there, that doesn't seem right.

Yeah, it's the same problem we had with device-mapper, nobody could
think of any useful action at a dm-device suspend "change"-event, so we
didn't add it. :)   

> The real problem is that udev opens the device, and md interprets and
> 'open' as a request to create the device. And udev see the open and an
> ADD and so opens the device....

Yes, current udev rules are written to to so, md needs to be excluded
from the list of block devices which are handled by the default
persistent naming rules, and moved to its own rules file. We did the
same for device-mapper to ignore some "private" dm-* volumes like
snapshot devices. 

> It's not clear to me what the 'right' thing to do here is:
>  - I could stop removing the device on last-close, but I still
>    think that (the current situation) is ugly.
>  - I could delay the remove until udev will have stopped poking,
>    but that is even more ugly
>  - udev could avoid opening md devices until it has poked in 
>    /sys/block/mdX to see what the status is, but that is very specific
>    to md
> 
> It would be nice if I could delay the add until later, but that would
> require major surgery and probably break the model badly.
> 
> On the whole, it seems that udev was designed without thought to the
> special needs of md, and md was designed (long ago) without thought
> the ugliness that "open creates a device" causes.

The persistent naming rules for /dev/disk/by-* are causing this. Md
devices will probably just get their own rules file, which will handle
this and which can be packaged and installed along with the md tools.

If it's acceptable for you, so leave the shutdown "change" event out for
now, until someone has the need for it.
We will update the rules in the meantime, and read a sysfs file or call
a md-tool to query the current state of the device on "add" and "change"
events, this will prevent the opening of the device when it's not
supposed to do so.

Thanks,
Kay


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops.
  2006-11-03  8:22             ` Kay Sievers
@ 2006-11-06  0:18               ` Neil Brown
  2006-11-06  8:38                 ` dean gaudet
  2006-11-08 11:14                 ` Kay Sievers
  0 siblings, 2 replies; 26+ messages in thread
From: Neil Brown @ 2006-11-06  0:18 UTC (permalink / raw)
  To: Kay Sievers; +Cc: Greg KH, Andrew Morton, linux-raid, linux-kernel

On Friday November 3, kay.sievers@vrfy.org wrote:
> 
> Hmm, why does the open() of device node of a stopped device cause an "add"?
> Shouldn't it just return a failure, instead of creating a device?

Because that is the API I inherited.  To create an MD array, you open
/dev/mdX and issue some IOCTLs.  Originally I think the devices were
all created at boot/module-load time much like they still are for
loop.c.  But when Al Viro did all that work with kmap and blkdev_get
ages ago he changed it so they didn't have to pre-created but rather
were created on-the-fly by an attempt to open the block device (this
calls in to md_probe which does the add_disk).

This creates a deep disconnect between udev and md.
udev expects a device to appear first, then it created the
device-special-file in /dev.
md expect the device-special-file to exist first, and then created the
device on the first open.

> 
> > A bit unfortunate really.  This didn't happen when I had
> > ONLINE/OFFLINE as udev ignored the OFFLINE.
> > I guess I can removed the CHANGE at shutdown, but as there really is a
> > change there, that doesn't seem right.
> 
> Yeah, it's the same problem we had with device-mapper, nobody could
> think of any useful action at a dm-device suspend "change"-event, so we
> didn't add it. :)   
> 

Yes... the device cannot disappear until no-one is using it, so no-one
will be interested in it going away.

> 
> The persistent naming rules for /dev/disk/by-* are causing this. Md
> devices will probably just get their own rules file, which will handle
> this and which can be packaged and installed along with the md tools.
> 
> If it's acceptable for you, so leave the shutdown "change" event out for
> now, until someone has the need for it.

Yes, I'll get rid of the online/offline events and just put in a
CHANGE when the array becomes available.

I'm still a bit concerned about the open->add->open infinite loop.
If anyone opens /dev/mdX while it isn't active (e.g. to check if it is
active), that will (given a patch that I would like to include) cause
and ADD event which will cause udev to start it's loop again.
Can we make udev ignore ADD for md and only watch for CHANGE?

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops.
  2006-11-06  0:18               ` Neil Brown
@ 2006-11-06  8:38                 ` dean gaudet
  2006-11-07  5:05                   ` Neil Brown
  2006-11-08 11:14                 ` Kay Sievers
  1 sibling, 1 reply; 26+ messages in thread
From: dean gaudet @ 2006-11-06  8:38 UTC (permalink / raw)
  To: Neil Brown; +Cc: Kay Sievers, Greg KH, Andrew Morton, linux-raid, linux-kernel

On Mon, 6 Nov 2006, Neil Brown wrote:

> This creates a deep disconnect between udev and md.
> udev expects a device to appear first, then it created the
> device-special-file in /dev.
> md expect the device-special-file to exist first, and then created the
> device on the first open.

could you create a special /dev/mdx device which is used to 
assemble/create arrays only?  i mean literally "mdx" not "mdX" where X is 
a number.  mdx would always be there if md module is loaded... so udev 
would see the driver appear and then create the /dev/mdx.  then mdadm 
would use /dev/mdx to do assemble/creates/whatever and cause other devices 
to appear/disappear in a manner which udev is happy with.

(much like how /dev/ptmx is used to create /dev/pts/N entries.)

doesn't help legacy mdadm binaries... but seems like it fits the New World 
Order.

or hm i suppose the New World Order is to eschew binary interfaces and 
suggest a /sys/class/md/ hierarchy with a bunch of files you have to splat 
ascii data into to cause an array to be created/assembled.

-dean

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops.
  2006-11-06  8:38                 ` dean gaudet
@ 2006-11-07  5:05                   ` Neil Brown
  2006-11-09 10:10                     ` Michael Tokarev
  0 siblings, 1 reply; 26+ messages in thread
From: Neil Brown @ 2006-11-07  5:05 UTC (permalink / raw)
  To: dean gaudet; +Cc: Kay Sievers, Greg KH, Andrew Morton, linux-raid, linux-kernel

On Monday November 6, dean@arctic.org wrote:
> On Mon, 6 Nov 2006, Neil Brown wrote:
> 
> > This creates a deep disconnect between udev and md.
> > udev expects a device to appear first, then it created the
> > device-special-file in /dev.
> > md expect the device-special-file to exist first, and then created the
> > device on the first open.
> 
> could you create a special /dev/mdx device which is used to 
> assemble/create arrays only?  i mean literally "mdx" not "mdX" where X is 
> a number.  mdx would always be there if md module is loaded... so udev 
> would see the driver appear and then create the /dev/mdx.  then mdadm 
> would use /dev/mdx to do assemble/creates/whatever and cause other devices 
> to appear/disappear in a manner which udev is happy with.
> 
> (much like how /dev/ptmx is used to create /dev/pts/N entries.)
> 
> doesn't help legacy mdadm binaries... but seems like it fits the New World 
> Order.
> 
> or hm i suppose the New World Order is to eschew binary interfaces and 
> suggest a /sys/class/md/ hierarchy with a bunch of files you have to splat 
> ascii data into to cause an array to be created/assembled.

I have the following patch sitting in my patch queue (since about
March).
It does what you suggest via /sys/module/md-mod/parameters/MAGIC_FILE
which is the only md-specific part of the /sys namespace that I could
find.

However I'm not at all convinced that it is a good idea.  I would much
rather have mdadm control device naming than leave it up to udev.

An in any case, we have the semantic that opening an md device-file
creates the device, and we cannot get rid of that semantic without a
lot of warning and a lot of pain.  And adding a new semantic isn't
really going to help.

We simply need to find the best way for udev and md to play together,
and I think we can achieve something quite workable.  Both sides just
have to give a bit.

NeilBrown


Allow md devices to be created by writing to sysfs.

Until now, to create an md device, you needed to open the relevant 
device-special file.  This created a catch-22 with udev.

This patch provides an alternate.
Options include

 echo 10 > /sys/module/md-mod/paramters/create
to create legacy device with minor 10,
 echo d10 > /sys/module/md-mod/paramters/create
to create partitionable device 10<<6
 cat /sys/module/md-mod/paramters/next_free_legacy
to return major:minor device of an unused legacy array, which will exist.
 cat /sys/module/md-mod/paramters/next_free_partitionable
to return major:minor device of an unused partitionable array which will exist.


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c |   56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~	2006-03-27 09:20:58.000000000 +1100
+++ ./drivers/md/md.c	2006-03-27 09:20:56.000000000 +1100
@@ -5525,6 +5525,62 @@ module_param_call(start_ro, set_ro, get_
 module_param(start_dirty_degraded, int, 0644);
 
 
+static int md_create(const char *val, struct kernel_param *kp)
+{
+	/* NN or dNN creates the numbered device */
+	int part = 0;
+	int num;
+	char *e;
+	if (*val == 'd' || *val == 'p') {
+		part = 1;
+		val++;
+	}
+	num = simple_strtoul(val, &e, 10);
+	if (*val && (*e == '\0' || *e == '\n')) {
+		/* success! */
+		dev_t dev;
+		if (part)
+			dev = MKDEV(mdp_major, num << MdpMinorShift);
+		else
+			dev = MKDEV(MD_MAJOR, num);
+		md_probe(dev, NULL, NULL);
+		return 0;
+	}
+	return -EINVAL;
+}
+static int md_next_free(char *buffer, struct kernel_param *kp)
+{
+	mddev_t *mddev;
+	int major = MD_MAJOR;
+	int inc = 1;
+	int next = MKDEV(MD_MAJOR,0);
+	if (kp->arg) {
+		next = MKDEV(mdp_major,0);
+		major = mdp_major;
+		inc = 1 << MdpMinorShift;
+	}
+	spin_lock(&all_mddevs_lock);
+	list_for_each_entry(mddev, &all_mddevs, all_mddevs)
+		if (MAJOR(mddev->unit) == major) {
+			if (atomic_read(&mddev->active)<=1 &&
+				mddev->pers == NULL &&
+				mddev->raid_disks == 0) {
+				next = mddev->unit;
+				break;
+			} else if (mddev->unit >= next)
+				next = mddev->unit + inc;
+		}
+	spin_unlock(&all_mddevs_lock);
+	md_probe(next, NULL, NULL);
+	return sprintf(buffer, "%d:%d", major, MINOR(next));
+}
+static int ignore(const char *val, struct kernel_param *kp) { return -EINVAL; }
+
+
+module_param_call(create, md_create, NULL, NULL, 0200);
+module_param_call(next_free_legacy, ignore, md_next_free, (void*)0, 0400);
+module_param_call(next_free_partitionable, ignore, md_next_free, (void*)1, 0400);
+
 EXPORT_SYMBOL(register_md_personality);
 EXPORT_SYMBOL(unregister_md_personality);
 EXPORT_SYMBOL(md_error);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops.
  2006-11-07  5:05                   ` Neil Brown
@ 2006-11-09 10:10                     ` Michael Tokarev
  2006-11-09 10:17                       ` Michael Tokarev
  0 siblings, 1 reply; 26+ messages in thread
From: Michael Tokarev @ 2006-11-09 10:10 UTC (permalink / raw)
  To: Neil Brown
  Cc: dean gaudet, Kay Sievers, Greg KH, Andrew Morton, linux-raid,
	linux-kernel

Neil Brown wrote:
[/dev/mdx...]
>> (much like how /dev/ptmx is used to create /dev/pts/N entries.)
[]
> I have the following patch sitting in my patch queue (since about
> March).
> It does what you suggest via /sys/module/md-mod/parameters/MAGIC_FILE
> which is the only md-specific part of the /sys namespace that I could
> find.
> 
> However I'm not at all convinced that it is a good idea.  I would much
> rather have mdadm control device naming than leave it up to udev.

This is again the same "device naming" question as pops up every time
someone mentions udev.  And as usual, I'm suggesting the following, which
should - hopefully - make everyone happy:

  create kernel names *always*, be it /dev/mdN or /dev/sdF or whatever,
  so that things like /proc/partitions, /proc/mdstat etc will be useful.
  For this, the ideal solution - IMHO - is to have mini-devfs-like filesystem
  mounted as /dev, so that it is possible to have "bare" names without any
  help from any external programs like udev, but I don't want to start another
  flamewar here, esp. since it's off-topic to *this* discussion.
  Note /dev/mdN is as good as /dev/md/N - because there are only a few active
  devices wich appear in /dev, there's no "risk" to have "too many" files in
  /dev, hence no need to put them into subdirs like /dev/md/, /dev/sd/ etc.

  if so desired, create *symlinks* at /dev with appropriate user-controlled
  names to those official kernel device nodes.  Be it like /dev/disk/by-label/
  or /dev/cdrom0 or whatever.
  The links can be created by mdadm, OR by udev - in this case, it's really
  irrelevant.  Udev rules does a good job of creating /dev/disk/ hierarchy
  already, and that seems to be sufficient - i see no reason to make other
  device nodes (symlinks) by mdadm.

By the way, unlike /dev/sdE and /dev/hdF entries, /dev/mdN nodes are pretty
stable.  Even if scsi disks gets reordered, mdadm finds the component devices
by UUID (if DEVICE partitions is given in config file), and you have /dev/md1
pointing to the same "logical partition" (have the same filesystem or data)
regardless how you shuffle your disks (IF mdadm was able to find all components
and assemble the array, anyway).  So sometimes, I use md/mdadm on systems
WITHOUT any "raided" drives, but where I suspect disk devices may change for
whatever reason - I just create raid0 "arrays" composed of a single partition
and let mdadm to find them in /dev/sd* and to assemble stable-numbered /dev/mdN
devices - without any help of udev or anything else (I for one dislike udev for
several reasons).

> An in any case, we have the semantic that opening an md device-file
> creates the device, and we cannot get rid of that semantic without a
> lot of warning and a lot of pain.  And adding a new semantic isn't
> really going to help.

I don't think so.  With new semantic in place, we've two options (provided
current semantics stays, and I don't see a strong reason why it should be
removed except of the bloat):

 a) with new mdadm utilizing new semantics, there's nothing to change in udev --
    it will all Just Work, by mdadm opening /dev/md-control-node (how it's called)
    and assembling devices using that, and during assemble, udev will receive proper
    events about new "disks" appearing and will handle that as usual.

 b) without new mdadm, it will work as before (now).  And in this case, let's not
    send any udev events, as mdadm already created the nodes etc.

So if a user wants neat and nice md/udev integration, the way to go is case "a".
If it's not required, either case will do.

Sure, eventually, long term, support for case "b" can be removed.  Or not - depending
on how the things will be implemented, because when done properly, both cases will
call the same routine(s), but case "b" will just skip sending uevents, so ioctl handlers
becomes two- or one-liners (two in case a and one in case b), which isn't bloat really ;)

/mjt

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops.
  2006-11-09 10:10                     ` Michael Tokarev
@ 2006-11-09 10:17                       ` Michael Tokarev
  0 siblings, 0 replies; 26+ messages in thread
From: Michael Tokarev @ 2006-11-09 10:17 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Neil Brown, dean gaudet, Kay Sievers, Greg KH, Andrew Morton,
	linux-raid, linux-kernel

Michael Tokarev wrote:
> Neil Brown wrote:
> [/dev/mdx...]
[]
>> An in any case, we have the semantic that opening an md device-file
>> creates the device, and we cannot get rid of that semantic without a
>> lot of warning and a lot of pain.  And adding a new semantic isn't
>> really going to help.
> 
> I don't think so.  With new semantic in place, we've two options (provided
> current semantics stays, and I don't see a strong reason why it should be
> removed except of the bloat):
> 
>  a) with new mdadm utilizing new semantics, there's nothing to change in udev --
>     it will all Just Work, by mdadm opening /dev/md-control-node (how it's called)
>     and assembling devices using that, and during assemble, udev will receive proper
>     events about new "disks" appearing and will handle that as usual.
> 
>  b) without new mdadm, it will work as before (now).  And in this case, let's not
>     send any udev events, as mdadm already created the nodes etc.

Forgot to add.  This is important point: do NOT change current behavour wrt uevents,
ie, don't add uevents for current semantics at all.  Only send uevents (and in this
case it will be normal "add" and "remove" events) when assembling arrays "the new way",
using (stable!) /dev/mdcontrol misc device, after RUN_ARRAY and STOP_ARRAY actions has
been performed.

/mjt

> So if a user wants neat and nice md/udev integration, the way to go is case "a".
> If it's not required, either case will do.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops.
  2006-11-06  0:18               ` Neil Brown
  2006-11-06  8:38                 ` dean gaudet
@ 2006-11-08 11:14                 ` Kay Sievers
  2006-11-09  0:17                   ` Neil Brown
  1 sibling, 1 reply; 26+ messages in thread
From: Kay Sievers @ 2006-11-08 11:14 UTC (permalink / raw)
  To: Neil Brown; +Cc: Greg KH, Andrew Morton, linux-raid, linux-kernel

On Mon, 2006-11-06 at 11:18 +1100, Neil Brown wrote:
> On Friday November 3, kay.sievers@vrfy.org wrote:

> > The persistent naming rules for /dev/disk/by-* are causing this. Md
> > devices will probably just get their own rules file, which will handle
> > this and which can be packaged and installed along with the md tools.

> I'm still a bit concerned about the open->add->open infinite loop.
> If anyone opens /dev/mdX while it isn't active (e.g. to check if it is
> active), that will (given a patch that I would like to include) cause
> and ADD event which will cause udev to start it's loop again.
> Can we make udev ignore ADD for md and only watch for CHANGE?

Is there a sysfs file or something similar(we could also call a md-tool)
udev could look at, before it tries to open the device? Like:
  KERNEL=="md*", ATTR{state}=="active", IMPORT{program}= ...

If we currently ignore the "add" event, then we will not hook into the
coldplug logic, where "add" events are requested for all devices to do
the initial setup after bootup.

If we can't read the state of the md device, to see if it's safe to open
the device, we would need to be smarter with the coldplug logic by
requesting "change" events if necessary, or by passing a "coldplug" flag
with the synthesized event.

Thanks,
Kay

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops.
  2006-11-08 11:14                 ` Kay Sievers
@ 2006-11-09  0:17                   ` Neil Brown
  0 siblings, 0 replies; 26+ messages in thread
From: Neil Brown @ 2006-11-09  0:17 UTC (permalink / raw)
  To: Kay Sievers; +Cc: Greg KH, Andrew Morton, linux-raid, linux-kernel

On Wednesday November 8, kay.sievers@vrfy.org wrote:
> 
> Is there a sysfs file or something similar(we could also call a md-tool)
> udev could look at, before it tries to open the device? Like:
>   KERNEL=="md*", ATTR{state}=="active", IMPORT{program}= ...

If the /sys/block/mdX directory exists at all, it is safe to open the
device-special file.  But that is racy.  It could disappear between
checking that the dir exists, and opening the device-special-file.

I still think it would make SO much sense if /sys/block/md4/dev were a
device-special-file instead of a (silly) ascii file with 9:4.  Then
this race could be closed.  But I feel that is a battle I've never
going to win.

You could look at /sys/block/mdX/md/array_state.  If that contains
'clean' or 'inactive' then there is no point opening the device.
Otherwise there might be a point, and the race would be a lot harder
to lose.

I guess it is time for me to learn about udev config files...

NeilBrown

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 002 of 6] md: Change lifetime rules for 'md' devices.
  2006-10-31  6:00 [PATCH 000 of 6] md: udev events and cache bypass for reads NeilBrown
  2006-10-31  6:00 ` [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops NeilBrown
@ 2006-10-31  6:00 ` NeilBrown
  2006-10-31  8:22   ` Andrew Morton
  2006-10-31  6:00 ` [PATCH 003 of 6] md: Define raid5_mergeable_bvec NeilBrown
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: NeilBrown @ 2006-10-31  6:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, gregkh


Currently md devices are created when first opened and remain in existence
until the module is unloaded.
This isn't a major problem, but it somewhat ugly.

This patch changes the lifetime rules so that an md device will
disappear on the last close if it has no state.

Locking rules depend on bd_mutex being held in do_open and
__blkdev_put, and on setting bd_disk->private_data to 'mddev'.

There is room for a race because md_probe is called early in do_open
(get_gendisk) to create the mddev.  As this isn't protected by
bd_mutex, a concurrent call to md_close can destroy that mddev before
do_open calls md_open to get a reference on it.
md_open and md_close are serialised by md_mutex so the worst that
can happen is that md_open finds that the mddev structure doesn't
exist after all.  In this case bd_disk->private_data will be NULL,
and md_open chooses to exit with -EBUSY in this case, which is
arguable and appropriate result.

The new 'dead' field in mddev is used to track whether it is time
to destroy the mddev (if a last-close happens).  It is cleared when
any state is create (set_array_info) and set when the array is stopped
(do_md_stop).

mddev_put becomes simpler. It just destroys the mddev when the
refcount hits zero.  This will normally be the reference held in
bd_disk->private_data.
  

cc: gregkh@suse.de
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c           |   35 +++++++++++++++++++++++++----------
 ./include/linux/raid/md_k.h |    3 +++
 2 files changed, 28 insertions(+), 10 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2006-10-31 16:41:02.000000000 +1100
+++ ./drivers/md/md.c	2006-10-31 16:41:14.000000000 +1100
@@ -226,13 +226,14 @@ static void mddev_put(mddev_t *mddev)
 {
 	if (!atomic_dec_and_lock(&mddev->active, &all_mddevs_lock))
 		return;
-	if (!mddev->raid_disks && list_empty(&mddev->disks)) {
-		list_del(&mddev->all_mddevs);
-		spin_unlock(&all_mddevs_lock);
-		blk_cleanup_queue(mddev->queue);
-		kobject_unregister(&mddev->kobj);
-	} else
-		spin_unlock(&all_mddevs_lock);
+	list_del(&mddev->all_mddevs);
+	spin_unlock(&all_mddevs_lock);
+
+	blk_cleanup_queue(mddev->queue);
+	mddev->queue = NULL;
+	del_gendisk(mddev->gendisk);
+	mddev->gendisk = NULL;
+	kobject_unregister(&mddev->kobj);
 }
 
 static mddev_t * mddev_find(dev_t unit)
@@ -273,6 +274,7 @@ static mddev_t * mddev_find(dev_t unit)
 	atomic_set(&new->active, 1);
 	spin_lock_init(&new->write_lock);
 	init_waitqueue_head(&new->sb_wait);
+	new->dead = 1;
 
 	new->queue = blk_alloc_queue(GFP_KERNEL);
 	if (!new->queue) {
@@ -3362,6 +3364,8 @@ static int do_md_stop(mddev_t * mddev, i
 		disk = mddev->gendisk;
 		if (disk)
 			set_capacity(disk, 0);
+		mddev->dead = 1;
+
 		mddev->changed = 1;
 	} else if (mddev->pers)
 		printk(KERN_INFO "md: %s switched to read-only mode.\n",
@@ -4293,7 +4297,8 @@ static int md_ioctl(struct inode *inode,
 					printk(KERN_WARNING "md: couldn't set"
 					       " array info. %d\n", err);
 					goto abort_unlock;
-				}
+				} else
+					mddev->dead = 0;
 			}
 			goto done_unlock;
 
@@ -4377,6 +4382,8 @@ static int md_ioctl(struct inode *inode,
 				err = -EFAULT;
 			else
 				err = add_new_disk(mddev, &info);
+			if (!err)
+				mddev->dead = 0;
 			goto done_unlock;
 		}
 
@@ -4423,8 +4430,12 @@ static int md_open(struct inode *inode, 
 	 * Succeed if we can lock the mddev, which confirms that
 	 * it isn't being stopped right now.
 	 */
-	mddev_t *mddev = inode->i_bdev->bd_disk->private_data;
-	int err;
+	mddev_t *mddev;
+	int err = -EBUSY;
+
+	mddev = inode->i_bdev->bd_disk->private_data;
+	if (!mddev)
+		goto out;
 
 	if ((err = mutex_lock_interruptible_nested(&mddev->reconfig_mutex, 1)))
 		goto out;
@@ -4443,6 +4454,10 @@ static int md_release(struct inode *inod
  	mddev_t *mddev = inode->i_bdev->bd_disk->private_data;
 
 	BUG_ON(!mddev);
+	if (inode->i_bdev->bd_openers == 0 && mddev->dead) {
+		inode->i_bdev->bd_disk->private_data = NULL;
+		mddev_put(mddev);
+	}
 	mddev_put(mddev);
 
 	return 0;

diff .prev/include/linux/raid/md_k.h ./include/linux/raid/md_k.h
--- .prev/include/linux/raid/md_k.h	2006-10-31 16:40:51.000000000 +1100
+++ ./include/linux/raid/md_k.h	2006-10-31 16:41:14.000000000 +1100
@@ -119,6 +119,9 @@ struct mddev_s
 #define MD_CHANGE_PENDING 2	/* superblock update in progress */
 
 	int				ro;
+	int				dead; /* array should be discarded on
+					       * last close
+					       */
 
 	struct gendisk			*gendisk;
 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 002 of 6] md: Change lifetime rules for 'md' devices.
  2006-10-31  6:00 ` [PATCH 002 of 6] md: Change lifetime rules for 'md' devices NeilBrown
@ 2006-10-31  8:22   ` Andrew Morton
  2006-10-31  9:09     ` Neil Brown
  0 siblings, 1 reply; 26+ messages in thread
From: Andrew Morton @ 2006-10-31  8:22 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, linux-kernel, gregkh

On Tue, 31 Oct 2006 17:00:51 +1100
NeilBrown <neilb@suse.de> wrote:

> Currently md devices are created when first opened and remain in existence
> until the module is unloaded.
> This isn't a major problem, but it somewhat ugly.
> 
> This patch changes the lifetime rules so that an md device will
> disappear on the last close if it has no state.

This kills the G5:


EXT3-fs: recovery complete.
EXT3-fs: mounted filesystem with ordered data mode.
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=4 
Modules linked in:
NIP: C0000000001A31B8 LR: C00000000018E5DC CTR: C0000000001A3404
REGS: c0000000017ff4a0 TRAP: 0300   Not tainted  (2.6.19-rc4-mm1)
MSR: 9000000000009032 <EE,ME,IR,DR>  CR: 84000048  XER: 00000000
DAR: 6B6B6B6B6B6B6BB3, DSISR: 0000000040000000
TASK = c00000000ff2b7f0[1899] 'nash' THREAD: c0000000017fc000 CPU: 1
GPR00: 0000000000000008 C0000000017FF720 C0000000006B26D0 6B6B6B6B6B6B6B7B 
GPR04: 0000000000000002 0000000000000001 0000000000000001 00000000000200D0 
GPR08: 0000000000050C00 C0000000001A3404 0000000000000000 C0000000001A318C 
GPR12: 0000000084000044 C000000000535680 00000000100FE350 00000000100FE7B8 
GPR16: 00000000FFFFFFFF 00000000FFFFFFFF 0000000000000000 0000000000000000 
GPR20: 0000000010005CD4 0000000010090000 0000000000000000 0000000010007E54 
GPR24: 0000000000000000 C000000000472C80 6B6B6B6B6B6B6B7B C000000001FD2530 
GPR28: C000000007B8C2F0 6B6B6B6B6B6B6B7B C00000000057DAE8 C0000000079A2550 
NIP [C0000000001A31B8] .kobject_uevent+0xac/0x55c
LR [C00000000018E5DC] .__elv_unregister_queue+0x20/0x44
Call Trace:
[C0000000017FF720] [C000000000562508] read_pipe_fops+0xd0/0xd8 (unreliable)
[C0000000017FF840] [C00000000018E5DC] .__elv_unregister_queue+0x20/0x44
[C0000000017FF8D0] [C000000000195548] .blk_unregister_queue+0x58/0x9c
[C0000000017FF960] [C00000000019683C] .unlink_gendisk+0x1c/0x50
[C0000000017FF9F0] [C000000000122840] .del_gendisk+0x98/0x22c
[C0000000017FFA90] [C00000000035B56C] .mddev_put+0xa0/0xe0
[C0000000017FFB20] [C000000000362178] .md_release+0x84/0x9c
[C0000000017FFBA0] [C0000000000FDDE0] .__blkdev_put+0x204/0x220
[C0000000017FFC50] [C0000000000C765C] .__fput+0x234/0x274
[C0000000017FFD00] [C0000000000C5264] .filp_close+0x6c/0xfc
[C0000000017FFD90] [C0000000000C53B8] .sys_close+0xc4/0x178
[C0000000017FFE30] [C00000000000872C] syscall_exit+0x0/0x40
Instruction dump:
4e800420 00000020 00000250 00000278 00000280 00000258 00000260 00000268 
00000270 3b200000 2fb90000 419e003c <e81a0038> 2fa00000 7c1d0378 409e0070 
 <7>eth0: no IPv6 routers present

It happens during initscripts.  The machine has no MD devices.  config is
at http://userweb.kernel.org/~akpm/config-g5.txt

Also, it'd be nice to enable CONFIG_MUST_CHECK and take a look at a few
things...

drivers/md/md.c: In function `bind_rdev_to_array':
drivers/md/md.c:1379: warning: ignoring return value of `kobject_add', declared with attribute warn_unused_result
drivers/md/md.c:1385: warning: ignoring return value of `sysfs_create_link', declared with attribute warn_unused_result
drivers/md/md.c: In function `md_probe':
drivers/md/md.c:2986: warning: ignoring return value of `kobject_register', declared with attribute warn_unused_result
drivers/md/md.c: In function `do_md_run':
drivers/md/md.c:3135: warning: ignoring return value of `sysfs_create_group', declared with attribute warn_unused_result
drivers/md/md.c:3150: warning: ignoring return value of `sysfs_create_link', declared with attribute warn_unused_result
drivers/md/md.c: In function `md_check_recovery':
drivers/md/md.c:5446: warning: ignoring return value of `sysfs_create_link', declared with attribute warn_unused_result



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 002 of 6] md: Change lifetime rules for 'md' devices.
  2006-10-31  8:22   ` Andrew Morton
@ 2006-10-31  9:09     ` Neil Brown
  2006-10-31  9:15       ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Neil Brown @ 2006-10-31  9:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: jens.axboe, linux-raid, linux-kernel, gregkh

On Tuesday October 31, akpm@osdl.org wrote:
> On Tue, 31 Oct 2006 17:00:51 +1100
> NeilBrown <neilb@suse.de> wrote:
> 
> > Currently md devices are created when first opened and remain in existence
> > until the module is unloaded.
> > This isn't a major problem, but it somewhat ugly.
> > 
> > This patch changes the lifetime rules so that an md device will
> > disappear on the last close if it has no state.
> 
> This kills the G5:
> 
> 
> EXT3-fs: recovery complete.
> EXT3-fs: mounted filesystem with ordered data mode.
> Oops: Kernel access of bad area, sig: 11 [#1]
> SMP NR_CPUS=4 
> Modules linked in:
> NIP: C0000000001A31B8 LR: C00000000018E5DC CTR: C0000000001A3404
> REGS: c0000000017ff4a0 TRAP: 0300   Not tainted  (2.6.19-rc4-mm1)
> MSR: 9000000000009032 <EE,ME,IR,DR>  CR: 84000048  XER: 00000000
> DAR: 6B6B6B6B6B6B6BB3, DSISR: 0000000040000000
> TASK = c00000000ff2b7f0[1899] 'nash' THREAD: c0000000017fc000 CPU: 1
> GPR00: 0000000000000008 C0000000017FF720 C0000000006B26D0 6B6B6B6B6B6B6B7B 
..
> NIP [C0000000001A31B8] .kobject_uevent+0xac/0x55c
> LR [C00000000018E5DC] .__elv_unregister_queue+0x20/0x44
> Call Trace:
> [C0000000017FF720] [C000000000562508] read_pipe_fops+0xd0/0xd8 (unreliable)
> [C0000000017FF840] [C00000000018E5DC] .__elv_unregister_queue+0x20/0x44
> [C0000000017FF8D0] [C000000000195548] .blk_unregister_queue+0x58/0x9c
> [C0000000017FF960] [C00000000019683C] .unlink_gendisk+0x1c/0x50
> [C0000000017FF9F0] [C000000000122840] .del_gendisk+0x98/0x22c

I'm guessing we need

diff .prev/block/elevator.c ./block/elevator.c
--- .prev/block/elevator.c	2006-10-31 20:06:22.000000000 +1100
+++ ./block/elevator.c	2006-10-31 20:06:40.000000000 +1100
@@ -926,7 +926,7 @@ static void __elv_unregister_queue(eleva
 
 void elv_unregister_queue(struct request_queue *q)
 {
-	if (q)
+	if (q && q->elevator)
 		__elv_unregister_queue(q->elevator);
 }


Jens?  md never registers and elevator for its queue.

> 
> Also, it'd be nice to enable CONFIG_MUST_CHECK and take a look at a few
> things...
> 
> drivers/md/md.c: In function `bind_rdev_to_array':
> drivers/md/md.c:1379: warning: ignoring return value of `kobject_add', declared with attribute warn_unused_result
> drivers/md/md.c:1385: warning: ignoring return value of `sysfs_create_link', declared with attribute warn_unused_result
> drivers/md/md.c: In function `md_probe':
> drivers/md/md.c:2986: warning: ignoring return value of `kobject_register', declared with attribute warn_unused_result
> drivers/md/md.c: In function `do_md_run':
> drivers/md/md.c:3135: warning: ignoring return value of `sysfs_create_group', declared with attribute warn_unused_result
> drivers/md/md.c:3150: warning: ignoring return value of `sysfs_create_link', declared with attribute warn_unused_result
> drivers/md/md.c: In function `md_check_recovery':
> drivers/md/md.c:5446: warning: ignoring return value of `sysfs_create_link', declared with attribute warn_unused_result

I guess... I saw mail a while ago about why we really should be
checking those.  I'll have to dig it up again.

NeilBrown

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 002 of 6] md: Change lifetime rules for 'md' devices.
  2006-10-31  9:09     ` Neil Brown
@ 2006-10-31  9:15       ` Jens Axboe
  2006-10-31  9:26         ` Neil Brown
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2006-10-31  9:15 UTC (permalink / raw)
  To: Neil Brown; +Cc: Andrew Morton, linux-raid, linux-kernel, gregkh

On Tue, Oct 31 2006, Neil Brown wrote:
> On Tuesday October 31, akpm@osdl.org wrote:
> > On Tue, 31 Oct 2006 17:00:51 +1100
> > NeilBrown <neilb@suse.de> wrote:
> > 
> > > Currently md devices are created when first opened and remain in existence
> > > until the module is unloaded.
> > > This isn't a major problem, but it somewhat ugly.
> > > 
> > > This patch changes the lifetime rules so that an md device will
> > > disappear on the last close if it has no state.
> > 
> > This kills the G5:
> > 
> > 
> > EXT3-fs: recovery complete.
> > EXT3-fs: mounted filesystem with ordered data mode.
> > Oops: Kernel access of bad area, sig: 11 [#1]
> > SMP NR_CPUS=4 
> > Modules linked in:
> > NIP: C0000000001A31B8 LR: C00000000018E5DC CTR: C0000000001A3404
> > REGS: c0000000017ff4a0 TRAP: 0300   Not tainted  (2.6.19-rc4-mm1)
> > MSR: 9000000000009032 <EE,ME,IR,DR>  CR: 84000048  XER: 00000000
> > DAR: 6B6B6B6B6B6B6BB3, DSISR: 0000000040000000
> > TASK = c00000000ff2b7f0[1899] 'nash' THREAD: c0000000017fc000 CPU: 1
> > GPR00: 0000000000000008 C0000000017FF720 C0000000006B26D0 6B6B6B6B6B6B6B7B 
> ..
> > NIP [C0000000001A31B8] .kobject_uevent+0xac/0x55c
> > LR [C00000000018E5DC] .__elv_unregister_queue+0x20/0x44
> > Call Trace:
> > [C0000000017FF720] [C000000000562508] read_pipe_fops+0xd0/0xd8 (unreliable)
> > [C0000000017FF840] [C00000000018E5DC] .__elv_unregister_queue+0x20/0x44
> > [C0000000017FF8D0] [C000000000195548] .blk_unregister_queue+0x58/0x9c
> > [C0000000017FF960] [C00000000019683C] .unlink_gendisk+0x1c/0x50
> > [C0000000017FF9F0] [C000000000122840] .del_gendisk+0x98/0x22c
> 
> I'm guessing we need
> 
> diff .prev/block/elevator.c ./block/elevator.c
> --- .prev/block/elevator.c	2006-10-31 20:06:22.000000000 +1100
> +++ ./block/elevator.c	2006-10-31 20:06:40.000000000 +1100
> @@ -926,7 +926,7 @@ static void __elv_unregister_queue(eleva
>  
>  void elv_unregister_queue(struct request_queue *q)
>  {
> -	if (q)
> +	if (q && q->elevator)
>  		__elv_unregister_queue(q->elevator);
>  }
> 
> 
> Jens?  md never registers and elevator for its queue.

Hmm, but blk_unregister_queue() doesn't call elv_unregister_queue()
unless q->request_fn is set. And in that case, you must have an io
scheduler attached.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 002 of 6] md: Change lifetime rules for 'md' devices.
  2006-10-31  9:15       ` Jens Axboe
@ 2006-10-31  9:26         ` Neil Brown
  2006-10-31  9:30           ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Neil Brown @ 2006-10-31  9:26 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, linux-raid, linux-kernel, gregkh

On Tuesday October 31, jens.axboe@oracle.com wrote:
> On Tue, Oct 31 2006, Neil Brown wrote:
> > 
> > I'm guessing we need
> > 
> > diff .prev/block/elevator.c ./block/elevator.c
> > --- .prev/block/elevator.c	2006-10-31 20:06:22.000000000 +1100
> > +++ ./block/elevator.c	2006-10-31 20:06:40.000000000 +1100
> > @@ -926,7 +926,7 @@ static void __elv_unregister_queue(eleva
> >  
> >  void elv_unregister_queue(struct request_queue *q)
> >  {
> > -	if (q)
> > +	if (q && q->elevator)
> >  		__elv_unregister_queue(q->elevator);
> >  }
> > 
> > 
> > Jens?  md never registers and elevator for its queue.
> 
> Hmm, but blk_unregister_queue() doesn't call elv_unregister_queue()
> unless q->request_fn is set. And in that case, you must have an io
> scheduler attached.

Hmm.. yes.  Oh, I get it.  I have

	blk_cleanup_queue(mddev->queue);
	mddev->queue = NULL;
	del_gendisk(mddev->gendisk);
	mddev->gendisk = NULL;

That's the wrong order, isn't it. :-(

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 002 of 6] md: Change lifetime rules for 'md' devices.
  2006-10-31  9:26         ` Neil Brown
@ 2006-10-31  9:30           ` Jens Axboe
  0 siblings, 0 replies; 26+ messages in thread
From: Jens Axboe @ 2006-10-31  9:30 UTC (permalink / raw)
  To: Neil Brown; +Cc: Andrew Morton, linux-raid, linux-kernel, gregkh

On Tue, Oct 31 2006, Neil Brown wrote:
> On Tuesday October 31, jens.axboe@oracle.com wrote:
> > On Tue, Oct 31 2006, Neil Brown wrote:
> > > 
> > > I'm guessing we need
> > > 
> > > diff .prev/block/elevator.c ./block/elevator.c
> > > --- .prev/block/elevator.c	2006-10-31 20:06:22.000000000 +1100
> > > +++ ./block/elevator.c	2006-10-31 20:06:40.000000000 +1100
> > > @@ -926,7 +926,7 @@ static void __elv_unregister_queue(eleva
> > >  
> > >  void elv_unregister_queue(struct request_queue *q)
> > >  {
> > > -	if (q)
> > > +	if (q && q->elevator)
> > >  		__elv_unregister_queue(q->elevator);
> > >  }
> > > 
> > > 
> > > Jens?  md never registers and elevator for its queue.
> > 
> > Hmm, but blk_unregister_queue() doesn't call elv_unregister_queue()
> > unless q->request_fn is set. And in that case, you must have an io
> > scheduler attached.
> 
> Hmm.. yes.  Oh, I get it.  I have
> 
> 	blk_cleanup_queue(mddev->queue);
> 	mddev->queue = NULL;
> 	del_gendisk(mddev->gendisk);
> 	mddev->gendisk = NULL;
> 
> That's the wrong order, isn't it. :-(

Yep, you want to reverse that :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 003 of 6] md: Define raid5_mergeable_bvec
  2006-10-31  6:00 [PATCH 000 of 6] md: udev events and cache bypass for reads NeilBrown
  2006-10-31  6:00 ` [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops NeilBrown
  2006-10-31  6:00 ` [PATCH 002 of 6] md: Change lifetime rules for 'md' devices NeilBrown
@ 2006-10-31  6:00 ` NeilBrown
  2006-10-31  6:01 ` [PATCH 004 of 6] md: Handle bypassing the read cache (assuming nothing fails) NeilBrown
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2006-10-31  6:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Raz Ben-Jehuda(caro)


This will encourage read request to be on only one device,
so we will often be able to bypass the cache for read
requests.

cc:  "Raz Ben-Jehuda(caro)" <raziebe@gmail.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c |   24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c	2006-10-31 16:40:57.000000000 +1100
+++ ./drivers/md/raid5.c	2006-10-31 16:41:26.000000000 +1100
@@ -2611,6 +2611,28 @@ static int raid5_congested(void *data, i
 	return 0;
 }
 
+/* We want read requests to align with chunks where possible,
+ * but write requests don't need to.
+ */
+static int raid5_mergeable_bvec(request_queue_t *q, struct bio *bio, struct bio_vec *biovec)
+{
+	mddev_t *mddev = q->queuedata;
+	sector_t sector = bio->bi_sector + get_start_sect(bio->bi_bdev);
+	int max;
+	unsigned int chunk_sectors = mddev->chunk_size >> 9;
+	unsigned int bio_sectors = bio->bi_size >> 9;
+
+	if (bio_data_dir(bio))
+		return biovec->bv_len; /* always allow writes to be mergeable */
+
+	max =  (chunk_sectors - ((sector & (chunk_sectors - 1)) + bio_sectors)) << 9;
+	if (max < 0) max = 0;
+	if (max <= biovec->bv_len && bio_sectors == 0)
+		return biovec->bv_len;
+	else
+		return max;
+}
+
 static int make_request(request_queue_t *q, struct bio * bi)
 {
 	mddev_t *mddev = q->queuedata;
@@ -3320,6 +3342,8 @@ static int run(mddev_t *mddev)
 	mddev->array_size =  mddev->size * (conf->previous_raid_disks -
 					    conf->max_degraded);
 
+	blk_queue_merge_bvec(mddev->queue, raid5_mergeable_bvec);
+
 	return 0;
 abort:
 	if (conf) {

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 004 of 6] md: Handle bypassing the read cache (assuming nothing fails).
  2006-10-31  6:00 [PATCH 000 of 6] md: udev events and cache bypass for reads NeilBrown
                   ` (2 preceding siblings ...)
  2006-10-31  6:00 ` [PATCH 003 of 6] md: Define raid5_mergeable_bvec NeilBrown
@ 2006-10-31  6:01 ` NeilBrown
  2006-10-31  6:01 ` [PATCH 005 of 6] md: Allow reads that have bypassed the cache to be retried on failure NeilBrown
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2006-10-31  6:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Raz Ben-Jehuda(caro)

cc:  "Raz Ben-Jehuda(caro)" <raziebe@gmail.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c |   78 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 78 insertions(+)

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c	2006-10-31 16:41:26.000000000 +1100
+++ ./drivers/md/raid5.c	2006-10-31 16:41:51.000000000 +1100
@@ -2633,6 +2633,84 @@ static int raid5_mergeable_bvec(request_
 		return max;
 }
 
+
+static int in_chunk_boundary(mddev_t *mddev, struct bio *bio)
+{
+	sector_t sector = bio->bi_sector + get_start_sect(bio->bi_bdev);
+	unsigned int chunk_sectors = mddev->chunk_size >> 9;
+	unsigned int bio_sectors = bio->bi_size >> 9;
+
+	return  chunk_sectors >=
+		((sector & (chunk_sectors - 1)) + bio_sectors);
+}
+
+/*
+ *  The "raid5_align_endio" should check if the read succeeded and if it
+ *  did, call bio_endio on the original bio (having bio_put the new bio
+ *  first).
+ *  If the read failed..
+ */
+int raid5_align_endio(struct bio *bi, unsigned int bytes , int error)
+{
+	struct bio* raid_bi  = bi->bi_private;
+	if (bi->bi_size)
+		return 1;
+	bio_put(bi);
+	bio_endio(raid_bi, bytes, error);
+	return 0;
+}
+
+static int chunk_aligned_read(request_queue_t *q, struct bio * raid_bio)
+{
+	mddev_t *mddev = q->queuedata;
+	raid5_conf_t *conf = mddev_to_conf(mddev);
+	const unsigned int raid_disks = conf->raid_disks;
+	const unsigned int data_disks = raid_disks - 1;
+	unsigned int dd_idx, pd_idx;
+	struct bio* align_bi;
+	mdk_rdev_t *rdev;
+
+	if (!in_chunk_boundary(mddev, raid_bio)) {
+		printk("chunk_aligned_read : non aligned\n");
+		return 0;
+	}
+	/*
+ 	 * use bio_clone to make a copy of the bio
+	 */
+	align_bi = bio_clone(raid_bio, GFP_NOIO);
+	if (!align_bi)
+		return 0;
+	/*
+	 *   set bi_end_io to a new function, and set bi_private to the
+	 *     original bio.
+	 */
+	align_bi->bi_end_io  = raid5_align_endio;
+	align_bi->bi_private = raid_bio;
+	/*
+	 *	compute position
+	 */
+	align_bi->bi_sector =  raid5_compute_sector(raid_bio->bi_sector,
+					raid_disks,
+					data_disks,
+					&dd_idx,
+					&pd_idx,
+					conf);
+
+	rcu_read_lock();
+	rdev = rcu_dereference(conf->disks[dd_idx].rdev);
+	if (rdev && test_bit(In_sync, &rdev->flags)) {
+		align_bi->bi_bdev =  rdev->bdev;
+		atomic_inc(&rdev->nr_pending);
+		rcu_read_unlock();
+		generic_make_request(align_bi);
+		return 1;
+	} else {
+		rcu_read_unlock();
+		return 0;
+	}
+}
+
+
 static int make_request(request_queue_t *q, struct bio * bi)
 {
 	mddev_t *mddev = q->queuedata;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 005 of 6] md: Allow reads that have bypassed the cache to be retried on failure.
  2006-10-31  6:00 [PATCH 000 of 6] md: udev events and cache bypass for reads NeilBrown
                   ` (3 preceding siblings ...)
  2006-10-31  6:01 ` [PATCH 004 of 6] md: Handle bypassing the read cache (assuming nothing fails) NeilBrown
@ 2006-10-31  6:01 ` NeilBrown
  2006-10-31  6:01 ` [PATCH 006 of 6] md: Enable bypassing cache for reads NeilBrown
  2006-10-31 21:15 ` [PATCH 000 of 6] md: udev events and cache bypass " Greg KH
  6 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2006-10-31  6:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Raz Ben-Jehuda(caro)


If a bypass-the-cache read fails, we simply try again through
the cache.  If it fails again it will trigger normal recovery
precedures.

cc:  "Raz Ben-Jehuda(caro)" <raziebe@gmail.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c         |  150 ++++++++++++++++++++++++++++++++++++++++++-
 ./include/linux/raid/raid5.h |    3 
 2 files changed, 150 insertions(+), 3 deletions(-)

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c	2006-10-31 16:41:51.000000000 +1100
+++ ./drivers/md/raid5.c	2006-10-31 16:42:30.000000000 +1100
@@ -134,6 +134,8 @@ static void __release_stripe(raid5_conf_
 			if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
 				list_add_tail(&sh->lru, &conf->inactive_list);
 				wake_up(&conf->wait_for_stripe);
+				if (conf->retry_read_aligned)
+					md_wakeup_thread(conf->mddev->thread);
 			}
 		}
 	}
@@ -2645,18 +2647,74 @@ static int in_chunk_boundary(mddev_t *md
 }
 
 /*
+ *  add bio to the retry LIFO  ( in O(1) ... we are in interrupt )
+ *  later sampled by raid5d.
+ */
+static void add_bio_to_retry(struct bio *bi,raid5_conf_t *conf)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&conf->device_lock, flags);
+
+	bi->bi_next = conf->retry_read_aligned;
+	conf->retry_read_aligned = bi;
+
+	spin_unlock_irqrestore(&conf->device_lock, flags);
+	md_wakeup_thread(conf->mddev->thread);
+}
+
+
+static struct bio *remove_bio_from_retry(raid5_conf_t *conf)
+{
+	struct bio *bi;
+
+	bi = conf->retry_read_aligned;
+	if (bi) {
+		conf->retry_read_aligned = NULL;
+		return bi;
+	}
+	bi = conf->retry_read_aligned_list;
+	if(bi) {
+		conf->retry_read_aligned = bi->bi_next;
+		bi->bi_next = NULL;
+		bi->bi_phys_segments = 1; /* biased count of active stripes */
+		bi->bi_hw_segments = 0; /* count of processed stripes */
+	}
+
+	return bi;
+}
+
+
+/*
  *  The "raid5_align_endio" should check if the read succeeded and if it
  *  did, call bio_endio on the original bio (having bio_put the new bio
  *  first).
  *  If the read failed..
  */
-int raid5_align_endio(struct bio *bi, unsigned int bytes , int error)
+int raid5_align_endio(struct bio *bi, unsigned int bytes, int error)
 {
 	struct bio* raid_bi  = bi->bi_private;
+	mddev_t *mddev;
+	raid5_conf_t *conf;
+
 	if (bi->bi_size)
 		return 1;
 	bio_put(bi);
-	bio_endio(raid_bi, bytes, error);
+
+	mddev = raid_bi->bi_bdev->bd_disk->queue->queuedata;
+	conf = mddev_to_conf(mddev);
+
+	if (!error && test_bit(BIO_UPTODATE, &bi->bi_flags)) {
+		bio_endio(raid_bi, bytes, 0);
+		if (atomic_dec_and_test(&conf->active_aligned_reads))
+			wake_up(&conf->wait_for_stripe);
+		return 0;
+	}
+
+
+	PRINTK("raid5_align_endio : io error...handing IO for a retry\n");
+
+	add_bio_to_retry(raid_bi, conf);
 	return 0;
 }
 
@@ -2702,6 +2760,14 @@ static int chunk_aligned_read(request_qu
 		align_bi->bi_bdev =  rdev->bdev;
 		atomic_inc(&rdev->nr_pending);
 		rcu_read_unlock();
+
+		spin_lock_irq(&conf->device_lock);
+		wait_event_lock_irq(conf->wait_for_stripe,
+				    conf->quiesce == 0,
+				    conf->device_lock, /* nothing */);
+		atomic_inc(&conf->active_aligned_reads);
+		spin_unlock_irq(&conf->device_lock);
+
 		generic_make_request(align_bi);
 		return 1;
 	} else {
@@ -3050,6 +3116,71 @@ static inline sector_t sync_request(mdde
 	return STRIPE_SECTORS;
 }
 
+static int  retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio)
+{
+	/* We may not be able to submit a whole bio at once as there
+	 * may not be enough stripe_heads available.
+	 * We cannot pre-allocate enough stripe_heads as we may need
+	 * more than exist in the cache (if we allow ever large chunks).
+	 * So we do one stripe head at a time and record in
+	 * ->bi_hw_segments how many have been done.
+	 *
+	 * We *know* that this entire raid_bio is in one chunk, so
+	 * it will be only one 'dd_idx' and only need one call to raid5_compute_sector.
+	 */
+	struct stripe_head *sh;
+	int dd_idx, pd_idx;
+	sector_t sector, logical_sector, last_sector;
+	int scnt = 0;
+	int remaining;
+	int handled = 0;
+
+	logical_sector = raid_bio->bi_sector & ~((sector_t)STRIPE_SECTORS-1);
+	sector = raid5_compute_sector(	logical_sector,
+					conf->raid_disks,
+					conf->raid_disks-1,
+					&dd_idx,
+					&pd_idx,
+					conf);
+	last_sector = raid_bio->bi_sector + (raid_bio->bi_size>>9);
+
+	for (; logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
+
+		if (scnt < raid_bio->bi_hw_segments)
+			/* already done this stripe */
+			continue;
+
+		sh = get_active_stripe(conf, sector, conf->raid_disks, pd_idx, 1);
+
+		if (!sh) {
+			/* failed to get a stripe - must wait */
+			raid_bio->bi_hw_segments = scnt;
+			conf->retry_read_aligned = raid_bio;
+			return handled;
+		}
+
+		set_bit(R5_ReadError, &sh->dev[dd_idx].flags);
+		add_stripe_bio(sh, raid_bio, dd_idx, 0);
+		handle_stripe(sh, NULL);
+		release_stripe(sh);
+		handled++;
+	}
+	spin_lock_irq(&conf->device_lock);
+	remaining = --raid_bio->bi_phys_segments;
+	spin_unlock_irq(&conf->device_lock);
+	if (remaining == 0) {
+		int bytes = raid_bio->bi_size;
+
+		raid_bio->bi_size = 0;
+		raid_bio->bi_end_io(raid_bio, bytes, 0);
+	}
+	if (atomic_dec_and_test(&conf->active_aligned_reads))
+		wake_up(&conf->wait_for_stripe);
+	return handled;
+}
+
+
+
 /*
  * This is our raid5 kernel thread.
  *
@@ -3071,6 +3202,7 @@ static void raid5d (mddev_t *mddev)
 	spin_lock_irq(&conf->device_lock);
 	while (1) {
 		struct list_head *first;
+		struct bio *bio;
 
 		if (conf->seq_flush != conf->seq_write) {
 			int seq = conf->seq_flush;
@@ -3087,6 +3219,16 @@ static void raid5d (mddev_t *mddev)
 		    !list_empty(&conf->delayed_list))
 			raid5_activate_delayed(conf);
 
+		while ((bio = remove_bio_from_retry(conf))) {
+			int ok;
+			spin_unlock_irq(&conf->device_lock);
+			ok = retry_aligned_read(conf, bio);
+			spin_lock_irq(&conf->device_lock);
+			if (!ok)
+				break;
+			handled++;
+		}
+
 		if (list_empty(&conf->handle_list))
 			break;
 
@@ -3274,6 +3416,7 @@ static int run(mddev_t *mddev)
 	INIT_LIST_HEAD(&conf->inactive_list);
 	atomic_set(&conf->active_stripes, 0);
 	atomic_set(&conf->preread_active_stripes, 0);
+	atomic_set(&conf->active_aligned_reads, 0);
 
 	PRINTK("raid5: run(%s) called.\n", mdname(mddev));
 
@@ -3796,7 +3939,8 @@ static void raid5_quiesce(mddev_t *mddev
 		spin_lock_irq(&conf->device_lock);
 		conf->quiesce = 1;
 		wait_event_lock_irq(conf->wait_for_stripe,
-				    atomic_read(&conf->active_stripes) == 0,
+				    atomic_read(&conf->active_stripes) == 0 &&
+				    atomic_read(&conf->active_aligned_reads) == 0,
 				    conf->device_lock, /* nothing */);
 		spin_unlock_irq(&conf->device_lock);
 		break;

diff .prev/include/linux/raid/raid5.h ./include/linux/raid/raid5.h
--- .prev/include/linux/raid/raid5.h	2006-10-31 16:40:56.000000000 +1100
+++ ./include/linux/raid/raid5.h	2006-10-31 16:42:30.000000000 +1100
@@ -227,7 +227,10 @@ struct raid5_private_data {
 	struct list_head	handle_list; /* stripes needing handling */
 	struct list_head	delayed_list; /* stripes that have plugged requests */
 	struct list_head	bitmap_list; /* stripes delaying awaiting bitmap update */
+	struct bio		*retry_read_aligned; /* currently retrying aligned bios   */
+	struct bio		*retry_read_aligned_list; /* aligned bios retry list  */
 	atomic_t		preread_active_stripes; /* stripes with scheduled io */
+	atomic_t		active_aligned_reads;
 
 	atomic_t		reshape_stripes; /* stripes with pending writes for reshape */
 	/* unfortunately we need two cache names as we temporarily have

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 006 of 6] md: Enable bypassing cache for reads.
  2006-10-31  6:00 [PATCH 000 of 6] md: udev events and cache bypass for reads NeilBrown
                   ` (4 preceding siblings ...)
  2006-10-31  6:01 ` [PATCH 005 of 6] md: Allow reads that have bypassed the cache to be retried on failure NeilBrown
@ 2006-10-31  6:01 ` NeilBrown
  2006-10-31 21:15 ` [PATCH 000 of 6] md: udev events and cache bypass " Greg KH
  6 siblings, 0 replies; 26+ messages in thread
From: NeilBrown @ 2006-10-31  6:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Raz Ben-Jehuda(caro)


Call the chunk_aligned_read where appropriate.

cc:  "Raz Ben-Jehuda(caro)" <raziebe@gmail.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c |    5 +++++
 1 file changed, 5 insertions(+)

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c	2006-10-31 16:42:30.000000000 +1100
+++ ./drivers/md/raid5.c	2006-10-31 16:47:53.000000000 +1100
@@ -2798,6 +2798,11 @@ static int make_request(request_queue_t 
 	disk_stat_inc(mddev->gendisk, ios[rw]);
 	disk_stat_add(mddev->gendisk, sectors[rw], bio_sectors(bi));
 
+	if ( bio_data_dir(bi) == READ &&
+	     mddev->reshape_position == MaxSector &&
+	     chunk_aligned_read(q,bi))
+            		return 0;
+
 	logical_sector = bi->bi_sector & ~((sector_t)STRIPE_SECTORS-1);
 	last_sector = bi->bi_sector + (bi->bi_size>>9);
 	bi->bi_next = NULL;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 000 of 6] md: udev events and cache bypass for reads
  2006-10-31  6:00 [PATCH 000 of 6] md: udev events and cache bypass for reads NeilBrown
                   ` (5 preceding siblings ...)
  2006-10-31  6:01 ` [PATCH 006 of 6] md: Enable bypassing cache for reads NeilBrown
@ 2006-10-31 21:15 ` Greg KH
  6 siblings, 0 replies; 26+ messages in thread
From: Greg KH @ 2006-10-31 21:15 UTC (permalink / raw)
  To: NeilBrown; +Cc: Andrew Morton, linux-raid, linux-kernel, Raz Ben-Jehuda(caro)

On Tue, Oct 31, 2006 at 05:00:40PM +1100, NeilBrown wrote:
> Following are 6 patches for md in -lastest which I have been sitting
> on for a while because I hadn't had a chance to test them properly.
> I now have so there shouldn't be too many bugs left :-)
> 
> First is suitable for 2.6.19 (if it isn't too late and gregkh thinks it
> is good).  Rest are for 2.6.20.

No objections from me.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2006-11-09 10:17 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-10-31  6:00 [PATCH 000 of 6] md: udev events and cache bypass for reads NeilBrown
2006-10-31  6:00 ` [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops NeilBrown
2006-10-31 21:16   ` Greg KH
2006-11-02 12:13     ` Kay Sievers
2006-11-02 12:32       ` Neil Brown
2006-11-02 13:51         ` Kay Sievers
2006-11-03  6:57           ` Neil Brown
2006-11-03  8:22             ` Kay Sievers
2006-11-06  0:18               ` Neil Brown
2006-11-06  8:38                 ` dean gaudet
2006-11-07  5:05                   ` Neil Brown
2006-11-09 10:10                     ` Michael Tokarev
2006-11-09 10:17                       ` Michael Tokarev
2006-11-08 11:14                 ` Kay Sievers
2006-11-09  0:17                   ` Neil Brown
2006-10-31  6:00 ` [PATCH 002 of 6] md: Change lifetime rules for 'md' devices NeilBrown
2006-10-31  8:22   ` Andrew Morton
2006-10-31  9:09     ` Neil Brown
2006-10-31  9:15       ` Jens Axboe
2006-10-31  9:26         ` Neil Brown
2006-10-31  9:30           ` Jens Axboe
2006-10-31  6:00 ` [PATCH 003 of 6] md: Define raid5_mergeable_bvec NeilBrown
2006-10-31  6:01 ` [PATCH 004 of 6] md: Handle bypassing the read cache (assuming nothing fails) NeilBrown
2006-10-31  6:01 ` [PATCH 005 of 6] md: Allow reads that have bypassed the cache to be retried on failure NeilBrown
2006-10-31  6:01 ` [PATCH 006 of 6] md: Enable bypassing cache for reads NeilBrown
2006-10-31 21:15 ` [PATCH 000 of 6] md: udev events and cache bypass " Greg KH

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).