linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: PATCH - change to blkdev->queue calling triggers BUG in md.c
@ 2002-09-03  0:53 Andries.Brouwer
  2002-09-03  3:55 ` Linus Torvalds
  2002-09-03 15:27 ` Roe Peterson
  0 siblings, 2 replies; 22+ messages in thread
From: Andries.Brouwer @ 2002-09-03  0:53 UTC (permalink / raw)
  To: aebr, torvalds; +Cc: Andries.Brouwer, linux-kernel, linux-raid, neilb

    > I think it important to get rid of partition table reading in the kernel.

    Why?

Let me be more precise.
I think it important to get rid of automatic partition table reading
in the kernel.

Why?
Because in some cases it is undesirable.
Because in some cases it crashes the kernel.
Because it involves guessing and heuristics.
Because policy belongs in user space.


    >  One argument is that our traditional DOS-type partition table will
    >  soon be at the end of its useful life. Yes, maybe it survives
    >  a few more years but our own stability requires slow changes,
    >  so we must start thinking a long time in advance.

    That's a bad argument. It's not as if we want to have random formats for 
    this thing. Partitioning is damn important, and it has to be portable 
    across different machines and different operating systems. That all means 
    that there is absolutely _zero_ incentive to make up a partition format of 
    our own, since there are perfectly fine and existing formats.

That is a separate discussion best left for some other time.
[But every OS has its own partition table type, and the types
are not compatible. We started using the DOS-type partition table.
But it is dying. Windows replaces it with their dynamic disks.
What do we do? Follow Microsoft? Pick the Plan9 format?]

    >  Another argument is that it sometimes takes a *long* time, like several
    >  minutes, especially when this reading triggers hardware bugs.

    This is only an argument for doing it on demand, not for dropping it.

Yes - that is my main point: doing it on demand. On demand only.

    >  Another argument is that nobody knows whether there is
    >  a partition table. (ZIP: "large floppy" vs "removable disk")
    >  Another argument is that tricky things happen with disk managers.

    And none of these work any better in user space.

Well, in fact they do.

The user knows whether she treats her ZIP like a removable disk
or like a big floppy, that is, whether she should ask or refrain
from asking to read the pt.

And yes, if the partitions on the disk are to be shifted by 63 sectors
then partx can notice that and tell the kernel. But if the kernel does
these things automatically it can be difficult to remove Disk Manager.


    You seem to think that kernel space somehow cannot do something that
    user space can. I just don't see the overriding problems you claim.

It is the user who knows and wants to decide.

If my disk has media errors and I want to rescue what still can be read,
then I am very unhappy that the kernel starts reading the first sector
and the last sector and various sectors in the middle.
I want to have very precise control over what I/O happens.

If I insert a SmartMedia card then I know very precisely that it has
a FAT filesystem, a special one. Some cameras will refuse to read
such cards formatted by DOS. If the kernel starts probing, as it does
today, then it will read the first sector and the last sector, etc.
But my reader has a firmware bug, an off-by-one mistake in the reported
capacity, and the kernel tries to read a sector past the end of the card,
gets an error and the SCSI code starts retrying, resetting the device,
the host, the bus, finally takes the device offline. In the meantime
the USB code is entirely confused by aborts and crashes the kernel.
Of course both SCSI and USB code have to be improved, but it would
certainly be nice if I could tell from userspace: probe only for FAT.
No need at all to read this last sector.

I have seen partition tables with a loop. They would poison Linux
so that it was impossible to boot Linux on a system with such a disk.

I have seen disks with random test data causing Linux to go out and
read nonexistent sectors. There is the real possibility that no
partition table is present, and trying to find one may be a bad idea.

I have seen disks that form part of a multi-disk array.
Often the partition tables are meaningless.


Not doing things automatically gives power to the user.
In some situations this power is needed.


And once this partition reading is done on demand only, it does not
matter very much who does the reading. It may be the kernel.
It may be a user space program.

Andries


^ permalink raw reply	[flat|nested] 22+ messages in thread
* Re: PATCH - change to blkdev->queue calling triggers BUG in md.c
@ 2002-09-02 21:41 Andries.Brouwer
  2002-09-02 22:00 ` Linus Torvalds
  0 siblings, 1 reply; 22+ messages in thread
From: Andries.Brouwer @ 2002-09-02 21:41 UTC (permalink / raw)
  To: Andries.Brouwer, torvalds; +Cc: aebr, linux-kernel, linux-raid, neilb

> The point about backwards compatibility is that things WORK.

Must I conclude that you did not read my entire letter?

Since we started this small detour talking about media change,
let me quote that fragment once more.

"[Don't think that I actually propose doing this today as the default,
but it would be a very small patch to add this as an optional
behaviour. But there is today, and there is the faraway goal.
The faraway goal is: no partition table reading in the kernel.
And that influences designing today what to do on media change.
Already today I would consider it entirely reasonable if there
was no automatic partition table reading after a media change.]"

No, my suggested changes would not break a single Linux installation
in the world.

Andries

^ permalink raw reply	[flat|nested] 22+ messages in thread
* Re: PATCH - change to blkdev->queue calling triggers BUG in md.c
@ 2002-09-02 21:27 Andries.Brouwer
  2002-09-02 21:39 ` Linus Torvalds
  2002-09-02 21:43 ` Thunder from the hill
  0 siblings, 2 replies; 22+ messages in thread
From: Andries.Brouwer @ 2002-09-02 21:27 UTC (permalink / raw)
  To: aebr, torvalds; +Cc: Andries.Brouwer, linux-kernel, linux-raid, neilb

    > But I hope he makes precisely this: a kernel that does not do any
    > partition reading of its own.

    I disagree, if only because of backwards compatibility issues.

    On a conceptual level I think you're right. However, it will break too 
    many standard installations as is.

    If/when we have a reasonable initrd setup that is usable, we could do some 
    automatic partitioning of devices that are available at bootup to minimize 
    the impact, but I don't think it is realistic otherwise.

Compare it with mounting.

It would be very bad if the kernel automatically mounted all
filesystems in sight. So, user space tells what to mount.
But at boot time there is a special situation.
In the end we want to have an initrd that mounts the rootfs,
but today we give kernel command line parameters with
rootfstype= and root=.

In a similar way it is bad that the kernel automatically tries
to interpret some data on a block device as a partition table.
The user can tell the kernel. (Yes, today.)
But at boot time there is a special situation.
In the end we want to have an initrd that does the partition reading,
but now we could give a kernel command line parameter with
rootpttype= and have the kernel only parse the partition table
of the root device.

Andries


[Yes, a shock, but very easy for people to add
      blockdev --rereadpt /dev/foo
(or a partx call) in some bootscripts.]

[Don't think that I actually propose doing this today as the default,
but it would be a very small patch to add this as an optional
behaviour. But there is today, and there is the faraway goal.
The faraway goal is: no partition table reading in the kernel.
And that influences designing today what to do on media change.
Already today I would consider it entirely reasonable if there
was no automatic partition table reading after a media change.]

^ permalink raw reply	[flat|nested] 22+ messages in thread
* Re: PATCH - change to blkdev->queue calling triggers BUG in md.c
@ 2002-09-02  8:53 Andries.Brouwer
  2002-09-02 17:01 ` Linus Torvalds
  0 siblings, 1 reply; 22+ messages in thread
From: Andries.Brouwer @ 2002-09-02  8:53 UTC (permalink / raw)
  To: neilb, torvalds; +Cc: linux-kernel, linux-raid

> HOWEVER, that disk change checking really should be done by
> the generic layers, and it should be done after the open() anyway
> (and not by the open)

Are you sure?
I am inclined to think that this would be an undesirable change of
open() semantics. Traditionally, and according to all standards,
open() will return ENXIO when the device does not exist.

Andries

^ permalink raw reply	[flat|nested] 22+ messages in thread
* PATCH - change to blkdev->queue calling triggers BUG in md.c
@ 2002-09-01 23:43 Neil Brown
  2002-09-02  4:13 ` Linus Torvalds
  0 siblings, 1 reply; 22+ messages in thread
From: Neil Brown @ 2002-09-01 23:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-raid


Changeset 1.573 (just prior to 2.5.33 release) changed the calling
sequence for blk_dev[major].queue so that it is now called before the 
bd_op->open function is called.
This triggers a BUG in md.c which checked that the device was open
whenever ->queue was called.  Patch below removes the BUG.

I'm actually a little disappointed by this change.  I was hoping that
the ->queue might get changed to be passed a 'struct block_device *'
instead of a 'kdev_t' so that the device driver would only have to
interpret the device number in one place: the open.  But now that
->queue is called before ->open, that wouldn't help.

I don't suppose it would make sense to do the default:
	if (!bdev->bd_queue) {
		struct blk_dev_struct *p = blk_dev + major(dev);
		bdev->bd_queue = &p->request_queue;
	}
bit where it is now, and leave the:
		if (p->queue)
			bdev->bd_queue =  p->queue(dev);

bit until after the open?  It would keep floppy happy, and make me
happy too, but I'm not sure that it is actually 'right'...

Anyway, here is the patch that stops md from BUGging out.

NeilBrown

### Comments for ChangeSet
Remove BUG in md.c that change in 2.5.33 triggers.

Since 2.5.33, the blk_dev[].queue is called without
the device open, so md_queue_proc can no-longer assume
that the device is open.


 ----------- Diffstat output ------------
 ./drivers/md/md.c |   10 +++++-----
 1 files changed, 5 insertions(+), 5 deletions(-)

--- ./drivers/md/md.c	2002/09/01 23:27:10	1.1
+++ ./drivers/md/md.c	2002/09/01 23:28:27	1.2
@@ -3157,11 +3157,11 @@ request_queue_t * md_queue_proc(kdev_t d
 {
 	mddev_t *mddev = mddev_find(minor(dev));
 	request_queue_t *q = BLK_DEFAULT_QUEUE(MAJOR_NR);
-	if (!mddev || atomic_read(&mddev->active)<2)
-		BUG();
-	if (mddev->pers)
-		q = &mddev->queue;
-	mddev_put(mddev); /* the caller must hold a reference... */
+	if (mddev) {
+		if (mddev->pers)
+			q = &mddev->queue;
+		mddev_put(mddev);
+	}
 	return q;
 }
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2002-09-03 15:27 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-09-03  0:53 PATCH - change to blkdev->queue calling triggers BUG in md.c Andries.Brouwer
2002-09-03  3:55 ` Linus Torvalds
2002-09-03  4:06   ` Linus Torvalds
2002-09-03 15:22   ` Andries Brouwer
2002-09-03 15:27 ` Roe Peterson
  -- strict thread matches above, loose matches on Subject: below --
2002-09-02 21:41 Andries.Brouwer
2002-09-02 22:00 ` Linus Torvalds
2002-09-02 23:08   ` Andries Brouwer
2002-09-02 23:27     ` Linus Torvalds
2002-09-02 21:27 Andries.Brouwer
2002-09-02 21:39 ` Linus Torvalds
2002-09-02 21:48   ` Thunder from the hill
2002-09-02 21:43 ` Thunder from the hill
2002-09-02 21:58   ` Andries Brouwer
2002-09-02 22:06   ` Linus Torvalds
2002-09-02 22:39     ` Thunder from the hill
2002-09-02  8:53 Andries.Brouwer
2002-09-02 17:01 ` Linus Torvalds
2002-09-02 20:35   ` Andries Brouwer
2002-09-02 20:50     ` Linus Torvalds
2002-09-01 23:43 Neil Brown
2002-09-02  4:13 ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).