* PATCH - change to blkdev->queue calling triggers BUG in md.c
@ 2002-09-01 23:43 Neil Brown
2002-09-02 4:13 ` Linus Torvalds
0 siblings, 1 reply; 22+ messages in thread
From: Neil Brown @ 2002-09-01 23:43 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel, linux-raid
Changeset 1.573 (just prior to 2.5.33 release) changed the calling
sequence for blk_dev[major].queue so that it is now called before the
bd_op->open function is called.
This triggers a BUG in md.c which checked that the device was open
whenever ->queue was called. Patch below removes the BUG.
I'm actually a little disappointed by this change. I was hoping that
the ->queue might get changed to be passed a 'struct block_device *'
instead of a 'kdev_t' so that the device driver would only have to
interpret the device number in one place: the open. But now that
->queue is called before ->open, that wouldn't help.
I don't suppose it would make sense to do the default:
if (!bdev->bd_queue) {
struct blk_dev_struct *p = blk_dev + major(dev);
bdev->bd_queue = &p->request_queue;
}
bit where it is now, and leave the:
if (p->queue)
bdev->bd_queue = p->queue(dev);
bit until after the open? It would keep floppy happy, and make me
happy too, but I'm not sure that it is actually 'right'...
Anyway, here is the patch that stops md from BUGging out.
NeilBrown
### Comments for ChangeSet
Remove BUG in md.c that change in 2.5.33 triggers.
Since 2.5.33, the blk_dev[].queue is called without
the device open, so md_queue_proc can no-longer assume
that the device is open.
----------- Diffstat output ------------
./drivers/md/md.c | 10 +++++-----
1 files changed, 5 insertions(+), 5 deletions(-)
--- ./drivers/md/md.c 2002/09/01 23:27:10 1.1
+++ ./drivers/md/md.c 2002/09/01 23:28:27 1.2
@@ -3157,11 +3157,11 @@ request_queue_t * md_queue_proc(kdev_t d
{
mddev_t *mddev = mddev_find(minor(dev));
request_queue_t *q = BLK_DEFAULT_QUEUE(MAJOR_NR);
- if (!mddev || atomic_read(&mddev->active)<2)
- BUG();
- if (mddev->pers)
- q = &mddev->queue;
- mddev_put(mddev); /* the caller must hold a reference... */
+ if (mddev) {
+ if (mddev->pers)
+ q = &mddev->queue;
+ mddev_put(mddev);
+ }
return q;
}
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATCH - change to blkdev->queue calling triggers BUG in md.c
2002-09-01 23:43 Neil Brown
@ 2002-09-02 4:13 ` Linus Torvalds
0 siblings, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2002-09-02 4:13 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-kernel, linux-raid
On Mon, 2 Sep 2002, Neil Brown wrote:
>
> I'm actually a little disappointed by this change. I was hoping that
> the ->queue might get changed to be passed a 'struct block_device *'
> instead of a 'kdev_t' so that the device driver would only have to
> interpret the device number in one place: the open. But now that
> ->queue is called before ->open, that wouldn't help.
We may still do this.
Right now the _only_ reason to call ->queue before open() is that open()
is also doing things like disk change checking, which reasonably needs the
queue because it can need to do IO in order to check the disk change
status. The floppy in fact did exactly this.
HOWEVER, that disk change checking really should be done by the generic
layers, and it should be done after the open() anyway (and not by the
open), and I think Al is actually working on this. That will allow us to
be a bit more flexible about the ordering.
Linus
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATCH - change to blkdev->queue calling triggers BUG in md.c
@ 2002-09-02 8:53 Andries.Brouwer
2002-09-02 17:01 ` Linus Torvalds
0 siblings, 1 reply; 22+ messages in thread
From: Andries.Brouwer @ 2002-09-02 8:53 UTC (permalink / raw)
To: neilb, torvalds; +Cc: linux-kernel, linux-raid
> HOWEVER, that disk change checking really should be done by
> the generic layers, and it should be done after the open() anyway
> (and not by the open)
Are you sure?
I am inclined to think that this would be an undesirable change of
open() semantics. Traditionally, and according to all standards,
open() will return ENXIO when the device does not exist.
Andries
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATCH - change to blkdev->queue calling triggers BUG in md.c
2002-09-02 8:53 Andries.Brouwer
@ 2002-09-02 17:01 ` Linus Torvalds
2002-09-02 20:35 ` Andries Brouwer
0 siblings, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2002-09-02 17:01 UTC (permalink / raw)
To: Andries.Brouwer; +Cc: neilb, linux-kernel, linux-raid
On Mon, 2 Sep 2002 Andries.Brouwer@cwi.nl wrote:
> > HOWEVER, that disk change checking really should be done by
> > the generic layers, and it should be done after the open() anyway
> > (and not by the open)
>
> Are you sure?
> I am inclined to think that this would be an undesirable change of
> open() semantics. Traditionally, and according to all standards,
> open() will return ENXIO when the device does not exist.
Well, one reason I don't want the low-level drivers doing the media change
checking is that there's more to media change than just checking the
media.
For example, the higher levels want to do a partition table re-read if the
media really has changed. We do have this strange "bd_invalidated" thing
for passing that information back, and maybe that is acceptable. It's a
bit subtle, though.
Another reason why it would be good to factor out media change from open()
is that I can well imagine that somebody would want to do a "door open"
ioctl on a device without a media, and we actually do kind of have that
interface: opening with O_NDELAY historically means to not do the media
change checks.
And guess what? Because that test is done inside the low-level driver
right now, it means that these O_NDELAY semantics aren't actually known or
followed by most drivers, _and_ it means that the higher levels don't even
realize that sometimes the media check hasn't gotten done at all (ie
because the low-level "open()" is called only for the _first_ open, the
higher levels right now won't even call "open()" at _all_ later on and so
the media checks aren't done later when they should be).
However, your ENXIO point is a good one, and implies that we really should
have a more expressive "media_change()" function, so that if we'd factor
out open()/media_check(), then we'd still get the right ENXIO thing.
Linus
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATCH - change to blkdev->queue calling triggers BUG in md.c
2002-09-02 17:01 ` Linus Torvalds
@ 2002-09-02 20:35 ` Andries Brouwer
2002-09-02 20:50 ` Linus Torvalds
0 siblings, 1 reply; 22+ messages in thread
From: Andries Brouwer @ 2002-09-02 20:35 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Andries.Brouwer, neilb, linux-kernel, linux-raid
On Mon, Sep 02, 2002 at 10:01:46AM -0700, Linus Torvalds wrote:
> For example, the higher levels want to do a partition table re-read
> if the media really has changed.
My original setup made a kernel that does not know anything about
partition tables. User space would tell the kernel about partitions
on some block device.
Roughly speaking the impact is that there is a partx invocation
before a mount.
Now it seems Al is doing all the work, so I can just sit back and watch.
But I hope he makes precisely this: a kernel that does not do any
partition reading of its own.
Andries
[Yes, it is fundamentally wrong when the kernel starts guessing.
Guessing filesystem type is bad. Also guessing partition table type
is bad. Moreover, the kernel probing may lead to device problems
and even to kernel crashes, as I last observed two days ago.
Only the user knows what she wants to do with this disk. Format?
Remove OnTrack Disk Manager? There are all kinds of situations
where partition table re-read is directly harmful.]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATCH - change to blkdev->queue calling triggers BUG in md.c
2002-09-02 20:35 ` Andries Brouwer
@ 2002-09-02 20:50 ` Linus Torvalds
0 siblings, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2002-09-02 20:50 UTC (permalink / raw)
To: Andries Brouwer; +Cc: Andries.Brouwer, neilb, linux-kernel, linux-raid
On Mon, 2 Sep 2002, Andries Brouwer wrote:
>
> Now it seems Al is doing all the work, so I can just sit back and watch.
> But I hope he makes precisely this: a kernel that does not do any
> partition reading of its own.
I disagree, if only because of backwards competibility issues.
On a conceptual level I think you're right. However, it will break too
many standard installations as is.
If/when we have a reasonable initrd setup that is usable, we could do some
automatic partitioning of devices that are available at bootup to minimize
the impact, but I don't think it is realistic otherwise.
Linus
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATCH - change to blkdev->queue calling triggers BUG in md.c
@ 2002-09-02 21:27 Andries.Brouwer
2002-09-02 21:39 ` Linus Torvalds
2002-09-02 21:43 ` Thunder from the hill
0 siblings, 2 replies; 22+ messages in thread
From: Andries.Brouwer @ 2002-09-02 21:27 UTC (permalink / raw)
To: aebr, torvalds; +Cc: Andries.Brouwer, linux-kernel, linux-raid, neilb
> But I hope he makes precisely this: a kernel that does not do any
> partition reading of its own.
I disagree, if only because of backwards compatibility issues.
On a conceptual level I think you're right. However, it will break too
many standard installations as is.
If/when we have a reasonable initrd setup that is usable, we could do some
automatic partitioning of devices that are available at bootup to minimize
the impact, but I don't think it is realistic otherwise.
Compare it with mounting.
It would be very bad if the kernel automatically mounted all
filesystems in sight. So, user space tells what to mount.
But at boot time there is a special situation.
In the end we want to have an initrd that mounts the rootfs,
but today we give kernel command line parameters with
rootfstype= and root=.
In a similar way it is bad that the kernel automatically tries
to interpret some data on a block device as a partition table.
The user can tell the kernel. (Yes, today.)
But at boot time there is a special situation.
In the end we want to have an initrd that does the partition reading,
but now we could give a kernel command line parameter with
rootpttype= and have the kernel only parse the partition table
of the root device.
Andries
[Yes, a shock, but very easy for people to add
blockdev --rereadpt /dev/foo
(or a partx call) in some bootscripts.]
[Don't think that I actually propose doing this today as the default,
but it would be a very small patch to add this as an optional
behaviour. But there is today, and there is the faraway goal.
The faraway goal is: no partition table reading in the kernel.
And that influences designing today what to do on media change.
Already today I would consider it entirely reasonable if there
was no automatic partition table reading after a media change.]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATCH - change to blkdev->queue calling triggers BUG in md.c
2002-09-02 21:27 Andries.Brouwer
@ 2002-09-02 21:39 ` Linus Torvalds
2002-09-02 21:48 ` Thunder from the hill
2002-09-02 21:43 ` Thunder from the hill
1 sibling, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2002-09-02 21:39 UTC (permalink / raw)
To: Andries.Brouwer; +Cc: aebr, linux-kernel, linux-raid, neilb
On Mon, 2 Sep 2002 Andries.Brouwer@cwi.nl wrote:
>
> Compare it with mounting.
NO.
The point about backwards compatibility is that things WORK.
There's no point in comparing things to how you _want_ them to work. The
only thing that matters for bckwards compatibility is how they work
_today_.
And your suggestion would break every single installation out there. Not
"maybe a few". Every single one.
(yeah, you could find some NFS-only setup that doesn't break. Big deal).
And backwards compatibility is extremely important.
Linus
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATCH - change to blkdev->queue calling triggers BUG in md.c
@ 2002-09-02 21:41 Andries.Brouwer
2002-09-02 22:00 ` Linus Torvalds
0 siblings, 1 reply; 22+ messages in thread
From: Andries.Brouwer @ 2002-09-02 21:41 UTC (permalink / raw)
To: Andries.Brouwer, torvalds; +Cc: aebr, linux-kernel, linux-raid, neilb
> The point about backwards compatibility is that things WORK.
Must I conclude that you did not read my entire letter?
Since we started this small detour talking about media change,
let me quote that fragment once more.
"[Don't think that I actually propose doing this today as the default,
but it would be a very small patch to add this as an optional
behaviour. But there is today, and there is the faraway goal.
The faraway goal is: no partition table reading in the kernel.
And that influences designing today what to do on media change.
Already today I would consider it entirely reasonable if there
was no automatic partition table reading after a media change.]"
No, my suggested changes would not break a single Linux installation
in the world.
Andries
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATCH - change to blkdev->queue calling triggers BUG in md.c
2002-09-02 21:27 Andries.Brouwer
2002-09-02 21:39 ` Linus Torvalds
@ 2002-09-02 21:43 ` Thunder from the hill
2002-09-02 21:58 ` Andries Brouwer
2002-09-02 22:06 ` Linus Torvalds
1 sibling, 2 replies; 22+ messages in thread
From: Thunder from the hill @ 2002-09-02 21:43 UTC (permalink / raw)
To: Andries.Brouwer; +Cc: aebr, torvalds, linux-kernel, linux-raid, neilb
Hi,
On Mon, 2 Sep 2002 Andries.Brouwer@cwi.nl wrote:
> [Yes, a shock, but very easy for people to add
> blockdev --rereadpt /dev/foo
> (or a partx call) in some bootscripts.]
fdisk -r DEV -> read device's partition table
> The faraway goal is: no partition table reading in the kernel.
Why not the faraway goal: no partition tables any more? They're annoying.
Thunder
--
--./../...-/. -.--/---/..-/.-./..././.-../..-. .---/..-/.../- .-
--/../-./..-/-/./--..-- ../.----./.-../.-.. --./../...-/. -.--/---/..-
.- -/---/--/---/.-./.-./---/.--/.-.-.-
--./.-/-.../.-./.././.-../.-.-.-
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATCH - change to blkdev->queue calling triggers BUG in md.c
2002-09-02 21:39 ` Linus Torvalds
@ 2002-09-02 21:48 ` Thunder from the hill
0 siblings, 0 replies; 22+ messages in thread
From: Thunder from the hill @ 2002-09-02 21:48 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Andries.Brouwer, aebr, linux-kernel, linux-raid, neilb
Hi,
On Mon, 2 Sep 2002, Linus Torvalds wrote:
> The point about backwards compatibility is that things WORK.
>
> There's no point in comparing things to how you _want_ them to work. The
> only thing that matters for bckwards compatibility is how they work
> _today_.
>
> And your suggestion would break every single installation out there. Not
> "maybe a few". Every single one.
>
> (yeah, you could find some NFS-only setup that doesn't break. Big deal).
>
> And backwards compatibility is extremely important.
dep_bool ' New mountalike partitioning code' CONFIG_PARTMOUNTING CONFIG_EXPERIMENTAL CONFIG_WHATEVER
Or, since we're talking about the future:
<bool name="PARTMOUNTING">
<title>
New mount-alike partitioning code
</title>
<dep name="EXPERIMENTAL" sense="include" />
<dep name="WHATEVER" sense="exclude" />
</bool>
See? New Deal is for the ones that were annoyed by the old one.
Thunder
--
--./../...-/. -.--/---/..-/.-./..././.-../..-. .---/..-/.../- .-
--/../-./..-/-/./--..-- ../.----./.-../.-.. --./../...-/. -.--/---/..-
.- -/---/--/---/.-./.-./---/.--/.-.-.-
--./.-/-.../.-./.././.-../.-.-.-
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATCH - change to blkdev->queue calling triggers BUG in md.c
2002-09-02 21:43 ` Thunder from the hill
@ 2002-09-02 21:58 ` Andries Brouwer
2002-09-02 22:06 ` Linus Torvalds
1 sibling, 0 replies; 22+ messages in thread
From: Andries Brouwer @ 2002-09-02 21:58 UTC (permalink / raw)
To: Thunder from the hill
Cc: Andries.Brouwer, torvalds, linux-kernel, linux-raid, neilb
On Mon, Sep 02, 2002 at 03:43:56PM -0600, Thunder from the hill wrote:
> > The faraway goal is: no partition table reading in the kernel.
>
> Why not the faraway goal: no partition tables any more? They're annoying.
As soon as the kernel stops reading partition tables, user space
is entirely free in what it does. One of the possibilities is
then of course: no partition tables.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATCH - change to blkdev->queue calling triggers BUG in md.c
2002-09-02 21:41 Andries.Brouwer
@ 2002-09-02 22:00 ` Linus Torvalds
2002-09-02 23:08 ` Andries Brouwer
0 siblings, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2002-09-02 22:00 UTC (permalink / raw)
To: Andries.Brouwer; +Cc: aebr, linux-kernel, linux-raid, neilb
On Mon, 2 Sep 2002 Andries.Brouwer@cwi.nl wrote:
>
> No, my suggested changes would not break a single Linux installation
> in the world.
.. by making your suggested behaviour not be used. Yes.
But if that is the case, then we _still_ need to fix the media change and
partition read issue. Right? Which brings back _all_ my points for why it
should be done at open time, and by the generic routine. Agreed?
Linus
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATCH - change to blkdev->queue calling triggers BUG in md.c
2002-09-02 21:43 ` Thunder from the hill
2002-09-02 21:58 ` Andries Brouwer
@ 2002-09-02 22:06 ` Linus Torvalds
2002-09-02 22:39 ` Thunder from the hill
1 sibling, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2002-09-02 22:06 UTC (permalink / raw)
To: Thunder from the hill
Cc: Andries.Brouwer, aebr, linux-kernel, linux-raid, neilb
On Mon, 2 Sep 2002, Thunder from the hill wrote:
>
> Why not the faraway goal: no partition tables any more? They're annoying.
Yeah, users and real life is annoying.
Guys, Linux is not a research project. Never was, never will be. If you
want to have a research project that does things the way people think they
should be done (as opposed to real life and being practical), look at Hurd
and look at a lot of other projects. But don't look at Linux.
Partition tables are a fact of life. And they are a fundamental part to
being able to parse what the disk contains.
Sure, you can do it in user space too. And you can do TCP in user space.
But some things are just fairly fundamental to the working of the system.
The disk and filesystem layout is one such thing. It had better "just
work".
Linus
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATCH - change to blkdev->queue calling triggers BUG in md.c
2002-09-02 22:06 ` Linus Torvalds
@ 2002-09-02 22:39 ` Thunder from the hill
0 siblings, 0 replies; 22+ messages in thread
From: Thunder from the hill @ 2002-09-02 22:39 UTC (permalink / raw)
To: Linus Torvalds
Cc: Thunder from the hill, Andries.Brouwer, aebr, linux-kernel,
linux-raid, neilb
Hi,
On Mon, 2 Sep 2002, Linus Torvalds wrote:
> On Mon, 2 Sep 2002, Thunder from the hill wrote:
> >
> > Why not the faraway goal: no partition tables any more? They're annoying.
>
> Guys, Linux is not a research project.
>
> Partition tables are a fact of life.
Linus, can you spell "faraway"? I wasn't talking about kicking
partitioning code from Linux 2.5, I was talking about inventing a better
way in 2010.
Thunder
--
--./../...-/. -.--/---/..-/.-./..././.-../..-. .---/..-/.../- .-
--/../-./..-/-/./--..-- ../.----./.-../.-.. --./../...-/. -.--/---/..-
.- -/---/--/---/.-./.-./---/.--/.-.-.-
--./.-/-.../.-./.././.-../.-.-.-
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATCH - change to blkdev->queue calling triggers BUG in md.c
2002-09-02 22:00 ` Linus Torvalds
@ 2002-09-02 23:08 ` Andries Brouwer
2002-09-02 23:27 ` Linus Torvalds
0 siblings, 1 reply; 22+ messages in thread
From: Andries Brouwer @ 2002-09-02 23:08 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Andries.Brouwer, linux-kernel, linux-raid, neilb
On Mon, Sep 02, 2002 at 03:00:27PM -0700, Linus Torvalds wrote:
> > No, my suggested changes would not break a single Linux installation
> > in the world.
>
> .. by making your suggested behaviour not be used. Yes.
Not so pessimistic. We go by small steps.
I think it important to get rid of partition table reading in the kernel.
It (pt reading) is wrong in principle, as we agree already.
But there are also all kinds of practical reasons.
One argument is that our traditional DOS-type partition table will soon
be at the end of its useful life. Yes, maybe it survives a few more years
but our own stability requires slow changes, so we must start thinking a
long time in advance.
Another argument is that it sometimes takes a *long* time, like several
minutes, especially when this reading triggers hardware bugs.
Another argument is that nobody knows whether there is a partition table.
In the case of ZIP drives there sometimes is a jumper or special SCSI command
to switch between the "large floppy" and "removable disk" statuses, and
the kernel doesnt know.
Another argument is that tricky things happen in the presence of disk managers.
So stage one is a kernel boot parameter "nopt" or so, that stops parsing
of partition tables other than the root partition. Some people need it
because of special problems, others just want to experiment. That is good,
and we'll get some feedback on partx and family.
Stage two happens a year later, when we have a working initrd. Seen from the
outside the new (kernel + initrd) plays the role of the old kernel.
Ha. That means that we can move the pt reading to initrd, and nobody notices.
Stage three happens when initrd and kernel no longer are so tightly coupled.
Initrd is just early userspace, tools exist to populate it, distributions make
their own. Now the kernel does not need any partition reading code and
nobody ever noticed. And the setup has become much more powerful.
-----
> But if that is the case, then we _still_ need to fix the media change and
> partition read issue. Right? Which brings back _all_ my points for why it
> should be done at open time, and by the generic routine. Agreed?
The above was mainly about the partition reading at boot time.
There are two other situations: partition reading at insmod time,
and partition reading at media change time.
But these are easier situations. There is a functioning userspace already.
As I said, in view of the desired direction, I would not mind at all if
a media change did not trigger partition reading today.
(In fact, for me, under 2.5.33, it doesn't. But blockdev --rereadpt helps.)
Andries
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATCH - change to blkdev->queue calling triggers BUG in md.c
2002-09-02 23:08 ` Andries Brouwer
@ 2002-09-02 23:27 ` Linus Torvalds
0 siblings, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2002-09-02 23:27 UTC (permalink / raw)
To: Andries Brouwer; +Cc: Andries.Brouwer, linux-kernel, linux-raid, neilb
On Tue, 3 Sep 2002, Andries Brouwer wrote:
>
> I think it important to get rid of partition table reading in the kernel.
Why?
> It (pt reading) is wrong in principle, as we agree already.
No, we don't agree.
I see that some people would like to remove it from the kernel, and I'm
not violently opposed to it if it can be done without breaking existing
behaviour.
But I do _not_ see any really fundamental reason why the kernel shouldn't
parse the partition tables. I see a lot of problems if the kernel were to
stop, and I don't see a lot of advantages to not doing so.
> But there are also all kinds of practical reasons.
>
> One argument is that our traditional DOS-type partition table will soon
> be at the end of its useful life. Yes, maybe it survives a few more years
> but our own stability requires slow changes, so we must start thinking a
> long time in advance.
That's a bad argument. It's not as if we want to have random formats for
this thing. Partitioning is damn important, and it has to be portable
across different machines and different operating systems. That all means
that there is absolutely _zero_ incentive to make up a partition format of
our own, since there are perfectly fine and existing formats.
> Another argument is that it sometimes takes a *long* time, like several
> minutes, especially when this reading triggers hardware bugs.
This is only an argument for doing it on demand, not for dropping it.
> Another argument is that nobody knows whether there is a partition table.
> In the case of ZIP drives there sometimes is a jumper or special SCSI command
> to switch between the "large floppy" and "removable disk" statuses, and
> the kernel doesnt know.
> Another argument is that tricky things happen in the presence of disk managers.
And none of these work any better in user space.
> > But if that is the case, then we _still_ need to fix the media change and
> > partition read issue. Right? Which brings back _all_ my points for why it
> > should be done at open time, and by the generic routine. Agreed?
>
> The above was mainly about the partition reading at boot time.
> There are two other situations: partition reading at insmod time,
> and partition reading at media change time.
>
> But these are easier situations. There is a functioning userspace already.
You seem to think that kernel space somehow cannot do something that user
space can. I just don't see the overriding problems you claim.
Linus
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATCH - change to blkdev->queue calling triggers BUG in md.c
@ 2002-09-03 0:53 Andries.Brouwer
2002-09-03 3:55 ` Linus Torvalds
2002-09-03 15:27 ` Roe Peterson
0 siblings, 2 replies; 22+ messages in thread
From: Andries.Brouwer @ 2002-09-03 0:53 UTC (permalink / raw)
To: aebr, torvalds; +Cc: Andries.Brouwer, linux-kernel, linux-raid, neilb
> I think it important to get rid of partition table reading in the kernel.
Why?
Let me be more precise.
I think it important to get rid of automatic partition table reading
in the kernel.
Why?
Because in some cases it is undesirable.
Because in some cases it crashes the kernel.
Because it involves guessing and heuristics.
Because policy belongs in user space.
> One argument is that our traditional DOS-type partition table will
> soon be at the end of its useful life. Yes, maybe it survives
> a few more years but our own stability requires slow changes,
> so we must start thinking a long time in advance.
That's a bad argument. It's not as if we want to have random formats for
this thing. Partitioning is damn important, and it has to be portable
across different machines and different operating systems. That all means
that there is absolutely _zero_ incentive to make up a partition format of
our own, since there are perfectly fine and existing formats.
That is a separate discussion best left for some other time.
[But every OS has its own partition table type, and the types
are not compatible. We started using the DOS-type partition table.
But it is dying. Windows replaces it with their dynamic disks.
What do we do? Follow Microsoft? Pick the Plan9 format?]
> Another argument is that it sometimes takes a *long* time, like several
> minutes, especially when this reading triggers hardware bugs.
This is only an argument for doing it on demand, not for dropping it.
Yes - that is my main point: doing it on demand. On demand only.
> Another argument is that nobody knows whether there is
> a partition table. (ZIP: "large floppy" vs "removable disk")
> Another argument is that tricky things happen with disk managers.
And none of these work any better in user space.
Well, in fact they do.
The user knows whether she treats her ZIP like a removable disk
or like a big floppy, that is, whether she should ask or refrain
from asking to read the pt.
And yes, if the partitions on the disk are to be shifted by 63 sectors
then partx can notice that and tell the kernel. But if the kernel does
these things automatically it can be difficult to remove Disk Manager.
You seem to think that kernel space somehow cannot do something that
user space can. I just don't see the overriding problems you claim.
It is the user who knows and wants to decide.
If my disk has media errors and I want to rescue what still can be read,
then I am very unhappy that the kernel starts reading the first sector
and the last sector and various sectors in the middle.
I want to have very precise control over what I/O happens.
If I insert a SmartMedia card then I know very precisely that it has
a FAT filesystem, a special one. Some cameras will refuse to read
such cards formatted by DOS. If the kernel starts probing, as it does
today, then it will read the first sector and the last sector, etc.
But my reader has a firmware bug, an off-by-one mistake in the reported
capacity, and the kernel tries to read a sector past the end of the card,
gets an error and the SCSI code starts retrying, resetting the device,
the host, the bus, finally takes the device offline. In the meantime
the USB code is entirely confused by aborts and crashes the kernel.
Of course both SCSI and USB code have to be improved, but it would
certainly be nice if I could tell from userspace: probe only for FAT.
No need at all to read this last sector.
I have seen partition tables with a loop. They would poison Linux
so that it was impossible to boot Linux on a system with such a disk.
I have seen disks with random test data causing Linux to go out and
read nonexistent sectors. There is the real possibility that no
partition table is present, and trying to find one may be a bad idea.
I have seen disks that form part of a multi-disk array.
Often the partition tables are meaningless.
Not doing things automatically gives power to the user.
In some situations this power is needed.
And once this partition reading is done on demand only, it does not
matter very much who does the reading. It may be the kernel.
It may be a user space program.
Andries
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATCH - change to blkdev->queue calling triggers BUG in md.c
2002-09-03 0:53 PATCH - change to blkdev->queue calling triggers BUG in md.c Andries.Brouwer
@ 2002-09-03 3:55 ` Linus Torvalds
2002-09-03 4:06 ` Linus Torvalds
2002-09-03 15:22 ` Andries Brouwer
2002-09-03 15:27 ` Roe Peterson
1 sibling, 2 replies; 22+ messages in thread
From: Linus Torvalds @ 2002-09-03 3:55 UTC (permalink / raw)
To: Andries.Brouwer; +Cc: aebr, linux-kernel, linux-raid, neilb
On Tue, 3 Sep 2002 Andries.Brouwer@cwi.nl wrote:
>
> Why?
> Because in some cases it is undesirable.
Again, Why?
You can always use the flat device as-is.
> Because in some cases it crashes the kernel.
But moving it to user space would cause the kernel to crash anyway. Bugs
are bugs.
> Because it involves guessing and heuristics.
The same guesses and heuristics would have to be in user space.
> Because policy belongs in user space.
It's not policy. It's a fact of life that disks need to be split up into
parts, and the partitioning schemes are well-defined and shared across
multiple operating systems.
> Yes - that is my main point: doing it on demand. On demand only.
But I actually _agree_ with this.
However, that has nothing to do with whether it is in user space or kernel
space. In many ways it is _easier_ to do on demand in kernel space: when
somebody opens /dev/sda1 and it isn't partitioned yet, you know it needs
to be.
The fact that partitioning right now is to some degree handled by device
drivers is a problem, but that's not a user space vs kernel space issue.
It's slowly getting moved to higher levels.
Linus
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATCH - change to blkdev->queue calling triggers BUG in md.c
2002-09-03 3:55 ` Linus Torvalds
@ 2002-09-03 4:06 ` Linus Torvalds
2002-09-03 15:22 ` Andries Brouwer
1 sibling, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2002-09-03 4:06 UTC (permalink / raw)
To: Andries.Brouwer; +Cc: aebr, linux-kernel, linux-raid, neilb
On Mon, 2 Sep 2002, Linus Torvalds wrote:
>
> However, that has nothing to do with whether it is in user space or kernel
> space. In many ways it is _easier_ to do on demand in kernel space: when
> somebody opens /dev/sda1 and it isn't partitioned yet, you know it needs
> to be.
Note that this actually allows you to do your own user-space partitioning
if you want to - simply by making sure that you do your partitioning
_before_ somebody tries to open a partition on the device.
And if you look at how fs/block_dev.c looks right now, you'll notice that
we already handle the "main device" vs "sub-partition" cases differently,
so it should be fairly straightforward to eventually do the partitioning
on demand.
We're not there yet, no. But doing it in the open() path of
fs/block_dev.c sure looks like it's the easiest way to maintain sanity wrt
partitioning, _and_ maintain 100% backwards compatibility.
[ Well, the "100% backwards compatibility" is not strictly true. Doing
partition handling on demand will mean that things like /proc/partitions
will obviously also end up being populated on demand, which may break
various sysadmin tools. But at least then it's fairly well localized,
and it's reasonably easy to grep for /proc/partitions in tools to see if
they may care ]
Linus
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATCH - change to blkdev->queue calling triggers BUG in md.c
2002-09-03 3:55 ` Linus Torvalds
2002-09-03 4:06 ` Linus Torvalds
@ 2002-09-03 15:22 ` Andries Brouwer
1 sibling, 0 replies; 22+ messages in thread
From: Andries Brouwer @ 2002-09-03 15:22 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Andries.Brouwer, linux-kernel, linux-raid, neilb
On Mon, Sep 02, 2002 at 08:55:47PM -0700, Linus Torvalds wrote:
Discussion so far:
(1) It is wrong when the kernel guesses, because it may guess wrong.
Userspace must tell the kernel what to do.
[The mount call is not "mount dev dir" but "mount -t type dev dir".
The kernel could guess, and often guess right, but some types are
close, like ext2 and ext3, or various ufs types, and some types may be
indistinguishable from the disk image, like msdos and vfat, where the
right type may depend on the intentions of the user.]
[In a similar way it is bad when the kernel unprovoked starts trying
to interpret the first few and last few sectors of the disk as an
Acorn, Amiga, Atari, BSD, DOS, EFI, IBM, Mac, Minix, LDM, OSF, SGI,
Sun, Ultrix partition table. Maybe there was no table. Maybe there
is a table of a kind the kernel did not know about, e.g. an AIX or
Plan 9 table, or a newer version of *BSD or Minix while the kernel
only knows about older versions, or ...]
(2) In all kinds of special situations attempts to read a partition
table lead to errors, even to kernel crashes. The kernel should not
unprovoked start doing I/O, guessing where the partition table might be,
and what type it might have.
> > Yes - that is my main point: doing it on demand. On demand only.
>
> But I actually _agree_ with this.
>
> However, that has nothing to do with whether it is in user space or kernel
> space. In many ways it is _easier_ to do on demand in kernel space: when
> somebody opens /dev/sda1 and it isn't partitioned yet, you know it needs
> to be.
At first I misread this sentence ("partitioning" for me is something
done with fdisk) but now I take it to mean: If we have /dev/sda
but have not read its partition table, and somebody opens /dev/sda1,
then we decide that we must read a partition table.
If that is what you mean, I disagree.
(Compare: we have /mnt/cdrom and someone opens /mnt/cdrom/foo, should we
decide to automatically mount /dev/cdrom? An automounter in user space
may do such things. The kernel may not.)
> Note that this actually allows you to do your own user-space partitioning
> if you want to - simply by making sure that you do your partitioning
> _before_ somebody tries to open a partition on the device.
You are inventing a can of worms. Suppose user space already told the
kernel where the partitions are, and the kernel knows about sda1, sda2, sda3.
Now somebody refers to sda4. Does the kernel start reading the device,
possibly changing the meaning of sda1 etc?
What if this disk is part of a RAID?
No, we must slowly migrate to the state where the kernel never takes the
initiative to search for a partition table. That initiative belongs to
user space.
Andries
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATCH - change to blkdev->queue calling triggers BUG in md.c
2002-09-03 0:53 PATCH - change to blkdev->queue calling triggers BUG in md.c Andries.Brouwer
2002-09-03 3:55 ` Linus Torvalds
@ 2002-09-03 15:27 ` Roe Peterson
1 sibling, 0 replies; 22+ messages in thread
From: Roe Peterson @ 2002-09-03 15:27 UTC (permalink / raw)
To: linux-raid
I'm a newbie to the list (not unix, though :-), but anyhow...
Andries.Brouwer@cwi.nl wrote:
> > Another argument is that nobody knows whether there is
> > a partition table. (ZIP: "large floppy" vs "removable disk")
> > Another argument is that tricky things happen with disk managers.
>
> And none of these work any better in user space.
>
> Well, in fact they do.
>
> The user knows whether she treats her ZIP like a removable disk
> or like a big floppy, that is, whether she should ask or refrain
> from asking to read the pt.
"The User Knows"? My experience with the vast bulk of users is that
the only thing you can count on is that they _don't_ know. Much of
anything at all, in fact. Depending on an luser to know how her zip disk
is configured is _much_ less reliable than some minor kernel
heuristic (translate: guesswork).
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2002-09-03 15:27 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-09-03 0:53 PATCH - change to blkdev->queue calling triggers BUG in md.c Andries.Brouwer
2002-09-03 3:55 ` Linus Torvalds
2002-09-03 4:06 ` Linus Torvalds
2002-09-03 15:22 ` Andries Brouwer
2002-09-03 15:27 ` Roe Peterson
-- strict thread matches above, loose matches on Subject: below --
2002-09-02 21:41 Andries.Brouwer
2002-09-02 22:00 ` Linus Torvalds
2002-09-02 23:08 ` Andries Brouwer
2002-09-02 23:27 ` Linus Torvalds
2002-09-02 21:27 Andries.Brouwer
2002-09-02 21:39 ` Linus Torvalds
2002-09-02 21:48 ` Thunder from the hill
2002-09-02 21:43 ` Thunder from the hill
2002-09-02 21:58 ` Andries Brouwer
2002-09-02 22:06 ` Linus Torvalds
2002-09-02 22:39 ` Thunder from the hill
2002-09-02 8:53 Andries.Brouwer
2002-09-02 17:01 ` Linus Torvalds
2002-09-02 20:35 ` Andries Brouwer
2002-09-02 20:50 ` Linus Torvalds
2002-09-01 23:43 Neil Brown
2002-09-02 4:13 ` Linus Torvalds
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).