public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Linux/Pro  -- clusters
@ 2001-12-03 18:12 Donald Becker
  2001-12-04  1:55 ` Davide Libenzi
  0 siblings, 1 reply; 70+ messages in thread
From: Donald Becker @ 2001-12-03 18:12 UTC (permalink / raw)
  To: Linux Kernel Mailing List

Davide Libenzi (davidel@xmailserver.org) wrote

>And if you're the prophet and you think that the future of multiprocessing 
>is UP on clusters, why instead of spreading your word between us poor 
>kernel fans don't you pull out money from your pocket ( or investors ) and 
>start a new Co. that will have that solution has primary and unique goal ? 

I believe that the future of multiprocessing is clusters of small scale
SMP machines, 2-8 processors each.  And the most important part of
clustering them together isn't single system image from the programmers
point of view, it's transparent administration for the end user.  Thus
our system has a unified process space and a single point of control,
while imposing no overhead on processes.

You are right that there is no reason to convince people here -- I tried
to do that a few years ago.  Instead I've put lots of my own time and
money, as well as investor money, into a company that does only cluster
system software.

Anyway, my real point is that while I'm a big proponent of designing
consistent interfaces rather than the haphazard, incompatible changes
that have been occurring, this is far from predict-the-future design.

The goal of designing the kernel to support 128 way SMP systems is a
perfect example of the difference.  A few days or weeks of using a
proposed interface change will show if the advantages are worth the cost
of the change.  We won't know for years if redesigning the kernel for
large scale SMP system is useful
  - does it actually work,
  - will big SMP machines be common, or even exist?
  - will big SMP machines have the characteristics we predict
let alone worth the costs such as
  - UP performance hit
  - complexity increase slows other improvements
  - difficult performance tuning


Donald Becker				becker@scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993


^ permalink raw reply	[flat|nested] 70+ messages in thread
* Re: Linux/Pro  -- clusters
@ 2001-12-08  1:50 Andries.Brouwer
  2001-12-08  3:42 ` H. Peter Anvin
  0 siblings, 1 reply; 70+ messages in thread
From: Andries.Brouwer @ 2001-12-08  1:50 UTC (permalink / raw)
  To: alan, torvalds; +Cc: linux-kernel

    From: Alan Cox <alan@lxorguk.ukuu.org.uk>

    > > For those of us who want to run a standards based operating system can
    > > you do the 32bit dev_t.
    > 
    > You asked for an _internal_ data structure. dev_t is the external
    > representation, and has _nothing_ to do with any drivers at all.

    The internal representation is kdev_t, which wants to turn into a pointer
    from what Aeb has been saying for a long time.

Yes and no. If I am not mistaken there are three details:

(i) Linus prefers to separate block and character devices.
I agree that that makes the code a bit cleaner, but dislike
the code duplication: the interface to user space, the allocation,
deallocation, registering is completely identical for the two.
But apparently Linus does not mind a little bloat if that avoids
an ugly cast in two or three places.

(ii) So, we split kdev_t into kbdev_t and kcdev_t.
Al (and/or Linus) baptizes the struct that a kbdev_t is pointing at
"struct block_device". I usually had a two-layer version, with
device_struct and driver_struct, while struct genhd disappeared.
Don't know whether Al has similar ideas.
The current struct block_device is an ordered pair (dev_t, ops *)
and does not seem to give easy access to the partitions, so maybe Al
still has to reshuffle things a bit, or add a pointer to a struct genhd.
We'll see.

(iii) The past months Al has been nibbling away a little at the road
that makes kdev_t (or kbdev_t or so) a pointer to a device_struct.
Instead it looks like he wants to construct a parallel and equivalent
road starting from the already present basis for a struct block_device.

So, yes, internally we'll have a pointer. No, it doesnt look like
the name of the pointer will be kdev_t.

No doubt Linus or Al or somebody will correct me if the above is all wrong.


    A 32bit "dev_t" is needed so that we can label over 65536 file systems
    to things like ls, regardless of how
    "/dev/sdfoo" is mapped onto a driver

    I'm sure that dev_t (the cookie we feed to user space) going to 32bits is
    going to break something and I'd rather it broke early

Yes, that is an entirely independent matter.
User space uses a 64bit cookie today, and the kernel throws away
three quarters of that. Very little breaks if the kernel throws away less.

[As you know I like a large dev_t, and Linus hated it before he understood
the use of a large dev_t. (For example, he worried that an "ls" would take
many centuries.) Don't know about current opinions. Such a lot of nice
applications: use any device description you like, take a cryptographic
hash and have a device number. Or, generate a new anonymous device by
incrementing a counter. Or, support full NFS.
It would really be a pity to go only to 32 bits. Indeed, 32 bits is
large but not large enough to be collision-free for random assignments,
so one would need a registry of numbers. With a much larger device
number the registry is superfluous.]

Andries

^ permalink raw reply	[flat|nested] 70+ messages in thread
* Re: Linux/Pro  -- clusters
@ 2001-12-08 17:26 Andries.Brouwer
  2001-12-09  4:22 ` Linus Torvalds
  0 siblings, 1 reply; 70+ messages in thread
From: Andries.Brouwer @ 2001-12-08 17:26 UTC (permalink / raw)
  To: alan, torvalds; +Cc: linux-kernel, viro

    From: Linus Torvalds <torvalds@transmeta.com>

    The sad thing is that along the whole path, we actually end
    up needing the structure pointer in different places, so the IO code
    (which is supposed to be timing-critical) ends up doing various lookups on
    the kdev_t several times (both at a higher level and deep down in the IO
    submit layer).

    So now we have to do "bdfind()" *kdev_t -> block_device", and
    "get_gendisk()" for "kdev_t -> struct gendisk" and about 5 different
    "index various arrays using the MAJOR number" on the way to actually doing
    the IO.

    Even though the filesystems that want to _do_ the IO actually already have
    the structure pointer available, and all the indexing off major would
    actually fairly trivially just be about reading off the fields off that
    structure.

    Oh, well. It _is_ going to be quite painful to switch things around.

I don't understand at all. It is not painful at all.
Things are completely straightforward.

A kdev_t is a pointer to all information needed, nowhere a lookup,
except at open time.

You make it kbdev_t, and then call it struct block_device *.
OK, the name doesnt matter as long as the struct it points to has all
information needed. In my version that is the case, and I would
be rather surprised if it were otherwise in Al's version.

The changes are only of the easy, provably correct, mechanical kind.
Boring work, and a bit slow - each step requires a grep over the
kernel source and there are about a hundred steps.

I am sure also Al will tell you that there is no problem.

Andries

^ permalink raw reply	[flat|nested] 70+ messages in thread
* Re: Linux/Pro  -- clusters
@ 2001-12-09  8:59 Andries.Brouwer
  2001-12-10 16:49 ` Alexander Viro
  0 siblings, 1 reply; 70+ messages in thread
From: Andries.Brouwer @ 2001-12-09  8:59 UTC (permalink / raw)
  To: torvalds, viro; +Cc: Andries.Brouwer, alan, linux-kernel

    From: Alexander Viro <viro@math.psu.edu>

    > > I am sure also Al will tell you that there is no problem.

    <raised brows>  What gave you such impression?  IIRC, I've described
    the problems several months ago.  Three words: object freeing policy.

    The fundamental reason why we can't replace kdev_t with pointer and hope
    to survive is that YOU DON'T FREE NUMBERS.  Integer is an integer - it's
    always valid.  We will need to free the structures and _that_ is where the
    problems will start.

Yes, you are quite right, this is a difficulty, more serious
than the bdev/cdev distinction Linus mentions.

But for me the difficulty is far away.

Let me once more sketch the mechanical change.

Part 1: Invent some random structures, to be changed when needed,
that contain all data we want to refer to via our pointer.
Since the procedure was supposed to be mechanical, take
the arrays indexed today by major or major,minor and make
their contents fields in these structs.

Work to do: global search and replace of
	blk_size[MAJOR(dev)][MINOR(dev)]
by
	dev->size
(possibly with a shift: I was going to bytes instead of blocks;
possibly with an inline function
	get_size(dev)
so that changing the setup of these structs later is easier).

Part 2: These structures have to be allocated. Let the allocating
happen in the same place where the arrays like blk_size[][] are
initialized today.

Part 3: These structures have to be found, given a dev_t.
Use a hash table.

Now you see no refcounting, and no freeing.
But my point is that that does not matter.
At least not at first.

I have run for months with systems like this, and typically saw
2000 or so such structures allocated. But they are small structures,
a few dozen bytes, nobody cares - at first.

Result of the mechanical change: a system without arrays,
with large device numbers, so that people can have ten thousand
SCSI disk partitions, should they want to.

In other words, two problems are solved: the arrays are gone,
and the device numbers no longer live in this cramped space.


Yes, now you want, and I want, to go further.
As long as these structs are not located in memory that goes away,
and do not contain pointers that point to stuff that goes away
when a module is unloaded it does not matter much that they are
never freed. But in the long run we of course want to free all
that is allocated. So, later we must audit what happens to them.
I can say more about that, but our difference is that it is your
first worry and my last worry.

(Roughly speaking the situation is still as I ordained six years ago:
things of type kdev_t only live in ROOT_DEV, inode->i_dev, inode->i_rdev,
sb->s_dev, bh->b_dev, req->rq_dev, tty->device.
We change inode->i_rdev back to a dev_t.
One does not want to free the struct upon the last close; soon there
will be an open again. One only wants to free the struct when the module
is unloaded, or perhaps when it is certain that the device will never
be used again, like in my version with 40-bit anonymous device numbers
that are never reused. So, inode, sb, bh, req, tty belonging to a
module that is unloaded must be freed. But we wanted that already,
also without device structs.)

[There is more to say, but I have to go, and maybe you and Linus
can start telling me why this mechanical approach is silly.
Hope to be back twelve hours from now.]

Andries

^ permalink raw reply	[flat|nested] 70+ messages in thread
* Re: Linux/Pro  -- clusters
@ 2001-12-10 19:36 Andries.Brouwer
  2001-12-10 22:55 ` Alexander Viro
  0 siblings, 1 reply; 70+ messages in thread
From: Andries.Brouwer @ 2001-12-10 19:36 UTC (permalink / raw)
  To: Andries.Brouwer, viro; +Cc: alan, linux-kernel, torvalds

    From viro@math.psu.edu Mon Dec 10 17:50:02 2001

    Basically you propose to take the current system, replace it with
    something without clear memory management ("let it leak") and then
    try to fix the resulting mess.

Al - you are using debating tricks instead of logic, using
negative words ("unclear", "leak", "mess") instead of arguments.
Maybe you are unable to refute the soundness of the system I propose?

It is quite possible that I overlook some detail.
On the other hand, I have been running these systems.
You are not able to convince me that something is wrong
just by handwaving. Real arguments are required.


What I do is go from the present situation, in a series of steps,
to a new situation where the source looks different but the
system behaves provably the same. Consequently, no "fixing"
is required. "Mess" is a matter of taste, I'll not discuss that
except by saying that I vastly prefer the situation without arrays.
"Leak" is false. "Dangling pointers" is false.

Andries


[About "leak": What happens today is that a driver like sd.c
allocates arrays and fills them. In my version this driver
allocates structures and fills them. When the module is removed,
today the arrays are freed. In my version the structures are
freed at that point. So, no leakage occurs.
About "dangling pointers": The correctness condition for this
scheme is that no struct that contains kdev_t fields survives
removal of the module.
It seems to me that that is true already, and in any case will
be easy to ensure. If you have other opinions, please come
with explicit examples where fundamental problems would occur.]

[and, Linus, the name of the beast makes no difference; kdev_t
or kbdev_t or struct block_device *; it is the same amount of work]

^ permalink raw reply	[flat|nested] 70+ messages in thread
* Re: Linux/Pro  -- clusters
@ 2001-12-10 19:51 Andries.Brouwer
  2001-12-10 20:34 ` Alan Cox
  0 siblings, 1 reply; 70+ messages in thread
From: Andries.Brouwer @ 2001-12-10 19:51 UTC (permalink / raw)
  To: alan, viro; +Cc: Andries.Brouwer, linux-kernel, torvalds

    From alan@lxorguk.ukuu.org.uk Mon Dec 10 18:01:03 2001

    And it means we can get proper refcounting. Which as the maintainer of
    two block drivers that support dynamic volume create/destroy is remarkably
    good news.

You say this as if that would be a difference between the two
approaches. I don't think it is.

My goal was: allow large device numbers.
The subgoal: get rid of the arrays since these do not allow large indices.
The approach: make kdev_t a pointer to some random structure.

Now that I have achieved my goal, if you come along and want
refcounting, it seems to me that all I have to do is add a field
refcount to this struct, and have xget() and xput() routines
increase or decrease this number.

Maybe you are confused because usually one has a structure
that keeps track of all references to itself, so that the structure
can be freed when the number drops to zero. I do not need such refcounting
for a kdev_t, but it is very easy to keep track of the number of openers,
the number of inodes, or whatever you would like to count.
After all, anything you do with the device gets called with
a kdev_t argument, so nothing is easier than having open() increase
and close() decrease some field.

Andries

^ permalink raw reply	[flat|nested] 70+ messages in thread
* Re: Linux/Pro  -- clusters
@ 2001-12-10 21:31 Andries.Brouwer
  2001-12-10 21:44 ` Alan Cox
  0 siblings, 1 reply; 70+ messages in thread
From: Andries.Brouwer @ 2001-12-10 21:31 UTC (permalink / raw)
  To: Andries.Brouwer, alan; +Cc: linux-kernel, torvalds, viro

    From: Alan Cox <alan@lxorguk.ukuu.org.uk>

    >     And it means we can get proper refcounting. Which as the maintainer
    >     of two block drivers that support dynamic volume create/destroy is
    >     remarkably good news.
    > 
    > You say this as if that would be a difference between the two
    > approaches. I don't think it is.

    Its easier to make sure its correct when we have a single structure not
    a pile of arrays.

I don't understand your reference to arrays. Nobody uses arrays.
That is something of the past.

    Object lifetime becomes explicit, and we don't have to
    worry about re-use races since a new instance of that major,minor
    will have a different object attached to the one in use that is
    about to be refcounted into oblivion by currently active requests

As described, my setup certainly has no re-use races, since
I do not use refcounts as a way to terminate the lifespan of
a kdev_t. So, are you saying that you prefer my version?
I have problems reading your replies.

Andries

^ permalink raw reply	[flat|nested] 70+ messages in thread
* Re: Linux/Pro  -- clusters
@ 2001-12-10 22:48 Andries.Brouwer
  0 siblings, 0 replies; 70+ messages in thread
From: Andries.Brouwer @ 2001-12-10 22:48 UTC (permalink / raw)
  To: Andries.Brouwer, alan; +Cc: linux-kernel, torvalds, viro

> Basically you seem to be saying
> "void *  is cool" (aka kdev_t is basically an opaque magic).

Well, kdev_t is just as opaque as struct inode *.
One refers to what you want to know about a block device.
The other to what you want to know about an inode.

> I don't see what it gains you over "struct block_device *".

That is difficult to say, since the present struct block_device
still has a long way to go. At present it has no facilities
for storing data. Maybe the final results would be the same.
My main objective has always been to do a mechanical,
correctness preserving change (as the first and major step).

This means that very early on the road the objectives
"no arrays" and "large device numbers" are achieved.
Afterwards one can continue restructuring and polishing
as desired. Al's approach (as I understand it) will
achieve the same things, but later, and with more handwork.

Andries

^ permalink raw reply	[flat|nested] 70+ messages in thread
* Re: Linux/Pro  -- clusters
@ 2001-12-10 23:33 Andries.Brouwer
  0 siblings, 0 replies; 70+ messages in thread
From: Andries.Brouwer @ 2001-12-10 23:33 UTC (permalink / raw)
  To: Andries.Brouwer, viro; +Cc: alan, linux-kernel, torvalds

    From: Alexander Viro <viro@math.psu.edu>

    What???  You've just said that on the first stage you are not going to
    free these objects and then add freeing them and audit the whole thing 
    at that point.

    The first is commonly known as leak (objects are allocated but not freed).

You are mistaken. Allocation without freeing is not a leak.
A leak is the situation where an unbounded amount of memory is lost
over time because of repeated allocs without corresponding frees.

Allocation of a known, bounded amount of memory is no leak.

(But this has very little relevance except in a shouting match.
Your next remarks are more interesting.)

    Dangling pointers is what you will have to fight during that audit -
    places where something retains kdev_t after your object had been freed.

    Let me rephrase it: with your plan we will have much more complex audit
    needed at the moment when you introduce freeing your objects.  Reason:
    it will have to involve all subsystems using kdev_t at once.  That's
    my problem with your plan.  Sigh...

I am not as afraid as you are.
Something retains kdev_t after the module has been unloaded?
That would be a bug, sure, both in the present and in the future kernel.
I listed the places where a kdev_t is stored (inode, sb, ..) and for
each of those it is true that these structs should be released before
or at module unload time, so that after module unload time no instances
of corresponding kdev_t are left.

Moreover, the audit happens fully automatically during the boring,
mechanical work. Indeed, already the separation of kdev_t into
kbdev_t and kcdev_t will touch all places where kdev_t occurs,
so that as a side effect one has a list of all places where one
of these is stored.

Andries

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2001-12-11 20:58 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-12-03 18:12 Linux/Pro -- clusters Donald Becker
2001-12-04  1:55 ` Davide Libenzi
2001-12-04  2:09   ` Donald Becker
2001-12-04  2:23     ` Davide Libenzi
2001-12-04  2:34       ` Alexander Viro
2001-12-04  9:10     ` Alan Cox
2001-12-04  9:30       ` Thomas Langås
2001-12-04  9:45         ` Alan Cox
2001-12-04 11:34           ` Thomas Langås
2001-12-05 21:57         ` Linus Torvalds
2001-12-05 23:05           ` Andre Hedrick
2001-12-06  4:31             ` Daniel Phillips
2001-12-05 23:49           ` Alan Cox
2001-12-05 23:48             ` Andre Hedrick
2001-12-06 16:58             ` Linus Torvalds
2001-12-06 18:02               ` Alan Cox
2001-12-06 18:07                 ` Linus Torvalds
2001-12-06 18:12                   ` Kai Henningsen
2001-12-06 20:46                     ` Linus Torvalds
2001-12-06 22:40                     ` Alan Cox
2001-12-06 18:33                   ` Alan Cox
2001-12-06 18:55                     ` Linus Torvalds
2001-12-06 19:19                       ` Alan Cox
2001-12-06 20:37                         ` Linus Torvalds
2001-12-06 22:35                           ` Alan Cox
2001-12-06 22:34                             ` Linus Torvalds
2001-12-06 22:58                               ` Alexander Viro
2001-12-07 10:14                     ` Martin Dalecki
2001-12-07 10:37                       ` Alan Cox
2001-12-07 10:56                         ` Martin Dalecki
2001-12-07 12:08                           ` Alan Cox
2001-12-07 20:51                             ` On re-working the major/minor system Erik Andersen
2001-12-07 21:21                               ` H. Peter Anvin
2001-12-07 21:55                                 ` Erik Andersen
2001-12-07 22:04                                   ` H. Peter Anvin
2001-12-07 23:07                                     ` Erik Andersen
2001-12-07 23:12                                       ` H. Peter Anvin
2001-12-08 11:42                                         ` Alan Cox
2001-12-08 20:37                                           ` H. Peter Anvin
2001-12-09 12:06                                   ` Kai Henningsen
2001-12-09 21:57                                     ` H. Peter Anvin
2001-12-11 20:45                                       ` Kai Henningsen
2001-12-06 18:38               ` Linux/Pro -- clusters Doug Ledford
2001-12-04 14:37     ` Daniel Phillips
2001-12-04 15:19       ` Jeff Garzik
2001-12-04 17:16         ` Daniel Phillips
2001-12-04 17:20           ` Jeff Garzik
2001-12-04 18:04           ` Alan Cox
2001-12-04 18:16             ` Daniel Phillips
2001-12-04 20:20               ` Andrew Morton
2001-12-05 13:11               ` Deep look into VFS Martin Dalecki
2001-12-05 15:19                 ` Alexander Viro
2001-12-05 15:30                   ` Martin Dalecki
  -- strict thread matches above, loose matches on Subject: below --
2001-12-08  1:50 Linux/Pro -- clusters Andries.Brouwer
2001-12-08  3:42 ` H. Peter Anvin
2001-12-08 17:26 Andries.Brouwer
2001-12-09  4:22 ` Linus Torvalds
2001-12-09  5:49   ` Alexander Viro
2001-12-09  8:59 Andries.Brouwer
2001-12-10 16:49 ` Alexander Viro
2001-12-10 17:09   ` Alan Cox
2001-12-11  8:39   ` Albert D. Cahalan
2001-12-10 19:36 Andries.Brouwer
2001-12-10 22:55 ` Alexander Viro
2001-12-10 19:51 Andries.Brouwer
2001-12-10 20:34 ` Alan Cox
2001-12-10 21:31 Andries.Brouwer
2001-12-10 21:44 ` Alan Cox
2001-12-10 22:48 Andries.Brouwer
2001-12-10 23:33 Andries.Brouwer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox