* Linux/Pro -- clusters
@ 2001-12-03 18:12 Donald Becker
2001-12-04 1:55 ` Davide Libenzi
0 siblings, 1 reply; 53+ messages in thread
From: Donald Becker @ 2001-12-03 18:12 UTC (permalink / raw)
To: Linux Kernel Mailing List
Davide Libenzi (davidel@xmailserver.org) wrote
>And if you're the prophet and you think that the future of multiprocessing
>is UP on clusters, why instead of spreading your word between us poor
>kernel fans don't you pull out money from your pocket ( or investors ) and
>start a new Co. that will have that solution has primary and unique goal ?
I believe that the future of multiprocessing is clusters of small scale
SMP machines, 2-8 processors each. And the most important part of
clustering them together isn't single system image from the programmers
point of view, it's transparent administration for the end user. Thus
our system has a unified process space and a single point of control,
while imposing no overhead on processes.
You are right that there is no reason to convince people here -- I tried
to do that a few years ago. Instead I've put lots of my own time and
money, as well as investor money, into a company that does only cluster
system software.
Anyway, my real point is that while I'm a big proponent of designing
consistent interfaces rather than the haphazard, incompatible changes
that have been occurring, this is far from predict-the-future design.
The goal of designing the kernel to support 128 way SMP systems is a
perfect example of the difference. A few days or weeks of using a
proposed interface change will show if the advantages are worth the cost
of the change. We won't know for years if redesigning the kernel for
large scale SMP system is useful
- does it actually work,
- will big SMP machines be common, or even exist?
- will big SMP machines have the characteristics we predict
let alone worth the costs such as
- UP performance hit
- complexity increase slows other improvements
- difficult performance tuning
Donald Becker becker@scyld.com
Scyld Computing Corporation http://www.scyld.com
410 Severn Ave. Suite 210 Second Generation Beowulf Clusters
Annapolis MD 21403 410-990-9993
^ permalink raw reply [flat|nested] 53+ messages in thread* Re: Linux/Pro -- clusters 2001-12-03 18:12 Linux/Pro -- clusters Donald Becker @ 2001-12-04 1:55 ` Davide Libenzi 2001-12-04 2:09 ` Donald Becker 0 siblings, 1 reply; 53+ messages in thread From: Davide Libenzi @ 2001-12-04 1:55 UTC (permalink / raw) To: Donald Becker; +Cc: Linux Kernel Mailing List On Mon, 3 Dec 2001, Donald Becker wrote: > Davide Libenzi (davidel@xmailserver.org) wrote > > >And if you're the prophet and you think that the future of multiprocessing > >is UP on clusters, why instead of spreading your word between us poor > >kernel fans don't you pull out money from your pocket ( or investors ) and > >start a new Co. that will have that solution has primary and unique goal ? > > I believe that the future of multiprocessing is clusters of small scale > SMP machines, 2-8 processors each. And the most important part of > clustering them together isn't single system image from the programmers > point of view, it's transparent administration for the end user. Thus > our system has a unified process space and a single point of control, > while imposing no overhead on processes. > > You are right that there is no reason to convince people here -- I tried > to do that a few years ago. Instead I've put lots of my own time and > money, as well as investor money, into a company that does only cluster > system software. > > Anyway, my real point is that while I'm a big proponent of designing > consistent interfaces rather than the haphazard, incompatible changes > that have been occurring, this is far from predict-the-future design. > > The goal of designing the kernel to support 128 way SMP systems is a > perfect example of the difference. A few days or weeks of using a > proposed interface change will show if the advantages are worth the cost > of the change. We won't know for years if redesigning the kernel for > large scale SMP system is useful > - does it actually work, > - will big SMP machines be common, or even exist? > - will big SMP machines have the characteristics we predict > let alone worth the costs such as > - UP performance hit > - complexity increase slows other improvements > - difficult performance tuning Don't get me wrong Donald, I like clusters a for a certain type of applications they fit perfectly and they've a quasi proportional scalability over the number of computational units. So, I like clusters. Quite sadly the real world has applications that cannot be easily fit in a cluster environment like, for example, heavy threaded applications ( pls do not start a thread topic from here ) or a more general share-everything-over-computational-units kind. Now this kind of design, that we can or cannot call crappy, exist in the real world and is currently working for a _lot_ of Company's applications servers. Now we, as we're all smart guys, know that "work" is GOOD and nobody, expecially heavy payed managers, is willing to change/rearchitect something that is currently working. So these guys are happily running their own application server until one day they suddendly realize that they need more power. Now they're going to face a dilemma, that is 1) use a standard architecture big SMP machine ( high price ) 2) embrace a clustered solution ( lower price ) having in this way to redesign their application. I feel quite confortable in excluding SSI for the kind of applications I'm talking about. Since I've seen this scenario many many times I'm telling You that, by followin the 1st theorem of "work" that states that "work is GOOD", our poor high-salary-senior-manager is going to snatch solution 1 w/out even thinking about. No, I do not believe in 128 single CPU SMP machines but, if I've to watch inside my pretty dirty crystal ball, I see multi-core CPUs as a technology response to SMP request. Yes, because after the 1st theorem of "work" there's the 1st lemma of technology that states that "technology will always follow the market request". - Davide ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-04 1:55 ` Davide Libenzi @ 2001-12-04 2:09 ` Donald Becker 2001-12-04 2:23 ` Davide Libenzi ` (2 more replies) 0 siblings, 3 replies; 53+ messages in thread From: Donald Becker @ 2001-12-04 2:09 UTC (permalink / raw) To: Davide Libenzi; +Cc: Linux Kernel Mailing List On Mon, 3 Dec 2001, Davide Libenzi wrote: > On Mon, 3 Dec 2001, Donald Becker wrote: > > of the change. We won't know for years if redesigning the kernel for > > large scale SMP system is useful > > - does it actually work, > > - will big SMP machines be common, or even exist? > > - will big SMP machines have the characteristics we predict > > let alone worth the costs such as > > - UP performance hit > > - complexity increase slows other improvements > > - difficult performance tuning ... > No, I do not believe in 128 single CPU SMP machines but, if I've to watch > inside my pretty dirty crystal ball, I see multi-core CPUs as a technology > response to SMP request. > Yes, because after the 1st theorem of "work" there's the 1st lemma of > technology that states that "technology will always follow the > market request". You haven't addressed the points above. We haven't established that the market will request substantial numbers of 128-way SMPs. Even if they do request single-address-space multiprocessors, it's very likely that the result will be some form of cc-numa where the structure will strongly influence the OS to treat the machine as something besides a SMP. To bring this branch back on point: we should distinguish between design for an arbitrary and unpredictable goal (e.g. 128 way SMP) vs. putting some design into things that we are supposed to already understand a SCSI device layer that isn't three half-finished clean-ups a VFS layer that doesn't require the kernel to know a priori all of the filesystem types that might be loaded Donald Becker becker@scyld.com Scyld Computing Corporation http://www.scyld.com 410 Severn Ave. Suite 210 Second Generation Beowulf Clusters Annapolis MD 21403 410-990-9993 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-04 2:09 ` Donald Becker @ 2001-12-04 2:23 ` Davide Libenzi 2001-12-04 2:34 ` Alexander Viro 2001-12-04 9:10 ` Alan Cox 2001-12-04 14:37 ` Daniel Phillips 2 siblings, 1 reply; 53+ messages in thread From: Davide Libenzi @ 2001-12-04 2:23 UTC (permalink / raw) To: Donald Becker; +Cc: Linux Kernel Mailing List On Mon, 3 Dec 2001, Donald Becker wrote: > On Mon, 3 Dec 2001, Davide Libenzi wrote: > > > On Mon, 3 Dec 2001, Donald Becker wrote: > > > of the change. We won't know for years if redesigning the kernel for > > > large scale SMP system is useful > > > - does it actually work, > > > - will big SMP machines be common, or even exist? > > > - will big SMP machines have the characteristics we predict > > > let alone worth the costs such as > > > - UP performance hit > > > - complexity increase slows other improvements > > > - difficult performance tuning > ... > > No, I do not believe in 128 single CPU SMP machines but, if I've to watch > > inside my pretty dirty crystal ball, I see multi-core CPUs as a technology > > response to SMP request. > > Yes, because after the 1st theorem of "work" there's the 1st lemma of > > technology that states that "technology will always follow the > > market request". > > You haven't addressed the points above. > We haven't established that the market will request substantial numbers > of 128-way SMPs. Even if they do request single-address-space > multiprocessors, it's very likely that the result will be some form of cc-numa > where the structure will strongly influence the OS to treat the machine > as something besides a SMP. > > To bring this branch back on point: we should distinguish between > design for an arbitrary and unpredictable goal (e.g. 128 way SMP) > vs. putting some design into things that we are supposed to already > understand > a SCSI device layer that isn't three half-finished clean-ups > a VFS layer that doesn't require the kernel to know a priori all of > the filesystem types that might be loaded Donald, I'm not even thinking about planning a 128 CPU scalability for Linux. The whole point of this discussion ( if any ) is that we've to design on what we've or on what we expect to have in a very near future. We cannot play with technology on long term plans because, no matter how good we plan, it'll screw us up. - Davide ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-04 2:23 ` Davide Libenzi @ 2001-12-04 2:34 ` Alexander Viro 0 siblings, 0 replies; 53+ messages in thread From: Alexander Viro @ 2001-12-04 2:34 UTC (permalink / raw) To: Davide Libenzi; +Cc: Donald Becker, Linux Kernel Mailing List [apologies for over-the-head reply] > On Mon, 3 Dec 2001, Donald Becker wrote: > > a SCSI device layer that isn't three half-finished clean-ups <nod> > > a VFS layer that doesn't require the kernel to know a priori all of > > the filesystem types that might be loaded WTF? The only interpretation I can think of is about unions in struct inode and struct superblock. _If_ you add a filesystem that a) doesn't do separate allocation of fs-private parts of inode/superblock (i.e. doesn't use ->u.gerneric_ip and ->u.generic_sbp) and b) hadn't been known at kernel compile time and c) has one of these fields (member in inode->u or sb->u) bigger than all filesystems known at compile time - yes, you've got a problem. Solution: use ->u.generic_<...>. Works fine. Not to mention the fact that VFS per se doesn't give a damn for fs types. All it needs is sizeof(struct inode) and sizeof(struct superblock). And any fs using ->generic_<...> (i.e. pointer to separately allocated private objects) is OK, whether it was known at build time or not. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-04 2:09 ` Donald Becker 2001-12-04 2:23 ` Davide Libenzi @ 2001-12-04 9:10 ` Alan Cox 2001-12-04 9:30 ` Thomas Langås 2001-12-04 14:37 ` Daniel Phillips 2 siblings, 1 reply; 53+ messages in thread From: Alan Cox @ 2001-12-04 9:10 UTC (permalink / raw) To: Donald Becker; +Cc: Davide Libenzi, Linux Kernel Mailing List > a SCSI device layer that isn't three half-finished clean-ups Beginning (at last) > a VFS layer that doesn't require the kernel to know a priori all of > the filesystem types that might be loaded That was done a while ago. File systems are one by one being moved from using the union of stuff to the fs specific pointer. New file systems don't have to go hack the inode etc Alan ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-04 9:10 ` Alan Cox @ 2001-12-04 9:30 ` Thomas Langås 2001-12-04 9:45 ` Alan Cox 2001-12-05 21:57 ` Linus Torvalds 0 siblings, 2 replies; 53+ messages in thread From: Thomas Langås @ 2001-12-04 9:30 UTC (permalink / raw) To: Alan Cox; +Cc: Linux Kernel Mailing List Alan Cox: > > a SCSI device layer that isn't three half-finished clean-ups > Beginning (at last) So there's someone fixing the SCSI-layer code now? (It's marked as unmaintained in the MAINTAINERS-file for 2.4-kernels, at least) -- Thomas ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-04 9:30 ` Thomas Langås @ 2001-12-04 9:45 ` Alan Cox 2001-12-04 11:34 ` Thomas Langås 2001-12-05 21:57 ` Linus Torvalds 1 sibling, 1 reply; 53+ messages in thread From: Alan Cox @ 2001-12-04 9:45 UTC (permalink / raw) To: linux-kernel; +Cc: Alan Cox > Alan Cox: > > > a SCSI device layer that isn't three half-finished clean-ups > > Beginning (at last) > > So there's someone fixing the SCSI-layer code now? (It's marked as > unmaintained in the MAINTAINERS-file for 2.4-kernels, at least) Take a look at the 2.5 code and you'll notice various bits of old cruft are vanishing rapidly (old eh support, clustering gloop etc) ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-04 9:45 ` Alan Cox @ 2001-12-04 11:34 ` Thomas Langås 0 siblings, 0 replies; 53+ messages in thread From: Thomas Langås @ 2001-12-04 11:34 UTC (permalink / raw) To: Alan Cox; +Cc: linux-kernel Alan Cox: > > So there's someone fixing the SCSI-layer code now? (It's marked as > > unmaintained in the MAINTAINERS-file for 2.4-kernels, at least) > Take a look at the 2.5 code and you'll notice various bits of old cruft > are vanishing rapidly (old eh support, clustering gloop etc) Ok, I see... I'm trying to read up on the design of the SCSI-internals (from http://www.andante.org/scsi.html), and it seems like there's much to do yet... Is the listing in scsi_todo.html still valid, or is much of what's listed there already done? -- Thomas ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-04 9:30 ` Thomas Langås 2001-12-04 9:45 ` Alan Cox @ 2001-12-05 21:57 ` Linus Torvalds 2001-12-05 23:05 ` Andre Hedrick 2001-12-05 23:49 ` Alan Cox 1 sibling, 2 replies; 53+ messages in thread From: Linus Torvalds @ 2001-12-05 21:57 UTC (permalink / raw) To: linux-kernel In article <20011204103010.A30650@stud.ntnu.no>, =?iso-8859-1?Q?Thomas_Lang=E5s?= <thomas@langaas.org> wrote: >Alan Cox: >> > a SCSI device layer that isn't three half-finished clean-ups >> Beginning (at last) > >So there's someone fixing the SCSI-layer code now? (It's marked as >unmaintained in the MAINTAINERS-file for 2.4-kernels, at least) The old SCSI code won't be fixed. It will just be made totally obsolete by the better generic block layer code. I personally hope that a year from now, if somebody wants to do a new SCSI driver, he won't even _think_ about using the SCSI code, the driver will just take the (generic SCSI) requests directly off the block queue. Death to middle-men that can't do a good job anyway. Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-05 21:57 ` Linus Torvalds @ 2001-12-05 23:05 ` Andre Hedrick 2001-12-06 4:31 ` Daniel Phillips 2001-12-05 23:49 ` Alan Cox 1 sibling, 1 reply; 53+ messages in thread From: Andre Hedrick @ 2001-12-05 23:05 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Wed, 5 Dec 2001, Linus Torvalds wrote: > The old SCSI code won't be fixed. It will just be made totally obsolete > by the better generic block layer code. I personally hope that a year > from now, if somebody wants to do a new SCSI driver, he won't even > _think_ about using the SCSI code, the driver will just take the > (generic SCSI) requests directly off the block queue. > > Death to middle-men that can't do a good job anyway. Linus, Would a three part model be to your liking? The parts of there for isolation and testing the intergity of the driver to have confidence it can be trusted to do its tasks proper. BLOCK IO ---------------------- 1) Mainloop 2) Personality Drivers (DEVICE TYPES, but expanded) 3) HOST/Controller-DEVICE The significance in part three of the driver layer is for satisfying a new requirement in SCSI4 for (buzzword) "Domain Boundaries". It means to provide a diagnostic verification of the transport/data-phase layers. It would require non-serialized block access to perform something like a direct pattern-block write-read-verification-checksum. This is not trivial for SCSI, but it can be created. The strength of this model is Linux could then isolate the hardware problems and make corrections on a controller bases and not pollute the purity of the "Domain Boundaries". This is a model, I have been working on for a while for ATA for 2.5, however it is no longer possible at this time because of the changes in the Block IO model that are not documented describing what and why new request_struct items are added and their usage ruleset. Regards, Andre Hedrick Linux ATA Development ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-05 23:05 ` Andre Hedrick @ 2001-12-06 4:31 ` Daniel Phillips 0 siblings, 0 replies; 53+ messages in thread From: Daniel Phillips @ 2001-12-06 4:31 UTC (permalink / raw) To: Andre Hedrick, Linus Torvalds; +Cc: linux-kernel On December 6, 2001 12:05 am, Andre Hedrick wrote: > This is a model, I have been working on for a while for ATA for 2.5, > however it is no longer possible at this time because of the changes in > the Block IO model that are not documented describing what and why new > request_struct items are added and their usage ruleset. There's a rather nice document by Suparna, have you seen it? -- Daniel ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-05 21:57 ` Linus Torvalds 2001-12-05 23:05 ` Andre Hedrick @ 2001-12-05 23:49 ` Alan Cox 2001-12-05 23:48 ` Andre Hedrick 2001-12-06 16:58 ` Linus Torvalds 1 sibling, 2 replies; 53+ messages in thread From: Alan Cox @ 2001-12-05 23:49 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel > by the better generic block layer code. I personally hope that a year > from now, if somebody wants to do a new SCSI driver, he won't even > _think_ about using the SCSI code, the driver will just take the > (generic SCSI) requests directly off the block queue. You still need the scsi code. There are a whole sequence of common, quite complex and generic functions that the scsi layer handles (in paticular error handling). Turning it the right way I up definitely agree with. It should be the driver calling the scsi code to do bio->scsi request, and to do scsi error recovery, not vice versa. There are also some tricky relationships queues are per logical unit number locking is mostly per controller resources are often per controller Alan ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-05 23:49 ` Alan Cox @ 2001-12-05 23:48 ` Andre Hedrick 2001-12-06 16:58 ` Linus Torvalds 1 sibling, 0 replies; 53+ messages in thread From: Andre Hedrick @ 2001-12-05 23:48 UTC (permalink / raw) To: Alan Cox; +Cc: Linus Torvalds, linux-kernel On Wed, 5 Dec 2001, Alan Cox wrote: > > by the better generic block layer code. I personally hope that a year > > from now, if somebody wants to do a new SCSI driver, he won't even > > _think_ about using the SCSI code, the driver will just take the > > (generic SCSI) requests directly off the block queue. > > You still need the scsi code. There are a whole sequence of common, quite > complex and generic functions that the scsi layer handles (in paticular > error handling). > > Turning it the right way I up definitely agree with. It should be the driver > calling the scsi code to do bio->scsi request, and to do scsi error > recovery, not vice versa. > > There are also some tricky relationships > queues are per logical unit number > locking is mostly per controller > resources are often per controller Alan, Nothing that can not be handled in the core-model described earlier; however, I am positive that the suttle issues are more sticky than you are revealing now. Regards, Andre Hedrick Linux Disk Certification Project Linux ATA Development ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-05 23:49 ` Alan Cox 2001-12-05 23:48 ` Andre Hedrick @ 2001-12-06 16:58 ` Linus Torvalds 2001-12-06 18:02 ` Alan Cox 2001-12-06 18:38 ` Linux/Pro -- clusters Doug Ledford 1 sibling, 2 replies; 53+ messages in thread From: Linus Torvalds @ 2001-12-06 16:58 UTC (permalink / raw) To: linux-kernel In article <E16BlnL-00080m-00@the-village.bc.nu>, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote: > >You still need the scsi code. There are a whole sequence of common, quite >complex and generic functions that the scsi layer handles (in paticular >error handling). Well, the preliminary patches already handle _some_ common things, like building the proper command request for reads and writes etc, and that will probably continue. We'll probably have to have all the old helpers for things like "this target only wants to be probed on lun 0" etc. I disagree about the error handling, though. Traditionally, the timeouts and the reset handling was handled in the SCSI mid-layer, and it was a complete and utter disaster. Different hosts simply wanted so different behaviour that it's not even funny. Timeouts for different commands were so different that people ended up making most timeouts so long that they no longer made sense for other commands etc. Other device drivers have been able to handle timeouts and errors on their own before, and have _not_ had the kinds of horrendous problems that the SCSI layer has had. We'll see what the details will end up being, but I personally think that it is a major mistake to try to have generic error handling. The only true generic thing is "this request finished successfully / with an error", and _no_ high-level retries etc. It's up to the driver to decide if retries make sense. (Often retrying _doesn't_ make sense, because the firmware on the high-end card or disk itself may already have done retries on its own, and high-level error handling is nothing but a waste of time and causes the error notification to be even more delayed). Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-06 16:58 ` Linus Torvalds @ 2001-12-06 18:02 ` Alan Cox 2001-12-06 18:07 ` Linus Torvalds 2001-12-06 18:38 ` Linux/Pro -- clusters Doug Ledford 1 sibling, 1 reply; 53+ messages in thread From: Alan Cox @ 2001-12-06 18:02 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel > Timeouts for different commands were so different that people ended up > making most timeouts so long that they no longer made sense for other > commands etc. Thats per _target_ not host. Which needs to be common code. > Other device drivers have been able to handle timeouts and errors on > their own before, and have _not_ had the kinds of horrendous problems > that the SCSI layer has had. Every IDE layer uses the same IDE error handling code, because every IDE driver would otherwise have to make a copy of it - ditto scsi. > that it is a major mistake to try to have generic error handling. The > only true generic thing is "this request finished successfully / with an > error", and _no_ high-level retries etc. It's up to the driver to decide > if retries make sense. Retries and retry handling are target specific not host specific (think about the ton of logic you need every time your cd rom decides to error a read). You can have a read turn into a sequence of operations while you go and work out why it failed, ask it if its ready, tell it to lock the door, spin up the media, wait for it to be ready, reissue the I/O. This processing has to be robust because scsi cd-roms for example are rarely robust themselves. So its very much request->controller libscsi -> make me a command block issue command interrupt->controller error ? libscsi recommend an action please add suggested recovery to queue head kick request handling > (Often retrying _doesn't_ make sense, because the firmware on the > high-end card or disk itself may already have done retries on its own, > and high-level error handling is nothing but a waste of time and causes > the error notification to be even more delayed). Those devices aren't SCSI controllers, and they don't want to appear as one. Thats a horrible windows NT habit that harms performance badly. Of course everyone is now doing it with Linux because someone wouldn't provide more major numbers. Which is another thing - can you make the internal dev_t 32 or 64bits now. You can have 65536 volumes on an S/390 so even with perfectly distributed devfs allocated device identifiers - we don't have enough. Alan ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-06 18:02 ` Alan Cox @ 2001-12-06 18:07 ` Linus Torvalds 2001-12-06 18:12 ` Kai Henningsen 2001-12-06 18:33 ` Alan Cox 0 siblings, 2 replies; 53+ messages in thread From: Linus Torvalds @ 2001-12-06 18:07 UTC (permalink / raw) To: Alan Cox; +Cc: linux-kernel On Thu, 6 Dec 2001, Alan Cox wrote: > > > Timeouts for different commands were so different that people ended up > > making most timeouts so long that they no longer made sense for other > > commands etc. > > Thats per _target_ not host. Which needs to be common code. It hasn't traditionally been "common code". The old SCSI layer has various fixed timeouts, many of them on the order of 2-5 minutes, and none of them target-specific. Some of them are effectively turned off - the format timeout was increased to 2 hours to make sure that it basically never triggers. But never fear, we'll have some common routines for error handling. But they will be library routines, NOT the current crap. > Those devices aren't SCSI controllers, and they don't want to appear as one. Ehh.. IDE disks take SCSI commands, and do most error recovery entirely in disk firmware. There is very little you can do about most errors there. Don't think "SCSI" as in SCSI controllers. Think SCSI as in "fairly generic packet protocol that somehow infiltrated most things". > Which is another thing - can you make the internal dev_t 32 or 64bits now. > You can have 65536 volumes on an S/390 so even with perfectly distributed > devfs allocated device identifiers - we don't have enough. It's called "struct block_device" and "struct genhd". The pointers will have as many bits as pointers have on the architecture. Low-level drivers will not even see anything else eventually, there will be no "numbers". Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-06 18:07 ` Linus Torvalds @ 2001-12-06 18:12 ` Kai Henningsen 2001-12-06 20:46 ` Linus Torvalds 2001-12-06 22:40 ` Alan Cox 2001-12-06 18:33 ` Alan Cox 1 sibling, 2 replies; 53+ messages in thread From: Kai Henningsen @ 2001-12-06 18:12 UTC (permalink / raw) To: linux-kernel; +Cc: alan, torvalds torvalds@transmeta.com (Linus Torvalds) wrote on 06.12.01 in <Pine.LNX.4.33.0112060958450.10625-100000@penguin.transmeta.com>: > Some of them are effectively turned off - the format timeout was increased > to 2 hours to make sure that it basically never triggers. And I recently found out the hard way that wasn't enough, and ended up cludging a utility to patch a running kernel (don't ask) to increase that timeout. Turned out the drive needed a little over three hours to tell me it couldn't format. Frankly, format should really have NO timeout. Or possibly a user- specified one. MfG Kai ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-06 18:12 ` Kai Henningsen @ 2001-12-06 20:46 ` Linus Torvalds 2001-12-06 22:40 ` Alan Cox 1 sibling, 0 replies; 53+ messages in thread From: Linus Torvalds @ 2001-12-06 20:46 UTC (permalink / raw) To: Kai Henningsen; +Cc: linux-kernel On 6 Dec 2001, Kai Henningsen wrote: > > Frankly, format should really have NO timeout. Or possibly a user- > specified one. Well, frankly, the interface should be that the user code sends the command it needs, and waits for it. With no policy in the kernel at all. Now, for backwards compatibility reasons we cannot do that generically, and some things (not format, though), may be common enough that the "library code" to do the normal thing is in the kernel. But on the whole, the question should not be "how long should the timeout be", but more along the lines of "how can we make this easy to interface to existing and new applications _without_ having policy decisions like timeouts and number of retries". Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-06 18:12 ` Kai Henningsen 2001-12-06 20:46 ` Linus Torvalds @ 2001-12-06 22:40 ` Alan Cox 1 sibling, 0 replies; 53+ messages in thread From: Alan Cox @ 2001-12-06 22:40 UTC (permalink / raw) To: Kai Henningsen; +Cc: linux-kernel, alan, torvalds > cludging a utility to patch a running kernel (don't ask) to increase that > timeout. Turned out the drive needed a little over three hours to tell me > it couldn't format. > > Frankly, format should really have NO timeout. Or possibly a user- > specified one. For generic packet interfaces the "abort" operation becomes something you can put in the hands of user space, as well as progress reports ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-06 18:07 ` Linus Torvalds 2001-12-06 18:12 ` Kai Henningsen @ 2001-12-06 18:33 ` Alan Cox 2001-12-06 18:55 ` Linus Torvalds 2001-12-07 10:14 ` Martin Dalecki 1 sibling, 2 replies; 53+ messages in thread From: Alan Cox @ 2001-12-06 18:33 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alan Cox, linux-kernel > Some of them are effectively turned off - the format timeout was increased > to 2 hours to make sure that it basically never triggers. Thats scsi_generic which thankfully puts most of the logic in user space. > > Those devices aren't SCSI controllers, and they don't want to appear as one. > > Don't think "SCSI" as in SCSI controllers. Think SCSI as in "fairly > generic packet protocol that somehow infiltrated most things". The scsi controller is akin to a network driver. The stuff that matters is stuff like the scsi disk, scsi cd and scsi tape drivers. Scsi disk and CD need to do a lot of error recovery (especially CD-ROM). Disk too has to because older scsi devices don't have the same kind of "the host is clueless crap I'll have to try error recovery myself before reporting" mentality. It would be nice if a lot of the CD error/recovery logic could be in the cdrom libraries because the logic (close the door, lock the door, try half speed, ..) is the same in scsi and ide. > It's called "struct block_device" and "struct genhd". The pointers will > have as many bits as pointers have on the architecture. Low-level drivers > will not even see anything else eventually, there will be no "numbers". For those of us who want to run a standards based operating system can you do the 32bit dev_t. Otherwise some slightly fundamental things don't work. You know boring stuff like ls, find, df, and other standard unix commands. Those export a dev_t cookie. If you don't want to be able to run stuff like ls, just let me know and I'll start another kernel tree 8) Alan ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-06 18:33 ` Alan Cox @ 2001-12-06 18:55 ` Linus Torvalds 2001-12-06 19:19 ` Alan Cox 2001-12-07 10:14 ` Martin Dalecki 1 sibling, 1 reply; 53+ messages in thread From: Linus Torvalds @ 2001-12-06 18:55 UTC (permalink / raw) To: Alan Cox; +Cc: linux-kernel On Thu, 6 Dec 2001, Alan Cox wrote: > > The scsi controller is akin to a network driver. The stuff that matters is > stuff like the scsi disk, scsi cd and scsi tape drivers. Scsi disk and CD > need to do a lot of error recovery (especially CD-ROM). Ok, we agree here. The problem is that we've done things the "wrong way around". If you think of the problem as a network controller, together with "packets" that have SCSI commands in them, then it is clear how you should NOT have - read/write -> - driver IO request -> SCSI layer -> driver because that is equivalent to doing TCP with - read/write -> driver request -> TCP layer -> driver which is bogus. However, what's bogus about it is not that the old SCSI layer was above the driver, but the fact that it was _below_ the "ll_rw_block" and request queueing interface. That's the _packet_ interface. You don't do TCP or UDP below the packet interface. We should try to have some of the error recovery etc at a really _high_ level, preferably in user space. Especially the "complicated" cases are hard to do any other way, as some IO errors require you to start sending magic "unlock drive using this key" packets to the drive, and just stupidly retrying simply will not work. But that is not something that the SCSI layer should really care about. > It would be nice if a lot of the CD error/recovery logic could be in the > cdrom libraries because the logic (close the door, lock the door, try > half speed, ..) is the same in scsi and ide. Not CD-ROM library. Instead, what I and Jens have been talking about, and what the next pre-patch will actually have is to move some of the higher-level logic _up_, to above the "packet interface". Think of "struct request" as a packet, and think of a disk driver as nothing but a specialized network driver. So what do you get? Rip out all of drivers/scsi/scsi_ioctl.c, and replace it with a much higher-level interface that parses the ioctl and passes down the appropriate packets. So "close door" is equivalent to a ICMP packet. Normal read/write is TCP - we do merging, sorting, re-ordering etc, again at a higher level. The packet that makes it to the low-level driver is just a packet. This is the only layer that does retransmit etc. And then you have the old "raw packet" interface, where user-level apps can send commands down to the disk. > For those of us who want to run a standards based operating system can > you do the 32bit dev_t. You asked for an _internal_ data structure. dev_t is the external representation, and has _nothing_ to do with any drivers at all. Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-06 18:55 ` Linus Torvalds @ 2001-12-06 19:19 ` Alan Cox 2001-12-06 20:37 ` Linus Torvalds 0 siblings, 1 reply; 53+ messages in thread From: Alan Cox @ 2001-12-06 19:19 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alan Cox, linux-kernel > Normal read/write is TCP - we do merging, sorting, re-ordering etc, again > at a higher level. The packet that makes it to the low-level driver is > just a packet. This is the only layer that does retransmit etc. Makes sense yes. I'm not sure how much we can push into user space before we break back compatibility or lose the needed info/security credentials to take action but it makes sense when possible. > > For those of us who want to run a standards based operating system can > > you do the 32bit dev_t. > > You asked for an _internal_ data structure. dev_t is the external > representation, and has _nothing_ to do with any drivers at all. The internal representation is kdev_t, which wants to turn into a pointer from what Aeb has been saying for a long time. A 32bit "dev_t" is need so that we can label over 65536 file systems to things like ls, regardless of how "/dev/sdfoo" is mapped onto a driver I'm sure that dev_t (the cookie we feed to user space) going to 32bits is going to break something and I'd rather it broke early Alan ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-06 19:19 ` Alan Cox @ 2001-12-06 20:37 ` Linus Torvalds 2001-12-06 22:35 ` Alan Cox 0 siblings, 1 reply; 53+ messages in thread From: Linus Torvalds @ 2001-12-06 20:37 UTC (permalink / raw) To: Alan Cox; +Cc: linux-kernel On Thu, 6 Dec 2001, Alan Cox wrote: > > The internal representation is kdev_t, which wants to turn into a pointer No. That kdev_t has been around for years, and is going away. In 2.6 there will _be_ no kdev_t. There is "struct block_device" for internal stuff, and "dev_t" for external stuff. The first one is a real structure, the second one is just a cookie. Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-06 20:37 ` Linus Torvalds @ 2001-12-06 22:35 ` Alan Cox 2001-12-06 22:34 ` Linus Torvalds 0 siblings, 1 reply; 53+ messages in thread From: Alan Cox @ 2001-12-06 22:35 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alan Cox, linux-kernel > On Thu, 6 Dec 2001, Alan Cox wrote: > > The internal representation is kdev_t, which wants to turn into a pointer > > No. > > That kdev_t has been around for years, and is going away. In 2.6 there > will _be_ no kdev_t. > > There is "struct block_device" for internal stuff, and "dev_t" for > external stuff. The first one is a real structure, the second one is just > a cookie. Ok so kdev_t will split into structs for char and block device which are seperate things ? ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-06 22:35 ` Alan Cox @ 2001-12-06 22:34 ` Linus Torvalds 2001-12-06 22:58 ` Alexander Viro 0 siblings, 1 reply; 53+ messages in thread From: Linus Torvalds @ 2001-12-06 22:34 UTC (permalink / raw) To: Alan Cox; +Cc: linux-kernel On Thu, 6 Dec 2001, Alan Cox wrote: > > Ok so kdev_t will split into structs for char and block device which are > seperate things ? Yes. And the name will change to reflect that. (Ie once char and block are separate, like they logically are in the namespace anyway, there's no "dev_t" at all, it's all "struct char_device" or "struct block_device" and they have nothing in common). We already have pretty much all the infrastructure in place for this, it's just that a lot of calling conventions have "kdev_t" still (which is actually ambiguous as-is - you have to look at the function name etc to figure out if it is a character device or a block device). The main ones are things like "bread()" down all the way to the bottom of the IO path. The sad thing is that along the whole path, we actually end up needing the structure pointer in different places, so the IO code (which is supposed to be timing-critical) ends up doing various lookups on the kdev_t several times (both at a higher level and deep down in the IO submit layer). So now we have to do "bdfind()" *kdev_t -> block_device", and "get_gendisk()" for "kdev_t -> struct gendisk" and about 5 different "index various arrays using the MAJOR number" on the way to actually doing the IO. Even though the filesystems that want to _do_ the IO actually already have the structure pointer available, and all the indexing off major would actually fairly trivially just be about reading off the fields off that structure. (Ugh, just _look_ at the code looking up block size, sector size, "readonly" status, queue finding, statistics gathering etc. The ro_bits thing seems to "know" that "long" is 32 bits etc. It's enough to make you cry ;) Oh, well. It _is_ going to be quite painful to switch things around. Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-06 22:34 ` Linus Torvalds @ 2001-12-06 22:58 ` Alexander Viro 0 siblings, 0 replies; 53+ messages in thread From: Alexander Viro @ 2001-12-06 22:58 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alan Cox, linux-kernel On Thu, 6 Dec 2001, Linus Torvalds wrote: > The main ones are things like "bread()" down all the way to the bottom of > the IO path. The sad thing is that along the whole path, we actually end > up needing the structure pointer in different places, so the IO code > (which is supposed to be timing-critical) ends up doing various lookups on > the kdev_t several times (both at a higher level and deep down in the IO > submit layer). I have a conversion patches for bread()/getblk()/get_hash_table(). Once bio stuff settles down I'll start feeding them to you - they are very straightforward. Nice side effect is the death of buffer hash - once we have block_device in all places in question we can use page hash just fine. One level of spinlocks in buffer.c goes to hell... If you are interested I can feed the preparation part tomorrow - it's a matter of adding struct buffer_head * sb_bread(struct super_block *sb, sector_t block) { return bread(sb->s_dev, block, sb->s_blocksize); } and replacing instances of that in filesystems with this guy. That alone reduces the number of places that call bread() by factor of 80, IIRC. And it's an obvious cleanup that doesn't break anything and can go into 2.4 as well as 2.5. Same goes for getblk() and get_hash_table(). After that there is a payload part of patch - switch to struct block_device * which is now available to all callers - sb_...() have it in sb->s_bdev and the rest also have a place to get it from. And that I'd rather postpone until bio is stable. However, that part is several orders of magnitude smaller than the entire patch - most is the conversion above. So if you want it - I can do it as soon as I get some sleep; last version of that patch is against 2.4.12, it's split into edible chunks and it's not hard to update. Comments? ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-06 18:33 ` Alan Cox 2001-12-06 18:55 ` Linus Torvalds @ 2001-12-07 10:14 ` Martin Dalecki 2001-12-07 10:37 ` Alan Cox 1 sibling, 1 reply; 53+ messages in thread From: Martin Dalecki @ 2001-12-07 10:14 UTC (permalink / raw) To: Alan Cox; +Cc: Linus Torvalds, linux-kernel > > It's called "struct block_device" and "struct genhd". The pointers will > > have as many bits as pointers have on the architecture. Low-level drivers > > will not even see anything else eventually, there will be no "numbers". > > For those of us who want to run a standards based operating system can > you do the 32bit dev_t. Otherwise some slightly fundamental things don't > work. You know boring stuff like ls, find, df, and other standard unix > commands. Those export a dev_t cookie. I don't think this is what Linus was talking about. The current problem is that at many places the drivers (not the generic layer) know too much about stuff, which should be handled entierly on the genric device type layer. And changing this is actually a *prerequsite* to change the type of dev_t. For example please grep for the MINOR() macro in the scsi layer... Most of the places where it's used should be replaced by a simple driver instance enumerator. I did this once already, so this is for sure. > If you don't want to be able to run stuff like ls, just let me know and > I'll start another kernel tree 8) ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-07 10:14 ` Martin Dalecki @ 2001-12-07 10:37 ` Alan Cox 2001-12-07 10:56 ` Martin Dalecki 0 siblings, 1 reply; 53+ messages in thread From: Alan Cox @ 2001-12-07 10:37 UTC (permalink / raw) To: dalecki; +Cc: Alan Cox, Linus Torvalds, linux-kernel > > For those of us who want to run a standards based operating system can > > you do the 32bit dev_t. Otherwise some slightly fundamental things don't > > work. You know boring stuff like ls, find, df, and other standard unix > > commands. Those export a dev_t cookie. > > I don't think this is what Linus was talking about. The current problem Linus wasnt talking about what I was talking about. Problem the other way around 8) > For example please grep for the MINOR() macro in the scsi layer... > Most of the places where it's used should be replaced by a simple > driver instance enumerator. I did this once already, so this is for > sure. it become block_device->instance or ->minor major/minors for old stuff still end up leaking into user space and mattering there. I'm not sure the best option for that ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-07 10:37 ` Alan Cox @ 2001-12-07 10:56 ` Martin Dalecki 2001-12-07 12:08 ` Alan Cox 0 siblings, 1 reply; 53+ messages in thread From: Martin Dalecki @ 2001-12-07 10:56 UTC (permalink / raw) To: Alan Cox; +Cc: dalecki, Linus Torvalds, linux-kernel Alan Cox wrote: > > For example please grep for the MINOR() macro in the scsi layer... > > Most of the places where it's used should be replaced by a simple > > driver instance enumerator. I did this once already, so this is for > > sure. > > it become block_device->instance or ->minor Well if all the infromation those functions are needing would be already in block_device in place, that it could become as easy as just passing &block_device there. However please note that replacing kdev_t in the scsi layer with just passing the minor can be done already *now* without any pain. The same applies to the excessive MINOR lookups in the v4l code. I did this already some time ago (patch was here - about one year ago) > major/minors for old stuff still end up leaking into user space and > mattering there. I'm not sure the best option for that Thta's no problem. But they should be used as hash values no the syscall implementation level and nowhere else. -- - phone: +49 214 8656 283 - job: eVision-Ventures AG, LEV .de (MY OPINIONS ARE MY OWN!) - langs: de_DE.ISO8859-1, en_US, pl_PL.ISO8859-2, last ressort: ru_RU.KOI8-R ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-07 10:56 ` Martin Dalecki @ 2001-12-07 12:08 ` Alan Cox 2001-12-07 20:51 ` On re-working the major/minor system Erik Andersen 0 siblings, 1 reply; 53+ messages in thread From: Alan Cox @ 2001-12-07 12:08 UTC (permalink / raw) To: dalecki; +Cc: Alan Cox, Linus Torvalds, linux-kernel > > major/minors for old stuff still end up leaking into user space and > > mattering there. I'm not sure the best option for that > > Thta's no problem. But they should be used as hash values no the > syscall implementation level and nowhere else. We have apps that "know" about specific major/minors that need changing and will take time - also some of them are closed source so unfixable. For new stuff that bit isnt an issue, although ioctl overlaps mean we have some other problems to worry about there ^ permalink raw reply [flat|nested] 53+ messages in thread
* On re-working the major/minor system 2001-12-07 12:08 ` Alan Cox @ 2001-12-07 20:51 ` Erik Andersen 2001-12-07 21:21 ` H. Peter Anvin 0 siblings, 1 reply; 53+ messages in thread From: Erik Andersen @ 2001-12-07 20:51 UTC (permalink / raw) To: Alan Cox; +Cc: dalecki, Linus Torvalds, linux-kernel On Fri Dec 07, 2001 at 12:08:35PM +0000, Alan Cox wrote: > > > major/minors for old stuff still end up leaking into user space and > > > mattering there. I'm not sure the best option for that > > > > Thta's no problem. But they should be used as hash values no the > > syscall implementation level and nowhere else. > > We have apps that "know" about specific major/minors that need changing and > will take time - also some of them are closed source so unfixable. Right. Tons of apps have illicit insider knowledge of kernel major/minor representation and NEED IT to do their job. Try running 'ls -l' on a device node. Wow, it prints out major and minor number. You can pack up a tarball containing all of /dev so tar has to has insider major/minor knowledge too -- as does the structure of every existant tarball! Check out, for example, Section 10.1.1 (page 210) of the IEEE Std. 1003.1b-1993 (POSIX) and you will see every tarball in existance stores 8 chars for the major, and 8 chars for the minor.... So we have POSIX, ls, tar, du, mknod, and mount and tons of other apps all with illicit insider knowledge of what a dev_t looks like. A couple of months ago I patched up mkfs.jffs2 so it could create device nodes on the target filesystem that don't really exist in the source directory (avoids the need to be root when building filesystems). Right now, you will find that a zillion user space apps currently have little snippets of code looking like: /* FIXME: MKDEV uses illicit insider knowledge of kernel * major/minor representation... */ #define MINORBITS 8 #define MKDEV(ma,mi) (((ma) << MINORBITS) | (mi)) To currently, to do pretty much anything nifty related to devices in usespace, usespace has to peek under the kernel's skirt to know how to change a major and minor number into a dev_t and/or to sanely populate a struct stat. To change things, we 1) need some sortof sane interface by which userspace can refer sensibly to devices without resorting to evil illicit macros and 2) we certainly need some sort of a static mapping such that existing devices end up mapping to the same thing they always did or 3) we will need a flag day where we say that all pre-2.5.x created tarballs and user space apps are declared broken... -Erik -- Erik B. Andersen http://codepoet-consulting.com/ --This message was written using 73% post-consumer electrons-- ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: On re-working the major/minor system 2001-12-07 20:51 ` On re-working the major/minor system Erik Andersen @ 2001-12-07 21:21 ` H. Peter Anvin 2001-12-07 21:55 ` Erik Andersen 0 siblings, 1 reply; 53+ messages in thread From: H. Peter Anvin @ 2001-12-07 21:21 UTC (permalink / raw) To: linux-kernel Followup to: <20011207135100.A17683@codepoet.org> By author: Erik Andersen <andersen@codepoet.org> In newsgroup: linux.dev.kernel > > Right. Tons of apps have illicit insider knowledge of kernel > major/minor representation and NEED IT to do their job. Try > running 'ls -l' on a device node. Wow, it prints out major and > minor number. You can pack up a tarball containing all of /dev > so tar has to has insider major/minor knowledge too -- as does > the structure of every existant tarball! Check out, for example, > Section 10.1.1 (page 210) of the IEEE Std. 1003.1b-1993 (POSIX) > and you will see every tarball in existance stores 8 chars for > the major, and 8 chars for the minor.... > Actually, it's not "tons of apps", it's in the C library itself. These things are defined in <sys/sysmacros.h> and anyone who uses anything else should be taken out and shot. -hpa -- <hpa@transmeta.com> at work, <hpa@zytor.com> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt <amsp@zytor.com> ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: On re-working the major/minor system 2001-12-07 21:21 ` H. Peter Anvin @ 2001-12-07 21:55 ` Erik Andersen 2001-12-07 22:04 ` H. Peter Anvin 2001-12-09 12:06 ` Kai Henningsen 0 siblings, 2 replies; 53+ messages in thread From: Erik Andersen @ 2001-12-07 21:55 UTC (permalink / raw) To: H. Peter Anvin; +Cc: linux-kernel On Fri Dec 07, 2001 at 01:21:58PM -0800, H. Peter Anvin wrote: > Followup to: <20011207135100.A17683@codepoet.org> > By author: Erik Andersen <andersen@codepoet.org> > In newsgroup: linux.dev.kernel > > > > Right. Tons of apps have illicit insider knowledge of kernel > > major/minor representation and NEED IT to do their job. Try > > running 'ls -l' on a device node. Wow, it prints out major and > > minor number. You can pack up a tarball containing all of /dev > > so tar has to has insider major/minor knowledge too -- as does > > the structure of every existant tarball! Check out, for example, > > Section 10.1.1 (page 210) of the IEEE Std. 1003.1b-1993 (POSIX) > > and you will see every tarball in existance stores 8 chars for > > the major, and 8 chars for the minor.... > > > > Actually, it's not "tons of apps", it's in the C library itself. The C library, and the POSIX standard, etc, etc. > These things are defined in <sys/sysmacros.h> and anyone who uses > anything else should be taken out and shot. Ok, so we go through, change sys/sysmacros.h, tar.h, cpio.h, and any other offending header file. And guess what? Not only has nothing changed (since those are macros, not functions), but you just broke every older .deb and .rpm in existance on your updated system. In sys/sysmacros.h it defines major() and minor() as macros, so just dropping in an updated C library binary isn't going to do squat until all of userspace gets recompiled. And tar.h and cpio.h define long standing (well over 10 years now) binary structures. We can't just go changing this stuff, since now when a dev_t is some magic cookie, if I go to install something from my old Debian 1.2 CD or my old RedHat 4.0 CD, my system will puke trying to install using cookies that in fact are old 8/8 split device nodes and not cookies at all. -Erik -- Erik B. Andersen http://codepoet-consulting.com/ --This message was written using 73% post-consumer electrons-- ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: On re-working the major/minor system 2001-12-07 21:55 ` Erik Andersen @ 2001-12-07 22:04 ` H. Peter Anvin 2001-12-07 23:07 ` Erik Andersen 2001-12-09 12:06 ` Kai Henningsen 1 sibling, 1 reply; 53+ messages in thread From: H. Peter Anvin @ 2001-12-07 22:04 UTC (permalink / raw) To: andersen; +Cc: linux-kernel Erik Andersen wrote: > > Ok, so we go through, change sys/sysmacros.h, tar.h, cpio.h, and > any other offending header file. And guess what? Not only has > nothing changed (since those are macros, not functions), but you > just broke every older .deb and .rpm in existance on your updated > system. > > In sys/sysmacros.h it defines major() and minor() as macros, so > just dropping in an updated C library binary isn't going to do > squat until all of userspace gets recompiled. And tar.h and > cpio.h define long standing (well over 10 years now) binary > structures. We can't just go changing this stuff, since now when > a dev_t is some magic cookie, if I go to install something from > my old Debian 1.2 CD or my old RedHat 4.0 CD, my system will puke > trying to install using cookies that in fact are old 8/8 split > device nodes and not cookies at all. > It's clear a painful change is needed. **We don't have a choice.** However, the fewer places we have to make source code changes the better. What we agreed upon when this was discussed last year was the following: dev_t is extended to a 12:20 (32-bit size.) I personally would rather have seen a 64-bit size (32:32) but was outvoted :( New major 0 is reserved, except that dev_t == 0 remains the code for "no device". The unnamed device major becomes major 256. If (dev_t & ~0xFFFF) == 0, the dev_t is interpreted as an old-format dev_t, and is interpreted according to the following algorithm: if ( dev && (dev & ~0xFFFF) == 0 ) { major = (dev >> 8) ? (dev >> 8) : 256; minor = dev & 0xFF; } else { major = dev >> 20; minor = dev & 0xFFFFF; } -hpa ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: On re-working the major/minor system 2001-12-07 22:04 ` H. Peter Anvin @ 2001-12-07 23:07 ` Erik Andersen 2001-12-07 23:12 ` H. Peter Anvin 0 siblings, 1 reply; 53+ messages in thread From: Erik Andersen @ 2001-12-07 23:07 UTC (permalink / raw) To: H. Peter Anvin; +Cc: linux-kernel On Fri Dec 07, 2001 at 02:04:42PM -0800, H. Peter Anvin wrote: > > It's clear a painful change is needed. **We don't have a choice.** > However, the fewer places we have to make source code changes the better. Sure. I'm not arguing again the change. Just making sure everyone 100% understands that we have just thown any prayer of binary compatibility with anything less then 2.5.x.... But lets look on the bright side though. Since we are going to be having a flag day _anyways_ we may as well make the most of it. I can think of 20 things off the top of my head that are being retained in the name of binary cmpatibilty that can easily move to the trash bucket. :) For example, I would _love_ for Linux to standardize syscall numbers across all architectures, guarantee that userspace gets the exact same stack setup for all arches, we might as well fixup proc, etc, etc, etc. > What we agreed upon when this was discussed last year was the following: > > dev_t is extended to a 12:20 (32-bit size.) I personally would rather > have seen a 64-bit size (32:32) but was outvoted :( > > New major 0 is reserved, except that dev_t == 0 remains the code for "no > device". The unnamed device major becomes major 256. > > If (dev_t & ~0xFFFF) == 0, the dev_t is interpreted as an old-format > dev_t, and is interpreted according to the following algorithm: > > if ( dev && (dev & ~0xFFFF) == 0 ) { > major = (dev >> 8) ? (dev >> 8) : 256; > minor = dev & 0xFF; > } else { > major = dev >> 20; > minor = dev & 0xFFFFF; > } That works, and should prevent most major problems. Hmm. At least for cpio there are 6 chars worth of device info in there, so we coule easily go to 48 bits without RPM problems. Or redhat could fix rpm to use tarballs like debs do, and then we could go to 64 bit devices no problem. -Erik -- Erik B. Andersen http://codepoet-consulting.com/ --This message was written using 73% post-consumer electrons-- ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: On re-working the major/minor system 2001-12-07 23:07 ` Erik Andersen @ 2001-12-07 23:12 ` H. Peter Anvin 2001-12-08 11:42 ` Alan Cox 0 siblings, 1 reply; 53+ messages in thread From: H. Peter Anvin @ 2001-12-07 23:12 UTC (permalink / raw) To: andersen; +Cc: linux-kernel Erik Andersen wrote: > On Fri Dec 07, 2001 at 02:04:42PM -0800, H. Peter Anvin wrote: > >>It's clear a painful change is needed. **We don't have a choice.** >>However, the fewer places we have to make source code changes the better. >> > > Sure. I'm not arguing again the change. Just making sure > everyone 100% understands that we have just thown any prayer of > binary compatibility with anything less then 2.5.x.... > > But lets look on the bright side though. Since we are going to > be having a flag day _anyways_ we may as well make the most of > it. I can think of 20 things off the top of my head that are > being retained in the name of binary cmpatibilty that can easily > move to the trash bucket. :) > > For example, I would _love_ for Linux to standardize syscall > numbers across all architectures, guarantee that userspace gets > the exact same stack setup for all arches, we might as well fixup > proc, etc, etc, etc. > Not going to happen. Linux deliberately choose against that, because in Linux, syscall numbers are generally (except x86) compatible with the dominant vendor Unix on the platform. > > That works, and should prevent most major problems. Hmm. At > least for cpio there are 6 chars worth of device info in there, > so we coule easily go to 48 bits without RPM problems. Or redhat > could fix rpm to use tarballs like debs do, and then we could go > to 64 bit devices no problem. > The big stubling block seems to be NFSv2. -hpa ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: On re-working the major/minor system 2001-12-07 23:12 ` H. Peter Anvin @ 2001-12-08 11:42 ` Alan Cox 2001-12-08 20:37 ` H. Peter Anvin 0 siblings, 1 reply; 53+ messages in thread From: Alan Cox @ 2001-12-08 11:42 UTC (permalink / raw) To: H. Peter Anvin; +Cc: andersen, linux-kernel > > That works, and should prevent most major problems. Hmm. At > > least for cpio there are 6 chars worth of device info in there, > > so we coule easily go to 48 bits without RPM problems. Or redhat > > could fix rpm to use tarballs like debs do, and then we could go RPM can't easily use tarballs. Too much of a tar ball isnt rigidly defined so you can cryptographically sign it. > > to 64 bit devices no problem. > > The big stubling block seems to be NFSv2. Well 2.5 isnt going to be able to support NFS without a magic daemon maintained translation table - so that when the kernel randomly changes the major/minor number of an exported file system (eg a USB reconnect or even plain boring shutdown/reboot) it can keep consistent file handles. If you have a file handle table surely you can remap every NFS file handle through that down to 32bits. For device files the problem doesn't matter because at the kernel meeting Linus said those were going to change in a way that meant devices over NFS are a lost cause and clients would have to use devfs Alan ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: On re-working the major/minor system 2001-12-08 11:42 ` Alan Cox @ 2001-12-08 20:37 ` H. Peter Anvin 0 siblings, 0 replies; 53+ messages in thread From: H. Peter Anvin @ 2001-12-08 20:37 UTC (permalink / raw) To: Alan Cox; +Cc: andersen, linux-kernel Alan Cox wrote: >>>That works, and should prevent most major problems. Hmm. At >>>least for cpio there are 6 chars worth of device info in there, >>>so we coule easily go to 48 bits without RPM problems. Or redhat >>>could fix rpm to use tarballs like debs do, and then we could go >>> > > RPM can't easily use tarballs. Too much of a tar ball isnt rigidly defined so > you can cryptographically sign it. > Why does that matter? You're signing a *specific instance* of tar, not the generic format. > >>>to 64 bit devices no problem. >>> >>The big stubling block seems to be NFSv2. > > Well 2.5 isnt going to be able to support NFS without a magic daemon > maintained translation table - so that when the kernel randomly changes the > major/minor number of an exported file system (eg a USB reconnect or even plain > boring shutdown/reboot) it can keep consistent file handles. > > If you have a file handle table surely you can remap every NFS file handle > through that down to 32bits. For device files the problem doesn't matter > because at the kernel meeting Linus said those were going to change in a way > that meant devices over NFS are a lost cause and clients would have to use > devfs > Yeah, I know what Linus said at the kernel summit. As far as I could tell he rejected anything that seemed like a sensible approach from here to there, but that's just my $0.02... -hpa ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: On re-working the major/minor system 2001-12-07 21:55 ` Erik Andersen 2001-12-07 22:04 ` H. Peter Anvin @ 2001-12-09 12:06 ` Kai Henningsen 2001-12-09 21:57 ` H. Peter Anvin 1 sibling, 1 reply; 53+ messages in thread From: Kai Henningsen @ 2001-12-09 12:06 UTC (permalink / raw) To: linux-kernel andersen@codepoet.org (Erik Andersen) wrote on 07.12.01 in <20011207145535.A18152@codepoet.org>: > The C library, and the POSIX standard, etc, etc. I think you'll find that there is *NOTHING* in either the C standard, POSIX, or the Austin future-{POSIX,UNIX} standard that knows about major or minor numbers. MfG Kai ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: On re-working the major/minor system 2001-12-09 12:06 ` Kai Henningsen @ 2001-12-09 21:57 ` H. Peter Anvin 2001-12-11 20:45 ` Kai Henningsen 0 siblings, 1 reply; 53+ messages in thread From: H. Peter Anvin @ 2001-12-09 21:57 UTC (permalink / raw) To: linux-kernel Followup to: <8EWhHLVmw-B@khms.westfalen.de> By author: kaih@khms.westfalen.de (Kai Henningsen) In newsgroup: linux.dev.kernel > > > The C library, and the POSIX standard, etc, etc. > > I think you'll find that there is *NOTHING* in either the C standard, > POSIX, or the Austin future-{POSIX,UNIX} standard that knows about major > or minor numbers. > It's not "future" anymore... Austin is now IEEE 1003.1-2001 and thus the new POSIX standard. Anyway, look for things like tar, cpio, ISO 9660 and that class of standards. -hpa -- <hpa@transmeta.com> at work, <hpa@zytor.com> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt <amsp@zytor.com> ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: On re-working the major/minor system 2001-12-09 21:57 ` H. Peter Anvin @ 2001-12-11 20:45 ` Kai Henningsen 0 siblings, 0 replies; 53+ messages in thread From: Kai Henningsen @ 2001-12-11 20:45 UTC (permalink / raw) To: linux-kernel hpa@zytor.com (H. Peter Anvin) wrote on 09.12.01 in <9v0mo1$ms$1@cesium.transmeta.com>: > By author: kaih@khms.westfalen.de (Kai Henningsen) > > > The C library, and the POSIX standard, etc, etc. > > > > I think you'll find that there is *NOTHING* in either the C standard, > > POSIX, or the Austin future-{POSIX,UNIX} standard that knows about major > > or minor numbers. > > > > It's not "future" anymore... Austin is now IEEE 1003.1-2001 and thus > the new POSIX standard. As of this Friday, yes. > Anyway, look for things like tar, cpio, ISO 9660 and that class of > standards. Well, at least in Austin there is neither tar, cpio, nor 9660. You are, however, right insofar as there's pax, which for ustar format has devmajor and devminor fields of 8 octets each, which contain unspecified information. (cpio format just has the rdev field.) MfG Kai ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-06 16:58 ` Linus Torvalds 2001-12-06 18:02 ` Alan Cox @ 2001-12-06 18:38 ` Doug Ledford 1 sibling, 0 replies; 53+ messages in thread From: Doug Ledford @ 2001-12-06 18:38 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel Linus Torvalds wrote: >In article <E16BlnL-00080m-00@the-village.bc.nu>, >Alan Cox <alan@lxorguk.ukuu.org.uk> wrote: > >>You still need the scsi code. There are a whole sequence of common, quite >>complex and generic functions that the scsi layer handles (in paticular >>error handling). >> > >Well, the preliminary patches already handle _some_ common things, like >building the proper command request for reads and writes etc, and that >will probably continue. We'll probably have to have all the old helpers >for things like "this target only wants to be probed on lun 0" etc > Personally, I think it would be a mistake to have any of the low level driver know a thing about the ll_rw_blk.c request structs or bio structures or similar stuff. They simply don't *need* to know those things. Low level drivers need a command that can go to the drive and they need a sg array. Smart hosts, aka raid controllers and the like, could theoretically use the higher level structs for reasonable things, but I won't make any assumptions about which ones could or couldn't use that information because I haven't looked into it. So, making the scsi mid layer a helper library may be OK, but for the largest number of drivers, it really means that the call chain would likely look something like this: blk_dev->request_fn() scsi_dispatch_request() // map from the major/minor to driver driver_queue_routine(bio *) Scsi_Cmnd = scsi_make_cmnd(bio *); send_command(cmnd); //put it in the hardware driver_interrupt_routine() if (cmnd = get_errored_command()) { // Pick out the commands that had a transfer related error // and rework or error them out here if(retryable_error(cmnd)) requeue_command(cmnd); else { cmnd->bio->completion_handler(cmnd->bio); scsi_free_command(cmnd); } } if (cmnd = get_completed_command()) { retryable = scsi_check_sense(cmnd); switch(retryable) { case TRANSIENT_ERROR: requeue_command(cmnd); break; case FATAL_ERROR: case NO_ERROR: cmnd->bio->completion_handler(cmnd->bio); scsi_free_command(cmnd); break; case BUSY: requeue_command_with_delay(cmnd); break; case ... } } Personally, I don't like it. The queueing stuff is somewhat OK (but requires other things to be changed to accomodate and I don't think those things should change), but I don't want to have to deal with result parsing in every driver. Proper result parsing is huge and easy to get wrong. The current code gets it wrong. There are a few ways in which it could be fixed to do the right thing without disrupting the world. Duplicating that in all the low level drivers will be a nightmare though. I also don't like one aspect of putting the queueing totally in the low level drivers. That way, all of the low level drivers are going to have to maintain A) delayable queues with variable delay times and undelay on completion semantics (my driver already has this, but it's been a source of pain to maintain, it would be nice if it didn't need it) and B) ordering requirements when multiple commands for a device operating in untagged mode are in the queue (this one isn't so hard, but getting it wrong means things like tape drives will store garbage when you least expect it). Anyway, I suspect the end result would be the same either way: drivers would end up having control over their own flow of commands. The only difference would be how that control would be achieved. A) by changing the mid layer to accept more driver specific parameters (aka, even though MegaRAID may take up to 60 seconds to complete a normal command, the aic7xxx should never take more than 10 seconds, so the default command timeout length would be driver specific) and to honor the low level driver's idea of what to do on a timeout (let the low level driver tell the mid layer what action to take, and that should be the limit of the mid layers error handling "brains", performing those actions). Or B) removing the mid layer from the picture all together (as much as possible anyway, it (or a similar variant) will still likely be needed on queueing just to map from major/minor to driver unless we change the device allocation scheme or something) and then having the low level driver call into helper routines. > >I disagree about the error handling, though. > >Traditionally, the timeouts and the reset handling was handled in the >SCSI mid-layer, and it was a complete and utter disaster. Different >hosts simply wanted so different behaviour that it's not even funny. > That's not really accurate. It was too opaque, sure. But for what it was it did a decent job. The mid layer driver was trying to make generic timeout decisions without the benefit of the low level driver's knowledge of the current bus state. For example, 20 commands may time out at once, but in reality, only one of those commands is probably holding up the SCSI bus. The low level drivers can (usually) look at their card to see which command is *really* the hold up. Then, they could have all the other commands simply put back to sleep without doing anything and take appropriate action against the holdup command. The current mid layer (at least in the old error handling) couldn't do that. The new_eh code attempts to allow drivers to tell them these things by use of the strategy function. In reality, that's all you need. That's the driver's ability to tell the mid layer *how* to proceed on any given command. > >Timeouts for different commands were so different that people ended up >making most timeouts so long that they no longer made sense for other >commands etc. > Not accurate here either. For most commands across all controllers, timeouts are pretty uniform (the timeout to read a CD-ROM for instance is pretty constant). The timeouts *only* started to vary when you ran across smart RAID controllers that were doing too much work rebuilding things and didn't respond to your requests in a timely fashion. And that only applys to their logical drives. It would be pretty easy to just allow disk timeouts to be adjusted on a disk by disk basis to a reasonable default for that controller and solve this problem. > >Other device drivers have been able to handle timeouts and errors on >their own before, and have _not_ had the kinds of horrendous problems >that the SCSI layer has had. > If you use the new eh strategy function (or so I understand it, if I'm wrong here then I'll take the time to *make* myself right by changing the code), then you are essentially allowing your driver to take control of *what* happens, and the mid layer timeout code becomes nothing more than a glorified timer creation, monitoring, and teardown framework that happens to be plugged into the queueing and completion framework so it can notice new commands and when old commands are done. Nothing more. And that's all a SCSI driver that wants to do its own thing really needs. Done properly, this could be a "good" thing. Done poorly, everyone will hate it. But, the ability is there to make your driver do what you want with the strategy call in. > >We'll see what the details will end up being, but I personally think >that it is a major mistake to try to have generic error handling. The >only true generic thing is "this request finished successfully / with an >error", and _no_ high-level retries etc. It's up to the driver to decide >if retries make sense. > Well, I disagree on this. I think asking all those drivers to deal with sense data and keep up with changing standards and new sense returns on new devices, etc. is just asking for a lot of out of date drivers that do the wrong thing. Sense parsing and decision making is *device* specific, not controller specific, and I don't think it has any place being in the controller's responsibility list. You might as well start asking eth drivers to not only check checksums on packets but also protocol flags before passing the packet up to the IP or TCP layers and to perform all the TCP retrans operations in themselves instead of in the TCP layer. > >(Often retrying _doesn't_ make sense, because the firmware on the >high-end card or disk itself may already have done retries on its own, >and high-level error handling is nothing but a waste of time and causes >the error notification to be even more delayed). > > Linus >- >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html >Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-04 2:09 ` Donald Becker 2001-12-04 2:23 ` Davide Libenzi 2001-12-04 9:10 ` Alan Cox @ 2001-12-04 14:37 ` Daniel Phillips 2001-12-04 15:19 ` Jeff Garzik 2 siblings, 1 reply; 53+ messages in thread From: Daniel Phillips @ 2001-12-04 14:37 UTC (permalink / raw) To: Donald Becker, Davide Libenzi; +Cc: Linux Kernel Mailing List On December 4, 2001 03:09 am, Donald Becker wrote: > To bring this branch back on point: we should distinguish between > design for an arbitrary and unpredictable goal (e.g. 128 way SMP) > vs. putting some design into things that we are supposed to already > understan > [...] > a VFS layer that doesn't require the kernel to know a priori all of > the filesystem types that might be loaded Right, there's a consensus that the fs includes have to fixed and that it should be in 2.5.lownum. The precise plan isn't fully evolved yet ;) See fsdevel for the thread, 3-4 months ago. IIRC, the favored idea (Linus's) was to make the generic struct inode part of the fs-specific inode instead of the other way around, which resolves the question of how the compiler calculates the size/layout of an inode. This is going to be a pervasive change that someone has to do all in one day, so it remains to be seen when/if that is actually going to happen. It's also going to break every out-of-tree filesystem. -- Daniel ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-04 14:37 ` Daniel Phillips @ 2001-12-04 15:19 ` Jeff Garzik 2001-12-04 17:16 ` Daniel Phillips 0 siblings, 1 reply; 53+ messages in thread From: Jeff Garzik @ 2001-12-04 15:19 UTC (permalink / raw) To: Daniel Phillips; +Cc: Donald Becker, Davide Libenzi, Linux Kernel Mailing List Daniel Phillips wrote: > > On December 4, 2001 03:09 am, Donald Becker wrote: > > To bring this branch back on point: we should distinguish between > > design for an arbitrary and unpredictable goal (e.g. 128 way SMP) > > vs. putting some design into things that we are supposed to already > > understan > > [...] > > a VFS layer that doesn't require the kernel to know a priori all of > > the filesystem types that might be loaded > > Right, there's a consensus that the fs includes have to fixed and that it > should be in 2.5.lownum. The precise plan isn't fully evolved yet ;) > > See fsdevel for the thread, 3-4 months ago. IIRC, the favored idea (Linus's) > was to make the generic struct inode part of the fs-specific inode instead of > the other way around, which resolves the question of how the compiler > calculates the size/layout of an inode. > > This is going to be a pervasive change that someone has to do all in one > day, so it remains to be seen when/if that is actually going to happen. > > It's also going to break every out-of-tree filesystem. ug. what's wrong with a single additional alloc for generic_ip? [if a filesystem needs to do multiple allocs post-conversion, somebody's doing something wrong] Using generic_ip in its current form has the advantage of being able to create a nicely-aligned kmem cache for your private inode data. Jeff -- Jeff Garzik | Only so many songs can be sung Building 1024 | with two lips, two lungs, and one tongue. MandrakeSoft | - nomeansno ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-04 15:19 ` Jeff Garzik @ 2001-12-04 17:16 ` Daniel Phillips 2001-12-04 17:20 ` Jeff Garzik 2001-12-04 18:04 ` Alan Cox 0 siblings, 2 replies; 53+ messages in thread From: Daniel Phillips @ 2001-12-04 17:16 UTC (permalink / raw) To: Jeff Garzik; +Cc: Donald Becker, Davide Libenzi, Linux Kernel Mailing List On December 4, 2001 04:19 pm, Jeff Garzik wrote: > Daniel Phillips wrote: > > > > On December 4, 2001 03:09 am, Donald Becker wrote: > > > To bring this branch back on point: we should distinguish between > > > design for an arbitrary and unpredictable goal (e.g. 128 way SMP) > > > vs. putting some design into things that we are supposed to already > > > understan > > > [...] > > > a VFS layer that doesn't require the kernel to know a priori all of > > > the filesystem types that might be loaded > > > > Right, there's a consensus that the fs includes have to fixed and that it > > should be in 2.5.lownum. The precise plan isn't fully evolved yet ;) > > > > See fsdevel for the thread, 3-4 months ago. IIRC, the favored idea (Linus's) > > was to make the generic struct inode part of the fs-specific inode instead of > > the other way around, which resolves the question of how the compiler > > calculates the size/layout of an inode. > > > > This is going to be a pervasive change that someone has to do all in one > > day, so it remains to be seen when/if that is actually going to happen. > > > > It's also going to break every out-of-tree filesystem. > > ug. what's wrong with a single additional alloc for generic_ip? [if a > filesystem needs to do multiple allocs post-conversion, somebody's doing > something wrong] Single additional alloc -> twice as many allocs, two slabs, more cachelines dirty. This was hashed out on fsdevel, though apparently not to everyone's satisfaction. > Using generic_ip in its current form has the advantage of being able to > create a nicely-aligned kmem cache for your private inode data. I don't see why that's hard with the combined struct. -- Daniel ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-04 17:16 ` Daniel Phillips @ 2001-12-04 17:20 ` Jeff Garzik 2001-12-04 18:04 ` Alan Cox 1 sibling, 0 replies; 53+ messages in thread From: Jeff Garzik @ 2001-12-04 17:20 UTC (permalink / raw) To: Daniel Phillips; +Cc: Linux Kernel Mailing List, linux-fsdevel Daniel Phillips wrote: > On December 4, 2001 04:19 pm, Jeff Garzik wrote: > > ug. what's wrong with a single additional alloc for generic_ip? [if a > > filesystem needs to do multiple allocs post-conversion, somebody's doing > > something wrong] > > Single additional alloc -> twice as many allocs, two slabs, more cachelines > dirty. This was hashed out on fsdevel, though apparently not to everyone's > satisfaction. > > > Using generic_ip in its current form has the advantage of being able to > > create a nicely-aligned kmem cache for your private inode data. > > I don't see why that's hard with the combined struct. The advantage of having two structs means that both struct inode and the private info can be aligned nicely. Yes it potentially wastes a tiny bit more memory, but I challenge you to find an architecture where doing this isn't a win. In a couple cases I looked at, additional slabs are not even necessary, as kmalloc's standard ones do the job quite well. 'cat /proc/slabinfo' for a list of the sizes. Note this only applies to inodes. There aren't enough superblocks in a running system to worry about doing anything but simple kmalloc on the superblock private info (before assigning to generic_sbp). Jeff -- Jeff Garzik | Only so many songs can be sung Building 1024 | with two lips, two lungs, and one tongue. MandrakeSoft | - nomeansno ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-04 17:16 ` Daniel Phillips 2001-12-04 17:20 ` Jeff Garzik @ 2001-12-04 18:04 ` Alan Cox 2001-12-04 18:16 ` Daniel Phillips 1 sibling, 1 reply; 53+ messages in thread From: Alan Cox @ 2001-12-04 18:04 UTC (permalink / raw) To: Daniel Phillips Cc: Jeff Garzik, Donald Becker, Davide Libenzi, Linux Kernel Mailing List > Single additional alloc -> twice as many allocs, two slabs, more cachelines > dirty. This was hashed out on fsdevel, though apparently not to everyone's > satisfaction. Al Viro's NFS in generic_ip saved me something like 130K of memory. > > Using generic_ip in its current form has the advantage of being able to > > create a nicely-aligned kmem cache for your private inode data. > > I don't see why that's hard with the combined struct. Providing you end up with fs->alloc_inode() and the fs allocates a suitable sized inode + private I see no problem. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-04 18:04 ` Alan Cox @ 2001-12-04 18:16 ` Daniel Phillips 2001-12-04 20:20 ` Andrew Morton 2001-12-05 13:11 ` Deep look into VFS Martin Dalecki 0 siblings, 2 replies; 53+ messages in thread From: Daniel Phillips @ 2001-12-04 18:16 UTC (permalink / raw) To: Alan Cox Cc: Jeff Garzik, Donald Becker, Davide Libenzi, Linux Kernel Mailing List On December 4, 2001 07:04 pm, Alan Cox wrote: > > Single additional alloc -> twice as many allocs, two slabs, more cachelines > > dirty. This was hashed out on fsdevel, though apparently not to everyone's > > satisfaction. > > Al Viro's NFS in generic_ip saved me something like 130K of memory. Yes, all of these proposals would do that, by getting away from all inodes being the same size (basically the size of the ext2 inode). -- Daniel ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Linux/Pro -- clusters 2001-12-04 18:16 ` Daniel Phillips @ 2001-12-04 20:20 ` Andrew Morton 2001-12-05 13:11 ` Deep look into VFS Martin Dalecki 1 sibling, 0 replies; 53+ messages in thread From: Andrew Morton @ 2001-12-04 20:20 UTC (permalink / raw) To: Daniel Phillips; +Cc: Linux Kernel Mailing List Daniel Phillips wrote: > > On December 4, 2001 07:04 pm, Alan Cox wrote: > > > Single additional alloc -> twice as many allocs, two slabs, more cachelines > > > dirty. This was hashed out on fsdevel, though apparently not to everyone's > > > satisfaction. > > > > Al Viro's NFS in generic_ip saved me something like 130K of memory. > > Yes, all of these proposals would do that, by getting away from all inodes > being the same size (basically the size of the ext2 inode). > ext3 is the pig at present. I think Andreas has half-a-patch to move it to generic_ip. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Deep look into VFS 2001-12-04 18:16 ` Daniel Phillips 2001-12-04 20:20 ` Andrew Morton @ 2001-12-05 13:11 ` Martin Dalecki 2001-12-05 15:19 ` Alexander Viro 1 sibling, 1 reply; 53+ messages in thread From: Martin Dalecki @ 2001-12-05 13:11 UTC (permalink / raw) Cc: Linux Kernel Mailing List Yerstoday I had a look into the virtual VFS. Out of this the following question araises for me. Inside fs/inode.c we have a generic clear_inode(). All fine all well at one palce the usage of this function goes as follows: (the function in question is iput() from the same file) if (op && op->delete_inode) { void (*delete)(struct inode *) = op->delete_inode; if (!is_bad_inode(inode)) DQUOT_INIT(inode); /* s_op->delete_inode internally recalls clear_inode() */ delete(inode); } else clear_inode(inode); Well my tought was, that it would be nice to avoid the explicit callback to inode from driver code in the middle for nowhere, which would allow us to change the above code sequence into the much cleaner: if (op && op->delete_inode) { void (*delete)(struct inode *) = op->delete_inode; if (!is_bad_inode(inode)) DQUOT_INIT(inode); delete(inode); } clear_inode(inode); Therefore I have looked at all the places, where clear_inode is actually called inside the FS implementation code. shmmem() told me that the above change would be entierly fine with it. We have however the following in ext2/ialloc.c: /* * NOTE! When we get the inode, we're the only people * that have access to it, and as such there are no * race conditions we have to worry about. The inode * is not on the hash-lists, and it cannot be reached * through the filesystem because the directory entry * has been deleted earlier. * * HOWEVER: we must make sure that we get no aliases, * which means that we have to call "clear_inode()" * _before_ we mark the inode not in use in the inode * bitmaps. Otherwise a newly created file might use * the same inode number (not actually the same pointer * though), and then we'd have two inodes sharing the * same inode number and space on the harddisk. */ void ext2_free_inode (struct inode * inode) { ... lock_super (sb); ... /* Do this BEFORE marking the inode not in use or returning an error */ clear_inode (inode); ... unlock_super (sb); } Unless I'm compleatly misguided the lock on the superblock should entierly prevent the race described inside the header comment and we should be able to delete clear_inode from this function. Question is: Can someone with more knowlendge of the intimidate inner workings of the VFS tell me whatever my suspiction is right or not? Thanks in advance... PS. Deleting clear_inode() would help to simplify the delete_inode parameters quite a significant bit, as well as deleting the tail union in struct inode - that's the goal. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Deep look into VFS 2001-12-05 13:11 ` Deep look into VFS Martin Dalecki @ 2001-12-05 15:19 ` Alexander Viro 2001-12-05 15:30 ` Martin Dalecki 0 siblings, 1 reply; 53+ messages in thread From: Alexander Viro @ 2001-12-05 15:19 UTC (permalink / raw) To: Martin Dalecki; +Cc: Linux Kernel Mailing List On Wed, 5 Dec 2001, Martin Dalecki wrote: > Unless I'm compleatly misguided the lock on the superblock > should entierly prevent the race described inside the header comment > and we should be able to delete clear_inode from this function. Huh? We drop that lock before the return from this function. So if you move clear_inode() after the return, you lose that protections. What's more, you can't more that lock_super()/unlock_super() into iput() itself - you need it _not_ taken in the beginning of ext2_delete_inode() and you don't want it for quite a few filesystems. Nothing VFS-specific here, just a bog-standard "you lose protection of semaphore once you call up()"... > PS. Deleting clear_inode() would help to simplify the > delete_inode parameters quite a significant bit, as > well as deleting the tail union in struct inode - that's the goal. Again, huh? ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Deep look into VFS 2001-12-05 15:19 ` Alexander Viro @ 2001-12-05 15:30 ` Martin Dalecki 0 siblings, 0 replies; 53+ messages in thread From: Martin Dalecki @ 2001-12-05 15:30 UTC (permalink / raw) To: Alexander Viro; +Cc: Linux Kernel Mailing List Alexander Viro wrote: > > On Wed, 5 Dec 2001, Martin Dalecki wrote: > > > Unless I'm compleatly misguided the lock on the superblock > > should entierly prevent the race described inside the header comment > > and we should be able to delete clear_inode from this function. > > Huh? We drop that lock before the return from this function. So if you > move clear_inode() after the return, you lose that protections. > > What's more, you can't more that lock_super()/unlock_super() into iput() > itself - you need it _not_ taken in the beginning of ext2_delete_inode() > and you don't want it for quite a few filesystems. > > Nothing VFS-specific here, just a bog-standard "you lose protection of > semaphore once you call up()"... Ummmmm... that is well trivially true... of course (I'm slapping a hand on my forehead). Thank you for answering! ^ permalink raw reply [flat|nested] 53+ messages in thread
end of thread, other threads:[~2001-12-11 20:58 UTC | newest] Thread overview: 53+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2001-12-03 18:12 Linux/Pro -- clusters Donald Becker 2001-12-04 1:55 ` Davide Libenzi 2001-12-04 2:09 ` Donald Becker 2001-12-04 2:23 ` Davide Libenzi 2001-12-04 2:34 ` Alexander Viro 2001-12-04 9:10 ` Alan Cox 2001-12-04 9:30 ` Thomas Langås 2001-12-04 9:45 ` Alan Cox 2001-12-04 11:34 ` Thomas Langås 2001-12-05 21:57 ` Linus Torvalds 2001-12-05 23:05 ` Andre Hedrick 2001-12-06 4:31 ` Daniel Phillips 2001-12-05 23:49 ` Alan Cox 2001-12-05 23:48 ` Andre Hedrick 2001-12-06 16:58 ` Linus Torvalds 2001-12-06 18:02 ` Alan Cox 2001-12-06 18:07 ` Linus Torvalds 2001-12-06 18:12 ` Kai Henningsen 2001-12-06 20:46 ` Linus Torvalds 2001-12-06 22:40 ` Alan Cox 2001-12-06 18:33 ` Alan Cox 2001-12-06 18:55 ` Linus Torvalds 2001-12-06 19:19 ` Alan Cox 2001-12-06 20:37 ` Linus Torvalds 2001-12-06 22:35 ` Alan Cox 2001-12-06 22:34 ` Linus Torvalds 2001-12-06 22:58 ` Alexander Viro 2001-12-07 10:14 ` Martin Dalecki 2001-12-07 10:37 ` Alan Cox 2001-12-07 10:56 ` Martin Dalecki 2001-12-07 12:08 ` Alan Cox 2001-12-07 20:51 ` On re-working the major/minor system Erik Andersen 2001-12-07 21:21 ` H. Peter Anvin 2001-12-07 21:55 ` Erik Andersen 2001-12-07 22:04 ` H. Peter Anvin 2001-12-07 23:07 ` Erik Andersen 2001-12-07 23:12 ` H. Peter Anvin 2001-12-08 11:42 ` Alan Cox 2001-12-08 20:37 ` H. Peter Anvin 2001-12-09 12:06 ` Kai Henningsen 2001-12-09 21:57 ` H. Peter Anvin 2001-12-11 20:45 ` Kai Henningsen 2001-12-06 18:38 ` Linux/Pro -- clusters Doug Ledford 2001-12-04 14:37 ` Daniel Phillips 2001-12-04 15:19 ` Jeff Garzik 2001-12-04 17:16 ` Daniel Phillips 2001-12-04 17:20 ` Jeff Garzik 2001-12-04 18:04 ` Alan Cox 2001-12-04 18:16 ` Daniel Phillips 2001-12-04 20:20 ` Andrew Morton 2001-12-05 13:11 ` Deep look into VFS Martin Dalecki 2001-12-05 15:19 ` Alexander Viro 2001-12-05 15:30 ` Martin Dalecki
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox