public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* RE: 2.5.28 and partitions
@ 2002-07-25 12:43 Petr Vandrovec
  0 siblings, 0 replies; 51+ messages in thread
From: Petr Vandrovec @ 2002-07-25 12:43 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Matt_Domsch, Andries.Brouwer, linux-kernel

On 25 Jul 02 at 7:44, Alexander Viro wrote:
> On Wed, 24 Jul 2002, Linus Torvalds wrote:
> 
> > Note that there is one place where 64 bits is simply _too_ expensive, and
> > that's the page cache. In particular, the "index" in "struct page". We
> > want to make "struct page" _smaller_, not larger.
> > 
> > Right now that means that 16TB really is a hard limit for at least some
> > device access on a 32-bit machine with a 4kB page-size (yes, you could
> > make a filesystem that is bigger, but you very fundamentally cannot make
> > individual files larger than 16TB).
> 
> ITYM "8Tb" - indices are signed, IIRC.  OTOH, it's not 2^31 * PAGE_SIZE -
> it's 2^31 * PAGE_CACHE_SIZE, which can be bigger.
> 
> Al, still thinking that anybody who does mkfs.<whatever> on a multi-Tb
> device should seek professional help of the kind they don't give on l-k...

Don't worry. Netware (NW6) uses also 32bit for indices to page cache, 
and 4KB page cache size, but in addition to our implementation they 
(1) do not verify that file you created is smaller than 16TB, and 
(2) they have signedness bug somewhere too. So if you'll create file 
larger than 8TB, data you wrote in are silently discarded, while
file size is preserved.

I was really surprised when I updated ncpfs to access files > 4GB.
Written data were disappearing after server reboot :-(

Just my two cents.
                                            Petr Vandrovec
                                            vandrove@vc.cvut.cz

^ permalink raw reply	[flat|nested] 51+ messages in thread
* Re: 2.5.28 and partitions
@ 2002-08-02 14:54 Jesse Pollard
  2002-08-02 18:33 ` Kai Henningsen
  0 siblings, 1 reply; 51+ messages in thread
From: Jesse Pollard @ 2002-08-02 14:54 UTC (permalink / raw)
  To: Kai Henningsen, linux-kernel

kaih@khms.westfalen.de (Kai Henningsen):
...
> As for finding where to boot from - either have the bootloader define a  
> partition name it wants to see, or put the relevant name into the boot  
> loader config. No need to define that in the partition format. That's  
> trivial: even MS-DOS did that (finding IO.SYS and MSDOS.SYS from the boot  
> loader)! And neither scanning for '=' and '\n' nor comparing one string  
> nor converting one number from decimal is any kind of hardship. Maybe half  
> a screen of assembler, tops.
> 

Nope.

The problem is different - which file system is the file stored in?
How many different filesystems are there?
Do think all of them will fit in a boot loader?
Or even one of them?
How many different logical volume structures are there?

Do do this you first have to convince the development people to say that
"only xxxx filesystem shall be bootable".

Very unlikely.

And now, you also have to add possible logical volumes on top (or under :)
of it.

Even more unlikely.

That is why LILO doesn't use file names for boots. It only uses block
numbers.

Another alternative (possibly just as hard) is to have LILO only
load a more complex and dynamic loader, which could be configured for
each filesystem structure. Once that "dynamic loader" is loaded, it
could find and load the kernel (passing, of course, the boot command line
from LILO).

I know IRIX gets around the problem by having a tiny filesystem for the
"disk label". This filesystem contains only contigeous files, and has
references to the drive partition table, the complex boot program (sash -
stand alone shell), optional diagnostic boot, and logical volume mebership -
one reference per logical volume type and partition .. I think it is
	<lvm type>.<partitionnumber>
The contents of the file is volume name followed by the order of the partition
in the lvm (section 1, 2, 3, ..).

And this is not a "mountable" file system. It is only accessed via special
utilities (like the "mtools" set for non-mounted M$DOS floppies)

At least, I remember IRIX this way - it should be close.

SunOS had something a little different: the initial boot (at the bios level)
use block numbers to locate a "boot" utility. The "boot" utility knew about
the filesystem type. I think it was a link of the boot object with a fs
utility library, where the library was selected by a "makeboot" command
and by the filesystem type that the kernel(s) was(were) stored on. The
"makeboot" utility modified/replace the "boot" program, then set the
block numbers in the boot sector.

All of this has truly horrible effects on boot times though. At a minimum
I would expect it to take twice as long.

You pay for the additional flexibility though.

-------------------------------------------------------------------------
Jesse I Pollard, II
Email: pollard@navo.hpc.mil

Any opinions expressed are solely my own.

^ permalink raw reply	[flat|nested] 51+ messages in thread
[parent not found: <15688.27022.143541.447952@wombat.chubb.wattle.id.au>]
* RE: 2.5.28 and partitions
@ 2002-07-31 23:38 Matt_Domsch
  0 siblings, 0 replies; 51+ messages in thread
From: Matt_Domsch @ 2002-07-31 23:38 UTC (permalink / raw)
  To: peter; +Cc: pavel, viro, Andries.Brouwer, linux-kernel

> Matt> What's wrong with EFI GUID scheme (GPT) (other than it wasn't
> Matt> invented by Linux folks)?
> 
> Nothing, except it's not used on all platforms yet.

(set boot issues aside for now)
It could.  I use it on x86 and IA-64 now.  I think Richard Hirst found the
last (knock on wood) of my endianness bugs about 6 months ago, so I know it
works on BE and LE non-Intel machines.  It's in the partitioning menu, not
specific to arch.  The only arch dependency in code is on asm-ia64/efi.h for
some typedefs, which is annoying but not hard to fix if desired (move
relevant bits to include/linux/efi.h).

> For my machines the *only* reason for having a legacy partitioning
> scheme is to allow booting.

As you point out, booting is BIOS-specific.  So for now boot a disk with a
native scheme (where your OS resides already) and mount that 64XB file
system for data afterwords.  By the time that doesn't work, 32-bit CPUs will
be dead anyhow.

Thanks,
Matt

--
Matt Domsch
Sr. Software Engineer, Lead Engineer, Architect
Dell Linux Solutions www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com
#1 US Linux Server provider for 2001 and Q1/2002! (IDC May 2002)


^ permalink raw reply	[flat|nested] 51+ messages in thread
[parent not found: <F44891A593A6DE4B99FDCB7CC537BBBBB839AC@AUSXMPS308.aus.amer .dell.com>]
* RE: 2.5.28 and partitions
@ 2002-07-31 22:47 Matt_Domsch
  0 siblings, 0 replies; 51+ messages in thread
From: Matt_Domsch @ 2002-07-31 22:47 UTC (permalink / raw)
  To: peter, pavel; +Cc: viro, Andries.Brouwer, linux-kernel

Hi Peter.  Thanks for your work on LBD for 2.5.x.  I'm really looking
forward to its inclusion.

> What we really need to be able to do, however, is partition these huge
> discs, if only so that each partition is less than a reasonable number
> of backup tapes/devices/whatever.  And at present the only scheme that
> Linux understands for partitioning huge discs is the EFI GUID scheme.

:-)
 
> Maybe we need to roll our own?

What's wrong with EFI GUID scheme (GPT) (other than it wasn't invented by
Linux folks)?

the 2.5.x kernel understands it today
the 2.4.x kernel could very easily understand it (patch available on
http://domsch.com/linux/patches/gpt against 2.4.19-rc1), and ia64 has had it
for a couple years.
partx understands it today
parted understands it today
(efibootmgr and the EFI environment understand it today, but that's only
relevant to IA-64 at the moment)


> Maybe we need to roll our own?  I suggest something like:
>       struct linux_volume_header {
> 	     char  volname[16];
> 	     __u32 nparts;
> 	     __u32 blocksize;

The disk can tell you its blocksize.  The FS will have its own idea anyhow.

> 	     struct linux_partition {
> 		    char partname[16]
> 		    __u64  start;
> 		    __u64  len;
> 		    __u32  usage;
> 		    __u32  flags;
> 	    } parts[]
>     }
> 
> the whole to fit into a 4k block at the start of the volume, with a
> crc32 at the end.
> 
> Usage to be a magic number that says this is a swap, spare,
> whole-disc, filesystem+type, whatever, partition.
> 
> flags for whatever we want.

All of this is already done in GPT today, or could be if desired (spare,
etc).  Tagging the FS type inside the partition table isn't pretty, and has
lead to the huge table of partition type numbers that Andries maintains,
when fs probing isn't hard.

> I can't see anyone booting from a huge array in the near-term future,
> because you need the BIOS to understand the array.

Sure, so we don't have to fix grub or lilo to understand GPT yet.  :-)

Unless there's something that GPT doesn't do well, I'd prefer not to make
yet another partitioning scheme.  If there is something else it needs, it
can be extended.

Thanks,
Matt

--
Matt Domsch
Sr. Software Engineer, Lead Engineer, Architect
Dell Linux Solutions www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com
#1 US Linux Server provider for 2001 and Q1/2002! (IDC May 2002)


^ permalink raw reply	[flat|nested] 51+ messages in thread
[parent not found: <15688.25919.138565.6427@wombat.chubb.wattle.id.au>]
* Re: 2.5.28 and partitions
@ 2002-07-25 17:50 Andries.Brouwer
  0 siblings, 0 replies; 51+ messages in thread
From: Andries.Brouwer @ 2002-07-25 17:50 UTC (permalink / raw)
  To: viro; +Cc: linux-kernel, torvalds

>> and I object to the long instead of u64 or so.

> Separate set of patches.

Good.
Although it is better to design the right data structures first.

> As it is, struct hd_struct is still there and still not modified.
> And it has unsigned long.  It will become sector_t.

You need two things:

(i) A faithful representation of what the partition parser says.
Partition table parsers, in the kernel or in user space, find out
how this disk is partitioned and the information found is stored
in some "parsed partition table" struct. Here offset and length
must be u64 and use byte as a unit.

(ii) A representation of offset and length suitable to use for
block I/O. During block I/O a sector number is tested against
the max to test for errors, and the partition offset is added.
These two must of course use the units the sector number is in.
So a sector_t is reasonable here.


> Actually, I'm not all that sure that we want u64 here.  The thing being,
> start_sect shouldn't be bigger than sector_t (see how it's used).  And
> 64bit arithmetics on 32bit boxen sucks big way.  I'm not too concerned
> about adding start_sect per se - it's done once per request and it's
> noise compared to the rest of work.  However, long long for sector_t
> will hit in a lot of more interesting code paths.

It will be unavoidable soon. For many applications it is needed today.

> Al, still thinking that anybody who does mkfs.<whatever> on a multi-Tb
> device should seek professional help of the kind they don't give on l-k...

I don't see how this can be relevant. If the device is large and you
make it one big partition then the size of the partition will need more
than 32 bits. If you split it up into lots of tiny 2 TB partitions
then the offsets will need more than 32 bits.

I did my partition stuff seven years ago, and at that time discussion
was possible: is it really necessary to use 64 bits?
Today no discussion is possible. Yes, u64 is needed.

Andries


[As a separate discussion:
I used a sparse setup, that is why the struct describing a partition also
had the partition number. Your version with 256 structs looks a bit clumsy.
In most setups 256 is a waste. In some it is not enough.
Sparseness is useful for user space. But of course I had a 64-bit dev_t.]

^ permalink raw reply	[flat|nested] 51+ messages in thread
* RE: 2.5.28 and partitions
@ 2002-07-25 13:24 Petr Vandrovec
  2002-07-25 13:45 ` Anton Altaparmakov
  0 siblings, 1 reply; 51+ messages in thread
From: Petr Vandrovec @ 2002-07-25 13:24 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Linus Torvalds, Matt_Domsch, Andries.Brouwer, linux-kernel

On 25 Jul 02 at 14:03, Anton Altaparmakov wrote:
> At 12:44 25/07/02, Alexander Viro wrote:
> >Al, still thinking that anybody who does mkfs.<whatever> on a multi-Tb
> >device should seek professional help of the kind they don't give on l-k...
> 
> Why? What is wrong with large devices/file systems? Why do we have to break 
> up everything into multiple devices? Just because the kernel is "too lazy" 
> to implement support for large devices? Nobody cares if 64bit code is 
> 10-20% slower than 32bit code on a storage server. The storage devices are 

But I care whether gcc barfs on code or not, and whether generated code
is correct or not.

I do very trivial 64bit computations in TV-Out portion of matroxfb,
but I spent two days shifting code up/down, adding temporary variables
and splitting expressions to simple ones to make code compilable at all
with gcc-2.95.4 compiling module for PIII kernel (Debian bug #151196). 
So I personally cannot recommend doing any 64bit math without setting
gcc-3.0 as minimal version for ia32 architecture.
                                                Petr Vandrovec
                                                vandrove@vc.cvut.cz
                                                

^ permalink raw reply	[flat|nested] 51+ messages in thread
* RE: 2.5.28 and partitions
@ 2002-07-25  3:22 Matt_Domsch
  2002-07-25  5:27 ` Linus Torvalds
  2002-07-25 10:42 ` Alan Cox
  0 siblings, 2 replies; 51+ messages in thread
From: Matt_Domsch @ 2002-07-25  3:22 UTC (permalink / raw)
  To: viro, Andries.Brouwer; +Cc: torvalds, linux-kernel

> That stuff becomes an issue for 2Tb disks.  Do we actually have something
> that large attached to 32bit boxen?

Absolutely.  A single external disk pod with 14 73GB SCSI disks is >1TB,
with 145GB disks expected in the very near future, and 120GB IDE disks
available today.  You can put 4 disk pods on a single 4-channel RAID
controller.  You can have multiple 4-channel RAID controllers and do
software RAID across the lot.  You can attach your server to a multi-TB
Dell|EMC SAN.  All of these configs are limited by the current 32-bit block
address.

> ... and still use i386 with these disks?  

Yep.  We're doing all of this today on our x86 server products, and don't
expect x86 to die any time soon.  I'm on conference calls each week with
customers who have huge data storage requirements, who like the
price/performance of x86 servers and the ever-decreasing cost of storage.
Medical imaging.  Render farms.  CAD/CAM.  Search engines.  Mirror sites.
Scientific compute clusters (they want a real CFS too).  Spam quarantine :-)
I'm excited by Peter Chubb's LBD patch for 2.5.x, but a product with a 2.6.x
kernel is still a long way away, and customers with money are asking for
this today.  "Be patient" isn't something a salesperson likes to hear when
there's a commission on the line. :-)

Right now all of these solutions are being done with multiple ~1TB
partitions and file systems, which for most applications works.  But some of
the above believe they would benefit from, say, a single 10TB shared
clustered file system (with another 10TB of disks to back the thing up).
That isn't possible today, even though one could build such.

> >u64 for sector_t doesn't change anything for 64bit boxen 
> >that might be interested in really large disks and
> >screws 32bit ones that shouldn't have to pay for that...
> 
> True. That's why sector_t should be a compile time option in 
> the kernel

I'd be happy with an option too.  Then the distros can choose to enable it
for some kernels "i686 bigmem-bigdisk", but not for i686 UP.  There does
arise the proliferation of kernels problem, but I'm sure the distros will
have some ideas there.

The promise of 64-bit block addresses eventually was a huge part of why I
worked on the GPT code in the kernel, partx, parted, etc.  I could really
use it today, and it'll be a solid requirement less than a year from now.


Thanks,
Matt
--
Matt Domsch
Sr. Software Engineer, Lead Engineer, Architect
Dell Linux Solutions www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com
#1 US Linux Server provider for 2001 and Q1/2002! (IDC May 2002)


^ permalink raw reply	[flat|nested] 51+ messages in thread
* 2.5.28 and partitions
@ 2002-07-24 22:42 Andries.Brouwer
  2002-07-24 23:42 ` Alexander Viro
       [not found] ` <Pine.GSO.4.21.0207241925450.14656-100000@weyl.math.psu.edu >
  0 siblings, 2 replies; 51+ messages in thread
From: Andries.Brouwer @ 2002-07-24 22:42 UTC (permalink / raw)
  To: torvalds, viro; +Cc: linux-kernel

Just saw some new partition code in 2.5.28. Good!
I like almost all I see, except for one thing:

When I did precisely these same things, long ago, I used

struct blkpg_partition {
        long long start;                /* starting offset in bytes */
        long long length;               /* length in bytes */
        int pno;                        /* partition number */
        char devname[BLKPG_DEVNAMELTH]; /* partition name, like sda5 or c0d1p2,
                                           to be used in kernel messages */
        char volname[BLKPG_VOLNAMELTH]; /* volume label */
};

still visible in blkpg.h.

Now I read in 2.5.28:

+struct parsed_partitions {
+       char name[40];
+       struct {
+               unsigned long from;
+               unsigned long size;
+               int flags;
+       } parts[MAX_PART];
+       int next;
+       int limit;
+};

and I object to the long instead of u64 or so.

With 2^32 sectors one can handle up to 2^41 bytes, 2 TiB.
Already today people want RAIDs that are larger, and
few years from now we'll have single disks that are larger.

The fields from and size really need more bits than 32.
And when they become u64, it is a good idea to measure bytes
instead of 512-byte sectors.

(In the design where all partition reading code is removed
from the kernel, and user space tells the kernel what the
partitions on its disks are, it is also natural that user
space is able to provide names for the partitions.
Both names for the kernel to use in its messages, and names
to be used in mount-by-label. Of course I would like to
remove all mount-by-label code from mount(8).)

Andries



^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2002-08-02 22:50 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-07-25 12:43 2.5.28 and partitions Petr Vandrovec
  -- strict thread matches above, loose matches on Subject: below --
2002-08-02 14:54 Jesse Pollard
2002-08-02 18:33 ` Kai Henningsen
     [not found] <15688.27022.143541.447952@wombat.chubb.wattle.id.au>
2002-07-31 23:42 ` Alexander Viro
2002-07-31 23:38 Matt_Domsch
     [not found] <F44891A593A6DE4B99FDCB7CC537BBBBB839AC@AUSXMPS308.aus.amer .dell.com>
2002-07-31 22:58 ` Anton Altaparmakov
2002-07-31 22:47 Matt_Domsch
     [not found] <15688.25919.138565.6427@wombat.chubb.wattle.id.au>
2002-07-31 22:39 ` Alexander Viro
2002-08-01 10:08   ` Marcin Dalecki
2002-08-01 12:31     ` Kai Henningsen
2002-08-01 19:29   ` Thunder from the hill
2002-08-01 20:31     ` Alexander Viro
2002-08-01 20:45       ` Thunder from the hill
2002-08-01 21:08         ` Alexander Viro
2002-08-01 21:25           ` Marcin Dalecki
2002-08-01 21:41             ` Alexander Viro
2002-08-02 19:40               ` Mike Touloumtzis
2002-08-01 21:02       ` Marcin Dalecki
2002-08-01 21:27         ` Alexander Viro
2002-08-01 21:45           ` Marcin Dalecki
2002-08-02  5:21           ` Ryan Anderson
2002-08-01 21:24       ` Albert D. Cahalan
2002-08-02 19:47         ` Mike Touloumtzis
2002-08-02 20:49           ` Albert D. Cahalan
2002-08-02 21:21             ` Mike Touloumtzis
2002-08-02 22:12               ` Albert D. Cahalan
2002-08-02 22:53                 ` Mike Touloumtzis
2002-07-25 17:50 Andries.Brouwer
2002-07-25 13:24 Petr Vandrovec
2002-07-25 13:45 ` Anton Altaparmakov
2002-07-26  5:13   ` Adrian Bunk
2002-07-25  3:22 Matt_Domsch
2002-07-25  5:27 ` Linus Torvalds
2002-07-25 11:44   ` Alexander Viro
2002-07-25 15:57     ` Linus Torvalds
2002-07-30  9:58     ` Pavel Machek
     [not found]   ` <Pine.GSO.4.21.0207250739390.17037-100000@weyl.math.psu.edu >
2002-07-25 13:03     ` Anton Altaparmakov
2002-07-25 16:50       ` Alexander Viro
2002-07-25 17:35         ` Jason L Tibbitts III
2002-07-25 17:57         ` Rik van Riel
2002-07-25 18:27           ` Alexander Viro
2002-07-27  5:56         ` Austin Gonyou
     [not found]       ` <Pine.GSO.4.21.0207251245530.17621-100000@weyl.math.psu.edu >
2002-07-25 17:39         ` Anton Altaparmakov
2002-07-25 10:42 ` Alan Cox
2002-07-24 22:42 Andries.Brouwer
2002-07-24 23:42 ` Alexander Viro
2002-07-25  0:20   ` kwijibo
2002-07-25  4:00   ` Jason L Tibbitts III
     [not found] ` <Pine.GSO.4.21.0207241925450.14656-100000@weyl.math.psu.edu >
2002-07-25  2:11   ` Anton Altaparmakov
2002-07-25  5:15     ` Linus Torvalds
     [not found]     ` <Pine.LNX.4.44.0207242213540.1231-100000@home.transmeta.com >
2002-07-25  8:43       ` Anton Altaparmakov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox