* lots and lots of disks again @ 2004-02-04 10:45 Andrew Morton 2004-02-10 11:04 ` Kurt Garloff 0 siblings, 1 reply; 43+ messages in thread From: Andrew Morton @ 2004-02-04 10:45 UTC (permalink / raw) To: linux-scsi You can't hide from me you know ;) What to do about this? Last time it was discussed we had: Christoph Hellwig <hch@infradead.org> wrote: > > > - I think the bitmap overhead was killing us on big x86 machines, right? > so that might need some swork. > - a config option definitly is the wrong way to go. > - I don't think assigning more numbers to the last major is a bad idea, > better assign them to the first one, maybe with a hole for what we're > using the other majors currently for, so when udev gets more mature > we can kill the additional majors. Matthew Wilcox <willy@debian.org> wrote: > > LWN were kind enough to archive my previous thoughts on this at > http://lwn.net/Articles/54840/ > > I'd've worked on some code for this if anyone had been interested ... badari <pbadari@us.ibm.com> wrote: > > I am not sorry for not replying for so long. I have been on vacation. > > 1) I am not really concerned about bitmap overhead. Bitmap of one > page (4k) - should be enough to support 32K disks. That should be > good enough for most of the machines. > > 2) I used config option for 2 reasons > - to minimize impact on machines this is not needed. > - didn't want to depend on <major, minor> split to decide on > how many disks we can support. > > 3) The reason, I assigned all the disks to last major is to maintain backward > compatibility with current major, minor assignments. Hopefully "udev" > will cleanup all these. > > My question is, what do we do for current 2.6 ? Hoping to address these > before distros start making 2.6 distros and adding their own stuff to support > this. > > And also, is there a plan to support more partitions per disk ? And nothing happened. From: Badari Pulavarty <pbadari@us.ibm.com> Here is the patch to support large number of SCSI disks. The patch is not fully cooked yet. I was hoping to use this generate discussion. I have not tested it fully on 2.6.0-test6-mm1. As I mentioned earlier, it maintains backward compatibility with existing sd major/minors - by attached all the new disks to last sd major. And also, I made the number of disks to support as configurable - to avoid dependency on how many minor bits. --- drivers/scsi/Kconfig | 8 ++++++++ drivers/scsi/sd.c | 25 +++++++++++++++++++------ 2 files changed, 27 insertions(+), 6 deletions(-) diff -puN drivers/scsi/Kconfig~support-zillions-of-scsi-disks drivers/scsi/Kconfig --- 25/drivers/scsi/Kconfig~support-zillions-of-scsi-disks 2004-01-07 19:10:47.000000000 -0800 +++ 25-akpm/drivers/scsi/Kconfig 2004-01-07 19:10:47.000000000 -0800 @@ -55,6 +55,14 @@ config BLK_DEV_SD In this case, do not compile the driver for your SCSI host adapter (below) as a module either. +config MAX_SD_DISKS + int "Maximum number of SCSI disks to support (256-8192)" + depends on BLK_DEV_SD + default "256" + help + The maximum number SCSI disks to support. Default is 256. + Change this value if you want kernel to support lots of SCSI devices. + config CHR_DEV_ST tristate "SCSI tape support" depends on SCSI diff -puN drivers/scsi/sd.c~support-zillions-of-scsi-disks drivers/scsi/sd.c --- 25/drivers/scsi/sd.c~support-zillions-of-scsi-disks 2004-01-07 19:10:47.000000000 -0800 +++ 25-akpm/drivers/scsi/sd.c 2004-01-07 19:10:47.000000000 -0800 @@ -62,6 +62,7 @@ */ #define SD_MAJORS 16 #define SD_DISKS (SD_MAJORS << 4) +#define TOTAL_SD_DISKS CONFIG_MAX_SD_DISKS /* * Time out in seconds for disks and Magneto-opticals (which are slower). @@ -95,7 +96,7 @@ struct scsi_disk { }; -static unsigned long sd_index_bits[SD_DISKS / BITS_PER_LONG]; +static unsigned long sd_index_bits[TOTAL_SD_DISKS / BITS_PER_LONG]; static spinlock_t sd_index_lock = SPIN_LOCK_UNLOCKED; static int sd_revalidate_disk(struct gendisk *disk); @@ -130,6 +131,9 @@ static int sd_major(int major_idx) return SCSI_DISK1_MAJOR + major_idx - 1; case 8 ... 15: return SCSI_DISK8_MAJOR + major_idx - 8; +#define MAX_IDX (TOTAL_SD_DISKS >> 4) + case 16 ... MAX_IDX: + return SCSI_DISK15_MAJOR; default: BUG(); return 0; /* shut up gcc */ @@ -1320,8 +1324,8 @@ static int sd_probe(struct device *dev) goto out_free; spin_lock(&sd_index_lock); - index = find_first_zero_bit(sd_index_bits, SD_DISKS); - if (index == SD_DISKS) { + index = find_first_zero_bit(sd_index_bits, TOTAL_SD_DISKS); + if (index == TOTAL_SD_DISKS) { spin_unlock(&sd_index_lock); error = -EBUSY; goto out_put; @@ -1336,15 +1340,24 @@ static int sd_probe(struct device *dev) sdkp->openers = 0; gd->major = sd_major(index >> 4); - gd->first_minor = (index & 15) << 4; + if (index > SD_DISKS) + gd->first_minor = ((index - SD_DISKS) & 15) << 4; + else + gd->first_minor = (index & 15) << 4; gd->minors = 16; gd->fops = &sd_fops; - if (index >= 26) { + if (index < 26) { + sprintf(gd->disk_name, "sd%c", 'a' + index % 26); + } else if (index < (26*27)) { sprintf(gd->disk_name, "sd%c%c", 'a' + index/26-1,'a' + index % 26); } else { - sprintf(gd->disk_name, "sd%c", 'a' + index % 26); + const unsigned int m1 = (index/ 26 - 1) / 26 - 1; + const unsigned int m2 = (index / 26 - 1) % 26; + const unsigned int m3 = index % 26; + sprintf(gd->disk_name, "sd%c%c%c", + 'a' + m1, 'a' + m2, 'a' + m3); } strcpy(gd->devfs_name, sdp->devfs_name); _ ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-04 10:45 lots and lots of disks again Andrew Morton @ 2004-02-10 11:04 ` Kurt Garloff 2004-02-10 11:26 ` Kurt Garloff 0 siblings, 1 reply; 43+ messages in thread From: Kurt Garloff @ 2004-02-10 11:04 UTC (permalink / raw) To: Andrew Morton Cc: linux-scsi, Badari Pulavarty, Matthew Wilcox, James Bottomley, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 4077 bytes --] Hi Andrew, On Wed, Feb 04, 2004 at 02:45:12AM -0800, Andrew Morton wrote: > You can't hide from me you know ;) > > What to do about this? Apply my patch ;-) Find attached a patch that combines Badari's and Matthew's ideas. It corrects the typos in Matthew's code snippets. It does even address Christoph's criticism. Though I'm sure, he'll find new points ... > Last time it was discussed we had: > > Christoph Hellwig <hch@infradead.org> wrote: > > > > > > - I think the bitmap overhead was killing us on big x86 machines, right? > > so that might need some swork. I limited the number of disks to 32k, which gives us a 4k bitmap. Loading scsi_debug with 4k disks takes a minute or so, removing a few minutes. I doubt that it's caused by the bitmap, though. sg takes equally long. Eventually, we'll need to ask oprofile. As long as we know there are no holes, we could shortcut the bitmap search ... The patch would support up to 262144 disks, but I believe we're safe with just 32k for the next years, even for those hard core multipath people. > > - a config option definitly is the wrong way to go. There's no config option any more. EMBEDDED -> 256 otherwise -> 32768 > > - I don't think assigning more numbers to the last major is a bad idea, > > better assign them to the first one, maybe with a hole for what we're > > using the other majors currently for, so when udev gets more mature > > we can kill the additional majors. Matthew stripes them over the majors. The first sixteen go to major0, the next to major 1 ...; disks 256 -- 271 go to major0 again etc. > Matthew Wilcox <willy@debian.org> wrote: > > > > LWN were kind enough to archive my previous thoughts on this at > > http://lwn.net/Articles/54840/ > > > > I'd've worked on some code for this if anyone had been interested ... I've taken most from it. I also let the support for two more partition bits in, though it can't be really used at the moment. The genhd does not accomodate non-contigous minors for partitions. If the need arises, this can be done later, though it will be a bit of a hack. So we can as well look at them as two reserved bits for now. > badari <pbadari@us.ibm.com> wrote: > > > > I am not sorry for not replying for so long. I have been on vacation. > > > > 1) I am not really concerned about bitmap overhead. Bitmap of one > > page (4k) - should be enough to support 32K disks. That should be > > good enough for most of the machines. Agreed. > > 2) I used config option for 2 reasons > > - to minimize impact on machines this is not needed. > > - didn't want to depend on <major, minor> split to decide on > > how many disks we can support. That's why I made it depending on CONFIG_EMBEDDED. > > 3) The reason, I assigned all the disks to last major is to maintain backward > > compatibility with current major, minor assignments. Hopefully "udev" > > will cleanup all these. Matthew's approach does this as well. > > My question is, what do we do for current 2.6 ? Hoping to address these > > before distros start making 2.6 distros and adding their own stuff to support > > this. > > > > And also, is there a plan to support more partitions per disk ? There is, but either we have to hack gendisk or break the old numbering. For now, I've done neither, but two bits are reserved for supporting 64 partitions per disk. > And nothing happened. Well, I even tested the patch. I'm a bit worried that unloading scsi_debug with 4k disks takes so long, but I doubt it's caused by sd, as it also happened with sg only. For the rest, it survived well. Note that sg currently limits the amount of generic devices to 8k. We should probably also increase it to 32k. If the sd patch is found acceptable, I'll submit a patch to sg. Regards, -- Kurt Garloff <garloff@suse.de> Cologne, DE SUSE LINUX AG, Nuernberg, DE SUSE Labs (Head) [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-10 11:04 ` Kurt Garloff @ 2004-02-10 11:26 ` Kurt Garloff 2004-02-10 13:39 ` Christoph Hellwig 0 siblings, 1 reply; 43+ messages in thread From: Kurt Garloff @ 2004-02-10 11:26 UTC (permalink / raw) To: Andrew Morton, linux-scsi, Badari Pulavarty, Matthew Wilcox, James Bottomley, Christoph Hellwig [-- Attachment #1.1: Type: text/plain, Size: 537 bytes --] Hi, On Tue, Feb 10, 2004 at 12:04:17PM +0100, Kurt Garloff wrote: > Hi Andrew, > > On Wed, Feb 04, 2004 at 02:45:12AM -0800, Andrew Morton wrote: > > What to do about this? > > Apply my patch ;-) > > Find attached a patch that combines Badari's and Matthew's ideas. > It corrects the typos in Matthew's code snippets. Only that the patch was not attached. Regadrs, -- Kurt Garloff <garloff@suse.de> Cologne, DE SUSE LINUX AG, Nuernberg, DE SUSE Labs (Head) [-- Attachment #1.2: scsi-many-26.diff --] [-- Type: text/plain, Size: 4465 bytes --] --- drivers/scsi/sd.c.orig 2004-01-09 07:59:49.000000000 +0100 +++ drivers/scsi/sd.c 2004-02-10 09:44:24.913264768 +0100 @@ -19,6 +19,9 @@ * not being read in sd_open. Fix problem where removable media * could be ejected after sd_open. * - Douglas Gilbert <dgilbert@interlog.com> cleanup for lk 2.5.x + * - Badari Pulavarty <pbadari@us.ibm.com>, Matthew Wilcox + * <willy@debian.org>, Kurt Garloff <garloff@suse.de>: + * Support 256k disks (with potentially 64 partitions, TBD). * * Logging policy (needs CONFIG_SCSI_LOGGING defined): * - setting up transfer: SCSI_LOG_HLQUEUE levels 1 and 2 @@ -61,7 +64,16 @@ * Remaining dev_t-handling stuff */ #define SD_MAJORS 16 -#define SD_DISKS (SD_MAJORS << 4) +/* sd_index_bits array size / disks + * 32 / 256 + * 4096 / 32768 + * 32768 / 262144 + */ +#ifdef CONFIG_EMBEDDED +# define SD_DISKS 256 +#else +# define SD_DISKS 32768 // we can raise this to 262144 if needed +#endif /* * Time out in seconds for disks and Magneto-opticals (which are slower). @@ -121,6 +132,22 @@ .init_command = sd_init_command, }; +/* Major / minor to disk mapping, from Matthew Wilcox, corrected + * (mail to linux-scsi@vger.kernel.org from 2003-10-16) + * + * major p2 disc2 disc p1 + * |............|..|..........|....|....| <- dev_t + * 31 20 17 8 7 4 3 0 + * + * We allow 64 partitions per disk, by adding two more bits. + * Inside a major, we have 16k disks, however mapped non- + * contiguously. The first 16 disks are for major0, the next + * ones with major1, ... Disk 256 is for major0 again, disk 272 + * for major1, ... + * We can't currently use the partitions beyond 16, as the + * genhd infrastructure expects contiguous minors. + */ + static int sd_major(int major_idx) { switch (major_idx) { @@ -136,6 +163,35 @@ } } +static int inv_sd_major(int major) +{ + switch (major) { + case SCSI_DISK0_MAJOR: + return 0; + case SCSI_DISK1_MAJOR ... SCSI_DISK7_MAJOR: + return major + 1 - SCSI_DISK1_MAJOR; + case SCSI_DISK8_MAJOR ... SCSI_DISK15_MAJOR: + return major + 8 - SCSI_DISK8_MAJOR; + default: + BUG(); + return 0; /* shut up gcc */ + } +} + +unsigned int dev_to_sd_nr(unsigned int dev) { + return ((dev >> 4) & 15) | (inv_sd_major(dev >> 20) << 4) | + (dev & 0x3ff00); +} + +unsigned int dev_to_sd_part(unsigned int dev) { + return (dev & 15) | ((dev >> 14) & 0x30); +} + +unsigned int make_sd_dev(unsigned int sd_nr, unsigned int part) { + return (part & 0xf) | ((part & 0x30) << 14) | ((sd_nr & 0xf) << 4) | + (sd_major((sd_nr & 0xf0) >> 4) << 20) | (sd_nr & 0x3ff00); +} + #define to_scsi_disk(obj) container_of(obj,struct scsi_disk,kobj); static inline struct scsi_disk *scsi_disk(struct gendisk *disk) @@ -1297,7 +1353,7 @@ struct scsi_disk *sdkp; struct gendisk *gd; u32 index; - int error; + int error, devno; error = -ENODEV; if ((sdp->type != TYPE_DISK) && (sdp->type != TYPE_MOD)) @@ -1315,6 +1371,12 @@ kobject_init(&sdkp->kobj); sdkp->kobj.ktype = &scsi_disk_kobj_type; + /* Note: We can accomodate 64 partitions, but the genhd code + * assumes partitions allocate consecutive minors, which they don't. + * So for now stay with max 16 partitions and leave two spare bits. + * Later, we may change the genhd code and the alloc_disk() call + * and the ->minors assignment here. KG, 2004-02-10 + */ gd = alloc_disk(16); if (!gd) goto out_free; @@ -1335,16 +1397,23 @@ sdkp->index = index; sdkp->openers = 0; - gd->major = sd_major(index >> 4); - gd->first_minor = (index & 15) << 4; + devno = make_sd_dev(index, 0); + gd->major = MAJOR(devno); + gd->first_minor = MINOR(devno); gd->minors = 16; gd->fops = &sd_fops; - if (index >= 26) { + if (index < 26) { + sprintf(gd->disk_name, "sd%c", 'a' + index % 26); + } else if (index < (26*27)) { sprintf(gd->disk_name, "sd%c%c", - 'a' + index/26-1,'a' + index % 26); + 'a' + index / 26 - 1,'a' + index % 26); } else { - sprintf(gd->disk_name, "sd%c", 'a' + index % 26); + const unsigned int m1 = (index / 26 - 1) / 26 - 1; + const unsigned int m2 = (index / 26 - 1) % 26; + const unsigned int m3 = index % 26; + sprintf(gd->disk_name, "sd%c%c%c", + 'a' + m1, 'a' + m2, 'a' + m3); } strcpy(gd->devfs_name, sdp->devfs_name); [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-10 11:26 ` Kurt Garloff @ 2004-02-10 13:39 ` Christoph Hellwig 2004-02-10 15:47 ` Kurt Garloff 0 siblings, 1 reply; 43+ messages in thread From: Christoph Hellwig @ 2004-02-10 13:39 UTC (permalink / raw) To: Kurt Garloff, Andrew Morton, linux-scsi, Badari Pulavarty, Matthew Wilcox, James Bottomley, Christoph Hellwig On Tue, Feb 10, 2004 at 12:26:58PM +0100, Kurt Garloff wrote: > +/* sd_index_bits array size / disks > + * 32 / 256 > + * 4096 / 32768 > + * 32768 / 262144 > + */ > +#ifdef CONFIG_EMBEDDED > +# define SD_DISKS 256 > +#else > +# define SD_DISKS 32768 // we can raise this to 262144 if needed > +#endif Umm, using CONFIG_EMBEDDED doesn't mean no config option - it just mean another config option which no-one would think of changing the number of support scsi disks. Really, any solution that requires huge static allocations is wrong. The bitmap either needs to be replaced with a saner algorithm or dynamic allocation and reallocation on growth. > } > > +static int inv_sd_major(int major) > +{ > + switch (major) { > + case SCSI_DISK0_MAJOR: > + return 0; > + case SCSI_DISK1_MAJOR ... SCSI_DISK7_MAJOR: > + return major + 1 - SCSI_DISK1_MAJOR; > + case SCSI_DISK8_MAJOR ... SCSI_DISK15_MAJOR: > + return major + 8 - SCSI_DISK8_MAJOR; > + default: > + BUG(); > + return 0; /* shut up gcc */ > + } > +} > + > +unsigned int dev_to_sd_nr(unsigned int dev) { > + return ((dev >> 4) & 15) | (inv_sd_major(dev >> 20) << 4) | > + (dev & 0x3ff00); > +} > +unsigned int dev_to_sd_part(unsigned int dev) { > + return (dev & 15) | ((dev >> 14) & 0x30); > +} > + Maybe I missed something but this seems completely unused? Also please follow the coding style guidelines, that is opening brace for functions on the next line and non-exported functions always static. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-10 13:39 ` Christoph Hellwig @ 2004-02-10 15:47 ` Kurt Garloff 2004-02-10 15:52 ` Christoph Hellwig 2004-02-10 18:26 ` Andrew Morton 0 siblings, 2 replies; 43+ messages in thread From: Kurt Garloff @ 2004-02-10 15:47 UTC (permalink / raw) To: Christoph Hellwig Cc: Andrew Morton, linux-scsi, Badari Pulavarty, Matthew Wilcox, James Bottomley [-- Attachment #1.1: Type: text/plain, Size: 2226 bytes --] Hi Christoph, On Tue, Feb 10, 2004 at 01:39:32PM +0000, Christoph "Nitpick" Hellwig wrote: > On Tue, Feb 10, 2004 at 12:26:58PM +0100, Kurt Garloff wrote: > > +#ifdef CONFIG_EMBEDDED > > +# define SD_DISKS 256 > > +#else > > +# define SD_DISKS 32768 // we can raise this to 262144 if needed > > +#endif > > Umm, using CONFIG_EMBEDDED doesn't mean no config option - it just mean > another config option which no-one would think of changing the number of > support scsi disks. I don't think a 4k array is too much for any normal machine. Given the overhead that registering the gendisks ... generate. Embedded has special requirements and for them saving 4k is probably worth it. Thus I did them the favour. Of course knowing that you could attack. But you should check for more side-effects of CONFIG_EMBEDDED before you attack this particular one ... and tell us about embedded devices that have access to more than 256 SCSI disks. Anyway, I can throw it out, and wait if the embedded people complain. > Really, any solution that requires huge static allocations is wrong. The > bitmap either needs to be replaced with a saner algorithm or dynamic > allocation and reallocation on growth. Since when is 4k huge? Sorry, I don't see any benefit in adding complexity to avoid a 4k array. > > +static int inv_sd_major(int major) [...] > > +unsigned int dev_to_sd_nr(unsigned int dev) { [...] > > +unsigned int dev_to_sd_part(unsigned int dev) { [...] > > Maybe I missed something but this seems completely unused? Right, Matthew provided the reverse mapping, despite the fact that we don't need them, as it's just done by pointer deref ... We may need them later though, if we ever go for 64 partitions per disk. So I left them in. Probably we should comment them out, so the compiler does not see them for now. Or stick them in the docu. > Also please follow the coding style guidelines, that is opening brace > for functions on the next line and non-exported functions always static. Will do. Regards, -- Kurt Garloff <garloff@suse.de> Cologne, DE SUSE LINUX AG, Nuernberg, DE SUSE Labs (Head) [-- Attachment #1.2: scsi-many-26-2.diff --] [-- Type: text/plain, Size: 4369 bytes --] --- drivers/scsi/sd.c.orig 2004-01-09 07:59:49.000000000 +0100 +++ drivers/scsi/sd.c 2004-02-10 16:37:59.354806838 +0100 @@ -19,6 +19,9 @@ * not being read in sd_open. Fix problem where removable media * could be ejected after sd_open. * - Douglas Gilbert <dgilbert@interlog.com> cleanup for lk 2.5.x + * - Badari Pulavarty <pbadari@us.ibm.com>, Matthew Wilcox + * <willy@debian.org>, Kurt Garloff <garloff@suse.de>: + * Support 256k disks (with potentially 64 partitions, TBD). * * Logging policy (needs CONFIG_SCSI_LOGGING defined): * - setting up transfer: SCSI_LOG_HLQUEUE levels 1 and 2 @@ -61,7 +64,7 @@ * Remaining dev_t-handling stuff */ #define SD_MAJORS 16 -#define SD_DISKS (SD_MAJORS << 4) +#define SD_DISKS 32768 // anything between 256 and 262144 /* * Time out in seconds for disks and Magneto-opticals (which are slower). @@ -121,6 +124,22 @@ .init_command = sd_init_command, }; +/* Major / minor to disk mapping, from Matthew Wilcox, corrected + * (mail to linux-scsi@vger.kernel.org from 2003-10-16) + * + * major p2 disc2 disc p1 + * |............|..|..........|....|....| <- dev_t + * 31 20 17 8 7 4 3 0 + * + * We allow 64 partitions per disk, by adding two more bits. + * Inside a major, we have 16k disks, however mapped non- + * contiguously. The first 16 disks are for major0, the next + * ones with major1, ... Disk 256 is for major0 again, disk 272 + * for major1, ... + * We can't currently use the partitions beyond 16, as the + * genhd infrastructure expects contiguous minors. + */ + static int sd_major(int major_idx) { switch (major_idx) { @@ -136,6 +155,42 @@ } } +static unsigned int make_sd_dev(unsigned int sd_nr, unsigned int part) +{ + return (part & 0xf) | ((part & 0x30) << 14) | ((sd_nr & 0xf) << 4) | + (sd_major((sd_nr & 0xf0) >> 4) << 20) | (sd_nr & 0x3ff00); +} + +#if 0 +/* Reverse mapping, not needed at the moment */ + +static int inv_sd_major(int major) +{ + switch (major) { + case SCSI_DISK0_MAJOR: + return 0; + case SCSI_DISK1_MAJOR ... SCSI_DISK7_MAJOR: + return major + 1 - SCSI_DISK1_MAJOR; + case SCSI_DISK8_MAJOR ... SCSI_DISK15_MAJOR: + return major + 8 - SCSI_DISK8_MAJOR; + default: + BUG(); + return 0; /* shut up gcc */ + } +} + +static unsigned int dev_to_sd_nr(unsigned int dev) +{ + return ((dev >> 4) & 15) | (inv_sd_major(dev >> 20) << 4) | + (dev & 0x3ff00); +} + +static unsigned int dev_to_sd_part(unsigned int dev) +{ + return (dev & 15) | ((dev >> 14) & 0x30); +} +#endif + #define to_scsi_disk(obj) container_of(obj,struct scsi_disk,kobj); static inline struct scsi_disk *scsi_disk(struct gendisk *disk) @@ -1297,7 +1352,7 @@ struct scsi_disk *sdkp; struct gendisk *gd; u32 index; - int error; + int error, devno; error = -ENODEV; if ((sdp->type != TYPE_DISK) && (sdp->type != TYPE_MOD)) @@ -1315,6 +1370,12 @@ kobject_init(&sdkp->kobj); sdkp->kobj.ktype = &scsi_disk_kobj_type; + /* Note: We can accomodate 64 partitions, but the genhd code + * assumes partitions allocate consecutive minors, which they don't. + * So for now stay with max 16 partitions and leave two spare bits. + * Later, we may change the genhd code and the alloc_disk() call + * and the ->minors assignment here. KG, 2004-02-10 + */ gd = alloc_disk(16); if (!gd) goto out_free; @@ -1335,16 +1396,23 @@ sdkp->index = index; sdkp->openers = 0; - gd->major = sd_major(index >> 4); - gd->first_minor = (index & 15) << 4; + devno = make_sd_dev(index, 0); + gd->major = MAJOR(devno); + gd->first_minor = MINOR(devno); gd->minors = 16; gd->fops = &sd_fops; - if (index >= 26) { + if (index < 26) { + sprintf(gd->disk_name, "sd%c", 'a' + index % 26); + } else if (index < (26*27)) { sprintf(gd->disk_name, "sd%c%c", - 'a' + index/26-1,'a' + index % 26); + 'a' + index / 26 - 1,'a' + index % 26); } else { - sprintf(gd->disk_name, "sd%c", 'a' + index % 26); + const unsigned int m1 = (index / 26 - 1) / 26 - 1; + const unsigned int m2 = (index / 26 - 1) % 26; + const unsigned int m3 = index % 26; + sprintf(gd->disk_name, "sd%c%c%c", + 'a' + m1, 'a' + m2, 'a' + m3); } strcpy(gd->devfs_name, sdp->devfs_name); [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-10 15:47 ` Kurt Garloff @ 2004-02-10 15:52 ` Christoph Hellwig 2004-02-10 16:08 ` Kurt Garloff 2004-02-10 18:26 ` Andrew Morton 1 sibling, 1 reply; 43+ messages in thread From: Christoph Hellwig @ 2004-02-10 15:52 UTC (permalink / raw) To: Kurt Garloff, Christoph Hellwig, Andrew Morton, linux-scsi, Badari Pulavarty, Matthew Wilcox, James Bottomley On Tue, Feb 10, 2004 at 04:47:51PM +0100, Kurt Garloff wrote: > > Right, Matthew provided the reverse mapping, despite the fact that > we don't need them, as it's just done by pointer deref ... > > We may need them later though, if we ever go for 64 partitions per disk. > So I left them in. Probably we should comment them out, so the compiler > does not see them for now. Or stick them in the docu. Please don't add #if 0'ed code. Just leave it out and whoever needs it can readd it. > +#define SD_DISKS 32768 // anything between 256 and 262144 Even if C99 allows C++-style comments now please don't use the in the kernel Else I'm okay with the patch although I'm not too happy with having the huge array, but I think we can declare that a "who complains may fix it" issue. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-10 15:52 ` Christoph Hellwig @ 2004-02-10 16:08 ` Kurt Garloff 2004-02-10 20:10 ` Andries Brouwer 0 siblings, 1 reply; 43+ messages in thread From: Kurt Garloff @ 2004-02-10 16:08 UTC (permalink / raw) To: Christoph Hellwig Cc: Andrew Morton, linux-scsi, Badari Pulavarty, Matthew Wilcox, James Bottomley [-- Attachment #1.1: Type: text/plain, Size: 818 bytes --] Hi Christoph, On Tue, Feb 10, 2004 at 03:52:03PM +0000, Christoph Hellwig wrote: > On Tue, Feb 10, 2004 at 04:47:51PM +0100, Kurt Garloff wrote: > Please don't add #if 0'ed code. Just leave it out and whoever needs it > can readd it. So be it. > > +#define SD_DISKS 32768 // anything between 256 and 262144 > > Even if C99 allows C++-style comments now please don't use the in the kernel Oh well, being conservative is fashionable these days? > Else I'm okay with the patch although I'm not too happy with having the > huge array, but I think we can declare that a "who complains may fix it" > issue. Find the patch attached. Regards, -- Kurt Garloff <garloff@suse.de> Cologne, DE SUSE LINUX AG, Nuernberg, DE SUSE Labs (Head) [-- Attachment #1.2: scsi-many-26-3.diff --] [-- Type: text/plain, Size: 3763 bytes --] --- drivers/scsi/sd.c.orig 2004-01-09 07:59:49.000000000 +0100 +++ drivers/scsi/sd.c 2004-02-10 17:04:52.194843081 +0100 @@ -19,6 +19,9 @@ * not being read in sd_open. Fix problem where removable media * could be ejected after sd_open. * - Douglas Gilbert <dgilbert@interlog.com> cleanup for lk 2.5.x + * - Badari Pulavarty <pbadari@us.ibm.com>, Matthew Wilcox + * <willy@debian.org>, Kurt Garloff <garloff@suse.de>: + * Support 256k disks (with potentially 64 partitions, TBD). * * Logging policy (needs CONFIG_SCSI_LOGGING defined): * - setting up transfer: SCSI_LOG_HLQUEUE levels 1 and 2 @@ -61,7 +64,7 @@ * Remaining dev_t-handling stuff */ #define SD_MAJORS 16 -#define SD_DISKS (SD_MAJORS << 4) +#define SD_DISKS 32768 /* anything between 256 and 262144 */ /* * Time out in seconds for disks and Magneto-opticals (which are slower). @@ -121,6 +124,22 @@ .init_command = sd_init_command, }; +/* Major / minor to disk mapping, from Matthew Wilcox, corrected + * (mail to linux-scsi@vger.kernel.org from 2003-10-16) + * + * major p2 disc2 disc p1 + * |............|..|..........|....|....| <- dev_t + * 31 20 17 8 7 4 3 0 + * + * We allow 64 partitions per disk, by adding two more bits. + * Inside a major, we have 16k disks, however mapped non- + * contiguously. The first 16 disks are for major0, the next + * ones with major1, ... Disk 256 is for major0 again, disk 272 + * for major1, ... + * We can't currently use the partitions beyond 16, as the + * genhd infrastructure expects contiguous minors. + */ + static int sd_major(int major_idx) { switch (major_idx) { @@ -136,6 +155,14 @@ } } +static unsigned int make_sd_dev(unsigned int sd_nr, unsigned int part) +{ + return (part & 0xf) | ((part & 0x30) << 14) | ((sd_nr & 0xf) << 4) | + (sd_major((sd_nr & 0xf0) >> 4) << 20) | (sd_nr & 0x3ff00); +} + +/* reverse mapping dev -> (sd_nr, part) not currently needed */ + #define to_scsi_disk(obj) container_of(obj,struct scsi_disk,kobj); static inline struct scsi_disk *scsi_disk(struct gendisk *disk) @@ -1297,7 +1324,7 @@ struct scsi_disk *sdkp; struct gendisk *gd; u32 index; - int error; + int error, devno; error = -ENODEV; if ((sdp->type != TYPE_DISK) && (sdp->type != TYPE_MOD)) @@ -1315,6 +1342,12 @@ kobject_init(&sdkp->kobj); sdkp->kobj.ktype = &scsi_disk_kobj_type; + /* Note: We can accomodate 64 partitions, but the genhd code + * assumes partitions allocate consecutive minors, which they don't. + * So for now stay with max 16 partitions and leave two spare bits. + * Later, we may change the genhd code and the alloc_disk() call + * and the ->minors assignment here. KG, 2004-02-10 + */ gd = alloc_disk(16); if (!gd) goto out_free; @@ -1335,16 +1368,23 @@ sdkp->index = index; sdkp->openers = 0; - gd->major = sd_major(index >> 4); - gd->first_minor = (index & 15) << 4; + devno = make_sd_dev(index, 0); + gd->major = MAJOR(devno); + gd->first_minor = MINOR(devno); gd->minors = 16; gd->fops = &sd_fops; - if (index >= 26) { + if (index < 26) { + sprintf(gd->disk_name, "sd%c", 'a' + index % 26); + } else if (index < (26*27)) { sprintf(gd->disk_name, "sd%c%c", - 'a' + index/26-1,'a' + index % 26); + 'a' + index / 26 - 1,'a' + index % 26); } else { - sprintf(gd->disk_name, "sd%c", 'a' + index % 26); + const unsigned int m1 = (index / 26 - 1) / 26 - 1; + const unsigned int m2 = (index / 26 - 1) % 26; + const unsigned int m3 = index % 26; + sprintf(gd->disk_name, "sd%c%c%c", + 'a' + m1, 'a' + m2, 'a' + m3); } strcpy(gd->devfs_name, sdp->devfs_name); [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-10 16:08 ` Kurt Garloff @ 2004-02-10 20:10 ` Andries Brouwer 2004-02-10 20:11 ` Matthew Wilcox 2004-02-10 20:58 ` Kurt Garloff 0 siblings, 2 replies; 43+ messages in thread From: Andries Brouwer @ 2004-02-10 20:10 UTC (permalink / raw) To: Kurt Garloff, Christoph Hellwig, Andrew Morton, linux-scsi, Badari Pulavarty, Matthew Wilcox, James Bottomley Kurt Garloff writes: +/* Major / minor to disk mapping, from Matthew Wilcox, corrected + * (mail to linux-scsi@vger.kernel.org from 2003-10-16) + * + * major p2 disc2 disc p1 + * |............|..|..........|....|....| <- dev_t + * 31 20 17 8 7 4 3 0 + * + * We allow 64 partitions per disk, by adding two more bits. + * Inside a major, we have 16k disks, however mapped non- + * contiguously. The first 16 disks are for major0, the next + * ones with major1, ... Disk 256 is for major0 again, disk 272 + * for major1, ... + * We can't currently use the partitions beyond 16, as the + * genhd infrastructure expects contiguous minors. + */ It is true that the first step needed is some agreement on how to assign device numbers to disks and their partitions. The above suggestion is really ugly, however. I can hardly imagine that we would like to start such crap while still in the design phase. Andries ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-10 20:10 ` Andries Brouwer @ 2004-02-10 20:11 ` Matthew Wilcox 2004-02-10 20:58 ` Kurt Garloff 1 sibling, 0 replies; 43+ messages in thread From: Matthew Wilcox @ 2004-02-10 20:11 UTC (permalink / raw) To: Andries Brouwer Cc: Kurt Garloff, Christoph Hellwig, Andrew Morton, linux-scsi, Badari Pulavarty, Matthew Wilcox, James Bottomley On Tue, Feb 10, 2004 at 09:10:02PM +0100, Andries Brouwer wrote: > The above suggestion is really ugly, however. > I can hardly imagine that we would like to start such crap > while still in the design phase. Your alternative suggestion is...? -- "Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception." -- Mark Twain ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-10 20:10 ` Andries Brouwer 2004-02-10 20:11 ` Matthew Wilcox @ 2004-02-10 20:58 ` Kurt Garloff 2004-02-10 21:21 ` viro 1 sibling, 1 reply; 43+ messages in thread From: Kurt Garloff @ 2004-02-10 20:58 UTC (permalink / raw) To: Andries Brouwer Cc: Christoph Hellwig, Andrew Morton, linux-scsi, Badari Pulavarty, Matthew Wilcox, James Bottomley [-- Attachment #1: Type: text/plain, Size: 1245 bytes --] Hi Andries, On Tue, Feb 10, 2004 at 09:10:02PM +0100, Andries Brouwer wrote: > + * major p2 disc2 disc p1 > + * |............|..|..........|....|....| <- dev_t > + * 31 20 17 8 7 4 3 0 [...] > > It is true that the first step needed is some agreement on > how to assign device numbers to disks and their partitions. > > The above suggestion is really ugly, however. Nice! Ask two people and you'll get three opinions. > I can hardly imagine that we would like to start such crap > while still in the design phase. We maintain backwards compatibility. We need some bit shuffling and the rest is straightforward. As we are pretty sure to not need beyong 256k disks, we reserve two extra bits for the partition (though we can't currently use it). The only alternative that I see is leaving the old majors alone and starting a new scheme with new majors. Not really nice either; if you think about supporting > 16 partitions, it's even very inconsistent. I don't see what's crappy with the suggestion. Tell us! Regards, -- Kurt Garloff <garloff@suse.de> Cologne, DE SUSE LINUX AG, Nuernberg, DE SUSE Labs (Head) [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-10 20:58 ` Kurt Garloff @ 2004-02-10 21:21 ` viro 2004-02-10 21:34 ` Kurt Garloff 0 siblings, 1 reply; 43+ messages in thread From: viro @ 2004-02-10 21:21 UTC (permalink / raw) To: Kurt Garloff, Andries Brouwer, Christoph Hellwig, Andrew Morton, linux-scsi, Badari Pulavarty, Matthew Wilcox, James Bottomley On Tue, Feb 10, 2004 at 09:58:33PM +0100, Kurt Garloff wrote: > Hi Andries, > > On Tue, Feb 10, 2004 at 09:10:02PM +0100, Andries Brouwer wrote: > > + * major p2 disc2 disc p1 > > + * |............|..|..........|....|....| <- dev_t > > + * 31 20 17 8 7 4 3 0 > [...] > > > > It is true that the first step needed is some agreement on > > how to assign device numbers to disks and their partitions. > > > > The above suggestion is really ugly, however. Uh-oh... Rare event: I agree with aeb. > > We maintain backwards compatibility. > We need some bit shuffling and the rest is straightforward. Bit shuffling where? Since partitioning does *not* reach the driver (nor should it), I really wonder where exactly do you expect that turd to land. > I don't see what's crappy with the suggestion. Tell us! Putting partitioning crap back into driver. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-10 21:21 ` viro @ 2004-02-10 21:34 ` Kurt Garloff 2004-02-10 21:42 ` viro 0 siblings, 1 reply; 43+ messages in thread From: Kurt Garloff @ 2004-02-10 21:34 UTC (permalink / raw) To: viro Cc: Andries Brouwer, Christoph Hellwig, Andrew Morton, linux-scsi, Badari Pulavarty, Matthew Wilcox, James Bottomley [-- Attachment #1: Type: text/plain, Size: 1035 bytes --] Hi, On Tue, Feb 10, 2004 at 09:21:13PM +0000, viro@parcelfarce.linux.theplanet.co.uk wrote: > On Tue, Feb 10, 2004 at 09:58:33PM +0100, Kurt Garloff wrote: > > We maintain backwards compatibility. > > We need some bit shuffling and the rest is straightforward. > > Bit shuffling where? Since partitioning does *not* reach the driver (nor > should it), I really wonder where exactly do you expect that turd to land. We always needed to tell the gendisk code what part of the minor refers to disks and which one to partitions. The knowledge is there, like it or not. > > I don't see what's crappy with the suggestion. Tell us! > > Putting partitioning crap back into driver. Oh well, are you happier if we declare the two bits reserved instead of having the hope of using them to extend the number of partitions per disk to 64 per disk some day? Regards, -- Kurt Garloff <garloff@suse.de> Cologne, DE SUSE LINUX AG, Nuernberg, DE SUSE Labs (Head) [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-10 21:34 ` Kurt Garloff @ 2004-02-10 21:42 ` viro 2004-02-10 22:28 ` Kurt Garloff 0 siblings, 1 reply; 43+ messages in thread From: viro @ 2004-02-10 21:42 UTC (permalink / raw) To: Kurt Garloff, Andries Brouwer, Christoph Hellwig, Andrew Morton, linux-scsi, Badari Pulavarty, Matthew Wilcox, James Bottomley On Tue, Feb 10, 2004 at 10:34:38PM +0100, Kurt Garloff wrote: > > Bit shuffling where? Since partitioning does *not* reach the driver (nor > > should it), I really wonder where exactly do you expect that turd to land. > > We always needed to tell the gendisk code what part of the minor refers > to disks and which one to partitions. The knowledge is there, like it or > not. Hell, no. We tell the upper layers that gendisk covers device numbers from dev to dev + N - 1. That has nothing to do with bitmasks, majors or minors. And _that_ is generic enough to be handled by upper layers. Driver doesn't care about any device numbers past that point. At all. It gets pointers to gendisk and that's it. > > > I don't see what's crappy with the suggestion. Tell us! > > > > Putting partitioning crap back into driver. > > Oh well, are you happier if we declare the two bits reserved instead > of having the hope of using them to extend the number of partitions per > disk to 64 per disk some day? No, I'm not. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-10 21:42 ` viro @ 2004-02-10 22:28 ` Kurt Garloff 0 siblings, 0 replies; 43+ messages in thread From: Kurt Garloff @ 2004-02-10 22:28 UTC (permalink / raw) To: viro Cc: Andries Brouwer, Christoph Hellwig, Andrew Morton, linux-scsi, Badari Pulavarty, Matthew Wilcox, James Bottomley [-- Attachment #1: Type: text/plain, Size: 556 bytes --] On Tue, Feb 10, 2004 at 09:42:33PM +0000, viro@parcelfarce.linux.theplanet.co.uk wrote: > > Oh well, are you happier if we declare the two bits reserved instead > > of having the hope of using them to extend the number of partitions per > > disk to 64 per disk some day? > > No, I'm not. So where's your suggestion? -- Kurt Garloff <kurt@garloff.de> [Koeln, DE] Physics:Plasma modeling <garloff@plasimo.phys.tue.nl> [TU Eindhoven, NL] Linux: SUSE Labs (Head) <garloff@suse.de> [SUSE Nuernberg, DE] [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-10 15:47 ` Kurt Garloff 2004-02-10 15:52 ` Christoph Hellwig @ 2004-02-10 18:26 ` Andrew Morton 2004-02-11 14:56 ` Kurt Garloff 1 sibling, 1 reply; 43+ messages in thread From: Andrew Morton @ 2004-02-10 18:26 UTC (permalink / raw) To: Kurt Garloff; +Cc: hch, linux-scsi, pbadari, willy, James.Bottomley Kurt Garloff <garloff@suse.de> wrote: > > Since when is 4k huge? > Sorry, I don't see any benefit in adding complexity to avoid a 4k array. Well yes, but we do have lib/idr.c which I believe does what you need. It is simple to use. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-10 18:26 ` Andrew Morton @ 2004-02-11 14:56 ` Kurt Garloff 2004-02-11 21:28 ` Andrew Morton 0 siblings, 1 reply; 43+ messages in thread From: Kurt Garloff @ 2004-02-11 14:56 UTC (permalink / raw) To: Andrew Morton; +Cc: hch, linux-scsi, pbadari, willy, James.Bottomley [-- Attachment #1: Type: text/plain, Size: 754 bytes --] Hi Andrew, On Tue, Feb 10, 2004 at 10:26:03AM -0800, Andrew Morton wrote: > Kurt Garloff <garloff@suse.de> wrote: > > > > Since when is 4k huge? > > Sorry, I don't see any benefit in adding complexity to avoid a 4k array. > > Well yes, but we do have lib/idr.c which I believe does what you need. It > is simple to use. It can store a pointer per ID. One that I don't need, unless I overlook some possibility to put an already existing pointer in there. If we really allocate thousands of disks, the overhead of this solution will be higher than the bitmap, I'm afraid. Regards, -- Kurt Garloff <garloff@suse.de> Cologne, DE SUSE LINUX AG, Nuernberg, DE SUSE Labs (Head) [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-11 14:56 ` Kurt Garloff @ 2004-02-11 21:28 ` Andrew Morton 2004-02-11 22:09 ` Kurt Garloff 0 siblings, 1 reply; 43+ messages in thread From: Andrew Morton @ 2004-02-11 21:28 UTC (permalink / raw) To: Kurt Garloff; +Cc: hch, linux-scsi, pbadari, willy, James.Bottomley Kurt Garloff <garloff@suse.de> wrote: > > Hi Andrew, > > On Tue, Feb 10, 2004 at 10:26:03AM -0800, Andrew Morton wrote: > > Kurt Garloff <garloff@suse.de> wrote: > > > > > > Since when is 4k huge? > > > Sorry, I don't see any benefit in adding complexity to avoid a 4k array. > > > > Well yes, but we do have lib/idr.c which I believe does what you need. It > > is simple to use. > > It can store a pointer per ID. One that I don't need, unless I overlook > some possibility to put an already existing pointer in there. > > If we really allocate thousands of disks, the overhead of this solution > will be higher than the bitmap, I'm afraid. Four (or eight) bytes per disk! I perceive a lack of perspecitve here ;) I'd trade clarity of implementation for that. (As well as, possibly, reduced memory use on low-end machines). ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-11 21:28 ` Andrew Morton @ 2004-02-11 22:09 ` Kurt Garloff 2004-02-11 22:29 ` Andrew Morton 0 siblings, 1 reply; 43+ messages in thread From: Kurt Garloff @ 2004-02-11 22:09 UTC (permalink / raw) To: Andrew Morton; +Cc: hch, linux-scsi, pbadari, willy, James.Bottomley [-- Attachment #1: Type: text/plain, Size: 3350 bytes --] Hi Andrew, On Wed, Feb 11, 2004 at 01:28:48PM -0800, Andrew Morton wrote: > Kurt Garloff <garloff@suse.de> wrote: > > If we really allocate thousands of disks, the overhead of this solution > > will be higher than the bitmap, I'm afraid. > > Four (or eight) bytes per disk! I perceive a lack of perspecitve here ;) We're fighting about a static 4k array currently ;-) Well, it's certainly true that the memory we allocate in gendisk per disk is higher than these wasted 4/8 bytes. It's just that I don't like wasting anything ... The waste is 32/64times higher than with our bitfield, except that it's dynamic. > I'd trade clarity of implementation for that. You seem to have used idr often before. It took me much longer to read the docu and look at the function decsl than to understand the array and the meaning of find_first_zero_bit(). > (As well as, possibly, reduced memory use on low-end machines). This one is true. If you think this is a relevant issue, tell me. I'll convert to use idr then. Currently, I'm not sure whether the patch has any chance to be merged, given the opposition of some people. Actually, I'm not sure we currently have the right discussion here. We want to have > 256 disk support, and we need to agree on how we want to present it to the user. Changing some implementation behind this is not really painful to do afterwards. Changing the way we interpret majors/minors would be painful. Thus the suggestion of Matthew which I liked and adopted. No need for new majors, no changes to the well known numbers. Of course there's the long term perspective of having a "disk" major and udev taking care of everything. This will happen, but nobody so far told that this should be done within 2.6. Nor do I see udev replacing the classical device nodes completely within that timeframe. If we want something useful now, we need to keep the old disks major/minor scheme untouched. I see two possibilities to get somes done: * Matthew's proposal (whatever we use the two extra "partition" bits for in the end) * Introducing new majors, where we introduce a new numbering scheme. One major can accomodate 65536 disks à 16 partitions or 8192 disks with 64 partitions. We'd allocate one new major and sort disks after 256 there. The 64 partitions is actually out of the race, as the number of possible partitions would depend on the order that scsi disks are detected :-/ I think adding some bit shifting in the kernel is preferable to breaking user interfaces. The effort to fix all the apps is much higher. We should be aware that some bit shifting in the kernel is really not the most complex part of this picture. I would appreciate if those disliking Matthew's maj/min layout would come up with a proposal and tell what they want. Or maybe admit that they actually don't care? I'm definitely not religious about how to solve this. But I would like to have _a_ solution and I would like this solution not to change anything for already well-known devices. Once we have that, we can struggle about implememtation details that are easy to change anyways. Regards, -- Kurt Garloff <garloff@suse.de> Cologne, DE SUSE LINUX AG, Nuernberg, DE SUSE Labs (Head) [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-11 22:09 ` Kurt Garloff @ 2004-02-11 22:29 ` Andrew Morton 2004-02-11 22:53 ` viro 0 siblings, 1 reply; 43+ messages in thread From: Andrew Morton @ 2004-02-11 22:29 UTC (permalink / raw) To: Kurt Garloff Cc: hch, linux-scsi, pbadari, willy, James.Bottomley, viro@parcelfarce.linux.theplanet.co.uk Kurt Garloff <garloff@suse.de> wrote: > > We're fighting about a static 4k array currently ;-) yup. Sorry I was distracting. > Actually, I'm not sure we currently have the right discussion here. Absolutely. I was rather hoping that Al would follow up on his comments yesterday actually. Kurt Garloff <garloff@suse.de> wrote: > > Hi Andrew, > > On Wed, Feb 11, 2004 at 01:28:48PM -0800, Andrew Morton wrote: > > Kurt Garloff <garloff@suse.de> wrote: > > > If we really allocate thousands of disks, the overhead of this solution > > > will be higher than the bitmap, I'm afraid. > > > > Four (or eight) bytes per disk! I perceive a lack of perspecitve here ;) > > We're fighting about a static 4k array currently ;-) > > Well, it's certainly true that the memory we allocate in gendisk per > disk is higher than these wasted 4/8 bytes. It's just that I don't like > wasting anything ... The waste is 32/64times higher than with our > bitfield, except that it's dynamic. > > > I'd trade clarity of implementation for that. > > You seem to have used idr often before. It took me much longer to read > the docu and look at the function decsl than to understand the array > and the meaning of find_first_zero_bit(). > > > (As well as, possibly, reduced memory use on low-end machines). > > This one is true. > > If you think this is a relevant issue, tell me. I'll convert to use > idr then. Currently, I'm not sure whether the patch has any chance > to be merged, given the opposition of some people. > > Actually, I'm not sure we currently have the right discussion here. > > We want to have > 256 disk support, and we need to agree on how we > want to present it to the user. Changing some implementation behind > this is not really painful to do afterwards. Changing the way we > interpret majors/minors would be painful. Thus the suggestion of > Matthew which I liked and adopted. No need for new majors, no > changes to the well known numbers. > > Of course there's the long term perspective of having a "disk" major > and udev taking care of everything. This will happen, but nobody > so far told that this should be done within 2.6. Nor do I see > udev replacing the classical device nodes completely within that > timeframe. > > If we want something useful now, we need to keep the old disks > major/minor scheme untouched. > I see two possibilities to get somes done: > * Matthew's proposal (whatever we use the two extra "partition" > bits for in the end) > * Introducing new majors, where we introduce a new numbering scheme. > One major can accomodate 65536 disks à 16 partitions or 8192 disks > with 64 partitions. We'd allocate one new major and sort disks > after 256 there. The 64 partitions is actually out of the race, > as the number of possible partitions would depend on the order > that scsi disks are detected :-/ > > I think adding some bit shifting in the kernel is preferable to > breaking user interfaces. The effort to fix all the apps is much > higher. We should be aware that some bit shifting in the kernel > is really not the most complex part of this picture. > > I would appreciate if those disliking Matthew's maj/min layout would > come up with a proposal and tell what they want. Or maybe admit that > they actually don't care? > > I'm definitely not religious about how to solve this. But I would > like to have _a_ solution and I would like this solution not to > change anything for already well-known devices. > > Once we have that, we can struggle about implememtation details that > are easy to change anyways. > > Regards, > -- > Kurt Garloff <garloff@suse.de> Cologne, DE > SUSE LINUX AG, Nuernberg, DE SUSE Labs (Head) > - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-11 22:29 ` Andrew Morton @ 2004-02-11 22:53 ` viro 2004-02-12 15:00 ` Kurt Garloff 0 siblings, 1 reply; 43+ messages in thread From: viro @ 2004-02-11 22:53 UTC (permalink / raw) To: Andrew Morton Cc: Kurt Garloff, hch, linux-scsi, pbadari, willy, James.Bottomley On Wed, Feb 11, 2004 at 02:29:33PM -0800, Andrew Morton wrote: > Kurt Garloff <garloff@suse.de> wrote: > > > > We're fighting about a static 4k array currently ;-) > > yup. Sorry I was distracting. > > > Actually, I'm not sure we currently have the right discussion here. > > Absolutely. > > I was rather hoping that Al would follow up on his comments yesterday > actually. OK. Proposed "sparse" partition numbers are Bad Idea(tm), for the same reasons why 255.255.252.240 is not a good netmask for a subnet in class B. It might be possible to handle if we do a single ->probe() for entire major and put the smarts into it (about the only way to do that, actually), but we'll have a hell of time hunting down all places assuming that device number == first device number + partition number. Note that we are talking about late-boot code, fs/partition/*, anything that calls bdget_disk(), etc. _And_ unknown amount of userland code. And code that exports information to userland (e.g. /proc/paritions or sysfs per-disk subtrees). Unless somebody has a good idea of the modifications involved and is willing to describe them (_not_ on usual aeb level of handwaving, please), I consider that idea as non-feasible. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-11 22:53 ` viro @ 2004-02-12 15:00 ` Kurt Garloff 2004-02-12 15:20 ` James Bottomley ` (2 more replies) 0 siblings, 3 replies; 43+ messages in thread From: Kurt Garloff @ 2004-02-12 15:00 UTC (permalink / raw) To: viro; +Cc: Andrew Morton, hch, linux-scsi, pbadari, willy, James.Bottomley [-- Attachment #1: Type: text/plain, Size: 2190 bytes --] Hi Al Viro, On Wed, Feb 11, 2004 at 10:53:48PM +0000, viro@parcelfarce.linux.theplanet.co.uk wrote: > On Wed, Feb 11, 2004 at 02:29:33PM -0800, Andrew Morton wrote: > > I was rather hoping that Al would follow up on his comments yesterday > > actually. > > OK. Proposed "sparse" partition numbers are Bad Idea(tm), for the same > reasons why 255.255.252.240 is not a good netmask for a subnet in class B. > It might be possible to handle if we do a single ->probe() for entire > major and put the smarts into it (about the only way to do that, actually), > but we'll have a hell of time hunting down all places assuming that > device number == first device number + partition number. Note that we > are talking about late-boot code, fs/partition/*, anything that calls > bdget_disk(), etc. _And_ unknown amount of userland code. And code that > exports information to userland (e.g. /proc/paritions or sysfs per-disk > subtrees). OK, so let's scratch that from the proposal. It means we won't get 64 partitions, but I don't think this an important feature for many users. And it looks like we can't do it in a sane way without breaking the old numbers any way. And the breakage should be deferred to the /dev/disk change in 2.7, IMHO. Let's just use 31 20 19 4 3 0 |--major--|---disk---|-part-| Any problems with that? I would have reserved 2 bits (19,18) in case somebody finds a way to implement it in a sane way, but probably you're right and it's not doable in a sane way. As long as we don't bump up the limit to more than 256k disks, we do not run into conflicts anyway. Expect a patch later. > Unless somebody has a good idea of the modifications involved and is willing > to describe them (_not_ on usual aeb level of handwaving, please), I consider > that idea as non-feasible. One could add a hook to gendisk, and do some translation from user-visible device numbers to kernel device numbers. Thus only causing hacks at one place ... Regards, -- Kurt Garloff <garloff@suse.de> Cologne, DE SUSE LINUX AG, Nuernberg, DE SUSE Labs (Head) [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-12 15:00 ` Kurt Garloff @ 2004-02-12 15:20 ` James Bottomley 2004-02-12 15:57 ` viro 2004-02-13 0:05 ` Kurt Garloff 2 siblings, 0 replies; 43+ messages in thread From: James Bottomley @ 2004-02-12 15:20 UTC (permalink / raw) To: Kurt Garloff; +Cc: viro, Andrew Morton, hch, SCSI Mailing List, pbadari, willy As I understand it, we can do aliasing in gendisk (two different major/minor sets for the same disc), could we also do this if we have different numbers of partitions on them? What I was thinking is just keep the partition mask for our current majors and allocate a new major for the large disk configurations with a different partition mask. As long as we can alias our old scheme to the new one (and obviously everyone using the old scheme may have inaccessible partitions if the disc were partitioned by the new scheme), it would seem to provide backwards compatibility required by 2.6 and a way of moving to 64 (or however many partitions we want) if the user desires. As long as this won't induce any nasty complications, I think it's workable and we can simply drop all the old SCSI majors in 2.7 James ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-12 15:00 ` Kurt Garloff 2004-02-12 15:20 ` James Bottomley @ 2004-02-12 15:57 ` viro 2004-02-12 16:18 ` Kurt Garloff 2004-02-13 0:05 ` Kurt Garloff 2 siblings, 1 reply; 43+ messages in thread From: viro @ 2004-02-12 15:57 UTC (permalink / raw) To: Kurt Garloff, Andrew Morton, hch, linux-scsi, pbadari, willy, James.Bottomley On Thu, Feb 12, 2004 at 04:00:35PM +0100, Kurt Garloff wrote: > Let's just use > 31 20 19 4 3 0 > |--major--|---disk---|-part-| > > Any problems with that? What the hell is "major"? Device numbers space is flat - you don't even need any sort of alignment for allocated chunks... ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-12 15:57 ` viro @ 2004-02-12 16:18 ` Kurt Garloff 2004-02-12 16:43 ` James Bottomley 0 siblings, 1 reply; 43+ messages in thread From: Kurt Garloff @ 2004-02-12 16:18 UTC (permalink / raw) To: viro; +Cc: Andrew Morton, hch, linux-scsi, pbadari, willy, James.Bottomley [-- Attachment #1: Type: text/plain, Size: 1326 bytes --] On Thu, Feb 12, 2004 at 03:57:21PM +0000, viro@parcelfarce.linux.theplanet.co.uk wrote: > On Thu, Feb 12, 2004 at 04:00:35PM +0100, Kurt Garloff wrote: > > Let's just use > > 31 20 19 4 3 0 > > |--major--|---disk---|-part-| > > > > Any problems with that? > > What the hell is "major"? A number out of 8, 64 -- 71, 136 -- 143. Together with the disk part of the minor number, it determines the disk we talk to. The other way round actually. When detecting disks, we assign the device numbers according to a scheme 0x00800000,0x00800010, ... 0x008000f0, 0x04000000, ... (0) (1) (15) (16) 0x00800100, ... 0x00800200 (256) (512) The numbers in parens denote the detection order of the SCSI disks. And we fill the number into the gendisk structure. But you know this. > Device numbers space is flat - you don't even > need any sort of alignment for allocated chunks... ... and I suggest to use it the way outlined above. Unless we go for the gendisk aliasing suggestion from James and allocate a new major additionally. I did not yet check whether it's feasible. Regards, -- Kurt Garloff <garloff@suse.de> Cologne, DE SUSE LINUX AG, Nuernberg, DE SUSE Labs (Head) [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-12 16:18 ` Kurt Garloff @ 2004-02-12 16:43 ` James Bottomley 2004-02-16 12:40 ` Kurt Garloff 0 siblings, 1 reply; 43+ messages in thread From: James Bottomley @ 2004-02-12 16:43 UTC (permalink / raw) To: Kurt Garloff; +Cc: viro, Andrew Morton, hch, SCSI Mailing List, pbadari, willy On Thu, 2004-02-12 at 11:18, Kurt Garloff wrote: > Unless we go for the gendisk aliasing suggestion from James and allocate > a new major additionally. I did not yet check whether it's feasible. Well, I've had my ear bent about gendisk aliasing not being feasible, so to use the new major scheme, we'd have to have something like a boot time switch in the kernel and only register either the old majors or the new major. James ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-12 16:43 ` James Bottomley @ 2004-02-16 12:40 ` Kurt Garloff 2004-02-16 22:57 ` Andries Brouwer 0 siblings, 1 reply; 43+ messages in thread From: Kurt Garloff @ 2004-02-16 12:40 UTC (permalink / raw) To: James Bottomley Cc: viro, Andrew Morton, hch, SCSI Mailing List, pbadari, willy [-- Attachment #1: Type: text/plain, Size: 1364 bytes --] Hi James, On Thu, Feb 12, 2004 at 11:43:03AM -0500, James Bottomley wrote: > Well, I've had my ear bent about gendisk aliasing not being feasible, so > to use the new major scheme, we'd have to have something like a boot > time switch in the kernel and only register either the old majors or the > new major. This could be done. So we register one new major, with the 6bit partitions, allowing for 1<<14 disks with 1<<6 partitions each with one major. It looks like in 2.7, we'll have /dev/disk anyway, it would fit in there perfectly. Until then, the world has hopefully stepped over to udev, so we don't need to maintain the compatibility with the old device node numbering any more. If we wait with this untili /dev/disk is there, we can save to register a major for SCSI now and avoid the boot parameter. From a distributor's perspective, I currently see demand to have many (>1000) SCSI disks supported (*), but no serious demand for > 16 partitions, so we would not have trouble to wait with this until 2.7/2.8. (*) This includes needs that arise because of multipathing, most people don't have 1000 really different LUNs exported by their SAN RAID system. Regards, -- Kurt Garloff <garloff@suse.de> Cologne, DE SUSE LINUX AG, Nuernberg, DE SUSE Labs (Head) [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-16 12:40 ` Kurt Garloff @ 2004-02-16 22:57 ` Andries Brouwer 2004-02-17 0:56 ` James Bottomley 0 siblings, 1 reply; 43+ messages in thread From: Andries Brouwer @ 2004-02-16 22:57 UTC (permalink / raw) To: Kurt Garloff, James Bottomley, viro, Andrew Morton, hch, SCSI Mailing List, pbadari, willy > On Thu, Feb 12, 2004 at 11:43:03AM -0500, James Bottomley wrote: > > Well, I've had my ear bent about gendisk aliasing not being feasible, so > > to use the new major scheme, we'd have to have something like a boot > > time switch in the kernel and only register either the old majors or the > > new major. I am not quite sure what you mean. In my opinion we must continue supporting the old 16-bit space for a long time to come. But can of course do what we want in the new space. It is not either or, but both. Old devices must not change number. On Mon, Feb 16, 2004 at 01:40:47PM +0100, Kurt Garloff wrote: > This could be done. So we register one new major, with the 6bit > partitions, allowing for 1<<14 disks with 1<<6 partitions each with > one major. > > From a distributor's perspective, I currently see demand to have many > (>1000) SCSI disks supported (*), but no serious demand for > 16 partitions, > so we would not have trouble to wait with this until 2.7/2.8. True. As fdisk maintainer I see a few complaints a year about the upper bound 64. (But I like having much more space than is needed today.) But let me stress again that "one major" hardly has any meaning. You register a structureless interval of device numbers. Andries ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-16 22:57 ` Andries Brouwer @ 2004-02-17 0:56 ` James Bottomley 2004-02-17 7:57 ` Kurt Garloff 2004-02-17 14:49 ` Andries Brouwer 0 siblings, 2 replies; 43+ messages in thread From: James Bottomley @ 2004-02-17 0:56 UTC (permalink / raw) To: Andries Brouwer Cc: Kurt Garloff, viro, Andrew Morton, hch, SCSI Mailing List, pbadari, willy On Mon, 2004-02-16 at 17:57, Andries Brouwer wrote: > I am not quite sure what you mean. > In my opinion we must continue supporting the old 16-bit space > for a long time to come. But can of course do what we want in > the new space. The problem is that we cannot have /dev/sda1 at both (8,1) and (1049, 1) or whatever new major is chosen. We also need to flip the switch entirely one way or the other for increased partition numbers since a mixed scheme would be asking for an admin nightmare. Thus it makes sense to have a boot time switch for this, for 2.6 it would be an "opt in" to the new major, for 2.7 it would be an "opt out". I don't see a reason to keep the old majors hanging around after 2.7 if this is the chosen solution ... any user application relying on hard coded device numbers is by definition broken. > It is not either or, but both. > Old devices must not change number. I don't understand this. The user does not know devices by number but by device node name (which is chosen by the OS vendor who populated /dev/ now, or by udev in future). I agree the node name should not change, but don't see any reason why the major/minor might not. James ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-17 0:56 ` James Bottomley @ 2004-02-17 7:57 ` Kurt Garloff 2004-02-17 15:08 ` James Bottomley 2004-02-17 15:28 ` Matthew Wilcox 2004-02-17 14:49 ` Andries Brouwer 1 sibling, 2 replies; 43+ messages in thread From: Kurt Garloff @ 2004-02-17 7:57 UTC (permalink / raw) To: James Bottomley Cc: Andries Brouwer, viro, Andrew Morton, hch, SCSI Mailing List, pbadari, willy [-- Attachment #1: Type: text/plain, Size: 1518 bytes --] Hi James, On Mon, Feb 16, 2004 at 07:56:10PM -0500, James Bottomley wrote: > Thus it makes sense to have a boot time switch for this, for 2.6 it > would be an "opt in" to the new major, for 2.7 it would be an "opt out". OK, if you take care to register some major number (or some space for dev_t in newspeak), I'll write the patch to allow the SCSI disks to be registered there. We need about 20 bits, IMHO. > I don't see a reason to keep the old majors hanging around after 2.7 if > this is the chosen solution ... any user application relying on hard > coded device numbers is by definition broken. Agreed. > On Mon, 2004-02-16 at 17:57, Andries Brouwer wrote: > > It is not either or, but both. > > Old devices must not change number. > > I don't understand this. The user does not know devices by number but > by device node name (which is chosen by the OS vendor who populated > /dev/ now, or by udev in future). I agree the node name should not > change, but don't see any reason why the major/minor might not. You underestimate the amount of cruft code out there that does care. Unfortunately. Some programs (think scsiformat, ...) do want to know what transport the underlying hardware actually uses. And determine this by looking at the device number. I expect it will take some time to sort this all out. Regards, -- Kurt Garloff <garloff@suse.de> Cologne, DE SUSE LINUX AG, Nuernberg, DE SUSE Labs (Head) [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-17 7:57 ` Kurt Garloff @ 2004-02-17 15:08 ` James Bottomley 2004-02-17 15:28 ` Matthew Wilcox 1 sibling, 0 replies; 43+ messages in thread From: James Bottomley @ 2004-02-17 15:08 UTC (permalink / raw) To: Kurt Garloff Cc: Andries Brouwer, viro, Andrew Morton, hch, SCSI Mailing List, pbadari, willy On Tue, 2004-02-17 at 02:57, Kurt Garloff wrote: > You underestimate the amount of cruft code out there that does care. > Unfortunately. Some programs (think scsiformat, ...) do want to know > what transport the underlying hardware actually uses. And determine > this by looking at the device number. > I expect it will take some time to sort this all out. Well, yes, that's why the opt in for 2.6, opt out for 2.7 expect all apps to be fixed by 2.8 roadmap... James ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-17 7:57 ` Kurt Garloff 2004-02-17 15:08 ` James Bottomley @ 2004-02-17 15:28 ` Matthew Wilcox 1 sibling, 0 replies; 43+ messages in thread From: Matthew Wilcox @ 2004-02-17 15:28 UTC (permalink / raw) To: Kurt Garloff, James Bottomley, Andries Brouwer, viro, Andrew Morton, hch, SCSI Mailing List, pbadari, willy On Tue, Feb 17, 2004 at 08:57:24AM +0100, Kurt Garloff wrote: > OK, if you take care to register some major number (or some space for > dev_t in newspeak), I'll write the patch to allow the SCSI disks to > be registered there. We need about 20 bits, IMHO. 20 bits might not be quite big enough. Assuming 6 bits for partitions, that gives us 14 bits (16384 devices). I'd like a fibrechannel person to chime in here and tell us about the biggest FC array they've ever seen, but given multipathing showing us the same device multiple times, that seems a little tight (given how much pain it is to move to a new numbering scheme). If we're talking about the numberspace for /dev/drive, I don't see why we shouldn't take a quarter of the space; ie 2^30 bits. I suggest we take 64.0.0.0/30 to leave the top half unused for now. -- "Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception." -- Mark Twain ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-17 0:56 ` James Bottomley 2004-02-17 7:57 ` Kurt Garloff @ 2004-02-17 14:49 ` Andries Brouwer 2004-02-17 15:18 ` James Bottomley 1 sibling, 1 reply; 43+ messages in thread From: Andries Brouwer @ 2004-02-17 14:49 UTC (permalink / raw) To: James Bottomley Cc: Andries Brouwer, Kurt Garloff, viro, Andrew Morton, hch, SCSI Mailing List, pbadari, willy On Mon, Feb 16, 2004 at 07:56:10PM -0500, James Bottomley wrote: > On Mon, 2004-02-16 at 17:57, Andries Brouwer wrote: > > I am not quite sure what you mean. > > In my opinion we must continue supporting the old 16-bit space > > for a long time to come. But can of course do what we want in > > the new space. > > The problem is that we cannot have /dev/sda1 at both (8,1) and (1049, 1) > or whatever new major is chosen. True. In my opinion it must stay at (8,1). > We also need to flip the switch entirely one way or the other for > increased partition numbers since a mixed scheme would be asking for an > admin nightmare. Why would that be? Suppose /dev/sdAA is (1049,0) and has 256 partitions. And /dev/sda is (8,0) and has 16 partitions. Where is the nightmare? > > It is not either or, but both. > > Old devices must not change number. > > I don't understand this. The user does not know devices by number but > by device node name (which is chosen by the OS vendor who populated > /dev/ now, or by udev in future). I agree the node name should not > change, but don't see any reason why the major/minor might not. That is (i) because people do not change device nodes when they boot a new kernel, and (ii) because the knowledge is built-in into a nontrivial number of programs. Andries ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-17 14:49 ` Andries Brouwer @ 2004-02-17 15:18 ` James Bottomley 2004-02-17 15:27 ` Kurt Garloff 2004-02-17 15:50 ` Andries Brouwer 0 siblings, 2 replies; 43+ messages in thread From: James Bottomley @ 2004-02-17 15:18 UTC (permalink / raw) To: Andries Brouwer Cc: Kurt Garloff, viro, Andrew Morton, hch, SCSI Mailing List, pbadari, willy On Tue, 2004-02-17 at 09:49, Andries Brouwer wrote: > On Mon, Feb 16, 2004 at 07:56:10PM -0500, James Bottomley wrote: > > We also need to flip the switch entirely one way or the other for > > increased partition numbers since a mixed scheme would be asking for an > > admin nightmare. > > Why would that be? > > Suppose /dev/sdAA is (1049,0) and has 256 partitions. > And /dev/sda is (8,0) and has 16 partitions. > Where is the nightmare? You're kidding, right? Think of the code paths: at the moment we tell the gendisk the number of minors we need when we register it. It then probes the partition tables and fills in the values. If we have a large partition major, we need to know *before* we call add_disk(). The only thing that determines this is the on disc partitioning scheme, so now you need to know the partition type before you register the gendisk. This type of layering violation is a sure sign of a bad design. Even assuming we can come up with a clean coding solution that doesn't cause everyone to blow chunks when reading it, think what you've done to the administrator of the system: Accidentally repartition a drive with a big partition table and it migrates majors and device names. Also, our device node assignment now isn't simply discovery order, it's partition type order followed by discovery order. The correct way to solve this is an all or nothing migration gated by a boot time flag. James ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-17 15:18 ` James Bottomley @ 2004-02-17 15:27 ` Kurt Garloff 2004-02-29 16:41 ` James Bottomley 2004-02-17 15:50 ` Andries Brouwer 1 sibling, 1 reply; 43+ messages in thread From: Kurt Garloff @ 2004-02-17 15:27 UTC (permalink / raw) To: James Bottomley Cc: Andries Brouwer, viro, Andrew Morton, hch, SCSI Mailing List, pbadari, willy [-- Attachment #1: Type: text/plain, Size: 614 bytes --] On Tue, Feb 17, 2004 at 10:18:07AM -0500, James Bottomley wrote: > The correct way to solve this is an all or nothing migration gated by a > boot time flag. Agreed. Having it the new scheme optional now (defaulting to off), deprecating the old numbers in 2.7 (defaulting new scheme to on) and drop from 2.8 looks like a reasonable plan to me. Now we ionly need to get some device space allocated. PS: I believe 64 partitions is enough. Regards, -- Kurt Garloff <garloff@suse.de> Cologne, DE SUSE LINUX AG, Nuernberg, DE SUSE Labs (Head) [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-17 15:27 ` Kurt Garloff @ 2004-02-29 16:41 ` James Bottomley 2004-02-29 23:31 ` Kurt Garloff 2004-03-03 19:30 ` Mike Anderson 0 siblings, 2 replies; 43+ messages in thread From: James Bottomley @ 2004-02-29 16:41 UTC (permalink / raw) To: Kurt Garloff Cc: Andries Brouwer, viro, Andrew Morton, hch, SCSI Mailing List, pbadari, willy On Tue, 2004-02-17 at 09:27, Kurt Garloff wrote: > Agreed. Having it the new scheme optional now (defaulting to off), > deprecating the old numbers in 2.7 (defaulting new scheme to on) > and drop from 2.8 looks like a reasonable plan to me. > > Now we ionly need to get some device space allocated. > > PS: I believe 64 partitions is enough. OK, this all went quiet again. The choices are between the current patch from Kurt which is limited to 16 partitions, or a new patch (which no-one has yet produced) to raise us to 64 partitions on a new major number with a switching scheme. Kurt, since you're pushing the patch, which do you want me to do? Apply your current patch for 16 or wait for a new one for 64? James ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-29 16:41 ` James Bottomley @ 2004-02-29 23:31 ` Kurt Garloff 2004-03-03 19:30 ` Mike Anderson 1 sibling, 0 replies; 43+ messages in thread From: Kurt Garloff @ 2004-02-29 23:31 UTC (permalink / raw) To: James Bottomley Cc: Andries Brouwer, viro, Andrew Morton, hch, SCSI Mailing List, pbadari, willy [-- Attachment #1: Type: text/plain, Size: 1182 bytes --] On Sun, Feb 29, 2004 at 10:41:36AM -0600, James Bottomley wrote: > OK, this all went quiet again. > > The choices are between the current patch from Kurt which is limited to > 16 partitions, or a new patch (which no-one has yet produced) to raise > us to 64 partitions on a new major number with a switching scheme. And those two solutions don't even exclude each other ;-) > Kurt, since you're pushing the patch, which do you want me to do? Apply > your current patch for 16 or wait for a new one for 64? Sticking with the old majors for now has the advantage that we don't force people with many disks to get all the userspace fixed immediately. And for those needing we introduce ~22 bits of (4 majors) with 64 partitions and provide a switch. Or actually reserve ~24 bits for /dev/disk and start using it with SCSI ;-) So, I'd do both. For now, I would to have my patch applied. Let's see what Linus thinks about the /dev/disk thing. I know there are long term plans in that direction. Regards, -- Kurt Garloff <garloff@suse.de> Cologne, DE SUSE LINUX AG, Nuernberg, DE SUSE Labs (Head) [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-29 16:41 ` James Bottomley 2004-02-29 23:31 ` Kurt Garloff @ 2004-03-03 19:30 ` Mike Anderson 2004-03-03 19:55 ` Kurt Garloff 1 sibling, 1 reply; 43+ messages in thread From: Mike Anderson @ 2004-03-03 19:30 UTC (permalink / raw) To: James Bottomley Cc: Kurt Garloff, Andries Brouwer, viro, Andrew Morton, hch, SCSI Mailing List, pbadari, willy James Bottomley [James.Bottomley@steeleye.com] wrote: > OK, this all went quiet again. > It looks like this went quiet yet again or maybe I maybe I missed something. It looks like Badari's patch is still in the mm tree and Kurt's is only in the linux-scsi archives. Which patch / direction should we be using if we want to go above the current number of sd's in the mainline. -andmike -- Michael Anderson andmike@us.ibm.com ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-03-03 19:30 ` Mike Anderson @ 2004-03-03 19:55 ` Kurt Garloff 0 siblings, 0 replies; 43+ messages in thread From: Kurt Garloff @ 2004-03-03 19:55 UTC (permalink / raw) To: James Bottomley, Andries Brouwer, viro, Andrew Morton, hch, SCSI Mailing List, pbadari, willy [-- Attachment #1: Type: text/plain, Size: 808 bytes --] Hi Mike, On Wed, Mar 03, 2004 at 11:30:41AM -0800, Mike Anderson wrote: > James Bottomley [James.Bottomley@steeleye.com] wrote: > It looks like this went quiet yet again or maybe I maybe I missed > something. It looks like Badari's patch is still in the mm tree and > Kurt's is only in the linux-scsi archives. and in the SUSE kernel ;-) > Which patch / direction should we be using if we want to go above the > current number of sd's in the mainline. Well, I have an opinion obviously. But if we can agree on a common solution, I'd happily adopt it, at least if it does not screw up migration from the old numbering scheme. Regards, -- Kurt Garloff <garloff@suse.de> Cologne, DE SUSE LINUX AG, Nuernberg, DE SUSE Labs (Head) [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-17 15:18 ` James Bottomley 2004-02-17 15:27 ` Kurt Garloff @ 2004-02-17 15:50 ` Andries Brouwer 2004-02-17 17:57 ` James Bottomley 1 sibling, 1 reply; 43+ messages in thread From: Andries Brouwer @ 2004-02-17 15:50 UTC (permalink / raw) To: James Bottomley Cc: Andries Brouwer, Kurt Garloff, viro, Andrew Morton, hch, SCSI Mailing List, pbadari, willy On Tue, Feb 17, 2004 at 10:18:07AM -0500, James Bottomley wrote: > On Tue, 2004-02-17 at 09:49, Andries Brouwer wrote: > > On Mon, Feb 16, 2004 at 07:56:10PM -0500, James Bottomley wrote: > > > We also need to flip the switch entirely one way or the other for > > > increased partition numbers since a mixed scheme would be asking for an > > > admin nightmare. > > > > Why would that be? > > > > Suppose /dev/sdAA is (1049,0) and has 256 partitions. > > And /dev/sda is (8,0) and has 16 partitions. > > Where is the nightmare? > > You're kidding, right? I was not. > Think of the code paths: at the moment we tell the gendisk the number of > minors we need when we register it. It then probes the partition tables > and fills in the values. If we have a large partition major, we need to > know *before* we call add_disk(). The only thing that determines this > is the on disc partitioning scheme, so now you need to know the > partition type before you register the gendisk. This type of layering > violation is a sure sign of a bad design. Ah, you think that I want to assign the name depending on what one finds on the disk. That was not my intention at all. After all, we bought the disk and it had no partition table. Should it shift major when it is partitioned? > The correct way to solve this is an all or nothing migration gated by a > boot time flag. Yes, I was also thinking of boot time flags, but not all-or-nothing. Almost nobody actually needs these five thousand disks, and programs like LILO know about major 8, so I can even imagine that someone with lots of disks would like the boot disk to remain sda (8,0) even when all the others become (1049,*)-(1073,*). Then what is the boot parameter? I don't know. Anything. Say, scsi-legacy=sda,sdb,sde. Andries ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-17 15:50 ` Andries Brouwer @ 2004-02-17 17:57 ` James Bottomley 2004-02-17 18:44 ` Andries Brouwer 0 siblings, 1 reply; 43+ messages in thread From: James Bottomley @ 2004-02-17 17:57 UTC (permalink / raw) To: Andries Brouwer Cc: Kurt Garloff, viro, Andrew Morton, hch, SCSI Mailing List, pbadari, willy On Tue, 2004-02-17 at 10:50, Andries Brouwer wrote: > On Tue, Feb 17, 2004 at 10:18:07AM -0500, James Bottomley wrote: > > The correct way to solve this is an all or nothing migration gated by a > > boot time flag. > > Yes, I was also thinking of boot time flags, but not all-or-nothing. > Almost nobody actually needs these five thousand disks, and programs > like LILO know about major 8, so I can even imagine that someone with > lots of disks would like the boot disk to remain sda (8,0) even when > all the others become (1049,*)-(1073,*). > > Then what is the boot parameter? I don't know. Anything. > Say, scsi-legacy=sda,sdb,sde. But this is complexity for no gain. The proposal is very simple: You want large numbers or discs or large numbers of partitions, you get fixed tools and use a flat space on the new major number. You want legacy, you don't do anything for 2.6; for 2.7 you'll need to specify a boot flag. James ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-17 17:57 ` James Bottomley @ 2004-02-17 18:44 ` Andries Brouwer 0 siblings, 0 replies; 43+ messages in thread From: Andries Brouwer @ 2004-02-17 18:44 UTC (permalink / raw) To: James Bottomley Cc: Andries Brouwer, Kurt Garloff, viro, Andrew Morton, hch, SCSI Mailing List, pbadari, willy On Tue, Feb 17, 2004 at 12:57:37PM -0500, James Bottomley wrote: > > Then what is the boot parameter? I don't know. Anything. > > Say, scsi-legacy=sda,sdb,sde. > > But this is complexity for no gain. > > The proposal is very simple: You want large numbers or discs or large > numbers of partitions, you get fixed tools and use a flat space on the > new major number. You want legacy, you don't do anything for 2.6; for > 2.7 you'll need to specify a boot flag. Yes, I understand what you propose. I describe something more backwards compatible. Without a flag day. Without the need to change device nodes that you have already. Without the need to change them back when you boot an old kernel. Change is painful. All-or-nothing changes even more so. I like keeping support for the old stuff, but making its use slowly more inconvenient (by making the new things the default). Andries ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-12 15:00 ` Kurt Garloff 2004-02-12 15:20 ` James Bottomley 2004-02-12 15:57 ` viro @ 2004-02-13 0:05 ` Kurt Garloff 2004-02-16 12:31 ` Kurt Garloff 2 siblings, 1 reply; 43+ messages in thread From: Kurt Garloff @ 2004-02-13 0:05 UTC (permalink / raw) To: viro, Andrew Morton, hch, linux-scsi, pbadari, willy, James.Bottomley [-- Attachment #1.1: Type: text/plain, Size: 252 bytes --] On Thu, Feb 12, 2004 at 04:00:35PM +0100, Kurt Garloff wrote: > Expect a patch later. Attached. -- Kurt Garloff <garloff@suse.de> Cologne, DE SUSE LINUX AG, Nuernberg, DE SUSE Labs (Head) [-- Attachment #1.2: scsi-many-26-4.diff --] [-- Type: text/plain, Size: 3553 bytes --] --- drivers/scsi/sd.c.orig 2004-01-09 07:59:49.000000000 +0100 +++ drivers/scsi/sd.c 2004-02-13 00:58:05.000000000 +0100 @@ -19,6 +19,9 @@ * not being read in sd_open. Fix problem where removable media * could be ejected after sd_open. * - Douglas Gilbert <dgilbert@interlog.com> cleanup for lk 2.5.x + * - Badari Pulavarty <pbadari@us.ibm.com>, Matthew Wilcox + * <willy@debian.org>, Kurt Garloff <garloff@suse.de>: + * Support 32k/1M disks. * * Logging policy (needs CONFIG_SCSI_LOGGING defined): * - setting up transfer: SCSI_LOG_HLQUEUE levels 1 and 2 @@ -61,7 +64,7 @@ * Remaining dev_t-handling stuff */ #define SD_MAJORS 16 -#define SD_DISKS (SD_MAJORS << 4) +#define SD_DISKS 32768 /* anything between 256 and 262144 */ /* * Time out in seconds for disks and Magneto-opticals (which are slower). @@ -121,6 +124,20 @@ .init_command = sd_init_command, }; +/* Device no to disk mapping: + * + * major disc2 disc p1 + * |............|.............|....|....| <- dev_t + * 31 20 19 8 7 4 3 0 + * + * Inside a major, we have 16k disks, however mapped non- + * contiguously. The first 16 disks are for major0, the next + * ones with major1, ... Disk 256 is for major0 again, disk 272 + * for major1, ... + * As we stay compatible with our numbering scheme, we can reuse + * the well-know SCSI majors 8, 65--71, 136--143. + */ + static int sd_major(int major_idx) { switch (major_idx) { @@ -136,6 +153,14 @@ } } +static unsigned int make_sd_dev(unsigned int sd_nr, unsigned int part) +{ + return (part & 0xf) | ((sd_nr & 0xf) << 4) | + (sd_major((sd_nr & 0xf0) >> 4) << 20) | (sd_nr & 0xfff00); +} + +/* reverse mapping dev -> (sd_nr, part) not currently needed */ + #define to_scsi_disk(obj) container_of(obj,struct scsi_disk,kobj); static inline struct scsi_disk *scsi_disk(struct gendisk *disk) @@ -1297,7 +1322,7 @@ struct scsi_disk *sdkp; struct gendisk *gd; u32 index; - int error; + int error, devno; error = -ENODEV; if ((sdp->type != TYPE_DISK) && (sdp->type != TYPE_MOD)) @@ -1315,6 +1340,12 @@ kobject_init(&sdkp->kobj); sdkp->kobj.ktype = &scsi_disk_kobj_type; + /* Note: We can accomodate 64 partitions, but the genhd code + * assumes partitions allocate consecutive minors, which they don't. + * So for now stay with max 16 partitions and leave two spare bits. + * Later, we may change the genhd code and the alloc_disk() call + * and the ->minors assignment here. KG, 2004-02-10 + */ gd = alloc_disk(16); if (!gd) goto out_free; @@ -1335,16 +1366,23 @@ sdkp->index = index; sdkp->openers = 0; - gd->major = sd_major(index >> 4); - gd->first_minor = (index & 15) << 4; + devno = make_sd_dev(index, 0); + gd->major = MAJOR(devno); + gd->first_minor = MINOR(devno); gd->minors = 16; gd->fops = &sd_fops; - if (index >= 26) { + if (index < 26) { + sprintf(gd->disk_name, "sd%c", 'a' + index % 26); + } else if (index < (26*27)) { sprintf(gd->disk_name, "sd%c%c", - 'a' + index/26-1,'a' + index % 26); + 'a' + index / 26 - 1,'a' + index % 26); } else { - sprintf(gd->disk_name, "sd%c", 'a' + index % 26); + const unsigned int m1 = (index / 26 - 1) / 26 - 1; + const unsigned int m2 = (index / 26 - 1) % 26; + const unsigned int m3 = index % 26; + sprintf(gd->disk_name, "sd%c%c%c", + 'a' + m1, 'a' + m2, 'a' + m3); } strcpy(gd->devfs_name, sdp->devfs_name); [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: lots and lots of disks again 2004-02-13 0:05 ` Kurt Garloff @ 2004-02-16 12:31 ` Kurt Garloff 0 siblings, 0 replies; 43+ messages in thread From: Kurt Garloff @ 2004-02-16 12:31 UTC (permalink / raw) To: viro, Andrew Morton, hch, linux-scsi, pbadari, willy, James.Bottomley [-- Attachment #1: Type: text/plain, Size: 450 bytes --] On Fri, Feb 13, 2004 at 01:05:39AM +0100, Kurt Garloff wrote: > On Thu, Feb 12, 2004 at 04:00:35PM +0100, Kurt Garloff wrote: > > Expect a patch later. > > Attached. It look like people have run out of objections. Can this patch please be included in the next SCSI patch sets? Regards, -- Kurt Garloff <garloff@suse.de> Cologne, DE SUSE LINUX AG, Nuernberg, DE SUSE Labs (Head) [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2004-03-03 19:57 UTC | newest] Thread overview: 43+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-02-04 10:45 lots and lots of disks again Andrew Morton 2004-02-10 11:04 ` Kurt Garloff 2004-02-10 11:26 ` Kurt Garloff 2004-02-10 13:39 ` Christoph Hellwig 2004-02-10 15:47 ` Kurt Garloff 2004-02-10 15:52 ` Christoph Hellwig 2004-02-10 16:08 ` Kurt Garloff 2004-02-10 20:10 ` Andries Brouwer 2004-02-10 20:11 ` Matthew Wilcox 2004-02-10 20:58 ` Kurt Garloff 2004-02-10 21:21 ` viro 2004-02-10 21:34 ` Kurt Garloff 2004-02-10 21:42 ` viro 2004-02-10 22:28 ` Kurt Garloff 2004-02-10 18:26 ` Andrew Morton 2004-02-11 14:56 ` Kurt Garloff 2004-02-11 21:28 ` Andrew Morton 2004-02-11 22:09 ` Kurt Garloff 2004-02-11 22:29 ` Andrew Morton 2004-02-11 22:53 ` viro 2004-02-12 15:00 ` Kurt Garloff 2004-02-12 15:20 ` James Bottomley 2004-02-12 15:57 ` viro 2004-02-12 16:18 ` Kurt Garloff 2004-02-12 16:43 ` James Bottomley 2004-02-16 12:40 ` Kurt Garloff 2004-02-16 22:57 ` Andries Brouwer 2004-02-17 0:56 ` James Bottomley 2004-02-17 7:57 ` Kurt Garloff 2004-02-17 15:08 ` James Bottomley 2004-02-17 15:28 ` Matthew Wilcox 2004-02-17 14:49 ` Andries Brouwer 2004-02-17 15:18 ` James Bottomley 2004-02-17 15:27 ` Kurt Garloff 2004-02-29 16:41 ` James Bottomley 2004-02-29 23:31 ` Kurt Garloff 2004-03-03 19:30 ` Mike Anderson 2004-03-03 19:55 ` Kurt Garloff 2004-02-17 15:50 ` Andries Brouwer 2004-02-17 17:57 ` James Bottomley 2004-02-17 18:44 ` Andries Brouwer 2004-02-13 0:05 ` Kurt Garloff 2004-02-16 12:31 ` Kurt Garloff
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox