All of lore.kernel.org
 help / color / mirror / Atom feed
* Greater than 16 xvd devices for blkfront
@ 2008-05-06 17:36 Chris Lalancette
  2008-05-06 17:45 ` Daniel P. Berrange
  2008-05-07  1:55 ` Daniel P. Berrange
  0 siblings, 2 replies; 12+ messages in thread
From: Chris Lalancette @ 2008-05-06 17:36 UTC (permalink / raw)
  To: xen-devel

All,
     We've had a number of requests to increase the number of xvd devices that a
PV guest can have.  Currently, if you try to connect > 16 disks, you get an
error from xend.  The problem ends up being that both xend and blkfront assume
that for dev_t, major/minor is 8 bits each, where in fact there are actually 10
bits for major and 22 bits for minor.
     Therefore, it shouldn't really be a problem giving lots of disks to guests.
 The problem is in backwards compatibility, and the details.  What I am
initially proposing to do is to leave things where they are for /dev/xvd[a-p];
that is, still put the xenstore entries in the same place, and use 8 bits for
the major and 8 bits for the minor.  For anything above that, we would end up
putting the xenstore entry in a different place, and pushing the major into the
top 10 bits (leaving the bottom 22 bits for the minor); that way old guests
won't fire when the entry is added, and we will add code to newer guests
blkfront so that they will fire when they see that entry.  Does anyone see any
problems with this setup, or have any ideas how to do it better?

Thanks,
Chris Lalancette

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Greater than 16 xvd devices for blkfront
  2008-05-06 17:36 Greater than 16 xvd devices for blkfront Chris Lalancette
@ 2008-05-06 17:45 ` Daniel P. Berrange
  2008-05-07 16:04   ` Chris Wright
  2008-05-07  1:55 ` Daniel P. Berrange
  1 sibling, 1 reply; 12+ messages in thread
From: Daniel P. Berrange @ 2008-05-06 17:45 UTC (permalink / raw)
  To: Chris Lalancette; +Cc: xen-devel

On Tue, May 06, 2008 at 01:36:05PM -0400, Chris Lalancette wrote:
> All,
>      We've had a number of requests to increase the number of xvd devices that a
> PV guest can have.  Currently, if you try to connect > 16 disks, you get an
> error from xend.  The problem ends up being that both xend and blkfront assume
> that for dev_t, major/minor is 8 bits each, where in fact there are actually 10
> bits for major and 22 bits for minor.
>      Therefore, it shouldn't really be a problem giving lots of disks to guests.
>  The problem is in backwards compatibility, and the details.  What I am
> initially proposing to do is to leave things where they are for /dev/xvd[a-p];
> that is, still put the xenstore entries in the same place, and use 8 bits for
> the major and 8 bits for the minor.  For anything above that, we would end up
> putting the xenstore entry in a different place, and pushing the major into the
> top 10 bits (leaving the bottom 22 bits for the minor); that way old guests
> won't fire when the entry is added, and we will add code to newer guests
> blkfront so that they will fire when they see that entry.  Does anyone see any
> problems with this setup, or have any ideas how to do it better?

Putting the xenstore entries in a different place is a non-starter. Too
many things look at that location already. When blktap was added and it
put xenstore entries in a different place it took months to track down 
all the bugs this caused.

Dan.
-- 
|: Red Hat, Engineering, Boston   -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org  -o-  http://virt-manager.org  -o-  http://ovirt.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-  F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Greater than 16 xvd devices for blkfront
  2008-05-06 17:36 Greater than 16 xvd devices for blkfront Chris Lalancette
  2008-05-06 17:45 ` Daniel P. Berrange
@ 2008-05-07  1:55 ` Daniel P. Berrange
  2008-05-07  3:47   ` Daniel P. Berrange
  1 sibling, 1 reply; 12+ messages in thread
From: Daniel P. Berrange @ 2008-05-07  1:55 UTC (permalink / raw)
  To: Chris Lalancette; +Cc: xen-devel

On Tue, May 06, 2008 at 01:36:05PM -0400, Chris Lalancette wrote:
> All,
>      We've had a number of requests to increase the number of xvd devices that a
> PV guest can have.  Currently, if you try to connect > 16 disks, you get an
> error from xend.  The problem ends up being that both xend and blkfront assume
> that for dev_t, major/minor is 8 bits each, where in fact there are actually 10
> bits for major and 22 bits for minor.
>      Therefore, it shouldn't really be a problem giving lots of disks to guests.
>  The problem is in backwards compatibility, and the details.  What I am
> initially proposing to do is to leave things where they are for /dev/xvd[a-p];
> that is, still put the xenstore entries in the same place, and use 8 bits for
> the major and 8 bits for the minor.  For anything above that, we would end up
> putting the xenstore entry in a different place, and pushing the major into the
> top 10 bits (leaving the bottom 22 bits for the minor); that way old guests
> won't fire when the entry is added, and we will add code to newer guests
> blkfront so that they will fire when they see that entry.  Does anyone see any
> problems with this setup, or have any ideas how to do it better?

Looking at the blkfront code I think we can increase the minor numbers
available for xvdX devices without requiring changes to the where stuff
is stored.

The key is that in blkfront we can reliably detect the overflow triggered
by the 16th disk, because the next major number 203 doesn't clash with
any of the other major numbers blkfront is looking for

Consider the 17th disk, which has name 'xvdq', this gives a device number
in xenstore of '51968'.

Upon seeing this, current blkfront code will use

   #define BLKIF_MAJOR(dev) ((dev)>>8)
   #define BLKIF_MINOR(dev) ((dev) & 0xff)

And so get back major number of 203 and minor number of '0'. 

In the xlbd_get_major_info(int vdevice) function, it has a switch on major
numbers and the xvdX case is handled as the default

        major = BLKIF_MAJOR(vdevice);
        minor = BLKIF_MINOR(vdevice);

        switch (major) {
        case IDE0_MAJOR: index = 0; break;
         ....snipped...
        case IDE9_MAJOR: index = 9; break;
        case SCSI_DISK0_MAJOR: index = 10; break;
        case SCSI_DISK1_MAJOR ... SCSI_DISK7_MAJOR:
                index = 11 + major - SCSI_DISK1_MAJOR;
                break;
        case SCSI_CDROM_MAJOR: index = 18; break;
        default: index = 19; break;
        }


So, the 17th disk in fact gets treated as 1st disk and the front end assigns
it the name 'xvda', and then promptly kernel panics because xvda already
exists in sysfs. 

kobject_add failed for xvda with -EEXIST, don't try to register things with the same name in the same directory.

Call Trace:
 [<ffffffff80336951>] kobject_add+0x16e/0x199
 [<ffffffff8025ce3c>] exact_lock+0x0/0x14
 [<ffffffff8029b271>] keventd_create_kthread+0x0/0xc4
 [<ffffffff802f393e>] register_disk+0x43/0x198
 [<ffffffff8029b271>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8032e453>] add_disk+0x34/0x3d
 [<ffffffff88074eb8>] :xenblk:backend_changed+0x110/0x193
 [<ffffffff803a4029>] xenbus_read_driver_state+0x26/0x3b

Now, this kernel panic isn't a huge problem (though it ought to handle the 
kobject_add gracefully), because we can never do anything to make existing
frontends deal with > 16 disks. If an admin tries to add more than 16 disks
to an existing guest they should already expect doom.

For future frontends though, it looks like we can adapt the switch(major)
in xlbd_get_major_info(), so that it detects the overflow of minor numbers,
and re-adjusts the major/minor numbers to their intended value:


eg   change

        case SCSI_CDROM_MAJOR: index = 18; break;
        default: index = 19; break;
        }


to

        case SCSI_CDROM_MAJOR: index = 18; break;
        default: 
              index = 19;
              if (major > XLBD_MAJOR_VBD_START) {
                  minor += 16 * (major - XLBD_MAJOR_VBD_START);
                  major = XLBD_MAJOR_VBD_START;
              }
              break;
        }


Now, I've not actually tested this, and there's a few other places in blkfront
needing similar tweaks but I don't see anything in the code which fundamentally
stops this overflow detection & fixup.

As far as the XenD backend is concerned, all we need todo is edit the XenD 
blkdev_name_to_number() function in tools/python/xen/util/blkif.py to relax
the regex to allow > xvdp. And adapt the math so it overflows onto the major
numbers following XVD's 202. In 

eg, change

    if re.match( '/dev/xvd[a-p]([1-9]|1[0-5])?', n):
        return 202 * 256 + 16 * (ord(n[8:9]) - ord('a')) + int(n[9:] or 0)

to

    if re.match( '/dev/xvd[a-z]([1-9]|1[0-5])?', n):
        return 202 * 256 + 16 * (ord(n[8:9]) - ord('a')) + int(n[9:] or 0)

gets you to 26 disks. This is how I got the gues to boot and front end to 
crash on the 17th disk 'xvdq'. It is a little more complex to cope with 
2-letter drives, but no show stopper there.

So, unless I'm missing something obvious we can keep compatability with
existing guests for the first 16 disks and still (indirectly) make use 
of a 22/12  dev_t split for the 17th+ disk, without needing to change 
how or where stuff is stored in XenStore. 

Regards,
Daniel.
-- 
|: Red Hat, Engineering, Boston   -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org  -o-  http://virt-manager.org  -o-  http://ovirt.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-  F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Greater than 16 xvd devices for blkfront
  2008-05-07  1:55 ` Daniel P. Berrange
@ 2008-05-07  3:47   ` Daniel P. Berrange
  2008-05-07 16:40     ` Chris Wright
  0 siblings, 1 reply; 12+ messages in thread
From: Daniel P. Berrange @ 2008-05-07  3:47 UTC (permalink / raw)
  To: Chris Lalancette; +Cc: xen-devel

On Wed, May 07, 2008 at 02:55:02AM +0100, Daniel P. Berrange wrote:
> On Tue, May 06, 2008 at 01:36:05PM -0400, Chris Lalancette wrote:
> > All,
> >      We've had a number of requests to increase the number of xvd devices that a
> > PV guest can have.  Currently, if you try to connect > 16 disks, you get an
> > error from xend.  The problem ends up being that both xend and blkfront assume
> > that for dev_t, major/minor is 8 bits each, where in fact there are actually 10
> > bits for major and 22 bits for minor.
> >      Therefore, it shouldn't really be a problem giving lots of disks to guests.
> >  The problem is in backwards compatibility, and the details.  What I am
> > initially proposing to do is to leave things where they are for /dev/xvd[a-p];
> > that is, still put the xenstore entries in the same place, and use 8 bits for
> > the major and 8 bits for the minor.  For anything above that, we would end up
> > putting the xenstore entry in a different place, and pushing the major into the
> > top 10 bits (leaving the bottom 22 bits for the minor); that way old guests
> > won't fire when the entry is added, and we will add code to newer guests
> > blkfront so that they will fire when they see that entry.  Does anyone see any
> > problems with this setup, or have any ideas how to do it better?
> 
> Looking at the blkfront code I think we can increase the minor numbers
> available for xvdX devices without requiring changes to the where stuff
> is stored.

Have a go with this proof of concept patch to blkfront. I built pv-on-hvm drivers
with this and successfully booted my guest with 25 disks (xvdb -> xvdz) and saw
them registered in /dev as can be seen from /proc/partitions:

major minor  #blocks  name

   3     0    5242880 hda
   3     1     104391 hda1
   3     2    5132767 hda2
 253     0    4096000 dm-0
 253     1    1015808 dm-1
 202    16     102400 xvdb
 202    32     102400 xvdc
 202    48     102400 xvdd
 202    49      48163 xvdd1
 202    50      48195 xvdd2
 202    64     102400 xvde
 202    80     102400 xvdf
 202    96     102400 xvdg
 202   112     102400 xvdh
 202   128     102400 xvdi
 202   144     102400 xvdj
 202   160     102400 xvdk
 202   176     102400 xvdl
 202   192     102400 xvdm
 202   208     102400 xvdn
 202   224     102400 xvdo
 202   240     102400 xvdp
 202   256     102400 xvdq
 202   272     102400 xvdr
 202   288     102400 xvds
 202   304     102400 xvdt
 202   320     102400 xvdu
 202   336     102400 xvdv
 202   352     102400 xvdw
 202   368     102400 xvdx
 202   384     102400 xvdy
 202   400     102400 xvdz
 202   401      96358 xvdz1

NB, requires the regex tweak to blkif.py in XenD to allow xvd[a-z] naming.

Regards,
Daniel.

diff -r 57ab8dd47580 drivers/xen/blkfront/vbd.c
--- a/drivers/xen/blkfront/vbd.c	Sun Jul 01 22:07:32 2007 +0100
+++ b/drivers/xen/blkfront/vbd.c	Tue May 06 23:38:20 2008 -0400
@@ -166,7 +166,14 @@ xlbd_get_major_info(int vdevice)
                 index = 18 + major - SCSI_DISK8_MAJOR;
                 break;
         case SCSI_CDROM_MAJOR: index = 26; break;
-        default: index = 27; break;
+        default:
+		index = 27;
+		if (major >  XLBD_MAJOR_VBD_START) {
+			printk("xen-vbd: fixup major/minor %d -> %d,%d\n", vdevice, major, minor);
+			minor += (16 * 16 * (major - 202));
+			major = 202;
+		}
+		printk("xen-vbd: process major/minor %d -> %d,%d\n", vdevice, major, minor);
 	}
 
 	mi = ((major_info[index] != NULL) ? major_info[index] :
@@ -315,14 +322,42 @@ xlvbd_add(blkif_sector_t capacity, int v
 {
 	struct block_device *bd;
 	int err = 0;
+	int major, minor;
 
-	info->dev = MKDEV(BLKIF_MAJOR(vdevice), BLKIF_MINOR(vdevice));
+	major = BLKIF_MAJOR(vdevice);
+	minor = BLKIF_MINOR(vdevice);
+
+	switch (major) {
+	case IDE0_MAJOR:
+	case IDE1_MAJOR:
+	case IDE2_MAJOR:
+	case IDE3_MAJOR:
+	case IDE4_MAJOR:
+	case IDE5_MAJOR:
+	case IDE6_MAJOR:
+	case IDE7_MAJOR:
+	case IDE8_MAJOR:
+	case IDE9_MAJOR:
+	case SCSI_DISK0_MAJOR:
+	case SCSI_DISK1_MAJOR ... SCSI_DISK7_MAJOR:
+	case SCSI_DISK8_MAJOR ... SCSI_DISK15_MAJOR:
+	case SCSI_CDROM_MAJOR:
+		break;
+
+	default:
+		if (major > 202) {
+			minor += (16 * 16 * (major - 202));
+			major = 202;
+		}
+	}
+
+	info->dev = MKDEV(major, minor);
 
 	bd = bdget(info->dev);
 	if (bd == NULL)
 		return -ENODEV;
 
-	err = xlvbd_alloc_gendisk(BLKIF_MINOR(vdevice), capacity, vdevice,
+	err = xlvbd_alloc_gendisk(minor, capacity, vdevice,
 				  vdisk_info, sector_size, info);
 
 	bdput(bd);


-- 
|: Red Hat, Engineering, Boston   -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org  -o-  http://virt-manager.org  -o-  http://ovirt.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-  F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Greater than 16 xvd devices for blkfront
  2008-05-06 17:45 ` Daniel P. Berrange
@ 2008-05-07 16:04   ` Chris Wright
  0 siblings, 0 replies; 12+ messages in thread
From: Chris Wright @ 2008-05-07 16:04 UTC (permalink / raw)
  To: Daniel P. Berrange; +Cc: Chris Lalancette, xen-devel

* Daniel P. Berrange (berrange@redhat.com) wrote:
> On Tue, May 06, 2008 at 01:36:05PM -0400, Chris Lalancette wrote:
> > All,
> >      We've had a number of requests to increase the number of xvd devices that a
> > PV guest can have.  Currently, if you try to connect > 16 disks, you get an
> > error from xend.  The problem ends up being that both xend and blkfront assume
> > that for dev_t, major/minor is 8 bits each, where in fact there are actually 10
> > bits for major and 22 bits for minor.

Just a nit, it's actually 12:20.

> >      Therefore, it shouldn't really be a problem giving lots of disks to guests.
> >  The problem is in backwards compatibility, and the details.  What I am
> > initially proposing to do is to leave things where they are for /dev/xvd[a-p];
> > that is, still put the xenstore entries in the same place, and use 8 bits for
> > the major and 8 bits for the minor.  For anything above that, we would end up
> > putting the xenstore entry in a different place, and pushing the major into the
> > top 10 bits (leaving the bottom 22 bits for the minor); that way old guests
> > won't fire when the entry is added, and we will add code to newer guests
> > blkfront so that they will fire when they see that entry.  Does anyone see any
> > problems with this setup, or have any ideas how to do it better?
> 
> Putting the xenstore entries in a different place is a non-starter. Too
> many things look at that location already. When blktap was added and it
> put xenstore entries in a different place it took months to track down 
> all the bugs this caused.

I'm not sure what you mean?  Since this is blkfront it'd be more like
adding a virtual-device2 to extend the protocol.

/* FIXME: Use dynamic device id if this is not set. */
err = xenbus_scanf(XBT_NIL, dev->nodename, "virtual-device", "%i", &vdevice);
if (err != 1) {
	xenbus_dev_fatal(dev, err, "reading virtual-device");
	return err;
}


IOW smth simple like:

err = xenbus_scanf(XBT_NIL, dev->nodename, "virtual-device", "%i", &vdevice);
if (err == -ENOENT)
	err = xenbus_scanf(XBT_NIL, dev->nodename, "virtual-device2", "%i", &vdevice);

Then we can stop propagating the myth that dev_t is 8:8.

thanks,
-chris

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Greater than 16 xvd devices for blkfront
  2008-05-07  3:47   ` Daniel P. Berrange
@ 2008-05-07 16:40     ` Chris Wright
  2008-05-08  9:30       ` Ian Jackson
  0 siblings, 1 reply; 12+ messages in thread
From: Chris Wright @ 2008-05-07 16:40 UTC (permalink / raw)
  To: Daniel P. Berrange; +Cc: Chris Lalancette, xen-devel

* Daniel P. Berrange (berrange@redhat.com) wrote:
> +	default:
> +		if (major > 202) {
> +			minor += (16 * 16 * (major - 202));
> +			major = 202;
> +		}
> +	}

I didn't think of handling overflow (since the major for scsi/ide/etc
were involved, I expected that to fail).  But, aside of crashing an
older guest with > 16 disks (not ideal, but I think it's possible
already with 0x format), seems good.

thanks,
-chris

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Greater than 16 xvd devices for blkfront
  2008-05-07 16:40     ` Chris Wright
@ 2008-05-08  9:30       ` Ian Jackson
  2008-05-08 15:33         ` Chris Wright
  0 siblings, 1 reply; 12+ messages in thread
From: Ian Jackson @ 2008-05-08  9:30 UTC (permalink / raw)
  To: Chris Wright; +Cc: Chris Lalancette, Daniel P. Berrange, xen-devel

Chris Wright writes ("Re: [Xen-devel] Greater than 16 xvd devices for blkfront"):
> * Daniel P. Berrange (berrange@redhat.com) wrote:
> > +	default:
> > +		if (major > 202) {
> > +			minor += (16 * 16 * (major - 202));
> > +			major = 202;
> > +		}
> > +	}

The root cause of the problem is the incorporation of the Linux device
numbering scheme into the xenstore protocol, which is wrong I think.
What Daniel's excellent if rather unpleasant suggestion is doing is to
regard the xenstore number not as a `Linux device number' but rather
as a crazy encoding of the disk number.

I think this is fine but it would be good if we could think about what
the new crazy encoding is, and document it.  I infer that in Daniel's
suggestion it's:

  xenstore number = (202 << 8) + (actual disk number << 4)
                        | partition number

where the actual disk number starts at 0 for xvda and partition
numbers are 0 for whole disk or 1..15.

Daniel's solution still doesn't work for partitions >15.  Perhaps,
given that old guests are going to break anyway, we should consider a
different scheme ?  Since disks and partitions not supported by the
old encoding won't work on old guests anyway, we can use a completely
new encoding for that case provided only that it doesn't use numbers
of the form  (202 << 8) | something

Presumably we can safely use at least 31 bits.  If we reserve one to
indicate that this is the new encoding that leaves us with 30 which
should be enough for a reasonable number of disks with many
partitions each.

> I didn't think of handling overflow (since the major for scsi/ide/etc
> were involved, I expected that to fail).  But, aside of crashing an
> older guest with > 16 disks (not ideal, but I think it's possible
> already with 0x format), seems good.

If a guest takes the xenstore number to be the concatenation of its
own major and minor numbers then obviously it is leaving itself open
to breaking in the future.  dom0 admins will just have to Not Do That
Then.  (It's a shame, if true, that the guests don't have actual error
checking.)

Ian.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Greater than 16 xvd devices for blkfront
  2008-05-08  9:30       ` Ian Jackson
@ 2008-05-08 15:33         ` Chris Wright
  2008-05-08 17:03           ` Ian Jackson
  2008-05-08 22:14           ` Greater than 16 xvd devices for blkfront Jeremy Fitzhardinge
  0 siblings, 2 replies; 12+ messages in thread
From: Chris Wright @ 2008-05-08 15:33 UTC (permalink / raw)
  To: Ian Jackson; +Cc: Chris Wright, Chris Lalancette, Daniel P. Berrange, xen-devel

* Ian Jackson (Ian.Jackson@eu.citrix.com) wrote:
> Chris Wright writes ("Re: [Xen-devel] Greater than 16 xvd devices for blkfront"):
> > * Daniel P. Berrange (berrange@redhat.com) wrote:
> > > +	default:
> > > +		if (major > 202) {
> > > +			minor += (16 * 16 * (major - 202));
> > > +			major = 202;
> > > +		}
> > > +	}
> 
> The root cause of the problem is the incorporation of the Linux device
> numbering scheme into the xenstore protocol, which is wrong I think.
> What Daniel's excellent if rather unpleasant suggestion is doing is to
> regard the xenstore number not as a `Linux device number' but rather
> as a crazy encoding of the disk number.
> 
> I think this is fine but it would be good if we could think about what
> the new crazy encoding is, and document it.  I infer that in Daniel's
> suggestion it's:
> 
>   xenstore number = (202 << 8) + (actual disk number << 4)
>                         | partition number
> 
> where the actual disk number starts at 0 for xvda and partition
> numbers are 0 for whole disk or 1..15.
> 
> Daniel's solution still doesn't work for partitions >15.  Perhaps,

I think that's OK, and effectively a hard limitation w.r.t. lanana:

202 block	Xen Virtual Block Device
		  0 = /dev/xvda       First Xen VBD whole disk
		  16 = /dev/xvdb      Second Xen VBD whole disk
		  32 = /dev/xvdc      Third Xen VBD whole disk
		    ...
		  240 = /dev/xvdp     Sixteenth Xen VBD whole disk

                Partitions are handled in the same way as for IDE
                disks (see major number 3) except that the limit on
                partitions is 15.


> given that old guests are going to break anyway, we should consider a
> different scheme ?  Since disks and partitions not supported by the
> old encoding won't work on old guests anyway, we can use a completely
> new encoding for that case provided only that it doesn't use numbers
> of the form  (202 << 8) | something

Well, we don't actually need 202, or any minor numbers at all.  The major
is only needed for the case where xvd masquerades as IDE or SCSI.
We ripped this wart out for upstream Linux.  And the guest can happily
dynamically allocate minor numbers on its own behalf.  A disk discovery
event can be completely dynamic, the admin just wouldn't be able to
guarantee which minor slot gets allocated for a particular disk in
a guest.  We do have mount by label or UUID.

> Presumably we can safely use at least 31 bits.  If we reserve one to
> indicate that this is the new encoding that leaves us with 30 which
> should be enough for a reasonable number of disks with many
> partitions each.
> 
> > I didn't think of handling overflow (since the major for scsi/ide/etc
> > were involved, I expected that to fail).  But, aside of crashing an
> > older guest with > 16 disks (not ideal, but I think it's possible
> > already with 0x format), seems good.
> 
> If a guest takes the xenstore number to be the concatenation of its
> own major and minor numbers then obviously it is leaving itself open
> to breaking in the future.  dom0 admins will just have to Not Do That
> Then.  (It's a shame, if true, that the guests don't have actual error
> checking.)

Agreed.

thanks,
-chris

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Greater than 16 xvd devices for blkfront
  2008-05-08 15:33         ` Chris Wright
@ 2008-05-08 17:03           ` Ian Jackson
  2010-02-03 16:50             ` Xen vbd numbering Ian Jackson
  2008-05-08 22:14           ` Greater than 16 xvd devices for blkfront Jeremy Fitzhardinge
  1 sibling, 1 reply; 12+ messages in thread
From: Ian Jackson @ 2008-05-08 17:03 UTC (permalink / raw)
  To: Chris Wright; +Cc: Chris Lalancette, Daniel P. Berrange, xen-devel

Chris Wright writes ("Re: [Xen-devel] Greater than 16 xvd devices for blkfront"):
> Ian Jackson (Ian.Jackson@eu.citrix.com) wrote:
> > Daniel's solution still doesn't work for partitions >15.  Perhaps,
> 
> I think that's OK, and effectively a hard limitation w.r.t. lanana:

No, because not all guests are Linux, and anyway that limitation in
Linux may be improved in the future.  If we're going to invent a new
scheme then we may as well solve the problem properly.

> > given that old guests are going to break anyway, we should consider a
> > different scheme ?  Since disks and partitions not supported by the
> > old encoding won't work on old guests anyway, we can use a completely
> > new encoding for that case provided only that it doesn't use numbers
> > of the form  (202 << 8) | something
> 
> Well, we don't actually need 202, or any minor numbers at all.  The major
> is only needed for the case where xvd masquerades as IDE or SCSI.

I think you're really missing the point.  At the moment the Xen domain
config specifies whether the device is supposed to show up in the
guest as a native xvd, or masquerading as scsi or ide.  This
information is encoded, along with the disk number and partition
number, into the xenstore path.

The xenstore path element is currently as a decimal integer, and that
integer supplies this information in a encoding derived from that used
internally by pre-32-bit-devt Linux guests.  That's completely mad.
However, we can't really change it now at least for disks which fit
into the old encoding scheme, because any new scheme won't be
supported by old guests.

For disks and partitions which are out of the range which fit into the
current encodings, we need a new encoding anyway.  Old guests
definitely can't cope with those so we don't need to be compatible.

Daniel Berrange's suggestion amounts to this: rather than invent a
wholly new location in xenstore for these disks, we simply make use of
more of the available values of this integer.

I'm pointing out that when we do that we ought to take into account
our future requirements in general, which may include >15 partitions.


Something like this:

 Old format:

  202 << 8 | disk << 4 | partition	xvd, disks and partitions up to 15
    8 << 8 | disk << 4 | partition      sd, disks and partitions up to 15
    3 << 8 | disk << 6 | partition      hd, disks 0..3, partitions 1..63

 New format:

  1 << 28 | disk << 8 | partition	xvd, disks or partitions 16 onwards

 Reserved for future use:

  2 << 28 onwards


Ian.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Greater than 16 xvd devices for blkfront
  2008-05-08 15:33         ` Chris Wright
  2008-05-08 17:03           ` Ian Jackson
@ 2008-05-08 22:14           ` Jeremy Fitzhardinge
  2008-05-08 23:34             ` Daniel P. Berrange
  1 sibling, 1 reply; 12+ messages in thread
From: Jeremy Fitzhardinge @ 2008-05-08 22:14 UTC (permalink / raw)
  To: Chris Wright; +Cc: Daniel P. Berrange, Chris Lalancette, Ian Jackson, xen-devel

Chris Wright wrote:
> Well, we don't actually need 202, or any minor numbers at all.  The major
> is only needed for the case where xvd masquerades as IDE or SCSI.
> We ripped this wart out for upstream Linux. 

I'm considering putting it back in if it makes anyone's life easier.  In 
general using labels/uuids is the best way to make an installation 
device-agnostic, but installers might have an easier time with a forged 
scsi device or something.  I mentioned it in passing to Al Viro, and he 
was surprisingly non-insulting about the notion.

>  And the guest can happily
> dynamically allocate minor numbers on its own behalf.  A disk discovery
> event can be completely dynamic, the admin just wouldn't be able to
> guarantee which minor slot gets allocated for a particular disk in
> a guest.  We do have mount by label or UUID.
>   

That's true for filesystems which have already been initialized.  But if 
you're attaching 4 new devices to a guest and they appear at random 
device nodes, how do you know which is which?  Smell?

    J

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Greater than 16 xvd devices for blkfront
  2008-05-08 22:14           ` Greater than 16 xvd devices for blkfront Jeremy Fitzhardinge
@ 2008-05-08 23:34             ` Daniel P. Berrange
  0 siblings, 0 replies; 12+ messages in thread
From: Daniel P. Berrange @ 2008-05-08 23:34 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Chris Wright, Chris Lalancette, Ian Jackson, xen-devel

On Thu, May 08, 2008 at 11:14:34PM +0100, Jeremy Fitzhardinge wrote:
> Chris Wright wrote:
> >Well, we don't actually need 202, or any minor numbers at all.  The major
> >is only needed for the case where xvd masquerades as IDE or SCSI.
> >We ripped this wart out for upstream Linux. 
> 
> I'm considering putting it back in if it makes anyone's life easier.  In 
> general using labels/uuids is the best way to make an installation 
> device-agnostic, but installers might have an easier time with a forged 
> scsi device or something.  I mentioned it in passing to Al Viro, and he 
> was surprisingly non-insulting about the notion.
> 
> > And the guest can happily
> >dynamically allocate minor numbers on its own behalf.  A disk discovery
> >event can be completely dynamic, the admin just wouldn't be able to
> >guarantee which minor slot gets allocated for a particular disk in
> >a guest.  We do have mount by label or UUID.
> >  
> 
> That's true for filesystems which have already been initialized.  But if 
> you're attaching 4 new devices to a guest and they appear at random 
> device nodes, how do you know which is which?  Smell?

Well there's /dev/disk/by-{path,id}. Now there's no udev rules to setup
these links for Xen VBD (afaik), but we could arrange to have some suitable
info used to provide a persistent path under either of those locations.

Dan.
-- 
|: Red Hat, Engineering, Boston   -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org  -o-  http://virt-manager.org  -o-  http://ovirt.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-  F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Xen vbd numbering
  2008-05-08 17:03           ` Ian Jackson
@ 2010-02-03 16:50             ` Ian Jackson
  0 siblings, 0 replies; 12+ messages in thread
From: Ian Jackson @ 2010-02-03 16:50 UTC (permalink / raw)
  To: Chris Wright, Daniel P. Berrange, Chris Lalancette, xen-devel

In May 2008 I wrote:
>  Old format:
> 
>   202 << 8 | disk << 4 | partition	xvd, disks and partitions up to 15
>     8 << 8 | disk << 4 | partition      sd, disks and partitions up to 15
>     3 << 8 | disk << 6 | partition      hd, disks 0..3, partitions 1..63
> 
>  New format:
> 
>   1 << 28 | disk << 8 | partition	xvd, disks or partitions 16 onwards
> 
>  Reserved for future use:
> 
>   2 << 28 onwards

But now that I get down and dirty with some code I discover that
actually what we have is not quite this.  Much Linux-specific stuff
has crept in and the result is a mess.

After consultation, what we intend to implement in libxl is as
follows:

 * The abstract interface specifies, for each VBD:

     * Nominal disk type: Xen virtual disk (aka xvd, the default);
       SCSI (sd); IDE (hd).  This is for use as a hint by the guest's
       device naming scheme.

     * Disk number, which is a nonnegative integer,
       conventionally starting at 0 for the first disk.

     * Partition number, which is a nonnegative integer
       where by convention partition 0 indicates the "whole disk".

       Normally for any disk _either_ partition 0 should be supplied
       in which case the guest is expected to treat it as they would a
       native whole disk (for example by putting or expecting a
       partition table or disk label on it);

       _Or_ only non-0 partitions should be supplied in which case the
       guest should expect storage management to be done by the host
       and treat each vbd as it would a partition or slice or LVM
       volume (for example by putting or expecting a filesystem on
       it).

  * The syntaxes are, for example:
       d0 d0p0  xvda     Xen virtual disk 0 partition 0 (whole disk)
       d1p2     xvda2    Xen virtual disk 1 partition 2
       d536p37  xvdtq37  Xen virtual disk 536 partition 37
       sdb3              SCSI disk 1 partition 3
       hdc2              IDE disk 2 partition 2
    The d*p* syntax is not supported by xm/xend.

 * This is encoded in the concrete interface as an integer (in a
   canonical decimal format in xenstore), whose value encodes the
   information above as follows:

    1 << 28 | disk << 8 | partition      xvd, disks or partitions 16 onwards
   202 << 8 | disk << 4 | partition      xvd, disks and partitions up to 15
     8 << 8 | disk << 4 | partition      sd, disks and partitions up to 15
     3 << 8 | disk << 6 | partition      hd, disks 0..1, partitions 0..63
    22 << 8 | (disk-2) << 6 | partition  hd, disks 2..3, partitions 0..63
    2 << 28 onwards                      reserved for future use
   other values less than 1 << 28        deprecated / reserved

   The 1<<28 format handles disks up to (1<<20)-1 and partitions up to
   255.  It will be used only where the 202<<8 format does not have
   enough bits.

   Guests MAY support any subset of the formats above except that if
   they support 1<<28 they MUST also support 202<<8.

   Some software has provided essentially Linux-specific encodings for
   SCSI disks beyond disk 15 partition 15, and IDE disks beyond disk 3
   partition 63.  These vbds, and the corresponding encoded integers,
   are deprecated.

 * Guests SHOULD ignore numbers that they do not understand or
   recognise.  They SHOULD check supplied numbers for validity.

 * We know that not all guests conform to this interface.  For
   example, old Linux systems interpret the integer as
       major << 8 | minor
   where major and minor are the Linux-specific device numbers,
   and some old configurations may depend on deprecated high-numbered
   SCSI and IDE disks.

   We will therefore preserve the existing facility to specify
   the xenstore numerical value directly by putting a single
   number (hex, decimal or octal) in the domain config file instead
   of the disk identifier.

Ian.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2010-02-03 16:50 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-06 17:36 Greater than 16 xvd devices for blkfront Chris Lalancette
2008-05-06 17:45 ` Daniel P. Berrange
2008-05-07 16:04   ` Chris Wright
2008-05-07  1:55 ` Daniel P. Berrange
2008-05-07  3:47   ` Daniel P. Berrange
2008-05-07 16:40     ` Chris Wright
2008-05-08  9:30       ` Ian Jackson
2008-05-08 15:33         ` Chris Wright
2008-05-08 17:03           ` Ian Jackson
2010-02-03 16:50             ` Xen vbd numbering Ian Jackson
2008-05-08 22:14           ` Greater than 16 xvd devices for blkfront Jeremy Fitzhardinge
2008-05-08 23:34             ` Daniel P. Berrange

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.