* Greater than 16 xvd devices for blkfront
@ 2008-05-06 17:36 Chris Lalancette
2008-05-06 17:45 ` Daniel P. Berrange
2008-05-07 1:55 ` Daniel P. Berrange
0 siblings, 2 replies; 12+ messages in thread
From: Chris Lalancette @ 2008-05-06 17:36 UTC (permalink / raw)
To: xen-devel
All,
We've had a number of requests to increase the number of xvd devices that a
PV guest can have. Currently, if you try to connect > 16 disks, you get an
error from xend. The problem ends up being that both xend and blkfront assume
that for dev_t, major/minor is 8 bits each, where in fact there are actually 10
bits for major and 22 bits for minor.
Therefore, it shouldn't really be a problem giving lots of disks to guests.
The problem is in backwards compatibility, and the details. What I am
initially proposing to do is to leave things where they are for /dev/xvd[a-p];
that is, still put the xenstore entries in the same place, and use 8 bits for
the major and 8 bits for the minor. For anything above that, we would end up
putting the xenstore entry in a different place, and pushing the major into the
top 10 bits (leaving the bottom 22 bits for the minor); that way old guests
won't fire when the entry is added, and we will add code to newer guests
blkfront so that they will fire when they see that entry. Does anyone see any
problems with this setup, or have any ideas how to do it better?
Thanks,
Chris Lalancette
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Greater than 16 xvd devices for blkfront
2008-05-06 17:36 Greater than 16 xvd devices for blkfront Chris Lalancette
@ 2008-05-06 17:45 ` Daniel P. Berrange
2008-05-07 16:04 ` Chris Wright
2008-05-07 1:55 ` Daniel P. Berrange
1 sibling, 1 reply; 12+ messages in thread
From: Daniel P. Berrange @ 2008-05-06 17:45 UTC (permalink / raw)
To: Chris Lalancette; +Cc: xen-devel
On Tue, May 06, 2008 at 01:36:05PM -0400, Chris Lalancette wrote:
> All,
> We've had a number of requests to increase the number of xvd devices that a
> PV guest can have. Currently, if you try to connect > 16 disks, you get an
> error from xend. The problem ends up being that both xend and blkfront assume
> that for dev_t, major/minor is 8 bits each, where in fact there are actually 10
> bits for major and 22 bits for minor.
> Therefore, it shouldn't really be a problem giving lots of disks to guests.
> The problem is in backwards compatibility, and the details. What I am
> initially proposing to do is to leave things where they are for /dev/xvd[a-p];
> that is, still put the xenstore entries in the same place, and use 8 bits for
> the major and 8 bits for the minor. For anything above that, we would end up
> putting the xenstore entry in a different place, and pushing the major into the
> top 10 bits (leaving the bottom 22 bits for the minor); that way old guests
> won't fire when the entry is added, and we will add code to newer guests
> blkfront so that they will fire when they see that entry. Does anyone see any
> problems with this setup, or have any ideas how to do it better?
Putting the xenstore entries in a different place is a non-starter. Too
many things look at that location already. When blktap was added and it
put xenstore entries in a different place it took months to track down
all the bugs this caused.
Dan.
--
|: Red Hat, Engineering, Boston -o- http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Greater than 16 xvd devices for blkfront
2008-05-06 17:36 Greater than 16 xvd devices for blkfront Chris Lalancette
2008-05-06 17:45 ` Daniel P. Berrange
@ 2008-05-07 1:55 ` Daniel P. Berrange
2008-05-07 3:47 ` Daniel P. Berrange
1 sibling, 1 reply; 12+ messages in thread
From: Daniel P. Berrange @ 2008-05-07 1:55 UTC (permalink / raw)
To: Chris Lalancette; +Cc: xen-devel
On Tue, May 06, 2008 at 01:36:05PM -0400, Chris Lalancette wrote:
> All,
> We've had a number of requests to increase the number of xvd devices that a
> PV guest can have. Currently, if you try to connect > 16 disks, you get an
> error from xend. The problem ends up being that both xend and blkfront assume
> that for dev_t, major/minor is 8 bits each, where in fact there are actually 10
> bits for major and 22 bits for minor.
> Therefore, it shouldn't really be a problem giving lots of disks to guests.
> The problem is in backwards compatibility, and the details. What I am
> initially proposing to do is to leave things where they are for /dev/xvd[a-p];
> that is, still put the xenstore entries in the same place, and use 8 bits for
> the major and 8 bits for the minor. For anything above that, we would end up
> putting the xenstore entry in a different place, and pushing the major into the
> top 10 bits (leaving the bottom 22 bits for the minor); that way old guests
> won't fire when the entry is added, and we will add code to newer guests
> blkfront so that they will fire when they see that entry. Does anyone see any
> problems with this setup, or have any ideas how to do it better?
Looking at the blkfront code I think we can increase the minor numbers
available for xvdX devices without requiring changes to the where stuff
is stored.
The key is that in blkfront we can reliably detect the overflow triggered
by the 16th disk, because the next major number 203 doesn't clash with
any of the other major numbers blkfront is looking for
Consider the 17th disk, which has name 'xvdq', this gives a device number
in xenstore of '51968'.
Upon seeing this, current blkfront code will use
#define BLKIF_MAJOR(dev) ((dev)>>8)
#define BLKIF_MINOR(dev) ((dev) & 0xff)
And so get back major number of 203 and minor number of '0'.
In the xlbd_get_major_info(int vdevice) function, it has a switch on major
numbers and the xvdX case is handled as the default
major = BLKIF_MAJOR(vdevice);
minor = BLKIF_MINOR(vdevice);
switch (major) {
case IDE0_MAJOR: index = 0; break;
....snipped...
case IDE9_MAJOR: index = 9; break;
case SCSI_DISK0_MAJOR: index = 10; break;
case SCSI_DISK1_MAJOR ... SCSI_DISK7_MAJOR:
index = 11 + major - SCSI_DISK1_MAJOR;
break;
case SCSI_CDROM_MAJOR: index = 18; break;
default: index = 19; break;
}
So, the 17th disk in fact gets treated as 1st disk and the front end assigns
it the name 'xvda', and then promptly kernel panics because xvda already
exists in sysfs.
kobject_add failed for xvda with -EEXIST, don't try to register things with the same name in the same directory.
Call Trace:
[<ffffffff80336951>] kobject_add+0x16e/0x199
[<ffffffff8025ce3c>] exact_lock+0x0/0x14
[<ffffffff8029b271>] keventd_create_kthread+0x0/0xc4
[<ffffffff802f393e>] register_disk+0x43/0x198
[<ffffffff8029b271>] keventd_create_kthread+0x0/0xc4
[<ffffffff8032e453>] add_disk+0x34/0x3d
[<ffffffff88074eb8>] :xenblk:backend_changed+0x110/0x193
[<ffffffff803a4029>] xenbus_read_driver_state+0x26/0x3b
Now, this kernel panic isn't a huge problem (though it ought to handle the
kobject_add gracefully), because we can never do anything to make existing
frontends deal with > 16 disks. If an admin tries to add more than 16 disks
to an existing guest they should already expect doom.
For future frontends though, it looks like we can adapt the switch(major)
in xlbd_get_major_info(), so that it detects the overflow of minor numbers,
and re-adjusts the major/minor numbers to their intended value:
eg change
case SCSI_CDROM_MAJOR: index = 18; break;
default: index = 19; break;
}
to
case SCSI_CDROM_MAJOR: index = 18; break;
default:
index = 19;
if (major > XLBD_MAJOR_VBD_START) {
minor += 16 * (major - XLBD_MAJOR_VBD_START);
major = XLBD_MAJOR_VBD_START;
}
break;
}
Now, I've not actually tested this, and there's a few other places in blkfront
needing similar tweaks but I don't see anything in the code which fundamentally
stops this overflow detection & fixup.
As far as the XenD backend is concerned, all we need todo is edit the XenD
blkdev_name_to_number() function in tools/python/xen/util/blkif.py to relax
the regex to allow > xvdp. And adapt the math so it overflows onto the major
numbers following XVD's 202. In
eg, change
if re.match( '/dev/xvd[a-p]([1-9]|1[0-5])?', n):
return 202 * 256 + 16 * (ord(n[8:9]) - ord('a')) + int(n[9:] or 0)
to
if re.match( '/dev/xvd[a-z]([1-9]|1[0-5])?', n):
return 202 * 256 + 16 * (ord(n[8:9]) - ord('a')) + int(n[9:] or 0)
gets you to 26 disks. This is how I got the gues to boot and front end to
crash on the 17th disk 'xvdq'. It is a little more complex to cope with
2-letter drives, but no show stopper there.
So, unless I'm missing something obvious we can keep compatability with
existing guests for the first 16 disks and still (indirectly) make use
of a 22/12 dev_t split for the 17th+ disk, without needing to change
how or where stuff is stored in XenStore.
Regards,
Daniel.
--
|: Red Hat, Engineering, Boston -o- http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Greater than 16 xvd devices for blkfront
2008-05-07 1:55 ` Daniel P. Berrange
@ 2008-05-07 3:47 ` Daniel P. Berrange
2008-05-07 16:40 ` Chris Wright
0 siblings, 1 reply; 12+ messages in thread
From: Daniel P. Berrange @ 2008-05-07 3:47 UTC (permalink / raw)
To: Chris Lalancette; +Cc: xen-devel
On Wed, May 07, 2008 at 02:55:02AM +0100, Daniel P. Berrange wrote:
> On Tue, May 06, 2008 at 01:36:05PM -0400, Chris Lalancette wrote:
> > All,
> > We've had a number of requests to increase the number of xvd devices that a
> > PV guest can have. Currently, if you try to connect > 16 disks, you get an
> > error from xend. The problem ends up being that both xend and blkfront assume
> > that for dev_t, major/minor is 8 bits each, where in fact there are actually 10
> > bits for major and 22 bits for minor.
> > Therefore, it shouldn't really be a problem giving lots of disks to guests.
> > The problem is in backwards compatibility, and the details. What I am
> > initially proposing to do is to leave things where they are for /dev/xvd[a-p];
> > that is, still put the xenstore entries in the same place, and use 8 bits for
> > the major and 8 bits for the minor. For anything above that, we would end up
> > putting the xenstore entry in a different place, and pushing the major into the
> > top 10 bits (leaving the bottom 22 bits for the minor); that way old guests
> > won't fire when the entry is added, and we will add code to newer guests
> > blkfront so that they will fire when they see that entry. Does anyone see any
> > problems with this setup, or have any ideas how to do it better?
>
> Looking at the blkfront code I think we can increase the minor numbers
> available for xvdX devices without requiring changes to the where stuff
> is stored.
Have a go with this proof of concept patch to blkfront. I built pv-on-hvm drivers
with this and successfully booted my guest with 25 disks (xvdb -> xvdz) and saw
them registered in /dev as can be seen from /proc/partitions:
major minor #blocks name
3 0 5242880 hda
3 1 104391 hda1
3 2 5132767 hda2
253 0 4096000 dm-0
253 1 1015808 dm-1
202 16 102400 xvdb
202 32 102400 xvdc
202 48 102400 xvdd
202 49 48163 xvdd1
202 50 48195 xvdd2
202 64 102400 xvde
202 80 102400 xvdf
202 96 102400 xvdg
202 112 102400 xvdh
202 128 102400 xvdi
202 144 102400 xvdj
202 160 102400 xvdk
202 176 102400 xvdl
202 192 102400 xvdm
202 208 102400 xvdn
202 224 102400 xvdo
202 240 102400 xvdp
202 256 102400 xvdq
202 272 102400 xvdr
202 288 102400 xvds
202 304 102400 xvdt
202 320 102400 xvdu
202 336 102400 xvdv
202 352 102400 xvdw
202 368 102400 xvdx
202 384 102400 xvdy
202 400 102400 xvdz
202 401 96358 xvdz1
NB, requires the regex tweak to blkif.py in XenD to allow xvd[a-z] naming.
Regards,
Daniel.
diff -r 57ab8dd47580 drivers/xen/blkfront/vbd.c
--- a/drivers/xen/blkfront/vbd.c Sun Jul 01 22:07:32 2007 +0100
+++ b/drivers/xen/blkfront/vbd.c Tue May 06 23:38:20 2008 -0400
@@ -166,7 +166,14 @@ xlbd_get_major_info(int vdevice)
index = 18 + major - SCSI_DISK8_MAJOR;
break;
case SCSI_CDROM_MAJOR: index = 26; break;
- default: index = 27; break;
+ default:
+ index = 27;
+ if (major > XLBD_MAJOR_VBD_START) {
+ printk("xen-vbd: fixup major/minor %d -> %d,%d\n", vdevice, major, minor);
+ minor += (16 * 16 * (major - 202));
+ major = 202;
+ }
+ printk("xen-vbd: process major/minor %d -> %d,%d\n", vdevice, major, minor);
}
mi = ((major_info[index] != NULL) ? major_info[index] :
@@ -315,14 +322,42 @@ xlvbd_add(blkif_sector_t capacity, int v
{
struct block_device *bd;
int err = 0;
+ int major, minor;
- info->dev = MKDEV(BLKIF_MAJOR(vdevice), BLKIF_MINOR(vdevice));
+ major = BLKIF_MAJOR(vdevice);
+ minor = BLKIF_MINOR(vdevice);
+
+ switch (major) {
+ case IDE0_MAJOR:
+ case IDE1_MAJOR:
+ case IDE2_MAJOR:
+ case IDE3_MAJOR:
+ case IDE4_MAJOR:
+ case IDE5_MAJOR:
+ case IDE6_MAJOR:
+ case IDE7_MAJOR:
+ case IDE8_MAJOR:
+ case IDE9_MAJOR:
+ case SCSI_DISK0_MAJOR:
+ case SCSI_DISK1_MAJOR ... SCSI_DISK7_MAJOR:
+ case SCSI_DISK8_MAJOR ... SCSI_DISK15_MAJOR:
+ case SCSI_CDROM_MAJOR:
+ break;
+
+ default:
+ if (major > 202) {
+ minor += (16 * 16 * (major - 202));
+ major = 202;
+ }
+ }
+
+ info->dev = MKDEV(major, minor);
bd = bdget(info->dev);
if (bd == NULL)
return -ENODEV;
- err = xlvbd_alloc_gendisk(BLKIF_MINOR(vdevice), capacity, vdevice,
+ err = xlvbd_alloc_gendisk(minor, capacity, vdevice,
vdisk_info, sector_size, info);
bdput(bd);
--
|: Red Hat, Engineering, Boston -o- http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Greater than 16 xvd devices for blkfront
2008-05-06 17:45 ` Daniel P. Berrange
@ 2008-05-07 16:04 ` Chris Wright
0 siblings, 0 replies; 12+ messages in thread
From: Chris Wright @ 2008-05-07 16:04 UTC (permalink / raw)
To: Daniel P. Berrange; +Cc: Chris Lalancette, xen-devel
* Daniel P. Berrange (berrange@redhat.com) wrote:
> On Tue, May 06, 2008 at 01:36:05PM -0400, Chris Lalancette wrote:
> > All,
> > We've had a number of requests to increase the number of xvd devices that a
> > PV guest can have. Currently, if you try to connect > 16 disks, you get an
> > error from xend. The problem ends up being that both xend and blkfront assume
> > that for dev_t, major/minor is 8 bits each, where in fact there are actually 10
> > bits for major and 22 bits for minor.
Just a nit, it's actually 12:20.
> > Therefore, it shouldn't really be a problem giving lots of disks to guests.
> > The problem is in backwards compatibility, and the details. What I am
> > initially proposing to do is to leave things where they are for /dev/xvd[a-p];
> > that is, still put the xenstore entries in the same place, and use 8 bits for
> > the major and 8 bits for the minor. For anything above that, we would end up
> > putting the xenstore entry in a different place, and pushing the major into the
> > top 10 bits (leaving the bottom 22 bits for the minor); that way old guests
> > won't fire when the entry is added, and we will add code to newer guests
> > blkfront so that they will fire when they see that entry. Does anyone see any
> > problems with this setup, or have any ideas how to do it better?
>
> Putting the xenstore entries in a different place is a non-starter. Too
> many things look at that location already. When blktap was added and it
> put xenstore entries in a different place it took months to track down
> all the bugs this caused.
I'm not sure what you mean? Since this is blkfront it'd be more like
adding a virtual-device2 to extend the protocol.
/* FIXME: Use dynamic device id if this is not set. */
err = xenbus_scanf(XBT_NIL, dev->nodename, "virtual-device", "%i", &vdevice);
if (err != 1) {
xenbus_dev_fatal(dev, err, "reading virtual-device");
return err;
}
IOW smth simple like:
err = xenbus_scanf(XBT_NIL, dev->nodename, "virtual-device", "%i", &vdevice);
if (err == -ENOENT)
err = xenbus_scanf(XBT_NIL, dev->nodename, "virtual-device2", "%i", &vdevice);
Then we can stop propagating the myth that dev_t is 8:8.
thanks,
-chris
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Greater than 16 xvd devices for blkfront
2008-05-07 3:47 ` Daniel P. Berrange
@ 2008-05-07 16:40 ` Chris Wright
2008-05-08 9:30 ` Ian Jackson
0 siblings, 1 reply; 12+ messages in thread
From: Chris Wright @ 2008-05-07 16:40 UTC (permalink / raw)
To: Daniel P. Berrange; +Cc: Chris Lalancette, xen-devel
* Daniel P. Berrange (berrange@redhat.com) wrote:
> + default:
> + if (major > 202) {
> + minor += (16 * 16 * (major - 202));
> + major = 202;
> + }
> + }
I didn't think of handling overflow (since the major for scsi/ide/etc
were involved, I expected that to fail). But, aside of crashing an
older guest with > 16 disks (not ideal, but I think it's possible
already with 0x format), seems good.
thanks,
-chris
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Greater than 16 xvd devices for blkfront
2008-05-07 16:40 ` Chris Wright
@ 2008-05-08 9:30 ` Ian Jackson
2008-05-08 15:33 ` Chris Wright
0 siblings, 1 reply; 12+ messages in thread
From: Ian Jackson @ 2008-05-08 9:30 UTC (permalink / raw)
To: Chris Wright; +Cc: Chris Lalancette, Daniel P. Berrange, xen-devel
Chris Wright writes ("Re: [Xen-devel] Greater than 16 xvd devices for blkfront"):
> * Daniel P. Berrange (berrange@redhat.com) wrote:
> > + default:
> > + if (major > 202) {
> > + minor += (16 * 16 * (major - 202));
> > + major = 202;
> > + }
> > + }
The root cause of the problem is the incorporation of the Linux device
numbering scheme into the xenstore protocol, which is wrong I think.
What Daniel's excellent if rather unpleasant suggestion is doing is to
regard the xenstore number not as a `Linux device number' but rather
as a crazy encoding of the disk number.
I think this is fine but it would be good if we could think about what
the new crazy encoding is, and document it. I infer that in Daniel's
suggestion it's:
xenstore number = (202 << 8) + (actual disk number << 4)
| partition number
where the actual disk number starts at 0 for xvda and partition
numbers are 0 for whole disk or 1..15.
Daniel's solution still doesn't work for partitions >15. Perhaps,
given that old guests are going to break anyway, we should consider a
different scheme ? Since disks and partitions not supported by the
old encoding won't work on old guests anyway, we can use a completely
new encoding for that case provided only that it doesn't use numbers
of the form (202 << 8) | something
Presumably we can safely use at least 31 bits. If we reserve one to
indicate that this is the new encoding that leaves us with 30 which
should be enough for a reasonable number of disks with many
partitions each.
> I didn't think of handling overflow (since the major for scsi/ide/etc
> were involved, I expected that to fail). But, aside of crashing an
> older guest with > 16 disks (not ideal, but I think it's possible
> already with 0x format), seems good.
If a guest takes the xenstore number to be the concatenation of its
own major and minor numbers then obviously it is leaving itself open
to breaking in the future. dom0 admins will just have to Not Do That
Then. (It's a shame, if true, that the guests don't have actual error
checking.)
Ian.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Greater than 16 xvd devices for blkfront
2008-05-08 9:30 ` Ian Jackson
@ 2008-05-08 15:33 ` Chris Wright
2008-05-08 17:03 ` Ian Jackson
2008-05-08 22:14 ` Greater than 16 xvd devices for blkfront Jeremy Fitzhardinge
0 siblings, 2 replies; 12+ messages in thread
From: Chris Wright @ 2008-05-08 15:33 UTC (permalink / raw)
To: Ian Jackson; +Cc: Chris Wright, Chris Lalancette, Daniel P. Berrange, xen-devel
* Ian Jackson (Ian.Jackson@eu.citrix.com) wrote:
> Chris Wright writes ("Re: [Xen-devel] Greater than 16 xvd devices for blkfront"):
> > * Daniel P. Berrange (berrange@redhat.com) wrote:
> > > + default:
> > > + if (major > 202) {
> > > + minor += (16 * 16 * (major - 202));
> > > + major = 202;
> > > + }
> > > + }
>
> The root cause of the problem is the incorporation of the Linux device
> numbering scheme into the xenstore protocol, which is wrong I think.
> What Daniel's excellent if rather unpleasant suggestion is doing is to
> regard the xenstore number not as a `Linux device number' but rather
> as a crazy encoding of the disk number.
>
> I think this is fine but it would be good if we could think about what
> the new crazy encoding is, and document it. I infer that in Daniel's
> suggestion it's:
>
> xenstore number = (202 << 8) + (actual disk number << 4)
> | partition number
>
> where the actual disk number starts at 0 for xvda and partition
> numbers are 0 for whole disk or 1..15.
>
> Daniel's solution still doesn't work for partitions >15. Perhaps,
I think that's OK, and effectively a hard limitation w.r.t. lanana:
202 block Xen Virtual Block Device
0 = /dev/xvda First Xen VBD whole disk
16 = /dev/xvdb Second Xen VBD whole disk
32 = /dev/xvdc Third Xen VBD whole disk
...
240 = /dev/xvdp Sixteenth Xen VBD whole disk
Partitions are handled in the same way as for IDE
disks (see major number 3) except that the limit on
partitions is 15.
> given that old guests are going to break anyway, we should consider a
> different scheme ? Since disks and partitions not supported by the
> old encoding won't work on old guests anyway, we can use a completely
> new encoding for that case provided only that it doesn't use numbers
> of the form (202 << 8) | something
Well, we don't actually need 202, or any minor numbers at all. The major
is only needed for the case where xvd masquerades as IDE or SCSI.
We ripped this wart out for upstream Linux. And the guest can happily
dynamically allocate minor numbers on its own behalf. A disk discovery
event can be completely dynamic, the admin just wouldn't be able to
guarantee which minor slot gets allocated for a particular disk in
a guest. We do have mount by label or UUID.
> Presumably we can safely use at least 31 bits. If we reserve one to
> indicate that this is the new encoding that leaves us with 30 which
> should be enough for a reasonable number of disks with many
> partitions each.
>
> > I didn't think of handling overflow (since the major for scsi/ide/etc
> > were involved, I expected that to fail). But, aside of crashing an
> > older guest with > 16 disks (not ideal, but I think it's possible
> > already with 0x format), seems good.
>
> If a guest takes the xenstore number to be the concatenation of its
> own major and minor numbers then obviously it is leaving itself open
> to breaking in the future. dom0 admins will just have to Not Do That
> Then. (It's a shame, if true, that the guests don't have actual error
> checking.)
Agreed.
thanks,
-chris
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Greater than 16 xvd devices for blkfront
2008-05-08 15:33 ` Chris Wright
@ 2008-05-08 17:03 ` Ian Jackson
2010-02-03 16:50 ` Xen vbd numbering Ian Jackson
2008-05-08 22:14 ` Greater than 16 xvd devices for blkfront Jeremy Fitzhardinge
1 sibling, 1 reply; 12+ messages in thread
From: Ian Jackson @ 2008-05-08 17:03 UTC (permalink / raw)
To: Chris Wright; +Cc: Chris Lalancette, Daniel P. Berrange, xen-devel
Chris Wright writes ("Re: [Xen-devel] Greater than 16 xvd devices for blkfront"):
> Ian Jackson (Ian.Jackson@eu.citrix.com) wrote:
> > Daniel's solution still doesn't work for partitions >15. Perhaps,
>
> I think that's OK, and effectively a hard limitation w.r.t. lanana:
No, because not all guests are Linux, and anyway that limitation in
Linux may be improved in the future. If we're going to invent a new
scheme then we may as well solve the problem properly.
> > given that old guests are going to break anyway, we should consider a
> > different scheme ? Since disks and partitions not supported by the
> > old encoding won't work on old guests anyway, we can use a completely
> > new encoding for that case provided only that it doesn't use numbers
> > of the form (202 << 8) | something
>
> Well, we don't actually need 202, or any minor numbers at all. The major
> is only needed for the case where xvd masquerades as IDE or SCSI.
I think you're really missing the point. At the moment the Xen domain
config specifies whether the device is supposed to show up in the
guest as a native xvd, or masquerading as scsi or ide. This
information is encoded, along with the disk number and partition
number, into the xenstore path.
The xenstore path element is currently as a decimal integer, and that
integer supplies this information in a encoding derived from that used
internally by pre-32-bit-devt Linux guests. That's completely mad.
However, we can't really change it now at least for disks which fit
into the old encoding scheme, because any new scheme won't be
supported by old guests.
For disks and partitions which are out of the range which fit into the
current encodings, we need a new encoding anyway. Old guests
definitely can't cope with those so we don't need to be compatible.
Daniel Berrange's suggestion amounts to this: rather than invent a
wholly new location in xenstore for these disks, we simply make use of
more of the available values of this integer.
I'm pointing out that when we do that we ought to take into account
our future requirements in general, which may include >15 partitions.
Something like this:
Old format:
202 << 8 | disk << 4 | partition xvd, disks and partitions up to 15
8 << 8 | disk << 4 | partition sd, disks and partitions up to 15
3 << 8 | disk << 6 | partition hd, disks 0..3, partitions 1..63
New format:
1 << 28 | disk << 8 | partition xvd, disks or partitions 16 onwards
Reserved for future use:
2 << 28 onwards
Ian.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Greater than 16 xvd devices for blkfront
2008-05-08 15:33 ` Chris Wright
2008-05-08 17:03 ` Ian Jackson
@ 2008-05-08 22:14 ` Jeremy Fitzhardinge
2008-05-08 23:34 ` Daniel P. Berrange
1 sibling, 1 reply; 12+ messages in thread
From: Jeremy Fitzhardinge @ 2008-05-08 22:14 UTC (permalink / raw)
To: Chris Wright; +Cc: Daniel P. Berrange, Chris Lalancette, Ian Jackson, xen-devel
Chris Wright wrote:
> Well, we don't actually need 202, or any minor numbers at all. The major
> is only needed for the case where xvd masquerades as IDE or SCSI.
> We ripped this wart out for upstream Linux.
I'm considering putting it back in if it makes anyone's life easier. In
general using labels/uuids is the best way to make an installation
device-agnostic, but installers might have an easier time with a forged
scsi device or something. I mentioned it in passing to Al Viro, and he
was surprisingly non-insulting about the notion.
> And the guest can happily
> dynamically allocate minor numbers on its own behalf. A disk discovery
> event can be completely dynamic, the admin just wouldn't be able to
> guarantee which minor slot gets allocated for a particular disk in
> a guest. We do have mount by label or UUID.
>
That's true for filesystems which have already been initialized. But if
you're attaching 4 new devices to a guest and they appear at random
device nodes, how do you know which is which? Smell?
J
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Greater than 16 xvd devices for blkfront
2008-05-08 22:14 ` Greater than 16 xvd devices for blkfront Jeremy Fitzhardinge
@ 2008-05-08 23:34 ` Daniel P. Berrange
0 siblings, 0 replies; 12+ messages in thread
From: Daniel P. Berrange @ 2008-05-08 23:34 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: Chris Wright, Chris Lalancette, Ian Jackson, xen-devel
On Thu, May 08, 2008 at 11:14:34PM +0100, Jeremy Fitzhardinge wrote:
> Chris Wright wrote:
> >Well, we don't actually need 202, or any minor numbers at all. The major
> >is only needed for the case where xvd masquerades as IDE or SCSI.
> >We ripped this wart out for upstream Linux.
>
> I'm considering putting it back in if it makes anyone's life easier. In
> general using labels/uuids is the best way to make an installation
> device-agnostic, but installers might have an easier time with a forged
> scsi device or something. I mentioned it in passing to Al Viro, and he
> was surprisingly non-insulting about the notion.
>
> > And the guest can happily
> >dynamically allocate minor numbers on its own behalf. A disk discovery
> >event can be completely dynamic, the admin just wouldn't be able to
> >guarantee which minor slot gets allocated for a particular disk in
> >a guest. We do have mount by label or UUID.
> >
>
> That's true for filesystems which have already been initialized. But if
> you're attaching 4 new devices to a guest and they appear at random
> device nodes, how do you know which is which? Smell?
Well there's /dev/disk/by-{path,id}. Now there's no udev rules to setup
these links for Xen VBD (afaik), but we could arrange to have some suitable
info used to provide a persistent path under either of those locations.
Dan.
--
|: Red Hat, Engineering, Boston -o- http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
^ permalink raw reply [flat|nested] 12+ messages in thread
* Xen vbd numbering
2008-05-08 17:03 ` Ian Jackson
@ 2010-02-03 16:50 ` Ian Jackson
0 siblings, 0 replies; 12+ messages in thread
From: Ian Jackson @ 2010-02-03 16:50 UTC (permalink / raw)
To: Chris Wright, Daniel P. Berrange, Chris Lalancette, xen-devel
In May 2008 I wrote:
> Old format:
>
> 202 << 8 | disk << 4 | partition xvd, disks and partitions up to 15
> 8 << 8 | disk << 4 | partition sd, disks and partitions up to 15
> 3 << 8 | disk << 6 | partition hd, disks 0..3, partitions 1..63
>
> New format:
>
> 1 << 28 | disk << 8 | partition xvd, disks or partitions 16 onwards
>
> Reserved for future use:
>
> 2 << 28 onwards
But now that I get down and dirty with some code I discover that
actually what we have is not quite this. Much Linux-specific stuff
has crept in and the result is a mess.
After consultation, what we intend to implement in libxl is as
follows:
* The abstract interface specifies, for each VBD:
* Nominal disk type: Xen virtual disk (aka xvd, the default);
SCSI (sd); IDE (hd). This is for use as a hint by the guest's
device naming scheme.
* Disk number, which is a nonnegative integer,
conventionally starting at 0 for the first disk.
* Partition number, which is a nonnegative integer
where by convention partition 0 indicates the "whole disk".
Normally for any disk _either_ partition 0 should be supplied
in which case the guest is expected to treat it as they would a
native whole disk (for example by putting or expecting a
partition table or disk label on it);
_Or_ only non-0 partitions should be supplied in which case the
guest should expect storage management to be done by the host
and treat each vbd as it would a partition or slice or LVM
volume (for example by putting or expecting a filesystem on
it).
* The syntaxes are, for example:
d0 d0p0 xvda Xen virtual disk 0 partition 0 (whole disk)
d1p2 xvda2 Xen virtual disk 1 partition 2
d536p37 xvdtq37 Xen virtual disk 536 partition 37
sdb3 SCSI disk 1 partition 3
hdc2 IDE disk 2 partition 2
The d*p* syntax is not supported by xm/xend.
* This is encoded in the concrete interface as an integer (in a
canonical decimal format in xenstore), whose value encodes the
information above as follows:
1 << 28 | disk << 8 | partition xvd, disks or partitions 16 onwards
202 << 8 | disk << 4 | partition xvd, disks and partitions up to 15
8 << 8 | disk << 4 | partition sd, disks and partitions up to 15
3 << 8 | disk << 6 | partition hd, disks 0..1, partitions 0..63
22 << 8 | (disk-2) << 6 | partition hd, disks 2..3, partitions 0..63
2 << 28 onwards reserved for future use
other values less than 1 << 28 deprecated / reserved
The 1<<28 format handles disks up to (1<<20)-1 and partitions up to
255. It will be used only where the 202<<8 format does not have
enough bits.
Guests MAY support any subset of the formats above except that if
they support 1<<28 they MUST also support 202<<8.
Some software has provided essentially Linux-specific encodings for
SCSI disks beyond disk 15 partition 15, and IDE disks beyond disk 3
partition 63. These vbds, and the corresponding encoded integers,
are deprecated.
* Guests SHOULD ignore numbers that they do not understand or
recognise. They SHOULD check supplied numbers for validity.
* We know that not all guests conform to this interface. For
example, old Linux systems interpret the integer as
major << 8 | minor
where major and minor are the Linux-specific device numbers,
and some old configurations may depend on deprecated high-numbered
SCSI and IDE disks.
We will therefore preserve the existing facility to specify
the xenstore numerical value directly by putting a single
number (hex, decimal or octal) in the domain config file instead
of the disk identifier.
Ian.
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2010-02-03 16:50 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-06 17:36 Greater than 16 xvd devices for blkfront Chris Lalancette
2008-05-06 17:45 ` Daniel P. Berrange
2008-05-07 16:04 ` Chris Wright
2008-05-07 1:55 ` Daniel P. Berrange
2008-05-07 3:47 ` Daniel P. Berrange
2008-05-07 16:40 ` Chris Wright
2008-05-08 9:30 ` Ian Jackson
2008-05-08 15:33 ` Chris Wright
2008-05-08 17:03 ` Ian Jackson
2010-02-03 16:50 ` Xen vbd numbering Ian Jackson
2008-05-08 22:14 ` Greater than 16 xvd devices for blkfront Jeremy Fitzhardinge
2008-05-08 23:34 ` Daniel P. Berrange
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.