All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Daniel P. Berrange" <berrange@redhat.com>
To: Chris Lalancette <clalance@redhat.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Subject: Re: Greater than 16 xvd devices for blkfront
Date: Wed, 7 May 2008 02:55:02 +0100	[thread overview]
Message-ID: <20080507015502.GA2121@redhat.com> (raw)
In-Reply-To: <48209705.4030005@redhat.com>

On Tue, May 06, 2008 at 01:36:05PM -0400, Chris Lalancette wrote:
> All,
>      We've had a number of requests to increase the number of xvd devices that a
> PV guest can have.  Currently, if you try to connect > 16 disks, you get an
> error from xend.  The problem ends up being that both xend and blkfront assume
> that for dev_t, major/minor is 8 bits each, where in fact there are actually 10
> bits for major and 22 bits for minor.
>      Therefore, it shouldn't really be a problem giving lots of disks to guests.
>  The problem is in backwards compatibility, and the details.  What I am
> initially proposing to do is to leave things where they are for /dev/xvd[a-p];
> that is, still put the xenstore entries in the same place, and use 8 bits for
> the major and 8 bits for the minor.  For anything above that, we would end up
> putting the xenstore entry in a different place, and pushing the major into the
> top 10 bits (leaving the bottom 22 bits for the minor); that way old guests
> won't fire when the entry is added, and we will add code to newer guests
> blkfront so that they will fire when they see that entry.  Does anyone see any
> problems with this setup, or have any ideas how to do it better?

Looking at the blkfront code I think we can increase the minor numbers
available for xvdX devices without requiring changes to the where stuff
is stored.

The key is that in blkfront we can reliably detect the overflow triggered
by the 16th disk, because the next major number 203 doesn't clash with
any of the other major numbers blkfront is looking for

Consider the 17th disk, which has name 'xvdq', this gives a device number
in xenstore of '51968'.

Upon seeing this, current blkfront code will use

   #define BLKIF_MAJOR(dev) ((dev)>>8)
   #define BLKIF_MINOR(dev) ((dev) & 0xff)

And so get back major number of 203 and minor number of '0'. 

In the xlbd_get_major_info(int vdevice) function, it has a switch on major
numbers and the xvdX case is handled as the default

        major = BLKIF_MAJOR(vdevice);
        minor = BLKIF_MINOR(vdevice);

        switch (major) {
        case IDE0_MAJOR: index = 0; break;
         ....snipped...
        case IDE9_MAJOR: index = 9; break;
        case SCSI_DISK0_MAJOR: index = 10; break;
        case SCSI_DISK1_MAJOR ... SCSI_DISK7_MAJOR:
                index = 11 + major - SCSI_DISK1_MAJOR;
                break;
        case SCSI_CDROM_MAJOR: index = 18; break;
        default: index = 19; break;
        }


So, the 17th disk in fact gets treated as 1st disk and the front end assigns
it the name 'xvda', and then promptly kernel panics because xvda already
exists in sysfs. 

kobject_add failed for xvda with -EEXIST, don't try to register things with the same name in the same directory.

Call Trace:
 [<ffffffff80336951>] kobject_add+0x16e/0x199
 [<ffffffff8025ce3c>] exact_lock+0x0/0x14
 [<ffffffff8029b271>] keventd_create_kthread+0x0/0xc4
 [<ffffffff802f393e>] register_disk+0x43/0x198
 [<ffffffff8029b271>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8032e453>] add_disk+0x34/0x3d
 [<ffffffff88074eb8>] :xenblk:backend_changed+0x110/0x193
 [<ffffffff803a4029>] xenbus_read_driver_state+0x26/0x3b

Now, this kernel panic isn't a huge problem (though it ought to handle the 
kobject_add gracefully), because we can never do anything to make existing
frontends deal with > 16 disks. If an admin tries to add more than 16 disks
to an existing guest they should already expect doom.

For future frontends though, it looks like we can adapt the switch(major)
in xlbd_get_major_info(), so that it detects the overflow of minor numbers,
and re-adjusts the major/minor numbers to their intended value:


eg   change

        case SCSI_CDROM_MAJOR: index = 18; break;
        default: index = 19; break;
        }


to

        case SCSI_CDROM_MAJOR: index = 18; break;
        default: 
              index = 19;
              if (major > XLBD_MAJOR_VBD_START) {
                  minor += 16 * (major - XLBD_MAJOR_VBD_START);
                  major = XLBD_MAJOR_VBD_START;
              }
              break;
        }


Now, I've not actually tested this, and there's a few other places in blkfront
needing similar tweaks but I don't see anything in the code which fundamentally
stops this overflow detection & fixup.

As far as the XenD backend is concerned, all we need todo is edit the XenD 
blkdev_name_to_number() function in tools/python/xen/util/blkif.py to relax
the regex to allow > xvdp. And adapt the math so it overflows onto the major
numbers following XVD's 202. In 

eg, change

    if re.match( '/dev/xvd[a-p]([1-9]|1[0-5])?', n):
        return 202 * 256 + 16 * (ord(n[8:9]) - ord('a')) + int(n[9:] or 0)

to

    if re.match( '/dev/xvd[a-z]([1-9]|1[0-5])?', n):
        return 202 * 256 + 16 * (ord(n[8:9]) - ord('a')) + int(n[9:] or 0)

gets you to 26 disks. This is how I got the gues to boot and front end to 
crash on the 17th disk 'xvdq'. It is a little more complex to cope with 
2-letter drives, but no show stopper there.

So, unless I'm missing something obvious we can keep compatability with
existing guests for the first 16 disks and still (indirectly) make use 
of a 22/12  dev_t split for the 17th+ disk, without needing to change 
how or where stuff is stored in XenStore. 

Regards,
Daniel.
-- 
|: Red Hat, Engineering, Boston   -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org  -o-  http://virt-manager.org  -o-  http://ovirt.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-  F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

  parent reply	other threads:[~2008-05-07  1:55 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-05-06 17:36 Greater than 16 xvd devices for blkfront Chris Lalancette
2008-05-06 17:45 ` Daniel P. Berrange
2008-05-07 16:04   ` Chris Wright
2008-05-07  1:55 ` Daniel P. Berrange [this message]
2008-05-07  3:47   ` Daniel P. Berrange
2008-05-07 16:40     ` Chris Wright
2008-05-08  9:30       ` Ian Jackson
2008-05-08 15:33         ` Chris Wright
2008-05-08 17:03           ` Ian Jackson
2010-02-03 16:50             ` Xen vbd numbering Ian Jackson
2008-05-08 22:14           ` Greater than 16 xvd devices for blkfront Jeremy Fitzhardinge
2008-05-08 23:34             ` Daniel P. Berrange

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080507015502.GA2121@redhat.com \
    --to=berrange@redhat.com \
    --cc=clalance@redhat.com \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.