From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from list by lists.gnu.org with archive (Exim 4.71) id 1agmKv-0005nA-EC for mharc-grub-devel@gnu.org; Fri, 18 Mar 2016 00:48:33 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:55242) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1agmKr-0005mq-P7 for grub-devel@gnu.org; Fri, 18 Mar 2016 00:48:31 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1agmKo-0003qX-HN for grub-devel@gnu.org; Fri, 18 Mar 2016 00:48:29 -0400 Received: from sender163-mail.zoho.com ([74.201.84.163]:24664) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1agmKo-0003qS-8o for grub-devel@gnu.org; Fri, 18 Mar 2016 00:48:26 -0400 Received: from JTJLaptop (c-68-59-201-157.hsd1.tn.comcast.net [68.59.201.157]) by mx.zohomail.com with SMTPS id 1458276503782908.387606627065; Thu, 17 Mar 2016 21:48:23 -0700 (PDT) From: "James Johnston" To: "'The development of GNU GRUB'" Subject: RE: [PATCH] getroot: Correctly handle missing btrfs device identifiers. Date: Fri, 18 Mar 2016 04:48:12 -0000 Message-ID: <02a101d180d1$6117e090$2347a1b0$@codenest.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Outlook 14.0 Thread-Index: AdGAxDEk1N5ulx6XQf6rSN6/9N7iyA== Content-Language: en-us X-Zoho-Virus-Status: 1 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 74.201.84.163 X-BeenThere: grub-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: The development of GNU GRUB List-Id: The development of GNU GRUB List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 18 Mar 2016 04:48:31 -0000 Hi Andrei, Thank you for considering and reviewing my patch. I did a little more = checking of the actual btrfs sources in kernel, and also userspace btrfs = tools to try to address your concerns, since the ioctl interface is = basically undocumented. The scenario I am trying to address is a system such as my own, which = reports the following: $ btrfs fi show / Label: 'SystemRoot' uuid: 5c756382-8dc5-4afe-ac87-a604e9e7a9a2 Total devices 2 FS bytes used 2.65GiB devid 2 size 5.60GiB used 3.53GiB path = /dev/mapper/System1RootCrypt devid 3 size 5.60GiB used 3.53GiB path = /dev/mapper/System0RootCrypt btrfs-progs v4.0 As you can see: (1) there are only two devices, (2) they are numbered = with devids #2 and #3. There is no #1 device and if you ask for it with = BTRFS_IOC_DEV_INFO, you get ENODEV. (How did my system get this way? I = converted all the disk partitions to use LVM, one mirror at a time - = this involved deleting/adding btrfs disks. Somehow after this whole = process I realized I had this new numbering scheme.) Yet because there is no device #1, grub_find_root_devices_from_btrfs = returns NULL. This means that "grub-probe --target=3Ddevice /" only = prints one device, not two (and the one device is printed by fallback = code that is called for when grub_find_root_devices_from_btrfs returns = NULL). That is the behavior I am trying to correct here. > -----Original Message----- > From: Andrei Borzenkov > Sent: Thursday, March 17, 2016 17:12 >=20 > 17.03.2016 18:41, James Johnston =D0=BF=D0=B8=D1=88=D0=B5=D1=82: > > From 97bcb9eb8c34329a8ca1ec88fd89296e273ceb24 Mon Sep 17 00:00:00 = 2001 > > From: James Johnston > > Date: Thu, 17 Mar 2016 15:10:49 +0000 > > Subject: [PATCH] getroot: Correctly handle missing btrfs device = identifiers. > > > > The btrfs driver's BTRFS_IOC_FS_INFO ioctl only tells you the = maximum > > device ID and the number of devices. It does not tell you which > > device IDs may no longer be allocated - for example, if a device is > > later deleted from the file system, its device ID won't be allocated > > any more. > > > > You must then probe each device ID with BTRFS_IOC_DEV_INFO to see if > > it is a valid device or not. It is not an error condition for this > > ioctl to return ENODEV: it seems to mean that the device was deleted > > by the user. > > >=20 > Kernel returns ENODEV if device was not found; we do not know why it = was > not found. We need some other way to verify that all devices are > actually present to preserve current behavior. >=20 > Actually we probably need to handle redundant layout where = non-essential > devices are missing but that is another topic. So, I did some investigating to see where the kernel returns ENODEV and = it seems that it just returns this if there just isn't a btrfs device = with that ID in the driver's list of devices. Since the ioctl is not = terribly helpful beyond ENODEV, we just have to dig into the sources to = find out why.=20 Note that it seems the driver list contains physically missing btrfs = devices - if a btrfs device is physically missing, it is not cause to = return ENODEV. Here is the BTRFS_IOC_DEV_INFO ioctl: https://github.com/torvalds/linux/blob/82b666eee71618b7ca812ee529af116582= 617dec/fs/btrfs/ioctl.c#L2746 As you can see, the function only returns ENODEV when btrfs_find_device = returns NULL. Let's examine that and find out: https://github.com/torvalds/linux/blob/82b666eee71618b7ca812ee529af116582= 617dec/fs/btrfs/volumes.c#L6113 As we can see it just examines all devices for the file system and sees = if any match the given devid (since we aren't comparing UUIDs). Seems = obvious enough. What about if a device is missing? Note that a device will still be in = the list even if it is physically missing. It just won't have a device = path (i.e. null). btrfs_find_device_missing_or_by_path function specifically returns = devices marked missing: https://github.com/torvalds/linux/blob/82b666eee71618b7ca812ee529af116582= 617dec/fs/btrfs/volumes.c#L2073 add_missing_dev does the obvious: https://github.com/torvalds/linux/blob/82b666eee71618b7ca812ee529af116582= 617dec/fs/btrfs/volumes.c#L6133 You can search that file for the word "missing" to try to follow along = and see that the driver does track missing devices. In the userspace tools, we can look at btrfs-progs to see how they = enumerate physical devices. The load_device_info function enumerates = devices in basically the same way I do in my patch - I tried to imitate = the userspace tools, figuring I can't go wrong with that: http://git.kernel.org/cgit/linux/kernel/git/kdave/btrfs-progs.git/tree/cm= ds-fi-usage.c?h=3Dv4.4.1#n511 In this loop you can see that ENODEV is not considered a real error. = They just "continue" the loop to the next possible devid. Also you see = they loop to "fi_args.max_id" like how grub does, and just try all the = possible IDs. Noteworthy in the above btrfs userspace load_device_info function is = that they handle physically missing devices: if (!dev_info.path[0]) { strcpy(info[ndevs].path, "missing"); } else { strcpy(info[ndevs].path, (char *)dev_info.path); You can see it when running "btrfs fi show /": Data,RAID1: Size:3.00GiB, Used:2.53GiB /dev/mapper/System0RootCrypt 3.00GiB missing 3.00GiB My patch does no such special check, because the original grub code = didn't do a special check either. In that regard, your request of = "preserving current behavior" has been retained. When grub-probe = encounters a missing btrfs device (as opposed to an unallocated devid), = I get this for "grub-probe -v --target=3Ddevice /" after taking = System1RootCrypt device offline: ./grub-probe: info: cannot open `/boot/grub/device.map': No such file or = directory. ./grub-probe: error: failed to get canonical path of `'. That's because our grub_find_root_devices_from_btrfs function didn't do = any special handling for the null string, which signifies a missing = device. I'm not sure if this is really a desirable error to have happen. But = that's another issue for another patch I think. (How would you like to = see it handled?) In summary: 1. If the device is physically present, you get no error and the path = is returned normally. 2. If the device is absent (e.g. the file system was mounted degraded), = you get a zero length path. 3. If the device ID is not part of the file system to begin with, you = get ENODEV. If I'm getting any of the above wrong, please let me know. :) > Why this temporary variable? Why you cannot check errno directly? >=20 > > + { > > + grub_util_warn(_("btrfs device info could not be read " > > + "for %s, device %d: errno %d"), dir, i, > > + last_error); Because I wanted to log it if there was a problem; this log message will = hopefully help the next guy who has some problem with this function. The "_" is presumably a call to gettext function and I don't know if = gettext might do anything that would reset errno. The safe thing is = just to save it to a temporary and then call gettext via "_". Best regards, James Johnston