* poor performance of mount due to libblkid @ 2007-05-09 22:06 Shapor Naghibzadeh 2007-05-10 0:30 ` Andreas Dilger 0 siblings, 1 reply; 6+ messages in thread From: Shapor Naghibzadeh @ 2007-05-09 22:06 UTC (permalink / raw) To: linux-ext4 There is a serious performance degradation with the mount command after mounting many unique devices when compiled with libblkid support. A simple "mount" command to display the list of mounted filesystem can take minutes to run. This is due to a call to libblkid's blkid_get_cache and a relatively large /etc/blkid.tab (tens of thousands of lines, a few megs in size). The file is able to grow to a large size because it does not know that device mapper devices have been removed and will never be created again. Neither the libblkid api nor blkid command line appear to even provide a facility for removing entries if you wanted to do so manually on device removal. Combined with no (reasonable) bound on the size of the blkid.tab file, this causes the mount command to get slower over time. To make matters worse, the cost of reading the file in to memory is n-squared (which happens every time the mount command is run, even with "-h" for help!). I have provided a simple shell script which reproduces the problem and generates a graph of mount's running time using gnuplot: http://www.shapor.com/libblkid/ I intended to run 100,000 iterations but even 16,000 took almost a day to run on a fast computer (2.8GHz P4). Simply removing the blkid.tab immediately restores the original performance at iteration zero. Conclusions: 1) mount has libblkid support hacked in sloppily. It shouldn't attempt to read the blkid.tab unless it is trying to guess the filesystem type. Even if it is, what is the point? Is reading blkid.tab and parsing xml really an optimization over reading the superblock (which we are about to do when we mount the filesystem) and determining the fs type? It doesn't even seem to help the normal case, and really hurts the worst case badly. If mount is to use the file, it should scan through it only in the case it is actually trying to detect the filesystem type, and stop when it finds the entry. 2) The in-memory data structure of the blkid cache was not designed with scale in mind. It should not have to scan the entire list locate a device, which happens on every insert when reading it. 3) The use of XML in /etc is not very unixy. It is difficult for both computers and humans to parse. This list appeared to be the best place to post my findings with mount and the libblkid component of e2fsprogs. If there is a better place, please let me know. Regards, Shapor ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: poor performance of mount due to libblkid 2007-05-09 22:06 poor performance of mount due to libblkid Shapor Naghibzadeh @ 2007-05-10 0:30 ` Andreas Dilger 2007-05-10 4:45 ` Shapor Naghibzadeh 0 siblings, 1 reply; 6+ messages in thread From: Andreas Dilger @ 2007-05-10 0:30 UTC (permalink / raw) To: Shapor Naghibzadeh; +Cc: linux-ext4 On May 09, 2007 17:06 -0500, Shapor Naghibzadeh wrote: > There is a serious performance degradation with the mount command after > mounting many unique devices when compiled with libblkid support. A simple > "mount" command to display the list of mounted filesystem can take minutes to > run. This is due to a call to libblkid's blkid_get_cache and a > relatively large /etc/blkid.tab (tens of thousands of lines, a few megs in > size). > > The file is able to grow to a large size because it does not know that device > mapper devices have been removed and will never be created again. Is there something unusual about your system or startup scripts that is causing so many entries in /etc/blkid.tab file? > libblkid api nor blkid command line appear to even provide a facility for > removing entries if you wanted to do so manually on device removal. Using "blkid -c /dev/null" skips the cache load, but I also see it doesn't write out a new /etc/blkid.tab file. > Combined with no (reasonable) bound on the size of the blkid.tab file, this > causes the mount command to get slower over time. To make matters worse, the > cost of reading the file in to memory is n-squared (which happens every time > the mount command is run, even with "-h" for help!). I hadn't looked at that previously (it's been a long time since I looked at the blkid code), and I also don't have the mount code handy. Can you be more specific? > 1) mount has libblkid support hacked in sloppily. It shouldn't attempt to > read the blkid.tab unless it is trying to guess the filesystem type. Even if > it is, what is the point? Is reading blkid.tab and parsing xml really an > optimization over reading the superblock (which we are about to do when we > mount the filesystem) and determining the fs type? The reason for libblkid is twofold: - centralize the detection of filesystem types into one library - allow userspace applications to find device content type without needing root or read access to the device (hence reason for /etc/blkid.tab) > It doesn't even seem to > help the normal case, and really hurts the worst case badly. If mount is to > use the file, it should scan through it only in the case it is actually > trying to detect the filesystem type, and stop when it finds the entry. That makes a lot of sense, but that should be sent to the mount(8) maintainer. > 2) The in-memory data structure of the blkid cache was not designed with scale > in mind. It should not have to scan the entire list locate a device, > which happens on every insert when reading it. > > 3) The use of XML in /etc is not very unixy. It is difficult for both > computers and humans to parse. Yeah, but when I wrote it that was what people told me to use. I guess the late 90's was the time when XML was cool. I don't think people would complain too loudly if the blkid code was changed to have a plain-text formatted file, so long as that was not initially the default, and the XML parsing support was kept around for a while to allow apps which are statically linked to libblkid to continue working. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: poor performance of mount due to libblkid 2007-05-10 0:30 ` Andreas Dilger @ 2007-05-10 4:45 ` Shapor Naghibzadeh 2007-05-10 6:44 ` Theodore Tso 0 siblings, 1 reply; 6+ messages in thread From: Shapor Naghibzadeh @ 2007-05-10 4:45 UTC (permalink / raw) To: linux-ext4; +Cc: Andreas Dilger On Wed, May 09, 2007 at 05:30:05PM -0700, Andreas Dilger wrote: > Is there something unusual about your system or startup scripts that is > causing so many entries in /etc/blkid.tab file? This issue came up while doing development work on a snapshot and remote replication project called zumastor (http://zumastor.googlepages.com). Every snapshot is assigned a new snapshot id, and over time the blkid.tab gets polluted with device mapper devices of snapshots that no longer exist named /dev/mapper/vol(n), where n is the snapshot id. It is perfectly reasonable to be able to create, mount, umount, and remove devices without anything left behind. See the shell script at http://www.shapor.com/libblkid/ for a watered down 10 line test case (using dm-linear). > > libblkid api nor blkid command line appear to even provide a facility for > > removing entries if you wanted to do so manually on device removal. > > Using "blkid -c /dev/null" skips the cache load, but I also see it doesn't > write out a new /etc/blkid.tab file. At that point you might as well rm /etc/blkid.tab. If removing the file randomly has no effect on anything, just do without it. Its pretty obvious that it doesn't save I/O or cpu cycles. > > Combined with no (reasonable) bound on the size of the blkid.tab file, this > > causes the mount command to get slower over time. To make matters worse, the > > cost of reading the file in to memory is n-squared (which happens every time > > the mount command is run, even with "-h" for help!). > > I hadn't looked at that previously (it's been a long time since I looked > at the blkid code), and I also don't have the mount code handy. Can you > be more specific? Sure. As blkid_read_cache reads the blkid.tab file, it ends up calling blkid_get_dev for every device name it parses. blkid_get_dev does a linear search on the blkid_cache using strcmp() on each existing entry before adding the new one, hence the n-squared running time. The graph I generated visualizes this quite nicely. > The reason for libblkid is twofold: > - centralize the detection of filesystem types into one library Sounds like a good idea, but it attempts to do far more. > - allow userspace applications to find device content type without needing > root or read access to the device (hence reason for /etc/blkid.tab) A quick 'apt-cache rdepends libblkid1' on Ubuntu returns only the following: pysdm loop-aes-utils dump ocfs2console mount libblkid-dev e2fsprogs I don't think any of those are intended to be used by anyone other than root. Why not just look in /proc/mounts anyway? If its an unmounted device, you must be root in order to do anything with it. The corner-case of a normal user wanting to know the type of filesystem located on a device which was once mounted doesn't make it worth it. It sounds like a solution for a non-problem. It also has the potential to introduce security issues. Its now possible for any user to know the volume label of any usb storage device ever connected to the machine, for example. I doubt users or administrators expect such behavior. > > It doesn't even seem to > > help the normal case, and really hurts the worst case badly. If mount is to > > use the file, it should scan through it only in the case it is actually > > trying to detect the filesystem type, and stop when it finds the entry. > > That makes a lot of sense, but that should be sent to the mount(8) maintainer. The problem is that libblkid doesn't provide that without a n^2 worst case (see above). If the goal is to centralize the detection of filesystem types, it must be used by mount and shouldn't do anything else unless specifically asked to. > > 3) The use of XML in /etc is not very unixy. It is difficult for both > > computers and humans to parse. > > Yeah, but when I wrote it that was what people told me to use. I guess the > late 90's was the time when XML was cool. I don't think people would complain > too loudly if the blkid code was changed to have a plain-text formatted file, > so long as that was not initially the default, and the XML parsing support was > kept around for a while to allow apps which are statically linked to libblkid > to continue working. XML was never cool in my book, especially not anywhere in /etc. ;) I don't see a compelling reason to keep the file around in any format. Switching to plain text doesn't address garbage collecting of removed devices. This definitely worth fixing. I'd be willing to help rid the Linux world of this unix philosophy atrocity. Shapor ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: poor performance of mount due to libblkid 2007-05-10 4:45 ` Shapor Naghibzadeh @ 2007-05-10 6:44 ` Theodore Tso 2007-05-14 21:40 ` Shapor Naghibzadeh 0 siblings, 1 reply; 6+ messages in thread From: Theodore Tso @ 2007-05-10 6:44 UTC (permalink / raw) To: Shapor Naghibzadeh; +Cc: linux-ext4, Andreas Dilger On Wed, May 09, 2007 at 11:45:32PM -0500, Shapor Naghibzadeh wrote: > This issue came up while doing development work on a snapshot and remote > replication project called zumastor (http://zumastor.googlepages.com). Every > snapshot is assigned a new snapshot id, and over time the blkid.tab gets > polluted with device mapper devices of snapshots that no longer exist named > /dev/mapper/vol(n), where n is the snapshot id. OK, this was not a use case we had anticipated --- that there would be a large number of throwaway devices which would appear and then disappear, never to be seen again. Normally device names don't change like this, which is why blkid doesn't end up recording "the volume label of any usb storage device ever connected to the machine", as you put it. The device names of USB storage devices end up getting reused, so in practice what is in blkid.tab is merely the last storage device that was plugged in, not every single one going back forever. The problem is that zumastor is creating names that aren't being reused, and creating more and more of them. That's clearly a problem. One easy way of solving this problem is when we're parsing the file, try to stat the device file, and if it doesn't exist, to skip parsing the line together. This would prevent blkid.tab from growing without bound given your workload. > Sure. As blkid_read_cache reads the blkid.tab file, it ends up > calling blkid_get_dev for every device name it parses. > blkid_get_dev does a linear search on the blkid_cache using strcmp() > on each existing entry before adding the new one, hence the > n-squared running time. The graph I generated visualizes this quite > nicely. Yes, we need to add a better in-memory representation for the blkid.tab file so we don't have to do a linear scan to do the insert. > > The reason for libblkid is twofold: > > - centralize the detection of filesystem types into one library > > > - allow userspace applications to find device content type without needing > > root or read access to the device (hence reason for /etc/blkid.tab) Actually, Andreas missed the most important reason for libblkid, which was to speed up mount-by-label. Before libblkid, if you have 300 filesystems in /etc/fstab all with individual mount labels --- which might be the case if you had a large storage array hooked up to your system, for example --- there mount -a would be an n**2 operation since the mount command for each filesystem would proceed to search all devices looking for the matching label, since the volume label for each device was not being cached. So yes, the O(n**2) of memory operations was bad, but that didn't show up in the case of a few hundred filesystems --- especially compared to the old behavior of where it was O(n**2) disk operations to probe multiple potential superblock locations for each filesyustem. So blkid was solving a very real problem. The whole point of blkid.tab file was so that having searched all of the devices to find the particular filesystem with a specified volume label or UUID, that all of the information that was gathered doesn't have to be searched a next time you need to do a mount-by-uuid or mount-by-label. And if you have a large number of disks that you might have to potentially spin up, you definitely want to keep this cache across boots, which is why we store it in /etc/blkid.tab. > The problem is that libblkid doesn't provide that without a n^2 > worst case (see above). If the goal is to centralize the detection > of filesystem types, it must be used by mount and shouldn't do > anything else unless specifically asked to. The goal of libblkid is a lot more than that. You're right though, it would have been better if mount only tried to read in the blkid cache file if it needs to do a mount-by-label. If it is just trying to do a probe of the filesystem type, the cache doesn't actually help that much. If the last modified entry in the cache is very recent, it will skip revalidation of the entry, but most of the time we always revalidate the cache information before we return it to the user. (It does take less work to revalidate a cache entry, since we don't have to try all possible filesystem types, but instead we only need to verify that the information in the blkid cache file is correct.) > > > 3) The use of XML in /etc is not very unixy. It is difficult for both > > > computers and humans to parse. The reason why I chose XML was that I wanted a format which was relatively easily extensible. In fact the XML parser used by blkid is actually pretty lightweight. I don't particularly care about whether or not humans can parse it easily, since programmatically users should always be going through the blkid library so it can verify the data in the cache as being correct before returning it to the application. So it sounds like the short-term fix is to simply add a test so that if the device isn't present, we should just ignore the entry when we read it into memory. The longer-term fix is use a more sophisticated in-core representation which doesn't have a linear search time, and so that algorithms to detect multiple lines referring to the same device don't take O(n**2). We should also fix mount to avoid having it unconditionally read in the blkid.tab file. The assumption was the overhead for doing so should not be measurable. We could add functions to allow a particular entry to be removed from blkid.tab, but I'd much rather to have that garbage collection be automatically handled without needing any manual calls to specific APi's. Regards, - Ted ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: poor performance of mount due to libblkid 2007-05-10 6:44 ` Theodore Tso @ 2007-05-14 21:40 ` Shapor Naghibzadeh 2007-05-18 3:20 ` Theodore Tso 0 siblings, 1 reply; 6+ messages in thread From: Shapor Naghibzadeh @ 2007-05-14 21:40 UTC (permalink / raw) To: linux-ext4; +Cc: Theodore Tso, Adrian Bunk On Thu, May 10, 2007 at 02:44:48AM -0400, Theodore Tso wrote: > put it. The device names of USB storage devices end up getting > reused, so in practice what is in blkid.tab is merely the last storage > device that was plugged in, not every single one going back forever. My point with the USB example was that it keeps their labels around in a world-readable cache infinitely (or until a device with the same name gets mounted again). Its probably not a security issue in most cases, but its clutter which one doesn't expect to stick around. > One easy way of solving this problem is when we're parsing the file, > try to stat the device file, and if it doesn't exist, to skip parsing > the line together. This would prevent blkid.tab from growing without > bound given your workload. This idea of doing garbage collection every time blkid.tab is read destroys the cache if, for example, you mount /usr or /var before other block devices have been brought up. AoE and nbd come to mind as a potentially large number of devices that might not exist until later in the boot process. > The whole point of blkid.tab file was so that having searched all of > the devices to find the particular filesystem with a specified volume > label or UUID, that all of the information that was gathered doesn't > have to be searched a next time you need to do a mount-by-uuid or > mount-by-label. And if you have a large number of disks that you > might have to potentially spin up, you definitely want to keep this > cache across boots, which is why we store it in /etc/blkid.tab. Ok, but why do we bother caching the filesystem type? The desire to optimize the scanning for UUIDs or labels is indeed a real problem, but caching the filesystem type has the potential for introducing bugs and doesn't seem to have any real payoff. I for one have been bitten by the ext2 to ext3 upgrade bug more than once. There should be a better way of maintaining a UUID and label cache other than having mount keep an XML cache in /etc (which seems to violate the Linux filesystem hierarchy standard). Certainly having it enabled by default when there is no desire to mount by UUID or label is wasteful and probably the most common case. > So it sounds like the short-term fix is to simply add a test so that > if the device isn't present, we should just ignore the entry when we > read it into memory. The longer-term fix is use a more sophisticated > in-core representation which doesn't have a linear search time, and so > that algorithms to detect multiple lines referring to the same device > don't take O(n**2). We should also fix mount to avoid having it > unconditionally read in the blkid.tab file. The assumption was the > overhead for doing so should not be measurable. The first and safest step would seem to be removing the use of blkid.tab from mount except when trying to mount by UUID or volume label to prevent the performance issue when the cache is large. I think garbage collection is more complex to do safely and the whole approach might some re-thinking. Shapor ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: poor performance of mount due to libblkid 2007-05-14 21:40 ` Shapor Naghibzadeh @ 2007-05-18 3:20 ` Theodore Tso 0 siblings, 0 replies; 6+ messages in thread From: Theodore Tso @ 2007-05-18 3:20 UTC (permalink / raw) To: Shapor Naghibzadeh; +Cc: linux-ext4, Adrian Bunk Sorry for the delay in getting back to you; I've been on travel this past week, didn't have much time to keep completely up on e-mail. On Mon, May 14, 2007 at 04:40:26PM -0500, Shapor Naghibzadeh wrote: > My point with the USB example was that it keeps their labels around in a > world-readable cache infinitely (or until a device with the same name gets > mounted again). Its probably not a security issue in most cases, but its > clutter which one doesn't expect to stick around. It's not a lot of clutter in practice. > > try to stat the device file, and if it doesn't exist, to skip parsing > > the line together. This would prevent blkid.tab from growing without > > bound given your workload. > > This idea of doing garbage collection every time blkid.tab is read destroys > the cache if, for example, you mount /usr or /var before other block devices > have been brought up. AoE and nbd come to mind as a potentially large number > of devices that might not exist until later in the boot process. True, for nbd and AoE, that's a real problem. And certainly there's no guarantee that device mapper nodes will be created from the beginning. > > The whole point of blkid.tab file was so that having searched all of > > the devices to find the particular filesystem with a specified volume > > label or UUID, that all of the information that was gathered doesn't > > have to be searched a next time you need to do a mount-by-uuid or > > mount-by-label. And if you have a large number of disks that you > > might have to potentially spin up, you definitely want to keep this > > cache across boots, which is why we store it in /etc/blkid.tab. > > Ok, but why do we bother caching the filesystem type? The desire to optimize > the scanning for UUIDs or labels is indeed a real problem, but caching the > filesystem type has the potential for introducing bugs and doesn't seem to > have any real payoff. I for one have been bitten by the ext2 to ext3 upgrade > bug more than once. First of all, what ext2 to ext3 upgrade bug? What version of blkid/e2fsprogs are you using? It works just fine for me: # blkid /dev/loop0 /dev/loop0: UUID="cc211710-904a-48a4-9073-c84821963931" TYPE="ext2" # tune2fs -O has_journal /dev/loop0 tune2fs 1.40-WIP (14-Nov-2006) Creating journal inode: done This filesystem will be automatically checked every 30 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. # blkid /dev/loop0 /dev/loop0: UUID="cc211710-904a-48a4-9073-c84821963931" SEC_TYPE="ext2" TYPE="ext3" # tune2fs -O ^has_journal /dev/loop0 tune2fs 1.40-WIP (14-Nov-2006) # blkid /dev/loop0 /dev/loop0: UUID="cc211710-904a-48a4-9073-c84821963931" TYPE="ext2" As far as caching the filesystem type, the goal was to cache everything, since you never know when you might need the information, since doing an exhaustive search could be quite expensive. So we want to cache the label and UUID information whenever we get our hands on it --- and as it turns out, in order to get the filesystem type, getting the label and UUID information comes for free, and contrawise, in order to get the label and UUID information, you need to know the filesystem type first, so you know where to find label and UUID information. Hence, it makes sense for blkid to know how to find the filesystem type information, and in the process of gathering the filesystem type information, it is prudent to the cache the UUID and LABEL information if it is present. After all, if doing so avoids needing to do a brute force search of all of the devicesin the system, it is a net win. > There should be a better way of maintaining a UUID and label cache other than > having mount keep an XML cache in /etc (which seems to violate the Linux > filesystem hierarchy standard). Well, the problem is that /var might not be mounted, and in some applications, it might itself be located on the SAN network. One of the customers that I am working with is doing precisely that, with the blades booting and storing all of their filesystems across a SAN filesystem network. A certain amount of bootstrapping information in /etc is certainly within the sprit of the Linux FHS, and if the root is read-only, it is not a disaster, since the blkid code can handle that case by simply not relying on the cache. Basically, you have to store the information somewhere, and /etc is the only place that is guaranteed to be around when the system is initially coming up. > The first and safest step would seem to be removing the use of blkid.tab from > mount except when trying to mount by UUID or volume label to prevent the > performance issue when the cache is large. I think garbage collection is more > complex to do safely and the whole approach might some re-thinking. I agree that mount should be patched to only read in the blkid information when the blkid library needs to be consulted. I disagree that we shouldn't be caching caching the label and UUID information when it is discovered as a side effect of doing a filesystem type detection; it could be useful later. Probably the right answer is to have an explicit blkid GC operation, callable either from the blkid library API, or via /sbin/blkid -g. This could be called by the init scripts once the system has been brought fully up, or at some other point when a system application finds that it is necessary. Fundamentally, I think the use case that your application brought up where vast number of devices are created and then destroyed is unique enough that if your application needs to explicit request a garbage collection operation, that is an acceptable thing to require of it; it is doing something very, very, strange, after all. - Ted ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2007-05-18 3:20 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-05-09 22:06 poor performance of mount due to libblkid Shapor Naghibzadeh 2007-05-10 0:30 ` Andreas Dilger 2007-05-10 4:45 ` Shapor Naghibzadeh 2007-05-10 6:44 ` Theodore Tso 2007-05-14 21:40 ` Shapor Naghibzadeh 2007-05-18 3:20 ` Theodore Tso
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).