* Monitoring Btrfs
@ 2016-10-17 16:44 Stefan Malte Schumacher
2016-10-17 17:23 ` Austin S. Hemmelgarn
` (4 more replies)
0 siblings, 5 replies; 14+ messages in thread
From: Stefan Malte Schumacher @ 2016-10-17 16:44 UTC (permalink / raw)
To: linux-btrfs
Hello
I would like to monitor my btrfs-filesystem for missing drives. On
Debian mdadm uses a script in /etc/cron.daily, which calls mdadm and
sends an email if anything is wrong with the array. I would like to do
the same with btrfs. In my first attempt I grepped and cut the
information from "btrfs fi show" and let the script send an email if
the number of devices was not equal to the preselected number.
Then I saw this:
ubuntu@ubuntu:~$ sudo btrfs filesystem show
Label: none uuid: 67b4821f-16e0-436d-b521-e4ab2c7d3ab7
Total devices 6 FS bytes used 5.47TiB
devid 1 size 1.81TiB used 1.71TiB path /dev/sda3
devid 2 size 1.81TiB used 1.71TiB path /dev/sdb3
devid 3 size 1.82TiB used 1.72TiB path /dev/sdc1
devid 4 size 1.82TiB used 1.72TiB path /dev/sdd1
devid 5 size 2.73TiB used 2.62TiB path /dev/sde1
*** Some devices missing
on this page: https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices
The number of devices is still at 6, despite the fact that one of the
drives is missing, which means that my first idea doesnt work. I have
two questions:
1) Has anybody already written a script like this? After all, there is
no need to reinvent the wheel a second time.
2) What should I best grep for? In this case I would just go for the
"missing". Does this cover all possible outputs of btrfs fi show in
case of a damaged array? What other outputs do I need to consider for
my script.
Yours sincerely
Stefan
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: Monitoring Btrfs 2016-10-17 16:44 Monitoring Btrfs Stefan Malte Schumacher @ 2016-10-17 17:23 ` Austin S. Hemmelgarn 2016-10-18 3:23 ` Anand Jain 2016-10-17 17:41 ` Zygo Blaxell ` (3 subsequent siblings) 4 siblings, 1 reply; 14+ messages in thread From: Austin S. Hemmelgarn @ 2016-10-17 17:23 UTC (permalink / raw) To: Stefan Malte Schumacher, linux-btrfs On 2016-10-17 12:44, Stefan Malte Schumacher wrote: > Hello > > I would like to monitor my btrfs-filesystem for missing drives. On > Debian mdadm uses a script in /etc/cron.daily, which calls mdadm and > sends an email if anything is wrong with the array. I would like to do > the same with btrfs. In my first attempt I grepped and cut the > information from "btrfs fi show" and let the script send an email if > the number of devices was not equal to the preselected number. > > Then I saw this: > > ubuntu@ubuntu:~$ sudo btrfs filesystem show > Label: none uuid: 67b4821f-16e0-436d-b521-e4ab2c7d3ab7 > Total devices 6 FS bytes used 5.47TiB > devid 1 size 1.81TiB used 1.71TiB path /dev/sda3 > devid 2 size 1.81TiB used 1.71TiB path /dev/sdb3 > devid 3 size 1.82TiB used 1.72TiB path /dev/sdc1 > devid 4 size 1.82TiB used 1.72TiB path /dev/sdd1 > devid 5 size 2.73TiB used 2.62TiB path /dev/sde1 > *** Some devices missing > > on this page: https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices > The number of devices is still at 6, despite the fact that one of the > drives is missing, which means that my first idea doesnt work. This is actually correct behavior, the filesystem reports that it should have 6 devices, which is how it knows a device is missing. > I have > two questions: > 1) Has anybody already written a script like this? After all, there is > no need to reinvent the wheel a second time. Not that I know of, but I may be wrong. > 2) What should I best grep for? In this case I would just go for the > "missing". Does this cover all possible outputs of btrfs fi show in > case of a damaged array? What other outputs do I need to consider for > my script. That should catch any case of a failed device. It will not catch things like devices being out of sync or at-rest data corruption. In general, you should be running scrub regularly to check for those conditions (and fix them if they have happened). FWIW, what I watch is the filesystem flags (it will go read-only if it becomes degraded), and the filesystem size (which will change in most cases as disks are added or removed). I also have regular SMART status checks on the disks themselves too though, so even aside from BTRFS, I'll know if a disk has failed (or thinks it's failed) pretty quickly. Also, you may want to look into something like Monit (https://mmonit.com/monit/) to handle the monitoring. It lets you define all your monitoring requirements in a single file (or multiple if you prefer), and provides the infrastructure to handle e-mail notifications (including the ability to cache messages when the upstream mail server is down). ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Monitoring Btrfs 2016-10-17 17:23 ` Austin S. Hemmelgarn @ 2016-10-18 3:23 ` Anand Jain 2016-10-18 12:39 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 14+ messages in thread From: Anand Jain @ 2016-10-18 3:23 UTC (permalink / raw) To: Austin S. Hemmelgarn, Stefan Malte Schumacher, linux-btrfs, David Sterba >> I would like to monitor my btrfs-filesystem for missing drives. > This is actually correct behavior, the filesystem reports that it should > have 6 devices, which is how it knows a device is missing. Missing - means missing at the time of mount. So how are you planning to monitor a disk which is failed while in production ? Good luck Anand ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Monitoring Btrfs 2016-10-18 3:23 ` Anand Jain @ 2016-10-18 12:39 ` Austin S. Hemmelgarn 2016-10-18 21:36 ` Anand Jain 0 siblings, 1 reply; 14+ messages in thread From: Austin S. Hemmelgarn @ 2016-10-18 12:39 UTC (permalink / raw) To: Anand Jain, Stefan Malte Schumacher, linux-btrfs, David Sterba On 2016-10-17 23:23, Anand Jain wrote: > > >>> I would like to monitor my btrfs-filesystem for missing drives. > >> This is actually correct behavior, the filesystem reports that it should >> have 6 devices, which is how it knows a device is missing. > > Missing - means missing at the time of mount. So how are you planning > to monitor a disk which is failed while in production ? No, in `btrfs fi show` it means that it can't find the device. All that `fi show` does is print out what info it can find about the filesystem, nothing more, nothing less. It's trivial to see from the output (two different ways I might add) that you're missing devices and how many you still have. The only way without poking at the FS directly to figure out how many devices the FS is supposed to have (or at least, how many it thinks it should have) is the device count output by `btrfs fi show`. Now, for production usage, you have three things you should be monitoring: 1. Output from `btrfs dev stats`. This reports per-device error counters, and is one of the best ways to see if something is wrong, and also gives you a decent indicator of exactly what is wrong. 2. Status from regular scrub operations. Pretty self explanatory. 3. SMART status of the underlying devices themselves. This will catch pre-failure conditions, and the direct access from smartctl will error out when the drive has failed to the point of not being present. You can additionally monitor: 1. Filesystem flags. These will change when the filesystem goes degraded, and it's actually good practice for any filesystem, not just BTRFS. 2. Total filesystem size. If this changes without manual intervention, something is seriously wrong. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Monitoring Btrfs 2016-10-18 12:39 ` Austin S. Hemmelgarn @ 2016-10-18 21:36 ` Anand Jain 2016-10-19 11:15 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 14+ messages in thread From: Anand Jain @ 2016-10-18 21:36 UTC (permalink / raw) To: Austin S. Hemmelgarn, Stefan Malte Schumacher, linux-btrfs, David Sterba >>>> I would like to monitor my btrfs-filesystem for missing drives. >>> This is actually correct behavior, the filesystem reports that it should >>> have 6 devices, which is how it knows a device is missing. >> Missing - means missing at the time of mount. So how are you planning >> to monitor a disk which is failed while in production ? > No, in `btrfs fi show` it means that it can't find the device. 'btrfs fi show' is miss-leading as compared to 'btrfs fi show -m' -m tells btrfs-kernel perspective of the devices, as of now there is no code in the kernel which changes the device status while its mounted (expect for readonly, which is irrelevant in raid1 with 1 disk failed). > 1. Filesystem flags. These will change when the filesystem goes > degraded, Which flag is in question here. ? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Monitoring Btrfs 2016-10-18 21:36 ` Anand Jain @ 2016-10-19 11:15 ` Austin S. Hemmelgarn 2016-10-19 13:06 ` Anand Jain 0 siblings, 1 reply; 14+ messages in thread From: Austin S. Hemmelgarn @ 2016-10-19 11:15 UTC (permalink / raw) To: Anand Jain, Stefan Malte Schumacher, linux-btrfs, David Sterba On 2016-10-18 17:36, Anand Jain wrote: > > >>>>> I would like to monitor my btrfs-filesystem for missing drives. > > >>>> This is actually correct behavior, the filesystem reports that it >>>> should >>>> have 6 devices, which is how it knows a device is missing. > > >>> Missing - means missing at the time of mount. So how are you planning >>> to monitor a disk which is failed while in production ? > >> No, in `btrfs fi show` it means that it can't find the device. > > 'btrfs fi show' is miss-leading as compared to 'btrfs fi show -m' > -m tells btrfs-kernel perspective of the devices, as of now > there is no code in the kernel which changes the device status > while its mounted (expect for readonly, which is irrelevant in > raid1 with 1 disk failed). Actually, that's exactly how I would expect each of them to behave. We need some way to get both the state the kernel thinks the FS is in, and the state it's actually in (according to the tools, not the kernel), and '-m' reporting kernel state while no '-m' reports actual state is exactly what I would expect in this case. That leads also to another way I hadn't thought of to monitor a filesystem. The output of 'fi show' with and without '-m' should match if the filesystem was healthy when mounted and is still healthy, if they don't, then something is wrong. > >> 1. Filesystem flags. These will change when the filesystem goes >> degraded, > > Which flag is in question here. ? I should clarify here, I mean the mount options, I'm just used to the monit terminology (which was not well picked in this case). The big one to watch is the read-only flag, as BTRFS will force a filesystem read-only (which updates the mount options). Any change to the mount options though without manual intervention is generally a sign that _something_ is wrong. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Monitoring Btrfs 2016-10-19 11:15 ` Austin S. Hemmelgarn @ 2016-10-19 13:06 ` Anand Jain 2016-10-19 13:33 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 14+ messages in thread From: Anand Jain @ 2016-10-19 13:06 UTC (permalink / raw) To: Austin S. Hemmelgarn, Stefan Malte Schumacher, linux-btrfs, David Sterba On 10/19/16 19:15, Austin S. Hemmelgarn wrote: > On 2016-10-18 17:36, Anand Jain wrote: >> >> >>>>>> I would like to monitor my btrfs-filesystem for missing drives. >> >> >>>>> This is actually correct behavior, the filesystem reports that it >>>>> should >>>>> have 6 devices, which is how it knows a device is missing. >> >> >>>> Missing - means missing at the time of mount. So how are you planning >>>> to monitor a disk which is failed while in production ? >> >>> No, in `btrfs fi show` it means that it can't find the device. >> >> 'btrfs fi show' is miss-leading as compared to 'btrfs fi show -m' >> -m tells btrfs-kernel perspective of the devices, as of now >> there is no code in the kernel which changes the device status >> while its mounted (expect for readonly, which is irrelevant in >> raid1 with 1 disk failed). > Actually, that's exactly how I would expect each of them to behave. We > need some way to get both the state the kernel thinks the FS is in, and > the state it's actually in (according to the tools, not the kernel), and > '-m' reporting kernel state while no '-m' reports actual state is > exactly what I would expect in this case. > That leads also to another way I hadn't thought of to monitor a > filesystem. The output of 'fi show' with and without '-m' should match > if the filesystem was healthy when mounted and is still healthy, if they > don't, then something is wrong. >>> 1. Filesystem flags. These will change when the filesystem goes >>> degraded, >> >> Which flag is in question here. ? > I should clarify here, I mean the mount options, I'm just used to the > monit terminology (which was not well picked in this case). The big one > to watch is the read-only flag, as BTRFS will force a filesystem > read-only (which updates the mount options). Any change to the mount > options though without manual intervention is generally a sign that > _something_ is wrong. btrfs-progs shouldn't add its own intelligence in determining the device state, it should be a transparent tool to report status from the btrfs-kernel. So I opposed to the patches such as commit 206efb60cbe3049e0d44c6da3c1909aeee18f813 btrfs-progs: Add missing devices check for mounted btrfs. There are many ways a device can fail/recover in the SAN environment, these device state managing intelligence should be at one place and in the kernel. The volume manager part of the code in the kernel is incomplete. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Monitoring Btrfs 2016-10-19 13:06 ` Anand Jain @ 2016-10-19 13:33 ` Austin S. Hemmelgarn 2016-10-19 21:38 ` Anand Jain 0 siblings, 1 reply; 14+ messages in thread From: Austin S. Hemmelgarn @ 2016-10-19 13:33 UTC (permalink / raw) To: Anand Jain, Stefan Malte Schumacher, linux-btrfs, David Sterba On 2016-10-19 09:06, Anand Jain wrote: > > > On 10/19/16 19:15, Austin S. Hemmelgarn wrote: >> On 2016-10-18 17:36, Anand Jain wrote: >>> >>> >>>>>>> I would like to monitor my btrfs-filesystem for missing drives. >>> >>> >>>>>> This is actually correct behavior, the filesystem reports that it >>>>>> should >>>>>> have 6 devices, which is how it knows a device is missing. >>> >>> >>>>> Missing - means missing at the time of mount. So how are you planning >>>>> to monitor a disk which is failed while in production ? >>> >>>> No, in `btrfs fi show` it means that it can't find the device. >>> >>> 'btrfs fi show' is miss-leading as compared to 'btrfs fi show -m' >>> -m tells btrfs-kernel perspective of the devices, as of now >>> there is no code in the kernel which changes the device status >>> while its mounted (expect for readonly, which is irrelevant in >>> raid1 with 1 disk failed). > >> Actually, that's exactly how I would expect each of them to behave. We >> need some way to get both the state the kernel thinks the FS is in, and >> the state it's actually in (according to the tools, not the kernel), and >> '-m' reporting kernel state while no '-m' reports actual state is >> exactly what I would expect in this case. > > >> That leads also to another way I hadn't thought of to monitor a >> filesystem. The output of 'fi show' with and without '-m' should match >> if the filesystem was healthy when mounted and is still healthy, if they >> don't, then something is wrong. > > >>>> 1. Filesystem flags. These will change when the filesystem goes >>>> degraded, >>> >>> Which flag is in question here. ? >> I should clarify here, I mean the mount options, I'm just used to the >> monit terminology (which was not well picked in this case). The big one >> to watch is the read-only flag, as BTRFS will force a filesystem >> read-only (which updates the mount options). Any change to the mount >> options though without manual intervention is generally a sign that >> _something_ is wrong. > > > btrfs-progs shouldn't add its own intelligence in determining the > device state, it should be a transparent tool to report status from > the btrfs-kernel. So I opposed to the patches such as > > commit 206efb60cbe3049e0d44c6da3c1909aeee18f813 > btrfs-progs: Add missing devices check for mounted btrfs. > > There are many ways a device can fail/recover in the SAN environment, > these device state managing intelligence should be at one place and > in the kernel. The volume manager part of the code in the kernel > is incomplete. > I don't agree that the management should be completely unified or that the tools should just report kernel state. The tools have to have some way to check device state for unmounted filesystems because they have to operate on unmounted filesystems, and because until the kernel gets smart enough to actually handle device state properly, some method is needed to check the actual state of the devices. Even once the kernel is smart enough, it's still helpful to see without mounting a filesystem whether or not all the devices are there, and if we ever switch to a real mount helper (which I am in favor of for multiple reasons), we'll need device state checking in userspace for that too. Take a look for at LVM. The separation of responsibilities there is ideally what we should be looking at long term for BTRFS. The userspace components tell the kernel what to do, and list both kernel state _and_ physical state in a readable manner. The kernel tracks limited parts of the state (only for active LV's, so the equivalent of mounted filesystems, and even then only what it needs to track (Is this RAID volume in sync? Is that snapshot or thin storage pool getting close to full?)), and sends notifications to a userspace component which then acts on those conditions (possibly then telling the kernel what to do in response to them). On top of that, the userspace components don't require a kernel which supports them for any off-line operations, and the kernel works fine with older userspace. Both userspace and the kernel handle missing devices (userspace tools report them, the kernel refuses to activate LV's that require them). ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Monitoring Btrfs 2016-10-19 13:33 ` Austin S. Hemmelgarn @ 2016-10-19 21:38 ` Anand Jain 0 siblings, 0 replies; 14+ messages in thread From: Anand Jain @ 2016-10-19 21:38 UTC (permalink / raw) To: Austin S. Hemmelgarn, Stefan Malte Schumacher, linux-btrfs, David Sterba On 10/19/16 21:33, Austin S. Hemmelgarn wrote: > On 2016-10-19 09:06, Anand Jain wrote: >> >> >> On 10/19/16 19:15, Austin S. Hemmelgarn wrote: >>> On 2016-10-18 17:36, Anand Jain wrote: >>>> >>>> >>>>>>>> I would like to monitor my btrfs-filesystem for missing drives. >>>> >>>> >>>>>>> This is actually correct behavior, the filesystem reports that it >>>>>>> should >>>>>>> have 6 devices, which is how it knows a device is missing. >>>> >>>> >>>>>> Missing - means missing at the time of mount. So how are you >>>>>> planning >>>>>> to monitor a disk which is failed while in production ? >>>> >>>>> No, in `btrfs fi show` it means that it can't find the device. >>>> >>>> 'btrfs fi show' is miss-leading as compared to 'btrfs fi show -m' >>>> -m tells btrfs-kernel perspective of the devices, as of now >>>> there is no code in the kernel which changes the device status >>>> while its mounted (expect for readonly, which is irrelevant in >>>> raid1 with 1 disk failed). >> >>> Actually, that's exactly how I would expect each of them to behave. We >>> need some way to get both the state the kernel thinks the FS is in, and >>> the state it's actually in (according to the tools, not the kernel), and >>> '-m' reporting kernel state while no '-m' reports actual state is >>> exactly what I would expect in this case. >> >> >>> That leads also to another way I hadn't thought of to monitor a >>> filesystem. The output of 'fi show' with and without '-m' should match >>> if the filesystem was healthy when mounted and is still healthy, if they >>> don't, then something is wrong. >> >> >>>>> 1. Filesystem flags. These will change when the filesystem goes >>>>> degraded, >>>> >>>> Which flag is in question here. ? >>> I should clarify here, I mean the mount options, I'm just used to the >>> monit terminology (which was not well picked in this case). The big one >>> to watch is the read-only flag, as BTRFS will force a filesystem >>> read-only (which updates the mount options). Any change to the mount >>> options though without manual intervention is generally a sign that >>> _something_ is wrong. >> >> >> btrfs-progs shouldn't add its own intelligence in determining the >> device state, it should be a transparent tool to report status from >> the btrfs-kernel. So I opposed to the patches such as >> >> commit 206efb60cbe3049e0d44c6da3c1909aeee18f813 >> btrfs-progs: Add missing devices check for mounted btrfs. >> >> There are many ways a device can fail/recover in the SAN environment, >> these device state managing intelligence should be at one place and >> in the kernel. The volume manager part of the code in the kernel >> is incomplete. >> > I don't agree that the management should be completely unified or that > the tools should just report kernel state. The tools have to have some > way to check device state for unmounted filesystems because they have to > operate on unmounted filesystems, and because until the kernel gets > smart enough to actually handle device state properly, some method is > needed to check the actual state of the devices. Even once the kernel > is smart enough, it's still helpful to see without mounting a filesystem > whether or not all the devices are there, and if we ever switch to a > real mount helper (which I am in favor of for multiple reasons), we'll > need device state checking in userspace for that too. Bit out of context. here its about monitoring device when FS is mounted, in this context, if there is tool which would make its own intelligence without kernel, then that's wrong. > Take a look for at LVM. The separation of responsibilities there is > ideally what we should be looking at long term for BTRFS. The userspace > components tell the kernel what to do, and list both kernel state _and_ > physical state in a readable manner. The kernel tracks limited parts of > the state (only for active LV's, so the equivalent of mounted > filesystems, and even then only what it needs to track (Is this RAID > volume in sync? Is that snapshot or thin storage pool getting close to > full?)), and sends notifications to a userspace component which then > acts on those conditions (possibly then telling the kernel what to do in > response to them). On top of that, the userspace components don't > require a kernel which supports them for any off-line operations, and > the kernel works fine with older userspace. Both userspace and the > kernel handle missing devices (userspace tools report them, the kernel > refuses to activate LV's that require them). > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Monitoring Btrfs 2016-10-17 16:44 Monitoring Btrfs Stefan Malte Schumacher 2016-10-17 17:23 ` Austin S. Hemmelgarn @ 2016-10-17 17:41 ` Zygo Blaxell 2016-10-17 17:55 ` Kyle Manna ` (2 subsequent siblings) 4 siblings, 0 replies; 14+ messages in thread From: Zygo Blaxell @ 2016-10-17 17:41 UTC (permalink / raw) To: Stefan Malte Schumacher; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 2539 bytes --] On Mon, Oct 17, 2016 at 06:44:14PM +0200, Stefan Malte Schumacher wrote: > Hello > > I would like to monitor my btrfs-filesystem for missing drives. On > Debian mdadm uses a script in /etc/cron.daily, which calls mdadm and > sends an email if anything is wrong with the array. I would like to do > the same with btrfs. In my first attempt I grepped and cut the > information from "btrfs fi show" and let the script send an email if > the number of devices was not equal to the preselected number. > > Then I saw this: > > ubuntu@ubuntu:~$ sudo btrfs filesystem show > Label: none uuid: 67b4821f-16e0-436d-b521-e4ab2c7d3ab7 > Total devices 6 FS bytes used 5.47TiB > devid 1 size 1.81TiB used 1.71TiB path /dev/sda3 > devid 2 size 1.81TiB used 1.71TiB path /dev/sdb3 > devid 3 size 1.82TiB used 1.72TiB path /dev/sdc1 > devid 4 size 1.82TiB used 1.72TiB path /dev/sdd1 > devid 5 size 2.73TiB used 2.62TiB path /dev/sde1 > *** Some devices missing > > on this page: https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices > The number of devices is still at 6, despite the fact that one of the > drives is missing, which means that my first idea doesnt work. Using fi show for this isn't a good idea. By the time btrfs fi show tells you something is different from the norm, you've probably already crashed at least once and are now mounting with the 'degraded' option. > I have > two questions: > 1) Has anybody already written a script like this? After all, there is > no need to reinvent the wheel a second time. > 2) What should I best grep for? In this case I would just go for the > "missing". Does this cover all possible outputs of btrfs fi show in > case of a damaged array? What other outputs do I need to consider for > my script. I monitor the device error counters, i.e. the output of for fs in /fs1 /fs2 /fs3... ; do btrfs dev stat "$fs" | grep -v " 0$" done and send an email when it isn't empty. When there are errors I investigate in more detail (is it a failing disk? failed disk? bad cables? bad RAM? One-off UNC sector that can be ignored?), fix any problems (i.e. replace hardware, run scrub), and reset the counters to zero with 'btrfs dev stat -z'. > Yours sincerely > Stefan > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Monitoring Btrfs 2016-10-17 16:44 Monitoring Btrfs Stefan Malte Schumacher 2016-10-17 17:23 ` Austin S. Hemmelgarn 2016-10-17 17:41 ` Zygo Blaxell @ 2016-10-17 17:55 ` Kyle Manna 2016-10-17 20:40 ` Chris Murphy 2016-10-19 22:46 ` Stefan Malte Schumacher 4 siblings, 0 replies; 14+ messages in thread From: Kyle Manna @ 2016-10-17 17:55 UTC (permalink / raw) To: Stefan Malte Schumacher; +Cc: linux-btrfs On Mon, Oct 17, 2016 at 9:44 AM, Stefan Malte Schumacher <stefan.m.schumacher@gmail.com> wrote: > Hello > > I would like to monitor my btrfs-filesystem for missing drives. On > Debian mdadm uses a script in /etc/cron.daily, which calls mdadm and > sends an email if anything is wrong with the array. I would like to do > the same with btrfs. In my first attempt I grepped and cut the > information from "btrfs fi show" and let the script send an email if > the number of devices was not equal to the preselected number. > > ... > > 1) Has anybody already written a script like this? After all, there is > no need to reinvent the wheel a second time. Not that I have a solution to your primary question regarding message parsing, but do something different which may offer a different perspective on your monitoring and reporting. I employ systemd with timers to scrub my btrfs volumes[0][1] every week. I used to use either an OnFailure[2] trigger or my failure monitor log (aka systemd-journal) parser[3] to send me emails if the service failed to run. This is a more "modern" approach to cron.weekly + custom shell script for people that like systemd, love it or hate it. Recently I dropped the systemd journal parser for remote logging with rsyslog + Papertrail[4] with a few alerts for things like "systemd Failed to start" which indicates that the script returned a non-zero exit code. Papertrail then emails me when any of a handful of machines trip up. It's also worth noting logstash[5] (or similar) may be another way to parse log files. It could be a bloated overkill solution for something that a 10 line shell script could accomplish, depends on if you leverage it for things beyond basic log parsing. [0] https://github.com/kylemanna/systemd-utils/blob/master/units/btrfs-scrub.service [1] https://github.com/kylemanna/systemd-utils/blob/master/units/btrfs-scrub.timer [2] https://github.com/kylemanna/systemd-utils/tree/master/onfailure [3] https://github.com/kylemanna/systemd-utils/tree/master/failure-monitor [4] https://blog.kylemanna.com/linux/logging-all-the-things-with-rsyslog-and-papertrail/ [5] https://www.elastic.co/guide/en/logstash/current/introduction.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Monitoring Btrfs 2016-10-17 16:44 Monitoring Btrfs Stefan Malte Schumacher ` (2 preceding siblings ...) 2016-10-17 17:55 ` Kyle Manna @ 2016-10-17 20:40 ` Chris Murphy 2016-10-18 12:41 ` Austin S. Hemmelgarn 2016-10-19 22:46 ` Stefan Malte Schumacher 4 siblings, 1 reply; 14+ messages in thread From: Chris Murphy @ 2016-10-17 20:40 UTC (permalink / raw) To: Btrfs BTRFS May be better to use /sys/fs/btrfs/<uuid>/devices to find the device to monitor, and then monitor them with blktrace - maybe there's some courser granularity available there, I'm not sure. The thing is, as far as Btrfs alone is concerned, a drive can be "bad" and you're effectively degraded, while the drive is not missing. Unless it's physical removed or somehow dead, it'll still be seen but can produce all kinds of mayhem. Chris Murphy ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Monitoring Btrfs 2016-10-17 20:40 ` Chris Murphy @ 2016-10-18 12:41 ` Austin S. Hemmelgarn 0 siblings, 0 replies; 14+ messages in thread From: Austin S. Hemmelgarn @ 2016-10-18 12:41 UTC (permalink / raw) To: Btrfs BTRFS On 2016-10-17 16:40, Chris Murphy wrote: > May be better to use /sys/fs/btrfs/<uuid>/devices to find the device > to monitor, and then monitor them with blktrace - maybe there's some > courser granularity available there, I'm not sure. The thing is, as > far as Btrfs alone is concerned, a drive can be "bad" and you're > effectively degraded, while the drive is not missing. Unless it's > physical removed or somehow dead, it'll still be seen but can produce > all kinds of mayhem. This is exactly why you should be monitoring the disks themselves, not just BTRFS. I would not advise using blktrace for monitoring in production though, it technically risks an information leak, and it's not exactly low impact. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Monitoring Btrfs 2016-10-17 16:44 Monitoring Btrfs Stefan Malte Schumacher ` (3 preceding siblings ...) 2016-10-17 20:40 ` Chris Murphy @ 2016-10-19 22:46 ` Stefan Malte Schumacher 4 siblings, 0 replies; 14+ messages in thread From: Stefan Malte Schumacher @ 2016-10-19 22:46 UTC (permalink / raw) To: linux-btrfs Thanks to everybody for the help. I have been doing regular smart-tests for quite some time and I also let scrub run on a regular basis. I will use a simple shellscript witch inspects the output of "greps btrfs fi show" for lines containing "missing" and another which checks "btrfs dev stats" for None-Zero Entries. Thanks everybody for your advice. Stefan 2016-10-17 18:44 GMT+02:00 Stefan Malte Schumacher <stefan.m.schumacher@gmail.com>: > Hello > > I would like to monitor my btrfs-filesystem for missing drives. On > Debian mdadm uses a script in /etc/cron.daily, which calls mdadm and > sends an email if anything is wrong with the array. I would like to do > the same with btrfs. In my first attempt I grepped and cut the > information from "btrfs fi show" and let the script send an email if > the number of devices was not equal to the preselected number. > > Then I saw this: > > ubuntu@ubuntu:~$ sudo btrfs filesystem show > Label: none uuid: 67b4821f-16e0-436d-b521-e4ab2c7d3ab7 > Total devices 6 FS bytes used 5.47TiB > devid 1 size 1.81TiB used 1.71TiB path /dev/sda3 > devid 2 size 1.81TiB used 1.71TiB path /dev/sdb3 > devid 3 size 1.82TiB used 1.72TiB path /dev/sdc1 > devid 4 size 1.82TiB used 1.72TiB path /dev/sdd1 > devid 5 size 2.73TiB used 2.62TiB path /dev/sde1 > *** Some devices missing > > on this page: https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices > The number of devices is still at 6, despite the fact that one of the > drives is missing, which means that my first idea doesnt work. I have > two questions: > 1) Has anybody already written a script like this? After all, there is > no need to reinvent the wheel a second time. > 2) What should I best grep for? In this case I would just go for the > "missing". Does this cover all possible outputs of btrfs fi show in > case of a damaged array? What other outputs do I need to consider for > my script. > > Yours sincerely > Stefan ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2016-10-19 22:46 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-10-17 16:44 Monitoring Btrfs Stefan Malte Schumacher 2016-10-17 17:23 ` Austin S. Hemmelgarn 2016-10-18 3:23 ` Anand Jain 2016-10-18 12:39 ` Austin S. Hemmelgarn 2016-10-18 21:36 ` Anand Jain 2016-10-19 11:15 ` Austin S. Hemmelgarn 2016-10-19 13:06 ` Anand Jain 2016-10-19 13:33 ` Austin S. Hemmelgarn 2016-10-19 21:38 ` Anand Jain 2016-10-17 17:41 ` Zygo Blaxell 2016-10-17 17:55 ` Kyle Manna 2016-10-17 20:40 ` Chris Murphy 2016-10-18 12:41 ` Austin S. Hemmelgarn 2016-10-19 22:46 ` Stefan Malte Schumacher
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).