Monitoring Btrfs

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Monitoring Btrfs
@ 2016-10-17 16:44 Stefan Malte Schumacher
  2016-10-17 17:23 ` Austin S. Hemmelgarn
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Stefan Malte Schumacher @ 2016-10-17 16:44 UTC (permalink / raw)
  To: linux-btrfs

Hello

I would like to monitor my btrfs-filesystem for missing drives. On
Debian mdadm uses a script in /etc/cron.daily, which calls mdadm and
sends an email if anything is wrong with the array. I would like to do
the same with btrfs. In my first attempt I grepped and cut the
information from "btrfs fi show" and let the script send an email if
the number of devices was not equal to the preselected number.

Then I saw this:

ubuntu@ubuntu:~$ sudo btrfs filesystem show
Label: none  uuid: 67b4821f-16e0-436d-b521-e4ab2c7d3ab7
    Total devices 6 FS bytes used 5.47TiB
    devid    1 size 1.81TiB used 1.71TiB path /dev/sda3
    devid    2 size 1.81TiB used 1.71TiB path /dev/sdb3
    devid    3 size 1.82TiB used 1.72TiB path /dev/sdc1
    devid    4 size 1.82TiB used 1.72TiB path /dev/sdd1
    devid    5 size 2.73TiB used 2.62TiB path /dev/sde1
    *** Some devices missing

on this page: https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices
The number of devices is still at 6, despite the fact that one of the
drives is missing, which means that my first idea doesnt work. I have
two questions:
1) Has anybody already written a script like this? After all, there is
no need to reinvent the wheel a second time.
2) What should I best grep for? In this case I would just go for the
"missing". Does this cover all possible outputs of btrfs fi show in
case of a damaged array? What other outputs do I need to consider for
my script.

Yours sincerely
Stefan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Monitoring Btrfs
  2016-10-17 16:44 Monitoring Btrfs Stefan Malte Schumacher
@ 2016-10-17 17:23 ` Austin S. Hemmelgarn
  2016-10-18  3:23   ` Anand Jain
  2016-10-17 17:41 ` Zygo Blaxell
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Austin S. Hemmelgarn @ 2016-10-17 17:23 UTC (permalink / raw)
  To: Stefan Malte Schumacher, linux-btrfs

On 2016-10-17 12:44, Stefan Malte Schumacher wrote:
> Hello
>
> I would like to monitor my btrfs-filesystem for missing drives. On
> Debian mdadm uses a script in /etc/cron.daily, which calls mdadm and
> sends an email if anything is wrong with the array. I would like to do
> the same with btrfs. In my first attempt I grepped and cut the
> information from "btrfs fi show" and let the script send an email if
> the number of devices was not equal to the preselected number.
>
> Then I saw this:
>
> ubuntu@ubuntu:~$ sudo btrfs filesystem show
> Label: none  uuid: 67b4821f-16e0-436d-b521-e4ab2c7d3ab7
>     Total devices 6 FS bytes used 5.47TiB
>     devid    1 size 1.81TiB used 1.71TiB path /dev/sda3
>     devid    2 size 1.81TiB used 1.71TiB path /dev/sdb3
>     devid    3 size 1.82TiB used 1.72TiB path /dev/sdc1
>     devid    4 size 1.82TiB used 1.72TiB path /dev/sdd1
>     devid    5 size 2.73TiB used 2.62TiB path /dev/sde1
>     *** Some devices missing
>
> on this page: https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices
> The number of devices is still at 6, despite the fact that one of the
> drives is missing, which means that my first idea doesnt work.
This is actually correct behavior, the filesystem reports that it should 
have 6 devices, which is how it knows a device is missing.
> I have
> two questions:
> 1) Has anybody already written a script like this? After all, there is
> no need to reinvent the wheel a second time.
Not that I know of, but I may be wrong.
> 2) What should I best grep for? In this case I would just go for the
> "missing". Does this cover all possible outputs of btrfs fi show in
> case of a damaged array? What other outputs do I need to consider for
> my script.
That should catch any case of a failed device.  It will not catch things 
like devices being out of sync or at-rest data corruption.  In general, 
you should be running scrub regularly to check for those conditions (and 
fix them if they have happened).

FWIW, what I watch is the filesystem flags (it will go read-only if it 
becomes degraded), and the filesystem size (which will change in most 
cases as disks are added or removed).  I also have regular SMART status 
checks on the disks themselves too though, so even aside from BTRFS, 
I'll know if a disk has failed (or thinks it's failed) pretty quickly.

Also, you may want to look into something like Monit 
(https://mmonit.com/monit/) to handle the monitoring.  It lets you 
define all your monitoring requirements in a single file (or multiple if 
you prefer), and provides the infrastructure to handle e-mail 
notifications (including the ability to cache messages when the upstream 
mail server is down).

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Monitoring Btrfs
  2016-10-17 17:23 ` Austin S. Hemmelgarn
@ 2016-10-18  3:23   ` Anand Jain
  2016-10-18 12:39     ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 14+ messages in thread
From: Anand Jain @ 2016-10-18  3:23 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Stefan Malte Schumacher, linux-btrfs,
	David Sterba



>> I would like to monitor my btrfs-filesystem for missing drives.

> This is actually correct behavior, the filesystem reports that it should
> have 6 devices, which is how it knows a device is missing.

  Missing - means missing at the time of mount. So how are you planning 
to monitor a disk which is failed while in production ?


Good luck

Anand

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Monitoring Btrfs
  2016-10-18  3:23   ` Anand Jain
@ 2016-10-18 12:39     ` Austin S. Hemmelgarn
  2016-10-18 21:36       ` Anand Jain
  0 siblings, 1 reply; 14+ messages in thread
From: Austin S. Hemmelgarn @ 2016-10-18 12:39 UTC (permalink / raw)
  To: Anand Jain, Stefan Malte Schumacher, linux-btrfs, David Sterba

On 2016-10-17 23:23, Anand Jain wrote:
>
>
>>> I would like to monitor my btrfs-filesystem for missing drives.
>
>> This is actually correct behavior, the filesystem reports that it should
>> have 6 devices, which is how it knows a device is missing.
>
>  Missing - means missing at the time of mount. So how are you planning
> to monitor a disk which is failed while in production ?
No, in `btrfs fi show` it means that it can't find the device.  All that 
`fi show` does is print out what info it can find about the filesystem, 
nothing more, nothing less.  It's trivial to see from the output (two 
different ways I might add) that you're missing devices and how many you 
still have.  The only way without poking at the FS directly to figure 
out how many devices the FS is supposed to have (or at least, how many 
it thinks it should have) is the device count output by `btrfs fi show`.

Now, for production usage, you have three things you should be monitoring:
1. Output from `btrfs dev stats`.  This reports per-device error 
counters, and is one of the best ways to see if something is wrong, and 
also gives you a decent indicator of exactly what is wrong.
2. Status from regular scrub operations.  Pretty self explanatory.
3. SMART status of the underlying devices themselves.  This will catch 
pre-failure conditions, and the direct access from smartctl will error 
out when the drive has failed to the point of not being present.

You can additionally monitor:
1. Filesystem flags.  These will change when the filesystem goes 
degraded, and it's actually good practice for any filesystem, not just 
BTRFS.
2. Total filesystem size.  If this changes without manual intervention, 
something is seriously wrong.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Monitoring Btrfs
  2016-10-18 12:39     ` Austin S. Hemmelgarn
@ 2016-10-18 21:36       ` Anand Jain
  2016-10-19 11:15         ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 14+ messages in thread
From: Anand Jain @ 2016-10-18 21:36 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Stefan Malte Schumacher, linux-btrfs,
	David Sterba



>>>> I would like to monitor my btrfs-filesystem for missing drives.


>>> This is actually correct behavior, the filesystem reports that it should
>>> have 6 devices, which is how it knows a device is missing.


>>  Missing - means missing at the time of mount. So how are you planning
>> to monitor a disk which is failed while in production ?

> No, in `btrfs fi show` it means that it can't find the device.

  'btrfs fi show' is miss-leading as compared to 'btrfs fi show -m'
  -m tells btrfs-kernel perspective of the devices, as of now
  there is no code in the kernel which changes the device status
  while its mounted (expect for readonly, which is irrelevant in
  raid1 with 1 disk failed).

> 1. Filesystem flags.  These will change when the filesystem goes
> degraded,

   Which flag is in question here. ?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Monitoring Btrfs
  2016-10-18 21:36       ` Anand Jain
@ 2016-10-19 11:15         ` Austin S. Hemmelgarn
  2016-10-19 13:06           ` Anand Jain
  0 siblings, 1 reply; 14+ messages in thread
From: Austin S. Hemmelgarn @ 2016-10-19 11:15 UTC (permalink / raw)
  To: Anand Jain, Stefan Malte Schumacher, linux-btrfs, David Sterba

On 2016-10-18 17:36, Anand Jain wrote:
>
>
>>>>> I would like to monitor my btrfs-filesystem for missing drives.
>
>
>>>> This is actually correct behavior, the filesystem reports that it
>>>> should
>>>> have 6 devices, which is how it knows a device is missing.
>
>
>>>  Missing - means missing at the time of mount. So how are you planning
>>> to monitor a disk which is failed while in production ?
>
>> No, in `btrfs fi show` it means that it can't find the device.
>
>  'btrfs fi show' is miss-leading as compared to 'btrfs fi show -m'
>  -m tells btrfs-kernel perspective of the devices, as of now
>  there is no code in the kernel which changes the device status
>  while its mounted (expect for readonly, which is irrelevant in
>  raid1 with 1 disk failed).
Actually, that's exactly how I would expect each of them to behave.  We 
need some way to get both the state the kernel thinks the FS is in, and 
the state it's actually in (according to the tools, not the kernel), and 
'-m' reporting kernel state while no '-m' reports actual state is 
exactly what I would expect in this case.

That leads also to another way I hadn't thought of to monitor a 
filesystem.  The output of 'fi show' with and without '-m' should match 
if the filesystem was healthy when mounted and is still healthy, if they 
don't, then something is wrong.
>
>> 1. Filesystem flags.  These will change when the filesystem goes
>> degraded,
>
>   Which flag is in question here. ?
I should clarify here, I mean the mount options, I'm just used to the 
monit terminology (which was not well picked in this case).  The big one 
to watch is the read-only flag, as BTRFS will force a filesystem 
read-only (which updates the mount options).  Any change to the mount 
options though without manual intervention is generally a sign that 
_something_ is wrong.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Monitoring Btrfs
  2016-10-19 11:15         ` Austin S. Hemmelgarn
@ 2016-10-19 13:06           ` Anand Jain
  2016-10-19 13:33             ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 14+ messages in thread
From: Anand Jain @ 2016-10-19 13:06 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Stefan Malte Schumacher, linux-btrfs,
	David Sterba



On 10/19/16 19:15, Austin S. Hemmelgarn wrote:
> On 2016-10-18 17:36, Anand Jain wrote:
>>
>>
>>>>>> I would like to monitor my btrfs-filesystem for missing drives.
>>
>>
>>>>> This is actually correct behavior, the filesystem reports that it
>>>>> should
>>>>> have 6 devices, which is how it knows a device is missing.
>>
>>
>>>>  Missing - means missing at the time of mount. So how are you planning
>>>> to monitor a disk which is failed while in production ?
>>
>>> No, in `btrfs fi show` it means that it can't find the device.
>>
>>  'btrfs fi show' is miss-leading as compared to 'btrfs fi show -m'
>>  -m tells btrfs-kernel perspective of the devices, as of now
>>  there is no code in the kernel which changes the device status
>>  while its mounted (expect for readonly, which is irrelevant in
>>  raid1 with 1 disk failed).

> Actually, that's exactly how I would expect each of them to behave.  We
> need some way to get both the state the kernel thinks the FS is in, and
> the state it's actually in (according to the tools, not the kernel), and
> '-m' reporting kernel state while no '-m' reports actual state is
> exactly what I would expect in this case.


> That leads also to another way I hadn't thought of to monitor a
> filesystem.  The output of 'fi show' with and without '-m' should match
> if the filesystem was healthy when mounted and is still healthy, if they
> don't, then something is wrong.


>>> 1. Filesystem flags.  These will change when the filesystem goes
>>> degraded,
>>
>>   Which flag is in question here. ?
> I should clarify here, I mean the mount options, I'm just used to the
> monit terminology (which was not well picked in this case).  The big one
> to watch is the read-only flag, as BTRFS will force a filesystem
> read-only (which updates the mount options).  Any change to the mount
> options though without manual intervention is generally a sign that
> _something_ is wrong.


  btrfs-progs shouldn't add its own intelligence in determining the
  device state, it should be a transparent tool to report status from
  the btrfs-kernel. So I opposed to the patches such as

     commit 206efb60cbe3049e0d44c6da3c1909aeee18f813
     btrfs-progs: Add missing devices check for mounted btrfs.

  There are many ways a device can fail/recover in the SAN environment,
  these device state managing intelligence should be at one place and
  in the kernel. The volume manager part of the code in the kernel
  is incomplete.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Monitoring Btrfs
  2016-10-19 13:06           ` Anand Jain
@ 2016-10-19 13:33             ` Austin S. Hemmelgarn
  2016-10-19 21:38               ` Anand Jain
  0 siblings, 1 reply; 14+ messages in thread
From: Austin S. Hemmelgarn @ 2016-10-19 13:33 UTC (permalink / raw)
  To: Anand Jain, Stefan Malte Schumacher, linux-btrfs, David Sterba

On 2016-10-19 09:06, Anand Jain wrote:
>
>
> On 10/19/16 19:15, Austin S. Hemmelgarn wrote:
>> On 2016-10-18 17:36, Anand Jain wrote:
>>>
>>>
>>>>>>> I would like to monitor my btrfs-filesystem for missing drives.
>>>
>>>
>>>>>> This is actually correct behavior, the filesystem reports that it
>>>>>> should
>>>>>> have 6 devices, which is how it knows a device is missing.
>>>
>>>
>>>>>  Missing - means missing at the time of mount. So how are you planning
>>>>> to monitor a disk which is failed while in production ?
>>>
>>>> No, in `btrfs fi show` it means that it can't find the device.
>>>
>>>  'btrfs fi show' is miss-leading as compared to 'btrfs fi show -m'
>>>  -m tells btrfs-kernel perspective of the devices, as of now
>>>  there is no code in the kernel which changes the device status
>>>  while its mounted (expect for readonly, which is irrelevant in
>>>  raid1 with 1 disk failed).
>
>> Actually, that's exactly how I would expect each of them to behave.  We
>> need some way to get both the state the kernel thinks the FS is in, and
>> the state it's actually in (according to the tools, not the kernel), and
>> '-m' reporting kernel state while no '-m' reports actual state is
>> exactly what I would expect in this case.
>
>
>> That leads also to another way I hadn't thought of to monitor a
>> filesystem.  The output of 'fi show' with and without '-m' should match
>> if the filesystem was healthy when mounted and is still healthy, if they
>> don't, then something is wrong.
>
>
>>>> 1. Filesystem flags.  These will change when the filesystem goes
>>>> degraded,
>>>
>>>   Which flag is in question here. ?
>> I should clarify here, I mean the mount options, I'm just used to the
>> monit terminology (which was not well picked in this case).  The big one
>> to watch is the read-only flag, as BTRFS will force a filesystem
>> read-only (which updates the mount options).  Any change to the mount
>> options though without manual intervention is generally a sign that
>> _something_ is wrong.
>
>
>  btrfs-progs shouldn't add its own intelligence in determining the
>  device state, it should be a transparent tool to report status from
>  the btrfs-kernel. So I opposed to the patches such as
>
>     commit 206efb60cbe3049e0d44c6da3c1909aeee18f813
>     btrfs-progs: Add missing devices check for mounted btrfs.
>
>  There are many ways a device can fail/recover in the SAN environment,
>  these device state managing intelligence should be at one place and
>  in the kernel. The volume manager part of the code in the kernel
>  is incomplete.
>
I don't agree that the management should be completely unified or that 
the tools should just report kernel state.  The tools have to have some 
way to check device state for unmounted filesystems because they have to 
operate on unmounted filesystems, and because until the kernel gets 
smart enough to actually handle device state properly, some method is 
needed to check the actual state of the devices.  Even once the kernel 
is smart enough, it's still helpful to see without mounting a filesystem 
whether or not all the devices are there, and if we ever switch to a 
real mount helper (which I am in favor of for multiple reasons), we'll 
need device state checking in userspace for that too.

Take a look for at LVM.  The separation of responsibilities there is 
ideally what we should be looking at long term for BTRFS.  The userspace 
components tell the kernel what to do, and list both kernel state _and_ 
physical state in a readable manner.  The kernel tracks limited parts of 
the state (only for active LV's, so the equivalent of mounted 
filesystems, and even then only what it needs to track (Is this RAID 
volume in sync?  Is that snapshot or thin storage pool getting close to 
full?)), and sends notifications to a userspace component which then 
acts on those conditions (possibly then telling the kernel what to do in 
response to them).  On top of that, the userspace components don't 
require a kernel which supports them for any off-line operations, and 
the kernel works fine with older userspace.  Both userspace and the 
kernel handle missing devices (userspace tools report them, the kernel 
refuses to activate LV's that require them).

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Monitoring Btrfs
  2016-10-19 13:33             ` Austin S. Hemmelgarn
@ 2016-10-19 21:38               ` Anand Jain
  0 siblings, 0 replies; 14+ messages in thread
From: Anand Jain @ 2016-10-19 21:38 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Stefan Malte Schumacher, linux-btrfs,
	David Sterba



On 10/19/16 21:33, Austin S. Hemmelgarn wrote:
> On 2016-10-19 09:06, Anand Jain wrote:
>>
>>
>> On 10/19/16 19:15, Austin S. Hemmelgarn wrote:
>>> On 2016-10-18 17:36, Anand Jain wrote:
>>>>
>>>>
>>>>>>>> I would like to monitor my btrfs-filesystem for missing drives.
>>>>
>>>>
>>>>>>> This is actually correct behavior, the filesystem reports that it
>>>>>>> should
>>>>>>> have 6 devices, which is how it knows a device is missing.
>>>>
>>>>
>>>>>>  Missing - means missing at the time of mount. So how are you
>>>>>> planning
>>>>>> to monitor a disk which is failed while in production ?
>>>>
>>>>> No, in `btrfs fi show` it means that it can't find the device.
>>>>
>>>>  'btrfs fi show' is miss-leading as compared to 'btrfs fi show -m'
>>>>  -m tells btrfs-kernel perspective of the devices, as of now
>>>>  there is no code in the kernel which changes the device status
>>>>  while its mounted (expect for readonly, which is irrelevant in
>>>>  raid1 with 1 disk failed).
>>
>>> Actually, that's exactly how I would expect each of them to behave.  We
>>> need some way to get both the state the kernel thinks the FS is in, and
>>> the state it's actually in (according to the tools, not the kernel), and
>>> '-m' reporting kernel state while no '-m' reports actual state is
>>> exactly what I would expect in this case.
>>
>>
>>> That leads also to another way I hadn't thought of to monitor a
>>> filesystem.  The output of 'fi show' with and without '-m' should match
>>> if the filesystem was healthy when mounted and is still healthy, if they
>>> don't, then something is wrong.
>>
>>
>>>>> 1. Filesystem flags.  These will change when the filesystem goes
>>>>> degraded,
>>>>
>>>>   Which flag is in question here. ?
>>> I should clarify here, I mean the mount options, I'm just used to the
>>> monit terminology (which was not well picked in this case).  The big one
>>> to watch is the read-only flag, as BTRFS will force a filesystem
>>> read-only (which updates the mount options).  Any change to the mount
>>> options though without manual intervention is generally a sign that
>>> _something_ is wrong.
>>
>>
>>  btrfs-progs shouldn't add its own intelligence in determining the
>>  device state, it should be a transparent tool to report status from
>>  the btrfs-kernel. So I opposed to the patches such as
>>
>>     commit 206efb60cbe3049e0d44c6da3c1909aeee18f813
>>     btrfs-progs: Add missing devices check for mounted btrfs.
>>
>>  There are many ways a device can fail/recover in the SAN environment,
>>  these device state managing intelligence should be at one place and
>>  in the kernel. The volume manager part of the code in the kernel
>>  is incomplete.
>>
> I don't agree that the management should be completely unified or that
> the tools should just report kernel state.  The tools have to have some
> way to check device state for unmounted filesystems because they have to
> operate on unmounted filesystems, and because until the kernel gets
> smart enough to actually handle device state properly, some method is
> needed to check the actual state of the devices.  Even once the kernel
> is smart enough, it's still helpful to see without mounting a filesystem
> whether or not all the devices are there, and if we ever switch to a
> real mount helper (which I am in favor of for multiple reasons), we'll
> need device state checking in userspace for that too.


  Bit out of context. here its about monitoring device when FS
  is mounted, in this context, if there is tool which would make
  its own intelligence without kernel, then that's wrong.




> Take a look for at LVM.  The separation of responsibilities there is
> ideally what we should be looking at long term for BTRFS.  The userspace
> components tell the kernel what to do, and list both kernel state _and_
> physical state in a readable manner.  The kernel tracks limited parts of
> the state (only for active LV's, so the equivalent of mounted
> filesystems, and even then only what it needs to track (Is this RAID
> volume in sync?  Is that snapshot or thin storage pool getting close to
> full?)), and sends notifications to a userspace component which then
> acts on those conditions (possibly then telling the kernel what to do in
> response to them).  On top of that, the userspace components don't
> require a kernel which supports them for any off-line operations, and
> the kernel works fine with older userspace.  Both userspace and the
> kernel handle missing devices (userspace tools report them, the kernel
> refuses to activate LV's that require them).
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Monitoring Btrfs
  2016-10-17 16:44 Monitoring Btrfs Stefan Malte Schumacher
  2016-10-17 17:23 ` Austin S. Hemmelgarn
@ 2016-10-17 17:41 ` Zygo Blaxell
  2016-10-17 17:55 ` Kyle Manna
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Zygo Blaxell @ 2016-10-17 17:41 UTC (permalink / raw)
  To: Stefan Malte Schumacher; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2539 bytes --]

On Mon, Oct 17, 2016 at 06:44:14PM +0200, Stefan Malte Schumacher wrote:
> Hello
> 
> I would like to monitor my btrfs-filesystem for missing drives. On
> Debian mdadm uses a script in /etc/cron.daily, which calls mdadm and
> sends an email if anything is wrong with the array. I would like to do
> the same with btrfs. In my first attempt I grepped and cut the
> information from "btrfs fi show" and let the script send an email if
> the number of devices was not equal to the preselected number.
> 
> Then I saw this:
> 
> ubuntu@ubuntu:~$ sudo btrfs filesystem show
> Label: none  uuid: 67b4821f-16e0-436d-b521-e4ab2c7d3ab7
>     Total devices 6 FS bytes used 5.47TiB
>     devid    1 size 1.81TiB used 1.71TiB path /dev/sda3
>     devid    2 size 1.81TiB used 1.71TiB path /dev/sdb3
>     devid    3 size 1.82TiB used 1.72TiB path /dev/sdc1
>     devid    4 size 1.82TiB used 1.72TiB path /dev/sdd1
>     devid    5 size 2.73TiB used 2.62TiB path /dev/sde1
>     *** Some devices missing
> 
> on this page: https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices
> The number of devices is still at 6, despite the fact that one of the
> drives is missing, which means that my first idea doesnt work. 

Using fi show for this isn't a good idea.  By the time btrfs fi show
tells you something is different from the norm, you've probably already
crashed at least once and are now mounting with the 'degraded' option.

> I have
> two questions:
> 1) Has anybody already written a script like this? After all, there is
> no need to reinvent the wheel a second time.
> 2) What should I best grep for? In this case I would just go for the
> "missing". Does this cover all possible outputs of btrfs fi show in
> case of a damaged array? What other outputs do I need to consider for
> my script.

I monitor the device error counters, i.e. the output of

	for fs in /fs1 /fs2 /fs3... ; do
		btrfs dev stat "$fs" | grep -v " 0$"
	done

and send an email when it isn't empty.

When there are errors I investigate in more detail (is it a failing disk?
failed disk?  bad cables?  bad RAM?  One-off UNC sector that can be
ignored?), fix any problems (i.e. replace hardware, run scrub), and
reset the counters to zero with 'btrfs dev stat -z'.

> Yours sincerely
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Monitoring Btrfs
  2016-10-17 16:44 Monitoring Btrfs Stefan Malte Schumacher
  2016-10-17 17:23 ` Austin S. Hemmelgarn
  2016-10-17 17:41 ` Zygo Blaxell
@ 2016-10-17 17:55 ` Kyle Manna
  2016-10-17 20:40 ` Chris Murphy
  2016-10-19 22:46 ` Stefan Malte Schumacher
  4 siblings, 0 replies; 14+ messages in thread
From: Kyle Manna @ 2016-10-17 17:55 UTC (permalink / raw)
  To: Stefan Malte Schumacher; +Cc: linux-btrfs

On Mon, Oct 17, 2016 at 9:44 AM, Stefan Malte Schumacher
<stefan.m.schumacher@gmail.com> wrote:
> Hello
>
> I would like to monitor my btrfs-filesystem for missing drives. On
> Debian mdadm uses a script in /etc/cron.daily, which calls mdadm and
> sends an email if anything is wrong with the array. I would like to do
> the same with btrfs. In my first attempt I grepped and cut the
> information from "btrfs fi show" and let the script send an email if
> the number of devices was not equal to the preselected number.
>
> ...
>
> 1) Has anybody already written a script like this? After all, there is
> no need to reinvent the wheel a second time.

Not that I have a solution to your primary question regarding message
parsing, but do something different which may offer a different
perspective on your monitoring and reporting.

I employ systemd with timers to scrub my btrfs volumes[0][1] every
week.  I used to use either an OnFailure[2] trigger or my failure
monitor log (aka systemd-journal) parser[3] to send me emails if the
service failed to run.  This is a more "modern" approach to
cron.weekly + custom shell script for people that like systemd, love
it or hate it.

Recently I dropped the systemd journal parser for remote logging with
rsyslog + Papertrail[4] with a few alerts for things like "systemd
Failed to start" which indicates that the script returned a non-zero
exit code.  Papertrail then emails me when any of a handful of
machines trip up.

It's also worth noting logstash[5] (or similar) may be another way to
parse log files.  It could be a bloated overkill solution for
something that a 10 line shell script could accomplish, depends on if
you leverage it for things beyond basic log parsing.

[0] https://github.com/kylemanna/systemd-utils/blob/master/units/btrfs-scrub.service
[1] https://github.com/kylemanna/systemd-utils/blob/master/units/btrfs-scrub.timer
[2] https://github.com/kylemanna/systemd-utils/tree/master/onfailure
[3] https://github.com/kylemanna/systemd-utils/tree/master/failure-monitor
[4] https://blog.kylemanna.com/linux/logging-all-the-things-with-rsyslog-and-papertrail/
[5] https://www.elastic.co/guide/en/logstash/current/introduction.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Monitoring Btrfs
  2016-10-17 16:44 Monitoring Btrfs Stefan Malte Schumacher
                   ` (2 preceding siblings ...)
  2016-10-17 17:55 ` Kyle Manna
@ 2016-10-17 20:40 ` Chris Murphy
  2016-10-18 12:41   ` Austin S. Hemmelgarn
  2016-10-19 22:46 ` Stefan Malte Schumacher
  4 siblings, 1 reply; 14+ messages in thread
From: Chris Murphy @ 2016-10-17 20:40 UTC (permalink / raw)
  To: Btrfs BTRFS

May be better to use /sys/fs/btrfs/<uuid>/devices to find the device
to monitor, and then monitor them with blktrace - maybe there's some
courser granularity available there, I'm not sure. The thing is, as
far as Btrfs alone is concerned, a drive can be "bad" and you're
effectively degraded, while the drive is not missing. Unless it's
physical removed or somehow dead, it'll still be seen but can produce
all kinds of mayhem.

Chris Murphy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Monitoring Btrfs
  2016-10-17 20:40 ` Chris Murphy
@ 2016-10-18 12:41   ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 14+ messages in thread
From: Austin S. Hemmelgarn @ 2016-10-18 12:41 UTC (permalink / raw)
  To: Btrfs BTRFS

On 2016-10-17 16:40, Chris Murphy wrote:
> May be better to use /sys/fs/btrfs/<uuid>/devices to find the device
> to monitor, and then monitor them with blktrace - maybe there's some
> courser granularity available there, I'm not sure. The thing is, as
> far as Btrfs alone is concerned, a drive can be "bad" and you're
> effectively degraded, while the drive is not missing. Unless it's
> physical removed or somehow dead, it'll still be seen but can produce
> all kinds of mayhem.
This is exactly why you should be monitoring the disks themselves, not 
just BTRFS.  I would not advise using blktrace for monitoring in 
production though, it technically risks an information leak, and it's 
not exactly low impact.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Monitoring Btrfs
  2016-10-17 16:44 Monitoring Btrfs Stefan Malte Schumacher
                   ` (3 preceding siblings ...)
  2016-10-17 20:40 ` Chris Murphy
@ 2016-10-19 22:46 ` Stefan Malte Schumacher
  4 siblings, 0 replies; 14+ messages in thread
From: Stefan Malte Schumacher @ 2016-10-19 22:46 UTC (permalink / raw)
  To: linux-btrfs

Thanks to everybody for the help. I have been doing regular
smart-tests for quite some time and I also let scrub run on a regular
basis.
I will use a simple shellscript witch inspects the output of "greps
btrfs fi show" for lines containing "missing" and another which checks
"btrfs dev stats" for None-Zero Entries.

Thanks everybody for your advice.
Stefan

2016-10-17 18:44 GMT+02:00 Stefan Malte Schumacher
<stefan.m.schumacher@gmail.com>:
> Hello
>
> I would like to monitor my btrfs-filesystem for missing drives. On
> Debian mdadm uses a script in /etc/cron.daily, which calls mdadm and
> sends an email if anything is wrong with the array. I would like to do
> the same with btrfs. In my first attempt I grepped and cut the
> information from "btrfs fi show" and let the script send an email if
> the number of devices was not equal to the preselected number.
>
> Then I saw this:
>
> ubuntu@ubuntu:~$ sudo btrfs filesystem show
> Label: none  uuid: 67b4821f-16e0-436d-b521-e4ab2c7d3ab7
>     Total devices 6 FS bytes used 5.47TiB
>     devid    1 size 1.81TiB used 1.71TiB path /dev/sda3
>     devid    2 size 1.81TiB used 1.71TiB path /dev/sdb3
>     devid    3 size 1.82TiB used 1.72TiB path /dev/sdc1
>     devid    4 size 1.82TiB used 1.72TiB path /dev/sdd1
>     devid    5 size 2.73TiB used 2.62TiB path /dev/sde1
>     *** Some devices missing
>
> on this page: https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices
> The number of devices is still at 6, despite the fact that one of the
> drives is missing, which means that my first idea doesnt work. I have
> two questions:
> 1) Has anybody already written a script like this? After all, there is
> no need to reinvent the wheel a second time.
> 2) What should I best grep for? In this case I would just go for the
> "missing". Does this cover all possible outputs of btrfs fi show in
> case of a damaged array? What other outputs do I need to consider for
> my script.
>
> Yours sincerely
> Stefan

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-10-19 22:46 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-10-17 16:44 Monitoring Btrfs Stefan Malte Schumacher
2016-10-17 17:23 ` Austin S. Hemmelgarn
2016-10-18  3:23   ` Anand Jain
2016-10-18 12:39     ` Austin S. Hemmelgarn
2016-10-18 21:36       ` Anand Jain
2016-10-19 11:15         ` Austin S. Hemmelgarn
2016-10-19 13:06           ` Anand Jain
2016-10-19 13:33             ` Austin S. Hemmelgarn
2016-10-19 21:38               ` Anand Jain
2016-10-17 17:41 ` Zygo Blaxell
2016-10-17 17:55 ` Kyle Manna
2016-10-17 20:40 ` Chris Murphy
2016-10-18 12:41   ` Austin S. Hemmelgarn
2016-10-19 22:46 ` Stefan Malte Schumacher

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).